Hierarchical device independent -> device specific architecture (#13108)

Summary: This PR principally redesigns the fuser's logical flow to be hierarchical, with device-independent logic directing (relatively little) device-specific logic. This design is based on reviews of XLA, TVM, internal design review at NVIDIA and discussions with fuser owners at Facebook. To further vet the design I have begun developing the next significant PR (extended fusion logic) on top of this architecture and it has made the work significantly easier. This PR also improves fuser modularity, which should make it easier for others to contribute to. Unfortunately, this PR is large and its nature has made breaking it into smaller pieces challenging. Future PRs should be smaller. The fusion flow is now: - Fusions are "registered" and "upfront compilation" occurs. The fusion specifications, which includes the graph, go into a thread-safe device-independent cache. Upfront compilation generates some information used later during shape inference. - Fusions are run, which passes them to an executor that performs shape inference, requests an instantiated fusion from the specification's thread-safe store, and launches them. Launch logic eventually defers to device-specific logic. - Fusions not previously instantiated are compiled. Compilation is device-specific and arg-specific. Compilation logic eventually defers to device-specific logic. - If the fusion could not be run because fusion on the requested device is disabled or shape inference fails a fallback is invoked. This flow can be thought of as PyTorch IR -> Device-Independent Fusion Logic -> Device-Specific Fusion Logic. The current upstream logic is, by contrast, PyTorch IR -> Device-Specific Logic -> Device-Independent Logic, which results in needless code duplication and lack of conceptual clarity. That was my mistake when splitting the fuser off from the rest of the jit and our reviews since then have been incredibly helpful in understanding why the approach in this PR is better. This PR does not only move code around. It also fixes few couple bugs and makes some logical/code changes. Bug fixes: - thread-safety is improved with caches preventing concurrent access - the nvrtc version is now reviewed to determine the appropriate compute architecture to compile for, fixing a bug that would cause runtime errors if a user's nvrtc didn't support the compute architecture their gpu reported - an issue with DeviceGuard not setting the device properly and failing silently is worked-around (ezyang mentioned he was reviewing the dynamic registration DeviceGuard uses, which may resolve the issue) Code/Logical changes: - "const" now appears many more places (note: I cast const away in operator.h because of some obscure build issues -- I think we should be able to fix this and will take a look while this goes through testing) - The new flow allowed some redundant code to be removed (AnnotatedGraph is gone, for example, and the more straightforward flow eliminated duplication of effort elsewhere) - Fallback logic is now also invoked if a fusion is requested on a device that cannot handle fusions - Use of macros to determine which files are compiled is reduced (though they may come back if the Windows build is unhappy) - There is no more "common" code or folder, the device-independent logic being at the forefront of the fuser replaces and improves upon the goal of sharing code apaszke who I promised naming rights to zdevito who correctly pointed out that the device-independent logic should be the bulk of what the fuser is doing ngimel who contributed to the design of this architecture Pull Request resolved: https://github.com/pytorch/pytorch/pull/13108 Reviewed By: gchanan, fmassa Differential Revision: D12850608 Pulled By: soumith fbshipit-source-id: 24e2df6dfa97591ee36aeca8944519678c301fa3
author: mruberry <mruberry@nvidia.com> 2018-10-31 18:10:40 -0700
committer: Facebook Github Bot <facebook-github-bot@users.noreply.github.com> 2018-10-31 18:13:00 -0700
commit: 6fe089c6eab1a14e5bc381f1f5b3d541531c2cb4 (patch)
tree: f4eda03d29f2a44d9fb680a303da8088c17a3c65 /torch/CMakeLists.txt
parent: 2df6d3e3c745673c715619967b4260bd7de59c3d (diff)
download: pytorch-6fe089c6eab1a14e5bc381f1f5b3d541531c2cb4.tar.gz
pytorch-6fe089c6eab1a14e5bc381f1f5b3d541531c2cb4.tar.bz2
pytorch-6fe089c6eab1a14e5bc381f1f5b3d541531c2cb4.zip
1 files changed, 10 insertions, 10 deletions
diff --git a/torch/CMakeLists.txt b/torch/CMakeLists.txt
index 08f0347051..01aa644ecb 100644
--- a/torch/CMakeLists.txt
+++ b/torch/CMakeLists.txt
@@ -183,8 +183,8 @@ set(TORCH_SRCS
   ${TORCH_SRC_DIR}/csrc/jit/passes/shape_analysis.cpp
   ${TORCH_SRC_DIR}/csrc/jit/passes/requires_grad_analysis.cpp
   ${TORCH_SRC_DIR}/csrc/jit/passes/specialize_undef.cpp
-  ${TORCH_SRC_DIR}/csrc/jit/fusers/interface.cpp
   ${TORCH_SRC_DIR}/csrc/jit/passes/pretty_print.cpp
+  ${TORCH_SRC_DIR}/csrc/jit/fuser/interface.cpp
   ${TORCH_SRC_DIR}/csrc/jit/register_prim_ops.cpp
   ${TORCH_SRC_DIR}/csrc/jit/register_special_ops.cpp
   ${TORCH_SRC_DIR}/csrc/jit/scope.cpp
@@ -205,11 +205,12 @@ if (NOT WIN32)
   SET(USE_CPU_FUSER 1)
 
   list(APPEND TORCH_SRCS
-    ${TORCH_SRC_DIR}/csrc/jit/fusers/common/tensor_desc.cpp
-    ${TORCH_SRC_DIR}/csrc/jit/fusers/common/fusion_handle_impl.cpp
-    ${TORCH_SRC_DIR}/csrc/jit/fusers/common/fused_kernel.cpp
-    ${TORCH_SRC_DIR}/csrc/jit/fusers/cpu/fusion_compiler.cpp
-    ${TORCH_SRC_DIR}/csrc/jit/fusers/cpu/fused_kernel.cpp
+    ${TORCH_SRC_DIR}/csrc/jit/fuser/kernel_cache.cpp
+    ${TORCH_SRC_DIR}/csrc/jit/fuser/compiler.cpp
+    ${TORCH_SRC_DIR}/csrc/jit/fuser/executor.cpp
+    ${TORCH_SRC_DIR}/csrc/jit/fuser/codegen.cpp
+    ${TORCH_SRC_DIR}/csrc/jit/fuser/fallback.cpp
+    ${TORCH_SRC_DIR}/csrc/jit/fuser/cpu/fused_kernel.cpp
   )
 endif()
 
@@ -218,15 +219,14 @@ if (USE_CUDA AND NOT USE_ROCM AND NOT WIN32)
   SET(USE_CUDA_FUSER 1)
 
   list(APPEND TORCH_SRCS
-    ${TORCH_SRC_DIR}/csrc/jit/fusers/cuda/fusion_compiler.cpp
-    ${TORCH_SRC_DIR}/csrc/jit/fusers/cuda/fused_kernel.cpp
+    ${TORCH_SRC_DIR}/csrc/jit/fuser/cuda/fused_kernel.cpp
   )
 
 endif()
 
 CONFIGURE_FILE(
-    ${TORCH_SRC_DIR}/csrc/jit/fusers/Config.h.in
-    ${CMAKE_CURRENT_SOURCE_DIR}/csrc/jit/fusers/Config.h)
+    ${TORCH_SRC_DIR}/csrc/jit/fuser/config.h.in
+    ${CMAKE_CURRENT_SOURCE_DIR}/csrc/jit/fuser/config.h)
 
 if (NOT NO_API AND NOT USE_ROCM)
   list(APPEND TORCH_SRCS
author	mruberry <mruberry@nvidia.com>	2018-10-31 18:10:40 -0700
committer	Facebook Github Bot <facebook-github-bot@users.noreply.github.com>	2018-10-31 18:13:00 -0700
commit	6fe089c6eab1a14e5bc381f1f5b3d541531c2cb4 (patch)
tree	f4eda03d29f2a44d9fb680a303da8088c17a3c65 /torch/CMakeLists.txt
parent	2df6d3e3c745673c715619967b4260bd7de59c3d (diff)
download	pytorch-6fe089c6eab1a14e5bc381f1f5b3d541531c2cb4.tar.gz pytorch-6fe089c6eab1a14e5bc381f1f5b3d541531c2cb4.tar.bz2 pytorch-6fe089c6eab1a14e5bc381f1f5b3d541531c2cb4.zip