Age | Commit message (Collapse) | Author | Files | Lines |
|
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
|
|
|
|
- gloo, pybind11, nanopb and nccl now live in third_party.
- ATen builds in aten/build rather than torch/lib/build/aten
- A bit of faffing about in the scripts was necessary, because they used to assume that everything lived in the same directory. Now you are expected to cd into the correct directory before calling one of the build functions. The actual builder script lives in tools
- Lint now just unconditionally ignores third_party, rather than enumerating folders explicitly
|
|
This PR addresses #5648. In particular, following the discussion at #5648:
- it adds Catch as a submodule (https://github.com/catchorg/Catch2) in torch/aten/utils
- it ports all ATen tests to Catch
- it ports torch/csrc/jit/test_jit.cpp to Catch (libtorch only, Python build is unaffected)
|
|
#5481 was reverted due to a strange test bug. This PR attempts to fix that.
This diff adds vectorization to ATen. It uses intel intrinsics to build a general vec256 class, that represents types of 256bit width. These can then be treated like regular variables. Using those it implements torch.sum() for the contiguous case. It uses Intel TBB for multithreading, which allows workstealing and chunks the reduction operations based on a experimentally chosen value (_THRESHOLD). It uses cpuinfo to pick the right code depending on the host's capabilities.
The kernels are implemented under native/cpu. Each .cpp file is compiled with -avx, -avx2 and no additional flags. A macro is used to append AVX, AVX2 or NONE to the function name. The header then needs to define the functions three times, one for each capability. This could be improved by either changing the cmake file a bit or possibly generating source code using a Python script etc.
For the non-contiguous case this defaults to the current implementation within TH. For CUDA is entirely defaults to the implementation within THC.
There probably needs to be a bit of a debate around the design decisions here, the additional dependencies, parallelization strategy, clarity, etc. The numerical results also diverge from numpy with larger tensors, which is expected since we're summing, for example, 8 numbers and then adding the result to the running sum, instead of each number one by one. But there might be something to be said about accumulating into a double for floats or the degree of divergence, the behavior with respect to CUDA, etc.
I wrote a [small Python script]( https://github.com/cpuhrsch/benchmark/blob/sumall/benchmarks/sum_bench.py) to compare the results with numpy numerically as well as on timing. I ran this script to create timings both on master and this branch.
Here is the command for 1 core
`OMP_NUM_THREAD=1 taskset -c 0 python sum_bench.py --enable_numpy 200`
Here is the command for all cores
`python sum_bench.py --enable_numpy 200`
Here are the results of each:
[Master, 1 core](https://paste.fedoraproject.org/paste/Nho9JzHpPVK9av8a6mByjQ)
[This branch, 1 core](https://paste.fedoraproject.org/paste/6xLHkYvcVJx9z~5MoHxN4w)
[Master, all cores](https://paste.fedoraproject.org/paste/5l3V1d5zGqvJcMXIUteMRw)
[This branch, all cores](https://paste.fedoraproject.org/paste/J4RuDU-0Drz0aZwtphQwEA)
To test the command is
`python sum_bench.py --test 200`
[This branch, test results](https://paste.fedoraproject.org/paste/kTEoUC~oWgXA6XWMAfNfNw)
For this test we look at the average absolute value of the differences. This does not take into account the relative magnitude of the numbers. The numbers are sampled from a standard normal distribution.
In terms of performance this diff should bring PyTorch on par with Numpy and usually exceed it by 1.5 to 2x.
|
|
* Revert "ATen ReduceOps (#5481)"
This reverts commit 310c3735b9eb97f30cee743b773e5bb054989edc.
* Revert "Check that new cpuinfo and tbb submodules exist (#5714)"
This reverts commit 1a23c9901dbfee295bf5b3dad36e4d3ee7e86366.
|
|
This diff adds vectorization to ATen. It uses intel intrinsics to build a general vec256 class, that represents types of 256bit width. These can then be treated like regular variables. Using those it implements torch.sum() for the contiguous case. It uses Intel TBB for multithreading, which allows workstealing and chunks the reduction operations based on a experimentally chosen value (_THRESHOLD). It uses cpuinfo to pick the right code depending on the host's capabilities.
The kernels are implemented under native/cpu. Each .cpp file is compiled with -avx, -avx2 and no additional flags. A macro is used to append AVX, AVX2 or NONE to the function name. The header then needs to define the functions three times, one for each capability. This could be improved by either changing the cmake file a bit or possibly generating source code using a Python script etc.
For the non-contiguous case this defaults to the current implementation within TH. For CUDA is entirely defaults to the implementation within THC.
There probably needs to be a bit of a debate around the design decisions here, the additional dependencies, parallelization strategy, clarity, etc. The numerical results also diverge from numpy with larger tensors, which is expected since we're summing, for example, 8 numbers and then adding the result to the running sum, instead of each number one by one. But there might be something to be said about accumulating into a double for floats or the degree of divergence, the behavior with respect to CUDA, etc.
I wrote a [small Python script]( https://github.com/cpuhrsch/benchmark/blob/sumall/benchmarks/sum_bench.py) to compare the results with numpy numerically as well as on timing. I ran this script to create timings both on master and this branch.
Here is the command for 1 core
`OMP_NUM_THREAD=1 taskset -c 0 python sum_bench.py --enable_numpy 200`
Here is the command for all cores
`python sum_bench.py --enable_numpy 200`
Here are the results of each:
[Master, 1 core](https://paste.fedoraproject.org/paste/Nho9JzHpPVK9av8a6mByjQ)
[This branch, 1 core](https://paste.fedoraproject.org/paste/6xLHkYvcVJx9z~5MoHxN4w)
[Master, all cores](https://paste.fedoraproject.org/paste/5l3V1d5zGqvJcMXIUteMRw)
[This branch, all cores](https://paste.fedoraproject.org/paste/J4RuDU-0Drz0aZwtphQwEA)
To test the command is
`python sum_bench.py --test 200`
[This branch, test results](https://paste.fedoraproject.org/paste/kTEoUC~oWgXA6XWMAfNfNw)
For this test we look at the average absolute value of the differences. This does not take into account the relative magnitude of the numbers. The numbers are sampled from a standard normal distribution.
In terms of performance this diff should bring PyTorch on par with Numpy and usually exceed it by 1.5 to 2x.
|
|
Caffe2 ARM Compute Library Integration
|
|
|
|
Summary:
Include six, enum34, and PeachPy as Caffe2 submodules, and use the versions from submodules instead of downloading them during configuration time
Closes https://github.com/caffe2/caffe2/pull/1917
Reviewed By: orionr
Differential Revision: D6938735
Pulled By: Maratyszcza
fbshipit-source-id: 841a6c47a1cd003a19f48f6c256aa4d9eb2cc6e4
|
|
Summary:
Original commit changeset: d0c1c7681605
Reverting due to broken OSS build due to this commit
Reviewed By: bddppq
Differential Revision: D6935666
fbshipit-source-id: 955cfeb6d5a4ed265b2e099094cfb5bfe960ff95
|
|
Summary:
Include six, enum34, and PeachPy as Caffe2 submodules, and use the versions from submodules instead of downloading them during configuration time
Closes https://github.com/caffe2/caffe2/pull/1901
Differential Revision: D6930731
Pulled By: Maratyszcza
fbshipit-source-id: d0c1c7681605d957de6f51bd24fbb25afc0f282f
|
|
|
|
|
|
Summary:
This is in order for us to share compression ops to oss.
Closes https://github.com/caffe2/caffe2/pull/1463
Reviewed By: hlu1
Differential Revision: D6319101
Pulled By: Yangqing
fbshipit-source-id: 16c94e71fc3efe256054a648170aaf7702e5bcfe
|
|
Summary:
This operator allows the use of Torch's underlying TH libraries (TH, THC, THNN, and THCUNN)
through the ATen tensor library. Use of the operator is described in the README.
The operator itself is generated from ATen's Declarations.yaml file which describes its public API.
Closes https://github.com/caffe2/caffe2/pull/1235
Reviewed By: dzhulgakov
Differential Revision: D5876944
Pulled By: zdevito
fbshipit-source-id: b558e8563a5e82a0e6278705a4a359bd7df4e70a
|
|
Summary: TSIA
Reviewed By: Yangqing
Differential Revision: D5815624
fbshipit-source-id: 1a6c0e471eac778aeac80001eac947178fc105ed
|
|
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
|
|
|
|
We make gloo a submodule because it contains submodules itself, and
Git cannot handle subtrees with nested submodules.
Fixes https://github.com/pytorch/pytorch/issues/2426
|
|
Summary:
(1) Changed android-cmake to use Yangqing/android-cmake, which supports NEON fp16.
(2) Added cmake scripts to build opengl.
(3) Updated nnpack to master, and changed the corresponding build files.
Closes https://github.com/caffe2/caffe2/pull/1061
Differential Revision: D5591387
Pulled By: Yangqing
fbshipit-source-id: 1d3f28511d33c09df6ecef5041448ac9a3246601
|
|
|
|
Summary:
This also requires a change to cmake/External/nccl.cmake to use the
static NCCL binary instead of the shared object. When the Caffe2/Gloo
build uses the bundled NCCL version it should be packaged up in the
resulting libraries and not cause another runtime dependency on a
library that has to be installed separately.
Closes https://github.com/caffe2/caffe2/pull/218
Differential Revision: D4769926
Pulled By: pietern
fbshipit-source-id: 5c85559992c200d874f4218724823815ffb5adb5
|
|
|
|
|
|
Summary:
Added cmake for android script under scripts, and set up the travis contbuild target.
Closes https://github.com/caffe2/caffe2/pull/109
Reviewed By: bwasti
Differential Revision: D4468767
Pulled By: Yangqing
fbshipit-source-id: 709f3eb6be24727b0a989d0901dbf377871b122a
|
|
Pinned protobuf to v3.1.0
Removed the USE_SYSTEM_PROTOBUF option in cmake. It is no longer used.
|
|
|
|
|
|
|
|
now googletest
|
|
is now eigen.
|
|
|
|
|
|
(1) nccl submodule, cnmem submodule
(2) mpi ops fallback test
(3) a bit more blob interface
(4) fixed tests
(5) caffe2.python.io -> caffe2.python.dataio to avoid name conflicts
(6) In the build system autogen __init__.py instead of having manual
rules just to copy over an empty __init__.py.
|
|
|