Age | Commit message (Collapse) | Author | Files | Lines |
|
|
|
* Make AT_ASSERT/AT_ERROR non-printf based, other tweaks
- AT_ASSERT/AT_ERROR don't take printf strings anymore; instead,
they take a comma-separated list of things you wanted to print
(bringing it inline with Caffe2's conventions).
Instead of AT_ASSERT(x == 0, "%d is not zero", x)
you write AT_ASSERT(x == 0, x, " is not zero")
This is done by way of a new variadic template at::str(), which
takes a list of arguments and cats their string reps (as per
operator<<) together.
- A bunch of the demangling logic that was in Error.h is now
moved to Error.cpp (better header hygiene.) Also, demangle
has been moved out to its own helper function, and also
a new helper demangle_type (from Caffe2) added.
- A bunch of AT_ASSERT converted into AT_CHECK, to more properly
convey which checks can be caused by user error, and which are
due to logic error in ATen.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* CR
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Fix test failure.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* buildfix
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* More fixes.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* One more fix
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Try harder
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
|
|
Thank you ngimel and zou3519!
|
|
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
|
|
* Add max mode support to EmbeddingBag
* Lint fix
* Fix compilation issue on other platforms
* Rebase + don't waste memory when not in max mode
* Oops, missed a spot
* Fix whitespace from merge
* less precision
* Lower precision to avoid spurious failures
* Minor typo
* Switch to size()
|
|
* Add autograd API to at::Tensor
* Trying to fix linker errors on Windows
* Add AT_API to set_data
|
|
Changelist:
- Move *.c to *.cpp
- Change includes of ".c" to ".cpp"
- A bunch of cmake configuration modifying CMAKE_C_FLAGS changed
to CMAKE_CXX_FLAGS or add_compile_options, because if you do CMAKE_C_FLAGS it only applies when you compile C code
- Explicitly cast void* to T* in a number of places
- Delete extern "C" { ... } blocks; instead, properly apply TH_API to everything that should have it (TH_API handles extern "C")
- Stop using stdatomic.h, instead, use <atomic>. This resulted in a bunch of placement-new/delete to be "totally properly correct"
- Refactor of THLongStorageView to not have static constructor methods (since it no longer has a copy/move constructor)
- Documentation about how the TH C interface (and extern C business) works
- Note that THD master_worker mode is dead
- C++ headers in TH libraries are given .hpp suffix, to make it less likely that you'll confuse them with the C-compatible headers (now suffixed .h)
- New function THCStream_stream and THCStream_device to project out fields of THCStream instead of accessing fields directly
- New function THStorage_(retainIfLive), which is equivalent to a retain but only if the refcount is greater than zero.
- In general, I tried to avoid using hpp headers outside of ATen/TH. However, there were a few places where I gave up and depended on the headers for my own sanity. See Note [TH abstraction violation] for all the sites where this occurred. All other sites were refactored to use functions
- Some extra Werror fixes (char* versus const char*)
|
|
* Enable WERROR in tests
* Also set WERROR=1 for cpp_build in CI
* Enable Werror after the compiler checks
* Remove -DWERROR because its picked up from the env var
* Had to fix some errors in aten/contrib/data
* Allow an uninitialized variable in ReduceOpsKernel.cpp
* Use CUDNN_DATA_UINT8 in cuDNN type string conversion
* Fixes and use target_compile_options
* Fix uninitialized variables in THNN
* Include Python.h earlier in tensor_types.cpp
* Use CUDNN_VERSION 7100 instead of 7000?
* More Python.h includes
* Make switch case in common_subexpression_elimination.cpp exhaustive
* Build with WERROR=0 just to see all the warnings
* Remove some Python includes
* Enable WERROR=1 again
* Bring back switch case default
|
|
|
|
* Implement matmul_out and dot_out.
* Fix autograd by only calling _out variants if we have an out ourselves.
* Disallow mismatched types in dot_out.
* Make sure out variant doesn't have a method.
* Do proper type conversion.
|
|
|
|
* Enhance diagonal
This patch
- adds Tensor.diagonal to complement torch.diagonal
- implements diagonal natively in ATen
- makes diagonal a view
- implements taking arbitrary diagonals
- implements diagonal backward instead of referring
to the (more limited) diag
* add tests, copy diagonal code to backward for double differentiability
* improve tests and doc comment. Thank you, Adam!
* Mark diagonal as view function in gen_autograd.py, use simple backward.
|
|
|
|
|
|
* Refactor standard_gamma and implement CUDA gamma sampling
* Attempt fixes for AT_CUDA_ENABLED changes
* Gamma cuda and cpu forward as ATen native
* implement standard_gamma_grad_cuda
* update native_test.cpp, try to fix windows and various cuda version compiles
* searching a windows fix via CI... use std:: for math
* casting some constants in the calculation, compute at float for half precision
* whitespace fixes
* add acctype to do half->float computation, include HALF in generation, cast locally rather than tensors
* fix cuda8 half compilation
* always use scalar_cast with CUDACC, lock CPU generator, CPU acctype = double\nThank you for your review comments!
|
|
|
|
|
|
"maybeOverlappingIndices" (#6953)
* Changes incorrect "overlappingIndices" call to correct "maybeOverlappingIndices"
THE PROBLEM
The current overlappingIndices() is meant to detect if a tensor defines multiple valid indices for the same data element. There are two significant issues with this function:
(1) The algorithm it attempts to implement cannot do this.
(2) That algorithm is not implemented correctly.
This call is used by pointwiseApply() and scatter(). If a tensor is readable/writable and detected as overlapped these algorithms will create a non-overlapped copy of it to work on. When tensors are improperly identified as overlapped this causese extra work. If tensors are improperly identified as non-overlapped then this would cause the operations to exhibit unexpected behavior.
For example,
ref = torch.arange(0, 32 * 5).view(4, 8, 5).cuda().double()
p = ref[:,:,::2]
p += 1
Results in a call to pointwiseApply1, which detects p as an overlapped tensor (it is not), causing a call to pointwiseApply2 that copies it into a non-overlapped temporary, and then another call to pointwiseApply2 later that copies it back to the original tensor. If, however, the original tensor is given dimensions of (4, 8, 4), instead, it is correctly detected as non-overlapped and only a single pointwiseApply1 call is made.
DISCUSSION + FIX
The algorithm that overlappingIndices() attempts to implement tests for a sufficient but not necessary condition of a tensor to be non-overlapping. That is, if its algorithm were implemented properly then it would be a conservative check that would ensure all overlapped tensors were copied (as desired), but also that some non-overlapped tensors were copied too.
The algorithm can be thought of as trying to test whether the dimensions can be ordered like "nesting dolls," with each dimension fitting within the next one larger than it. If this is true then the tensor is non-overlapping, but if it's false the tensor may or may not be overlapped. For example, a tensor with dims (2, 3) and strides (4, 3) cannot be "nested," but is non-overlapping. (The tensor looks like [[0, 3, 6], [4, 7, 10]].)
The algorithm is currently implemented improperly, as can be seen in the example above. The tensor p has dimensions [4, 8, 3] and strides [40, 5, 2]. This confuses the current implementation, which thinks the innermost dimension needs a stride of 6, which is incorrect. The first row is [0, 2, 4] and the next row begins with 5. The current implementation also improperly implemented its sorting behavior. (qsort comparators require -1, 0, and 1, not true/false return values.)
Fixing the existing algorithm is straightforward (and what this PR does, see below), but it is important to note that the algorithm never performed as intended, so its name and the documentation around it has been updated, too. A natural question is if it's possible to write an efficient overlappingIndices(), and I believe the answer is "no." Disambiguating overlapping from non-overlapping tensors is equivalent to finding a nonzero solution to a linear diophantine equation with restricted coefficients, that is, an equation of the form x_0s_0 + x_1s_1 ... = 0 where s_X is the stride in dimension X and x_X is an integer from [-size_X + 1, size_X - 1].
Another note is that the CPU does not perform this check. For example, if we run:
a = torch.FloatTensor([[0,1], [10, 11]])
b = torch.FloatTensor([[0,0],[0,0]])
b = b.set_(a.storage(), storage_offset=0, size=a.size(), stride=(1,1))
b += 1
Then b is [[1, 3], [3, 11]] because the operation is applied twice to the second element of the original tensor. This causes no warning.
Since the CPU does not perform a similar check, another question is whether the GPU code should remove its check. While it may seem that writing to overlapping tensors is an error state, running test_cuda.py reveals 171 instances of possibly overlapped tensors being copied by pointwiseApply(). (The prior incorrect version has 176 copies.) Allowing writing to overlapped tensors on the GPU may violate assumptions about memory accesses, too. In fairness, these assumptions may be violated on the CPU already.
Leaving the CPU vs GPU behavior question for the future, this fix corrects the current intended GPU behavior. This means that there will be fewer unnecessary copies and no chance of an overlapped tensor sneaking through on the GPU. The CPU behavior remains unchanged. The fix also adds a test to test_cuda.py to ensure that overlapped tensors on the GPU are written to as expected.
* cleanup
* Fixes Python formatting
|
|
* Make cuda 9 behave as cuda 8 wrt half conversions
Cuda 9 is too smart about implicit half conversions, this would disable them so that cuda 8 and cuda 9 behave in the same way wrt half.
* try fixing windows build
* one more broken conversion
|
|
It was unclear to me whether the "viewed" tensor was the input or the output.
|
|
PyTorch uses THC's THCStream API.
|
|
THC had a concept of per-device per-stream scratch space that was
persistent in THCState. This was useful before the caching allocator
because it avoided synchronizations in kernels that needed temporary
scratch space. However, it's not thread-safe since multiple threads can
operate on the same stream: In a two-pass reduction the scratch space
may get clobbered in between the two kernels.
This removes the scratch space and just uses THCudaMalloc and THCudaFree
within the reductions.
I've kept THCState_getCurrentDeviceScratchSpaceSize for now since it's
useful to have the temporary buffer be sized based on the number of SMs.
|
|
TH_TENSOR_APPLY_REDUCTION_OMP (#6946)
|
|
|
|
|
|
* add threshold for ops using omp macro
* modify interface for ops using omp macro
* modify some thresholds
* implement C macros with optional parameters to avoid duplicating definitions for all pointwise operations
* add a parameter of LAB_IMPLEMENT_BASIC_FUNCTION for vectorizing
* modify the comment
* Revert "add a parameter of LAB_IMPLEMENT_BASIC_FUNCTION for vectorizing"
Modify macro LAB_IMPLEMENT_VECTORIZED_FUNCTION to enable optional parameters
This reverts commit 8ef783a0cc67b653c435e64a3beb6866a6b4216d.
Conflicts:
aten/src/TH/generic/THTensorMath.c
* fix build error on windows
* retrigger the test
|
|
* [aten] Move submodules to third_party
* [aten] Update aten_mirror.sh script for third_party
* [aten] Move ATen submodules def to root and rename
* [aten] Update cpuinfo cmake build
* [aten] Fix cpuinfo cmake build
* Update third_party/cpuinfo to d03d5d296063063c66877fb559cf34469734e3e1
* [aten] Fix JIT test reference to catch
|
|
Sebastian Messmer noticed that these iterators were writeable by
default, which seemed dangerous. Replaced with const iterators.
This doesn't seem to affect any ATen code; seems reasonable enough.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
|
|
|
|
This class as well as several functions using it appear to not be used. This is simply code cleanup.
Testing:
All tests in test_cuda.py pass.
|
|
* add static linkage option for CUDA libs
* add CuFFT linking via fakelink
* remove warning for 5.0 cuda architecture
|
|
|
|
* Fix debug build for Windows
* Fix for wrong placement
* Fix variable name
|
|
Fixes #6759.
Before, `tensor.chunk(0)` would cause a divide by 0.
`tensor.chunk(-1)` would throw an error complaining that "split_size
needs to be positive".
This PR changes it so that the error message makes it clear that
`chunks` has to be greater than 0.
|
|
* Eliminate handle_zero_dim when broadcasting is applied earlier.
This ends up not actually doing anything unless all the broadcasted tensors are scalars,
which ends up with inconsistent behavior in that case only, because the type promotion rules are different.
This is better solved with real type promotion logic.
* Change type of script comparison to long.
* Fix jit tests.
* Fix cpp jit test by being consistent about long-vs-float.
* Consistent float and long.
* Use int64_t rather than long.
|
|
|
|
* Sort declarations when generating Python bindings
This helps resolve ambiguities in argument parsing according to
any rules we will need.
For now, this allows us to make scalar operations more conservarive
wrt. argument types, but makes them commutative again.
* Fix inconsistencies between mod with tensor and scalar
* Fix a stupid mistake
|
|
* Add mutex to THC random number generator
* Add test for CUDA RNG multithread
* fix lint
* Rename gen_state to state and remove unnecessary mutex lock
* Remove RNG test from cpp_extensions
* Add CUDA RNG test to libtorch
* Build test_rng only if CUDA exists
* Move test to aten/src/ATen/test/
* Separate ATen build and test, and run ATen test in CI test phase
* Don't test ATen in ASAN build
* Fix bug in ATen scalar_test
* Fix bug in ATen native_test
* Add FIXME to some CUDA tests in scalar_tensor_test
* Valgrind doesn't work well with CUDA, seed the CPU and CUDA RNG separately instead
|
|
* start at generic trilinear
* Implement einsum (fixes #1889)
This provides a simple implementation of einsum. It is built on
top of the work for computing bilinear (#6110).
It uses a naive left-to-right resolution at the moment.
Autograd is able to differentiate by itself.
The obvious unsupported feature is taking diagonals (einsum('ii->i',(a,)).
* add tests and docs
* fix flake8
* clean diff
* rebase on current master to resolve conflicting String wrapping
* clean up after rebase
* better commentary in einsum and sumproduct_pair
* don't say fixme if it's fixed and rename num_outputs to num_output_dims
* adapt python wrapper to use std::string instead of String to avoid typedef at::String
* typos and some vector to array conversion
* fix accidental python<->python3 change
* really fix bad rebase
|
|
|
|
google perf tools. Looks like a harmless fix (#6676)
|
|
It solves the problem of chaining externally defined functions.
|
|
This makes it compatible with arange and numpy.random.permutation
|
|
* add cuda trtrs
* remove queue
* add test trtrs
|
|
* Add dtypes (with reasonable defaults) to sum, prod, cumsum, cumprod.
This adds optional dtypes to torch.sum, torch.prod, torch.cumsum, torch.cumprod.
By default, the dtype is torch.float64 for integral types, and the dtype of the input for floating point types.
* Don't use optional<ScalarType>, because the jit can't handle it yet.
Instead, we manually build the overloads. This is fairly painful because of default arguments, but should be easy to pull out once the jit can handle optional<ScalarType>.
* Fix keepdim with out parameters.
* Fix _cudnn_rnn_flatten_weight.
* If dtype is provided to an out function, make sure it matches the dtype of the result.
* Fix typo.
|
|
Fixes #5748.
Added an unsafe version so embedding isn't slowed.
* Create safe and unsafe versions of sparse_coo_tensor
* rename sparse_coo_tensor_unsafe to _sparse_coo_tensor_unsafe
* refactor
* make helper static inline
* add sparse size check test
* fix lint
|
|
The current implementation of bilinar uses a matrix multiplication approach. This creates a large intermediate matrix (batch * output dimension * input dimension). Relative to the previous pure python approach, this caused severe performance regression (600ms vs. 18ms for 300x100x200 weights and a batch of 50 on CPU, and also quadratic memory).
The attached change restores the performance using the previous strategy of looping over output features. It implements forward, backward, and double backward as native ATen code.
Credits:
Martin Tutek reported the regression and pinpointed the problem
Adam Paszke patiently answered my questions about ATen
I would not have been able to prepare this without you, thank you!
I referenced the old python implementation, used a python version of the naive implementation, and coded manual functions etc.
The tests have gradgradcheck etc.
* fix memory use of native bilinear
* bilinear double backward
* Move bilinear_double_backward to Functions.cpp
Addresses review comment by Tongzhou Wang. Thank you!
* add WrapDimUtilsMulti.h
* start at generic trilinear
* move to generic trilinear
* catch up on dim_list_to_bitset
* switch bilinear to use _trilinear implement _trilinear_backward
* add comments to Linear.cpp, move _trilinear in yaml
|
|
* More precise digamma
Fixes #6190.
This is a rebase of #3955 with some tweaks for better performance around
poles. The code is ported over from cephes with permission.
By itself, the cephes code returns inf for the poles.
For better performance around the poles with float32, one intermediate
step is always computed with double precision, regardless of dtype.
This step does `PI / tan(PI * input)`. This is necessary because small (1e-6)
rounding errors for the inputs to tan have strong effects on the output
(ie, the derivative of tan is very large at some points).
* Replace usages of finite-differences digamma with newly implemented digamma
* Better behavior near and at poles
* ScalarConvert -> scalar_cast for readability
|
|
* use THC allocation for CUFFT
* use auto& instead
|
|
|