summaryrefslogtreecommitdiff
path: root/aten/src
AgeCommit message (Collapse)AuthorFilesLines
2018-05-01Removing references to CUDA_SDK_ROOT_DIR to see if it breaks anything (#7125)Paul Jesse Hellemn1-1/+0
2018-05-01Make AT_ASSERT/AT_ERROR non-printf based, other tweaks (#7104)Edward Z. Yang36-352/+400
* Make AT_ASSERT/AT_ERROR non-printf based, other tweaks - AT_ASSERT/AT_ERROR don't take printf strings anymore; instead, they take a comma-separated list of things you wanted to print (bringing it inline with Caffe2's conventions). Instead of AT_ASSERT(x == 0, "%d is not zero", x) you write AT_ASSERT(x == 0, x, " is not zero") This is done by way of a new variadic template at::str(), which takes a list of arguments and cats their string reps (as per operator<<) together. - A bunch of the demangling logic that was in Error.h is now moved to Error.cpp (better header hygiene.) Also, demangle has been moved out to its own helper function, and also a new helper demangle_type (from Caffe2) added. - A bunch of AT_ASSERT converted into AT_CHECK, to more properly convey which checks can be caused by user error, and which are due to logic error in ATen. Signed-off-by: Edward Z. Yang <ezyang@fb.com> * CR Signed-off-by: Edward Z. Yang <ezyang@fb.com> * Fix test failure. Signed-off-by: Edward Z. Yang <ezyang@fb.com> * buildfix Signed-off-by: Edward Z. Yang <ezyang@fb.com> * More fixes. Signed-off-by: Edward Z. Yang <ezyang@fb.com> * One more fix Signed-off-by: Edward Z. Yang <ezyang@fb.com> * Try harder Signed-off-by: Edward Z. Yang <ezyang@fb.com>
2018-04-30fix max/min on cuda in presence of NaN (fixes #6996) (#7052)Thomas Viehmann2-6/+24
Thank you ngimel and zou3519!
2018-04-30Delete unnecessary header includes. (#7094)Edward Z. Yang1-6/+0
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
2018-04-29Add max pooling support to EmbeddingBag (#5725)Ethan Steinberg3-128/+300
* Add max mode support to EmbeddingBag * Lint fix * Fix compilation issue on other platforms * Rebase + don't waste memory when not in max mode * Oops, missed a spot * Fix whitespace from merge * less precision * Lower precision to avoid spurious failures * Minor typo * Switch to size()
2018-04-28Add autograd API to at::Tensor (#6582)Peter Goldsborough3-6/+84
* Add autograd API to at::Tensor * Trying to fix linker errors on Windows * Add AT_API to set_data
2018-04-28Make all of TH and THC C++. (#6913)Edward Z. Yang197-760/+789
Changelist: - Move *.c to *.cpp - Change includes of ".c" to ".cpp" - A bunch of cmake configuration modifying CMAKE_C_FLAGS changed to CMAKE_CXX_FLAGS or add_compile_options, because if you do CMAKE_C_FLAGS it only applies when you compile C code - Explicitly cast void* to T* in a number of places - Delete extern "C" { ... } blocks; instead, properly apply TH_API to everything that should have it (TH_API handles extern "C") - Stop using stdatomic.h, instead, use <atomic>. This resulted in a bunch of placement-new/delete to be "totally properly correct" - Refactor of THLongStorageView to not have static constructor methods (since it no longer has a copy/move constructor) - Documentation about how the TH C interface (and extern C business) works - Note that THD master_worker mode is dead - C++ headers in TH libraries are given .hpp suffix, to make it less likely that you'll confuse them with the C-compatible headers (now suffixed .h) - New function THCStream_stream and THCStream_device to project out fields of THCStream instead of accessing fields directly - New function THStorage_(retainIfLive), which is equivalent to a retain but only if the refcount is greater than zero. - In general, I tried to avoid using hpp headers outside of ATen/TH. However, there were a few places where I gave up and depended on the headers for my own sanity. See Note [TH abstraction violation] for all the sites where this occurred. All other sites were refactored to use functions - Some extra Werror fixes (char* versus const char*)
2018-04-28[WIP] Enable WERROR in tests (#6539)Peter Goldsborough13-160/+181
* Enable WERROR in tests * Also set WERROR=1 for cpp_build in CI * Enable Werror after the compiler checks * Remove -DWERROR because its picked up from the env var * Had to fix some errors in aten/contrib/data * Allow an uninitialized variable in ReduceOpsKernel.cpp * Use CUDNN_DATA_UINT8 in cuDNN type string conversion * Fixes and use target_compile_options * Fix uninitialized variables in THNN * Include Python.h earlier in tensor_types.cpp * Use CUDNN_VERSION 7100 instead of 7000? * More Python.h includes * Make switch case in common_subexpression_elimination.cpp exhaustive * Build with WERROR=0 just to see all the warnings * Remove some Python includes * Enable WERROR=1 again * Bring back switch case default
2018-04-27Support non-contiguous tensors for unary ops (#6119)cpuhrsch10-378/+640
2018-04-26Implement matmul_out and dot_out. (#6961)gchanan2-15/+39
* Implement matmul_out and dot_out. * Fix autograd by only calling _out variants if we have an out ourselves. * Disallow mismatched types in dot_out. * Make sure out variant doesn't have a method. * Do proper type conversion.
2018-04-26Fixes some build warnings. (#7004)Mike Ruberry2-6/+10
2018-04-26Enhance diagonal (fixes #6479) (#6718)Thomas Viehmann2-7/+37
* Enhance diagonal This patch - adds Tensor.diagonal to complement torch.diagonal - implements diagonal natively in ATen - makes diagonal a view - implements taking arbitrary diagonals - implements diagonal backward instead of referring to the (more limited) diag * add tests, copy diagonal code to backward for double differentiability * improve tests and doc comment. Thank you, Adam! * Mark diagonal as view function in gen_autograd.py, use simple backward.
2018-04-26typo corrected: is -> if (#6980)derek_kim1-1/+1
2018-04-26Fix forward and backward for norm/renorm with infty norm (fixes #6817) (#6969)Thomas Viehmann3-19/+85
2018-04-25implement gamma cuda (#6855)Thomas Viehmann10-187/+356
* Refactor standard_gamma and implement CUDA gamma sampling * Attempt fixes for AT_CUDA_ENABLED changes * Gamma cuda and cpu forward as ATen native * implement standard_gamma_grad_cuda * update native_test.cpp, try to fix windows and various cuda version compiles * searching a windows fix via CI... use std:: for math * casting some constants in the calculation, compute at float for half precision * whitespace fixes * add acctype to do half->float computation, include HALF in generation, cast locally rather than tensors * fix cuda8 half compilation * always use scalar_cast with CUDACC, lock CPU generator, CPU acctype = double\nThank you for your review comments!
2018-04-25Code Cleanup: removes unused getTextureObject (#6974)Mike Ruberry2-27/+0
2018-04-25Removes unused _long functions in THCTensorIndex (#6971)Mike Ruberry2-55/+0
2018-04-25Changes incorrect "overlappingIndices" call to correct ↵Mike Ruberry7-139/+113
"maybeOverlappingIndices" (#6953) * Changes incorrect "overlappingIndices" call to correct "maybeOverlappingIndices" THE PROBLEM The current overlappingIndices() is meant to detect if a tensor defines multiple valid indices for the same data element. There are two significant issues with this function: (1) The algorithm it attempts to implement cannot do this. (2) That algorithm is not implemented correctly. This call is used by pointwiseApply() and scatter(). If a tensor is readable/writable and detected as overlapped these algorithms will create a non-overlapped copy of it to work on. When tensors are improperly identified as overlapped this causese extra work. If tensors are improperly identified as non-overlapped then this would cause the operations to exhibit unexpected behavior. For example, ref = torch.arange(0, 32 * 5).view(4, 8, 5).cuda().double() p = ref[:,:,::2] p += 1 Results in a call to pointwiseApply1, which detects p as an overlapped tensor (it is not), causing a call to pointwiseApply2 that copies it into a non-overlapped temporary, and then another call to pointwiseApply2 later that copies it back to the original tensor. If, however, the original tensor is given dimensions of (4, 8, 4), instead, it is correctly detected as non-overlapped and only a single pointwiseApply1 call is made. DISCUSSION + FIX The algorithm that overlappingIndices() attempts to implement tests for a sufficient but not necessary condition of a tensor to be non-overlapping. That is, if its algorithm were implemented properly then it would be a conservative check that would ensure all overlapped tensors were copied (as desired), but also that some non-overlapped tensors were copied too. The algorithm can be thought of as trying to test whether the dimensions can be ordered like "nesting dolls," with each dimension fitting within the next one larger than it. If this is true then the tensor is non-overlapping, but if it's false the tensor may or may not be overlapped. For example, a tensor with dims (2, 3) and strides (4, 3) cannot be "nested," but is non-overlapping. (The tensor looks like [[0, 3, 6], [4, 7, 10]].) The algorithm is currently implemented improperly, as can be seen in the example above. The tensor p has dimensions [4, 8, 3] and strides [40, 5, 2]. This confuses the current implementation, which thinks the innermost dimension needs a stride of 6, which is incorrect. The first row is [0, 2, 4] and the next row begins with 5. The current implementation also improperly implemented its sorting behavior. (qsort comparators require -1, 0, and 1, not true/false return values.) Fixing the existing algorithm is straightforward (and what this PR does, see below), but it is important to note that the algorithm never performed as intended, so its name and the documentation around it has been updated, too. A natural question is if it's possible to write an efficient overlappingIndices(), and I believe the answer is "no." Disambiguating overlapping from non-overlapping tensors is equivalent to finding a nonzero solution to a linear diophantine equation with restricted coefficients, that is, an equation of the form x_0s_0 + x_1s_1 ... = 0 where s_X is the stride in dimension X and x_X is an integer from [-size_X + 1, size_X - 1]. Another note is that the CPU does not perform this check. For example, if we run: a = torch.FloatTensor([[0,1], [10, 11]]) b = torch.FloatTensor([[0,0],[0,0]]) b = b.set_(a.storage(), storage_offset=0, size=a.size(), stride=(1,1)) b += 1 Then b is [[1, 3], [3, 11]] because the operation is applied twice to the second element of the original tensor. This causes no warning. Since the CPU does not perform a similar check, another question is whether the GPU code should remove its check. While it may seem that writing to overlapping tensors is an error state, running test_cuda.py reveals 171 instances of possibly overlapped tensors being copied by pointwiseApply(). (The prior incorrect version has 176 copies.) Allowing writing to overlapped tensors on the GPU may violate assumptions about memory accesses, too. In fairness, these assumptions may be violated on the CPU already. Leaving the CPU vs GPU behavior question for the future, this fix corrects the current intended GPU behavior. This means that there will be fewer unnecessary copies and no chance of an overlapped tensor sneaking through on the GPU. The CPU behavior remains unchanged. The fix also adds a test to test_cuda.py to ensure that overlapped tensors on the GPU are written to as expected. * cleanup * Fixes Python formatting
2018-04-25Make cuda 9 behave as cuda 8 wrt half conversions (#6958)ngimel2-10/+10
* Make cuda 9 behave as cuda 8 wrt half conversions Cuda 9 is too smart about implicit half conversions, this would disable them so that cuda 8 and cuda 9 behave in the same way wrt half. * try fixing windows build * one more broken conversion
2018-04-25Clarify _unsafe_view comment. (#6952)gchanan1-1/+1
It was unclear to me whether the "viewed" tensor was the input or the output.
2018-04-25Delete unused legacy indexed based streams (#6964)Sam Gross2-67/+3
PyTorch uses THC's THCStream API.
2018-04-25Remove scratch space from THCState (#6956)Sam Gross3-113/+10
THC had a concept of per-device per-stream scratch space that was persistent in THCState. This was useful before the caching allocator because it avoided synchronizations in kernels that needed temporary scratch space. However, it's not thread-safe since multiple threads can operate on the same stream: In a two-pass reduction the scratch space may get clobbered in between the two kernels. This removes the scratch space and just uses THCudaMalloc and THCudaFree within the reductions. I've kept THCState_getCurrentDeviceScratchSpaceSize for now since it's useful to have the temporary buffer be sized based on the number of SMs.
2018-04-25add missing UNCERTAIN_TH_OMP_OVERHEAD_THRESHOLD to ↵Soumith Chintala1-2/+2
TH_TENSOR_APPLY_REDUCTION_OMP (#6946)
2018-04-25Make any and all on ByteTensor behave like sum/prod. (#4627)Tao He5-22/+278
2018-04-24silence compiler warnings (#6915)li-roy1-2/+2
2018-04-24Add threshold for ops using openmp macro (#5584)Yang, Zhen3-79/+119
* add threshold for ops using omp macro * modify interface for ops using omp macro * modify some thresholds * implement C macros with optional parameters to avoid duplicating definitions for all pointwise operations * add a parameter of LAB_IMPLEMENT_BASIC_FUNCTION for vectorizing * modify the comment * Revert "add a parameter of LAB_IMPLEMENT_BASIC_FUNCTION for vectorizing" Modify macro LAB_IMPLEMENT_VECTORIZED_FUNCTION to enable optional parameters This reverts commit 8ef783a0cc67b653c435e64a3beb6866a6b4216d. Conflicts: aten/src/TH/generic/THTensorMath.c * fix build error on windows * retrigger the test
2018-04-24[aten] Move submodules to third_party (#6866)Orion Reblitz-Richardson5-9/+28
* [aten] Move submodules to third_party * [aten] Update aten_mirror.sh script for third_party * [aten] Move ATen submodules def to root and rename * [aten] Update cpuinfo cmake build * [aten] Fix cpuinfo cmake build * Update third_party/cpuinfo to d03d5d296063063c66877fb559cf34469734e3e1 * [aten] Fix JIT test reference to catch
2018-04-24Make ArrayRef read-only by default. (#6444)Edward Z. Yang1-2/+2
Sebastian Messmer noticed that these iterators were writeable by default, which seemed dangerous. Replaced with const iterators. This doesn't seem to affect any ATen code; seems reasonable enough. Signed-off-by: Edward Z. Yang <ezyang@fb.com>
2018-04-23fix memory leak in median (#6889)Soumith Chintala1-0/+3
2018-04-22Removes (unused) LinearIndexCalcData. (#6791)Mike Ruberry3-152/+0
This class as well as several functions using it appear to not be used. This is simply code cleanup. Testing: All tests in test_cuda.py pass.
2018-04-22Static linkage for CUDA (#6807)Soumith Chintala1-6/+49
* add static linkage option for CUDA libs * add CuFFT linking via fakelink * remove warning for 5.0 cuda architecture
2018-04-22Fix reductions on some contiguous tensors where size(dim) == 1 (#6815)cpuhrsch1-0/+8
2018-04-20Fix debug build for Windows (#6758)peterjc1231-4/+6
* Fix debug build for Windows * Fix for wrong placement * Fix variable name
2018-04-19Disallow chunks that are <= in torch.chunk (#6761)Richard Zou1-1/+5
Fixes #6759. Before, `tensor.chunk(0)` would cause a divide by 0. `tensor.chunk(-1)` would throw an error complaining that "split_size needs to be positive". This PR changes it so that the error message makes it clear that `chunks` has to be greater than 0.
2018-04-18Eliminate handle_zero_dim when broadcasting is applied earlier. (#6683)gchanan1-6/+10
* Eliminate handle_zero_dim when broadcasting is applied earlier. This ends up not actually doing anything unless all the broadcasted tensors are scalars, which ends up with inconsistent behavior in that case only, because the type promotion rules are different. This is better solved with real type promotion logic. * Change type of script comparison to long. * Fix jit tests. * Fix cpp jit test by being consistent about long-vs-float. * Consistent float and long. * Use int64_t rather than long.
2018-04-18Better error message for gels on CUDA (#6726)Tongzhou Wang1-2/+4
2018-04-18Sort declarations when generating Python bindings (#6701)Adam Paszke4-21/+45
* Sort declarations when generating Python bindings This helps resolve ambiguities in argument parsing according to any rules we will need. For now, this allows us to make scalar operations more conservarive wrt. argument types, but makes them commutative again. * Fix inconsistencies between mod with tensor and scalar * Fix a stupid mistake
2018-04-18Add mutex to THC random number generator (#6527)Will Feng21-88/+148
* Add mutex to THC random number generator * Add test for CUDA RNG multithread * fix lint * Rename gen_state to state and remove unnecessary mutex lock * Remove RNG test from cpp_extensions * Add CUDA RNG test to libtorch * Build test_rng only if CUDA exists * Move test to aten/src/ATen/test/ * Separate ATen build and test, and run ATen test in CI test phase * Don't test ATen in ASAN build * Fix bug in ATen scalar_test * Fix bug in ATen native_test * Add FIXME to some CUDA tests in scalar_tensor_test * Valgrind doesn't work well with CUDA, seed the CPU and CUDA RNG separately instead
2018-04-18Implement torch.einsum (fixes #1889) (#6307)Thomas Viehmann2-10/+167
* start at generic trilinear * Implement einsum (fixes #1889) This provides a simple implementation of einsum. It is built on top of the work for computing bilinear (#6110). It uses a naive left-to-right resolution at the moment. Autograd is able to differentiate by itself. The obvious unsupported feature is taking diagonals (einsum('ii->i',(a,)). * add tests and docs * fix flake8 * clean diff * rebase on current master to resolve conflicting String wrapping * clean up after rebase * better commentary in einsum and sumproduct_pair * don't say fixme if it's fixed and rename num_outputs to num_output_dims * adapt python wrapper to use std::string instead of String to avoid typedef at::String * typos and some vector to array conversion * fix accidental python<->python3 change * really fix bad rebase
2018-04-18Better dispatch (#6687)Zeming Lin1-2/+2
2018-04-17__STDC_FORMAT_MACROS was conflicting with some thirdparty include from ↵Dmytro Dzhulgakov1-0/+2
google perf tools. Looks like a harmless fix (#6676)
2018-04-17Adding dispatch to Tensors (#6664)Zeming Lin2-0/+11
It solves the problem of chaining externally defined functions.
2018-04-17randperm supports n=0 (#6656)Francisco Massa1-2/+2
This makes it compatible with arange and numpy.random.permutation
2018-04-17Support gpu triangle solve (#6648)Du Phan3-0/+40
* add cuda trtrs * remove queue * add test trtrs
2018-04-16Add dtypes (with reasonable defaults) to sum, prod, cumsum, cumprod. (#6573)gchanan5-13/+237
* Add dtypes (with reasonable defaults) to sum, prod, cumsum, cumprod. This adds optional dtypes to torch.sum, torch.prod, torch.cumsum, torch.cumprod. By default, the dtype is torch.float64 for integral types, and the dtype of the input for floating point types. * Don't use optional<ScalarType>, because the jit can't handle it yet. Instead, we manually build the overloads. This is fairly painful because of default arguments, but should be easy to pull out once the jit can handle optional<ScalarType>. * Fix keepdim with out parameters. * Fix _cudnn_rnn_flatten_weight. * If dtype is provided to an out function, make sure it matches the dtype of the result. * Fix typo.
2018-04-16Create safe and unsafe versions of sparse_coo_tensor (#6058)li-roy7-145/+185
Fixes #5748. Added an unsafe version so embedding isn't slowed. * Create safe and unsafe versions of sparse_coo_tensor * rename sparse_coo_tensor_unsafe to _sparse_coo_tensor_unsafe * refactor * make helper static inline * add sparse size check test * fix lint
2018-04-16Fix bilinear performance regression (#6110)Thomas Viehmann3-6/+194
The current implementation of bilinar uses a matrix multiplication approach. This creates a large intermediate matrix (batch * output dimension * input dimension). Relative to the previous pure python approach, this caused severe performance regression (600ms vs. 18ms for 300x100x200 weights and a batch of 50 on CPU, and also quadratic memory). The attached change restores the performance using the previous strategy of looping over output features. It implements forward, backward, and double backward as native ATen code. Credits: Martin Tutek reported the regression and pinpointed the problem Adam Paszke patiently answered my questions about ATen I would not have been able to prepare this without you, thank you! I referenced the old python implementation, used a python version of the naive implementation, and coded manual functions etc. The tests have gradgradcheck etc. * fix memory use of native bilinear * bilinear double backward * Move bilinear_double_backward to Functions.cpp Addresses review comment by Tongzhou Wang. Thank you! * add WrapDimUtilsMulti.h * start at generic trilinear * move to generic trilinear * catch up on dim_list_to_bitset * switch bilinear to use _trilinear implement _trilinear_backward * add comments to Linear.cpp, move _trilinear in yaml
2018-04-13 More precise digamma (#6517)Richard Zou4-34/+179
* More precise digamma Fixes #6190. This is a rebase of #3955 with some tweaks for better performance around poles. The code is ported over from cephes with permission. By itself, the cephes code returns inf for the poles. For better performance around the poles with float32, one intermediate step is always computed with double precision, regardless of dtype. This step does `PI / tan(PI * input)`. This is necessary because small (1e-6) rounding errors for the inputs to tan have strong effects on the output (ie, the derivative of tan is very large at some points). * Replace usages of finite-differences digamma with newly implemented digamma * Better behavior near and at poles * ScalarConvert -> scalar_cast for readability
2018-04-12Use THC allocation for CUFFT workspace (#6568)Tongzhou Wang1-7/+15
* use THC allocation for CUFFT * use auto& instead
2018-04-12Support arbitrary number of batch dimensions in *FFT (#6528)Tongzhou Wang1-16/+31