summaryrefslogtreecommitdiff
path: root/test/test_c10d.py
AgeCommit message (Collapse)AuthorFilesLines
2019-04-19Make finding unused model parameters optional (#19515)Pieter Noordhuis1-0/+63
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/19515 This is still done by default, but can now be disabled by specifying `find_unused_parameters=False`. There are use cases where finding unused parameters results in erroneous behavior, because a subset of model parameters is used *outside* the `forward` function. One can argue that doing this is not a good idea, but we should not break existing use cases without an escape hatch. This configuration parameter is that escape hatch. Reviewed By: bddppq Differential Revision: D15016381 fbshipit-source-id: f2f86b60771b3801ab52776e62b5fd6748ddeed0
2019-04-18Recursively find tensors in DDP module output (#19360)Pieter Noordhuis1-0/+90
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/19360 We'll return the output object verbatim since it is a freeform object. We need to find any tensors in this object, though, because we need to figure out which parameters were used during this forward pass, to ensure we short circuit reduction for any unused parameters. Before this commit only lists were handled and the functionality went untested. This commit adds support for dicts and recursive structures, and also adds a test case. Closes #19354. Reviewed By: mrshenli Differential Revision: D14978016 fbshipit-source-id: 4bb6999520871fb6a9e4561608afa64d55f4f3a8
2019-04-17Allow DDP to wrap multi-GPU modules (#19271)Shen Li1-11/+190
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/19271 allow DDP to take multi-gpu models Reviewed By: pietern Differential Revision: D14822375 fbshipit-source-id: 1eebfaa33371766d3129f0ac6f63a573332b2f1c
2019-04-15Make DistributedDataParallel use new reducer (#18953)Pieter Noordhuis1-0/+46
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/18953 This removes Python side bucketing code from DistributedDataParallel and replaces it with calls to the new C++ based bucketing and reducing code. To confirm this is working well, we ran a test with both the previous implementation and the new implementation, and confirmed they are numerically equivalent. Performance is improved by a couple percent or more, including the single machine multiple GPU runs. Closes #13273. Reviewed By: mrshenli Differential Revision: D14580911 fbshipit-source-id: 44e76f8b0b7e58dd6c91644e3df4660ca2ee4ae2
2019-04-10Fix flaky store timeout test (#19114)Shen Li1-12/+23
Summary: ~Sometimes, `init_process_group()`, `store.get()`, and `destory_process_group()` can take more than a few seconds. Hence, removing thread join timeout.~ The error was due to `Address already in use` when starting TPC backend. The solution is to catch the error and report it to the `retry_on_address_already_in_use_error` decorator. Pull Request resolved: https://github.com/pytorch/pytorch/pull/19114 Reviewed By: ezyang Differential Revision: D14872680 Pulled By: mrshenli fbshipit-source-id: fc504d02853ca73f76288c0ade564ab20bc01f7e
2019-04-09Propagate ProcessGroup timeout to Store (#16571)Shen Li1-0/+47
Summary: closes #16520 Hi pietern, I am not sure if this is the expected way to pass timeout to `Store`, could you please help take a look? Thanks! Questions: 1. How do I write tests for this? I wanted to do something like `test_barrier_timeout_global`, but it seems I need to set the pg's timeout larger than the `Store`'s default timeout (3 min) to see a difference, which is too long for a unit test. And I do not want to change the `Store`'s default timeout either. Any suggestion? 2. Should I also propagate timeout configuration down to `PrefixStore` in `_new_process_group_helper`? Pull Request resolved: https://github.com/pytorch/pytorch/pull/16571 Differential Revision: D13954527 Pulled By: mrshenli fbshipit-source-id: 77f2653903f24255207233eb298f7c0321119a87
2019-04-05Increase default c10d/ProcessGroupGloo test timeout (#18916)Pieter Noordhuis1-1/+1
Summary: See #18659. Pull Request resolved: https://github.com/pytorch/pytorch/pull/18916 Differential Revision: D14808749 Pulled By: pietern fbshipit-source-id: 9a9c8beddb2dbbb1bf4c5e575743d9e1fa3f07fa
2019-04-05Add tests for reducer class (#18845)Pieter Noordhuis1-0/+136
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/18845 This adds a few CPU only test cases for the reducer class. Reviewed By: mrshenli Differential Revision: D14768432 fbshipit-source-id: c008a52206826304e634a95bc14167ed94c97662
2019-03-22Correctly call superclass setUp in TestCase subclasses. (#18291)Edward Yang1-6/+5
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/18291 ghimport-source-id: d6e95e899bd320407967df41435801e54864ba62 Stack from [ghstack](https://github.com/ezyang/ghstack): * #18292 Add test for #17271 (torch.exp incorrect for 2**31 size tensor) * **#18291 Correctly call superclass setUp in TestCase subclasses.** This makes PYTORCH_TEST_SKIP_FAST work correctly for more tests, reducing the wasted testing effort on our slow_test job. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: D14567643 fbshipit-source-id: 40cf1d6556e0dd0a0550ff3d9ffed8b6000f8191
2019-01-23Disable flaky testEdward Yang1-0/+1
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/16274 Reviewed By: pietern Differential Revision: D13788036 fbshipit-source-id: a9b7353fb0655908e6d47387cc77af33e9471aed
2019-01-18TCP init method race condition fix (#15684)Teng Li1-40/+24
Summary: This PR fixes a race condition for TCP init method, when master rank can exit earlier than slave ranks and thus the TCP daemon thread gets shutdown before other slaves are able to access it. This will let every rank (process) write a special key to the store to mark that they are completed (and thus about to exit). The master rank (who is the server) will always wait until all the ranks to complete before complete itself. This should fix: https://github.com/pytorch/pytorch/issues/15638 Tested using the repro of https://github.com/pytorch/pytorch/issues/15638 and works fine. Also test_distributed and test_c10d should have already had this coverage. I had to make rendezvous test in c10d the world size of 1, since it is a single process code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/15684 Differential Revision: D13570904 Pulled By: teng-li fbshipit-source-id: 34f3bc471204bbd29320df359347ad5561c6b589
2018-12-11add gloo support for gather on GPU (#14916)Jane Wang1-3/+54
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14916 as titled Reviewed By: pietern Differential Revision: D13267832 fbshipit-source-id: 3b89d08af93f74941f17ff892c33fc2a4a023c19
2018-12-11add gloo scatter support on GPU (#14917)Jane Wang1-3/+54
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14917 as titled Reviewed By: pietern Differential Revision: D13271560 fbshipit-source-id: 0187a3390f8ebd72a2c074e7a651432159d427c0
2018-12-10add gloo allgather support on GPU (#14576)Jane Wang1-3/+45
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14576 as titled Reviewed By: pietern Differential Revision: D13266063 fbshipit-source-id: e262f77d63724a7504a7112907bbfba49612fe75
2018-12-06Skipping two c10d tests only if there are multi-GPUs (#14860)Teng Li1-0/+2
Summary: Otherwise, these tests will fail, even though there are never meant to run on single GPU machines. Pull Request resolved: https://github.com/pytorch/pytorch/pull/14860 Differential Revision: D13369060 Pulled By: teng-li fbshipit-source-id: 8a637a6d57335491ba8602cd09927700b2bbf8a0
2018-12-05Increase test timeout (#14814)Pieter Noordhuis1-1/+1
Summary: It is possible that some sort of contention causes process scheduling delays which in turn cause the timeout to *not* be hit. Increased sleep here will decrease the probability of this happening. Fixes #14555. Pull Request resolved: https://github.com/pytorch/pytorch/pull/14814 Differential Revision: D13351924 Pulled By: pietern fbshipit-source-id: 1222cf0855408dfcb79f30f94694c790ee998cf9
2018-12-05Retry test on address already in use error (#14815)Pieter Noordhuis1-0/+1
Summary: Thanks nairbv for the suggestion. Also see #14589. Fixes #14703. Pull Request resolved: https://github.com/pytorch/pytorch/pull/14815 Differential Revision: D13351913 Pulled By: pietern fbshipit-source-id: d11a4152505d0ce15592b13e417bb80551476a61
2018-12-03Fix multi-argument allreduce in ProcessGroupGloo (#14688)Pieter Noordhuis1-4/+43
Summary: If multiple arguments are specified to c10d allreduce, they are interpreted as if they are expanding the ranks in the process group. Therefore, not only is every argument to allreduce an input that must be considered, it is also an output. The problem that this commit fixes is that they were not correctly considered as outputs. The upstream problem is tracked in facebookincubator/gloo#152. Once this is fixed there we can remove the copies that this commit adds. This fixes #14676. Pull Request resolved: https://github.com/pytorch/pytorch/pull/14688 Differential Revision: D13294405 Pulled By: pietern fbshipit-source-id: 078a2a0a0ff12d051392461438f1496201ec3cb9
2018-11-29Make env init_method support both env and args for rank and size (#14494)Teng Li1-0/+49
Summary: Fixing: https://github.com/pytorch/pytorch/issues/14446 This was a supported behavior in old torch.distributed. We want to support it in the new release. Test should cover all combination of scenario when we have either env or arg set up for rank or size or both Pull Request resolved: https://github.com/pytorch/pytorch/pull/14494 Differential Revision: D13253433 Pulled By: teng-li fbshipit-source-id: c05974d84f1bdf969f74ec45763e11a841fe4848
2018-11-29add gloo support for reduce on GPU (#14443)Jane Wang1-2/+46
Summary: as titled Pull Request resolved: https://github.com/pytorch/pytorch/pull/14443 Reviewed By: pietern Differential Revision: D13222907 Pulled By: janewangfb fbshipit-source-id: f418c5d84880196f97089114d02957cf739243f8
2018-11-27Fixed SyncParam/QueueReduction/SyncReduction test for 2+ GPUs (#14452)Teng Li1-5/+5
Summary: Fixed: https://github.com/pytorch/pytorch/issues/14445 Also bumped up timeout to 30 seconds, since on 8-GPU machines, DDP test will take more than 15 seconds sometimes. Tested on 8 GPU machines: ``` tengli@learnfair062:~/pytorch/test$ python test_c10d.py --verbose test_dist_broadcast_coalesced_gloo (__main__.DistributedDataParallelTest) ... ok test_dist_broadcast_coalesced_nccl (__main__.DistributedDataParallelTest) ... skipped 'Test skipped due to known issues' test_fp16 (__main__.DistributedDataParallelTest) ... ok test_gloo_backend (__main__.DistributedDataParallelTest) ... ok test_nccl_backend (__main__.DistributedDataParallelTest) ... ok test_queue_reduction (__main__.DistributedDataParallelTest) ... ok test_sync_params_no_buffers (__main__.DistributedDataParallelTest) ... ok test_sync_params_with_buffers (__main__.DistributedDataParallelTest) ... ok test_sync_reduction (__main__.DistributedDataParallelTest) ... ok test_set_get (__main__.FileStoreTest) ... ok test_set_get (__main__.PrefixFileStoreTest) ... ok test_set_get (__main__.PrefixTCPStoreTest) ... ok test_allgather_basics (__main__.ProcessGroupGlooTest) ... ok test_allgather_checks (__main__.ProcessGroupGlooTest) ... ok test_allreduce_basics (__main__.ProcessGroupGlooTest) ... ok test_allreduce_basics_cuda (__main__.ProcessGroupGlooTest) ... ok test_allreduce_checks (__main__.ProcessGroupGlooTest) ... ok test_allreduce_stress (__main__.ProcessGroupGlooTest) ... ok test_allreduce_stress_cuda (__main__.ProcessGroupGlooTest) ... ok test_broadcast_basics (__main__.ProcessGroupGlooTest) ... ok test_broadcast_basics_cuda (__main__.ProcessGroupGlooTest) ... ok test_broadcast_checks (__main__.ProcessGroupGlooTest) ... ok test_broadcast_stress (__main__.ProcessGroupGlooTest) ... ok test_broadcast_stress_cuda (__main__.ProcessGroupGlooTest) ... ok test_gather_basics (__main__.ProcessGroupGlooTest) ... ok test_gather_checks (__main__.ProcessGroupGlooTest) ... ok test_reduce_basics (__main__.ProcessGroupGlooTest) ... ok test_reduce_checks (__main__.ProcessGroupGlooTest) ... ok test_scatter_basics (__main__.ProcessGroupGlooTest) ... ok test_scatter_checks (__main__.ProcessGroupGlooTest) ... ok test_send_recv_all_to_all (__main__.ProcessGroupGlooTest) ... ok test_timeout_kwarg (__main__.ProcessGroupGlooTest) ... ok test_allgather_ops (__main__.ProcessGroupNCCLTest) ... ok test_allreduce_ops (__main__.ProcessGroupNCCLTest) ... ok test_barrier (__main__.ProcessGroupNCCLTest) ... ok test_broadcast_ops (__main__.ProcessGroupNCCLTest) ... ok test_reduce_ops (__main__.ProcessGroupNCCLTest) ... ok test_common_errors (__main__.RendezvousEnvTest) ... ok test_nominal (__main__.RendezvousEnvTest) ... ok test_common_errors (__main__.RendezvousFileTest) ... ok test_nominal (__main__.RendezvousFileTest) ... ok test_common_errors (__main__.RendezvousTCPTest) ... ok test_nominal (__main__.RendezvousTCPTest) ... ok test_unknown_handler (__main__.RendezvousTest) ... ok test_address_already_in_use (__main__.TCPStoreTest) ... ok test_set_get (__main__.TCPStoreTest) ... ok ---------------------------------------------------------------------- Ran 46 tests in 162.980s OK (skipped=1) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/14452 Differential Revision: D13230652 Pulled By: teng-li fbshipit-source-id: 88580fe55b3a4fbc7a499ca3b591958f11623bf8
2018-11-27Barrier synchronizes with prior work before completing (#14386)Pieter Noordhuis1-3/+2
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14386 See #13573, #14142, and #14271 for discussion. This change updates ProcessGroupGloo to ensure that all prior operations have completed before executing the barrier. Reviewed By: manojkris Differential Revision: D13205022 fbshipit-source-id: 673e7e6ca357dc843874d6dd8da590832e1de7fa
2018-11-27Make ProcessGroup::Work::wait() throw (#14298)Pieter Noordhuis1-3/+6
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14298 This is a breaking API change for users of the C++ c10d API. The work object defined wait() to return a boolean. If the work completed successfully it would return true, if it didn't it would return false. It was then up to the user to call the exception() function to figure out what went wrong. This has proven suboptimal as it allows users to forget about failure handling and errors may be ignored. The work class is semantically very similar to std::future, where a call to get() may throw if the underlying std::promise has set an exception. This commit changes the semantic of the work class to be similar to this and turns wait() into a void function that throws if the work completes with an exception. The exception() function can still be used to retrieve the exception if isSuccess() returns false, but now returns an std::exception_ptr instead of a reference to a std::exception. Reviewed By: manojkris Differential Revision: D13158475 fbshipit-source-id: 9cd8569b9e7cbddc867a5f34c6fd0b7be85581b8
2018-11-27Use new style barrier support in c10d/gloo (#14294)Pieter Noordhuis1-0/+19
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14294 This is the final collective to be ported to the new style where there is no longer a need to keep a cached algorithm instance around. There is a follow up change incoming to remove the algorithm caching functionality in ProcessGroupGloo. Reviewed By: manojkris Differential Revision: D13111509 fbshipit-source-id: f3ea0d955a62029fc4e7cfc09055e4957e0943ac
2018-11-26Fixed c10d test (#14389)Teng Li1-1/+1
Summary: Most likely a typo. Tested on 8-GPU machine ``` tengli@learnfair062:~/pytorch/test$ python test_c10d.py ProcessGroupNCCLTest.test_barrier . ---------------------------------------------------------------------- Ran 1 test in 29.341s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/14389 Differential Revision: D13207207 Pulled By: teng-li fbshipit-source-id: aaffe14237076fe19d94e2fa4d9c093397f07bb9
2018-11-21Robust NCCL barrier improvement to cover all devices combinations (#14271)Teng Li1-0/+30
Summary: This covers the very edgy case when we run the same NCCL process group with multiple GPU combinations instead of the last GPU combination. We always keep track of what GPUs have been used previously in the NCCL process group and barrier() itself will synchronize on each GPU's NCCL stream. Test covered as well. Tested on 8-GPU machine Pull Request resolved: https://github.com/pytorch/pytorch/pull/14271 Differential Revision: D13164993 Pulled By: teng-li fbshipit-source-id: 81e04352740ea50b5e943369e74cfcba40bb61c1
2018-11-14Retry test on "Address already in use" error (#13911)Pieter Noordhuis1-0/+3
Summary: This fixes #13907. Pull Request resolved: https://github.com/pytorch/pytorch/pull/13911 Differential Revision: D13046256 Pulled By: pietern fbshipit-source-id: bab70cd73ef868e23d4857b06e72830ad29ddb4f
2018-11-14FileStore auto deletes file and FileStore::add bug fix (#13708)Teng Li1-40/+59
Summary: This addressed: https://github.com/pytorch/pytorch/issues/11874 and we will have the identical file init_method behavior as the previous THD file init. Also the FileStore::add bug is pretty annoying. Two bugs: (1) Add doesn't append to the end of the file. (2) Cache doesn't get updated. Both are fixed and tests are covered. I examined the /tmp to ensure that all temp files are auto deleted after test_c10d.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/13708 Reviewed By: pietern Differential Revision: D12972810 Pulled By: teng-li fbshipit-source-id: 917255390aa52845f6b0ad0f283875a7a704da48
2018-11-10Remove potential infinite loop from test_c10d.py (#13816)Pieter Noordhuis1-1/+4
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/13816 If common.find_free_port() returns the same port over and over again, and the TCPStore fails to bind to it over and over again, this function has the potential to loop forever. If we can't find a free port after 10 tries, we are safe to assume something is wrong... Differential Revision: D13017700 fbshipit-source-id: 2139a0ea0f30ce08b5571f80ae0551f1fa7ba4a2
2018-11-07Added the finer bucketing option for DDP (#13607)Teng Li1-2/+3
Summary: We only need this for backward, for FWD cast, the non-fine-grained bucketing should be better since it's sequential anyway. Test should be covered all by c10d test, reduced bucket size to make bucketing happen in c10d test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/13607 Differential Revision: D12944515 Pulled By: teng-li fbshipit-source-id: d982e8dca2874c91d39b30b73a85bfbeb768c508
2018-11-06Consolidate argument checkers (#13623)Pieter Noordhuis1-14/+14
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/13623 Moves the bulk of shared argument checkers in the gloo backend to Utils.hpp. Reviewed By: teng-li Differential Revision: D12934598 fbshipit-source-id: 7b80e67ccc3425f21498c30fbe7837af314f96f2
2018-11-05Disabling NCCL coalesced bcast test since it hangs in CI (#13606)Teng Li1-0/+11
Summary: Functionality test shouldn't be affected since we have both backends testing for the same thing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/13606 Differential Revision: D12937185 Pulled By: teng-li fbshipit-source-id: 03d897b6690f7932654fdb7d11a07016dfffa751
2018-11-05Mixed precision DDP hang fix and fine-grained option for DDP perf (#13496)Teng Li1-16/+64
Summary: When go to mixed precision fp16 training, DDP randomly hangs. Initially, I thought this smells like a similar NCCL bug I filed a while ago. It turns out it's not. Again, I am seeing different rank process has different size. How could this even happen? It turns out that take_tensors will generate a list of bucketed tensors in an un deterministic order, because, the key to the map is a pointer. An interesting bug digging and fix. Now fp16 DDP training should be fully working now. Also, added another take_tensor fine grained helper that aims to improve the performance of DDP, making it a TODO to replace the DDP take_tensors with that. Fixed: https://github.com/pytorch/pytorch/issues/12150 Pull Request resolved: https://github.com/pytorch/pytorch/pull/13496 Differential Revision: D12920985 Pulled By: teng-li fbshipit-source-id: 26f3edae7be45a80fa7b2410a2e5a1baab212d9c
2018-11-05Add new style broadcast support in c10d/gloo (#13497)Pieter Noordhuis1-7/+97
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/13497 This replaces the existing broadcast implementation with the new style collective call in the gloo backend. The CUDA path copies CUDA tensors to CPU tensors and then runs the CPU broadcast implementation. Reviewed By: teng-li Differential Revision: D12890013 fbshipit-source-id: 43f346fb2814f421bedc7babf89169703a46bb9c
2018-11-05Add new style allreduce support in c10d/gloo (#13426)Pieter Noordhuis1-25/+58
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/13426 This replaces the existing allreduce implementation with the new style collective call in the gloo backend. This is the first one to include both a CPU and a CUDA path. The CUDA path copies CUDA tensors to CPU tensors and then runs the CPU allreduce implementation. This is not much different from the current situation in the case where there is a single input tensor per call (which is the case when called from DistributedDataParallel). Reviewed By: teng-li Differential Revision: D12855689 fbshipit-source-id: 574281d762dd29149fa7f634fb71f8f6a9787598
2018-11-05Add reduce support in c10d/gloo (#13425)Pieter Noordhuis1-0/+69
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/13425 This adds support for the new style reduce collective call in the gloo backend. Reviewed By: teng-li Differential Revision: D12869404 fbshipit-source-id: 93c641e6aba3b03c796bda80737547c565cfa571
2018-11-05Add allgather support in c10d/gloo (#13424)Pieter Noordhuis1-0/+58
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/13424 This adds support for the allgather collective call in the gloo backend. The gloo implementation does not support multiple inputs per rank (nor one or more outputs per rank), so we use a temporary flattened buffer and unflatten once the collective finishes. Reviewed By: teng-li Differential Revision: D12832009 fbshipit-source-id: 2f5c1934a338589cef1d3192bd92ada135fecd7a
2018-11-05Add gather support in c10d/gloo (#13423)Pieter Noordhuis1-0/+88
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/13423 This adds support for the gather collective call in the gloo backend. The gloo implementation does not yet support the mode where the root has multiple output tensors (one per rank), so we use a temporary flattened buffer and unflatten on the root once the collective finishes. Reviewed By: teng-li Differential Revision: D12811647 fbshipit-source-id: 90fe8af8c390090b7d4ef43aa74f4e3e67ab9d0b
2018-11-05Add scatter support in c10d/gloo (#13422)Pieter Noordhuis1-0/+91
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/13422 This adds support for the scatter collective call in the gloo backend. This is the first of the new style collectives that do not expect to be created once and used many times. This commit contains some shortcuts to make this new style work side by side with the existing implementations (such as the std::tuple with nullptr's). These shortcuts are temporary until we have moved over all collectives to this new style. Reviewed By: teng-li Differential Revision: D12310219 fbshipit-source-id: 32e68717f819d5980f0e469d297204948351cefc
2018-10-29Test scripts only run cases defined in the running script (#13250)Tongzhou Wang1-2/+2
Summary: 1. Refactors `TestTorch` into `TestTorchMixin` (subclass of `object`) and `TestTorch` (subclass of `TestCase`, MRO `(TestCase, TestTorchMixin)`, only defined if `__name__ == '__main__'`). So other scripts won't accidentally run it. 2. Adds an assertion in `load_tests` that each script only runs cases defined in itself. cc yf225 ezyang Pull Request resolved: https://github.com/pytorch/pytorch/pull/13250 Differential Revision: D12823734 Pulled By: SsnL fbshipit-source-id: 7a169f35fe0794ce76e310d8a137d9a3265c012b
2018-10-26Shard all of tests based on how many tests exist. (#13160)Zachary DeVito1-1/+4
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/13160 Reduces pytorch_core build from 2 hours to 30 minutes Reviewed By: soumith, dzhulgakov Differential Revision: D10524261 fbshipit-source-id: 97270ac73404b5ea4c264cd0e9d8d4b1be79b0e9
2018-10-26Ignore flake8 warnings in test_c10d.py (#13159)Pieter Noordhuis1-2/+6
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/13159 These lint violations are intentional. Reviewed By: ezyang Differential Revision: D10862131 fbshipit-source-id: 70ad4b0a360cb12d050805fd7b1080dfe4566e86
2018-10-25Use default timeout of 30 minutes for gloo backend (#13056)Pieter Noordhuis1-0/+18
Summary: The existing default timeout was set at 10 seconds, which is too low for asynchronous tasks that depend on a barrier to resynchronize. Having a single timeout for all operations is not ideal and this will be addressed in future commits. Pull Request resolved: https://github.com/pytorch/pytorch/pull/13056 Reviewed By: teng-li Differential Revision: D10558746 Pulled By: pietern fbshipit-source-id: d857ea55b1776fc7d0baf2efd77951b5d98beabb
2018-10-24DDP perf improvement: move sync_reduction to C++, dedicated CUDA streams for ↵Teng Li1-0/+18
memcpy (#12954) Summary: - Moved sync_reduction to C++ - Use a dedicated CUDA stream for memcpy - Also use a dedicated CUDA stream for memcpy in queue_reduction Added test as well. CI should cover both DDP and unittest Pull Request resolved: https://github.com/pytorch/pytorch/pull/12954 Differential Revision: D10520069 Pulled By: teng-li fbshipit-source-id: 64348e4e43c15f9695a4c28b036c232587ecfb65
2018-10-22Move DDP queue_reduction to C++ (#12852)Teng Li1-2/+30
Summary: fully working version by using continuing on goldsborough 's initial version. waiting on the stream guard to be merged before adding more stream perf logics into the c++ version Pull Request resolved: https://github.com/pytorch/pytorch/pull/12852 Differential Revision: D10468696 Pulled By: teng-li fbshipit-source-id: 8e46d408796973817abfd9dbd6566e0ca5b7a13f
2018-10-18Try to reduce c10d test flakiness (#12782)Pieter Noordhuis1-16/+29
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/12782 We have seen the "Address already in use" error popup a few times when instantiating the TCPStore. The port that it uses is dynamically generated through common.find_free_port(), which binds a new socket to a random port, closes the socket, and returns the port that the OS had assigned. If some other process grabs that port in the time between closing the socket and the TCPStore binding to it, the bind error shows up. This commit changes most tests to use the FileStore instead and includes a retry when testing the TCPStore. Differential Revision: D10433401 fbshipit-source-id: 8dd575ac91a3cddd1cc41ddb0ff4311ddc58c813
2018-10-17Rename test/common.py to test/common_utils.py (#12794)James Sun1-2/+2
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/12794 common.py is used in base_module for almost all tests in test/. The name of this file is so common that can easily conflict with other dependencies if they happen to have another common.py in the base module. Rename the file to avoid conflict. Reviewed By: orionr Differential Revision: D10438204 fbshipit-source-id: 6a996c14980722330be0a9fd3a54c20af4b3d380
2018-09-19Add env:// rendezvous test (#11782)Pieter Noordhuis1-0/+72
Summary: A missing environment variable raised a missing key error. Now it raises a more descriptive error of the actual problem, for example: ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable WORLD_SIZE expected, but not set Pull Request resolved: https://github.com/pytorch/pytorch/pull/11782 Differential Revision: D9888962 Pulled By: pietern fbshipit-source-id: 5947e7a7bf7aa45f13bbd7b5e997529f26cc92d6
2018-09-14Add message tag parameter to send/recvPieter Noordhuis1-2/+2
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/11490 Reviewed By: teng-li Differential Revision: D9828116 Pulled By: pietern fbshipit-source-id: 98be1ae84b6763ffb329e63c030c5e3ec0e748b7
2018-09-11convert output_device at data_parallel from torch.device to index (#10189)Wei Yang1-4/+7
Summary: - fixes #9984 Pull Request resolved: https://github.com/pytorch/pytorch/pull/10189 Differential Revision: D9545390 Pulled By: weiyangfb fbshipit-source-id: 3a6a705437553ba319e9fd4b7f676ff73857a27e