Added distributed docs on NCCL2 backend/functions and launch module (#6579)

author: Teng Li <teng-li@users.noreply.github.com> 2018-04-15 18:53:10 -0700
committer: Soumith Chintala <soumith@gmail.com> 2018-04-15 21:53:10 -0400
commit: f5beff334bb511ff50d606e4ae5b47938723cd13 (patch)
tree: 84e0eccc1441b7dee03fa3ccc0db7a7b8c8c545c /docs
parent: 5463a4a3199b89aa1e944068694fa0f50635807b (diff)
download: pytorch-f5beff334bb511ff50d606e4ae5b47938723cd13.tar.gz
pytorch-f5beff334bb511ff50d606e4ae5b47938723cd13.tar.bz2
pytorch-f5beff334bb511ff50d606e4ae5b47938723cd13.zip
1 files changed, 98 insertions, 25 deletions
diff --git a/docs/source/distributed.rst b/docs/source/distributed.rst
index 27decd0f99..23846f18b1 100644
--- a/docs/source/distributed.rst
+++ b/docs/source/distributed.rst
@@ -7,35 +7,35 @@ Distributed communication package - torch.distributed
 .. automodule:: torch.distributed
 .. currentmodule:: torch.distributed
 
-Currently torch.distributed supports three backends, each with
+Currently torch.distributed supports four backends, each with
 different capabilities. The table below shows which functions are available
 for use with CPU / CUDA tensors.
 MPI supports cuda only if the implementation used to build PyTorch supports it.
 
 
-+------------+-----------+-----------+-----------+
-| Backend    | ``tcp``   | ``gloo``  | ``mpi``   |
-+------------+-----+-----+-----+-----+-----+-----+
-| Device     | CPU | GPU | CPU | GPU | CPU | GPU |
-+============+=====+=====+=====+=====+=====+=====+
-| send       | ✓   | ✘   | ✘   | ✘   | ✓   | ?   |
-+------------+-----+-----+-----+-----+-----+-----+
-| recv       | ✓   | ✘   | ✘   | ✘   | ✓   | ?   |
-+------------+-----+-----+-----+-----+-----+-----+
-| broadcast  | ✓   | ✘   | ✓   | ✓   | ✓   | ?   |
-+------------+-----+-----+-----+-----+-----+-----+
-| all_reduce | ✓   | ✘   | ✓   | ✓   | ✓   | ?   |
-+------------+-----+-----+-----+-----+-----+-----+
-| reduce     | ✓   | ✘   | ✘   | ✘   | ✓   | ?   |
-+------------+-----+-----+-----+-----+-----+-----+
-| all_gather | ✓   | ✘   | ✘   | ✘   | ✓   | ?   |
-+------------+-----+-----+-----+-----+-----+-----+
-| gather     | ✓   | ✘   | ✘   | ✘   | ✓   | ?   |
-+------------+-----+-----+-----+-----+-----+-----+
-| scatter    | ✓   | ✘   | ✘   | ✘   | ✓   | ?   |
-+------------+-----+-----+-----+-----+-----+-----+
-| barrier    | ✓   | ✘   | ✓   | ✓   | ✓   | ?   |
-+------------+-----+-----+-----+-----+-----+-----+
++------------+-----------+-----------+-----------+-----------+
+| Backend    | ``tcp``   | ``gloo``  | ``mpi``   | ``nccl``  |
++------------+-----+-----+-----+-----+-----+-----+-----+-----+
+| Device     | CPU | GPU | CPU | GPU | CPU | GPU | CPU | GPU |
++============+=====+=====+=====+=====+=====+=====+=====+=====+
+| send       | ✓   | ✘   | ✘   | ✘   | ✓   | ?   | ✘   | ✘   |
++------------+-----+-----+-----+-----+-----+-----+-----+-----+
+| recv       | ✓   | ✘   | ✘   | ✘   | ✓   | ?   | ✘   | ✘   |
++------------+-----+-----+-----+-----+-----+-----+-----+-----+
+| broadcast  | ✓   | ✘   | ✓   | ✓   | ✓   | ?   | ✘   | ✓   |
++------------+-----+-----+-----+-----+-----+-----+-----+-----+
+| all_reduce | ✓   | ✘   | ✓   | ✓   | ✓   | ?   | ✘   | ✓   |
++------------+-----+-----+-----+-----+-----+-----+-----+-----+
+| reduce     | ✓   | ✘   | ✘   | ✘   | ✓   | ?   | ✘   | ✓   |
++------------+-----+-----+-----+-----+-----+-----+-----+-----+
+| all_gather | ✓   | ✘   | ✘   | ✘   | ✓   | ?   | ✘   | ✓   |
++------------+-----+-----+-----+-----+-----+-----+-----+-----+
+| gather     | ✓   | ✘   | ✘   | ✘   | ✓   | ?   | ✘   | ✓   |
++------------+-----+-----+-----+-----+-----+-----+-----+-----+
+| scatter    | ✓   | ✘   | ✘   | ✘   | ✓   | ?   | ✘   | ✓   |
++------------+-----+-----+-----+-----+-----+-----+-----+-----+
+| barrier    | ✓   | ✘   | ✓   | ✓   | ✓   | ?   | ✘   | ✘   |
++------------+-----+-----+-----+-----+-----+-----+-----+-----+
 
 .. _distributed-basics:
 
@@ -173,7 +173,7 @@ as they should never be created manually, but they are guaranteed to support two
 * ``is_completed()`` - returns True if the operation has finished
 * ``wait()`` - will block the process until the operation is finished.
   ``is_completed()`` is guaranteed to return True once it returns.
-  
+
 When using the MPI backend, :func:`~torch.distributed.isend` and :func:`~torch.distributed.irecv`
 support non-overtaking, which has some guarantees on supporting message order. For more detail, see
 http://mpi-forum.org/docs/mpi-2.2/mpi22-report/node54.htm#Node54
@@ -199,3 +199,76 @@ Collective functions
 
 .. autofunction:: barrier
 
+Multi-GPU collective functions
+------------------------------
+
+If you have more than one GPU on each node, when using the NCCL backend,
+:func:`~torch.distributed.broadcast_multigpu`
+:func:`~torch.distributed.all_reduce_multigpu`
+:func:`~torch.distributed.reduce_multigpu` and
+:func:`~torch.distributed.all_gather_multigpu` support distributed collective
+operations among multiple GPUs within each node. These functions can potentially
+improve the overall distributed training performance and be easily used by
+passing a list of tensors. Each Tensor in the passed tensor list needs
+to be on a separate GPU device of the host where the function is called. Note
+that the length of the tensor list needs to be identical among all the
+distributed processes. Also note that currently the multi-GPU collective
+functions are only supported by the NCCL backend.
+
+For example, if the system we use for distributed training has 2 nodes, each
+of which has 8 GPUs. On each of the 16 GPUs, there is a tensor that we would
+like to all-reduce. The following code can serve as a reference:
+
+Code running on Node 0
+
+::
+
+    import torch
+    import torch.distributed as dist
+
+    dist.init_process_group(backend="nccl",
+                            init_method="file:///distributed_test",
+                            world_size=2,
+                            rank=0)
+    tensor_list = []
+    for dev_idx in range(torch.cuda.device_count()):
+        tensor_list.append(torch.FloatTensor([1]).cuda(dev_idx))
+
+    dist.all_reduce_multigpu(tensor_list)
+
+Code running on Node 1
+
+::
+
+    import torch
+    import torch.distributed as dist
+
+    dist.init_process_group(backend="nccl",
+                            init_method="file:///distributed_test",
+                            world_size=2,
+                            rank=1)
+    tensor_list = []
+    for dev_idx in range(torch.cuda.device_count()):
+        tensor_list.append(torch.FloatTensor([1]).cuda(dev_idx))
+
+    dist.all_reduce_multigpu(tensor_list)
+
+After the call, all 16 tensors on the two nodes will have the all-reduced value
+of 16
+
+.. autofunction:: broadcast_multigpu
+
+.. autofunction:: all_reduce_multigpu
+
+.. autofunction:: reduce_multigpu
+
+.. autofunction:: all_gather_multigpu
+
+
+Launch utility
+--------------
+
+The `torch.distributed` package also provides a launch utility in
+`torch.distributed.launch`.
+
+.. automodule:: torch.distributed.launch
author	Teng Li <teng-li@users.noreply.github.com>	2018-04-15 18:53:10 -0700
committer	Soumith Chintala <soumith@gmail.com>	2018-04-15 21:53:10 -0400
commit	f5beff334bb511ff50d606e4ae5b47938723cd13 (patch)
tree	84e0eccc1441b7dee03fa3ccc0db7a7b8c8c545c /docs
parent	5463a4a3199b89aa1e944068694fa0f50635807b (diff)
download	pytorch-f5beff334bb511ff50d606e4ae5b47938723cd13.tar.gz pytorch-f5beff334bb511ff50d606e4ae5b47938723cd13.tar.bz2 pytorch-f5beff334bb511ff50d606e4ae5b47938723cd13.zip