Add information about how to get GPU topology from nvidia-smi

author: mhouston <mhouston@nvidia.com> 2015-08-18 15:29:26 -0700
committer: mhouston <mhouston@nvidia.com> 2015-08-18 15:29:26 -0700
commit: 26a9880d72e81d415d1dc3bf449586ce54185ea4 (patch)
tree: e95c6b63426c242b2a9c1045e64bf38cfa623a11 /docs
parent: 7453bbf6ea1aeb03330b5892a06276b69434f699 (diff)
download: caffeonacl-26a9880d72e81d415d1dc3bf449586ce54185ea4.tar.gz
caffeonacl-26a9880d72e81d415d1dc3bf449586ce54185ea4.tar.bz2
caffeonacl-26a9880d72e81d415d1dc3bf449586ce54185ea4.zip
1 files changed, 2 insertions, 0 deletions
diff --git a/docs/multigpu.md b/docs/multigpu.md
index 4b202347..01cfb893 100644
--- a/docs/multigpu.md
+++ b/docs/multigpu.md
@@ -19,6 +19,8 @@ For best performance, P2P DMA access between devices is needed. Without P2P acce
 
 Current implementation has a "soft" assumption that the devices being used are homogeneous.  In practice, any devices of the same general class should work together, but performance and total size is limited by the smallest device being used.  e.g. if you combine a TitanX and a GTX980, peformance will be limited by the 980.  Mixing vastly different levels of boards, e.g. Kepler and Fermi, is not supported.
 
+"nvidia-smi topo -m" will show you the connectivity matrix.  You can do P2P through PCIe bridges, but not across socket level links at this time, e.g. across CPU sockets on a multi-socket motherboard.
+
 # Scaling Performance
 
 Performance is **heavily** dependent on the PCIe topology of the system, the configuration of the neural network you are training, and the speed of each of the layers.  Systems like the DIGITS DevBox have an optimized PCIe topology (X99-E WS chipset).  In general, scaling on 2 GPUs tends to be ~1.8X on average for networks like AlexNet, CaffeNet, VGG, GoogleNet.  4 GPUs begins to have falloff in scaling.  Generally with "weak scaling" where the batchsize increases with the number of GPUs you will see 3.5x scaling or so.  With "strong scaling", the system can become communication bound, especially with layer performance optimizations like those in [cuDNNv3](http://nvidia.com/cudnn), and you will likely see closer to mid 2.x scaling in performance.  Networks that have heavy computation compared to the number of parameters tend to have the best scaling performance.
 \ No newline at end of file
author	mhouston <mhouston@nvidia.com>	2015-08-18 15:29:26 -0700
committer	mhouston <mhouston@nvidia.com>	2015-08-18 15:29:26 -0700
commit	26a9880d72e81d415d1dc3bf449586ce54185ea4 (patch)
tree	e95c6b63426c242b2a9c1045e64bf38cfa623a11 /docs
parent	7453bbf6ea1aeb03330b5892a06276b69434f699 (diff)
download	caffeonacl-26a9880d72e81d415d1dc3bf449586ce54185ea4.tar.gz caffeonacl-26a9880d72e81d415d1dc3bf449586ce54185ea4.tar.bz2 caffeonacl-26a9880d72e81d415d1dc3bf449586ce54185ea4.zip