diff options
author | James Reed <jamesreed@fb.com> | 2019-04-19 19:13:10 -0700 |
---|---|---|
committer | Facebook Github Bot <facebook-github-bot@users.noreply.github.com> | 2019-04-19 19:16:24 -0700 |
commit | d17c22d024467b7185e33c4652b44739f67965be (patch) | |
tree | 151571fa171e1b269cda0781d9dd1655cebd1ee9 /caffe2 | |
parent | 6325b6e44e56f518d423cb46c93cfad892b236d0 (diff) | |
download | pytorch-d17c22d024467b7185e33c4652b44739f67965be.tar.gz pytorch-d17c22d024467b7185e33c4652b44739f67965be.tar.bz2 pytorch-d17c22d024467b7185e33c4652b44739f67965be.zip |
Improve embedding_bag add kernel (#19329)
Summary:
This was actually getting pretty poor throughput with respect to memory bandwidth. I used this test to measure the memory bandwidth specifically for the AXPY call: https://gist.github.com/jamesr66a/b27ff9ecbe036eed5ec310c0a3cc53c5
And I got ~8 GB/s before this change, but ~14 GB/s after this change.
This seems to speed up the operator overall by around 1.3x (benchmark: https://gist.github.com/jamesr66a/c533817c334d0be432720ef5e54a4166):
== Before ==
time_per_iter 0.0001298875093460083
GB/s 3.082544287868467
== After ==
time_per_iter 0.00010104801654815674
GB/s 3.9623142905451076
The large difference between the local BW increase and the full-op BW increase likely indicates significant time is being spent elsewhere in the op, so I will investigate that.
EDIT: I updated this PR to include a call into caffe2/perfkernels. This is the progression:
before
time_per_iter 8.983819484710693e-05
GB/s 4.456723564864611
After no axpy
time_per_iter 7.19951868057251e-05
GB/s 5.56126065872172
AFter perfkernels
time_per_iter 5.6699180603027346e-05
GB/s 7.061548257694262
After perfkernels no grad
time_per_iter 4.388842582702637e-05
GB/s 9.122769670026413
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19329
Reviewed By: dzhulgakov
Differential Revision: D14969630
Pulled By: jamesr66a
fbshipit-source-id: 42d1015772c87bedd119e33c0aa2c8105160a738
Diffstat (limited to 'caffe2')
-rw-r--r-- | caffe2/perfkernels/embedding_lookup.h | 3 |
1 files changed, 3 insertions, 0 deletions
diff --git a/caffe2/perfkernels/embedding_lookup.h b/caffe2/perfkernels/embedding_lookup.h index 1d0cd2abfa..37830d69c8 100644 --- a/caffe2/perfkernels/embedding_lookup.h +++ b/caffe2/perfkernels/embedding_lookup.h @@ -28,6 +28,9 @@ namespace caffe2 { * if (normalize_weights && lengths[i] > 0) * for (k = 0..block_size-1) * out[i*block_size + k] /= lengths[i] + * + * TODO: make this API also take "offsets" rather than "lengths" to match the + * API for PyTorch's EmbeddingBag */ template < typename IndexType, |