summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorjgorbe <jgorbe@google.com>2018-02-02 18:38:30 -0800
committerVictor Costan <pwnall@chromium.org>2018-02-17 00:47:18 -0800
commitca37ab7fb9b718e056009babb4fea591626e5882 (patch)
tree0d1a8702bfa75a722559934646b47fb583f4a465
parent15a2804cd2fcea1dab72b9c1eae0f71e4294c0da (diff)
downloadsnappy-ca37ab7fb9b718e056009babb4fea591626e5882.tar.gz
snappy-ca37ab7fb9b718e056009babb4fea591626e5882.tar.bz2
snappy-ca37ab7fb9b718e056009babb4fea591626e5882.zip
Ensure DecompressAllTags starts on a 32-byte boundary + 16 bytes.
First of all, I'm sorry about this ugly hack. I hope the following long explanation is enough to justify it. We have observed that, in some conditions, the results for dataset number 10 (pb) in the zippy benchmark can show a >20% regression on Skylake CPUs. In order to diagnose this, we profiled the benchmark looking at hot functions (99% of the time is spent on DecompressAllTags), then looked at the generated code to see if there was any difference. In order to discard a minor difference we observed in register allocation we replaced zippy.cc with a pre-built assembly file so it was the same in both variants, and we still were able to reproduce the regression. After discarding a regression caused by the compiler, we digged a bit further and noticed that the alignment of the function in the final binary was different. Both were aligned to a 16-byte boundary, but the slower one was also (by chance) aligned to a 32-byte boundary. A regression caused by alignment differences would explain why I could reproduce it consistently on the same CitC client, but not others: slight differences in the sources can cause the resulting binary to have different layout. Here are some detailed benchmark results before/after the fix. Note how fixing the alignment makes the difference between baseline and experiment go away, but regular 32-byte alignment puts both variants in the same ballpark as the original regression: Original (note BM_UCord_10 and BM_UDataBuffer_10 around the -24% line): BASELINE BM_UCord/10 2938 2932 24194 3.767GB/s pb BM_UDataBuffer/10 3008 3004 23316 3.677GB/s pb EXPERIMENT BM_UCord/10 3797 3789 18512 2.915GB/s pb BM_UDataBuffer/10 4024 4016 17543 2.750GB/s pb Aligning DecompressAllTags to a 32-byte boundary: BASELINE BM_UCord/10 3872 3862 18035 2.860GB/s pb BM_UDataBuffer/10 4010 3998 17591 2.763GB/s pb EXPERIMENT BM_UCord/10 3884 3876 18126 2.850GB/s pb BM_UDataBuffer/10 4037 4027 17199 2.743GB/s pb Aligning DecompressAllTags to a 32-byte boundary + 16 bytes (this patch): BASELINE BM_UCord/10 3103 3095 22642 3.569GB/s pb BM_UDataBuffer/10 3186 3177 21947 3.476GB/s pb EXPERIMENT BM_UCord/10 3104 3095 22632 3.569GB/s pb BM_UDataBuffer/10 3167 3159 22076 3.496GB/s pb This change forces the "good" alignment for DecompressAllTags which, if anything, should make benchmark results more stable (and maybe we'll improve some unlucky application!).
-rw-r--r--snappy.cc19
1 files changed, 19 insertions, 0 deletions
diff --git a/snappy.cc b/snappy.cc
index 081482d..d0eb323 100644
--- a/snappy.cc
+++ b/snappy.cc
@@ -699,7 +699,26 @@ class SnappyDecompressor {
// Process the next item found in the input.
// Returns true if successful, false on error or end of input.
template <class Writer>
+#if defined(__GNUC__) && defined(__x86_64__)
+ __attribute__((aligned(32)))
+#endif
void DecompressAllTags(Writer* writer) {
+ // In x86, pad the function body to start 16 bytes later. This function has
+ // a couple of hotspots that are highly sensitive to alignment: we have
+ // observed regressions by more than 20% in some metrics just by moving the
+ // exact same code to a different position in the benchmark binary.
+ //
+ // Putting this code on a 32-byte-aligned boundary + 16 bytes makes us hit
+ // the "lucky" case consistently. Unfortunately, this is a very brittle
+ // workaround, and future differences in code generation may reintroduce
+ // this regression. If you experience a big, difficult to explain, benchmark
+ // performance regression here, first try removing this hack.
+#if defined(__GNUC__) && defined(__x86_64__)
+ // Two 8-byte "NOP DWORD ptr [EAX + EAX*1 + 00000000H]" instructions.
+ asm(".byte 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00");
+ asm(".byte 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00");
+#endif
+
const char* ip = ip_;
// For position-independent executables, accessing global arrays can be
// slow. Move wordmask array onto the stack to mitigate this.