Ensure DecompressAllTags starts on a 32-byte boundary + 16 bytes.

First of all, I'm sorry about this ugly hack. I hope the following long explanation is enough to justify it. We have observed that, in some conditions, the results for dataset number 10 (pb) in the zippy benchmark can show a >20% regression on Skylake CPUs. In order to diagnose this, we profiled the benchmark looking at hot functions (99% of the time is spent on DecompressAllTags), then looked at the generated code to see if there was any difference. In order to discard a minor difference we observed in register allocation we replaced zippy.cc with a pre-built assembly file so it was the same in both variants, and we still were able to reproduce the regression. After discarding a regression caused by the compiler, we digged a bit further and noticed that the alignment of the function in the final binary was different. Both were aligned to a 16-byte boundary, but the slower one was also (by chance) aligned to a 32-byte boundary. A regression caused by alignment differences would explain why I could reproduce it consistently on the same CitC client, but not others: slight differences in the sources can cause the resulting binary to have different layout. Here are some detailed benchmark results before/after the fix. Note how fixing the alignment makes the difference between baseline and experiment go away, but regular 32-byte alignment puts both variants in the same ballpark as the original regression: Original (note BM_UCord_10 and BM_UDataBuffer_10 around the -24% line): BASELINE BM_UCord/10 2938 2932 24194 3.767GB/s pb BM_UDataBuffer/10 3008 3004 23316 3.677GB/s pb EXPERIMENT BM_UCord/10 3797 3789 18512 2.915GB/s pb BM_UDataBuffer/10 4024 4016 17543 2.750GB/s pb Aligning DecompressAllTags to a 32-byte boundary: BASELINE BM_UCord/10 3872 3862 18035 2.860GB/s pb BM_UDataBuffer/10 4010 3998 17591 2.763GB/s pb EXPERIMENT BM_UCord/10 3884 3876 18126 2.850GB/s pb BM_UDataBuffer/10 4037 4027 17199 2.743GB/s pb Aligning DecompressAllTags to a 32-byte boundary + 16 bytes (this patch): BASELINE BM_UCord/10 3103 3095 22642 3.569GB/s pb BM_UDataBuffer/10 3186 3177 21947 3.476GB/s pb EXPERIMENT BM_UCord/10 3104 3095 22632 3.569GB/s pb BM_UDataBuffer/10 3167 3159 22076 3.496GB/s pb This change forces the "good" alignment for DecompressAllTags which, if anything, should make benchmark results more stable (and maybe we'll improve some unlucky application!).
author: jgorbe <jgorbe@google.com> 2018-02-02 18:38:30 -0800
committer: Victor Costan <pwnall@chromium.org> 2018-02-17 00:47:18 -0800
commit: ca37ab7fb9b718e056009babb4fea591626e5882 (patch)
tree: 0d1a8702bfa75a722559934646b47fb583f4a465
parent: 15a2804cd2fcea1dab72b9c1eae0f71e4294c0da (diff)
download: snappy-ca37ab7fb9b718e056009babb4fea591626e5882.tar.gz
snappy-ca37ab7fb9b718e056009babb4fea591626e5882.tar.bz2
snappy-ca37ab7fb9b718e056009babb4fea591626e5882.zip
1 files changed, 19 insertions, 0 deletions
diff --git a/snappy.cc b/snappy.cc
index 081482d..d0eb323 100644
--- a/snappy.cc
+++ b/snappy.cc
@@ -699,7 +699,26 @@ class SnappyDecompressor {
   // Process the next item found in the input.
   // Returns true if successful, false on error or end of input.
   template <class Writer>
+#if defined(__GNUC__) && defined(__x86_64__)
+  __attribute__((aligned(32)))
+#endif
   void DecompressAllTags(Writer* writer) {
+    // In x86, pad the function body to start 16 bytes later. This function has
+    // a couple of hotspots that are highly sensitive to alignment: we have
+    // observed regressions by more than 20% in some metrics just by moving the
+    // exact same code to a different position in the benchmark binary.
+    //
+    // Putting this code on a 32-byte-aligned boundary + 16 bytes makes us hit
+    // the "lucky" case consistently. Unfortunately, this is a very brittle
+    // workaround, and future differences in code generation may reintroduce
+    // this regression. If you experience a big, difficult to explain, benchmark
+    // performance regression here, first try removing this hack.
+#if defined(__GNUC__) && defined(__x86_64__)
+    // Two 8-byte "NOP DWORD ptr [EAX + EAX*1 + 00000000H]" instructions.
+    asm(".byte 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00");
+    asm(".byte 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00");
+#endif
+
     const char* ip = ip_;
     // For position-independent executables, accessing global arrays can be
     // slow.  Move wordmask array onto the stack to mitigate this.
author	jgorbe <jgorbe@google.com>	2018-02-02 18:38:30 -0800
committer	Victor Costan <pwnall@chromium.org>	2018-02-17 00:47:18 -0800
commit	ca37ab7fb9b718e056009babb4fea591626e5882 (patch)
tree	0d1a8702bfa75a722559934646b47fb583f4a465
parent	15a2804cd2fcea1dab72b9c1eae0f71e4294c0da (diff)
download	snappy-ca37ab7fb9b718e056009babb4fea591626e5882.tar.gz snappy-ca37ab7fb9b718e056009babb4fea591626e5882.tar.bz2 snappy-ca37ab7fb9b718e056009babb4fea591626e5882.zip