diff options
author | Alkis Evlogimenos <alkis@google.com> | 2017-01-27 09:10:36 +0100 |
---|---|---|
committer | Alkis Evlogimenos <alkis@google.com> | 2017-01-27 09:10:36 +0100 |
commit | 8bfb028b618747a6ac8af159c87e9196c729566f (patch) | |
tree | cfe7f409aa3b26d02ddfeb2b79322938337b7ab4 /snappy.cc | |
parent | 818b583387a5288cb2778031655a5e764e5ad124 (diff) | |
download | snappy-8bfb028b618747a6ac8af159c87e9196c729566f.tar.gz snappy-8bfb028b618747a6ac8af159c87e9196c729566f.tar.bz2 snappy-8bfb028b618747a6ac8af159c87e9196c729566f.zip |
Improve zippy decompression speed.
The CL contains the following optimizations:
1) rewrite IncrementalCopy routine: single routine that splits the code into sections based on typical probabilities observed across a variety of inputs and helps reduce branch mispredictions both for FDO and non-FDO builds. IncrementalCopy is an adaptive routine that selects the best strategy based on input.
2) introduce UnalignedCopy128 that copies 128 bits per cycle using SSE2.
3) add branch hint for the main decoding loop. The non-literal case is taken more often in benchmarks. I expect this to be a noop in production with FDO. Note that this became apparent after step 1 above.
4) use the new IncrementalCopy in ZippyScatteredWriter.
I test two archs: x86_haswell and ppc_power8.
For x86_haswell I use FDO. For ppc_power8 I do not use FDO.
x86_haswell + FDO
name old speed new speed delta
BM_UCord/0 1.97GB/s ± 1% 3.19GB/s ± 1% +62.08% (p=0.000 n=19+18)
BM_UCord/1 1.28GB/s ± 1% 1.51GB/s ± 1% +18.14% (p=0.000 n=19+18)
BM_UCord/2 15.6GB/s ± 9% 15.5GB/s ± 7% ~ (p=0.620 n=20+20)
BM_UCord/3 811MB/s ± 1% 808MB/s ± 1% -0.38% (p=0.009 n=17+18)
BM_UCord/4 12.4GB/s ± 4% 12.7GB/s ± 8% +2.70% (p=0.002 n=17+20)
BM_UCord/5 1.77GB/s ± 0% 2.33GB/s ± 1% +31.37% (p=0.000 n=18+18)
BM_UCord/6 900MB/s ± 1% 1006MB/s ± 1% +11.71% (p=0.000 n=18+17)
BM_UCord/7 858MB/s ± 1% 938MB/s ± 2% +9.36% (p=0.000 n=19+16)
BM_UCord/8 921MB/s ± 1% 985MB/s ±21% +6.94% (p=0.028 n=19+20)
BM_UCord/9 824MB/s ± 1% 800MB/s ±20% ~ (p=0.113 n=19+20)
BM_UCord/10 2.60GB/s ± 1% 3.67GB/s ±21% +41.31% (p=0.000 n=19+20)
BM_UCord/11 1.07GB/s ± 1% 1.21GB/s ± 1% +13.17% (p=0.000 n=16+16)
BM_UCord/12 1.84GB/s ± 8% 2.18GB/s ± 1% +18.44% (p=0.000 n=16+19)
BM_UCord/13 1.83GB/s ±18% 1.89GB/s ± 1% +3.14% (p=0.000 n=17+19)
BM_UCord/14 1.96GB/s ± 2% 1.97GB/s ± 1% +0.55% (p=0.000 n=16+17)
BM_UCord/15 1.30GB/s ±20% 1.43GB/s ± 1% +9.85% (p=0.000 n=20+20)
BM_UCord/16 658MB/s ±20% 705MB/s ± 1% +7.22% (p=0.000 n=20+19)
BM_UCord/17 1.96GB/s ± 2% 2.15GB/s ± 1% +9.73% (p=0.000 n=16+19)
BM_UCord/18 555MB/s ± 1% 833MB/s ± 1% +50.11% (p=0.000 n=18+19)
BM_UCord/19 1.57GB/s ± 1% 1.75GB/s ± 1% +11.34% (p=0.000 n=20+20)
BM_UCord/20 1.72GB/s ± 2% 1.70GB/s ± 2% -1.01% (p=0.001 n=20+20)
BM_UCordStringSink/0 2.88GB/s ± 1% 3.15GB/s ± 1% +9.56% (p=0.000 n=17+20)
BM_UCordStringSink/1 1.50GB/s ± 1% 1.52GB/s ± 1% +1.96% (p=0.000 n=19+20)
BM_UCordStringSink/2 14.5GB/s ±10% 14.6GB/s ±10% ~ (p=0.542 n=20+20)
BM_UCordStringSink/3 1.06GB/s ± 1% 1.08GB/s ± 1% +1.77% (p=0.000 n=18+20)
BM_UCordStringSink/4 12.6GB/s ± 7% 13.2GB/s ± 4% +4.63% (p=0.000 n=20+20)
BM_UCordStringSink/5 2.29GB/s ± 1% 2.36GB/s ± 1% +3.05% (p=0.000 n=19+20)
BM_UCordStringSink/6 1.01GB/s ± 2% 1.01GB/s ± 0% ~ (p=0.055 n=20+18)
BM_UCordStringSink/7 945MB/s ± 1% 939MB/s ± 1% -0.60% (p=0.000 n=19+20)
BM_UCordStringSink/8 1.06GB/s ± 1% 1.07GB/s ± 1% +0.62% (p=0.000 n=18+20)
BM_UCordStringSink/9 866MB/s ± 1% 864MB/s ± 1% ~ (p=0.107 n=19+20)
BM_UCordStringSink/10 3.64GB/s ± 2% 3.98GB/s ± 1% +9.32% (p=0.000 n=19+20)
BM_UCordStringSink/11 1.22GB/s ± 1% 1.22GB/s ± 1% +0.61% (p=0.001 n=19+20)
BM_UCordStringSink/12 2.23GB/s ± 1% 2.23GB/s ± 1% ~ (p=0.692 n=19+20)
BM_UCordStringSink/13 1.96GB/s ± 1% 1.94GB/s ± 1% -0.82% (p=0.000 n=17+18)
BM_UCordStringSink/14 2.09GB/s ± 2% 2.08GB/s ± 1% ~ (p=0.147 n=20+18)
BM_UCordStringSink/15 1.47GB/s ± 1% 1.45GB/s ± 1% -0.88% (p=0.000 n=20+19)
BM_UCordStringSink/16 908MB/s ± 1% 917MB/s ± 1% +0.97% (p=0.000 n=19+19)
BM_UCordStringSink/17 2.11GB/s ± 1% 2.20GB/s ± 1% +4.35% (p=0.000 n=18+20)
BM_UCordStringSink/18 804MB/s ± 2% 1106MB/s ± 1% +37.52% (p=0.000 n=20+20)
BM_UCordStringSink/19 1.67GB/s ± 1% 1.72GB/s ± 0% +2.81% (p=0.000 n=18+20)
BM_UCordStringSink/20 1.77GB/s ± 3% 1.77GB/s ± 3% ~ (p=0.815 n=20+20)
ppc_power8
name old speed new speed delta
BM_UCord/0 918MB/s ± 6% 1262MB/s ± 0% +37.56% (p=0.000 n=17+16)
BM_UCord/1 671MB/s ±13% 879MB/s ± 2% +30.99% (p=0.000 n=18+16)
BM_UCord/2 12.6GB/s ± 8% 12.6GB/s ± 5% ~ (p=0.452 n=17+19)
BM_UCord/3 285MB/s ±10% 284MB/s ± 4% -0.50% (p=0.021 n=19+17)
BM_UCord/4 5.21GB/s ±12% 6.59GB/s ± 1% +26.37% (p=0.000 n=17+16)
BM_UCord/5 913MB/s ± 4% 1253MB/s ± 1% +37.27% (p=0.000 n=16+17)
BM_UCord/6 461MB/s ±13% 547MB/s ± 1% +18.67% (p=0.000 n=18+16)
BM_UCord/7 455MB/s ± 2% 524MB/s ± 3% +15.28% (p=0.000 n=16+18)
BM_UCord/8 489MB/s ± 2% 584MB/s ± 2% +19.47% (p=0.000 n=17+17)
BM_UCord/9 410MB/s ±33% 490MB/s ± 1% +19.64% (p=0.000 n=17+18)
BM_UCord/10 1.10GB/s ± 3% 1.55GB/s ± 2% +41.21% (p=0.000 n=16+16)
BM_UCord/11 494MB/s ± 1% 558MB/s ± 1% +12.92% (p=0.000 n=17+18)
BM_UCord/12 608MB/s ± 3% 793MB/s ± 1% +30.45% (p=0.000 n=17+16)
BM_UCord/13 545MB/s ±18% 721MB/s ± 2% +32.22% (p=0.000 n=19+17)
BM_UCord/14 594MB/s ± 4% 748MB/s ± 3% +25.99% (p=0.000 n=17+17)
BM_UCord/15 628MB/s ± 1% 822MB/s ± 3% +30.94% (p=0.000 n=18+16)
BM_UCord/16 277MB/s ± 2% 280MB/s ±15% +0.86% (p=0.001 n=17+17)
BM_UCord/17 864MB/s ± 1% 1001MB/s ± 3% +15.96% (p=0.000 n=17+17)
BM_UCord/18 121MB/s ± 2% 284MB/s ± 4% +134.08% (p=0.000 n=17+18)
BM_UCord/19 594MB/s ± 0% 713MB/s ± 2% +19.93% (p=0.000 n=16+17)
BM_UCord/20 553MB/s ±10% 662MB/s ± 5% +19.74% (p=0.000 n=16+18)
BM_UCordStringSink/0 1.37GB/s ± 4% 1.48GB/s ± 2% +8.51% (p=0.000 n=16+16)
BM_UCordStringSink/1 969MB/s ± 1% 990MB/s ± 1% +2.16% (p=0.000 n=16+18)
BM_UCordStringSink/2 13.1GB/s ±11% 13.0GB/s ±14% ~ (p=0.858 n=17+18)
BM_UCordStringSink/3 411MB/s ± 1% 415MB/s ± 1% +0.93% (p=0.000 n=16+17)
BM_UCordStringSink/4 6.81GB/s ± 8% 7.29GB/s ± 5% +7.12% (p=0.000 n=16+19)
BM_UCordStringSink/5 1.35GB/s ± 5% 1.45GB/s ±13% +8.00% (p=0.000 n=16+17)
BM_UCordStringSink/6 653MB/s ± 8% 653MB/s ± 3% -0.12% (p=0.007 n=17+19)
BM_UCordStringSink/7 618MB/s ±13% 597MB/s ±18% -3.45% (p=0.001 n=18+18)
BM_UCordStringSink/8 702MB/s ± 5% 702MB/s ± 1% -0.10% (p=0.012 n=17+16)
BM_UCordStringSink/9 590MB/s ± 2% 564MB/s ±13% -4.46% (p=0.000 n=16+17)
BM_UCordStringSink/10 1.63GB/s ± 2% 1.76GB/s ± 4% +8.28% (p=0.000 n=17+16)
BM_UCordStringSink/11 630MB/s ±14% 684MB/s ±15% +8.51% (p=0.000 n=19+17)
BM_UCordStringSink/12 858MB/s ±12% 903MB/s ± 9% +5.17% (p=0.000 n=19+17)
BM_UCordStringSink/13 806MB/s ±22% 879MB/s ± 1% +8.98% (p=0.000 n=19+19)
BM_UCordStringSink/14 854MB/s ±13% 901MB/s ± 5% +5.60% (p=0.000 n=19+17)
BM_UCordStringSink/15 930MB/s ± 2% 964MB/s ± 3% +3.59% (p=0.000 n=16+16)
BM_UCordStringSink/16 363MB/s ±10% 356MB/s ± 6% ~ (p=0.050 n=20+19)
BM_UCordStringSink/17 976MB/s ±12% 1078MB/s ± 1% +10.52% (p=0.000 n=20+17)
BM_UCordStringSink/18 227MB/s ± 1% 355MB/s ± 3% +56.45% (p=0.000 n=16+17)
BM_UCordStringSink/19 751MB/s ± 4% 808MB/s ± 4% +7.70% (p=0.000 n=18+17)
BM_UCordStringSink/20 761MB/s ± 8% 786MB/s ± 4% +3.23% (p=0.000 n=18+17)
Diffstat (limited to 'snappy.cc')
-rw-r--r-- | snappy.cc | 265 |
1 files changed, 143 insertions, 122 deletions
@@ -30,6 +30,9 @@ #include "snappy-internal.h" #include "snappy-sinksource.h" +#if defined(__x86_64__) || defined(_M_X64) +#include <emmintrin.h> +#endif #include <stdio.h> #include <algorithm> @@ -83,71 +86,125 @@ size_t MaxCompressedLength(size_t source_len) { return 32 + source_len + source_len/6; } -// Copy "len" bytes from "src" to "op", one byte at a time. Used for -// handling COPY operations where the input and output regions may -// overlap. For example, suppose: -// src == "ab" -// op == src + 2 -// len == 20 -// After IncrementalCopy(src, op, len), the result will have -// eleven copies of "ab" -// ababababababababababab -// Note that this does not match the semantics of either memcpy() -// or memmove(). -static inline void IncrementalCopy(const char* src, char* op, ssize_t len) { - assert(len > 0); - do { - *op++ = *src++; - } while (--len > 0); +namespace { + +void UnalignedCopy64(const void* src, void* dst) { + memcpy(dst, src, 8); } -// Equivalent to IncrementalCopy except that it can write up to ten extra -// bytes after the end of the copy, and that it is faster. -// -// The main part of this loop is a simple copy of eight bytes at a time until -// we've copied (at least) the requested amount of bytes. However, if op and -// src are less than eight bytes apart (indicating a repeating pattern of -// length < 8), we first need to expand the pattern in order to get the correct -// results. For instance, if the buffer looks like this, with the eight-byte -// <src> and <op> patterns marked as intervals: -// -// abxxxxxxxxxxxx -// [------] src -// [------] op -// -// a single eight-byte copy from <src> to <op> will repeat the pattern once, -// after which we can move <op> two bytes without moving <src>: -// -// ababxxxxxxxxxx -// [------] src -// [------] op -// -// and repeat the exercise until the two no longer overlap. -// -// This allows us to do very well in the special case of one single byte -// repeated many times, without taking a big hit for more general cases. -// -// The worst case of extra writing past the end of the match occurs when -// op - src == 1 and len == 1; the last copy will read from byte positions -// [0..7] and write to [4..11], whereas it was only supposed to write to -// position 1. Thus, ten excess bytes. +void UnalignedCopy128(const void* src, void* dst) { + // TODO(alkis): Remove this when we upgrade to a recent compiler that emits + // SSE2 moves for memcpy(dst, src, 16). +#ifdef __SSE2__ + __m128i x = _mm_loadu_si128(static_cast<const __m128i*>(src)); + _mm_storeu_si128(static_cast<__m128i*>(dst), x); +#else + memcpy(dst, src, 16); +#endif +} -namespace { +// Copy [src, src+(op_limit-op)) to [op, (op_limit-op)) a byte at a time. Used +// for handling COPY operations where the input and output regions may overlap. +// For example, suppose: +// src == "ab" +// op == src + 2 +// op_limit == op + 20 +// After IncrementalCopySlow(src, op, op_limit), the result will have eleven +// copies of "ab" +// ababababababababababab +// Note that this does not match the semantics of either memcpy() or memmove(). +inline char* IncrementalCopySlow(const char* src, char* op, + char* const op_limit) { + while (op < op_limit) { + *op++ = *src++; + } + return op_limit; +} -const int kMaxIncrementCopyOverflow = 10; +// Copy [src, src+(op_limit-op)) to [op, (op_limit-op)) but faster than +// IncrementalCopySlow. buf_limit is the address past the end of the writable +// region of the buffer. +inline char* IncrementalCopy(const char* src, char* op, char* const op_limit, + char* const buf_limit) { + // Terminology: + // + // slop = buf_limit - op + // pat = op - src + // len = limit - op + assert(src < op); + assert(op_limit <= buf_limit); + // NOTE: The compressor always emits 4 <= len <= 64. It is ok to assume that + // to optimize this function but we have to also handle these cases in case + // the input does not satisfy these conditions. + + size_t pattern_size = op - src; + // The cases are split into different branches to allow the branch predictor, + // FDO, and static prediction hints to work better. For each input we list the + // ratio of invocations that match each condition. + // + // input slop < 16 pat < 8 len > 16 + // ------------------------------------------ + // html|html4|cp 0% 1.01% 27.73% + // urls 0% 0.88% 14.79% + // jpg 0% 64.29% 7.14% + // pdf 0% 2.56% 58.06% + // txt[1-4] 0% 0.23% 0.97% + // pb 0% 0.96% 13.88% + // bin 0.01% 22.27% 41.17% + // + // It is very rare that we don't have enough slop for doing block copies. It + // is also rare that we need to expand a pattern. Small patterns are common + // for incompressible formats and for those we are plenty fast already. + // Lengths are normally not greater than 16 but they vary depending on the + // input. In general if we always predict len <= 16 it would be an ok + // prediction. + // + // In order to be fast we want a pattern >= 8 bytes and an unrolled loop + // copying 2x 8 bytes at a time. + + // Handle the uncommon case where pattern is less than 8 bytes. + if (PREDICT_FALSE(pattern_size < 8)) { + // Expand pattern to at least 8 bytes. The worse case scenario in terms of + // buffer usage is when the pattern is size 3. ^ is the original position + // of op. x are irrelevant bytes copied by the last UnalignedCopy64. + // + // abc + // abcabcxxxxx + // abcabcabcabcxxxxx + // ^ + // The last x is 14 bytes after ^. + if (PREDICT_TRUE(op <= buf_limit - 14)) { + while (pattern_size < 8) { + UnalignedCopy64(src, op); + op += pattern_size; + pattern_size *= 2; + } + if (PREDICT_TRUE(op >= op_limit)) return op_limit; + } else { + return IncrementalCopySlow(src, op, op_limit); + } + } + assert(pattern_size >= 8); -inline void IncrementalCopyFastPath(const char* src, char* op, ssize_t len) { - while (PREDICT_FALSE(op - src < 8)) { + // Copy 2x 8 bytes at a time. Because op - src can be < 16, a single + // UnalignedCopy128 might overwrite data in op. UnalignedCopy64 is safe + // because expanding the pattern to at least 8 bytes guarantees that + // op - src >= 8. + while (op <= buf_limit - 16) { UnalignedCopy64(src, op); - len -= op - src; - op += op - src; + UnalignedCopy64(src + 8, op + 8); + src += 16; + op += 16; + if (PREDICT_TRUE(op >= op_limit)) return op_limit; } - while (len > 0) { + // We only take this branch if we didn't have enough slop and we can do a + // single 8 byte copy. + if (PREDICT_FALSE(op <= buf_limit - 8)) { UnalignedCopy64(src, op); src += 8; op += 8; - len -= 8; } + return IncrementalCopySlow(src, op, op_limit); } } // namespace @@ -172,8 +229,7 @@ static inline char* EmitLiteral(char* op, // Fits in tag byte *op++ = LITERAL | (n << 2); - UnalignedCopy64(literal, op); - UnalignedCopy64(literal + 8, op + 8); + UnalignedCopy128(literal, op); return op + len; } @@ -599,7 +655,19 @@ class SnappyDecompressor { for ( ;; ) { const unsigned char c = *(reinterpret_cast<const unsigned char*>(ip++)); - if ((c & 0x3) == LITERAL) { + // Ratio of iterations that have LITERAL vs non-LITERAL for different + // inputs. + // + // input LITERAL NON_LITERAL + // ----------------------------------- + // html|html4|cp 23% 77% + // urls 36% 64% + // jpg 47% 53% + // pdf 19% 81% + // txt[1-4] 25% 75% + // pb 24% 76% + // bin 24% 76% + if (PREDICT_FALSE((c & 0x3) == LITERAL)) { size_t literal_length = (c >> 2) + 1u; if (writer->TryFastAppend(ip, ip_limit_ - ip, literal_length)) { assert(literal_length < 61); @@ -663,10 +731,8 @@ bool SnappyDecompressor::RefillTag() { size_t n; ip = reader_->Peek(&n); peeked_ = n; - if (n == 0) { - eof_ = true; - return false; - } + eof_ = (n == 0); + if (eof_) return false; ip_limit_ = ip + n; } @@ -906,8 +972,7 @@ class SnappyIOVecWriter { output_iov_[curr_iov_index_].iov_len - curr_iov_written_ >= 16) { // Fast path, used for the majority (about 95%) of invocations. char* ptr = GetIOVecPointer(curr_iov_index_, curr_iov_written_); - UnalignedCopy64(ip, ptr); - UnalignedCopy64(ip + 8, ptr + 8); + UnalignedCopy128(ip, ptr); curr_iov_written_ += len; total_written_ += len; return true; @@ -971,9 +1036,10 @@ class SnappyIOVecWriter { if (to_copy > len) { to_copy = len; } - IncrementalCopy(GetIOVecPointer(from_iov_index, from_iov_offset), - GetIOVecPointer(curr_iov_index_, curr_iov_written_), - to_copy); + IncrementalCopySlow( + GetIOVecPointer(from_iov_index, from_iov_offset), + GetIOVecPointer(curr_iov_index_, curr_iov_written_), + GetIOVecPointer(curr_iov_index_, curr_iov_written_) + to_copy); curr_iov_written_ += to_copy; from_iov_offset += to_copy; total_written_ += to_copy; @@ -1043,8 +1109,7 @@ class SnappyArrayWriter { const size_t space_left = op_limit_ - op; if (len <= 16 && available >= 16 + kMaximumTagLength && space_left >= 16) { // Fast path, used for the majority (about 95%) of invocations. - UnalignedCopy64(ip, op); - UnalignedCopy64(ip + 8, op + 8); + UnalignedCopy128(ip, op); op_ = op + len; return true; } else { @@ -1053,8 +1118,7 @@ class SnappyArrayWriter { } inline bool AppendFromSelf(size_t offset, size_t len) { - char* op = op_; - const size_t space_left = op_limit_ - op; + char* const op_end = op_ + len; // Check if we try to append from before the start of the buffer. // Normally this would just be a check for "produced < offset", @@ -1063,52 +1127,13 @@ class SnappyArrayWriter { // to a very big number. This is convenient, as offset==0 is another // invalid case that we also want to catch, so that we do not go // into an infinite loop. - assert(op >= base_); - size_t produced = op - base_; - if (produced <= offset - 1u) { - return false; - } - if (offset >= 8 && space_left >= 16) { - UnalignedCopy64(op - offset, op); - UnalignedCopy64(op - offset + 8, op + 8); - if (PREDICT_TRUE(len <= 16)) { - // Fast path, used for the majority (70-80%) of dynamic invocations. - op_ = op + len; - return true; - } - op += 16; - // Copy 8 bytes at a time. This will write as many as 7 bytes more - // than necessary, so we check if space_left >= len + 7. - if (space_left >= len + 7) { - const char* src = op - offset; - ssize_t l = len - 16; // 16 bytes were already handled, above. - do { - UnalignedCopy64(src, op); - src += 8; - op += 8; - l -= 8; - } while (l > 0); - // l is now negative if we wrote extra bytes; adjust op_ accordingly. - op_ = op + l; - return true; - } else if (space_left < len) { - return false; - } else { - len -= 16; - IncrementalCopy(op - offset, op, len); - } - } else if (space_left >= len + kMaxIncrementCopyOverflow) { - IncrementalCopyFastPath(op - offset, op, len); - } else if (space_left < len) { - return false; - } else { - IncrementalCopy(op - offset, op, len); - } + if (Produced() <= offset - 1u || op_end > op_limit_) return false; + op_ = IncrementalCopy(op_ - offset, op_, op_end, op_limit_); - op_ = op + len; return true; } inline size_t Produced() const { + assert(op_ >= base_); return op_ - base_; } inline void Flush() {} @@ -1276,8 +1301,7 @@ class SnappyScatteredWriter { if (length <= 16 && available >= 16 + kMaximumTagLength && space_left >= 16) { // Fast path, used for the majority (about 95%) of invocations. - UNALIGNED_STORE64(op, UNALIGNED_LOAD64(ip)); - UNALIGNED_STORE64(op + 8, UNALIGNED_LOAD64(ip + 8)); + UnalignedCopy128(ip, op); op_ptr_ = op + length; return true; } else { @@ -1286,16 +1310,13 @@ class SnappyScatteredWriter { } inline bool AppendFromSelf(size_t offset, size_t len) { + char* const op_end = op_ptr_ + len; // See SnappyArrayWriter::AppendFromSelf for an explanation of // the "offset - 1u" trick. - if (offset - 1u < op_ptr_ - op_base_) { - const size_t space_left = op_limit_ - op_ptr_; - if (space_left >= len + kMaxIncrementCopyOverflow) { - // Fast path: src and dst in current block. - IncrementalCopyFastPath(op_ptr_ - offset, op_ptr_, len); - op_ptr_ += len; - return true; - } + if (PREDICT_TRUE(offset - 1u < op_ptr_ - op_base_ && op_end <= op_limit_)) { + // Fast path: src and dst in current block. + op_ptr_ = IncrementalCopy(op_ptr_ - offset, op_ptr_, op_end, op_limit_); + return true; } return SlowAppendFromSelf(offset, len); } |