Re: Improve CRC32C performance on SSE4.2 - Mailing list pgsql-hackers
From | John Naylor |
---|---|
Subject | Re: Improve CRC32C performance on SSE4.2 |
Date | |
Msg-id | CANWCAZYRhLHArpyfV4uRK-Rw9N5oV5HMkkKtBehcuTjNOMwCZg@mail.gmail.com Whole thread Raw |
In response to | Re: Improve CRC32C performance on SSE4.2 (Nathan Bossart <nathandbossart@gmail.com>) |
Responses |
Re: Improve CRC32C performance on SSE4.2
|
List | pgsql-hackers |
On Wed, Mar 5, 2025 at 12:36 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > > On Tue, Mar 04, 2025 at 12:09:09PM +0700, John Naylor wrote: > > On Tue, Mar 4, 2025 at 2:11 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > >> This could potentially lead to a small regression for machines with SSE > >> 4.2 but not PCLMUL, but that may be uncommon enough at this point to not > >> worry aobut. > > > > Note also upthread I mentioned we may have to go to 512-bit pclmul, > > since Zen 2 regresses on 128-bit. :-( > > Ah, okay. You mean the AVX-512 version [0]? Right, except not that version, rather a more efficient way and with only one accumulator, so still a minimum length of 64 bytes. I'll share that once we have agreement on detection/dispatch. > And are you thinking we'd use > the same strategy for the compiled-in-SSE4.2 builds, i.e., inline the > SSE4.2 version for small inputs and use a function pointer for larger ones? Yes. Although, we may not even have to inline for non-constant input, see below. Inlining loops does take binary space. > > I actually haven't seen any measurable difference with direct calls > > versus indirect, but it could very well be that the microbenchmark is > > hiding that since it's doing something unnatural by calling things a > > bunch of times in a loop. I want to try changing the benchmark to base > > the address it's computing on some bits from the crc from the last > > loop iteration. I think that would make it more latency-sensitive. We > > could also make it do an additional constant 20-byte input every time > > to make it resemble WAL more closely. > > Looking back on some old benchmarks for small-ish inputs [0], the > difference does seem within the noise range. I suppose these functions > might be expensive enough to make the function pointer overhead negligible. > IME there's a big difference when a function pointer is used for an > instruction or two [2], but even relatively small inputs to the CRC-32C > functions might require several instructions. That was my hunch too, but I wanted to be more sure, so I modified the benchmark so it doesn't know the address of the next calculation until it finishes the last calculation so we can hopefully see the latency caused by indirection. It also does an additional calculation on constant 20 bytes, like the WAL header. I also tweaked the length each iteration so the branch predictor maybe has a harder time predicting the constant 20 input. And to make it more challenging, I removed the part that inlined all small inputs, so it inlines only constant inputs: 0001+0002 (test only) func pointer: 32 latency average = 24.021 ms latency average = 24.020 ms latency average = 23.733 ms 40 latency average = 25.018 ms latency average = 24.253 ms latency average = 24.278 ms 48 latency average = 25.437 ms latency average = 24.817 ms latency average = 24.793 ms SSE4.2 build (direct func): 32 latency average = 22.422 ms latency average = 22.387 ms latency average = 22.391 ms 40 latency average = 23.444 ms latency average = 22.887 ms latency average = 22.988 ms 48 latency average = 23.432 ms latency average = 23.380 ms latency average = 23.384 ms 0001-0006 SSE 4.2 build (inlined constant / otherwise func pointer) 32 latency average = 22.135 ms latency average = 21.874 ms latency average = 21.910 ms 40 latency average = 22.916 ms latency average = 23.086 ms latency average = 22.422 ms 48 latency average = 23.255 ms latency average = 22.780 ms latency average = 22.804 ms These are still a bit noisy, and close, but, it seems there is no penalty in using the function pointer as long as the header calculation is inlined. -- John Naylor Amazon Web Services
pgsql-hackers by date: