Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) - Mailing list pgsql-hackers

From Bryan Green
Subject Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c)
Date
Msg-id CAF+pBj9M8rCoS-EEBwLiA6hdxm_UNjxMXG9Vc4RU8xifqnQB-g@mail.gmail.com
Whole thread
In response to Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c)  (Bryan Green <dbryan.green@gmail.com>)
Responses Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c)
List pgsql-hackers
I meant to add that the micro-benchmark went through each loop 100,000,000 iterations for each run...
for 20 runs to come up with those numbers.  

--bg

On Mon, Mar 9, 2026 at 11:02 AM Bryan Green <dbryan.green@gmail.com> wrote:
I performed a micro-benchmark on my dual epyc (zen 2) server and version 1 wins for small values of n.

20 runs: 

n       version       min  median    mean     max  stddev  noise%
-----------------------------------------------------------------------
n=1     version1     2.440   2.440   2.450   2.550   0.024    4.5%
n=1     version2     4.260   4.280   4.277   4.290   0.007    0.7%

n=2     version1     2.740   2.750   2.757   2.880   0.029    5.1%
n=2     version2     3.970   3.980   3.980   4.020   0.010    1.3%

n=4     version1     4.580   4.595   4.649   4.910   0.094    7.2%
n=4     version2     5.780   5.815   5.809   5.820   0.013    0.7%

But, micro-benchmarks always make me nervous, so I looked at the actual instruction cost for my 
platform given the version 1 and version 2 code.

If we count cpu cycles using the AMD Zen 2 instruction latency/throughput tables:  version 1 (loop body) 
has a critical path of ~5-6 cycles per iteration.  version 2 (loop body) has ~3-4 cycles per iteration. 

The problem for version 2 is that the call to memcpy is ~24-30 cycles due to the stub + function call + return
and branch predictor pressure on first call.  This probably results in ~2.5 ns per iteration cost for version 2.

So, no I wouldn't call it an optimization.  But, it will be interesting to hear other opinions on this. 

--bg


On Mon, Mar 9, 2026 at 10:25 AM Ranier Vilela <ranier.vf@gmail.com> wrote:


Em seg., 9 de mar. de 2026 às 11:47, Bryan Green <dbryan.green@gmail.com> escreveu:
I created an example that is a little bit closer to the actual code and changed the compiler from C++ to C. 

It is interesting the optimization that the compiler has chosen for version 1 versus version 2.  One calls
memcpy and one doesn't.  There is a good chance the inlining of memcpy as SSE+scalar per iteration
will be faster for syscache scans-- which I believe are usually small (1-4 keys?).  
I doubt the inline version is better.
Clang is supported too and the code generated is much better with memcpy one call outside of the loop.
 

Probably the only reason to do this patch would be if N is normally large or if this is considered an
improvement in code clarity without a detrimental impact on small N syscache scans.  
I realize you only said "possible small optimization".  It might be worthwhile to benchmark the code for 
different values of n to determine if there is a tipping point either way?
 In your opinion, shouldn't this be considered an optimization, even a small one?

best regards,
Ranier Vilela

pgsql-hackers by date:

Previous
From: Xuneng Zhou
Date:
Subject: Re: Refactor recovery conflict signaling a little
Next
From: Andrey Borodin
Date:
Subject: Re: Compression of bigger WAL records