Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) - Mailing list pgsql-hackers

From lakshmi
Subject Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c)
Date
Msg-id CAEvyyTjLYVuy0V6gXE9_HgyB4gUftuTZm65gdbacE_OOKk1cqw@mail.gmail.com
Whole thread
In response to Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c)  (Bryan Green <dbryan.green@gmail.com>)
Responses Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c)
List pgsql-hackers
Hi,

I tried this change on PostgreSQL (19devel) and ran a few simple tests using pgbench to see how it behaves in practice.

I used 10 clients and 4 threads and ran each test for 60 seconds.

From my runs, the original version was giving around ~663 TPS with ~15.07 ms latency.
With the patched version, I observed TPS in the range of ~638–657, averaging around ~648 TPS, with latency slightly higher (~15.2–15.6 ms).

So in my tests, I didn’t see a clear improvement. If anything, the patched version appears slightly slower, though the difference is small.

It might be that this change doesn’t help much in common cases where the number of keys is small, but it would be interesting to see results from other workloads as well.

Thanks for the discussion—it was helpful to look into this.


Regards,  
Lakshmi G


On Fri, Mar 20, 2026 at 9:57 AM Bryan Green <dbryan.green@gmail.com> wrote:
I meant to add that the micro-benchmark went through each loop 100,000,000 iterations for each run...
for 20 runs to come up with those numbers.  

--bg

On Mon, Mar 9, 2026 at 11:02 AM Bryan Green <dbryan.green@gmail.com> wrote:
I performed a micro-benchmark on my dual epyc (zen 2) server and version 1 wins for small values of n.

20 runs: 

n       version       min  median    mean     max  stddev  noise%
-----------------------------------------------------------------------
n=1     version1     2.440   2.440   2.450   2.550   0.024    4.5%
n=1     version2     4.260   4.280   4.277   4.290   0.007    0.7%

n=2     version1     2.740   2.750   2.757   2.880   0.029    5.1%
n=2     version2     3.970   3.980   3.980   4.020   0.010    1.3%

n=4     version1     4.580   4.595   4.649   4.910   0.094    7.2%
n=4     version2     5.780   5.815   5.809   5.820   0.013    0.7%

But, micro-benchmarks always make me nervous, so I looked at the actual instruction cost for my 
platform given the version 1 and version 2 code.

If we count cpu cycles using the AMD Zen 2 instruction latency/throughput tables:  version 1 (loop body) 
has a critical path of ~5-6 cycles per iteration.  version 2 (loop body) has ~3-4 cycles per iteration. 

The problem for version 2 is that the call to memcpy is ~24-30 cycles due to the stub + function call + return
and branch predictor pressure on first call.  This probably results in ~2.5 ns per iteration cost for version 2.

So, no I wouldn't call it an optimization.  But, it will be interesting to hear other opinions on this. 

--bg


On Mon, Mar 9, 2026 at 10:25 AM Ranier Vilela <ranier.vf@gmail.com> wrote:


Em seg., 9 de mar. de 2026 às 11:47, Bryan Green <dbryan.green@gmail.com> escreveu:
I created an example that is a little bit closer to the actual code and changed the compiler from C++ to C. 

It is interesting the optimization that the compiler has chosen for version 1 versus version 2.  One calls
memcpy and one doesn't.  There is a good chance the inlining of memcpy as SSE+scalar per iteration
will be faster for syscache scans-- which I believe are usually small (1-4 keys?).  
I doubt the inline version is better.
Clang is supported too and the code generated is much better with memcpy one call outside of the loop.
 

Probably the only reason to do this patch would be if N is normally large or if this is considered an
improvement in code clarity without a detrimental impact on small N syscache scans.  
I realize you only said "possible small optimization".  It might be worthwhile to benchmark the code for 
different values of n to determine if there is a tipping point either way?
 In your opinion, shouldn't this be considered an optimization, even a small one?

best regards,
Ranier Vilela

pgsql-hackers by date:

Previous
From: Ashutosh Sharma
Date:
Subject: Re: synchronized_standby_slots behavior inconsistent with quorum-based synchronous replication
Next
From: Henson Choi
Date:
Subject: Re: [PATCH] rewriteGraphTable: Fix missing RTEs in FROM clause by setting inFromCl=true