Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) - Mailing list pgsql-hackers

From Bryan Green
Subject Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c)
Date
Msg-id CAF+pBj-pAGnTh2un8RGcDqSYuMnwGhXv5_MteB77FNjf-Af=tg@mail.gmail.com
Whole thread
In response to Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c)  (Ranier Vilela <ranier.vf@gmail.com>)
Responses Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c)
List pgsql-hackers
I don't think your version 1 memcpy is doing what you think it is doing.

On Thu, Mar 12, 2026 at 12:35 PM Ranier Vilela <ranier.vf@gmail.com> wrote:
Hi.

Em seg., 9 de mar. de 2026 às 14:02, Bryan Green <dbryan.green@gmail.com> escreveu:
I performed a micro-benchmark on my dual epyc (zen 2) server and version 1 wins for small values of n.

20 runs: 

n       version       min  median    mean     max  stddev  noise%
-----------------------------------------------------------------------
n=1     version1     2.440   2.440   2.450   2.550   0.024    4.5%
n=1     version2     4.260   4.280   4.277   4.290   0.007    0.7%

n=2     version1     2.740   2.750   2.757   2.880   0.029    5.1%
n=2     version2     3.970   3.980   3.980   4.020   0.010    1.3%

n=4     version1     4.580   4.595   4.649   4.910   0.094    7.2%
n=4     version2     5.780   5.815   5.809   5.820   0.013    0.7%

But, micro-benchmarks always make me nervous, so I looked at the actual instruction cost for my 
platform given the version 1 and version 2 code.

If we count cpu cycles using the AMD Zen 2 instruction latency/throughput tables:  version 1 (loop body) 
has a critical path of ~5-6 cycles per iteration.  version 2 (loop body) has ~3-4 cycles per iteration. 

The problem for version 2 is that the call to memcpy is ~24-30 cycles due to the stub + function call + return
and branch predictor pressure on first call.  This probably results in ~2.5 ns per iteration cost for version 2.

So, no I wouldn't call it an optimization.  But, it will be interesting to hear other opinions on this. 
I made dirty and quick tests with two versions:
gcc 15.2.0
gcc -O2 memcpy1.c -o memcpy1

The first test was with keys 10000000 and 10000000 loops:
version1: on memcpy call
done in 1873 nanoseconds

version2: inlined memcpy
not finish

The second test was with keys 4 and 10000000 loops:
version1: one memcpy call
version2: inlined memcpy call

version1: done in 1519 nanoseconds
version2: done in 104981851 nanoseconds
(1.44692e-05 times faster)

version1: done in 1979 nanoseconds
version2: done in 110568901 nanoseconds
(1.78983e-05 times faster)

version1: done in 1814 nanoseconds
version2: done in 108555484 nanoseconds
(1.67103e-05 times faster)

version1: done in 1631 nanoseconds
version2: done in 109867919 nanoseconds
(1.48451e-05 times faster)

version1: done in 1269 nanoseconds
version2: done in 111639106 nanoseconds
(1.1367e-05 times faster)

Unless I'm doing something wrong, one call memcpy wins!
memcpy1.c attached.

best regards,
Ranier Vilela

pgsql-hackers by date:

Previous
From: Nathan Bossart
Date:
Subject: Re: Speed up COPY FROM text/CSV parsing using SIMD
Next
From: Alena Rybakina
Date:
Subject: Re: Vacuum statistics