Home > mailing lists

Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) - Mailing list pgsql-hackers

From	Ranier Vilela
Subject	Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c)
Date	March 12 17:35:04
Msg-id	CAEudQApqk6DXWgqSBdHyH7+wSxJuk7D-DwkGODUcGkUWpYu0UA@mail.gmail.com Whole thread
In response to	Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) (Bryan Green <dbryan.green@gmail.com>)
Responses	Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c)
List	pgsql-hackers

Tree view

Hi.

Em seg., 9 de mar. de 2026 às 14:02, Bryan Green <dbryan.green@gmail.com> escreveu:

I performed a micro-benchmark on my dual epyc (zen 2) server and version 1 wins for small values of n.

20 runs:

n version min median mean max stddev noise%
-----------------------------------------------------------------------
n=1 version1 2.440 2.440 2.450 2.550 0.024 4.5%
n=1 version2 4.260 4.280 4.277 4.290 0.007 0.7%

n=2 version1 2.740 2.750 2.757 2.880 0.029 5.1%
n=2 version2 3.970 3.980 3.980 4.020 0.010 1.3%

n=4 version1 4.580 4.595 4.649 4.910 0.094 7.2%
n=4 version2 5.780 5.815 5.809 5.820 0.013 0.7%

But, micro-benchmarks always make me nervous, so I looked at the actual instruction cost for my
platform given the version 1 and version 2 code.

If we count cpu cycles using the AMD Zen 2 instruction latency/throughput tables: version 1 (loop body)
has a critical path of ~5-6 cycles per iteration. version 2 (loop body) has ~3-4 cycles per iteration.

The problem for version 2 is that the call to memcpy is ~24-30 cycles due to the stub + function call + return
and branch predictor pressure on first call. This probably results in ~2.5 ns per iteration cost for version 2.

So, no I wouldn't call it an optimization. But, it will be interesting to hear other opinions on this.

I made dirty and quick tests with two versions:

gcc 15.2.0

gcc -O2 memcpy1.c -o memcpy1

The first test was with keys 10000000 and 10000000 loops:

version1: on memcpy call

done in 1873 nanoseconds

version2: inlined memcpy

not finish

The second test was with keys 4 and 10000000 loops:

version1: one memcpy call

version2: inlined memcpy call

version1: done in 1519 nanoseconds
version2: done in 104981851 nanoseconds
(1.44692e-05 times faster)

version1: done in 1979 nanoseconds
version2: done in 110568901 nanoseconds
(1.78983e-05 times faster)

version1: done in 1814 nanoseconds
version2: done in 108555484 nanoseconds
(1.67103e-05 times faster)

version1: done in 1631 nanoseconds
version2: done in 109867919 nanoseconds
(1.48451e-05 times faster)

version1: done in 1269 nanoseconds
version2: done in 111639106 nanoseconds
(1.1367e-05 times faster)

Unless I'm doing something wrong, one call memcpy wins!

memcpy1.c attached.

best regards,

Ranier Vilela

Attachment

memcpy1.c

pgsql-hackers by date:

From: Boris Mironov
Date: 12 March, 17:28:07
Subject: Re: Idea to enhance pgbench by more modes to generate data (multi-TXNs, UNNEST, COPY BINARY)

From: Daniel Gustafsson
Date: 12 March, 17:36:00
Subject: Re: Serverside SNI support in libpq

Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) - Mailing list pgsql-hackers

Attachment

Previous

Next