Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) - Mailing list pgsql-hackers
| From | Bryan Green |
|---|---|
| Subject | Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) |
| Date | |
| Msg-id | CAF+pBj87nrr1jGadSRqZrhik+Y9T7V5e7177MOJX6=8KjUgOkA@mail.gmail.com Whole thread |
| In response to | Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) (Bryan Green <dbryan.green@gmail.com>) |
| Responses |
Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c)
|
| List | pgsql-hackers |
I modified your memcpy1.c program to not inline the version functions. I changed the memcpy function
call in version 1, added volatile to keep some DCE opportunities from happening and added a range
of N values to keep the compiler from specializing the code for N = 4. Before it did DCE and the test1
function was just a ret.
The interesting issue is the use of malloc versus the stack. The use of malloc will probably track closer
with PG's use of palloc so I would say in that case this is an optimization. It might be fun to compile PG
with and without the patch (in debug mode) and actually see what gets generated for this function.
Here are the results I got using your modified benchmark:
--- stack allocated ---
stack n=1 v1(patch): 49721599 ns v2(original): 21477302 ns ratio: 2.315 original wins
stack n=2 v1(patch): 52065462 ns v2(original): 28765199 ns ratio: 1.810 original wins
stack n=3 v1(patch): 58914958 ns v2(original): 39726110 ns ratio: 1.483 original wins
stack n=4 v1(patch): 64585275 ns v2(original): 47046397 ns ratio: 1.373 original wins
stack n=5 v1(patch): 73929844 ns v2(original): 58588698 ns ratio: 1.262 original wins
stack n=6 v1(patch): 95465376 ns v2(original): 67807817 ns ratio: 1.408 original wins
stack n=7 v1(patch): 86910226 ns v2(original): 76999488 ns ratio: 1.129 original wins
stack n=8 v1(patch): 107765417 ns v2(original): 86046016 ns ratio: 1.252 original wins
--- malloc allocated ---
malloc n=1 v1(patch): 133283824 ns v2(original): 141361091 ns ratio: 0.943 patch wins
malloc n=2 v1(patch): 145625895 ns v2(original): 180912711 ns ratio: 0.805 patch wins
malloc n=3 v1(patch): 153975594 ns v2(original): 228459879 ns ratio: 0.674 patch wins
malloc n=4 v1(patch): 154483094 ns v2(original): 248157408 ns ratio: 0.623 patch wins
malloc n=5 v1(patch): 157710598 ns v2(original): 298795018 ns ratio: 0.528 patch wins
malloc n=6 v1(patch): 165196636 ns v2(original): 332940132 ns ratio: 0.496 patch wins
malloc n=7 v1(patch): 169576370 ns v2(original): 358438778 ns ratio: 0.473 patch wins
malloc n=8 v1(patch): 184463815 ns v2(original): 403721513 ns ratio: 0.457 patch wins
stack n=1 v1(patch): 49721599 ns v2(original): 21477302 ns ratio: 2.315 original wins
stack n=2 v1(patch): 52065462 ns v2(original): 28765199 ns ratio: 1.810 original wins
stack n=3 v1(patch): 58914958 ns v2(original): 39726110 ns ratio: 1.483 original wins
stack n=4 v1(patch): 64585275 ns v2(original): 47046397 ns ratio: 1.373 original wins
stack n=5 v1(patch): 73929844 ns v2(original): 58588698 ns ratio: 1.262 original wins
stack n=6 v1(patch): 95465376 ns v2(original): 67807817 ns ratio: 1.408 original wins
stack n=7 v1(patch): 86910226 ns v2(original): 76999488 ns ratio: 1.129 original wins
stack n=8 v1(patch): 107765417 ns v2(original): 86046016 ns ratio: 1.252 original wins
--- malloc allocated ---
malloc n=1 v1(patch): 133283824 ns v2(original): 141361091 ns ratio: 0.943 patch wins
malloc n=2 v1(patch): 145625895 ns v2(original): 180912711 ns ratio: 0.805 patch wins
malloc n=3 v1(patch): 153975594 ns v2(original): 228459879 ns ratio: 0.674 patch wins
malloc n=4 v1(patch): 154483094 ns v2(original): 248157408 ns ratio: 0.623 patch wins
malloc n=5 v1(patch): 157710598 ns v2(original): 298795018 ns ratio: 0.528 patch wins
malloc n=6 v1(patch): 165196636 ns v2(original): 332940132 ns ratio: 0.496 patch wins
malloc n=7 v1(patch): 169576370 ns v2(original): 358438778 ns ratio: 0.473 patch wins
malloc n=8 v1(patch): 184463815 ns v2(original): 403721513 ns ratio: 0.457 patch wins
The modified program:
#include <stdlib.h>
#include <string.h>
#include <stdint.h>
#include <stdbool.h>
#include <stddef.h>
#include <stdio.h>
#include <time.h>
typedef void (*RegProcedure)(void);
typedef uintptr_t Datum;
typedef struct ScanKeyData
{
int sk_flags;
int sk_attno;
RegProcedure sk_func;
Datum sk_argument;
} ScanKeyData;
/* version1: bulk memcpy + fixup (the patch) */
static __attribute__((noinline))
void version1_stack(int n, const ScanKeyData *key, ScanKeyData *idxkey)
{
memcpy(idxkey, key, n * sizeof(ScanKeyData));
for (int i = 0; i < n; i++)
idxkey[i].sk_attno = i + 1;
}
/* version2: per-element memcpy + fixup (the original) */
static __attribute__((noinline))
void version2_stack(int n, const ScanKeyData *key, ScanKeyData *idxkey)
{
for (int i = 0; i < n; i++)
{
memcpy(&idxkey[i], &key[i], sizeof(ScanKeyData));
idxkey[i].sk_attno = i + 1;
}
}
/* version1: bulk memcpy + fixup (the patch) */
static __attribute__((noinline))
ScanKeyData *version1_malloc(int n, const ScanKeyData *key)
{
ScanKeyData *idxkey = (ScanKeyData *) malloc(n * sizeof(ScanKeyData));
memcpy(idxkey, key, n * sizeof(ScanKeyData));
for (int i = 0; i < n; i++)
idxkey[i].sk_attno = i + 1;
return idxkey;
}
/* version2: per-element memcpy + fixup (the original) */
static __attribute__((noinline))
ScanKeyData *version2_malloc(int n, const ScanKeyData *key)
{
ScanKeyData *idxkey = (ScanKeyData *) malloc(n * sizeof(ScanKeyData));
for (int i = 0; i < n; i++)
{
memcpy(&idxkey[i], &key[i], sizeof(ScanKeyData));
idxkey[i].sk_attno = i + 1;
}
return idxkey;
}
#define NANOSEC_PER_SEC 1000000000
int64_t
get_clock_diff(struct timespec *t1, struct timespec *t2)
{
int64_t nanosec = (t1->tv_sec - t2->tv_sec) * NANOSEC_PER_SEC;
nanosec += (t1->tv_nsec - t2->tv_nsec);
return nanosec;
}
#define MAX_KEYS 8
#define LOOPS 10000000
void test_stack(int n)
{
ScanKeyData keys[MAX_KEYS];
ScanKeyData idxkey[MAX_KEYS];
struct timespec start, end;
int64_t version1_time, version2_time;
memset(keys, 0, sizeof(keys));
/* warmup */
for (int i = 0; i < 1000; i++)
{
version1_stack(n, keys, idxkey);
volatile int sink = idxkey[n-1].sk_attno;
(void) sink;
}
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start);
for (int i = 0; i < LOOPS; i++)
{
version1_stack(n, keys, idxkey);
volatile int sink = idxkey[n-1].sk_attno;
(void) sink;
}
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end);
version1_time = get_clock_diff(&end, &start);
/* warmup */
for (int i = 0; i < 1000; i++)
{
version2_stack(n, keys, idxkey);
volatile int sink = idxkey[n-1].sk_attno;
(void) sink;
}
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start);
for (int i = 0; i < LOOPS; i++)
{
version2_stack(n, keys, idxkey);
volatile int sink = idxkey[n-1].sk_attno;
(void) sink;
}
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end);
version2_time = get_clock_diff(&end, &start);
printf("stack n=%d v1(patch): %ld ns v2(original): %ld ns ratio: %.3f %s\n",
n, version1_time, version2_time,
(double) version1_time / version2_time,
version1_time < version2_time ? "patch wins" : "original wins");
}
void test_malloc(int n)
{
ScanKeyData keys[MAX_KEYS];
ScanKeyData *idxkey;
struct timespec start, end;
int64_t version1_time, version2_time;
memset(keys, 0, sizeof(keys));
/* warmup */
for (int i = 0; i < 1000; i++)
{
idxkey = version1_malloc(n, keys);
volatile int sink = idxkey[n-1].sk_attno;
(void) sink;
free(idxkey);
}
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start);
for (int i = 0; i < LOOPS; i++)
{
idxkey = version1_malloc(n, keys);
volatile int sink = idxkey[n-1].sk_attno;
(void) sink;
free(idxkey);
}
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end);
version1_time = get_clock_diff(&end, &start);
/* warmup */
for (int i = 0; i < 1000; i++)
{
idxkey = version2_malloc(n, keys);
volatile int sink = idxkey[n-1].sk_attno;
(void) sink;
free(idxkey);
}
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start);
for (int i = 0; i < LOOPS; i++)
{
idxkey = version2_malloc(n, keys);
volatile int sink = idxkey[n-1].sk_attno;
(void) sink;
free(idxkey);
}
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end);
version2_time = get_clock_diff(&end, &start);
printf("malloc n=%d v1(patch): %ld ns v2(original): %ld ns ratio: %.3f %s\n",
n, version1_time, version2_time,
(double) version1_time / version2_time,
version1_time < version2_time ? "patch wins" : "original wins");
}
int main(void)
{
printf("--- stack allocated ---\n");
for (int n = 1; n <= MAX_KEYS; n++)
test_stack(n);
printf("\n--- malloc allocated ---\n");
for (int n = 1; n <= MAX_KEYS; n++)
test_malloc(n);
return 0;
}
#include <string.h>
#include <stdint.h>
#include <stdbool.h>
#include <stddef.h>
#include <stdio.h>
#include <time.h>
typedef void (*RegProcedure)(void);
typedef uintptr_t Datum;
typedef struct ScanKeyData
{
int sk_flags;
int sk_attno;
RegProcedure sk_func;
Datum sk_argument;
} ScanKeyData;
/* version1: bulk memcpy + fixup (the patch) */
static __attribute__((noinline))
void version1_stack(int n, const ScanKeyData *key, ScanKeyData *idxkey)
{
memcpy(idxkey, key, n * sizeof(ScanKeyData));
for (int i = 0; i < n; i++)
idxkey[i].sk_attno = i + 1;
}
/* version2: per-element memcpy + fixup (the original) */
static __attribute__((noinline))
void version2_stack(int n, const ScanKeyData *key, ScanKeyData *idxkey)
{
for (int i = 0; i < n; i++)
{
memcpy(&idxkey[i], &key[i], sizeof(ScanKeyData));
idxkey[i].sk_attno = i + 1;
}
}
/* version1: bulk memcpy + fixup (the patch) */
static __attribute__((noinline))
ScanKeyData *version1_malloc(int n, const ScanKeyData *key)
{
ScanKeyData *idxkey = (ScanKeyData *) malloc(n * sizeof(ScanKeyData));
memcpy(idxkey, key, n * sizeof(ScanKeyData));
for (int i = 0; i < n; i++)
idxkey[i].sk_attno = i + 1;
return idxkey;
}
/* version2: per-element memcpy + fixup (the original) */
static __attribute__((noinline))
ScanKeyData *version2_malloc(int n, const ScanKeyData *key)
{
ScanKeyData *idxkey = (ScanKeyData *) malloc(n * sizeof(ScanKeyData));
for (int i = 0; i < n; i++)
{
memcpy(&idxkey[i], &key[i], sizeof(ScanKeyData));
idxkey[i].sk_attno = i + 1;
}
return idxkey;
}
#define NANOSEC_PER_SEC 1000000000
int64_t
get_clock_diff(struct timespec *t1, struct timespec *t2)
{
int64_t nanosec = (t1->tv_sec - t2->tv_sec) * NANOSEC_PER_SEC;
nanosec += (t1->tv_nsec - t2->tv_nsec);
return nanosec;
}
#define MAX_KEYS 8
#define LOOPS 10000000
void test_stack(int n)
{
ScanKeyData keys[MAX_KEYS];
ScanKeyData idxkey[MAX_KEYS];
struct timespec start, end;
int64_t version1_time, version2_time;
memset(keys, 0, sizeof(keys));
/* warmup */
for (int i = 0; i < 1000; i++)
{
version1_stack(n, keys, idxkey);
volatile int sink = idxkey[n-1].sk_attno;
(void) sink;
}
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start);
for (int i = 0; i < LOOPS; i++)
{
version1_stack(n, keys, idxkey);
volatile int sink = idxkey[n-1].sk_attno;
(void) sink;
}
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end);
version1_time = get_clock_diff(&end, &start);
/* warmup */
for (int i = 0; i < 1000; i++)
{
version2_stack(n, keys, idxkey);
volatile int sink = idxkey[n-1].sk_attno;
(void) sink;
}
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start);
for (int i = 0; i < LOOPS; i++)
{
version2_stack(n, keys, idxkey);
volatile int sink = idxkey[n-1].sk_attno;
(void) sink;
}
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end);
version2_time = get_clock_diff(&end, &start);
printf("stack n=%d v1(patch): %ld ns v2(original): %ld ns ratio: %.3f %s\n",
n, version1_time, version2_time,
(double) version1_time / version2_time,
version1_time < version2_time ? "patch wins" : "original wins");
}
void test_malloc(int n)
{
ScanKeyData keys[MAX_KEYS];
ScanKeyData *idxkey;
struct timespec start, end;
int64_t version1_time, version2_time;
memset(keys, 0, sizeof(keys));
/* warmup */
for (int i = 0; i < 1000; i++)
{
idxkey = version1_malloc(n, keys);
volatile int sink = idxkey[n-1].sk_attno;
(void) sink;
free(idxkey);
}
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start);
for (int i = 0; i < LOOPS; i++)
{
idxkey = version1_malloc(n, keys);
volatile int sink = idxkey[n-1].sk_attno;
(void) sink;
free(idxkey);
}
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end);
version1_time = get_clock_diff(&end, &start);
/* warmup */
for (int i = 0; i < 1000; i++)
{
idxkey = version2_malloc(n, keys);
volatile int sink = idxkey[n-1].sk_attno;
(void) sink;
free(idxkey);
}
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start);
for (int i = 0; i < LOOPS; i++)
{
idxkey = version2_malloc(n, keys);
volatile int sink = idxkey[n-1].sk_attno;
(void) sink;
free(idxkey);
}
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end);
version2_time = get_clock_diff(&end, &start);
printf("malloc n=%d v1(patch): %ld ns v2(original): %ld ns ratio: %.3f %s\n",
n, version1_time, version2_time,
(double) version1_time / version2_time,
version1_time < version2_time ? "patch wins" : "original wins");
}
int main(void)
{
printf("--- stack allocated ---\n");
for (int n = 1; n <= MAX_KEYS; n++)
test_stack(n);
printf("\n--- malloc allocated ---\n");
for (int n = 1; n <= MAX_KEYS; n++)
test_malloc(n);
return 0;
}
-- bg
On Thu, Mar 12, 2026 at 12:48 PM Bryan Green <dbryan.green@gmail.com> wrote:
I don't think your version 1 memcpy is doing what you think it is doing.On Thu, Mar 12, 2026 at 12:35 PM Ranier Vilela <ranier.vf@gmail.com> wrote:Hi.Em seg., 9 de mar. de 2026 às 14:02, Bryan Green <dbryan.green@gmail.com> escreveu:I performed a micro-benchmark on my dual epyc (zen 2) server and version 1 wins for small values of n.20 runs:n version min median mean max stddev noise%
-----------------------------------------------------------------------
n=1 version1 2.440 2.440 2.450 2.550 0.024 4.5%
n=1 version2 4.260 4.280 4.277 4.290 0.007 0.7%
n=2 version1 2.740 2.750 2.757 2.880 0.029 5.1%
n=2 version2 3.970 3.980 3.980 4.020 0.010 1.3%
n=4 version1 4.580 4.595 4.649 4.910 0.094 7.2%
n=4 version2 5.780 5.815 5.809 5.820 0.013 0.7%But, micro-benchmarks always make me nervous, so I looked at the actual instruction cost for myplatform given the version 1 and version 2 code.If we count cpu cycles using the AMD Zen 2 instruction latency/throughput tables: version 1 (loop body)has a critical path of ~5-6 cycles per iteration. version 2 (loop body) has ~3-4 cycles per iteration.The problem for version 2 is that the call to memcpy is ~24-30 cycles due to the stub + function call + returnand branch predictor pressure on first call. This probably results in ~2.5 ns per iteration cost for version 2.So, no I wouldn't call it an optimization. But, it will be interesting to hear other opinions on this.I made dirty and quick tests with two versions:gcc 15.2.0gcc -O2 memcpy1.c -o memcpy1The first test was with keys 10000000 and 10000000 loops:version1: on memcpy calldone in 1873 nanosecondsversion2: inlined memcpynot finishThe second test was with keys 4 and 10000000 loops:version1: one memcpy callversion2: inlined memcpy callversion1: done in 1519 nanoseconds
version2: done in 104981851 nanoseconds
(1.44692e-05 times faster)
version1: done in 1979 nanoseconds
version2: done in 110568901 nanoseconds
(1.78983e-05 times faster)version1: done in 1814 nanoseconds
version2: done in 108555484 nanoseconds
(1.67103e-05 times faster)version1: done in 1631 nanoseconds
version2: done in 109867919 nanoseconds
(1.48451e-05 times faster)
version1: done in 1269 nanoseconds
version2: done in 111639106 nanoseconds
(1.1367e-05 times faster)Unless I'm doing something wrong, one call memcpy wins!memcpy1.c attached.best regards,Ranier Vilela
pgsql-hackers by date: