Thread: Align large shared memory allocations
Attached is a patch that aligns large shared memory allocations beyond MAXIMUM_ALIGNOF. The reason for this is that Intel's cpus have a fast path for bulk memory copies that only works with aligned addresses. It's possible that other cpus have similar restrictions. With 7.3.4, it achives a 5% performance gain with pgbench. It has no effect with 7.3.3, because the buffers are already aligned by chance. I haven't properly tested 7.4cvs yet. One problem is the "32" - it's arbitrary, it probably belongs into an arch dependant header file. But where? -- Manfred diff -u pgsql.orig/src/backend/storage/ipc/shmem.c pgsql/src/backend/storage/ipc/shmem.c --- pgsql.orig/src/backend/storage/ipc/shmem.c 2003-09-20 20:17:08.000000000 +0200 +++ pgsql/src/backend/storage/ipc/shmem.c 2003-09-20 20:34:21.000000000 +0200 @@ -131,6 +131,7 @@ void * ShmemAlloc(Size size) { + uint32 newStart; uint32 newFree; void *newSpace; @@ -146,10 +147,21 @@ SpinLockAcquire(ShmemLock); - newFree = shmemseghdr->freeoffset + size; + newStart = shmemseghdr->freeoffset; + if (size >= BLCKSZ) + { + /* Align BLCKSZ sized buffers even further: + * - the costs are small + * - some cpus (most notably Intel Pentium III) + * prefer well-aligned addresses for memory copies + */ + newStart = TYPEALIGN(32, newStart); + } + + newFree = newStart + size; if (newFree <= shmemseghdr->totalsize) { - newSpace = (void *) MAKE_PTR(shmemseghdr->freeoffset); + newSpace = (void *) MAKE_PTR(newStart); shmemseghdr->freeoffset = newFree; } else
Manfred Spraul <manfred@colorfullife.com> writes: > Attached is a patch that aligns large shared memory allocations beyond > MAXIMUM_ALIGNOF. The reason for this is that Intel's cpus have a fast > path for bulk memory copies that only works with aligned addresses. This patch is missing a demonstration that it's actually worth anything. What kind of performance gain do you get? > One problem is the "32" - it's arbitrary, it probably belongs into an > arch dependant header file. But where? We don't really have arch-dependent header files. What I'd be inclined to do is "#define ALIGNOF_BUFFER 32" in pg_config_manual.h, then #define BUFFERALIGN(LEN) to parallel the other TYPEALIGN macros in c.h, and finally use that in the ShmemAlloc code. regards, tom lane
Tom Lane wrote: >Manfred Spraul <manfred@colorfullife.com> writes: > > >>Attached is a patch that aligns large shared memory allocations beyond >>MAXIMUM_ALIGNOF. The reason for this is that Intel's cpus have a fast >>path for bulk memory copies that only works with aligned addresses. >> >> > >This patch is missing a demonstration that it's actually worth anything. >What kind of performance gain do you get? > > 7.4cvs on a 1.13 GHz Intel Celeron mobile, 384 MB RAM, "Severn" RedHat Linux 2.4 beta, postmaster -N 30 -B 64, data directory on ramdisk, pgbench -c 10 -s 11 -t 1000: Without the patch: 124 tps with the patch: 130 tps. I've reduced the buffer setting to 64 because without that, a too large part of the database was cached by postgres. I expect that with all Intel Pentium III chips, it will be worth 10-20% less system time. I had around 30% system time after reducing the number of buffers, thus the ~5% performance improvement. >We don't really have arch-dependent header files. What I'd be inclined >to do is "#define ALIGNOF_BUFFER 32" in pg_config_manual.h, then >#define BUFFERALIGN(LEN) to parallel the other TYPEALIGN macros in c.h, >and finally use that in the ShmemAlloc code. > > Ok, new patch attached. -- Manfred Index: src/backend/storage/ipc/shmem.c =================================================================== RCS file: /projects/cvsroot/pgsql-server/src/backend/storage/ipc/shmem.c,v retrieving revision 1.70 diff -u -r1.70 shmem.c --- src/backend/storage/ipc/shmem.c 4 Aug 2003 02:40:03 -0000 1.70 +++ src/backend/storage/ipc/shmem.c 21 Sep 2003 07:53:13 -0000 @@ -131,6 +131,7 @@ void * ShmemAlloc(Size size) { + uint32 newStart; uint32 newFree; void *newSpace; @@ -146,10 +147,14 @@ SpinLockAcquire(ShmemLock); - newFree = shmemseghdr->freeoffset + size; + newStart = shmemseghdr->freeoffset; + if (size >= BLCKSZ) + newStart = BUFFERALIGN(newStart); + + newFree = newStart + size; if (newFree <= shmemseghdr->totalsize) { - newSpace = (void *) MAKE_PTR(shmemseghdr->freeoffset); + newSpace = (void *) MAKE_PTR(newStart); shmemseghdr->freeoffset = newFree; } else Index: src/include/c.h =================================================================== RCS file: /projects/cvsroot/pgsql-server/src/include/c.h,v retrieving revision 1.152 diff -u -r1.152 c.h --- src/include/c.h 4 Aug 2003 02:40:10 -0000 1.152 +++ src/include/c.h 21 Sep 2003 07:53:14 -0000 @@ -529,6 +529,7 @@ #define LONGALIGN(LEN) TYPEALIGN(ALIGNOF_LONG, (LEN)) #define DOUBLEALIGN(LEN) TYPEALIGN(ALIGNOF_DOUBLE, (LEN)) #define MAXALIGN(LEN) TYPEALIGN(MAXIMUM_ALIGNOF, (LEN)) +#define BUFFERALIGN(LEN) TYPEALIGN(ALIGNOF_BUFFER, (LEN)) /* ---------------------------------------------------------------- Index: src/include/pg_config_manual.h =================================================================== RCS file: /projects/cvsroot/pgsql-server/src/include/pg_config_manual.h,v retrieving revision 1.5 diff -u -r1.5 pg_config_manual.h --- src/include/pg_config_manual.h 4 Aug 2003 00:43:29 -0000 1.5 +++ src/include/pg_config_manual.h 21 Sep 2003 07:53:14 -0000 @@ -176,6 +176,14 @@ */ #define MAX_RANDOM_VALUE (0x7FFFFFFF) +/* + * Alignment of the disk blocks in the shared memory area. + * A significant amount of the total system time is required for + * copying disk blocks between the os buffers and the cache in the + * shared memory area. Some cpus (most notably Intel Pentium III) + * prefer well-aligned addresses for memory copies. + */ +#define ALIGNOF_BUFFER 32 /* *------------------------------------------------------------------------
Manfred Spraul <manfred@colorfullife.com> writes: > Tom Lane wrote: >> This patch is missing a demonstration that it's actually worth anything. >> What kind of performance gain do you get? >> > 7.4cvs on a 1.13 GHz Intel Celeron mobile, 384 MB RAM, "Severn" RedHat > Linux 2.4 beta, postmaster -N 30 -B 64, data directory on ramdisk, > pgbench -c 10 -s 11 -t 1000: > Without the patch: 124 tps > with the patch: 130 tps. I tried it on an Intel box here (P4 I think). Using postmaster -B 64 -N 30 and three tries of pgbench -s 10 -c 1 -t 1000 after creation of the test tables, I get: tps = 92.461144 (including connections establishing) tps = 92.500572 (excluding connections establishing) tps = 88.078814 (including connections establishing) tps = 88.115905 (excluding connections establishing) tps = 85.434473 (including connections establishing) tps = 85.468807 (excluding connections establishing) and with the patch: tps = 122.927066 (including connections establishing) tps = 122.998129 (excluding connections establishing) tps = 110.716370 (including connections establishing) tps = 110.773928 (excluding connections establishing) tps = 138.155991 (including connections establishing) tps = 138.245777 (excluding connections establishing) So there's definitely a visible difference on recent Pentiums. It might not help on other CPUs, but we can surely waste a couple dozen bytes in the hope that it might. Patch applied. Do you want to look at making it happen for local buffers and buffile.c as well? regards, tom lane