I've spent a bit more time hacking on this patch.
Changes:
1. Changed GetFreeListLink() so that it stores the AllocFreelistLink
at the end of the chunk rather than at the start.
2. Made it so MemoryChunk stores a magic number in the spare 60 bits
of the hdrmask when the chunk is "external". This is always set but
only verified in assert builds.
3. In aset.c, I'm no longer storing the chunk_size in the hdrmask. I'm
now instead storing the freelist index. I'll explain this below.
4. Various other cleanups.
For #3, I was doing some benchmarking of the patch with a function I
wrote to heavily exercise palloc() and pfree(). When this function is
called to only allocate a small amount of memory at once, I saw a
small regression in the palloc() / pfree() performance for aset.c. On
looking at profiles, I saw that the code in AllocSetFreeIndex() was
standing out AllocSetFree(). That function uses the __builtin_clz()
intrinsic function which I see on x86-64 uses the "bsr" instruction.
Going by page 104 of [1], it tells me the latency of that instruction
is 4 for my Zen 2 CPU. I'm not yet sure why the v3 patch appeared
slower than master for this workload.
To make AllocSetFree() faster, I've now changed things so that instead
of storing the chunk size in the hdrmask of the MemoryChunk, I'm now
just storing the freelist index. The chunk size is always a power of
2 for non-external chunks. It's very cheap to obtain the chunk size
from the freelist index when we need to. That's just a "sal" or "shl"
instruction, effectively 8 << freelist_idx, both of which have a
latency of 1. This means that AllocSetFreeIndex() is only called in
AllocSetAlloc now.
This changes the performance as follows:
Master:
postgres=# select pg_allocate_memory_test(64, 1024,
20::bigint*1024*1024*1024, 'aset');
Time: 2524.438 ms (00:02.524)
Old patch (v3):
postgres=# select pg_allocate_memory_test(64, 1024,
20::bigint*1024*1024*1024, 'aset');
Time: 2646.438 ms (00:02.646)
New patch (v4):
postgres=# select pg_allocate_memory_test(64, 1024,
20::bigint*1024*1024*1024, 'aset');
Time: 2296.228 ms (00:02.296)
(about ~10% faster than master)
This function is allocating 64-byte chunks and keeping 1k of them
around at once, but allocating a total of 20GBs of them. I've attached
another patch with that function in it for anyone who wants to check
the performance.
I also tried another round of the pgbench -S workload that I ran
upthread [2] on the v2 patch. Confusingly, even when testing on
0b039e3a8 as I was last week, I'm unable to see that same 10%
performance increase.
Does anyone else want to have a go at taking v4 for a spin to see how
it performs?
David
[1] https://www.agner.org/optimize/instruction_tables.pdf
[2] https://www.postgresql.org/message-id/CAApHDvrrYfcCXfuc_bZ0xsqBP8U62Y0i27agr9Qt-2geE_rv0Q@mail.gmail.com