Re: slab allocator performance issues - Mailing list pgsql-hackers

From David Rowley
Subject Re: slab allocator performance issues
Date
Msg-id CAApHDvq6eUdLJxAUdSmukGiiTQNT79cNtntL=3FE52T_AP3XDQ@mail.gmail.com
Whole thread Raw
In response to Re: slab allocator performance issues  (John Naylor <john.naylor@enterprisedb.com>)
Responses Re: slab allocator performance issues
List pgsql-hackers
I've spent quite a bit more time on the slab changes now and I've
attached a v3 patch.

One of the major things that I've spent time on is benchmarking this.
I'm aware that Tomas wrote some functions to benchmark.  I've taken
those and made some modifications to allow the memory context type to
be specified as a function parameter.  This allows me to easily
compare the performance of slab with both aset and generation.

Another change that I made to Tomas' module was how the random
ordering part works.  What I wanted was the ability to specify how
randomly to pfree the chunks and test various "degrees-of-randomness"
to see how that affects the performance.  What I ended up coming up
with was the ability to specify the number of "random segments".  This
controls how many groups we split all allocated chunks into to
randomise.  If there is 1 random segment, then that's just randomising
over all chunks. If there are 10 random segments, then we split the
array of allocated chunks into 10 portions based on either FIFO or
LIFO order, then randomise the order of the chunks only within each of
those segments. This allows us to test FIFO/LIFO allocation patterns
with and without random and any degrees of that in between. If the
random segments is set to 0, then no randomisation is done.

Another change I made to Tomas' code was, I'm now using palloc0()
instead of palloc() and I'm also checking the first byte of the
allocated chunk is '\0' before pfreeing it.  What I was finding was
that pfree was showing as highly dominant in perf output due to it
having to deference the MemoryChunk to find the context-type bits.
pfree had to do this as none of the calling code had touched any of
the memory in the chunk.  I felt it was unrealistic to be pallocing
memory and not doing anything with it and then pfreeing it without
having done anything with it.  Mostly this just moves the
responsibilities around of which function is penalised in having to
load the cache line. I mostly did this as I was struggling to make any
sense of perf's output.

I've attached alloc_bench_contrib.patch which I used for testing.

I've also attached a spreadsheet with the benchmark results.  The
general summary from having done those is that slab is now generally
now on-par with aset in terms of palloc performance. Previously slab
was performing at about half the speed of aset unless CPU cache
pressure became more significant, in which case the performance is
dominated by fetching cache lines from RAM. However, the new code
still makes meaningful improvements even under heavy CPU cache
pressure.  When it comes to pfree performance, the updated slab code
is much faster than it was previously, but not quite on-par with aset
or generation.

The attached spreadsheet is broken down into 3 tabs.  Each tab is
testing a chunk size and a fixed total number of chunks allocated at
once.  Within each tab, I'm testing FIFO and then LIFO allocation
patterns each with a different degree of randomness introduced, as I
described above. In none of the tests was the patched version slower
than the unpatched version.

One pending question I had was about SlabStats where we list free
chunks.  Since we now have a list of emptyblocks, I wasn't too sure if
the chunks from those should be included in that total.  I currently
am not including them, but I have added some additional information to
list the number of completely empty blocks that we've got in the
emptyblocks list.

Some follow-up work that I'm thinking is a good idea:

1. Reduce the SlabContext's chunkSize, fullChunkSize and blockSize
fields from Size down to uint32.  These have no need to be 64 bits. We
don't allow slab blocks over 1GB since c6e0fe1f2.  I thought of doing
this separately as we might need to rationalise the equivalent fields
in aset.c and generation.c.  Those can have external chunks, so I'm
not 100% sure if we should do that there or not yet. I just didn't
want to touch those files in this effort.
2. Slab should probably gain the ability to grow the block size as
aset and generation both do. Since the performance of the slab context
is good now, we might want to use it for hash join's 32kb chunks, but
I doubt we can without the block size growth.

I'm planning on pushing the attached v3 patch shortly. I've spent
several days reading over this and testing it in detail along with
adding additional features to the SlabCheck code to find more
inconsistencies.

David

Attachment

pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: pgsql: Doc: Explain about Column List feature.
Next
From: Dilip Kumar
Date:
Subject: Re: Add sub-transaction overflow status in pg_stat_activity