Re: slab allocator performance issues - Mailing list pgsql-hackers
From | David Rowley |
---|---|
Subject | Re: slab allocator performance issues |
Date | |
Msg-id | CAApHDvr0pKjrGVKFg2VqSUjbB1QxsKobu0PhD11qkSO_caL3dg@mail.gmail.com Whole thread Raw |
In response to | Re: slab allocator performance issues (John Naylor <john.naylor@enterprisedb.com>) |
Responses |
Re: slab allocator performance issues
|
List | pgsql-hackers |
.On Mon, 5 Dec 2022 at 23:18, John Naylor <john.naylor@enterprisedb.com> wrote: > > > On Mon, Dec 5, 2022 at 3:02 PM David Rowley <dgrowleyml@gmail.com> wrote: > > Going by [2], the instructions are very different with each method, so > > other machines with different latencies on those instructions might > > show something different. I attached what I used to test if anyone > > else wants a go. > > I get about 0.1% difference on my machine. Both ways boil down to (on gcc) 3 instructions with low latency. The laterones need the prior results to execute, which I think is what the XXX comment "isn't great" was referring to. The newcoding is more mysterious (does it do the right thing on all platforms?), so I guess the original is still the way togo unless we get a better idea. I don't think it would work well on a one's complement machine, but I don't think we support those going by the comments above RIGHTMOST_ONE in bitmapset.c. In anycase, I found that it wasn't any faster than what Andres wrote. In fact, even changing the code there to "index = (freecount > 0);" seems to do very little to increase performance. I do see that having 3 freelist items performs a decent amount better than having 9. However, which workload is run may alter the result of that, assuming that keeping new allocations on fuller blocks is a winning strategy for the CPU's caches. I've now done quite a bit more work on Andres' patch to try and get it into (hopefully) somewhere close to a committable shape. I'm fairly happy with what's there now. It does seem to perform much better than current master, especially so when the workload would have caused master to continually malloc and free an entire block when freeing and allocating a single chunk. I've done some basic benchmarking, mostly using the attached alloc_bench patch. If I run: select *, round(slab_result / aset_result * 100 - 100,1)::text || '%' as slab_slower_by from ( select chunk_size, keep_chunks, sum(pg_allocate_memory_test(chunk_size, chunk_size * keep_chunks, 1024*1024*1024, 'slab'))::numeric(1000,3) as slab_result, sum(pg_allocate_memory_test(chunk_size, chunk_size * keep_chunks, 1024*1024*1024, 'aset'))::numeric(1000,3) as aset_result from (values(1),(10),(50),(100),(200),(300),(400),(500),(1000),(2000),(3000),(4000),(5000),(10000)) v1(keep_chunks), (values(64),(128),(256),(512),(1024)) v2(chunk_size) group by rollup(1,2) ); The results for the 64-byte chunk are shown in the attached chart. It's not as fast as aset, but much faster than master's slab.c The first blue bar of the chart is well above the vertical axis. It took master 1118958 milliseconds for that test. The attached patch took 194 ms. The rest of the tests seem to put the patched code around somewhere in the middle between the unpatched code and aset.c's performance. The benchmark I did was entirely a FIFO workload that keeps around "keep_chunks" at once before starting to free the oldest chunks. I know Tomas has some LIFO and random benchmarks. I edited that code [1] a little to add support for other context types so that a comparison could be done more easily, however, I'm getting very weird performance results where sometimes it runs about twice as fast (Tomas mentioned he got this too). I'm not seeing that with my own benchmarking function, so I'm wondering if there's something weird going on with the benchmark itself rather than the slab.c code. I've likely made much more changes than I can list here, but here are a few of the more major ones: 1. Make use of dclist for empty blocks 2. In SlabAlloc() allocate chunks from the freelist before the unused list. 3. Added output for showing information about empty blocks in the SlabStats output. 4. Renamed the context freelists to blocklist. I found this was likely just confusing things with the block-level freelist. In any case, it seemed weird to have freelist[0] store full blocks. Not much free there! I renamed to blocklist[]. 5. I did a round of changing the order of the fields in SlabBlock. This seems to affect performance quite a bit. Having the context first seems to improve performance. Having the blocklist[] node last also helps. 6. Removed nblocks and headerSize from SlabContext. headerSize is no longer needed. nblocks was only really used for Asserts and SlabIsEmpty. I changed the Asserts to use a local count of blocks and changed SlabIsEmpty to look at the context's mem_allocated. 7. There's now no integer division in any of the alloc and free code. The only time we divide by fullChunkSize is in the check function. 8. When using a block from the emptyblock list, I changed the code to not re-init the block. It's now used as it was left previously. This means no longer having to build the freelist again. 9. Updated all comments to reflect the current state of the code. Some things I thought about but didn't do: a. Change the size of SlabContext's chunkSize, fullChunkSize and blockSize to be uint32 instead of Size. It might be possible to get SlabContext below 128 bytes with a bit more work. b. I could have done a bit more experimentation with unlikely() and likely() to move less frequently accessed code off into a cold area. For #2 above, I didn't really see much change in performance when I swapped the order of what we allocate from first. I expected free chunks would be better as they've been used and are seemingly more likely to be in some CPU cache than one of the unused chunks. I might need a different allocation pattern than the one I used to highlight that fact though. David [1] https://github.com/david-rowley/postgres/tree/alloc_bench_contrib
Attachment
pgsql-hackers by date: