slab allocator performance issues - Mailing list pgsql-hackers

From Andres Freund
Subject slab allocator performance issues
Date
Msg-id 20210717194333.mr5io3zup3kxahfm@alap3.anarazel.de
Whole thread Raw
Responses Re: slab allocator performance issues  (Andres Freund <andres@anarazel.de>)
Re: slab allocator performance issues  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
List pgsql-hackers
Hi,

I just tried to use the slab allocator for a case where aset.c was
bloating memory usage substantially. First: It worked wonders for memory
usage, nearly eliminating overhead.

But it turned out to cause a *substantial* slowdown. With aset the
allocator is barely in the profile. With slab the profile is dominated
by allocator performance.

slab:
NOTICE:  00000: 100000000 ordered insertions in 5.216287 seconds, 19170724/sec
LOCATION:  bfm_test_insert_bulk, radix.c:2880
  Overhead  Command   Shared Object     Symbol

+   28.27%  postgres  postgres          [.] SlabAlloc
+    9.64%  postgres  bdbench.so        [.] bfm_delete
+    9.03%  postgres  bdbench.so        [.] bfm_set
+    8.39%  postgres  bdbench.so        [.] bfm_lookup
+    8.36%  postgres  bdbench.so        [.] bfm_set_leaf.constprop.0
+    8.16%  postgres  libc-2.31.so      [.] __memmove_avx_unaligned_erms
+    6.88%  postgres  bdbench.so        [.] bfm_delete_leaf
+    3.24%  postgres  libc-2.31.so      [.] _int_malloc
+    2.58%  postgres  bdbench.so        [.] bfm_tests
+    2.33%  postgres  postgres          [.] SlabFree
+    1.29%  postgres  libc-2.31.so      [.] _int_free
+    1.09%  postgres  libc-2.31.so      [.] unlink_chunk.constprop.0

aset:

NOTICE:  00000: 100000000 ordered insertions in 2.082602 seconds, 48016848/sec
LOCATION:  bfm_test_insert_bulk, radix.c:2880

+   16.43%  postgres  bdbench.so        [.] bfm_lookup
+   15.38%  postgres  bdbench.so        [.] bfm_delete
+   12.82%  postgres  libc-2.31.so      [.] __memmove_avx_unaligned_erms
+   12.65%  postgres  bdbench.so        [.] bfm_set
+   12.15%  postgres  bdbench.so        [.] bfm_set_leaf.constprop.0
+   10.57%  postgres  bdbench.so        [.] bfm_delete_leaf
+    4.05%  postgres  bdbench.so        [.] bfm_tests
+    2.93%  postgres  [kernel.vmlinux]  [k] clear_page_erms
+    1.59%  postgres  postgres          [.] AllocSetAlloc
+    1.15%  postgres  bdbench.so        [.] memmove@plt
+    1.06%  postgres  bdbench.so        [.] bfm_grow_leaf_16

OS:
NOTICE:  00000: 100000000 ordered insertions in 2.089790 seconds, 47851690/sec
LOCATION:  bfm_test_insert_bulk, radix.c:2880


That is somewhat surprising - part of the promise of a slab allocator is
that it's fast...


This is caused by multiple issues, I think. Some of which seems fairly easy to
fix.

1) If allocations are short-lived slab.c, can end up constantly
freeing/initializing blocks. Which requires fairly expensively iterating over
all potential chunks in the block and initializing it. Just to then free that
memory again after a small number of allocations. The extreme case of this is
when there are phases of alloc/free of a single allocation.

I "fixed" this by adding a few && slab->nblocks > 1 in SlabFree() and the
problem vastly reduced. Instead of a 0.4x slowdown it's 0.88x. Of course that
only works if the problem is with the only, so it's not a great
approach. Perhaps just keeping the last allocated block around would work?


2) SlabChunkIndex() in SlabFree() is slow. It requires a 64bit division, taking
up ~50% of the cycles in SlabFree(). A 64bit div, according to [1] , has a
latency of 35-88 cycles on skylake-x (and a reverse throughput of 21-83,
i.e. no parallelism). While it's getting a bit faster on icelake / zen 3, it's
still slow enough there to be very worrisome.

I don't see a way to get around the division while keeping the freelist
structure as is. But:

ISTM that we only need the index because of the free-chunk list, right? Why
don't we make the chunk list use actual pointers? Is it concern that that'd
increase the minimum allocation size?  If so, I see two ways around that:
First, we could make the index just the offset from the start of the block,
that's much cheaper to calculate. Second, we could store the next pointer in
SlabChunk->slab/block instead (making it a union) - while on the freelist we
don't need to dereference those, right?

I suspect both would also make the block initialization a bit cheaper.

That should also accelerate SlabBlockGetChunk(), which currently shows up as
an imul, which isn't exactly fast either (and uses a lot of execution ports).


3) Every free/alloc needing to unlink from slab->freelist[i] and then relink
shows up prominently in profiles. That's ~tripling the number of cachelines
touched in the happy path, with unpredictable accesses to boot.

Perhaps we could reduce the precision of slab->freelist indexing to amortize
that cost? I.e. not have one slab->freelist entry for each nfree, but instead
have an upper limit on the number of freelists?


4) Less of a performance, and more of a usability issue: The constant
block size strikes me as problematic. Most users of an allocator can
sometimes be used with a small amount of data, and sometimes with a
large amount.

Greetings,

Andres Freund

[1] https://www.agner.org/optimize/instruction_tables.pdf



pgsql-hackers by date:

Previous
From: Andy Fan
Date:
Subject: Re: UniqueKey on Partitioned table.
Next
From: Andres Freund
Date:
Subject: Re: slab allocator performance issues