Thread: slab allocator performance issues

slab allocator performance issues

From

Andres Freund

Date:

17 July 2021, 19:43:33

Hi,

I just tried to use the slab allocator for a case where aset.c was
bloating memory usage substantially. First: It worked wonders for memory
usage, nearly eliminating overhead.

But it turned out to cause a *substantial* slowdown. With aset the
allocator is barely in the profile. With slab the profile is dominated
by allocator performance.

slab:
NOTICE:  00000: 100000000 ordered insertions in 5.216287 seconds, 19170724/sec
LOCATION:  bfm_test_insert_bulk, radix.c:2880
  Overhead  Command   Shared Object     Symbol

+   28.27%  postgres  postgres          [.] SlabAlloc
+    9.64%  postgres  bdbench.so        [.] bfm_delete
+    9.03%  postgres  bdbench.so        [.] bfm_set
+    8.39%  postgres  bdbench.so        [.] bfm_lookup
+    8.36%  postgres  bdbench.so        [.] bfm_set_leaf.constprop.0
+    8.16%  postgres  libc-2.31.so      [.] __memmove_avx_unaligned_erms
+    6.88%  postgres  bdbench.so        [.] bfm_delete_leaf
+    3.24%  postgres  libc-2.31.so      [.] _int_malloc
+    2.58%  postgres  bdbench.so        [.] bfm_tests
+    2.33%  postgres  postgres          [.] SlabFree
+    1.29%  postgres  libc-2.31.so      [.] _int_free
+    1.09%  postgres  libc-2.31.so      [.] unlink_chunk.constprop.0

aset:

NOTICE:  00000: 100000000 ordered insertions in 2.082602 seconds, 48016848/sec
LOCATION:  bfm_test_insert_bulk, radix.c:2880

+   16.43%  postgres  bdbench.so        [.] bfm_lookup
+   15.38%  postgres  bdbench.so        [.] bfm_delete
+   12.82%  postgres  libc-2.31.so      [.] __memmove_avx_unaligned_erms
+   12.65%  postgres  bdbench.so        [.] bfm_set
+   12.15%  postgres  bdbench.so        [.] bfm_set_leaf.constprop.0
+   10.57%  postgres  bdbench.so        [.] bfm_delete_leaf
+    4.05%  postgres  bdbench.so        [.] bfm_tests
+    2.93%  postgres  [kernel.vmlinux]  [k] clear_page_erms
+    1.59%  postgres  postgres          [.] AllocSetAlloc
+    1.15%  postgres  bdbench.so        [.] memmove@plt
+    1.06%  postgres  bdbench.so        [.] bfm_grow_leaf_16

OS:
NOTICE:  00000: 100000000 ordered insertions in 2.089790 seconds, 47851690/sec
LOCATION:  bfm_test_insert_bulk, radix.c:2880


That is somewhat surprising - part of the promise of a slab allocator is
that it's fast...


This is caused by multiple issues, I think. Some of which seems fairly easy to
fix.

1) If allocations are short-lived slab.c, can end up constantly
freeing/initializing blocks. Which requires fairly expensively iterating over
all potential chunks in the block and initializing it. Just to then free that
memory again after a small number of allocations. The extreme case of this is
when there are phases of alloc/free of a single allocation.

I "fixed" this by adding a few && slab->nblocks > 1 in SlabFree() and the
problem vastly reduced. Instead of a 0.4x slowdown it's 0.88x. Of course that
only works if the problem is with the only, so it's not a great
approach. Perhaps just keeping the last allocated block around would work?


2) SlabChunkIndex() in SlabFree() is slow. It requires a 64bit division, taking
up ~50% of the cycles in SlabFree(). A 64bit div, according to [1] , has a
latency of 35-88 cycles on skylake-x (and a reverse throughput of 21-83,
i.e. no parallelism). While it's getting a bit faster on icelake / zen 3, it's
still slow enough there to be very worrisome.

I don't see a way to get around the division while keeping the freelist
structure as is. But:

ISTM that we only need the index because of the free-chunk list, right? Why
don't we make the chunk list use actual pointers? Is it concern that that'd
increase the minimum allocation size?  If so, I see two ways around that:
First, we could make the index just the offset from the start of the block,
that's much cheaper to calculate. Second, we could store the next pointer in
SlabChunk->slab/block instead (making it a union) - while on the freelist we
don't need to dereference those, right?

I suspect both would also make the block initialization a bit cheaper.

That should also accelerate SlabBlockGetChunk(), which currently shows up as
an imul, which isn't exactly fast either (and uses a lot of execution ports).


3) Every free/alloc needing to unlink from slab->freelist[i] and then relink
shows up prominently in profiles. That's ~tripling the number of cachelines
touched in the happy path, with unpredictable accesses to boot.

Perhaps we could reduce the precision of slab->freelist indexing to amortize
that cost? I.e. not have one slab->freelist entry for each nfree, but instead
have an upper limit on the number of freelists?


4) Less of a performance, and more of a usability issue: The constant
block size strikes me as problematic. Most users of an allocator can
sometimes be used with a small amount of data, and sometimes with a
large amount.

Greetings,

Andres Freund

[1] https://www.agner.org/optimize/instruction_tables.pdf

Re: slab allocator performance issues

From

Andres Freund

Date:

17 July 2021, 19:53:07

Hi,

On 2021-07-17 12:43:33 -0700, Andres Freund wrote:
> 2) SlabChunkIndex() in SlabFree() is slow. It requires a 64bit division, taking
> up ~50% of the cycles in SlabFree(). A 64bit div, according to [1] , has a
> latency of 35-88 cycles on skylake-x (and a reverse throughput of 21-83,
> i.e. no parallelism). While it's getting a bit faster on icelake / zen 3, it's
> still slow enough there to be very worrisome.
> 
> I don't see a way to get around the division while keeping the freelist
> structure as is. But:
> 
> ISTM that we only need the index because of the free-chunk list, right? Why
> don't we make the chunk list use actual pointers? Is it concern that that'd
> increase the minimum allocation size?  If so, I see two ways around that:
> First, we could make the index just the offset from the start of the block,
> that's much cheaper to calculate. Second, we could store the next pointer in
> SlabChunk->slab/block instead (making it a union) - while on the freelist we
> don't need to dereference those, right?
> 
> I suspect both would also make the block initialization a bit cheaper.
> 
> That should also accelerate SlabBlockGetChunk(), which currently shows up as
> an imul, which isn't exactly fast either (and uses a lot of execution ports).

Oh - I just saw that effectively the allocation size already is a
uintptr_t at minimum. I had only seen

    /* Make sure the linked list node fits inside a freed chunk */
    if (chunkSize < sizeof(int))
        chunkSize = sizeof(int);
but it's followed by
    /* chunk, including SLAB header (both addresses nicely aligned) */
    fullChunkSize = sizeof(SlabChunk) + MAXALIGN(chunkSize);

which means we are reserving enough space for a pointer on just about
any platform already? Seems we can just make that official and reserve
space for a pointer as part of the chunk size rounding up, instead of
fullChunkSize?

Greetings,

Andres Freund

Re: slab allocator performance issues

From

Tomas Vondra

Date:

17 July 2021, 20:35:07

Hi,

On 7/17/21 9:43 PM, Andres Freund wrote:
> Hi,
> 
> I just tried to use the slab allocator for a case where aset.c was
> bloating memory usage substantially. First: It worked wonders for memory
> usage, nearly eliminating overhead.
> 
> But it turned out to cause a *substantial* slowdown. With aset the
> allocator is barely in the profile. With slab the profile is dominated
> by allocator performance.
> 
> slab:
> NOTICE:  00000: 100000000 ordered insertions in 5.216287 seconds, 19170724/sec
> LOCATION:  bfm_test_insert_bulk, radix.c:2880
>    Overhead  Command   Shared Object     Symbol
> 
> +   28.27%  postgres  postgres          [.] SlabAlloc
> +    9.64%  postgres  bdbench.so        [.] bfm_delete
> +    9.03%  postgres  bdbench.so        [.] bfm_set
> +    8.39%  postgres  bdbench.so        [.] bfm_lookup
> +    8.36%  postgres  bdbench.so        [.] bfm_set_leaf.constprop.0
> +    8.16%  postgres  libc-2.31.so      [.] __memmove_avx_unaligned_erms
> +    6.88%  postgres  bdbench.so        [.] bfm_delete_leaf
> +    3.24%  postgres  libc-2.31.so      [.] _int_malloc
> +    2.58%  postgres  bdbench.so        [.] bfm_tests
> +    2.33%  postgres  postgres          [.] SlabFree
> +    1.29%  postgres  libc-2.31.so      [.] _int_free
> +    1.09%  postgres  libc-2.31.so      [.] unlink_chunk.constprop.0
> 
> aset:
> 
> NOTICE:  00000: 100000000 ordered insertions in 2.082602 seconds, 48016848/sec
> LOCATION:  bfm_test_insert_bulk, radix.c:2880
> 
> +   16.43%  postgres  bdbench.so        [.] bfm_lookup
> +   15.38%  postgres  bdbench.so        [.] bfm_delete
> +   12.82%  postgres  libc-2.31.so      [.] __memmove_avx_unaligned_erms
> +   12.65%  postgres  bdbench.so        [.] bfm_set
> +   12.15%  postgres  bdbench.so        [.] bfm_set_leaf.constprop.0
> +   10.57%  postgres  bdbench.so        [.] bfm_delete_leaf
> +    4.05%  postgres  bdbench.so        [.] bfm_tests
> +    2.93%  postgres  [kernel.vmlinux]  [k] clear_page_erms
> +    1.59%  postgres  postgres          [.] AllocSetAlloc
> +    1.15%  postgres  bdbench.so        [.] memmove@plt
> +    1.06%  postgres  bdbench.so        [.] bfm_grow_leaf_16
> 
> OS:
> NOTICE:  00000: 100000000 ordered insertions in 2.089790 seconds, 47851690/sec
> LOCATION:  bfm_test_insert_bulk, radix.c:2880
> 
> 
> That is somewhat surprising - part of the promise of a slab allocator is
> that it's fast...
> 
> 
> This is caused by multiple issues, I think. Some of which seems fairly easy to
> fix.
> 
> 1) If allocations are short-lived slab.c, can end up constantly
> freeing/initializing blocks. Which requires fairly expensively iterating over
> all potential chunks in the block and initializing it. Just to then free that
> memory again after a small number of allocations. The extreme case of this is
> when there are phases of alloc/free of a single allocation.
> 
> I "fixed" this by adding a few && slab->nblocks > 1 in SlabFree() and the
> problem vastly reduced. Instead of a 0.4x slowdown it's 0.88x. Of course that
> only works if the problem is with the only, so it's not a great
> approach. Perhaps just keeping the last allocated block around would work?
> 

+1

I think it makes perfect sense to not free the blocks immediately, and 
keep one (or a small number) as a cache. I'm not sure why we decided not 
to have a "keeper" block, but I suspect memory consumption was my main 
concern at that point. But I never expected the cost to be this high.

> 
> 2) SlabChunkIndex() in SlabFree() is slow. It requires a 64bit division, taking
> up ~50% of the cycles in SlabFree(). A 64bit div, according to [1] , has a
> latency of 35-88 cycles on skylake-x (and a reverse throughput of 21-83,
> i.e. no parallelism). While it's getting a bit faster on icelake / zen 3, it's
> still slow enough there to be very worrisome.
> 
> I don't see a way to get around the division while keeping the freelist
> structure as is. But:
> 
> ISTM that we only need the index because of the free-chunk list, right? Why
> don't we make the chunk list use actual pointers? Is it concern that that'd
> increase the minimum allocation size?  If so, I see two ways around that:
> First, we could make the index just the offset from the start of the block,
> that's much cheaper to calculate. Second, we could store the next pointer in
> SlabChunk->slab/block instead (making it a union) - while on the freelist we
> don't need to dereference those, right?
> 
> I suspect both would also make the block initialization a bit cheaper.
> 
> That should also accelerate SlabBlockGetChunk(), which currently shows up as
> an imul, which isn't exactly fast either (and uses a lot of execution ports).
> 

Hmm, I think you're right we could simply use the pointers, but I have 
not tried that.

> 
> 3) Every free/alloc needing to unlink from slab->freelist[i] and then relink
> shows up prominently in profiles. That's ~tripling the number of cachelines
> touched in the happy path, with unpredictable accesses to boot.
> 
> Perhaps we could reduce the precision of slab->freelist indexing to amortize
> that cost? I.e. not have one slab->freelist entry for each nfree, but instead
> have an upper limit on the number of freelists?
> 

Yeah. The purpose of organizing the freelists like this is to prioritize 
the "more full" blocks" when allocating new chunks, in the hope that the 
"less full" blocks will end up empty and freed faster.

But this is naturally imprecise, and strongly depends on the workload, 
of course, and I bet for most cases a less precise approach would work 
just as fine.

I'm not sure how exactly would the upper limit you propose work, but 
perhaps we could group the blocks for nfree ranges, say [0-15], [16-31] 
and so on. So after the alloc/free we'd calculate the new freelist index 
as (nfree/16) and only moved the block if the index changed. This would 
reduce the overhead to 1/16 and probably even more in practice.

Of course, we could also say we have e.g. 8 freelists and work the 
ranges backwards from that, I guess that's what you propose.

> 
> 4) Less of a performance, and more of a usability issue: The constant
> block size strikes me as problematic. Most users of an allocator can
> sometimes be used with a small amount of data, and sometimes with a
> large amount.
> 

I doubt this is worth the effort, really. The constant block size makes 
various places much simpler (both to code and reason about), so this 
should not make a huge difference in performance. And IMHO the block 
size is mostly an implementation detail, so I don't see that as a 
usability issue.



regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: slab allocator performance issues

From

Andres Freund

Date:

17 July 2021, 21:14:05

Hi,

On 2021-07-17 22:35:07 +0200, Tomas Vondra wrote:
> On 7/17/21 9:43 PM, Andres Freund wrote:
> > 1) If allocations are short-lived slab.c, can end up constantly
> > freeing/initializing blocks. Which requires fairly expensively iterating over
> > all potential chunks in the block and initializing it. Just to then free that
> > memory again after a small number of allocations. The extreme case of this is
> > when there are phases of alloc/free of a single allocation.
> > 
> > I "fixed" this by adding a few && slab->nblocks > 1 in SlabFree() and the
> > problem vastly reduced. Instead of a 0.4x slowdown it's 0.88x. Of course that
> > only works if the problem is with the only, so it's not a great
> > approach. Perhaps just keeping the last allocated block around would work?
> > 
> 
> +1
> 
> I think it makes perfect sense to not free the blocks immediately, and keep
> one (or a small number) as a cache. I'm not sure why we decided not to have
> a "keeper" block, but I suspect memory consumption was my main concern at
> that point. But I never expected the cost to be this high.

I think one free block might be too low in some cases. It's pretty
common to have workloads where the number of allocations is "bursty",
and it's imo one case where one might justifiably want to use a slab
allocator... Perhaps a portion of a high watermark? Or a portion of the
in use blocks?

Hm. I wonder if we should just not populate the freelist eagerly, to
drive down the initialization cost. I.e. have a separate allocation path
for chunks that have never been allocated, by having a
SlabBlock->free_offset or such.

Sure, it adds a branch to the allocation happy path, but it also makes the
first allocation for a chunk cheaper, because there's no need to get the next
element from the freelist (adding a likely cache miss). And it should make the
allocation of a new block faster by a lot.

> > 2) SlabChunkIndex() in SlabFree() is slow. It requires a 64bit division, taking
> > up ~50% of the cycles in SlabFree(). A 64bit div, according to [1] , has a
> > latency of 35-88 cycles on skylake-x (and a reverse throughput of 21-83,
> > i.e. no parallelism). While it's getting a bit faster on icelake / zen 3, it's
> > still slow enough there to be very worrisome.
> > 
> > I don't see a way to get around the division while keeping the freelist
> > structure as is. But:
> > 
> > ISTM that we only need the index because of the free-chunk list, right? Why
> > don't we make the chunk list use actual pointers? Is it concern that that'd
> > increase the minimum allocation size?  If so, I see two ways around that:
> > First, we could make the index just the offset from the start of the block,
> > that's much cheaper to calculate. Second, we could store the next pointer in
> > SlabChunk->slab/block instead (making it a union) - while on the freelist we
> > don't need to dereference those, right?
> > 
> > I suspect both would also make the block initialization a bit cheaper.
> > 
> > That should also accelerate SlabBlockGetChunk(), which currently shows up as
> > an imul, which isn't exactly fast either (and uses a lot of execution ports).
> > 
> 
> Hmm, I think you're right we could simply use the pointers, but I have not
> tried that.

I quickly tried that, and it does seem to improve matters considerably. The
block initialization still shows up as expensive, but not as bad. The div and
imul are gone (exept in an assertion build right now). The list manipulation
still is visible.

> > 3) Every free/alloc needing to unlink from slab->freelist[i] and then relink
> > shows up prominently in profiles. That's ~tripling the number of cachelines
> > touched in the happy path, with unpredictable accesses to boot.
> > 
> > Perhaps we could reduce the precision of slab->freelist indexing to amortize
> > that cost? I.e. not have one slab->freelist entry for each nfree, but instead
> > have an upper limit on the number of freelists?
> > 
> 
> Yeah. The purpose of organizing the freelists like this is to prioritize the
> "more full" blocks" when allocating new chunks, in the hope that the "less
> full" blocks will end up empty and freed faster.
> 
> But this is naturally imprecise, and strongly depends on the workload, of
> course, and I bet for most cases a less precise approach would work just as
> fine.
> 
> I'm not sure how exactly would the upper limit you propose work, but perhaps
> we could group the blocks for nfree ranges, say [0-15], [16-31] and so on.
> So after the alloc/free we'd calculate the new freelist index as (nfree/16)
> and only moved the block if the index changed. This would reduce the
> overhead to 1/16 and probably even more in practice.

> Of course, we could also say we have e.g. 8 freelists and work the ranges
> backwards from that, I guess that's what you propose.

Yea, I was thinking something along those lines. As you say, ow about there's
always at most 8 freelists or such. During initialization we compute a shift
that distributes chunksPerBlock from 0 to 8. Then we only need to perform list
manipulation if block->nfree >> slab->freelist_shift != --block->nfree >> slab->freelist_shift.

That seems nice from a memory usage POV as well - for small and frequent
allocations needing chunksPerBlock freelists isn't great. The freelists are on
a fair number of cachelines right now.

> > 4) Less of a performance, and more of a usability issue: The constant
> > block size strikes me as problematic. Most users of an allocator can
> > sometimes be used with a small amount of data, and sometimes with a
> > large amount.
> > 
> 
> I doubt this is worth the effort, really. The constant block size makes
> various places much simpler (both to code and reason about), so this should
> not make a huge difference in performance. And IMHO the block size is mostly
> an implementation detail, so I don't see that as a usability issue.

Hm? It's something the user has to specify, so I it's not really an
implementation detail. It needs to be specified without sufficient
information, as well, since externally one doesn't know how much memory the
block header and chunk headers + rounding up will use, so computing a good
block size isn't easy. I've wondered whether it should just be a count...

Why do you not think it's relevant for performance? Either one causes too much
memory usage by using a too large block size, wasting memory, or one ends up
loosing perf through frequent allocations?

Greetings,

Andres Freund

Re: slab allocator performance issues

From

Tomas Vondra

Date:

17 July 2021, 22:46:09


On 7/17/21 11:14 PM, Andres Freund wrote:
> Hi,
> 
> On 2021-07-17 22:35:07 +0200, Tomas Vondra wrote:
>> On 7/17/21 9:43 PM, Andres Freund wrote:
>>> 1) If allocations are short-lived slab.c, can end up constantly
>>> freeing/initializing blocks. Which requires fairly expensively iterating over
>>> all potential chunks in the block and initializing it. Just to then free that
>>> memory again after a small number of allocations. The extreme case of this is
>>> when there are phases of alloc/free of a single allocation.
>>>
>>> I "fixed" this by adding a few && slab->nblocks > 1 in SlabFree() and the
>>> problem vastly reduced. Instead of a 0.4x slowdown it's 0.88x. Of course that
>>> only works if the problem is with the only, so it's not a great
>>> approach. Perhaps just keeping the last allocated block around would work?
>>>
>>
>> +1
>>
>> I think it makes perfect sense to not free the blocks immediately, and keep
>> one (or a small number) as a cache. I'm not sure why we decided not to have
>> a "keeper" block, but I suspect memory consumption was my main concern at
>> that point. But I never expected the cost to be this high.
> 
> I think one free block might be too low in some cases. It's pretty
> common to have workloads where the number of allocations is "bursty",
> and it's imo one case where one might justifiably want to use a slab
> allocator... Perhaps a portion of a high watermark? Or a portion of the
> in use blocks?
> 

I think the portion of watermark would be problematic for cases with one 
huge transaction - that'll set a high watermark, and we'll keep way too 
many free blocks. But the portion of in use blocks might work, I think.

> Hm. I wonder if we should just not populate the freelist eagerly, to
> drive down the initialization cost. I.e. have a separate allocation path
> for chunks that have never been allocated, by having a
> SlabBlock->free_offset or such.
> 
> Sure, it adds a branch to the allocation happy path, but it also makes the
> first allocation for a chunk cheaper, because there's no need to get the next
> element from the freelist (adding a likely cache miss). And it should make the
> allocation of a new block faster by a lot.
>

Not sure what you mean by 'not populate eagerly' so can't comment :-(

> 
>>> 2) SlabChunkIndex() in SlabFree() is slow. It requires a 64bit division, taking
>>> up ~50% of the cycles in SlabFree(). A 64bit div, according to [1] , has a
>>> latency of 35-88 cycles on skylake-x (and a reverse throughput of 21-83,
>>> i.e. no parallelism). While it's getting a bit faster on icelake / zen 3, it's
>>> still slow enough there to be very worrisome.
>>>
>>> I don't see a way to get around the division while keeping the freelist
>>> structure as is. But:
>>>
>>> ISTM that we only need the index because of the free-chunk list, right? Why
>>> don't we make the chunk list use actual pointers? Is it concern that that'd
>>> increase the minimum allocation size?  If so, I see two ways around that:
>>> First, we could make the index just the offset from the start of the block,
>>> that's much cheaper to calculate. Second, we could store the next pointer in
>>> SlabChunk->slab/block instead (making it a union) - while on the freelist we
>>> don't need to dereference those, right?
>>>
>>> I suspect both would also make the block initialization a bit cheaper.
>>>
>>> That should also accelerate SlabBlockGetChunk(), which currently shows up as
>>> an imul, which isn't exactly fast either (and uses a lot of execution ports).
>>>
>>
>> Hmm, I think you're right we could simply use the pointers, but I have not
>> tried that.
> 
> I quickly tried that, and it does seem to improve matters considerably. The
> block initialization still shows up as expensive, but not as bad. The div and
> imul are gone (exept in an assertion build right now). The list manipulation
> still is visible.
> 

Understood. I didn't expect this to be a full solution.

> 
>>> 3) Every free/alloc needing to unlink from slab->freelist[i] and then relink
>>> shows up prominently in profiles. That's ~tripling the number of cachelines
>>> touched in the happy path, with unpredictable accesses to boot.
>>>
>>> Perhaps we could reduce the precision of slab->freelist indexing to amortize
>>> that cost? I.e. not have one slab->freelist entry for each nfree, but instead
>>> have an upper limit on the number of freelists?
>>>
>>
>> Yeah. The purpose of organizing the freelists like this is to prioritize the
>> "more full" blocks" when allocating new chunks, in the hope that the "less
>> full" blocks will end up empty and freed faster.
>>
>> But this is naturally imprecise, and strongly depends on the workload, of
>> course, and I bet for most cases a less precise approach would work just as
>> fine.
>>
>> I'm not sure how exactly would the upper limit you propose work, but perhaps
>> we could group the blocks for nfree ranges, say [0-15], [16-31] and so on.
>> So after the alloc/free we'd calculate the new freelist index as (nfree/16)
>> and only moved the block if the index changed. This would reduce the
>> overhead to 1/16 and probably even more in practice.
> 
>> Of course, we could also say we have e.g. 8 freelists and work the ranges
>> backwards from that, I guess that's what you propose.
> 
> Yea, I was thinking something along those lines. As you say, ow about there's
> always at most 8 freelists or such. During initialization we compute a shift
> that distributes chunksPerBlock from 0 to 8. Then we only need to perform list
> manipulation if block->nfree >> slab->freelist_shift != --block->nfree >> slab->freelist_shift.
> 
> That seems nice from a memory usage POV as well - for small and frequent
> allocations needing chunksPerBlock freelists isn't great. The freelists are on
> a fair number of cachelines right now.
> 

Agreed.

> 
>>> 4) Less of a performance, and more of a usability issue: The constant
>>> block size strikes me as problematic. Most users of an allocator can
>>> sometimes be used with a small amount of data, and sometimes with a
>>> large amount.
>>>
>>
>> I doubt this is worth the effort, really. The constant block size makes
>> various places much simpler (both to code and reason about), so this should
>> not make a huge difference in performance. And IMHO the block size is mostly
>> an implementation detail, so I don't see that as a usability issue.
> 
> Hm? It's something the user has to specify, so I it's not really an
> implementation detail. It needs to be specified without sufficient
> information, as well, since externally one doesn't know how much memory the
> block header and chunk headers + rounding up will use, so computing a good
> block size isn't easy. I've wondered whether it should just be a count...
> 

I think this is mixing two problems - how to specify the block size, and 
whether the block size is constant (as in slab) or grows over time (as 
in allocset).

As for specifying the block size, I agree maybe setting chunk count and 
deriving the bytes from that might be easier to use, for the reasons you 
mentioned.

But growing the block size seems problematic for long-lived contexts 
with workloads that change a lot over time - imagine e.g. decoding many 
small transactions, with one huge transaction mixed in. The one huge 
transaction will grow the block size, and we'll keep using it forever. 
But in that case we might have just as well allocate the large blocks 
from the start, I guess.

> Why do you not think it's relevant for performance? Either one causes too much
> memory usage by using a too large block size, wasting memory, or one ends up
> loosing perf through frequent allocations?
> 

True. I simply would not expect this to make a huge difference - I may 
be wrong, and I'm sure there are workloads where it matters. But I still 
think it's easier to just use larger blocks than to make the slab code 
more complex.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: slab allocator performance issues

From

Andres Freund

Date:

17 July 2021, 23:10:19

Hi,

On 2021-07-18 00:46:09 +0200, Tomas Vondra wrote:
> On 7/17/21 11:14 PM, Andres Freund wrote:
> > Hm. I wonder if we should just not populate the freelist eagerly, to
> > drive down the initialization cost. I.e. have a separate allocation path
> > for chunks that have never been allocated, by having a
> > SlabBlock->free_offset or such.
> > 
> > Sure, it adds a branch to the allocation happy path, but it also makes the
> > first allocation for a chunk cheaper, because there's no need to get the next
> > element from the freelist (adding a likely cache miss). And it should make the
> > allocation of a new block faster by a lot.
> > 
> 
> Not sure what you mean by 'not populate eagerly' so can't comment :-(

Instead of populating a linked list with all chunks upon creation of a block -
which requires touching a fair bit of memory - keep a per-block pointer (or an
offset) into "unused" area of the block. When allocating from the block and
theres still "unused" memory left, use that, instead of bothering with the
freelist.

I tried that, and it nearly got slab up to the allocation/freeing performance
of aset.c (while winning after allocation, due to the higher memory density).

> > > > 4) Less of a performance, and more of a usability issue: The constant
> > > > block size strikes me as problematic. Most users of an allocator can
> > > > sometimes be used with a small amount of data, and sometimes with a
> > > > large amount.
> > > > 
> > > 
> > > I doubt this is worth the effort, really. The constant block size makes
> > > various places much simpler (both to code and reason about), so this should
> > > not make a huge difference in performance. And IMHO the block size is mostly
> > > an implementation detail, so I don't see that as a usability issue.
> > 
> > Hm? It's something the user has to specify, so I it's not really an
> > implementation detail. It needs to be specified without sufficient
> > information, as well, since externally one doesn't know how much memory the
> > block header and chunk headers + rounding up will use, so computing a good
> > block size isn't easy. I've wondered whether it should just be a count...
> > 
> 
> I think this is mixing two problems - how to specify the block size, and
> whether the block size is constant (as in slab) or grows over time (as in
> allocset).

That was in response to the "implementation detail" bit solely.

> But growing the block size seems problematic for long-lived contexts with
> workloads that change a lot over time - imagine e.g. decoding many small
> transactions, with one huge transaction mixed in. The one huge transaction
> will grow the block size, and we'll keep using it forever. But in that case
> we might have just as well allocate the large blocks from the start, I
> guess.

I was thinking of capping the growth fairly low. I don't think after a 16x
growth or so you're likely to still see allocation performance gains with
slab. And I don't think that'd be too bad for decoding - we'd start with a
small initial block size, and in many workloads that will be enough, and just
workloads where that doesn't suffice will adapt performance wise.  And: Medium
term I wouldn't expect reorderbuffer.c to stay the only slab.c user...

> > Why do you not think it's relevant for performance? Either one causes too much
> > memory usage by using a too large block size, wasting memory, or one ends up
> > loosing perf through frequent allocations?
>
> True. I simply would not expect this to make a huge difference - I may be
> wrong, and I'm sure there are workloads where it matters. But I still think
> it's easier to just use larger blocks than to make the slab code more
> complex.

IDK. I'm looking at using slab as part of a radix tree implementation right
now. Which I'd hope to be used in various different situations. So it's hard
to choose the right block size - and it does seem to matter for performance.

Greetings,

Andres Freund

Re: slab allocator performance issues

From

Andres Freund

Date:

18 July 2021, 01:06:00

Hi,

On 2021-07-17 16:10:19 -0700, Andres Freund wrote:
> Instead of populating a linked list with all chunks upon creation of a block -
> which requires touching a fair bit of memory - keep a per-block pointer (or an
> offset) into "unused" area of the block. When allocating from the block and
> theres still "unused" memory left, use that, instead of bothering with the
> freelist.
> 
> I tried that, and it nearly got slab up to the allocation/freeing performance
> of aset.c (while winning after allocation, due to the higher memory density).

Combining that with limiting the number of freelists, and some
microoptimizations, allocation performance is now on par.

Freeing still seems to be a tad slower, mostly because SlabFree()
practically is immediately stalled on fetching the block, whereas
AllocSetFree() can happily speculate ahead and do work like computing
the freelist index. And then aset only needs to access memory inside the
context - which is much more likely to be in cache than a freelist
inside a block (there are many more).

But that's ok, I think. It's close and it's only a small share of the
overall runtime of my workload...

Greetings,

Andres Freund

Re: slab allocator performance issues

From

Tomas Vondra

Date:

18 July 2021, 17:23:31

On 7/18/21 3:06 AM, Andres Freund wrote:
> Hi,
> 
> On 2021-07-17 16:10:19 -0700, Andres Freund wrote:
>> Instead of populating a linked list with all chunks upon creation of a block -
>> which requires touching a fair bit of memory - keep a per-block pointer (or an
>> offset) into "unused" area of the block. When allocating from the block and
>> theres still "unused" memory left, use that, instead of bothering with the
>> freelist.
>>
>> I tried that, and it nearly got slab up to the allocation/freeing performance
>> of aset.c (while winning after allocation, due to the higher memory density).
> 
> Combining that with limiting the number of freelists, and some
> microoptimizations, allocation performance is now on par.
> 
> Freeing still seems to be a tad slower, mostly because SlabFree()
> practically is immediately stalled on fetching the block, whereas
> AllocSetFree() can happily speculate ahead and do work like computing
> the freelist index. And then aset only needs to access memory inside the
> context - which is much more likely to be in cache than a freelist
> inside a block (there are many more).
> 
> But that's ok, I think. It's close and it's only a small share of the
> overall runtime of my workload...
> 

Sounds great! Thanks for investigating this and for the improvements.

It might be good to do some experiments to see how the changes affect 
memory consumption for practical workloads. I'm willing to spend soem 
time on that, if needed.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: slab allocator performance issues

From

Andres Freund

Date:

19 July 2021, 20:56:03

Hi,

On 2021-07-18 19:23:31 +0200, Tomas Vondra wrote:
> Sounds great! Thanks for investigating this and for the improvements.
> 
> It might be good to do some experiments to see how the changes affect memory
> consumption for practical workloads. I'm willing to spend soem time on that,
> if needed.

I've attached my changes. They're in a rough shape right now, but I
think good enough for an initial look.

Greetings,

Andres Freund

Attachment

Re: slab allocator performance issues

From

Ranier Vilela

Date:

19 July 2021, 22:03:41

Em seg., 19 de jul. de 2021 às 17:56, Andres Freund <andres@anarazel.de> escreveu:

Hi,

On 2021-07-18 19:23:31 +0200, Tomas Vondra wrote:
> Sounds great! Thanks for investigating this and for the improvements.
>
> It might be good to do some experiments to see how the changes affect memory
> consumption for practical workloads. I'm willing to spend soem time on that,
> if needed.

I've attached my changes. They're in a rough shape right now, but I
think good enough for an initial look.

Hi Andres, I take a look.

Perhaps you would agree with me that in the most absolute of times, malloc will not fail.
So it makes more sense to test:
if (ret != NULL)
than
if (ret == NULL)

What might help branch prediction.
With this change wins too, the possibility
to reduce the scope of some variable.

Example:
+static void * pg_noinline
+AllocSetAllocLarge(AllocSet set, Size size, int flags)
+{
+         AllocBlock block;
+        Size chunk_size;
+        Size blksize;
+
+       /* check size, only allocation path where the limits could be hit */
+ MemoryContextCheckSize(&set->header, size, flags);
+
+ AssertArg(AllocSetIsValid(set));
+
+       chunk_size = MAXALIGN(size);
+       blksize = chunk_size + ALLOC_BLOCKHDRSZ + ALLOC_CHUNKHDRSZ;
+       block = (AllocBlock) malloc(blksize);
+       if (block != NULL)
+ {
+          AllocChunk chunk;
+
+          set->header.mem_allocated += blksize;
+
+         block->aset = set;
+         block->freeptr = block->endptr = ((char *) block) + blksize;
+
+         /*
+          * Stick the new block underneath the active allocation block, if any,
+          * so that we don't lose the use of the space remaining therein.
+          */
+         if (set->blocks != NULL)
+ {
+    block->prev = set->blocks;
+            block->next = set->blocks->next;
+            if (block->next)
+                block->next->prev = block;
+ set->blocks->next = block;
+            }
+            else
+            {
+                 block->prev = NULL;
+ block->next = NULL;
+ set->blocks = block;
+            }
+
+            chunk = (AllocChunk) (((char *) block) + ALLOC_BLOCKHDRSZ);
+            chunk->size = chunk_size;
+
+            return AllocSetAllocReturnChunk(set, size, chunk, chunk_size);
+      }
+
+      return NULL;
+}

regards,

Ranier Vilela

Re: slab allocator performance issues

From

David Rowley

Date:

20 July 2021, 14:15:07

On Tue, 20 Jul 2021 at 10:04, Ranier Vilela <ranier.vf@gmail.com> wrote:
> Perhaps you would agree with me that in the most absolute of times, malloc will not fail.
> So it makes more sense to test:
> if (ret != NULL)
> than
> if (ret == NULL)

I think it'd be better to use unlikely() for that.

David

Re: slab allocator performance issues

From

Ranier Vilela

Date:

20 July 2021, 14:24:50

Em ter., 20 de jul. de 2021 às 11:15, David Rowley <dgrowleyml@gmail.com> escreveu:

On Tue, 20 Jul 2021 at 10:04, Ranier Vilela <ranier.vf@gmail.com> wrote:
> Perhaps you would agree with me that in the most absolute of times, malloc will not fail.
> So it makes more sense to test:
> if (ret != NULL)
> than
> if (ret == NULL)

I think it'd be better to use unlikely() for that.

Sure, it can be, but in this case, there is no way to reduce the scope.

regards,

Ranier Vilela

Re: slab allocator performance issues

From

Tomas Vondra

Date:

01 August 2021, 17:59:18

Hi,

I spent a bit of time benchmarking this - the first patch adds an 
extension with three functions, each executing a slightly different 
allocation pattern:

1) FIFO (allocates and frees in the same order)
2) LIFO (frees in reverse order)
3) random

Each function can also do a custom number of iterations, each allocating 
and freeing certain number of chunks. The bench.sql script executes 
three combinations (for each pattern)

1) no loops
2) increase: 100 loops, each freeing 10k chunks and allocating 15k
3) decrease: 100 loops, each freeing 10k chunks and allocating 5k

The idea is to test simple one-time allocation, and workloads that mix 
allocations and frees (to see how the changes affect reuse etc.).

The script tests this with a range of block sizes (1k-32k) and chunk 
sizes (32B-512B).

In the attached .ods file with results, the "comparison" sheets are the 
interesting ones - the last couple columns compare the main metrics for 
the two patches (labeled patch-1 and patch-2) to master.

Overall, the results look quite good - patch-1 is mostly on par with 
master, with maybe 5% variability in both directions. That's expected, 
considering the patch does not aim to improve performance.

The second patch brings some nice improvements - 30%-50% in most cases 
(for both allocation and free) seems pretty nice. But for the "increase" 
FIFO pattern (incrementally allocating/freeing more memory) there's a 
significant regression - particularly for the allocation time. In some 
cases (larger chunks, block size does not matter too much) it jumps from 
25ms to almost 200ms.

This seems unfortunate - the allocation pattern (FIFO, allocating more 
memory over time) seems pretty common, and the slowdown is significant.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: slab allocator performance issues

From

Andres Freund

Date:

01 August 2021, 21:07:22

Hi,

On 2021-08-01 19:59:18 +0200, Tomas Vondra wrote:
> In the attached .ods file with results, the "comparison" sheets are the
> interesting ones - the last couple columns compare the main metrics for the
> two patches (labeled patch-1 and patch-2) to master.

I assume with patch-1/2 you mean the ones after the benchmark patch
itself?

> Overall, the results look quite good - patch-1 is mostly on par with master,
> with maybe 5% variability in both directions. That's expected, considering
> the patch does not aim to improve performance.

Not for slab anyway...

> The second patch brings some nice improvements - 30%-50% in most cases (for
> both allocation and free) seems pretty nice. But for the "increase" FIFO
> pattern (incrementally allocating/freeing more memory) there's a significant
> regression - particularly for the allocation time. In some cases (larger
> chunks, block size does not matter too much) it jumps from 25ms to almost
> 200ms.

I'm not surprised to see some memory usage increase some, but that
degree of time overhead does surprise me. ISTM there's something wrong.

It'd probably worth benchmarking the different improvements inside the
WIP: slab performance. patch. There's some that I'd expect to be all
around improvements, whereas others likely aren't quite that clear
cut. I assume you'd prefer that I split the patch up?

> This seems unfortunate - the allocation pattern (FIFO, allocating more
> memory over time) seems pretty common, and the slowdown is significant.

Did you analyze what causes the regressions?

Greetings,

Andres Freund

Re: slab allocator performance issues

From

Tomas Vondra

Date:

01 August 2021, 22:01:25

On 8/1/21 11:07 PM, Andres Freund wrote:
> Hi,
> 
> On 2021-08-01 19:59:18 +0200, Tomas Vondra wrote:
>> In the attached .ods file with results, the "comparison" sheets are the
>> interesting ones - the last couple columns compare the main metrics for the
>> two patches (labeled patch-1 and patch-2) to master.
> 
> I assume with patch-1/2 you mean the ones after the benchmark patch
> itself?
> 

Yes, those are the two WIP patches you shared on 19/7.

> 
>> Overall, the results look quite good - patch-1 is mostly on par with master,
>> with maybe 5% variability in both directions. That's expected, considering
>> the patch does not aim to improve performance.
> 
> Not for slab anyway...
> 

Maybe the hot/cold separation could have some effect, but probably not 
for the workloads I've tested.

> 
>> The second patch brings some nice improvements - 30%-50% in most cases (for
>> both allocation and free) seems pretty nice. But for the "increase" FIFO
>> pattern (incrementally allocating/freeing more memory) there's a significant
>> regression - particularly for the allocation time. In some cases (larger
>> chunks, block size does not matter too much) it jumps from 25ms to almost
>> 200ms.
> 
> I'm not surprised to see some memory usage increase some, but that
> degree of time overhead does surprise me. ISTM there's something wrong.
> 

Yeah, the higher amount of allocated memory is due to the couple fields 
added to the SlabBlock struct, but even that only affects a single case 
with 480B chunks and 1kB blocks. Seems fine to me, especially if we end 
up growing the slab blocks.

Not sure about the allocation time, though.

> It'd probably worth benchmarking the different improvements inside the
> WIP: slab performance. patch. There's some that I'd expect to be all
> around improvements, whereas others likely aren't quite that clear
> cut. I assume you'd prefer that I split the patch up?
> 

Yeah, if you split that patch into sensible parts, I'll benchmark those. 
Also, we can add more interesting workloads if you have some ideas.

> 
>> This seems unfortunate - the allocation pattern (FIFO, allocating more
>> memory over time) seems pretty common, and the slowdown is significant.
> 
> Did you analyze what causes the regressions?
> 

No, not yet. I'll run the same set of benchmarks for the Generation, 
discussed in the other thread, and then I'll investigate this. But if 
you split the patch, that'd probably help.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: slab allocator performance issues

From

Tomas Vondra

Date:

03 August 2021, 13:33:28

FWIW I tried running the benchmarks again, with some minor changes in 
the extension code - most importantly, the time is counted in microsecs 
(instead of milisecs).

I suspected the rounding might have been causing some rounding errors 
(essentially not counting anything below 1ms, because it rounds to 0), 
and the results are a bit different.

On the i5-2500k machine it's an improvement across the board, while on 
the bigger Xeon e5-2620v3 machine it shows roughly the same regression 
for the "decreasing" allocation pattern.

There's another issue in the benchmarking script - the queries are meant 
to do multiple runs for each combination of parameters, but it's written 
in a way that simply runs it once and then does cross product with the 
generate_sequence(1,5). I'll look into fixing that, but judging by the 
stability of results for similar chunk sizes it won't change much.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: slab allocator performance issues

From

Tomas Vondra

Date:

10 September 2021, 21:06:54

Hi,

I've been investigating the regressions in some of the benchmark
results, together with the generation context benchmarks [1].

Turns out it's pretty difficult to benchmark this, because the results
strongly depend on what the backend did before. For example if I run
slab_bench_fifo with the "decreasing" test for 32kB blocks and 512B
chunks, I get this:

select * from slab_bench_fifo(1000000, 32768, 512, 100, 10000, 5000);

mem_allocated | alloc_ms | free_ms
---------------+----------+---------
528547840 | 155394 | 87440

i.e. palloc() takes ~155ms and pfree() ~87ms (and these result are
stable, the numbers don't change much with more runs).

But if I run a set of "lifo" tests in the backend first, the results
look like this:

mem_allocated | alloc_ms | free_ms
---------------+----------+---------
528547840 | 41728 | 71524
(1 row)

so the pallocs are suddenly about ~4x faster. Clearly, what the backend
did before may have pretty dramatic impact on results, even for simple
benchmarks like this.

Note: The benchmark was a single SQL script, running all the different
workloads in the same backend.

I did a fair amount of perf profiling, and the main difference between
the slow and fast runs seems to be this:

0 page-faults:u

0 minor-faults:u

0 major-faults:u

20,634,153 page-faults:u

20,634,153 minor-faults:u

0 major-faults:u

Attached is a more complete perf stat output, but the page faults seem
to be the main issue. My theory is that in the "fast" case, the past
backend activity puts the glibc memory management into a state that
prevents page faults in the benchmark.

But of course, this theory may be incomplete - for example it's not
clear why running the benchmark repeatedly would not "condition" the
backend the same way. But it doesn't - it's ~150ms even for repeated runs.

Secondly, I'm not sure this explains why some of the timings actually
got much slower with the 0003 patch, when the sequence of the steps is
still the same. Of course, it's possible 0003 changes the allocation
pattern a bit, interfering with glibc memory management.

This leads to a couple of interesting questions, I think:

1) I've only tested this on Linux, with glibc. I wonder how it'd behave
on other platforms, or with other allocators.

2) Which cases are more important? When the backend was warmed up, or
when each benchmark runs in a new backend? It seems the "new backend" is
something like a "worst case" leading to more page faults, so maybe
that's the thing to watch. OTOH it's unlikely to have a completely new
backend, so maybe not.

3) Can this teach us something about how to allocate stuff, to better
"prepare" the backend for future allocations? For example, it's a bit
strange that repeated runs of the same benchmark don't do the trick, for
some reason.

regards

[1]
https://www.postgresql.org/message-id/bcdd4e3e-c12d-cd2b-7ead-a91ad416100a%40enterprisedb.com

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

perf.stat.txt

Re: slab allocator performance issues

From

David Rowley

Date:

12 October 2022, 09:37:17

On Sat, 11 Sept 2021 at 09:07, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> I've been investigating the regressions in some of the benchmark
> results, together with the generation context benchmarks [1].

I've not looked into the regression you found with this yet, but I did
rebase the patch.  slab.c has seen quite a number of changes recently.

I didn't spend a lot of time checking over the patch. I mainly wanted
to see what the performance was like before reviewing in too much
detail.

To test the performance, I used [1] and ran:

select pg_allocate_memory_test(<nbytes>, 1024*1024,
10::bigint*1024*1024*1024, 'slab');

that basically allocates chunks of <nbytes> and keeps around 1MB of
them at a time and allocates a total of 10GBs of them.

I saw:

Master:
16 byte chunk = 8754.678 ms
32 byte chunk = 4511.725 ms
64 byte chunk = 2244.885 ms
128 byte chunk = 1135.349 ms
256  byte chunk = 548.030 ms
512 byte chunk = 272.017 ms
1024 byte chunk = 144.618 ms

Master + attached patch:
16 byte chunk = 5255.974 ms
32 byte chunk = 2640.807 ms
64 byte chunk = 1328.949 ms
128 byte chunk = 668.078 ms
256 byte chunk = 330.564 ms
512 byte chunk = 166.844 ms
1024 byte chunk = 85.399 ms

So patched runs in about 60% of the time that master runs in.

I plan to look at the patch in a bit more detail and see if I can
recreate and figure out the regression that Tomas reported. For now, I
just want to share the rebased patch.

The only thing I really adjusted from Andres' version is to instead of
using pointers for the linked list block freelist, I made it store the
number of bytes into the block that the chunk is.  This means we can
use 4 bytes instead of 8 bytes for these pointers.  The block size is
limited to 1GB now anyway, so 32-bit is large enough for these
offsets.

David

[1] https://www.postgresql.org/message-id/attachment/137056/allocate_performance_functions.patch.txt

Attachment

v3-0001-WIP-slab-performance.patch

Re: slab allocator performance issues

From

John Naylor

Date:

11 November 2022, 09:20:29

On Wed, Oct 12, 2022 at 4:37 PM David Rowley <dgrowleyml@gmail.com> wrote:
> [v3]

+ /*
+ * Compute a shift that guarantees that shifting chunksPerBlock with it
+ * yields is smaller than SLAB_FREELIST_COUNT - 1 (one freelist is used for full blocks).
+ */
+ slab->freelist_shift = 0;
+ while ((slab->chunksPerBlock >> slab->freelist_shift) >= (SLAB_FREELIST_COUNT - 1))
+ slab->freelist_shift++;

+ /*
+ * Ensure, without a branch, that index 0 is only used for blocks entirely
+ * without free chunks.
+ * XXX: There probably is a cheaper way to do this. Needing to shift twice
+ * by slab->freelist_shift isn't great.
+ */
+ index = (freecount + (1 << slab->freelist_shift) - 1) >> slab->freelist_shift;

How about something like

#define SLAB_FREELIST_COUNT ((1<<3) + 1)
index = (freecount & (SLAB_FREELIST_COUNT - 2)) + (freecount != 0);

and dispense with both freelist_shift and the loop that computes it?

--
John Naylor
EDB: http://www.enterprisedb.com

Re: slab allocator performance issues

From

David Rowley

Date:

05 December 2022, 08:02:06

On Fri, 11 Nov 2022 at 22:20, John Naylor <john.naylor@enterprisedb.com> wrote:
>
>
> On Wed, Oct 12, 2022 at 4:37 PM David Rowley <dgrowleyml@gmail.com> wrote:
> > [v3]
>
> + /*
> + * Compute a shift that guarantees that shifting chunksPerBlock with it
> + * yields is smaller than SLAB_FREELIST_COUNT - 1 (one freelist is used for full blocks).
> + */
> + slab->freelist_shift = 0;
> + while ((slab->chunksPerBlock >> slab->freelist_shift) >= (SLAB_FREELIST_COUNT - 1))
> + slab->freelist_shift++;
>
> + /*
> + * Ensure, without a branch, that index 0 is only used for blocks entirely
> + * without free chunks.
> + * XXX: There probably is a cheaper way to do this. Needing to shift twice
> + * by slab->freelist_shift isn't great.
> + */
> + index = (freecount + (1 << slab->freelist_shift) - 1) >> slab->freelist_shift;
>
> How about something like
>
> #define SLAB_FREELIST_COUNT ((1<<3) + 1)
> index = (freecount & (SLAB_FREELIST_COUNT - 2)) + (freecount != 0);

Doesn't this create a sort of round-robin use of the free list?  What
we want is a sort of "histogram" bucket set of free lists so we can
group together blocks that have a close-enough free number of chunks.
Unless I'm mistaken, I think what you have doesn't do that.

I wondered if simply:

index = -(-freecount >> slab->freelist_shift);

would be faster than Andres' version.  I tried it out and on my AMD
machine, it's about the same speed. Same on a Raspberry Pi 4.

Going by [2], the instructions are very different with each method, so
other machines with different latencies on those instructions might
show something different. I attached what I used to test if anyone
else wants a go.

AMD Zen2
$ ./freecount 2000000000
Test 'a' in 0.922766 seconds
Test 'd' in 0.922762 seconds (0.000433% faster)

RPI4
$ ./freecount 2000000000
Test 'a' in 3.341350 seconds
Test 'd' in 3.338690 seconds (0.079672% faster)

That was gcc. Trying it with clang, it went in a little heavy-handed
and optimized out my loop, so some more trickery might be needed for a
useful test on that compiler.

David

[2] https://godbolt.org/z/dh95TohEG

Attachment

freecount.c

Re: slab allocator performance issues

From

John Naylor

Date:

05 December 2022, 10:18:24

On Mon, Dec 5, 2022 at 3:02 PM David Rowley <dgrowleyml@gmail.com> wrote:
>
> On Fri, 11 Nov 2022 at 22:20, John Naylor <john.naylor@enterprisedb.com> wrote:

> > #define SLAB_FREELIST_COUNT ((1<<3) + 1)
> > index = (freecount & (SLAB_FREELIST_COUNT - 2)) + (freecount != 0);
>
> Doesn't this create a sort of round-robin use of the free list? What
> we want is a sort of "histogram" bucket set of free lists so we can
> group together blocks that have a close-enough free number of chunks.
> Unless I'm mistaken, I think what you have doesn't do that.

The intent must have slipped my mind along the way.

> I wondered if simply:
>
> index = -(-freecount >> slab->freelist_shift);
>
> would be faster than Andres' version. I tried it out and on my AMD
> machine, it's about the same speed. Same on a Raspberry Pi 4.
>
> Going by [2], the instructions are very different with each method, so
> other machines with different latencies on those instructions might
> show something different. I attached what I used to test if anyone
> else wants a go.

I get about 0.1% difference on my machine. Both ways boil down to (on gcc) 3 instructions with low latency. The later ones need the prior results to execute, which I think is what the XXX comment "isn't great" was referring to. The new coding is more mysterious (does it do the right thing on all platforms?), so I guess the original is still the way to go unless we get a better idea.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: slab allocator performance issues

From

Robert Haas

Date:

05 December 2022, 15:31:36

On Fri, Sep 10, 2021 at 5:07 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> Turns out it's pretty difficult to benchmark this, because the results
> strongly depend on what the backend did before.

What you report here seems to be mostly cold-cache effects, with which
I don't think we need to be overly concerned. We don't want big
regressions in the cold-cache case, but there is always going to be
some overhead when a new backend starts up, because you've got to
fault some pages into the heap/malloc arena/whatever before they can
be efficiently accessed. What would be more concerning is if we found
out that the performance depended heavily on the internal state of the
allocator. For example, suppose you have two warmup routines W1 and
W2, each of which touches the same amount of total memory, but with
different details. Then you have a benchmark B. If you do W1-B and
W2-B and the time for B varies dramatically between them, then you've
maybe got an issue. For instance, it could indicate that the allocator
has issue when the old and new allocations are very different sizes,
or something like that.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: slab allocator performance issues

From

David Rowley

Date:

10 December 2022, 04:01:56

.On Mon, 5 Dec 2022 at 23:18, John Naylor <john.naylor@enterprisedb.com> wrote:
>
>
> On Mon, Dec 5, 2022 at 3:02 PM David Rowley <dgrowleyml@gmail.com> wrote:

> > Going by [2], the instructions are very different with each method, so
> > other machines with different latencies on those instructions might
> > show something different. I attached what I used to test if anyone
> > else wants a go.
>
> I get about 0.1% difference on my machine.  Both ways boil down to (on gcc) 3 instructions with low latency. The
laterones need the prior results to execute, which I think is what the XXX comment "isn't great" was referring to. The
newcoding is more mysterious (does it do the right thing on all platforms?), so I guess the original is still the way
togo unless we get a better idea. 

I don't think it would work well on a one's complement machine, but I
don't think we support those going by the comments above RIGHTMOST_ONE
in bitmapset.c.  In anycase, I found that it wasn't any faster than
what Andres wrote. In fact, even changing the code there to "index =
(freecount > 0);" seems to do very little to increase performance. I
do see that having 3 freelist items performs a decent amount better
than having 9. However, which workload is run may alter the result of
that, assuming that keeping new allocations on fuller blocks is a
winning strategy for the CPU's caches.

I've now done quite a bit more work on Andres' patch to try and get it
into (hopefully) somewhere close to a committable shape.

I'm fairly happy with what's there now. It does seem to perform much
better than current master, especially so when the workload would have
caused master to continually malloc and free an entire block when
freeing and allocating a single chunk.

I've done some basic benchmarking, mostly using the attached alloc_bench patch.

If I run:

select *, round(slab_result / aset_result * 100 - 100,1)::text || '%'
as slab_slower_by
from (
select
    chunk_size,
    keep_chunks,
    sum(pg_allocate_memory_test(chunk_size, chunk_size * keep_chunks,
1024*1024*1024, 'slab'))::numeric(1000,3) as slab_result,
    sum(pg_allocate_memory_test(chunk_size, chunk_size * keep_chunks,
1024*1024*1024, 'aset'))::numeric(1000,3) as aset_result
from
    (values(1),(10),(50),(100),(200),(300),(400),(500),(1000),(2000),(3000),(4000),(5000),(10000))
v1(keep_chunks),
    (values(64),(128),(256),(512),(1024)) v2(chunk_size)
group by rollup(1,2)
);

The results for the 64-byte chunk are shown in the attached chart.
It's not as fast as aset, but much faster than master's slab.c
The first blue bar of the chart is well above the vertical axis. It
took master 1118958 milliseconds for that test. The attached patch
took 194 ms. The rest of the tests seem to put the patched code around
somewhere in the middle between the unpatched code and aset.c's
performance.

The benchmark I did was entirely a FIFO workload that keeps around
"keep_chunks" at once before starting to free the oldest chunks.  I
know Tomas has some LIFO and random benchmarks. I edited that code [1]
a little to add support for other context types so that a comparison
could be done more easily, however, I'm getting very weird performance
results where sometimes it runs about twice as fast (Tomas mentioned
he got this too). I'm not seeing that with my own benchmarking
function, so I'm wondering if there's something weird going on with
the benchmark itself rather than the slab.c code.

I've likely made much more changes than I can list here, but here are
a few of the more major ones:

1. Make use of dclist for empty blocks
2. In SlabAlloc() allocate chunks from the freelist before the unused list.
3. Added output for showing information about empty blocks in the
SlabStats output.
4. Renamed the context freelists to blocklist.  I found this was
likely just confusing things with the block-level freelist.  In any
case, it seemed weird to have freelist[0] store full blocks. Not much
free there! I renamed to blocklist[].
5. I did a round of changing the order of the fields in SlabBlock.
This seems to affect performance quite a bit. Having the context first
seems to improve performance. Having the blocklist[] node last also
helps.
6. Removed nblocks and headerSize from SlabContext. headerSize is no
longer needed. nblocks was only really used for Asserts and
SlabIsEmpty.  I changed the Asserts to use a local count of blocks and
changed SlabIsEmpty to look at the context's mem_allocated.
7. There's now no integer division in any of the alloc and free code.
The only time we divide by fullChunkSize is in the check function.
8. When using a block from the emptyblock list, I changed the code to
not re-init the block. It's now used as it was left previously. This
means no longer having to build the freelist again.
9. Updated all comments to reflect the current state of the code.

Some things I thought about but didn't do:
a. Change the size of SlabContext's chunkSize, fullChunkSize and
blockSize to be uint32 instead of Size.  It might be possible to get
SlabContext below 128 bytes with a bit more work.
b. I could have done a bit more experimentation with unlikely() and
likely() to move less frequently accessed code off into a cold area.

For #2 above, I didn't really see much change in performance when I
swapped the order of what we allocate from first. I expected free
chunks would be better as they've been used and are seemingly more
likely to be in some CPU cache than one of the unused chunks.  I might
need a different allocation pattern than the one I used to highlight
that fact though.

David

[1] https://github.com/david-rowley/postgres/tree/alloc_bench_contrib

Attachment

Re: slab allocator performance issues

From

John Naylor

Date:

12 December 2022, 07:13:47

On Sat, Dec 10, 2022 at 11:02 AM David Rowley <dgrowleyml@gmail.com> wrote:
> [v4]

Thanks for working on this!

I ran an in-situ benchmark using the v13 radix tree patchset ([1] WIP but should be useful enough for testing allocation speed), only applying the first five, which are local-memory only. The benchmark is not meant to represent a realistic workload, and primarily stresses traversal and allocation of the smallest node type. Minimum of five, with turbo-boost off, on recent Intel laptop hardware:

v13-0001 to 0005:

# select * from bench_load_random_int(500 * 1000);
mem_allocated | load_ms
---------------+---------
151123432 | 222

47.06% postgres postgres [.] rt_set
22.89% postgres postgres [.] SlabAlloc
9.65% postgres postgres [.] rt_node_insert_inner.isra.0
5.94% postgres [unknown] [k] 0xffffffffb5e011b7
3.62% postgres postgres [.] MemoryContextAlloc
2.70% postgres libc.so.6 [.] __memmove_avx_unaligned_erms
2.60% postgres postgres [.] SlabFree

+ v4 slab:

# select * from bench_load_random_int(500 * 1000);
mem_allocated | load_ms
---------------+---------
152463112 | 213

52.42% postgres postgres [.] rt_set
12.80% postgres postgres [.] SlabAlloc
9.38% postgres postgres [.] rt_node_insert_inner.isra.0
7.87% postgres [unknown] [k] 0xffffffffb5e011b7
4.98% postgres postgres [.] SlabFree

While allocation is markedly improved, freeing looks worse here. The proportion is surprising because only about 2% of nodes are freed during the load, but doing that takes up 10-40% of the time compared to allocating.

num_keys = 500000, height = 7
n4 = 2501016, n15 = 56932, n32 = 270, n125 = 0, n256 = 257

Sidenote: I don't recall ever seeing vsyscall (I think that's what the 0xffffffffb5e011b7 address is referring to) in a profile, so not sure what is happening there.

[1] https://www.postgresql.org/message-id/CAFBsxsHNE621mGuPhd7kxaGc22vMkoSu7R4JW9Zan1jjorGy3g%40mail.gmail.com

--
John Naylor
EDB: http://www.enterprisedb.com

Re: slab allocator performance issues

From

David Rowley

Date:

13 December 2022, 00:49:36

Thanks for testing the patch.

On Mon, 12 Dec 2022 at 20:14, John Naylor <john.naylor@enterprisedb.com> wrote:
> v13-0001 to 0005:

>  2.60%  postgres  postgres             [.] SlabFree

> + v4 slab:

>    4.98%  postgres  postgres              [.] SlabFree
>
> While allocation is markedly improved, freeing looks worse here. The proportion is surprising because only about 2%
ofnodes are freed during the load, but doing that takes up 10-40% of the time compared to allocating. 

I've tried to reproduce this with the v13 patches applied and I'm not
really getting the same as you are. To run the function 100 times I
used:

select x, a.* from generate_series(1,100) x(x), lateral (select * from
bench_load_random_int(500 * 1000 * (1+x-x))) a;

(I had to add the * (1+x-x) to add a lateral dependency to stop the
function just being executed once)

v13-0001 - 0005 gives me:

  37.71%  postgres             [.] rt_set
  19.24%  postgres             [.] SlabAlloc
   8.73%  [kernel]             [k] clear_page_rep
   5.21%  postgres             [.] rt_node_insert_inner.isra.0
   2.63%  [kernel]             [k] asm_exc_page_fault
   2.24%  postgres             [.] SlabFree

and fairly consistently 122 ms runtime per call.

Applying v4 slab patch I get:

  41.06%  postgres             [.] rt_set
  10.84%  postgres             [.] SlabAlloc
   9.01%  [kernel]             [k] clear_page_rep
   6.49%  postgres             [.] rt_node_insert_inner.isra.0
   2.76%  postgres             [.] SlabFree

and fairly consistently 112 ms per call.

I wonder if you can consistently get the same result on another
compiler or after patching something like master~50 or master~100.
Maybe it's just a code alignment thing.

Looking at the annotation of perf report for SlabFree with the patched
version I see:

      │
      │     /* push this chunk onto the head of the free list */
      │     *(MemoryChunk **) pointer = block->freehead;
 0.09 │       mov     0x10(%r8),%rax
      │     slab = block->slab;
59.15 │       mov     (%r8),%rbp
      │     *(MemoryChunk **) pointer = block->freehead;
 9.43 │       mov     %rax,(%rdi)
      │     block->freehead = chunk;
      │
      │     block->nfree++;

I think what that's telling me is that dereferencing the block's
memory is slow, likely due to that particular cache line not being
cached any longer. I tried running the test with 10,000 ints instead
of 500,000 so that there would be less CPU cache pressure. I see:

 29.76 │       mov     (%r8),%rbp
       │     *(MemoryChunk **) pointer = block->freehead;
 12.72 │       mov     %rax,(%rdi)
       │     block->freehead = chunk;
       │
       │     block->nfree++;
       │       mov     0x8(%r8),%eax
       │     block->freehead = chunk;
  4.27 │       mov     %rdx,0x10(%r8)
       │     SlabBlocklistIndex():
       │     index = (nfree + (1 << blocklist_shift) - 1) >> blocklist_shift;
       │       mov     $0x1,%edx
       │     SlabFree():
       │     block->nfree++;
       │       lea     0x1(%rax),%edi
       │       mov     %edi,0x8(%r8)
       │     SlabBlocklistIndex():
       │     int32           blocklist_shift = slab->blocklist_shift;
       │       mov     0x70(%rbp),%ecx
       │     index = (nfree + (1 << blocklist_shift) - 1) >> blocklist_shift;
  8.46 │       shl     %cl,%edx

various other instructions in SlabFree are proportionally taking
longer now. For example the bitshift at the end was insignificant
previously. That indicates to me that this is due to caching effects.
We must fetch the block in SlabFree() in both versions. It's possible
that something is going on in SlabAlloc() that is causing more useful
cachelines to be evicted, but (I think) one of primary design goals
Andres was going for was to reduce that. For example not having to
write out the freelist for an entire block when the block is first
allocated means not having to load possibly all cache lines for the
entire block anymore.

I tried looking at perf stat during the run.

Without slab changes:

drowley@amd3990x:~$ sudo perf stat --pid=74922 sleep 2
 Performance counter stats for process id '74922':

          2,000.74 msec task-clock                #    1.000 CPUs utilized
                 4      context-switches          #    1.999 /sec
                 0      cpu-migrations            #    0.000 /sec
           578,139      page-faults               #  288.963 K/sec
     8,614,687,392      cycles                    #    4.306 GHz
               (83.21%)
       682,574,688      stalled-cycles-frontend   #    7.92% frontend
cycles idle     (83.33%)
     4,822,904,271      stalled-cycles-backend    #   55.98% backend
cycles idle      (83.41%)
    11,447,124,105      instructions              #    1.33  insn per cycle
                                                  #    0.42  stalled
cycles per insn  (83.41%)
     1,947,647,575      branches                  #  973.464 M/sec
               (83.41%)
        13,914,897      branch-misses             #    0.71% of all
branches          (83.24%)

       2.000924020 seconds time elapsed

With slab changes:

drowley@amd3990x:~$ sudo perf stat --pid=75967 sleep 2
 Performance counter stats for process id '75967':

          2,000.89 msec task-clock                #    1.000 CPUs utilized
                 1      context-switches          #    0.500 /sec
                 0      cpu-migrations            #    0.000 /sec
           607,423      page-faults               #  303.576 K/sec
     8,566,091,176      cycles                    #    4.281 GHz
               (83.21%)
       737,839,390      stalled-cycles-frontend   #    8.61% frontend
cycles idle     (83.32%)
     4,454,357,725      stalled-cycles-backend    #   52.00% backend
cycles idle      (83.41%)
    10,760,559,837      instructions              #    1.26  insn per cycle
                                                  #    0.41  stalled
cycles per insn  (83.41%)
     1,872,047,962      branches                  #  935.606 M/sec
               (83.41%)
        14,928,953      branch-misses             #    0.80% of all
branches          (83.25%)

       2.000960610 seconds time elapsed

It would be interesting to see if your perf stat output is showing
something significantly different with and without the slab changes.

It does not seem impossible that due to the slab changes having to
look at less memory in SlabAlloc() that that's moving some additional
requirements for SlabFree() to fetch cache lines that in the unpatched
version would have already been available.  If that is the case, then
I think we shouldn't worry about it unless we can find some workload
that demonstrates an overall performance regression with the patch. I
just don't quite have enough perf experience to know how I might go
about proving that.

David

Re: slab allocator performance issues

From

John Naylor

Date:

14 December 2022, 10:37:52

On Tue, Dec 13, 2022 at 7:50 AM David Rowley <dgrowleyml@gmail.com> wrote:
>
> Thanks for testing the patch.
>
> On Mon, 12 Dec 2022 at 20:14, John Naylor <john.naylor@enterprisedb.com> wrote:

> > While allocation is markedly improved, freeing looks worse here. The proportion is surprising because only about 2% of nodes are freed during the load, but doing that takes up 10-40% of the time compared to allocating.
>
> I've tried to reproduce this with the v13 patches applied and I'm not
> really getting the same as you are. To run the function 100 times I
> used:
>
> select x, a.* from generate_series(1,100) x(x), lateral (select * from
> bench_load_random_int(500 * 1000 * (1+x-x))) a;

Simply running over a longer period of time like this makes the SlabFree difference much closer to your results, so it doesn't seem out of line anymore. Here SlabAlloc seems to take maybe 2/3 of the time of current slab, with a 5% reduction in total time:

500k ints:

v13-0001-0005
average of 30: 217ms

47.61% postgres postgres [.] rt_set
20.99% postgres postgres [.] SlabAlloc
10.00% postgres postgres [.] rt_node_insert_inner.isra.0
6.87% postgres [unknown] [k] 0xffffffffbce011b7
3.53% postgres postgres [.] MemoryContextAlloc
2.82% postgres postgres [.] SlabFree

+slab v4
average of 30: 206ms

51.13% postgres postgres [.] rt_set
14.08% postgres postgres [.] SlabAlloc
11.41% postgres postgres [.] rt_node_insert_inner.isra.0
7.44% postgres [unknown] [k] 0xffffffffbce011b7
3.89% postgres postgres [.] MemoryContextAlloc
3.39% postgres postgres [.] SlabFree

It doesn't look mysterious anymore, but I went ahead and took some more perf measurements, including for cache misses. My naive impression is that we're spending a bit more time waiting for data, but having to do less work with it once we get it, which is consistent with your earlier comments:

perf stat -p $pid sleep 2
v13:
2,001.55 msec task-clock:u # 1.000 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
311,690 page-faults:u # 155.724 K/sec
3,128,740,701 cycles:u # 1.563 GHz
4,739,333,861 instructions:u # 1.51 insn per cycle
820,014,588 branches:u # 409.690 M/sec
7,385,923 branch-misses:u # 0.90% of all branches

+slab v4:
2,001.09 msec task-clock:u # 1.000 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
326,017 page-faults:u # 162.920 K/sec
3,016,668,818 cycles:u # 1.508 GHz
4,324,863,908 instructions:u # 1.43 insn per cycle
761,839,927 branches:u # 380.712 M/sec
7,718,366 branch-misses:u # 1.01% of all branches

perf stat -e LLC-loads,LLC-loads-misses -p $pid sleep 2
min/max of 3 runs:
v13: LL cache misses: 25.08% - 25.41%
+slab v4: LL cache misses: 25.74% - 26.01%

--
John Naylor
EDB: http://www.enterprisedb.com

Re: slab allocator performance issues

From

David Rowley

Date:

20 December 2022, 03:35:26

I've spent quite a bit more time on the slab changes now and I've
attached a v3 patch.

One of the major things that I've spent time on is benchmarking this.
I'm aware that Tomas wrote some functions to benchmark.  I've taken
those and made some modifications to allow the memory context type to
be specified as a function parameter.  This allows me to easily
compare the performance of slab with both aset and generation.

Another change that I made to Tomas' module was how the random
ordering part works.  What I wanted was the ability to specify how
randomly to pfree the chunks and test various "degrees-of-randomness"
to see how that affects the performance.  What I ended up coming up
with was the ability to specify the number of "random segments".  This
controls how many groups we split all allocated chunks into to
randomise.  If there is 1 random segment, then that's just randomising
over all chunks. If there are 10 random segments, then we split the
array of allocated chunks into 10 portions based on either FIFO or
LIFO order, then randomise the order of the chunks only within each of
those segments. This allows us to test FIFO/LIFO allocation patterns
with and without random and any degrees of that in between. If the
random segments is set to 0, then no randomisation is done.

Another change I made to Tomas' code was, I'm now using palloc0()
instead of palloc() and I'm also checking the first byte of the
allocated chunk is '\0' before pfreeing it.  What I was finding was
that pfree was showing as highly dominant in perf output due to it
having to deference the MemoryChunk to find the context-type bits.
pfree had to do this as none of the calling code had touched any of
the memory in the chunk.  I felt it was unrealistic to be pallocing
memory and not doing anything with it and then pfreeing it without
having done anything with it.  Mostly this just moves the
responsibilities around of which function is penalised in having to
load the cache line. I mostly did this as I was struggling to make any
sense of perf's output.

I've attached alloc_bench_contrib.patch which I used for testing.

I've also attached a spreadsheet with the benchmark results.  The
general summary from having done those is that slab is now generally
now on-par with aset in terms of palloc performance. Previously slab
was performing at about half the speed of aset unless CPU cache
pressure became more significant, in which case the performance is
dominated by fetching cache lines from RAM. However, the new code
still makes meaningful improvements even under heavy CPU cache
pressure.  When it comes to pfree performance, the updated slab code
is much faster than it was previously, but not quite on-par with aset
or generation.

The attached spreadsheet is broken down into 3 tabs.  Each tab is
testing a chunk size and a fixed total number of chunks allocated at
once.  Within each tab, I'm testing FIFO and then LIFO allocation
patterns each with a different degree of randomness introduced, as I
described above. In none of the tests was the patched version slower
than the unpatched version.

One pending question I had was about SlabStats where we list free
chunks.  Since we now have a list of emptyblocks, I wasn't too sure if
the chunks from those should be included in that total.  I currently
am not including them, but I have added some additional information to
list the number of completely empty blocks that we've got in the
emptyblocks list.

Some follow-up work that I'm thinking is a good idea:

1. Reduce the SlabContext's chunkSize, fullChunkSize and blockSize
fields from Size down to uint32.  These have no need to be 64 bits. We
don't allow slab blocks over 1GB since c6e0fe1f2.  I thought of doing
this separately as we might need to rationalise the equivalent fields
in aset.c and generation.c.  Those can have external chunks, so I'm
not 100% sure if we should do that there or not yet. I just didn't
want to touch those files in this effort.
2. Slab should probably gain the ability to grow the block size as
aset and generation both do. Since the performance of the slab context
is good now, we might want to use it for hash join's 32kb chunks, but
I doubt we can without the block size growth.

I'm planning on pushing the attached v3 patch shortly. I've spent
several days reading over this and testing it in detail along with
adding additional features to the SlabCheck code to find more
inconsistencies.

David

Attachment

Re: slab allocator performance issues

From

John Naylor

Date:

20 December 2022, 08:19:25

On Tue, Dec 20, 2022 at 10:36 AM David Rowley <dgrowleyml@gmail.com> wrote:
>
> I'm planning on pushing the attached v3 patch shortly. I've spent
> several days reading over this and testing it in detail along with
> adding additional features to the SlabCheck code to find more
> inconsistencies.

FWIW, I reran my test from last week and got similar results.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: slab allocator performance issues

From

David Rowley

Date:

20 December 2022, 08:51:37

On Tue, 20 Dec 2022 at 21:19, John Naylor <john.naylor@enterprisedb.com> wrote:
>
>
> On Tue, Dec 20, 2022 at 10:36 AM David Rowley <dgrowleyml@gmail.com> wrote:
> >
> > I'm planning on pushing the attached v3 patch shortly. I've spent
> > several days reading over this and testing it in detail along with
> > adding additional features to the SlabCheck code to find more
> > inconsistencies.
>
> FWIW, I reran my test from last week and got similar results.

Thanks a lot for testing that stuff last week.  I got a bit engrossed
in the perf weirdness and forgot to reply.  I found they made much
more sense after using palloc0 and touching the allocated memory just
before freeing.  I think this is a more realistic test.

I've now pushed the patch after making a small adjustment to the
version I sent earlier.

David