Thread: slab allocator performance issues
Hi, I just tried to use the slab allocator for a case where aset.c was bloating memory usage substantially. First: It worked wonders for memory usage, nearly eliminating overhead. But it turned out to cause a *substantial* slowdown. With aset the allocator is barely in the profile. With slab the profile is dominated by allocator performance. slab: NOTICE: 00000: 100000000 ordered insertions in 5.216287 seconds, 19170724/sec LOCATION: bfm_test_insert_bulk, radix.c:2880 Overhead Command Shared Object Symbol + 28.27% postgres postgres [.] SlabAlloc + 9.64% postgres bdbench.so [.] bfm_delete + 9.03% postgres bdbench.so [.] bfm_set + 8.39% postgres bdbench.so [.] bfm_lookup + 8.36% postgres bdbench.so [.] bfm_set_leaf.constprop.0 + 8.16% postgres libc-2.31.so [.] __memmove_avx_unaligned_erms + 6.88% postgres bdbench.so [.] bfm_delete_leaf + 3.24% postgres libc-2.31.so [.] _int_malloc + 2.58% postgres bdbench.so [.] bfm_tests + 2.33% postgres postgres [.] SlabFree + 1.29% postgres libc-2.31.so [.] _int_free + 1.09% postgres libc-2.31.so [.] unlink_chunk.constprop.0 aset: NOTICE: 00000: 100000000 ordered insertions in 2.082602 seconds, 48016848/sec LOCATION: bfm_test_insert_bulk, radix.c:2880 + 16.43% postgres bdbench.so [.] bfm_lookup + 15.38% postgres bdbench.so [.] bfm_delete + 12.82% postgres libc-2.31.so [.] __memmove_avx_unaligned_erms + 12.65% postgres bdbench.so [.] bfm_set + 12.15% postgres bdbench.so [.] bfm_set_leaf.constprop.0 + 10.57% postgres bdbench.so [.] bfm_delete_leaf + 4.05% postgres bdbench.so [.] bfm_tests + 2.93% postgres [kernel.vmlinux] [k] clear_page_erms + 1.59% postgres postgres [.] AllocSetAlloc + 1.15% postgres bdbench.so [.] memmove@plt + 1.06% postgres bdbench.so [.] bfm_grow_leaf_16 OS: NOTICE: 00000: 100000000 ordered insertions in 2.089790 seconds, 47851690/sec LOCATION: bfm_test_insert_bulk, radix.c:2880 That is somewhat surprising - part of the promise of a slab allocator is that it's fast... This is caused by multiple issues, I think. Some of which seems fairly easy to fix. 1) If allocations are short-lived slab.c, can end up constantly freeing/initializing blocks. Which requires fairly expensively iterating over all potential chunks in the block and initializing it. Just to then free that memory again after a small number of allocations. The extreme case of this is when there are phases of alloc/free of a single allocation. I "fixed" this by adding a few && slab->nblocks > 1 in SlabFree() and the problem vastly reduced. Instead of a 0.4x slowdown it's 0.88x. Of course that only works if the problem is with the only, so it's not a great approach. Perhaps just keeping the last allocated block around would work? 2) SlabChunkIndex() in SlabFree() is slow. It requires a 64bit division, taking up ~50% of the cycles in SlabFree(). A 64bit div, according to [1] , has a latency of 35-88 cycles on skylake-x (and a reverse throughput of 21-83, i.e. no parallelism). While it's getting a bit faster on icelake / zen 3, it's still slow enough there to be very worrisome. I don't see a way to get around the division while keeping the freelist structure as is. But: ISTM that we only need the index because of the free-chunk list, right? Why don't we make the chunk list use actual pointers? Is it concern that that'd increase the minimum allocation size? If so, I see two ways around that: First, we could make the index just the offset from the start of the block, that's much cheaper to calculate. Second, we could store the next pointer in SlabChunk->slab/block instead (making it a union) - while on the freelist we don't need to dereference those, right? I suspect both would also make the block initialization a bit cheaper. That should also accelerate SlabBlockGetChunk(), which currently shows up as an imul, which isn't exactly fast either (and uses a lot of execution ports). 3) Every free/alloc needing to unlink from slab->freelist[i] and then relink shows up prominently in profiles. That's ~tripling the number of cachelines touched in the happy path, with unpredictable accesses to boot. Perhaps we could reduce the precision of slab->freelist indexing to amortize that cost? I.e. not have one slab->freelist entry for each nfree, but instead have an upper limit on the number of freelists? 4) Less of a performance, and more of a usability issue: The constant block size strikes me as problematic. Most users of an allocator can sometimes be used with a small amount of data, and sometimes with a large amount. Greetings, Andres Freund [1] https://www.agner.org/optimize/instruction_tables.pdf
Hi, On 2021-07-17 12:43:33 -0700, Andres Freund wrote: > 2) SlabChunkIndex() in SlabFree() is slow. It requires a 64bit division, taking > up ~50% of the cycles in SlabFree(). A 64bit div, according to [1] , has a > latency of 35-88 cycles on skylake-x (and a reverse throughput of 21-83, > i.e. no parallelism). While it's getting a bit faster on icelake / zen 3, it's > still slow enough there to be very worrisome. > > I don't see a way to get around the division while keeping the freelist > structure as is. But: > > ISTM that we only need the index because of the free-chunk list, right? Why > don't we make the chunk list use actual pointers? Is it concern that that'd > increase the minimum allocation size? If so, I see two ways around that: > First, we could make the index just the offset from the start of the block, > that's much cheaper to calculate. Second, we could store the next pointer in > SlabChunk->slab/block instead (making it a union) - while on the freelist we > don't need to dereference those, right? > > I suspect both would also make the block initialization a bit cheaper. > > That should also accelerate SlabBlockGetChunk(), which currently shows up as > an imul, which isn't exactly fast either (and uses a lot of execution ports). Oh - I just saw that effectively the allocation size already is a uintptr_t at minimum. I had only seen /* Make sure the linked list node fits inside a freed chunk */ if (chunkSize < sizeof(int)) chunkSize = sizeof(int); but it's followed by /* chunk, including SLAB header (both addresses nicely aligned) */ fullChunkSize = sizeof(SlabChunk) + MAXALIGN(chunkSize); which means we are reserving enough space for a pointer on just about any platform already? Seems we can just make that official and reserve space for a pointer as part of the chunk size rounding up, instead of fullChunkSize? Greetings, Andres Freund
Hi, On 7/17/21 9:43 PM, Andres Freund wrote: > Hi, > > I just tried to use the slab allocator for a case where aset.c was > bloating memory usage substantially. First: It worked wonders for memory > usage, nearly eliminating overhead. > > But it turned out to cause a *substantial* slowdown. With aset the > allocator is barely in the profile. With slab the profile is dominated > by allocator performance. > > slab: > NOTICE: 00000: 100000000 ordered insertions in 5.216287 seconds, 19170724/sec > LOCATION: bfm_test_insert_bulk, radix.c:2880 > Overhead Command Shared Object Symbol > > + 28.27% postgres postgres [.] SlabAlloc > + 9.64% postgres bdbench.so [.] bfm_delete > + 9.03% postgres bdbench.so [.] bfm_set > + 8.39% postgres bdbench.so [.] bfm_lookup > + 8.36% postgres bdbench.so [.] bfm_set_leaf.constprop.0 > + 8.16% postgres libc-2.31.so [.] __memmove_avx_unaligned_erms > + 6.88% postgres bdbench.so [.] bfm_delete_leaf > + 3.24% postgres libc-2.31.so [.] _int_malloc > + 2.58% postgres bdbench.so [.] bfm_tests > + 2.33% postgres postgres [.] SlabFree > + 1.29% postgres libc-2.31.so [.] _int_free > + 1.09% postgres libc-2.31.so [.] unlink_chunk.constprop.0 > > aset: > > NOTICE: 00000: 100000000 ordered insertions in 2.082602 seconds, 48016848/sec > LOCATION: bfm_test_insert_bulk, radix.c:2880 > > + 16.43% postgres bdbench.so [.] bfm_lookup > + 15.38% postgres bdbench.so [.] bfm_delete > + 12.82% postgres libc-2.31.so [.] __memmove_avx_unaligned_erms > + 12.65% postgres bdbench.so [.] bfm_set > + 12.15% postgres bdbench.so [.] bfm_set_leaf.constprop.0 > + 10.57% postgres bdbench.so [.] bfm_delete_leaf > + 4.05% postgres bdbench.so [.] bfm_tests > + 2.93% postgres [kernel.vmlinux] [k] clear_page_erms > + 1.59% postgres postgres [.] AllocSetAlloc > + 1.15% postgres bdbench.so [.] memmove@plt > + 1.06% postgres bdbench.so [.] bfm_grow_leaf_16 > > OS: > NOTICE: 00000: 100000000 ordered insertions in 2.089790 seconds, 47851690/sec > LOCATION: bfm_test_insert_bulk, radix.c:2880 > > > That is somewhat surprising - part of the promise of a slab allocator is > that it's fast... > > > This is caused by multiple issues, I think. Some of which seems fairly easy to > fix. > > 1) If allocations are short-lived slab.c, can end up constantly > freeing/initializing blocks. Which requires fairly expensively iterating over > all potential chunks in the block and initializing it. Just to then free that > memory again after a small number of allocations. The extreme case of this is > when there are phases of alloc/free of a single allocation. > > I "fixed" this by adding a few && slab->nblocks > 1 in SlabFree() and the > problem vastly reduced. Instead of a 0.4x slowdown it's 0.88x. Of course that > only works if the problem is with the only, so it's not a great > approach. Perhaps just keeping the last allocated block around would work? > +1 I think it makes perfect sense to not free the blocks immediately, and keep one (or a small number) as a cache. I'm not sure why we decided not to have a "keeper" block, but I suspect memory consumption was my main concern at that point. But I never expected the cost to be this high. > > 2) SlabChunkIndex() in SlabFree() is slow. It requires a 64bit division, taking > up ~50% of the cycles in SlabFree(). A 64bit div, according to [1] , has a > latency of 35-88 cycles on skylake-x (and a reverse throughput of 21-83, > i.e. no parallelism). While it's getting a bit faster on icelake / zen 3, it's > still slow enough there to be very worrisome. > > I don't see a way to get around the division while keeping the freelist > structure as is. But: > > ISTM that we only need the index because of the free-chunk list, right? Why > don't we make the chunk list use actual pointers? Is it concern that that'd > increase the minimum allocation size? If so, I see two ways around that: > First, we could make the index just the offset from the start of the block, > that's much cheaper to calculate. Second, we could store the next pointer in > SlabChunk->slab/block instead (making it a union) - while on the freelist we > don't need to dereference those, right? > > I suspect both would also make the block initialization a bit cheaper. > > That should also accelerate SlabBlockGetChunk(), which currently shows up as > an imul, which isn't exactly fast either (and uses a lot of execution ports). > Hmm, I think you're right we could simply use the pointers, but I have not tried that. > > 3) Every free/alloc needing to unlink from slab->freelist[i] and then relink > shows up prominently in profiles. That's ~tripling the number of cachelines > touched in the happy path, with unpredictable accesses to boot. > > Perhaps we could reduce the precision of slab->freelist indexing to amortize > that cost? I.e. not have one slab->freelist entry for each nfree, but instead > have an upper limit on the number of freelists? > Yeah. The purpose of organizing the freelists like this is to prioritize the "more full" blocks" when allocating new chunks, in the hope that the "less full" blocks will end up empty and freed faster. But this is naturally imprecise, and strongly depends on the workload, of course, and I bet for most cases a less precise approach would work just as fine. I'm not sure how exactly would the upper limit you propose work, but perhaps we could group the blocks for nfree ranges, say [0-15], [16-31] and so on. So after the alloc/free we'd calculate the new freelist index as (nfree/16) and only moved the block if the index changed. This would reduce the overhead to 1/16 and probably even more in practice. Of course, we could also say we have e.g. 8 freelists and work the ranges backwards from that, I guess that's what you propose. > > 4) Less of a performance, and more of a usability issue: The constant > block size strikes me as problematic. Most users of an allocator can > sometimes be used with a small amount of data, and sometimes with a > large amount. > I doubt this is worth the effort, really. The constant block size makes various places much simpler (both to code and reason about), so this should not make a huge difference in performance. And IMHO the block size is mostly an implementation detail, so I don't see that as a usability issue. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2021-07-17 22:35:07 +0200, Tomas Vondra wrote: > On 7/17/21 9:43 PM, Andres Freund wrote: > > 1) If allocations are short-lived slab.c, can end up constantly > > freeing/initializing blocks. Which requires fairly expensively iterating over > > all potential chunks in the block and initializing it. Just to then free that > > memory again after a small number of allocations. The extreme case of this is > > when there are phases of alloc/free of a single allocation. > > > > I "fixed" this by adding a few && slab->nblocks > 1 in SlabFree() and the > > problem vastly reduced. Instead of a 0.4x slowdown it's 0.88x. Of course that > > only works if the problem is with the only, so it's not a great > > approach. Perhaps just keeping the last allocated block around would work? > > > > +1 > > I think it makes perfect sense to not free the blocks immediately, and keep > one (or a small number) as a cache. I'm not sure why we decided not to have > a "keeper" block, but I suspect memory consumption was my main concern at > that point. But I never expected the cost to be this high. I think one free block might be too low in some cases. It's pretty common to have workloads where the number of allocations is "bursty", and it's imo one case where one might justifiably want to use a slab allocator... Perhaps a portion of a high watermark? Or a portion of the in use blocks? Hm. I wonder if we should just not populate the freelist eagerly, to drive down the initialization cost. I.e. have a separate allocation path for chunks that have never been allocated, by having a SlabBlock->free_offset or such. Sure, it adds a branch to the allocation happy path, but it also makes the first allocation for a chunk cheaper, because there's no need to get the next element from the freelist (adding a likely cache miss). And it should make the allocation of a new block faster by a lot. > > 2) SlabChunkIndex() in SlabFree() is slow. It requires a 64bit division, taking > > up ~50% of the cycles in SlabFree(). A 64bit div, according to [1] , has a > > latency of 35-88 cycles on skylake-x (and a reverse throughput of 21-83, > > i.e. no parallelism). While it's getting a bit faster on icelake / zen 3, it's > > still slow enough there to be very worrisome. > > > > I don't see a way to get around the division while keeping the freelist > > structure as is. But: > > > > ISTM that we only need the index because of the free-chunk list, right? Why > > don't we make the chunk list use actual pointers? Is it concern that that'd > > increase the minimum allocation size? If so, I see two ways around that: > > First, we could make the index just the offset from the start of the block, > > that's much cheaper to calculate. Second, we could store the next pointer in > > SlabChunk->slab/block instead (making it a union) - while on the freelist we > > don't need to dereference those, right? > > > > I suspect both would also make the block initialization a bit cheaper. > > > > That should also accelerate SlabBlockGetChunk(), which currently shows up as > > an imul, which isn't exactly fast either (and uses a lot of execution ports). > > > > Hmm, I think you're right we could simply use the pointers, but I have not > tried that. I quickly tried that, and it does seem to improve matters considerably. The block initialization still shows up as expensive, but not as bad. The div and imul are gone (exept in an assertion build right now). The list manipulation still is visible. > > 3) Every free/alloc needing to unlink from slab->freelist[i] and then relink > > shows up prominently in profiles. That's ~tripling the number of cachelines > > touched in the happy path, with unpredictable accesses to boot. > > > > Perhaps we could reduce the precision of slab->freelist indexing to amortize > > that cost? I.e. not have one slab->freelist entry for each nfree, but instead > > have an upper limit on the number of freelists? > > > > Yeah. The purpose of organizing the freelists like this is to prioritize the > "more full" blocks" when allocating new chunks, in the hope that the "less > full" blocks will end up empty and freed faster. > > But this is naturally imprecise, and strongly depends on the workload, of > course, and I bet for most cases a less precise approach would work just as > fine. > > I'm not sure how exactly would the upper limit you propose work, but perhaps > we could group the blocks for nfree ranges, say [0-15], [16-31] and so on. > So after the alloc/free we'd calculate the new freelist index as (nfree/16) > and only moved the block if the index changed. This would reduce the > overhead to 1/16 and probably even more in practice. > Of course, we could also say we have e.g. 8 freelists and work the ranges > backwards from that, I guess that's what you propose. Yea, I was thinking something along those lines. As you say, ow about there's always at most 8 freelists or such. During initialization we compute a shift that distributes chunksPerBlock from 0 to 8. Then we only need to perform list manipulation if block->nfree >> slab->freelist_shift != --block->nfree >> slab->freelist_shift. That seems nice from a memory usage POV as well - for small and frequent allocations needing chunksPerBlock freelists isn't great. The freelists are on a fair number of cachelines right now. > > 4) Less of a performance, and more of a usability issue: The constant > > block size strikes me as problematic. Most users of an allocator can > > sometimes be used with a small amount of data, and sometimes with a > > large amount. > > > > I doubt this is worth the effort, really. The constant block size makes > various places much simpler (both to code and reason about), so this should > not make a huge difference in performance. And IMHO the block size is mostly > an implementation detail, so I don't see that as a usability issue. Hm? It's something the user has to specify, so I it's not really an implementation detail. It needs to be specified without sufficient information, as well, since externally one doesn't know how much memory the block header and chunk headers + rounding up will use, so computing a good block size isn't easy. I've wondered whether it should just be a count... Why do you not think it's relevant for performance? Either one causes too much memory usage by using a too large block size, wasting memory, or one ends up loosing perf through frequent allocations? Greetings, Andres Freund
On 7/17/21 11:14 PM, Andres Freund wrote: > Hi, > > On 2021-07-17 22:35:07 +0200, Tomas Vondra wrote: >> On 7/17/21 9:43 PM, Andres Freund wrote: >>> 1) If allocations are short-lived slab.c, can end up constantly >>> freeing/initializing blocks. Which requires fairly expensively iterating over >>> all potential chunks in the block and initializing it. Just to then free that >>> memory again after a small number of allocations. The extreme case of this is >>> when there are phases of alloc/free of a single allocation. >>> >>> I "fixed" this by adding a few && slab->nblocks > 1 in SlabFree() and the >>> problem vastly reduced. Instead of a 0.4x slowdown it's 0.88x. Of course that >>> only works if the problem is with the only, so it's not a great >>> approach. Perhaps just keeping the last allocated block around would work? >>> >> >> +1 >> >> I think it makes perfect sense to not free the blocks immediately, and keep >> one (or a small number) as a cache. I'm not sure why we decided not to have >> a "keeper" block, but I suspect memory consumption was my main concern at >> that point. But I never expected the cost to be this high. > > I think one free block might be too low in some cases. It's pretty > common to have workloads where the number of allocations is "bursty", > and it's imo one case where one might justifiably want to use a slab > allocator... Perhaps a portion of a high watermark? Or a portion of the > in use blocks? > I think the portion of watermark would be problematic for cases with one huge transaction - that'll set a high watermark, and we'll keep way too many free blocks. But the portion of in use blocks might work, I think. > Hm. I wonder if we should just not populate the freelist eagerly, to > drive down the initialization cost. I.e. have a separate allocation path > for chunks that have never been allocated, by having a > SlabBlock->free_offset or such. > > Sure, it adds a branch to the allocation happy path, but it also makes the > first allocation for a chunk cheaper, because there's no need to get the next > element from the freelist (adding a likely cache miss). And it should make the > allocation of a new block faster by a lot. > Not sure what you mean by 'not populate eagerly' so can't comment :-( > >>> 2) SlabChunkIndex() in SlabFree() is slow. It requires a 64bit division, taking >>> up ~50% of the cycles in SlabFree(). A 64bit div, according to [1] , has a >>> latency of 35-88 cycles on skylake-x (and a reverse throughput of 21-83, >>> i.e. no parallelism). While it's getting a bit faster on icelake / zen 3, it's >>> still slow enough there to be very worrisome. >>> >>> I don't see a way to get around the division while keeping the freelist >>> structure as is. But: >>> >>> ISTM that we only need the index because of the free-chunk list, right? Why >>> don't we make the chunk list use actual pointers? Is it concern that that'd >>> increase the minimum allocation size? If so, I see two ways around that: >>> First, we could make the index just the offset from the start of the block, >>> that's much cheaper to calculate. Second, we could store the next pointer in >>> SlabChunk->slab/block instead (making it a union) - while on the freelist we >>> don't need to dereference those, right? >>> >>> I suspect both would also make the block initialization a bit cheaper. >>> >>> That should also accelerate SlabBlockGetChunk(), which currently shows up as >>> an imul, which isn't exactly fast either (and uses a lot of execution ports). >>> >> >> Hmm, I think you're right we could simply use the pointers, but I have not >> tried that. > > I quickly tried that, and it does seem to improve matters considerably. The > block initialization still shows up as expensive, but not as bad. The div and > imul are gone (exept in an assertion build right now). The list manipulation > still is visible. > Understood. I didn't expect this to be a full solution. > >>> 3) Every free/alloc needing to unlink from slab->freelist[i] and then relink >>> shows up prominently in profiles. That's ~tripling the number of cachelines >>> touched in the happy path, with unpredictable accesses to boot. >>> >>> Perhaps we could reduce the precision of slab->freelist indexing to amortize >>> that cost? I.e. not have one slab->freelist entry for each nfree, but instead >>> have an upper limit on the number of freelists? >>> >> >> Yeah. The purpose of organizing the freelists like this is to prioritize the >> "more full" blocks" when allocating new chunks, in the hope that the "less >> full" blocks will end up empty and freed faster. >> >> But this is naturally imprecise, and strongly depends on the workload, of >> course, and I bet for most cases a less precise approach would work just as >> fine. >> >> I'm not sure how exactly would the upper limit you propose work, but perhaps >> we could group the blocks for nfree ranges, say [0-15], [16-31] and so on. >> So after the alloc/free we'd calculate the new freelist index as (nfree/16) >> and only moved the block if the index changed. This would reduce the >> overhead to 1/16 and probably even more in practice. > >> Of course, we could also say we have e.g. 8 freelists and work the ranges >> backwards from that, I guess that's what you propose. > > Yea, I was thinking something along those lines. As you say, ow about there's > always at most 8 freelists or such. During initialization we compute a shift > that distributes chunksPerBlock from 0 to 8. Then we only need to perform list > manipulation if block->nfree >> slab->freelist_shift != --block->nfree >> slab->freelist_shift. > > That seems nice from a memory usage POV as well - for small and frequent > allocations needing chunksPerBlock freelists isn't great. The freelists are on > a fair number of cachelines right now. > Agreed. > >>> 4) Less of a performance, and more of a usability issue: The constant >>> block size strikes me as problematic. Most users of an allocator can >>> sometimes be used with a small amount of data, and sometimes with a >>> large amount. >>> >> >> I doubt this is worth the effort, really. The constant block size makes >> various places much simpler (both to code and reason about), so this should >> not make a huge difference in performance. And IMHO the block size is mostly >> an implementation detail, so I don't see that as a usability issue. > > Hm? It's something the user has to specify, so I it's not really an > implementation detail. It needs to be specified without sufficient > information, as well, since externally one doesn't know how much memory the > block header and chunk headers + rounding up will use, so computing a good > block size isn't easy. I've wondered whether it should just be a count... > I think this is mixing two problems - how to specify the block size, and whether the block size is constant (as in slab) or grows over time (as in allocset). As for specifying the block size, I agree maybe setting chunk count and deriving the bytes from that might be easier to use, for the reasons you mentioned. But growing the block size seems problematic for long-lived contexts with workloads that change a lot over time - imagine e.g. decoding many small transactions, with one huge transaction mixed in. The one huge transaction will grow the block size, and we'll keep using it forever. But in that case we might have just as well allocate the large blocks from the start, I guess. > Why do you not think it's relevant for performance? Either one causes too much > memory usage by using a too large block size, wasting memory, or one ends up > loosing perf through frequent allocations? > True. I simply would not expect this to make a huge difference - I may be wrong, and I'm sure there are workloads where it matters. But I still think it's easier to just use larger blocks than to make the slab code more complex. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2021-07-18 00:46:09 +0200, Tomas Vondra wrote: > On 7/17/21 11:14 PM, Andres Freund wrote: > > Hm. I wonder if we should just not populate the freelist eagerly, to > > drive down the initialization cost. I.e. have a separate allocation path > > for chunks that have never been allocated, by having a > > SlabBlock->free_offset or such. > > > > Sure, it adds a branch to the allocation happy path, but it also makes the > > first allocation for a chunk cheaper, because there's no need to get the next > > element from the freelist (adding a likely cache miss). And it should make the > > allocation of a new block faster by a lot. > > > > Not sure what you mean by 'not populate eagerly' so can't comment :-( Instead of populating a linked list with all chunks upon creation of a block - which requires touching a fair bit of memory - keep a per-block pointer (or an offset) into "unused" area of the block. When allocating from the block and theres still "unused" memory left, use that, instead of bothering with the freelist. I tried that, and it nearly got slab up to the allocation/freeing performance of aset.c (while winning after allocation, due to the higher memory density). > > > > 4) Less of a performance, and more of a usability issue: The constant > > > > block size strikes me as problematic. Most users of an allocator can > > > > sometimes be used with a small amount of data, and sometimes with a > > > > large amount. > > > > > > > > > > I doubt this is worth the effort, really. The constant block size makes > > > various places much simpler (both to code and reason about), so this should > > > not make a huge difference in performance. And IMHO the block size is mostly > > > an implementation detail, so I don't see that as a usability issue. > > > > Hm? It's something the user has to specify, so I it's not really an > > implementation detail. It needs to be specified without sufficient > > information, as well, since externally one doesn't know how much memory the > > block header and chunk headers + rounding up will use, so computing a good > > block size isn't easy. I've wondered whether it should just be a count... > > > > I think this is mixing two problems - how to specify the block size, and > whether the block size is constant (as in slab) or grows over time (as in > allocset). That was in response to the "implementation detail" bit solely. > But growing the block size seems problematic for long-lived contexts with > workloads that change a lot over time - imagine e.g. decoding many small > transactions, with one huge transaction mixed in. The one huge transaction > will grow the block size, and we'll keep using it forever. But in that case > we might have just as well allocate the large blocks from the start, I > guess. I was thinking of capping the growth fairly low. I don't think after a 16x growth or so you're likely to still see allocation performance gains with slab. And I don't think that'd be too bad for decoding - we'd start with a small initial block size, and in many workloads that will be enough, and just workloads where that doesn't suffice will adapt performance wise. And: Medium term I wouldn't expect reorderbuffer.c to stay the only slab.c user... > > Why do you not think it's relevant for performance? Either one causes too much > > memory usage by using a too large block size, wasting memory, or one ends up > > loosing perf through frequent allocations? > > True. I simply would not expect this to make a huge difference - I may be > wrong, and I'm sure there are workloads where it matters. But I still think > it's easier to just use larger blocks than to make the slab code more > complex. IDK. I'm looking at using slab as part of a radix tree implementation right now. Which I'd hope to be used in various different situations. So it's hard to choose the right block size - and it does seem to matter for performance. Greetings, Andres Freund
Hi, On 2021-07-17 16:10:19 -0700, Andres Freund wrote: > Instead of populating a linked list with all chunks upon creation of a block - > which requires touching a fair bit of memory - keep a per-block pointer (or an > offset) into "unused" area of the block. When allocating from the block and > theres still "unused" memory left, use that, instead of bothering with the > freelist. > > I tried that, and it nearly got slab up to the allocation/freeing performance > of aset.c (while winning after allocation, due to the higher memory density). Combining that with limiting the number of freelists, and some microoptimizations, allocation performance is now on par. Freeing still seems to be a tad slower, mostly because SlabFree() practically is immediately stalled on fetching the block, whereas AllocSetFree() can happily speculate ahead and do work like computing the freelist index. And then aset only needs to access memory inside the context - which is much more likely to be in cache than a freelist inside a block (there are many more). But that's ok, I think. It's close and it's only a small share of the overall runtime of my workload... Greetings, Andres Freund
On 7/18/21 3:06 AM, Andres Freund wrote: > Hi, > > On 2021-07-17 16:10:19 -0700, Andres Freund wrote: >> Instead of populating a linked list with all chunks upon creation of a block - >> which requires touching a fair bit of memory - keep a per-block pointer (or an >> offset) into "unused" area of the block. When allocating from the block and >> theres still "unused" memory left, use that, instead of bothering with the >> freelist. >> >> I tried that, and it nearly got slab up to the allocation/freeing performance >> of aset.c (while winning after allocation, due to the higher memory density). > > Combining that with limiting the number of freelists, and some > microoptimizations, allocation performance is now on par. > > Freeing still seems to be a tad slower, mostly because SlabFree() > practically is immediately stalled on fetching the block, whereas > AllocSetFree() can happily speculate ahead and do work like computing > the freelist index. And then aset only needs to access memory inside the > context - which is much more likely to be in cache than a freelist > inside a block (there are many more). > > But that's ok, I think. It's close and it's only a small share of the > overall runtime of my workload... > Sounds great! Thanks for investigating this and for the improvements. It might be good to do some experiments to see how the changes affect memory consumption for practical workloads. I'm willing to spend soem time on that, if needed. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2021-07-18 19:23:31 +0200, Tomas Vondra wrote: > Sounds great! Thanks for investigating this and for the improvements. > > It might be good to do some experiments to see how the changes affect memory > consumption for practical workloads. I'm willing to spend soem time on that, > if needed. I've attached my changes. They're in a rough shape right now, but I think good enough for an initial look. Greetings, Andres Freund
Attachment
Em seg., 19 de jul. de 2021 às 17:56, Andres Freund <andres@anarazel.de> escreveu:
Hi,
On 2021-07-18 19:23:31 +0200, Tomas Vondra wrote:
> Sounds great! Thanks for investigating this and for the improvements.
>
> It might be good to do some experiments to see how the changes affect memory
> consumption for practical workloads. I'm willing to spend soem time on that,
> if needed.
I've attached my changes. They're in a rough shape right now, but I
think good enough for an initial look.
Hi Andres, I take a look.
Perhaps you would agree with me that in the most absolute of times, malloc will not fail.
So it makes more sense to test:
if (ret != NULL)
than
if (ret == NULL)
What might help branch prediction.
With this change wins too, the possibility
to reduce the scope of some variable.
Example:
+static void * pg_noinline
+AllocSetAllocLarge(AllocSet set, Size size, int flags)
+{
+ AllocBlock block;
+ Size chunk_size;
+ Size blksize;
+
+ /* check size, only allocation path where the limits could be hit */
+ MemoryContextCheckSize(&set->header, size, flags);
+
+ AssertArg(AllocSetIsValid(set));
+
+ chunk_size = MAXALIGN(size);
+ blksize = chunk_size + ALLOC_BLOCKHDRSZ + ALLOC_CHUNKHDRSZ;
+ block = (AllocBlock) malloc(blksize);
+ if (block != NULL)
+ {
+ AllocChunk chunk;
+
+ set->header.mem_allocated += blksize;
+
+ block->aset = set;
+ block->freeptr = block->endptr = ((char *) block) + blksize;
+
+ /*
+ * Stick the new block underneath the active allocation block, if any,
+ * so that we don't lose the use of the space remaining therein.
+ */
+ if (set->blocks != NULL)
+ {
+ block->prev = set->blocks;
+ block->next = set->blocks->next;
+ if (block->next)
+ block->next->prev = block;
+ set->blocks->next = block;
+ }
+ else
+ {
+ block->prev = NULL;
+ block->next = NULL;
+ set->blocks = block;
+ }
+
+ chunk = (AllocChunk) (((char *) block) + ALLOC_BLOCKHDRSZ);
+ chunk->size = chunk_size;
+
+ return AllocSetAllocReturnChunk(set, size, chunk, chunk_size);
+ }
+
+ return NULL;
+}
So it makes more sense to test:
if (ret != NULL)
than
if (ret == NULL)
What might help branch prediction.
With this change wins too, the possibility
to reduce the scope of some variable.
Example:
+static void * pg_noinline
+AllocSetAllocLarge(AllocSet set, Size size, int flags)
+{
+ AllocBlock block;
+ Size chunk_size;
+ Size blksize;
+
+ /* check size, only allocation path where the limits could be hit */
+ MemoryContextCheckSize(&set->header, size, flags);
+
+ AssertArg(AllocSetIsValid(set));
+
+ chunk_size = MAXALIGN(size);
+ blksize = chunk_size + ALLOC_BLOCKHDRSZ + ALLOC_CHUNKHDRSZ;
+ block = (AllocBlock) malloc(blksize);
+ if (block != NULL)
+ {
+ AllocChunk chunk;
+
+ set->header.mem_allocated += blksize;
+
+ block->aset = set;
+ block->freeptr = block->endptr = ((char *) block) + blksize;
+
+ /*
+ * Stick the new block underneath the active allocation block, if any,
+ * so that we don't lose the use of the space remaining therein.
+ */
+ if (set->blocks != NULL)
+ {
+ block->prev = set->blocks;
+ block->next = set->blocks->next;
+ if (block->next)
+ block->next->prev = block;
+ set->blocks->next = block;
+ }
+ else
+ {
+ block->prev = NULL;
+ block->next = NULL;
+ set->blocks = block;
+ }
+
+ chunk = (AllocChunk) (((char *) block) + ALLOC_BLOCKHDRSZ);
+ chunk->size = chunk_size;
+
+ return AllocSetAllocReturnChunk(set, size, chunk, chunk_size);
+ }
+
+ return NULL;
+}
regards,
Ranier Vilela
On Tue, 20 Jul 2021 at 10:04, Ranier Vilela <ranier.vf@gmail.com> wrote: > Perhaps you would agree with me that in the most absolute of times, malloc will not fail. > So it makes more sense to test: > if (ret != NULL) > than > if (ret == NULL) I think it'd be better to use unlikely() for that. David
Em ter., 20 de jul. de 2021 às 11:15, David Rowley <dgrowleyml@gmail.com> escreveu:
On Tue, 20 Jul 2021 at 10:04, Ranier Vilela <ranier.vf@gmail.com> wrote:
> Perhaps you would agree with me that in the most absolute of times, malloc will not fail.
> So it makes more sense to test:
> if (ret != NULL)
> than
> if (ret == NULL)
I think it'd be better to use unlikely() for that.
Sure, it can be, but in this case, there is no way to reduce the scope.
regards,
Ranier Vilela
Hi, I spent a bit of time benchmarking this - the first patch adds an extension with three functions, each executing a slightly different allocation pattern: 1) FIFO (allocates and frees in the same order) 2) LIFO (frees in reverse order) 3) random Each function can also do a custom number of iterations, each allocating and freeing certain number of chunks. The bench.sql script executes three combinations (for each pattern) 1) no loops 2) increase: 100 loops, each freeing 10k chunks and allocating 15k 3) decrease: 100 loops, each freeing 10k chunks and allocating 5k The idea is to test simple one-time allocation, and workloads that mix allocations and frees (to see how the changes affect reuse etc.). The script tests this with a range of block sizes (1k-32k) and chunk sizes (32B-512B). In the attached .ods file with results, the "comparison" sheets are the interesting ones - the last couple columns compare the main metrics for the two patches (labeled patch-1 and patch-2) to master. Overall, the results look quite good - patch-1 is mostly on par with master, with maybe 5% variability in both directions. That's expected, considering the patch does not aim to improve performance. The second patch brings some nice improvements - 30%-50% in most cases (for both allocation and free) seems pretty nice. But for the "increase" FIFO pattern (incrementally allocating/freeing more memory) there's a significant regression - particularly for the allocation time. In some cases (larger chunks, block size does not matter too much) it jumps from 25ms to almost 200ms. This seems unfortunate - the allocation pattern (FIFO, allocating more memory over time) seems pretty common, and the slowdown is significant. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Hi, On 2021-08-01 19:59:18 +0200, Tomas Vondra wrote: > In the attached .ods file with results, the "comparison" sheets are the > interesting ones - the last couple columns compare the main metrics for the > two patches (labeled patch-1 and patch-2) to master. I assume with patch-1/2 you mean the ones after the benchmark patch itself? > Overall, the results look quite good - patch-1 is mostly on par with master, > with maybe 5% variability in both directions. That's expected, considering > the patch does not aim to improve performance. Not for slab anyway... > The second patch brings some nice improvements - 30%-50% in most cases (for > both allocation and free) seems pretty nice. But for the "increase" FIFO > pattern (incrementally allocating/freeing more memory) there's a significant > regression - particularly for the allocation time. In some cases (larger > chunks, block size does not matter too much) it jumps from 25ms to almost > 200ms. I'm not surprised to see some memory usage increase some, but that degree of time overhead does surprise me. ISTM there's something wrong. It'd probably worth benchmarking the different improvements inside the WIP: slab performance. patch. There's some that I'd expect to be all around improvements, whereas others likely aren't quite that clear cut. I assume you'd prefer that I split the patch up? > This seems unfortunate - the allocation pattern (FIFO, allocating more > memory over time) seems pretty common, and the slowdown is significant. Did you analyze what causes the regressions? Greetings, Andres Freund
On 8/1/21 11:07 PM, Andres Freund wrote: > Hi, > > On 2021-08-01 19:59:18 +0200, Tomas Vondra wrote: >> In the attached .ods file with results, the "comparison" sheets are the >> interesting ones - the last couple columns compare the main metrics for the >> two patches (labeled patch-1 and patch-2) to master. > > I assume with patch-1/2 you mean the ones after the benchmark patch > itself? > Yes, those are the two WIP patches you shared on 19/7. > >> Overall, the results look quite good - patch-1 is mostly on par with master, >> with maybe 5% variability in both directions. That's expected, considering >> the patch does not aim to improve performance. > > Not for slab anyway... > Maybe the hot/cold separation could have some effect, but probably not for the workloads I've tested. > >> The second patch brings some nice improvements - 30%-50% in most cases (for >> both allocation and free) seems pretty nice. But for the "increase" FIFO >> pattern (incrementally allocating/freeing more memory) there's a significant >> regression - particularly for the allocation time. In some cases (larger >> chunks, block size does not matter too much) it jumps from 25ms to almost >> 200ms. > > I'm not surprised to see some memory usage increase some, but that > degree of time overhead does surprise me. ISTM there's something wrong. > Yeah, the higher amount of allocated memory is due to the couple fields added to the SlabBlock struct, but even that only affects a single case with 480B chunks and 1kB blocks. Seems fine to me, especially if we end up growing the slab blocks. Not sure about the allocation time, though. > It'd probably worth benchmarking the different improvements inside the > WIP: slab performance. patch. There's some that I'd expect to be all > around improvements, whereas others likely aren't quite that clear > cut. I assume you'd prefer that I split the patch up? > Yeah, if you split that patch into sensible parts, I'll benchmark those. Also, we can add more interesting workloads if you have some ideas. > >> This seems unfortunate - the allocation pattern (FIFO, allocating more >> memory over time) seems pretty common, and the slowdown is significant. > > Did you analyze what causes the regressions? > No, not yet. I'll run the same set of benchmarks for the Generation, discussed in the other thread, and then I'll investigate this. But if you split the patch, that'd probably help. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
FWIW I tried running the benchmarks again, with some minor changes in the extension code - most importantly, the time is counted in microsecs (instead of milisecs). I suspected the rounding might have been causing some rounding errors (essentially not counting anything below 1ms, because it rounds to 0), and the results are a bit different. On the i5-2500k machine it's an improvement across the board, while on the bigger Xeon e5-2620v3 machine it shows roughly the same regression for the "decreasing" allocation pattern. There's another issue in the benchmarking script - the queries are meant to do multiple runs for each combination of parameters, but it's written in a way that simply runs it once and then does cross product with the generate_sequence(1,5). I'll look into fixing that, but judging by the stability of results for similar chunk sizes it won't change much. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Hi, I've been investigating the regressions in some of the benchmark results, together with the generation context benchmarks [1]. Turns out it's pretty difficult to benchmark this, because the results strongly depend on what the backend did before. For example if I run slab_bench_fifo with the "decreasing" test for 32kB blocks and 512B chunks, I get this: select * from slab_bench_fifo(1000000, 32768, 512, 100, 10000, 5000); mem_allocated | alloc_ms | free_ms ---------------+----------+--------- 528547840 | 155394 | 87440 i.e. palloc() takes ~155ms and pfree() ~87ms (and these result are stable, the numbers don't change much with more runs). But if I run a set of "lifo" tests in the backend first, the results look like this: mem_allocated | alloc_ms | free_ms ---------------+----------+--------- 528547840 | 41728 | 71524 (1 row) so the pallocs are suddenly about ~4x faster. Clearly, what the backend did before may have pretty dramatic impact on results, even for simple benchmarks like this. Note: The benchmark was a single SQL script, running all the different workloads in the same backend. I did a fair amount of perf profiling, and the main difference between the slow and fast runs seems to be this: 0 page-faults:u 0 minor-faults:u 0 major-faults:u vs 20,634,153 page-faults:u 20,634,153 minor-faults:u 0 major-faults:u Attached is a more complete perf stat output, but the page faults seem to be the main issue. My theory is that in the "fast" case, the past backend activity puts the glibc memory management into a state that prevents page faults in the benchmark. But of course, this theory may be incomplete - for example it's not clear why running the benchmark repeatedly would not "condition" the backend the same way. But it doesn't - it's ~150ms even for repeated runs. Secondly, I'm not sure this explains why some of the timings actually got much slower with the 0003 patch, when the sequence of the steps is still the same. Of course, it's possible 0003 changes the allocation pattern a bit, interfering with glibc memory management. This leads to a couple of interesting questions, I think: 1) I've only tested this on Linux, with glibc. I wonder how it'd behave on other platforms, or with other allocators. 2) Which cases are more important? When the backend was warmed up, or when each benchmark runs in a new backend? It seems the "new backend" is something like a "worst case" leading to more page faults, so maybe that's the thing to watch. OTOH it's unlikely to have a completely new backend, so maybe not. 3) Can this teach us something about how to allocate stuff, to better "prepare" the backend for future allocations? For example, it's a bit strange that repeated runs of the same benchmark don't do the trick, for some reason. regards [1] https://www.postgresql.org/message-id/bcdd4e3e-c12d-cd2b-7ead-a91ad416100a%40enterprisedb.com -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Sat, 11 Sept 2021 at 09:07, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > I've been investigating the regressions in some of the benchmark > results, together with the generation context benchmarks [1]. I've not looked into the regression you found with this yet, but I did rebase the patch. slab.c has seen quite a number of changes recently. I didn't spend a lot of time checking over the patch. I mainly wanted to see what the performance was like before reviewing in too much detail. To test the performance, I used [1] and ran: select pg_allocate_memory_test(<nbytes>, 1024*1024, 10::bigint*1024*1024*1024, 'slab'); that basically allocates chunks of <nbytes> and keeps around 1MB of them at a time and allocates a total of 10GBs of them. I saw: Master: 16 byte chunk = 8754.678 ms 32 byte chunk = 4511.725 ms 64 byte chunk = 2244.885 ms 128 byte chunk = 1135.349 ms 256 byte chunk = 548.030 ms 512 byte chunk = 272.017 ms 1024 byte chunk = 144.618 ms Master + attached patch: 16 byte chunk = 5255.974 ms 32 byte chunk = 2640.807 ms 64 byte chunk = 1328.949 ms 128 byte chunk = 668.078 ms 256 byte chunk = 330.564 ms 512 byte chunk = 166.844 ms 1024 byte chunk = 85.399 ms So patched runs in about 60% of the time that master runs in. I plan to look at the patch in a bit more detail and see if I can recreate and figure out the regression that Tomas reported. For now, I just want to share the rebased patch. The only thing I really adjusted from Andres' version is to instead of using pointers for the linked list block freelist, I made it store the number of bytes into the block that the chunk is. This means we can use 4 bytes instead of 8 bytes for these pointers. The block size is limited to 1GB now anyway, so 32-bit is large enough for these offsets. David [1] https://www.postgresql.org/message-id/attachment/137056/allocate_performance_functions.patch.txt
Attachment
On Wed, Oct 12, 2022 at 4:37 PM David Rowley <dgrowleyml@gmail.com> wrote:
> [v3]
+ /*
+ * Compute a shift that guarantees that shifting chunksPerBlock with it
+ * yields is smaller than SLAB_FREELIST_COUNT - 1 (one freelist is used for full blocks).
+ */
+ slab->freelist_shift = 0;
+ while ((slab->chunksPerBlock >> slab->freelist_shift) >= (SLAB_FREELIST_COUNT - 1))
+ slab->freelist_shift++;
+ /*
+ * Ensure, without a branch, that index 0 is only used for blocks entirely
+ * without free chunks.
+ * XXX: There probably is a cheaper way to do this. Needing to shift twice
+ * by slab->freelist_shift isn't great.
+ */
+ index = (freecount + (1 << slab->freelist_shift) - 1) >> slab->freelist_shift;
How about something like
#define SLAB_FREELIST_COUNT ((1<<3) + 1)
index = (freecount & (SLAB_FREELIST_COUNT - 2)) + (freecount != 0);
and dispense with both freelist_shift and the loop that computes it?
--
John Naylor
EDB: http://www.enterprisedb.com
On Fri, 11 Nov 2022 at 22:20, John Naylor <john.naylor@enterprisedb.com> wrote: > > > On Wed, Oct 12, 2022 at 4:37 PM David Rowley <dgrowleyml@gmail.com> wrote: > > [v3] > > + /* > + * Compute a shift that guarantees that shifting chunksPerBlock with it > + * yields is smaller than SLAB_FREELIST_COUNT - 1 (one freelist is used for full blocks). > + */ > + slab->freelist_shift = 0; > + while ((slab->chunksPerBlock >> slab->freelist_shift) >= (SLAB_FREELIST_COUNT - 1)) > + slab->freelist_shift++; > > + /* > + * Ensure, without a branch, that index 0 is only used for blocks entirely > + * without free chunks. > + * XXX: There probably is a cheaper way to do this. Needing to shift twice > + * by slab->freelist_shift isn't great. > + */ > + index = (freecount + (1 << slab->freelist_shift) - 1) >> slab->freelist_shift; > > How about something like > > #define SLAB_FREELIST_COUNT ((1<<3) + 1) > index = (freecount & (SLAB_FREELIST_COUNT - 2)) + (freecount != 0); Doesn't this create a sort of round-robin use of the free list? What we want is a sort of "histogram" bucket set of free lists so we can group together blocks that have a close-enough free number of chunks. Unless I'm mistaken, I think what you have doesn't do that. I wondered if simply: index = -(-freecount >> slab->freelist_shift); would be faster than Andres' version. I tried it out and on my AMD machine, it's about the same speed. Same on a Raspberry Pi 4. Going by [2], the instructions are very different with each method, so other machines with different latencies on those instructions might show something different. I attached what I used to test if anyone else wants a go. AMD Zen2 $ ./freecount 2000000000 Test 'a' in 0.922766 seconds Test 'd' in 0.922762 seconds (0.000433% faster) RPI4 $ ./freecount 2000000000 Test 'a' in 3.341350 seconds Test 'd' in 3.338690 seconds (0.079672% faster) That was gcc. Trying it with clang, it went in a little heavy-handed and optimized out my loop, so some more trickery might be needed for a useful test on that compiler. David [2] https://godbolt.org/z/dh95TohEG
Attachment
On Mon, Dec 5, 2022 at 3:02 PM David Rowley <dgrowleyml@gmail.com> wrote:
>
> On Fri, 11 Nov 2022 at 22:20, John Naylor <john.naylor@enterprisedb.com> wrote:
> > #define SLAB_FREELIST_COUNT ((1<<3) + 1)
> > index = (freecount & (SLAB_FREELIST_COUNT - 2)) + (freecount != 0);
>
> Doesn't this create a sort of round-robin use of the free list? What
> we want is a sort of "histogram" bucket set of free lists so we can
> group together blocks that have a close-enough free number of chunks.
> Unless I'm mistaken, I think what you have doesn't do that.
The intent must have slipped my mind along the way.
> I wondered if simply:
>
> index = -(-freecount >> slab->freelist_shift);
>
> would be faster than Andres' version. I tried it out and on my AMD
> machine, it's about the same speed. Same on a Raspberry Pi 4.
>
> Going by [2], the instructions are very different with each method, so
> other machines with different latencies on those instructions might
> show something different. I attached what I used to test if anyone
> else wants a go.
I get about 0.1% difference on my machine. Both ways boil down to (on gcc) 3 instructions with low latency. The later ones need the prior results to execute, which I think is what the XXX comment "isn't great" was referring to. The new coding is more mysterious (does it do the right thing on all platforms?), so I guess the original is still the way to go unless we get a better idea.
--
John Naylor
EDB: http://www.enterprisedb.com
> > index = (freecount & (SLAB_FREELIST_COUNT - 2)) + (freecount != 0);
>
> Doesn't this create a sort of round-robin use of the free list? What
> we want is a sort of "histogram" bucket set of free lists so we can
> group together blocks that have a close-enough free number of chunks.
> Unless I'm mistaken, I think what you have doesn't do that.
The intent must have slipped my mind along the way.
> I wondered if simply:
>
> index = -(-freecount >> slab->freelist_shift);
>
> would be faster than Andres' version. I tried it out and on my AMD
> machine, it's about the same speed. Same on a Raspberry Pi 4.
>
> Going by [2], the instructions are very different with each method, so
> other machines with different latencies on those instructions might
> show something different. I attached what I used to test if anyone
> else wants a go.
I get about 0.1% difference on my machine. Both ways boil down to (on gcc) 3 instructions with low latency. The later ones need the prior results to execute, which I think is what the XXX comment "isn't great" was referring to. The new coding is more mysterious (does it do the right thing on all platforms?), so I guess the original is still the way to go unless we get a better idea.
--
John Naylor
EDB: http://www.enterprisedb.com
On Fri, Sep 10, 2021 at 5:07 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > Turns out it's pretty difficult to benchmark this, because the results > strongly depend on what the backend did before. What you report here seems to be mostly cold-cache effects, with which I don't think we need to be overly concerned. We don't want big regressions in the cold-cache case, but there is always going to be some overhead when a new backend starts up, because you've got to fault some pages into the heap/malloc arena/whatever before they can be efficiently accessed. What would be more concerning is if we found out that the performance depended heavily on the internal state of the allocator. For example, suppose you have two warmup routines W1 and W2, each of which touches the same amount of total memory, but with different details. Then you have a benchmark B. If you do W1-B and W2-B and the time for B varies dramatically between them, then you've maybe got an issue. For instance, it could indicate that the allocator has issue when the old and new allocations are very different sizes, or something like that. -- Robert Haas EDB: http://www.enterprisedb.com
.On Mon, 5 Dec 2022 at 23:18, John Naylor <john.naylor@enterprisedb.com> wrote: > > > On Mon, Dec 5, 2022 at 3:02 PM David Rowley <dgrowleyml@gmail.com> wrote: > > Going by [2], the instructions are very different with each method, so > > other machines with different latencies on those instructions might > > show something different. I attached what I used to test if anyone > > else wants a go. > > I get about 0.1% difference on my machine. Both ways boil down to (on gcc) 3 instructions with low latency. The laterones need the prior results to execute, which I think is what the XXX comment "isn't great" was referring to. The newcoding is more mysterious (does it do the right thing on all platforms?), so I guess the original is still the way togo unless we get a better idea. I don't think it would work well on a one's complement machine, but I don't think we support those going by the comments above RIGHTMOST_ONE in bitmapset.c. In anycase, I found that it wasn't any faster than what Andres wrote. In fact, even changing the code there to "index = (freecount > 0);" seems to do very little to increase performance. I do see that having 3 freelist items performs a decent amount better than having 9. However, which workload is run may alter the result of that, assuming that keeping new allocations on fuller blocks is a winning strategy for the CPU's caches. I've now done quite a bit more work on Andres' patch to try and get it into (hopefully) somewhere close to a committable shape. I'm fairly happy with what's there now. It does seem to perform much better than current master, especially so when the workload would have caused master to continually malloc and free an entire block when freeing and allocating a single chunk. I've done some basic benchmarking, mostly using the attached alloc_bench patch. If I run: select *, round(slab_result / aset_result * 100 - 100,1)::text || '%' as slab_slower_by from ( select chunk_size, keep_chunks, sum(pg_allocate_memory_test(chunk_size, chunk_size * keep_chunks, 1024*1024*1024, 'slab'))::numeric(1000,3) as slab_result, sum(pg_allocate_memory_test(chunk_size, chunk_size * keep_chunks, 1024*1024*1024, 'aset'))::numeric(1000,3) as aset_result from (values(1),(10),(50),(100),(200),(300),(400),(500),(1000),(2000),(3000),(4000),(5000),(10000)) v1(keep_chunks), (values(64),(128),(256),(512),(1024)) v2(chunk_size) group by rollup(1,2) ); The results for the 64-byte chunk are shown in the attached chart. It's not as fast as aset, but much faster than master's slab.c The first blue bar of the chart is well above the vertical axis. It took master 1118958 milliseconds for that test. The attached patch took 194 ms. The rest of the tests seem to put the patched code around somewhere in the middle between the unpatched code and aset.c's performance. The benchmark I did was entirely a FIFO workload that keeps around "keep_chunks" at once before starting to free the oldest chunks. I know Tomas has some LIFO and random benchmarks. I edited that code [1] a little to add support for other context types so that a comparison could be done more easily, however, I'm getting very weird performance results where sometimes it runs about twice as fast (Tomas mentioned he got this too). I'm not seeing that with my own benchmarking function, so I'm wondering if there's something weird going on with the benchmark itself rather than the slab.c code. I've likely made much more changes than I can list here, but here are a few of the more major ones: 1. Make use of dclist for empty blocks 2. In SlabAlloc() allocate chunks from the freelist before the unused list. 3. Added output for showing information about empty blocks in the SlabStats output. 4. Renamed the context freelists to blocklist. I found this was likely just confusing things with the block-level freelist. In any case, it seemed weird to have freelist[0] store full blocks. Not much free there! I renamed to blocklist[]. 5. I did a round of changing the order of the fields in SlabBlock. This seems to affect performance quite a bit. Having the context first seems to improve performance. Having the blocklist[] node last also helps. 6. Removed nblocks and headerSize from SlabContext. headerSize is no longer needed. nblocks was only really used for Asserts and SlabIsEmpty. I changed the Asserts to use a local count of blocks and changed SlabIsEmpty to look at the context's mem_allocated. 7. There's now no integer division in any of the alloc and free code. The only time we divide by fullChunkSize is in the check function. 8. When using a block from the emptyblock list, I changed the code to not re-init the block. It's now used as it was left previously. This means no longer having to build the freelist again. 9. Updated all comments to reflect the current state of the code. Some things I thought about but didn't do: a. Change the size of SlabContext's chunkSize, fullChunkSize and blockSize to be uint32 instead of Size. It might be possible to get SlabContext below 128 bytes with a bit more work. b. I could have done a bit more experimentation with unlikely() and likely() to move less frequently accessed code off into a cold area. For #2 above, I didn't really see much change in performance when I swapped the order of what we allocate from first. I expected free chunks would be better as they've been used and are seemingly more likely to be in some CPU cache than one of the unused chunks. I might need a different allocation pattern than the one I used to highlight that fact though. David [1] https://github.com/david-rowley/postgres/tree/alloc_bench_contrib
Attachment
On Sat, Dec 10, 2022 at 11:02 AM David Rowley <dgrowleyml@gmail.com> wrote:
> [v4]
Thanks for working on this!
I ran an in-situ benchmark using the v13 radix tree patchset ([1] WIP but should be useful enough for testing allocation speed), only applying the first five, which are local-memory only. The benchmark is not meant to represent a realistic workload, and primarily stresses traversal and allocation of the smallest node type. Minimum of five, with turbo-boost off, on recent Intel laptop hardware:
v13-0001 to 0005:
# select * from bench_load_random_int(500 * 1000);
mem_allocated | load_ms
---------------+---------
151123432 | 222
47.06% postgres postgres [.] rt_set
22.89% postgres postgres [.] SlabAlloc
9.65% postgres postgres [.] rt_node_insert_inner.isra.0
5.94% postgres [unknown] [k] 0xffffffffb5e011b7
3.62% postgres postgres [.] MemoryContextAlloc
2.70% postgres libc.so.6 [.] __memmove_avx_unaligned_erms
2.60% postgres postgres [.] SlabFree
+ v4 slab:
# select * from bench_load_random_int(500 * 1000);
mem_allocated | load_ms
---------------+---------
152463112 | 213
52.42% postgres postgres [.] rt_set
12.80% postgres postgres [.] SlabAlloc
9.38% postgres postgres [.] rt_node_insert_inner.isra.0
7.87% postgres [unknown] [k] 0xffffffffb5e011b7
4.98% postgres postgres [.] SlabFree
While allocation is markedly improved, freeing looks worse here. The proportion is surprising because only about 2% of nodes are freed during the load, but doing that takes up 10-40% of the time compared to allocating.
num_keys = 500000, height = 7
n4 = 2501016, n15 = 56932, n32 = 270, n125 = 0, n256 = 257
Sidenote: I don't recall ever seeing vsyscall (I think that's what the 0xffffffffb5e011b7 address is referring to) in a profile, so not sure what is happening there.
[1] https://www.postgresql.org/message-id/CAFBsxsHNE621mGuPhd7kxaGc22vMkoSu7R4JW9Zan1jjorGy3g%40mail.gmail.com
--
John Naylor
EDB: http://www.enterprisedb.com
> [v4]
Thanks for working on this!
I ran an in-situ benchmark using the v13 radix tree patchset ([1] WIP but should be useful enough for testing allocation speed), only applying the first five, which are local-memory only. The benchmark is not meant to represent a realistic workload, and primarily stresses traversal and allocation of the smallest node type. Minimum of five, with turbo-boost off, on recent Intel laptop hardware:
v13-0001 to 0005:
# select * from bench_load_random_int(500 * 1000);
mem_allocated | load_ms
---------------+---------
151123432 | 222
47.06% postgres postgres [.] rt_set
22.89% postgres postgres [.] SlabAlloc
9.65% postgres postgres [.] rt_node_insert_inner.isra.0
5.94% postgres [unknown] [k] 0xffffffffb5e011b7
3.62% postgres postgres [.] MemoryContextAlloc
2.70% postgres libc.so.6 [.] __memmove_avx_unaligned_erms
2.60% postgres postgres [.] SlabFree
+ v4 slab:
# select * from bench_load_random_int(500 * 1000);
mem_allocated | load_ms
---------------+---------
152463112 | 213
52.42% postgres postgres [.] rt_set
12.80% postgres postgres [.] SlabAlloc
9.38% postgres postgres [.] rt_node_insert_inner.isra.0
7.87% postgres [unknown] [k] 0xffffffffb5e011b7
4.98% postgres postgres [.] SlabFree
While allocation is markedly improved, freeing looks worse here. The proportion is surprising because only about 2% of nodes are freed during the load, but doing that takes up 10-40% of the time compared to allocating.
num_keys = 500000, height = 7
n4 = 2501016, n15 = 56932, n32 = 270, n125 = 0, n256 = 257
Sidenote: I don't recall ever seeing vsyscall (I think that's what the 0xffffffffb5e011b7 address is referring to) in a profile, so not sure what is happening there.
[1] https://www.postgresql.org/message-id/CAFBsxsHNE621mGuPhd7kxaGc22vMkoSu7R4JW9Zan1jjorGy3g%40mail.gmail.com
--
John Naylor
EDB: http://www.enterprisedb.com
Thanks for testing the patch. On Mon, 12 Dec 2022 at 20:14, John Naylor <john.naylor@enterprisedb.com> wrote: > v13-0001 to 0005: > 2.60% postgres postgres [.] SlabFree > + v4 slab: > 4.98% postgres postgres [.] SlabFree > > While allocation is markedly improved, freeing looks worse here. The proportion is surprising because only about 2% ofnodes are freed during the load, but doing that takes up 10-40% of the time compared to allocating. I've tried to reproduce this with the v13 patches applied and I'm not really getting the same as you are. To run the function 100 times I used: select x, a.* from generate_series(1,100) x(x), lateral (select * from bench_load_random_int(500 * 1000 * (1+x-x))) a; (I had to add the * (1+x-x) to add a lateral dependency to stop the function just being executed once) v13-0001 - 0005 gives me: 37.71% postgres [.] rt_set 19.24% postgres [.] SlabAlloc 8.73% [kernel] [k] clear_page_rep 5.21% postgres [.] rt_node_insert_inner.isra.0 2.63% [kernel] [k] asm_exc_page_fault 2.24% postgres [.] SlabFree and fairly consistently 122 ms runtime per call. Applying v4 slab patch I get: 41.06% postgres [.] rt_set 10.84% postgres [.] SlabAlloc 9.01% [kernel] [k] clear_page_rep 6.49% postgres [.] rt_node_insert_inner.isra.0 2.76% postgres [.] SlabFree and fairly consistently 112 ms per call. I wonder if you can consistently get the same result on another compiler or after patching something like master~50 or master~100. Maybe it's just a code alignment thing. Looking at the annotation of perf report for SlabFree with the patched version I see: │ │ /* push this chunk onto the head of the free list */ │ *(MemoryChunk **) pointer = block->freehead; 0.09 │ mov 0x10(%r8),%rax │ slab = block->slab; 59.15 │ mov (%r8),%rbp │ *(MemoryChunk **) pointer = block->freehead; 9.43 │ mov %rax,(%rdi) │ block->freehead = chunk; │ │ block->nfree++; I think what that's telling me is that dereferencing the block's memory is slow, likely due to that particular cache line not being cached any longer. I tried running the test with 10,000 ints instead of 500,000 so that there would be less CPU cache pressure. I see: 29.76 │ mov (%r8),%rbp │ *(MemoryChunk **) pointer = block->freehead; 12.72 │ mov %rax,(%rdi) │ block->freehead = chunk; │ │ block->nfree++; │ mov 0x8(%r8),%eax │ block->freehead = chunk; 4.27 │ mov %rdx,0x10(%r8) │ SlabBlocklistIndex(): │ index = (nfree + (1 << blocklist_shift) - 1) >> blocklist_shift; │ mov $0x1,%edx │ SlabFree(): │ block->nfree++; │ lea 0x1(%rax),%edi │ mov %edi,0x8(%r8) │ SlabBlocklistIndex(): │ int32 blocklist_shift = slab->blocklist_shift; │ mov 0x70(%rbp),%ecx │ index = (nfree + (1 << blocklist_shift) - 1) >> blocklist_shift; 8.46 │ shl %cl,%edx various other instructions in SlabFree are proportionally taking longer now. For example the bitshift at the end was insignificant previously. That indicates to me that this is due to caching effects. We must fetch the block in SlabFree() in both versions. It's possible that something is going on in SlabAlloc() that is causing more useful cachelines to be evicted, but (I think) one of primary design goals Andres was going for was to reduce that. For example not having to write out the freelist for an entire block when the block is first allocated means not having to load possibly all cache lines for the entire block anymore. I tried looking at perf stat during the run. Without slab changes: drowley@amd3990x:~$ sudo perf stat --pid=74922 sleep 2 Performance counter stats for process id '74922': 2,000.74 msec task-clock # 1.000 CPUs utilized 4 context-switches # 1.999 /sec 0 cpu-migrations # 0.000 /sec 578,139 page-faults # 288.963 K/sec 8,614,687,392 cycles # 4.306 GHz (83.21%) 682,574,688 stalled-cycles-frontend # 7.92% frontend cycles idle (83.33%) 4,822,904,271 stalled-cycles-backend # 55.98% backend cycles idle (83.41%) 11,447,124,105 instructions # 1.33 insn per cycle # 0.42 stalled cycles per insn (83.41%) 1,947,647,575 branches # 973.464 M/sec (83.41%) 13,914,897 branch-misses # 0.71% of all branches (83.24%) 2.000924020 seconds time elapsed With slab changes: drowley@amd3990x:~$ sudo perf stat --pid=75967 sleep 2 Performance counter stats for process id '75967': 2,000.89 msec task-clock # 1.000 CPUs utilized 1 context-switches # 0.500 /sec 0 cpu-migrations # 0.000 /sec 607,423 page-faults # 303.576 K/sec 8,566,091,176 cycles # 4.281 GHz (83.21%) 737,839,390 stalled-cycles-frontend # 8.61% frontend cycles idle (83.32%) 4,454,357,725 stalled-cycles-backend # 52.00% backend cycles idle (83.41%) 10,760,559,837 instructions # 1.26 insn per cycle # 0.41 stalled cycles per insn (83.41%) 1,872,047,962 branches # 935.606 M/sec (83.41%) 14,928,953 branch-misses # 0.80% of all branches (83.25%) 2.000960610 seconds time elapsed It would be interesting to see if your perf stat output is showing something significantly different with and without the slab changes. It does not seem impossible that due to the slab changes having to look at less memory in SlabAlloc() that that's moving some additional requirements for SlabFree() to fetch cache lines that in the unpatched version would have already been available. If that is the case, then I think we shouldn't worry about it unless we can find some workload that demonstrates an overall performance regression with the patch. I just don't quite have enough perf experience to know how I might go about proving that. David
On Tue, Dec 13, 2022 at 7:50 AM David Rowley <dgrowleyml@gmail.com> wrote:
>
> Thanks for testing the patch.
>
> On Mon, 12 Dec 2022 at 20:14, John Naylor <john.naylor@enterprisedb.com> wrote:
> > While allocation is markedly improved, freeing looks worse here. The proportion is surprising because only about 2% of nodes are freed during the load, but doing that takes up 10-40% of the time compared to allocating.
>
> I've tried to reproduce this with the v13 patches applied and I'm not
> really getting the same as you are. To run the function 100 times I
> used:
>
> select x, a.* from generate_series(1,100) x(x), lateral (select * from
> bench_load_random_int(500 * 1000 * (1+x-x))) a;
Simply running over a longer period of time like this makes the SlabFree difference much closer to your results, so it doesn't seem out of line anymore. Here SlabAlloc seems to take maybe 2/3 of the time of current slab, with a 5% reduction in total time:
500k ints:
v13-0001-0005
average of 30: 217ms
47.61% postgres postgres [.] rt_set
20.99% postgres postgres [.] SlabAlloc
10.00% postgres postgres [.] rt_node_insert_inner.isra.0
6.87% postgres [unknown] [k] 0xffffffffbce011b7
3.53% postgres postgres [.] MemoryContextAlloc
2.82% postgres postgres [.] SlabFree
+slab v4
average of 30: 206ms
51.13% postgres postgres [.] rt_set
14.08% postgres postgres [.] SlabAlloc
11.41% postgres postgres [.] rt_node_insert_inner.isra.0
7.44% postgres [unknown] [k] 0xffffffffbce011b7
3.89% postgres postgres [.] MemoryContextAlloc
3.39% postgres postgres [.] SlabFree
It doesn't look mysterious anymore, but I went ahead and took some more perf measurements, including for cache misses. My naive impression is that we're spending a bit more time waiting for data, but having to do less work with it once we get it, which is consistent with your earlier comments:
perf stat -p $pid sleep 2
v13:
2,001.55 msec task-clock:u # 1.000 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
311,690 page-faults:u # 155.724 K/sec
3,128,740,701 cycles:u # 1.563 GHz
4,739,333,861 instructions:u # 1.51 insn per cycle
820,014,588 branches:u # 409.690 M/sec
7,385,923 branch-misses:u # 0.90% of all branches
+slab v4:
2,001.09 msec task-clock:u # 1.000 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
326,017 page-faults:u # 162.920 K/sec
3,016,668,818 cycles:u # 1.508 GHz
4,324,863,908 instructions:u # 1.43 insn per cycle
761,839,927 branches:u # 380.712 M/sec
7,718,366 branch-misses:u # 1.01% of all branches
perf stat -e LLC-loads,LLC-loads-misses -p $pid sleep 2
min/max of 3 runs:
v13: LL cache misses: 25.08% - 25.41%
+slab v4: LL cache misses: 25.74% - 26.01%
--
John Naylor
EDB: http://www.enterprisedb.com
>
> Thanks for testing the patch.
>
> On Mon, 12 Dec 2022 at 20:14, John Naylor <john.naylor@enterprisedb.com> wrote:
> > While allocation is markedly improved, freeing looks worse here. The proportion is surprising because only about 2% of nodes are freed during the load, but doing that takes up 10-40% of the time compared to allocating.
>
> I've tried to reproduce this with the v13 patches applied and I'm not
> really getting the same as you are. To run the function 100 times I
> used:
>
> select x, a.* from generate_series(1,100) x(x), lateral (select * from
> bench_load_random_int(500 * 1000 * (1+x-x))) a;
Simply running over a longer period of time like this makes the SlabFree difference much closer to your results, so it doesn't seem out of line anymore. Here SlabAlloc seems to take maybe 2/3 of the time of current slab, with a 5% reduction in total time:
500k ints:
v13-0001-0005
average of 30: 217ms
47.61% postgres postgres [.] rt_set
20.99% postgres postgres [.] SlabAlloc
10.00% postgres postgres [.] rt_node_insert_inner.isra.0
6.87% postgres [unknown] [k] 0xffffffffbce011b7
3.53% postgres postgres [.] MemoryContextAlloc
2.82% postgres postgres [.] SlabFree
+slab v4
average of 30: 206ms
51.13% postgres postgres [.] rt_set
14.08% postgres postgres [.] SlabAlloc
11.41% postgres postgres [.] rt_node_insert_inner.isra.0
7.44% postgres [unknown] [k] 0xffffffffbce011b7
3.89% postgres postgres [.] MemoryContextAlloc
3.39% postgres postgres [.] SlabFree
It doesn't look mysterious anymore, but I went ahead and took some more perf measurements, including for cache misses. My naive impression is that we're spending a bit more time waiting for data, but having to do less work with it once we get it, which is consistent with your earlier comments:
perf stat -p $pid sleep 2
v13:
2,001.55 msec task-clock:u # 1.000 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
311,690 page-faults:u # 155.724 K/sec
3,128,740,701 cycles:u # 1.563 GHz
4,739,333,861 instructions:u # 1.51 insn per cycle
820,014,588 branches:u # 409.690 M/sec
7,385,923 branch-misses:u # 0.90% of all branches
+slab v4:
2,001.09 msec task-clock:u # 1.000 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
326,017 page-faults:u # 162.920 K/sec
3,016,668,818 cycles:u # 1.508 GHz
4,324,863,908 instructions:u # 1.43 insn per cycle
761,839,927 branches:u # 380.712 M/sec
7,718,366 branch-misses:u # 1.01% of all branches
perf stat -e LLC-loads,LLC-loads-misses -p $pid sleep 2
min/max of 3 runs:
v13: LL cache misses: 25.08% - 25.41%
+slab v4: LL cache misses: 25.74% - 26.01%
--
John Naylor
EDB: http://www.enterprisedb.com
I've spent quite a bit more time on the slab changes now and I've attached a v3 patch. One of the major things that I've spent time on is benchmarking this. I'm aware that Tomas wrote some functions to benchmark. I've taken those and made some modifications to allow the memory context type to be specified as a function parameter. This allows me to easily compare the performance of slab with both aset and generation. Another change that I made to Tomas' module was how the random ordering part works. What I wanted was the ability to specify how randomly to pfree the chunks and test various "degrees-of-randomness" to see how that affects the performance. What I ended up coming up with was the ability to specify the number of "random segments". This controls how many groups we split all allocated chunks into to randomise. If there is 1 random segment, then that's just randomising over all chunks. If there are 10 random segments, then we split the array of allocated chunks into 10 portions based on either FIFO or LIFO order, then randomise the order of the chunks only within each of those segments. This allows us to test FIFO/LIFO allocation patterns with and without random and any degrees of that in between. If the random segments is set to 0, then no randomisation is done. Another change I made to Tomas' code was, I'm now using palloc0() instead of palloc() and I'm also checking the first byte of the allocated chunk is '\0' before pfreeing it. What I was finding was that pfree was showing as highly dominant in perf output due to it having to deference the MemoryChunk to find the context-type bits. pfree had to do this as none of the calling code had touched any of the memory in the chunk. I felt it was unrealistic to be pallocing memory and not doing anything with it and then pfreeing it without having done anything with it. Mostly this just moves the responsibilities around of which function is penalised in having to load the cache line. I mostly did this as I was struggling to make any sense of perf's output. I've attached alloc_bench_contrib.patch which I used for testing. I've also attached a spreadsheet with the benchmark results. The general summary from having done those is that slab is now generally now on-par with aset in terms of palloc performance. Previously slab was performing at about half the speed of aset unless CPU cache pressure became more significant, in which case the performance is dominated by fetching cache lines from RAM. However, the new code still makes meaningful improvements even under heavy CPU cache pressure. When it comes to pfree performance, the updated slab code is much faster than it was previously, but not quite on-par with aset or generation. The attached spreadsheet is broken down into 3 tabs. Each tab is testing a chunk size and a fixed total number of chunks allocated at once. Within each tab, I'm testing FIFO and then LIFO allocation patterns each with a different degree of randomness introduced, as I described above. In none of the tests was the patched version slower than the unpatched version. One pending question I had was about SlabStats where we list free chunks. Since we now have a list of emptyblocks, I wasn't too sure if the chunks from those should be included in that total. I currently am not including them, but I have added some additional information to list the number of completely empty blocks that we've got in the emptyblocks list. Some follow-up work that I'm thinking is a good idea: 1. Reduce the SlabContext's chunkSize, fullChunkSize and blockSize fields from Size down to uint32. These have no need to be 64 bits. We don't allow slab blocks over 1GB since c6e0fe1f2. I thought of doing this separately as we might need to rationalise the equivalent fields in aset.c and generation.c. Those can have external chunks, so I'm not 100% sure if we should do that there or not yet. I just didn't want to touch those files in this effort. 2. Slab should probably gain the ability to grow the block size as aset and generation both do. Since the performance of the slab context is good now, we might want to use it for hash join's 32kb chunks, but I doubt we can without the block size growth. I'm planning on pushing the attached v3 patch shortly. I've spent several days reading over this and testing it in detail along with adding additional features to the SlabCheck code to find more inconsistencies. David
Attachment
On Tue, Dec 20, 2022 at 10:36 AM David Rowley <dgrowleyml@gmail.com> wrote:
>
> I'm planning on pushing the attached v3 patch shortly. I've spent
> several days reading over this and testing it in detail along with
> adding additional features to the SlabCheck code to find more
> inconsistencies.
FWIW, I reran my test from last week and got similar results.
--
John Naylor
EDB: http://www.enterprisedb.com
On Tue, 20 Dec 2022 at 21:19, John Naylor <john.naylor@enterprisedb.com> wrote: > > > On Tue, Dec 20, 2022 at 10:36 AM David Rowley <dgrowleyml@gmail.com> wrote: > > > > I'm planning on pushing the attached v3 patch shortly. I've spent > > several days reading over this and testing it in detail along with > > adding additional features to the SlabCheck code to find more > > inconsistencies. > > FWIW, I reran my test from last week and got similar results. Thanks a lot for testing that stuff last week. I got a bit engrossed in the perf weirdness and forgot to reply. I found they made much more sense after using palloc0 and touching the allocated memory just before freeing. I think this is a more realistic test. I've now pushed the patch after making a small adjustment to the version I sent earlier. David