Thread: Use generation context to speed up tuplesorts
Hackers, While reviewing and benchmarking 91e9e89dc (Make nodeSort.c use Datum sorts for single column sorts), I noticed that we use a separate memory context to store tuple data and we just reset when we're done if the sort fits in memory, or we dump the tuples to disk in the same order they're added and reset the context when it does not. There is a little pfree() work going on via writetup_heap() which I think possibly could just be removed to get some additional gains. Anyway, this context that stores tuples uses the standard aset.c allocator which has the usual power of 2 wastage and additional overheads of freelists etc. I wondered how much faster it would go if I set it to use a generation context instead of an aset.c one. After running make installcheck to make the tenk1 table, running the attached tuplesortbench script, I get this: Master: work_mem = '4MB'; Sort Method: external merge Disk: 2496kB work_mem = '4GB'; Sort Method: quicksort Memory: 5541kB Patched: work_mem = '4MB'; Sort Method: quicksort Memory: 3197kB So it seems to save quite a bit of memory getting away from rounding up allocations to the next power of 2. Performance-wise, there's some pretty good gains. (Results in TPS) work_mem = '4GB'; Test master gen sort compare Test1 317.2 665.6 210% Test2 228.6 388.9 170% Test3 207.4 330.7 159% Test4 185.5 279.4 151% Test5 292.2 563.9 193% If I drop the work_mem down to standard the unpatched version does to disk, but the patched version does not. The gains get a little bigger. work_mem = '4MB'; Test master gen sort compare Test1 177.5 658.2 371% Test2 149.7 385.2 257% Test3 137.5 330.0 240% Test4 129.0 275.1 213% Test5 161.7 546.4 338% The patch is just a simple 1-liner at the moment. I likely do need to adjust what I'm passing as the blockSize to GenerationContextCreate(). Maybe a better number would be something that's calculated from work_mem, e.g Min(ALLOCSET_DEFAULT_MAXSIZE, ((Size) work_mem) * 64)) so that we just allocate at most a 16th of work_mem per chunk, but not bigger than 8MB. I don't think changing this will affect the performance of the above very much. David
Attachment
On Fri, Jul 30, 2021 at 2:42 AM David Rowley <dgrowleyml@gmail.com> wrote: > Master: > Sort Method: quicksort Memory: 5541kB > Patched: > Sort Method: quicksort Memory: 3197kB Whoa. > work_mem = '4GB'; > Test master gen sort compare > Test1 317.2 665.6 210% > Test2 228.6 388.9 170% > Test3 207.4 330.7 159% > Test4 185.5 279.4 151% > Test5 292.2 563.9 193% Very impressive. An early version of what eventually became DSA worked with backend-local memory and I saw very substantial memory usage improvements on large sorts, similar to what you show here. I am not sure I saw the same CPU improvements, and in any case I abandoned the idea of using that infrastructure to manage backend-local memory at some point, since the whole thing had lots of problems that I didn't know how to solve. What you've done here looks like a much more promising approach. -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On 2021-07-30 18:42:18 +1200, David Rowley wrote: > While reviewing and benchmarking 91e9e89dc (Make nodeSort.c use Datum > sorts for single column sorts), I noticed that we use a separate > memory context to store tuple data and we just reset when we're done > if the sort fits in memory, or we dump the tuples to disk in the same > order they're added and reset the context when it does not. There is > a little pfree() work going on via writetup_heap() which I think > possibly could just be removed to get some additional gains. > > Anyway, this context that stores tuples uses the standard aset.c > allocator which has the usual power of 2 wastage and additional > overheads of freelists etc. I wondered how much faster it would go if > I set it to use a generation context instead of an aset.c one. > > After running make installcheck to make the tenk1 table, running the > attached tuplesortbench script, I get this: > So it seems to save quite a bit of memory getting away from rounding > up allocations to the next power of 2. > > Performance-wise, there's some pretty good gains. (Results in TPS) Very nice! I wonder if there's cases where generation.c would regress performance over aset.c due to not having an initial / "keeper" block? > The patch is just a simple 1-liner at the moment. I likely do need to > adjust what I'm passing as the blockSize to GenerationContextCreate(). > Maybe a better number would be something that's calculated from > work_mem, e.g Min(ALLOCSET_DEFAULT_MAXSIZE, ((Size) work_mem) * 64)) > so that we just allocate at most a 16th of work_mem per chunk, but not > bigger than 8MB. I don't think changing this will affect the > performance of the above very much. I think it's bad that both genereration and slab don't have internal handling of block sizes. Needing to err on the size of too big blocks to handle large amounts of memory well, just so the contexts don't need to deal with variably sized blocks isn't a sensible tradeoff. I don't think it's acceptable to use ALLOCSET_DEFAULT_MAXSIZE or Min(ALLOCSET_DEFAULT_MAXSIZE, ((Size) work_mem) * 64) for tuplesort.c. There's plenty cases where we'll just sort a handful of tuples, and increasing the memory usage of those by a factor of 1024 isn't good. The Min() won't do any good if a larger work_mem is used. Nor will it be good to use thousands of small allocations for a large in-memory tuplesort just because we're concerned about the initial allocation size. Both because of the allocation overhead, but importantly also because that will make context resets more expensive. To me this says that we should transplant aset.c's block size growing into generation.c. There is at least one path using tuplecontext that reaches code that could end up freeing memory to a significant enough degree to care about generation.c effectively not using that memory: tuplesort_putdatum()->datumCopy()->EOH_flatten_into() On a quick look I didn't find any expanded record user that frees nontrivial amounts of memory, but I didn't look all that carefully. Greetings, Andres Freund
On 7/30/21 10:38 PM, Andres Freund wrote: > Hi, > > On 2021-07-30 18:42:18 +1200, David Rowley wrote: >> While reviewing and benchmarking 91e9e89dc (Make nodeSort.c use Datum >> sorts for single column sorts), I noticed that we use a separate >> memory context to store tuple data and we just reset when we're done >> if the sort fits in memory, or we dump the tuples to disk in the same >> order they're added and reset the context when it does not. There is >> a little pfree() work going on via writetup_heap() which I think >> possibly could just be removed to get some additional gains. >> >> Anyway, this context that stores tuples uses the standard aset.c >> allocator which has the usual power of 2 wastage and additional >> overheads of freelists etc. I wondered how much faster it would go if >> I set it to use a generation context instead of an aset.c one. >> >> After running make installcheck to make the tenk1 table, running the >> attached tuplesortbench script, I get this: > >> So it seems to save quite a bit of memory getting away from rounding >> up allocations to the next power of 2. >> >> Performance-wise, there's some pretty good gains. (Results in TPS) > > Very nice! > Yes, very nice. I wouldn't have expected such significant difference, particularly in CPU usage. It's pretty interesting that it both reduces memory and CPU usage, I'd have guessed it's either one of the other. > > I wonder if there's cases where generation.c would regress performance > over aset.c due to not having an initial / "keeper" block? > Not sure. I guess such workload would need to allocate and free a single block (so very little memory) very often. I guess that's possible, but I'm not aware of a place doing that very often. Although, maybe decoding could do that for simple (serial) workload. I'm not opposed to adding a keeper block to Generation, similarly to what was discussed for Slab not too long ago. > >> The patch is just a simple 1-liner at the moment. I likely do need to >> adjust what I'm passing as the blockSize to GenerationContextCreate(). >> Maybe a better number would be something that's calculated from >> work_mem, e.g Min(ALLOCSET_DEFAULT_MAXSIZE, ((Size) work_mem) * 64)) >> so that we just allocate at most a 16th of work_mem per chunk, but not >> bigger than 8MB. I don't think changing this will affect the >> performance of the above very much. > > I think it's bad that both genereration and slab don't have internal > handling of block sizes. Needing to err on the size of too big blocks to > handle large amounts of memory well, just so the contexts don't need to > deal with variably sized blocks isn't a sensible tradeoff. > Well, back then it seemed like a sensible trade off to me, but I agree it may have negative consequences. I'm not opposed to revisiting this. > I don't think it's acceptable to use ALLOCSET_DEFAULT_MAXSIZE or > Min(ALLOCSET_DEFAULT_MAXSIZE, ((Size) work_mem) * 64) for > tuplesort.c. There's plenty cases where we'll just sort a handful of > tuples, and increasing the memory usage of those by a factor of 1024 > isn't good. The Min() won't do any good if a larger work_mem is used. > > Nor will it be good to use thousands of small allocations for a large > in-memory tuplesort just because we're concerned about the initial > allocation size. Both because of the allocation overhead, but > importantly also because that will make context resets more expensive. > True. > To me this says that we should transplant aset.c's block size growing > into generation.c. > Yeah, maybe. > > There is at least one path using tuplecontext that reaches code that > could end up freeing memory to a significant enough degree to care about > generation.c effectively not using that memory: > tuplesort_putdatum()->datumCopy()->EOH_flatten_into() > On a quick look I didn't find any expanded record user that frees > nontrivial amounts of memory, but I didn't look all that carefully. > Not sure, I'm not familiar with EOH_flatten_into or expanded records. But I wonder if there's some sort of metric that we could track in Generation and use it to identify "interesting" places. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, I spent a bit of time hacking on the Generation context, adding the two improvements discussed in this thread: 1) internal handling of block sizes, similar to what AllocSet does (it pretty much just copies parts of it) 2) keeper block (we keep one empry block instead of freeing it) 3) I've also added allocChunkLimit, which makes it look a bit more like AllocSet (instead of using just blockSize/8, which does not work too well with dynamic blockSize) I haven't done any extensive tests on it, but it does pass check-world with asserts etc. I haven't touched the comments, those need updating. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Sat, 31 Jul 2021 at 14:34, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > I spent a bit of time hacking on the Generation context, adding the two > improvements discussed in this thread: > > 1) internal handling of block sizes, similar to what AllocSet does (it > pretty much just copies parts of it) > > 2) keeper block (we keep one empry block instead of freeing it) > > 3) I've also added allocChunkLimit, which makes it look a bit more like > AllocSet (instead of using just blockSize/8, which does not work too > well with dynamic blockSize) > > I haven't done any extensive tests on it, but it does pass check-world > with asserts etc. I haven't touched the comments, those need updating. > regards Thanks for starting work on that. I've only had a quick look, but I can have a more detailed look once you've got it more complete. For now it does not really look like the keeper block stuff is wired up the same way as in aset.c. I'd expect you to be allocating that in the same malloc as you're using to allocate the context struct itself in GenerationContextCreate(). Also, likely as a result of the above, minContextSize does not seem to be wired up to anything apart from an Assert(). David
On Sat, 31 Jul 2021 at 08:38, Andres Freund <andres@anarazel.de> wrote: > There is at least one path using tuplecontext that reaches code that > could end up freeing memory to a significant enough degree to care about > generation.c effectively not using that memory: > tuplesort_putdatum()->datumCopy()->EOH_flatten_into() > On a quick look I didn't find any expanded record user that frees > nontrivial amounts of memory, but I didn't look all that carefully. I guess we could just use a normal context for datum sorts if we thought that might be a problem. I'm not too familiar with the expanded object code, but I'm struggling to imagine why anything would need to do a pfree in there. We just do EOH_get_flat_size() to determine how big to make the allocation then allocate some memory for EOH_flatten_into() to use to expand the object into. David
On 8/2/21 1:17 PM, David Rowley wrote: > On Sat, 31 Jul 2021 at 14:34, Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> I spent a bit of time hacking on the Generation context, adding the two >> improvements discussed in this thread: >> >> 1) internal handling of block sizes, similar to what AllocSet does (it >> pretty much just copies parts of it) >> >> 2) keeper block (we keep one empry block instead of freeing it) >> >> 3) I've also added allocChunkLimit, which makes it look a bit more like >> AllocSet (instead of using just blockSize/8, which does not work too >> well with dynamic blockSize) >> >> I haven't done any extensive tests on it, but it does pass check-world >> with asserts etc. I haven't touched the comments, those need updating. >> regards > > Thanks for starting work on that. I've only had a quick look, but I > can have a more detailed look once you've got it more complete. > A review would be nice, although it can wait - It'd be interesting to know if those patches help with the workload(s) you've been looking at. > For now it does not really look like the keeper block stuff is wired > up the same way as in aset.c. I'd expect you to be allocating that in > the same malloc as you're using to allocate the context struct itself > in GenerationContextCreate(). > Yes, that difference is natural. The AllocSet works a bit differently, as it does not release the blocks (except during reset), while the Generation context frees the blocks. So it seems pointless to use the same "keeper" block as AllocSet - instead my intention was to keep one "allocated" block as a cache, which should help with tight pfree/palloc cycles. Maybe we should not call that "keeper" block? > Also, likely as a result of the above, minContextSize does not seem to > be wired up to anything apart from an Assert(). > Hmm, yeah. This is probably due to copying some of the block-growth and keeper block code from AllocSet. There should be just init/max block size, I think. I did run the same set of benchmarks as for Slab, measuring some usual allocation patterns. The results for i5-2500k machine are attached (for the xeon it's almost exactly the same behavior). While running those tests I realized the last patch is wrong and sets allocChunkLimit=1, which is bogus and causes significant regression. So here's an updated version of the patch series too. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Hi, On 2021-08-03 10:59:25 +1200, David Rowley wrote: > On Sat, 31 Jul 2021 at 08:38, Andres Freund <andres@anarazel.de> wrote: > > There is at least one path using tuplecontext that reaches code that > > could end up freeing memory to a significant enough degree to care about > > generation.c effectively not using that memory: > > tuplesort_putdatum()->datumCopy()->EOH_flatten_into() > > On a quick look I didn't find any expanded record user that frees > > nontrivial amounts of memory, but I didn't look all that carefully. > > I guess we could just use a normal context for datum sorts if we > thought that might be a problem. I think that's probably a cure worse than the disease. I suspect datum sorts can benefit from the higher density quite a bit... > I'm not too familiar with the expanded object code, but I'm struggling > to imagine why anything would need to do a pfree in there. We just do > EOH_get_flat_size() to determine how big to make the allocation then > allocate some memory for EOH_flatten_into() to use to expand the > object into. I can see some scenarios with a bit more creative uses of expanded objects. We've e.g. been talking about using EA to avoid repeated and partial detoasting overhead and you might need to do some more toast fetches when flattening. Toast fetches always allocate, and if the fetch were only for later parts of the tuple, the fetched data would need to be freed. It's probably fine to deal with this at later time, and just leave a comment somewhere. It could be addressed by having a bump style allocator combined with having freelists. It's not like the tuplesort.c case is actually interested in the generational behaviour of generation.c (which makes freelists uninteresting), it's just that generation.c is the densest allocator that we have right now... Greetings, Andres Freund
On Wed, 4 Aug 2021 at 02:10, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > A review would be nice, although it can wait - It'd be interesting to > know if those patches help with the workload(s) you've been looking at. I tried out the v2 set of patches using the attached scripts. The attached spreadsheet includes the original tests and compares master with the patch which uses the generation context vs that patch plus your v2 patch. I've also included 4 additional tests, each of which starts with a 1 column table and then adds another 32 columns testing the performance after adding each additional column. I did this because I wanted to see if the performance was more similar to master when the allocations had less power of 2 wastage from allocset. If, for example, you look at row 123 of the spreadsheet you can see both patched and unpatched the allocations were 272 bytes each yet there was still a 50% performance improvement with just the generation context patch when compared to master. Looking at the spreadsheet, you'll also notice that in the 2 column test of each of the 4 new tests the number of bytes used for each allocation is larger with the generation context. 56 vs 48. This is due to the GenerationChunk struct size being later than the Allocset's version by 8 bytes. This is because it also holds the GenerationBlock. So with the patch there are some cases where we'll use slightly more memory. Additional tests: 1. Sort 10000 tuples on a column with values 0-99 in memory. 2. As #1 but with 1 million tuples. 3 As #1 but with a large OFFSET to remove the overhead of sending to the client. 4. As #2 but with a large OFFSET. Test #3 above is the most similar one to the original tests and shows similar gains. When the sort becomes larger (1 million tuple test), the gains reduce. This indicates the gains are coming from improved CPU cache efficiency from the removal of the power of 2 wastage in memory allocations. All of the tests show that the patches to improve the allocation efficiency of generation.c don't help to improve the results of the test cases. I wondered if it's maybe worth trying to see what happens if instead of doubling the allocations each time, quadruple them instead. I didn't try this. David
Attachment
On 8/6/21 3:07 PM, David Rowley wrote: > On Wed, 4 Aug 2021 at 02:10, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: >> A review would be nice, although it can wait - It'd be interesting to >> know if those patches help with the workload(s) you've been looking at. > > I tried out the v2 set of patches using the attached scripts. The > attached spreadsheet includes the original tests and compares master > with the patch which uses the generation context vs that patch plus > your v2 patch. > > I've also included 4 additional tests, each of which starts with a 1 > column table and then adds another 32 columns testing the performance > after adding each additional column. I did this because I wanted to > see if the performance was more similar to master when the allocations > had less power of 2 wastage from allocset. If, for example, you look > at row 123 of the spreadsheet you can see both patched and unpatched > the allocations were 272 bytes each yet there was still a 50% > performance improvement with just the generation context patch when > compared to master. > > Looking at the spreadsheet, you'll also notice that in the 2 column > test of each of the 4 new tests the number of bytes used for each > allocation is larger with the generation context. 56 vs 48. This is > due to the GenerationChunk struct size being later than the Allocset's > version by 8 bytes. This is because it also holds the > GenerationBlock. So with the patch there are some cases where we'll > use slightly more memory. > > Additional tests: > > 1. Sort 10000 tuples on a column with values 0-99 in memory. > 2. As #1 but with 1 million tuples. > 3 As #1 but with a large OFFSET to remove the overhead of sending to the client. > 4. As #2 but with a large OFFSET. > > Test #3 above is the most similar one to the original tests and shows > similar gains. When the sort becomes larger (1 million tuple test), > the gains reduce. This indicates the gains are coming from improved > CPU cache efficiency from the removal of the power of 2 wastage in > memory allocations. > > All of the tests show that the patches to improve the allocation > efficiency of generation.c don't help to improve the results of the > test cases. I wondered if it's maybe worth trying to see what happens > if instead of doubling the allocations each time, quadruple them > instead. I didn't try this. > Thanks for the scripts and the spreadsheet with results. I doubt quadrupling the allocations won't help very much, but I suspect the problem might be in the 0004 patch - at least that's what shows regression in my results. Could you try with just 0001-0003 applied? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, 7 Aug 2021 at 12:10, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > On 8/6/21 3:07 PM, David Rowley wrote: > > All of the tests show that the patches to improve the allocation > > efficiency of generation.c don't help to improve the results of the > > test cases. I wondered if it's maybe worth trying to see what happens > > if instead of doubling the allocations each time, quadruple them > > instead. I didn't try this. > > > > I doubt quadrupling the allocations won't help very much, but I suspect > the problem might be in the 0004 patch - at least that's what shows > regression in my results. Could you try with just 0001-0003 applied? But 0004 only changes the logic which controls the threshold of when we allocate an oversized chunk. It looks like the threshold is 512KB with the 0004 patch. My test is only doing a maximum allocation of 296 bytes so will never allocate an oversized chunk. Can you explain why you think 0004 would cause performance regressions? David
On Sat, 7 Aug 2021 at 12:10, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > All of the tests show that the patches to improve the allocation > > efficiency of generation.c don't help to improve the results of the > > test cases. I wondered if it's maybe worth trying to see what happens > > if instead of doubling the allocations each time, quadruple them > > instead. I didn't try this. > > > > I doubt quadrupling the allocations won't help very much, but I suspect > the problem might be in the 0004 patch - at least that's what shows > regression in my results. Could you try with just 0001-0003 applied? I tried the quadrupling of the buffer instead of doubling it each time and got the attached. Column E, or green in the graphs show the results of that. It's now much closer to the original patch which just made the block size 8MB. David
Attachment
On 8/8/21 9:02 AM, David Rowley wrote: > On Sat, 7 Aug 2021 at 12:10, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: >> >> On 8/6/21 3:07 PM, David Rowley wrote: >>> All of the tests show that the patches to improve the allocation >>> efficiency of generation.c don't help to improve the results of the >>> test cases. I wondered if it's maybe worth trying to see what happens >>> if instead of doubling the allocations each time, quadruple them >>> instead. I didn't try this. >>> >> >> I doubt quadrupling the allocations won't help very much, but I suspect >> the problem might be in the 0004 patch - at least that's what shows >> regression in my results. Could you try with just 0001-0003 applied? > > But 0004 only changes the logic which controls the threshold of when > we allocate an oversized chunk. It looks like the threshold is 512KB > with the 0004 patch. My test is only doing a maximum allocation of > 296 bytes so will never allocate an oversized chunk. > > Can you explain why you think 0004 would cause performance regressions? > It's based solely on results of my benchmarks, where this patch seems to cause performance regression. I agree it's a bit bizzare, considering what the patch does. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 8/8/21 12:11 PM, David Rowley wrote: > On Sat, 7 Aug 2021 at 12:10, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: >>> All of the tests show that the patches to improve the allocation >>> efficiency of generation.c don't help to improve the results of the >>> test cases. I wondered if it's maybe worth trying to see what happens >>> if instead of doubling the allocations each time, quadruple them >>> instead. I didn't try this. >>> >> >> I doubt quadrupling the allocations won't help very much, but I suspect >> the problem might be in the 0004 patch - at least that's what shows >> regression in my results. Could you try with just 0001-0003 applied? > > I tried the quadrupling of the buffer instead of doubling it each time > and got the attached. Column E, or green in the graphs show the > results of that. It's now much closer to the original patch which just > made the block size 8MB. > Interesting, I wouldn't have expected that to make such difference. I'm not sure quadrupling the size is a good idea, though, because it increases the amount of memory we might be wasting. With the doubling, the amount of wasted /unused memory is limited to ~50%, because the next block is (roughly) equal to sum of already allocated blocks, so allocating just 1B on it leaves us with 50%. But quadrupling the size means we'll end up with ~75% free space. Of course, this is capped by the maximum block size etc. but still ... regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, 4 Aug 2021 at 02:10, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > I did run the same set of benchmarks as for Slab, measuring some usual > allocation patterns. The results for i5-2500k machine are attached (for > the xeon it's almost exactly the same behavior). While running those > tests I realized the last patch is wrong and sets allocChunkLimit=1, > which is bogus and causes significant regression. So here's an updated > version of the patch series too. I know you're not done with these yet, but FWIW, I was getting an Assert failure with these patches on: Assert(total_allocated == context->mem_allocated); It seems to be because you've forgotten to ignore keeper blocks when adjusting context->mem_allocated in GenerationReset() David
On Mon, 9 Aug 2021 at 00:38, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > I'm not sure quadrupling the size is a good idea, though, because it > increases the amount of memory we might be wasting. With the doubling, > the amount of wasted /unused memory is limited to ~50%, because the next > block is (roughly) equal to sum of already allocated blocks, so > allocating just 1B on it leaves us with 50%. But quadrupling the size > means we'll end up with ~75% free space. Of course, this is capped by > the maximum block size etc. but still ... Yeah, not sure what is best. It does however seem likely that the majority of the performance improvement that I saw is due to either malloc()/free() calls or just having fewer blocks in the context. Maybe it's worth getting the planner on board with deciding how to do the allocations. It feels a bit overcautious to go allocating blocks in each power of two starting at 8192 bytes when doing a 1GB sort. Maybe we should be looking towards doing something more like making the init allocation size more like pg_prevpower2_64(Min(work_mem * 1024L, sort_tuples * tuple_width)), or maybe half or quarter that. It would certainly not be the only executor node to allocate memory based on what the planner thought. Just look at ExecHashTableCreate(). David
Hi, I've spent quite a bit of time investigating the regressions clearly visible in the benchmark results for some allocation/free patterns. Attached is a subset of results from a recent run, but the old results show mostly the same thing. Some allocation patterns (e.g. fifo-cycle of lifo-cycle) are perfectly fine - the patches are a clear improvement. For for example results for fifo-decrease pattern look like this (showing just timings of palloc): block chunk 0001 0002 0003 0004 ---------------------------------------------------- 32768 512 38497 37234 39817 226451 0001 just adds an extension used for the benchmarking, so it's pretty much "master" without any behavior changes. 0002 and 0003 make some changes that seem to have no impact on this workload, but 0004 makes it about 5x slower. Which is bizarre, because 0004 just tweaks how we calculate the threshold for oversized chunks. But that shouldn't have any impact, because with the benchmark parameters the threshold should be 64kB, way more than the chunk size (which is what we allocate). Similarly, it should not matter whether we double or quadruple the block size, because we reach the maximum block size (1MB) very fast in both cases, and then just just 1MB blocks for 99% of the benchmark. I was thinking that maybe the extra allocChunkLimit increases the size of the GenerationContext struct, requiring another cacheline. But no, it's 3 cachelines in both cases. But while trying to investigate / profile this, I noticed a rather bizarre behavior, which I can't explain but I suspect it may be related to the slowdown. The benchmark queries are executing generation_bench_fifo() for a range of block size / chunk_size combinations, and look about like this: select block_size, chunk_size, x.* from generate_series(32,512,32) a(chunk_size), (values (1024), (2048), (4096), (8192), (16384), (32768)) AS b(block_size), lateral generation_bench_fifo(1000000, block_size, chunk_size, 2*chunk_size, 100, 10000, 5000) x; Which takes a while to run, and then produces something like this: block_size | chunk_size | mem_allocated | alloc_ms | free_ms ------------+------------+---------------+----------+--------- 1024 | 32 | 81528832 | 62016 | 24601 1024 | 64 | 137449472 | 61951 | 36034 1024 | 96 | 208408464 | 85437 | 57984 ... 32768 | 448 | 707756032 | 36605 | 67648 32768 | 480 | 757104640 | 36838 | 71967 32768 | 512 | 806256640 | 37073 | 76489 (96 rows) Now, I can run benchmark for a single case (e.g. the last one in the results above), say like this: select block_size, chunk_size, x.* from (values (512)) AS a(chunk_size), (values (32768)) AS b(block_size), lateral generation_bench_fifo(1000000, block_size, chunk_size, 2*chunk_size, 100, 10000, 5000) x; And now comes the funny part - if I run it in the same backend as the "full" benchmark, I get roughly the same results: block_size | chunk_size | mem_allocated | alloc_ms | free_ms ------------+------------+---------------+----------+--------- 32768 | 512 | 806256640 | 37159 | 76669 but if I reconnect and run it in the new backend, I get this: block_size | chunk_size | mem_allocated | alloc_ms | free_ms ------------+------------+---------------+----------+--------- 32768 | 512 | 806158336 | 233909 | 100785 (1 row) It does not matter if I wait a bit before running the query, if I run it repeatedly, etc. The machine is not doing anything else, the CPU is set to use "performance" governor, etc. This is really strange, and the really suspicious thing is that this slower timing is very close to the 0004 timing (which is 226451). So perhaps the reason for these slowdowns is the same, it's just that with 0004 it happens all the time. According to perf profiling, it seems there's one pretty significant change - 0004 spends quite a bit of time in asm_exc_page_fault 41.17% 5.26% postgres postgres [.] GenerationAlloc | --36.06%--GenerationAlloc | |--28.95%--asm_exc_page_fault | | | --23.17%--exc_page_fault | | while on 0003 there's not a single frame with asm_exc_page_fault. I have no idea why, though. I wonder if this might be some unexpected / unfortunate interaction with kernel memory management, or something like that. Any ideas what might be the root cause, what to look for, etc.? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Le jeudi 9 septembre 2021, 15:37:59 CET Tomas Vondra a écrit : > And now comes the funny part - if I run it in the same backend as the > "full" benchmark, I get roughly the same results: > > block_size | chunk_size | mem_allocated | alloc_ms | free_ms > ------------+------------+---------------+----------+--------- > 32768 | 512 | 806256640 | 37159 | 76669 > > but if I reconnect and run it in the new backend, I get this: > > block_size | chunk_size | mem_allocated | alloc_ms | free_ms > ------------+------------+---------------+----------+--------- > 32768 | 512 | 806158336 | 233909 | 100785 > (1 row) > > It does not matter if I wait a bit before running the query, if I run it > repeatedly, etc. The machine is not doing anything else, the CPU is set > to use "performance" governor, etc. I've reproduced the behaviour you mention. I also noticed asm_exc_page_fault showing up in the perf report in that case. Running an strace on it shows that in one case, we have a lot of brk calls, while when we run in the same process as the previous tests, we don't. My suspicion is that the previous workload makes glibc malloc change it's trim_threshold and possibly other dynamic options, which leads to constantly moving the brk pointer in one case and not the other. Running your fifo test with absurd malloc options shows that indeed that might be the case (I needed to change several, because changing one disable the dynamic adjustment for every single one of them, and malloc would fall back to using mmap and freeing it on each iteration): mallopt(M_TOP_PAD, 1024 * 1024 * 1024); mallopt(M_TRIM_THRESHOLD, 256 * 1024 * 1024); mallopt(M_MMAP_THRESHOLD, 4*1024*1024*sizeof(long)); I get the following results for your self contained test. I ran the query twice, in each case, seeing the importance of the first allocation and the subsequent ones: With default malloc options: block_size | chunk_size | mem_allocated | alloc_ms | free_ms ------------+------------+---------------+----------+--------- 32768 | 512 | 795836416 | 300156 | 207557 block_size | chunk_size | mem_allocated | alloc_ms | free_ms ------------+------------+---------------+----------+--------- 32768 | 512 | 795836416 | 211942 | 77207 With the oversized values above: block_size | chunk_size | mem_allocated | alloc_ms | free_ms ------------+------------+---------------+----------+--------- 32768 | 512 | 795836416 | 219000 | 36223 block_size | chunk_size | mem_allocated | alloc_ms | free_ms ------------+------------+---------------+----------+--------- 32768 | 512 | 795836416 | 75761 | 78082 (1 row) I can't tell how representative your benchmark extension would be of real life allocation / free patterns, but there is probably something we can improve here. I'll try to see if I can understand more precisely what is happening. -- Ronan Dunklau
On 12/8/21 16:51, Ronan Dunklau wrote: > Le jeudi 9 septembre 2021, 15:37:59 CET Tomas Vondra a écrit : >> And now comes the funny part - if I run it in the same backend as the >> "full" benchmark, I get roughly the same results: >> >> block_size | chunk_size | mem_allocated | alloc_ms | free_ms >> ------------+------------+---------------+----------+--------- >> 32768 | 512 | 806256640 | 37159 | 76669 >> >> but if I reconnect and run it in the new backend, I get this: >> >> block_size | chunk_size | mem_allocated | alloc_ms | free_ms >> ------------+------------+---------------+----------+--------- >> 32768 | 512 | 806158336 | 233909 | 100785 >> (1 row) >> >> It does not matter if I wait a bit before running the query, if I run it >> repeatedly, etc. The machine is not doing anything else, the CPU is set >> to use "performance" governor, etc. > > I've reproduced the behaviour you mention. > I also noticed asm_exc_page_fault showing up in the perf report in that case. > > Running an strace on it shows that in one case, we have a lot of brk calls, > while when we run in the same process as the previous tests, we don't. > > My suspicion is that the previous workload makes glibc malloc change it's > trim_threshold and possibly other dynamic options, which leads to constantly > moving the brk pointer in one case and not the other. > > Running your fifo test with absurd malloc options shows that indeed that might > be the case (I needed to change several, because changing one disable the > dynamic adjustment for every single one of them, and malloc would fall back to > using mmap and freeing it on each iteration): > > mallopt(M_TOP_PAD, 1024 * 1024 * 1024); > mallopt(M_TRIM_THRESHOLD, 256 * 1024 * 1024); > mallopt(M_MMAP_THRESHOLD, 4*1024*1024*sizeof(long)); > > I get the following results for your self contained test. I ran the query > twice, in each case, seeing the importance of the first allocation and the > subsequent ones: > > With default malloc options: > > block_size | chunk_size | mem_allocated | alloc_ms | free_ms > ------------+------------+---------------+----------+--------- > 32768 | 512 | 795836416 | 300156 | 207557 > > block_size | chunk_size | mem_allocated | alloc_ms | free_ms > ------------+------------+---------------+----------+--------- > 32768 | 512 | 795836416 | 211942 | 77207 > > > With the oversized values above: > > block_size | chunk_size | mem_allocated | alloc_ms | free_ms > ------------+------------+---------------+----------+--------- > 32768 | 512 | 795836416 | 219000 | 36223 > > > block_size | chunk_size | mem_allocated | alloc_ms | free_ms > ------------+------------+---------------+----------+--------- > 32768 | 512 | 795836416 | 75761 | 78082 > (1 row) > > I can't tell how representative your benchmark extension would be of real life > allocation / free patterns, but there is probably something we can improve > here. > Thanks for looking at this. I think those allocation / free patterns are fairly extreme, and there probably are no workloads doing exactly this. The idea is the actual workloads are likely some combination of these extreme cases. > I'll try to see if I can understand more precisely what is happening. > Thanks, that'd be helpful. Maybe we can learn something about tuning malloc parameters to get significantly better performance. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Le mercredi 8 décembre 2021, 22:07:12 CET Tomas Vondra a écrit : > On 12/8/21 16:51, Ronan Dunklau wrote: > > Le jeudi 9 septembre 2021, 15:37:59 CET Tomas Vondra a écrit : > >> And now comes the funny part - if I run it in the same backend as the > >> > >> "full" benchmark, I get roughly the same results: > >> block_size | chunk_size | mem_allocated | alloc_ms | free_ms > >> > >> ------------+------------+---------------+----------+--------- > >> > >> 32768 | 512 | 806256640 | 37159 | 76669 > >> > >> but if I reconnect and run it in the new backend, I get this: > >> block_size | chunk_size | mem_allocated | alloc_ms | free_ms > >> > >> ------------+------------+---------------+----------+--------- > >> > >> 32768 | 512 | 806158336 | 233909 | 100785 > >> > >> (1 row) > >> > >> It does not matter if I wait a bit before running the query, if I run it > >> repeatedly, etc. The machine is not doing anything else, the CPU is set > >> to use "performance" governor, etc. > > > > I've reproduced the behaviour you mention. > > I also noticed asm_exc_page_fault showing up in the perf report in that > > case. > > > > Running an strace on it shows that in one case, we have a lot of brk > > calls, > > while when we run in the same process as the previous tests, we don't. > > > > My suspicion is that the previous workload makes glibc malloc change it's > > trim_threshold and possibly other dynamic options, which leads to > > constantly moving the brk pointer in one case and not the other. > > > > Running your fifo test with absurd malloc options shows that indeed that > > might be the case (I needed to change several, because changing one > > disable the dynamic adjustment for every single one of them, and malloc > > would fall back to using mmap and freeing it on each iteration): > > > > mallopt(M_TOP_PAD, 1024 * 1024 * 1024); > > mallopt(M_TRIM_THRESHOLD, 256 * 1024 * 1024); > > mallopt(M_MMAP_THRESHOLD, 4*1024*1024*sizeof(long)); > > > > I get the following results for your self contained test. I ran the query > > twice, in each case, seeing the importance of the first allocation and the > > subsequent ones: > > > > With default malloc options: > > block_size | chunk_size | mem_allocated | alloc_ms | free_ms > > > > ------------+------------+---------------+----------+--------- > > > > 32768 | 512 | 795836416 | 300156 | 207557 > > > > block_size | chunk_size | mem_allocated | alloc_ms | free_ms > > > > ------------+------------+---------------+----------+--------- > > > > 32768 | 512 | 795836416 | 211942 | 77207 > > > > With the oversized values above: > > block_size | chunk_size | mem_allocated | alloc_ms | free_ms > > > > ------------+------------+---------------+----------+--------- > > > > 32768 | 512 | 795836416 | 219000 | 36223 > > > > block_size | chunk_size | mem_allocated | alloc_ms | free_ms > > > > ------------+------------+---------------+----------+--------- > > > > 32768 | 512 | 795836416 | 75761 | 78082 > > > > (1 row) > > > > I can't tell how representative your benchmark extension would be of real > > life allocation / free patterns, but there is probably something we can > > improve here. > > Thanks for looking at this. I think those allocation / free patterns are > fairly extreme, and there probably are no workloads doing exactly this. > The idea is the actual workloads are likely some combination of these > extreme cases. > > > I'll try to see if I can understand more precisely what is happening. > > Thanks, that'd be helpful. Maybe we can learn something about tuning > malloc parameters to get significantly better performance. > Apologies for the long email, maybe what I will state here is obvious for others but I learnt a lot, so... I think I understood what the problem was in your generation tests: depending on the sequence of allocations we will raise a different maximum for mmap_threshold and trim_threshold. When an mmap block is freed, malloc will raise it's threshold as it consider memory freed to be better served by regular moving of the sbrk pointer, instead of using mmap to map memory. This threshold is upped by multiplying it by two anytime we free a mmap'ed block. At the same time, the trim_threshold is raised to double the mmap_threshold, considering that this amount of memory should not be released to the OS because we have a good chance of reusing it. This can be demonstrated using the attached systemtap script, along with a patch adding new traces to generation_context for this purpose: When running your query: select block_size, chunk_size, x.* from (values (512)) AS a(chunk_size), (values (32768)) AS b(block_size), lateral generation_bench_fifo(1000000, block_size, chunk_size, 2*chunk_size, 100, 10000, 5000) x; We obtain the following trace for thresholds adjustments: 2167837: New thresholds: mmap: 135168 bytes, trim: 270336 bytes 2167837: New thresholds: mmap: 266240 bytes, trim: 532480 bytes 2167837: New thresholds: mmap: 528384 bytes, trim: 1056768 bytes 2167837: New thresholds: mmap: 1052672 bytes, trim: 2105344 bytes 2167837: New thresholds: mmap: 16003072 bytes, trim: 32006144 bytes When running the full benchmark, we reach a higher threshold at some point: 2181301: New thresholds: mmap: 135168 bytes, trim: 270336 bytes 2181301: New thresholds: mmap: 266240 bytes, trim: 532480 bytes 2181301: New thresholds: mmap: 528384 bytes, trim: 1056768 bytes 2181301: New thresholds: mmap: 1052672 bytes, trim: 2105344 bytes 2181301: New thresholds: mmap: 16003072 bytes, trim: 32006144 bytes 2181301: New thresholds: mmap: 24002560 bytes, trim: 48005120 bytes This is because at some point in the full benchmark, we allocate a block bigger than mmap threshold, which malloc serves by an mmap, then frees it which means we now also raise the trim_threshold. The subsequent allocations in the lone query end up between those thresholds, so are served by moving the sbrk pointer, then releasing the memory back to the OS, which turns out to be expensive too. One thing that I observed is that that effect of constantly moving the sbrk pointer still happens, but not as much in a real tuplesort workload. I haven't tested the reordering buffer for now, which is the other part of the code using generation context. So, I decided to benchmark trying to control the malloc thresholds. This result is on my local laptop, with an I7 processor, for the benchmark initially proposed by David, with 4GB work_mem and default settings. I benchmarked master, the original patch with fixed block size, and the one adjusting the size. For each of them, I ran them using default malloc options (ie, adjusting the thresholds dynamically), and with setting MMAP_MMAP_THRESHOLD to 32MB (the maximum on my platform, 4 * 1024 * 1024 * sizeof(long)) and the corresponding MMAP_TRIM_THRESHOLD if malloc would have adjusted it by itself reaching this value (ie, 64MB). I did that by setting the corresponding env variables. The results is in the attached spreadsheet. I will follow up with a benchmark of the test sorting a table with a width varying from 1 to 32 columns. As of now, my conclusion is that for glibc malloc, the block size we use doesn't really matter, as long as we tell malloc to not release a certain amount of memory back to the system once it's been allocated. Setting the mmap_threshold to min(work_mem, DEFAULT_MMAP_THRESHOLD_MAX) and trim_threshold to two times that would IMO take us to where a long-lived backend would likely end anyway: as soon as we alloc min(work_mem, 32MB), we won't give it back to the system, and save us a huge amount a syscalls in common cases. Doing that will not change the allocation profiles for other platforms, and should be safe for those. The only problem I see if we were to do that would be for allocations in excess of work_mem, which would no longer trigger a threshold raise since once we're in "non-dynamic" mode, glibc's malloc would keep our manually-set values. I guess this proposal could be refined to set it up dynamically ourselves when pfreeing blocks, just like malloc does. What do you think of this analysis and idea ? -- Ronan Dunklau
Attachment
Le jeudi 16 décembre 2021, 11:56:15 CET Ronan Dunklau a écrit : > I will follow up with a benchmark of the test sorting a table with a width > varying from 1 to 32 columns. > So please find attached another benchmark for that case. The 3 different patchsets tested are: - master - fixed (David's original patch) - adjust (Thomas growing blocks patch) So it looks like tuning malloc for this would be very benificial for any kind of allocation, and by doing so we reduce the problems seen with the growing blocks patch to next to nothing, while keeping the ability to not allocate too much memory from the get go. I would like to try to implement some dynamic glibc malloc tuning, if that is something we don't reject on principle from the get go. -- Ronan Dunklau
Attachment
On 12/16/21 17:03, Ronan Dunklau wrote: > Le jeudi 16 décembre 2021, 11:56:15 CET Ronan Dunklau a écrit : >> I will follow up with a benchmark of the test sorting a table with a width >> varying from 1 to 32 columns. >> > > So please find attached another benchmark for that case. > > The 3 different patchsets tested are: > > - master > - fixed (David's original patch) > - adjust (Thomas growing blocks patch) > Presumably Thomas is me, right? > So it looks like tuning malloc for this would be very benificial for any kind > of allocation, and by doing so we reduce the problems seen with the growing > blocks patch to next to nothing, while keeping the ability to not allocate too > much memory from the get go. > Thanks for running those tests and investigating the glibc behavior! I find those results very interesting. My conclusions from this is that the interaction interaction between "our" allocator and the allocator in malloc (e.g. glibc) can be problematic. Which makes benchmarking and optimization somewhat tricky because code changes may trigger behavior change in glibc (or whatever allocator backs malloc). I think it's worth exploring if we can tune this in a reasonable way, but I have a couple concerns related to that: 1) I wonder how glibc-specific this is - I'd bet it applies to other allocators (either on another OS or just different allocator on Linux) too. Tweaking glibc parameters won't affect those systems, of course, but maybe we should allow tweaking those systems too ... 2) In fact, I wonder if different glibc versions behave differently? Hopefully it's not changing that much, though. Ditto kernel versions, but the mmap/sbrk interface is likely more stable. We can test this. 3) If we bump the thresholds, won't that work against reusing the memory? I mean, if we free a whole block (from any allocator we have), glibc might return it to kernel, depending on mmap threshold value. It's not guaranteed, but increasing the malloc thresholds will make that even less likely. So we might just as well increase the minimum block size, with about the same effect, no? > I would like to try to implement some dynamic glibc malloc tuning, if that is > something we don't reject on principle from the get go. > +1 to that regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Le jeudi 16 décembre 2021, 18:00:56 CET Tomas Vondra a écrit : > On 12/16/21 17:03, Ronan Dunklau wrote: > > Le jeudi 16 décembre 2021, 11:56:15 CET Ronan Dunklau a écrit : > >> I will follow up with a benchmark of the test sorting a table with a > >> width > >> varying from 1 to 32 columns. > > > > So please find attached another benchmark for that case. > > > > The 3 different patchsets tested are: > > - master > > - fixed (David's original patch) > > - adjust (Thomas growing blocks patch) > > Presumably Thomas is me, right? I'm really sorry for this typo... Please accept my apologies. > > > So it looks like tuning malloc for this would be very benificial for any > > kind of allocation, and by doing so we reduce the problems seen with the > > growing blocks patch to next to nothing, while keeping the ability to not > > allocate too much memory from the get go. > > Thanks for running those tests and investigating the glibc behavior! I > find those results very interesting. My conclusions from this is that > the interaction interaction between "our" allocator and the allocator in > malloc (e.g. glibc) can be problematic. Which makes benchmarking and > optimization somewhat tricky because code changes may trigger behavior > change in glibc (or whatever allocator backs malloc). > > I think it's worth exploring if we can tune this in a reasonable way, > but I have a couple concerns related to that: > > 1) I wonder how glibc-specific this is - I'd bet it applies to other > allocators (either on another OS or just different allocator on Linux) > too. Tweaking glibc parameters won't affect those systems, of course, > but maybe we should allow tweaking those systems too ... I agree, finding their specific problems and see if we can workaround it would be interesting. I suppose glibc's malloc is the most commonly used allocator in production, as it is the default for most Linux distributions. > > 2) In fact, I wonder if different glibc versions behave differently? > Hopefully it's not changing that much, though. Ditto kernel versions, > but the mmap/sbrk interface is likely more stable. We can test this. That could be tested, yes. As a matter of fact, a commit removing the upper limit for MALLOC_MMAP_THRESHOLD has just been committed yesterday to glibc, which means we can service much bigger allocation without mmap. > > 3) If we bump the thresholds, won't that work against reusing the > memory? I mean, if we free a whole block (from any allocator we have), > glibc might return it to kernel, depending on mmap threshold value. It's > not guaranteed, but increasing the malloc thresholds will make that even > less likely. So we might just as well increase the minimum block size, > with about the same effect, no? It is my understanding that malloc will try to compact memory by moving it around. So the memory should be actually be released to the kernel at some point. In the meantime, malloc can reuse it for our next invocation (which can be in a different memory context on our side). If we increase the minimum block size, this is memory we will actually reserve, and it will not protect us against the ramping-up behaviour: - the first allocation of a big block may be over mmap_threshold, and serviced by an expensive mmap - when it's free, the threshold is doubled - next invocation is serviced by an sbrk call - freeing it will be above the trim threshold, and it will be returned. After several "big" allocations, the thresholds will raise to their maximum values (well, it used to, I need to check what happens with that latest patch of glibc...) This will typically happen several times as malloc doubles the threshold each time. This is probably the reason quadrupling the block sizes was more effective. > > > I would like to try to implement some dynamic glibc malloc tuning, if that > > is something we don't reject on principle from the get go. > > +1 to that Ok, I'll work on a patch for this and submit a new thread. -- Ronan Dunklau
Le vendredi 17 décembre 2021, 09:08:06 CET Ronan Dunklau a écrit : > It is my understanding that malloc will try to compact memory by moving it > around. So the memory should be actually be released to the kernel at some > point. In the meantime, malloc can reuse it for our next invocation (which > can be in a different memory context on our side). I've been told off-list this comment wasn't clear: I meant that it compacts *free* memory, consolidating into larger blocks that will eventually reach the threshold, and be released. -- Ronan Dunklau
On 12/17/21 09:08, Ronan Dunklau wrote: > Le jeudi 16 décembre 2021, 18:00:56 CET Tomas Vondra a écrit : >> On 12/16/21 17:03, Ronan Dunklau wrote: >>> Le jeudi 16 décembre 2021, 11:56:15 CET Ronan Dunklau a écrit : >>>> I will follow up with a benchmark of the test sorting a table with a >>>> width >>>> varying from 1 to 32 columns. >>> >>> So please find attached another benchmark for that case. >>> >>> The 3 different patchsets tested are: >>> - master >>> - fixed (David's original patch) >>> - adjust (Thomas growing blocks patch) >> >> Presumably Thomas is me, right? > > I'm really sorry for this typo... Please accept my apologies. > Nah, no apology needed. I was just wondering if I missed some patch from Thomas Munro ... >> >>> So it looks like tuning malloc for this would be very benificial for any >>> kind of allocation, and by doing so we reduce the problems seen with the >>> growing blocks patch to next to nothing, while keeping the ability to not >>> allocate too much memory from the get go. >> >> Thanks for running those tests and investigating the glibc behavior! I >> find those results very interesting. My conclusions from this is that >> the interaction interaction between "our" allocator and the allocator in >> malloc (e.g. glibc) can be problematic. Which makes benchmarking and >> optimization somewhat tricky because code changes may trigger behavior >> change in glibc (or whatever allocator backs malloc). >> >> I think it's worth exploring if we can tune this in a reasonable way, >> but I have a couple concerns related to that: >> >> 1) I wonder how glibc-specific this is - I'd bet it applies to other >> allocators (either on another OS or just different allocator on Linux) >> too. Tweaking glibc parameters won't affect those systems, of course, >> but maybe we should allow tweaking those systems too ... > > I agree, finding their specific problems and see if we can workaround it would > be interesting. I suppose glibc's malloc is the most commonly used allocator > in production, as it is the default for most Linux distributions. > I wasn't really suggesting to investigate those other allocators in this patch - it seems like a task requiring a pretty significant amount of work/time. My point was that we should make it reasonably easy to add tweaks for those other environments, if someone is interested enough to do the legwork. >> >> 2) In fact, I wonder if different glibc versions behave differently? >> Hopefully it's not changing that much, though. Ditto kernel versions, >> but the mmap/sbrk interface is likely more stable. We can test this. > > That could be tested, yes. As a matter of fact, a commit removing the upper > limit for MALLOC_MMAP_THRESHOLD has just been committed yesterday to glibc, > which means we can service much bigger allocation without mmap. > Yeah, I noticed that commit too. Most systems stick to one glibc version, so it'll take time to reach most systems. Let's continue with just one glibc version and then maybe test other versions. > >> >> 3) If we bump the thresholds, won't that work against reusing the >> memory? I mean, if we free a whole block (from any allocator we have), >> glibc might return it to kernel, depending on mmap threshold value. It's >> not guaranteed, but increasing the malloc thresholds will make that even >> less likely. So we might just as well increase the minimum block size, >> with about the same effect, no? > > It is my understanding that malloc will try to compact memory by moving it > around. So the memory should be actually be released to the kernel at some > point. In the meantime, malloc can reuse it for our next invocation (which can > be in a different memory context on our side). > > If we increase the minimum block size, this is memory we will actually > reserve, and it will not protect us against the ramping-up behaviour: > - the first allocation of a big block may be over mmap_threshold, and serviced > by an expensive mmap > - when it's free, the threshold is doubled > - next invocation is serviced by an sbrk call > - freeing it will be above the trim threshold, and it will be returned. > > After several "big" allocations, the thresholds will raise to their maximum > values (well, it used to, I need to check what happens with that latest patch > of glibc...) > > This will typically happen several times as malloc doubles the threshold each > time. This is probably the reason quadrupling the block sizes was more > effective. > Hmmm, OK. Can we we benchmark the case with large initial block size, at least for comparison? > >> >>> I would like to try to implement some dynamic glibc malloc tuning, if that >>> is something we don't reject on principle from the get go. >> >> +1 to that > > Ok, I'll work on a patch for this and submit a new thread. > Cool! regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Le vendredi 17 décembre 2021, 14:39:10 CET Tomas Vondra a écrit : > I wasn't really suggesting to investigate those other allocators in this > patch - it seems like a task requiring a pretty significant amount of > work/time. My point was that we should make it reasonably easy to add > tweaks for those other environments, if someone is interested enough to > do the legwork. > > >> 2) In fact, I wonder if different glibc versions behave differently? > >> Hopefully it's not changing that much, though. Ditto kernel versions, > >> but the mmap/sbrk interface is likely more stable. We can test this. > > > > That could be tested, yes. As a matter of fact, a commit removing the > > upper > > limit for MALLOC_MMAP_THRESHOLD has just been committed yesterday to > > glibc, > > which means we can service much bigger allocation without mmap. > > Yeah, I noticed that commit too. Most systems stick to one glibc > version, so it'll take time to reach most systems. Let's continue with > just one glibc version and then maybe test other versions. Yes, I also need to figure out how to detect we're using glibc as I'm not very familiar with configure. > > >> 3) If we bump the thresholds, won't that work against reusing the > >> memory? I mean, if we free a whole block (from any allocator we have), > >> glibc might return it to kernel, depending on mmap threshold value. It's > >> not guaranteed, but increasing the malloc thresholds will make that even > >> less likely. So we might just as well increase the minimum block size, > >> with about the same effect, no? > > > > It is my understanding that malloc will try to compact memory by moving it > > around. So the memory should be actually be released to the kernel at some > > point. In the meantime, malloc can reuse it for our next invocation (which > > can be in a different memory context on our side). > > > > If we increase the minimum block size, this is memory we will actually > > > > reserve, and it will not protect us against the ramping-up behaviour: > > - the first allocation of a big block may be over mmap_threshold, and > > serviced> > > by an expensive mmap > > > > - when it's free, the threshold is doubled > > - next invocation is serviced by an sbrk call > > - freeing it will be above the trim threshold, and it will be returned. > > > > After several "big" allocations, the thresholds will raise to their > > maximum > > values (well, it used to, I need to check what happens with that latest > > patch of glibc...) > > > > This will typically happen several times as malloc doubles the threshold > > each time. This is probably the reason quadrupling the block sizes was > > more effective. > > Hmmm, OK. Can we we benchmark the case with large initial block size, at > least for comparison? The benchmark I called "fixed" was with a fixed block size of ALLOCSET_DEFAULT_MAXSIZE (first proposed patch) and showed roughly the same performance profile as the growing blocks + malloc tuning. But if I understand correctly, you implemented the growing blocks logic after concerns about wasting memory with a constant large block size. If we tune malloc, that memory would not be wasted if we don't alloc it, just not released as eagerly when it's allocated. Or do you want a benchmark with an even bigger initial block size ? With the growing blocks patch with a large initial size ? I can run either, I just want to understand what is interesting to you. One thing that would be interesting would be to trace the total amount of memory allocated in the different cases. This is something I will need to do anyway when I propose that patch; Best regards, -- Ronan Dunklau
On Fri, 10 Sept 2021 at 01:38, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > I've spent quite a bit of time investigating the regressions clearly > visible in the benchmark results for some allocation/free patterns. > Attached is a subset of results from a recent run, but the old results > show mostly the same thing. I started looking at the original patch again and wondered if it might be a good idea to allow the estimates from the planner about the tuple size and the number of tuples to help determine the size of the chunks in the generation context. The number of tuples can maybe also help size the array that we're sorting, which would save us allocating just 1024 elements and doubling it each time we run out of space in the array. The main problem with that is that there will be many cases where we need to sort but don't have any estimates from the planner, so if we do this, then likely, we'll still need the chunk growing code. I've attached some benchmark results that I took recently. The spreadsheet contains results from 3 versions. master, master + 0001 - 0002, then master + 0001 - 0003. The 0003 patch makes the code a bit more conservative about the chunk sizes it allocates and also tries to allocate the tuple array according to the number of tuples we expect to be able to sort in a single batch for when the sort is not estimated to fit inside work_mem. For the 4MB work_mem test in the spreadsheet, this does end up using 2MB chunks rather than 4MB chunks. This is because the 10k tuple test is just so close to requiring 4MB of memory to perform the sort. This does mean that for the 10k tuple test, the 0003 patch makes things look quite a bit slower than just with the 0001 + 0002 patch. However, I think not making the chunk size the same size as work_mem is a good idea. The patch is still WIP. I still have a few magic numbers in there, for example the 48 in: state->est_tuple_memory = pg_nextpower2_64(state->est_tuples * (state->est_tupwidth + 48)); I should really have a sizeof() in there. I've not really looked for other opportunities other than in nodeSort.c for where I can pass down tuple estimates down into tuplesort.c David
Attachment
Le vendredi 31 décembre 2021, 22:26:37 CET David Rowley a écrit : > I've attached some benchmark results that I took recently. The > spreadsheet contains results from 3 versions. master, master + 0001 - > 0002, then master + 0001 - 0003. The 0003 patch makes the code a bit > more conservative about the chunk sizes it allocates and also tries to > allocate the tuple array according to the number of tuples we expect > to be able to sort in a single batch for when the sort is not > estimated to fit inside work_mem. (Sorry for trying to merge back the discussion on the two sides of the thread) In https://www.postgresql.org/message-id/4776839.iZASKD2KPV%40aivenronan, I expressed the idea of being able to tune glibc's malloc behaviour. I implemented that (patch 0001) to provide a new hook which is called on backend startup, and anytime we set work_mem. This hook is # defined depending on the malloc implementation: currently a default, no-op implementation is provided as well as a glibc's malloc implementation. The glibc's malloc implementation relies on a new GUC, glibc_malloc_max_trim_threshold. When set to it's default value of -1, we don't tune malloc at all, exactly as in HEAD. If a different value is provided, we set M_MMAP_THRESHOLD to half this value, and M_TRIM_TRESHOLD to this value, capped by work_mem / 2 and work_mem respectively. The net result is that we can then allow to keep more unused memory at the top of the heap, and to use mmap less frequently, if the DBA chooses too. A possible other use case would be to on the contrary, limit the allocated memory in idle backends to a minimum. The reasoning behind this is that glibc's malloc default way of handling those two thresholds is to adapt to the size of the last freed mmaped block. I've run the same "up to 32 columns" benchmark as you did, with this new patch applied on top of both HEAD and your v2 patchset incorporating planner estimates for the block sizez. Those are called "aset" and "generation" in the attached spreadsheet. For each, I've run it with glibc_malloc_max_trim_threshold set to -1, 1MB, 4MB and 64MB. In each case I've measured two things: - query latency, as reported by pgbench - total memory allocated by malloc at backend ext after running each query three times. This represents the "idle" memory consumption, and thus what we waste in malloc inside of releasing back to the system. This measurement has been performed using the very small module presented in patch 0002. Please note that I in no way propose that we include this module, it was just a convenient way for me to measure memory footprint. My conclusion is that the impressive gains you see from using the generation context with bigger blocks mostly comes from the fact that we allocate bigger blocks, and that this moves the mmap thresholds accordingly. I wonder how much of a difference it would make on other malloc implementation: I'm afraid the optimisation presented here would in fact be specific to glibc's malloc, since we have almost the same gains with both allocators when tuning malloc to keep more memory. I still think both approaches are useful, and would be necessary. Since this affects all memory allocations, I need to come up with other meaningful scenarios to benchmarks. -- Ronan Dunklau
Attachment
On 1/7/22 12:03, Ronan Dunklau wrote: > Le vendredi 31 décembre 2021, 22:26:37 CET David Rowley a écrit : >> I've attached some benchmark results that I took recently. The >> spreadsheet contains results from 3 versions. master, master + 0001 - >> 0002, then master + 0001 - 0003. The 0003 patch makes the code a bit >> more conservative about the chunk sizes it allocates and also tries to >> allocate the tuple array according to the number of tuples we expect >> to be able to sort in a single batch for when the sort is not >> estimated to fit inside work_mem. > > (Sorry for trying to merge back the discussion on the two sides of the thread) > > In https://www.postgresql.org/message-id/4776839.iZASKD2KPV%40aivenronan, I > expressed the idea of being able to tune glibc's malloc behaviour. > > I implemented that (patch 0001) to provide a new hook which is called on > backend startup, and anytime we set work_mem. This hook is # defined depending > on the malloc implementation: currently a default, no-op implementation is > provided as well as a glibc's malloc implementation. > Not sure I'd call this a hook - that usually means a way to plug-in custom code through a callback, and this is simply ifdefing a block of code to pick the right implementation. Which may be a good way to do that, just let's not call that a hook. There's a commented-out MallocTuneHook() call, probably not needed. I wonder if #ifdefing is sufficient solution, because it happens at compile time, so what if someone overrides the allocator in LD_PRELOAD? That was a fairly common way to use a custom allocator in an existing application. But I don't know how many people do that with Postgres (I'm not aware of anyone doing that) or if we support that (it'd probably apply to other stuff too, not just malloc). So maybe it's OK, and I can't think of a better way anyway. > The glibc's malloc implementation relies on a new GUC, > glibc_malloc_max_trim_threshold. When set to it's default value of -1, we > don't tune malloc at all, exactly as in HEAD. If a different value is provided, > we set M_MMAP_THRESHOLD to half this value, and M_TRIM_TRESHOLD to this value, > capped by work_mem / 2 and work_mem respectively. > > The net result is that we can then allow to keep more unused memory at the top > of the heap, and to use mmap less frequently, if the DBA chooses too. A > possible other use case would be to on the contrary, limit the allocated > memory in idle backends to a minimum. > > The reasoning behind this is that glibc's malloc default way of handling those > two thresholds is to adapt to the size of the last freed mmaped block. > > I've run the same "up to 32 columns" benchmark as you did, with this new patch > applied on top of both HEAD and your v2 patchset incorporating planner > estimates for the block sizez. Those are called "aset" and "generation" in the > attached spreadsheet. For each, I've run it with > glibc_malloc_max_trim_threshold set to -1, 1MB, 4MB and 64MB. In each case > I've measured two things: > - query latency, as reported by pgbench > - total memory allocated by malloc at backend ext after running each query > three times. This represents the "idle" memory consumption, and thus what we > waste in malloc inside of releasing back to the system. This measurement has > been performed using the very small module presented in patch 0002. Please > note that I in no way propose that we include this module, it was just a > convenient way for me to measure memory footprint. > > My conclusion is that the impressive gains you see from using the generation > context with bigger blocks mostly comes from the fact that we allocate bigger > blocks, and that this moves the mmap thresholds accordingly. I wonder how much > of a difference it would make on other malloc implementation: I'm afraid the > optimisation presented here would in fact be specific to glibc's malloc, since > we have almost the same gains with both allocators when tuning malloc to keep > more memory. I still think both approaches are useful, and would be necessary. > Interesting measurements. It's intriguing that for generation contexts, the default "-1" often outperforms "1MB" (but not the other options), while for aset it's pretty much "the higher value the better". > Since this affects all memory allocations, I need to come up with other > meaningful scenarios to benchmarks. > OK. Are you thinking about a different microbenchmark, or something closer to real workload? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Le vendredi 7 janvier 2022, 13:03:28 CET Tomas Vondra a écrit : > On 1/7/22 12:03, Ronan Dunklau wrote: > > Le vendredi 31 décembre 2021, 22:26:37 CET David Rowley a écrit : > >> I've attached some benchmark results that I took recently. The > >> spreadsheet contains results from 3 versions. master, master + 0001 - > >> 0002, then master + 0001 - 0003. The 0003 patch makes the code a bit > >> more conservative about the chunk sizes it allocates and also tries to > >> allocate the tuple array according to the number of tuples we expect > >> to be able to sort in a single batch for when the sort is not > >> estimated to fit inside work_mem. > > > > (Sorry for trying to merge back the discussion on the two sides of the > > thread) > > > > In https://www.postgresql.org/message-id/4776839.iZASKD2KPV%40aivenronan, > > I expressed the idea of being able to tune glibc's malloc behaviour. > > > > I implemented that (patch 0001) to provide a new hook which is called on > > backend startup, and anytime we set work_mem. This hook is # defined > > depending on the malloc implementation: currently a default, no-op > > implementation is provided as well as a glibc's malloc implementation. > > Not sure I'd call this a hook - that usually means a way to plug-in > custom code through a callback, and this is simply ifdefing a block of > code to pick the right implementation. Which may be a good way to do > that, just let's not call that a hook. > > There's a commented-out MallocTuneHook() call, probably not needed. Ok, I'll clean that up if we decide to proceed with this. > > I wonder if #ifdefing is sufficient solution, because it happens at > compile time, so what if someone overrides the allocator in LD_PRELOAD? > That was a fairly common way to use a custom allocator in an existing > application. But I don't know how many people do that with Postgres (I'm > not aware of anyone doing that) or if we support that (it'd probably > apply to other stuff too, not just malloc). So maybe it's OK, and I > can't think of a better way anyway. I couldn't think of a better way either, maybe there is something to be done with trying to dlsym something specific to glibc's malloc implementation ? > > > The glibc's malloc implementation relies on a new GUC, > > glibc_malloc_max_trim_threshold. When set to it's default value of -1, we > > don't tune malloc at all, exactly as in HEAD. If a different value is > > provided, we set M_MMAP_THRESHOLD to half this value, and M_TRIM_TRESHOLD > > to this value, capped by work_mem / 2 and work_mem respectively. > > > > The net result is that we can then allow to keep more unused memory at the > > top of the heap, and to use mmap less frequently, if the DBA chooses too. > > A possible other use case would be to on the contrary, limit the > > allocated memory in idle backends to a minimum. > > > > The reasoning behind this is that glibc's malloc default way of handling > > those two thresholds is to adapt to the size of the last freed mmaped > > block. > > > > I've run the same "up to 32 columns" benchmark as you did, with this new > > patch applied on top of both HEAD and your v2 patchset incorporating > > planner estimates for the block sizez. Those are called "aset" and > > "generation" in the attached spreadsheet. For each, I've run it with > > glibc_malloc_max_trim_threshold set to -1, 1MB, 4MB and 64MB. In each case > > > > I've measured two things: > > - query latency, as reported by pgbench > > - total memory allocated by malloc at backend ext after running each > > query > > > > three times. This represents the "idle" memory consumption, and thus what > > we waste in malloc inside of releasing back to the system. This > > measurement has been performed using the very small module presented in > > patch 0002. Please note that I in no way propose that we include this > > module, it was just a convenient way for me to measure memory footprint. > > > > My conclusion is that the impressive gains you see from using the > > generation context with bigger blocks mostly comes from the fact that we > > allocate bigger blocks, and that this moves the mmap thresholds > > accordingly. I wonder how much of a difference it would make on other > > malloc implementation: I'm afraid the optimisation presented here would > > in fact be specific to glibc's malloc, since we have almost the same > > gains with both allocators when tuning malloc to keep more memory. I > > still think both approaches are useful, and would be necessary. > Interesting measurements. It's intriguing that for generation contexts, > the default "-1" often outperforms "1MB" (but not the other options), > while for aset it's pretty much "the higher value the better". For generation context with "big block sizes" this result is expected, as the malloc dynamic tuning will adapt to the big block size. This can also be seen on the "idle memory" measurement: the memory consumption is identical to the 64MB value when using -1, since that's what we converge to. This makes it possible to configure postgres to be more conservative with memory: for example, if we have long lived backend where we sometime temporarily set work_mem to a high value, we may end up with a large memory foot print. The implementation I provide also requests a malloc trim when we lower the threshold, making it possible to release memory that would have otherwise been kept around forever. For aset, the memory allocation pattern is a bit more complicated, and we don't end up with such a high value for mmap_threshold. Also, one thing that I haven't explained yet is the weird outlier when there is only one column. > > > Since this affects all memory allocations, I need to come up with other > > meaningful scenarios to benchmarks. > > OK. Are you thinking about a different microbenchmark, or something > closer to real workload? Both. As for microbenchmarking, I'd like to test the following scenarios: - set returning functions allocating a lot of memory - maintenance operations: REINDEX TABLE and the like, where we may end up with a large amount of memory used. - operations involving large hash tables For real workloads, if you have something specific in mind let me know. One thing I didn't mention is that I set max_parallel_workers_per_gather to 0 in all tests. -- Ronan Dunklau
On Tue, 18 Jan 2022 at 19:45, Julien Rouhaud <rjuju123@gmail.com> wrote: > I'm also a bit confused about which patch(es) should actually be reviewed in > that thread. My understanding is that the original patch might not be relevant > anymore but I might be wrong. The CF entry should probably be updated to > clarify that, with an annotation/ title change / author update or something. > > In the meantime I will switch the entry to Waiting on Author. I think what's going on here is a bit confusing. There's quite a few patches on here that aim to do very different things. The patch I originally proposed was just to make tuplesort use generation memory contexts. This helps speed up tuplesort because generation contexts allow palloc() to be faster than the standard allocator. Additionally, there's no power-of-2 wastage with generation contexts like there is with the standard allocator. This helps fit more tuples on fewer CPU cache lines. I believe Andres then mentioned that the fixed block size for generation contexts might become a problem. With the standard allocator the actual size of the underlying malloc() grows up until a maximum size. With generation contexts this is always the same, so we could end up with a very large number of blocks which means lots of small mallocs which could end up slow. Tomas then posted a series of patches to address this. I then posted another patch that has the planner make a choice on the size of the blocks to use for the generation context based on the tuple width and number of tuple estimates. My hope there was to improve performance further by making a better choice for how large to make the blocks in the first place. This reduces the need for Tomas' patch, but does not eliminate the need for it. As of now, I still believe we'll need Tomas' patches to allow the block size to grow up to a maximum size. I think those patches are likely needed before we think about making tuplesort use generation contexts. The reason I believe this is that we don't always have good estimates for the number of tuples we're going to sort. nodeSort.c is fairly easy as it just fetches tuples once and then sorts them. The use of tuplesort for nodeIncrementalSort.c is much more complex as we'll likely be performing a series of smaller sorts which are much harder to get size estimates for. This means there's likely no magic block size that we can use for the generation context that's used to store the tuples in tuplesort. This is also the case for order by aggregates and probably most other usages of tuplesort. David
Le jeudi 20 janvier 2022, 02:31:15 CET David Rowley a écrit : > On Tue, 18 Jan 2022 at 19:45, Julien Rouhaud <rjuju123@gmail.com> wrote: > > I'm also a bit confused about which patch(es) should actually be reviewed > > in that thread. My understanding is that the original patch might not be > > relevant anymore but I might be wrong. The CF entry should probably be > > updated to clarify that, with an annotation/ title change / author update > > or something. > > > > In the meantime I will switch the entry to Waiting on Author. > > I think what's going on here is a bit confusing. There's quite a few > patches on here that aim to do very different things. > > The patch I originally proposed was just to make tuplesort use > generation memory contexts. This helps speed up tuplesort because > generation contexts allow palloc() to be faster than the standard > allocator. Additionally, there's no power-of-2 wastage with generation > contexts like there is with the standard allocator. This helps fit > more tuples on fewer CPU cache lines. > > I believe Andres then mentioned that the fixed block size for > generation contexts might become a problem. With the standard > allocator the actual size of the underlying malloc() grows up until a > maximum size. With generation contexts this is always the same, so we > could end up with a very large number of blocks which means lots of > small mallocs which could end up slow. Tomas then posted a series of > patches to address this. > > I then posted another patch that has the planner make a choice on the > size of the blocks to use for the generation context based on the > tuple width and number of tuple estimates. My hope there was to > improve performance further by making a better choice for how large to > make the blocks in the first place. This reduces the need for Tomas' > patch, but does not eliminate the need for it. > > As of now, I still believe we'll need Tomas' patches to allow the > block size to grow up to a maximum size. I think those patches are > likely needed before we think about making tuplesort use generation > contexts. The reason I believe this is that we don't always have good > estimates for the number of tuples we're going to sort. nodeSort.c is > fairly easy as it just fetches tuples once and then sorts them. The > use of tuplesort for nodeIncrementalSort.c is much more complex as > we'll likely be performing a series of smaller sorts which are much > harder to get size estimates for. This means there's likely no magic > block size that we can use for the generation context that's used to > store the tuples in tuplesort. This is also the case for order by > aggregates and probably most other usages of tuplesort. > You left out the discussion I started about glibc's malloc specific behaviour. I tried to demonstrate that a big part of the performance gain you were seeing were not in fact due to our allocator by itself, but by the way different block sizes allocations interact with this malloc implementation and how it handles releasing memory back to the system. I also tried to show that we can give DBAs more control over how much memory to "waste" as the price for faster allocations. I agree that the generation context memory manager is useful in the tuplesort context, if only for the fact that we fall back to disk less quickly due to the reduced wastage of memory, but I would be wary of drawing conclusions too quickly based on a specific malloc implementation which shows threshold effects, which are compounded by our growing block sizes code in general. I have on my TODO list to run the same kind of benchmarks using jemalloc, to better isolate the performance gains expected from the generation allocator itself from the side effect of avoiding glibc's pattern of releasing memory back to the kernel too quickly. Julien, sorry for the confusion of adding that discussion and those patches to the same thread, but I don't really see a way of dissociating those two aspects. -- Ronan Dunklau
Le jeudi 20 janvier 2022, 08:36:46 CET Ronan Dunklau a écrit : > Le jeudi 20 janvier 2022, 02:31:15 CET David Rowley a écrit : > > On Tue, 18 Jan 2022 at 19:45, Julien Rouhaud <rjuju123@gmail.com> wrote: > > > I'm also a bit confused about which patch(es) should actually be > > > reviewed > > > in that thread. My understanding is that the original patch might not > > > be > > > relevant anymore but I might be wrong. The CF entry should probably be > > > updated to clarify that, with an annotation/ title change / author > > > update > > > or something. > > > > > > In the meantime I will switch the entry to Waiting on Author. > > > > I think what's going on here is a bit confusing. There's quite a few > > patches on here that aim to do very different things. > > > > The patch I originally proposed was just to make tuplesort use > > generation memory contexts. This helps speed up tuplesort because > > generation contexts allow palloc() to be faster than the standard > > allocator. Additionally, there's no power-of-2 wastage with generation > > contexts like there is with the standard allocator. This helps fit > > more tuples on fewer CPU cache lines. > > > > I believe Andres then mentioned that the fixed block size for > > generation contexts might become a problem. With the standard > > allocator the actual size of the underlying malloc() grows up until a > > maximum size. With generation contexts this is always the same, so we > > could end up with a very large number of blocks which means lots of > > small mallocs which could end up slow. Tomas then posted a series of > > patches to address this. > > > > I then posted another patch that has the planner make a choice on the > > size of the blocks to use for the generation context based on the > > tuple width and number of tuple estimates. My hope there was to > > improve performance further by making a better choice for how large to > > make the blocks in the first place. This reduces the need for Tomas' > > patch, but does not eliminate the need for it. > > > > As of now, I still believe we'll need Tomas' patches to allow the > > block size to grow up to a maximum size. I think those patches are > > likely needed before we think about making tuplesort use generation > > contexts. The reason I believe this is that we don't always have good > > estimates for the number of tuples we're going to sort. nodeSort.c is > > fairly easy as it just fetches tuples once and then sorts them. The > > use of tuplesort for nodeIncrementalSort.c is much more complex as > > we'll likely be performing a series of smaller sorts which are much > > harder to get size estimates for. This means there's likely no magic > > block size that we can use for the generation context that's used to > > store the tuples in tuplesort. This is also the case for order by > > aggregates and probably most other usages of tuplesort. > > You left out the discussion I started about glibc's malloc specific > behaviour. > > I tried to demonstrate that a big part of the performance gain you were > seeing were not in fact due to our allocator by itself, but by the way > different block sizes allocations interact with this malloc implementation > and how it handles releasing memory back to the system. I also tried to > show that we can give DBAs more control over how much memory to "waste" as > the price for faster allocations. > > I agree that the generation context memory manager is useful in the > tuplesort context, if only for the fact that we fall back to disk less > quickly due to the reduced wastage of memory, but I would be wary of > drawing conclusions too quickly based on a specific malloc implementation > which shows threshold effects, which are compounded by our growing block > sizes code in general. > > I have on my TODO list to run the same kind of benchmarks using jemalloc, to > better isolate the performance gains expected from the generation allocator > itself from the side effect of avoiding glibc's pattern of releasing memory > back to the kernel too quickly. > I've run the same 1-to-32-columns sort benchmark, using both glibc malloc and jemalloc, with the standard aset memory manager or with David's generation v2 patch. To use jemalloc, I just use a LD_PRELOAD env variable before starting postgres. I have not tried to measure jemalloc's memory footprint like I did for glibc's malloc in the previous benchmark, as I'm not trying to evaluate jemalloc as a glibc's malloc replacement. Please find the results attached. As I suspected, most of the gain observed with David's patch come from working around glibc's malloc idiosyncracies but the good news is that it also helps (to a lesser degree) with jemalloc. I'm not sure how that would behave with growing block sizes though: as Tomas patch and stress-testing benchmarks showed, we can hit some particular thresholds (and regressions) in glibc's self-tuning behaviour. -- Ronan Dunklau
Attachment
Hi:
On Thu, Jan 20, 2022 at 9:31 AM David Rowley <dgrowleyml@gmail.com> wrote:
As of now, I still believe we'll need Tomas' patches to allow the
block size to grow up to a maximum size. I think those patches are
likely needed before we think about making tuplesort use generation
contexts. The reason I believe this is that we don't always have good
estimates for the number of tuples we're going to sort.
+1
I spent some times to study the case here and my current thought is:
we can discuss/commit the minimum committable changes which
should be beneficial for some cases and no harm for others.
Tomas's patch 0002 would make there are no more blocks needed
if we switch to GenerationContext (compared with Standard Context). and
David's patch can obviously reduce total memory usage and improve
the performance. so IMO Tomas's patch 0002 + David's patch is a committable
patchset at first round. and Tomas's 0001 patch would be good to have
as well.
I double checked Tomas's 0002 patch, it looks good to me. and then applied
David's patch with ALLOCSET_DEFAULT_SIZES, testing the same workload.
Here is the result (number is tps):
work_mem = '4GB'
| Test Case | master | patched |
|-----------+--------+---------|
| Test 1 | 306 | 406 |
| Test 2 | 225 | 278 |
| Test 3 | 202 | 248 |
| Test 4 | 184 | 218 |
| Test 5 | 281 | 360 |
work_mem = '4MB'
| Test Case | master | patched |
|-----------+--------+---------|
| Test 1 | 124 | 409 |
| Test 2 | 106 | 280 |
| Test 3 | 100 | 249 |
| Test 4 | 97 | 218 |
| Test 5 | 120 | 369 |
| Test Case | master | patched |
|-----------+--------+---------|
| Test 1 | 306 | 406 |
| Test 2 | 225 | 278 |
| Test 3 | 202 | 248 |
| Test 4 | 184 | 218 |
| Test 5 | 281 | 360 |
work_mem = '4MB'
| Test Case | master | patched |
|-----------+--------+---------|
| Test 1 | 124 | 409 |
| Test 2 | 106 | 280 |
| Test 3 | 100 | 249 |
| Test 4 | 97 | 218 |
| Test 5 | 120 | 369 |
I didn't get the performance improvement as much as David's at the beginning, I
think that is because David uses the ALLOCSET_DEFAULT_MAXSIZE directly which
will need less number of times for memory allocation.
AFAICS, Tomas's patch 0002 + David's patch should be ready for commit for round 1.
We can try other opportunities like use rows estimation to allocate initial memory and
GenerationContext improves like 0003/0004. Would this work?
Best Regards
Andy Fan
On 1/25/22 13:34, Ronan Dunklau wrote: > > ... > > I've run the same 1-to-32-columns sort benchmark, using both glibc malloc and > jemalloc, with the standard aset memory manager or with David's generation v2 > patch. To use jemalloc, I just use a LD_PRELOAD env variable before starting > postgres. > > I have not tried to measure jemalloc's memory footprint like I did for glibc's > malloc in the previous benchmark, as I'm not trying to evaluate jemalloc as a > glibc's malloc replacement. > > Please find the results attached. As I suspected, most of the gain observed > with David's patch come from working around glibc's malloc idiosyncracies but > the good news is that it also helps (to a lesser degree) with jemalloc. > > I'm not sure how that would behave with growing block sizes though: as Tomas > patch and stress-testing benchmarks showed, we can hit some particular > thresholds (and regressions) in glibc's self-tuning behaviour. > Interesting results! So just switching to jemalloc can yield much better performance (in some cases), even beating the generation context (on top of glibc malloc). Of course, those are just simple (somewhat synthetic) benchmarks, and there is no info about memory consumption. Perhaps there are cases where this would be slower than glibc malloc, and I doubt many people to switch to a non-default malloc implementation. But I think in general this mostly confirms our suspicion about the patch (somewhat accidentally) working around glibc internal behavior. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 1/20/22 08:36, Ronan Dunklau wrote: > Le jeudi 20 janvier 2022, 02:31:15 CET David Rowley a écrit : >> On Tue, 18 Jan 2022 at 19:45, Julien Rouhaud <rjuju123@gmail.com> wrote: >>> I'm also a bit confused about which patch(es) should actually be reviewed >>> in that thread. My understanding is that the original patch might not be >>> relevant anymore but I might be wrong. The CF entry should probably be >>> updated to clarify that, with an annotation/ title change / author update >>> or something. >>> >>> In the meantime I will switch the entry to Waiting on Author. >> >> I think what's going on here is a bit confusing. There's quite a few >> patches on here that aim to do very different things. >> >> The patch I originally proposed was just to make tuplesort use >> generation memory contexts. This helps speed up tuplesort because >> generation contexts allow palloc() to be faster than the standard >> allocator. Additionally, there's no power-of-2 wastage with generation >> contexts like there is with the standard allocator. This helps fit >> more tuples on fewer CPU cache lines. >> >> I believe Andres then mentioned that the fixed block size for >> generation contexts might become a problem. With the standard >> allocator the actual size of the underlying malloc() grows up until a >> maximum size. With generation contexts this is always the same, so we >> could end up with a very large number of blocks which means lots of >> small mallocs which could end up slow. Tomas then posted a series of >> patches to address this. >> >> I then posted another patch that has the planner make a choice on the >> size of the blocks to use for the generation context based on the >> tuple width and number of tuple estimates. My hope there was to >> improve performance further by making a better choice for how large to >> make the blocks in the first place. This reduces the need for Tomas' >> patch, but does not eliminate the need for it. >> >> As of now, I still believe we'll need Tomas' patches to allow the >> block size to grow up to a maximum size. I think those patches are >> likely needed before we think about making tuplesort use generation >> contexts. The reason I believe this is that we don't always have good >> estimates for the number of tuples we're going to sort. nodeSort.c is >> fairly easy as it just fetches tuples once and then sorts them. The >> use of tuplesort for nodeIncrementalSort.c is much more complex as >> we'll likely be performing a series of smaller sorts which are much >> harder to get size estimates for. This means there's likely no magic >> block size that we can use for the generation context that's used to >> store the tuples in tuplesort. This is also the case for order by >> aggregates and probably most other usages of tuplesort. >> > > You left out the discussion I started about glibc's malloc specific behaviour. > > I tried to demonstrate that a big part of the performance gain you were seeing > were not in fact due to our allocator by itself, but by the way different block > sizes allocations interact with this malloc implementation and how it handles > releasing memory back to the system. I also tried to show that we can give > DBAs more control over how much memory to "waste" as the price for faster > allocations. > > I agree that the generation context memory manager is useful in the tuplesort > context, if only for the fact that we fall back to disk less quickly due to > the reduced wastage of memory, but I would be wary of drawing conclusions too > quickly based on a specific malloc implementation which shows threshold effects, > which are compounded by our growing block sizes code in general. > Yeah, I share this concern - how much of the improvement is real and how much is due to accidentally not hitting some glibc-specific self-tuning stuff? Would a realistic workloads get the same improvement or would it cause regression? What about non-glibc systems with other malloc? I'm not against pushing the generation context improvements, e.g. something like the patches posted in [1], because those seem reasonable in general. But I'm somewhat confused about the last patch (adjusting allocChunkLimit) causing fairly significant regressions. I'll try investigating this a bit more next week. regards [1] https://www.postgresql.org/message-id/flat/bcdd4e3e-c12d-cd2b-7ead-a91ad416100a%40enterprisedb.com#a5ecd4e5a9e7d5b07ff25b5684da894f -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sun, 13 Feb 2022 at 09:56, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > I'm not against pushing the generation context improvements, e.g. > something like the patches posted in [1], because those seem reasonable > in general. But I'm somewhat confused about the last patch (adjusting > allocChunkLimit) causing fairly significant regressions. My thoughts here are that we should pursue the patch that allows growing of the block size in the generation context. I do think the investigation of the malloc() behaviour and performance is worth some pursuit, I just don't think it should be done here or as part of this patch. I think it's a fairly large undertaking to ensure any changes to the malloc options are an overall win and not just a win for whatever benchmark we test them on. If they're really an overall win, then shouldn't glibc know about them and maybe adopt them as the standard options? To get the ball rolling again on the changes to the generation context, I took your patches, Tomas, and fixed a few things around the keeper blocks not working correctly. I've now made the keeper block behaviour match how aset.c works. There the keeper block is allocated in the same allocation as the context itself. That reduces the number of malloc() calls when setting up a new memory context. On testing this, I discovered a problem regarding the use of generation contexts for storing tuples in tuplesort.c. The problem is when the sort is bounded (e.g. SELECT * FROM ... ORDER BY .. LIMIT N), we'll store and directly throw away any tuples that sort greater than the existing set of N tuples. This causes a problem because once we move away from using the keeper block, initially, we'll at some point end up storing a tuple in a newly allocated generation block then proceed to pfree() that tuple due to it sorting greater than the existing N tuples. Because we currently always free() newly empty blocks, we end up thrashing free()/malloc(). This behaviour results in one of the regression test queries in tuplesort.sql going from taking 10 ms to 1 minute, on my machine. I had a few thoughts about how best to work around this problem. Firstly, I thought that we should just *not* use the Generation context when the sort is bounded. That turns out to be a bit icky as we only make the sort bounding when we call tuplesort_set_bound(), which is after we've allocated the memory context for storing tuples. I thought about maybe just adding another bool flag named "allowBounding" and use a Generation context if that flag is not set. I just don't really like that as a solution. We also cannot make the set bound part of the tuplesort_begin_heap() call as nodeIncrementalSort.c does reset the bound. Possibly we could ditch tuplesort_set_bound() and invent tuplesort_reset_bound() and change the API so the bound is set when making the Tuplesortstate. The other way I thought to fix it was by changing the logic for when generation blocks are freed. In the problem case mentioned above, the block being freed is the current block (which was just allocated). I made some changes to adjust this behaviour so that we no longer free the block when the final chunk is pfree()'d. Instead, that now lingers and can be reused by future allocations, providing they fit inside it. If they don't fit then we must free the block, otherwise it would just remain empty forever. The problem I see with this method is that there still could be some pathological case that causes us to end up storing just a single tuple per generation block. Hitting the worst case here does seem pretty unlikely, so I'm a bit drawn between if we should just play it safe and stick to the standard memory allocator for bounded sort. I've attached 2 patches. The 0001 patch adds the new logic to generation.c to allow the block to grow and also adds the logic to skip freeing the current block when it becomes empty. The 0002 patch simply makes tuplesort.c use the generation context for tuples unconditionally. I've also attached the benchmark results I got from this patch running the attached sortbenchall script. It's fairly obvious that there are performance improvements here even when the malloc() usage pattern is similar to aset.c's David
Attachment
Hi, On 2022-02-18 12:10:51 +1300, David Rowley wrote: > The other way I thought to fix it was by changing the logic for when > generation blocks are freed. In the problem case mentioned above, the > block being freed is the current block (which was just allocated). I > made some changes to adjust this behaviour so that we no longer free > the block when the final chunk is pfree()'d. Instead, that now lingers > and can be reused by future allocations, providing they fit inside it. That makes sense to me, as long as we keep just one such block. > The problem I see with this method is that there still could be some > pathological case that causes us to end up storing just a single tuple per > generation block. Crazy idea: Detect the situation, and recompact. Create a new context, copy all the tuples over, delete the old context. That could be a win even in less adversarial situations than "a single tuple per generation block". Greetings, Andres Freund
On 2/23/22 03:25, Andres Freund wrote: > Hi, > > On 2022-02-18 12:10:51 +1300, David Rowley wrote: >> The other way I thought to fix it was by changing the logic for when >> generation blocks are freed. In the problem case mentioned above, the >> block being freed is the current block (which was just allocated). I >> made some changes to adjust this behaviour so that we no longer free >> the block when the final chunk is pfree()'d. Instead, that now lingers >> and can be reused by future allocations, providing they fit inside it. > > That makes sense to me, as long as we keep just one such block. > > >> The problem I see with this method is that there still could be some >> pathological case that causes us to end up storing just a single tuple per >> generation block. > > Crazy idea: Detect the situation, and recompact. Create a new context, copy > all the tuples over, delete the old context. That could be a win even in less > adversarial situations than "a single tuple per generation block". > What about pointers to the chunks in the old memory context? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, 23 Feb 2022 at 15:25, Andres Freund <andres@anarazel.de> wrote: > On 2022-02-18 12:10:51 +1300, David Rowley wrote: > > The other way I thought to fix it was by changing the logic for when > > generation blocks are freed. In the problem case mentioned above, the > > block being freed is the current block (which was just allocated). I > > made some changes to adjust this behaviour so that we no longer free > > the block when the final chunk is pfree()'d. Instead, that now lingers > > and can be reused by future allocations, providing they fit inside it. > > That makes sense to me, as long as we keep just one such block. I've rewritten the patch in such a way that it no longer matters which block becomes free. What I did was add a new field to the GenerationContext named "freeblock". We now simply store up to 1 recently emptied block there instead of calling free() on it. When we're allocating new memory, we'll try and make use of the "freeblock" when we don't have enough space in the current block. I feel like this is quite a good change as it saves continual free/malloc calls for FIFO pattern users of the generation context. Since that's pretty much the workload that this context is intended for, that seems like a very good change to make. I did consider only recording a block as free once it reaches the maximum size. As I have the code currently, we'll continually reuse all blocks, including the initial sized ones. It will end up filling blocks quicker as we'll be reusing those smaller initial blocks continually with FIFO workloads. I'm not entirely sure if that matters. I just wanted to point out that I thought about it. Aside from the freeblock, I've also added code for adding a "keeper" block. This is allocated in the same malloc'd chunk as the Generation context itself. One thing I'm not too sure about is if the keeper block change is worth the trouble. If we pfree() all of the chunks out of the keeper block, there's a special case in GenerationFree() to empty that block rather than free() it. This also means we need a special case in GenerationAlloc() so we try to see if the keeper block has any space before we go and allocate a new block. My current thoughts are that the keeper block is really only useful for Generation contexts that remain very small and don't ever require allocation of additional blocks. This will mean all the memory can be allocated along with the context, which saves a malloc and free call. Does anyone have any thoughts on this? > > The problem I see with this method is that there still could be some > > pathological case that causes us to end up storing just a single tuple per > > generation block. > > Crazy idea: Detect the situation, and recompact. Create a new context, copy > all the tuples over, delete the old context. That could be a win even in less > adversarial situations than "a single tuple per generation block". I think that would need some new MemoryContext API functions in which callers could use to determine the fragmentation and do something about it on the consumer side. Otherwise, as Tomas says, if we did it from within the context code itself we'd have no visibility of what is pointing to the memory we're moving around. If nobody has any objections to the 0001 patch then I'd like to move ahead with it in the next few days. For the 0002 patch, I'm currently feeling like we shouldn't be using the Generation context for bounded sorts. The only way I can think to do that is with an API change to tuplesort. I feel 0001 is useful even without 0002. David
Attachment
On Wed, 23 Mar 2022 at 04:08, David Rowley <dgrowleyml@gmail.com> wrote: > If nobody has any objections to the 0001 patch then I'd like to move > ahead with it in the next few days. For the 0002 patch, I'm currently > feeling like we shouldn't be using the Generation context for bounded > sorts. The only way I can think to do that is with an API change to > tuplesort. I feel 0001 is useful even without 0002. 0001: I've made some small revisions to the 0001 patch so that the keeper and freeblock are only used again when they're entirely empty. The situation I want to avoid here is that when the current block does not have enough free space to store some new allocation, that we'll then try the freeblock and then keeper block. The problem I saw there is that we may previously have partially filled the keeper or freeblock block and have been unable to store some medium sized allocation which caused us to move to a new block. If we didn't check that the keeper and freeblock blocks were empty first then we could end up being able to store some small allocation in there where some previous medium sized allocation couldn't fit. While that's not necessarily an actual problem, what it does mean is that consecutive allocations might not end up in the same block or the next block. Over time in a FIFO type workload it would be possible to get fragmentation, which could result in being unable to free blocks. For longer lived contexts I imagine that could end up fairly bad. The updated patch should avoid that problem. 0002: This modifies the tuplesort API so that instead of having a randomAccess bool flag, this is changed to a bitwise flag that we can add further options in the future. It's slightly annoying to break the API, but it's not exactly going to be hard for people to code around that. It might also mean we don't have to break the API in the future if we're doing some change where we can just add a new bitwise flag. 0003: This adds a new flag for TUPLESORT_ALLOWBOUNDED and modifies the places where we set a sort bound to pass the flag. The patch only uses the generation context when the flag is not passed. I feel this is a pretty simple patch and if nobody has any objections then I plan to push all 3 of them on my New Zealand Monday morning. David
Attachment
On Fri, Apr 01, 2022 at 10:00:08PM +1300, David Rowley wrote: > 0002: > This modifies the tuplesort API so that instead of having a > randomAccess bool flag, Per cirrus, this missed updating one line for dtrace. On Fri, Apr 01, 2022 at 10:00:08PM +1300, David Rowley wrote: > I feel this is a pretty simple patch and if nobody has any objections > then I plan to push all 3 of them on my New Zealand Monday morning. I'll mark the CF as such. -- Justin
Attachment
On Sat, 2 Apr 2022 at 02:25, Justin Pryzby <pryzby@telsasoft.com> wrote: > > On Fri, Apr 01, 2022 at 10:00:08PM +1300, David Rowley wrote: > > 0002: > > This modifies the tuplesort API so that instead of having a > > randomAccess bool flag, > > Per cirrus, this missed updating one line for dtrace. Thanks. I've pushed all 3 of the patches now. David
On Fri, Apr 01, 2022 at 10:00:08PM +1300, David Rowley wrote: > 0002: > This modifies the tuplesort API so that instead of having a > randomAccess bool flag, this is changed to a bitwise flag that we can > add further options in the future. It's slightly annoying to break > the API, but it's not exactly going to be hard for people to code > around that. For what it's worth, the three PGXN extensions using tuplesort_begin_* are "citus", "pg_bulkload", and "vector". Nothing calls tuplesort_set_bound(). This (commit 77bae39) did not change function parameter counts, and TUPLESORT_RANDOMACCESS generally has same the same numeric value as "true". I get no warning if I pass "true" for the "sortopt" flags parameter. Hence, I suspect this did not break the API. Should we be happy about that? I'm fine with it.
On Sat, Apr 23, 2022 at 5:59 PM Noah Misch <noah@leadboat.com> wrote: > This (commit 77bae39) did not change function parameter counts, and > TUPLESORT_RANDOMACCESS generally has same the same numeric value as "true". I > get no warning if I pass "true" for the "sortopt" flags parameter. Hence, I > suspect this did not break the API. Should we be happy about that? I'm fine > with it. If I happened to believe that this issue (or one like it) might have real negative consequences, and that those consequences could easily be avoided (by making the API break impossible to overlook), then I would object -- why even take a small chance? Fortunately I don't believe that we're even taking a small chance here, all things considered. And so I agree; this issue isn't a concern. -- Peter Geoghegan
Thanks for raising this. On Sun, 24 Apr 2022 at 12:59, Noah Misch <noah@leadboat.com> wrote: > This (commit 77bae39) did not change function parameter counts, and > TUPLESORT_RANDOMACCESS generally has same the same numeric value as "true". I > get no warning if I pass "true" for the "sortopt" flags parameter. Hence, I > suspect this did not break the API. Should we be happy about that? I'm fine > with it. I'd be open to doing something like make sortopt the first argument of each of the tuplesort_begin* functions if people have concerns about this causing problems. However, given that TUPLESORT_RANDOMACCESS and true likely have the same value, it seems we'd be more likely to break code down the line if we added some option that's needed that wouldn't get set by passing "true" as the sortopt. If anyone was to use tuplesort_set_bound() without updating their tuplesort_begin* then on non-assert builds, nothing would misbehave. There's just a risk of having bad memory fragmentation from using the generation context for bounded sorts. David
Hi Ronan, We briefly chatted about the glibc-tuning part of this thread at pgcon, so I wonder if you're still planning to pursue that. If you do, I suggest we start a fresh thread, so that it's not mixed with the already committed improvements of generation context. I wonder what's the situation with the generation context improvements already pushed - does setting the glibc thresholds still help, and if yes how much? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Le dimanche 18 juin 2023, 20:22:17 CEST Tomas Vondra a écrit : > Hi Ronan, > > We briefly chatted about the glibc-tuning part of this thread at pgcon, > so I wonder if you're still planning to pursue that. If you do, I > suggest we start a fresh thread, so that it's not mixed with the already > committed improvements of generation context. > > I wonder what's the situation with the generation context improvements > already pushed - does setting the glibc thresholds still help, and if > yes how much? Hi Tomas ! I'm currently working on it, trying to evaluate the possible benefits on the current ASet and Generation contexts usecases. I'll make sure to use a new thread to post my findings. Thank you for remembering ! Best regards, -- Ronan Dunklau