Thread: Use generation context to speed up tuplesorts

Use generation context to speed up tuplesorts

From

David Rowley

Date:

30 July 2021, 06:42:18

Hackers,

While reviewing and benchmarking 91e9e89dc (Make nodeSort.c use Datum
sorts for single column sorts), I noticed that we use a separate
memory context to store tuple data and we just reset when we're done
if the sort fits in memory, or we dump the tuples to disk in the same
order they're added and reset the context when it does not.  There is
a little pfree() work going on via writetup_heap() which I think
possibly could just be removed to get some additional gains.

Anyway, this context that stores tuples uses the standard aset.c
allocator which has the usual power of 2 wastage and additional
overheads of freelists etc.  I wondered how much faster it would go if
I set it to use a generation context instead of an aset.c one.

After running make installcheck to make the tenk1 table, running the
attached tuplesortbench script, I get this:

Master:
work_mem = '4MB';
   Sort Method: external merge  Disk: 2496kB
work_mem = '4GB';
   Sort Method: quicksort  Memory: 5541kB

Patched:
work_mem = '4MB';
   Sort Method: quicksort  Memory: 3197kB

So it seems to save quite a bit of memory getting away from rounding
up allocations to the next power of 2.

Performance-wise, there's some pretty good gains. (Results in TPS)

work_mem = '4GB';
Test    master    gen sort    compare
Test1    317.2    665.6    210%
Test2    228.6    388.9    170%
Test3    207.4    330.7    159%
Test4    185.5    279.4    151%
Test5    292.2    563.9    193%

If I drop the work_mem down to standard the unpatched version does to
disk, but the patched version does not.  The gains get a little
bigger.

work_mem = '4MB';
Test    master    gen sort    compare
Test1    177.5    658.2    371%
Test2    149.7    385.2    257%
Test3    137.5    330.0    240%
Test4    129.0    275.1    213%
Test5    161.7    546.4    338%

The patch is just a simple 1-liner at the moment.  I likely do need to
adjust what I'm passing as the blockSize to GenerationContextCreate().
  Maybe a better number would be something that's calculated from
work_mem, e.g Min(ALLOCSET_DEFAULT_MAXSIZE, ((Size) work_mem) * 64))
so that we just allocate at most a 16th of work_mem per chunk, but not
bigger than 8MB. I don't think changing this will affect the
performance of the above very much.

David

Attachment

Re: Use generation context to speed up tuplesorts

From

Robert Haas

Date:

30 July 2021, 14:51:10

On Fri, Jul 30, 2021 at 2:42 AM David Rowley <dgrowleyml@gmail.com> wrote:
> Master:
>    Sort Method: quicksort  Memory: 5541kB
> Patched:
>    Sort Method: quicksort  Memory: 3197kB

Whoa.

> work_mem = '4GB';
> Test    master    gen sort    compare
> Test1    317.2    665.6    210%
> Test2    228.6    388.9    170%
> Test3    207.4    330.7    159%
> Test4    185.5    279.4    151%
> Test5    292.2    563.9    193%

Very impressive.

An early version of what eventually became DSA worked with
backend-local memory and I saw very substantial memory usage
improvements on large sorts, similar to what you show here. I am not
sure I saw the same CPU improvements, and in any case I abandoned the
idea of using that infrastructure to manage backend-local memory at
some point, since the whole thing had lots of problems that I didn't
know how to solve. What you've done here looks like a much more
promising approach.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Use generation context to speed up tuplesorts

From

Andres Freund

Date:

30 July 2021, 20:38:53

Hi,

On 2021-07-30 18:42:18 +1200, David Rowley wrote:
> While reviewing and benchmarking 91e9e89dc (Make nodeSort.c use Datum
> sorts for single column sorts), I noticed that we use a separate
> memory context to store tuple data and we just reset when we're done
> if the sort fits in memory, or we dump the tuples to disk in the same
> order they're added and reset the context when it does not.  There is
> a little pfree() work going on via writetup_heap() which I think
> possibly could just be removed to get some additional gains.
> 
> Anyway, this context that stores tuples uses the standard aset.c
> allocator which has the usual power of 2 wastage and additional
> overheads of freelists etc.  I wondered how much faster it would go if
> I set it to use a generation context instead of an aset.c one.
> 
> After running make installcheck to make the tenk1 table, running the
> attached tuplesortbench script, I get this:

> So it seems to save quite a bit of memory getting away from rounding
> up allocations to the next power of 2.
> 
> Performance-wise, there's some pretty good gains. (Results in TPS)

Very nice!

I wonder if there's cases where generation.c would regress performance
over aset.c due to not having an initial / "keeper" block?

> The patch is just a simple 1-liner at the moment.  I likely do need to
> adjust what I'm passing as the blockSize to GenerationContextCreate().
>   Maybe a better number would be something that's calculated from
> work_mem, e.g Min(ALLOCSET_DEFAULT_MAXSIZE, ((Size) work_mem) * 64))
> so that we just allocate at most a 16th of work_mem per chunk, but not
> bigger than 8MB. I don't think changing this will affect the
> performance of the above very much.

I think it's bad that both genereration and slab don't have internal
handling of block sizes. Needing to err on the size of too big blocks to
handle large amounts of memory well, just so the contexts don't need to
deal with variably sized blocks isn't a sensible tradeoff.

I don't think it's acceptable to use ALLOCSET_DEFAULT_MAXSIZE or
Min(ALLOCSET_DEFAULT_MAXSIZE, ((Size) work_mem) * 64) for
tuplesort.c. There's plenty cases where we'll just sort a handful of
tuples, and increasing the memory usage of those by a factor of 1024
isn't good. The Min() won't do any good if a larger work_mem is used.

Nor will it be good to use thousands of small allocations for a large
in-memory tuplesort just because we're concerned about the initial
allocation size. Both because of the allocation overhead, but
importantly also because that will make context resets more expensive.

To me this says that we should transplant aset.c's block size growing
into generation.c.

There is at least one path using tuplecontext that reaches code that
could end up freeing memory to a significant enough degree to care about
generation.c effectively not using that memory:
tuplesort_putdatum()->datumCopy()->EOH_flatten_into()
On a quick look I didn't find any expanded record user that frees
nontrivial amounts of memory, but I didn't look all that carefully.

Greetings,

Andres Freund

Re: Use generation context to speed up tuplesorts

From

Tomas Vondra

Date:

30 July 2021, 21:13:26

On 7/30/21 10:38 PM, Andres Freund wrote:
> Hi,
> 
> On 2021-07-30 18:42:18 +1200, David Rowley wrote:
>> While reviewing and benchmarking 91e9e89dc (Make nodeSort.c use Datum
>> sorts for single column sorts), I noticed that we use a separate
>> memory context to store tuple data and we just reset when we're done
>> if the sort fits in memory, or we dump the tuples to disk in the same
>> order they're added and reset the context when it does not.  There is
>> a little pfree() work going on via writetup_heap() which I think
>> possibly could just be removed to get some additional gains.
>>
>> Anyway, this context that stores tuples uses the standard aset.c
>> allocator which has the usual power of 2 wastage and additional
>> overheads of freelists etc.  I wondered how much faster it would go if
>> I set it to use a generation context instead of an aset.c one.
>>
>> After running make installcheck to make the tenk1 table, running the
>> attached tuplesortbench script, I get this:
> 
>> So it seems to save quite a bit of memory getting away from rounding
>> up allocations to the next power of 2.
>>
>> Performance-wise, there's some pretty good gains. (Results in TPS)
> 
> Very nice!
> 

Yes, very nice. I wouldn't have expected such significant difference, 
particularly in CPU usage. It's pretty interesting that it both reduces 
memory and CPU usage, I'd have guessed it's either one of the other.

> 
> I wonder if there's cases where generation.c would regress performance
> over aset.c due to not having an initial / "keeper" block?
> 

Not sure. I guess such workload would need to allocate and free a single 
block (so very little memory) very often. I guess that's possible, but 
I'm not aware of a place doing that very often. Although, maybe decoding 
could do that for simple (serial) workload.

I'm not opposed to adding a keeper block to Generation, similarly to 
what was discussed for Slab not too long ago.

> 
>> The patch is just a simple 1-liner at the moment.  I likely do need to
>> adjust what I'm passing as the blockSize to GenerationContextCreate().
>>    Maybe a better number would be something that's calculated from
>> work_mem, e.g Min(ALLOCSET_DEFAULT_MAXSIZE, ((Size) work_mem) * 64))
>> so that we just allocate at most a 16th of work_mem per chunk, but not
>> bigger than 8MB. I don't think changing this will affect the
>> performance of the above very much.
> 
> I think it's bad that both genereration and slab don't have internal
> handling of block sizes. Needing to err on the size of too big blocks to
> handle large amounts of memory well, just so the contexts don't need to
> deal with variably sized blocks isn't a sensible tradeoff.
> 

Well, back then it seemed like a sensible trade off to me, but I agree 
it may have negative consequences. I'm not opposed to revisiting this.

> I don't think it's acceptable to use ALLOCSET_DEFAULT_MAXSIZE or
> Min(ALLOCSET_DEFAULT_MAXSIZE, ((Size) work_mem) * 64) for
> tuplesort.c. There's plenty cases where we'll just sort a handful of
> tuples, and increasing the memory usage of those by a factor of 1024
> isn't good. The Min() won't do any good if a larger work_mem is used.
> 
> Nor will it be good to use thousands of small allocations for a large
> in-memory tuplesort just because we're concerned about the initial
> allocation size. Both because of the allocation overhead, but
> importantly also because that will make context resets more expensive.
> 

True.

> To me this says that we should transplant aset.c's block size growing
> into generation.c.
> 

Yeah, maybe.

> 
> There is at least one path using tuplecontext that reaches code that
> could end up freeing memory to a significant enough degree to care about
> generation.c effectively not using that memory:
> tuplesort_putdatum()->datumCopy()->EOH_flatten_into()
> On a quick look I didn't find any expanded record user that frees
> nontrivial amounts of memory, but I didn't look all that carefully.
> 

Not sure, I'm not familiar with EOH_flatten_into or expanded records. 
But I wonder if there's some sort of metric that we could track in 
Generation and use it to identify "interesting" places.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Use generation context to speed up tuplesorts

From

Tomas Vondra

Date:

31 July 2021, 02:34:04

Hi,

I spent a bit of time hacking on the Generation context, adding the two 
improvements discussed in this thread:

1) internal handling of block sizes, similar to what AllocSet does (it 
pretty much just copies parts of it)

2) keeper block (we keep one empry block instead of freeing it)

3) I've also added allocChunkLimit, which makes it look a bit more like 
AllocSet (instead of using just blockSize/8, which does not work too 
well with dynamic blockSize)

I haven't done any extensive tests on it, but it does pass check-world 
with asserts etc. I haven't touched the comments, those need updating.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: Use generation context to speed up tuplesorts

From

David Rowley

Date:

02 August 2021, 11:17:48

On Sat, 31 Jul 2021 at 14:34, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> I spent a bit of time hacking on the Generation context, adding the two
> improvements discussed in this thread:
>
> 1) internal handling of block sizes, similar to what AllocSet does (it
> pretty much just copies parts of it)
>
> 2) keeper block (we keep one empry block instead of freeing it)
>
> 3) I've also added allocChunkLimit, which makes it look a bit more like
> AllocSet (instead of using just blockSize/8, which does not work too
> well with dynamic blockSize)
>
> I haven't done any extensive tests on it, but it does pass check-world
> with asserts etc. I haven't touched the comments, those need updating.
> regards

Thanks for starting work on that. I've only had a quick look, but I
can have a more detailed look once you've got it more complete.

For now it does not really look like the keeper block stuff is wired
up the same way as in aset.c. I'd expect you to be allocating that in
the same malloc as you're using to allocate the context struct itself
in GenerationContextCreate().

Also, likely as a result of the above, minContextSize does not seem to
be wired up to anything apart from an Assert().

David

Re: Use generation context to speed up tuplesorts

From

David Rowley

Date:

02 August 2021, 22:59:25

On Sat, 31 Jul 2021 at 08:38, Andres Freund <andres@anarazel.de> wrote:
> There is at least one path using tuplecontext that reaches code that
> could end up freeing memory to a significant enough degree to care about
> generation.c effectively not using that memory:
> tuplesort_putdatum()->datumCopy()->EOH_flatten_into()
> On a quick look I didn't find any expanded record user that frees
> nontrivial amounts of memory, but I didn't look all that carefully.

I guess we could just use a normal context for datum sorts if we
thought that might be a problem.

I'm not too familiar with the expanded object code, but I'm struggling
to imagine why anything would need to do a pfree in there. We just do
EOH_get_flat_size() to determine how big to make the allocation then
allocate some memory for EOH_flatten_into() to use to expand the
object into.

David

Re: Use generation context to speed up tuplesorts

From

Tomas Vondra

Date:

03 August 2021, 14:10:23

On 8/2/21 1:17 PM, David Rowley wrote:
> On Sat, 31 Jul 2021 at 14:34, Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>> I spent a bit of time hacking on the Generation context, adding the two
>> improvements discussed in this thread:
>>
>> 1) internal handling of block sizes, similar to what AllocSet does (it
>> pretty much just copies parts of it)
>>
>> 2) keeper block (we keep one empry block instead of freeing it)
>>
>> 3) I've also added allocChunkLimit, which makes it look a bit more like
>> AllocSet (instead of using just blockSize/8, which does not work too
>> well with dynamic blockSize)
>>
>> I haven't done any extensive tests on it, but it does pass check-world
>> with asserts etc. I haven't touched the comments, those need updating.
>> regards
> 
> Thanks for starting work on that. I've only had a quick look, but I
> can have a more detailed look once you've got it more complete.
> 

A review would be nice, although it can wait - It'd be interesting to 
know if those patches help with the workload(s) you've been looking at.

> For now it does not really look like the keeper block stuff is wired
> up the same way as in aset.c. I'd expect you to be allocating that in
> the same malloc as you're using to allocate the context struct itself
> in GenerationContextCreate().
> 

Yes, that difference is natural. The AllocSet works a bit differently, 
as it does not release the blocks (except during reset), while the 
Generation context frees the blocks. So it seems pointless to use the 
same "keeper" block as AllocSet - instead my intention was to keep one 
"allocated" block as a cache, which should help with tight pfree/palloc 
cycles. Maybe we should not call that "keeper" block?

> Also, likely as a result of the above, minContextSize does not seem to
> be wired up to anything apart from an Assert().
> 

Hmm, yeah. This is probably due to copying some of the block-growth and 
keeper block code from AllocSet. There should be just init/max block 
size, I think.

I did run the same set of benchmarks as for Slab, measuring some usual 
allocation patterns. The results for i5-2500k machine are attached (for 
the xeon it's almost exactly the same behavior). While running those 
tests I realized the last patch is wrong and sets allocChunkLimit=1, 
which is bogus and causes significant regression. So here's an updated 
version of the patch series too.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: Use generation context to speed up tuplesorts

From

Andres Freund

Date:

04 August 2021, 01:00:23

Hi,

On 2021-08-03 10:59:25 +1200, David Rowley wrote:
> On Sat, 31 Jul 2021 at 08:38, Andres Freund <andres@anarazel.de> wrote:
> > There is at least one path using tuplecontext that reaches code that
> > could end up freeing memory to a significant enough degree to care about
> > generation.c effectively not using that memory:
> > tuplesort_putdatum()->datumCopy()->EOH_flatten_into()
> > On a quick look I didn't find any expanded record user that frees
> > nontrivial amounts of memory, but I didn't look all that carefully.
> 
> I guess we could just use a normal context for datum sorts if we
> thought that might be a problem.

I think that's probably a cure worse than the disease. I suspect datum sorts
can benefit from the higher density quite a bit...

> I'm not too familiar with the expanded object code, but I'm struggling
> to imagine why anything would need to do a pfree in there. We just do
> EOH_get_flat_size() to determine how big to make the allocation then
> allocate some memory for EOH_flatten_into() to use to expand the
> object into.

I can see some scenarios with a bit more creative uses of expanded
objects. We've e.g. been talking about using EA to avoid repeated and partial
detoasting overhead and you might need to do some more toast fetches when
flattening. Toast fetches always allocate, and if the fetch were only for
later parts of the tuple, the fetched data would need to be freed.

It's probably fine to deal with this at later time, and just leave a comment
somewhere.

It could be addressed by having a bump style allocator combined with having
freelists. It's not like the tuplesort.c case is actually interested in the
generational behaviour of generation.c (which makes freelists uninteresting),
it's just that generation.c is the densest allocator that we have right now...

Greetings,

Andres Freund

Re: Use generation context to speed up tuplesorts

From

David Rowley

Date:

06 August 2021, 13:07:27

On Wed, 4 Aug 2021 at 02:10, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
> A review would be nice, although it can wait - It'd be interesting to
> know if those patches help with the workload(s) you've been looking at.

I tried out the v2 set of patches using the attached scripts.  The
attached spreadsheet includes the original tests and compares master
with the patch which uses the generation context vs that patch plus
your v2 patch.

I've also included 4 additional tests, each of which starts with a 1
column table and then adds another 32 columns testing the performance
after adding each additional column. I did this because I wanted to
see if the performance was more similar to master when the allocations
had less power of 2 wastage from allocset. If, for example, you look
at row 123 of the spreadsheet you can see both patched and unpatched
the allocations were 272 bytes each yet there was still a 50%
performance improvement with just the generation context patch when
compared to master.

Looking at the spreadsheet, you'll also notice that in the 2 column
test of each of the 4 new tests the number of bytes used for each
allocation is larger with the generation context. 56 vs 48.  This is
due to the GenerationChunk struct size being later than the Allocset's
version by 8 bytes.  This is because it also holds the
GenerationBlock.  So with the patch there are some cases where we'll
use slightly more memory.

Additional tests:

1. Sort 10000 tuples on a column with values 0-99 in memory.
2. As #1 but with 1 million tuples.
3 As #1 but with a large OFFSET to remove the overhead of sending to the client.
4. As #2 but with a large OFFSET.

Test #3 above is the most similar one to the original tests and shows
similar gains. When the sort becomes larger (1 million tuple test),
the gains reduce. This indicates the gains are coming from improved
CPU cache efficiency from the removal of the power of 2 wastage in
memory allocations.

All of the tests show that the patches to improve the allocation
efficiency of generation.c don't help to improve the results of the
test cases. I wondered if it's maybe worth trying to see what happens
if instead of doubling the allocations each time, quadruple them
instead. I didn't try this.

David

Attachment

Re: Use generation context to speed up tuplesorts

From

Tomas Vondra

Date:

07 August 2021, 00:10:35

On 8/6/21 3:07 PM, David Rowley wrote:
> On Wed, 4 Aug 2021 at 02:10, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>> A review would be nice, although it can wait - It'd be interesting to
>> know if those patches help with the workload(s) you've been looking at.
> 
> I tried out the v2 set of patches using the attached scripts.  The
> attached spreadsheet includes the original tests and compares master
> with the patch which uses the generation context vs that patch plus
> your v2 patch.
> 
> I've also included 4 additional tests, each of which starts with a 1
> column table and then adds another 32 columns testing the performance
> after adding each additional column. I did this because I wanted to
> see if the performance was more similar to master when the allocations
> had less power of 2 wastage from allocset. If, for example, you look
> at row 123 of the spreadsheet you can see both patched and unpatched
> the allocations were 272 bytes each yet there was still a 50%
> performance improvement with just the generation context patch when
> compared to master.
> 
> Looking at the spreadsheet, you'll also notice that in the 2 column
> test of each of the 4 new tests the number of bytes used for each
> allocation is larger with the generation context. 56 vs 48.  This is
> due to the GenerationChunk struct size being later than the Allocset's
> version by 8 bytes.  This is because it also holds the
> GenerationBlock.  So with the patch there are some cases where we'll
> use slightly more memory.
> 
> Additional tests:
> 
> 1. Sort 10000 tuples on a column with values 0-99 in memory.
> 2. As #1 but with 1 million tuples.
> 3 As #1 but with a large OFFSET to remove the overhead of sending to the client.
> 4. As #2 but with a large OFFSET.
> 
> Test #3 above is the most similar one to the original tests and shows
> similar gains. When the sort becomes larger (1 million tuple test),
> the gains reduce. This indicates the gains are coming from improved
> CPU cache efficiency from the removal of the power of 2 wastage in
> memory allocations.
> 
> All of the tests show that the patches to improve the allocation
> efficiency of generation.c don't help to improve the results of the
> test cases. I wondered if it's maybe worth trying to see what happens
> if instead of doubling the allocations each time, quadruple them
> instead. I didn't try this.
> 

Thanks for the scripts and the spreadsheet with results.

I doubt quadrupling the allocations won't help very much, but I suspect 
the problem might be in the 0004 patch - at least that's what shows 
regression in my results. Could you try with just 0001-0003 applied?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Use generation context to speed up tuplesorts

From

David Rowley

Date:

08 August 2021, 07:02:50

On Sat, 7 Aug 2021 at 12:10, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>
> On 8/6/21 3:07 PM, David Rowley wrote:
> > All of the tests show that the patches to improve the allocation
> > efficiency of generation.c don't help to improve the results of the
> > test cases. I wondered if it's maybe worth trying to see what happens
> > if instead of doubling the allocations each time, quadruple them
> > instead. I didn't try this.
> >
>
> I doubt quadrupling the allocations won't help very much, but I suspect
> the problem might be in the 0004 patch - at least that's what shows
> regression in my results. Could you try with just 0001-0003 applied?

But 0004 only changes the logic which controls the threshold of when
we allocate an oversized chunk.  It looks like the threshold is 512KB
with the 0004 patch.  My test is only doing a maximum allocation of
296 bytes so will never allocate an oversized chunk.

Can you explain why you think 0004 would cause performance regressions?

David

Re: Use generation context to speed up tuplesorts

From

David Rowley

Date:

08 August 2021, 10:11:12

On Sat, 7 Aug 2021 at 12:10, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
> > All of the tests show that the patches to improve the allocation
> > efficiency of generation.c don't help to improve the results of the
> > test cases. I wondered if it's maybe worth trying to see what happens
> > if instead of doubling the allocations each time, quadruple them
> > instead. I didn't try this.
> >
>
> I doubt quadrupling the allocations won't help very much, but I suspect
> the problem might be in the 0004 patch - at least that's what shows
> regression in my results. Could you try with just 0001-0003 applied?

I tried the quadrupling of the buffer instead of doubling it each time
and got the attached.  Column E, or green in the graphs show the
results of that. It's now much closer to the original patch which just
made the block size 8MB.

David

Attachment

generation context tuplesort_v2.ods

Re: Use generation context to speed up tuplesorts

From

Tomas Vondra

Date:

08 August 2021, 12:28:10


On 8/8/21 9:02 AM, David Rowley wrote:
> On Sat, 7 Aug 2021 at 12:10, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>>
>> On 8/6/21 3:07 PM, David Rowley wrote:
>>> All of the tests show that the patches to improve the allocation
>>> efficiency of generation.c don't help to improve the results of the
>>> test cases. I wondered if it's maybe worth trying to see what happens
>>> if instead of doubling the allocations each time, quadruple them
>>> instead. I didn't try this.
>>>
>>
>> I doubt quadrupling the allocations won't help very much, but I suspect
>> the problem might be in the 0004 patch - at least that's what shows
>> regression in my results. Could you try with just 0001-0003 applied?
> 
> But 0004 only changes the logic which controls the threshold of when
> we allocate an oversized chunk.  It looks like the threshold is 512KB
> with the 0004 patch.  My test is only doing a maximum allocation of
> 296 bytes so will never allocate an oversized chunk.
> 
> Can you explain why you think 0004 would cause performance regressions?
> 

It's based solely on results of my benchmarks, where this patch seems to 
cause performance regression. I agree it's a bit bizzare, considering 
what the patch does.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Use generation context to speed up tuplesorts

From

Tomas Vondra

Date:

08 August 2021, 12:38:00

On 8/8/21 12:11 PM, David Rowley wrote:
> On Sat, 7 Aug 2021 at 12:10, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>>> All of the tests show that the patches to improve the allocation
>>> efficiency of generation.c don't help to improve the results of the
>>> test cases. I wondered if it's maybe worth trying to see what happens
>>> if instead of doubling the allocations each time, quadruple them
>>> instead. I didn't try this.
>>>
>>
>> I doubt quadrupling the allocations won't help very much, but I suspect
>> the problem might be in the 0004 patch - at least that's what shows
>> regression in my results. Could you try with just 0001-0003 applied?
> 
> I tried the quadrupling of the buffer instead of doubling it each time
> and got the attached.  Column E, or green in the graphs show the
> results of that. It's now much closer to the original patch which just
> made the block size 8MB.
> 

Interesting, I wouldn't have expected that to make such difference.

I'm not sure quadrupling the size is a good idea, though, because it 
increases the amount of memory we might be wasting. With the doubling, 
the amount of wasted /unused memory is limited to ~50%, because the next 
block is (roughly) equal to sum of already allocated blocks, so 
allocating just 1B on it leaves us with 50%. But quadrupling the size 
means we'll end up with ~75% free space. Of course, this is capped by 
the maximum block size etc. but still ...

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Use generation context to speed up tuplesorts

From

David Rowley

Date:

08 August 2021, 12:38:27

On Wed, 4 Aug 2021 at 02:10, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
> I did run the same set of benchmarks as for Slab, measuring some usual
> allocation patterns. The results for i5-2500k machine are attached (for
> the xeon it's almost exactly the same behavior). While running those
> tests I realized the last patch is wrong and sets allocChunkLimit=1,
> which is bogus and causes significant regression. So here's an updated
> version of the patch series too.

I know you're not done with these yet, but FWIW, I was getting an
Assert failure with these patches on:

Assert(total_allocated == context->mem_allocated);

It seems to be because you've forgotten to ignore keeper blocks when
adjusting context->mem_allocated in GenerationReset()

David

Re: Use generation context to speed up tuplesorts

From

David Rowley

Date:

08 August 2021, 12:51:21

On Mon, 9 Aug 2021 at 00:38, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
> I'm not sure quadrupling the size is a good idea, though, because it
> increases the amount of memory we might be wasting. With the doubling,
> the amount of wasted /unused memory is limited to ~50%, because the next
> block is (roughly) equal to sum of already allocated blocks, so
> allocating just 1B on it leaves us with 50%. But quadrupling the size
> means we'll end up with ~75% free space. Of course, this is capped by
> the maximum block size etc. but still ...

Yeah, not sure what is best. It does however seem likely that the
majority of the performance improvement that I saw is due to either
malloc()/free() calls or just having fewer blocks in the context.

Maybe it's worth getting the planner on board with deciding how to do
the allocations.  It feels a bit overcautious to go allocating blocks
in each power of two starting at 8192 bytes when doing a 1GB sort.
Maybe we should be looking towards doing something more like making
the init allocation size more like pg_prevpower2_64(Min(work_mem *
1024L, sort_tuples * tuple_width)), or maybe half or quarter that.

It would certainly not be the only executor node to allocate memory
based on what the planner thought. Just look at ExecHashTableCreate().

David

Re: Use generation context to speed up tuplesorts

From

Tomas Vondra

Date:

09 September 2021, 13:37:59

Hi,

I've spent quite a bit of time investigating the regressions clearly 
visible in the benchmark results for some allocation/free patterns. 
Attached is a subset of results from a recent run, but the old results 
show mostly the same thing.

Some allocation patterns (e.g. fifo-cycle of lifo-cycle) are perfectly 
fine - the patches are a clear improvement. For for example results for 
fifo-decrease pattern look like this (showing just timings of palloc):

     block    chunk      0001     0002    0003     0004
    ----------------------------------------------------
     32768      512     38497    37234   39817   226451

0001 just adds an extension used for the benchmarking, so it's pretty 
much "master" without any behavior changes. 0002 and 0003 make some 
changes that seem to have no impact on this workload, but 0004 makes it 
about 5x slower. Which is bizarre, because 0004 just tweaks how we 
calculate the threshold for oversized chunks.

But that shouldn't have any impact, because with the benchmark 
parameters the threshold should be 64kB, way more than the chunk size 
(which is what we allocate).

Similarly, it should not matter whether we double or quadruple the block 
size, because we reach the maximum block size (1MB) very fast in both 
cases, and then just just 1MB blocks for 99% of the benchmark.

I was thinking that maybe the extra allocChunkLimit increases the size 
of the GenerationContext struct, requiring another cacheline. But no, 
it's 3 cachelines in both cases.

But while trying to investigate / profile this, I noticed a rather 
bizarre behavior, which I can't explain but I suspect it may be related 
to the slowdown.

The benchmark queries are  executing generation_bench_fifo() for a range 
of block size / chunk_size combinations, and look about like this:

     select block_size, chunk_size, x.*
       from generate_series(32,512,32) a(chunk_size),
          (values (1024), (2048), (4096), (8192), (16384), (32768))
          AS b(block_size),
       lateral generation_bench_fifo(1000000, block_size, chunk_size,
                                     2*chunk_size, 100, 10000, 5000) x;

Which takes a while to run, and then produces something like this:

      block_size | chunk_size | mem_allocated | alloc_ms | free_ms
     ------------+------------+---------------+----------+---------
            1024 |         32 |      81528832 |    62016 |   24601
            1024 |         64 |     137449472 |    61951 |   36034
            1024 |         96 |     208408464 |    85437 |   57984
     ...
           32768 |        448 |     707756032 |    36605 |   67648
           32768 |        480 |     757104640 |    36838 |   71967
           32768 |        512 |     806256640 |    37073 |   76489
     (96 rows)


Now, I can run benchmark for a single case (e.g. the last one in the 
results above), say like this:

     select block_size, chunk_size, x.* from
        (values (512)) AS a(chunk_size),
        (values (32768)) AS b(block_size),
     lateral generation_bench_fifo(1000000, block_size, chunk_size,
                                   2*chunk_size, 100, 10000, 5000) x;


And now comes the funny part - if I run it in the same backend as the 
"full" benchmark, I get roughly the same results:

      block_size | chunk_size | mem_allocated | alloc_ms | free_ms
     ------------+------------+---------------+----------+---------
           32768 |        512 |     806256640 |    37159 |   76669

but if I reconnect and run it in the new backend, I get this:

      block_size | chunk_size | mem_allocated | alloc_ms | free_ms
     ------------+------------+---------------+----------+---------
           32768 |        512 |     806158336 |   233909 |  100785
     (1 row)

It does not matter if I wait a bit before running the query, if I run it 
repeatedly, etc. The machine is not doing anything else, the CPU is set 
to use "performance" governor, etc.

This is really strange, and the really suspicious thing is that this 
slower timing is very close to the 0004 timing (which is 226451). So 
perhaps the reason for these slowdowns is the same, it's just that with 
0004 it happens all the time.

According to perf profiling, it seems there's one pretty significant 
change - 0004 spends quite a bit of time in asm_exc_page_fault

   41.17%     5.26%  postgres  postgres   [.] GenerationAlloc
           |
            --36.06%--GenerationAlloc
                      |
                      |--28.95%--asm_exc_page_fault
                      |          |
                      |           --23.17%--exc_page_fault
                      |                     |

while on 0003 there's not a single frame with asm_exc_page_fault. I have 
no idea why, though.

I wonder if this might be some unexpected / unfortunate interaction with 
kernel memory management, or something like that. Any ideas what might 
be the root cause, what to look for, etc.?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: Use generation context to speed up tuplesorts

From

Ronan Dunklau

Date:

08 December 2021, 15:51:17

Le jeudi 9 septembre 2021, 15:37:59 CET Tomas Vondra a écrit :
> And now comes the funny part - if I run it in the same backend as the
> "full" benchmark, I get roughly the same results:
>
>       block_size | chunk_size | mem_allocated | alloc_ms | free_ms
>      ------------+------------+---------------+----------+---------
>            32768 |        512 |     806256640 |    37159 |   76669
>
> but if I reconnect and run it in the new backend, I get this:
>
>       block_size | chunk_size | mem_allocated | alloc_ms | free_ms
>      ------------+------------+---------------+----------+---------
>            32768 |        512 |     806158336 |   233909 |  100785
>      (1 row)
>
> It does not matter if I wait a bit before running the query, if I run it
> repeatedly, etc. The machine is not doing anything else, the CPU is set
> to use "performance" governor, etc.

I've reproduced the behaviour you mention.
I also noticed asm_exc_page_fault showing up in the perf report in that case.

Running an strace on it shows that in one case, we have a lot of brk calls,
while when we run in the same process as the previous tests, we don't.

My suspicion is that the previous workload makes glibc malloc change it's
trim_threshold and possibly other dynamic options, which leads to constantly
moving the brk pointer in one case and not the other.

Running your fifo test with absurd malloc options shows that indeed that might
be the case (I needed to change several, because changing one disable the
dynamic adjustment for every single one of them, and malloc would fall back to
using mmap and freeing it on each iteration):

mallopt(M_TOP_PAD, 1024 * 1024 * 1024);
mallopt(M_TRIM_THRESHOLD, 256 * 1024 * 1024);
mallopt(M_MMAP_THRESHOLD, 4*1024*1024*sizeof(long));

I get the following results for your self contained test. I ran the query
twice, in each case, seeing the importance of the first allocation and the
subsequent ones:

With default malloc options:

 block_size | chunk_size | mem_allocated | alloc_ms | free_ms
------------+------------+---------------+----------+---------
      32768 |        512 |     795836416 |   300156 |  207557

 block_size | chunk_size | mem_allocated | alloc_ms | free_ms
------------+------------+---------------+----------+---------
      32768 |        512 |     795836416 |   211942 |   77207


With the oversized values above:

 block_size | chunk_size | mem_allocated | alloc_ms | free_ms
------------+------------+---------------+----------+---------
      32768 |        512 |     795836416 |   219000 |   36223


 block_size | chunk_size | mem_allocated | alloc_ms | free_ms
------------+------------+---------------+----------+---------
      32768 |        512 |     795836416 |    75761 |   78082
(1 row)

I can't tell how representative your benchmark extension would be of real life
allocation / free patterns, but there is probably something we can improve
here.

I'll try to see if I can understand more precisely what is happening.

--
Ronan Dunklau

Re: Use generation context to speed up tuplesorts

From

Tomas Vondra

Date:

08 December 2021, 21:07:12

On 12/8/21 16:51, Ronan Dunklau wrote:
> Le jeudi 9 septembre 2021, 15:37:59 CET Tomas Vondra a écrit :
>> And now comes the funny part - if I run it in the same backend as the
>> "full" benchmark, I get roughly the same results:
>>
>>       block_size | chunk_size | mem_allocated | alloc_ms | free_ms
>>      ------------+------------+---------------+----------+---------
>>            32768 |        512 |     806256640 |    37159 |   76669
>>
>> but if I reconnect and run it in the new backend, I get this:
>>
>>       block_size | chunk_size | mem_allocated | alloc_ms | free_ms
>>      ------------+------------+---------------+----------+---------
>>            32768 |        512 |     806158336 |   233909 |  100785
>>      (1 row)
>>
>> It does not matter if I wait a bit before running the query, if I run it
>> repeatedly, etc. The machine is not doing anything else, the CPU is set
>> to use "performance" governor, etc.
> 
> I've reproduced the behaviour you mention.
> I also noticed asm_exc_page_fault showing up in the perf report in that case.
> 
> Running an strace on it shows that in one case, we have a lot of brk calls, 
> while when we run in the same process as the previous tests, we don't.
> 
> My suspicion is that the previous workload makes glibc malloc change it's 
> trim_threshold and possibly other dynamic options, which leads to constantly 
> moving the brk pointer in one case and not the other.
> 
> Running your fifo test with absurd malloc options shows that indeed that might 
> be the case (I needed to change several, because changing one disable the 
> dynamic adjustment for every single one of them, and malloc would fall back to 
> using mmap and freeing it on each iteration):
> 
> mallopt(M_TOP_PAD, 1024 * 1024 * 1024);
> mallopt(M_TRIM_THRESHOLD, 256 * 1024 * 1024);
> mallopt(M_MMAP_THRESHOLD, 4*1024*1024*sizeof(long));
> 
> I get the following results for your self contained test. I ran the query 
> twice, in each case, seeing the importance of the first allocation and the 
> subsequent ones:
> 
> With default malloc options:
> 
>  block_size | chunk_size | mem_allocated | alloc_ms | free_ms 
> ------------+------------+---------------+----------+---------
>       32768 |        512 |     795836416 |   300156 |  207557
> 
>  block_size | chunk_size | mem_allocated | alloc_ms | free_ms 
> ------------+------------+---------------+----------+---------
>       32768 |        512 |     795836416 |   211942 |   77207
> 
> 
> With the oversized values above:
> 
>  block_size | chunk_size | mem_allocated | alloc_ms | free_ms 
> ------------+------------+---------------+----------+---------
>       32768 |        512 |     795836416 |   219000 |   36223
> 
> 
>  block_size | chunk_size | mem_allocated | alloc_ms | free_ms 
> ------------+------------+---------------+----------+---------
>       32768 |        512 |     795836416 |    75761 |   78082
> (1 row)
> 
> I can't tell how representative your benchmark extension would be of real life 
> allocation / free patterns, but there is probably something we can improve 
> here.
> 

Thanks for looking at this. I think those allocation / free patterns are
fairly extreme, and there probably are no workloads doing exactly this.
The idea is the actual workloads are likely some combination of these
extreme cases.

> I'll try to see if I can understand more precisely what is happening.
> 

Thanks, that'd be helpful. Maybe we can learn something about tuning
malloc parameters to get significantly better performance.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Use generation context to speed up tuplesorts

From

Ronan Dunklau

Date:

16 December 2021, 10:56:15

Le mercredi 8 décembre 2021, 22:07:12 CET Tomas Vondra a écrit :
> On 12/8/21 16:51, Ronan Dunklau wrote:
> > Le jeudi 9 septembre 2021, 15:37:59 CET Tomas Vondra a écrit :
> >> And now comes the funny part - if I run it in the same backend as the
> >>
> >> "full" benchmark, I get roughly the same results:
> >>       block_size | chunk_size | mem_allocated | alloc_ms | free_ms
> >>
> >>      ------------+------------+---------------+----------+---------
> >>
> >>            32768 |        512 |     806256640 |    37159 |   76669
> >>
> >> but if I reconnect and run it in the new backend, I get this:
> >>       block_size | chunk_size | mem_allocated | alloc_ms | free_ms
> >>
> >>      ------------+------------+---------------+----------+---------
> >>
> >>            32768 |        512 |     806158336 |   233909 |  100785
> >>
> >>      (1 row)
> >>
> >> It does not matter if I wait a bit before running the query, if I run it
> >> repeatedly, etc. The machine is not doing anything else, the CPU is set
> >> to use "performance" governor, etc.
> >
> > I've reproduced the behaviour you mention.
> > I also noticed asm_exc_page_fault showing up in the perf report in that
> > case.
> >
> > Running an strace on it shows that in one case, we have a lot of brk
> > calls,
> > while when we run in the same process as the previous tests, we don't.
> >
> > My suspicion is that the previous workload makes glibc malloc change it's
> > trim_threshold and possibly other dynamic options, which leads to
> > constantly moving the brk pointer in one case and not the other.
> >
> > Running your fifo test with absurd malloc options shows that indeed that
> > might be the case (I needed to change several, because changing one
> > disable the dynamic adjustment for every single one of them, and malloc
> > would fall back to using mmap and freeing it on each iteration):
> >
> > mallopt(M_TOP_PAD, 1024 * 1024 * 1024);
> > mallopt(M_TRIM_THRESHOLD, 256 * 1024 * 1024);
> > mallopt(M_MMAP_THRESHOLD, 4*1024*1024*sizeof(long));
> >
> > I get the following results for your self contained test. I ran the query
> > twice, in each case, seeing the importance of the first allocation and the
> > subsequent ones:
> >
> > With default malloc options:
> >  block_size | chunk_size | mem_allocated | alloc_ms | free_ms
> >
> > ------------+------------+---------------+----------+---------
> >
> >       32768 |        512 |     795836416 |   300156 |  207557
> >
> >  block_size | chunk_size | mem_allocated | alloc_ms | free_ms
> >
> > ------------+------------+---------------+----------+---------
> >
> >       32768 |        512 |     795836416 |   211942 |   77207
> >
> > With the oversized values above:
> >  block_size | chunk_size | mem_allocated | alloc_ms | free_ms
> >
> > ------------+------------+---------------+----------+---------
> >
> >       32768 |        512 |     795836416 |   219000 |   36223
> >
> >  block_size | chunk_size | mem_allocated | alloc_ms | free_ms
> >
> > ------------+------------+---------------+----------+---------
> >
> >       32768 |        512 |     795836416 |    75761 |   78082
> >
> > (1 row)
> >
> > I can't tell how representative your benchmark extension would be of real
> > life allocation / free patterns, but there is probably something we can
> > improve here.
>
> Thanks for looking at this. I think those allocation / free patterns are
> fairly extreme, and there probably are no workloads doing exactly this.
> The idea is the actual workloads are likely some combination of these
> extreme cases.
>
> > I'll try to see if I can understand more precisely what is happening.
>
> Thanks, that'd be helpful. Maybe we can learn something about tuning
> malloc parameters to get significantly better performance.
>

Apologies for the long email, maybe what I will state here is obvious for
others but I learnt a lot, so...

I think I understood what the problem was in your generation tests: depending
on the sequence of allocations we will raise a different maximum for
mmap_threshold and trim_threshold. When an mmap block is freed, malloc will
raise it's threshold as it consider memory freed to be better served by
regular moving of the sbrk pointer, instead of using mmap to map memory.  This
threshold is upped by multiplying it by two anytime we free a mmap'ed block.

At the same time, the trim_threshold is raised to double the mmap_threshold,
considering  that this amount of memory should not be released to the OS
because we have a good chance of reusing it.

This can be demonstrated using the attached systemtap script, along with a
patch adding new traces to generation_context for this purpose:
When running your query:

select block_size, chunk_size, x.* from
        (values (512)) AS a(chunk_size),
        (values (32768)) AS b(block_size),
     lateral generation_bench_fifo(1000000, block_size, chunk_size,
                                   2*chunk_size, 100, 10000, 5000) x;

We obtain the following trace for thresholds adjustments:

2167837: New thresholds: mmap: 135168 bytes, trim: 270336 bytes
2167837: New thresholds: mmap: 266240 bytes, trim: 532480 bytes
2167837: New thresholds: mmap: 528384 bytes, trim: 1056768 bytes
2167837: New thresholds: mmap: 1052672 bytes, trim: 2105344 bytes
2167837: New thresholds: mmap: 16003072 bytes, trim: 32006144 bytes

When running the full benchmark, we reach a higher threshold at some point:

2181301: New thresholds: mmap: 135168 bytes, trim: 270336 bytes
2181301: New thresholds: mmap: 266240 bytes, trim: 532480 bytes
2181301: New thresholds: mmap: 528384 bytes, trim: 1056768 bytes
2181301: New thresholds: mmap: 1052672 bytes, trim: 2105344 bytes
2181301: New thresholds: mmap: 16003072 bytes, trim: 32006144 bytes
2181301: New thresholds: mmap: 24002560 bytes, trim: 48005120 bytes

This is because at some point in the full benchmark, we allocate a block
bigger than mmap threshold, which malloc serves by an mmap, then frees it
which means we now also raise the trim_threshold. The subsequent allocations
in the lone query end up between those thresholds, so are served by moving the
sbrk pointer, then releasing the memory back to the OS, which turns out to be
expensive too.

One thing that I observed is that that effect of constantly moving the sbrk
pointer still happens, but not as much in a real tuplesort workload. I haven't
tested the reordering buffer for now, which is the other part of the code using
generation context.

So, I decided to benchmark trying to control the malloc thresholds. This
result is on my local laptop, with an I7 processor, for the benchmark
initially proposed by David, with 4GB work_mem and default settings.

I benchmarked master, the original patch with fixed block size, and the one
adjusting the size.

For each of them, I ran them using default malloc options (ie, adjusting the
thresholds dynamically), and with setting MMAP_MMAP_THRESHOLD to 32MB (the
maximum on my platform, 4 * 1024 * 1024 * sizeof(long)) and the corresponding
MMAP_TRIM_THRESHOLD if malloc would have adjusted it by itself reaching this
value (ie, 64MB). I did that by setting the corresponding env variables.

The results is in the attached spreadsheet.

I will follow up with a benchmark of the test sorting a table with a width
varying from 1 to 32 columns.

As of now, my conclusion is that for glibc malloc, the block size we use
doesn't really matter, as long as we tell malloc to not release a certain
amount of memory back to the system once it's been allocated. Setting the
mmap_threshold to min(work_mem, DEFAULT_MMAP_THRESHOLD_MAX) and trim_threshold
to two times that would IMO take us to where a long-lived backend would likely
end anyway: as soon as we alloc min(work_mem, 32MB), we won't give it back to
the system, and save us a huge amount a syscalls in common cases. Doing that
will not change the allocation profiles for other platforms, and should be safe
for those.

The only problem I see if we were to do that would be for allocations in
excess of work_mem, which would no longer trigger a threshold raise since once
we're in "non-dynamic" mode, glibc's malloc would keep our manually-set
values. I guess this proposal could be refined to set it up dynamically
ourselves when pfreeing blocks, just like malloc does.

What do you think of this analysis and idea ?

--
Ronan Dunklau

Attachment

Re: Use generation context to speed up tuplesorts

From

Ronan Dunklau

Date:

16 December 2021, 16:03:39

Le jeudi 16 décembre 2021, 11:56:15 CET Ronan Dunklau a écrit :
> I will follow up with a benchmark of the test sorting a table with a width
> varying from 1 to 32 columns.
>

So please find attached another benchmark for that case.

The 3 different patchsets tested are:

 - master
 - fixed (David's original patch)
 - adjust (Thomas growing blocks patch)

So it looks like tuning malloc for this would be very benificial for any kind
of allocation, and by doing so we reduce the problems seen with the growing
blocks patch to next to nothing, while keeping the ability to not allocate too
much memory from the get go.

I would like to try to implement some dynamic glibc malloc tuning, if that is
something we don't reject on principle from the get go.


--
Ronan Dunklau

Attachment

bench_results.ods

Re: Use generation context to speed up tuplesorts

From

Tomas Vondra

Date:

16 December 2021, 17:00:56

On 12/16/21 17:03, Ronan Dunklau wrote:
> Le jeudi 16 décembre 2021, 11:56:15 CET Ronan Dunklau a écrit :
>> I will follow up with a benchmark of the test sorting a table with a width
>> varying from 1 to 32 columns.
>>
> 
> So please find attached another benchmark for that case.
> 
> The 3 different patchsets tested are:
> 
>   - master
>   - fixed (David's original patch)
>   - adjust (Thomas growing blocks patch)
> 

Presumably Thomas is me, right?

> So it looks like tuning malloc for this would be very benificial for any kind
> of allocation, and by doing so we reduce the problems seen with the growing
> blocks patch to next to nothing, while keeping the ability to not allocate too
> much memory from the get go.
> 

Thanks for running those tests and investigating the glibc behavior! I 
find those results very interesting. My conclusions from this is that 
the interaction interaction between "our" allocator and the allocator in 
malloc (e.g. glibc) can be problematic. Which makes benchmarking and 
optimization somewhat tricky because code changes may trigger behavior 
change in glibc (or whatever allocator backs malloc).

I think it's worth exploring if we can tune this in a reasonable way, 
but I have a couple concerns related to that:

1) I wonder how glibc-specific this is - I'd bet it applies to other 
allocators (either on another OS or just different allocator on Linux) 
too. Tweaking glibc parameters won't affect those systems, of course, 
but maybe we should allow tweaking those systems too ...

2) In fact, I wonder if different glibc versions behave differently? 
Hopefully it's not changing that much, though. Ditto kernel versions, 
but the mmap/sbrk interface is likely more stable. We can test this.

3) If we bump the thresholds, won't that work against reusing the 
memory? I mean, if we free a whole block (from any allocator we have), 
glibc might return it to kernel, depending on mmap threshold value. It's 
not guaranteed, but increasing the malloc thresholds will make that even 
less likely. So we might just as well increase the minimum block size, 
with about the same effect, no?

> I would like to try to implement some dynamic glibc malloc tuning, if that is
> something we don't reject on principle from the get go.
> 

+1 to that

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Use generation context to speed up tuplesorts

From

Ronan Dunklau

Date:

17 December 2021, 08:08:06

Le jeudi 16 décembre 2021, 18:00:56 CET Tomas Vondra a écrit :
> On 12/16/21 17:03, Ronan Dunklau wrote:
> > Le jeudi 16 décembre 2021, 11:56:15 CET Ronan Dunklau a écrit :
> >> I will follow up with a benchmark of the test sorting a table with a
> >> width
> >> varying from 1 to 32 columns.
> >
> > So please find attached another benchmark for that case.
> >
> > The 3 different patchsets tested are:
> >   - master
> >   - fixed (David's original patch)
> >   - adjust (Thomas growing blocks patch)
>
> Presumably Thomas is me, right?

I'm really sorry for this typo... Please accept my apologies.

>
> > So it looks like tuning malloc for this would be very benificial for any
> > kind of allocation, and by doing so we reduce the problems seen with the
> > growing blocks patch to next to nothing, while keeping the ability to not
> > allocate too much memory from the get go.
>
> Thanks for running those tests and investigating the glibc behavior! I
> find those results very interesting. My conclusions from this is that
> the interaction interaction between "our" allocator and the allocator in
> malloc (e.g. glibc) can be problematic. Which makes benchmarking and
> optimization somewhat tricky because code changes may trigger behavior
> change in glibc (or whatever allocator backs malloc).
>
> I think it's worth exploring if we can tune this in a reasonable way,
> but I have a couple concerns related to that:
>
> 1) I wonder how glibc-specific this is - I'd bet it applies to other
> allocators (either on another OS or just different allocator on Linux)
> too. Tweaking glibc parameters won't affect those systems, of course,
> but maybe we should allow tweaking those systems too ...

I agree, finding their specific problems and see if we can workaround it would
be interesting. I suppose glibc's malloc is the most commonly used allocator
in production, as it is the default for most Linux distributions.

>
> 2) In fact, I wonder if different glibc versions behave differently?
> Hopefully it's not changing that much, though. Ditto kernel versions,
> but the mmap/sbrk interface is likely more stable. We can test this.

That could be tested, yes. As a matter of fact, a commit removing the upper
limit for MALLOC_MMAP_THRESHOLD has just been committed yesterday to glibc,
which means we can service much bigger allocation without mmap.


>
> 3) If we bump the thresholds, won't that work against reusing the
> memory? I mean, if we free a whole block (from any allocator we have),
> glibc might return it to kernel, depending on mmap threshold value. It's
> not guaranteed, but increasing the malloc thresholds will make that even
> less likely. So we might just as well increase the minimum block size,
> with about the same effect, no?

It is my understanding that malloc will try to compact memory by moving it
around. So the memory should be actually be released to the kernel at some
point. In the meantime, malloc can reuse it for our next invocation (which can
be in a different memory context on our side).

If we increase the minimum block size, this is memory we will actually
reserve, and it will not protect us against the ramping-up behaviour:
 - the first allocation of a big block may be over mmap_threshold, and serviced
by an expensive mmap
 - when it's free, the threshold is doubled
 - next invocation is serviced by an sbrk call
 - freeing it will be above the trim threshold, and it will be returned.

After several "big" allocations, the thresholds will raise to their maximum
values (well, it used to, I need to check what happens with that latest patch
of glibc...)

This will typically happen several times as malloc doubles the threshold each
time. This is probably the reason quadrupling the block sizes was more
effective.


>
> > I would like to try to implement some dynamic glibc malloc tuning, if that
> > is something we don't reject on principle from the get go.
>
> +1 to that

Ok, I'll work on a patch for this and submit a new thread.


--
Ronan Dunklau

Re: Use generation context to speed up tuplesorts

From

Ronan Dunklau

Date:

17 December 2021, 10:20:11

Le vendredi 17 décembre 2021, 09:08:06 CET Ronan Dunklau a écrit :
> It is my understanding that malloc will try to compact memory by moving it
> around. So the memory should be actually be released to the kernel at some
> point. In the meantime, malloc can reuse it for our next invocation (which
> can be in a different memory context on our side).

I've been told off-list this comment wasn't clear: I meant that it compacts
*free* memory, consolidating into larger blocks that will eventually reach the
threshold, and be released.

--
Ronan Dunklau

Re: Use generation context to speed up tuplesorts

From

Tomas Vondra

Date:

17 December 2021, 13:39:10

On 12/17/21 09:08, Ronan Dunklau wrote:
> Le jeudi 16 décembre 2021, 18:00:56 CET Tomas Vondra a écrit :
>> On 12/16/21 17:03, Ronan Dunklau wrote:
>>> Le jeudi 16 décembre 2021, 11:56:15 CET Ronan Dunklau a écrit :
>>>> I will follow up with a benchmark of the test sorting a table with a
>>>> width
>>>> varying from 1 to 32 columns.
>>>
>>> So please find attached another benchmark for that case.
>>>
>>> The 3 different patchsets tested are:
>>>   - master
>>>   - fixed (David's original patch)
>>>   - adjust (Thomas growing blocks patch)
>>
>> Presumably Thomas is me, right?
> 
> I'm really sorry for this typo... Please accept my apologies. 
> 

Nah, no apology needed. I was just wondering if I missed some patch from
Thomas Munro ...

>>
>>> So it looks like tuning malloc for this would be very benificial for any
>>> kind of allocation, and by doing so we reduce the problems seen with the
>>> growing blocks patch to next to nothing, while keeping the ability to not
>>> allocate too much memory from the get go.
>>
>> Thanks for running those tests and investigating the glibc behavior! I
>> find those results very interesting. My conclusions from this is that
>> the interaction interaction between "our" allocator and the allocator in
>> malloc (e.g. glibc) can be problematic. Which makes benchmarking and
>> optimization somewhat tricky because code changes may trigger behavior
>> change in glibc (or whatever allocator backs malloc).
>>
>> I think it's worth exploring if we can tune this in a reasonable way,
>> but I have a couple concerns related to that:
>>
>> 1) I wonder how glibc-specific this is - I'd bet it applies to other
>> allocators (either on another OS or just different allocator on Linux)
>> too. Tweaking glibc parameters won't affect those systems, of course,
>> but maybe we should allow tweaking those systems too ...
> 
> I agree, finding their specific problems and see if we can workaround it would 
> be interesting. I suppose glibc's malloc is the most commonly used allocator 
> in production, as it is the default for most Linux distributions.
> 

I wasn't really suggesting to investigate those other allocators in this
patch - it seems like a task requiring a pretty significant amount of
work/time. My point was that we should make it reasonably easy to add
tweaks for those other environments, if someone is interested enough to
do the legwork.

>>
>> 2) In fact, I wonder if different glibc versions behave differently?
>> Hopefully it's not changing that much, though. Ditto kernel versions,
>> but the mmap/sbrk interface is likely more stable. We can test this.
> 
> That could be tested, yes. As a matter of fact, a commit removing the upper 
> limit for MALLOC_MMAP_THRESHOLD has just been committed yesterday to glibc, 
> which means we can service much bigger allocation without mmap.
> 

Yeah, I noticed that commit too. Most systems stick to one glibc
version, so it'll take time to reach most systems. Let's continue with
just one glibc version and then maybe test other versions.

> 
>>
>> 3) If we bump the thresholds, won't that work against reusing the
>> memory? I mean, if we free a whole block (from any allocator we have),
>> glibc might return it to kernel, depending on mmap threshold value. It's
>> not guaranteed, but increasing the malloc thresholds will make that even
>> less likely. So we might just as well increase the minimum block size,
>> with about the same effect, no?
> 
> It is my understanding that malloc will try to compact memory by moving it 
> around. So the memory should be actually be released to the kernel at some 
> point. In the meantime, malloc can reuse it for our next invocation (which can 
> be in a different memory context on our side). 
> 
> If we increase the minimum block size, this is memory we will actually 
> reserve, and it will not protect us against the ramping-up behaviour:
>  - the first allocation of a big block may be over mmap_threshold, and serviced 
> by an expensive mmap
>  - when it's free, the threshold is doubled
>  - next invocation is serviced by an sbrk call
>  - freeing it will be above the trim threshold, and it will be returned.
> 
> After several "big" allocations, the thresholds will raise to their maximum 
> values (well, it used to, I need to check what happens with that latest patch 
> of glibc...)
>  
> This will typically happen several times as malloc doubles the threshold each 
> time. This is probably the reason quadrupling the block sizes was more 
> effective. 
> 

Hmmm, OK. Can we we benchmark the case with large initial block size, at
least for comparison?

> 
>>
>>> I would like to try to implement some dynamic glibc malloc tuning, if that
>>> is something we don't reject on principle from the get go.
>>
>> +1 to that
> 
> Ok, I'll work on a patch for this and submit a new thread. 
> 

Cool!


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Use generation context to speed up tuplesorts

From

Ronan Dunklau

Date:

17 December 2021, 14:00:25

Le vendredi 17 décembre 2021, 14:39:10 CET Tomas Vondra a écrit :
> I wasn't really suggesting to investigate those other allocators in this
> patch - it seems like a task requiring a pretty significant amount of
> work/time. My point was that we should make it reasonably easy to add
> tweaks for those other environments, if someone is interested enough to
> do the legwork.
>
> >> 2) In fact, I wonder if different glibc versions behave differently?
> >> Hopefully it's not changing that much, though. Ditto kernel versions,
> >> but the mmap/sbrk interface is likely more stable. We can test this.
> >
> > That could be tested, yes. As a matter of fact, a commit removing the
> > upper
> > limit for MALLOC_MMAP_THRESHOLD has just been committed yesterday to
> > glibc,
> > which means we can service much bigger allocation without mmap.
>
> Yeah, I noticed that commit too. Most systems stick to one glibc
> version, so it'll take time to reach most systems. Let's continue with
> just one glibc version and then maybe test other versions.

Yes, I also need to figure out how to detect we're using glibc as I'm not very
familiar with configure.

>
> >> 3) If we bump the thresholds, won't that work against reusing the
> >> memory? I mean, if we free a whole block (from any allocator we have),
> >> glibc might return it to kernel, depending on mmap threshold value. It's
> >> not guaranteed, but increasing the malloc thresholds will make that even
> >> less likely. So we might just as well increase the minimum block size,
> >> with about the same effect, no?
> >
> > It is my understanding that malloc will try to compact memory by moving it
> > around. So the memory should be actually be released to the kernel at some
> > point. In the meantime, malloc can reuse it for our next invocation (which
> > can be in a different memory context on our side).
> >
> > If we increase the minimum block size, this is memory we will actually
> >
> > reserve, and it will not protect us against the ramping-up behaviour:
> >  - the first allocation of a big block may be over mmap_threshold, and
> >  serviced>
> > by an expensive mmap
> >
> >  - when it's free, the threshold is doubled
> >  - next invocation is serviced by an sbrk call
> >  - freeing it will be above the trim threshold, and it will be returned.
> >
> > After several "big" allocations, the thresholds will raise to their
> > maximum
> > values (well, it used to, I need to check what happens with that latest
> > patch of glibc...)
> >
> > This will typically happen several times as malloc doubles the threshold
> > each time. This is probably the reason quadrupling the block sizes was
> > more effective.
>
> Hmmm, OK. Can we we benchmark the case with large initial block size, at
> least for comparison?

The benchmark I called "fixed" was with a fixed block size of
ALLOCSET_DEFAULT_MAXSIZE (first proposed patch) and showed roughly the same
performance profile as the growing blocks + malloc tuning.
But if  I understand correctly, you implemented the growing blocks logic after
concerns about wasting memory with a constant large block size. If we tune
malloc, that memory would not be wasted if we don't alloc it, just not
released as eagerly when it's allocated.

Or do you want a benchmark with an even bigger initial block size ? With the
growing blocks patch with a large initial size ? I can run either, I just want
to understand what is interesting to you.

One thing that would be interesting would be to trace the total amount of
memory allocated in the different cases. This is something I will need to do
anyway when I propose that patch;

Best regards,

--
Ronan Dunklau

Re: Use generation context to speed up tuplesorts

From

David Rowley

Date:

31 December 2021, 21:26:37

On Fri, 10 Sept 2021 at 01:38, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> I've spent quite a bit of time investigating the regressions clearly
> visible in the benchmark results for some allocation/free patterns.
> Attached is a subset of results from a recent run, but the old results
> show mostly the same thing.

I started looking at the original patch again and wondered if it might
be a good idea to allow the estimates from the planner about the tuple
size and the number of tuples to help determine the size of the chunks
in the generation context.   The number of tuples can maybe also help
size the array that we're sorting, which would save us allocating just
1024 elements and doubling it each time we run out of space in the
array.

The main problem with that is that there will be many cases where we
need to sort but don't have any estimates from the planner, so if we
do this, then likely, we'll still need the chunk growing code.

I've attached some benchmark results that I took recently.  The
spreadsheet contains results from 3 versions. master, master + 0001 -
0002, then master + 0001 - 0003.  The 0003 patch makes the code a bit
more conservative about the chunk sizes it allocates and also tries to
allocate the tuple array according to the number of tuples we expect
to be able to sort in a single batch for when the sort is not
estimated to fit inside work_mem.

For the 4MB work_mem test in the spreadsheet, this does end up using
2MB chunks rather than 4MB chunks. This is because the 10k tuple test
is just so close to requiring 4MB of memory to perform the sort.  This
does mean that for the 10k tuple test, the 0003 patch makes things
look quite a bit slower than just with the 0001 + 0002 patch. However,
I think not making the chunk size the same size as work_mem is a good
idea.

The patch is still WIP.   I still have a few magic numbers in there,
for example the 48 in:

state->est_tuple_memory = pg_nextpower2_64(state->est_tuples *
(state->est_tupwidth + 48));

I should really have a sizeof() in there.  I've not really looked for
other opportunities other than in nodeSort.c for where I can pass down
tuple estimates down into tuplesort.c

David

Attachment

Re: Use generation context to speed up tuplesorts

From

Ronan Dunklau

Date:

07 January 2022, 11:03:55

Le vendredi 31 décembre 2021, 22:26:37 CET David Rowley a écrit :
> I've attached some benchmark results that I took recently.  The
> spreadsheet contains results from 3 versions. master, master + 0001 -
> 0002, then master + 0001 - 0003.  The 0003 patch makes the code a bit
> more conservative about the chunk sizes it allocates and also tries to
> allocate the tuple array according to the number of tuples we expect
> to be able to sort in a single batch for when the sort is not
> estimated to fit inside work_mem.

(Sorry for trying to merge back the discussion on the two sides of the thread)

In  https://www.postgresql.org/message-id/4776839.iZASKD2KPV%40aivenronan, I
expressed the idea of being able to tune glibc's malloc behaviour.

I implemented that (patch 0001) to provide a new hook which is called on
backend startup, and anytime we set work_mem. This hook is # defined depending
on the malloc implementation: currently a default, no-op implementation is
provided as well as a glibc's malloc implementation.

The glibc's malloc implementation relies on a new GUC,
glibc_malloc_max_trim_threshold. When set to it's default value of -1, we
don't tune malloc at all, exactly as in HEAD. If a different value is provided,
we set M_MMAP_THRESHOLD to half this value, and M_TRIM_TRESHOLD to this value,
capped by work_mem / 2 and work_mem respectively.

The net result is that we can then allow to keep more unused memory at the top
of the heap, and to use mmap less frequently, if the DBA chooses too. A
possible other use case would be to on the contrary, limit the allocated
memory in idle backends to a minimum.

The reasoning behind this is that glibc's malloc default way of handling those
two thresholds is to adapt to the size of the last freed mmaped block.

I've run the same "up to 32 columns" benchmark as you did, with this new patch
applied on top of both HEAD and your v2 patchset incorporating planner
estimates for the block sizez. Those are called "aset" and "generation" in the
attached spreadsheet. For each, I've run it with
glibc_malloc_max_trim_threshold set to -1, 1MB, 4MB and 64MB. In each case
I've measured two things:
 - query latency, as reported by pgbench
 - total memory allocated by malloc at backend ext after running each query
three times. This represents the "idle" memory consumption, and thus what we
waste in malloc inside of releasing back to the system. This measurement has
been performed using the very small module presented in patch 0002. Please
note that I in no way propose that we include this module, it was just a
convenient way for me to measure memory footprint.

My conclusion is that the impressive gains you see from using the generation
context with bigger blocks mostly comes from the fact that we allocate bigger
blocks, and that this moves the mmap thresholds accordingly. I wonder how much
of a difference it would make on other malloc implementation: I'm afraid the
optimisation presented here would in fact be specific to glibc's malloc, since
we have almost the same gains with both allocators when tuning malloc to keep
more memory. I still think both approaches are useful, and would be necessary.

Since this affects all memory allocations, I need to come up with other
meaningful scenarios to benchmarks.


--
Ronan Dunklau

Attachment

Re: Use generation context to speed up tuplesorts

From

Tomas Vondra

Date:

07 January 2022, 12:03:28

On 1/7/22 12:03, Ronan Dunklau wrote:
> Le vendredi 31 décembre 2021, 22:26:37 CET David Rowley a écrit :
>> I've attached some benchmark results that I took recently.  The
>> spreadsheet contains results from 3 versions. master, master + 0001 -
>> 0002, then master + 0001 - 0003.  The 0003 patch makes the code a bit
>> more conservative about the chunk sizes it allocates and also tries to
>> allocate the tuple array according to the number of tuples we expect
>> to be able to sort in a single batch for when the sort is not
>> estimated to fit inside work_mem.
> 
> (Sorry for trying to merge back the discussion on the two sides of the thread)
> 
> In  https://www.postgresql.org/message-id/4776839.iZASKD2KPV%40aivenronan, I
> expressed the idea of being able to tune glibc's malloc behaviour.
> 
> I implemented that (patch 0001) to provide a new hook which is called on
> backend startup, and anytime we set work_mem. This hook is # defined depending
> on the malloc implementation: currently a default, no-op implementation is
> provided as well as a glibc's malloc implementation.
> 

Not sure I'd call this a hook - that usually means a way to plug-in 
custom code through a callback, and this is simply ifdefing a block of 
code to pick the right implementation. Which may be a good way to do 
that, just let's not call that a hook.

There's a commented-out MallocTuneHook() call, probably not needed.

I wonder if #ifdefing is sufficient solution, because it happens at 
compile time, so what if someone overrides the allocator in LD_PRELOAD? 
That was a fairly common way to use a custom allocator in an existing 
application. But I don't know how many people do that with Postgres (I'm 
not aware of anyone doing that) or if we support that (it'd probably 
apply to other stuff too, not just malloc). So maybe it's OK, and I 
can't think of a better way anyway.

> The glibc's malloc implementation relies on a new GUC,
> glibc_malloc_max_trim_threshold. When set to it's default value of -1, we
> don't tune malloc at all, exactly as in HEAD. If a different value is provided,
> we set M_MMAP_THRESHOLD to half this value, and M_TRIM_TRESHOLD to this value,
> capped by work_mem / 2 and work_mem respectively.
> 
> The net result is that we can then allow to keep more unused memory at the top
> of the heap, and to use mmap less frequently, if the DBA chooses too. A
> possible other use case would be to on the contrary, limit the allocated
> memory in idle backends to a minimum.
> 
> The reasoning behind this is that glibc's malloc default way of handling those
> two thresholds is to adapt to the size of the last freed mmaped block.
> 
> I've run the same "up to 32 columns" benchmark as you did, with this new patch
> applied on top of both HEAD and your v2 patchset incorporating planner
> estimates for the block sizez. Those are called "aset" and "generation" in the
> attached spreadsheet. For each, I've run it with
> glibc_malloc_max_trim_threshold set to -1, 1MB, 4MB and 64MB. In each case
> I've measured two things:
>   - query latency, as reported by pgbench
>   - total memory allocated by malloc at backend ext after running each query
> three times. This represents the "idle" memory consumption, and thus what we
> waste in malloc inside of releasing back to the system. This measurement has
> been performed using the very small module presented in patch 0002. Please
> note that I in no way propose that we include this module, it was just a
> convenient way for me to measure memory footprint.
> 
> My conclusion is that the impressive gains you see from using the generation
> context with bigger blocks mostly comes from the fact that we allocate bigger
> blocks, and that this moves the mmap thresholds accordingly. I wonder how much
> of a difference it would make on other malloc implementation: I'm afraid the
> optimisation presented here would in fact be specific to glibc's malloc, since
> we have almost the same gains with both allocators when tuning malloc to keep
> more memory. I still think both approaches are useful, and would be necessary.
> 

Interesting measurements. It's intriguing that for generation contexts, 
the default "-1" often outperforms "1MB" (but not the other options), 
while for aset it's pretty much "the higher value the better".

> Since this affects all memory allocations, I need to come up with other
> meaningful scenarios to benchmarks.
> 

OK. Are you thinking about a different microbenchmark, or something 
closer to real workload?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Use generation context to speed up tuplesorts

From

Ronan Dunklau

Date:

07 January 2022, 12:49:53

Le vendredi 7 janvier 2022, 13:03:28 CET Tomas Vondra a écrit :
> On 1/7/22 12:03, Ronan Dunklau wrote:
> > Le vendredi 31 décembre 2021, 22:26:37 CET David Rowley a écrit :
> >> I've attached some benchmark results that I took recently.  The
> >> spreadsheet contains results from 3 versions. master, master + 0001 -
> >> 0002, then master + 0001 - 0003.  The 0003 patch makes the code a bit
> >> more conservative about the chunk sizes it allocates and also tries to
> >> allocate the tuple array according to the number of tuples we expect
> >> to be able to sort in a single batch for when the sort is not
> >> estimated to fit inside work_mem.
> >
> > (Sorry for trying to merge back the discussion on the two sides of the
> > thread)
> >
> > In  https://www.postgresql.org/message-id/4776839.iZASKD2KPV%40aivenronan,
> > I expressed the idea of being able to tune glibc's malloc behaviour.
> >
> > I implemented that (patch 0001) to provide a new hook which is called on
> > backend startup, and anytime we set work_mem. This hook is # defined
> > depending on the malloc implementation: currently a default, no-op
> > implementation is provided as well as a glibc's malloc implementation.
>
> Not sure I'd call this a hook - that usually means a way to plug-in
> custom code through a callback, and this is simply ifdefing a block of
> code to pick the right implementation. Which may be a good way to do
> that, just let's not call that a hook.
>
> There's a commented-out MallocTuneHook() call, probably not needed.

Ok, I'll clean that up if we decide to proceed with this.

>
> I wonder if #ifdefing is sufficient solution, because it happens at
> compile time, so what if someone overrides the allocator in LD_PRELOAD?
> That was a fairly common way to use a custom allocator in an existing
> application. But I don't know how many people do that with Postgres (I'm
> not aware of anyone doing that) or if we support that (it'd probably
> apply to other stuff too, not just malloc). So maybe it's OK, and I
> can't think of a better way anyway.

I couldn't think of a better way either, maybe there is something to be done
with trying to dlsym something specific to glibc's malloc implementation ?


>
> > The glibc's malloc implementation relies on a new GUC,
> > glibc_malloc_max_trim_threshold. When set to it's default value of -1, we
> > don't tune malloc at all, exactly as in HEAD. If a different value is
> > provided, we set M_MMAP_THRESHOLD to half this value, and M_TRIM_TRESHOLD
> > to this value, capped by work_mem / 2 and work_mem respectively.
> >
> > The net result is that we can then allow to keep more unused memory at the
> > top of the heap, and to use mmap less frequently, if the DBA chooses too.
> > A possible other use case would be to on the contrary, limit the
> > allocated memory in idle backends to a minimum.
> >
> > The reasoning behind this is that glibc's malloc default way of handling
> > those two thresholds is to adapt to the size of the last freed mmaped
> > block.
> >
> > I've run the same "up to 32 columns" benchmark as you did, with this new
> > patch applied on top of both HEAD and your v2 patchset incorporating
> > planner estimates for the block sizez. Those are called "aset" and
> > "generation" in the attached spreadsheet. For each, I've run it with
> > glibc_malloc_max_trim_threshold set to -1, 1MB, 4MB and 64MB. In each case
> >
> > I've measured two things:
> >   - query latency, as reported by pgbench
> >   - total memory allocated by malloc at backend ext after running each
> >   query
> >
> > three times. This represents the "idle" memory consumption, and thus what
> > we waste in malloc inside of releasing back to the system. This
> > measurement has been performed using the very small module presented in
> > patch 0002. Please note that I in no way propose that we include this
> > module, it was just a convenient way for me to measure memory footprint.
> >
> > My conclusion is that the impressive gains you see from using the
> > generation context with bigger blocks mostly comes from the fact that we
> > allocate bigger blocks, and that this moves the mmap thresholds
> > accordingly. I wonder how much of a difference it would make on other
> > malloc implementation: I'm afraid the optimisation presented here would
> > in fact be specific to glibc's malloc, since we have almost the same
> > gains with both allocators when tuning malloc to keep more memory. I
> > still think both approaches are useful, and would be necessary.
> Interesting measurements. It's intriguing that for generation contexts,
> the default "-1" often outperforms "1MB" (but not the other options),
> while for aset it's pretty much "the higher value the better".

For generation context with "big block sizes" this result is expected, as the
malloc dynamic tuning will adapt to the big block size. This can also be seen
on the "idle memory" measurement: the memory consumption is identical to the
64MB value when using -1, since that's what we converge to. This makes it
possible to configure postgres to be more conservative with memory: for
example, if we have long lived backend where we sometime temporarily set
work_mem to a high value, we may end up with a large memory foot print. The
implementation I provide also requests a malloc trim when we lower the
threshold, making it possible to release memory that would have otherwise been
kept around forever.

For aset, the memory allocation pattern is a bit more complicated, and we
don't end up with such a high value for mmap_threshold.

Also, one thing that I haven't explained yet is the weird outlier when there
is only one column.



>
> > Since this affects all memory allocations, I need to come up with other
> > meaningful scenarios to benchmarks.
>
> OK. Are you thinking about a different microbenchmark, or something
> closer to real workload?

Both. As for microbenchmarking, I'd like to test the following scenarios:
 - set returning functions allocating a lot of memory
 - maintenance operations: REINDEX TABLE and the like, where we may end up
with a large amount of memory used.
 - operations involving large hash tables

For real workloads, if you have something specific in mind let me know.

One thing I didn't mention is that I set max_parallel_workers_per_gather to 0
in all tests.


--
Ronan Dunklau

Re: Use generation context to speed up tuplesorts

From

David Rowley

Date:

20 January 2022, 01:31:15

On Tue, 18 Jan 2022 at 19:45, Julien Rouhaud <rjuju123@gmail.com> wrote:
> I'm also a bit confused about which patch(es) should actually be reviewed in
> that thread.  My understanding is that the original patch might not be relevant
> anymore but I might be wrong.  The CF entry should probably be updated to
> clarify that, with an annotation/ title change / author update or something.
>
> In the meantime I will switch the entry to Waiting on Author.

I think what's going on here is a bit confusing. There's quite a few
patches on here that aim to do very different things.

The patch I originally proposed was just to make tuplesort use
generation memory contexts. This helps speed up tuplesort because
generation contexts allow palloc() to be faster than the standard
allocator. Additionally, there's no power-of-2 wastage with generation
contexts like there is with the standard allocator. This helps fit
more tuples on fewer CPU cache lines.

I believe Andres then mentioned that the fixed block size for
generation contexts might become a problem. With the standard
allocator the actual size of the underlying malloc() grows up until a
maximum size.  With generation contexts this is always the same, so we
could end up with a very large number of blocks which means lots of
small mallocs which could end up slow.  Tomas then posted a series of
patches to address this.

I then posted another patch that has the planner make a choice on the
size of the blocks to use for the generation context based on the
tuple width and number of tuple estimates. My hope there was to
improve performance further by making a better choice for how large to
make the blocks in the first place.  This reduces the need for Tomas'
patch, but does not eliminate the need for it.

As of now, I still believe we'll need Tomas' patches to allow the
block size to grow up to a maximum size.  I think those patches are
likely needed before we think about making tuplesort use generation
contexts.  The reason I believe this is that we don't always have good
estimates for the number of tuples we're going to sort.  nodeSort.c is
fairly easy as it just fetches tuples once and then sorts them.  The
use of tuplesort for nodeIncrementalSort.c is much more complex as
we'll likely be performing a series of smaller sorts which are much
harder to get size estimates for. This means there's likely no magic
block size that we can use for the generation context that's used to
store the tuples in tuplesort.  This is also the case for order by
aggregates and probably most other usages of tuplesort.

David

Re: Use generation context to speed up tuplesorts

From

Ronan Dunklau

Date:

20 January 2022, 07:36:46

Le jeudi 20 janvier 2022, 02:31:15 CET David Rowley a écrit :
> On Tue, 18 Jan 2022 at 19:45, Julien Rouhaud <rjuju123@gmail.com> wrote:
> > I'm also a bit confused about which patch(es) should actually be reviewed
> > in that thread.  My understanding is that the original patch might not be
> > relevant anymore but I might be wrong.  The CF entry should probably be
> > updated to clarify that, with an annotation/ title change / author update
> > or something.
> >
> > In the meantime I will switch the entry to Waiting on Author.
>
> I think what's going on here is a bit confusing. There's quite a few
> patches on here that aim to do very different things.
>
> The patch I originally proposed was just to make tuplesort use
> generation memory contexts. This helps speed up tuplesort because
> generation contexts allow palloc() to be faster than the standard
> allocator. Additionally, there's no power-of-2 wastage with generation
> contexts like there is with the standard allocator. This helps fit
> more tuples on fewer CPU cache lines.
>
> I believe Andres then mentioned that the fixed block size for
> generation contexts might become a problem. With the standard
> allocator the actual size of the underlying malloc() grows up until a
> maximum size.  With generation contexts this is always the same, so we
> could end up with a very large number of blocks which means lots of
> small mallocs which could end up slow.  Tomas then posted a series of
> patches to address this.
>
> I then posted another patch that has the planner make a choice on the
> size of the blocks to use for the generation context based on the
> tuple width and number of tuple estimates. My hope there was to
> improve performance further by making a better choice for how large to
> make the blocks in the first place.  This reduces the need for Tomas'
> patch, but does not eliminate the need for it.
>
> As of now, I still believe we'll need Tomas' patches to allow the
> block size to grow up to a maximum size.  I think those patches are
> likely needed before we think about making tuplesort use generation
> contexts.  The reason I believe this is that we don't always have good
> estimates for the number of tuples we're going to sort.  nodeSort.c is
> fairly easy as it just fetches tuples once and then sorts them.  The
> use of tuplesort for nodeIncrementalSort.c is much more complex as
> we'll likely be performing a series of smaller sorts which are much
> harder to get size estimates for. This means there's likely no magic
> block size that we can use for the generation context that's used to
> store the tuples in tuplesort.  This is also the case for order by
> aggregates and probably most other usages of tuplesort.
>

You left out the discussion I started about glibc's malloc specific behaviour.

I tried to demonstrate that a big part of the performance gain you were seeing
were not in fact due to our allocator by itself, but by the way different block
sizes allocations interact with this malloc implementation and how it handles
releasing memory back to the system. I also tried to show that we can give
DBAs more control over how much memory to "waste" as the price for faster
allocations.

I agree that the generation context memory manager is useful in the tuplesort
context, if only for the fact that we fall back to disk less quickly due to
the reduced wastage of memory, but I would be wary of drawing conclusions too
quickly based on a specific malloc implementation which shows threshold effects,
which are compounded by our growing block sizes code in general.

I have on my TODO list to run the same kind of benchmarks using jemalloc, to
better isolate the performance gains expected from the generation allocator
itself from the side effect of avoiding glibc's pattern of releasing memory
back to the kernel too quickly.

Julien, sorry for the confusion  of adding that discussion and those patches
to the same thread, but I don't really see a way of dissociating those two
aspects.

--
Ronan Dunklau

Re: Use generation context to speed up tuplesorts

From

Ronan Dunklau

Date:

25 January 2022, 12:34:23

Le jeudi 20 janvier 2022, 08:36:46 CET Ronan Dunklau a écrit :
> Le jeudi 20 janvier 2022, 02:31:15 CET David Rowley a écrit :
> > On Tue, 18 Jan 2022 at 19:45, Julien Rouhaud <rjuju123@gmail.com> wrote:
> > > I'm also a bit confused about which patch(es) should actually be
> > > reviewed
> > > in that thread.  My understanding is that the original patch might not
> > > be
> > > relevant anymore but I might be wrong.  The CF entry should probably be
> > > updated to clarify that, with an annotation/ title change / author
> > > update
> > > or something.
> > >
> > > In the meantime I will switch the entry to Waiting on Author.
> >
> > I think what's going on here is a bit confusing. There's quite a few
> > patches on here that aim to do very different things.
> >
> > The patch I originally proposed was just to make tuplesort use
> > generation memory contexts. This helps speed up tuplesort because
> > generation contexts allow palloc() to be faster than the standard
> > allocator. Additionally, there's no power-of-2 wastage with generation
> > contexts like there is with the standard allocator. This helps fit
> > more tuples on fewer CPU cache lines.
> >
> > I believe Andres then mentioned that the fixed block size for
> > generation contexts might become a problem. With the standard
> > allocator the actual size of the underlying malloc() grows up until a
> > maximum size.  With generation contexts this is always the same, so we
> > could end up with a very large number of blocks which means lots of
> > small mallocs which could end up slow.  Tomas then posted a series of
> > patches to address this.
> >
> > I then posted another patch that has the planner make a choice on the
> > size of the blocks to use for the generation context based on the
> > tuple width and number of tuple estimates. My hope there was to
> > improve performance further by making a better choice for how large to
> > make the blocks in the first place.  This reduces the need for Tomas'
> > patch, but does not eliminate the need for it.
> >
> > As of now, I still believe we'll need Tomas' patches to allow the
> > block size to grow up to a maximum size.  I think those patches are
> > likely needed before we think about making tuplesort use generation
> > contexts.  The reason I believe this is that we don't always have good
> > estimates for the number of tuples we're going to sort.  nodeSort.c is
> > fairly easy as it just fetches tuples once and then sorts them.  The
> > use of tuplesort for nodeIncrementalSort.c is much more complex as
> > we'll likely be performing a series of smaller sorts which are much
> > harder to get size estimates for. This means there's likely no magic
> > block size that we can use for the generation context that's used to
> > store the tuples in tuplesort.  This is also the case for order by
> > aggregates and probably most other usages of tuplesort.
>
> You left out the discussion I started about glibc's malloc specific
> behaviour.
>
> I tried to demonstrate that a big part of the performance gain you were
> seeing were not in fact due to our allocator by itself, but by the way
> different block sizes allocations interact with this malloc implementation
> and how it handles releasing memory back to the system. I also tried to
> show that we can give DBAs more control over how much memory to "waste" as
> the price for faster allocations.
>
> I agree that the generation context memory manager is useful in the
> tuplesort context, if only for the fact that we fall back to disk less
> quickly due to the reduced wastage of memory, but I would be wary of
> drawing conclusions too quickly based on a specific malloc implementation
> which shows threshold effects, which are compounded by our growing block
> sizes code in general.
>
> I have on my TODO list to run the same kind of benchmarks using jemalloc, to
> better isolate the performance gains expected from the generation allocator
> itself from the side effect of avoiding glibc's pattern of releasing memory
> back to the kernel too quickly.
>

I've run the same 1-to-32-columns sort benchmark, using both glibc malloc and
jemalloc, with the standard aset memory manager or with David's generation v2
patch.  To use jemalloc, I just use a LD_PRELOAD env variable before starting
postgres.

I have not tried to measure jemalloc's memory footprint like I did for glibc's
malloc in the previous benchmark, as I'm not trying to evaluate jemalloc as a
glibc's malloc replacement.

Please find the results attached. As I suspected, most of the gain observed
with David's patch come from working around glibc's malloc idiosyncracies but
the good news is that it also helps (to a lesser degree) with jemalloc.

I'm not sure how that would behave with growing block sizes though: as Tomas
patch and stress-testing benchmarks showed, we can hit some particular
thresholds (and regressions) in glibc's self-tuning behaviour.

--
Ronan Dunklau

Attachment

bench_jemalloc.ods

Re: Use generation context to speed up tuplesorts

From

Andy Fan

Date:

06 February 2022, 21:41:25

Hi:

On Thu, Jan 20, 2022 at 9:31 AM David Rowley <dgrowleyml@gmail.com> wrote:

As of now, I still believe we'll need Tomas' patches to allow the
block size to grow up to a maximum size. I think those patches are
likely needed before we think about making tuplesort use generation
contexts. The reason I believe this is that we don't always have good
estimates for the number of tuples we're going to sort.

I spent some times to study the case here and my current thought is:

we can discuss/commit the minimum committable changes which

should be beneficial for some cases and no harm for others.

Tomas's patch 0002 would make there are no more blocks needed

if we switch to GenerationContext (compared with Standard Context). and

David's patch can obviously reduce total memory usage and improve

the performance. so IMO Tomas's patch 0002 + David's patch is a committable

patchset at first round. and Tomas's 0001 patch would be good to have

as well.

I double checked Tomas's 0002 patch, it looks good to me. and then applied

David's patch with ALLOCSET_DEFAULT_SIZES, testing the same workload.

Here is the result (number is tps):

work_mem = '4GB'

| Test Case | master | patched |
|-----------+--------+---------|
| Test 1 | 306 | 406 |
| Test 2 | 225 | 278 |
| Test 3 | 202 | 248 |
| Test 4 | 184 | 218 |
| Test 5 | 281 | 360 |

work_mem = '4MB'

| Test Case | master | patched |
|-----------+--------+---------|
| Test 1 | 124 | 409 |
| Test 2 | 106 | 280 |
| Test 3 | 100 | 249 |
| Test 4 | 97 | 218 |
| Test 5 | 120 | 369 |

I didn't get the performance improvement as much as David's at the beginning, I

think that is because David uses the ALLOCSET_DEFAULT_MAXSIZE directly which

will need less number of times for memory allocation.

AFAICS, Tomas's patch 0002 + David's patch should be ready for commit for round 1.

We can try other opportunities like use rows estimation to allocate initial memory and

GenerationContext improves like 0003/0004. Would this work?

Best Regards

Andy Fan

Re: Use generation context to speed up tuplesorts

From

Tomas Vondra

Date:

12 February 2022, 20:44:32

On 1/25/22 13:34, Ronan Dunklau wrote:
>
> ...
>
> I've run the same 1-to-32-columns sort benchmark, using both glibc malloc and 
> jemalloc, with the standard aset memory manager or with David's generation v2 
> patch.  To use jemalloc, I just use a LD_PRELOAD env variable before starting 
> postgres. 
> 
> I have not tried to measure jemalloc's memory footprint like I did for glibc's 
> malloc in the previous benchmark, as I'm not trying to evaluate jemalloc as a 
> glibc's malloc replacement. 
> 
> Please find the results attached. As I suspected, most of the gain observed 
> with David's patch come from working around glibc's malloc idiosyncracies but 
> the good news is that it also helps (to a lesser degree) with jemalloc.
> 
> I'm not sure how that would behave with growing block sizes though: as Tomas 
> patch and stress-testing benchmarks showed, we can hit some particular 
> thresholds (and regressions) in glibc's self-tuning behaviour.
> 

Interesting results!

So just switching to jemalloc can yield much better performance (in some
cases), even beating the generation context (on top of glibc malloc).

Of course, those are just simple (somewhat synthetic) benchmarks, and
there is no info about memory consumption. Perhaps there are cases where
this would be slower than glibc malloc, and I doubt many people to
switch to a non-default malloc implementation.

But I think in general this mostly confirms our suspicion about the
patch (somewhat accidentally) working around glibc internal behavior.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Use generation context to speed up tuplesorts

From

Tomas Vondra

Date:

12 February 2022, 20:56:09

On 1/20/22 08:36, Ronan Dunklau wrote:
> Le jeudi 20 janvier 2022, 02:31:15 CET David Rowley a écrit :
>> On Tue, 18 Jan 2022 at 19:45, Julien Rouhaud <rjuju123@gmail.com> wrote:
>>> I'm also a bit confused about which patch(es) should actually be reviewed
>>> in that thread.  My understanding is that the original patch might not be
>>> relevant anymore but I might be wrong.  The CF entry should probably be
>>> updated to clarify that, with an annotation/ title change / author update
>>> or something.
>>>
>>> In the meantime I will switch the entry to Waiting on Author.
>>
>> I think what's going on here is a bit confusing. There's quite a few
>> patches on here that aim to do very different things.
>>
>> The patch I originally proposed was just to make tuplesort use
>> generation memory contexts. This helps speed up tuplesort because
>> generation contexts allow palloc() to be faster than the standard
>> allocator. Additionally, there's no power-of-2 wastage with generation
>> contexts like there is with the standard allocator. This helps fit
>> more tuples on fewer CPU cache lines.
>>
>> I believe Andres then mentioned that the fixed block size for
>> generation contexts might become a problem. With the standard
>> allocator the actual size of the underlying malloc() grows up until a
>> maximum size.  With generation contexts this is always the same, so we
>> could end up with a very large number of blocks which means lots of
>> small mallocs which could end up slow.  Tomas then posted a series of
>> patches to address this.
>>
>> I then posted another patch that has the planner make a choice on the
>> size of the blocks to use for the generation context based on the
>> tuple width and number of tuple estimates. My hope there was to
>> improve performance further by making a better choice for how large to
>> make the blocks in the first place.  This reduces the need for Tomas'
>> patch, but does not eliminate the need for it.
>>
>> As of now, I still believe we'll need Tomas' patches to allow the
>> block size to grow up to a maximum size.  I think those patches are
>> likely needed before we think about making tuplesort use generation
>> contexts.  The reason I believe this is that we don't always have good
>> estimates for the number of tuples we're going to sort.  nodeSort.c is
>> fairly easy as it just fetches tuples once and then sorts them.  The
>> use of tuplesort for nodeIncrementalSort.c is much more complex as
>> we'll likely be performing a series of smaller sorts which are much
>> harder to get size estimates for. This means there's likely no magic
>> block size that we can use for the generation context that's used to
>> store the tuples in tuplesort.  This is also the case for order by
>> aggregates and probably most other usages of tuplesort.
>>
> 
> You left out the discussion I started about glibc's malloc specific behaviour.
> 
> I tried to demonstrate that a big part of the performance gain you were seeing 
> were not in fact due to our allocator by itself, but by the way different block 
> sizes allocations interact with this malloc implementation and how it handles 
> releasing memory back to the system. I also tried to show that we can give 
> DBAs more control over how much memory to "waste" as the price for faster 
> allocations.
> 
> I agree that the generation context memory manager is useful in the tuplesort 
> context, if only for the fact that we fall back to disk less quickly due to 
> the reduced wastage of memory, but I would be wary of drawing conclusions too 
> quickly based on a specific malloc implementation which shows threshold effects, 
> which are compounded by our growing block sizes code in general.
> 

Yeah, I share this concern - how much of the improvement is real and how
much is due to accidentally not hitting some glibc-specific self-tuning
stuff? Would a realistic workloads get the same improvement or would it
cause regression? What about non-glibc systems with other malloc?

I'm not against pushing the generation context improvements, e.g.
something like the patches posted in [1], because those seem reasonable
in general. But I'm somewhat confused about the last patch (adjusting
allocChunkLimit) causing fairly significant regressions.

I'll try investigating this a bit more next week.


regards


[1]

https://www.postgresql.org/message-id/flat/bcdd4e3e-c12d-cd2b-7ead-a91ad416100a%40enterprisedb.com#a5ecd4e5a9e7d5b07ff25b5684da894f

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Use generation context to speed up tuplesorts

From

David Rowley

Date:

17 February 2022, 23:10:51

On Sun, 13 Feb 2022 at 09:56, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> I'm not against pushing the generation context improvements, e.g.
> something like the patches posted in [1], because those seem reasonable
> in general. But I'm somewhat confused about the last patch (adjusting
> allocChunkLimit) causing fairly significant regressions.

My thoughts here are that we should pursue the patch that allows
growing of the block size in the generation context.  I do think the
investigation of the malloc() behaviour and performance is worth some
pursuit, I just don't think it should be done here or as part of this
patch.  I think it's a fairly large undertaking to ensure any changes
to the malloc options are an overall win and not just a win for
whatever benchmark we test them on. If they're really an overall win,
then shouldn't glibc know about them and maybe adopt them as the
standard options?

To get the ball rolling again on the changes to the generation
context, I took your patches, Tomas, and fixed a few things around the
keeper blocks not working correctly.  I've now made the keeper block
behaviour match how aset.c works.  There the keeper block is allocated
in the same allocation as the context itself. That reduces the number
of malloc() calls when setting up a new memory context.

On testing this, I discovered a problem regarding the use of
generation contexts for storing tuples in tuplesort.c.  The problem is
when the sort is bounded (e.g. SELECT * FROM ... ORDER BY .. LIMIT N),
we'll store and directly throw away any tuples that sort greater than
the existing set of N tuples. This causes a problem because once we
move away from using the keeper block, initially, we'll at some point
end up storing a tuple in a newly allocated generation block then
proceed to pfree() that tuple due to it sorting greater than the
existing N tuples. Because we currently always free() newly empty
blocks, we end up thrashing free()/malloc().  This behaviour results
in one of the regression test queries in tuplesort.sql going from
taking 10 ms to 1 minute, on my machine.

I had a few thoughts about how best to work around this problem.
Firstly, I thought that we should just *not* use the Generation
context when the sort is bounded.  That turns out to be a bit icky as
we only make the sort bounding when we call tuplesort_set_bound(),
which is after we've allocated the memory context for storing tuples.
I thought about maybe just adding another bool flag named
"allowBounding" and use a Generation context if that flag is not set.
I just don't really like that as a solution.  We also cannot make the
set bound part of the tuplesort_begin_heap() call as
nodeIncrementalSort.c does reset the bound. Possibly we could ditch
tuplesort_set_bound() and invent tuplesort_reset_bound() and change
the API so the bound is set when making the Tuplesortstate.

The other way I thought to fix it was by changing the logic for when
generation blocks are freed.  In the problem case mentioned above, the
block being freed is the current block (which was just allocated).  I
made some changes to adjust this behaviour so that we no longer free
the block when the final chunk is pfree()'d. Instead, that now lingers
and can be reused by future allocations, providing they fit inside it.
If they don't fit then we must free the block, otherwise it would just
remain empty forever.   The problem I see with this method is that
there still could be some pathological case that causes us to end up
storing just a single tuple per generation block.

Hitting the worst case here does seem pretty unlikely, so I'm a bit
drawn between if we should just play it safe and stick to the standard
memory allocator for bounded sort.

I've attached 2 patches. The 0001 patch adds the new logic to
generation.c to allow the block to grow and also adds the logic to
skip freeing the current block when it becomes empty.  The 0002 patch
simply makes tuplesort.c use the generation context for tuples
unconditionally.

I've also attached the benchmark results I got from this patch running
the attached sortbenchall script. It's fairly obvious that there are
performance improvements here even when the malloc() usage pattern is
similar to aset.c's

David

Attachment

Re: Use generation context to speed up tuplesorts

From

Andres Freund

Date:

23 February 2022, 02:25:04

Hi,

On 2022-02-18 12:10:51 +1300, David Rowley wrote:
> The other way I thought to fix it was by changing the logic for when
> generation blocks are freed.  In the problem case mentioned above, the
> block being freed is the current block (which was just allocated).  I
> made some changes to adjust this behaviour so that we no longer free
> the block when the final chunk is pfree()'d. Instead, that now lingers
> and can be reused by future allocations, providing they fit inside it.

That makes sense to me, as long as we keep just one such block.


> The problem I see with this method is that there still could be some
> pathological case that causes us to end up storing just a single tuple per
> generation block.

Crazy idea: Detect the situation, and recompact. Create a new context, copy
all the tuples over, delete the old context. That could be a win even in less
adversarial situations than "a single tuple per generation block".


Greetings,

Andres Freund

Re: Use generation context to speed up tuplesorts

From

Tomas Vondra

Date:

23 February 2022, 19:19:52

On 2/23/22 03:25, Andres Freund wrote:
> Hi,
> 
> On 2022-02-18 12:10:51 +1300, David Rowley wrote:
>> The other way I thought to fix it was by changing the logic for when
>> generation blocks are freed.  In the problem case mentioned above, the
>> block being freed is the current block (which was just allocated).  I
>> made some changes to adjust this behaviour so that we no longer free
>> the block when the final chunk is pfree()'d. Instead, that now lingers
>> and can be reused by future allocations, providing they fit inside it.
> 
> That makes sense to me, as long as we keep just one such block.
> 
> 
>> The problem I see with this method is that there still could be some
>> pathological case that causes us to end up storing just a single tuple per
>> generation block.
> 
> Crazy idea: Detect the situation, and recompact. Create a new context, copy
> all the tuples over, delete the old context. That could be a win even in less
> adversarial situations than "a single tuple per generation block".
> 

What about pointers to the chunks in the old memory context?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Use generation context to speed up tuplesorts

From

David Rowley

Date:

22 March 2022, 15:08:44

On Wed, 23 Feb 2022 at 15:25, Andres Freund <andres@anarazel.de> wrote:
> On 2022-02-18 12:10:51 +1300, David Rowley wrote:
> > The other way I thought to fix it was by changing the logic for when
> > generation blocks are freed.  In the problem case mentioned above, the
> > block being freed is the current block (which was just allocated).  I
> > made some changes to adjust this behaviour so that we no longer free
> > the block when the final chunk is pfree()'d. Instead, that now lingers
> > and can be reused by future allocations, providing they fit inside it.
>
> That makes sense to me, as long as we keep just one such block.

I've rewritten the patch in such a way that it no longer matters which
block becomes free.  What I did was add a new field to the
GenerationContext named "freeblock".  We now simply store up to 1
recently emptied block there instead of calling free() on it.  When
we're allocating new memory, we'll try and make use of the "freeblock"
when we don't have enough space in the current block.

I feel like this is quite a good change as it saves continual
free/malloc calls for FIFO pattern users of the generation context.
Since that's pretty much the workload that this context is intended
for, that seems like a very good change to make.

I did consider only recording a block as free once it reaches the
maximum size.  As I have the code currently, we'll continually reuse
all blocks, including the initial sized ones.  It will end up filling
blocks quicker as we'll be reusing those smaller initial blocks
continually with FIFO workloads. I'm not entirely sure if that
matters. I just wanted to point out that I thought about it.

Aside from the freeblock, I've also added code for adding a "keeper"
block.  This is allocated in the same malloc'd chunk as the Generation
context itself. One thing I'm not too sure about is if the keeper
block change is worth the trouble.  If we pfree() all of the chunks
out of the keeper block, there's a special case in GenerationFree() to
empty that block rather than free() it.  This also means we need a
special case in GenerationAlloc() so we try to see if the keeper block
has any space before we go and allocate a new block.   My current
thoughts are that the keeper block is really only useful for
Generation contexts that remain very small and don't ever require
allocation of additional blocks.  This will mean all the memory can be
allocated along with the context, which saves a malloc and free call.
Does anyone have any thoughts on this?

> > The problem I see with this method is that there still could be some
> > pathological case that causes us to end up storing just a single tuple per
> > generation block.
>
> Crazy idea: Detect the situation, and recompact. Create a new context, copy
> all the tuples over, delete the old context. That could be a win even in less
> adversarial situations than "a single tuple per generation block".

I think that would need some new MemoryContext API functions in which
callers could use to determine the fragmentation and do something
about it on the consumer side. Otherwise, as Tomas says, if we did it
from within the context code itself we'd have no visibility of what is
pointing to the memory we're moving around.

If nobody has any objections to the 0001 patch then I'd like to move
ahead with it in the next few days.  For the 0002 patch, I'm currently
feeling like we shouldn't be using the Generation context for bounded
sorts. The only way I can think to do that is with an API change to
tuplesort. I feel 0001 is useful even without 0002.

David

Attachment

Re: Use generation context to speed up tuplesorts

From

David Rowley

Date:

01 April 2022, 09:00:08

On Wed, 23 Mar 2022 at 04:08, David Rowley <dgrowleyml@gmail.com> wrote:
> If nobody has any objections to the 0001 patch then I'd like to move
> ahead with it in the next few days.  For the 0002 patch, I'm currently
> feeling like we shouldn't be using the Generation context for bounded
> sorts. The only way I can think to do that is with an API change to
> tuplesort. I feel 0001 is useful even without 0002.

0001:
I've made some small revisions to the 0001 patch so that the keeper
and freeblock are only used again when they're entirely empty.  The
situation I want to avoid here is that when the current block does not
have enough free space to store some new allocation, that we'll then
try the freeblock and then keeper block.  The problem I saw there is
that we may previously have partially filled the keeper or freeblock
block and have been unable to store some medium sized allocation which
caused us to move to a new block.  If we didn't check that the keeper
and freeblock blocks were empty first then we could end up being able
to store some small allocation in there where some previous medium
sized allocation couldn't fit.  While that's not necessarily an actual
problem, what it does mean is that consecutive allocations might not
end up in the same block or the next block.  Over time in a FIFO type
workload it would be possible to get fragmentation, which could result
in being unable to free blocks.  For longer lived contexts I imagine
that could end up fairly bad. The updated patch should avoid that
problem.

0002:
This modifies the tuplesort API so that instead of having a
randomAccess bool flag, this is changed to a bitwise flag that we can
add further options in the future.  It's slightly annoying to break
the API, but it's not exactly going to be hard for people to code
around that.  It might also mean we don't have to break the API in the
future if we're doing some change where we can just add a new bitwise
flag.

0003:
This adds a new flag for TUPLESORT_ALLOWBOUNDED and modifies the
places where we set a sort bound to pass the flag.  The patch only
uses the generation context when the flag is not passed.

I feel this is a pretty simple patch and if nobody has any objections
then I plan to push all 3 of them on my New Zealand Monday morning.

David

Attachment

Re: Use generation context to speed up tuplesorts

From

Justin Pryzby

Date:

01 April 2022, 13:25:51

On Fri, Apr 01, 2022 at 10:00:08PM +1300, David Rowley wrote:
> 0002:
> This modifies the tuplesort API so that instead of having a
> randomAccess bool flag,

Per cirrus, this missed updating one line for dtrace.

On Fri, Apr 01, 2022 at 10:00:08PM +1300, David Rowley wrote:
> I feel this is a pretty simple patch and if nobody has any objections
> then I plan to push all 3 of them on my New Zealand Monday morning.

I'll mark the CF as such.

-- 
Justin

Attachment

Re: Use generation context to speed up tuplesorts

From

David Rowley

Date:

04 April 2022, 11:18:36

On Sat, 2 Apr 2022 at 02:25, Justin Pryzby <pryzby@telsasoft.com> wrote:
>
> On Fri, Apr 01, 2022 at 10:00:08PM +1300, David Rowley wrote:
> > 0002:
> > This modifies the tuplesort API so that instead of having a
> > randomAccess bool flag,
>
> Per cirrus, this missed updating one line for dtrace.

Thanks.

I've pushed all 3 of the patches now.

David

Re: Use generation context to speed up tuplesorts

From

Noah Misch

Date:

24 April 2022, 00:59:38

On Fri, Apr 01, 2022 at 10:00:08PM +1300, David Rowley wrote:
> 0002:
> This modifies the tuplesort API so that instead of having a
> randomAccess bool flag, this is changed to a bitwise flag that we can
> add further options in the future.  It's slightly annoying to break
> the API, but it's not exactly going to be hard for people to code
> around that.

For what it's worth, the three PGXN extensions using tuplesort_begin_* are
"citus", "pg_bulkload", and "vector".  Nothing calls tuplesort_set_bound().

This (commit 77bae39) did not change function parameter counts, and
TUPLESORT_RANDOMACCESS generally has same the same numeric value as "true".  I
get no warning if I pass "true" for the "sortopt" flags parameter.  Hence, I
suspect this did not break the API.  Should we be happy about that?  I'm fine
with it.

Re: Use generation context to speed up tuplesorts

From

Peter Geoghegan

Date:

24 April 2022, 01:11:45

On Sat, Apr 23, 2022 at 5:59 PM Noah Misch <noah@leadboat.com> wrote:
> This (commit 77bae39) did not change function parameter counts, and
> TUPLESORT_RANDOMACCESS generally has same the same numeric value as "true".  I
> get no warning if I pass "true" for the "sortopt" flags parameter.  Hence, I
> suspect this did not break the API.  Should we be happy about that?  I'm fine
> with it.

If I happened to believe that this issue (or one like it) might have
real negative consequences, and that those consequences could easily
be avoided (by making the API break impossible to overlook), then I
would object -- why even take a small chance? Fortunately I don't
believe that we're even taking a small chance here, all things
considered. And so I agree; this issue isn't a concern.

-- 
Peter Geoghegan

Re: Use generation context to speed up tuplesorts

From

David Rowley

Date:

24 April 2022, 05:16:25

Thanks for raising this.

On Sun, 24 Apr 2022 at 12:59, Noah Misch <noah@leadboat.com> wrote:
> This (commit 77bae39) did not change function parameter counts, and
> TUPLESORT_RANDOMACCESS generally has same the same numeric value as "true".  I
> get no warning if I pass "true" for the "sortopt" flags parameter.  Hence, I
> suspect this did not break the API.  Should we be happy about that?  I'm fine
> with it.

I'd be open to doing something like make sortopt the first argument of
each of the tuplesort_begin* functions if people have concerns about
this causing problems. However, given that TUPLESORT_RANDOMACCESS and
true likely have the same value, it seems we'd be more likely to break
code down the line if we added some option that's needed that wouldn't
get set by passing "true" as the sortopt.

If anyone was to use tuplesort_set_bound() without updating their
tuplesort_begin* then on non-assert builds, nothing would misbehave.
There's just a risk of having bad memory fragmentation from using the
generation context for bounded sorts.

David

Re: Use generation context to speed up tuplesorts

From

Tomas Vondra

Date:

18 June 2023, 18:22:17

Hi Ronan,

We briefly chatted about the glibc-tuning part of this thread at pgcon,
so I wonder if you're still planning to pursue that. If you do, I
suggest we start a fresh thread, so that it's not mixed with the already
committed improvements of generation context.

I wonder what's the situation with the generation context improvements
already pushed - does setting the glibc thresholds still help, and if
yes how much?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Use generation context to speed up tuplesorts

From

Ronan Dunklau

Date:

19 June 2023, 07:01:56

Le dimanche 18 juin 2023, 20:22:17 CEST Tomas Vondra a écrit :
> Hi Ronan,
>
> We briefly chatted about the glibc-tuning part of this thread at pgcon,
> so I wonder if you're still planning to pursue that. If you do, I
> suggest we start a fresh thread, so that it's not mixed with the already
> committed improvements of generation context.
>
> I wonder what's the situation with the generation context improvements
> already pushed - does setting the glibc thresholds still help, and if
> yes how much?

Hi Tomas !

I'm currently working on it, trying to evaluate the possible benefits on the
current ASet and Generation contexts usecases.

I'll make sure to use a new thread to post my findings.

Thank you for remembering !

Best regards,

--
Ronan Dunklau