Re: Use generation context to speed up tuplesorts - Mailing list pgsql-hackers

From Ronan Dunklau
Subject Re: Use generation context to speed up tuplesorts
Date
Msg-id 3470027.R56niFO833@aivenronan
Whole thread Raw
In response to Re: Use generation context to speed up tuplesorts  (Ronan Dunklau <ronan.dunklau@aiven.io>)
Responses Re: Use generation context to speed up tuplesorts
List pgsql-hackers
Le jeudi 20 janvier 2022, 08:36:46 CET Ronan Dunklau a écrit :
> Le jeudi 20 janvier 2022, 02:31:15 CET David Rowley a écrit :
> > On Tue, 18 Jan 2022 at 19:45, Julien Rouhaud <rjuju123@gmail.com> wrote:
> > > I'm also a bit confused about which patch(es) should actually be
> > > reviewed
> > > in that thread.  My understanding is that the original patch might not
> > > be
> > > relevant anymore but I might be wrong.  The CF entry should probably be
> > > updated to clarify that, with an annotation/ title change / author
> > > update
> > > or something.
> > >
> > > In the meantime I will switch the entry to Waiting on Author.
> >
> > I think what's going on here is a bit confusing. There's quite a few
> > patches on here that aim to do very different things.
> >
> > The patch I originally proposed was just to make tuplesort use
> > generation memory contexts. This helps speed up tuplesort because
> > generation contexts allow palloc() to be faster than the standard
> > allocator. Additionally, there's no power-of-2 wastage with generation
> > contexts like there is with the standard allocator. This helps fit
> > more tuples on fewer CPU cache lines.
> >
> > I believe Andres then mentioned that the fixed block size for
> > generation contexts might become a problem. With the standard
> > allocator the actual size of the underlying malloc() grows up until a
> > maximum size.  With generation contexts this is always the same, so we
> > could end up with a very large number of blocks which means lots of
> > small mallocs which could end up slow.  Tomas then posted a series of
> > patches to address this.
> >
> > I then posted another patch that has the planner make a choice on the
> > size of the blocks to use for the generation context based on the
> > tuple width and number of tuple estimates. My hope there was to
> > improve performance further by making a better choice for how large to
> > make the blocks in the first place.  This reduces the need for Tomas'
> > patch, but does not eliminate the need for it.
> >
> > As of now, I still believe we'll need Tomas' patches to allow the
> > block size to grow up to a maximum size.  I think those patches are
> > likely needed before we think about making tuplesort use generation
> > contexts.  The reason I believe this is that we don't always have good
> > estimates for the number of tuples we're going to sort.  nodeSort.c is
> > fairly easy as it just fetches tuples once and then sorts them.  The
> > use of tuplesort for nodeIncrementalSort.c is much more complex as
> > we'll likely be performing a series of smaller sorts which are much
> > harder to get size estimates for. This means there's likely no magic
> > block size that we can use for the generation context that's used to
> > store the tuples in tuplesort.  This is also the case for order by
> > aggregates and probably most other usages of tuplesort.
>
> You left out the discussion I started about glibc's malloc specific
> behaviour.
>
> I tried to demonstrate that a big part of the performance gain you were
> seeing were not in fact due to our allocator by itself, but by the way
> different block sizes allocations interact with this malloc implementation
> and how it handles releasing memory back to the system. I also tried to
> show that we can give DBAs more control over how much memory to "waste" as
> the price for faster allocations.
>
> I agree that the generation context memory manager is useful in the
> tuplesort context, if only for the fact that we fall back to disk less
> quickly due to the reduced wastage of memory, but I would be wary of
> drawing conclusions too quickly based on a specific malloc implementation
> which shows threshold effects, which are compounded by our growing block
> sizes code in general.
>
> I have on my TODO list to run the same kind of benchmarks using jemalloc, to
> better isolate the performance gains expected from the generation allocator
> itself from the side effect of avoiding glibc's pattern of releasing memory
> back to the kernel too quickly.
>

I've run the same 1-to-32-columns sort benchmark, using both glibc malloc and
jemalloc, with the standard aset memory manager or with David's generation v2
patch.  To use jemalloc, I just use a LD_PRELOAD env variable before starting
postgres.

I have not tried to measure jemalloc's memory footprint like I did for glibc's
malloc in the previous benchmark, as I'm not trying to evaluate jemalloc as a
glibc's malloc replacement.

Please find the results attached. As I suspected, most of the gain observed
with David's patch come from working around glibc's malloc idiosyncracies but
the good news is that it also helps (to a lesser degree) with jemalloc.

I'm not sure how that would behave with growing block sizes though: as Tomas
patch and stress-testing benchmarks showed, we can hit some particular
thresholds (and regressions) in glibc's self-tuning behaviour.

--
Ronan Dunklau
Attachment

pgsql-hackers by date:

Previous
From: Julien Rouhaud
Date:
Subject: Re: [PATCH] Full support for index LP_DEAD hint bits on standby
Next
From: Peter Eisentraut
Date:
Subject: Re: Skipping logical replication transactions on subscriber side