Re: Use generation context to speed up tuplesorts - Mailing list pgsql-hackers

From David Rowley
Subject Re: Use generation context to speed up tuplesorts
Date
Msg-id CAApHDvpnpoGrEcoJT0bQjCR55_zqZgMhd0mmgQ748cyXZoQQ8w@mail.gmail.com
Whole thread Raw
In response to Re: Use generation context to speed up tuplesorts  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Responses Re: Use generation context to speed up tuplesorts  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
On Sun, 13 Feb 2022 at 09:56, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> I'm not against pushing the generation context improvements, e.g.
> something like the patches posted in [1], because those seem reasonable
> in general. But I'm somewhat confused about the last patch (adjusting
> allocChunkLimit) causing fairly significant regressions.

My thoughts here are that we should pursue the patch that allows
growing of the block size in the generation context.  I do think the
investigation of the malloc() behaviour and performance is worth some
pursuit, I just don't think it should be done here or as part of this
patch.  I think it's a fairly large undertaking to ensure any changes
to the malloc options are an overall win and not just a win for
whatever benchmark we test them on. If they're really an overall win,
then shouldn't glibc know about them and maybe adopt them as the
standard options?

To get the ball rolling again on the changes to the generation
context, I took your patches, Tomas, and fixed a few things around the
keeper blocks not working correctly.  I've now made the keeper block
behaviour match how aset.c works.  There the keeper block is allocated
in the same allocation as the context itself. That reduces the number
of malloc() calls when setting up a new memory context.

On testing this, I discovered a problem regarding the use of
generation contexts for storing tuples in tuplesort.c.  The problem is
when the sort is bounded (e.g. SELECT * FROM ... ORDER BY .. LIMIT N),
we'll store and directly throw away any tuples that sort greater than
the existing set of N tuples. This causes a problem because once we
move away from using the keeper block, initially, we'll at some point
end up storing a tuple in a newly allocated generation block then
proceed to pfree() that tuple due to it sorting greater than the
existing N tuples. Because we currently always free() newly empty
blocks, we end up thrashing free()/malloc().  This behaviour results
in one of the regression test queries in tuplesort.sql going from
taking 10 ms to 1 minute, on my machine.

I had a few thoughts about how best to work around this problem.
Firstly, I thought that we should just *not* use the Generation
context when the sort is bounded.  That turns out to be a bit icky as
we only make the sort bounding when we call tuplesort_set_bound(),
which is after we've allocated the memory context for storing tuples.
I thought about maybe just adding another bool flag named
"allowBounding" and use a Generation context if that flag is not set.
I just don't really like that as a solution.  We also cannot make the
set bound part of the tuplesort_begin_heap() call as
nodeIncrementalSort.c does reset the bound. Possibly we could ditch
tuplesort_set_bound() and invent tuplesort_reset_bound() and change
the API so the bound is set when making the Tuplesortstate.

The other way I thought to fix it was by changing the logic for when
generation blocks are freed.  In the problem case mentioned above, the
block being freed is the current block (which was just allocated).  I
made some changes to adjust this behaviour so that we no longer free
the block when the final chunk is pfree()'d. Instead, that now lingers
and can be reused by future allocations, providing they fit inside it.
If they don't fit then we must free the block, otherwise it would just
remain empty forever.   The problem I see with this method is that
there still could be some pathological case that causes us to end up
storing just a single tuple per generation block.

Hitting the worst case here does seem pretty unlikely, so I'm a bit
drawn between if we should just play it safe and stick to the standard
memory allocator for bounded sort.

I've attached 2 patches. The 0001 patch adds the new logic to
generation.c to allow the block to grow and also adds the logic to
skip freeing the current block when it becomes empty.  The 0002 patch
simply makes tuplesort.c use the generation context for tuples
unconditionally.

I've also attached the benchmark results I got from this patch running
the attached sortbenchall script. It's fairly obvious that there are
performance improvements here even when the malloc() usage pattern is
similar to aset.c's

David

Attachment

pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
Next
From: Andres Freund
Date:
Subject: Re: O(n) tasks cause lengthy startups and checkpoints