Re: Use generation context to speed up tuplesorts - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Use generation context to speed up tuplesorts
Date
Msg-id 20210730203853.utjj43f6zzn5e2hy@alap3.anarazel.de
Whole thread Raw
In response to Use generation context to speed up tuplesorts  (David Rowley <dgrowleyml@gmail.com>)
Responses Re: Use generation context to speed up tuplesorts  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Re: Use generation context to speed up tuplesorts  (David Rowley <dgrowleyml@gmail.com>)
List pgsql-hackers
Hi,

On 2021-07-30 18:42:18 +1200, David Rowley wrote:
> While reviewing and benchmarking 91e9e89dc (Make nodeSort.c use Datum
> sorts for single column sorts), I noticed that we use a separate
> memory context to store tuple data and we just reset when we're done
> if the sort fits in memory, or we dump the tuples to disk in the same
> order they're added and reset the context when it does not.  There is
> a little pfree() work going on via writetup_heap() which I think
> possibly could just be removed to get some additional gains.
> 
> Anyway, this context that stores tuples uses the standard aset.c
> allocator which has the usual power of 2 wastage and additional
> overheads of freelists etc.  I wondered how much faster it would go if
> I set it to use a generation context instead of an aset.c one.
> 
> After running make installcheck to make the tenk1 table, running the
> attached tuplesortbench script, I get this:

> So it seems to save quite a bit of memory getting away from rounding
> up allocations to the next power of 2.
> 
> Performance-wise, there's some pretty good gains. (Results in TPS)

Very nice!


I wonder if there's cases where generation.c would regress performance
over aset.c due to not having an initial / "keeper" block?


> The patch is just a simple 1-liner at the moment.  I likely do need to
> adjust what I'm passing as the blockSize to GenerationContextCreate().
>   Maybe a better number would be something that's calculated from
> work_mem, e.g Min(ALLOCSET_DEFAULT_MAXSIZE, ((Size) work_mem) * 64))
> so that we just allocate at most a 16th of work_mem per chunk, but not
> bigger than 8MB. I don't think changing this will affect the
> performance of the above very much.

I think it's bad that both genereration and slab don't have internal
handling of block sizes. Needing to err on the size of too big blocks to
handle large amounts of memory well, just so the contexts don't need to
deal with variably sized blocks isn't a sensible tradeoff.

I don't think it's acceptable to use ALLOCSET_DEFAULT_MAXSIZE or
Min(ALLOCSET_DEFAULT_MAXSIZE, ((Size) work_mem) * 64) for
tuplesort.c. There's plenty cases where we'll just sort a handful of
tuples, and increasing the memory usage of those by a factor of 1024
isn't good. The Min() won't do any good if a larger work_mem is used.

Nor will it be good to use thousands of small allocations for a large
in-memory tuplesort just because we're concerned about the initial
allocation size. Both because of the allocation overhead, but
importantly also because that will make context resets more expensive.

To me this says that we should transplant aset.c's block size growing
into generation.c.


There is at least one path using tuplecontext that reaches code that
could end up freeing memory to a significant enough degree to care about
generation.c effectively not using that memory:
tuplesort_putdatum()->datumCopy()->EOH_flatten_into()
On a quick look I didn't find any expanded record user that frees
nontrivial amounts of memory, but I didn't look all that carefully.


Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Melanie Plageman
Date:
Subject: Re: Parallel Full Hash Join
Next
From: Tom Lane
Date:
Subject: Re: Have I found an interval arithmetic bug?