Re: Sort optimizations: Making in-memory sort cache-aware - Mailing list pgsql-hackers

From Ankit Kumar Pandey
Subject Re: Sort optimizations: Making in-memory sort cache-aware
Date
Msg-id 04a0dc6c-b789-67fe-f3c6-b2a8e8a36509@gmail.com
Whole thread Raw
In response to Re: Sort optimizations: Making in-memory sort cache-aware  (Andres Freund <andres@anarazel.de>)
Responses Re: Sort optimizations: Making in-memory sort cache-aware
List pgsql-hackers
> On 12/02/23 01:59, Andres Freund wrote:

> However, tuplesort.c completely breaks that for > 1 column
> sorts. While spatial locality for accesses to the ->memtuples array is decent
> during sorting, due to qsort's subdividing of the problem, the locality for
> access to the tuples is *awful*.

> The memtuples array is reordered while sorting, but the tuples themselves
> aren't. Unless the input data is vaguely presorted, the access pattern for the
> tuples has practically zero locality.

This probably explains the issue.


> One idea is to keep track of the distinctness of the first column sorted and
> to behave differently if it's significantly lower than the number of to be
> sorted tuples.  E.g. by doing a first sort solely on the first column, then
> reorder the MinimalTuples in memory, and then continue normally.

> There's two main problems with that idea:
> 1) It's hard to re-order the tuples in memory, without needing substantial
>   amounts of additional memory
> 2) If the second column also is not very distinct, it doesn't buy you much, if
>   anything.

> But it might provide sufficient benefits regardless. And a naive version,
> requiring additional memory, should be quick to hack up.

I get the second point (to reorder MinimalTuples by sorting on first column) but
not sure how can keep `track of the distinctness of the first column`?


> Most sorting papers don't discuss
> variable-width data, nor a substantial amount of cache-polluting work while
> gathering the data that needs to be sorted.

I definitely agree with this.


> My suggestion would be to collect a roughly ~L3 sized amount of tuples, sort
> just those using the existing code, allocate new memory for all the
> corresponding MinimalTuples in one allocation, and copy the MinimalTuples into
> that, obviously in ->memtuples order.

This should be easy hack and we can easily profile benefits from this.

> It's very likely we can do better than just doing a plain sort of everything
> after that.
> You effectively end up with a bounded number of pre-sorted blocks, so the most
> obvious thing to try is to build a heap of those blocks and effectively do a 
> heapsort between the presorted blocks.

This is very interesting. It is actually what few papers had suggested, to
do sort in blocks and then merge (while sorting) presorted blocks.
I am bit fuzzy on implementation of this (if it is same as what I am thinking)
but this is definitely what I was looking for.


> A related, but separate, improvement is to reduce / remove the memory
> allocation overhead. The switch to GenerationContext helped some, but still
> leaves a bunch of overhead. And it's not used for bounded sorts right now.
> We don't palloc/pfree individual tuples during a normal sorts, but we do have
> some, for bounded sorts.  I think with a reasonable amount of work we could
> avoid that for all tuples in ->tuplecontext. And switch to a trivial bump
> allocator, getting rid of all allocator overhead.

> The biggest source of individual pfree()s in the bounded case is that we
> unconditionally copy the tuple into base->tuplecontext during puttuple. Even
> though quite likely we'll immediately free it in the "case TSS_BOUNDED" block.

> We could instead pre-check that the tuple won't immediately be discarded,
> before copying it into tuplecontext. Only in the TSS_BOUNDED, case, of
> course.

This Looks doable, try this.

> I think we also can replace the individual freeing of tuples in
> tuplesort_heap_replace_top(), by allowing a number of dead tuples to
> accumulate (up to work_mem/2 maybe), and then copying the still living tuples
> into new memory context, freeing the old one.

> While that doesn't sound cheap, in bounded sorts the number of tuples commonly
> is quite limited, the pre-check before copying the tuple will prevent this
> from occurring too often, the copy will result in higher locality, and, most
> importantly, the reduced palloc() overhead (~25% or so with aset.c) will
> result in a considerably higher cache hit ratio / lower memory usage.

I would try this as well.

> There are things we could do to improve upon this that don't require swapping
> out our sorting implementation wholesale.

Thanks a lot Andres, these are lots of pointers to work on (without major overhaul of sort
implementation and with potentially good amount of improvements). I will give these a try
and see if I could get some performance gains.

Thanks,
Ankit




pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Making aggregate deserialization (and WAL receive) functions slightly faster
Next
From: David Rowley
Date:
Subject: Re: Making aggregate deserialization (and WAL receive) functions slightly faster