Re: Sort optimizations: Making in-memory sort cache-aware - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Sort optimizations: Making in-memory sort cache-aware
Date
Msg-id 20230211202959.lkhrxiy7awsyctef@awork3.anarazel.de
Whole thread Raw
In response to Sort optimizations: Making in-memory sort cache-aware  (Ankit Kumar Pandey <itsankitkp@gmail.com>)
Responses Re: Sort optimizations: Making in-memory sort cache-aware
List pgsql-hackers
Hi,

On 2023-02-11 17:49:02 +0530, Ankit Kumar Pandey wrote:
> 2. Frequent cache misses
>
> Issue #1 is being looked in separate patch. I am currently looking at #2.
>
> Possible solution was to batch tuples into groups (which can fit into L3
> cache) before pushing them to sort function.
>
> After looking at different papers on this (multi-Quicksort, memory-tuned
> quicksort, Samplesort and various distributed sorts),  although they look
> promising (especially samplesort), I would like to get more inputs as
> changes look bit too steep and may or may not be in of scope of solving
> actual problem in hand.
>
> Please let me know your opinions, do we really need to re-look at quicksort
> for this use-case or we can perform optimization without major change in
> core sorting algorithm? Are we are open for trying new algorithms for sort?

I think it'll require some experimentation to know what we can and should
do. Clearly we're not going to do anything fundamental if the gains are a few
percent. But it's not hard to imagine that the gains will be substantially
larger.


I believe that a significant part of the reason we have low cache hit ratios
once the input gets larger, is that we kind of break the fundamental benefit
of qsort:

The reason quicksort is a good sorting algorithm, despite plenty downsides, is
that it has pretty decent locality, due to its divide and conquer
approach. However, tuplesort.c completely breaks that for > 1 column
sorts. While spatial locality for accesses to the ->memtuples array is decent
during sorting, due to qsort's subdividing of the problem, the locality for
access to the tuples is *awful*.

The memtuples array is reordered while sorting, but the tuples themselves
aren't. Unless the input data is vaguely presorted, the access pattern for the
tuples has practically zero locality.

The only reason that doesn't completely kill us is that SortTuple contains
datum1 inline and that abbreviated keys reduce the cost of by-reference datums
in the first column.


There are things we could do to improve upon this that don't require swapping
out our sorting implementation wholesale.


One idea is to keep track of the distinctness of the first column sorted and
to behave differently if it's significantly lower than the number of to be
sorted tuples.  E.g. by doing a first sort solely on the first column, then
reorder the MinimalTuples in memory, and then continue normally.

There's two main problems with that idea:
1) It's hard to re-order the tuples in memory, without needing substantial
   amounts of additional memory
2) If the second column also is not very distinct, it doesn't buy you much, if
   anything.

But it might provide sufficient benefits regardless. And a naive version,
requiring additional memory, should be quick to hack up.


I have *not* looked at a whole lot of papers of cache optimized sorts, and the
little I did was not recently. Partially because I am not sure that they are
that applicable to our scenarios: Most sorting papers don't discuss
variable-width data, nor a substantial amount of cache-polluting work while
gathering the data that needs to be sorted.

I think:

> Possible solution was to batch tuples into groups (which can fit into L3
> cache) before pushing them to sort function.

is the most general solution to the issue outlined above. I wouldn't try to
implement this via a full new sorting algorithm though.

My suggestion would be to collect a roughly ~L3 sized amount of tuples, sort
just those using the existing code, allocate new memory for all the
corresponding MinimalTuples in one allocation, and copy the MinimalTuples into
that, obviously in ->memtuples order.

Even if we just use the existing code for the overall sort after that, I'd
expect that to yield noticable benefits.

It's very likely we can do better than just doing a plain sort of everything
after that.

You effectively end up with a bounded number of pre-sorted blocks, so the most
obvious thing to try is to build a heap of those blocks and effectively do a
heapsort between the presorted blocks.



A related, but separate, improvement is to reduce / remove the memory
allocation overhead. The switch to GenerationContext helped some, but still
leaves a bunch of overhead. And it's not used for bounded sorts right now.

We don't palloc/pfree individual tuples during a normal sorts, but we do have
some, for bounded sorts.  I think with a reasonable amount of work we could
avoid that for all tuples in ->tuplecontext. And switch to a trivial bump
allocator, getting rid of all allocator overhead.

The biggest source of individual pfree()s in the bounded case is that we
unconditionally copy the tuple into base->tuplecontext during puttuple. Even
though quite likely we'll immediately free it in the "case TSS_BOUNDED" block.

We could instead pre-check that the tuple won't immediately be discarded,
before copying it into tuplecontext. Only in the TSS_BOUNDED, case, of
course.

I think we also can replace the individual freeing of tuples in
tuplesort_heap_replace_top(), by allowing a number of dead tuples to
accumulate (up to work_mem/2 maybe), and then copying the still living tuples
into new memory context, freeing the old one.

While that doesn't sound cheap, in bounded sorts the number of tuples commonly
is quite limited, the pre-check before copying the tuple will prevent this
from occurring too often, the copy will result in higher locality, and, most
importantly, the reduced palloc() overhead (~25% or so with aset.c) will
result in a considerably higher cache hit ratio / lower memory usage.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Improving inferred query column names
Next
From: Vladimir Churyukin
Date:
Subject: Re: Improving inferred query column names