Thread: Sort optimizations: Making in-memory sort cache-aware

Sort optimizations: Making in-memory sort cache-aware

From
Ankit Kumar Pandey
Date:
Hi all,

While working on sort optimization for window function, it was seen that 
performance of sort where

all tuples are in memory was bad when number of tuples were very large [1]

Eg: work_mem = 4 GB, sort on 4 int columns on table having 10 million 
tuples.


Issues we saw were as follows:

1. The comparetup function re-compares the first key again in case of 
tie-break.

2. Frequent cache misses


Issue #1 is being looked in separate patch. I am currently looking at #2.

Possible solution was to batch tuples into groups (which can fit into L3 
cache) before pushing them to sort function.

After looking at different papers on this (multi-Quicksort, memory-tuned 
quicksort, Samplesort and various distributed sorts),

although they look promising (especially samplesort), I would like to 
get more inputs as changes look bit too steep and

may or may not be in of scope of solving actual problem in hand.


Please let me know your opinions, do we really need to re-look at 
quicksort for this use-case or we can

perform optimization without major change in core sorting algorithm? Are 
we are open for trying new algorithms for sort?

Any suggestions to narrow down search space for this problem are welcomed.


[1] 
https://www.postgresql.org/message-id/CAApHDvqh+qOHk4sbvvy=Qr2NjPqAAVYf82oXY0g=Z2hRpC2Vmg@mail.gmail.com


Thanks,

Ankit




Re: Sort optimizations: Making in-memory sort cache-aware

From
Andres Freund
Date:
Hi,

On 2023-02-11 17:49:02 +0530, Ankit Kumar Pandey wrote:
> 2. Frequent cache misses
>
> Issue #1 is being looked in separate patch. I am currently looking at #2.
>
> Possible solution was to batch tuples into groups (which can fit into L3
> cache) before pushing them to sort function.
>
> After looking at different papers on this (multi-Quicksort, memory-tuned
> quicksort, Samplesort and various distributed sorts),  although they look
> promising (especially samplesort), I would like to get more inputs as
> changes look bit too steep and may or may not be in of scope of solving
> actual problem in hand.
>
> Please let me know your opinions, do we really need to re-look at quicksort
> for this use-case or we can perform optimization without major change in
> core sorting algorithm? Are we are open for trying new algorithms for sort?

I think it'll require some experimentation to know what we can and should
do. Clearly we're not going to do anything fundamental if the gains are a few
percent. But it's not hard to imagine that the gains will be substantially
larger.


I believe that a significant part of the reason we have low cache hit ratios
once the input gets larger, is that we kind of break the fundamental benefit
of qsort:

The reason quicksort is a good sorting algorithm, despite plenty downsides, is
that it has pretty decent locality, due to its divide and conquer
approach. However, tuplesort.c completely breaks that for > 1 column
sorts. While spatial locality for accesses to the ->memtuples array is decent
during sorting, due to qsort's subdividing of the problem, the locality for
access to the tuples is *awful*.

The memtuples array is reordered while sorting, but the tuples themselves
aren't. Unless the input data is vaguely presorted, the access pattern for the
tuples has practically zero locality.

The only reason that doesn't completely kill us is that SortTuple contains
datum1 inline and that abbreviated keys reduce the cost of by-reference datums
in the first column.


There are things we could do to improve upon this that don't require swapping
out our sorting implementation wholesale.


One idea is to keep track of the distinctness of the first column sorted and
to behave differently if it's significantly lower than the number of to be
sorted tuples.  E.g. by doing a first sort solely on the first column, then
reorder the MinimalTuples in memory, and then continue normally.

There's two main problems with that idea:
1) It's hard to re-order the tuples in memory, without needing substantial
   amounts of additional memory
2) If the second column also is not very distinct, it doesn't buy you much, if
   anything.

But it might provide sufficient benefits regardless. And a naive version,
requiring additional memory, should be quick to hack up.


I have *not* looked at a whole lot of papers of cache optimized sorts, and the
little I did was not recently. Partially because I am not sure that they are
that applicable to our scenarios: Most sorting papers don't discuss
variable-width data, nor a substantial amount of cache-polluting work while
gathering the data that needs to be sorted.

I think:

> Possible solution was to batch tuples into groups (which can fit into L3
> cache) before pushing them to sort function.

is the most general solution to the issue outlined above. I wouldn't try to
implement this via a full new sorting algorithm though.

My suggestion would be to collect a roughly ~L3 sized amount of tuples, sort
just those using the existing code, allocate new memory for all the
corresponding MinimalTuples in one allocation, and copy the MinimalTuples into
that, obviously in ->memtuples order.

Even if we just use the existing code for the overall sort after that, I'd
expect that to yield noticable benefits.

It's very likely we can do better than just doing a plain sort of everything
after that.

You effectively end up with a bounded number of pre-sorted blocks, so the most
obvious thing to try is to build a heap of those blocks and effectively do a
heapsort between the presorted blocks.



A related, but separate, improvement is to reduce / remove the memory
allocation overhead. The switch to GenerationContext helped some, but still
leaves a bunch of overhead. And it's not used for bounded sorts right now.

We don't palloc/pfree individual tuples during a normal sorts, but we do have
some, for bounded sorts.  I think with a reasonable amount of work we could
avoid that for all tuples in ->tuplecontext. And switch to a trivial bump
allocator, getting rid of all allocator overhead.

The biggest source of individual pfree()s in the bounded case is that we
unconditionally copy the tuple into base->tuplecontext during puttuple. Even
though quite likely we'll immediately free it in the "case TSS_BOUNDED" block.

We could instead pre-check that the tuple won't immediately be discarded,
before copying it into tuplecontext. Only in the TSS_BOUNDED, case, of
course.

I think we also can replace the individual freeing of tuples in
tuplesort_heap_replace_top(), by allowing a number of dead tuples to
accumulate (up to work_mem/2 maybe), and then copying the still living tuples
into new memory context, freeing the old one.

While that doesn't sound cheap, in bounded sorts the number of tuples commonly
is quite limited, the pre-check before copying the tuple will prevent this
from occurring too often, the copy will result in higher locality, and, most
importantly, the reduced palloc() overhead (~25% or so with aset.c) will
result in a considerably higher cache hit ratio / lower memory usage.

Greetings,

Andres Freund



Re: Sort optimizations: Making in-memory sort cache-aware

From
Ankit Kumar Pandey
Date:
> On 12/02/23 01:59, Andres Freund wrote:

> However, tuplesort.c completely breaks that for > 1 column
> sorts. While spatial locality for accesses to the ->memtuples array is decent
> during sorting, due to qsort's subdividing of the problem, the locality for
> access to the tuples is *awful*.

> The memtuples array is reordered while sorting, but the tuples themselves
> aren't. Unless the input data is vaguely presorted, the access pattern for the
> tuples has practically zero locality.

This probably explains the issue.


> One idea is to keep track of the distinctness of the first column sorted and
> to behave differently if it's significantly lower than the number of to be
> sorted tuples.  E.g. by doing a first sort solely on the first column, then
> reorder the MinimalTuples in memory, and then continue normally.

> There's two main problems with that idea:
> 1) It's hard to re-order the tuples in memory, without needing substantial
>   amounts of additional memory
> 2) If the second column also is not very distinct, it doesn't buy you much, if
>   anything.

> But it might provide sufficient benefits regardless. And a naive version,
> requiring additional memory, should be quick to hack up.

I get the second point (to reorder MinimalTuples by sorting on first column) but
not sure how can keep `track of the distinctness of the first column`?


> Most sorting papers don't discuss
> variable-width data, nor a substantial amount of cache-polluting work while
> gathering the data that needs to be sorted.

I definitely agree with this.


> My suggestion would be to collect a roughly ~L3 sized amount of tuples, sort
> just those using the existing code, allocate new memory for all the
> corresponding MinimalTuples in one allocation, and copy the MinimalTuples into
> that, obviously in ->memtuples order.

This should be easy hack and we can easily profile benefits from this.

> It's very likely we can do better than just doing a plain sort of everything
> after that.
> You effectively end up with a bounded number of pre-sorted blocks, so the most
> obvious thing to try is to build a heap of those blocks and effectively do a 
> heapsort between the presorted blocks.

This is very interesting. It is actually what few papers had suggested, to
do sort in blocks and then merge (while sorting) presorted blocks.
I am bit fuzzy on implementation of this (if it is same as what I am thinking)
but this is definitely what I was looking for.


> A related, but separate, improvement is to reduce / remove the memory
> allocation overhead. The switch to GenerationContext helped some, but still
> leaves a bunch of overhead. And it's not used for bounded sorts right now.
> We don't palloc/pfree individual tuples during a normal sorts, but we do have
> some, for bounded sorts.  I think with a reasonable amount of work we could
> avoid that for all tuples in ->tuplecontext. And switch to a trivial bump
> allocator, getting rid of all allocator overhead.

> The biggest source of individual pfree()s in the bounded case is that we
> unconditionally copy the tuple into base->tuplecontext during puttuple. Even
> though quite likely we'll immediately free it in the "case TSS_BOUNDED" block.

> We could instead pre-check that the tuple won't immediately be discarded,
> before copying it into tuplecontext. Only in the TSS_BOUNDED, case, of
> course.

This Looks doable, try this.

> I think we also can replace the individual freeing of tuples in
> tuplesort_heap_replace_top(), by allowing a number of dead tuples to
> accumulate (up to work_mem/2 maybe), and then copying the still living tuples
> into new memory context, freeing the old one.

> While that doesn't sound cheap, in bounded sorts the number of tuples commonly
> is quite limited, the pre-check before copying the tuple will prevent this
> from occurring too often, the copy will result in higher locality, and, most
> importantly, the reduced palloc() overhead (~25% or so with aset.c) will
> result in a considerably higher cache hit ratio / lower memory usage.

I would try this as well.

> There are things we could do to improve upon this that don't require swapping
> out our sorting implementation wholesale.

Thanks a lot Andres, these are lots of pointers to work on (without major overhaul of sort
implementation and with potentially good amount of improvements). I will give these a try
and see if I could get some performance gains.

Thanks,
Ankit




Re: Sort optimizations: Making in-memory sort cache-aware

From
Ankit Kumar Pandey
Date:
Hi Andres,


I took a stab at naive version of this but been stuck for sometime now.

I have added logic to sort on first column at first pass,

realloc all tuples and do full sort at second pass, but I am not seeing

any benefit (it is actually regressing) at all.

Tried doing above both at bulk and at chunks of data.

 > You effectively end up with a bounded number of pre-sorted blocks, so 
the most
 > obvious thing to try is to build a heap of those blocks and 
effectively do a
 > heapsort between the presorted blocks.

I am not very clear about implementation for this. How can we do 
heapsort between

  the presorted blocks? Do we keep changing state->bound=i, i+n, i+2n 
something like

this and keep calling make_bounded_heap/sort_bounded_heap?

 > A related, but separate, improvement is to reduce / remove the memory
 > allocation overhead.

This is still pending from my side.


I have attached some benchmarking results with script and POC

patches (which includes GUC switch to enable optimization for ease of 
testing) for the same.

Tested on WORK_MEM=3 GB for 1 and 10 Million rows data.

Please let me know things which I can fix and re-attempt.


Thanks,

Ankit



Attachment