Re: Sort optimizations: Making in-memory sort cache-aware - Mailing list pgsql-hackers
From | Ankit Kumar Pandey |
---|---|
Subject | Re: Sort optimizations: Making in-memory sort cache-aware |
Date | |
Msg-id | 04a0dc6c-b789-67fe-f3c6-b2a8e8a36509@gmail.com Whole thread Raw |
In response to | Re: Sort optimizations: Making in-memory sort cache-aware (Andres Freund <andres@anarazel.de>) |
Responses |
Re: Sort optimizations: Making in-memory sort cache-aware
|
List | pgsql-hackers |
> On 12/02/23 01:59, Andres Freund wrote: > However, tuplesort.c completely breaks that for > 1 column > sorts. While spatial locality for accesses to the ->memtuples array is decent > during sorting, due to qsort's subdividing of the problem, the locality for > access to the tuples is *awful*. > The memtuples array is reordered while sorting, but the tuples themselves > aren't. Unless the input data is vaguely presorted, the access pattern for the > tuples has practically zero locality. This probably explains the issue. > One idea is to keep track of the distinctness of the first column sorted and > to behave differently if it's significantly lower than the number of to be > sorted tuples. E.g. by doing a first sort solely on the first column, then > reorder the MinimalTuples in memory, and then continue normally. > There's two main problems with that idea: > 1) It's hard to re-order the tuples in memory, without needing substantial > amounts of additional memory > 2) If the second column also is not very distinct, it doesn't buy you much, if > anything. > But it might provide sufficient benefits regardless. And a naive version, > requiring additional memory, should be quick to hack up. I get the second point (to reorder MinimalTuples by sorting on first column) but not sure how can keep `track of the distinctness of the first column`? > Most sorting papers don't discuss > variable-width data, nor a substantial amount of cache-polluting work while > gathering the data that needs to be sorted. I definitely agree with this. > My suggestion would be to collect a roughly ~L3 sized amount of tuples, sort > just those using the existing code, allocate new memory for all the > corresponding MinimalTuples in one allocation, and copy the MinimalTuples into > that, obviously in ->memtuples order. This should be easy hack and we can easily profile benefits from this. > It's very likely we can do better than just doing a plain sort of everything > after that. > You effectively end up with a bounded number of pre-sorted blocks, so the most > obvious thing to try is to build a heap of those blocks and effectively do a > heapsort between the presorted blocks. This is very interesting. It is actually what few papers had suggested, to do sort in blocks and then merge (while sorting) presorted blocks. I am bit fuzzy on implementation of this (if it is same as what I am thinking) but this is definitely what I was looking for. > A related, but separate, improvement is to reduce / remove the memory > allocation overhead. The switch to GenerationContext helped some, but still > leaves a bunch of overhead. And it's not used for bounded sorts right now. > We don't palloc/pfree individual tuples during a normal sorts, but we do have > some, for bounded sorts. I think with a reasonable amount of work we could > avoid that for all tuples in ->tuplecontext. And switch to a trivial bump > allocator, getting rid of all allocator overhead. > The biggest source of individual pfree()s in the bounded case is that we > unconditionally copy the tuple into base->tuplecontext during puttuple. Even > though quite likely we'll immediately free it in the "case TSS_BOUNDED" block. > We could instead pre-check that the tuple won't immediately be discarded, > before copying it into tuplecontext. Only in the TSS_BOUNDED, case, of > course. This Looks doable, try this. > I think we also can replace the individual freeing of tuples in > tuplesort_heap_replace_top(), by allowing a number of dead tuples to > accumulate (up to work_mem/2 maybe), and then copying the still living tuples > into new memory context, freeing the old one. > While that doesn't sound cheap, in bounded sorts the number of tuples commonly > is quite limited, the pre-check before copying the tuple will prevent this > from occurring too often, the copy will result in higher locality, and, most > importantly, the reduced palloc() overhead (~25% or so with aset.c) will > result in a considerably higher cache hit ratio / lower memory usage. I would try this as well. > There are things we could do to improve upon this that don't require swapping > out our sorting implementation wholesale. Thanks a lot Andres, these are lots of pointers to work on (without major overhaul of sort implementation and with potentially good amount of improvements). I will give these a try and see if I could get some performance gains. Thanks, Ankit
pgsql-hackers by date: