Re: Sort optimizations: Making in-memory sort cache-aware - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: Sort optimizations: Making in-memory sort cache-aware |
Date | |
Msg-id | 20230211202959.lkhrxiy7awsyctef@awork3.anarazel.de Whole thread Raw |
In response to | Sort optimizations: Making in-memory sort cache-aware (Ankit Kumar Pandey <itsankitkp@gmail.com>) |
Responses |
Re: Sort optimizations: Making in-memory sort cache-aware
|
List | pgsql-hackers |
Hi, On 2023-02-11 17:49:02 +0530, Ankit Kumar Pandey wrote: > 2. Frequent cache misses > > Issue #1 is being looked in separate patch. I am currently looking at #2. > > Possible solution was to batch tuples into groups (which can fit into L3 > cache) before pushing them to sort function. > > After looking at different papers on this (multi-Quicksort, memory-tuned > quicksort, Samplesort and various distributed sorts), although they look > promising (especially samplesort), I would like to get more inputs as > changes look bit too steep and may or may not be in of scope of solving > actual problem in hand. > > Please let me know your opinions, do we really need to re-look at quicksort > for this use-case or we can perform optimization without major change in > core sorting algorithm? Are we are open for trying new algorithms for sort? I think it'll require some experimentation to know what we can and should do. Clearly we're not going to do anything fundamental if the gains are a few percent. But it's not hard to imagine that the gains will be substantially larger. I believe that a significant part of the reason we have low cache hit ratios once the input gets larger, is that we kind of break the fundamental benefit of qsort: The reason quicksort is a good sorting algorithm, despite plenty downsides, is that it has pretty decent locality, due to its divide and conquer approach. However, tuplesort.c completely breaks that for > 1 column sorts. While spatial locality for accesses to the ->memtuples array is decent during sorting, due to qsort's subdividing of the problem, the locality for access to the tuples is *awful*. The memtuples array is reordered while sorting, but the tuples themselves aren't. Unless the input data is vaguely presorted, the access pattern for the tuples has practically zero locality. The only reason that doesn't completely kill us is that SortTuple contains datum1 inline and that abbreviated keys reduce the cost of by-reference datums in the first column. There are things we could do to improve upon this that don't require swapping out our sorting implementation wholesale. One idea is to keep track of the distinctness of the first column sorted and to behave differently if it's significantly lower than the number of to be sorted tuples. E.g. by doing a first sort solely on the first column, then reorder the MinimalTuples in memory, and then continue normally. There's two main problems with that idea: 1) It's hard to re-order the tuples in memory, without needing substantial amounts of additional memory 2) If the second column also is not very distinct, it doesn't buy you much, if anything. But it might provide sufficient benefits regardless. And a naive version, requiring additional memory, should be quick to hack up. I have *not* looked at a whole lot of papers of cache optimized sorts, and the little I did was not recently. Partially because I am not sure that they are that applicable to our scenarios: Most sorting papers don't discuss variable-width data, nor a substantial amount of cache-polluting work while gathering the data that needs to be sorted. I think: > Possible solution was to batch tuples into groups (which can fit into L3 > cache) before pushing them to sort function. is the most general solution to the issue outlined above. I wouldn't try to implement this via a full new sorting algorithm though. My suggestion would be to collect a roughly ~L3 sized amount of tuples, sort just those using the existing code, allocate new memory for all the corresponding MinimalTuples in one allocation, and copy the MinimalTuples into that, obviously in ->memtuples order. Even if we just use the existing code for the overall sort after that, I'd expect that to yield noticable benefits. It's very likely we can do better than just doing a plain sort of everything after that. You effectively end up with a bounded number of pre-sorted blocks, so the most obvious thing to try is to build a heap of those blocks and effectively do a heapsort between the presorted blocks. A related, but separate, improvement is to reduce / remove the memory allocation overhead. The switch to GenerationContext helped some, but still leaves a bunch of overhead. And it's not used for bounded sorts right now. We don't palloc/pfree individual tuples during a normal sorts, but we do have some, for bounded sorts. I think with a reasonable amount of work we could avoid that for all tuples in ->tuplecontext. And switch to a trivial bump allocator, getting rid of all allocator overhead. The biggest source of individual pfree()s in the bounded case is that we unconditionally copy the tuple into base->tuplecontext during puttuple. Even though quite likely we'll immediately free it in the "case TSS_BOUNDED" block. We could instead pre-check that the tuple won't immediately be discarded, before copying it into tuplecontext. Only in the TSS_BOUNDED, case, of course. I think we also can replace the individual freeing of tuples in tuplesort_heap_replace_top(), by allowing a number of dead tuples to accumulate (up to work_mem/2 maybe), and then copying the still living tuples into new memory context, freeing the old one. While that doesn't sound cheap, in bounded sorts the number of tuples commonly is quite limited, the pre-check before copying the tuple will prevent this from occurring too often, the copy will result in higher locality, and, most importantly, the reduced palloc() overhead (~25% or so with aset.c) will result in a considerably higher cache hit ratio / lower memory usage. Greetings, Andres Freund
pgsql-hackers by date: