Re: tuple radix sort - Mailing list pgsql-hackers
| From | John Naylor | 
|---|---|
| Subject | Re: tuple radix sort | 
| Date | |
| Msg-id | CANWCAZbsx5VY-F5RVvZ5H1abBWcinMY90gNra6EbdGS1CTx=7g@mail.gmail.com Whole thread Raw  | 
		
| In response to | tuple radix sort (John Naylor <johncnaylorls@gmail.com>) | 
| List | pgsql-hackers | 
I wrote: > The v1 patch > has some optimizations, but in other ways things are simple and/or > wasteful. Exactly how things fit together will be informed by what, if > anything, has to be done to avoid regressions. In v1, radix sort diverts to qsort_tuple for small partitions (similar to how quicksort diverts to insertion sort), but qsort_tuple is inefficient because the comparator is called via a function pointer. I also thought having two different radix sorts was too complex, so I wondered if it'd be better to get rid of the smaller radix sort (whose control flow I find harder to understand, even ignoring the unsightly goto) and have the larger sort divert to a new quicksort specialization that compares on the conditioned datum. That allows skipping branches for NULL comparisons and order reversal. I've done this in v2. It makes sense to replace the three current integer-comparison quicksorts with one. v1 was careful to restore isnull1 to false when diverting to quicksort for the tiebreak. v2 doesn't bother, since the only tiebreak in core that looks at isnull1 is comparetup_datum_tiebreak, which is not reachable by radix sort, requiring a pass-by-value datum that compares like an integer. This is a bit of a risk, since it's possible a third party fork could be doing something weird. Seems unlikely, but something to keep in mind. I used a standalone program (attached) to microbenchmark this new fallback qsort vs. a pass of radix sort on one byte to get a decent threshold value. This is not quite fair, since the quicksort will then be finished, but the radix sort could still need to recurse to the next byte(s), so these number could underestimate the threshold. This is just to get an idea. The numbers are in RDTSC ticks per element sorted. cardinality: 256 number of elements: 100 qsort: 35.4 radix: 49.2 number of elements: 200 qsort: 34.9 radix: 38.1 number of elements: 400 qsort: 42.4 radix: 34.4 number of elements: 800 qsort: 95.0 radix: 29.2 number of elements: 1600 qsort: 115.0 radix: 22.4 number of elements: 3200 qsort: 125.5 radix: 19.4 number of elements: 6400 qsort: 128.1 radix: 17.6 With the highest cardinality possible on a single byte, radix sort is actually not bad at low inputs. Notice that the time per element is consistently going down with larger inputs. Smaller inputs have large constant overheads, made worse by my unrolling the counting step. cardinality: 2 number of elements: 100 qsort: 09.2 radix: 28.0 number of elements: 200 qsort: 09.1 radix: 19.5 number of elements: 400 qsort: 10.4 radix: 15.7 number of elements: 800 qsort: 10.1 radix: 14.5 number of elements: 1600 qsort: 10.4 radix: 13.7 number of elements: 3200 qsort: 15.8 radix: 13.6 number of elements: 6400 qsort: 22.2 radix: 13.8 This is an extreme best case for B&M quicksort, which is basically O(n) -- the point at which the per-element time goes up seems purely due to exceeding L1 cache. Radix sort takes a big input to catch up, but it doesn't seem awful, either. cardinality: 16 number of elements: 100 qsort: 19.5 radix: 34.5 number of elements: 200 qsort: 18.7 radix: 22.6 number of elements: 400 qsort: 18.5 radix: 17.2 number of elements: 800 qsort: 25.0 radix: 14.8 number of elements: 1600 qsort: 43.8 radix: 13.8 number of elements: 3200 qsort: 51.2 radix: 13.2 number of elements: 6400 qsort: 59.0 radix: 12.8 This is still low cardinality, but behaves more like the high cardinality case. I've set the threshold to 400 for now, but I'm not claiming that's the end story. In addition to the underestimation mentioned above, unrolling the counting step is a factor. Unrolling makes smaller inputs worse (which we can reach by recursing from larger inputs), but unrolling seems important for large inputs with low cardinality (a few percent, but I haven't shared numbers yet). We've found that a large input with only 4-5 distinct values just barely wins with radix sort. I'll be curious to see if unrolling is actually needed to prevent regressions there. Other things to consider: - I don't quite like how the NULL partitioning step looks, and it could be costly when the distribution of NULL is not predictable, so I'm thinking of turning part of that into a branch-free cyclic permutation, similar to https://www.postgresql.org/message-id/CANWCAZbAmaZ7P%2BARjS97sJLXsBB5CPZyzFgqNDiqe-L%2BBqXzug%40mail.gmail.com - The quicksort on the NULL partition still compares isnull1 -- the branches are predictable but perhaps it's worth it to add a specialization that skips that. -- John Naylor Amazon Web Services
Attachment
pgsql-hackers by date: