Re: tuple radix sort - Mailing list pgsql-hackers

From John Naylor
Subject Re: tuple radix sort
Date
Msg-id CANWCAZbsx5VY-F5RVvZ5H1abBWcinMY90gNra6EbdGS1CTx=7g@mail.gmail.com
Whole thread Raw
In response to tuple radix sort  (John Naylor <johncnaylorls@gmail.com>)
List pgsql-hackers
I wrote:

> The v1 patch
> has some optimizations, but in other ways things are simple and/or
> wasteful. Exactly how things fit together will be informed by what, if
> anything, has to be done to avoid regressions.

In v1, radix sort diverts to qsort_tuple for small partitions (similar
to how quicksort diverts to insertion sort), but qsort_tuple is
inefficient because the comparator is called via a function pointer.

I also thought having two different radix sorts was too complex, so I
wondered if it'd be better to get rid of the smaller radix sort (whose
control flow I find harder to understand, even ignoring the unsightly
goto) and have the larger sort divert to a new quicksort
specialization that compares on the conditioned datum. That allows
skipping branches for NULL comparisons and order reversal. I've done
this in v2. It makes sense to replace the three current
integer-comparison quicksorts with one.

v1 was careful to restore isnull1 to false when diverting to quicksort
for the tiebreak. v2 doesn't bother, since the only tiebreak in core
that looks at isnull1 is comparetup_datum_tiebreak, which is not
reachable by radix sort, requiring a pass-by-value datum that compares
like an integer. This is a bit of a risk, since it's possible a third
party fork could be doing something weird. Seems unlikely, but
something to keep in mind.

I used a standalone program (attached) to microbenchmark this new
fallback qsort vs. a pass of radix sort on one byte to get a decent
threshold value. This is not quite fair, since the quicksort will then
be finished, but the radix sort could still need to recurse to the
next byte(s), so these number could underestimate the threshold. This
is just to get an idea.

The numbers are in RDTSC ticks per element sorted.

cardinality: 256
number of elements:  100   qsort: 35.4 radix: 49.2
number of elements:  200   qsort: 34.9 radix: 38.1
number of elements:  400   qsort: 42.4 radix: 34.4
number of elements:  800   qsort: 95.0 radix: 29.2
number of elements: 1600   qsort: 115.0 radix: 22.4
number of elements: 3200   qsort: 125.5 radix: 19.4
number of elements: 6400   qsort: 128.1 radix: 17.6

With the highest cardinality possible on a single byte, radix sort is
actually not bad at low inputs. Notice that the time per element is
consistently going down with larger inputs. Smaller inputs have large
constant overheads, made worse by my unrolling the counting step.

cardinality: 2
number of elements:  100   qsort: 09.2 radix: 28.0
number of elements:  200   qsort: 09.1 radix: 19.5
number of elements:  400   qsort: 10.4 radix: 15.7
number of elements:  800   qsort: 10.1 radix: 14.5
number of elements: 1600   qsort: 10.4 radix: 13.7
number of elements: 3200   qsort: 15.8 radix: 13.6
number of elements: 6400   qsort: 22.2 radix: 13.8

This is an extreme best case for B&M quicksort, which is basically
O(n) -- the point at which the per-element time goes up seems purely
due to exceeding L1 cache. Radix sort takes a big input to catch up,
but it doesn't seem awful, either.

cardinality: 16
number of elements:  100   qsort: 19.5 radix: 34.5
number of elements:  200   qsort: 18.7 radix: 22.6
number of elements:  400   qsort: 18.5 radix: 17.2
number of elements:  800   qsort: 25.0 radix: 14.8
number of elements: 1600   qsort: 43.8 radix: 13.8
number of elements: 3200   qsort: 51.2 radix: 13.2
number of elements: 6400   qsort: 59.0 radix: 12.8

This is still low cardinality, but behaves more like the high cardinality case.

I've set the threshold to 400 for now, but I'm not claiming that's the
end story. In addition to the underestimation mentioned above,
unrolling the counting step is a factor. Unrolling makes smaller
inputs worse (which we can reach by recursing from larger inputs), but
unrolling seems important for large inputs with low cardinality (a few
percent, but I haven't shared numbers yet). We've found that a large
input with only 4-5 distinct values just barely wins with radix sort.
I'll be curious to see if unrolling is actually needed to prevent
regressions there.

Other things to consider:

- I don't quite like how the NULL partitioning step looks, and it
could be costly when the distribution of NULL is not predictable, so
I'm thinking of turning part of that into a branch-free cyclic
permutation, similar to
https://www.postgresql.org/message-id/CANWCAZbAmaZ7P%2BARjS97sJLXsBB5CPZyzFgqNDiqe-L%2BBqXzug%40mail.gmail.com

- The quicksort on the NULL partition still compares isnull1 -- the
branches are predictable but perhaps it's worth it to add a
specialization that skips that.

--
John Naylor
Amazon Web Services

Attachment

pgsql-hackers by date:

Previous
From: "Matheus Alcantara"
Date:
Subject: Re: Asynchronous MergeAppend
Next
From: Álvaro Herrera
Date:
Subject: Re: fix NOT VALID NOT NULL with ALTER COLUMN SET IDENTITY