Thread: some aspects of our qsort might not be ideal

some aspects of our qsort might not be ideal

From
John Naylor
Date:
While reviewing Thomas Munro's patchset to consider expanding the uses
of specialized qsort [1], I wondered about some aspects of the current
qsort implementation. For background, ours is based on Bentley &
McIlroy "Engineering a Sort Function" [2], which is a classic paper
worth studying.

1) the check for pre-sorted input at the start each and every recursion

This has been discussed before [3]. This is great if the input is
sorted, and mostly harmless if the current partition is badly sorted
enough that this check fails quickly, but it's not hard to imagine
that this is mostly a waste of cycles. There have been proposals to
base the pre-sorted check on input size [4] or to do it only once at
the beginning, but that strikes me as looking for the keys under the
lamppost because that's where the light is. The logical criterion for
proceeding to check if the input is sorted is whether we *think* the
input could be sorted. That may sound like a tautology, but consider
the case where the partitioning phase didn't perform any swaps. Then,
it makes sense to check, but we can go further. What if we do the
check, but towards the end that check fails. If just a couple elements
are out of place, does it really make sense to give up, partition, and
recurse? If just a few iterations of insertion sort are all that is
needed to finish, why not just do that? This way, we *dynamically*
decide to optimistically start an insertion sort, in the hopes that it
will occasionally prevent a recursion, and worst case the input is
slightly more sorted for the next recursion. All we need is a
lightweight way to detect that the partitioning phase didn't do any
swaps. More on this later.

2) cardinality issues can cancel abbreviated keys, but our qsort is
not optimized for that

Since in this case comparison is very expensive, it stands to reason
that qsort could profitably be optimized for a low number of unique
keys. The Bentley & McIlroy scheme does take great pains to prevent
quadratic behavior from lots of duplicates, but I've found something
that might have stronger guarantees: Pattern-defeating quicksort
(paper: [5] C++ implementation: [6]) claims worst-case runtime of
O(nk) for inputs with k distinct elements. This is achieved via an
asymmetric partitioning scheme. It's not complex, but it is subtle, so
I won't try to describe it here. I recommend reading section 3 of the
paper for details. Rust and Boost use PDQuicksort in some form, so it
has some production use.

The partitioning scheme just mentioned also provides an easy way to
detect when no swaps have been done, thus solving #1 above.

There is one requirement that is a double-edged sword: Recursion to
the partitions must happen in order. This has an additional advantage
that in every case but one, insertion sort doesn't need to guard
against running off the beginning of the array. The downside for us is
that we currently recurse to a partition based on its size as a
stack-guarding measure, so that guard would have to be replaced. The
C++ implementation of PDQuicksort falls back to heap sort as a last
resort to bound runtime, but it would be just as effective at guarding
the stack. That's a bit of a snag for making it production-ready, but
not enough to prevent testing the actual qsort part.

Does anyone see a reason not to put in the necessary work to try it out?

[1] https://www.postgresql.org/message-id/CA%2BhUKGKztHEWm676csTFjYzortziWmOcf8HDss2Zr0muZ2xfEg%40mail.gmail.com
[2] https://cs.fit.edu/~pkc/classes/writing/samples/bentley93engineering.pdf
[3]
https://www.postgresql.org/message-id/flat/87F42982BF2B434F831FCEF4C45FC33E5BD369DF%40EXCHANGE.corporate.connx.com#e69718293c45d89555020bd0922ad055
[4] https://www.postgresql.org/message-id/CA%2BTgmobKFcb6ajVEN8eSnBa78sB3FSwuEWTHXd2x9JC5DOh5OA%40mail.gmail.com
[5] https://arxiv.org/pdf/2106.05123.pdf
[6] https://github.com/orlp/pdqsort

-- 
John Naylor
EDB: http://www.enterprisedb.com



Re: some aspects of our qsort might not be ideal

From
John Naylor
Date:
> This way, we *dynamically*
> decide to optimistically start an insertion sort, in the hopes that it
> will occasionally prevent a recursion, and worst case the input is
> slightly more sorted for the next recursion.

I should mention one detail -- we put a limit on how many iterations
we attempt before we give up and partition/recurse. The author's
implementation chooses 8 for this limit. The paper mentions this
technique in section 5.2, but is not the origin of it.

-- 
John Naylor
EDB: http://www.enterprisedb.com



Re: some aspects of our qsort might not be ideal

From
Robert Haas
Date:
On Wed, Feb 16, 2022 at 2:29 AM John Naylor
<john.naylor@enterprisedb.com> wrote:
> Does anyone see a reason not to put in the necessary work to try it out?

Seems reasonable to me. It's always a bit difficult, I feel, to know
what test cases to use - almost any idea is going to have some case
where it's worse than what we do today, and there can always be some
user who does that exact thing 100% of the time. Moreover, it's hard
to be certain that test cases we construct - say, ordered data,
reverse ordered data, randomly ordered data, almost ordered data with
a single element out of place, etc. - are actually covering all of the
interesting cases. At the same time, I don't think anyone would
seriously disagree with what you say in the subject line, and we won't
make any progress by NOT trying things that are recommended in the
academic literature.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: some aspects of our qsort might not be ideal

From
John Naylor
Date:
Hi,

Attached is a draft series that implements some but not all features
of pattern-defeating quicksort, namely the ones I thought were
interesting for us. Recently this quicksort variant got committed for
the next release of the Go language 1.19 [1] (which in turn was based
on that of Rust [2]), and that implementation was a helpful additional
example. The bottom line is that replacing the partitioning scheme
this way is likely not worth doing because of our particular use
cases, but along the way I found some other things that might be worth
doing, so some good may come out of this.

0001 is a test module for running a microbenchmark, based on earlier
work by Thomas Munro.

0002 is just preliminary refactoring.

(From here on some regression tests change from patch to patch, since
tests that rely on a sort order might have duplicates in the sort
key.)

0003 starts a "partial insertion sort" if partitioning didn't do any
swaps, and bails if it doesn't finish quickly. A footnote to what I
said earlier:

> All we need is a
> lightweight way to detect that the partitioning phase didn't do any
> swaps.

Actually, Postgres did detect this until a3f0b3d68 [3], but wasn't
very bright about it -- it attempted a complete insertion sort.
Limiting the iterations is more sensible. In addition to the swap
count that I borrowed from ancient PG code, this patch checks the root
level as we do today (but now in a different code path), and there are
a couple additional heuristics to limit wasted work:
- Only try it if the pivot candidates are in order
- If the input is presorted we always return, but if it isn't we will
only proceed with swapping elements if the input was large enough that
we checked the order of nine pivot candidates.
- Only try partial insertion sort if the partitions are not highly
different in size, since bad pivots will lead to fewer swaps in
general, causing the swap count heuristic to be unhelpful. (Future
work: look into randomizing pivot candidates when encountering such
skewed partitions, as some other implementations do)

0004 builds on upon the previous and preemptively reverses the input
if there are nine pivot candidates and they are in descending order.

0005 implements the PDQuicksort partitioning scheme. In working on
this, I found I need to correct something I said earlier:

> There is one requirement that is a double-edged sword: Recursion to
> the partitions must happen in order. This has an additional advantage
> that in every case but one, insertion sort doesn't need to guard
> against running off the beginning of the array. The downside for us is
> that we currently recurse to a partition based on its size as a
> stack-guarding measure, so that guard would have to be replaced.

This is not true actually (I read too much into an implementation
detail), and with a tiny bit of bookkeepping we can retain our
practice of recursing to the smaller partition.

A very brief description of 0005: There are two partitioning
functions: left and right. Partition-right puts elements >= pivot in
the right partition. This is the default. When many duplicates end up
in the right partition, the pivot chosen for the next iteration there
could be the same as the previous pivot. In that case, we switch to
partition-left: put elements equal to the pivot in the left partition.
Since all those elements are the same, we are done and can iterate on
the right side. With this scheme, only 2-way comparators are needed.
Also, the previous counting of swaps during partitioning is replaced
by a simple pointer check once per iteration.

-- Tests:

I have two flavors of tests:
1) ORDER BY queries with 10 million records: Select a single value for
three types:
- bigint (so can use datum sort)
- 16-byte abbreviatable text
- 16-byte non-abbreviatable text
(median of 5 runs, script attached)

2) a microbenchmark sorting 1 million int32 datums with two ways to
vary comparison schemes: direct C function/fmgr function and
runtime/specialized, so four different combinations. The idea here is
to get an easy way to test comparators of different "heaviness", not
that these actual cases are all necessarily relevant.
(minimum of 10 runs)

The input distributions can be put into three categories:
- Varying amounts of randomness and presortedness (random, sort50/90/99, dither)
- Long runs of ascending and descending sequences (ascending,
descending, organ, merge)
- Low cardinality (dup8, dupsq, mod100/8)

To get the microbenchmark output in the same format as the query script I ran

psql -c 'select test_sort_cmp_weight(1 * 1024*1024)' 2> result.txt
perl -n -e 'print "$2\t$3\t$1\t$4\n" if /NOTICE:  \[(.+)\] num=(\d+),
order=(\w+), time=(\d+\.\d+)/;' result.txt > result.csv

-- Summary of results:

Partial insertion sort:
- For small types sometimes possibly a bit faster, sometimes
significantly faster.
- With queries, significantly slower for some partially-presorted
distributions, otherwise mostly no change.
- Could be further optimized for types known at compile time by using
a "sift" algorithm rather than swapping at every iteration. We could
also try basing the iteration limit on type size.

Optimistically reversing input:
- 20-25x faster on descending input, which is nice. It's not clear if
it would benefit "mostly descending" input.

PDQuicksort:
- With light or moderate types/comparators, it seems significantly
faster than B&M.
- With heavier comparators (including all queries), it's slightly
faster at best, and regresses noticeably with low cardinality inputs.
This latter finding was somewhat surprising. The O(kn) behavior for
low cardinality input is academically interesting, but doesn't
translate to gains for some of our key use cases.

-- Conclusion: The partial insertion sort seems useful if the
regressions can be worked out. The reverse-input optimization is a
small addition on top of that and seems worth doing. The PDQuicksort
partitioning scheme is a win in some cases I tested but a loss in
others, so at this point I'm inclined not to pursue that part further.

I'd be happy to hear any thoughts anyone might have.

[1] https://github.com/golang/go/blob/72e77a7f41bbf45d466119444307fd3ae996e257/src/sort/zsortfunc.go
[2] https://github.com/rust-lang/rust/blob/415c9f95884caebd0be9b837ef0885d1ffde1b9b/library/core/src/slice/sort.rs
[3]
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=a3f0b3d68f9a5357a3f72b40a45bcc714a9e0649;hp=570b726533fb561bc5a35ba50ea944f139a250d5

Attachment

Re: some aspects of our qsort might not be ideal

From
Peter Geoghegan
Date:
On Thu, Jun 2, 2022 at 8:33 PM John Naylor <john.naylor@enterprisedb.com> wrote:
> Attached is a draft series that implements some but not all features
> of pattern-defeating quicksort, namely the ones I thought were
> interesting for us. Recently this quicksort variant got committed for
> the next release of the Go language 1.19 [1] (which in turn was based
> on that of Rust [2]), and that implementation was a helpful additional
> example. The bottom line is that replacing the partitioning scheme
> this way is likely not worth doing because of our particular use
> cases, but along the way I found some other things that might be worth
> doing, so some good may come out of this.

What about dual-pivot quicksort, which is used in Java 7+? That is the
defacto successor to Bentley & McIlroy. In fact, Jon Bentley himself
collaborated with its author, and provided benchmarking input. The
underlying philosophy is essentially the same as the original -- it
is supposed to be an "industrial strength" quicksort, with various
adversarial cases considered directly.

See:

https://www.wild-inter.net/publications/wild-2018b.pdf

https://codeblab.com/wp-content/uploads/2009/09/DualPivotQuicksort.pdf

At one point quite a few years back I planned on investigating it
myself, but never followed through.

--
Peter Geoghegan



Re: some aspects of our qsort might not be ideal

From
John Naylor
Date:
On Fri, Jun 3, 2022 at 10:44 AM Peter Geoghegan <pg@bowt.ie> wrote:
>
> What about dual-pivot quicksort, which is used in Java 7+? That is the
> defacto successor to Bentley & McIlroy. In fact, Jon Bentley himself
> collaborated with its author, and provided benchmarking input. The
> underlying philosophy is essentially the same as the original -- it
> is supposed to be an "industrial strength" quicksort, with various
> adversarial cases considered directly.

I had heard of it but not looked into it deeply. I did read that Java
7 uses dual pivot quicksort for primitives and timsort for objects. I
wasn't sure if dual pivot was not good for objects (which could have
possibly-complex comparators) or if timsort was just simply good for
Java's use cases. It seems accessible to try doing, so I'll look into
that.

-- 
John Naylor
EDB: http://www.enterprisedb.com



Re: some aspects of our qsort might not be ideal

From
Peter Geoghegan
Date:
On Thu, Jun 2, 2022 at 9:24 PM John Naylor <john.naylor@enterprisedb.com> wrote:
> I had heard of it but not looked into it deeply. I did read that Java
> 7 uses dual pivot quicksort for primitives and timsort for objects. I
> wasn't sure if dual pivot was not good for objects (which could have
> possibly-complex comparators) or if timsort was just simply good for
> Java's use cases. It seems accessible to try doing, so I'll look into
> that.

I think that Timsort is most useful with cases where individual
comparisons are inherently very expensive (perhaps even implemented in
Python), which might be common with objects due to the indirection
that Python (and even Java) impose on objects in general.

I would imagine that this is a lot less relevant in
Postgres/tuplesort, because we have optimizations like abbreviated
keys. And because we're mostly just sorting tuples on one or more
attributes of scalar types.

The abbreviated keys optimization is very much something that comes
from the world of databases, not the world of sorting. It's pretty much a
domain-specific technique. That seems relevant to me.

--
Peter Geoghegan



Re: some aspects of our qsort might not be ideal

From
John Naylor
Date:
On Fri, Jun 3, 2022 at 10:44 AM Peter Geoghegan <pg@bowt.ie> wrote:

> What about dual-pivot quicksort, which is used in Java 7+? That is the
> defacto successor to Bentley & McIlroy. In fact, Jon Bentley himself
> collaborated with its author, and provided benchmarking input. The
> underlying philosophy is essentially the same as the original -- it
> is supposed to be an "industrial strength" quicksort, with various
> adversarial cases considered directly.

Here is a *rough* first pass at dual-pivot quicksort. I haven't
changed the regression tests to adjust for qsort being an unstable
sort, and there is some dead code. I looked to a couple JDKs for
examples of design decisions as a first approximation. It includes a
module with a few debugging printouts, run like so: select
debug_qsort_int(array[7,6,5,4,3,2,1]);

Pivot selection: A common way is to pick five candidates (here: mid,
+/- 1/7, +/- 2/7), sort them in place, then pick the 2nd and 4th of
them, or "tertile of five". If they are all evenly spaced, we can do
insertion sort.

Fall back to single-pivot: If any of the pivot candidates are equal,
JDK assumes there could be a large number of duplicates and falls back
to single-pivot, using the median of the five. I've done the same
here. Despite having two code paths in all of our template instances,
the VM binary is only about 5kb bigger, since MED3 is no longer built.

Performance testing: Not started yet. I'm thinking an apples-to-apples
comparison is not insightful enough, since the current insertion sort
threshold 7 is already a bit constrained for single-pivot, and would
be even more so for dual pivot, especially since 5 of the elements are
pre-sorted to choose the pivots. My plan is to retest the threshold on
HEAD using my latest tests, then use that as a starting point to test
thresholds with dual-pivot.

-- 
John Naylor
EDB: http://www.enterprisedb.com

Attachment

Re: some aspects of our qsort might not be ideal

From
Robert Haas
Date:
On Thu, Jun 23, 2022 at 6:13 AM John Naylor
<john.naylor@enterprisedb.com> wrote:
> Here is a *rough* first pass at dual-pivot quicksort. I haven't
> changed the regression tests to adjust for qsort being an unstable
> sort, ...

Hmm. I thought we had some reasons for preferring a stable sort algorithm.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: some aspects of our qsort might not be ideal

From
Matthias van de Meent
Date:
On Thu, 23 Jun 2022 at 15:52, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Jun 23, 2022 at 6:13 AM John Naylor
> <john.naylor@enterprisedb.com> wrote:
> > Here is a *rough* first pass at dual-pivot quicksort. I haven't
> > changed the regression tests to adjust for qsort being an unstable
> > sort, ...
>
> Hmm. I thought we had some reasons for preferring a stable sort algorithm.

I think that mostly has to do with reliability / stability of ORDER BY
in combination with LIMIT and OFFSET, even though right now we cannot
fully guarantee such stability due to unstable results from underlying
plan nodes.

As an example, a table scan node under a sort node can start its scan
at an arbitrary point in the table (using synchronize_seqscans), and
because Sort nodes only sort MinimalTuple-s, each set of tuples that
have an equal sort value will be ordered by TID + y (mod tablesize),
with arbitrary values for y.

- Matthias



Re: some aspects of our qsort might not be ideal

From
Peter Geoghegan
Date:
On Thu, Jun 23, 2022 at 7:51 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
> I think that mostly has to do with reliability / stability of ORDER BY
> in combination with LIMIT and OFFSET, even though right now we cannot
> fully guarantee such stability due to unstable results from underlying
> plan nodes.

The current qsort isn't stable. While quicksort is never stable, our
particular implementation has fairly significant optimizations that
strongly rely on not using a stable sort. In particular, the B&M
optimizations for duplicate elements.

These optimizations make things like datum tuplesorts for
'SELECT(DISTINCT mycol) ...' style queries on low cardinality columns
extremely fast. We're not really sorting so much as bucketing. This is
based on Dijkstra's Dutch national flag problem.

-- 
Peter Geoghegan



Re: some aspects of our qsort might not be ideal

From
Matthias van de Meent
Date:
On Thu, 23 Jun 2022 at 17:04, Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Thu, Jun 23, 2022 at 7:51 AM Matthias van de Meent
> <boekewurm+postgres@gmail.com> wrote:
> > I think that mostly has to do with reliability / stability of ORDER BY
> > in combination with LIMIT and OFFSET, even though right now we cannot
> > fully guarantee such stability due to unstable results from underlying
> > plan nodes.
>
> The current qsort isn't stable.

Then I misunderstood Robert's comment, thanks for correcting me.

- Matthias



Re: some aspects of our qsort might not be ideal

From
John Naylor
Date:
I've run the microbenchmarks only so far, since they complete much
faster than queries, and I wanted to tell quickly if this is a good
avenue to pursue. Later I'll repeat for queries. Methodology is
similar to what I did earlier in the thread, so I won't belabor that.

Takeaways:

- Contrary to some testing on single pivot I did some month ago with
fewer input distributions, in this test neither single nor dual pivot
benefit much from a large insertion sort threshold (i.e. 20 or so). I
picked 12 for both to do the head-to-head comparison, since this seems
to give a slight boost to specialized sorts on the slowest inputs.
They don't have to be the same, of course, but it seems about right
with these results. (Of course, it's easy to have specialized sorts
define their own threshold, as Thomas Munro has shown.)
- Generic qsort (type length and comparator passed at runtime) sees no
benefit from dual-pivot in this test, but specialized qsorts do get a
decent speedup.
- Descending inputs get a huge boost when specialized. This is
surprising to me, enough that I'm skeptical and want to double check
the test. If it's legitimate, I'm happy about it.

I've attached three spreadsheets with graphs, two for threshold tests
for single- and dual-pivot, and one comparing single and dual for
threshold=12.

0001 is the test module and rough script (not for commit). Note: The
server won't build, since the threshold is passed via CPPFLAGS.
0002 is trivial and mostly for assert builds
0003 is a mostly cleaned up patch for dual pivot, with passing
regression tests and default insertion sort threshold of 12

I'll add a CF entry for this.

TODO: add results for queries.
-- 
John Naylor
EDB: http://www.enterprisedb.com

Attachment

Re: some aspects of our qsort might not be ideal

From
John Naylor
Date:
I wrote:
> TODO: add results for queries.

Attached are three spreadsheets for queries, similar to the earlier
ones for microbenchmarks. There are three datatypes:

- Bigint: by selecting just the value, these can use Datum sorts
- Text: the same number as bigint, but in lpadded text form, followed
by 8 identical chararters, so 16 bytes.
- Text no-abbrev: also 16 bytes of text derived from bigint, but with
the 8 characters at the beginning to defeat abbreviation.

For the single-vs-dual comparison, this time I chose the insertion
sort threshold to be 10 for single pivot and 18 for dual-pivot. I see
a big speedup in descending values here as well, completing in 20-50%
less time. Everything else is the same, slightly faster, or slightly
slower, at least on this machine (Comet Lake, gcc 12.1). The slightly
slower cases are ones with duplicates. It would make sense that the
duplicate-detection heuristic lags behind where we would ideally be
using single-pivot, but the difference is small. In any case, there
doesn't seem to be a compelling benefit from dual-pivot for most
cases.

I suspect the branches in the sorttuple comparators for null handling
are negating the benefits from better d-cache and TLB behavior. In
that case, a useful thing to try would be to pre-partition the
sorttuple array between null and non-null values as it's being
populated, then sort each of those in turn. (The null portion would
only need sorting if there was more than one sort key.) That would
automatically take care of nulls-first vs nulls-last upfront, save a
bunch of binary space used for branches (worth doing on its own), and
the "isnull" member would not be needed anymore. That wouldn't save
data space because of alignment, but the compiler wouldn't need to
generate a load/store instruction for it. gcc does, but clang and msvc
just treat the 24-byte struct as 3 8-byte integers:

https://godbolt.org/z/d8j1EvroP

-- 
John Naylor
EDB: http://www.enterprisedb.com

Attachment

Re: some aspects of our qsort might not be ideal

From
John Naylor
Date:
On Tue, Jul 5, 2022 at 12:14 PM John Naylor <john.naylor@enterprisedb.com> wrote:

> I suspect the branches in the sorttuple comparators for null handling
> are negating the benefits from better d-cache and TLB behavior. In
> that case, a useful thing to try would be to pre-partition the
> sorttuple array between null and non-null values as it's being
> populated, then sort each of those in turn. (The null portion would
> only need sorting if there was more than one sort key.) That would
> automatically take care of nulls-first vs nulls-last upfront, save a
> bunch of binary space used for branches (worth doing on its own),

To see if this idea had any legs, I simply ripped out the null handling from the specialized sorttuple comparators. This works for my tests since all the values are not null. Doing this much seems to be more effective than dual pivot in every case but purely-descending input. The combination of the two seems a bit faster still on bigint, but not on text. From this test, it seems the conclusions for tuple sorts with dual-pivot haven't changed: impressive speedup on descending input, not so much elsewhere. (spreadsheet attached, note the insertion sort thresholds are reused from previous testing, namely 10 and 18 for single / dual pivot)

> and
> the "isnull" member would not be needed anymore. That wouldn't save
> data space because of alignment, but the compiler wouldn't need to
> generate a load/store instruction for it.

This piece might not be workable, or would make external merges more difficult. That said, doing the null-handling up front when forming sort tuples seems worth doing, and I intend to come back to it at some point.

Regarding binary size, removing the null handling from the comparators shrinks the binary by about 4.3kB. Adding dual pivot on top of that expands it to about 3.5kB larger than HEAD, but I'm pretty sure I could get that down to net zero by replacing the presorted check with a partial insertion sort (as discussed before), and reusing that for the pivot selection as well. Being able to add partial insertion sort with net ~zero binary increase could be another small point in favor of dual-pivot.

I've attached the patches for this experiment for completeness, but withdrawing the CF entry for now.

--
John Naylor
EDB: http://www.enterprisedb.com
Attachment