Thread: some aspects of our qsort might not be ideal

some aspects of our qsort might not be ideal

From

John Naylor

Date:

16 February 2022, 07:28:53

While reviewing Thomas Munro's patchset to consider expanding the uses
of specialized qsort [1], I wondered about some aspects of the current
qsort implementation. For background, ours is based on Bentley &
McIlroy "Engineering a Sort Function" [2], which is a classic paper
worth studying.

1) the check for pre-sorted input at the start each and every recursion

This has been discussed before [3]. This is great if the input is
sorted, and mostly harmless if the current partition is badly sorted
enough that this check fails quickly, but it's not hard to imagine
that this is mostly a waste of cycles. There have been proposals to
base the pre-sorted check on input size [4] or to do it only once at
the beginning, but that strikes me as looking for the keys under the
lamppost because that's where the light is. The logical criterion for
proceeding to check if the input is sorted is whether we *think* the
input could be sorted. That may sound like a tautology, but consider
the case where the partitioning phase didn't perform any swaps. Then,
it makes sense to check, but we can go further. What if we do the
check, but towards the end that check fails. If just a couple elements
are out of place, does it really make sense to give up, partition, and
recurse? If just a few iterations of insertion sort are all that is
needed to finish, why not just do that? This way, we *dynamically*
decide to optimistically start an insertion sort, in the hopes that it
will occasionally prevent a recursion, and worst case the input is
slightly more sorted for the next recursion. All we need is a
lightweight way to detect that the partitioning phase didn't do any
swaps. More on this later.

2) cardinality issues can cancel abbreviated keys, but our qsort is
not optimized for that

Since in this case comparison is very expensive, it stands to reason
that qsort could profitably be optimized for a low number of unique
keys. The Bentley & McIlroy scheme does take great pains to prevent
quadratic behavior from lots of duplicates, but I've found something
that might have stronger guarantees: Pattern-defeating quicksort
(paper: [5] C++ implementation: [6]) claims worst-case runtime of
O(nk) for inputs with k distinct elements. This is achieved via an
asymmetric partitioning scheme. It's not complex, but it is subtle, so
I won't try to describe it here. I recommend reading section 3 of the
paper for details. Rust and Boost use PDQuicksort in some form, so it
has some production use.

The partitioning scheme just mentioned also provides an easy way to
detect when no swaps have been done, thus solving #1 above.

There is one requirement that is a double-edged sword: Recursion to
the partitions must happen in order. This has an additional advantage
that in every case but one, insertion sort doesn't need to guard
against running off the beginning of the array. The downside for us is
that we currently recurse to a partition based on its size as a
stack-guarding measure, so that guard would have to be replaced. The
C++ implementation of PDQuicksort falls back to heap sort as a last
resort to bound runtime, but it would be just as effective at guarding
the stack. That's a bit of a snag for making it production-ready, but
not enough to prevent testing the actual qsort part.

Does anyone see a reason not to put in the necessary work to try it out?

[1] https://www.postgresql.org/message-id/CA%2BhUKGKztHEWm676csTFjYzortziWmOcf8HDss2Zr0muZ2xfEg%40mail.gmail.com
[2] https://cs.fit.edu/~pkc/classes/writing/samples/bentley93engineering.pdf
[3]
https://www.postgresql.org/message-id/flat/87F42982BF2B434F831FCEF4C45FC33E5BD369DF%40EXCHANGE.corporate.connx.com#e69718293c45d89555020bd0922ad055
[4] https://www.postgresql.org/message-id/CA%2BTgmobKFcb6ajVEN8eSnBa78sB3FSwuEWTHXd2x9JC5DOh5OA%40mail.gmail.com
[5] https://arxiv.org/pdf/2106.05123.pdf
[6] https://github.com/orlp/pdqsort

--
John Naylor
EDB: http://www.enterprisedb.com

Re: some aspects of our qsort might not be ideal

From

John Naylor

Date:

16 February 2022, 07:38:02

> [3]
https://www.postgresql.org/message-id/flat/87F42982BF2B434F831FCEF4C45FC33E5BD369DF%40EXCHANGE.corporate.connx.com#e69718293c45d89555020bd0922ad055

The top of that thread is where I meant to point to:

https://www.postgresql.org/message-id/flat/CAEYLb_Xn4-6f1ofsf2qduf24dDCVHbQidt7JPpdL_RiT1zBJ6A%40mail.gmail.com

-- 
John Naylor
EDB: http://www.enterprisedb.com

Re: some aspects of our qsort might not be ideal

From

John Naylor

Date:

16 February 2022, 07:48:51

> This way, we *dynamically*
> decide to optimistically start an insertion sort, in the hopes that it
> will occasionally prevent a recursion, and worst case the input is
> slightly more sorted for the next recursion.

I should mention one detail -- we put a limit on how many iterations
we attempt before we give up and partition/recurse. The author's
implementation chooses 8 for this limit. The paper mentions this
technique in section 5.2, but is not the origin of it.

-- 
John Naylor
EDB: http://www.enterprisedb.com

Re: some aspects of our qsort might not be ideal

From

Robert Haas

Date:

16 February 2022, 13:41:30

On Wed, Feb 16, 2022 at 2:29 AM John Naylor
<john.naylor@enterprisedb.com> wrote:
> Does anyone see a reason not to put in the necessary work to try it out?

Seems reasonable to me. It's always a bit difficult, I feel, to know
what test cases to use - almost any idea is going to have some case
where it's worse than what we do today, and there can always be some
user who does that exact thing 100% of the time. Moreover, it's hard
to be certain that test cases we construct - say, ordered data,
reverse ordered data, randomly ordered data, almost ordered data with
a single element out of place, etc. - are actually covering all of the
interesting cases. At the same time, I don't think anyone would
seriously disagree with what you say in the subject line, and we won't
make any progress by NOT trying things that are recommended in the
academic literature.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: some aspects of our qsort might not be ideal

From

John Naylor

Date:

03 June 2022, 03:33:16

Hi,

Attached is a draft series that implements some but not all features
of pattern-defeating quicksort, namely the ones I thought were
interesting for us. Recently this quicksort variant got committed for
the next release of the Go language 1.19 [1] (which in turn was based
on that of Rust [2]), and that implementation was a helpful additional
example. The bottom line is that replacing the partitioning scheme
this way is likely not worth doing because of our particular use
cases, but along the way I found some other things that might be worth
doing, so some good may come out of this.

0001 is a test module for running a microbenchmark, based on earlier
work by Thomas Munro.

0002 is just preliminary refactoring.

(From here on some regression tests change from patch to patch, since
tests that rely on a sort order might have duplicates in the sort
key.)

0003 starts a "partial insertion sort" if partitioning didn't do any
swaps, and bails if it doesn't finish quickly. A footnote to what I
said earlier:

> All we need is a
> lightweight way to detect that the partitioning phase didn't do any
> swaps.

Actually, Postgres did detect this until a3f0b3d68 [3], but wasn't
very bright about it -- it attempted a complete insertion sort.
Limiting the iterations is more sensible. In addition to the swap
count that I borrowed from ancient PG code, this patch checks the root
level as we do today (but now in a different code path), and there are
a couple additional heuristics to limit wasted work:
- Only try it if the pivot candidates are in order
- If the input is presorted we always return, but if it isn't we will
only proceed with swapping elements if the input was large enough that
we checked the order of nine pivot candidates.
- Only try partial insertion sort if the partitions are not highly
different in size, since bad pivots will lead to fewer swaps in
general, causing the swap count heuristic to be unhelpful. (Future
work: look into randomizing pivot candidates when encountering such
skewed partitions, as some other implementations do)

0004 builds on upon the previous and preemptively reverses the input
if there are nine pivot candidates and they are in descending order.

0005 implements the PDQuicksort partitioning scheme. In working on
this, I found I need to correct something I said earlier:

> There is one requirement that is a double-edged sword: Recursion to
> the partitions must happen in order. This has an additional advantage
> that in every case but one, insertion sort doesn't need to guard
> against running off the beginning of the array. The downside for us is
> that we currently recurse to a partition based on its size as a
> stack-guarding measure, so that guard would have to be replaced.

This is not true actually (I read too much into an implementation
detail), and with a tiny bit of bookkeepping we can retain our
practice of recursing to the smaller partition.

A very brief description of 0005: There are two partitioning
functions: left and right. Partition-right puts elements >= pivot in
the right partition. This is the default. When many duplicates end up
in the right partition, the pivot chosen for the next iteration there
could be the same as the previous pivot. In that case, we switch to
partition-left: put elements equal to the pivot in the left partition.
Since all those elements are the same, we are done and can iterate on
the right side. With this scheme, only 2-way comparators are needed.
Also, the previous counting of swaps during partitioning is replaced
by a simple pointer check once per iteration.

-- Tests:

I have two flavors of tests:
1) ORDER BY queries with 10 million records: Select a single value for
three types:
- bigint (so can use datum sort)
- 16-byte abbreviatable text
- 16-byte non-abbreviatable text
(median of 5 runs, script attached)

2) a microbenchmark sorting 1 million int32 datums with two ways to
vary comparison schemes: direct C function/fmgr function and
runtime/specialized, so four different combinations. The idea here is
to get an easy way to test comparators of different "heaviness", not
that these actual cases are all necessarily relevant.
(minimum of 10 runs)

The input distributions can be put into three categories:
- Varying amounts of randomness and presortedness (random, sort50/90/99, dither)
- Long runs of ascending and descending sequences (ascending,
descending, organ, merge)
- Low cardinality (dup8, dupsq, mod100/8)

To get the microbenchmark output in the same format as the query script I ran

psql -c 'select test_sort_cmp_weight(1 * 1024*1024)' 2> result.txt
perl -n -e 'print "$2\t$3\t$1\t$4\n" if /NOTICE: \[(.+)\] num=(\d+),
order=(\w+), time=(\d+\.\d+)/;' result.txt > result.csv

-- Summary of results:

Partial insertion sort:
- For small types sometimes possibly a bit faster, sometimes
significantly faster.
- With queries, significantly slower for some partially-presorted
distributions, otherwise mostly no change.
- Could be further optimized for types known at compile time by using
a "sift" algorithm rather than swapping at every iteration. We could
also try basing the iteration limit on type size.

Optimistically reversing input:
- 20-25x faster on descending input, which is nice. It's not clear if
it would benefit "mostly descending" input.

Re: some aspects of our qsort might not be ideal

From

Robert Haas

Date:

23 June 2022, 13:52:24

On Thu, Jun 23, 2022 at 6:13 AM John Naylor
<john.naylor@enterprisedb.com> wrote:
> Here is a *rough* first pass at dual-pivot quicksort. I haven't
> changed the regression tests to adjust for qsort being an unstable
> sort, ...

Hmm. I thought we had some reasons for preferring a stable sort algorithm.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: some aspects of our qsort might not be ideal

From

Matthias van de Meent

Date:

23 June 2022, 14:51:24

On Thu, 23 Jun 2022 at 15:52, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Jun 23, 2022 at 6:13 AM John Naylor
> <john.naylor@enterprisedb.com> wrote:
> > Here is a *rough* first pass at dual-pivot quicksort. I haven't
> > changed the regression tests to adjust for qsort being an unstable
> > sort, ...
>
> Hmm. I thought we had some reasons for preferring a stable sort algorithm.

I think that mostly has to do with reliability / stability of ORDER BY
in combination with LIMIT and OFFSET, even though right now we cannot
fully guarantee such stability due to unstable results from underlying
plan nodes.

As an example, a table scan node under a sort node can start its scan
at an arbitrary point in the table (using synchronize_seqscans), and
because Sort nodes only sort MinimalTuple-s, each set of tuples that
have an equal sort value will be ordered by TID + y (mod tablesize),
with arbitrary values for y.

- Matthias

Re: some aspects of our qsort might not be ideal

From

Peter Geoghegan

Date:

23 June 2022, 15:04:17

On Thu, Jun 23, 2022 at 7:51 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
> I think that mostly has to do with reliability / stability of ORDER BY
> in combination with LIMIT and OFFSET, even though right now we cannot
> fully guarantee such stability due to unstable results from underlying
> plan nodes.

The current qsort isn't stable. While quicksort is never stable, our
particular implementation has fairly significant optimizations that
strongly rely on not using a stable sort. In particular, the B&M
optimizations for duplicate elements.

These optimizations make things like datum tuplesorts for
'SELECT(DISTINCT mycol) ...' style queries on low cardinality columns
extremely fast. We're not really sorting so much as bucketing. This is
based on Dijkstra's Dutch national flag problem.

-- 
Peter Geoghegan

Re: some aspects of our qsort might not be ideal

From

Matthias van de Meent

Date:

23 June 2022, 15:08:07

On Thu, 23 Jun 2022 at 17:04, Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Thu, Jun 23, 2022 at 7:51 AM Matthias van de Meent
> <boekewurm+postgres@gmail.com> wrote:
> > I think that mostly has to do with reliability / stability of ORDER BY
> > in combination with LIMIT and OFFSET, even though right now we cannot
> > fully guarantee such stability due to unstable results from underlying
> > plan nodes.
>
> The current qsort isn't stable.

Then I misunderstood Robert's comment, thanks for correcting me.

- Matthias

Re: some aspects of our qsort might not be ideal

From

John Naylor

Date:

30 June 2022, 10:21:19

I've run the microbenchmarks only so far, since they complete much
faster than queries, and I wanted to tell quickly if this is a good
avenue to pursue. Later I'll repeat for queries. Methodology is
similar to what I did earlier in the thread, so I won't belabor that.

Takeaways:

- Contrary to some testing on single pivot I did some month ago with
fewer input distributions, in this test neither single nor dual pivot
benefit much from a large insertion sort threshold (i.e. 20 or so). I
picked 12 for both to do the head-to-head comparison, since this seems
to give a slight boost to specialized sorts on the slowest inputs.
They don't have to be the same, of course, but it seems about right
with these results. (Of course, it's easy to have specialized sorts
define their own threshold, as Thomas Munro has shown.)
- Generic qsort (type length and comparator passed at runtime) sees no
benefit from dual-pivot in this test, but specialized qsorts do get a
decent speedup.
- Descending inputs get a huge boost when specialized. This is
surprising to me, enough that I'm skeptical and want to double check
the test. If it's legitimate, I'm happy about it.

I've attached three spreadsheets with graphs, two for threshold tests
for single- and dual-pivot, and one comparing single and dual for
threshold=12.

0001 is the test module and rough script (not for commit). Note: The
server won't build, since the threshold is passed via CPPFLAGS.
0002 is trivial and mostly for assert builds
0003 is a mostly cleaned up patch for dual pivot, with passing
regression tests and default insertion sort threshold of 12

I'll add a CF entry for this.

TODO: add results for queries.
-- 
John Naylor
EDB: http://www.enterprisedb.com

Attachment

Re: some aspects of our qsort might not be ideal

From

John Naylor

Date:

05 July 2022, 05:14:17

I wrote:
> TODO: add results for queries.

Attached are three spreadsheets for queries, similar to the earlier
ones for microbenchmarks. There are three datatypes:

- Bigint: by selecting just the value, these can use Datum sorts
- Text: the same number as bigint, but in lpadded text form, followed
by 8 identical chararters, so 16 bytes.
- Text no-abbrev: also 16 bytes of text derived from bigint, but with
the 8 characters at the beginning to defeat abbreviation.

For the single-vs-dual comparison, this time I chose the insertion
sort threshold to be 10 for single pivot and 18 for dual-pivot. I see
a big speedup in descending values here as well, completing in 20-50%
less time. Everything else is the same, slightly faster, or slightly
slower, at least on this machine (Comet Lake, gcc 12.1). The slightly
slower cases are ones with duplicates. It would make sense that the
duplicate-detection heuristic lags behind where we would ideally be
using single-pivot, but the difference is small. In any case, there
doesn't seem to be a compelling benefit from dual-pivot for most
cases.

I suspect the branches in the sorttuple comparators for null handling
are negating the benefits from better d-cache and TLB behavior. In
that case, a useful thing to try would be to pre-partition the
sorttuple array between null and non-null values as it's being
populated, then sort each of those in turn. (The null portion would
only need sorting if there was more than one sort key.) That would
automatically take care of nulls-first vs nulls-last upfront, save a
bunch of binary space used for branches (worth doing on its own), and
the "isnull" member would not be needed anymore. That wouldn't save
data space because of alignment, but the compiler wouldn't need to
generate a load/store instruction for it. gcc does, but clang and msvc
just treat the 24-byte struct as 3 8-byte integers:

https://godbolt.org/z/d8j1EvroP

-- 
John Naylor
EDB: http://www.enterprisedb.com

Attachment

Re: some aspects of our qsort might not be ideal

From

John Naylor

Date:

22 July 2022, 03:01:50

On Tue, Jul 5, 2022 at 12:14 PM John Naylor <john.naylor@enterprisedb.com> wrote:

> I suspect the branches in the sorttuple comparators for null handling
> are negating the benefits from better d-cache and TLB behavior. In
> that case, a useful thing to try would be to pre-partition the
> sorttuple array between null and non-null values as it's being
> populated, then sort each of those in turn. (The null portion would
> only need sorting if there was more than one sort key.) That would
> automatically take care of nulls-first vs nulls-last upfront, save a
> bunch of binary space used for branches (worth doing on its own),

To see if this idea had any legs, I simply ripped out the null handling from the specialized sorttuple comparators. This works for my tests since all the values are not null. Doing this much seems to be more effective than dual pivot in every case but purely-descending input. The combination of the two seems a bit faster still on bigint, but not on text. From this test, it seems the conclusions for tuple sorts with dual-pivot haven't changed: impressive speedup on descending input, not so much elsewhere. (spreadsheet attached, note the insertion sort thresholds are reused from previous testing, namely 10 and 18 for single / dual pivot)

> and
> the "isnull" member would not be needed anymore. That wouldn't save
> data space because of alignment, but the compiler wouldn't need to
> generate a load/store instruction for it.

This piece might not be workable, or would make external merges more difficult. That said, doing the null-handling up front when forming sort tuples seems worth doing, and I intend to come back to it at some point.

Regarding binary size, removing the null handling from the comparators shrinks the binary by about 4.3kB. Adding dual pivot on top of that expands it to about 3.5kB larger than HEAD, but I'm pretty sure I could get that down to net zero by replacing the presorted check with a partial insertion sort (as discussed before), and reusing that for the pivot selection as well. Being able to add partial insertion sort with net ~zero binary increase could be another small point in favor of dual-pivot.

I've attached the patches for this experiment for completeness, but withdrawing the CF entry for now.

--
John Naylor
EDB: http://www.enterprisedb.com