Thread: An implementation of multi-key sort

An implementation of multi-key sort

From

Wang Yao

Date:

22 May 2024, 12:48:23

Hi hackers,

I'd submit an implementation of multi-key sort for review. Please see the
code as attachment. Thanks for your reponse in advance.


Overview
--------

MKsort (multi-key sort) is an alternative of standard qsort algorithm,
which has better performance for particular sort scenarios, i.e. the data
set has multiple keys to be sorted.

The implementation is based on the paper:
Jon L. Bentley and Robert Sedgewick, "Fast Algorithms for Sorting and
Searching Strings", Jan 1997 [1]

MKsort is applied only for tuple sort by the patch. Theoretically it can
be applied for general-purpose sort scenario when there are multiple sort
keys available, but it is relatively difficult in practice because kind of
unique interface is needed to manipulate the keys. So I limit the usage of
mksort to sort SortTuple.

Comparing to classic quick sort, it can get significant performance
improvement once multiple keys are available. A rough test shows it got
~129% improvement than qsort for ORDER BY on 6 keys, and ~52% for CREATE
INDEX on the same data set. (See more details in section "Performance
Test")

Author: Yao Wang <yaowangm@outlook.com>
Co-author: Hongxu Ma <interma@outlook.com>

Scope
-----

The interface of mksort is pretty simple: in tuplesort_sort_memtuples(),
mksort_tuple() is invoked instead of qsort_tuple() if mksort is applicable.
The major logic in mksort_tuple() is to apply mksort algorithm on
SortTuple, and kind of callback mechanism is used to handle
sort-variant-specific issue, e.g. comparing different datums, like
qsort_tuple() does. It also handles the complexity of "abbreviated keys".

A small difference from classic mksort algorithm is: for IndexTuple, when
all the columns are equal, an additional comparing based on ItemPointer
is performed to determine the order. It is to make the result consistent
to existing qsort.

I did consider about implementing mksort by the approach of kind of
template mechanism like qsort (see sort_template.h), but it seems
unnecessary because all concrete tuple types need to be handled are
derived from SortTuple. Use callback to isolate type specific features
is good enough.

Note that not all tuple types are supported by mksort. Please see the
comments inside tuplesort_sort_memtuples().

Test Cases
----------

The changes of test cases include:

* Generally, mksort should generate result exactly same to qsort. However
some test cases don't. The reason is that SQL doesn't specify order on
all possible columns, e.g. "select c1, c2 from t1 order by c1" will
generate different results between mksort/qsort when c1 values are equal,
and the solution is to order c2 as well ("select c1, c2 from t1 order by
c1, c2"). (e.g. geometry)
* Some cases need to be updated to display the new sort method "multi-key
sort" in explain result. (e.g. incremental_sort)
* regress/tuplesort was updated with new cases to cover some scenarios of
mksort.

Performance Test
----------------

The script I used to configure the build:

CFLAGS="-O3 -fargument-noalias-global -fno-omit-frame-pointer -g"
./configure --prefix=$PGHOME --with-pgport=5432 --with-perl --with-openssl
--with-python --with-pam --with-blocksize=16 --with-wal-blocksize=16
--with-perl --enable-tap-tests --with-gssapi --with-ldap

I used the script for a rough test for ORDER BY:

\timing on
create table t1 (c1 int, c2 int, c3 int, c4 int, c5 int, c6 varchar(100));
insert into t1 values (generate_series(1,499999), 0, 0, 0, 0,
  'aaaaaaaaaaaaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbbbbbbbbbbbb');
update t1 set c2 = c1 % 100, c3 = c1 % 50, c4 = c1 % 10, c5 = c1 % 3;
update t1 set c6 = 'aaaaaaaaaaaaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbbbbbbbbbbbb'
|| (c1 % 5)::text;

-- Use a large work mem to ensure the entire sort happens in memory
set work_mem='1GB';

-- switch between qsort/mksort
set enable_mk_sort=off;

explain analyze select c1 from t1 order by c6, c5, c4, c3, c2, c1;

Results:

mksort:
1341.283 ms (00:01.341)
1379.098 ms (00:01.379)
1369.868 ms (00:01.370)

qsort:
3137.277 ms (00:03.137)
3147.771 ms (00:03.148)
3131.887 ms (00:03.132)

The perf improvement is ~129%.

Another perf test for CREATE INDEX:

create index idx_t1_mk on t3 (c6, c5, c4, c3, c2, c1);

Results:

mksort:
1147.207 ms (00:01.147)
1200.501 ms (00:01.201)
1235.657 ms (00:01.236)

Qsort:
1852.957 ms (00:01.853)
1824.209 ms (00:01.824)
1808.781 ms (00:01.809)

The perf improvement is ~52%.

Another test is to use one of queries of TPC-H:

set work_mem='1GB';

-- query rewritten from TPCH-Q1, and there are 6001215 rows in lineitem
explain analyze select
 l_returnflag,l_linestatus,l_quantity,l_shipmode
from
 lineitem
where
 l_shipdate <= date'1998-12-01' - interval '65 days'
order by
 l_returnflag,l_linestatus,l_quantity,l_shipmode;

Result:

Qsort:
14582.626 ms
14524.188 ms
14524.111 ms

mksort:
11390.891 ms
11647.065 ms
11546.791 ms

The perf improvement is ~25.8%.

[1] https://www.cs.tufts.edu/~nr/cs257/archive/bob-sedgewick/fast-strings.pdf
[2] https://www.tpc.org/tpch/


Thanks,

Yao Wang

Attachment

0001-Implement-multi-key-sort.patch

Re: An implementation of multi-key sort

From

Heikki Linnakangas

Date:

22 May 2024, 15:29:34

On 22/05/2024 15:48, Wang Yao wrote:
> Comparing to classic quick sort, it can get significant performance
> improvement once multiple keys are available. A rough test shows it got
> ~129% improvement than qsort for ORDER BY on 6 keys, and ~52% for CREATE
> INDEX on the same data set. (See more details in section "Performance
> Test")

Impressive. Did you test the performance of the cases where MK-sort 
doesn't help, to check if there is a performance regression?

-- 
Heikki Linnakangas
Neon (https://neon.tech)

回复: An implementation of multi-key sort

From

Wang Yao

Date:

23 May 2024, 12:39:06

No obvious perf regression is expected because PG will follow original

qsort code path when mksort is disabled. For the case, the only extra

cost is the check in tuplesort_sort_memtuples() to enter mksort code path.

It's also proved by the experiment today:

Mksort disabled:

2949.287 ms

2955.258 ms

2947.262 ms

No mksort code:

2947.094 ms

2946.419 ms

2953.215 ms

Almost the same.

I also updated code with small enhancements. Please see the latest code

as attachment.

Thanks,

Yao Wang

发件人: Heikki Linnakangas <hlinnaka@iki.fi>
发送时间: 2024年5月22日 23:29
收件人: Wang Yao <yaowangm@outlook.com>; PostgreSQL Hackers <pgsql-hackers@postgresql.org>
抄送: interma@outlook.com <interma@outlook.com>
主题: Re: An implementation of multi-key sort

On 22/05/2024 15:48, Wang Yao wrote:
> Comparing to classic quick sort, it can get significant performance
> improvement once multiple keys are available. A rough test shows it got
> ~129% improvement than qsort for ORDER BY on 6 keys, and ~52% for CREATE
> INDEX on the same data set. (See more details in section "Performance
> Test")

Impressive. Did you test the performance of the cases where MK-sort
doesn't help, to check if there is a performance regression?

--
Heikki Linnakangas
Neon (https://neon.tech)

Attachment

v2-Implement-multi-key-sort.patch

Re: 回复: An implementation of multi-key sort

From

Heikki Linnakangas

Date:

23 May 2024, 12:47:29

On 23/05/2024 15:39, Wang Yao wrote:
> No obvious perf regression is expected because PG will follow original
> qsort code path when mksort is disabled. For the case, the only extra
> cost is the check in tuplesort_sort_memtuples() to enter mksort code path.

And what about the case the mksort is enabled, but it's not effective 
because all leading keys are different?

-- 
Heikki Linnakangas
Neon (https://neon.tech)

Re: 回复: An implementation of multi-key sort

From

Yao Wang

Date:

24 May 2024, 12:50:54

When all leading keys are different, mksort will finish the entire sort at the
first sort key and never touch other keys. For the case, mksort falls back to
kind of qsort actually.

I created another data set with distinct values in all sort keys:

create table t2 (c1 int, c2 int, c3 int, c4 int, c5 int, c6 varchar(100));
insert into t2 values (generate_series(1,499999), 0, 0, 0, 0, '');
update t2 set c2 = 999990 - c1, c3 = 999991 - c1, c4 = 999992 - c1, c5
= 999993 - c1;
update t2 set c6 = 'aaaaaaaaaaaaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbbbbbbbbbbbb'
  || (999994 - c1)::text;
explain analyze select c1 from t2 order by c6, c5, c4, c3, c2, c1;

Results:

MKsort:
12374.427 ms
12528.068 ms
12554.718 ms

qsort:
12251.422 ms
12279.938 ms
12280.254 ms

MKsort is a bit slower than qsort, which can be explained by extra
checks of MKsort.

Yao Wang

On Fri, May 24, 2024 at 8:36 PM Wang Yao <yaowangm@outlook.com> wrote:
>
>
>
> 获取Outlook for Android
> ________________________________
> From: Heikki Linnakangas <hlinnaka@iki.fi>
> Sent: Thursday, May 23, 2024 8:47:29 PM
> To: Wang Yao <yaowangm@outlook.com>; PostgreSQL Hackers <pgsql-hackers@postgresql.org>
> Cc: interma@outlook.com <interma@outlook.com>
> Subject: Re: 回复: An implementation of multi-key sort
>
> On 23/05/2024 15:39, Wang Yao wrote:
> > No obvious perf regression is expected because PG will follow original
> > qsort code path when mksort is disabled. For the case, the only extra
> > cost is the check in tuplesort_sort_memtuples() to enter mksort code path.
>
> And what about the case the mksort is enabled, but it's not effective
> because all leading keys are different?
>
> --
> Heikki Linnakangas
> Neon (https://neon.tech)
>

--
This electronic communication and the information and any files transmitted
with it, or attached to it, are confidential and are intended solely for
the use of the individual or entity to whom it is addressed and may contain
information that is confidential, legally privileged, protected by privacy
laws, or otherwise restricted from disclosure to anyone else. If you are
not the intended recipient or the person responsible for delivering the
e-mail to the intended recipient, you are hereby notified that any use,
copying, distributing, dissemination, forwarding, printing, or copying of
this e-mail is strictly prohibited. If you received this e-mail in error,
please return the e-mail to the sender, delete it from your computer, and
destroy any printed copy of it.

Re: 回复: An implementation of multi-key sort

From

Yao Wang

Date:

31 May 2024, 12:09:53

I added two optimizations to mksort which exist on qsort_tuple():

1. When selecting pivot, always pick the item in the middle of array but
not by random. Theoretically it has the same effect to old approach, but
it can eliminate some unstable perf test results, plus a bit perf benefit by
removing random value generator.
2. Always check whether the array is ordered already, and return
immediately if it is. The pre-ordered check requires extra cost and
impacts perf numbers on some data sets, but can improve perf
significantly on other data sets.

By now, mksort has perf results equal or better than qsort on all data
sets I ever used.

I also updated test case. Please see v3 code as attachment.

Perf test results:

Data set 1 (with mass duplicate values):
-----------------------------------------

create table t1 (c1 int, c2 int, c3 int, c4 int, c5 int, c6 varchar(100));
insert into t1 values (generate_series(1,499999), 0, 0, 0, 0,
'aaaaaaaaaaaaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbbbbbbbbbbbb');
update t1 set c2 = c1 % 100, c3 = c1 % 50, c4 = c1 % 10, c5 = c1 % 3;
update t1 set c6 = 'aaaaaaaaaaaaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbbbbbbbbbbbb'
|| (c1 % 5)::text;

Query 1:

explain analyze select c1 from t1 order by c6, c5, c4, c3, c2, c1;

Disable Mksort

3021.636 ms
3014.669 ms
3033.588 ms

Enable Mksort

1688.590 ms
1686.956 ms
1688.567 ms

The improvement is 78.9%, which is reduced from the previous version
(129%). The most cost should be the pre-ordered check.

Query 2:

create index idx_t1_mk on t1 (c6, c5, c4, c3, c2, c1);

Disable Mksort

1674.648 ms
1680.608 ms
1681.373 ms

Enable Mksort

1143.341 ms
1143.462 ms
1143.894 ms

The improvement is ~47%, which is also reduced a bit (52%).

Data set 2 (with distinct values):
----------------------------------

create table t2 (c1 int, c2 int, c3 int, c4 int, c5 int, c6 varchar(100));
insert into t2 values (generate_series(1,499999), 0, 0, 0, 0, '');
update t2 set c2 = 999990 - c1, c3 = 999991 - c1, c4 = 999992 - c1, c5
= 999993 - c1;
update t2 set c6 = 'aaaaaaaaaaaaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbbbbbbbbbbbb'
|| (999994 - c1)::text;

Query 1:

explain analyze select c1 from t2 order by c6, c5, c4, c3, c2, c1;

Disable Mksort

12199.963 ms
12197.068 ms
12191.657 ms

Enable Mksort

9538.219 ms
9571.681 ms
9536.335 ms

The improvement is 27.9%, which is much better than the old approach (-6.2%).

Query 2 (the data is pre-ordered):

explain analyze select c1 from t2 order by c6 desc, c5, c4, c3, c2, c1;

Enable Mksort

768.191 ms
768.079 ms
767.026 ms

Disable Mksort

768.757 ms
766.166 ms
766.149 ms

They are almost the same since no actual sort was performed, and much
better than the old approach (-1198.1%).


Thanks,

Yao Wang

On Fri, May 24, 2024 at 8:50 PM Yao Wang <yao-yw.wang@broadcom.com> wrote:
>
> When all leading keys are different, mksort will finish the entire sort at the
> first sort key and never touch other keys. For the case, mksort falls back to
> kind of qsort actually.
>
> I created another data set with distinct values in all sort keys:
>
> create table t2 (c1 int, c2 int, c3 int, c4 int, c5 int, c6 varchar(100));
> insert into t2 values (generate_series(1,499999), 0, 0, 0, 0, '');
> update t2 set c2 = 999990 - c1, c3 = 999991 - c1, c4 = 999992 - c1, c5
> = 999993 - c1;
> update t2 set c6 = 'aaaaaaaaaaaaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbbbbbbbbbbbb'
>   || (999994 - c1)::text;
> explain analyze select c1 from t2 order by c6, c5, c4, c3, c2, c1;
>
> Results:
>
> MKsort:
> 12374.427 ms
> 12528.068 ms
> 12554.718 ms
>
> qsort:
> 12251.422 ms
> 12279.938 ms
> 12280.254 ms
>
> MKsort is a bit slower than qsort, which can be explained by extra
> checks of MKsort.
>
> Yao Wang
>
> On Fri, May 24, 2024 at 8:36 PM Wang Yao <yaowangm@outlook.com> wrote:
> >
> >
> >
> > 获取Outlook for Android
> > ________________________________
> > From: Heikki Linnakangas <hlinnaka@iki.fi>
> > Sent: Thursday, May 23, 2024 8:47:29 PM
> > To: Wang Yao <yaowangm@outlook.com>; PostgreSQL Hackers <pgsql-hackers@postgresql.org>
> > Cc: interma@outlook.com <interma@outlook.com>
> > Subject: Re: 回复: An implementation of multi-key sort
> >
> > On 23/05/2024 15:39, Wang Yao wrote:
> > > No obvious perf regression is expected because PG will follow original
> > > qsort code path when mksort is disabled. For the case, the only extra
> > > cost is the check in tuplesort_sort_memtuples() to enter mksort code path.
> >
> > And what about the case the mksort is enabled, but it's not effective
> > because all leading keys are different?
> >
> > --
> > Heikki Linnakangas
> > Neon (https://neon.tech)
> >

--
This electronic communication and the information and any files transmitted
with it, or attached to it, are confidential and are intended solely for
the use of the individual or entity to whom it is addressed and may contain
information that is confidential, legally privileged, protected by privacy
laws, or otherwise restricted from disclosure to anyone else. If you are
not the intended recipient or the person responsible for delivering the
e-mail to the intended recipient, you are hereby notified that any use,
copying, distributing, dissemination, forwarding, printing, or copying of
this e-mail is strictly prohibited. If you received this e-mail in error,
please return the e-mail to the sender, delete it from your computer, and
destroy any printed copy of it.

Attachment

v3-Implement-multi-key-sort.patch

Re: 回复: An implementation of multi-key sort

From

Yao Wang

Date:

07 June 2024, 13:59:55

To be accurate, "multi-key sort" includes both "multi-key quick sort"
and "multi-key heap sort". This patch includes code change related to
only "multi-key quick sort" which is used to replace standard quick
sort for tuplesort. The "multi-key heap sort" is about an implementation
of multi-key heap and should be treated as a separated task. We need
to clarify the naming to avoid confusion.

I updated code which is related to only function/var renaming and
relevant comments, plus some minor assertions changes. Please see the
attachment.


Thanks,

Yao Wang

On Fri, May 31, 2024 at 8:09 PM Yao Wang <yao-yw.wang@broadcom.com> wrote:
>
> I added two optimizations to mksort which exist on qsort_tuple():
>
> 1. When selecting pivot, always pick the item in the middle of array but
> not by random. Theoretically it has the same effect to old approach, but
> it can eliminate some unstable perf test results, plus a bit perf benefit by
> removing random value generator.
> 2. Always check whether the array is ordered already, and return
> immediately if it is. The pre-ordered check requires extra cost and
> impacts perf numbers on some data sets, but can improve perf
> significantly on other data sets.
>
> By now, mksort has perf results equal or better than qsort on all data
> sets I ever used.
>
> I also updated test case. Please see v3 code as attachment.
>
> Perf test results:
>
> Data set 1 (with mass duplicate values):
> -----------------------------------------
>
> create table t1 (c1 int, c2 int, c3 int, c4 int, c5 int, c6 varchar(100));
> insert into t1 values (generate_series(1,499999), 0, 0, 0, 0,
> 'aaaaaaaaaaaaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbbbbbbbbbbbb');
> update t1 set c2 = c1 % 100, c3 = c1 % 50, c4 = c1 % 10, c5 = c1 % 3;
> update t1 set c6 = 'aaaaaaaaaaaaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbbbbbbbbbbbb'
> || (c1 % 5)::text;
>
> Query 1:
>
> explain analyze select c1 from t1 order by c6, c5, c4, c3, c2, c1;
>
> Disable Mksort
>
> 3021.636 ms
> 3014.669 ms
> 3033.588 ms
>
> Enable Mksort
>
> 1688.590 ms
> 1686.956 ms
> 1688.567 ms
>
> The improvement is 78.9%, which is reduced from the previous version
> (129%). The most cost should be the pre-ordered check.
>
> Query 2:
>
> create index idx_t1_mk on t1 (c6, c5, c4, c3, c2, c1);
>
> Disable Mksort
>
> 1674.648 ms
> 1680.608 ms
> 1681.373 ms
>
> Enable Mksort
>
> 1143.341 ms
> 1143.462 ms
> 1143.894 ms
>
> The improvement is ~47%, which is also reduced a bit (52%).
>
> Data set 2 (with distinct values):
> ----------------------------------
>
> create table t2 (c1 int, c2 int, c3 int, c4 int, c5 int, c6 varchar(100));
> insert into t2 values (generate_series(1,499999), 0, 0, 0, 0, '');
> update t2 set c2 = 999990 - c1, c3 = 999991 - c1, c4 = 999992 - c1, c5
> = 999993 - c1;
> update t2 set c6 = 'aaaaaaaaaaaaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbbbbbbbbbbbb'
> || (999994 - c1)::text;
>
> Query 1:
>
> explain analyze select c1 from t2 order by c6, c5, c4, c3, c2, c1;
>
> Disable Mksort
>
> 12199.963 ms
> 12197.068 ms
> 12191.657 ms
>
> Enable Mksort
>
> 9538.219 ms
> 9571.681 ms
> 9536.335 ms
>
> The improvement is 27.9%, which is much better than the old approach (-6.2%).
>
> Query 2 (the data is pre-ordered):
>
> explain analyze select c1 from t2 order by c6 desc, c5, c4, c3, c2, c1;
>
> Enable Mksort
>
> 768.191 ms
> 768.079 ms
> 767.026 ms
>
> Disable Mksort
>
> 768.757 ms
> 766.166 ms
> 766.149 ms
>
> They are almost the same since no actual sort was performed, and much
> better than the old approach (-1198.1%).
>
>
> Thanks,
>
> Yao Wang
>
> On Fri, May 24, 2024 at 8:50 PM Yao Wang <yao-yw.wang@broadcom.com> wrote:
> >
> > When all leading keys are different, mksort will finish the entire sort at the
> > first sort key and never touch other keys. For the case, mksort falls back to
> > kind of qsort actually.
> >
> > I created another data set with distinct values in all sort keys:
> >
> > create table t2 (c1 int, c2 int, c3 int, c4 int, c5 int, c6 varchar(100));
> > insert into t2 values (generate_series(1,499999), 0, 0, 0, 0, '');
> > update t2 set c2 = 999990 - c1, c3 = 999991 - c1, c4 = 999992 - c1, c5
> > = 999993 - c1;
> > update t2 set c6 = 'aaaaaaaaaaaaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbbbbbbbbbbbb'
> >   || (999994 - c1)::text;
> > explain analyze select c1 from t2 order by c6, c5, c4, c3, c2, c1;
> >
> > Results:
> >
> > MKsort:
> > 12374.427 ms
> > 12528.068 ms
> > 12554.718 ms
> >
> > qsort:
> > 12251.422 ms
> > 12279.938 ms
> > 12280.254 ms
> >
> > MKsort is a bit slower than qsort, which can be explained by extra
> > checks of MKsort.
> >
> > Yao Wang
> >
> > On Fri, May 24, 2024 at 8:36 PM Wang Yao <yaowangm@outlook.com> wrote:
> > >
> > >
> > >
> > > 获取Outlook for Android
> > > ________________________________
> > > From: Heikki Linnakangas <hlinnaka@iki.fi>
> > > Sent: Thursday, May 23, 2024 8:47:29 PM
> > > To: Wang Yao <yaowangm@outlook.com>; PostgreSQL Hackers <pgsql-hackers@postgresql.org>
> > > Cc: interma@outlook.com <interma@outlook.com>
> > > Subject: Re: 回复: An implementation of multi-key sort
> > >
> > > On 23/05/2024 15:39, Wang Yao wrote:
> > > > No obvious perf regression is expected because PG will follow original
> > > > qsort code path when mksort is disabled. For the case, the only extra
> > > > cost is the check in tuplesort_sort_memtuples() to enter mksort code path.
> > >
> > > And what about the case the mksort is enabled, but it's not effective
> > > because all leading keys are different?
> > >
> > > --
> > > Heikki Linnakangas
> > > Neon (https://neon.tech)
> > >

--
This electronic communication and the information and any files transmitted
with it, or attached to it, are confidential and are intended solely for
the use of the individual or entity to whom it is addressed and may contain
information that is confidential, legally privileged, protected by privacy
laws, or otherwise restricted from disclosure to anyone else. If you are
not the intended recipient or the person responsible for delivering the
e-mail to the intended recipient, you are hereby notified that any use,
copying, distributing, dissemination, forwarding, printing, or copying of
this e-mail is strictly prohibited. If you received this e-mail in error,
please return the e-mail to the sender, delete it from your computer, and
destroy any printed copy of it.

Attachment

v4-Implement-multi-key-quick-sort.patch

Re: 回复: An implementation of multi-key sort

From

Tomas Vondra

Date:

09 June 2024, 21:09:13

Hello Yao,

I was interested in the patch, considering the promise of significant
speedups of sorting, so I took a quick look and did some basic perf
testing today. Unfortunately, my benchmarks don't really confirm any
peformance benefits, so I haven't looked at the code very much and only
have some very basic feedback:

1) The new GUC is missing from the .sample config, triggering a failure
of "make check-world". Fixed by 0002.

2) There's a place mixing tabs/spaces in indentation. Fixed by 0003.

3) I tried running pgindent, mostly to see how that would affect the
comments, and for most it's probably fine, but a couple are mangled
(usually those with a numbered list of items). Might needs some changes
to use formatting that's not reformatted like this. The changes from
pgindent are in 0004, but this is not a fix - it just shows the changes
after running pgindent.

Now, regarding the performance tests - I decided to do the usual black
box testing, i.e. generate tables with varying numbers of columns, data
types, different data distribution (random, correlated, ...) and so on.
And then run simple ORDER BY queries on that, measuring timing with and
without mk-sort, and checking the effect.

So I wrote a simple bash script (attached) that does exactly that - it
generates a table with 1k - 10M rows, fills with with data (with some
basic simple data distributions), and then runs the queries.

The raw results are too large to attach, I'm only attaching a PDF
showing the summary with a "speedup heatmap" - it's a pivot with the
parameters on the left, and then the GUC and number on columns on top.
So the first group of columns is with enable_mk_sort=off, the second
group with enable_mk_sort=on, and finally the heatmap with relative
timing (enable_mk_sort=on / enable_mk_sort=off).

So values <100% mean it got faster (green color - good), and values
>100% mean it got slower (red - bad). And the thing is - pretty much
everything is red, often in the 200%-300% range, meaning it got 2x-3x
slower. There's only very few combinations where it got faster. That
does not seem very promising ... but maybe I did something wrong?

After seeing this, I took a look at your example again, which showed
some nice speedups. But it seems very dependent on the order of keys in
the ORDER BY clause. For example consider this:

set enable_mk_sort = on;
explain (analyze, timing off)
select * from t1 order by c6, c5, c4, c3, c2, c1;

                         QUERY PLAN
-------------------------------------------------------------------
 Sort  (cost=72328.81..73578.81 rows=499999 width=76)
       (actual rows=499999 loops=1)
   Sort Key: c6, c5, c4, c3, c2, c1
   Sort Method: quicksort  Memory: 59163kB
   ->  Seq Scan on t1  (cost=0.00..24999.99 rows=499999 width=76)
                       (actual rows=499999 loops=1)
 Planning Time: 0.054 ms
 Execution Time: 1095.183 ms
(6 rows)

set enable_mk_sort = on;
explain (analyze, timing off)
select * from t1 order by c6, c5, c4, c3, c2, c1;

                         QUERY PLAN
-------------------------------------------------------------------
 Sort  (cost=72328.81..73578.81 rows=499999 width=76)
       (actual rows=499999 loops=1)
   Sort Key: c6, c5, c4, c3, c2, c1
   Sort Method: multi-key quick sort  Memory: 59163kB
   ->  Seq Scan on t1  (cost=0.00..24999.99 rows=499999 width=76)
                       (actual rows=499999 loops=1)
 Planning Time: 0.130 ms
 Execution Time: 633.635 ms
(6 rows)

Which seems great, but let's reverse the sort keys:

set enable_mk_sort = off;
explain (analyze, timing off)
select * from t1 order by c1, c2, c3, c4, c5, c6;

                         QUERY PLAN
-------------------------------------------------------------------

 Sort  (cost=72328.81..73578.81 rows=499999 width=76)
       (actual rows=499999 loops=1)
   Sort Key: c1, c2, c3, c4, c5, c6
   Sort Method: quicksort  Memory: 59163kB
   ->  Seq Scan on t1  (cost=0.00..24999.99 rows=499999 width=76)
                       (actual rows=499999 loops=1)
 Planning Time: 0.146 ms
 Execution Time: 170.085 ms
(6 rows)

set enable_mk_sort = off;
explain (analyze, timing off)
select * from t1 order by c1, c2, c3, c4, c5, c6;

                         QUERY PLAN
-------------------------------------------------------------------
 Sort  (cost=72328.81..73578.81 rows=499999 width=76)
       (actual rows=499999 loops=1)
   Sort Key: c1, c2, c3, c4, c5, c6
   Sort Method: multi-key quick sort  Memory: 59163kB
   ->  Seq Scan on t1  (cost=0.00..24999.99 rows=499999 width=76)
                       (actual rows=499999 loops=1)
 Planning Time: 0.127 ms
 Execution Time: 367.263 ms
(6 rows)

I believe this is the case Heikki was asking about. I see the response
was that it's OK and the overhead is very low, but without too much
detail so I don't know what case you measured.

Anyway, I think it seems to be very sensitive to the exact data set.
Which is not entirely surprising, I guess - most optimizations have a
mix of improved/regressed cases, yielding a heatmap with a mix of green
and red areas, and we have to either optimize the code (or heuristics to
enable the feature), or convince ourselves the "red" cases are less
important / unlikely etc.

But here the results are almost universally "red", so it's going to be
very hard to convince ourselves this is a good trade off. Of course, you
may argue the cases I've tested are wrong and not representative. I
don't think that's the case, though.

It's also interesting (and perhaps a little bit bizarre) that almost all
the cases that got better are for a single-column sort. Which is exactly
the case the patch should not affect. But it seems pretty consistent, so
maybe this is something worth investigating.

FWIW I'm not familiar with the various quicksort variants, but I noticed
that the Bentley & Sedgewick paper mentioned as the basis for the patch
is from 1997, and apparently implements stuff originally proposed by
Hoare in 1961. So maybe this is just an example of an algorithm that was
good for a hardware at that time, but the changes (e.g. the growing
important of on-CPU caches) made it less relevant?

Another thing I noticed while skimming [1] is this:

    The algorithm is designed to exploit the property that in many
    problems, strings tend to have shared prefixes.

If that's the case, isn't it wrong to apply this to all sorts, including
sorts with non-string keys? It might explain why your example works OK,
as it involves key c6 which is string with all values sharing the same
(fairly long) prefix. But then maybe we should be careful and restrict
this to only such those cases?

regards

[1] https://en.wikipedia.org/wiki/Multi-key_quicksort

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: 回复: An implementation of multi-key sort

From

Yao Wang

Date:

14 June 2024, 11:20:11

hi Tomas,

So many thanks for your kind response and detailed report. I am working
on locating issues based on your report/script and optimizing code, and
will update later.

Could you please also send me the script to generate report pdf
from the test results (explain*.log)? I can try to make one by myself,
but I'd like to get a report exactly the same as yours. It's really
helpful.

Thanks in advance.


Yao Wang

On Mon, Jun 10, 2024 at 5:09 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> Hello Yao,
>
> I was interested in the patch, considering the promise of significant
> speedups of sorting, so I took a quick look and did some basic perf
> testing today. Unfortunately, my benchmarks don't really confirm any
> peformance benefits, so I haven't looked at the code very much and only
> have some very basic feedback:
>
> 1) The new GUC is missing from the .sample config, triggering a failure
> of "make check-world". Fixed by 0002.
>
> 2) There's a place mixing tabs/spaces in indentation. Fixed by 0003.
>
> 3) I tried running pgindent, mostly to see how that would affect the
> comments, and for most it's probably fine, but a couple are mangled
> (usually those with a numbered list of items). Might needs some changes
> to use formatting that's not reformatted like this. The changes from
> pgindent are in 0004, but this is not a fix - it just shows the changes
> after running pgindent.
>
> Now, regarding the performance tests - I decided to do the usual black
> box testing, i.e. generate tables with varying numbers of columns, data
> types, different data distribution (random, correlated, ...) and so on.
> And then run simple ORDER BY queries on that, measuring timing with and
> without mk-sort, and checking the effect.
>
> So I wrote a simple bash script (attached) that does exactly that - it
> generates a table with 1k - 10M rows, fills with with data (with some
> basic simple data distributions), and then runs the queries.
>
> The raw results are too large to attach, I'm only attaching a PDF
> showing the summary with a "speedup heatmap" - it's a pivot with the
> parameters on the left, and then the GUC and number on columns on top.
> So the first group of columns is with enable_mk_sort=off, the second
> group with enable_mk_sort=on, and finally the heatmap with relative
> timing (enable_mk_sort=on / enable_mk_sort=off).
>
> So values <100% mean it got faster (green color - good), and values
> >100% mean it got slower (red - bad). And the thing is - pretty much
> everything is red, often in the 200%-300% range, meaning it got 2x-3x
> slower. There's only very few combinations where it got faster. That
> does not seem very promising ... but maybe I did something wrong?
>
> After seeing this, I took a look at your example again, which showed
> some nice speedups. But it seems very dependent on the order of keys in
> the ORDER BY clause. For example consider this:
>
> set enable_mk_sort = on;
> explain (analyze, timing off)
> select * from t1 order by c6, c5, c4, c3, c2, c1;
>
>                          QUERY PLAN
> -------------------------------------------------------------------
>  Sort  (cost=72328.81..73578.81 rows=499999 width=76)
>        (actual rows=499999 loops=1)
>    Sort Key: c6, c5, c4, c3, c2, c1
>    Sort Method: quicksort  Memory: 59163kB
>    ->  Seq Scan on t1  (cost=0.00..24999.99 rows=499999 width=76)
>                        (actual rows=499999 loops=1)
>  Planning Time: 0.054 ms
>  Execution Time: 1095.183 ms
> (6 rows)
>
> set enable_mk_sort = on;
> explain (analyze, timing off)
> select * from t1 order by c6, c5, c4, c3, c2, c1;
>
>                          QUERY PLAN
> -------------------------------------------------------------------
>  Sort  (cost=72328.81..73578.81 rows=499999 width=76)
>        (actual rows=499999 loops=1)
>    Sort Key: c6, c5, c4, c3, c2, c1
>    Sort Method: multi-key quick sort  Memory: 59163kB
>    ->  Seq Scan on t1  (cost=0.00..24999.99 rows=499999 width=76)
>                        (actual rows=499999 loops=1)
>  Planning Time: 0.130 ms
>  Execution Time: 633.635 ms
> (6 rows)
>
> Which seems great, but let's reverse the sort keys:
>
> set enable_mk_sort = off;
> explain (analyze, timing off)
> select * from t1 order by c1, c2, c3, c4, c5, c6;
>
>                          QUERY PLAN
> -------------------------------------------------------------------
>
>  Sort  (cost=72328.81..73578.81 rows=499999 width=76)
>        (actual rows=499999 loops=1)
>    Sort Key: c1, c2, c3, c4, c5, c6
>    Sort Method: quicksort  Memory: 59163kB
>    ->  Seq Scan on t1  (cost=0.00..24999.99 rows=499999 width=76)
>                        (actual rows=499999 loops=1)
>  Planning Time: 0.146 ms
>  Execution Time: 170.085 ms
> (6 rows)
>
> set enable_mk_sort = off;
> explain (analyze, timing off)
> select * from t1 order by c1, c2, c3, c4, c5, c6;
>
>                          QUERY PLAN
> -------------------------------------------------------------------
>  Sort  (cost=72328.81..73578.81 rows=499999 width=76)
>        (actual rows=499999 loops=1)
>    Sort Key: c1, c2, c3, c4, c5, c6
>    Sort Method: multi-key quick sort  Memory: 59163kB
>    ->  Seq Scan on t1  (cost=0.00..24999.99 rows=499999 width=76)
>                        (actual rows=499999 loops=1)
>  Planning Time: 0.127 ms
>  Execution Time: 367.263 ms
> (6 rows)
>
> I believe this is the case Heikki was asking about. I see the response
> was that it's OK and the overhead is very low, but without too much
> detail so I don't know what case you measured.
>
> Anyway, I think it seems to be very sensitive to the exact data set.
> Which is not entirely surprising, I guess - most optimizations have a
> mix of improved/regressed cases, yielding a heatmap with a mix of green
> and red areas, and we have to either optimize the code (or heuristics to
> enable the feature), or convince ourselves the "red" cases are less
> important / unlikely etc.
>
> But here the results are almost universally "red", so it's going to be
> very hard to convince ourselves this is a good trade off. Of course, you
> may argue the cases I've tested are wrong and not representative. I
> don't think that's the case, though.
>
> It's also interesting (and perhaps a little bit bizarre) that almost all
> the cases that got better are for a single-column sort. Which is exactly
> the case the patch should not affect. But it seems pretty consistent, so
> maybe this is something worth investigating.
>
> FWIW I'm not familiar with the various quicksort variants, but I noticed
> that the Bentley & Sedgewick paper mentioned as the basis for the patch
> is from 1997, and apparently implements stuff originally proposed by
> Hoare in 1961. So maybe this is just an example of an algorithm that was
> good for a hardware at that time, but the changes (e.g. the growing
> important of on-CPU caches) made it less relevant?
>
> Another thing I noticed while skimming [1] is this:
>
>     The algorithm is designed to exploit the property that in many
>     problems, strings tend to have shared prefixes.
>
> If that's the case, isn't it wrong to apply this to all sorts, including
> sorts with non-string keys? It might explain why your example works OK,
> as it involves key c6 which is string with all values sharing the same
> (fairly long) prefix. But then maybe we should be careful and restrict
> this to only such those cases?
>
> regards
>
> [1] https://en.wikipedia.org/wiki/Multi-key_quicksort
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company

--
This electronic communication and the information and any files transmitted
with it, or attached to it, are confidential and are intended solely for
the use of the individual or entity to whom it is addressed and may contain
information that is confidential, legally privileged, protected by privacy
laws, or otherwise restricted from disclosure to anyone else. If you are
not the intended recipient or the person responsible for delivering the
e-mail to the intended recipient, you are hereby notified that any use,
copying, distributing, dissemination, forwarding, printing, or copying of
this e-mail is strictly prohibited. If you received this e-mail in error,
please return the e-mail to the sender, delete it from your computer, and
destroy any printed copy of it.

Re: 回复: An implementation of multi-key sort

From

Tomas Vondra

Date:

14 June 2024, 12:27:50

On 6/14/24 13:20, Yao Wang wrote:
> hi Tomas,
> 
> So many thanks for your kind response and detailed report. I am working
> on locating issues based on your report/script and optimizing code, and
> will update later.
> 
> Could you please also send me the script to generate report pdf
> from the test results (explain*.log)? I can try to make one by myself,
> but I'd like to get a report exactly the same as yours. It's really
> helpful.
> 

I don't have a script for that. I simply load the results into a
spreadsheet, do a pivot table to "aggregate and reshuffle" it a bit, and
then add a heatmap. I use google sheets for this, but any other
spreadsheet should handle this too, I think.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: 回复: An implementation of multi-key sort

From

John Naylor

Date:

20 June 2024, 10:00:03

On Fri, Jun 14, 2024 at 6:20 PM Yao Wang <yao-yw.wang@broadcom.com> wrote:
>
> hi Tomas,
>
> So many thanks for your kind response and detailed report. I am working
> on locating issues based on your report/script and optimizing code, and
> will update later.

Hi,
This is an interesting proof-of-concept!

Given the above, I've set this CF entry to "waiting on author".

Also, I see you've added Heikki as a reviewer. I'm not sure how others
think, but I consider a "reviewer" in the CF app to be someone who has
volunteered to be responsible to help move this patch forward. If
there is a name in the reviewer column, it may discourage others from
doing review. It also can happened that people ping reviewers to ask
"There's been no review for X months -- are you planning on looking at
this?", and it's not great if that message is a surprise.

Note that we prefer not to top-post in emails since it makes our web
archive more difficult to read.

Thanks,
John

Re: 回复: An implementation of multi-key sort

From

Yao Wang

Date:

04 July 2024, 12:45:33

Hi John,

Thanks for your kind message. I talked to Heikki before getting Tomas's
response, and he said "no promise but I will take a look". That's why I
added his email. I have updated the CF entry and added Tomas as reviewer.

Hi Tomas,

Again, I'd say a big thank to you. The report and script are really, really
helpful. And your ideas are very valuable.

Firstly, the expectation of mksort performance:

1. When mksort works well, it should be faster than qsort because it saves
the cost of comparing duplicated values every time.
2. When all values are distinct at a particular column, the comparison
will finish immediately, and mksort will actually fall back to qsort. For
the case, mksort should be equal or a bit slower than qsort because it need
to maintain more complex state.

Generally, the benefit of mksort is mainly from duplicated values and sort
keys: the more duplicated values and sort keys are, the bigger benefit it
gets.

Analysis on the report in your previous mail
--------------------------------------------

1. It seems the script uses $count to specify the duplicated values:

number of repetitions for each value (ndistinct = nrows/count)

However, it is not always correct. For type text, the script generates
values like this:

expr="md5(((i / $count) + random())::text)"

But md5() generates totally random values regardless of $count. Some cases
of timestamptz have the same problem.

For all distinct values, the sort will finish at first depth and fall to
qsort actually.

2. Even for the types with correct duplicated setting, the duplicated ratio
is very small: e.g. say $nrows = 10000 and $count = 100, only 1% duplicated
rows can go to depth 2, and only 0.01% of them can go to depth 3. So it still
works on nearly all distinct values.

3. Qsort of PG17 uses kind of specialization for tuple comparator, i.e. it
uses specialized functions for different types, e.g. qsort_tuple_unsigned()
for unsigned int. The specialized comparators avoid all type related checks
and are much faster than regular comparator. That is why we saw 200% or more
regression for the cases.


Code optimizations I did for mk qsort
-------------------------------------

1. Adapted specialization for tuple comparator.
2. Use kind of "hybrid" sort: when we actually adapt bubble sort due to
limited sort items, use bubble sort to check datums since specified depth.
3. Other other optimizations such as pre-ordered check.


Analysis on the new report
--------------------------

I also did some modifications to your script about the issues of data types,
plus an output about distinct value count/distinct ratio, and an indicator
for improvement/regression. I attached the new script and a report on a
data set with 100,000 rows and 2, 5, 8 columns.

1. Generally, the result match the expectation: "When mksort works well, it
should be faster than qsort; when mksort falls to qsort, it should be equal
or a bit slower than qsort."
2. For all values of "sequential" (except text type), mksort is a bit slower
than qsort because no actual sort is performed due to the "pre-ordered"
check.
3. For int and bigint type, mksort became faster and faster when
there were more and more duplicated values and sort keys. Improvement of
the best cases is about 58% (line 333) and 57% (line 711).
4. For timestamptz type, mksort is a bit slower than qsort because the
distinct ratio is always 1 for almost all cases. I think more benefit is
available by increasing the duplicated values.
5. For text type, mksort is faster than qsort for all cases, and
improvement of the best case is about 160% (line 1510). It is the only
tested type in which specialization comparators are disabled.

Obviously, text has much better improvement than others. I suppose the cause
is about the specialisation comparators: for the types with them, the
comparing is too faster so the cost saved by mksort is not significant. Only
when saved cost became big enough, mksort can defeat qsort.

For other types without specialisation comparators, mksort can defeat
qsort completely. It is the "real" performance of mksort.


Answers for some other questions you mentioned
----------------------------------------------

Q1: Why are almost all the cases that got better for a single-column sort?

A: mksort is enabled only for multi column sort. When there is only one
column, qsort works. So we can simply ignore the cases.

Q2: Why did the perf become worse by just reversing the sort keys?

A: In the example we used, the sort keys are ordered from more duplicated
to less. Please see the SQL:

update t1 set c2 = c1 % 100, c3 = c1 % 50, c4 = c1 % 10, c5 = c1 % 3;
update t1 set c6 = 'aaaaaaaaaaaaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbbbbbbbbbbbb'
|| (c1 % 5)::text;

So c6 has most duplicated values, c5 has less, and so on. By the order
"C6, c5, c4, ...", mksort can take effect on every sort key.

By the reverse order "c2, c3, c4...", mksort almost finished on first
sort key (c2) because it has only 1% duplicated values, and fell back
to qsort actually.

Based on the new code, I reran the example, and got about 141% improvement
for order "c6, c5, c4...", and about -4% regression for order
"c2, c3, c4...".

Q3: Does mksort work effectively only for particular type, e.g. string?

A: No, the implementation of mksort does not distinguish data type for
special handling. It just calls existing comparators which are also
used by qsort. I used long prefix for string just to enlarge the time
cost of comparing to amplify the result. The new report shows mksort
can work effectively on non-string types and string without long prefix.

Q4: Was the algorithm good for a hardware at that time, but the changes
(e.g. the growing important of on-CPU caches) made it less relevant?

A: As my understanding, the answer is no because the benefit of mksort
is from saving cost for duplicated comparison, which is not related to
hardware. I suppose the new report can prove it.

However, the hardware varying definitely affects the perf, especially
considering that the perf different between mksort and qsort is not so
big when mksort falls back to qsort. I am not able to test on a wide
range of hardwares, so any finding is appreciated.


Potential improvement spaces
----------------------------

I tried some other optimizations but didn't add the code finally because
the benefit is not very sure and/or the implementation is complex. Just
raise them for more discussion if necessary:

1. Use distinct stats info of table to enable mksort

It's kind of heuristics: in optimizer, check Form_pg_statistic->stadistinct
of a table via pg_statistics. Enable mksort only when it is less than a
threshold.

The hacked code works, which need to modify a couple of interfaces of
optimizer. In addition, a complete solution should consider types and
distinct values of all columns, which might be too complex, and the benefit
seems not so big.

2. Cache of datum positions

e.g. for heap tuple, we need to extract datum position from SortTuple by
extract_heaptuple_from_sorttuple() for comparing, which is executed
for each datum. By comparison, qsort does it once for each tuple.
Theoretically we can create a cache to remember the datum positions to
avoid duplicated extracting.

The hacked code works, but the improvement seems limited. Not sure if more
improvement space is available.

3. Template mechanism

Qsort uses kind of template mechanism by macro (see sort_template.h), which
avoids cost of runtime type check. Theoretically template mechanism can be
applied to mksort, but I am hesitating because it will impose more complexity
and the code will become difficult to maintain.

Please let me know your opinion, thanks!

Yao Wang

-- 
This electronic communication and the information and any files transmitted 
with it, or attached to it, are confidential and are intended solely for 
the use of the individual or entity to whom it is addressed and may contain 
information that is confidential, legally privileged, protected by privacy
laws, or otherwise restricted from disclosure to anyone else. If you are 
not the intended recipient or the person responsible for delivering the 
e-mail to the intended recipient, you are hereby notified that any use, 
copying, distributing, dissemination, forwarding, printing, or copying of 
this e-mail is strictly prohibited. If you received this e-mail in error, 
please return the e-mail to the sender, delete it from your computer, and 
destroy any printed copy of it.

Attachment

Re: 回复: An implementation of multi-key sort

From

Konstantin Knizhnik

Date:

07 July 2024, 06:32:34

On 04/07/2024 3:45 pm, Yao Wang wrote:
> Generally, the benefit of mksort is mainly from duplicated values and sort
> keys: the more duplicated values and sort keys are, the bigger benefit it
> gets.
...
> 1. Use distinct stats info of table to enable mksort
>
> It's kind of heuristics: in optimizer, check Form_pg_statistic->stadistinct
> of a table via pg_statistics. Enable mksort only when it is less than a
> threshold.
>
> The hacked code works, which need to modify a couple of interfaces of
> optimizer. In addition, a complete solution should consider types and
> distinct values of all columns, which might be too complex, and the benefit
> seems not so big.

If mksort really provides advantage only when there are a lot of 
duplicates (for prefix keys?) and of small fraction of duplicates there 
is even some (small) regression
then IMHO taking in account in planner information about estimated 
number of distinct values seems to be really important. What was a 
problem with accessing this statistics and why it requires modification 
of optimizer interfaces? There is `get_variable_numdistinct` function 
which is defined and used only in selfuncs.c

Information about values distribution seems to be quite useful for  
choosing optimal sort algorithm. Not only for multi-key sort 
optimization. For example if we know min.max value of sort key and it is 
small, we can use O(N) algorithm for sorting. Also it can help to 
estimate when TOP-N search is preferable.

Right now Posgres creates special path for incremental sort. I am not 
sure if we also need to be separate path for mk-sort.
But IMHO if we need to change some optimizer interfaces to be able to 
take in account statistic and choose preferred sort algorithm at 
planning time, then it should be done.
If mksort can increase sort more than two times (for large number of 
duplicates), it will be nice to take it in account when choosing optimal 
plan.

Also in this case we do not need extra GUC for explicit enabling of 
mksort. There are too many parameters for optimizer and adding one more 
will make tuning more complex. So I prefer that decision is take buy 
optimizer itself based on the available information, especially if 
criteria seems to be obvious.

Best regards,
Konstantin

Re: 回复: An implementation of multi-key sort

From

Tomas Vondra

Date:

08 July 2024, 14:40:30

Hello,

Thanks for posting a new version of the patch, and for reporting a bunch
of issues in the bash scripts I used for testing. I decided to repeat
those fixed tests on both the old and new version of the patches, and I
finally have the results from three machines (the i5/xeon I usually use,
and also rpi5 for fun).

The complete scripts, raw results (CSV), and various reports (ODS and
PDF) are available in my github:

  https://github.com/tvondra/mksort-tests

I'm not going to attach all of it to this message, because the raw CSV
results alone are ~3MB for each of the three machines.

You can do your own analysis on the raw CSV results, of course - see the
'csv' directory, there are data for the clean branch and the two patch
versions.

But I've also prepared PDF reports comparing how the patches work on
each of the machines - see the 'pdf' directory. There are two types of
reports, depending on what's compared to what.

The general report structure is the same - columns with results for
different combinations of parameters, followed by comparison of the
results and a heatmap (red - bad/regression, green - good/speedup).

The "patch comparison" reports compare v5/v4, so it's essentially

    (timing with v5) / (timing with v4)

with the mksort enabled or disabled. And the charts are pretty green,
which means v5 is much faster than v4 - so seems like a step in the
right direction.

The "patch impact" reports compare v4/master and v5/master, i.e. this is
what the users would see after an upgrade. Attached is an small example
from the i5 machine, but the other machines behave in almost exactly the
same way (including the tiny rpi5).

For v4, the results were not great - almost everything regressed (red
color), except for the "text" data type (green).

You can immediately see v5 does much better - it still regresses, but
the regressions are way smaller. And the speedup for "text" it actually
a bit more significant (there's more/darker green).

So as I said before, I think v5 is definitely moving in the right
direction, but the regressions still seem far too significant. If you're
sorting a lot of text data, then sure - this will help a lot. But if
you're sorting int data, and it happens to be random/correlated, you're
going to pay 10-20% more. That's not great.

I haven't analyzed the code very closely, and I don't have a great idea
on how to fix this. But I think to make this patch committable, this
needs to be solved.

Considering the benefits seems to be pretty specific to "text" (and
perhaps some other data types), maybe the best solution would be to only
enable this for those cases. Yes, there are some cases where this helps
for the other data types too, but that also comes with the regressions.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

mksort-v4-vs-v5.pdf

Re: 回复: An implementation of multi-key sort

From

Tomas Vondra

Date:

08 July 2024, 15:14:38


On 7/4/24 14:45, Yao Wang wrote:
> Hi John,
> 
> Thanks for your kind message. I talked to Heikki before getting Tomas's
> response, and he said "no promise but I will take a look". That's why I
> added his email. I have updated the CF entry and added Tomas as reviewer.
> 
> Hi Tomas,
> 
> Again, I'd say a big thank to you. The report and script are really, really
> helpful. And your ideas are very valuable.
> 
> Firstly, the expectation of mksort performance:
> 
> 1. When mksort works well, it should be faster than qsort because it saves
> the cost of comparing duplicated values every time.
> 2. When all values are distinct at a particular column, the comparison
> will finish immediately, and mksort will actually fall back to qsort. For
> the case, mksort should be equal or a bit slower than qsort because it need
> to maintain more complex state.
> 
> Generally, the benefit of mksort is mainly from duplicated values and sort
> keys: the more duplicated values and sort keys are, the bigger benefit it
> gets.
> 
> Analysis on the report in your previous mail
> --------------------------------------------
> 
> 1. It seems the script uses $count to specify the duplicated values:
> 
> number of repetitions for each value (ndistinct = nrows/count)
> 
> However, it is not always correct. For type text, the script generates
> values like this:
> 
> expr="md5(((i / $count) + random())::text)"
> 
> But md5() generates totally random values regardless of $count. Some cases
> of timestamptz have the same problem.
> 
> For all distinct values, the sort will finish at first depth and fall to
> qsort actually.
> 

You're right, thanks for noticing / fixing this.

> 2. Even for the types with correct duplicated setting, the duplicated ratio
> is very small: e.g. say $nrows = 10000 and $count = 100, only 1% duplicated
> rows can go to depth 2, and only 0.01% of them can go to depth 3. So it still
> works on nearly all distinct values.
> 

True, but that's why the scripts test with much larger data sets too,
with more comparisons needing to look at other columns. It's be possible
to construct data sets that are likely to benefit more from mksort - I'm
not against doing that, but then there's the question of what data sets
are more representative of what users actually do.

I'd say a random data set like the ones I used are fairly common - it's
fine to not improve them, but we should not regress them.

> 3. Qsort of PG17 uses kind of specialization for tuple comparator, i.e. it
> uses specialized functions for different types, e.g. qsort_tuple_unsigned()
> for unsigned int. The specialized comparators avoid all type related checks
> and are much faster than regular comparator. That is why we saw 200% or more
> regression for the cases.
> 

OK, I'm not familiar with this code enough to have an opinion.

> 
> Code optimizations I did for mk qsort
> -------------------------------------
> 
> 1. Adapted specialization for tuple comparator.
> 2. Use kind of "hybrid" sort: when we actually adapt bubble sort due to
> limited sort items, use bubble sort to check datums since specified depth.
> 3. Other other optimizations such as pre-ordered check.
> 
> 
> Analysis on the new report
> --------------------------
> 
> I also did some modifications to your script about the issues of data types,
> plus an output about distinct value count/distinct ratio, and an indicator
> for improvement/regression. I attached the new script and a report on a
> data set with 100,000 rows and 2, 5, 8 columns.
> 

OK, but I think a report for a single data set size is not sufficient to
evaluate a patch like this, it can easily miss various caching effects
etc. The results I shared a couple minutes ago are from 1000 to 10M
rows, and it's much more complete view.

> 1. Generally, the result match the expectation: "When mksort works well, it
> should be faster than qsort; when mksort falls to qsort, it should be equal
> or a bit slower than qsort."

The challenge is how to know in advance if mksort is likely to work well.

> 2. For all values of "sequential" (except text type), mksort is a bit slower
> than qsort because no actual sort is performed due to the "pre-ordered"
> check.

OK

> 3. For int and bigint type, mksort became faster and faster when
> there were more and more duplicated values and sort keys. Improvement of
> the best cases is about 58% (line 333) and 57% (line 711).

I find it hard to interpret the text-only report, but I suppose these
are essentially the "green" patches in the PDF report I attached to my
earlier message. And indeed, there are nice improvements, but only with
cases with very many duplicates, and the price for that is 10-20%
regressions in the other cases. That does not seem like a great trade
off to me.

> 4. For timestamptz type, mksort is a bit slower than qsort because the
> distinct ratio is always 1 for almost all cases. I think more benefit is
> available by increasing the duplicated values.

Yeah, this was a bug in my script, generating too many distinct values.
After fixing that, it behaves pretty much exactly like int/bigint, which
is not really surprising.

> 5. For text type, mksort is faster than qsort for all cases, and
> improvement of the best case is about 160% (line 1510). It is the only
> tested type in which specialization comparators are disabled.
> 

Correct.

> Obviously, text has much better improvement than others. I suppose the cause
> is about the specialisation comparators: for the types with them, the
> comparing is too faster so the cost saved by mksort is not significant. Only
> when saved cost became big enough, mksort can defeat qsort.
> 
> For other types without specialisation comparators, mksort can defeat
> qsort completely. It is the "real" performance of mksort.
> 

No opinion, but if this is the case, then maybe the best solution is to
only use mksort for types without specialized comparators.

> 
> Answers for some other questions you mentioned
> ----------------------------------------------
> 
> Q1: Why are almost all the cases that got better for a single-column sort?
> 
> A: mksort is enabled only for multi column sort. When there is only one
> column, qsort works. So we can simply ignore the cases.
> 
> Q2: Why did the perf become worse by just reversing the sort keys?
> 
> A: In the example we used, the sort keys are ordered from more duplicated
> to less. Please see the SQL:
> 
> update t1 set c2 = c1 % 100, c3 = c1 % 50, c4 = c1 % 10, c5 = c1 % 3;
> update t1 set c6 = 'aaaaaaaaaaaaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbbbbbbbbbbbb'
> || (c1 % 5)::text;
> 
> So c6 has most duplicated values, c5 has less, and so on. By the order
> "C6, c5, c4, ...", mksort can take effect on every sort key.
> 
> By the reverse order "c2, c3, c4...", mksort almost finished on first
> sort key (c2) because it has only 1% duplicated values, and fell back
> to qsort actually.
> 
> Based on the new code, I reran the example, and got about 141% improvement
> for order "c6, c5, c4...", and about -4% regression for order
> "c2, c3, c4...".
> 

OK

> Q3: Does mksort work effectively only for particular type, e.g. string?
> 
> A: No, the implementation of mksort does not distinguish data type for
> special handling. It just calls existing comparators which are also
> used by qsort. I used long prefix for string just to enlarge the time
> cost of comparing to amplify the result. The new report shows mksort
> can work effectively on non-string types and string without long prefix.
> 

Maybe I misunderstood, but I think it seems to help much less for types
with specialized comparators, so maybe I'd rephrase this if it only
works effectively for types without them.

> Q4: Was the algorithm good for a hardware at that time, but the changes
> (e.g. the growing important of on-CPU caches) made it less relevant?
> 
> A: As my understanding, the answer is no because the benefit of mksort
> is from saving cost for duplicated comparison, which is not related to
> hardware. I suppose the new report can prove it.
> 
> However, the hardware varying definitely affects the perf, especially
> considering that the perf different between mksort and qsort is not so
> big when mksort falls back to qsort. I am not able to test on a wide
> range of hardwares, so any finding is appreciated.
> 

OK. FWIW I think it's important to test with a range of data set sizes
exactly to evaluate these hardware-related effects (some of which are
related to size of various caches).

> 
> Potential improvement spaces
> ----------------------------
> 
> I tried some other optimizations but didn't add the code finally because
> the benefit is not very sure and/or the implementation is complex. Just
> raise them for more discussion if necessary:
> 
> 1. Use distinct stats info of table to enable mksort
> 
> It's kind of heuristics: in optimizer, check Form_pg_statistic->stadistinct
> of a table via pg_statistics. Enable mksort only when it is less than a
> threshold.
> 
> The hacked code works, which need to modify a couple of interfaces of
> optimizer. In addition, a complete solution should consider types and
> distinct values of all columns, which might be too complex, and the benefit
> seems not so big.
> 

I assume that's not in v5, or did I miss a part of the patch doing it?


I any case, I suspect relying on stadistinct is going to be unreliable.
It's known to be pretty likely to be off, and especially if this is
about multiple columns, which can be correlated in some way.

It would be much better if we could make this decision at runtime, based
on some cheap heuristics. Not sure if that's possible, though.

> 2. Cache of datum positions
> 
> e.g. for heap tuple, we need to extract datum position from SortTuple by
> extract_heaptuple_from_sorttuple() for comparing, which is executed
> for each datum. By comparison, qsort does it once for each tuple.
> Theoretically we can create a cache to remember the datum positions to
> avoid duplicated extracting.
> 
> The hacked code works, but the improvement seems limited. Not sure if more
> improvement space is available.
> 

No idea. But it seems more like an independent optimization than a fix
for the cases where mksort v5 regresses, right?

> 3. Template mechanism
> 
> Qsort uses kind of template mechanism by macro (see sort_template.h), which
> avoids cost of runtime type check. Theoretically template mechanism can be
> applied to mksort, but I am hesitating because it will impose more complexity
> and the code will become difficult to maintain.
> 

No idea, but same as above - I don't see how templating could address
the regressions.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: 回复: An implementation of multi-key sort

From

Tomas Vondra

Date:

08 July 2024, 15:25:43


On 7/7/24 08:32, Konstantin Knizhnik wrote:
> 
> On 04/07/2024 3:45 pm, Yao Wang wrote:
>> Generally, the benefit of mksort is mainly from duplicated values and
>> sort
>> keys: the more duplicated values and sort keys are, the bigger benefit it
>> gets.
> ...
>> 1. Use distinct stats info of table to enable mksort
>>
>> It's kind of heuristics: in optimizer, check
>> Form_pg_statistic->stadistinct
>> of a table via pg_statistics. Enable mksort only when it is less than a
>> threshold.
>>
>> The hacked code works, which need to modify a couple of interfaces of
>> optimizer. In addition, a complete solution should consider types and
>> distinct values of all columns, which might be too complex, and the
>> benefit
>> seems not so big.
> 
> 
> If mksort really provides advantage only when there are a lot of
> duplicates (for prefix keys?) and of small fraction of duplicates there
> is even some (small) regression
> then IMHO taking in account in planner information about estimated
> number of distinct values seems to be really important. What was a
> problem with accessing this statistics and why it requires modification
> of optimizer interfaces? There is `get_variable_numdistinct` function
> which is defined and used only in selfuncs.c
> 

Yeah, I've been wondering about that too. But I'm also a bit unsure if
using this known-unreliable statistics (especially with filters and
multiple columns) would actually fix the regressions.

> Information about values distribution seems to be quite useful for 
> choosing optimal sort algorithm. Not only for multi-key sort
> optimization. For example if we know min.max value of sort key and it is
> small, we can use O(N) algorithm for sorting. Also it can help to
> estimate when TOP-N search is preferable.
> 

This assumes the information is accurate / reliable, and I'm far from
sure about that.

> Right now Posgres creates special path for incremental sort. I am not
> sure if we also need to be separate path for mk-sort.
> But IMHO if we need to change some optimizer interfaces to be able to
> take in account statistic and choose preferred sort algorithm at
> planning time, then it should be done.
> If mksort can increase sort more than two times (for large number of
> duplicates), it will be nice to take it in account when choosing optimal
> plan.
> 

I did commit the incremental sort patch, and TBH I'm not convinced I'd
do that again. It's a great optimization when it works (and it seems to
work in plenty of cases), but we've also had a number of reports about
significant regressions, where the incremental sort costing is quite
off. Granted, it's often about cases where we already had issues and
incremental sort just "exacerbates" that (say, with LIMIT queries), but
that's kinda the point I'm trying to make - stats are inherently
incomplete / simplified, and some plans are more sensitive to that.

Which is why I'm wondering if we might do the decision based on some
information collected at runtime.

> Also in this case we do not need extra GUC for explicit enabling of
> mksort. There are too many parameters for optimizer and adding one more
> will make tuning more complex. So I prefer that decision is take buy
> optimizer itself based on the available information, especially if
> criteria seems to be obvious.
> 

The GUC is very useful for testing, so let's keep it for now.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: 回复: An implementation of multi-key sort

From

Tomas Vondra

Date:

09 July 2024, 11:34:26

BTW I forgot to report that I intended to test this on 32-bit ARM too,
because that sometimes triggers "funny" behavior, but the build fails
like this:

In file included from tuplesort.c:630:
mk_qsort_tuple.c: In function ‘mkqs_compare_datum_by_shortcut’:
mk_qsort_tuple.c:167:23: warning: implicit declaration of function
‘ApplySignedSortComparator’; did you mean ‘ApplyUnsignedSortComparator’?
[-Wimplicit-function-declaration]
  167 |                 ret = ApplySignedSortComparator(tuple1->datum1,
      |                       ^~~~~~~~~~~~~~~~~~~~~~~~~
      |                       ApplyUnsignedSortComparator
mk_qsort_tuple.c: In function ‘mkqs_compare_tuple’:
mk_qsort_tuple.c:376:23: warning: implicit declaration of function
‘qsort_tuple_signed_compare’; did you mean
‘qsort_tuple_unsigned_compare’? [-Wimplicit-function-declaration]
  376 |                 ret = qsort_tuple_signed_compare(a, b, state);
      |                       ^~~~~~~~~~~~~~~~~~~~~~~~~~
      |                       qsort_tuple_unsigned_compare
/usr/bin/ld: utils/sort/tuplesort.o: in function
`mkqs_compare_datum_by_shortcut':
/home/debian/postgres/src/backend/utils/sort/mk_qsort_tuple.c:167:
undefined reference to `ApplySignedSortComparator'
/usr/bin/ld:
/home/debian/postgres/src/backend/utils/sort/mk_qsort_tuple.c:167:
undefined reference to `ApplySignedSortComparator'
/usr/bin/ld:
/home/debian/postgres/src/backend/utils/sort/mk_qsort_tuple.c:167:
undefined reference to `ApplySignedSortComparator'
/usr/bin/ld: utils/sort/tuplesort.o: in function `mkqs_compare_tuple':
/home/debian/postgres/src/backend/utils/sort/mk_qsort_tuple.c:376:
undefined reference to `qsort_tuple_signed_compare'
/usr/bin/ld: utils/sort/tuplesort.o: in function
`mkqs_compare_datum_by_shortcut':
/home/debian/postgres/src/backend/utils/sort/mk_qsort_tuple.c:167:
undefined reference to `ApplySignedSortComparator'
collect2: error: ld returned 1 exit status
make[2]: *** [Makefile:67: postgres] Error 1
make[1]: *** [Makefile:42: all-backend-recurse] Error 2
make: *** [GNUmakefile:11: all-src-recurse] Error 2

I haven't investigated why it fails.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: 回复: An implementation of multi-key sort

From

Robert Haas

Date:

09 July 2024, 18:58:23

On Sun, Jul 7, 2024 at 2:32 AM Konstantin Knizhnik <knizhnik@garret.ru> wrote:
> If mksort really provides advantage only when there are a lot of
> duplicates (for prefix keys?) and of small fraction of duplicates there
> is even some (small) regression
> then IMHO taking in account in planner information about estimated
> number of distinct values seems to be really important.

I don't think we can rely on the planner's n_distinct estimates for
this at all. That information tends to be massively unreliable when we
have it at all. If we rely on it for good performance, it will be easy
to find cases where it's wrong and performance is bad.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: 回复: An implementation of multi-key sort

From

Yao Wang

Date:

26 July 2024, 11:18:35

Thanks for all of your comments.

Progress
--------

Because there are too many details for discussion, let me summarize to
two major issues:

1. Since it seems the perf is ideal for data types without specialized
comparator, can we improve the perf for data types with specialized
comparator to a satisfying level?

I refactored some code on mksort code path and eliminated kinda perf
bottlenecks. With latest code, most of results shows mksort got better
perf than qsort. For other cases, the regressions are usually less than
5% except seldom exceptions. Since most of the exceptions are transient
(happening occasionally in a series of "normal" results), I prefer they
are due to external interruption because the test machine is not
dedicated. Let's discuss on the latest code.

2. Should we update optimizer to take in account statistic
(pg_statistic->stadistinct) and choose mksort/qsort accordingly?

The trick here is not just about the reliability of stadistinct. The
sort perf is affected by a couple of conditions: sort key count, distinct
ratio, data type, and data layout. e.g. with int type, 5 sort keys, and
distinct ratio is about 0.05, we can see about 1% perf improvement for
"random" data set, about 7% for "correlated" data set, and almost the
same for "sequential" data set because of pre-ordered check. So it is
pretty difficult to create a formula to calculate the benefit.

Anyway, I tried to make an implementation by adding "mkqsApplicable"
determined by optimizer, so we can discuss based on code and perf result.
I also updated the test script to add an extra column "mk_enabled"
indicating whether mksort is enabled or not by optimizer.

According to the result, the choosing mechanism in optimizer
almost eliminated all regressions. Please note that even when mksort is
disabled (i.e. qsort was performed twice actually), there are still
"regressions" which are usually less than 5% and should be accounted
to kinda error range.

I am still hesitating about putting the code to final version because
there are a number of concerns:

a. for sort keys without a real table, stadistinct is unavailable.
b. stadistinct may not be accurate as mentioned.
c. the formula I used may not be able to cover all possible cases.

On the other hand, the worst result may be just 5% regression. So the
side effects may not be so serious?

I can refine the code (e.g. for now mkqsApplicable is valid only for
some particular code paths), but I prefer to do more after we have a
clear decision about whether the code in optimizer is needed.

Please let me know your comments, thanks.

Attachements
------------

- v6-Implement-multi-key-quick-sort.patch
(v6) The latest code without optimizer change

- v6-1-add-Sort-ndistInFirstRow.patch .
(v6.1) The latest code with optimizer change, can be applied to v6

- mksort-test-v2.sh
The script made by Tomas and modified by me to produce test result

- new_report.txt
Test result in a small data set (100,000 rows) based on v6

- new_report_opti.txt
Test result in a small data set (100,000 rows) based on v6.1

I tried to produce a "full" report with all data ranges, but it seems
kept working for more than 15 hours on my machine and was always
disturbed by other heavy loads. However, I did run tests on some
datasets with other sizes and got similar results.

Answers for other questions
---------------------------

1. Can we enable mksort for just particular data types?

As I mentioned, it is not easy to make the decision considering all the
factors impacting the result and all possible combinations. Code in
optimizer I showed may be a grip.

2. Does v5 include the code about "distinct stats info" and others?

No. As I mentioned, all code in "Potential improvement spaces" was not
included in v5 (or not implemented at all). v6.1 includes some code in
optimizer.

3. Should we remove the GUC enable_mk_sort?

I kept it at least for coding phase. And I prefer keeping it permanently
in case some scenarios we are not aware of.

4. Build failure on 32-bit ARM

It is a code fault by myself. ApplySignedSortComparator() is built only
when SIZEOF_DATUM >= 8. I was aware of that, but missed encapsulating
all relevant code in the condition. It is supposed to have been fixed on
v6, but I don't have a 32-bit ARM platform. @Tomas please take a try if
you still have interest, thanks.

5. How templating could address the regressions?

Please refer the implementation of qsort in sort_template.h, which adapted
kinda template mechanism by using macros since C language does not have
built-in template. .e.g. for comparator, it uses a macro ST_COMPARE which
is specialized for different functions (such as
qsort_tuple_unsigned_compare()) for different data types. As a contrast,
mksort needs to determine the comparator on runtime for each comparison
(see mkqs_compare_tuple()), which needs more costs. Although the cost is
not much, comparison is very performance sensitive. (About 1~2% regression
if my memory is correct)

Thanks,

Yao Wang

On Wed, Jul 10, 2024 at 2:58 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Sun, Jul 7, 2024 at 2:32 AM Konstantin Knizhnik <knizhnik@garret.ru> wrote:
> > If mksort really provides advantage only when there are a lot of
> > duplicates (for prefix keys?) and of small fraction of duplicates there
> > is even some (small) regression
> > then IMHO taking in account in planner information about estimated
> > number of distinct values seems to be really important.
>
> I don't think we can rely on the planner's n_distinct estimates for
> this at all. That information tends to be massively unreliable when we
> have it at all. If we rely on it for good performance, it will be easy
> to find cases where it's wrong and performance is bad.
>
> --
> Robert Haas
> EDB: http://www.enterprisedb.com

--
This electronic communication and the information and any files transmitted
with it, or attached to it, are confidential and are intended solely for
the use of the individual or entity to whom it is addressed and may contain
information that is confidential, legally privileged, protected by privacy
laws, or otherwise restricted from disclosure to anyone else. If you are
not the intended recipient or the person responsible for delivering the
e-mail to the intended recipient, you are hereby notified that any use,
copying, distributing, dissemination, forwarding, printing, or copying of
this e-mail is strictly prohibited. If you received this e-mail in error,
please return the e-mail to the sender, delete it from your computer, and
destroy any printed copy of it.

Attachment

Re: 回复: An implementation of multi-key sort

From

John Naylor

Date:

11 August 2024, 02:05:41

On Fri, Jul 26, 2024 at 6:18 PM Yao Wang <yao-yw.wang@broadcom.com> wrote:

> 2. Should we update optimizer to take in account statistic
> (pg_statistic->stadistinct) and choose mksort/qsort accordingly?

> According to the result, the choosing mechanism in optimizer
> almost eliminated all regressions. Please note that even when mksort is
> disabled (i.e. qsort was performed twice actually), there are still
> "regressions" which are usually less than 5% and should be accounted
> to kinda error range.

This is kind of an understatement. To be clear, I see mostly ~1%,
which I take to be in the noise level. If it were commonly 5%
regression, that'd be cause for concern. If actual noise were 5%, the
testing methodology is not strict enough.

> I am still hesitating about putting the code to final version because
> there are a number of concerns:
>
> a. for sort keys without a real table, stadistinct is unavailable.
> b. stadistinct may not be accurate as mentioned.
> c. the formula I used may not be able to cover all possible cases.
>
> On the other hand, the worst result may be just 5% regression. So the
> side effects may not be so serious?

FWIW, I share these concerns as well. I don't believe this proves a
bound on the worst result, since the test is using ideal well-behaved
data with no filter. The worst result is when the estimates are wrong,
so real world use could easily be back to 10-20% regression, which is
not acceptable. I believe this is what Robert and Tomas were warning
about.

It'd be good to understand what causes the differences, whether better
or worse. Some initial thoughts:

- If the first key is unique, then I would hope multikey would be no
different then a standard sort.

- If the first key commonly ties, and other following keys tie also,
those later comparisons are a waste. In that case, it's not hard to
imagine that partitioning on only one key at a time might be fast.

- If the first key commonly ties, but the second key is closer to
unique, I'm not sure which way is better. Have we tested this case?

- If we actually only have one sort key, a multi-key sort with a
single depth should ideally have no significant performance difference
than standard sort. That seems like a good sanity check. Has this been
tried?

- For the biggest benefit/regression cases, it'd be good to know what
changed at the hardware level. # comparsisons? # swaps? # cache
misses? # branch mispredicts?

Looking at the code a bit, I have some questions and see some
architectural issues. My thoughts below have a bunch of brainstorms so
should be taken with a large grain of salt:

1. The new single-use abstraction for the btree tid tiebreak seems
awkward. In standard sort, all that knowledge was confined to btree's
full comparetup function. Now it's spread out, and general code has to
worry about "duplicated" tuples. The passed-down isNull seems only
needed for this? (Quick idea: It seems we could pass a start-depth and
max-depth to some "comparetup_mk", which would be very similar to
current "comparetup + comparetup_tiebreak". The btree function would
have to know that a depth greater than the last sortkey is a signal to
do the tid comparison. And if start-depth and max-depth are the same,
that means comparing at a single depth. That might simplify the code
elsewhere because there is no need for a separate getDatum function.)

2. I don't understand why the pre-ordered check sometimes tolerates
duplicates and sometimes doesn't.

3. "tiebreak" paths and terminology is already somewhat awkward for
standard sort (my fault), but seems really out of place in multikey
sort. It already has a general concept of "depth", so that should be
used in fullest generality.

3A. Random thought: I wonder if the shortcut (and abbreviated?)
comparisons could be thought of as having their own depth < 0. If it's
worth it to postpone later keys, maybe it's worth it to postpone the
full comparison for the first key as well? I could be wrong, though.

3B. Side note: I've long wanted to try separating all NULL first keys
to a separate array, so we can remove all those branches for NULL
ordering and reduce SortTuple to 16 bytes. That might be easier to
code if we could simply specify "start_depth = 1" at the top level for
that.

4. Trying to stuff all our optimized comparators in the same path was
a heroic effort, but it's quite messy and seems pretty bad for the
instruction cache and branch predictor. I don't think we need yet
another template, but some of these branches should be taken as we
recurse into a partition to keep them out of the hot path.

5.
+ /*
+ * When the count < 16 and no need to handle duplicated tuples, use
+ * bubble sort.
+ *
+ * Use 16 instead of 7 which is used in standard qsort, because mk qsort
+ * need more cost to maintain more complex state.

Note: 7 isn't ideal for standard sort either, and should probably be
at least 10 (at least for single-key sorts). If one implementation's
parameter is more ideal than the other, it obscures what the true
trade-offs are. 16 happens to be a power of two -- how many different
values did you test? (And isn't this the same as our insertion sort,
not bubble sort?).