tuple radix sort - Mailing list pgsql-hackers
| From | John Naylor |
|---|---|
| Subject | tuple radix sort |
| Date | |
| Msg-id | CANWCAZYzx7a7E9AY16Jt_U3+GVKDADfgApZ-42SYNiig8dTnFA@mail.gmail.com Whole thread Raw |
| Responses |
Re: tuple radix sort
|
| List | pgsql-hackers |
First, a quick demonstration of what this PoC can do on 1 million
random not-NULL bigints:
set wip_radix_sort = 'off'; select * from test order by a offset 1_000_000_000;
240ms
set wip_radix_sort = 'on'; select * from test order by a offset 1_000_000_000;
140ms
Background: Peter Geoghegan recently mentioned to me off-list an
interesting set of techniques for sorting in the context of databases.
I'm not yet sure how to approach certain aspects of that architecture,
so I won't go into the full picture at this point. However, there is
one piece that already fits well within our existing architecture, and
that is using radix sort on datum1. The basic sequence is:
1. Partition tuples on first key NULL and not-NULL, according to NULLS
FIRST or NULLS LAST.
2. Do normal qsort on the NULL partition using the tiebreak comparator.
3. Create a "conditioned" or "normalized" datum that encodes datum1
such that unsigned comparison is order-preserving, accounting for ASC
/ DESC as well. I've reused space now unused during in-memory not-NULL
sorts:
typedef struct
{
void *tuple; /* the tuple itself */
Datum datum1; /* value of first key column */
union
{
struct
{
bool isnull1; /* is first key column NULL? */
int srctape; /* source tape number */
};
Datum cond_datum1; /* sort key for radix sort */
};
} SortTuple;
4. Radix sort on cond_datum1. For the PoC I've based it on the
implementation in "ska sort" [1] (C++, Boost license). For
medium-sized sorts it uses "American flag sort" (there is a paper [3]
co-authored by M. D. McIlroy, same as in the paper we reference for
quicksort). For larger sorts it's similar, but performs multiple
passes, which takes better advantage of modern CPUs. Upon recursion,
sorts on small partitions divert to quicksort. Any necessary tiebreaks
are handled by quicksort, either after the end of radix sort, or when
diverting to small quicksort.
5. Reset isnull1 to "false" before returning to the caller. This also
must be done when diverting to quicksort.
Next steps: Try to find regressions (help welcome here). The v1 patch
has some optimizations, but in other ways things are simple and/or
wasteful. Exactly how things fit together will be informed by what, if
anything, has to be done to avoid regressions. I suspect the challenge
will be multikey sorts when the first key has low cardinality. This is
because tiebreaks are necessarily postponed rather than taken care of
up front. I'm optimistic, since low cardinality cases can be even
faster than our B&M qsort, so we have some headroom:
drop table if exists test;
create unlogged table test (a bigint);
insert into test select
(1_000_000_000 * random())::bigint % 8 -- mod
-- (1_000_000_000 * random())::bigint -- random, for the case at the top
from generate_series(1,1_000_000,1) i;
vacuum freeze test;
select pg_prewarm('test');
set work_mem = '64MB';
set wip_radix_sort = 'off'; select * from test order by a offset 1_000_000_000;
95ms
set wip_radix_sort = 'on'; select * from test order by a offset 1_000_000_000;
84ms
[1] https://github.com/skarupke/ska_sort/tree/master
[2] https://probablydance.com/2016/12/27/i-wrote-a-faster-sorting-algorithm/
[3] http://static.usenix.org/publications/compsystems/1993/win_mcilroy.pdf
--
John Naylor
Amazon Web Services
Attachment
pgsql-hackers by date: