tuple radix sort - Mailing list pgsql-hackers

From John Naylor
Subject tuple radix sort
Date
Msg-id CANWCAZYzx7a7E9AY16Jt_U3+GVKDADfgApZ-42SYNiig8dTnFA@mail.gmail.com
Whole thread Raw
Responses Re: tuple radix sort
List pgsql-hackers
First, a quick demonstration of what this PoC can do on 1 million
random not-NULL bigints:

set wip_radix_sort = 'off'; select * from test order by a offset 1_000_000_000;
240ms

set wip_radix_sort = 'on'; select * from test order by a offset 1_000_000_000;
140ms

Background: Peter Geoghegan recently mentioned to me off-list an
interesting set of techniques for sorting in the context of databases.
I'm not yet sure how to approach certain aspects of that architecture,
so I won't go into the full picture at this point. However, there is
one piece that already fits well within our existing architecture, and
that is using radix sort on datum1. The basic sequence is:

1. Partition tuples on first key NULL and not-NULL, according to NULLS
FIRST or NULLS LAST.
2. Do normal qsort on the NULL partition using the tiebreak comparator.
3. Create a "conditioned" or "normalized" datum that encodes datum1
such that unsigned comparison is order-preserving, accounting for ASC
/ DESC as well. I've reused space now unused during in-memory not-NULL
sorts:

typedef struct
{
  void     *tuple;      /* the tuple itself */
  Datum    datum1;      /* value of first key column */

  union
  {
    struct
    {
      bool    isnull1;    /* is first key column NULL? */
      int     srctape;    /* source tape number */
    };
    Datum    cond_datum1; /* sort key for radix sort */
  };
} SortTuple;


4. Radix sort on cond_datum1. For the PoC I've based it on the
implementation in "ska sort" [1] (C++, Boost license). For
medium-sized sorts it uses "American flag sort" (there is a paper [3]
co-authored by M. D. McIlroy, same as in the paper we reference for
quicksort). For larger sorts it's similar, but performs multiple
passes, which takes better advantage of modern CPUs. Upon recursion,
sorts on small partitions divert to quicksort. Any necessary tiebreaks
are handled by quicksort, either after the end of radix sort, or when
diverting to small quicksort.
5. Reset isnull1 to "false" before returning to the caller. This also
must be done when diverting to quicksort.

Next steps: Try to find regressions (help welcome here). The v1 patch
has some optimizations, but in other ways things are simple and/or
wasteful. Exactly how things fit together will be informed by what, if
anything, has to be done to avoid regressions. I suspect the challenge
will be multikey sorts when the first key has low cardinality. This is
because tiebreaks are necessarily postponed rather than taken care of
up front. I'm optimistic, since low cardinality cases can be even
faster than our B&M qsort, so we have some headroom:

drop table if exists test;
create unlogged table test (a bigint);
insert into test select
(1_000_000_000 * random())::bigint % 8 -- mod
-- (1_000_000_000 * random())::bigint  -- random, for the case at the top
from generate_series(1,1_000_000,1) i;
vacuum freeze test;

select pg_prewarm('test');
set work_mem = '64MB';

set wip_radix_sort = 'off'; select * from test order by a offset 1_000_000_000;
95ms

set wip_radix_sort = 'on'; select * from test order by a offset 1_000_000_000;
84ms


[1] https://github.com/skarupke/ska_sort/tree/master
[2] https://probablydance.com/2016/12/27/i-wrote-a-faster-sorting-algorithm/
[3] http://static.usenix.org/publications/compsystems/1993/win_mcilroy.pdf

--
John Naylor
Amazon Web Services

Attachment

pgsql-hackers by date:

Previous
From: Nico Williams
Date:
Subject: Re: Channel binding for post-quantum cryptography
Next
From: Amit Langote
Date:
Subject: Re: Batching in executor