Thread: Considering additional sort specialisation functions for PG16

Considering additional sort specialisation functions for PG16

From
David Rowley
Date:
Hi,

(Background: 697492434 added 3 new sort functions to remove the
indirect function calls for the comparator function.  This sped up
sorting for various of our built-in data types.)

There was a bit of unfinished discussion around exactly how far to
take these specialisations for PG15.  We could certainly add more.

There are various other things we could do to further speed up sorting
for these datatypes.  One example is, we could add 3 more variations
of these functions that can be called when there are no NULL datums to
sort.  That effectively multiplies the number of specialisations by 2,
or adds another dimension.

I have the following dimensions in mind for consideration:

1. Specialisations to handle sorting of non-null datums (eliminates
checking for nulls in the comparison function)
2. Specialisations to handle single column sorts (eliminates
tiebreaker function call or any checks for existence of tiebreaker)
3. ASC sort (No need for if (ssup->ssup_reverse) INVERT_COMPARE_RESULT(compare))

If we did all of the above then we'd end up with 3 * 2 * 2 * 2 = 24
specialization functions.  That seems a bit excessive. So here I'd
like to discuss which ones we should add, if any.

I've attached a very basic implementation of #1 which adds 3 new
functions for sorting non-null datums.  This could be made a bit more
advanced.  For now, I just added a bool flag to track if we have any
NULL datum1s in memtuples[].  For bounded sorts, we may remove NULLs
from that array, and may end up with no nulls after having seen null.
So maybe a count would be better than a flag.

A quick performance test with 1 million random INTs shows ~6%
performance improvement when there are no nulls.

Master
$ pgbench -n -f bench.sql -T 60 postgres
latency average = 159.837 ms
latency average = 161.193 ms
latency average = 159.512 ms

master + not_null_sort_specializations.patch
$ pgbench -n -f bench.sql -T 60 postgres
latency average = 150.791 ms
latency average = 149.843 ms
latency average = 150.319 ms

I didn't test for any regression when there are NULLs and we're unable
to use the new specializations. I'm hoping the null tracking will be
almost free, but I will need to check.

It's all quite subjective to know which specializations should be
added. I think #1 is likely to have the biggest wins when it can be
used as it removes the most branching in the comparator function,
however the biggest gains are not the only thing to consider. We also
need to consider how commonly these functions will be used. I don't
have any information about that.

David

Attachment

Re: Considering additional sort specialisation functions for PG16

From
John Naylor
Date:
On Tue, Aug 23, 2022 at 9:18 AM David Rowley <dgrowleyml@gmail.com> wrote:
>
> I have the following dimensions in mind for consideration:
>
> 1. Specialisations to handle sorting of non-null datums (eliminates
> checking for nulls in the comparison function)
> 2. Specialisations to handle single column sorts (eliminates
> tiebreaker function call or any checks for existence of tiebreaker)
> 3. ASC sort (No need for if (ssup->ssup_reverse) INVERT_COMPARE_RESULT(compare))
>
> If we did all of the above then we'd end up with 3 * 2 * 2 * 2 = 24
> specialization functions.  That seems a bit excessive. So here I'd
> like to discuss which ones we should add, if any.
>
> I've attached a very basic implementation of #1 which adds 3 new
> functions for sorting non-null datums.

Did you happen to see

https://www.postgresql.org/message-id/CAFBsxsFhq8VUSkUL5YO17cFXbCPwtbbxBu%2Bd9MFrrsssfDXm3Q%40mail.gmail.com

where I experimented with removing all null handling? What I had in
mind was pre-partitioning nulls and non-nulls when populating the
SortTuple array, then calling qsort twice, once with the non-null
partition with comparators that assume non-null, and (only if there
are additional sort keys) once on the null partition. And the
pre-partitioning would take care of nulls first/last upfront. I
haven't looked into the feasibility of this yet, but the good thing
about the concept is that it removes null handling in the comparators
without additional sort specializations.

-- 
John Naylor
EDB: http://www.enterprisedb.com



Re: Considering additional sort specialisation functions for PG16

From
David Rowley
Date:
On Tue, 23 Aug 2022 at 15:22, John Naylor <john.naylor@enterprisedb.com> wrote:
> Did you happen to see
>
> https://www.postgresql.org/message-id/CAFBsxsFhq8VUSkUL5YO17cFXbCPwtbbxBu%2Bd9MFrrsssfDXm3Q%40mail.gmail.com

I missed that.  It looks like a much more promising idea than what I
came up with. I've not looked at your code yet, but I'm interested and
will aim to look soon.

David



Re: Considering additional sort specialisation functions for PG16

From
John Naylor
Date:
On Tue, Aug 23, 2022 at 11:24 AM David Rowley <dgrowleyml@gmail.com> wrote:
>
> On Tue, 23 Aug 2022 at 15:22, John Naylor <john.naylor@enterprisedb.com> wrote:
> > Did you happen to see
> >
> > https://www.postgresql.org/message-id/CAFBsxsFhq8VUSkUL5YO17cFXbCPwtbbxBu%2Bd9MFrrsssfDXm3Q%40mail.gmail.com
>
> I missed that.  It looks like a much more promising idea than what I
> came up with. I've not looked at your code yet, but I'm interested and
> will aim to look soon.

Note that I haven't actually implemented this idea yet, just tried to
model the effects by lobotomizing the current comparators. I think
it's worth pursuing and will try to come back to it this cycle, but if
you or anyone else wants to try, that's fine of course.

-- 
John Naylor
EDB: http://www.enterprisedb.com



Re: Considering additional sort specialisation functions for PG16

From
John Naylor
Date:

On Tue, Aug 23, 2022 at 1:13 PM John Naylor <john.naylor@enterprisedb.com> wrote:
>
> On Tue, Aug 23, 2022 at 11:24 AM David Rowley <dgrowleyml@gmail.com> wrote:
> >
> > On Tue, 23 Aug 2022 at 15:22, John Naylor <john.naylor@enterprisedb.com> wrote:
> > > Did you happen to see
> > >
> > > https://www.postgresql.org/message-id/CAFBsxsFhq8VUSkUL5YO17cFXbCPwtbbxBu%2Bd9MFrrsssfDXm3Q%40mail.gmail.com
> >
> > I missed that.  It looks like a much more promising idea than what I
> > came up with. I've not looked at your code yet, but I'm interested and
> > will aim to look soon.
>
> Note that I haven't actually implemented this idea yet, just tried to
> model the effects by lobotomizing the current comparators. I think
> it's worth pursuing and will try to come back to it this cycle, but if
> you or anyone else wants to try, that's fine of course.

Coming back to this, I wanted to sketch out this idea in a bit more detail.

Have two memtuple arrays, one for first sortkey null and one for first sortkey non-null:
- Qsort the non-null array, including whatever specialization is available. Existing specialized comparators could ignore nulls (and their ordering) taking less space in the binary.
- Only if there is more than one sort key, qsort the null array. Ideally at some point we would have a method of ignoring the first sortkey (this is an existing opportunity that applies elsewhere as well).
- To handle two arrays, grow_memtuples() would need some adjustment, as would any callers that read the final result of an in-memory sort -- they would need to retrieve the tuples starting with the appropriate array depending on NULLS FIRST/LAST behavior.

I believe external merges wouldn't have to do anything different, since when writing out the tapes, we read from the arrays in the right order.

(One could extend this idea further and have two pools of tapes for null and non-null first sortkey, that are merged separately, in the right order. That sounds like quite a bit more complexity than is worth, however.)

--
John Naylor
EDB: http://www.enterprisedb.com

Re: Considering additional sort specialisation functions for PG16

From
Pavel Borisov
Date:
Hi, John!
Generally, I like the separation of non-null values before sorting and
would like to join as a reviewer when we come to patch. I have only a
small question:

> - Only if there is more than one sort key, qsort the null array. Ideally at some point we would have a method of
ignoringthe first sortkey (this is an existing opportunity that applies elsewhere as well).
 
Should we need to sort by the second sort key provided the first one
in NULL by standard or by some part of the code relying on this? I
suppose NULL values in the first sort key mean attribute values are
undefined and there is no preferred order between these tuples, even
if their second sort keys are different.

And maybe (unlikely IMO) we need some analog of NULLS DISCTICNT/NOT
DISTINCT in this scope?

Kind regards,
Pavel Borisov,
Supabase.



Re: Considering additional sort specialisation functions for PG16

From
John Naylor
Date:

On Thu, Jan 26, 2023 at 6:14 PM Pavel Borisov <pashkin.elfe@gmail.com> wrote:

> > - Only if there is more than one sort key, qsort the null array. Ideally at some point we would have a method of ignoring the first sortkey (this is an existing opportunity that applies elsewhere as well).

> Should we need to sort by the second sort key provided the first one
> in NULL by standard or by some part of the code relying on this? I

I'm not sure I quite understand the question.

If there is more than one sort key, and the specialized comparison on the first key gives a definitive zero result, it falls back to comparing all keys from the full tuple. (The sorttuple struct only contains the first sortkey, which might actually be an abbreviated key.) A possible optimization, relevant here and also elsewhere, is to compare only using keys starting from key2. But note: if the first key is abbreviated, a zero result is not definitive, and we must check the first key's full value from the tuple.

> suppose NULL values in the first sort key mean attribute values are
> undefined and there is no preferred order between these tuples, even
> if their second sort keys are different.

There is in fact a preferred order between these tuples -- the second key is the tie breaker in this case.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: Considering additional sort specialisation functions for PG16

From
David Rowley
Date:
On Thu, 26 Jan 2023 at 23:29, John Naylor <john.naylor@enterprisedb.com> wrote:
> Coming back to this, I wanted to sketch out this idea in a bit more detail.
>
> Have two memtuple arrays, one for first sortkey null and one for first sortkey non-null:
> - Qsort the non-null array, including whatever specialization is available. Existing specialized comparators could
ignorenulls (and their ordering) taking less space in the binary. 
> - Only if there is more than one sort key, qsort the null array. Ideally at some point we would have a method of
ignoringthe first sortkey (this is an existing opportunity that applies elsewhere as well). 
> - To handle two arrays, grow_memtuples() would need some adjustment, as would any callers that read the final result
ofan in-memory sort -- they would need to retrieve the tuples starting with the appropriate array depending on NULLS
FIRST/LASTbehavior. 

Thanks for coming back to this. I've been thinking about this again
recently due to what was discovered in [1].  Basically, the patch on
that thread is trying to eliminate the need for the ORDER BY sort in a
query such as: SELECT a,b,row_number() over (order by a) from ab order
by a,b;  the idea is to just perform the full sort on a,b for the
WindowClause so save from having to do an Incremental Sort on b for
the ORDER BY after evaluating the window funcs.  Surprisingly (for
me), we found a bunch of cases where the performance is better to do a
sort on some of the keys, then do an Incremental sort on the remainder
rather than just doing a single sort on everything.

I don't really understand why this is fully yet, but one theory I have
is that it might be down to work_mem being larger than the CPU's L3
cache and causing more cache line misses.  With the more simple sort,
there's less swapping of items in the array because the comparison
function sees tuples as equal more often.  I did find that the gap
between the two is not as large with fewer tuples to sort.

I think the slower sorts I found in [2] could also be partially caused
by the current sort specialisation comparators re-comparing the
leading column during a tie-break. I've not gotten around to disabling
the sort specialisations to see if and how much this is a factor for
that test.

Why I'm bringing this up here is that I wondered, in addition to what
you're mentioning above, if we're making some changes to allow
multiple in-memory arrays, would it be worth going a little further
and allowing it so we could have N arrays which we'd try to keep under
L3 cache size. Maybe some new GUC could be used to know what a good
value is.  With fast enough disks, it's often faster to use smaller
values of work_mem which don't exceed L3 to keep these batches
smaller.  Keeping them in memory would remove the need to write out
tapes.

I don't really know exactly if such a feature could easily be tagged
on to what you propose above, but I feel like what you're talking
about is part of the way towards that at least, maybe at the least,
the additional work could be kept in mind when this is written so that
it's easier to extend in the future.

David

[1] https://postgr.es/m/CAApHDvpAO5H_L84kn9gCJ_hihOavtmDjimKYyftjWtF69BJ=8Q@mail.gmail.com
[2] https://postgr.es/m/CAApHDvqh%2BqOHk4sbvvy%3DQr2NjPqAAVYf82oXY0g%3DZ2hRpC2Vmg%40mail.gmail.com



Re: Considering additional sort specialisation functions for PG16

From
John Naylor
Date:

On Thu, Jan 26, 2023 at 7:15 PM David Rowley <dgrowleyml@gmail.com> wrote:
>
> On Thu, 26 Jan 2023 at 23:29, John Naylor <john.naylor@enterprisedb.com> wrote:
> > Coming back to this, I wanted to sketch out this idea in a bit more detail.
> >
> > Have two memtuple arrays, one for first sortkey null and one for first sortkey non-null:
> > - Qsort the non-null array, including whatever specialization is available. Existing specialized comparators could ignore nulls (and their ordering) taking less space in the binary.
> > - Only if there is more than one sort key, qsort the null array. Ideally at some point we would have a method of ignoring the first sortkey (this is an existing opportunity that applies elsewhere as well).
> > - To handle two arrays, grow_memtuples() would need some adjustment, as would any callers that read the final result of an in-memory sort -- they would need to retrieve the tuples starting with the appropriate array depending on NULLS FIRST/LAST behavior.
>
> Thanks for coming back to this. I've been thinking about this again
> recently due to what was discovered in [1].

That was indeed part of the motivation for bringing this up.

> Why I'm bringing this up here is that I wondered, in addition to what
> you're mentioning above, if we're making some changes to allow
> multiple in-memory arrays, would it be worth going a little further
> and allowing it so we could have N arrays which we'd try to keep under
> L3 cache size. Maybe some new GUC could be used to know what a good
> value is.  With fast enough disks, it's often faster to use smaller
> values of work_mem which don't exceed L3 to keep these batches
> smaller.  Keeping them in memory would remove the need to write out
> tapes.

That's interesting. I don't know enough to guess how complex it would be to make "external" merges agnostic about whether the tapes are on disk or in memory.

If in-memory sorts were designed analogously to external ones, grow_memtuples would never have to repalloc, it could just allocate a new tape, which could further shave some cycles.

My hunch in my last email was that having separate groups of tapes for each null/non-null first sortkey would be complex, because it increases the number of places that have to know about nulls and their ordering. If we wanted to go to N arrays instead of 2, and additionally wanted separate null/non-null treatment, it seems we would still need 2 sets of arrays, one with N non-null-first-sortkey tapes and one with M null-first-sortkey tapes.

So using 2 arrays seems like the logical first step. I'd be curious to hear other possible development paths.

> I think the slower sorts I found in [2] could also be partially caused
> by the current sort specialisation comparators re-comparing the
> leading column during a tie-break. I've not gotten around to disabling
> the sort specialisations to see if and how much this is a factor for
> that test.

Right, that's worth addressing independently of the window function consideration. I'm still swapping this area back in my head, but I believe one issue is that state->base.onlyKey signals two things: "one sortkey, not abbreviated". We could add a separate branch for "first key unabbreviated, nkeys>1" -- I don't think we'd need to specialize, just branch -- and instead of state->base.comparetup, call a set of analogous functions that only handle keys 2 and above (comparetup_tail_* ? or possibly just add a boolean parameter compare_first). That would not pose a huge challenge, I think, since they're already written like this:

/* Compare the leading sort key */
compare = ApplySortComparator(...);
if (compare != 0)
    return compare;

/* Compare additional sort keys */
...

The null/non-null separation would eliminate a bunch of branches in inlined comparators, so we could afford to add another branch for number of keys.

I haven't thought through either of these ideas in the gory detail, but I don't yet see any big obstacles.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: Considering additional sort specialisation functions for PG16

From
John Naylor
Date:

I wrote:

> On Thu, Jan 26, 2023 at 7:15 PM David Rowley <dgrowleyml@gmail.com> wrote:
> > I think the slower sorts I found in [2] could also be partially caused
> > by the current sort specialisation comparators re-comparing the
> > leading column during a tie-break. I've not gotten around to disabling
> > the sort specialisations to see if and how much this is a factor for
> > that test.
>
> Right, that's worth addressing independently of the window function consideration. I'm still swapping this area back in my head, but I believe one issue is that state->base.onlyKey signals two things: "one sortkey, not abbreviated". We could add a separate branch for "first key unabbreviated, nkeys>1" -- I don't think we'd need to specialize, just branch -- and instead of state->base.comparetup, call a set of analogous functions that only handle keys 2 and above (comparetup_tail_* ? or possibly just add a boolean parameter compare_first). That would not pose a huge challenge, I think, since they're already written like this:
>
> /* Compare the leading sort key */
> compare = ApplySortComparator(...);
> if (compare != 0)
>     return compare;
>
> /* Compare additional sort keys */
> ...
>
> The null/non-null separation would eliminate a bunch of branches in inlined comparators, so we could afford to add another branch for number of keys.

I gave this a go, and it turns out we don't need any extra branches in the inlined comparators -- the new fallbacks are naturally written to account for the "!onlyKey" case. If the first sortkey was abbreviated, call its full comparator, otherwise skip to the next sortkey (if there isn't one, we shouldn't have gotten here). The existing comparetup functions try the simple case and then call the fallback (which could be inlined for them but I haven't looked).

Tests pass, but I'm not sure yet if we need more tests. I don't have a purpose-built benchmark at the moment, but I'll see if any of my existing tests exercise this code path. I can also try the window function case unless someone beats me to it.

--
John Naylor
EDB: http://www.enterprisedb.com
Attachment

Re: Considering additional sort specialisation functions for PG16

From
John Naylor
Date:

I wrote:
> Have two memtuple arrays, one for first sortkey null and one for first sortkey non-null:

Hacking on this has gotten only as far as the "compiles but segfaults" stage, but I wanted to note an idea that occurred to me:

Internal qsort doesn't need the srctape member, and removing both that and isnull1 would allow 16-byte "init-tuples" for qsort, which would save a bit of work_mem space, binary space for qsort specializations, and work done during swaps.

During heap sort, we already copy one entry into a stack variable to keep from clobbering it, so it's not a big step to read a member from the init array and form a regular sorttuple from it.

--
John Naylor
EDB: http://www.enterprisedb.com