Thread: Considering additional sort specialisation functions for PG16
Hi, (Background: 697492434 added 3 new sort functions to remove the indirect function calls for the comparator function. This sped up sorting for various of our built-in data types.) There was a bit of unfinished discussion around exactly how far to take these specialisations for PG15. We could certainly add more. There are various other things we could do to further speed up sorting for these datatypes. One example is, we could add 3 more variations of these functions that can be called when there are no NULL datums to sort. That effectively multiplies the number of specialisations by 2, or adds another dimension. I have the following dimensions in mind for consideration: 1. Specialisations to handle sorting of non-null datums (eliminates checking for nulls in the comparison function) 2. Specialisations to handle single column sorts (eliminates tiebreaker function call or any checks for existence of tiebreaker) 3. ASC sort (No need for if (ssup->ssup_reverse) INVERT_COMPARE_RESULT(compare)) If we did all of the above then we'd end up with 3 * 2 * 2 * 2 = 24 specialization functions. That seems a bit excessive. So here I'd like to discuss which ones we should add, if any. I've attached a very basic implementation of #1 which adds 3 new functions for sorting non-null datums. This could be made a bit more advanced. For now, I just added a bool flag to track if we have any NULL datum1s in memtuples[]. For bounded sorts, we may remove NULLs from that array, and may end up with no nulls after having seen null. So maybe a count would be better than a flag. A quick performance test with 1 million random INTs shows ~6% performance improvement when there are no nulls. Master $ pgbench -n -f bench.sql -T 60 postgres latency average = 159.837 ms latency average = 161.193 ms latency average = 159.512 ms master + not_null_sort_specializations.patch $ pgbench -n -f bench.sql -T 60 postgres latency average = 150.791 ms latency average = 149.843 ms latency average = 150.319 ms I didn't test for any regression when there are NULLs and we're unable to use the new specializations. I'm hoping the null tracking will be almost free, but I will need to check. It's all quite subjective to know which specializations should be added. I think #1 is likely to have the biggest wins when it can be used as it removes the most branching in the comparator function, however the biggest gains are not the only thing to consider. We also need to consider how commonly these functions will be used. I don't have any information about that. David
Attachment
On Tue, Aug 23, 2022 at 9:18 AM David Rowley <dgrowleyml@gmail.com> wrote: > > I have the following dimensions in mind for consideration: > > 1. Specialisations to handle sorting of non-null datums (eliminates > checking for nulls in the comparison function) > 2. Specialisations to handle single column sorts (eliminates > tiebreaker function call or any checks for existence of tiebreaker) > 3. ASC sort (No need for if (ssup->ssup_reverse) INVERT_COMPARE_RESULT(compare)) > > If we did all of the above then we'd end up with 3 * 2 * 2 * 2 = 24 > specialization functions. That seems a bit excessive. So here I'd > like to discuss which ones we should add, if any. > > I've attached a very basic implementation of #1 which adds 3 new > functions for sorting non-null datums. Did you happen to see https://www.postgresql.org/message-id/CAFBsxsFhq8VUSkUL5YO17cFXbCPwtbbxBu%2Bd9MFrrsssfDXm3Q%40mail.gmail.com where I experimented with removing all null handling? What I had in mind was pre-partitioning nulls and non-nulls when populating the SortTuple array, then calling qsort twice, once with the non-null partition with comparators that assume non-null, and (only if there are additional sort keys) once on the null partition. And the pre-partitioning would take care of nulls first/last upfront. I haven't looked into the feasibility of this yet, but the good thing about the concept is that it removes null handling in the comparators without additional sort specializations. -- John Naylor EDB: http://www.enterprisedb.com
On Tue, 23 Aug 2022 at 15:22, John Naylor <john.naylor@enterprisedb.com> wrote: > Did you happen to see > > https://www.postgresql.org/message-id/CAFBsxsFhq8VUSkUL5YO17cFXbCPwtbbxBu%2Bd9MFrrsssfDXm3Q%40mail.gmail.com I missed that. It looks like a much more promising idea than what I came up with. I've not looked at your code yet, but I'm interested and will aim to look soon. David
On Tue, Aug 23, 2022 at 11:24 AM David Rowley <dgrowleyml@gmail.com> wrote: > > On Tue, 23 Aug 2022 at 15:22, John Naylor <john.naylor@enterprisedb.com> wrote: > > Did you happen to see > > > > https://www.postgresql.org/message-id/CAFBsxsFhq8VUSkUL5YO17cFXbCPwtbbxBu%2Bd9MFrrsssfDXm3Q%40mail.gmail.com > > I missed that. It looks like a much more promising idea than what I > came up with. I've not looked at your code yet, but I'm interested and > will aim to look soon. Note that I haven't actually implemented this idea yet, just tried to model the effects by lobotomizing the current comparators. I think it's worth pursuing and will try to come back to it this cycle, but if you or anyone else wants to try, that's fine of course. -- John Naylor EDB: http://www.enterprisedb.com
On Tue, Aug 23, 2022 at 1:13 PM John Naylor <john.naylor@enterprisedb.com> wrote:
>
> On Tue, Aug 23, 2022 at 11:24 AM David Rowley <dgrowleyml@gmail.com> wrote:
> >
> > On Tue, 23 Aug 2022 at 15:22, John Naylor <john.naylor@enterprisedb.com> wrote:
> > > Did you happen to see
> > >
> > > https://www.postgresql.org/message-id/CAFBsxsFhq8VUSkUL5YO17cFXbCPwtbbxBu%2Bd9MFrrsssfDXm3Q%40mail.gmail.com
> >
> > I missed that. It looks like a much more promising idea than what I
> > came up with. I've not looked at your code yet, but I'm interested and
> > will aim to look soon.
>
> Note that I haven't actually implemented this idea yet, just tried to
> model the effects by lobotomizing the current comparators. I think
> it's worth pursuing and will try to come back to it this cycle, but if
> you or anyone else wants to try, that's fine of course.
Coming back to this, I wanted to sketch out this idea in a bit more detail.
Have two memtuple arrays, one for first sortkey null and one for first sortkey non-null:
- Qsort the non-null array, including whatever specialization is available. Existing specialized comparators could ignore nulls (and their ordering) taking less space in the binary.
- Only if there is more than one sort key, qsort the null array. Ideally at some point we would have a method of ignoring the first sortkey (this is an existing opportunity that applies elsewhere as well).
- To handle two arrays, grow_memtuples() would need some adjustment, as would any callers that read the final result of an in-memory sort -- they would need to retrieve the tuples starting with the appropriate array depending on NULLS FIRST/LAST behavior.
I believe external merges wouldn't have to do anything different, since when writing out the tapes, we read from the arrays in the right order.
(One could extend this idea further and have two pools of tapes for null and non-null first sortkey, that are merged separately, in the right order. That sounds like quite a bit more complexity than is worth, however.)
Hi, John! Generally, I like the separation of non-null values before sorting and would like to join as a reviewer when we come to patch. I have only a small question: > - Only if there is more than one sort key, qsort the null array. Ideally at some point we would have a method of ignoringthe first sortkey (this is an existing opportunity that applies elsewhere as well). Should we need to sort by the second sort key provided the first one in NULL by standard or by some part of the code relying on this? I suppose NULL values in the first sort key mean attribute values are undefined and there is no preferred order between these tuples, even if their second sort keys are different. And maybe (unlikely IMO) we need some analog of NULLS DISCTICNT/NOT DISTINCT in this scope? Kind regards, Pavel Borisov, Supabase.
On Thu, Jan 26, 2023 at 6:14 PM Pavel Borisov <pashkin.elfe@gmail.com> wrote:
> > - Only if there is more than one sort key, qsort the null array. Ideally at some point we would have a method of ignoring the first sortkey (this is an existing opportunity that applies elsewhere as well).
> Should we need to sort by the second sort key provided the first one
> in NULL by standard or by some part of the code relying on this? I
> suppose NULL values in the first sort key mean attribute values are
> undefined and there is no preferred order between these tuples, even
> if their second sort keys are different.
There is in fact a preferred order between these tuples -- the second key is the tie breaker in this case.
--
John Naylor
EDB: http://www.enterprisedb.com
> in NULL by standard or by some part of the code relying on this? I
I'm not sure I quite understand the question.
If there is more than one sort key, and the specialized comparison on the first key gives a definitive zero result, it falls back to comparing all keys from the full tuple. (The sorttuple struct only contains the first sortkey, which might actually be an abbreviated key.) A possible optimization, relevant here and also elsewhere, is to compare only using keys starting from key2. But note: if the first key is abbreviated, a zero result is not definitive, and we must check the first key's full value from the tuple.
> suppose NULL values in the first sort key mean attribute values are
> undefined and there is no preferred order between these tuples, even
> if their second sort keys are different.
There is in fact a preferred order between these tuples -- the second key is the tie breaker in this case.
--
John Naylor
EDB: http://www.enterprisedb.com
On Thu, 26 Jan 2023 at 23:29, John Naylor <john.naylor@enterprisedb.com> wrote: > Coming back to this, I wanted to sketch out this idea in a bit more detail. > > Have two memtuple arrays, one for first sortkey null and one for first sortkey non-null: > - Qsort the non-null array, including whatever specialization is available. Existing specialized comparators could ignorenulls (and their ordering) taking less space in the binary. > - Only if there is more than one sort key, qsort the null array. Ideally at some point we would have a method of ignoringthe first sortkey (this is an existing opportunity that applies elsewhere as well). > - To handle two arrays, grow_memtuples() would need some adjustment, as would any callers that read the final result ofan in-memory sort -- they would need to retrieve the tuples starting with the appropriate array depending on NULLS FIRST/LASTbehavior. Thanks for coming back to this. I've been thinking about this again recently due to what was discovered in [1]. Basically, the patch on that thread is trying to eliminate the need for the ORDER BY sort in a query such as: SELECT a,b,row_number() over (order by a) from ab order by a,b; the idea is to just perform the full sort on a,b for the WindowClause so save from having to do an Incremental Sort on b for the ORDER BY after evaluating the window funcs. Surprisingly (for me), we found a bunch of cases where the performance is better to do a sort on some of the keys, then do an Incremental sort on the remainder rather than just doing a single sort on everything. I don't really understand why this is fully yet, but one theory I have is that it might be down to work_mem being larger than the CPU's L3 cache and causing more cache line misses. With the more simple sort, there's less swapping of items in the array because the comparison function sees tuples as equal more often. I did find that the gap between the two is not as large with fewer tuples to sort. I think the slower sorts I found in [2] could also be partially caused by the current sort specialisation comparators re-comparing the leading column during a tie-break. I've not gotten around to disabling the sort specialisations to see if and how much this is a factor for that test. Why I'm bringing this up here is that I wondered, in addition to what you're mentioning above, if we're making some changes to allow multiple in-memory arrays, would it be worth going a little further and allowing it so we could have N arrays which we'd try to keep under L3 cache size. Maybe some new GUC could be used to know what a good value is. With fast enough disks, it's often faster to use smaller values of work_mem which don't exceed L3 to keep these batches smaller. Keeping them in memory would remove the need to write out tapes. I don't really know exactly if such a feature could easily be tagged on to what you propose above, but I feel like what you're talking about is part of the way towards that at least, maybe at the least, the additional work could be kept in mind when this is written so that it's easier to extend in the future. David [1] https://postgr.es/m/CAApHDvpAO5H_L84kn9gCJ_hihOavtmDjimKYyftjWtF69BJ=8Q@mail.gmail.com [2] https://postgr.es/m/CAApHDvqh%2BqOHk4sbvvy%3DQr2NjPqAAVYf82oXY0g%3DZ2hRpC2Vmg%40mail.gmail.com
On Thu, Jan 26, 2023 at 7:15 PM David Rowley <dgrowleyml@gmail.com> wrote:
>
> On Thu, 26 Jan 2023 at 23:29, John Naylor <john.naylor@enterprisedb.com> wrote:
> > Coming back to this, I wanted to sketch out this idea in a bit more detail.
> >
> > Have two memtuple arrays, one for first sortkey null and one for first sortkey non-null:
> > - Qsort the non-null array, including whatever specialization is available. Existing specialized comparators could ignore nulls (and their ordering) taking less space in the binary.
> > - Only if there is more than one sort key, qsort the null array. Ideally at some point we would have a method of ignoring the first sortkey (this is an existing opportunity that applies elsewhere as well).
> > - To handle two arrays, grow_memtuples() would need some adjustment, as would any callers that read the final result of an in-memory sort -- they would need to retrieve the tuples starting with the appropriate array depending on NULLS FIRST/LAST behavior.
>
> Thanks for coming back to this. I've been thinking about this again
> recently due to what was discovered in [1].
That was indeed part of the motivation for bringing this up.
> Why I'm bringing this up here is that I wondered, in addition to what
> you're mentioning above, if we're making some changes to allow
> multiple in-memory arrays, would it be worth going a little further
> and allowing it so we could have N arrays which we'd try to keep under
> L3 cache size. Maybe some new GUC could be used to know what a good
> value is. With fast enough disks, it's often faster to use smaller
> values of work_mem which don't exceed L3 to keep these batches
> smaller. Keeping them in memory would remove the need to write out
> tapes.
That's interesting. I don't know enough to guess how complex it would be to make "external" merges agnostic about whether the tapes are on disk or in memory.
If in-memory sorts were designed analogously to external ones, grow_memtuples would never have to repalloc, it could just allocate a new tape, which could further shave some cycles.
My hunch in my last email was that having separate groups of tapes for each null/non-null first sortkey would be complex, because it increases the number of places that have to know about nulls and their ordering. If we wanted to go to N arrays instead of 2, and additionally wanted separate null/non-null treatment, it seems we would still need 2 sets of arrays, one with N non-null-first-sortkey tapes and one with M null-first-sortkey tapes.
So using 2 arrays seems like the logical first step. I'd be curious to hear other possible development paths.
> I think the slower sorts I found in [2] could also be partially caused
> by the current sort specialisation comparators re-comparing the
> leading column during a tie-break. I've not gotten around to disabling
> the sort specialisations to see if and how much this is a factor for
> that test.
Right, that's worth addressing independently of the window function consideration. I'm still swapping this area back in my head, but I believe one issue is that state->base.onlyKey signals two things: "one sortkey, not abbreviated". We could add a separate branch for "first key unabbreviated, nkeys>1" -- I don't think we'd need to specialize, just branch -- and instead of state->base.comparetup, call a set of analogous functions that only handle keys 2 and above (comparetup_tail_* ? or possibly just add a boolean parameter compare_first). That would not pose a huge challenge, I think, since they're already written like this:
/* Compare the leading sort key */
compare = ApplySortComparator(...);
if (compare != 0)
return compare;
/* Compare additional sort keys */
...
The null/non-null separation would eliminate a bunch of branches in inlined comparators, so we could afford to add another branch for number of keys.
I haven't thought through either of these ideas in the gory detail, but I don't yet see any big obstacles.
--
John Naylor
EDB: http://www.enterprisedb.com
--
John Naylor
EDB: http://www.enterprisedb.com
I wrote:
> On Thu, Jan 26, 2023 at 7:15 PM David Rowley <dgrowleyml@gmail.com> wrote:
> > I think the slower sorts I found in [2] could also be partially caused
> > by the current sort specialisation comparators re-comparing the
> > leading column during a tie-break. I've not gotten around to disabling
> > the sort specialisations to see if and how much this is a factor for
> > that test.
>
> Right, that's worth addressing independently of the window function consideration. I'm still swapping this area back in my head, but I believe one issue is that state->base.onlyKey signals two things: "one sortkey, not abbreviated". We could add a separate branch for "first key unabbreviated, nkeys>1" -- I don't think we'd need to specialize, just branch -- and instead of state->base.comparetup, call a set of analogous functions that only handle keys 2 and above (comparetup_tail_* ? or possibly just add a boolean parameter compare_first). That would not pose a huge challenge, I think, since they're already written like this:
>
> /* Compare the leading sort key */
> compare = ApplySortComparator(...);
> if (compare != 0)
> return compare;
>
> /* Compare additional sort keys */
> ...
>
> The null/non-null separation would eliminate a bunch of branches in inlined comparators, so we could afford to add another branch for number of keys.
I gave this a go, and it turns out we don't need any extra branches in the inlined comparators -- the new fallbacks are naturally written to account for the "!onlyKey" case. If the first sortkey was abbreviated, call its full comparator, otherwise skip to the next sortkey (if there isn't one, we shouldn't have gotten here). The existing comparetup functions try the simple case and then call the fallback (which could be inlined for them but I haven't looked).
> > by the current sort specialisation comparators re-comparing the
> > leading column during a tie-break. I've not gotten around to disabling
> > the sort specialisations to see if and how much this is a factor for
> > that test.
>
> Right, that's worth addressing independently of the window function consideration. I'm still swapping this area back in my head, but I believe one issue is that state->base.onlyKey signals two things: "one sortkey, not abbreviated". We could add a separate branch for "first key unabbreviated, nkeys>1" -- I don't think we'd need to specialize, just branch -- and instead of state->base.comparetup, call a set of analogous functions that only handle keys 2 and above (comparetup_tail_* ? or possibly just add a boolean parameter compare_first). That would not pose a huge challenge, I think, since they're already written like this:
>
> /* Compare the leading sort key */
> compare = ApplySortComparator(...);
> if (compare != 0)
> return compare;
>
> /* Compare additional sort keys */
> ...
>
> The null/non-null separation would eliminate a bunch of branches in inlined comparators, so we could afford to add another branch for number of keys.
I gave this a go, and it turns out we don't need any extra branches in the inlined comparators -- the new fallbacks are naturally written to account for the "!onlyKey" case. If the first sortkey was abbreviated, call its full comparator, otherwise skip to the next sortkey (if there isn't one, we shouldn't have gotten here). The existing comparetup functions try the simple case and then call the fallback (which could be inlined for them but I haven't looked).
Tests pass, but I'm not sure yet if we need more tests. I don't have a purpose-built benchmark at the moment, but I'll see if any of my existing tests exercise this code path. I can also try the window function case unless someone beats me to it.
--
John Naylor
EDB: http://www.enterprisedb.com
--
John Naylor
EDB: http://www.enterprisedb.com
Attachment
I wrote:
> Have two memtuple arrays, one for first sortkey null and one for first sortkey non-null:
Hacking on this has gotten only as far as the "compiles but segfaults" stage, but I wanted to note an idea that occurred to me:
Hacking on this has gotten only as far as the "compiles but segfaults" stage, but I wanted to note an idea that occurred to me:
Internal qsort doesn't need the srctape member, and removing both that and isnull1 would allow 16-byte "init-tuples" for qsort, which would save a bit of work_mem space, binary space for qsort specializations, and work done during swaps.
During heap sort, we already copy one entry into a stack variable to keep from clobbering it, so it's not a big step to read a member from the init array and form a regular sorttuple from it.
During heap sort, we already copy one entry into a stack variable to keep from clobbering it, so it's not a big step to read a member from the init array and form a regular sorttuple from it.