Thread: [PATCH] Use optimized single-datum tuplesort in ExecSort

[PATCH] Use optimized single-datum tuplesort in ExecSort

From
Ronan Dunklau
Date:
Hello,

While testing the patch "Add proper planner support for ORDER BY / DISTINCT 
aggregates" [0] I discovered the performance penalty from adding a sort node 
essentially came from not using the single-datum tuplesort optimization in 
ExecSort (contrary to the sorting done in ExecAgg).

I originally proposed this patch as a companion in the same thread [1], but 
following James suggestion I'm making a separate thread just for this as the 
optimization is worthwhile independently of David's patch: it looks like we 
can expect a 2x speedup on a "select a single ordered column" case.

The patch aimed to be as simple as possible: we only turn this optimization on 
when the tuple being sorted has only one attribute, it is "byval" (so as not 
to incur copies which would be hard to track in the execution tree) and 
unbound (again, not having to deal with copying borrowed datum anywhere).

The attached patch is originally by me, with some cleanup by Ranier Vilela. 
I'm sending Ranier's version here.


[0]: https://commitfest.postgresql.org/33/3164/
[1]: https://www.postgresql.org/message-id/4480689.ObhdGn8bVM%40aivenronan

-- 
Ronan Dunklau
Attachment

Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

From
Ranier Vilela
Date:
Em ter., 6 de jul. de 2021 às 03:15, Ronan Dunklau <ronan.dunklau@aiven.io> escreveu:
Hello,

While testing the patch "Add proper planner support for ORDER BY / DISTINCT
aggregates" [0] I discovered the performance penalty from adding a sort node
essentially came from not using the single-datum tuplesort optimization in
ExecSort (contrary to the sorting done in ExecAgg).

I originally proposed this patch as a companion in the same thread [1], but
following James suggestion I'm making a separate thread just for this as the
optimization is worthwhile independently of David's patch: it looks like we
can expect a 2x speedup on a "select a single ordered column" case.

The patch aimed to be as simple as possible: we only turn this optimization on
when the tuple being sorted has only one attribute, it is "byval" (so as not
to incur copies which would be hard to track in the execution tree) and
unbound (again, not having to deal with copying borrowed datum anywhere).

The attached patch is originally by me, with some cleanup by Ranier Vilela.
I'm sending Ranier's version here.
Nice Ronan.
But I think there is some confusion here.
The author is not you?

Just to clarify, at Commitfest, it was supposed to be the other way around.
You as an author and David as a reviewer.
I'll put myself as a reviewer too.

regards,
Ranier Vilela

Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

From
Ranier Vilela
Date:
Em ter., 6 de jul. de 2021 às 08:25, Ranier Vilela <ranier.vf@gmail.com> escreveu:
Em ter., 6 de jul. de 2021 às 03:15, Ronan Dunklau <ronan.dunklau@aiven.io> escreveu:
Hello,

While testing the patch "Add proper planner support for ORDER BY / DISTINCT
aggregates" [0] I discovered the performance penalty from adding a sort node
essentially came from not using the single-datum tuplesort optimization in
ExecSort (contrary to the sorting done in ExecAgg).

I originally proposed this patch as a companion in the same thread [1], but
following James suggestion I'm making a separate thread just for this as the
optimization is worthwhile independently of David's patch: it looks like we
can expect a 2x speedup on a "select a single ordered column" case.

The patch aimed to be as simple as possible: we only turn this optimization on
when the tuple being sorted has only one attribute, it is "byval" (so as not
to incur copies which would be hard to track in the execution tree) and
unbound (again, not having to deal with copying borrowed datum anywhere).

The attached patch is originally by me, with some cleanup by Ranier Vilela.
I'm sending Ranier's version here.
Nice Ronan.
But I think there is some confusion here.
The author is not you?

Just to clarify, at Commitfest, it was supposed to be the other way around.
You as an author and David as a reviewer.
I'll put myself as a reviewer too.
Sorry David, my mistake.
I confused the numbers (id) of Commitfest.

regards,
Ranier Vilela

Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

From
James Coleman
Date:
Adding David since this patch is likely a precondition for [1].

On Tue, Jul 6, 2021 at 2:15 AM Ronan Dunklau <ronan.dunklau@aiven.io> wrote:
>
> Hello,
>
> While testing the patch "Add proper planner support for ORDER BY / DISTINCT
> aggregates" [0] I discovered the performance penalty from adding a sort node
> essentially came from not using the single-datum tuplesort optimization in
> ExecSort (contrary to the sorting done in ExecAgg).
>
> I originally proposed this patch as a companion in the same thread [1], but
> following James suggestion I'm making a separate thread just for this as the
> optimization is worthwhile independently of David's patch: it looks like we
> can expect a 2x speedup on a "select a single ordered column" case.
>
> The patch aimed to be as simple as possible: we only turn this optimization on
> when the tuple being sorted has only one attribute, it is "byval" (so as not
> to incur copies which would be hard to track in the execution tree) and
> unbound (again, not having to deal with copying borrowed datum anywhere).

Thanks again for finding this and working up a patch.

I've taken a look, and while I haven't dug into testing it yet, I have
a few comments.

First, the changes are lacking any explanatory comments. Probably we
should follow how nodeAgg does this and add both comments to the
ExecSort function header as well as specific comments above the "if"
around the new tuplesort_begin_datum explaining the specific
conditions that are required for the optimization to be useful and
safe.

That leads to a question I had: I don't follow why bounded mode (when
using byval) needs to be excluded. Comments should be added if there's
a good reason (as noted above), but maybe it's a case we can handle
safely?

A second question: at first glance it's intuitively the case we might
not be able to handle byref values. But nodeAgg doesn't seem to have
that restriction. What's the difference here?

A few small code observations:
- In my view the addition of unlikely() in ExecSort is unlikely to be
of benefit because it's a single call for the entire node's execution
(not in the tuple loop).
- It seems clearer to change the "if (!node->is_single_val)" to flip
the true/false cases so we don't need the negation.
- I assume there are tests that likely already cover this case, but
it'd be worth verifying that.

Finally, I believe the same optimization likely ought to be added to
nodeIncrementalSort. It's less likely the tests there are sufficient
for both this and the original case, but we'll see.

Thanks,
James

1: https://www.postgresql.org/message-id/CAApHDvpHzfo92%3DR4W0%2BxVua3BUYCKMckWAmo-2t_KiXN-wYH%3Dw%40mail.gmail.com



Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

From
Ranier Vilela
Date:
Em ter., 6 de jul. de 2021 às 10:19, James Coleman <jtc331@gmail.com> escreveu:
Adding David since this patch is likely a precondition for [1].

On Tue, Jul 6, 2021 at 2:15 AM Ronan Dunklau <ronan.dunklau@aiven.io> wrote:
>
> Hello,
>
> While testing the patch "Add proper planner support for ORDER BY / DISTINCT
> aggregates" [0] I discovered the performance penalty from adding a sort node
> essentially came from not using the single-datum tuplesort optimization in
> ExecSort (contrary to the sorting done in ExecAgg).
>
> I originally proposed this patch as a companion in the same thread [1], but
> following James suggestion I'm making a separate thread just for this as the
> optimization is worthwhile independently of David's patch: it looks like we
> can expect a 2x speedup on a "select a single ordered column" case.
>
> The patch aimed to be as simple as possible: we only turn this optimization on
> when the tuple being sorted has only one attribute, it is "byval" (so as not
> to incur copies which would be hard to track in the execution tree) and
> unbound (again, not having to deal with copying borrowed datum anywhere).

Thanks again for finding this and working up a patch.

I've taken a look, and while I haven't dug into testing it yet, I have
a few comments.

First, the changes are lacking any explanatory comments. Probably we
should follow how nodeAgg does this and add both comments to the
ExecSort function header as well as specific comments above the "if"
around the new tuplesort_begin_datum explaining the specific
conditions that are required for the optimization to be useful and
safe.

That leads to a question I had: I don't follow why bounded mode (when
using byval) needs to be excluded. Comments should be added if there's
a good reason (as noted above), but maybe it's a case we can handle
safely?

A second question: at first glance it's intuitively the case we might
not be able to handle byref values. But nodeAgg doesn't seem to have
that restriction. What's the difference here?

A few small code observations:
- In my view the addition of unlikely() in ExecSort is unlikely to be
of benefit because it's a single call for the entire node's execution
(not in the tuple loop).
No objection. And I agree that testing is complex and needs to remain as it is.

- It seems clearer to change the "if (!node->is_single_val)" to flip
the true/false cases so we don't need the negation.
I think yes, it can be better.

regards,
Ranier Vilela

Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

From
Dilip Kumar
Date:
On Tue, Jul 6, 2021 at 6:49 PM James Coleman <jtc331@gmail.com> wrote:
>
> Adding David since this patch is likely a precondition for [1].
>
> On Tue, Jul 6, 2021 at 2:15 AM Ronan Dunklau <ronan.dunklau@aiven.io> wrote:
> >
> > Hello,
> >
> > While testing the patch "Add proper planner support for ORDER BY / DISTINCT
> > aggregates" [0] I discovered the performance penalty from adding a sort node
> > essentially came from not using the single-datum tuplesort optimization in
> > ExecSort (contrary to the sorting done in ExecAgg).
> >
> > I originally proposed this patch as a companion in the same thread [1], but
> > following James suggestion I'm making a separate thread just for this as the
> > optimization is worthwhile independently of David's patch: it looks like we
> > can expect a 2x speedup on a "select a single ordered column" case.
> >
> > The patch aimed to be as simple as possible: we only turn this optimization on
> > when the tuple being sorted has only one attribute, it is "byval" (so as not
> > to incur copies which would be hard to track in the execution tree) and
> > unbound (again, not having to deal with copying borrowed datum anywhere).
>
> Thanks again for finding this and working up a patch.
>
> I've taken a look, and while I haven't dug into testing it yet, I have
> a few comments.
>
> First, the changes are lacking any explanatory comments. Probably we
> should follow how nodeAgg does this and add both comments to the
> ExecSort function header as well as specific comments above the "if"
> around the new tuplesort_begin_datum explaining the specific
> conditions that are required for the optimization to be useful and
> safe.
>
> That leads to a question I had: I don't follow why bounded mode (when
> using byval) needs to be excluded. Comments should be added if there's
> a good reason (as noted above), but maybe it's a case we can handle
> safely?
>
> A second question: at first glance it's intuitively the case we might
> not be able to handle byref values. But nodeAgg doesn't seem to have
> that restriction. What's the difference here?
>

I think tuplesort_begin_datum, doesn't have any such limitation, it
can handle any type of Datum so I think we don't need to consider the
only attbyval, we can consider any type of attribute for this
optimization.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

From
Ronan Dunklau
Date:
Thank you for the review, I will address those shortly, but will answer some 
questions in the meantime.

> > First, the changes are lacking any explanatory comments. Probably we
> > should follow how nodeAgg does this and add both comments to the
> > ExecSort function header as well as specific comments above the "if"
> > around the new tuplesort_begin_datum explaining the specific
> > conditions that are required for the optimization to be useful and
> > safe.

Done, since I lifted the restrictions following your questions, there isn't 
much left to comment. (see below)

> > 
> > That leads to a question I had: I don't follow why bounded mode (when
> > using byval) needs to be excluded. Comments should be added if there's
> > a good reason (as noted above), but maybe it's a case we can handle
> > safely?

I had test failures when trying to move the Datum around when performing a 
bounded sort, but did not look into it at first.

Now I've looked into it, and the switch to a heapsort when using bounded mode 
just unconditionnaly tried to free a tuple that was never there to begin with. 
So if the SortTuple does not contain an actual tuple, but only a single datum, 
do not do that. 

I've updated the patch to fix this and enable the optimization in the case of 
bounded sort.

> > 
> > A second question: at first glance it's intuitively the case we might
> > not be able to handle byref values. But nodeAgg doesn't seem to have
> > that restriction. What's the difference here?
> 
> I think tuplesort_begin_datum, doesn't have any such limitation, it
> can handle any type of Datum so I think we don't need to consider the
> only attbyval, we can consider any type of attribute for this
> optimization.

I've restricted the optimization to byval types because of the following 
comment in nodeAgg.c:

    /*
     * Note: if input type is pass-by-ref, the datums returned by the 
sort are
     * freshly palloc'd in the per-query context, so we must be careful 
to
     * pfree them when they are no longer needed.
     */

As I was not sure how to handle that, I prefered the safety of not enabling 
it. Since you both told me it should be safe, I've lifted that restriction 
too.


> A few small code observations:
> - In my view the addition of unlikely() in ExecSort is unlikely to be
> of benefit because it's a single call for the entire node's execution
> (not in the tuple loop).

Done.

> - It seems clearer to change the "if (!node->is_single_val)" to flip
> the true/false cases so we don't need the negation.

Agreed, done.

> - I assume there are tests that likely already cover this case, but
> it'd be worth verifying that.

Yes many test cases cover that, but maybe it would be better to explictly 
check for it on some cases: do you think we could add a debug message that can 
be checked for ? 

> Finally, I believe the same optimization likely ought to be added to
> nodeIncrementalSort. It's less likely the tests there are sufficient
> for both this and the original case, but we'll see.

I will look into it, but isn't incrementalsort used to sort tuples on several 
keys, when they are already sorted on the first ? In that case, I doubt we 
would ever have a single-valued tuple here, except if there is a projection to 
strip the tuple from extraneous attributes.

-- 
Ronan Dunklau
Attachment

Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

From
James Coleman
Date:
On Tue, Jul 6, 2021 at 11:03 AM Ronan Dunklau <ronan.dunklau@aiven.io> wrote:
>
> Thank you for the review, I will address those shortly, but will answer some
> questions in the meantime.
>
> > > First, the changes are lacking any explanatory comments. Probably we
> > > should follow how nodeAgg does this and add both comments to the
> > > ExecSort function header as well as specific comments above the "if"
> > > around the new tuplesort_begin_datum explaining the specific
> > > conditions that are required for the optimization to be useful and
> > > safe.
>
> Done, since I lifted the restrictions following your questions, there isn't
> much left to comment. (see below)
>
> > >
> > > That leads to a question I had: I don't follow why bounded mode (when
> > > using byval) needs to be excluded. Comments should be added if there's
> > > a good reason (as noted above), but maybe it's a case we can handle
> > > safely?
>
> I had test failures when trying to move the Datum around when performing a
> bounded sort, but did not look into it at first.
>
> Now I've looked into it, and the switch to a heapsort when using bounded mode
> just unconditionnaly tried to free a tuple that was never there to begin with.
> So if the SortTuple does not contain an actual tuple, but only a single datum,
> do not do that.
>
> I've updated the patch to fix this and enable the optimization in the case of
> bounded sort.

Awesome.

> > > A second question: at first glance it's intuitively the case we might
> > > not be able to handle byref values. But nodeAgg doesn't seem to have
> > > that restriction. What's the difference here?
> >
> > I think tuplesort_begin_datum, doesn't have any such limitation, it
> > can handle any type of Datum so I think we don't need to consider the
> > only attbyval, we can consider any type of attribute for this
> > optimization.
>
> I've restricted the optimization to byval types because of the following
> comment in nodeAgg.c:
>
>         /*
>          * Note: if input type is pass-by-ref, the datums returned by the
> sort are
>          * freshly palloc'd in the per-query context, so we must be careful
> to
>          * pfree them when they are no longer needed.
>          */
>
> As I was not sure how to handle that, I prefered the safety of not enabling
> it. Since you both told me it should be safe, I've lifted that restriction
> too.

To be clear, I don't know for certain it's safe [without extra work],
but even if it involves some extra manual pfree'ing (a la nodeAgg)
it's probably worth it. Maybe someone else will weigh in on whether or
not anything special is required here to ensure we don't leak memory
(I haven't looked in detail yet).

> > A few small code observations:
> > - In my view the addition of unlikely() in ExecSort is unlikely to be
> > of benefit because it's a single call for the entire node's execution
> > (not in the tuple loop).
>
> Done.
>
> > - It seems clearer to change the "if (!node->is_single_val)" to flip
> > the true/false cases so we don't need the negation.
>
> Agreed, done.

Thanks

> > - I assume there are tests that likely already cover this case, but
> > it'd be worth verifying that.
>
> Yes many test cases cover that, but maybe it would be better to explictly
> check for it on some cases: do you think we could add a debug message that can
> be checked for ?

Mostly I think we should verify code coverage and _maybe_ add a
specific test or two that we know execises this path. I don't know
that the debug message needs to be matched in the test (probably more
pain than it's worth), but the debug message ("enabling datum sort
optimizaton" or similar) might be good anyway.

I wonder if we need to change costing of sorts for this case. I don't
like having to do so, but it's a significant change in speed, so
probably should impact what plan gets chosen. Hopefully others will
weigh on this also.

> > Finally, I believe the same optimization likely ought to be added to
> > nodeIncrementalSort. It's less likely the tests there are sufficient
> > for both this and the original case, but we'll see.
>
> I will look into it, but isn't incrementalsort used to sort tuples on several
> keys, when they are already sorted on the first ? In that case, I doubt we
> would ever have a single-valued tuple here, except if there is a projection to
> strip the tuple from extraneous attributes.

Yes and no. When incremental sort has to do a full sort there will
always be at least 2 attributes. But in prefix sort mode (see
prefixsort_state) only non-presorted columns are sorted (i.e., if
given a,b already sorted by a, then only b is sorted). So the
prefixsort_state could use this optimization.

James



Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

From
Ronan Dunklau
Date:
Le mardi 6 juillet 2021, 17:37:53 CEST James Coleman a écrit :
> Yes and no. When incremental sort has to do a full sort there will
> always be at least 2 attributes. But in prefix sort mode (see
> prefixsort_state) only non-presorted columns are sorted (i.e., if
> given a,b already sorted by a, then only b is sorted). So the
> prefixsort_state could use this optimization.

The optimization is not when we actually sort on a single key, but when we get
a single attribute in / out of the tuplesort.  Since sorting always add
resjunk entries for the keys being sorted on, I don't think we can ever end up
in a situation where the optimization would kick in, since the entries for the
already-performed-sort keys will need to be present in the output.

Maybe if instead of adding resjunk entries to the whole query's targetlist,
sort and incrementalsort nodes were able to do a projection from the input
(needed tle + resjunk sorting tle) to a tuple containing only the needed tle
on output before actually sorting it, it would be possible, but that would be
quite a big design change.

In the meantime I fixed some formatting issues, please find attached a new
patch.


--
Ronan Dunklau
Attachment

Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

From
David Rowley
Date:
On Wed, 7 Jul 2021 at 21:32, Ronan Dunklau <ronan.dunklau@aiven.io> wrote:
> In the meantime I fixed some formatting issues, please find attached a new
> patch.

I started to look at this.

First I wondered how often we might be able to apply this
optimisation, so I ran make check after adding some elog(NOTICE) calls
to output which method is going to be used just before we do the
tuplestore_begin_* calls.  It looks like there are 614 instances of
Datum sorts and 4223 of tuple sorts. That's about 14.5% datum sorts.
223 of the 614 are byval types and the other 391 are byref. Not that
the regression tests are a good reflection of the real world, but if
it were then that's quite a good number of cases to be able to
optimise.

As for the patch, just a few things:

1. Can you add the missing braces in this if condition and the else
condition that belongs to it.

+ if (node->is_single_val)
+ for (;;)
+ {

2. I think it would nicer to name the new is_single_val field
"datumSort" instead. To me it seems more clear what it is for.

3. This seems to be a bug fix where byval datum sorts do not properly
handle bounded sorts.  I think that maybe that should be fixed and
backpatched.  I don't see anything that says Datum sorts can't be
bounded and if there were some restriction on that I'd expect
tuplesort_set_bound() to fail when the Tuplesortstate had been set up
with tuplesort_begin_datum().

 static void
 free_sort_tuple(Tuplesortstate *state, SortTuple *stup)
 {
- FREEMEM(state, GetMemoryChunkSpace(stup->tuple));
- pfree(stup->tuple);
+ /*
+ * If the SortTuple is actually only a single Datum, which was not copied
+ * as it is a byval type, do not try to free it nor account for it in
+ * memory used.
+ */
+ if (stup->tuple)
+ {
+ FREEMEM(state, GetMemoryChunkSpace(stup->tuple));
+ pfree(stup->tuple);
+ }

I can take this to another thread.

That's all I have for now.

David



Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

From
Ronan Dunklau
Date:
Le lundi 12 juillet 2021, 15:11:17 CEST David Rowley a écrit :
> On Wed, 7 Jul 2021 at 21:32, Ronan Dunklau <ronan.dunklau@aiven.io> wrote:
> > In the meantime I fixed some formatting issues, please find attached a new
> > patch.
>
> I started to look at this.

Thank you ! I'm attaching a new version of the patch taking your remarks into
account.
>
> First I wondered how often we might be able to apply this
> optimisation, so I ran make check after adding some elog(NOTICE) calls
> to output which method is going to be used just before we do the
> tuplestore_begin_* calls.  It looks like there are 614 instances of
> Datum sorts and 4223 of tuple sorts. That's about 14.5% datum sorts.
> 223 of the 614 are byval types and the other 391 are byref. Not that
> the regression tests are a good reflection of the real world, but if
> it were then that's quite a good number of cases to be able to
> optimise.

That's an interesting stat.

>
> As for the patch, just a few things:
>
> 1. Can you add the missing braces in this if condition and the else
> condition that belongs to it.
>
> + if (node->is_single_val)
> + for (;;)
> + {
>

Done.

> 2. I think it would nicer to name the new is_single_val field
> "datumSort" instead. To me it seems more clear what it is for.

Done.

>
> 3. This seems to be a bug fix where byval datum sorts do not properly
> handle bounded sorts.  I think that maybe that should be fixed and
> backpatched.  I don't see anything that says Datum sorts can't be
> bounded and if there were some restriction on that I'd expect
> tuplesort_set_bound() to fail when the Tuplesortstate had been set up
> with tuplesort_begin_datum().

I've kept this as-is for now, but I will remove it from my patch if it is
deemed worthy of back-patching in your other thread.

Regards,

--
Ronan Dunklau
Attachment

Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

From
David Rowley
Date:
On Tue, 13 Jul 2021 at 01:59, Ronan Dunklau <ronan.dunklau@aiven.io> wrote:
> > 3. This seems to be a bug fix where byval datum sorts do not properly
> > handle bounded sorts.  I think that maybe that should be fixed and
> > backpatched.  I don't see anything that says Datum sorts can't be
> > bounded and if there were some restriction on that I'd expect
> > tuplesort_set_bound() to fail when the Tuplesortstate had been set up
> > with tuplesort_begin_datum().
>
> I've kept this as-is for now, but I will remove it from my patch if it is
> deemed worthy of back-patching in your other thread.

I've now pushed that bug fix so it's fine to remove the change to
tuplesort.c now.

I also did a round of benchmarking on this patch using the attached
script. Anyone wanting to run it will need to run make installcheck
first to create the required tables.

On an AMD machine, I got the following results.

Result in transactions per second.
Test master v5 patch compare
Test1 446.1 657.3 147.32%
Test2 315.8 314.0 99.44%
Test3 302.3 392.1 129.67%
Test4 232.7 230.7 99.12%
Test5 230.0 446.1 194.00%
Test6 199.5 217.9 109.23%
Test7 188.7 185.3 98.21%
Test8 385.4 544.0 141.17%

Tests 2, 4, 7 are designed to check if there is any regression from
doing the additional run-time checks to see if we're doing datumSort.
I measured a very small penalty from this. It's most visible in test7
with a drop of about 1.8%. Each test did OFFSET 1000000 as I didn't
want to measure the overhead of outputting tuples.

All the other tests show a pretty good gain. Test6 is testing a byref
type, so it appears the gains are not just from byval datums.

It would be good to see the benchmark script run on a few other
machines to get an idea if the gains and losses are consistent.

In theory, we likely could get rid of the small regression by having
two versions of ExecSort() and setting the correct one during
ExecInitSort() by setting the function pointer to the version we want
to use in sortstate->ss.ps.ExecProcNode. But maybe the small
regression is not worth going to that trouble over. I'm not aware of
any other executor nodes that have logic like that so maybe it would
be a bad idea to introduce something like that.

David

Attachment

Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

From
Ronan Dunklau
Date:
> I've now pushed that bug fix so it's fine to remove the change to
> tuplesort.c now.

Thanks, I've rebased the patch, please find attached the v6.

> 
> I also did a round of benchmarking on this patch using the attached
> script. Anyone wanting to run it will need to run make installcheck
> first to create the required tables.

I've run your benchmark, keeping the best of three runs each time.
This is an intel laptop, so as many things are running on it there is a lot of 
noise... 

Both standard and patched run come from a compilation with gcc -O2. No changes 
have been done to the default settings.

Query #    Master    Patched    Variation
1    884    1627    184.05%
2    364    375    103.02%
3    568    783    137.85%
4    296    297    100.34%
5    421    484    114.96%
6    359    408    113.65%
7    237    251    105.91%
8    806    1271    157.69%

Since I didn't reproduce your slowdown at all on the first run, I tried to 
rerun the benchmark several times and for the "dubious cases" (2, 4 and 7), 
the results are too jittery to conclude one way or another in my case.  I 
don't have access to proper hardware, so not sure if that would be useful in 
any way to just run the bench for thousands of xacts instead. I would be 
surprised the check adds that much to the whole execution though.

I attach a graph similar to yours for reference.


-- 
Ronan Dunklau
Attachment

Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

From
Ranier Vilela
Date:
Em ter., 13 de jul. de 2021 às 04:19, Ronan Dunklau <ronan.dunklau@aiven.io> escreveu:
> I've now pushed that bug fix so it's fine to remove the change to

 
I would be
surprised the check adds that much to the whole execution though.
I think this branch is a misprediction.
In most cases is it not datumSort?
That's why I would like to use unlikely.

IMO all the tests should all be to verify past behavior first.

regards,
Ranier Vilela

Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

From
David Rowley
Date:
On Wed, 14 Jul 2021 at 00:06, Ranier Vilela <ranier.vf@gmail.com> wrote:
>
> Em ter., 13 de jul. de 2021 às 04:19, Ronan Dunklau <ronan.dunklau@aiven.io> escreveu:
>> I would be
>> surprised the check adds that much to the whole execution though.
>
> I think this branch is a misprediction.

It could be.  I wondered that myself when I saw Ronan's results were
better than mine for 2,4 and 7.  However, I think Ronan had quite a
bit of noise in his results as there's no reason for the speedup in
tests 2,4 and 7.

> In most cases is it not datumSort?

who knows.  Maybe someone's workload always requires the datum sort.

> That's why I would like to use unlikely.

We really only use unlikely() in cases where we want to move code out
of line to a cold area because it's really never executed under normal
circumstances. We tend to do that for ERROR cases as we don't ever
really want to optimise for errors. We also sometimes do it when some
function has a branch to initialise something during the first call.
The case in question here does not fit for either of those two cases.

> IMO all the tests should all be to verify past behavior first.

I'm not quire sure what you mean there.

David



Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

From
Ranier Vilela
Date:
Em ter., 13 de jul. de 2021 às 09:24, David Rowley <dgrowleyml@gmail.com> escreveu:
On Wed, 14 Jul 2021 at 00:06, Ranier Vilela <ranier.vf@gmail.com> wrote:
>
> Em ter., 13 de jul. de 2021 às 04:19, Ronan Dunklau <ronan.dunklau@aiven.io> escreveu:
>> I would be
>> surprised the check adds that much to the whole execution though.
>
> I think this branch is a misprediction.

It could be.  I wondered that myself when I saw Ronan's results were
better than mine for 2,4 and 7.  However, I think Ronan had quite a
bit of noise in his results as there's no reason for the speedup in
tests 2,4 and 7. 

> In most cases is it not datumSort?

who knows.  Maybe someone's workload always requires the datum sort.

> That's why I would like to use unlikely.

We really only use unlikely() in cases where we want to move code out
of line to a cold area because it's really never executed under normal
circumstances. We tend to do that for ERROR cases as we don't ever
really want to optimise for errors. We also sometimes do it when some
function has a branch to initialise something during the first call.
The case in question here does not fit for either of those two cases.
Hum, I understand the usage cases now.
Thanks for the hint.
 

> IMO all the tests should all be to verify past behavior first.

I'm not quire sure what you mean there.
I'm saying we could help the branch by keeping the same testing logic as before and not reversing it.
Attached is a version to demonstrate this, I don't pretend to be v7.

I couldn't find a good phrase to the contrary:
"are we *not* using the single value optimization ?"

I don't have time to take the tests right now.

regards,
Ranier Vilela
Attachment

Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

From
Ranier Vilela
Date:
Em ter., 13 de jul. de 2021 às 09:44, Ranier Vilela <ranier.vf@gmail.com> escreveu:
Em ter., 13 de jul. de 2021 às 09:24, David Rowley <dgrowleyml@gmail.com> escreveu:
On Wed, 14 Jul 2021 at 00:06, Ranier Vilela <ranier.vf@gmail.com> wrote:
>
> Em ter., 13 de jul. de 2021 às 04:19, Ronan Dunklau <ronan.dunklau@aiven.io> escreveu:
>> I would be
>> surprised the check adds that much to the whole execution though.
>
> I think this branch is a misprediction.

It could be.  I wondered that myself when I saw Ronan's results were
better than mine for 2,4 and 7.  However, I think Ronan had quite a
bit of noise in his results as there's no reason for the speedup in
tests 2,4 and 7. 

> In most cases is it not datumSort?

who knows.  Maybe someone's workload always requires the datum sort.

> That's why I would like to use unlikely.

We really only use unlikely() in cases where we want to move code out
of line to a cold area because it's really never executed under normal
circumstances. We tend to do that for ERROR cases as we don't ever
really want to optimise for errors. We also sometimes do it when some
function has a branch to initialise something during the first call.
The case in question here does not fit for either of those two cases.
Hum, I understand the usage cases now.
Thanks for the hint.
 

> IMO all the tests should all be to verify past behavior first.

I'm not quire sure what you mean there.
I'm saying we could help the branch by keeping the same testing logic as before and not reversing it.
Attached is a version to demonstrate this, I don't pretend to be v7.

I couldn't find a good phrase to the contrary:
"are we *not* using the single value optimization ?"

I don't have time to take the tests right now.
Finally I had time to benchmark (David's benchsort.sh)

ubuntu 64 bits (20.04) 8gb ram SSD 256GB.
Table with the best results of each.


         HEAD           v6           v7            v7b        v6 vs master              v7 vs v6           v7b vs v6
Test1288,149636449,018541469,757169550,48505155,83%104,62%122,60%
Test294,76695595,45140694,55624994,718982100,72%99,06%99,23%
Test3190,521319260,279802259,597067278,115296136,61%99,74%106,85%
Test478,77934478,25345578,11406877,94148299,33%99,82%99,60%
Test5131,362614142,662223136,436347149,639041108,60%95,64%104,89%
Test6112,884298124,181671115,528328127,58497110,01%93,03%102,74%
Test769,30858768,64306766,1019569,08754499,04%96,30%100,65%
Test8243,674171364,681142371,928453419,259703149,66%101,99%114,97%

I have no idea why v7 failed with test6?
v6 slowdown with test4 and test7.
v7b slowdown with test2 and test4, in relation with v7.

If field struct datumSort is not absolutely necessary, I think that v7 will be better.
Attached the patchs and file results.
Attachment

Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

From
Ranier Vilela
Date:
Em ter., 13 de jul. de 2021 às 14:42, Ranier Vilela <ranier.vf@gmail.com> escreveu:
Em ter., 13 de jul. de 2021 às 09:44, Ranier Vilela <ranier.vf@gmail.com> escreveu:
Em ter., 13 de jul. de 2021 às 09:24, David Rowley <dgrowleyml@gmail.com> escreveu:
On Wed, 14 Jul 2021 at 00:06, Ranier Vilela <ranier.vf@gmail.com> wrote:
>
> Em ter., 13 de jul. de 2021 às 04:19, Ronan Dunklau <ronan.dunklau@aiven.io> escreveu:
>> I would be
>> surprised the check adds that much to the whole execution though.
>
> I think this branch is a misprediction.

It could be.  I wondered that myself when I saw Ronan's results were
better than mine for 2,4 and 7.  However, I think Ronan had quite a
bit of noise in his results as there's no reason for the speedup in
tests 2,4 and 7. 

> In most cases is it not datumSort?

who knows.  Maybe someone's workload always requires the datum sort.

> That's why I would like to use unlikely.

We really only use unlikely() in cases where we want to move code out
of line to a cold area because it's really never executed under normal
circumstances. We tend to do that for ERROR cases as we don't ever
really want to optimise for errors. We also sometimes do it when some
function has a branch to initialise something during the first call.
The case in question here does not fit for either of those two cases.
Hum, I understand the usage cases now.
Thanks for the hint.
 

> IMO all the tests should all be to verify past behavior first.

I'm not quire sure what you mean there.
I'm saying we could help the branch by keeping the same testing logic as before and not reversing it.
Attached is a version to demonstrate this, I don't pretend to be v7.

I couldn't find a good phrase to the contrary:
"are we *not* using the single value optimization ?"

I don't have time to take the tests right now.
Finally I had time to benchmark (David's benchsort.sh)

ubuntu 64 bits (20.04) 8gb ram SSD 256GB.
Table with the best results of each.


         HEAD           v6           v7            v7b        v6 vs master              v7 vs v6           v7b vs v6
Test1288,149636449,018541469,757169550,48505155,83%104,62%122,60%
Test294,76695595,45140694,55624994,718982100,72%99,06%99,23%
Test3190,521319260,279802259,597067278,115296136,61%99,74%106,85%
Test478,77934478,25345578,11406877,94148299,33%99,82%99,60%
Test5131,362614142,662223136,436347149,639041108,60%95,64%104,89%
Test6112,884298124,181671115,528328127,58497110,01%93,03%102,74%
Test769,30858768,64306766,1019569,08754499,04%96,30%100,65%
Test8243,674171364,681142371,928453419,259703149,66%101,99%114,97%

I have no idea why v7 failed with test6?
v6 slowdown with test4 and test7.
v7b slowdown with test2 and test4, in relation with v7.
 v7b slowdown with test2 and test4, in relation with *v6*.


If field struct datumSort is not absolutely necessary, I think that v7 will be better.
 *v7b* will be better.

Sorry for the noise.

regards,
Ranier Vilela

Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

From
David Rowley
Date:
On Tue, 13 Jul 2021 at 15:15, David Rowley <dgrowleyml@gmail.com> wrote:
> In theory, we likely could get rid of the small regression by having
> two versions of ExecSort() and setting the correct one during
> ExecInitSort() by setting the function pointer to the version we want
> to use in sortstate->ss.ps.ExecProcNode.

Just to see how it would perform, I tried what I mentioned above. I've
included what I ended up with in the attached POC patch.

I got the following results on my AMD hardware.

Test master v8 patch comparison
Test1   448.0   671.7   149.9%
Test2   316.4   317.5   100.3%
Test3   299.5   381.6   127.4%
Test4   219.7   229.5   104.5%
Test5   226.3   254.6   112.5%
Test6   197.9   217.9   110.1%
Test7   179.2   185.3   103.4%
Test8   389.2   544.8   140.0%

This time I saw no regression on tests 2, 4 and 7.

I looked to see if there was anywhere else in the executor that
conditionally uses a different exec function in this way and found
nothing, so I'm not too sure if it's a good idea to start doing this.

It would be good to get a 2nd opinion about this idea.  Also, more
benchmark results with v6 and v8 would be good too.

David

Attachment

Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

From
Ranier Vilela
Date:
Em qua., 14 de jul. de 2021 às 07:14, David Rowley <dgrowleyml@gmail.com> escreveu:
On Tue, 13 Jul 2021 at 15:15, David Rowley <dgrowleyml@gmail.com> wrote:
> In theory, we likely could get rid of the small regression by having
> two versions of ExecSort() and setting the correct one during
> ExecInitSort() by setting the function pointer to the version we want
> to use in sortstate->ss.ps.ExecProcNode.

Just to see how it would perform, I tried what I mentioned above. I've
included what I ended up with in the attached POC patch.

I got the following results on my AMD hardware.

Test master v8 patch comparison
Test1   448.0   671.7   149.9%
Test2   316.4   317.5   100.3%
Test3   299.5   381.6   127.4%
Test4   219.7   229.5   104.5%
Test5   226.3   254.6   112.5%
Test6   197.9   217.9   110.1%
Test7   179.2   185.3   103.4%
Test8   389.2   544.8   140.0%
I'm a little surprised by your results.
Test1 and Test8 look pretty good to me.
What is compiler and environment?

I repeated (3 times) the benchmark with v8 here,
and the results were not good.


                  HEAD           v6             v7b           v8        v6 vs head             v8 vs v6            v8 vs v7b
Test1288,149636449,018541550,48505468,168165155,83%104,26%85,05%
Test294,76695595,45140694,71898294,800275100,72%99,32%100,09%
Test3190,521319260,279802278,115296262,538383136,61%100,87%94,40%
Test478,77934478,25345577,94148278,47154699,33%100,28%100,68%
Test5131,362614142,662223149,639041144,849303108,60%101,53%96,80%
Test6112,884298124,181671127,58497124,29376110,01%100,09%97,42%
Test769,30858768,64306769,08754469,43731299,04%101,16%100,51%
Test8243,674171364,681142419,259703369,239176149,66%101,25%88,07%



This time I saw no regression on tests 2, 4 and 7.

I looked to see if there was anywhere else in the executor that
conditionally uses a different exec function in this way and found
nothing, so I'm not too sure if it's a good idea to start doing this.
Specialized functions can be a way to optimize. The compilers themselves do it.
But the ExecSortTuple and ExecSortDatum are much more similar,
which can cause maintenance problems.
I don't think in this case it would be a good idea.
 

It would be good to get a 2nd opinion about this idea.  Also, more
benchmark results with v6 and v8 would be good too.
Yeah, another different machine.
I would like to see other results with v7b.

Attached the file with all results from v8.

regards,
Ranier Vilela
Attachment

Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

From
John Naylor
Date:

On Wed, Jul 14, 2021 at 6:14 AM David Rowley <dgrowleyml@gmail.com> wrote:

> It would be good to get a 2nd opinion about this idea.  Also, more
> benchmark results with v6 and v8 would be good too.

I tested this on an older Xeon, gcc 8.4 (here median of each test, full results attached):

test HEAD v6 v8

Test1 588 1007 998
Test2 198 202 197
Test3 374 516 512
Test4 172 165 166
Test5 255 279 283
Test6 227 251 251
Test7 145 147 146
Test8 474 783 770

Test4 could be a regression, but 2 and 7 look fine here.

--
John Naylor
EDB: http://www.enterprisedb.com

Attachment

Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

From
David Rowley
Date:
On Thu, 15 Jul 2021 at 05:55, Ranier Vilela <ranier.vf@gmail.com> wrote:
> I repeated (3 times) the benchmark with v8 here,
> and the results were not good.

Do you have any good theories on why the additional branching that's
done in v7b vs v8 might cause it to run faster?

David



Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

From
Ranier Vilela
Date:
Em qua., 14 de jul. de 2021 às 20:43, David Rowley <dgrowleyml@gmail.com> escreveu:
On Thu, 15 Jul 2021 at 05:55, Ranier Vilela <ranier.vf@gmail.com> wrote:
> I repeated (3 times) the benchmark with v8 here,
> and the results were not good.

Do you have any good theories on why the additional branching that's
done in v7b vs v8 might cause it to run faster?
 
Branch Predictions works with *more* probable path,
otherwise a penalty occurs and the cpu must revert the results.

In this case it seems to me that most of the time, tuplesort is the path.
So as it is tested if it is *datumSort* and the *prediction* fails,
the cpu has more work to reverse the wrong path.

To help the branch, test a more probable case first, anywhere.
if, switch, etc.

Another gain is the local variable tupleSort, which is obviously faster than node.

regards,
Ranier Vilela

Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

From
David Rowley
Date:
On Thu, 15 Jul 2021 at 12:10, Ranier Vilela <ranier.vf@gmail.com> wrote:
>
> Em qua., 14 de jul. de 2021 às 20:43, David Rowley <dgrowleyml@gmail.com> escreveu:
>>
>> On Thu, 15 Jul 2021 at 05:55, Ranier Vilela <ranier.vf@gmail.com> wrote:
>> > I repeated (3 times) the benchmark with v8 here,
>> > and the results were not good.
>>
>> Do you have any good theories on why the additional branching that's
>> done in v7b vs v8 might cause it to run faster?
>
>
> Branch Predictions works with *more* probable path,
> otherwise a penalty occurs and the cpu must revert the results.

But, in v8 there is no additional branch, so no branch to mispredict.
I don't really see how your explanation fits.

It seems much more likely to me that the results were just noisy.  It
would be good to see if you can recreate them consistently.

David



Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

From
Ranier Vilela
Date:
Em qua., 14 de jul. de 2021 às 21:21, David Rowley <dgrowleyml@gmail.com> escreveu:
On Thu, 15 Jul 2021 at 12:10, Ranier Vilela <ranier.vf@gmail.com> wrote:
>
> Em qua., 14 de jul. de 2021 às 20:43, David Rowley <dgrowleyml@gmail.com> escreveu:
>>
>> On Thu, 15 Jul 2021 at 05:55, Ranier Vilela <ranier.vf@gmail.com> wrote:
>> > I repeated (3 times) the benchmark with v8 here,
>> > and the results were not good.
>>
>> Do you have any good theories on why the additional branching that's
>> done in v7b vs v8 might cause it to run faster?
>
>
> Branch Predictions works with *more* probable path,
> otherwise a penalty occurs and the cpu must revert the results.

But, in v8 there is no additional branch, so no branch to mispredict.
I don't really see how your explanation fits.
In v8 the branch occurs at :
+ if (ExecGetResultType(outerPlanState(sortstate))->natts == 1)

datumSort is tested first.

Cpu time is a more expensive resource.
Always is executed two branches, if it is right path, win,
otherwise occurs a penalty time.


It seems much more likely to me that the results were just noisy.  It
would be good to see if you can recreate them consistently.
I do.
Can you please share results with v7b?

regards,
Ranier Vilela

Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

From
David Rowley
Date:
On Thu, 15 Jul 2021 at 12:30, Ranier Vilela <ranier.vf@gmail.com> wrote:
>
> Em qua., 14 de jul. de 2021 às 21:21, David Rowley <dgrowleyml@gmail.com> escreveu:
>> But, in v8 there is no additional branch, so no branch to mispredict.
>> I don't really see how your explanation fits.
>
> In v8 the branch occurs at :
> + if (ExecGetResultType(outerPlanState(sortstate))->natts == 1)

You do know that branch is in a function that's only executed once
during executor initialization, right?

David



Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

From
Ranier Vilela
Date:
Em qua., 14 de jul. de 2021 às 22:22, David Rowley <dgrowleyml@gmail.com> escreveu:
On Thu, 15 Jul 2021 at 12:30, Ranier Vilela <ranier.vf@gmail.com> wrote:
>
> Em qua., 14 de jul. de 2021 às 21:21, David Rowley <dgrowleyml@gmail.com> escreveu:
>> But, in v8 there is no additional branch, so no branch to mispredict.
>> I don't really see how your explanation fits.
>
> In v8 the branch occurs at :
> + if (ExecGetResultType(outerPlanState(sortstate))->natts == 1)

You do know that branch is in a function that's only executed once
during executor initialization, right?
The branch prediction should work better.
I have no idea why it works worse.

I redid all tests:
notebook 8GB RAM 256GB SSD
ubuntu 64 bits (20.04)
clang-12
powerhigh (charger on)
none configuration (all defaults)


          HEAD          v6         v7b             v8         v6 vs head           v7b vs v6          v8 vs v7b
Test1576,868013940,9472361090,2538591016,0443163,11%115,87%93,19%
Test2184,748363177,6254177,346229178,23025896,14%99,84%100,50%
Test3410,030055541,889704605,843924534,946166132,16%111,80%88,30%
Test4153,331752147,98418148,010894147,77115596,51%100,02%99,84%
Test5268,97555301,979647316,928492300,94932112,27%104,95%94,96%
Test6234,910125259,71483269,851427260,567637110,56%103,90%96,56%
Test7142,704153136,09163136,802695136,93570995,37%100,52%100,10%
Test8498,634855763,482151867,350046804,833884153,11%113,60%92,79%

The values are high here, because now, the tests are made with full power of cpu to all patchs!
I think that more testing is needed with v7b and v8.

Anyway, two functions (ExecSortTuple and ExecSortDatum) are almost equal, maybe not a good idea.

file results attached.

regards,
Ranier Vilela
Attachment

Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

From
Ronan Dunklau
Date:
Le jeudi 15 juillet 2021, 01:30:26 CEST John Naylor a écrit :
> On Wed, Jul 14, 2021 at 6:14 AM David Rowley <dgrowleyml@gmail.com> wrote:
> > It would be good to get a 2nd opinion about this idea.  Also, more
> > benchmark results with v6 and v8 would be good too.
>

Hello,

Thank you for trying this approach in v8 David !

I've decided to test on more "stable" hardware, an EC-2 medium instance,
compiling with Debian's gcc 8.3. That's still not ideal but a lot better than
a laptop.

To gather more meaningful results, I ran every pgbench for 30s instead of the
10 in the initial script provided by David. I ran the full script once for
HEAD, v6, v8, then a second time for HEAD, v6, v8 to try to eliminate noise
that could happen for 90 consecutive seconds, and took for each of those the
median of the 6 runs.  It's much less noisy than my previous runs but still
not as as stable as I'd like to.

The results are attached in graph form, as well as the raw data if someone
wants it.

As a conclusion, I don't think it's worth it to introduce a separate
execprocnode function for that case. It is likely the minor difference still
observed can be explained to noise, as they fluctuate if you compare the min,
max, average or median values from the results.

Best regards,

--
Ronan Dunklau
Attachment

Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

From
Ranier Vilela
Date:
Em qui., 15 de jul. de 2021 às 07:18, Ronan Dunklau <ronan.dunklau@aiven.io> escreveu:
Le jeudi 15 juillet 2021, 01:30:26 CEST John Naylor a écrit :
> On Wed, Jul 14, 2021 at 6:14 AM David Rowley <dgrowleyml@gmail.com> wrote:
> > It would be good to get a 2nd opinion about this idea.  Also, more
> > benchmark results with v6 and v8 would be good too.
>

Hello,

Thank you for trying this approach in v8 David !

I've decided to test on more "stable" hardware, an EC-2 medium instance,
compiling with Debian's gcc 8.3. That's still not ideal but a lot better than
a laptop.

To gather more meaningful results, I ran every pgbench for 30s instead of the
10 in the initial script provided by David. I ran the full script once for
HEAD, v6, v8, then a second time for HEAD, v6, v8 to try to eliminate noise
that could happen for 90 consecutive seconds, and took for each of those the
median of the 6 runs.  It's much less noisy than my previous runs but still
not as as stable as I'd like to.

The results are attached in graph form, as well as the raw data if someone
wants it.

As a conclusion, I don't think it's worth it to introduce a separate
execprocnode function for that case. It is likely the minor difference still
observed can be explained to noise, as they fluctuate if you compare the min,
max, average or median values from the results.
Is there a special reason to not share v7b tests and results?

IMHO he is much more branch friendly.

regards,
Ranier Vilela

Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

From
Ronan Dunklau
Date:
Le jeudi 15 juillet 2021, 14:09:26 CEST Ranier Vilela a écrit :
> Is there a special reason to not share v7b tests and results?
>

The v7b patch is wrong, as it loses the type of tuplesort being used and as
such always tries to fetch results using tuplesort_gettupleslot after the first
tuple is fetched.


--
Ronan Dunklau





Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

From
Ranier Vilela
Date:
Em qui., 15 de jul. de 2021 às 09:27, Ronan Dunklau <ronan.dunklau@aiven.io> escreveu:
Le jeudi 15 juillet 2021, 14:09:26 CEST Ranier Vilela a écrit :
> Is there a special reason to not share v7b tests and results?
>

The v7b patch is wrong, as it loses the type of tuplesort being used
I don't see 'node->datumSort' being anywhere else yet.
 
and as
such always tries to fetch results using tuplesort_gettupleslot after the first
tuple is fetched.
Is that why it is faster than v6?

regards,
Ranier Vilela

Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

From
James Coleman
Date:
On Wed, Jul 14, 2021 at 9:22 PM David Rowley <dgrowleyml@gmail.com> wrote:
>
> On Thu, 15 Jul 2021 at 12:30, Ranier Vilela <ranier.vf@gmail.com> wrote:
> >
> > Em qua., 14 de jul. de 2021 às 21:21, David Rowley <dgrowleyml@gmail.com> escreveu:
> >> But, in v8 there is no additional branch, so no branch to mispredict.
> >> I don't really see how your explanation fits.
> >
> > In v8 the branch occurs at :
> > + if (ExecGetResultType(outerPlanState(sortstate))->natts == 1)
>
> You do know that branch is in a function that's only executed once
> during executor initialization, right?

This is why I have a hard time believing there's a "real" change here
and not the result of either noise or something not really
controllable like executable layout changing.

James



Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

From
David Rowley
Date:
On Fri, 16 Jul 2021 at 01:44, James Coleman <jtc331@gmail.com> wrote:
>
> On Wed, Jul 14, 2021 at 9:22 PM David Rowley <dgrowleyml@gmail.com> wrote:
> >
> > On Thu, 15 Jul 2021 at 12:30, Ranier Vilela <ranier.vf@gmail.com> wrote:
> > >
> > > Em qua., 14 de jul. de 2021 às 21:21, David Rowley <dgrowleyml@gmail.com> escreveu:
> > >> But, in v8 there is no additional branch, so no branch to mispredict.
> > >> I don't really see how your explanation fits.
> > >
> > > In v8 the branch occurs at :
> > > + if (ExecGetResultType(outerPlanState(sortstate))->natts == 1)
> >
> > You do know that branch is in a function that's only executed once
> > during executor initialization, right?
>
> This is why I have a hard time believing there's a "real" change here
> and not the result of either noise or something not really
> controllable like executable layout changing.

Yeah, I think we likely are at the level where layout changes in the
compiled code are going to make things hard to measure.  I just want
to make sure we're not going to end up with some regression that's
actual and not random depending on layout changes of unrelated code.
I think a branch that's taken consistently *should* be predicted
correctly each time.

Anyway, I think all the comparisons with v7b can safely be ignored. As
Ronan pointed out, v7b has some issues due to it not recording the
sort method in the executor state that leads to it forgetting which
method it used once we start pulling tuples from it. The reproductions
of that are it calling tuplesort_gettupleslot() from the 2nd tuple
onwards regardless of if we've done a datum or tuple sort.

Ronan's latest results plus John's make me think there's no need to
separate out the node function as I did in v8.  However, I do think v6
could learn a little from v8. I think I'd rather see the sort method
determined in ExecInitSort() rather than ExecSort(). I think
minimising those few extra instructions in ExecSort() might help the
L1 instruction cache.

David



Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

From
Ranier Vilela
Date:
Em qua., 14 de jul. de 2021 às 22:22, David Rowley <dgrowleyml@gmail.com> escreveu:
On Thu, 15 Jul 2021 at 12:30, Ranier Vilela <ranier.vf@gmail.com> wrote:
>
> Em qua., 14 de jul. de 2021 às 21:21, David Rowley <dgrowleyml@gmail.com> escreveu:
>> But, in v8 there is no additional branch, so no branch to mispredict.
>> I don't really see how your explanation fits.
>
> In v8 the branch occurs at :
> + if (ExecGetResultType(outerPlanState(sortstate))->natts == 1)

You do know that branch is in a function that's only executed once
during executor initialization, right?
There's a real difference between v8 and v6, if I understood correctly.

v6 the branches is per tuple:
+ if (tupDesc->natts == 1)

v8 the branches is per state:
+ if (ExecGetResultType(outerPlanState(sortstate))->natts == 1)

I think that a big different way to solve the problem.
Or am I getting it wrong?

If the sortstate number of attributes is equal to 1, is it worth the same for each tuple?
Can you explain this, please?

regards,
Ranier Vilela

Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

From
Ranier Vilela
Date:
Em qui., 15 de jul. de 2021 às 11:19, David Rowley <dgrowleyml@gmail.com> escreveu:
On Fri, 16 Jul 2021 at 01:44, James Coleman <jtc331@gmail.com> wrote:
>
> On Wed, Jul 14, 2021 at 9:22 PM David Rowley <dgrowleyml@gmail.com> wrote:
> >
> > On Thu, 15 Jul 2021 at 12:30, Ranier Vilela <ranier.vf@gmail.com> wrote:
> > >
> > > Em qua., 14 de jul. de 2021 às 21:21, David Rowley <dgrowleyml@gmail.com> escreveu:
> > >> But, in v8 there is no additional branch, so no branch to mispredict.
> > >> I don't really see how your explanation fits.
> > >
> > > In v8 the branch occurs at :
> > > + if (ExecGetResultType(outerPlanState(sortstate))->natts == 1)
> >
> > You do know that branch is in a function that's only executed once
> > during executor initialization, right?
>
> This is why I have a hard time believing there's a "real" change here
> and not the result of either noise or something not really
> controllable like executable layout changing.

Yeah, I think we likely are at the level where layout changes in the
compiled code are going to make things hard to measure.  I just want
to make sure we're not going to end up with some regression that's
actual and not random depending on layout changes of unrelated code.
I think a branch that's taken consistently *should* be predicted
correctly each time.

Anyway, I think all the comparisons with v7b can safely be ignored. As
Ronan pointed out, v7b has some issues due to it not recording the
sort method in the executor state that leads to it forgetting which
method it used once we start pulling tuples from it. The reproductions
of that are it calling tuplesort_gettupleslot() from the 2nd tuple
onwards regardless of if we've done a datum or tuple sort.
Sorry for insisting on this.
Assuming v7b is doing it the wrong way, which I still don't think it is.
Why is it still faster than v6 and v8?

regards,
Ranier Vilela

Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

From
Ronan Dunklau
Date:
Le jeudi 15 juillet 2021, 16:19:23 CEST David Rowley a écrit :>
> Ronan's latest results plus John's make me think there's no need to
> separate out the node function as I did in v8.  However, I do think v6
> could learn a little from v8. I think I'd rather see the sort method
> determined in ExecInitSort() rather than ExecSort(). I think
> minimising those few extra instructions in ExecSort() might help the
> L1 instruction cache.
>

I'm not sure I understand what you expect from moving that to ExecInitSort ?
Maybe we should also implement the tuplesort_state initialization in
ExecInitSort ? (not the actual feeding and sorting of course).

Please find attached a v9 just moving the flag setting to ExecInitSort, and my
apologies if I misunderstood your point.

--
Ronan Dunklau
Attachment

Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

From
James Coleman
Date:
On Thu, Jul 15, 2021 at 10:19 AM David Rowley <dgrowleyml@gmail.com> wrote:
>
> On Fri, 16 Jul 2021 at 01:44, James Coleman <jtc331@gmail.com> wrote:
> >
> > On Wed, Jul 14, 2021 at 9:22 PM David Rowley <dgrowleyml@gmail.com> wrote:
> > >
> > > On Thu, 15 Jul 2021 at 12:30, Ranier Vilela <ranier.vf@gmail.com> wrote:
> > > >
> > > > Em qua., 14 de jul. de 2021 às 21:21, David Rowley <dgrowleyml@gmail.com> escreveu:
> > > >> But, in v8 there is no additional branch, so no branch to mispredict.
> > > >> I don't really see how your explanation fits.
> > > >
> > > > In v8 the branch occurs at :
> > > > + if (ExecGetResultType(outerPlanState(sortstate))->natts == 1)
> > >
> > > You do know that branch is in a function that's only executed once
> > > during executor initialization, right?
> >
> > This is why I have a hard time believing there's a "real" change here
> > and not the result of either noise or something not really
> > controllable like executable layout changing.
>
> Yeah, I think we likely are at the level where layout changes in the
> compiled code are going to make things hard to measure.  I just want
> to make sure we're not going to end up with some regression that's
> actual and not random depending on layout changes of unrelated code.
> I think a branch that's taken consistently *should* be predicted
> correctly each time.
>
> Anyway, I think all the comparisons with v7b can safely be ignored. As
> Ronan pointed out, v7b has some issues due to it not recording the
> sort method in the executor state that leads to it forgetting which
> method it used once we start pulling tuples from it. The reproductions
> of that are it calling tuplesort_gettupleslot() from the 2nd tuple
> onwards regardless of if we've done a datum or tuple sort.
>
> Ronan's latest results plus John's make me think there's no need to
> separate out the node function as I did in v8.  However, I do think v6
> could learn a little from v8. I think I'd rather see the sort method
> determined in ExecInitSort() rather than ExecSort(). I think
> minimising those few extra instructions in ExecSort() might help the
> L1 instruction cache.

I ran master/v6/v8 tests for 90s each with David's test script on an
AWS c5n.metal instance (so should be immune to noise neighbor issues).
Here are comparative results:

    Test1 Test2 Test3 Test4 Test5 Test6 Test7 Test8
v6 68.66% 0.05% 32.21% -0.83% 12.58% 10.42% -1.48% 50.98%
v8 69.78% -0.44% 32.45% -1.11% 12.01% 10.58% -1.40% 49.30%

So I see a consistent change in the data, but I don't really see a
good explanation for it not being noise. Can't prove that yet though.

James



Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

From
David Rowley
Date:
On Fri, 16 Jul 2021 at 02:53, Ronan Dunklau <ronan.dunklau@aiven.io> wrote:
>
> Le jeudi 15 juillet 2021, 16:19:23 CEST David Rowley a écrit :>
> > Ronan's latest results plus John's make me think there's no need to
> > separate out the node function as I did in v8.  However, I do think v6
> > could learn a little from v8. I think I'd rather see the sort method
> > determined in ExecInitSort() rather than ExecSort(). I think
> > minimising those few extra instructions in ExecSort() might help the
> > L1 instruction cache.
> >
>
> I'm not sure I understand what you expect from moving that to ExecInitSort ?

The motivation was to reduce the extra code that's being added to
ExecSort. I checked the assembly of ExecSort on v6 and v9 and v6 was
544 lines of assembly and v9 is 534 lines.

> Maybe we should also implement the tuplesort_state initialization in
> ExecInitSort ? (not the actual feeding and sorting of course).

I don't think that would be a good idea.  Setting the datumSort does
not require any new memory to be allocated. That's not the case for
the tuplesort_begin routines.  The difference here is that we can
delay the memory allocation until we pull the first tuple and if we
don't pull any tuples from the outer node then there are no needless
allocations.

> Please find attached a v9 just moving the flag setting to ExecInitSort, and my
> apologies if I misunderstood your point.

That's exactly what I meant. Thanks

David



Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

From
David Rowley
Date:
On Fri, 16 Jul 2021 at 02:53, Ronan Dunklau <ronan.dunklau@aiven.io> wrote:
> Please find attached a v9 just moving the flag setting to ExecInitSort, and my
> apologies if I misunderstood your point.

I took this and adjusted a few things and ended up with the attached patch.

The changes are fairly minor. I made the bracing consistent between
both tuplesort_begin calls. I rewrote the comment at the top of
ExecSort() to make it more clear about each method used.

I also adjusted the comment down at the end of ExecSort that was
mentioning something about tuplesort_gettupleslot returning NULL.
Your patch didn't touch this, but to me, the comment just looked wrong
both before and after the changes. tuplesort_gettupleslot returns
false and sets the slot to empty when it runs out of tuples.  Anyway,
I wrote something there that I think improves that.

I feel like this patch is commit-worthy now.  However, I'll leave it
for a few days, maybe until after the weekend as there's been a fair
bit of interest and I imagine someone will have comments to make.

David

Attachment

Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

From
Ranier Vilela
Date:
Em sex., 16 de jul. de 2021 às 00:45, David Rowley <dgrowleyml@gmail.com> escreveu:
On Fri, 16 Jul 2021 at 02:53, Ronan Dunklau <ronan.dunklau@aiven.io> wrote:
> Please find attached a v9 just moving the flag setting to ExecInitSort, and my
> apologies if I misunderstood your point.

I took this and adjusted a few things and ended up with the attached patch.

The changes are fairly minor. I made the bracing consistent between
both tuplesort_begin calls. I rewrote the comment at the top of
ExecSort() to make it more clear about each method used.
With relation to the braces, it's still not clear to me which style to follow.
I gave Ronan directions about it.
And I think maybe, it's still not clear when to use it or not.


I also adjusted the comment down at the end of ExecSort that was
mentioning something about tuplesort_gettupleslot returning NULL.
Your patch didn't touch this, but to me, the comment just looked wrong
both before and after the changes. tuplesort_gettupleslot returns
false and sets the slot to empty when it runs out of tuples.  Anyway,
I wrote something there that I think improves that.
Can help a little here, but, seems good to me.


I feel like this patch is commit-worthy now.  However, I'll leave it
for a few days, maybe until after the weekend as there's been a fair
bit of interest and I imagine someone will have comments to make.
A little lack of time.

But I finally can understand v7b.
Really struct field is necessary and he fails with the next tuple, ok.
The only conclusion I can come to is that he is faster because he fails to sort correctly.
It's no use being faster and getting wrong results.

So, +1 from me to commit v10.

Thanks for working together.

Ranier Vilela

Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

From
James Coleman
Date:
On Thu, Jul 15, 2021 at 11:45 PM David Rowley <dgrowleyml@gmail.com> wrote:
>
> On Fri, 16 Jul 2021 at 02:53, Ronan Dunklau <ronan.dunklau@aiven.io> wrote:
> > Please find attached a v9 just moving the flag setting to ExecInitSort, and my
> > apologies if I misunderstood your point.
>
> I took this and adjusted a few things and ended up with the attached patch.
>
> The changes are fairly minor. I made the bracing consistent between
> both tuplesort_begin calls. I rewrote the comment at the top of
> ExecSort() to make it more clear about each method used.
>
> I also adjusted the comment down at the end of ExecSort that was
> mentioning something about tuplesort_gettupleslot returning NULL.
> Your patch didn't touch this, but to me, the comment just looked wrong
> both before and after the changes. tuplesort_gettupleslot returns
> false and sets the slot to empty when it runs out of tuples.  Anyway,
> I wrote something there that I think improves that.
>
> I feel like this patch is commit-worthy now.  However, I'll leave it
> for a few days, maybe until after the weekend as there's been a fair
> bit of interest and I imagine someone will have comments to make.

The only remaining question I have is whether or not costing needs to
change, given the very significant speedup for datum sort.

James



Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

From
David Rowley
Date:
 On Sat, 17 Jul 2021 at 01:14, James Coleman <jtc331@gmail.com> wrote:
> The only remaining question I have is whether or not costing needs to
> change, given the very significant speedup for datum sort.

I'm looking at cost_tuplesort and the only thing that I think might
make sense would be to adjust how the input_bytes value is calculated.
For now, that's done with the following function that's used in quite
a number of places.

static double
relation_byte_size(double tuples, int width)
{
    return tuples * (MAXALIGN(width) + MAXALIGN(SizeofHeapTupleHeader));
}

It seems, at least in the case of Sort, that using SizeofHeapTupleHead
is just always wrong as it should be SizeofMinimalTupleHeader. I know
that's also the case for Memoize too. I've not checked the other
locations.

The only thing I can really see that we might do would be not add the
MAXALIGN(SizeofHeapTupleHeader) when there's just a single column.
We'd need to pass down the number of attributes from
create_sort_path() so we'd know when and when not to add that. I'm not
saying that we should do this. I'm just saying that I don't really see
what else we might do.

I can imagine another patch might just want to do a complete overhaul
of all locations that use relation_byte_size().  There are various
things that function just does not account for. e.g, the fact that we
allocate chunks in powers of 2 and that there's a chunk header added
on.  Of course, "width" is just an estimate, so maybe trying to
calculate something too precisely wouldn't be too wise. However,
there's a bit of a chicken and the egg problem there as there'd be
little incentive to improve "width" unless we started making more
accurate use of the value.

Anyway, none of the above take into account that the Datum sort is
just a little faster, The only thing that exists in the existing cost
modal that we could use to adjust the cost of an in memory sort is the
comparison_cost.  The problem there is that the comparison is exactly
the same in both Datum and Tuple sorts. The only thing that really
changes between Datum and Tuple sort is the fact that we don't make a
MinimalTuple when doing a Datum sort.  The cost modal, unfortunately,
does not account for that.   That kinda makes me think that we should
do nothing as if we start to account for making MemoryTuples then
we'll just penalise Tuple sorts and that might cause someone to be
upset.

David



Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

From
Ronan Dunklau
Date:
Le samedi 17 juillet 2021, 10:36:09 CEST David Rowley a écrit :
>  On Sat, 17 Jul 2021 at 01:14, James Coleman <jtc331@gmail.com> wrote:
> > The only remaining question I have is whether or not costing needs to
> > change, given the very significant speedup for datum sort.
>
> I'm looking at cost_tuplesort and the only thing that I think might
> make sense would be to adjust how the input_bytes value is calculated.
> For now, that's done with the following function that's used in quite
> a number of places.
>
> static double
> relation_byte_size(double tuples, int width)
> {
>     return tuples * (MAXALIGN(width) + MAXALIGN(SizeofHeapTupleHeader));
> }
>
> It seems, at least in the case of Sort, that using SizeofHeapTupleHead
> is just always wrong as it should be SizeofMinimalTupleHeader. I know
> that's also the case for Memoize too. I've not checked the other
> locations.
>
> The only thing I can really see that we might do would be not add the
> MAXALIGN(SizeofHeapTupleHeader) when there's just a single column.
> We'd need to pass down the number of attributes from
> create_sort_path() so we'd know when and when not to add that. I'm not
> saying that we should do this. I'm just saying that I don't really see
> what else we might do.
>
> I can imagine another patch might just want to do a complete overhaul
> of all locations that use relation_byte_size().  There are various
> things that function just does not account for. e.g, the fact that we
> allocate chunks in powers of 2 and that there's a chunk header added
> on.  Of course, "width" is just an estimate, so maybe trying to
> calculate something too precisely wouldn't be too wise. However,
> there's a bit of a chicken and the egg problem there as there'd be
> little incentive to improve "width" unless we started making more
> accurate use of the value.
>
> Anyway, none of the above take into account that the Datum sort is
> just a little faster, The only thing that exists in the existing cost
> modal that we could use to adjust the cost of an in memory sort is the
> comparison_cost.  The problem there is that the comparison is exactly
> the same in both Datum and Tuple sorts. The only thing that really
> changes between Datum and Tuple sort is the fact that we don't make a
> MinimalTuple when doing a Datum sort.  The cost modal, unfortunately,
> does not account for that.   That kinda makes me think that we should
> do nothing as if we start to account for making MemoryTuples then
> we'll just penalise Tuple sorts and that might cause someone to be
> upset.
>
Thank you for taking the time to perform that analysis. I agree with you and
tt looks to me that if we were to start accounting for it, we would have to
make the change almost transparent for tuple sorts so that it stays roughly
the same, which is impossible since we don't apply the comparison cost to all
tuples but only to the number of tuples we actually expect to compare.

On the other hand, if we don't change the sorting cost and it just ends up
being faster in some cases I doubt anyone would complain.


--
Ronan Dunklau





Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

From
James Coleman
Date:
On Sat, Jul 17, 2021 at 4:36 AM David Rowley <dgrowleyml@gmail.com> wrote:
>
>  On Sat, 17 Jul 2021 at 01:14, James Coleman <jtc331@gmail.com> wrote:
> > The only remaining question I have is whether or not costing needs to
> > change, given the very significant speedup for datum sort.
>
> I'm looking at cost_tuplesort and the only thing that I think might
> make sense would be to adjust how the input_bytes value is calculated.
> For now, that's done with the following function that's used in quite
> a number of places.
>
> static double
> relation_byte_size(double tuples, int width)
> {
>     return tuples * (MAXALIGN(width) + MAXALIGN(SizeofHeapTupleHeader));
> }
>
> It seems, at least in the case of Sort, that using SizeofHeapTupleHead
> is just always wrong as it should be SizeofMinimalTupleHeader. I know
> that's also the case for Memoize too. I've not checked the other
> locations.
>
> The only thing I can really see that we might do would be not add the
> MAXALIGN(SizeofHeapTupleHeader) when there's just a single column.
> We'd need to pass down the number of attributes from
> create_sort_path() so we'd know when and when not to add that. I'm not
> saying that we should do this. I'm just saying that I don't really see
> what else we might do.
>
> I can imagine another patch might just want to do a complete overhaul
> of all locations that use relation_byte_size().  There are various
> things that function just does not account for. e.g, the fact that we
> allocate chunks in powers of 2 and that there's a chunk header added
> on.  Of course, "width" is just an estimate, so maybe trying to
> calculate something too precisely wouldn't be too wise. However,
> there's a bit of a chicken and the egg problem there as there'd be
> little incentive to improve "width" unless we started making more
> accurate use of the value.
>
> Anyway, none of the above take into account that the Datum sort is
> just a little faster, The only thing that exists in the existing cost
> modal that we could use to adjust the cost of an in memory sort is the
> comparison_cost.  The problem there is that the comparison is exactly
> the same in both Datum and Tuple sorts. The only thing that really
> changes between Datum and Tuple sort is the fact that we don't make a
> MinimalTuple when doing a Datum sort.  The cost modal, unfortunately,
> does not account for that.   That kinda makes me think that we should
> do nothing as if we start to account for making MemoryTuples then
> we'll just penalise Tuple sorts and that might cause someone to be
> upset.

To be clear up front: I'm in favor of the patch, and I don't want to
put unnecessary stumbling blocks up for it getting committed. So if we
decide to proceed as is, that's fine with me.

But I'm not sure that the "cost model, unfortunately, does not account
for that" is entirely accurate. The end of cost_tuplesort contains a
cost "per extracted tuple". It does, however, note that it doesn't
charge cpu_tuple_cost, which maybe is what you'd want to fully
incorporate this into the model. But given this run_cost isn't about
accounting for comparison cost (that's been done earlier) which is the
part that'd be the same between tuple and datum sort, it seems to me
that we could lower the cpu_operator_cost here by something like 10%
if it's byref and 30% if it's byval?

James



Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

From
David Rowley
Date:
On Tue, 20 Jul 2021 at 01:10, James Coleman <jtc331@gmail.com> wrote:
> To be clear up front: I'm in favor of the patch, and I don't want to
> put unnecessary stumbling blocks up for it getting committed. So if we
> decide to proceed as is, that's fine with me.

Thanks for making that clear.

> But I'm not sure that the "cost model, unfortunately, does not account
> for that" is entirely accurate. The end of cost_tuplesort contains a
> cost "per extracted tuple". It does, however, note that it doesn't
> charge cpu_tuple_cost, which maybe is what you'd want to fully
> incorporate this into the model. But given this run_cost isn't about
> accounting for comparison cost (that's been done earlier) which is the
> part that'd be the same between tuple and datum sort, it seems to me
> that we could lower the cpu_operator_cost here by something like 10%
> if it's byref and 30% if it's byval?

I failed to notice that last part that adds the additional cpu_operator_cost.

The default cpu_operator_cost is 0.0025, so with the 10k tuple
benchmark, that adds an additional charge of 25 to the total cost.

If we take test 1 from my results on v5 as an example:

> Test1 446.1 657.3 147.32%

Looking at explain for that query:

regression=# explain select two from tenk1 order by two offset 1000000;
                              QUERY PLAN
----------------------------------------------------------------------
 Limit  (cost=1133.95..1133.95 rows=1 width=4)
   ->  Sort  (cost=1108.97..1133.95 rows=9995 width=4)
         Sort Key: two
         ->  Seq Scan on tenk1  (cost=0.00..444.95 rows=9995 width=4)
(4 rows)

If we want the costs to reflect reality again here then we'd have
reduce 1133.95 by something like 147.32% (the performance difference).
That would bring the cost down to 769.72, which is way more than we
have to play with than the 25 that the cpu_operator_cost * tuples
gives us.

If we reduced the 25 by 30% in this case, we'd get 17.5 and the total
cost would become 1126.45.  That's not great considering the actual
performance indicates that 769.72 would be a better number.

If we look at John's result for test 1: He saw 588 tps on master and
998 on v8.  1133.95 / 998.0 * 588.0 = 668.09, so we'd need even more
to get close to reality on that machine.

My thoughts are that the small surcharge added at the end of
cost_tuplesort() is just not enough for us to play with.  I think to
get us closer to fixing this correctly would require a redesign of the
tuplesort costing entirely. I think that would be about an order of
magnitude more effort than this patch was, so I really feel like I
don't want to do this.

I kinda feel that since the comparison_cost is always just 2.0 *
cpu_operator_cost regardless of the number of columns in the sort,
then if we add too many new smarts to try and properly adjust for this
new optimization, unless we do a completly new cost modal for this,
then we might as well be putting lipstick on a pig.

It sounds like James mostly just mentioned the sorting just to ensure
it was properly considered and does not really feel strongly that it
needs to be adjusted.  Does anyone else feel that we should be
adjusting it?

David



Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

From
Ranier Vilela
Date:
Em ter., 20 de jul. de 2021 às 05:35, David Rowley <dgrowleyml@gmail.com> escreveu:
On Tue, 20 Jul 2021 at 01:10, James Coleman <jtc331@gmail.com> wrote:
> To be clear up front: I'm in favor of the patch, and I don't want to
> put unnecessary stumbling blocks up for it getting committed. So if we
> decide to proceed as is, that's fine with me.

Thanks for making that clear.

> But I'm not sure that the "cost model, unfortunately, does not account
> for that" is entirely accurate. The end of cost_tuplesort contains a
> cost "per extracted tuple". It does, however, note that it doesn't
> charge cpu_tuple_cost, which maybe is what you'd want to fully
> incorporate this into the model. But given this run_cost isn't about
> accounting for comparison cost (that's been done earlier) which is the
> part that'd be the same between tuple and datum sort, it seems to me
> that we could lower the cpu_operator_cost here by something like 10%
> if it's byref and 30% if it's byval?

I failed to notice that last part that adds the additional cpu_operator_cost.

The default cpu_operator_cost is 0.0025, so with the 10k tuple
benchmark, that adds an additional charge of 25 to the total cost.

If we take test 1 from my results on v5 as an example:

> Test1 446.1 657.3 147.32%

Looking at explain for that query:

regression=# explain select two from tenk1 order by two offset 1000000;
                              QUERY PLAN
----------------------------------------------------------------------
 Limit  (cost=1133.95..1133.95 rows=1 width=4)
   ->  Sort  (cost=1108.97..1133.95 rows=9995 width=4)
         Sort Key: two
         ->  Seq Scan on tenk1  (cost=0.00..444.95 rows=9995 width=4)
(4 rows)

If we want the costs to reflect reality again here then we'd have
reduce 1133.95 by something like 147.32% (the performance difference).
That would bring the cost down to 769.72, which is way more than we
have to play with than the 25 that the cpu_operator_cost * tuples
gives us.

If we reduced the 25 by 30% in this case, we'd get 17.5 and the total
cost would become 1126.45.  That's not great considering the actual
performance indicates that 769.72 would be a better number.

If we look at John's result for test 1: He saw 588 tps on master and
998 on v8.  1133.95 / 998.0 * 588.0 = 668.09, so we'd need even more
to get close to reality on that machine.

My thoughts are that the small surcharge added at the end of
cost_tuplesort() is just not enough for us to play with.  I think to
get us closer to fixing this correctly would require a redesign of the
tuplesort costing entirely. I think that would be about an order of
magnitude more effort than this patch was, so I really feel like I
don't want to do this.
I understand that redesign would require a lot of work,
but why not do it step by step?


I kinda feel that since the comparison_cost is always just 2.0 *
cpu_operator_cost regardless of the number of columns in the sort,
then if we add too many new smarts to try and properly adjust for this
new optimization, unless we do a completly new cost modal for this,
then we might as well be putting lipstick on a pig.
I think one first step is naming this 2.0?
Does this magic number don't have a good name?
 

It sounds like James mostly just mentioned the sorting just to ensure
it was properly considered and does not really feel strongly that it
needs to be adjusted.  Does anyone else feel that we should be
adjusting it?
I took a look at cost_tuplesort and I think that some small adjustments could be made as part of the improvement process.
It is attached.
1. long is a very problematic type; better int64?
2. 1024 can be int, not long?
3. 2 changed all to 2.0 (double)?
4. If disk-based is not needed, IMO can we avoid calling relation_byte_size?

Finally, to at least document (add comments) those conclusions,
would be nice, wouldn't it?

regards,
Ranier Vilela
Attachment

Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

From
David Rowley
Date:
On Fri, 16 Jul 2021 at 15:44, David Rowley <dgrowleyml@gmail.com> wrote:
>
> On Fri, 16 Jul 2021 at 02:53, Ronan Dunklau <ronan.dunklau@aiven.io> wrote:
> > Please find attached a v9 just moving the flag setting to ExecInitSort, and my
> > apologies if I misunderstood your point.
>
> I took this and adjusted a few things and ended up with the attached patch.

Attaching the same v10 patch again so the CF bot picks up the correct
patch again.

David

Attachment

Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

From
David Rowley
Date:
On Tue, 20 Jul 2021 at 23:28, Ranier Vilela <ranier.vf@gmail.com> wrote:
> I took a look at cost_tuplesort and I think that some small adjustments could be made as part of the improvement
process.
> It is attached.
> 1. long is a very problematic type; better int64?
> 2. 1024 can be int, not long?
> 3. 2 changed all to 2.0 (double)?
> 4. If disk-based is not needed, IMO can we avoid calling relation_byte_size?
>
> Finally, to at least document (add comments) those conclusions,
> would be nice, wouldn't it?

I don't think there's anything useful here. If you think otherwise,
please take it to another thread.  Also, I'd recommend at least
compiling any patches you send to -hackers in the future. Going by the
CF bot, this one does not.

You might also want to read up on type promotion rules in C.  Your
sort_mem calculation change does not do what you think it does. Class
it as homework to figure out what's wrong with it.  No need to report
your findings here. Just thought it would be useful for you to learn
those things.

David



Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

From
James Coleman
Date:
On Tue, Jul 20, 2021 at 4:35 AM David Rowley <dgrowleyml@gmail.com> wrote:
>
> On Tue, 20 Jul 2021 at 01:10, James Coleman <jtc331@gmail.com> wrote:
> > To be clear up front: I'm in favor of the patch, and I don't want to
> > put unnecessary stumbling blocks up for it getting committed. So if we
> > decide to proceed as is, that's fine with me.
>
> Thanks for making that clear.
>
> > But I'm not sure that the "cost model, unfortunately, does not account
> > for that" is entirely accurate. The end of cost_tuplesort contains a
> > cost "per extracted tuple". It does, however, note that it doesn't
> > charge cpu_tuple_cost, which maybe is what you'd want to fully
> > incorporate this into the model. But given this run_cost isn't about
> > accounting for comparison cost (that's been done earlier) which is the
> > part that'd be the same between tuple and datum sort, it seems to me
> > that we could lower the cpu_operator_cost here by something like 10%
> > if it's byref and 30% if it's byval?
>
> I failed to notice that last part that adds the additional cpu_operator_cost.
>
> The default cpu_operator_cost is 0.0025, so with the 10k tuple
> benchmark, that adds an additional charge of 25 to the total cost.
>
> If we take test 1 from my results on v5 as an example:
>
> > Test1 446.1 657.3 147.32%
>
> Looking at explain for that query:
>
> regression=# explain select two from tenk1 order by two offset 1000000;
>                               QUERY PLAN
> ----------------------------------------------------------------------
>  Limit  (cost=1133.95..1133.95 rows=1 width=4)
>    ->  Sort  (cost=1108.97..1133.95 rows=9995 width=4)
>          Sort Key: two
>          ->  Seq Scan on tenk1  (cost=0.00..444.95 rows=9995 width=4)
> (4 rows)
>
> If we want the costs to reflect reality again here then we'd have
> reduce 1133.95 by something like 147.32% (the performance difference).
> That would bring the cost down to 769.72, which is way more than we
> have to play with than the 25 that the cpu_operator_cost * tuples
> gives us.
>
> If we reduced the 25 by 30% in this case, we'd get 17.5 and the total
> cost would become 1126.45.  That's not great considering the actual
> performance indicates that 769.72 would be a better number.
>
> If we look at John's result for test 1: He saw 588 tps on master and
> 998 on v8.  1133.95 / 998.0 * 588.0 = 668.09, so we'd need even more
> to get close to reality on that machine.
>
> My thoughts are that the small surcharge added at the end of
> cost_tuplesort() is just not enough for us to play with.  I think to
> get us closer to fixing this correctly would require a redesign of the
> tuplesort costing entirely. I think that would be about an order of
> magnitude more effort than this patch was, so I really feel like I
> don't want to do this.
>
> I kinda feel that since the comparison_cost is always just 2.0 *
> cpu_operator_cost regardless of the number of columns in the sort,
> then if we add too many new smarts to try and properly adjust for this
> new optimization, unless we do a completly new cost modal for this,
> then we might as well be putting lipstick on a pig.
>
> It sounds like James mostly just mentioned the sorting just to ensure
> it was properly considered and does not really feel strongly that it
> needs to be adjusted.  Does anyone else feel that we should be
> adjusting it?

Thanks for doing the math measuring how much we could impact things.

I'm +lots on getting this committed as is.

Thanks all for your work on the improvement!

James



Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

From
David Rowley
Date:
On Wed, 21 Jul 2021 at 13:39, James Coleman <jtc331@gmail.com> wrote:
> Thanks for doing the math measuring how much we could impact things.
>
> I'm +lots on getting this committed as is.

Ok good. I plan on taking a final look at the v10 patch tomorrow
morning NZ time (about 12 hours from now) and if all is well, I'll
push it.

If anyone feels differently, please let me know before then.

David



RE: [PATCH] Use optimized single-datum tuplesort in ExecSort

From
"houzj.fnst@fujitsu.com"
Date:
From: David Rowley <dgrowleyml@gmail.com>
> On Wed, 21 Jul 2021 at 13:39, James Coleman <jtc331@gmail.com> wrote:
> > Thanks for doing the math measuring how much we could impact things.
> >
> > I'm +lots on getting this committed as is.
> 
> Ok good. I plan on taking a final look at the v10 patch tomorrow morning NZ
> time (about 12 hours from now) and if all is well, I'll push it.
> 
> If anyone feels differently, please let me know before then.
Hi,

I noticed a minor thing about the v10 patch.

-
-        for (;;)
+        if (node->datumSort)
         {
-            slot = ExecProcNode(outerNode);
-
-            if (TupIsNull(slot))
-                break;
-
-            tuplesort_puttupleslot(tuplesortstate, slot);
+            for (;;)
+            {
+                slot = ExecProcNode(outerNode);
+
+                if (TupIsNull(slot))
+                    break;
+                slot_getsomeattrs(slot, 1);
+                tuplesort_putdatum(tuplesortstate,
+                                   slot->tts_values[0],
+                                   slot->tts_isnull[0]);
+            }
+        }
+        else
+        {
+            for (;;)
+            {
+                slot = ExecProcNode(outerNode);
+
+                if (TupIsNull(slot))
+                    break;
+                tuplesort_puttupleslot(tuplesortstate, slot);
+            }

The above seems can be shorter like the following ?

for (;;)
{
    slot = ExecProcNode(outerNode);
    if (TupIsNull(slot))
        break;
    if (node->datumSort)
    {
        slot_getsomeattrs(slot, 1);
        tuplesort_putdatum(tuplesortstate,
                    slot->tts_values[0],
                    slot->tts_isnull[0]);
    }
    else
        tuplesort_puttupleslot(tuplesortstate, slot);
}

Best regards,
houzj


Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

From
David Rowley
Date:
On Thu, 22 Jul 2021 at 12:27, houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:
> The above seems can be shorter like the following ?
>
> for (;;)
> {
>         slot = ExecProcNode(outerNode);
>         if (TupIsNull(slot))
>                 break;
>         if (node->datumSort)
>         {
>                 slot_getsomeattrs(slot, 1);
>                 tuplesort_putdatum(tuplesortstate,
>                                         slot->tts_values[0],
>                                         slot->tts_isnull[0]);
>         }
>         else
>                 tuplesort_puttupleslot(tuplesortstate, slot);
> }

I don't think that's a good change.  It puts the branch inside the
loop the pulls all tuples from the subplan.  Given the loop is likely
to be very hot combined with the fact that it's so simple, I'd much
rather have two separate loops to keep the extra branch outside the
loop.  It's true the branch predictor is likely to get the prediction
correct on each iteration, but unless the compiler rewrites this into
two loops then the comparison and jump must be done per loop.

David



RE: [PATCH] Use optimized single-datum tuplesort in ExecSort

From
"houzj.fnst@fujitsu.com"
Date:
On July 22, 2021 8:38 AM David Rowley <dgrowleyml@gmail.com>
> On Thu, 22 Jul 2021 at 12:27, houzj.fnst@fujitsu.com <houzj.fnst@fujitsu.com>
> wrote:
> > The above seems can be shorter like the following ?
> >
> > for (;;)
> > {
> >         slot = ExecProcNode(outerNode);
> >         if (TupIsNull(slot))
> >                 break;
> >         if (node->datumSort)
> >         {
> >                 slot_getsomeattrs(slot, 1);
> >                 tuplesort_putdatum(tuplesortstate,
> >                                         slot->tts_values[0],
> >                                         slot->tts_isnull[0]);
> >         }
> >         else
> >                 tuplesort_puttupleslot(tuplesortstate, slot); }
> 
> I don't think that's a good change.  It puts the branch inside the loop the pulls
> all tuples from the subplan.  Given the loop is likely to be very hot combined
> with the fact that it's so simple, I'd much rather have two separate loops to
> keep the extra branch outside the loop.  It's true the branch predictor is likely
> to get the prediction correct on each iteration, but unless the compiler
> rewrites this into two loops then the comparison and jump must be done per
> loop.

Ah, you are right, I missed that. Thanks for the explanation.

Best regards,
houzj

Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

From
David Rowley
Date:
On Wed, 21 Jul 2021 at 22:09, David Rowley <dgrowleyml@gmail.com> wrote:
>
> On Wed, 21 Jul 2021 at 13:39, James Coleman <jtc331@gmail.com> wrote:
> > Thanks for doing the math measuring how much we could impact things.
> >
> > I'm +lots on getting this committed as is.
>
> Ok good. I plan on taking a final look at the v10 patch tomorrow
> morning NZ time (about 12 hours from now) and if all is well, I'll
> push it.

Pushed.

David