Thread: Add proper planner support for ORDER BY / DISTINCT aggregates

Add proper planner support for ORDER BY / DISTINCT aggregates

From

David Rowley

Date:

12 June 2021, 15:07:18

A few years ago I wrote a patch to implement the missing aggregate
combine functions for array_agg and string_agg [1]. In the end, the
patch was rejected due to some concern [2] that if we allow these
aggregates to run in parallel then it might mess up the order in which
values are being aggregated by some unsuspecting user who didn't
include an ORDER BY clause in the aggregate. It was mentioned at the
time that due to nodeAgg.c performing so terribly with ORDER BY
aggregates that we couldn't possibly ask users who were upset by this
change to include an ORDER BY in their aggregate functions.

I'd still quite like to finish off the remaining aggregate functions
that still don't have a combine function, so to get that going again
I've written some code that gets the query planner onboard with
picking a plan that allows nodeAgg.c to skip doing any internal sorts
when the input results are already sorted according to what's required
by the aggregate function.

Of course, the query could have many aggregates all with differing
ORDER BY clauses. Since we reuse the same input for each, it might not
be possible to feed each aggregate with suitably sorted input. When
the input is not sorted, nodeAgg.c still must perform the sort as it
does today. So unfortunately we can't remove the nodeAgg.c code for
that.

The best I could come up with is just picking a sort order that suits
the first ORDER BY aggregate, then spin through the remaining ones to
see if there's any with a more strict order requirement that we can
use to suit that one and the first one. The planner does something
similar for window functions already, although it's not quite as
comprehensive to look beyond the first window for windows with a more
strict sort requirement.

This allows us to give presorted input to both aggregates in the following case:

SELECT agg(a ORDER BY a),agg2(a ORDER BY a,b) ...

but just the first agg in this one:

SELECT agg(a ORDER BY a),agg2(a ORDER BY c) ...

In order to make DISTINCT work, I had to add a new expression
evaluation operator to allow filtering if the current value is the
same as the last value. The current unpatched code implements
distinct when reading back the sorted tuplestore data. Since I don't
have a tuplestore with the pre-sorted version, I needed to cache the
last Datum, or last tuple somewhere. I ended up putting that in the
AggStatePerTransData struct. I'm not quite sure if I like that, but I
don't really see where else it could go.

When testing the performance of all this I found that when a suitable
index exists to provide pre-sorted input for the aggregation that the
performance does improve. Unfortunately, it looks like things get more
complex when no index exists. In this case, since we're setting
pathkeys to tell the planner we need a plan that provides pre-sorted
input to the aggregates, the planner will add a sort below the
aggregate node. I initially didn't see any problem with that as it
just moves the sort to a Sort node rather than having it done
implicitly inside nodeAgg.c. The problem is, it just does not perform
as well. I guess this is because when the sort is done inside
nodeAgg.c that the transition function is called in a tight loop while
reading records back from the tuplestore. In the patched version,
there's an additional node transition in between nodeAgg and nodeSort
and that causes slower performance. For now, I'm not quite sure what
to do about that. We set the plan pathkeys well before we could
possibly decide if asking for pre-sorted input for the aggregates
would be a good idea or not.

Please find attached my WIP patch. It's WIP due to what I mentioned
in the above paragraph and also because I've not bothered to add JIT
support for the new expression evaluation steps.

I'll add this to the July commitfest.

David

[1] https://www.postgresql.org/message-id/CAKJS1f9sx_6GTcvd6TMuZnNtCh0VhBzhX6FZqw17TgVFH-ga_A%40mail.gmail.com
[2] https://www.postgresql.org/message-id/flat/6538.1522096067%40sss.pgh.pa.us#c228ed67026faa15209c76dada035da4

Attachment

wip_planner_support_for_orderby_distinct_aggs_v0.patch

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

From

Ronan Dunklau

Date:

02 July 2021, 07:53:52

> 
> This allows us to give presorted input to both aggregates in the following
> case:
> 
> SELECT agg(a ORDER BY a),agg2(a ORDER BY a,b) ...
> 
> but just the first agg in this one:
> 
> SELECT agg(a ORDER BY a),agg2(a ORDER BY c) ...

I don't know if it's acceptable, but in the case where you add both an 
aggregate with an ORDER BY clause, and another aggregate without the clause, 
the output for the unordered one will change and use the same ordering, maybe 
suprising the unsuspecting user. Would that be acceptable ?

> When testing the performance of all this I found that when a suitable
> index exists to provide pre-sorted input for the aggregation that the
> performance does improve. Unfortunately, it looks like things get more
> complex when no index exists.  In this case, since we're setting
> pathkeys to tell the planner we need a plan that provides pre-sorted
> input to the aggregates, the planner will add a sort below the
> aggregate node.  I initially didn't see any problem with that as it
> just moves the sort to a Sort node rather than having it done
> implicitly inside nodeAgg.c.  The problem is, it just does not perform
> as well.  I guess this is because when the sort is done inside
> nodeAgg.c that the transition function is called in a tight loop while
> reading records back from the tuplestore.  In the patched version,
> there's an additional node transition in between nodeAgg and nodeSort
> and that causes slower performance.  For now, I'm not quite sure what
> to do about that.  We set the plan pathkeys well before we could
> possibly decide if asking for pre-sorted input for the aggregates
> would be a good idea or not.

I was curious about the performance implication of that additional transition, 
and could not reproduce a signifcant difference. I may be doing something 
wrong: how did you highlight it ?

Regards,

--
Ronan Dunklau

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

From

David Rowley

Date:

02 July 2021, 08:39:44

On Fri, 2 Jul 2021 at 19:54, Ronan Dunklau <ronan.dunklau@aiven.io> wrote:
> I don't know if it's acceptable, but in the case where you add both an
> aggregate with an ORDER BY clause, and another aggregate without the clause,
> the output for the unordered one will change and use the same ordering, maybe
> suprising the unsuspecting user. Would that be acceptable ?

That's a good question. There was an argument in [1] that mentions
that there might be a group of people who rely on aggregation being
done in a certain order where they're not specifying an ORDER BY
clause in the aggregate.  If that group of people exists, then it's
possible they might be upset in the scenario that you describe.

I also think that it's going to be pretty hard to make significant
gains in this area if we are too scared to make changes to undefined
behaviour.  You wouldn't have to look too hard in the pgsql-general
mailing list to find someone complaining that their query output is in
the wrong order on some query that does not have an ORDER BY. We
pretty much always tell those people that the order is undefined
without an ORDER BY. I'm not too sure why Tom in [1] classes the ORDER
BY aggregate people any differently. We'll be stuck forever here and
in many other areas if we're too scared to change the order of
aggregation. You could argue that something like parallelism has
changed that for people already. I think the multi-batch Hash
Aggregate code could also change this.

> I was curious about the performance implication of that additional transition,
> and could not reproduce a signifcant difference. I may be doing something
> wrong: how did you highlight it ?

It was pretty basic. I just created a table with two columns and no
index and did something like SELECT a,SUM(b ORDER BY b) from ab GROUP
BY 1;  the new code will include a Sort due to lack of any index and
the old code would have done a sort inside nodeAgg.c. I imagine that
the overhead comes from the fact that in the patched version nodeAgg.c
must ask its subnode (nodeSort.c) for the next tuple each time,
whereas unpatched nodeAgg.c already has all the tuples in a tuplestore
and can fetch them very quickly in a tight loop.

David

[1] https://www.postgresql.org/message-id/6538.1522096067%40sss.pgh.pa.us

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

From

Gavin Flower

Date:

02 July 2021, 20:37:40

On 2/07/21 8:39 pm, David Rowley wrote:
> On Fri, 2 Jul 2021 at 19:54, Ronan Dunklau <ronan.dunklau@aiven.io> wrote:
>> I don't know if it's acceptable, but in the case where you add both an
>> aggregate with an ORDER BY clause, and another aggregate without the clause,
>> the output for the unordered one will change and use the same ordering, maybe
>> suprising the unsuspecting user. Would that be acceptable ?
> That's a good question. There was an argument in [1] that mentions
> that there might be a group of people who rely on aggregation being
> done in a certain order where they're not specifying an ORDER BY
> clause in the aggregate.  If that group of people exists, then it's
> possible they might be upset in the scenario that you describe.

[...]

I've always worked on the assumption that if I do not specify an ORDER 
BY clause then the rdbms is expected to present rows in the order most 
efficient for it to do so. If pg orders rows when not requested then 
this could add extra elapsed time to the query, which might be 
significant for large queries.

I don't know of any rdbms that guarantees the order of returned rows 
when an ORDER BY clause is not used.

So I think that pg has no obligation to protect the sensibilities of 
naive users in this case, especially at the expense of users that want 
queries to complete as quickly as possible.  IMHO


Cheers,
Gavin

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

From

Tom Lane

Date:

02 July 2021, 20:51:51

Gavin Flower <GavinFlower@archidevsys.co.nz> writes:
> On 2/07/21 8:39 pm, David Rowley wrote:
>> That's a good question. There was an argument in [1] that mentions
>> that there might be a group of people who rely on aggregation being
>> done in a certain order where they're not specifying an ORDER BY
>> clause in the aggregate.  If that group of people exists, then it's
>> possible they might be upset in the scenario that you describe.

> So I think that pg has no obligation to protect the sensibilities of 
> naive users in this case, especially at the expense of users that want 
> queries to complete as quickly as possible.  IMHO

I agree.  We've broken such expectations in the past and I don't
have much hesitation about breaking them again.

            regards, tom lane

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

From

Ronan Dunklau

Date:

05 July 2021, 06:38:28

Le vendredi 2 juillet 2021, 10:39:44 CEST David Rowley a écrit :
> On Fri, 2 Jul 2021 at 19:54, Ronan Dunklau <ronan.dunklau@aiven.io> wrote:
> > I don't know if it's acceptable, but in the case where you add both an
> > aggregate with an ORDER BY clause, and another aggregate without the
> > clause, the output for the unordered one will change and use the same
> > ordering, maybe suprising the unsuspecting user. Would that be acceptable
> > ?
>
> That's a good question. There was an argument in [1] that mentions
> that there might be a group of people who rely on aggregation being
> done in a certain order where they're not specifying an ORDER BY
> clause in the aggregate.  If that group of people exists, then it's
> possible they might be upset in the scenario that you describe.
>
> I also think that it's going to be pretty hard to make significant
> gains in this area if we are too scared to make changes to undefined
> behaviour.  You wouldn't have to look too hard in the pgsql-general
> mailing list to find someone complaining that their query output is in
> the wrong order on some query that does not have an ORDER BY. We
> pretty much always tell those people that the order is undefined
> without an ORDER BY. I'm not too sure why Tom in [1] classes the ORDER
> BY aggregate people any differently. We'll be stuck forever here and
> in many other areas if we're too scared to change the order of
> aggregation. You could argue that something like parallelism has
> changed that for people already. I think the multi-batch Hash
> Aggregate code could also change this.

I would agree with that.

>
> > I was curious about the performance implication of that additional
> > transition, and could not reproduce a signifcant difference. I may be
> > doing something wrong: how did you highlight it ?
>
> It was pretty basic. I just created a table with two columns and no
> index and did something like SELECT a,SUM(b ORDER BY b) from ab GROUP
> BY 1;  the new code will include a Sort due to lack of any index and
> the old code would have done a sort inside nodeAgg.c. I imagine that
> the overhead comes from the fact that in the patched version nodeAgg.c
> must ask its subnode (nodeSort.c) for the next tuple each time,
> whereas unpatched nodeAgg.c already has all the tuples in a tuplestore
> and can fetch them very quickly in a tight loop.

Ok, I reproduced that case, just not using a group by: by adding the group by
a sort node is added in both cases (master and your patch), except that with
your patch we sort on both keys and that doesn't really incur a performance
penalty.

I think the overhead occurs because in the ExecAgg case, we use the
tuplesort_*_datum API as an optimization when we have a single column as an
input, which the ExecSort code doesn't. Maybe it would be worth it to try to
use that API in sort nodes too, when it can be done.


>
> David
>
> [1] https://www.postgresql.org/message-id/6538.1522096067%40sss.pgh.pa.us


--
Ronan Dunklau

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

From

Ronan Dunklau

Date:

05 July 2021, 12:07:05

> Ok, I reproduced that case, just not using a group by: by adding the group
> by a sort node is added in both cases (master and your patch), except that
> with your patch we sort on both keys and that doesn't really incur a
> performance penalty.
> 
> I think the overhead occurs because in the ExecAgg case, we use the
> tuplesort_*_datum API as an optimization when we have a single column as an
> input, which the ExecSort code doesn't. Maybe it would be worth it to try to
> use that API in sort nodes too, when it can be done.

Please find attached a POC patch to do just that.

The switch to the single-datum tuplesort is done when there is only one 
attribute, it is byval (to avoid having to deal with copy of the references 
everywhere) and we are not in bound mode (to also avoid having to move things 
around).

A naive run on make check pass on this, but I may have overlooked things.

Should I add this separately to the commitfest ?

For the record, the times I got on my laptop, on master VS david's patch VS 
both. Values are an average of 100 runs, as reported by pgbench --no-vacuum -f 
<file.sql> -t 100. There is a good amount of noise, but the simple "select one 
ordered column case" seems worth the optimization.

Only shared_buffers and work_mem have been set to 2GB each.

Setup 1: single table, 1 000 000 tuples, no index
CREATE TABLE tbench (
  a int,
  b int
);

INSERT INTO tbench (a, b) SELECT a, b FROM generate_series(1, 100) a, 
generate_series(1, 10000) b;


Test 1: Single-column ordered select (order by b since the table is already 
sorted by a)
select b from tbench order by b;

master: 303.661ms
with mine: 148.571ms

Test 2: Ordered sum (using b so that the input is not presorted)
select sum(b order by b) from tbench;

master: 112.379ms
with david's patch: 144.469ms
with david's patch + mine: 97ms

Test 3: Ordered sum + group by
select b, sum(a order by a) from tbench GROUP BY b;

master: 316.117ms
with david's patch: 297.079
with david's patch + mine: 294.601

Setup 2: same as before, but adding an index on (b, a)
CREATE INDEX ON tbench (b, a);

Test 2: Ordered sum:
select sum(a order by a) from tbench;

master: 111.847 ms
with david's patch: 48.088
with david's patch + mine: 47.678 ms

Test 3: Ordered sum + group by:
select a, sum(b order by b) from tbench GROUP BY a;

master: 76.873 ms
with david's patch: 61.105
with david's patch + mine:   62.672 ms


-- 
Ronan Dunklau

Attachment

0001-Allow-Sort-nodes-to-use-the-fast-single-datum-tuples.patch

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

From

Ranier Vilela

Date:

05 July 2021, 14:51:59

>Please find attached a POC patch to do just that.

>The switch to the single-datum tuplesort is done when there is only one
>attribute, it is byval (to avoid having to deal with copy of the references
>everywhere) and we are not in bound mode (to also avoid having to move things
>around).

Hi, nice results!

I have a few suggestions and questions to your patch:

1. Why do you moved the declaration of variable *plannode?

I think this is unnecessary, extend the scope.

2. Why do you declare a new variable TupleDesc out_tuple_desc at ExecInitSort?

I think this is unnecessary too, maybe I didn't notice something.

3. I inverted the order of check at this line, I think "!node-bounded" is more cheap that TupleDescAttr(tupDesc, 0) ->attbyval

4. Once that you changed the order of execution, this test is unlikely that happens, so add unlikely helps the branch.

5. I think that you add a invariant inside the loop "if(node->is_single_val)"?

Would not be better two fors?

For you convenience, I attached a v2 version (with styles changes), please take a look and can you repeat yours tests?

regards,

Ranier Vilela

Attachment

v2-0001-Allow-Sort-nodes-to-use-the-fast-single-datum-tuples.patch

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

From

Ronan Dunklau

Date:

05 July 2021, 15:07:27

Le lundi 5 juillet 2021, 16:51:59 CEST Ranier Vilela a écrit :
> >Please find attached a POC patch to do just that.
> >
> >The switch to the single-datum tuplesort is done when there is only one
> >attribute, it is byval (to avoid having to deal with copy of the
>
> references
>
> >everywhere) and we are not in bound mode (to also avoid having to move
>
> things
>
> >around).
>
> Hi, nice results!
>
> I have a few suggestions and questions to your patch:

Thank you for those !
>
> 1. Why do you moved the declaration of variable *plannode?
> I think this is unnecessary, extend the scope.

Sorry, I should have cleaned it up before sending.

>
> 2. Why do you declare a new variable TupleDesc out_tuple_desc at
> ExecInitSort?
> I think this is unnecessary too, maybe I didn't notice something.

Same as the above, thanks for the two.
>
> 3. I inverted the order of check at this line, I think "!node-bounded" is
> more cheap that TupleDescAttr(tupDesc, 0) ->attbyval

I'm not sure it matters since it's done once per sort but Ok
>
> 4. Once that you changed the order of execution, this test is unlikely that
> happens, so add unlikely helps the branch.

Ok.

>
> 5. I think that you add a invariant inside the loop
> "if(node->is_single_val)"?
> Would not be better two fors?

Ok for me.

>
> For you convenience, I attached a v2 version (with styles changes), please
> take a look and can you repeat yours tests?

Tested it quickly, and did not see any change performance wise that cannot be
attributed to noise on my laptop but it's fine.

Thank you for the fixes !

>
> regards,
> Ranier Vilela


--
Ronan Dunklau

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

From

Ranier Vilela

Date:

05 July 2021, 15:39:52

Em seg., 5 de jul. de 2021 às 12:07, Ronan Dunklau <ronan.dunklau@aiven.io> escreveu:

Le lundi 5 juillet 2021, 16:51:59 CEST Ranier Vilela a écrit :
> >Please find attached a POC patch to do just that.
> >
> >The switch to the single-datum tuplesort is done when there is only one
> >attribute, it is byval (to avoid having to deal with copy of the
>
> references
>
> >everywhere) and we are not in bound mode (to also avoid having to move
>
> things
>
> >around).
>
> Hi, nice results!
>
> I have a few suggestions and questions to your patch:

Thank you for those !
>
> 1. Why do you moved the declaration of variable *plannode?
> I think this is unnecessary, extend the scope.

Sorry, I should have cleaned it up before sending.

>
> 2. Why do you declare a new variable TupleDesc out_tuple_desc at
> ExecInitSort?
> I think this is unnecessary too, maybe I didn't notice something.

Same as the above, thanks for the two.
>
> 3. I inverted the order of check at this line, I think "!node-bounded" is
> more cheap that TupleDescAttr(tupDesc, 0) ->attbyval

I'm not sure it matters since it's done once per sort but Ok
>
> 4. Once that you changed the order of execution, this test is unlikely that
> happens, so add unlikely helps the branch.

Ok.

>
> 5. I think that you add a invariant inside the loop
> "if(node->is_single_val)"?
> Would not be better two fors?

Ok for me.

>
> For you convenience, I attached a v2 version (with styles changes), please
> take a look and can you repeat yours tests?

Tested it quickly, and did not see any change performance wise that cannot be
attributed to noise on my laptop but it's fine.

Thanks for testing again.

Thank you for the fixes !

You are welcome.

regards,

Ranier Vilela

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

From

James Coleman

Date:

05 July 2021, 19:12:51

On Sat, Jun 12, 2021 at 11:07 AM David Rowley <dgrowleyml@gmail.com> wrote:
>
> A few years ago I wrote a patch to implement the missing aggregate
> combine functions for array_agg and string_agg [1].  In the end, the
> patch was rejected due to some concern [2] that if we allow these
> aggregates to run in parallel then it might mess up the order in which
> values are being aggregated by some unsuspecting user who didn't
> include an ORDER BY clause in the aggregate.   It was mentioned at the
> time that due to nodeAgg.c performing so terribly with ORDER BY
> aggregates that we couldn't possibly ask users who were upset by this
> change to include an ORDER BY in their aggregate functions.
>
> I'd still quite like to finish off the remaining aggregate functions
> that still don't have a combine function, so to get that going again
> I've written some code that gets the query planner onboard with
> picking a plan that allows nodeAgg.c to skip doing any internal sorts
> when the input results are already sorted according to what's required
> by the aggregate function.
>
> Of course, the query could have many aggregates all with differing
> ORDER BY clauses. Since we reuse the same input for each, it might not
> be possible to feed each aggregate with suitably sorted input.   When
> the input is not sorted, nodeAgg.c still must perform the sort as it
> does today.  So unfortunately we can't remove the nodeAgg.c code for
> that.
>
> The best I could come up with is just picking a sort order that suits
> the first ORDER BY aggregate, then spin through the remaining ones to
> see if there's any with a more strict order requirement that we can
> use to suit that one and the first one. The planner does something
> similar for window functions already, although it's not quite as
> comprehensive to look beyond the first window for windows with a more
> strict sort requirement.

I think this is a reasonable choice. I could imagine a more complex
method, say, counting the number of aggregates benefiting from a given
sort, and choosing the one that benefits the most (and this could be
further complicated by ranking based on "cost" -- not costing in the
normal sense since we don't have that at this point), but I think it'd
take a lot of convincing that that was valuable.

> This allows us to give presorted input to both aggregates in the following case:
>
> SELECT agg(a ORDER BY a),agg2(a ORDER BY a,b) ...
>
> but just the first agg in this one:
>
> SELECT agg(a ORDER BY a),agg2(a ORDER BY c) ...
>
> In order to make DISTINCT work, I had to add a new expression
> evaluation operator to allow filtering if the current value is the
> same as the last value.  The current unpatched code implements
> distinct when reading back the sorted tuplestore data.  Since I don't
> have a tuplestore with the pre-sorted version, I needed to cache the
> last Datum, or last tuple somewhere.  I ended up putting that in the
> AggStatePerTransData struct. I'm not quite sure if I like that, but I
> don't really see where else it could go.

That sounds like what nodeIncrementalSort.c's isCurrentGroup() does,
except it's just implemented inline. Not anything you need to change
in this patch, but noting it in case it triggered a thought valuable
for you for me later on.

> When testing the performance of all this I found that when a suitable
> index exists to provide pre-sorted input for the aggregation that the
> performance does improve. Unfortunately, it looks like things get more
> complex when no index exists.  In this case, since we're setting
> pathkeys to tell the planner we need a plan that provides pre-sorted
> input to the aggregates, the planner will add a sort below the
> aggregate node.  I initially didn't see any problem with that as it
> just moves the sort to a Sort node rather than having it done
> implicitly inside nodeAgg.c.  The problem is, it just does not perform
> as well.  I guess this is because when the sort is done inside
> nodeAgg.c that the transition function is called in a tight loop while
> reading records back from the tuplestore.  In the patched version,
> there's an additional node transition in between nodeAgg and nodeSort
> and that causes slower performance.  For now, I'm not quite sure what
> to do about that.  We set the plan pathkeys well before we could
> possibly decide if asking for pre-sorted input for the aggregates
> would be a good idea or not.

This opens up another path for significant plan benefits too: if
there's now an explicit sort node, then it's possible for that node to
be an incremental sort node, which isn't something nodeAgg is capable
of utilizing currently.

> Please find attached my WIP patch.  It's WIP due to what I mentioned
> in the above paragraph and also because I've not bothered to add JIT
> support for the new expression evaluation steps.

I looked this over (though didn't get a chance to play with it).

I'm wondering about the changes to the test output in tuplesort.out.
It looks like where a merge join used to be proceeding with an
explicit sort with DESC it's now sorting with (implicit) ASC, and then
an explicit sort node using DESC is above the merge join node.  Two
thoughts:

1. It seems like losing the "proper" sort order in the JOIN node isn't
preferable. Given both are full sorts, it may not actually be a
significant performance difference on its own (it can't short
circuit), but it means precluding using incremental sort in the new
sort node.
2. In an ideal world we'd push the more complex sort down into the
merge join rather than doing a sort on "col1" and then a sort on
"col1, col2". That's likely beyond this patch, but I haven't
investigated at all.

Thanks for your work on this; both this patch and the original one
you're trying to enable seem like great additions.

James

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

From

James Coleman

Date:

05 July 2021, 19:33:59

On Mon, Jul 5, 2021 at 8:08 AM Ronan Dunklau <ronan.dunklau@aiven.io> wrote:
>
> > Ok, I reproduced that case, just not using a group by: by adding the group
> > by a sort node is added in both cases (master and your patch), except that
> > with your patch we sort on both keys and that doesn't really incur a
> > performance penalty.
> >
> > I think the overhead occurs because in the ExecAgg case, we use the
> > tuplesort_*_datum API as an optimization when we have a single column as an
> > input, which the ExecSort code doesn't. Maybe it would be worth it to try to
> > use that API in sort nodes too, when it can be done.
>
> Please find attached a POC patch to do just that.
>
> The switch to the single-datum tuplesort is done when there is only one
> attribute, it is byval (to avoid having to deal with copy of the references
> everywhere) and we are not in bound mode (to also avoid having to move things
> around).
>
> A naive run on make check pass on this, but I may have overlooked things.
>
> Should I add this separately to the commitfest ?

It seems like a pretty obvious win on its own, and, I'd expect, will
need less discussion than David's patch, so my vote is to make it a
separate thread. The patch tester wants the full series attached each
time, and even without that it's difficult to discuss multiple patches
on a single thread.

If you make a separate thread and CF entry, please CC me and add me as
a reviewer on the CF entry.

One thing from a quick read through of the patch: your changes near
the end of ExecSort, in ExecInitSort, and in execnodes.h need
indentation fixes.

Thanks,
James

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

From

Ronan Dunklau

Date:

06 July 2021, 06:19:38

> If you make a separate thread and CF entry, please CC me and add me as
> a reviewer on the CF entry.

Ok, I started a new thread and added it to the next CF: https://
commitfest.postgresql.org/34/3237/



-- 
Ronan Dunklau

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

From

David Rowley

Date:

06 July 2021, 07:26:49

On Mon, 5 Jul 2021 at 18:38, Ronan Dunklau <ronan.dunklau@aiven.io> wrote:
> I think the overhead occurs because in the ExecAgg case, we use the
> tuplesort_*_datum API as an optimization when we have a single column as an
> input, which the ExecSort code doesn't. Maybe it would be worth it to try to
> use that API in sort nodes too, when it can be done.

That's a really great find!  Looks like I was wrong to assume that the
extra overhead was from transitioning between nodes.

I ran the performance results locally here with:

create table t1(a int not null);
create table t2(a int not null, b int not null);
create table t3(a int not null, b int not null, c int not null);

insert into t1 select x from generate_Series(1,1000000)x;
insert into t2 select x,x from generate_Series(1,1000000)x;
insert into t3 select x,x,1 from generate_Series(1,1000000)x;
vacuum freeze analyze t1,t2,t3;

select1: select sum(a order by a) from t1;
select2: select sum(a order by b) from t2;
select3: select c,sum(a order by b) from t3 group by c;

master = https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=8aafb02616753f5c6c90bbc567636b73c0cbb9d4
patch1 =
https://www.postgresql.org/message-id/attachment/123546/wip_planner_support_for_orderby_distinct_aggs_v0.patch
patch2 =
https://www.postgresql.org/message-id/attachment/124238/0001-Allow-Sort-nodes-to-use-the-fast-single-datum-tuples.patch

The attached graph shows the results.

It's very good to see that with both patches applied there's no
regression. I'm a bit surprised there's much performance gain here
given that I didn't add any indexes to provide any presorted input.

David

Attachment

order_by_aggs.png

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

From

David Rowley

Date:

12 July 2021, 12:04:22

On Sun, 13 Jun 2021 at 03:07, David Rowley <dgrowleyml@gmail.com> wrote:
>
> Please find attached my WIP patch.  It's WIP due to what I mentioned
> in the above paragraph and also because I've not bothered to add JIT
> support for the new expression evaluation steps.

I've split this patch into two parts.

0001 Adds planner support for ORDER BY aggregates.

0002 is a WIP patch for DISTINCT support.  This still lacks JIT
support and I'm still not certain of the best where to store the
previous value or tuple to determine if the current one is distinct
from it.

The 0001 patch is fairly simple and does not require much in the way
of changes in the planner aside from standard_qp_callback().
Surprisingly the executor does not need a great deal of work here
either.  It's just mostly about skipping the normal agg(.. ORDER BY)
code when the Aggref is presorted.

David

Attachment

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

From

Ranier Vilela

Date:

12 July 2021, 23:04:26

Em seg., 12 de jul. de 2021 às 09:04, David Rowley <dgrowleyml@gmail.com> escreveu:

On Sun, 13 Jun 2021 at 03:07, David Rowley <dgrowleyml@gmail.com> wrote:
>
> Please find attached my WIP patch. It's WIP due to what I mentioned
> in the above paragraph and also because I've not bothered to add JIT
> support for the new expression evaluation steps.

I've split this patch into two parts.

Hi, I'll take a look.

0001 Adds planner support for ORDER BY aggregates.

/* Normal transition function without ORDER BY / DISTINCT. */

Is it possible to avoid entering to initialize args if 'argno >= pertrans->numTransInputs'?

Like this:

if (!pertrans->aggsortrequired && argno < pertrans->numTransInputs)

And if argos is '>' that pertrans->numTransInputs?

The test shouldn't be, inside the loop?

+ /*
+ * Don't initialize args for any ORDER BY clause that might
+ * exist in a presorted aggregate.
+ */
+ if (argno >= pertrans->numTransInputs)
+ break;

I think that or can reduce the scope of variable 'sortlist' or simply remove it?

or better

+ /* Determine pathkeys for aggregate functions with an ORDER BY */
+ if (parse->groupingSets == NIL && root->numOrderedAggs > 0 &&
+ (qp_extra->groupClause == NIL || root->group_pathkeys))
+ {
+ ListCell *lc;
+ List *pathkeys = NIL;
+
+ foreach(lc, root->agginfos)
+ {
+ AggInfo *agginfo = (AggInfo *) lfirst(lc);
+ Aggref *aggref = agginfo->representative_aggref;
+
+ if (AGGKIND_IS_ORDERED_SET(aggref->aggkind))
+ continue;
+
+ /* DISTINCT aggregates not yet supported by the planner */
+ if (aggref->aggdistinct != NIL)
+ continue;
+
+ if (aggref->aggorder == NIL)
+ continue;
+
+ /*
+ * Find the pathkeys with the most sorted derivative of the first
+ * Aggref. For example, if we determine the pathkeys for the first
+ * Aggref to be {a}, and we find another with {a,b}, then we use
+ * {a,b} since it's useful for more Aggrefs than just {a}. We
+ * currently ignore anything that might have a longer list of
+ * pathkeys than the first Aggref if it is not contained in the
+ * pathkeys for the first agg. We can't practically plan for all
+ * orders of each Aggref, so this seems like the best compromise.
+ */
+ if (pathkeys == NIL)
+ {
+ pathkeys = make_pathkeys_for_sortclauses(root, aggref->aggorder ,
+ aggref->args);
+ aggref->aggpresorted = true;
+ }
+ else
+ {
+ List *pathkeys2 = make_pathkeys_for_sortclauses(root,
+ aggref->aggorder,
+ aggref->args);

0002 is a WIP patch for DISTINCT support. This still lacks JIT
support and I'm still not certain of the best where to store the
previous value or tuple to determine if the current one is distinct
from it.

In the patch 0002, I think that can reduce the scope of variable 'aggstate'?

+ EEO_CASE(EEOP_AGG_PRESORTED_DISTINCT_SINGLE)
+ {
+ AggStatePerTrans pertrans = op->d.agg_presorted_distinctcheck.pertrans;
+ Datum value = pertrans->transfn_fcinfo->args[1].value;
+ bool isnull = pertrans->transfn_fcinfo->args[1].isnull;
+
+ if (!pertrans->haslast ||
+ pertrans->lastisnull != isnull ||
+ !DatumGetBool(FunctionCall2Coll(&pertrans->equalfnOne,
+ pertrans->aggCollation,
+ pertrans->lastdatum, value)))
+ {
+ if (pertrans->haslast && !pertrans->inputtypeByVal)
+ pfree(DatumGetPointer(pertrans->lastdatum));
+
+ pertrans->haslast = true;
+ if (!isnull)
+ {
+ AggState *aggstate = castNode(AggState, state->parent);

+ /*
+ * XXX is it worth having a dedicted ByVal version of this
+ * operation so that we can skip switching memory contexts
+ * and do a simple assign rather than datumCopy below?
+ */
+ MemoryContext oldContext;
+
+ oldContext = MemoryContextSwitchTo(aggstate->curaggcontext->ecxt_per_tuple_memory);

What do you think?

regards,

Ranier Vilela

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

From

David Rowley

Date:

13 July 2021, 04:44:12

Thanks for having a look at this.

On Tue, 13 Jul 2021 at 11:04, Ranier Vilela <ranier.vf@gmail.com> wrote:
>> 0001 Adds planner support for ORDER BY aggregates.
>
> /* Normal transition function without ORDER BY / DISTINCT. */
> Is it possible to avoid entering to initialize args if 'argno >= pertrans->numTransInputs'?
> Like this:
>
> if (!pertrans->aggsortrequired && argno < pertrans->numTransInputs)
>
> And if argos is '>' that pertrans->numTransInputs?
> The test shouldn't be, inside the loop?
>
> + /*
> + * Don't initialize args for any ORDER BY clause that might
> + * exist in a presorted aggregate.
> + */
> + if (argno >= pertrans->numTransInputs)
> + break;

The idea is to stop the loop before processing any Aggref arguments
that might belong to the ORDER BY clause. We must still process other
arguments up to the ORDER BY args though, so we can't skip this loop.

Note that we're doing argno++ inside the loop.  If we had a
for_each_to() macro, similar to for_each_from(), but allowed us to
specify an end element then we could use that instead, but we don't
and we still must initialize the transition arguments.

> I think that or can reduce the scope of variable 'sortlist' or simply remove it?

I've adjusted the scope of this.  I didn't want to remove it because
it's kinda useful to have it that way otherwise the 0002 patch would
need to add it.

>> 0002 is a WIP patch for DISTINCT support.  This still lacks JIT
>> support and I'm still not certain of the best where to store the
>> previous value or tuple to determine if the current one is distinct
>> from it.
>
> In the patch 0002, I think that can reduce the scope of variable 'aggstate'?
>
> + EEO_CASE(EEOP_AGG_PRESORTED_DISTINCT_SINGLE)

Yeah, that could be done.

I've attached the updated patches.

David

Attachment

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

From

Ranier Vilela

Date:

13 July 2021, 11:45:20

Em ter., 13 de jul. de 2021 às 01:44, David Rowley <dgrowleyml@gmail.com> escreveu:

Thanks for having a look at this.

On Tue, 13 Jul 2021 at 11:04, Ranier Vilela <ranier.vf@gmail.com> wrote:
>> 0001 Adds planner support for ORDER BY aggregates.
>
> /* Normal transition function without ORDER BY / DISTINCT. */
> Is it possible to avoid entering to initialize args if 'argno >= pertrans->numTransInputs'?
> Like this:
>
> if (!pertrans->aggsortrequired && argno < pertrans->numTransInputs)
>
> And if argos is '>' that pertrans->numTransInputs?
> The test shouldn't be, inside the loop?
>
> + /*
> + * Don't initialize args for any ORDER BY clause that might
> + * exist in a presorted aggregate.
> + */
> + if (argno >= pertrans->numTransInputs)
> + break;

The idea is to stop the loop before processing any Aggref arguments
that might belong to the ORDER BY clause.

Yes, I get the idea.

We must still process other
arguments up to the ORDER BY args though,

I may have misunderstood, but the other arguments are under pertrans->numTransInputs?

so we can't skip this loop.

The question not answered is if *argno* can '>=' that pertrans->numTransInputs,

before entering the loop?

If *can*, the loop might be useless in that case.

Note that we're doing argno++ inside the loop.

Another question is, if *argno* can '>' that pertrans->numTransInputs,

before the loop, the test will fail?

if (argno == pertrans->numTransInputs)

> I think that or can reduce the scope of variable 'sortlist' or simply remove it?

I've adjusted the scope of this. I didn't want to remove it because
it's kinda useful to have it that way otherwise the 0002 patch would
need to add it.

Nice.

>> 0002 is a WIP patch for DISTINCT support. This still lacks JIT
>> support and I'm still not certain of the best where to store the
>> previous value or tuple to determine if the current one is distinct
>> from it.
>
> In the patch 0002, I think that can reduce the scope of variable 'aggstate'?
>
> + EEO_CASE(EEOP_AGG_PRESORTED_DISTINCT_SINGLE)

Yeah, that could be done.

I've attached the updated patches.

Thanks.

regards,

Ranier Vilela

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

From

David Rowley

Date:

14 July 2021, 01:15:27

On Tue, 13 Jul 2021 at 23:45, Ranier Vilela <ranier.vf@gmail.com> wrote:
> The question not answered is if *argno* can '>=' that pertrans->numTransInputs,
> before entering the loop?
> If *can*, the loop might be useless in that case.
>
>>
>>
>> Note that we're doing argno++ inside the loop.
>
> Another question is, if *argno* can '>' that pertrans->numTransInputs,
> before the loop, the test will fail?
> if (argno == pertrans->numTransInputs)

argno is *always* 0 before the loop starts.

David

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

From

Ranier Vilela

Date:

14 July 2021, 01:45:22

Em ter., 13 de jul. de 2021 às 22:15, David Rowley <dgrowleyml@gmail.com> escreveu:

On Tue, 13 Jul 2021 at 23:45, Ranier Vilela <ranier.vf@gmail.com> wrote:
> The question not answered is if *argno* can '>=' that pertrans->numTransInputs,
> before entering the loop?
> If *can*, the loop might be useless in that case.
>
>>
>>
>> Note that we're doing argno++ inside the loop.
>
> Another question is, if *argno* can '>' that pertrans->numTransInputs,
> before the loop, the test will fail?
> if (argno == pertrans->numTransInputs)

argno is *always* 0 before the loop starts.

Good. Thanks.

regards,

Ranier Vilela

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

From

Ronan Dunklau

Date:

15 July 2021, 13:02:00

Le mardi 13 juillet 2021, 06:44:12 CEST David Rowley a écrit :
> I've attached the updated patches.

The approach of building a pathkey for the first order by we find, then
appending to it as needed seems sensible but I'm a bit worried about users
starting to rely on this as an optimization. Even if we don't document it,
people may start to change the order of their target lists to "force" a
specific sort on the lower nodes. How confident are we that we won't change this
or that we will be willing to break it ?

Generating all possible pathkeys and costing the resulting plans would be too
expensive, but maybe a more "stable" (and limited) approach would be fine, like
generating the pathkeys only if every ordered aggref shares the same prefix. I
don't think there would be any ambiguity here.

--
Ronan Dunklau

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

From

David Rowley

Date:

16 July 2021, 06:04:03

On Fri, 16 Jul 2021 at 01:02, Ronan Dunklau <ronan.dunklau@aiven.io> wrote:
> The approach of building a pathkey for the first order by we find, then
> appending to it as needed seems sensible but I'm a bit worried about users
> starting to rely on this as an optimization. Even if we don't document it,
> people may start to change the order of their target lists to "force" a
> specific sort on the lower nodes. How confident are we that we won't change this
> or that we will be willing to break it ?

That's a good question.  I mainly did it that way because Windowing
functions work similarly based on the position of items in the
targetlist.  The situation there is slightly more complex as it
depends on the SortGroupClause->tleSortGroupRef.

> Generating all possible pathkeys and costing the resulting plans would be too
> expensive, but maybe a more "stable" (and limited) approach would be fine, like
> generating the pathkeys only if every ordered aggref shares the same prefix. I
> don't think there would be any ambiguity here.

I think that's a bad idea as it would leave a lot on the table. I
don't see any reason to make it that restrictive. Remember that before
this that every Aggref with a sort clause must perform their own sort.
So it's not like we'll ever increase the number of sorts here as a
result.

What we maybe could consider instead would be to pick the first Aggref
then look for the most sorted derivative of that then tally up the
number of Aggrefs that can be sorted using those pathkeys, then repeat
that process for the remaining Aggrefs that didn't have the same
prefix then use the pathkeys for the set with the most Aggrefs.  We
could still tiebreak on the targetlist position so at least it's not
random which ones we pick. Now that we have a list of Aggrefs that are
deduplicated in the planner thanks to 0a2bc5d61e it should be fairly
easy to do that.

David

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

From

David Rowley

Date:

16 July 2021, 10:00:44

On Fri, 16 Jul 2021 at 18:04, David Rowley <dgrowleyml@gmail.com> wrote:
> What we maybe could consider instead would be to pick the first Aggref
> then look for the most sorted derivative of that then tally up the
> number of Aggrefs that can be sorted using those pathkeys, then repeat
> that process for the remaining Aggrefs that didn't have the same
> prefix then use the pathkeys for the set with the most Aggrefs.  We
> could still tiebreak on the targetlist position so at least it's not
> random which ones we pick. Now that we have a list of Aggrefs that are
> deduplicated in the planner thanks to 0a2bc5d61e it should be fairly
> easy to do that.

I've attached a patch which does as I mention above.

I'm still not sold on if this is better than just going with the order
of the first aggregate.  The problem might be that a plan could change
as new aggregates are added to the end of the target list. It feels
like there might be a bit less control over it than the previous
version. Remember that suiting more aggregates is not always better as
there might be an index that could provide presorted input for another
set of aggregates which would overall reduce the number of sorts.
However, maybe it's not too big an issue as for any aggregates that
are not presorted we're left doing 1 sort per Aggref, so reducing the
number of those might be more important than selecting the order that
has an index to support it.

I've left off the 0002 patch this time as I think the lack of JIT
support for DISTINCT was causing the CF bot to fail.  I'd quite like
to confirm that theory.

David

Attachment

v4-0001-Add-planner-support-for-ORDER-BY-aggregates.patch

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

From

David Rowley

Date:

17 July 2021, 02:36:09

On Fri, 16 Jul 2021 at 22:00, David Rowley <dgrowleyml@gmail.com> wrote:
>
> On Fri, 16 Jul 2021 at 18:04, David Rowley <dgrowleyml@gmail.com> wrote:
> > What we maybe could consider instead would be to pick the first Aggref
> > then look for the most sorted derivative of that then tally up the
> > number of Aggrefs that can be sorted using those pathkeys, then repeat
> > that process for the remaining Aggrefs that didn't have the same
> > prefix then use the pathkeys for the set with the most Aggrefs.  We
> > could still tiebreak on the targetlist position so at least it's not
> > random which ones we pick. Now that we have a list of Aggrefs that are
> > deduplicated in the planner thanks to 0a2bc5d61e it should be fairly
> > easy to do that.
>
> I've attached a patch which does as I mention above.

Looks like I did a sloppy job of that.  I messed up the condition in
standard_qp_callback() that sets the ORDER BY aggregate pathkeys so
that it accidentally set them when there was an unsortable GROUP BY
clause, as highlighted by the postgres_fdw tests failing.  I've now
added a comment to explain why the condition is the way it is so that
I don't forget again.

Here's a cleaned-up version that passes make check-world.

David

Attachment

v5-0001-Add-planner-support-for-ORDER-BY-aggregates.patch

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

From

Ronan Dunklau

Date:

19 July 2021, 06:32:34

> Looks like I did a sloppy job of that.  I messed up the condition in
> standard_qp_callback() that sets the ORDER BY aggregate pathkeys so
> that it accidentally set them when there was an unsortable GROUP BY
> clause, as highlighted by the postgres_fdw tests failing.  I've now
> added a comment to explain why the condition is the way it is so that
> I don't forget again.
> 
> Here's a cleaned-up version that passes make check-world.
> 

I've noticed that when the ORDER BY is a grouping key (which to be honest 
doesn't seem to make much sense to me), the sort key is duplicated, as 
demonstrated by one of the modified tests (partition_aggregate.sql). 

This leads to additional sort nodes being added when there is no necessity to 
do so. In the case of sort and index pathes, the duplicate keys are not 
considered,  I think the same should apply here.

It means the logic for appending the order by pathkeys to the existing group 
by pathkeys would ideally need to remove the redundant group by keys from the 
order by keys, considering this example:

regression=# explain select sum(unique1 order by ten, two), sum(unique1 order 
by four), sum(unique1 order by two, four) from tenk2 group by ten;
                               QUERY PLAN                               
------------------------------------------------------------------------
 GroupAggregate  (cost=1109.39..1234.49 rows=10 width=28)
   Group Key: ten
   ->  Sort  (cost=1109.39..1134.39 rows=10000 width=16)
         Sort Key: ten, ten, two
         ->  Seq Scan on tenk2  (cost=0.00..445.00 rows=10000 width=16)


We would ideally like to sort on ten, two, four to satisfy the first and last 
aggref at once. Stripping the common prefix (ten) would eliminate this problem. 

Also, existing regression tests cover the first problem (order by a grouping 
key) but I feel like they should be extended with a case similar as the above 
to check which pathkeys are used in the "multiple ordered aggregates + group 
by" cases. 


-- 
Ronan Dunklau

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

From

David Rowley

Date:

20 July 2021, 11:06:40

On Mon, 19 Jul 2021 at 18:32, Ronan Dunklau <ronan.dunklau@aiven.io> wrote:
> It means the logic for appending the order by pathkeys to the existing group
> by pathkeys would ideally need to remove the redundant group by keys from the
> order by keys, considering this example:
>
> regression=# explain select sum(unique1 order by ten, two), sum(unique1 order
> by four), sum(unique1 order by two, four) from tenk2 group by ten;
>                                QUERY PLAN
> ------------------------------------------------------------------------
>  GroupAggregate  (cost=1109.39..1234.49 rows=10 width=28)
>    Group Key: ten
>    ->  Sort  (cost=1109.39..1134.39 rows=10000 width=16)
>          Sort Key: ten, ten, two
>          ->  Seq Scan on tenk2  (cost=0.00..445.00 rows=10000 width=16)
>
>
> We would ideally like to sort on ten, two, four to satisfy the first and last
> aggref at once. Stripping the common prefix (ten) would eliminate this problem.

hmm, yeah. That's not great.  This comes from the way I'm doing
list_concat on the pathkeys from the GROUP BY with the ones from the
ordered aggregates. If it were possible to use
make_pathkeys_for_sortclauses() to make these in one go, that would
fix the problem since pathkey_is_redundant() would skip the 2nd "ten".
Unfortunately, it's not possible to pass the combined list of
SortGroupClauses to make_pathkeys_for_sortclauses since they're not
from the same targetlist.  Aggrefs have their own targetlist and the
SortGroupClauses for the Aggref reference that tlist.

I think to do this we'd need something like pathkeys_append() in
pathkeys.c which had a loop and appended the pathkey only if
pathkey_is_redundant returns false.

> Also, existing regression tests cover the first problem (order by a grouping
> key) but I feel like they should be extended with a case similar as the above
> to check which pathkeys are used in the "multiple ordered aggregates + group
> by" cases.

It does seem like a bit of a weird case to go to a lot of effort to
make work, but it would be nice if it did work without having to
contort the code too much.

David

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

From

David Rowley

Date:

21 July 2021, 02:52:43

On Mon, 19 Jul 2021 at 18:32, Ronan Dunklau <ronan.dunklau@aiven.io> wrote:
> regression=# explain select sum(unique1 order by ten, two), sum(unique1 order
> by four), sum(unique1 order by two, four) from tenk2 group by ten;
>                                QUERY PLAN
> ------------------------------------------------------------------------
>  GroupAggregate  (cost=1109.39..1234.49 rows=10 width=28)
>    Group Key: ten
>    ->  Sort  (cost=1109.39..1134.39 rows=10000 width=16)
>          Sort Key: ten, ten, two
>          ->  Seq Scan on tenk2  (cost=0.00..445.00 rows=10000 width=16)
>
>
> We would ideally like to sort on ten, two, four to satisfy the first and last
> aggref at once. Stripping the common prefix (ten) would eliminate this problem.

Thanks for finding this.  I've made a few changes to make this case
work as you describe. Please see attached v6 patches.

Because I had to add additional code to take the GROUP BY pathkeys
into account when choosing the best ORDER BY agg pathkeys, the
function that does that became a little bigger.  To try to reduce the
complexity of it, I got rid of all the processing from the initial
loop and instead it now uses the logic from the 2nd loop to find the
best pathkeys.  The reason I'd not done that in the first place was
because I'd thought I could get away without building an additional
Bitmapset for simple cases, but that's probably fairly cheap compared
to building Pathkeys.   With the additional complexity for the GROUP
BY pathkeys, the extra code seemed not worth it.

The 0001 patch is the ORDER BY aggregate code.  0002 is to fix up some
broken regression tests in postgres_fdw that 0001 caused.  It appears
that 0001 uncovered a bug in the postgres_fdw code.  I've reported
that in [1]. If that turns out to be a bug then it'll need to be fixed
before this can progress.

David

[1] https://www.postgresql.org/message-id/CAApHDvr4OeC2DBVY--zVP83-K=bYrTD7F8SZDhN4g+pj2f2S-A@mail.gmail.com

Attachment

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

From

Ronan Dunklau

Date:

21 July 2021, 14:01:03

Le mercredi 21 juillet 2021, 04:52:43 CEST David Rowley a écrit :
> Thanks for finding this.  I've made a few changes to make this case
> work as you describe. Please see attached v6 patches.
>
> Because I had to add additional code to take the GROUP BY pathkeys
> into account when choosing the best ORDER BY agg pathkeys, the
> function that does that became a little bigger.  To try to reduce the
> complexity of it, I got rid of all the processing from the initial
> loop and instead it now uses the logic from the 2nd loop to find the
> best pathkeys.  The reason I'd not done that in the first place was
> because I'd thought I could get away without building an additional
> Bitmapset for simple cases, but that's probably fairly cheap compared
> to building Pathkeys.   With the additional complexity for the GROUP
> BY pathkeys, the extra code seemed not worth it.
>
> The 0001 patch is the ORDER BY aggregate code.  0002 is to fix up some
> broken regression tests in postgres_fdw that 0001 caused.  It appears
> that 0001 uncovered a bug in the postgres_fdw code.  I've reported
> that in [1]. If that turns out to be a bug then it'll need to be fixed
> before this can progress.

I tested the 0001 patch against both HEAD and my proposed bugfix for
postgres_fdw.

There is a problem that the ordered aggregate is not pushed down anymore. The
underlying Sort node is correctly pushed down though.

This comes from the fact that postgres_fdw grouping path never contains any
pathkey. Since the cost is fuzzily the same between the pushed-down aggregate
and the locally performed one, the tie is broken against the pathkeys.

Ideally we would add the group pathkeys to the grouping path, but this would
add an additional ORDER BY expression matching the GROUP BY. Moreover, some
triaging of the pathkeys would be necessary since we now mix the sort-in-
aggref pathkeys with the group_pathkeys.

--
Ronan Dunklau

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

From

David Rowley

Date:

22 July 2021, 07:38:50

On Thu, 22 Jul 2021 at 02:01, Ronan Dunklau <ronan.dunklau@aiven.io> wrote:
> I tested the 0001 patch against both HEAD and my proposed bugfix for
> postgres_fdw.
>
> There is a problem that the ordered aggregate is not pushed down anymore. The
> underlying Sort node is correctly pushed down though.
>
> This comes from the fact that postgres_fdw grouping path never contains any
> pathkey. Since the cost is fuzzily the same between the pushed-down aggregate
> and the locally performed one, the tie is broken against the pathkeys.

I think this might be more down to a lack of any penalty cost for
fetching foreign tuples. Looking at create_foreignscan_path(), I don't
see anything that adds anything to account for fetching the tuples
from the foreign server. If there was something like that then there'd
be more of a preference to perform the remote aggregation so that
fewer rows must arrive from the remote server.

I tested by adding: total_cost += cpu_tuple_cost * rows * 100; in
create_foreignscan_path() and I got the plan with the remote
aggregation. That's a fairly large penalty of 1.0 per row. Much bigger
than parallel_tuple_cost's default value.

I'm a bit undecided on how much this patch needs to get involved in
adjusting foreign scan costs.  The problem is that we've given the
executor a new path to consider and nobody has done any proper
costings for the foreign scan so that it properly prefers paths that
have to pull fewer foreign tuples.  This is a pretty similar problem
to what parallel_tuple_cost aims to fix. Also similar to how we had to
add APPEND_CPU_COST_MULTIPLIER to have partition-wise aggregates
prefer grouping at the partition level rather than at the partitioned
table level.

> Ideally we would add the group pathkeys to the grouping path, but this would
> add an additional ORDER BY expression matching the GROUP BY. Moreover, some
> triaging of the pathkeys would be necessary since we now mix the sort-in-
> aggref pathkeys with the group_pathkeys.

I think you're talking about passing pathkeys into
create_foreign_upper_path in add_foreign_grouping_paths. If so, I
don't really see how it would be safe to add pathkeys to the foreign
grouping path. What if the foreign server did a Hash Aggregate?  The
rows might come back in any random order.

I kinda think that to fix this properly would need a new foreign
server option such as foreign_tuple_cost. I'd feel better about
something like that with some of the people with a vested interest in
the FDW code were watching more closely. So far we've not managed to
entice any of them with the bug report yet, but it's maybe early days
yet.

David

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

From

Ronan Dunklau

Date:

22 July 2021, 08:42:49

Le jeudi 22 juillet 2021, 09:38:50 CEST David Rowley a écrit :
> On Thu, 22 Jul 2021 at 02:01, Ronan Dunklau <ronan.dunklau@aiven.io> wrote:
> > I tested the 0001 patch against both HEAD and my proposed bugfix for
> > postgres_fdw.
> >
> > There is a problem that the ordered aggregate is not pushed down anymore.
> > The underlying Sort node is correctly pushed down though.
> >
> > This comes from the fact that postgres_fdw grouping path never contains
> > any
> > pathkey. Since the cost is fuzzily the same between the pushed-down
> > aggregate and the locally performed one, the tie is broken against the
> > pathkeys.
> I think this might be more down to a lack of any penalty cost for
> fetching foreign tuples. Looking at create_foreignscan_path(), I don't
> see anything that adds anything to account for fetching the tuples
> from the foreign server. If there was something like that then there'd
> be more of a preference to perform the remote aggregation so that
> fewer rows must arrive from the remote server.
>
> I tested by adding: total_cost += cpu_tuple_cost * rows * 100; in
> create_foreignscan_path() and I got the plan with the remote
> aggregation. That's a fairly large penalty of 1.0 per row. Much bigger
> than parallel_tuple_cost's default value.
>
> I'm a bit undecided on how much this patch needs to get involved in
> adjusting foreign scan costs.  The problem is that we've given the
> executor a new path to consider and nobody has done any proper
> costings for the foreign scan so that it properly prefers paths that
> have to pull fewer foreign tuples.  This is a pretty similar problem
> to what parallel_tuple_cost aims to fix. Also similar to how we had to
> add APPEND_CPU_COST_MULTIPLIER to have partition-wise aggregates
> prefer grouping at the partition level rather than at the partitioned
> table level.
>
> > Ideally we would add the group pathkeys to the grouping path, but this
> > would add an additional ORDER BY expression matching the GROUP BY.
> > Moreover, some triaging of the pathkeys would be necessary since we now
> > mix the sort-in- aggref pathkeys with the group_pathkeys.
>
> I think you're talking about passing pathkeys into
> create_foreign_upper_path in add_foreign_grouping_paths. If so, I
> don't really see how it would be safe to add pathkeys to the foreign
> grouping path. What if the foreign server did a Hash Aggregate?  The
> rows might come back in any random order.

Yes, I was suggesting to add a new path with the pathkeys factored in, which
if chosen over the non-ordered path would result in an additional ORDER BY
clause to prevent a HashAggregate. But that doesn't seem a good idea after
all.

>
> I kinda think that to fix this properly would need a new foreign
> server option such as foreign_tuple_cost. I'd feel better about
> something like that with some of the people with a vested interest in
> the FDW code were watching more closely. So far we've not managed to
> entice any of them with the bug report yet, but it's maybe early days
> yet.

We already have that in the form of fdw_tuple_cost as a server option if I'm
not mistaken ? It works as expected when the number of tuples is notably
reduced by the foreign group by.

The problem arise when the cardinality of the groups is equal to the input's
cardinality. I think even in that case we should try to use a remote aggregate
since it's a computation that will not happen on the local server. I also
think we're more likely to have up to date statistics remotely than the ones
collected locally on the foreign tables, and the estimated number of groups
would be more accurate on the remote side than the local one.

--
Ronan Dunklau

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

From

Ronan Dunklau

Date:

04 November 2021, 07:59:00

Le jeudi 22 juillet 2021, 10:42:49 CET Ronan Dunklau a écrit :
> Le jeudi 22 juillet 2021, 09:38:50 CEST David Rowley a écrit :
> > On Thu, 22 Jul 2021 at 02:01, Ronan Dunklau <ronan.dunklau@aiven.io>
wrote:
> > > I tested the 0001 patch against both HEAD and my proposed bugfix for
> > > postgres_fdw.
> > >
> > > There is a problem that the ordered aggregate is not pushed down
> > > anymore.
> > > The underlying Sort node is correctly pushed down though.
> > >
> > > This comes from the fact that postgres_fdw grouping path never contains
> > > any
> > > pathkey. Since the cost is fuzzily the same between the pushed-down
> > > aggregate and the locally performed one, the tie is broken against the
> > > pathkeys.
> >
> > I think this might be more down to a lack of any penalty cost for
> > fetching foreign tuples. Looking at create_foreignscan_path(), I don't
> > see anything that adds anything to account for fetching the tuples
> > from the foreign server. If there was something like that then there'd
> > be more of a preference to perform the remote aggregation so that
> > fewer rows must arrive from the remote server.
> >
> > I tested by adding: total_cost += cpu_tuple_cost * rows * 100; in
> > create_foreignscan_path() and I got the plan with the remote
> > aggregation. That's a fairly large penalty of 1.0 per row. Much bigger
> > than parallel_tuple_cost's default value.
> >
> > I'm a bit undecided on how much this patch needs to get involved in
> > adjusting foreign scan costs.  The problem is that we've given the
> > executor a new path to consider and nobody has done any proper
> > costings for the foreign scan so that it properly prefers paths that
> > have to pull fewer foreign tuples.  This is a pretty similar problem
> > to what parallel_tuple_cost aims to fix. Also similar to how we had to
> > add APPEND_CPU_COST_MULTIPLIER to have partition-wise aggregates
> > prefer grouping at the partition level rather than at the partitioned
> > table level.
> >
> > > Ideally we would add the group pathkeys to the grouping path, but this
> > > would add an additional ORDER BY expression matching the GROUP BY.
> > > Moreover, some triaging of the pathkeys would be necessary since we now
> > > mix the sort-in- aggref pathkeys with the group_pathkeys.
> >
> > I think you're talking about passing pathkeys into
> > create_foreign_upper_path in add_foreign_grouping_paths. If so, I
> > don't really see how it would be safe to add pathkeys to the foreign
> > grouping path. What if the foreign server did a Hash Aggregate?  The
> > rows might come back in any random order.
>
> Yes, I was suggesting to add a new path with the pathkeys factored in, which
> if chosen over the non-ordered path would result in an additional ORDER BY
> clause to prevent a HashAggregate. But that doesn't seem a good idea after
> all.
>
> > I kinda think that to fix this properly would need a new foreign
> > server option such as foreign_tuple_cost. I'd feel better about
> > something like that with some of the people with a vested interest in
> > the FDW code were watching more closely. So far we've not managed to
> > entice any of them with the bug report yet, but it's maybe early days
> > yet.
>
> We already have that in the form of fdw_tuple_cost as a server option if I'm
> not mistaken ? It works as expected when the number of tuples is notably
> reduced by the foreign group by.
>
> The problem arise when the cardinality of the groups is equal to the input's
> cardinality. I think even in that case we should try to use a remote
> aggregate since it's a computation that will not happen on the local
> server. I also think we're more likely to have up to date statistics
> remotely than the ones collected locally on the foreign tables, and the
> estimated number of groups would be more accurate on the remote side than
> the local one.

I took some time to toy with this again.

At first I thought that charging a discount in foreign grouping paths for
Aggref targets (since they are computed remotely) would be a good idea,
similar to what is done for the grouping keys.

But in the end, it's probably not something we would like to do. Yes, the
group planning will be more accurate on the remote side generally (better
statistics than locally for estimating the number of groups) but executing the
grouping locally when the number of groups is close to the input's cardinality
(ex: group by unique_key)  gives us a form of parallelism which can actually
perform better.

For the other cases where there is fewer output than input tuples, that is,
when an actual grouping takes place, adjusting fdw_tuple_cost might be enough
to tune the behavior to what is desirable.


--
Ronan Dunklau

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

From

Greg Stark

Date:

30 March 2022, 17:35:57

This patch is now failing in the sqljson regression test. It looks
like it's just the ordering of the elements in json_arrayagg() calls
which may actually be a faulty test that needs more ORDER BY clauses
rather than any issues with the code. Nonetheless it's something that
needs to be addressed before this patch could be applied.

Given it's gotten some feedback from Ronan and this regression test
failure I'll move it to Waiting on Author but we're near the end of
the CF and it'll probably be moved forward soon.

On Thu, 4 Nov 2021 at 04:00, Ronan Dunklau <ronan.dunklau@aiven.io> wrote:
>
> Le jeudi 22 juillet 2021, 10:42:49 CET Ronan Dunklau a écrit :
> > Le jeudi 22 juillet 2021, 09:38:50 CEST David Rowley a écrit :
> > > On Thu, 22 Jul 2021 at 02:01, Ronan Dunklau <ronan.dunklau@aiven.io>
> wrote:
> > > > I tested the 0001 patch against both HEAD and my proposed bugfix for
> > > > postgres_fdw.
> > > >
> > > > There is a problem that the ordered aggregate is not pushed down
> > > > anymore.
> > > > The underlying Sort node is correctly pushed down though.
> > > >
> > > > This comes from the fact that postgres_fdw grouping path never contains
> > > > any
> > > > pathkey. Since the cost is fuzzily the same between the pushed-down
> > > > aggregate and the locally performed one, the tie is broken against the
> > > > pathkeys.
> > >
> > > I think this might be more down to a lack of any penalty cost for
> > > fetching foreign tuples. Looking at create_foreignscan_path(), I don't
> > > see anything that adds anything to account for fetching the tuples
> > > from the foreign server. If there was something like that then there'd
> > > be more of a preference to perform the remote aggregation so that
> > > fewer rows must arrive from the remote server.
> > >
> > > I tested by adding: total_cost += cpu_tuple_cost * rows * 100; in
> > > create_foreignscan_path() and I got the plan with the remote
> > > aggregation. That's a fairly large penalty of 1.0 per row. Much bigger
> > > than parallel_tuple_cost's default value.
> > >
> > > I'm a bit undecided on how much this patch needs to get involved in
> > > adjusting foreign scan costs.  The problem is that we've given the
> > > executor a new path to consider and nobody has done any proper
> > > costings for the foreign scan so that it properly prefers paths that
> > > have to pull fewer foreign tuples.  This is a pretty similar problem
> > > to what parallel_tuple_cost aims to fix. Also similar to how we had to
> > > add APPEND_CPU_COST_MULTIPLIER to have partition-wise aggregates
> > > prefer grouping at the partition level rather than at the partitioned
> > > table level.
> > >
> > > > Ideally we would add the group pathkeys to the grouping path, but this
> > > > would add an additional ORDER BY expression matching the GROUP BY.
> > > > Moreover, some triaging of the pathkeys would be necessary since we now
> > > > mix the sort-in- aggref pathkeys with the group_pathkeys.
> > >
> > > I think you're talking about passing pathkeys into
> > > create_foreign_upper_path in add_foreign_grouping_paths. If so, I
> > > don't really see how it would be safe to add pathkeys to the foreign
> > > grouping path. What if the foreign server did a Hash Aggregate?  The
> > > rows might come back in any random order.
> >
> > Yes, I was suggesting to add a new path with the pathkeys factored in, which
> > if chosen over the non-ordered path would result in an additional ORDER BY
> > clause to prevent a HashAggregate. But that doesn't seem a good idea after
> > all.
> >
> > > I kinda think that to fix this properly would need a new foreign
> > > server option such as foreign_tuple_cost. I'd feel better about
> > > something like that with some of the people with a vested interest in
> > > the FDW code were watching more closely. So far we've not managed to
> > > entice any of them with the bug report yet, but it's maybe early days
> > > yet.
> >
> > We already have that in the form of fdw_tuple_cost as a server option if I'm
> > not mistaken ? It works as expected when the number of tuples is notably
> > reduced by the foreign group by.
> >
> > The problem arise when the cardinality of the groups is equal to the input's
> > cardinality. I think even in that case we should try to use a remote
> > aggregate since it's a computation that will not happen on the local
> > server. I also think we're more likely to have up to date statistics
> > remotely than the ones collected locally on the foreign tables, and the
> > estimated number of groups would be more accurate on the remote side than
> > the local one.
>
> I took some time to toy with this again.
>
> At first I thought that charging a discount in foreign grouping paths for
> Aggref targets (since they are computed remotely) would be a good idea,
> similar to what is done for the grouping keys.
>
> But in the end, it's probably not something we would like to do. Yes, the
> group planning will be more accurate on the remote side generally (better
> statistics than locally for estimating the number of groups) but executing the
> grouping locally when the number of groups is close to the input's cardinality
> (ex: group by unique_key)  gives us a form of parallelism which can actually
> perform better.
>
> For the other cases where there is fewer output than input tuples, that is,
> when an actual grouping takes place, adjusting fdw_tuple_cost might be enough
> to tune the behavior to what is desirable.
>
>
> --
> Ronan Dunklau
>
>
>
>


--
greg

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

From

David Rowley

Date:

07 April 2022, 04:09:41

On Thu, 31 Mar 2022 at 06:36, Greg Stark <stark@mit.edu> wrote:
>
> This patch is now failing in the sqljson regression test. It looks
> like it's just the ordering of the elements in json_arrayagg() calls
> which may actually be a faulty test that needs more ORDER BY clauses
> rather than any issues with the code. Nonetheless it's something that
> needs to be addressed before this patch could be applied.
>
> Given it's gotten some feedback from Ronan and this regression test
> failure I'll move it to Waiting on Author but we're near the end of
> the CF and it'll probably be moved forward soon.

Thanks for mentioning this and for keeping tabs on it.

This patch in general is more than there's realistic time for in this
CF.  I'd very much like to get the DISTINCT part working too. Not just
the ORDER BY.  I've pushed this one out to July's CF now.

David

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

From

David Rowley

Date:

20 July 2022, 05:26:39

On Thu, 4 Nov 2021 at 20:59, Ronan Dunklau <ronan.dunklau@aiven.io> wrote:
> I took some time to toy with this again.
>
> At first I thought that charging a discount in foreign grouping paths for
> Aggref targets (since they are computed remotely) would be a good idea,
> similar to what is done for the grouping keys.

I've been working on this patch again. There was a bit of work to do
to rebase it atop db0d67db2.  The problem there was that since this
patch appends pathkeys to suit ORDER BY / DISTINCT aggregates to the
query's group_pathkeys, db0d67db2 goes and tries to rearrange those,
but fails to find the SortGroupClause corresponding to the PathKey in
group_pathkeys. I wish the code I came up with to make that work was a
bit nicer, but what's there at least seems to work. There are a few
more making copies of Lists than I'd like.

I've also went and added LLVM support to make JIT work with the new
DISTINCT expression evaluation step types.

Also, James mentioned in [1] about the Merge Join plan change that
this patch was causing in an existing test.  I looked into that and
found the cause.  The plan change is not really the fault of this
patch, so I've proposed a fix for to make that work more efficiently
in [2]. The basics there are that select_outer_pathkeys_for_merge()
pre-dates Incremental Sorts and didn't consider prefixes of the
query_pathkeys after matching all the join quals. The patch on that
thread relaxes that rule and makes this patch produce an Incremental
Sort plan for the query in question.

Another annoying part of this patch is that I've added an
"aggpresorted" field to Aggref, which the planner sets.  That's a
parse node type and it would be nicer not to have the planner mess
around with those. We maybe could wrap up the Aggrefs in some planner
struct and pass those to the executor instead.  That would require a
bit more churn than what I've got in the attached.

I've attached the v7 patch.

David

[1] https://www.postgresql.org/message-id/CAAaqYe-yxXkXVPJkRw1nDA=CJBw28jvhACRyDcU10dAOcdSj=Q@mail.gmail.com
[2] https://www.postgresql.org/message-id/CAApHDvrtZu0PHVfDPFM4Yx3jNR2Wuwosv+T2zqa7LrhhBr2rRg@mail.gmail.com

Attachment

v7-0001-Add-planner-support-for-ORDER-BY-aggregates.patch

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

From

David Rowley

Date:

22 July 2022, 04:00:15

On Thu, 4 Nov 2021 at 20:59, Ronan Dunklau <ronan.dunklau@aiven.io> wrote:
> I took some time to toy with this again.
>
> At first I thought that charging a discount in foreign grouping paths for
> Aggref targets (since they are computed remotely) would be a good idea,
> similar to what is done for the grouping keys.
>
> But in the end, it's probably not something we would like to do. Yes, the
> group planning will be more accurate on the remote side generally (better
> statistics than locally for estimating the number of groups) but executing the
> grouping locally when the number of groups is close to the input's cardinality
> (ex: group by unique_key)  gives us a form of parallelism which can actually
> perform better.
>
> For the other cases where there is fewer output than input tuples, that is,
> when an actual grouping takes place, adjusting fdw_tuple_cost might be enough
> to tune the behavior to what is desirable.

I've now looked into this issue. With the patched code, the remote
aggregate path loses out in add_path() due to the fact that the local
aggregate path compares fuzzily the same as the remote aggregate path.
Since the local aggregate path is now fetching the rows from the
foreign server with a SQL query containing an ORDER BY clause (per my
change to query_pathkeys being picked up in
get_useful_pathkeys_for_relation()), add_path now prefers that path
due to it having pathkeys and the remote aggregate query not having
any (PATHKEYS_BETTER2).

It seems what's going on is that quite simply the default
fdw_tuple_cost is unrealistically low. Let's look.

#define DEFAULT_FDW_TUPLE_COST 0.01

Which is even lower than DEFAULT_PARALLEL_TUPLE_COST (0.1) and the
same as cpu_tuple_cost!

After some debugging, I see add_path() switches to the, seemingly
better, remote aggregate plan again if I multiple fdw_tuple_cost by
28. Anything below that sticks to the (inferior) local aggregate plan.

There's also another problem going on that would make that situation
better. The query planner expects the following query to produce 6
rows:

SELECT array_agg("C 1" ORDER BY "C 1" USING OPERATOR(public.<^) NULLS
LAST), c2 FROM "S 1"."T 1" WHERE (("C 1" < 100)) AND ((c2 = 6)) GROUP
BY c2;

You might expect the planner to think there'd just be 1 row due to the
"c2 = 6" and "GROUP BY c2", but it thinks there will be more than
that. If estimate_num_groups() knew about EquivalenceClasses and
checked ec_has_const, then it might be able to do better, but it
doesn't, so:

GroupAggregate  (cost=11.67..11.82 rows=6 width=36)

If I force that estimate to be 1 row instead of 6, then I only need a
fdw_tuple_cost to be 12 times the default to get it to switch to the
remote aggregate plan.

I think we should likely just patch master and change
DEFAULT_FDW_TUPLE_COST to at the very least 0.2, which is 20x higher
than today's setting. I'd be open to a much higher setting such as 0.5
(50x).

David

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

From

Richard Guo

Date:

22 July 2022, 09:33:24

On Wed, Jul 20, 2022 at 1:27 PM David Rowley <dgrowleyml@gmail.com> wrote:

I've been working on this patch again. There was a bit of work to do
to rebase it atop db0d67db2. The problem there was that since this
patch appends pathkeys to suit ORDER BY / DISTINCT aggregates to the
query's group_pathkeys, db0d67db2 goes and tries to rearrange those,
but fails to find the SortGroupClause corresponding to the PathKey in
group_pathkeys. I wish the code I came up with to make that work was a
bit nicer, but what's there at least seems to work. There are a few
more making copies of Lists than I'd like.

We may need to do more checks when adding members to 'aggindexes' to
record we've found pathkeys for an aggregate, because 'currpathkeys' may
include pathkeys for some latter aggregates. I can see this problem with
the query below:

select max(b order by b), max(a order by a) from t group by a;

When processing the first aggregate, we compose the 'currpathkeys' as
{a, b} and mark this aggregate in 'aggindexes'. When it comes to the
second aggregate, we compose its pathkeys as {a} and decide that it is
not stronger than 'currpathkeys'. So the second aggregate is not
recorded in 'aggindexes'. As a result, we fail to mark aggpresorted for
the second aggregate.

Thanks

Richard

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

From

David Rowley

Date:

25 July 2022, 23:38:44

On Fri, 22 Jul 2022 at 21:33, Richard Guo <guofenglinux@gmail.com> wrote:
> I can see this problem with
> the query below:
>
>     select max(b order by b), max(a order by a) from t group by a;
>
> When processing the first aggregate, we compose the 'currpathkeys' as
> {a, b} and mark this aggregate in 'aggindexes'. When it comes to the
> second aggregate, we compose its pathkeys as {a} and decide that it is
> not stronger than 'currpathkeys'. So the second aggregate is not
> recorded in 'aggindexes'. As a result, we fail to mark aggpresorted for
> the second aggregate.

Yeah, you're right. I have a missing check to see if currpathkeys are
better than the pathkeys for the current aggregate. In your example
case we'd have still processed the 2nd aggregate the old way instead
of realising we could take the new pre-sorted path for faster
processing.

I've adjusted that in the attached to make it properly include the
case where currpathkeys are better.

Thanks for the review.

David

Attachment

v8-0001-Add-planner-support-for-ORDER-BY-aggregates.patch

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

From

Zhihong Yu

Date:

26 July 2022, 00:07:31

On Mon, Jul 25, 2022 at 4:39 PM David Rowley <dgrowleyml@gmail.com> wrote:

On Fri, 22 Jul 2022 at 21:33, Richard Guo <guofenglinux@gmail.com> wrote:
> I can see this problem with
> the query below:
>
> select max(b order by b), max(a order by a) from t group by a;
>
> When processing the first aggregate, we compose the 'currpathkeys' as
> {a, b} and mark this aggregate in 'aggindexes'. When it comes to the
> second aggregate, we compose its pathkeys as {a} and decide that it is
> not stronger than 'currpathkeys'. So the second aggregate is not
> recorded in 'aggindexes'. As a result, we fail to mark aggpresorted for
> the second aggregate.

Yeah, you're right. I have a missing check to see if currpathkeys are
better than the pathkeys for the current aggregate. In your example
case we'd have still processed the 2nd aggregate the old way instead
of realising we could take the new pre-sorted path for faster
processing.

I've adjusted that in the attached to make it properly include the
case where currpathkeys are better.

Thanks for the review.

David

Hi,

sort order the the planner chooses is simply : there is duplicate `the`

+ /* mark this aggregate is covered by 'currpathkeys' */

is covered by -> as covered by

Cheers

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

From

Richard Guo

Date:

26 July 2022, 07:39:25

On Tue, Jul 26, 2022 at 7:38 AM David Rowley <dgrowleyml@gmail.com> wrote:

On Fri, 22 Jul 2022 at 21:33, Richard Guo <guofenglinux@gmail.com> wrote:
> I can see this problem with
> the query below:
>
> select max(b order by b), max(a order by a) from t group by a;
>
> When processing the first aggregate, we compose the 'currpathkeys' as
> {a, b} and mark this aggregate in 'aggindexes'. When it comes to the
> second aggregate, we compose its pathkeys as {a} and decide that it is
> not stronger than 'currpathkeys'. So the second aggregate is not
> recorded in 'aggindexes'. As a result, we fail to mark aggpresorted for
> the second aggregate.

Yeah, you're right. I have a missing check to see if currpathkeys are
better than the pathkeys for the current aggregate. In your example
case we'd have still processed the 2nd aggregate the old way instead
of realising we could take the new pre-sorted path for faster
processing.

I've adjusted that in the attached to make it properly include the
case where currpathkeys are better.

Thanks. Verified problem is solved in v8 patch.

Also I'm wondering if it's possible to take into consideration the
ordering indicated by existing indexes when determining the pathkeys. So
that for the query below we can avoid the Incremental Sort node if we
consider that there is an index on t(a, c):

# explain (costs off) select max(b order by b), max(c order by c) from t group by a;
QUERY PLAN
---------------------------------------------
GroupAggregate
Group Key: a
-> Incremental Sort
Sort Key: a, b
Presorted Key: a
-> Index Scan using t_a_c_idx on t
(6 rows)

Thanks

Richard

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

From

David Rowley

Date:

26 July 2022, 22:37:26

On Tue, 26 Jul 2022 at 12:01, Zhihong Yu <zyu@yugabyte.com> wrote:
> sort order the the planner chooses is simply : there is duplicate `the`

I think the first "the" should be "that"

> +                       /* mark this aggregate is covered by 'currpathkeys' */
>
> is covered by -> as covered by

I think it was shortened from "mark that this aggregate", but I
dropped "that" to get the comment to fit on a single line. Swapping
"is" for "as" makes it better. Thanks.

David

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

From

David Rowley

Date:

26 July 2022, 22:45:44

On Tue, 26 Jul 2022 at 19:39, Richard Guo <guofenglinux@gmail.com> wrote:
> Also I'm wondering if it's possible to take into consideration the
> ordering indicated by existing indexes when determining the pathkeys. So
> that for the query below we can avoid the Incremental Sort node if we
> consider that there is an index on t(a, c):
>
> # explain (costs off) select max(b order by b), max(c order by c) from t group by a;
>                  QUERY PLAN
> ---------------------------------------------
>  GroupAggregate
>    Group Key: a
>    ->  Incremental Sort
>          Sort Key: a, b
>          Presorted Key: a
>          ->  Index Scan using t_a_c_idx on t
> (6 rows)

That would be nice but I'm not going to add anything to this patch
which does anything like that. I think the patch, as it is, is a good
meaningful step forward to improve the performance of ordered
aggregates.

There are other things in the planner that could gain from what you
talk about. For example, choosing the evaluation order of WindowFuncs.
Perhaps it would be better to try to tackle those two problems
together rather than try to sneak something half-baked along with this
patch.

David

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

From

Richard Guo

Date:

27 July 2022, 03:16:36

On Wed, Jul 27, 2022 at 6:46 AM David Rowley <dgrowleyml@gmail.com> wrote:

On Tue, 26 Jul 2022 at 19:39, Richard Guo <guofenglinux@gmail.com> wrote:
> Also I'm wondering if it's possible to take into consideration the
> ordering indicated by existing indexes when determining the pathkeys. So
> that for the query below we can avoid the Incremental Sort node if we
> consider that there is an index on t(a, c):
>
> # explain (costs off) select max(b order by b), max(c order by c) from t group by a;
> QUERY PLAN
> ---------------------------------------------
> GroupAggregate
> Group Key: a
> -> Incremental Sort
> Sort Key: a, b
> Presorted Key: a
> -> Index Scan using t_a_c_idx on t
> (6 rows)

That would be nice but I'm not going to add anything to this patch
which does anything like that. I think the patch, as it is, is a good
meaningful step forward to improve the performance of ordered
aggregates.

Concur with that.

There are other things in the planner that could gain from what you
talk about. For example, choosing the evaluation order of WindowFuncs.
Perhaps it would be better to try to tackle those two problems
together rather than try to sneak something half-baked along with this
patch.

That makes sense. The patch looks in a good shape to me in this part.

Thanks

Richard

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

From

David Rowley

Date:

28 July 2022, 18:49:53

On Wed, 27 Jul 2022 at 15:16, Richard Guo <guofenglinux@gmail.com> wrote:
> That makes sense. The patch looks in a good shape to me in this part.

Thanks for giving it another look.

I'm also quite happy with the patch now. The 2 plan changes are
explained. I have a patch on another thread [1] for the change in the
Merge Join plan.  I'd like to consider that separately from this
patch.

The postgres_fdw changes are explained in [2]. This can be fixed by
setting fdw_tuple_cost to something more realistic in the foreign
server settings on the test.

I'd like to take a serious look at pushing this patch on the first few
days of August, so if anyone is following along here that might have
objections, can you do so before then?

David

[1] https://www.postgresql.org/message-id/CAApHDvrtZu0PHVfDPFM4Yx3jNR2Wuwosv+T2zqa7LrhhBr2rRg@mail.gmail.com
[2] https://www.postgresql.org/message-id/CAApHDvpXiXLxg4TsA8P_4etnuGQqAAbHWEOM4hGe=DCaXmi_jA@mail.gmail.com

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

From

Tom Lane

Date:

31 July 2022, 15:49:51

David Rowley <dgrowleyml@gmail.com> writes:
> I'd like to take a serious look at pushing this patch on the first few
> days of August, so if anyone is following along here that might have
> objections, can you do so before then?

Are you going to push the other patch (adjusting
select_outer_pathkeys_for_merge) first, so that we can see the residual
plan changes that this patch creates?  I'm not entirely comfortable
with the regression test changes as posted.  Likewise, it might be
better to fix DEFAULT_FDW_TUPLE_COST beforehand, to detangle what
the effects of that are.

Also, I think it's bad style to rely on aggpresorted defaulting to false.
You should explicitly initialize it anywhere that an Aggref node is
constructed.  It looks like there are just two places to fix
(parse_expr.c and parse_func.c).

Nothing else jumped out at me in a quick scan.

            regards, tom lane

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

From

David Rowley

Date:

02 August 2022, 11:21:04

On Mon, 1 Aug 2022 at 03:49, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Are you going to push the other patch (adjusting
> select_outer_pathkeys_for_merge) first, so that we can see the residual
> plan changes that this patch creates?  I'm not entirely comfortable
> with the regression test changes as posted.

Yes, I pushed that earlier.

> Likewise, it might be
> better to fix DEFAULT_FDW_TUPLE_COST beforehand, to detangle what
> the effects of that are.

I chatted to Andres and Thomas about this last week and their view
made me think it might not be quite as clear-cut as "just bump it up a
bunch because it's ridiculously low" that I had in mind.  They
mentioned about file_fdw and another one that appears to work on
mmapped segments, which I don't recall if any names were mentioned.
Certainly that's not a reason not to change it, but it's not quite as
clear-cut as I thought.  I'll open a thread with some reasonable
evidence to get a topic going and see where we end up.  In the
meantime I've just coded it to do a temporary adjustment to the
fdw_tuple_cost foreign server setting just before the test in
question.

> Also, I think it's bad style to rely on aggpresorted defaulting to false.
> You should explicitly initialize it anywhere that an Aggref node is
> constructed.  It looks like there are just two places to fix
> (parse_expr.c and parse_func.c).

Ooops. I'm normally good at remembering that. Not this time!

> Nothing else jumped out at me in a quick scan.

Thanks for the quick scan.  I did another few myself and adjusted a
small number of things. Mostly comments and using things like
lfirst_node and list_nth_node instead of lfirst and list_nth with a
cast.

I've now pushed the patch.

Thank you to everyone who looked at this.

David

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

From

Tom Lane

Date:

02 August 2022, 13:19:39

David Rowley <dgrowleyml@gmail.com> writes:
> On Mon, 1 Aug 2022 at 03:49, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Likewise, it might be
>> better to fix DEFAULT_FDW_TUPLE_COST beforehand, to detangle what
>> the effects of that are.

> I chatted to Andres and Thomas about this last week and their view
> made me think it might not be quite as clear-cut as "just bump it up a
> bunch because it's ridiculously low" that I had in mind.  They
> mentioned about file_fdw and another one that appears to work on
> mmapped segments, which I don't recall if any names were mentioned.

Um ... DEFAULT_FDW_TUPLE_COST is postgres_fdw-specific, so I do not
see what connection some other FDW would have to it.

            regards, tom lane

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

From

David Rowley

Date:

02 August 2022, 14:59:05

On Wed, 3 Aug 2022 at 01:19, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> David Rowley <dgrowleyml@gmail.com> writes:
> > I chatted to Andres and Thomas about this last week and their view
> > made me think it might not be quite as clear-cut as "just bump it up a
> > bunch because it's ridiculously low" that I had in mind.  They
> > mentioned about file_fdw and another one that appears to work on
> > mmapped segments, which I don't recall if any names were mentioned.
>
> Um ... DEFAULT_FDW_TUPLE_COST is postgres_fdw-specific, so I do not
> see what connection some other FDW would have to it.

I should have devoted more brain cells to that one.

Anyway, I started a thread at [1].

David

[1] https://www.postgresql.org/message-id/CAApHDvopVjjfh5c1Ed2HRvDdfom2dEpMwwiu5-f1AnmYprJngA@mail.gmail.com

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

From

Zhihong Yu

Date:

02 August 2022, 18:02:51

Hi, David:

I was looking at the final patch and noticed that setno field in agg_presorted_distinctcheck struct is never used.

Looks like it was copied from neighboring struct.

Can you take a look at the patch ?

Thanks

Attachment

drop-setno-from-agg_presorted_distinctcheck.patch

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

From

Zhihong Yu

Date:

02 August 2022, 19:38:22

On Tue, Aug 2, 2022 at 11:02 AM Zhihong Yu <zyu@yugabyte.com> wrote:

Hi, David:
I was looking at the final patch and noticed that setno field in agg_presorted_distinctcheck struct is never used.

Looks like it was copied from neighboring struct.

Can you take a look at the patch ?

Thanks

Looks like setoff field is not used either.

Cheers

Attachment

drop-setno-setoff-from-agg_presorted_distinctcheck.patch

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

From

David Rowley

Date:

02 August 2022, 21:48:49

On Wed, 3 Aug 2022 at 07:31, Zhihong Yu <zyu@yugabyte.com> wrote:
> On Tue, Aug 2, 2022 at 11:02 AM Zhihong Yu <zyu@yugabyte.com> wrote:
>> I was looking at the final patch and noticed that setno field in agg_presorted_distinctcheck struct is never used.

> Looks like setoff field is not used either.

Thanks for the report. It seems transno was unused too.

I just pushed a commit to remove all 3.

David

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

From

Justin Pryzby

Date:

17 August 2022, 01:57:55

On Tue, Aug 02, 2022 at 11:21:04PM +1200, David Rowley wrote:
> I've now pushed the patch.

I've not studied the patch at all.

But in a few places, it removes the locally-computed group_pathkeys:

-                       List       *group_pathkeys = root->group_pathkeys;

However it doesn't do that here:

                /*
                 * Instead of operating directly on the input relation, we can
                 * consider finalizing a partially aggregated path.
                 */
                if (partially_grouped_rel != NULL)
                {
                        foreach(lc, partially_grouped_rel->pathlist)
                        {
                                ListCell   *lc2;
                                Path       *path = (Path *) lfirst(lc);
                                Path       *path_original = path;
 
                                List       *pathkey_orderings = NIL;
 
                                List       *group_pathkeys = root->group_pathkeys;

I noticed because that creates a new shadow variable, which seems accidental.

make src/backend/optimizer/plan/planner.o COPT=-Wshadow=compatible-local

src/backend/optimizer/plan/planner.c:6642:14: warning: declaration of ‘group_pathkeys’ shadows a previous local
[-Wshadow=compatible-local]
 6642 |     List    *group_pathkeys = root->group_pathkeys;
      |              ^~~~~~~~~~~~~~
src/backend/optimizer/plan/planner.c:6438:12: note: shadowed declaration is here
 6438 |   List    *group_pathkeys;

-- 
Justin

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

From

David Rowley

Date:

17 August 2022, 23:38:25

On Wed, 17 Aug 2022 at 13:57, Justin Pryzby <pryzby@telsasoft.com> wrote:
> But in a few places, it removes the locally-computed group_pathkeys:
>
> -                       List       *group_pathkeys = root->group_pathkeys;

> I noticed because that creates a new shadow variable, which seems accidental.

Thanks for the report.

I've just pushed a fix for this that basically just removes the line
you quoted.  Really I should have been using the version of
group_pathkeys that stripped off the pathkeys from the ORDER BY /
DISTINCT aggregates that is calculated earlier in that function.   In
practice, there was no actual bug here as the wrong variable was only
being used in the code path that was handling partial paths.  We never
create any partial paths when there are aggregates with ORDER BY /
DISTINCT  clauses, so in that code path, the two versions of the
group_pathkeys variable would have always been set to the same thing.

It makes sense just to get rid of the shadowed variable since the
value of it will be the same anyway.

David

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

From

Pavel Luzanov

Date:

05 November 2022, 08:51:23

Hello,

While playing with the patch I found a situation where the performance 
may be degraded compared to previous versions.

The test case below.
If you create a proper index for the query (a,c), version 16 wins. On my 
notebook, the query runs ~50% faster.
But if there is no index (a,c), but only (a,b), in previous versions the 
planner uses it, but with this patch a full table scan is selected.


create table t (a text, b text, c text);
insert into t (a,b,c) select x,y,x from generate_series(1,100) as x, 
generate_series(1,10000) y;
create index on t (a,b);
vacuum analyze t;

explain (analyze, buffers)
select a, array_agg(c order by c) from t group by a;


v 14.5
                                                              QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------
  GroupAggregate  (cost=0.42..46587.76 rows=100 width=34) (actual 
time=3.077..351.526 rows=100 loops=1)
    Group Key: a
    Buffers: shared hit=193387 read=2745
    ->  Index Scan using t_a_b_idx on t  (cost=0.42..41586.51 
rows=1000000 width=4) (actual time=0.014..155.095 rows=1000000 loops=1)
          Buffers: shared hit=193387 read=2745
  Planning:
    Buffers: shared hit=9
  Planning Time: 0.059 ms
  Execution Time: 351.581 ms
(9 rows)


v 16
                                                        QUERY PLAN

------------------------------------------------------------------------------------------------------------------------
  GroupAggregate  (cost=128728.34..136229.59 rows=100 width=34) (actual 
time=262.930..572.915 rows=100 loops=1)
    Group Key: a
    Buffers: shared hit=5396, temp read=1950 written=1964
    ->  Sort  (cost=128728.34..131228.34 rows=1000000 width=4) (actual 
time=259.423..434.105 rows=1000000 loops=1)
          Sort Key: a, c
          Sort Method: external merge  Disk: 15600kB
          Buffers: shared hit=5396, temp read=1950 written=1964
          ->  Seq Scan on t  (cost=0.00..15396.00 rows=1000000 width=4) 
(actual time=0.005..84.104 rows=1000000 loops=1)
                Buffers: shared hit=5396
  Planning:
    Buffers: shared hit=9
  Planning Time: 0.055 ms
  Execution Time: 575.146 ms
(13 rows)

-- 
Pavel Luzanov
Postgres Professional: https://postgrespro.com
The Russian Postgres Company

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

From

Ronan Dunklau

Date:

07 November 2022, 14:53:42

Le samedi 5 novembre 2022, 09:51:23 CET Pavel Luzanov a écrit :
> While playing with the patch I found a situation where the performance
> may be degraded compared to previous versions.
>
> The test case below.
> If you create a proper index for the query (a,c), version 16 wins. On my
> notebook, the query runs ~50% faster.
> But if there is no index (a,c), but only (a,b), in previous versions the
> planner uses it, but with this patch a full table scan is selected.

Hello,

In your exact use case, the combo incremental-sort + Index scan is evaluated
to cost more than doing a full sort + seqscan.

If you try for example to create an index on (b, a) and group by b, you will
get the expected behaviour:

ro=# create index on t (b, a);
CREATE INDEX
ro=# explain  select b, array_agg(c order by c) from t group by b;
                                       QUERY PLAN
-----------------------------------------------------------------------------------------
 GroupAggregate  (cost=10.64..120926.80 rows=9970 width=36)
   Group Key: b
   ->  Incremental Sort  (cost=10.64..115802.17 rows=1000000 width=6)
         Sort Key: b, c
         Presorted Key: b
         ->  Index Scan using t_b_a_idx on t  (cost=0.42..47604.12
rows=1000000 width=6)
(6 rows)

I think we can trace that back to incremental sort being pessimistic about
it's performance. If you try the same query, but with set enable_seqscan = off,
you will get a full sort over an index scan:

                                       QUERY PLAN
-----------------------------------------------------------------------------------------
 GroupAggregate  (cost=154944.94..162446.19 rows=100 width=34)
   Group Key: a
   ->  Sort  (cost=154944.94..157444.94 rows=1000000 width=4)
         Sort Key: a, c
         ->  Index Scan using t_a_b_idx on t  (cost=0.42..41612.60
rows=1000000 width=4)
(5 rows)


This probably comes from the overly pessimistic behaviour that the number of
tuples per group will be 1.5 times as much as we should estimate:

    /*
     * Estimate average cost of sorting of one group where presorted
keys are
     * equal.  Incremental sort is sensitive to distribution of tuples
to the
     * groups, where we're relying on quite rough assumptions.  Thus,
we're
     * pessimistic about incremental sort performance and increase its
average
     * group size by half.
     */

I can't see why an incrementalsort could be more expensive than a full sort,
using the same presorted path. It looks to me that in that case we should
always prefer an incrementalsort. Maybe we should bound incremental sorts cost
to make sure they are never more expensive than the full sort ?

Also, prior to this commit I don't think it made a real difference, because
worst case scenario we would have missed an incremental sort, which we didn't
have beforehand. But with this patch, we may actually replace a "hidden"
incremental sort which was done in the agg codepath by a full sort.

Best regards,

--
Ronan Dunklau

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

From

Pavel Luzanov

Date:

07 November 2022, 16:58:50

Hi,

On 07.11.2022 17:53, Ronan Dunklau wrote:
> In your exact use case, the combo incremental-sort + Index scan is evaluated
> to cost more than doing a full sort + seqscan.

> I think we can trace that back to incremental sort being pessimistic about
> it's performance. If you try the same query, but with set enable_seqscan = off,
> you will get a full sort over an index scan:
>
>                                         QUERY PLAN
> -----------------------------------------------------------------------------------------
>   GroupAggregate  (cost=154944.94..162446.19 rows=100 width=34)
>     Group Key: a
>     ->  Sort  (cost=154944.94..157444.94 rows=1000000 width=4)
>           Sort Key: a, c
>           ->  Index Scan using t_a_b_idx on t  (cost=0.42..41612.60
> rows=1000000 width=4)
> (5 rows)

You are right. By disabling seq scan, we can get this plan. But compare 
it with the plan in v15:

postgres@db(15.0)=# explain
select a, array_agg(c order by c) from t group by a;
                                     QUERY PLAN
-----------------------------------------------------------------------------------
  GroupAggregate  (cost=0.42..46667.56 rows=100 width=34)
    Group Key: a
    ->  Index Scan using t_a_b_idx on t  (cost=0.42..41666.31 
rows=1000000 width=4)
(3 rows)

The total plan cost in v16 is ~4 times higher, while the index scan cost 
remains the same.

> I can't see why an incrementalsort could be more expensive than a full sort,
> using the same presorted path.

The only reason I can see is the number of buffers to read. In the plan 
with incremental sort we read the whole index, ~190000 buffers.
And the plan with seq scan only required ~5000 (I think due to buffer 
ring optimization).

Perhaps this behavior is preferable. Especially when many concurrent 
queries are running. The less buffer cache is busy, the better. But in 
single-user mode this query is slower.

-- 
Pavel Luzanov
Postgres Professional: https://postgrespro.com
The Russian Postgres Company

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

From

Ronan Dunklau

Date:

07 November 2022, 17:30:16

Le lundi 7 novembre 2022, 17:58:50 CET Pavel Luzanov a écrit :
> > I can't see why an incrementalsort could be more expensive than a full
> > sort, using the same presorted path.
>
> The only reason I can see is the number of buffers to read. In the plan
> with incremental sort we read the whole index, ~190000 buffers.
> And the plan with seq scan only required ~5000 (I think due to buffer
> ring optimization).

What I meant here is that disabling seqscans, the planner still chooses a full
sort over a partial sort. The underlying index is the same, it is just a
matter of choosing a Sort node over an IncrementalSort node. This, I think, is
wrong: I can't see how it could be worse to use an incrementalsort in that
case.

It makes sense to prefer a SeqScan over an IndexScan if you are going to sort
the whole table anyway. But in that case we shouldn't. What happened before is
that some sort of incremental sort was always chosen, because it was hidden as
an implementation detail of the agg node. But now it has to compete on a cost
basis with the full sort, and that costing is wrong in that case.

Maybe the original costing code for incremental sort was a bit too
pessimistic.

--
Ronan Dunklau

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

From

Pavel Luzanov

Date:

07 November 2022, 20:37:55

On 07.11.2022 20:30, Ronan Dunklau wrote:
> What I meant here is that disabling seqscans, the planner still chooses a full
> sort over a partial sort. The underlying index is the same, it is just a
> matter of choosing a Sort node over an IncrementalSort node. This, I think, is
> wrong: I can't see how it could be worse to use an incrementalsort in that
> case.

I finally get your point. And I agree with you.

> Maybe the original costing code for incremental sort was a bit too
> pessimistic.

In this query, incremental sorting lost just a little bit in cost: 
164468.95 vs 162504.23.

QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------------
  GroupAggregate  (cost=155002.98..162504.23 rows=100 width=34) (actual 
time=296.591..568.270 rows=100 loops=1)
    Group Key: a
    ->  Sort  (cost=155002.98..157502.98 rows=1000000 width=4) (actual 
time=293.810..454.170 rows=1000000 loops=1)
          Sort Key: a, c
          Sort Method: external merge  Disk: 15560kB
          ->  Index Scan using t_a_b_idx on t (cost=0.42..41670.64 
rows=1000000 width=4) (actual time=0.021..156.441 rows=1000000 loops=1)
  Settings: enable_seqscan = 'off'
  Planning Time: 0.074 ms
  Execution Time: 569.957 ms
(9 rows)

set enable_sort=off;
SET
QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------------
  GroupAggregate  (cost=1457.58..164468.95 rows=100 width=34) (actual 
time=6.623..408.833 rows=100 loops=1)
    Group Key: a
    ->  Incremental Sort  (cost=1457.58..159467.70 rows=1000000 width=4) 
(actual time=2.652..298.530 rows=1000000 loops=1)
          Sort Key: a, c
          Presorted Key: a
          Full-sort Groups: 100  Sort Method: quicksort  Average Memory: 
27kB  Peak Memory: 27kB
          Pre-sorted Groups: 100  Sort Method: quicksort  Average 
Memory: 697kB  Peak Memory: 697kB
          ->  Index Scan using t_a_b_idx on t (cost=0.42..41670.64 
rows=1000000 width=4) (actual time=0.011..155.260 rows=1000000 loops=1)
  Settings: enable_seqscan = 'off', enable_sort = 'off'
  Planning Time: 0.044 ms
  Execution Time: 408.867 ms

-- 
Pavel Luzanov
Postgres Professional: https://postgrespro.com
The Russian Postgres Company

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

From

David Rowley

Date:

08 November 2022, 01:31:12

On Tue, 8 Nov 2022 at 03:53, Ronan Dunklau <ronan.dunklau@aiven.io> wrote:
> I can't see why an incrementalsort could be more expensive than a full sort,
> using the same presorted path. It looks to me that in that case we should
> always prefer an incrementalsort. Maybe we should bound incremental sorts cost
> to make sure they are never more expensive than the full sort ?

The only thing that I could think of that would cause incremental sort
to be more expensive than sort is the tuplesort_reset() calls that are
performed between sorts. However, I see cost_incremental_sort()
accounts for those already with:

run_cost += 2.0 * cpu_tuple_cost * input_groups;

Also, I see at the top of incremental_sort.sql there's a comment claiming:

-- When we have to sort the entire table, incremental sort will
-- be slower than plain sort, so it should not be used.

I'm just unable to verify that's true by doing the following:

$ echo "select * from (select * from tenk1 order by four) t order by
four, ten;" > bench.sql

$ pgbench -n -f bench.sql -T 60 -M prepared regression | grep -E "^tps"
tps = 102.136151 (without initial connection time)

$ # disable sort so that the test performs Sort -> Incremental Sort rather
$ # than Sort -> Sort
$ psql -c "alter system set enable_sort=0;" regression
$ psql -c "select pg_reload_conf();" regression

$ pgbench -n -f bench.sql -T 60 -M prepared regression | grep -E "^tps"
tps = 112.378761 (without initial connection time)

When I disable sort, the plan changes to use Incremental Sort and
execution becomes faster, not slower like the comment claims it will.
Perhaps this was true during the development of Incremental sort and
then something was changed to speed things up.  I do recall reviewing
that patch many years ago and hinting about the invention of
tuplesort_reset(). I don't recall, but I assume the patch must have
been creating a new tuplesort each group before that.

Also, I was looking at add_paths_to_grouping_rel() and I saw that if
presorted_keys > 0 that we'll create both a Sort and Incremental Sort
path.  If we assume Incremental Sort is always better when it can be
done, then it seems useless to create the Sort path when Incremental
Sort is possible.  When working on making Incremental Sorts work for
window functions I did things that way. Maybe we should just make
add_paths_to_grouping_rel() work the same way.

Regarding the 1.5 factor in cost_incremental_sort(), I assume this is
for skewed groups. Imagine there's 1 huge group and 99 tiny ones.
However, even if that were the case, I imagine the performance would
still be around the same performance as the non-incremental variant of
sort.

I've been playing around with the attached patch which does:

1. Adjusts add_paths_to_grouping_rel so that we don't add a Sort path
when we can add an Incremental Sort path instead. This removes quite a
few redundant lines of code.
2. Removes the * 1.5 fuzz-factor in cost_incremental_sort()
3. Does various other code tidy stuff in cost_incremental_sort().
4. Removes the test from incremental_sort.sql that was ensuring the
inferior Sort -> Sort plan was being used instead of the superior Sort
-> Incremental Sort plan.

I'm not really that 100% confident in the removal of the * 1.5 thing.
I wonder if there's some reason we're not considering that might cause
a performance regression if we're to remove it.

David

Attachment

adjust_incremental_sort_costings.patch

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

From

Pavel Luzanov

Date:

08 November 2022, 06:39:12

On 08.11.2022 04:31, David Rowley wrote:
> I've been playing around with the attached patch which does:
>
> 1. Adjusts add_paths_to_grouping_rel so that we don't add a Sort path
> when we can add an Incremental Sort path instead. This removes quite a
> few redundant lines of code.
> 2. Removes the * 1.5 fuzz-factor in cost_incremental_sort()
> 3. Does various other code tidy stuff in cost_incremental_sort().
> 4. Removes the test from incremental_sort.sql that was ensuring the
> inferior Sort -> Sort plan was being used instead of the superior Sort
> -> Incremental Sort plan.

I can confirm that with this patch, the plan with incremental sorting 
beats the others.

Here are the test results with my previous example.

Script:

create table t (a text, b text, c text);
insert into t (a,b,c) select x,y,x from generate_series(1,100) as x, 
generate_series(1,10000) y;
create index on t (a);
vacuum analyze t;
reset all;

explain (settings, analyze)
select a, array_agg(c order by c) from t group by a;

\echo set enable_incremental_sort=off;
set enable_incremental_sort=off;

explain (settings, analyze)
select a, array_agg(c order by c) from t group by a;

\echo set enable_seqscan=off;
set enable_seqscan=off;

explain (settings, analyze)
select a, array_agg(c order by c) from t group by a;

Script output:

CREATE TABLE
INSERT 0 1000000
CREATE INDEX
VACUUM
RESET
QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------------
  GroupAggregate  (cost=957.60..113221.24 rows=100 width=34) (actual 
time=6.088..381.777 rows=100 loops=1)
    Group Key: a
    ->  Incremental Sort  (cost=957.60..108219.99 rows=1000000 width=4) 
(actual time=2.387..272.332 rows=1000000 loops=1)
          Sort Key: a, c
          Presorted Key: a
          Full-sort Groups: 100  Sort Method: quicksort  Average Memory: 
27kB  Peak Memory: 27kB
          Pre-sorted Groups: 100  Sort Method: quicksort  Average 
Memory: 697kB  Peak Memory: 697kB
          ->  Index Scan using t_a_idx on t (cost=0.42..29279.42 
rows=1000000 width=4) (actual time=0.024..128.083 rows=1000000 loops=1)
  Planning Time: 0.070 ms
  Execution Time: 381.815 ms
(10 rows)

set enable_incremental_sort=off;
SET
                                                        QUERY PLAN

------------------------------------------------------------------------------------------------------------------------
  GroupAggregate  (cost=128728.34..136229.59 rows=100 width=34) (actual 
time=234.044..495.537 rows=100 loops=1)
    Group Key: a
    ->  Sort  (cost=128728.34..131228.34 rows=1000000 width=4) (actual 
time=231.172..383.393 rows=1000000 loops=1)
          Sort Key: a, c
          Sort Method: external merge  Disk: 15600kB
          ->  Seq Scan on t  (cost=0.00..15396.00 rows=1000000 width=4) 
(actual time=0.005..78.189 rows=1000000 loops=1)
  Settings: enable_incremental_sort = 'off'
  Planning Time: 0.041 ms
  Execution Time: 497.230 ms
(9 rows)

set enable_seqscan=off;
SET
QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------------
  GroupAggregate  (cost=142611.77..150113.02 rows=100 width=34) (actual 
time=262.250..527.260 rows=100 loops=1)
    Group Key: a
    ->  Sort  (cost=142611.77..145111.77 rows=1000000 width=4) (actual 
time=259.551..417.154 rows=1000000 loops=1)
          Sort Key: a, c
          Sort Method: external merge  Disk: 15560kB
          ->  Index Scan using t_a_idx on t (cost=0.42..29279.42 
rows=1000000 width=4) (actual time=0.012..121.995 rows=1000000 loops=1)
  Settings: enable_incremental_sort = 'off', enable_seqscan = 'off'
  Planning Time: 0.041 ms
  Execution Time: 528.950 ms
(9 rows)

-- 
Pavel Luzanov
Postgres Professional: https://postgrespro.com
The Russian Postgres Company

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

From

Richard Guo

Date:

08 November 2022, 06:51:00

On Tue, Nov 8, 2022 at 9:31 AM David Rowley <dgrowleyml@gmail.com> wrote:

I've been playing around with the attached patch which does:

1. Adjusts add_paths_to_grouping_rel so that we don't add a Sort path
when we can add an Incremental Sort path instead. This removes quite a
few redundant lines of code.

For unsorted paths, the original logic here is to explicitly add a Sort
path only for the cheapest-total path. This patch changes that and may
add a Sort path for other paths besides the cheapest-total path. I
think this may introduce in some unnecessary path candidates.

I think it's good that this patch removes redundant codes. ISTM we can
do the same when we try to finalize the partially aggregated paths from
partially_grouped_rel.

Thanks
Richard

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

From

Ronan Dunklau

Date:

08 November 2022, 08:43:57

Le mardi 8 novembre 2022, 02:31:12 CET David Rowley a écrit :
> 1. Adjusts add_paths_to_grouping_rel so that we don't add a Sort path
> when we can add an Incremental Sort path instead. This removes quite a
> few redundant lines of code.

This seems sensible

> 2. Removes the * 1.5 fuzz-factor in cost_incremental_sort()
> 3. Does various other code tidy stuff in cost_incremental_sort().
> 4. Removes the test from incremental_sort.sql that was ensuring the
> inferior Sort -> Sort plan was being used instead of the superior Sort
> -> Incremental Sort plan.
>
> I'm not really that 100% confident in the removal of the * 1.5 thing.
> I wonder if there's some reason we're not considering that might cause
> a performance regression if we're to remove it.

I'm not sure about it either. It seems to me that we were afraid of
regressions, and having this overcharged just made us miss a new optimization
without changing existing plans. With ordered aggregates, the balance is a bit
trickier and we are at risk of either regressing on aggregate plans, or more
common ordered ones.

--
Ronan Dunklau

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

From

David Rowley

Date:

09 November 2022, 01:58:53

On Tue, 8 Nov 2022 at 19:51, Richard Guo <guofenglinux@gmail.com> wrote:
> For unsorted paths, the original logic here is to explicitly add a Sort
> path only for the cheapest-total path.  This patch changes that and may
> add a Sort path for other paths besides the cheapest-total path.  I
> think this may introduce in some unnecessary path candidates.

Yeah, you're right. The patch shouldn't change that.  I've adjusted
the attached patch so that part works more like it did before.

v2 attached.

Thanks

David

Attachment

adjust_incremental_sort_costings_v2.patch

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

From

David Rowley

Date:

13 December 2022, 07:53:40

On Wed, 9 Nov 2022 at 14:58, David Rowley <dgrowleyml@gmail.com> wrote:
> v2 attached.

I've been looking at this again and this time around understand why
the * 1.5 pessimism factor was included in the incremental sort code.

If we create a table with a very large skew in the number of rows per
what will be our pre-sorted groups.

create table ab (a int not null, b int not null);
insert into ab select 0,x from generate_Series(1,999000)x union all
select x%1000+1,0 from generate_Series(999001,1000000)x;

Here the 0 group has close to 1 million rows, but the remaining groups
1-1000 have just 1 row each. The planner only knows there are about
1001 distinct values in "a" and assumes an even distribution of rows
between those values.

With:
explain (analyze, timing off) select * from ab order by a,b;

In master, the plan is:

                                                   QUERY PLAN
-----------------------------------------------------------------------------------------------------------------
 Sort  (cost=122490.27..124990.27 rows=1000000 width=8) (actual
rows=1000000 loops=1)
   Sort Key: a, b
   Sort Method: quicksort  Memory: 55827kB
   ->  Index Scan using ab_a_idx on ab  (cost=0.42..22832.42
rows=1000000 width=8) (actual rows=1000000 loops=1)
 Planning Time: 0.069 ms
 Execution Time: 155.469 ms

With the v2 patch it's:
                                                   QUERY PLAN
-----------------------------------------------------------------------------------------------------------------
 Incremental Sort  (cost=2767.38..109344.55 rows=1000000 width=8)
(actual rows=1000000 loops=1)
   Sort Key: a, b
   Presorted Key: a
   Full-sort Groups: 33  Sort Method: quicksort  Average Memory: 27kB
Peak Memory: 27kB
   Pre-sorted Groups: 1  Sort Method: quicksort  Average Memory:
55795kB  Peak Memory: 55795kB
   ->  Index Scan using ab_a_idx on ab  (cost=0.42..22832.42
rows=1000000 width=8) (actual rows=1000000 loops=1)
 Planning Time: 0.072 ms
 Execution Time: 163.614 ms

So there is a performance regression.

Sometimes teaching the planner new tricks means that it might use
those tricks at a bad time.  Normally we put in an off switch for
these situations to allow users an escape hatch. We have
enable_incremental_sort for this.  It seems like incremental sort has
tried to avoid this problem by always considering the same "Sort"
paths that we did prior to incremental sort, and also considers
incremental sort for pre-sorted paths with the 1.5 pessimism factor.
The v2 patch taking away the safety net.

I think what we need to do is:  Do our best to give incremental sort
the most realistic costs we can and accept that it might choose a
worse plan in some cases. Users can turn it off if they really have no
other means to convince the planner it's wrong.

Additionally, I think what we also need to add a GUC such as
enable_presorted_aggregate.  People can use that when their Index Scan
-> Incremental Sort -> Aggregate plan is worse than their previous Seq
Scan -> Sort -> Aggregate plan that they were getting in < 16.
Turning off enable_incremental_sort alone won't give them the same
aggregate plan that they had in pg15 as we always set the
query_pathkeys to request a sort order that will suit the order by /
distinct aggregates.

I'll draft up a patch for the enable_presorted_aggregate.

David

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

From

David Rowley

Date:

15 December 2022, 11:10:44

On Tue, 13 Dec 2022 at 20:53, David Rowley <dgrowleyml@gmail.com> wrote:
> I think what we need to do is:  Do our best to give incremental sort
> the most realistic costs we can and accept that it might choose a
> worse plan in some cases. Users can turn it off if they really have no
> other means to convince the planner it's wrong.
>
> Additionally, I think what we also need to add a GUC such as
> enable_presorted_aggregate.  People can use that when their Index Scan
> -> Incremental Sort -> Aggregate plan is worse than their previous Seq
> Scan -> Sort -> Aggregate plan that they were getting in < 16.
> Turning off enable_incremental_sort alone won't give them the same
> aggregate plan that they had in pg15 as we always set the
> query_pathkeys to request a sort order that will suit the order by /
> distinct aggregates.
>
> I'll draft up a patch for the enable_presorted_aggregate.

I've attached a patch series for this.

v3-0001 can be ignored here. I've posted about that in [1]. Any
discussion about that patch should take place over there.  The patch
is required to get the 0002 patch to pass isolation check

v3-0002 removes the 1.5 x cost pessimism from incremental sort and
also rewrites how we make incremental sort paths.  I've now gone
through the remaining places where we create an incremental sort path
to give all those the same treatment that I'd added to
add_paths_to_grouping_rel(). There was a 1 or 2 plan changes in the
regression tests.  One was the isolation test change, which I claim to
be a broken test and should be fixed another way.  The other was
performing a Sort on the cheapest input path which had presorted keys.
That plan now uses an Incremental Sort to make use of the presorted
keys. I'm happy to see just how much redundant code this removes.
About 200 lines.

v3-0003 adds the enable_presorted_aggregate GUC.

David

[1] https://www.postgresql.org/message-id/CAApHDvrbDhObhLV+=U_K_-t+2Av2av1aL9d+2j_3AO-XndaviA@mail.gmail.com

Attachment

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

From

David Rowley

Date:

16 December 2022, 02:26:20

On Fri, 16 Dec 2022 at 00:10, David Rowley <dgrowleyml@gmail.com> wrote:
> v3-0002 removes the 1.5 x cost pessimism from incremental sort and
> also rewrites how we make incremental sort paths.  I've now gone
> through the remaining places where we create an incremental sort path
> to give all those the same treatment that I'd added to
> add_paths_to_grouping_rel(). There was a 1 or 2 plan changes in the
> regression tests.  One was the isolation test change, which I claim to
> be a broken test and should be fixed another way.  The other was
> performing a Sort on the cheapest input path which had presorted keys.
> That plan now uses an Incremental Sort to make use of the presorted
> keys. I'm happy to see just how much redundant code this removes.
> About 200 lines.

I've now pushed this patch. Thanks for the report and everyone for all
the useful discussion. Also Richard for the review.

> v3-0003 adds the enable_presorted_aggregate GUC.

This I've moved off to [1]. We tend to have lengthy discussions about
GUCs, what to name them and if we actually need them. I didn't want to
bury that discussion in this old and already long thread.

David

[1] https://postgr.es/m/CAApHDvqzuHerD8zN1Qu=d66e3bp1=9iFn09ZiQ3Zug_Phi6yLQ@mail.gmail.com

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

From

Dean Rasheed

Date:

10 January 2023, 10:11:50

While doing some random testing, I noticed that the following is broken in HEAD:

SELECT COUNT(DISTINCT random()) FROM generate_series(1,10);

ERROR:  ORDER/GROUP BY expression not found in targetlist

It appears to have been broken by 1349d279, though I haven't looked at
the details.

I'm somewhat surprised that a case as simple as this wasn't covered by
any pre-existing regression tests.

Regards,
Dean

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

From

Richard Guo

Date:

11 January 2023, 02:45:51

On Tue, Jan 10, 2023 at 6:12 PM Dean Rasheed <dean.a.rasheed@gmail.com> wrote:

While doing some random testing, I noticed that the following is broken in HEAD:

SELECT COUNT(DISTINCT random()) FROM generate_series(1,10);

ERROR: ORDER/GROUP BY expression not found in targetlist

It appears to have been broken by 1349d279, though I haven't looked at
the details.

Yeah, this is definitely broken. For this query, we try to sort the
scan/join path by random() before performing the Aggregate, which is an
optimization implemented in 1349d2790b. However the scan/join plan's
tlist does not contain random(), which I think we need to fix.

Thanks
Richard

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

From

David Rowley

Date:

11 January 2023, 04:12:45

On Wed, 11 Jan 2023 at 15:46, Richard Guo <guofenglinux@gmail.com> wrote:
> However the scan/join plan's
> tlist does not contain random(), which I think we need to fix.

I was wondering if that's true and considered that we don't want to
evaluate random() for the sort then again when doing the aggregate
transitions, but I see that does not really work before 1349d279, per:

postgres=# set enable_presorted_aggregate=0;
SET
postgres=# select string_agg(random()::text, ',' order by random())
from generate_series(1,3);
                        string_agg
-----------------------------------------------------------
 0.8659110018246505,0.15612649559563474,0.2022878955613403
(1 row)

I'd have expected those random numbers to be concatenated in ascending order.

Running: select random() from generate_Series(1,3) order by random();
gives me the results in the order I'd have expected.

I think whatever the fix is here, we should likely ensure that the
results are consistent regardless of which Aggrefs are the presorted
ones.  Perhaps the easiest way to do that, and to ensure we call the
volatile functions are called the same number of times would just be
to never choose Aggrefs with volatile functions when doing
make_pathkeys_for_groupagg().

David

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

From

Tom Lane

Date:

11 January 2023, 04:32:04

David Rowley <dgrowleyml@gmail.com> writes:
> I think whatever the fix is here, we should likely ensure that the
> results are consistent regardless of which Aggrefs are the presorted
> ones.  Perhaps the easiest way to do that, and to ensure we call the
> volatile functions are called the same number of times would just be
> to never choose Aggrefs with volatile functions when doing
> make_pathkeys_for_groupagg().

There's existing logic in equivclass.c and other places that tries
to draw very tight lines around what we'll assume about volatile
sort expressions (pathkeys).  It sounds like there's someplace in
this recent patch that didn't get that memo.

            regards, tom lane

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

From

David Rowley

Date:

11 January 2023, 05:23:50

On Wed, 11 Jan 2023 at 17:32, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> David Rowley <dgrowleyml@gmail.com> writes:
> > I think whatever the fix is here, we should likely ensure that the
> > results are consistent regardless of which Aggrefs are the presorted
> > ones.  Perhaps the easiest way to do that, and to ensure we call the
> > volatile functions are called the same number of times would just be
> > to never choose Aggrefs with volatile functions when doing
> > make_pathkeys_for_groupagg().
>
> There's existing logic in equivclass.c and other places that tries
> to draw very tight lines around what we'll assume about volatile
> sort expressions (pathkeys).  It sounds like there's someplace in
> this recent patch that didn't get that memo.

I'm not sure I did a good job of communicating my thoughts there. What
I mean is, having volatile functions in the aggregate's ORDER BY or
DISTINCT clause didn't seem very well behaved prior to the presorted
aggregates patch. If I go and fix the bug with the missing targetlist
items, then a query such as:

select string_agg(random()::text, ',' order by random()) from
generate_series(1,3);

should start putting the random() numbers in order where it didn't
prior to 1349d279. Perhaps users might be happy that those are in
order, however, if they then go and change the query to:

select sum(a order by a),string_agg(random()::text, ',' order by
random()) from generate_series(1,3);

then they might become unhappy again that their string_agg is not
ordered the way they specified because the planner opted to sort by
"a" rather than "random()" after the initial scan.

I'm wondering if 1349d279 should have just never opted to presort
Aggrefs which have volatile functions so that the existing behaviour
of unordered output is given always and nobody is fooled into thinking
this works correctly only to be disappointed later when they add some
other aggregate to their query or if we should fix both.  Certainly,
it seems much easier to do the former.

David

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

From

Richard Guo

Date:

11 January 2023, 08:47:05

On Wed, Jan 11, 2023 at 12:13 PM David Rowley <dgrowleyml@gmail.com> wrote:

postgres=# set enable_presorted_aggregate=0;
SET
postgres=# select string_agg(random()::text, ',' order by random())
from generate_series(1,3);
string_agg
-----------------------------------------------------------
0.8659110018246505,0.15612649559563474,0.2022878955613403
(1 row)

I'd have expected those random numbers to be concatenated in ascending order.

select string_agg(
random()::text, -- position 1
','
order by random() -- position 2
)
from generate_series(1,3);

I traced this query a bit and found that when executing the aggregation
the random() function in the aggregate expression (position 1) and in
the order by clause (position 2) are calculated separately. And the
sorting is performed based on the function results from the order by
clause. In the final output, what we see is the function results from
the aggregate expression. Thus we'll notice the output is not sorted.

I'm not sure if this is expected or broken though.

BTW, if we explicitly add ::text for random() in the order by clause, as

select string_agg(
random()::text,
','
order by random()::text
)
from generate_series(1,3);

The random() function will be calculated only once for each tuple, and
we can get a sorted output.

Thanks
Richard

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

From

Dean Rasheed

Date:

17 January 2023, 00:16:10

On Wed, 11 Jan 2023 at 05:24, David Rowley <dgrowleyml@gmail.com> wrote:
>
> I'm wondering if 1349d279 should have just never opted to presort
> Aggrefs which have volatile functions so that the existing behaviour
> of unordered output is given always and nobody is fooled into thinking
> this works correctly only to be disappointed later when they add some
> other aggregate to their query or if we should fix both.  Certainly,
> it seems much easier to do the former.
>

I took a look at this, and I agree that the best solution is probably
to have make_pathkeys_for_groupagg() ignore Aggrefs that contain
volatile functions. Not only is that the simplest solution, preserving
the old behaviour, I think it's required for correctness.

Aside from the fact that I don't think such aggregates would benefit
from the optimisation introduced by 1349d279, I think it would be
incorrect if there was more than one such aggregate having the same
sort expression, because I think that volatile sorting should be
evaluated separately for each aggregate. For example:

SELECT string_agg(a::text, ',' ORDER BY random()),
       string_agg(a::text, ',' ORDER BY random())
FROM generate_series(1,3) s(a);

 string_agg | string_agg
------------+------------
 2,1,3      | 3,2,1
(1 row)

so pre-sorting wouldn't be right (or at least it would change existing
behaviour in a surprising way).

Regards,
Dean

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

From

David Rowley

Date:

17 January 2023, 03:39:38

On Tue, 17 Jan 2023 at 13:16, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:
>
> On Wed, 11 Jan 2023 at 05:24, David Rowley <dgrowleyml@gmail.com> wrote:
> >
> > I'm wondering if 1349d279 should have just never opted to presort
> > Aggrefs which have volatile functions so that the existing behaviour
> > of unordered output is given always and nobody is fooled into thinking
> > this works correctly only to be disappointed later when they add some
> > other aggregate to their query or if we should fix both.  Certainly,
> > it seems much easier to do the former.
> >
>
> I took a look at this, and I agree that the best solution is probably
> to have make_pathkeys_for_groupagg() ignore Aggrefs that contain
> volatile functions.

Thanks for giving that some additional thought.  I've just pushed a
fix which adjusts things that way.

David

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

From

Richard Guo

Date:

17 January 2023, 06:29:56

On Tue, Jan 17, 2023 at 11:39 AM David Rowley <dgrowleyml@gmail.com> wrote:

On Tue, 17 Jan 2023 at 13:16, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:
> I took a look at this, and I agree that the best solution is probably
> to have make_pathkeys_for_groupagg() ignore Aggrefs that contain
> volatile functions.

Thanks for giving that some additional thought. I've just pushed a
fix which adjusts things that way.

This makes a lot of sense. I agree that we shouldn't do pre-sorting for
volatile sort expressions, especially when there are multiple aggregates
with the same volatile sort expression.

Not related to this specific issue, but I find sorting by volatile
expression is confusing in different scenarios. Consider the two
queries given by David

Query 1:
select string_agg(random()::text, ',' order by random()) from generate_series(1,3);

Query 2:
select random()::text from generate_series(1,3) order by random();

Considering the targetlist as Aggref->args or Query->targetList, in both
queries we would add an additional TargetEntry (as resjunk column) for
the ORDER BY item 'random()', because it's not present in the existing
targetlist. Note that the existing TargetEntry for 'random()::text' is
a CoerceViaIO expression which is an explicit cast, so we cannot strip
it and match it to the ORDER BY item. Thus we would have two random()
FuncExprs in the final targetlist, for both queries.

In query 1 we call random() twice for each tuple, one for the original
TargetEntry 'random()::text', and one for the TargetEntry of the ORDER
BY item 'random()', and do the sorting according to the second call
results. Thus we would notice the final output is unsorted because it's
from the first random() call.

However, in query 2 we have the ORDER BY item 'random()' in the
scan/join node's targetlist. And then for the two random() FuncExprs in
the final targetlist, set_plan_references would adjust both of them to
refer to the outputs of the scan/join node. Thus random() is actually
called only once for each tuple and we would find the final output is
sorted.

It seems we fail to keep consistent about the behavior of sorting by
volatile expression in the two scenarios.

BTW, I wonder if we should have checked CoercionForm before
fix_upper_expr_mutator steps into CoerceViaIO->arg to adjust the expr
there. It seems parser checks it and only strips implicit coercions
when matching TargetEntry expr to ORDER BY item.

Thanks
Richard

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

From

Tom Lane

Date:

17 January 2023, 07:05:19

Richard Guo <guofenglinux@gmail.com> writes:
> BTW, I wonder if we should have checked CoercionForm before
> fix_upper_expr_mutator steps into CoerceViaIO->arg to adjust the expr
> there.

I will just quote what it says in primnodes.h:

 * NB: equal() ignores CoercionForm fields, therefore this *must* not carry
 * any semantically significant information.

If you think the planner should act differently for different values of
CoercionForm, you are mistaken.  Maybe this is evidence of some
previous bit of brain-fade, but if so we need to fix that.

            regards, tom lane

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

From

Richard Guo

Date:

18 January 2023, 09:37:01

On Tue, Jan 17, 2023 at 3:05 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Richard Guo <guofenglinux@gmail.com> writes:
> BTW, I wonder if we should have checked CoercionForm before
> fix_upper_expr_mutator steps into CoerceViaIO->arg to adjust the expr
> there.

I will just quote what it says in primnodes.h:

* NB: equal() ignores CoercionForm fields, therefore this *must* not carry
* any semantically significant information.

If you think the planner should act differently for different values of
CoercionForm, you are mistaken. Maybe this is evidence of some
previous bit of brain-fade, but if so we need to fix that.

According to this comment in primnodes.h, the planner is not supposed to
treat implicit and explicit casts differently. In this case
set_plan_references is doing its job correctly, to adjust both random()
FuncExprs in targetlist to refer to subplan's output for query 2. As a
consequence we would get a sorted output.

I'm still confused that when the same scenario happens with ORDER BY in
an aggregate function, like in query 1, the result is different in that
we would get an unsorted output.

I wonder if we should avoid this inconsistent behavior.

Thanks
Richard

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

From

David Rowley

Date:

18 January 2023, 09:49:10

On Wed, 18 Jan 2023 at 22:37, Richard Guo <guofenglinux@gmail.com> wrote:
> I'm still confused that when the same scenario happens with ORDER BY in
> an aggregate function, like in query 1, the result is different in that
> we would get an unsorted output.
>
> I wonder if we should avoid this inconsistent behavior.

It certainly seems pretty strange that aggregates with an ORDER BY
behave differently from the query's ORDER BY. I'd have expected that
to be the same.  I've not looked to see why there's a difference, but
suspect that we thought about how we want it to work for the query's
ORDER BY and when ORDER BY aggregates were added, that behaviour was
not considered.

Likely finding the code or location where that code should be would
help us understand if something was just forgotten in the aggregate's
case.

It's probably another question as to if we should be adjusting this
behaviour now.

David

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

From

Dean Rasheed

Date:

18 January 2023, 10:52:24

On Wed, 18 Jan 2023 at 09:49, David Rowley <dgrowleyml@gmail.com> wrote:
>
> On Wed, 18 Jan 2023 at 22:37, Richard Guo <guofenglinux@gmail.com> wrote:
> > I'm still confused that when the same scenario happens with ORDER BY in
> > an aggregate function, like in query 1, the result is different in that
> > we would get an unsorted output.
> >
> > I wonder if we should avoid this inconsistent behavior.
>
> It certainly seems pretty strange that aggregates with an ORDER BY
> behave differently from the query's ORDER BY. I'd have expected that
> to be the same.  I've not looked to see why there's a difference, but
> suspect that we thought about how we want it to work for the query's
> ORDER BY and when ORDER BY aggregates were added, that behaviour was
> not considered.
>

I think the behaviour of an ORDER BY in the query can also be pretty
surprising. For example, consider:

SELECT ARRAY[random(), random(), random()]
FROM generate_series(1, 3);

                            array
-------------------------------------------------------------
 {0.2335800863701647,0.14688842754711273,0.2975659224823368}
 {0.10616525384762876,0.8371175798972244,0.2936178886154661}
 {0.21679841321788262,0.5254761982948826,0.7789412240118161}
(3 rows)

which produces 9 different random values, as expected, and compare that to:

SELECT ARRAY[random(), random(), random()]
FROM generate_series(1, 3)
ORDER BY random();

                             array
---------------------------------------------------------------
 {0.01952216253949679,0.01952216253949679,0.01952216253949679}
 {0.6735145595500629,0.6735145595500629,0.6735145595500629}
 {0.9406665780147616,0.9406665780147616,0.9406665780147616}
(3 rows)

which now only has 3 distinct random values. It's pretty
counterintuitive that adding an ORDER BY clause changes the contents
of the rows returned, not just their order.

The trouble is, if we tried to fix that, we'd risk changing some other
behaviour that users may have come to rely on.

Regards,
Dean

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

From

Tom Lane

Date:

18 January 2023, 15:46:24

Dean Rasheed <dean.a.rasheed@gmail.com> writes:
> I think the behaviour of an ORDER BY in the query can also be pretty
> surprising.

Indeed.  The fundamental question is this: in

> SELECT ARRAY[random(), random(), random()]
> FROM generate_series(1, 3)
> ORDER BY random();

are those four occurrences of random() supposed to refer to the
same value, or not?  This only matters for volatile functions
of course; with stable or immutable functions, textually-equal
subexpressions should have the same value in any given row.

It is very clear what we are supposed to do for

SELECT random() FROM ... ORDER BY 1;

which sadly isn't legal SQL anymore.  It gets fuzzy as soon
as we have

SELECT random() FROM ... ORDER BY random();

You could make an argument either way for those being the
same value or not, but historically we've concluded that
it's more useful to deem them the same value.  Then the
behavior you show is not such a surprising extension,
although it could be argued that such matches should only
extend to identical top-level targetlist entries.

> The trouble is, if we tried to fix that, we'd risk changing some other
> behaviour that users may have come to rely on.

Yeah.  I'm hesitant to try to adjust semantics here;
we're much more likely to get complaints than kudos.

            regards, tom lane