Re: Add proper planner support for ORDER BY / DISTINCT aggregates - Mailing list pgsql-hackers

From David Rowley
Subject Re: Add proper planner support for ORDER BY / DISTINCT aggregates
Date
Msg-id CAApHDvpYZwbRNurFsG+cvbUFBaFaiF6J6VWKBvWdtVsbU-Bfgg@mail.gmail.com
Whole thread Raw
In response to Re: Add proper planner support for ORDER BY / DISTINCT aggregates  (Ronan Dunklau <ronan.dunklau@aiven.io>)
Responses Re: Add proper planner support for ORDER BY / DISTINCT aggregates
Re: Add proper planner support for ORDER BY / DISTINCT aggregates
List pgsql-hackers
On Fri, 2 Jul 2021 at 19:54, Ronan Dunklau <ronan.dunklau@aiven.io> wrote:
> I don't know if it's acceptable, but in the case where you add both an
> aggregate with an ORDER BY clause, and another aggregate without the clause,
> the output for the unordered one will change and use the same ordering, maybe
> suprising the unsuspecting user. Would that be acceptable ?

That's a good question. There was an argument in [1] that mentions
that there might be a group of people who rely on aggregation being
done in a certain order where they're not specifying an ORDER BY
clause in the aggregate.  If that group of people exists, then it's
possible they might be upset in the scenario that you describe.

I also think that it's going to be pretty hard to make significant
gains in this area if we are too scared to make changes to undefined
behaviour.  You wouldn't have to look too hard in the pgsql-general
mailing list to find someone complaining that their query output is in
the wrong order on some query that does not have an ORDER BY. We
pretty much always tell those people that the order is undefined
without an ORDER BY. I'm not too sure why Tom in [1] classes the ORDER
BY aggregate people any differently. We'll be stuck forever here and
in many other areas if we're too scared to change the order of
aggregation. You could argue that something like parallelism has
changed that for people already. I think the multi-batch Hash
Aggregate code could also change this.

> I was curious about the performance implication of that additional transition,
> and could not reproduce a signifcant difference. I may be doing something
> wrong: how did you highlight it ?

It was pretty basic. I just created a table with two columns and no
index and did something like SELECT a,SUM(b ORDER BY b) from ab GROUP
BY 1;  the new code will include a Sort due to lack of any index and
the old code would have done a sort inside nodeAgg.c. I imagine that
the overhead comes from the fact that in the patched version nodeAgg.c
must ask its subnode (nodeSort.c) for the next tuple each time,
whereas unpatched nodeAgg.c already has all the tuples in a tuplestore
and can fetch them very quickly in a tight loop.

David

[1] https://www.postgresql.org/message-id/6538.1522096067%40sss.pgh.pa.us



pgsql-hackers by date:

Previous
From: Dean Rasheed
Date:
Subject: Re: rand48 replacement
Next
From: David Rowley
Date:
Subject: Re: Commit fest manager