Add proper planner support for ORDER BY / DISTINCT aggregates - Mailing list pgsql-hackers

From David Rowley
Subject Add proper planner support for ORDER BY / DISTINCT aggregates
Date
Msg-id CAApHDvpHzfo92=R4W0+xVua3BUYCKMckWAmo-2t_KiXN-wYH=w@mail.gmail.com
Whole thread Raw
Responses Re: Add proper planner support for ORDER BY / DISTINCT aggregates
Re: Add proper planner support for ORDER BY / DISTINCT aggregates
Re: Add proper planner support for ORDER BY / DISTINCT aggregates
List pgsql-hackers
A few years ago I wrote a patch to implement the missing aggregate
combine functions for array_agg and string_agg [1].  In the end, the
patch was rejected due to some concern [2] that if we allow these
aggregates to run in parallel then it might mess up the order in which
values are being aggregated by some unsuspecting user who didn't
include an ORDER BY clause in the aggregate.   It was mentioned at the
time that due to nodeAgg.c performing so terribly with ORDER BY
aggregates that we couldn't possibly ask users who were upset by this
change to include an ORDER BY in their aggregate functions.

I'd still quite like to finish off the remaining aggregate functions
that still don't have a combine function, so to get that going again
I've written some code that gets the query planner onboard with
picking a plan that allows nodeAgg.c to skip doing any internal sorts
when the input results are already sorted according to what's required
by the aggregate function.

Of course, the query could have many aggregates all with differing
ORDER BY clauses. Since we reuse the same input for each, it might not
be possible to feed each aggregate with suitably sorted input.   When
the input is not sorted, nodeAgg.c still must perform the sort as it
does today.  So unfortunately we can't remove the nodeAgg.c code for
that.

The best I could come up with is just picking a sort order that suits
the first ORDER BY aggregate, then spin through the remaining ones to
see if there's any with a more strict order requirement that we can
use to suit that one and the first one. The planner does something
similar for window functions already, although it's not quite as
comprehensive to look beyond the first window for windows with a more
strict sort requirement.

This allows us to give presorted input to both aggregates in the following case:

SELECT agg(a ORDER BY a),agg2(a ORDER BY a,b) ...

but just the first agg in this one:

SELECT agg(a ORDER BY a),agg2(a ORDER BY c) ...

In order to make DISTINCT work, I had to add a new expression
evaluation operator to allow filtering if the current value is the
same as the last value.  The current unpatched code implements
distinct when reading back the sorted tuplestore data.  Since I don't
have a tuplestore with the pre-sorted version, I needed to cache the
last Datum, or last tuple somewhere.  I ended up putting that in the
AggStatePerTransData struct. I'm not quite sure if I like that, but I
don't really see where else it could go.

When testing the performance of all this I found that when a suitable
index exists to provide pre-sorted input for the aggregation that the
performance does improve. Unfortunately, it looks like things get more
complex when no index exists.  In this case, since we're setting
pathkeys to tell the planner we need a plan that provides pre-sorted
input to the aggregates, the planner will add a sort below the
aggregate node.  I initially didn't see any problem with that as it
just moves the sort to a Sort node rather than having it done
implicitly inside nodeAgg.c.  The problem is, it just does not perform
as well.  I guess this is because when the sort is done inside
nodeAgg.c that the transition function is called in a tight loop while
reading records back from the tuplestore.  In the patched version,
there's an additional node transition in between nodeAgg and nodeSort
and that causes slower performance.  For now, I'm not quite sure what
to do about that.  We set the plan pathkeys well before we could
possibly decide if asking for pre-sorted input for the aggregates
would be a good idea or not.

Please find attached my WIP patch.  It's WIP due to what I mentioned
in the above paragraph and also because I've not bothered to add JIT
support for the new expression evaluation steps.

I'll add this to the July commitfest.

David

[1] https://www.postgresql.org/message-id/CAKJS1f9sx_6GTcvd6TMuZnNtCh0VhBzhX6FZqw17TgVFH-ga_A%40mail.gmail.com
[2] https://www.postgresql.org/message-id/flat/6538.1522096067%40sss.pgh.pa.us#c228ed67026faa15209c76dada035da4

Attachment

pgsql-hackers by date:

Previous
From: Zhihong Yu
Date:
Subject: Re: Use pg_nextpower2_* in a few more places
Next
From: Justin Pryzby
Date:
Subject: Re: pg_filenode_relation(0,0) elog