Re: Add proper planner support for ORDER BY / DISTINCT aggregates - Mailing list pgsql-hackers

From James Coleman
Subject Re: Add proper planner support for ORDER BY / DISTINCT aggregates
Date
Msg-id CAAaqYe-yxXkXVPJkRw1nDA=CJBw28jvhACRyDcU10dAOcdSj=Q@mail.gmail.com
Whole thread Raw
In response to Add proper planner support for ORDER BY / DISTINCT aggregates  (David Rowley <dgrowleyml@gmail.com>)
List pgsql-hackers
On Sat, Jun 12, 2021 at 11:07 AM David Rowley <dgrowleyml@gmail.com> wrote:
>
> A few years ago I wrote a patch to implement the missing aggregate
> combine functions for array_agg and string_agg [1].  In the end, the
> patch was rejected due to some concern [2] that if we allow these
> aggregates to run in parallel then it might mess up the order in which
> values are being aggregated by some unsuspecting user who didn't
> include an ORDER BY clause in the aggregate.   It was mentioned at the
> time that due to nodeAgg.c performing so terribly with ORDER BY
> aggregates that we couldn't possibly ask users who were upset by this
> change to include an ORDER BY in their aggregate functions.
>
> I'd still quite like to finish off the remaining aggregate functions
> that still don't have a combine function, so to get that going again
> I've written some code that gets the query planner onboard with
> picking a plan that allows nodeAgg.c to skip doing any internal sorts
> when the input results are already sorted according to what's required
> by the aggregate function.
>
> Of course, the query could have many aggregates all with differing
> ORDER BY clauses. Since we reuse the same input for each, it might not
> be possible to feed each aggregate with suitably sorted input.   When
> the input is not sorted, nodeAgg.c still must perform the sort as it
> does today.  So unfortunately we can't remove the nodeAgg.c code for
> that.
>
> The best I could come up with is just picking a sort order that suits
> the first ORDER BY aggregate, then spin through the remaining ones to
> see if there's any with a more strict order requirement that we can
> use to suit that one and the first one. The planner does something
> similar for window functions already, although it's not quite as
> comprehensive to look beyond the first window for windows with a more
> strict sort requirement.

I think this is a reasonable choice. I could imagine a more complex
method, say, counting the number of aggregates benefiting from a given
sort, and choosing the one that benefits the most (and this could be
further complicated by ranking based on "cost" -- not costing in the
normal sense since we don't have that at this point), but I think it'd
take a lot of convincing that that was valuable.

> This allows us to give presorted input to both aggregates in the following case:
>
> SELECT agg(a ORDER BY a),agg2(a ORDER BY a,b) ...
>
> but just the first agg in this one:
>
> SELECT agg(a ORDER BY a),agg2(a ORDER BY c) ...
>
> In order to make DISTINCT work, I had to add a new expression
> evaluation operator to allow filtering if the current value is the
> same as the last value.  The current unpatched code implements
> distinct when reading back the sorted tuplestore data.  Since I don't
> have a tuplestore with the pre-sorted version, I needed to cache the
> last Datum, or last tuple somewhere.  I ended up putting that in the
> AggStatePerTransData struct. I'm not quite sure if I like that, but I
> don't really see where else it could go.

That sounds like what nodeIncrementalSort.c's isCurrentGroup() does,
except it's just implemented inline. Not anything you need to change
in this patch, but noting it in case it triggered a thought valuable
for you for me later on.

> When testing the performance of all this I found that when a suitable
> index exists to provide pre-sorted input for the aggregation that the
> performance does improve. Unfortunately, it looks like things get more
> complex when no index exists.  In this case, since we're setting
> pathkeys to tell the planner we need a plan that provides pre-sorted
> input to the aggregates, the planner will add a sort below the
> aggregate node.  I initially didn't see any problem with that as it
> just moves the sort to a Sort node rather than having it done
> implicitly inside nodeAgg.c.  The problem is, it just does not perform
> as well.  I guess this is because when the sort is done inside
> nodeAgg.c that the transition function is called in a tight loop while
> reading records back from the tuplestore.  In the patched version,
> there's an additional node transition in between nodeAgg and nodeSort
> and that causes slower performance.  For now, I'm not quite sure what
> to do about that.  We set the plan pathkeys well before we could
> possibly decide if asking for pre-sorted input for the aggregates
> would be a good idea or not.

This opens up another path for significant plan benefits too: if
there's now an explicit sort node, then it's possible for that node to
be an incremental sort node, which isn't something nodeAgg is capable
of utilizing currently.

> Please find attached my WIP patch.  It's WIP due to what I mentioned
> in the above paragraph and also because I've not bothered to add JIT
> support for the new expression evaluation steps.

I looked this over (though didn't get a chance to play with it).

I'm wondering about the changes to the test output in tuplesort.out.
It looks like where a merge join used to be proceeding with an
explicit sort with DESC it's now sorting with (implicit) ASC, and then
an explicit sort node using DESC is above the merge join node.  Two
thoughts:

1. It seems like losing the "proper" sort order in the JOIN node isn't
preferable. Given both are full sorts, it may not actually be a
significant performance difference on its own (it can't short
circuit), but it means precluding using incremental sort in the new
sort node.
2. In an ideal world we'd push the more complex sort down into the
merge join rather than doing a sort on "col1" and then a sort on
"col1, col2". That's likely beyond this patch, but I haven't
investigated at all.

Thanks for your work on this; both this patch and the original one
you're trying to enable seem like great additions.

James



pgsql-hackers by date:

Previous
From: James Coleman
Date:
Subject: Minor typo in generate_useful_gather_paths comment
Next
From: Tom Lane
Date:
Subject: Re: "debug_invalidate_system_caches_always" is too long