Thread: Aggregate node doesn't include cost for sorting

Aggregate node doesn't include cost for sorting

From

David Geier

Date:

08 December 2022, 09:06:22

Hi hackers,

The cost of an Aggregate node seems to be the same regardless of the 
input being pre-sorted or not. If however the input is not sorted, the 
Aggregate node must additionally perform a sort which can impact runtime 
significantly. Here is an example:

CREATE TABLE foo(col0 INT, col1 TEXT);
INSERT INTO foo SELECT generate_series(1, 100000), 
'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa' || md5(random()::TEXT);
CREATE INDEX foo_idx ON foo(col1);
SET max_parallel_workers_per_gather = 0;
SET enable_bitmapscan = FALSE;

-- Unsorted input. Aggregate node must additionally sort all rows.
EXPLAIN ANALYZE SELECT COUNT(DISTINCT(col1)) FROM foo;

                                                     QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
  Aggregate  (cost=2584.00..2584.01 rows=1 width=8) (actual 
time=1684.705..1684.809 rows=1 loops=1)
    ->  Seq Scan on foo  (cost=0.00..2334.00 rows=100000 width=71) 
(actual time=0.018..343.280 rows=100000 loops=1)
  Planning Time: 0.317 ms
  Execution Time: 1685.543 ms


-- Pre-sorted input. Aggregate node doesn't have to sort all rows.
SET enable_seqscan = FALSE;
EXPLAIN ANALYZE SELECT COUNT(DISTINCT(col1)) FROM foo;

QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------------
  Aggregate  (cost=6756.42..6756.43 rows=1 width=8) (actual 
time=819.028..819.041 rows=1 loops=1)
    ->  Index Only Scan using foo_idx on foo (cost=6506.42..6506.42 
rows=100000 width=71) (actual time=0.046..404.260 rows=100000 loops=1)
          Heap Fetches: 100000
          Heap Prefetches: 1334
  Planning Time: 0.438 ms
  Execution Time: 819.515 ms

The cost of the Aggregate node is in both cases the same (250.0) while 
its runtime is 1.3s in the unsorted case and 0.4s in the pre-sorted case.

Also, why does the Aggregate node sort itself? Why don't we instead emit 
an explicit Sort node in the plan so that it's visible to the user what 
is happening? As soon as there's also a GROUP BY in the query, a Sort 
node occurs in the plan. This seems inconsistent.

-- 
David Geier
(ServiceNow)

Re: Aggregate node doesn't include cost for sorting

From

David Rowley

Date:

08 December 2022, 10:40:58

On Thu, 8 Dec 2022 at 22:06, David Geier <geidav.pg@gmail.com> wrote:
> The cost of an Aggregate node seems to be the same regardless of the
> input being pre-sorted or not. If however the input is not sorted, the
> Aggregate node must additionally perform a sort which can impact runtime
> significantly. Here is an example:
>
> -- Unsorted input. Aggregate node must additionally sort all rows.
> EXPLAIN ANALYZE SELECT COUNT(DISTINCT(col1)) FROM foo;
>
>                                                      QUERY PLAN
> -------------------------------------------------------------------------------------------------------------------
>   Aggregate  (cost=2584.00..2584.01 rows=1 width=8) (actual
> time=1684.705..1684.809 rows=1 loops=1)
>     ->  Seq Scan on foo  (cost=0.00..2334.00 rows=100000 width=71)

This output surely must be from a version of PostgreSQL prior to
1349d279?  I can't quite figure out why you've added a "SET
enable_seqscan = FALSE;". That makes it look like you've used the same
version of PostgreSQL to produce both of these plans.  The 2nd plan
you've shown must be post 1349d279.

So, with the assumption that you've used 2 different versions to show
this output, for post 1349d279, there does not seem to be much choice
on how the aggregate is executed. What's your concern about the
costings having to be accurate given there's no other plan choice?

The new pre-sorted aggregate code will always request the sort order
that suits the largest number of ORDER BY / DISTINCT aggregates.  When
there are multiple ORDER BY / DISTINCT aggregates and they have
different sort requirements then there certainly are completing ways
that the aggregation portion of the plan can be executed.  I opted to
make the choice just based on the number of aggregates that could
become presorted.  nodeAgg.c is currently not very smart about sharing
sorts between multiple aggregates with the same sort requirements.  If
that was made better, there might be more motivation to have better
costing code in make_pathkeys_for_groupagg().  However, the reasons
for the reversion of db0d67db seemed to be mainly around the lack of
ability to accurately cost multiple competing sort orders. We'd need
to come up with some better way to do that if we were to want to give
make_pathkeys_for_groupagg() similar abilities.

> Also, why does the Aggregate node sort itself? Why don't we instead emit
> an explicit Sort node in the plan so that it's visible to the user what
> is happening? As soon as there's also a GROUP BY in the query, a Sort
> node occurs in the plan. This seems inconsistent.

Post 1349d279 we do that, but it can only do it for 1 sort order.
There can be any number of aggregate functions which require a sort
and they don't all have to have the same sort order requirements.  We
can't do the same as WindowAgg does by chaining nodes together either
because the aggregate node aggregates the results and we'd need all
the same input rows to be available at each step.

The only other way would be to have it so an Aggregate node could be
fed by multiple different input nodes and then it would only work on
the aggregates that suit the given input before reading the next input
and doing the other aggregates.  Making it work like that would cause
quite a bit of additional effort during planning (not to mention the
executor). We'd have to run the join search once per required order,
which is one of the slowest parts of planning.  Right now, you could
probably make that work by just writing the SQL to have a subquery per
sort requirement.

David

Re: Aggregate node doesn't include cost for sorting

From

David Geier

Date:

08 December 2022, 12:11:59

Hi David,

Thanks for the explanations and awesome that this was improved on already.
I didn't have this change on my radar.

On 12/8/22 11:40, David Rowley wrote:
>
> This output surely must be from a version of PostgreSQL prior to
> 1349d279?  I can't quite figure out why you've added a "SET
> enable_seqscan = FALSE;". That makes it look like you've used the same
> version of PostgreSQL to produce both of these plans.  The 2nd plan
> you've shown must be post 1349d279.


Both plans were captured on 14.5, which is indeed prior to 1349d279.

I disabled sequential scan to show that there's an alternative plan 
which is superior to the chosen plan: Index Only Scan is more expensive 
and takes longer than the Seq Scan, but the subsequent Aggregate runs 
much faster as it doesn't have to sort, making the plan overall superior.

>
> So, with the assumption that you've used 2 different versions to show
> this output, for post 1349d279, there does not seem to be much choice
> on how the aggregate is executed. What's your concern about the
> costings having to be accurate given there's no other plan choice?

There's another plan choice which is using the index to get pre-sorted 
input rows, see previous comment.

>
> The new pre-sorted aggregate code will always request the sort order
> that suits the largest number of ORDER BY / DISTINCT aggregates.  When
> there are multiple ORDER BY / DISTINCT aggregates and they have
> different sort requirements then there certainly are completing ways
> that the aggregation portion of the plan can be executed.  I opted to
> make the choice just based on the number of aggregates that could
> become presorted.  nodeAgg.c is currently not very smart about sharing
> sorts between multiple aggregates with the same sort requirements.  If
> that was made better, there might be more motivation to have better
> costing code in make_pathkeys_for_groupagg().  However, the reasons
> for the reversion of db0d67db seemed to be mainly around the lack of
> ability to accurately cost multiple competing sort orders. We'd need
> to come up with some better way to do that if we were to want to give
> make_pathkeys_for_groupagg() similar abilities.
>
>> Also, why does the Aggregate node sort itself? Why don't we instead emit
>> an explicit Sort node in the plan so that it's visible to the user what
>> is happening? As soon as there's also a GROUP BY in the query, a Sort
>> node occurs in the plan. This seems inconsistent.
> Post 1349d279 we do that, but it can only do it for 1 sort order.
> There can be any number of aggregate functions which require a sort
> and they don't all have to have the same sort order requirements.  We
> can't do the same as WindowAgg does by chaining nodes together either
> because the aggregate node aggregates the results and we'd need all
> the same input rows to be available at each step.
>
> The only other way would be to have it so an Aggregate node could be
> fed by multiple different input nodes and then it would only work on
> the aggregates that suit the given input before reading the next input
> and doing the other aggregates.  Making it work like that would cause
> quite a bit of additional effort during planning (not to mention the
> executor). We'd have to run the join search once per required order,
> which is one of the slowest parts of planning.  Right now, you could
> probably make that work by just writing the SQL to have a subquery per
> sort requirement.

Thanks for the explanation!

-- 
David Geier
(ServiceNow)

Re: Aggregate node doesn't include cost for sorting

From

Tom Lane

Date:

08 December 2022, 14:38:55

David Rowley <dgrowleyml@gmail.com> writes:
> So, with the assumption that you've used 2 different versions to show
> this output, for post 1349d279, there does not seem to be much choice
> on how the aggregate is executed. What's your concern about the
> costings having to be accurate given there's no other plan choice?

It's true that the cost attributed to the Agg node won't impact any
live decisions in the plan level in which it appears.  However, if
that's a subquery, then the total cost attributed to the subplan
could in principle affect plan choices in the outer query.  So there
is a valid argument for wanting to try to get it right.

Having said that, I think that the actual impact on outer-level choices
is usually minimal.  So it didn't bother me that we swept this under
the rug before --- and I'm pretty sure that we're sweeping similar
things under the rug elsewhere in top-of-query planning.  However,
given 1349d279 it should surely be true that the planner knows how many
sorts it's left to be done at runtime (a number we did not have at hand
before).  So it seems like it ought to be simple enough to account for
this effect more accurately.  I'd be in favor of doing so if it's
simple and cheap to get the numbers.

            regards, tom lane

Re: Aggregate node doesn't include cost for sorting

From

David Rowley

Date:

08 December 2022, 20:56:43

On Fri, 9 Dec 2022 at 01:12, David Geier <geidav.pg@gmail.com> wrote:
> Both plans were captured on 14.5, which is indeed prior to 1349d279.
>
> I disabled sequential scan to show that there's an alternative plan
> which is superior to the chosen plan: Index Only Scan is more expensive
> and takes longer than the Seq Scan, but the subsequent Aggregate runs
> much faster as it doesn't have to sort, making the plan overall superior.

Aha, 14.5. What's going on there is that it's still doing the sort.
The aggregate code in that version does not skip the sort because of
the presorted input. A likely explanation for the performance increase
is due to the presorted check in our qsort implementation. The
successful presort check is O(N), whereas an actual sort is O(N *
logN).

It's true that if we had been doing proper costing on these ORDER BY /
DISTINCT aggregates that we could have noticed that the input path's
pathkeys indicate that no sort is required and costed accordingly, but
if we'd gone to the trouble of factoring that into the costs, then it
would also have made sense to make nodeAgg.c not sort on presorted
input.  We got the latter in 1349d279. It's just we didn't do anything
about the costings in that commit.

Anyway, in the next version of Postgres, the planner is highly likely
to choose the 2nd plan in your original email. It'll also be even
faster than you've shown due to the aggregate code not having to store
and read tuples in the tuplesort object. Also, no O(N) presort check
either.  The performance should be much closer to what it would be if
you disabled seqscan and dropped the DISTINCT out of your aggregate.

David

Re: Aggregate node doesn't include cost for sorting

From

David Rowley

Date:

08 December 2022, 21:24:00

On Fri, 9 Dec 2022 at 03:38, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> It's true that the cost attributed to the Agg node won't impact any
> live decisions in the plan level in which it appears.  However, if
> that's a subquery, then the total cost attributed to the subplan
> could in principle affect plan choices in the outer query.  So there
> is a valid argument for wanting to try to get it right.

I guess the jit thresholds are another reason to try to make the costs
a reflection of the expected run-time too.

> Having said that, I think that the actual impact on outer-level choices
> is usually minimal.  So it didn't bother me that we swept this under
> the rug before --- and I'm pretty sure that we're sweeping similar
> things under the rug elsewhere in top-of-query planning.  However,
> given 1349d279 it should surely be true that the planner knows how many
> sorts it's left to be done at runtime (a number we did not have at hand
> before).  So it seems like it ought to be simple enough to account for
> this effect more accurately.  I'd be in favor of doing so if it's
> simple and cheap to get the numbers.

Ok, probably Heikki's work in 0a2bc5d61 is a more useful piece of work
to get us closer to that goal.  I think all that's required to make it
work is adding on the costs in the final foreach loop in
get_agg_clause_costs().  The Aggrefs have already been marked as
aggpresorted by that time, so it should be a matter of:

if ((aggref->aggorder != NIL || aggref->aggdistinct != NIL) &&
!aggref->aggpresorted)
 // add costs for sort

David