Thread: Aggregate node doesn't include cost for sorting
Hi hackers, The cost of an Aggregate node seems to be the same regardless of the input being pre-sorted or not. If however the input is not sorted, the Aggregate node must additionally perform a sort which can impact runtime significantly. Here is an example: CREATE TABLE foo(col0 INT, col1 TEXT); INSERT INTO foo SELECT generate_series(1, 100000), 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa' || md5(random()::TEXT); CREATE INDEX foo_idx ON foo(col1); SET max_parallel_workers_per_gather = 0; SET enable_bitmapscan = FALSE; -- Unsorted input. Aggregate node must additionally sort all rows. EXPLAIN ANALYZE SELECT COUNT(DISTINCT(col1)) FROM foo; QUERY PLAN ------------------------------------------------------------------------------------------------------------------- Aggregate (cost=2584.00..2584.01 rows=1 width=8) (actual time=1684.705..1684.809 rows=1 loops=1) -> Seq Scan on foo (cost=0.00..2334.00 rows=100000 width=71) (actual time=0.018..343.280 rows=100000 loops=1) Planning Time: 0.317 ms Execution Time: 1685.543 ms -- Pre-sorted input. Aggregate node doesn't have to sort all rows. SET enable_seqscan = FALSE; EXPLAIN ANALYZE SELECT COUNT(DISTINCT(col1)) FROM foo; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=6756.42..6756.43 rows=1 width=8) (actual time=819.028..819.041 rows=1 loops=1) -> Index Only Scan using foo_idx on foo (cost=6506.42..6506.42 rows=100000 width=71) (actual time=0.046..404.260 rows=100000 loops=1) Heap Fetches: 100000 Heap Prefetches: 1334 Planning Time: 0.438 ms Execution Time: 819.515 ms The cost of the Aggregate node is in both cases the same (250.0) while its runtime is 1.3s in the unsorted case and 0.4s in the pre-sorted case. Also, why does the Aggregate node sort itself? Why don't we instead emit an explicit Sort node in the plan so that it's visible to the user what is happening? As soon as there's also a GROUP BY in the query, a Sort node occurs in the plan. This seems inconsistent. -- David Geier (ServiceNow)
On Thu, 8 Dec 2022 at 22:06, David Geier <geidav.pg@gmail.com> wrote: > The cost of an Aggregate node seems to be the same regardless of the > input being pre-sorted or not. If however the input is not sorted, the > Aggregate node must additionally perform a sort which can impact runtime > significantly. Here is an example: > > -- Unsorted input. Aggregate node must additionally sort all rows. > EXPLAIN ANALYZE SELECT COUNT(DISTINCT(col1)) FROM foo; > > QUERY PLAN > ------------------------------------------------------------------------------------------------------------------- > Aggregate (cost=2584.00..2584.01 rows=1 width=8) (actual > time=1684.705..1684.809 rows=1 loops=1) > -> Seq Scan on foo (cost=0.00..2334.00 rows=100000 width=71) This output surely must be from a version of PostgreSQL prior to 1349d279? I can't quite figure out why you've added a "SET enable_seqscan = FALSE;". That makes it look like you've used the same version of PostgreSQL to produce both of these plans. The 2nd plan you've shown must be post 1349d279. So, with the assumption that you've used 2 different versions to show this output, for post 1349d279, there does not seem to be much choice on how the aggregate is executed. What's your concern about the costings having to be accurate given there's no other plan choice? The new pre-sorted aggregate code will always request the sort order that suits the largest number of ORDER BY / DISTINCT aggregates. When there are multiple ORDER BY / DISTINCT aggregates and they have different sort requirements then there certainly are completing ways that the aggregation portion of the plan can be executed. I opted to make the choice just based on the number of aggregates that could become presorted. nodeAgg.c is currently not very smart about sharing sorts between multiple aggregates with the same sort requirements. If that was made better, there might be more motivation to have better costing code in make_pathkeys_for_groupagg(). However, the reasons for the reversion of db0d67db seemed to be mainly around the lack of ability to accurately cost multiple competing sort orders. We'd need to come up with some better way to do that if we were to want to give make_pathkeys_for_groupagg() similar abilities. > Also, why does the Aggregate node sort itself? Why don't we instead emit > an explicit Sort node in the plan so that it's visible to the user what > is happening? As soon as there's also a GROUP BY in the query, a Sort > node occurs in the plan. This seems inconsistent. Post 1349d279 we do that, but it can only do it for 1 sort order. There can be any number of aggregate functions which require a sort and they don't all have to have the same sort order requirements. We can't do the same as WindowAgg does by chaining nodes together either because the aggregate node aggregates the results and we'd need all the same input rows to be available at each step. The only other way would be to have it so an Aggregate node could be fed by multiple different input nodes and then it would only work on the aggregates that suit the given input before reading the next input and doing the other aggregates. Making it work like that would cause quite a bit of additional effort during planning (not to mention the executor). We'd have to run the join search once per required order, which is one of the slowest parts of planning. Right now, you could probably make that work by just writing the SQL to have a subquery per sort requirement. David
Hi David, Thanks for the explanations and awesome that this was improved on already. I didn't have this change on my radar. On 12/8/22 11:40, David Rowley wrote: > > This output surely must be from a version of PostgreSQL prior to > 1349d279? I can't quite figure out why you've added a "SET > enable_seqscan = FALSE;". That makes it look like you've used the same > version of PostgreSQL to produce both of these plans. The 2nd plan > you've shown must be post 1349d279. Both plans were captured on 14.5, which is indeed prior to 1349d279. I disabled sequential scan to show that there's an alternative plan which is superior to the chosen plan: Index Only Scan is more expensive and takes longer than the Seq Scan, but the subsequent Aggregate runs much faster as it doesn't have to sort, making the plan overall superior. > > So, with the assumption that you've used 2 different versions to show > this output, for post 1349d279, there does not seem to be much choice > on how the aggregate is executed. What's your concern about the > costings having to be accurate given there's no other plan choice? There's another plan choice which is using the index to get pre-sorted input rows, see previous comment. > > The new pre-sorted aggregate code will always request the sort order > that suits the largest number of ORDER BY / DISTINCT aggregates. When > there are multiple ORDER BY / DISTINCT aggregates and they have > different sort requirements then there certainly are completing ways > that the aggregation portion of the plan can be executed. I opted to > make the choice just based on the number of aggregates that could > become presorted. nodeAgg.c is currently not very smart about sharing > sorts between multiple aggregates with the same sort requirements. If > that was made better, there might be more motivation to have better > costing code in make_pathkeys_for_groupagg(). However, the reasons > for the reversion of db0d67db seemed to be mainly around the lack of > ability to accurately cost multiple competing sort orders. We'd need > to come up with some better way to do that if we were to want to give > make_pathkeys_for_groupagg() similar abilities. > >> Also, why does the Aggregate node sort itself? Why don't we instead emit >> an explicit Sort node in the plan so that it's visible to the user what >> is happening? As soon as there's also a GROUP BY in the query, a Sort >> node occurs in the plan. This seems inconsistent. > Post 1349d279 we do that, but it can only do it for 1 sort order. > There can be any number of aggregate functions which require a sort > and they don't all have to have the same sort order requirements. We > can't do the same as WindowAgg does by chaining nodes together either > because the aggregate node aggregates the results and we'd need all > the same input rows to be available at each step. > > The only other way would be to have it so an Aggregate node could be > fed by multiple different input nodes and then it would only work on > the aggregates that suit the given input before reading the next input > and doing the other aggregates. Making it work like that would cause > quite a bit of additional effort during planning (not to mention the > executor). We'd have to run the join search once per required order, > which is one of the slowest parts of planning. Right now, you could > probably make that work by just writing the SQL to have a subquery per > sort requirement. Thanks for the explanation! -- David Geier (ServiceNow)
David Rowley <dgrowleyml@gmail.com> writes: > So, with the assumption that you've used 2 different versions to show > this output, for post 1349d279, there does not seem to be much choice > on how the aggregate is executed. What's your concern about the > costings having to be accurate given there's no other plan choice? It's true that the cost attributed to the Agg node won't impact any live decisions in the plan level in which it appears. However, if that's a subquery, then the total cost attributed to the subplan could in principle affect plan choices in the outer query. So there is a valid argument for wanting to try to get it right. Having said that, I think that the actual impact on outer-level choices is usually minimal. So it didn't bother me that we swept this under the rug before --- and I'm pretty sure that we're sweeping similar things under the rug elsewhere in top-of-query planning. However, given 1349d279 it should surely be true that the planner knows how many sorts it's left to be done at runtime (a number we did not have at hand before). So it seems like it ought to be simple enough to account for this effect more accurately. I'd be in favor of doing so if it's simple and cheap to get the numbers. regards, tom lane
On Fri, 9 Dec 2022 at 01:12, David Geier <geidav.pg@gmail.com> wrote: > Both plans were captured on 14.5, which is indeed prior to 1349d279. > > I disabled sequential scan to show that there's an alternative plan > which is superior to the chosen plan: Index Only Scan is more expensive > and takes longer than the Seq Scan, but the subsequent Aggregate runs > much faster as it doesn't have to sort, making the plan overall superior. Aha, 14.5. What's going on there is that it's still doing the sort. The aggregate code in that version does not skip the sort because of the presorted input. A likely explanation for the performance increase is due to the presorted check in our qsort implementation. The successful presort check is O(N), whereas an actual sort is O(N * logN). It's true that if we had been doing proper costing on these ORDER BY / DISTINCT aggregates that we could have noticed that the input path's pathkeys indicate that no sort is required and costed accordingly, but if we'd gone to the trouble of factoring that into the costs, then it would also have made sense to make nodeAgg.c not sort on presorted input. We got the latter in 1349d279. It's just we didn't do anything about the costings in that commit. Anyway, in the next version of Postgres, the planner is highly likely to choose the 2nd plan in your original email. It'll also be even faster than you've shown due to the aggregate code not having to store and read tuples in the tuplesort object. Also, no O(N) presort check either. The performance should be much closer to what it would be if you disabled seqscan and dropped the DISTINCT out of your aggregate. David
On Fri, 9 Dec 2022 at 03:38, Tom Lane <tgl@sss.pgh.pa.us> wrote: > It's true that the cost attributed to the Agg node won't impact any > live decisions in the plan level in which it appears. However, if > that's a subquery, then the total cost attributed to the subplan > could in principle affect plan choices in the outer query. So there > is a valid argument for wanting to try to get it right. I guess the jit thresholds are another reason to try to make the costs a reflection of the expected run-time too. > Having said that, I think that the actual impact on outer-level choices > is usually minimal. So it didn't bother me that we swept this under > the rug before --- and I'm pretty sure that we're sweeping similar > things under the rug elsewhere in top-of-query planning. However, > given 1349d279 it should surely be true that the planner knows how many > sorts it's left to be done at runtime (a number we did not have at hand > before). So it seems like it ought to be simple enough to account for > this effect more accurately. I'd be in favor of doing so if it's > simple and cheap to get the numbers. Ok, probably Heikki's work in 0a2bc5d61 is a more useful piece of work to get us closer to that goal. I think all that's required to make it work is adding on the costs in the final foreach loop in get_agg_clause_costs(). The Aggrefs have already been marked as aggpresorted by that time, so it should be a matter of: if ((aggref->aggorder != NIL || aggref->aggdistinct != NIL) && !aggref->aggpresorted) // add costs for sort David