Thread: fix cost subqueryscan wrong parallel cost
The cost_subqueryscan function does not judge whether it is parallel.
regress
-- Incremental sort vs. set operations with varno 0
set enable_hashagg to off;
explain (costs off) select * from t union select * from t order by 1,3;
QUERY PLAN
----------------------------------------------------------
Incremental Sort
Sort Key: t.a, t.c
Presorted Key: t.a
-> Unique
-> Sort
Sort Key: t.a, t.b, t.c
-> Append
-> Gather
Workers Planned: 2
-> Parallel Seq Scan on t
-> Gather
Workers Planned: 2
-> Parallel Seq Scan on t t_1
to
Incremental Sort
Sort Key: t.a, t.c
Presorted Key: t.a
-> Unique
-> Sort
Sort Key: t.a, t.b, t.c
-> Gather
Workers Planned: 2
-> Parallel Append
-> Parallel Seq Scan on t
-> Parallel Seq Scan on t t_1
Obviously the latter is less expensive
bucoo@sohu.com
Attachment
On Tue, Apr 12, 2022 at 2:57 AM bucoo@sohu.com <bucoo@sohu.com> wrote: > The cost_subqueryscan function does not judge whether it is parallel. I don't see any reason why it would need to do that. A subquery scan isn't parallel aware. > regress > -- Incremental sort vs. set operations with varno 0 > set enable_hashagg to off; > explain (costs off) select * from t union select * from t order by 1,3; > QUERY PLAN > ---------------------------------------------------------- > Incremental Sort > Sort Key: t.a, t.c > Presorted Key: t.a > -> Unique > -> Sort > Sort Key: t.a, t.b, t.c > -> Append > -> Gather > Workers Planned: 2 > -> Parallel Seq Scan on t > -> Gather > Workers Planned: 2 > -> Parallel Seq Scan on t t_1 > to > Incremental Sort > Sort Key: t.a, t.c > Presorted Key: t.a > -> Unique > -> Sort > Sort Key: t.a, t.b, t.c > -> Gather > Workers Planned: 2 > -> Parallel Append > -> Parallel Seq Scan on t > -> Parallel Seq Scan on t t_1 > Obviously the latter is less expensive Generally it should be. But there's no subquery scan visible here. There may well be something wrong here, but I don't think that you've diagnosed the problem correctly, or explained it clearly. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, Apr 15, 2022 at 12:50 AM Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Apr 12, 2022 at 2:57 AM bucoo@sohu.com <bucoo@sohu.com> wrote:
> The cost_subqueryscan function does not judge whether it is parallel.
I don't see any reason why it would need to do that. A subquery scan
isn't parallel aware.
> regress
> -- Incremental sort vs. set operations with varno 0
> set enable_hashagg to off;
> explain (costs off) select * from t union select * from t order by 1,3;
> QUERY PLAN
> ----------------------------------------------------------
> Incremental Sort
> Sort Key: t.a, t.c
> Presorted Key: t.a
> -> Unique
> -> Sort
> Sort Key: t.a, t.b, t.c
> -> Append
> -> Gather
> Workers Planned: 2
> -> Parallel Seq Scan on t
> -> Gather
> Workers Planned: 2
> -> Parallel Seq Scan on t t_1
> to
> Incremental Sort
> Sort Key: t.a, t.c
> Presorted Key: t.a
> -> Unique
> -> Sort
> Sort Key: t.a, t.b, t.c
> -> Gather
> Workers Planned: 2
> -> Parallel Append
> -> Parallel Seq Scan on t
> -> Parallel Seq Scan on t t_1
> Obviously the latter is less expensive
Generally it should be. But there's no subquery scan visible here.
The paths of subtrees in set operations would be type of subqueryscan.
The SubqueryScan nodes are removed later in set_plan_references() in
this case as they are considered as being trivial.
The SubqueryScan nodes are removed later in set_plan_references() in
this case as they are considered as being trivial.
There may well be something wrong here, but I don't think that you've
diagnosed the problem correctly, or explained it clearly.
fails when competing with the first path. So if there is something
wrong, I think cost calculation is the suspicious point.
Not related to this topic but I noticed another problem from the plan.
Note the first Sort node which is to unique-ify the result of the UNION.
Why cannot we re-arrange the sort keys from (a, b, c) to (a, c, b) so
that we can avoid the second Sort node?
Thanks
Richard
Note the first Sort node which is to unique-ify the result of the UNION.
Why cannot we re-arrange the sort keys from (a, b, c) to (a, c, b) so
that we can avoid the second Sort node?
Thanks
Richard
> Generally it should be. But there's no subquery scan visible here.I wrote a patch for distinct/union and aggregate support last year(I want restart it again).If not apply this patch, some parallel paths will naver be selected.> Some debugging work shows that the second path is generated but then
> fails when competing with the first path. So if there is something> wrong, I think cost calculation is the suspicious point.Maybe, I will check it again.> Not related to this topic but I noticed another problem from the plan.
> Note the first Sort node which is to unique-ify the result of the UNION.
> Why cannot we re-arrange the sort keys from (a, b, c) to (a, c, b) so
> that we can avoid the second Sort node?
This is a regress test, just for test Incremental Sort plan.
On Fri, Apr 15, 2022 at 05:16:44PM +0800, Richard Guo wrote: > Not related to this topic but I noticed another problem from the plan. > Note the first Sort node which is to unique-ify the result of the UNION. > Why cannot we re-arrange the sort keys from (a, b, c) to (a, c, b) so > that we can avoid the second Sort node? I don't know, but it's possible there's a solution related to commit db0d67db2 "Optimize order of GROUP BY keys" - DISTINCT is the same as GROUP BY k1, ..., kN. I guess UNION [DISTINCT] should learn to use GROUP BY rather than DISTINCT? -- Justin
On Fri, Apr 15, 2022 at 6:06 AM bucoo@sohu.com <bucoo@sohu.com> wrote: > > Generally it should be. But there's no subquery scan visible here. > I wrote a patch for distinct/union and aggregate support last year(I want restart it again). > https://www.postgresql.org/message-id/2021091517250848215321%40sohu.com > If not apply this patch, some parallel paths will naver be selected. Sure, but that doesn't make the patch correct. The patch proposes that, when parallelism in use, a subquery scan will produce fewer rows than when parallelism is not in use, and that's 100% false. Compare this with the case of a parallel sequential scan. If a table contains 1000 rows, and we scan it with a regular Seq Scan, the Seq Scan will return 1000 rows. But if we scan it with a Parallel Seq Scan using say 4 workers, the number of rows returned in each worker will be substantially less than 1000, because 1000 is now the *total* number of rows to be returned across *all* processes, and what we need is the number of rows returned in *each* process. The same thing isn't true for a subquery scan. Consider: Gather -> Subquery Scan -> Parallel Seq Scan One thing is for sure: the number of rows that will be produced by the subquery scan in each backend is exactly equal to the number of rows that the subquery scan receives from its subpath. Parallel Seq Scan can't just return a row count estimate based on the number of rows in the table, because those rows are going to be divided among the workers. But the Subquery Scan doesn't do anything like that. If it receives let's say 250 rows as input in each worker, it's going to produce 250 output rows in each worker. Your patch says it's going to produce fewer than that, and that's wrong, regardless of whether it gives you the plan you want in this particular case. -- Robert Haas EDB: http://www.enterprisedb.com
> Sure, but that doesn't make the patch correct. The patch proposes
> that, when parallelism in use, a subquery scan will produce fewer rows
> than when parallelism is not in use, and that's 100% false. Compare
> this with the case of a parallel sequential scan. If a table contains
> 1000 rows, and we scan it with a regular Seq Scan, the Seq Scan will
> return 1000 rows. But if we scan it with a Parallel Seq Scan using
> say 4 workers, the number of rows returned in each worker will be
> substantially less than 1000, because 1000 is now the *total* number
> of rows to be returned across *all* processes, and what we need is the
> number of rows returned in *each* process.
for now fuction cost_subqueryscan always using *total* rows even parallel
path. like this:
Gather (rows=30000)
Workers Planned: 2
-> Subquery Scan (rows=30000) -- *total* rows, should be equal subpath
-> Parallel Seq Scan (rows=10000)
Maybe the codes:
/* Mark the path with the correct row estimate */
if (param_info)
path->path.rows = param_info->ppi_rows;
else
path->path.rows = baserel->rows;
should change to:
/* Mark the path with the correct row estimate */
if (path->path.parallel_workers > 0)
path->path.rows = path->subpath->rows;
else if (param_info)
path->path.rows = param_info->ppi_rows;
else
path->path.rows = baserel->rows;
bucoo@sohu.com
On Wed, Apr 20, 2022 at 10:01 AM bucoo@sohu.com <bucoo@sohu.com> wrote: > for now fuction cost_subqueryscan always using *total* rows even parallel > path. like this: > > Gather (rows=30000) > Workers Planned: 2 > -> Subquery Scan (rows=30000) -- *total* rows, should be equal subpath > -> Parallel Seq Scan (rows=10000) OK, that's bad. > Maybe the codes: > > /* Mark the path with the correct row estimate */ > if (param_info) > path->path.rows = param_info->ppi_rows; > else > path->path.rows = baserel->rows; > > should change to: > > /* Mark the path with the correct row estimate */ > if (path->path.parallel_workers > 0) > path->path.rows = path->subpath->rows; > else if (param_info) > path->path.rows = param_info->ppi_rows; > else > path->path.rows = baserel->rows; Suppose parallelism is not in use and that param_info is NULL. Then, is path->subpath->rows guaranteed to be equal to baserel->rows? If yes, then we don't need to a three-part if statement as you propose here and can just change the "else" clause to say path->path.rows = path->subpath->rows. If no, then your change gives the wrong answer. -- Robert Haas EDB: http://www.enterprisedb.com
> > for now fuction cost_subqueryscan always using *total* rows even parallel
> > path. like this:
> >
> > Gather (rows=30000)
> > Workers Planned: 2
> > -> Subquery Scan (rows=30000) -- *total* rows, should be equal subpath
> > -> Parallel Seq Scan (rows=10000)
>
> OK, that's bad.
>
> > Maybe the codes:
> >
> > /* Mark the path with the correct row estimate */
> > if (param_info)
> > path->path.rows = param_info->ppi_rows;
> > else
> > path->path.rows = baserel->rows;
> >
> > should change to:
> >
> > /* Mark the path with the correct row estimate */
> > if (path->path.parallel_workers > 0)
> > path->path.rows = path->subpath->rows;
> > else if (param_info)
> > path->path.rows = param_info->ppi_rows;
> > else
> > path->path.rows = baserel->rows;
>
> Suppose parallelism is not in use and that param_info is NULL. Then,
> is path->subpath->rows guaranteed to be equal to baserel->rows? If
> yes, then we don't need to a three-part if statement as you propose
> here and can just change the "else" clause to say path->path.rows =
> path->subpath->rows. If no, then your change gives the wrong answer.
I checked some regress test, Sometimes subquery scan have filter,
so path->subpath->row guaranteed *not* to be equal to baserel->rows.
If the first patch is false, I don't known how to fix this,
looks like need someone's help.
On Thu, Apr 21, 2022 at 2:38 AM bucoo@sohu.com <bucoo@sohu.com> wrote: > > Suppose parallelism is not in use and that param_info is NULL. Then, > > is path->subpath->rows guaranteed to be equal to baserel->rows? If > > yes, then we don't need to a three-part if statement as you propose > > here and can just change the "else" clause to say path->path.rows = > > path->subpath->rows. If no, then your change gives the wrong answer. > > I checked some regress test, Sometimes subquery scan have filter, > so path->subpath->row guaranteed *not* to be equal to baserel->rows. > If the first patch is false, I don't known how to fix this, > looks like need someone's help. Please fix your mailer so that it doesn't send me a bounce message every time I reply to one of your messages on list. I don't know how to fix this right now either, then; maybe I or someone else will have a good idea later. -- Robert Haas EDB: http://www.enterprisedb.com
> > > Suppose parallelism is not in use and that param_info is NULL. Then, > > > is path->subpath->rows guaranteed to be equal to baserel->rows? If > > > yes, then we don't need to a three-part if statement as you propose > > > here and can just change the "else" clause to say path->path.rows = > > > path->subpath->rows. If no, then your change gives the wrong answer. > > > > I checked some regress test, Sometimes subquery scan have filter, > > so path->subpath->row guaranteed *not* to be equal to baserel->rows. > > If the first patch is false, I don't known how to fix this, > > looks like need someone's help. > > Please fix your mailer so that it doesn't send me a bounce message > every time I reply to one of your messages on list. This message send using Outlook. > I don't know how to fix this right now either, then; maybe I or > someone else will have a good idea later. I don't known too.
On Wed, Apr 20, 2022 at 11:38 PM bucoo@sohu.com <bucoo@sohu.com> wrote:
> > for now fuction cost_subqueryscan always using *total* rows even parallel> > path. like this:> >> > Gather (rows=30000)> > Workers Planned: 2> > -> Subquery Scan (rows=30000) -- *total* rows, should be equal subpath> > -> Parallel Seq Scan (rows=10000)>> OK, that's bad.
I don't understand how that plan shape is possible. Gather requires a parallel aware subpath, so said subpath can be executed multiple times in parallel, and subquery isn't. If there is parallelism happening within a subquery the results are consolidated using Append or Gather first - and the output rows of that path entry (all subpaths of Subquery have the same ->row value per set_subquery_size_estimates), become the input tuples for Subquery, to which it then applies its selectivity multiplier and stores the final result in baserel->rows; which the costing code then examines when costing the RTE_SUBQUERY path entry.
David J.
On Fri, Apr 22, 2022 at 11:55 AM David G. Johnston <david.g.johnston@gmail.com> wrote: > On Wed, Apr 20, 2022 at 11:38 PM bucoo@sohu.com <bucoo@sohu.com> wrote: >> >> > > for now fuction cost_subqueryscan always using *total* rows even parallel >> > > path. like this: >> > > >> > > Gather (rows=30000) >> > > Workers Planned: 2 >> > > -> Subquery Scan (rows=30000) -- *total* rows, should be equal subpath >> > > -> Parallel Seq Scan (rows=10000) >> > >> > OK, that's bad. > > I don't understand how that plan shape is possible. Gather requires a parallel aware subpath, so said subpath can be executedmultiple times in parallel, and subquery isn't. If there is parallelism happening within a subquery the resultsare consolidated using Append or Gather first - and the output rows of that path entry (all subpaths of Subquery havethe same ->row value per set_subquery_size_estimates), become the input tuples for Subquery, to which it then appliesits selectivity multiplier and stores the final result in baserel->rows; which the costing code then examines whencosting the RTE_SUBQUERY path entry. Gather doesn't require a parallel aware subpath, just a parallel-safe subpath. In a case like this, the parallel seq scan will divide the rows from the underlying relation across the three processes executing it. Each process will pass the rows it receives through its own copy of the subquery scan. Then, the Gather node will collect all the rows from all the workers to produce the final result. It's an extremely important feature of parallel query that the parallel-aware node doesn't have to be immediately beneath the Gather. You need to have a parallel-aware node in there someplace, but it could be separated from the gather by any number of levels e.g. Gather -> Nested Loop -> Nested Loop -> Nested Loop -> Parallel Seq Scan -> Index Scan -> Index Scan -> Index Scan You can stick as many parameterized index scans in there as you like and you still only need one parallel-aware node at the bottom. Once the parallel seq scan divides up the rows across workers, each worker can perform the index lookups for the rows that it receives without any coordination with other workers. It neither knows nor cares that this is happening in the midst of an operation that is parallel overall; all the nested loops and index scans work just as they would in a non-parallel plan. The only things that needed to know about parallelism are the operation that is dividing up the rows among workers (here, the parallel seq scan) and the operation that is gathering up all the results produced by individual workers (here, the gather node). -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Apr 28, 2022 at 9:53 AM Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Apr 22, 2022 at 11:55 AM David G. Johnston
<david.g.johnston@gmail.com> wrote:
> On Wed, Apr 20, 2022 at 11:38 PM bucoo@sohu.com <bucoo@sohu.com> wrote:
>>
>> > > for now fuction cost_subqueryscan always using *total* rows even parallel
>> > > path. like this:
>> > >
>> > > Gather (rows=30000)
>> > > Workers Planned: 2
>> > > -> Subquery Scan (rows=30000) -- *total* rows, should be equal subpath
>> > > -> Parallel Seq Scan (rows=10000)
>> >
>> > OK, that's bad.
>
Gather doesn't require a parallel aware subpath, just a parallel-safe
subpath. In a case like this, the parallel seq scan will divide the
rows from the underlying relation across the three processes executing
it. Each process will pass the rows it receives through its own copy
of the subquery scan. Then, the Gather node will collect all the rows
from all the workers to produce the final result.
Thank you. I think I got off on a tangent there and do understand the general design here better now.
I feel like the fact that the 2.4 divisor (for 2 planned workers) isn't shown in the explain plan anywhere is an oversight.
To move the original complaint forward a bit I am posting the three plan changes that using path->subpath->rows provokes in the regression tests.
======================================================================
--
-- Check for incorrect optimization when IN subquery contains a SRF
--
-- Check for incorrect optimization when IN subquery contains a SRF
--
select * from int4_tbl o where (f1, f1) in
(select f1, generate_series(1,50) / 10 g from int4_tbl i group by f1);
(select f1, generate_series(1,50) / 10 g from int4_tbl i group by f1);
The material difference between the existing plan and this one is the estimation of 250 rows
here compared to 1 row.
So (rel.rows != path->subpath->rows) at the top of cost_subqueryscan
+ -> Subquery Scan on "ANY_subquery" (cost=1.06..9.28 rows=250 width=8)
+ Output: "ANY_subquery".f1, "ANY_subquery".g
+ Filter: ("ANY_subquery".f1 = "ANY_subquery".g)
+ -> Result (cost=1.06..6.15 rows=250 width=8)
+ Output: "ANY_subquery".f1, "ANY_subquery".g
+ Filter: ("ANY_subquery".f1 = "ANY_subquery".g)
+ -> Result (cost=1.06..6.15 rows=250 width=8)
======================================================================
The second plan change is basically this same thing, going from rows=4 to rows=1
causes the plan to include a materialize node. The shape for purposes of the security barrier
remains correct.
======================================================================
select * from t union select * from t order by 1,3;
Gather here costs 2,600 vs the Append being 2,950 in the existing plan shape.
+ -> Gather (cost=0.00..2600.00 rows=120000 width=12)
+ Workers Planned: 2
+ -> Parallel Append (cost=0.00..2600.00 rows=50000 width=12)
+ -> Parallel Seq Scan on t (cost=0.00..575.00 rows=25000 width=12)
+ -> Parallel Seq Scan on t t_1 (cost=0.00..575.00 rows=25000 width=12)
+ Workers Planned: 2
+ -> Parallel Append (cost=0.00..2600.00 rows=50000 width=12)
+ -> Parallel Seq Scan on t (cost=0.00..575.00 rows=25000 width=12)
+ -> Parallel Seq Scan on t t_1 (cost=0.00..575.00 rows=25000 width=12)
=======================================================================
I've attached the two raw regression output diffs.
Using path->subpath->rows ignores the impact of the node's own filters, but the base pre-filter number is/should be the correct one; though it is difficult to say that with certainty when most of these nodes are discarded and one cannot debug in the middle but only observe the end results. Disabling that optimization is presently beyond my skill though I may take it up anyway as its likely still orders easier to do, and then hope some of these plans produce using data to check with, than actually diving into a C debugger for the first time.
Reverse engineering the 350 difference may be another approach - namely is it strictly due to the different plan shape or is it due to the number of rows. The fact that the row difference is 35,000 and the cost is 1% (cpu_tuple_cost = 0.01) of that number seems like a red herring after thinking it through...to many scans plus the differing shapes.
David J.
Attachment
On Fri, Apr 29, 2022 at 12:53 AM Robert Haas <robertmhaas@gmail.com> wrote:
Gather doesn't require a parallel aware subpath, just a parallel-safe
subpath. In a case like this, the parallel seq scan will divide the
rows from the underlying relation across the three processes executing
it. Each process will pass the rows it receives through its own copy
of the subquery scan. Then, the Gather node will collect all the rows
from all the workers to produce the final result.
It's an extremely important feature of parallel query that the
parallel-aware node doesn't have to be immediately beneath the Gather.
You need to have a parallel-aware node in there someplace, but it
could be separated from the gather by any number of levels e.g.
Gather
-> Nested Loop
-> Nested Loop
-> Nested Loop
-> Parallel Seq Scan
-> Index Scan
-> Index Scan
-> Index Scan
Thanks for the explanation. That's really helpful to understand the
parallel query mechanism.So for the nodes between Gather and parallel-aware node, how should we
calculate their estimated rows?
Currently subquery scan is using rel->rows (if no parameterization),
which I believe is not correct. That's not the size the subquery scan
node in each worker needs to handle, as the rows have been divided
across workers by the parallel-aware node.
Using subpath->rows is not correct either, as subquery scan node may
have quals.
It seems to me the right way is to divide the rel->rows among all the
workers.
Thanks
Richard
Richard Guo <guofenglinux@gmail.com> writes: > Currently subquery scan is using rel->rows (if no parameterization), > which I believe is not correct. That's not the size the subquery scan > node in each worker needs to handle, as the rows have been divided > across workers by the parallel-aware node. Really? Maybe I misunderstand the case under consideration, but what I think will be happening is that each worker will re-execute the pushed-down subquery in full. Otherwise it can't compute the correct answer. What gets divided across the set of workers is the total *number of executions* of the subquery, which should be independent of the number of workers, so that the cost is (more or less) the same as the non-parallel case. At least that's true for a standard correlated subplan, which is normally run again for each row processed by the parent node. For hashed subplans and initplans, what would have been "execute once" semantics becomes "execute once per worker", creating a strict cost disadvantage for parallelization. I don't know whether the current costing model accounts for that. But if it does that wrong, arbitrarily altering the number of rows won't make it better. regards, tom lane
On Fri, Apr 29, 2022 at 7:02 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Richard Guo <guofenglinux@gmail.com> writes:
> Currently subquery scan is using rel->rows (if no parameterization),
> which I believe is not correct. That's not the size the subquery scan
> node in each worker needs to handle, as the rows have been divided
> across workers by the parallel-aware node.
Really? Maybe I misunderstand the case under consideration, but
what I think will be happening is that each worker will re-execute
the pushed-down subquery in full. Otherwise it can't compute the
correct answer. What gets divided across the set of workers is
the total *number of executions* of the subquery, which should be
independent of the number of workers, so that the cost is (more
or less) the same as the non-parallel case.
I've been operating under the belief that a subquery node (or, rather, all nodes) in a path get their row count estimates from self-selectivity * rows provided by the next node down in the path. In a partial path, eventually the parallel aware node at the bottom of the path divides its row count estimate by the parallel divisor (2.4 for two workers). That value then returns back up through the path until it hits a gather. Every node in between, which are almost exclusively parallel-safe (not parallel aware), can just use the row count being percolated up the path (i.e., they get the divided row count in their partial pathing and full row counts in the normal pathing). The gather node, realizing that the row count it is dealing with has been divided by the parallel divisor, undoes the division in order to provide the correct row count to the non-parallel executing nodes above it.
So a subquery is only ever executed once in a path - but the number of returned rows depends on the number of planned workers done under the assumption that a query executed among 2 workers costs 1/2.4 the amount of the same query done with just the leader. It is an independent factor compared to everything else going on and so the multiplication can simply be done first to the original row count.
<starts looking for confirmation in the code>
allpaths.c@2974-2975 (generate_gather_paths)
(L: 2993 as well)
(L: 3167 - generate_useful_gather_paths)
rows =
cheapest_partial_path->rows * cheapest_partial_path->parallel_workers;
cheapest_partial_path->rows * cheapest_partial_path->parallel_workers;
Shouldn't that be:
rows =
cheapest_partial_path->rows * (get_parallel_divisor(cheapest_partial_path))
?
David J.
I wrote: > Really? Maybe I misunderstand the case under consideration, but > what I think will be happening is that each worker will re-execute > the pushed-down subquery in full. Oh ... nevermind that: what I was thinking about was the SubLink/SubPlan case, which is unrelated to SubqueryScan. -ENOCAFFEINE, sorry about that. regards, tom lane
I wrote: > -ENOCAFFEINE, sorry about that. As penance for that blunder, I spent a little time looking into this issue (responding to Robert's upthread complaint that it wasn't explained clearly). See the test case in parallel_subquery.sql, attached, which is adapted from the test in incremental_sort.sql. It produces these plans: explain select * from t union select * from t; Unique (cost=29344.85..30544.85 rows=120000 width=12) -> Sort (cost=29344.85..29644.85 rows=120000 width=12) Sort Key: t.a, t.b, t.c -> Append (cost=0.00..2950.00 rows=120000 width=12) -> Gather (cost=0.00..575.00 rows=60000 width=12) Workers Planned: 2 -> Parallel Seq Scan on t (cost=0.00..575.00 rows=25000 width=12) -> Gather (cost=0.00..575.00 rows=60000 width=12) Workers Planned: 2 -> Parallel Seq Scan on t t_1 (cost=0.00..575.00 rows=25000 width=12) explain select * from t union all select * from t; Gather (cost=0.00..1400.00 rows=120000 width=12) Workers Planned: 2 -> Parallel Append (cost=0.00..1400.00 rows=50000 width=12) -> Parallel Seq Scan on t (cost=0.00..575.00 rows=25000 width=12) -> Parallel Seq Scan on t t_1 (cost=0.00..575.00 rows=25000 width=12) I take no position on whether the second plan's costs are correct; but if they are, then the Gather-atop-Append structure is clearly cheaper than the Append-atop-Gathers structure, so the planner made the wrong choice in the first case. I then modified setrefs.c to not remove SubqueryScan nodes (just make trivial_subqueryscan() return constant false) and got this output from the UNION case: Unique (cost=29344.85..30544.85 rows=120000 width=12) -> Sort (cost=29344.85..29644.85 rows=120000 width=12) Sort Key: "*SELECT* 1".a, "*SELECT* 1".b, "*SELECT* 1".c -> Append (cost=0.00..2950.00 rows=120000 width=12) -> Subquery Scan on "*SELECT* 1" (cost=0.00..1175.00 rows=60000 width=12) -> Gather (cost=0.00..575.00 rows=60000 width=12) Workers Planned: 2 -> Parallel Seq Scan on t (cost=0.00..575.00 rows=25000 width=12) -> Subquery Scan on "*SELECT* 2" (cost=0.00..1175.00 rows=60000 width=12) -> Gather (cost=0.00..575.00 rows=60000 width=12) Workers Planned: 2 -> Parallel Seq Scan on t t_1 (cost=0.00..575.00 rows=25000 width=12) The UNION ALL plan doesn't change, because no SubqueryScans are ever created in that case (we flattened the UNION ALL to an appendrel early on). Removing the test case's settings to allow a parallel plan, the non-parallelized plan for the UNION case looks like Unique (cost=30044.85..31244.85 rows=120000 width=12) -> Sort (cost=30044.85..30344.85 rows=120000 width=12) Sort Key: "*SELECT* 1".a, "*SELECT* 1".b, "*SELECT* 1".c -> Append (cost=0.00..3650.00 rows=120000 width=12) -> Subquery Scan on "*SELECT* 1" (cost=0.00..1525.00 rows=60000 width=12) -> Seq Scan on t (cost=0.00..925.00 rows=60000 width=12) -> Subquery Scan on "*SELECT* 2" (cost=0.00..1525.00 rows=60000 width=12) -> Seq Scan on t t_1 (cost=0.00..925.00 rows=60000 width=12) So: the row counts for SubqueryScan are correct, thus the upthread proposal to change them is not. The cost estimates, however, seem pretty bogus. What we're seeing here is that we're charging 600 cost units to pass the data through SubqueryScan, and that's consistent across the parallel and non-parallel cases, so it's correct on its own terms. But actually it's totally wrong because we're going to elide that node later. So I think the actual problem here is that we leave the decision to elide no-op SubqueryScan nodes till setrefs.c. We should detect that earlier, and when it applies, label the SubqueryScanPath with the exact cost of its child. (I think the current implementation might've been all right when it was devised, on the grounds that we had no real planning flexibility for UNION operations anyway. But now we do, and the bogus cost charge is causing visible problems.) regards, tom lane drop table if exists t; create table t (a int, b int, c int); insert into t select mod(i,10),mod(i,10),i from generate_series(1,60000) s(i); create index on t (a); analyze t; set min_parallel_table_scan_size = '1kB'; set min_parallel_index_scan_size = '1kB'; set parallel_setup_cost = 0; set parallel_tuple_cost = 0; set max_parallel_workers_per_gather = 2; set enable_hashagg to off; explain select * from t union select * from t; explain select * from t union all select * from t;
I wrote: > So I think the actual problem here is that we leave the decision > to elide no-op SubqueryScan nodes till setrefs.c. We should detect > that earlier, and when it applies, label the SubqueryScanPath with > the exact cost of its child. Hmm ... actually, while doing that seems like it'd be a good idea, it doesn't have much bearing on the case at hand. I approximated the results of such a change on this test case by just removing the "cpu_tuple_cost" component from cost_subqueryscan: @@ -1420,7 +1420,7 @@ cost_subqueryscan(SubqueryScanPath *path, PlannerInfo *root, get_restriction_qual_cost(root, baserel, param_info, &qpqual_cost); startup_cost = qpqual_cost.startup; - cpu_per_tuple = cpu_tuple_cost + qpqual_cost.per_tuple; + cpu_per_tuple = qpqual_cost.per_tuple; run_cost = cpu_per_tuple * baserel->tuples; /* tlist eval costs are paid per output row, not per tuple scanned */ and what I got was Unique (cost=28144.85..29344.85 rows=120000 width=12) -> Sort (cost=28144.85..28444.85 rows=120000 width=12) Sort Key: "*SELECT* 1".a, "*SELECT* 1".b, "*SELECT* 1".c -> Append (cost=0.00..1750.00 rows=120000 width=12) -> Subquery Scan on "*SELECT* 1" (cost=0.00..575.00 rows=60000 width=12) -> Gather (cost=0.00..575.00 rows=60000 width=12) Workers Planned: 2 -> Parallel Seq Scan on t (cost=0.00..575.00 rows=25000 width=12) -> Subquery Scan on "*SELECT* 2" (cost=0.00..575.00 rows=60000 width=12) -> Gather (cost=0.00..575.00 rows=60000 width=12) Workers Planned: 2 -> Parallel Seq Scan on t t_1 (cost=0.00..575.00 rows=25000 width=12) The subqueryscans are being costed the way we want now, but it's still going for the append-atop-gathers plan. So I dug a bit deeper, and found that generate_union_paths does also create the gather-atop-append plan, but *it's costed at 1750 units*, exactly the same as the other path. So add_path is just making an arbitrary choice between two paths of identical cost, and it happens to pick this one. (If I don't make the above tweak to cost_subqueryscan, the same thing happens, although the two competing paths now each have cost 2950.) So: why do we come out with a cost of 1750 for the very same plan that in the UNION ALL case is costed at 1400? AFAICS it's because the UNION ALL case thinks that the two inputs of the parallel append produce 25000 rows apiece so the parallel append produces 50000 rows, and it costs the append's overhead on that basis. But in the UNION case, the parallel append sees two inputs that are claimed to return 60000 rows, so the append produces 120000 rows, meaning more append overhead. We can't readily get EXPLAIN to print this tree since it's rejected by add_path, but what I'm seeing (with the above costing hack) is {GATHERPATH :rows 120000 :startup_cost 0.00 :total_cost 1750.00 :subpath {APPENDPATH :parallel_aware true :parallel_safe true :parallel_workers 2 :rows 120000 :startup_cost 0.00 :total_cost 1750.00 :subpaths ( {SUBQUERYSCANPATH :rows 60000 :startup_cost 0.00 :total_cost 575.00 :subpath {PATH :parallel_aware true :parallel_safe true :parallel_workers 2 :rows 25000 :startup_cost 0.00 :total_cost 575.00 } } {SUBQUERYSCANPATH :rows 60000 :startup_cost 0.00 :total_cost 575.00 :subpath {PATH :parallel_aware true :parallel_safe true :parallel_workers 2 :rows 25000 :startup_cost 0.00 :total_cost 575.00 } } ) } } In short, these SubqueryScans are being labeled as producing 60000 rows when their input only produces 25000 rows, which is surely insane. So: even though the SubqueryScan itself isn't parallel-aware, the number of rows it processes has to be de-rated according to the number of workers involved. Perhaps something like the original patch in this thread is indeed correct. But I can't help feeling that this definition of path_rows is mighty unintuitive and error-prone, and that if cost_subqueryscan is wrong then it's likely got lots of company. Maybe we need to take two steps back and rethink the whole approach. regards, tom lane
On Fri, Apr 29, 2022 at 11:09 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
In short, these SubqueryScans are being labeled as producing 60000 rows
when their input only produces 25000 rows, which is surely insane.
So: even though the SubqueryScan itself isn't parallel-aware, the number
of rows it processes has to be de-rated according to the number of workers
involved.
Right, so why does baserel.rows show 60,000 here when path->subpath->rows only shows 25,000? Because if you substitute path->subpath->rows for baserel.rows in cost_subquery you get (with your cost change above):
Incremental Sort (cost=27875.50..45577.57 rows=120000 width=12) (actual time=165.285..235.749 rows=60000 loops=1)
Sort Key: "*SELECT* 1".a, "*SELECT* 1".c
Presorted Key: "*SELECT* 1".a
Full-sort Groups: 10 Sort Method: quicksort Average Memory: 28kB Peak Memory: 28kB
Pre-sorted Groups: 10 Sort Method: quicksort Average Memory: 521kB Peak Memory: 521kB
-> Unique (cost=27794.85..28994.85 rows=120000 width=12) (actual time=157.882..220.501 rows=60000 loops=1)
-> Sort (cost=27794.85..28094.85 rows=120000 width=12) (actual time=157.881..187.232 rows=120000 loops=1)
Sort Key: "*SELECT* 1".a, "*SELECT* 1".b, "*SELECT* 1".c
Sort Method: external merge Disk: 2600kB
-> Gather (cost=0.00..1400.00 rows=120000 width=12) (actual time=0.197..22.705 rows=120000 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Append (cost=0.00..1400.00 rows=50000 width=12) (actual time=0.015..13.101 rows=40000 loops=3)
-> Subquery Scan on "*SELECT* 1" (cost=0.00..575.00 rows=25000 width=12) (actual time=0.014..6.864 rows=30000 loops=2)
-> Parallel Seq Scan on t (cost=0.00..575.00 rows=25000 width=12) (actual time=0.014..3.708 rows=30000 loops=2)
-> Subquery Scan on "*SELECT* 2" (cost=0.00..575.00 rows=25000 width=12) (actual time=0.010..6.918 rows=30000 loops=2)
-> Parallel Seq Scan on t t_1 (cost=0.00..575.00 rows=25000 width=12) (actual time=0.010..3.769 rows=30000 loops=2)
Planning Time: 0.137 ms
Execution Time: 239.958 ms
(19 rows)
Sort Key: "*SELECT* 1".a, "*SELECT* 1".c
Presorted Key: "*SELECT* 1".a
Full-sort Groups: 10 Sort Method: quicksort Average Memory: 28kB Peak Memory: 28kB
Pre-sorted Groups: 10 Sort Method: quicksort Average Memory: 521kB Peak Memory: 521kB
-> Unique (cost=27794.85..28994.85 rows=120000 width=12) (actual time=157.882..220.501 rows=60000 loops=1)
-> Sort (cost=27794.85..28094.85 rows=120000 width=12) (actual time=157.881..187.232 rows=120000 loops=1)
Sort Key: "*SELECT* 1".a, "*SELECT* 1".b, "*SELECT* 1".c
Sort Method: external merge Disk: 2600kB
-> Gather (cost=0.00..1400.00 rows=120000 width=12) (actual time=0.197..22.705 rows=120000 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Append (cost=0.00..1400.00 rows=50000 width=12) (actual time=0.015..13.101 rows=40000 loops=3)
-> Subquery Scan on "*SELECT* 1" (cost=0.00..575.00 rows=25000 width=12) (actual time=0.014..6.864 rows=30000 loops=2)
-> Parallel Seq Scan on t (cost=0.00..575.00 rows=25000 width=12) (actual time=0.014..3.708 rows=30000 loops=2)
-> Subquery Scan on "*SELECT* 2" (cost=0.00..575.00 rows=25000 width=12) (actual time=0.010..6.918 rows=30000 loops=2)
-> Parallel Seq Scan on t t_1 (cost=0.00..575.00 rows=25000 width=12) (actual time=0.010..3.769 rows=30000 loops=2)
Planning Time: 0.137 ms
Execution Time: 239.958 ms
(19 rows)
Which shows your 1400 cost goal from union all, and the expected row counts, for gather-atop-append.
The fact that (baserel.rows > path->subpath->rows) here seems like a straight bug: there are no filters involved in this case but in the presence of filters baserel->rows should be strictly (<= path->subpath->rows), right?
David J.
"David G. Johnston" <david.g.johnston@gmail.com> writes: > The fact that (baserel.rows > path->subpath->rows) here seems like a > straight bug: there are no filters involved in this case but in the > presence of filters baserel->rows should be strictly (<= > path->subpath->rows), right? No, because the subpath's rowcount has been derated to represent the number of rows any one worker is expected to process. Since the SubqueryScan is below the Gather, it has to represent that number of rows too. Like I said, this design is pretty confusing; though I do not have a better alternative to suggest. Anyway, after chewing on this for awhile, it strikes me that the sanest way to proceed is for cost_subqueryscan to compute its rows estimate as the subpath's rows estimate times the selectivity of the subqueryscan's quals (if any). That'd be the natural way to proceed for most sorts of non-bottom-level paths to begin with. I think there are a few reasons why cost_subqueryscan currently does it the way it does: * By analogy to other sorts of relation-scan nodes. But those don't have any subpath that they could consult instead. This analogy is really a bit faulty, since SubqueryScan isn't a relation scan node in any ordinary meaning of that term. * To ensure that all paths for the same relation have the same rowcount estimate (modulo different parameterization or parallelism). But we'll have that anyway because the only possible path type for an unflattened RTE_SUBQUERY rel is a SubqueryScan, so there's no risk of different cost_xxx functions arriving at slightly different estimates. * To avoid redundant computation of the quals' selectivity. This is slightly annoying; but in practice the quals will always be RestrictInfos which will cache the per-qual selectivities, so it's not going to cost us very much to perform that calculation over again. So perhaps we should do it more like the attached, which produces this plan for the UNION case: Unique (cost=28994.85..30194.85 rows=120000 width=12) -> Sort (cost=28994.85..29294.85 rows=120000 width=12) Sort Key: t.a, t.b, t.c -> Gather (cost=0.00..2600.00 rows=120000 width=12) Workers Planned: 2 -> Parallel Append (cost=0.00..2600.00 rows=50000 width=12) -> Parallel Seq Scan on t (cost=0.00..575.00 rows=25000 width=12) -> Parallel Seq Scan on t t_1 (cost=0.00..575.00 rows=25000 width=12) It'd still be a good idea to fix the cost estimation to not charge extra for SubqueryScans that will be elided later, because this Parallel Append's cost should be 1400 not 2600. But as I showed upthread, that won't affect the plan choice for this particular test case. regards, tom lane #text/x-diff; name="change-cost_subqueryscan-rowcount-estimate-2.patch" [change-cost_subqueryscan-rowcount-estimate-2.patch]/home/postgres/zzz
I wrote: > So perhaps we should do it more like the attached, which produces > this plan for the UNION case: sigh ... actually attached this time. regards, tom lane diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c index b787c6f81a..18749e842d 100644 --- a/src/backend/optimizer/path/costsize.c +++ b/src/backend/optimizer/path/costsize.c @@ -1395,6 +1395,7 @@ cost_subqueryscan(SubqueryScanPath *path, PlannerInfo *root, { Cost startup_cost; Cost run_cost; + List *qpquals; QualCost qpqual_cost; Cost cpu_per_tuple; @@ -1402,11 +1403,24 @@ cost_subqueryscan(SubqueryScanPath *path, PlannerInfo *root, Assert(baserel->relid > 0); Assert(baserel->rtekind == RTE_SUBQUERY); - /* Mark the path with the correct row estimate */ + /* + * We compute the rowcount estimate as the subplan's estimate times the + * selectivity of relevant restriction clauses. In simple cases this will + * come out the same as baserel->rows; but when dealing with parallelized + * paths we must do it like this to get the right answer. + */ if (param_info) - path->path.rows = param_info->ppi_rows; + qpquals = list_concat_copy(param_info->ppi_clauses, + baserel->baserestrictinfo); else - path->path.rows = baserel->rows; + qpquals = baserel->baserestrictinfo; + + path->path.rows = clamp_row_est(path->subpath->rows * + clauselist_selectivity(root, + qpquals, + 0, + JOIN_INNER, + NULL)); /* * Cost of path is cost of evaluating the subplan, plus cost of evaluating diff --git a/src/test/regress/expected/incremental_sort.out b/src/test/regress/expected/incremental_sort.out index 21c429226f..0d8d77140a 100644 --- a/src/test/regress/expected/incremental_sort.out +++ b/src/test/regress/expected/incremental_sort.out @@ -1487,14 +1487,12 @@ explain (costs off) select * from t union select * from t order by 1,3; -> Unique -> Sort Sort Key: t.a, t.b, t.c - -> Append - -> Gather - Workers Planned: 2 + -> Gather + Workers Planned: 2 + -> Parallel Append -> Parallel Seq Scan on t - -> Gather - Workers Planned: 2 -> Parallel Seq Scan on t t_1 -(13 rows) +(11 rows) -- Full sort, not just incremental sort can be pushed below a gather merge path -- by generate_useful_gather_paths.
On Fri, Apr 29, 2022 at 12:31 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
"David G. Johnston" <david.g.johnston@gmail.com> writes:
> The fact that (baserel.rows > path->subpath->rows) here seems like a
> straight bug: there are no filters involved in this case but in the
> presence of filters baserel->rows should be strictly (<=
> path->subpath->rows), right?
No, because the subpath's rowcount has been derated to represent the
number of rows any one worker is expected to process. Since the
SubqueryScan is below the Gather, it has to represent that number
of rows too. Like I said, this design is pretty confusing; though
I do not have a better alternative to suggest.
Anyway, after chewing on this for awhile, it strikes me that the
sanest way to proceed is for cost_subqueryscan to compute its rows
estimate as the subpath's rows estimate times the selectivity of
the subqueryscan's quals (if any).
This is what Robert was getting at, and I followed-up on.
The question I ended up at is why doesn't baserel->rows already produce the value you now propose to calculate directly within cost_subquery
set_baserel_size_estimates (multiplies rel->tuples - which is derated - by selectivity, sets rel->rows)
set_subquery_size_estimates
rel->subroot = subquery_planner(...)
// my expectation is that sub_final_rel->cheapest_total_path->rows is the derated number of rows;
// the fact you can reach the derated amount later by using path->subpath->rows seems to affirm this.
sets rel->tuples from sub_final_rel->cheapest_total_path->rows)
set_subquery_pathlist (executes the sizing call stack above, then proceeds to create_subqueryscan_path which in turn calls cost_subquery)
set_rel_size
...
* By analogy to other sorts of relation-scan nodes. But those don't
have any subpath that they could consult instead. This analogy is
really a bit faulty, since SubqueryScan isn't a relation scan node
in any ordinary meaning of that term.
I did observe this copy-and-paste dynamic; I take it this is why we cannot or would not just change all of the baserel->rows usages to path->subpath->rows.
So perhaps we should do it more like the attached, which produces
this plan for the UNION case:
The fact this changes row counts in a costing function is bothersome - it would be nice to go the other direction and remove the if block. More generally, path->path.rows should be read-only by the time we get to costing. But I'm not out to start a revolution here either. But it feels like we are just papering over a bug in how baserel->rows is computed; per my analysis above.
In short, the solution seems like it should, and in fact does here, fix the observed problem. I'm fine with that.
David J.
On Fri, Apr 29, 2022 at 3:38 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > I wrote: > > So perhaps we should do it more like the attached, which produces > > this plan for the UNION case: > > sigh ... actually attached this time. I am not sure whether this is actually correct, but it seems a lot more believable than the previous proposals. The problem might be more general, though. I think when I developed this parallel query stuff I modeled a lot of it on what you did for parameterized paths. Both parameterized paths and parallelism can create situations where executing a path to completion produces fewer rows than you would otherwise get. In the case of parameterized paths, this happens because we enforce the parameterization we've chosen on top of the user-supplied quals. In the case of parallelism, it happens because the rows are split up across the different workers. I think I intended that the "rows" field of RelOptInfo should be the row count for the relation in total, and that the "rows" field of the Path should be the number of rows we expect to get for one execution of the path. But it seems like this problem is good evidence that I didn't find all the places that need to be adjusted for parallelism, and I wouldn't be very surprised if there are a bunch of others that I overlooked. It's not actually very nice that we end up having to call clauselist_selectivity() here. We've already called set_baserel_size_estimates() to figure out how many rows we expect to have been filtered out by the quals, and it sucks to have to do it again. Brainstorming wildly and maybe stupidly, I wonder if the whole model is wrong here. Maybe a path shouldn't have a row count; instead, maybe it should have a multiplier that it applies to the relation's row count. Then, if X is parameterized in the same way as its subpath Y, we can just copy the multiplier up, but now it will be applied to the new rel's "rows" value, which will have already been adjusted appropriately by set_baserel_size_estimates(). And having thrown out that wild and crazy idea, I will now run away quickly and hide someplace. -- Robert Haas EDB: http://www.enterprisedb.com
Robert Haas <robertmhaas@gmail.com> writes: > I am not sure whether this is actually correct, but it seems a lot > more believable than the previous proposals. The problem might be more > general, though. I think when I developed this parallel query stuff I > modeled a lot of it on what you did for parameterized paths. Both > parameterized paths and parallelism can create situations where > executing a path to completion produces fewer rows than you would > otherwise get. In the case of parameterized paths, this happens > because we enforce the parameterization we've chosen on top of the > user-supplied quals. In the case of parallelism, it happens because > the rows are split up across the different workers. I think I intended > that the "rows" field of RelOptInfo should be the row count for the > relation in total, and that the "rows" field of the Path should be the > number of rows we expect to get for one execution of the path. But it > seems like this problem is good evidence that I didn't find all the > places that need to be adjusted for parallelism, and I wouldn't be > very surprised if there are a bunch of others that I overlooked. I did look at the rest of costsize.c for similar instances, and didn't find any. In any case, I think we have two options: 1. Apply this fix, and in future fix any other places that we identify later. 2. Invent some entirely new scheme that we hope is less mistake-prone. Option #2 is unlikely to lead to any near-term fix, and we certainly wouldn't dare back-patch it. > It's not actually very nice that we end up having to call > clauselist_selectivity() here. We've already called > set_baserel_size_estimates() to figure out how many rows we expect to > have been filtered out by the quals, and it sucks to have to do it > again. Brainstorming wildly and maybe stupidly, I wonder if the whole > model is wrong here. Maybe a path shouldn't have a row count; instead, > maybe it should have a multiplier that it applies to the relation's > row count. Then, if X is parameterized in the same way as its subpath > Y, we can just copy the multiplier up, but now it will be applied to > the new rel's "rows" value, which will have already been adjusted > appropriately by set_baserel_size_estimates(). I've wondered about that too, but it seems to depend on the assumption that clauses are estimated independently by clauselist_selectivity, which has not been true for a long time (and is getting less true not more so). So we could possibly apply something like this for parallelism, but not for parameterized paths, and that makes it less appealing ... IMO anyway. I have thought it might be good to explicitly mark partial paths with the estimated number of workers, which would be effectively the same thing as what you're talking about. But I wonder if we'd not still be better off keeping the path rowcount as being number-of-rows-in-each-worker, and just scale it up by the multiplier for EXPLAIN output. (And then also print the true total number of rows in EXPLAIN ANALYZE.) If we do the inverse of that, then we risk bugs from failing to correct the rowcount during cost-estimation calculations. I kinda feel that the bottom line here is that cost estimation is hard, and we're not going to find a magic bullet that removes bugs. regards, tom lane
On Mon, May 2, 2022 at 5:24 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > I did look at the rest of costsize.c for similar instances, and didn't > find any. In any case, I think we have two options: > > 1. Apply this fix, and in future fix any other places that we identify > later. > > 2. Invent some entirely new scheme that we hope is less mistake-prone. > > Option #2 is unlikely to lead to any near-term fix, and we certainly > wouldn't dare back-patch it. Sure, although I think it's questionable whether we should back-patch anyway, since there's no guarantee that every plan change anybody gets will be a desirable one. > I've wondered about that too, but it seems to depend on the assumption > that clauses are estimated independently by clauselist_selectivity, which > has not been true for a long time (and is getting less true not more so). > So we could possibly apply something like this for parallelism, but not > for parameterized paths, and that makes it less appealing ... IMO anyway. I agree. We'd have to correct for that somehow, and that might be awkward. > I have thought it might be good to explicitly mark partial paths with the > estimated number of workers, which would be effectively the same thing > as what you're talking about. But I wonder if we'd not still be better off > keeping the path rowcount as being number-of-rows-in-each-worker, and > just scale it up by the multiplier for EXPLAIN output. (And then also > print the true total number of rows in EXPLAIN ANALYZE.) If we do the > inverse of that, then we risk bugs from failing to correct the rowcount > during cost-estimation calculations. That I don't like at all. I'm still of the opinion that it's a huge mistake for EXPLAIN to print int(rowcount/loops) instead of just rowcount. The division is never what I want and in my experience is also not what other people want and often causes confusion. Both the division and the rounding lose information about precisely what row count was estimated, which makes it harder to figure out where in the plan things went wrong. I am not at all keen on adding more ways for what we print out to be different from the information actually stored in the plan tree. I don't know for sure what we ought to be storing in the plan tree, but I think whatever we store should also be what we print. I think the fact that we've chosen to store something in the plan tree is strong evidence that that exact value, and not some quantity derived therefrom, is what's interesting. > I kinda feel that the bottom line here is that cost estimation is > hard, and we're not going to find a magic bullet that removes bugs. Well that much is certainly true. -- Robert Haas EDB: http://www.enterprisedb.com
Robert Haas <robertmhaas@gmail.com> writes: > That I don't like at all. I'm still of the opinion that it's a huge > mistake for EXPLAIN to print int(rowcount/loops) instead of just > rowcount. The division is never what I want and in my experience is > also not what other people want and often causes confusion. Both the > division and the rounding lose information about precisely what row > count was estimated, which makes it harder to figure out where in the > plan things went wrong. I'm inclined to look at it a bit differently: it was a mistake to use the same "loops" notion for parallelism as for repeated node execution. But I think we are saying the same thing in one respect, namely it'd be better if what EXPLAIN shows for parallelism were totals across all workers rather than per-worker numbers. (I'm unconvinced about whether repeated node execution ought to work like that.) > I am not at all keen on adding more ways for > what we print out to be different from the information actually stored > in the plan tree. I think the cost estimation functions want to work with per-worker rowcounts. We could scale that up to totals when we create the finished plan tree, perhaps. > I don't know for sure what we ought to be storing in > the plan tree, but I think whatever we store should also be what we > print. I think the fact that we've chosen to store something in the > plan tree is strong evidence that that exact value, and not some > quantity derived therefrom, is what's interesting. The only reason we store any of this in the plan tree is for EXPLAIN to print it out. On the other hand, I don't want the planner expending any large number of cycles modifying the numbers it works with before putting them in the plan tree, because most of the time we're not doing EXPLAIN so it'd be wasted effort. In any case, fundamental redesign of what EXPLAIN prints is a job for v16 or later. Are you okay with the proposed patch as a v15 fix? regards, tom lane
On Tue, May 3, 2022 at 2:13 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > In any case, fundamental redesign of what EXPLAIN prints is a job > for v16 or later. Are you okay with the proposed patch as a v15 fix? Yes. I can't really vouch for it, but I don't object to it. -- Robert Haas EDB: http://www.enterprisedb.com
Robert Haas <robertmhaas@gmail.com> writes: > On Tue, May 3, 2022 at 2:13 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> In any case, fundamental redesign of what EXPLAIN prints is a job >> for v16 or later. Are you okay with the proposed patch as a v15 fix? > Yes. I can't really vouch for it, but I don't object to it. I re-read the patch and noticed that I'd missed one additional change needed: - run_cost = cpu_per_tuple * baserel->tuples; + run_cost = cpu_per_tuple * path->subpath->rows; That is, we should estimate the number of qual evaluations and cpu_tuple_cost charges based on the subpath row count not the non-parallel-aware relation size. So this reduces the cost of the SubqueryScan itself, as well as its row count, in partial paths. I don't see any further change in regression test results though. Pushed with that change. regards, tom lane