Thread: Considering fractional paths in Append node
Hi hackers!
--
A colleague of mine, Andrei Lepikhov, has found interesting behavior
in path cost calculation for Append node - when evaluating the cheapest
path it does not take into account fractional path costs.
We've prepared a patch that forces add_paths_to_append_rel function
to consider non-parametrized fractional path costs.
The effect is easily seen in one of standard PG tests:
Vanilla (current master):
explain analyze
select t1.unique1 from tenk1 t1
inner join tenk2 t2 on t1.tenthous = t2.tenthous and t2.thousand = 0
union all
(values(1)) limit 1;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..219.55 rows=1 width=4) (actual time=6.309..6.312 rows=1 loops=1)
-> Append (cost=0.00..2415.09 rows=11 width=4) (actual time=6.308..6.310 rows=1 loops=1)
-> Nested Loop (cost=0.00..2415.03 rows=10 width=4) (actual time=6.307..6.308 rows=1 loops=1)
Join Filter: (t1.tenthous = t2.tenthous)
Rows Removed by Join Filter: 4210
-> Seq Scan on tenk1 t1 (cost=0.00..445.00 rows=10000 width=8) (actual time=0.004..0.057 rows=422 loops=1)
-> Materialize (cost=0.00..470.05 rows=10 width=4) (actual time=0.000..0.014 rows=10 loops=422)
Storage: Memory Maximum Storage: 17kB
-> Seq Scan on tenk2 t2 (cost=0.00..470.00 rows=10 width=4) (actual time=0.076..5.535 rows=10 loops=1)
Filter: (thousand = 0)
Rows Removed by Filter: 9990
-> Result (cost=0.00..0.01 rows=1 width=4) (never executed)
Planning Time: 0.324 ms
Execution Time: 6.336 ms
(14 rows)
select t1.unique1 from tenk1 t1
inner join tenk2 t2 on t1.tenthous = t2.tenthous and t2.thousand = 0
union all
(values(1)) limit 1;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..219.55 rows=1 width=4) (actual time=6.309..6.312 rows=1 loops=1)
-> Append (cost=0.00..2415.09 rows=11 width=4) (actual time=6.308..6.310 rows=1 loops=1)
-> Nested Loop (cost=0.00..2415.03 rows=10 width=4) (actual time=6.307..6.308 rows=1 loops=1)
Join Filter: (t1.tenthous = t2.tenthous)
Rows Removed by Join Filter: 4210
-> Seq Scan on tenk1 t1 (cost=0.00..445.00 rows=10000 width=8) (actual time=0.004..0.057 rows=422 loops=1)
-> Materialize (cost=0.00..470.05 rows=10 width=4) (actual time=0.000..0.014 rows=10 loops=422)
Storage: Memory Maximum Storage: 17kB
-> Seq Scan on tenk2 t2 (cost=0.00..470.00 rows=10 width=4) (actual time=0.076..5.535 rows=10 loops=1)
Filter: (thousand = 0)
Rows Removed by Filter: 9990
-> Result (cost=0.00..0.01 rows=1 width=4) (never executed)
Planning Time: 0.324 ms
Execution Time: 6.336 ms
(14 rows)
Patched, the same test:
explain analyze
select t1.unique1 from tenk1 t1
inner join tenk2 t2 on t1.tenthous = t2.tenthous and t2.thousand = 0
union all
(values(1)) limit 1;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.29..126.00 rows=1 width=4) (actual time=0.105..0.106 rows=1 loops=1)
-> Append (cost=0.29..1383.12 rows=11 width=4) (actual time=0.104..0.105 rows=1 loops=1)
-> Nested Loop (cost=0.29..1383.05 rows=10 width=4) (actual time=0.104..0.104 rows=1 loops=1)
-> Seq Scan on tenk2 t2 (cost=0.00..470.00 rows=10 width=4) (actual time=0.076..0.076 rows=1 loops=1)
Filter: (thousand = 0)
Rows Removed by Filter: 421
-> Index Scan using tenk1_thous_tenthous on tenk1 t1 (cost=0.29..91.30 rows=1 width=8) (actual time=0.026..0.026 rows=1 loops=1)
Index Cond: (tenthous = t2.tenthous)
-> Result (cost=0.00..0.01 rows=1 width=4) (never executed)
Planning Time: 0.334 ms
Execution Time: 0.130 ms
(11 rows)
select t1.unique1 from tenk1 t1
inner join tenk2 t2 on t1.tenthous = t2.tenthous and t2.thousand = 0
union all
(values(1)) limit 1;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.29..126.00 rows=1 width=4) (actual time=0.105..0.106 rows=1 loops=1)
-> Append (cost=0.29..1383.12 rows=11 width=4) (actual time=0.104..0.105 rows=1 loops=1)
-> Nested Loop (cost=0.29..1383.05 rows=10 width=4) (actual time=0.104..0.104 rows=1 loops=1)
-> Seq Scan on tenk2 t2 (cost=0.00..470.00 rows=10 width=4) (actual time=0.076..0.076 rows=1 loops=1)
Filter: (thousand = 0)
Rows Removed by Filter: 421
-> Index Scan using tenk1_thous_tenthous on tenk1 t1 (cost=0.29..91.30 rows=1 width=8) (actual time=0.026..0.026 rows=1 loops=1)
Index Cond: (tenthous = t2.tenthous)
-> Result (cost=0.00..0.01 rows=1 width=4) (never executed)
Planning Time: 0.334 ms
Execution Time: 0.130 ms
(11 rows)
Hope this optimization could be useful.
Nikita Malakhov <hukutoc@gmail.com> writes: Helll Nikita, > Hi hackers! > > Sorry, I've forgot to attach the patch itself. Please check it out. Could you check if [1] is related to this subject? I think the hard part would be which tuple_fraction to use during adding add_paths_to_append_rel since root->tuple_fraction is on subquery level, while the add_paths_to_append_rel is only on RelOptInfo level. [1] https://www.postgresql.org/message-id/CAApHDvry0nSV62kAOH3iccvfPhGPLN0Q97%2B%3Db1RsDPXDz3%3DCiQ%40mail.gmail.com -- Best Regards Andy Fan
On 10/17/24 07:05, Andy Fan wrote: > Nikita Malakhov <hukutoc@gmail.com> writes: > > Helll Nikita, > >> Hi hackers! >> >> Sorry, I've forgot to attach the patch itself. Please check it out. > > Could you check if [1] is related to this subject? I think the hard part > would be which tuple_fraction to use during adding > add_paths_to_append_rel since root->tuple_fraction is on subquery level, > while the add_paths_to_append_rel is only on RelOptInfo level. > > [1] > https://www.postgresql.org/message-id/CAApHDvry0nSV62kAOH3iccvfPhGPLN0Q97%2B%3Db1RsDPXDz3%3DCiQ%40mail.gmail.com Yes, this thread is connected to the code Nikita has proposed. It is almost related to partitioned cases, of course. I'm not sure about partitionwise joins - it looks overcomplicated for the moment. We just see cases when SeqScan is preferred to IndexScan. One logical way is to add into the Append node an alternative fractional path and, if at the top level some Limit, IncrementalSort or another fractional-friendly node will request only a small part of the data, the optimiser will have the option to choose the fractional path. -- regards, Andrei Lepikhov
Hi,
Andy, your words make sense, but to me it seems that in add_paths_to_append_rel
we have no other options than tuple_fraction from root, and rows (if any) in paths
we take into account, please correct me if I am wrong.
Thank you!
Also, on top of that I have an idea of pruning unnecessary partitions
in generate_orderedappend_paths() when we have a valid LIMIT value.
I'm currently checking if it is working correctly in multiple cases,
so 'll send it after we deal with this issue.
--
Nikita Malakhov <hukutoc@gmail.com> writes: > Hi, > > Andy, your words make sense, but to me it seems that in add_paths_to_append_rel > we have no other options than tuple_fraction from root, and rows (if any) in paths > we take into account, please correct me if I am wrong. One of the option might be applying your logic only if we can prove the tuple_fraction from root is same as the tuple_fraction, similar with what I did before. But it is proved that is too complex. > Thank you! > > Also, on top of that I have an idea of pruning unnecessary partitions > in generate_orderedappend_paths() when we have a valid LIMIT value. > I'm currently checking if it is working correctly in multiple cases, > so 'll send it after we deal with this issue. -- Best Regards Andy Fan
Nikita Malakhov <hukutoc@gmail.com> writes: > The effect is easily seen in one of standard PG tests: > Vanilla (current master): > explain analyze > select t1.unique1 from tenk1 t1 > inner join tenk2 t2 on t1.tenthous = t2.tenthous and t2.thousand = 0 > union all > (values(1)) limit 1; > QUERY PLAN > > ------------------------------------------------------------------------------------------------------------------------------ > > Limit (cost=0.00..219.55 rows=1 width=4) (actual time=6.309..6.312 rows=1 loops=1) > -> Append (cost=0.00..2415.09 rows=11 width=4) (actual time=6.308..6.310 rows=1 loops=1) > -> Nested Loop (cost=0.00..2415.03 rows=10 width=4) (actual time=6.307..6.308 rows=1 loops=1) > Join Filter: (t1.tenthous = t2.tenthous) > Rows Removed by Join Filter: 4210 > -> Seq Scan on tenk1 t1 (cost=0.00..445.00 rows=10000 width=8) (actual time=0.004..0.057 rows=422 > loops=1) > -> Materialize (cost=0.00..470.05 rows=10 width=4) (actual time=0.000..0.014 rows=10 loops=422) > Storage: Memory Maximum Storage: 17kB > -> Seq Scan on tenk2 t2 (cost=0.00..470.00 rows=10 width=4) (actual time=0.076..5.535 rows=10 > loops=1) > Filter: (thousand = 0) > Rows Removed by Filter: 9990 > -> Result (cost=0.00..0.01 rows=1 width=4) (never executed) > Planning Time: 0.324 ms > Execution Time: 6.336 ms > (14 rows) > > Patched, the same test: > explain analyze > select t1.unique1 from tenk1 t1 > inner join tenk2 t2 on t1.tenthous = t2.tenthous and t2.thousand = 0 > union all > (values(1)) limit 1; > QUERY PLAN > > -------------------------------------------------------------------------------------------------------------------------------------------------- > > Limit (cost=0.29..126.00 rows=1 width=4) (actual time=0.105..0.106 rows=1 loops=1) > -> Append (cost=0.29..1383.12 rows=11 width=4) (actual time=0.104..0.105 rows=1 loops=1) > -> Nested Loop (cost=0.29..1383.05 rows=10 width=4) (actual time=0.104..0.104 rows=1 loops=1) > -> Seq Scan on tenk2 t2 (cost=0.00..470.00 rows=10 width=4) (actual time=0.076..0.076 rows=1 loops=1) > Filter: (thousand = 0) > Rows Removed by Filter: 421 > -> Index Scan using tenk1_thous_tenthous on tenk1 t1 (cost=0.29..91.30 rows=1 width=8) (actual > time=0.026..0.026 rows=1 loops=1) > Index Cond: (tenthous = t2.tenthous) > -> Result (cost=0.00..0.01 rows=1 width=4) (never executed) > Planning Time: 0.334 ms > Execution Time: 0.130 ms > (11 rows) > This is a nice result. After some more thoughts, I'm feeling the startup cost calculation on seq scan looks a more important one to address. Bad Plan: Append (cost=0.00..2415.09 ..) shows us the "startup cost" is 0. Good plan: Append (cost=0.29..1383.12 ..) show us the "startup cost" is 0.29. The major reason of this is we calculate the "startup cost" for "SeqScan" and "Index scan" with different guidances. For the "Index scan", the startup cost is "the cost to retrieve the first tuple", however for "SeqScan", it is not, as we can see the startup cost for query "SELECT * FROM tenk2 WHERE thousand = 0" has a 0 startup_cost. In my understading, "startup cost" means the cost to retrieve the first tuple *already*, but at [1], Tom said: """ I think that things might work out better if we redefined the startup cost as "estimated cost to retrieve the first tuple", rather than its current very-squishy definition as "cost to initialize the scan". """ The above statement makes me confused. If we take the startup cost as cost to retrieve cost for the first tuple, we can do the below quick hack, @@ -355,8 +355,8 @@ cost_seqscan(Path *path, PlannerInfo *root, } path->disabled_nodes = enable_seqscan ? 0 : 1; - path->startup_cost = startup_cost; path->total_cost = startup_cost + cpu_run_cost + disk_run_cost; + path->startup_cost = startup_cost + (cpu_run_cost + disk_run_cost) * (1 - path->rows / baserel->tuples); } We get plan: regression=# explain select t1.unique1 from tenk1 t1 inner join tenk2 t2 on t1.tenthous = t2.tenthous and t2.thousand = 0 union all select 1 limit 1 ; QUERY PLAN ------------------------------------------------------------------------------------- Limit (cost=470.12..514.00 rows=1 width=4) -> Append (cost=470.12..952.79 rows=11 width=4) -> Hash Join (cost=470.12..952.73 rows=10 width=4) Hash Cond: (t1.tenthous = t2.tenthous) -> Seq Scan on tenk1 t1 (cost=0.00..445.00 rows=10000 width=8) -> Hash (cost=470.00..470.00 rows=10 width=4) -> Seq Scan on tenk2 t2 (cost=469.53..470.00 rows=10 width=4) Filter: (thousand = 0) -> Result (cost=0.00..0.01 rows=1 width=4) (9 rows) set enable_hashjoin to off; regression=# explain select t1.unique1 from tenk1 t1 inner join tenk2 t2 on t1.tenthous = t2.tenthous and t2.thousand = 0 union all select 1 limit 1 ; QUERY PLAN -------------------------------------------------------------------------------------------------------- Limit (cost=469.81..542.66 rows=1 width=4) -> Append (cost=469.81..1271.12 rows=11 width=4) -> Nested Loop (cost=469.81..1271.05 rows=10 width=4) -> Seq Scan on tenk2 t2 (cost=469.53..470.00 rows=10 width=4) Filter: (thousand = 0) -> Index Scan using tenk1_thous_tenthous on tenk1 t1 (cost=0.29..80.09 rows=1 width=8) Index Cond: (tenthous = t2.tenthous) -> Result (cost=0.00..0.01 rows=1 width=4) (8 rows) Looks we still have some other stuff to do, but we have seen the desired plan has a closer cost to estimated best plan than before. [1] https://www.postgresql.org/message-id/flat/3783591.1721327902%40sss.pgh.pa.us#09d6471fc59b35fa4aca939e49943c2c -- Best Regards Andy Fan
On 10/18/24 07:54, Andy Fan wrote: > Nikita Malakhov <hukutoc@gmail.com> writes: >> The effect is easily seen in one of standard PG tests: > """ > I think that things might work out better if we redefined the startup > cost as "estimated cost to retrieve the first tuple", rather than its > current very-squishy definition as "cost to initialize the scan". > """ > The above statement makes me confused. If we take the startup cost as > cost to retrieve cost for the first tuple, we can do the below quick hack, Promising way to go. Of course even in that case IndexScan usually gives way to SeqScan (because of the additional heap fetch). And only IndexOnlyScan may overcome it. Moreover, SeqScan first tuple cost is contradictory issue - who knows, how much tuples it will filter before the first tuple will be produced? > > Looks we still have some other stuff to do, but we have seen the desired > plan has a closer cost to estimated best plan than before. But our patch is about some different stuff: adding one more Append strategy (and, as I realised recently, one promising MergeAppend too) we give a chance upper fraction-friendly node to decide which plan is better. Right now, AFAIK, only LIMIT node can profit from that (maybe IncrementalSort if we include MergeAppend). But it may open a door to improve other nodes too. -- regards, Andrei Lepikhov
Hi!
Andy, one quick question - what do you think on using root->limit_tuples as a guidance on how many
rows we have to consider in plans cost?
Nikita Malakhov <hukutoc@gmail.com> writes: Hi Nikita, > Hi! > > Andy, one quick question - what do you think on using > root->limit_tuples as a guidance on how many > rows we have to consider in plans cost? Within my exprerience, the committer probabaly dislikes the idea of "ignoring the difference of tuple_fraction between subquery level and RelOptInfo level" and the committers dislikes a complex soluation like what I did in [1]. Maybe you can try to figure out completed and simple soluation to make everyone happy, but I have no idea about on this right now. [1] https://www.postgresql.org/message-id/CAApHDvry0nSV62kAOH3iccvfPhGPLN0Q97%2B%3Db1RsDPXDz3%3DCiQ%40mail.gmail.com -- Best Regards Andy Fan
Hi,
Andy, thank you, I've checked this thread out along with run-time partition pruning.
I've spend some time hovering on the tuple_fraction field usage and would disagree
with you on this topic - it is already used on the RelOptInfo level later on, in
generate_orderedappend_paths()
I mean the following piece:
if (root->tuple_fraction > 0)
{
double path_fraction = (1.0 / root->tuple_fraction);
Path cheapest_consider_fraction;
cheapest_fractional = get_cheapest_fractional_path_for_pathkeys(childrel->pathlist, pathkeys, NULL, path_fraction);
{
double path_fraction = (1.0 / root->tuple_fraction);
Path cheapest_consider_fraction;
cheapest_fractional = get_cheapest_fractional_path_for_pathkeys(childrel->pathlist, pathkeys, NULL, path_fraction);
...
function, so it does not seem incorrect to use its value for a single relation in subquery -
I agree that we do not have accurate estimation at this level, but we could use the one
we already have.
I've also tried hard to find an example where this patch could break something,
but without success.
Nikita Malakhov <hukutoc@gmail.com> writes: > Hi, > > Andy, thank you, I've checked this thread out along with run-time > partition pruning. I'm not sure the relationshipe between this topic and run-time partitioin pruning.. > I've spend some time hovering on the tuple_fraction field usage and would disagree > with you on this topic - it is already used on the RelOptInfo level later on, in > generate_orderedappend_paths() Looks you are right that root->tuple_fraction has been used on RelOptInfo level in the generate_orderedappend_paths(). But we also tried to use not in the RelOptInfo level such as set_subquery_pathlist. See.. """ /* * We can safely pass the outer tuple_fraction down to the subquery if the * outer level has no joining, aggregation, or sorting to do. Otherwise * we'd better tell the subquery to plan for full retrieval. (XXX This * could probably be made more intelligent ...) */ """ I'm not sure the "more intelligent" would be just use it directly. So I'm not saying we can't do this, just that the facts are: (a). root->tuple_fraction are not exactly same as RelOptInfo's tuple_fraction. (b). We have used root->tuple_fraction in RelOptInfo in some cases and also tried to not use it in some other case (and only use it under some situation similar like what I did before). Looks different committers have different opinion on this. -- Best Regards Andy Fan