Thread: Considering fractional paths in Append node

Considering fractional paths in Append node

From
Nikita Malakhov
Date:
Hi hackers!

A colleague of mine, Andrei Lepikhov, has found interesting behavior
in path cost calculation for Append node - when evaluating the cheapest
path it does not take into account fractional path costs. 

We've prepared a patch that forces add_paths_to_append_rel function
to consider non-parametrized fractional path costs.

The effect is easily seen in one of standard PG tests:
Vanilla (current master):
explain analyze
select t1.unique1 from tenk1 t1
inner join tenk2 t2 on t1.tenthous = t2.tenthous and t2.thousand = 0
   union all
(values(1)) limit 1;
                                                          QUERY PLAN                                                          
------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..219.55 rows=1 width=4) (actual time=6.309..6.312 rows=1 loops=1)
   ->  Append  (cost=0.00..2415.09 rows=11 width=4) (actual time=6.308..6.310 rows=1 loops=1)
         ->  Nested Loop  (cost=0.00..2415.03 rows=10 width=4) (actual time=6.307..6.308 rows=1 loops=1)
               Join Filter: (t1.tenthous = t2.tenthous)
               Rows Removed by Join Filter: 4210
               ->  Seq Scan on tenk1 t1  (cost=0.00..445.00 rows=10000 width=8) (actual time=0.004..0.057 rows=422 loops=1)
               ->  Materialize  (cost=0.00..470.05 rows=10 width=4) (actual time=0.000..0.014 rows=10 loops=422)
                     Storage: Memory  Maximum Storage: 17kB
                     ->  Seq Scan on tenk2 t2  (cost=0.00..470.00 rows=10 width=4) (actual time=0.076..5.535 rows=10 loops=1)
                           Filter: (thousand = 0)
                           Rows Removed by Filter: 9990
         ->  Result  (cost=0.00..0.01 rows=1 width=4) (never executed)
 Planning Time: 0.324 ms
 Execution Time: 6.336 ms
(14 rows)

Patched, the same test:
explain analyze
select t1.unique1 from tenk1 t1
inner join tenk2 t2 on t1.tenthous = t2.tenthous and t2.thousand = 0
   union all
(values(1)) limit 1;
                                                                    QUERY PLAN                                                                    
--------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.29..126.00 rows=1 width=4) (actual time=0.105..0.106 rows=1 loops=1)
   ->  Append  (cost=0.29..1383.12 rows=11 width=4) (actual time=0.104..0.105 rows=1 loops=1)
         ->  Nested Loop  (cost=0.29..1383.05 rows=10 width=4) (actual time=0.104..0.104 rows=1 loops=1)
               ->  Seq Scan on tenk2 t2  (cost=0.00..470.00 rows=10 width=4) (actual time=0.076..0.076 rows=1 loops=1)
                     Filter: (thousand = 0)
                     Rows Removed by Filter: 421
               ->  Index Scan using tenk1_thous_tenthous on tenk1 t1  (cost=0.29..91.30 rows=1 width=8) (actual time=0.026..0.026 rows=1 loops=1)
                     Index Cond: (tenthous = t2.tenthous)
         ->  Result  (cost=0.00..0.01 rows=1 width=4) (never executed)
 Planning Time: 0.334 ms
 Execution Time: 0.130 ms
(11 rows)

Hope this optimization could be useful.

--
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company

Re: Considering fractional paths in Append node

From
Andy Fan
Date:
Nikita Malakhov <hukutoc@gmail.com> writes:

Helll Nikita,

> Hi hackers!
>
> Sorry, I've forgot to attach the patch itself. Please check it out.

Could you check if [1] is related to this subject? I think the hard part
would be which tuple_fraction to use during adding
add_paths_to_append_rel since root->tuple_fraction is on subquery level,
while the add_paths_to_append_rel is only on RelOptInfo level.

[1]
https://www.postgresql.org/message-id/CAApHDvry0nSV62kAOH3iccvfPhGPLN0Q97%2B%3Db1RsDPXDz3%3DCiQ%40mail.gmail.com

-- 
Best Regards
Andy Fan




Re: Considering fractional paths in Append node

From
Andrei Lepikhov
Date:
On 10/17/24 07:05, Andy Fan wrote:
> Nikita Malakhov <hukutoc@gmail.com> writes:
> 
> Helll Nikita,
> 
>> Hi hackers!
>>
>> Sorry, I've forgot to attach the patch itself. Please check it out.
> 
> Could you check if [1] is related to this subject? I think the hard part
> would be which tuple_fraction to use during adding
> add_paths_to_append_rel since root->tuple_fraction is on subquery level,
> while the add_paths_to_append_rel is only on RelOptInfo level.
> 
> [1]
> https://www.postgresql.org/message-id/CAApHDvry0nSV62kAOH3iccvfPhGPLN0Q97%2B%3Db1RsDPXDz3%3DCiQ%40mail.gmail.com
Yes, this thread is connected to the code Nikita has proposed.
It is almost related to partitioned cases, of course.
I'm not sure about partitionwise joins - it looks overcomplicated for 
the moment. We just see cases when SeqScan is preferred to IndexScan. 
One logical way is to add into the Append node an alternative fractional 
path and, if at the top level some Limit, IncrementalSort or another 
fractional-friendly node will request only a small part of the data, the 
optimiser will have the option to choose the fractional path.

-- 
regards, Andrei Lepikhov




Re: Considering fractional paths in Append node

From
Nikita Malakhov
Date:
Hi,

Andy, your words make sense, but to me it seems that in add_paths_to_append_rel
we have no other options than tuple_fraction from root, and rows (if any) in paths
we take into account, please correct me if I am wrong.

Thank you!

Also, on top of that I have an idea of pruning unnecessary partitions
in generate_orderedappend_paths() when we have a valid LIMIT value.
I'm currently checking if it is working correctly in multiple cases,
so 'll send it after we deal with this issue.

--
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company

Re: Considering fractional paths in Append node

From
Andy Fan
Date:
Nikita Malakhov <hukutoc@gmail.com> writes:

> Hi,
>
> Andy, your words make sense, but to me it seems that in add_paths_to_append_rel
> we have no other options than tuple_fraction from root, and rows (if any) in paths
> we take into account, please correct me if I am wrong.

One of the option might be applying your logic only if we can prove the
tuple_fraction from root is same as the tuple_fraction, similar with
what I did before. But it is proved that is too complex. 

> Thank you!
>
> Also, on top of that I have an idea of pruning unnecessary partitions
> in generate_orderedappend_paths() when we have a valid LIMIT value.
> I'm currently checking if it is working correctly in multiple cases,
> so 'll send it after we deal with this issue.

-- 
Best Regards
Andy Fan




Re: Considering fractional paths in Append node

From
Andy Fan
Date:
Nikita Malakhov <hukutoc@gmail.com> writes:


> The effect is easily seen in one of standard PG tests:
> Vanilla (current master):
> explain analyze
> select t1.unique1 from tenk1 t1
> inner join tenk2 t2 on t1.tenthous = t2.tenthous and t2.thousand = 0
>    union all
> (values(1)) limit 1;
>                                                           QUERY PLAN
  
 
>     
>
------------------------------------------------------------------------------------------------------------------------------
>
>  Limit  (cost=0.00..219.55 rows=1 width=4) (actual time=6.309..6.312 rows=1 loops=1)
>    ->  Append  (cost=0.00..2415.09 rows=11 width=4) (actual time=6.308..6.310 rows=1 loops=1)
>          ->  Nested Loop  (cost=0.00..2415.03 rows=10 width=4) (actual time=6.307..6.308 rows=1 loops=1)
>                Join Filter: (t1.tenthous = t2.tenthous)
>                Rows Removed by Join Filter: 4210
>                ->  Seq Scan on tenk1 t1  (cost=0.00..445.00 rows=10000 width=8) (actual time=0.004..0.057 rows=422
> loops=1)
>                ->  Materialize  (cost=0.00..470.05 rows=10 width=4) (actual time=0.000..0.014 rows=10 loops=422)
>                      Storage: Memory  Maximum Storage: 17kB
>                      ->  Seq Scan on tenk2 t2  (cost=0.00..470.00 rows=10 width=4) (actual time=0.076..5.535 rows=10
> loops=1)
>                            Filter: (thousand = 0)
>                            Rows Removed by Filter: 9990
>          ->  Result  (cost=0.00..0.01 rows=1 width=4) (never executed)
>  Planning Time: 0.324 ms
>  Execution Time: 6.336 ms
> (14 rows)
>
> Patched, the same test:
> explain analyze
> select t1.unique1 from tenk1 t1
> inner join tenk2 t2 on t1.tenthous = t2.tenthous and t2.thousand = 0
>    union all
> (values(1)) limit 1;
>                                                                     QUERY PLAN
  
 
>                         
>
--------------------------------------------------------------------------------------------------------------------------------------------------
>
>  Limit  (cost=0.29..126.00 rows=1 width=4) (actual time=0.105..0.106 rows=1 loops=1)
>    ->  Append  (cost=0.29..1383.12 rows=11 width=4) (actual time=0.104..0.105 rows=1 loops=1)
>          ->  Nested Loop  (cost=0.29..1383.05 rows=10 width=4) (actual time=0.104..0.104 rows=1 loops=1)
>                ->  Seq Scan on tenk2 t2  (cost=0.00..470.00 rows=10 width=4) (actual time=0.076..0.076 rows=1
loops=1)
>                      Filter: (thousand = 0)
>                      Rows Removed by Filter: 421
>                ->  Index Scan using tenk1_thous_tenthous on tenk1 t1  (cost=0.29..91.30 rows=1 width=8) (actual
> time=0.026..0.026 rows=1 loops=1)
>                      Index Cond: (tenthous = t2.tenthous)
>          ->  Result  (cost=0.00..0.01 rows=1 width=4) (never executed)
>  Planning Time: 0.334 ms
>  Execution Time: 0.130 ms
> (11 rows)
>

This is a nice result. After some more thoughts, I'm feeling the startup
cost calculation on seq scan looks a more important one to address. 

Bad Plan: Append  (cost=0.00..2415.09 ..) shows us the "startup cost" is 0.
Good plan: Append  (cost=0.29..1383.12 ..) show us the "startup cost" is
0.29.

The major reason of this is we calculate the "startup cost" for
"SeqScan" and "Index scan" with different guidances. For the "Index
scan", the startup cost is "the cost to retrieve the first tuple",
however for "SeqScan", it is not, as we can see the startup cost for
query "SELECT * FROM tenk2 WHERE thousand = 0" has a 0 startup_cost. 

In my understading, "startup cost" means the cost to retrieve the first
tuple *already*, but at [1], Tom said:

"""
I think that things might work out better if we redefined the startup
cost as "estimated cost to retrieve the first tuple", rather than its
current very-squishy definition as "cost to initialize the scan". 
"""

The above statement makes me confused. If we take the startup cost as
cost to retrieve cost for the first tuple, we can do the below quick hack,

@@ -355,8 +355,8 @@ cost_seqscan(Path *path, PlannerInfo *root,
     }
 
     path->disabled_nodes = enable_seqscan ? 0 : 1;
-    path->startup_cost = startup_cost;
     path->total_cost = startup_cost + cpu_run_cost + disk_run_cost;
+    path->startup_cost = startup_cost +   (cpu_run_cost + disk_run_cost) * (1 - path->rows / baserel->tuples);
 }

We get plan:

regression=# explain                   
select t1.unique1 from tenk1 t1
inner join tenk2 t2 on t1.tenthous = t2.tenthous and t2.thousand = 0 union all select 1 limit 1
;
                                     QUERY PLAN                                      
-------------------------------------------------------------------------------------
 Limit  (cost=470.12..514.00 rows=1 width=4)
   ->  Append  (cost=470.12..952.79 rows=11 width=4)
         ->  Hash Join  (cost=470.12..952.73 rows=10 width=4)
               Hash Cond: (t1.tenthous = t2.tenthous)
               ->  Seq Scan on tenk1 t1  (cost=0.00..445.00 rows=10000 width=8)
               ->  Hash  (cost=470.00..470.00 rows=10 width=4)
                     ->  Seq Scan on tenk2 t2  (cost=469.53..470.00 rows=10 width=4)
                           Filter: (thousand = 0)
         ->  Result  (cost=0.00..0.01 rows=1 width=4)
(9 rows)

set enable_hashjoin to off;

regression=# explain                    
select t1.unique1 from tenk1 t1
inner join tenk2 t2 on t1.tenthous = t2.tenthous and t2.thousand = 0 union all select 1 limit 1
;
                                               QUERY PLAN                                               
--------------------------------------------------------------------------------------------------------
 Limit  (cost=469.81..542.66 rows=1 width=4)
   ->  Append  (cost=469.81..1271.12 rows=11 width=4)
         ->  Nested Loop  (cost=469.81..1271.05 rows=10 width=4)
               ->  Seq Scan on tenk2 t2  (cost=469.53..470.00 rows=10 width=4)
                     Filter: (thousand = 0)
               ->  Index Scan using tenk1_thous_tenthous on tenk1 t1  (cost=0.29..80.09 rows=1 width=8)
                     Index Cond: (tenthous = t2.tenthous)
         ->  Result  (cost=0.00..0.01 rows=1 width=4)
(8 rows)

Looks we still have some other stuff to do, but we have seen the desired
plan has a closer cost to estimated best plan than before. 


[1]
https://www.postgresql.org/message-id/flat/3783591.1721327902%40sss.pgh.pa.us#09d6471fc59b35fa4aca939e49943c2c 
-- 
Best Regards
Andy Fan




Re: Considering fractional paths in Append node

From
Andrei Lepikhov
Date:
On 10/18/24 07:54, Andy Fan wrote:
> Nikita Malakhov <hukutoc@gmail.com> writes:
>> The effect is easily seen in one of standard PG tests:
> """
> I think that things might work out better if we redefined the startup
> cost as "estimated cost to retrieve the first tuple", rather than its
> current very-squishy definition as "cost to initialize the scan".
> """
> The above statement makes me confused. If we take the startup cost as
> cost to retrieve cost for the first tuple, we can do the below quick hack,
Promising way to go. Of course even in that case IndexScan usually gives 
way to SeqScan (because of the additional heap fetch). And only 
IndexOnlyScan may overcome it. Moreover, SeqScan first tuple cost is 
contradictory issue - who knows, how much tuples it will filter before 
the first tuple will be produced?

> 
> Looks we still have some other stuff to do, but we have seen the desired
> plan has a closer cost to estimated best plan than before.
But our patch is about some different stuff: adding one more Append 
strategy (and, as I realised recently, one promising MergeAppend too) we 
give a chance upper fraction-friendly node to decide which plan is 
better. Right now, AFAIK, only LIMIT node can profit from that (maybe 
IncrementalSort if we include MergeAppend). But it may open a door to 
improve other nodes too.

-- 
regards, Andrei Lepikhov




Re: Considering fractional paths in Append node

From
Nikita Malakhov
Date:
Hi!

Andy, one quick question - what do you think on using root->limit_tuples as a guidance on how many
rows we have to consider in plans cost?

--
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company

Re: Considering fractional paths in Append node

From
Andy Fan
Date:
Nikita Malakhov <hukutoc@gmail.com> writes:

Hi Nikita,

> Hi!
>
> Andy, one quick question - what do you think on using
> root->limit_tuples as a guidance on how many 
> rows we have to consider in plans cost?

Within my exprerience, the committer probabaly dislikes the idea of
"ignoring  the difference of tuple_fraction between subquery level and
RelOptInfo level" and the committers dislikes a complex soluation like
what I did in [1]. Maybe you can try to figure out completed and simple
soluation to make everyone happy, but I have no idea about on this right
now. 

[1]
https://www.postgresql.org/message-id/CAApHDvry0nSV62kAOH3iccvfPhGPLN0Q97%2B%3Db1RsDPXDz3%3DCiQ%40mail.gmail.com

-- 
Best Regards
Andy Fan




Re: Considering fractional paths in Append node

From
Nikita Malakhov
Date:
Hi,

Andy, thank you, I've checked this thread out along with run-time partition pruning.
I've spend some time hovering on the tuple_fraction field usage and would disagree
with you on this topic - it is already used on the RelOptInfo level later on, in
generate_orderedappend_paths()
I mean the following piece:
if (root->tuple_fraction > 0)
{
double path_fraction = (1.0 / root->tuple_fraction);
Path cheapest_consider_fraction;

cheapest_fractional = get_cheapest_fractional_path_for_pathkeys(childrel->pathlist, pathkeys, NULL, path_fraction);
...

function, so it does not seem incorrect to use its value for a single relation in subquery -
I agree that we do not have accurate estimation at this level, but we could use the one
we already have.
I've also tried hard to find an example where this patch could break something,
but without success.

--
Nikita Malakhov
Postgres Professional
The Russian Postgres Company

Re: Considering fractional paths in Append node

From
Andy Fan
Date:
Nikita Malakhov <hukutoc@gmail.com> writes:

> Hi,
>
> Andy, thank you, I've checked this thread out along with run-time
> partition pruning.
I'm not sure the relationshipe between this topic and run-time
partitioin pruning..

> I've spend some time hovering on the tuple_fraction field usage and would disagree
> with you on this topic - it is already used on the RelOptInfo level later on, in
> generate_orderedappend_paths()

Looks you are right that root->tuple_fraction has been used on RelOptInfo
level in the generate_orderedappend_paths(). But we also tried to
use not in the RelOptInfo level such as set_subquery_pathlist. See..

"""
/*
 * We can safely pass the outer tuple_fraction down to the subquery if the
 * outer level has no joining, aggregation, or sorting to do. Otherwise
 * we'd better tell the subquery to plan for full retrieval. (XXX This
 * could probably be made more intelligent ...)
 */
"""

I'm not sure the "more intelligent" would be just use it directly. 

So I'm not saying we can't do this, just that the facts are:
(a).  root->tuple_fraction are not exactly same as RelOptInfo's
tuple_fraction.
(b).  We have used root->tuple_fraction in RelOptInfo in some cases and
also tried to not use it in some other case (and only use it under some
situation similar like what I did before).

Looks different committers have different opinion on this. 

-- 
Best Regards
Andy Fan