make add_paths_to_append_rel aware of startup cost - Mailing list pgsql-hackers

From Andy Fan
Subject make add_paths_to_append_rel aware of startup cost
Date
Msg-id CAKU4AWrXSkUV=Pt-gRxQT7EbfUeNssprGyNsB=5mJibFZ6S3ww@mail.gmail.com
Whole thread Raw
Responses Re: make add_paths_to_append_rel aware of startup cost
Re: make add_paths_to_append_rel aware of startup cost
List pgsql-hackers
Hi:

This thread is a refactor of thread [1] for easier communication.

Currently add_paths_to_append_rel overlooked the startup cost for creating
append path, so it may have lost some optimization chances.  After a glance, 
the following 4 identifiers can be impacted.

Identifier 1:

SELECT .. FROM v1
UNION ALL
SELECT .. FROM v2
LIMIT 3;

Identifier 2:

SELECT * FROM p .. LIMIT 3;  p is a partitioned table.

Identifier 3:
SELECT * FROM p JOIN q using (partkey) LIMIT 3;

If we did the partition-wise-join, then we lost the chances for a better
plan.

Identifier 4:  -- EXISTS implies LIMIT 1;
SELECT * FROM foo
WHERE EXISTS
(SELECT 1 FROM append_rel_v_not_pullup_able WHERE xxx);

However, after I completed my patch and wanted to build some real 
queries to prove my idea,  I just find it is hard to build the case for 
Identifier 2/3/4.  But the Improvement for Identifier 1 is easy and 
my real user case in work is Identifier 1 as well.

So a patch is attached for this case, it will use fractional costs
rather than total costs if needed. The following items needs more
attention during development. 

- We shouldn't do the optimization if there are still more tables to join,
  the reason is similar to has_multiple_baserels(root) in
  set_subquery_pathlist. But the trouble here is we may inherit multiple
  levels to build an appendrel, so I have to keep the 'top_relids' all the
  time and compare it with PlannerInfo.all_baserels. If they are the same,
  then it is the case we want to optimize.

- add_partial_path doesn't consider the startup costs by design, I didn't
  rethink too much about this, but the idea of "choose a path which
  let each worker produces the top-N tuples as fast as possible" looks
  reasonable, and even though add_partial_path doesn't consider startup
  cost, it is still possible that childrel keeps more than 1 partial paths
  due to any reasons except startup_cost, for example PathKey. then we
  still have chances to choose the cheapest fractional path among
  them. The same strategy also applies to mixed partial and non-partial
  append paths.

- Due to the complexity of add_paths_to_append_rel, 3 arguments have
 to be added to get_cheapest_fractional_path...

  Path *
get_cheapest_fractional_path(RelOptInfo *rel, double tuple_fraction,
bool allow_parameterized, bool look_partial,
bool must_parallel_safe)


Cases can be improved.

Here is the simplest test case,  but it will not be hard to provide more
cases for Identifier 1. 

 (select * from tenk1 order by hundred)
 UNION ALL
 (select * from tenk1 order by hundred)
 limit 3;

master: 8.096ms.
patched: 0.204ms.

The below user case should be more reasonable for real business. 

with a as (select * from t1 join t2..),
b as (select * from t1 join t3 ..)
select * from a union all select * from b
limit 3;

The patch would also have impacts on identifier 2/3/4, even though I can't
make a demo sql which can get benefits from this patch,  I also  added
some test cases for code coverage purposes.
Attachment

pgsql-hackers by date:

Previous
From: jacktby jacktby
Date:
Subject: Re: How to add a new pg oid?
Next
From: Damir Belyalov
Date:
Subject: Output affected rows in EXPLAIN