[HACKERS] [Proposal] Make the optimiser aware of partitions ordering - Mailing list pgsql-hackers

From Ronan Dunklau
Subject [HACKERS] [Proposal] Make the optimiser aware of partitions ordering
Date
Msg-id 2401607.SfZhPQhbS4@ronan_laptop
Whole thread Raw
Responses Re: [HACKERS] [Proposal] Make the optimiser aware of partitions ordering  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
Hello,

With native partitioning landing in Postgres 10, we (Julien Rouhaud and 
myself) had the idea for the 
following simple optimisation. This simple optimisation comes from a real use 
case.

===== Proposal =====

With range partitioning, we guarantee that each partition contains non-
overlapping values. Since we know the range allowed for each partition, it is 
possible to sort them according to the partition key (as is done already for 
looking up partitions).

Thus, we ca generate sorted Append plans instead of MergeAppend when sorting 
occurs on the partition key.

For example, consider the following model and query:


CREATE TABLE parent (id int) PARTITION BY RANGE (id);
CREATE TABLE child2 PARTITION OF parent FOR VALUES FROM (10) TO (20);
CREATE TABLE child1 PARTITION OF parent FOR VALUES FROM (1) TO (10);
CREATE INDEX ON child1(id);
CREATE INDEX ON child2(id);



EXPLAIN (COSTS OFF) SELECT * FROM parent ORDER BY id desc;

                          QUERY PLAN                          
--------------------------------------------------------------
 Merge Append
   Sort Key: parent.id DESC
   ->  Sort
         Sort Key: parent.id DESC
         ->  Seq Scan on parent
   ->  Index Only Scan Backward using child2_id_idx on child2
   ->  Index Only Scan Backward using child1_id_idx on child

We can guarantee that every value found in child2 will have a greater id than 
any value found in child1. Thus, we could generate a plan like this:

                          QUERY PLAN                           
---------------------------------------------------------------
 Append
   Sort Key: child2.id DESC
   ->  Index Only Scan Backward using child2_id_idx1 on child2
   ->  Index Only Scan Backward using child1_id_idx1 on child1

Skipping the merge step altogether.

This has the added benefit of providing an "early stop": if we add a limit to 
the query, we will only scan the indexes that are necessary for fetching the 
required amount of tuples. This is especially useful if we add a WHERE clause 
not covered by the index with the limit. Consider the following example, with 
a table "webvisits" partitioned by a ts colum. If we want to retrieve the 
latest 100 hits from a specific ip:

EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM webvisits WHERE ipaddr = 
'93.184.216.34' ORDER BY ts DESC LIMIT 10;

                                                            

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.43..32.32 rows=10 width=104) (actual time=0.080..0.109 rows=10 
loops=1)
   Buffers: shared hit=7
   ->  Append  (cost=0.43..401009.65 rows=125740 width=104) (actual 
time=0.078..0.105 rows=10 loops=1)
         Sort Key: webvisits_201712.ts DESC
         Buffers: shared hit=7
         ->  Index Scan Backward using webvisits_201712_ipaddr_ts_idx on 
webvisits_201712  (cost=0.43..133617.70 rows=41901 width=104) (actual 
time=0.077..0.101 rows=10 loops=1)
               Index Cond: (ipaddr = '93.184.216.34'::inet)
               Buffers: shared hit=7
         ->  Index Scan Backward using webvisits_201711_ipaddr_ts_idx on 
webvisits_201711  (cost=0.43..131514.18 rows=41243 width=104) (never executed)
               Index Cond: (ipaddr = '93.184.216.34'::inet)
         ->  Index Scan Backward using webvisits_201710_ipaddr_ts_idx on 
webvisits_201710  (cost=0.43..135801.71 rows=42587 width=104) (never executed)
             ........

We have developed a proof-of-concept implementation for this optimisation.

For sub partitioning, intermediate Append nodes can be generated.

Consider the following examples:


CREATE TABLE parentparent (c1 int, c2 int) PARTITION BY range (c1);

-- Subpartition only on second column
CREATE TABLE parent1 PARTITION OF parentparent FOR VALUES FROM (1) TO (10)
PARTITION BY range (c2);

-- Subpartition on both columns
CREATE TABLE parent2 PARTITION OF parentparent FOR VALUES FROM (10) TO (20)
PARTITION BY range (c1, c2);

CREATE TABLE child11 PARTITION OF parent1 FOR VALUES FROM (1) TO (10);
CREATE TABLE child12 PARTITION OF parent1 FOR VALUES FROM (10) TO (20);
CREATE TABLE child21 PARTITION OF parent2 FOR VALUES FROM (10, 1) TO (15, 10);
CREATE TABLE child22 PARTITION OF parent2 FOR VALUES FROM (15, 10) TO (20, 
20);

EXPLAIN (COSTS OFF) SELECT * FROM parentparent ORDER BY c1;
              QUERY PLAN               
---------------------------------------
 Append
   Sort Key: parent1.c1
   ->  Sort
         Sort Key: parent1.c1
         ->  Append
               ->  Seq Scan on parent1
               ->  Seq Scan on child11
               ->  Seq Scan on child12
   ->  Append
         Sort Key: child21.c1
         ->  Sort
               Sort Key: child21.c1
               ->  Seq Scan on child21
         ->  Sort
               Sort Key: child22.c1
               ->  Seq Scan on child22


EXPLAIN (COSTS OFF) SELECT * FROM parentparent ORDER BY c1, c2;
                   QUERY PLAN                   
------------------------------------------------
 Append
   Sort Key: parent1.c1, parent1.c2
   ->  Sort
         Sort Key: parent1.c1, parent1.c2
         ->  Append
               ->  Seq Scan on parent1
               ->  Seq Scan on child11
               ->  Seq Scan on child12
   ->  Append
         Sort Key: child21.c1, child21.c2
         ->  Sort
               Sort Key: child21.c1, child21.c2
               ->  Seq Scan on child21
         ->  Sort
               Sort Key: child22.c1, child22.c2
               ->  Seq Scan on child22



===== Patch design =====

First, we don't really know what we're doing :). But this is a proof of 
concept, and if there is interest in this patch, let us know how we should 
have done things and we're willing to rewrite it entirely if needed.

==== Overview ====

In the optimiser, we generate PathKeys representing the two sort orders by 
wich partition can be ordered (ASC and DESC), only for "native" range-based 
partitioned tables, and store them in RelOptInfo associated with every parent 
table, along with a list of OIDs corresponding to the partitions.

Then, when the time comes to generate append paths, we add another step which 
tries to match the query_pathkeys to those of the partition, and generate 
another append node with the sorted paths for each partitions, in the expected 
order, and a pathkey to the append path itself to indicate its already sorted.

==== Known Problems with the patch ====

Once again, keep in mind it's a proof-of-concept  

- It is in conflict with the existing patch designed to avoid adding the 
parent partitions
to the append node. As such, it will almost certainly need a full rewrite to 
adapt to that since that is done in this patch in a probably less than ideal 
fashion.
- As of now, the patch doesn't try to match the partition pathkey to 
mergejoinable pathkeys. 
- maybe a new node should be created altogether, instead of porting most of 
the MergeAppend struct fields to the basic Append ?
- the way we store the PathKeys and partitions OID directly in RelOptInfo is 
probably wrong
- the major impact on existing queries is the fact that to handle sub-
partitioning, RangeTblEntries representing intermediate append nodes are added 
(but just in the case of native partitioning, regular inheritance is not 
affected). See updated prepunion.c for what that means. Those RTE are then 
ignored 
when generating the real Append nodes.
- new regression tests have not been written yet, and existing ones haven't 
been "fixed" to take into account the different partition ordering: in case of 
sub partitioning, the order is now "depth-first" rather than "breadth-first" 
like it was earlier.
- no documentation has been added, although we don't really know where it 
should go
- the patch lacks a lot of comments
- the key displayed in EXPLAIN output comes from the first partition, not the 
parent.
- maybe a "real" SortPath should be generated, instead of generating Sort
nodes 
out of nowhere when creating the plan. This has been done to conform to what 
MergeAppend already did.

===== Questions =====

Is there interest for such an optimization ?
Should we work from the patch proposed to remove parent tables already ?
Should there be a new Node for such a "SortedAppend" operation or is it fine 
keeping it with the Append node already ? Should that instead be an 
optimization of the MergeAppend Node ?
What is or is not acceptable with regards to storing things in RelOptInfo 
(among others) ?

More generally, any kind of feedback that will help us get on the right track 
will be appreciated.


-- 
Ronan Dunklau & Julien Rouhaud
http://dalibo.com - http://dalibo.org
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

pgsql-hackers by date:

Previous
From: Pavel Stehule
Date:
Subject: Re: [HACKERS] PoC plpgsql - possibility to force custom or generic plan
Next
From: Stas Kelvich
Date:
Subject: Re: [HACKERS] logical decoding of two-phase transactions