[HACKERS] [Proposal] Make the optimiser aware of partitions ordering - Mailing list pgsql-hackers
From | Ronan Dunklau |
---|---|
Subject | [HACKERS] [Proposal] Make the optimiser aware of partitions ordering |
Date | |
Msg-id | 2401607.SfZhPQhbS4@ronan_laptop Whole thread Raw |
Responses |
Re: [HACKERS] [Proposal] Make the optimiser aware of partitions ordering
|
List | pgsql-hackers |
Hello, With native partitioning landing in Postgres 10, we (Julien Rouhaud and myself) had the idea for the following simple optimisation. This simple optimisation comes from a real use case. ===== Proposal ===== With range partitioning, we guarantee that each partition contains non- overlapping values. Since we know the range allowed for each partition, it is possible to sort them according to the partition key (as is done already for looking up partitions). Thus, we ca generate sorted Append plans instead of MergeAppend when sorting occurs on the partition key. For example, consider the following model and query: CREATE TABLE parent (id int) PARTITION BY RANGE (id); CREATE TABLE child2 PARTITION OF parent FOR VALUES FROM (10) TO (20); CREATE TABLE child1 PARTITION OF parent FOR VALUES FROM (1) TO (10); CREATE INDEX ON child1(id); CREATE INDEX ON child2(id); EXPLAIN (COSTS OFF) SELECT * FROM parent ORDER BY id desc; QUERY PLAN -------------------------------------------------------------- Merge Append Sort Key: parent.id DESC -> Sort Sort Key: parent.id DESC -> Seq Scan on parent -> Index Only Scan Backward using child2_id_idx on child2 -> Index Only Scan Backward using child1_id_idx on child We can guarantee that every value found in child2 will have a greater id than any value found in child1. Thus, we could generate a plan like this: QUERY PLAN --------------------------------------------------------------- Append Sort Key: child2.id DESC -> Index Only Scan Backward using child2_id_idx1 on child2 -> Index Only Scan Backward using child1_id_idx1 on child1 Skipping the merge step altogether. This has the added benefit of providing an "early stop": if we add a limit to the query, we will only scan the indexes that are necessary for fetching the required amount of tuples. This is especially useful if we add a WHERE clause not covered by the index with the limit. Consider the following example, with a table "webvisits" partitioned by a ts colum. If we want to retrieve the latest 100 hits from a specific ip: EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM webvisits WHERE ipaddr = '93.184.216.34' ORDER BY ts DESC LIMIT 10; ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=0.43..32.32 rows=10 width=104) (actual time=0.080..0.109 rows=10 loops=1) Buffers: shared hit=7 -> Append (cost=0.43..401009.65 rows=125740 width=104) (actual time=0.078..0.105 rows=10 loops=1) Sort Key: webvisits_201712.ts DESC Buffers: shared hit=7 -> Index Scan Backward using webvisits_201712_ipaddr_ts_idx on webvisits_201712 (cost=0.43..133617.70 rows=41901 width=104) (actual time=0.077..0.101 rows=10 loops=1) Index Cond: (ipaddr = '93.184.216.34'::inet) Buffers: shared hit=7 -> Index Scan Backward using webvisits_201711_ipaddr_ts_idx on webvisits_201711 (cost=0.43..131514.18 rows=41243 width=104) (never executed) Index Cond: (ipaddr = '93.184.216.34'::inet) -> Index Scan Backward using webvisits_201710_ipaddr_ts_idx on webvisits_201710 (cost=0.43..135801.71 rows=42587 width=104) (never executed) ........ We have developed a proof-of-concept implementation for this optimisation. For sub partitioning, intermediate Append nodes can be generated. Consider the following examples: CREATE TABLE parentparent (c1 int, c2 int) PARTITION BY range (c1); -- Subpartition only on second column CREATE TABLE parent1 PARTITION OF parentparent FOR VALUES FROM (1) TO (10) PARTITION BY range (c2); -- Subpartition on both columns CREATE TABLE parent2 PARTITION OF parentparent FOR VALUES FROM (10) TO (20) PARTITION BY range (c1, c2); CREATE TABLE child11 PARTITION OF parent1 FOR VALUES FROM (1) TO (10); CREATE TABLE child12 PARTITION OF parent1 FOR VALUES FROM (10) TO (20); CREATE TABLE child21 PARTITION OF parent2 FOR VALUES FROM (10, 1) TO (15, 10); CREATE TABLE child22 PARTITION OF parent2 FOR VALUES FROM (15, 10) TO (20, 20); EXPLAIN (COSTS OFF) SELECT * FROM parentparent ORDER BY c1; QUERY PLAN --------------------------------------- Append Sort Key: parent1.c1 -> Sort Sort Key: parent1.c1 -> Append -> Seq Scan on parent1 -> Seq Scan on child11 -> Seq Scan on child12 -> Append Sort Key: child21.c1 -> Sort Sort Key: child21.c1 -> Seq Scan on child21 -> Sort Sort Key: child22.c1 -> Seq Scan on child22 EXPLAIN (COSTS OFF) SELECT * FROM parentparent ORDER BY c1, c2; QUERY PLAN ------------------------------------------------ Append Sort Key: parent1.c1, parent1.c2 -> Sort Sort Key: parent1.c1, parent1.c2 -> Append -> Seq Scan on parent1 -> Seq Scan on child11 -> Seq Scan on child12 -> Append Sort Key: child21.c1, child21.c2 -> Sort Sort Key: child21.c1, child21.c2 -> Seq Scan on child21 -> Sort Sort Key: child22.c1, child22.c2 -> Seq Scan on child22 ===== Patch design ===== First, we don't really know what we're doing :). But this is a proof of concept, and if there is interest in this patch, let us know how we should have done things and we're willing to rewrite it entirely if needed. ==== Overview ==== In the optimiser, we generate PathKeys representing the two sort orders by wich partition can be ordered (ASC and DESC), only for "native" range-based partitioned tables, and store them in RelOptInfo associated with every parent table, along with a list of OIDs corresponding to the partitions. Then, when the time comes to generate append paths, we add another step which tries to match the query_pathkeys to those of the partition, and generate another append node with the sorted paths for each partitions, in the expected order, and a pathkey to the append path itself to indicate its already sorted. ==== Known Problems with the patch ==== Once again, keep in mind it's a proof-of-concept - It is in conflict with the existing patch designed to avoid adding the parent partitions to the append node. As such, it will almost certainly need a full rewrite to adapt to that since that is done in this patch in a probably less than ideal fashion. - As of now, the patch doesn't try to match the partition pathkey to mergejoinable pathkeys. - maybe a new node should be created altogether, instead of porting most of the MergeAppend struct fields to the basic Append ? - the way we store the PathKeys and partitions OID directly in RelOptInfo is probably wrong - the major impact on existing queries is the fact that to handle sub- partitioning, RangeTblEntries representing intermediate append nodes are added (but just in the case of native partitioning, regular inheritance is not affected). See updated prepunion.c for what that means. Those RTE are then ignored when generating the real Append nodes. - new regression tests have not been written yet, and existing ones haven't been "fixed" to take into account the different partition ordering: in case of sub partitioning, the order is now "depth-first" rather than "breadth-first" like it was earlier. - no documentation has been added, although we don't really know where it should go - the patch lacks a lot of comments - the key displayed in EXPLAIN output comes from the first partition, not the parent. - maybe a "real" SortPath should be generated, instead of generating Sort nodes out of nowhere when creating the plan. This has been done to conform to what MergeAppend already did. ===== Questions ===== Is there interest for such an optimization ? Should we work from the patch proposed to remove parent tables already ? Should there be a new Node for such a "SortedAppend" operation or is it fine keeping it with the Append node already ? Should that instead be an optimization of the MergeAppend Node ? What is or is not acceptable with regards to storing things in RelOptInfo (among others) ? More generally, any kind of feedback that will help us get on the right track will be appreciated. -- Ronan Dunklau & Julien Rouhaud http://dalibo.com - http://dalibo.org -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
pgsql-hackers by date: