[HACKERS] advanced partition matching algorithm for partition-wise join - Mailing list pgsql-hackers
From | Ashutosh Bapat |
---|---|
Subject | [HACKERS] advanced partition matching algorithm for partition-wise join |
Date | |
Msg-id | CAFjFpRdjQvaUEV5DJX3TW6pU5eq54NCkadtxHX2JiJG_GvbrCA@mail.gmail.com Whole thread Raw |
Responses |
Re: [HACKERS] advanced partition matching algorithm forpartition-wise join
(Rajkumar Raghuwanshi <rajkumar.raghuwanshi@enterprisedb.com>)
|
List | pgsql-hackers |
The patch-set in [1] supports partition-wise join when the partition bounds and partition keys of the joining tables exactly match. The last two patches in the last few patch-sets in that thread implement more advanced partition matching code. In order to avoid mixing reviews for advanced partition matching and the basic partition-wise join implementation, I am starting a new thread to discuss the same. I am attaching the last two patches from that patch set here. The new partition matching algorithm handles following cases when a given partition on one side has at most one matching partition matching on the other side. 1. When the ranges of the joining tables do not match exactly E.g. partition table t1 has partitions t1p1 (0 - 100), t1p2 (150 - 200) and partition table t2 has partitions t2p1 (0 - 50), t2p2 (100 - 175). In this case (t1p1, t2p1) and (t1p2, t2p2) form the matching partition pairs, which can be joined. While matching the pairs, we also compute the partition bounds for the resulting join. An INNER join between t1 and t2 will have ranges (0 - 50) since no row with 50 <= key < 100 from t1p1 is going to find a matching row in t2p1 and (150 - 175) since no row with 100 <= key < 150 from t2p2 is going to find a matching row in t1p2 and no row with 175 <= key < 200 in t1p2 is going to find a matching row in t1p2. A t1 LEFT join t2 on the other hand will have ranges same as the outer relation i.e. t1, (0 - 100), (150 - 200) since all rows from t1 will be part of the join. Thus depending upon the type of join the partition bounds of the resultant join relation change. Similarly for list partitioned table, when the lists do not match exactly, the algorithm finds matching pairs of partitions and the lists of resultant join relation. E.g. t1 has partitions t1p1 ('a', 'b', 'c'), t1p2 ('e', 'f') and t2 has partitions t2p1 ('a', 'b'), t2p2 ('d', 'e', 'f'). In this case (t1p1, t2p1) and (t2p1, t2p2) form the matching pairs which are joined. Inner join will have bounds ('a','b'), ('e', 'f') and t1 LEFT JOIN t2 will have bounds same as t1. 2. When one or both side have at least one partition that does not have matching partition on the other side. E.g. t1 has partitions t1p1 ('a','b'), t1p2 ('c','d') and t2 has only one partition t2p1 ('a','b') OR t1 has partitions t1p1 (0 - 100), t1p2 (100 - 200) and t2 has only one partition t2p1 (0 - 100). In this case as well different types of joins will have different partition bounds for the result using similar rules described above. 3. A combination of 1 and 2 e.g. t1 has partitions t1p1('a','b','c'), t1p2('d','e','f') and t2 has a single partition t2p1 ('a','b', 'z'). Algorithm --------- The pairs of matching partitions and the partition bounds of the join are calculated by an algorithm similar to merge join. In such a join, it can be observed that every partition on either side, contributes to at most one partition of the resultant join relation. Thus for every partition on either side, we keep track of the partition of resultant join (if any), which it contributes to. If multiple partitions from any of the joining relations map to a single partition of the resultant join, we need to gang those partitions together before joining the partition/s from the other side. Since we do not have infrastructure for ganging multiple arbitrary RelOptInfos together in a parent RelOptInfo, we do not support such a partitionw-wise join right now. We stop merging the bounds immediately when we detect such a case. For list partitioned tables, we compare list values from both the sides, starting with the lowest. If the two list values being compared match, corresponding partitions from both sides form a pair of partitions to be joined. We record this mapping and also include the list value in join bounds. If the two list values do not match and the lower of those two comes from the outer side of the join, we include it in the join bounds. We advance to the next list value on side with the lower list value continuing the process of merging till list values on at least one side are exhausted. If the remaining values are from the outer side, we include those in the join partition bounds. Every list value included in the join bounds, and its originating partition/s are associated with appropriate partition of the resultant join. For more details please see partition_list_bounds_merge() in the attached patch. In case of range partitioned tables, we compare the ranges of the partitions in increasing order of their bounds. If two ranges being compared overlap, corresponding partitions from both sides form a pair of partitions to be joined. We record this mapping and also include the merged range in the bounds of resultant join. The overlapping ranges are merged based on the type of join as described above. If either of the ranges completely precedes the other, and it's on the outer side, we include that range in the bounds of resultant join. We advance to the next range on the side with lower upper bound till ranges on at least one side are exhausted. If the remaining ranges are from the outer side, we include those in the partition bounds of resultant join. While including a range in the partition bounds of the resultant join if its lower bound precedes the upper bound of the last included range, it indicates that multiple partitions on that side map to one partition on the other side, so we bail out. Notice that in this method, we always include the ranges in the partition bounds of the resultant join in the increasing order of their bounds. Every range included in the join's partition bounds and it's corresponding partition/s from joining relations are associated with appropriate partition of the resultant join. For more details please see partition_range_bounds_merge() in the attached patch. The partitions from both sides (one partition from each side) which map to the same partition of the resultant join are joined to form child-joins. The case when an outer partition may not have a matching partition from the inner side will be discussed in the next section. Except for the above algorithm to find the pairs of matching partitions and calculating bounds of the resultant join, the rest of the partition-wise join algorithm remains the same. Unsupported case: When a partition from outer side doesn't have matching partition on the inner side. -------------------------------------------------------------------------- Consider a join t1 LEFT JOIN t2 where t1 has partitions t1p1 (0 - 100), t1p2 (100 - 200) and t2 has a single partition t2p1(0 - 100). The rows in t1p2 won't have a matching row in t2 since there is no partition matching t1p2. The result of the join will have rows in t1p2 with columns from t2 NULLed. In order to execute this join as a partition-wise join, we need a dummy relation in place of the missing partition, which we can join with t1p2. We need this placeholder dummy relation (its targetlist, relids etc.), so that rest of the planner can work with the resulting child-join. We notice the missing partitions only while planning the join (during the execution of make_one_rel()), by which time we have frozen the number of base relations. Introducing a base relation during join planning is not supported in current planner. Similarly, a partition can be missing from a partitioned join relation, in which case we have to add a dummy join relation. This might need adding corresponding base relations as well. I have not spent time looking for what it takes to support these cases. For now the patch does not support partition-wise join in such cases. TODOs ----------- 1. Add tests for advanced partition matching algorithm 2. Improve code quality, commenting, function names etc. [1] https://www.postgresql.org/message-id/CAFjFpRd9Vqh_=-Ldv-XqWY006d07TJ+VXuhXCbdj=P1jukYBrw@mail.gmail.com -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
pgsql-hackers by date: