Thread: [HACKERS] advanced partition matching algorithm for partition-wise join
The patch-set in [1] supports partition-wise join when the partition bounds and partition keys of the joining tables exactly match. The last two patches in the last few patch-sets in that thread implement more advanced partition matching code. In order to avoid mixing reviews for advanced partition matching and the basic partition-wise join implementation, I am starting a new thread to discuss the same. I am attaching the last two patches from that patch set here. The new partition matching algorithm handles following cases when a given partition on one side has at most one matching partition matching on the other side. 1. When the ranges of the joining tables do not match exactly E.g. partition table t1 has partitions t1p1 (0 - 100), t1p2 (150 - 200) and partition table t2 has partitions t2p1 (0 - 50), t2p2 (100 - 175). In this case (t1p1, t2p1) and (t1p2, t2p2) form the matching partition pairs, which can be joined. While matching the pairs, we also compute the partition bounds for the resulting join. An INNER join between t1 and t2 will have ranges (0 - 50) since no row with 50 <= key < 100 from t1p1 is going to find a matching row in t2p1 and (150 - 175) since no row with 100 <= key < 150 from t2p2 is going to find a matching row in t1p2 and no row with 175 <= key < 200 in t1p2 is going to find a matching row in t1p2. A t1 LEFT join t2 on the other hand will have ranges same as the outer relation i.e. t1, (0 - 100), (150 - 200) since all rows from t1 will be part of the join. Thus depending upon the type of join the partition bounds of the resultant join relation change. Similarly for list partitioned table, when the lists do not match exactly, the algorithm finds matching pairs of partitions and the lists of resultant join relation. E.g. t1 has partitions t1p1 ('a', 'b', 'c'), t1p2 ('e', 'f') and t2 has partitions t2p1 ('a', 'b'), t2p2 ('d', 'e', 'f'). In this case (t1p1, t2p1) and (t2p1, t2p2) form the matching pairs which are joined. Inner join will have bounds ('a','b'), ('e', 'f') and t1 LEFT JOIN t2 will have bounds same as t1. 2. When one or both side have at least one partition that does not have matching partition on the other side. E.g. t1 has partitions t1p1 ('a','b'), t1p2 ('c','d') and t2 has only one partition t2p1 ('a','b') OR t1 has partitions t1p1 (0 - 100), t1p2 (100 - 200) and t2 has only one partition t2p1 (0 - 100). In this case as well different types of joins will have different partition bounds for the result using similar rules described above. 3. A combination of 1 and 2 e.g. t1 has partitions t1p1('a','b','c'), t1p2('d','e','f') and t2 has a single partition t2p1 ('a','b', 'z'). Algorithm --------- The pairs of matching partitions and the partition bounds of the join are calculated by an algorithm similar to merge join. In such a join, it can be observed that every partition on either side, contributes to at most one partition of the resultant join relation. Thus for every partition on either side, we keep track of the partition of resultant join (if any), which it contributes to. If multiple partitions from any of the joining relations map to a single partition of the resultant join, we need to gang those partitions together before joining the partition/s from the other side. Since we do not have infrastructure for ganging multiple arbitrary RelOptInfos together in a parent RelOptInfo, we do not support such a partitionw-wise join right now. We stop merging the bounds immediately when we detect such a case. For list partitioned tables, we compare list values from both the sides, starting with the lowest. If the two list values being compared match, corresponding partitions from both sides form a pair of partitions to be joined. We record this mapping and also include the list value in join bounds. If the two list values do not match and the lower of those two comes from the outer side of the join, we include it in the join bounds. We advance to the next list value on side with the lower list value continuing the process of merging till list values on at least one side are exhausted. If the remaining values are from the outer side, we include those in the join partition bounds. Every list value included in the join bounds, and its originating partition/s are associated with appropriate partition of the resultant join. For more details please see partition_list_bounds_merge() in the attached patch. In case of range partitioned tables, we compare the ranges of the partitions in increasing order of their bounds. If two ranges being compared overlap, corresponding partitions from both sides form a pair of partitions to be joined. We record this mapping and also include the merged range in the bounds of resultant join. The overlapping ranges are merged based on the type of join as described above. If either of the ranges completely precedes the other, and it's on the outer side, we include that range in the bounds of resultant join. We advance to the next range on the side with lower upper bound till ranges on at least one side are exhausted. If the remaining ranges are from the outer side, we include those in the partition bounds of resultant join. While including a range in the partition bounds of the resultant join if its lower bound precedes the upper bound of the last included range, it indicates that multiple partitions on that side map to one partition on the other side, so we bail out. Notice that in this method, we always include the ranges in the partition bounds of the resultant join in the increasing order of their bounds. Every range included in the join's partition bounds and it's corresponding partition/s from joining relations are associated with appropriate partition of the resultant join. For more details please see partition_range_bounds_merge() in the attached patch. The partitions from both sides (one partition from each side) which map to the same partition of the resultant join are joined to form child-joins. The case when an outer partition may not have a matching partition from the inner side will be discussed in the next section. Except for the above algorithm to find the pairs of matching partitions and calculating bounds of the resultant join, the rest of the partition-wise join algorithm remains the same. Unsupported case: When a partition from outer side doesn't have matching partition on the inner side. -------------------------------------------------------------------------- Consider a join t1 LEFT JOIN t2 where t1 has partitions t1p1 (0 - 100), t1p2 (100 - 200) and t2 has a single partition t2p1(0 - 100). The rows in t1p2 won't have a matching row in t2 since there is no partition matching t1p2. The result of the join will have rows in t1p2 with columns from t2 NULLed. In order to execute this join as a partition-wise join, we need a dummy relation in place of the missing partition, which we can join with t1p2. We need this placeholder dummy relation (its targetlist, relids etc.), so that rest of the planner can work with the resulting child-join. We notice the missing partitions only while planning the join (during the execution of make_one_rel()), by which time we have frozen the number of base relations. Introducing a base relation during join planning is not supported in current planner. Similarly, a partition can be missing from a partitioned join relation, in which case we have to add a dummy join relation. This might need adding corresponding base relations as well. I have not spent time looking for what it takes to support these cases. For now the patch does not support partition-wise join in such cases. TODOs ----------- 1. Add tests for advanced partition matching algorithm 2. Improve code quality, commenting, function names etc. [1] https://www.postgresql.org/message-id/CAFjFpRd9Vqh_=-Ldv-XqWY006d07TJ+VXuhXCbdj=P1jukYBrw@mail.gmail.com -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Re: [HACKERS] advanced partition matching algorithm forpartition-wise join
From
Rajkumar Raghuwanshi
Date:
On Mon, Aug 21, 2017 at 12:43 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:
TODOs
-----------
1. Add tests for advanced partition matching algorithm
Hi Ashutosh,
I have applied all partition-wise-join patches (v26) and tested feature. I have modified partition_join.sql file and added extra test cases to test partition matching.
Attaching WIP test case patch which as of now have some server crashes and a data corruptions issue which is commented in the file itself and need to be removed once issue got solved. Also some of queries is not picking or picking partition-wise-join as per expectation which may need some adjustment.
Attachment
Re: [HACKERS] advanced partition matching algorithm forpartition-wise join
From
Ashutosh Bapat
Date:
PFA the patches rebased on the latest sources. There are also fixes for some of the crashes and bugs reported. I haven't yet included the testcase patch in the main patchset. On Mon, Aug 28, 2017 at 12:44 PM, Rajkumar Raghuwanshi <rajkumar.raghuwanshi@enterprisedb.com> wrote: > On Mon, Aug 21, 2017 at 12:43 PM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >> >> TODOs >> ----------- >> 1. Add tests for advanced partition matching algorithm > > > Hi Ashutosh, > > I have applied all partition-wise-join patches (v26) and tested feature. I > have modified partition_join.sql file and added extra test cases to test > partition matching. > > Attaching WIP test case patch which as of now have some server crashes and a > data corruptions issue which is commented in the file itself and need to be > removed once issue got solved. Also some of queries is not picking or > picking partition-wise-join as per expectation which may need some > adjustment. > > > > > > -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Re: [HACKERS] advanced partition matching algorithm forpartition-wise join
From
Rajkumar Raghuwanshi
Date:
On Sat, Sep 2, 2017 at 12:42 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:
PFA the patches rebased on the latest sources. There are also fixes
for some of the crashes and bugs reported. I haven't yet included the
testcase patch in the main patchset.
On Mon, Aug 28, 2017 at 12:44 PM, Rajkumar Raghuwanshi
<rajkumar.raghuwanshi@enterprisedb.com> wrote:
> On Mon, Aug 21, 2017 at 12:43 PM, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
>>
>> TODOs
>> -----------
>> 1. Add tests for advanced partition matching algorithm
>
>
> Hi Ashutosh,
>
> I have applied all partition-wise-join patches (v26) and tested feature. I
> have modified partition_join.sql file and added extra test cases to test
> partition matching.
>
> Attaching WIP test case patch which as of now have some server crashes and a
> data corruptions issue which is commented in the file itself and need to be
> removed once issue got solved. Also some of queries is not picking or
> picking partition-wise-join as per expectation which may need some
> adjustment.
>
I have applied v27 patches and tested feature. Also tried to reduce regression diff with the
existing partition_join.sql by adding new partition instead of changing original partition bounds.
Attached WIP patch have a server crash and some wrong output which need to be fixed. I have
commented these issue in patch itself, Please take a look and let me know if it need more
changes.
Attachment
Re: [HACKERS] advanced partition matching algorithm forpartition-wise join
From
Ashutosh Bapat
Date:
> > I have applied v27 patches and tested feature. Also tried to reduce > regression diff with the > existing partition_join.sql by adding new partition instead of changing > original partition bounds. > > Attached WIP patch have a server crash and some wrong output which need to > be fixed. I have > commented these issue in patch itself, Please take a look and let me know if > it need more > changes. I have fixed the issues which were marked as TODOs in the attached patches. Also, I have included your test change patch in my series of patches. Are there any other issues you have commented out? -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Re: [HACKERS] advanced partition matching algorithm forpartition-wise join
From
Rajkumar Raghuwanshi
Date:
On Tue, Sep 5, 2017 at 4:34 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:
I have fixed the issues which were marked as TODOs in the attached
patches. Also, I have included your test change patch in my series of
patches. Are there any other issues you have commented out?
Thanks Ashutosh, All commented issue got fixed. I am working on some combinations of N-way joins
to test partition matching, will send those as well once done.
to test partition matching, will send those as well once done.
Re: [HACKERS] advanced partition matching algorithm for partition-wise join
From
Antonin Houska
Date:
Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > I have fixed the issues which were marked as TODOs in the attached > patches. Also, I have included your test change patch in my series of > patches. I've noticed that partition_bounds_merge() is called twice from make_join_rel(): * build_join_rel -> build_joinrel_partition_info -> partition_bounds_merge * try_partition_wise_join -> partition_bounds_merge Is this intentional, or just a thinko? -- Antonin Houska Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de, http://www.cybertec.at
Re: [HACKERS] advanced partition matching algorithm forpartition-wise join
From
Ashutosh Bapat
Date:
On Thu, Sep 7, 2017 at 7:34 PM, Antonin Houska <ah@cybertec.at> wrote: > Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > >> I have fixed the issues which were marked as TODOs in the attached >> patches. Also, I have included your test change patch in my series of >> patches. > > I've noticed that partition_bounds_merge() is called twice from > make_join_rel(): > > * build_join_rel -> build_joinrel_partition_info -> partition_bounds_merge > > * try_partition_wise_join -> partition_bounds_merge > > Is this intentional, or just a thinko? > This is expected. partition_bounds_merge() also returns the pairs of matching partitions. So, we have to call that function for every pair of joining relations. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] advanced partition matching algorithm forpartition-wise join
From
Ashutosh Bapat
Date:
Here's updated patch set based on the basic partition-wise join committed. The patchset applies on top of the patch to optimize the case of dummy partitioned tables [1]. Right now, the advanced partition matching algorithm bails out when either of the joining relations has a default partition. [1] https://www.postgresql.org/message-id/CAFjFpRcPvT5ay9_p3e-k2Cwu4bW_rypON7ceJVWhsU3Uk4Nmmg@mail.gmail.com -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Wed, Oct 11, 2017 at 7:08 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > Here's updated patch set based on the basic partition-wise join > committed. The patchset applies on top of the patch to optimize the > case of dummy partitioned tables [1]. > > Right now, the advanced partition matching algorithm bails out when > either of the joining relations has a default partition. So is that something you are going to fix? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] advanced partition matching algorithm forpartition-wise join
From
Ashutosh Bapat
Date:
On Thu, Oct 12, 2017 at 9:46 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Oct 11, 2017 at 7:08 AM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >> Here's updated patch set based on the basic partition-wise join >> committed. The patchset applies on top of the patch to optimize the >> case of dummy partitioned tables [1]. >> >> Right now, the advanced partition matching algorithm bails out when >> either of the joining relations has a default partition. > > So is that something you are going to fix? > Yes, if time permits. I had left the patch unattended while basic partition-wise join was getting committed. Now that it's committed, I rebased it. It still has TODOs and some work is required to improve it. But for the patch to be really complete, we have to deal with the problem of missing partitions described before. I am fine collaborating if someone else wants to pick it up. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] advanced partition matching algorithm forpartition-wise join
From
Ashutosh Bapat
Date:
On Fri, Oct 13, 2017 at 7:59 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > On Thu, Oct 12, 2017 at 9:46 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Wed, Oct 11, 2017 at 7:08 AM, Ashutosh Bapat >> <ashutosh.bapat@enterprisedb.com> wrote: >>> Here's updated patch set based on the basic partition-wise join >>> committed. The patchset applies on top of the patch to optimize the >>> case of dummy partitioned tables [1]. >>> >>> Right now, the advanced partition matching algorithm bails out when >>> either of the joining relations has a default partition. >> >> So is that something you are going to fix? >> > > Yes, if time permits. I had left the patch unattended while basic > partition-wise join was getting committed. Now that it's committed, I > rebased it. It still has TODOs and some work is required to improve > it. But for the patch to be really complete, we have to deal with the > problem of missing partitions described before. I am fine > collaborating if someone else wants to pick it up. > Here's patchset which support advanced partition matching for partition bounds with default partition. The patchset is rebased on the latest head. When a list value is present in one of the joining relations and not the other, and the other relation has default partition, match (join) the partition containing that list value with the default partition, since the default partition may contain rows with that list value. If the default partition happens to be on the outer side of the join, the resulting join partition acts as a default partition as it will contain all the values from the default partition. If the partition containing the list value happens to be on the outer side of the join, the resulting join partition is associated with the list value, since no other partition key value from the default partition makes it to the join result. When a range is present (completely or partly) in one of the joining relations and not the other, and the other relation has default partition, match (join) the partition corresponding to that range with the default partition. If the default partition happens to be on the outer side of the join, the resulting join partition acts as a default partition as it will contain all the values from the default partition. If the non-partition corresponding to the range happens to be on the outer side of the join, the resulting join partition is associated with that range, since partition key values from the default partition outside that range won't make it to the join result. If both the relations have default partition, match (join) the default partition with each other and deem the resulting join partition as default partition. If one of the relations has default partition but not the other, and the default partition happens to be on the outer side of the join, all its rows will make it to the join. Such a default partition may get joined to a non-default partition from the inner side, if inner side has a range missing in the outer side. If any of the above causes multiple partitions from one side to match with one or more partitions on the other side, we won't use partition-wise join as discussed in the first mail of this thread. I have tested the patches for two-way join, but haven't added any test involving default partitions to the patch itself. It needs to be tested for N-way join as well. So, for now I have kept the two patches supporting the default partition in case of range and list resp. separate. Also, some of the code duplication in partition matching functions can be avoided using macros. I will merge those patches into the main patch and add macros once they are tested appropriately. For hash partitioned table, we haven't implemented the advanced partition matching, since it would be rare that somebody has hash partitioned tables with holes (even if they are allowed). -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Attachment
Re: [HACKERS] advanced partition matching algorithm forpartition-wise join
From
Ashutosh Bapat
Date:
Here's a new patchset with following changes 1. Rebased on the latest head taking care of partition bound comparison function changes 2. Refactored the code to avoid duplication. 3. There's an extensive test (provided by Rajkumar) set added, which is not meant to be committed. That testset has testcases which crash or reveal a bug. I will fix those crashes and add corresponding testcases to partition_join.sql. TODO 1. FIx crashes/bugs in the testcases. On Sun, Dec 3, 2017 at 4:53 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > On Fri, Oct 13, 2017 at 7:59 AM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >> On Thu, Oct 12, 2017 at 9:46 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>> On Wed, Oct 11, 2017 at 7:08 AM, Ashutosh Bapat >>> <ashutosh.bapat@enterprisedb.com> wrote: >>>> Here's updated patch set based on the basic partition-wise join >>>> committed. The patchset applies on top of the patch to optimize the >>>> case of dummy partitioned tables [1]. >>>> >>>> Right now, the advanced partition matching algorithm bails out when >>>> either of the joining relations has a default partition. >>> >>> So is that something you are going to fix? >>> >> >> Yes, if time permits. I had left the patch unattended while basic >> partition-wise join was getting committed. Now that it's committed, I >> rebased it. It still has TODOs and some work is required to improve >> it. But for the patch to be really complete, we have to deal with the >> problem of missing partitions described before. I am fine >> collaborating if someone else wants to pick it up. >> > > Here's patchset which support advanced partition matching for > partition bounds with default partition. The patchset is rebased on > the latest head. > > When a list value is present in one of the joining relations and not > the other, and the other relation has default partition, match (join) > the partition containing that list value with the default partition, > since the default partition may contain rows with that list value. If > the default partition happens to be on the outer side of the join, the > resulting join partition acts as a default partition as it will > contain all the values from the default partition. If the partition > containing the list value happens to be on the outer side of the join, > the resulting join partition is associated with the list value, since > no other partition key value from the default partition makes it to > the join result. > > When a range is present (completely or partly) in one of the joining > relations and not the other, and the other relation has default > partition, match (join) the partition corresponding to that range with > the default partition. If the default partition happens to be on the > outer side of the join, the resulting join partition acts as a default > partition as it will contain all the values from the default > partition. If the non-partition corresponding to the range happens to > be on the outer side of the join, the resulting join partition is > associated with that range, since partition key values from the > default partition outside that range won't make it to the join result. > > If both the relations have default partition, match (join) the default > partition with each other and deem the resulting join partition as > default partition. If one of the relations has default partition but > not the other, and the default partition happens to be on the outer > side of the join, all its rows will make it to the join. Such a > default partition may get joined to a non-default partition from the > inner side, if inner side has a range missing in the outer side. > > If any of the above causes multiple partitions from one side to match > with one or more partitions on the other side, we won't use > partition-wise join as discussed in the first mail of this thread. > > I have tested the patches for two-way join, but haven't added any test > involving default partitions to the patch itself. It needs to be > tested for N-way join as well. So, for now I have kept the two patches > supporting the default partition in case of range and list resp. > separate. Also, some of the code duplication in partition matching > functions can be avoided using macros. I will merge those patches into > the main patch and add macros once they are tested appropriately. > > For hash partitioned table, we haven't implemented the advanced > partition matching, since it would be rare that somebody has hash > partitioned tables with holes (even if they are allowed). > > -- > Best Wishes, > Ashutosh Bapat > EnterpriseDB Corporation > The Postgres Database Company -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Attachment
Hi Ashutosh. On 2018/02/07 13:51, Ashutosh Bapat wrote: > Here's a new patchset with following changes > > 1. Rebased on the latest head taking care of partition bound > comparison function changes I was about to make these changes myself while revising the fast pruning patch. Instead, I decided to take a look at your patch and try to use it in my tree. I looked at the patch 0001 and noticed that git diff --check says: src/backend/catalog/partition.c:2900: trailing whitespace. +partition_rbound_datum_cmp(FmgrInfo *partsupfunc, Oid *partcollation, Also, might be a good idea to write briefly about the new arguments in the header comment. Something like that they are PartitionKey elements. Thanks, Amit
On 2018/02/08 11:55, Amit Langote wrote: > Hi Ashutosh. > > On 2018/02/07 13:51, Ashutosh Bapat wrote: >> Here's a new patchset with following changes >> >> 1. Rebased on the latest head taking care of partition bound >> comparison function changes > > I was about to make these changes myself while revising the fast pruning > patch. Instead, I decided to take a look at your patch and try to use it > in my tree. I also noticed that a later patch adds partsupfunc to PartitionScheme, which the pruning patch needs too. So, perhaps would be nice to take out that portion of the patch. That is, the changes to PartitionScheme struct definition and those to find_partition_scheme(). Regarding the latter, wouldn't be nice to have a comment before the code that does the copying about why we don't compare the partsupfunc field to decide if we have a match or not. I understand it's because the partsupfunc array contains pointers, not OIDs. But maybe, that's too obvious to warrant a comment. Thanks, Amit
Re: [HACKERS] advanced partition matching algorithm forpartition-wise join
From
Ashutosh Bapat
Date:
On Thu, Feb 8, 2018 at 8:25 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > Hi Ashutosh. > > On 2018/02/07 13:51, Ashutosh Bapat wrote: >> Here's a new patchset with following changes >> >> 1. Rebased on the latest head taking care of partition bound >> comparison function changes > > I was about to make these changes myself while revising the fast pruning > patch. Instead, I decided to take a look at your patch and try to use it > in my tree. > > I looked at the patch 0001 and noticed that git diff --check says: > > src/backend/catalog/partition.c:2900: trailing whitespace. > +partition_rbound_datum_cmp(FmgrInfo *partsupfunc, Oid *partcollation, Thanks. Fixed. > > Also, might be a good idea to write briefly about the new arguments in the > header comment. Something like that they are PartitionKey elements. Here's updated patch set with those comments added. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Attachment
Re: [HACKERS] advanced partition matching algorithm forpartition-wise join
From
Ashutosh Bapat
Date:
On Thu, Feb 8, 2018 at 10:41 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > On 2018/02/08 11:55, Amit Langote wrote: >> Hi Ashutosh. >> >> On 2018/02/07 13:51, Ashutosh Bapat wrote: >>> Here's a new patchset with following changes >>> >>> 1. Rebased on the latest head taking care of partition bound >>> comparison function changes >> >> I was about to make these changes myself while revising the fast pruning >> patch. Instead, I decided to take a look at your patch and try to use it >> in my tree. > > I also noticed that a later patch adds partsupfunc to PartitionScheme, > which the pruning patch needs too. So, perhaps would be nice to take out > that portion of the patch. That is, the changes to PartitionScheme struct > definition and those to find_partition_scheme(). I am not sure whether a patch with just that change and without any changes to use that member will be acceptable. So leaving this aside. > > Regarding the latter, wouldn't be nice to have a comment before the code > that does the copying about why we don't compare the partsupfunc field to > decide if we have a match or not. I understand it's because the > partsupfunc array contains pointers, not OIDs. But maybe, that's too > obvious to warrant a comment. It's because partsupfuncs should point to the information of the same function when partopfamily matches and partopcintype matches. I would have added an assertion for that with a comment, but with the pointer that would be risky. Or we can just assert that the oids match. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
On 2018/02/09 14:31, Ashutosh Bapat wrote: >> I also noticed that a later patch adds partsupfunc to PartitionScheme, >> which the pruning patch needs too. So, perhaps would be nice to take out >> that portion of the patch. That is, the changes to PartitionScheme struct >> definition and those to find_partition_scheme(). > > I am not sure whether a patch with just that change and without any > changes to use that member will be acceptable. So leaving this aside. I asked, because with everything that I have now changed in the partition pruning patch, one would need to pass these FmgrInfo pointers down to partition bound searching functions from the optimizer. If the changes to add partsupfunc to the optimizer were taken out from your main patch, the pruning patch could just start using it. For now, I'm making those changes part of the pruning patch. Thanks, Amit
Re: [HACKERS] advanced partition matching algorithm forpartition-wise join
From
Ashutosh Bapat
Date:
On Fri, Feb 9, 2018 at 11:26 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > On 2018/02/09 14:31, Ashutosh Bapat wrote: >>> I also noticed that a later patch adds partsupfunc to PartitionScheme, >>> which the pruning patch needs too. So, perhaps would be nice to take out >>> that portion of the patch. That is, the changes to PartitionScheme struct >>> definition and those to find_partition_scheme(). >> >> I am not sure whether a patch with just that change and without any >> changes to use that member will be acceptable. So leaving this aside. > > I asked, because with everything that I have now changed in the partition > pruning patch, one would need to pass these FmgrInfo pointers down to > partition bound searching functions from the optimizer. If the changes to > add partsupfunc to the optimizer were taken out from your main patch, the > pruning patch could just start using it. For now, I'm making those > changes part of the pruning patch. That's fine. Someone's patch will be committed first and the other will just take out those changes. But I am open to separate those changes into other patch if a committer feels so. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Hi Ashutosh. On 2018/02/09 14:27, Ashutosh Bapat wrote: > Here's updated patch set with those comments added. I looked at patches 0002 and 0003. In 0002: + * In case of hash partition we store modulus and remainder in datums array In case of hash partitioned table? + * which has the same data type irrespective of the number of partition keys + * and their data types. Hence we can compare the hash bound collection without + * any partition key specific information. "has the same data type" sounds like it means a Postgres data type, whereas I think you mean that they are simple int32 values, so we don't need any PartitionKey information to compare them. In 0003: A portion of code in both partition_range_bounds_merge(), partition_list_bounds_merge(), and merge_null_partitions() has an extra semi-colon at the end of a line starting with else if: if (default_index < 0) default_index = merged_index; else if(default_index != merged_index); { which emits warnings like this: partition.c: In function ‘partition_range_bounds_merge’: partition.c:4192:11: warning: this ‘if’ clause does not guard... [-Wmisleading-indentation] else if(default_index != merged_index); ^~ partition.c: In function ‘partition_list_bounds_merge’: partition.c:4261:11: warning: this ‘if’ clause does not guard... [-Wmisleading-indentation] else if(default_index != merged_index); ^~ Also, get this warning. partition.c:3955:1: warning: ‘is_next_range_continuous’ defined but not used [-Wunused-function] I'm trying to understand the optimizer side of this patch. In your commit message, I read: This commit allows partition-wise join to be applied under following conditions 1. the partition bounds of joining relations are such that rows from given partition on one side join can join with rows from maximum one partition on the other side i.e. bounds of a given partition on one side match/overlap with those of maximum one partition on the other side. If the mapping happens to be m:n where m > 1 or n > 1, we have to gang multiple partition relations together into a single relation. This means that we have to add simple relations during join processing, something which is not supported right now. ALso, in such a case, different pairs of joining relations can produce different partition bounds for the same join relation, which again is not supported right now. So, is the currently allowed case (partition bounds on both sides match exactly) a special case of this new feature which tries to match partitions in a more generalized manner? I see that this patch removes the partition_bound_equal(outer_rel->boundinfo, inner_rel->boundinfo) check in build_joinrel_partition_info() in favor of reconciling any differences in the representation of the partition bounds by calling partition_bounds_merge() from try_partition_wise_join(). 2. For every partition on outer side that can contribute to the result of an OUTER side, there exists at least one (taken along with item 1, it means exactly one) matching partition on the inner side. To support partition-wise join when the inner matching partition doesn't exist, we have to add a dummy base relation corresponding to the non-existent inner partition. We don't have support add base relations during join processing. Sorry but I'm a bit confused by the last sentence; does it mean we're not able to allow partition-wise join to happen in this case? But this is in the list of the new cases that the patch makes partition-wise join to happen for. Looking at the code changes under src/backend/optimizer: + else + { + Assert(partition_bounds_equal(part_scheme->partnatts, + part_scheme->parttyplen, + part_scheme->parttypbyval, + join_boundinfo, joinrel->boundinfo)); IIUC, this else block would run when try_partition_wise_join() is called again for the same pair of relations. + /* + * Every pair of joining relations should result in the same number + * of child-joins. + */ Sorry if I'm misreading this, but does it mean: a given pair of joining relations should always result in the same number of (set of, too?) child-joins? In the new comment in build_joinrel_partition_info(): + * Because of restrictions in partition_bounds_merge(), not every pair of + * joining relation joining relations I will try to hop into partition_bounds_merge() from now... Thanks, Amit
Re: [HACKERS] advanced partition matching algorithm forpartition-wise join
From
Ashutosh Bapat
Date:
On Thu, Feb 15, 2018 at 2:41 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > Hi Ashutosh. > > On 2018/02/09 14:27, Ashutosh Bapat wrote: > > In 0002: > > + * In case of hash partition we store modulus and remainder in datums array > > In case of hash partitioned table? > > + * which has the same data type irrespective of the number of partition keys > + * and their data types. Hence we can compare the hash bound collection > without > + * any partition key specific information. > > "has the same data type" sounds like it means a Postgres data type, > whereas I think you mean that they are simple int32 values, so we don't > need any PartitionKey information to compare them. Modified this comment to read + * Hash partition bounds store modulus and remainder in datums array which are + * always integers irrespective of the number of partition keys and their data + * types. Hence we can compare the hash bound collection without any partition + * key specific information. Separating this logic in a function which does not + * require partition key specific information allows it be called from places + * where the partition key specific information is not completely available. Let me know if that looks good. > > In 0003: > > A portion of code in both partition_range_bounds_merge(), > partition_list_bounds_merge(), and merge_null_partitions() has an extra > semi-colon at the end of a line starting with else if: > > if (default_index < 0) > default_index = merged_index; > else if(default_index != merged_index); > { > > which emits warnings like this: > > partition.c: In function ‘partition_range_bounds_merge’: > partition.c:4192:11: warning: this ‘if’ clause does not guard... > [-Wmisleading-indentation] > else if(default_index != merged_index); > > ^~ > partition.c: In function ‘partition_list_bounds_merge’: > partition.c:4261:11: warning: this ‘if’ clause does not guard... > [-Wmisleading-indentation] > else if(default_index != merged_index); > Thanks for catching those. They will cause bugs. > Also, get this warning. > > partition.c:3955:1: warning: ‘is_next_range_continuous’ defined but not > used [-Wunused-function] I had added that function earlier, but it's no more useful in the patch. But I have kept it there in case we need it again. I will remove it in the final patch. > > I'm trying to understand the optimizer side of this patch. In your commit > message, I read: > > This commit allows partition-wise join to be applied under > following conditions > > 1. the partition bounds of joining relations are such that rows from > given partition on one side join can join with rows from maximum one > partition on the other side i.e. bounds of a given partition on one > side match/overlap with those of maximum one partition on the other > side. If the mapping happens to be m:n where m > 1 or n > 1, we have > to gang multiple partition relations together into a single relation. > This means that we have to add simple relations during join > processing, something which is not supported right now. ALso, in such > a case, different pairs of joining relations can produce different > partition bounds for the same join relation, which again is not > supported right now. > > > So, is the currently allowed case (partition bounds on both sides match > exactly) a special case of this new feature which tries to match > partitions in a more generalized manner? Right. > > 2. For every partition on outer side that can contribute to the result > of an OUTER side, there exists at least one (taken along with item 1, > it means exactly one) matching partition on the inner side. To > support partition-wise join when the inner matching partition doesn't > exist, we have to add a dummy base relation corresponding to the > non-existent inner partition. We don't have support add base relations > during join processing. > > Sorry but I'm a bit confused by the last sentence; does it mean we're not > able to allow partition-wise join to happen in this case? But this is in > the list of the new cases that the patch makes partition-wise join to > happen for. > Consider two cases for partitioned tables t1(p1, p3, p4), t2 (p1, p2, p3). Assume that the bounds of any pk across t1 and t2 are same when k matches. So, p2 is missing from t1 and p4 from t2. This case will not be handled by current partition-wise join. With these set of patches, we will be able to handle some joins between t1 and t2. The matching algorithm will output pairs as (p1, p1), (-, p2), (p3, p3), (p4, -). In an inner join between t1 and t2, (-, p2), (p4, -) would not contribute any result, so we can eliminate those. Thus t1 INNER JOIN t2 is just (p1, p1), (p3, p3). t1 LEFT JOIN t2 will not have any result from (-, p2), so we can eliminate it. If there would not have been p4, we will be able to compute t1 LEFT JOIN t2. Such cases are not handled right now. Fixed some grammar and typos in the paragraph. > > Looking at the code changes under src/backend/optimizer: > > + else > + { > + Assert(partition_bounds_equal(part_scheme->partnatts, > + part_scheme->parttyplen, > + part_scheme->parttypbyval, > + join_boundinfo, joinrel->boundinfo)); > > IIUC, this else block would run when try_partition_wise_join() is called > again for the same pair of relations. Yes, it's all Asserts, so effectively, we run nothing in a binary without asserts. > > + /* > + * Every pair of joining relations should result in the same number > + * of child-joins. > + */ > > Sorry if I'm misreading this, but does it mean: a given pair of joining > relations should always result in the same number of (set of, too?) > child-joins? The assert there just check for number so the comment doesn't mention set. But you are correct, it's the same set. We check it when we actually deal with them by checking their relids. > > In the new comment in build_joinrel_partition_info(): > > + * Because of restrictions in partition_bounds_merge(), not every pair of > + * joining relation > > joining relations Fixed. > > > I will try to hop into partition_bounds_merge() from now... Appreciate you taking time for review. PFA updated version. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Attachment
On Fri, Feb 16, 2018 at 12:14 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > Appreciate you taking time for review. > > PFA updated version. Committed 0001. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] advanced partition matching algorithm forpartition-wise join
From
Ashutosh Bapat
Date:
On Fri, Feb 23, 2018 at 7:35 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Feb 16, 2018 at 12:14 AM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >> Appreciate you taking time for review. >> >> PFA updated version. > > Committed 0001. Thanks. Here's patchset rebased on the latest head. I have fixed all the crashes and bugs reported till now. One of the crash was related to the targetlist of child-join relation. In basic partition-wise join, either all pairs of joining relations can use partition-wise join technique or none can use it. So, the joining pair which is used to create targetlist for a child-join corresponds to the pair of joining relations used to create targetlist for the parent join. With the restrictions related to missing partitions discussed upthread, this isn't true with advanced partition matching. The joining pair which is used to create targetlist for a child-join may not correspond to the pair of joining relations used to create targetlist for the parent join. Since these two pairs are different build_joinrel_tlist() arranges the targetlist entries in different order for child-join and parent-join. But for an appendrel, we expect the parent and child targetlists in sync. So fix is: instead of constructing the child-join targetlist from joining relations, we construct it by translating the parent join. The basic partition-wise join commit had changed build_joinrel_tlist() to handle child-joins and it had changed set_append_rel_size() to compute attr_needed for child relations so that that information can be used to built child-join's targetlist. Both of those changes are not need because of translation. I have added this fix as a separate patch with those two changes reverted. When we will lift up the restrictions related to missing partitions (I am not sure when), we will get back to a state when joining pair which creates targetlist for child-join corresponds to the joining pair which creates targetlist for the parent join. And thus we don't need translation, which consumes more memory. We can use attrs_needed information. So, may be we should retain those two changes instead of reverting those. Any suggestions? The extra extensive testcase advance_partition_join_test still fails because of plan changes caused by recent changes to the append costing. I have verified that those plan changes are fine and the queries do not have any bugs or crashes. But that testcase has many many queries and the plans are not stable. Since that testcase is not meant for committing, I haven't fixed the plan diffs right now. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Attachment
> On 26 February 2018 at 11:03, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > On Fri, Feb 23, 2018 at 7:35 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Fri, Feb 16, 2018 at 12:14 AM, Ashutosh Bapat >> <ashutosh.bapat@enterprisedb.com> wrote: >>> Appreciate you taking time for review. >>> >>> PFA updated version. >> >> Committed 0001. > > Thanks. > > Here's patchset rebased on the latest head. I have fixed all the > crashes and bugs reported till now. Hi, I've stumbled upon this patch and noticed, that it's not compiled anymore after 2af28e6033, where `parttypcoll` was renamed to `partcollation`.
Re: [HACKERS] advanced partition matching algorithm forpartition-wise join
From
Ashutosh Bapat
Date:
On Mon, Mar 12, 2018 at 1:04 AM, Dmitry Dolgov <9erthalion6@gmail.com> wrote: >> On 26 February 2018 at 11:03, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: >> On Fri, Feb 23, 2018 at 7:35 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>> On Fri, Feb 16, 2018 at 12:14 AM, Ashutosh Bapat >>> <ashutosh.bapat@enterprisedb.com> wrote: >>>> Appreciate you taking time for review. >>>> >>>> PFA updated version. >>> >>> Committed 0001. >> >> Thanks. >> >> Here's patchset rebased on the latest head. I have fixed all the >> crashes and bugs reported till now. > > Hi, > > I've stumbled upon this patch and noticed, that it's not compiled anymore after > 2af28e6033, where `parttypcoll` was renamed to `partcollation`. Thanks for the note. Here are rebased patches. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Attachment
> On 12 March 2018 at 06:00, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > Thanks for the note. Here are rebased patches. Since I started to look at this patch, I can share few random notes (although it's not a complete review, I'm in the progress now), maybe it can be helpful. In `partition_range_bounds_merge` + if (!merged) + break; is a bit redundant I think, because every time `merged` set to false it followed by break. At the end same function + if (merged) + merged = merge_default_partitions(outer_bi, outer_pmap, outer_mmap, + inner_bi, inner_pmap, inner_mmap, + jointype, &next_index, + &default_index); Looks like I misunderstand something in the algorithm, but aren't default partitions were already processed before e.g. here: + /* + * Default partition on the outer side forms the default + * partition of the join result. + */ Also in here + /* Free any memory we used in this function. */ + list_free(merged_datums); + list_free(merged_indexes); + pfree(outer_pmap); + pfree(inner_pmap); + pfree(outer_mmap); + pfree(inner_mmap); I think `merged_kinds` is missing. I've noticed that in some places `IS_PARTITIONED_REL` was replaced - if (!IS_PARTITIONED_REL(joinrel)) + if (joinrel->part_scheme == NULL) but I'm not quite follow why? Is it because `boundinfo` is not available anymore at this point? If so, maybe it makes sense to update the commentary for this macro and mention to not use for joinrel. Also, as for me it would be helpful to have some commentary about this new `partsupfunc`, what is exactly the purpose of it (but it's maybe just me missing some obvious things from partitioning context) + FmgrInfo *partsupfunc;
Re: [HACKERS] advanced partition matching algorithm forpartition-wise join
From
Ashutosh Bapat
Date:
On Thu, Mar 22, 2018 at 4:32 AM, Dmitry Dolgov <9erthalion6@gmail.com> wrote: >> On 12 March 2018 at 06:00, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: >> Thanks for the note. Here are rebased patches. > > Since I started to look at this patch, I can share few random notes (although > it's not a complete review, I'm in the progress now), maybe it can be helpful. > > In `partition_range_bounds_merge` > > + if (!merged) > + break; > > is a bit redundant I think, because every time `merged` set to false it > followed by break. Yes, right now. May be I should turn it into Assert(merged); What do you think? > > At the end same function > > + if (merged) > + merged = merge_default_partitions(outer_bi, outer_pmap, outer_mmap, > + inner_bi, inner_pmap, inner_mmap, > + jointype, &next_index, > + &default_index); > > Looks like I misunderstand something in the algorithm, but aren't default > partitions were already processed before e.g. here: > > + /* > + * Default partition on the outer side forms the default > + * partition of the join result. > + */ The default partition handling in the loop handles the cases of missing partitions as explained in a comment /* * If a range appears in one of the joining relations but not the other * (as a whole or part), the rows in the corresponding partition will * not have join partners in the other relation, unless the other * relation has a default partition. But merge_default_partitions() tries to map the default partitions from both the relations. > > Also in here > > + /* Free any memory we used in this function. */ > + list_free(merged_datums); > + list_free(merged_indexes); > + pfree(outer_pmap); > + pfree(inner_pmap); > + pfree(outer_mmap); > + pfree(inner_mmap); > > I think `merged_kinds` is missing. Done. > > I've noticed that in some places `IS_PARTITIONED_REL` was replaced > > - if (!IS_PARTITIONED_REL(joinrel)) > + if (joinrel->part_scheme == NULL) > > but I'm not quite follow why? Is it because `boundinfo` is not available > anymore at this point? If so, maybe it makes sense to update the commentary for > this macro and mention to not use for joinrel. This is done in try_partitionwise_join(). As explained in the comment /* * Get the list of matching partitions to be joined along with the * partition bounds of the join relation. Because of the restrictions * imposed by partition matching algorithm, not every pair of joining * relations for this join will be able to use partition-wise join. But all * those pairs which can use partition-wise join will produce the same * partition bounds for the join relation. */ boundinfo for the join relation is built in this function. So, we don't have join relation's partitioning information fully set up yet. So we can't use IS_PARTITIONED_REL() there. joinrel->part_scheme if set tells that the joining relations have matching partition schemes and thus the join relation can possibly use partition-wise join technique. If it's not set, then we can't use partition-wise join. But IS_PARTITIONED_REL() is still useful at a number of other places, where it's known to encounter a RelOptInfo whose partitioning properties are fully setup. So, I don't think we should change the macro or the comments above it. > > Also, as for me it would be helpful to have some commentary about this new > `partsupfunc`, what is exactly the purpose of it (but it's maybe just me > missing some obvious things from partitioning context) > > + FmgrInfo *partsupfunc; It's just copied from Relation::PartitionKey as is. It stores the functions required to compare two partition key datums. Since we use those functions frequently their FmgrInfo is cached. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
> On 22 March 2018 at 14:18, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > On Thu, Mar 22, 2018 at 4:32 AM, Dmitry Dolgov <9erthalion6@gmail.com> wrote: >>> On 12 March 2018 at 06:00, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: >>> Thanks for the note. Here are rebased patches. >> >> Since I started to look at this patch, I can share few random notes (although >> it's not a complete review, I'm in the progress now), maybe it can be helpful. >> >> In `partition_range_bounds_merge` >> >> + if (!merged) >> + break; >> >> is a bit redundant I think, because every time `merged` set to false it >> followed by break. > > Yes, right now. May be I should turn it into Assert(merged); What do you think? Thank you for reply. Yes, that sounds good. But actually you also mentioned another topic that bothers me about this patch. Different parts of the algorithm implementation (at least for functions that build maps of matching partitions) are quite dependent on each other in terms of shared state. At first glance in `partition_range_bounds_merge` we have about a dozen of variables of different mutability level, that affect the control flow: outer_lb_index inner_lb_index merged merged_index overlap merged_lb merged_ub finished_outer finished_inner ub_cmpval lb_cmpval inner_has_default outer_has_default jointype It looks a bit too much for me, and would require commentaries like "if you changed the logic here, also take a look there". But I'm not saying that I have any specific suggestions how to change it (although I'll definitely try to do so, at least to get some better understanding of the underlying algorithm). >> >> I've noticed that in some places `IS_PARTITIONED_REL` was replaced >> >> - if (!IS_PARTITIONED_REL(joinrel)) >> + if (joinrel->part_scheme == NULL) >> >> but I'm not quite follow why? Is it because `boundinfo` is not available >> anymore at this point? If so, maybe it makes sense to update the commentary for >> this macro and mention to not use for joinrel. > > > This is done in try_partitionwise_join(). As explained in the comment > > * Get the list of matching partitions to be joined along with the > * partition bounds of the join relation. Because of the restrictions > * imposed by partition matching algorithm, not every pair of joining > * relations for this join will be able to use partition-wise join. But all > * those pairs which can use partition-wise join will produce the same > * partition bounds for the join relation. > > boundinfo for the join relation is built in this function. So, we > don't have join relation's partitioning information fully set up yet. > So we can't use IS_PARTITIONED_REL() there. joinrel->part_scheme if > set tells that the joining relations have matching partition schemes > and thus the join relation can possibly use partition-wise join > technique. If it's not set, then we can't use partition-wise join. > > But IS_PARTITIONED_REL() is still useful at a number of other places, > where it's known to encounter a RelOptInfo whose partitioning > properties are fully setup. So, I don't think we should change the > macro or the comments above it. Just to make myself clear, I wanted to suggest not to change the commentary for `IS_PARTITIONED_REL` significantly, but just add a sentence that you need to check if given relation is fully set up. Also, few more random notes (mostly related to readability, since I found some parts of the patch hard to read, but of course it's arguable). ``` PartitionRangeBound outer_lb; PartitionRangeBound outer_ub; PartitionRangeBound inner_lb; PartitionRangeBound inner_ub; PartitionRangeBound *merged_lb = NULL; PartitionRangeBound *merged_ub = NULL; ``` Maybe it would be better to not repeat the type here? Let's say: ``` PartitionRangeBound outer_lb, outer_ub, ... ``` It's just too long and distracting. ``` partition_range_bounds_merge(int partnatts, FmgrInfo *partsupfuncs, Oid *partcollations, PartitionBoundInfo outer_bi, int outer_nparts, PartitionBoundInfo inner_bi, int inner_nparts, JoinType jointype, List **outer_parts, List **inner_parts) ``` From what I see in `partition.c` there are a lot functions that accept `partnatts`, `partcollations` only to pass it down to, e.g. `partition_rbound_cmp`. What do you think about introducing a data structure to keep these arguments, and pass only an instance of this structure instead?
Re: [HACKERS] advanced partition matching algorithm forpartition-wise join
From
Ashutosh Bapat
Date:
Thanks for your review comments. They are helpful. On Thu, Mar 29, 2018 at 2:53 AM, Dmitry Dolgov <9erthalion6@gmail.com> wrote: > But actually you also mentioned > another topic that bothers me about this patch. Different parts of the > algorithm implementation (at least for functions that build maps of matching > partitions) are quite dependent on each other in terms of shared state. At > first glance in `partition_range_bounds_merge` we have about a dozen of > variables of different mutability level, that affect the control flow: > > outer_lb_index > inner_lb_index > merged > merged_index > overlap > merged_lb > merged_ub > finished_outer > finished_inner > ub_cmpval > lb_cmpval > inner_has_default > outer_has_default > jointype > > It looks a bit too much for me, and would require commentaries like "if you > changed the logic here, also take a look there". But I'm not saying that I have > any specific suggestions how to change it (although I'll definitely try to do > so, at least to get some better understanding of the underlying algorithm). I am working on commenting portions of the code to make it more clear and readable. Will update the patches with the comments soon, mostly this Monday. > > Just to make myself clear, I wanted to suggest not to change the commentary for > `IS_PARTITIONED_REL` significantly, but just add a sentence that you need to > check if given relation is fully set up. For that we have REL_HAS_ALL_PART_PROPS. IS_PARTITIONED_REL is really checking whether a relation is partitioned and following comment makes it clear why we have to check all those things. So, I still don't see why we need to change the comment there. * It's not enough to test whether rel->part_scheme is set, because it might * be that the basic partitioning properties of the input relations matched * but the partition bounds did not. > Also, few more random notes (mostly related to readability, since I found some > parts of the patch hard to read, but of course it's arguable). > > ``` > PartitionRangeBound outer_lb; > PartitionRangeBound outer_ub; > PartitionRangeBound inner_lb; > PartitionRangeBound inner_ub; > PartitionRangeBound *merged_lb = NULL; > PartitionRangeBound *merged_ub = NULL; > ``` > > Maybe it would be better to not repeat the type here? Let's say: > > ``` > PartitionRangeBound outer_lb, > outer_ub, PostgreSQL code uses both the styles, but I like declarations with the type is right there along side the variable. It also makes it easy to delete the variable or modify its type, without causing unnecessary diff on adjacent lines. > ``` > partition_range_bounds_merge(int partnatts, FmgrInfo *partsupfuncs, > Oid *partcollations, PartitionBoundInfo outer_bi, > int outer_nparts, PartitionBoundInfo inner_bi, > int inner_nparts, JoinType jointype, > List **outer_parts, List **inner_parts) > ``` > > From what I see in `partition.c` there are a lot functions that accept > `partnatts`, `partcollations` only to pass it down to, e.g. > `partition_rbound_cmp`. > What do you think about introducing a data structure to keep these arguments, > and pass only an instance of this structure instead? I have discussed this in an adjacent thread, but let me repeat it here. Those members come from PartitionKey or PartitionScheme. If we pack those in a structure and not include it in those two structures,we will just create a structure before calling the function and unpack those inside the comparison function e.g. partition_rbound_cmp(). We could argue that that will benefit intermediate functions like partition_range_bound_cmp(), which just pass those values down, but there is all the possibility that its future caller may not have that packing structure available readily. So, I am inclined not to add a new structure just for this. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Re: [HACKERS] advanced partition matching algorithm forpartition-wise join
From
Ashutosh Bapat
Date:
On Fri, Mar 30, 2018 at 7:36 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > > I am working on commenting portions of the code to make it more clear > and readable. Will update the patches with the comments soon, mostly > this Monday. > Here's set of patches rebased on the latest head. I have added a lot of comments and revised a lot of code to avoid duplication, causing a net reduction in the number of lines. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Attachment
> On 3 April 2018 at 15:34, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > On Fri, Mar 30, 2018 at 7:36 PM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >> >> I am working on commenting portions of the code to make it more clear >> and readable. Will update the patches with the comments soon, mostly >> this Monday. >> > > Here's set of patches rebased on the latest head. I have added a lot > of comments and revised a lot of code to avoid duplication, causing a > net reduction in the number of lines. Thank you. Unfortunately, there are already few more conflicts since last time, could you rebase again? In the meantime, I'm a bit confused by `merge_null_partitions` and why partitions with NULL values should be treated separately? It became even more confusing when I looked at where this functions is used: /* If merge is unsuccessful, bail out without any further processing. */ if (merged) merged = merge_null_partitions(outer_bi, outer_pmap, outer_mmap, inner_bi, inner_pmap, inner_mmap, jointype, &next_index, &null_index, &default_index); Is this commentary correct? Also, I don't see anywhere in PostgreSQL itself or in this patch a definition of the term "NULL partition", can you add it, just to make things clear? Another question, I see this pattern a lot here when there is a code like: if (!outer_has_something && inner_has_something) { // some logic } else if (outer_has_something && !inner_has_something) { // some logic symmetric to what we have above } By symmetric I mean that the code is more or less the same, just "inner" variables were changed to "outer" (and required types of joins are opposite). Is it feasible to actually use completely the same logic, and just change "inner"/"outer" variables instead?
Re: [HACKERS] advanced partition matching algorithm forpartition-wise join
From
Ashutosh Bapat
Date:
Thanks Dmitry for your review. On Mon, May 7, 2018 at 1:06 AM, Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > Thank you. Unfortunately, there are already few more conflicts since last time, > could you rebase again? Done. > > In the meantime, I'm a bit confused by `merge_null_partitions` and why > partitions with NULL values should be treated separately? It became even more > confusing when I looked at where this functions is used: > > /* If merge is unsuccessful, bail out without any further processing. */ > if (merged) > merged = merge_null_partitions(outer_bi, outer_pmap, outer_mmap, > inner_bi, inner_pmap, inner_mmap, > jointype, &next_index, &null_index, > &default_index); > > Is this commentary correct? I agree. It's misleading. Removed. > > Also, I don't see anywhere in PostgreSQL itself or in this patch a definition > of the term "NULL partition", can you add it, just to make things clear? By NULL partition, I mean a list partition which holds NULL values. Added that clarification in the prologue of merge_null_partitions(). > > Another question, I see this pattern a lot here when there is a code like: > > if (!outer_has_something && inner_has_something) > { > // some logic > } > else if (outer_has_something && !inner_has_something) > { > // some logic symmetric to what we have above > } > > By symmetric I mean that the code is more or less the same, just "inner" > variables were changed to "outer" (and required types of joins are opposite). > Is it feasible to actually use completely the same logic, and just change > "inner"/"outer" variables instead? I have added handle_missing_partition() for the same purpose. But I didn't use it in merge_null_partitions(), which I have done in the attached patches. Now the number of such instances are just at two places one in merge_default_partitions() and the other in handle_missing_partition(). But they are different enough not to extract into a common function. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Attachment
Hi Ashutosh, On 2018/05/14 20:14, Ashutosh Bapat wrote: > 0001-Hash-partition-bound-equality-refactoring.patch > 0002-Targetlist-of-a-child-join-is-produced-by-translatin.patch > 0003-Partition-wise-join-for-1-1-1-0-0-1-partition-matchi.patch > 0004-Add-a-debug-message-to-notify-whether-partition-wise.patch > 0005-Tests-for-0-1-1-1-and-1-0-partition-matching.patch > 0006-Extra-extensive-tests-for-advanced-partition-matchin.patch I noticed after *cleanly* applying 0001-0004 to today's HEAD that while 0005's test all pass, there are many failures in 0006's tests. Maybe, you need to adjust something in one of the patches or adjust test outputs. Thanks, Amit
Re: [HACKERS] advanced partition matching algorithm forpartition-wise join
From
Ashutosh Bapat
Date:
On Tue, Jun 26, 2018 at 2:27 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > Hi Ashutosh, > > On 2018/05/14 20:14, Ashutosh Bapat wrote: >> 0001-Hash-partition-bound-equality-refactoring.patch >> 0002-Targetlist-of-a-child-join-is-produced-by-translatin.patch >> 0003-Partition-wise-join-for-1-1-1-0-0-1-partition-matchi.patch >> 0004-Add-a-debug-message-to-notify-whether-partition-wise.patch >> 0005-Tests-for-0-1-1-1-and-1-0-partition-matching.patch >> 0006-Extra-extensive-tests-for-advanced-partition-matchin.patch > > I noticed after *cleanly* applying 0001-0004 to today's HEAD that while > 0005's test all pass, there are many failures in 0006's tests. Maybe, you > need to adjust something in one of the patches or adjust test outputs. If the failures are because of plan changes, it's expected. If those are because of crashes or changed output, those need to be fixed. I have kept that patch to notice any crashes or output changes, in which case, I pull that test into 0005 test set after fixing the code. Once we are near commit, I will remove that patch from the patchset. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
On 2018/06/26 18:02, Ashutosh Bapat wrote: > On Tue, Jun 26, 2018 at 2:27 PM, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> Hi Ashutosh, >> >> On 2018/05/14 20:14, Ashutosh Bapat wrote: >>> 0001-Hash-partition-bound-equality-refactoring.patch >>> 0002-Targetlist-of-a-child-join-is-produced-by-translatin.patch >>> 0003-Partition-wise-join-for-1-1-1-0-0-1-partition-matchi.patch >>> 0004-Add-a-debug-message-to-notify-whether-partition-wise.patch >>> 0005-Tests-for-0-1-1-1-and-1-0-partition-matching.patch >>> 0006-Extra-extensive-tests-for-advanced-partition-matchin.patch >> >> I noticed after *cleanly* applying 0001-0004 to today's HEAD that while >> 0005's test all pass, there are many failures in 0006's tests. Maybe, you >> need to adjust something in one of the patches or adjust test outputs. > > If the failures are because of plan changes, it's expected. If those > are because of crashes or changed output, those need to be fixed. I > have kept that patch to notice any crashes or output changes, in which > case, I pull that test into 0005 test set after fixing the code. Once > we are near commit, I will remove that patch from the patchset. Ah, okay. I thought of reporting this because I felt the errors may have to do with changes to the related code in HEAD between May 14 when you last posted the patches and today that you may need to account for in you patches. For instance, there are many diffs like this: *************** *** 90,132 **** -- left outer join, with whole-row reference EXPLAIN (COSTS OFF) SELECT t1, t2 FROM prt1 t1 LEFT JOIN prt2 t2 ON t1.a = t2.b WHERE t1.b = 0 ORDER BY t1.a, t1.b, t1.c, t2.a, t2.b, t2.c; ! QUERY PLAN ! -------------------------------------------------------- Sort Sort Key: t1.a, t1.c, t2.a, t2.b, t2.c ! -> Result ! -> Append ! -> Hash Right Join ! Hash Cond: (t2.b = t1.a) ! -> Seq Scan on prt2_p0 t2 ! -> Hash ! -> Seq Scan on prt1_p0 t1 <snip> --- 90,131 ---- -- left outer join, with whole-row reference EXPLAIN (COSTS OFF) SELECT t1, t2 FROM prt1 t1 LEFT JOIN prt2 t2 ON t1.a = t2.b WHERE t1.b = 0 ORDER BY t1.a, t1.b, t1.c, t2.a, t2.b, t2.c; ! QUERY PLAN ! -------------------------------------------------- Sort Sort Key: t1.a, t1.c, t2.a, t2.b, t2.c ! -> Append ! -> Hash Right Join ! Hash Cond: (t2.b = t1.a) ! -> Seq Scan on prt2_p0 t2 ! -> Hash ! -> Seq Scan on prt1_p0 t1 ! Filter: (b = 0) Looks like the Result node on top of Append is no longer there after applying your patch. Thanks, Amit
Re: [HACKERS] advanced partition matching algorithm forpartition-wise join
From
Ashutosh Bapat
Date:
On Wed, Jun 27, 2018 at 12:28 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > On 2018/06/26 18:02, Ashutosh Bapat wrote: >> On Tue, Jun 26, 2018 at 2:27 PM, Amit Langote >> <Langote_Amit_f8@lab.ntt.co.jp> wrote: >>> Hi Ashutosh, >>> >>> On 2018/05/14 20:14, Ashutosh Bapat wrote: >>>> 0001-Hash-partition-bound-equality-refactoring.patch >>>> 0002-Targetlist-of-a-child-join-is-produced-by-translatin.patch >>>> 0003-Partition-wise-join-for-1-1-1-0-0-1-partition-matchi.patch >>>> 0004-Add-a-debug-message-to-notify-whether-partition-wise.patch >>>> 0005-Tests-for-0-1-1-1-and-1-0-partition-matching.patch >>>> 0006-Extra-extensive-tests-for-advanced-partition-matchin.patch >>> >>> I noticed after *cleanly* applying 0001-0004 to today's HEAD that while >>> 0005's test all pass, there are many failures in 0006's tests. Maybe, you >>> need to adjust something in one of the patches or adjust test outputs. >> >> If the failures are because of plan changes, it's expected. If those >> are because of crashes or changed output, those need to be fixed. I >> have kept that patch to notice any crashes or output changes, in which >> case, I pull that test into 0005 test set after fixing the code. Once >> we are near commit, I will remove that patch from the patchset. > > Ah, okay. I thought of reporting this because I felt the errors may have > to do with changes to the related code in HEAD between May 14 when you > last posted the patches and today that you may need to account for in you > patches. For instance, there are many diffs like this: > > *************** > *** 90,132 **** > -- left outer join, with whole-row reference > EXPLAIN (COSTS OFF) > SELECT t1, t2 FROM prt1 t1 LEFT JOIN prt2 t2 ON t1.a = t2.b WHERE t1.b = > 0 ORDER BY t1.a, t1.b, t1.c, t2.a, t2.b, t2.c; > ! QUERY PLAN > ! -------------------------------------------------------- > Sort > Sort Key: t1.a, t1.c, t2.a, t2.b, t2.c > ! -> Result > ! -> Append > ! -> Hash Right Join > ! Hash Cond: (t2.b = t1.a) > ! -> Seq Scan on prt2_p0 t2 > ! -> Hash > ! -> Seq Scan on prt1_p0 t1 > <snip> > > --- 90,131 ---- > -- left outer join, with whole-row reference > EXPLAIN (COSTS OFF) > SELECT t1, t2 FROM prt1 t1 LEFT JOIN prt2 t2 ON t1.a = t2.b WHERE t1.b = > 0 ORDER BY t1.a, t1.b, t1.c, t2.a, t2.b, t2.c; > ! QUERY PLAN > ! -------------------------------------------------- > Sort > Sort Key: t1.a, t1.c, t2.a, t2.b, t2.c > ! -> Append > ! -> Hash Right Join > ! Hash Cond: (t2.b = t1.a) > ! -> Seq Scan on prt2_p0 t2 > ! -> Hash > ! -> Seq Scan on prt1_p0 t1 > ! Filter: (b = 0) > > Looks like the Result node on top of Append is no longer there after > applying your patch. Yes. They are coming because of a commit which removed Result node on top of an Append node. I don't remember exactly which. I wouldn't worry about those diffs at this time. As I have mentioned in earlier mails, the expected output from 0006 is quite large and is not supposed to be committed. So, I don't see much value in fixing the plans in that output. Do you see that as a hindrance in reviewing the code changes and tests in 0005? -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
On 2018/06/27 22:21, Ashutosh Bapat wrote: > On Wed, Jun 27, 2018 at 12:28 PM, Amit Langote >> Ah, okay. I thought of reporting this because I felt the errors may have >> to do with changes to the related code in HEAD between May 14 when you >> last posted the patches and today that you may need to account for in you >> patches. For instance, there are many diffs like this: >> >> Looks like the Result node on top of Append is no longer there after >> applying your patch. > > Yes. They are coming because of a commit which removed Result node on > top of an Append node. I don't remember exactly which. > > I wouldn't worry about those diffs at this time. As I have mentioned > in earlier mails, the expected output from 0006 is quite large and is > not supposed to be committed. So, I don't see much value in fixing the > plans in that output. > > Do you see that as a hindrance in reviewing the code changes and tests in 0005? I think not. I'll ignore 0006 for now and focus on other patches. Thanks, Amit
> On Thu, 28 Jun 2018 at 07:54, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > > On 2018/06/27 22:21, Ashutosh Bapat wrote: > > On Wed, Jun 27, 2018 at 12:28 PM, Amit Langote > >> Ah, okay. I thought of reporting this because I felt the errors may have > >> to do with changes to the related code in HEAD between May 14 when you > >> last posted the patches and today that you may need to account for in you > >> patches. For instance, there are many diffs like this: > >> > >> Looks like the Result node on top of Append is no longer there after > >> applying your patch. > > > > Yes. They are coming because of a commit which removed Result node on > > top of an Append node. I don't remember exactly which. > > > > I wouldn't worry about those diffs at this time. As I have mentioned > > in earlier mails, the expected output from 0006 is quite large and is > > not supposed to be committed. So, I don't see much value in fixing the > > plans in that output. > > > > Do you see that as a hindrance in reviewing the code changes and tests in 0005? > > I think not. I'll ignore 0006 for now and focus on other patches. Hi, Sorry for my irregular reviews. Unfortunately, the patch need to be rebased again. In the meantime I have few more commentaries about range_bounds_merge: * From what I see partition_range_bound_cmp is executed on the same bounds few times to find whether they are overlapping and during the range merging, is it necessary? Probably it would be enough just to compare current ranges once per iteration. * Just to clarify - the iterating through all the partitions, is it the best way of finding matching ranges? Correct me if I'm wrong, but from what I see in the comments for PartitionBoundInfoData, all the ranges are arranged in increasing order - maybe then it's possible to use binary search for finding a matching range? * I'm not sure why in this case partition wise join is not applied? Maybe I'm wrong, but I was under the impression that it should work here =# \d+ test1; Table "public.test1" Column | Type | Collation | Nullable | Default | Storage | --------+---------+-----------+----------+---------+----------+ data | jsonb | | | | extended | id | integer | | | | plain | Partition key: RANGE (id) Indexes: "test1_idx" btree (id) Partitions: test11 FOR VALUES FROM (0) TO (100), test12 FOR VALUES FROM (100) TO (200), test13 FOR VALUES FROM (200) TO (300) =# \d+ test2; Table "public.test2" Column | Type | Collation | Nullable | Default | Storage | --------+---------+-----------+----------+---------+----------+ data | jsonb | | | | extended | id | integer | | | | plain | Partition key: RANGE (id) Indexes: "test2_idx" btree (id) Partitions: test21 FOR VALUES FROM (10) TO (110), test22 FOR VALUES FROM (110) TO (210), test23 FOR VALUES FROM (210) TO (310) =# set enable_partitionwise_join to true; =# explain analyze select * from test1 t1 inner join test2 t2 using (id); QUERY PLAN ------------------------------------------------------------------------------- Hash Join (cost=3.25..6.56 rows=9 width=54) (actual time=0.082..0.105 rows=3 loops=1) Hash Cond: (t1.id = t2.id) -> Append (cost=0.00..3.18 rows=12 width=29) (actual time=0.026..0.047 rows=12 loops=1) -> Seq Scan on test11 t1 (cost=0.00..1.05 rows=5 width=29) (actual time=0.025..0.028 rows=5 loops=1) -> Seq Scan on test12 t1_1 (cost=0.00..1.04 rows=4 width=29) (actual time=0.006..0.0 07 rows=4 loops=1) -> Seq Scan on test13 t1_2 (cost=0.00..1.03 rows=3 width=29) (actual time=0.005..0.0 06 rows=3 loops=1) -> Hash (cost=3.13..3.13 rows=9 width=29) (actual time=0.033..0.033 rows=9 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 9kB -> Append (cost=0.00..3.13 rows=9 width=29) (actual time=0.006..0.022 rows=9 loops=1) -> Seq Scan on test21 t2 (cost=0.00..1.03 rows=3 width=29) (actual time=0.005. .0.008 rows=3 loops=1) -> Seq Scan on test22 t2_1 (cost=0.00..1.03 rows=3 width=29) (actual time=0.00 4..0.005 rows=3 loops=1) -> Seq Scan on test23 t2_2 (cost=0.00..1.03 rows=3 width=29) (actual time=0.00 3..0.004 rows=3 loops=1) Planning Time: 0.921 ms Execution Time: 0.261 ms (14 rows) =# set enable_partitionwise_join to false; =# explain analyze select * from test1 t1 inner join test2 t2 using (id); QUERY PLAN ------------------------------------------------------------------------------- Hash Join (cost=3.25..6.56 rows=9 width=54) (actual time=0.073..0.095 rows=3 loops=1) Hash Cond: (t1.id = t2.id) -> Append (cost=0.00..3.18 rows=12 width=29) (actual time=0.022..0.041 rows=12 loops=1) -> Seq Scan on test11 t1 (cost=0.00..1.05 rows=5 width=29) (actual time=0.021..0.024 rows=5 loops=1) -> Seq Scan on test12 t1_1 (cost=0.00..1.04 rows=4 width=29) (actual time=0.006..0.0 07 rows=4 loops=1) -> Seq Scan on test13 t1_2 (cost=0.00..1.03 rows=3 width=29) (actual time=0.004..0.0 05 rows=3 loops=1) -> Hash (cost=3.13..3.13 rows=9 width=29) (actual time=0.031..0.031 rows=9 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 9kB -> Append (cost=0.00..3.13 rows=9 width=29) (actual time=0.006..0.021 rows=9 loops=1) -> Seq Scan on test21 t2 (cost=0.00..1.03 rows=3 width=29) (actual time=0.005. .0.008 rows=3 loops=1) -> Seq Scan on test22 t2_1 (cost=0.00..1.03 rows=3 width=29) (actual time=0.00 3..0.004 rows=3 loops=1) -> Seq Scan on test23 t2_2 (cost=0.00..1.03 rows=3 width=29) (actual time=0.00 3..0.004 rows=3 loops=1) Planning Time: 1.154 ms Execution Time: 0.201 ms (14 rows) My investigation shows that the merge function stops on the second iteration because of this condition: /* * Multiple partitions from one relation map to one partition from the * other relation. Partition-wise join does not handle this case right * now, since it requires ganging multiple partitions together (into * one RelOptInfo). */ merged_index = -1; I'm confused, since there is only one-to-one mapping of partitions.
On Sun, Jul 15, 2018 at 1:43 PM, Dmitry Dolgov <9erthalion6@gmail.com> wrote: > Partitions: test11 FOR VALUES FROM (0) TO (100), > test12 FOR VALUES FROM (100) TO (200), > test13 FOR VALUES FROM (200) TO (300) > > Partitions: test21 FOR VALUES FROM (10) TO (110), > test22 FOR VALUES FROM (110) TO (210), > test23 FOR VALUES FROM (210) TO (310) > > I'm confused, since there is only one-to-one mapping of partitions. No, test21 would have to be joined to both test11 and test12, since either could contain matching rows. Also, test22 would have to be joined to both test12 and test13. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] advanced partition matching algorithm forpartition-wise join
From
Ashutosh Bapat
Date:
On Sun, Jul 15, 2018 at 11:13 PM, Dmitry Dolgov <9erthalion6@gmail.com> wrote: >> On Thu, 28 Jun 2018 at 07:54, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> >> On 2018/06/27 22:21, Ashutosh Bapat wrote: >> > On Wed, Jun 27, 2018 at 12:28 PM, Amit Langote >> >> Ah, okay. I thought of reporting this because I felt the errors may have >> >> to do with changes to the related code in HEAD between May 14 when you >> >> last posted the patches and today that you may need to account for in you >> >> patches. For instance, there are many diffs like this: >> >> >> >> Looks like the Result node on top of Append is no longer there after >> >> applying your patch. >> > >> > Yes. They are coming because of a commit which removed Result node on >> > top of an Append node. I don't remember exactly which. >> > >> > I wouldn't worry about those diffs at this time. As I have mentioned >> > in earlier mails, the expected output from 0006 is quite large and is >> > not supposed to be committed. So, I don't see much value in fixing the >> > plans in that output. >> > >> > Do you see that as a hindrance in reviewing the code changes and tests in 0005? >> >> I think not. I'll ignore 0006 for now and focus on other patches. > > Hi, > > Sorry for my irregular reviews. Unfortunately, the patch need to be rebased > again. I rebased the patches without any problem on top of commit commit f7cb2842bf47715133b40e4a503f35dbe60d1b72 Author: Peter Eisentraut <peter_e@gmx.net> Date: Mon Jul 16 13:35:41 2018 +0200 Add plan_cache_mode setting This allows overriding the choice of custom or generic plan. Author: Pavel Stehule <pavel.stehule@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/CAFj8pRAGLaiEm8ur5DWEBo7qHRWTk9HxkuUAz00CZZtJj-LkCA%40mail.gmail.com I didn't get any problem rebasing the patches. They compiled cleanly. The tests added by commit 4513d3a4be0bb7d0141f8b7eaf669a55c08e41b0 failed since this patch-set changes the partitions of the tables used in the test. Adjusted those tests accordingly. > In the meantime I have few more commentaries about range_bounds_merge: > > * From what I see partition_range_bound_cmp is executed on the same bounds few > times to find whether they are overlapping and during the range merging, is > it necessary? Probably it would be enough just to compare current ranges once > per iteration. The bounds that are compared in those cases are different. Any two bounds are compared only once per iteration. Remember each range has two bounds, thus there are total four comparisons possible. In case of overlap, we do all four comparisons. > > * Just to clarify - the iterating through all the partitions, is it the best > way of finding matching ranges? Correct me if I'm wrong, but from what I see > in the comments for PartitionBoundInfoData, all the ranges are arranged in > increasing order - maybe then it's possible to use binary search for finding > a matching range? The loop in partition_range_bounds_merge() runs as many times as the maximum of number of datums from the given partition bounds. So the complexity is O(n) where n is the maximum of number of datums. With binary search the complexity will increase to O(nlogn). I might be missing something here. > > * I'm not sure why in this case partition wise join is not applied? Maybe I'm > wrong, but I was under the impression that it should work here > > =# \d+ test1; > Table "public.test1" > Column | Type | Collation | Nullable | Default | Storage | > --------+---------+-----------+----------+---------+----------+ > data | jsonb | | | | extended | > id | integer | | | | plain | > Partition key: RANGE (id) > Indexes: > "test1_idx" btree (id) > Partitions: test11 FOR VALUES FROM (0) TO (100), > test12 FOR VALUES FROM (100) TO (200), > test13 FOR VALUES FROM (200) TO (300) > > =# \d+ test2; > Table "public.test2" > Column | Type | Collation | Nullable | Default | Storage | > --------+---------+-----------+----------+---------+----------+ > data | jsonb | | | | extended | > id | integer | | | | plain | > Partition key: RANGE (id) > Indexes: > "test2_idx" btree (id) > Partitions: test21 FOR VALUES FROM (10) TO (110), > test22 FOR VALUES FROM (110) TO (210), > test23 FOR VALUES FROM (210) TO (310) > > > I'm confused, since there is only one-to-one mapping of partitions. In this case, test21 overlaps test11 (10-100) and test12 (100-110), test22 overlaps test12 (110-200) and test13(200-210). So, there is no one-to-one mapping. PFA rebased patches with tests fixed as described above. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Attachment
> On Tue, 17 Jul 2018 at 11:58, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > > On Sun, Jul 15, 2018 at 11:13 PM, Dmitry Dolgov <9erthalion6@gmail.com> wrote: > >> On Thu, 28 Jun 2018 at 07:54, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > >> > >> On 2018/06/27 22:21, Ashutosh Bapat wrote: > >> > On Wed, Jun 27, 2018 at 12:28 PM, Amit Langote > >> >> Ah, okay. I thought of reporting this because I felt the errors may have > >> >> to do with changes to the related code in HEAD between May 14 when you > >> >> last posted the patches and today that you may need to account for in you > >> >> patches. For instance, there are many diffs like this: > >> >> > >> >> Looks like the Result node on top of Append is no longer there after > >> >> applying your patch. > >> > > >> > Yes. They are coming because of a commit which removed Result node on > >> > top of an Append node. I don't remember exactly which. > >> > > >> > I wouldn't worry about those diffs at this time. As I have mentioned > >> > in earlier mails, the expected output from 0006 is quite large and is > >> > not supposed to be committed. So, I don't see much value in fixing the > >> > plans in that output. > >> > > >> > Do you see that as a hindrance in reviewing the code changes and tests in 0005? > >> > >> I think not. I'll ignore 0006 for now and focus on other patches. > > > > Hi, > > > > Sorry for my irregular reviews. Unfortunately, the patch need to be rebased > > again. > > I rebased the patches without any problem on top of commit Thanks! > > In the meantime I have few more commentaries about range_bounds_merge: > > > > * From what I see partition_range_bound_cmp is executed on the same bounds few > > times to find whether they are overlapping and during the range merging, is > > it necessary? Probably it would be enough just to compare current ranges once > > per iteration. > > The bounds that are compared in those cases are different. Any two > bounds are compared only once per iteration. Remember each range has > two bounds, thus there are total four comparisons possible. In case of > overlap, we do all four comparisons. Yep, indeed, now I see. > > * Just to clarify - the iterating through all the partitions, is it the best > > way of finding matching ranges? Correct me if I'm wrong, but from what I see > > in the comments for PartitionBoundInfoData, all the ranges are arranged in > > increasing order - maybe then it's possible to use binary search for finding > > a matching range? > > The loop in partition_range_bounds_merge() runs as many times as the > maximum of number of datums from the given partition bounds. So the > complexity is O(n) where n is the maximum of number of datums. With > binary search the complexity will increase to O(nlogn). I might be > missing something here. Now I'm confused even more. Correct me if I'm wrong, but here is what I see right now: * We're trying to solve the standard problem of finding overlapping intervals from two sets of intervals * The current implementation implicitly compares every range from one side of a join with every range from another side, which is O(n^2). Let's imagine we have two tables: =# \d+ test1_overlap Table "public.test1_overlap" Column | Type | Collation | Nullable | Default | Storage | --------+---------+-----------+----------+---------+----------+ id | integer | | | | plain | data | jsonb | | | | extended | Partition key: RANGE (id) Partitions: test11_overlap FOR VALUES FROM (200) TO (210), test12_overlap FOR VALUES FROM (220) TO (230), test13_overlap FOR VALUES FROM (240) TO (250) =# \d+ test2_overlap Table "public.test2_overlap" Column | Type | Collation | Nullable | Default | Storage | --------+---------+-----------+----------+---------+----------+ id | integer | | | | plain | data | jsonb | | | | extended | Partition key: RANGE (id) Partitions: test210_overlap FOR VALUES FROM (160) TO (170), test211_overlap FOR VALUES FROM (180) TO (190), test21_overlap FOR VALUES FROM (0) TO (10), test22_overlap FOR VALUES FROM (20) TO (30), test23_overlap FOR VALUES FROM (200) TO (210), test24_overlap FOR VALUES FROM (40) TO (50), test25_overlap FOR VALUES FROM (60) TO (70), test26_overlap FOR VALUES FROM (80) TO (90), test27_overlap FOR VALUES FROM (100) TO (110), test28_overlap FOR VALUES FROM (120) TO (130), test29_overlap FOR VALUES FROM (140) TO (150) And the join: select * from test1_overlap inner join test2_overlap using (id); Partition wise join will work fine, but what will happen (I see this following the code under gdb) is that inside the function partition_range_bounds_merge we start from two partitions: test11_overlap (200, 210) and test21_overlap (0, 10) In the comparison loop we figure out that there is no overlap and go to the ub_cmpval > 0 branch, because test11_overlap is higher that test21_overlap. Inside this branch we think that we need to handle a missing partition case (apparently, by mistake, but it works - in case if it's not a missing partition, there are no any records to join with from a default one). Since in this case there isn't any default partition, the result is merged = true. After that partition_range_get_next_lb_index will move us to another partition pair: test11_overlap (200, 210) and test22_overlap (20, 30) And so on and so forth until we will reach the partition test23_overlap (200, 210), which would be actually what we're looking for. So basically as I said above we will iterate over the all partitions, and we could skip some of them using binary search. > > > > * I'm not sure why in this case partition wise join is not applied? Maybe I'm > > wrong, but I was under the impression that it should work here > > > > =# \d+ test1; > > Table "public.test1" > > Column | Type | Collation | Nullable | Default | Storage | > > --------+---------+-----------+----------+---------+----------+ > > data | jsonb | | | | extended | > > id | integer | | | | plain | > > Partition key: RANGE (id) > > Indexes: > > "test1_idx" btree (id) > > Partitions: test11 FOR VALUES FROM (0) TO (100), > > test12 FOR VALUES FROM (100) TO (200), > > test13 FOR VALUES FROM (200) TO (300) > > > > =# \d+ test2; > > Table "public.test2" > > Column | Type | Collation | Nullable | Default | Storage | > > --------+---------+-----------+----------+---------+----------+ > > data | jsonb | | | | extended | > > id | integer | | | | plain | > > Partition key: RANGE (id) > > Indexes: > > "test2_idx" btree (id) > > Partitions: test21 FOR VALUES FROM (10) TO (110), > > test22 FOR VALUES FROM (110) TO (210), > > test23 FOR VALUES FROM (210) TO (310) > > > > > > I'm confused, since there is only one-to-one mapping of partitions. > > In this case, test21 overlaps test11 (10-100) and test12 (100-110), > test22 overlaps test12 (110-200) and test13(200-210). So, there is no > one-to-one mapping. Yep, thanks for the explanation.
> On Thu, 19 Jul 2018 at 21:04, Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > > > * Just to clarify - the iterating through all the partitions, is it the best > > > way of finding matching ranges? Correct me if I'm wrong, but from what I see > > > in the comments for PartitionBoundInfoData, all the ranges are arranged in > > > increasing order - maybe then it's possible to use binary search for finding > > > a matching range? > > > > The loop in partition_range_bounds_merge() runs as many times as the > > maximum of number of datums from the given partition bounds. So the > > complexity is O(n) where n is the maximum of number of datums. With > > binary search the complexity will increase to O(nlogn). I might be > > missing something here. > > Now I'm confused even more. Correct me if I'm wrong, but here is what I see > right now: > > * We're trying to solve the standard problem of finding overlapping intervals > from two sets of intervals > > * The current implementation implicitly compares every range from one side of a > join with every range from another side, which is O(n^2). It's of course wrong, it's going to be O(max(m, n)) as you said, but the point is still valid - if we have partitions A1, A2 from one side and B1, ..., BN on another side, we can skip necessary the partitions from B that are between e.g. A1 and A2 faster with binary search.
Re: [HACKERS] advanced partition matching algorithm forpartition-wise join
From
Ashutosh Bapat
Date:
On Fri, Jul 20, 2018 at 3:13 AM, Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > It's of course wrong, it's going to be O(max(m, n)) as you said, but > the point is still valid - if we have partitions A1, A2 from one side > and B1, ..., BN on another side, we can skip necessary the > partitions from B that are between e.g. A1 and A2 faster with > binary search. That's possible only when there is no default partition and the join is an inner join. For an outer join, we need to include all the partitions on the outer side, so we can't just skip over some partitions. In case of a default partition, it can take place of a missing partition, so we can't skip partitions using binary search. The code right now works for all the cases and is O(n). I agree that it can be optimized for certain cases, but 1. those cases are rare enough that we can ignore those right now. How many times we would encounter the case you have quoted, for example? Usually the ranges will be continuous only differing in the first or last partition e.g time-series data. 2. The code is enough complex right now and it's also a lot. Making it complicated further is not the worth the rare use cases. If we get complaints from the field, we can certainly improve it in future. But I would wait for those complaints before improving it further. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
> On Mon, 23 Jul 2018 at 10:38, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > > On Fri, Jul 20, 2018 at 3:13 AM, Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > > > It's of course wrong, it's going to be O(max(m, n)) as you said, but > > the point is still valid - if we have partitions A1, A2 from one side > > and B1, ..., BN on another side, we can skip necessary the > > partitions from B that are between e.g. A1 and A2 faster with > > binary search. > > That's possible only when there is no default partition and the join > is an inner join. For an outer join, we need to include all the > partitions on the outer side, so we can't just skip over some > partitions. In case of a default partition, it can take place of a > missing partition, so we can't skip partitions using binary search. But in this case I described above (when we have a partition A3 on one side, and another matching partition B3 from other side, but separated by some other partitions, e.g. B1, B2) it means that we will merge A3 with a default partition from other side without actually needing that, am I right? In this sense it's an overhead out of nothing. > 1. those cases are rare enough that we can ignore those right now. How > many times we would encounter the case you have quoted, for example? > Usually the ranges will be continuous only differing in the first or > last partition e.g time-series data. Well, unfortunately, I don't have enough context to discuss whether it's rare or not.
Re: [HACKERS] advanced partition matching algorithm forpartition-wise join
From
Ashutosh Bapat
Date:
On Thu, Jul 26, 2018 at 8:37 PM, Dmitry Dolgov <9erthalion6@gmail.com> wrote: >> On Mon, 23 Jul 2018 at 10:38, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: >> >> On Fri, Jul 20, 2018 at 3:13 AM, Dmitry Dolgov <9erthalion6@gmail.com> wrote: >> > >> > It's of course wrong, it's going to be O(max(m, n)) as you said, but >> > the point is still valid - if we have partitions A1, A2 from one side >> > and B1, ..., BN on another side, we can skip necessary the >> > partitions from B that are between e.g. A1 and A2 faster with >> > binary search. >> >> That's possible only when there is no default partition and the join >> is an inner join. For an outer join, we need to include all the >> partitions on the outer side, so we can't just skip over some >> partitions. In case of a default partition, it can take place of a >> missing partition, so we can't skip partitions using binary search. > > But in this case I described above (when we have a partition A3 on one side, > and another matching partition B3 from other side, but separated by some other > partitions, e.g. B1, B2) it means that we will merge A3 with a default > partition from other side without actually needing that, am I right? In this > sense it's an overhead out of nothing. No. We will join A3 with B3 since they have matching bounds. We will compare B1 and B2's bounds with A3 (assuming that there are no bounds before A3). Since they won't be compatible we will match default of A to B1 and B2. That will of-course fail since we will try to match multiple partitions to a single partition, but that's not the point of your question I think. You are right that we could skip comparing A3 with B1 and B2 and directly land on B3. Any partitions skipped in between will be matched with A's default partition. But as I have said this would be rare and complicating the logic for a rare case doesn't seem practical at this stage. Apart from the complexity there's also a possibility that this skipping will reduce the efficiency actually in normal cases. Consider a case where A and B have exactly matching partitions. Current partition matching algorithm compare any given range/bound only once and we will have O(n) algorithm. If we use a binary search, however, for every range comparison, it will be O(n log n) algorithm. There will be unnecessary comparisons during binary search. The current algorithm is always O(n), whereas binary search would be O(n log(n)) with a possibility of having sub-O(n) complexity in some cases. I would go for an algorithm which has a consistent time complexity always and which is efficient in normal cases, rather than worrying about some cases which are not practical. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
On Fri, Jul 27, 2018 at 3:17 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > Apart from the complexity there's also a possibility that this > skipping will reduce the efficiency actually in normal cases. Consider > a case where A and B have exactly matching partitions. Current > partition matching algorithm compare any given range/bound only once > and we will have O(n) algorithm. If we use a binary search, however, > for every range comparison, it will be O(n log n) algorithm. There > will be unnecessary comparisons during binary search. The current > algorithm is always O(n), whereas binary search would be O(n log(n)) > with a possibility of having sub-O(n) complexity in some cases. I > would go for an algorithm which has a consistent time complexity > always and which is efficient in normal cases, rather than worrying > about some cases which are not practical. Yeah, I think that's a good point. The normal case here will be that the partition bounds are equal, or that there are a few extra partitions on one side that don't exist on the other. We don't want other cases to be crazily inefficient, but I suspect in practice that if the partitioning bounds aren't pretty close to matching up exactly, we're going to end up failing to be able to do a partition-wise join at all. It's not very likely that somebody happens to have a case where they've partitioned two tables in randomly different ways, but then they decide to join them anyway, but then it turns out that the partition bounds happen to be compatible enough that this algorithm works. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> On Fri, 27 Jul 2018 at 20:13, Robert Haas <robertmhaas@gmail.com> wrote: > > On Fri, Jul 27, 2018 at 3:17 AM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: > > Apart from the complexity there's also a possibility that this > > skipping will reduce the efficiency actually in normal cases. Consider > > a case where A and B have exactly matching partitions. Current > > partition matching algorithm compare any given range/bound only once > > and we will have O(n) algorithm. If we use a binary search, however, > > for every range comparison, it will be O(n log n) algorithm. There > > will be unnecessary comparisons during binary search. The current > > algorithm is always O(n), whereas binary search would be O(n log(n)) > > with a possibility of having sub-O(n) complexity in some cases. I > > would go for an algorithm which has a consistent time complexity > > always and which is efficient in normal cases, rather than worrying > > about some cases which are not practical. > > Yeah, I think that's a good point. The normal case here will be that > the partition bounds are equal, or that there are a few extra > partitions on one side that don't exist on the other. We don't want > other cases to be crazily inefficient, but I suspect in practice that > if the partitioning bounds aren't pretty close to matching up exactly, > we're going to end up failing to be able to do a partition-wise join > at all. It's not very likely that somebody happens to have a case > where they've partitioned two tables in randomly different ways, but > then they decide to join them anyway, but then it turns out that the > partition bounds happen to be compatible enough that this algorithm > works. > On Mon, 23 Jul 2018 at 10:38, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > 1. those cases are rare enough that we can ignore those right now. How > many times we would encounter the case you have quoted, for example? > Usually the ranges will be continuous only differing in the first or > last partition e.g time-series data. Ok, but what about list partitioning? I believe the area of application for it can be more diverse than mostly just for time-series, and in the patch almost the same approach is used to merge list partitions. Other than that everything seems fine from functionality point of view, and so far I couldn't find any situation that produces a wrong plan. Some of the joins were not that intuitive from the first glance, but at the end everything was according to the documentation. Taking this into account I wonder if it's possible somehow to give any hints in an explain result about why it wasn't possible to apply partition wise join? If something wasn't clear for me I ended up looking at the value of "merged" flag, and to figure out actual reason one needs to trace the entire algorithm. Besides that I checked the performance in some simple cases, no problems on this side (but I'll also do some checks for more complicated joins). And I still have some complaints about readability, although I can imagine that it's just me: * Many functions carry some unrelated arguments just to pass them through, which obscures the purpose of a function. * I want to suggest to introduce a new data structure for partitions mapping instead of just keeping them in arrays (was it discussed already before?). * What is the reason that almost everywhere in the patch there is a naming convention "outer/inner" partition maps, and sometimes (e.g. in generate_matching_part_pairs) it's "map1/map2"?
Re: [HACKERS] advanced partition matching algorithm forpartition-wise join
From
Ashutosh Bapat
Date:
On Thu, Aug 23, 2018 at 4:01 PM, Dmitry Dolgov <9erthalion6@gmail.com> wrote: >> On Fri, 27 Jul 2018 at 20:13, Robert Haas <robertmhaas@gmail.com> wrote: >> >> On Fri, Jul 27, 2018 at 3:17 AM, Ashutosh Bapat >> <ashutosh.bapat@enterprisedb.com> wrote: >> > Apart from the complexity there's also a possibility that this >> > skipping will reduce the efficiency actually in normal cases. Consider >> > a case where A and B have exactly matching partitions. Current >> > partition matching algorithm compare any given range/bound only once >> > and we will have O(n) algorithm. If we use a binary search, however, >> > for every range comparison, it will be O(n log n) algorithm. There >> > will be unnecessary comparisons during binary search. The current >> > algorithm is always O(n), whereas binary search would be O(n log(n)) >> > with a possibility of having sub-O(n) complexity in some cases. I >> > would go for an algorithm which has a consistent time complexity >> > always and which is efficient in normal cases, rather than worrying >> > about some cases which are not practical. >> >> Yeah, I think that's a good point. The normal case here will be that >> the partition bounds are equal, or that there are a few extra >> partitions on one side that don't exist on the other. We don't want >> other cases to be crazily inefficient, but I suspect in practice that >> if the partitioning bounds aren't pretty close to matching up exactly, >> we're going to end up failing to be able to do a partition-wise join >> at all. It's not very likely that somebody happens to have a case >> where they've partitioned two tables in randomly different ways, but >> then they decide to join them anyway, but then it turns out that the >> partition bounds happen to be compatible enough that this algorithm >> works. > >> On Mon, 23 Jul 2018 at 10:38, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: >> 1. those cases are rare enough that we can ignore those right now. How >> many times we would encounter the case you have quoted, for example? >> Usually the ranges will be continuous only differing in the first or >> last partition e.g time-series data. > > Ok, but what about list partitioning? I believe the area of application for it > can be more diverse than mostly just for time-series, and in the patch almost > the same approach is used to merge list partitions. Same arguments hold for list partitioning as well. For a list partitioned table the bounds are ordered by list datums and not partitions, so it will be rather odd to have large runs of mismatching datums, the case where binary search fares, from one of side of join. So, my following argument still holds true --- >> > I would go for an algorithm which has a consistent time complexity >> > always and which is efficient in normal cases, rather than worrying >> > about some cases which are not practical. --- > > Other than that everything seems fine from functionality point of view, and so > far I couldn't find any situation that produces a wrong plan. Some of the joins > were not that intuitive from the first glance, but at the end everything was > according to the documentation. Thanks a lot for your tests and I am glad that you didn't find any failures. > Taking this into account I wonder if it's possible somehow to give any hints in > an explain result about why it wasn't possible to apply partition wise join? If > something wasn't clear for me I ended up looking at the value of "merged" flag, > and to figure out actual reason one needs to trace the entire algorithm. That's kind of tricky in PostgreSQL. The optimizer doesn't always report why a particular path was not chosen. Said that we could add trace logs printing that information, however, the main difficulty is reporting the relations being joined.See 0004 for example. > > Besides that I checked the performance in some simple cases, no problems on > this side (but I'll also do some checks for more complicated joins). Thanks a lot. > > And I still have some complaints about readability, although I can imagine that > it's just me: > > * Many functions carry some unrelated arguments just to pass them through, > which obscures the purpose of a function. Can you please provide some examples? > > * I want to suggest to introduce a new data structure for partitions mapping > instead of just keeping them in arrays (was it discussed already before?). The best I could think of was a structure with just two arrays. So, instead of encapsulating the arrays within a structure, I thought it best to pass bare arrays around. If you have any other ideas, please let me know. > > * What is the reason that almost everywhere in the patch there is a naming > convention "outer/inner" partition maps, and sometimes (e.g. in > generate_matching_part_pairs) it's "map1/map2"? You will find that the optimizer in general uses outer/inner, rel1/rel2 nomenclature interchangeably. This patch-set just inherits that. But yes, we should do more to straighten it out. I won't be working on this actively in the next commitfest. I will be glad if somebody else wants to take this up. If there's nobody, probably we should mark this entry as "returned with feedback" in the next commitfest. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
> On Wed, 29 Aug 2018 at 09:32, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > > > * Many functions carry some unrelated arguments just to pass them through, > > which obscures the purpose of a function. > > Can you please provide some examples? E.g this chain with partsupfunc & collations: partition_range_bounds_merge -> partition_range_cmp -> partition_range_bound_cmp -> partition_range_merge_next_lb I believe I already mentioned that before (although I'm not saying that I have a solution right away, it just bothers me). > > * I want to suggest to introduce a new data structure for partitions mapping > > instead of just keeping them in arrays (was it discussed already before?). > > The best I could think of was a structure with just two arrays. So, > instead of encapsulating the arrays within a structure, I thought it > best to pass bare arrays around. If you have any other ideas, please > let me know. Well, what I had in mind is something like a structure Mapping with fields from & to. > > * What is the reason that almost everywhere in the patch there is a naming > > convention "outer/inner" partition maps, and sometimes (e.g. in > > generate_matching_part_pairs) it's "map1/map2"? > > You will find that the optimizer in general uses outer/inner, > rel1/rel2 nomenclature interchangeably. This patch-set just inherits > that. But yes, we should do more to straighten it out. Thanks, good to know. > I won't be working on this actively in the next commitfest. I will be > glad if somebody else wants to take this up. If there's nobody, > probably we should mark this entry as "returned with feedback" in the > next commitfest. Since I'm more or less familiar with the code and I believe it's an interesting feature, I can try to take over it for now if you don't mind (but without any strong commitments to make it perfectly shining for the next CF).
Re: [HACKERS] advanced partition matching algorithm forpartition-wise join
From
Ashutosh Bapat
Date:
On Thu, Aug 30, 2018 at 2:23 PM, Dmitry Dolgov <9erthalion6@gmail.com> wrote: > >> I won't be working on this actively in the next commitfest. I will be >> glad if somebody else wants to take this up. If there's nobody, >> probably we should mark this entry as "returned with feedback" in the >> next commitfest. > > Since I'm more or less familiar with the code and I believe it's an interesting > feature, I can try to take over it for now if you don't mind (but without any > strong commitments to make it perfectly shining for the next CF). Please do. Thanks. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
> On Fri, 31 Aug 2018 at 08:23, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > > On Thu, Aug 30, 2018 at 2:23 PM, Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > > >> I won't be working on this actively in the next commitfest. I will be > >> glad if somebody else wants to take this up. If there's nobody, > >> probably we should mark this entry as "returned with feedback" in the > >> next commitfest. > > > > Since I'm more or less familiar with the code and I believe it's an interesting > > feature, I can try to take over it for now if you don't mind (but without any > > strong commitments to make it perfectly shining for the next CF). > > Please do. Thanks. I've noticed that the patch is outdated already, so here is the rebased version. I also removed the last part with the extra tests since it was something internal and merged the debug message into the implementation part. Ashutosh, please let me know if you're not happy with these modifications. Other than that I haven't changed anything yet, but hope this will come soon. And this version is more to keep it updated for those people who may be interested.
Attachment
Re: [HACKERS] advanced partition matching algorithm forpartition-wise join
From
Ashutosh Bapat
Date:
On Thu, Sep 13, 2018 at 1:45 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> On Fri, 31 Aug 2018 at 08:23, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:
>
> On Thu, Aug 30, 2018 at 2:23 PM, Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> >
> >> I won't be working on this actively in the next commitfest. I will be
> >> glad if somebody else wants to take this up. If there's nobody,
> >> probably we should mark this entry as "returned with feedback" in the
> >> next commitfest.
> >
> > Since I'm more or less familiar with the code and I believe it's an interesting
> > feature, I can try to take over it for now if you don't mind (but without any
> > strong commitments to make it perfectly shining for the next CF).
>
> Please do. Thanks.
I've noticed that the patch is outdated already, so here is the rebased
version. I also removed the last part with the extra tests since it
was something
internal
I am fine with that. It was never meant to be committed. I used to run those tests to make sure that any changes to the core logic do not break any working scenarios. Whenever I found a new failure in the extra tests which wasn't there in tests to be committed, I used to move that test from the first to the second. Over the time, the number of new failures in extra has reduced and recently I didn't see any extra failures. So, may be it's time for the extra tests to be dropped. I will suggest that keep the extra tests running from time to time and certainly around the time the feature gets committed.
and merged the debug message into the implementation part. Ashutosh,
please let me know if you're not happy with these modifications.
Robert Haas raised objections, and I agreed to those, about a similar debug message I had included in the basic partition-wise join patches. I think those reasons still apply, so you will need to remove the debug message before the patches get committed. Said that the debug message is a good debugging aid, so keeping it around till that time is a good idea.
Other than that I haven't changed anything yet, but hope this will come soon.
And this version is more to keep it updated for those people who may be
interested.
Thanks.
--
Best Wishes,
Ashutosh Bapat
> On Mon, 17 Sep 2018 at 11:20, Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote: > > I am fine with that. It was never meant to be committed. I used to run those > tests to make sure that any changes to the core logic do not break any > working scenarios. Whenever I found a new failure in the extra tests which > wasn't there in tests to be committed, I used to move that test from the > first to the second. Over the time, the number of new failures in extra has > reduced and recently I didn't see any extra failures. So, may be it's time > for the extra tests to be dropped. I will suggest that keep the extra tests > running from time to time and certainly around the time the feature gets > committed. Great, that's exactly what I'm doing right now - I keep these tests locally to monitor any significant failures except any changes in plans, but without including it into the patch series. Since most of my complaints about the patch were related to code readability, I modified it a bit to show more clearly what I have in mind. I haven't changed anything about the functionality, just adjusted some parts of it: * removed some unused arguments (looks like they were added for consistency in all higher level branches, but not all of them were actually in use) * replaced representation for partition mapping (instead of int arrays there is a structure, that allows to replace 0/1 with more meaningful from/to) * tried to make naming a bit more consistent - so, if a function doesn't explicitely say anything about outer/inner, we have partmap1/partmap2, otherwise outermap/innermap. I don't really like this style with partmap1/partmap2 or index1/index2, but it seems consistent with the rest of the code in partbounds.c * removed some state representation flags, e.g. merged - instead, if there is a situation when we can't merge, functions will return NULL right away * removed questionable debugging statement Ashutosh, can you please take a look at it and confirm (or not) that you also think these changes have improved readability a bit. If we're on the same page about that, I'll continue in this direction.
Attachment
> On Thu, Oct 25, 2018 at 11:19 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > Since most of my complaints about the patch were related to code readability, > I modified it a bit to show more clearly what I have in mind. I haven't changed > anything about the functionality, just adjusted some parts of it: > > * removed some unused arguments (looks like they were added for consistency in > all higher level branches, but not all of them were actually in use) > > * replaced representation for partition mapping (instead of int arrays there is > a structure, that allows to replace 0/1 with more meaningful from/to) > > * tried to make naming a bit more consistent - so, if a function doesn't > explicitely say anything about outer/inner, we have partmap1/partmap2, > otherwise outermap/innermap. I don't really like this style with > partmap1/partmap2 or index1/index2, but it seems consistent with the rest of > the code in partbounds.c > > * removed some state representation flags, e.g. merged - instead, if there is a > situation when we can't merge, functions will return NULL right away > > * removed questionable debugging statement I've noticed, that this patch set is outdated, so here is the rebased version.
Attachment
On Mon, Nov 26, 2018 at 9:03 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > I've noticed, that this patch set is outdated, so here is the rebased version. This turned red on cfbot because I turned on -Werror: partbounds.c: In function ‘partition_bounds_merge’: partbounds.c:3619:43: error: ‘outer_part’ may be used uninitialized in this function [-Werror=maybe-uninitialized] merged_index = map_and_merge_partitions(outer_pmap, outer_mmap, ^ partbounds.c:3475:6: note: ‘outer_part’ was declared here int outer_part; ^ cc1: all warnings being treated as errors -- Thomas Munro http://www.enterprisedb.com
> On Sun, Nov 25, 2018 at 10:07 PM Thomas Munro <thomas.munro@enterprisedb.com> wrote: > > This turned red on cfbot because I turned on -Werror: > > partbounds.c: In function ‘partition_bounds_merge’: > partbounds.c:3619:43: error: ‘outer_part’ may be used uninitialized in > this function [-Werror=maybe-uninitialized] > merged_index = map_and_merge_partitions(outer_pmap, outer_mmap, > ^ > partbounds.c:3475:6: note: ‘outer_part’ was declared here > int outer_part; > ^ > cc1: all warnings being treated as errors Thanks, I've messed this up too - rebased a wrong branch, the correct one doesn't have this code already. Sorry for that, and here is the proper version of this patch.
Attachment
> On Mon, Nov 26, 2018 at 1:41 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > Thanks, I've messed this up too - rebased a wrong branch, the correct one > doesn't have this code already. Sorry for that, and here is the proper version > of this patch. And one more version, because I haven't fixed the tests (sorry for the noise).
Attachment
On Mon, Nov 26, 2018 at 9:33 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> On Mon, Nov 26, 2018 at 1:41 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>
> Thanks, I've messed this up too - rebased a wrong branch, the correct one
> doesn't have this code already. Sorry for that, and here is the proper version
> of this patch.
And one more version, because I haven't fixed the tests (sorry for the noise).
0003 patch need a rebase.
Regards,
Amul
(2019/01/21 20:56), amul sul wrote: > On Mon, Nov 26, 2018 at 9:33 PM Dmitry Dolgov <9erthalion6@gmail.com > <mailto:9erthalion6@gmail.com>> wrote: > > > On Mon, Nov 26, 2018 at 1:41 PM Dmitry Dolgov > <9erthalion6@gmail.com <mailto:9erthalion6@gmail.com>> wrote: > > > > Thanks, I've messed this up too - rebased a wrong branch, the > correct one > > doesn't have this code already. Sorry for that, and here is the > proper version > > of this patch. > > And one more version, because I haven't fixed the tests (sorry for > the noise). > 0003 patch need a rebase. Will do. Thank you for letting us know! Best regards, Etsuro Fujita
(2019/01/21 21:09), Etsuro Fujita wrote: > (2019/01/21 20:56), amul sul wrote: >> 0003 patch need a rebase. > > Will do. While doing the rebase, I noticed that 0002 would not be necessary anymore; what was originally proposed by it on how to generate the tlists for child-joins in [1] is entirely included in commit 7cfdc77023ad50731723e85c215a4127436ed09c, which I pushed. I didn't pay attention to this thread, but that was pretty much the same as what I did in that commit (except the handling of PHVs to avoid extra work in that commit), so I think I should have mentioned about that in the commit message. Sorry for that, Ashutosh. Here is an updated version of the patchset. As above, I don't think 0002 is necessary; it added copy_pathtarget to build_child_join_rel, but I think it would be redundant because we do create_empty_pathtarget at the top of that function and then build_child_join_reltarget. Also, it added an assertion to build_joinrel_tlist, but I don't think it's really necessary IMO. So I removed it entirely from the patchset. No changes other that. Will continue to review. Best regards, Etsuro Fujita [1] https://www.postgresql.org/message-id/CAFjFpResoxfp1rnV4Op9JOnG19VNEnjvjRN5DVd8QRHD%2BagTDw%40mail.gmail.com
Attachment
(2019/01/22 21:38), Etsuro Fujita wrote: > Will continue to review. I rebased the patch set against the latest HEAD. Attached is a new version. I'll move this to the next CF, and continue to review it. Best regards, Etsuro Fujita
Attachment
There are few whitespaces in 0002 patch that I have fixed in the attached version.
Rest of the patches are untouched.
Ill continue my review and testing.
Regards,
Amul
On Thu, Jan 31, 2019 at 5:26 PM Etsuro Fujita <fujita.etsuro@lab.ntt.co.jp> wrote:
(2019/01/22 21:38), Etsuro Fujita wrote:
> Will continue to review.
I rebased the patch set against the latest HEAD. Attached is a new
version. I'll move this to the next CF, and continue to review it.
Best regards,
Etsuro Fujita
Attachment
Attached is the rebased atop of the latest master head(35bc0ec7c8).
Regards,
Amul Sul
On Mon, Feb 4, 2019 at 11:05 AM amul sul <sulamul@gmail.com> wrote:
There are few whitespaces in 0002 patch that I have fixed in the attached version.Rest of the patches are untouched.Ill continue my review and testing.Regards,AmulOn Thu, Jan 31, 2019 at 5:26 PM Etsuro Fujita <fujita.etsuro@lab.ntt.co.jp> wrote:(2019/01/22 21:38), Etsuro Fujita wrote:
> Will continue to review.
I rebased the patch set against the latest HEAD. Attached is a new
version. I'll move this to the next CF, and continue to review it.
Best regards,
Etsuro Fujita
Attachment
Re: [HACKERS] advanced partition matching algorithm forpartition-wise join
From
Rajkumar Raghuwanshi
Date:
On Tue, Mar 5, 2019 at 3:45 PM amul sul <sulamul@gmail.com> wrote:
Attached is the rebased atop of the latest master head(35bc0ec7c8).
thanks Amul, patches applied cleanly on PG head.
While testing this I got a server crash with below test case.
CREATE TABLE plt1 (a int, b int, c varchar) PARTITION BY LIST(c);
CREATE TABLE plt1_p1 PARTITION OF plt1 FOR VALUES IN ('0001','0002','0003');
CREATE TABLE plt1_p2 PARTITION OF plt1 FOR VALUES IN ('0004','0005','0006');
CREATE TABLE plt1_p3 PARTITION OF plt1 FOR VALUES IN (NULL,'0008','0009');
CREATE TABLE plt1_p4 PARTITION OF plt1 FOR VALUES IN ('0000','0010');
INSERT INTO plt1 SELECT i, i % 47, to_char(i % 17, 'FM0000') FROM generate_series(0, 500) i WHERE i % 17 NOT IN (7, 11, 12, 13, 14, 15,16);
INSERT INTO plt1 SELECT i, i % 47, case when i % 17 = 7 then NULL else to_char(i % 17, 'FM0000') end FROM generate_series(0, 500) i WHERE i % 17 IN (7,8,9);
ANALYSE plt1;
CREATE TABLE plt2 (a int, b int, c varchar) PARTITION BY LIST(c);
CREATE TABLE plt2_p1 PARTITION OF plt2 FOR VALUES IN ('0002','0003');
CREATE TABLE plt2_p2 PARTITION OF plt2 FOR VALUES IN ('0004','0005','0006');
CREATE TABLE plt2_p3 PARTITION OF plt2 FOR VALUES IN ('0007','0008','0009');
CREATE TABLE plt2_p4 PARTITION OF plt2 FOR VALUES IN ('0000',NULL,'0012');
INSERT INTO plt2 SELECT i, i % 47, to_char(i % 17, 'FM0000') FROM generate_series(0, 500) i WHERE i % 17 NOT IN (1, 10, 11, 13, 14, 15, 16);
INSERT INTO plt2 SELECT i, i % 47, case when i % 17 = 11 then NULL else to_char(i % 17, 'FM0000') end FROM generate_series(0, 500) i WHERE i % 17 IN (0,11,12);
ANALYZE plt2;
CREATE TABLE plt1_e (a int, b int, c text) PARTITION BY LIST(ltrim(c, 'A'));
CREATE TABLE plt1_e_p1 PARTITION OF plt1_e FOR VALUES IN ('0002', '0003');
CREATE TABLE plt1_e_p2 PARTITION OF plt1_e FOR VALUES IN ('0004', '0005', '0006');
CREATE TABLE plt1_e_p3 PARTITION OF plt1_e FOR VALUES IN ('0008', '0009');
CREATE TABLE plt1_e_p4 PARTITION OF plt1_e FOR VALUES IN ('0000');
INSERT INTO plt1_e SELECT i, i % 47, to_char(i % 17, 'FM0000') FROM generate_series(0, 500) i WHERE i % 17 NOT IN (1, 7, 10, 11, 12, 13, 14, 15, 16);
ANALYZE plt1_e;
EXPLAIN (COSTS OFF)
SELECT avg(t1.a), avg(t2.b), avg(t3.a + t3.b), t1.c, t2.c, t3.c FROM
plt1 t1, plt2 t2, plt1_e t3 WHERE t1.c = t2.c AND ltrim(t3.c, 'A') = t1.c GROUP BY t1.c, t2.c, t3.c ORDER BY t1.c, t2.c, t3.c;
CREATE TABLE plt1_p1 PARTITION OF plt1 FOR VALUES IN ('0001','0002','0003');
CREATE TABLE plt1_p2 PARTITION OF plt1 FOR VALUES IN ('0004','0005','0006');
CREATE TABLE plt1_p3 PARTITION OF plt1 FOR VALUES IN (NULL,'0008','0009');
CREATE TABLE plt1_p4 PARTITION OF plt1 FOR VALUES IN ('0000','0010');
INSERT INTO plt1 SELECT i, i % 47, to_char(i % 17, 'FM0000') FROM generate_series(0, 500) i WHERE i % 17 NOT IN (7, 11, 12, 13, 14, 15,16);
INSERT INTO plt1 SELECT i, i % 47, case when i % 17 = 7 then NULL else to_char(i % 17, 'FM0000') end FROM generate_series(0, 500) i WHERE i % 17 IN (7,8,9);
ANALYSE plt1;
CREATE TABLE plt2 (a int, b int, c varchar) PARTITION BY LIST(c);
CREATE TABLE plt2_p1 PARTITION OF plt2 FOR VALUES IN ('0002','0003');
CREATE TABLE plt2_p2 PARTITION OF plt2 FOR VALUES IN ('0004','0005','0006');
CREATE TABLE plt2_p3 PARTITION OF plt2 FOR VALUES IN ('0007','0008','0009');
CREATE TABLE plt2_p4 PARTITION OF plt2 FOR VALUES IN ('0000',NULL,'0012');
INSERT INTO plt2 SELECT i, i % 47, to_char(i % 17, 'FM0000') FROM generate_series(0, 500) i WHERE i % 17 NOT IN (1, 10, 11, 13, 14, 15, 16);
INSERT INTO plt2 SELECT i, i % 47, case when i % 17 = 11 then NULL else to_char(i % 17, 'FM0000') end FROM generate_series(0, 500) i WHERE i % 17 IN (0,11,12);
ANALYZE plt2;
CREATE TABLE plt1_e (a int, b int, c text) PARTITION BY LIST(ltrim(c, 'A'));
CREATE TABLE plt1_e_p1 PARTITION OF plt1_e FOR VALUES IN ('0002', '0003');
CREATE TABLE plt1_e_p2 PARTITION OF plt1_e FOR VALUES IN ('0004', '0005', '0006');
CREATE TABLE plt1_e_p3 PARTITION OF plt1_e FOR VALUES IN ('0008', '0009');
CREATE TABLE plt1_e_p4 PARTITION OF plt1_e FOR VALUES IN ('0000');
INSERT INTO plt1_e SELECT i, i % 47, to_char(i % 17, 'FM0000') FROM generate_series(0, 500) i WHERE i % 17 NOT IN (1, 7, 10, 11, 12, 13, 14, 15, 16);
ANALYZE plt1_e;
EXPLAIN (COSTS OFF)
SELECT avg(t1.a), avg(t2.b), avg(t3.a + t3.b), t1.c, t2.c, t3.c FROM
plt1 t1, plt2 t2, plt1_e t3 WHERE t1.c = t2.c AND ltrim(t3.c, 'A') = t1.c GROUP BY t1.c, t2.c, t3.c ORDER BY t1.c, t2.c, t3.c;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
below is stack trace, looks like some indexes got corrupted, please take a look.
Core was generated by `postgres: edb postgres [local] EXPLAIN '.
Program terminated with signal 11, Segmentation fault.
#0 0x0000000000821aee in map_and_merge_partitions (partmaps1=0x2c1c8a8, partmaps2=0x2c1c8e0, index1=45540240, index2=0, next_index=0x7ffeebd43d3c) at partbounds.c:4162
4162 if (partmap1->from < 0 && partmap2->from < 0)
Missing separate debuginfos, use: debuginfo-install keyutils-libs-1.4-5.el6.x86_64 krb5-libs-1.10.3-65.el6.x86_64 libcom_err-1.41.12-24.el6.x86_64 libselinux-2.0.94-7.el6.x86_64 openssl-1.0.1e-57.el6.x86_64 zlib-1.2.3-29.el6.x86_64
(gdb) bt
#0 0x0000000000821aee in map_and_merge_partitions (partmaps1=0x2c1c8a8, partmaps2=0x2c1c8e0, index1=45540240, index2=0, next_index=0x7ffeebd43d3c) at partbounds.c:4162
#1 0x00000000008226c3 in merge_null_partitions (outer_bi=0x2b6e338, inner_bi=0x2bf90b0, outer_maps=0x2c1c8a8, inner_maps=0x2c1c8e0, jointype=JOIN_INNER,
next_index=0x7ffeebd43d3c, null_index=0x7ffeebd43d38, default_index=0x7ffeebd43d34) at partbounds.c:4610
#2 0x0000000000821726 in partition_list_bounds_merge (partsupfunc=0x2ba3548, partcollation=0x2ba34e8, outer_rel=0x2b6ce40, inner_rel=0x2bf8d28, outer_parts=0x7ffeebd43ed8,
inner_parts=0x7ffeebd43ed0, jointype=JOIN_INNER) at partbounds.c:4031
#3 0x000000000081ff5d in partition_bounds_merge (partnatts=1, partsupfunc=0x2ba3548, partcollation=0x2ba34e8, outer_rel=0x2b6ce40, inner_rel=0x2bf8d28, jointype=JOIN_INNER,
outer_parts=0x7ffeebd43ed8, inner_parts=0x7ffeebd43ed0) at partbounds.c:3053
#4 0x00000000007c610f in try_partitionwise_join (root=0x2be2a28, rel1=0x2b6ce40, rel2=0x2bf8d28, joinrel=0x2c1b0f0, parent_sjinfo=0x7ffeebd44010,
parent_restrictlist=0x2c1c070) at joinrels.c:1370
#5 0x00000000007c5521 in populate_joinrel_with_paths (root=0x2be2a28, rel1=0x2b6ce40, rel2=0x2bf8d28, joinrel=0x2c1b0f0, sjinfo=0x7ffeebd44010, restrictlist=0x2c1c070)
at joinrels.c:914
#6 0x00000000007c4f48 in make_join_rel (root=0x2be2a28, rel1=0x2b6ce40, rel2=0x2bf8d28) at joinrels.c:748
#7 0x00000000007c4514 in make_rels_by_clause_joins (root=0x2be2a28, old_rel=0x2b6ce40, other_rels=0x2bae4d8) at joinrels.c:294
#8 0x00000000007c41c8 in join_search_one_level (root=0x2be2a28, level=3) at joinrels.c:116
#9 0x00000000007abe59 in standard_join_search (root=0x2be2a28, levels_needed=3, initial_rels=0x2bae500) at allpaths.c:2716
#10 0x00000000007abdca in make_rel_from_joinlist (root=0x2be2a28, joinlist=0x2bfbae8) at allpaths.c:2647
#11 0x00000000007a86b0 in make_one_rel (root=0x2be2a28, joinlist=0x2bfbae8) at allpaths.c:227
#12 0x00000000007dada1 in query_planner (root=0x2be2a28, tlist=0x2ba01c8, qp_callback=0x7e0b25 <standard_qp_callback>, qp_extra=0x7ffeebd44390) at planmain.c:265
#13 0x00000000007ddf83 in grouping_planner (root=0x2be2a28, inheritance_update=false, tuple_fraction=0) at planner.c:1929
#14 0x00000000007dc5f5 in subquery_planner (glob=0x2be2990, parse=0x2c0e8c8, parent_root=0x0, hasRecursion=false, tuple_fraction=0) at planner.c:997
#15 0x00000000007db1b6 in standard_planner (parse=0x2c0e8c8, cursorOptions=256, boundParams=0x0) at planner.c:416
#16 0x00000000007daef7 in planner (parse=0x2c0e8c8, cursorOptions=256, boundParams=0x0) at planner.c:276
#17 0x00000000008e15c5 in pg_plan_query (querytree=0x2c0e8c8, cursorOptions=256, boundParams=0x0) at postgres.c:878
#18 0x00000000006562cc in ExplainOneQuery (query=0x2c0e8c8, cursorOptions=256, into=0x0, es=0x2c0e0a0,
queryString=0x2aa24d8 "EXPLAIN (COSTS OFF)\nSELECT avg(t1.a), avg(t2.b), avg(t3.a + t3.b), t1.c, t2.c, t3.c FROM\nplt1 t1, plt2 t2, plt1_e t3 WHERE t1.c = t2.c AND ltrim(t3.c, 'A') = t1.c GROUP BY t1.c, t2.c, t3.c ORDER BY t1"..., params=0x0, queryEnv=0x0) at explain.c:364
Program terminated with signal 11, Segmentation fault.
#0 0x0000000000821aee in map_and_merge_partitions (partmaps1=0x2c1c8a8, partmaps2=0x2c1c8e0, index1=45540240, index2=0, next_index=0x7ffeebd43d3c) at partbounds.c:4162
4162 if (partmap1->from < 0 && partmap2->from < 0)
Missing separate debuginfos, use: debuginfo-install keyutils-libs-1.4-5.el6.x86_64 krb5-libs-1.10.3-65.el6.x86_64 libcom_err-1.41.12-24.el6.x86_64 libselinux-2.0.94-7.el6.x86_64 openssl-1.0.1e-57.el6.x86_64 zlib-1.2.3-29.el6.x86_64
(gdb) bt
#0 0x0000000000821aee in map_and_merge_partitions (partmaps1=0x2c1c8a8, partmaps2=0x2c1c8e0, index1=45540240, index2=0, next_index=0x7ffeebd43d3c) at partbounds.c:4162
#1 0x00000000008226c3 in merge_null_partitions (outer_bi=0x2b6e338, inner_bi=0x2bf90b0, outer_maps=0x2c1c8a8, inner_maps=0x2c1c8e0, jointype=JOIN_INNER,
next_index=0x7ffeebd43d3c, null_index=0x7ffeebd43d38, default_index=0x7ffeebd43d34) at partbounds.c:4610
#2 0x0000000000821726 in partition_list_bounds_merge (partsupfunc=0x2ba3548, partcollation=0x2ba34e8, outer_rel=0x2b6ce40, inner_rel=0x2bf8d28, outer_parts=0x7ffeebd43ed8,
inner_parts=0x7ffeebd43ed0, jointype=JOIN_INNER) at partbounds.c:4031
#3 0x000000000081ff5d in partition_bounds_merge (partnatts=1, partsupfunc=0x2ba3548, partcollation=0x2ba34e8, outer_rel=0x2b6ce40, inner_rel=0x2bf8d28, jointype=JOIN_INNER,
outer_parts=0x7ffeebd43ed8, inner_parts=0x7ffeebd43ed0) at partbounds.c:3053
#4 0x00000000007c610f in try_partitionwise_join (root=0x2be2a28, rel1=0x2b6ce40, rel2=0x2bf8d28, joinrel=0x2c1b0f0, parent_sjinfo=0x7ffeebd44010,
parent_restrictlist=0x2c1c070) at joinrels.c:1370
#5 0x00000000007c5521 in populate_joinrel_with_paths (root=0x2be2a28, rel1=0x2b6ce40, rel2=0x2bf8d28, joinrel=0x2c1b0f0, sjinfo=0x7ffeebd44010, restrictlist=0x2c1c070)
at joinrels.c:914
#6 0x00000000007c4f48 in make_join_rel (root=0x2be2a28, rel1=0x2b6ce40, rel2=0x2bf8d28) at joinrels.c:748
#7 0x00000000007c4514 in make_rels_by_clause_joins (root=0x2be2a28, old_rel=0x2b6ce40, other_rels=0x2bae4d8) at joinrels.c:294
#8 0x00000000007c41c8 in join_search_one_level (root=0x2be2a28, level=3) at joinrels.c:116
#9 0x00000000007abe59 in standard_join_search (root=0x2be2a28, levels_needed=3, initial_rels=0x2bae500) at allpaths.c:2716
#10 0x00000000007abdca in make_rel_from_joinlist (root=0x2be2a28, joinlist=0x2bfbae8) at allpaths.c:2647
#11 0x00000000007a86b0 in make_one_rel (root=0x2be2a28, joinlist=0x2bfbae8) at allpaths.c:227
#12 0x00000000007dada1 in query_planner (root=0x2be2a28, tlist=0x2ba01c8, qp_callback=0x7e0b25 <standard_qp_callback>, qp_extra=0x7ffeebd44390) at planmain.c:265
#13 0x00000000007ddf83 in grouping_planner (root=0x2be2a28, inheritance_update=false, tuple_fraction=0) at planner.c:1929
#14 0x00000000007dc5f5 in subquery_planner (glob=0x2be2990, parse=0x2c0e8c8, parent_root=0x0, hasRecursion=false, tuple_fraction=0) at planner.c:997
#15 0x00000000007db1b6 in standard_planner (parse=0x2c0e8c8, cursorOptions=256, boundParams=0x0) at planner.c:416
#16 0x00000000007daef7 in planner (parse=0x2c0e8c8, cursorOptions=256, boundParams=0x0) at planner.c:276
#17 0x00000000008e15c5 in pg_plan_query (querytree=0x2c0e8c8, cursorOptions=256, boundParams=0x0) at postgres.c:878
#18 0x00000000006562cc in ExplainOneQuery (query=0x2c0e8c8, cursorOptions=256, into=0x0, es=0x2c0e0a0,
queryString=0x2aa24d8 "EXPLAIN (COSTS OFF)\nSELECT avg(t1.a), avg(t2.b), avg(t3.a + t3.b), t1.c, t2.c, t3.c FROM\nplt1 t1, plt2 t2, plt1_e t3 WHERE t1.c = t2.c AND ltrim(t3.c, 'A') = t1.c GROUP BY t1.c, t2.c, t3.c ORDER BY t1"..., params=0x0, queryEnv=0x0) at explain.c:364
Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB.
Regards,Amul SulOn Mon, Feb 4, 2019 at 11:05 AM amul sul <sulamul@gmail.com> wrote:There are few whitespaces in 0002 patch that I have fixed in the attached version.Rest of the patches are untouched.Ill continue my review and testing.Regards,AmulOn Thu, Jan 31, 2019 at 5:26 PM Etsuro Fujita <fujita.etsuro@lab.ntt.co.jp> wrote:(2019/01/22 21:38), Etsuro Fujita wrote:
> Will continue to review.
I rebased the patch set against the latest HEAD. Attached is a new
version. I'll move this to the next CF, and continue to review it.
Best regards,
Etsuro Fujita
Thanks Rajkumar,
I am looking into this.
Regards,
Amul
On Thu, Mar 7, 2019 at 11:54 AM Rajkumar Raghuwanshi <rajkumar.raghuwanshi@enterprisedb.com> wrote:
On Tue, Mar 5, 2019 at 3:45 PM amul sul <sulamul@gmail.com> wrote:Attached is the rebased atop of the latest master head(35bc0ec7c8).thanks Amul, patches applied cleanly on PG head.While testing this I got a server crash with below test case.CREATE TABLE plt1 (a int, b int, c varchar) PARTITION BY LIST(c);
CREATE TABLE plt1_p1 PARTITION OF plt1 FOR VALUES IN ('0001','0002','0003');
CREATE TABLE plt1_p2 PARTITION OF plt1 FOR VALUES IN ('0004','0005','0006');
CREATE TABLE plt1_p3 PARTITION OF plt1 FOR VALUES IN (NULL,'0008','0009');
CREATE TABLE plt1_p4 PARTITION OF plt1 FOR VALUES IN ('0000','0010');
INSERT INTO plt1 SELECT i, i % 47, to_char(i % 17, 'FM0000') FROM generate_series(0, 500) i WHERE i % 17 NOT IN (7, 11, 12, 13, 14, 15,16);
INSERT INTO plt1 SELECT i, i % 47, case when i % 17 = 7 then NULL else to_char(i % 17, 'FM0000') end FROM generate_series(0, 500) i WHERE i % 17 IN (7,8,9);
ANALYSE plt1;
CREATE TABLE plt2 (a int, b int, c varchar) PARTITION BY LIST(c);
CREATE TABLE plt2_p1 PARTITION OF plt2 FOR VALUES IN ('0002','0003');
CREATE TABLE plt2_p2 PARTITION OF plt2 FOR VALUES IN ('0004','0005','0006');
CREATE TABLE plt2_p3 PARTITION OF plt2 FOR VALUES IN ('0007','0008','0009');
CREATE TABLE plt2_p4 PARTITION OF plt2 FOR VALUES IN ('0000',NULL,'0012');
INSERT INTO plt2 SELECT i, i % 47, to_char(i % 17, 'FM0000') FROM generate_series(0, 500) i WHERE i % 17 NOT IN (1, 10, 11, 13, 14, 15, 16);
INSERT INTO plt2 SELECT i, i % 47, case when i % 17 = 11 then NULL else to_char(i % 17, 'FM0000') end FROM generate_series(0, 500) i WHERE i % 17 IN (0,11,12);
ANALYZE plt2;
CREATE TABLE plt1_e (a int, b int, c text) PARTITION BY LIST(ltrim(c, 'A'));
CREATE TABLE plt1_e_p1 PARTITION OF plt1_e FOR VALUES IN ('0002', '0003');
CREATE TABLE plt1_e_p2 PARTITION OF plt1_e FOR VALUES IN ('0004', '0005', '0006');
CREATE TABLE plt1_e_p3 PARTITION OF plt1_e FOR VALUES IN ('0008', '0009');
CREATE TABLE plt1_e_p4 PARTITION OF plt1_e FOR VALUES IN ('0000');
INSERT INTO plt1_e SELECT i, i % 47, to_char(i % 17, 'FM0000') FROM generate_series(0, 500) i WHERE i % 17 NOT IN (1, 7, 10, 11, 12, 13, 14, 15, 16);
ANALYZE plt1_e;
EXPLAIN (COSTS OFF)
SELECT avg(t1.a), avg(t2.b), avg(t3.a + t3.b), t1.c, t2.c, t3.c FROM
plt1 t1, plt2 t2, plt1_e t3 WHERE t1.c = t2.c AND ltrim(t3.c, 'A') = t1.c GROUP BY t1.c, t2.c, t3.c ORDER BY t1.c, t2.c, t3.c;server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.below is stack trace, looks like some indexes got corrupted, please take a look.Core was generated by `postgres: edb postgres [local] EXPLAIN '.
Program terminated with signal 11, Segmentation fault.
#0 0x0000000000821aee in map_and_merge_partitions (partmaps1=0x2c1c8a8, partmaps2=0x2c1c8e0, index1=45540240, index2=0, next_index=0x7ffeebd43d3c) at partbounds.c:4162
4162 if (partmap1->from < 0 && partmap2->from < 0)
Missing separate debuginfos, use: debuginfo-install keyutils-libs-1.4-5.el6.x86_64 krb5-libs-1.10.3-65.el6.x86_64 libcom_err-1.41.12-24.el6.x86_64 libselinux-2.0.94-7.el6.x86_64 openssl-1.0.1e-57.el6.x86_64 zlib-1.2.3-29.el6.x86_64
(gdb) bt
#0 0x0000000000821aee in map_and_merge_partitions (partmaps1=0x2c1c8a8, partmaps2=0x2c1c8e0, index1=45540240, index2=0, next_index=0x7ffeebd43d3c) at partbounds.c:4162
#1 0x00000000008226c3 in merge_null_partitions (outer_bi=0x2b6e338, inner_bi=0x2bf90b0, outer_maps=0x2c1c8a8, inner_maps=0x2c1c8e0, jointype=JOIN_INNER,
next_index=0x7ffeebd43d3c, null_index=0x7ffeebd43d38, default_index=0x7ffeebd43d34) at partbounds.c:4610
#2 0x0000000000821726 in partition_list_bounds_merge (partsupfunc=0x2ba3548, partcollation=0x2ba34e8, outer_rel=0x2b6ce40, inner_rel=0x2bf8d28, outer_parts=0x7ffeebd43ed8,
inner_parts=0x7ffeebd43ed0, jointype=JOIN_INNER) at partbounds.c:4031
#3 0x000000000081ff5d in partition_bounds_merge (partnatts=1, partsupfunc=0x2ba3548, partcollation=0x2ba34e8, outer_rel=0x2b6ce40, inner_rel=0x2bf8d28, jointype=JOIN_INNER,
outer_parts=0x7ffeebd43ed8, inner_parts=0x7ffeebd43ed0) at partbounds.c:3053
#4 0x00000000007c610f in try_partitionwise_join (root=0x2be2a28, rel1=0x2b6ce40, rel2=0x2bf8d28, joinrel=0x2c1b0f0, parent_sjinfo=0x7ffeebd44010,
parent_restrictlist=0x2c1c070) at joinrels.c:1370
#5 0x00000000007c5521 in populate_joinrel_with_paths (root=0x2be2a28, rel1=0x2b6ce40, rel2=0x2bf8d28, joinrel=0x2c1b0f0, sjinfo=0x7ffeebd44010, restrictlist=0x2c1c070)
at joinrels.c:914
#6 0x00000000007c4f48 in make_join_rel (root=0x2be2a28, rel1=0x2b6ce40, rel2=0x2bf8d28) at joinrels.c:748
#7 0x00000000007c4514 in make_rels_by_clause_joins (root=0x2be2a28, old_rel=0x2b6ce40, other_rels=0x2bae4d8) at joinrels.c:294
#8 0x00000000007c41c8 in join_search_one_level (root=0x2be2a28, level=3) at joinrels.c:116
#9 0x00000000007abe59 in standard_join_search (root=0x2be2a28, levels_needed=3, initial_rels=0x2bae500) at allpaths.c:2716
#10 0x00000000007abdca in make_rel_from_joinlist (root=0x2be2a28, joinlist=0x2bfbae8) at allpaths.c:2647
#11 0x00000000007a86b0 in make_one_rel (root=0x2be2a28, joinlist=0x2bfbae8) at allpaths.c:227
#12 0x00000000007dada1 in query_planner (root=0x2be2a28, tlist=0x2ba01c8, qp_callback=0x7e0b25 <standard_qp_callback>, qp_extra=0x7ffeebd44390) at planmain.c:265
#13 0x00000000007ddf83 in grouping_planner (root=0x2be2a28, inheritance_update=false, tuple_fraction=0) at planner.c:1929
#14 0x00000000007dc5f5 in subquery_planner (glob=0x2be2990, parse=0x2c0e8c8, parent_root=0x0, hasRecursion=false, tuple_fraction=0) at planner.c:997
#15 0x00000000007db1b6 in standard_planner (parse=0x2c0e8c8, cursorOptions=256, boundParams=0x0) at planner.c:416
#16 0x00000000007daef7 in planner (parse=0x2c0e8c8, cursorOptions=256, boundParams=0x0) at planner.c:276
#17 0x00000000008e15c5 in pg_plan_query (querytree=0x2c0e8c8, cursorOptions=256, boundParams=0x0) at postgres.c:878
#18 0x00000000006562cc in ExplainOneQuery (query=0x2c0e8c8, cursorOptions=256, into=0x0, es=0x2c0e0a0,
queryString=0x2aa24d8 "EXPLAIN (COSTS OFF)\nSELECT avg(t1.a), avg(t2.b), avg(t3.a + t3.b), t1.c, t2.c, t3.c FROM\nplt1 t1, plt2 t2, plt1_e t3 WHERE t1.c = t2.c AND ltrim(t3.c, 'A') = t1.c GROUP BY t1.c, t2.c, t3.c ORDER BY t1"..., params=0x0, queryEnv=0x0) at explain.c:364Thanks & Regards,Rajkumar RaghuwanshiQMG, EnterpriseDB.Regards,Amul SulOn Mon, Feb 4, 2019 at 11:05 AM amul sul <sulamul@gmail.com> wrote:There are few whitespaces in 0002 patch that I have fixed in the attached version.Rest of the patches are untouched.Ill continue my review and testing.Regards,AmulOn Thu, Jan 31, 2019 at 5:26 PM Etsuro Fujita <fujita.etsuro@lab.ntt.co.jp> wrote:(2019/01/22 21:38), Etsuro Fujita wrote:
> Will continue to review.
I rebased the patch set against the latest HEAD. Attached is a new
version. I'll move this to the next CF, and continue to review it.
Best regards,
Etsuro Fujita
On Thu, Mar 7, 2019 at 1:02 PM amul sul <sulamul@gmail.com> wrote:
Thanks Rajkumar,I am looking into this.
The crash happens when none of the if-else branch of handle_missing_partition()
evaluates and returns merged_index unassigned.
Let me explain, in Rajkumar 's test case, the join type is JOIN_INNER. When
only outer rel has null partition, merge_null_partitions() function calls
handle_missing_partition() with missing_side_inner = false and
missing_side_outer = false argument value which fails to set merged_index.
In the attached patch, I tried to fix this case by setting merged_index
explicitly which fixes the reported crash.
Regards,
Amul
On Thu, Mar 7, 2019 at 11:54 AM Rajkumar Raghuwanshi <rajkumar.raghuwanshi@enterprisedb.com> wrote:On Tue, Mar 5, 2019 at 3:45 PM amul sul <sulamul@gmail.com> wrote:Attached is the rebased atop of the latest master head(35bc0ec7c8).thanks Amul, patches applied cleanly on PG head.While testing this I got a server crash with below test case.CREATE TABLE plt1 (a int, b int, c varchar) PARTITION BY LIST(c);
CREATE TABLE plt1_p1 PARTITION OF plt1 FOR VALUES IN ('0001','0002','0003');
CREATE TABLE plt1_p2 PARTITION OF plt1 FOR VALUES IN ('0004','0005','0006');
CREATE TABLE plt1_p3 PARTITION OF plt1 FOR VALUES IN (NULL,'0008','0009');
CREATE TABLE plt1_p4 PARTITION OF plt1 FOR VALUES IN ('0000','0010');
INSERT INTO plt1 SELECT i, i % 47, to_char(i % 17, 'FM0000') FROM generate_series(0, 500) i WHERE i % 17 NOT IN (7, 11, 12, 13, 14, 15,16);
INSERT INTO plt1 SELECT i, i % 47, case when i % 17 = 7 then NULL else to_char(i % 17, 'FM0000') end FROM generate_series(0, 500) i WHERE i % 17 IN (7,8,9);
ANALYSE plt1;
CREATE TABLE plt2 (a int, b int, c varchar) PARTITION BY LIST(c);
CREATE TABLE plt2_p1 PARTITION OF plt2 FOR VALUES IN ('0002','0003');
CREATE TABLE plt2_p2 PARTITION OF plt2 FOR VALUES IN ('0004','0005','0006');
CREATE TABLE plt2_p3 PARTITION OF plt2 FOR VALUES IN ('0007','0008','0009');
CREATE TABLE plt2_p4 PARTITION OF plt2 FOR VALUES IN ('0000',NULL,'0012');
INSERT INTO plt2 SELECT i, i % 47, to_char(i % 17, 'FM0000') FROM generate_series(0, 500) i WHERE i % 17 NOT IN (1, 10, 11, 13, 14, 15, 16);
INSERT INTO plt2 SELECT i, i % 47, case when i % 17 = 11 then NULL else to_char(i % 17, 'FM0000') end FROM generate_series(0, 500) i WHERE i % 17 IN (0,11,12);
ANALYZE plt2;
CREATE TABLE plt1_e (a int, b int, c text) PARTITION BY LIST(ltrim(c, 'A'));
CREATE TABLE plt1_e_p1 PARTITION OF plt1_e FOR VALUES IN ('0002', '0003');
CREATE TABLE plt1_e_p2 PARTITION OF plt1_e FOR VALUES IN ('0004', '0005', '0006');
CREATE TABLE plt1_e_p3 PARTITION OF plt1_e FOR VALUES IN ('0008', '0009');
CREATE TABLE plt1_e_p4 PARTITION OF plt1_e FOR VALUES IN ('0000');
INSERT INTO plt1_e SELECT i, i % 47, to_char(i % 17, 'FM0000') FROM generate_series(0, 500) i WHERE i % 17 NOT IN (1, 7, 10, 11, 12, 13, 14, 15, 16);
ANALYZE plt1_e;
EXPLAIN (COSTS OFF)
SELECT avg(t1.a), avg(t2.b), avg(t3.a + t3.b), t1.c, t2.c, t3.c FROM
plt1 t1, plt2 t2, plt1_e t3 WHERE t1.c = t2.c AND ltrim(t3.c, 'A') = t1.c GROUP BY t1.c, t2.c, t3.c ORDER BY t1.c, t2.c, t3.c;server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.below is stack trace, looks like some indexes got corrupted, please take a look.Core was generated by `postgres: edb postgres [local] EXPLAIN '.
Program terminated with signal 11, Segmentation fault.
#0 0x0000000000821aee in map_and_merge_partitions (partmaps1=0x2c1c8a8, partmaps2=0x2c1c8e0, index1=45540240, index2=0, next_index=0x7ffeebd43d3c) at partbounds.c:4162
4162 if (partmap1->from < 0 && partmap2->from < 0)
Missing separate debuginfos, use: debuginfo-install keyutils-libs-1.4-5.el6.x86_64 krb5-libs-1.10.3-65.el6.x86_64 libcom_err-1.41.12-24.el6.x86_64 libselinux-2.0.94-7.el6.x86_64 openssl-1.0.1e-57.el6.x86_64 zlib-1.2.3-29.el6.x86_64
(gdb) bt
#0 0x0000000000821aee in map_and_merge_partitions (partmaps1=0x2c1c8a8, partmaps2=0x2c1c8e0, index1=45540240, index2=0, next_index=0x7ffeebd43d3c) at partbounds.c:4162
#1 0x00000000008226c3 in merge_null_partitions (outer_bi=0x2b6e338, inner_bi=0x2bf90b0, outer_maps=0x2c1c8a8, inner_maps=0x2c1c8e0, jointype=JOIN_INNER,
next_index=0x7ffeebd43d3c, null_index=0x7ffeebd43d38, default_index=0x7ffeebd43d34) at partbounds.c:4610
#2 0x0000000000821726 in partition_list_bounds_merge (partsupfunc=0x2ba3548, partcollation=0x2ba34e8, outer_rel=0x2b6ce40, inner_rel=0x2bf8d28, outer_parts=0x7ffeebd43ed8,
inner_parts=0x7ffeebd43ed0, jointype=JOIN_INNER) at partbounds.c:4031
#3 0x000000000081ff5d in partition_bounds_merge (partnatts=1, partsupfunc=0x2ba3548, partcollation=0x2ba34e8, outer_rel=0x2b6ce40, inner_rel=0x2bf8d28, jointype=JOIN_INNER,
outer_parts=0x7ffeebd43ed8, inner_parts=0x7ffeebd43ed0) at partbounds.c:3053
#4 0x00000000007c610f in try_partitionwise_join (root=0x2be2a28, rel1=0x2b6ce40, rel2=0x2bf8d28, joinrel=0x2c1b0f0, parent_sjinfo=0x7ffeebd44010,
parent_restrictlist=0x2c1c070) at joinrels.c:1370
#5 0x00000000007c5521 in populate_joinrel_with_paths (root=0x2be2a28, rel1=0x2b6ce40, rel2=0x2bf8d28, joinrel=0x2c1b0f0, sjinfo=0x7ffeebd44010, restrictlist=0x2c1c070)
at joinrels.c:914
#6 0x00000000007c4f48 in make_join_rel (root=0x2be2a28, rel1=0x2b6ce40, rel2=0x2bf8d28) at joinrels.c:748
#7 0x00000000007c4514 in make_rels_by_clause_joins (root=0x2be2a28, old_rel=0x2b6ce40, other_rels=0x2bae4d8) at joinrels.c:294
#8 0x00000000007c41c8 in join_search_one_level (root=0x2be2a28, level=3) at joinrels.c:116
#9 0x00000000007abe59 in standard_join_search (root=0x2be2a28, levels_needed=3, initial_rels=0x2bae500) at allpaths.c:2716
#10 0x00000000007abdca in make_rel_from_joinlist (root=0x2be2a28, joinlist=0x2bfbae8) at allpaths.c:2647
#11 0x00000000007a86b0 in make_one_rel (root=0x2be2a28, joinlist=0x2bfbae8) at allpaths.c:227
#12 0x00000000007dada1 in query_planner (root=0x2be2a28, tlist=0x2ba01c8, qp_callback=0x7e0b25 <standard_qp_callback>, qp_extra=0x7ffeebd44390) at planmain.c:265
#13 0x00000000007ddf83 in grouping_planner (root=0x2be2a28, inheritance_update=false, tuple_fraction=0) at planner.c:1929
#14 0x00000000007dc5f5 in subquery_planner (glob=0x2be2990, parse=0x2c0e8c8, parent_root=0x0, hasRecursion=false, tuple_fraction=0) at planner.c:997
#15 0x00000000007db1b6 in standard_planner (parse=0x2c0e8c8, cursorOptions=256, boundParams=0x0) at planner.c:416
#16 0x00000000007daef7 in planner (parse=0x2c0e8c8, cursorOptions=256, boundParams=0x0) at planner.c:276
#17 0x00000000008e15c5 in pg_plan_query (querytree=0x2c0e8c8, cursorOptions=256, boundParams=0x0) at postgres.c:878
#18 0x00000000006562cc in ExplainOneQuery (query=0x2c0e8c8, cursorOptions=256, into=0x0, es=0x2c0e0a0,
queryString=0x2aa24d8 "EXPLAIN (COSTS OFF)\nSELECT avg(t1.a), avg(t2.b), avg(t3.a + t3.b), t1.c, t2.c, t3.c FROM\nplt1 t1, plt2 t2, plt1_e t3 WHERE t1.c = t2.c AND ltrim(t3.c, 'A') = t1.c GROUP BY t1.c, t2.c, t3.c ORDER BY t1"..., params=0x0, queryEnv=0x0) at explain.c:364Thanks & Regards,Rajkumar RaghuwanshiQMG, EnterpriseDB.Regards,Amul SulOn Mon, Feb 4, 2019 at 11:05 AM amul sul <sulamul@gmail.com> wrote:There are few whitespaces in 0002 patch that I have fixed in the attached version.Rest of the patches are untouched.Ill continue my review and testing.Regards,AmulOn Thu, Jan 31, 2019 at 5:26 PM Etsuro Fujita <fujita.etsuro@lab.ntt.co.jp> wrote:(2019/01/22 21:38), Etsuro Fujita wrote:
> Will continue to review.
I rebased the patch set against the latest HEAD. Attached is a new
version. I'll move this to the next CF, and continue to review it.
Best regards,
Etsuro Fujita
Attachment
Re: [HACKERS] advanced partition matching algorithm forpartition-wise join
From
Rajkumar Raghuwanshi
Date:
On Thu, Mar 7, 2019 at 8:20 PM amul sul <sulamul@gmail.com> wrote:
On Thu, Mar 7, 2019 at 1:02 PM amul sul <sulamul@gmail.com> wrote:Thanks Rajkumar,I am looking into this.The crash happens when none of the if-else branch of handle_missing_partition()evaluates and returns merged_index unassigned.Let me explain, in Rajkumar 's test case, the join type is JOIN_INNER. Whenonly outer rel has null partition, merge_null_partitions() function callshandle_missing_partition() with missing_side_inner = false andmissing_side_outer = false argument value which fails to set merged_index.In the attached patch, I tried to fix this case by setting merged_indexexplicitly which fixes the reported crash.
Thanks Amul, with v20 patches, crash is fixed.
Regards,AmulOn Thu, Mar 7, 2019 at 11:54 AM Rajkumar Raghuwanshi <rajkumar.raghuwanshi@enterprisedb.com> wrote:On Tue, Mar 5, 2019 at 3:45 PM amul sul <sulamul@gmail.com> wrote:Attached is the rebased atop of the latest master head(35bc0ec7c8).thanks Amul, patches applied cleanly on PG head.While testing this I got a server crash with below test case.CREATE TABLE plt1 (a int, b int, c varchar) PARTITION BY LIST(c);
CREATE TABLE plt1_p1 PARTITION OF plt1 FOR VALUES IN ('0001','0002','0003');
CREATE TABLE plt1_p2 PARTITION OF plt1 FOR VALUES IN ('0004','0005','0006');
CREATE TABLE plt1_p3 PARTITION OF plt1 FOR VALUES IN (NULL,'0008','0009');
CREATE TABLE plt1_p4 PARTITION OF plt1 FOR VALUES IN ('0000','0010');
INSERT INTO plt1 SELECT i, i % 47, to_char(i % 17, 'FM0000') FROM generate_series(0, 500) i WHERE i % 17 NOT IN (7, 11, 12, 13, 14, 15,16);
INSERT INTO plt1 SELECT i, i % 47, case when i % 17 = 7 then NULL else to_char(i % 17, 'FM0000') end FROM generate_series(0, 500) i WHERE i % 17 IN (7,8,9);
ANALYSE plt1;
CREATE TABLE plt2 (a int, b int, c varchar) PARTITION BY LIST(c);
CREATE TABLE plt2_p1 PARTITION OF plt2 FOR VALUES IN ('0002','0003');
CREATE TABLE plt2_p2 PARTITION OF plt2 FOR VALUES IN ('0004','0005','0006');
CREATE TABLE plt2_p3 PARTITION OF plt2 FOR VALUES IN ('0007','0008','0009');
CREATE TABLE plt2_p4 PARTITION OF plt2 FOR VALUES IN ('0000',NULL,'0012');
INSERT INTO plt2 SELECT i, i % 47, to_char(i % 17, 'FM0000') FROM generate_series(0, 500) i WHERE i % 17 NOT IN (1, 10, 11, 13, 14, 15, 16);
INSERT INTO plt2 SELECT i, i % 47, case when i % 17 = 11 then NULL else to_char(i % 17, 'FM0000') end FROM generate_series(0, 500) i WHERE i % 17 IN (0,11,12);
ANALYZE plt2;
CREATE TABLE plt1_e (a int, b int, c text) PARTITION BY LIST(ltrim(c, 'A'));
CREATE TABLE plt1_e_p1 PARTITION OF plt1_e FOR VALUES IN ('0002', '0003');
CREATE TABLE plt1_e_p2 PARTITION OF plt1_e FOR VALUES IN ('0004', '0005', '0006');
CREATE TABLE plt1_e_p3 PARTITION OF plt1_e FOR VALUES IN ('0008', '0009');
CREATE TABLE plt1_e_p4 PARTITION OF plt1_e FOR VALUES IN ('0000');
INSERT INTO plt1_e SELECT i, i % 47, to_char(i % 17, 'FM0000') FROM generate_series(0, 500) i WHERE i % 17 NOT IN (1, 7, 10, 11, 12, 13, 14, 15, 16);
ANALYZE plt1_e;
EXPLAIN (COSTS OFF)
SELECT avg(t1.a), avg(t2.b), avg(t3.a + t3.b), t1.c, t2.c, t3.c FROM
plt1 t1, plt2 t2, plt1_e t3 WHERE t1.c = t2.c AND ltrim(t3.c, 'A') = t1.c GROUP BY t1.c, t2.c, t3.c ORDER BY t1.c, t2.c, t3.c;server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.below is stack trace, looks like some indexes got corrupted, please take a look.Core was generated by `postgres: edb postgres [local] EXPLAIN '.
Program terminated with signal 11, Segmentation fault.
#0 0x0000000000821aee in map_and_merge_partitions (partmaps1=0x2c1c8a8, partmaps2=0x2c1c8e0, index1=45540240, index2=0, next_index=0x7ffeebd43d3c) at partbounds.c:4162
4162 if (partmap1->from < 0 && partmap2->from < 0)
Missing separate debuginfos, use: debuginfo-install keyutils-libs-1.4-5.el6.x86_64 krb5-libs-1.10.3-65.el6.x86_64 libcom_err-1.41.12-24.el6.x86_64 libselinux-2.0.94-7.el6.x86_64 openssl-1.0.1e-57.el6.x86_64 zlib-1.2.3-29.el6.x86_64
(gdb) bt
#0 0x0000000000821aee in map_and_merge_partitions (partmaps1=0x2c1c8a8, partmaps2=0x2c1c8e0, index1=45540240, index2=0, next_index=0x7ffeebd43d3c) at partbounds.c:4162
#1 0x00000000008226c3 in merge_null_partitions (outer_bi=0x2b6e338, inner_bi=0x2bf90b0, outer_maps=0x2c1c8a8, inner_maps=0x2c1c8e0, jointype=JOIN_INNER,
next_index=0x7ffeebd43d3c, null_index=0x7ffeebd43d38, default_index=0x7ffeebd43d34) at partbounds.c:4610
#2 0x0000000000821726 in partition_list_bounds_merge (partsupfunc=0x2ba3548, partcollation=0x2ba34e8, outer_rel=0x2b6ce40, inner_rel=0x2bf8d28, outer_parts=0x7ffeebd43ed8,
inner_parts=0x7ffeebd43ed0, jointype=JOIN_INNER) at partbounds.c:4031
#3 0x000000000081ff5d in partition_bounds_merge (partnatts=1, partsupfunc=0x2ba3548, partcollation=0x2ba34e8, outer_rel=0x2b6ce40, inner_rel=0x2bf8d28, jointype=JOIN_INNER,
outer_parts=0x7ffeebd43ed8, inner_parts=0x7ffeebd43ed0) at partbounds.c:3053
#4 0x00000000007c610f in try_partitionwise_join (root=0x2be2a28, rel1=0x2b6ce40, rel2=0x2bf8d28, joinrel=0x2c1b0f0, parent_sjinfo=0x7ffeebd44010,
parent_restrictlist=0x2c1c070) at joinrels.c:1370
#5 0x00000000007c5521 in populate_joinrel_with_paths (root=0x2be2a28, rel1=0x2b6ce40, rel2=0x2bf8d28, joinrel=0x2c1b0f0, sjinfo=0x7ffeebd44010, restrictlist=0x2c1c070)
at joinrels.c:914
#6 0x00000000007c4f48 in make_join_rel (root=0x2be2a28, rel1=0x2b6ce40, rel2=0x2bf8d28) at joinrels.c:748
#7 0x00000000007c4514 in make_rels_by_clause_joins (root=0x2be2a28, old_rel=0x2b6ce40, other_rels=0x2bae4d8) at joinrels.c:294
#8 0x00000000007c41c8 in join_search_one_level (root=0x2be2a28, level=3) at joinrels.c:116
#9 0x00000000007abe59 in standard_join_search (root=0x2be2a28, levels_needed=3, initial_rels=0x2bae500) at allpaths.c:2716
#10 0x00000000007abdca in make_rel_from_joinlist (root=0x2be2a28, joinlist=0x2bfbae8) at allpaths.c:2647
#11 0x00000000007a86b0 in make_one_rel (root=0x2be2a28, joinlist=0x2bfbae8) at allpaths.c:227
#12 0x00000000007dada1 in query_planner (root=0x2be2a28, tlist=0x2ba01c8, qp_callback=0x7e0b25 <standard_qp_callback>, qp_extra=0x7ffeebd44390) at planmain.c:265
#13 0x00000000007ddf83 in grouping_planner (root=0x2be2a28, inheritance_update=false, tuple_fraction=0) at planner.c:1929
#14 0x00000000007dc5f5 in subquery_planner (glob=0x2be2990, parse=0x2c0e8c8, parent_root=0x0, hasRecursion=false, tuple_fraction=0) at planner.c:997
#15 0x00000000007db1b6 in standard_planner (parse=0x2c0e8c8, cursorOptions=256, boundParams=0x0) at planner.c:416
#16 0x00000000007daef7 in planner (parse=0x2c0e8c8, cursorOptions=256, boundParams=0x0) at planner.c:276
#17 0x00000000008e15c5 in pg_plan_query (querytree=0x2c0e8c8, cursorOptions=256, boundParams=0x0) at postgres.c:878
#18 0x00000000006562cc in ExplainOneQuery (query=0x2c0e8c8, cursorOptions=256, into=0x0, es=0x2c0e0a0,
queryString=0x2aa24d8 "EXPLAIN (COSTS OFF)\nSELECT avg(t1.a), avg(t2.b), avg(t3.a + t3.b), t1.c, t2.c, t3.c FROM\nplt1 t1, plt2 t2, plt1_e t3 WHERE t1.c = t2.c AND ltrim(t3.c, 'A') = t1.c GROUP BY t1.c, t2.c, t3.c ORDER BY t1"..., params=0x0, queryEnv=0x0) at explain.c:364Thanks & Regards,Rajkumar RaghuwanshiQMG, EnterpriseDB.Regards,Amul SulOn Mon, Feb 4, 2019 at 11:05 AM amul sul <sulamul@gmail.com> wrote:There are few whitespaces in 0002 patch that I have fixed in the attached version.Rest of the patches are untouched.Ill continue my review and testing.Regards,AmulOn Thu, Jan 31, 2019 at 5:26 PM Etsuro Fujita <fujita.etsuro@lab.ntt.co.jp> wrote:(2019/01/22 21:38), Etsuro Fujita wrote:
> Will continue to review.
I rebased the patch set against the latest HEAD. Attached is a new
version. I'll move this to the next CF, and continue to review it.
Best regards,
Etsuro Fujita
Re: [HACKERS] advanced partition matching algorithm forpartition-wise join
From
Ashutosh Bapat
Date:
On Thu, Mar 7, 2019 at 8:20 PM amul sul <sulamul@gmail.com> wrote:
On Thu, Mar 7, 2019 at 1:02 PM amul sul <sulamul@gmail.com> wrote:Thanks Rajkumar,I am looking into this.The crash happens when none of the if-else branch of handle_missing_partition()evaluates and returns merged_index unassigned.Let me explain, in Rajkumar 's test case, the join type is JOIN_INNER. Whenonly outer rel has null partition, merge_null_partitions() function callshandle_missing_partition() with missing_side_inner = false andmissing_side_outer = false
Both missing_side_ variables being false when the NULL partition is missing on the inner side looks suspicious. I guess from the variable names that the missing_side_inner should be true in this case.
argument value which fails to set merged_index.In the attached patch, I tried to fix this case by setting merged_indexexplicitly which fixes the reported crash.
I expect handle_missing_partition() to set the merged_index always. In your patches, I don't see that function in your patches is setting it explicitly. If we are setting merged_index explicitly somewhere else, other places may miss that explicit assignment. So it's better to move it inside this function.
--
Best Wishes,
Ashutosh Bapat
On Mon, Mar 11, 2019 at 8:29 AM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:
On Thu, Mar 7, 2019 at 8:20 PM amul sul <sulamul@gmail.com> wrote:On Thu, Mar 7, 2019 at 1:02 PM amul sul <sulamul@gmail.com> wrote:Thanks Rajkumar,I am looking into this.The crash happens when none of the if-else branch of handle_missing_partition()evaluates and returns merged_index unassigned.Let me explain, in Rajkumar 's test case, the join type is JOIN_INNER. Whenonly outer rel has null partition, merge_null_partitions() function callshandle_missing_partition() with missing_side_inner = false andmissing_side_outer = falseBoth missing_side_ variables being false when the NULL partition is missing on the inner side looks suspicious. I guess from the variable names that the missing_side_inner should be true in this case.
All the places from where this handle_missing_partition() get called
have the following code to decide the value for missing_side_outer/_inner which
I yet to understand. Do you think this has some flaw?
/*
* For a FULL join, inner relation acts as both OUTER and INNER
* relation. For LEFT and ANTI join the inner relation acts as
* INNER relation. For INNER and SEMI join OUTER and INNER
* differentiation is immaterial.
*/
missing_side_inner = (jointype == JOIN_FULL ||
jointype == JOIN_LEFT ||
jointype == JOIN_ANTI);
missing_side_outer = (jointype == JOIN_FULL);
argument value which fails to set merged_index.In the attached patch, I tried to fix this case by setting merged_indexexplicitly which fixes the reported crash.I expect handle_missing_partition() to set the merged_index always. In your patches, I don't see that function in your patches is setting it explicitly. If we are setting merged_index explicitly somewhere else, other places may miss that explicit assignment. So it's better to move it inside this function.
Ok, that can be fixed.
Similarly, I think merge_null_partitions should set null_index instead of
asserting when null partitions missing from both the side, make sense?
Regards,
Amul
Re: [HACKERS] advanced partition matching algorithm forpartition-wise join
From
Ashutosh Bapat
Date:
On Mon, Mar 11, 2019 at 10:40 AM amul sul <sulamul@gmail.com> wrote:
All the places from where this handle_missing_partition() get calledhave the following code to decide the value for missing_side_outer/_inner whichI yet to understand. Do you think this has some flaw?/** For a FULL join, inner relation acts as both OUTER and INNER* relation. For LEFT and ANTI join the inner relation acts as* INNER relation. For INNER and SEMI join OUTER and INNER* differentiation is immaterial.*/missing_side_inner = (jointype == JOIN_FULL ||jointype == JOIN_LEFT ||jointype == JOIN_ANTI);missing_side_outer = (jointype == JOIN_FULL);
I was wrong, sorry. The comment says it all.
argument value which fails to set merged_index.In the attached patch, I tried to fix this case by setting merged_indexexplicitly which fixes the reported crash.I expect handle_missing_partition() to set the merged_index always. In your patches, I don't see that function in your patches is setting it explicitly. If we are setting merged_index explicitly somewhere else, other places may miss that explicit assignment. So it's better to move it inside this function.Ok, that can be fixed.Similarly, I think merge_null_partitions should set null_index instead ofasserting when null partitions missing from both the side, make sense?
I think not. null_index, once set shouldn't change and hence does not change with each pair of partitions being matched. So, it makes sense to make sure that null_index remains invalid if none of the tables have null partition.
--
Best Wishes,
Ashutosh Bapat
Attached version is rebase atop of the latest master head(fdc7efcc30), also
incorporates the Ashutosh's suggestion, thanks.
Regards,
Amul
On Mon, Mar 11, 2019 at 10:14 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:
On Mon, Mar 11, 2019 at 10:40 AM amul sul <sulamul@gmail.com> wrote:All the places from where this handle_missing_partition() get calledhave the following code to decide the value for missing_side_outer/_inner whichI yet to understand. Do you think this has some flaw?/** For a FULL join, inner relation acts as both OUTER and INNER* relation. For LEFT and ANTI join the inner relation acts as* INNER relation. For INNER and SEMI join OUTER and INNER* differentiation is immaterial.*/missing_side_inner = (jointype == JOIN_FULL ||jointype == JOIN_LEFT ||jointype == JOIN_ANTI);missing_side_outer = (jointype == JOIN_FULL);I was wrong, sorry. The comment says it all.argument value which fails to set merged_index.In the attached patch, I tried to fix this case by setting merged_indexexplicitly which fixes the reported crash.I expect handle_missing_partition() to set the merged_index always. In your patches, I don't see that function in your patches is setting it explicitly. If we are setting merged_index explicitly somewhere else, other places may miss that explicit assignment. So it's better to move it inside this function.Ok, that can be fixed.Similarly, I think merge_null_partitions should set null_index instead ofasserting when null partitions missing from both the side, make sense?I think not. null_index, once set shouldn't change and hence does not change with each pair of partitions being matched. So, it makes sense to make sure that null_index remains invalid if none of the tables have null partition.--Best Wishes,Ashutosh Bapat
Attachment
Re: [HACKERS] advanced partition matching algorithm forpartition-wise join
From
Rajkumar Raghuwanshi
Date:
On Wed, Apr 24, 2019 at 4:56 PM amul sul <sulamul@gmail.com> wrote:
Attached version is rebase atop of the latest master head(fdc7efcc30), alsoincorporates the Ashutosh's suggestion, thanks.
Thanks for rebase patch, patches applied cleanly on PG head.
I did some crash testing with extra test case [0006 patch] [1] and found no more issue.
Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB.
[1] https://www.postgresql.org/message-id/CAFjFpReKuV_4LRRfdy80BqX8oZfwbo%2BHWLQNv3CsJ5iGPSyTfA%40mail.gmail.com
Regards,AmulOn Mon, Mar 11, 2019 at 10:14 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:On Mon, Mar 11, 2019 at 10:40 AM amul sul <sulamul@gmail.com> wrote:All the places from where this handle_missing_partition() get calledhave the following code to decide the value for missing_side_outer/_inner whichI yet to understand. Do you think this has some flaw?/** For a FULL join, inner relation acts as both OUTER and INNER* relation. For LEFT and ANTI join the inner relation acts as* INNER relation. For INNER and SEMI join OUTER and INNER* differentiation is immaterial.*/missing_side_inner = (jointype == JOIN_FULL ||jointype == JOIN_LEFT ||jointype == JOIN_ANTI);missing_side_outer = (jointype == JOIN_FULL);I was wrong, sorry. The comment says it all.argument value which fails to set merged_index.In the attached patch, I tried to fix this case by setting merged_indexexplicitly which fixes the reported crash.I expect handle_missing_partition() to set the merged_index always. In your patches, I don't see that function in your patches is setting it explicitly. If we are setting merged_index explicitly somewhere else, other places may miss that explicit assignment. So it's better to move it inside this function.Ok, that can be fixed.Similarly, I think merge_null_partitions should set null_index instead ofasserting when null partitions missing from both the side, make sense?I think not. null_index, once set shouldn't change and hence does not change with each pair of partitions being matched. So, it makes sense to make sure that null_index remains invalid if none of the tables have null partition.--Best Wishes,Ashutosh Bapat
Hi Amul, Ashutosh, On 2019/04/24 20:26, amul sul wrote: > Attached version is rebase atop of the latest master head(fdc7efcc30), also > incorporates the Ashutosh's suggestion, thanks. Reading the commit message of 0002 and after testing to confirm, I understand that the patch doesn't handle OUTER joins where the nullable side is missing some partitions. The reason given is that join planning would have to add base relations corresponding to missing partitions on the nullable side, which we can't do. While working on partition pruning refactoring recently (committed as 428b260f87e), we faced a similar situation in that pruned partitions are like missing partitions, because they're not added to the PlannerInfo anymore, whereas before that commit, they'd be added and marked dummy afterwards. Earlier versions of my patch had code to add dummy base relations for such pruned partitions, because partitionwise join expected pairs of matched partitions to be valid base relations, because that's how things were when partitionwise joins feature was committed. Join path generated in this case would have a constant-false Result path (an empty relation) for the nullable side. Tom strongly objected to that idea saying that such join paths are kind of silly [1], even outside the context of partitionwise join. He suggested that we abandon partitionwise join in such cases, because having to build a dummy base relation for pruned partitions only to generate silly-looking paths would be an ugly kludge. I guess the same argument applies to the case where the nullable side is missing some partitions, so the right way to support partitionwise join case in that case wouldn't be to figure out how joinrels.c could add dummy base relations. He did mention that cases where the nullable side is provably empty can be handled by simply returning the path of the non-nullable side with suitable projection path added on top to emit NULLs for the columns of the nullable-side. If we teach populate_joinrel_with_paths() and underlings about that, then we can allow partitionwise join even in the case where the nullable side has some partitions missing. Thanks, Amit [1] https://www.postgresql.org/message-id/25035.1553905052%40sss.pgh.pa.us
On 2019/05/14 13:23, Amit Langote wrote: > Tom > strongly objected to that idea saying that such join paths are kind of > silly [1], even outside the context of partitionwise join. He suggested > that we abandon partitionwise join in such cases, because having to build > a dummy base relation for pruned partitions only to generate silly-looking > paths would be an ugly kludge. I forgot to mention that he even committed a patch to disable partitionwise joins in such cases, which was also applied to v11 branch. https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=d70c147fa217c4bae32ac1afb86ab42d98b36fdf Note that there were also other reasons for committing, beside what I described in my previous email. Thanks, Amit
Re: [HACKERS] advanced partition matching algorithm forpartition-wise join
From
Ashutosh Bapat
Date:
On Tue, May 14, 2019 at 10:00 AM Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
On 2019/05/14 13:23, Amit Langote wrote:
> Tom
> strongly objected to that idea saying that such join paths are kind of
> silly [1], even outside the context of partitionwise join. He suggested
> that we abandon partitionwise join in such cases, because having to build
> a dummy base relation for pruned partitions only to generate silly-looking
> paths would be an ugly kludge.
I forgot to mention that he even committed a patch to disable
partitionwise joins in such cases, which was also applied to v11 branch.
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=d70c147fa217c4bae32ac1afb86ab42d98b36fdf
Note that there were also other reasons for committing, beside what I
described in my previous email.
I haven't seen the actual commit, but we could use these patches to enable partition-wise join when partitions are pruned. For that the partition descriptor of the pruned partition table should be arranged as if those partitions are missing in the table itself. However, we will still need code to handle the cases when the partitions are missing on the nullable side. Tom mentioned the idea of using just projection to produce join tuples with rows on the outer side appended with null columns from the nullable side. If we can implement that, we can remove the restrictions in this patch.
--
Best Wishes,
Ashutosh Bapat
On Tue, May 14, 2019 at 12:24 AM Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > He did mention that cases where the nullable side is provably empty can be > handled by simply returning the path of the non-nullable side with > suitable projection path added on top to emit NULLs for the columns of the > nullable-side. If we teach populate_joinrel_with_paths() and underlings > about that, then we can allow partitionwise join even in the case where > the nullable side has some partitions missing. Yes, I think that would be a good approach to pursue. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, May 18, 2019 at 12:20 AM Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, May 14, 2019 at 12:24 AM Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: > > He did mention that cases where the nullable side is provably empty can be > > handled by simply returning the path of the non-nullable side with > > suitable projection path added on top to emit NULLs for the columns of the > > nullable-side. If we teach populate_joinrel_with_paths() and underlings > > about that, then we can allow partitionwise join even in the case where > > the nullable side has some partitions missing. > > Yes, I think that would be a good approach to pursue. Hi Ashutosh, Amul, Fujita-san, Could we please have a fresh rebase for the new Commitfest? Thanks, -- Thomas Munro https://enterprisedb.com
On Mon, Jul 1, 2019 at 6:50 PM Thomas Munro <thomas.munro@gmail.com> wrote: > On Sat, May 18, 2019 at 12:20 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Tue, May 14, 2019 at 12:24 AM Amit Langote > > <Langote_Amit_f8@lab.ntt.co.jp> wrote: > > > He did mention that cases where the nullable side is provably empty can be > > > handled by simply returning the path of the non-nullable side with > > > suitable projection path added on top to emit NULLs for the columns of the > > > nullable-side. If we teach populate_joinrel_with_paths() and underlings > > > about that, then we can allow partitionwise join even in the case where > > > the nullable side has some partitions missing. > > > > Yes, I think that would be a good approach to pursue. > > Hi Ashutosh, Amul, Fujita-san, > > Could we please have a fresh rebase for the new Commitfest? Will do unless Ashutosh, Amul, or anyone wants to. Thanks! Best regards, Etsuro Fujita
Attached version is rebase atop of the latest master head(c74d49d41c), thanks.
Regards,
Amul
Regards,
Amul
On Mon, Jul 1, 2019 at 4:46 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
On Mon, Jul 1, 2019 at 6:50 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Sat, May 18, 2019 at 12:20 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > On Tue, May 14, 2019 at 12:24 AM Amit Langote
> > <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> > > He did mention that cases where the nullable side is provably empty can be
> > > handled by simply returning the path of the non-nullable side with
> > > suitable projection path added on top to emit NULLs for the columns of the
> > > nullable-side. If we teach populate_joinrel_with_paths() and underlings
> > > about that, then we can allow partitionwise join even in the case where
> > > the nullable side has some partitions missing.
> >
> > Yes, I think that would be a good approach to pursue.
>
> Hi Ashutosh, Amul, Fujita-san,
>
> Could we please have a fresh rebase for the new Commitfest?
Will do unless Ashutosh, Amul, or anyone wants to.
Thanks!
Best regards,
Etsuro Fujita
Attachment
On Tue, Jul 2, 2019 at 1:47 PM amul sul <sulamul@gmail.com> wrote: > Attached version is rebase atop of the latest master head(c74d49d41c), thanks. Thanks! Will review. Best regards, Etsuro Fujita
On Wed, Jul 3, 2019 at 3:44 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote: > On Tue, Jul 2, 2019 at 1:47 PM amul sul <sulamul@gmail.com> wrote: > > Attached version is rebase atop of the latest master head(c74d49d41c), thanks. > > Thanks! Will review. I started reviewing this. Here is my initial review comments: * 0001-Hash-partition-bound-equality-refactoring-v22.patch First of all, I agree with your view on hash partitioning: "3. For hash partitioned tables however, we support partition-wise join only when the bounds exactly match. For hash partitioning it's unusual to have missing partitions and hence generic partition matching is not required." which is cited from the commit message for the main patch "0002-Partition-wise-join-for-1-1-1-0-0-1-partition-matchi-v22.patch". (I think it would be better if we can extend the partition matching to the hash-partitioning case where there are missing partitions in future, though.) However, I don't think it's a good idea to do this refactoring, because that would lead to duplicating the code to check whether two given hash bound collections are equal in partition_bounds_equal() and partition_hash_bounds_merge() that will be added by the main patch, after all. To avoid that, how about calling partition_bounds_equal() from partition_hash_bounds_merge() in the main patch, like the attached? I also did some cleanup for partition_hash_bounds_merge(). This change makes the refactoring patch unnecessary, so I dropped it. Also, I included the regression-test patch "0003-Tests-for-0-1-1-1-and-1-0-partition-matching-v22.patch" in the attached, because make check didn't succeed without the regression-test patch. That's it for now. I'll review the remaining parts (ie, "0002-Partition-wise-join-for-1-1-1-0-0-1-partition-matchi-v22.patch" and "0003-Tests-for-0-1-1-1-and-1-0-partition-matching-v22.patch") closely next. Best regards, Etsuro Fujita
Attachment
On Mon, Jul 8, 2019 at 8:33 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote: > I'll review the remaining parts (ie, > "0002-Partition-wise-join-for-1-1-1-0-0-1-partition-matchi-v22.patch" > and "0003-Tests-for-0-1-1-1-and-1-0-partition-matching-v22.patch") > closely next. I've been reviewing the main patch "0002-Partition-wise-join-for-1-1-1-0-0-1-partition-matchi-v22.patch", but it's pretty large and complicated, so I'm still at the first pass, and so maybe I'm missing something, but let me comment a few things on the patch. First of all, I think the patch's problem setting, which is stated in the commit message, is reasonable: Earlier version of partition-wise join implementation allowed partition-wise join between two relations with exactly same partition bounds. This commit allows partition-wise join to be applied under following conditions 1. the partition bounds of joining relations are such that rows from given partition on one side join can join with rows from maximum one partition on the other side i.e. bounds of a given partition on one side match/overlap with those of maximum one partition on the other side. And I agree with the patch's approach to this that tries to find such a partition mapping between two input partitioned relations for a join relation, by trying to match their partition bounds, which is implemented in a new function partition_bounds_merge(), in general. But one thing that I'm concerned about most at this point is this in try_partitionwise_join(): /* - * Since we allow partitionwise join only when the partition bounds of the - * joining relations exactly match, the partition bounds of the join - * should match those of the joining relations. + * Get the list of matching partitions to be joined along with the + * partition bounds of the join relation. Because of the restrictions + * imposed by partition matching algorithm, not every pair of joining + * relations for this join will be able to use partition-wise join. But all + * those pairs which can use partition-wise join will produce the same + * partition bounds for the join relation. */ - Assert(partition_bounds_equal(joinrel->part_scheme->partnatts, - joinrel->part_scheme->parttyplen, - joinrel->part_scheme->parttypbyval, - joinrel->boundinfo, rel1->boundinfo)); - Assert(partition_bounds_equal(joinrel->part_scheme->partnatts, - joinrel->part_scheme->parttyplen, - joinrel->part_scheme->parttypbyval, - joinrel->boundinfo, rel2->boundinfo)); + join_boundinfo = partition_bounds_merge(part_scheme->partnatts, + part_scheme->partsupfunc, + part_scheme->partcollation, + rel1, rel2, + parent_sjinfo->jointype, + &parts1, &parts2); + + if (join_boundinfo == NULL) + return; I.e., partition_bounds_merge() is performed for each pair of input partitioned relations for a join relation in try_partitionwise_join(). Since partition_bounds_merge() would need a lot of CPU cycles, I don't think this is acceptable; ISTM that some redesign is needed to avoid this. I'm wondering that once we successfully merged partition bounds from a pair of input partitioned relations for the join relation, by using the merged partition bounds, we could get the lists of matching to-be-joined partitions for subsequent pairs of input partitioned relations for the join relation in a more efficient way than by performing partition_bounds_merge() as proposed in the patch. Best regards, Etsuro Fujita
On Mon, Jul 8, 2019 at 5:03 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
On Wed, Jul 3, 2019 at 3:44 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> On Tue, Jul 2, 2019 at 1:47 PM amul sul <sulamul@gmail.com> wrote:
> > Attached version is rebase atop of the latest master head(c74d49d41c), thanks.
>
> Thanks! Will review.
I started reviewing this. Here is my initial review comments:
* 0001-Hash-partition-bound-equality-refactoring-v22.patch
First of all, I agree with your view on hash partitioning:
"3. For hash partitioned tables however, we support partition-wise join
only when the bounds exactly match. For hash partitioning it's unusual
to have missing partitions and hence generic partition matching is not
required."
which is cited from the commit message for the main patch
"0002-Partition-wise-join-for-1-1-1-0-0-1-partition-matchi-v22.patch".
(I think it would be better if we can extend the partition matching to
the hash-partitioning case where there are missing partitions in
future, though.) However, I don't think it's a good idea to do this
refactoring, because that would lead to duplicating the code to check
whether two given hash bound collections are equal in
partition_bounds_equal() and partition_hash_bounds_merge() that will
be added by the main patch, after all. To avoid that, how about
calling partition_bounds_equal() from partition_hash_bounds_merge() in
the main patch, like the attached?
Agree, your changes look good to me, thanks for working on it.
Regards,
Amul
On Thu, Jul 18, 2019 at 2:55 AM Etsuro Fujita <etsuro.fujita@gmail.com> wrote: > I.e., partition_bounds_merge() is performed for each pair of input > partitioned relations for a join relation in try_partitionwise_join(). > Since partition_bounds_merge() would need a lot of CPU cycles, I don't > think this is acceptable; ISTM that some redesign is needed to avoid > this. I'm wondering that once we successfully merged partition bounds > from a pair of input partitioned relations for the join relation, by > using the merged partition bounds, we could get the lists of matching > to-be-joined partitions for subsequent pairs of input partitioned > relations for the join relation in a more efficient way than by > performing partition_bounds_merge() as proposed in the patch. I don't know whether partition_bounds_merge() is well-implemented; I haven't looked. But in general I don't see an alternative to doing some kind of merging on each pair of input relations. That's just how planning works, and I don't see why it should need to be prohibitively expensive. I might be missing something, though; do you have an idea? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Jul 19, 2019 at 8:09 PM amul sul <sulamul@gmail.com> wrote: > On Mon, Jul 8, 2019 at 5:03 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote: >> I started reviewing this. Here is my initial review comments: >> >> * 0001-Hash-partition-bound-equality-refactoring-v22.patch >> However, I don't think it's a good idea to do this >> refactoring, because that would lead to duplicating the code to check >> whether two given hash bound collections are equal in >> partition_bounds_equal() and partition_hash_bounds_merge() that will >> be added by the main patch, after all. To avoid that, how about >> calling partition_bounds_equal() from partition_hash_bounds_merge() in >> the main patch, like the attached? > Agree, your changes look good to me, thanks for working on it. Cool! Thanks for reviewing! Sorry for the delay. I was busy with something else recently. Best regards, Etsuro Fujita
On Fri, Jul 19, 2019 at 10:44 PM Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Jul 18, 2019 at 2:55 AM Etsuro Fujita <etsuro.fujita@gmail.com> wrote: > > I.e., partition_bounds_merge() is performed for each pair of input > > partitioned relations for a join relation in try_partitionwise_join(). > > Since partition_bounds_merge() would need a lot of CPU cycles, I don't > > think this is acceptable; ISTM that some redesign is needed to avoid > > this. I'm wondering that once we successfully merged partition bounds > > from a pair of input partitioned relations for the join relation, by > > using the merged partition bounds, we could get the lists of matching > > to-be-joined partitions for subsequent pairs of input partitioned > > relations for the join relation in a more efficient way than by > > performing partition_bounds_merge() as proposed in the patch. > > I don't know whether partition_bounds_merge() is well-implemented; I > haven't looked. My concern about that is list partitioning. In that case that function calls partition_list_bounds_merge(), which generates the partition bounds for a join relation between two given input relations, by performing merge join for a pair of the datums arrays from both the input relations. Since the datums arrays contain all non-null list values across all partitions, if the numbers of the list values (ie, ndatums') are large, the merge join would require not a few cycles, so it would be much expensive to perform the merge join for each such pair when considering large N-way partitionwise joins of list-partitioned tables with large ndatums. To see that, I did simple tests using a list-partitioned table pt created with the attached, which has 10 partitions, each with 1000 list values, so ndatums is 10000. (The tests below are performed with enable_partitionwise_join=on.) * 2-way self-join of pt: explain analyze select * from pt t0, pt t1 where t0.a = t1.a; - HEAD: Planning Time: 1.731 ms Execution Time: 15.159 ms - Patched: Planning Time: 1.884 ms Execution Time: 15.127 ms * 4-way self-join of pt: explain analyze select * from pt t0, pt t1, pt t2, pt t3 where t0.a = t1.a and t1.a = t2.a and t2.a = t3.a; - HEAD: Planning Time: 28.787 ms Execution Time: 34.313 ms - Patched: Planning Time: 40.263 ms Execution Time: 35.019 ms * 8-way self-join of pt: explain analyze select * from pt t0, pt t1, pt t2, pt t3, pt t4, pt t5, pt t6, pt t7 where t0.a = t1.a and t1.a = t2.a and t2.a = t3.a and t3.a = t4.a and t4.a = t5.a and t5.a = t6.a and t6.a = t7.a; - HEAD: Planning Time: 2279.653 ms Execution Time: 63.303 ms - Patched: Planning Time: 3834.751 ms Execution Time: 62.949 ms Actually, these joins would not need the partition-matching algorithm the patch adds; we could probably avoid this regression by modifying the patch to plan these joins the same way as before, but ISTM that these results imply that the cost of performing the merge join for each such pair would not be negligible when considering large N-way partitionwise joins mentioned above. Maybe I'm missing something, though. > But in general I don't see an alternative to doing > some kind of merging on each pair of input relations. That's just how > planning works, and I don't see why it should need to be prohibitively > expensive. I might be missing something, though; do you have an idea? Yes, I do; but I think I should think a little more about that. Sorry for the delay. Best regards, Etsuro Fujita
Attachment
On Tue, Jul 30, 2019 at 6:00 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote: > On Fri, Jul 19, 2019 at 10:44 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Thu, Jul 18, 2019 at 2:55 AM Etsuro Fujita <etsuro.fujita@gmail.com> wrote: > > > I.e., partition_bounds_merge() is performed for each pair of input > > > partitioned relations for a join relation in try_partitionwise_join(). > > > Since partition_bounds_merge() would need a lot of CPU cycles, I don't > > > think this is acceptable; ISTM that some redesign is needed to avoid > > > this. I'm wondering that once we successfully merged partition bounds > > > from a pair of input partitioned relations for the join relation, by > > > using the merged partition bounds, we could get the lists of matching > > > to-be-joined partitions for subsequent pairs of input partitioned > > > relations for the join relation in a more efficient way than by > > > performing partition_bounds_merge() as proposed in the patch. > > > > I don't know whether partition_bounds_merge() is well-implemented; I > > haven't looked. > > My concern about that is list partitioning. In that case that > function calls partition_list_bounds_merge(), which generates the > partition bounds for a join relation between two given input > relations, by performing merge join for a pair of the datums arrays > from both the input relations. I had similar thoughts upon seeing that partition_bounds_merge() will be replacing the current way of determining if partition-wise join can occur; that it will make the handling of currently supported cases more expensive. The current way is to compare the PartitionBoundInfos of joining relations using partition_bounds_equal(), and if equal, simply join the pairs of matching partitions if the join quals permit doing so. There's no need to do anything extra to determine which partitions to join with each other, because it's already established. Likewise, partition_bounds_merge() shouldn't to have to anything extra in that case. That is, for the cases that are already supported, we should find a way to make partition_bounds_merge() only as expensive as just performing partition_bounds_equals(), or maybe just slightly more. Thanks, Amit
Amit-san, On Wed, Jul 31, 2019 at 2:47 PM Amit Langote <amitlangote09@gmail.com> wrote: > On Tue, Jul 30, 2019 at 6:00 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote: > > On Fri, Jul 19, 2019 at 10:44 PM Robert Haas <robertmhaas@gmail.com> wrote: > > > I don't know whether partition_bounds_merge() is well-implemented; I > > > haven't looked. > > > > My concern about that is list partitioning. In that case that > > function calls partition_list_bounds_merge(), which generates the > > partition bounds for a join relation between two given input > > relations, by performing merge join for a pair of the datums arrays > > from both the input relations. > > I had similar thoughts upon seeing that partition_bounds_merge() will > be replacing the current way of determining if partition-wise join can > occur; that it will make the handling of currently supported cases > more expensive. > > The current way is to compare the PartitionBoundInfos of joining > relations using partition_bounds_equal(), and if equal, simply join > the pairs of matching partitions if the join quals permit doing so. > There's no need to do anything extra to determine which partitions to > join with each other, because it's already established. Likewise, > partition_bounds_merge() shouldn't to have to anything extra in that > case. That is, for the cases that are already supported, we should > find a way to make partition_bounds_merge() only as expensive as just > performing partition_bounds_equals(), or maybe just slightly more. I 100% agree on that point. One thing that was unexpected to me is this: I wrote: > To see that, I did simple > tests using a list-partitioned table pt created with the attached, > which has 10 partitions, each with 1000 list values, so ndatums is > 10000. (The tests below are performed with > enable_partitionwise_join=on.) > > * 2-way self-join of pt: explain analyze select * from pt t0, pt t1 > where t0.a = t1.a; > - HEAD: > Planning Time: 1.731 ms > Execution Time: 15.159 ms > - Patched: > Planning Time: 1.884 ms > Execution Time: 15.127 ms IIUC, in this test, I think partition_bounds_equals() and partition_bounds_merge() have been performed only once in HEAD and the patched version respectively to plan the partitionwise join, so this might imply that the cost of the latter is just slightly more expensive than that of the former. I'm missing something, though. Anyway I'll continue to review this patch, so I'll move this to the next CF with the same status (Needs review). Best regards, Etsuro Fujita
On Tue, Jul 30, 2019 at 6:00 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote: > On Fri, Jul 19, 2019 at 10:44 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Thu, Jul 18, 2019 at 2:55 AM Etsuro Fujita <etsuro.fujita@gmail.com> wrote: > > > I.e., partition_bounds_merge() is performed for each pair of input > > > partitioned relations for a join relation in try_partitionwise_join(). > > > Since partition_bounds_merge() would need a lot of CPU cycles, I don't > > > think this is acceptable; ISTM that some redesign is needed to avoid > > > this. I'm wondering that once we successfully merged partition bounds > > > from a pair of input partitioned relations for the join relation, by > > > using the merged partition bounds, we could get the lists of matching > > > to-be-joined partitions for subsequent pairs of input partitioned > > > relations for the join relation in a more efficient way than by > > > performing partition_bounds_merge() as proposed in the patch. > > > > I don't know whether partition_bounds_merge() is well-implemented; I > > haven't looked. > > My concern about that is list partitioning. In that case that > function calls partition_list_bounds_merge(), which generates the > partition bounds for a join relation between two given input > relations, by performing merge join for a pair of the datums arrays > from both the input relations. Since the datums arrays contain all > non-null list values across all partitions, if the numbers of the list > values (ie, ndatums') are large, the merge join would require not a > few cycles, so it would be much expensive to perform the merge join > for each such pair when considering large N-way partitionwise joins of > list-partitioned tables with large ndatums. To see that, I did simple > tests using a list-partitioned table pt created with the attached, > which has 10 partitions, each with 1000 list values, so ndatums is > 10000. (The tests below are performed with > enable_partitionwise_join=on.) > > * 2-way self-join of pt: explain analyze select * from pt t0, pt t1 > where t0.a = t1.a; > - HEAD: > Planning Time: 1.731 ms > Execution Time: 15.159 ms > - Patched: > Planning Time: 1.884 ms > Execution Time: 15.127 ms > > * 4-way self-join of pt: explain analyze select * from pt t0, pt t1, > pt t2, pt t3 where t0.a = t1.a and t1.a = t2.a and t2.a = t3.a; > - HEAD: > Planning Time: 28.787 ms > Execution Time: 34.313 ms > - Patched: > Planning Time: 40.263 ms > Execution Time: 35.019 ms > > * 8-way self-join of pt: explain analyze select * from pt t0, pt t1, > pt t2, pt t3, pt t4, pt t5, pt t6, pt t7 where t0.a = t1.a and t1.a = > t2.a and t2.a = t3.a and t3.a = t4.a and t4.a = t5.a and t5.a = t6.a > and t6.a = t7.a; > - HEAD: > Planning Time: 2279.653 ms > Execution Time: 63.303 ms > - Patched: > Planning Time: 3834.751 ms > Execution Time: 62.949 ms > > Actually, these joins would not need the partition-matching algorithm > the patch adds; we could probably avoid this regression by modifying > the patch to plan these joins the same way as before, but ISTM that > these results imply that the cost of performing the merge join for > each such pair would not be negligible when considering large N-way > partitionwise joins mentioned above. Maybe I'm missing something, > though. > > > But in general I don't see an alternative to doing > > some kind of merging on each pair of input relations. That's just how > > planning works, and I don't see why it should need to be prohibitively > > expensive. I might be missing something, though; do you have an idea? > > Yes, I do; but I think I should think a little more about that. I gave more thought to this. My idea is to generate the list of matching partitions to be joined from the partbound info after creating it with partition_bounds_merge(), as I stated before. Let me explain using an example. Suppose that R, S and T are partitioned tables, that R=R(a,b,c) is partitioned on ranges of a into three partitions R1, R2 and R3, that S=S(a,b,c) is partitioned on ranges of a into three partitions S1, S2 and S3, and that T=T(a,b,c) is partitioned on ranges of a into three partitions T1, T2 and T3. Consider a 3-way join query: SELECT * FROM R, S, T WHERE R.a=S.a AND S.a=T.a; Suppose that when considering 2-way join R IJ S, partition_bounds_merge() successfully merges the partition bounds for R and S, and generates join pairs (R1, S1), (R2, S2) and (R3, S3), and that when considering 3-way join (R IJ S) IJ T, that function successfully merges the partition bounds for (R IJ S) and T, and generates join pairs ((R1 IJ S1), T1) and ((R2 IJ S2), T2). The question here is: do we really need to perform partition_bounds_merge() to generate the list of matching partitions to be joined for 3-way join R IJ (S IJ T), for example? I don't think so; because 1) we see from the 3-way join pairs ((R1 IJ S1), T1) and ((R2 IJ S2), T2) that Ri, Si and Ti (i=1,2) have boundaries overlapping with each other, which means that there would be (S1, T1) and (S2, T2) as 2-way join pairs for S IJ T, and 2) we have R IJ (S IJ T) = (R IJ S) IJ T = (R1 IJ S1) IJ T1 union (R2 IJ S2) IJ T2 = R1 IJ (S1 IJ T1) union R2 IJ (S2 IJ T2), which means that the list of matching partitions to be joined for R IJ (S IJ T) in question are (R1, (S1 IJ T1)) and (R2, (S2 IJ T2)) since we see from the equation that pairs from R and (S IJ T) other than that would not contribute to the join result. Does that make sense? Attached is a WIP patch implementing this created on top of the patch [1]. It seems that I performed the above tests on an assertion-enabled build. :( So I executed the tests one more time. Here are the results. * 2-way self-join of pt: explain analyze select * from pt t0, pt t1 where t0.a = t1.a; - HEAD: Planning Time: 0.969 ms Execution Time: 13.843 ms - with patch: Planning Time: 1.720 ms Execution Time: 14.393 ms - with patch plus attached: Planning Time: 1.630 ms Execution Time: 14.002 ms * 4-way self-join of pt: explain analyze select * from pt t0, pt t1, pt t2, pt t3 where t0.a = t1.a and t1.a = t2.a and t2.a = t3.a; - HEAD: Planning Time: 12.203 ms Execution Time: 31.784 ms - with patch: Planning Time: 32.102 ms Execution Time: 32.504 ms - with patch plus attached: Planning Time: 19.471 ms Execution Time: 32.582 ms * 8-way self-join of pt: explain analyze select * from pt t0, pt t1, pt t2, pt t3, pt t4, pt t5, pt t6, pt t7 where t0.a = t1.a and t1.a = t2.a and t2.a = t3.a and t3.a = t4.a and t4.a = t5.a and t5.a = t6.a and t6.a = t7.a; - HEAD: Planning Time: 948.131 ms Execution Time: 55.645 ms - with patch: Planning Time: 2939.813 ms Execution Time: 56.760 ms - with patch plus attached: Planning Time: 1108.076 ms Execution Time: 55.750 ms Note: the attached patch still uses the proposed partition matching algorithm for these queries. As I said before, these queries don't need that algorithm, so we could eliminate the planning overhead compared to HEAD, by planning these queries as before, perhaps, but I haven't modified the patch as such yet. Best regards, Etsuro Fujita [1] https://www.postgresql.org/message-id/CAPmGK177W%2BjNgpM5f_m-vdDKbEBu_%3D3CyPzFjkT_1nzf1Vqn%2BA%40mail.gmail.com
Attachment
On Fri, Aug 16, 2019 at 10:25 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote: > It seems that I performed the above tests on an assertion-enabled > build. :( So I executed the tests one more time. Here are the > results. > > * 2-way self-join of pt: explain analyze select * from pt t0, pt t1 > where t0.a = t1.a; > - HEAD: > Planning Time: 0.969 ms > Execution Time: 13.843 ms > - with patch: > Planning Time: 1.720 ms > Execution Time: 14.393 ms > - with patch plus attached: > Planning Time: 1.630 ms > Execution Time: 14.002 ms > > * 4-way self-join of pt: explain analyze select * from pt t0, pt t1, > pt t2, pt t3 where t0.a = t1.a > > and t1.a = t2.a and t2.a = t3.a; > - HEAD: > Planning Time: 12.203 ms > Execution Time: 31.784 ms > - with patch: > Planning Time: 32.102 ms > Execution Time: 32.504 ms > - with patch plus attached: > Planning Time: 19.471 ms > Execution Time: 32.582 ms > > * 8-way self-join of pt: explain analyze select * from pt t0, pt t1, > pt t2, pt t3, pt t4, pt t5, pt t6, pt t7 where t0.a = t1.a and t1.a = > t2.a and t2.a = t3.a and t3.a = t4.a and t4.a = t5.a and t5.a = t6.a > and t6.a = t7.a; > - HEAD: > Planning Time: 948.131 ms > Execution Time: 55.645 ms > - with patch: > Planning Time: 2939.813 ms > Execution Time: 56.760 ms > - with patch plus attached: > Planning Time: 1108.076 ms > Execution Time: 55.750 ms > > Note: the attached patch still uses the proposed partition matching > algorithm for these queries. As I said before, these queries don't > need that algorithm, so we could eliminate the planning overhead > compared to HEAD, by planning these queries as before, perhaps, but I > haven't modified the patch as such yet. I modified the patch further as such. Attached is an updated version of the patch created on top of the patch in [1]. I did the tests again using the updated version of the patch. Here are the results: * 2-way self-join of pt: Planning Time: 1.043 ms Execution Time: 13.931 ms * 4-way self-join of pt: Planning Time: 12.499 ms Execution Time: 32.392 ms * 8-way self-join of pt: Planning Time: 968.412 ms Execution Time: 56.328 ms The planning time for each test case still increased slightly, but IMO I think that would be acceptable. To see the efficiency of the attached, I did another testing with test cases that really need the new partition-matching algorithm: * explain analyze select * from pt6 t6, pt7 t7 where t6.a = t7.a; - base patch in [1] Planning Time: 1.758 ms Execution Time: 13.977 ms - with attached Planning Time: 1.777 ms Execution Time: 13.959 ms * explain analyze select * from pt4 t4, pt5 t5, pt6 t6, pt7 t7 where t4.a = t5.a and t5.a = t6.a and t6.a = t7.a; - base patch in [1] Planning Time: 33.201 ms Execution Time: 32.480 ms - with attached Planning Time: 21.019 ms Execution Time: 32.777 ms * explain analyze select * from pt0 t0, pt1 t1, pt2 t2, pt3 t3, pt4 t4, pt5 t5, pt6 t6, pt7 t7 where t0.a = t1.a and t1.a = t2.a and t2.a = t3.a and t3.a = t4.a and t4.a = t5.a and t5.a = t6.a and t6.a = t7.a; - base patch in [1] Planning Time: 3060.000 ms Execution Time: 55.553 ms - with attached Planning Time: 1144.996 ms Execution Time: 56.233 ms where pt0, pt1, pt2, pt3, pt4, pt5, pt6 and pt7 are list partitioned tables that have slighly different list values. (The structure and list values of ptN are almost the same as that of pt used above, but ptN's N-th partition ptNpN has an extra list value that pt's N-th partition ptpN doesn't have.) If anyone is interested in this testing, I'll send a script file for producing these list partitioned tables. About the attached: * The attached patch modified try_partitionwise_join() so that we call partition_bounds_equal() then partition_bounds_merge() (if necessary) to create the partition bounds for the join rel. We don't support for merging the partition bounds for the hash-partitioning case, so this makes code to handle the hash-partitioning case in partition_bounds_merge() completely unnecessary, so I removed that entirely. * I removed this assertion in partition_bounds_merge(), because I think this is covered by two assertions above this. + Assert((*outer_parts == NIL || *inner_parts != NIL) && + (*inner_parts == NIL || *outer_parts != NIL)); * (I forgot to mention this in a previous email, but) I removed this bit of generate_matching_part_pairs(), because we already have the same check in try_partitionwise_join(), so this bit would be redundant IIUC. + switch (jointype) + { + case JOIN_INNER: + case JOIN_SEMI: + + /* + * An inner or semi join can not return any row when the + * matching partition on either side is missing. We should + * have eliminated all such cases while merging the bounds. + */ + Assert(part1 >= 0 && part2 >= 0); + break; + + case JOIN_LEFT: + case JOIN_ANTI: + Assert(part1 >= 0); + if (part2 < 0) + merged = false; + break; + + case JOIN_FULL: + if (part1 < 0 || part2 < 0) + merged = false; + break; + + default: + elog(ERROR, "unrecognized join type: %d", (int) jointype); + } * I added more comments. If there are no objections, I'll merge the attached with the base patch in [1]. Best regards, Etsuro Fujita [1] https://www.postgresql.org/message-id/CAPmGK177W%2BjNgpM5f_m-vdDKbEBu_%3D3CyPzFjkT_1nzf1Vqn%2BA%40mail.gmail.com
Attachment
Thank you Fujita San for the enhancement, will have a look.
Regards,
Amul
On Wed, Aug 28, 2019 at 3:52 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
On Fri, Aug 16, 2019 at 10:25 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> It seems that I performed the above tests on an assertion-enabled
> build. :( So I executed the tests one more time. Here are the
> results.
>
> * 2-way self-join of pt: explain analyze select * from pt t0, pt t1
> where t0.a = t1.a;
> - HEAD:
> Planning Time: 0.969 ms
> Execution Time: 13.843 ms
> - with patch:
> Planning Time: 1.720 ms
> Execution Time: 14.393 ms
> - with patch plus attached:
> Planning Time: 1.630 ms
> Execution Time: 14.002 ms
>
> * 4-way self-join of pt: explain analyze select * from pt t0, pt t1,
> pt t2, pt t3 where t0.a = t1.a
>
> and t1.a = t2.a and t2.a = t3.a;
> - HEAD:
> Planning Time: 12.203 ms
> Execution Time: 31.784 ms
> - with patch:
> Planning Time: 32.102 ms
> Execution Time: 32.504 ms
> - with patch plus attached:
> Planning Time: 19.471 ms
> Execution Time: 32.582 ms
>
> * 8-way self-join of pt: explain analyze select * from pt t0, pt t1,
> pt t2, pt t3, pt t4, pt t5, pt t6, pt t7 where t0.a = t1.a and t1.a =
> t2.a and t2.a = t3.a and t3.a = t4.a and t4.a = t5.a and t5.a = t6.a
> and t6.a = t7.a;
> - HEAD:
> Planning Time: 948.131 ms
> Execution Time: 55.645 ms
> - with patch:
> Planning Time: 2939.813 ms
> Execution Time: 56.760 ms
> - with patch plus attached:
> Planning Time: 1108.076 ms
> Execution Time: 55.750 ms
>
> Note: the attached patch still uses the proposed partition matching
> algorithm for these queries. As I said before, these queries don't
> need that algorithm, so we could eliminate the planning overhead
> compared to HEAD, by planning these queries as before, perhaps, but I
> haven't modified the patch as such yet.
I modified the patch further as such. Attached is an updated version
of the patch created on top of the patch in [1]. I did the tests
again using the updated version of the patch. Here are the results:
* 2-way self-join of pt:
Planning Time: 1.043 ms
Execution Time: 13.931 ms
* 4-way self-join of pt:
Planning Time: 12.499 ms
Execution Time: 32.392 ms
* 8-way self-join of pt:
Planning Time: 968.412 ms
Execution Time: 56.328 ms
The planning time for each test case still increased slightly, but IMO
I think that would be acceptable. To see the efficiency of the
attached, I did another testing with test cases that really need the
new partition-matching algorithm:
* explain analyze select * from pt6 t6, pt7 t7 where t6.a = t7.a;
- base patch in [1]
Planning Time: 1.758 ms
Execution Time: 13.977 ms
- with attached
Planning Time: 1.777 ms
Execution Time: 13.959 ms
* explain analyze select * from pt4 t4, pt5 t5, pt6 t6, pt7 t7 where
t4.a = t5.a and t5.a = t6.a and t6.a = t7.a;
- base patch in [1]
Planning Time: 33.201 ms
Execution Time: 32.480 ms
- with attached
Planning Time: 21.019 ms
Execution Time: 32.777 ms
* explain analyze select * from pt0 t0, pt1 t1, pt2 t2, pt3 t3, pt4
t4, pt5 t5, pt6 t6, pt7 t7 where t0.a = t1.a and t1.a = t2.a and t2.a
= t3.a and t3.a = t4.a and t4.a = t5.a and t5.a = t6.a and t6.a =
t7.a;
- base patch in [1]
Planning Time: 3060.000 ms
Execution Time: 55.553 ms
- with attached
Planning Time: 1144.996 ms
Execution Time: 56.233 ms
where pt0, pt1, pt2, pt3, pt4, pt5, pt6 and pt7 are list partitioned
tables that have slighly different list values. (The structure and
list values of ptN are almost the same as that of pt used above, but
ptN's N-th partition ptNpN has an extra list value that pt's N-th
partition ptpN doesn't have.) If anyone is interested in this
testing, I'll send a script file for producing these list partitioned
tables.
About the attached:
* The attached patch modified try_partitionwise_join() so that we call
partition_bounds_equal() then partition_bounds_merge() (if necessary)
to create the partition bounds for the join rel. We don't support for
merging the partition bounds for the hash-partitioning case, so this
makes code to handle the hash-partitioning case in
partition_bounds_merge() completely unnecessary, so I removed that
entirely.
* I removed this assertion in partition_bounds_merge(), because I
think this is covered by two assertions above this.
+ Assert((*outer_parts == NIL || *inner_parts != NIL) &&
+ (*inner_parts == NIL || *outer_parts != NIL));
* (I forgot to mention this in a previous email, but) I removed this
bit of generate_matching_part_pairs(), because we already have the
same check in try_partitionwise_join(), so this bit would be redundant
IIUC.
+ switch (jointype)
+ {
+ case JOIN_INNER:
+ case JOIN_SEMI:
+
+ /*
+ * An inner or semi join can not return any row when the
+ * matching partition on either side is missing. We should
+ * have eliminated all such cases while merging the bounds.
+ */
+ Assert(part1 >= 0 && part2 >= 0);
+ break;
+
+ case JOIN_LEFT:
+ case JOIN_ANTI:
+ Assert(part1 >= 0);
+ if (part2 < 0)
+ merged = false;
+ break;
+
+ case JOIN_FULL:
+ if (part1 < 0 || part2 < 0)
+ merged = false;
+ break;
+
+ default:
+ elog(ERROR, "unrecognized join type: %d", (int) jointype);
+ }
* I added more comments.
If there are no objections, I'll merge the attached with the base patch in [1].
Best regards,
Etsuro Fujita
[1] https://www.postgresql.org/message-id/CAPmGK177W%2BjNgpM5f_m-vdDKbEBu_%3D3CyPzFjkT_1nzf1Vqn%2BA%40mail.gmail.com
Hi Fujita San,
Please find my comments inline below:
On Wed, Aug 28, 2019 at 3:52 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
On Fri, Aug 16, 2019 at 10:25 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
>
[... skipped ..]
About the attached:
* The attached patch modified try_partitionwise_join() so that we call
partition_bounds_equal() then partition_bounds_merge() (if necessary)
to create the partition bounds for the join rel. We don't support for
merging the partition bounds for the hash-partitioning case, so this
makes code to handle the hash-partitioning case in
partition_bounds_merge() completely unnecessary, so I removed that
entirely.
Yes, that make sense.
On thinking further, a thought hits to me why we can't join two hash partitioned
table which has the same modulus and partition key specification, but some of
the partitions are missing from either partitioned table.
For e.g. here is a smaller case where foo has two partitions and bar has only one.
table which has the same modulus and partition key specification, but some of
the partitions are missing from either partitioned table.
For e.g. here is a smaller case where foo has two partitions and bar has only one.
CREATE TABLE foo(a int) PARTITION BY HASH(a);
CREATE TABLE foo_p0 PARTITION OF foo FOR VALUES WITH(modulus 2, remainder 0);
CREATE TABLE foo_p1 PARTITION OF foo FOR VALUES WITH(modulus 2, remainder 1);
CREATE TABLE bar(a int) PARTITION BY HASH(a); <-- missing partitions for REMAINDER 1
CREATE TABLE bar_p0 PARTITION OF bar FOR VALUES WITH(modulus 2, remainder 0);
CREATE TABLE foo_p0 PARTITION OF foo FOR VALUES WITH(modulus 2, remainder 0);
CREATE TABLE foo_p1 PARTITION OF foo FOR VALUES WITH(modulus 2, remainder 1);
CREATE TABLE bar(a int) PARTITION BY HASH(a); <-- missing partitions for REMAINDER 1
CREATE TABLE bar_p0 PARTITION OF bar FOR VALUES WITH(modulus 2, remainder 0);
Explain:
postgres=# explain select * from foo p1, bar p2 where p1.a = p2.a;
QUERY PLAN
---------------------------------------------------------------------------------
Merge Join (cost=590.35..1578.47 rows=65025 width=8)
Merge Cond: (p2.a = p1.a)
-> Sort (cost=179.78..186.16 rows=2550 width=4)
Sort Key: p2.a
-> Seq Scan on bar_p0 p2 (cost=0.00..35.50 rows=2550 width=4)
-> Sort (cost=410.57..423.32 rows=5100 width=4)
Sort Key: p1.a
-> Append (cost=0.00..96.50 rows=5100 width=4)
-> Seq Scan on foo_p0 p1 (cost=0.00..35.50 rows=2550 width=4)
-> Seq Scan on foo_p1 p1_1 (cost=0.00..35.50 rows=2550 width=4)
(10 rows)
QUERY PLAN
---------------------------------------------------------------------------------
Merge Join (cost=590.35..1578.47 rows=65025 width=8)
Merge Cond: (p2.a = p1.a)
-> Sort (cost=179.78..186.16 rows=2550 width=4)
Sort Key: p2.a
-> Seq Scan on bar_p0 p2 (cost=0.00..35.50 rows=2550 width=4)
-> Sort (cost=410.57..423.32 rows=5100 width=4)
Sort Key: p1.a
-> Append (cost=0.00..96.50 rows=5100 width=4)
-> Seq Scan on foo_p0 p1 (cost=0.00..35.50 rows=2550 width=4)
-> Seq Scan on foo_p1 p1_1 (cost=0.00..35.50 rows=2550 width=4)
(10 rows)
The partitions-wise join will be performed only if we fill the partition hole that
exists for the joining table i.e. adding partitions to bar table.postgres=# CREATE TABLE bar_p1 PARTITION OF bar FOR VALUES WITH(modulus 2, remainder 1);
CREATE TABLE
postgres=# explain select * from foo p1, bar p2 where p1.a = p2.a;
QUERY PLAN
---------------------------------------------------------------------------------
Append (cost=359.57..2045.11 rows=65024 width=8)
-> Merge Join (cost=359.57..860.00 rows=32512 width=8)
Merge Cond: (p1.a = p2.a)
-> Sort (cost=179.78..186.16 rows=2550 width=4)
Sort Key: p1.a
-> Seq Scan on foo_p0 p1 (cost=0.00..35.50 rows=2550 width=4)
-> Sort (cost=179.78..186.16 rows=2550 width=4)
Sort Key: p2.a
-> Seq Scan on bar_p0 p2 (cost=0.00..35.50 rows=2550 width=4)
-> Merge Join (cost=359.57..860.00 rows=32512 width=8)
Merge Cond: (p1_1.a = p2_1.a)
-> Sort (cost=179.78..186.16 rows=2550 width=4)
Sort Key: p1_1.a
-> Seq Scan on foo_p1 p1_1 (cost=0.00..35.50 rows=2550 width=4)
-> Sort (cost=179.78..186.16 rows=2550 width=4)
Sort Key: p2_1.a
-> Seq Scan on bar_p1 p2_1 (cost=0.00..35.50 rows=2550 width=4)
(17 rows)
CREATE TABLE
postgres=# explain select * from foo p1, bar p2 where p1.a = p2.a;
QUERY PLAN
---------------------------------------------------------------------------------
Append (cost=359.57..2045.11 rows=65024 width=8)
-> Merge Join (cost=359.57..860.00 rows=32512 width=8)
Merge Cond: (p1.a = p2.a)
-> Sort (cost=179.78..186.16 rows=2550 width=4)
Sort Key: p1.a
-> Seq Scan on foo_p0 p1 (cost=0.00..35.50 rows=2550 width=4)
-> Sort (cost=179.78..186.16 rows=2550 width=4)
Sort Key: p2.a
-> Seq Scan on bar_p0 p2 (cost=0.00..35.50 rows=2550 width=4)
-> Merge Join (cost=359.57..860.00 rows=32512 width=8)
Merge Cond: (p1_1.a = p2_1.a)
-> Sort (cost=179.78..186.16 rows=2550 width=4)
Sort Key: p1_1.a
-> Seq Scan on foo_p1 p1_1 (cost=0.00..35.50 rows=2550 width=4)
-> Sort (cost=179.78..186.16 rows=2550 width=4)
Sort Key: p2_1.a
-> Seq Scan on bar_p1 p2_1 (cost=0.00..35.50 rows=2550 width=4)
(17 rows)
It would have been nice if we could support this case, as we do allow partitions
hole for the other partition scheme, but there wouldn't be much objection ifwe don't want to add this support for now since there will be a lesser chance
that hash partitioned table has the hole, IMO.
* I removed this assertion in partition_bounds_merge(), because I
think this is covered by two assertions above this.
+ Assert((*outer_parts == NIL || *inner_parts != NIL) &&
+ (*inner_parts == NIL || *outer_parts != NIL));
* (I forgot to mention this in a previous email, but) I removed this
bit of generate_matching_part_pairs(), because we already have the
same check in try_partitionwise_join(), so this bit would be redundant
IIUC.
Looks good.
* I added more comments.
Thanks.
If there are no objections, I'll merge the attached with the base patch in [1].
The proposed enhancement in the patch is too good and the patch is pretty much
reasonable to merge into the main patch.
reasonable to merge into the main patch.
Here are the few cosmetic fixes for this path I think is needed. Feel free to
ignore if some of them do not make sense or too obvious.
ignore if some of them do not make sense or too obvious.
Note: left side number represents code line number of the patch.
118 + }
119 +
120 + /*
121 + * Try to create the partition bounds along with join pairs.
122 + */
123 + if (boundinfo == NULL)
124 + {
119 +
120 + /*
121 + * Try to create the partition bounds along with join pairs.
122 + */
123 + if (boundinfo == NULL)
124 + {
Can we add this block as else part of previous if condition checking equal partitions bound?
133 + Assert(list_length(parts1) == list_length(parts2));
134 + if (boundinfo == NULL)
135 + {
136 + joinrel->nparts = 0;
137 + return;
138 + }
139 + nparts = list_length(parts1);
Can we move the assert check below at line#139 in the patch i.e. after if block.
133 + Assert(list_length(parts1) == list_length(parts2));
134 + if (boundinfo == NULL)
135 + {
136 + joinrel->nparts = 0;
137 + return;
138 + }
139 + nparts = list_length(parts1);
Can we move the assert check below at line#139 in the patch i.e. after if block.
And the question is do we need to do that assert check since partition_bounds_merge()
does that just before returning, thoughts?
204 + RelOptInfo *child_rel1 = merged ? (RelOptInfo *) lfirst(lcr1) : rel1->part_rels[cnt_parts];
205 + RelOptInfo *child_rel2 = merged ? (RelOptInfo *) lfirst(lcr2) : rel2->part_rels[cnt_parts];
How about using lfirst_node instead of lfirst & casting explicitly?
Also, these lines crossing 80 column length which I think we need to fix. How about
doing the assignment as follow, just after the variable declaration part:
if (merged)
{
child_rel1 = lfirst_node(lcr1, RelOptInfo);
child_rel2 = lfirst_node(lcr2, RelOptInfo);
lcr1 = lnext(parts1, lcr1);
lcr2 = lnext(parts2, lcr2);
}
else
{
child_rel1 = rel1->part_rels[cnt_parts];
child_rel2 = rel2->part_rels[cnt_parts]
}
rel1_empty = (child_rel1 == NULL || IS_DUMMY_REL(child_rel1));
rel2_empty = (child_rel2 == NULL || IS_DUMMY_REL(child_rel2));
266 + * get_matching_part_pairs
267 + * Generate join pairs of partitions for the two inputs for the join rel
204 + RelOptInfo *child_rel1 = merged ? (RelOptInfo *) lfirst(lcr1) : rel1->part_rels[cnt_parts];
205 + RelOptInfo *child_rel2 = merged ? (RelOptInfo *) lfirst(lcr2) : rel2->part_rels[cnt_parts];
How about using lfirst_node instead of lfirst & casting explicitly?
Also, these lines crossing 80 column length which I think we need to fix. How about
doing the assignment as follow, just after the variable declaration part:
if (merged)
{
child_rel1 = lfirst_node(lcr1, RelOptInfo);
child_rel2 = lfirst_node(lcr2, RelOptInfo);
lcr1 = lnext(parts1, lcr1);
lcr2 = lnext(parts2, lcr2);
}
else
{
child_rel1 = rel1->part_rels[cnt_parts];
child_rel2 = rel2->part_rels[cnt_parts]
}
rel1_empty = (child_rel1 == NULL || IS_DUMMY_REL(child_rel1));
rel2_empty = (child_rel2 == NULL || IS_DUMMY_REL(child_rel2));
266 + * get_matching_part_pairs
267 + * Generate join pairs of partitions for the two inputs for the join rel
Can we rewrite this description as " Generate join pairs of partitions for the
join rel from the two inputs." OR "Generate join pairs of partitions for the
two inputs"
310 + Assert(bms_num_members(child_relids1) == bms_num_members(rel1->relids));
311 + /*
335 + Assert(bms_num_members(child_relids2) == bms_num_members(rel2->relids));
336 + /*
Need newline after assert statements.
Regards,
Amul
Re: [HACKERS] advanced partition matching algorithm forpartition-wise join
From
Alvaro Herrera
Date:
Fujita-san, amul, CFbot complains that Fujita-san submitted a patch that doesn't apply, which makes sense since the necessary previous patch was only referred to without being resubmitted. I suggest to always post all patches together with each resubmission so that it can be checked automatically by the cf bot: http://commitfest.cputube.org/etsuro-fujita.html I'm not clear on who the author of this patch is, now. Also, I'm not sure what the *status* is. Are we waiting for Fujita-san to review this patch? -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Sep 4, 2019 at 2:40 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
Fujita-san, amul,
CFbot complains that Fujita-san submitted a patch that doesn't apply,
which makes sense since the necessary previous patch was only referred
to without being resubmitted. I suggest to always post all patches
together with each resubmission so that it can be checked automatically
by the cf bot: http://commitfest.cputube.org/etsuro-fujita.html
Understood and sorry about that.
I'm not clear on who the author of this patch is, now. Also, I'm not
sure what the *status* is. Are we waiting for Fujita-san to review this
patch?
Yes, we are waiting for Fujita-san to review. Fujita-san has started a review
and proposed some enhancement which I reviewed in the last update.
I think soon Fujita-san might post the complete patch including his changes.
Regards,
Amul
Hi, On Thu, Sep 5, 2019 at 1:24 PM amul sul <sulamul@gmail.com> wrote: > On Wed, Sep 4, 2019 at 2:40 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: >> CFbot complains that Fujita-san submitted a patch that doesn't apply, >> which makes sense since the necessary previous patch was only referred >> to without being resubmitted. I suggest to always post all patches >> together with each resubmission so that it can be checked automatically >> by the cf bot: http://commitfest.cputube.org/etsuro-fujita.html > Understood and sorry about that. Sorry about that. >> I'm not clear on who the author of this patch is, now. Also, I'm not >> sure what the *status* is. Are we waiting for Fujita-san to review this >> patch? > Yes, we are waiting for Fujita-san to review. Fujita-san has started a review > and proposed some enhancement which I reviewed in the last update. > > I think soon Fujita-san might post the complete patch including his changes. I'm a bit busy with something else recently, but I'll do that ASAP. And I'll continue to review the other part of the patch. Best regards, Etsuro Fujita
Hi Amul, On Mon, Sep 2, 2019 at 2:08 PM amul sul <sulamul@gmail.com> wrote: > Please find my comments inline below: Thank you for the review! > On Wed, Aug 28, 2019 at 3:52 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote: >> On Fri, Aug 16, 2019 at 10:25 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote: > On thinking further, a thought hits to me why we can't join two hash partitioned > table which has the same modulus and partition key specification, but some of > the partitions are missing from either partitioned table. > > For e.g. here is a smaller case where foo has two partitions and bar has only one. > > CREATE TABLE foo(a int) PARTITION BY HASH(a); > CREATE TABLE foo_p0 PARTITION OF foo FOR VALUES WITH(modulus 2, remainder 0); > CREATE TABLE foo_p1 PARTITION OF foo FOR VALUES WITH(modulus 2, remainder 1); > > CREATE TABLE bar(a int) PARTITION BY HASH(a); <-- missing partitions for REMAINDER 1 > CREATE TABLE bar_p0 PARTITION OF bar FOR VALUES WITH(modulus 2, remainder 0); > > Explain: > postgres=# explain select * from foo p1, bar p2 where p1.a = p2.a; > QUERY PLAN > --------------------------------------------------------------------------------- > Merge Join (cost=590.35..1578.47 rows=65025 width=8) > Merge Cond: (p2.a = p1.a) > -> Sort (cost=179.78..186.16 rows=2550 width=4) > Sort Key: p2.a > -> Seq Scan on bar_p0 p2 (cost=0.00..35.50 rows=2550 width=4) > -> Sort (cost=410.57..423.32 rows=5100 width=4) > Sort Key: p1.a > -> Append (cost=0.00..96.50 rows=5100 width=4) > -> Seq Scan on foo_p0 p1 (cost=0.00..35.50 rows=2550 width=4) > -> Seq Scan on foo_p1 p1_1 (cost=0.00..35.50 rows=2550 width=4) > (10 rows) > > The partitions-wise join will be performed only if we fill the partition hole that > exists for the joining table i.e. adding partitions to bar table. > > postgres=# CREATE TABLE bar_p1 PARTITION OF bar FOR VALUES WITH(modulus 2, remainder 1); > CREATE TABLE > postgres=# explain select * from foo p1, bar p2 where p1.a = p2.a; > QUERY PLAN > --------------------------------------------------------------------------------- > Append (cost=359.57..2045.11 rows=65024 width=8) > -> Merge Join (cost=359.57..860.00 rows=32512 width=8) > Merge Cond: (p1.a = p2.a) > -> Sort (cost=179.78..186.16 rows=2550 width=4) > Sort Key: p1.a > -> Seq Scan on foo_p0 p1 (cost=0.00..35.50 rows=2550 width=4) > -> Sort (cost=179.78..186.16 rows=2550 width=4) > Sort Key: p2.a > -> Seq Scan on bar_p0 p2 (cost=0.00..35.50 rows=2550 width=4) > -> Merge Join (cost=359.57..860.00 rows=32512 width=8) > Merge Cond: (p1_1.a = p2_1.a) > -> Sort (cost=179.78..186.16 rows=2550 width=4) > Sort Key: p1_1.a > -> Seq Scan on foo_p1 p1_1 (cost=0.00..35.50 rows=2550 width=4) > -> Sort (cost=179.78..186.16 rows=2550 width=4) > Sort Key: p2_1.a > -> Seq Scan on bar_p1 p2_1 (cost=0.00..35.50 rows=2550 width=4) > (17 rows) > > It would have been nice if we could support this case, as we do allow partitions > hole for the other partition scheme, but there wouldn't be much objection if > we don't want to add this support for now since there will be a lesser chance > that hash partitioned table has the hole, IMO. I agree with you on that point. >> If there are no objections, I'll merge the attached with the base patch in [1]. > The proposed enhancement in the patch is too good and the patch is pretty much > reasonable to merge into the main patch. Done. Attached is a merged version of the patch. > Here are the few cosmetic fixes for this path I think is needed. Feel free to > ignore if some of them do not make sense or too obvious. > > Note: left side number represents code line number of the patch. > > 118 + } > 119 + > 120 + /* > 121 + * Try to create the partition bounds along with join pairs. > 122 + */ > 123 + if (boundinfo == NULL) > 124 + { > > Can we add this block as else part of previous if condition checking equal partitions bound? Done. > 133 + Assert(list_length(parts1) == list_length(parts2)); > 134 + if (boundinfo == NULL) > 135 + { > 136 + joinrel->nparts = 0; > 137 + return; > 138 + } > 139 + nparts = list_length(parts1); > And the question is do we need to do that assert check since partition_bounds_merge() > does that just before returning, thoughts? You are right; that assertion would be redundant, so I removed that. > 204 + RelOptInfo *child_rel1 = merged ? (RelOptInfo *) lfirst(lcr1) : rel1->part_rels[cnt_parts]; > 205 + RelOptInfo *child_rel2 = merged ? (RelOptInfo *) lfirst(lcr2) : rel2->part_rels[cnt_parts]; > > How about using lfirst_node instead of lfirst & casting explicitly? > > Also, these lines crossing 80 column length which I think we need to fix. How about > doing the assignment as follow, just after the variable declaration part: > > if (merged) > { > child_rel1 = lfirst_node(lcr1, RelOptInfo); > child_rel2 = lfirst_node(lcr2, RelOptInfo); > lcr1 = lnext(parts1, lcr1); > lcr2 = lnext(parts2, lcr2); > } > else > { > child_rel1 = rel1->part_rels[cnt_parts]; > child_rel2 = rel2->part_rels[cnt_parts] > } > > rel1_empty = (child_rel1 == NULL || IS_DUMMY_REL(child_rel1)); > rel2_empty = (child_rel2 == NULL || IS_DUMMY_REL(child_rel2)); Done. (I don't think the order of arguments for first_node() above is correct; the first argument for that should be RelOptInfo.) > 266 + * get_matching_part_pairs > 267 + * Generate join pairs of partitions for the two inputs for the join rel > > Can we rewrite this description as " Generate join pairs of partitions for the > join rel from the two inputs." OR "Generate join pairs of partitions for the > two inputs" Done. > 310 + Assert(bms_num_members(child_relids1) == bms_num_members(rel1->relids)); > 311 + /* > > 335 + Assert(bms_num_members(child_relids2) == bms_num_members(rel2->relids)); > 336 + /* > > Need newline after assert statements. Done. Other change: I guess this in partbounds.c would be leftovers from an older version of the patch, so I removed it. +/* + * Allocate and initialize partition maps. We maintain four maps, two maps + * for each joining relation. pmap[i] gives the partition from the other + * relation which would join with ith partition of the given relation. + * Partition i from the given relation will join with partition pmap[i] + * from the other relation to produce partition mmap[i] of the join (merged + * partition). + * + * pmap[i] = -1 indicates that ith partition of a given relation does not + * have a matching partition from the other relation. + * + * mmap[i] = -1 indicates that ith partition of a given relation does not + * contribute to the join result. That can happen only when the given + * relation is the inner relation and it doesn't have a matching partition + * from the outer relation, hence pmap[i] should be -1. + * + * In case of an outer join, every partition of the outer join will appear + * in the join result, and thus has mmap[i] set for all i. But it's not + * necessary that every partition on the outer side will have a matching + * partition on the inner side. In such a case, we end up with pmap[i] = -1 + * and mmap[i] != -1. + */ I will continue to review the rest of the patch. I am sorry for the long delay. Best regards, Etsuro Fujita
Attachment
On Wed, Sep 25, 2019 at 12:59 AM Etsuro Fujita <etsuro.fujita@gmail.com> wrote: > I will continue to review the rest of the patch. I've been reviewing the rest of the patch. Here are my review comments: * map_and_merge_partitions() checks whether the two partitions from the outer and inner sides can be merged in two steps: 1) see if the partitions are mapped to each other (ie, partmap1->from == index2 && partmap2->from == index1), and 2) see if the merged partition indexes assigned are the same (ie, partmap1->to == partmap2->to) (or satisfy some other conditions), but the first step seems redundant to me because I think that if the merged partition indexes are the same, then the partitions would be guaranteed to be mapped to each other. Also, I noticed that that function can't handle some list-partitioning cases properly. Here is an example: CREATE TABLE plt1 (a int, b int, c text) PARTITION BY LIST(c); CREATE TABLE plt1_p1 PARTITION OF plt1 FOR VALUES IN ('0000', '0002'); CREATE TABLE plt1_p2 PARTITION OF plt1 FOR VALUES IN ('0003', '0004'); INSERT INTO plt1 SELECT i, i, to_char(i % 5, 'FM0000') FROM generate_series(0, 24) i WHERE i % 5 IN (0, 2, 3, 4); ANALYZE plt1; CREATE TABLE plt2 (a int, b int, c text) PARTITION BY LIST(c); CREATE TABLE plt2_p1 PARTITION OF plt2 FOR VALUES IN ('0001', '0002'); CREATE TABLE plt2_p2 PARTITION OF plt2 FOR VALUES IN ('0003', '0004'); INSERT INTO plt2 SELECT i, i, to_char(i % 5, 'FM0000') FROM generate_series(0, 24) i WHERE i % 5 IN (1, 2, 3, 4); ANALYZE plt2; EXPLAIN (COSTS OFF) SELECT t1.a, t1.c, t2.a, t2.c FROM plt1 t1 FULL JOIN plt2 t2 ON (t1.c = t2.c) WHERE COALESCE(t1.a, 0) % 5 != 3 AND COALESCE(t1.a, 0) % 5 != 4 ORDER BY t1.c, t1.a, t2.a; QUERY PLAN ------------------------------------------------------------------------------- ------ Sort Sort Key: t1.c, t1.a, t2.a -> Hash Full Join Hash Cond: (t1.c = t2.c) Filter: (((COALESCE(t1.a, 0) % 5) <> 3) AND ((COALESCE(t1.a, 0) % 5) <> 4)) -> Append -> Seq Scan on plt1_p1 t1 -> Seq Scan on plt1_p2 t1_1 -> Hash -> Append -> Seq Scan on plt2_p1 t2 -> Seq Scan on plt2_p2 t2_1 (12 rows) This should use partitionwise join by the new partition-matching algorithm but doesn't. The reason for that is because plt1_p1 and plt2_p1 are mapped to different merged partitions and thus considered not merge-able by map_and_merge_partitions() as-is. I might be missing something, but I don't think this is intentional, so I rewrote that function completely in the attached, which is a WIP patch created on top of the patch [1]. * In handle_missing_partition(), I noticed this: + else if (missing_side_inner) + { + /* + * If this partition has already been mapped (say because we + * found an overlapping range earlier), we know where does it + * fit in the join result. Nothing to do in that case. Else + * create a new merged partition. + */ + PartitionMap *extra_map = &maps_with_extra[extra_part]; + if (extra_map->to < 0) + { + extra_map->to = *next_index; + *next_index = *next_index + 1; + *merged_index = extra_map->to; + } + } As commented, that function skips setting *merged_index when the "extra_part" partition is already mapped to a merged partition. This would be correct for range partitioning, but not for list partitioning, I think. Here is an example: CREATE TABLE plt1 (a int, b int, c text) PARTITION BY LIST(c); CREATE TABLE plt1_p1 PARTITION OF plt1 FOR VALUES IN ('0000', '0001', '0002'); CREATE TABLE plt1_p2 PARTITION OF plt1 FOR VALUES IN ('0003', '0004'); INSERT INTO plt1 SELECT i, i, to_char(i % 5, 'FM0000') FROM generate_series(0, 24) i; ANALYZE plt1; CREATE TABLE plt2 (a int, b int, c text) PARTITION BY LIST(c); CREATE TABLE plt2_p1 PARTITION OF plt2 FOR VALUES IN ('0002'); CREATE TABLE plt2_p2 PARTITION OF plt2 FOR VALUES IN ('0003', '0004'); INSERT INTO plt2 SELECT i, i, to_char(i % 5, 'FM0000') FROM generate_series(0, 24) i WHERE i % 5 IN (2, 3, 4); ANALYZE plt2; CREATE TABLE plt3 (a int, b int, c text) PARTITION BY LIST(c); CREATE TABLE plt3_p1 PARTITION OF plt3 FOR VALUES IN ('0001'); CREATE TABLE plt3_p2 PARTITION OF plt3 FOR VALUES IN ('0003', '0004'); INSERT INTO plt3 SELECT i, i, to_char(i % 5, 'FM0000') FROM generate_series(0, 24) i WHERE i % 5 IN (1, 3, 4); ANALYZE plt3; EXPLAIN (COSTS OFF) SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM (plt1 t1 LEFT JOIN plt2 t2 ON (t1.c = t2.c)) FULL JOIN plt3 t3 ON (t1.c = t3.c) WHERE COALESCE(t1.a, 0) % 5 != 3 AND coalesce(t1.a, 0) % 5 != 4 ORDER BY t1.c, t1.a, t2.a, t3.a; QUERY PLAN ------------------------------------------------------------------------------- ------ Sort Sort Key: t1.c, t1.a, t2.a, t3.a -> Hash Full Join Hash Cond: (t1.c = t3.c) Filter: (((COALESCE(t1.a, 0) % 5) <> 3) AND ((COALESCE(t1.a, 0) % 5) <> 4)) -> Hash Left Join Hash Cond: (t1.c = t2.c) -> Append -> Seq Scan on plt1_p1 t1 -> Seq Scan on plt1_p2 t1_1 -> Hash -> Append -> Seq Scan on plt2_p1 t2 -> Seq Scan on plt2_p2 t2_1 -> Hash -> Append -> Seq Scan on plt3_p1 t3 -> Seq Scan on plt3_p2 t3_1 (18 rows) I think this should use 3-way partitionwise join by the new algorithm but doesn't. The reason for that is because when considering partitionwise join for plt1 and plt2, partition_list_bounds_merge() would incorrectly produce {'0000', '0002'} as the merged values for the join segment of plt1_p1 and plt2_p1, not {'0000', '0001', '0002'}, because that function would call handle_missing_partition() for the values '0000' and '0001' of plt1_p1, and it would set *merged_index correctly for '0000' but not for '0001' (keeping *merged_index=-1), due to the behavior mentioned above, resulting in the merged values for the join segment {'0000', '0002'}, not {'0000', '0001', '0002'}. This would cause to fail to merge the values for the join segment {'0000', '0002'} and the values for plt3_p1 {'0001'} when considering partitionwise join for the 2-way join and plt3. I fixed this as well in the attached. handle_missing_partition() can handle both cases where a datum is missing from the inner side and where a datum is missing from the outer side in a unified way, but I think that that makes the code pretty hard to read. Also, I think handle_missing_partition() is not caller-friendly because the caller needs to write something like this: + /* A range missing from the inner side. */ + bool missing_side_outer; + bool missing_side_inner; + /* + * For a FULL join, inner relation acts as both OUTER and INNER + * relation. For LEFT and ANTI join the inner relation acts as + * INNER relation. For INNER and SEMI join OUTER and INNER + * differentiation is immaterial. + */ + missing_side_inner = (jointype == JOIN_FULL || + jointype == JOIN_LEFT || + jointype == JOIN_ANTI); + missing_side_outer = (jointype == JOIN_FULL); + if (!handle_missing_partition(inner_maps, + outer_maps, + inner_default, + outer_part, + missing_side_outer, + missing_side_inner, + &next_index, + &default_index, + &merged_index)) + return NULL; So I'd like to propose to introduce separate functions like process_outer_partition() and process_inner_partition() in the attached, instead of handle_missing_partition(). (I added a fast path to these functions that if both outer/inner sides have the default partitions, give up on merging partitions. Also, I modified the caller sites of proposed functions so that they call these if necessary.) Other minor changes in the attached: * I modified partition_range_bounds_merge() and partition_list_bounds_merge() so that the order of arguments for them matches partition_bounds_merge(). Also, I did some cleanup for these functions, and removed this in these function because this merely re-checks the while condition above. + /* If we exhausted both the sides, we won't enter the loop. */ + Assert(!finished_inner || !finished_outer); * I modified merge_null_partitions() and merge_default_partitions() accordingly to the changes described above. Also, I did some cleanup and modified the caller sites of these functions so that they call these functions if necessary. * I also modified generate_matching_part_pairs() accordingly. * I added a creanup to free memory allocated by init_partition_map(). * We have this in a few places such as merge_default_partitions(): + extra_map->to = *next_index; + *next_index = *next_index + 1; + *merged_index = extra_map->to; To reduce code duplication, I made this into a separate function. * I think the return statement in the end of partition_range_merge() is useless, so I removed it. That is all for now. Will continue to review. Best regards, Etsuro Fujita [1] https://www.postgresql.org/message-id/CAPmGK14WHKckT1P6UJV2B63TZAxPyMn8iZJ99XF=AZuNhG6vow@mail.gmail.com
Attachment
On Wed, Oct 16, 2019 at 6:20 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
On Wed, Sep 25, 2019 at 12:59 AM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> I will continue to review the rest of the patch.
I've been reviewing the rest of the patch. Here are my review comments:
[....]
So I'd like to propose to introduce separate functions like
process_outer_partition() and process_inner_partition() in the
attached, instead of handle_missing_partition(). (I added a fast path
to these functions that if both outer/inner sides have the default
partitions, give up on merging partitions. Also, I modified the
caller sites of proposed functions so that they call these if
necessary.)
Agree -- process_outer_partition() & process_inner_partition() approach looks
much cleaner than before.
Here are the few comments:
much cleaner than before.
Here are the few comments:
Note that LHS numbers are the line numbers in your previously posted patch[1].
455 + if (inner_has_default ||
456 + jointype == JOIN_LEFT ||
457 + jointype == JOIN_ANTI ||
458 + jointype == JOIN_FULL)
459 + {
460 + if (!process_outer_partition(&outer_map,
512 + if (outer_has_default || jointype == JOIN_FULL)
513 + {
514 + if (!process_inner_partition(&outer_map,
How about adding these conditions to the else block of process_outer_partition()
& process_inner_partition() function respectively so that these functions can be
called unconditionally? Thoughts/Comments?
456 + jointype == JOIN_LEFT ||
457 + jointype == JOIN_ANTI ||
458 + jointype == JOIN_FULL)
459 + {
460 + if (!process_outer_partition(&outer_map,
512 + if (outer_has_default || jointype == JOIN_FULL)
513 + {
514 + if (!process_inner_partition(&outer_map,
How about adding these conditions to the else block of process_outer_partition()
& process_inner_partition() function respectively so that these functions can be
called unconditionally? Thoughts/Comments?
---
456 + jointype == JOIN_LEFT ||
457 + jointype == JOIN_ANTI ||
458 + jointype == JOIN_FULL)
457 + jointype == JOIN_ANTI ||
458 + jointype == JOIN_FULL)
Also, how about using IS_OUTER_JOIN() instead. But we need an assertion to
restrict JOIN_RIGHT or something.
restrict JOIN_RIGHT or something.
For the convenience, I did both aforesaid changes in the attached delta patch that
can be applied atop of your previously posted patch[1]. Kindly have a look & share
can be applied atop of your previously posted patch[1]. Kindly have a look & share
your thoughts, thanks.
--
1273 + * *next_index is incremented when creating a new merged partition associated
1274 + * with the given outer partition.
1275 + */
Correction: s/outer/inner
1274 + * with the given outer partition.
1275 + */
Correction: s/outer/inner
---
1338 + * In range partitioning, if the given outer partition is already
1339 + * merged (eg, because we found an overlapping range earlier), we know
1340 + * where it fits in the join result; nothing to do in that case. Else
1341 + * create a new merged partition.
Correction: s/outer/inner.
1339 + * merged (eg, because we found an overlapping range earlier), we know
1340 + * where it fits in the join result; nothing to do in that case. Else
1341 + * create a new merged partition.
Correction: s/outer/inner.
---
1712 /*
1713 * If the NULL partition was missing from the inner side of the join,
s/inner side/outer side
1713 * If the NULL partition was missing from the inner side of the join,
s/inner side/outer side
--
Regards,
Amul
Attachment
Hi Amul, On Fri, Oct 25, 2019 at 6:59 PM amul sul <sulamul@gmail.com> wrote: > On Wed, Oct 16, 2019 at 6:20 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote: >> So I'd like to propose to introduce separate functions like >> process_outer_partition() and process_inner_partition() in the >> attached, instead of handle_missing_partition(). (I added a fast path >> to these functions that if both outer/inner sides have the default >> partitions, give up on merging partitions. Also, I modified the >> caller sites of proposed functions so that they call these if >> necessary.) > Agree -- process_outer_partition() & process_inner_partition() approach looks > much cleaner than before. > > Here are the few comments: Thanks for the review! > Note that LHS numbers are the line numbers in your previously posted patch[1]. > > 455 + if (inner_has_default || > 456 + jointype == JOIN_LEFT || > 457 + jointype == JOIN_ANTI || > 458 + jointype == JOIN_FULL) > 459 + { > 460 + if (!process_outer_partition(&outer_map, > > 512 + if (outer_has_default || jointype == JOIN_FULL) > 513 + { > 514 + if (!process_inner_partition(&outer_map, > > How about adding these conditions to the else block of process_outer_partition() > & process_inner_partition() function respectively so that these functions can be > called unconditionally? Thoughts/Comments? I'm not sure that's a good idea; these functions might be called many times, so I just thought it would be better to call these functions conditionally, to avoid useless function call overhead. > 456 + jointype == JOIN_LEFT || > 457 + jointype == JOIN_ANTI || > 458 + jointype == JOIN_FULL) > > Also, how about using IS_OUTER_JOIN() instead. But we need an assertion to > restrict JOIN_RIGHT or something. Seems like a good idea. > For the convenience, I did both aforesaid changes in the attached delta patch that > can be applied atop of your previously posted patch[1]. Kindly have a look & share > your thoughts, thanks. Thanks for the patch! > 1273 + * *next_index is incremented when creating a new merged partition associated > 1274 + * with the given outer partition. > 1275 + */ > > Correction: s/outer/inner > --- > > 1338 + * In range partitioning, if the given outer partition is already > 1339 + * merged (eg, because we found an overlapping range earlier), we know > 1340 + * where it fits in the join result; nothing to do in that case. Else > 1341 + * create a new merged partition. > > Correction: s/outer/inner. > --- > > 1712 /* > 1713 * If the NULL partition was missing from the inner side of the join, > > s/inner side/outer side > -- Good catch! Will fix. Best regards, Etsuro Fujita
On Tue, Oct 29, 2019 at 7:29 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote: > On Fri, Oct 25, 2019 at 6:59 PM amul sul <sulamul@gmail.com> wrote: > > On Wed, Oct 16, 2019 at 6:20 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote: > >> So I'd like to propose to introduce separate functions like > >> process_outer_partition() and process_inner_partition() in the > >> attached, instead of handle_missing_partition(). (I added a fast path > >> to these functions that if both outer/inner sides have the default > >> partitions, give up on merging partitions. Also, I modified the > >> caller sites of proposed functions so that they call these if > >> necessary.) > > > Agree -- process_outer_partition() & process_inner_partition() approach looks > > much cleaner than before. > > Note that LHS numbers are the line numbers in your previously posted patch[1]. > > > > 455 + if (inner_has_default || > > 456 + jointype == JOIN_LEFT || > > 457 + jointype == JOIN_ANTI || > > 458 + jointype == JOIN_FULL) > > 459 + { > > 460 + if (!process_outer_partition(&outer_map, > > > > 512 + if (outer_has_default || jointype == JOIN_FULL) > > 513 + { > > 514 + if (!process_inner_partition(&outer_map, > > > > How about adding these conditions to the else block of process_outer_partition() > > & process_inner_partition() function respectively so that these functions can be > > called unconditionally? Thoughts/Comments? > > I'm not sure that's a good idea; these functions might be called many > times, so I just thought it would be better to call these functions > conditionally, to avoid useless function call overhead. The overhead might be small, but isn't zero, so I still think that we should call these functions if necessary. > > 456 + jointype == JOIN_LEFT || > > 457 + jointype == JOIN_ANTI || > > 458 + jointype == JOIN_FULL) > > > > Also, how about using IS_OUTER_JOIN() instead. But we need an assertion to > > restrict JOIN_RIGHT or something. > > Seems like a good idea. Done. > > For the convenience, I did both aforesaid changes in the attached delta patch that > > can be applied atop of your previously posted patch[1]. Kindly have a look & share > > your thoughts, thanks. > > Thanks for the patch! > > > 1273 + * *next_index is incremented when creating a new merged partition associated > > 1274 + * with the given outer partition. > > 1275 + */ > > > > Correction: s/outer/inner > > --- > > > > 1338 + * In range partitioning, if the given outer partition is already > > 1339 + * merged (eg, because we found an overlapping range earlier), we know > > 1340 + * where it fits in the join result; nothing to do in that case. Else > > 1341 + * create a new merged partition. > > > > Correction: s/outer/inner. > > --- > > > > 1712 /* > > 1713 * If the NULL partition was missing from the inner side of the join, > > > > s/inner side/outer side > > -- > > Good catch! Will fix. Done. I also added some assertions to process_outer_partition() and process_inner_partition(), including ones as proposed in your patch. Attached is an updated version. If no objections, I'll merge this with the main patch [1]. Best regards, Etsuro Fujita [1] https://www.postgresql.org/message-id/CAPmGK14WHKckT1P6UJV2B63TZAxPyMn8iZJ99XF=AZuNhG6vow@mail.gmail.com
Attachment
On Thu, Oct 31, 2019 at 6:49 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote: > Attached is an updated version. If no objections, I'll merge this > with the main patch [1]. Done. Attached is a new version of the patch. Other changes: in generate_matching_part_pairs(), I changed variable names to match other functions, simplified assertions, and adjusted/added comments a bit. Will continue to review. Best regards, Etsuro Fujita
Attachment
On Fri, Nov 1, 2019 at 3:58 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
On Thu, Oct 31, 2019 at 6:49 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> Attached is an updated version. If no objections, I'll merge this
> with the main patch [1].
Done. Attached is a new version of the patch.
Other changes: in generate_matching_part_pairs(), I changed variable
names to match other functions, simplified assertions, and
adjusted/added comments a bit.
Thanks for the update version.
A query and comments for v25:
583 + * The function returns NULL if we can not find the matching pair of
584 + * partitions. This happens if 1. multiple partitions on one side match with
585 + * one partition on the other side. 2. a given partition on the outer side
586 + * doesn't have a matching partition on the inner side. We can not support the
587 + * first case since we don't have a way to represent multiple partitions as a
588 + * single relation (RelOptInfo) and then perform join using the ganged
589 + * relation. We can not support the second case since the missing inner
590 + * partition needs to be represented as an empty relation and we don't have a
591 + * way to introduce empty relation during join planning after creating paths
592 + * for all the base relations.
593 + */
594 +PartitionBoundInfo
595 +partition_bounds_merge(int partnatts,
I think the second condition mention for partition_bounds_merge() looks
outdated, do you think we should remove that or am I missing something here?
584 + * partitions. This happens if 1. multiple partitions on one side match with
585 + * one partition on the other side. 2. a given partition on the outer side
586 + * doesn't have a matching partition on the inner side. We can not support the
587 + * first case since we don't have a way to represent multiple partitions as a
588 + * single relation (RelOptInfo) and then perform join using the ganged
589 + * relation. We can not support the second case since the missing inner
590 + * partition needs to be represented as an empty relation and we don't have a
591 + * way to introduce empty relation during join planning after creating paths
592 + * for all the base relations.
593 + */
594 +PartitionBoundInfo
595 +partition_bounds_merge(int partnatts,
I think the second condition mention for partition_bounds_merge() looks
outdated, do you think we should remove that or am I missing something here?
---
1768 +
1769 + /*
1770 + * If this is an outer join, the merged partition would act as the
1771 + * default partition of the join; record the index in *default_index
1772 + * if not done yet.
1773 + */
1774 + if (jointype == JOIN_LEFT || jointype == JOIN_ANTI ||
1775 + jointype == JOIN_FULL)
1776 + {
As decided need to replace this list by IS_OUTER_JOIN(jointype).
---
2020 + if (jointype == JOIN_LEFT || jointype == JOIN_FULL ||
2021 + jointype == JOIN_ANTI)
2022 + {
Same as the previous.
---
code coverage for the v25 patch. Please have a look at the attached 0002 & also
attach the coverage output with & without this patch, TWIMW.
0001 is the same v25 patch, reattaching to make CFbot happy.
Regards,
Amul
Attachment
Attached is the rebase atop the latest master head(a9056cc637f).
On Tue, Nov 5, 2019 at 6:44 PM amul sul <sulamul@gmail.com> wrote:
On Fri, Nov 1, 2019 at 3:58 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:On Thu, Oct 31, 2019 at 6:49 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> Attached is an updated version. If no objections, I'll merge this
> with the main patch [1].
Done. Attached is a new version of the patch.
Other changes: in generate_matching_part_pairs(), I changed variable
names to match other functions, simplified assertions, and
adjusted/added comments a bit.Thanks for the update version.A query and comments for v25:583 + * The function returns NULL if we can not find the matching pair of
584 + * partitions. This happens if 1. multiple partitions on one side match with
585 + * one partition on the other side. 2. a given partition on the outer side
586 + * doesn't have a matching partition on the inner side. We can not support the
587 + * first case since we don't have a way to represent multiple partitions as a
588 + * single relation (RelOptInfo) and then perform join using the ganged
589 + * relation. We can not support the second case since the missing inner
590 + * partition needs to be represented as an empty relation and we don't have a
591 + * way to introduce empty relation during join planning after creating paths
592 + * for all the base relations.
593 + */
594 +PartitionBoundInfo
595 +partition_bounds_merge(int partnatts,
I think the second condition mention for partition_bounds_merge() looks
outdated, do you think we should remove that or am I missing something here?---
1768 +
1769 + /*
1770 + * If this is an outer join, the merged partition would act as the
1771 + * default partition of the join; record the index in *default_index
1772 + * if not done yet.
1773 + */
1774 + if (jointype == JOIN_LEFT || jointype == JOIN_ANTI ||
1775 + jointype == JOIN_FULL)
1776 + {
As decided need to replace this list by IS_OUTER_JOIN(jointype).---
2020 + if (jointype == JOIN_LEFT || jointype == JOIN_FULL ||
2021 + jointype == JOIN_ANTI)
2022 + {
Same as the previous.---I tried a coverage testing and tried to adjust & add a few tests to improved the
code coverage for the v25 patch. Please have a look at the attached 0002 & also
attach the coverage output with & without this patch, TWIMW.0001 is the same v25 patch, reattaching to make CFbot happy.Regards,Amul
Attachment
Hi Amul, On Wed, Nov 6, 2019 at 3:12 PM amul sul <sulamul@gmail.com> wrote: > Attached is the rebase atop the latest master head(a9056cc637f). Thanks for that! > On Tue, Nov 5, 2019 at 6:44 PM amul sul <sulamul@gmail.com> wrote: >> On Fri, Nov 1, 2019 at 3:58 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote: >>> Done. Attached is a new version of the patch. >> A query and comments for v25: >> >> 583 + * The function returns NULL if we can not find the matching pair of >> 584 + * partitions. This happens if 1. multiple partitions on one side match with >> 585 + * one partition on the other side. 2. a given partition on the outer side >> 586 + * doesn't have a matching partition on the inner side. We can not support the >> 587 + * first case since we don't have a way to represent multiple partitions as a >> 588 + * single relation (RelOptInfo) and then perform join using the ganged >> 589 + * relation. We can not support the second case since the missing inner >> 590 + * partition needs to be represented as an empty relation and we don't have a >> 591 + * way to introduce empty relation during join planning after creating paths >> 592 + * for all the base relations. >> 593 + */ >> 594 +PartitionBoundInfo >> 595 +partition_bounds_merge(int partnatts, >> >> I think the second condition mention for partition_bounds_merge() looks >> outdated, do you think we should remove that or am I missing something here? Yeah, I think so. In addition to that, the output arguments for that function *outer_pars and *inner_parts don't store partition indexes anymore, so I updated comments for that function totally. Does that make sense? >> 1768 + >> 1769 + /* >> 1770 + * If this is an outer join, the merged partition would act as the >> 1771 + * default partition of the join; record the index in *default_index >> 1772 + * if not done yet. >> 1773 + */ >> 1774 + if (jointype == JOIN_LEFT || jointype == JOIN_ANTI || >> 1775 + jointype == JOIN_FULL) >> 1776 + { >> >> As decided need to replace this list by IS_OUTER_JOIN(jointype). Done. >> 2020 + if (jointype == JOIN_LEFT || jointype == JOIN_FULL || >> 2021 + jointype == JOIN_ANTI) >> 2022 + { >> >> Same as the previous. Done. >> I tried a coverage testing and tried to adjust & add a few tests to improved the >> code coverage for the v25 patch. Please have a look at the attached 0002 & also >> attach the coverage output with & without this patch, TWIMW. Good idea! I merged that to the main patch after modifying that a bit so to reduce the outputs of SELECTs. Attached is an updated version of the main patch. Thanks for reviewing! I looked at partition_range_bounds_merge() more closely. Here are my comments: * When merging partition bounds from the outer/inner sides in the while loop of that function, we only moves to the next partition on one side when both partitions overlap (ie, overlap=true) and the upper bounds of partitions are not the same (ie, ub_cmpval<0 or ub_cmpval>0); but I think that's inefficient. Let me explain using an example: create table prt1 (a int, b int) partition by range (a); create table prt1_p1 partition of prt1 for values from (0) to (50); create table prt1_p2 partition of prt1 for values from (100) to (150); create table prt2 (a int, b int) partition by range (a); create table prt2_p1 partition of prt2 for values from (0) to (100); create table prt2_p2 partition of prt2 for values from (100) to (200); insert into prt1 select i, i from generate_series(0, 49) i; insert into prt1 select i, i from generate_series(100, 149) i; insert into prt2 select i, i from generate_series(0, 49) i; insert into prt2 select i, i from generate_series(100, 149) i; analyze prt1; analyze prt2; set enable_partitionwise_join to on; explain verbose select * from prt1 t1 full join prt2 t2 on (t1.a = t2.a); QUERY PLAN -------------------------------------------------------------------------------------- Append (cost=2.12..9.12 rows=100 width=16) -> Hash Full Join (cost=2.12..4.31 rows=50 width=16) Output: t1.a, t1.b, t2.a, t2.b Hash Cond: (t1.a = t2.a) -> Seq Scan on public.prt1_p1 t1 (cost=0.00..1.50 rows=50 width=8) Output: t1.a, t1.b -> Hash (cost=1.50..1.50 rows=50 width=8) Output: t2.a, t2.b -> Seq Scan on public.prt2_p1 t2 (cost=0.00..1.50 rows=50 width=8) Output: t2.a, t2.b -> Hash Full Join (cost=2.12..4.31 rows=50 width=16) Output: t1_1.a, t1_1.b, t2_1.a, t2_1.b Hash Cond: (t1_1.a = t2_1.a) -> Seq Scan on public.prt1_p2 t1_1 (cost=0.00..1.50 rows=50 width=8) Output: t1_1.a, t1_1.b -> Hash (cost=1.50..1.50 rows=50 width=8) Output: t2_1.a, t2_1.b -> Seq Scan on public.prt2_p2 t2_1 (cost=0.00..1.50 rows=50 width=8) Output: t2_1.a, t2_1.b (19 rows) For this query, the merging would proceed as follow: 1) In the first iteration of the while loop, we have overlap=true, so we merge prt1_p1 and prt2_p1 and move to the next partition on *only the outer side* (prt1_p2) as we have ub_cmpval<0. 2) In the second iteration, we have overlap=false and ub_cmpval>0, so we perform process_inner_partition() to prt2_p1 and move to the next partition on the inner side (prt2_p2). 3) In the third iteration, we have overlap=true, so we merge prt1_p2 and prt2_p2 and move to the next partition on *only the outer side* as we have ub_cmpval<0 (but can't as the outer side is exhausted). 4) In the forth iteration, we have overlap=false and ub_cmpval>0, so we perform process_inner_partition() to prt2_p2 and move to the next partition on the inner side (but can't as the inner side is exhausted). And we are done. We do this in the process_inner_partition() call in the second and forth iterations: /* * In range partitioning, if the given inner partition is already * merged (eg, because we found an overlapping range earlier), we know * where it fits in the join result; nothing to do in that case. Else * create a new merged partition. */ if (inner_map->merged_indexes[inner_index] >= 0) { if (strategy == PARTITION_STRATEGY_LIST) *merged_index = inner_map->merged_indexes[inner_index]; else { Assert(strategy == PARTITION_STRATEGY_RANGE); *merged_index = -1; } } What we do in that function is actually a no-op, so the second and forth iterations are done only to move to the next partition on the inner side, which I think is inefficient. To avoid that, why not move to *the next pair of partitions* in the first and third iterations like the attached? This needs an extra check to see if we can safely move to the next pair of partitions in the first and third iterations, but doesn't need meaningless iterations like the second and fourth iterations in the example. This change also makes the code in process_inner_partition() shown above unnecessary, so I removed it from that function (and process_outer_partition()). * I don't like this: + if ((lb_cmpval < 0 && inner_has_default) || + /* Non-overlapping range on the lower side of outer range. */ + (lb_cmpval > 0 && outer_has_default) || + /* Non-overlapping range on the lower side of inner range. */ + (ub_cmpval < 0 && outer_has_default) || + /* Non-overlapping range on the upper side of inner range. */ + (ub_cmpval > 0 && inner_has_default)) + /* Non-overlapping range on the upper side of outer range. */ + return NULL; because it looks like that eg, the first comment "Non-overlapping range on the lower side of outer range." explains the second condition "(lb_cmpval > 0 && outer_has_default)", which isn't intended, I think. Also, it seems a bit verbose to me to add a comment to each of the four cases. How about simplifying the comments and moving them to above the if block like the attached? * In partition_range_merge_next_lb(), do we really need this bit? + /* + * The lower bound is lower than the last upper bound, thus does not fit + * the bounds created so far and hence can not be merged with the existing + * bounds. + */ + if (cmpval < 0) + return false; I think that if we have such a lower bound, we would error out before calling that function. Also, I think it would be easier to understand to add the upper bound as well within that function. So I modified that function that way (and renamed it). I also added a bit of comments to that function. * Since this is called many times, I changed it to a macro: +static int32 +partition_range_bound_cmp(int partnatts, FmgrInfo *partsupfunc, + Oid *partcollations, PartitionRangeBound *bound1, + PartitionRangeBound *bound2) +{ + return partition_rbound_cmp(partnatts, partsupfunc, partcollations, + bound1->datums, bound1->kind, bound1->lower, + bound2); +} * This is just my taste, but how about renaming partition_range_bound_cmp() and partition_range_cmp() to more common? or readable? names eg, compare_range_bounds() and compare_range_partitions(), respectively? Also, how about renaming partition_range_merge() to eg, get_merged_range_bounds()? Please find attached the delta patch for that. Any feedback welcome! Best regards, Etsuro Fujita
Attachment
On Wed, Nov 13, 2019 at 9:46 AM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
Hi Amul,
On Wed, Nov 6, 2019 at 3:12 PM amul sul <sulamul@gmail.com> wrote:
> Attached is the rebase atop the latest master head(a9056cc637f).
Thanks for that!
> On Tue, Nov 5, 2019 at 6:44 PM amul sul <sulamul@gmail.com> wrote:
>> On Fri, Nov 1, 2019 at 3:58 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
>>> Done. Attached is a new version of the patch.
>> A query and comments for v25:
>>
>> 583 + * The function returns NULL if we can not find the matching pair of
>> 584 + * partitions. This happens if 1. multiple partitions on one side match with
>> 585 + * one partition on the other side. 2. a given partition on the outer side
>> 586 + * doesn't have a matching partition on the inner side. We can not support the
>> 587 + * first case since we don't have a way to represent multiple partitions as a
>> 588 + * single relation (RelOptInfo) and then perform join using the ganged
>> 589 + * relation. We can not support the second case since the missing inner
>> 590 + * partition needs to be represented as an empty relation and we don't have a
>> 591 + * way to introduce empty relation during join planning after creating paths
>> 592 + * for all the base relations.
>> 593 + */
>> 594 +PartitionBoundInfo
>> 595 +partition_bounds_merge(int partnatts,
>>
>> I think the second condition mention for partition_bounds_merge() looks
>> outdated, do you think we should remove that or am I missing something here?
Yeah, I think so. In addition to that, the output arguments for that
function *outer_pars and *inner_parts don't store partition indexes
anymore, so I updated comments for that function totally. Does that
make sense?
Thanks, that's looks better.
[....]
I looked at partition_range_bounds_merge() more closely. Here are my comments:
* When merging partition bounds from the outer/inner sides in the
while loop of that function, we only moves to the next partition on
one side when both partitions overlap (ie, overlap=true) and the upper
bounds of partitions are not the same (ie, ub_cmpval<0 or
ub_cmpval>0); but I think that's inefficient. Let me explain using an
example:
create table prt1 (a int, b int) partition by range (a);
create table prt1_p1 partition of prt1 for values from (0) to (50);
create table prt1_p2 partition of prt1 for values from (100) to (150);
create table prt2 (a int, b int) partition by range (a);
create table prt2_p1 partition of prt2 for values from (0) to (100);
create table prt2_p2 partition of prt2 for values from (100) to (200);
insert into prt1 select i, i from generate_series(0, 49) i;
insert into prt1 select i, i from generate_series(100, 149) i;
insert into prt2 select i, i from generate_series(0, 49) i;
insert into prt2 select i, i from generate_series(100, 149) i;
analyze prt1;
analyze prt2;
set enable_partitionwise_join to on;
explain verbose select * from prt1 t1 full join prt2 t2 on (t1.a = t2.a);
QUERY PLAN
--------------------------------------------------------------------------------------
Append (cost=2.12..9.12 rows=100 width=16)
-> Hash Full Join (cost=2.12..4.31 rows=50 width=16)
Output: t1.a, t1.b, t2.a, t2.b
Hash Cond: (t1.a = t2.a)
-> Seq Scan on public.prt1_p1 t1 (cost=0.00..1.50 rows=50 width=8)
Output: t1.a, t1.b
-> Hash (cost=1.50..1.50 rows=50 width=8)
Output: t2.a, t2.b
-> Seq Scan on public.prt2_p1 t2 (cost=0.00..1.50
rows=50 width=8)
Output: t2.a, t2.b
-> Hash Full Join (cost=2.12..4.31 rows=50 width=16)
Output: t1_1.a, t1_1.b, t2_1.a, t2_1.b
Hash Cond: (t1_1.a = t2_1.a)
-> Seq Scan on public.prt1_p2 t1_1 (cost=0.00..1.50 rows=50 width=8)
Output: t1_1.a, t1_1.b
-> Hash (cost=1.50..1.50 rows=50 width=8)
Output: t2_1.a, t2_1.b
-> Seq Scan on public.prt2_p2 t2_1 (cost=0.00..1.50
rows=50 width=8)
Output: t2_1.a, t2_1.b
(19 rows)
For this query, the merging would proceed as follow:
1) In the first iteration of the while loop, we have overlap=true, so
we merge prt1_p1 and prt2_p1 and move to the next partition on *only
the outer side* (prt1_p2) as we have ub_cmpval<0. 2) In the second
iteration, we have overlap=false and ub_cmpval>0, so we perform
process_inner_partition() to prt2_p1 and move to the next partition on
the inner side (prt2_p2). 3) In the third iteration, we have
overlap=true, so we merge prt1_p2 and prt2_p2 and move to the next
partition on *only the outer side* as we have ub_cmpval<0 (but can't
as the outer side is exhausted). 4) In the forth iteration, we have
overlap=false and ub_cmpval>0, so we perform process_inner_partition()
to prt2_p2 and move to the next partition on the inner side (but can't
as the inner side is exhausted). And we are done.
We do this in the process_inner_partition() call in the second and
forth iterations:
/*
* In range partitioning, if the given inner partition is already
* merged (eg, because we found an overlapping range earlier), we know
* where it fits in the join result; nothing to do in that case. Else
* create a new merged partition.
*/
if (inner_map->merged_indexes[inner_index] >= 0)
{
if (strategy == PARTITION_STRATEGY_LIST)
*merged_index = inner_map->merged_indexes[inner_index];
else
{
Assert(strategy == PARTITION_STRATEGY_RANGE);
*merged_index = -1;
}
}
What we do in that function is actually a no-op, so the second and
forth iterations are done only to move to the next partition on the
inner side, which I think is inefficient. To avoid that, why not move
to *the next pair of partitions* in the first and third iterations
like the attached? This needs an extra check to see if we can safely
move to the next pair of partitions in the first and third iterations,
but doesn't need meaningless iterations like the second and fourth
iterations in the example. This change also makes the code in
process_inner_partition() shown above unnecessary, so I removed it
from that function (and process_outer_partition()).
Make sense.
* I don't like this:
+ if ((lb_cmpval < 0 && inner_has_default) ||
+ /* Non-overlapping range on the lower side of outer range. */
+ (lb_cmpval > 0 && outer_has_default) ||
+ /* Non-overlapping range on the lower side of inner range. */
+ (ub_cmpval < 0 && outer_has_default) ||
+ /* Non-overlapping range on the upper side of inner range. */
+ (ub_cmpval > 0 && inner_has_default))
+ /* Non-overlapping range on the upper side of outer range. */
+ return NULL;
because it looks like that eg, the first comment "Non-overlapping
range on the lower side of outer range." explains the second condition
"(lb_cmpval > 0 && outer_has_default)", which isn't intended, I think.
Also, it seems a bit verbose to me to add a comment to each of the
four cases. How about simplifying the comments and moving them to
above the if block like the attached?
Check.
* In partition_range_merge_next_lb(), do we really need this bit?
+ /*
+ * The lower bound is lower than the last upper bound, thus does not fit
+ * the bounds created so far and hence can not be merged with the existing
+ * bounds.
+ */
+ if (cmpval < 0)
+ return false;
I think that if we have such a lower bound, we would error out before
calling that function. Also, I think it would be easier to understand
to add the upper bound as well within that function. So I modified
that function that way (and renamed it). I also added a bit of
comments to that function.
* Since this is called many times, I changed it to a macro:
+static int32
+partition_range_bound_cmp(int partnatts, FmgrInfo *partsupfunc,
+ Oid *partcollations, PartitionRangeBound *bound1,
+ PartitionRangeBound *bound2)
+{
+ return partition_rbound_cmp(partnatts, partsupfunc, partcollations,
+ bound1->datums, bound1->kind, bound1->lower,
+ bound2);
+}
Check.
* This is just my taste, but how about renaming
partition_range_bound_cmp() and partition_range_cmp() to more common?
or readable? names eg, compare_range_bounds() and
compare_range_partitions(), respectively? Also, how about renaming
partition_range_merge() to eg, get_merged_range_bounds()?
Check.
Please find attached the delta patch for that. Any feedback welcome!
Thank you Fujita san for the patch & the enhancements. I am fine with your
delta patch. I would like to share some thought for other code:
delta patch. I would like to share some thought for other code:
File: partbounds.c:
3298 static void
3299 get_merged_range_bounds(int partnatts, FmgrInfo *partsupfuncs,
3300 Oid *partcollations, JoinType jointype,
3301 PartitionRangeBound *left_lb,
3302 PartitionRangeBound *left_ub,
3303 PartitionRangeBound *right_lb,
3304 PartitionRangeBound *right_ub,
3305 PartitionRangeBound *merged_lb,
3306 PartitionRangeBound *merged_ub)
Can we rename these argument as inner_* & outer_* like we having for the
functions, like 0003 patch?
3299 get_merged_range_bounds(int partnatts, FmgrInfo *partsupfuncs,
3300 Oid *partcollations, JoinType jointype,
3301 PartitionRangeBound *left_lb,
3302 PartitionRangeBound *left_ub,
3303 PartitionRangeBound *right_lb,
3304 PartitionRangeBound *right_ub,
3305 PartitionRangeBound *merged_lb,
3306 PartitionRangeBound *merged_ub)
Can we rename these argument as inner_* & outer_* like we having for the
functions, like 0003 patch?
---
File: partbounds.c:
3322
3323 case JOIN_INNER:
3324 case JOIN_SEMI:
3325 if (compare_range_bounds(partnatts, partsupfuncs, partcollations,
3326 left_ub, right_ub) < 0)
3327 *merged_ub = *left_ub;
3328 else
3329 *merged_ub = *right_ub;
3330
3331 if (compare_range_bounds(partnatts, partsupfuncs, partcollations,
3332 left_lb, right_lb) > 0)
3333 *merged_lb = *left_lb;
3334 else
3335 *merged_lb = *right_lb;
3336 break;
3337
How about reusing ub_cmpval & lb_cmpval instead of calling
compare_range_bounds() inside get_merged_range_bounds(), like 0004 patch?
3323 case JOIN_INNER:
3324 case JOIN_SEMI:
3325 if (compare_range_bounds(partnatts, partsupfuncs, partcollations,
3326 left_ub, right_ub) < 0)
3327 *merged_ub = *left_ub;
3328 else
3329 *merged_ub = *right_ub;
3330
3331 if (compare_range_bounds(partnatts, partsupfuncs, partcollations,
3332 left_lb, right_lb) > 0)
3333 *merged_lb = *left_lb;
3334 else
3335 *merged_lb = *right_lb;
3336 break;
3337
How about reusing ub_cmpval & lb_cmpval instead of calling
compare_range_bounds() inside get_merged_range_bounds(), like 0004 patch?
--
Apart from this, I would like to propose 0005 cleanup patch where I have
rearranged function arguments & code to make sure the arguments & the code
rearranged function arguments & code to make sure the arguments & the code
related to lower bound should come first and then the upper bound.
Regards,
Amul
P.S: As usual, reattaching 0001 & 0002 patches.
Attachment
Hi Amul, On Fri, Nov 15, 2019 at 2:21 PM amul sul <sulamul@gmail.com> wrote: > Thank you Fujita san for the patch & the enhancements. I am fine with your > delta patch. OK, I'll merge the delta patch with the main one in the next version, if no objections from others. > I would like to share some thought for other code: > > File: partbounds.c: > 3298 static void > 3299 get_merged_range_bounds(int partnatts, FmgrInfo *partsupfuncs, > 3300 Oid *partcollations, JoinType jointype, > 3301 PartitionRangeBound *left_lb, > 3302 PartitionRangeBound *left_ub, > 3303 PartitionRangeBound *right_lb, > 3304 PartitionRangeBound *right_ub, > 3305 PartitionRangeBound *merged_lb, > 3306 PartitionRangeBound *merged_ub) > > Can we rename these argument as inner_* & outer_* like we having for the > functions, like 0003 patch? +1 (Actually, I too was thinking about that.) > File: partbounds.c: > 3322 > 3323 case JOIN_INNER: > 3324 case JOIN_SEMI: > 3325 if (compare_range_bounds(partnatts, partsupfuncs, partcollations, > 3326 left_ub, right_ub) < 0) > 3327 *merged_ub = *left_ub; > 3328 else > 3329 *merged_ub = *right_ub; > 3330 > 3331 if (compare_range_bounds(partnatts, partsupfuncs, partcollations, > 3332 left_lb, right_lb) > 0) > 3333 *merged_lb = *left_lb; > 3334 else > 3335 *merged_lb = *right_lb; > 3336 break; > 3337 > > How about reusing ub_cmpval & lb_cmpval instead of calling > compare_range_bounds() inside get_merged_range_bounds(), like 0004 patch? Good idea! So, +1 > Apart from this, I would like to propose 0005 cleanup patch where I have > rearranged function arguments & code to make sure the arguments & the code > related to lower bound should come first and then the upper bound. +1 I'll merge these changes as well into the main patch, if no objections of others. Thanks for the review and patches! Best regards, Etsuro Fujita
On Fri, Nov 15, 2019 at 6:10 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote: > On Fri, Nov 15, 2019 at 2:21 PM amul sul <sulamul@gmail.com> wrote: > > Thank you Fujita san for the patch & the enhancements. I am fine with your > > delta patch. > > OK, I'll merge the delta patch with the main one in the next version, > if no objections from others. > > > I would like to share some thought for other code: > > > > File: partbounds.c: > > 3298 static void > > 3299 get_merged_range_bounds(int partnatts, FmgrInfo *partsupfuncs, > > 3300 Oid *partcollations, JoinType jointype, > > 3301 PartitionRangeBound *left_lb, > > 3302 PartitionRangeBound *left_ub, > > 3303 PartitionRangeBound *right_lb, > > 3304 PartitionRangeBound *right_ub, > > 3305 PartitionRangeBound *merged_lb, > > 3306 PartitionRangeBound *merged_ub) > > > > Can we rename these argument as inner_* & outer_* like we having for the > > functions, like 0003 patch? > > +1 (Actually, I too was thinking about that.) > > > File: partbounds.c: > > 3322 > > 3323 case JOIN_INNER: > > 3324 case JOIN_SEMI: > > 3325 if (compare_range_bounds(partnatts, partsupfuncs, partcollations, > > 3326 left_ub, right_ub) < 0) > > 3327 *merged_ub = *left_ub; > > 3328 else > > 3329 *merged_ub = *right_ub; > > 3330 > > 3331 if (compare_range_bounds(partnatts, partsupfuncs, partcollations, > > 3332 left_lb, right_lb) > 0) > > 3333 *merged_lb = *left_lb; > > 3334 else > > 3335 *merged_lb = *right_lb; > > 3336 break; > > 3337 > > > > How about reusing ub_cmpval & lb_cmpval instead of calling > > compare_range_bounds() inside get_merged_range_bounds(), like 0004 patch? > > Good idea! So, +1 > > > Apart from this, I would like to propose 0005 cleanup patch where I have > > rearranged function arguments & code to make sure the arguments & the code > > related to lower bound should come first and then the upper bound. > > +1 > > I'll merge these changes as well into the main patch, if no objections > of others. Done. I modified compare_range_partitions() as well to match its the variable names with others. Attached is a new version of the patch. I reviewed the rest of the partbounds.c changes. Here are my review comments: * I don't think this analysis is correct: +/* + * merge_null_partitions + * + * Merge NULL partitions, i.e. a partition that can hold NULL values for a lis\ t + * partitioned table, if any. Find the index of merged partition to which the + * NULL values would belong in the join result. If one joining relation has a + * NULL partition but not the other, try matching it with the default partitio\ n + * from the other relation since the default partition may have rows with NULL + * partition key. We can eliminate a NULL partition when it appears only on th\ e + * inner side of the join and the outer side doesn't have a default partition. + * + * When the equality operator used for join is strict, two NULL values will no\ t + * be considered as equal, and thus a NULL partition can be eliminated for an + * inner join. But we don't check the strictness operator here. + */ First of all, I think we can assume here that the equality operator is *strict*, because 1) have_partkey_equi_join(), which is executed before calling merge_null_partitions(), requires that the corresponding clause is mergejoinable, and 2) currently, we assume that mergejoinable operators are strict (see MJEvalOuterValues() and MJEvalInnerValues()). So I don't think it's needed to try merging a NULL partition on one side with the default partition on the other side as above. (merge_null_partitions() tries merging NULL partitions as well, but in some cases I don't think we need to do that, either.) So I rewrote merge_null_partitions() completely. Another change is that I eliminated the NULL partition of the joinrel in more cases. Attached is a patch (v28-0002-modify-merge_null_partitions.patch) for that created on top of the main patch. I might be missing something, though. Other changes in that patch: * I fixed memory leaks in partition_list_bounds_merge() and partition_range_bounds_merge(). * I noticed this in merge_default_partitions(): + Assert(outer_has_default && inner_has_default); + + *default_index = map_and_merge_partitions(outer_map, + inner_map, + outer_default, + inner_default, + next_index); + if (*default_index == -1) + return false; I think the merging should be done successfully, because of 1) this in process_outer_partition(): + if (inner_has_default) + { + Assert(inner_default >= 0); + + /* + * If the outer side has the default partition as well, we need to + * merge the default partitions (see merge_default_partitions()); give + * up on it. + */ + if (outer_has_default) + return false; and 2) this in process_inner_partition(): + if (outer_has_default) + { + Assert(outer_default >= 0); + + /* + * If the inner side has the default partition as well, we need to + * merge the default partitions (see merge_default_partitions()); give + * up on it. + */ + if (inner_has_default) + return false; So I removed the above if test (ie, *default_index == -1). I also modified that function a bit further, including comments. * I simplified process_outer_partition() and process_inner_partition() a bit, changing the APIs to match that of map_and_merge_partitions(). Also, I think this in these functions is a bit ugly; + /* Don't add this index to the list of merged indexes. */ + *merged_index = -1; so I removed it, and modified the condition on whether or not we add merged bounds to the lists in partition_list_bounds_merge() and partition_range_bounds_merge(), instead. That's it. Sorry for the delay. Best regards, Etsuro Fujita
Attachment
On Fri, Nov 22, 2019 at 10:08 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote: > Done. I modified compare_range_partitions() as well to match its the > variable names with others. Attached is a new version of the patch. > > I reviewed the rest of the partbounds.c changes. Here are my review comments: > > * I don't think this analysis is correct: > > +/* > + * merge_null_partitions > + * > + * Merge NULL partitions, i.e. a partition that can hold NULL values for a lis\ > t > + * partitioned table, if any. Find the index of merged partition to which the > + * NULL values would belong in the join result. If one joining relation has a > + * NULL partition but not the other, try matching it with the default partitio\ > n > + * from the other relation since the default partition may have rows with NULL > + * partition key. We can eliminate a NULL partition when it appears only on th\ > e > + * inner side of the join and the outer side doesn't have a default partition. > + * > + * When the equality operator used for join is strict, two NULL values will no\ > t > + * be considered as equal, and thus a NULL partition can be eliminated for an > + * inner join. But we don't check the strictness operator here. > + */ > > First of all, I think we can assume here that the equality operator is > *strict*, because 1) have_partkey_equi_join(), which is executed > before calling merge_null_partitions(), requires that the > corresponding clause is mergejoinable, and 2) currently, we assume > that mergejoinable operators are strict (see MJEvalOuterValues() and > MJEvalInnerValues()). So I don't think it's needed to try merging a > NULL partition on one side with the default partition on the other > side as above. (merge_null_partitions() tries merging NULL partitions > as well, but in some cases I don't think we need to do that, either.) > So I rewrote merge_null_partitions() completely. Another change is > that I eliminated the NULL partition of the joinrel in more cases. > Attached is a patch (v28-0002-modify-merge_null_partitions.patch) for > that created on top of the main patch. I might be missing something, > though. > > Other changes in that patch: > > * I fixed memory leaks in partition_list_bounds_merge() and > partition_range_bounds_merge(). > > * I noticed this in merge_default_partitions(): > > + Assert(outer_has_default && inner_has_default); > + > + *default_index = map_and_merge_partitions(outer_map, > + inner_map, > + outer_default, > + inner_default, > + next_index); > + if (*default_index == -1) > + return false; > > I think the merging should be done successfully, because of 1) this in > process_outer_partition(): > > + if (inner_has_default) > + { > + Assert(inner_default >= 0); > + > + /* > + * If the outer side has the default partition as well, we need to > + * merge the default partitions (see merge_default_partitions()); give > + * up on it. > + */ > + if (outer_has_default) > + return false; > > and 2) this in process_inner_partition(): > > + if (outer_has_default) > + { > + Assert(outer_default >= 0); > + > + /* > + * If the inner side has the default partition as well, we need to > + * merge the default partitions (see merge_default_partitions()); give > + * up on it. > + */ > + if (inner_has_default) > + return false; > > So I removed the above if test (ie, *default_index == -1). I also > modified that function a bit further, including comments. > > * I simplified process_outer_partition() and process_inner_partition() > a bit, changing the APIs to match that of map_and_merge_partitions(). > Also, I think this in these functions is a bit ugly; > > + /* Don't add this index to the list of merged indexes. */ > + *merged_index = -1; > > so I removed it, and modified the condition on whether or not we add > merged bounds to the lists in partition_list_bounds_merge() and > partition_range_bounds_merge(), instead. Moved to the next CF. Best regards, Etsuro Fujita
Attached is the rebase version atop the latest master head(2d0fdfaccec).
Regards,
Amul
Attachment
On Mon, Dec 9, 2019 at 3:08 PM amul sul <sulamul@gmail.com> wrote:
Attached is the rebase version atop the latest master head(2d0fdfaccec).
Hi Fujita san,
I have been through your changes proposed in [1] -- the changes make sense to me &
I didn't see any unobvious behaviour in testing as well.
Regards,
Amul
Hi Amul, On Tue, Dec 10, 2019 at 3:49 PM amul sul <sulamul@gmail.com> wrote: > On Mon, Dec 9, 2019 at 3:08 PM amul sul <sulamul@gmail.com> wrote: >> Attached is the rebase version atop the latest master head(2d0fdfaccec). Thanks for that! > I have been through your changes proposed in [1] -- the changes make sense to me & > I didn't see any unobvious behaviour in testing as well. Thanks for reviewing! I'll merge the changes into the main patch, then. I don't see any issues in the latest version, but I think we need to polish the patch, so I'll do that. Best regards, Etsuro Fujita
On Tue, Dec 10, 2019 at 7:30 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote: > I don't see any issues in the latest version, but I think we > need to polish the patch, so I'll do that. I noticed some issues. :-( I think we should address it before polishing the patch. One thing I noticed is: the patch heavily modifies the existing test cases in partition_join.sql to test the new partition-matching algorithm, but I think we should leave those test cases alone because we would handle the exiting test cases (except one negative test case) as before (see the try_partitionwise_join() change in the patch), so those test cases would be still needed to test that. Attached is a proposed patch for that (v30-0001-Improve-partition-matching-for-partitionwise-join.patch) that 1) avoids modifying the existing test cases and 2) adds a slightly modified version of the test cases proposed in the previous patch to test the new algorithm. Though I omitted some test cases that seem redundant to me and added a bit more test cases involving NULL partitions and/or default partitions. The elapsed time to run the partition_join.sql regression test increased from 741 ms (HEAD) to 1086 ms in my environment, but I think that would be acceptable. I fixed one white space issue, but other than that, no code/comment changes. Another thing I noticed while working on the above is: the patch fails to apply PWJ to this case: CREATE TABLE plt1_ad (a int, b int, c text) PARTITION BY LIST (c); CREATE TABLE plt1_ad_p1 PARTITION OF plt1_ad FOR VALUES IN ('0001', '0003'); CREATE TABLE plt1_ad_p2 PARTITION OF plt1_ad FOR VALUES IN ('0004', '0006'); CREATE TABLE plt1_ad_p3 PARTITION OF plt1_ad FOR VALUES IN ('0008', '0009'); CREATE TABLE plt1_ad_extra PARTITION OF plt1_ad FOR VALUES IN (NULL); INSERT INTO plt1_ad SELECT i, i, to_char(i % 10, 'FM0000') FROM generate_series(1, 299) i WHERE i % 10 NOT IN (0, 2, 5, 7); INSERT INTO plt1_ad VALUES (-1, -1, NULL); ANALYZE plt1_ad; CREATE TABLE plt2_ad (a int, b int, c text) PARTITION BY LIST (c); CREATE TABLE plt2_ad_p1 PARTITION OF plt2_ad FOR VALUES IN ('0002', '0003'); CREATE TABLE plt2_ad_p2 PARTITION OF plt2_ad FOR VALUES IN ('0004', '0006'); CREATE TABLE plt2_ad_p3 PARTITION OF plt2_ad FOR VALUES IN ('0007', '0009'); CREATE TABLE plt2_ad_extra PARTITION OF plt2_ad FOR VALUES IN (NULL); INSERT INTO plt2_ad SELECT i, i, to_char(i % 10, 'FM0000') FROM generate_series(1, 299) i WHERE i % 10 NOT IN (0, 1, 5, 8); INSERT INTO plt2_ad VALUES (-1, -1, NULL); ANALYZE plt2_ad; EXPLAIN (COSTS OFF) SELECT t1.a, t1.c, t2.a, t2.c FROM plt1_ad t1 LEFT JOIN plt2_ad t2 ON (t1.a = t2.a AND t1.c = t2.c) WHERE t1.b < 10 ORDER BY t1.a; QUERY PLAN -------------------------------------------------------- Sort Sort Key: t1.a -> Hash Right Join Hash Cond: ((t2.a = t1.a) AND (t2.c = t1.c)) -> Append -> Seq Scan on plt2_ad_p1 t2_1 -> Seq Scan on plt2_ad_p2 t2_2 -> Seq Scan on plt2_ad_p3 t2_3 -> Seq Scan on plt2_ad_extra t2_4 -> Hash -> Append -> Seq Scan on plt1_ad_p1 t1_1 Filter: (b < 10) -> Seq Scan on plt1_ad_p2 t1_2 Filter: (b < 10) -> Seq Scan on plt1_ad_p3 t1_3 Filter: (b < 10) -> Seq Scan on plt1_ad_extra t1_4 Filter: (b < 10) (19 rows) because merge_null_partitions() does not consider matching the NULL partitions from both sides, but matches the NULL partition on the plt1_ad side and a dummy partition, resulting in a non-PWJ plan (see [1]). I overlooked this case when modifying that function. :-( Another patch attached to fix this issue (v30-0002-Fix-handling-of-NULL-partitions.patch). (We would not need to fix this, if we could handle the case where a dummy partition is on the nullable side of an outer join [1], but we can't, so I think it would be a good idea at least for now to match the NULL partitions from both sides to do PWJ.) Best regards, Etsuro Fujita [1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=7ad6498fd5a654de6e743814c36cf619a3b5ddb6
Attachment
> On Tue, Dec 10, 2019 at 7:30 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote: >> I don't see any issues in the latest version, but I think we >> need to polish the patch, so I'll do that. > > I noticed some issues. :-( I think we should address it before > polishing the patch. One thing I noticed is: the patch heavily > modifies the existing test cases in partition_join.sql to test the new > partition-matching algorithm, but I think we should leave those test > cases alone because we would handle the exiting test cases (except one > negative test case) as before (see the try_partitionwise_join() > change in the patch), so those test cases would be still needed to > test that. Attached is a proposed patch for that > (v30-0001-Improve-partition-matching-for-partitionwise-join.patch) > that 1) avoids modifying the existing test cases and 2) adds a > slightly modified version of the test cases proposed in the previous > patch to test the new algorithm. Though I omitted some test cases > that seem redundant to me and added a bit more test cases involving > NULL partitions and/or default partitions. The elapsed time to run > the partition_join.sql regression test increased from 741 ms (HEAD) to > 1086 ms in my environment, but I think that would be acceptable. I > fixed one white space issue, but other than that, no code/comment > changes. > > Another thing I noticed while working on the above is: the patch fails > to apply PWJ to this case: > > CREATE TABLE plt1_ad (a int, b int, c text) PARTITION BY LIST (c); > CREATE TABLE plt1_ad_p1 PARTITION OF plt1_ad FOR VALUES IN ('0001', '0003'); > CREATE TABLE plt1_ad_p2 PARTITION OF plt1_ad FOR VALUES IN ('0004', '0006'); > CREATE TABLE plt1_ad_p3 PARTITION OF plt1_ad FOR VALUES IN ('0008', '0009'); > CREATE TABLE plt1_ad_extra PARTITION OF plt1_ad FOR VALUES IN (NULL); > INSERT INTO plt1_ad SELECT i, i, to_char(i % 10, 'FM0000') FROM > generate_series(1, 299) i WHERE i % 10 NOT IN (0, 2, 5, 7); > INSERT INTO plt1_ad VALUES (-1, -1, NULL); > ANALYZE plt1_ad; > CREATE TABLE plt2_ad (a int, b int, c text) PARTITION BY LIST (c); > CREATE TABLE plt2_ad_p1 PARTITION OF plt2_ad FOR VALUES IN ('0002', '0003'); > CREATE TABLE plt2_ad_p2 PARTITION OF plt2_ad FOR VALUES IN ('0004', '0006'); > CREATE TABLE plt2_ad_p3 PARTITION OF plt2_ad FOR VALUES IN ('0007', '0009'); > CREATE TABLE plt2_ad_extra PARTITION OF plt2_ad FOR VALUES IN (NULL); > INSERT INTO plt2_ad SELECT i, i, to_char(i % 10, 'FM0000') FROM > generate_series(1, 299) i WHERE i % 10 NOT IN (0, 1, 5, 8); > INSERT INTO plt2_ad VALUES (-1, -1, NULL); > ANALYZE plt2_ad; > > EXPLAIN (COSTS OFF) > SELECT t1.a, t1.c, t2.a, t2.c FROM plt1_ad t1 LEFT JOIN plt2_ad t2 ON > (t1.a = t2.a AND t1.c = t2.c) WHERE t1.b < 10 ORDER BY t1.a; > QUERY PLAN > -------------------------------------------------------- > Sort > Sort Key: t1.a > -> Hash Right Join > Hash Cond: ((t2.a = t1.a) AND (t2.c = t1.c)) > -> Append > -> Seq Scan on plt2_ad_p1 t2_1 > -> Seq Scan on plt2_ad_p2 t2_2 > -> Seq Scan on plt2_ad_p3 t2_3 > -> Seq Scan on plt2_ad_extra t2_4 > -> Hash > -> Append > -> Seq Scan on plt1_ad_p1 t1_1 > Filter: (b < 10) > -> Seq Scan on plt1_ad_p2 t1_2 > Filter: (b < 10) > -> Seq Scan on plt1_ad_p3 t1_3 > Filter: (b < 10) > -> Seq Scan on plt1_ad_extra t1_4 > Filter: (b < 10) > (19 rows) > > because merge_null_partitions() does not consider matching the NULL > partitions from both sides, but matches the NULL partition on the > plt1_ad side and a dummy partition, resulting in a non-PWJ plan (see > [1]). I overlooked this case when modifying that function. :-( > Another patch attached to fix this issue > (v30-0002-Fix-handling-of-NULL-partitions.patch). (We would not need > to fix this, if we could handle the case where a dummy partition is on > the nullable side of an outer join [1], but we can't, so I think it > would be a good idea at least for now to match the NULL partitions > from both sides to do PWJ.) > > Best regards, > Etsuro Fujita > > [1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=7ad6498fd5a654de6e743814c36cf619a3b5ddb6 > > <v30-0001-Improve-partition-matching-for-partitionwise-join.patch><v30-0002-Fix-handling-of-NULL-partitions.patch> Fujita-san, With respect to these two patches: They apply, compile, and pass all the regression tests. The code looks reasonable. There is stray whitespace in v30-0002: src/backend/partitioning/partbounds.c:4557: space before tab in indent. + outer_null_unmerged = true; I have added tests checking correctness and showing some partition pruning limitations. Find three patches, attached. The v31-0001-… patch merely applies your patches as a starting point for the next two patches. It is your work, not mine. The v31-0002-… patch adds more regression tests for range partitioning. The commit message contains my comments about that. The v31-0003-… patch adds more regression tests for list partitioning, and again, the commit message contains my commentsabout that. I hope this review is helpful. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Hi Mark, On Tue, Jan 28, 2020 at 1:39 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote: > There is stray whitespace in v30-0002: > > src/backend/partitioning/partbounds.c:4557: space before tab in indent. > + outer_null_unmerged = true; Good catch! > I have added tests checking correctness and showing some partition pruning limitations. Find three patches, attached. > > The v31-0001-… patch merely applies your patches as a starting point for the next two patches. It is your work, not mine. > > The v31-0002-… patch adds more regression tests for range partitioning. The commit message contains my comments aboutthat. > > The v31-0003-… patch adds more regression tests for list partitioning, and again, the commit message contains my commentsabout that. The PWJ behavior shown by the tests you added looks interesting! I'll dig into it more closely. Thanks for the patches and review! Best regards, Etsuro Fujita
On Wed, Jan 29, 2020 at 9:15 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote: > On Tue, Jan 28, 2020 at 1:39 PM Mark Dilger > <mark.dilger@enterprisedb.com> wrote: > > I have added tests checking correctness and showing some partition pruning limitations. Find three patches, attached. > > > > The v31-0001-… patch merely applies your patches as a starting point for the next two patches. It is your work, notmine. > > > > The v31-0002-… patch adds more regression tests for range partitioning. The commit message contains my comments aboutthat. > > > > The v31-0003-… patch adds more regression tests for list partitioning, and again, the commit message contains my commentsabout that. > > I'll dig into it more closely. I tested the new test patches. The patches are applied to the partition_join.sql regression test cleanly, but the new tests failed in my environment (even with make check LANG=C). I think I should set the right locale for the testing. Which one did you use? Maybe I'm missing something, but let me comment on the new tests. This is the one proposed in the v31-0002 patch: +EXPLAIN (COSTS OFF) +SELECT t1.a, t2.a FROM alpha t1 INNER JOIN beta t2 ON (t1.a = t2.a) WHERE t1.a IN ('äbç', 'ὀδυσσεύς'); + QUERY PLAN +------------------------------------------------------------------ + Hash Join + Hash Cond: (t2.a = t1.a) + -> Append + -> Seq Scan on beta_a t2_1 + -> Seq Scan on beta_b t2_2 + -> Seq Scan on beta_c t2_3 + -> Seq Scan on beta_d t2_4 + -> Seq Scan on beta_e t2_5 + -> Seq Scan on beta_f t2_6 + -> Seq Scan on beta_g t2_7 + -> Seq Scan on beta_h t2_8 + -> Seq Scan on beta_default t2_9 + -> Hash + -> Append + -> Seq Scan on alpha_e t1_1 + Filter: (a = ANY ('{äbç,ὀδυσσεύς}'::text[])) + -> Seq Scan on alpha_default t1_2 + Filter: (a = ANY ('{äbç,ὀδυσσεύς}'::text[])) +(18 rows) The commit message says: When joining with alpha.a = beta.a and alpha.a IN ('äbç', 'ὀδυσσεύς') the planner does the right thing for one side of the query, but hits all partitions for the other side, which it doesn't need to do. Yeah, I agree that the resulting plan is not efficient. The reason for that would be that the planner doesn't generate a qual on the outer side matching the ScalarArrayOpExpr qual "a = ANY ('{äbç,ὀδυσσεύς}'::text[])" on the inner side, which I think would be a restriction caused by the equivalence machinery not by the partitionwise join logic IIUC. I think the critique would be useful, so I don't object to adding this test case, but the critique would be more about query planning that is actually not related to the partitionwise join logic, so I'm not sure that the partition_join.sql regression test is the best place to add it. I guess that there would be much better places than partition_join.sql. (This is nitpicking; but another thing I noticed about this test case is that the join query contains only a single join condition "t1.a = t2.a", but the queried tables alpha and beta are range-partitioned by multiple columns a and b, so the query should have a join condition for each of the columns like "t1.a = t2.a AND t1.b = t2.b" if adding this as a test case for partitionwise join.) I feel almost the same to other test cases in the patch (and the v31-0003 patch), except this one proposed in the v31-0003 patch: +CREATE TABLE alpha (a TEXT) PARTITION BY LIST(a); +CREATE TABLE alpha_a PARTITION OF alpha FOR VALUES IN ('Turkiye', 'TURKIYE'); +CREATE TABLE alpha_b PARTITION OF alpha FOR VALUES IN ('b?t', 'BIT'); +CREATE TABLE alpha_c PARTITION OF alpha FOR VALUES IN ('abc', 'ABC'); +CREATE TABLE alpha_d PARTITION OF alpha FOR VALUES IN ('aaa', 'cote', 'Gotz'); +CREATE TABLE alpha_e PARTITION OF alpha FOR VALUES IN ('?δυσσε??', '?ΔΥΣΣΕ?Σ'); +CREATE TABLE alpha_f PARTITION OF alpha FOR VALUES IN ('を読み取り用', 'にオープンできませんでした', NULL); +CREATE TABLE alpha_default PARTITION OF alpha DEFAULT; +CREATE TABLE beta (a TEXT) PARTITION BY LIST(a); +CREATE TABLE beta_a PARTITION OF beta FOR VALUES IN ('Turkiye', 'cote', '?ΔΥΣΣΕ?Σ'); +CREATE TABLE beta_b PARTITION OF beta FOR VALUES IN ('b?t', 'TURKIYE'); +CREATE TABLE beta_c PARTITION OF beta FOR VALUES IN ('abc', 'を読み取り用', 'にオープンできませんでした'); +CREATE TABLE beta_d PARTITION OF beta FOR VALUES IN ('aaa', 'Gotz', 'BIT', '?δυσσε??', 'ABC', NULL); +CREATE TABLE beta_default PARTITION OF beta DEFAULT; +EXPLAIN (COSTS OFF) +SELECT t1.a, t2.a FROM alpha t1 INNER JOIN beta t2 ON (t1.a = t2.a) WHERE t1.a IS NULL; + QUERY PLAN +------------------------------------------- + Hash Join + Hash Cond: (t2.a = t1.a) + -> Append + -> Seq Scan on beta_d t2_1 + -> Seq Scan on beta_b t2_2 + -> Seq Scan on beta_a t2_3 + -> Seq Scan on beta_c t2_4 + -> Seq Scan on beta_default t2_5 + -> Hash + -> Seq Scan on alpha_f t1 + Filter: (a IS NULL) +(11 rows) Which made me notice an issue in the partitionwise join logic: partition_bounds_merge() assumes that all partitions of given relations are always present; in other words, it doesn't consider cases where some of the partitions have been pruned entirely. :-( If that function considered the cases, the above query would use partitionwise join, because alpha_f only matches beta_c. I don't think this is a bug, but it causes the planner not only to fail to chose partitionwise join but to waste CPU cycles to process pruned partitions, so I'll propose to address it. Attached is a patch for that (the v31-0004 patch) created on top of the main patch (the v31-0001 patch), which is also attached. With the attached, the plan for the above query would be changed to something like this: EXPLAIN (COSTS OFF) SELECT t1.a, t2.a FROM alpha t1 INNER JOIN beta t2 ON (t1.a = t2.a) WHERE t1.a IS NULL; QUERY PLAN ------------------------------ Nested Loop Join Filter: (t1.a = t2.a) -> Seq Scan on alpha_f t1 Filter: (a IS NULL) -> Seq Scan on beta_c t2 (5 rows) Thanks again, Mark! Best regards, Etsuro Fujita
Attachment
> On Feb 5, 2020, at 4:51 AM, Etsuro Fujita <etsuro.fujita@gmail.com> wrote: > > On Wed, Jan 29, 2020 at 9:15 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote: >> On Tue, Jan 28, 2020 at 1:39 PM Mark Dilger >> <mark.dilger@enterprisedb.com> wrote: >>> I have added tests checking correctness and showing some partition pruning limitations. Find three patches, attached. >>> >>> The v31-0001-… patch merely applies your patches as a starting point for the next two patches. It is your work, notmine. >>> >>> The v31-0002-… patch adds more regression tests for range partitioning. The commit message contains my comments aboutthat. >>> >>> The v31-0003-… patch adds more regression tests for list partitioning, and again, the commit message contains my commentsabout that. >> >> I'll dig into it more closely. > > I tested the new test patches. The patches are applied to the > partition_join.sql regression test cleanly, but the new tests failed > in my environment (even with make check LANG=C). I think I should set > the right locale for the testing. Which one did you use? I did not set a locale in the tests. My environment has: LANG="en_US.UTF-8" LC_COLLATE="en_US.UTF-8" LC_CTYPE="en_US.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_MONETARY="en_US.UTF-8" LC_NUMERIC="en_US.UTF-8" LC_TIME="en_US.UTF-8" LC_ALL= > Maybe I'm > missing something, but let me comment on the new tests. This is the > one proposed in the v31-0002 patch: > > +EXPLAIN (COSTS OFF) > +SELECT t1.a, t2.a FROM alpha t1 INNER JOIN beta t2 ON (t1.a = t2.a) > WHERE t1.a IN ('äbç', 'ὀδυσσεύς'); > + QUERY PLAN > +------------------------------------------------------------------ > + Hash Join > + Hash Cond: (t2.a = t1.a) > + -> Append > + -> Seq Scan on beta_a t2_1 > + -> Seq Scan on beta_b t2_2 > + -> Seq Scan on beta_c t2_3 > + -> Seq Scan on beta_d t2_4 > + -> Seq Scan on beta_e t2_5 > + -> Seq Scan on beta_f t2_6 > + -> Seq Scan on beta_g t2_7 > + -> Seq Scan on beta_h t2_8 > + -> Seq Scan on beta_default t2_9 > + -> Hash > + -> Append > + -> Seq Scan on alpha_e t1_1 > + Filter: (a = ANY ('{äbç,ὀδυσσεύς}'::text[])) > + -> Seq Scan on alpha_default t1_2 > + Filter: (a = ANY ('{äbç,ὀδυσσεύς}'::text[])) > +(18 rows) > > The commit message says: > > When joining with > > alpha.a = beta.a and alpha.a IN ('äbç', 'ὀδυσσεύς') > > the planner does the right thing for one side of the query, but > hits all partitions for the other side, which it doesn't need to > do. > > Yeah, I agree that the resulting plan is not efficient. The reason > for that would be that the planner doesn't generate a qual on the > outer side matching the ScalarArrayOpExpr qual "a = ANY > ('{äbç,ὀδυσσεύς}'::text[])" on the inner side, which I think would be > a restriction caused by the equivalence machinery not by the > partitionwise join logic IIUC. It’s fine if this is beyond the scope of the patch. > I think the critique would be useful, > so I don't object to adding this test case, but the critique would be > more about query planning that is actually not related to the > partitionwise join logic, so I'm not sure that the partition_join.sql > regression test is the best place to add it. I guess that there would > be much better places than partition_join.sql. You don’t need to add the test anywhere. It’s enough for me that you looked at it and considered whether the partition-wisejoin patch should do anything differently in this case. Again, it sounds like this is beyond the scope ofthe patch. > (This is nitpicking; > but another thing I noticed about this test case is that the join > query contains only a single join condition "t1.a = t2.a", but the > queried tables alpha and beta are range-partitioned by multiple > columns a and b, so the query should have a join condition for each of > the columns like "t1.a = t2.a AND t1.b = t2.b" if adding this as a > test case for partitionwise join.) Well, it is important that partition-wise join does not break such queries. I added the column ‘b’ to the partitioning logicto verify that did not confuse the code in your patch. > I feel almost the same to other > test cases in the patch (and the v31-0003 patch), except this one > proposed in the v31-0003 patch: > > +CREATE TABLE alpha (a TEXT) PARTITION BY LIST(a); > +CREATE TABLE alpha_a PARTITION OF alpha FOR VALUES IN ('Turkiye', 'TURKIYE'); > +CREATE TABLE alpha_b PARTITION OF alpha FOR VALUES IN ('b?t', 'BIT'); > +CREATE TABLE alpha_c PARTITION OF alpha FOR VALUES IN ('abc', 'ABC'); > +CREATE TABLE alpha_d PARTITION OF alpha FOR VALUES IN ('aaa', 'cote', 'Gotz'); > +CREATE TABLE alpha_e PARTITION OF alpha FOR VALUES IN ('?δυσσε??', '?ΔΥΣΣΕ?Σ'); > +CREATE TABLE alpha_f PARTITION OF alpha FOR VALUES IN ('を読み取り用', > 'にオープンできませんでした', NULL); > +CREATE TABLE alpha_default PARTITION OF alpha DEFAULT; > +CREATE TABLE beta (a TEXT) PARTITION BY LIST(a); > +CREATE TABLE beta_a PARTITION OF beta FOR VALUES IN ('Turkiye', > 'cote', '?ΔΥΣΣΕ?Σ'); > +CREATE TABLE beta_b PARTITION OF beta FOR VALUES IN ('b?t', 'TURKIYE'); > +CREATE TABLE beta_c PARTITION OF beta FOR VALUES IN ('abc', 'を読み取り用', > 'にオープンできませんでした'); > +CREATE TABLE beta_d PARTITION OF beta FOR VALUES IN ('aaa', 'Gotz', > 'BIT', '?δυσσε??', 'ABC', NULL); > +CREATE TABLE beta_default PARTITION OF beta DEFAULT; > > +EXPLAIN (COSTS OFF) > +SELECT t1.a, t2.a FROM alpha t1 INNER JOIN beta t2 ON (t1.a = t2.a) > WHERE t1.a IS NULL; > + QUERY PLAN > +------------------------------------------- > + Hash Join > + Hash Cond: (t2.a = t1.a) > + -> Append > + -> Seq Scan on beta_d t2_1 > + -> Seq Scan on beta_b t2_2 > + -> Seq Scan on beta_a t2_3 > + -> Seq Scan on beta_c t2_4 > + -> Seq Scan on beta_default t2_5 > + -> Hash > + -> Seq Scan on alpha_f t1 > + Filter: (a IS NULL) > +(11 rows) > > Which made me notice an issue in the partitionwise join logic: > partition_bounds_merge() assumes that all partitions of given > relations are always present; in other words, it doesn't consider > cases where some of the partitions have been pruned entirely. :-( If > that function considered the cases, the above query would use > partitionwise join, because alpha_f only matches beta_c. I don't > think this is a bug, but it causes the planner not only to fail to > chose partitionwise join but to waste CPU cycles to process pruned > partitions, so I'll propose to address it. Attached is a patch for > that (the v31-0004 patch) created on top of the main patch (the > v31-0001 patch), which is also attached. With the attached, the plan > for the above query would be changed to something like this: > > EXPLAIN (COSTS OFF) > SELECT t1.a, t2.a FROM alpha t1 INNER JOIN beta t2 ON (t1.a = > t2.a) WHERE t1.a IS NULL; > QUERY PLAN > ------------------------------ > Nested Loop > Join Filter: (t1.a = t2.a) > -> Seq Scan on alpha_f t1 > Filter: (a IS NULL) > -> Seq Scan on beta_c t2 > (5 rows) > > Thanks again, Mark! Thank you for working on this important patch! — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> On Feb 5, 2020, at 4:51 AM, Etsuro Fujita <etsuro.fujita@gmail.com> wrote: > > <v31-0001-Applying-Etsuro-Fujita-s-patches.patch><v31-0004-Consider-pruned-partitions.patch> The patches apply and pass all tests. A review of the patch vs. master looks reasonable. The partition_join.sql test has multiple levels of partitioning, but when your patch extends that test with “advanced partition-wisejoin”, none of the tables for the new section have multiple levels. I spent a little while reviewing the codeand inventing multiple level partitioning tests for advanced partition-wise join and did not encounter any problems. I don’t care whether you use this particular example, but do you want to have multiple level partitioning in thenew test section? CREATE TABLE alpha (a double precision, b double precision) PARTITION BY RANGE (a); CREATE TABLE alpha_neg PARTITION OF alpha FOR VALUES FROM ('-Infinity') TO (0) PARTITION BY RANGE (b); CREATE TABLE alpha_pos PARTITION OF alpha FOR VALUES FROM (0) TO ('Infinity') PARTITION BY RANGE (b); CREATE TABLE alpha_nan PARTITION OF alpha FOR VALUES FROM ('Infinity') TO ('NaN'); CREATE TABLE alpha_neg_neg PARTITION OF alpha_neg FOR VALUES FROM ('-Infinity') TO (0); CREATE TABLE alpha_neg_pos PARTITION OF alpha_neg FOR VALUES FROM (0) TO ('Infinity'); CREATE TABLE alpha_neg_nan PARTITION OF alpha_neg FOR VALUES FROM ('Infinity') TO ('NaN'); CREATE TABLE alpha_pos_neg PARTITION OF alpha_pos FOR VALUES FROM ('-Infinity') TO (0); CREATE TABLE alpha_pos_pos PARTITION OF alpha_pos FOR VALUES FROM (0) TO ('Infinity'); CREATE TABLE alpha_pos_nan PARTITION OF alpha_pos FOR VALUES FROM ('Infinity') TO ('NaN'); INSERT INTO alpha (a, b) (SELECT * FROM (VALUES (-1.0::float8), (0.0::float8), (1.0::float8), ('Infinity'::float8)) a, (VALUES (-1.0::float8), (0.0::float8), (1.0::float8), ('Infinity'::float8)) b ); ANALYZE alpha; ANALYZE alpha_neg; ANALYZE alpha_pos; ANALYZE alpha_nan; ANALYZE alpha_neg_neg; ANALYZE alpha_neg_pos; ANALYZE alpha_neg_nan; ANALYZE alpha_pos_neg; ANALYZE alpha_pos_pos; ANALYZE alpha_pos_nan; CREATE TABLE beta (a double precision, b double precision) PARTITION BY RANGE (a, b); CREATE TABLE beta_lo PARTITION OF beta FOR VALUES FROM (-5, -5) TO (0, 0); CREATE TABLE beta_me PARTITION OF beta FOR VALUES FROM (0, 0) TO (0, 5); CREATE TABLE beta_hi PARTITION OF beta FOR VALUES FROM (0, 5) TO (5, 5); INSERT INTO beta (a, b) (SELECT * FROM (VALUES (-1.0::float8), (0.0::float8), (1.0::float8)) a, (VALUES (-1.0::float8), (0.0::float8), (1.0::float8)) b ); ANALYZE beta; ANALYZE beta_lo; ANALYZE beta_me; ANALYZE beta_hi; EXPLAIN SELECT * FROM alpha INNER JOIN beta ON (alpha.a = beta.a AND alpha.b = beta.b) WHERE alpha.a = 1 AND beta.b = 1; QUERY PLAN ------------------------------------------------------------------------------- Nested Loop (cost=0.00..2.11 rows=1 width=32) -> Seq Scan on alpha_pos_pos alpha (cost=0.00..1.06 rows=1 width=16) Filter: ((b = '1'::double precision) AND (a = '1'::double precision)) -> Seq Scan on beta_hi beta (cost=0.00..1.04 rows=1 width=16) Filter: ((b = '1'::double precision) AND (a = '1'::double precision)) (5 rows) — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Feb 6, 2020 at 2:03 AM Mark Dilger <mark.dilger@enterprisedb.com> wrote: > > On Feb 5, 2020, at 4:51 AM, Etsuro Fujita <etsuro.fujita@gmail.com> wrote: > >> On Tue, Jan 28, 2020 at 1:39 PM Mark Dilger > >> <mark.dilger@enterprisedb.com> wrote: > >>> I have added tests checking correctness and showing some partition pruning limitations. Find three patches, attached. > >>> > >>> The v31-0001-… patch merely applies your patches as a starting point for the next two patches. It is your work, notmine. > >>> > >>> The v31-0002-… patch adds more regression tests for range partitioning. The commit message contains my comments aboutthat. > >>> > >>> The v31-0003-… patch adds more regression tests for list partitioning, and again, the commit message contains my commentsabout that. > > I tested the new test patches. The patches are applied to the > > partition_join.sql regression test cleanly, but the new tests failed > > in my environment (even with make check LANG=C). I think I should set > > the right locale for the testing. Which one did you use? > > I did not set a locale in the tests. My environment has: > > LANG="en_US.UTF-8" > LC_COLLATE="en_US.UTF-8" > LC_CTYPE="en_US.UTF-8" > LC_MESSAGES="en_US.UTF-8" > LC_MONETARY="en_US.UTF-8" > LC_NUMERIC="en_US.UTF-8" > LC_TIME="en_US.UTF-8" > LC_ALL= Thanks for the information! > > Maybe I'm > > missing something, but let me comment on the new tests. This is the > > one proposed in the v31-0002 patch: > > > > +EXPLAIN (COSTS OFF) > > +SELECT t1.a, t2.a FROM alpha t1 INNER JOIN beta t2 ON (t1.a = t2.a) > > WHERE t1.a IN ('äbç', 'ὀδυσσεύς'); > > + QUERY PLAN > > +------------------------------------------------------------------ > > + Hash Join > > + Hash Cond: (t2.a = t1.a) > > + -> Append > > + -> Seq Scan on beta_a t2_1 > > + -> Seq Scan on beta_b t2_2 > > + -> Seq Scan on beta_c t2_3 > > + -> Seq Scan on beta_d t2_4 > > + -> Seq Scan on beta_e t2_5 > > + -> Seq Scan on beta_f t2_6 > > + -> Seq Scan on beta_g t2_7 > > + -> Seq Scan on beta_h t2_8 > > + -> Seq Scan on beta_default t2_9 > > + -> Hash > > + -> Append > > + -> Seq Scan on alpha_e t1_1 > > + Filter: (a = ANY ('{äbç,ὀδυσσεύς}'::text[])) > > + -> Seq Scan on alpha_default t1_2 > > + Filter: (a = ANY ('{äbç,ὀδυσσεύς}'::text[])) > > +(18 rows) > > > > The commit message says: > > > > When joining with > > > > alpha.a = beta.a and alpha.a IN ('äbç', 'ὀδυσσεύς') > > > > the planner does the right thing for one side of the query, but > > hits all partitions for the other side, which it doesn't need to > > do. > > > > Yeah, I agree that the resulting plan is not efficient. The reason > > for that would be that the planner doesn't generate a qual on the > > outer side matching the ScalarArrayOpExpr qual "a = ANY > > ('{äbç,ὀδυσσεύς}'::text[])" on the inner side, which I think would be > > a restriction caused by the equivalence machinery not by the > > partitionwise join logic IIUC. > > It’s fine if this is beyond the scope of the patch. > > > I think the critique would be useful, > > so I don't object to adding this test case, but the critique would be > > more about query planning that is actually not related to the > > partitionwise join logic, so I'm not sure that the partition_join.sql > > regression test is the best place to add it. I guess that there would > > be much better places than partition_join.sql. > > You don’t need to add the test anywhere. It’s enough for me that you looked at it and considered whether the partition-wisejoin patch should do anything differently in this case. Again, it sounds like this is beyond the scope ofthe patch. OK > > (This is nitpicking; > > but another thing I noticed about this test case is that the join > > query contains only a single join condition "t1.a = t2.a", but the > > queried tables alpha and beta are range-partitioned by multiple > > columns a and b, so the query should have a join condition for each of > > the columns like "t1.a = t2.a AND t1.b = t2.b" if adding this as a > > test case for partitionwise join.) > > Well, it is important that partition-wise join does not break such queries. I added the column ‘b’ to the partitioninglogic to verify that did not confuse the code in your patch. OK, thanks for the testing! Best regards, Etsuro Fujita
On Thu, Feb 6, 2020 at 3:55 AM Mark Dilger <mark.dilger@enterprisedb.com> wrote: > The patches apply and pass all tests. A review of the patch vs. master looks reasonable. Thanks for the review! > The partition_join.sql test has multiple levels of partitioning, but when your patch extends that test with “advanced partition-wisejoin”, none of the tables for the new section have multiple levels. I spent a little while reviewing the codeand inventing multiple level partitioning tests for advanced partition-wise join and did not encounter any problems. I don’t care whether you use this particular example, but do you want to have multiple level partitioning in thenew test section? Yes, I do. > CREATE TABLE alpha (a double precision, b double precision) PARTITION BY RANGE (a); > CREATE TABLE alpha_neg PARTITION OF alpha FOR VALUES FROM ('-Infinity') TO (0) PARTITION BY RANGE (b); > CREATE TABLE alpha_pos PARTITION OF alpha FOR VALUES FROM (0) TO ('Infinity') PARTITION BY RANGE (b); > CREATE TABLE alpha_nan PARTITION OF alpha FOR VALUES FROM ('Infinity') TO ('NaN'); > CREATE TABLE alpha_neg_neg PARTITION OF alpha_neg FOR VALUES FROM ('-Infinity') TO (0); > CREATE TABLE alpha_neg_pos PARTITION OF alpha_neg FOR VALUES FROM (0) TO ('Infinity'); > CREATE TABLE alpha_neg_nan PARTITION OF alpha_neg FOR VALUES FROM ('Infinity') TO ('NaN'); > CREATE TABLE alpha_pos_neg PARTITION OF alpha_pos FOR VALUES FROM ('-Infinity') TO (0); > CREATE TABLE alpha_pos_pos PARTITION OF alpha_pos FOR VALUES FROM (0) TO ('Infinity'); > CREATE TABLE alpha_pos_nan PARTITION OF alpha_pos FOR VALUES FROM ('Infinity') TO ('NaN'); > INSERT INTO alpha (a, b) > (SELECT * FROM > (VALUES (-1.0::float8), (0.0::float8), (1.0::float8), ('Infinity'::float8)) a, > (VALUES (-1.0::float8), (0.0::float8), (1.0::float8), ('Infinity'::float8)) b > ); > ANALYZE alpha; > ANALYZE alpha_neg; > ANALYZE alpha_pos; > ANALYZE alpha_nan; > ANALYZE alpha_neg_neg; > ANALYZE alpha_neg_pos; > ANALYZE alpha_neg_nan; > ANALYZE alpha_pos_neg; > ANALYZE alpha_pos_pos; > ANALYZE alpha_pos_nan; > CREATE TABLE beta (a double precision, b double precision) PARTITION BY RANGE (a, b); > CREATE TABLE beta_lo PARTITION OF beta FOR VALUES FROM (-5, -5) TO (0, 0); > CREATE TABLE beta_me PARTITION OF beta FOR VALUES FROM (0, 0) TO (0, 5); > CREATE TABLE beta_hi PARTITION OF beta FOR VALUES FROM (0, 5) TO (5, 5); > INSERT INTO beta (a, b) > (SELECT * FROM > (VALUES (-1.0::float8), (0.0::float8), (1.0::float8)) a, > (VALUES (-1.0::float8), (0.0::float8), (1.0::float8)) b > ); > ANALYZE beta; > ANALYZE beta_lo; > ANALYZE beta_me; > ANALYZE beta_hi; > EXPLAIN SELECT * FROM alpha INNER JOIN beta ON (alpha.a = beta.a AND alpha.b = beta.b) WHERE alpha.a = 1 AND beta.b = 1; > QUERY PLAN > ------------------------------------------------------------------------------- > Nested Loop (cost=0.00..2.11 rows=1 width=32) > -> Seq Scan on alpha_pos_pos alpha (cost=0.00..1.06 rows=1 width=16) > Filter: ((b = '1'::double precision) AND (a = '1'::double precision)) > -> Seq Scan on beta_hi beta (cost=0.00..1.04 rows=1 width=16) > Filter: ((b = '1'::double precision) AND (a = '1'::double precision)) > (5 rows) Hmm, I'm not sure this is a good test case for that, because this result would be due to partition pruning applied to each side of the join before considering partition-wise join; you could get the same result even with enable_partitionwise_join=off. I think it's important that the partition-wise join logic doesn't break this query, though. Best regards, Etsuro Fujita
On Fri, Feb 7, 2020 at 9:57 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote: > On Thu, Feb 6, 2020 at 3:55 AM Mark Dilger <mark.dilger@enterprisedb.com> wrote: > > The patches apply and pass all tests. A review of the patch vs. master looks reasonable. I've merged the patches. Attached is a new version of the patch. > > The partition_join.sql test has multiple levels of partitioning, but when your patch extends that test with “advancedpartition-wise join”, none of the tables for the new section have multiple levels. I spent a little while reviewingthe code and inventing multiple level partitioning tests for advanced partition-wise join and did not encounterany problems. I don’t care whether you use this particular example, but do you want to have multiple level partitioningin the new test section? > > Yes, I do. > > > CREATE TABLE alpha (a double precision, b double precision) PARTITION BY RANGE (a); > > CREATE TABLE alpha_neg PARTITION OF alpha FOR VALUES FROM ('-Infinity') TO (0) PARTITION BY RANGE (b); > > CREATE TABLE alpha_pos PARTITION OF alpha FOR VALUES FROM (0) TO ('Infinity') PARTITION BY RANGE (b); > > CREATE TABLE alpha_nan PARTITION OF alpha FOR VALUES FROM ('Infinity') TO ('NaN'); > > CREATE TABLE alpha_neg_neg PARTITION OF alpha_neg FOR VALUES FROM ('-Infinity') TO (0); > > CREATE TABLE alpha_neg_pos PARTITION OF alpha_neg FOR VALUES FROM (0) TO ('Infinity'); > > CREATE TABLE alpha_neg_nan PARTITION OF alpha_neg FOR VALUES FROM ('Infinity') TO ('NaN'); > > CREATE TABLE alpha_pos_neg PARTITION OF alpha_pos FOR VALUES FROM ('-Infinity') TO (0); > > CREATE TABLE alpha_pos_pos PARTITION OF alpha_pos FOR VALUES FROM (0) TO ('Infinity'); > > CREATE TABLE alpha_pos_nan PARTITION OF alpha_pos FOR VALUES FROM ('Infinity') TO ('NaN'); > > INSERT INTO alpha (a, b) > > (SELECT * FROM > > (VALUES (-1.0::float8), (0.0::float8), (1.0::float8), ('Infinity'::float8)) a, > > (VALUES (-1.0::float8), (0.0::float8), (1.0::float8), ('Infinity'::float8)) b > > ); > > ANALYZE alpha; > > ANALYZE alpha_neg; > > ANALYZE alpha_pos; > > ANALYZE alpha_nan; > > ANALYZE alpha_neg_neg; > > ANALYZE alpha_neg_pos; > > ANALYZE alpha_neg_nan; > > ANALYZE alpha_pos_neg; > > ANALYZE alpha_pos_pos; > > ANALYZE alpha_pos_nan; > > CREATE TABLE beta (a double precision, b double precision) PARTITION BY RANGE (a, b); > > CREATE TABLE beta_lo PARTITION OF beta FOR VALUES FROM (-5, -5) TO (0, 0); > > CREATE TABLE beta_me PARTITION OF beta FOR VALUES FROM (0, 0) TO (0, 5); > > CREATE TABLE beta_hi PARTITION OF beta FOR VALUES FROM (0, 5) TO (5, 5); > > INSERT INTO beta (a, b) > > (SELECT * FROM > > (VALUES (-1.0::float8), (0.0::float8), (1.0::float8)) a, > > (VALUES (-1.0::float8), (0.0::float8), (1.0::float8)) b > > ); > > ANALYZE beta; > > ANALYZE beta_lo; > > ANALYZE beta_me; > > ANALYZE beta_hi; > > EXPLAIN SELECT * FROM alpha INNER JOIN beta ON (alpha.a = beta.a AND alpha.b = beta.b) WHERE alpha.a = 1 AND beta.b =1; > > QUERY PLAN > > ------------------------------------------------------------------------------- > > Nested Loop (cost=0.00..2.11 rows=1 width=32) > > -> Seq Scan on alpha_pos_pos alpha (cost=0.00..1.06 rows=1 width=16) > > Filter: ((b = '1'::double precision) AND (a = '1'::double precision)) > > -> Seq Scan on beta_hi beta (cost=0.00..1.04 rows=1 width=16) > > Filter: ((b = '1'::double precision) AND (a = '1'::double precision)) > > (5 rows) > > Hmm, I'm not sure this is a good test case for that, because this > result would be due to partition pruning applied to each side of the > join before considering partition-wise join; you could get the same > result even with enable_partitionwise_join=off. I think it's > important that the partition-wise join logic doesn't break this query, > though. I think this would be beyond the scope of the patch, so I added different test cases that I think would be better as ones for multi-level partitioning. Thanks! Best regards, Etsuro Fujita
Attachment
Re: [HACKERS] advanced partition matching algorithm forpartition-wise join
From
Ashutosh Bapat
Date:
Hi Fujita-san,
I reviewed the patch. Except for the logic of matching the pairs of partitions from already merged partitions, I think the code changes are good. But there are several places where it needs further changes to comments. The attached patch has those. I have described some of them below.
+ * We can not perform partition-wise join if either of the joining
+ * relations is not partitioned.
We are consistently using partitionwise instead of partition-wise.
+ /*
+ * See if the partition bounds for inputs are exactly the same, in
+ * which case we don't need to work hard: partitions with the same
+ * partition indexes will form join pairs, and the join rel will have
+ * the same partition bounds as inputs; otherwise try to merge the
+ * partition bounds along with generating join pairs.
Phrase "joining relations" is better than "inputs", IMO. Updated in the
attached patch.
+ /*
+ * If the partition bounds for the join rel are not merged ones,
+ * inputs are guaranteed to have the same partition bounds, so
+ * partitions with the same partition indexes will form join pairs;
+ * else let get_matching_part_pairs() do the work.
+ */
+ if (joinrel->merged)
+ {
This condition in the comment is opposite to the condition being checked in
code, so looks confusing. BTW this comment is also repeated around line 1405.
See attached patch for correction.
+ /*
+ * If this segment of the join is empty, it means that this segment
"partition of the join" looks consistent with other usages than "segment of the
join". Modified in the attached patch.
+ /*
+ * Get a relids set of partition(s) involved in this join segment that
+ * are from the rel1 side.
+ */
+ child_relids1 = bms_intersect(child_joinrel->relids,
+ rel1->all_partrels);
The partition bounds are sorted by their values. Even for a list partitioned
table, the bounds are sorted by the least partition value. We do not allow
multiple paritions from one side to be joined with one partition on the other
and vice versa. All this together means that the partitions of the join
relation are formed by joining partitions from either side in the order of
their bounds. This means that the matching pairs of partitions can be found by
matching relids of partitions of join with those of the joining relation by
traversing partitions from all the three relations only once in the order they
appears in partition bounds of corresponding relations. If we use this
algorithm, we don't need all_partrels to be collected and also don't need to
search base or join relation. That, I think, will reduce the time and space
complexity of this logic. Am I missing something? As a side effect it won't
require any special handling for base and join relation.
+ /*
+ * Get a child rel for rel1 with the relids. Note that we should have
+ * the child rel even if rel1 is a join rel, because in that case the
+ * partitions specified in the relids would have matching/overlapping
+ * boundaries, so those partitions should be considered as ones to be
+ * joined even when planning partitionwise joins of rel1, meaning that
+ * the child rel would have been built by the time we get here.
+ */
+ if (rel1_is_simple)
This variable is used only in one place. So instead we should the expression
assigning the value to it. Changed in the attached patch.
- rel->nparts = 0;
+ rel->nparts = -1;
I think we need to add comments around various values that nparts can take. How
about like something attached.
+ case PARTITION_STRATEGY_HASH:
+ merged_bounds = NULL;
I think, we need to explain why we aren't merging hash partition bounds. AFAIU,
the reason is thus: When the modulo of used for partition mapping (i.e. maximum
number of has partitions) is same, their partition bounds are same and do not
need merging. If the maximum number of partitions is different for both the
joining relations, there's high probability that one partition on one side will
join with multiple partitions on the other side. So exact partition bounds
match will work in most of the cases. The cases otherwise are not so common to
spend the effort in coding and planning.
I have added this explanation in the patch. Don't know if it's there written
somewhere already.
+ if (part_index == -1)
+ return -1;
+ } while (is_dummy_partition(rel, part_index));
I understand why we are skipping NULL positions. I am not sure why are we are
skipping over RelOptInfos which exist but are marked as dummy; we can still
create a join pair with those partitions.
+/*
+ * get_merged_range_bounds
+ * Given the bounds of range partitions to be join, determine the range
s/join/joined/
There are more changes to comments, where I thought that the comments are
required or existing comments need more clarification. Please review the
attached patch. This patch is created on top of
v32-0001-Improve-partition-matching-for-partitionwise-join.
+ * relations is not partitioned.
We are consistently using partitionwise instead of partition-wise.
+ /*
+ * See if the partition bounds for inputs are exactly the same, in
+ * which case we don't need to work hard: partitions with the same
+ * partition indexes will form join pairs, and the join rel will have
+ * the same partition bounds as inputs; otherwise try to merge the
+ * partition bounds along with generating join pairs.
Phrase "joining relations" is better than "inputs", IMO. Updated in the
attached patch.
+ /*
+ * If the partition bounds for the join rel are not merged ones,
+ * inputs are guaranteed to have the same partition bounds, so
+ * partitions with the same partition indexes will form join pairs;
+ * else let get_matching_part_pairs() do the work.
+ */
+ if (joinrel->merged)
+ {
This condition in the comment is opposite to the condition being checked in
code, so looks confusing. BTW this comment is also repeated around line 1405.
See attached patch for correction.
+ /*
+ * If this segment of the join is empty, it means that this segment
"partition of the join" looks consistent with other usages than "segment of the
join". Modified in the attached patch.
+ /*
+ * Get a relids set of partition(s) involved in this join segment that
+ * are from the rel1 side.
+ */
+ child_relids1 = bms_intersect(child_joinrel->relids,
+ rel1->all_partrels);
The partition bounds are sorted by their values. Even for a list partitioned
table, the bounds are sorted by the least partition value. We do not allow
multiple paritions from one side to be joined with one partition on the other
and vice versa. All this together means that the partitions of the join
relation are formed by joining partitions from either side in the order of
their bounds. This means that the matching pairs of partitions can be found by
matching relids of partitions of join with those of the joining relation by
traversing partitions from all the three relations only once in the order they
appears in partition bounds of corresponding relations. If we use this
algorithm, we don't need all_partrels to be collected and also don't need to
search base or join relation. That, I think, will reduce the time and space
complexity of this logic. Am I missing something? As a side effect it won't
require any special handling for base and join relation.
+ /*
+ * Get a child rel for rel1 with the relids. Note that we should have
+ * the child rel even if rel1 is a join rel, because in that case the
+ * partitions specified in the relids would have matching/overlapping
+ * boundaries, so those partitions should be considered as ones to be
+ * joined even when planning partitionwise joins of rel1, meaning that
+ * the child rel would have been built by the time we get here.
+ */
+ if (rel1_is_simple)
This variable is used only in one place. So instead we should the expression
assigning the value to it. Changed in the attached patch.
- rel->nparts = 0;
+ rel->nparts = -1;
I think we need to add comments around various values that nparts can take. How
about like something attached.
+ case PARTITION_STRATEGY_HASH:
+ merged_bounds = NULL;
I think, we need to explain why we aren't merging hash partition bounds. AFAIU,
the reason is thus: When the modulo of used for partition mapping (i.e. maximum
number of has partitions) is same, their partition bounds are same and do not
need merging. If the maximum number of partitions is different for both the
joining relations, there's high probability that one partition on one side will
join with multiple partitions on the other side. So exact partition bounds
match will work in most of the cases. The cases otherwise are not so common to
spend the effort in coding and planning.
I have added this explanation in the patch. Don't know if it's there written
somewhere already.
+ if (part_index == -1)
+ return -1;
+ } while (is_dummy_partition(rel, part_index));
I understand why we are skipping NULL positions. I am not sure why are we are
skipping over RelOptInfos which exist but are marked as dummy; we can still
create a join pair with those partitions.
+/*
+ * get_merged_range_bounds
+ * Given the bounds of range partitions to be join, determine the range
s/join/joined/
There are more changes to comments, where I thought that the comments are
required or existing comments need more clarification. Please review the
attached patch. This patch is created on top of
v32-0001-Improve-partition-matching-for-partitionwise-join.
On Mon, Feb 10, 2020 at 5:14 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
On Fri, Feb 7, 2020 at 9:57 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> On Thu, Feb 6, 2020 at 3:55 AM Mark Dilger <mark.dilger@enterprisedb.com> wrote:
> > The patches apply and pass all tests. A review of the patch vs. master looks reasonable.
I've merged the patches. Attached is a new version of the patch.
> > The partition_join.sql test has multiple levels of partitioning, but when your patch extends that test with “advanced partition-wise join”, none of the tables for the new section have multiple levels. I spent a little while reviewing the code and inventing multiple level partitioning tests for advanced partition-wise join and did not encounter any problems. I don’t care whether you use this particular example, but do you want to have multiple level partitioning in the new test section?
>
> Yes, I do.
>
> > CREATE TABLE alpha (a double precision, b double precision) PARTITION BY RANGE (a);
> > CREATE TABLE alpha_neg PARTITION OF alpha FOR VALUES FROM ('-Infinity') TO (0) PARTITION BY RANGE (b);
> > CREATE TABLE alpha_pos PARTITION OF alpha FOR VALUES FROM (0) TO ('Infinity') PARTITION BY RANGE (b);
> > CREATE TABLE alpha_nan PARTITION OF alpha FOR VALUES FROM ('Infinity') TO ('NaN');
> > CREATE TABLE alpha_neg_neg PARTITION OF alpha_neg FOR VALUES FROM ('-Infinity') TO (0);
> > CREATE TABLE alpha_neg_pos PARTITION OF alpha_neg FOR VALUES FROM (0) TO ('Infinity');
> > CREATE TABLE alpha_neg_nan PARTITION OF alpha_neg FOR VALUES FROM ('Infinity') TO ('NaN');
> > CREATE TABLE alpha_pos_neg PARTITION OF alpha_pos FOR VALUES FROM ('-Infinity') TO (0);
> > CREATE TABLE alpha_pos_pos PARTITION OF alpha_pos FOR VALUES FROM (0) TO ('Infinity');
> > CREATE TABLE alpha_pos_nan PARTITION OF alpha_pos FOR VALUES FROM ('Infinity') TO ('NaN');
> > INSERT INTO alpha (a, b)
> > (SELECT * FROM
> > (VALUES (-1.0::float8), (0.0::float8), (1.0::float8), ('Infinity'::float8)) a,
> > (VALUES (-1.0::float8), (0.0::float8), (1.0::float8), ('Infinity'::float8)) b
> > );
> > ANALYZE alpha;
> > ANALYZE alpha_neg;
> > ANALYZE alpha_pos;
> > ANALYZE alpha_nan;
> > ANALYZE alpha_neg_neg;
> > ANALYZE alpha_neg_pos;
> > ANALYZE alpha_neg_nan;
> > ANALYZE alpha_pos_neg;
> > ANALYZE alpha_pos_pos;
> > ANALYZE alpha_pos_nan;
> > CREATE TABLE beta (a double precision, b double precision) PARTITION BY RANGE (a, b);
> > CREATE TABLE beta_lo PARTITION OF beta FOR VALUES FROM (-5, -5) TO (0, 0);
> > CREATE TABLE beta_me PARTITION OF beta FOR VALUES FROM (0, 0) TO (0, 5);
> > CREATE TABLE beta_hi PARTITION OF beta FOR VALUES FROM (0, 5) TO (5, 5);
> > INSERT INTO beta (a, b)
> > (SELECT * FROM
> > (VALUES (-1.0::float8), (0.0::float8), (1.0::float8)) a,
> > (VALUES (-1.0::float8), (0.0::float8), (1.0::float8)) b
> > );
> > ANALYZE beta;
> > ANALYZE beta_lo;
> > ANALYZE beta_me;
> > ANALYZE beta_hi;
> > EXPLAIN SELECT * FROM alpha INNER JOIN beta ON (alpha.a = beta.a AND alpha.b = beta.b) WHERE alpha.a = 1 AND beta.b = 1;
> > QUERY PLAN
> > -------------------------------------------------------------------------------
> > Nested Loop (cost=0.00..2.11 rows=1 width=32)
> > -> Seq Scan on alpha_pos_pos alpha (cost=0.00..1.06 rows=1 width=16)
> > Filter: ((b = '1'::double precision) AND (a = '1'::double precision))
> > -> Seq Scan on beta_hi beta (cost=0.00..1.04 rows=1 width=16)
> > Filter: ((b = '1'::double precision) AND (a = '1'::double precision))
> > (5 rows)
>
> Hmm, I'm not sure this is a good test case for that, because this
> result would be due to partition pruning applied to each side of the
> join before considering partition-wise join; you could get the same
> result even with enable_partitionwise_join=off. I think it's
> important that the partition-wise join logic doesn't break this query,
> though.
I think this would be beyond the scope of the patch, so I added
different test cases that I think would be better as ones for
multi-level partitioning.
Thanks!
Best regards,
Etsuro Fujita
--
--
Best Wishes,
Ashutosh Bapat
Attachment
Hi Ashutosh, On Wed, Mar 4, 2020 at 1:48 AM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote: > I reviewed the patch. Except for the logic of matching the pairs of partitions from already merged partitions, I thinkthe code changes are good. But there are several places where it needs further changes to comments. The attached patchhas those. I have described some of them below. Thanks for reviewing! > + * We can not perform partition-wise join if either of the joining > + * relations is not partitioned. > > We are consistently using partitionwise instead of partition-wise. Will fix. > + /* > + * See if the partition bounds for inputs are exactly the same, in > + * which case we don't need to work hard: partitions with the same > + * partition indexes will form join pairs, and the join rel will have > + * the same partition bounds as inputs; otherwise try to merge the > + * partition bounds along with generating join pairs. > > Phrase "joining relations" is better than "inputs", IMO. Updated in the > attached patch. "inputs" is used in many places in the planner performing join planning, so I'm not sure "joining relations" is better than "inputs". > + /* > + * If the partition bounds for the join rel are not merged ones, > + * inputs are guaranteed to have the same partition bounds, so > + * partitions with the same partition indexes will form join pairs; > + * else let get_matching_part_pairs() do the work. > + */ > + if (joinrel->merged) > + { > > This condition in the comment is opposite to the condition being checked in > code, so looks confusing. BTW this comment is also repeated around line 1405. > See attached patch for correction. OK, I'll revise the comments as proposed. > + /* > + * If this segment of the join is empty, it means that this segment > > "partition of the join" looks consistent with other usages than "segment of the > join". Actually, "segment" is used in the existing comments in the caller function try_partitionwise_join(), so I think "segment" is better here for consistency. > + /* > + * Get a relids set of partition(s) involved in this join segment that > + * are from the rel1 side. > + */ > + child_relids1 = bms_intersect(child_joinrel->relids, > + rel1->all_partrels); > > The partition bounds are sorted by their values. Even for a list partitioned > table, the bounds are sorted by the least partition value. We do not allow > multiple paritions from one side to be joined with one partition on the other > and vice versa. All this together means that the partitions of the join > relation are formed by joining partitions from either side in the order of > their bounds. This means that the matching pairs of partitions can be found by > matching relids of partitions of join with those of the joining relation by > traversing partitions from all the three relations only once in the order they > appears in partition bounds of corresponding relations. Consider this 2-way join for list partitioned tables: CREATE TABLE plt1_ad (a int, b int, c text) PARTITION BY LIST (c); CREATE TABLE plt1_ad_p1 PARTITION OF plt1_ad FOR VALUES IN ('0001', '0003'); CREATE TABLE plt1_ad_p2 PARTITION OF plt1_ad FOR VALUES IN ('0002'); INSERT INTO plt1_ad SELECT i, i, to_char(i % 10, 'FM0000') FROM generate_series(1, 100) i WHERE i % 10 in (1, 2, 3); ANALYZE plt1_ad; CREATE TABLE plt2_ad (a int, b int, c text) PARTITION BY LIST (c); CREATE TABLE plt2_ad_p1 PARTITION OF plt2_ad FOR VALUES IN ('0002', '0004'); CREATE TABLE plt2_ad_p2 PARTITION OF plt2_ad FOR VALUES IN ('0003'); INSERT INTO plt2_ad SELECT i, i, to_char(i % 10, 'FM0000') FROM generate_series(1, 100) i WHERE i % 10 IN (2, 3, 4); ANALYZE plt2_ad; EXPLAIN (COSTS OFF) SELECT t1.c, t1.a, t2.a FROM plt1_ad t1 INNER JOIN plt2_ad t2 ON (t1.c = t2.c) WHERE t1.a < 10 ORDER BY t1.c, t1.a, t2.a; QUERY PLAN ----------------------------------------------------- Sort Sort Key: t1.c, t1.a, t2.a -> Append -> Hash Join Hash Cond: (t2_1.c = t1_2.c) -> Seq Scan on plt2_ad_p1 t2_1 -> Hash -> Seq Scan on plt1_ad_p2 t1_2 Filter: (a < 10) -> Hash Join Hash Cond: (t2_2.c = t1_1.c) -> Seq Scan on plt2_ad_p2 t2_2 -> Hash -> Seq Scan on plt1_ad_p1 t1_1 Filter: (a < 10) (15 rows) As you can see from the EXPLAIN result, the first partition on the outer side matches the second partition on the inner side, and the second partition on the outer side matches the first partition on the inner side. How does the algorithm you proposed work e.g., when an N-way join for list partitioned tables contains this join as its lower join? > If we use this > algorithm, we don't need all_partrels to be collected and also don't need to > search base or join relation. That, I think, will reduce the time and space > complexity of this logic. Am I missing something? Relids is used for storing all_partrels, so the time/space cost of handling it would be small. Also, the cost of searing base relations would be small. The cost of searching join relations would be a bit large in some cases, but I thought that would be acceptable, compared with large overhead of performing other part of partitionwise join planning. > + if (rel1_is_simple) > > This variable is used only in one place. So instead we should the expression > assigning the value to it. Changed in the attached patch. I don't think that's a good idea, because this check is done repeatedly in a for loop. > - rel->nparts = 0; > + rel->nparts = -1; > > I think we need to add comments around various values that nparts can take. How > about like something attached. +1 > + case PARTITION_STRATEGY_HASH: > + merged_bounds = NULL; > > I think, we need to explain why we aren't merging hash partition bounds. AFAIU, > the reason is thus: When the modulo of used for partition mapping (i.e. maximum > number of has partitions) is same, their partition bounds are same and do not > need merging. I don't think that's always true; there are cases where the moduli are the same, but the partition bounds are not, because it's possible to only define partitions for some of the remainders. See the discussion in [1]. > If the maximum number of partitions is different for both the > joining relations, there's high probability that one partition on one side will > join with multiple partitions on the other side. So exact partition bounds > match will work in most of the cases. The cases otherwise are not so common to > spend the effort in coding and planning. > > I have added this explanation in the patch. I also think it would be great if we can perform generic partitionwise join for hash partitioned tables, so I'd like to propose to add something like this, instead: "Currently we support partitionwise join for hash partitioned tables only when the partition bounds for them exactly match, but later it might be worth the effort to relax the restriction." > + if (part_index == -1) > + return -1; > + } while (is_dummy_partition(rel, part_index)); > > I understand why we are skipping NULL positions. I am not sure why are we are > skipping over RelOptInfos which exist but are marked as dummy; we can still > create a join pair with those partitions. Yeah, but I think it's safe to skip over those partitions as well, because such a join pair can be created using merge_partition_with_dummy(). > +/* > + * get_merged_range_bounds > + * Given the bounds of range partitions to be join, determine the range > > s/join/joined/ Good catch! Will fix. > There are more changes to comments, where I thought that the comments are > required or existing comments need more clarification. Please review the > attached patch. This patch is created on top of > v32-0001-Improve-partition-matching-for-partitionwise-join. Thanks for the patch! I will review the patch ASAP. Sorry for the delay. Best regards, Etsuro Fujita [1] https://www.postgresql.org/message-id/CAAJ_b94tJTix3kR8uBjin-ruJ-7ojn-gAWJQRicbLqAttQTe1g%40mail.gmail.com
Re: [HACKERS] advanced partition matching algorithm forpartition-wise join
From
Ashutosh Bapat
Date:
On Tue, Mar 17, 2020 at 1:44 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote: > > > + /* > > + * If this segment of the join is empty, it means that this segment > > > > "partition of the join" looks consistent with other usages than "segment of the > > join". > > Actually, "segment" is used in the existing comments in the caller > function try_partitionwise_join(), so I think "segment" is better here > for consistency. A segment can be any part of the join relation, not necessarily a partition. May be we should change the caller. > > > + /* > > + * Get a relids set of partition(s) involved in this join segment that > > + * are from the rel1 side. > > + */ > > + child_relids1 = bms_intersect(child_joinrel->relids, > > + rel1->all_partrels); > > > > The partition bounds are sorted by their values. Even for a list partitioned > > table, the bounds are sorted by the least partition value. We do not allow > > multiple paritions from one side to be joined with one partition on the other > > and vice versa. All this together means that the partitions of the join > > relation are formed by joining partitions from either side in the order of > > their bounds. This means that the matching pairs of partitions can be found by > > matching relids of partitions of join with those of the joining relation by > > traversing partitions from all the three relations only once in the order they > > appears in partition bounds of corresponding relations. > > Consider this 2-way join for list partitioned tables: > > CREATE TABLE plt1_ad (a int, b int, c text) PARTITION BY LIST (c); > CREATE TABLE plt1_ad_p1 PARTITION OF plt1_ad FOR VALUES IN ('0001', '0003'); > CREATE TABLE plt1_ad_p2 PARTITION OF plt1_ad FOR VALUES IN ('0002'); > INSERT INTO plt1_ad SELECT i, i, to_char(i % 10, 'FM0000') FROM > generate_series(1, 100) i WHERE i % 10 in (1, 2, 3); > ANALYZE plt1_ad; > CREATE TABLE plt2_ad (a int, b int, c text) PARTITION BY LIST (c); > CREATE TABLE plt2_ad_p1 PARTITION OF plt2_ad FOR VALUES IN ('0002', '0004'); > CREATE TABLE plt2_ad_p2 PARTITION OF plt2_ad FOR VALUES IN ('0003'); > INSERT INTO plt2_ad SELECT i, i, to_char(i % 10, 'FM0000') FROM > generate_series(1, 100) i WHERE i % 10 IN (2, 3, 4); > ANALYZE plt2_ad; > > EXPLAIN (COSTS OFF) > SELECT t1.c, t1.a, t2.a FROM plt1_ad t1 INNER JOIN plt2_ad t2 ON (t1.c > = t2.c) WHERE t1.a < 10 ORDER BY t1.c, t1.a, t2.a; > QUERY PLAN > ----------------------------------------------------- > Sort > Sort Key: t1.c, t1.a, t2.a > -> Append > -> Hash Join > Hash Cond: (t2_1.c = t1_2.c) > -> Seq Scan on plt2_ad_p1 t2_1 > -> Hash > -> Seq Scan on plt1_ad_p2 t1_2 > Filter: (a < 10) > -> Hash Join > Hash Cond: (t2_2.c = t1_1.c) > -> Seq Scan on plt2_ad_p2 t2_2 > -> Hash > -> Seq Scan on plt1_ad_p1 t1_1 > Filter: (a < 10) > (15 rows) > > As you can see from the EXPLAIN result, the first partition on the > outer side matches the second partition on the inner side, and the > second partition on the outer side matches the first partition on the > inner side. How does the algorithm you proposed work e.g., when an > N-way join for list partitioned tables contains this join as its lower > join? Hmm, this is a good example. I tried to work out the algorithm based on the bound ordering. The algorithm worked well when all the bounds on both the sides were included in the join. But it didn't work well, when some bounds vanished. In order to detect whether a bound has vanished, we need to either compare that bound with the bounds of join (an operation costlier than comparing bitmapset) or compare relids of all the partitions of the join. Either way it looks costlier than what you have right now. May be we could improve by keeping track of such lost bounds and corresponding partitions. But I didn't get time to work on that part. Anyway, even if such an algorithm exists, we will have to change just a single function. That could be done later I think. So we are good here right now. Thanks. > > + if (rel1_is_simple) > > > > This variable is used only in one place. So instead we should the expression > > assigning the value to it. Changed in the attached patch. > > I don't think that's a good idea, because this check is done > repeatedly in a for loop. Compiler's optimizer would anyway optimize it away. But anyway, I won't insist on this. > > > - rel->nparts = 0; > > + rel->nparts = -1; > > > > I think we need to add comments around various values that nparts can take. How > > about like something attached. > > +1 > > > + case PARTITION_STRATEGY_HASH: > > + merged_bounds = NULL; > > > > I think, we need to explain why we aren't merging hash partition bounds. AFAIU, > > the reason is thus: When the modulo of used for partition mapping (i.e. maximum > > number of has partitions) is same, their partition bounds are same and do not > > need merging. > > I don't think that's always true; there are cases where the moduli are > the same, but the partition bounds are not, because it's possible to > only define partitions for some of the remainders. See the discussion > in [1]. Hmm, but that case would be rare, IMO. It's an artifact of the way our hash partitioning syntax is, and does not reflect a real world scenario. > > > If the maximum number of partitions is different for both the > > joining relations, there's high probability that one partition on one side will > > join with multiple partitions on the other side. So exact partition bounds > > match will work in most of the cases. The cases otherwise are not so common to > > spend the effort in coding and planning. > > > > I have added this explanation in the patch. > > I also think it would be great if we can perform generic partitionwise > join for hash partitioned tables, so I'd like to propose to add > something like this, instead: "Currently we support partitionwise join > for hash partitioned tables only when the partition bounds for them > exactly match, but later it might be worth the effort to relax the > restriction." That's good too. But please include an explanation about the case when modulo/max no. of partitions itself differs. That case is not likely to get addressed in nearer future. -- Best Wishes, Ashutosh Bapat
Hi, On Mon, Mar 23, 2020 at 10:42 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote: > On Tue, Mar 17, 2020 at 1:44 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote: > > > "partition of the join" looks consistent with other usages than "segment of the > > > join". > > > > Actually, "segment" is used in the existing comments in the caller > > function try_partitionwise_join(), so I think "segment" is better here > > for consistency. > > A segment can be any part of the join relation, not necessarily a > partition. May be we should change the caller. I don't think so. My first language is not English, though. I don't think this should be a blocker, so how about leaving this for another patch? > > > + /* > > > + * Get a relids set of partition(s) involved in this join segment that > > > + * are from the rel1 side. > > > + */ > > > + child_relids1 = bms_intersect(child_joinrel->relids, > > > + rel1->all_partrels); > > > > > > The partition bounds are sorted by their values. Even for a list partitioned > > > table, the bounds are sorted by the least partition value. We do not allow > > > multiple paritions from one side to be joined with one partition on the other > > > and vice versa. All this together means that the partitions of the join > > > relation are formed by joining partitions from either side in the order of > > > their bounds. This means that the matching pairs of partitions can be found by > > > matching relids of partitions of join with those of the joining relation by > > > traversing partitions from all the three relations only once in the order they > > > appears in partition bounds of corresponding relations. > > > > Consider this 2-way join for list partitioned tables: > Hmm, this is a good example. I tried to work out the algorithm based > on the bound ordering. The algorithm worked well when all the bounds > on both the sides were included in the join. But it didn't work well, > when some bounds vanished. In order to detect whether a bound has > vanished, we need to either compare that bound with the bounds of join > (an operation costlier than comparing bitmapset) or compare relids of > all the partitions of the join. Either way it looks costlier than what > you have right now. May be we could improve by keeping track of such > lost bounds and corresponding partitions. But I didn't get time to > work on that part. Anyway, even if such an algorithm exists, we will > have to change just a single function. That could be done later I > think. So we are good here right now. Thanks. OK, there is always room for improvement. > > > + if (rel1_is_simple) > > > > > > This variable is used only in one place. So instead we should the expression > > > assigning the value to it. Changed in the attached patch. > > > > I don't think that's a good idea, because this check is done > > repeatedly in a for loop. > > Compiler's optimizer would anyway optimize it away. But anyway, I > won't insist on this. OK > > > + case PARTITION_STRATEGY_HASH: > > > + merged_bounds = NULL; > > > > > > I think, we need to explain why we aren't merging hash partition bounds. AFAIU, > > > the reason is thus: When the modulo of used for partition mapping (i.e. maximum > > > number of has partitions) is same, their partition bounds are same and do not > > > need merging. > > > > I don't think that's always true; there are cases where the moduli are > > the same, but the partition bounds are not, because it's possible to > > only define partitions for some of the remainders. See the discussion > > in [1]. > > Hmm, but that case would be rare, IMO. It's an artifact of the way our > hash partitioning syntax is, and does not reflect a real world > scenario. I agree on that point. > > > If the maximum number of partitions is different for both the > > > joining relations, there's high probability that one partition on one side will > > > join with multiple partitions on the other side. So exact partition bounds > > > match will work in most of the cases. The cases otherwise are not so common to > > > spend the effort in coding and planning. > > > > > > I have added this explanation in the patch. > > I'd like to propose to add > > something like this, instead: "Currently we support partitionwise join > > for hash partitioned tables only when the partition bounds for them > > exactly match, but later it might be worth the effort to relax the > > restriction." > > That's good too. But please include an explanation about the case when > modulo/max no. of partitions itself differs. That case is not likely > to get addressed in nearer future. OK, I added comments, including that explanation. Please find attached a new version of the patch. On Tue, Mar 17, 2020 at 5:14 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote: > On Wed, Mar 4, 2020 at 1:48 AM Ashutosh Bapat > <ashutosh.bapat.oss@gmail.com> wrote: > > + * We can not perform partition-wise join if either of the joining > > + * relations is not partitioned. > > > > We are consistently using partitionwise instead of partition-wise. > > Will fix. Done. > > + /* > > + * If the partition bounds for the join rel are not merged ones, > > + * inputs are guaranteed to have the same partition bounds, so > > + * partitions with the same partition indexes will form join pairs; > > + * else let get_matching_part_pairs() do the work. > > + */ > > + if (joinrel->merged) > > + { > > > > This condition in the comment is opposite to the condition being checked in > > code, so looks confusing. BTW this comment is also repeated around line 1405. > > See attached patch for correction. > > OK, I'll revise the comments as proposed. I updated this based on what you suggested. > > - rel->nparts = 0; > > + rel->nparts = -1; > > > > I think we need to add comments around various values that nparts can take. How > > about like something attached. > > +1 - int nparts; /* number of partitions */ + int nparts; /* number of partitions. + * 0 for a relation with no partitions, + * > 0 indicates actual number of partitions + * -1 for a relation whose number of partitions + * is not yet known. + */ I don't think the comment "0 for a relation with no partitions" is correct. I think 0 means the relation is considered unpartitioned, so I modified the comments as such. I also changed the format to match other places in pathnodes.h. > > +/* > > + * get_merged_range_bounds > > + * Given the bounds of range partitions to be join, determine the range > > > > s/join/joined/ > > Good catch! Will fix. Done. > > There are more changes to comments, where I thought that the comments are > > required or existing comments need more clarification. Please review the > > attached patch. > > I will review the patch ASAP. I looked into the patch. @@ -4093,8 +4103,13 @@ map_and_merge_partitions(PartitionMap *outer_map, PartitionMap *inner_map, - * Note that we will fix the larger index that have been added to - * the merged_indexes list so far in fix_merged_indexes(). + * Both the inner and outer partitions have an empty partition on + * the other side as their joining partner. But now that each of + * them has found a non-empty joining partner we should re-map + * those to a single partition in the join. We use lower of the + * two indexes to avoid any holes being created by re-mapping. + * Also, it keeps the partition index array in partition bounds + * roughly sorted. OK, but I modified this further. + * The caller thinks that the partition at the given index does not have a + * partition in the other relation or the joining partition is empty. In such a + * case we assign a temporary index (indicated by merged flag in the map) for + * the resulting partition in the join. In case the given partition finds a + * non-empty partner latter we will adjust the mapping again. Done. I modified this a bit, though. @@ -4590,6 +4620,13 @@ merge_default_partitions(PartitionMap *outer_map, + * We should have already given up if we found that both the inner and + * outer relations have default partitions and either of them had a + * partition without a matching non-default partition on the other + * side. See process_outer_partition() and process_inner_partition() + * for details. This seems to me a bit difficult to read. How about something simpler in the attached? I also updated comments in process_outer_partition() and process_inner_partition(), almost as proposed in your patch. I polished the patch further. Changes are: * I removed the arguments parttyplen and parttypbyval for partition_bounds_merge(), because we don't need them anymore. Also, I removed these assertions from that function: + Assert(merged_bounds || (*outer_parts == NIL && *inner_parts == NIL)); + + Assert(list_length(*outer_parts) == list_length(*inner_parts)); because the first assertion isn't correct as we don't allow non-NULL merged_bounds to have outer_parts=NIL and inner_parts=NIL, and because the second assertion is redundant as we have the same in partition_list_bounds_merge() and partition_range_bounds_merge(). I modified assertions in these functions a bit, though. * I renamed partition_list_bounds_merge() and partition_range_bounds_merge() to merge_list_bounds() and merge_range_bounds(), analogously to create_list_bounds() and create_range_bounds() in partition_bounds_create(). Does that make sense? * In the old versions of the patch, map_and_merge_partitions() checked whether it was possible to "map" one of given partitions to the other, but that function doesn't do that anymore. That function assumes that given partitions match each other, so I don't think the name is good anymore. How about renaming that function to merge_matching_partitions() or something like that? * I think the names of variables used in functions added to partbounds.c are inconsistent. For example, the names of the outer and inner partition indexes in partition_list_bounds_merge() are "o_index" and "i_index", while those in partition_range_bounds_merge() are "outer_part" and "inner_part". IMO such inconsistencies would make the code a bit difficult to read, so I modified the names of those variables to be more consistent between those functions. * I modified the code in merge_null_partitions() to match other functions added to partbounds.c, but no functional changes. Also, I think the names "outer_null_unmatched" and "inner_null_unmatched" are a bit misleading, because I think the reader might incorrectly think both the NULL partitions exist, but that isn't always true, so I renamed them to what I think would be better. * I added/tweaked comments a lot, fixing (probably my) typos and grammatical errors. * I did some cleanup, including fixing some format issues. * I sorted functions added to partbounds.c in a more logical order, and moved them to a more appropriate place in that source file. * I added/modified some regression tests. I added/modified comments a bit as well. One thing I should mention is a change to this: +-- 3-way join where not every pair of relations can do partitioned join +EXPLAIN (COSTS OFF) +SELECT t1.a, t2.a, t3.c FROM prt1_ad t1 RIGHT JOIN prt2_ad t2 ON (t1.a = t2.b) INNER JOIN prt1_ad t3 ON (t2.b = t3.a) WHERE t1.b = 0 ORDER BY t1.a, t2.a, t3.c; The RIGHT join will be transformed to an INNER join by outer join reduction IIUC, so this wouldn't be what was originally intended. I think this is my fault; I modified the original test case incorrectly when I updated all the regression tests for this feature [1]. Sorry for that. I fixed this issue. Best regards, Etsuro Fujita [1] https://www.postgresql.org/message-id/CAPmGK16LsKXX%3DYYzc-PNiY6aaYApg1Gmkc6A14dnJsrBBmgd0g%40mail.gmail.com
Attachment
Hi, I've started reviewing the patch a couple days ago. I haven't done any extensive testing, but I do have a bunch of initial comments that I can share now. 1) I wonder if this needs to update src/backend/optimizer/README, which does have a section about partitionwise joins. It seems formulated in a way that that probably covers even this more advanced algorithm, but maybe it should mention handling of default partitions etc.? There certainly needs to be some description of the algorithm somewhere, either in a README or before a suitable function. It doesn't have to be particularly detailed, a rough outline of the matching would be enough, so that readers don't have to rebuild the knowledge from pieces scattered around various comments. 2) Do we need another GUC enabling this more complex algorithm? In PG11 the partitionwise join is disabled by default, on the grounds that it's expensive and not worth it by default. How much more expensive is this? Maybe it makes sense to allow enabling only the "simple" approach? 3) This comment in try_partitionwise_join is now incorrect, because the condition may be true even for partitioned tables with (nparts == 0). /* Nothing to do, if the join relation is not partitioned. */ if (joinrel->part_scheme == NULL || joinrel->nparts == 0) return; Moreover, the condition used to be if (!IS_PARTITIONED_REL(joinrel)) return; which is way more readable. I think it's net negative to replace these "nice" macros with clear meaning with complex conditions. If needed, we can invent new macros. There are many other places where the patch replaces macros with less readable conditions. 4) I'm a bit puzzled how we could get here with non-partitioned rels? /* * We can not perform partitionwise join if either of the joining relations * is not partitioned. */ if (!IS_PARTITIONED_REL(rel1) || !IS_PARTITIONED_REL(rel2)) return; 5) I find the "merged" flag in RelOptInfo rather unclear, because it does not clearly indicate what was merged. Maybe something like partbounds_merged would be better? 6) The try_partitionwise_join function is getting a bit too long and harder to understand. The whole block in if (joinrel->nparts == -1) { ... } seems rather well isolated, so I propose to move it to a separate function responsible only for the merging. We can simply call it on the joinrel, and make it return right away if (joinrel->nparts == -1). 7) I'd suggest not to reference exact functions in comments unless abolutely necessary, because it's harder to maintain and it does not really explain purpose of the struct/code. E.g. consider this: /* Per-partitioned-relation data for merge_list_bounds()/merge_range_bounds() */ typedef struct PartitionMap { ... } Why does it matter where is the struct used? That's pretty trivial to find using 'git grep' or something. Instead the comment should explain the purpose of the struct. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, three more comments after eye-balling the code for a bit longer. 1) The patch probably needs to tweak config.sgml which says this about the enable_partitionwise_join GUC: .. Partitionwise join currently applies only when the join conditions include all the partition keys, which must be of the same data type and have exactly matching sets of child partitions. .. Which is probably incorrect, because the point of this patch is not to require exact match of the partitions, right? 2) Do we really need the 'merged' flag in try_partitionwise_join? Can't we simply use the joinrel->merged flag directly? ISTM the we always start with joinrel->merged=false, and then only ever set it to true in some cases. I've tried doing that, per the attached 0002 patch. The regression tests seem to work fine. I noticed this because I've tried moving part of the function into a separate function, and removing the variable makes that simpler. The patch also does a couple additional minor tweaks. 3) I think the for nparts comment is somewhat misleading: int nparts; /* number of partitions; 0 = not partitioned; * -1 = not yet set */ which says that nparts=0 means not partitioned. But then there are conditions like this: /* Nothing to do, if the join relation is not partitioned. */ if (joinrel->part_scheme == NULL || joinrel->nparts == 0) return; which (ignoring the obsolete comment) seems to say we can have nparts==0 even for partitioned tables, no? Anyway, attached is the original patch (0001) and two patches with proposed changes. 0002 removes the "merged" flag as explained in (2), 0003 splits the try_partitionwise_join() function into two parts. I'm saying these changes have to happen and it's a bit crude (and it might be a bit of a bikeshedding). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
Re: [HACKERS] advanced partition matching algorithm forpartition-wise join
From
Ashutosh Bapat
Date:
On Thu, 26 Mar 2020 at 05:47, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
Hi,
three more comments after eye-balling the code for a bit longer.
1) The patch probably needs to tweak config.sgml which says this about
the enable_partitionwise_join GUC:
.. Partitionwise join currently applies only when the join conditions
include all the partition keys, which must be of the same data type
and have exactly matching sets of child partitions. ..
Done. Actually this wasn't updated when partition pruning was introduced, which could cause a partitionwise join to be not used even when those conditions were met. Similarly when a query involved whole row reference. It's hard to explain all the conditions under which partitionwise join technique will be used. But I have tried to keep it easy to understand.
Which is probably incorrect, because the point of this patch is not to
require exact match of the partitions, right?
2) Do we really need the 'merged' flag in try_partitionwise_join? Can't
we simply use the joinrel->merged flag directly? ISTM the we always
start with joinrel->merged=false, and then only ever set it to true in
some cases. I've tried doing that, per the attached 0002 patch. The
regression tests seem to work fine.
Thanks. I just added a small prologue to compute_partition_bounds().
I noticed this because I've tried moving part of the function into a
separate function, and removing the variable makes that simpler.
The patch also does a couple additional minor tweaks.
3) I think the for nparts comment is somewhat misleading:
int nparts; /* number of partitions; 0 = not partitioned;
* -1 = not yet set */
which says that nparts=0 means not partitioned. But then there are
conditions like this:
/* Nothing to do, if the join relation is not partitioned. */
if (joinrel->part_scheme == NULL || joinrel->nparts == 0)
return;
which (ignoring the obsolete comment) seems to say we can have nparts==0
even for partitioned tables, no?
See my previous mail.
Anyway, attached is the original patch (0001) and two patches with
proposed changes. 0002 removes the "merged" flag as explained in (2),
0003 splits the try_partitionwise_join() function into two parts.
I'm saying these changes have to happen and it's a bit crude (and it
might be a bit of a bikeshedding).
I have added 0005 with the changes I described in this as well as the previous mail. 0004 is just some white space fixes.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Best Wishes,
Ashutosh
Attachment
Re: [HACKERS] advanced partition matching algorithm forpartition-wise join
From
Ashutosh Bapat
Date:
On Thu, 26 Mar 2020 at 00:35, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
Hi,
I've started reviewing the patch a couple days ago. I haven't done any
extensive testing, but I do have a bunch of initial comments that I can
share now.
1) I wonder if this needs to update src/backend/optimizer/README, which
does have a section about partitionwise joins. It seems formulated in a
way that that probably covers even this more advanced algorithm, but
maybe it should mention handling of default partitions etc.?
Done. Please check the wording. It might need some word smithy.
There certainly needs to be some description of the algorithm somewhere,
either in a README or before a suitable function. It doesn't have to be
particularly detailed, a rough outline of the matching would be enough,
so that readers don't have to rebuild the knowledge from pieces
scattered around various comments.
The algorithm for list and range partitioned tables is slightly different. So, I have added separate prologue to each list_merge_bounds() and range_merge_bounds(). Please check if that serves the purpose.
2) Do we need another GUC enabling this more complex algorithm? In PG11
the partitionwise join is disabled by default, on the grounds that it's
expensive and not worth it by default. How much more expensive is this?
Maybe it makes sense to allow enabling only the "simple" approach?
We have reduced the complexity of merging bounds quite a bit so this shouldn't be costly. Further more we handle the usual case of equal bounds quickly using the merge flag so most of the cases should be fine. It's only when two partitioned tables with same partition scheme are joined but do not have merge-able bounds that this algorithm would not yield useful result - but that would be rare in the field. enable_partitionwise_join = false should take care of such scenarios easily. I am not in favour of adding another GUC which we set to false by default and then take another few releases to make it true by default.
3) This comment in try_partitionwise_join is now incorrect, because the
condition may be true even for partitioned tables with (nparts == 0).
/* Nothing to do, if the join relation is not partitioned. */
if (joinrel->part_scheme == NULL || joinrel->nparts == 0)
return;
If part_scheme = NULL, npart should be 0, fixed that in build_joinrel_partition_info(). If partscheme != NULL but bounds can not be merged, nparts = 0. So this condition is correct. Encapsulated in a macro IS_JOINREL_NOT_PARTITITIONED(). and added comments for the macro. Given that the macro is used exactly at one place, it may not be necessary to define it but it looks *nice*.
Moreover, the condition used to be
if (!IS_PARTITIONED_REL(joinrel))
return;
which is way more readable. I think it's net negative to replace these
"nice" macros with clear meaning with complex conditions. If needed, we
can invent new macros. There are many other places where the patch
replaces macros with less readable conditions.
The only other place where we have replaced a *nice* macro is in build_joinrel_partition_info(). But I think it's a legit replacement. I have added a comment there.
4) I'm a bit puzzled how we could get here with non-partitioned rels?
/*
* We can not perform partitionwise join if either of the joining relations
* is not partitioned.
*/
if (!IS_PARTITIONED_REL(rel1) || !IS_PARTITIONED_REL(rel2))
return;
See the comment I have added in build_joinrel_partition_info(). Not all joining pairs for a given relation are partitioned.
5) I find the "merged" flag in RelOptInfo rather unclear, because it
does not clearly indicate what was merged. Maybe something like
partbounds_merged would be better?
Done.
6) The try_partitionwise_join function is getting a bit too long and
harder to understand. The whole block in
if (joinrel->nparts == -1)
{
...
}
seems rather well isolated, so I propose to move it to a separate
function responsible only for the merging. We can simply call it on the
joinrel, and make it return right away if (joinrel->nparts == -1).
Looks like you have already taken care of this one in one of your patches.
7) I'd suggest not to reference exact functions in comments unless
abolutely necessary, because it's harder to maintain and it does not
really explain purpose of the struct/code. E.g. consider this:
/* Per-partitioned-relation data for merge_list_bounds()/merge_range_bounds() */
typedef struct PartitionMap
{ ... }
Why does it matter where is the struct used? That's pretty trivial to
find using 'git grep' or something. Instead the comment should explain
the purpose of the struct.
Adjusted names and comments a bit.
Best Wishes,
Ashutosh
Hi Tomas and Ashutosh, On Thu, Apr 2, 2020 at 1:51 AM Ashutosh Bapat <ashutosh.bapat@2ndquadrant.com> wrote: > On Thu, 26 Mar 2020 at 05:47, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: >> three more comments after eye-balling the code for a bit longer. >> >> 1) The patch probably needs to tweak config.sgml which says this about >> the enable_partitionwise_join GUC: >> >> .. Partitionwise join currently applies only when the join conditions >> include all the partition keys, which must be of the same data type >> and have exactly matching sets of child partitions. .. > > > Done. Actually this wasn't updated when partition pruning was introduced, which could cause a partitionwise join to benot used even when those conditions were met. Similarly when a query involved whole row reference. It's hard to explainall the conditions under which partitionwise join technique will be used. But I have tried to keep it easy to understand. > >> >> >> Which is probably incorrect, because the point of this patch is not to >> require exact match of the partitions, right? >> >> 2) Do we really need the 'merged' flag in try_partitionwise_join? Can't >> we simply use the joinrel->merged flag directly? ISTM the we always >> start with joinrel->merged=false, and then only ever set it to true in >> some cases. I've tried doing that, per the attached 0002 patch. The >> regression tests seem to work fine. > > > Thanks. I just added a small prologue to compute_partition_bounds(). > >> >> >> I noticed this because I've tried moving part of the function into a >> separate function, and removing the variable makes that simpler. >> >> The patch also does a couple additional minor tweaks. >> >> 3) I think the for nparts comment is somewhat misleading: >> >> int nparts; /* number of partitions; 0 = not partitioned; >> * -1 = not yet set */ >> >> which says that nparts=0 means not partitioned. But then there are >> conditions like this: >> >> /* Nothing to do, if the join relation is not partitioned. */ >> if (joinrel->part_scheme == NULL || joinrel->nparts == 0) >> return; >> >> which (ignoring the obsolete comment) seems to say we can have nparts==0 >> even for partitioned tables, no? > > > See my previous mail. > >> >> >> Anyway, attached is the original patch (0001) and two patches with >> proposed changes. 0002 removes the "merged" flag as explained in (2), >> 0003 splits the try_partitionwise_join() function into two parts. >> >> I'm saying these changes have to happen and it's a bit crude (and it >> might be a bit of a bikeshedding). > > > I have added 0005 with the changes I described in this as well as the previous mail. 0004 is just some white space fixes. Thanks for the comments, Tomas! Thanks for the patch, Ashutosh! I will look at the patch. Best regards, Etsuro Fujita
Hi, On Thu, Apr 2, 2020 at 2:12 AM Ashutosh Bapat <ashutosh.bapat@2ndquadrant.com> wrote: > On Thu, 26 Mar 2020 at 00:35, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: >> I've started reviewing the patch a couple days ago. I haven't done any >> extensive testing, but I do have a bunch of initial comments that I can >> share now. >> >> 1) I wonder if this needs to update src/backend/optimizer/README, which >> does have a section about partitionwise joins. It seems formulated in a >> way that that probably covers even this more advanced algorithm, but >> maybe it should mention handling of default partitions etc.? > Done. Please check the wording. It might need some word smithy. You heavily changed the existing documentation about PWJ, but I don't think we really need to do so. Also, IMO I think the description about default partitions you added is needed in README. I think it would be better to put such a description in source files. How about something like the attached, instead? I wrote part of this based on the commit message in the original versions of the patch you posted. >> There certainly needs to be some description of the algorithm somewhere, >> either in a README or before a suitable function. It doesn't have to be >> particularly detailed, a rough outline of the matching would be enough, >> so that readers don't have to rebuild the knowledge from pieces >> scattered around various comments. > The algorithm for list and range partitioned tables is slightly different. So, I have added separate prologue to each list_merge_bounds()and range_merge_bounds(). Please check if that serves the purpose. Too detailed to me. In this: + * If there are multiple partitions from one side matching a given partition on + * the other side, the algorithm bails out since we do not have machinary for + * joining one partition with mulitple partitions. It might happen that any of + * the list items of a partition from the outer relation do not appear in the + * inner relation and there is no default partition in the inner relation. Such + * a partition from the outer side will have no matching partition on the inner + * side. The algorithm will bail out in such a case since we do not have a + * mechanism to perform a join with a non-existing relation. I don't think the last comment is correct; that would apply to the old versions of this function IIRC, but not to the latest version. How about something much simpler like the attached, instead? >> 2) Do we need another GUC enabling this more complex algorithm? In PG11 >> the partitionwise join is disabled by default, on the grounds that it's >> expensive and not worth it by default. How much more expensive is this? >> Maybe it makes sense to allow enabling only the "simple" approach? > We have reduced the complexity of merging bounds quite a bit so this shouldn't be costly. Further more we handle the usualcase of equal bounds quickly using the merge flag so most of the cases should be fine. It's only when two partitionedtables with same partition scheme are joined but do not have merge-able bounds that this algorithm would not yielduseful result - but that would be rare in the field. enable_partitionwise_join = false should take care of such scenarioseasily. I am not in favour of adding another GUC which we set to false by default and then take another few releasesto make it true by default. I agree with Ashutosh. >> 3) This comment in try_partitionwise_join is now incorrect, because the >> condition may be true even for partitioned tables with (nparts == 0). >> >> /* Nothing to do, if the join relation is not partitioned. */ >> if (joinrel->part_scheme == NULL || joinrel->nparts == 0) >> return; > If part_scheme = NULL, npart should be 0, fixed that in build_joinrel_partition_info(). If partscheme != NULL but boundscan not be merged, nparts = 0. So this condition is correct. Encapsulated in a macro IS_JOINREL_NOT_PARTITITIONED().and added comments for the macro. Given that the macro is used exactly at one place, it maynot be necessary to define it but it looks *nice*. I don't think it would be a good idea to add such a macro for only one place. >> Moreover, the condition used to be >> >> if (!IS_PARTITIONED_REL(joinrel)) >> return; >> >> which is way more readable. I think it's net negative to replace these >> "nice" macros with clear meaning with complex conditions. If needed, we >> can invent new macros. There are many other places where the patch >> replaces macros with less readable conditions. > The only other place where we have replaced a *nice* macro is in build_joinrel_partition_info(). But I think it's a legitreplacement. I have added a comment there. IIUC, this is the existing issue, so I think it would be better to leave this for another improvement patch. >> 6) The try_partitionwise_join function is getting a bit too long and >> harder to understand. The whole block in >> >> if (joinrel->nparts == -1) >> { >> ... >> } >> >> seems rather well isolated, so I propose to move it to a separate >> function responsible only for the merging. We can simply call it on the >> joinrel, and make it return right away if (joinrel->nparts == -1). That's a good idea, so +1. >> 7) I'd suggest not to reference exact functions in comments unless >> abolutely necessary, because it's harder to maintain and it does not >> really explain purpose of the struct/code. E.g. consider this: >> >> /* Per-partitioned-relation data for merge_list_bounds()/merge_range_bounds() */ >> typedef struct PartitionMap >> { ... } >> >> Why does it matter where is the struct used? That's pretty trivial to >> find using 'git grep' or something. Instead the comment should explain >> the purpose of the struct. > Adjusted names and comments a bit. I modified the comments a bit further. I don't think we need to change the name of a variable, so I kept it as is. On Thu, Apr 2, 2020 at 1:51 AM Ashutosh Bapat <ashutosh.bapat@2ndquadrant.com> wrote: > On Thu, 26 Mar 2020 at 05:47, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: >> three more comments after eye-balling the code for a bit longer. >> >> 1) The patch probably needs to tweak config.sgml which says this about >> the enable_partitionwise_join GUC: >> >> .. Partitionwise join currently applies only when the join conditions >> include all the partition keys, which must be of the same data type >> and have exactly matching sets of child partitions. .. > Done. Actually this wasn't updated when partition pruning was introduced, which could cause a partitionwise join to benot used even when those conditions were met. Similarly when a query involved whole row reference. It's hard to explainall the conditions under which partitionwise join technique will be used. But I have tried to keep it easy to understand. IMO I think your words "there is exactly one pair of matching partitions." is a bit misleading, because that sounds like that PWJ doesn't allow multiply-segmented join. How about s/exact matching/one-to-one matching/ in the existing documentation, instead? >> 2) Do we really need the 'merged' flag in try_partitionwise_join? Can't >> we simply use the joinrel->merged flag directly? ISTM the we always >> start with joinrel->merged=false, and then only ever set it to true in >> some cases. I've tried doing that, per the attached 0002 patch. The >> regression tests seem to work fine. I introduced the flag "merged", because I thought it would be verbose to write "joiners->merged" in many places, but I think removing that would make the code simpler, so +1 for that change. > Thanks. I just added a small prologue to compute_partition_bounds(). I tweaked that a bit. >> I noticed this because I've tried moving part of the function into a >> separate function, and removing the variable makes that simpler. >> >> The patch also does a couple additional minor tweaks. /* * Currently, this function is called only from try_partitionwise_join(), * so the join type should be INNER, LEFT, FULL, SEMI, or ANTI. + * + * XXX Maybe an assert would be more appropriate? Or maybe just + * bail out by returning NULL? Not sure. */ if (jointype != JOIN_INNER && jointype != JOIN_LEFT && jointype != JOIN_FULL && jointype != JOIN_SEMI && jointype != JOIN_ANTI) I kept this as originally proposed, but I agree that an assertion would be more appropriate, because it would save cycles a bit in a non-assertion-enabled build. I modified this as such. I modified the test below this as well. >> 3) I think the for nparts comment is somewhat misleading: >> >> int nparts; /* number of partitions; 0 = not partitioned; >> * -1 = not yet set */ >> >> which says that nparts=0 means not partitioned. But then there are >> conditions like this: >> >> /* Nothing to do, if the join relation is not partitioned. */ >> if (joinrel->part_scheme == NULL || joinrel->nparts == 0) >> return; >> >> which (ignoring the obsolete comment) seems to say we can have nparts==0 >> even for partitioned tables, no? Yeah, I think I was a bit hasty. I fixed this. Attached is the original patch (0001) and one patch (0002) with changes including those by Tomas and Ashutosh. In the 0002 patch I fixed my many typos as well. :-( Many of them were found by Justin Pryzby. One of them was found by Ashutosh. Thank you! >> Anyway, attached is the original patch (0001) and two patches with >> proposed changes. 0002 removes the "merged" flag as explained in (2), >> 0003 splits the try_partitionwise_join() function into two parts. Thanks for the patches, Tomas! > I have added 0005 with the changes I described in this as well as the previous mail. 0004 is just some white space fixes. Thanks for the patches, Ashutosh! I fixed the white space issues. Best regards, Etsuro Fujita
Attachment
On Sat, Apr 4, 2020 at 12:15 AM Etsuro Fujita <etsuro.fujita@gmail.com> wrote: > Attached is the original patch (0001) and one patch (0002) with > changes including those by Tomas and Ashutosh. I merged the patches into one and rebased it against HEAD. Attached is a new version, in which I added the commit message as well. Does that make sense? If there are no objections, I’ll commit the patch. Best regards, Etsuro Fujita
Attachment
Re: [HACKERS] advanced partition matching algorithm forpartition-wise join
From
Ashutosh Bapat
Date:
On Fri, 3 Apr 2020 at 20:45, Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
Hi,
On Thu, Apr 2, 2020 at 2:12 AM Ashutosh Bapat
<ashutosh.bapat@2ndquadrant.com> wrote:
> On Thu, 26 Mar 2020 at 00:35, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>> I've started reviewing the patch a couple days ago. I haven't done any
>> extensive testing, but I do have a bunch of initial comments that I can
>> share now.
>>
>> 1) I wonder if this needs to update src/backend/optimizer/README, which
>> does have a section about partitionwise joins. It seems formulated in a
>> way that that probably covers even this more advanced algorithm, but
>> maybe it should mention handling of default partitions etc.?
> Done. Please check the wording. It might need some word smithy.
You heavily changed the existing documentation about PWJ, but I don't
think we really need to do so. Also, IMO I think the description
about default partitions you added is needed in README. I think it
would be better to put such a description in source files. How about
something like the attached, instead? I wrote part of this based on
the commit message in the original versions of the patch you posted.
I corrected some grammar, typos. Broke longer sentences into smaller ones so that its easy to read and understand. As is the concept is hard to understand with all its limitations. Thanks for the example. Retained it.
You seem to have removed few comments that explained the algorithm in detail from build_joinrel_partition_info(). It would have been good to have those there. But I am ok not having them either.
But it will be good to have following addition I suggested in my patches to make sure nparts is set to 0 for an unpartitioned relation as per the comment on nparts in RelOptInfo.
@@ -1653,6 +1663,8 @@ build_joinrel_partition_info(RelOptInfo *joinrel, RelOptInfo *outer_rel,
jointype, restrictlist))
{
Assert(!IS_PARTITIONED_REL(joinrel));
+ /* Join is not partitioned. */
+ joinrel->nparts = 0;
return;
}
jointype, restrictlist))
{
Assert(!IS_PARTITIONED_REL(joinrel));
+ /* Join is not partitioned. */
+ joinrel->nparts = 0;
return;
}
>> There certainly needs to be some description of the algorithm somewhere,
>> either in a README or before a suitable function. It doesn't have to be
>> particularly detailed, a rough outline of the matching would be enough,
>> so that readers don't have to rebuild the knowledge from pieces
>> scattered around various comments.
> The algorithm for list and range partitioned tables is slightly different. So, I have added separate prologue to each list_merge_bounds() and range_merge_bounds(). Please check if that serves the purpose.
Too detailed to me. In this:
+ * If there are multiple partitions from one side matching a given partition on
+ * the other side, the algorithm bails out since we do not have machinary for
+ * joining one partition with mulitple partitions. It might happen that any of
+ * the list items of a partition from the outer relation do not appear in the
+ * inner relation and there is no default partition in the inner relation. Such
+ * a partition from the outer side will have no matching partition on the inner
+ * side. The algorithm will bail out in such a case since we do not have a
+ * mechanism to perform a join with a non-existing relation.
I don't think the last comment is correct; that would apply to the old
versions of this function IIRC, but not to the latest version. How
about something much simpler like the attached, instead?
I know that algorithm pretty well by now, so it suffices for me to say we use something similar to merge join, but may be for someone without that background a detailed explanation is useful. But this looks fine at the moment.
> Done. Actually this wasn't updated when partition pruning was introduced, which could cause a partitionwise join to be not used even when those conditions were met. Similarly when a query involved whole row reference. It's hard to explain all the conditions under which partitionwise join technique will be used. But I have tried to keep it easy to understand.
IMO I think your words "there is exactly one pair of matching
partitions." is a bit misleading, because that sounds like that PWJ
doesn't allow multiply-segmented join. How about s/exact
matching/one-to-one matching/ in the existing documentation, instead?
Good catch. That was really misleading. Looks good to me.
>> 3) I think the for nparts comment is somewhat misleading:
>>
>> int nparts; /* number of partitions; 0 = not partitioned;
>> * -1 = not yet set */
>>
>> which says that nparts=0 means not partitioned. But then there are
>> conditions like this:
>>
>> /* Nothing to do, if the join relation is not partitioned. */
>> if (joinrel->part_scheme == NULL || joinrel->nparts == 0)
>> return;
>>
>> which (ignoring the obsolete comment) seems to say we can have nparts==0
>> even for partitioned tables, no?
Yeah, I think I was a bit hasty. I fixed this.
For a non-join relation, nparts = 0 and nparts = -1 both have the same meaning. Although we never set nparts = 0 for a non-join relation? Otherwise, the comment looks good now.
Best Wishes,
Ashutosh
Re: [HACKERS] advanced partition matching algorithm forpartition-wise join
From
Ashutosh Bapat
Date:
Here are some changes suggested on top of v34 as per my previous mail.
These are mostly comment changes.
On Mon, 6 Apr 2020 at 13:59, Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
On Sat, Apr 4, 2020 at 12:15 AM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> Attached is the original patch (0001) and one patch (0002) with
> changes including those by Tomas and Ashutosh.
I merged the patches into one and rebased it against HEAD. Attached
is a new version, in which I added the commit message as well. Does
that make sense? If there are no objections, I’ll commit the patch.
I have no objections.
Best Wishes,
Ashutosh
Attachment
On Mon, Apr 06, 2020 at 05:28:52PM +0900, Etsuro Fujita wrote: >On Sat, Apr 4, 2020 at 12:15 AM Etsuro Fujita <etsuro.fujita@gmail.com> wrote: >> Attached is the original patch (0001) and one patch (0002) with >> changes including those by Tomas and Ashutosh. > >I merged the patches into one and rebased it against HEAD. Attached >is a new version, in which I added the commit message as well. Does >that make sense? If there are no objections, I’ll commit the patch. > +1 regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi Tomas, On Wed, Apr 8, 2020 at 12:15 AM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > On Mon, Apr 06, 2020 at 05:28:52PM +0900, Etsuro Fujita wrote: > >On Sat, Apr 4, 2020 at 12:15 AM Etsuro Fujita <etsuro.fujita@gmail.com> wrote: > >> Attached is the original patch (0001) and one patch (0002) with > >> changes including those by Tomas and Ashutosh. > > > >I merged the patches into one and rebased it against HEAD. Attached > >is a new version, in which I added the commit message as well. Does > >that make sense? If there are no objections, I’ll commit the patch. > +1 Great! It's midnight in Japan now, so I'll push the patch early morning. Best regards, Etsuro Fujita
On Wed, Apr 8, 2020 at 2:24 AM Etsuro Fujita <etsuro.fujita@gmail.com> wrote: > On Wed, Apr 8, 2020 at 12:15 AM Tomas Vondra > <tomas.vondra@2ndquadrant.com> wrote: > > On Mon, Apr 06, 2020 at 05:28:52PM +0900, Etsuro Fujita wrote: > > >On Sat, Apr 4, 2020 at 12:15 AM Etsuro Fujita <etsuro.fujita@gmail.com> wrote: > > >> Attached is the original patch (0001) and one patch (0002) with > > >> changes including those by Tomas and Ashutosh. > > > > > >I merged the patches into one and rebased it against HEAD. Attached > > >is a new version, in which I added the commit message as well. Does > > >that make sense? If there are no objections, I’ll commit the patch. > > > +1 > > Great! It's midnight in Japan now, so I'll push the patch early morning. Pushed after modifying some comments further, based on the suggestions of Ashutosh. Best regards, Etsuro Fujita
Re: [HACKERS] advanced partition matching algorithm forpartition-wise join
From
Ashutosh Bapat
Date:
Thanks a lot Fujita-san. Thanks Dmitry, Rajkumar, Amul, Mark, Robert, Antonin, Amit, Justin,Thomas and Tomas for all your help and review.
On Wed, 8 Apr 2020 at 07:07, Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
On Wed, Apr 8, 2020 at 2:24 AM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> On Wed, Apr 8, 2020 at 12:15 AM Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
> > On Mon, Apr 06, 2020 at 05:28:52PM +0900, Etsuro Fujita wrote:
> > >On Sat, Apr 4, 2020 at 12:15 AM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> > >> Attached is the original patch (0001) and one patch (0002) with
> > >> changes including those by Tomas and Ashutosh.
> > >
> > >I merged the patches into one and rebased it against HEAD. Attached
> > >is a new version, in which I added the commit message as well. Does
> > >that make sense? If there are no objections, I’ll commit the patch.
>
> > +1
>
> Great! It's midnight in Japan now, so I'll push the patch early morning.
Pushed after modifying some comments further, based on the suggestions
of Ashutosh.
Best regards,
Etsuro Fujita
Best Wishes,
Ashutosh
Hi, On Wed, Apr 8, 2020 at 7:07 AM Etsuro Fujita <etsuro.fujita@gmail.com> wrote: > > Pushed after modifying some comments further, based on the suggestions > of Ashutosh. I'm getting the following warning during compilation. partbounds.c: In function ‘partition_bounds_merge’: partbounds.c:1024:21: warning: unused variable ‘inner_binfo’ [-Wunused-variable] PartitionBoundInfo inner_binfo = inner_rel->boundinfo; ^ For fixing the same, we can declare inner_binfo as PG_USED_FOR_ASSERTS_ONLY as it is not used for any other purpose. -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com
Hi Kuntal, On Wed, Apr 8, 2020 at 4:30 PM Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote: > I'm getting the following warning during compilation. > > partbounds.c: In function ‘partition_bounds_merge’: > partbounds.c:1024:21: warning: unused variable ‘inner_binfo’ [-Wunused-variable] > PartitionBoundInfo inner_binfo = inner_rel->boundinfo; > ^ > For fixing the same, we can declare inner_binfo as > PG_USED_FOR_ASSERTS_ONLY as it is not used for any other purpose. I'd propose to remove an assertion causing this (and the outer_binfo/inner_binfo variables) from partition_bounds_merge(), rather than doing so, because the assertion is redundant, as we have the same assertion in merge_list_bounds() and merge_range_bounds(). Please find attached a patch. Best regards, Etsuro Fujita
Attachment
Re: [HACKERS] advanced partition matching algorithm forpartition-wise join
From
Ashutosh Bapat
Date:
Thanks Kuntal for the report. Let me know if this patch works for you.
On Wed, 8 Apr 2020 at 13:00, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
Hi,
On Wed, Apr 8, 2020 at 7:07 AM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
>
> Pushed after modifying some comments further, based on the suggestions
> of Ashutosh.
I'm getting the following warning during compilation.
partbounds.c: In function ‘partition_bounds_merge’:
partbounds.c:1024:21: warning: unused variable ‘inner_binfo’ [-Wunused-variable]
PartitionBoundInfo inner_binfo = inner_rel->boundinfo;
^
For fixing the same, we can declare inner_binfo as
PG_USED_FOR_ASSERTS_ONLY as it is not used for any other purpose.
--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com
Best Wishes,
Ashutosh
Attachment
Re: [HACKERS] advanced partition matching algorithm forpartition-wise join
From
Ashutosh Bapat
Date:
On Wed, 8 Apr 2020 at 15:42, Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
Hi Kuntal,
On Wed, Apr 8, 2020 at 4:30 PM Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
> I'm getting the following warning during compilation.
>
> partbounds.c: In function ‘partition_bounds_merge’:
> partbounds.c:1024:21: warning: unused variable ‘inner_binfo’ [-Wunused-variable]
> PartitionBoundInfo inner_binfo = inner_rel->boundinfo;
> ^
> For fixing the same, we can declare inner_binfo as
> PG_USED_FOR_ASSERTS_ONLY as it is not used for any other purpose.
I'd propose to remove an assertion causing this (and the
outer_binfo/inner_binfo variables) from partition_bounds_merge(),
rather than doing so, because the assertion is redundant, as we have
the same assertion in merge_list_bounds() and merge_range_bounds().
Please find attached a patch.
Oh, I didn't see this mail before sending my other mail.
I think it's better to have the assertion in all the three places and also in merge_hash_bounds() whenever that comes along. The assertion in merge_*_bounds() will be good to in case those functions are called from places other than partition_bounds_merge(). The assertion in partition_bounds_merge() will make sure that when the individual merge_*_bounds() functions are called based on one of the bounds both of the bounds have same strategy.
Best Wishes,
Ashutosh
Hello Ashutosh, Fujita, On Wed, Apr 8, 2020 at 3:49 PM Ashutosh Bapat <ashutosh.bapat@2ndquadrant.com> wrote: > On Wed, 8 Apr 2020 at 15:42, Etsuro Fujita <etsuro.fujita@gmail.com> wrote: >> On Wed, Apr 8, 2020 at 4:30 PM Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote: >> > I'm getting the following warning during compilation. >> > >> > partbounds.c: In function ‘partition_bounds_merge’: >> > partbounds.c:1024:21: warning: unused variable ‘inner_binfo’ [-Wunused-variable] >> > PartitionBoundInfo inner_binfo = inner_rel->boundinfo; >> > ^ >> > For fixing the same, we can declare inner_binfo as >> > PG_USED_FOR_ASSERTS_ONLY as it is not used for any other purpose. >> >> I'd propose to remove an assertion causing this (and the >> outer_binfo/inner_binfo variables) from partition_bounds_merge(), >> rather than doing so, because the assertion is redundant, as we have >> the same assertion in merge_list_bounds() and merge_range_bounds(). >> Please find attached a patch. > > > I think it's better to have the assertion in all the three places and also in merge_hash_bounds() whenever that comes along.The assertion in merge_*_bounds() will be good to in case those functions are called from places other than partition_bounds_merge().The assertion in partition_bounds_merge() will make sure that when the individual merge_*_bounds()functions are called based on one of the bounds both of the bounds have same strategy. Both of your patches fix the problem. I don't have much exposure in this area to comment on whether we should keep/remove the assertion from the code. But, here is my opinion: The code structure looks like following: Assert(condition A); if (Condition B) merge_*_bounds(....); Inside merge_*_bounds(), you have both the above assert and the if condition as another assert: Assert(condition A and Condition B); And, merge_*_bounds() are called from only one place. So, something is redundant here and I'm inclined towards removal of the assert condition. Another thing I noticed: /* The partitioning strategies should be the same. */ Assert(outer_binfo->strategy == inner_binfo->strategy); The comment just reads the assertion aloud which looks unnecessary. -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com
Hi, On Thu, Apr 9, 2020 at 12:06 AM Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote: > Both of your patches fix the problem. I don't have much exposure in > this area to comment on whether we should keep/remove the assertion > from the code. But, here is my opinion: > > The code structure looks like following: > Assert(condition A); > if (Condition B) > merge_*_bounds(....); > > Inside merge_*_bounds(), you have both the above assert and the if > condition as another assert: > Assert(condition A and Condition B); > > And, merge_*_bounds() are called from only one place. So, something is > redundant here and I'm inclined towards removal of the assert > condition. Another thing I noticed: > > /* The partitioning strategies should be the same. */ > Assert(outer_binfo->strategy == inner_binfo->strategy); > > The comment just reads the assertion aloud which looks unnecessary. > Yeah, partition_bounds_merge() is currently called only from try_partitionwise_join(), which guarantees that the strategies are the same. The assertion cost would be cheap, but not zero, so I still think it would be better to remove the assertion from partition_bounds_merge(). Best regards, Etsuro Fujita
Etsuro Fujita <etsuro.fujita@gmail.com> writes: > Yeah, partition_bounds_merge() is currently called only from > try_partitionwise_join(), which guarantees that the strategies are the > same. The assertion cost would be cheap, but not zero, so I still > think it would be better to remove the assertion from > partition_bounds_merge(). FWIW, our general policy is that assertion costs should be ignored in any performance considerations. If you're concerned about performance you should be running a non-assert build, so it doesn't matter. (And certainly, there are lots of assertions in the backend that cost FAR more than this one.) The thing to evaluate an assertion on is how likely it is that it would catch a foreseeable sort of coding error in some future patch. Maybe this one carries its weight on that score or maybe it doesn't, but that's how to think about it. If there's only one caller and there's not likely to ever be more, then I tend to agree that you don't need the assertion. regards, tom lane
On Thu, Apr 9, 2020 at 2:36 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Etsuro Fujita <etsuro.fujita@gmail.com> writes: > > Yeah, partition_bounds_merge() is currently called only from > > try_partitionwise_join(), which guarantees that the strategies are the > > same. > If there's only one caller and there's not likely to ever be more, > then I tend to agree that you don't need the assertion. It seems unlikely that partition_bounds_merge() will be called from more places in the foreseeable future, so I'd still vote for removing the assertion. Best regards, Etsuro Fujita
Re: [HACKERS] advanced partition matching algorithm forpartition-wise join
From
Ashutosh Bapat
Date:
On Thu, Apr 9, 2020 at 12:03 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote: > > On Thu, Apr 9, 2020 at 2:36 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Etsuro Fujita <etsuro.fujita@gmail.com> writes: > > > Yeah, partition_bounds_merge() is currently called only from > > > try_partitionwise_join(), which guarantees that the strategies are the > > > same. > > > If there's only one caller and there's not likely to ever be more, > > then I tend to agree that you don't need the assertion. > > It seems unlikely that partition_bounds_merge() will be called from > more places in the foreseeable future, so I'd still vote for removing > the assertion. When I wrote that function, I had UNION also in mind. A UNION across multiple partitioned relations will be partitioned if we can merge the partition bounds in a sensible manner. Of course the current structure of that function looks more purposed for join, but it's not difficult to convert it to be used for UNION as well. In that case those set of functions will have many more callers. So, I will vote to keep that assertion now that we have it there. -- Best Wishes, Ashutosh Bapat
On Thu, Apr 09, 2020 at 07:34:01PM +0530, Ashutosh Bapat wrote: >On Thu, Apr 9, 2020 at 12:03 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote: >> >> On Thu, Apr 9, 2020 at 2:36 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> > Etsuro Fujita <etsuro.fujita@gmail.com> writes: >> > > Yeah, partition_bounds_merge() is currently called only from >> > > try_partitionwise_join(), which guarantees that the strategies are the >> > > same. >> >> > If there's only one caller and there's not likely to ever be more, >> > then I tend to agree that you don't need the assertion. >> >> It seems unlikely that partition_bounds_merge() will be called from >> more places in the foreseeable future, so I'd still vote for removing >> the assertion. > >When I wrote that function, I had UNION also in mind. A UNION across >multiple partitioned relations will be partitioned if we can merge the >partition bounds in a sensible manner. Of course the current structure >of that function looks more purposed for join, but it's not difficult >to convert it to be used for UNION as well. In that case those set of >functions will have many more callers. So, I will vote to keep that >assertion now that we have it there. Yeah. I really don't see why we should remove an assertion that enforces something useful, especially when it's just a plain comparions. Had it been some expensive assert, maybe. But how much slower does this make an assert-enabled build? 0.000000000001% or something like that? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Apr 9, 2020 at 10:04 AM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:
On Thu, Apr 9, 2020 at 12:03 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
>
> On Thu, Apr 9, 2020 at 2:36 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Etsuro Fujita <etsuro.fujita@gmail.com> writes:
> > > Yeah, partition_bounds_merge() is currently called only from
> > > try_partitionwise_join(), which guarantees that the strategies are the
> > > same.
>
> > If there's only one caller and there's not likely to ever be more,
> > then I tend to agree that you don't need the assertion.
>
> It seems unlikely that partition_bounds_merge() will be called from
> more places in the foreseeable future, so I'd still vote for removing
> the assertion.
When I wrote that function, I had UNION also in mind. A UNION across
multiple partitioned relations will be partitioned if we can merge the
partition bounds in a sensible manner. Of course the current structure
of that function looks more purposed for join, but it's not difficult
to convert it to be used for UNION as well. In that case those set of
functions will have many more callers. So, I will vote to keep that
assertion now that we have it there.
In that case, we really should add the PG_USED_FOR_ASSERTS_ONLY to make the compiler happy.
Cheers,
Jeff
Re: [HACKERS] advanced partition matching algorithm forpartition-wise join
From
Ashutosh Bapat
Date:
On Fri, 10 Apr 2020 at 20:44, Jeff Janes <jeff.janes@gmail.com> wrote:
On Thu, Apr 9, 2020 at 10:04 AM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:On Thu, Apr 9, 2020 at 12:03 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
>
> On Thu, Apr 9, 2020 at 2:36 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Etsuro Fujita <etsuro.fujita@gmail.com> writes:
> > > Yeah, partition_bounds_merge() is currently called only from
> > > try_partitionwise_join(), which guarantees that the strategies are the
> > > same.
>
> > If there's only one caller and there's not likely to ever be more,
> > then I tend to agree that you don't need the assertion.
>
> It seems unlikely that partition_bounds_merge() will be called from
> more places in the foreseeable future, so I'd still vote for removing
> the assertion.
When I wrote that function, I had UNION also in mind. A UNION across
multiple partitioned relations will be partitioned if we can merge the
partition bounds in a sensible manner. Of course the current structure
of that function looks more purposed for join, but it's not difficult
to convert it to be used for UNION as well. In that case those set of
functions will have many more callers. So, I will vote to keep that
assertion now that we have it there.In that case, we really should add the PG_USED_FOR_ASSERTS_ONLY to make the compiler happy.
Attaching my patch again. It doesn't need PG_USED_FOR_ASSERTS_ONLY as well. Kuntal has confirmed that this fixes the warning for him.
Best Wishes,
Ashutosh
Ashutosh Bapat <ashutosh.bapat@2ndquadrant.com> writes: > On Fri, 10 Apr 2020 at 20:44, Jeff Janes <jeff.janes@gmail.com> wrote: >> In that case, we really should add the PG_USED_FOR_ASSERTS_ONLY to make >> the compiler happy. > Attaching my patch again. It doesn't need PG_USED_FOR_ASSERTS_ONLY as well. > Kuntal has confirmed that this fixes the warning for him. I see no patch here ... regards, tom lane
Re: [HACKERS] advanced partition matching algorithm forpartition-wise join
From
Ashutosh Bapat
Date:
On Fri, Apr 10, 2020 at 9:14 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Ashutosh Bapat <ashutosh.bapat@2ndquadrant.com> writes: > > On Fri, 10 Apr 2020 at 20:44, Jeff Janes <jeff.janes@gmail.com> wrote: > >> In that case, we really should add the PG_USED_FOR_ASSERTS_ONLY to make > >> the compiler happy. > > > Attaching my patch again. It doesn't need PG_USED_FOR_ASSERTS_ONLY as well. > > Kuntal has confirmed that this fixes the warning for him. > > I see no patch here ... > Sorry. Here it is -- Best Wishes, Ashutosh Bapat
Attachment
Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> writes: > On Fri, Apr 10, 2020 at 9:14 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I see no patch here ... > Sorry. Here it is LGTM, will push in a moment. regards, tom lane
On Sat, Apr 11, 2020 at 1:00 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> writes: > > On Fri, Apr 10, 2020 at 9:14 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> I see no patch here ... > > > Sorry. Here it is > > LGTM, will push in a moment. Thanks for taking care of this, Tom! Thanks for the patch, Ashutosh! Thanks for the report, Kuntal! Best regards, Etsuro Fujita
On Mon, Apr 6, 2020 at 8:43 PM Ashutosh Bapat <ashutosh.bapat@2ndquadrant.com> wrote: > On Fri, 3 Apr 2020 at 20:45, Etsuro Fujita <etsuro.fujita@gmail.com> wrote: > But it will be good to have following addition I suggested in my patches to make sure nparts is set to 0 for an unpartitionedrelation as per the comment on nparts in RelOptInfo. > @@ -1653,6 +1663,8 @@ build_joinrel_partition_info(RelOptInfo *joinrel, RelOptInfo *outer_rel, > jointype, restrictlist)) > { > Assert(!IS_PARTITIONED_REL(joinrel)); > + /* Join is not partitioned. */ > + joinrel->nparts = 0; > return; > } I didn't modified that function as proposed, because I thought that 1) there would be no need to do so, and that 2) it would be better to set joinrel->nparts only when we set joinrel->part_schema, for consistency. >> >> 3) I think the for nparts comment is somewhat misleading: >> >> >> >> int nparts; /* number of partitions; 0 = not partitioned; >> >> * -1 = not yet set */ >> >> >> >> which says that nparts=0 means not partitioned. But then there are >> >> conditions like this: >> >> >> >> /* Nothing to do, if the join relation is not partitioned. */ >> >> if (joinrel->part_scheme == NULL || joinrel->nparts == 0) >> >> return; >> >> >> >> which (ignoring the obsolete comment) seems to say we can have nparts==0 >> >> even for partitioned tables, no? >> >> Yeah, I think I was a bit hasty. I fixed this. > For a non-join relation, nparts = 0 and nparts = -1 both have the same meaning. Although we never set nparts = 0 for anon-join relation? I don't think so. Consider this: create table prt (a int, b int) partition by range (a); create table prt_p1 partition of prt for values from (0) to (250); create table prt_p2 partition of prt for values from (250) to (500); drop table prt_p1; drop table prt_p2; select count(*) from prt; For this query, we would have nparts=0 for the partitioned table prt. Thanks! Sorry for the delay. Best regards, Etsuro Fujita