Thread: [HACKERS] advanced partition matching algorithm for partition-wise join

[HACKERS] advanced partition matching algorithm for partition-wise join

From

Ashutosh Bapat

Date:

21 August 2017, 10:13:16

The patch-set in [1] supports partition-wise join when the partition bounds and
partition keys of the joining tables exactly match. The last two patches in the
last few patch-sets in that thread implement more
advanced partition matching code. In order to avoid mixing reviews for advanced
partition matching and the basic partition-wise join implementation, I am
starting a new thread to discuss the same. I am attaching the last two
patches from that patch set here.

The new partition matching algorithm handles following cases when a
given partition
on one side has at most one matching partition matching on the other side.

1. When the ranges of the joining tables do not match exactly E.g. partition
table t1 has partitions t1p1 (0 - 100), t1p2 (150 - 200) and partition table t2
has partitions t2p1 (0 - 50), t2p2 (100 - 175). In this case (t1p1, t2p1) and
(t1p2, t2p2) form the matching partition pairs, which can be joined. While
matching the pairs, we also compute the partition bounds for the resulting
join. An INNER join between t1 and t2 will have ranges (0 - 50) since no row
with 50 <= key < 100 from t1p1 is going to find a matching row in t2p1 and (150
- 175) since no row with 100 <= key < 150 from t2p2 is going to find a matching
row in t1p2 and no row with 175 <= key < 200 in t1p2 is going to find a
matching row in t1p2. A t1 LEFT join t2 on the other hand will have ranges same
as the outer relation i.e. t1, (0 - 100), (150 - 200) since all rows from t1
will be part of the join. Thus depending upon the type of join the partition
bounds of the resultant join relation change. Similarly for list partitioned
table, when the lists do not match exactly, the algorithm finds matching pairs
of partitions and the lists of resultant join relation. E.g. t1 has
partitions t1p1 ('a',
'b', 'c'), t1p2 ('e', 'f') and t2 has partitions t2p1 ('a', 'b'), t2p2 ('d',
'e', 'f'). In this case (t1p1, t2p1) and (t2p1, t2p2) form the matching
pairs which are joined. Inner join will have bounds ('a','b'), ('e', 'f') and
t1 LEFT JOIN t2 will have bounds same as t1.

2. When one or both side have at least one partition that does not have
matching partition on the other side. E.g. t1 has partitions t1p1 ('a','b'),
t1p2 ('c','d') and t2 has only one partition t2p1 ('a','b') OR t1 has
partitions t1p1 (0 - 100), t1p2 (100 - 200) and t2 has only one partition t2p1
(0 - 100). In this case as well different types of joins will have different
partition bounds for the result using similar rules described above.

3. A combination of 1 and 2 e.g. t1 has partitions t1p1('a','b','c'),
t1p2('d','e','f') and t2 has a single partition t2p1 ('a','b', 'z').

Algorithm
---------
The pairs of matching partitions and the partition bounds of the join are
calculated by an algorithm similar to merge join.

In such a join, it can be observed that every partition on either side,
contributes to at most one partition of the resultant join relation. Thus for
every partition on either side, we keep track of the partition of resultant
join (if any), which it contributes to. If multiple partitions from any of the
joining relations map to a single partition of the resultant join, we need to
gang those partitions together before joining the partition/s from the other
side. Since we do not have infrastructure for ganging multiple arbitrary
RelOptInfos together in a parent RelOptInfo, we do not support such a
partitionw-wise join right now. We stop merging the bounds immediately when we
detect such a case.

For list partitioned tables, we compare list values from both the sides,
starting with the lowest. If the two list values being compared match,
corresponding partitions from both sides form a pair of partitions to be
joined. We record this mapping and also include the list value in join bounds.
If the two list values do not match and the lower of those two comes from the
outer side of the join, we include it in the join bounds. We advance to the
next list value on side with the lower list value continuing the process of
merging till list values on at least one side are exhausted. If the remaining
values are from the outer side, we include those in the join partition bounds.
Every list value included in the join bounds, and its originating partition/s
are associated with appropriate partition of the resultant join. For more
details please see partition_list_bounds_merge() in the attached patch.

In case of range partitioned tables, we compare the ranges of the partitions in
increasing order of their bounds. If two ranges being compared overlap,
corresponding partitions from both sides form a pair of partitions to be
joined. We record this mapping and also include the merged range in the bounds
of resultant join. The overlapping ranges are merged based on the type of join
as described above. If either of the ranges completely precedes the other, and
it's on the outer side, we include that range in the bounds of resultant join.
We advance to the next range on the side with lower upper bound till ranges on
at least one side are exhausted. If the remaining ranges are from the outer
side, we include those in the partition bounds of resultant join. While
including a range in the partition bounds of the resultant join if its lower
bound precedes the upper bound of the last included range, it indicates that
multiple partitions on that side map to one partition on the other side, so we
bail out. Notice that in this method, we always include the ranges in the
partition bounds of the resultant join in the increasing order of their bounds.
Every range included in the join's partition bounds and it's corresponding
partition/s from joining relations are associated with appropriate partition of
the resultant join. For more details please see partition_range_bounds_merge()
in the attached patch.

The partitions from both sides (one partition from each side) which map to the
same partition of the resultant join are joined to form child-joins. The case
when an outer partition may not have a matching partition from the inner side
will be discussed in the next section. Except for the above algorithm to find
the pairs of matching partitions and calculating bounds of the resultant join,
the rest of the partition-wise join algorithm remains the same.

Unsupported case: When a partition from outer side doesn't have matching
partition on the inner side.
--------------------------------------------------------------------------
Consider a join t1 LEFT JOIN t2 where t1 has partitions t1p1 (0 - 100), t1p2
(100 - 200) and t2 has a single partition t2p1(0 - 100). The rows in t1p2 won't
have a matching row in t2 since there is no partition matching t1p2. The result
of the join will have rows in t1p2 with columns from t2 NULLed. In order to
execute this join as a partition-wise join, we need a dummy relation in place
of the missing partition, which we can join with t1p2. We need this placeholder
dummy relation (its targetlist, relids etc.), so that rest of the planner can
work with the resulting child-join.

We notice the missing partitions only while planning the join (during the
execution of make_one_rel()), by which time we have frozen the number of base
relations. Introducing a base relation during join planning is not supported
in current planner. Similarly, a partition can be missing from a partitioned
join relation, in which case we have to add a dummy join relation. This might
need adding corresponding base relations as well. I have not spent time looking
for what it takes to support these cases. For now the patch does not support
partition-wise join in such cases.

TODOs
-----------
1. Add tests for advanced partition matching algorithm
2. Improve code quality, commenting, function names etc.

[1] https://www.postgresql.org/message-id/CAFjFpRd9Vqh_=-Ldv-XqWY006d07TJ+VXuhXCbdj=P1jukYBrw@mail.gmail.com

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Rajkumar Raghuwanshi

Date:

28 August 2017, 10:14:15

On Mon, Aug 21, 2017 at 12:43 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:

TODOs
-----------
1. Add tests for advanced partition matching algorithm

Hi Ashutosh,

I have applied all partition-wise-join patches (v26) and tested feature. I have modified partition_join.sql file and added extra test cases to test partition matching.

Attaching WIP test case patch which as of now have some server crashes and a data corruptions issue which is commented in the file itself and need to be removed once issue got solved. Also some of queries is not picking or picking partition-wise-join as per expectation which may need some adjustment.

Attachment

advanced_partition_matching_test.patch

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Ashutosh Bapat

Date:

01 September 2017, 22:12:03

PFA the patches rebased on the latest sources. There are also fixes
for some of the crashes and bugs reported. I haven't yet included the
testcase patch in the main patchset.

On Mon, Aug 28, 2017 at 12:44 PM, Rajkumar Raghuwanshi
<rajkumar.raghuwanshi@enterprisedb.com> wrote:
> On Mon, Aug 21, 2017 at 12:43 PM, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
>>
>> TODOs
>> -----------
>> 1. Add tests for advanced partition matching algorithm
>
>
> Hi Ashutosh,
>
> I have applied all partition-wise-join patches (v26) and tested feature. I
> have modified partition_join.sql file and added extra test cases to test
> partition matching.
>
> Attaching WIP test case patch which as of now have some server crashes and a
> data corruptions issue which is commented in the file itself and need to be
> removed once issue got solved. Also some of queries is not picking or
> picking partition-wise-join as per expectation which may need some
> adjustment.
>
>
>
>
>
>



-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Rajkumar Raghuwanshi

Date:

04 September 2017, 15:30:13

On Sat, Sep 2, 2017 at 12:42 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:

PFA the patches rebased on the latest sources. There are also fixes
for some of the crashes and bugs reported. I haven't yet included the
testcase patch in the main patchset.

On Mon, Aug 28, 2017 at 12:44 PM, Rajkumar Raghuwanshi
<rajkumar.raghuwanshi@enterprisedb.com> wrote:
> On Mon, Aug 21, 2017 at 12:43 PM, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
>>
>> TODOs
>> -----------
>> 1. Add tests for advanced partition matching algorithm
>
>
> Hi Ashutosh,
>
> I have applied all partition-wise-join patches (v26) and tested feature. I
> have modified partition_join.sql file and added extra test cases to test
> partition matching.
>
> Attaching WIP test case patch which as of now have some server crashes and a
> data corruptions issue which is commented in the file itself and need to be
> removed once issue got solved. Also some of queries is not picking or
> picking partition-wise-join as per expectation which may need some
> adjustment.
>

I have applied v27 patches and tested feature. Also tried to reduce regression diff with the

existing partition_join.sql by adding new partition instead of changing original partition bounds.

Attached WIP patch have a server crash and some wrong output which need to be fixed. I have

commented these issue in patch itself, Please take a look and let me know if it need more

changes.

Attachment

advance_partition_matching_test_v1.patch

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Ashutosh Bapat

Date:

05 September 2017, 14:04:18

>
> I have applied v27 patches and tested feature. Also tried to reduce
> regression diff with the
> existing partition_join.sql by adding new partition instead of changing
> original partition bounds.
>
> Attached WIP patch have a server crash and some wrong output which need to
> be fixed. I have
> commented these issue in patch itself, Please take a look and let me know if
> it need more
> changes.

I have fixed the issues which were marked as TODOs in the attached
patches. Also, I have included your test change patch in my series of
patches. Are there any other issues you have commented out?

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Rajkumar Raghuwanshi

Date:

05 September 2017, 14:55:03

On Tue, Sep 5, 2017 at 4:34 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:

I have fixed the issues which were marked as TODOs in the attached
patches. Also, I have included your test change patch in my series of
patches. Are there any other issues you have commented out?

Thanks Ashutosh, All commented issue got fixed. I am working on some combinations of N-way joins
to test partition matching, will send those as well once done.

Re: [HACKERS] advanced partition matching algorithm for partition-wise join

From

Antonin Houska

Date:

07 September 2017, 17:04:14

Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:

> I have fixed the issues which were marked as TODOs in the attached
> patches. Also, I have included your test change patch in my series of
> patches.

I've noticed that partition_bounds_merge() is called twice from
make_join_rel():
* build_join_rel -> build_joinrel_partition_info -> partition_bounds_merge
* try_partition_wise_join -> partition_bounds_merge

Is this intentional, or just a thinko?

--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Ashutosh Bapat

Date:

11 September 2017, 11:08:21

On Thu, Sep 7, 2017 at 7:34 PM, Antonin Houska <ah@cybertec.at> wrote:
> Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:
>
>> I have fixed the issues which were marked as TODOs in the attached
>> patches. Also, I have included your test change patch in my series of
>> patches.
>
> I've noticed that partition_bounds_merge() is called twice from
> make_join_rel():
>
>  * build_join_rel -> build_joinrel_partition_info -> partition_bounds_merge
>
>  * try_partition_wise_join -> partition_bounds_merge
>
> Is this intentional, or just a thinko?
>

This is expected. partition_bounds_merge() also returns the pairs of
matching partitions. So, we have to call that function for every pair
of joining relations.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Ashutosh Bapat

Date:

11 October 2017, 14:08:23

Here's updated patch set based on the basic partition-wise join
committed. The patchset applies on top of the patch to optimize the
case of dummy partitioned tables [1].

Right now, the advanced partition matching algorithm bails out when
either of the joining relations has a default partition.

[1] https://www.postgresql.org/message-id/CAFjFpRcPvT5ay9_p3e-k2Cwu4bW_rypON7ceJVWhsU3Uk4Nmmg@mail.gmail.com

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Robert Haas

Date:

12 October 2017, 19:16:59

On Wed, Oct 11, 2017 at 7:08 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> Here's updated patch set based on the basic partition-wise join
> committed. The patchset applies on top of the patch to optimize the
> case of dummy partitioned tables [1].
>
> Right now, the advanced partition matching algorithm bails out when
> either of the joining relations has a default partition.

So is that something you are going to fix?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Ashutosh Bapat

Date:

13 October 2017, 05:29:16

On Thu, Oct 12, 2017 at 9:46 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Oct 11, 2017 at 7:08 AM, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
>> Here's updated patch set based on the basic partition-wise join
>> committed. The patchset applies on top of the patch to optimize the
>> case of dummy partitioned tables [1].
>>
>> Right now, the advanced partition matching algorithm bails out when
>> either of the joining relations has a default partition.
>
> So is that something you are going to fix?
>

Yes, if time permits. I had left the patch unattended while basic
partition-wise join was getting committed. Now that it's committed, I
rebased it. It still has TODOs and some work is required to improve
it. But for the patch to be really complete, we have to deal with the
problem of missing partitions described before. I am fine
collaborating if someone else wants to pick it up.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Ashutosh Bapat

Date:

03 December 2017, 14:23:36

On Fri, Oct 13, 2017 at 7:59 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> On Thu, Oct 12, 2017 at 9:46 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Wed, Oct 11, 2017 at 7:08 AM, Ashutosh Bapat
>> <ashutosh.bapat@enterprisedb.com> wrote:
>>> Here's updated patch set based on the basic partition-wise join
>>> committed. The patchset applies on top of the patch to optimize the
>>> case of dummy partitioned tables [1].
>>>
>>> Right now, the advanced partition matching algorithm bails out when
>>> either of the joining relations has a default partition.
>>
>> So is that something you are going to fix?
>>
>
> Yes, if time permits. I had left the patch unattended while basic
> partition-wise join was getting committed. Now that it's committed, I
> rebased it. It still has TODOs and some work is required to improve
> it. But for the patch to be really complete, we have to deal with the
> problem of missing partitions described before. I am fine
> collaborating if someone else wants to pick it up.
>

Here's patchset which support advanced partition matching for
partition bounds with default partition. The patchset is rebased on
the latest head.

When a list value is present in one of the joining relations and not
the other, and the other relation has default partition, match (join)
the partition containing that list value with the default partition,
since the default partition may contain rows with that list value. If
the default partition happens to be on the outer side of the join, the
resulting join partition acts as a default partition as it will
contain all the values from the default partition. If the partition
containing the list value happens to be on the outer side of the join,
the resulting join partition is associated with the list value, since
no other partition key value from the default partition makes it to
the join result.

When a range is present (completely or partly) in one of the joining
relations and not the other, and the other relation has default
partition, match (join) the partition corresponding to that range with
the default partition. If the default partition happens to be on the
outer side of the join, the resulting join partition acts as a default
partition as it will contain all the values from the default
partition. If the non-partition corresponding to the range happens to
be on the outer side of the join, the resulting join partition is
associated with that range, since partition key values from the
default partition outside that range won't make it to the join result.

If both the relations have default partition, match (join) the default
partition with each other and deem the resulting join partition as
default partition. If one of the relations has default partition but
not the other, and the default partition happens to be on the outer
side of the join, all its rows will make it to the join.  Such a
default partition may get joined to a non-default partition from the
inner side, if inner side has a range missing in the outer side.

If any of the above causes multiple partitions from one side to match
with one or more partitions on the other side, we won't use
partition-wise join as discussed in the first mail of this thread.

I have tested the patches for two-way join, but haven't added any test
involving default partitions to the patch itself. It needs to be
tested for N-way join as well. So, for now I have kept the two patches
supporting the default partition in case of range and list resp.
separate. Also, some of the code duplication in partition matching
functions can be avoided using macros. I will merge those patches into
the main patch and add macros once they are tested appropriately.

For hash partitioned table, we haven't implemented the advanced
partition matching, since it would be rare that somebody has hash
partitioned tables with holes (even if they are allowed).

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachment

pg_adv_dp_join_patches_v2.tar.gz

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Ashutosh Bapat

Date:

07 February 2018, 07:51:58

Here's a new patchset with following changes

1. Rebased on the latest head taking care of partition bound
comparison function changes
2. Refactored the code to avoid duplication.
3. There's an extensive test (provided by Rajkumar) set added, which
is not meant to be committed. That testset has testcases which crash
or reveal a bug. I will fix those crashes and add corresponding
testcases to partition_join.sql.

TODO
1. FIx crashes/bugs in the testcases.



On Sun, Dec 3, 2017 at 4:53 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> On Fri, Oct 13, 2017 at 7:59 AM, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
>> On Thu, Oct 12, 2017 at 9:46 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> On Wed, Oct 11, 2017 at 7:08 AM, Ashutosh Bapat
>>> <ashutosh.bapat@enterprisedb.com> wrote:
>>>> Here's updated patch set based on the basic partition-wise join
>>>> committed. The patchset applies on top of the patch to optimize the
>>>> case of dummy partitioned tables [1].
>>>>
>>>> Right now, the advanced partition matching algorithm bails out when
>>>> either of the joining relations has a default partition.
>>>
>>> So is that something you are going to fix?
>>>
>>
>> Yes, if time permits. I had left the patch unattended while basic
>> partition-wise join was getting committed. Now that it's committed, I
>> rebased it. It still has TODOs and some work is required to improve
>> it. But for the patch to be really complete, we have to deal with the
>> problem of missing partitions described before. I am fine
>> collaborating if someone else wants to pick it up.
>>
>
> Here's patchset which support advanced partition matching for
> partition bounds with default partition. The patchset is rebased on
> the latest head.
>
> When a list value is present in one of the joining relations and not
> the other, and the other relation has default partition, match (join)
> the partition containing that list value with the default partition,
> since the default partition may contain rows with that list value. If
> the default partition happens to be on the outer side of the join, the
> resulting join partition acts as a default partition as it will
> contain all the values from the default partition. If the partition
> containing the list value happens to be on the outer side of the join,
> the resulting join partition is associated with the list value, since
> no other partition key value from the default partition makes it to
> the join result.
>
> When a range is present (completely or partly) in one of the joining
> relations and not the other, and the other relation has default
> partition, match (join) the partition corresponding to that range with
> the default partition. If the default partition happens to be on the
> outer side of the join, the resulting join partition acts as a default
> partition as it will contain all the values from the default
> partition. If the non-partition corresponding to the range happens to
> be on the outer side of the join, the resulting join partition is
> associated with that range, since partition key values from the
> default partition outside that range won't make it to the join result.
>
> If both the relations have default partition, match (join) the default
> partition with each other and deem the resulting join partition as
> default partition. If one of the relations has default partition but
> not the other, and the default partition happens to be on the outer
> side of the join, all its rows will make it to the join.  Such a
> default partition may get joined to a non-default partition from the
> inner side, if inner side has a range missing in the outer side.
>
> If any of the above causes multiple partitions from one side to match
> with one or more partitions on the other side, we won't use
> partition-wise join as discussed in the first mail of this thread.
>
> I have tested the patches for two-way join, but haven't added any test
> involving default partitions to the patch itself. It needs to be
> tested for N-way join as well. So, for now I have kept the two patches
> supporting the default partition in case of range and list resp.
> separate. Also, some of the code duplication in partition matching
> functions can be avoided using macros. I will merge those patches into
> the main patch and add macros once they are tested appropriately.
>
> For hash partitioned table, we haven't implemented the advanced
> partition matching, since it would be rare that somebody has hash
> partitioned tables with holes (even if they are allowed).
>
> --
> Best Wishes,
> Ashutosh Bapat
> EnterpriseDB Corporation
> The Postgres Database Company



-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachment

pg_adv_dp_join_patches_v3.tar.gz

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Amit Langote

Date:

08 February 2018, 05:55:20

Hi Ashutosh.

On 2018/02/07 13:51, Ashutosh Bapat wrote:
> Here's a new patchset with following changes
> 
> 1. Rebased on the latest head taking care of partition bound
> comparison function changes

I was about to make these changes myself while revising the fast pruning
patch.  Instead, I decided to take a look at your patch and try to use it
in my tree.

I looked at the patch 0001 and noticed that git diff --check says:

src/backend/catalog/partition.c:2900: trailing whitespace.
+partition_rbound_datum_cmp(FmgrInfo *partsupfunc, Oid *partcollation,

Also, might be a good idea to write briefly about the new arguments in the
header comment.  Something like that they are PartitionKey elements.

Thanks,
Amit

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Amit Langote

Date:

08 February 2018, 08:11:07

On 2018/02/08 11:55, Amit Langote wrote:
> Hi Ashutosh.
> 
> On 2018/02/07 13:51, Ashutosh Bapat wrote:
>> Here's a new patchset with following changes
>>
>> 1. Rebased on the latest head taking care of partition bound
>> comparison function changes
> 
> I was about to make these changes myself while revising the fast pruning
> patch.  Instead, I decided to take a look at your patch and try to use it
> in my tree.

I also noticed that a later patch adds partsupfunc to PartitionScheme,
which the pruning patch needs too.  So, perhaps would be nice to take out
that portion of the patch.  That is, the changes to PartitionScheme struct
definition and those to find_partition_scheme().

Regarding the latter, wouldn't be nice to have a comment before the code
that does the copying about why we don't compare the partsupfunc field to
decide if we have a match or not.  I understand it's because the
partsupfunc array contains pointers, not OIDs.  But maybe, that's too
obvious to warrant a comment.

Thanks,
Amit

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Ashutosh Bapat

Date:

09 February 2018, 08:27:40

On Thu, Feb 8, 2018 at 8:25 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Hi Ashutosh.
>
> On 2018/02/07 13:51, Ashutosh Bapat wrote:
>> Here's a new patchset with following changes
>>
>> 1. Rebased on the latest head taking care of partition bound
>> comparison function changes
>
> I was about to make these changes myself while revising the fast pruning
> patch.  Instead, I decided to take a look at your patch and try to use it
> in my tree.
>
> I looked at the patch 0001 and noticed that git diff --check says:
>
> src/backend/catalog/partition.c:2900: trailing whitespace.
> +partition_rbound_datum_cmp(FmgrInfo *partsupfunc, Oid *partcollation,

Thanks. Fixed.

>
> Also, might be a good idea to write briefly about the new arguments in the
> header comment.  Something like that they are PartitionKey elements.

Here's updated patch set with those comments added.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachment

pg_adv_dp_join_patches_v4.tar.gz

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Ashutosh Bapat

Date:

09 February 2018, 08:31:58

On Thu, Feb 8, 2018 at 10:41 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2018/02/08 11:55, Amit Langote wrote:
>> Hi Ashutosh.
>>
>> On 2018/02/07 13:51, Ashutosh Bapat wrote:
>>> Here's a new patchset with following changes
>>>
>>> 1. Rebased on the latest head taking care of partition bound
>>> comparison function changes
>>
>> I was about to make these changes myself while revising the fast pruning
>> patch.  Instead, I decided to take a look at your patch and try to use it
>> in my tree.
>
> I also noticed that a later patch adds partsupfunc to PartitionScheme,
> which the pruning patch needs too.  So, perhaps would be nice to take out
> that portion of the patch.  That is, the changes to PartitionScheme struct
> definition and those to find_partition_scheme().

I am not sure whether a patch with just that change and without any
changes to use that member will be acceptable. So leaving this aside.

>
> Regarding the latter, wouldn't be nice to have a comment before the code
> that does the copying about why we don't compare the partsupfunc field to
> decide if we have a match or not.  I understand it's because the
> partsupfunc array contains pointers, not OIDs.  But maybe, that's too
> obvious to warrant a comment.

It's because partsupfuncs should point to the information of the same
function when partopfamily matches and partopcintype matches. I would
have added an assertion for that with a comment, but with the pointer
that would be risky. Or we can just assert that the oids match.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Amit Langote

Date:

09 February 2018, 08:56:28

On 2018/02/09 14:31, Ashutosh Bapat wrote:
>> I also noticed that a later patch adds partsupfunc to PartitionScheme,
>> which the pruning patch needs too.  So, perhaps would be nice to take out
>> that portion of the patch.  That is, the changes to PartitionScheme struct
>> definition and those to find_partition_scheme().
> 
> I am not sure whether a patch with just that change and without any
> changes to use that member will be acceptable. So leaving this aside.

I asked, because with everything that I have now changed in the partition
pruning patch, one would need to pass these FmgrInfo pointers down to
partition bound searching functions from the optimizer.  If the changes to
add partsupfunc to the optimizer were taken out from your main patch, the
pruning patch could just start using it.  For now, I'm making those
changes part of the pruning patch.

Thanks,
Amit

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Ashutosh Bapat

Date:

09 February 2018, 09:05:38

On Fri, Feb 9, 2018 at 11:26 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2018/02/09 14:31, Ashutosh Bapat wrote:
>>> I also noticed that a later patch adds partsupfunc to PartitionScheme,
>>> which the pruning patch needs too.  So, perhaps would be nice to take out
>>> that portion of the patch.  That is, the changes to PartitionScheme struct
>>> definition and those to find_partition_scheme().
>>
>> I am not sure whether a patch with just that change and without any
>> changes to use that member will be acceptable. So leaving this aside.
>
> I asked, because with everything that I have now changed in the partition
> pruning patch, one would need to pass these FmgrInfo pointers down to
> partition bound searching functions from the optimizer.  If the changes to
> add partsupfunc to the optimizer were taken out from your main patch, the
> pruning patch could just start using it.  For now, I'm making those
> changes part of the pruning patch.

That's fine. Someone's patch will be committed first and the other
will just take out those changes. But I am open to separate those
changes into other patch if a committer feels so.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Amit Langote

Date:

15 February 2018, 12:11:20

Hi Ashutosh.

On 2018/02/09 14:27, Ashutosh Bapat wrote:
> Here's updated patch set with those comments added.

I looked at patches 0002 and 0003.

In 0002:

+ * In case of hash partition we store modulus and remainder in datums array

In case of hash partitioned table?

+ * which has the same data type irrespective of the number of partition keys
+ * and their data types. Hence we can compare the hash bound collection
without
+ * any partition key specific information.

"has the same data type" sounds like it means a Postgres data type,
whereas I think you mean that they are simple int32 values, so we don't
need any PartitionKey information to compare them.

In 0003:

A portion of code in both partition_range_bounds_merge(),
partition_list_bounds_merge(), and merge_null_partitions() has an extra
semi-colon at the end of a line starting with else if:

                   if (default_index < 0)
                        default_index = merged_index;
                    else if(default_index != merged_index);
                    {

which emits warnings like this:

partition.c: In function ‘partition_range_bounds_merge’:
partition.c:4192:11: warning: this ‘if’ clause does not guard...
[-Wmisleading-indentation]
      else if(default_index != merged_index);

           ^~
partition.c: In function ‘partition_list_bounds_merge’:
partition.c:4261:11: warning: this ‘if’ clause does not guard...
[-Wmisleading-indentation]
      else if(default_index != merged_index);
           ^~
Also, get this warning.

partition.c:3955:1: warning: ‘is_next_range_continuous’ defined but not
used [-Wunused-function]

I'm trying to understand the optimizer side of this patch.  In your commit
message, I read:

    This commit allows partition-wise join to be applied under
    following conditions

    1. the partition bounds of joining relations are such that rows from
    given partition on one side join can join with rows from maximum one
    partition on the other side i.e. bounds of a given partition on one
    side match/overlap with those of maximum one partition on the other
    side. If the mapping happens to be m:n where m > 1 or n > 1, we have
    to gang multiple partition relations together into a single relation.
    This means that we have to add simple relations during join
    processing, something which is not supported right now.  ALso, in such
    a case, different pairs of joining relations can produce different
    partition bounds for the same join relation, which again is not
    supported right now.


So, is the currently allowed case (partition bounds on both sides match
exactly) a special case of this new feature which tries to match
partitions in a more generalized manner?  I see that this patch removes
the partition_bound_equal(outer_rel->boundinfo, inner_rel->boundinfo)
check in build_joinrel_partition_info() in favor of reconciling any
differences in the representation of the partition bounds by calling
partition_bounds_merge() from try_partition_wise_join().

    2. For every partition on outer side that can contribute to the result
    of an OUTER side, there exists at least one (taken along with item 1,
    it means exactly one)  matching partition on the inner side. To
    support partition-wise join when the inner matching partition doesn't
    exist, we have to add a dummy base relation corresponding to the
    non-existent inner partition. We don't have support add base relations
    during join processing.

Sorry but I'm a bit confused by the last sentence; does it mean we're not
able to allow partition-wise join to happen in this case?  But this is in
the list of the new cases that the patch makes partition-wise join to
happen for.


Looking at the code changes under src/backend/optimizer:

+    else
+    {
+        Assert(partition_bounds_equal(part_scheme->partnatts,
+                                      part_scheme->parttyplen,
+                                      part_scheme->parttypbyval,
+                                      join_boundinfo, joinrel->boundinfo));

IIUC, this else block would run when try_partition_wise_join() is called
again for the same pair of relations.

+        /*
+         * Every pair of joining relations should result in the same number
+         * of child-joins.
+         */

Sorry if I'm misreading this, but does it mean: a given pair of joining
relations should always result in the same number of (set of, too?)
child-joins?

In the new comment in build_joinrel_partition_info():

+     * Because of restrictions in partition_bounds_merge(), not every pair of
+     * joining relation

joining relations


I will try to hop into partition_bounds_merge() from now...

Thanks,
Amit

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Ashutosh Bapat

Date:

16 February 2018, 08:14:32

On Thu, Feb 15, 2018 at 2:41 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Hi Ashutosh.
>
> On 2018/02/09 14:27, Ashutosh Bapat wrote:
>
> In 0002:
>
> + * In case of hash partition we store modulus and remainder in datums array
>
> In case of hash partitioned table?
>
> + * which has the same data type irrespective of the number of partition keys
> + * and their data types. Hence we can compare the hash bound collection
> without
> + * any partition key specific information.
>
> "has the same data type" sounds like it means a Postgres data type,
> whereas I think you mean that they are simple int32 values, so we don't
> need any PartitionKey information to compare them.

Modified this comment to read
+ * Hash partition bounds store modulus and remainder in datums array which are
+ * always integers irrespective of the number of partition keys and their data
+ * types. Hence we can compare the hash bound collection without any partition
+ * key specific information. Separating this logic in a function which does not
+ * require partition key specific information allows it be called from places
+ * where the partition key specific information is not completely available.

Let me know if that looks good.

>
> In 0003:
>
> A portion of code in both partition_range_bounds_merge(),
> partition_list_bounds_merge(), and merge_null_partitions() has an extra
> semi-colon at the end of a line starting with else if:
>
>                    if (default_index < 0)
>                         default_index = merged_index;
>                     else if(default_index != merged_index);
>                     {
>
> which emits warnings like this:
>
> partition.c: In function ‘partition_range_bounds_merge’:
> partition.c:4192:11: warning: this ‘if’ clause does not guard...
> [-Wmisleading-indentation]
>       else if(default_index != merged_index);
>
>            ^~
> partition.c: In function ‘partition_list_bounds_merge’:
> partition.c:4261:11: warning: this ‘if’ clause does not guard...
> [-Wmisleading-indentation]
>       else if(default_index != merged_index);
>

Thanks for catching those. They will cause bugs.

> Also, get this warning.
>
> partition.c:3955:1: warning: ‘is_next_range_continuous’ defined but not
> used [-Wunused-function]

I had added that function earlier, but it's no more useful in the
patch. But I have kept it there in case we need it again. I will
remove it in the final patch.

>
> I'm trying to understand the optimizer side of this patch.  In your commit
> message, I read:
>
>     This commit allows partition-wise join to be applied under
>     following conditions
>
>     1. the partition bounds of joining relations are such that rows from
>     given partition on one side join can join with rows from maximum one
>     partition on the other side i.e. bounds of a given partition on one
>     side match/overlap with those of maximum one partition on the other
>     side. If the mapping happens to be m:n where m > 1 or n > 1, we have
>     to gang multiple partition relations together into a single relation.
>     This means that we have to add simple relations during join
>     processing, something which is not supported right now.  ALso, in such
>     a case, different pairs of joining relations can produce different
>     partition bounds for the same join relation, which again is not
>     supported right now.
>
>
> So, is the currently allowed case (partition bounds on both sides match
> exactly) a special case of this new feature which tries to match
> partitions in a more generalized manner?

Right.

>
>     2. For every partition on outer side that can contribute to the result
>     of an OUTER side, there exists at least one (taken along with item 1,
>     it means exactly one)  matching partition on the inner side. To
>     support partition-wise join when the inner matching partition doesn't
>     exist, we have to add a dummy base relation corresponding to the
>     non-existent inner partition. We don't have support add base relations
>     during join processing.
>
> Sorry but I'm a bit confused by the last sentence; does it mean we're not
> able to allow partition-wise join to happen in this case?  But this is in
> the list of the new cases that the patch makes partition-wise join to
> happen for.
>

Consider two cases for partitioned tables t1(p1, p3, p4), t2 (p1, p2,
p3). Assume that the bounds of any pk across t1 and t2 are same when k
matches. So, p2 is missing from t1 and p4 from t2. This case will not
be handled by current partition-wise join. With these set of patches,
we will be able to handle some joins between t1 and t2. The matching
algorithm will output pairs as (p1, p1), (-, p2), (p3, p3), (p4, -).
In an inner join between t1 and t2, (-, p2), (p4, -) would not
contribute any result, so we can eliminate those. Thus t1 INNER JOIN
t2 is just (p1, p1), (p3, p3). t1 LEFT JOIN t2 will not have any
result from (-, p2), so we can eliminate it. If there would not have
been p4, we will be able to compute t1 LEFT JOIN t2. Such cases are
not handled right now.

Fixed some grammar and typos in the paragraph.

>
> Looking at the code changes under src/backend/optimizer:
>
> +    else
> +    {
> +        Assert(partition_bounds_equal(part_scheme->partnatts,
> +                                      part_scheme->parttyplen,
> +                                      part_scheme->parttypbyval,
> +                                      join_boundinfo, joinrel->boundinfo));
>
> IIUC, this else block would run when try_partition_wise_join() is called
> again for the same pair of relations.

Yes, it's all Asserts, so effectively, we run nothing in a binary
without asserts.

>
> +        /*
> +         * Every pair of joining relations should result in the same number
> +         * of child-joins.
> +         */
>
> Sorry if I'm misreading this, but does it mean: a given pair of joining
> relations should always result in the same number of (set of, too?)
> child-joins?

The assert there just check for number so the comment doesn't mention
set. But you are correct, it's the same set. We check it when we
actually deal with them by checking their relids.

>
> In the new comment in build_joinrel_partition_info():
>
> +     * Because of restrictions in partition_bounds_merge(), not every pair of
> +     * joining relation
>
> joining relations

Fixed.

>
>
> I will try to hop into partition_bounds_merge() from now...

Appreciate you taking time for review.

PFA updated version.


--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachment

pg_adv_dp_join_patches_v5.tar.gz

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Robert Haas

Date:

23 February 2018, 17:05:12

On Fri, Feb 16, 2018 at 12:14 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> Appreciate you taking time for review.
>
> PFA updated version.

Committed 0001.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Ashutosh Bapat

Date:

26 February 2018, 13:03:21

On Fri, Feb 23, 2018 at 7:35 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Feb 16, 2018 at 12:14 AM, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
>> Appreciate you taking time for review.
>>
>> PFA updated version.
>
> Committed 0001.

Thanks.

Here's patchset rebased on the latest head. I have fixed all the
crashes and bugs reported till now.

One of the crash was related to the targetlist of child-join relation.
In basic partition-wise join, either all pairs of joining relations
can use partition-wise join technique or none can use it. So, the
joining pair which is used to create targetlist for a child-join
corresponds to the pair of joining relations used to create targetlist
for the parent join. With the restrictions related to missing
partitions discussed upthread, this isn't true with advanced partition
matching. The joining pair which is used to create targetlist for a
child-join may not correspond to the pair of joining relations used to
create targetlist for the parent join. Since these two pairs are
different build_joinrel_tlist() arranges the targetlist entries in
different order for child-join and parent-join. But for an appendrel,
we expect the parent and child targetlists in sync. So fix is: instead
of constructing the child-join targetlist from joining relations, we
construct it by translating the parent join. The basic partition-wise
join commit had changed build_joinrel_tlist() to handle child-joins
and it had changed set_append_rel_size() to compute attr_needed for
child relations so that that information can be used to built
child-join's targetlist. Both of those changes are not need because of
translation. I have added this fix as a separate patch with those two
changes reverted. When we will lift up the restrictions related to
missing partitions (I am not sure when), we will get back to a state
when joining pair which creates targetlist for child-join corresponds
to the joining pair which creates targetlist for the parent join. And
thus we don't need translation, which consumes more memory. We can use
attrs_needed information. So, may be we should retain those two
changes instead of reverting those. Any suggestions?

The extra extensive testcase advance_partition_join_test still fails
because of plan changes caused by recent changes to the append
costing. I have verified that those plan changes are fine and the
queries do not have any bugs or crashes. But that testcase has many
many queries and the plans are not stable. Since that testcase is not
meant for committing, I haven't fixed the plan diffs right now.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachment

pg_adv_dp_join_patches_v6.tar.gz

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Dmitry Dolgov

Date:

11 March 2018, 22:34:33

> On 26 February 2018 at 11:03, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:
> On Fri, Feb 23, 2018 at 7:35 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Fri, Feb 16, 2018 at 12:14 AM, Ashutosh Bapat
>> <ashutosh.bapat@enterprisedb.com> wrote:
>>> Appreciate you taking time for review.
>>>
>>> PFA updated version.
>>
>> Committed 0001.
>
> Thanks.
>
> Here's patchset rebased on the latest head. I have fixed all the
> crashes and bugs reported till now.

Hi,

I've stumbled upon this patch and noticed, that it's not compiled anymore after
2af28e6033, where `parttypcoll` was renamed to `partcollation`.

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Ashutosh Bapat

Date:

12 March 2018, 08:00:54

On Mon, Mar 12, 2018 at 1:04 AM, Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>> On 26 February 2018 at 11:03, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:
>> On Fri, Feb 23, 2018 at 7:35 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> On Fri, Feb 16, 2018 at 12:14 AM, Ashutosh Bapat
>>> <ashutosh.bapat@enterprisedb.com> wrote:
>>>> Appreciate you taking time for review.
>>>>
>>>> PFA updated version.
>>>
>>> Committed 0001.
>>
>> Thanks.
>>
>> Here's patchset rebased on the latest head. I have fixed all the
>> crashes and bugs reported till now.
>
> Hi,
>
> I've stumbled upon this patch and noticed, that it's not compiled anymore after
> 2af28e6033, where `parttypcoll` was renamed to `partcollation`.

Thanks for the note. Here are rebased patches.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachment

pg_adv_dp_join_patches_v7.tar.gz

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Dmitry Dolgov

Date:

22 March 2018, 02:02:02

> On 12 March 2018 at 06:00, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:
> Thanks for the note. Here are rebased patches.

Since I started to look at this patch, I can share few random notes (although
it's not a complete review, I'm in the progress now), maybe it can be helpful.

In `partition_range_bounds_merge`

    + if (!merged)
    +     break;

is a bit redundant I think, because every time `merged` set to false it
followed by break.

At the end same function

    + if (merged)
    +     merged = merge_default_partitions(outer_bi, outer_pmap, outer_mmap,
    +         inner_bi, inner_pmap, inner_mmap,
    +         jointype, &next_index,
    +         &default_index);

Looks like I misunderstand something in the algorithm, but aren't default
partitions were already processed before e.g. here:

    + /*
    +  * Default partition on the outer side forms the default
    +  * partition of the join result.
    +  */

Also in here

    + /* Free any memory we used in this function. */
    + list_free(merged_datums);
    + list_free(merged_indexes);
    + pfree(outer_pmap);
    + pfree(inner_pmap);
    + pfree(outer_mmap);
    + pfree(inner_mmap);

I think `merged_kinds` is missing.

I've noticed that in some places `IS_PARTITIONED_REL` was replaced

    - if (!IS_PARTITIONED_REL(joinrel))
    + if (joinrel->part_scheme == NULL)

but I'm not quite follow why? Is it because `boundinfo` is not available
anymore at this point? If so, maybe it makes sense to update the commentary for
this macro and mention to not use for joinrel.

Also, as for me it would be helpful to have some commentary about this new
`partsupfunc`, what is exactly the purpose of it (but it's maybe just me
missing some obvious things from partitioning context)

    + FmgrInfo   *partsupfunc;

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Ashutosh Bapat

Date:

22 March 2018, 16:18:40

On Thu, Mar 22, 2018 at 4:32 AM, Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>> On 12 March 2018 at 06:00, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:
>> Thanks for the note. Here are rebased patches.
>
> Since I started to look at this patch, I can share few random notes (although
> it's not a complete review, I'm in the progress now), maybe it can be helpful.
>
> In `partition_range_bounds_merge`
>
>     + if (!merged)
>     +     break;
>
> is a bit redundant I think, because every time `merged` set to false it
> followed by break.

Yes, right now. May be I should turn it into Assert(merged); What do you think?

>
> At the end same function
>
>     + if (merged)
>     +     merged = merge_default_partitions(outer_bi, outer_pmap, outer_mmap,
>     +         inner_bi, inner_pmap, inner_mmap,
>     +         jointype, &next_index,
>     +         &default_index);
>
> Looks like I misunderstand something in the algorithm, but aren't default
> partitions were already processed before e.g. here:
>
>     + /*
>     +  * Default partition on the outer side forms the default
>     +  * partition of the join result.
>     +  */

The default partition handling in the loop handles the cases of
missing partitions as explained in a comment
        /*
         * If a range appears in one of the joining relations but not the other
         * (as a whole or part), the rows in the corresponding partition will
         * not have join partners in the other relation, unless the other
         * relation has a default partition.

But merge_default_partitions() tries to map the default partitions
from both the relations.

>
> Also in here
>
>     + /* Free any memory we used in this function. */
>     + list_free(merged_datums);
>     + list_free(merged_indexes);
>     + pfree(outer_pmap);
>     + pfree(inner_pmap);
>     + pfree(outer_mmap);
>     + pfree(inner_mmap);
>
> I think `merged_kinds` is missing.

Done.

>
> I've noticed that in some places `IS_PARTITIONED_REL` was replaced
>
>     - if (!IS_PARTITIONED_REL(joinrel))
>     + if (joinrel->part_scheme == NULL)
>
> but I'm not quite follow why? Is it because `boundinfo` is not available
> anymore at this point? If so, maybe it makes sense to update the commentary for
> this macro and mention to not use for joinrel.

This is done in try_partitionwise_join(). As explained in the comment
    /*
     * Get the list of matching partitions to be joined along with the
     * partition bounds of the join relation. Because of the restrictions
     * imposed by partition matching algorithm, not every pair of joining
     * relations for this join will be able to use partition-wise join. But all
     * those pairs which can use partition-wise join will produce the same
     * partition bounds for the join relation.
     */
boundinfo for the join relation is built in this function. So, we
don't have join relation's partitioning information fully set up yet.
So we can't use IS_PARTITIONED_REL() there. joinrel->part_scheme if
set tells that the joining relations have matching partition schemes
and thus the join relation can possibly use partition-wise join
technique. If it's not set, then we can't use partition-wise join.

But IS_PARTITIONED_REL() is still useful at a number of other places,
where it's known to encounter a RelOptInfo whose partitioning
properties are fully setup. So, I don't think we should change the
macro or the comments above it.

>
> Also, as for me it would be helpful to have some commentary about this new
> `partsupfunc`, what is exactly the purpose of it (but it's maybe just me
> missing some obvious things from partitioning context)
>
>     + FmgrInfo   *partsupfunc;

It's just copied from Relation::PartitionKey as is. It stores the
functions required to compare two partition key datums. Since we use
those functions frequently their FmgrInfo is cached.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Dmitry Dolgov

Date:

29 March 2018, 00:23:09

> On 22 March 2018 at 14:18, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:
> On Thu, Mar 22, 2018 at 4:32 AM, Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>>> On 12 March 2018 at 06:00, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:
>>> Thanks for the note. Here are rebased patches.
>>
>> Since I started to look at this patch, I can share few random notes (although
>> it's not a complete review, I'm in the progress now), maybe it can be helpful.
>>
>> In `partition_range_bounds_merge`
>>
>>     + if (!merged)
>>     +     break;
>>
>> is a bit redundant I think, because every time `merged` set to false it
>> followed by break.
>
> Yes, right now. May be I should turn it into Assert(merged); What do you think?

Thank you for reply. Yes, that sounds good. But actually you also mentioned
another topic that bothers me about this patch. Different parts of the
algorithm implementation (at least for functions that build maps of matching
partitions) are quite dependent on each other in terms of shared state. At
first glance in `partition_range_bounds_merge` we have about a dozen of
variables of different mutability level, that affect the control flow:

    outer_lb_index
    inner_lb_index
    merged
    merged_index
    overlap
    merged_lb
    merged_ub
    finished_outer
    finished_inner
    ub_cmpval
    lb_cmpval
    inner_has_default
    outer_has_default
    jointype

It looks a bit too much for me, and would require commentaries like "if you
changed the logic here, also take a look there". But I'm not saying that I have
any specific suggestions how to change it (although I'll definitely try to do
so, at least to get some better understanding of the underlying algorithm).

>>
>> I've noticed that in some places `IS_PARTITIONED_REL` was replaced
>>
>>     - if (!IS_PARTITIONED_REL(joinrel))
>>     + if (joinrel->part_scheme == NULL)
>>
>> but I'm not quite follow why? Is it because `boundinfo` is not available
>> anymore at this point? If so, maybe it makes sense to update the commentary for
>> this macro and mention to not use for joinrel.
>
>
> This is done in try_partitionwise_join(). As explained in the comment
>
>      * Get the list of matching partitions to be joined along with the
>      * partition bounds of the join relation. Because of the restrictions
>      * imposed by partition matching algorithm, not every pair of joining
>      * relations for this join will be able to use partition-wise join. But all
>      * those pairs which can use partition-wise join will produce the same
>      * partition bounds for the join relation.
>
> boundinfo for the join relation is built in this function. So, we
> don't have join relation's partitioning information fully set up yet.
> So we can't use IS_PARTITIONED_REL() there. joinrel->part_scheme if
> set tells that the joining relations have matching partition schemes
> and thus the join relation can possibly use partition-wise join
> technique. If it's not set, then we can't use partition-wise join.
>
> But IS_PARTITIONED_REL() is still useful at a number of other places,
> where it's known to encounter a RelOptInfo whose partitioning
> properties are fully setup. So, I don't think we should change the
> macro or the comments above it.

Just to make myself clear, I wanted to suggest not to change the commentary for
`IS_PARTITIONED_REL` significantly, but just add a sentence that you need to
check if given relation is fully set up.

Also, few more random notes (mostly related to readability, since I found some
parts of the patch hard to read, but of course it's arguable).

```
    PartitionRangeBound outer_lb;
    PartitionRangeBound outer_ub;
    PartitionRangeBound inner_lb;
    PartitionRangeBound inner_ub;
    PartitionRangeBound *merged_lb = NULL;
    PartitionRangeBound *merged_ub = NULL;
```

Maybe it would be better to not repeat the type here? Let's say:

```
    PartitionRangeBound outer_lb,
                        outer_ub,
                        ...
```

It's just too long and distracting.

```
partition_range_bounds_merge(int partnatts, FmgrInfo *partsupfuncs,
                             Oid *partcollations, PartitionBoundInfo outer_bi,
                             int outer_nparts, PartitionBoundInfo inner_bi,
                             int inner_nparts, JoinType jointype,
                             List **outer_parts, List **inner_parts)
```

From what I see in `partition.c` there are a lot functions that accept
`partnatts`, `partcollations` only to pass it down to, e.g.
`partition_rbound_cmp`.
What do you think about introducing a data structure to keep these arguments,
and pass only an instance of this structure instead?

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Ashutosh Bapat

Date:

30 March 2018, 17:06:22

Thanks for your review comments. They are helpful.

On Thu, Mar 29, 2018 at 2:53 AM, Dmitry Dolgov <9erthalion6@gmail.com> wrote:

> But actually you also mentioned
> another topic that bothers me about this patch. Different parts of the
> algorithm implementation (at least for functions that build maps of matching
> partitions) are quite dependent on each other in terms of shared state. At
> first glance in `partition_range_bounds_merge` we have about a dozen of
> variables of different mutability level, that affect the control flow:
>
>     outer_lb_index
>     inner_lb_index
>     merged
>     merged_index
>     overlap
>     merged_lb
>     merged_ub
>     finished_outer
>     finished_inner
>     ub_cmpval
>     lb_cmpval
>     inner_has_default
>     outer_has_default
>     jointype
>
> It looks a bit too much for me, and would require commentaries like "if you
> changed the logic here, also take a look there". But I'm not saying that I have
> any specific suggestions how to change it (although I'll definitely try to do
> so, at least to get some better understanding of the underlying algorithm).

I am working on commenting portions of the code to make it more clear
and readable. Will update the patches with the comments soon, mostly
this Monday.

>
> Just to make myself clear, I wanted to suggest not to change the commentary for
> `IS_PARTITIONED_REL` significantly, but just add a sentence that you need to
> check if given relation is fully set up.

For that we have REL_HAS_ALL_PART_PROPS. IS_PARTITIONED_REL is really
checking whether a relation is partitioned and following comment makes
it clear why we have to check all those things. So, I still don't see
why we need to change the comment there.

 * It's not enough to test whether rel->part_scheme is set, because it might
 * be that the basic partitioning properties of the input relations matched
 * but the partition bounds did not.

> Also, few more random notes (mostly related to readability, since I found some
> parts of the patch hard to read, but of course it's arguable).
>
> ```
>     PartitionRangeBound outer_lb;
>     PartitionRangeBound outer_ub;
>     PartitionRangeBound inner_lb;
>     PartitionRangeBound inner_ub;
>     PartitionRangeBound *merged_lb = NULL;
>     PartitionRangeBound *merged_ub = NULL;
> ```
>
> Maybe it would be better to not repeat the type here? Let's say:
>
> ```
>     PartitionRangeBound outer_lb,
>                         outer_ub,

PostgreSQL code uses both the styles, but I like declarations with the
type is right there along side the variable. It also makes it easy to
delete the variable or modify its type, without causing unnecessary
diff on adjacent lines.

> ```
> partition_range_bounds_merge(int partnatts, FmgrInfo *partsupfuncs,
>                              Oid *partcollations, PartitionBoundInfo outer_bi,
>                              int outer_nparts, PartitionBoundInfo inner_bi,
>                              int inner_nparts, JoinType jointype,
>                              List **outer_parts, List **inner_parts)
> ```
>
> From what I see in `partition.c` there are a lot functions that accept
> `partnatts`, `partcollations` only to pass it down to, e.g.
> `partition_rbound_cmp`.
> What do you think about introducing a data structure to keep these arguments,
> and pass only an instance of this structure instead?

I have discussed this in an adjacent thread, but let me repeat it
here. Those members come from PartitionKey or PartitionScheme. If we
pack those in a structure and not include it in those two
structures,we will just create a structure before calling the function
and unpack those inside the comparison function e.g.
partition_rbound_cmp(). We could argue that that will benefit
intermediate functions like partition_range_bound_cmp(), which just
pass those values down, but there is all the possibility that its
future caller may not have that packing structure available readily.
So, I am inclined not to add a new structure just for this.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Ashutosh Bapat

Date:

03 April 2018, 16:34:45

On Fri, Mar 30, 2018 at 7:36 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
>
> I am working on commenting portions of the code to make it more clear
> and readable. Will update the patches with the comments soon, mostly
> this Monday.
>

Here's set of patches rebased on the latest head. I have added a lot
of comments and revised a lot of code to avoid duplication, causing a
net reduction in the number of lines.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachment

pg_adv_dp_join_patches_v8.tar.gz

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Dmitry Dolgov

Date:

06 May 2018, 22:36:56

> On 3 April 2018 at 15:34, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:
> On Fri, Mar 30, 2018 at 7:36 PM, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
>>
>> I am working on commenting portions of the code to make it more clear
>> and readable. Will update the patches with the comments soon, mostly
>> this Monday.
>>
>
> Here's set of patches rebased on the latest head. I have added a lot
> of comments and revised a lot of code to avoid duplication, causing a
> net reduction in the number of lines.

Thank you. Unfortunately, there are already few more conflicts since last time,
could you rebase again?

In the meantime, I'm a bit confused by `merge_null_partitions` and why
partitions with NULL values should be treated separately? It became even more
confusing when I looked at where this functions is used:

    /* If merge is unsuccessful, bail out without any further processing. */
    if (merged)
        merged = merge_null_partitions(outer_bi, outer_pmap, outer_mmap,
                                       inner_bi, inner_pmap, inner_mmap,
                                       jointype, &next_index, &null_index,
                                       &default_index);

Is this commentary correct?

Also, I don't see anywhere in PostgreSQL itself or in this patch a definition
of the term "NULL partition", can you add it, just to make things clear?

Another question, I see this pattern a lot here when there is a code like:

    if (!outer_has_something && inner_has_something)
    {
        // some logic
    }
    else if (outer_has_something && !inner_has_something)
    {
        // some logic symmetric to what we have above
    }

By symmetric I mean that the code is more or less the same, just "inner"
variables were changed to "outer" (and required types of joins are opposite).
Is it feasible to actually use completely the same logic, and just change
"inner"/"outer" variables instead?

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Ashutosh Bapat

Date:

14 May 2018, 14:14:23

Thanks Dmitry for your review.

On Mon, May 7, 2018 at 1:06 AM, Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>
> Thank you. Unfortunately, there are already few more conflicts since last time,
> could you rebase again?

Done.

>
> In the meantime, I'm a bit confused by `merge_null_partitions` and why
> partitions with NULL values should be treated separately? It became even more
> confusing when I looked at where this functions is used:
>
>     /* If merge is unsuccessful, bail out without any further processing. */
>     if (merged)
>         merged = merge_null_partitions(outer_bi, outer_pmap, outer_mmap,
>                                        inner_bi, inner_pmap, inner_mmap,
>                                        jointype, &next_index, &null_index,
>                                        &default_index);
>
> Is this commentary correct?

I agree. It's misleading. Removed.

>
> Also, I don't see anywhere in PostgreSQL itself or in this patch a definition
> of the term "NULL partition", can you add it, just to make things clear?

By NULL partition, I mean a list partition which holds NULL values.
Added that clarification in the prologue of merge_null_partitions().

>
> Another question, I see this pattern a lot here when there is a code like:
>
>     if (!outer_has_something && inner_has_something)
>     {
>         // some logic
>     }
>     else if (outer_has_something && !inner_has_something)
>     {
>         // some logic symmetric to what we have above
>     }
>
> By symmetric I mean that the code is more or less the same, just "inner"
> variables were changed to "outer" (and required types of joins are opposite).
> Is it feasible to actually use completely the same logic, and just change
> "inner"/"outer" variables instead?

I have added handle_missing_partition() for the same purpose. But I
didn't use it in merge_null_partitions(), which I have done in the
attached patches. Now the number of such instances are just at two
places one in merge_default_partitions() and the other in
handle_missing_partition(). But they are different enough not to
extract into a common function.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachment

pg_adv_dp_join_patches_v9.tar.gz

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Amit Langote

Date:

26 June 2018, 11:57:03

Hi Ashutosh,

On 2018/05/14 20:14, Ashutosh Bapat wrote:
> 0001-Hash-partition-bound-equality-refactoring.patch
> 0002-Targetlist-of-a-child-join-is-produced-by-translatin.patch
> 0003-Partition-wise-join-for-1-1-1-0-0-1-partition-matchi.patch
> 0004-Add-a-debug-message-to-notify-whether-partition-wise.patch
> 0005-Tests-for-0-1-1-1-and-1-0-partition-matching.patch
> 0006-Extra-extensive-tests-for-advanced-partition-matchin.patch

I noticed after *cleanly* applying 0001-0004 to today's HEAD that while
0005's test all pass, there are many failures in 0006's tests.  Maybe, you
need to adjust something in one of the patches or adjust test outputs.

Thanks,
Amit

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Ashutosh Bapat

Date:

26 June 2018, 12:02:21

On Tue, Jun 26, 2018 at 2:27 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Hi Ashutosh,
>
> On 2018/05/14 20:14, Ashutosh Bapat wrote:
>> 0001-Hash-partition-bound-equality-refactoring.patch
>> 0002-Targetlist-of-a-child-join-is-produced-by-translatin.patch
>> 0003-Partition-wise-join-for-1-1-1-0-0-1-partition-matchi.patch
>> 0004-Add-a-debug-message-to-notify-whether-partition-wise.patch
>> 0005-Tests-for-0-1-1-1-and-1-0-partition-matching.patch
>> 0006-Extra-extensive-tests-for-advanced-partition-matchin.patch
>
> I noticed after *cleanly* applying 0001-0004 to today's HEAD that while
> 0005's test all pass, there are many failures in 0006's tests.  Maybe, you
> need to adjust something in one of the patches or adjust test outputs.

If the failures are because of plan changes, it's expected. If those
are because of crashes or changed output, those need to be fixed. I
have kept that patch to notice any crashes or output changes, in which
case, I pull that test into 0005 test set after fixing the code. Once
we are near commit, I will remove that patch from the patchset.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Amit Langote

Date:

27 June 2018, 09:58:20

On 2018/06/26 18:02, Ashutosh Bapat wrote:
> On Tue, Jun 26, 2018 at 2:27 PM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Hi Ashutosh,
>>
>> On 2018/05/14 20:14, Ashutosh Bapat wrote:
>>> 0001-Hash-partition-bound-equality-refactoring.patch
>>> 0002-Targetlist-of-a-child-join-is-produced-by-translatin.patch
>>> 0003-Partition-wise-join-for-1-1-1-0-0-1-partition-matchi.patch
>>> 0004-Add-a-debug-message-to-notify-whether-partition-wise.patch
>>> 0005-Tests-for-0-1-1-1-and-1-0-partition-matching.patch
>>> 0006-Extra-extensive-tests-for-advanced-partition-matchin.patch
>>
>> I noticed after *cleanly* applying 0001-0004 to today's HEAD that while
>> 0005's test all pass, there are many failures in 0006's tests.  Maybe, you
>> need to adjust something in one of the patches or adjust test outputs.
> 
> If the failures are because of plan changes, it's expected. If those
> are because of crashes or changed output, those need to be fixed. I
> have kept that patch to notice any crashes or output changes, in which
> case, I pull that test into 0005 test set after fixing the code. Once
> we are near commit, I will remove that patch from the patchset.

Ah, okay.  I thought of reporting this because I felt the errors may have
to do with changes to the related code in HEAD between May 14 when you
last posted the patches and today that you may need to account for in you
patches.  For instance, there are many diffs like this:

***************
*** 90,132 ****
  -- left outer join, with whole-row reference
  EXPLAIN (COSTS OFF)
  SELECT t1, t2 FROM prt1 t1 LEFT JOIN prt2 t2 ON t1.a = t2.b WHERE t1.b =
0 ORDER BY t1.a, t1.b, t1.c, t2.a, t2.b, t2.c;
!                        QUERY PLAN
! --------------------------------------------------------
   Sort
     Sort Key: t1.a, t1.c, t2.a, t2.b, t2.c
!    ->  Result
!          ->  Append
!                ->  Hash Right Join
!                      Hash Cond: (t2.b = t1.a)
!                      ->  Seq Scan on prt2_p0 t2
!                      ->  Hash
!                            ->  Seq Scan on prt1_p0 t1
<snip>

--- 90,131 ----
  -- left outer join, with whole-row reference
  EXPLAIN (COSTS OFF)
  SELECT t1, t2 FROM prt1 t1 LEFT JOIN prt2 t2 ON t1.a = t2.b WHERE t1.b =
0 ORDER BY t1.a, t1.b, t1.c, t2.a, t2.b, t2.c;
!                     QUERY PLAN
! --------------------------------------------------
   Sort
     Sort Key: t1.a, t1.c, t2.a, t2.b, t2.c
!    ->  Append
!          ->  Hash Right Join
!                Hash Cond: (t2.b = t1.a)
!                ->  Seq Scan on prt2_p0 t2
!                ->  Hash
!                      ->  Seq Scan on prt1_p0 t1
!                            Filter: (b = 0)

Looks like the Result node on top of Append is no longer there after
applying your patch.

Thanks,
Amit

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Ashutosh Bapat

Date:

27 June 2018, 16:21:50

On Wed, Jun 27, 2018 at 12:28 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2018/06/26 18:02, Ashutosh Bapat wrote:
>> On Tue, Jun 26, 2018 at 2:27 PM, Amit Langote
>> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>> Hi Ashutosh,
>>>
>>> On 2018/05/14 20:14, Ashutosh Bapat wrote:
>>>> 0001-Hash-partition-bound-equality-refactoring.patch
>>>> 0002-Targetlist-of-a-child-join-is-produced-by-translatin.patch
>>>> 0003-Partition-wise-join-for-1-1-1-0-0-1-partition-matchi.patch
>>>> 0004-Add-a-debug-message-to-notify-whether-partition-wise.patch
>>>> 0005-Tests-for-0-1-1-1-and-1-0-partition-matching.patch
>>>> 0006-Extra-extensive-tests-for-advanced-partition-matchin.patch
>>>
>>> I noticed after *cleanly* applying 0001-0004 to today's HEAD that while
>>> 0005's test all pass, there are many failures in 0006's tests.  Maybe, you
>>> need to adjust something in one of the patches or adjust test outputs.
>>
>> If the failures are because of plan changes, it's expected. If those
>> are because of crashes or changed output, those need to be fixed. I
>> have kept that patch to notice any crashes or output changes, in which
>> case, I pull that test into 0005 test set after fixing the code. Once
>> we are near commit, I will remove that patch from the patchset.
>
> Ah, okay.  I thought of reporting this because I felt the errors may have
> to do with changes to the related code in HEAD between May 14 when you
> last posted the patches and today that you may need to account for in you
> patches.  For instance, there are many diffs like this:
>
> ***************
> *** 90,132 ****
>   -- left outer join, with whole-row reference
>   EXPLAIN (COSTS OFF)
>   SELECT t1, t2 FROM prt1 t1 LEFT JOIN prt2 t2 ON t1.a = t2.b WHERE t1.b =
> 0 ORDER BY t1.a, t1.b, t1.c, t2.a, t2.b, t2.c;
> !                        QUERY PLAN
> ! --------------------------------------------------------
>    Sort
>      Sort Key: t1.a, t1.c, t2.a, t2.b, t2.c
> !    ->  Result
> !          ->  Append
> !                ->  Hash Right Join
> !                      Hash Cond: (t2.b = t1.a)
> !                      ->  Seq Scan on prt2_p0 t2
> !                      ->  Hash
> !                            ->  Seq Scan on prt1_p0 t1
> <snip>
>
> --- 90,131 ----
>   -- left outer join, with whole-row reference
>   EXPLAIN (COSTS OFF)
>   SELECT t1, t2 FROM prt1 t1 LEFT JOIN prt2 t2 ON t1.a = t2.b WHERE t1.b =
> 0 ORDER BY t1.a, t1.b, t1.c, t2.a, t2.b, t2.c;
> !                     QUERY PLAN
> ! --------------------------------------------------
>    Sort
>      Sort Key: t1.a, t1.c, t2.a, t2.b, t2.c
> !    ->  Append
> !          ->  Hash Right Join
> !                Hash Cond: (t2.b = t1.a)
> !                ->  Seq Scan on prt2_p0 t2
> !                ->  Hash
> !                      ->  Seq Scan on prt1_p0 t1
> !                            Filter: (b = 0)
>
> Looks like the Result node on top of Append is no longer there after
> applying your patch.

Yes. They are coming because of a commit which removed Result node on
top of an Append node. I don't remember exactly which.

I wouldn't worry about those diffs at this time. As I have mentioned
in earlier mails, the expected output from 0006 is quite large and is
not supposed to be committed. So, I don't see much value in fixing the
plans in that output.

Do you see that as a hindrance in reviewing the code changes and tests in 0005?

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Amit Langote

Date:

28 June 2018, 08:53:40

On 2018/06/27 22:21, Ashutosh Bapat wrote:
> On Wed, Jun 27, 2018 at 12:28 PM, Amit Langote
>> Ah, okay.  I thought of reporting this because I felt the errors may have
>> to do with changes to the related code in HEAD between May 14 when you
>> last posted the patches and today that you may need to account for in you
>> patches.  For instance, there are many diffs like this:
>>
>> Looks like the Result node on top of Append is no longer there after
>> applying your patch.
> 
> Yes. They are coming because of a commit which removed Result node on
> top of an Append node. I don't remember exactly which.
> 
> I wouldn't worry about those diffs at this time. As I have mentioned
> in earlier mails, the expected output from 0006 is quite large and is
> not supposed to be committed. So, I don't see much value in fixing the
> plans in that output.
> 
> Do you see that as a hindrance in reviewing the code changes and tests in 0005?

I think not.  I'll ignore 0006 for now and focus on other patches.

Thanks,
Amit

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Dmitry Dolgov

Date:

15 July 2018, 20:43:43

> On Thu, 28 Jun 2018 at 07:54, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>
> On 2018/06/27 22:21, Ashutosh Bapat wrote:
> > On Wed, Jun 27, 2018 at 12:28 PM, Amit Langote
> >> Ah, okay.  I thought of reporting this because I felt the errors may have
> >> to do with changes to the related code in HEAD between May 14 when you
> >> last posted the patches and today that you may need to account for in you
> >> patches.  For instance, there are many diffs like this:
> >>
> >> Looks like the Result node on top of Append is no longer there after
> >> applying your patch.
> >
> > Yes. They are coming because of a commit which removed Result node on
> > top of an Append node. I don't remember exactly which.
> >
> > I wouldn't worry about those diffs at this time. As I have mentioned
> > in earlier mails, the expected output from 0006 is quite large and is
> > not supposed to be committed. So, I don't see much value in fixing the
> > plans in that output.
> >
> > Do you see that as a hindrance in reviewing the code changes and tests in 0005?
>
> I think not.  I'll ignore 0006 for now and focus on other patches.

Hi,

Sorry for my irregular reviews. Unfortunately, the patch need to be rebased
again. In the meantime I have few more commentaries about range_bounds_merge:

* From what I see partition_range_bound_cmp is executed on the same bounds few
  times to find whether they are overlapping and during the range merging, is
  it necessary? Probably it would be enough just to compare current ranges once
  per iteration.

* Just to clarify - the iterating through all the partitions, is it the best
  way of finding matching ranges? Correct me if I'm wrong, but from what I see
  in the comments for PartitionBoundInfoData, all the ranges are arranged in
  increasing order - maybe then it's possible to use binary search for finding
  a matching range?

* I'm not sure why in this case partition wise join is not applied? Maybe I'm
  wrong, but I was under the impression that it should work here

=# \d+ test1;
                        Table "public.test1"
 Column |  Type   | Collation | Nullable | Default | Storage  |
--------+---------+-----------+----------+---------+----------+
 data   | jsonb   |           |          |         | extended |
 id     | integer |           |          |         | plain    |
Partition key: RANGE (id)
Indexes:
    "test1_idx" btree (id)
Partitions: test11 FOR VALUES FROM (0) TO (100),
            test12 FOR VALUES FROM (100) TO (200),
            test13 FOR VALUES FROM (200) TO (300)

=# \d+ test2;
                        Table "public.test2"
 Column |  Type   | Collation | Nullable | Default | Storage  |
--------+---------+-----------+----------+---------+----------+
 data   | jsonb   |           |          |         | extended |
 id     | integer |           |          |         | plain    |
Partition key: RANGE (id)
Indexes:
    "test2_idx" btree (id)
Partitions: test21 FOR VALUES FROM (10) TO (110),
            test22 FOR VALUES FROM (110) TO (210),
            test23 FOR VALUES FROM (210) TO (310)

=# set enable_partitionwise_join to true;

=# explain analyze select * from test1 t1 inner join test2 t2 using (id);
                                QUERY PLAN
-------------------------------------------------------------------------------
 Hash Join  (cost=3.25..6.56 rows=9 width=54)
 (actual time=0.082..0.105 rows=3 loops=1)
   Hash Cond: (t1.id = t2.id)
   ->  Append  (cost=0.00..3.18 rows=12 width=29)
               (actual time=0.026..0.047 rows=12 loops=1)
         ->  Seq Scan on test11 t1  (cost=0.00..1.05 rows=5 width=29)
                                    (actual time=0.025..0.028 rows=5 loops=1)
         ->  Seq Scan on test12 t1_1 (cost=0.00..1.04 rows=4 width=29)
                                     (actual time=0.006..0.0 07 rows=4 loops=1)
         ->  Seq Scan on test13 t1_2 (cost=0.00..1.03 rows=3 width=29)
                                     (actual time=0.005..0.0 06 rows=3 loops=1)
   ->  Hash  (cost=3.13..3.13 rows=9 width=29)
             (actual time=0.033..0.033 rows=9 loops=1)
         Buckets: 1024  Batches: 1  Memory Usage: 9kB
         ->  Append  (cost=0.00..3.13 rows=9 width=29)
                     (actual time=0.006..0.022 rows=9 loops=1)
               ->  Seq Scan on test21 t2
                   (cost=0.00..1.03 rows=3 width=29)
                   (actual time=0.005.  .0.008 rows=3 loops=1)
               ->  Seq Scan on test22 t2_1
                   (cost=0.00..1.03 rows=3 width=29)
                   (actual time=0.00 4..0.005 rows=3 loops=1)
               ->  Seq Scan on test23 t2_2
                   (cost=0.00..1.03 rows=3 width=29)
                   (actual time=0.00 3..0.004 rows=3 loops=1)
 Planning Time: 0.921 ms
 Execution Time: 0.261 ms
(14 rows)

=# set enable_partitionwise_join to false;
=# explain analyze select * from test1 t1 inner join test2 t2 using (id);
                                QUERY PLAN
-------------------------------------------------------------------------------
 Hash Join  (cost=3.25..6.56 rows=9 width=54)
            (actual time=0.073..0.095 rows=3 loops=1)
   Hash Cond: (t1.id = t2.id)
   ->  Append  (cost=0.00..3.18 rows=12 width=29)
               (actual time=0.022..0.041 rows=12 loops=1)
         ->  Seq Scan on test11 t1  (cost=0.00..1.05 rows=5 width=29)
                                    (actual time=0.021..0.024 rows=5 loops=1)
         ->  Seq Scan on test12 t1_1 (cost=0.00..1.04 rows=4 width=29)
                                     (actual time=0.006..0.0 07 rows=4 loops=1)
         ->  Seq Scan on test13 t1_2 (cost=0.00..1.03 rows=3 width=29)
                                     (actual time=0.004..0.0 05 rows=3 loops=1)
   ->  Hash  (cost=3.13..3.13 rows=9 width=29)
             (actual time=0.031..0.031 rows=9 loops=1)
         Buckets: 1024  Batches: 1  Memory Usage: 9kB
         ->  Append  (cost=0.00..3.13 rows=9 width=29)
                     (actual time=0.006..0.021 rows=9 loops=1)
               ->  Seq Scan on test21 t2
                   (cost=0.00..1.03 rows=3 width=29)
                   (actual time=0.005.  .0.008 rows=3 loops=1)
               ->  Seq Scan on test22 t2_1
                   (cost=0.00..1.03 rows=3 width=29)
                   (actual time=0.00 3..0.004 rows=3 loops=1)
               ->  Seq Scan on test23 t2_2
                   (cost=0.00..1.03 rows=3 width=29)
                   (actual time=0.00 3..0.004 rows=3 loops=1)
 Planning Time: 1.154 ms
 Execution Time: 0.201 ms
(14 rows)

My investigation shows that the merge function stops on the second iteration
because of this condition:

/*
 * Multiple partitions from one relation map to one partition from the
 * other relation. Partition-wise join does not handle this case right
 * now, since it requires ganging multiple partitions together (into
 * one RelOptInfo).
 */
merged_index = -1;

I'm confused, since there is only one-to-one mapping of partitions.

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Robert Haas

Date:

16 July 2018, 18:01:51

On Sun, Jul 15, 2018 at 1:43 PM, Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> Partitions: test11 FOR VALUES FROM (0) TO (100),
>             test12 FOR VALUES FROM (100) TO (200),
>             test13 FOR VALUES FROM (200) TO (300)
>
> Partitions: test21 FOR VALUES FROM (10) TO (110),
>             test22 FOR VALUES FROM (110) TO (210),
>             test23 FOR VALUES FROM (210) TO (310)
>
> I'm confused, since there is only one-to-one mapping of partitions.

No, test21 would have to be joined to both test11 and test12, since
either could contain matching rows.  Also, test22 would have to be
joined to both test12 and test13.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Ashutosh Bapat

Date:

17 July 2018, 12:58:23

On Sun, Jul 15, 2018 at 11:13 PM, Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>> On Thu, 28 Jun 2018 at 07:54, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>
>> On 2018/06/27 22:21, Ashutosh Bapat wrote:
>> > On Wed, Jun 27, 2018 at 12:28 PM, Amit Langote
>> >> Ah, okay.  I thought of reporting this because I felt the errors may have
>> >> to do with changes to the related code in HEAD between May 14 when you
>> >> last posted the patches and today that you may need to account for in you
>> >> patches.  For instance, there are many diffs like this:
>> >>
>> >> Looks like the Result node on top of Append is no longer there after
>> >> applying your patch.
>> >
>> > Yes. They are coming because of a commit which removed Result node on
>> > top of an Append node. I don't remember exactly which.
>> >
>> > I wouldn't worry about those diffs at this time. As I have mentioned
>> > in earlier mails, the expected output from 0006 is quite large and is
>> > not supposed to be committed. So, I don't see much value in fixing the
>> > plans in that output.
>> >
>> > Do you see that as a hindrance in reviewing the code changes and tests in 0005?
>>
>> I think not.  I'll ignore 0006 for now and focus on other patches.
>
> Hi,
>
> Sorry for my irregular reviews. Unfortunately, the patch need to be rebased
> again.

I rebased the patches without any problem on top of commit
commit f7cb2842bf47715133b40e4a503f35dbe60d1b72
Author: Peter Eisentraut <peter_e@gmx.net>
Date:   Mon Jul 16 13:35:41 2018 +0200

    Add plan_cache_mode setting

    This allows overriding the choice of custom or generic plan.

    Author: Pavel Stehule <pavel.stehule@gmail.com>
    Discussion:
https://www.postgresql.org/message-id/flat/CAFj8pRAGLaiEm8ur5DWEBo7qHRWTk9HxkuUAz00CZZtJj-LkCA%40mail.gmail.com

I didn't get any problem rebasing the patches. They compiled cleanly.
The tests added by commit 4513d3a4be0bb7d0141f8b7eaf669a55c08e41b0
failed since this patch-set changes the partitions of the tables used
in the test. Adjusted those tests accordingly.


> In the meantime I have few more commentaries about range_bounds_merge:
>
> * From what I see partition_range_bound_cmp is executed on the same bounds few
>   times to find whether they are overlapping and during the range merging, is
>   it necessary? Probably it would be enough just to compare current ranges once
>   per iteration.

The bounds that are compared in those cases are different. Any two
bounds are compared only once per iteration. Remember each range has
two bounds, thus there are total four comparisons possible. In case of
overlap, we do all four comparisons.

>
> * Just to clarify - the iterating through all the partitions, is it the best
>   way of finding matching ranges? Correct me if I'm wrong, but from what I see
>   in the comments for PartitionBoundInfoData, all the ranges are arranged in
>   increasing order - maybe then it's possible to use binary search for finding
>   a matching range?

The loop in partition_range_bounds_merge() runs as many times as the
maximum of number of datums from the given partition bounds. So the
complexity is O(n) where n is the maximum of number of datums. With
binary search the complexity will increase to O(nlogn). I might be
missing something here.

>
> * I'm not sure why in this case partition wise join is not applied? Maybe I'm
>   wrong, but I was under the impression that it should work here
>
> =# \d+ test1;
>                         Table "public.test1"
>  Column |  Type   | Collation | Nullable | Default | Storage  |
> --------+---------+-----------+----------+---------+----------+
>  data   | jsonb   |           |          |         | extended |
>  id     | integer |           |          |         | plain    |
> Partition key: RANGE (id)
> Indexes:
>     "test1_idx" btree (id)
> Partitions: test11 FOR VALUES FROM (0) TO (100),
>             test12 FOR VALUES FROM (100) TO (200),
>             test13 FOR VALUES FROM (200) TO (300)
>
> =# \d+ test2;
>                         Table "public.test2"
>  Column |  Type   | Collation | Nullable | Default | Storage  |
> --------+---------+-----------+----------+---------+----------+
>  data   | jsonb   |           |          |         | extended |
>  id     | integer |           |          |         | plain    |
> Partition key: RANGE (id)
> Indexes:
>     "test2_idx" btree (id)
> Partitions: test21 FOR VALUES FROM (10) TO (110),
>             test22 FOR VALUES FROM (110) TO (210),
>             test23 FOR VALUES FROM (210) TO (310)
>
>
> I'm confused, since there is only one-to-one mapping of partitions.

In this case, test21 overlaps test11 (10-100) and test12 (100-110),
test22 overlaps test12 (110-200) and test13(200-210). So, there is no
one-to-one mapping.

PFA rebased patches with tests fixed as described above.
-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachment

pg_adv_dp_join_patches_v10.tar.gz

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Dmitry Dolgov

Date:

19 July 2018, 22:04:22

> On Tue, 17 Jul 2018 at 11:58, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:
>
> On Sun, Jul 15, 2018 at 11:13 PM, Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> >> On Thu, 28 Jun 2018 at 07:54, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> >>
> >> On 2018/06/27 22:21, Ashutosh Bapat wrote:
> >> > On Wed, Jun 27, 2018 at 12:28 PM, Amit Langote
> >> >> Ah, okay.  I thought of reporting this because I felt the errors may have
> >> >> to do with changes to the related code in HEAD between May 14 when you
> >> >> last posted the patches and today that you may need to account for in you
> >> >> patches.  For instance, there are many diffs like this:
> >> >>
> >> >> Looks like the Result node on top of Append is no longer there after
> >> >> applying your patch.
> >> >
> >> > Yes. They are coming because of a commit which removed Result node on
> >> > top of an Append node. I don't remember exactly which.
> >> >
> >> > I wouldn't worry about those diffs at this time. As I have mentioned
> >> > in earlier mails, the expected output from 0006 is quite large and is
> >> > not supposed to be committed. So, I don't see much value in fixing the
> >> > plans in that output.
> >> >
> >> > Do you see that as a hindrance in reviewing the code changes and tests in 0005?
> >>
> >> I think not.  I'll ignore 0006 for now and focus on other patches.
> >
> > Hi,
> >
> > Sorry for my irregular reviews. Unfortunately, the patch need to be rebased
> > again.
>
> I rebased the patches without any problem on top of commit

Thanks!

> > In the meantime I have few more commentaries about range_bounds_merge:
> >
> > * From what I see partition_range_bound_cmp is executed on the same bounds few
> >   times to find whether they are overlapping and during the range merging, is
> >   it necessary? Probably it would be enough just to compare current ranges once
> >   per iteration.
>
> The bounds that are compared in those cases are different. Any two
> bounds are compared only once per iteration. Remember each range has
> two bounds, thus there are total four comparisons possible. In case of
> overlap, we do all four comparisons.

Yep, indeed, now I see.

> > * Just to clarify - the iterating through all the partitions, is it the best
> >   way of finding matching ranges? Correct me if I'm wrong, but from what I see
> >   in the comments for PartitionBoundInfoData, all the ranges are arranged in
> >   increasing order - maybe then it's possible to use binary search for finding
> >   a matching range?
>
> The loop in partition_range_bounds_merge() runs as many times as the
> maximum of number of datums from the given partition bounds. So the
> complexity is O(n) where n is the maximum of number of datums. With
> binary search the complexity will increase to O(nlogn). I might be
> missing something here.

Now I'm confused even more. Correct me if I'm wrong, but here is what I see
right now:

* We're trying to solve the standard problem of finding overlapping intervals
  from two sets of intervals

* The current implementation implicitly compares every range from one side of a
  join with every range from another side, which is O(n^2).

Let's imagine we have two tables:

=# \d+ test1_overlap
   Table "public.test1_overlap"
 Column |  Type   | Collation | Nullable | Default | Storage  |
--------+---------+-----------+----------+---------+----------+
 id     | integer |           |          |         | plain    |
 data   | jsonb   |           |          |         | extended |
Partition key: RANGE (id)
Partitions: test11_overlap FOR VALUES FROM (200) TO (210),
            test12_overlap FOR VALUES FROM (220) TO (230),
            test13_overlap FOR VALUES FROM (240) TO (250)

=# \d+ test2_overlap
   Table "public.test2_overlap"
 Column |  Type   | Collation | Nullable | Default | Storage  |
--------+---------+-----------+----------+---------+----------+
 id     | integer |           |          |         | plain    |
 data   | jsonb   |           |          |         | extended |
Partition key: RANGE (id)
Partitions: test210_overlap FOR VALUES FROM (160) TO (170),
            test211_overlap FOR VALUES FROM (180) TO (190),
            test21_overlap FOR VALUES FROM (0) TO (10),
            test22_overlap FOR VALUES FROM (20) TO (30),
            test23_overlap FOR VALUES FROM (200) TO (210),
            test24_overlap FOR VALUES FROM (40) TO (50),
            test25_overlap FOR VALUES FROM (60) TO (70),
            test26_overlap FOR VALUES FROM (80) TO (90),
            test27_overlap FOR VALUES FROM (100) TO (110),
            test28_overlap FOR VALUES FROM (120) TO (130),
            test29_overlap FOR VALUES FROM (140) TO (150)

And the join:

select * from test1_overlap inner join test2_overlap using (id);

Partition wise join will work fine, but what will happen (I see this following
the code under gdb) is that inside the function partition_range_bounds_merge we
start from two partitions:

test11_overlap (200, 210) and test21_overlap (0, 10)

In the comparison loop we figure out that there is no overlap and go to the
ub_cmpval > 0 branch, because test11_overlap is higher that test21_overlap.
Inside this branch we think that we need to handle a missing partition case
(apparently, by mistake, but it works - in case if it's not a missing
partition, there are no any records to join with from a default one). Since in
this case there isn't any default partition, the result is merged = true. After
that partition_range_get_next_lb_index will move us to another partition pair:

test11_overlap (200, 210) and test22_overlap (20, 30)

And so on and so forth until we will reach the partition test23_overlap (200,
210), which would be actually what we're looking for. So basically as I said
above we will iterate over the all partitions, and we could skip some of them
using binary search.

> >
> > * I'm not sure why in this case partition wise join is not applied? Maybe I'm
> >   wrong, but I was under the impression that it should work here
> >
> > =# \d+ test1;
> >                         Table "public.test1"
> >  Column |  Type   | Collation | Nullable | Default | Storage  |
> > --------+---------+-----------+----------+---------+----------+
> >  data   | jsonb   |           |          |         | extended |
> >  id     | integer |           |          |         | plain    |
> > Partition key: RANGE (id)
> > Indexes:
> >     "test1_idx" btree (id)
> > Partitions: test11 FOR VALUES FROM (0) TO (100),
> >             test12 FOR VALUES FROM (100) TO (200),
> >             test13 FOR VALUES FROM (200) TO (300)
> >
> > =# \d+ test2;
> >                         Table "public.test2"
> >  Column |  Type   | Collation | Nullable | Default | Storage  |
> > --------+---------+-----------+----------+---------+----------+
> >  data   | jsonb   |           |          |         | extended |
> >  id     | integer |           |          |         | plain    |
> > Partition key: RANGE (id)
> > Indexes:
> >     "test2_idx" btree (id)
> > Partitions: test21 FOR VALUES FROM (10) TO (110),
> >             test22 FOR VALUES FROM (110) TO (210),
> >             test23 FOR VALUES FROM (210) TO (310)
> >
> >
> > I'm confused, since there is only one-to-one mapping of partitions.
>
> In this case, test21 overlaps test11 (10-100) and test12 (100-110),
> test22 overlaps test12 (110-200) and test13(200-210). So, there is no
> one-to-one mapping.

Yep, thanks for the explanation.

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Dmitry Dolgov

Date:

20 July 2018, 00:43:49

> On Thu, 19 Jul 2018 at 21:04, Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>
> > > * Just to clarify - the iterating through all the partitions, is it the best
> > >   way of finding matching ranges? Correct me if I'm wrong, but from what I see
> > >   in the comments for PartitionBoundInfoData, all the ranges are arranged in
> > >   increasing order - maybe then it's possible to use binary search for finding
> > >   a matching range?
> >
> > The loop in partition_range_bounds_merge() runs as many times as the
> > maximum of number of datums from the given partition bounds. So the
> > complexity is O(n) where n is the maximum of number of datums. With
> > binary search the complexity will increase to O(nlogn). I might be
> > missing something here.
>
> Now I'm confused even more. Correct me if I'm wrong, but here is what I see
> right now:
>
> * We're trying to solve the standard problem of finding overlapping intervals
>   from two sets of intervals
>
> * The current implementation implicitly compares every range from one side of a
>   join with every range from another side, which is O(n^2).

It's of course wrong, it's going to be O(max(m, n)) as you said, but
the point is still valid - if we have partitions A1, A2 from one side
and B1, ..., BN on another side, we can skip necessary the
partitions from B that are between e.g. A1 and A2 faster with
binary search.

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Ashutosh Bapat

Date:

23 July 2018, 11:38:44

On Fri, Jul 20, 2018 at 3:13 AM, Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>
> It's of course wrong, it's going to be O(max(m, n)) as you said, but
> the point is still valid - if we have partitions A1, A2 from one side
> and B1, ..., BN on another side, we can skip necessary the
> partitions from B that are between e.g. A1 and A2 faster with
> binary search.

That's possible only when there is no default partition and the join
is an inner join. For an outer join, we need to include all the
partitions on the outer side, so we can't just skip over some
partitions. In case of a default partition, it can take place of a
missing partition, so we can't skip partitions using binary search.
The code right now works for all the cases and is O(n). I agree that
it can be optimized for certain cases, but
1. those cases are rare enough that we can ignore those right now. How
many times we would encounter the case you have quoted, for example?
Usually the ranges will be continuous only differing in the first or
last partition e.g time-series data.
2. The code is enough complex right now and it's also a lot. Making it
complicated further is not the worth the rare use cases. If we get
complaints from the field, we can certainly improve it in future. But
I would wait for those complaints before improving it further.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Dmitry Dolgov

Date:

26 July 2018, 18:07:32

> On Mon, 23 Jul 2018 at 10:38, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:
>
> On Fri, Jul 20, 2018 at 3:13 AM, Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> >
> > It's of course wrong, it's going to be O(max(m, n)) as you said, but
> > the point is still valid - if we have partitions A1, A2 from one side
> > and B1, ..., BN on another side, we can skip necessary the
> > partitions from B that are between e.g. A1 and A2 faster with
> > binary search.
>
> That's possible only when there is no default partition and the join
> is an inner join. For an outer join, we need to include all the
> partitions on the outer side, so we can't just skip over some
> partitions. In case of a default partition, it can take place of a
> missing partition, so we can't skip partitions using binary search.

But in this case I described above (when we have a partition A3 on one side,
and another matching partition B3 from other side, but separated by some other
partitions, e.g. B1, B2) it means that we will merge A3 with a default
partition from other side without actually needing that, am I right? In this
sense it's an overhead out of nothing.

> 1. those cases are rare enough that we can ignore those right now. How
> many times we would encounter the case you have quoted, for example?
> Usually the ranges will be continuous only differing in the first or
> last partition e.g time-series data.

Well, unfortunately, I don't have enough context to discuss whether it's rare
or not.

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Ashutosh Bapat

Date:

27 July 2018, 10:17:04

On Thu, Jul 26, 2018 at 8:37 PM, Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>> On Mon, 23 Jul 2018 at 10:38, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:
>>
>> On Fri, Jul 20, 2018 at 3:13 AM, Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>> >
>> > It's of course wrong, it's going to be O(max(m, n)) as you said, but
>> > the point is still valid - if we have partitions A1, A2 from one side
>> > and B1, ..., BN on another side, we can skip necessary the
>> > partitions from B that are between e.g. A1 and A2 faster with
>> > binary search.
>>
>> That's possible only when there is no default partition and the join
>> is an inner join. For an outer join, we need to include all the
>> partitions on the outer side, so we can't just skip over some
>> partitions. In case of a default partition, it can take place of a
>> missing partition, so we can't skip partitions using binary search.
>
> But in this case I described above (when we have a partition A3 on one side,
> and another matching partition B3 from other side, but separated by some other
> partitions, e.g. B1, B2) it means that we will merge A3 with a default
> partition from other side without actually needing that, am I right? In this
> sense it's an overhead out of nothing.

No. We will join A3 with B3 since they have matching bounds. We will
compare B1 and B2's bounds with A3 (assuming that there are no bounds
before A3). Since they won't be compatible we will match default of A
to B1 and B2. That will of-course fail since we will try to match
multiple partitions to a single partition, but that's not the point of
your question I think. You are right that we could skip comparing A3
with B1 and B2 and directly land on B3. Any partitions skipped in
between will be matched with A's default partition. But as I have said
this would be rare and complicating the logic for a rare case doesn't
seem practical at this stage.

Apart from the complexity there's also a possibility that this
skipping will reduce the efficiency actually in normal cases. Consider
a case where A and B have exactly matching partitions. Current
partition matching algorithm compare any given range/bound only once
and we will have O(n) algorithm. If we use a binary search, however,
for every range comparison, it will be O(n log n) algorithm. There
will be unnecessary comparisons during binary search. The current
algorithm is always O(n), whereas binary search would be O(n log(n))
with a possibility of having sub-O(n) complexity in some cases. I
would go for an algorithm which has a consistent time complexity
always and which is efficient in normal cases, rather than worrying
about some cases which are not practical.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Robert Haas

Date:

27 July 2018, 21:13:07

On Fri, Jul 27, 2018 at 3:17 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> Apart from the complexity there's also a possibility that this
> skipping will reduce the efficiency actually in normal cases. Consider
> a case where A and B have exactly matching partitions. Current
> partition matching algorithm compare any given range/bound only once
> and we will have O(n) algorithm. If we use a binary search, however,
> for every range comparison, it will be O(n log n) algorithm. There
> will be unnecessary comparisons during binary search. The current
> algorithm is always O(n), whereas binary search would be O(n log(n))
> with a possibility of having sub-O(n) complexity in some cases. I
> would go for an algorithm which has a consistent time complexity
> always and which is efficient in normal cases, rather than worrying
> about some cases which are not practical.

Yeah, I think that's a good point.  The normal case here will be that
the partition bounds are equal, or that there are a few extra
partitions on one side that don't exist on the other.  We don't want
other cases to be crazily inefficient, but I suspect in practice that
if the partitioning bounds aren't pretty close to matching up exactly,
we're going to end up failing to be able to do a partition-wise join
at all.  It's not very likely that somebody happens to have a case
where they've partitioned two tables in randomly different ways, but
then they decide to join them anyway, but then it turns out that the
partition bounds happen to be compatible enough that this algorithm
works.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Dmitry Dolgov

Date:

23 August 2018, 13:31:47

> On Fri, 27 Jul 2018 at 20:13, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Fri, Jul 27, 2018 at 3:17 AM, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
> > Apart from the complexity there's also a possibility that this
> > skipping will reduce the efficiency actually in normal cases. Consider
> > a case where A and B have exactly matching partitions. Current
> > partition matching algorithm compare any given range/bound only once
> > and we will have O(n) algorithm. If we use a binary search, however,
> > for every range comparison, it will be O(n log n) algorithm. There
> > will be unnecessary comparisons during binary search. The current
> > algorithm is always O(n), whereas binary search would be O(n log(n))
> > with a possibility of having sub-O(n) complexity in some cases. I
> > would go for an algorithm which has a consistent time complexity
> > always and which is efficient in normal cases, rather than worrying
> > about some cases which are not practical.
>
> Yeah, I think that's a good point.  The normal case here will be that
> the partition bounds are equal, or that there are a few extra
> partitions on one side that don't exist on the other.  We don't want
> other cases to be crazily inefficient, but I suspect in practice that
> if the partitioning bounds aren't pretty close to matching up exactly,
> we're going to end up failing to be able to do a partition-wise join
> at all.  It's not very likely that somebody happens to have a case
> where they've partitioned two tables in randomly different ways, but
> then they decide to join them anyway, but then it turns out that the
> partition bounds happen to be compatible enough that this algorithm
> works.

> On Mon, 23 Jul 2018 at 10:38, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:
> 1. those cases are rare enough that we can ignore those right now. How
> many times we would encounter the case you have quoted, for example?
> Usually the ranges will be continuous only differing in the first or
> last partition e.g time-series data.

Ok, but what about list partitioning? I believe the area of application for it
can be more diverse than mostly just for time-series, and in the patch almost
the same approach is used to merge list partitions.

Other than that everything seems fine from functionality point of view, and so
far I couldn't find any situation that produces a wrong plan. Some of the joins
were not that intuitive from the first glance, but at the end everything was
according to the documentation.
Taking this into account I wonder if it's possible somehow to give any hints in
an explain result about why it wasn't possible to apply partition wise join? If
something wasn't clear for me I ended up looking at the value of "merged" flag,
and to figure out actual reason one needs to trace the entire algorithm.

Besides that I checked the performance in some simple cases, no problems on
this side (but I'll also do some checks for more complicated joins).

And I still have some complaints about readability, although I can imagine that
it's just me:

* Many functions carry some unrelated arguments just to pass them through,
  which obscures the purpose of a function.

* I want to suggest to introduce a new data structure for partitions mapping
  instead of just keeping them in arrays (was it discussed already before?).

* What is the reason that almost everywhere in the patch there is a naming
  convention "outer/inner" partition maps, and sometimes (e.g. in
  generate_matching_part_pairs) it's "map1/map2"?

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Ashutosh Bapat

Date:

29 August 2018, 10:32:11

On Thu, Aug 23, 2018 at 4:01 PM, Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>> On Fri, 27 Jul 2018 at 20:13, Robert Haas <robertmhaas@gmail.com> wrote:
>>
>> On Fri, Jul 27, 2018 at 3:17 AM, Ashutosh Bapat
>> <ashutosh.bapat@enterprisedb.com> wrote:
>> > Apart from the complexity there's also a possibility that this
>> > skipping will reduce the efficiency actually in normal cases. Consider
>> > a case where A and B have exactly matching partitions. Current
>> > partition matching algorithm compare any given range/bound only once
>> > and we will have O(n) algorithm. If we use a binary search, however,
>> > for every range comparison, it will be O(n log n) algorithm. There
>> > will be unnecessary comparisons during binary search. The current
>> > algorithm is always O(n), whereas binary search would be O(n log(n))
>> > with a possibility of having sub-O(n) complexity in some cases. I
>> > would go for an algorithm which has a consistent time complexity
>> > always and which is efficient in normal cases, rather than worrying
>> > about some cases which are not practical.
>>
>> Yeah, I think that's a good point.  The normal case here will be that
>> the partition bounds are equal, or that there are a few extra
>> partitions on one side that don't exist on the other.  We don't want
>> other cases to be crazily inefficient, but I suspect in practice that
>> if the partitioning bounds aren't pretty close to matching up exactly,
>> we're going to end up failing to be able to do a partition-wise join
>> at all.  It's not very likely that somebody happens to have a case
>> where they've partitioned two tables in randomly different ways, but
>> then they decide to join them anyway, but then it turns out that the
>> partition bounds happen to be compatible enough that this algorithm
>> works.
>
>> On Mon, 23 Jul 2018 at 10:38, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:
>> 1. those cases are rare enough that we can ignore those right now. How
>> many times we would encounter the case you have quoted, for example?
>> Usually the ranges will be continuous only differing in the first or
>> last partition e.g time-series data.
>
> Ok, but what about list partitioning? I believe the area of application for it
> can be more diverse than mostly just for time-series, and in the patch almost
> the same approach is used to merge list partitions.

Same arguments hold for list partitioning as well. For a list
partitioned table the bounds are ordered by list datums and not
partitions, so it will be rather odd to have large runs of mismatching
datums, the case where binary search fares, from one of side of join.
So, my following argument still holds true

---
>> > I would go for an algorithm which has a consistent time complexity
>> > always and which is efficient in normal cases, rather than worrying
>> > about some cases which are not practical.
---


>
> Other than that everything seems fine from functionality point of view, and so
> far I couldn't find any situation that produces a wrong plan. Some of the joins
> were not that intuitive from the first glance, but at the end everything was
> according to the documentation.

Thanks a lot for your tests and I am glad that you didn't find any failures.

> Taking this into account I wonder if it's possible somehow to give any hints in
> an explain result about why it wasn't possible to apply partition wise join? If
> something wasn't clear for me I ended up looking at the value of "merged" flag,
> and to figure out actual reason one needs to trace the entire algorithm.

That's kind of tricky in PostgreSQL. The optimizer doesn't always
report why a particular path was not chosen. Said that we could add
trace logs printing that information, however, the main difficulty is
reporting the relations being joined.See 0004 for example.

>
> Besides that I checked the performance in some simple cases, no problems on
> this side (but I'll also do some checks for more complicated joins).

Thanks a lot.

>
> And I still have some complaints about readability, although I can imagine that
> it's just me:
>
> * Many functions carry some unrelated arguments just to pass them through,
>   which obscures the purpose of a function.

Can you please provide some examples?

>
> * I want to suggest to introduce a new data structure for partitions mapping
>   instead of just keeping them in arrays (was it discussed already before?).

The best I could think of was a structure with just two arrays. So,
instead of encapsulating the arrays within a structure, I thought it
best to pass bare arrays around. If you have any other ideas, please
let me know.

>
> * What is the reason that almost everywhere in the patch there is a naming
>   convention "outer/inner" partition maps, and sometimes (e.g. in
>   generate_matching_part_pairs) it's "map1/map2"?

You will find that the optimizer in general uses outer/inner,
rel1/rel2 nomenclature interchangeably. This patch-set just inherits
that. But yes, we should do more to straighten it out.

I won't be working on this actively in the next commitfest. I will be
glad if somebody else wants to take this up. If there's nobody,
probably we should mark this entry as "returned with feedback" in the
next commitfest.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Dmitry Dolgov

Date:

30 August 2018, 11:53:32

> On Wed, 29 Aug 2018 at 09:32, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:
>
> > * Many functions carry some unrelated arguments just to pass them through,
> >   which obscures the purpose of a function.
>
> Can you please provide some examples?

E.g this chain with partsupfunc & collations:

partition_range_bounds_merge
 -> partition_range_cmp
     -> partition_range_bound_cmp
 -> partition_range_merge_next_lb

I believe I already mentioned that before (although I'm not saying that I have
a solution right away, it just bothers me).

> > * I want to suggest to introduce a new data structure for partitions mapping
> >   instead of just keeping them in arrays (was it discussed already before?).
>
> The best I could think of was a structure with just two arrays. So,
> instead of encapsulating the arrays within a structure, I thought it
> best to pass bare arrays around. If you have any other ideas, please
> let me know.

Well, what I had in mind is something like a structure Mapping with fields from
& to.

> > * What is the reason that almost everywhere in the patch there is a naming
> >   convention "outer/inner" partition maps, and sometimes (e.g. in
> >   generate_matching_part_pairs) it's "map1/map2"?
>
> You will find that the optimizer in general uses outer/inner,
> rel1/rel2 nomenclature interchangeably. This patch-set just inherits
> that. But yes, we should do more to straighten it out.

Thanks, good to know.

> I won't be working on this actively in the next commitfest. I will be
> glad if somebody else wants to take this up. If there's nobody,
> probably we should mark this entry as "returned with feedback" in the
> next commitfest.

Since I'm more or less familiar with the code and I believe it's an interesting
feature, I can try to take over it for now if you don't mind (but without any
strong commitments to make it perfectly shining for the next CF).

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Ashutosh Bapat

Date:

31 August 2018, 09:23:56

On Thu, Aug 30, 2018 at 2:23 PM, Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>
>> I won't be working on this actively in the next commitfest. I will be
>> glad if somebody else wants to take this up. If there's nobody,
>> probably we should mark this entry as "returned with feedback" in the
>> next commitfest.
>
> Since I'm more or less familiar with the code and I believe it's an interesting
> feature, I can try to take over it for now if you don't mind (but without any
> strong commitments to make it perfectly shining for the next CF).

Please do. Thanks.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Dmitry Dolgov

Date:

12 September 2018, 23:14:52

> On Fri, 31 Aug 2018 at 08:23, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:
>
> On Thu, Aug 30, 2018 at 2:23 PM, Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> >
> >> I won't be working on this actively in the next commitfest. I will be
> >> glad if somebody else wants to take this up. If there's nobody,
> >> probably we should mark this entry as "returned with feedback" in the
> >> next commitfest.
> >
> > Since I'm more or less familiar with the code and I believe it's an interesting
> > feature, I can try to take over it for now if you don't mind (but without any
> > strong commitments to make it perfectly shining for the next CF).
>
> Please do. Thanks.

I've noticed that the patch is outdated already, so here is the rebased
version. I also removed the last part with the extra tests since it
was something
internal and merged the debug message into the implementation part. Ashutosh,
please let me know if you're not happy with these modifications.

Other than that I haven't changed anything yet, but hope this will come soon.
And this version is more to keep it updated for those people who may be
interested.

Attachment

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Ashutosh Bapat

Date:

17 September 2018, 12:20:21

On Thu, Sep 13, 2018 at 1:45 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

> On Fri, 31 Aug 2018 at 08:23, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:
>
> On Thu, Aug 30, 2018 at 2:23 PM, Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> >
> >> I won't be working on this actively in the next commitfest. I will be
> >> glad if somebody else wants to take this up. If there's nobody,
> >> probably we should mark this entry as "returned with feedback" in the
> >> next commitfest.
> >
> > Since I'm more or less familiar with the code and I believe it's an interesting
> > feature, I can try to take over it for now if you don't mind (but without any
> > strong commitments to make it perfectly shining for the next CF).
>
> Please do. Thanks.

I've noticed that the patch is outdated already, so here is the rebased
version. I also removed the last part with the extra tests since it
was something
internal

I am fine with that. It was never meant to be committed. I used to run those tests to make sure that any changes to the core logic do not break any working scenarios. Whenever I found a new failure in the extra tests which wasn't there in tests to be committed, I used to move that test from the first to the second. Over the time, the number of new failures in extra has reduced and recently I didn't see any extra failures. So, may be it's time for the extra tests to be dropped. I will suggest that keep the extra tests running from time to time and certainly around the time the feature gets committed.

and merged the debug message into the implementation part. Ashutosh,
please let me know if you're not happy with these modifications.

Robert Haas raised objections, and I agreed to those, about a similar debug message I had included in the basic partition-wise join patches. I think those reasons still apply, so you will need to remove the debug message before the patches get committed. Said that the debug message is a good debugging aid, so keeping it around till that time is a good idea.

Other than that I haven't changed anything yet, but hope this will come soon.
And this version is more to keep it updated for those people who may be
interested.

Thanks.

Best Wishes,

Ashutosh Bapat

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Dmitry Dolgov

Date:

25 October 2018, 21:19:09

> On Mon, 17 Sep 2018 at 11:20, Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:
>
> I am fine with that. It was never meant to be committed. I used to run those
> tests to make sure that any changes to the core logic do not break any
> working scenarios. Whenever I found a new failure in the extra tests which
> wasn't there in tests to be committed, I used to move that test from the
> first to the second. Over the time, the number of new failures in extra has
> reduced and recently I didn't see any extra failures. So, may be it's time
> for the extra tests to be dropped. I will suggest that keep the extra tests
> running from time to time and certainly around the time the feature gets
> committed.

Great, that's exactly what I'm doing right now - I keep these tests locally
to monitor any significant failures except any changes in plans, but without
including it into the patch series.

Since most of my complaints about the patch were related to code readability,
I modified it a bit to show more clearly what I have in mind. I haven't changed
anything about the functionality, just adjusted some parts of it:

* removed some unused arguments (looks like they were added for consistency in
  all higher level branches, but not all of them were actually in use)

* replaced representation for partition mapping (instead of int arrays there is
  a structure, that allows to replace 0/1 with more meaningful from/to)

* tried to make naming a bit more consistent - so, if a function doesn't
  explicitely say anything about outer/inner, we have partmap1/partmap2,
  otherwise outermap/innermap. I don't really like this style with
  partmap1/partmap2 or index1/index2, but it seems consistent with the rest of
  the code in partbounds.c

* removed some state representation flags, e.g. merged - instead, if there is a
  situation when we can't merge, functions will return NULL right away

* removed questionable debugging statement

Ashutosh, can you please take a look at it and confirm (or not) that you also
think these changes have improved readability a bit. If we're on the same page
about that, I'll continue in this direction.

Attachment

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Dmitry Dolgov

Date:

25 November 2018, 20:04:29

> On Thu, Oct 25, 2018 at 11:19 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>
> Since most of my complaints about the patch were related to code readability,
> I modified it a bit to show more clearly what I have in mind. I haven't changed
> anything about the functionality, just adjusted some parts of it:
>
> * removed some unused arguments (looks like they were added for consistency in
>   all higher level branches, but not all of them were actually in use)
>
> * replaced representation for partition mapping (instead of int arrays there is
>   a structure, that allows to replace 0/1 with more meaningful from/to)
>
> * tried to make naming a bit more consistent - so, if a function doesn't
>   explicitely say anything about outer/inner, we have partmap1/partmap2,
>   otherwise outermap/innermap. I don't really like this style with
>   partmap1/partmap2 or index1/index2, but it seems consistent with the rest of
>   the code in partbounds.c
>
> * removed some state representation flags, e.g. merged - instead, if there is a
>   situation when we can't merge, functions will return NULL right away
>
> * removed questionable debugging statement

I've noticed, that this patch set is outdated, so here is the rebased version.

Attachment

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Thomas Munro

Date:

25 November 2018, 21:06:36

On Mon, Nov 26, 2018 at 9:03 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> I've noticed, that this patch set is outdated, so here is the rebased version.

This turned red on cfbot because I turned on -Werror:

partbounds.c: In function ‘partition_bounds_merge’:
partbounds.c:3619:43: error: ‘outer_part’ may be used uninitialized in
this function [-Werror=maybe-uninitialized]
merged_index = map_and_merge_partitions(outer_pmap, outer_mmap,
^
partbounds.c:3475:6: note: ‘outer_part’ was declared here
int outer_part;
^
cc1: all warnings being treated as errors

--
Thomas Munro
http://www.enterprisedb.com

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Dmitry Dolgov

Date:

26 November 2018, 12:41:06

> On Sun, Nov 25, 2018 at 10:07 PM Thomas Munro <thomas.munro@enterprisedb.com> wrote:
>
> This turned red on cfbot because I turned on -Werror:
>
> partbounds.c: In function ‘partition_bounds_merge’:
> partbounds.c:3619:43: error: ‘outer_part’ may be used uninitialized in
> this function [-Werror=maybe-uninitialized]
> merged_index = map_and_merge_partitions(outer_pmap, outer_mmap,
> ^
> partbounds.c:3475:6: note: ‘outer_part’ was declared here
> int outer_part;
> ^
> cc1: all warnings being treated as errors

Thanks, I've messed this up too - rebased a wrong branch, the correct one
doesn't have this code already. Sorry for that, and here is the proper version
of this patch.

Attachment

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Dmitry Dolgov

Date:

26 November 2018, 16:03:57

> On Mon, Nov 26, 2018 at 1:41 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>
> Thanks, I've messed this up too - rebased a wrong branch, the correct one
> doesn't have this code already. Sorry for that, and here is the proper version
> of this patch.

And one more version, because I haven't fixed the tests (sorry for the noise).

Attachment

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

amul sul

Date:

21 January 2019, 11:56:16

On Mon, Nov 26, 2018 at 9:33 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

> On Mon, Nov 26, 2018 at 1:41 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>
> Thanks, I've messed this up too - rebased a wrong branch, the correct one
> doesn't have this code already. Sorry for that, and here is the proper version
> of this patch.

And one more version, because I haven't fixed the tests (sorry for the noise).

0003 patch need a rebase.

Regards,

Amul

Re: [HACKERS] advanced partition matching algorithm for partition-wisejoin

From

Etsuro Fujita

Date:

21 January 2019, 12:09:38

(2019/01/21 20:56), amul sul wrote:
> On Mon, Nov 26, 2018 at 9:33 PM Dmitry Dolgov <9erthalion6@gmail.com
> <mailto:9erthalion6@gmail.com>> wrote:
>
>      > On Mon, Nov 26, 2018 at 1:41 PM Dmitry Dolgov
>     <9erthalion6@gmail.com <mailto:9erthalion6@gmail.com>> wrote:
>      >
>      > Thanks, I've messed this up too - rebased a wrong branch, the
>     correct one
>      > doesn't have this code already. Sorry for that, and here is the
>     proper version
>      > of this patch.
>
>     And one more version, because I haven't fixed the tests (sorry for
>     the noise).

> 0003 patch need a rebase.

Will do.

Thank you for letting us know!

Best regards,
Etsuro Fujita

Re: [HACKERS] advanced partition matching algorithm for partition-wisejoin

From

Etsuro Fujita

Date:

22 January 2019, 12:38:16

(2019/01/21 21:09), Etsuro Fujita wrote:
> (2019/01/21 20:56), amul sul wrote:
>> 0003 patch need a rebase.
>
> Will do.

While doing the rebase, I noticed that 0002 would not be necessary 
anymore; what was originally proposed by it on how to generate the 
tlists for child-joins in [1] is entirely included in commit 
7cfdc77023ad50731723e85c215a4127436ed09c, which I pushed.  I didn't pay 
attention to this thread, but that was pretty much the same as what I 
did in that commit (except the handling of PHVs to avoid extra work in 
that commit), so I think I should have mentioned about that in the 
commit message.  Sorry for that, Ashutosh.

Here is an updated version of the patchset.  As above, I don't think 
0002 is necessary; it added copy_pathtarget to build_child_join_rel, but 
I think it would be redundant because we do create_empty_pathtarget at 
the top of that function and then build_child_join_reltarget.  Also, it 
added an assertion to build_joinrel_tlist, but I don't think it's really 
necessary IMO.  So I removed it entirely from the patchset.  No changes 
other that.

Will continue to review.

Best regards,
Etsuro Fujita

[1] 
https://www.postgresql.org/message-id/CAFjFpResoxfp1rnV4Op9JOnG19VNEnjvjRN5DVd8QRHD%2BagTDw%40mail.gmail.com

Attachment

Re: [HACKERS] advanced partition matching algorithm for partition-wisejoin

From

Etsuro Fujita

Date:

31 January 2019, 11:53:15

(2019/01/22 21:38), Etsuro Fujita wrote:
> Will continue to review.

I rebased the patch set against the latest HEAD.  Attached is a new 
version.  I'll move this to the next CF, and continue to review it.

Best regards,
Etsuro Fujita

Attachment

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

amul sul

Date:

04 February 2019, 05:35:06

There are few whitespaces in 0002 patch that I have fixed in the attached version.

Rest of the patches are untouched.

Ill continue my review and testing.

Regards,

Amul

On Thu, Jan 31, 2019 at 5:26 PM Etsuro Fujita <fujita.etsuro@lab.ntt.co.jp> wrote:

(2019/01/22 21:38), Etsuro Fujita wrote:
> Will continue to review.

I rebased the patch set against the latest HEAD. Attached is a new
version. I'll move this to the next CF, and continue to review it.

Best regards,
Etsuro Fujita

Attachment

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

amul sul

Date:

05 March 2019, 10:14:32

Attached is the rebased atop of the latest master head(35bc0ec7c8).

Regards,

Amul Sul

On Mon, Feb 4, 2019 at 11:05 AM amul sul <sulamul@gmail.com> wrote:

There are few whitespaces in 0002 patch that I have fixed in the attached version.
Rest of the patches are untouched.

Ill continue my review and testing.

Regards,
Amul

On Thu, Jan 31, 2019 at 5:26 PM Etsuro Fujita <fujita.etsuro@lab.ntt.co.jp> wrote:
(2019/01/22 21:38), Etsuro Fujita wrote:
> Will continue to review.

I rebased the patch set against the latest HEAD. Attached is a new
version. I'll move this to the next CF, and continue to review it.

Best regards,
Etsuro Fujita

Attachment

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Rajkumar Raghuwanshi

Date:

07 March 2019, 06:24:04

On Tue, Mar 5, 2019 at 3:45 PM amul sul <sulamul@gmail.com> wrote:

Attached is the rebased atop of the latest master head(35bc0ec7c8).

thanks Amul, patches applied cleanly on PG head.

While testing this I got a server crash with below test case.

CREATE TABLE plt1 (a int, b int, c varchar) PARTITION BY LIST(c);
CREATE TABLE plt1_p1 PARTITION OF plt1 FOR VALUES IN ('0001','0002','0003');
CREATE TABLE plt1_p2 PARTITION OF plt1 FOR VALUES IN ('0004','0005','0006');
CREATE TABLE plt1_p3 PARTITION OF plt1 FOR VALUES IN (NULL,'0008','0009');
CREATE TABLE plt1_p4 PARTITION OF plt1 FOR VALUES IN ('0000','0010');
INSERT INTO plt1 SELECT i, i % 47, to_char(i % 17, 'FM0000') FROM generate_series(0, 500) i WHERE i % 17 NOT IN (7, 11, 12, 13, 14, 15,16);
INSERT INTO plt1 SELECT i, i % 47, case when i % 17 = 7 then NULL else to_char(i % 17, 'FM0000') end FROM generate_series(0, 500) i WHERE i % 17 IN (7,8,9);
ANALYSE plt1;

CREATE TABLE plt2 (a int, b int, c varchar) PARTITION BY LIST(c);
CREATE TABLE plt2_p1 PARTITION OF plt2 FOR VALUES IN ('0002','0003');
CREATE TABLE plt2_p2 PARTITION OF plt2 FOR VALUES IN ('0004','0005','0006');
CREATE TABLE plt2_p3 PARTITION OF plt2 FOR VALUES IN ('0007','0008','0009');
CREATE TABLE plt2_p4 PARTITION OF plt2 FOR VALUES IN ('0000',NULL,'0012');
INSERT INTO plt2 SELECT i, i % 47, to_char(i % 17, 'FM0000') FROM generate_series(0, 500) i WHERE i % 17 NOT IN (1, 10, 11, 13, 14, 15, 16);
INSERT INTO plt2 SELECT i, i % 47, case when i % 17 = 11 then NULL else to_char(i % 17, 'FM0000') end FROM generate_series(0, 500) i WHERE i % 17 IN (0,11,12);
ANALYZE plt2;

CREATE TABLE plt1_e (a int, b int, c text) PARTITION BY LIST(ltrim(c, 'A'));
CREATE TABLE plt1_e_p1 PARTITION OF plt1_e FOR VALUES IN ('0002', '0003');
CREATE TABLE plt1_e_p2 PARTITION OF plt1_e FOR VALUES IN ('0004', '0005', '0006');
CREATE TABLE plt1_e_p3 PARTITION OF plt1_e FOR VALUES IN ('0008', '0009');
CREATE TABLE plt1_e_p4 PARTITION OF plt1_e FOR VALUES IN ('0000');
INSERT INTO plt1_e SELECT i, i % 47, to_char(i % 17, 'FM0000') FROM generate_series(0, 500) i WHERE i % 17 NOT IN (1, 7, 10, 11, 12, 13, 14, 15, 16);
ANALYZE plt1_e;

EXPLAIN (COSTS OFF)
SELECT avg(t1.a), avg(t2.b), avg(t3.a + t3.b), t1.c, t2.c, t3.c FROM
plt1 t1, plt2 t2, plt1_e t3 WHERE t1.c = t2.c AND ltrim(t3.c, 'A') = t1.c GROUP BY t1.c, t2.c, t3.c ORDER BY t1.c, t2.c, t3.c;

server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

below is stack trace, looks like some indexes got corrupted, please take a look.

Core was generated by `postgres: edb postgres [local] EXPLAIN                   '.
Program terminated with signal 11, Segmentation fault.
#0 0x0000000000821aee in map_and_merge_partitions (partmaps1=0x2c1c8a8, partmaps2=0x2c1c8e0, index1=45540240, index2=0, next_index=0x7ffeebd43d3c) at partbounds.c:4162
4162        if (partmap1->from < 0 && partmap2->from < 0)
Missing separate debuginfos, use: debuginfo-install keyutils-libs-1.4-5.el6.x86_64 krb5-libs-1.10.3-65.el6.x86_64 libcom_err-1.41.12-24.el6.x86_64 libselinux-2.0.94-7.el6.x86_64 openssl-1.0.1e-57.el6.x86_64 zlib-1.2.3-29.el6.x86_64
(gdb) bt
#0 0x0000000000821aee in map_and_merge_partitions (partmaps1=0x2c1c8a8, partmaps2=0x2c1c8e0, index1=45540240, index2=0, next_index=0x7ffeebd43d3c) at partbounds.c:4162
#1 0x00000000008226c3 in merge_null_partitions (outer_bi=0x2b6e338, inner_bi=0x2bf90b0, outer_maps=0x2c1c8a8, inner_maps=0x2c1c8e0, jointype=JOIN_INNER,
    next_index=0x7ffeebd43d3c, null_index=0x7ffeebd43d38, default_index=0x7ffeebd43d34) at partbounds.c:4610
#2 0x0000000000821726 in partition_list_bounds_merge (partsupfunc=0x2ba3548, partcollation=0x2ba34e8, outer_rel=0x2b6ce40, inner_rel=0x2bf8d28, outer_parts=0x7ffeebd43ed8,
    inner_parts=0x7ffeebd43ed0, jointype=JOIN_INNER) at partbounds.c:4031
#3 0x000000000081ff5d in partition_bounds_merge (partnatts=1, partsupfunc=0x2ba3548, partcollation=0x2ba34e8, outer_rel=0x2b6ce40, inner_rel=0x2bf8d28, jointype=JOIN_INNER,
    outer_parts=0x7ffeebd43ed8, inner_parts=0x7ffeebd43ed0) at partbounds.c:3053
#4 0x00000000007c610f in try_partitionwise_join (root=0x2be2a28, rel1=0x2b6ce40, rel2=0x2bf8d28, joinrel=0x2c1b0f0, parent_sjinfo=0x7ffeebd44010,
    parent_restrictlist=0x2c1c070) at joinrels.c:1370
#5 0x00000000007c5521 in populate_joinrel_with_paths (root=0x2be2a28, rel1=0x2b6ce40, rel2=0x2bf8d28, joinrel=0x2c1b0f0, sjinfo=0x7ffeebd44010, restrictlist=0x2c1c070)
    at joinrels.c:914
#6 0x00000000007c4f48 in make_join_rel (root=0x2be2a28, rel1=0x2b6ce40, rel2=0x2bf8d28) at joinrels.c:748
#7 0x00000000007c4514 in make_rels_by_clause_joins (root=0x2be2a28, old_rel=0x2b6ce40, other_rels=0x2bae4d8) at joinrels.c:294
#8 0x00000000007c41c8 in join_search_one_level (root=0x2be2a28, level=3) at joinrels.c:116
#9 0x00000000007abe59 in standard_join_search (root=0x2be2a28, levels_needed=3, initial_rels=0x2bae500) at allpaths.c:2716
#10 0x00000000007abdca in make_rel_from_joinlist (root=0x2be2a28, joinlist=0x2bfbae8) at allpaths.c:2647
#11 0x00000000007a86b0 in make_one_rel (root=0x2be2a28, joinlist=0x2bfbae8) at allpaths.c:227
#12 0x00000000007dada1 in query_planner (root=0x2be2a28, tlist=0x2ba01c8, qp_callback=0x7e0b25 <standard_qp_callback>, qp_extra=0x7ffeebd44390) at planmain.c:265
#13 0x00000000007ddf83 in grouping_planner (root=0x2be2a28, inheritance_update=false, tuple_fraction=0) at planner.c:1929
#14 0x00000000007dc5f5 in subquery_planner (glob=0x2be2990, parse=0x2c0e8c8, parent_root=0x0, hasRecursion=false, tuple_fraction=0) at planner.c:997
#15 0x00000000007db1b6 in standard_planner (parse=0x2c0e8c8, cursorOptions=256, boundParams=0x0) at planner.c:416
#16 0x00000000007daef7 in planner (parse=0x2c0e8c8, cursorOptions=256, boundParams=0x0) at planner.c:276
#17 0x00000000008e15c5 in pg_plan_query (querytree=0x2c0e8c8, cursorOptions=256, boundParams=0x0) at postgres.c:878
#18 0x00000000006562cc in ExplainOneQuery (query=0x2c0e8c8, cursorOptions=256, into=0x0, es=0x2c0e0a0,
    queryString=0x2aa24d8 "EXPLAIN (COSTS OFF)\nSELECT avg(t1.a), avg(t2.b), avg(t3.a + t3.b), t1.c, t2.c, t3.c FROM\nplt1 t1, plt2 t2, plt1_e t3 WHERE t1.c = t2.c AND ltrim(t3.c, 'A') = t1.c GROUP BY t1.c, t2.c, t3.c ORDER BY t1"..., params=0x0, queryEnv=0x0) at explain.c:364

Thanks & Regards,

Rajkumar Raghuwanshi

QMG, EnterpriseDB.

Regards,
Amul Sul

On Mon, Feb 4, 2019 at 11:05 AM amul sul <sulamul@gmail.com> wrote:
There are few whitespaces in 0002 patch that I have fixed in the attached version.
Rest of the patches are untouched.

Ill continue my review and testing.

Regards,
Amul

On Thu, Jan 31, 2019 at 5:26 PM Etsuro Fujita <fujita.etsuro@lab.ntt.co.jp> wrote:
(2019/01/22 21:38), Etsuro Fujita wrote:
> Will continue to review.

I rebased the patch set against the latest HEAD. Attached is a new
version. I'll move this to the next CF, and continue to review it.

Best regards,
Etsuro Fujita

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

amul sul

Date:

07 March 2019, 07:32:55

Thanks Rajkumar,

I am looking into this.

Regards,

Amul

On Thu, Mar 7, 2019 at 11:54 AM Rajkumar Raghuwanshi <rajkumar.raghuwanshi@enterprisedb.com> wrote:

On Tue, Mar 5, 2019 at 3:45 PM amul sul <sulamul@gmail.com> wrote:
Attached is the rebased atop of the latest master head(35bc0ec7c8).
thanks Amul, patches applied cleanly on PG head.

While testing this I got a server crash with below test case.

CREATE TABLE plt1 (a int, b int, c varchar) PARTITION BY LIST(c);
CREATE TABLE plt1_p1 PARTITION OF plt1 FOR VALUES IN ('0001','0002','0003');
CREATE TABLE plt1_p2 PARTITION OF plt1 FOR VALUES IN ('0004','0005','0006');
CREATE TABLE plt1_p3 PARTITION OF plt1 FOR VALUES IN (NULL,'0008','0009');
CREATE TABLE plt1_p4 PARTITION OF plt1 FOR VALUES IN ('0000','0010');
INSERT INTO plt1 SELECT i, i % 47, to_char(i % 17, 'FM0000') FROM generate_series(0, 500) i WHERE i % 17 NOT IN (7, 11, 12, 13, 14, 15,16);
INSERT INTO plt1 SELECT i, i % 47, case when i % 17 = 7 then NULL else to_char(i % 17, 'FM0000') end FROM generate_series(0, 500) i WHERE i % 17 IN (7,8,9);
ANALYSE plt1;

CREATE TABLE plt2 (a int, b int, c varchar) PARTITION BY LIST(c);
CREATE TABLE plt2_p1 PARTITION OF plt2 FOR VALUES IN ('0002','0003');
CREATE TABLE plt2_p2 PARTITION OF plt2 FOR VALUES IN ('0004','0005','0006');
CREATE TABLE plt2_p3 PARTITION OF plt2 FOR VALUES IN ('0007','0008','0009');
CREATE TABLE plt2_p4 PARTITION OF plt2 FOR VALUES IN ('0000',NULL,'0012');
INSERT INTO plt2 SELECT i, i % 47, to_char(i % 17, 'FM0000') FROM generate_series(0, 500) i WHERE i % 17 NOT IN (1, 10, 11, 13, 14, 15, 16);
INSERT INTO plt2 SELECT i, i % 47, case when i % 17 = 11 then NULL else to_char(i % 17, 'FM0000') end FROM generate_series(0, 500) i WHERE i % 17 IN (0,11,12);
ANALYZE plt2;

CREATE TABLE plt1_e (a int, b int, c text) PARTITION BY LIST(ltrim(c, 'A'));
CREATE TABLE plt1_e_p1 PARTITION OF plt1_e FOR VALUES IN ('0002', '0003');
CREATE TABLE plt1_e_p2 PARTITION OF plt1_e FOR VALUES IN ('0004', '0005', '0006');
CREATE TABLE plt1_e_p3 PARTITION OF plt1_e FOR VALUES IN ('0008', '0009');
CREATE TABLE plt1_e_p4 PARTITION OF plt1_e FOR VALUES IN ('0000');
INSERT INTO plt1_e SELECT i, i % 47, to_char(i % 17, 'FM0000') FROM generate_series(0, 500) i WHERE i % 17 NOT IN (1, 7, 10, 11, 12, 13, 14, 15, 16);
ANALYZE plt1_e;

EXPLAIN (COSTS OFF)
SELECT avg(t1.a), avg(t2.b), avg(t3.a + t3.b), t1.c, t2.c, t3.c FROM
plt1 t1, plt2 t2, plt1_e t3 WHERE t1.c = t2.c AND ltrim(t3.c, 'A') = t1.c GROUP BY t1.c, t2.c, t3.c ORDER BY t1.c, t2.c, t3.c;
server closed the connection unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

below is stack trace, looks like some indexes got corrupted, please take a look.

Core was generated by `postgres: edb postgres [local] EXPLAIN                   '.
Program terminated with signal 11, Segmentation fault.
#0 0x0000000000821aee in map_and_merge_partitions (partmaps1=0x2c1c8a8, partmaps2=0x2c1c8e0, index1=45540240, index2=0, next_index=0x7ffeebd43d3c) at partbounds.c:4162
4162        if (partmap1->from < 0 && partmap2->from < 0)
Missing separate debuginfos, use: debuginfo-install keyutils-libs-1.4-5.el6.x86_64 krb5-libs-1.10.3-65.el6.x86_64 libcom_err-1.41.12-24.el6.x86_64 libselinux-2.0.94-7.el6.x86_64 openssl-1.0.1e-57.el6.x86_64 zlib-1.2.3-29.el6.x86_64
(gdb) bt
#0 0x0000000000821aee in map_and_merge_partitions (partmaps1=0x2c1c8a8, partmaps2=0x2c1c8e0, index1=45540240, index2=0, next_index=0x7ffeebd43d3c) at partbounds.c:4162
#1 0x00000000008226c3 in merge_null_partitions (outer_bi=0x2b6e338, inner_bi=0x2bf90b0, outer_maps=0x2c1c8a8, inner_maps=0x2c1c8e0, jointype=JOIN_INNER,
    next_index=0x7ffeebd43d3c, null_index=0x7ffeebd43d38, default_index=0x7ffeebd43d34) at partbounds.c:4610
#2 0x0000000000821726 in partition_list_bounds_merge (partsupfunc=0x2ba3548, partcollation=0x2ba34e8, outer_rel=0x2b6ce40, inner_rel=0x2bf8d28, outer_parts=0x7ffeebd43ed8,
    inner_parts=0x7ffeebd43ed0, jointype=JOIN_INNER) at partbounds.c:4031
#3 0x000000000081ff5d in partition_bounds_merge (partnatts=1, partsupfunc=0x2ba3548, partcollation=0x2ba34e8, outer_rel=0x2b6ce40, inner_rel=0x2bf8d28, jointype=JOIN_INNER,
    outer_parts=0x7ffeebd43ed8, inner_parts=0x7ffeebd43ed0) at partbounds.c:3053
#4 0x00000000007c610f in try_partitionwise_join (root=0x2be2a28, rel1=0x2b6ce40, rel2=0x2bf8d28, joinrel=0x2c1b0f0, parent_sjinfo=0x7ffeebd44010,
    parent_restrictlist=0x2c1c070) at joinrels.c:1370
#5 0x00000000007c5521 in populate_joinrel_with_paths (root=0x2be2a28, rel1=0x2b6ce40, rel2=0x2bf8d28, joinrel=0x2c1b0f0, sjinfo=0x7ffeebd44010, restrictlist=0x2c1c070)
    at joinrels.c:914
#6 0x00000000007c4f48 in make_join_rel (root=0x2be2a28, rel1=0x2b6ce40, rel2=0x2bf8d28) at joinrels.c:748
#7 0x00000000007c4514 in make_rels_by_clause_joins (root=0x2be2a28, old_rel=0x2b6ce40, other_rels=0x2bae4d8) at joinrels.c:294
#8 0x00000000007c41c8 in join_search_one_level (root=0x2be2a28, level=3) at joinrels.c:116
#9 0x00000000007abe59 in standard_join_search (root=0x2be2a28, levels_needed=3, initial_rels=0x2bae500) at allpaths.c:2716
#10 0x00000000007abdca in make_rel_from_joinlist (root=0x2be2a28, joinlist=0x2bfbae8) at allpaths.c:2647
#11 0x00000000007a86b0 in make_one_rel (root=0x2be2a28, joinlist=0x2bfbae8) at allpaths.c:227
#12 0x00000000007dada1 in query_planner (root=0x2be2a28, tlist=0x2ba01c8, qp_callback=0x7e0b25 <standard_qp_callback>, qp_extra=0x7ffeebd44390) at planmain.c:265
#13 0x00000000007ddf83 in grouping_planner (root=0x2be2a28, inheritance_update=false, tuple_fraction=0) at planner.c:1929
#14 0x00000000007dc5f5 in subquery_planner (glob=0x2be2990, parse=0x2c0e8c8, parent_root=0x0, hasRecursion=false, tuple_fraction=0) at planner.c:997
#15 0x00000000007db1b6 in standard_planner (parse=0x2c0e8c8, cursorOptions=256, boundParams=0x0) at planner.c:416
#16 0x00000000007daef7 in planner (parse=0x2c0e8c8, cursorOptions=256, boundParams=0x0) at planner.c:276
#17 0x00000000008e15c5 in pg_plan_query (querytree=0x2c0e8c8, cursorOptions=256, boundParams=0x0) at postgres.c:878
#18 0x00000000006562cc in ExplainOneQuery (query=0x2c0e8c8, cursorOptions=256, into=0x0, es=0x2c0e0a0,
    queryString=0x2aa24d8 "EXPLAIN (COSTS OFF)\nSELECT avg(t1.a), avg(t2.b), avg(t3.a + t3.b), t1.c, t2.c, t3.c FROM\nplt1 t1, plt2 t2, plt1_e t3 WHERE t1.c = t2.c AND ltrim(t3.c, 'A') = t1.c GROUP BY t1.c, t2.c, t3.c ORDER BY t1"..., params=0x0, queryEnv=0x0) at explain.c:364

Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB.

Regards,
Amul Sul

On Mon, Feb 4, 2019 at 11:05 AM amul sul <sulamul@gmail.com> wrote:
There are few whitespaces in 0002 patch that I have fixed in the attached version.
Rest of the patches are untouched.

Ill continue my review and testing.

Regards,
Amul

On Thu, Jan 31, 2019 at 5:26 PM Etsuro Fujita <fujita.etsuro@lab.ntt.co.jp> wrote:
(2019/01/22 21:38), Etsuro Fujita wrote:
> Will continue to review.

I rebased the patch set against the latest HEAD. Attached is a new
version. I'll move this to the next CF, and continue to review it.

Best regards,
Etsuro Fujita

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

amul sul

Date:

07 March 2019, 14:50:10

On Thu, Mar 7, 2019 at 1:02 PM amul sul <sulamul@gmail.com> wrote:

Thanks Rajkumar,

I am looking into this.

The crash happens when none of the if-else branch of handle_missing_partition()

evaluates and returns merged_index unassigned.

Let me explain, in Rajkumar 's test case, the join type is JOIN_INNER. When

only outer rel has null partition, merge_null_partitions() function calls

handle_missing_partition() with missing_side_inner = false and

missing_side_outer = false argument value which fails to set merged_index.

In the attached patch, I tried to fix this case by setting merged_index

explicitly which fixes the reported crash.

Regards,

Amul

On Thu, Mar 7, 2019 at 11:54 AM Rajkumar Raghuwanshi <rajkumar.raghuwanshi@enterprisedb.com> wrote:

On Tue, Mar 5, 2019 at 3:45 PM amul sul <sulamul@gmail.com> wrote:
Attached is the rebased atop of the latest master head(35bc0ec7c8).
thanks Amul, patches applied cleanly on PG head.

While testing this I got a server crash with below test case.

CREATE TABLE plt1 (a int, b int, c varchar) PARTITION BY LIST(c);
CREATE TABLE plt1_p1 PARTITION OF plt1 FOR VALUES IN ('0001','0002','0003');
CREATE TABLE plt1_p2 PARTITION OF plt1 FOR VALUES IN ('0004','0005','0006');
CREATE TABLE plt1_p3 PARTITION OF plt1 FOR VALUES IN (NULL,'0008','0009');
CREATE TABLE plt1_p4 PARTITION OF plt1 FOR VALUES IN ('0000','0010');
INSERT INTO plt1 SELECT i, i % 47, to_char(i % 17, 'FM0000') FROM generate_series(0, 500) i WHERE i % 17 NOT IN (7, 11, 12, 13, 14, 15,16);
INSERT INTO plt1 SELECT i, i % 47, case when i % 17 = 7 then NULL else to_char(i % 17, 'FM0000') end FROM generate_series(0, 500) i WHERE i % 17 IN (7,8,9);
ANALYSE plt1;

CREATE TABLE plt2 (a int, b int, c varchar) PARTITION BY LIST(c);
CREATE TABLE plt2_p1 PARTITION OF plt2 FOR VALUES IN ('0002','0003');
CREATE TABLE plt2_p2 PARTITION OF plt2 FOR VALUES IN ('0004','0005','0006');
CREATE TABLE plt2_p3 PARTITION OF plt2 FOR VALUES IN ('0007','0008','0009');
CREATE TABLE plt2_p4 PARTITION OF plt2 FOR VALUES IN ('0000',NULL,'0012');
INSERT INTO plt2 SELECT i, i % 47, to_char(i % 17, 'FM0000') FROM generate_series(0, 500) i WHERE i % 17 NOT IN (1, 10, 11, 13, 14, 15, 16);
INSERT INTO plt2 SELECT i, i % 47, case when i % 17 = 11 then NULL else to_char(i % 17, 'FM0000') end FROM generate_series(0, 500) i WHERE i % 17 IN (0,11,12);
ANALYZE plt2;

CREATE TABLE plt1_e (a int, b int, c text) PARTITION BY LIST(ltrim(c, 'A'));
CREATE TABLE plt1_e_p1 PARTITION OF plt1_e FOR VALUES IN ('0002', '0003');
CREATE TABLE plt1_e_p2 PARTITION OF plt1_e FOR VALUES IN ('0004', '0005', '0006');
CREATE TABLE plt1_e_p3 PARTITION OF plt1_e FOR VALUES IN ('0008', '0009');
CREATE TABLE plt1_e_p4 PARTITION OF plt1_e FOR VALUES IN ('0000');
INSERT INTO plt1_e SELECT i, i % 47, to_char(i % 17, 'FM0000') FROM generate_series(0, 500) i WHERE i % 17 NOT IN (1, 7, 10, 11, 12, 13, 14, 15, 16);
ANALYZE plt1_e;

EXPLAIN (COSTS OFF)
SELECT avg(t1.a), avg(t2.b), avg(t3.a + t3.b), t1.c, t2.c, t3.c FROM
plt1 t1, plt2 t2, plt1_e t3 WHERE t1.c = t2.c AND ltrim(t3.c, 'A') = t1.c GROUP BY t1.c, t2.c, t3.c ORDER BY t1.c, t2.c, t3.c;
server closed the connection unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

below is stack trace, looks like some indexes got corrupted, please take a look.

Core was generated by `postgres: edb postgres [local] EXPLAIN                   '.
Program terminated with signal 11, Segmentation fault.
#0 0x0000000000821aee in map_and_merge_partitions (partmaps1=0x2c1c8a8, partmaps2=0x2c1c8e0, index1=45540240, index2=0, next_index=0x7ffeebd43d3c) at partbounds.c:4162
4162        if (partmap1->from < 0 && partmap2->from < 0)
Missing separate debuginfos, use: debuginfo-install keyutils-libs-1.4-5.el6.x86_64 krb5-libs-1.10.3-65.el6.x86_64 libcom_err-1.41.12-24.el6.x86_64 libselinux-2.0.94-7.el6.x86_64 openssl-1.0.1e-57.el6.x86_64 zlib-1.2.3-29.el6.x86_64
(gdb) bt
#0 0x0000000000821aee in map_and_merge_partitions (partmaps1=0x2c1c8a8, partmaps2=0x2c1c8e0, index1=45540240, index2=0, next_index=0x7ffeebd43d3c) at partbounds.c:4162
#1 0x00000000008226c3 in merge_null_partitions (outer_bi=0x2b6e338, inner_bi=0x2bf90b0, outer_maps=0x2c1c8a8, inner_maps=0x2c1c8e0, jointype=JOIN_INNER,
    next_index=0x7ffeebd43d3c, null_index=0x7ffeebd43d38, default_index=0x7ffeebd43d34) at partbounds.c:4610
#2 0x0000000000821726 in partition_list_bounds_merge (partsupfunc=0x2ba3548, partcollation=0x2ba34e8, outer_rel=0x2b6ce40, inner_rel=0x2bf8d28, outer_parts=0x7ffeebd43ed8,
    inner_parts=0x7ffeebd43ed0, jointype=JOIN_INNER) at partbounds.c:4031
#3 0x000000000081ff5d in partition_bounds_merge (partnatts=1, partsupfunc=0x2ba3548, partcollation=0x2ba34e8, outer_rel=0x2b6ce40, inner_rel=0x2bf8d28, jointype=JOIN_INNER,
    outer_parts=0x7ffeebd43ed8, inner_parts=0x7ffeebd43ed0) at partbounds.c:3053
#4 0x00000000007c610f in try_partitionwise_join (root=0x2be2a28, rel1=0x2b6ce40, rel2=0x2bf8d28, joinrel=0x2c1b0f0, parent_sjinfo=0x7ffeebd44010,
    parent_restrictlist=0x2c1c070) at joinrels.c:1370
#5 0x00000000007c5521 in populate_joinrel_with_paths (root=0x2be2a28, rel1=0x2b6ce40, rel2=0x2bf8d28, joinrel=0x2c1b0f0, sjinfo=0x7ffeebd44010, restrictlist=0x2c1c070)
    at joinrels.c:914
#6 0x00000000007c4f48 in make_join_rel (root=0x2be2a28, rel1=0x2b6ce40, rel2=0x2bf8d28) at joinrels.c:748
#7 0x00000000007c4514 in make_rels_by_clause_joins (root=0x2be2a28, old_rel=0x2b6ce40, other_rels=0x2bae4d8) at joinrels.c:294
#8 0x00000000007c41c8 in join_search_one_level (root=0x2be2a28, level=3) at joinrels.c:116
#9 0x00000000007abe59 in standard_join_search (root=0x2be2a28, levels_needed=3, initial_rels=0x2bae500) at allpaths.c:2716
#10 0x00000000007abdca in make_rel_from_joinlist (root=0x2be2a28, joinlist=0x2bfbae8) at allpaths.c:2647
#11 0x00000000007a86b0 in make_one_rel (root=0x2be2a28, joinlist=0x2bfbae8) at allpaths.c:227
#12 0x00000000007dada1 in query_planner (root=0x2be2a28, tlist=0x2ba01c8, qp_callback=0x7e0b25 <standard_qp_callback>, qp_extra=0x7ffeebd44390) at planmain.c:265
#13 0x00000000007ddf83 in grouping_planner (root=0x2be2a28, inheritance_update=false, tuple_fraction=0) at planner.c:1929
#14 0x00000000007dc5f5 in subquery_planner (glob=0x2be2990, parse=0x2c0e8c8, parent_root=0x0, hasRecursion=false, tuple_fraction=0) at planner.c:997
#15 0x00000000007db1b6 in standard_planner (parse=0x2c0e8c8, cursorOptions=256, boundParams=0x0) at planner.c:416
#16 0x00000000007daef7 in planner (parse=0x2c0e8c8, cursorOptions=256, boundParams=0x0) at planner.c:276
#17 0x00000000008e15c5 in pg_plan_query (querytree=0x2c0e8c8, cursorOptions=256, boundParams=0x0) at postgres.c:878
#18 0x00000000006562cc in ExplainOneQuery (query=0x2c0e8c8, cursorOptions=256, into=0x0, es=0x2c0e0a0,
    queryString=0x2aa24d8 "EXPLAIN (COSTS OFF)\nSELECT avg(t1.a), avg(t2.b), avg(t3.a + t3.b), t1.c, t2.c, t3.c FROM\nplt1 t1, plt2 t2, plt1_e t3 WHERE t1.c = t2.c AND ltrim(t3.c, 'A') = t1.c GROUP BY t1.c, t2.c, t3.c ORDER BY t1"..., params=0x0, queryEnv=0x0) at explain.c:364

Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB.

Regards,
Amul Sul

On Mon, Feb 4, 2019 at 11:05 AM amul sul <sulamul@gmail.com> wrote:
There are few whitespaces in 0002 patch that I have fixed in the attached version.
Rest of the patches are untouched.

Ill continue my review and testing.

Regards,
Amul

On Thu, Jan 31, 2019 at 5:26 PM Etsuro Fujita <fujita.etsuro@lab.ntt.co.jp> wrote:
(2019/01/22 21:38), Etsuro Fujita wrote:
> Will continue to review.

I rebased the patch set against the latest HEAD. Attached is a new
version. I'll move this to the next CF, and continue to review it.

Best regards,
Etsuro Fujita

Attachment

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Rajkumar Raghuwanshi

Date:

08 March 2019, 08:42:59

On Thu, Mar 7, 2019 at 8:20 PM amul sul <sulamul@gmail.com> wrote:

On Thu, Mar 7, 2019 at 1:02 PM amul sul <sulamul@gmail.com> wrote:
Thanks Rajkumar,

I am looking into this.

The crash happens when none of the if-else branch of handle_missing_partition()
evaluates and returns merged_index unassigned.

Let me explain, in Rajkumar 's test case, the join type is JOIN_INNER. When
only outer rel has null partition, merge_null_partitions() function calls
handle_missing_partition() with missing_side_inner = false and
missing_side_outer = false argument value which fails to set merged_index.

In the attached patch, I tried to fix this case by setting merged_index
explicitly which fixes the reported crash.

Thanks Amul, with v20 patches, crash is fixed.

Regards,
Amul

On Thu, Mar 7, 2019 at 11:54 AM Rajkumar Raghuwanshi <rajkumar.raghuwanshi@enterprisedb.com> wrote:

On Tue, Mar 5, 2019 at 3:45 PM amul sul <sulamul@gmail.com> wrote:
Attached is the rebased atop of the latest master head(35bc0ec7c8).
thanks Amul, patches applied cleanly on PG head.

While testing this I got a server crash with below test case.

CREATE TABLE plt1 (a int, b int, c varchar) PARTITION BY LIST(c);
CREATE TABLE plt1_p1 PARTITION OF plt1 FOR VALUES IN ('0001','0002','0003');
CREATE TABLE plt1_p2 PARTITION OF plt1 FOR VALUES IN ('0004','0005','0006');
CREATE TABLE plt1_p3 PARTITION OF plt1 FOR VALUES IN (NULL,'0008','0009');
CREATE TABLE plt1_p4 PARTITION OF plt1 FOR VALUES IN ('0000','0010');
INSERT INTO plt1 SELECT i, i % 47, to_char(i % 17, 'FM0000') FROM generate_series(0, 500) i WHERE i % 17 NOT IN (7, 11, 12, 13, 14, 15,16);
INSERT INTO plt1 SELECT i, i % 47, case when i % 17 = 7 then NULL else to_char(i % 17, 'FM0000') end FROM generate_series(0, 500) i WHERE i % 17 IN (7,8,9);
ANALYSE plt1;

CREATE TABLE plt2 (a int, b int, c varchar) PARTITION BY LIST(c);
CREATE TABLE plt2_p1 PARTITION OF plt2 FOR VALUES IN ('0002','0003');
CREATE TABLE plt2_p2 PARTITION OF plt2 FOR VALUES IN ('0004','0005','0006');
CREATE TABLE plt2_p3 PARTITION OF plt2 FOR VALUES IN ('0007','0008','0009');
CREATE TABLE plt2_p4 PARTITION OF plt2 FOR VALUES IN ('0000',NULL,'0012');
INSERT INTO plt2 SELECT i, i % 47, to_char(i % 17, 'FM0000') FROM generate_series(0, 500) i WHERE i % 17 NOT IN (1, 10, 11, 13, 14, 15, 16);
INSERT INTO plt2 SELECT i, i % 47, case when i % 17 = 11 then NULL else to_char(i % 17, 'FM0000') end FROM generate_series(0, 500) i WHERE i % 17 IN (0,11,12);
ANALYZE plt2;

CREATE TABLE plt1_e (a int, b int, c text) PARTITION BY LIST(ltrim(c, 'A'));
CREATE TABLE plt1_e_p1 PARTITION OF plt1_e FOR VALUES IN ('0002', '0003');
CREATE TABLE plt1_e_p2 PARTITION OF plt1_e FOR VALUES IN ('0004', '0005', '0006');
CREATE TABLE plt1_e_p3 PARTITION OF plt1_e FOR VALUES IN ('0008', '0009');
CREATE TABLE plt1_e_p4 PARTITION OF plt1_e FOR VALUES IN ('0000');
INSERT INTO plt1_e SELECT i, i % 47, to_char(i % 17, 'FM0000') FROM generate_series(0, 500) i WHERE i % 17 NOT IN (1, 7, 10, 11, 12, 13, 14, 15, 16);
ANALYZE plt1_e;

EXPLAIN (COSTS OFF)
SELECT avg(t1.a), avg(t2.b), avg(t3.a + t3.b), t1.c, t2.c, t3.c FROM
plt1 t1, plt2 t2, plt1_e t3 WHERE t1.c = t2.c AND ltrim(t3.c, 'A') = t1.c GROUP BY t1.c, t2.c, t3.c ORDER BY t1.c, t2.c, t3.c;
server closed the connection unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

below is stack trace, looks like some indexes got corrupted, please take a look.

Core was generated by `postgres: edb postgres [local] EXPLAIN                   '.
Program terminated with signal 11, Segmentation fault.
#0 0x0000000000821aee in map_and_merge_partitions (partmaps1=0x2c1c8a8, partmaps2=0x2c1c8e0, index1=45540240, index2=0, next_index=0x7ffeebd43d3c) at partbounds.c:4162
4162        if (partmap1->from < 0 && partmap2->from < 0)
Missing separate debuginfos, use: debuginfo-install keyutils-libs-1.4-5.el6.x86_64 krb5-libs-1.10.3-65.el6.x86_64 libcom_err-1.41.12-24.el6.x86_64 libselinux-2.0.94-7.el6.x86_64 openssl-1.0.1e-57.el6.x86_64 zlib-1.2.3-29.el6.x86_64
(gdb) bt
#0 0x0000000000821aee in map_and_merge_partitions (partmaps1=0x2c1c8a8, partmaps2=0x2c1c8e0, index1=45540240, index2=0, next_index=0x7ffeebd43d3c) at partbounds.c:4162
#1 0x00000000008226c3 in merge_null_partitions (outer_bi=0x2b6e338, inner_bi=0x2bf90b0, outer_maps=0x2c1c8a8, inner_maps=0x2c1c8e0, jointype=JOIN_INNER,
    next_index=0x7ffeebd43d3c, null_index=0x7ffeebd43d38, default_index=0x7ffeebd43d34) at partbounds.c:4610
#2 0x0000000000821726 in partition_list_bounds_merge (partsupfunc=0x2ba3548, partcollation=0x2ba34e8, outer_rel=0x2b6ce40, inner_rel=0x2bf8d28, outer_parts=0x7ffeebd43ed8,
    inner_parts=0x7ffeebd43ed0, jointype=JOIN_INNER) at partbounds.c:4031
#3 0x000000000081ff5d in partition_bounds_merge (partnatts=1, partsupfunc=0x2ba3548, partcollation=0x2ba34e8, outer_rel=0x2b6ce40, inner_rel=0x2bf8d28, jointype=JOIN_INNER,
    outer_parts=0x7ffeebd43ed8, inner_parts=0x7ffeebd43ed0) at partbounds.c:3053
#4 0x00000000007c610f in try_partitionwise_join (root=0x2be2a28, rel1=0x2b6ce40, rel2=0x2bf8d28, joinrel=0x2c1b0f0, parent_sjinfo=0x7ffeebd44010,
    parent_restrictlist=0x2c1c070) at joinrels.c:1370
#5 0x00000000007c5521 in populate_joinrel_with_paths (root=0x2be2a28, rel1=0x2b6ce40, rel2=0x2bf8d28, joinrel=0x2c1b0f0, sjinfo=0x7ffeebd44010, restrictlist=0x2c1c070)
    at joinrels.c:914
#6 0x00000000007c4f48 in make_join_rel (root=0x2be2a28, rel1=0x2b6ce40, rel2=0x2bf8d28) at joinrels.c:748
#7 0x00000000007c4514 in make_rels_by_clause_joins (root=0x2be2a28, old_rel=0x2b6ce40, other_rels=0x2bae4d8) at joinrels.c:294
#8 0x00000000007c41c8 in join_search_one_level (root=0x2be2a28, level=3) at joinrels.c:116
#9 0x00000000007abe59 in standard_join_search (root=0x2be2a28, levels_needed=3, initial_rels=0x2bae500) at allpaths.c:2716
#10 0x00000000007abdca in make_rel_from_joinlist (root=0x2be2a28, joinlist=0x2bfbae8) at allpaths.c:2647
#11 0x00000000007a86b0 in make_one_rel (root=0x2be2a28, joinlist=0x2bfbae8) at allpaths.c:227
#12 0x00000000007dada1 in query_planner (root=0x2be2a28, tlist=0x2ba01c8, qp_callback=0x7e0b25 <standard_qp_callback>, qp_extra=0x7ffeebd44390) at planmain.c:265
#13 0x00000000007ddf83 in grouping_planner (root=0x2be2a28, inheritance_update=false, tuple_fraction=0) at planner.c:1929
#14 0x00000000007dc5f5 in subquery_planner (glob=0x2be2990, parse=0x2c0e8c8, parent_root=0x0, hasRecursion=false, tuple_fraction=0) at planner.c:997
#15 0x00000000007db1b6 in standard_planner (parse=0x2c0e8c8, cursorOptions=256, boundParams=0x0) at planner.c:416
#16 0x00000000007daef7 in planner (parse=0x2c0e8c8, cursorOptions=256, boundParams=0x0) at planner.c:276
#17 0x00000000008e15c5 in pg_plan_query (querytree=0x2c0e8c8, cursorOptions=256, boundParams=0x0) at postgres.c:878
#18 0x00000000006562cc in ExplainOneQuery (query=0x2c0e8c8, cursorOptions=256, into=0x0, es=0x2c0e0a0,
    queryString=0x2aa24d8 "EXPLAIN (COSTS OFF)\nSELECT avg(t1.a), avg(t2.b), avg(t3.a + t3.b), t1.c, t2.c, t3.c FROM\nplt1 t1, plt2 t2, plt1_e t3 WHERE t1.c = t2.c AND ltrim(t3.c, 'A') = t1.c GROUP BY t1.c, t2.c, t3.c ORDER BY t1"..., params=0x0, queryEnv=0x0) at explain.c:364

Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB.

Regards,
Amul Sul

On Mon, Feb 4, 2019 at 11:05 AM amul sul <sulamul@gmail.com> wrote:
There are few whitespaces in 0002 patch that I have fixed in the attached version.
Rest of the patches are untouched.

Ill continue my review and testing.

Regards,
Amul

On Thu, Jan 31, 2019 at 5:26 PM Etsuro Fujita <fujita.etsuro@lab.ntt.co.jp> wrote:
(2019/01/22 21:38), Etsuro Fujita wrote:
> Will continue to review.

I rebased the patch set against the latest HEAD. Attached is a new
version. I'll move this to the next CF, and continue to review it.

Best regards,
Etsuro Fujita

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Ashutosh Bapat

Date:

11 March 2019, 02:58:54

On Thu, Mar 7, 2019 at 8:20 PM amul sul <sulamul@gmail.com> wrote:

On Thu, Mar 7, 2019 at 1:02 PM amul sul <sulamul@gmail.com> wrote:
Thanks Rajkumar,

I am looking into this.

The crash happens when none of the if-else branch of handle_missing_partition()
evaluates and returns merged_index unassigned.

Let me explain, in Rajkumar 's test case, the join type is JOIN_INNER. When
only outer rel has null partition, merge_null_partitions() function calls
handle_missing_partition() with missing_side_inner = false and
missing_side_outer = false

Both missing_side_ variables being false when the NULL partition is missing on the inner side looks suspicious. I guess from the variable names that the missing_side_inner should be true in this case.

argument value which fails to set merged_index.

In the attached patch, I tried to fix this case by setting merged_index
explicitly which fixes the reported crash.

I expect handle_missing_partition() to set the merged_index always. In your patches, I don't see that function in your patches is setting it explicitly. If we are setting merged_index explicitly somewhere else, other places may miss that explicit assignment. So it's better to move it inside this function.

Best Wishes,

Ashutosh Bapat

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

amul sul

Date:

11 March 2019, 05:09:46

On Mon, Mar 11, 2019 at 8:29 AM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:

On Thu, Mar 7, 2019 at 8:20 PM amul sul <sulamul@gmail.com> wrote:

On Thu, Mar 7, 2019 at 1:02 PM amul sul <sulamul@gmail.com> wrote:
Thanks Rajkumar,

I am looking into this.

The crash happens when none of the if-else branch of handle_missing_partition()
evaluates and returns merged_index unassigned.

Let me explain, in Rajkumar 's test case, the join type is JOIN_INNER. When
only outer rel has null partition, merge_null_partitions() function calls
handle_missing_partition() with missing_side_inner = false and
missing_side_outer = false

Both missing_side_ variables being false when the NULL partition is missing on the inner side looks suspicious. I guess from the variable names that the missing_side_inner should be true in this case.

All the places from where this handle_missing_partition() get called

have the following code to decide the value for missing_side_outer/_inner which

I yet to understand. Do you think this has some flaw?

* For a FULL join, inner relation acts as both OUTER and INNER

* relation. For LEFT and ANTI join the inner relation acts as

* INNER relation. For INNER and SEMI join OUTER and INNER

* differentiation is immaterial.

missing_side_inner = (jointype == JOIN_FULL ||

jointype == JOIN_LEFT ||

jointype == JOIN_ANTI);

missing_side_outer = (jointype == JOIN_FULL);

argument value which fails to set merged_index.

In the attached patch, I tried to fix this case by setting merged_index
explicitly which fixes the reported crash.

I expect handle_missing_partition() to set the merged_index always. In your patches, I don't see that function in your patches is setting it explicitly. If we are setting merged_index explicitly somewhere else, other places may miss that explicit assignment. So it's better to move it inside this function.

Ok, that can be fixed.

Similarly, I think merge_null_partitions should set null_index instead of

asserting when null partitions missing from both the side, make sense?

Regards,

Amul

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Ashutosh Bapat

Date:

11 March 2019, 16:43:51

On Mon, Mar 11, 2019 at 10:40 AM amul sul <sulamul@gmail.com> wrote:

All the places from where this handle_missing_partition() get called
have the following code to decide the value for missing_side_outer/_inner which
I yet to understand. Do you think this has some flaw?

/*
* For a FULL join, inner relation acts as both OUTER and INNER
* relation. For LEFT and ANTI join the inner relation acts as
* INNER relation. For INNER and SEMI join OUTER and INNER
* differentiation is immaterial.
*/
missing_side_inner = (jointype == JOIN_FULL ||
jointype == JOIN_LEFT ||
jointype == JOIN_ANTI);
missing_side_outer = (jointype == JOIN_FULL);

I was wrong, sorry. The comment says it all.

argument value which fails to set merged_index.

In the attached patch, I tried to fix this case by setting merged_index
explicitly which fixes the reported crash.

I expect handle_missing_partition() to set the merged_index always. In your patches, I don't see that function in your patches is setting it explicitly. If we are setting merged_index explicitly somewhere else, other places may miss that explicit assignment. So it's better to move it inside this function.

Ok, that can be fixed.

Similarly, I think merge_null_partitions should set null_index instead of
asserting when null partitions missing from both the side, make sense?

I think not. null_index, once set shouldn't change and hence does not change with each pair of partitions being matched. So, it makes sense to make sure that null_index remains invalid if none of the tables have null partition.

Best Wishes,

Ashutosh Bapat

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

amul sul

Date:

24 April 2019, 11:26:06

Attached version is rebase atop of the latest master head(fdc7efcc30), also

incorporates the Ashutosh's suggestion, thanks.

Regards,

Amul

On Mon, Mar 11, 2019 at 10:14 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:

On Mon, Mar 11, 2019 at 10:40 AM amul sul <sulamul@gmail.com> wrote:

All the places from where this handle_missing_partition() get called
have the following code to decide the value for missing_side_outer/_inner which
I yet to understand. Do you think this has some flaw?

/*
* For a FULL join, inner relation acts as both OUTER and INNER
* relation. For LEFT and ANTI join the inner relation acts as
* INNER relation. For INNER and SEMI join OUTER and INNER
* differentiation is immaterial.
*/
missing_side_inner = (jointype == JOIN_FULL ||
jointype == JOIN_LEFT ||
jointype == JOIN_ANTI);
missing_side_outer = (jointype == JOIN_FULL);

I was wrong, sorry. The comment says it all.

argument value which fails to set merged_index.

In the attached patch, I tried to fix this case by setting merged_index
explicitly which fixes the reported crash.

I expect handle_missing_partition() to set the merged_index always. In your patches, I don't see that function in your patches is setting it explicitly. If we are setting merged_index explicitly somewhere else, other places may miss that explicit assignment. So it's better to move it inside this function.

Ok, that can be fixed.

Similarly, I think merge_null_partitions should set null_index instead of
asserting when null partitions missing from both the side, make sense?

I think not. null_index, once set shouldn't change and hence does not change with each pair of partitions being matched. So, it makes sense to make sure that null_index remains invalid if none of the tables have null partition.

--
Best Wishes,
Ashutosh Bapat

Attachment

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Rajkumar Raghuwanshi

Date:

26 April 2019, 10:29:31

On Wed, Apr 24, 2019 at 4:56 PM amul sul <sulamul@gmail.com> wrote:

Attached version is rebase atop of the latest master head(fdc7efcc30), also
incorporates the Ashutosh's suggestion, thanks.

Thanks for rebase patch, patches applied cleanly on PG head.

I did some crash testing with extra test case [0006 patch] [1] and found no more issue.

Thanks & Regards,

Rajkumar Raghuwanshi

QMG, EnterpriseDB.

[1] https://www.postgresql.org/message-id/CAFjFpReKuV_4LRRfdy80BqX8oZfwbo%2BHWLQNv3CsJ5iGPSyTfA%40mail.gmail.com

Regards,
Amul

On Mon, Mar 11, 2019 at 10:14 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:

On Mon, Mar 11, 2019 at 10:40 AM amul sul <sulamul@gmail.com> wrote:

All the places from where this handle_missing_partition() get called
have the following code to decide the value for missing_side_outer/_inner which
I yet to understand. Do you think this has some flaw?

/*
* For a FULL join, inner relation acts as both OUTER and INNER
* relation. For LEFT and ANTI join the inner relation acts as
* INNER relation. For INNER and SEMI join OUTER and INNER
* differentiation is immaterial.
*/
missing_side_inner = (jointype == JOIN_FULL ||
jointype == JOIN_LEFT ||
jointype == JOIN_ANTI);
missing_side_outer = (jointype == JOIN_FULL);

I was wrong, sorry. The comment says it all.

argument value which fails to set merged_index.

In the attached patch, I tried to fix this case by setting merged_index
explicitly which fixes the reported crash.

I expect handle_missing_partition() to set the merged_index always. In your patches, I don't see that function in your patches is setting it explicitly. If we are setting merged_index explicitly somewhere else, other places may miss that explicit assignment. So it's better to move it inside this function.

Ok, that can be fixed.

Similarly, I think merge_null_partitions should set null_index instead of
asserting when null partitions missing from both the side, make sense?

I think not. null_index, once set shouldn't change and hence does not change with each pair of partitions being matched. So, it makes sense to make sure that null_index remains invalid if none of the tables have null partition.

--
Best Wishes,
Ashutosh Bapat

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Amit Langote

Date:

14 May 2019, 04:23:36

Hi Amul, Ashutosh,

On 2019/04/24 20:26, amul sul wrote:
> Attached version is rebase atop of the latest master head(fdc7efcc30), also
> incorporates the Ashutosh's suggestion, thanks.

Reading the commit message of 0002 and after testing to confirm, I
understand that the patch doesn't handle OUTER joins where the nullable
side is missing some partitions.  The reason given is that join planning
would have to add base relations corresponding to missing partitions on
the nullable side, which we can't do.  While working on partition pruning
refactoring recently (committed as 428b260f87e), we faced a similar
situation in that pruned partitions are like missing partitions, because
they're not added to the PlannerInfo anymore, whereas before that commit,
they'd be added and marked dummy afterwards.  Earlier versions of my patch
had code to add dummy base relations for such pruned partitions, because
partitionwise join expected pairs of matched partitions to be valid base
relations, because that's how things were when partitionwise joins feature
was committed.  Join path generated in this case would have a
constant-false Result path (an empty relation) for the nullable side.  Tom
strongly objected to that idea saying that such join paths are kind of
silly [1], even outside the context of partitionwise join.  He suggested
that we abandon partitionwise join in such cases, because having to build
a dummy base relation for pruned partitions only to generate silly-looking
paths would be an ugly kludge.  I guess the same argument applies to the
case where the nullable side is missing some partitions, so the right way
to support partitionwise join case in that case wouldn't be to figure out
how joinrels.c could add dummy base relations.

He did mention that cases where the nullable side is provably empty can be
handled by simply returning the path of the non-nullable side with
suitable projection path added on top to emit NULLs for the columns of the
nullable-side.  If we teach populate_joinrel_with_paths() and underlings
about that, then we can allow partitionwise join even in the case where
the nullable side has some partitions missing.

Thanks,
Amit

[1] https://www.postgresql.org/message-id/25035.1553905052%40sss.pgh.pa.us

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Amit Langote

Date:

14 May 2019, 04:29:33

On 2019/05/14 13:23, Amit Langote wrote:
> Tom
> strongly objected to that idea saying that such join paths are kind of
> silly [1], even outside the context of partitionwise join.  He suggested
> that we abandon partitionwise join in such cases, because having to build
> a dummy base relation for pruned partitions only to generate silly-looking
> paths would be an ugly kludge.

I forgot to mention that he even committed a patch to disable
partitionwise joins in such cases, which was also applied to v11 branch.

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=d70c147fa217c4bae32ac1afb86ab42d98b36fdf

Note that there were also other reasons for committing, beside what I
described in my previous email.

Thanks,
Amit

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Ashutosh Bapat

Date:

14 May 2019, 14:11:39

On Tue, May 14, 2019 at 10:00 AM Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

On 2019/05/14 13:23, Amit Langote wrote:
> Tom
> strongly objected to that idea saying that such join paths are kind of
> silly [1], even outside the context of partitionwise join. He suggested
> that we abandon partitionwise join in such cases, because having to build
> a dummy base relation for pruned partitions only to generate silly-looking
> paths would be an ugly kludge.

I forgot to mention that he even committed a patch to disable
partitionwise joins in such cases, which was also applied to v11 branch.

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=d70c147fa217c4bae32ac1afb86ab42d98b36fdf

Note that there were also other reasons for committing, beside what I
described in my previous email.

I haven't seen the actual commit, but we could use these patches to enable partition-wise join when partitions are pruned. For that the partition descriptor of the pruned partition table should be arranged as if those partitions are missing in the table itself. However, we will still need code to handle the cases when the partitions are missing on the nullable side. Tom mentioned the idea of using just projection to produce join tuples with rows on the outer side appended with null columns from the nullable side. If we can implement that, we can remove the restrictions in this patch.

Best Wishes,

Ashutosh Bapat

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Robert Haas

Date:

17 May 2019, 12:20:04

On Tue, May 14, 2019 at 12:24 AM Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> He did mention that cases where the nullable side is provably empty can be
> handled by simply returning the path of the non-nullable side with
> suitable projection path added on top to emit NULLs for the columns of the
> nullable-side.  If we teach populate_joinrel_with_paths() and underlings
> about that, then we can allow partitionwise join even in the case where
> the nullable side has some partitions missing.

Yes, I think that would be a good approach to pursue.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Thomas Munro

Date:

01 July 2019, 09:49:13

On Sat, May 18, 2019 at 12:20 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, May 14, 2019 at 12:24 AM Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> > He did mention that cases where the nullable side is provably empty can be
> > handled by simply returning the path of the non-nullable side with
> > suitable projection path added on top to emit NULLs for the columns of the
> > nullable-side.  If we teach populate_joinrel_with_paths() and underlings
> > about that, then we can allow partitionwise join even in the case where
> > the nullable side has some partitions missing.
>
> Yes, I think that would be a good approach to pursue.

Hi Ashutosh, Amul, Fujita-san,

Could we please have a fresh rebase for the new Commitfest?

Thanks,

-- 
Thomas Munro
https://enterprisedb.com

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Etsuro Fujita

Date:

01 July 2019, 11:16:30

On Mon, Jul 1, 2019 at 6:50 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Sat, May 18, 2019 at 12:20 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > On Tue, May 14, 2019 at 12:24 AM Amit Langote
> > <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> > > He did mention that cases where the nullable side is provably empty can be
> > > handled by simply returning the path of the non-nullable side with
> > > suitable projection path added on top to emit NULLs for the columns of the
> > > nullable-side.  If we teach populate_joinrel_with_paths() and underlings
> > > about that, then we can allow partitionwise join even in the case where
> > > the nullable side has some partitions missing.
> >
> > Yes, I think that would be a good approach to pursue.
>
> Hi Ashutosh, Amul, Fujita-san,
>
> Could we please have a fresh rebase for the new Commitfest?

Will do unless Ashutosh, Amul, or anyone wants to.

Thanks!

Best regards,
Etsuro Fujita

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

amul sul

Date:

02 July 2019, 04:47:17

Attached version is rebase atop of the latest master head(c74d49d41c), thanks.

Regards,
Amul

On Mon, Jul 1, 2019 at 4:46 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:

On Mon, Jul 1, 2019 at 6:50 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Sat, May 18, 2019 at 12:20 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > On Tue, May 14, 2019 at 12:24 AM Amit Langote
> > <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> > > He did mention that cases where the nullable side is provably empty can be
> > > handled by simply returning the path of the non-nullable side with
> > > suitable projection path added on top to emit NULLs for the columns of the
> > > nullable-side. If we teach populate_joinrel_with_paths() and underlings
> > > about that, then we can allow partitionwise join even in the case where
> > > the nullable side has some partitions missing.
> >
> > Yes, I think that would be a good approach to pursue.
>
> Hi Ashutosh, Amul, Fujita-san,
>
> Could we please have a fresh rebase for the new Commitfest?

Will do unless Ashutosh, Amul, or anyone wants to.

Thanks!

Best regards,
Etsuro Fujita

Attachment

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Etsuro Fujita

Date:

03 July 2019, 06:44:16

On Tue, Jul 2, 2019 at 1:47 PM amul sul <sulamul@gmail.com> wrote:
> Attached version is rebase atop of the latest master head(c74d49d41c), thanks.

Thanks!  Will review.

Best regards,
Etsuro Fujita

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Etsuro Fujita

Date:

08 July 2019, 11:33:32

On Wed, Jul 3, 2019 at 3:44 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> On Tue, Jul 2, 2019 at 1:47 PM amul sul <sulamul@gmail.com> wrote:
> > Attached version is rebase atop of the latest master head(c74d49d41c), thanks.
>
> Thanks!  Will review.

I started reviewing this.  Here is my initial review comments:

* 0001-Hash-partition-bound-equality-refactoring-v22.patch

First of all, I agree with your view on hash partitioning:

"3. For hash partitioned tables however, we support partition-wise join
only when the bounds exactly match. For hash partitioning it's unusual
to have missing partitions and hence generic partition matching is not
required."

which is cited from the commit message for the main patch
"0002-Partition-wise-join-for-1-1-1-0-0-1-partition-matchi-v22.patch".
(I think it would be better if we can extend the partition matching to
the hash-partitioning case where there are missing partitions in
future, though.)  However, I don't think it's a good idea to do this
refactoring, because that would lead to duplicating the code to check
whether two given hash bound collections are equal in
partition_bounds_equal() and partition_hash_bounds_merge() that will
be added by the main patch, after all.  To avoid that, how about
calling partition_bounds_equal() from partition_hash_bounds_merge() in
the main patch, like the attached?  I also did some cleanup for
partition_hash_bounds_merge().  This change makes the refactoring
patch unnecessary, so I dropped it.  Also, I included the
regression-test patch
"0003-Tests-for-0-1-1-1-and-1-0-partition-matching-v22.patch" in the
attached, because make check didn't succeed without the
regression-test patch.

That's it for now.  I'll review the remaining parts (ie,
"0002-Partition-wise-join-for-1-1-1-0-0-1-partition-matchi-v22.patch"
and "0003-Tests-for-0-1-1-1-and-1-0-partition-matching-v22.patch")
closely next.

Best regards,
Etsuro Fujita

Attachment

Improve-partition-matching-for-partitionwise-joins-v23.patch

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Etsuro Fujita

Date:

18 July 2019, 06:55:36

On Mon, Jul 8, 2019 at 8:33 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> I'll review the remaining parts (ie,
> "0002-Partition-wise-join-for-1-1-1-0-0-1-partition-matchi-v22.patch"
> and "0003-Tests-for-0-1-1-1-and-1-0-partition-matching-v22.patch")
> closely next.

I've been reviewing the main patch
"0002-Partition-wise-join-for-1-1-1-0-0-1-partition-matchi-v22.patch",
but it's pretty large and complicated, so I'm still at the first pass,
and so maybe I'm missing something, but let me comment a few things on
the patch.  First of all, I think the patch's problem setting, which
is stated in the commit message, is reasonable:

    Earlier version of partition-wise join implementation allowed
    partition-wise join between two relations with exactly same partition
    bounds. This commit allows partition-wise join to be applied under
    following conditions

    1. the partition bounds of joining relations are such that rows from
    given partition on one side join can join with rows from maximum one
    partition on the other side i.e. bounds of a given partition on one
    side match/overlap with those of maximum one partition on the other
    side.

And I agree with the patch's approach to this that tries to find such
a partition mapping between two input partitioned relations for a join
relation, by trying to match their partition bounds, which is
implemented in a new function partition_bounds_merge(), in general.
But one thing that I'm concerned about most at this point is this in
try_partitionwise_join():

    /*
-    * Since we allow partitionwise join only when the partition bounds of the
-    * joining relations exactly match, the partition bounds of the join
-    * should match those of the joining relations.
+    * Get the list of matching partitions to be joined along with the
+    * partition bounds of the join relation. Because of the restrictions
+    * imposed by partition matching algorithm, not every pair of joining
+    * relations for this join will be able to use partition-wise join. But all
+    * those pairs which can use partition-wise join will produce the same
+    * partition bounds for the join relation.
     */
-   Assert(partition_bounds_equal(joinrel->part_scheme->partnatts,
-                                 joinrel->part_scheme->parttyplen,
-                                 joinrel->part_scheme->parttypbyval,
-                                 joinrel->boundinfo, rel1->boundinfo));
-   Assert(partition_bounds_equal(joinrel->part_scheme->partnatts,
-                                 joinrel->part_scheme->parttyplen,
-                                 joinrel->part_scheme->parttypbyval,
-                                 joinrel->boundinfo, rel2->boundinfo));
+   join_boundinfo = partition_bounds_merge(part_scheme->partnatts,
+                                           part_scheme->partsupfunc,
+                                           part_scheme->partcollation,
+                                           rel1, rel2,
+                                           parent_sjinfo->jointype,
+                                           &parts1, &parts2);
+
+   if (join_boundinfo == NULL)
+       return;

I.e., partition_bounds_merge() is performed for each pair of input
partitioned relations for a join relation in try_partitionwise_join().
Since partition_bounds_merge() would need a lot of CPU cycles, I don't
think this is acceptable; ISTM that some redesign is needed to avoid
this.  I'm wondering that once we successfully merged partition bounds
from a pair of input partitioned relations for the join relation, by
using the merged partition bounds, we could get the lists of matching
to-be-joined partitions for subsequent pairs of input partitioned
relations for the join relation in a more efficient way than by
performing partition_bounds_merge() as proposed in the patch.

Best regards,
Etsuro Fujita

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

amul sul

Date:

19 July 2019, 11:09:19

On Mon, Jul 8, 2019 at 5:03 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:

On Wed, Jul 3, 2019 at 3:44 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> On Tue, Jul 2, 2019 at 1:47 PM amul sul <sulamul@gmail.com> wrote:
> > Attached version is rebase atop of the latest master head(c74d49d41c), thanks.
>
> Thanks! Will review.

I started reviewing this. Here is my initial review comments:

* 0001-Hash-partition-bound-equality-refactoring-v22.patch

First of all, I agree with your view on hash partitioning:

"3. For hash partitioned tables however, we support partition-wise join
only when the bounds exactly match. For hash partitioning it's unusual
to have missing partitions and hence generic partition matching is not
required."

which is cited from the commit message for the main patch
"0002-Partition-wise-join-for-1-1-1-0-0-1-partition-matchi-v22.patch".
(I think it would be better if we can extend the partition matching to
the hash-partitioning case where there are missing partitions in
future, though.) However, I don't think it's a good idea to do this
refactoring, because that would lead to duplicating the code to check
whether two given hash bound collections are equal in
partition_bounds_equal() and partition_hash_bounds_merge() that will
be added by the main patch, after all. To avoid that, how about
calling partition_bounds_equal() from partition_hash_bounds_merge() in
the main patch, like the attached?

Agree, your changes look good to me, thanks for working on it.

Regards,

Amul

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Robert Haas

Date:

19 July 2019, 13:43:54

On Thu, Jul 18, 2019 at 2:55 AM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> I.e., partition_bounds_merge() is performed for each pair of input
> partitioned relations for a join relation in try_partitionwise_join().
> Since partition_bounds_merge() would need a lot of CPU cycles, I don't
> think this is acceptable; ISTM that some redesign is needed to avoid
> this.  I'm wondering that once we successfully merged partition bounds
> from a pair of input partitioned relations for the join relation, by
> using the merged partition bounds, we could get the lists of matching
> to-be-joined partitions for subsequent pairs of input partitioned
> relations for the join relation in a more efficient way than by
> performing partition_bounds_merge() as proposed in the patch.

I don't know whether partition_bounds_merge() is well-implemented; I
haven't looked. But in general I don't see an alternative to doing
some kind of merging on each pair of input relations. That's just how
planning works, and I don't see why it should need to be prohibitively
expensive.  I might be missing something, though; do you have an idea?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Etsuro Fujita

Date:

30 July 2019, 07:57:55

On Fri, Jul 19, 2019 at 8:09 PM amul sul <sulamul@gmail.com> wrote:
> On Mon, Jul 8, 2019 at 5:03 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
>> I started reviewing this.  Here is my initial review comments:
>>
>> * 0001-Hash-partition-bound-equality-refactoring-v22.patch

>> However, I don't think it's a good idea to do this
>> refactoring, because that would lead to duplicating the code to check
>> whether two given hash bound collections are equal in
>> partition_bounds_equal() and partition_hash_bounds_merge() that will
>> be added by the main patch, after all.  To avoid that, how about
>> calling partition_bounds_equal() from partition_hash_bounds_merge() in
>> the main patch, like the attached?

> Agree, your changes look good to me, thanks for working on it.

Cool!  Thanks for reviewing!

Sorry for the delay.  I was busy with something else recently.

Best regards,
Etsuro Fujita

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Etsuro Fujita

Date:

30 July 2019, 09:00:33

On Fri, Jul 19, 2019 at 10:44 PM Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Jul 18, 2019 at 2:55 AM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> > I.e., partition_bounds_merge() is performed for each pair of input
> > partitioned relations for a join relation in try_partitionwise_join().
> > Since partition_bounds_merge() would need a lot of CPU cycles, I don't
> > think this is acceptable; ISTM that some redesign is needed to avoid
> > this.  I'm wondering that once we successfully merged partition bounds
> > from a pair of input partitioned relations for the join relation, by
> > using the merged partition bounds, we could get the lists of matching
> > to-be-joined partitions for subsequent pairs of input partitioned
> > relations for the join relation in a more efficient way than by
> > performing partition_bounds_merge() as proposed in the patch.
>
> I don't know whether partition_bounds_merge() is well-implemented; I
> haven't looked.

My concern about that is list partitioning.  In that case that
function calls partition_list_bounds_merge(), which generates the
partition bounds for a join relation between two given input
relations, by performing merge join for a pair of the datums arrays
from both the input relations.  Since the datums arrays contain all
non-null list values across all partitions, if the numbers of the list
values (ie, ndatums') are large, the merge join would require not a
few cycles, so it would be much expensive to perform the merge join
for each such pair when considering large N-way partitionwise joins of
list-partitioned tables with large ndatums.  To see that, I did simple
tests using a list-partitioned table pt created with the attached,
which has 10 partitions, each with 1000 list values, so ndatums is
10000.  (The tests below are performed with
enable_partitionwise_join=on.)

* 2-way self-join of pt: explain analyze select * from pt t0, pt t1
where t0.a = t1.a;
 - HEAD:
 Planning Time: 1.731 ms
 Execution Time: 15.159 ms
 - Patched:
 Planning Time: 1.884 ms
 Execution Time: 15.127 ms

* 4-way self-join of pt: explain analyze select * from pt t0, pt t1,
pt t2, pt t3 where t0.a = t1.a and t1.a = t2.a and t2.a = t3.a;
 - HEAD:
 Planning Time: 28.787 ms
 Execution Time: 34.313 ms
 - Patched:
 Planning Time: 40.263 ms
 Execution Time: 35.019 ms

* 8-way self-join of pt: explain analyze select * from pt t0, pt t1,
pt t2, pt t3, pt t4, pt t5, pt t6, pt t7 where t0.a = t1.a and t1.a =
t2.a and t2.a = t3.a and t3.a = t4.a and t4.a = t5.a and t5.a = t6.a
and t6.a = t7.a;
 - HEAD:
 Planning Time: 2279.653 ms
 Execution Time: 63.303 ms
 - Patched:
 Planning Time: 3834.751 ms
 Execution Time: 62.949 ms

Actually, these joins would not need the partition-matching algorithm
the patch adds; we could probably avoid this regression by modifying
the patch to plan these joins the same way as before, but ISTM that
these results imply that the cost of performing the merge join for
each such pair would not be negligible when considering large N-way
partitionwise joins mentioned above.  Maybe I'm missing something,
though.

> But in general I don't see an alternative to doing
> some kind of merging on each pair of input relations. That's just how
> planning works, and I don't see why it should need to be prohibitively
> expensive.  I might be missing something, though; do you have an idea?

Yes, I do; but I think I should think a little more about that.

Sorry for the delay.

Best regards,
Etsuro Fujita

Attachment

list_parted2.sql

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Amit Langote

Date:

31 July 2019, 05:47:41

On Tue, Jul 30, 2019 at 6:00 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> On Fri, Jul 19, 2019 at 10:44 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > On Thu, Jul 18, 2019 at 2:55 AM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> > > I.e., partition_bounds_merge() is performed for each pair of input
> > > partitioned relations for a join relation in try_partitionwise_join().
> > > Since partition_bounds_merge() would need a lot of CPU cycles, I don't
> > > think this is acceptable; ISTM that some redesign is needed to avoid
> > > this.  I'm wondering that once we successfully merged partition bounds
> > > from a pair of input partitioned relations for the join relation, by
> > > using the merged partition bounds, we could get the lists of matching
> > > to-be-joined partitions for subsequent pairs of input partitioned
> > > relations for the join relation in a more efficient way than by
> > > performing partition_bounds_merge() as proposed in the patch.
> >
> > I don't know whether partition_bounds_merge() is well-implemented; I
> > haven't looked.
>
> My concern about that is list partitioning.  In that case that
> function calls partition_list_bounds_merge(), which generates the
> partition bounds for a join relation between two given input
> relations, by performing merge join for a pair of the datums arrays
> from both the input relations.

I had similar thoughts upon seeing that partition_bounds_merge() will
be replacing the current way of determining if partition-wise join can
occur; that it will make the handling of currently supported cases
more expensive.

The current way is to compare the PartitionBoundInfos of joining
relations using partition_bounds_equal(), and if equal, simply join
the pairs of matching partitions if the join quals permit doing so.
There's no need to do anything extra to determine which partitions to
join with each other, because it's already established.  Likewise,
partition_bounds_merge() shouldn't to have to anything extra in that
case.  That is, for the cases that are already supported, we should
find a way to make partition_bounds_merge() only as expensive as just
performing partition_bounds_equals(), or maybe just slightly more.

Thanks,
Amit

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Etsuro Fujita

Date:

01 August 2019, 10:36:05

Amit-san,

On Wed, Jul 31, 2019 at 2:47 PM Amit Langote <amitlangote09@gmail.com> wrote:
> On Tue, Jul 30, 2019 at 6:00 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> > On Fri, Jul 19, 2019 at 10:44 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > > I don't know whether partition_bounds_merge() is well-implemented; I
> > > haven't looked.
> >
> > My concern about that is list partitioning.  In that case that
> > function calls partition_list_bounds_merge(), which generates the
> > partition bounds for a join relation between two given input
> > relations, by performing merge join for a pair of the datums arrays
> > from both the input relations.
>
> I had similar thoughts upon seeing that partition_bounds_merge() will
> be replacing the current way of determining if partition-wise join can
> occur; that it will make the handling of currently supported cases
> more expensive.
>
> The current way is to compare the PartitionBoundInfos of joining
> relations using partition_bounds_equal(), and if equal, simply join
> the pairs of matching partitions if the join quals permit doing so.
> There's no need to do anything extra to determine which partitions to
> join with each other, because it's already established.  Likewise,
> partition_bounds_merge() shouldn't to have to anything extra in that
> case.  That is, for the cases that are already supported, we should
> find a way to make partition_bounds_merge() only as expensive as just
> performing partition_bounds_equals(), or maybe just slightly more.

I 100% agree on that point.

One thing that was unexpected to me is this:

I wrote:
> To see that, I did simple
> tests using a list-partitioned table pt created with the attached,
> which has 10 partitions, each with 1000 list values, so ndatums is
> 10000.  (The tests below are performed with
> enable_partitionwise_join=on.)
>
> * 2-way self-join of pt: explain analyze select * from pt t0, pt t1
> where t0.a = t1.a;
>  - HEAD:
>  Planning Time: 1.731 ms
>  Execution Time: 15.159 ms
>  - Patched:
>  Planning Time: 1.884 ms
>  Execution Time: 15.127 ms

IIUC, in this test, I think partition_bounds_equals() and
partition_bounds_merge() have been performed only once in HEAD and the
patched version respectively to plan the partitionwise join, so this
might imply that the cost of the latter is just slightly more
expensive than that of the former.  I'm missing something, though.

Anyway I'll continue to review this patch, so I'll move this to the
next CF with the same status (Needs review).

Best regards,
Etsuro Fujita

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Etsuro Fujita

Date:

16 August 2019, 13:25:10

On Tue, Jul 30, 2019 at 6:00 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> On Fri, Jul 19, 2019 at 10:44 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > On Thu, Jul 18, 2019 at 2:55 AM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> > > I.e., partition_bounds_merge() is performed for each pair of input
> > > partitioned relations for a join relation in try_partitionwise_join().
> > > Since partition_bounds_merge() would need a lot of CPU cycles, I don't
> > > think this is acceptable; ISTM that some redesign is needed to avoid
> > > this.  I'm wondering that once we successfully merged partition bounds
> > > from a pair of input partitioned relations for the join relation, by
> > > using the merged partition bounds, we could get the lists of matching
> > > to-be-joined partitions for subsequent pairs of input partitioned
> > > relations for the join relation in a more efficient way than by
> > > performing partition_bounds_merge() as proposed in the patch.
> >
> > I don't know whether partition_bounds_merge() is well-implemented; I
> > haven't looked.
>
> My concern about that is list partitioning.  In that case that
> function calls partition_list_bounds_merge(), which generates the
> partition bounds for a join relation between two given input
> relations, by performing merge join for a pair of the datums arrays
> from both the input relations.  Since the datums arrays contain all
> non-null list values across all partitions, if the numbers of the list
> values (ie, ndatums') are large, the merge join would require not a
> few cycles, so it would be much expensive to perform the merge join
> for each such pair when considering large N-way partitionwise joins of
> list-partitioned tables with large ndatums.  To see that, I did simple
> tests using a list-partitioned table pt created with the attached,
> which has 10 partitions, each with 1000 list values, so ndatums is
> 10000.  (The tests below are performed with
> enable_partitionwise_join=on.)
>
> * 2-way self-join of pt: explain analyze select * from pt t0, pt t1
> where t0.a = t1.a;
>  - HEAD:
>  Planning Time: 1.731 ms
>  Execution Time: 15.159 ms
>  - Patched:
>  Planning Time: 1.884 ms
>  Execution Time: 15.127 ms
>
> * 4-way self-join of pt: explain analyze select * from pt t0, pt t1,
> pt t2, pt t3 where t0.a = t1.a and t1.a = t2.a and t2.a = t3.a;
>  - HEAD:
>  Planning Time: 28.787 ms
>  Execution Time: 34.313 ms
>  - Patched:
>  Planning Time: 40.263 ms
>  Execution Time: 35.019 ms
>
> * 8-way self-join of pt: explain analyze select * from pt t0, pt t1,
> pt t2, pt t3, pt t4, pt t5, pt t6, pt t7 where t0.a = t1.a and t1.a =
> t2.a and t2.a = t3.a and t3.a = t4.a and t4.a = t5.a and t5.a = t6.a
> and t6.a = t7.a;
>  - HEAD:
>  Planning Time: 2279.653 ms
>  Execution Time: 63.303 ms
>  - Patched:
>  Planning Time: 3834.751 ms
>  Execution Time: 62.949 ms
>
> Actually, these joins would not need the partition-matching algorithm
> the patch adds; we could probably avoid this regression by modifying
> the patch to plan these joins the same way as before, but ISTM that
> these results imply that the cost of performing the merge join for
> each such pair would not be negligible when considering large N-way
> partitionwise joins mentioned above.  Maybe I'm missing something,
> though.
>
> > But in general I don't see an alternative to doing
> > some kind of merging on each pair of input relations. That's just how
> > planning works, and I don't see why it should need to be prohibitively
> > expensive.  I might be missing something, though; do you have an idea?
>
> Yes, I do; but I think I should think a little more about that.

I gave more thought to this.  My idea is to generate the list of
matching partitions to be joined from the partbound info after
creating it with partition_bounds_merge(), as I stated before.  Let me
explain using an example.  Suppose that R, S and T are partitioned
tables, that R=R(a,b,c) is partitioned on ranges of a into three
partitions R1, R2 and R3, that S=S(a,b,c) is partitioned on ranges of
a into three partitions S1, S2 and S3, and that T=T(a,b,c) is
partitioned on ranges of a into three partitions T1, T2 and T3.
Consider a 3-way join query: SELECT * FROM R, S, T WHERE R.a=S.a AND
S.a=T.a;  Suppose that when considering 2-way join R IJ S,
partition_bounds_merge() successfully merges the partition bounds for
R and S, and generates join pairs (R1, S1), (R2, S2) and (R3, S3), and
that when considering 3-way join (R IJ S) IJ T, that function
successfully merges the partition bounds for (R IJ S) and T, and
generates join pairs ((R1 IJ S1), T1) and ((R2 IJ S2), T2).  The
question here is: do we really need to perform
partition_bounds_merge() to generate the list of matching partitions
to be joined for 3-way join R IJ (S IJ T), for example?  I don't think
so; because 1) we see from the 3-way join pairs ((R1 IJ S1), T1) and
((R2 IJ S2), T2) that Ri, Si and Ti (i=1,2) have boundaries
overlapping with each other, which means that there would be (S1, T1)
and (S2, T2) as 2-way join pairs for S IJ T, and 2) we have R IJ (S IJ
T) = (R IJ S) IJ T = (R1 IJ S1) IJ T1 union (R2 IJ S2) IJ T2 = R1 IJ
(S1 IJ T1) union R2 IJ (S2 IJ T2), which means that the list of
matching partitions to be joined for R IJ (S IJ T) in question are
(R1, (S1 IJ T1)) and (R2, (S2 IJ T2)) since we see from the equation
that pairs from R and (S IJ T) other than that would not contribute to
the join result.  Does that make sense?  Attached is a WIP patch
implementing this created on top of the patch [1].

It seems that I performed the above tests on an assertion-enabled
build.  :(  So I executed the tests one more time.  Here are the
results.

* 2-way self-join of pt: explain analyze select * from pt t0, pt t1
where t0.a = t1.a;
 - HEAD:
 Planning Time: 0.969 ms
 Execution Time: 13.843 ms
 - with patch:
 Planning Time: 1.720 ms
 Execution Time: 14.393 ms
 - with patch plus attached:
 Planning Time: 1.630 ms
 Execution Time: 14.002 ms

* 4-way self-join of pt: explain analyze select * from pt t0, pt t1,
pt t2, pt t3 where t0.a = t1.a

and t1.a = t2.a and t2.a = t3.a;
 - HEAD:
 Planning Time: 12.203 ms
 Execution Time: 31.784 ms
 - with patch:
 Planning Time: 32.102 ms
 Execution Time: 32.504 ms
 - with patch plus attached:
 Planning Time: 19.471 ms
 Execution Time: 32.582 ms

* 8-way self-join of pt: explain analyze select * from pt t0, pt t1,
pt t2, pt t3, pt t4, pt t5, pt t6, pt t7 where t0.a = t1.a and t1.a =
t2.a and t2.a = t3.a and t3.a = t4.a and t4.a = t5.a and t5.a = t6.a
and t6.a = t7.a;
 - HEAD:
 Planning Time: 948.131 ms
 Execution Time: 55.645 ms
 - with patch:
 Planning Time: 2939.813 ms
 Execution Time: 56.760 ms
 - with patch plus attached:
 Planning Time: 1108.076 ms
 Execution Time: 55.750 ms

Note: the attached patch still uses the proposed partition matching
algorithm for these queries.  As I said before, these queries don't
need that algorithm, so we could eliminate the planning overhead
compared to HEAD, by planning these queries as before, perhaps, but I
haven't modified the patch as such yet.

Best regards,
Etsuro Fujita

[1] https://www.postgresql.org/message-id/CAPmGK177W%2BjNgpM5f_m-vdDKbEBu_%3D3CyPzFjkT_1nzf1Vqn%2BA%40mail.gmail.com

Attachment

Modify-partition-matching-algorithm-WIP.patch

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Etsuro Fujita

Date:

28 August 2019, 10:22:04

On Fri, Aug 16, 2019 at 10:25 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> It seems that I performed the above tests on an assertion-enabled
> build.  :(  So I executed the tests one more time.  Here are the
> results.
>
> * 2-way self-join of pt: explain analyze select * from pt t0, pt t1
> where t0.a = t1.a;
>  - HEAD:
>  Planning Time: 0.969 ms
>  Execution Time: 13.843 ms
>  - with patch:
>  Planning Time: 1.720 ms
>  Execution Time: 14.393 ms
>  - with patch plus attached:
>  Planning Time: 1.630 ms
>  Execution Time: 14.002 ms
>
> * 4-way self-join of pt: explain analyze select * from pt t0, pt t1,
> pt t2, pt t3 where t0.a = t1.a
>
> and t1.a = t2.a and t2.a = t3.a;
>  - HEAD:
>  Planning Time: 12.203 ms
>  Execution Time: 31.784 ms
>  - with patch:
>  Planning Time: 32.102 ms
>  Execution Time: 32.504 ms
>  - with patch plus attached:
>  Planning Time: 19.471 ms
>  Execution Time: 32.582 ms
>
> * 8-way self-join of pt: explain analyze select * from pt t0, pt t1,
> pt t2, pt t3, pt t4, pt t5, pt t6, pt t7 where t0.a = t1.a and t1.a =
> t2.a and t2.a = t3.a and t3.a = t4.a and t4.a = t5.a and t5.a = t6.a
> and t6.a = t7.a;
>  - HEAD:
>  Planning Time: 948.131 ms
>  Execution Time: 55.645 ms
>  - with patch:
>  Planning Time: 2939.813 ms
>  Execution Time: 56.760 ms
>  - with patch plus attached:
>  Planning Time: 1108.076 ms
>  Execution Time: 55.750 ms
>
> Note: the attached patch still uses the proposed partition matching
> algorithm for these queries.  As I said before, these queries don't
> need that algorithm, so we could eliminate the planning overhead
> compared to HEAD, by planning these queries as before, perhaps, but I
> haven't modified the patch as such yet.

I modified the patch further as such.  Attached is an updated version
of the patch created on top of the patch in [1].  I did the tests
again using the updated version of the patch.  Here are the results:

* 2-way self-join of pt:
 Planning Time: 1.043 ms
 Execution Time: 13.931 ms

* 4-way self-join of pt:
 Planning Time: 12.499 ms
 Execution Time: 32.392 ms

* 8-way self-join of pt:
 Planning Time: 968.412 ms
 Execution Time: 56.328 ms

The planning time for each test case still increased slightly, but IMO
I think that would be acceptable.  To see the efficiency of the
attached, I did another testing with test cases that really need the
new partition-matching algorithm:

* explain analyze select * from pt6 t6, pt7 t7 where t6.a = t7.a;
 - base patch in [1]
 Planning Time: 1.758 ms
 Execution Time: 13.977 ms
 - with attached
 Planning Time: 1.777 ms
 Execution Time: 13.959 ms

* explain analyze select * from pt4 t4, pt5 t5, pt6 t6, pt7 t7 where
t4.a = t5.a and t5.a = t6.a and t6.a = t7.a;
 - base patch in [1]
 Planning Time: 33.201 ms
 Execution Time: 32.480 ms
 - with attached
 Planning Time: 21.019 ms
 Execution Time: 32.777 ms

* explain analyze select * from pt0 t0, pt1 t1, pt2 t2, pt3 t3, pt4
t4, pt5 t5, pt6 t6, pt7 t7 where t0.a = t1.a and t1.a = t2.a and t2.a
= t3.a and t3.a = t4.a and t4.a = t5.a and t5.a = t6.a and t6.a =
t7.a;
 - base patch in [1]
 Planning Time: 3060.000 ms
 Execution Time: 55.553 ms
 - with attached
 Planning Time: 1144.996 ms
 Execution Time: 56.233 ms

where pt0, pt1, pt2, pt3, pt4, pt5, pt6 and pt7 are list partitioned
tables that have slighly different list values.  (The structure and
list values of ptN are almost the same as that of pt used above, but
ptN's N-th partition ptNpN has an extra list value that pt's N-th
partition ptpN doesn't have.)  If anyone is interested in this
testing, I'll send a script file for producing these list partitioned
tables.

About the attached:

* The attached patch modified try_partitionwise_join() so that we call
partition_bounds_equal() then partition_bounds_merge() (if necessary)
to create the partition bounds for the join rel.  We don't support for
merging the partition bounds for the hash-partitioning case, so this
makes code to handle the hash-partitioning case in
partition_bounds_merge() completely unnecessary, so I removed that
entirely.

* I removed this assertion in partition_bounds_merge(), because I
think this is covered by two assertions above this.

+   Assert((*outer_parts == NIL || *inner_parts != NIL) &&
+          (*inner_parts == NIL || *outer_parts != NIL));

* (I forgot to mention this in a previous email, but) I removed this
bit of generate_matching_part_pairs(), because we already have the
same check in try_partitionwise_join(), so this bit would be redundant
IIUC.

+       switch (jointype)
+       {
+           case JOIN_INNER:
+           case JOIN_SEMI:
+
+               /*
+                * An inner or semi join can not return any row when the
+                * matching partition on either side is missing. We should
+                * have eliminated all such cases while merging the bounds.
+                */
+               Assert(part1 >= 0 && part2 >= 0);
+               break;
+
+           case JOIN_LEFT:
+           case JOIN_ANTI:
+               Assert(part1 >= 0);
+               if (part2 < 0)
+                   merged = false;
+               break;
+
+           case JOIN_FULL:
+               if (part1 < 0 || part2 < 0)
+                   merged = false;
+               break;
+
+           default:
+               elog(ERROR, "unrecognized join type: %d", (int) jointype);
+       }

* I added more comments.

If there are no objections, I'll merge the attached with the base patch in [1].

Best regards,
Etsuro Fujita

[1] https://www.postgresql.org/message-id/CAPmGK177W%2BjNgpM5f_m-vdDKbEBu_%3D3CyPzFjkT_1nzf1Vqn%2BA%40mail.gmail.com

Attachment

Modify-partition-matching-algorithm-1.patch

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

amul sul

Date:

28 August 2019, 10:26:49

Thank you Fujita San for the enhancement, will have a look.

Regards,

Amul

On Wed, Aug 28, 2019 at 3:52 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:

On Fri, Aug 16, 2019 at 10:25 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> It seems that I performed the above tests on an assertion-enabled
> build. :( So I executed the tests one more time. Here are the
> results.
>
> * 2-way self-join of pt: explain analyze select * from pt t0, pt t1
> where t0.a = t1.a;
> - HEAD:
> Planning Time: 0.969 ms
> Execution Time: 13.843 ms
> - with patch:
> Planning Time: 1.720 ms
> Execution Time: 14.393 ms
> - with patch plus attached:
> Planning Time: 1.630 ms
> Execution Time: 14.002 ms
>
> * 4-way self-join of pt: explain analyze select * from pt t0, pt t1,
> pt t2, pt t3 where t0.a = t1.a
>
> and t1.a = t2.a and t2.a = t3.a;
> - HEAD:
> Planning Time: 12.203 ms
> Execution Time: 31.784 ms
> - with patch:
> Planning Time: 32.102 ms
> Execution Time: 32.504 ms
> - with patch plus attached:
> Planning Time: 19.471 ms
> Execution Time: 32.582 ms
>
> * 8-way self-join of pt: explain analyze select * from pt t0, pt t1,
> pt t2, pt t3, pt t4, pt t5, pt t6, pt t7 where t0.a = t1.a and t1.a =
> t2.a and t2.a = t3.a and t3.a = t4.a and t4.a = t5.a and t5.a = t6.a
> and t6.a = t7.a;
> - HEAD:
> Planning Time: 948.131 ms
> Execution Time: 55.645 ms
> - with patch:
> Planning Time: 2939.813 ms
> Execution Time: 56.760 ms
> - with patch plus attached:
> Planning Time: 1108.076 ms
> Execution Time: 55.750 ms
>
> Note: the attached patch still uses the proposed partition matching
> algorithm for these queries. As I said before, these queries don't
> need that algorithm, so we could eliminate the planning overhead
> compared to HEAD, by planning these queries as before, perhaps, but I
> haven't modified the patch as such yet.

I modified the patch further as such. Attached is an updated version
of the patch created on top of the patch in [1]. I did the tests
again using the updated version of the patch. Here are the results:

* 2-way self-join of pt:
Planning Time: 1.043 ms
Execution Time: 13.931 ms

* 4-way self-join of pt:
Planning Time: 12.499 ms
Execution Time: 32.392 ms

* 8-way self-join of pt:
Planning Time: 968.412 ms
Execution Time: 56.328 ms

The planning time for each test case still increased slightly, but IMO
I think that would be acceptable. To see the efficiency of the
attached, I did another testing with test cases that really need the
new partition-matching algorithm:

* explain analyze select * from pt6 t6, pt7 t7 where t6.a = t7.a;
- base patch in [1]
Planning Time: 1.758 ms
Execution Time: 13.977 ms
- with attached
Planning Time: 1.777 ms
Execution Time: 13.959 ms

* explain analyze select * from pt4 t4, pt5 t5, pt6 t6, pt7 t7 where
t4.a = t5.a and t5.a = t6.a and t6.a = t7.a;
- base patch in [1]
Planning Time: 33.201 ms
Execution Time: 32.480 ms
- with attached
Planning Time: 21.019 ms
Execution Time: 32.777 ms

* explain analyze select * from pt0 t0, pt1 t1, pt2 t2, pt3 t3, pt4
t4, pt5 t5, pt6 t6, pt7 t7 where t0.a = t1.a and t1.a = t2.a and t2.a
= t3.a and t3.a = t4.a and t4.a = t5.a and t5.a = t6.a and t6.a =
t7.a;
- base patch in [1]
Planning Time: 3060.000 ms
Execution Time: 55.553 ms
- with attached
Planning Time: 1144.996 ms
Execution Time: 56.233 ms

where pt0, pt1, pt2, pt3, pt4, pt5, pt6 and pt7 are list partitioned
tables that have slighly different list values. (The structure and
list values of ptN are almost the same as that of pt used above, but
ptN's N-th partition ptNpN has an extra list value that pt's N-th
partition ptpN doesn't have.) If anyone is interested in this
testing, I'll send a script file for producing these list partitioned
tables.

About the attached:

* The attached patch modified try_partitionwise_join() so that we call
partition_bounds_equal() then partition_bounds_merge() (if necessary)
to create the partition bounds for the join rel. We don't support for
merging the partition bounds for the hash-partitioning case, so this
makes code to handle the hash-partitioning case in
partition_bounds_merge() completely unnecessary, so I removed that
entirely.

* I removed this assertion in partition_bounds_merge(), because I
think this is covered by two assertions above this.

+ Assert((*outer_parts == NIL || *inner_parts != NIL) &&
+ (*inner_parts == NIL || *outer_parts != NIL));

* (I forgot to mention this in a previous email, but) I removed this
bit of generate_matching_part_pairs(), because we already have the
same check in try_partitionwise_join(), so this bit would be redundant
IIUC.

+ switch (jointype)
+ {
+ case JOIN_INNER:
+ case JOIN_SEMI:
+
+ /*
+ * An inner or semi join can not return any row when the
+ * matching partition on either side is missing. We should
+ * have eliminated all such cases while merging the bounds.
+ */
+ Assert(part1 >= 0 && part2 >= 0);
+ break;
+
+ case JOIN_LEFT:
+ case JOIN_ANTI:
+ Assert(part1 >= 0);
+ if (part2 < 0)
+ merged = false;
+ break;
+
+ case JOIN_FULL:
+ if (part1 < 0 || part2 < 0)
+ merged = false;
+ break;
+
+ default:
+ elog(ERROR, "unrecognized join type: %d", (int) jointype);
+ }

* I added more comments.

If there are no objections, I'll merge the attached with the base patch in [1].

Best regards,
Etsuro Fujita

[1] https://www.postgresql.org/message-id/CAPmGK177W%2BjNgpM5f_m-vdDKbEBu_%3D3CyPzFjkT_1nzf1Vqn%2BA%40mail.gmail.com

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

amul sul

Date:

02 September 2019, 05:08:02

Hi Fujita San,

Please find my comments inline below:

On Wed, Aug 28, 2019 at 3:52 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:

On Fri, Aug 16, 2019 at 10:25 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
>

[... skipped ..]

About the attached:

* The attached patch modified try_partitionwise_join() so that we call
partition_bounds_equal() then partition_bounds_merge() (if necessary)
to create the partition bounds for the join rel. We don't support for
merging the partition bounds for the hash-partitioning case, so this
makes code to handle the hash-partitioning case in
partition_bounds_merge() completely unnecessary, so I removed that
entirely.

Yes, that make sense.

On thinking further, a thought hits to me why we can't join two hash partitioned
table which has the same modulus and partition key specification, but some of
the partitions are missing from either partitioned table.

For e.g. here is a smaller case where foo has two partitions and bar has only one.

CREATE TABLE foo(a int) PARTITION BY HASH(a);
CREATE TABLE foo_p0 PARTITION OF foo FOR VALUES WITH(modulus 2, remainder 0);
CREATE TABLE foo_p1 PARTITION OF foo FOR VALUES WITH(modulus 2, remainder 1);

CREATE TABLE bar(a int) PARTITION BY HASH(a); <-- missing partitions for REMAINDER 1
CREATE TABLE bar_p0 PARTITION OF bar FOR VALUES WITH(modulus 2, remainder 0);

Explain:

postgres=# explain select * from foo p1, bar p2 where p1.a = p2.a;
QUERY PLAN
---------------------------------------------------------------------------------
Merge Join (cost=590.35..1578.47 rows=65025 width=8)
Merge Cond: (p2.a = p1.a)
-> Sort (cost=179.78..186.16 rows=2550 width=4)
Sort Key: p2.a
-> Seq Scan on bar_p0 p2 (cost=0.00..35.50 rows=2550 width=4)
-> Sort (cost=410.57..423.32 rows=5100 width=4)
Sort Key: p1.a
-> Append (cost=0.00..96.50 rows=5100 width=4)
-> Seq Scan on foo_p0 p1 (cost=0.00..35.50 rows=2550 width=4)
-> Seq Scan on foo_p1 p1_1 (cost=0.00..35.50 rows=2550 width=4)
(10 rows)

The partitions-wise join will be performed only if we fill the partition hole that

exists for the joining table i.e. adding partitions to bar table.

postgres=# CREATE TABLE bar_p1 PARTITION OF bar FOR VALUES WITH(modulus 2, remainder 1);
CREATE TABLE
postgres=# explain select * from foo p1, bar p2 where p1.a = p2.a;
QUERY PLAN
---------------------------------------------------------------------------------
Append (cost=359.57..2045.11 rows=65024 width=8)
-> Merge Join (cost=359.57..860.00 rows=32512 width=8)
Merge Cond: (p1.a = p2.a)
-> Sort (cost=179.78..186.16 rows=2550 width=4)
Sort Key: p1.a
-> Seq Scan on foo_p0 p1 (cost=0.00..35.50 rows=2550 width=4)
-> Sort (cost=179.78..186.16 rows=2550 width=4)
Sort Key: p2.a
-> Seq Scan on bar_p0 p2 (cost=0.00..35.50 rows=2550 width=4)
-> Merge Join (cost=359.57..860.00 rows=32512 width=8)
Merge Cond: (p1_1.a = p2_1.a)
-> Sort (cost=179.78..186.16 rows=2550 width=4)
Sort Key: p1_1.a
-> Seq Scan on foo_p1 p1_1 (cost=0.00..35.50 rows=2550 width=4)
-> Sort (cost=179.78..186.16 rows=2550 width=4)
Sort Key: p2_1.a
-> Seq Scan on bar_p1 p2_1 (cost=0.00..35.50 rows=2550 width=4)
(17 rows)

It would have been nice if we could support this case, as we do allow partitions

hole for the other partition scheme, but there wouldn't be much objection if
we don't want to add this support for now since there will be a lesser chance
that hash partitioned table has the hole, IMO.

* I removed this assertion in partition_bounds_merge(), because I
think this is covered by two assertions above this.

+ Assert((*outer_parts == NIL || *inner_parts != NIL) &&
+ (*inner_parts == NIL || *outer_parts != NIL));

* (I forgot to mention this in a previous email, but) I removed this
bit of generate_matching_part_pairs(), because we already have the
same check in try_partitionwise_join(), so this bit would be redundant
IIUC.

Looks good.

* I added more comments.

Thanks.

If there are no objections, I'll merge the attached with the base patch in [1].

The proposed enhancement in the patch is too good and the patch is pretty much
reasonable to merge into the main patch.

Here are the few cosmetic fixes for this path I think is needed. Feel free to
ignore if some of them do not make sense or too obvious.

Note: left side number represents code line number of the patch.

118 + }
119 +
120 + /*
121 + * Try to create the partition bounds along with join pairs.
122 + */
123 + if (boundinfo == NULL)
124 + {

Can we add this block as else part of previous if condition checking equal partitions bound?

133 + Assert(list_length(parts1) == list_length(parts2));
134 + if (boundinfo == NULL)
135 + {
136 + joinrel->nparts = 0;
137 + return;
138 + }
139 + nparts = list_length(parts1);

Can we move the assert check below at line#139 in the patch i.e. after if block.

And the question is do we need to do that assert check since partition_bounds_merge()

does that just before returning, thoughts?

204 + RelOptInfo *child_rel1 = merged ? (RelOptInfo *) lfirst(lcr1) : rel1->part_rels[cnt_parts];
205 + RelOptInfo *child_rel2 = merged ? (RelOptInfo *) lfirst(lcr2) : rel2->part_rels[cnt_parts];

How about using lfirst_node instead of lfirst & casting explicitly?

Also, these lines crossing 80 column length which I think we need to fix. How about
doing the assignment as follow, just after the variable declaration part:

if (merged)
{
child_rel1 = lfirst_node(lcr1, RelOptInfo);
child_rel2 = lfirst_node(lcr2, RelOptInfo);
lcr1 = lnext(parts1, lcr1);
lcr2 = lnext(parts2, lcr2);
}
else
{
child_rel1 = rel1->part_rels[cnt_parts];
child_rel2 = rel2->part_rels[cnt_parts]
}

rel1_empty = (child_rel1 == NULL || IS_DUMMY_REL(child_rel1));
rel2_empty = (child_rel2 == NULL || IS_DUMMY_REL(child_rel2));

266 + * get_matching_part_pairs
267 + * Generate join pairs of partitions for the two inputs for the join rel

Can we rewrite this description as " Generate join pairs of partitions for the
join rel from the two inputs." OR "Generate join pairs of partitions for the
two inputs"

310 + Assert(bms_num_members(child_relids1) == bms_num_members(rel1->relids));
311 + /*

335 + Assert(bms_num_members(child_relids2) == bms_num_members(rel2->relids));
336 + /*

Need newline after assert statements.

Regards,

Amul

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Alvaro Herrera

Date:

03 September 2019, 21:10:50

Fujita-san, amul,

CFbot complains that Fujita-san submitted a patch that doesn't apply,
which makes sense since the necessary previous patch was only referred
to without being resubmitted.  I suggest to always post all patches
together with each resubmission so that it can be checked automatically
by the cf bot: http://commitfest.cputube.org/etsuro-fujita.html

I'm not clear on who the author of this patch is, now.  Also, I'm not
sure what the *status* is.  Are we waiting for Fujita-san to review this
patch?

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

amul sul

Date:

05 September 2019, 04:23:28

On Wed, Sep 4, 2019 at 2:40 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

Fujita-san, amul,

CFbot complains that Fujita-san submitted a patch that doesn't apply,
which makes sense since the necessary previous patch was only referred
to without being resubmitted. I suggest to always post all patches
together with each resubmission so that it can be checked automatically
by the cf bot: http://commitfest.cputube.org/etsuro-fujita.html

Understood and sorry about that.

I'm not clear on who the author of this patch is, now. Also, I'm not
sure what the *status* is. Are we waiting for Fujita-san to review this
patch?

Yes, we are waiting for Fujita-san to review. Fujita-san has started a review

and proposed some enhancement which I reviewed in the last update.

I think soon Fujita-san might post the complete patch including his changes.

Regards,

Amul

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Etsuro Fujita

Date:

05 September 2019, 07:10:44

Hi,

On Thu, Sep 5, 2019 at 1:24 PM amul sul <sulamul@gmail.com> wrote:
> On Wed, Sep 4, 2019 at 2:40 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>> CFbot complains that Fujita-san submitted a patch that doesn't apply,
>> which makes sense since the necessary previous patch was only referred
>> to without being resubmitted.  I suggest to always post all patches
>> together with each resubmission so that it can be checked automatically
>> by the cf bot: http://commitfest.cputube.org/etsuro-fujita.html

> Understood and sorry about that.

Sorry about that.

>> I'm not clear on who the author of this patch is, now.  Also, I'm not
>> sure what the *status* is.  Are we waiting for Fujita-san to review this
>> patch?

> Yes, we are waiting for Fujita-san to review.  Fujita-san has started a review
> and proposed some enhancement which I reviewed in the last update.
>
> I think soon Fujita-san might post the complete patch including his changes.

I'm a bit busy with something else recently, but I'll do that ASAP.
And I'll continue to review the other part of the patch.

Best regards,
Etsuro Fujita

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Etsuro Fujita

Date:

24 September 2019, 15:59:57

Hi Amul,

On Mon, Sep 2, 2019 at 2:08 PM amul sul <sulamul@gmail.com> wrote:
> Please find my comments inline below:

Thank you for the review!

> On Wed, Aug 28, 2019 at 3:52 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
>> On Fri, Aug 16, 2019 at 10:25 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:

> On thinking further, a thought hits to me why we can't join two hash partitioned
> table which has the same modulus and partition key specification, but some of
> the partitions are missing from either partitioned table.
>
> For e.g. here is a smaller case where foo has two partitions and bar has only one.
>
> CREATE TABLE foo(a int) PARTITION BY HASH(a);
> CREATE TABLE foo_p0 PARTITION OF foo FOR VALUES WITH(modulus 2, remainder 0);
> CREATE TABLE foo_p1 PARTITION OF foo FOR VALUES WITH(modulus 2, remainder 1);
>
> CREATE TABLE bar(a int) PARTITION BY HASH(a);  <-- missing partitions for REMAINDER 1
> CREATE TABLE bar_p0 PARTITION OF bar FOR VALUES WITH(modulus 2, remainder 0);
>
> Explain:
> postgres=# explain select * from foo p1, bar p2 where p1.a = p2.a;
>                                    QUERY PLAN
> ---------------------------------------------------------------------------------
>  Merge Join  (cost=590.35..1578.47 rows=65025 width=8)
>    Merge Cond: (p2.a = p1.a)
>    ->  Sort  (cost=179.78..186.16 rows=2550 width=4)
>          Sort Key: p2.a
>          ->  Seq Scan on bar_p0 p2  (cost=0.00..35.50 rows=2550 width=4)
>    ->  Sort  (cost=410.57..423.32 rows=5100 width=4)
>          Sort Key: p1.a
>          ->  Append  (cost=0.00..96.50 rows=5100 width=4)
>                ->  Seq Scan on foo_p0 p1  (cost=0.00..35.50 rows=2550 width=4)
>                ->  Seq Scan on foo_p1 p1_1  (cost=0.00..35.50 rows=2550 width=4)
> (10 rows)
>
> The partitions-wise join will be performed only if we fill the partition hole that
> exists for the joining table i.e. adding partitions to bar table.
>
> postgres=# CREATE TABLE bar_p1 PARTITION OF bar FOR VALUES WITH(modulus 2, remainder 1);
> CREATE TABLE
> postgres=# explain select * from foo p1, bar p2 where p1.a = p2.a;
>                                    QUERY PLAN
> ---------------------------------------------------------------------------------
>  Append  (cost=359.57..2045.11 rows=65024 width=8)
>    ->  Merge Join  (cost=359.57..860.00 rows=32512 width=8)
>          Merge Cond: (p1.a = p2.a)
>          ->  Sort  (cost=179.78..186.16 rows=2550 width=4)
>                Sort Key: p1.a
>                ->  Seq Scan on foo_p0 p1  (cost=0.00..35.50 rows=2550 width=4)
>          ->  Sort  (cost=179.78..186.16 rows=2550 width=4)
>                Sort Key: p2.a
>                ->  Seq Scan on bar_p0 p2  (cost=0.00..35.50 rows=2550 width=4)
>    ->  Merge Join  (cost=359.57..860.00 rows=32512 width=8)
>          Merge Cond: (p1_1.a = p2_1.a)
>          ->  Sort  (cost=179.78..186.16 rows=2550 width=4)
>                Sort Key: p1_1.a
>                ->  Seq Scan on foo_p1 p1_1  (cost=0.00..35.50 rows=2550 width=4)
>          ->  Sort  (cost=179.78..186.16 rows=2550 width=4)
>                Sort Key: p2_1.a
>                ->  Seq Scan on bar_p1 p2_1  (cost=0.00..35.50 rows=2550 width=4)
> (17 rows)
>
> It would have been nice if we could support this case, as we do allow partitions
> hole for the other partition scheme, but there wouldn't be much objection if
> we don't want to add this support for now since there will be a lesser chance
> that hash partitioned table has the hole, IMO.

I agree with you on that point.

>> If there are no objections, I'll merge the attached with the base patch in [1].

> The proposed enhancement in the patch is too good and the patch is pretty much
> reasonable to merge into the main patch.

Done.  Attached is a merged version of the patch.

> Here are the few cosmetic fixes for this path I think is needed. Feel free to
> ignore if some of them do not make sense or too obvious.
>
> Note: left side number represents code line number of the patch.
>
> 118 +       }
> 119 +
> 120 +       /*
> 121 +        * Try to create the partition bounds along with join pairs.
> 122 +        */
> 123 +       if (boundinfo == NULL)
> 124 +       {
>
> Can we add this block as else part of previous if condition checking equal partitions bound?

Done.

> 133 +           Assert(list_length(parts1) == list_length(parts2));
> 134 +           if (boundinfo == NULL)
> 135 +           {
> 136 +               joinrel->nparts = 0;
> 137 +               return;
> 138 +           }
> 139 +           nparts = list_length(parts1);

> And the question is do we need to do that assert check since partition_bounds_merge()
> does that just before returning, thoughts?

You are right; that assertion would be redundant, so I removed that.

> 204 +       RelOptInfo *child_rel1 = merged ? (RelOptInfo *) lfirst(lcr1) : rel1->part_rels[cnt_parts];
> 205 +       RelOptInfo *child_rel2 = merged ? (RelOptInfo *) lfirst(lcr2) : rel2->part_rels[cnt_parts];
>
> How about using lfirst_node instead of lfirst & casting explicitly?
>
> Also, these lines crossing 80 column length which I think we need to fix. How about
> doing the assignment as follow, just after the variable declaration part:
>
>         if (merged)
>         {
>             child_rel1 = lfirst_node(lcr1, RelOptInfo);
>             child_rel2 = lfirst_node(lcr2, RelOptInfo);
>             lcr1 = lnext(parts1, lcr1);
>             lcr2 = lnext(parts2, lcr2);
>         }
>         else
>         {
>             child_rel1 = rel1->part_rels[cnt_parts];
>             child_rel2 =  rel2->part_rels[cnt_parts]
>         }
>
>         rel1_empty = (child_rel1 == NULL || IS_DUMMY_REL(child_rel1));
>         rel2_empty = (child_rel2 == NULL || IS_DUMMY_REL(child_rel2));

Done.  (I don't think the order of arguments for first_node() above is
correct; the first argument for that should be RelOptInfo.)

> 266 + * get_matching_part_pairs
> 267 + *     Generate join pairs of partitions for the two inputs for the join rel
>
> Can we rewrite this description as  " Generate join pairs of partitions for the
> join rel from the two inputs." OR  "Generate join pairs of partitions for the
> two inputs"

Done.

> 310 +       Assert(bms_num_members(child_relids1) == bms_num_members(rel1->relids));
> 311 +       /*
>
> 335 +       Assert(bms_num_members(child_relids2) == bms_num_members(rel2->relids));
> 336 +       /*
>
> Need newline after assert statements.

Done.

Other change: I guess this in partbounds.c would be leftovers from an
older version of the patch, so I removed it.

+/*
+ * Allocate and initialize partition maps. We maintain four maps, two maps
+ * for each joining relation. pmap[i] gives the partition from the other
+ * relation which would join with ith partition of the given relation.
+ * Partition i from the given relation will join with partition pmap[i]
+ * from the other relation to produce partition mmap[i] of the join (merged
+ * partition).
+ *
+ * pmap[i] = -1 indicates that ith partition of a given relation does not
+ * have a matching partition from the other relation.
+ *
+ * mmap[i] = -1 indicates that ith partition of a given relation does not
+ * contribute to the join result. That can happen only when the given
+ * relation is the inner relation and it doesn't have a matching partition
+ * from the outer relation, hence pmap[i] should be -1.
+ *
+ * In case of an outer join, every partition of the outer join will appear
+ * in the join result, and thus has mmap[i] set for all i. But it's not
+ * necessary that every partition on the outer side will have a matching
+ * partition on the inner side. In such a case, we end up with pmap[i] = -1
+ * and mmap[i] != -1.
+ */

I will continue to review the rest of the patch.

I am sorry for the long delay.

Best regards,
Etsuro Fujita

Attachment

Improve-partition-matching-for-partitionwise-joins-v24.patch

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Etsuro Fujita

Date:

16 October 2019, 12:50:17

On Wed, Sep 25, 2019 at 12:59 AM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> I will continue to review the rest of the patch.

I've been reviewing the rest of the patch.  Here are my review comments:

* map_and_merge_partitions() checks whether the two partitions from
the outer and inner sides can be merged in two steps: 1) see if the
partitions are mapped to each other (ie, partmap1->from == index2 &&
partmap2->from == index1), and 2) see if the merged partition indexes
assigned are the same (ie, partmap1->to == partmap2->to) (or satisfy
some other conditions), but the first step seems redundant to me
because I think that if the merged partition indexes are the same,
then the partitions would be guaranteed to be mapped to each other.
Also, I noticed that that function can't handle some list-partitioning
cases properly.  Here is an example:

CREATE TABLE plt1 (a int, b int, c text) PARTITION BY LIST(c);
CREATE TABLE plt1_p1 PARTITION OF plt1 FOR VALUES IN ('0000', '0002');
CREATE TABLE plt1_p2 PARTITION OF plt1 FOR VALUES IN ('0003', '0004');
INSERT INTO plt1 SELECT i, i, to_char(i % 5, 'FM0000') FROM
generate_series(0, 24) i WHERE i % 5 IN (0, 2, 3, 4);
ANALYZE plt1;
CREATE TABLE plt2 (a int, b int, c text) PARTITION BY LIST(c);
CREATE TABLE plt2_p1 PARTITION OF plt2 FOR VALUES IN ('0001', '0002');
CREATE TABLE plt2_p2 PARTITION OF plt2 FOR VALUES IN ('0003', '0004');
INSERT INTO plt2 SELECT i, i, to_char(i % 5, 'FM0000') FROM
generate_series(0, 24) i WHERE i % 5 IN (1, 2, 3, 4);
ANALYZE plt2;
EXPLAIN (COSTS OFF)
SELECT t1.a, t1.c, t2.a, t2.c FROM plt1 t1 FULL JOIN plt2 t2 ON (t1.c
= t2.c) WHERE COALESCE(t1.a, 0) % 5 != 3 AND COALESCE(t1.a, 0) % 5 !=
4 ORDER BY t1.c, t1.a, t2.a;
                                     QUERY PLAN

-------------------------------------------------------------------------------
------
 Sort
   Sort Key: t1.c, t1.a, t2.a
   ->  Hash Full Join
         Hash Cond: (t1.c = t2.c)
         Filter: (((COALESCE(t1.a, 0) % 5) <> 3) AND ((COALESCE(t1.a,
0) % 5) <> 4))
         ->  Append
               ->  Seq Scan on plt1_p1 t1
               ->  Seq Scan on plt1_p2 t1_1
         ->  Hash
               ->  Append
                     ->  Seq Scan on plt2_p1 t2
                     ->  Seq Scan on plt2_p2 t2_1
(12 rows)

This should use partitionwise join by the new partition-matching
algorithm but doesn't.  The reason for that is because plt1_p1 and
plt2_p1 are mapped to different merged partitions and thus considered
not merge-able by map_and_merge_partitions() as-is.  I might be
missing something, but I don't think this is intentional, so I rewrote
that function completely in the attached, which is a WIP patch created
on top of the patch [1].

* In handle_missing_partition(), I noticed this:

+   else if (missing_side_inner)
+   {
+       /*
+        * If this partition has already been mapped (say because we
+        * found an overlapping range earlier), we know where does it
+        * fit in the join result. Nothing to do in that case. Else
+        * create a new merged partition.
+        */
+       PartitionMap *extra_map = &maps_with_extra[extra_part];
+       if (extra_map->to < 0)
+       {
+           extra_map->to = *next_index;
+           *next_index = *next_index + 1;
+           *merged_index = extra_map->to;
+       }
+   }

As commented, that function skips setting *merged_index when the
"extra_part" partition is already mapped to a merged partition.  This
would be correct for range partitioning, but not for list
partitioning, I think.  Here is an example:

CREATE TABLE plt1 (a int, b int, c text) PARTITION BY LIST(c);
CREATE TABLE plt1_p1 PARTITION OF plt1 FOR VALUES IN ('0000', '0001', '0002');
CREATE TABLE plt1_p2 PARTITION OF plt1 FOR VALUES IN ('0003', '0004');
INSERT INTO plt1 SELECT i, i, to_char(i % 5, 'FM0000') FROM
generate_series(0, 24) i;
ANALYZE plt1;
CREATE TABLE plt2 (a int, b int, c text) PARTITION BY LIST(c);
CREATE TABLE plt2_p1 PARTITION OF plt2 FOR VALUES IN ('0002');
CREATE TABLE plt2_p2 PARTITION OF plt2 FOR VALUES IN ('0003', '0004');
INSERT INTO plt2 SELECT i, i, to_char(i % 5, 'FM0000') FROM
generate_series(0, 24) i WHERE i % 5 IN (2, 3, 4);
ANALYZE plt2;
CREATE TABLE plt3 (a int, b int, c text) PARTITION BY LIST(c);
CREATE TABLE plt3_p1 PARTITION OF plt3 FOR VALUES IN ('0001');
CREATE TABLE plt3_p2 PARTITION OF plt3 FOR VALUES IN ('0003', '0004');
INSERT INTO plt3 SELECT i, i, to_char(i % 5, 'FM0000') FROM
generate_series(0, 24) i WHERE i % 5 IN (1, 3, 4);
ANALYZE plt3;
EXPLAIN (COSTS OFF)
SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM (plt1 t1 LEFT JOIN plt2
t2 ON (t1.c = t2.c)) FULL JOIN plt3 t3 ON (t1.c = t3.c) WHERE
COALESCE(t1.a, 0) % 5 != 3 AND coalesce(t1.a, 0) % 5 != 4 ORDER BY
t1.c, t1.a, t2.a, t3.a;
                                     QUERY PLAN

-------------------------------------------------------------------------------
------
 Sort
   Sort Key: t1.c, t1.a, t2.a, t3.a
   ->  Hash Full Join
         Hash Cond: (t1.c = t3.c)
         Filter: (((COALESCE(t1.a, 0) % 5) <> 3) AND ((COALESCE(t1.a,
0) % 5) <> 4))
         ->  Hash Left Join
               Hash Cond: (t1.c = t2.c)
               ->  Append
                     ->  Seq Scan on plt1_p1 t1
                     ->  Seq Scan on plt1_p2 t1_1
               ->  Hash
                     ->  Append
                           ->  Seq Scan on plt2_p1 t2
                           ->  Seq Scan on plt2_p2 t2_1
         ->  Hash
               ->  Append
                     ->  Seq Scan on plt3_p1 t3
                     ->  Seq Scan on plt3_p2 t3_1
(18 rows)

I think this should use 3-way partitionwise join by the new algorithm
but doesn't.  The reason for that is because when considering
partitionwise join for plt1 and plt2, partition_list_bounds_merge()
would incorrectly produce {'0000', '0002'} as the merged values for
the join segment of plt1_p1 and plt2_p1, not {'0000', '0001', '0002'},
because that function would call handle_missing_partition() for the
values '0000' and '0001' of plt1_p1, and it would set *merged_index
correctly for '0000' but not for '0001' (keeping *merged_index=-1),
due to the behavior mentioned above, resulting in the merged values
for the join segment {'0000', '0002'}, not {'0000', '0001', '0002'}.
This would cause to fail to merge the values for the join segment
{'0000', '0002'} and the values for plt3_p1 {'0001'} when considering
partitionwise join for the 2-way join and plt3.  I fixed this as well
in the attached.  handle_missing_partition() can handle both cases
where a datum is missing from the inner side and where a datum is
missing from the outer side in a unified way, but I think that that
makes the code pretty hard to read.  Also, I think
handle_missing_partition() is not caller-friendly because the caller
needs to write something like this:

+               /* A range missing from the inner side. */
+               bool        missing_side_outer;
+               bool        missing_side_inner;

+               /*
+                * For a FULL join, inner relation acts as both OUTER and INNER
+                * relation.  For LEFT and ANTI join the inner relation acts as
+                * INNER relation. For INNER and SEMI join OUTER and INNER
+                * differentiation is immaterial.
+                */
+               missing_side_inner = (jointype == JOIN_FULL ||
+                                     jointype == JOIN_LEFT ||
+                                     jointype == JOIN_ANTI);
+               missing_side_outer = (jointype == JOIN_FULL);
+               if (!handle_missing_partition(inner_maps,
+                                             outer_maps,
+                                             inner_default,
+                                             outer_part,
+                                             missing_side_outer,
+                                             missing_side_inner,
+                                             &next_index,
+                                             &default_index,
+                                             &merged_index))
+                   return NULL;

So I'd like to propose to introduce separate functions like
process_outer_partition() and process_inner_partition() in the
attached, instead of handle_missing_partition().  (I added a fast path
to these functions that if both outer/inner sides have the default
partitions, give up on merging partitions.  Also, I modified the
caller sites of proposed functions so that they call these if
necessary.)

Other minor changes in the attached:

* I modified partition_range_bounds_merge() and
partition_list_bounds_merge() so that the order of arguments for them
matches partition_bounds_merge().  Also, I did some cleanup for these
functions, and removed this in these function because this merely
re-checks the while condition above.

+       /* If we exhausted both the sides, we won't enter the loop. */
+       Assert(!finished_inner || !finished_outer);

* I modified merge_null_partitions() and merge_default_partitions()
accordingly to the changes described above.  Also, I did some cleanup
and modified the caller sites of these functions so that they call
these functions if necessary.

* I also modified generate_matching_part_pairs() accordingly.

* I added a creanup to free memory allocated by init_partition_map().

* We have this in a few places such as merge_default_partitions():

+           extra_map->to = *next_index;
+           *next_index = *next_index + 1;
+           *merged_index = extra_map->to;

To reduce code duplication, I made this into a separate function.

* I think the return statement in the end of partition_range_merge()
is useless, so I removed it.

That is all for now.  Will continue to review.

Best regards,
Etsuro Fujita

[1] https://www.postgresql.org/message-id/CAPmGK14WHKckT1P6UJV2B63TZAxPyMn8iZJ99XF=AZuNhG6vow@mail.gmail.com

Attachment

modify-partbounds-changes-1.patch

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

amul sul

Date:

25 October 2019, 09:59:03

On Wed, Oct 16, 2019 at 6:20 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:

On Wed, Sep 25, 2019 at 12:59 AM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> I will continue to review the rest of the patch.

I've been reviewing the rest of the patch. Here are my review comments:

[....]

So I'd like to propose to introduce separate functions like
process_outer_partition() and process_inner_partition() in the
attached, instead of handle_missing_partition(). (I added a fast path
to these functions that if both outer/inner sides have the default
partitions, give up on merging partitions. Also, I modified the
caller sites of proposed functions so that they call these if
necessary.)

Agree -- process_outer_partition() & process_inner_partition() approach looks
much cleaner than before.

Here are the few comments:

Note that LHS numbers are the line numbers in your previously posted patch[1].

455 + if (inner_has_default ||
456 + jointype == JOIN_LEFT ||
457 + jointype == JOIN_ANTI ||
458 + jointype == JOIN_FULL)
459 + {
460 + if (!process_outer_partition(&outer_map,

512 + if (outer_has_default || jointype == JOIN_FULL)
513 + {
514 + if (!process_inner_partition(&outer_map,

How about adding these conditions to the else block of process_outer_partition()
& process_inner_partition() function respectively so that these functions can be
called unconditionally? Thoughts/Comments?

---

456 + jointype == JOIN_LEFT ||
457 + jointype == JOIN_ANTI ||
458 + jointype == JOIN_FULL)

Also, how about using IS_OUTER_JOIN() instead. But we need an assertion to
restrict JOIN_RIGHT or something.

For the convenience, I did both aforesaid changes in the attached delta patch that
can be applied atop of your previously posted patch[1]. Kindly have a look & share

your thoughts, thanks.

1273 + * *next_index is incremented when creating a new merged partition associated
1274 + * with the given outer partition.
1275 + */

Correction: s/outer/inner

---

1338 + * In range partitioning, if the given outer partition is already
1339 + * merged (eg, because we found an overlapping range earlier), we know
1340 + * where it fits in the join result; nothing to do in that case. Else
1341 + * create a new merged partition.

Correction: s/outer/inner.

---

1712 /*
1713 * If the NULL partition was missing from the inner side of the join,

s/inner side/outer side

Regards,

Amul

1] https://postgr.es/m/CAPmGK145V8DNCNQ2gQBgNE3QqoJGjsmK5WMwaA3FMirNM723KQ%40mail.gmail.com

Attachment

delta.patch

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Etsuro Fujita

Date:

29 October 2019, 10:29:32

Hi Amul,

On Fri, Oct 25, 2019 at 6:59 PM amul sul <sulamul@gmail.com> wrote:
> On Wed, Oct 16, 2019 at 6:20 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
>> So I'd like to propose to introduce separate functions like
>> process_outer_partition() and process_inner_partition() in the
>> attached, instead of handle_missing_partition().  (I added a fast path
>> to these functions that if both outer/inner sides have the default
>> partitions, give up on merging partitions.  Also, I modified the
>> caller sites of proposed functions so that they call these if
>> necessary.)

> Agree -- process_outer_partition() & process_inner_partition() approach looks
> much cleaner than before.
>
> Here are the few comments:

Thanks for the review!

> Note that LHS numbers are the line numbers in your previously posted patch[1].
>
>  455 +               if (inner_has_default ||
>  456 +                   jointype == JOIN_LEFT ||
>  457 +                   jointype == JOIN_ANTI ||
>  458 +                   jointype == JOIN_FULL)
>  459 +               {
>  460 +                   if (!process_outer_partition(&outer_map,
>
>  512 +               if (outer_has_default || jointype == JOIN_FULL)
>  513 +               {
>  514 +                   if (!process_inner_partition(&outer_map,
>
> How about adding these conditions to the else block of process_outer_partition()
> & process_inner_partition() function respectively so that these functions can be
> called unconditionally?  Thoughts/Comments?

I'm not sure that's a good idea; these functions might be called many
times, so I just thought it would be better to call these functions
conditionally, to avoid useless function call overhead.

>  456 +                   jointype == JOIN_LEFT ||
>  457 +                   jointype == JOIN_ANTI ||
>  458 +                   jointype == JOIN_FULL)
>
> Also, how about using IS_OUTER_JOIN() instead. But we need an assertion to
> restrict JOIN_RIGHT or something.

Seems like a good idea.

> For the convenience, I did both aforesaid changes in the attached delta patch that
> can be applied atop of your previously posted patch[1].  Kindly have a look & share
> your thoughts, thanks.

Thanks for the patch!

> 1273 + * *next_index is incremented when creating a new merged partition associated
> 1274 + * with the given outer partition.
> 1275 + */
>
> Correction: s/outer/inner
> ---
>
> 1338 +        * In range partitioning, if the given outer partition is already
> 1339 +        * merged (eg, because we found an overlapping range earlier), we know
> 1340 +        * where it fits in the join result; nothing to do in that case.  Else
> 1341 +        * create a new merged partition.
>
> Correction: s/outer/inner.
> ---
>
> 1712         /*
> 1713          * If the NULL partition was missing from the inner side of the join,
>
> s/inner side/outer side
> --

Good catch!  Will fix.

Best regards,
Etsuro Fujita

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Etsuro Fujita

Date:

31 October 2019, 09:49:26

On Tue, Oct 29, 2019 at 7:29 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> On Fri, Oct 25, 2019 at 6:59 PM amul sul <sulamul@gmail.com> wrote:
> > On Wed, Oct 16, 2019 at 6:20 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> >> So I'd like to propose to introduce separate functions like
> >> process_outer_partition() and process_inner_partition() in the
> >> attached, instead of handle_missing_partition().  (I added a fast path
> >> to these functions that if both outer/inner sides have the default
> >> partitions, give up on merging partitions.  Also, I modified the
> >> caller sites of proposed functions so that they call these if
> >> necessary.)
>
> > Agree -- process_outer_partition() & process_inner_partition() approach looks
> > much cleaner than before.

> > Note that LHS numbers are the line numbers in your previously posted patch[1].
> >
> >  455 +               if (inner_has_default ||
> >  456 +                   jointype == JOIN_LEFT ||
> >  457 +                   jointype == JOIN_ANTI ||
> >  458 +                   jointype == JOIN_FULL)
> >  459 +               {
> >  460 +                   if (!process_outer_partition(&outer_map,
> >
> >  512 +               if (outer_has_default || jointype == JOIN_FULL)
> >  513 +               {
> >  514 +                   if (!process_inner_partition(&outer_map,
> >
> > How about adding these conditions to the else block of process_outer_partition()
> > & process_inner_partition() function respectively so that these functions can be
> > called unconditionally?  Thoughts/Comments?
>
> I'm not sure that's a good idea; these functions might be called many
> times, so I just thought it would be better to call these functions
> conditionally, to avoid useless function call overhead.

The overhead might be small, but isn't zero, so I still think that we
should call these functions if necessary.

> >  456 +                   jointype == JOIN_LEFT ||
> >  457 +                   jointype == JOIN_ANTI ||
> >  458 +                   jointype == JOIN_FULL)
> >
> > Also, how about using IS_OUTER_JOIN() instead. But we need an assertion to
> > restrict JOIN_RIGHT or something.
>
> Seems like a good idea.

Done.

> > For the convenience, I did both aforesaid changes in the attached delta patch that
> > can be applied atop of your previously posted patch[1].  Kindly have a look & share
> > your thoughts, thanks.
>
> Thanks for the patch!
>
> > 1273 + * *next_index is incremented when creating a new merged partition associated
> > 1274 + * with the given outer partition.
> > 1275 + */
> >
> > Correction: s/outer/inner
> > ---
> >
> > 1338 +        * In range partitioning, if the given outer partition is already
> > 1339 +        * merged (eg, because we found an overlapping range earlier), we know
> > 1340 +        * where it fits in the join result; nothing to do in that case.  Else
> > 1341 +        * create a new merged partition.
> >
> > Correction: s/outer/inner.
> > ---
> >
> > 1712         /*
> > 1713          * If the NULL partition was missing from the inner side of the join,
> >
> > s/inner side/outer side
> > --
>
> Good catch!  Will fix.

Done.

I also added some assertions to process_outer_partition() and
process_inner_partition(), including ones as proposed in your patch.
Attached is an updated version.  If no objections, I'll merge this
with the main patch [1].

Best regards,
Etsuro Fujita

[1] https://www.postgresql.org/message-id/CAPmGK14WHKckT1P6UJV2B63TZAxPyMn8iZJ99XF=AZuNhG6vow@mail.gmail.com

Attachment

modify-partbounds-changes-2.patch

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Etsuro Fujita

Date:

01 November 2019, 10:28:29

On Thu, Oct 31, 2019 at 6:49 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> Attached is an updated version.  If no objections, I'll merge this
> with the main patch [1].

Done.  Attached is a new version of the patch.

Other changes: in generate_matching_part_pairs(), I changed variable
names to match other functions, simplified assertions, and
adjusted/added comments a bit.

Will continue to review.

Best regards,
Etsuro Fujita

Attachment

Improve-partition-matching-for-partitionwise-joins-v25.patch

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

amul sul

Date:

05 November 2019, 13:14:50

On Fri, Nov 1, 2019 at 3:58 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:

On Thu, Oct 31, 2019 at 6:49 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> Attached is an updated version. If no objections, I'll merge this
> with the main patch [1].

Done. Attached is a new version of the patch.

Other changes: in generate_matching_part_pairs(), I changed variable
names to match other functions, simplified assertions, and
adjusted/added comments a bit.

Thanks for the update version.

A query and comments for v25:

583 + * The function returns NULL if we can not find the matching pair of
584 + * partitions. This happens if 1. multiple partitions on one side match with
585 + * one partition on the other side. 2. a given partition on the outer side
586 + * doesn't have a matching partition on the inner side. We can not support the
587 + * first case since we don't have a way to represent multiple partitions as a
588 + * single relation (RelOptInfo) and then perform join using the ganged
589 + * relation. We can not support the second case since the missing inner
590 + * partition needs to be represented as an empty relation and we don't have a
591 + * way to introduce empty relation during join planning after creating paths
592 + * for all the base relations.
593 + */
594 +PartitionBoundInfo
595 +partition_bounds_merge(int partnatts,

I think the second condition mention for partition_bounds_merge() looks
outdated, do you think we should remove that or am I missing something here?

---

1768 +
1769 + /*
1770 + * If this is an outer join, the merged partition would act as the
1771 + * default partition of the join; record the index in *default_index
1772 + * if not done yet.
1773 + */
1774 + if (jointype == JOIN_LEFT || jointype == JOIN_ANTI ||
1775 + jointype == JOIN_FULL)
1776 + {

As decided need to replace this list by IS_OUTER_JOIN(jointype).

---

2020 + if (jointype == JOIN_LEFT || jointype == JOIN_FULL ||
2021 + jointype == JOIN_ANTI)
2022 + {

Same as the previous.

---

I tried a coverage testing and tried to adjust & add a few tests to improved the
code coverage for the v25 patch. Please have a look at the attached 0002 & also
attach the coverage output with & without this patch, TWIMW.

0001 is the same v25 patch, reattaching to make CFbot happy.

Regards,

Amul

Attachment

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

amul sul

Date:

06 November 2019, 06:11:47

Attached is the rebase atop the latest master head(a9056cc637f).

On Tue, Nov 5, 2019 at 6:44 PM amul sul <sulamul@gmail.com> wrote:

On Fri, Nov 1, 2019 at 3:58 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
On Thu, Oct 31, 2019 at 6:49 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> Attached is an updated version. If no objections, I'll merge this
> with the main patch [1].

Done. Attached is a new version of the patch.

Other changes: in generate_matching_part_pairs(), I changed variable
names to match other functions, simplified assertions, and
adjusted/added comments a bit.

Thanks for the update version.

A query and comments for v25:

583 + * The function returns NULL if we can not find the matching pair of
584 + * partitions. This happens if 1. multiple partitions on one side match with
585 + * one partition on the other side. 2. a given partition on the outer side
586 + * doesn't have a matching partition on the inner side. We can not support the
587 + * first case since we don't have a way to represent multiple partitions as a
588 + * single relation (RelOptInfo) and then perform join using the ganged
589 + * relation. We can not support the second case since the missing inner
590 + * partition needs to be represented as an empty relation and we don't have a
591 + * way to introduce empty relation during join planning after creating paths
592 + * for all the base relations.
593 + */
594 +PartitionBoundInfo
595 +partition_bounds_merge(int partnatts,

I think the second condition mention for partition_bounds_merge() looks
outdated, do you think we should remove that or am I missing something here?
---

1768 +
1769 + /*
1770 + * If this is an outer join, the merged partition would act as the
1771 + * default partition of the join; record the index in *default_index
1772 + * if not done yet.
1773 + */
1774 + if (jointype == JOIN_LEFT || jointype == JOIN_ANTI ||
1775 + jointype == JOIN_FULL)
1776 + {

As decided need to replace this list by IS_OUTER_JOIN(jointype).
---

2020 + if (jointype == JOIN_LEFT || jointype == JOIN_FULL ||
2021 + jointype == JOIN_ANTI)
2022 + {

Same as the previous.
---

I tried a coverage testing and tried to adjust & add a few tests to improved the
code coverage for the v25 patch. Please have a look at the attached 0002 & also
attach the coverage output with & without this patch, TWIMW.

0001 is the same v25 patch, reattaching to make CFbot happy.

Regards,
Amul

Attachment

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Etsuro Fujita

Date:

13 November 2019, 04:16:45

Hi Amul,

On Wed, Nov 6, 2019 at 3:12 PM amul sul <sulamul@gmail.com> wrote:
> Attached is the rebase atop the latest master head(a9056cc637f).

Thanks for that!

> On Tue, Nov 5, 2019 at 6:44 PM amul sul <sulamul@gmail.com> wrote:
>> On Fri, Nov 1, 2019 at 3:58 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
>>> Done.  Attached is a new version of the patch.

>> A query and comments for v25:
>>
>> 583 + * The function returns NULL if we can not find the matching pair of
>>  584 + * partitions. This happens if 1. multiple partitions on one side match with
>>  585 + * one partition on the other side. 2. a given partition on the outer side
>>  586 + * doesn't have a matching partition on the inner side. We can not support the
>>  587 + * first case since we don't have a way to represent multiple partitions as a
>>  588 + * single relation (RelOptInfo) and then perform join using the ganged
>>  589 + * relation. We can not support the second case since the missing inner
>>  590 + * partition needs to be represented as an empty relation and we don't have a
>>  591 + * way to introduce empty relation during join planning after creating paths
>>  592 + * for all the base relations.
>>  593 + */
>>  594 +PartitionBoundInfo
>>  595 +partition_bounds_merge(int partnatts,
>>
>> I think the second condition mention for partition_bounds_merge() looks
>> outdated, do you think we should remove that or am I missing something here?

Yeah, I think so.  In addition to that, the output arguments for that
function *outer_pars and *inner_parts don't store partition indexes
anymore, so I updated comments for that function totally.  Does that
make sense?

>> 1768 +
>> 1769 +       /*
>> 1770 +        * If this is an outer join, the merged partition would act as the
>> 1771 +        * default partition of the join; record the index in *default_index
>> 1772 +        * if not done yet.
>> 1773 +        */
>> 1774 +       if (jointype == JOIN_LEFT || jointype == JOIN_ANTI ||
>> 1775 +           jointype == JOIN_FULL)
>> 1776 +       {
>>
>> As decided need to replace this list by IS_OUTER_JOIN(jointype).

Done.

>> 2020 +       if (jointype == JOIN_LEFT || jointype == JOIN_FULL ||
>> 2021 +           jointype == JOIN_ANTI)
>> 2022 +       {
>>
>> Same as the previous.

Done.

>> I tried a coverage testing and tried to adjust & add a few tests to improved the
>> code coverage for the v25 patch. Please have a look at the attached 0002 & also
>> attach the coverage output with & without this patch, TWIMW.

Good idea!  I merged that to the main patch after modifying that a bit
so to reduce the outputs of SELECTs.

Attached is an updated version of the main patch.  Thanks for reviewing!

I looked at partition_range_bounds_merge() more closely.  Here are my comments:

* When merging partition bounds from the outer/inner sides in the
while loop of that function, we only moves to the next partition on
one side when both partitions overlap (ie, overlap=true) and the upper
bounds of partitions are not the same (ie, ub_cmpval<0 or
ub_cmpval>0); but I think that's inefficient.  Let me explain using an
example:

create table prt1 (a int, b int) partition by range (a);
create table prt1_p1 partition of prt1 for values from (0) to (50);
create table prt1_p2 partition of prt1 for values from (100) to (150);
create table prt2 (a int, b int) partition by range (a);
create table prt2_p1 partition of prt2 for values from (0) to (100);
create table prt2_p2 partition of prt2 for values from (100) to (200);
insert into prt1 select i, i from generate_series(0, 49) i;
insert into prt1 select i, i from generate_series(100, 149) i;
insert into prt2 select i, i from generate_series(0, 49) i;
insert into prt2 select i, i from generate_series(100, 149) i;
analyze prt1;
analyze prt2;
set enable_partitionwise_join to on;
explain verbose select * from prt1 t1 full join prt2 t2 on (t1.a = t2.a);
                                      QUERY PLAN
--------------------------------------------------------------------------------------
 Append  (cost=2.12..9.12 rows=100 width=16)
   ->  Hash Full Join  (cost=2.12..4.31 rows=50 width=16)
         Output: t1.a, t1.b, t2.a, t2.b
         Hash Cond: (t1.a = t2.a)
         ->  Seq Scan on public.prt1_p1 t1  (cost=0.00..1.50 rows=50 width=8)
               Output: t1.a, t1.b
         ->  Hash  (cost=1.50..1.50 rows=50 width=8)
               Output: t2.a, t2.b
               ->  Seq Scan on public.prt2_p1 t2  (cost=0.00..1.50
rows=50 width=8)
                     Output: t2.a, t2.b
   ->  Hash Full Join  (cost=2.12..4.31 rows=50 width=16)
         Output: t1_1.a, t1_1.b, t2_1.a, t2_1.b
         Hash Cond: (t1_1.a = t2_1.a)
         ->  Seq Scan on public.prt1_p2 t1_1  (cost=0.00..1.50 rows=50 width=8)
               Output: t1_1.a, t1_1.b
         ->  Hash  (cost=1.50..1.50 rows=50 width=8)
               Output: t2_1.a, t2_1.b
               ->  Seq Scan on public.prt2_p2 t2_1  (cost=0.00..1.50
rows=50 width=8)
                     Output: t2_1.a, t2_1.b
(19 rows)

For this query, the merging would proceed as follow:

1) In the first iteration of the while loop, we have overlap=true, so
we merge prt1_p1 and prt2_p1 and move to the next partition on *only
the outer side* (prt1_p2) as we have ub_cmpval<0.  2) In the second
iteration, we have overlap=false and ub_cmpval>0, so we perform
process_inner_partition() to prt2_p1 and move to the next partition on
the inner side (prt2_p2).  3) In the third iteration, we have
overlap=true, so we merge prt1_p2 and prt2_p2 and move to the next
partition on *only the outer side* as we have ub_cmpval<0 (but can't
as the outer side is exhausted).  4) In the forth iteration, we have
overlap=false and ub_cmpval>0, so we perform process_inner_partition()
to prt2_p2 and move to the next partition on the inner side (but can't
as the inner side is exhausted).  And we are done.

We do this in the process_inner_partition() call in the second and
forth iterations:

        /*
         * In range partitioning, if the given inner partition is already
         * merged (eg, because we found an overlapping range earlier), we know
         * where it fits in the join result; nothing to do in that case.  Else
         * create a new merged partition.
         */
        if (inner_map->merged_indexes[inner_index] >= 0)
        {
            if (strategy == PARTITION_STRATEGY_LIST)
                *merged_index = inner_map->merged_indexes[inner_index];
            else
            {
                Assert(strategy == PARTITION_STRATEGY_RANGE);
                *merged_index = -1;
            }
        }

What we do in that function is actually a no-op, so the second and
forth iterations are done only to move to the next partition on the
inner side, which I think is inefficient.  To avoid that, why not move
to *the next pair of partitions* in the first and third iterations
like the attached?  This needs an extra check to see if we can safely
move to the next pair of partitions in the first and third iterations,
but doesn't need meaningless iterations like the second and fourth
iterations in the example.  This change also makes the code in
process_inner_partition() shown above unnecessary, so I removed it
from that function (and process_outer_partition()).

* I don't like this:

+           if ((lb_cmpval < 0 && inner_has_default) ||
+               /* Non-overlapping range on the lower side of outer range. */
+               (lb_cmpval > 0 && outer_has_default) ||
+               /* Non-overlapping range on the lower side of inner range. */
+               (ub_cmpval < 0 && outer_has_default) ||
+               /* Non-overlapping range on the upper side of inner range. */
+               (ub_cmpval > 0 && inner_has_default))
+               /* Non-overlapping range on the upper side of outer range. */
+               return NULL;

because it looks like that eg, the first comment "Non-overlapping
range on the lower side of outer range." explains the second condition
"(lb_cmpval > 0 && outer_has_default)", which isn't intended, I think.
Also, it seems a bit verbose to me to add a comment to each of the
four cases.  How about simplifying the comments and moving them to
above the if block like the attached?

* In partition_range_merge_next_lb(), do we really need this bit?

+   /*
+    * The lower bound is lower than the last upper bound, thus does not fit
+    * the bounds created so far and hence can not be merged with the existing
+    * bounds.
+    */
+   if (cmpval < 0)
+       return false;

I think that if we have such a lower bound, we would error out before
calling that function.  Also, I think it would be easier to understand
to add the upper bound as well within that function.  So I modified
that function that way (and renamed it).  I also added a bit of
comments to that function.

* Since this is called many times, I changed it to a macro:

+static int32
+partition_range_bound_cmp(int partnatts, FmgrInfo *partsupfunc,
+                         Oid *partcollations, PartitionRangeBound *bound1,
+                         PartitionRangeBound *bound2)
+{
+   return partition_rbound_cmp(partnatts, partsupfunc, partcollations,
+                               bound1->datums, bound1->kind, bound1->lower,
+                               bound2);
+}

* This is just my taste, but how about renaming
partition_range_bound_cmp() and partition_range_cmp() to more common?
or readable? names eg, compare_range_bounds() and
compare_range_partitions(), respectively?  Also, how about renaming
partition_range_merge() to eg, get_merged_range_bounds()?

Please find attached the delta patch for that.  Any feedback welcome!

Best regards,
Etsuro Fujita

Attachment

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

amul sul

Date:

15 November 2019, 05:20:39

On Wed, Nov 13, 2019 at 9:46 AM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:

Hi Amul,

On Wed, Nov 6, 2019 at 3:12 PM amul sul <sulamul@gmail.com> wrote:
> Attached is the rebase atop the latest master head(a9056cc637f).

Thanks for that!

> On Tue, Nov 5, 2019 at 6:44 PM amul sul <sulamul@gmail.com> wrote:
>> On Fri, Nov 1, 2019 at 3:58 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
>>> Done. Attached is a new version of the patch.

>> A query and comments for v25:
>>
>> 583 + * The function returns NULL if we can not find the matching pair of
>> 584 + * partitions. This happens if 1. multiple partitions on one side match with
>> 585 + * one partition on the other side. 2. a given partition on the outer side
>> 586 + * doesn't have a matching partition on the inner side. We can not support the
>> 587 + * first case since we don't have a way to represent multiple partitions as a
>> 588 + * single relation (RelOptInfo) and then perform join using the ganged
>> 589 + * relation. We can not support the second case since the missing inner
>> 590 + * partition needs to be represented as an empty relation and we don't have a
>> 591 + * way to introduce empty relation during join planning after creating paths
>> 592 + * for all the base relations.
>> 593 + */
>> 594 +PartitionBoundInfo
>> 595 +partition_bounds_merge(int partnatts,
>>
>> I think the second condition mention for partition_bounds_merge() looks
>> outdated, do you think we should remove that or am I missing something here?

Yeah, I think so. In addition to that, the output arguments for that
function *outer_pars and *inner_parts don't store partition indexes
anymore, so I updated comments for that function totally. Does that
make sense?

Thanks, that's looks better.

[....]

I looked at partition_range_bounds_merge() more closely. Here are my comments:

* When merging partition bounds from the outer/inner sides in the
while loop of that function, we only moves to the next partition on
one side when both partitions overlap (ie, overlap=true) and the upper
bounds of partitions are not the same (ie, ub_cmpval<0 or
ub_cmpval>0); but I think that's inefficient. Let me explain using an
example:

create table prt1 (a int, b int) partition by range (a);
create table prt1_p1 partition of prt1 for values from (0) to (50);
create table prt1_p2 partition of prt1 for values from (100) to (150);
create table prt2 (a int, b int) partition by range (a);
create table prt2_p1 partition of prt2 for values from (0) to (100);
create table prt2_p2 partition of prt2 for values from (100) to (200);
insert into prt1 select i, i from generate_series(0, 49) i;
insert into prt1 select i, i from generate_series(100, 149) i;
insert into prt2 select i, i from generate_series(0, 49) i;
insert into prt2 select i, i from generate_series(100, 149) i;
analyze prt1;
analyze prt2;
set enable_partitionwise_join to on;
explain verbose select * from prt1 t1 full join prt2 t2 on (t1.a = t2.a);
QUERY PLAN
--------------------------------------------------------------------------------------
Append (cost=2.12..9.12 rows=100 width=16)
-> Hash Full Join (cost=2.12..4.31 rows=50 width=16)
Output: t1.a, t1.b, t2.a, t2.b
Hash Cond: (t1.a = t2.a)
-> Seq Scan on public.prt1_p1 t1 (cost=0.00..1.50 rows=50 width=8)
Output: t1.a, t1.b
-> Hash (cost=1.50..1.50 rows=50 width=8)
Output: t2.a, t2.b
-> Seq Scan on public.prt2_p1 t2 (cost=0.00..1.50
rows=50 width=8)
Output: t2.a, t2.b
-> Hash Full Join (cost=2.12..4.31 rows=50 width=16)
Output: t1_1.a, t1_1.b, t2_1.a, t2_1.b
Hash Cond: (t1_1.a = t2_1.a)
-> Seq Scan on public.prt1_p2 t1_1 (cost=0.00..1.50 rows=50 width=8)
Output: t1_1.a, t1_1.b
-> Hash (cost=1.50..1.50 rows=50 width=8)
Output: t2_1.a, t2_1.b
-> Seq Scan on public.prt2_p2 t2_1 (cost=0.00..1.50
rows=50 width=8)
Output: t2_1.a, t2_1.b
(19 rows)

For this query, the merging would proceed as follow:

1) In the first iteration of the while loop, we have overlap=true, so
we merge prt1_p1 and prt2_p1 and move to the next partition on *only
the outer side* (prt1_p2) as we have ub_cmpval<0. 2) In the second
iteration, we have overlap=false and ub_cmpval>0, so we perform
process_inner_partition() to prt2_p1 and move to the next partition on
the inner side (prt2_p2). 3) In the third iteration, we have
overlap=true, so we merge prt1_p2 and prt2_p2 and move to the next
partition on *only the outer side* as we have ub_cmpval<0 (but can't
as the outer side is exhausted). 4) In the forth iteration, we have
overlap=false and ub_cmpval>0, so we perform process_inner_partition()
to prt2_p2 and move to the next partition on the inner side (but can't
as the inner side is exhausted). And we are done.

We do this in the process_inner_partition() call in the second and
forth iterations:

/*
* In range partitioning, if the given inner partition is already
* merged (eg, because we found an overlapping range earlier), we know
* where it fits in the join result; nothing to do in that case. Else
* create a new merged partition.
*/
if (inner_map->merged_indexes[inner_index] >= 0)
{
if (strategy == PARTITION_STRATEGY_LIST)
*merged_index = inner_map->merged_indexes[inner_index];
else
{
Assert(strategy == PARTITION_STRATEGY_RANGE);
*merged_index = -1;
}
}

What we do in that function is actually a no-op, so the second and
forth iterations are done only to move to the next partition on the
inner side, which I think is inefficient. To avoid that, why not move
to *the next pair of partitions* in the first and third iterations
like the attached? This needs an extra check to see if we can safely
move to the next pair of partitions in the first and third iterations,
but doesn't need meaningless iterations like the second and fourth
iterations in the example. This change also makes the code in
process_inner_partition() shown above unnecessary, so I removed it
from that function (and process_outer_partition()).

Make sense.

* I don't like this:

+ if ((lb_cmpval < 0 && inner_has_default) ||
+ /* Non-overlapping range on the lower side of outer range. */
+ (lb_cmpval > 0 && outer_has_default) ||
+ /* Non-overlapping range on the lower side of inner range. */
+ (ub_cmpval < 0 && outer_has_default) ||
+ /* Non-overlapping range on the upper side of inner range. */
+ (ub_cmpval > 0 && inner_has_default))
+ /* Non-overlapping range on the upper side of outer range. */
+ return NULL;

because it looks like that eg, the first comment "Non-overlapping
range on the lower side of outer range." explains the second condition
"(lb_cmpval > 0 && outer_has_default)", which isn't intended, I think.
Also, it seems a bit verbose to me to add a comment to each of the
four cases. How about simplifying the comments and moving them to
above the if block like the attached?

Check.

* In partition_range_merge_next_lb(), do we really need this bit?

+ /*
+ * The lower bound is lower than the last upper bound, thus does not fit
+ * the bounds created so far and hence can not be merged with the existing
+ * bounds.
+ */
+ if (cmpval < 0)
+ return false;

I think that if we have such a lower bound, we would error out before
calling that function. Also, I think it would be easier to understand
to add the upper bound as well within that function. So I modified
that function that way (and renamed it). I also added a bit of
comments to that function.

* Since this is called many times, I changed it to a macro:

+static int32
+partition_range_bound_cmp(int partnatts, FmgrInfo *partsupfunc,
+ Oid *partcollations, PartitionRangeBound *bound1,
+ PartitionRangeBound *bound2)
+{
+ return partition_rbound_cmp(partnatts, partsupfunc, partcollations,
+ bound1->datums, bound1->kind, bound1->lower,
+ bound2);
+}

Check.

* This is just my taste, but how about renaming
partition_range_bound_cmp() and partition_range_cmp() to more common?
or readable? names eg, compare_range_bounds() and
compare_range_partitions(), respectively? Also, how about renaming
partition_range_merge() to eg, get_merged_range_bounds()?

Check.

Please find attached the delta patch for that. Any feedback welcome!

Thank you Fujita san for the patch & the enhancements. I am fine with your
delta patch. I would like to share some thought for other code:

File: partbounds.c:

3298 static void
3299 get_merged_range_bounds(int partnatts, FmgrInfo *partsupfuncs,
3300 Oid *partcollations, JoinType jointype,
3301 PartitionRangeBound *left_lb,
3302 PartitionRangeBound *left_ub,
3303 PartitionRangeBound *right_lb,
3304 PartitionRangeBound *right_ub,
3305 PartitionRangeBound *merged_lb,
3306 PartitionRangeBound *merged_ub)

Can we rename these argument as inner_* & outer_* like we having for the
functions, like 0003 patch?

---

File: partbounds.c:

3322
3323 case JOIN_INNER:
3324 case JOIN_SEMI:
3325 if (compare_range_bounds(partnatts, partsupfuncs, partcollations,
3326 left_ub, right_ub) < 0)
3327 *merged_ub = *left_ub;
3328 else
3329 *merged_ub = *right_ub;
3330
3331 if (compare_range_bounds(partnatts, partsupfuncs, partcollations,
3332 left_lb, right_lb) > 0)
3333 *merged_lb = *left_lb;
3334 else
3335 *merged_lb = *right_lb;
3336 break;
3337

How about reusing ub_cmpval & lb_cmpval instead of calling
compare_range_bounds() inside get_merged_range_bounds(), like 0004 patch?

Apart from this, I would like to propose 0005 cleanup patch where I have
rearranged function arguments & code to make sure the arguments & the code

related to lower bound should come first and then the upper bound.

Regards,

Amul

P.S: As usual, reattaching 0001 & 0002 patches.

Attachment

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Etsuro Fujita

Date:

15 November 2019, 09:10:25

Hi Amul,

On Fri, Nov 15, 2019 at 2:21 PM amul sul <sulamul@gmail.com> wrote:
> Thank you Fujita san for the patch & the enhancements.  I am fine with your
> delta patch.

OK, I'll merge the delta patch with the main one in the next version,
if no objections from others.

> I would like to share some thought for other code:
>
> File: partbounds.c:
> 3298 static void
> 3299 get_merged_range_bounds(int partnatts, FmgrInfo *partsupfuncs,
> 3300                         Oid *partcollations, JoinType jointype,
> 3301                         PartitionRangeBound *left_lb,
> 3302                         PartitionRangeBound *left_ub,
> 3303                         PartitionRangeBound *right_lb,
> 3304                         PartitionRangeBound *right_ub,
> 3305                         PartitionRangeBound *merged_lb,
> 3306                         PartitionRangeBound *merged_ub)
>
> Can we rename these argument as inner_*  & outer_*  like we having for the
> functions, like 0003 patch?

+1  (Actually, I too was thinking about that.)

> File: partbounds.c:
> 3322
> 3323         case JOIN_INNER:
> 3324         case JOIN_SEMI:
> 3325             if (compare_range_bounds(partnatts, partsupfuncs, partcollations,
> 3326                                      left_ub, right_ub) < 0)
> 3327                 *merged_ub = *left_ub;
> 3328             else
> 3329                 *merged_ub = *right_ub;
> 3330
> 3331             if (compare_range_bounds(partnatts, partsupfuncs, partcollations,
> 3332                                      left_lb, right_lb) > 0)
> 3333                 *merged_lb = *left_lb;
> 3334             else
> 3335                 *merged_lb = *right_lb;
> 3336             break;
> 3337
>
> How about reusing ub_cmpval & lb_cmpval  instead of calling
> compare_range_bounds() inside get_merged_range_bounds(), like 0004 patch?

Good idea!  So, +1

> Apart from this, I would like to propose 0005 cleanup patch where I have
> rearranged function arguments & code to make sure the arguments & the code
> related to lower bound should come first and then the upper bound.

+1

I'll merge these changes as well into the main patch, if no objections
of others.

Thanks for the review and patches!

Best regards,
Etsuro Fujita

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Etsuro Fujita

Date:

22 November 2019, 13:08:15

On Fri, Nov 15, 2019 at 6:10 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> On Fri, Nov 15, 2019 at 2:21 PM amul sul <sulamul@gmail.com> wrote:
> > Thank you Fujita san for the patch & the enhancements.  I am fine with your
> > delta patch.
>
> OK, I'll merge the delta patch with the main one in the next version,
> if no objections from others.
>
> > I would like to share some thought for other code:
> >
> > File: partbounds.c:
> > 3298 static void
> > 3299 get_merged_range_bounds(int partnatts, FmgrInfo *partsupfuncs,
> > 3300                         Oid *partcollations, JoinType jointype,
> > 3301                         PartitionRangeBound *left_lb,
> > 3302                         PartitionRangeBound *left_ub,
> > 3303                         PartitionRangeBound *right_lb,
> > 3304                         PartitionRangeBound *right_ub,
> > 3305                         PartitionRangeBound *merged_lb,
> > 3306                         PartitionRangeBound *merged_ub)
> >
> > Can we rename these argument as inner_*  & outer_*  like we having for the
> > functions, like 0003 patch?
>
> +1  (Actually, I too was thinking about that.)
>
> > File: partbounds.c:
> > 3322
> > 3323         case JOIN_INNER:
> > 3324         case JOIN_SEMI:
> > 3325             if (compare_range_bounds(partnatts, partsupfuncs, partcollations,
> > 3326                                      left_ub, right_ub) < 0)
> > 3327                 *merged_ub = *left_ub;
> > 3328             else
> > 3329                 *merged_ub = *right_ub;
> > 3330
> > 3331             if (compare_range_bounds(partnatts, partsupfuncs, partcollations,
> > 3332                                      left_lb, right_lb) > 0)
> > 3333                 *merged_lb = *left_lb;
> > 3334             else
> > 3335                 *merged_lb = *right_lb;
> > 3336             break;
> > 3337
> >
> > How about reusing ub_cmpval & lb_cmpval  instead of calling
> > compare_range_bounds() inside get_merged_range_bounds(), like 0004 patch?
>
> Good idea!  So, +1
>
> > Apart from this, I would like to propose 0005 cleanup patch where I have
> > rearranged function arguments & code to make sure the arguments & the code
> > related to lower bound should come first and then the upper bound.
>
> +1
>
> I'll merge these changes as well into the main patch, if no objections
> of others.

Done.  I modified compare_range_partitions() as well to match its the
variable names with others.  Attached is a new version of the patch.

I reviewed the rest of the partbounds.c changes.  Here are my review comments:

* I don't think this analysis is correct:

+/*
+ * merge_null_partitions
+ *
+ * Merge NULL partitions, i.e. a partition that can hold NULL values for a lis\
t
+ * partitioned table, if any. Find the index of merged partition to which the
+ * NULL values would belong in the join result. If one joining relation has a
+ * NULL partition but not the other, try matching it with the default partitio\
n
+ * from the other relation since the default partition may have rows with NULL
+ * partition key. We can eliminate a NULL partition when it appears only on th\
e
+ * inner side of the join and the outer side doesn't have a default partition.
+ *
+ * When the equality operator used for join is strict, two NULL values will no\
t
+ * be considered as equal, and thus a NULL partition can be eliminated for an
+ * inner join. But we don't check the strictness operator here.
+ */

First of all, I think we can assume here that the equality operator is
*strict*, because 1) have_partkey_equi_join(), which is executed
before calling merge_null_partitions(), requires that the
corresponding clause is mergejoinable, and 2) currently, we assume
that mergejoinable operators are strict (see MJEvalOuterValues() and
MJEvalInnerValues()).  So I don't think it's needed to try merging a
NULL partition on one side with the default partition on the other
side as above.  (merge_null_partitions() tries merging NULL partitions
as well, but in some cases I don't think we need to do that, either.)
So I rewrote merge_null_partitions() completely.  Another change is
that I eliminated the NULL partition of the joinrel in more cases.
Attached is a patch (v28-0002-modify-merge_null_partitions.patch) for
that created on top of the main patch.  I might be missing something,
though.

Other changes in that patch:

* I fixed memory leaks in partition_list_bounds_merge() and
partition_range_bounds_merge().

* I noticed this in merge_default_partitions():

+       Assert(outer_has_default && inner_has_default);
+
+       *default_index = map_and_merge_partitions(outer_map,
+                                                 inner_map,
+                                                 outer_default,
+                                                 inner_default,
+                                                 next_index);
+       if (*default_index == -1)
+           return false;

I think the merging should be done successfully, because of 1) this in
process_outer_partition():

+   if (inner_has_default)
+   {
+       Assert(inner_default >= 0);
+
+       /*
+        * If the outer side has the default partition as well, we need to
+        * merge the default partitions (see merge_default_partitions()); give
+        * up on it.
+        */
+       if (outer_has_default)
+           return false;

and 2) this in process_inner_partition():

+   if (outer_has_default)
+   {
+       Assert(outer_default >= 0);
+
+       /*
+        * If the inner side has the default partition as well, we need to
+        * merge the default partitions (see merge_default_partitions()); give
+        * up on it.
+        */
+       if (inner_has_default)
+           return false;

So I removed the above if test (ie, *default_index == -1).  I also
modified that function a bit further, including comments.

* I simplified process_outer_partition() and process_inner_partition()
a bit, changing the APIs to match that of map_and_merge_partitions().
Also, I think this in these functions is a bit ugly;

+           /* Don't add this index to the list of merged indexes. */
+           *merged_index = -1;

so I removed it, and modified the condition on whether or not we add
merged bounds to the lists in partition_list_bounds_merge() and
partition_range_bounds_merge(), instead.

That's it.  Sorry for the delay.

Best regards,
Etsuro Fujita

Attachment

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Etsuro Fujita

Date:

29 November 2019, 03:45:35

On Fri, Nov 22, 2019 at 10:08 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> Done.  I modified compare_range_partitions() as well to match its the
> variable names with others.  Attached is a new version of the patch.
>
> I reviewed the rest of the partbounds.c changes.  Here are my review comments:
>
> * I don't think this analysis is correct:
>
> +/*
> + * merge_null_partitions
> + *
> + * Merge NULL partitions, i.e. a partition that can hold NULL values for a lis\
> t
> + * partitioned table, if any. Find the index of merged partition to which the
> + * NULL values would belong in the join result. If one joining relation has a
> + * NULL partition but not the other, try matching it with the default partitio\
> n
> + * from the other relation since the default partition may have rows with NULL
> + * partition key. We can eliminate a NULL partition when it appears only on th\
> e
> + * inner side of the join and the outer side doesn't have a default partition.
> + *
> + * When the equality operator used for join is strict, two NULL values will no\
> t
> + * be considered as equal, and thus a NULL partition can be eliminated for an
> + * inner join. But we don't check the strictness operator here.
> + */
>
> First of all, I think we can assume here that the equality operator is
> *strict*, because 1) have_partkey_equi_join(), which is executed
> before calling merge_null_partitions(), requires that the
> corresponding clause is mergejoinable, and 2) currently, we assume
> that mergejoinable operators are strict (see MJEvalOuterValues() and
> MJEvalInnerValues()).  So I don't think it's needed to try merging a
> NULL partition on one side with the default partition on the other
> side as above.  (merge_null_partitions() tries merging NULL partitions
> as well, but in some cases I don't think we need to do that, either.)
> So I rewrote merge_null_partitions() completely.  Another change is
> that I eliminated the NULL partition of the joinrel in more cases.
> Attached is a patch (v28-0002-modify-merge_null_partitions.patch) for
> that created on top of the main patch.  I might be missing something,
> though.
>
> Other changes in that patch:
>
> * I fixed memory leaks in partition_list_bounds_merge() and
> partition_range_bounds_merge().
>
> * I noticed this in merge_default_partitions():
>
> +       Assert(outer_has_default && inner_has_default);
> +
> +       *default_index = map_and_merge_partitions(outer_map,
> +                                                 inner_map,
> +                                                 outer_default,
> +                                                 inner_default,
> +                                                 next_index);
> +       if (*default_index == -1)
> +           return false;
>
> I think the merging should be done successfully, because of 1) this in
> process_outer_partition():
>
> +   if (inner_has_default)
> +   {
> +       Assert(inner_default >= 0);
> +
> +       /*
> +        * If the outer side has the default partition as well, we need to
> +        * merge the default partitions (see merge_default_partitions()); give
> +        * up on it.
> +        */
> +       if (outer_has_default)
> +           return false;
>
> and 2) this in process_inner_partition():
>
> +   if (outer_has_default)
> +   {
> +       Assert(outer_default >= 0);
> +
> +       /*
> +        * If the inner side has the default partition as well, we need to
> +        * merge the default partitions (see merge_default_partitions()); give
> +        * up on it.
> +        */
> +       if (inner_has_default)
> +           return false;
>
> So I removed the above if test (ie, *default_index == -1).  I also
> modified that function a bit further, including comments.
>
> * I simplified process_outer_partition() and process_inner_partition()
> a bit, changing the APIs to match that of map_and_merge_partitions().
> Also, I think this in these functions is a bit ugly;
>
> +           /* Don't add this index to the list of merged indexes. */
> +           *merged_index = -1;
>
> so I removed it, and modified the condition on whether or not we add
> merged bounds to the lists in partition_list_bounds_merge() and
> partition_range_bounds_merge(), instead.

Moved to the next CF.

Best regards,
Etsuro Fujita

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

amul sul

Date:

09 December 2019, 09:38:53

Attached is the rebase version atop the latest master head(2d0fdfaccec).

Regards,

Amul

Attachment

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

amul sul

Date:

10 December 2019, 06:49:08

On Mon, Dec 9, 2019 at 3:08 PM amul sul <sulamul@gmail.com> wrote:

Attached is the rebase version atop the latest master head(2d0fdfaccec).

Hi Fujita san,

I have been through your changes proposed in [1] -- the changes make sense to me &

I didn't see any unobvious behaviour in testing as well.

Regards,

Amul

1] https://postgr.es/m/CAPmGK15kXh76EnRn=B64u=+qLxZOKWROd4uMjMBnWcsZPdQ_mw@mail.gmail.com

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Etsuro Fujita

Date:

10 December 2019, 10:30:34

Hi Amul,

On Tue, Dec 10, 2019 at 3:49 PM amul sul <sulamul@gmail.com> wrote:
> On Mon, Dec 9, 2019 at 3:08 PM amul sul <sulamul@gmail.com> wrote:
>> Attached is the rebase version atop the latest master head(2d0fdfaccec).

Thanks for that!

> I have been through your changes proposed in [1] -- the changes make sense to me &
> I didn't see any unobvious behaviour in testing as well.

Thanks for reviewing!  I'll merge the changes into the main patch,
then.  I don't see any issues in the latest version, but I think we
need to polish the patch, so I'll do that.

Best regards,
Etsuro Fujita

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Etsuro Fujita

Date:

15 January 2020, 14:41:00

On Tue, Dec 10, 2019 at 7:30 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> I don't see any issues in the latest version, but I think we
> need to polish the patch, so I'll do that.

I noticed some issues.  :-(  I think we should address it before
polishing the patch.  One thing I noticed is: the patch heavily
modifies the existing test cases in partition_join.sql to test the new
partition-matching algorithm, but I think we should leave those test
cases alone because we would handle the exiting test cases (except one
negative test case) as before (see the  try_partitionwise_join()
change in the patch), so those test cases would be still needed to
test that.  Attached is a proposed patch for that
(v30-0001-Improve-partition-matching-for-partitionwise-join.patch)
that 1) avoids modifying the existing test cases and 2) adds a
slightly modified version of the test cases proposed in the previous
patch to test the new algorithm.  Though I omitted some test cases
that seem redundant to me and added a bit more test cases involving
NULL partitions and/or default partitions.  The elapsed time to run
the partition_join.sql regression test increased from 741 ms (HEAD) to
1086 ms in my environment, but I think that would be acceptable.  I
fixed one white space issue, but other than that, no code/comment
changes.

Another thing I noticed while working on the above is: the patch fails
to apply PWJ to this case:

CREATE TABLE plt1_ad (a int, b int, c text) PARTITION BY LIST (c);
CREATE TABLE plt1_ad_p1 PARTITION OF plt1_ad FOR VALUES IN ('0001', '0003');
CREATE TABLE plt1_ad_p2 PARTITION OF plt1_ad FOR VALUES IN ('0004', '0006');
CREATE TABLE plt1_ad_p3 PARTITION OF plt1_ad FOR VALUES IN ('0008', '0009');
CREATE TABLE plt1_ad_extra PARTITION OF plt1_ad FOR VALUES IN (NULL);
INSERT INTO plt1_ad SELECT i, i, to_char(i % 10, 'FM0000') FROM
generate_series(1, 299) i WHERE i % 10 NOT IN (0, 2, 5, 7);
INSERT INTO plt1_ad VALUES (-1, -1, NULL);
ANALYZE plt1_ad;
CREATE TABLE plt2_ad (a int, b int, c text) PARTITION BY LIST (c);
CREATE TABLE plt2_ad_p1 PARTITION OF plt2_ad FOR VALUES IN ('0002', '0003');
CREATE TABLE plt2_ad_p2 PARTITION OF plt2_ad FOR VALUES IN ('0004', '0006');
CREATE TABLE plt2_ad_p3 PARTITION OF plt2_ad FOR VALUES IN ('0007', '0009');
CREATE TABLE plt2_ad_extra PARTITION OF plt2_ad FOR VALUES IN (NULL);
INSERT INTO plt2_ad SELECT i, i, to_char(i % 10, 'FM0000') FROM
generate_series(1, 299) i WHERE i % 10 NOT IN (0, 1, 5, 8);
INSERT INTO plt2_ad VALUES (-1, -1, NULL);
ANALYZE plt2_ad;

EXPLAIN (COSTS OFF)
SELECT t1.a, t1.c, t2.a, t2.c FROM plt1_ad t1 LEFT JOIN plt2_ad t2 ON
(t1.a = t2.a AND t1.c = t2.c) WHERE t1.b < 10 ORDER BY t1.a;
                       QUERY PLAN
--------------------------------------------------------
 Sort
   Sort Key: t1.a
   ->  Hash Right Join
         Hash Cond: ((t2.a = t1.a) AND (t2.c = t1.c))
         ->  Append
               ->  Seq Scan on plt2_ad_p1 t2_1
               ->  Seq Scan on plt2_ad_p2 t2_2
               ->  Seq Scan on plt2_ad_p3 t2_3
               ->  Seq Scan on plt2_ad_extra t2_4
         ->  Hash
               ->  Append
                     ->  Seq Scan on plt1_ad_p1 t1_1
                           Filter: (b < 10)
                     ->  Seq Scan on plt1_ad_p2 t1_2
                           Filter: (b < 10)
                     ->  Seq Scan on plt1_ad_p3 t1_3
                           Filter: (b < 10)
                     ->  Seq Scan on plt1_ad_extra t1_4
                           Filter: (b < 10)
(19 rows)

because merge_null_partitions() does not consider matching the NULL
partitions from both sides, but matches the NULL partition on the
plt1_ad side and a dummy partition, resulting in a non-PWJ plan (see
[1]).  I overlooked this case when modifying that function.  :-(
Another patch attached to fix this issue
(v30-0002-Fix-handling-of-NULL-partitions.patch).  (We would not need
to fix this, if we could handle the case where a dummy partition is on
the nullable side of an outer join [1], but we can't, so I think it
would be a good idea at least for now to match the NULL partitions
from both sides to do PWJ.)

Best regards,
Etsuro Fujita

[1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=7ad6498fd5a654de6e743814c36cf619a3b5ddb6

Attachment

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Mark Dilger

Date:

28 January 2020, 04:39:03

> On Tue, Dec 10, 2019 at 7:30 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
>> I don't see any issues in the latest version, but I think we
>> need to polish the patch, so I'll do that.
>
> I noticed some issues.  :-(  I think we should address it before
> polishing the patch.  One thing I noticed is: the patch heavily
> modifies the existing test cases in partition_join.sql to test the new
> partition-matching algorithm, but I think we should leave those test
> cases alone because we would handle the exiting test cases (except one
> negative test case) as before (see the  try_partitionwise_join()
> change in the patch), so those test cases would be still needed to
> test that.  Attached is a proposed patch for that
> (v30-0001-Improve-partition-matching-for-partitionwise-join.patch)
> that 1) avoids modifying the existing test cases and 2) adds a
> slightly modified version of the test cases proposed in the previous
> patch to test the new algorithm.  Though I omitted some test cases
> that seem redundant to me and added a bit more test cases involving
> NULL partitions and/or default partitions.  The elapsed time to run
> the partition_join.sql regression test increased from 741 ms (HEAD) to
> 1086 ms in my environment, but I think that would be acceptable.  I
> fixed one white space issue, but other than that, no code/comment
> changes.
>
> Another thing I noticed while working on the above is: the patch fails
> to apply PWJ to this case:
>
> CREATE TABLE plt1_ad (a int, b int, c text) PARTITION BY LIST (c);
> CREATE TABLE plt1_ad_p1 PARTITION OF plt1_ad FOR VALUES IN ('0001', '0003');
> CREATE TABLE plt1_ad_p2 PARTITION OF plt1_ad FOR VALUES IN ('0004', '0006');
> CREATE TABLE plt1_ad_p3 PARTITION OF plt1_ad FOR VALUES IN ('0008', '0009');
> CREATE TABLE plt1_ad_extra PARTITION OF plt1_ad FOR VALUES IN (NULL);
> INSERT INTO plt1_ad SELECT i, i, to_char(i % 10, 'FM0000') FROM
> generate_series(1, 299) i WHERE i % 10 NOT IN (0, 2, 5, 7);
> INSERT INTO plt1_ad VALUES (-1, -1, NULL);
> ANALYZE plt1_ad;
> CREATE TABLE plt2_ad (a int, b int, c text) PARTITION BY LIST (c);
> CREATE TABLE plt2_ad_p1 PARTITION OF plt2_ad FOR VALUES IN ('0002', '0003');
> CREATE TABLE plt2_ad_p2 PARTITION OF plt2_ad FOR VALUES IN ('0004', '0006');
> CREATE TABLE plt2_ad_p3 PARTITION OF plt2_ad FOR VALUES IN ('0007', '0009');
> CREATE TABLE plt2_ad_extra PARTITION OF plt2_ad FOR VALUES IN (NULL);
> INSERT INTO plt2_ad SELECT i, i, to_char(i % 10, 'FM0000') FROM
> generate_series(1, 299) i WHERE i % 10 NOT IN (0, 1, 5, 8);
> INSERT INTO plt2_ad VALUES (-1, -1, NULL);
> ANALYZE plt2_ad;
>
> EXPLAIN (COSTS OFF)
> SELECT t1.a, t1.c, t2.a, t2.c FROM plt1_ad t1 LEFT JOIN plt2_ad t2 ON
> (t1.a = t2.a AND t1.c = t2.c) WHERE t1.b < 10 ORDER BY t1.a;
>                       QUERY PLAN
> --------------------------------------------------------
> Sort
>   Sort Key: t1.a
>   ->  Hash Right Join
>         Hash Cond: ((t2.a = t1.a) AND (t2.c = t1.c))
>         ->  Append
>               ->  Seq Scan on plt2_ad_p1 t2_1
>               ->  Seq Scan on plt2_ad_p2 t2_2
>               ->  Seq Scan on plt2_ad_p3 t2_3
>               ->  Seq Scan on plt2_ad_extra t2_4
>         ->  Hash
>               ->  Append
>                     ->  Seq Scan on plt1_ad_p1 t1_1
>                           Filter: (b < 10)
>                     ->  Seq Scan on plt1_ad_p2 t1_2
>                           Filter: (b < 10)
>                     ->  Seq Scan on plt1_ad_p3 t1_3
>                           Filter: (b < 10)
>                     ->  Seq Scan on plt1_ad_extra t1_4
>                           Filter: (b < 10)
> (19 rows)
>
> because merge_null_partitions() does not consider matching the NULL
> partitions from both sides, but matches the NULL partition on the
> plt1_ad side and a dummy partition, resulting in a non-PWJ plan (see
> [1]).  I overlooked this case when modifying that function.  :-(
> Another patch attached to fix this issue
> (v30-0002-Fix-handling-of-NULL-partitions.patch).  (We would not need
> to fix this, if we could handle the case where a dummy partition is on
> the nullable side of an outer join [1], but we can't, so I think it
> would be a good idea at least for now to match the NULL partitions
> from both sides to do PWJ.)
>
> Best regards,
> Etsuro Fujita
>
> [1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=7ad6498fd5a654de6e743814c36cf619a3b5ddb6
>
> <v30-0001-Improve-partition-matching-for-partitionwise-join.patch><v30-0002-Fix-handling-of-NULL-partitions.patch>

Fujita-san,

With respect to these two patches:  They apply, compile, and pass all the regression tests.  The code looks reasonable.

There is stray whitespace in v30-0002:

src/backend/partitioning/partbounds.c:4557: space before tab in indent.
+             outer_null_unmerged = true;

I have added tests checking correctness and showing some partition pruning limitations.  Find three patches, attached.

The v31-0001-… patch merely applies your patches as a starting point for the next two patches.  It is your work, not
mine.

The v31-0002-… patch adds more regression tests for range partitioning.  The commit message contains my comments about
that.

The v31-0003-… patch adds more regression tests for list partitioning, and again, the commit message contains my
commentsabout that. 

I hope this review is helpful.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Etsuro Fujita

Date:

29 January 2020, 12:15:17

Hi Mark,

On Tue, Jan 28, 2020 at 1:39 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:
> There is stray whitespace in v30-0002:
>
> src/backend/partitioning/partbounds.c:4557: space before tab in indent.
> +                       outer_null_unmerged = true;

Good catch!

> I have added tests checking correctness and showing some partition pruning limitations.  Find three patches,
attached.
>
> The v31-0001-… patch merely applies your patches as a starting point for the next two patches.  It is your work, not
mine.
>
> The v31-0002-… patch adds more regression tests for range partitioning.  The commit message contains my comments
aboutthat. 
>
> The v31-0003-… patch adds more regression tests for list partitioning, and again, the commit message contains my
commentsabout that. 

The PWJ behavior shown by the tests you added looks interesting!  I'll
dig into it more closely.  Thanks for the patches and review!

Best regards,
Etsuro Fujita

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Etsuro Fujita

Date:

05 February 2020, 12:51:13

On Wed, Jan 29, 2020 at 9:15 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> On Tue, Jan 28, 2020 at 1:39 PM Mark Dilger
> <mark.dilger@enterprisedb.com> wrote:
> > I have added tests checking correctness and showing some partition pruning limitations.  Find three patches,
attached.
> >
> > The v31-0001-… patch merely applies your patches as a starting point for the next two patches.  It is your work,
notmine. 
> >
> > The v31-0002-… patch adds more regression tests for range partitioning.  The commit message contains my comments
aboutthat. 
> >
> > The v31-0003-… patch adds more regression tests for list partitioning, and again, the commit message contains my
commentsabout that. 
>
> I'll dig into it more closely.

I tested the new test patches.  The patches are applied to the
partition_join.sql regression test cleanly, but the new tests failed
in my environment (even with make check LANG=C).  I think I should set
the right locale for the testing.  Which one did you use?  Maybe I'm
missing something, but let me comment on the new tests.  This is the
one proposed in the v31-0002 patch:

+EXPLAIN (COSTS OFF)
+SELECT t1.a, t2.a FROM alpha t1 INNER JOIN beta t2 ON (t1.a = t2.a)
WHERE t1.a IN ('äbç', 'ὀδυσσεύς');
+                            QUERY PLAN
+------------------------------------------------------------------
+ Hash Join
+   Hash Cond: (t2.a = t1.a)
+   ->  Append
+         ->  Seq Scan on beta_a t2_1
+         ->  Seq Scan on beta_b t2_2
+         ->  Seq Scan on beta_c t2_3
+         ->  Seq Scan on beta_d t2_4
+         ->  Seq Scan on beta_e t2_5
+         ->  Seq Scan on beta_f t2_6
+         ->  Seq Scan on beta_g t2_7
+         ->  Seq Scan on beta_h t2_8
+         ->  Seq Scan on beta_default t2_9
+   ->  Hash
+         ->  Append
+               ->  Seq Scan on alpha_e t1_1
+                     Filter: (a = ANY ('{äbç,ὀδυσσεύς}'::text[]))
+               ->  Seq Scan on alpha_default t1_2
+                     Filter: (a = ANY ('{äbç,ὀδυσσεύς}'::text[]))
+(18 rows)

The commit message says:

    When joining with

        alpha.a = beta.a and alpha.a IN ('äbç', 'ὀδυσσεύς')

    the planner does the right thing for one side of the query, but
    hits all partitions for the other side, which it doesn't need to
    do.

Yeah, I agree that the resulting plan is not efficient.  The reason
for that would be that the planner doesn't generate a qual on the
outer side matching the ScalarArrayOpExpr qual "a = ANY
('{äbç,ὀδυσσεύς}'::text[])" on the inner side, which I think would be
a restriction caused by the equivalence machinery not by the
partitionwise join logic IIUC.  I think the critique would be useful,
so I don't object to adding this test case, but the critique would be
more about query planning that is actually not related to the
partitionwise join logic, so I'm not sure that the partition_join.sql
regression test is the best place to add it.  I guess that there would
be much better places than partition_join.sql.  (This is nitpicking;
but another thing I noticed about this test case is that the join
query contains only a single join condition "t1.a = t2.a", but the
queried tables alpha and beta are range-partitioned by multiple
columns a and b, so the query should have a join condition for each of
the columns like "t1.a = t2.a AND t1.b = t2.b" if adding this as a
test case for partitionwise join.)  I feel almost the same to other
test cases in the patch (and the v31-0003 patch), except this one
proposed in the v31-0003 patch:

+CREATE TABLE alpha (a TEXT) PARTITION BY LIST(a);
+CREATE TABLE alpha_a PARTITION OF alpha FOR VALUES IN ('Turkiye', 'TURKIYE');
+CREATE TABLE alpha_b PARTITION OF alpha FOR VALUES IN ('b?t', 'BIT');
+CREATE TABLE alpha_c PARTITION OF alpha FOR VALUES IN ('abc', 'ABC');
+CREATE TABLE alpha_d PARTITION OF alpha FOR VALUES IN ('aaa', 'cote', 'Gotz');
+CREATE TABLE alpha_e PARTITION OF alpha FOR VALUES IN ('?δυσσε??', '?ΔΥΣΣΕ?Σ');
+CREATE TABLE alpha_f PARTITION OF alpha FOR VALUES IN ('を読み取り用',
'にオープンできませんでした', NULL);
+CREATE TABLE alpha_default PARTITION OF alpha DEFAULT;
+CREATE TABLE beta (a TEXT) PARTITION BY LIST(a);
+CREATE TABLE beta_a PARTITION OF beta FOR VALUES IN ('Turkiye',
'cote', '?ΔΥΣΣΕ?Σ');
+CREATE TABLE beta_b PARTITION OF beta FOR VALUES IN ('b?t', 'TURKIYE');
+CREATE TABLE beta_c PARTITION OF beta FOR VALUES IN ('abc', 'を読み取り用',
'にオープンできませんでした');
+CREATE TABLE beta_d PARTITION OF beta FOR VALUES IN ('aaa', 'Gotz',
'BIT', '?δυσσε??', 'ABC', NULL);
+CREATE TABLE beta_default PARTITION OF beta DEFAULT;

+EXPLAIN (COSTS OFF)
+SELECT t1.a, t2.a FROM alpha t1 INNER JOIN beta t2 ON (t1.a = t2.a)
WHERE t1.a IS NULL;
+                QUERY PLAN
+-------------------------------------------
+ Hash Join
+   Hash Cond: (t2.a = t1.a)
+   ->  Append
+         ->  Seq Scan on beta_d t2_1
+         ->  Seq Scan on beta_b t2_2
+         ->  Seq Scan on beta_a t2_3
+         ->  Seq Scan on beta_c t2_4
+         ->  Seq Scan on beta_default t2_5
+   ->  Hash
+         ->  Seq Scan on alpha_f t1
+               Filter: (a IS NULL)
+(11 rows)

Which made me notice an issue in the partitionwise join logic:
partition_bounds_merge() assumes that all partitions of given
relations are always present; in other words, it doesn't consider
cases where some of the partitions have been pruned entirely.  :-(  If
that function considered the cases, the above query would use
partitionwise join, because alpha_f only matches beta_c.  I don't
think this is a bug, but it causes the planner not only to fail to
chose partitionwise join but to waste CPU cycles to process pruned
partitions, so I'll propose to address it.  Attached is a patch for
that (the v31-0004 patch) created on top of the main patch (the
v31-0001 patch), which is also attached.  With the attached, the plan
for the above query would be changed to something like this:

    EXPLAIN (COSTS OFF)
    SELECT t1.a, t2.a FROM alpha t1 INNER JOIN beta t2 ON (t1.a =
t2.a) WHERE t1.a IS NULL;
              QUERY PLAN
    ------------------------------
     Nested Loop
       Join Filter: (t1.a = t2.a)
       ->  Seq Scan on alpha_f t1
             Filter: (a IS NULL)
       ->  Seq Scan on beta_c t2
    (5 rows)

Thanks again, Mark!

Best regards,
Etsuro Fujita

Attachment

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Mark Dilger

Date:

05 February 2020, 17:03:43


> On Feb 5, 2020, at 4:51 AM, Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
>
> On Wed, Jan 29, 2020 at 9:15 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
>> On Tue, Jan 28, 2020 at 1:39 PM Mark Dilger
>> <mark.dilger@enterprisedb.com> wrote:
>>> I have added tests checking correctness and showing some partition pruning limitations.  Find three patches,
attached.
>>>
>>> The v31-0001-… patch merely applies your patches as a starting point for the next two patches.  It is your work,
notmine. 
>>>
>>> The v31-0002-… patch adds more regression tests for range partitioning.  The commit message contains my comments
aboutthat. 
>>>
>>> The v31-0003-… patch adds more regression tests for list partitioning, and again, the commit message contains my
commentsabout that. 
>>
>> I'll dig into it more closely.
>
> I tested the new test patches.  The patches are applied to the
> partition_join.sql regression test cleanly, but the new tests failed
> in my environment (even with make check LANG=C).  I think I should set
> the right locale for the testing.  Which one did you use?

I did not set a locale in the tests.  My environment has:

LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL=


>  Maybe I'm
> missing something, but let me comment on the new tests.  This is the
> one proposed in the v31-0002 patch:
>
> +EXPLAIN (COSTS OFF)
> +SELECT t1.a, t2.a FROM alpha t1 INNER JOIN beta t2 ON (t1.a = t2.a)
> WHERE t1.a IN ('äbç', 'ὀδυσσεύς');
> +                            QUERY PLAN
> +------------------------------------------------------------------
> + Hash Join
> +   Hash Cond: (t2.a = t1.a)
> +   ->  Append
> +         ->  Seq Scan on beta_a t2_1
> +         ->  Seq Scan on beta_b t2_2
> +         ->  Seq Scan on beta_c t2_3
> +         ->  Seq Scan on beta_d t2_4
> +         ->  Seq Scan on beta_e t2_5
> +         ->  Seq Scan on beta_f t2_6
> +         ->  Seq Scan on beta_g t2_7
> +         ->  Seq Scan on beta_h t2_8
> +         ->  Seq Scan on beta_default t2_9
> +   ->  Hash
> +         ->  Append
> +               ->  Seq Scan on alpha_e t1_1
> +                     Filter: (a = ANY ('{äbç,ὀδυσσεύς}'::text[]))
> +               ->  Seq Scan on alpha_default t1_2
> +                     Filter: (a = ANY ('{äbç,ὀδυσσεύς}'::text[]))
> +(18 rows)
>
> The commit message says:
>
>    When joining with
>
>        alpha.a = beta.a and alpha.a IN ('äbç', 'ὀδυσσεύς')
>
>    the planner does the right thing for one side of the query, but
>    hits all partitions for the other side, which it doesn't need to
>    do.
>
> Yeah, I agree that the resulting plan is not efficient.  The reason
> for that would be that the planner doesn't generate a qual on the
> outer side matching the ScalarArrayOpExpr qual "a = ANY
> ('{äbç,ὀδυσσεύς}'::text[])" on the inner side, which I think would be
> a restriction caused by the equivalence machinery not by the
> partitionwise join logic IIUC.

It’s fine if this is beyond the scope of the patch.

>  I think the critique would be useful,
> so I don't object to adding this test case, but the critique would be
> more about query planning that is actually not related to the
> partitionwise join logic, so I'm not sure that the partition_join.sql
> regression test is the best place to add it.  I guess that there would
> be much better places than partition_join.sql.

You don’t need to add the test anywhere.  It’s enough for me that you looked at it and considered whether the
partition-wisejoin patch should do anything differently in this case.  Again, it sounds like this is beyond the scope
ofthe patch. 

> (This is nitpicking;
> but another thing I noticed about this test case is that the join
> query contains only a single join condition "t1.a = t2.a", but the
> queried tables alpha and beta are range-partitioned by multiple
> columns a and b, so the query should have a join condition for each of
> the columns like "t1.a = t2.a AND t1.b = t2.b" if adding this as a
> test case for partitionwise join.)

Well, it is important that partition-wise join does not break such queries.  I added the column ‘b’ to the partitioning
logicto verify that did not confuse the code in your patch. 

> I feel almost the same to other
> test cases in the patch (and the v31-0003 patch), except this one
> proposed in the v31-0003 patch:
>
> +CREATE TABLE alpha (a TEXT) PARTITION BY LIST(a);
> +CREATE TABLE alpha_a PARTITION OF alpha FOR VALUES IN ('Turkiye', 'TURKIYE');
> +CREATE TABLE alpha_b PARTITION OF alpha FOR VALUES IN ('b?t', 'BIT');
> +CREATE TABLE alpha_c PARTITION OF alpha FOR VALUES IN ('abc', 'ABC');
> +CREATE TABLE alpha_d PARTITION OF alpha FOR VALUES IN ('aaa', 'cote', 'Gotz');
> +CREATE TABLE alpha_e PARTITION OF alpha FOR VALUES IN ('?δυσσε??', '?ΔΥΣΣΕ?Σ');
> +CREATE TABLE alpha_f PARTITION OF alpha FOR VALUES IN ('を読み取り用',
> 'にオープンできませんでした', NULL);
> +CREATE TABLE alpha_default PARTITION OF alpha DEFAULT;
> +CREATE TABLE beta (a TEXT) PARTITION BY LIST(a);
> +CREATE TABLE beta_a PARTITION OF beta FOR VALUES IN ('Turkiye',
> 'cote', '?ΔΥΣΣΕ?Σ');
> +CREATE TABLE beta_b PARTITION OF beta FOR VALUES IN ('b?t', 'TURKIYE');
> +CREATE TABLE beta_c PARTITION OF beta FOR VALUES IN ('abc', 'を読み取り用',
> 'にオープンできませんでした');
> +CREATE TABLE beta_d PARTITION OF beta FOR VALUES IN ('aaa', 'Gotz',
> 'BIT', '?δυσσε??', 'ABC', NULL);
> +CREATE TABLE beta_default PARTITION OF beta DEFAULT;
>
> +EXPLAIN (COSTS OFF)
> +SELECT t1.a, t2.a FROM alpha t1 INNER JOIN beta t2 ON (t1.a = t2.a)
> WHERE t1.a IS NULL;
> +                QUERY PLAN
> +-------------------------------------------
> + Hash Join
> +   Hash Cond: (t2.a = t1.a)
> +   ->  Append
> +         ->  Seq Scan on beta_d t2_1
> +         ->  Seq Scan on beta_b t2_2
> +         ->  Seq Scan on beta_a t2_3
> +         ->  Seq Scan on beta_c t2_4
> +         ->  Seq Scan on beta_default t2_5
> +   ->  Hash
> +         ->  Seq Scan on alpha_f t1
> +               Filter: (a IS NULL)
> +(11 rows)
>
> Which made me notice an issue in the partitionwise join logic:
> partition_bounds_merge() assumes that all partitions of given
> relations are always present; in other words, it doesn't consider
> cases where some of the partitions have been pruned entirely.  :-(  If
> that function considered the cases, the above query would use
> partitionwise join, because alpha_f only matches beta_c.  I don't
> think this is a bug, but it causes the planner not only to fail to
> chose partitionwise join but to waste CPU cycles to process pruned
> partitions, so I'll propose to address it.  Attached is a patch for
> that (the v31-0004 patch) created on top of the main patch (the
> v31-0001 patch), which is also attached.  With the attached, the plan
> for the above query would be changed to something like this:
>
>    EXPLAIN (COSTS OFF)
>    SELECT t1.a, t2.a FROM alpha t1 INNER JOIN beta t2 ON (t1.a =
> t2.a) WHERE t1.a IS NULL;
>              QUERY PLAN
>    ------------------------------
>     Nested Loop
>       Join Filter: (t1.a = t2.a)
>       ->  Seq Scan on alpha_f t1
>             Filter: (a IS NULL)
>       ->  Seq Scan on beta_c t2
>    (5 rows)
>
> Thanks again, Mark!

Thank you for working on this important patch!

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Mark Dilger

Date:

05 February 2020, 18:55:32


> On Feb 5, 2020, at 4:51 AM, Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
>
> <v31-0001-Applying-Etsuro-Fujita-s-patches.patch><v31-0004-Consider-pruned-partitions.patch>


The patches apply and pass all tests.  A review of the patch vs. master looks reasonable.

The partition_join.sql test has multiple levels of partitioning, but when your patch extends that test with “advanced
partition-wisejoin”, none of the tables for the new section have multiple levels.  I spent a little while reviewing the
codeand inventing multiple level partitioning tests for advanced partition-wise join and did not encounter any
problems. I don’t care whether you use this particular example, but do you want to have multiple level partitioning in
thenew test section? 

CREATE TABLE alpha (a double precision, b double precision) PARTITION BY RANGE (a);
CREATE TABLE alpha_neg PARTITION OF alpha FOR VALUES FROM ('-Infinity') TO (0) PARTITION BY RANGE (b);
CREATE TABLE alpha_pos PARTITION OF alpha FOR VALUES FROM (0) TO ('Infinity') PARTITION BY RANGE (b);
CREATE TABLE alpha_nan PARTITION OF alpha FOR VALUES FROM ('Infinity') TO ('NaN');
CREATE TABLE alpha_neg_neg PARTITION OF alpha_neg FOR VALUES FROM ('-Infinity') TO (0);
CREATE TABLE alpha_neg_pos PARTITION OF alpha_neg FOR VALUES FROM (0) TO ('Infinity');
CREATE TABLE alpha_neg_nan PARTITION OF alpha_neg FOR VALUES FROM ('Infinity') TO ('NaN');
CREATE TABLE alpha_pos_neg PARTITION OF alpha_pos FOR VALUES FROM ('-Infinity') TO (0);
CREATE TABLE alpha_pos_pos PARTITION OF alpha_pos FOR VALUES FROM (0) TO ('Infinity');
CREATE TABLE alpha_pos_nan PARTITION OF alpha_pos FOR VALUES FROM ('Infinity') TO ('NaN');
INSERT INTO alpha (a, b)
    (SELECT * FROM
        (VALUES (-1.0::float8), (0.0::float8), (1.0::float8), ('Infinity'::float8)) a,
        (VALUES (-1.0::float8), (0.0::float8), (1.0::float8), ('Infinity'::float8)) b
    );
ANALYZE alpha;
ANALYZE alpha_neg;
ANALYZE alpha_pos;
ANALYZE alpha_nan;
ANALYZE alpha_neg_neg;
ANALYZE alpha_neg_pos;
ANALYZE alpha_neg_nan;
ANALYZE alpha_pos_neg;
ANALYZE alpha_pos_pos;
ANALYZE alpha_pos_nan;
CREATE TABLE beta (a double precision, b double precision) PARTITION BY RANGE (a, b);
CREATE TABLE beta_lo PARTITION OF beta FOR VALUES FROM (-5, -5) TO (0, 0);
CREATE TABLE beta_me PARTITION OF beta FOR VALUES FROM (0, 0) TO (0, 5);
CREATE TABLE beta_hi PARTITION OF beta FOR VALUES FROM (0, 5) TO (5, 5);
INSERT INTO beta (a, b)
    (SELECT * FROM
        (VALUES (-1.0::float8), (0.0::float8), (1.0::float8)) a,
        (VALUES (-1.0::float8), (0.0::float8), (1.0::float8)) b
    );
ANALYZE beta;
ANALYZE beta_lo;
ANALYZE beta_me;
ANALYZE beta_hi;
EXPLAIN SELECT * FROM alpha INNER JOIN beta ON (alpha.a = beta.a AND alpha.b = beta.b) WHERE alpha.a = 1 AND beta.b =
1;
                                  QUERY PLAN
-------------------------------------------------------------------------------
 Nested Loop  (cost=0.00..2.11 rows=1 width=32)
   ->  Seq Scan on alpha_pos_pos alpha  (cost=0.00..1.06 rows=1 width=16)
         Filter: ((b = '1'::double precision) AND (a = '1'::double precision))
   ->  Seq Scan on beta_hi beta  (cost=0.00..1.04 rows=1 width=16)
         Filter: ((b = '1'::double precision) AND (a = '1'::double precision))
(5 rows)




—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Etsuro Fujita

Date:

07 February 2020, 12:53:24

On Thu, Feb 6, 2020 at 2:03 AM Mark Dilger <mark.dilger@enterprisedb.com> wrote:
> > On Feb 5, 2020, at 4:51 AM, Etsuro Fujita <etsuro.fujita@gmail.com> wrote:

> >> On Tue, Jan 28, 2020 at 1:39 PM Mark Dilger
> >> <mark.dilger@enterprisedb.com> wrote:
> >>> I have added tests checking correctness and showing some partition pruning limitations.  Find three patches,
attached.
> >>>
> >>> The v31-0001-… patch merely applies your patches as a starting point for the next two patches.  It is your work,
notmine. 
> >>>
> >>> The v31-0002-… patch adds more regression tests for range partitioning.  The commit message contains my comments
aboutthat. 
> >>>
> >>> The v31-0003-… patch adds more regression tests for list partitioning, and again, the commit message contains my
commentsabout that. 

> > I tested the new test patches.  The patches are applied to the
> > partition_join.sql regression test cleanly, but the new tests failed
> > in my environment (even with make check LANG=C).  I think I should set
> > the right locale for the testing.  Which one did you use?
>
> I did not set a locale in the tests.  My environment has:
>
> LANG="en_US.UTF-8"
> LC_COLLATE="en_US.UTF-8"
> LC_CTYPE="en_US.UTF-8"
> LC_MESSAGES="en_US.UTF-8"
> LC_MONETARY="en_US.UTF-8"
> LC_NUMERIC="en_US.UTF-8"
> LC_TIME="en_US.UTF-8"
> LC_ALL=

Thanks for the information!

> >  Maybe I'm
> > missing something, but let me comment on the new tests.  This is the
> > one proposed in the v31-0002 patch:
> >
> > +EXPLAIN (COSTS OFF)
> > +SELECT t1.a, t2.a FROM alpha t1 INNER JOIN beta t2 ON (t1.a = t2.a)
> > WHERE t1.a IN ('äbç', 'ὀδυσσεύς');
> > +                            QUERY PLAN
> > +------------------------------------------------------------------
> > + Hash Join
> > +   Hash Cond: (t2.a = t1.a)
> > +   ->  Append
> > +         ->  Seq Scan on beta_a t2_1
> > +         ->  Seq Scan on beta_b t2_2
> > +         ->  Seq Scan on beta_c t2_3
> > +         ->  Seq Scan on beta_d t2_4
> > +         ->  Seq Scan on beta_e t2_5
> > +         ->  Seq Scan on beta_f t2_6
> > +         ->  Seq Scan on beta_g t2_7
> > +         ->  Seq Scan on beta_h t2_8
> > +         ->  Seq Scan on beta_default t2_9
> > +   ->  Hash
> > +         ->  Append
> > +               ->  Seq Scan on alpha_e t1_1
> > +                     Filter: (a = ANY ('{äbç,ὀδυσσεύς}'::text[]))
> > +               ->  Seq Scan on alpha_default t1_2
> > +                     Filter: (a = ANY ('{äbç,ὀδυσσεύς}'::text[]))
> > +(18 rows)
> >
> > The commit message says:
> >
> >    When joining with
> >
> >        alpha.a = beta.a and alpha.a IN ('äbç', 'ὀδυσσεύς')
> >
> >    the planner does the right thing for one side of the query, but
> >    hits all partitions for the other side, which it doesn't need to
> >    do.
> >
> > Yeah, I agree that the resulting plan is not efficient.  The reason
> > for that would be that the planner doesn't generate a qual on the
> > outer side matching the ScalarArrayOpExpr qual "a = ANY
> > ('{äbç,ὀδυσσεύς}'::text[])" on the inner side, which I think would be
> > a restriction caused by the equivalence machinery not by the
> > partitionwise join logic IIUC.
>
> It’s fine if this is beyond the scope of the patch.
>
> >  I think the critique would be useful,
> > so I don't object to adding this test case, but the critique would be
> > more about query planning that is actually not related to the
> > partitionwise join logic, so I'm not sure that the partition_join.sql
> > regression test is the best place to add it.  I guess that there would
> > be much better places than partition_join.sql.
>
> You don’t need to add the test anywhere.  It’s enough for me that you looked at it and considered whether the
partition-wisejoin patch should do anything differently in this case.  Again, it sounds like this is beyond the scope
ofthe patch. 

OK

> > (This is nitpicking;
> > but another thing I noticed about this test case is that the join
> > query contains only a single join condition "t1.a = t2.a", but the
> > queried tables alpha and beta are range-partitioned by multiple
> > columns a and b, so the query should have a join condition for each of
> > the columns like "t1.a = t2.a AND t1.b = t2.b" if adding this as a
> > test case for partitionwise join.)
>
> Well, it is important that partition-wise join does not break such queries.  I added the column ‘b’ to the
partitioninglogic to verify that did not confuse the code in your patch. 

OK, thanks for the testing!

Best regards,
Etsuro Fujita

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Etsuro Fujita

Date:

07 February 2020, 12:57:21

On Thu, Feb 6, 2020 at 3:55 AM Mark Dilger <mark.dilger@enterprisedb.com> wrote:
> The patches apply and pass all tests.  A review of the patch vs. master looks reasonable.

Thanks for the review!

> The partition_join.sql test has multiple levels of partitioning, but when your patch extends that test with “advanced
partition-wisejoin”, none of the tables for the new section have multiple levels.  I spent a little while reviewing the
codeand inventing multiple level partitioning tests for advanced partition-wise join and did not encounter any
problems. I don’t care whether you use this particular example, but do you want to have multiple level partitioning in
thenew test section? 

Yes, I do.

> CREATE TABLE alpha (a double precision, b double precision) PARTITION BY RANGE (a);
> CREATE TABLE alpha_neg PARTITION OF alpha FOR VALUES FROM ('-Infinity') TO (0) PARTITION BY RANGE (b);
> CREATE TABLE alpha_pos PARTITION OF alpha FOR VALUES FROM (0) TO ('Infinity') PARTITION BY RANGE (b);
> CREATE TABLE alpha_nan PARTITION OF alpha FOR VALUES FROM ('Infinity') TO ('NaN');
> CREATE TABLE alpha_neg_neg PARTITION OF alpha_neg FOR VALUES FROM ('-Infinity') TO (0);
> CREATE TABLE alpha_neg_pos PARTITION OF alpha_neg FOR VALUES FROM (0) TO ('Infinity');
> CREATE TABLE alpha_neg_nan PARTITION OF alpha_neg FOR VALUES FROM ('Infinity') TO ('NaN');
> CREATE TABLE alpha_pos_neg PARTITION OF alpha_pos FOR VALUES FROM ('-Infinity') TO (0);
> CREATE TABLE alpha_pos_pos PARTITION OF alpha_pos FOR VALUES FROM (0) TO ('Infinity');
> CREATE TABLE alpha_pos_nan PARTITION OF alpha_pos FOR VALUES FROM ('Infinity') TO ('NaN');
> INSERT INTO alpha (a, b)
>     (SELECT * FROM
>         (VALUES (-1.0::float8), (0.0::float8), (1.0::float8), ('Infinity'::float8)) a,
>         (VALUES (-1.0::float8), (0.0::float8), (1.0::float8), ('Infinity'::float8)) b
>     );
> ANALYZE alpha;
> ANALYZE alpha_neg;
> ANALYZE alpha_pos;
> ANALYZE alpha_nan;
> ANALYZE alpha_neg_neg;
> ANALYZE alpha_neg_pos;
> ANALYZE alpha_neg_nan;
> ANALYZE alpha_pos_neg;
> ANALYZE alpha_pos_pos;
> ANALYZE alpha_pos_nan;
> CREATE TABLE beta (a double precision, b double precision) PARTITION BY RANGE (a, b);
> CREATE TABLE beta_lo PARTITION OF beta FOR VALUES FROM (-5, -5) TO (0, 0);
> CREATE TABLE beta_me PARTITION OF beta FOR VALUES FROM (0, 0) TO (0, 5);
> CREATE TABLE beta_hi PARTITION OF beta FOR VALUES FROM (0, 5) TO (5, 5);
> INSERT INTO beta (a, b)
>     (SELECT * FROM
>         (VALUES (-1.0::float8), (0.0::float8), (1.0::float8)) a,
>         (VALUES (-1.0::float8), (0.0::float8), (1.0::float8)) b
>     );
> ANALYZE beta;
> ANALYZE beta_lo;
> ANALYZE beta_me;
> ANALYZE beta_hi;
> EXPLAIN SELECT * FROM alpha INNER JOIN beta ON (alpha.a = beta.a AND alpha.b = beta.b) WHERE alpha.a = 1 AND beta.b =
1;
>                                   QUERY PLAN
> -------------------------------------------------------------------------------
>  Nested Loop  (cost=0.00..2.11 rows=1 width=32)
>    ->  Seq Scan on alpha_pos_pos alpha  (cost=0.00..1.06 rows=1 width=16)
>          Filter: ((b = '1'::double precision) AND (a = '1'::double precision))
>    ->  Seq Scan on beta_hi beta  (cost=0.00..1.04 rows=1 width=16)
>          Filter: ((b = '1'::double precision) AND (a = '1'::double precision))
> (5 rows)

Hmm, I'm not sure this is a good test case for that, because this
result would be due to partition pruning applied to each side of the
join before considering partition-wise join; you could get the same
result even with enable_partitionwise_join=off.  I think it's
important that the partition-wise join logic doesn't break this query,
though.

Best regards,
Etsuro Fujita

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Etsuro Fujita

Date:

10 February 2020, 11:46:01

On Fri, Feb 7, 2020 at 9:57 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> On Thu, Feb 6, 2020 at 3:55 AM Mark Dilger <mark.dilger@enterprisedb.com> wrote:
> > The patches apply and pass all tests.  A review of the patch vs. master looks reasonable.

I've merged the patches.  Attached is a new version of the patch.

> > The partition_join.sql test has multiple levels of partitioning, but when your patch extends that test with
“advancedpartition-wise join”, none of the tables for the new section have multiple levels.  I spent a little while
reviewingthe code and inventing multiple level partitioning tests for advanced partition-wise join and did not
encounterany problems.  I don’t care whether you use this particular example, but do you want to have multiple level
partitioningin the new test section? 
>
> Yes, I do.
>
> > CREATE TABLE alpha (a double precision, b double precision) PARTITION BY RANGE (a);
> > CREATE TABLE alpha_neg PARTITION OF alpha FOR VALUES FROM ('-Infinity') TO (0) PARTITION BY RANGE (b);
> > CREATE TABLE alpha_pos PARTITION OF alpha FOR VALUES FROM (0) TO ('Infinity') PARTITION BY RANGE (b);
> > CREATE TABLE alpha_nan PARTITION OF alpha FOR VALUES FROM ('Infinity') TO ('NaN');
> > CREATE TABLE alpha_neg_neg PARTITION OF alpha_neg FOR VALUES FROM ('-Infinity') TO (0);
> > CREATE TABLE alpha_neg_pos PARTITION OF alpha_neg FOR VALUES FROM (0) TO ('Infinity');
> > CREATE TABLE alpha_neg_nan PARTITION OF alpha_neg FOR VALUES FROM ('Infinity') TO ('NaN');
> > CREATE TABLE alpha_pos_neg PARTITION OF alpha_pos FOR VALUES FROM ('-Infinity') TO (0);
> > CREATE TABLE alpha_pos_pos PARTITION OF alpha_pos FOR VALUES FROM (0) TO ('Infinity');
> > CREATE TABLE alpha_pos_nan PARTITION OF alpha_pos FOR VALUES FROM ('Infinity') TO ('NaN');
> > INSERT INTO alpha (a, b)
> >     (SELECT * FROM
> >         (VALUES (-1.0::float8), (0.0::float8), (1.0::float8), ('Infinity'::float8)) a,
> >         (VALUES (-1.0::float8), (0.0::float8), (1.0::float8), ('Infinity'::float8)) b
> >     );
> > ANALYZE alpha;
> > ANALYZE alpha_neg;
> > ANALYZE alpha_pos;
> > ANALYZE alpha_nan;
> > ANALYZE alpha_neg_neg;
> > ANALYZE alpha_neg_pos;
> > ANALYZE alpha_neg_nan;
> > ANALYZE alpha_pos_neg;
> > ANALYZE alpha_pos_pos;
> > ANALYZE alpha_pos_nan;
> > CREATE TABLE beta (a double precision, b double precision) PARTITION BY RANGE (a, b);
> > CREATE TABLE beta_lo PARTITION OF beta FOR VALUES FROM (-5, -5) TO (0, 0);
> > CREATE TABLE beta_me PARTITION OF beta FOR VALUES FROM (0, 0) TO (0, 5);
> > CREATE TABLE beta_hi PARTITION OF beta FOR VALUES FROM (0, 5) TO (5, 5);
> > INSERT INTO beta (a, b)
> >     (SELECT * FROM
> >         (VALUES (-1.0::float8), (0.0::float8), (1.0::float8)) a,
> >         (VALUES (-1.0::float8), (0.0::float8), (1.0::float8)) b
> >     );
> > ANALYZE beta;
> > ANALYZE beta_lo;
> > ANALYZE beta_me;
> > ANALYZE beta_hi;
> > EXPLAIN SELECT * FROM alpha INNER JOIN beta ON (alpha.a = beta.a AND alpha.b = beta.b) WHERE alpha.a = 1 AND beta.b
=1; 
> >                                   QUERY PLAN
> > -------------------------------------------------------------------------------
> >  Nested Loop  (cost=0.00..2.11 rows=1 width=32)
> >    ->  Seq Scan on alpha_pos_pos alpha  (cost=0.00..1.06 rows=1 width=16)
> >          Filter: ((b = '1'::double precision) AND (a = '1'::double precision))
> >    ->  Seq Scan on beta_hi beta  (cost=0.00..1.04 rows=1 width=16)
> >          Filter: ((b = '1'::double precision) AND (a = '1'::double precision))
> > (5 rows)
>
> Hmm, I'm not sure this is a good test case for that, because this
> result would be due to partition pruning applied to each side of the
> join before considering partition-wise join; you could get the same
> result even with enable_partitionwise_join=off.  I think it's
> important that the partition-wise join logic doesn't break this query,
> though.

I think this would be beyond the scope of the patch, so I added
different test cases that I think would be better as ones for
multi-level partitioning.

Thanks!

Best regards,
Etsuro Fujita

Attachment

v32-0001-Improve-partition-matching-for-partitionwise-join.patch

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Ashutosh Bapat

Date:

03 March 2020, 16:48:40

Hi Fujita-san,

I reviewed the patch. Except for the logic of matching the pairs of partitions from already merged partitions, I think the code changes are good. But there are several places where it needs further changes to comments. The attached patch has those. I have described some of them below.

+ * We can not perform partition-wise join if either of the joining
+ * relations is not partitioned.

We are consistently using partitionwise instead of partition-wise.

+ /*
+ * See if the partition bounds for inputs are exactly the same, in
+ * which case we don't need to work hard: partitions with the same
+ * partition indexes will form join pairs, and the join rel will have
+ * the same partition bounds as inputs; otherwise try to merge the
+ * partition bounds along with generating join pairs.

Phrase "joining relations" is better than "inputs", IMO. Updated in the
attached patch.

+ /*
+ * If the partition bounds for the join rel are not merged ones,
+ * inputs are guaranteed to have the same partition bounds, so
+ * partitions with the same partition indexes will form join pairs;
+ * else let get_matching_part_pairs() do the work.
+ */
+ if (joinrel->merged)
+ {

This condition in the comment is opposite to the condition being checked in
code, so looks confusing. BTW this comment is also repeated around line 1405.
See attached patch for correction.

+ /*
+ * If this segment of the join is empty, it means that this segment

"partition of the join" looks consistent with other usages than "segment of the
join". Modified in the attached patch.

+ /*
+ * Get a relids set of partition(s) involved in this join segment that
+ * are from the rel1 side.
+ */
+ child_relids1 = bms_intersect(child_joinrel->relids,
+ rel1->all_partrels);

The partition bounds are sorted by their values. Even for a list partitioned
table, the bounds are sorted by the least partition value. We do not allow
multiple paritions from one side to be joined with one partition on the other
and vice versa. All this together means that the partitions of the join
relation are formed by joining partitions from either side in the order of
their bounds. This means that the matching pairs of partitions can be found by
matching relids of partitions of join with those of the joining relation by
traversing partitions from all the three relations only once in the order they
appears in partition bounds of corresponding relations. If we use this
algorithm, we don't need all_partrels to be collected and also don't need to
search base or join relation. That, I think, will reduce the time and space
complexity of this logic. Am I missing something? As a side effect it won't
require any special handling for base and join relation.

+ /*
+ * Get a child rel for rel1 with the relids. Note that we should have
+ * the child rel even if rel1 is a join rel, because in that case the
+ * partitions specified in the relids would have matching/overlapping
+ * boundaries, so those partitions should be considered as ones to be
+ * joined even when planning partitionwise joins of rel1, meaning that
+ * the child rel would have been built by the time we get here.
+ */
+ if (rel1_is_simple)

This variable is used only in one place. So instead we should the expression
assigning the value to it. Changed in the attached patch.

- rel->nparts = 0;
+ rel->nparts = -1;

I think we need to add comments around various values that nparts can take. How
about like something attached.

+ case PARTITION_STRATEGY_HASH:
+ merged_bounds = NULL;

I think, we need to explain why we aren't merging hash partition bounds. AFAIU,
the reason is thus: When the modulo of used for partition mapping (i.e. maximum
number of has partitions) is same, their partition bounds are same and do not
need merging. If the maximum number of partitions is different for both the
joining relations, there's high probability that one partition on one side will
join with multiple partitions on the other side. So exact partition bounds
match will work in most of the cases. The cases otherwise are not so common to
spend the effort in coding and planning.

I have added this explanation in the patch. Don't know if it's there written
somewhere already.

+ if (part_index == -1)
+ return -1;
+ } while (is_dummy_partition(rel, part_index));

I understand why we are skipping NULL positions. I am not sure why are we are
skipping over RelOptInfos which exist but are marked as dummy; we can still
create a join pair with those partitions.

+/*
+ * get_merged_range_bounds
+ * Given the bounds of range partitions to be join, determine the range

s/join/joined/

There are more changes to comments, where I thought that the comments are
required or existing comments need more clarification. Please review the
attached patch. This patch is created on top of
v32-0001-Improve-partition-matching-for-partitionwise-join.

On Mon, Feb 10, 2020 at 5:14 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:

On Fri, Feb 7, 2020 at 9:57 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> On Thu, Feb 6, 2020 at 3:55 AM Mark Dilger <mark.dilger@enterprisedb.com> wrote:
> > The patches apply and pass all tests. A review of the patch vs. master looks reasonable.

I've merged the patches. Attached is a new version of the patch.

> > The partition_join.sql test has multiple levels of partitioning, but when your patch extends that test with “advanced partition-wise join”, none of the tables for the new section have multiple levels. I spent a little while reviewing the code and inventing multiple level partitioning tests for advanced partition-wise join and did not encounter any problems. I don’t care whether you use this particular example, but do you want to have multiple level partitioning in the new test section?
>
> Yes, I do.
>
> > CREATE TABLE alpha (a double precision, b double precision) PARTITION BY RANGE (a);
> > CREATE TABLE alpha_neg PARTITION OF alpha FOR VALUES FROM ('-Infinity') TO (0) PARTITION BY RANGE (b);
> > CREATE TABLE alpha_pos PARTITION OF alpha FOR VALUES FROM (0) TO ('Infinity') PARTITION BY RANGE (b);
> > CREATE TABLE alpha_nan PARTITION OF alpha FOR VALUES FROM ('Infinity') TO ('NaN');
> > CREATE TABLE alpha_neg_neg PARTITION OF alpha_neg FOR VALUES FROM ('-Infinity') TO (0);
> > CREATE TABLE alpha_neg_pos PARTITION OF alpha_neg FOR VALUES FROM (0) TO ('Infinity');
> > CREATE TABLE alpha_neg_nan PARTITION OF alpha_neg FOR VALUES FROM ('Infinity') TO ('NaN');
> > CREATE TABLE alpha_pos_neg PARTITION OF alpha_pos FOR VALUES FROM ('-Infinity') TO (0);
> > CREATE TABLE alpha_pos_pos PARTITION OF alpha_pos FOR VALUES FROM (0) TO ('Infinity');
> > CREATE TABLE alpha_pos_nan PARTITION OF alpha_pos FOR VALUES FROM ('Infinity') TO ('NaN');
> > INSERT INTO alpha (a, b)
> > (SELECT * FROM
> > (VALUES (-1.0::float8), (0.0::float8), (1.0::float8), ('Infinity'::float8)) a,
> > (VALUES (-1.0::float8), (0.0::float8), (1.0::float8), ('Infinity'::float8)) b
> > );
> > ANALYZE alpha;
> > ANALYZE alpha_neg;
> > ANALYZE alpha_pos;
> > ANALYZE alpha_nan;
> > ANALYZE alpha_neg_neg;
> > ANALYZE alpha_neg_pos;
> > ANALYZE alpha_neg_nan;
> > ANALYZE alpha_pos_neg;
> > ANALYZE alpha_pos_pos;
> > ANALYZE alpha_pos_nan;
> > CREATE TABLE beta (a double precision, b double precision) PARTITION BY RANGE (a, b);
> > CREATE TABLE beta_lo PARTITION OF beta FOR VALUES FROM (-5, -5) TO (0, 0);
> > CREATE TABLE beta_me PARTITION OF beta FOR VALUES FROM (0, 0) TO (0, 5);
> > CREATE TABLE beta_hi PARTITION OF beta FOR VALUES FROM (0, 5) TO (5, 5);
> > INSERT INTO beta (a, b)
> > (SELECT * FROM
> > (VALUES (-1.0::float8), (0.0::float8), (1.0::float8)) a,
> > (VALUES (-1.0::float8), (0.0::float8), (1.0::float8)) b
> > );
> > ANALYZE beta;
> > ANALYZE beta_lo;
> > ANALYZE beta_me;
> > ANALYZE beta_hi;
> > EXPLAIN SELECT * FROM alpha INNER JOIN beta ON (alpha.a = beta.a AND alpha.b = beta.b) WHERE alpha.a = 1 AND beta.b = 1;
> > QUERY PLAN
> > -------------------------------------------------------------------------------
> > Nested Loop (cost=0.00..2.11 rows=1 width=32)
> > -> Seq Scan on alpha_pos_pos alpha (cost=0.00..1.06 rows=1 width=16)
> > Filter: ((b = '1'::double precision) AND (a = '1'::double precision))
> > -> Seq Scan on beta_hi beta (cost=0.00..1.04 rows=1 width=16)
> > Filter: ((b = '1'::double precision) AND (a = '1'::double precision))
> > (5 rows)
>
> Hmm, I'm not sure this is a good test case for that, because this
> result would be due to partition pruning applied to each side of the
> join before considering partition-wise join; you could get the same
> result even with enable_partitionwise_join=off. I think it's
> important that the partition-wise join logic doesn't break this query,
> though.

I think this would be beyond the scope of the patch, so I added
different test cases that I think would be better as ones for
multi-level partitioning.

Thanks!

Best regards,
Etsuro Fujita

Best Wishes,

Ashutosh Bapat

Attachment

changes_on_top_of_v32_0001.patch

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Etsuro Fujita

Date:

17 March 2020, 08:14:39

Hi Ashutosh,

On Wed, Mar 4, 2020 at 1:48 AM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
> I reviewed the patch. Except for the logic of matching the pairs of partitions from already merged partitions, I
thinkthe code changes are good. But there are several places where it needs further changes to comments. The attached
patchhas those. I have described some of them below. 

Thanks for reviewing!

> + * We can not perform partition-wise join if either of the joining
> + * relations is not partitioned.
>
> We are consistently using partitionwise instead of partition-wise.

Will fix.

> + /*
> + * See if the partition bounds for inputs are exactly the same, in
> + * which case we don't need to work hard: partitions with the same
> + * partition indexes will form join pairs, and the join rel will have
> + * the same partition bounds as inputs; otherwise try to merge the
> + * partition bounds along with generating join pairs.
>
> Phrase "joining relations" is better than "inputs", IMO. Updated in the
> attached patch.

"inputs" is used in many places in the planner performing join
planning, so I'm not sure "joining relations" is better than "inputs".

> + /*
> + * If the partition bounds for the join rel are not merged ones,
> + * inputs are guaranteed to have the same partition bounds, so
> + * partitions with the same partition indexes will form join pairs;
> + * else let get_matching_part_pairs() do the work.
> + */
> + if (joinrel->merged)
> + {
>
> This condition in the comment is opposite to the condition being checked in
> code, so looks confusing. BTW this comment is also repeated around line 1405.
> See attached patch for correction.

OK, I'll revise the comments as proposed.

> + /*
> + * If this segment of the join is empty, it means that this segment
>
> "partition of the join" looks consistent with other usages than "segment of the
> join".

Actually, "segment" is used in the existing comments in the caller
function try_partitionwise_join(), so I think "segment" is better here
for consistency.

> + /*
> + * Get a relids set of partition(s) involved in this join segment that
> + * are from the rel1 side.
> + */
> + child_relids1 = bms_intersect(child_joinrel->relids,
> +  rel1->all_partrels);
>
> The partition bounds are sorted by their values. Even for a list partitioned
> table, the bounds are sorted by the least partition value. We do not allow
> multiple paritions from one side to be joined with one partition on the other
> and vice versa. All this together means that the partitions of the join
> relation are formed by joining partitions from either side in the order of
> their bounds. This means that the matching pairs of partitions can be found by
> matching relids of partitions of join with those of the joining relation by
> traversing partitions from all the three relations only once in the order they
> appears in partition bounds of corresponding relations.

Consider this 2-way join for list partitioned tables:

CREATE TABLE plt1_ad (a int, b int, c text) PARTITION BY LIST (c);
CREATE TABLE plt1_ad_p1 PARTITION OF plt1_ad FOR VALUES IN ('0001', '0003');
CREATE TABLE plt1_ad_p2 PARTITION OF plt1_ad FOR VALUES IN ('0002');
INSERT INTO plt1_ad SELECT i, i, to_char(i % 10, 'FM0000') FROM
generate_series(1, 100) i WHERE i % 10 in (1, 2, 3);
ANALYZE plt1_ad;
CREATE TABLE plt2_ad (a int, b int, c text) PARTITION BY LIST (c);
CREATE TABLE plt2_ad_p1 PARTITION OF plt2_ad FOR VALUES IN ('0002', '0004');
CREATE TABLE plt2_ad_p2 PARTITION OF plt2_ad FOR VALUES IN ('0003');
INSERT INTO plt2_ad SELECT i, i, to_char(i % 10, 'FM0000') FROM
generate_series(1, 100) i WHERE i % 10 IN (2, 3, 4);
ANALYZE plt2_ad;

EXPLAIN (COSTS OFF)
SELECT t1.c, t1.a, t2.a FROM plt1_ad t1 INNER JOIN plt2_ad t2 ON (t1.c
= t2.c) WHERE t1.a < 10  ORDER BY t1.c, t1.a, t2.a;
                     QUERY PLAN
-----------------------------------------------------
 Sort
   Sort Key: t1.c, t1.a, t2.a
   ->  Append
         ->  Hash Join
               Hash Cond: (t2_1.c = t1_2.c)
               ->  Seq Scan on plt2_ad_p1 t2_1
               ->  Hash
                     ->  Seq Scan on plt1_ad_p2 t1_2
                           Filter: (a < 10)
         ->  Hash Join
               Hash Cond: (t2_2.c = t1_1.c)
               ->  Seq Scan on plt2_ad_p2 t2_2
               ->  Hash
                     ->  Seq Scan on plt1_ad_p1 t1_1
                           Filter: (a < 10)
(15 rows)

As you can see from the EXPLAIN result, the first partition on the
outer side matches the second partition on the inner side, and the
second partition on the outer side matches the first partition on the
inner side.  How does the algorithm you proposed work e.g., when an
N-way join for list partitioned tables contains this join as its lower
join?

> If we use this
> algorithm, we don't need all_partrels to be collected and also don't need to
> search base or join relation. That, I think, will reduce the time and space
> complexity of this logic. Am I missing something?

Relids is used for storing all_partrels, so the time/space cost of
handling it would be small.  Also, the cost of searing base relations
would be small.  The cost of searching join relations would be a bit
large in some cases, but I thought that would be acceptable, compared
with large overhead of performing other part of partitionwise join
planning.

> + if (rel1_is_simple)
>
> This variable is used only in one place. So instead we should the expression
> assigning the value to it. Changed in the attached patch.

I don't think that's a good idea, because this check is done
repeatedly in a for loop.

> - rel->nparts = 0;
> + rel->nparts = -1;
>
> I think we need to add comments around various values that nparts can take. How
> about like something attached.

+1

> + case PARTITION_STRATEGY_HASH:
> + merged_bounds = NULL;
>
> I think, we need to explain why we aren't merging hash partition bounds. AFAIU,
> the reason is thus: When the modulo of used for partition mapping (i.e. maximum
> number of has partitions) is same, their partition bounds are same and do not
> need merging.

I don't think that's always true; there are cases where the moduli are
the same, but the partition bounds are not, because it's possible to
only define partitions for some of the remainders.  See the discussion
in [1].

> If the maximum number of partitions is different for both the
> joining relations, there's high probability that one partition on one side will
> join with multiple partitions on the other side. So exact partition bounds
> match will work in most of the cases. The cases otherwise are not so common to
> spend the effort in coding and planning.
>
> I have added this explanation in the patch.

I also think it would be great if we can perform generic partitionwise
join for hash partitioned tables, so I'd like to propose to add
something like this, instead: "Currently we support partitionwise join
for hash partitioned tables only when the partition bounds for them
exactly match, but later it might be worth the effort to relax the
restriction."

> + if (part_index == -1)
> + return -1;
> + } while (is_dummy_partition(rel, part_index));
>
> I understand why we are skipping NULL positions. I am not sure why are we are
> skipping over RelOptInfos which exist but are marked as dummy; we can still
> create a join pair with those partitions.

Yeah, but I think it's safe to skip over those partitions as well,
because such a join pair can be created using
merge_partition_with_dummy().

> +/*
> + * get_merged_range_bounds
> + * Given the bounds of range partitions to be join, determine the range
>
> s/join/joined/

Good catch!  Will fix.

> There are more changes to comments, where I thought that the comments are
> required or existing comments need more clarification. Please review the
> attached patch. This patch is created on top of
> v32-0001-Improve-partition-matching-for-partitionwise-join.

Thanks for the patch!  I will review the patch ASAP.

Sorry for the delay.

Best regards,
Etsuro Fujita

[1] https://www.postgresql.org/message-id/CAAJ_b94tJTix3kR8uBjin-ruJ-7ojn-gAWJQRicbLqAttQTe1g%40mail.gmail.com

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Ashutosh Bapat

Date:

23 March 2020, 13:41:54

On Tue, Mar 17, 2020 at 1:44 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
>
> > + /*
> > + * If this segment of the join is empty, it means that this segment
> >
> > "partition of the join" looks consistent with other usages than "segment of the
> > join".
>
> Actually, "segment" is used in the existing comments in the caller
> function try_partitionwise_join(), so I think "segment" is better here
> for consistency.

A segment can be any part of the join relation, not necessarily a
partition. May be we should change the caller.

>
> > + /*
> > + * Get a relids set of partition(s) involved in this join segment that
> > + * are from the rel1 side.
> > + */
> > + child_relids1 = bms_intersect(child_joinrel->relids,
> > +  rel1->all_partrels);
> >
> > The partition bounds are sorted by their values. Even for a list partitioned
> > table, the bounds are sorted by the least partition value. We do not allow
> > multiple paritions from one side to be joined with one partition on the other
> > and vice versa. All this together means that the partitions of the join
> > relation are formed by joining partitions from either side in the order of
> > their bounds. This means that the matching pairs of partitions can be found by
> > matching relids of partitions of join with those of the joining relation by
> > traversing partitions from all the three relations only once in the order they
> > appears in partition bounds of corresponding relations.
>
> Consider this 2-way join for list partitioned tables:
>
> CREATE TABLE plt1_ad (a int, b int, c text) PARTITION BY LIST (c);
> CREATE TABLE plt1_ad_p1 PARTITION OF plt1_ad FOR VALUES IN ('0001', '0003');
> CREATE TABLE plt1_ad_p2 PARTITION OF plt1_ad FOR VALUES IN ('0002');
> INSERT INTO plt1_ad SELECT i, i, to_char(i % 10, 'FM0000') FROM
> generate_series(1, 100) i WHERE i % 10 in (1, 2, 3);
> ANALYZE plt1_ad;
> CREATE TABLE plt2_ad (a int, b int, c text) PARTITION BY LIST (c);
> CREATE TABLE plt2_ad_p1 PARTITION OF plt2_ad FOR VALUES IN ('0002', '0004');
> CREATE TABLE plt2_ad_p2 PARTITION OF plt2_ad FOR VALUES IN ('0003');
> INSERT INTO plt2_ad SELECT i, i, to_char(i % 10, 'FM0000') FROM
> generate_series(1, 100) i WHERE i % 10 IN (2, 3, 4);
> ANALYZE plt2_ad;
>
> EXPLAIN (COSTS OFF)
> SELECT t1.c, t1.a, t2.a FROM plt1_ad t1 INNER JOIN plt2_ad t2 ON (t1.c
> = t2.c) WHERE t1.a < 10  ORDER BY t1.c, t1.a, t2.a;
>                      QUERY PLAN
> -----------------------------------------------------
>  Sort
>    Sort Key: t1.c, t1.a, t2.a
>    ->  Append
>          ->  Hash Join
>                Hash Cond: (t2_1.c = t1_2.c)
>                ->  Seq Scan on plt2_ad_p1 t2_1
>                ->  Hash
>                      ->  Seq Scan on plt1_ad_p2 t1_2
>                            Filter: (a < 10)
>          ->  Hash Join
>                Hash Cond: (t2_2.c = t1_1.c)
>                ->  Seq Scan on plt2_ad_p2 t2_2
>                ->  Hash
>                      ->  Seq Scan on plt1_ad_p1 t1_1
>                            Filter: (a < 10)
> (15 rows)
>
> As you can see from the EXPLAIN result, the first partition on the
> outer side matches the second partition on the inner side, and the
> second partition on the outer side matches the first partition on the
> inner side.  How does the algorithm you proposed work e.g., when an
> N-way join for list partitioned tables contains this join as its lower
> join?

Hmm, this is a good example. I tried to work out the algorithm based
on the bound ordering. The algorithm worked well when all the bounds
on both the sides were included in the join. But it didn't work well,
when some bounds vanished. In order to detect whether a bound has
vanished, we need to either compare that bound with the bounds of join
(an operation costlier than comparing bitmapset) or compare relids of
all the partitions of the join. Either way it looks costlier than what
you have right now. May be we could improve by keeping track of such
lost bounds and corresponding partitions. But I didn't get time to
work on that part. Anyway, even if such an algorithm exists, we will
have to change just a single function. That could be done later I
think. So we are good here right now. Thanks.


> > + if (rel1_is_simple)
> >
> > This variable is used only in one place. So instead we should the expression
> > assigning the value to it. Changed in the attached patch.
>
> I don't think that's a good idea, because this check is done
> repeatedly in a for loop.

Compiler's optimizer would anyway optimize it away. But anyway, I
won't insist on this.

>
> > - rel->nparts = 0;
> > + rel->nparts = -1;
> >
> > I think we need to add comments around various values that nparts can take. How
> > about like something attached.
>
> +1
>
> > + case PARTITION_STRATEGY_HASH:
> > + merged_bounds = NULL;
> >
> > I think, we need to explain why we aren't merging hash partition bounds. AFAIU,
> > the reason is thus: When the modulo of used for partition mapping (i.e. maximum
> > number of has partitions) is same, their partition bounds are same and do not
> > need merging.
>
> I don't think that's always true; there are cases where the moduli are
> the same, but the partition bounds are not, because it's possible to
> only define partitions for some of the remainders.  See the discussion
> in [1].

Hmm, but that case would be rare, IMO. It's an artifact of the way our
hash partitioning syntax is, and does not reflect a real world
scenario.

>
> > If the maximum number of partitions is different for both the
> > joining relations, there's high probability that one partition on one side will
> > join with multiple partitions on the other side. So exact partition bounds
> > match will work in most of the cases. The cases otherwise are not so common to
> > spend the effort in coding and planning.
> >
> > I have added this explanation in the patch.
>
> I also think it would be great if we can perform generic partitionwise
> join for hash partitioned tables, so I'd like to propose to add
> something like this, instead: "Currently we support partitionwise join
> for hash partitioned tables only when the partition bounds for them
> exactly match, but later it might be worth the effort to relax the
> restriction."

That's good too. But please include an explanation about the case when
modulo/max no. of partitions itself differs. That case is not likely
to get addressed in nearer future.

-- 
Best Wishes,
Ashutosh Bapat

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Etsuro Fujita

Date:

24 March 2020, 18:03:11

Hi,

On Mon, Mar 23, 2020 at 10:42 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
> On Tue, Mar 17, 2020 at 1:44 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:

> > > "partition of the join" looks consistent with other usages than "segment of the
> > > join".
> >
> > Actually, "segment" is used in the existing comments in the caller
> > function try_partitionwise_join(), so I think "segment" is better here
> > for consistency.
>
> A segment can be any part of the join relation, not necessarily a
> partition. May be we should change the caller.

I don't think so.  My first language is not English, though.  I don't
think this should be a blocker, so how about leaving this for another
patch?

> > > + /*
> > > + * Get a relids set of partition(s) involved in this join segment that
> > > + * are from the rel1 side.
> > > + */
> > > + child_relids1 = bms_intersect(child_joinrel->relids,
> > > +  rel1->all_partrels);
> > >
> > > The partition bounds are sorted by their values. Even for a list partitioned
> > > table, the bounds are sorted by the least partition value. We do not allow
> > > multiple paritions from one side to be joined with one partition on the other
> > > and vice versa. All this together means that the partitions of the join
> > > relation are formed by joining partitions from either side in the order of
> > > their bounds. This means that the matching pairs of partitions can be found by
> > > matching relids of partitions of join with those of the joining relation by
> > > traversing partitions from all the three relations only once in the order they
> > > appears in partition bounds of corresponding relations.
> >
> > Consider this 2-way join for list partitioned tables:

> Hmm, this is a good example. I tried to work out the algorithm based
> on the bound ordering. The algorithm worked well when all the bounds
> on both the sides were included in the join. But it didn't work well,
> when some bounds vanished. In order to detect whether a bound has
> vanished, we need to either compare that bound with the bounds of join
> (an operation costlier than comparing bitmapset) or compare relids of
> all the partitions of the join. Either way it looks costlier than what
> you have right now. May be we could improve by keeping track of such
> lost bounds and corresponding partitions. But I didn't get time to
> work on that part. Anyway, even if such an algorithm exists, we will
> have to change just a single function. That could be done later I
> think. So we are good here right now. Thanks.

OK, there is always room for improvement.

> > > + if (rel1_is_simple)
> > >
> > > This variable is used only in one place. So instead we should the expression
> > > assigning the value to it. Changed in the attached patch.
> >
> > I don't think that's a good idea, because this check is done
> > repeatedly in a for loop.
>
> Compiler's optimizer would anyway optimize it away. But anyway, I
> won't insist on this.

OK

> > > + case PARTITION_STRATEGY_HASH:
> > > + merged_bounds = NULL;
> > >
> > > I think, we need to explain why we aren't merging hash partition bounds. AFAIU,
> > > the reason is thus: When the modulo of used for partition mapping (i.e. maximum
> > > number of has partitions) is same, their partition bounds are same and do not
> > > need merging.
> >
> > I don't think that's always true; there are cases where the moduli are
> > the same, but the partition bounds are not, because it's possible to
> > only define partitions for some of the remainders.  See the discussion
> > in [1].
>
> Hmm, but that case would be rare, IMO. It's an artifact of the way our
> hash partitioning syntax is, and does not reflect a real world
> scenario.

I agree on that point.

> > > If the maximum number of partitions is different for both the
> > > joining relations, there's high probability that one partition on one side will
> > > join with multiple partitions on the other side. So exact partition bounds
> > > match will work in most of the cases. The cases otherwise are not so common to
> > > spend the effort in coding and planning.
> > >
> > > I have added this explanation in the patch.

> > I'd like to propose to add
> > something like this, instead: "Currently we support partitionwise join
> > for hash partitioned tables only when the partition bounds for them
> > exactly match, but later it might be worth the effort to relax the
> > restriction."
>
> That's good too. But please include an explanation about the case when
> modulo/max no. of partitions itself differs. That case is not likely
> to get addressed in nearer future.

OK, I added comments, including that explanation.  Please find
attached a new version of the patch.

On Tue, Mar 17, 2020 at 5:14 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> On Wed, Mar 4, 2020 at 1:48 AM Ashutosh Bapat
> <ashutosh.bapat.oss@gmail.com> wrote:
> > + * We can not perform partition-wise join if either of the joining
> > + * relations is not partitioned.
> >
> > We are consistently using partitionwise instead of partition-wise.
>
> Will fix.

Done.

> > + /*
> > + * If the partition bounds for the join rel are not merged ones,
> > + * inputs are guaranteed to have the same partition bounds, so
> > + * partitions with the same partition indexes will form join pairs;
> > + * else let get_matching_part_pairs() do the work.
> > + */
> > + if (joinrel->merged)
> > + {
> >
> > This condition in the comment is opposite to the condition being checked in
> > code, so looks confusing. BTW this comment is also repeated around line 1405.
> > See attached patch for correction.
>
> OK, I'll revise the comments as proposed.

I updated this based on what you suggested.

> > - rel->nparts = 0;
> > + rel->nparts = -1;
> >
> > I think we need to add comments around various values that nparts can take. How
> > about like something attached.
>
> +1

-   int         nparts;         /* number of partitions */
+   int         nparts;         /* number of partitions.
+                                * 0 for a relation with no partitions,
+                                * > 0 indicates actual number of partitions
+                                * -1 for a relation whose number of partitions
+                                *    is not yet known.
+                                */

I don't think the comment "0 for a relation with no partitions" is
correct.  I think 0 means the relation is considered unpartitioned, so
I modified the comments as such.  I also changed the format to match
other places in pathnodes.h.

> > +/*
> > + * get_merged_range_bounds
> > + * Given the bounds of range partitions to be join, determine the range
> >
> > s/join/joined/
>
> Good catch!  Will fix.

Done.

> > There are more changes to comments, where I thought that the comments are
> > required or existing comments need more clarification. Please review the
> > attached patch.
>
> I will review the patch ASAP.

I looked into the patch.

@@ -4093,8 +4103,13 @@ map_and_merge_partitions(PartitionMap
*outer_map, PartitionMap *inner_map,

-            * Note that we will fix the larger index that have been added to
-            * the merged_indexes list so far in fix_merged_indexes().
+            * Both the inner and outer partitions have an empty partition on
+            * the other side as their joining partner. But now that each of
+            * them has found a non-empty joining partner we should re-map
+            * those to a single partition in the join. We use lower of the
+            * two indexes to avoid any holes being created by re-mapping.
+            * Also, it keeps the partition index array in partition bounds
+            * roughly sorted.

OK, but I modified this further.

+ * The caller thinks that the partition at the given index does not have a
+ * partition in the other relation or the joining partition is empty. In such a
+ * case we assign a temporary index (indicated by merged flag in the map) for
+ * the resulting partition in the join. In case the given partition finds a
+ * non-empty partner latter we will adjust the mapping again.

Done.  I modified this a bit, though.

@@ -4590,6 +4620,13 @@ merge_default_partitions(PartitionMap *outer_map,

+        * We should have already given up if we found that both the inner and
+        * outer relations have default partitions and either of them had a
+        * partition without a matching non-default partition on the other
+        * side. See process_outer_partition() and process_inner_partition()
+        * for details.

This seems to me a bit difficult to read.  How about something simpler
in the attached?

I also updated comments in process_outer_partition() and
process_inner_partition(), almost as proposed in your patch.

I polished the patch further.  Changes are:

* I removed the arguments parttyplen and parttypbyval for
partition_bounds_merge(), because we don't need them anymore.  Also, I
removed these assertions from that function:

+   Assert(merged_bounds || (*outer_parts == NIL && *inner_parts == NIL));
+
+   Assert(list_length(*outer_parts) == list_length(*inner_parts));

because the first assertion isn't correct as we don't allow non-NULL
merged_bounds to have outer_parts=NIL and inner_parts=NIL, and because
the second assertion is redundant as we have the same in
partition_list_bounds_merge() and partition_range_bounds_merge().  I
modified assertions in these functions a bit, though.

* I renamed partition_list_bounds_merge() and
partition_range_bounds_merge() to merge_list_bounds() and
merge_range_bounds(), analogously to create_list_bounds() and
create_range_bounds() in partition_bounds_create().  Does that make
sense?

* In the old versions of the patch, map_and_merge_partitions() checked
whether it was possible to "map" one of given partitions to the other,
but that function doesn't do that anymore.  That function assumes that
given partitions match each other, so I don't think the name is good
anymore.  How about renaming that function to
merge_matching_partitions() or something like that?

* I think the names of variables used in functions added to
partbounds.c are inconsistent.  For example, the names of the outer
and inner partition indexes in partition_list_bounds_merge() are
"o_index" and "i_index", while those in partition_range_bounds_merge()
are "outer_part" and "inner_part".  IMO such inconsistencies would
make the code a bit difficult to read, so I modified the names of
those variables to be more consistent between those functions.

* I modified the code in merge_null_partitions() to match other
functions added to partbounds.c, but no functional changes.  Also, I
think the names "outer_null_unmatched" and "inner_null_unmatched" are
a bit misleading, because I think the reader might incorrectly think
both the NULL partitions exist, but that isn't always true, so I
renamed them to what I think would be better.

* I added/tweaked comments a lot, fixing (probably my) typos and
grammatical errors.

* I did some cleanup, including fixing some format issues.

* I sorted functions added to partbounds.c in a more logical order,
and moved them to a more appropriate place in that source file.

* I added/modified some regression tests.  I added/modified comments a
bit as well.  One thing I should mention is a change to this:

+-- 3-way join where not every pair of relations can do partitioned
join
+EXPLAIN (COSTS OFF)
+SELECT t1.a, t2.a, t3.c FROM prt1_ad t1 RIGHT JOIN prt2_ad t2 ON
(t1.a = t2.b) INNER JOIN prt1_ad t3 ON (t2.b = t3.a) WHERE t1.b = 0
ORDER BY t1.a, t2.a, t3.c;

The RIGHT join will be transformed to an INNER join by outer join
reduction IIUC, so this wouldn't be what was originally intended.  I
think this is my fault; I modified the original test case incorrectly
when I updated all the regression tests for this feature [1].  Sorry
for that.  I fixed this issue.

Best regards,
Etsuro Fujita

[1] https://www.postgresql.org/message-id/CAPmGK16LsKXX%3DYYzc-PNiY6aaYApg1Gmkc6A14dnJsrBBmgd0g%40mail.gmail.com

Attachment

v33-0001-Improve-partition-matching-for-partitionwise-join.patch

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Tomas Vondra

Date:

25 March 2020, 19:05:13

Hi,

I've started reviewing the patch a couple days ago. I haven't done any
extensive testing, but I do have a bunch of initial comments that I can
share now.

1) I wonder if this needs to update src/backend/optimizer/README, which
does have a section about partitionwise joins. It seems formulated in a
way that that probably covers even this more advanced algorithm, but
maybe it should mention handling of default partitions etc.?

There certainly needs to be some description of the algorithm somewhere,
either in a README or before a suitable function. It doesn't have to be
particularly detailed, a rough outline of the matching would be enough,
so that readers don't have to rebuild the knowledge from pieces
scattered around various comments.

2) Do we need another GUC enabling this more complex algorithm? In PG11
the partitionwise join is disabled by default, on the grounds that it's
expensive and not worth it by default. How much more expensive is this?
Maybe it makes sense to allow enabling only the "simple" approach?

3) This comment in try_partitionwise_join is now incorrect, because the
condition may be true even for partitioned tables with (nparts == 0).

     /* Nothing to do, if the join relation is not partitioned. */
     if (joinrel->part_scheme == NULL || joinrel->nparts == 0)
         return;

Moreover, the condition used to be 

     if (!IS_PARTITIONED_REL(joinrel))
         return;

which is way more readable. I think it's net negative to replace these
"nice" macros with clear meaning with complex conditions. If needed, we
can invent new macros. There are many other places where the patch
replaces macros with less readable conditions.


4) I'm a bit puzzled how we could get here with non-partitioned rels?

     /*
      * We can not perform partitionwise join if either of the joining relations
      * is not partitioned.
      */
     if (!IS_PARTITIONED_REL(rel1) || !IS_PARTITIONED_REL(rel2))
         return;

5) I find the "merged" flag in RelOptInfo rather unclear, because it
does not clearly indicate what was merged. Maybe something like
partbounds_merged would be better?

6) The try_partitionwise_join function is getting a bit too long and
harder to understand. The whole block in

     if (joinrel->nparts == -1)
     {
         ...
     }

seems rather well isolated, so I propose to move it to a separate
function responsible only for the merging. We can simply call it on the
joinrel, and make it return right away if (joinrel->nparts == -1).

7) I'd suggest not to reference exact functions in comments unless
abolutely necessary, because it's harder to maintain and it does not
really explain purpose of the struct/code. E.g. consider this:

     /* Per-partitioned-relation data for merge_list_bounds()/merge_range_bounds() */
     typedef struct PartitionMap
     { ... }

Why does it matter where is the struct used? That's pretty trivial to
find using 'git grep' or something. Instead the comment should explain
the purpose of the struct.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Tomas Vondra

Date:

26 March 2020, 00:17:17

Hi,

three more comments after eye-balling the code for a bit longer.

1) The patch probably needs to tweak config.sgml which says this about
the enable_partitionwise_join GUC:

   .. Partitionwise join currently applies only when the join conditions
   include all the partition keys, which must be of the same data type
   and have exactly matching sets of child partitions. ..

Which is probably incorrect, because the point of this patch is not to
require exact match of the partitions, right?

2) Do we really need the 'merged' flag in try_partitionwise_join? Can't
we simply use the joinrel->merged flag directly? ISTM the we always
start with joinrel->merged=false, and then only ever set it to true in
some cases. I've tried doing that, per the attached 0002 patch. The
regression tests seem to work fine.

I noticed this because I've tried moving part of the function into a
separate function, and removing the variable makes that simpler.

The patch also does a couple additional minor tweaks.

3) I think the for nparts comment is somewhat misleading:

   int nparts;  /* number of partitions; 0 = not partitioned;
                 * -1 = not yet set */

which says that nparts=0 means not partitioned. But then there are
conditions like this:

     /* Nothing to do, if the join relation is not partitioned. */
     if (joinrel->part_scheme == NULL || joinrel->nparts == 0)
         return;

which (ignoring the obsolete comment) seems to say we can have nparts==0
even for partitioned tables, no?

Anyway, attached is the original patch (0001) and two patches with
proposed changes. 0002 removes the "merged" flag as explained in (2),
0003 splits the try_partitionwise_join() function into two parts.

I'm saying these changes have to happen and it's a bit crude (and it
might be a bit of a bikeshedding).


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Ashutosh Bapat

Date:

01 April 2020, 16:51:17

On Thu, 26 Mar 2020 at 05:47, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

Hi,

three more comments after eye-balling the code for a bit longer.

1) The patch probably needs to tweak config.sgml which says this about
the enable_partitionwise_join GUC:

.. Partitionwise join currently applies only when the join conditions
include all the partition keys, which must be of the same data type
and have exactly matching sets of child partitions. ..

Done. Actually this wasn't updated when partition pruning was introduced, which could cause a partitionwise join to be not used even when those conditions were met. Similarly when a query involved whole row reference. It's hard to explain all the conditions under which partitionwise join technique will be used. But I have tried to keep it easy to understand.

Which is probably incorrect, because the point of this patch is not to
require exact match of the partitions, right?

2) Do we really need the 'merged' flag in try_partitionwise_join? Can't
we simply use the joinrel->merged flag directly? ISTM the we always
start with joinrel->merged=false, and then only ever set it to true in
some cases. I've tried doing that, per the attached 0002 patch. The
regression tests seem to work fine.

Thanks. I just added a small prologue to compute_partition_bounds().

I noticed this because I've tried moving part of the function into a
separate function, and removing the variable makes that simpler.

The patch also does a couple additional minor tweaks.

3) I think the for nparts comment is somewhat misleading:

int nparts; /* number of partitions; 0 = not partitioned;
* -1 = not yet set */

which says that nparts=0 means not partitioned. But then there are
conditions like this:

/* Nothing to do, if the join relation is not partitioned. */
if (joinrel->part_scheme == NULL || joinrel->nparts == 0)
return;

which (ignoring the obsolete comment) seems to say we can have nparts==0
even for partitioned tables, no?

See my previous mail.

Anyway, attached is the original patch (0001) and two patches with
proposed changes. 0002 removes the "merged" flag as explained in (2),
0003 splits the try_partitionwise_join() function into two parts.

I'm saying these changes have to happen and it's a bit crude (and it
might be a bit of a bikeshedding).

I have added 0005 with the changes I described in this as well as the previous mail. 0004 is just some white space fixes.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Best Wishes,

Ashutosh

Attachment

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Ashutosh Bapat

Date:

01 April 2020, 17:12:01

On Thu, 26 Mar 2020 at 00:35, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

Hi,

I've started reviewing the patch a couple days ago. I haven't done any
extensive testing, but I do have a bunch of initial comments that I can
share now.

1) I wonder if this needs to update src/backend/optimizer/README, which
does have a section about partitionwise joins. It seems formulated in a
way that that probably covers even this more advanced algorithm, but
maybe it should mention handling of default partitions etc.?

Done. Please check the wording. It might need some word smithy.

There certainly needs to be some description of the algorithm somewhere,
either in a README or before a suitable function. It doesn't have to be
particularly detailed, a rough outline of the matching would be enough,
so that readers don't have to rebuild the knowledge from pieces
scattered around various comments.

The algorithm for list and range partitioned tables is slightly different. So, I have added separate prologue to each list_merge_bounds() and range_merge_bounds(). Please check if that serves the purpose.

2) Do we need another GUC enabling this more complex algorithm? In PG11
the partitionwise join is disabled by default, on the grounds that it's
expensive and not worth it by default. How much more expensive is this?
Maybe it makes sense to allow enabling only the "simple" approach?

We have reduced the complexity of merging bounds quite a bit so this shouldn't be costly. Further more we handle the usual case of equal bounds quickly using the merge flag so most of the cases should be fine. It's only when two partitioned tables with same partition scheme are joined but do not have merge-able bounds that this algorithm would not yield useful result - but that would be rare in the field. enable_partitionwise_join = false should take care of such scenarios easily. I am not in favour of adding another GUC which we set to false by default and then take another few releases to make it true by default.

3) This comment in try_partitionwise_join is now incorrect, because the
condition may be true even for partitioned tables with (nparts == 0).

/* Nothing to do, if the join relation is not partitioned. */
if (joinrel->part_scheme == NULL || joinrel->nparts == 0)
return;

If part_scheme = NULL, npart should be 0, fixed that in build_joinrel_partition_info(). If partscheme != NULL but bounds can not be merged, nparts = 0. So this condition is correct. Encapsulated in a macro IS_JOINREL_NOT_PARTITITIONED(). and added comments for the macro. Given that the macro is used exactly at one place, it may not be necessary to define it but it looks *nice*.

Moreover, the condition used to be

if (!IS_PARTITIONED_REL(joinrel))
return;

which is way more readable. I think it's net negative to replace these
"nice" macros with clear meaning with complex conditions. If needed, we
can invent new macros. There are many other places where the patch
replaces macros with less readable conditions.

The only other place where we have replaced a *nice* macro is in build_joinrel_partition_info(). But I think it's a legit replacement. I have added a comment there.

4) I'm a bit puzzled how we could get here with non-partitioned rels?

/*
* We can not perform partitionwise join if either of the joining relations
* is not partitioned.
*/
if (!IS_PARTITIONED_REL(rel1) || !IS_PARTITIONED_REL(rel2))
return;

See the comment I have added in build_joinrel_partition_info(). Not all joining pairs for a given relation are partitioned.

5) I find the "merged" flag in RelOptInfo rather unclear, because it
does not clearly indicate what was merged. Maybe something like
partbounds_merged would be better?

Done.

6) The try_partitionwise_join function is getting a bit too long and
harder to understand. The whole block in

if (joinrel->nparts == -1)
{
...
}

seems rather well isolated, so I propose to move it to a separate
function responsible only for the merging. We can simply call it on the
joinrel, and make it return right away if (joinrel->nparts == -1).

Looks like you have already taken care of this one in one of your patches.

7) I'd suggest not to reference exact functions in comments unless
abolutely necessary, because it's harder to maintain and it does not
really explain purpose of the struct/code. E.g. consider this:

/* Per-partitioned-relation data for merge_list_bounds()/merge_range_bounds() */
typedef struct PartitionMap
{ ... }

Why does it matter where is the struct used? That's pretty trivial to
find using 'git grep' or something. Instead the comment should explain
the purpose of the struct.

Adjusted names and comments a bit.

Best Wishes,

Ashutosh

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Etsuro Fujita

Date:

02 April 2020, 03:08:57

Hi Tomas and Ashutosh,

On Thu, Apr 2, 2020 at 1:51 AM Ashutosh Bapat
<ashutosh.bapat@2ndquadrant.com> wrote:
> On Thu, 26 Mar 2020 at 05:47, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

>> three more comments after eye-balling the code for a bit longer.
>>
>> 1) The patch probably needs to tweak config.sgml which says this about
>> the enable_partitionwise_join GUC:
>>
>>    .. Partitionwise join currently applies only when the join conditions
>>    include all the partition keys, which must be of the same data type
>>    and have exactly matching sets of child partitions. ..
>
>
> Done. Actually this wasn't updated when partition pruning was introduced, which could cause a partitionwise join to
benot used even when those conditions were met. Similarly when a query involved whole row reference. It's hard to
explainall the conditions under which partitionwise join technique will be used. But I have tried to keep it easy to
understand.
>
>>
>>
>> Which is probably incorrect, because the point of this patch is not to
>> require exact match of the partitions, right?
>>
>> 2) Do we really need the 'merged' flag in try_partitionwise_join? Can't
>> we simply use the joinrel->merged flag directly? ISTM the we always
>> start with joinrel->merged=false, and then only ever set it to true in
>> some cases. I've tried doing that, per the attached 0002 patch. The
>> regression tests seem to work fine.
>
>
> Thanks. I just added a small prologue to compute_partition_bounds().
>
>>
>>
>> I noticed this because I've tried moving part of the function into a
>> separate function, and removing the variable makes that simpler.
>>
>> The patch also does a couple additional minor tweaks.
>>
>> 3) I think the for nparts comment is somewhat misleading:
>>
>>    int nparts;  /* number of partitions; 0 = not partitioned;
>>                  * -1 = not yet set */
>>
>> which says that nparts=0 means not partitioned. But then there are
>> conditions like this:
>>
>>      /* Nothing to do, if the join relation is not partitioned. */
>>      if (joinrel->part_scheme == NULL || joinrel->nparts == 0)
>>          return;
>>
>> which (ignoring the obsolete comment) seems to say we can have nparts==0
>> even for partitioned tables, no?
>
>
> See my previous mail.
>
>>
>>
>> Anyway, attached is the original patch (0001) and two patches with
>> proposed changes. 0002 removes the "merged" flag as explained in (2),
>> 0003 splits the try_partitionwise_join() function into two parts.
>>
>> I'm saying these changes have to happen and it's a bit crude (and it
>> might be a bit of a bikeshedding).
>
>
> I have added 0005 with the changes I described in this as well as the previous mail. 0004 is just some white space
fixes.

Thanks for the comments, Tomas!  Thanks for the patch, Ashutosh!  I
will look at the patch.

Best regards,
Etsuro Fujita

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Etsuro Fujita

Date:

03 April 2020, 15:15:36

Hi,

On Thu, Apr 2, 2020 at 2:12 AM Ashutosh Bapat
<ashutosh.bapat@2ndquadrant.com> wrote:
> On Thu, 26 Mar 2020 at 00:35, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>> I've started reviewing the patch a couple days ago. I haven't done any
>> extensive testing, but I do have a bunch of initial comments that I can
>> share now.
>>
>> 1) I wonder if this needs to update src/backend/optimizer/README, which
>> does have a section about partitionwise joins. It seems formulated in a
>> way that that probably covers even this more advanced algorithm, but
>> maybe it should mention handling of default partitions etc.?

> Done. Please check the wording. It might need some word smithy.

You heavily changed the existing documentation about PWJ, but I don't
think we really need to do so.  Also, IMO I think the description
about default partitions you added is needed in README.  I think it
would be better to put such a description in source files.  How about
something like the attached, instead?  I wrote part of this based on
the commit message in the original versions of the patch you posted.

>> There certainly needs to be some description of the algorithm somewhere,
>> either in a README or before a suitable function. It doesn't have to be
>> particularly detailed, a rough outline of the matching would be enough,
>> so that readers don't have to rebuild the knowledge from pieces
>> scattered around various comments.

> The algorithm for list and range partitioned tables is slightly different. So, I have added separate prologue to each
list_merge_bounds()and range_merge_bounds(). Please check if that serves the purpose. 

Too detailed to me.  In this:

+ * If there are multiple partitions from one side matching a given partition on
+ * the other side, the algorithm bails out since we do not have machinary for
+ * joining one partition with mulitple partitions. It might happen that any of
+ * the list items of a partition from the outer relation do not appear in the
+ * inner relation and there is no default partition in the inner relation. Such
+ * a partition from the outer side will have no matching partition on the inner
+ * side. The algorithm will bail out in such a case since we do not have a
+ * mechanism to perform a join with a non-existing relation.

I don't think the last comment is correct; that would apply to the old
versions of this function IIRC, but not to the latest version.  How
about something much simpler like the attached, instead?

>> 2) Do we need another GUC enabling this more complex algorithm? In PG11
>> the partitionwise join is disabled by default, on the grounds that it's
>> expensive and not worth it by default. How much more expensive is this?
>> Maybe it makes sense to allow enabling only the "simple" approach?

> We have reduced the complexity of merging bounds quite a bit so this shouldn't be costly. Further more we handle the
usualcase of equal bounds quickly using the merge flag so most of the cases should be fine. It's only when two
partitionedtables with same partition scheme are joined but do not have merge-able bounds that this algorithm would not
yielduseful result - but that would be rare in the field. enable_partitionwise_join = false should take care of such
scenarioseasily. I am not in favour of adding another GUC which we set to false by default and then take another few
releasesto make it true by default. 

I agree with Ashutosh.

>> 3) This comment in try_partitionwise_join is now incorrect, because the
>> condition may be true even for partitioned tables with (nparts == 0).
>>
>>      /* Nothing to do, if the join relation is not partitioned. */
>>      if (joinrel->part_scheme == NULL || joinrel->nparts == 0)
>>          return;

> If part_scheme = NULL, npart should be 0, fixed that in build_joinrel_partition_info(). If partscheme != NULL but
boundscan not be merged, nparts = 0. So this condition is correct. Encapsulated in a macro
IS_JOINREL_NOT_PARTITITIONED().and added comments for the macro. Given that the macro is used exactly at one place, it
maynot be necessary to define it but it looks *nice*. 

I don't think it would be a good idea to add such a macro for only one place.

>> Moreover, the condition used to be
>>
>>      if (!IS_PARTITIONED_REL(joinrel))
>>          return;
>>
>> which is way more readable. I think it's net negative to replace these
>> "nice" macros with clear meaning with complex conditions. If needed, we
>> can invent new macros. There are many other places where the patch
>> replaces macros with less readable conditions.

> The only other place where we have replaced a *nice* macro is in build_joinrel_partition_info(). But I think it's a
legitreplacement. I have added a comment there. 

IIUC, this is the existing issue, so I think it would be better to
leave this for another improvement patch.

>> 6) The try_partitionwise_join function is getting a bit too long and
>> harder to understand. The whole block in
>>
>>      if (joinrel->nparts == -1)
>>      {
>>          ...
>>      }
>>
>> seems rather well isolated, so I propose to move it to a separate
>> function responsible only for the merging. We can simply call it on the
>> joinrel, and make it return right away if (joinrel->nparts == -1).

That's a good idea, so +1.

>> 7) I'd suggest not to reference exact functions in comments unless
>> abolutely necessary, because it's harder to maintain and it does not
>> really explain purpose of the struct/code. E.g. consider this:
>>
>>      /* Per-partitioned-relation data for merge_list_bounds()/merge_range_bounds() */
>>      typedef struct PartitionMap
>>      { ... }
>>
>> Why does it matter where is the struct used? That's pretty trivial to
>> find using 'git grep' or something. Instead the comment should explain
>> the purpose of the struct.

> Adjusted names and comments a bit.

I modified the comments a bit further.  I don't think we need to
change the name of a variable, so I kept it as is.

On Thu, Apr 2, 2020 at 1:51 AM Ashutosh Bapat
<ashutosh.bapat@2ndquadrant.com> wrote:
> On Thu, 26 Mar 2020 at 05:47, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>> three more comments after eye-balling the code for a bit longer.
>>
>> 1) The patch probably needs to tweak config.sgml which says this about
>> the enable_partitionwise_join GUC:
>>
>>    .. Partitionwise join currently applies only when the join conditions
>>    include all the partition keys, which must be of the same data type
>>    and have exactly matching sets of child partitions. ..

> Done. Actually this wasn't updated when partition pruning was introduced, which could cause a partitionwise join to
benot used even when those conditions were met. Similarly when a query involved whole row reference. It's hard to
explainall the conditions under which partitionwise join technique will be used. But I have tried to keep it easy to
understand.

IMO I think your words "there is exactly one pair of matching
partitions." is a bit misleading, because that sounds like that PWJ
doesn't allow multiply-segmented join.  How about s/exact
matching/one-to-one matching/ in the existing documentation, instead?

>> 2) Do we really need the 'merged' flag in try_partitionwise_join? Can't
>> we simply use the joinrel->merged flag directly? ISTM the we always
>> start with joinrel->merged=false, and then only ever set it to true in
>> some cases. I've tried doing that, per the attached 0002 patch. The
>> regression tests seem to work fine.

I introduced the flag "merged", because I thought it would be verbose
to write "joiners->merged" in many places, but I think removing that
would make the code simpler, so +1 for that change.

> Thanks. I just added a small prologue to compute_partition_bounds().

I tweaked that a bit.

>> I noticed this because I've tried moving part of the function into a
>> separate function, and removing the variable makes that simpler.
>>
>> The patch also does a couple additional minor tweaks.

    /*
     * Currently, this function is called only from
try_partitionwise_join(),
     * so the join type should be INNER, LEFT, FULL, SEMI, or ANTI.
+    *
+    * XXX Maybe an assert would be more appropriate? Or maybe just
+    * bail out by returning NULL? Not sure.
     */
    if (jointype != JOIN_INNER && jointype != JOIN_LEFT &&
        jointype != JOIN_FULL && jointype != JOIN_SEMI &&
        jointype != JOIN_ANTI)

I kept this as originally proposed, but I agree that an assertion
would be more appropriate, because it would save cycles a bit in a
non-assertion-enabled build.  I modified this as such.  I modified the
test below this as well.

>> 3) I think the for nparts comment is somewhat misleading:
>>
>>    int nparts;  /* number of partitions; 0 = not partitioned;
>>                  * -1 = not yet set */
>>
>> which says that nparts=0 means not partitioned. But then there are
>> conditions like this:
>>
>>      /* Nothing to do, if the join relation is not partitioned. */
>>      if (joinrel->part_scheme == NULL || joinrel->nparts == 0)
>>          return;
>>
>> which (ignoring the obsolete comment) seems to say we can have nparts==0
>> even for partitioned tables, no?

Yeah, I think I was a bit hasty.  I fixed this.

Attached is the original patch (0001) and one patch (0002) with
changes including those by Tomas and Ashutosh.  In the 0002 patch I
fixed my many typos as well.  :-(  Many of them were found by Justin
Pryzby.  One of them was found by Ashutosh.  Thank you!

>> Anyway, attached is the original patch (0001) and two patches with
>> proposed changes. 0002 removes the "merged" flag as explained in (2),
>> 0003 splits the try_partitionwise_join() function into two parts.

Thanks for the patches, Tomas!

> I have added 0005 with the changes I described in this as well as the previous mail. 0004 is just some white space
fixes.

Thanks for the patches, Ashutosh!  I fixed the white space issues.

Best regards,
Etsuro Fujita

Attachment

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Etsuro Fujita

Date:

06 April 2020, 08:28:52

On Sat, Apr 4, 2020 at 12:15 AM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> Attached is the original patch (0001) and one patch (0002) with
> changes including those by Tomas and Ashutosh.

I merged the patches into one and rebased it against HEAD.  Attached
is a new version, in which I added the commit message as well.  Does
that make sense?  If there are no objections, I’ll commit the patch.

Best regards,
Etsuro Fujita

Attachment

v34-0001-Allow-partitionwise-joins-in-more-cases.patch

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Ashutosh Bapat

Date:

06 April 2020, 11:43:15

On Fri, 3 Apr 2020 at 20:45, Etsuro Fujita <etsuro.fujita@gmail.com> wrote:

Hi,

On Thu, Apr 2, 2020 at 2:12 AM Ashutosh Bapat
<ashutosh.bapat@2ndquadrant.com> wrote:
> On Thu, 26 Mar 2020 at 00:35, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>> I've started reviewing the patch a couple days ago. I haven't done any
>> extensive testing, but I do have a bunch of initial comments that I can
>> share now.
>>
>> 1) I wonder if this needs to update src/backend/optimizer/README, which
>> does have a section about partitionwise joins. It seems formulated in a
>> way that that probably covers even this more advanced algorithm, but
>> maybe it should mention handling of default partitions etc.?

> Done. Please check the wording. It might need some word smithy.

You heavily changed the existing documentation about PWJ, but I don't
think we really need to do so. Also, IMO I think the description
about default partitions you added is needed in README. I think it
would be better to put such a description in source files. How about
something like the attached, instead? I wrote part of this based on
the commit message in the original versions of the patch you posted.

I corrected some grammar, typos. Broke longer sentences into smaller ones so that its easy to read and understand. As is the concept is hard to understand with all its limitations. Thanks for the example. Retained it.

You seem to have removed few comments that explained the algorithm in detail from build_joinrel_partition_info(). It would have been good to have those there. But I am ok not having them either.

But it will be good to have following addition I suggested in my patches to make sure nparts is set to 0 for an unpartitioned relation as per the comment on nparts in RelOptInfo.

@@ -1653,6 +1663,8 @@ build_joinrel_partition_info(RelOptInfo *joinrel, RelOptInfo *outer_rel,
jointype, restrictlist))
{
Assert(!IS_PARTITIONED_REL(joinrel));
+ /* Join is not partitioned. */
+ joinrel->nparts = 0;
return;
}

>> There certainly needs to be some description of the algorithm somewhere,
>> either in a README or before a suitable function. It doesn't have to be
>> particularly detailed, a rough outline of the matching would be enough,
>> so that readers don't have to rebuild the knowledge from pieces
>> scattered around various comments.

> The algorithm for list and range partitioned tables is slightly different. So, I have added separate prologue to each list_merge_bounds() and range_merge_bounds(). Please check if that serves the purpose.

Too detailed to me. In this:

+ * If there are multiple partitions from one side matching a given partition on
+ * the other side, the algorithm bails out since we do not have machinary for
+ * joining one partition with mulitple partitions. It might happen that any of
+ * the list items of a partition from the outer relation do not appear in the
+ * inner relation and there is no default partition in the inner relation. Such
+ * a partition from the outer side will have no matching partition on the inner
+ * side. The algorithm will bail out in such a case since we do not have a
+ * mechanism to perform a join with a non-existing relation.

I don't think the last comment is correct; that would apply to the old
versions of this function IIRC, but not to the latest version. How
about something much simpler like the attached, instead?

I know that algorithm pretty well by now, so it suffices for me to say we use something similar to merge join, but may be for someone without that background a detailed explanation is useful. But this looks fine at the moment.

> Done. Actually this wasn't updated when partition pruning was introduced, which could cause a partitionwise join to be not used even when those conditions were met. Similarly when a query involved whole row reference. It's hard to explain all the conditions under which partitionwise join technique will be used. But I have tried to keep it easy to understand.

IMO I think your words "there is exactly one pair of matching
partitions." is a bit misleading, because that sounds like that PWJ
doesn't allow multiply-segmented join. How about s/exact
matching/one-to-one matching/ in the existing documentation, instead?

Good catch. That was really misleading. Looks good to me.

>> 3) I think the for nparts comment is somewhat misleading:
>>
>> int nparts; /* number of partitions; 0 = not partitioned;
>> * -1 = not yet set */
>>
>> which says that nparts=0 means not partitioned. But then there are
>> conditions like this:
>>
>> /* Nothing to do, if the join relation is not partitioned. */
>> if (joinrel->part_scheme == NULL || joinrel->nparts == 0)
>> return;
>>
>> which (ignoring the obsolete comment) seems to say we can have nparts==0
>> even for partitioned tables, no?

Yeah, I think I was a bit hasty. I fixed this.

For a non-join relation, nparts = 0 and nparts = -1 both have the same meaning. Although we never set nparts = 0 for a non-join relation? Otherwise, the comment looks good now.

Best Wishes,

Ashutosh

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Ashutosh Bapat

Date:

06 April 2020, 11:45:34

Here are some changes suggested on top of v34 as per my previous mail.

These are mostly comment changes.

On Mon, 6 Apr 2020 at 13:59, Etsuro Fujita <etsuro.fujita@gmail.com> wrote:

On Sat, Apr 4, 2020 at 12:15 AM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> Attached is the original patch (0001) and one patch (0002) with
> changes including those by Tomas and Ashutosh.

I merged the patches into one and rebased it against HEAD. Attached
is a new version, in which I added the commit message as well. Does
that make sense? If there are no objections, I’ll commit the patch.

I have no objections.

Best Wishes,

Ashutosh

Attachment

changes_over_v34.patch

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Tomas Vondra

Date:

07 April 2020, 15:15:21

On Mon, Apr 06, 2020 at 05:28:52PM +0900, Etsuro Fujita wrote:
>On Sat, Apr 4, 2020 at 12:15 AM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
>> Attached is the original patch (0001) and one patch (0002) with
>> changes including those by Tomas and Ashutosh.
>
>I merged the patches into one and rebased it against HEAD.  Attached
>is a new version, in which I added the commit message as well.  Does
>that make sense?  If there are no objections, I’ll commit the patch.
>

+1


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Etsuro Fujita

Date:

07 April 2020, 17:24:28

Hi Tomas,

On Wed, Apr 8, 2020 at 12:15 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> On Mon, Apr 06, 2020 at 05:28:52PM +0900, Etsuro Fujita wrote:
> >On Sat, Apr 4, 2020 at 12:15 AM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> >> Attached is the original patch (0001) and one patch (0002) with
> >> changes including those by Tomas and Ashutosh.
> >
> >I merged the patches into one and rebased it against HEAD.  Attached
> >is a new version, in which I added the commit message as well.  Does
> >that make sense?  If there are no objections, I’ll commit the patch.

> +1

Great!  It's midnight in Japan now, so I'll push the patch early morning.

Best regards,
Etsuro Fujita

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Etsuro Fujita

Date:

08 April 2020, 01:36:58

On Wed, Apr 8, 2020 at 2:24 AM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> On Wed, Apr 8, 2020 at 12:15 AM Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
> > On Mon, Apr 06, 2020 at 05:28:52PM +0900, Etsuro Fujita wrote:
> > >On Sat, Apr 4, 2020 at 12:15 AM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> > >> Attached is the original patch (0001) and one patch (0002) with
> > >> changes including those by Tomas and Ashutosh.
> > >
> > >I merged the patches into one and rebased it against HEAD.  Attached
> > >is a new version, in which I added the commit message as well.  Does
> > >that make sense?  If there are no objections, I’ll commit the patch.
>
> > +1
>
> Great!  It's midnight in Japan now, so I'll push the patch early morning.

Pushed after modifying some comments further, based on the suggestions
of Ashutosh.

Best regards,
Etsuro Fujita

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Ashutosh Bapat

Date:

08 April 2020, 05:09:18

Thanks a lot Fujita-san. Thanks Dmitry, Rajkumar, Amul, Mark, Robert, Antonin, Amit, Justin,Thomas and Tomas for all your help and review.

On Wed, 8 Apr 2020 at 07:07, Etsuro Fujita <etsuro.fujita@gmail.com> wrote:

On Wed, Apr 8, 2020 at 2:24 AM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> On Wed, Apr 8, 2020 at 12:15 AM Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
> > On Mon, Apr 06, 2020 at 05:28:52PM +0900, Etsuro Fujita wrote:
> > >On Sat, Apr 4, 2020 at 12:15 AM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> > >> Attached is the original patch (0001) and one patch (0002) with
> > >> changes including those by Tomas and Ashutosh.
> > >
> > >I merged the patches into one and rebased it against HEAD. Attached
> > >is a new version, in which I added the commit message as well. Does
> > >that make sense? If there are no objections, I’ll commit the patch.
>
> > +1
>
> Great! It's midnight in Japan now, so I'll push the patch early morning.

Pushed after modifying some comments further, based on the suggestions
of Ashutosh.

Best regards,
Etsuro Fujita

Best Wishes,

Ashutosh

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Kuntal Ghosh

Date:

08 April 2020, 07:30:14

Hi,

On Wed, Apr 8, 2020 at 7:07 AM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
>
> Pushed after modifying some comments further, based on the suggestions
> of Ashutosh.
I'm getting the following warning during compilation.

partbounds.c: In function ‘partition_bounds_merge’:
partbounds.c:1024:21: warning: unused variable ‘inner_binfo’ [-Wunused-variable]
  PartitionBoundInfo inner_binfo = inner_rel->boundinfo;
                     ^
For fixing the same, we can declare inner_binfo as
PG_USED_FOR_ASSERTS_ONLY as it is not used for any other purpose.

--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Etsuro Fujita

Date:

08 April 2020, 10:12:09

Hi Kuntal,

On Wed, Apr 8, 2020 at 4:30 PM Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
> I'm getting the following warning during compilation.
>
> partbounds.c: In function ‘partition_bounds_merge’:
> partbounds.c:1024:21: warning: unused variable ‘inner_binfo’ [-Wunused-variable]
>   PartitionBoundInfo inner_binfo = inner_rel->boundinfo;
>                      ^
> For fixing the same, we can declare inner_binfo as
> PG_USED_FOR_ASSERTS_ONLY as it is not used for any other purpose.

I'd propose to remove an assertion causing this (and  the
outer_binfo/inner_binfo variables) from partition_bounds_merge(),
rather than doing so, because the assertion is redundant, as we have
the same assertion in merge_list_bounds() and merge_range_bounds().
Please find attached a patch.

Best regards,
Etsuro Fujita

Attachment

fix-compiler-warning.patch

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Ashutosh Bapat

Date:

08 April 2020, 10:14:28

Thanks Kuntal for the report. Let me know if this patch works for you.

On Wed, 8 Apr 2020 at 13:00, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:

Hi,

On Wed, Apr 8, 2020 at 7:07 AM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
>
> Pushed after modifying some comments further, based on the suggestions
> of Ashutosh.
I'm getting the following warning during compilation.

partbounds.c: In function ‘partition_bounds_merge’:
partbounds.c:1024:21: warning: unused variable ‘inner_binfo’ [-Wunused-variable]
PartitionBoundInfo inner_binfo = inner_rel->boundinfo;
^
For fixing the same, we can declare inner_binfo as
PG_USED_FOR_ASSERTS_ONLY as it is not used for any other purpose.

--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Best Wishes,

Ashutosh

Attachment

fix_warning_in_adv_pwj.patch

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Ashutosh Bapat

Date:

08 April 2020, 10:18:53

On Wed, 8 Apr 2020 at 15:42, Etsuro Fujita <etsuro.fujita@gmail.com> wrote:

Hi Kuntal,

On Wed, Apr 8, 2020 at 4:30 PM Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
> I'm getting the following warning during compilation.
>
> partbounds.c: In function ‘partition_bounds_merge’:
> partbounds.c:1024:21: warning: unused variable ‘inner_binfo’ [-Wunused-variable]
> PartitionBoundInfo inner_binfo = inner_rel->boundinfo;
> ^
> For fixing the same, we can declare inner_binfo as
> PG_USED_FOR_ASSERTS_ONLY as it is not used for any other purpose.

I'd propose to remove an assertion causing this (and the
outer_binfo/inner_binfo variables) from partition_bounds_merge(),
rather than doing so, because the assertion is redundant, as we have
the same assertion in merge_list_bounds() and merge_range_bounds().
Please find attached a patch.

Oh, I didn't see this mail before sending my other mail.

I think it's better to have the assertion in all the three places and also in merge_hash_bounds() whenever that comes along. The assertion in merge_*_bounds() will be good to in case those functions are called from places other than partition_bounds_merge(). The assertion in partition_bounds_merge() will make sure that when the individual merge_*_bounds() functions are called based on one of the bounds both of the bounds have same strategy.

Best Wishes,

Ashutosh

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Kuntal Ghosh

Date:

08 April 2020, 15:06:19

Hello Ashutosh, Fujita,

On Wed, Apr 8, 2020 at 3:49 PM Ashutosh Bapat
<ashutosh.bapat@2ndquadrant.com> wrote:
> On Wed, 8 Apr 2020 at 15:42, Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
>> On Wed, Apr 8, 2020 at 4:30 PM Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
>> > I'm getting the following warning during compilation.
>> >
>> > partbounds.c: In function ‘partition_bounds_merge’:
>> > partbounds.c:1024:21: warning: unused variable ‘inner_binfo’ [-Wunused-variable]
>> >   PartitionBoundInfo inner_binfo = inner_rel->boundinfo;
>> >                      ^
>> > For fixing the same, we can declare inner_binfo as
>> > PG_USED_FOR_ASSERTS_ONLY as it is not used for any other purpose.
>>
>> I'd propose to remove an assertion causing this (and  the
>> outer_binfo/inner_binfo variables) from partition_bounds_merge(),
>> rather than doing so, because the assertion is redundant, as we have
>> the same assertion in merge_list_bounds() and merge_range_bounds().
>> Please find attached a patch.
>
>
> I think it's better to have the assertion in all the three places and also in merge_hash_bounds() whenever that comes
along.The assertion in merge_*_bounds() will be good to in case those functions are called from places other than
partition_bounds_merge().The assertion in partition_bounds_merge() will make sure that when the individual
merge_*_bounds()functions are called based on one of the bounds both of the bounds have same strategy. 

Both of your patches fix the problem. I don't have much exposure in
this area to comment on whether we should keep/remove the assertion
from the code. But, here is my opinion:

The code structure looks like following:
Assert(condition A);
if (Condition B)
    merge_*_bounds(....);

Inside merge_*_bounds(), you have both the above assert and the if
condition as another assert:
Assert(condition A and Condition B);

And, merge_*_bounds() are called from only one place. So, something is
redundant here and I'm inclined towards removal of the assert
condition. Another thing I noticed:

/* The partitioning strategies should be the same. */
Assert(outer_binfo->strategy == inner_binfo->strategy);

The comment just reads the assertion aloud which looks unnecessary.

--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Etsuro Fujita

Date:

09 April 2020, 04:59:22

Hi,

On Thu, Apr 9, 2020 at 12:06 AM Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
> Both of your patches fix the problem. I don't have much exposure in
> this area to comment on whether we should keep/remove the assertion
> from the code. But, here is my opinion:
>
> The code structure looks like following:
> Assert(condition A);
> if (Condition B)
>     merge_*_bounds(....);
>
> Inside merge_*_bounds(), you have both the above assert and the if
> condition as another assert:
> Assert(condition A and Condition B);
>
> And, merge_*_bounds() are called from only one place. So, something is
> redundant here and I'm inclined towards removal of the assert
> condition. Another thing I noticed:
>
> /* The partitioning strategies should be the same. */
> Assert(outer_binfo->strategy == inner_binfo->strategy);
>
> The comment just reads the assertion aloud which looks unnecessary.
>

Yeah, partition_bounds_merge() is currently called only from
try_partitionwise_join(), which guarantees that the strategies are the
same.  The assertion cost would be cheap, but not zero, so I still
think it would be better to remove the assertion from
partition_bounds_merge().

Best regards,
Etsuro Fujita

Re: [HACKERS] advanced partition matching algorithm for partition-wise join

From

Tom Lane

Date:

09 April 2020, 05:36:42

Etsuro Fujita <etsuro.fujita@gmail.com> writes:
> Yeah, partition_bounds_merge() is currently called only from
> try_partitionwise_join(), which guarantees that the strategies are the
> same.  The assertion cost would be cheap, but not zero, so I still
> think it would be better to remove the assertion from
> partition_bounds_merge().

FWIW, our general policy is that assertion costs should be ignored
in any performance considerations.  If you're concerned about
performance you should be running a non-assert build, so it doesn't
matter.  (And certainly, there are lots of assertions in the backend
that cost FAR more than this one.)  The thing to evaluate an assertion
on is how likely it is that it would catch a foreseeable sort of coding
error in some future patch.  Maybe this one carries its weight on that
score or maybe it doesn't, but that's how to think about it.

If there's only one caller and there's not likely to ever be more,
then I tend to agree that you don't need the assertion.

            regards, tom lane

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Etsuro Fujita

Date:

09 April 2020, 06:33:29

On Thu, Apr 9, 2020 at 2:36 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Etsuro Fujita <etsuro.fujita@gmail.com> writes:
> > Yeah, partition_bounds_merge() is currently called only from
> > try_partitionwise_join(), which guarantees that the strategies are the
> > same.

> If there's only one caller and there's not likely to ever be more,
> then I tend to agree that you don't need the assertion.

It seems unlikely that partition_bounds_merge() will be called from
more places in the foreseeable future, so I'd still vote for removing
the assertion.

Best regards,
Etsuro Fujita

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Ashutosh Bapat

Date:

09 April 2020, 14:04:01

On Thu, Apr 9, 2020 at 12:03 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
>
> On Thu, Apr 9, 2020 at 2:36 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Etsuro Fujita <etsuro.fujita@gmail.com> writes:
> > > Yeah, partition_bounds_merge() is currently called only from
> > > try_partitionwise_join(), which guarantees that the strategies are the
> > > same.
>
> > If there's only one caller and there's not likely to ever be more,
> > then I tend to agree that you don't need the assertion.
>
> It seems unlikely that partition_bounds_merge() will be called from
> more places in the foreseeable future, so I'd still vote for removing
> the assertion.

When I wrote that function, I had UNION also in mind. A UNION across
multiple partitioned relations will be partitioned if we can merge the
partition bounds in a sensible manner. Of course the current structure
of that function looks more purposed for join, but it's not difficult
to convert it to be used for UNION as well. In that case those set of
functions will have many more callers. So, I will vote to keep that
assertion now that we have it there.
-- 
Best Wishes,
Ashutosh Bapat

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Tomas Vondra

Date:

09 April 2020, 14:41:00

On Thu, Apr 09, 2020 at 07:34:01PM +0530, Ashutosh Bapat wrote:
>On Thu, Apr 9, 2020 at 12:03 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
>>
>> On Thu, Apr 9, 2020 at 2:36 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> > Etsuro Fujita <etsuro.fujita@gmail.com> writes:
>> > > Yeah, partition_bounds_merge() is currently called only from
>> > > try_partitionwise_join(), which guarantees that the strategies are the
>> > > same.
>>
>> > If there's only one caller and there's not likely to ever be more,
>> > then I tend to agree that you don't need the assertion.
>>
>> It seems unlikely that partition_bounds_merge() will be called from
>> more places in the foreseeable future, so I'd still vote for removing
>> the assertion.
>
>When I wrote that function, I had UNION also in mind. A UNION across
>multiple partitioned relations will be partitioned if we can merge the
>partition bounds in a sensible manner. Of course the current structure
>of that function looks more purposed for join, but it's not difficult
>to convert it to be used for UNION as well. In that case those set of
>functions will have many more callers. So, I will vote to keep that
>assertion now that we have it there.

Yeah. I really don't see why we should remove an assertion that enforces
something useful, especially when it's just a plain comparions. Had it
been some expensive assert, maybe. But how much slower does this make
an assert-enabled build? 0.000000000001% or something like that?


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Jeff Janes

Date:

10 April 2020, 15:14:22

On Thu, Apr 9, 2020 at 10:04 AM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:

On Thu, Apr 9, 2020 at 12:03 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
>
> On Thu, Apr 9, 2020 at 2:36 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Etsuro Fujita <etsuro.fujita@gmail.com> writes:
> > > Yeah, partition_bounds_merge() is currently called only from
> > > try_partitionwise_join(), which guarantees that the strategies are the
> > > same.
>
> > If there's only one caller and there's not likely to ever be more,
> > then I tend to agree that you don't need the assertion.
>
> It seems unlikely that partition_bounds_merge() will be called from
> more places in the foreseeable future, so I'd still vote for removing
> the assertion.

When I wrote that function, I had UNION also in mind. A UNION across
multiple partitioned relations will be partitioned if we can merge the
partition bounds in a sensible manner. Of course the current structure
of that function looks more purposed for join, but it's not difficult
to convert it to be used for UNION as well. In that case those set of
functions will have many more callers. So, I will vote to keep that
assertion now that we have it there.

In that case, we really should add the PG_USED_FOR_ASSERTS_ONLY to make the compiler happy.

Cheers,

Jeff

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Ashutosh Bapat

Date:

10 April 2020, 15:20:37

On Fri, 10 Apr 2020 at 20:44, Jeff Janes <jeff.janes@gmail.com> wrote:

On Thu, Apr 9, 2020 at 10:04 AM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:
On Thu, Apr 9, 2020 at 12:03 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
>
> On Thu, Apr 9, 2020 at 2:36 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Etsuro Fujita <etsuro.fujita@gmail.com> writes:
> > > Yeah, partition_bounds_merge() is currently called only from
> > > try_partitionwise_join(), which guarantees that the strategies are the
> > > same.
>
> > If there's only one caller and there's not likely to ever be more,
> > then I tend to agree that you don't need the assertion.
>
> It seems unlikely that partition_bounds_merge() will be called from
> more places in the foreseeable future, so I'd still vote for removing
> the assertion.

When I wrote that function, I had UNION also in mind. A UNION across
multiple partitioned relations will be partitioned if we can merge the
partition bounds in a sensible manner. Of course the current structure
of that function looks more purposed for join, but it's not difficult
to convert it to be used for UNION as well. In that case those set of
functions will have many more callers. So, I will vote to keep that
assertion now that we have it there.

In that case, we really should add the PG_USED_FOR_ASSERTS_ONLY to make the compiler happy.

Attaching my patch again. It doesn't need PG_USED_FOR_ASSERTS_ONLY as well. Kuntal has confirmed that this fixes the warning for him.

Best Wishes,

Ashutosh

Re: [HACKERS] advanced partition matching algorithm for partition-wise join

From

Tom Lane

Date:

10 April 2020, 15:44:35

Ashutosh Bapat <ashutosh.bapat@2ndquadrant.com> writes:
> On Fri, 10 Apr 2020 at 20:44, Jeff Janes <jeff.janes@gmail.com> wrote:
>> In that case, we really should add the PG_USED_FOR_ASSERTS_ONLY to make
>> the compiler happy.

> Attaching my patch again. It doesn't need PG_USED_FOR_ASSERTS_ONLY as well.
> Kuntal has confirmed that this fixes the warning for him.

I see no patch here ...

            regards, tom lane

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Ashutosh Bapat

Date:

10 April 2020, 15:51:36

On Fri, Apr 10, 2020 at 9:14 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Ashutosh Bapat <ashutosh.bapat@2ndquadrant.com> writes:
> > On Fri, 10 Apr 2020 at 20:44, Jeff Janes <jeff.janes@gmail.com> wrote:
> >> In that case, we really should add the PG_USED_FOR_ASSERTS_ONLY to make
> >> the compiler happy.
>
> > Attaching my patch again. It doesn't need PG_USED_FOR_ASSERTS_ONLY as well.
> > Kuntal has confirmed that this fixes the warning for him.
>
> I see no patch here ...
>

Sorry. Here it is


-- 
Best Wishes,
Ashutosh Bapat

Attachment

fix_warning_in_adv_pwj.patch

Re: [HACKERS] advanced partition matching algorithm for partition-wise join

From

Tom Lane

Date:

10 April 2020, 15:59:58

Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> writes:
> On Fri, Apr 10, 2020 at 9:14 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I see no patch here ...

> Sorry. Here it is

LGTM, will push in a moment.

            regards, tom lane

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Etsuro Fujita

Date:

11 April 2020, 04:02:11

On Sat, Apr 11, 2020 at 1:00 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> writes:
> > On Fri, Apr 10, 2020 at 9:14 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> I see no patch here ...
>
> > Sorry. Here it is
>
> LGTM, will push in a moment.

Thanks for taking care of this, Tom!  Thanks for the patch, Ashutosh!
Thanks for the report, Kuntal!

Best regards,
Etsuro Fujita

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join

From

Etsuro Fujita

Date:

16 April 2020, 09:20:53

On Mon, Apr 6, 2020 at 8:43 PM Ashutosh Bapat
<ashutosh.bapat@2ndquadrant.com> wrote:
> On Fri, 3 Apr 2020 at 20:45, Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> But it will be good to have following addition I suggested in my patches to make sure nparts is set to 0 for an
unpartitionedrelation as per the comment on nparts in RelOptInfo.
 
> @@ -1653,6 +1663,8 @@ build_joinrel_partition_info(RelOptInfo *joinrel, RelOptInfo *outer_rel,
>                                 jointype, restrictlist))
>     {
>         Assert(!IS_PARTITIONED_REL(joinrel));
> +       /* Join is not partitioned. */
> +       joinrel->nparts = 0;
>         return;
>     }

I didn't modified that function as proposed, because I thought that 1)
there would be no need to do so, and that 2) it would be better to set
joinrel->nparts only when we set joinrel->part_schema, for
consistency.

>> >> 3) I think the for nparts comment is somewhat misleading:
>> >>
>> >>    int nparts;  /* number of partitions; 0 = not partitioned;
>> >>                  * -1 = not yet set */
>> >>
>> >> which says that nparts=0 means not partitioned. But then there are
>> >> conditions like this:
>> >>
>> >>      /* Nothing to do, if the join relation is not partitioned. */
>> >>      if (joinrel->part_scheme == NULL || joinrel->nparts == 0)
>> >>          return;
>> >>
>> >> which (ignoring the obsolete comment) seems to say we can have nparts==0
>> >> even for partitioned tables, no?
>>
>> Yeah, I think I was a bit hasty.  I fixed this.

> For a non-join relation, nparts = 0 and nparts = -1 both have the same meaning. Although we never set nparts = 0 for
anon-join relation?
 

I don't think so.  Consider this:

create table prt (a int, b int) partition by range (a);
create table prt_p1 partition of prt for values from (0) to (250);
create table prt_p2 partition of prt for values from (250) to (500);
drop table prt_p1;
drop table prt_p2;
select count(*) from prt;

For this query, we would have nparts=0 for the partitioned table prt.

Thanks!  Sorry for the delay.

Best regards,
Etsuro Fujita