Re: [HACKERS] advanced partition matching algorithm forpartition-wise join - Mailing list pgsql-hackers

From Etsuro Fujita
Subject Re: [HACKERS] advanced partition matching algorithm forpartition-wise join
Date
Msg-id CAPmGK14WHKckT1P6UJV2B63TZAxPyMn8iZJ99XF=AZuNhG6vow@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] advanced partition matching algorithm forpartition-wise join  (amul sul <sulamul@gmail.com>)
Responses Re: [HACKERS] advanced partition matching algorithm forpartition-wise join
List pgsql-hackers
Hi Amul,

On Mon, Sep 2, 2019 at 2:08 PM amul sul <sulamul@gmail.com> wrote:
> Please find my comments inline below:

Thank you for the review!

> On Wed, Aug 28, 2019 at 3:52 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
>> On Fri, Aug 16, 2019 at 10:25 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:

> On thinking further, a thought hits to me why we can't join two hash partitioned
> table which has the same modulus and partition key specification, but some of
> the partitions are missing from either partitioned table.
>
> For e.g. here is a smaller case where foo has two partitions and bar has only one.
>
> CREATE TABLE foo(a int) PARTITION BY HASH(a);
> CREATE TABLE foo_p0 PARTITION OF foo FOR VALUES WITH(modulus 2, remainder 0);
> CREATE TABLE foo_p1 PARTITION OF foo FOR VALUES WITH(modulus 2, remainder 1);
>
> CREATE TABLE bar(a int) PARTITION BY HASH(a);  <-- missing partitions for REMAINDER 1
> CREATE TABLE bar_p0 PARTITION OF bar FOR VALUES WITH(modulus 2, remainder 0);
>
> Explain:
> postgres=# explain select * from foo p1, bar p2 where p1.a = p2.a;
>                                    QUERY PLAN
> ---------------------------------------------------------------------------------
>  Merge Join  (cost=590.35..1578.47 rows=65025 width=8)
>    Merge Cond: (p2.a = p1.a)
>    ->  Sort  (cost=179.78..186.16 rows=2550 width=4)
>          Sort Key: p2.a
>          ->  Seq Scan on bar_p0 p2  (cost=0.00..35.50 rows=2550 width=4)
>    ->  Sort  (cost=410.57..423.32 rows=5100 width=4)
>          Sort Key: p1.a
>          ->  Append  (cost=0.00..96.50 rows=5100 width=4)
>                ->  Seq Scan on foo_p0 p1  (cost=0.00..35.50 rows=2550 width=4)
>                ->  Seq Scan on foo_p1 p1_1  (cost=0.00..35.50 rows=2550 width=4)
> (10 rows)
>
> The partitions-wise join will be performed only if we fill the partition hole that
> exists for the joining table i.e. adding partitions to bar table.
>
> postgres=# CREATE TABLE bar_p1 PARTITION OF bar FOR VALUES WITH(modulus 2, remainder 1);
> CREATE TABLE
> postgres=# explain select * from foo p1, bar p2 where p1.a = p2.a;
>                                    QUERY PLAN
> ---------------------------------------------------------------------------------
>  Append  (cost=359.57..2045.11 rows=65024 width=8)
>    ->  Merge Join  (cost=359.57..860.00 rows=32512 width=8)
>          Merge Cond: (p1.a = p2.a)
>          ->  Sort  (cost=179.78..186.16 rows=2550 width=4)
>                Sort Key: p1.a
>                ->  Seq Scan on foo_p0 p1  (cost=0.00..35.50 rows=2550 width=4)
>          ->  Sort  (cost=179.78..186.16 rows=2550 width=4)
>                Sort Key: p2.a
>                ->  Seq Scan on bar_p0 p2  (cost=0.00..35.50 rows=2550 width=4)
>    ->  Merge Join  (cost=359.57..860.00 rows=32512 width=8)
>          Merge Cond: (p1_1.a = p2_1.a)
>          ->  Sort  (cost=179.78..186.16 rows=2550 width=4)
>                Sort Key: p1_1.a
>                ->  Seq Scan on foo_p1 p1_1  (cost=0.00..35.50 rows=2550 width=4)
>          ->  Sort  (cost=179.78..186.16 rows=2550 width=4)
>                Sort Key: p2_1.a
>                ->  Seq Scan on bar_p1 p2_1  (cost=0.00..35.50 rows=2550 width=4)
> (17 rows)
>
> It would have been nice if we could support this case, as we do allow partitions
> hole for the other partition scheme, but there wouldn't be much objection if
> we don't want to add this support for now since there will be a lesser chance
> that hash partitioned table has the hole, IMO.

I agree with you on that point.

>> If there are no objections, I'll merge the attached with the base patch in [1].

> The proposed enhancement in the patch is too good and the patch is pretty much
> reasonable to merge into the main patch.

Done.  Attached is a merged version of the patch.

> Here are the few cosmetic fixes for this path I think is needed. Feel free to
> ignore if some of them do not make sense or too obvious.
>
> Note: left side number represents code line number of the patch.
>
> 118 +       }
> 119 +
> 120 +       /*
> 121 +        * Try to create the partition bounds along with join pairs.
> 122 +        */
> 123 +       if (boundinfo == NULL)
> 124 +       {
>
> Can we add this block as else part of previous if condition checking equal partitions bound?

Done.

> 133 +           Assert(list_length(parts1) == list_length(parts2));
> 134 +           if (boundinfo == NULL)
> 135 +           {
> 136 +               joinrel->nparts = 0;
> 137 +               return;
> 138 +           }
> 139 +           nparts = list_length(parts1);

> And the question is do we need to do that assert check since partition_bounds_merge()
> does that just before returning, thoughts?

You are right; that assertion would be redundant, so I removed that.

> 204 +       RelOptInfo *child_rel1 = merged ? (RelOptInfo *) lfirst(lcr1) : rel1->part_rels[cnt_parts];
> 205 +       RelOptInfo *child_rel2 = merged ? (RelOptInfo *) lfirst(lcr2) : rel2->part_rels[cnt_parts];
>
> How about using lfirst_node instead of lfirst & casting explicitly?
>
> Also, these lines crossing 80 column length which I think we need to fix. How about
> doing the assignment as follow, just after the variable declaration part:
>
>         if (merged)
>         {
>             child_rel1 = lfirst_node(lcr1, RelOptInfo);
>             child_rel2 = lfirst_node(lcr2, RelOptInfo);
>             lcr1 = lnext(parts1, lcr1);
>             lcr2 = lnext(parts2, lcr2);
>         }
>         else
>         {
>             child_rel1 = rel1->part_rels[cnt_parts];
>             child_rel2 =  rel2->part_rels[cnt_parts]
>         }
>
>         rel1_empty = (child_rel1 == NULL || IS_DUMMY_REL(child_rel1));
>         rel2_empty = (child_rel2 == NULL || IS_DUMMY_REL(child_rel2));

Done.  (I don't think the order of arguments for first_node() above is
correct; the first argument for that should be RelOptInfo.)

> 266 + * get_matching_part_pairs
> 267 + *     Generate join pairs of partitions for the two inputs for the join rel
>
> Can we rewrite this description as  " Generate join pairs of partitions for the
> join rel from the two inputs." OR  "Generate join pairs of partitions for the
> two inputs"

Done.

> 310 +       Assert(bms_num_members(child_relids1) == bms_num_members(rel1->relids));
> 311 +       /*
>
> 335 +       Assert(bms_num_members(child_relids2) == bms_num_members(rel2->relids));
> 336 +       /*
>
> Need newline after assert statements.

Done.

Other change: I guess this in partbounds.c would be leftovers from an
older version of the patch, so I removed it.

+/*
+ * Allocate and initialize partition maps. We maintain four maps, two maps
+ * for each joining relation. pmap[i] gives the partition from the other
+ * relation which would join with ith partition of the given relation.
+ * Partition i from the given relation will join with partition pmap[i]
+ * from the other relation to produce partition mmap[i] of the join (merged
+ * partition).
+ *
+ * pmap[i] = -1 indicates that ith partition of a given relation does not
+ * have a matching partition from the other relation.
+ *
+ * mmap[i] = -1 indicates that ith partition of a given relation does not
+ * contribute to the join result. That can happen only when the given
+ * relation is the inner relation and it doesn't have a matching partition
+ * from the outer relation, hence pmap[i] should be -1.
+ *
+ * In case of an outer join, every partition of the outer join will appear
+ * in the join result, and thus has mmap[i] set for all i. But it's not
+ * necessary that every partition on the outer side will have a matching
+ * partition on the inner side. In such a case, we end up with pmap[i] = -1
+ * and mmap[i] != -1.
+ */

I will continue to review the rest of the patch.

I am sorry for the long delay.

Best regards,
Etsuro Fujita

Attachment

pgsql-hackers by date:

Previous
From: vignesh C
Date:
Subject: Re: dropdb --force
Next
From: Robert Haas
Date:
Subject: Re: Unwanted expression simplification in PG12b2