Re: [HACKERS] advanced partition matching algorithm forpartition-wise join - Mailing list pgsql-hackers

From Ashutosh Bapat
Subject Re: [HACKERS] advanced partition matching algorithm forpartition-wise join
Date
Msg-id CAExHW5uZAzhNYj9hhgoKj1d3mqJZH6yGqvY=ecBFk9W=7zL9yA@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] advanced partition matching algorithm forpartition-wise join  (Etsuro Fujita <etsuro.fujita@gmail.com>)
Responses Re: [HACKERS] advanced partition matching algorithm forpartition-wise join  (Etsuro Fujita <etsuro.fujita@gmail.com>)
List pgsql-hackers
Hi Fujita-san,
I reviewed the patch. Except for the logic of matching the pairs of partitions from already merged partitions, I think the code changes are good. But there are several places where it needs further changes to comments. The attached patch has those. I have described some of them below.
+ * We can not perform partition-wise join if either of the joining
+ * relations is not partitioned.

We are consistently using partitionwise instead of partition-wise.

+ /*
+ * See if the partition bounds for inputs are exactly the same, in
+ * which case we don't need to work hard: partitions with the same
+ * partition indexes will form join pairs, and the join rel will have
+ * the same partition bounds as inputs; otherwise try to merge the
+ * partition bounds along with generating join pairs.

Phrase "joining relations" is better than "inputs", IMO. Updated in the
attached patch.

+ /*
+ * If the partition bounds for the join rel are not merged ones,
+ * inputs are guaranteed to have the same partition bounds, so
+ * partitions with the same partition indexes will form join pairs;
+ * else let get_matching_part_pairs() do the work.
+ */
+ if (joinrel->merged)
+ {

This condition in the comment is opposite to the condition being checked in
code, so looks confusing. BTW this comment is also repeated around line 1405.
See attached patch for correction.

+ /*
+ * If this segment of the join is empty, it means that this segment

"partition of the join" looks consistent with other usages than "segment of the
join". Modified in the attached patch.

+ /*
+ * Get a relids set of partition(s) involved in this join segment that
+ * are from the rel1 side.
+ */
+ child_relids1 = bms_intersect(child_joinrel->relids,
+  rel1->all_partrels);

The partition bounds are sorted by their values. Even for a list partitioned
table, the bounds are sorted by the least partition value. We do not allow
multiple paritions from one side to be joined with one partition on the other
and vice versa. All this together means that the partitions of the join
relation are formed by joining partitions from either side in the order of
their bounds. This means that the matching pairs of partitions can be found by
matching relids of partitions of join with those of the joining relation by
traversing partitions from all the three relations only once in the order they
appears in partition bounds of corresponding relations. If we use this
algorithm, we don't need all_partrels to be collected and also don't need to
search base or join relation. That, I think, will reduce the time and space
complexity of this logic. Am I missing something? As a side effect it won't
require any special handling for base and join relation.

+ /*
+ * Get a child rel for rel1 with the relids.  Note that we should have
+ * the child rel even if rel1 is a join rel, because in that case the
+ * partitions specified in the relids would have matching/overlapping
+ * boundaries, so those partitions should be considered as ones to be
+ * joined even when planning partitionwise joins of rel1, meaning that
+ * the child rel would have been built by the time we get here.
+ */
+ if (rel1_is_simple)

This variable is used only in one place. So instead we should the expression
assigning the value to it. Changed in the attached patch.

- rel->nparts = 0;
+ rel->nparts = -1;

I think we need to add comments around various values that nparts can take. How
about like something attached.

+ case PARTITION_STRATEGY_HASH:
+ merged_bounds = NULL;

I think, we need to explain why we aren't merging hash partition bounds. AFAIU,
the reason is thus: When the modulo of used for partition mapping (i.e. maximum
number of has partitions) is same, their partition bounds are same and do not
need merging. If the maximum number of partitions is different for both the
joining relations, there's high probability that one partition on one side will
join with multiple partitions on the other side. So exact partition bounds
match will work in most of the cases. The cases otherwise are not so common to
spend the effort in coding and planning.

I have added this explanation in the patch. Don't know if it's there written
somewhere already.

+ if (part_index == -1)
+ return -1;
+ } while (is_dummy_partition(rel, part_index));

I understand why we are skipping NULL positions. I am not sure why are we are
skipping over RelOptInfos which exist but are marked as dummy; we can still
create a join pair with those partitions.

+/*
+ * get_merged_range_bounds
+ * Given the bounds of range partitions to be join, determine the range

s/join/joined/

There are more changes to comments, where I thought that the comments are
required or existing comments need more clarification. Please review the
attached patch. This patch is created on top of
v32-0001-Improve-partition-matching-for-partitionwise-join.

On Mon, Feb 10, 2020 at 5:14 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
On Fri, Feb 7, 2020 at 9:57 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> On Thu, Feb 6, 2020 at 3:55 AM Mark Dilger <mark.dilger@enterprisedb.com> wrote:
> > The patches apply and pass all tests.  A review of the patch vs. master looks reasonable.

I've merged the patches.  Attached is a new version of the patch.

> > The partition_join.sql test has multiple levels of partitioning, but when your patch extends that test with “advanced partition-wise join”, none of the tables for the new section have multiple levels.  I spent a little while reviewing the code and inventing multiple level partitioning tests for advanced partition-wise join and did not encounter any problems.  I don’t care whether you use this particular example, but do you want to have multiple level partitioning in the new test section?
>
> Yes, I do.
>
> > CREATE TABLE alpha (a double precision, b double precision) PARTITION BY RANGE (a);
> > CREATE TABLE alpha_neg PARTITION OF alpha FOR VALUES FROM ('-Infinity') TO (0) PARTITION BY RANGE (b);
> > CREATE TABLE alpha_pos PARTITION OF alpha FOR VALUES FROM (0) TO ('Infinity') PARTITION BY RANGE (b);
> > CREATE TABLE alpha_nan PARTITION OF alpha FOR VALUES FROM ('Infinity') TO ('NaN');
> > CREATE TABLE alpha_neg_neg PARTITION OF alpha_neg FOR VALUES FROM ('-Infinity') TO (0);
> > CREATE TABLE alpha_neg_pos PARTITION OF alpha_neg FOR VALUES FROM (0) TO ('Infinity');
> > CREATE TABLE alpha_neg_nan PARTITION OF alpha_neg FOR VALUES FROM ('Infinity') TO ('NaN');
> > CREATE TABLE alpha_pos_neg PARTITION OF alpha_pos FOR VALUES FROM ('-Infinity') TO (0);
> > CREATE TABLE alpha_pos_pos PARTITION OF alpha_pos FOR VALUES FROM (0) TO ('Infinity');
> > CREATE TABLE alpha_pos_nan PARTITION OF alpha_pos FOR VALUES FROM ('Infinity') TO ('NaN');
> > INSERT INTO alpha (a, b)
> >     (SELECT * FROM
> >         (VALUES (-1.0::float8), (0.0::float8), (1.0::float8), ('Infinity'::float8)) a,
> >         (VALUES (-1.0::float8), (0.0::float8), (1.0::float8), ('Infinity'::float8)) b
> >     );
> > ANALYZE alpha;
> > ANALYZE alpha_neg;
> > ANALYZE alpha_pos;
> > ANALYZE alpha_nan;
> > ANALYZE alpha_neg_neg;
> > ANALYZE alpha_neg_pos;
> > ANALYZE alpha_neg_nan;
> > ANALYZE alpha_pos_neg;
> > ANALYZE alpha_pos_pos;
> > ANALYZE alpha_pos_nan;
> > CREATE TABLE beta (a double precision, b double precision) PARTITION BY RANGE (a, b);
> > CREATE TABLE beta_lo PARTITION OF beta FOR VALUES FROM (-5, -5) TO (0, 0);
> > CREATE TABLE beta_me PARTITION OF beta FOR VALUES FROM (0, 0) TO (0, 5);
> > CREATE TABLE beta_hi PARTITION OF beta FOR VALUES FROM (0, 5) TO (5, 5);
> > INSERT INTO beta (a, b)
> >     (SELECT * FROM
> >         (VALUES (-1.0::float8), (0.0::float8), (1.0::float8)) a,
> >         (VALUES (-1.0::float8), (0.0::float8), (1.0::float8)) b
> >     );
> > ANALYZE beta;
> > ANALYZE beta_lo;
> > ANALYZE beta_me;
> > ANALYZE beta_hi;
> > EXPLAIN SELECT * FROM alpha INNER JOIN beta ON (alpha.a = beta.a AND alpha.b = beta.b) WHERE alpha.a = 1 AND beta.b = 1;
> >                                   QUERY PLAN
> > -------------------------------------------------------------------------------
> >  Nested Loop  (cost=0.00..2.11 rows=1 width=32)
> >    ->  Seq Scan on alpha_pos_pos alpha  (cost=0.00..1.06 rows=1 width=16)
> >          Filter: ((b = '1'::double precision) AND (a = '1'::double precision))
> >    ->  Seq Scan on beta_hi beta  (cost=0.00..1.04 rows=1 width=16)
> >          Filter: ((b = '1'::double precision) AND (a = '1'::double precision))
> > (5 rows)
>
> Hmm, I'm not sure this is a good test case for that, because this
> result would be due to partition pruning applied to each side of the
> join before considering partition-wise join; you could get the same
> result even with enable_partitionwise_join=off.  I think it's
> important that the partition-wise join logic doesn't break this query,
> though.

I think this would be beyond the scope of the patch, so I added
different test cases that I think would be better as ones for
multi-level partitioning.

Thanks!

Best regards,
Etsuro Fujita


--
--
Best Wishes,
Ashutosh Bapat
Attachment

pgsql-hackers by date:

Previous
From: Jeff Davis
Date:
Subject: Re: Add LogicalTapeSetExtend() to logtape.c
Next
From: Mark Wong
Date:
Subject: Re: [GSoC 2020] Questions About Performance Farm Benchmarks andWebsite