Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables - Mailing list pgsql-hackers

From Ashutosh Bapat
Subject Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
Date
Msg-id CAFjFpRcXN+3qZxHXzEsiApS6kLytf6C6VxkzpzPekXDM2Lq-5A@mail.gmail.com
Whole thread Raw
In response to Partition-wise join for join between (declaratively) partitioned tables  (Ashutosh Bapat <ashutosh.bapat@enterprisedb.com>)
List pgsql-hackers
On Thu, Apr 6, 2017 at 6:37 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Apr 5, 2017 at 2:42 AM, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
>> Only inner join conditions have equivalence classes associated with
>> those. Outer join conditions create single element equivalence
>> classes. So, we can not associate equivalence classes as they are with
>> partition scheme. If we could do that, it makes life much easier since
>> checking whether equi-join between all partition keys exist, is simply
>> looking up equivalence classes that cover joining relations and find
>> em_member corresponding to partition keys.
>
> OK.
>
>> It looks like we should only keep strategy, partnatts, partopfamily
>> and parttypcoll in PartitionScheme. A partition-wise join between two
>> relations would be possible if all those match.
>
> Yes, I think so. Conceivably you could even exclude partnatts and
> strategy, since there's nothing preventing a partitionwise join
> between a list-partitioned table and a range-partitioned table, or
> between a table range-partitioned on (a) and another range-partitioned
> on (a, b), but there is probably not much benefit in trying to cover
> such cases.  I think it's reasonable to tell users that this is only
> going to work when the partitioning strategy is the same and the join
> conditions include all of the partitioning columns on both sides.
>
>> There's a relevant comment in 0006, build_joinrel_partition_info()
>> (probably that name needs to change, but I will do that once we have
>> settled on design)
>> +   /*
>> +    * Construct partition keys for the join.
>> +    *
>> +    * An INNER join between two partitioned relations is partition by key
>> +    * expressions from both the relations. For tables A and B
>> partitioned by a and b
>> +    * respectively, (A INNER JOIN B ON A.a = B.b) is partitioned by both A.a
>> +    * and B.b.
>> +    *
>> +    * An OUTER join like (A LEFT JOIN B ON A.a = B.b) may produce rows with
>> +    * B.b NULL. These rows may not fit the partitioning conditions imposed on
>> +    * B.b. Hence, strictly speaking, the join is not partitioned by B.b.
>> +    * Strictly speaking, partition keys of an OUTER join should include
>> +    * partition key expressions from the OUTER side only. Consider a join like
>> +    * (A LEFT JOIN B on (A.a = B.b) LEFT JOIN C ON B.b = C.c. If we do not
>> +    * include B.b as partition key expression for (AB), it prohibits us from
>> +    * using partition-wise join when joining (AB) with C as there is no
>> +    * equi-join between partition keys of joining relations. But two NULL
>> +    * values are never equal and no two rows from mis-matching partitions can
>> +    * join. Hence it's safe to include B.b as partition key expression for
>> +    * (AB), even though rows in (AB) are not strictly partitioned by B.b.
>> +    */
>>
>> I think that also needs to be reviewed carefully.
>
> The following passage from src/backend/optimizer/README seems highly relevant:
>
> ===
> The planner's treatment of outer join reordering is based on the following
> identities:
>
> 1.      (A leftjoin B on (Pab)) innerjoin C on (Pac)
>         = (A innerjoin C on (Pac)) leftjoin B on (Pab)
>
> where Pac is a predicate referencing A and C, etc (in this case, clearly
> Pac cannot reference B, or the transformation is nonsensical).
>
> 2.      (A leftjoin B on (Pab)) leftjoin C on (Pac)
>         = (A leftjoin C on (Pac)) leftjoin B on (Pab)
>
> 3.      (A leftjoin B on (Pab)) leftjoin C on (Pbc)
>         = A leftjoin (B leftjoin C on (Pbc)) on (Pab)
>
> Identity 3 only holds if predicate Pbc must fail for all-null B rows
> (that is, Pbc is strict for at least one column of B).  If Pbc is not
> strict, the first form might produce some rows with nonnull C columns
> where the second form would make those entries null.
> ===
>
> In other words, I think your statement that null is never equal to
> null is a bit imprecise.  Somebody could certainly create an operator
> that is named "=" which returns true in that case, and then they could
> say, hey, two nulls are equal (when you use that operator).  The
> argument needs to be made in terms of the formal properties of the
> operator.  The relevant logic is in have_partkey_equi_join:
>
> +               /* Skip clauses which are not equality conditions. */
> +               if (rinfo->hashjoinoperator == InvalidOid &&
> !rinfo->mergeopfamilies)
> +                       continue;
>
> Actually, I think the hashjoinoperator test is formally and
> practically unnecessary here; lower down there is a test to see
> whether the partitioning scheme's operator family is a member of
> rinfo->mergeopfamilies, which will certainly fail if we got through
> this test with rinfo->mergeopfamilies == NIL just on the strength of
> rinfo->hashjoinoperator != InvalidOid.  So you can just bail out if
> rinfo->mergeopfamilies == NIL.  But the underlying point here is that
> the only thing you really know about the function is that it's got to
> be a strategy-3 operator in some btree opclass; if that guarantees
> strictness, then so be it -- but I wasn't able to find anything in the
> code or documentation off-hand that supports that contention, so we
> might need to think a bit more about why (or if) this is guaranteed to
> be true.

I need more time to think about this. Will get back to this soon.

>
>> Partition-wise joins
>> may be happy including partition keys from all sides, but
>> partition-wise aggregates may not be, esp. when pushing complete
>> aggregation down to partitions. In that case, rows with NULL partition
>> key, which falls on nullable side of join, will be spread across
>> multiple partitions. Proabably, we should separate nullable and
>> non-nullable partition key expressions.
>
> I don't think I understand quite what you're getting at here.  Can you
> spell this out in more detail?  To push an aggregate down to
> partitions, you need the grouping key to match the applicable
> partition key, and the partition key shouldn't allow nulls in more
> than one place.  Now I think your point may be that outer join
> semantics could let them creep in there, e.g. SELECT b.x, sum(a.y)
> FROM a LEFT JOIN b ON a.x = b.x GROUP BY 1 -- which would indeed be a
> good test case for partitionwise aggregate.  I'd be inclined to think
> that we should just give up on partitionwise aggregate in such cases;
> it's not worth trying to optimize such a weird query, at least IMHO.
> (Does this sort of case ever happen with joins?  I think not, as long
> as the join operator is strict.)

Yes, this is the case, I am thinking about. No, it doesn't happen with join.

>
> I spent some time thinking about this patch set today and I don't see
> that there's much point in committing any more of this to v10.  I
> think that 0001 and 0002 are probably committable or very close at
> this point.  However, 0001 is adding more complexity than I think is
> warranted until we're actually ready to commit the feature that uses
> it, and 0002 is so small that committing isn't really going to smooth
> future development much.  0003-0009 are essentially all one big patch
> that will have to be committed together.

Ok. Thanks.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



pgsql-hackers by date:

Previous
From: Craig Ringer
Date:
Subject: Re: [HACKERS] SCRAM authentication, take three
Next
From: David Rowley
Date:
Subject: Re: [HACKERS] Compiler warning in costsize.c