Home > mailing lists

Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables - Mailing list pgsql-hackers

From	Ashutosh Bapat
Subject	Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
Date	May 18, 2017 11:38:40
Msg-id	CAFjFpReKnTXPkX68B+vdGC8F0eJ50TYqOzUte02=G8KF5x3A0g@mail.gmail.com Whole thread
In response to	Partition-wise join for join between (declaratively) partitioned tables (Ashutosh Bapat <ashutosh.bapat@enterprisedb.com>)
Responses	Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
List	pgsql-hackers

Tree view

On Thu, Apr 6, 2017 at 6:37 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
>> There's a relevant comment in 0006, build_joinrel_partition_info()
>> (probably that name needs to change, but I will do that once we have
>> settled on design)
>> +   /*
>> +    * Construct partition keys for the join.
>> +    *
>> +    * An INNER join between two partitioned relations is partition by key
>> +    * expressions from both the relations. For tables A and B
>> partitioned by a and b
>> +    * respectively, (A INNER JOIN B ON A.a = B.b) is partitioned by both A.a
>> +    * and B.b.
>> +    *
>> +    * An OUTER join like (A LEFT JOIN B ON A.a = B.b) may produce rows with
>> +    * B.b NULL. These rows may not fit the partitioning conditions imposed on
>> +    * B.b. Hence, strictly speaking, the join is not partitioned by B.b.
>> +    * Strictly speaking, partition keys of an OUTER join should include
>> +    * partition key expressions from the OUTER side only. Consider a join like
>> +    * (A LEFT JOIN B on (A.a = B.b) LEFT JOIN C ON B.b = C.c. If we do not
>> +    * include B.b as partition key expression for (AB), it prohibits us from
>> +    * using partition-wise join when joining (AB) with C as there is no
>> +    * equi-join between partition keys of joining relations. But two NULL
>> +    * values are never equal and no two rows from mis-matching partitions can
>> +    * join. Hence it's safe to include B.b as partition key expression for
>> +    * (AB), even though rows in (AB) are not strictly partitioned by B.b.
>> +    */
>>
>> I think that also needs to be reviewed carefully.
>
> The following passage from src/backend/optimizer/README seems highly relevant:
>
> ===
> The planner's treatment of outer join reordering is based on the following
> identities:
>
> 1.      (A leftjoin B on (Pab)) innerjoin C on (Pac)
>         = (A innerjoin C on (Pac)) leftjoin B on (Pab)
>
> where Pac is a predicate referencing A and C, etc (in this case, clearly
> Pac cannot reference B, or the transformation is nonsensical).
>
> 2.      (A leftjoin B on (Pab)) leftjoin C on (Pac)
>         = (A leftjoin C on (Pac)) leftjoin B on (Pab)
>
> 3.      (A leftjoin B on (Pab)) leftjoin C on (Pbc)
>         = A leftjoin (B leftjoin C on (Pbc)) on (Pab)
>
> Identity 3 only holds if predicate Pbc must fail for all-null B rows
> (that is, Pbc is strict for at least one column of B).  If Pbc is not
> strict, the first form might produce some rows with nonnull C columns
> where the second form would make those entries null.
> ===
>
> In other words, I think your statement that null is never equal to
> null is a bit imprecise.  Somebody could certainly create an operator
> that is named "=" which returns true in that case, and then they could
> say, hey, two nulls are equal (when you use that operator).  The
> argument needs to be made in terms of the formal properties of the
> operator.

[.. some portion clipped .. ]

> The relevant logic is in have_partkey_equi_join:
>
> +               /* Skip clauses which are not equality conditions. */
> +               if (rinfo->hashjoinoperator == InvalidOid &&
> !rinfo->mergeopfamilies)
> +                       continue;
>
> Actually, I think the hashjoinoperator test is formally and
> practically unnecessary here; lower down there is a test to see
> whether the partitioning scheme's operator family is a member of
> rinfo->mergeopfamilies, which will certainly fail if we got through
> this test with rinfo->mergeopfamilies == NIL just on the strength of
> rinfo->hashjoinoperator != InvalidOid.  So you can just bail out if
> rinfo->mergeopfamilies == NIL.  But the underlying point here is that
> the only thing you really know about the function is that it's got to
> be a strategy-3 operator in some btree opclass; if that guarantees
> strictness, then so be it -- but I wasn't able to find anything in the
> code or documentation off-hand that supports that contention, so we
> might need to think a bit more about why (or if) this is guaranteed to
> be true.
>
>> Partition-wise joins
>> may be happy including partition keys from all sides, but
>> partition-wise aggregates may not be, esp. when pushing complete
>> aggregation down to partitions. In that case, rows with NULL partition
>> key, which falls on nullable side of join, will be spread across
>> multiple partitions. Proabably, we should separate nullable and
>> non-nullable partition key expressions.
>
> I don't think I understand quite what you're getting at here.  Can you
> spell this out in more detail?  To push an aggregate down to
> partitions, you need the grouping key to match the applicable
> partition key, and the partition key shouldn't allow nulls in more
> than one place.  Now I think your point may be that outer join
> semantics could let them creep in there, e.g. SELECT b.x, sum(a.y)
> FROM a LEFT JOIN b ON a.x = b.x GROUP BY 1 -- which would indeed be a
> good test case for partitionwise aggregate.  I'd be inclined to think
> that we should just give up on partitionwise aggregate in such cases;
> it's not worth trying to optimize such a weird query, at least IMHO.
> (Does this sort of case ever happen with joins?  I think not, as long
> as the join operator is strict.)
>

I am revisiting NULL equality in the context of merging partition
bounds. In [1] paragraphs following

--
Do not write expression = NULL because NULL is not “equal to” NULL.
(The null value represents an unknown value, and it is not known
whether two unknown values are equal.)

--

seem to indicate that an equality operator should never return true
for two NULL values since it would never know whether two NULL
(unknown) values are same or not. In a paragraph above, Robert stated
that

> In other words, I think your statement that null is never equal to
> null is a bit imprecise.  Somebody could certainly create an operator
> that is named "=" which returns true in that case, and then they could
> say, hey, two nulls are equal (when you use that operator).  The
> argument needs to be made in terms of the formal properties of the
> operator.

But in case a user has written an = operator which returns true for
two NULL values, per description in [1], that comparison operator is
flawed and
using that operator is going to result in SQL-standard-incompliant
behaviour. I have tried to preserve all the relevant portions of
discussion in this mail. Am I missing something?

[1] https://www.postgresql.org/docs/devel/static/functions-comparison.html

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

pgsql-hackers by date:

From: David Rowley
Date: 18 May 2017, 11:28:56
Subject: [HACKERS] Regression in join selectivity estimations when using foreign keys

From: Konstantin Knizhnik
Date: 18 May 2017, 11:57:57
Subject: Re: [HACKERS] Cached plans and statement generalization

Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables - Mailing list pgsql-hackers

Previous

Next