Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables - Mailing list pgsql-hackers
From | Ashutosh Bapat |
---|---|
Subject | Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables |
Date | |
Msg-id | CAFjFpReKnTXPkX68B+vdGC8F0eJ50TYqOzUte02=G8KF5x3A0g@mail.gmail.com Whole thread Raw |
In response to | Partition-wise join for join between (declaratively) partitioned tables (Ashutosh Bapat <ashutosh.bapat@enterprisedb.com>) |
Responses |
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
|
List | pgsql-hackers |
On Thu, Apr 6, 2017 at 6:37 AM, Robert Haas <robertmhaas@gmail.com> wrote: > >> There's a relevant comment in 0006, build_joinrel_partition_info() >> (probably that name needs to change, but I will do that once we have >> settled on design) >> + /* >> + * Construct partition keys for the join. >> + * >> + * An INNER join between two partitioned relations is partition by key >> + * expressions from both the relations. For tables A and B >> partitioned by a and b >> + * respectively, (A INNER JOIN B ON A.a = B.b) is partitioned by both A.a >> + * and B.b. >> + * >> + * An OUTER join like (A LEFT JOIN B ON A.a = B.b) may produce rows with >> + * B.b NULL. These rows may not fit the partitioning conditions imposed on >> + * B.b. Hence, strictly speaking, the join is not partitioned by B.b. >> + * Strictly speaking, partition keys of an OUTER join should include >> + * partition key expressions from the OUTER side only. Consider a join like >> + * (A LEFT JOIN B on (A.a = B.b) LEFT JOIN C ON B.b = C.c. If we do not >> + * include B.b as partition key expression for (AB), it prohibits us from >> + * using partition-wise join when joining (AB) with C as there is no >> + * equi-join between partition keys of joining relations. But two NULL >> + * values are never equal and no two rows from mis-matching partitions can >> + * join. Hence it's safe to include B.b as partition key expression for >> + * (AB), even though rows in (AB) are not strictly partitioned by B.b. >> + */ >> >> I think that also needs to be reviewed carefully. > > The following passage from src/backend/optimizer/README seems highly relevant: > > === > The planner's treatment of outer join reordering is based on the following > identities: > > 1. (A leftjoin B on (Pab)) innerjoin C on (Pac) > = (A innerjoin C on (Pac)) leftjoin B on (Pab) > > where Pac is a predicate referencing A and C, etc (in this case, clearly > Pac cannot reference B, or the transformation is nonsensical). > > 2. (A leftjoin B on (Pab)) leftjoin C on (Pac) > = (A leftjoin C on (Pac)) leftjoin B on (Pab) > > 3. (A leftjoin B on (Pab)) leftjoin C on (Pbc) > = A leftjoin (B leftjoin C on (Pbc)) on (Pab) > > Identity 3 only holds if predicate Pbc must fail for all-null B rows > (that is, Pbc is strict for at least one column of B). If Pbc is not > strict, the first form might produce some rows with nonnull C columns > where the second form would make those entries null. > === > > In other words, I think your statement that null is never equal to > null is a bit imprecise. Somebody could certainly create an operator > that is named "=" which returns true in that case, and then they could > say, hey, two nulls are equal (when you use that operator). The > argument needs to be made in terms of the formal properties of the > operator. [.. some portion clipped .. ] > The relevant logic is in have_partkey_equi_join: > > + /* Skip clauses which are not equality conditions. */ > + if (rinfo->hashjoinoperator == InvalidOid && > !rinfo->mergeopfamilies) > + continue; > > Actually, I think the hashjoinoperator test is formally and > practically unnecessary here; lower down there is a test to see > whether the partitioning scheme's operator family is a member of > rinfo->mergeopfamilies, which will certainly fail if we got through > this test with rinfo->mergeopfamilies == NIL just on the strength of > rinfo->hashjoinoperator != InvalidOid. So you can just bail out if > rinfo->mergeopfamilies == NIL. But the underlying point here is that > the only thing you really know about the function is that it's got to > be a strategy-3 operator in some btree opclass; if that guarantees > strictness, then so be it -- but I wasn't able to find anything in the > code or documentation off-hand that supports that contention, so we > might need to think a bit more about why (or if) this is guaranteed to > be true. > >> Partition-wise joins >> may be happy including partition keys from all sides, but >> partition-wise aggregates may not be, esp. when pushing complete >> aggregation down to partitions. In that case, rows with NULL partition >> key, which falls on nullable side of join, will be spread across >> multiple partitions. Proabably, we should separate nullable and >> non-nullable partition key expressions. > > I don't think I understand quite what you're getting at here. Can you > spell this out in more detail? To push an aggregate down to > partitions, you need the grouping key to match the applicable > partition key, and the partition key shouldn't allow nulls in more > than one place. Now I think your point may be that outer join > semantics could let them creep in there, e.g. SELECT b.x, sum(a.y) > FROM a LEFT JOIN b ON a.x = b.x GROUP BY 1 -- which would indeed be a > good test case for partitionwise aggregate. I'd be inclined to think > that we should just give up on partitionwise aggregate in such cases; > it's not worth trying to optimize such a weird query, at least IMHO. > (Does this sort of case ever happen with joins? I think not, as long > as the join operator is strict.) > I am revisiting NULL equality in the context of merging partition bounds. In [1] paragraphs following -- Do not write expression = NULL because NULL is not “equal to” NULL. (The null value represents an unknown value, and it is not known whether two unknown values are equal.) -- seem to indicate that an equality operator should never return true for two NULL values since it would never know whether two NULL (unknown) values are same or not. In a paragraph above, Robert stated that > In other words, I think your statement that null is never equal to > null is a bit imprecise. Somebody could certainly create an operator > that is named "=" which returns true in that case, and then they could > say, hey, two nulls are equal (when you use that operator). The > argument needs to be made in terms of the formal properties of the > operator. But in case a user has written an = operator which returns true for two NULL values, per description in [1], that comparison operator is flawed and using that operator is going to result in SQL-standard-incompliant behaviour. I have tried to preserve all the relevant portions of discussion in this mail. Am I missing something? [1] https://www.postgresql.org/docs/devel/static/functions-comparison.html -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
pgsql-hackers by date: