Thread: d25ea01275 and partitionwise join
Hi Tom, I think an assumption of d25ea01275 breaks partitionwise join. Sorry it took me a while to report it. In https://postgr.es/m/8168.1560446056@sss.pgh.pa.us, Tom wrote: > I poked into this and found the cause. For the sample query, we have > an EquivalenceClass containing the expression > COALESCE(COALESCE(Var_1_1, Var_2_1), Var_3_1) > where each of the Vars belongs to an appendrel parent. > add_child_rel_equivalences() needs to add expressions representing the > transform of that to each child relation. That is, if the children > of table 1 are A1 and A2, of table 2 are B1 and B2, and of table 3 > are C1 and C2, what we'd like to add are the expressions > COALESCE(COALESCE(Var_A1_1, Var_2_1), Var_3_1) > COALESCE(COALESCE(Var_A2_1, Var_2_1), Var_3_1) > COALESCE(COALESCE(Var_1_1, Var_B1_1), Var_3_1) > COALESCE(COALESCE(Var_1_1, Var_B2_1), Var_3_1) > COALESCE(COALESCE(Var_1_1, Var_2_1), Var_C1_1) > COALESCE(COALESCE(Var_1_1, Var_2_1), Var_C2_1) > However, what it's actually producing is additional combinations for > each appendrel after the first, because each call also mutates the > previously-added child expressions. So in this example we also get > COALESCE(COALESCE(Var_A1_1, Var_B1_1), Var_3_1) > COALESCE(COALESCE(Var_A2_1, Var_B2_1), Var_3_1) > COALESCE(COALESCE(Var_A1_1, Var_2_1), Var_C1_1) > COALESCE(COALESCE(Var_A2_1, Var_2_1), Var_C2_1) > COALESCE(COALESCE(Var_A1_1, Var_B1_1), Var_C1_1) > COALESCE(COALESCE(Var_A2_1, Var_B2_1), Var_C2_1) > With two appendrels involved, that's O(N^2) expressions; with > three appendrels, more like O(N^3). > > This is by no means specific to FULL JOINs; you could get the same > behavior with join clauses like "WHERE t1.a + t2.b + t3.c = t4.d". > > These extra expressions don't have any use, since we're not > going to join the children directly to each other. ...unless partition wise join thinks they can be joined. Partition wise join can't handle 3-way full joins today, but only because it's broken itself when trying to match a full join clause to the partition key due to one side being a COALESCE expression. Consider this example query: -- p is defined as: -- create table p (a int) partition by list (a); -- create table p1 partition of p for values in (1); -- create table p2 partition of p for values in (2); explain select * from p t1 full outer join p t2 using (a) full outer join p t3 using (a) full outer join p t4 using (a) order by 1; QUERY PLAN ───────────────────────────────────────────────────────────────────────────────────────────────────────────────── Sort (cost=16416733.32..16628145.85 rows=84565012 width=4) Sort Key: (COALESCE(COALESCE(COALESCE(t1.a, t2.a), t3.a), t4.a)) -> Merge Full Join (cost=536957.40..1813748.77 rows=84565012 width=4) Merge Cond: (t4.a = (COALESCE(COALESCE(t1.a, t2.a), t3.a))) -> Sort (cost=410.57..423.32 rows=5100 width=4) Sort Key: t4.a -> Append (cost=0.00..96.50 rows=5100 width=4) -> Seq Scan on p1 t4 (cost=0.00..35.50 rows=2550 width=4) -> Seq Scan on p2 t4_1 (cost=0.00..35.50 rows=2550 width=4) -> Materialize (cost=536546.83..553128.21 rows=3316275 width=12) -> Sort (cost=536546.83..544837.52 rows=3316275 width=12) Sort Key: (COALESCE(COALESCE(t1.a, t2.a), t3.a)) -> Merge Full Join (cost=14254.85..64024.48 rows=3316275 width=12) Merge Cond: (t3.a = (COALESCE(t1.a, t2.a))) -> Sort (cost=410.57..423.32 rows=5100 width=4) Sort Key: t3.a -> Append (cost=0.00..96.50 rows=5100 width=4) -> Seq Scan on p1 t3 (cost=0.00..35.50 rows=2550 width=4) -> Seq Scan on p2 t3_1 (cost=0.00..35.50 rows=2550 width=4) -> Sort (cost=13844.29..14169.41 rows=130050 width=8) Sort Key: (COALESCE(t1.a, t2.a)) -> Merge Full Join (cost=821.13..2797.38 rows=130050 width=8) Merge Cond: (t1.a = t2.a) -> Sort (cost=410.57..423.32 rows=5100 width=4) Sort Key: t1.a -> Append (cost=0.00..96.50 rows=5100 width=4) -> Seq Scan on p1 t1 (cost=0.00..35.50 rows=2550 width=4) -> Seq Scan on p2 t1_1 (cost=0.00..35.50 rows=2550 width=4) -> Sort (cost=410.57..423.32 rows=5100 width=4) Sort Key: t2.a -> Append (cost=0.00..96.50 rows=5100 width=4) -> Seq Scan on p1 t2 (cost=0.00..35.50 rows=2550 width=4) -> Seq Scan on p2 t2_1 (cost=0.00..35.50 rows=2550 width=4) -- turn on enable_partitionwise_join set enable_partitionwise_join to on; explain select * from p t1 full outer join p t2 using (a) full outer join p t3 using (a) full outer join p t4 using (a) order by 1; QUERY PLAN ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Sort (cost=16385259.94..16596672.47 rows=84565012 width=4) Sort Key: (COALESCE(COALESCE(COALESCE(t1.a, t2.a), t3.a), t4.a)) -> Merge Full Join (cost=505484.02..1782275.39 rows=84565012 width=4) Merge Cond: (t4.a = (COALESCE(COALESCE(t1.a, t2.a), t3.a))) -> Sort (cost=410.57..423.32 rows=5100 width=4) Sort Key: t4.a -> Append (cost=0.00..96.50 rows=5100 width=4) -> Seq Scan on p1 t4 (cost=0.00..35.50 rows=2550 width=4) -> Seq Scan on p2 t4_1 (cost=0.00..35.50 rows=2550 width=4) -> Materialize (cost=505073.45..521654.83 rows=3316275 width=12) -> Sort (cost=505073.45..513364.14 rows=3316275 width=12) Sort Key: (COALESCE(COALESCE(t1.a, t2.a), t3.a)) -> Merge Full Join (cost=7653.92..32551.10 rows=3316275 width=12) Merge Cond: (t3.a = (COALESCE(t1.a, t2.a))) -> Sort (cost=410.57..423.32 rows=5100 width=4) Sort Key: t3.a -> Append (cost=0.00..96.50 rows=5100 width=4) -> Seq Scan on p1 t3 (cost=0.00..35.50 rows=2550 width=4) -> Seq Scan on p2 t3_1 (cost=0.00..35.50 rows=2550 width=4) -> Sort (cost=7243.35..7405.91 rows=65024 width=8) Sort Key: (COALESCE(t1.a, t2.a)) -> Result (cost=359.57..2045.11 rows=65024 width=8) -> Append (cost=359.57..2045.11 rows=65024 width=8) -> Merge Full Join (cost=359.57..860.00 rows=32512 width=8) Merge Cond: (t1.a = t2.a) -> Sort (cost=179.78..186.16 rows=2550 width=4) Sort Key: t1.a -> Seq Scan on p1 t1 (cost=0.00..35.50 rows=2550 width=4) -> Sort (cost=179.78..186.16 rows=2550 width=4) Sort Key: t2.a -> Seq Scan on p1 t2 (cost=0.00..35.50 rows=2550 width=4) -> Merge Full Join (cost=359.57..860.00 rows=32512 width=8) Merge Cond: (t1_1.a = t2_1.a) -> Sort (cost=179.78..186.16 rows=2550 width=4) Sort Key: t1_1.a -> Seq Scan on p2 t1_1 (cost=0.00..35.50 rows=2550 width=4) -> Sort (cost=179.78..186.16 rows=2550 width=4) Sort Key: t2_1.a -> Seq Scan on p2 t2_1 (cost=0.00..35.50 rows=2550 width=4) See how it only managed to use partition wise join up to 2-way join, but gives up at 3-way join and higher, because the join condition looks like this: t3.a = (COALESCE(t1.a, t2.a). When building the join relation (t1, t2, t3) between (t3) and (t1, t2), it fails to see that COALESCE(t1.a, t2.a) actually matches the partition key of (t1, t2). When I fix the code that does the matching and run with merge joins disabled, I can get a plan where the whole 4-way join is partitioned: explain select * from p t1 full outer join p t2 using (a) full outer join p t3 using (a) full outer join p t4 using (a) order by 1; QUERY PLAN ───────────────────────────────────────────────────────────────────────────────────────────────────── Gather Merge (cost=831480.11..1859235.87 rows=8808720 width=4) Workers Planned: 2 -> Sort (cost=830480.09..841490.99 rows=4404360 width=4) Sort Key: (COALESCE(COALESCE(COALESCE(t1.a, t2.a), t3.a), t4.a)) -> Parallel Append (cost=202.12..224012.93 rows=4404360 width=4) -> Hash Full Join (cost=202.12..201991.13 rows=5285232 width=4) Hash Cond: (COALESCE(COALESCE(t1.a, t2.a), t3.a) = t4.a) -> Hash Full Join (cost=134.75..15904.32 rows=414528 width=12) Hash Cond: (COALESCE(t1.a, t2.a) = t3.a) -> Hash Full Join (cost=67.38..1247.18 rows=32512 width=8) Hash Cond: (t1.a = t2.a) -> Seq Scan on p1 t1 (cost=0.00..35.50 rows=2550 width=4) -> Hash (cost=35.50..35.50 rows=2550 width=4) -> Seq Scan on p1 t2 (cost=0.00..35.50 rows=2550 width=4) -> Hash (cost=35.50..35.50 rows=2550 width=4) -> Seq Scan on p1 t3 (cost=0.00..35.50 rows=2550 width=4) -> Hash (cost=35.50..35.50 rows=2550 width=4) -> Seq Scan on p1 t4 (cost=0.00..35.50 rows=2550 width=4) -> Hash Full Join (cost=202.12..201991.13 rows=5285232 width=4) Hash Cond: (COALESCE(COALESCE(t1_1.a, t2_1.a), t3_1.a) = t4_1.a) -> Hash Full Join (cost=134.75..15904.32 rows=414528 width=12) Hash Cond: (COALESCE(t1_1.a, t2_1.a) = t3_1.a) -> Hash Full Join (cost=67.38..1247.18 rows=32512 width=8) Hash Cond: (t1_1.a = t2_1.a) -> Seq Scan on p2 t1_1 (cost=0.00..35.50 rows=2550 width=4) -> Hash (cost=35.50..35.50 rows=2550 width=4) -> Seq Scan on p2 t2_1 (cost=0.00..35.50 rows=2550 width=4) -> Hash (cost=35.50..35.50 rows=2550 width=4) -> Seq Scan on p2 t3_1 (cost=0.00..35.50 rows=2550 width=4) -> Hash (cost=35.50..35.50 rows=2550 width=4) -> Seq Scan on p2 t4_1 (cost=0.00..35.50 rows=2550 width=4) (31 rows) But with merge joins enabled: explain select * from p t1 full outer join p t2 using (a) full outer join p t3 using (a) full outer join p t4 using (a) order by 1; ERROR: could not find pathkey item to sort That's because, there's no child COALESCE(t1_1.a, t2_1.a) expression in the EC that contains COALESCE(t1.a, t2.a), where t1_1 and t2_1 represent the 1st partition of t1 and t2, resp. The problem is that add_child_rel_equivalences(), as of d25ea01275, only adds the following child expressions of COALESCE(t1.a, t2.a): -- when translating t1 COALESCE(t1_1.a, t2.a) COALESCE(t1_2.a, t2.a) -- when translating t2 COALESCE(t1.a, t2_1.a) COALESCE(t1.a, t2_2.a) whereas previously, the following would be added too when translating t2: COALESCE(t1_1.a, t2_1.a) COALESCE(t1_1.a, t2_2.a) COALESCE(t1_2.a, t2_1.a) COALESCE(t1_2.a, t2_2.a) Note that of those, only COALESCE(t1_1.a, t2_1.a) and COALESCE(t1_2.a, t2_2.a) are interesting, because partition wise join will only ever consider pairs (t1_1, t2_1) and (t1_2, t2_2) to be joined. We can get the needed child expressions and still avoid the combinatorial explosion in the size of resulting EC members list if we taught add_child_rel_equivalences() to only translate ECs that the input parent relation is capable of producing. So, COALESCE(t1.a, t2.a) will not be translated if the input relation is only (t1) or (t2), that is, when called from set_append_rel_size(). Instead it would be translated if it's passed the joinrel (t1, t2). IOW, teach build_child_join_rel() to call add_child_rel_equivalences(), which I've tried to implement in the attached. I have attached two patches. 0001 - fix partitionwise join to work correctly with n-way joins of which some are full joins (+ cosmetic improvements around the code that was touched) 0002 - fix to translate multi-relation EC members correctly Thanks, Amit
Attachment
On Tue, Jul 2, 2019 at 6:29 PM Amit Langote <amitlangote09@gmail.com> wrote: > 0001 - fix partitionwise join to work correctly with n-way joins of > which some are full joins (+ cosmetic improvements around the code > that was touched) Here are my comments about the cosmetic improvements: they seem pretty large to me, so I'd make a separate patch for that. In addition, I'd move have_partkey_equi_join() and match_expr_to_partition_keys() to relnode.c, because these functions are only used in that file. Best regards, Etsuro Fujita
Fujita-san, Thanks for looking at this. On Tue, Jul 16, 2019 at 8:22 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote: > > On Tue, Jul 2, 2019 at 6:29 PM Amit Langote <amitlangote09@gmail.com> wrote: > > 0001 - fix partitionwise join to work correctly with n-way joins of > > which some are full joins (+ cosmetic improvements around the code > > that was touched) > > Here are my comments about the cosmetic improvements: they seem pretty > large to me, so I'd make a separate patch for that. OK, my bad that I added so many cosmetic changes into a patch that is meant to fix the main issue. Just to clarify, I'm proposing these cosmetic improvements to better clarify the terminological separation between nullable and non-nullable partition keys, which I found a bit hard to understand as is. I've broken the patch into two: 0001 contains only cosmetic changes and 0002 the fix for handling full joins properly. Would you rather that be reversed? > In addition, I'd > move have_partkey_equi_join() and match_expr_to_partition_keys() to > relnode.c, because these functions are only used in that file. I hadn't noticed that. Makes sense to move them to relnode.c, which is implemented in 0001. Thanks, Amit
Attachment
On Thu, Jul 18, 2019 at 11:18 AM Amit Langote <amitlangote09@gmail.com> wrote: > On Tue, Jul 16, 2019 at 8:22 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote: > > On Tue, Jul 2, 2019 at 6:29 PM Amit Langote <amitlangote09@gmail.com> wrote: > > > 0001 - fix partitionwise join to work correctly with n-way joins of > > > which some are full joins (+ cosmetic improvements around the code > > > that was touched) > > > > Here are my comments about the cosmetic improvements: they seem pretty > > large to me, so I'd make a separate patch for that. > > OK, my bad that I added so many cosmetic changes into a patch that is > meant to fix the main issue. Just to clarify, I'm proposing these > cosmetic improvements to better clarify the terminological separation > between nullable and non-nullable partition keys, which I found a bit > hard to understand as is. OK, thanks for the explanation! > I've broken the patch into two: 0001 contains only cosmetic changes > and 0002 the fix for handling full joins properly. Would you rather > that be reversed? I like this order. > > In addition, I'd > > move have_partkey_equi_join() and match_expr_to_partition_keys() to > > relnode.c, because these functions are only used in that file. > > I hadn't noticed that. Makes sense to move them to relnode.c, which > is implemented in 0001. Thanks for including that! Will review. Best regards, Etsuro Fujita
Fujita-san, On Thu, Jul 18, 2019 at 8:10 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote: > > On Thu, Jul 18, 2019 at 11:18 AM Amit Langote <amitlangote09@gmail.com> wrote: > > On Tue, Jul 16, 2019 at 8:22 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote: > > > On Tue, Jul 2, 2019 at 6:29 PM Amit Langote <amitlangote09@gmail.com> wrote: > > > > 0001 - fix partitionwise join to work correctly with n-way joins of > > > > which some are full joins (+ cosmetic improvements around the code > > > > that was touched) > > > > > > Here are my comments about the cosmetic improvements: they seem pretty > > > large to me, so I'd make a separate patch for that. > > > > OK, my bad that I added so many cosmetic changes into a patch that is > > meant to fix the main issue. Just to clarify, I'm proposing these > > cosmetic improvements to better clarify the terminological separation > > between nullable and non-nullable partition keys, which I found a bit > > hard to understand as is. > > OK, thanks for the explanation! > > > I've broken the patch into two: 0001 contains only cosmetic changes > > and 0002 the fix for handling full joins properly. Would you rather > > that be reversed? > > I like this order. > > > > In addition, I'd > > > move have_partkey_equi_join() and match_expr_to_partition_keys() to > > > relnode.c, because these functions are only used in that file. > > > > I hadn't noticed that. Makes sense to move them to relnode.c, which > > is implemented in 0001. > > Thanks for including that! Will review. To avoid losing track of this, I've added this to November CF. https://commitfest.postgresql.org/25/2278/ I know there is one more patch beside the partitionwise join fix, but I've set the title to suggest that this is related mainly to partitionwise joins. Thanks, Amit
Hi Amit,
On Wed, Sep 4, 2019 at 10:01 AM Amit Langote <amitlangote09@gmail.com> wrote:
Fujita-san,
To avoid losing track of this, I've added this to November CF.
https://commitfest.postgresql.org/25/2278/
I know there is one more patch beside the partitionwise join fix, but
I've set the title to suggest that this is related mainly to
partitionwise joins.
Thank you for working on this. Currently partitionwise join does not
take COALESCE expr into consideration when matching to partition keys.
This is a problem.
BTW, a rebase is needed for the patch set.
take COALESCE expr into consideration when matching to partition keys.
This is a problem.
BTW, a rebase is needed for the patch set.
Thanks
Richard
Hi Amit,
On Wed, Sep 4, 2019 at 3:30 PM Richard Guo <riguo@pivotal.io> wrote:
Hi Amit,On Wed, Sep 4, 2019 at 10:01 AM Amit Langote <amitlangote09@gmail.com> wrote:Fujita-san,
To avoid losing track of this, I've added this to November CF.
https://commitfest.postgresql.org/25/2278/
I know there is one more patch beside the partitionwise join fix, but
I've set the title to suggest that this is related mainly to
partitionwise joins.Thank you for working on this. Currently partitionwise join does not
take COALESCE expr into consideration when matching to partition keys.
This is a problem.
BTW, a rebase is needed for the patch set.
processed in match_join_arg_to_partition_keys().
If there is a COALESCE expr with first arg being non-partition key expr
and second arg being partition key, the patch would match it to the
partition key, which may result in wrong results in some cases.
For instance, consider the partition table below:
create table p (k int, val int) partition by range(k);
create table p_1 partition of p for values from (1) to (10);
create table p_2 partition of p for values from (10) to (100);
So with patch v2-0002, the following query will be planned with
partitionwise join.
# explain (costs off)
select * from (p as t1 full join p as t2 on t1.k = t2.k) as t12(k1,val1,k2,val2)
full join p as t3 on COALESCE(t12.val1, t12.k1) = t3.k;
QUERY PLAN
----------------------------------------------------------
Append
-> Hash Full Join
Hash Cond: (COALESCE(t1.val, t1.k) = t3.k)
-> Hash Full Join
Hash Cond: (t1.k = t2.k)
-> Seq Scan on p_1 t1
-> Hash
-> Seq Scan on p_1 t2
-> Hash
-> Seq Scan on p_1 t3
-> Hash Full Join
Hash Cond: (COALESCE(t1_1.val, t1_1.k) = t3_1.k)
-> Hash Full Join
Hash Cond: (t1_1.k = t2_1.k)
-> Seq Scan on p_2 t1_1
-> Hash
-> Seq Scan on p_2 t2_1
-> Hash
-> Seq Scan on p_2 t3_1
(19 rows)
But as t1.val is not a partition key, actually we cannot use
partitionwise join here.
If we insert below data into the table, we will get wrong results for
the query above.
insert into p select 5,15;
insert into p select 15,5;
Thanks
Richard
Hello Richard, On Wed, Sep 4, 2019 at 4:30 PM Richard Guo <riguo@pivotal.io> wrote: > > Hi Amit, > > On Wed, Sep 4, 2019 at 10:01 AM Amit Langote <amitlangote09@gmail.com> wrote: >> >> Fujita-san, >> >> To avoid losing track of this, I've added this to November CF. >> >> https://commitfest.postgresql.org/25/2278/ >> >> I know there is one more patch beside the partitionwise join fix, but >> I've set the title to suggest that this is related mainly to >> partitionwise joins. > > > Thank you for working on this. Currently partitionwise join does not > take COALESCE expr into consideration when matching to partition keys. > This is a problem. > > BTW, a rebase is needed for the patch set. Thanks a lot for looking at this. I tried rebasing today and found that adopting this patch to the following recent commit to equivalence processing code would take some time that I don't currently have. commit 3373c7155350cf6fcd51dd090f29e1332901e329 Author: David Rowley <drowley@postgresql.org> Date: Sun Jul 21 17:30:58 2019 +1200 Speed up finding EquivalenceClasses for a given set of rels I will come back to this in a couple of weeks, along with addressing your other comments. Thanks, Amit
Hi Richard, Thanks a lot for taking a close look at the patch and sorry about the delay. On Wed, Sep 4, 2019 at 5:29 PM Richard Guo <riguo@pivotal.io> wrote: >> On Wed, Sep 4, 2019 at 10:01 AM Amit Langote <amitlangote09@gmail.com> wrote: > I'm reviewing v2-0002 and I have concern about how COALESCE expr is > processed in match_join_arg_to_partition_keys(). > > If there is a COALESCE expr with first arg being non-partition key expr > and second arg being partition key, the patch would match it to the > partition key, which may result in wrong results in some cases. > > For instance, consider the partition table below: > > create table p (k int, val int) partition by range(k); > create table p_1 partition of p for values from (1) to (10); > create table p_2 partition of p for values from (10) to (100); > > So with patch v2-0002, the following query will be planned with > partitionwise join. > > # explain (costs off) > select * from (p as t1 full join p as t2 on t1.k = t2.k) as t12(k1,val1,k2,val2) > full join p as t3 on COALESCE(t12.val1, t12.k1) = t3.k; > QUERY PLAN > ---------------------------------------------------------- > Append > -> Hash Full Join > Hash Cond: (COALESCE(t1.val, t1.k) = t3.k) > -> Hash Full Join > Hash Cond: (t1.k = t2.k) > -> Seq Scan on p_1 t1 > -> Hash > -> Seq Scan on p_1 t2 > -> Hash > -> Seq Scan on p_1 t3 > -> Hash Full Join > Hash Cond: (COALESCE(t1_1.val, t1_1.k) = t3_1.k) > -> Hash Full Join > Hash Cond: (t1_1.k = t2_1.k) > -> Seq Scan on p_2 t1_1 > -> Hash > -> Seq Scan on p_2 t2_1 > -> Hash > -> Seq Scan on p_2 t3_1 > (19 rows) > > But as t1.val is not a partition key, actually we cannot use > partitionwise join here. > > If we insert below data into the table, we will get wrong results for > the query above. > > insert into p select 5,15; > insert into p select 15,5; Good catch! It's quite wrong to use COALESCE(t12.val1, t12.k1) = t3.k for partitionwise join as the COALESCE expression might as well output the value of val1 which doesn't conform to partitioning. I've fixed match_join_arg_to_partition_keys() to catch that case and fail. Added a test case as well. Please find attached updated patches. Thanks, Amit
Attachment
On Thu, Sep 19, 2019 at 4:15 PM Amit Langote <amitlangote09@gmail.com> wrote:
Hi Richard,
Thanks a lot for taking a close look at the patch and sorry about the delay.
On Wed, Sep 4, 2019 at 5:29 PM Richard Guo <riguo@pivotal.io> wrote:
>> On Wed, Sep 4, 2019 at 10:01 AM Amit Langote <amitlangote09@gmail.com> wrote:
> I'm reviewing v2-0002 and I have concern about how COALESCE expr is
> processed in match_join_arg_to_partition_keys().
>
> If there is a COALESCE expr with first arg being non-partition key expr
> and second arg being partition key, the patch would match it to the
> partition key, which may result in wrong results in some cases.
>
> For instance, consider the partition table below:
>
> create table p (k int, val int) partition by range(k);
> create table p_1 partition of p for values from (1) to (10);
> create table p_2 partition of p for values from (10) to (100);
>
> So with patch v2-0002, the following query will be planned with
> partitionwise join.
>
> # explain (costs off)
> select * from (p as t1 full join p as t2 on t1.k = t2.k) as t12(k1,val1,k2,val2)
> full join p as t3 on COALESCE(t12.val1, t12.k1) = t3.k;
> QUERY PLAN
> ----------------------------------------------------------
> Append
> -> Hash Full Join
> Hash Cond: (COALESCE(t1.val, t1.k) = t3.k)
> -> Hash Full Join
> Hash Cond: (t1.k = t2.k)
> -> Seq Scan on p_1 t1
> -> Hash
> -> Seq Scan on p_1 t2
> -> Hash
> -> Seq Scan on p_1 t3
> -> Hash Full Join
> Hash Cond: (COALESCE(t1_1.val, t1_1.k) = t3_1.k)
> -> Hash Full Join
> Hash Cond: (t1_1.k = t2_1.k)
> -> Seq Scan on p_2 t1_1
> -> Hash
> -> Seq Scan on p_2 t2_1
> -> Hash
> -> Seq Scan on p_2 t3_1
> (19 rows)
>
> But as t1.val is not a partition key, actually we cannot use
> partitionwise join here.
>
> If we insert below data into the table, we will get wrong results for
> the query above.
>
> insert into p select 5,15;
> insert into p select 15,5;
Good catch! It's quite wrong to use COALESCE(t12.val1, t12.k1) = t3.k
for partitionwise join as the COALESCE expression might as well output
the value of val1 which doesn't conform to partitioning.
I've fixed match_join_arg_to_partition_keys() to catch that case and
fail. Added a test case as well.
Please find attached updated patches.
Thank you for the fix. Will review.
Thanks
Richard
On Thu, Sep 19, 2019 at 05:15:37PM +0900, Amit Langote wrote: > Please find attached updated patches. Tom pointed me to this thread, since we hit it in 12.0 https://www.postgresql.org/message-id/flat/16802.1570989962%40sss.pgh.pa.us#070f6675a11dff17760b1cfccf1c038d I can't say much about the patch; there's a little typo: "The nullability of inner relation keys prevents them to" ..should say "prevent them from". In order to compile it against REL12, I tried to cherry-pick this one: 3373c715: Speed up finding EquivalenceClasses for a given set of rels But then it crashes in check-world (possibly due to misapplied hunks). -- Justin Pryzby System Administrator Telsasoft +1-952-707-8581
On Sun, Oct 13, 2019 at 03:02:17PM -0500, Justin Pryzby wrote: > On Thu, Sep 19, 2019 at 05:15:37PM +0900, Amit Langote wrote: > > Please find attached updated patches. > > Tom pointed me to this thread, since we hit it in 12.0 > https://www.postgresql.org/message-id/flat/16802.1570989962%40sss.pgh.pa.us#070f6675a11dff17760b1cfccf1c038d > > I can't say much about the patch; there's a little typo: > "The nullability of inner relation keys prevents them to" > ..should say "prevent them from". > > In order to compile it against REL12, I tried to cherry-pick this one: > 3373c715: Speed up finding EquivalenceClasses for a given set of rels > > But then it crashes in check-world (possibly due to misapplied hunks). I did it again paying more attention and got it to pass. The PWJ + FULL JOIN query seems ok now. But I'll leave PWJ disabled until I can look more closely. $ PGOPTIONS='-c max_parallel_workers_per_gather=0 -c enable_mergejoin=off -c enable_hashagg=off -c enable_partitionwise_join=on'psql postgres -f tmp/sql-2019-10-11.1 SET Nested Loop (cost=80106964.13..131163200.28 rows=2226681567 width=6) Join Filter: ((s.site_location = ''::text) OR ((s.site_office)::integer = ((COALESCE(t1.site_id, t2.site_id))::integer))) -> Group (cost=80106964.13..80837945.46 rows=22491733 width=12) Group Key: (COALESCE(t1.start_time, t2.start_time)), ((COALESCE(t1.site_id, t2.site_id))::integer) -> Merge Append (cost=80106964.13..80613028.13 rows=22491733 width=12) Sort Key: (COALESCE(t1.start_time, t2.start_time)), ((COALESCE(t1.site_id, t2.site_id))::integer) -> Group (cost=25494496.54..25633699.28 rows=11136219 width=12) Group Key: (COALESCE(t1.start_time, t2.start_time)), ((COALESCE(t1.site_id, t2.site_id))::integer) -> Sort (cost=25494496.54..25522337.09 rows=11136219 width=12) Sort Key: (COALESCE(t1.start_time, t2.start_time)), ((COALESCE(t1.site_id, t2.site_id))::integer) -> Hash Full Join (cost=28608.75..24191071.36 rows=11136219 width=12) Hash Cond: ((t1.start_time = t2.start_time) AND (t1.site_id = t2.site_id)) Filter: ((COALESCE(t1.start_time, t2.start_time) >= '2019-10-01 00:00:00'::timestamp withouttime zone) AND (COALESCE(t1.start_time, t2.start_time) < '2019-10-01 01:00:00'::timestamp without time zone)) -> Seq Scan on t1 (cost=0.00..14495.10 rows=940910 width=10) -> Hash (cost=14495.10..14495.10 rows=940910 width=10) -> Seq Scan on t1 t2 (cost=0.00..14495.10 rows=940910 width=10) -> Group (cost=54612467.58..54754411.51 rows=11355514 width=12) Group Key: (COALESCE(t1_1.start_time, t2_1.start_time)), ((COALESCE(t1_1.site_id, t2_1.site_id))::integer) -> Sort (cost=54612467.58..54640856.37 rows=11355514 width=12) Sort Key: (COALESCE(t1_1.start_time, t2_1.start_time)), ((COALESCE(t1_1.site_id, t2_1.site_id))::integer) -> Hash Full Join (cost=28608.75..53281777.94 rows=11355514 width=12) Hash Cond: ((t1_1.start_time = t2_1.start_time) AND (t1_1.site_id = t2_1.site_id)) Filter: ((COALESCE(t1_1.start_time, t2_1.start_time) >= '2019-10-01 00:00:00'::timestampwithout time zone) AND (COALESCE(t1_1.start_time, t2_1.start_time) < '2019-10-01 01:00:00'::timestampwithout time zone)) -> Seq Scan on t2 t1_1 (cost=0.00..14495.10 rows=940910 width=10) -> Hash (cost=14495.10..14495.10 rows=940910 width=10) -> Seq Scan on t2 t2_1 (cost=0.00..14495.10 rows=940910 width=10) -> Materialize (cost=0.00..2.48 rows=99 width=6) -> Seq Scan on s (cost=0.00..1.99 rows=99 width=6) -- Justin Pryzby System Administrator Telsasoft +1-952-707-8581
Hi Justin, On Mon, Oct 14, 2019 at 5:02 AM Justin Pryzby <pryzby@telsasoft.com> wrote: > > On Thu, Sep 19, 2019 at 05:15:37PM +0900, Amit Langote wrote: > > Please find attached updated patches. > > Tom pointed me to this thread, since we hit it in 12.0 > https://www.postgresql.org/message-id/flat/16802.1570989962%40sss.pgh.pa.us#070f6675a11dff17760b1cfccf1c038d > > I can't say much about the patch; there's a little typo: > "The nullability of inner relation keys prevents them to" > ..should say "prevent them from". Thanks, will fix. Regards, Amit
Amit Langote <amitlangote09@gmail.com> writes: > On Mon, Oct 14, 2019 at 5:02 AM Justin Pryzby <pryzby@telsasoft.com> wrote: >> I can't say much about the patch; there's a little typo: >> "The nullability of inner relation keys prevents them to" >> ..should say "prevent them from". > Thanks, will fix. Just to leave a breadcrumb in this thread --- the planner failure induced by d25ea01275 has been fixed in 529ebb20a. The difficulty with multiway full joins that Amit started this thread with remains open, but I imagine the posted patches will need rebasing over 529ebb20a. regards, tom lane
On Wed, Nov 6, 2019 at 2:00 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Amit Langote <amitlangote09@gmail.com> writes: > > On Mon, Oct 14, 2019 at 5:02 AM Justin Pryzby <pryzby@telsasoft.com> wrote: > >> I can't say much about the patch; there's a little typo: > >> "The nullability of inner relation keys prevents them to" > >> ..should say "prevent them from". > > > Thanks, will fix. > > Just to leave a breadcrumb in this thread --- the planner failure > induced by d25ea01275 has been fixed in 529ebb20a. The difficulty > with multiway full joins that Amit started this thread with remains > open, but I imagine the posted patches will need rebasing over > 529ebb20a. Here are the rebased patches. I've divided the patch containing only cosmetic improvements into two patches, of which the latter only moves around code. Patch 0003 implements the actual fix to the problem with multiway joins. Thanks, Amit
Attachment
Amit Langote <amitlangote09@gmail.com> writes: > On Wed, Nov 6, 2019 at 2:00 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Just to leave a breadcrumb in this thread --- the planner failure >> induced by d25ea01275 has been fixed in 529ebb20a. The difficulty >> with multiway full joins that Amit started this thread with remains >> open, but I imagine the posted patches will need rebasing over >> 529ebb20a. > Here are the rebased patches. The cfbot shows these patches as failing regression tests. I think it is just cosmetic fallout from 6ef77cf46 having changed EXPLAIN's choices of table alias names; but please look closer to confirm, and post updated patches. regards, tom lane
On Sat, Feb 29, 2020 at 8:18 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Amit Langote <amitlangote09@gmail.com> writes: > > On Wed, Nov 6, 2019 at 2:00 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> Just to leave a breadcrumb in this thread --- the planner failure > >> induced by d25ea01275 has been fixed in 529ebb20a. The difficulty > >> with multiway full joins that Amit started this thread with remains > >> open, but I imagine the posted patches will need rebasing over > >> 529ebb20a. > > > Here are the rebased patches. > > The cfbot shows these patches as failing regression tests. I think > it is just cosmetic fallout from 6ef77cf46 having changed EXPLAIN's > choices of table alias names; but please look closer to confirm, > and post updated patches. Thanks for notifying. Checked and indeed fallout from 6ef77cf46 seems to be the reason a test is failing. Updated patches attached. Thanks, Amit
Attachment
Amit Langote <amitlangote09@gmail.com> writes: > Updated patches attached. I looked through these and committed 0001+0002, with some further comment-polishing. However, I have no faith at all in 0003. It is blithely digging through COALESCE expressions with no concern for whether they came from full joins or not, or whether the other values being coalesced to might completely change the semantics. Digging through PlaceHolderVars scares me even more; what's that got to do with the problem, anyway? So while this might fix the complained-of issue of failing to use a partitionwise join, I think it wouldn't be hard to create examples that it would incorrectly turn into partitionwise joins. I wonder whether it'd be feasible to fix the problem by going in the other direction; that is, while constructing the nullable_partexprs lists for a full join, add synthesized COALESCE() expressions for the output columns (by wrapping COALESCE around copies of the input rels' partition expressions), and then not need to do anything special in match_expr_to_partition_keys. We'd still need to convince ourselves that this did the right thing and not any of the wrong things, but I think it might be easier to prove it that way. regards, tom lane
On Sat, Apr 4, 2020 at 6:13 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Amit Langote <amitlangote09@gmail.com> writes: > > Updated patches attached. > > I looked through these and committed 0001+0002, with some further > comment-polishing. However, I have no faith at all in 0003. Thanks for the review. > It is > blithely digging through COALESCE expressions with no concern for > whether they came from full joins or not, or whether the other values > being coalesced to might completely change the semantics. Digging > through PlaceHolderVars scares me even more; what's that got to do > with the problem, anyway? So while this might fix the complained-of > issue of failing to use a partitionwise join, I think it wouldn't be > hard to create examples that it would incorrectly turn into > partitionwise joins. > > I wonder whether it'd be feasible to fix the problem by going in the > other direction; that is, while constructing the nullable_partexprs > lists for a full join, add synthesized COALESCE() expressions for the > output columns (by wrapping COALESCE around copies of the input rels' > partition expressions), and then not need to do anything special in > match_expr_to_partition_keys. We'd still need to convince ourselves > that this did the right thing and not any of the wrong things, but > I think it might be easier to prove it that way. Okay, I tried that in the updated patch. I didn't try to come up with examples that might break it though. -- Thank you, Amit Langote EnterpriseDB: http://www.enterprisedb.com
Attachment
Amit Langote <amitlangote09@gmail.com> writes: > Okay, I tried that in the updated patch. I didn't try to come up with > examples that might break it though. I looked through this. * Wasn't excited about inventing makeCoalesceExpr(); the fact that it only had two potential call sites seemed to make it not worth the trouble. Plus, as defined, it could not handle the general case of COALESCE, which can have N arguments; so that seemed like a recipe for confusion. * I think your logic for building the coalesce combinations was just wrong. We need combinations of left-hand inputs with right-hand inputs, not left-hand with left-hand or right-hand with right-hand. Also the nesting should already have happened in the inputs, we don't need to try to generate it here. The looping code was pretty messy as well. * I don't think using partopcintype is necessarily right; that could be a polymorphic type, for instance. Safer to copy the type of the input expressions. Likely we could have got away with using partcollation, but for consistency I copied that as well. * You really need to update the data structure definitional comments when you make a change like this. * I did not like putting a test case that requires enable_partitionwise_aggregate in the partition_join test; that seems misplaced. But it's not necessary to the test, is it? * I did not follow the point of your second test case. The WHERE constraint on p1.a allows the planner to strength-reduce the joins, which is why there's no full join in that explain result, but then we aren't going to get to this code at all. I repaired the above in the attached. I'm actually sort of pleasantly surprised that this worked; I was not sure that building COALESCEs like this would provide the result we needed. But it seems okay -- it does fix the behavior in this 3-way test case, as well as the 4-way join you showed at the top of the thread. It's fairly dependent on the fact that the planner won't rearrange full joins; otherwise, the COALESCE structures we build here might not match those made at parse time. But that's not likely to change anytime soon; and this is hardly the only place that would break, so I'm not sweating about it. (I have some vague ideas about getting rid of the COALESCEs as part of the Var redefinition I've been muttering about, anyway; there might be a cleaner fix for this if that happens.) Anyway, I think this is probably OK for now. Given that the nullable_partexprs lists are only used in one place, it's pretty hard to see how it would break anything. regards, tom lane diff --git a/src/backend/optimizer/util/relnode.c b/src/backend/optimizer/util/relnode.c index af1fb48..e1cc11c 100644 --- a/src/backend/optimizer/util/relnode.c +++ b/src/backend/optimizer/util/relnode.c @@ -17,6 +17,7 @@ #include <limits.h> #include "miscadmin.h" +#include "nodes/nodeFuncs.h" #include "optimizer/appendinfo.h" #include "optimizer/clauses.h" #include "optimizer/cost.h" @@ -1890,7 +1891,8 @@ set_joinrel_partition_key_exprs(RelOptInfo *joinrel, RelOptInfo *outer_rel, RelOptInfo *inner_rel, JoinType jointype) { - int partnatts = joinrel->part_scheme->partnatts; + PartitionScheme part_scheme = joinrel->part_scheme; + int partnatts = part_scheme->partnatts; joinrel->partexprs = (List **) palloc0(sizeof(List *) * partnatts); joinrel->nullable_partexprs = @@ -1899,7 +1901,8 @@ set_joinrel_partition_key_exprs(RelOptInfo *joinrel, /* * The joinrel's partition expressions are the same as those of the input * rels, but we must properly classify them as nullable or not in the - * joinrel's output. + * joinrel's output. (Also, we add some more partition expressions if + * it's a FULL JOIN.) */ for (int cnt = 0; cnt < partnatts; cnt++) { @@ -1910,6 +1913,7 @@ set_joinrel_partition_key_exprs(RelOptInfo *joinrel, const List *inner_null_expr = inner_rel->nullable_partexprs[cnt]; List *partexpr = NIL; List *nullable_partexpr = NIL; + ListCell *lc; switch (jointype) { @@ -1969,6 +1973,31 @@ set_joinrel_partition_key_exprs(RelOptInfo *joinrel, outer_null_expr); nullable_partexpr = list_concat(nullable_partexpr, inner_null_expr); + + /* + * Also add CoalesceExprs corresponding to each possible + * full-join output variable (that is, left side coalesced to + * right side), so that we can match equijoin expressions + * using those variables. We assume no type coercions are + * needed to make the join outputs. + */ + foreach(lc, list_concat_copy(outer_expr, outer_null_expr)) + { + Node *larg = (Node *) lfirst(lc); + ListCell *lc2; + + foreach(lc2, list_concat_copy(inner_expr, inner_null_expr)) + { + Node *rarg = (Node *) lfirst(lc2); + CoalesceExpr *c = makeNode(CoalesceExpr); + + c->coalescetype = exprType(larg); + c->coalescecollid = exprCollation(larg); + c->args = list_make2(larg, rarg); + c->location = -1; + nullable_partexpr = lappend(nullable_partexpr, c); + } + } break; default: diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h index 469c686..39c7b2f 100644 --- a/src/include/nodes/pathnodes.h +++ b/src/include/nodes/pathnodes.h @@ -613,6 +613,9 @@ typedef struct PartitionSchemeData *PartitionScheme; * that expression goes in the partexprs[i] list if the base relation * is not nullable by this join or any lower outer join, or in the * nullable_partexprs[i] list if the base relation is nullable. + * Furthermore, FULL JOINs add extra nullable_partexprs expressions + * corresponding to COALESCE expressions of the left and right join columns, + * to simplify matching join clauses to those lists. *---------- */ typedef enum RelOptKind diff --git a/src/test/regress/expected/partition_join.out b/src/test/regress/expected/partition_join.out index b3fbe47..cd60b6a 100644 --- a/src/test/regress/expected/partition_join.out +++ b/src/test/regress/expected/partition_join.out @@ -750,6 +750,55 @@ SELECT t1.a, t1.c, t2.b, t2.c, t3.a + t3.b, t3.c FROM (prt1 t1 LEFT JOIN prt2 t2 550 | 0550 | | | 1100 | 0 (12 rows) +-- +-- 3-way full join +-- +EXPLAIN (COSTS OFF) +SELECT COUNT(*) FROM prt1 FULL JOIN prt2 p2(b,a,c) USING(a,b) FULL JOIN prt2 p3(b,a,c) USING (a, b) + WHERE a BETWEEN 490 AND 510; + QUERY PLAN +----------------------------------------------------------------------------------------------------------------------------------------- + Aggregate + -> Append + -> Hash Full Join + Hash Cond: ((COALESCE(prt1_1.a, p2_1.a) = p3_1.a) AND (COALESCE(prt1_1.b, p2_1.b) = p3_1.b)) + Filter: ((COALESCE(COALESCE(prt1_1.a, p2_1.a), p3_1.a) >= 490) AND (COALESCE(COALESCE(prt1_1.a, p2_1.a),p3_1.a) <= 510)) + -> Hash Full Join + Hash Cond: ((prt1_1.a = p2_1.a) AND (prt1_1.b = p2_1.b)) + -> Seq Scan on prt1_p1 prt1_1 + -> Hash + -> Seq Scan on prt2_p1 p2_1 + -> Hash + -> Seq Scan on prt2_p1 p3_1 + -> Hash Full Join + Hash Cond: ((COALESCE(prt1_2.a, p2_2.a) = p3_2.a) AND (COALESCE(prt1_2.b, p2_2.b) = p3_2.b)) + Filter: ((COALESCE(COALESCE(prt1_2.a, p2_2.a), p3_2.a) >= 490) AND (COALESCE(COALESCE(prt1_2.a, p2_2.a),p3_2.a) <= 510)) + -> Hash Full Join + Hash Cond: ((prt1_2.a = p2_2.a) AND (prt1_2.b = p2_2.b)) + -> Seq Scan on prt1_p2 prt1_2 + -> Hash + -> Seq Scan on prt2_p2 p2_2 + -> Hash + -> Seq Scan on prt2_p2 p3_2 + -> Hash Full Join + Hash Cond: ((COALESCE(prt1_3.a, p2_3.a) = p3_3.a) AND (COALESCE(prt1_3.b, p2_3.b) = p3_3.b)) + Filter: ((COALESCE(COALESCE(prt1_3.a, p2_3.a), p3_3.a) >= 490) AND (COALESCE(COALESCE(prt1_3.a, p2_3.a),p3_3.a) <= 510)) + -> Hash Full Join + Hash Cond: ((prt1_3.a = p2_3.a) AND (prt1_3.b = p2_3.b)) + -> Seq Scan on prt1_p3 prt1_3 + -> Hash + -> Seq Scan on prt2_p3 p2_3 + -> Hash + -> Seq Scan on prt2_p3 p3_3 +(32 rows) + +SELECT COUNT(*) FROM prt1 FULL JOIN prt2 p2(b,a,c) USING(a,b) FULL JOIN prt2 p3(b,a,c) USING (a, b) + WHERE a BETWEEN 490 AND 510; + count +------- + 14 +(1 row) + -- Cases with non-nullable expressions in subquery results; -- make sure these go to null as expected EXPLAIN (COSTS OFF) diff --git a/src/test/regress/sql/partition_join.sql b/src/test/regress/sql/partition_join.sql index 575ba7b..6184bbd 100644 --- a/src/test/regress/sql/partition_join.sql +++ b/src/test/regress/sql/partition_join.sql @@ -145,6 +145,15 @@ EXPLAIN (COSTS OFF) SELECT t1.a, t1.c, t2.b, t2.c, t3.a + t3.b, t3.c FROM (prt1 t1 LEFT JOIN prt2 t2 ON t1.a = t2.b) RIGHT JOIN prt1_e t3 ON(t1.a = (t3.a + t3.b)/2) WHERE t3.c = 0 ORDER BY t1.a, t2.b, t3.a + t3.b; SELECT t1.a, t1.c, t2.b, t2.c, t3.a + t3.b, t3.c FROM (prt1 t1 LEFT JOIN prt2 t2 ON t1.a = t2.b) RIGHT JOIN prt1_e t3 ON(t1.a = (t3.a + t3.b)/2) WHERE t3.c = 0 ORDER BY t1.a, t2.b, t3.a + t3.b; +-- +-- 3-way full join +-- +EXPLAIN (COSTS OFF) +SELECT COUNT(*) FROM prt1 FULL JOIN prt2 p2(b,a,c) USING(a,b) FULL JOIN prt2 p3(b,a,c) USING (a, b) + WHERE a BETWEEN 490 AND 510; +SELECT COUNT(*) FROM prt1 FULL JOIN prt2 p2(b,a,c) USING(a,b) FULL JOIN prt2 p3(b,a,c) USING (a, b) + WHERE a BETWEEN 490 AND 510; + -- Cases with non-nullable expressions in subquery results; -- make sure these go to null as expected EXPLAIN (COSTS OFF)
On Mon, Apr 6, 2020 at 7:29 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Amit Langote <amitlangote09@gmail.com> writes: > > Okay, I tried that in the updated patch. I didn't try to come up with > > examples that might break it though. > > I looked through this. Thank you. > * I think your logic for building the coalesce combinations was just > wrong. We need combinations of left-hand inputs with right-hand inputs, > not left-hand with left-hand or right-hand with right-hand. Also the > nesting should already have happened in the inputs, we don't need to > try to generate it here. The looping code was pretty messy as well. It didn't occur to me that that many Coalesce combinations would be necessary given the component rel combinations possible. > * I don't think using partopcintype is necessarily right; that could be > a polymorphic type, for instance. Safer to copy the type of the input > expressions. Likely we could have got away with using partcollation, > but for consistency I copied that as well. Ah, seeing set_baserel_partition_key_exprs(), I suppose they will come from parttypid and parttypcoll of the base partitioned tables, which seems fine. > * You really need to update the data structure definitional comments > when you make a change like this. Sorry, I should have. > * I did not like putting a test case that requires > enable_partitionwise_aggregate in the partition_join test; that seems > misplaced. But it's not necessary to the test, is it? Earlier in the discussion (which turned into a separate discussion), there were test cases where partition-level grouping would fail with errors in setrefs.c, but I think that was fixed last year by 529ebb20aaa5. Agree that it has nothing to do with the problem being solved here. > * I did not follow the point of your second test case. The WHERE > constraint on p1.a allows the planner to strength-reduce the joins, > which is why there's no full join in that explain result, but then > we aren't going to get to this code at all. Oops, I thought I copy-pasted 4-way full join test not this one, but evidently didn't. > I repaired the above in the attached. > > I'm actually sort of pleasantly surprised that this worked; I was > not sure that building COALESCEs like this would provide the result > we needed. But it seems okay -- it does fix the behavior in this > 3-way test case, as well as the 4-way join you showed at the top > of the thread. It's fairly dependent on the fact that the planner > won't rearrange full joins; otherwise, the COALESCE structures we > build here might not match those made at parse time. But that's > not likely to change anytime soon; and this is hardly the only > place that would break, so I'm not sweating about it. (I have > some vague ideas about getting rid of the COALESCEs as part of > the Var redefinition I've been muttering about, anyway; there > might be a cleaner fix for this if that happens.) > > Anyway, I think this is probably OK for now. Given that the > nullable_partexprs lists are only used in one place, it's pretty > hard to see how it would break anything. Makes sense. -- Thank you, Amit Langote EnterpriseDB: http://www.enterprisedb.com
Amit Langote <amitlangote09@gmail.com> writes: > On Mon, Apr 6, 2020 at 7:29 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> * I think your logic for building the coalesce combinations was just >> wrong. We need combinations of left-hand inputs with right-hand inputs, >> not left-hand with left-hand or right-hand with right-hand. Also the >> nesting should already have happened in the inputs, we don't need to >> try to generate it here. The looping code was pretty messy as well. > It didn't occur to me that that many Coalesce combinations would be > necessary given the component rel combinations possible. Well, we don't of course: we only need the one pair that corresponds to the COALESCE structures the parser would have generated. But we aren't sure which one that is. I thought about looking through the full join RTE's joinaliasvars list for COALESCE items instead of doing it like this, but there is a problem: I don't believe that that data structure gets maintained after flatten_join_alias_vars(). So it might contain out-of-date expressions that don't match what we need them to match. Possibly there will be a cleaner answer here if I succeed in redesigning the Var data structure to account for outer joins better. >> * I did not follow the point of your second test case. The WHERE >> constraint on p1.a allows the planner to strength-reduce the joins, >> which is why there's no full join in that explain result, but then >> we aren't going to get to this code at all. > Oops, I thought I copy-pasted 4-way full join test not this one, but > evidently didn't. Have you got such a query at hand? I wondered whether we shouldn't use a 4-way rather than 3-way test case; it'd offer more assurance that nesting of these things works. regards, tom lane
On Mon, Apr 6, 2020 at 11:09 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Amit Langote <amitlangote09@gmail.com> writes: > > Oops, I thought I copy-pasted 4-way full join test not this one, but > > evidently didn't. > > Have you got such a query at hand? I wondered whether we shouldn't > use a 4-way rather than 3-way test case; it'd offer more assurance > that nesting of these things works. Hmm, I just did: -SELECT COUNT(*) FROM prt1 FULL JOIN prt2 p2(b,a,c) USING(a,b) FULL JOIN prt2 p3(b,a,c) USING (a, b) +SELECT COUNT(*) FROM prt1 FULL JOIN prt2 p2(b,a,c) USING(a,b) FULL JOIN prt2 p3(b,a,c) USING (a, b) FULL JOIN prt1 p4 (a,b,c) USING (a, b) which does succeed in using partitionwise join. Please see attached delta that applies on your v7 if that is what you'd rather have. -- Thank you, Amit Langote EnterpriseDB: http://www.enterprisedb.com
Attachment
Amit Langote <amitlangote09@gmail.com> writes: > which does succeed in using partitionwise join. Please see attached > delta that applies on your v7 if that is what you'd rather have. I figured these queries were cheap enough that we could afford to run both. With that and some revision of the comments (per attached), I was feeling like we were ready to go. However, re-reading the thread, one of Richard's comments struck me as still relevant. If you try, say, create table p (k int, val int) partition by range(k); create table p_1 partition of p for values from (1) to (10); create table p_2 partition of p for values from (10) to (100); set enable_partitionwise_join = 1; explain select * from (p as t1 full join p as t2 on t1.k = t2.k) as t12(k1,val1,k2,val2) full join p as t3 on COALESCE(t12.k1, t12.k2) = t3.k; this patch will give you a partitioned join, with a different plan than you get without enable_partitionwise_join. This is scary, because it's not immediately obvious that the transformation is correct. I *think* that it might be all right, because although what we are matching to is a user-written COALESCE() not an actual FULL JOIN USING column, it has to behave in somewhat the same way. In particular, by construction it must be a coalesce of some representation of the matching partition columns of the full join's inputs. So, even though it might go to null in different cases than an actual USING variable would do, it does not break the ability to partition the join. However, I have not spent a whole lot of time thinking about partitionwise joins, so rather than go ahead and commit I am going to toss that point back out for community consideration. At the very least, what I'd written in the comment needs a lot more defense than it has now. Thoughts? regards, tom lane diff --git a/src/backend/optimizer/util/relnode.c b/src/backend/optimizer/util/relnode.c index af1fb48..d190b4b 100644 --- a/src/backend/optimizer/util/relnode.c +++ b/src/backend/optimizer/util/relnode.c @@ -17,6 +17,7 @@ #include <limits.h> #include "miscadmin.h" +#include "nodes/nodeFuncs.h" #include "optimizer/appendinfo.h" #include "optimizer/clauses.h" #include "optimizer/cost.h" @@ -1890,7 +1891,8 @@ set_joinrel_partition_key_exprs(RelOptInfo *joinrel, RelOptInfo *outer_rel, RelOptInfo *inner_rel, JoinType jointype) { - int partnatts = joinrel->part_scheme->partnatts; + PartitionScheme part_scheme = joinrel->part_scheme; + int partnatts = part_scheme->partnatts; joinrel->partexprs = (List **) palloc0(sizeof(List *) * partnatts); joinrel->nullable_partexprs = @@ -1899,7 +1901,8 @@ set_joinrel_partition_key_exprs(RelOptInfo *joinrel, /* * The joinrel's partition expressions are the same as those of the input * rels, but we must properly classify them as nullable or not in the - * joinrel's output. + * joinrel's output. (Also, we add some more partition expressions if + * it's a FULL JOIN.) */ for (int cnt = 0; cnt < partnatts; cnt++) { @@ -1910,6 +1913,7 @@ set_joinrel_partition_key_exprs(RelOptInfo *joinrel, const List *inner_null_expr = inner_rel->nullable_partexprs[cnt]; List *partexpr = NIL; List *nullable_partexpr = NIL; + ListCell *lc; switch (jointype) { @@ -1969,6 +1973,38 @@ set_joinrel_partition_key_exprs(RelOptInfo *joinrel, outer_null_expr); nullable_partexpr = list_concat(nullable_partexpr, inner_null_expr); + + /* + * Also add CoalesceExprs corresponding to each possible + * full-join output variable (that is, left side coalesced to + * right side), so that we can match equijoin expressions + * using those variables. We really only need these for + * columns merged by JOIN USING, and only with the pairs of + * input items that correspond to the data structures that + * parse analysis would build for such variables. But it's + * hard to tell which those are, so just make all the pairs. + * Extra items in the nullable_partexprs list won't cause big + * problems. We assume no type coercions are needed to make + * the coalesce expressions, since columns of different types + * won't have gotten classified as the same PartitionScheme. + */ + foreach(lc, list_concat_copy(outer_expr, outer_null_expr)) + { + Node *larg = (Node *) lfirst(lc); + ListCell *lc2; + + foreach(lc2, list_concat_copy(inner_expr, inner_null_expr)) + { + Node *rarg = (Node *) lfirst(lc2); + CoalesceExpr *c = makeNode(CoalesceExpr); + + c->coalescetype = exprType(larg); + c->coalescecollid = exprCollation(larg); + c->args = list_make2(larg, rarg); + c->location = -1; + nullable_partexpr = lappend(nullable_partexpr, c); + } + } break; default: diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h index 469c686..39c7b2f 100644 --- a/src/include/nodes/pathnodes.h +++ b/src/include/nodes/pathnodes.h @@ -613,6 +613,9 @@ typedef struct PartitionSchemeData *PartitionScheme; * that expression goes in the partexprs[i] list if the base relation * is not nullable by this join or any lower outer join, or in the * nullable_partexprs[i] list if the base relation is nullable. + * Furthermore, FULL JOINs add extra nullable_partexprs expressions + * corresponding to COALESCE expressions of the left and right join columns, + * to simplify matching join clauses to those lists. *---------- */ typedef enum RelOptKind diff --git a/src/test/regress/expected/partition_join.out b/src/test/regress/expected/partition_join.out index b3fbe47..a35e8e3 100644 --- a/src/test/regress/expected/partition_join.out +++ b/src/test/regress/expected/partition_join.out @@ -750,6 +750,116 @@ SELECT t1.a, t1.c, t2.b, t2.c, t3.a + t3.b, t3.c FROM (prt1 t1 LEFT JOIN prt2 t2 550 | 0550 | | | 1100 | 0 (12 rows) +-- +-- 3-way full join +-- +EXPLAIN (COSTS OFF) +SELECT COUNT(*) FROM prt1 FULL JOIN prt2 p2(b,a,c) USING(a,b) FULL JOIN prt2 p3(b,a,c) USING (a, b) + WHERE a BETWEEN 490 AND 510; + QUERY PLAN +----------------------------------------------------------------------------------------------------------------------------------------- + Aggregate + -> Append + -> Hash Full Join + Hash Cond: ((COALESCE(prt1_1.a, p2_1.a) = p3_1.a) AND (COALESCE(prt1_1.b, p2_1.b) = p3_1.b)) + Filter: ((COALESCE(COALESCE(prt1_1.a, p2_1.a), p3_1.a) >= 490) AND (COALESCE(COALESCE(prt1_1.a, p2_1.a),p3_1.a) <= 510)) + -> Hash Full Join + Hash Cond: ((prt1_1.a = p2_1.a) AND (prt1_1.b = p2_1.b)) + -> Seq Scan on prt1_p1 prt1_1 + -> Hash + -> Seq Scan on prt2_p1 p2_1 + -> Hash + -> Seq Scan on prt2_p1 p3_1 + -> Hash Full Join + Hash Cond: ((COALESCE(prt1_2.a, p2_2.a) = p3_2.a) AND (COALESCE(prt1_2.b, p2_2.b) = p3_2.b)) + Filter: ((COALESCE(COALESCE(prt1_2.a, p2_2.a), p3_2.a) >= 490) AND (COALESCE(COALESCE(prt1_2.a, p2_2.a),p3_2.a) <= 510)) + -> Hash Full Join + Hash Cond: ((prt1_2.a = p2_2.a) AND (prt1_2.b = p2_2.b)) + -> Seq Scan on prt1_p2 prt1_2 + -> Hash + -> Seq Scan on prt2_p2 p2_2 + -> Hash + -> Seq Scan on prt2_p2 p3_2 + -> Hash Full Join + Hash Cond: ((COALESCE(prt1_3.a, p2_3.a) = p3_3.a) AND (COALESCE(prt1_3.b, p2_3.b) = p3_3.b)) + Filter: ((COALESCE(COALESCE(prt1_3.a, p2_3.a), p3_3.a) >= 490) AND (COALESCE(COALESCE(prt1_3.a, p2_3.a),p3_3.a) <= 510)) + -> Hash Full Join + Hash Cond: ((prt1_3.a = p2_3.a) AND (prt1_3.b = p2_3.b)) + -> Seq Scan on prt1_p3 prt1_3 + -> Hash + -> Seq Scan on prt2_p3 p2_3 + -> Hash + -> Seq Scan on prt2_p3 p3_3 +(32 rows) + +SELECT COUNT(*) FROM prt1 FULL JOIN prt2 p2(b,a,c) USING(a,b) FULL JOIN prt2 p3(b,a,c) USING (a, b) + WHERE a BETWEEN 490 AND 510; + count +------- + 14 +(1 row) + +-- +-- 4-way full join +-- +EXPLAIN (COSTS OFF) +SELECT COUNT(*) FROM prt1 FULL JOIN prt2 p2(b,a,c) USING(a,b) FULL JOIN prt2 p3(b,a,c) USING (a, b) FULL JOIN prt1 p4 (a,b,c)USING (a, b) + WHERE a BETWEEN 490 AND 510; + QUERY PLAN +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------- + Aggregate + -> Append + -> Hash Full Join + Hash Cond: ((COALESCE(COALESCE(prt1_1.a, p2_1.a), p3_1.a) = p4_1.a) AND (COALESCE(COALESCE(prt1_1.b, p2_1.b),p3_1.b) = p4_1.b)) + Filter: ((COALESCE(COALESCE(COALESCE(prt1_1.a, p2_1.a), p3_1.a), p4_1.a) >= 490) AND (COALESCE(COALESCE(COALESCE(prt1_1.a,p2_1.a), p3_1.a), p4_1.a) <= 510)) + -> Hash Full Join + Hash Cond: ((COALESCE(prt1_1.a, p2_1.a) = p3_1.a) AND (COALESCE(prt1_1.b, p2_1.b) = p3_1.b)) + -> Hash Full Join + Hash Cond: ((prt1_1.a = p2_1.a) AND (prt1_1.b = p2_1.b)) + -> Seq Scan on prt1_p1 prt1_1 + -> Hash + -> Seq Scan on prt2_p1 p2_1 + -> Hash + -> Seq Scan on prt2_p1 p3_1 + -> Hash + -> Seq Scan on prt1_p1 p4_1 + -> Hash Full Join + Hash Cond: ((COALESCE(COALESCE(prt1_2.a, p2_2.a), p3_2.a) = p4_2.a) AND (COALESCE(COALESCE(prt1_2.b, p2_2.b),p3_2.b) = p4_2.b)) + Filter: ((COALESCE(COALESCE(COALESCE(prt1_2.a, p2_2.a), p3_2.a), p4_2.a) >= 490) AND (COALESCE(COALESCE(COALESCE(prt1_2.a,p2_2.a), p3_2.a), p4_2.a) <= 510)) + -> Hash Full Join + Hash Cond: ((COALESCE(prt1_2.a, p2_2.a) = p3_2.a) AND (COALESCE(prt1_2.b, p2_2.b) = p3_2.b)) + -> Hash Full Join + Hash Cond: ((prt1_2.a = p2_2.a) AND (prt1_2.b = p2_2.b)) + -> Seq Scan on prt1_p2 prt1_2 + -> Hash + -> Seq Scan on prt2_p2 p2_2 + -> Hash + -> Seq Scan on prt2_p2 p3_2 + -> Hash + -> Seq Scan on prt1_p2 p4_2 + -> Hash Full Join + Hash Cond: ((COALESCE(COALESCE(prt1_3.a, p2_3.a), p3_3.a) = p4_3.a) AND (COALESCE(COALESCE(prt1_3.b, p2_3.b),p3_3.b) = p4_3.b)) + Filter: ((COALESCE(COALESCE(COALESCE(prt1_3.a, p2_3.a), p3_3.a), p4_3.a) >= 490) AND (COALESCE(COALESCE(COALESCE(prt1_3.a,p2_3.a), p3_3.a), p4_3.a) <= 510)) + -> Hash Full Join + Hash Cond: ((COALESCE(prt1_3.a, p2_3.a) = p3_3.a) AND (COALESCE(prt1_3.b, p2_3.b) = p3_3.b)) + -> Hash Full Join + Hash Cond: ((prt1_3.a = p2_3.a) AND (prt1_3.b = p2_3.b)) + -> Seq Scan on prt1_p3 prt1_3 + -> Hash + -> Seq Scan on prt2_p3 p2_3 + -> Hash + -> Seq Scan on prt2_p3 p3_3 + -> Hash + -> Seq Scan on prt1_p3 p4_3 +(44 rows) + +SELECT COUNT(*) FROM prt1 FULL JOIN prt2 p2(b,a,c) USING(a,b) FULL JOIN prt2 p3(b,a,c) USING (a, b) FULL JOIN prt1 p4 (a,b,c)USING (a, b) + WHERE a BETWEEN 490 AND 510; + count +------- + 14 +(1 row) + -- Cases with non-nullable expressions in subquery results; -- make sure these go to null as expected EXPLAIN (COSTS OFF) diff --git a/src/test/regress/sql/partition_join.sql b/src/test/regress/sql/partition_join.sql index 575ba7b..dad1e07 100644 --- a/src/test/regress/sql/partition_join.sql +++ b/src/test/regress/sql/partition_join.sql @@ -145,6 +145,24 @@ EXPLAIN (COSTS OFF) SELECT t1.a, t1.c, t2.b, t2.c, t3.a + t3.b, t3.c FROM (prt1 t1 LEFT JOIN prt2 t2 ON t1.a = t2.b) RIGHT JOIN prt1_e t3 ON(t1.a = (t3.a + t3.b)/2) WHERE t3.c = 0 ORDER BY t1.a, t2.b, t3.a + t3.b; SELECT t1.a, t1.c, t2.b, t2.c, t3.a + t3.b, t3.c FROM (prt1 t1 LEFT JOIN prt2 t2 ON t1.a = t2.b) RIGHT JOIN prt1_e t3 ON(t1.a = (t3.a + t3.b)/2) WHERE t3.c = 0 ORDER BY t1.a, t2.b, t3.a + t3.b; +-- +-- 3-way full join +-- +EXPLAIN (COSTS OFF) +SELECT COUNT(*) FROM prt1 FULL JOIN prt2 p2(b,a,c) USING(a,b) FULL JOIN prt2 p3(b,a,c) USING (a, b) + WHERE a BETWEEN 490 AND 510; +SELECT COUNT(*) FROM prt1 FULL JOIN prt2 p2(b,a,c) USING(a,b) FULL JOIN prt2 p3(b,a,c) USING (a, b) + WHERE a BETWEEN 490 AND 510; + +-- +-- 4-way full join +-- +EXPLAIN (COSTS OFF) +SELECT COUNT(*) FROM prt1 FULL JOIN prt2 p2(b,a,c) USING(a,b) FULL JOIN prt2 p3(b,a,c) USING (a, b) FULL JOIN prt1 p4 (a,b,c)USING (a, b) + WHERE a BETWEEN 490 AND 510; +SELECT COUNT(*) FROM prt1 FULL JOIN prt2 p2(b,a,c) USING(a,b) FULL JOIN prt2 p3(b,a,c) USING (a, b) FULL JOIN prt1 p4 (a,b,c)USING (a, b) + WHERE a BETWEEN 490 AND 510; + -- Cases with non-nullable expressions in subquery results; -- make sure these go to null as expected EXPLAIN (COSTS OFF)
On Tue, Apr 7, 2020 at 2:41 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Amit Langote <amitlangote09@gmail.com> writes: > > which does succeed in using partitionwise join. Please see attached > > delta that applies on your v7 if that is what you'd rather have. > > I figured these queries were cheap enough that we could afford to run > both. With that and some revision of the comments (per attached), > I was feeling like we were ready to go. Looks good to me. > However, re-reading the thread, > one of Richard's comments struck me as still relevant. If you try, say, > > create table p (k int, val int) partition by range(k); > create table p_1 partition of p for values from (1) to (10); > create table p_2 partition of p for values from (10) to (100); > > set enable_partitionwise_join = 1; > > explain > select * from (p as t1 full join p as t2 on t1.k = t2.k) as t12(k1,val1,k2,val2) > full join p as t3 on COALESCE(t12.k1, t12.k2) = t3.k; > > this patch will give you a partitioned join, with a different plan > than you get without enable_partitionwise_join. This is scary, > because it's not immediately obvious that the transformation is > correct. > > I *think* that it might be all right, because although what we > are matching to is a user-written COALESCE() not an actual > FULL JOIN USING column, it has to behave in somewhat the same > way. In particular, by construction it must be a coalesce of > some representation of the matching partition columns of the > full join's inputs. So, even though it might go to null in > different cases than an actual USING variable would do, it > does not break the ability to partition the join. Seems fine to me too. Maybe users should avoid writing it by hand if possible anyway, because even slight variation in the way it's written will affect this: set enable_partitionwise_join = 1; -- order of coalesce() arguments reversed explain (costs off) select * from (p as t1 full join p as t2 on t1.k = t2.k) as t12(k1,val1,k2,val2) full join p as t3 on COALESCE(t12.k2, t12.k1) = t3.k; QUERY PLAN ---------------------------------------------- Hash Full Join Hash Cond: (COALESCE(t2.k, t1.k) = t3.k) -> Append -> Hash Full Join Hash Cond: (t1_1.k = t2_1.k) -> Seq Scan on p_1 t1_1 -> Hash -> Seq Scan on p_1 t2_1 -> Hash Full Join Hash Cond: (t1_2.k = t2_2.k) -> Seq Scan on p_2 t1_2 -> Hash -> Seq Scan on p_2 t2_2 -> Hash -> Append -> Seq Scan on p_1 t3_1 -> Seq Scan on p_2 t3_2 (17 rows) > However, I have not spent a whole lot of time thinking about > partitionwise joins, so rather than go ahead and commit I am > going to toss that point back out for community consideration. Agreed. > At the very least, what I'd written in the comment needs a > lot more defense than it has now. Sorry, which comment are you referring to? -- Thank you, Amit Langote EnterpriseDB: http://www.enterprisedb.com
Amit Langote <amitlangote09@gmail.com> writes: > On Tue, Apr 7, 2020 at 2:41 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I *think* that it might be all right, because although what we >> are matching to is a user-written COALESCE() not an actual >> FULL JOIN USING column, it has to behave in somewhat the same >> way. In particular, by construction it must be a coalesce of >> some representation of the matching partition columns of the >> full join's inputs. So, even though it might go to null in >> different cases than an actual USING variable would do, it >> does not break the ability to partition the join. > Seems fine to me too. Maybe users should avoid writing it by hand if > possible anyway, because even slight variation in the way it's written > will affect this: I'm not particularly concerned about users intentionally trying to trigger this behavior. I just want to be sure that if someone accidentally does so, we don't produce a wrong plan. I waited till after the "advanced partitionwise join" patch went in because that seemed more important (plus I wondered a bit if that would subsume this). But this patch seems to still work, and the other thing doesn't fix the problem, so pushed. regards, tom lane
On Wed, Apr 8, 2020 at 11:17 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > But this patch seems to still work, > and the other thing doesn't fix the problem, so pushed. Thanks for working on this! Best regards, Etsuro Fujita
On Wed, Apr 8, 2020 at 11:17 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Amit Langote <amitlangote09@gmail.com> writes: > > On Tue, Apr 7, 2020 at 2:41 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> I *think* that it might be all right, because although what we > >> are matching to is a user-written COALESCE() not an actual > >> FULL JOIN USING column, it has to behave in somewhat the same > >> way. In particular, by construction it must be a coalesce of > >> some representation of the matching partition columns of the > >> full join's inputs. So, even though it might go to null in > >> different cases than an actual USING variable would do, it > >> does not break the ability to partition the join. > > > Seems fine to me too. Maybe users should avoid writing it by hand if > > possible anyway, because even slight variation in the way it's written > > will affect this: > > I'm not particularly concerned about users intentionally trying to trigger > this behavior. I just want to be sure that if someone accidentally does > so, we don't produce a wrong plan. > > I waited till after the "advanced partitionwise join" patch went > in because that seemed more important (plus I wondered a bit if > that would subsume this). But this patch seems to still work, > and the other thing doesn't fix the problem, so pushed. Thank you for your time on this. -- Amit Langote EnterpriseDB: http://www.enterprisedb.com