Thread: [MASSMAIL]apply_scanjoin_target_to_paths and partitionwise join
Hi All,
Per below code and comment in apply_scanjoin_target_to_paths(), the function zaps all the paths of a partitioned relation.
/*
* If the rel is partitioned, we want to drop its existing paths and
* generate new ones. This function would still be correct if we kept the
* existing paths: we'd modify them to generate the correct target above
* the partitioning Append, and then they'd compete on cost with paths
* generating the target below the Append
* If the rel is partitioned, we want to drop its existing paths and
* generate new ones. This function would still be correct if we kept the
* existing paths: we'd modify them to generate the correct target above
* the partitioning Append, and then they'd compete on cost with paths
* generating the target below the Append
... snip ...
*/
if (rel_is_partitioned)
rel->pathlist = NIL;
if (rel_is_partitioned)
rel->pathlist = NIL;
Later the function adjusts the targets of paths in child relations and constructs Append paths from them. That works for simple partitioned relations but not for join between partitioned relations. When enable_partitionwise_join is true, the joinrel representing a join between partitioned relations may have join paths joining append paths and Append paths containing child join paths. Once we zap the pathlist, the only paths that can be computed again are the Append paths. If the optimal path, before applying the new target, was a join of append paths it will be lost forever. This will result in choosing a suboptimal Append path.
We have one such query in our regression set.
SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM (plt1_adv t1 LEFT JOIN plt2_adv t2 ON (t1.c = t2.c)) FULL JOIN plt3_adv t3 ON (t1.c = t3.c) WHERE coalesce(t1.a, 0 ) % 5 != 3 AND coalesce(t1.a, 0) % 5 != 4 ORDER BY t1.c, t1.a, t2.a, t3.a;
For this query, the cheapest Append of Joins path has cost 24.97..25.57 and the cheapest Join of Appends path has cost 21.29..21.81. The latter should be chosen even though enable_partitionwise_join is ON. But this function chooses the first.
The solution is to zap the pathlists only for simple partitioned relations like the attached patch.
With this patch above query does not choose non-partitionwise join path and partition_join test fails. That's expected. But we need to replace that query with some query which uses partitionwise join while maintaining the conditions of the test as explained in the comment above that query. I have tried a few variations but without success. Suggestions welcome.
The problem is reproducible on PG 15. The patch is based on 15_STABLE branch. But the problem exists in recent branches as well.
--
Best Wishes,
Ashutosh Bapat
Attachment
On Thu, Apr 11, 2024 at 12:07 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:
Hi All,Per below code and comment in apply_scanjoin_target_to_paths(), the function zaps all the paths of a partitioned relation./*
* If the rel is partitioned, we want to drop its existing paths and
* generate new ones. This function would still be correct if we kept the
* existing paths: we'd modify them to generate the correct target above
* the partitioning Append, and then they'd compete on cost with paths
* generating the target below the Append... snip ...*/
if (rel_is_partitioned)
rel->pathlist = NIL;Later the function adjusts the targets of paths in child relations and constructs Append paths from them. That works for simple partitioned relations but not for join between partitioned relations. When enable_partitionwise_join is true, the joinrel representing a join between partitioned relations may have join paths joining append paths and Append paths containing child join paths. Once we zap the pathlist, the only paths that can be computed again are the Append paths. If the optimal path, before applying the new target, was a join of append paths it will be lost forever. This will result in choosing a suboptimal Append path.We have one such query in our regression set.SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM (plt1_adv t1 LEFT JOIN plt2_adv t2 ON (t1.c = t2.c)) FULL JOIN plt3_adv t3 ON (t1.c = t3.c) WHERE coalesce(t1.a, 0 ) % 5 != 3 AND coalesce(t1.a, 0) % 5 != 4 ORDER BY t1.c, t1.a, t2.a, t3.a;For this query, the cheapest Append of Joins path has cost 24.97..25.57 and the cheapest Join of Appends path has cost 21.29..21.81. The latter should be chosen even though enable_partitionwise_join is ON. But this function chooses the first.The solution is to zap the pathlists only for simple partitioned relations like the attached patch.With this patch above query does not choose non-partitionwise join path and partition_join test fails. That's expected. But we need to replace that query with some query which uses partitionwise join while maintaining the conditions of the test as explained in the comment above that query. I have tried a few variations but without success. Suggestions welcome.The problem is reproducible on PG 15. The patch is based on 15_STABLE branch. But the problem exists in recent branches as well.
After sending email I found another thread where this problem was discussed [1]. Tom has the same explanation as mine however he favoured not doing anything about it. Since the rationale was based on the cost and not actual performance, it was fine at that time. At EDB we have a customer case where partitionwise join plan is worse than non-partitionwise join plan. The suboptimal plan is chosen because of the above code. I think we should fix the problem.
--
Best Wishes,
Ashutosh Bapat
Here's patch with
On Thu, Apr 11, 2024 at 12:24 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:
On Thu, Apr 11, 2024 at 12:07 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:Hi All,Per below code and comment in apply_scanjoin_target_to_paths(), the function zaps all the paths of a partitioned relation./*
* If the rel is partitioned, we want to drop its existing paths and
* generate new ones. This function would still be correct if we kept the
* existing paths: we'd modify them to generate the correct target above
* the partitioning Append, and then they'd compete on cost with paths
* generating the target below the Append... snip ...*/
if (rel_is_partitioned)
rel->pathlist = NIL;Later the function adjusts the targets of paths in child relations and constructs Append paths from them. That works for simple partitioned relations but not for join between partitioned relations. When enable_partitionwise_join is true, the joinrel representing a join between partitioned relations may have join paths joining append paths and Append paths containing child join paths. Once we zap the pathlist, the only paths that can be computed again are the Append paths. If the optimal path, before applying the new target, was a join of append paths it will be lost forever. This will result in choosing a suboptimal Append path.We have one such query in our regression set.SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM (plt1_adv t1 LEFT JOIN plt2_adv t2 ON (t1.c = t2.c)) FULL JOIN plt3_adv t3 ON (t1.c = t3.c) WHERE coalesce(t1.a, 0 ) % 5 != 3 AND coalesce(t1.a, 0) % 5 != 4 ORDER BY t1.c, t1.a, t2.a, t3.a;For this query, the cheapest Append of Joins path has cost 24.97..25.57 and the cheapest Join of Appends path has cost 21.29..21.81. The latter should be chosen even though enable_partitionwise_join is ON. But this function chooses the first.The solution is to zap the pathlists only for simple partitioned relations like the attached patch.With this patch above query does not choose non-partitionwise join path and partition_join test fails. That's expected. But we need to replace that query with some query which uses partitionwise join while maintaining the conditions of the test as explained in the comment above that query. I have tried a few variations but without success. Suggestions welcome.
Found a replacement for that query by using a 2-way join instead of 3-way join. The query still executes the referenced code in process_outer_partition() as mentioned in the comment. I did think about removing the original query. But it is the only example in our regression tests where partitionwise join is more costly than non-partitionwise join. So I have left it as is in the test. I am fine if others think that we should remove it.
Best Wishes,
Ashutosh Bapat
Attachment
Hi Ashutosh & hackers, On Mon, Apr 15, 2024 at 9:00 AM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote: > > Here's patch with > [..] > Adding to the next commitfest but better to consider this for the next set of minor releases. 1. The patch does not pass cfbot - https://cirrus-ci.com/task/5486258451906560 on master due to test failure "not ok 206 + partition_join" 2. Without the patch applied, the result of the meson test on master was clean (no failures , so master is fine). After applying patch there were expected some hunk failures (as the patch was created for 15_STABLE): patching file src/backend/optimizer/plan/planner.c Hunk #1 succeeded at 7567 (offset 468 lines). Hunk #2 succeeded at 7593 (offset 468 lines). patching file src/test/regress/expected/partition_join.out Hunk #1 succeeded at 4777 (offset 56 lines). Hunk #2 succeeded at 4867 (offset 56 lines). patching file src/test/regress/sql/partition_join.sql Hunk #1 succeeded at 1136 (offset 1 line). 3. Without patch there is performance regression/bug on master (cost is higher with enable_partitionwise_join=on that without it): data preparation: -- Test the process_outer_partition() code path CREATE TABLE plt1_adv (a int, b int, c text) PARTITION BY LIST (c); CREATE TABLE plt1_adv_p1 PARTITION OF plt1_adv FOR VALUES IN ('0000', '0001', '0002'); CREATE TABLE plt1_adv_p2 PARTITION OF plt1_adv FOR VALUES IN ('0003', '0004'); INSERT INTO plt1_adv SELECT i, i, to_char(i % 5, 'FM0000') FROM generate_series(0, 24) i; ANALYZE plt1_adv; CREATE TABLE plt2_adv (a int, b int, c text) PARTITION BY LIST (c); CREATE TABLE plt2_adv_p1 PARTITION OF plt2_adv FOR VALUES IN ('0002'); CREATE TABLE plt2_adv_p2 PARTITION OF plt2_adv FOR VALUES IN ('0003', '0004'); INSERT INTO plt2_adv SELECT i, i, to_char(i % 5, 'FM0000') FROM generate_series(0, 24) i WHERE i % 5 IN (2, 3, 4); ANALYZE plt2_adv; CREATE TABLE plt3_adv (a int, b int, c text) PARTITION BY LIST (c); CREATE TABLE plt3_adv_p1 PARTITION OF plt3_adv FOR VALUES IN ('0001'); CREATE TABLE plt3_adv_p2 PARTITION OF plt3_adv FOR VALUES IN ('0003', '0004'); INSERT INTO plt3_adv SELECT i, i, to_char(i % 5, 'FM0000') FROM generate_series(0, 24) i WHERE i % 5 IN (1, 3, 4); ANALYZE plt3_adv; off: EXPLAIN SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM (plt1_adv t1 LEFT JOIN plt2_adv t2 ON (t1.c = t2.c)) FULL JOIN plt3_adv t3 ON (t1.c = t3.c) WHERE coalesce(t1.a, 0) % 5 != 3 AND coalesce(t1.a, 0) % 5 != 4 ORDER BY t1.c, t1.a, t2.a, t3.a; QUERY PLAN ----------------------------------------------------------------------------------------------- Sort (cost=22.02..22.58 rows=223 width=27) Sort Key: t1.c, t1.a, t2.a, t3.a -> Hash Full Join (cost=4.83..13.33 rows=223 width=27) [..] with enable_partitionwise_join=ON (see the jump from cost 22.02 -> 27.65): EXPLAIN SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM (plt1_adv t1 LEFT JOIN plt2_adv t2 ON (t1.c = t2.c)) FULL JOIN plt3_adv t3 ON (t1.c = t3.c) WHERE coalesce(t1.a, 0) % 5 != 3 AND coalesce(t1.a, 0) % 5 != 4 ORDER BY t1.c, t1.a, t2.a, t3.a; QUERY PLAN ----------------------------------------------------------------------------------------------- Sort (cost=27.65..28.37 rows=289 width=27) Sort Key: t1.c, t1.a, t2.a, t3.a -> Append (cost=2.23..15.83 rows=289 width=27) -> Hash Full Join (cost=2.23..4.81 rows=41 width=27) [..] -> Hash Full Join (cost=2.45..9.57 rows=248 width=27) [..] However with the patch applied the plan with minimal cost is always chosen ("22"): explain SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM (plt1_adv t1 LEFT JOIN plt2_adv t2 ON (t1.c = t2.c)) FULL JOIN plt3_adv t3 ON (t1.c = t3.c) WHERE coalesce(t1.a, 0 ) % 5 != 3 AND coalesce(t1.a, 0) % 5 != 4 ORDER BY t1.c, t1.a, t2.a, t3.a; QUERY PLAN ----------------------------------------------------------------------------------------------- Sort (cost=22.02..22.58 rows=223 width=27) Sort Key: t1.c, t1.a, t2.a, t3.a -> Hash Full Join (cost=4.83..13.33 rows=223 width=27) [..] set enable_partitionwise_join to on; explain SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM (plt1_adv t1 LEFT JOIN plt2_adv t2 ON (t1.c = t2.c)) FULL JOIN plt3_adv t3 ON (t1.c = t3.c) WHERE coalesce(t1.a, 0 ) % 5 != 3 AND coalesce(t1.a, 0) % 5 != 4 ORDER BY t1.c, t1.a, t2.a, t3.a; QUERY PLAN ----------------------------------------------------------------------------------------------- Sort (cost=22.02..22.58 rows=223 width=27) Sort Key: t1.c, t1.a, t2.a, t3.a -> Hash Full Join (cost=4.83..13.33 rows=223 width=27) [..] with the patch applied, the minimal cost (with toggle on or off) the cost always stays the minimal from the available ones. We cannot provide a reproducer for real performance regression, but for the affected customer it took 530+s (with enable_partitionwise_join=on) and without that GUC it it was ~23s. 4. meson test ends up with failures like below: 4/290 postgresql:regress / regress/regress ERROR 32.67s 6/290 postgresql:pg_upgrade / pg_upgrade/002_pg_upgrade ERROR 56.96s 35/290 postgresql:recovery / recovery/027_stream_regress ERROR 40.20s (all due to "regression tests pass" failures) the partition_join.sql is failing for test 206, so for this: -- partitionwise join with fractional paths CREATE TABLE fract_t (id BIGINT, PRIMARY KEY (id)) PARTITION BY RANGE (id); CREATE TABLE fract_t0 PARTITION OF fract_t FOR VALUES FROM ('0') TO ('1000'); CREATE TABLE fract_t1 PARTITION OF fract_t FOR VALUES FROM ('1000') TO ('2000'); -- insert data INSERT INTO fract_t (id) (SELECT generate_series(0, 1999)); ANALYZE fract_t; -- verify plan; nested index only scans SET max_parallel_workers_per_gather = 0; SET enable_partitionwise_join = on; the testsuite was expecting the below with enable_partitionwise_join = on; EXPLAIN (COSTS OFF) SELECT x.id, y.id FROM fract_t x LEFT JOIN fract_t y USING (id) ORDER BY x.id ASC LIMIT 10; QUERY PLAN ----------------------------------------------------------------------- Limit -> Merge Append Sort Key: x.id -> Merge Left Join Merge Cond: (x_1.id = y_1.id) -> Index Only Scan using fract_t0_pkey on fract_t0 x_1 -> Index Only Scan using fract_t0_pkey on fract_t0 y_1 -> Merge Left Join Merge Cond: (x_2.id = y_2.id) -> Index Only Scan using fract_t1_pkey on fract_t1 x_2 -> Index Only Scan using fract_t1_pkey on fract_t1 y_2 but actually with patch it gets this (here with costs): EXPLAIN (COSTS) SELECT x.id, y.id FROM fract_t x LEFT JOIN fract_t y USING (id) ORDER BY x.id ASC LIMIT 10; QUERY PLAN ------------------------------------------------------------------------------------------------------------- Limit (cost=1.10..2.21 rows=10 width=16) -> Merge Left Join (cost=1.10..223.10 rows=2000 width=16) Merge Cond: (x.id = y.id) -> Append (cost=0.55..96.55 rows=2000 width=8) [..] -> Append (cost=0.55..96.55 rows=2000 width=8) [..] if you run it without patch and again with enable_partitionwise_join=on: EXPLAIN SELECT x.id, y.id FROM fract_t x LEFT JOIN fract_t y USING (id) ORDER BY x.id ASC LIMIT 10; QUERY PLAN ------------------------------------------------------------------------------------------------------------- Limit (cost=1.11..2.22 rows=10 width=16) -> Merge Append (cost=1.11..223.11 rows=2000 width=16) Sort Key: x.id -> Merge Left Join (cost=0.55..101.55 rows=1000 width=16) [..] -> Merge Left Join (cost=0.55..101.55 rows=1000 width=16) [..] So with the patch that SQL does not use partitionwise join as it finds it more optimal to stick to a plan with cost of "1.10..2.21" instead of "1.11..2.22" (w/ partition_join), nitpicking but still a failure technically. Perhaps it could be even removed? (it's pretty close to noise?). -J.
On Mon, May 6, 2024 at 4:26 PM Jakub Wartak <jakub.wartak@enterprisedb.com> wrote:
Hi Ashutosh & hackers,
On Mon, Apr 15, 2024 at 9:00 AM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
>
> Here's patch with
>
[..]
> Adding to the next commitfest but better to consider this for the next set of minor releases.
1. The patch does not pass cfbot -
https://cirrus-ci.com/task/5486258451906560 on master due to test
failure "not ok 206 + partition_join"
So I need to create a patch for master first. I thought CFBot somehow knew that the patch was created for PG 15. :)
2. Without the patch applied, the result of the meson test on master
was clean (no failures , so master is fine). After applying patch
there were expected some hunk failures (as the patch was created for
15_STABLE):
patching file src/backend/optimizer/plan/planner.c
Hunk #1 succeeded at 7567 (offset 468 lines).
Hunk #2 succeeded at 7593 (offset 468 lines).
patching file src/test/regress/expected/partition_join.out
Hunk #1 succeeded at 4777 (offset 56 lines).
Hunk #2 succeeded at 4867 (offset 56 lines).
patching file src/test/regress/sql/partition_join.sql
Hunk #1 succeeded at 1136 (offset 1 line).
3. Without patch there is performance regression/bug on master (cost
is higher with enable_partitionwise_join=on that without it):
data preparation:
-- Test the process_outer_partition() code path
CREATE TABLE plt1_adv (a int, b int, c text) PARTITION BY LIST (c);
CREATE TABLE plt1_adv_p1 PARTITION OF plt1_adv FOR VALUES IN ('0000',
'0001', '0002');
CREATE TABLE plt1_adv_p2 PARTITION OF plt1_adv FOR VALUES IN ('0003', '0004');
INSERT INTO plt1_adv SELECT i, i, to_char(i % 5, 'FM0000') FROM
generate_series(0, 24) i;
ANALYZE plt1_adv;
CREATE TABLE plt2_adv (a int, b int, c text) PARTITION BY LIST (c);
CREATE TABLE plt2_adv_p1 PARTITION OF plt2_adv FOR VALUES IN ('0002');
CREATE TABLE plt2_adv_p2 PARTITION OF plt2_adv FOR VALUES IN ('0003', '0004');
INSERT INTO plt2_adv SELECT i, i, to_char(i % 5, 'FM0000') FROM
generate_series(0, 24) i WHERE i % 5 IN (2, 3, 4);
ANALYZE plt2_adv;
CREATE TABLE plt3_adv (a int, b int, c text) PARTITION BY LIST (c);
CREATE TABLE plt3_adv_p1 PARTITION OF plt3_adv FOR VALUES IN ('0001');
CREATE TABLE plt3_adv_p2 PARTITION OF plt3_adv FOR VALUES IN ('0003', '0004');
INSERT INTO plt3_adv SELECT i, i, to_char(i % 5, 'FM0000') FROM
generate_series(0, 24) i WHERE i % 5 IN (1, 3, 4);
ANALYZE plt3_adv;
off:
EXPLAIN SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM (plt1_adv t1
LEFT JOIN plt2_adv t2 ON (t1.c = t2.c)) FULL JOIN plt3_adv t3 ON (t1.c
= t3.c) WHERE coalesce(t1.a, 0) % 5 != 3 AND coalesce(t1.a, 0) % 5 !=
4 ORDER BY t1.c, t1.a, t2.a, t3.a;
QUERY PLAN
-----------------------------------------------------------------------------------------------
Sort (cost=22.02..22.58 rows=223 width=27)
Sort Key: t1.c, t1.a, t2.a, t3.a
-> Hash Full Join (cost=4.83..13.33 rows=223 width=27)
[..]
with enable_partitionwise_join=ON (see the jump from cost 22.02 -> 27.65):
EXPLAIN SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM (plt1_adv t1
LEFT JOIN plt2_adv t2 ON (t1.c = t2.c)) FULL JOIN plt3_adv t3 ON (t1.c
= t3.c) WHERE coalesce(t1.a, 0) % 5 != 3 AND coalesce(t1.a, 0) % 5 !=
4 ORDER BY t1.c, t1.a, t2.a, t3.a;
QUERY PLAN
-----------------------------------------------------------------------------------------------
Sort (cost=27.65..28.37 rows=289 width=27)
Sort Key: t1.c, t1.a, t2.a, t3.a
-> Append (cost=2.23..15.83 rows=289 width=27)
-> Hash Full Join (cost=2.23..4.81 rows=41 width=27)
[..]
-> Hash Full Join (cost=2.45..9.57 rows=248 width=27)
[..]
However with the patch applied the plan with minimal cost is always
chosen ("22"):
explain SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM (plt1_adv t1 LEFT JOIN
plt2_adv t2 ON (t1.c = t2.c)) FULL JOIN plt3_adv t3 ON (t1.c = t3.c) WHERE
coalesce(t1.a, 0 ) % 5 != 3 AND coalesce(t1.a, 0) % 5 != 4 ORDER BY
t1.c, t1.a, t2.a, t3.a;
QUERY PLAN
-----------------------------------------------------------------------------------------------
Sort (cost=22.02..22.58 rows=223 width=27)
Sort Key: t1.c, t1.a, t2.a, t3.a
-> Hash Full Join (cost=4.83..13.33 rows=223 width=27)
[..]
set enable_partitionwise_join to on;
explain SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM (plt1_adv t1 LEFT JOIN
plt2_adv t2 ON (t1.c = t2.c)) FULL JOIN plt3_adv t3 ON (t1.c = t3.c) WHERE
coalesce(t1.a, 0 ) % 5 != 3 AND coalesce(t1.a, 0) % 5 != 4 ORDER BY
t1.c, t1.a, t2.a, t3.a;
QUERY PLAN
-----------------------------------------------------------------------------------------------
Sort (cost=22.02..22.58 rows=223 width=27)
Sort Key: t1.c, t1.a, t2.a, t3.a
-> Hash Full Join (cost=4.83..13.33 rows=223 width=27)
[..]
with the patch applied, the minimal cost (with toggle on or off) the
cost always stays the minimal from the available ones. We cannot
provide a reproducer for real performance regression, but for the
affected customer it took 530+s (with enable_partitionwise_join=on)
and without that GUC it it was ~23s.
Thanks for providing actual timing. That's a huge difference.
4. meson test ends up with failures like below:
4/290 postgresql:regress / regress/regress
ERROR 32.67s
6/290 postgresql:pg_upgrade / pg_upgrade/002_pg_upgrade
ERROR 56.96s
35/290 postgresql:recovery / recovery/027_stream_regress
ERROR 40.20s
(all due to "regression tests pass" failures)
the partition_join.sql is failing for test 206, so for this:
-- partitionwise join with fractional paths
CREATE TABLE fract_t (id BIGINT, PRIMARY KEY (id)) PARTITION BY RANGE (id);
CREATE TABLE fract_t0 PARTITION OF fract_t FOR VALUES FROM ('0') TO ('1000');
CREATE TABLE fract_t1 PARTITION OF fract_t FOR VALUES FROM ('1000') TO ('2000');
-- insert data
INSERT INTO fract_t (id) (SELECT generate_series(0, 1999));
ANALYZE fract_t;
-- verify plan; nested index only scans
SET max_parallel_workers_per_gather = 0;
SET enable_partitionwise_join = on;
the testsuite was expecting the below with enable_partitionwise_join = on;
EXPLAIN (COSTS OFF)
SELECT x.id, y.id FROM fract_t x LEFT JOIN fract_t y USING (id) ORDER
BY x.id ASC LIMIT 10;
QUERY PLAN
-----------------------------------------------------------------------
Limit
-> Merge Append
Sort Key: x.id
-> Merge Left Join
Merge Cond: (x_1.id = y_1.id)
-> Index Only Scan using fract_t0_pkey on fract_t0 x_1
-> Index Only Scan using fract_t0_pkey on fract_t0 y_1
-> Merge Left Join
Merge Cond: (x_2.id = y_2.id)
-> Index Only Scan using fract_t1_pkey on fract_t1 x_2
-> Index Only Scan using fract_t1_pkey on fract_t1 y_2
but actually with patch it gets this (here with costs):
EXPLAIN (COSTS) SELECT x.id, y.id FROM fract_t x LEFT JOIN fract_t y
USING (id) ORDER BY x.id ASC LIMIT 10;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------
Limit (cost=1.10..2.21 rows=10 width=16)
-> Merge Left Join (cost=1.10..223.10 rows=2000 width=16)
Merge Cond: (x.id = y.id)
-> Append (cost=0.55..96.55 rows=2000 width=8)
[..]
-> Append (cost=0.55..96.55 rows=2000 width=8)
[..]
if you run it without patch and again with enable_partitionwise_join=on:
EXPLAIN SELECT x.id, y.id FROM fract_t x LEFT JOIN fract_t y USING
(id) ORDER BY x.id ASC LIMIT 10;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------
Limit (cost=1.11..2.22 rows=10 width=16)
-> Merge Append (cost=1.11..223.11 rows=2000 width=16)
Sort Key: x.id
-> Merge Left Join (cost=0.55..101.55 rows=1000 width=16)
[..]
-> Merge Left Join (cost=0.55..101.55 rows=1000 width=16)
[..]
So with the patch that SQL does not use partitionwise join as it finds
it more optimal to stick to a plan with cost of "1.10..2.21" instead
of "1.11..2.22" (w/ partition_join), nitpicking but still a failure
technically. Perhaps it could be even removed? (it's pretty close to
noise?).
I think we need to replace the failing query with something which uses partitionwise join even with the patch.
I will take a look at this after returning from a two week long vacation, unless someone else is interested in fixing this before that.
Best Wishes,
Ashutosh Bapat
On Mon, May 6, 2024 at 6:28 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:
On Mon, May 6, 2024 at 4:26 PM Jakub Wartak <jakub.wartak@enterprisedb.com> wrote:Hi Ashutosh & hackers,
On Mon, Apr 15, 2024 at 9:00 AM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
>
> Here's patch with
>
[..]
> Adding to the next commitfest but better to consider this for the next set of minor releases.
1. The patch does not pass cfbot -
https://cirrus-ci.com/task/5486258451906560 on master due to test
failure "not ok 206 + partition_join"So I need to create a patch for master first. I thought CFBot somehow knew that the patch was created for PG 15. :)
PFA patch for master. That should fix CfBot.
4. meson test ends up with failures like below:
4/290 postgresql:regress / regress/regress
ERROR 32.67s
6/290 postgresql:pg_upgrade / pg_upgrade/002_pg_upgrade
ERROR 56.96s
35/290 postgresql:recovery / recovery/027_stream_regress
ERROR 40.20s
(all due to "regression tests pass" failures)
[...]
So with the patch that SQL does not use partitionwise join as it finds
it more optimal to stick to a plan with cost of "1.10..2.21" instead
of "1.11..2.22" (w/ partition_join), nitpicking but still a failure
technically. Perhaps it could be even removed? (it's pretty close to
noise?).
The test was added by 6b94e7a6da2f1c6df1a42efe64251f32a444d174 and later modified by 3c569049b7b502bb4952483d19ce622ff0af5fd6. The modification just avoided eliminating the join, so that change can be ignored. 6b94e7a6da2f1c6df1a42efe64251f32a444d174 added the tests to test fractional paths being considered when creating ordered append paths. Reading the commit message, I was expecting a test which did not use a join as well and also which used inheritance. But it seems that the queries added by that commit, test all the required scenarios and hence we see two queries involving join between partitioned tables. As the comment there says the intention is to verify index only scans and not exactly partitionwise join. So just fixing the expected output of one query looks fine. The other query will test partitionwise join and fractional paths anyway. I am including Tomas, Arne and Zhihong, who worked on the first commit, to comment on expected output changes.
I will create patches for the back-branches once the patch for master is in a committable state.
Best Wishes,
Ashutosh Bapat
Attachment
Hi Ashutosh, thanks for bringing this to my attention. I'll first share a few thoughts about the change and respond regarding the test below. I clearly understand your intention with this patch. It's an issue I run into from time to time. I did some testing with some benchmark sets back with pg 14. I did the following: I planned with and without the partitionwise join GUC (explain) and took the one with the lower cost to execute the query. Interestingly, even discounting the overhead and additional planning time, the option with the lower cost turned out to be slower on our benchmark set back then. The median query with disabled GUC was quicker, but on average that was not the case. The observation is one, I'd generally describe as "The more options you consider, the more ways we have to be horribly wrong. More options for the planner are a great way to uncover the various shortcomings of it." That might be specific to the benchmark I was working with at the time. But that made me drop the issue back then. That is ofc no valid reason not to go in the direction of making the planner to consider more options. :) Maybe we can discuss that in person next week? On 2024-05-22 07:57, Ashutosh Bapat wrote: > On Mon, May 6, 2024 at 6:28 PM Ashutosh Bapat > <ashutosh.bapat.oss@gmail.com> wrote: >>> 4. meson test ends up with failures like below: >>> >>> 4/290 postgresql:regress / regress/regress >>> ERROR 32.67s >>> 6/290 postgresql:pg_upgrade / pg_upgrade/002_pg_upgrade >>> ERROR 56.96s >>> 35/290 postgresql:recovery / recovery/027_stream_regress >>> ERROR 40.20s >>> >>> (all due to "regression tests pass" failures) >>> [...] > >>> So with the patch that SQL does not use partitionwise join as it >>> finds >>> it more optimal to stick to a plan with cost of "1.10..2.21" >>> instead >>> of "1.11..2.22" (w/ partition_join), nitpicking but still a >>> failure >>> technically. Perhaps it could be even removed? (it's pretty close >>> to >>> noise?). > > The test was added by 6b94e7a6da2f1c6df1a42efe64251f32a444d174 and > later modified by 3c569049b7b502bb4952483d19ce622ff0af5fd6. The > modification just avoided eliminating the join, so that change can be > ignored. 6b94e7a6da2f1c6df1a42efe64251f32a444d174 added the tests to > test fractional paths being considered when creating ordered append > paths. Reading the commit message, I was expecting a test which did > not use a join as well and also which used inheritance. But it seems > that the queries added by that commit, test all the required scenarios > and hence we see two queries involving join between partitioned > tables. As the comment there says the intention is to verify index > only scans and not exactly partitionwise join. So just fixing the > expected output of one query looks fine. The other query will test > partitionwise join and fractional paths anyway. I am including Tomas, > Arne and Zhihong, who worked on the first commit, to comment on > expected output changes. The test was put there to make sure a fractional join is considered in the case that a partitionwise join is considered. Because that wasn't the case before. The important part for my use case back then was that we do Merge Join(s) at all. The test result after your patch still confirms that. If we simply modify the test as such, we no longer confirm, whether the code path introduced in 6b94e7a6da2f1c6df1a42efe64251f32a444d174 is still working. Maybe it's worthwhile to add something like create index on fract_t0 ((id*id)); EXPLAIN (COSTS OFF) SELECT * FROM fract_t x JOIN fract_t y USING (id) ORDER BY id * id DESC LIMIT 10; QUERY PLAN ------------------------------------------------------------------------------- Limit -> Merge Append Sort Key: ((x.id * x.id)) DESC -> Nested Loop -> Index Scan Backward using fract_t0_expr_idx on fract_t0 x_1 -> Index Only Scan using fract_t0_pkey on fract_t0 y_1 Index Cond: (id = x_1.id) -> Sort Sort Key: ((x_2.id * x_2.id)) DESC -> Hash Join Hash Cond: (x_2.id = y_2.id) -> Seq Scan on fract_t1 x_2 -> Hash -> Seq Scan on fract_t1 y_2 I am not sure, whether it's worth the extra test cycles on every animal, but since we are not creating an extra table it might be ok. I don't have a very strong feeling about the above test case. > I will create patches for the back-branches once the patch for master > is in a committable state. I am not sure, whether it's really a bug. I personally wouldn't be brave enough to back patch this. I don't want to deal with complaining end users. Suddenly their optimizer, which always had horrible estimates, was actually able to do harmful stuff with them. Only due to a minor version upgrade. I think that's a bad idea to backpatch something with complex performance implications. Especially since they might even be based on potentially inaccurate data... > > -- > > Best Wishes, > Ashutosh Bapat All the best Arne
On Fri, May 24, 2024 at 2:02 PM <arne.roland@malkut.net> wrote: > I am not sure, whether it's really a bug. I personally wouldn't be brave > enough to back patch this. I don't want to deal with complaining end > users. Suddenly their optimizer, which always had horrible estimates, > was actually able to do harmful stuff with them. Only due to a minor > version upgrade. I think that's a bad idea to backpatch something with > complex performance implications. Especially since they might even be > based on potentially inaccurate data... +1. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, May 24, 2024 at 11:02 AM <arne.roland@malkut.net> wrote:
Hi Ashutosh,
thanks for bringing this to my attention. I'll first share a few
thoughts about the change and respond regarding the test below.
I clearly understand your intention with this patch. It's an issue I run
into from time to time.
I did some testing with some benchmark sets back with pg 14. I did the
following: I planned with and without the partitionwise join GUC
(explain) and took the one with the lower cost to execute the query.
Interestingly, even discounting the overhead and additional planning
time, the option with the lower cost turned out to be slower on our
benchmark set back then. The median query with disabled GUC was quicker,
but on average that was not the case. The observation is one, I'd
generally describe as "The more options you consider, the more ways we
have to be horribly wrong. More options for the planner are a great way
to uncover the various shortcomings of it."
That might be specific to the benchmark I was working with at the time.
But that made me drop the issue back then. That is ofc no valid reason
not to go in the direction of making the planner to consider more
options. :)
In summary, you are suggesting that partitionwise join performs better than plain join even if the latter one has lower cost. Hence fixing this issue has never become a priority for you. Am I right?
Plans with lower costs being slower is not new for optimizer. Partitionwise join just adds another case.
Maybe we can discuss that in person next week?
Sure.
On 2024-05-22 07:57, Ashutosh Bapat wrote:
>
> The test was added by 6b94e7a6da2f1c6df1a42efe64251f32a444d174 and
> later modified by 3c569049b7b502bb4952483d19ce622ff0af5fd6. The
> modification just avoided eliminating the join, so that change can be
> ignored. 6b94e7a6da2f1c6df1a42efe64251f32a444d174 added the tests to
> test fractional paths being considered when creating ordered append
> paths. Reading the commit message, I was expecting a test which did
> not use a join as well and also which used inheritance. But it seems
> that the queries added by that commit, test all the required scenarios
> and hence we see two queries involving join between partitioned
> tables. As the comment there says the intention is to verify index
> only scans and not exactly partitionwise join. So just fixing the
> expected output of one query looks fine. The other query will test
> partitionwise join and fractional paths anyway. I am including Tomas,
> Arne and Zhihong, who worked on the first commit, to comment on
> expected output changes.
The test was put there to make sure a fractional join is considered in
the case that a partitionwise join is considered. Because that wasn't
the case before.
The important part for my use case back then was that we do Merge
Join(s) at all. The test result after your patch still confirms that.
If we simply modify the test as such, we no longer confirm, whether the
code path introduced in 6b94e7a6da2f1c6df1a42efe64251f32a444d174 is
still working.
Maybe it's worthwhile to add something like
create index on fract_t0 ((id*id));
EXPLAIN (COSTS OFF)
SELECT * FROM fract_t x JOIN fract_t y USING (id) ORDER BY id * id DESC
LIMIT 10;
QUERY PLAN
-------------------------------------------------------------------------------
Limit
-> Merge Append
Sort Key: ((x.id * x.id)) DESC
-> Nested Loop
-> Index Scan Backward using fract_t0_expr_idx on
fract_t0 x_1
-> Index Only Scan using fract_t0_pkey on fract_t0 y_1
Index Cond: (id = x_1.id)
-> Sort
Sort Key: ((x_2.id * x_2.id)) DESC
-> Hash Join
Hash Cond: (x_2.id = y_2.id)
-> Seq Scan on fract_t1 x_2
-> Hash
-> Seq Scan on fract_t1 y_2
I am not sure, whether it's worth the extra test cycles on every animal,
but since we are not creating an extra table it might be ok.
I don't have a very strong feeling about the above test case.
My patch removes redundant enable_partitionwise_join = on since that's done very early in the test. Apart from that it does not change the test. So if the expected output change is fine with you, I think we should leave the test as is. Plan outputs are sometimes fragile and thus make expected outputs flaky.
> I will create patches for the back-branches once the patch for master
> is in a committable state.
I am not sure, whether it's really a bug. I personally wouldn't be brave
enough to back patch this. I don't want to deal with complaining end
users. Suddenly their optimizer, which always had horrible estimates,
was actually able to do harmful stuff with them. Only due to a minor
version upgrade. I think that's a bad idea to backpatch something with
complex performance implications. Especially since they might even be
based on potentially inaccurate data...
Since it's a thinko I considered it as a bug. But I agree that it has the potential to disturb plans after upgrade and thus upset users. So I am fine if we don't backpatch.
Best Wishes,
Ashutosh Bapat
Hi Ashutosh! On 2024-05-27 14:17, Ashutosh Bapat wrote: > On Fri, May 24, 2024 at 11:02 AM <arne.roland@malkut.net> wrote: > >> Hi Ashutosh, >> >> thanks for bringing this to my attention. I'll first share a few >> thoughts about the change and respond regarding the test below. >> >> I clearly understand your intention with this patch. It's an issue I >> run >> into from time to time. >> >> I did some testing with some benchmark sets back with pg 14. I did >> the >> following: I planned with and without the partitionwise join GUC >> (explain) and took the one with the lower cost to execute the query. >> >> Interestingly, even discounting the overhead and additional planning >> >> time, the option with the lower cost turned out to be slower on our >> benchmark set back then. The median query with disabled GUC was >> quicker, >> but on average that was not the case. The observation is one, I'd >> generally describe as "The more options you consider, the more ways >> we >> have to be horribly wrong. More options for the planner are a great >> way >> to uncover the various shortcomings of it." >> >> That might be specific to the benchmark I was working with at the >> time. >> But that made me drop the issue back then. That is ofc no valid >> reason >> not to go in the direction of making the planner to consider more >> options. :) > > In summary, you are suggesting that partitionwise join performs better > than plain join even if the latter one has lower cost. Hence fixing > this issue has never become a priority for you. Am I right? > > Plans with lower costs being slower is not new for optimizer. > Partitionwise join just adds another case. Sorry for my confusing long text. I will try to recap my points concisely. 1. I think the order by pk frac limit plans had just to similar performance behaviour for me to bother. But afaics the main point of your proposal is not related to frac plans at all. 2. We can't expect the optimizers to simply yield better results by being given more options to be wrong. (Let me give a simple example: This patch makes our lack of cross table cross column statistics worse. We give it more opportunity to pick something horrible. 3. I dislike, that this patch makes much harder to debug, why no partitionwise join is chosen. > >> Maybe we can discuss that in person next week? > > Sure. > >> On 2024-05-22 07:57, Ashutosh Bapat wrote: >>> >>> The test was added by 6b94e7a6da2f1c6df1a42efe64251f32a444d174 and >>> later modified by 3c569049b7b502bb4952483d19ce622ff0af5fd6. The >>> modification just avoided eliminating the join, so that change can >> be >>> ignored. 6b94e7a6da2f1c6df1a42efe64251f32a444d174 added the tests >> to >>> test fractional paths being considered when creating ordered >> append >>> paths. Reading the commit message, I was expecting a test which >> did >>> not use a join as well and also which used inheritance. But it >> seems >>> that the queries added by that commit, test all the required >> scenarios >>> and hence we see two queries involving join between partitioned >>> tables. As the comment there says the intention is to verify index >>> only scans and not exactly partitionwise join. So just fixing the >>> expected output of one query looks fine. The other query will test >>> partitionwise join and fractional paths anyway. I am including >> Tomas, >>> Arne and Zhihong, who worked on the first commit, to comment on >>> expected output changes. >> >> The test was put there to make sure a fractional join is considered >> in >> the case that a partitionwise join is considered. Because that >> wasn't >> the case before. >> >> The important part for my use case back then was that we do Merge >> Join(s) at all. The test result after your patch still confirms >> that. >> >> If we simply modify the test as such, we no longer confirm, whether >> the >> code path introduced in 6b94e7a6da2f1c6df1a42efe64251f32a444d174 is >> still working. >> >> Maybe it's worthwhile to add something like >> >> create index on fract_t0 ((id*id)); >> >> EXPLAIN (COSTS OFF) >> SELECT * FROM fract_t x JOIN fract_t y USING (id) ORDER BY id * id >> DESC >> LIMIT 10; >> QUERY PLAN >> > ------------------------------------------------------------------------------- >> Limit >> -> Merge Append >> Sort Key: ((x.id [1] * x.id [1])) DESC >> -> Nested Loop >> -> Index Scan Backward using fract_t0_expr_idx on >> fract_t0 x_1 >> -> Index Only Scan using fract_t0_pkey on fract_t0 >> y_1 >> Index Cond: (id = x_1.id [2]) >> -> Sort >> Sort Key: ((x_2.id [3] * x_2.id [3])) DESC >> -> Hash Join >> Hash Cond: (x_2.id [3] = y_2.id [4]) >> -> Seq Scan on fract_t1 x_2 >> -> Hash >> -> Seq Scan on fract_t1 y_2 >> >> I am not sure, whether it's worth the extra test cycles on every >> animal, >> but since we are not creating an extra table it might be ok. >> I don't have a very strong feeling about the above test case. > > My patch removes redundant enable_partitionwise_join = on since that's > done very early in the test. Apart from that it does not change the > test. So if the expected output change is fine with you, I think we > should leave the test as is. Plan outputs are sometimes fragile and > thus make expected outputs flaky. If at all, we can add to that. That would indeed give us more code test coverage. I will refrain from commenting further, since that discussion would get completely disconnected from the patch at hand. > >>> I will create patches for the back-branches once the patch for >> master >>> is in a committable state. >> >> I am not sure, whether it's really a bug. I personally wouldn't be >> brave >> enough to back patch this. I don't want to deal with complaining end >> >> users. Suddenly their optimizer, which always had horrible >> estimates, >> was actually able to do harmful stuff with them. Only due to a minor >> >> version upgrade. I think that's a bad idea to backpatch something >> with >> complex performance implications. Especially since they might even >> be >> based on potentially inaccurate data... > > Since it's a thinko I considered it as a bug. But I agree that it has > the potential to disturb plans after upgrade and thus upset users. So > I am fine if we don't backpatch. > > -- > > Best Wishes, > Ashutosh Bapat > > > Links: > ------ > [1] http://x.id > [2] http://x_1.id > [3] http://x_2.id > [4] http://y_2.id All the best Arne
On Tue, May 28, 2024 at 7:13 AM <arne.roland@malkut.net> wrote:
1. I think the order by pk frac limit plans had just to similar
performance behaviour for me to bother.
But afaics the main point of your proposal is not related to frac plans
at all.
Right.
2. We can't expect the optimizers to simply yield better results by
being given more options to be wrong. (Let me give a simple example:
This patch makes our lack of cross table cross column statistics worse.
We give it more opportunity to pick something horrible.
I don't see the connection between cross column statistics and this bug I am fixing. Can you please elaborate?
3. I dislike, that this patch makes much harder to debug, why no
partitionwise join is chosen.
Can you please elaborate more? How does my change make debugging harder?
Best Wishes,
Ashutosh Bapat
On Wed, May 22, 2024 at 3:57 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote: > I will create patches for the back-branches once the patch for master is in a committable state. AFAIU, this patch prevents apply_scanjoin_target_to_paths() from discarding old paths of partitioned joinrels. Therefore, we can retain non-partitionwise join paths if the cheapest path happens to be among them. One concern from me is that if the cheapest path of a joinrel is a partitionwise join path, following this approach could lead to undesirable cross-platform plan variations, as detailed in the original comment. Is there a specific query that demonstrates benefits from this change? I'm curious about scenarios where a partitionwise join runs slower than a non-partitionwise join. Thanks Richard
On Wed, Jul 24, 2024 at 9:42 AM Richard Guo <guofenglinux@gmail.com> wrote: > > On Wed, May 22, 2024 at 3:57 PM Ashutosh Bapat > <ashutosh.bapat.oss@gmail.com> wrote: > > I will create patches for the back-branches once the patch for master is in a committable state. > > AFAIU, this patch prevents apply_scanjoin_target_to_paths() from > discarding old paths of partitioned joinrels. Therefore, we can > retain non-partitionwise join paths if the cheapest path happens to be > among them. Right. Thanks for the summary. > > One concern from me is that if the cheapest path of a joinrel is a > partitionwise join path, following this approach could lead to > undesirable cross-platform plan variations, as detailed in the > original comment. I read through the email thread [3] referenced in the commit (1d338584062b3e53b738f987ecb0d2b67745232a) which added that comment. The change is mentioned in [4] first. Please notice that this change is unrelated to the bug that started the thread. [5], [6] talk about the costs of projection path above Append vs project path below Append. But I don't see any example of any cross-platform plan variations. I also do not see an example in that thread where such a plan variation results in bad performance. If the costs of partitionwise and non-partitionwise join paths are so close to each other that platform specific arithmetic can swing it one way or the other, possibly their performance is going to be comparable. Without an example query it's hard to assess this possibility or address the concern, especially when we have examples of the behaviour otherwise. > > Is there a specific query that demonstrates benefits from this change? > I'm curious about scenarios where a partitionwise join runs slower > than a non-partitionwise join. [1] provides a testcase where a nonpartitionwise join is better than partitionwise join. This testcase is derived from a bug reported by an EDB customer. [2] is another bug report on psql-bugs. [1] https://www.postgresql.org/message-id/CAKZiRmyaFFvxyEYGG_hu0F-EVEcqcnveH23MULhW6UY_jwykGw%40mail.gmail.com [2] https://www.postgresql.org/message-id/flat/786.1565541557%40sss.pgh.pa.us#9d50e1b375201f29bbf17072d75569e3 [3] https://www.postgresql.org/message-id/flat/15669-02fb3296cca26203%40postgresql.org [4] https://www.postgresql.org/message-id/20477.1551819776%40sss.pgh.pa.us [5] https://www.postgresql.org/message-id/15350.1551973953%40sss.pgh.pa.us [6] https://www.postgresql.org/message-id/24357.1551984010%40sss.pgh.pa.us -- Best Wishes, Ashutosh Bapat
On 24/7/2024 15:22, Ashutosh Bapat wrote: > On Wed, Jul 24, 2024 at 9:42 AM Richard Guo <guofenglinux@gmail.com> wrote: >> Is there a specific query that demonstrates benefits from this change? >> I'm curious about scenarios where a partitionwise join runs slower >> than a non-partitionwise join. > > [1] provides a testcase where a nonpartitionwise join is better than > partitionwise join. This testcase is derived from a bug reported by an > EDB customer. [2] is another bug report on psql-bugs. I haven't passed through the patch yet, but can this issue affect the decision on what to push down to foreign servers: a whole join or just a scan of two partitions? If the patch is related to the pushdown decision, I'd say it is quite an annoying problem for me. From time to time, I see cases where JOIN produces more tuples than both partitions have in total - in this case, it would be better to transfer tables' tuples to the main instance before joining them. -- regards, Andrei Lepikhov
On Tue, Oct 1, 2024 at 3:22 AM Andrei Lepikhov <lepihov@gmail.com> wrote: > > On 24/7/2024 15:22, Ashutosh Bapat wrote: > > On Wed, Jul 24, 2024 at 9:42 AM Richard Guo <guofenglinux@gmail.com> wrote: > >> Is there a specific query that demonstrates benefits from this change? > >> I'm curious about scenarios where a partitionwise join runs slower > >> than a non-partitionwise join. > > > > [1] provides a testcase where a nonpartitionwise join is better than > > partitionwise join. This testcase is derived from a bug reported by an > > EDB customer. [2] is another bug report on psql-bugs. > I haven't passed through the patch yet, but can this issue affect the > decision on what to push down to foreign servers: a whole join or just a > scan of two partitions? > If the patch is related to the pushdown decision, I'd say it is quite an > annoying problem for me. From time to time, I see cases where JOIN > produces more tuples than both partitions have in total - in this case, > it would be better to transfer tables' tuples to the main instance > before joining them. Sorry for replying late. I somehow didn't notice this. A join between partitions is pushed down if only partitionwise join is chosen and a join between partitions won't be pushed down if partitionwise join is not chosen. Hence this bug affects pushdown as well. The CF entry shows as waiting for author. But that isn't the right status. Will change it to needs review. I think we need a consensus as to whether we want to fix this bug or not. Since this bug doesn't affect me anymore, I will just withdraw this CF entry if there is no interest. -- Best Wishes, Ashutosh Bapat
On Thu, Jan 2, 2025 at 4:41 AM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote: > A join between partitions is pushed down if only partitionwise join is > chosen and a join between partitions won't be pushed down if > partitionwise join is not chosen. Hence this bug affects pushdown as > well. > > The CF entry shows as waiting for author. But that isn't the right > status. Will change it to needs review. I think we need a consensus as > to whether we want to fix this bug or not. Since this bug doesn't > affect me anymore, I will just withdraw this CF entry if there is no > interest. I think it's unhelpful that you keep calling this a "bug" when the behavior is clearly deliberate. Whether it's the *right* behavior is debatable, but it's not *accidental* behavior. I don't actually have a clear understanding of why we need this. In https://www.postgresql.org/message-id/CAKZiRmyaFFvxyEYGG_hu0F-EVEcqcnveH23MULhW6UY_jwykGw%40mail.gmail.com Jakub says that an EDB customer experienced a case where the partitionwise plan took 530+s and the non-partitionwise plan took 23s, but unfortunately there's no public test case, and in the examples shared publicly, either the partionwise plan is actually slower but is mistakenly estimated to be faster, or the two are extremely close to the same speed so it doesn't really matter. So the customer scenario (which is not public) is justification for a code-change, but the publicly-posted examples, as far as I can see, are not. And what confuses me is -- what could that test case possibly look like? I mean, how can it be less efficient to perform N small joins than 1 big join? For instance, suppose we're doing a merge join between A and B (partitions A_1..A_N and B_1..B_N) and we have to sort all the data. With a partitionwise join, we have to do 2N sorts of partitions of some size, let's say K. The cost should be O(2N * K lg K). If we instead do two really big sorts, the cost is now O(2 * (NK) lg (NK)), which is more. If we do a hash join, the cost should be about the same either way, because probing a hash table is roughly constant time. However, if we do N small hash joins, the hash table is a lot more likely to fit in memory -- and if the big hash table does not fit in memory and the small hash tables do, we should win big. Finally, let's say we're doing a nested loop. If the inner side of the nested loop is unparameterized, then the cost of the non-partitionwise nested loop is O(N^2 * K^2), while the cost of the partitionwise nested loop is O(N * K^2), which is a huge win. If the inner side is parameterized, then the partitionwise plan involves scanning one partition for matching values for each outer row, whereas the non-partitionwise plan involves scanning every partition for matching values for each outer row, which is again clearly worse. I'm obviously missing something here, because I'm sure Jakub is quite right when he says that this actually happened and actually hosed an EDB customer. But I don't understand HOW it happened, and I think if we're going to change the code we really ought to understand that and write some code comments about it. In general, I think that it's very reasonable to expect that a bunch of small joins will beat one big join, which is why the code does what it currently does. -- Robert Haas EDB: http://www.enterprisedb.com
Robert Haas <robertmhaas@gmail.com> writes: > I'm obviously missing something here, because I'm sure Jakub is quite > right when he says that this actually happened and actually hosed an > EDB customer. But I don't understand HOW it happened, and I think if > we're going to change the code we really ought to understand that and > write some code comments about it. In general, I think that it's very > reasonable to expect that a bunch of small joins will beat one big > join, which is why the code does what it currently does. I am wondering if the problem is not that the plan is slower, it's that for some reason the planner took a lot longer to create it. It's very plausible that partitionwise planning takes longer, and maybe we have some corner cases where the time is O(N^2) or worse. However, this is pure speculation without a test case, and any proposed fix would be even more speculative. I concur with your bottom line: we should insist on a public test case before deciding what to do about it. regards, tom lane
On Thu, Jan 2, 2025 at 3:58 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > I am wondering if the problem is not that the plan is slower, it's > that for some reason the planner took a lot longer to create it. > It's very plausible that partitionwise planning takes longer, and > maybe we have some corner cases where the time is O(N^2) or worse. That doesn't seem like a totally unreasonable speculation, but it seems a little surprising that retaining the non-partitionwise paths would fix it. True, that might let us discard a bunch of partitionwise paths more quickly than would otherwise be possible, but I wouldn't expect that to have an impact as dramatic as what Jakub alleged. The thing I thought about was whether there might be some weird effects with lots of empty partitions; or maybe with some other property of the path like say sort keys or parallelism. For example if we couldn't generate a partitionwise path with sort keys as good as the non-partitionwise path had, or if we couldn't generate a parallel partitionwise path but we could generate a parallel non-partitionwise path. As far as I knew neither of those things are real problems, but if they were then I believe they could pretty easily explain a large regression. > However, this is pure speculation without a test case, and any > proposed fix would be even more speculative. I concur with your > bottom line: we should insist on a public test case before deciding > what to do about it. Yeah. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, Jan 3, 2025 at 3:02 AM Robert Haas <robertmhaas@gmail.com> wrote: > > I think it's unhelpful that you keep calling this a "bug" when the > behavior is clearly deliberate. Whether it's the *right* behavior is > debatable, but it's not *accidental* behavior. > Ok, let's call it "not right" behaviour :). Let me further expand on the explanation in my first email [1]. After the planner has added all possible paths, apply_scanjoin_target_to_paths(), which should be just adjusting their targets, zaps them all. That looks weird. But the comment explains why its doing so. -- quote comment * This function would still be correct if we kept the * existing paths: we'd modify them to generate the correct target above * the partitioning Append, and then they'd compete on cost with paths * generating the target below the Append. However, in our current cost * model the latter way is always the same or cheaper cost, so modifying * the existing paths would just be useless work. Moreover, when the cost * is the same, varying roundoff errors might sometimes allow an existing * path to be picked, resulting in undesirable cross-platform plan * variations. -- The comment mentions only "partitioning Append" and not "Join" paths. A simple partitioned relation's pathlist contains only append paths but a partitioned join relation's pathlist contains join paths as well as append (of join) paths. What the comment says is true for both Append of scans as well as Append of joins, but not for join paths joining append paths - non-partition wise paths. In such paths we should be adjusting only the target of the join path and not that of the paths under the Append. The comment does not mention Join over append at all. The code discards both join as well as append paths but rebuilds only append paths. There is no comment in the function explaining why we omit join of append paths. So the code does not seem intentional to me as far as partitionwise joins are concerned. Losing a costwise optimal path doesn't seem to be something that should happen while adjusting path targets (unless while adjusting path targets we find a lower cost path, but we aren't keeping the old paths around for comparison.) > On Thu, Jan 2, 2025 at 3:58 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > I am wondering if the problem is not that the plan is slower, it's > > that for some reason the planner took a lot longer to create it. > > It's very plausible that partitionwise planning takes longer, and > > maybe we have some corner cases where the time is O(N^2) or worse. > > That doesn't seem like a totally unreasonable speculation, but it > seems a little surprising that retaining the non-partitionwise paths > would fix it. True, that might let us discard a bunch of partitionwise > paths more quickly than would otherwise be possible, but I wouldn't > expect that to have an impact as dramatic as what Jakub alleged. The > thing I thought about was whether there might be some weird effects > with lots of empty partitions; or maybe with some other property of > the path like say sort keys or parallelism. For example if we couldn't > generate a partitionwise path with sort keys as good as the > non-partitionwise path had, or if we couldn't generate a parallel > partitionwise path but we could generate a parallel non-partitionwise > path. As far as I knew neither of those things are real problems, but > if they were then I believe they could pretty easily explain a large > regression. > > > However, this is pure speculation without a test case, and any > > proposed fix would be even more speculative. I concur with your > > bottom line: we should insist on a public test case before deciding > > what to do about it. > That's a valid ask. AFAIR, it's a quite tricky scenario. Both Jakub and myself have tried but could not reproduce the issue. Let me mark this CF entry as returned with feedback and resurrect it when we have a reproduction. [1] https://www.postgresql.org/message-id/CAExHW5toze58+jL-454J3ty11sqJyU13Sz5rJPQZDmASwZgWiA@mail.gmail.com -- Best Wishes, Ashutosh Bapat