Thread: BUG #17335: Duplicate result rows in Gather node when "Suppress Append ... that have a single child" applied
BUG #17335: Duplicate result rows in Gather node when "Suppress Append ... that have a single child" applied
From
PG Bug reporting form
Date:
The following bug has been logged on the website: Bug reference: 17335 Logged by: Yura Sokolov Email address: y.sokolov@postgrespro.ru PostgreSQL version: 14.1 Operating system: any Description: Our customer faced a query that returns duplicate rows if parallel execution is enabled (max_parallel_workers_per_gather > 0) after 11->12 upgrade. Investigation shows bug were introduced in 8edd0e79460b414b1d971895312e549e95e12e4f "Suppress Append and MergeAppend plan nodes that have a single child." . Bug is present in all versions since 12 beta, and it is reproduced in master branch as well. Reproduction. https://gist.github.com/funny-falcon/45fa9460507c015390db99a7efd18a75 Dump contains definition of two tables (with 3 and 2 fields), index and data dump large enough to trigger parallel plan execution. Also `analyze public.testi3, public.testi4` line is added. https://gist.github.com/funny-falcon/45fa9460507c015390db99a7efd18a75/raw/06a55e57bc6c460ea8800600b1b597c1ed216259/repro_dump.sql.zst Query: SELECT count(1) FROM ( SELECT fk, id FROM testi3 WHERE flagged UNION ALL SELECT fk, id from testi3 WHERE false ) as t LEFT OUTER JOIN testi4 USING (fk); When `set max_parallel_workers_per_gather = 0;` then it returns 44 which is correct. When `set max_parallel_workers_per_gather = 2;` then it returns 88 which is wrong. Sometimes (for example, when you try to debug it) it returns 44 even with max_parallel_workers_per_gather = 2. It seems that it happens when worker returns first row fast enough so "Gather" node doesn't start to execute plan by itself. It is confirmed by `set parallel_leader_participation = false`, which forces correct result. `set enable_parallel_append = false` also forces correct result. Original plan included Gather Merge node and max_parallel_workers_per_gather = 1 . Reproduction code uses Gather instead and max_parallel_workers_per_gather = 2 just because it were easier to get bad plan. Plan before change 8edd0e79460b414b1d971895312e549e95e12e4f (returns correct result) Finalize Aggregate (cost=25003.16..25003.17 rows=1 width=8) (actual time=435.473..435.473 rows=1 loops=1) -> Gather (cost=25003.05..25003.16 rows=1 width=8) (actual time=434.646..456.889 rows=2 loops=1) Workers Planned: 1 Workers Launched: 1 -> Partial Aggregate (cost=24003.05..24003.06 rows=1 width=8) (actual time=433.000..433.000 rows=1 loops=2) -> Parallel Hash Left Join (cost=21701.47..24002.86 rows=74 width=0) (actual time=312.576..432.981 rows=22 loops=2) Hash Cond: ("*SELECT* 1".fk = testi4.fk) -> Parallel Append (cost=0.42..9.18 rows=24 width=4) (actual time=0.021..0.112 rows=16 loops=2) -> Subquery Scan on "*SELECT* 1" (cost=0.42..9.11 rows=24 width=4) (actual time=0.041..0.216 rows=33 loops=1) -> Index Scan using testi3_ix on testi3 (cost=0.42..8.87 rows=24 width=8) (actual time=0.039..0.207 rows=33 loops=1) Index Cond: (flagged = true) -> Parallel Hash (cost=12084.91..12084.91 rows=586091 width=4) (actual time=305.473..305.473 rows=703309 loops=2) Buckets: 131072 Batches: 32 Memory Usage: 2784kB -> Parallel Seq Scan on testi4 (cost=0.00..12084.91 rows=586091 width=4) (actual time=0.032..138.958 rows=703309 loops=2) Plan after that change (returns wrong result) Finalize Aggregate (cost=25003.87..25003.88 rows=1 width=8) (actual time=652.062..652.062 rows=1 loops=1) -> Gather (cost=25003.75..25003.86 rows=1 width=8) (actual time=651.482..668.718 rows=2 loops=1) Workers Planned: 1 Workers Launched: 1 -> Partial Aggregate (cost=24003.75..24003.76 rows=1 width=8) (actual time=649.150..649.150 rows=1 loops=2) -> Parallel Hash Left Join (cost=21701.47..24003.49 rows=106 width=0) (actual time=461.751..649.111 rows=44 loops=2) Hash Cond: ("*SELECT* 1".fk = testi4.fk) -> Subquery Scan on "*SELECT* 1" (cost=0.42..9.51 rows=36 width=4) (actual time=0.050..0.229 rows=33 loops=2) -> Index Scan using testi3_ix on testi3 (cost=0.42..9.15 rows=36 width=8) (actual time=0.048..0.214 rows=33 loops=2) Index Cond: (flagged = true) -> Parallel Hash (cost=12084.91..12084.91 rows=586091 width=4) (actual time=444.931..444.931 rows=703309 loops=2) Buckets: 131072 Batches: 32 Memory Usage: 2784kB -> Parallel Seq Scan on testi4 (cost=0.00..12084.91 rows=586091 width=4) (actual time=0.045..202.913 rows=703309 loops=2) ------ regards, Yura
Re: BUG #17335: Duplicate result rows in Gather node when "Suppress Append ... that have a single child" applied
From
Yura Sokolov
Date:
В Пн, 13/12/2021 в 14:45 +0000, PG Bug reporting form пишет: > The following bug has been logged on the website: > > Bug reference: 17335 > Logged by: Yura Sokolov > Email address: y.sokolov@postgrespro.ru > PostgreSQL version: 14.1 > Operating system: any > Description: > > Our customer faced a query that returns duplicate rows if parallel execution > is enabled > (max_parallel_workers_per_gather > 0) after 11->12 upgrade. > > Investigation shows bug were introduced in > 8edd0e79460b414b1d971895312e549e95e12e4f > "Suppress Append and MergeAppend plan nodes that have a single child." . > Bug is present in all versions since 12 beta, and it is reproduced in master > branch as well. > > Reproduction. > https://gist.github.com/funny-falcon/45fa9460507c015390db99a7efd18a75 > > Dump contains definition of two tables (with 3 and 2 fields), index and data > dump > large enough to trigger parallel plan execution. > Also `analyze public.testi3, public.testi4` line is added. > > https://gist.github.com/funny-falcon/45fa9460507c015390db99a7efd18a75/raw/06a55e57bc6c460ea8800600b1b597c1ed216259/repro_dump.sql.zst > > Query: > > SELECT count(1) > FROM ( > SELECT fk, id > FROM testi3 > WHERE flagged > UNION ALL > SELECT fk, id > from testi3 > WHERE false > ) as t > LEFT OUTER JOIN testi4 > USING (fk); > > When `set max_parallel_workers_per_gather = 0;` then it returns 44 which is > correct. > When `set max_parallel_workers_per_gather = 2;` then it returns 88 which is > wrong. > > Sometimes (for example, when you try to debug it) it returns 44 even > with max_parallel_workers_per_gather = 2. > It seems that it happens when worker returns first row fast enough so > "Gather" node doesn't start to execute plan by itself. > > It is confirmed by `set parallel_leader_participation = false`, which forces > correct result. > `set enable_parallel_append = false` also forces correct result. > > Original plan included Gather Merge node and max_parallel_workers_per_gather > = 1 . > Reproduction code uses Gather instead and max_parallel_workers_per_gather = > 2 just because it were easier to get bad plan. > > Plan before change 8edd0e79460b414b1d971895312e549e95e12e4f (returns correct > result) В Пн, 13/12/2021 в 14:45 +0000, PG Bug reporting form пишет: > The following bug has been logged on the website: > > Bug reference: 17335 > Logged by: Yura Sokolov > Email address: y.sokolov@postgrespro.ru > PostgreSQL version: 14.1 > Operating system: any > Description: > > Our customer faced a query that returns duplicate rows if parallel execution > is enabled > (max_parallel_workers_per_gather > 0) after 11->12 upgrade. > > Investigation shows bug were introduced in > 8edd0e79460b414b1d971895312e549e95e12e4f > "Suppress Append and MergeAppend plan nodes that have a single child." . > Bug is present in all versions since 12 beta, and it is reproduced in master > branch as well. > > Reproduction. > https://gist.github.com/funny-falcon/45fa9460507c015390db99a7efd18a75 > > Dump contains definition of two tables (with 3 and 2 fields), index and data > dump > large enough to trigger parallel plan execution. > Also `analyze public.testi3, public.testi4` line is added. > > https://gist.github.com/funny-falcon/45fa9460507c015390db99a7efd18a75/raw/06a55e57bc6c460ea8800600b1b597c1ed216259/repro_dump.sql.zst > > Query: > > SELECT count(1) > FROM ( > SELECT fk, id > FROM testi3 > WHERE flagged > UNION ALL > SELECT fk, id > from testi3 > WHERE false > ) as t > LEFT OUTER JOIN testi4 > USING (fk); > > When `set max_parallel_workers_per_gather = 0;` then it returns 44 which is > correct. > When `set max_parallel_workers_per_gather = 2;` then it returns 88 which is > wrong. > > Sometimes (for example, when you try to debug it) it returns 44 even > with max_parallel_workers_per_gather = 2. > It seems that it happens when worker returns first row fast enough so > "Gather" node doesn't start to execute plan by itself. > > It is confirmed by `set parallel_leader_participation = false`, which forces > correct result. > `set enable_parallel_append = false` also forces correct result. > > Original plan included Gather Merge node and max_parallel_workers_per_gather > = 1 . > Reproduction code uses Gather instead and max_parallel_workers_per_gather = > 2 just because it were easier to get bad plan. > > Plan before change 8edd0e79460b414b1d971895312e549e95e12e4f (returns correct > result) > > Finalize Aggregate (cost=25003.16..25003.17 rows=1 width=8) (actual > time=435.473..435.473 rows=1 loops=1) > -> Gather (cost=25003.05..25003.16 rows=1 width=8) (actual time=434.646..456.889 rows=2 loops=1) > Workers Planned: 1 > Workers Launched: 1 > -> Partial Aggregate (cost=24003.05..24003.06 rows=1 width=8) > (actual time=433.000..433.000 rows=1 loops=2) > -> Parallel Hash Left Join (cost=21701.47..24002.86 rows=74 > width=0) (actual time=312.576..432.981 rows=22 loops=2) > Hash Cond: ("*SELECT* 1".fk = testi4.fk) > -> Parallel Append (cost=0.42..9.18 rows=24 width=4) > (actual time=0.021..0.112 rows=16 loops=2) > -> Subquery Scan on "*SELECT* 1" > (cost=0.42..9.11 rows=24 width=4) (actual time=0.041..0.216 rows=33 loops=1) > -> Index Scan using testi3_ix on testi3 > (cost=0.42..8.87 rows=24 width=8) (actual time=0.039..0.207 rows=33 > loops=1) > Index Cond: (flagged = true) > -> Parallel Hash (cost=12084.91..12084.91 rows=586091 > width=4) (actual time=305.473..305.473 rows=703309 loops=2) > Buckets: 131072 Batches: 32 Memory Usage: > 2784kB > -> Parallel Seq Scan on testi4 > (cost=0.00..12084.91 rows=586091 width=4) (actual time=0.032..138.958 > rows=703309 loops=2) > > Plan after that change (returns wrong result) > > Finalize Aggregate (cost=25003.87..25003.88 rows=1 width=8) (actual > time=652.062..652.062 rows=1 loops=1) > -> Gather (cost=25003.75..25003.86 rows=1 width=8) (actual > time=651.482..668.718 rows=2 loops=1) > Workers Planned: 1 > Workers Launched: 1 > -> Partial Aggregate (cost=24003.75..24003.76 rows=1 width=8) > (actual time=649.150..649.150 rows=1 loops=2) > -> Parallel Hash Left Join (cost=21701.47..24003.49 > rows=106 width=0) (actual time=461.751..649.111 rows=44 loops=2) > Hash Cond: ("*SELECT* 1".fk = testi4.fk) > -> Subquery Scan on "*SELECT* 1" (cost=0.42..9.51 > rows=36 width=4) (actual time=0.050..0.229 rows=33 loops=2) > -> Index Scan using testi3_ix on testi3 > (cost=0.42..9.15 rows=36 width=8) (actual time=0.048..0.214 rows=33 loops=2) > Index Cond: (flagged = true) > -> Parallel Hash (cost=12084.91..12084.91 rows=586091 > width=4) (actual time=444.931..444.931 rows=703309 loops=2) > Buckets: 131072 Batches: 32 Memory Usage: > 2784kB > -> Parallel Seq Scan on testi4 > (cost=0.00..12084.91 rows=586091 width=4) (actual time=0.045..202.913 > rows=703309 loops=2)] I found simple fix: --- a/src/backend/optimizer/util/pathnode.c +++ b/src/backend/optimizer/util/pathnode.c @@ -1284,14 +1284,17 @@ create_append_path(PlannerInfo *root, if (list_length(pathnode->subpaths) == 1) { Path *child = (Path *) linitial(pathnode->subpaths); pathnode->path.rows = child->rows; pathnode->path.startup_cost = child->startup_cost; pathnode->path.total_cost = child->total_cost; pathn ode->path.pathkeys = child->pathkeys; + pathnode->path.parallel_aware = child->parallel_aware; + pathnode->path.parallel_safe = child- >parallel_safe; + pathnode->path.parallel_workers = child->parallel_workers; } else { pathnode->path.pathkeys = NIL; /* unsorted if more than 1 subpath */ cost_append(pathnode); } With this fix Gather node deduce `Single Copy: true`, and leader doesn't participate in query execution. It seems to be correct change: - if append node will be removed, then it should take parallel awareness/safety from its single child. With this change plan Gather Merge node fails on assert in cost_gather_merge /* * Add one to the number of workers to account for the leader. This might * be overgenerous since the leader will do less work than other workers * in typical cases, but we'll go with it for now. */ Assert(path->num_workers > 0); N = (double) path->num_workers + 1; logN = LOG2(N); Commenting this assert doesn't seem to break anything and doesn't seem to mess cost calculation (ie it remains logically correct). To preserve assert I've changed it to `Assert(path->num_workers >= 0);`. With this changes queries from reproduction code returns correct 44 rows. # explain analyze SELECT id, val FROM ( SELECT fk, id FROM testi3 WHERE flagged UNION ALL SELECT fk, id from testi3 WHERE false ) as t LEFT OUTER JOIN testi4 USING (fk); QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------- Gather (cost=22701.47..25015.79 rows=130 width=8) (actual time=1268.071..1686.918 rows=44 loops=1) Workers Planned: 1 Workers Launched: 1 Single Copy: true -> Parallel Hash Left Join (cost=21701.47..24002.79 rows=130 width=8) (actual time=1242.936..1634.421 rows=44 loops=1) Hash Cond: ("*SELECT* 1".fk = testi4.fk) -> Subquery Scan on "*SELECT* 1" (cost=0.42..9.09 rows=24 width=8) (actual time=0.049..0.166 rows=33 loops=1) -> Index Scan using testi3_ix on testi3 (cost=0.42..8.85 rows=24 width=8) (actual time=0.047..0.151 rows=33loops=1) Index Cond: (flagged = true) -> Parallel Hash (cost=12084.91..12084.91 rows=586091 width=8) (actual time=1238.029..1238.029 rows=1406618 loops=1) Buckets: 131072 Batches: 32 Memory Usage: 2784kB -> Parallel Seq Scan on testi4 (cost=0.00..12084.91 rows=586091 width=8) (actual time=0.070..540.652 rows=1406618loops=1) Interestingly, Gather Merge node doesn't run workers: # explain analyze SELECT id, val FROM ( SELECT fk, id FROM testi3 WHERE flagged UNION ALL SELECT fk, id from testi3 WHERE false ) as t LEFT OUTER JOIN testi4 USING (fk) ORDER BY val; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------- Gather Merge (cost=25007.36..25007.68 rows=0 width=8) (actual time=909.975..909.984 rows=44 loops=1) Workers Planned: 0 Workers Launched: 0 -> Sort (cost=24007.36..24007.68 rows=130 width=8) (actual time=909.948..909.952 rows=44 loops=1) Sort Key: testi4.val Sort Method: quicksort Memory: 27kB -> Parallel Hash Left Join (cost=21701.47..24002.79 rows=130 width=8) (actual time=757.099..909.779 rows=44 loops=1) Hash Cond: ("*SELECT* 1".fk = testi4.fk) -> Subquery Scan on "*SELECT* 1" (cost=0.42..9.09 rows=24 width=8) (actual time=0.224..0.576 rows=33 loops=1) -> Index Scan using testi3_ix on testi3 (cost=0.42..8.85 rows=24 width=8) (actual time=0.202..0.543rows=33 loops=1) Index Cond: (flagged = true) -> Parallel Hash (cost=12084.91..12084.91 rows=586091 width=8) (actual time=755.983..755.983 rows=1406618loops=1) Buckets: 131072 Batches: 32 Memory Usage: 2730kB -> Parallel Seq Scan on testi4 (cost=0.00..12084.91 rows=586091 width=8) (actual time=0.052..330.983rows=1406618 loops=1) So this changes could be taken as quick valid fix. Futher improvement is to remove Append node earlier to detect there is no need in parallelism when single Append child doesn't support it. -------- regards, Sokolov Yura Postgres Professional y.sokolov@postgrespro.ru funny.falcon@gmail.com
Re: BUG #17335: Duplicate result rows in Gather node when "Suppress Append ... that have a single child" applied
From
Yura Sokolov
Date:
В Пт, 17/12/2021 в 09:00 +0300, Yura Sokolov пишет: > В Пн, 13/12/2021 в 14:45 +0000, PG Bug reporting form пишет: > > The following bug has been logged on the website: > > > > Bug reference: 17335 > > Logged by: Yura Sokolov > > Email address: y.sokolov@postgrespro.ru > > PostgreSQL version: 14.1 > > Operating system: any > > Description: > > > > Our customer faced a query that returns duplicate rows if parallel execution > > is enabled > > (max_parallel_workers_per_gather > 0) after 11->12 upgrade. > > > > Investigation shows bug were introduced in > > 8edd0e79460b414b1d971895312e549e95e12e4f > > "Suppress Append and MergeAppend plan nodes that have a single child." . > > Bug is present in all versions since 12 beta, and it is reproduced in master > > branch as well. > > > > Reproduction. > > https://gist.github.com/funny-falcon/45fa9460507c015390db99a7efd18a75 > > > > Dump contains definition of two tables (with 3 and 2 fields), index and data > > dump > > large enough to trigger parallel plan execution. > > Also `analyze public.testi3, public.testi4` line is added. > > > > https://gist.github.com/funny-falcon/45fa9460507c015390db99a7efd18a75/raw/06a55e57bc6c460ea8800600b1b597c1ed216259/repro_dump.sql.zst > > > > Query: > > > > SELECT count(1) > > FROM ( > > SELECT fk, id > > FROM testi3 > > WHERE flagged > > UNION ALL > > SELECT fk, id > > from testi3 > > WHERE false > > ) as t > > LEFT OUTER JOIN testi4 > > USING (fk); > > > > When `set max_parallel_workers_per_gather = 0;` then it returns 44 which is > > correct. > > When `set max_parallel_workers_per_gather = 2;` then it returns 88 which is > > wrong. > > > > Sometimes (for example, when you try to debug it) it returns 44 even > > with max_parallel_workers_per_gather = 2. > > It seems that it happens when worker returns first row fast enough so > > "Gather" node doesn't start to execute plan by itself. > > > > It is confirmed by `set parallel_leader_participation = false`, which forces > > correct result. > > `set enable_parallel_append = false` also forces correct result. > > > > Original plan included Gather Merge node and max_parallel_workers_per_gather > > = 1 . > > Reproduction code uses Gather instead and max_parallel_workers_per_gather = > > 2 just because it were easier to get bad plan. > > > > Plan before change 8edd0e79460b414b1d971895312e549e95e12e4f (returns correct > > result) > В Пн, 13/12/2021 в 14:45 +0000, PG Bug reporting form пишет: > > The following bug has been logged on the website: > > > > Bug reference: 17335 > > Logged by: Yura Sokolov > > Email address: y.sokolov@postgrespro.ru > > PostgreSQL version: 14.1 > > Operating system: any > > Description: > > > > Our customer faced a query that returns duplicate rows if parallel execution > > is enabled > > (max_parallel_workers_per_gather > 0) after 11->12 upgrade. > > > > Investigation shows bug were introduced in > > 8edd0e79460b414b1d971895312e549e95e12e4f > > "Suppress Append and MergeAppend plan nodes that have a single child." . > > Bug is present in all versions since 12 beta, and it is reproduced in master > > branch as well. > > > > Reproduction. > > https://gist.github.com/funny-falcon/45fa9460507c015390db99a7efd18a75 > > > > Dump contains definition of two tables (with 3 and 2 fields), index and data > > dump > > large enough to trigger parallel plan execution. > > Also `analyze public.testi3, public.testi4` line is added. > > > > https://gist.github.com/funny-falcon/45fa9460507c015390db99a7efd18a75/raw/06a55e57bc6c460ea8800600b1b597c1ed216259/repro_dump.sql.zst > > > > Query: > > > > SELECT count(1) > > FROM ( > > SELECT fk, id > > FROM testi3 > > WHERE flagged > > UNION ALL > > SELECT fk, id > > from testi3 > > WHERE false > > ) as t > > LEFT OUTER JOIN testi4 > > USING (fk); > > > > When `set max_parallel_workers_per_gather = 0;` then it returns 44 which is > > correct. > > When `set max_parallel_workers_per_gather = 2;` then it returns 88 which is > > wrong. > > > > Sometimes (for example, when you try to debug it) it returns 44 even > > with max_parallel_workers_per_gather = 2. > > It seems that it happens when worker returns first row fast enough so > > "Gather" node doesn't start to execute plan by itself. > > > > It is confirmed by `set parallel_leader_participation = false`, which forces > > correct result. > > `set enable_parallel_append = false` also forces correct result. > > > > Original plan included Gather Merge node and max_parallel_workers_per_gather > > = 1 . > > Reproduction code uses Gather instead and max_parallel_workers_per_gather = > > 2 just because it were easier to get bad plan. > > > > Plan before change 8edd0e79460b414b1d971895312e549e95e12e4f (returns correct > > result) > > > > Finalize Aggregate (cost=25003.16..25003.17 rows=1 width=8) (actual > > time=435.473..435.473 rows=1 loops=1) > > -> Gather (cost=25003.05..25003.16 rows=1 width=8) (actual time=434.646..456.889 rows=2 loops=1) > > Workers Planned: 1 > > Workers Launched: 1 > > -> Partial Aggregate (cost=24003.05..24003.06 rows=1 width=8) > > (actual time=433.000..433.000 rows=1 loops=2) > > -> Parallel Hash Left Join (cost=21701.47..24002.86 rows=74 > > width=0) (actual time=312.576..432.981 rows=22 loops=2) > > Hash Cond: ("*SELECT* 1".fk = testi4.fk) > > -> Parallel Append (cost=0.42..9.18 rows=24 width=4) > > (actual time=0.021..0.112 rows=16 loops=2) > > -> Subquery Scan on "*SELECT* 1" > > (cost=0.42..9.11 rows=24 width=4) (actual time=0.041..0.216 rows=33 loops=1) > > -> Index Scan using testi3_ix on testi3 > > (cost=0.42..8.87 rows=24 width=8) (actual time=0.039..0.207 rows=33 > > loops=1) > > Index Cond: (flagged = true) > > -> Parallel Hash (cost=12084.91..12084.91 rows=586091 > > width=4) (actual time=305.473..305.473 rows=703309 loops=2) > > Buckets: 131072 Batches: 32 Memory Usage: > > 2784kB > > -> Parallel Seq Scan on testi4 > > (cost=0.00..12084.91 rows=586091 width=4) (actual time=0.032..138.958 > > rows=703309 loops=2) > > > > Plan after that change (returns wrong result) > > > > Finalize Aggregate (cost=25003.87..25003.88 rows=1 width=8) (actual > > time=652.062..652.062 rows=1 loops=1) > > -> Gather (cost=25003.75..25003.86 rows=1 width=8) (actual > > time=651.482..668.718 rows=2 loops=1) > > Workers Planned: 1 > > Workers Launched: 1 > > -> Partial Aggregate (cost=24003.75..24003.76 rows=1 width=8) > > (actual time=649.150..649.150 rows=1 loops=2) > > -> Parallel Hash Left Join (cost=21701.47..24003.49 > > rows=106 width=0) (actual time=461.751..649.111 rows=44 loops=2) > > Hash Cond: ("*SELECT* 1".fk = testi4.fk) > > -> Subquery Scan on "*SELECT* 1" (cost=0.42..9.51 > > rows=36 width=4) (actual time=0.050..0.229 rows=33 loops=2) > > -> Index Scan using testi3_ix on testi3 > > (cost=0.42..9.15 rows=36 width=8) (actual time=0.048..0.214 rows=33 loops=2) > > Index Cond: (flagged = true) > > -> Parallel Hash (cost=12084.91..12084.91 rows=586091 > > width=4) (actual time=444.931..444.931 rows=703309 loops=2) > > Buckets: 131072 Batches: 32 Memory Usage: > > 2784kB > > -> Parallel Seq Scan on testi4 > > (cost=0.00..12084.91 rows=586091 width=4) (actual time=0.045..202.913 > > rows=703309 loops=2)] > > I found simple fix: > > --- a/src/backend/optimizer/util/pathnode.c > +++ b/src/backend/optimizer/util/pathnode.c > @@ -1284,14 +1284,17 @@ create_append_path(PlannerInfo *root, > if > (list_length(pathnode->subpaths) == 1) > { > Path *child = (Path *) linitial(pathnode->subpaths); > > pathnode->path.rows = > child->rows; > pathnode->path.startup_cost = child->startup_cost; > pathnode->path.total_cost = child->total_cost; > pathn > ode->path.pathkeys = child->pathkeys; > + pathnode->path.parallel_aware = child->parallel_aware; > + pathnode->path.parallel_safe = child- > > parallel_safe; > + pathnode->path.parallel_workers = child->parallel_workers; > } > else > { > pathnode->path.pathkeys = > NIL; /* unsorted if more than 1 subpath */ > cost_append(pathnode); > } > > With this fix Gather node deduce `Single Copy: true`, and leader doesn't > participate in query execution. > > It seems to be correct change: > - if append node will be removed, then it should take parallel > awareness/safety from its single child. > > With this change plan Gather Merge node fails on assert in > cost_gather_merge > > /* > * Add one to the number of workers to account for the leader. This might > * be overgenerous since the leader will do less work than other workers > * in typical cases, but we'll go with it for now. > */ > Assert(path->num_workers > 0); > N = (double) path->num_workers + 1; > logN = LOG2(N); > > Commenting this assert doesn't seem to break anything and doesn't seem to mess > cost calculation (ie it remains logically correct). > > To preserve assert I've changed it to `Assert(path->num_workers >= 0);`. > > With this changes queries from reproduction code returns correct 44 rows. > > # explain analyze SELECT id, val > FROM ( > SELECT fk, id > FROM testi3 > WHERE flagged > UNION ALL > SELECT fk, id > from testi3 > WHERE false > ) as t > LEFT OUTER JOIN testi4 > USING (fk); > QUERY PLAN > -------------------------------------------------------------------------------------------------------------------------------------------- > Gather (cost=22701.47..25015.79 rows=130 width=8) (actual time=1268.071..1686.918 rows=44 loops=1) > Workers Planned: 1 > Workers Launched: 1 > Single Copy: true > -> Parallel Hash Left Join (cost=21701.47..24002.79 rows=130 width=8) (actual time=1242.936..1634.421 rows=44 loops=1) > Hash Cond: ("*SELECT* 1".fk = testi4.fk) > -> Subquery Scan on "*SELECT* 1" (cost=0.42..9.09 rows=24 width=8) (actual time=0.049..0.166 rows=33 loops=1) > -> Index Scan using testi3_ix on testi3 (cost=0.42..8.85 rows=24 width=8) (actual time=0.047..0.151 rows=33loops=1) > Index Cond: (flagged = true) > -> Parallel Hash (cost=12084.91..12084.91 rows=586091 width=8) (actual time=1238.029..1238.029 rows=1406618loops=1) > Buckets: 131072 Batches: 32 Memory Usage: 2784kB > -> Parallel Seq Scan on testi4 (cost=0.00..12084.91 rows=586091 width=8) (actual time=0.070..540.652 rows=1406618loops=1) > > Interestingly, Gather Merge node doesn't run workers: > > # explain analyze SELECT id, val > FROM ( > SELECT fk, id > FROM testi3 > WHERE flagged > UNION ALL > SELECT fk, id > from testi3 > WHERE false > ) as t > LEFT OUTER JOIN testi4 > USING (fk) ORDER BY val; > QUERY PLAN > -------------------------------------------------------------------------------------------------------------------------------------------------- > Gather Merge (cost=25007.36..25007.68 rows=0 width=8) (actual time=909.975..909.984 rows=44 loops=1) > Workers Planned: 0 > Workers Launched: 0 > -> Sort (cost=24007.36..24007.68 rows=130 width=8) (actual time=909.948..909.952 rows=44 loops=1) > Sort Key: testi4.val > Sort Method: quicksort Memory: 27kB > -> Parallel Hash Left Join (cost=21701.47..24002.79 rows=130 width=8) (actual time=757.099..909.779 rows=44loops=1) > Hash Cond: ("*SELECT* 1".fk = testi4.fk) > -> Subquery Scan on "*SELECT* 1" (cost=0.42..9.09 rows=24 width=8) (actual time=0.224..0.576 rows=33 loops=1) > -> Index Scan using testi3_ix on testi3 (cost=0.42..8.85 rows=24 width=8) (actual time=0.202..0.543rows=33 loops=1) > Index Cond: (flagged = true) > -> Parallel Hash (cost=12084.91..12084.91 rows=586091 width=8) (actual time=755.983..755.983 rows=1406618loops=1) > Buckets: 131072 Batches: 32 Memory Usage: 2730kB > -> Parallel Seq Scan on testi4 (cost=0.00..12084.91 rows=586091 width=8) (actual time=0.052..330.983rows=1406618 loops=1) > > So this changes could be taken as quick valid fix. > > Futher improvement is to remove Append node earlier to detect there is > no need in parallelism when single Append child doesn't support it. > I've attached this changes as a patch. -------- regards, Sokolov Yura Postgres Professional y.sokolov@postgrespro.ru funny.falcon@gmail.com
Attachment
Re: BUG #17335: Duplicate result rows in Gather node when "Suppress Append ... that have a single child" applied
From
Yura Sokolov
Date:
Sent to hackers with better reproduction code https://postgr.es/m/flat/b59605fecb20ba9ea94e70ab60098c237c870628.camel@postgrespro.ru Published to commitfest: https://commitfest.postgresql.org/36/3486/ В Пн, 20/12/2021 в 12:07 +0300, Yura Sokolov пишет: > В Пт, 17/12/2021 в 09:00 +0300, Yura Sokolov пишет: > > В Пн, 13/12/2021 в 14:45 +0000, PG Bug reporting form пишет: > > > The following bug has been logged on the website: > > > > > > Bug reference: 17335 > > > Logged by: Yura Sokolov > > > Email address: y.sokolov@postgrespro.ru > > > PostgreSQL version: 14.1 > > > Operating system: any > > > Description: > > > > > > Our customer faced a query that returns duplicate rows if parallel execution > > > is enabled > > > (max_parallel_workers_per_gather > 0) after 11->12 upgrade. > > > > > > Investigation shows bug were introduced in > > > 8edd0e79460b414b1d971895312e549e95e12e4f > > > "Suppress Append and MergeAppend plan nodes that have a single child." . > > > Bug is present in all versions since 12 beta, and it is reproduced in master > > > branch as well. > > > > > > Reproduction. > > > https://gist.github.com/funny-falcon/45fa9460507c015390db99a7efd18a75 > > > > > > Dump contains definition of two tables (with 3 and 2 fields), index and data > > > dump > > > large enough to trigger parallel plan execution. > > > Also `analyze public.testi3, public.testi4` line is added. > > > > > > https://gist.github.com/funny-falcon/45fa9460507c015390db99a7efd18a75/raw/06a55e57bc6c460ea8800600b1b597c1ed216259/repro_dump.sql.zst > > > > > > Query: > > > > > > SELECT count(1) > > > FROM ( > > > SELECT fk, id > > > FROM testi3 > > > WHERE flagged > > > UNION ALL > > > SELECT fk, id > > > from testi3 > > > WHERE false > > > ) as t > > > LEFT OUTER JOIN testi4 > > > USING (fk); > > > > > > When `set max_parallel_workers_per_gather = 0;` then it returns 44 which is > > > correct. > > > When `set max_parallel_workers_per_gather = 2;` then it returns 88 which is > > > wrong. > > > > > > Sometimes (for example, when you try to debug it) it returns 44 even > > > with max_parallel_workers_per_gather = 2. > > > It seems that it happens when worker returns first row fast enough so > > > "Gather" node doesn't start to execute plan by itself. > > > > > > It is confirmed by `set parallel_leader_participation = false`, which forces > > > correct result. > > > `set enable_parallel_append = false` also forces correct result. > > > > > > Original plan included Gather Merge node and max_parallel_workers_per_gather > > > = 1 . > > > Reproduction code uses Gather instead and max_parallel_workers_per_gather = > > > 2 just because it were easier to get bad plan. > > > > > > Plan before change 8edd0e79460b414b1d971895312e549e95e12e4f (returns correct > > > result) > > В Пн, 13/12/2021 в 14:45 +0000, PG Bug reporting form пишет: > > > The following bug has been logged on the website: > > > > > > Bug reference: 17335 > > > Logged by: Yura Sokolov > > > Email address: y.sokolov@postgrespro.ru > > > PostgreSQL version: 14.1 > > > Operating system: any > > > Description: > > > > > > Our customer faced a query that returns duplicate rows if parallel execution > > > is enabled > > > (max_parallel_workers_per_gather > 0) after 11->12 upgrade. > > > > > > Investigation shows bug were introduced in > > > 8edd0e79460b414b1d971895312e549e95e12e4f > > > "Suppress Append and MergeAppend plan nodes that have a single child." . > > > Bug is present in all versions since 12 beta, and it is reproduced in master > > > branch as well. > > > > > > Reproduction. > > > https://gist.github.com/funny-falcon/45fa9460507c015390db99a7efd18a75 > > > > > > Dump contains definition of two tables (with 3 and 2 fields), index and data > > > dump > > > large enough to trigger parallel plan execution. > > > Also `analyze public.testi3, public.testi4` line is added. > > > > > > https://gist.github.com/funny-falcon/45fa9460507c015390db99a7efd18a75/raw/06a55e57bc6c460ea8800600b1b597c1ed216259/repro_dump.sql.zst > > > > > > Query: > > > > > > SELECT count(1) > > > FROM ( > > > SELECT fk, id > > > FROM testi3 > > > WHERE flagged > > > UNION ALL > > > SELECT fk, id > > > from testi3 > > > WHERE false > > > ) as t > > > LEFT OUTER JOIN testi4 > > > USING (fk); > > > > > > When `set max_parallel_workers_per_gather = 0;` then it returns 44 which is > > > correct. > > > When `set max_parallel_workers_per_gather = 2;` then it returns 88 which is > > > wrong. > > > > > > Sometimes (for example, when you try to debug it) it returns 44 even > > > with max_parallel_workers_per_gather = 2. > > > It seems that it happens when worker returns first row fast enough so > > > "Gather" node doesn't start to execute plan by itself. > > > > > > It is confirmed by `set parallel_leader_participation = false`, which forces > > > correct result. > > > `set enable_parallel_append = false` also forces correct result. > > > > > > Original plan included Gather Merge node and max_parallel_workers_per_gather > > > = 1 . > > > Reproduction code uses Gather instead and max_parallel_workers_per_gather = > > > 2 just because it were easier to get bad plan. > > > > > > Plan before change 8edd0e79460b414b1d971895312e549e95e12e4f (returns correct > > > result) > > > > > > Finalize Aggregate (cost=25003.16..25003.17 rows=1 width=8) (actual > > > time=435.473..435.473 rows=1 loops=1) > > > -> Gather (cost=25003.05..25003.16 rows=1 width=8) (actual time=434.646..456.889 rows=2 loops=1) > > > Workers Planned: 1 > > > Workers Launched: 1 > > > -> Partial Aggregate (cost=24003.05..24003.06 rows=1 width=8) > > > (actual time=433.000..433.000 rows=1 loops=2) > > > -> Parallel Hash Left Join (cost=21701.47..24002.86 rows=74 > > > width=0) (actual time=312.576..432.981 rows=22 loops=2) > > > Hash Cond: ("*SELECT* 1".fk = testi4.fk) > > > -> Parallel Append (cost=0.42..9.18 rows=24 width=4) > > > (actual time=0.021..0.112 rows=16 loops=2) > > > -> Subquery Scan on "*SELECT* 1" > > > (cost=0.42..9.11 rows=24 width=4) (actual time=0.041..0.216 rows=33 loops=1) > > > -> Index Scan using testi3_ix on testi3 > > > (cost=0.42..8.87 rows=24 width=8) (actual time=0.039..0.207 rows=33 > > > loops=1) > > > Index Cond: (flagged = true) > > > -> Parallel Hash (cost=12084.91..12084.91 rows=586091 > > > width=4) (actual time=305.473..305.473 rows=703309 loops=2) > > > Buckets: 131072 Batches: 32 Memory Usage: > > > 2784kB > > > -> Parallel Seq Scan on testi4 > > > (cost=0.00..12084.91 rows=586091 width=4) (actual time=0.032..138.958 > > > rows=703309 loops=2) > > > > > > Plan after that change (returns wrong result) > > > > > > Finalize Aggregate (cost=25003.87..25003.88 rows=1 width=8) (actual > > > time=652.062..652.062 rows=1 loops=1) > > > -> Gather (cost=25003.75..25003.86 rows=1 width=8) (actual > > > time=651.482..668.718 rows=2 loops=1) > > > Workers Planned: 1 > > > Workers Launched: 1 > > > -> Partial Aggregate (cost=24003.75..24003.76 rows=1 width=8) > > > (actual time=649.150..649.150 rows=1 loops=2) > > > -> Parallel Hash Left Join (cost=21701.47..24003.49 > > > rows=106 width=0) (actual time=461.751..649.111 rows=44 loops=2) > > > Hash Cond: ("*SELECT* 1".fk = testi4.fk) > > > -> Subquery Scan on "*SELECT* 1" (cost=0.42..9.51 > > > rows=36 width=4) (actual time=0.050..0.229 rows=33 loops=2) > > > -> Index Scan using testi3_ix on testi3 > > > (cost=0.42..9.15 rows=36 width=8) (actual time=0.048..0.214 rows=33 loops=2) > > > Index Cond: (flagged = true) > > > -> Parallel Hash (cost=12084.91..12084.91 rows=586091 > > > width=4) (actual time=444.931..444.931 rows=703309 loops=2) > > > Buckets: 131072 Batches: 32 Memory Usage: > > > 2784kB > > > -> Parallel Seq Scan on testi4 > > > (cost=0.00..12084.91 rows=586091 width=4) (actual time=0.045..202.913 > > > rows=703309 loops=2)] > > > > I found simple fix: > > > > --- a/src/backend/optimizer/util/pathnode.c > > +++ b/src/backend/optimizer/util/pathnode.c > > @@ -1284,14 +1284,17 @@ create_append_path(PlannerInfo *root, > > if > > (list_length(pathnode->subpaths) == 1) > > { > > Path *child = (Path *) linitial(pathnode->subpaths); > > > > pathnode->path.rows = > > child->rows; > > pathnode->path.startup_cost = child->startup_cost; > > pathnode->path.total_cost = child->total_cost; > > pathn > > ode->path.pathkeys = child->pathkeys; > > + pathnode->path.parallel_aware = child->parallel_aware; > > + pathnode->path.parallel_safe = child- > > > parallel_safe; > > + pathnode->path.parallel_workers = child->parallel_workers; > > } > > else > > { > > pathnode->path.pathkeys = > > NIL; /* unsorted if more than 1 subpath */ > > cost_append(pathnode); > > } > > > > With this fix Gather node deduce `Single Copy: true`, and leader doesn't > > participate in query execution. > > > > It seems to be correct change: > > - if append node will be removed, then it should take parallel > > awareness/safety from its single child. > > > > With this change plan Gather Merge node fails on assert in > > cost_gather_merge > > > > /* > > * Add one to the number of workers to account for the leader. This might > > * be overgenerous since the leader will do less work than other workers > > * in typical cases, but we'll go with it for now. > > */ > > Assert(path->num_workers > 0); > > N = (double) path->num_workers + 1; > > logN = LOG2(N); > > > > Commenting this assert doesn't seem to break anything and doesn't seem to mess > > cost calculation (ie it remains logically correct). > > > > To preserve assert I've changed it to `Assert(path->num_workers >= 0);`. > > > > With this changes queries from reproduction code returns correct 44 rows. > > > > # explain analyze SELECT id, val > > FROM ( > > SELECT fk, id > > FROM testi3 > > WHERE flagged > > UNION ALL > > SELECT fk, id > > from testi3 > > WHERE false > > ) as t > > LEFT OUTER JOIN testi4 > > USING (fk); > > QUERY PLAN > > -------------------------------------------------------------------------------------------------------------------------------------------- > > Gather (cost=22701.47..25015.79 rows=130 width=8) (actual time=1268.071..1686.918 rows=44 loops=1) > > Workers Planned: 1 > > Workers Launched: 1 > > Single Copy: true > > -> Parallel Hash Left Join (cost=21701.47..24002.79 rows=130 width=8) (actual time=1242.936..1634.421 rows=44 loops=1) > > Hash Cond: ("*SELECT* 1".fk = testi4.fk) > > -> Subquery Scan on "*SELECT* 1" (cost=0.42..9.09 rows=24 width=8) (actual time=0.049..0.166 rows=33 loops=1) > > -> Index Scan using testi3_ix on testi3 (cost=0.42..8.85 rows=24 width=8) (actual time=0.047..0.151rows=33 loops=1) > > Index Cond: (flagged = true) > > -> Parallel Hash (cost=12084.91..12084.91 rows=586091 width=8) (actual time=1238.029..1238.029 rows=1406618loops=1) > > Buckets: 131072 Batches: 32 Memory Usage: 2784kB > > -> Parallel Seq Scan on testi4 (cost=0.00..12084.91 rows=586091 width=8) (actual time=0.070..540.652rows=1406618 loops=1) > > > > Interestingly, Gather Merge node doesn't run workers: > > > > # explain analyze SELECT id, val > > FROM ( > > SELECT fk, id > > FROM testi3 > > WHERE flagged > > UNION ALL > > SELECT fk, id > > from testi3 > > WHERE false > > ) as t > > LEFT OUTER JOIN testi4 > > USING (fk) ORDER BY val; > > QUERY PLAN > > -------------------------------------------------------------------------------------------------------------------------------------------------- > > Gather Merge (cost=25007.36..25007.68 rows=0 width=8) (actual time=909.975..909.984 rows=44 loops=1) > > Workers Planned: 0 > > Workers Launched: 0 > > -> Sort (cost=24007.36..24007.68 rows=130 width=8) (actual time=909.948..909.952 rows=44 loops=1) > > Sort Key: testi4.val > > Sort Method: quicksort Memory: 27kB > > -> Parallel Hash Left Join (cost=21701.47..24002.79 rows=130 width=8) (actual time=757.099..909.779 rows=44loops=1) > > Hash Cond: ("*SELECT* 1".fk = testi4.fk) > > -> Subquery Scan on "*SELECT* 1" (cost=0.42..9.09 rows=24 width=8) (actual time=0.224..0.576 rows=33loops=1) > > -> Index Scan using testi3_ix on testi3 (cost=0.42..8.85 rows=24 width=8) (actual time=0.202..0.543rows=33 loops=1) > > Index Cond: (flagged = true) > > -> Parallel Hash (cost=12084.91..12084.91 rows=586091 width=8) (actual time=755.983..755.983 rows=1406618loops=1) > > Buckets: 131072 Batches: 32 Memory Usage: 2730kB > > -> Parallel Seq Scan on testi4 (cost=0.00..12084.91 rows=586091 width=8) (actual time=0.052..330.983rows=1406618 loops=1) > > > > So this changes could be taken as quick valid fix. > > > > Futher improvement is to remove Append node earlier to detect there is > > no need in parallelism when single Append child doesn't support it. > > > > I've attached this changes as a patch. > > -------- > regards, > Sokolov Yura > Postgres Professional > y.sokolov@postgrespro.ru > funny.falcon@gmail.com >