BUG #17335: Duplicate result rows in Gather node when "Suppress Append ... that have a single child" applied - Mailing list pgsql-bugs

From PG Bug reporting form
Subject BUG #17335: Duplicate result rows in Gather node when "Suppress Append ... that have a single child" applied
Date
Msg-id 17335-4dc92e1aea3a78af@postgresql.org
Whole thread Raw
Responses Re: BUG #17335: Duplicate result rows in Gather node when "Suppress Append ... that have a single child" applied
List pgsql-bugs
The following bug has been logged on the website:

Bug reference:      17335
Logged by:          Yura Sokolov
Email address:      y.sokolov@postgrespro.ru
PostgreSQL version: 14.1
Operating system:   any
Description:

Our customer faced a query that returns duplicate rows if parallel execution
is enabled
(max_parallel_workers_per_gather > 0) after 11->12 upgrade.

Investigation shows bug were introduced in
8edd0e79460b414b1d971895312e549e95e12e4f
"Suppress Append and MergeAppend plan nodes that have a single child." .
Bug is present in all versions since 12 beta, and it is reproduced in master
branch as well.

Reproduction.
https://gist.github.com/funny-falcon/45fa9460507c015390db99a7efd18a75

Dump contains definition of two tables (with 3 and 2 fields), index and data
dump
large enough to trigger parallel plan execution.
Also `analyze public.testi3, public.testi4` line is added.


https://gist.github.com/funny-falcon/45fa9460507c015390db99a7efd18a75/raw/06a55e57bc6c460ea8800600b1b597c1ed216259/repro_dump.sql.zst

Query:

    SELECT count(1)
    FROM (
      SELECT fk, id
      FROM testi3
      WHERE flagged
      UNION ALL
      SELECT fk, id
      from testi3
      WHERE false
    ) as t
    LEFT OUTER JOIN testi4
    USING (fk);

When `set max_parallel_workers_per_gather = 0;` then it returns 44 which is
correct.
When `set max_parallel_workers_per_gather = 2;` then it returns 88 which is
wrong.

Sometimes (for example, when you try to debug it) it returns 44 even
with max_parallel_workers_per_gather = 2.
It seems that it happens when worker returns first row fast enough so
"Gather" node doesn't start to execute plan by itself.

It is confirmed by `set parallel_leader_participation = false`, which forces
correct result.
`set enable_parallel_append = false` also forces correct result.

Original plan included Gather Merge node and max_parallel_workers_per_gather
= 1 .
Reproduction code uses Gather instead and max_parallel_workers_per_gather =
2 just because it were easier to get bad plan.

Plan before change 8edd0e79460b414b1d971895312e549e95e12e4f (returns correct
result)

 Finalize Aggregate  (cost=25003.16..25003.17 rows=1 width=8) (actual
time=435.473..435.473 rows=1 loops=1)
    
   ->  Gather  (cost=25003.05..25003.16 rows=1 width=8) (actual
time=434.646..456.889 rows=2 loops=1)
          
         Workers Planned: 1

         Workers Launched: 1

         ->  Partial Aggregate  (cost=24003.05..24003.06 rows=1 width=8)
(actual time=433.000..433.000 rows=1 loops=2)
               ->  Parallel Hash Left Join  (cost=21701.47..24002.86 rows=74
width=0) (actual time=312.576..432.981 rows=22 loops=2)
                     Hash Cond: ("*SELECT* 1".fk = testi4.fk)

                     ->  Parallel Append  (cost=0.42..9.18 rows=24 width=4)
(actual time=0.021..0.112 rows=16 loops=2)
                           ->  Subquery Scan on "*SELECT* 1"
(cost=0.42..9.11 rows=24 width=4) (actual time=0.041..0.216 rows=33 loops=1)
            
                                 ->  Index Scan using testi3_ix on testi3
(cost=0.42..8.87 rows=24 width=8) (actual time=0.039..0.207 rows=33
loops=1)
                                       Index Cond: (flagged = true)
                                                                          
                     ->  Parallel Hash  (cost=12084.91..12084.91 rows=586091
width=4) (actual time=305.473..305.473 rows=703309 loops=2)               
                           Buckets: 131072  Batches: 32  Memory Usage:
2784kB
   
                           ->  Parallel Seq Scan on testi4
(cost=0.00..12084.91 rows=586091 width=4) (actual time=0.032..138.958
rows=703309 loops=2)

Plan after that change (returns wrong result)

 Finalize Aggregate  (cost=25003.87..25003.88 rows=1 width=8) (actual
time=652.062..652.062 rows=1 loops=1)

       
   ->  Gather  (cost=25003.75..25003.86 rows=1 width=8) (actual
time=651.482..668.718 rows=2 loops=1)

             
         Workers Planned: 1                              
         Workers Launched: 1
                                      
         ->  Partial Aggregate  (cost=24003.75..24003.76 rows=1 width=8)
(actual time=649.150..649.150 rows=1 loops=2)

    
               ->  Parallel Hash Left Join  (cost=21701.47..24003.49
rows=106 width=0) (actual time=461.751..649.111 rows=44 loops=2)

        
                     Hash Cond: ("*SELECT* 1".fk = testi4.fk)
                                      
                     ->  Subquery Scan on "*SELECT* 1"  (cost=0.42..9.51
rows=36 width=4) (actual time=0.050..0.229 rows=33 loops=2)

    
                           ->  Index Scan using testi3_ix on testi3
(cost=0.42..9.15 rows=36 width=8) (actual time=0.048..0.214 rows=33 loops=2)

        
                                 Index Cond: (flagged = true)
                                      
                     ->  Parallel Hash  (cost=12084.91..12084.91 rows=586091
width=4) (actual time=444.931..444.931 rows=703309 loops=2)


                           Buckets: 131072  Batches: 32  Memory Usage:
2784kB

      
                           ->  Parallel Seq Scan on testi4
(cost=0.00..12084.91 rows=586091 width=4) (actual time=0.045..202.913
rows=703309 loops=2)

------

regards,
Yura


pgsql-bugs by date:

Previous
From: Alexander Korotkov
Date:
Subject: Re: BUG #17300: Server crashes on deserializing text multirange
Next
From: Andres Freund
Date:
Subject: Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum