Re: Fix for parallel BTree initialization bug - Mailing list pgsql-hackers

From Jameson, Hunter 'James'
Subject Re: Fix for parallel BTree initialization bug
Date
Msg-id D1CDB3C9-1BCE-41E1-8988-49349652BFE2@amazon.com
Whole thread Raw
In response to Fix for parallel BTree initialization bug  ("Jameson, Hunter 'James'" <hunjmes@amazon.com>)
List pgsql-hackers
Answers inline below, sorry for the formatting-- am still trying to get corporate email to work nicely with this
mailinglist, thanks.
 

On 9/9/20, 9:22 PM, "Justin Pryzby" <pryzby@telsasoft.com> wrote:

    CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you
canconfirm the sender and know the content is safe.
 



    On Tue, Sep 08, 2020 at 06:25:03PM +0000, Jameson, Hunter 'James' wrote:
    > Hi, I ran across a small (but annoying) bug in initializing parallel BTree scans, which causes the parallel-scan
statemachine to get confused. The fix is one line; the description is a bit longer—
 

    What postgres version was this ?

We have observed this bug on PostgreSQL versions 11.x and 10.x. I don't believe it occurs in PostgreSQL versions 9.x,
because9.x does not have parallel BTree scan.
 

    > Before, function _bt_first() would exit immediately if the specified scan keys could never be satisfied--without
notifyingother parallel workers, if any, that the scan key was done. This moved that particular worker to a scan key
beyondwhat was in the shared parallel-query state, so that it would later try to read in "InvalidBlockNumber", without
recognizingit as a special sentinel value.
 
    >
    > The basic bug is that the BTree parallel query state machine assumes that a worker process is working on a key <=
theglobal key--a worker process can be behind (i.e., hasn't finished its work on a previous key), but never ahead. By
allowingthe first worker to move on to the next scan key, in this one case, without notifying other workers, the global
keyends up < the first worker's local key.
 
    >
    > Symptoms of the bug are: on R/O, we get an error saying we can't extend the index relation, while on an R/W we
justextend the index relation by 1 block.
 

    What's the exact error ?  Are you able to provide a backtrace ?

I am not able to provide a full backtrace, unfortunately, but the relevant part appears to be:

  ReadBuffer (... blockNum=blockNum@entry=4294967295)
 _bt_getbuf (... blkno=4294967295 ...)
 _bt_readnextpage (... blkno=4294967295 ... )
 _bt_steppage (...)
 _bt_next (...)
 btgettuple (...)
 index_getnext_tid (...)
 index_getnext (...)
 IndexNext (...) 

Notice that _bt_steppage() is passing InvalidBlockNumber to ReadBuffer(). That is the bug.

    > To reproduce, you need a query that:
    >
    > 1. Executes parallel BTree index scan;
    > 2. Has an IN-list of size > 1;

    Do you mean you have an index on col1 and a query condition like: col1 IN (a,b,c...) ?

Something like that, yes,

    > 3. Has an additional index filter that makes it impossible to satisfy the
    >     first IN-list condition.

    .. AND col1::text||'foo' = '';
    I think you mean that the "impossible" condition makes it so that a btree
    worker exits early.

Specifically, on that worker, _bt_first() sees !so->qual_ok and just returns "false". That is the bug. The fix is that
theworker must also call _bt_parallel_done(scan), as is done everywhere else in _bt_first() where it returns "false".
 

    > (We encountered such a query, and therefore the bug, on a production instance.)

    Could you send the "shape" of the query or its plan, obfuscated and redacted as
    need be ?

Plan is something like:

Finalize GroupAggregate  ... (... loops=1)
   Group Key: (...)
   ->  Gather Merge  ... (... loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Partial GroupAggregate  ... (... loops=3)
               Group Key: (...)
               ->  Sort  ... (... loops=3)
                     Sort Key: (...)
                     Sort Method: quicksort  ...
                     ->  Nested Loop ...  (... loops=3)
                           ->  Parallel Index Scan using ... (... loops=3)
                                 Index Cond: (((f ->> 't') >= ... ) AND ((f ->> 't') < ...) AND (((f -> 'c') ->> 't') =
ANY(...)) AND (((f-> 'c') ->> 't') = ...))
 
                                 Filter: (CASE WHEN ... END IS NOT NULL)
                                 Rows Removed by Filter: ...
                           ->  Index Only Scan using ... (... rows=1 loops=...)
                                 Index Cond: (a = b)
                                 Heap Fetches: ...

    --
    Justin

James
--
James Hunter, Amazon Web Services (AWS)




pgsql-hackers by date:

Previous
From: Justin Pryzby
Date:
Subject: Re: recovering from "found xmin ... from before relfrozenxid ..."
Next
From: Julien Rouhaud
Date:
Subject: Re: Online checksums verification in the backend