Re: [BUG] Error in BRIN summarization - Mailing list pgsql-hackers

From Anastasia Lubennikova
Subject Re: [BUG] Error in BRIN summarization
Date
Msg-id d3498a6d-bd6d-83f3-8823-547339560f49@postgrespro.ru
Whole thread Raw
In response to Re: [BUG] Error in BRIN summarization  (Anastasia Lubennikova <a.lubennikova@postgrespro.ru>)
List pgsql-hackers
On 30.07.2020 16:40, Anastasia Lubennikova wrote:
> While testing this fix, Alexander Lakhin spotted another problem.
>
> After a few runs, it will fail with "ERROR: corrupted BRIN index: 
> inconsistent range map"
>
> The problem is caused by a race in page locking in 
> brinGetTupleForHeapBlock [1]:
>
> (1) bitmapsan locks revmap->rm_currBuf and finds the address of the 
> tuple on a regular page "page", then unlocks revmap->rm_currBuf
> (2) in another transaction desummarize locks both revmap->rm_currBuf 
> and "page", cleans up the tuple and unlocks both buffers
> (1) bitmapscan locks buffer, containing "page", attempts to access the 
> tuple and fails to find it
>
>
> At first, I tried to fix it by holding the lock on revmap->rm_currBuf 
> until we locked the regular page, but it causes a deadlock with 
> brinsummarize(), It can be easily reproduced with the same test as above.
> Is there any rule about the order of locking revmap and regular pages 
> in brin? I haven't found anything in README.
>
> As an alternative, we can leave locks as is and add a recheck, before 
> throwing an error.
>
Here are the updated patches for both problems.

1) brin_summarize_fix_REL_12_v2 fixes
"failed to find parent tuple for heap-only tuple at (50661,130) in table 
"tbl'"

This patch checks that we only access initialized entries of 
root_offsets[] array. If necessary, collect the array again. One recheck 
is enough here, since concurrent pruning is not possible.

2) brin_pagelock_fix_REL_12_v1.patch fixes
"ERROR: corrupted BRIN index: inconsistent range map"

This patch adds a recheck as suggested in previous message.
I am not sure if one recheck is enough to eliminate the race completely, 
but the problem cannot be reproduced anymore.

-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Attachment

pgsql-hackers by date:

Previous
From: legrand legrand
Date:
Subject: nested queries vs. pg_stat_activity
Next
From: Tom Lane
Date:
Subject: Re: pendingOps table is not cleared with fsync=off