Re: LWLock deadlock in brinRevmapDesummarizeRange - Mailing list pgsql-hackers

From Arseniy Mukhin
Subject Re: LWLock deadlock in brinRevmapDesummarizeRange
Date
Msg-id 156d52ac-c502-4dc8-baa5-64b6d1b32e74@gmail.com
Whole thread Raw
In response to Re: LWLock deadlock in brinRevmapDesummarizeRange  (Alvaro Herrera <alvherre@alvh.no-ip.org>)
List pgsql-hackers
Hello Tomas and Alvaro,

Came across this old thread while searching for information about brin,
and decided to try to fix it.

On 22.02.2023 17:41, Alvaro Herrera wrote:
 > On 2023-Feb-22, Tomas Vondra wrote:
 >
 >>> ... and in there, I wrote that we would first write the brin tuple in
 >>> the regular page, unlock that, and then lock the revmap for the update,
 >>> without holding lock on the data page.  I don't remember why we do it
 >>> differently now, but maybe the fix is just to release the regular page
 >>> lock before locking the revmap page?  One very important change is that
 >>> in previous versions the revmap used a separate fork, and we had to
 >>> introduce an "evacuation protocol" when we integrated the revmap into
 >>> the main fork, which may have changed the locking considerations.
 >>
 >> What would happen if two processes built the summary concurrently? How
 >> would they find the other tuple, so that we don't end up with two BRIN
 >> tuples for the same range?
 >
 > Well, the revmap can only keep track of one tuple per range; if two
 > processes build summary tuples, and each tries to insert its tuple in a
 > regular page, that part may succeed; but then only one of them is going
 > to successfully register the summary tuple in the revmap: when the other
 > goes to do the same, it would find that a CTID is already present.
 >
 > ... Looking at the code (brinSetHeapBlockItemptr), I think what happens
 > here is that the second process would overwrite the TID with its own.
 > Not sure if it would work to see whether the item is empty and bail out
 > if it's not.
 >

AFAIS it's impossible to have two processes summarizing or desummarizing 
ranges simultaneously,
because you need ShareUpdateExclusiveLock to do it. Usual inserts / 
updates can't lead
to summarization, they affects only those ranges that are summarized 
already.

 >>> Another point: to desummarize a range, just unlinking the entry from
 >>> revmap should suffice, from the POV of other index scanners.  Maybe we
 >>> can simplify the whole procedure to: lock revmap, remove entry, 
remember
 >>> page number, unlock revmap; lock regular page, delete entry, unlock.
 >>> Then there are no two locks held at the same time during desummarize.
 >>
 >> Perhaps, as long as it doesn't confuse anything else.
 >
 > Well, I don't have the details fresh in mind, but I think it shouldn't,
 > because the only way to reach a regular tuple is coming from the revmap;
 > and we reuse "items" (lines) in a regular page only when they are
 > empty (so vacuuming should also be OK).
 >

I think it would work, but there's one thing to handle in this case:
You may have a concurrent update that moved the index tuple to another 
reg page and wants to update
the revmap page with a new pointer. This may happen after the revmap 
entry is deleted,
so you can have kind of lost update. But this seems to be easy to detect:
when you try to delete the index tuple, you'll see an unused item.
This means that a concurrent update moved index tuple and updated revmap 
page and you need to retry the desummarization.

Also if you don't hold locks on both revmap page and regular pages, you 
will have
to split wal record and rework recovery for desummarization. And 
probably in case of crash between revmap update
and regular page update you can have dangling index tuple, so it seems 
that having one wal record for update simplifies
recovery.


I think there is a straightforward way to implement desummarization and 
fix the deadlock.
Desummarization is actually very similar to usual update, so we can just 
use update flow
(like brininsert or second part of summarize_range where you need to 
merge summary with placeholder tuple):
- lock revmap page
- save pointer to the index tuple
- unlock revmap page
- lock regular page
- check if pointer is still valid (blkno is as expected, NORMAL tuple etc.)
   if we see that pointer is not valid anymore -> unlock reg page + retry
- lock revmap page
- update reg page and revmap page

First five steps have been implemented already in 
brinGetTupleForHeapBlock, so we can use it (like every update actually 
do, if I see correctly).

Please find the attached patch implementing the above idea.

I tried Tomas's stress tests with applied patch and didn't have deadlocks.

Also there is a another way how issue can be reproduced:
apply deadlock-sleep.patch (it just adds 15 sec sleep in brin_doupdate 
and few logs)
execute deadlock.sql
execute deadlock2.sql from another client (there are 15 seconds to do it 
while first process is sleeping).


Best regards,
Arseniy Mukhin
Attachment

pgsql-hackers by date:

Previous
From: Sami Imseih
Date:
Subject: Re: Proposal - Allow extensions to set a Plan Identifier
Next
From: Daniel Gustafsson
Date:
Subject: Re: Enhancing Memory Context Statistics Reporting