Thread: LWLock deadlock in brinRevmapDesummarizeRange
Hi, While finalizing some fixes in BRIN, I decided to stress-test the relevant part of the code to check if I missed something. Imagine a simple script that builds BRIN indexes on random data, does random changes and cross-checks the results with/without the index. But instead of I almost immediately ran into a LWLock deadlock :-( I've managed to reproduce this on PG13+, but I believe it's there since the brinRevmapDesummarizeRange was introduced in PG10. I just haven't tried on pre-13 releases. The stress-test-2.sh script (attached .tgz) builds a table, fills it with random data and then runs a mix of updates and (de)summarization DDL of random fraction of the index. The lockup is usually triggered within a couple minutes, but might take longer (I guess it depends on parameters used to generate the random data, so it may take a couple runs to hit the right combination). The root cause is that brin_doupdate and brinRevmapDesummarizeRange end up locking buffers in different orders. Attached is also a .patch that adds a bunch of LOG messages for buffer locking in BRIN code (it's for PG13, but should work on newer releases too). Here's a fairly typical example of the interaction between brin_doupdate and brinRevmapDesummarizeRange: brin_doupdate (from UPDATE query): LOG: brin_doupdate: samepage 0 LOG: number of LWLocks held: 0 LOG: brin_getinsertbuffer: locking 898 lock 0x7f9a99a5af64 LOG: brin_getinsertbuffer: buffer locked LOG: brin_getinsertbuffer B: locking 899 lock 0x7f9a99a5afa4 LOG: brin_getinsertbuffer B: buffer locked LOG: number of LWLocks held: 2 LOG: lock 0 => 0x7f9a99a5af64 LOG: lock 1 => 0x7f9a99a5afa4 LOG: brin_doupdate: locking buffer for update LOG: brinLockRevmapPageForUpdate: locking 158 lock 0x7f9a99a4f664 brinRevmapDesummarizeRange (from brin_desummarize_range): LOG: starting brinRevmapDesummarizeRange LOG: number of LWLocks held: 0 LOG: brinLockRevmapPageForUpdate: locking 158 lock 0x7f9a99a4f664 LOG: brinLockRevmapPageForUpdate: buffer locked LOG: number of LWLocks held: 1 LOG: lock 0 => 0x7f9a99a4f664 LOG: brinRevmapDesummarizeRange: locking 898 lock 0x7f9a99a5af64 So, brin_doupdate starts with no LWLocks, and locks buffers 898, 899 (through getinsertbuffer). And then tries to lock 158. Meanwhile brinRevmapDesummarizeRange locks 158 first, and then tries to lock 898. So, a LWLock deadlock :-( I've now seen a bunch of these traces, with only minor differences. For example brinRevmapDesummarizeRange might gets stuck on the second buffer locked by getinsertbuffer (not the first one like here). I don't have a great idea how to fix this - I guess we need to ensure the buffers are locked in the same order, but that seems tricky. Obviously, people don't call brin_desummarize_range() very often, which likely explains the lack of reports. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On 2023-Feb-22, Tomas Vondra wrote: > But instead of I almost immediately ran into a LWLock deadlock :-( Ouch. > I've managed to reproduce this on PG13+, but I believe it's there since > the brinRevmapDesummarizeRange was introduced in PG10. I just haven't > tried on pre-13 releases. Hmm, I think that might just be an "easy" way to hit it, but the problem is actually older than that, since AFAICS brin_doupdate is careless regarding locking order of revmap page vs. regular page. Sadly, the README doesn't cover locking considerations. I had that in a file called 'minmax-proposal' in version 16 of the patch here https://postgr.es/m/20140820225133.GB6343@eldon.alvh.no-ip.org but by version 18 (when 'minmax' became BRIN) I seem to have removed that file and replaced it with the README and apparently I didn't copy this material over. ... and in there, I wrote that we would first write the brin tuple in the regular page, unlock that, and then lock the revmap for the update, without holding lock on the data page. I don't remember why we do it differently now, but maybe the fix is just to release the regular page lock before locking the revmap page? One very important change is that in previous versions the revmap used a separate fork, and we had to introduce an "evacuation protocol" when we integrated the revmap into the main fork, which may have changed the locking considerations. Another point: to desummarize a range, just unlinking the entry from revmap should suffice, from the POV of other index scanners. Maybe we can simplify the whole procedure to: lock revmap, remove entry, remember page number, unlock revmap; lock regular page, delete entry, unlock. Then there are no two locks held at the same time during desummarize. This comes from v16: + Locking considerations + ---------------------- + + To read the TID during an index scan, we follow this protocol: + + * read revmap page + * obtain share lock on the revmap buffer + * read the TID + * obtain share lock on buffer of main fork + * LockTuple the TID (using the index as relation). A shared lock is + sufficient. We need the LockTuple to prevent VACUUM from recycling + the index tuple; see below. + * release revmap buffer lock + * read the index tuple + * release the tuple lock + * release main fork buffer lock + + + To update the summary tuple for a page range, we use this protocol: + + * insert a new index tuple somewhere in the main fork; note its TID + * read revmap page + * obtain exclusive lock on revmap buffer + * write the TID + * release lock + + This ensures no concurrent reader can obtain a partially-written TID. + Note we don't need a tuple lock here. Concurrent scans don't have to + worry about whether they got the old or new index tuple: if they get the + old one, the tighter values are okay from a correctness standpoint because + due to MVCC they can't possibly see the just-inserted heap tuples anyway. + + [vacuum stuff elided] -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "Escucha y olvidarás; ve y recordarás; haz y entenderás" (Confucio)
On 2/22/23 12:35, Alvaro Herrera wrote: > On 2023-Feb-22, Tomas Vondra wrote: > >> But instead of I almost immediately ran into a LWLock deadlock :-( > > Ouch. > >> I've managed to reproduce this on PG13+, but I believe it's there since >> the brinRevmapDesummarizeRange was introduced in PG10. I just haven't >> tried on pre-13 releases. > > Hmm, I think that might just be an "easy" way to hit it, but the problem > is actually older than that, since AFAICS brin_doupdate is careless > regarding locking order of revmap page vs. regular page. > That's certainly possible, although I ran a lot of BRIN stress tests and it only started failing after I added the desummarization. Although, the tests are "randomized" like this: UPDATE t SET a = '...' WHERE random() < 0.05; which is fairly sequential. Maybe reordering the CTIDs a bit would hit additional deadlocks, I'll probably give it a try. OTOH that'd probably be much more likely to be hit by users, and I don't recall any such reports. > Sadly, the README doesn't cover locking considerations. I had that in a > file called 'minmax-proposal' in version 16 of the patch here > https://postgr.es/m/20140820225133.GB6343@eldon.alvh.no-ip.org > but by version 18 (when 'minmax' became BRIN) I seem to have removed > that file and replaced it with the README and apparently I didn't copy > this material over. > Yeah :-( There's a couple more things that are missing in the README, like what oi_regular_nulls mean. > ... and in there, I wrote that we would first write the brin tuple in > the regular page, unlock that, and then lock the revmap for the update, > without holding lock on the data page. I don't remember why we do it > differently now, but maybe the fix is just to release the regular page > lock before locking the revmap page? One very important change is that > in previous versions the revmap used a separate fork, and we had to > introduce an "evacuation protocol" when we integrated the revmap into > the main fork, which may have changed the locking considerations. > What would happen if two processes built the summary concurrently? How would they find the other tuple, so that we don't end up with two BRIN tuples for the same range? > Another point: to desummarize a range, just unlinking the entry from > revmap should suffice, from the POV of other index scanners. Maybe we > can simplify the whole procedure to: lock revmap, remove entry, remember > page number, unlock revmap; lock regular page, delete entry, unlock. > Then there are no two locks held at the same time during desummarize. > Perhaps, as long as it doesn't confuse anything else. > This comes from v16: > I don't follow - what do you mean by v16? I don't see anything like that anywhere in the repository. > + Locking considerations > + ---------------------- > + > + To read the TID during an index scan, we follow this protocol: > + > + * read revmap page > + * obtain share lock on the revmap buffer > + * read the TID > + * obtain share lock on buffer of main fork > + * LockTuple the TID (using the index as relation). A shared lock is > + sufficient. We need the LockTuple to prevent VACUUM from recycling > + the index tuple; see below. > + * release revmap buffer lock > + * read the index tuple > + * release the tuple lock > + * release main fork buffer lock > + > + > + To update the summary tuple for a page range, we use this protocol: > + > + * insert a new index tuple somewhere in the main fork; note its TID > + * read revmap page > + * obtain exclusive lock on revmap buffer > + * write the TID > + * release lock > + > + This ensures no concurrent reader can obtain a partially-written TID. > + Note we don't need a tuple lock here. Concurrent scans don't have to > + worry about whether they got the old or new index tuple: if they get the > + old one, the tighter values are okay from a correctness standpoint because > + due to MVCC they can't possibly see the just-inserted heap tuples anyway. > + > + [vacuum stuff elided] > regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2023-Feb-22, Tomas Vondra wrote: > > ... and in there, I wrote that we would first write the brin tuple in > > the regular page, unlock that, and then lock the revmap for the update, > > without holding lock on the data page. I don't remember why we do it > > differently now, but maybe the fix is just to release the regular page > > lock before locking the revmap page? One very important change is that > > in previous versions the revmap used a separate fork, and we had to > > introduce an "evacuation protocol" when we integrated the revmap into > > the main fork, which may have changed the locking considerations. > > What would happen if two processes built the summary concurrently? How > would they find the other tuple, so that we don't end up with two BRIN > tuples for the same range? Well, the revmap can only keep track of one tuple per range; if two processes build summary tuples, and each tries to insert its tuple in a regular page, that part may succeed; but then only one of them is going to successfully register the summary tuple in the revmap: when the other goes to do the same, it would find that a CTID is already present. ... Looking at the code (brinSetHeapBlockItemptr), I think what happens here is that the second process would overwrite the TID with its own. Not sure if it would work to see whether the item is empty and bail out if it's not. But in any case, it seems to me that the update of the regular page is pretty much independent of the update of the revmap. > > Another point: to desummarize a range, just unlinking the entry from > > revmap should suffice, from the POV of other index scanners. Maybe we > > can simplify the whole procedure to: lock revmap, remove entry, remember > > page number, unlock revmap; lock regular page, delete entry, unlock. > > Then there are no two locks held at the same time during desummarize. > > Perhaps, as long as it doesn't confuse anything else. Well, I don't have the details fresh in mind, but I think it shouldn't, because the only way to reach a regular tuple is coming from the revmap; and we reuse "items" (lines) in a regular page only when they are empty (so vacuuming should also be OK). > > This comes from v16: > > I don't follow - what do you mean by v16? I don't see anything like that > anywhere in the repository. I meant the minmax-proposal file in patch v16, the one that I linked to. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "You're _really_ hosed if the person doing the hiring doesn't understand relational systems: you end up with a whole raft of programmers, none of whom has had a Date with the clue stick." (Andrew Sullivan)