Re: Inadequate thought about buffer locking during hot standby replay - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: Inadequate thought about buffer locking during hot standby replay |
Date | |
Msg-id | 8807.1352676263@sss.pgh.pa.us Whole thread Raw |
In response to | Re: Inadequate thought about buffer locking during hot standby replay (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Inadequate thought about buffer locking during hot standby replay
Re: Inadequate thought about buffer locking during hot standby replay Re: Inadequate thought about buffer locking during hot standby replay |
List | pgsql-hackers |
Attached is a complete draft patch against HEAD for this issue. I found a depressingly large number of pre-existing bugs: Practically all WAL record types that touch multiple pages have some bug of this type. In addition to btree_xlog_split, I found that heap_xlog_update, ginRedoDeletePage, spgRedoAddLeaf, spgRedoMoveLeafs, spgRedoAddNode, spgRedoSplitTuple, and spgRedoPickSplit fail to hold locks as required to make their updates safe for concurrent queries. (I'm not totally sure about ginRedoDeletePage, but the original action definitely locks the pages simultaneously, and it's not clear that it's safe not to.) Most of these are okay in cases without any full-page images, but could fail if the wrong subset of the pages-to-be-touched were processed by RestoreBkpBlocks. Some had bugs even without that :-( Worse than that, GIST WAL replay seems to be a total disaster from this standpoint, and I'm not convinced that it's even fixable without incompatible WAL changes. There are several cases where index insertion operations generate more than one WAL record while holding exclusive lock on buffer(s). If the lock continuity is actually necessary to prevent concurrent queries from seeing inconsistent index states, then we have a big problem, because WAL replay isn't designed to hold any buffer lock for more than one WAL action. This needs to be looked at closer by somebody who's more up on the GIST code than I am. The attached patch doesn't try very hard to make the GIST functions safe, it just transposes them to the new API. I also found that spgRedoAddNode could fail to update the parent downlink at all, if the parent tuple is in the same page as either the old or new split tuple --- it would get fooled by the LSN having been advanced already. This is an index corruption bug not just a transient-failure problem. Also, ginHeapTupleFastInsert's "merge lists" case fails to mark the old tail page as a candidate for a full-page image; in the worst case this could result in torn-page corruption. I already pointed out the inconsistency in heap_xlog_freeze about whether a cleanup lock is needed. As is, this patch uses a cleanup lock, but I suspect that a regular lock is sufficient --- comments? Also, gistRedoPageDeleteRecord seems to be dead code? I don't see anything that emits XLOG_GIST_PAGE_DELETE. regards, tom lane #application/octet-stream [restore-backup-blocks-individually-2.patch.gz] /home/tgl/pgsql/restore-backup-blocks-individually-2.patch.gz
pgsql-hackers by date: