Home > mailing lists

Re: Inadequate thought about buffer locking during hot standby replay - Mailing list pgsql-hackers

From	Tom Lane
Subject	Re: Inadequate thought about buffer locking during hot standby replay
Date	November 11, 2012 23:26:01
Msg-id	8859.1352676357@sss.pgh.pa.us Whole thread Raw
In response to	Re: Inadequate thought about buffer locking during hot standby replay (Tom Lane <tgl@sss.pgh.pa.us>)
Responses	Re: Inadequate thought about buffer locking during hot standby replay
List	pgsql-hackers

Tree view

[ this time *with* the patch ]

Attached is a complete draft patch against HEAD for this issue.  I found
a depressingly large number of pre-existing bugs:

Practically all WAL record types that touch multiple pages have some
bug of this type.  In addition to btree_xlog_split, I found that
heap_xlog_update, ginRedoDeletePage, spgRedoAddLeaf, spgRedoMoveLeafs,
spgRedoAddNode, spgRedoSplitTuple, and spgRedoPickSplit fail to hold
locks as required to make their updates safe for concurrent queries.
(I'm not totally sure about ginRedoDeletePage, but the original action
definitely locks the pages simultaneously, and it's not clear that it's
safe not to.)  Most of these are okay in cases without any full-page
images, but could fail if the wrong subset of the pages-to-be-touched
were processed by RestoreBkpBlocks.  Some had bugs even without that :-(

Worse than that, GIST WAL replay seems to be a total disaster from this
standpoint, and I'm not convinced that it's even fixable without
incompatible WAL changes.  There are several cases where index
insertion operations generate more than one WAL record while holding
exclusive lock on buffer(s).  If the lock continuity is actually
necessary to prevent concurrent queries from seeing inconsistent index
states, then we have a big problem, because WAL replay isn't designed to
hold any buffer lock for more than one WAL action.  This needs to be
looked at closer by somebody who's more up on the GIST code than I am.
The attached patch doesn't try very hard to make the GIST functions
safe, it just transposes them to the new API.

I also found that spgRedoAddNode could fail to update the parent
downlink at all, if the parent tuple is in the same page as either the
old or new split tuple --- it would get fooled by the LSN having been
advanced already.  This is an index corruption bug not just a
transient-failure problem.

Also, ginHeapTupleFastInsert's "merge lists" case fails to mark the old
tail page as a candidate for a full-page image; in the worst case this
could result in torn-page corruption.

I already pointed out the inconsistency in heap_xlog_freeze about
whether a cleanup lock is needed.  As is, this patch uses a cleanup
lock, but I suspect that a regular lock is sufficient --- comments?

Also, gistRedoPageDeleteRecord seems to be dead code?  I don't see
anything that emits XLOG_GIST_PAGE_DELETE.

            regards, tom lane

Attachment

msg-13774-2.bin

pgsql-hackers by date:

From: Tom Lane
Date: 11 November 2012, 23:24:28
Subject: Re: Inadequate thought about buffer locking during hot standby replay

From: Tomas Vondra
Date: 11 November 2012, 23:34:45
Subject: Re: review: pgbench - aggregation of info written into log

Re: Inadequate thought about buffer locking during hot standby replay - Mailing list pgsql-hackers

Attachment

Previous

Next