Re: Hot standby queries see transient all-zeros pages - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: Hot standby queries see transient all-zeros pages |
Date | |
Msg-id | yrqjqqiqiyhb4pwknw2vphrfhmtwmdn7xiqeux32q65lsox3h7@dzktq4py4lvw Whole thread Raw |
In response to | Hot standby queries see transient all-zeros pages (Noah Misch <noah@leadboat.com>) |
Responses |
Re: Hot standby queries see transient all-zeros pages
Re: Hot standby queries see transient all-zeros pages |
List | pgsql-hackers |
Hi, Just found this thread because I was looking for discussions for some behaviour of XLogReadBufferExtended()... Afaics we didn't do anything about this issue? I just tried to repro this, without success so far. But it very well might be sufficiently timing dependent that it just happens to not trigger on my machine. I'm actually not sure what'd trigger a problematic pattern in 027_stream_regress.pl, there's not a whole lot of reading while replaying records. On 2024-05-12 10:16:58 -0700, Noah Misch wrote: > XLogReadBufferForRedoExtended() precedes RestoreBlockImage() with > RBM_ZERO_AND_LOCK. Per src/backend/storage/buffer/README: > > Once one has determined that a tuple is interesting (visible to the current > transaction) one may drop the content lock, yet continue to access the > tuple's data for as long as one holds the buffer pin. > > The use of RBM_ZERO_AND_LOCK is incompatible with that. See a similar > argument at https://postgr.es/m/flat/5101.1328219790@sss.pgh.pa.us that led > me to the cause. Adding a 10ms sleep just after RBM_ZERO_AND_LOCK, I got 2 > failures in 7 runs of 027_stream_regress.pl, at Assert(ItemIdIsNormal(lpp)) > in heapgettup_pagemode(). In the core file, lpp pointed into an all-zeros > page. RestoreBkpBlocks() had been doing RBM_ZERO years before hot standby > existed, but it wasn't a bug until queries could run concurrently. Yep, this does sound like a problem. I looked around to see if other uses of RBM_ZERO_AND_LOCK are problematic, but so far I didn't find anything. Most paths that lead to RBM_ZERO_AND_LOCK being used should be safe, due to those pages not having valid contents beforehand when in an consistent state. E.g. the RBM_ZERO_AND_LOCK that originate from XLogInitBufferForRedo should only ever target pages that have been empty (when consistent) and thus shouldn't have anybody pointing into an unlocked buffer. Hm. Leaving RBM_ZERO_AND_LOCK aside, is it actually always safe to do RestoreBlockImage() into a buffer that currently is pinned? Not sure if there's actually all that much guarantee what transient state one can read when reading a page concurrently to a memcpy(). I suspect it's practically rare to see a problem, but one could imagine an memcpy implementation that uses non-temporal writes, which afaict would leave you open to seeing quite random states when reading concurrently, as the cache coherence protocol doesn't protect anymore. > I suspect the fix is to add a ReadBufferMode specified as, "If the block is > already in shared_buffers, do RBM_NORMAL and exclusive-lock the buffer. > Otherwise, do RBM_ZERO_AND_LOCK." I think that should work. At least in the current code it looks near trivial to implement, although the branch differences are going to be annoying. As usual the hardest part would probably be the naming. Maybe RBM_ZERO_ON_MISS_LOCK? RBM_LOCK_ZERO_ON_MISS? RBM_DWIM? > Like RBM_ZERO_AND_LOCK, it avoids wasting disk reads > on data we discard. Are there other strategies to consider? I guess we could just use RBM_ZERO_AND_CLEANUP_LOCK for this path. But that seems like it has too many obvious downsides. Greetings, Andres Freund
pgsql-hackers by date: