Thread: Hot standby queries see transient all-zeros pages
XLogReadBufferForRedoExtended() precedes RestoreBlockImage() with RBM_ZERO_AND_LOCK. Per src/backend/storage/buffer/README: Once one has determined that a tuple is interesting (visible to the current transaction) one may drop the content lock, yet continue to access the tuple's data for as long as one holds the buffer pin. The use of RBM_ZERO_AND_LOCK is incompatible with that. See a similar argument at https://postgr.es/m/flat/5101.1328219790@sss.pgh.pa.us that led me to the cause. Adding a 10ms sleep just after RBM_ZERO_AND_LOCK, I got 2 failures in 7 runs of 027_stream_regress.pl, at Assert(ItemIdIsNormal(lpp)) in heapgettup_pagemode(). In the core file, lpp pointed into an all-zeros page. RestoreBkpBlocks() had been doing RBM_ZERO years before hot standby existed, but it wasn't a bug until queries could run concurrently. I suspect the fix is to add a ReadBufferMode specified as, "If the block is already in shared_buffers, do RBM_NORMAL and exclusive-lock the buffer. Otherwise, do RBM_ZERO_AND_LOCK." That avoids RBM_NORMAL for a block past the current end of the file. Like RBM_ZERO_AND_LOCK, it avoids wasting disk reads on data we discard. Are there other strategies to consider? I got here from a Windows CI failure, https://cirrus-ci.com/task/6247605141766144. That involved patched code, but adding the sleep suffices on Linux, with today's git master: --- a/src/backend/access/transam/xlogutils.c +++ b/src/backend/access/transam/xlogutils.c @@ -388,6 +388,8 @@ XLogReadBufferForRedoExtended(XLogReaderState *record, *buf = XLogReadBufferExtended(rlocator, forknum, blkno, get_cleanup_lock ? RBM_ZERO_AND_CLEANUP_LOCK : RBM_ZERO_AND_LOCK, prefetch_buffer); + if (!get_cleanup_lock) + pg_usleep(10 * 1000); page = BufferGetPage(*buf); if (!RestoreBlockImage(record, block_id, page)) ereport(ERROR,
Hi, Just found this thread because I was looking for discussions for some behaviour of XLogReadBufferExtended()... Afaics we didn't do anything about this issue? I just tried to repro this, without success so far. But it very well might be sufficiently timing dependent that it just happens to not trigger on my machine. I'm actually not sure what'd trigger a problematic pattern in 027_stream_regress.pl, there's not a whole lot of reading while replaying records. On 2024-05-12 10:16:58 -0700, Noah Misch wrote: > XLogReadBufferForRedoExtended() precedes RestoreBlockImage() with > RBM_ZERO_AND_LOCK. Per src/backend/storage/buffer/README: > > Once one has determined that a tuple is interesting (visible to the current > transaction) one may drop the content lock, yet continue to access the > tuple's data for as long as one holds the buffer pin. > > The use of RBM_ZERO_AND_LOCK is incompatible with that. See a similar > argument at https://postgr.es/m/flat/5101.1328219790@sss.pgh.pa.us that led > me to the cause. Adding a 10ms sleep just after RBM_ZERO_AND_LOCK, I got 2 > failures in 7 runs of 027_stream_regress.pl, at Assert(ItemIdIsNormal(lpp)) > in heapgettup_pagemode(). In the core file, lpp pointed into an all-zeros > page. RestoreBkpBlocks() had been doing RBM_ZERO years before hot standby > existed, but it wasn't a bug until queries could run concurrently. Yep, this does sound like a problem. I looked around to see if other uses of RBM_ZERO_AND_LOCK are problematic, but so far I didn't find anything. Most paths that lead to RBM_ZERO_AND_LOCK being used should be safe, due to those pages not having valid contents beforehand when in an consistent state. E.g. the RBM_ZERO_AND_LOCK that originate from XLogInitBufferForRedo should only ever target pages that have been empty (when consistent) and thus shouldn't have anybody pointing into an unlocked buffer. Hm. Leaving RBM_ZERO_AND_LOCK aside, is it actually always safe to do RestoreBlockImage() into a buffer that currently is pinned? Not sure if there's actually all that much guarantee what transient state one can read when reading a page concurrently to a memcpy(). I suspect it's practically rare to see a problem, but one could imagine an memcpy implementation that uses non-temporal writes, which afaict would leave you open to seeing quite random states when reading concurrently, as the cache coherence protocol doesn't protect anymore. > I suspect the fix is to add a ReadBufferMode specified as, "If the block is > already in shared_buffers, do RBM_NORMAL and exclusive-lock the buffer. > Otherwise, do RBM_ZERO_AND_LOCK." I think that should work. At least in the current code it looks near trivial to implement, although the branch differences are going to be annoying. As usual the hardest part would probably be the naming. Maybe RBM_ZERO_ON_MISS_LOCK? RBM_LOCK_ZERO_ON_MISS? RBM_DWIM? > Like RBM_ZERO_AND_LOCK, it avoids wasting disk reads > on data we discard. Are there other strategies to consider? I guess we could just use RBM_ZERO_AND_CLEANUP_LOCK for this path. But that seems like it has too many obvious downsides. Greetings, Andres Freund
On Fri, Dec 13, 2024 at 05:41:15PM -0500, Andres Freund wrote: > Afaics we didn't do anything about this issue? postgr.es/c/e656657 fixed this. I thought this was longstanding, but it turned out to have started on 2024-04-02. > Hm. Leaving RBM_ZERO_AND_LOCK aside, is it actually always safe to do > RestoreBlockImage() into a buffer that currently is pinned? Not sure if > there's actually all that much guarantee what transient state one can read > when reading a page concurrently to a memcpy(). I suspect it's practically > rare to see a problem, but one could imagine an memcpy implementation that > uses non-temporal writes, which afaict would leave you open to seeing quite > random states when reading concurrently, as the cache coherence protocol > doesn't protect anymore. I wondered about that, too. I didn't dig too deep. https://pubs.opengroup.org/onlinepubs/9799919799/functions/memcpy.html and https://en.cppreference.com/w/cpp/string/byte/memcpy were both silent about the topic. > On 2024-05-12 10:16:58 -0700, Noah Misch wrote: > > I suspect the fix is to add a ReadBufferMode specified as, "If the block is > > already in shared_buffers, do RBM_NORMAL and exclusive-lock the buffer. > > Otherwise, do RBM_ZERO_AND_LOCK." > > I think that should work. At least in the current code it looks near trivial > to implement, although the branch differences are going to be annoying. > > As usual the hardest part would probably be the naming. Maybe > RBM_ZERO_ON_MISS_LOCK? RBM_LOCK_ZERO_ON_MISS? RBM_DWIM? It turned out RBM_ZERO_AND_LOCK long worked that way, and postgr.es/c/e656657 just had to restore that longstanding behavior. The existing comment "Don't read from disk, caller will initialize." does allude to this (but I didn't originally catch the subtle point). If RBM_ZERO_AND_LOCK hadn't existed so long, I'd rename it. Perhaps it deserves a rename anyway? Of those, I'd pick RBM_ZERO_ON_MISS_LOCK. I also considered RBM_RECENT_OR_ZERO, borrowing a term from ReadRecentBuffer().
On Sat, Dec 14, 2024 at 11:41 AM Andres Freund <andres@anarazel.de> wrote: > On 2024-05-12 10:16:58 -0700, Noah Misch wrote: > > The use of RBM_ZERO_AND_LOCK is incompatible with that. See a similar > > argument at https://postgr.es/m/flat/5101.1328219790@sss.pgh.pa.us that led > > me to the cause. Adding a 10ms sleep just after RBM_ZERO_AND_LOCK, I got 2 > > failures in 7 runs of 027_stream_regress.pl, at Assert(ItemIdIsNormal(lpp)) > > in heapgettup_pagemode(). In the core file, lpp pointed into an all-zeros > > page. RestoreBkpBlocks() had been doing RBM_ZERO years before hot standby > > existed, but it wasn't a bug until queries could run concurrently. > > Yep, this does sound like a problem. This was a bug of mine that was fixed later in another thread, which Noah might have mistaken for intended-but-wrong behaviour, because the name is weird IMHO. https://www.postgresql.org/message-id/flat/7ed10231-ce47-03d5-d3f9-4aea0dc7d5a4%40gmail.com > Hm. Leaving RBM_ZERO_AND_LOCK aside, is it actually always safe to do > RestoreBlockImage() into a buffer that currently is pinned? Not sure if > there's actually all that much guarantee what transient state one can read > when reading a page concurrently to a memcpy(). I suspect it's practically > rare to see a problem, but one could imagine an memcpy implementation that > uses non-temporal writes, which afaict would leave you open to seeing quite > random states when reading concurrently, as the cache coherence protocol > doesn't protect anymore. I also found that strange, while working on commit e656657f. I observed and pondered the behaviour you mentioned while testing that: you're scanning a page holding only a pin, which on a primary would allow others to add more (invisible to you) tuples that you shouldn't attempt to access (based on visibility checks performed while you held the lock), and in recovery the exact same thing is possible, except that if it happens to include a FPI then it *will* overwrite the tuples that you are scanning... with the exact same ones-and-zeros. But I couldn't think of anything specifically wrong with it, ie that might break. Well, maybe it's not exactly the same ones-and-zeros in some hint-bit scenario, but if that is considered acceptable on a primary too (well maybe you're getting rid of it), but what's the difference? Oh... that's a plain store and you're thinking that memcpy() might be allowed to use some weirder cache coherency modes for speed, ... sounds likely to break a lot of stuff, is that a real thing?! > > I suspect the fix is to add a ReadBufferMode specified as, "If the block is > > already in shared_buffers, do RBM_NORMAL and exclusive-lock the buffer. > > Otherwise, do RBM_ZERO_AND_LOCK." > > I think that should work. At least in the current code it looks near trivial > to implement, although the branch differences are going to be annoying. > As usual the hardest part would probably be the naming. Maybe > RBM_ZERO_ON_MISS_LOCK? RBM_LOCK_ZERO_ON_MISS? RBM_DWIM? That's already how RBM_ZERO_AND_LOCK works, and you are the third victim of its confusing name counting me (that's how I broke it in the vectored stuff) and Noah IIUC what happened here.
Hi, On 2024-12-13 16:38:05 -0800, Noah Misch wrote: > On Fri, Dec 13, 2024 at 05:41:15PM -0500, Andres Freund wrote: > > Afaics we didn't do anything about this issue? > > postgr.es/c/e656657 fixed this. I thought this was longstanding, but it > turned out to have started on 2024-04-02. Ah, that makes sense. I was a bit surprised to find this thread without any replies... > > Hm. Leaving RBM_ZERO_AND_LOCK aside, is it actually always safe to do > > RestoreBlockImage() into a buffer that currently is pinned? Not sure if > > there's actually all that much guarantee what transient state one can read > > when reading a page concurrently to a memcpy(). I suspect it's practically > > rare to see a problem, but one could imagine an memcpy implementation that > > uses non-temporal writes, which afaict would leave you open to seeing quite > > random states when reading concurrently, as the cache coherence protocol > > doesn't protect anymore. > > I wondered about that, too. I didn't dig too deep. > https://pubs.opengroup.org/onlinepubs/9799919799/functions/memcpy.html and > https://en.cppreference.com/w/cpp/string/byte/memcpy were both silent about > the topic. Hm. Perhaps it'd be worth having a small stress test in the tests that'd make problems like this more apparent. Even if it's not a problem current libc's, who knows what happens down the line. > > On 2024-05-12 10:16:58 -0700, Noah Misch wrote: > > > I suspect the fix is to add a ReadBufferMode specified as, "If the block is > > > already in shared_buffers, do RBM_NORMAL and exclusive-lock the buffer. > > > Otherwise, do RBM_ZERO_AND_LOCK." > > > > I think that should work. At least in the current code it looks near trivial > > to implement, although the branch differences are going to be annoying. > > > > As usual the hardest part would probably be the naming. Maybe > > RBM_ZERO_ON_MISS_LOCK? RBM_LOCK_ZERO_ON_MISS? RBM_DWIM? > > It turned out RBM_ZERO_AND_LOCK long worked that way, and postgr.es/c/e656657 > just had to restore that longstanding behavior. The existing comment "Don't > read from disk, caller will initialize." does allude to this (but I didn't > originally catch the subtle point). > > If RBM_ZERO_AND_LOCK hadn't existed so long, I'd rename it. Perhaps it > deserves a rename anyway? Of those, I'd pick RBM_ZERO_ON_MISS_LOCK. I also > considered RBM_RECENT_OR_ZERO, borrowing a term from ReadRecentBuffer(). At least we could make the documentation for the mode in the enum clearer... Greetings, Andres Freund
Hi, On 2024-12-14 13:41:40 +1300, Thomas Munro wrote: > On Sat, Dec 14, 2024 at 11:41 AM Andres Freund <andres@anarazel.de> wrote: > > On 2024-05-12 10:16:58 -0700, Noah Misch wrote: > > > The use of RBM_ZERO_AND_LOCK is incompatible with that. See a similar > > > argument at https://postgr.es/m/flat/5101.1328219790@sss.pgh.pa.us that led > > > me to the cause. Adding a 10ms sleep just after RBM_ZERO_AND_LOCK, I got 2 > > > failures in 7 runs of 027_stream_regress.pl, at Assert(ItemIdIsNormal(lpp)) > > > in heapgettup_pagemode(). In the core file, lpp pointed into an all-zeros > > > page. RestoreBkpBlocks() had been doing RBM_ZERO years before hot standby > > > existed, but it wasn't a bug until queries could run concurrently. > > > > Yep, this does sound like a problem. > > This was a bug of mine that was fixed later in another thread, which > Noah might have mistaken for intended-but-wrong behaviour, because the > name is weird IMHO. > > https://www.postgresql.org/message-id/flat/7ed10231-ce47-03d5-d3f9-4aea0dc7d5a4%40gmail.com The name is weird... > > Hm. Leaving RBM_ZERO_AND_LOCK aside, is it actually always safe to do > > RestoreBlockImage() into a buffer that currently is pinned? Not sure if > > there's actually all that much guarantee what transient state one can read > > when reading a page concurrently to a memcpy(). I suspect it's practically > > rare to see a problem, but one could imagine an memcpy implementation that > > uses non-temporal writes, which afaict would leave you open to seeing quite > > random states when reading concurrently, as the cache coherence protocol > > doesn't protect anymore. > > I also found that strange, while working on commit e656657f. I > observed and pondered the behaviour you mentioned while testing that: > you're scanning a page holding only a pin, which on a primary would > allow others to add more (invisible to you) tuples that you shouldn't > attempt to access (based on visibility checks performed while you held > the lock), and in recovery the exact same thing is possible, except > that if it happens to include a FPI then it *will* overwrite the > tuples that you are scanning... with the exact same ones-and-zeros. > But I couldn't think of anything specifically wrong with it, ie that > might break. Well, maybe it's not exactly the same ones-and-zeros in > some hint-bit scenario, but if that is considered acceptable on a > primary too (well maybe you're getting rid of it), but what's the > difference? FWIW, much more than hint bits can change outside of recovery - you're not holding a lock on the tuple, so xmin/xmax and non-hint fields in infomask* can change due to update/delete. > Oh... that's a plain store and you're thinking that memcpy() might be > allowed to use some weirder cache coherency modes for speed, ... sounds > likely to break a lot of stuff, is that a real thing?! It would have to synchronize at the *end* of the memcpy. But not necessarily before. I think our pages are probably too small to make it likely that a memcpy with such a path would use it. I'd assume the logic would be something along the lines of "If the to-be-copied size is bigger than L3*2, use non-temporal stores to read/write the data, as we won't have any cache hits". > > > I suspect the fix is to add a ReadBufferMode specified as, "If the block is > > > already in shared_buffers, do RBM_NORMAL and exclusive-lock the buffer. > > > Otherwise, do RBM_ZERO_AND_LOCK." > > > > I think that should work. At least in the current code it looks near trivial > > to implement, although the branch differences are going to be annoying. > > As usual the hardest part would probably be the naming. Maybe > > RBM_ZERO_ON_MISS_LOCK? RBM_LOCK_ZERO_ON_MISS? RBM_DWIM? > > That's already how RBM_ZERO_AND_LOCK works, and you are the third > victim of its confusing name counting me (that's how I broke it in the > vectored stuff) and Noah IIUC what happened here. I think I knew at some point that it didn't work that way, but then got swept up with worry upon reading Noah's thread, while already three layers deep in something else :) Greetings, Andres Freund
On Fri, Dec 13, 2024 at 09:34:21PM -0500, Andres Freund wrote: > On 2024-12-13 16:38:05 -0800, Noah Misch wrote: > > On Fri, Dec 13, 2024 at 05:41:15PM -0500, Andres Freund wrote: > > > Hm. Leaving RBM_ZERO_AND_LOCK aside, is it actually always safe to do > > > RestoreBlockImage() into a buffer that currently is pinned? Not sure if > > > there's actually all that much guarantee what transient state one can read > > > when reading a page concurrently to a memcpy(). I suspect it's practically > > > rare to see a problem, but one could imagine an memcpy implementation that > > > uses non-temporal writes, which afaict would leave you open to seeing quite > > > random states when reading concurrently, as the cache coherence protocol > > > doesn't protect anymore. > > > > I wondered about that, too. I didn't dig too deep. > > https://pubs.opengroup.org/onlinepubs/9799919799/functions/memcpy.html and > > https://en.cppreference.com/w/cpp/string/byte/memcpy were both silent about > > the topic. > > Hm. Perhaps it'd be worth having a small stress test in the tests that'd make > problems like this more apparent. Even if it's not a problem current libc's, > who knows what happens down the line. Agreed. > > > On 2024-05-12 10:16:58 -0700, Noah Misch wrote: > > > > I suspect the fix is to add a ReadBufferMode specified as, "If the block is > > > > already in shared_buffers, do RBM_NORMAL and exclusive-lock the buffer. > > > > Otherwise, do RBM_ZERO_AND_LOCK." > > > > > > I think that should work. At least in the current code it looks near trivial > > > to implement, although the branch differences are going to be annoying. > > > > > > As usual the hardest part would probably be the naming. Maybe > > > RBM_ZERO_ON_MISS_LOCK? RBM_LOCK_ZERO_ON_MISS? RBM_DWIM? > > > > It turned out RBM_ZERO_AND_LOCK long worked that way, and postgr.es/c/e656657 > > just had to restore that longstanding behavior. The existing comment "Don't > > read from disk, caller will initialize." does allude to this (but I didn't > > originally catch the subtle point). > > > > If RBM_ZERO_AND_LOCK hadn't existed so long, I'd rename it. Perhaps it > > deserves a rename anyway? Of those, I'd pick RBM_ZERO_ON_MISS_LOCK. I also > > considered RBM_RECENT_OR_ZERO, borrowing a term from ReadRecentBuffer(). > > At least we could make the documentation for the mode in the enum clearer... Agreed. Maybe "No I/O; get existing buffer, else zero-fill; caller initializes."