Re: Hot standby queries see transient all-zeros pages - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: Hot standby queries see transient all-zeros pages |
Date | |
Msg-id | mhjyuguex77h5env7vtgdwb5pnaqikyx5xgwjthcmzuazlowgn@rcjvqvjizk45 Whole thread Raw |
In response to | Re: Hot standby queries see transient all-zeros pages (Thomas Munro <thomas.munro@gmail.com>) |
List | pgsql-hackers |
Hi, On 2024-12-14 13:41:40 +1300, Thomas Munro wrote: > On Sat, Dec 14, 2024 at 11:41 AM Andres Freund <andres@anarazel.de> wrote: > > On 2024-05-12 10:16:58 -0700, Noah Misch wrote: > > > The use of RBM_ZERO_AND_LOCK is incompatible with that. See a similar > > > argument at https://postgr.es/m/flat/5101.1328219790@sss.pgh.pa.us that led > > > me to the cause. Adding a 10ms sleep just after RBM_ZERO_AND_LOCK, I got 2 > > > failures in 7 runs of 027_stream_regress.pl, at Assert(ItemIdIsNormal(lpp)) > > > in heapgettup_pagemode(). In the core file, lpp pointed into an all-zeros > > > page. RestoreBkpBlocks() had been doing RBM_ZERO years before hot standby > > > existed, but it wasn't a bug until queries could run concurrently. > > > > Yep, this does sound like a problem. > > This was a bug of mine that was fixed later in another thread, which > Noah might have mistaken for intended-but-wrong behaviour, because the > name is weird IMHO. > > https://www.postgresql.org/message-id/flat/7ed10231-ce47-03d5-d3f9-4aea0dc7d5a4%40gmail.com The name is weird... > > Hm. Leaving RBM_ZERO_AND_LOCK aside, is it actually always safe to do > > RestoreBlockImage() into a buffer that currently is pinned? Not sure if > > there's actually all that much guarantee what transient state one can read > > when reading a page concurrently to a memcpy(). I suspect it's practically > > rare to see a problem, but one could imagine an memcpy implementation that > > uses non-temporal writes, which afaict would leave you open to seeing quite > > random states when reading concurrently, as the cache coherence protocol > > doesn't protect anymore. > > I also found that strange, while working on commit e656657f. I > observed and pondered the behaviour you mentioned while testing that: > you're scanning a page holding only a pin, which on a primary would > allow others to add more (invisible to you) tuples that you shouldn't > attempt to access (based on visibility checks performed while you held > the lock), and in recovery the exact same thing is possible, except > that if it happens to include a FPI then it *will* overwrite the > tuples that you are scanning... with the exact same ones-and-zeros. > But I couldn't think of anything specifically wrong with it, ie that > might break. Well, maybe it's not exactly the same ones-and-zeros in > some hint-bit scenario, but if that is considered acceptable on a > primary too (well maybe you're getting rid of it), but what's the > difference? FWIW, much more than hint bits can change outside of recovery - you're not holding a lock on the tuple, so xmin/xmax and non-hint fields in infomask* can change due to update/delete. > Oh... that's a plain store and you're thinking that memcpy() might be > allowed to use some weirder cache coherency modes for speed, ... sounds > likely to break a lot of stuff, is that a real thing?! It would have to synchronize at the *end* of the memcpy. But not necessarily before. I think our pages are probably too small to make it likely that a memcpy with such a path would use it. I'd assume the logic would be something along the lines of "If the to-be-copied size is bigger than L3*2, use non-temporal stores to read/write the data, as we won't have any cache hits". > > > I suspect the fix is to add a ReadBufferMode specified as, "If the block is > > > already in shared_buffers, do RBM_NORMAL and exclusive-lock the buffer. > > > Otherwise, do RBM_ZERO_AND_LOCK." > > > > I think that should work. At least in the current code it looks near trivial > > to implement, although the branch differences are going to be annoying. > > As usual the hardest part would probably be the naming. Maybe > > RBM_ZERO_ON_MISS_LOCK? RBM_LOCK_ZERO_ON_MISS? RBM_DWIM? > > That's already how RBM_ZERO_AND_LOCK works, and you are the third > victim of its confusing name counting me (that's how I broke it in the > vectored stuff) and Noah IIUC what happened here. I think I knew at some point that it didn't work that way, but then got swept up with worry upon reading Noah's thread, while already three layers deep in something else :) Greetings, Andres Freund
pgsql-hackers by date: