Re: Hot standby queries see transient all-zeros pages - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Hot standby queries see transient all-zeros pages
Date
Msg-id mhjyuguex77h5env7vtgdwb5pnaqikyx5xgwjthcmzuazlowgn@rcjvqvjizk45
Whole thread Raw
In response to Re: Hot standby queries see transient all-zeros pages  (Thomas Munro <thomas.munro@gmail.com>)
List pgsql-hackers
Hi,

On 2024-12-14 13:41:40 +1300, Thomas Munro wrote:
> On Sat, Dec 14, 2024 at 11:41 AM Andres Freund <andres@anarazel.de> wrote:
> > On 2024-05-12 10:16:58 -0700, Noah Misch wrote:
> > > The use of RBM_ZERO_AND_LOCK is incompatible with that.  See a similar
> > > argument at https://postgr.es/m/flat/5101.1328219790@sss.pgh.pa.us that led
> > > me to the cause.  Adding a 10ms sleep just after RBM_ZERO_AND_LOCK, I got 2
> > > failures in 7 runs of 027_stream_regress.pl, at Assert(ItemIdIsNormal(lpp))
> > > in heapgettup_pagemode().  In the core file, lpp pointed into an all-zeros
> > > page.  RestoreBkpBlocks() had been doing RBM_ZERO years before hot standby
> > > existed, but it wasn't a bug until queries could run concurrently.
> >
> > Yep, this does sound like a problem.
> 
> This was a bug of mine that was fixed later in another thread, which
> Noah might have mistaken for intended-but-wrong behaviour, because the
> name is weird IMHO.
> 
> https://www.postgresql.org/message-id/flat/7ed10231-ce47-03d5-d3f9-4aea0dc7d5a4%40gmail.com

The name is weird...


> > Hm. Leaving RBM_ZERO_AND_LOCK aside, is it actually always safe to do
> > RestoreBlockImage() into a buffer that currently is pinned? Not sure if
> > there's actually all that much guarantee what transient state one can read
> > when reading a page concurrently to a memcpy(). I suspect it's practically
> > rare to see a problem, but one could imagine an memcpy implementation that
> > uses non-temporal writes, which afaict would leave you open to seeing quite
> > random states when reading concurrently, as the cache coherence protocol
> > doesn't protect anymore.
> 
> I also found that strange, while working on commit e656657f.  I
> observed and pondered the behaviour you mentioned while testing that:
> you're scanning a page holding only a pin, which on a primary would
> allow others to add more (invisible to you) tuples that you shouldn't
> attempt to access (based on visibility checks performed while you held
> the lock), and in recovery the exact same thing is possible, except
> that if it happens to include a FPI then it *will* overwrite the
> tuples that you are scanning... with the exact same ones-and-zeros.
> But I couldn't think of anything specifically wrong with it, ie that
> might break.  Well, maybe it's not exactly the same ones-and-zeros in
> some hint-bit scenario, but if that is considered acceptable on a
> primary too (well maybe you're getting rid of it), but what's the
> difference?

FWIW, much more than hint bits can change outside of recovery - you're not
holding a lock on the tuple, so xmin/xmax and non-hint fields in infomask* can
change due to update/delete.


> Oh... that's a plain store and you're thinking that memcpy() might be
> allowed to use some weirder cache coherency modes for speed, ... sounds
> likely to break a lot of stuff, is that a real thing?!

It would have to synchronize at the *end* of the memcpy. But not necessarily
before. I think our pages are probably too small to make it likely that a
memcpy with such a path would use it. I'd assume the logic would be something
along the lines of "If the to-be-copied size is bigger than L3*2, use
non-temporal stores to read/write the data, as we won't have any cache hits".


> > > I suspect the fix is to add a ReadBufferMode specified as, "If the block is
> > > already in shared_buffers, do RBM_NORMAL and exclusive-lock the buffer.
> > > Otherwise, do RBM_ZERO_AND_LOCK."
> >
> > I think that should work. At least in the current code it looks near trivial
> > to implement, although the branch differences are going to be annoying.
> > As usual the hardest part would probably be the naming. Maybe
> > RBM_ZERO_ON_MISS_LOCK? RBM_LOCK_ZERO_ON_MISS? RBM_DWIM?
> 
> That's already how RBM_ZERO_AND_LOCK works, and you are the third
> victim of its confusing name counting me (that's how I broke it in the
> vectored stuff) and Noah IIUC what happened here.

I think I knew at some point that it didn't work that way, but then got swept
up with worry upon reading Noah's thread, while already three layers deep in
something else :)

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: Hot standby queries see transient all-zeros pages
Next
From: John Naylor
Date:
Subject: Re: typo in a comment of restrictinfo.c