Thread: Hot standby queries see transient all-zeros pages

Hot standby queries see transient all-zeros pages

From

Noah Misch

Date:

12 May 2024, 17:16:58

XLogReadBufferForRedoExtended() precedes RestoreBlockImage() with
RBM_ZERO_AND_LOCK.  Per src/backend/storage/buffer/README:

  Once one has determined that a tuple is interesting (visible to the current
  transaction) one may drop the content lock, yet continue to access the
  tuple's data for as long as one holds the buffer pin.

The use of RBM_ZERO_AND_LOCK is incompatible with that.  See a similar
argument at https://postgr.es/m/flat/5101.1328219790@sss.pgh.pa.us that led me
to the cause.  Adding a 10ms sleep just after RBM_ZERO_AND_LOCK, I got 2
failures in 7 runs of 027_stream_regress.pl, at Assert(ItemIdIsNormal(lpp)) in
heapgettup_pagemode().  In the core file, lpp pointed into an all-zeros page.
RestoreBkpBlocks() had been doing RBM_ZERO years before hot standby existed,
but it wasn't a bug until queries could run concurrently.

I suspect the fix is to add a ReadBufferMode specified as, "If the block is
already in shared_buffers, do RBM_NORMAL and exclusive-lock the buffer.
Otherwise, do RBM_ZERO_AND_LOCK."  That avoids RBM_NORMAL for a block past the
current end of the file.  Like RBM_ZERO_AND_LOCK, it avoids wasting disk reads
on data we discard.  Are there other strategies to consider?

I got here from a Windows CI failure,
https://cirrus-ci.com/task/6247605141766144.  That involved patched code, but
adding the sleep suffices on Linux, with today's git master:

--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -388,6 +388,8 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
         *buf = XLogReadBufferExtended(rlocator, forknum, blkno,
                                       get_cleanup_lock ? RBM_ZERO_AND_CLEANUP_LOCK : RBM_ZERO_AND_LOCK,
                                       prefetch_buffer);
+        if (!get_cleanup_lock)
+            pg_usleep(10 * 1000);
         page = BufferGetPage(*buf);
         if (!RestoreBlockImage(record, block_id, page))
             ereport(ERROR,

Re: Hot standby queries see transient all-zeros pages

From

Andres Freund

Date:

14 December 2024, 01:41:15

Hi,

Just found this thread because I was looking for discussions for some
behaviour of XLogReadBufferExtended()...

Afaics we didn't do anything about this issue?

I just tried to repro this, without success so far. But it very well might be
sufficiently timing dependent that it just happens to not trigger on my
machine.  I'm actually not sure what'd trigger a problematic pattern in
027_stream_regress.pl, there's not a whole lot of reading while replaying
records.

On 2024-05-12 10:16:58 -0700, Noah Misch wrote:
> XLogReadBufferForRedoExtended() precedes RestoreBlockImage() with
> RBM_ZERO_AND_LOCK.  Per src/backend/storage/buffer/README:
>
>   Once one has determined that a tuple is interesting (visible to the current
>   transaction) one may drop the content lock, yet continue to access the
>   tuple's data for as long as one holds the buffer pin.
>
> The use of RBM_ZERO_AND_LOCK is incompatible with that.  See a similar
> argument at https://postgr.es/m/flat/5101.1328219790@sss.pgh.pa.us that led
> me to the cause.  Adding a 10ms sleep just after RBM_ZERO_AND_LOCK, I got 2
> failures in 7 runs of 027_stream_regress.pl, at Assert(ItemIdIsNormal(lpp))
> in heapgettup_pagemode().  In the core file, lpp pointed into an all-zeros
> page.  RestoreBkpBlocks() had been doing RBM_ZERO years before hot standby
> existed, but it wasn't a bug until queries could run concurrently.

Yep, this does sound like a problem.

I looked around to see if other uses of RBM_ZERO_AND_LOCK are problematic, but
so far I didn't find anything. Most paths that lead to RBM_ZERO_AND_LOCK being
used should be safe, due to those pages not having valid contents beforehand
when in an consistent state. E.g. the RBM_ZERO_AND_LOCK that originate from
XLogInitBufferForRedo should only ever target pages that have been empty (when
consistent) and thus shouldn't have anybody pointing into an unlocked buffer.

Hm. Leaving RBM_ZERO_AND_LOCK aside, is it actually always safe to do
RestoreBlockImage() into a buffer that currently is pinned? Not sure if
there's actually all that much guarantee what transient state one can read
when reading a page concurrently to a memcpy(). I suspect it's practically
rare to see a problem, but one could imagine an memcpy implementation that
uses non-temporal writes, which afaict would leave you open to seeing quite
random states when reading concurrently, as the cache coherence protocol
doesn't protect anymore.

> I suspect the fix is to add a ReadBufferMode specified as, "If the block is
> already in shared_buffers, do RBM_NORMAL and exclusive-lock the buffer.
> Otherwise, do RBM_ZERO_AND_LOCK."

I think that should work. At least in the current code it looks near trivial
to implement, although the branch differences are going to be annoying.

As usual the hardest part would probably be the naming. Maybe
RBM_ZERO_ON_MISS_LOCK? RBM_LOCK_ZERO_ON_MISS? RBM_DWIM?

> Like RBM_ZERO_AND_LOCK, it avoids wasting disk reads
> on data we discard.  Are there other strategies to consider?

I guess we could just use RBM_ZERO_AND_CLEANUP_LOCK for this path. But that
seems like it has too many obvious downsides.

Greetings,

Andres Freund

Re: Hot standby queries see transient all-zeros pages

From

Noah Misch

Date:

14 December 2024, 03:38:05

On Fri, Dec 13, 2024 at 05:41:15PM -0500, Andres Freund wrote:
> Afaics we didn't do anything about this issue?

postgr.es/c/e656657 fixed this.  I thought this was longstanding, but it
turned out to have started on 2024-04-02.

> Hm. Leaving RBM_ZERO_AND_LOCK aside, is it actually always safe to do
> RestoreBlockImage() into a buffer that currently is pinned? Not sure if
> there's actually all that much guarantee what transient state one can read
> when reading a page concurrently to a memcpy(). I suspect it's practically
> rare to see a problem, but one could imagine an memcpy implementation that
> uses non-temporal writes, which afaict would leave you open to seeing quite
> random states when reading concurrently, as the cache coherence protocol
> doesn't protect anymore.

I wondered about that, too.  I didn't dig too deep.
https://pubs.opengroup.org/onlinepubs/9799919799/functions/memcpy.html and
https://en.cppreference.com/w/cpp/string/byte/memcpy were both silent about
the topic.

> On 2024-05-12 10:16:58 -0700, Noah Misch wrote:
> > I suspect the fix is to add a ReadBufferMode specified as, "If the block is
> > already in shared_buffers, do RBM_NORMAL and exclusive-lock the buffer.
> > Otherwise, do RBM_ZERO_AND_LOCK."
> 
> I think that should work. At least in the current code it looks near trivial
> to implement, although the branch differences are going to be annoying.
> 
> As usual the hardest part would probably be the naming. Maybe
> RBM_ZERO_ON_MISS_LOCK? RBM_LOCK_ZERO_ON_MISS? RBM_DWIM?

It turned out RBM_ZERO_AND_LOCK long worked that way, and postgr.es/c/e656657
just had to restore that longstanding behavior.  The existing comment "Don't
read from disk, caller will initialize." does allude to this (but I didn't
originally catch the subtle point).

If RBM_ZERO_AND_LOCK hadn't existed so long, I'd rename it.  Perhaps it
deserves a rename anyway?  Of those, I'd pick RBM_ZERO_ON_MISS_LOCK.  I also
considered RBM_RECENT_OR_ZERO, borrowing a term from ReadRecentBuffer().

Re: Hot standby queries see transient all-zeros pages

From

Thomas Munro

Date:

14 December 2024, 03:41:40

On Sat, Dec 14, 2024 at 11:41 AM Andres Freund <andres@anarazel.de> wrote:
> On 2024-05-12 10:16:58 -0700, Noah Misch wrote:
> > The use of RBM_ZERO_AND_LOCK is incompatible with that.  See a similar
> > argument at https://postgr.es/m/flat/5101.1328219790@sss.pgh.pa.us that led
> > me to the cause.  Adding a 10ms sleep just after RBM_ZERO_AND_LOCK, I got 2
> > failures in 7 runs of 027_stream_regress.pl, at Assert(ItemIdIsNormal(lpp))
> > in heapgettup_pagemode().  In the core file, lpp pointed into an all-zeros
> > page.  RestoreBkpBlocks() had been doing RBM_ZERO years before hot standby
> > existed, but it wasn't a bug until queries could run concurrently.
>
> Yep, this does sound like a problem.

This was a bug of mine that was fixed later in another thread, which
Noah might have mistaken for intended-but-wrong behaviour, because the
name is weird IMHO.

https://www.postgresql.org/message-id/flat/7ed10231-ce47-03d5-d3f9-4aea0dc7d5a4%40gmail.com

> Hm. Leaving RBM_ZERO_AND_LOCK aside, is it actually always safe to do
> RestoreBlockImage() into a buffer that currently is pinned? Not sure if
> there's actually all that much guarantee what transient state one can read
> when reading a page concurrently to a memcpy(). I suspect it's practically
> rare to see a problem, but one could imagine an memcpy implementation that
> uses non-temporal writes, which afaict would leave you open to seeing quite
> random states when reading concurrently, as the cache coherence protocol
> doesn't protect anymore.

I also found that strange, while working on commit e656657f.  I
observed and pondered the behaviour you mentioned while testing that:
you're scanning a page holding only a pin, which on a primary would
allow others to add more (invisible to you) tuples that you shouldn't
attempt to access (based on visibility checks performed while you held
the lock), and in recovery the exact same thing is possible, except
that if it happens to include a FPI then it *will* overwrite the
tuples that you are scanning... with the exact same ones-and-zeros.
But I couldn't think of anything specifically wrong with it, ie that
might break.  Well, maybe it's not exactly the same ones-and-zeros in
some hint-bit scenario, but if that is considered acceptable on a
primary too (well maybe you're getting rid of it), but what's the
difference?  Oh... that's a plain store and you're thinking that
memcpy() might be allowed to use some weirder cache coherency modes
for speed, ... sounds likely to break a lot of stuff, is that a real
thing?!

> > I suspect the fix is to add a ReadBufferMode specified as, "If the block is
> > already in shared_buffers, do RBM_NORMAL and exclusive-lock the buffer.
> > Otherwise, do RBM_ZERO_AND_LOCK."
>
> I think that should work. At least in the current code it looks near trivial
> to implement, although the branch differences are going to be annoying.
> As usual the hardest part would probably be the naming. Maybe
> RBM_ZERO_ON_MISS_LOCK? RBM_LOCK_ZERO_ON_MISS? RBM_DWIM?

That's already how RBM_ZERO_AND_LOCK works, and you are the third
victim of its confusing name counting me (that's how I broke it in the
vectored stuff) and Noah IIUC what happened here.

Re: Hot standby queries see transient all-zeros pages

From

Andres Freund

Date:

14 December 2024, 05:34:21

Hi,

On 2024-12-13 16:38:05 -0800, Noah Misch wrote:
> On Fri, Dec 13, 2024 at 05:41:15PM -0500, Andres Freund wrote:
> > Afaics we didn't do anything about this issue?
> 
> postgr.es/c/e656657 fixed this.  I thought this was longstanding, but it
> turned out to have started on 2024-04-02.

Ah, that makes sense. I was a bit surprised to find this thread without any
replies...


> > Hm. Leaving RBM_ZERO_AND_LOCK aside, is it actually always safe to do
> > RestoreBlockImage() into a buffer that currently is pinned? Not sure if
> > there's actually all that much guarantee what transient state one can read
> > when reading a page concurrently to a memcpy(). I suspect it's practically
> > rare to see a problem, but one could imagine an memcpy implementation that
> > uses non-temporal writes, which afaict would leave you open to seeing quite
> > random states when reading concurrently, as the cache coherence protocol
> > doesn't protect anymore.
> 
> I wondered about that, too.  I didn't dig too deep.
> https://pubs.opengroup.org/onlinepubs/9799919799/functions/memcpy.html and
> https://en.cppreference.com/w/cpp/string/byte/memcpy were both silent about
> the topic.

Hm. Perhaps it'd be worth having a small stress test in the tests that'd make
problems like this more apparent. Even if it's not a problem current libc's,
who knows what happens down the line.


> > On 2024-05-12 10:16:58 -0700, Noah Misch wrote:
> > > I suspect the fix is to add a ReadBufferMode specified as, "If the block is
> > > already in shared_buffers, do RBM_NORMAL and exclusive-lock the buffer.
> > > Otherwise, do RBM_ZERO_AND_LOCK."
> > 
> > I think that should work. At least in the current code it looks near trivial
> > to implement, although the branch differences are going to be annoying.
> > 
> > As usual the hardest part would probably be the naming. Maybe
> > RBM_ZERO_ON_MISS_LOCK? RBM_LOCK_ZERO_ON_MISS? RBM_DWIM?
> 
> It turned out RBM_ZERO_AND_LOCK long worked that way, and postgr.es/c/e656657
> just had to restore that longstanding behavior.  The existing comment "Don't
> read from disk, caller will initialize." does allude to this (but I didn't
> originally catch the subtle point).
> 
> If RBM_ZERO_AND_LOCK hadn't existed so long, I'd rename it.  Perhaps it
> deserves a rename anyway?  Of those, I'd pick RBM_ZERO_ON_MISS_LOCK.  I also
> considered RBM_RECENT_OR_ZERO, borrowing a term from ReadRecentBuffer().

At least we could make the documentation for the mode in the enum clearer...

Greetings,

Andres Freund

Re: Hot standby queries see transient all-zeros pages

From

Andres Freund

Date:

14 December 2024, 05:41:06

Hi,

On 2024-12-14 13:41:40 +1300, Thomas Munro wrote:
> On Sat, Dec 14, 2024 at 11:41 AM Andres Freund <andres@anarazel.de> wrote:
> > On 2024-05-12 10:16:58 -0700, Noah Misch wrote:
> > > The use of RBM_ZERO_AND_LOCK is incompatible with that.  See a similar
> > > argument at https://postgr.es/m/flat/5101.1328219790@sss.pgh.pa.us that led
> > > me to the cause.  Adding a 10ms sleep just after RBM_ZERO_AND_LOCK, I got 2
> > > failures in 7 runs of 027_stream_regress.pl, at Assert(ItemIdIsNormal(lpp))
> > > in heapgettup_pagemode().  In the core file, lpp pointed into an all-zeros
> > > page.  RestoreBkpBlocks() had been doing RBM_ZERO years before hot standby
> > > existed, but it wasn't a bug until queries could run concurrently.
> >
> > Yep, this does sound like a problem.
> 
> This was a bug of mine that was fixed later in another thread, which
> Noah might have mistaken for intended-but-wrong behaviour, because the
> name is weird IMHO.
> 
> https://www.postgresql.org/message-id/flat/7ed10231-ce47-03d5-d3f9-4aea0dc7d5a4%40gmail.com

The name is weird...


> > Hm. Leaving RBM_ZERO_AND_LOCK aside, is it actually always safe to do
> > RestoreBlockImage() into a buffer that currently is pinned? Not sure if
> > there's actually all that much guarantee what transient state one can read
> > when reading a page concurrently to a memcpy(). I suspect it's practically
> > rare to see a problem, but one could imagine an memcpy implementation that
> > uses non-temporal writes, which afaict would leave you open to seeing quite
> > random states when reading concurrently, as the cache coherence protocol
> > doesn't protect anymore.
> 
> I also found that strange, while working on commit e656657f.  I
> observed and pondered the behaviour you mentioned while testing that:
> you're scanning a page holding only a pin, which on a primary would
> allow others to add more (invisible to you) tuples that you shouldn't
> attempt to access (based on visibility checks performed while you held
> the lock), and in recovery the exact same thing is possible, except
> that if it happens to include a FPI then it *will* overwrite the
> tuples that you are scanning... with the exact same ones-and-zeros.
> But I couldn't think of anything specifically wrong with it, ie that
> might break.  Well, maybe it's not exactly the same ones-and-zeros in
> some hint-bit scenario, but if that is considered acceptable on a
> primary too (well maybe you're getting rid of it), but what's the
> difference?

FWIW, much more than hint bits can change outside of recovery - you're not
holding a lock on the tuple, so xmin/xmax and non-hint fields in infomask* can
change due to update/delete.


> Oh... that's a plain store and you're thinking that memcpy() might be
> allowed to use some weirder cache coherency modes for speed, ... sounds
> likely to break a lot of stuff, is that a real thing?!

It would have to synchronize at the *end* of the memcpy. But not necessarily
before. I think our pages are probably too small to make it likely that a
memcpy with such a path would use it. I'd assume the logic would be something
along the lines of "If the to-be-copied size is bigger than L3*2, use
non-temporal stores to read/write the data, as we won't have any cache hits".


> > > I suspect the fix is to add a ReadBufferMode specified as, "If the block is
> > > already in shared_buffers, do RBM_NORMAL and exclusive-lock the buffer.
> > > Otherwise, do RBM_ZERO_AND_LOCK."
> >
> > I think that should work. At least in the current code it looks near trivial
> > to implement, although the branch differences are going to be annoying.
> > As usual the hardest part would probably be the naming. Maybe
> > RBM_ZERO_ON_MISS_LOCK? RBM_LOCK_ZERO_ON_MISS? RBM_DWIM?
> 
> That's already how RBM_ZERO_AND_LOCK works, and you are the third
> victim of its confusing name counting me (that's how I broke it in the
> vectored stuff) and Noah IIUC what happened here.

I think I knew at some point that it didn't work that way, but then got swept
up with worry upon reading Noah's thread, while already three layers deep in
something else :)

Greetings,

Andres Freund

Re: Hot standby queries see transient all-zeros pages

From

Noah Misch

Date:

14 December 2024, 06:27:03

On Fri, Dec 13, 2024 at 09:34:21PM -0500, Andres Freund wrote:
> On 2024-12-13 16:38:05 -0800, Noah Misch wrote:
> > On Fri, Dec 13, 2024 at 05:41:15PM -0500, Andres Freund wrote:
> > > Hm. Leaving RBM_ZERO_AND_LOCK aside, is it actually always safe to do
> > > RestoreBlockImage() into a buffer that currently is pinned? Not sure if
> > > there's actually all that much guarantee what transient state one can read
> > > when reading a page concurrently to a memcpy(). I suspect it's practically
> > > rare to see a problem, but one could imagine an memcpy implementation that
> > > uses non-temporal writes, which afaict would leave you open to seeing quite
> > > random states when reading concurrently, as the cache coherence protocol
> > > doesn't protect anymore.
> > 
> > I wondered about that, too.  I didn't dig too deep.
> > https://pubs.opengroup.org/onlinepubs/9799919799/functions/memcpy.html and
> > https://en.cppreference.com/w/cpp/string/byte/memcpy were both silent about
> > the topic.
> 
> Hm. Perhaps it'd be worth having a small stress test in the tests that'd make
> problems like this more apparent. Even if it's not a problem current libc's,
> who knows what happens down the line.

Agreed.

> > > On 2024-05-12 10:16:58 -0700, Noah Misch wrote:
> > > > I suspect the fix is to add a ReadBufferMode specified as, "If the block is
> > > > already in shared_buffers, do RBM_NORMAL and exclusive-lock the buffer.
> > > > Otherwise, do RBM_ZERO_AND_LOCK."
> > > 
> > > I think that should work. At least in the current code it looks near trivial
> > > to implement, although the branch differences are going to be annoying.
> > > 
> > > As usual the hardest part would probably be the naming. Maybe
> > > RBM_ZERO_ON_MISS_LOCK? RBM_LOCK_ZERO_ON_MISS? RBM_DWIM?
> > 
> > It turned out RBM_ZERO_AND_LOCK long worked that way, and postgr.es/c/e656657
> > just had to restore that longstanding behavior.  The existing comment "Don't
> > read from disk, caller will initialize." does allude to this (but I didn't
> > originally catch the subtle point).
> > 
> > If RBM_ZERO_AND_LOCK hadn't existed so long, I'd rename it.  Perhaps it
> > deserves a rename anyway?  Of those, I'd pick RBM_ZERO_ON_MISS_LOCK.  I also
> > considered RBM_RECENT_OR_ZERO, borrowing a term from ReadRecentBuffer().
> 
> At least we could make the documentation for the mode in the enum clearer...

Agreed.  Maybe "No I/O; get existing buffer, else zero-fill; caller
initializes."