Re: logical decoding and replication of sequences - Mailing list pgsql-hackers

From Noah Misch
Subject Re: logical decoding and replication of sequences
Date
Msg-id 20220807191205.GA16699@rfd.leadboat.com
Whole thread Raw
In response to Re: logical decoding and replication of sequences  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Responses Re: logical decoding and replication of sequences
List pgsql-hackers
On Sun, Aug 07, 2022 at 03:18:52PM +0200, Tomas Vondra wrote:
> On 8/7/22 02:36, Noah Misch wrote:
> > On Thu, Apr 07, 2022 at 08:34:50PM +0200, Tomas Vondra wrote:
> >> I've pushed a revert af all the commits related to this - decoding of
> >> sequences and test_decoding / built-in replication changes.
> > 
> > Two July buildfarm runs failed with PANIC during standby promotion:
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&dt=2022-07-19%2004%3A13%3A18
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&dt=2022-07-31%2011%3A33%3A13
> > 
> > The attached patch hacks things so an ordinary x86_64 GNU/Linux machine
> > reproduces this consistently.  "git bisect" then traced the regression to the
> > above revert commit (2c7ea57e56ca5f668c32d4266e0a3e45b455bef5).  The pg_ctl
> > test suite passes under this hack in all supported branches, and it passed on
> > v15 until that revert.  Would you investigate?
> > 
> > The buildfarm animal uses keep_error_builds.  From kept data directories, I
> > deduced these events:
> > 
> > - After the base backup, auto-analyze ran on the primary and wrote WAL.
> > - Standby streamed and wrote up to 0/301FFF.
> > - Standby received the promote signal.  Terminated streaming.  WAL page at 0/302000 remained all-zeros.
> > - Somehow, end-of-recovery became a PANIC.
> 
> I think it'd be really bizarre if this was due to the revert, as that
> simply undoes minor WAL changes (and none of this should affect what
> happens at WAL page boundary etc.). It just restores WAL as it was
> before 0da92dc, nothing particularly complicated. I did go through all
> of the changes again and I haven't spotted anything particularly
> suspicious, but I'll give it another try tomorrow.
> 
> However, I did try bisecting this using the attached patch, and that
> does not suggest the issue is in the revert commit. It actually fails
> all the way back to 5dc0418fab2, and it starts working on 9553b4115f1.
> 
>     ...
>     6392f2a0968 Try to silence "-Wmissing-braces" complaints in ...
>  => 5dc0418fab2 Prefetch data referenced by the WAL, take II.
>     9553b4115f1 Fix warning introduced in 5c279a6d350.
>     ...
> 
> This is merely 10 commits before the revert, and it seems way more
> related to WAL. Also, adding this to the two nodes in 003_standby.pl
> makes the issue go away, it seems:
> 
>   $node_standby->append_conf('postgresql.conf',
>       qq(recovery_prefetch = off));
> 
> I'd bet it's about WAL prefetching, not the revert, and the bisect was a
> bit incorrect, because the commits are close and the failures happen to
> be rare. (Presumably you first did the bisect and then wrote the patch
> that reproduces this, right?)

No.  I wrote the patch, then used the patch to drive the bisect.  With ten
iterations, commit 2c7ea57 passes 0/10, while 2c7ea57^ passes 10/10.  I've now
tried recovery_prefetch=off.  With that, the test passes 10/10 at 2c7ea57.
Given your observation of a failure at 5dc0418fab2, I agree with your
conclusion.  Whatever the role of 2c7ea57 in exposing the failure on my
machine, a root cause in WAL prefetching looks more likely.

> Adding Thomas Munro to the thread, he's the WAL prefetching expert ;-)



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: Cleaning up historical portability baggage
Next
From: Michail Nikolaev
Date:
Subject: Re: Slow standby snapshot