On 8/7/22 02:36, Noah Misch wrote:
> On Thu, Apr 07, 2022 at 08:34:50PM +0200, Tomas Vondra wrote:
>> I've pushed a revert af all the commits related to this - decoding of
>> sequences and test_decoding / built-in replication changes.
>
> Two July buildfarm runs failed with PANIC during standby promotion:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&dt=2022-07-19%2004%3A13%3A18
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&dt=2022-07-31%2011%3A33%3A13
>
> The attached patch hacks things so an ordinary x86_64 GNU/Linux machine
> reproduces this consistently. "git bisect" then traced the regression to the
> above revert commit (2c7ea57e56ca5f668c32d4266e0a3e45b455bef5). The pg_ctl
> test suite passes under this hack in all supported branches, and it passed on
> v15 until that revert. Would you investigate?
>
> The buildfarm animal uses keep_error_builds. From kept data directories, I
> deduced these events:
>
> - After the base backup, auto-analyze ran on the primary and wrote WAL.
> - Standby streamed and wrote up to 0/301FFF.
> - Standby received the promote signal. Terminated streaming. WAL page at 0/302000 remained all-zeros.
> - Somehow, end-of-recovery became a PANIC.
>
I think it'd be really bizarre if this was due to the revert, as that
simply undoes minor WAL changes (and none of this should affect what
happens at WAL page boundary etc.). It just restores WAL as it was
before 0da92dc, nothing particularly complicated. I did go through all
of the changes again and I haven't spotted anything particularly
suspicious, but I'll give it another try tomorrow.
However, I did try bisecting this using the attached patch, and that
does not suggest the issue is in the revert commit. It actually fails
all the way back to 5dc0418fab2, and it starts working on 9553b4115f1.
...
6392f2a0968 Try to silence "-Wmissing-braces" complaints in ...
=> 5dc0418fab2 Prefetch data referenced by the WAL, take II.
9553b4115f1 Fix warning introduced in 5c279a6d350.
...
This is merely 10 commits before the revert, and it seems way more
related to WAL. Also, adding this to the two nodes in 003_standby.pl
makes the issue go away, it seems:
$node_standby->append_conf('postgresql.conf',
qq(recovery_prefetch = off));
I'd bet it's about WAL prefetching, not the revert, and the bisect was a
bit incorrect, because the commits are close and the failures happen to
be rare. (Presumably you first did the bisect and then wrote the patch
that reproduces this, right?)
Adding Thomas Munro to the thread, he's the WAL prefetching expert ;-)
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company