Thomas Munro <thomas.munro@gmail.com> writes:
> Hmm, one thing I'm still unclear on: did this problem really start
> with 6051857fc/ed52c3707? My initial email in this thread lists
> similar failures going back further, doesn't it? (And what's tern
> doing mixed up in this mess?)
Well, those earlier ones may be committs failures, but a lot of
them contain different-looking symptoms, eg pg_ctl failures.
tern's failure at
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&dt=2021-07-17+10%3A57%3A49
does look similar, but we can see in its log that the standby
*did* notice the primary disconnection and then reconnect:
2021-07-17 16:29:08.248 UTC [17498380:2] LOG: replication terminated by primary server
2021-07-17 16:29:08.248 UTC [17498380:3] DETAIL: End of WAL reached on timeline 1 at 0/30378F8.
2021-07-17 16:29:08.248 UTC [17498380:4] FATAL: could not send end-of-streaming message to primary: no COPY in
progress
2021-07-17 16:29:08.248 UTC [25166230:5] LOG: invalid record length at 0/30378F8: wanted 24, got 0
2021-07-17 16:29:08.350 UTC [16318578:1] FATAL: could not connect to the primary server: server closed the connection
unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
2021-07-17 16:29:36.369 UTC [7077918:1] FATAL: could not connect to the primary server: FATAL: the database system is
startingup
2021-07-17 16:29:36.380 UTC [11338028:1] FATAL: could not connect to the primary server: FATAL: the database system
isstarting up
...
2021-07-17 16:29:36.881 UTC [17367092:1] LOG: started streaming WAL from primary at 0/3000000 on timeline 1
So I'm not sure what happened there, but it's not an instance
of this problem. One thing that looks a bit suspicious is
this in the primary's log:
2021-07-17 16:26:47.832 UTC [12386550:1] LOG: using stale statistics instead of current ones because stats collector
isnot responding
which makes me wonder if the timeout is down to out-of-date
pg_stats data. The loop in 002_standby.pl doesn't appear to
depend on the stats collector:
my $primary_lsn =
$primary->safe_psql('postgres', 'select pg_current_wal_lsn()');
$standby->poll_query_until('postgres',
qq{SELECT '$primary_lsn'::pg_lsn <= pg_last_wal_replay_lsn()})
or die "standby never caught up";
but maybe I'm missing the connection.
Apropos of that, it's worth noting that wait_for_catchup *is*
dependent on up-to-date stats, and here's a recent run where
it sure looks like the timeout cause is AWOL stats collector:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2022-01-10%2004%3A51%3A34
I wonder if we should refactor wait_for_catchup to probe the
standby directly instead of relying on the upstream's view.
regards, tom lane