Re: Why is src/test/modules/committs/t/002_standby.pl flaky? - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Why is src/test/modules/committs/t/002_standby.pl flaky?
Date
Msg-id 2411751.1641849927@sss.pgh.pa.us
Whole thread Raw
In response to Re: Why is src/test/modules/committs/t/002_standby.pl flaky?  (Thomas Munro <thomas.munro@gmail.com>)
Responses Re: Why is src/test/modules/committs/t/002_standby.pl flaky?
List pgsql-hackers
Thomas Munro <thomas.munro@gmail.com> writes:
> Hmm, one thing I'm still unclear on: did this problem really start
> with 6051857fc/ed52c3707?  My initial email in this thread lists
> similar failures going back further, doesn't it?  (And what's tern
> doing mixed up in this mess?)

Well, those earlier ones may be committs failures, but a lot of
them contain different-looking symptoms, eg pg_ctl failures.

tern's failure at
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&dt=2021-07-17+10%3A57%3A49
does look similar, but we can see in its log that the standby
*did* notice the primary disconnection and then reconnect:

2021-07-17 16:29:08.248 UTC [17498380:2] LOG:  replication terminated by primary server
2021-07-17 16:29:08.248 UTC [17498380:3] DETAIL:  End of WAL reached on timeline 1 at 0/30378F8.
2021-07-17 16:29:08.248 UTC [17498380:4] FATAL:  could not send end-of-streaming message to primary: no COPY in
progress
2021-07-17 16:29:08.248 UTC [25166230:5] LOG:  invalid record length at 0/30378F8: wanted 24, got 0
2021-07-17 16:29:08.350 UTC [16318578:1] FATAL:  could not connect to the primary server: server closed the connection
unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
2021-07-17 16:29:36.369 UTC [7077918:1] FATAL:  could not connect to the primary server: FATAL:  the database system is
startingup 
2021-07-17 16:29:36.380 UTC [11338028:1] FATAL:  could not connect to the primary server: FATAL:  the database system
isstarting up 
...
2021-07-17 16:29:36.881 UTC [17367092:1] LOG:  started streaming WAL from primary at 0/3000000 on timeline 1

So I'm not sure what happened there, but it's not an instance
of this problem.  One thing that looks a bit suspicious is
this in the primary's log:

2021-07-17 16:26:47.832 UTC [12386550:1] LOG:  using stale statistics instead of current ones because stats collector
isnot responding 

which makes me wonder if the timeout is down to out-of-date
pg_stats data.  The loop in 002_standby.pl doesn't appear to
depend on the stats collector:

my $primary_lsn =
  $primary->safe_psql('postgres', 'select pg_current_wal_lsn()');
$standby->poll_query_until('postgres',
    qq{SELECT '$primary_lsn'::pg_lsn <= pg_last_wal_replay_lsn()})
  or die "standby never caught up";

but maybe I'm missing the connection.

Apropos of that, it's worth noting that wait_for_catchup *is*
dependent on up-to-date stats, and here's a recent run where
it sure looks like the timeout cause is AWOL stats collector:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2022-01-10%2004%3A51%3A34

I wonder if we should refactor wait_for_catchup to probe the
standby directly instead of relying on the upstream's view.

            regards, tom lane



pgsql-hackers by date:

Previous
From: Andrew Dunstan
Date:
Subject: Re: Why is src/test/modules/committs/t/002_standby.pl flaky?
Next
From: Alvaro Herrera
Date:
Subject: Re: Column Filtering in Logical Replication