For about a week, eelpout has been failing the pg_basebackup test
more often than not, but only in the 9.5 and 9.6 branches:
https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=eelpout&br=REL9_6_STABLE
https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=eelpout&br=REL9_5_STABLE
The failures all look pretty alike:
# Running: pg_basebackup -D
/home/tmunro/build-farm/buildroot/REL9_6_STABLE/pgsql.build/src/bin/pg_basebackup/tmp_check/tmp_test_jJOm/backupxs-X
stream
pg_basebackup: could not send copy-end packet: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
pg_basebackup: child process exited with exit code 1
not ok 44 - pg_basebackup -X stream runs
What shows up in the postmaster log is
2020-12-02 09:04:53.064 NZDT [29536:1] [unknown] LOG: connection received: host=[local]
2020-12-02 09:04:53.065 NZDT [29536:2] [unknown] LOG: replication connection authorized: user=tmunro
2020-12-02 09:04:53.175 NZDT [29537:1] [unknown] LOG: connection received: host=[local]
2020-12-02 09:04:53.178 NZDT [29537:2] [unknown] LOG: replication connection authorized: user=tmunro
2020-12-02 09:05:42.860 NZDT [29502:2] LOG: using stale statistics instead of current ones because stats collector is
notresponding
2020-12-02 09:05:53.074 NZDT [29542:1] LOG: using stale statistics instead of current ones because stats collector is
notresponding
2020-12-02 09:05:53.183 NZDT [29537:3] pg_basebackup LOG: terminating walsender process due to replication timeout
2020-12-02 09:05:53.183 NZDT [29537:4] pg_basebackup LOG: disconnection: session time: 0:01:00.008 user=tmunro
database=host=[local]
2020-12-02 09:06:33.996 NZDT [29536:3] pg_basebackup LOG: disconnection: session time: 0:01:40.933 user=tmunro
database=host=[local]
The "using stale statistics" gripes seem to be from autovacuum, so they
may be unrelated to the problem; but they suggest that the system
is under very heavy load, or else that there's some kernel-level issue.
Note however that some of the failures don't have those messages, and
I also see those messages in some runs that didn't fail.
Perhaps this is just a question of the machine being too slow to complete
the test, in which case we ought to raise wal_sender_timeout. But it's
weird that it would've started to fail just now, because I don't really
see any changes in those branches that would explain a week-old change
in the test runtime.
Any thoughts?
regards, tom lane