Recent eelpout failures on 9.x branches - Mailing list pgsql-hackers

From Tom Lane
Subject Recent eelpout failures on 9.x branches
Date
Msg-id 1530182.1606862173@sss.pgh.pa.us
Whole thread Raw
Responses Re: Recent eelpout failures on 9.x branches
List pgsql-hackers
For about a week, eelpout has been failing the pg_basebackup test
more often than not, but only in the 9.5 and 9.6 branches:

https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=eelpout&br=REL9_6_STABLE
https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=eelpout&br=REL9_5_STABLE

The failures all look pretty alike:

# Running: pg_basebackup -D
/home/tmunro/build-farm/buildroot/REL9_6_STABLE/pgsql.build/src/bin/pg_basebackup/tmp_check/tmp_test_jJOm/backupxs-X
stream
pg_basebackup: could not send copy-end packet: server closed the connection unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request.
pg_basebackup: child process exited with exit code 1
not ok 44 - pg_basebackup -X stream runs

What shows up in the postmaster log is

2020-12-02 09:04:53.064 NZDT [29536:1] [unknown] LOG:  connection received: host=[local]
2020-12-02 09:04:53.065 NZDT [29536:2] [unknown] LOG:  replication connection authorized: user=tmunro
2020-12-02 09:04:53.175 NZDT [29537:1] [unknown] LOG:  connection received: host=[local]
2020-12-02 09:04:53.178 NZDT [29537:2] [unknown] LOG:  replication connection authorized: user=tmunro
2020-12-02 09:05:42.860 NZDT [29502:2] LOG:  using stale statistics instead of current ones because stats collector is
notresponding 
2020-12-02 09:05:53.074 NZDT [29542:1] LOG:  using stale statistics instead of current ones because stats collector is
notresponding 
2020-12-02 09:05:53.183 NZDT [29537:3] pg_basebackup LOG:  terminating walsender process due to replication timeout
2020-12-02 09:05:53.183 NZDT [29537:4] pg_basebackup LOG:  disconnection: session time: 0:01:00.008 user=tmunro
database=host=[local] 
2020-12-02 09:06:33.996 NZDT [29536:3] pg_basebackup LOG:  disconnection: session time: 0:01:40.933 user=tmunro
database=host=[local] 

The "using stale statistics" gripes seem to be from autovacuum, so they
may be unrelated to the problem; but they suggest that the system
is under very heavy load, or else that there's some kernel-level issue.
Note however that some of the failures don't have those messages, and
I also see those messages in some runs that didn't fail.

Perhaps this is just a question of the machine being too slow to complete
the test, in which case we ought to raise wal_sender_timeout.  But it's
weird that it would've started to fail just now, because I don't really
see any changes in those branches that would explain a week-old change
in the test runtime.

Any thoughts?

            regards, tom lane



pgsql-hackers by date:

Previous
From: Chapman Flack
Date:
Subject: Re: Setof RangeType returns
Next
From: Tom Lane
Date:
Subject: Re: Setof RangeType returns