Thread: Recent eelpout failures on 9.x branches

Recent eelpout failures on 9.x branches

From

Tom Lane

Date:

01 December 2020, 22:36:13

For about a week, eelpout has been failing the pg_basebackup test
more often than not, but only in the 9.5 and 9.6 branches:

https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=eelpout&br=REL9_6_STABLE
https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=eelpout&br=REL9_5_STABLE

The failures all look pretty alike:

# Running: pg_basebackup -D
/home/tmunro/build-farm/buildroot/REL9_6_STABLE/pgsql.build/src/bin/pg_basebackup/tmp_check/tmp_test_jJOm/backupxs-X
stream
pg_basebackup: could not send copy-end packet: server closed the connection unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request.
pg_basebackup: child process exited with exit code 1
not ok 44 - pg_basebackup -X stream runs

What shows up in the postmaster log is

2020-12-02 09:04:53.064 NZDT [29536:1] [unknown] LOG:  connection received: host=[local]
2020-12-02 09:04:53.065 NZDT [29536:2] [unknown] LOG:  replication connection authorized: user=tmunro
2020-12-02 09:04:53.175 NZDT [29537:1] [unknown] LOG:  connection received: host=[local]
2020-12-02 09:04:53.178 NZDT [29537:2] [unknown] LOG:  replication connection authorized: user=tmunro
2020-12-02 09:05:42.860 NZDT [29502:2] LOG:  using stale statistics instead of current ones because stats collector is
notresponding 
2020-12-02 09:05:53.074 NZDT [29542:1] LOG:  using stale statistics instead of current ones because stats collector is
notresponding 
2020-12-02 09:05:53.183 NZDT [29537:3] pg_basebackup LOG:  terminating walsender process due to replication timeout
2020-12-02 09:05:53.183 NZDT [29537:4] pg_basebackup LOG:  disconnection: session time: 0:01:00.008 user=tmunro
database=host=[local] 
2020-12-02 09:06:33.996 NZDT [29536:3] pg_basebackup LOG:  disconnection: session time: 0:01:40.933 user=tmunro
database=host=[local] 

The "using stale statistics" gripes seem to be from autovacuum, so they
may be unrelated to the problem; but they suggest that the system
is under very heavy load, or else that there's some kernel-level issue.
Note however that some of the failures don't have those messages, and
I also see those messages in some runs that didn't fail.

Perhaps this is just a question of the machine being too slow to complete
the test, in which case we ought to raise wal_sender_timeout.  But it's
weird that it would've started to fail just now, because I don't really
see any changes in those branches that would explain a week-old change
in the test runtime.

Any thoughts?

            regards, tom lane

Re: Recent eelpout failures on 9.x branches

From

Thomas Munro

Date:

01 December 2020, 22:50:21

On Wed, Dec 2, 2020 at 11:36 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Perhaps this is just a question of the machine being too slow to complete
> the test, in which case we ought to raise wal_sender_timeout.  But it's
> weird that it would've started to fail just now, because I don't really
> see any changes in those branches that would explain a week-old change
> in the test runtime.

Unfortunately, eelpout got kicked off the nice shiny ARM server it was
running on so last week I moved it to a rack mounted Raspberry Pi.  It
seems to be totally I/O starved causing some timeouts to be reached,
and I'm looking into fixing that by adding fast storage.  This may
take a few days.  Sorry for the noise.

Re: Recent eelpout failures on 9.x branches

From

Tom Lane

Date:

01 December 2020, 23:07:17

Thomas Munro <thomas.munro@gmail.com> writes:
> Unfortunately, eelpout got kicked off the nice shiny ARM server it was
> running on so last week I moved it to a rack mounted Raspberry Pi.  It
> seems to be totally I/O starved causing some timeouts to be reached,
> and I'm looking into fixing that by adding fast storage.  This may
> take a few days.  Sorry for the noise.

Ah-hah.  Now that I look, eelpout is very clearly slower overall
than it was a couple weeks ago, so all is explained.

It might still be reasonable to raise wal_sender_timeout in the
buildfarm environment, though.  We usually try to make sure that
buildfarm timeouts border on ridiculous, not just because of
underpowered critters but also for cases like CLOBBER_CACHE_ALWAYS
animals.

I'm also wondering a bit why the issue isn't affecting the newer
branches.  It's certainly not because we made the test shorter ...

            regards, tom lane

Re: Recent eelpout failures on 9.x branches

From

Noah Misch

Date:

02 December 2020, 03:51:31

On Tue, Dec 01, 2020 at 06:07:17PM -0500, Tom Lane wrote:
> Thomas Munro <thomas.munro@gmail.com> writes:
> > Unfortunately, eelpout got kicked off the nice shiny ARM server it was
> > running on so last week I moved it to a rack mounted Raspberry Pi.  It
> > seems to be totally I/O starved causing some timeouts to be reached,
> > and I'm looking into fixing that by adding fast storage.  This may
> > take a few days.  Sorry for the noise.
> 
> Ah-hah.  Now that I look, eelpout is very clearly slower overall
> than it was a couple weeks ago, so all is explained.
> 
> It might still be reasonable to raise wal_sender_timeout in the
> buildfarm environment, though.  We usually try to make sure that
> buildfarm timeouts border on ridiculous, not just because of
> underpowered critters but also for cases like CLOBBER_CACHE_ALWAYS
> animals.

My buildfarm animals override these:

    extra_config =>{
        DEFAULT => [
            "authentication_timeout = '600s'",
            "wal_receiver_timeout = '18000s'",
            "wal_sender_timeout = '18000s'",
        ],
    },
    build_env =>{
        PGCTLTIMEOUT => 18000,
    },

Each of those timeouts caused failures before I changed it.  For animals fast
enough to make them irrelevant, I've not yet encountered a disadvantage.

> I'm also wondering a bit why the issue isn't affecting the newer
> branches.  It's certainly not because we made the test shorter ...

That is peculiar.

Re: Recent eelpout failures on 9.x branches

From

Thomas Munro

Date:

02 December 2020, 06:56:04

On Wed, Dec 2, 2020 at 12:07 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I'm also wondering a bit why the issue isn't affecting the newer
> branches.  It's certainly not because we made the test shorter ...

I looked at htop while it was building the 9.x branches and saw
pg_basebackup sitting in D state waiting for glacial I/O.  Perhaps
this was improved by the --no-sync option that got sprinkled around
the place in later releases?