Thread: Recent eelpout failures on 9.x branches
For about a week, eelpout has been failing the pg_basebackup test more often than not, but only in the 9.5 and 9.6 branches: https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=eelpout&br=REL9_6_STABLE https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=eelpout&br=REL9_5_STABLE The failures all look pretty alike: # Running: pg_basebackup -D /home/tmunro/build-farm/buildroot/REL9_6_STABLE/pgsql.build/src/bin/pg_basebackup/tmp_check/tmp_test_jJOm/backupxs-X stream pg_basebackup: could not send copy-end packet: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. pg_basebackup: child process exited with exit code 1 not ok 44 - pg_basebackup -X stream runs What shows up in the postmaster log is 2020-12-02 09:04:53.064 NZDT [29536:1] [unknown] LOG: connection received: host=[local] 2020-12-02 09:04:53.065 NZDT [29536:2] [unknown] LOG: replication connection authorized: user=tmunro 2020-12-02 09:04:53.175 NZDT [29537:1] [unknown] LOG: connection received: host=[local] 2020-12-02 09:04:53.178 NZDT [29537:2] [unknown] LOG: replication connection authorized: user=tmunro 2020-12-02 09:05:42.860 NZDT [29502:2] LOG: using stale statistics instead of current ones because stats collector is notresponding 2020-12-02 09:05:53.074 NZDT [29542:1] LOG: using stale statistics instead of current ones because stats collector is notresponding 2020-12-02 09:05:53.183 NZDT [29537:3] pg_basebackup LOG: terminating walsender process due to replication timeout 2020-12-02 09:05:53.183 NZDT [29537:4] pg_basebackup LOG: disconnection: session time: 0:01:00.008 user=tmunro database=host=[local] 2020-12-02 09:06:33.996 NZDT [29536:3] pg_basebackup LOG: disconnection: session time: 0:01:40.933 user=tmunro database=host=[local] The "using stale statistics" gripes seem to be from autovacuum, so they may be unrelated to the problem; but they suggest that the system is under very heavy load, or else that there's some kernel-level issue. Note however that some of the failures don't have those messages, and I also see those messages in some runs that didn't fail. Perhaps this is just a question of the machine being too slow to complete the test, in which case we ought to raise wal_sender_timeout. But it's weird that it would've started to fail just now, because I don't really see any changes in those branches that would explain a week-old change in the test runtime. Any thoughts? regards, tom lane
On Wed, Dec 2, 2020 at 11:36 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Perhaps this is just a question of the machine being too slow to complete > the test, in which case we ought to raise wal_sender_timeout. But it's > weird that it would've started to fail just now, because I don't really > see any changes in those branches that would explain a week-old change > in the test runtime. Unfortunately, eelpout got kicked off the nice shiny ARM server it was running on so last week I moved it to a rack mounted Raspberry Pi. It seems to be totally I/O starved causing some timeouts to be reached, and I'm looking into fixing that by adding fast storage. This may take a few days. Sorry for the noise.
Thomas Munro <thomas.munro@gmail.com> writes: > Unfortunately, eelpout got kicked off the nice shiny ARM server it was > running on so last week I moved it to a rack mounted Raspberry Pi. It > seems to be totally I/O starved causing some timeouts to be reached, > and I'm looking into fixing that by adding fast storage. This may > take a few days. Sorry for the noise. Ah-hah. Now that I look, eelpout is very clearly slower overall than it was a couple weeks ago, so all is explained. It might still be reasonable to raise wal_sender_timeout in the buildfarm environment, though. We usually try to make sure that buildfarm timeouts border on ridiculous, not just because of underpowered critters but also for cases like CLOBBER_CACHE_ALWAYS animals. I'm also wondering a bit why the issue isn't affecting the newer branches. It's certainly not because we made the test shorter ... regards, tom lane
On Tue, Dec 01, 2020 at 06:07:17PM -0500, Tom Lane wrote: > Thomas Munro <thomas.munro@gmail.com> writes: > > Unfortunately, eelpout got kicked off the nice shiny ARM server it was > > running on so last week I moved it to a rack mounted Raspberry Pi. It > > seems to be totally I/O starved causing some timeouts to be reached, > > and I'm looking into fixing that by adding fast storage. This may > > take a few days. Sorry for the noise. > > Ah-hah. Now that I look, eelpout is very clearly slower overall > than it was a couple weeks ago, so all is explained. > > It might still be reasonable to raise wal_sender_timeout in the > buildfarm environment, though. We usually try to make sure that > buildfarm timeouts border on ridiculous, not just because of > underpowered critters but also for cases like CLOBBER_CACHE_ALWAYS > animals. My buildfarm animals override these: extra_config =>{ DEFAULT => [ "authentication_timeout = '600s'", "wal_receiver_timeout = '18000s'", "wal_sender_timeout = '18000s'", ], }, build_env =>{ PGCTLTIMEOUT => 18000, }, Each of those timeouts caused failures before I changed it. For animals fast enough to make them irrelevant, I've not yet encountered a disadvantage. > I'm also wondering a bit why the issue isn't affecting the newer > branches. It's certainly not because we made the test shorter ... That is peculiar.
On Wed, Dec 2, 2020 at 12:07 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > I'm also wondering a bit why the issue isn't affecting the newer > branches. It's certainly not because we made the test shorter ... I looked at htop while it was building the 9.x branches and saw pg_basebackup sitting in D state waiting for glacial I/O. Perhaps this was improved by the --no-sync option that got sprinkled around the place in later releases?