Thread: pg_rewind tap test unstable
Hi, for something between 10% and 20% of the devel builds for apt.postgresql.org (which happen every 6h if there's a git change, so it happens every few days), I'm seeing this: make[2]: Entering directory '/tmp/buildd/postgresql-9.6-9.6~~devel~20150728.0405/build/src/bin/pg_rewind' rm -rf /tmp/buildd/postgresql-9.6-9.6~~devel~20150728.0405/build/../src/bin/pg_rewind/tmp_check/log cd /tmp/buildd/postgresql-9.6-9.6~~devel~20150728.0405/build/../src/bin/pg_rewind && TESTDIR='/tmp/buildd/postgresql-9.6-9.6~~devel~20150728.0405/build/src/bin/pg_rewind' PATH="/tmp/buildd/postgresql-9.6-9.6~~devel~20150728.0405/build/tmp_install/usr/lib/postgresql/9.6/bin:$PATH" LD_LIBRARY_PATH="/tmp/buildd/postgresql-9.6-9.6~~devel~20150728.0405/build/tmp_install/usr/lib/x86_64-linux-gnu" PGPORT='65432'top_builddir='/tmp/buildd/postgresql-9.6-9.6~~devel~20150728.0405/build/src/bin/pg_rewind/../../..' prove -I/tmp/buildd/postgresql-9.6-9.6~~devel~20150728.0405/build/../src/test/perl/ --verbose t/*.pl # Failed test 'tail-copy: query result matches' # at RewindTest.pm line 131. # got: '1 # ' # expected: '10001 # ' # Looks like you failed 1 test of 8. t/001_basic.pl ....... 1..8 ok 1 - pg_rewind local ok 2 - table content: query result matches ok 3 - truncation: query result matches not ok 4 - tail-copy: query result matches ok 5 - pg_rewind remote ok 6 - table content: query result matches ok 7 - truncation: query result matches ok 8 - tail-copy: query result matches Dubious, test returned 1 (wstat 256, 0x100) Failed 1/8 subtests I don't have the older logs available, but from memory, the subtest failing and the two numbers mentioned are always the same. Christoph -- cb@df7cb.de | http://www.df7cb.de/
On Tue, Jul 28, 2015 at 5:57 PM, Christoph Berg <myon@debian.org> wrote: > for something between 10% and 20% of the devel builds for apt.postgresql.org > (which happen every 6h if there's a git change, so it happens every few days), > I'm seeing this: > Dubious, test returned 1 (wstat 256, 0x100) > Failed 1/8 subtests > > I don't have the older logs available, but from memory, the subtest > failing and the two numbers mentioned are always the same. There should be some output logs in src/bin/pg_rewind/tmp_check/log/*? Could you attach them here if you have them? That would be helpful to understand what is happening. -- Michael
Re: Michael Paquier 2015-07-28 <CAB7nPqQCpGy3u7CMfo8sQQUoBSFmEieOhuEsLxwyCC64j3GWxQ@mail.gmail.com> > On Tue, Jul 28, 2015 at 5:57 PM, Christoph Berg <myon@debian.org> wrote: > > for something between 10% and 20% of the devel builds for apt.postgresql.org > > (which happen every 6h if there's a git change, so it happens every few days), > > I'm seeing this: > > Dubious, test returned 1 (wstat 256, 0x100) > > Failed 1/8 subtests > > > > I don't have the older logs available, but from memory, the subtest > > failing and the two numbers mentioned are always the same. > > There should be some output logs in src/bin/pg_rewind/tmp_check/log/*? > Could you attach them here if you have them? That would be helpful to > understand what is happening. It took me a few attempts to tell the build environment to save a copy on failure and not shred everything right away. So here we go: Christoph -- cb@df7cb.de | http://www.df7cb.de/
Attachment
On Mon, Aug 3, 2015 at 5:35 PM, Christoph Berg <myon@debian.org> wrote: > Re: Michael Paquier 2015-07-28 <CAB7nPqQCpGy3u7CMfo8sQQUoBSFmEieOhuEsLxwyCC64j3GWxQ@mail.gmail.com> >> On Tue, Jul 28, 2015 at 5:57 PM, Christoph Berg <myon@debian.org> wrote: >> > for something between 10% and 20% of the devel builds for apt.postgresql.org >> > (which happen every 6h if there's a git change, so it happens every few days), >> > I'm seeing this: >> > Dubious, test returned 1 (wstat 256, 0x100) >> > Failed 1/8 subtests >> > >> > I don't have the older logs available, but from memory, the subtest >> > failing and the two numbers mentioned are always the same. >> >> There should be some output logs in src/bin/pg_rewind/tmp_check/log/*? >> Could you attach them here if you have them? That would be helpful to >> understand what is happening. > > It took me a few attempts to tell the build environment to save a copy > on failure and not shred everything right away. So here we go: In test case 2, the failure happens to be that the standby did not have the time to replicate the database beforepromotion that has been created on the master. One possible explanation for this failure is that the standby has been promoted before all the WAL needed for the tests has been replayed, hence we had better be sure that the replay_location of the standby matches pg_current_xlog_location() before promotion. On the buildfarm, hamster, the legendary slowest machine of the whole set, does not complain about that. Is your environment that heavy loaded when you run the tests? Perhaps the attached patch helps? -- Michael
Attachment
On Tue, Aug 04, 2015 at 02:21:16PM +0900, Michael Paquier wrote: > >> On Tue, Jul 28, 2015 at 5:57 PM, Christoph Berg <myon@debian.org> wrote: > >> > for something between 10% and 20% of the devel builds for apt.postgresql.org > >> > (which happen every 6h if there's a git change, so it happens every few days), > >> > I'm seeing this: > In test case 2, the failure happens to be that the standby did not > have the time to replicate the database beforepromotion that has been > created on the master. One possible explanation for this failure is > that the standby has been promoted before all the WAL needed for the > tests has been replayed, hence we had better be sure that the > replay_location of the standby matches pg_current_xlog_location() > before promotion. > Perhaps the attached patch helps? Thanks. In light of your diagnosis, I can reliably reproduce the failure by injecting a sleep into XLogSendPhysical(). Your patch fixes the problem, but it adds wal_receiver_status_interval (= 10s) stalls, doubling src/bin/pg_rewind/t/001_basic.pl runtime on a fast system. (The standby applies the final WAL quickly, then sleeps for wal_receiver_status_interval before notifying the master.) The standby will apply any written, unapplied WAL during promotion. Therefore, I plan to commit the attached performance-neutral variant of your patch.
Attachment
On Mon, Sep 7, 2015 at 1:16 PM, Noah Misch <noah@leadboat.com> wrote: > On Tue, Aug 04, 2015 at 02:21:16PM +0900, Michael Paquier wrote: >> >> On Tue, Jul 28, 2015 at 5:57 PM, Christoph Berg <myon@debian.org> wrote: >> >> > for something between 10% and 20% of the devel builds for apt.postgresql.org >> >> > (which happen every 6h if there's a git change, so it happens every few days), >> >> > I'm seeing this: > >> In test case 2, the failure happens to be that the standby did not >> have the time to replicate the database beforepromotion that has been >> created on the master. One possible explanation for this failure is >> that the standby has been promoted before all the WAL needed for the >> tests has been replayed, hence we had better be sure that the >> replay_location of the standby matches pg_current_xlog_location() >> before promotion. > >> Perhaps the attached patch helps? > > Thanks. In light of your diagnosis, I can reliably reproduce the failure by > injecting a sleep into XLogSendPhysical(). Your patch fixes the problem, but > it adds wal_receiver_status_interval (= 10s) stalls, doubling > src/bin/pg_rewind/t/001_basic.pl runtime on a fast system. (The standby > applies the final WAL quickly, then sleeps for wal_receiver_status_interval > before notifying the master.) Indeed, thanks for double-checking. > The standby will apply any written, unapplied > WAL during promotion. Therefore, I plan to commit the attached > performance-neutral variant of your patch. Explaining the use of write_location. This looks fine to me. Thanks again. -- Michael