Re: [HACKERS] [COMMITTERS] pgsql: Replication lag tracking for walsenders - Mailing list pgsql-hackers

From Mark Dilger
Subject Re: [HACKERS] [COMMITTERS] pgsql: Replication lag tracking for walsenders
Date
Msg-id 3177FAD5-3600-4C8B-B338-FB108F60455F@gmail.com
Whole thread Raw
In response to Re: [HACKERS] [COMMITTERS] pgsql: Replication lag tracking for walsenders  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: [HACKERS] [COMMITTERS] pgsql: Replication lag tracking for walsenders  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
> On Apr 22, 2017, at 11:40 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> I wrote:
>> Whoa.  This just turned into a much larger can of worms than I expected.
>> How can it be that processes are getting assertion crashes and yet the
>> test framework reports success anyway?  That's impossibly
>> broken/unacceptable.
>
> I poked into this on my laptop, where I'm able to reproduce both assertion
> failures.  The reason the one in subtrans.c is not being detected by the
> test framework is that it happens late in the run and the test doesn't
> do anything more with that server than shut it down.  "pg_ctl stop"
> actually notices there's a problem; for instance, the log on tern for
> this step shows
>
> ### Stopping node "master" using mode immediate
> # Running: pg_ctl -D /home/nm/farm/gcc32/HEAD/pgsql.build/src/test/recovery/tmp_check/data_master_xHGA/pgdata -m
immediatestop 
> pg_ctl: PID file
"/home/nm/farm/gcc32/HEAD/pgsql.build/src/test/recovery/tmp_check/data_master_xHGA/pgdata/postmaster.pid"does not exist 
> Is server running?
> # No postmaster PID
>
> However, PostgresNode::stop blithely ignores the failure exit from
> pg_ctl.  I think it should use system_or_bail() not just system_log(),
> so that we'll notice if the server is not running when we expect it
> to be.  It's conceivable that there should also be a "stop_if_running"
> method that allows the case where the server is not running, but so
> far as I can find, no existing TAP test needs such a behavior --- and
> it certainly shouldn't be the default.
>
> The walsender.c crash is harder to detect because the postmaster very
> nicely recovers and restarts its children.  That's great for robustness,
> but it sucks for testing.  I think that we ought to disable that in
> TAP tests, which we can do by having PostgresNode::init include
> "restart_after_crash = off" in the postgresql.conf modifications it
> applies.  Again, it's conceivable that some tests would not want that,
> but there is no existing test that seems to need crash recovery on,
> and I don't see a good argument for it being default for test purposes.
>
> As best I've been able to tell, the reason why the walsender.c crash
> is detected when it's detected is that the TAP script chances to try
> to connect while the recovery is happening:
>
> connection error: 'psql: FATAL:  the database system is in recovery mode'
> while running 'psql -XAtq -d port=57718 host=/tmp/_uP8FKEynq dbname='postgres' -f - -v ON_ERROR_STOP=1' at
/home/nm/farm/gcc32/HEAD/pgsql.build/src/test/recovery/../../../src/test/perl/PostgresNode.pmline 1173. 
>
> The window for that is not terribly wide, explaining why the detection
> is unreliable even when the crash does occur.
>
> In short then, I propose the attached patch to make these cases fail
> more reliably.  We might extend this later to allow the old behaviors
> to be explicitly opted-into, but we don't seem to need that today.
>
>             regards, tom lane

I pulled fresh sources with your latest commit, 7d68f2281a4b56834c8e5648fc7da0b73b674c45,
and the tests consistently fail (5 out of 5 attempts) for me on my laptop with:

/Applications/Xcode.app/Contents/Developer/usr/bin/make -C recovery check
for extra in contrib/test_decoding; do /Applications/Xcode.app/Contents/Developer/usr/bin/make -C '../../..'/$extra
DESTDIR='/Users/mark/vanilla/bitoctetlength'/tmp_installinstall
>>'/Users/mark/vanilla/bitoctetlength'/tmp_install/log/install.log|| exit; done 
rm -rf /Users/mark/vanilla/bitoctetlength/src/test/recovery/tmp_check/log
cd . && TESTDIR='/Users/mark/vanilla/bitoctetlength/src/test/recovery'
PATH="/Users/mark/vanilla/bitoctetlength/tmp_install/usr/local/pgsql/bin:$PATH"
DYLD_LIBRARY_PATH="/Users/mark/vanilla/bitoctetlength/tmp_install/usr/local/pgsql/lib"PGPORT='65432'
PG_REGRESS='/Users/mark/vanilla/bitoctetlength/src/test/recovery/../../../src/test/regress/pg_regress'prove -I
../../../src/test/perl/-I .  t/*.pl 
t/001_stream_rep.pl .................. ok
t/002_archiving.pl ................... ok
t/003_recovery_targets.pl ............ ok
t/004_timeline_switch.pl ............. Bailout called.  Further testing stopped:  system pg_ctl failed
FAILED--Further testing stopped: system pg_ctl failed
make[2]: *** [check] Error 255
make[1]: *** [check-recovery-recurse] Error 2
make: *** [check-world-src/test-recurse] Error 2

The first time, I had some local changes, but I've stashed them and am still getting the error each of
these next four times.  I'm compiling as follows:

make distclean; make clean; ./configure --enable-cassert --enable-tap-tests --enable-depend && make -j4 && make
check-world

Are the errors expected now?

mark


pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: [HACKERS] Statistics "dependency"
Next
From: Tom Lane
Date:
Subject: Re: [HACKERS] [COMMITTERS] pgsql: Replication lag tracking for walsenders