Thread: Re: [HACKERS] [COMMITTERS] pgsql: Replication lag tracking for walsenders

Re: [HACKERS] [COMMITTERS] pgsql: Replication lag tracking for walsenders

From
Alvaro Herrera
Date:
Simon Riggs wrote:
> Replication lag tracking for walsenders
> 
> Adds write_lag, flush_lag and replay_lag cols to pg_stat_replication.

Did anyone notice that this seems to be causing buildfarm member 'tern'
to fail the recovery check?  See here:

https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=tern&dt=2017-04-21%2012%3A48%3A09&stg=recovery-check
which has
TRAP: FailedAssertion("!(lsn >= prev.lsn)", File: "walsender.c", Line: 3331)

Line 3331 was added by this commit.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> Simon Riggs wrote:
>> Replication lag tracking for walsenders
>>
>> Adds write_lag, flush_lag and replay_lag cols to pg_stat_replication.

> Did anyone notice that this seems to be causing buildfarm member 'tern'
> to fail the recovery check?  See here:

> https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=tern&dt=2017-04-21%2012%3A48%3A09&stg=recovery-check
> which has
> TRAP: FailedAssertion("!(lsn >= prev.lsn)", File: "walsender.c", Line: 3331)

> Line 3331 was added by this commit.

Note that while that commit was some time back, tern has only just started
running recovery-check, following its update to the latest buildfarm
script.  It looks like it's run that test four times and failed twice,
so far.  So, not 100% reproducible, but there's something rotten there.
Timing-dependent, maybe?

Some excavation in the buildfarm database says that the coverage for
the recovery-check test has been mighty darn thin up until just recently.
These are all the reports we have:

pgbfprod=> select sysname, min(snapshot) as oldest, count(*) from build_status_log where log_stage =
'recovery-check.log'group by 1 order by 2;sysname  |       oldest        | count  
----------+---------------------+-------hamster  | 2016-03-01 02:34:26 |   182crake    | 2017-04-09 01:58:15 |
80nightjar| 2017-04-11 15:54:34 |    52longfin  | 2017-04-19 16:29:39 |     9hornet   | 2017-04-20 14:12:08 |
4mandrill| 2017-04-20 14:14:08 |     4sungazer | 2017-04-20 14:16:08 |     4tern     | 2017-04-20 14:18:08 |     4prion
  | 2017-04-20 14:23:05 |     8jacana   | 2017-04-20 15:00:17 |     3 
(10 rows)

So, other than hamster which is certainly going to have its own spin
on the timing question, we have next to no track record for this test.
I wouldn't bet that this issue is unique to tern; more likely, that's
just the first critter to show an intermittent issue.
        regards, tom lane



Re: [HACKERS] [COMMITTERS] pgsql: Replication lag tracking forwalsenders

From
Andres Freund
Date:
On 2017-04-21 17:04:08 -0400, Tom Lane wrote:
> Some excavation in the buildfarm database says that the coverage for
> the recovery-check test has been mighty darn thin up until just
> recently.

Hm, not good. Just enabled it for most of my animals (there seems to be
no point in having it on the C89 animal).  Very curious what it'll do to
skink's runtime :/.  I'll add a few more cores to the machine, otherwise
it'll probably wreak havock.

Greetings,

Andres Freund



Re: [HACKERS] [COMMITTERS] pgsql: Replication lag tracking for walsenders

From
Thomas Munro
Date:
On Sat, Apr 22, 2017 at 9:04 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
>> Simon Riggs wrote:
>>> Replication lag tracking for walsenders
>>>
>>> Adds write_lag, flush_lag and replay_lag cols to pg_stat_replication.
>
>> Did anyone notice that this seems to be causing buildfarm member 'tern'
>> to fail the recovery check?  See here:
>
>> https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=tern&dt=2017-04-21%2012%3A48%3A09&stg=recovery-check
>> which has
>> TRAP: FailedAssertion("!(lsn >= prev.lsn)", File: "walsender.c", Line: 3331)
>
>> Line 3331 was added by this commit.
>
> Note that while that commit was some time back, tern has only just started
> running recovery-check, following its update to the latest buildfarm
> script.  It looks like it's run that test four times and failed twice,
> so far.  So, not 100% reproducible, but there's something rotten there.
> Timing-dependent, maybe?

Thanks.  I'm away from my computer right now but will investigate this
and send a fix later today.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: [HACKERS] [COMMITTERS] pgsql: Replication lag tracking for walsenders

From
Simon Riggs
Date:
On 22 April 2017 at 06:45, Thomas Munro <thomas.munro@enterprisedb.com> wrote:

> Thanks.  I'm away from my computer right now but will investigate this
> and send a fix later today.

Thanks. I'll review later today.

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] [COMMITTERS] pgsql: Replication lag tracking for walsenders

From
Thomas Munro
Date:
On Sat, Apr 22, 2017 at 9:13 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 22 April 2017 at 06:45, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
>
>> Thanks.  I'm away from my computer right now but will investigate this
>> and send a fix later today.
>
> Thanks. I'll review later today.

The assertion fails reliably for me, because standby2's reported write
LSN jumps backwards after the timeline changes: for example I see
3020000 then 3028470 then 3020000 followed by a normal progression.
Surprisingly, 004_timeline_switch.pl reports success anyway.  I'm not
sure why the test fails sometimes on tern, but you can see that even
when it passed on tern the assertion had failed.

Here is a fix for the assertion failure.

-- 
Thomas Munro
http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment
Thomas Munro <thomas.munro@enterprisedb.com> writes:
> The assertion fails reliably for me, because standby2's reported write
> LSN jumps backwards after the timeline changes: for example I see
> 3020000 then 3028470 then 3020000 followed by a normal progression.
> Surprisingly, 004_timeline_switch.pl reports success anyway.  I'm not
> sure why the test fails sometimes on tern, but you can see that even
> when it passed on tern the assertion had failed.

Whoa.  This just turned into a much larger can of worms than I expected.
How can it be that processes are getting assertion crashes and yet the
test framework reports success anyway?  That's impossibly
broken/unacceptable.

Looking closer at the tern report we started the thread with, there
are actually TWO assertion trap reports, the one Alvaro noted and
another one in 009_twophase_master.log:

TRAP: FailedAssertion("!(*ptr == ((TransactionId) 0) || (*ptr == parent && overwriteOK))", File: "subtrans.c", Line:
92)

When I run the recovery test on my own machine, it reports success
(quite reliably, I tried a bunch of times yesterday), but now that
I know to look:

$ grep TRAP tmp_check/log/*
tmp_check/log/009_twophase_master.log:TRAP: FailedAssertion("!(*ptr == ((TransactionId) 0) || (*ptr == parent &&
overwriteOK))",File: "subtrans.c", Line: 92) 

So we now have three problems not just one:

* How is it that the TAP tests aren't noticing the failure?  This one,
to my mind, is a code-red situation, as it basically invalidates every
TAP test we've ever run.

* If Thomas's explanation for the timeline-switch assertion is correct,
why isn't it reproducible everywhere?

* What's with that second TRAP?

> Here is a fix for the assertion failure.

As for this patch itself, is it reasonable to try to assert that the
timeline has in fact changed?
        regards, tom lane



I wrote:
> Whoa.  This just turned into a much larger can of worms than I expected.
> How can it be that processes are getting assertion crashes and yet the
> test framework reports success anyway?  That's impossibly
> broken/unacceptable.

Taking a quick census of other buildfarm machines that are known to be
running the recovery test, it appears that most (not all) are seeing
one or both traps.  But the test is reporting success anyway, everywhere
except on Noah's 32-bit AIX critters.
        regards, tom lane



I wrote:
> Taking a quick census of other buildfarm machines that are known to be
> running the recovery test, it appears that most (not all) are seeing
> one or both traps.  But the test is reporting success anyway, everywhere
> except on Noah's 32-bit AIX critters.

Or, to be a bit more scientific, let's dig into the buildfarm database.
A couple more critters have started running the recovery test since
yesterday; these are the latest reports we have:

pgbfprod=> select sysname, max(snapshot) as newest, count(*) from build_status_log where log_stage =
'recovery-check.log'group by 1 order by 2;   sysname    |       newest        | count  
---------------+---------------------+-------hamster       | 2016-09-24 16:00:07 |   182jacana        | 2017-04-20
21:00:20|     3skink         | 2017-04-22 05:00:01 |     2sungazer      | 2017-04-22 06:07:17 |     7tern          |
2017-04-2206:38:09 |     8hornet        | 2017-04-22 06:41:12 |     7mandrill      | 2017-04-22 08:44:09 |
8nightjar     | 2017-04-22 13:54:24 |    55longfin       | 2017-04-22 14:29:17 |    13calliphoridae | 2017-04-22
14:30:01|     4piculet       | 2017-04-22 14:30:01 |     3culicidae     | 2017-04-22 14:30:01 |     5francolin     |
2017-04-2214:30:01 |     3prion         | 2017-04-22 14:33:05 |    12crake         | 2017-04-22 14:37:21 |    86 
(15 rows)

Grepping those specific reports for "TRAP" yields
   sysname    |
  l                                                                                                              

---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------jacana
      | TRAP: FailedAssertion("!(*ptr == ((TransactionId) 0) || (*ptr == parent && overwriteOK))", File:
"c:/mingw/msys/1.0/home/pgrunner/bf/root/HEAD/pgsql.build/../pgsql/src/backend/access/transam/subtrans.c",Line:
92)sungazer     | TRAP: FailedAssertion("!(lsn >= prev.lsn)", File: "walsender.c", Line: 3331)sungazer      | TRAP:
FailedAssertion("!(*ptr== ((TransactionId) 0) || (*ptr == parent && overwriteOK))", File: "subtrans.c", Line: 92)tern
      | TRAP: FailedAssertion("!(lsn >= prev.lsn)", File: "walsender.c", Line: 3331)tern          | TRAP:
FailedAssertion("!(*ptr== ((TransactionId) 0) || (*ptr == parent && overwriteOK))", File: "subtrans.c", Line: 92)hornet
      | TRAP: FailedAssertion("!(lsn >= prev.lsn)", File: "walsender.c", Line: 3331)hornet        | TRAP:
FailedAssertion("!(*ptr== ((TransactionId) 0) || (*ptr == parent && overwriteOK))", File: "subtrans.c", Line:
92)mandrill     | TRAP: FailedAssertion("!(lsn >= prev.lsn)", File: "walsender.c", Line: 3331)mandrill      | TRAP:
FailedAssertion("!(*ptr== ((TransactionId) 0) || (*ptr == parent && overwriteOK))", File: "subtrans.c", Line:
92)nightjar     | TRAP: FailedAssertion("!(lsn >= prev.lsn)", File:
"/pgbuild/root/HEAD/pgsql.build/../pgsql/src/backend/replication/walsender.c",Line: 3331)nightjar      | TRAP:
FailedAssertion("!(*ptr== ((TransactionId) 0) || (*ptr == parent && overwriteOK))", File:
"/pgbuild/root/HEAD/pgsql.build/../pgsql/src/backend/access/transam/subtrans.c",Line: 92)longfin       | TRAP:
FailedAssertion("!(lsn>= prev.lsn)", File: "walsender.c", Line: 3331)longfin       | TRAP: FailedAssertion("!(*ptr ==
((TransactionId)0) || (*ptr == parent && overwriteOK))", File: "subtrans.c", Line: 92)calliphoridae | TRAP:
FailedAssertion("!(*ptr== ((TransactionId) 0) || (*ptr == parent && overwriteOK))", File:
"/home/andres/build/buildfarm-calliphoridae/HEAD/pgsql.build/../pgsql/src/backend/access/transam/subtrans.c",Line:
92)piculet      | TRAP: FailedAssertion("!(*ptr == ((TransactionId) 0) || (*ptr == parent && overwriteOK))", File:
"/home/andres/build/buildfarm-piculet/HEAD/pgsql.build/../pgsql/src/backend/access/transam/subtrans.c",Line:
92)culicidae    | TRAP: FailedAssertion("!(*ptr == ((TransactionId) 0) || (*ptr == parent && overwriteOK))", File:
"/home/andres/build/buildfarm-culicidae/HEAD/pgsql.build/../pgsql/src/backend/access/transam/subtrans.c",Line:
92)francolin    | TRAP: FailedAssertion("!(*ptr == ((TransactionId) 0) || (*ptr == parent && overwriteOK))", File:
"/home/andres/build/buildfarm-francolin/HEAD/pgsql.build/../pgsql/src/backend/access/transam/subtrans.c",Line: 92)prion
       | TRAP: FailedAssertion("!(*ptr == ((TransactionId) 0) || (*ptr == parent && overwriteOK))", File:
"/home/ec2-user/bf/root/HEAD/pgsql.build/../pgsql/src/backend/access/transam/subtrans.c",Line: 92) 
(18 rows)

So 6 of 15 critters are getting the walsender.c assertion,
and those six plus six more are seeing the subtrans.c one,
and three are seeing neither one.  There's probably a pattern
to that, don't know what it is.

(Actually, it looks like hamster stopped running this test
a long time ago, so whatever is in its last report is probably
not very relevant.  So more like 12 of 14 critters are seeing
one or both traps.)
        regards, tom lane



I wrote:
> So 6 of 15 critters are getting the walsender.c assertion,
> and those six plus six more are seeing the subtrans.c one,
> and three are seeing neither one.  There's probably a pattern
> to that, don't know what it is.

Ah, got it: skink *is* seeing the subtrans.c assertion, but not
the other.  (I'd forgotten to limit the query to the HEAD branch,
and it was looking at 9.6 for skink.)  hamster, as stated, isn't
giving us a recent report; and crake is running the test but it
doesn't use --enable-cassert, hence no TRAP.

So actually, every single buildfarm member that could be reporting
the subtrans.c crash is doing so.  But only about half of them are
reporting the walsender.c crash.
        regards, tom lane



I wrote:
> Whoa.  This just turned into a much larger can of worms than I expected.
> How can it be that processes are getting assertion crashes and yet the
> test framework reports success anyway?  That's impossibly
> broken/unacceptable.

I poked into this on my laptop, where I'm able to reproduce both assertion
failures.  The reason the one in subtrans.c is not being detected by the
test framework is that it happens late in the run and the test doesn't
do anything more with that server than shut it down.  "pg_ctl stop"
actually notices there's a problem; for instance, the log on tern for
this step shows

### Stopping node "master" using mode immediate
# Running: pg_ctl -D /home/nm/farm/gcc32/HEAD/pgsql.build/src/test/recovery/tmp_check/data_master_xHGA/pgdata -m
immediatestop 
pg_ctl: PID file
"/home/nm/farm/gcc32/HEAD/pgsql.build/src/test/recovery/tmp_check/data_master_xHGA/pgdata/postmaster.pid"does not exist 
Is server running?
# No postmaster PID

However, PostgresNode::stop blithely ignores the failure exit from
pg_ctl.  I think it should use system_or_bail() not just system_log(),
so that we'll notice if the server is not running when we expect it
to be.  It's conceivable that there should also be a "stop_if_running"
method that allows the case where the server is not running, but so
far as I can find, no existing TAP test needs such a behavior --- and
it certainly shouldn't be the default.

The walsender.c crash is harder to detect because the postmaster very
nicely recovers and restarts its children.  That's great for robustness,
but it sucks for testing.  I think that we ought to disable that in
TAP tests, which we can do by having PostgresNode::init include
"restart_after_crash = off" in the postgresql.conf modifications it
applies.  Again, it's conceivable that some tests would not want that,
but there is no existing test that seems to need crash recovery on,
and I don't see a good argument for it being default for test purposes.

As best I've been able to tell, the reason why the walsender.c crash
is detected when it's detected is that the TAP script chances to try
to connect while the recovery is happening:

connection error: 'psql: FATAL:  the database system is in recovery mode'
while running 'psql -XAtq -d port=57718 host=/tmp/_uP8FKEynq dbname='postgres' -f - -v ON_ERROR_STOP=1' at
/home/nm/farm/gcc32/HEAD/pgsql.build/src/test/recovery/../../../src/test/perl/PostgresNode.pmline 1173. 

The window for that is not terribly wide, explaining why the detection
is unreliable even when the crash does occur.

In short then, I propose the attached patch to make these cases fail
more reliably.  We might extend this later to allow the old behaviors
to be explicitly opted-into, but we don't seem to need that today.

            regards, tom lane

diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index cb84f1f..f8a0581 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -402,6 +402,7 @@ sub init
     open my $conf, '>>', "$pgdata/postgresql.conf";
     print $conf "\n# Added by PostgresNode.pm\n";
     print $conf "fsync = off\n";
+    print $conf "restart_after_crash = off\n";
     print $conf "log_line_prefix = '%m [%p] %q%a '\n";
     print $conf "log_statement = all\n";
     print $conf "port = $port\n";
@@ -675,7 +676,7 @@ sub stop
     $mode = 'fast' unless defined $mode;
     return unless defined $self->{_pid};
     print "### Stopping node \"$name\" using mode $mode\n";
-    TestLib::system_log('pg_ctl', '-D', $pgdata, '-m', $mode, 'stop');
+    TestLib::system_or_bail('pg_ctl', '-D', $pgdata, '-m', $mode, 'stop');
     $self->{_pid} = undef;
     $self->_update_pid;
 }

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [COMMITTERS] pgsql: Replication lag tracking for walsenders

From
Thomas Munro
Date:
On Sun, Apr 23, 2017 at 3:41 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thomas Munro <thomas.munro@enterprisedb.com> writes:
>> The assertion fails reliably for me, because standby2's reported write
>> LSN jumps backwards after the timeline changes: for example I see
>> 3020000 then 3028470 then 3020000 followed by a normal progression.
>> Surprisingly, 004_timeline_switch.pl reports success anyway.  I'm not
>> sure why the test fails sometimes on tern, but you can see that even
>> when it passed on tern the assertion had failed.
>
> Whoa.  This just turned into a much larger can of worms than I expected.
> How can it be that processes are getting assertion crashes and yet the
> test framework reports success anyway?  That's impossibly
> broken/unacceptable.

Agreed, thanks for fixing that.

> Looking closer at the tern report we started the thread with, there
> are actually TWO assertion trap reports, the one Alvaro noted and
> another one in 009_twophase_master.log:
>
> TRAP: FailedAssertion("!(*ptr == ((TransactionId) 0) || (*ptr == parent && overwriteOK))", File: "subtrans.c", Line:
92)

I see you started another thread for that one.  I admit I spent a
couple of hours trying to figure this out before I saw your email, but
I was looking at the wrong bit of git history and didn't spot that
it's likely a 7 year old problem.  So this is a good result for these
TAP tests, despite teething difficulties with, erm, "pass" vs "fail"
and the fact that 009_twophase.pl was bombing from the moment it was
committed.  Hoping to use this framework in future work.

>> Here is a fix for the assertion failure.
>
> As for this patch itself, is it reasonable to try to assert that the
> timeline has in fact changed?

The protocol doesn't include the timeline in reply messages, so it's
not clear how the upstream server would know what timeline the standby
thinks it's dealing with in any given reply message.  The sending
server has its own idea of the current timeline but it's not in sync
with the stream of incoming replies.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: [HACKERS] [COMMITTERS] pgsql: Replication lag tracking for walsenders

From
Mark Dilger
Date:
> On Apr 22, 2017, at 11:40 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> I wrote:
>> Whoa.  This just turned into a much larger can of worms than I expected.
>> How can it be that processes are getting assertion crashes and yet the
>> test framework reports success anyway?  That's impossibly
>> broken/unacceptable.
>
> I poked into this on my laptop, where I'm able to reproduce both assertion
> failures.  The reason the one in subtrans.c is not being detected by the
> test framework is that it happens late in the run and the test doesn't
> do anything more with that server than shut it down.  "pg_ctl stop"
> actually notices there's a problem; for instance, the log on tern for
> this step shows
>
> ### Stopping node "master" using mode immediate
> # Running: pg_ctl -D /home/nm/farm/gcc32/HEAD/pgsql.build/src/test/recovery/tmp_check/data_master_xHGA/pgdata -m
immediatestop 
> pg_ctl: PID file
"/home/nm/farm/gcc32/HEAD/pgsql.build/src/test/recovery/tmp_check/data_master_xHGA/pgdata/postmaster.pid"does not exist 
> Is server running?
> # No postmaster PID
>
> However, PostgresNode::stop blithely ignores the failure exit from
> pg_ctl.  I think it should use system_or_bail() not just system_log(),
> so that we'll notice if the server is not running when we expect it
> to be.  It's conceivable that there should also be a "stop_if_running"
> method that allows the case where the server is not running, but so
> far as I can find, no existing TAP test needs such a behavior --- and
> it certainly shouldn't be the default.
>
> The walsender.c crash is harder to detect because the postmaster very
> nicely recovers and restarts its children.  That's great for robustness,
> but it sucks for testing.  I think that we ought to disable that in
> TAP tests, which we can do by having PostgresNode::init include
> "restart_after_crash = off" in the postgresql.conf modifications it
> applies.  Again, it's conceivable that some tests would not want that,
> but there is no existing test that seems to need crash recovery on,
> and I don't see a good argument for it being default for test purposes.
>
> As best I've been able to tell, the reason why the walsender.c crash
> is detected when it's detected is that the TAP script chances to try
> to connect while the recovery is happening:
>
> connection error: 'psql: FATAL:  the database system is in recovery mode'
> while running 'psql -XAtq -d port=57718 host=/tmp/_uP8FKEynq dbname='postgres' -f - -v ON_ERROR_STOP=1' at
/home/nm/farm/gcc32/HEAD/pgsql.build/src/test/recovery/../../../src/test/perl/PostgresNode.pmline 1173. 
>
> The window for that is not terribly wide, explaining why the detection
> is unreliable even when the crash does occur.
>
> In short then, I propose the attached patch to make these cases fail
> more reliably.  We might extend this later to allow the old behaviors
> to be explicitly opted-into, but we don't seem to need that today.
>
>             regards, tom lane

I pulled fresh sources with your latest commit, 7d68f2281a4b56834c8e5648fc7da0b73b674c45,
and the tests consistently fail (5 out of 5 attempts) for me on my laptop with:

/Applications/Xcode.app/Contents/Developer/usr/bin/make -C recovery check
for extra in contrib/test_decoding; do /Applications/Xcode.app/Contents/Developer/usr/bin/make -C '../../..'/$extra
DESTDIR='/Users/mark/vanilla/bitoctetlength'/tmp_installinstall
>>'/Users/mark/vanilla/bitoctetlength'/tmp_install/log/install.log|| exit; done 
rm -rf /Users/mark/vanilla/bitoctetlength/src/test/recovery/tmp_check/log
cd . && TESTDIR='/Users/mark/vanilla/bitoctetlength/src/test/recovery'
PATH="/Users/mark/vanilla/bitoctetlength/tmp_install/usr/local/pgsql/bin:$PATH"
DYLD_LIBRARY_PATH="/Users/mark/vanilla/bitoctetlength/tmp_install/usr/local/pgsql/lib"PGPORT='65432'
PG_REGRESS='/Users/mark/vanilla/bitoctetlength/src/test/recovery/../../../src/test/regress/pg_regress'prove -I
../../../src/test/perl/-I .  t/*.pl 
t/001_stream_rep.pl .................. ok
t/002_archiving.pl ................... ok
t/003_recovery_targets.pl ............ ok
t/004_timeline_switch.pl ............. Bailout called.  Further testing stopped:  system pg_ctl failed
FAILED--Further testing stopped: system pg_ctl failed
make[2]: *** [check] Error 255
make[1]: *** [check-recovery-recurse] Error 2
make: *** [check-world-src/test-recurse] Error 2

The first time, I had some local changes, but I've stashed them and am still getting the error each of
these next four times.  I'm compiling as follows:

make distclean; make clean; ./configure --enable-cassert --enable-tap-tests --enable-depend && make -j4 && make
check-world

Are the errors expected now?

mark


Mark Dilger <hornschnorter@gmail.com> writes:
>> On Apr 22, 2017, at 11:40 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> In short then, I propose the attached patch to make these cases fail
>> more reliably.  We might extend this later to allow the old behaviors
>> to be explicitly opted-into, but we don't seem to need that today.

> I pulled fresh sources with your latest commit, 7d68f2281a4b56834c8e5648fc7da0b73b674c45,
> and the tests consistently fail (5 out of 5 attempts) for me on my laptop with:

> Are the errors expected now?

Yes, as of right now, if HEAD *doesn't* fail the src/test/recovery tests
for you in an assert-enabled build, that would be surprising.  The next
step is to actually fix the bugs ... but the point of commit 7d68f2281
was to make these failures reproducible, which they were not before.
        regards, tom lane



Thomas Munro <thomas.munro@enterprisedb.com> writes:
> On Sun, Apr 23, 2017 at 3:41 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> As for this patch itself, is it reasonable to try to assert that the
>> timeline has in fact changed?

> The protocol doesn't include the timeline in reply messages, so it's
> not clear how the upstream server would know what timeline the standby
> thinks it's dealing with in any given reply message.  The sending
> server has its own idea of the current timeline but it's not in sync
> with the stream of incoming replies.

Fair enough.  But I'd still like an explanation of why only about
half of the population is showing a failure here.  Seems like every
machine should be seeing the LSN as moving backwards in this test.
So (a) why aren't they all failing, and (b) should we change the
test to make sure every platform sees that happening?
        regards, tom lane



Re: [HACKERS] [COMMITTERS] pgsql: Replication lag tracking for walsenders

From
Simon Riggs
Date:
On 22 April 2017 at 16:41, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thomas Munro <thomas.munro@enterprisedb.com> writes:
>> The assertion fails reliably for me, because standby2's reported write
>> LSN jumps backwards after the timeline changes: for example I see
>> 3020000 then 3028470 then 3020000 followed by a normal progression.
>> Surprisingly, 004_timeline_switch.pl reports success anyway.  I'm not
>> sure why the test fails sometimes on tern, but you can see that even
>> when it passed on tern the assertion had failed.
>
> Whoa.  This just turned into a much larger can of worms than I expected.
> How can it be that processes are getting assertion crashes and yet the
> test framework reports success anyway?  That's impossibly
> broken/unacceptable.

Agreed, thanks for fixing.

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] [COMMITTERS] pgsql: Replication lag tracking for walsenders

From
Thomas Munro
Date:
On Sun, Apr 23, 2017 at 6:01 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thomas Munro <thomas.munro@enterprisedb.com> writes:
>> On Sun, Apr 23, 2017 at 3:41 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> As for this patch itself, is it reasonable to try to assert that the
>>> timeline has in fact changed?
>
>> The protocol doesn't include the timeline in reply messages, so it's
>> not clear how the upstream server would know what timeline the standby
>> thinks it's dealing with in any given reply message.  The sending
>> server has its own idea of the current timeline but it's not in sync
>> with the stream of incoming replies.
>
> Fair enough.  But I'd still like an explanation of why only about
> half of the population is showing a failure here.  Seems like every
> machine should be seeing the LSN as moving backwards in this test.
> So (a) why aren't they all failing, and (b) should we change the
> test to make sure every platform sees that happening?

Every machine sees the LSN moving backwards, but the code path that
had the assertion only reached if it decides to interpolate, which is
timing dependent: there needs to be a future sample in the lag
tracking buffer, which I guess is not the case in those runs.  For
example, "jacana" is one of the animals that reported the assertion
failure for 009_twophase.pl but not the one for
004_timeline_switch.pl, in this run:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jacana&dt=2017-04-20%2021%3A00%3A20

This standby2 log fragment corresponds to the LSN zigzag manoeuvre:

2017-04-20 22:04:51.744 GMT [1984] LOG:  fetching timeline history
file for timeline 2 from primary server
2017-04-20 22:04:51.744 GMT [1984] LOG:  started streaming WAL from
primary at 0/3000000 on timeline 1
2017-04-20 22:04:51.744 GMT [1984] LOG:  replication terminated by
primary server
2017-04-20 22:04:51.744 GMT [1984] DETAIL:  End of WAL reached on
timeline 1 at 0/3028D50.
2017-04-20 22:04:51.744 GMT [1708] LOG:  new target timeline is 2
2017-04-20 22:04:51.744 GMT [1984] LOG:  restarted WAL streaming at
0/3000000 on timeline 2

It looks approximately the same on my development machine where I got
the assertion failure until commit 84638808:

2017-04-24 09:04:46.563 NZST [95164] LOG:  fetching timeline history
file for timeline 2 from primary server
2017-04-24 09:04:46.563 NZST [95164] LOG:  started streaming WAL from
primary at 0/3000000 on timeline 1
2017-04-24 09:04:46.564 NZST [95164] LOG:  replication terminated by
primary server
2017-04-24 09:04:46.564 NZST [95164] DETAIL:  End of WAL reached on
timeline 1 at 0/3028470.
2017-04-24 09:04:46.565 NZST [95160] LOG:  new target timeline is 2
2017-04-24 09:04:46.565 NZST [95164] LOG:  restarted WAL streaming at
0/3000000 on timeline 2

-- 
Thomas Munro
http://www.enterprisedb.com



Thomas Munro <thomas.munro@enterprisedb.com> writes:
> On Sun, Apr 23, 2017 at 6:01 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Fair enough.  But I'd still like an explanation of why only about
>> half of the population is showing a failure here.  Seems like every
>> machine should be seeing the LSN as moving backwards in this test.
>> So (a) why aren't they all failing, and (b) should we change the
>> test to make sure every platform sees that happening?

> Every machine sees the LSN moving backwards, but the code path that
> had the assertion only reached if it decides to interpolate, which is
> timing dependent: there needs to be a future sample in the lag
> tracking buffer, which I guess is not the case in those runs.

I'm dissatisfied with this explanation because if it's just timing,
it doesn't seem very likely that some machines would reproduce the
failure every single time while others never would.  Maybe that can be
blamed on kernel scheduler vagaries + different numbers of cores, but
I can't escape the feeling that there's something here we've not
fully understood.

While chasing after this earlier today, I turned on some debug logging
and noted that the standby's reports look like

2017-04-23 15:46:46.206 EDT [34829] LOG:  database system is ready to accept read only connections
2017-04-23 15:46:46.212 EDT [34834] LOG:  fetching timeline history file for timeline 2 from primary server
2017-04-23 15:46:46.212 EDT [34834] LOG:  started streaming WAL from primary at 0/3000000 on timeline 1
2017-04-23 15:46:46.213 EDT [34834] LOG:  sending write 0/3020000 flush 0/3028470 apply 0/3028470
2017-04-23 15:46:46.214 EDT [34834] LOG:  replication terminated by primary server
2017-04-23 15:46:46.214 EDT [34834] DETAIL:  End of WAL reached on timeline 1 at 0/3028470.
2017-04-23 15:46:46.214 EDT [34834] LOG:  sending write 0/3028470 flush 0/3028470 apply 0/3028470
2017-04-23 15:46:46.214 EDT [34830] LOG:  new target timeline is 2
2017-04-23 15:46:46.214 EDT [34834] LOG:  restarted WAL streaming at 0/3000000 on timeline 2
2017-04-23 15:46:46.228 EDT [34834] LOG:  sending write 0/3020000 flush 0/3028470 apply 0/3028470

So you're right that the standby's reported "write" position can go
backward, but it seems pretty darn odd that the flush and apply
positions didn't go backward too.  Is there a bug there?

I remain of the opinion that if we can't tell from the transmitted
data whether a timeline switch has caused the position to go backward,
then that's a protocol shortcoming that ought to be fixed.
        regards, tom lane



Re: [HACKERS] [COMMITTERS] pgsql: Replication lag tracking for walsenders

From
Mark Dilger
Date:
> On Apr 23, 2017, at 7:53 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Thomas Munro <thomas.munro@enterprisedb.com> writes:
>> On Sun, Apr 23, 2017 at 6:01 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Fair enough.  But I'd still like an explanation of why only about
>>> half of the population is showing a failure here.  Seems like every
>>> machine should be seeing the LSN as moving backwards in this test.
>>> So (a) why aren't they all failing, and (b) should we change the
>>> test to make sure every platform sees that happening?
>
>> Every machine sees the LSN moving backwards, but the code path that
>> had the assertion only reached if it decides to interpolate, which is
>> timing dependent: there needs to be a future sample in the lag
>> tracking buffer, which I guess is not the case in those runs.
>
> I'm dissatisfied with this explanation because if it's just timing,
> it doesn't seem very likely that some machines would reproduce the
> failure every single time while others never would.  Maybe that can be
> blamed on kernel scheduler vagaries + different numbers of cores, but
> I can't escape the feeling that there's something here we've not
> fully understood.
>
> While chasing after this earlier today, I turned on some debug logging
> and noted that the standby's reports look like
>
> 2017-04-23 15:46:46.206 EDT [34829] LOG:  database system is ready to accept read only connections
> 2017-04-23 15:46:46.212 EDT [34834] LOG:  fetching timeline history file for timeline 2 from primary server
> 2017-04-23 15:46:46.212 EDT [34834] LOG:  started streaming WAL from primary at 0/3000000 on timeline 1
> 2017-04-23 15:46:46.213 EDT [34834] LOG:  sending write 0/3020000 flush 0/3028470 apply 0/3028470
> 2017-04-23 15:46:46.214 EDT [34834] LOG:  replication terminated by primary server
> 2017-04-23 15:46:46.214 EDT [34834] DETAIL:  End of WAL reached on timeline 1 at 0/3028470.
> 2017-04-23 15:46:46.214 EDT [34834] LOG:  sending write 0/3028470 flush 0/3028470 apply 0/3028470
> 2017-04-23 15:46:46.214 EDT [34830] LOG:  new target timeline is 2
> 2017-04-23 15:46:46.214 EDT [34834] LOG:  restarted WAL streaming at 0/3000000 on timeline 2
> 2017-04-23 15:46:46.228 EDT [34834] LOG:  sending write 0/3020000 flush 0/3028470 apply 0/3028470
>
> So you're right that the standby's reported "write" position can go
> backward, but it seems pretty darn odd that the flush and apply
> positions didn't go backward too.  Is there a bug there?
>
> I remain of the opinion that if we can't tell from the transmitted
> data whether a timeline switch has caused the position to go backward,
> then that's a protocol shortcoming that ought to be fixed.

The recent fix in 546c13e11b29a5408b9d6a6e3cca301380b47f7f has local variable overwriteOK
assigned but not used in twophase.c RecoverPreparedTransactions(void).  I'm not sure if that's
future-proofing or an oversight.  It seems to be used in other functions.  Just FYI.

Mark Dilger




Re: [HACKERS] [COMMITTERS] pgsql: Replication lag tracking forwalsenders

From
Andres Freund
Date:
On 2017-04-24 15:41:25 -0700, Mark Dilger wrote:
> The recent fix in 546c13e11b29a5408b9d6a6e3cca301380b47f7f has local variable overwriteOK
> assigned but not used in twophase.c RecoverPreparedTransactions(void).  I'm not sure if that's
> future-proofing or an oversight.  It seems to be used in other functions.  Just FYI.

That's just a temporary workaround while the problem's being analyzed.

- Andres



Re: [HACKERS] [COMMITTERS] pgsql: Replication lag tracking for walsenders

From
Thomas Munro
Date:
On Mon, Apr 24, 2017 at 2:53 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thomas Munro <thomas.munro@enterprisedb.com> writes:
>> Every machine sees the LSN moving backwards, but the code path that
>> had the assertion only reached if it decides to interpolate, which is
>> timing dependent: there needs to be a future sample in the lag
>> tracking buffer, which I guess is not the case in those runs.
>
> I'm dissatisfied with this explanation because if it's just timing,
> it doesn't seem very likely that some machines would reproduce the
> failure every single time while others never would.  Maybe that can be
> blamed on kernel scheduler vagaries + different numbers of cores, but
> I can't escape the feeling that there's something here we've not
> fully understood.

Ahh.  It is timing-dependent, but what I said before wasn't right.
The write LSN always jumps back to the start of the current segment,
but some machines don't report that upstream because of timing
effects.

I found a machine that doesn't reproduce the earlier assertion failure
(or, rather, an elog I put into that code path where the assertion
was).  It's a VirtualBox VM running Debian 8 on amd64, and it doesn't
matter whether I tell it to use 1 or 4 CPU cores.

After the timeline change, standby2 restarts WAL streaming from
0/3000000, and standby1 sends it two messages: one 128KB message, and
then another smaller one.  On this virtual machine, the inner message
processing loop in walreceiver.c reliably manages to read both
messages from the socket one after the other, taking it all the way to
0/3028470 (end of WAL).  Then it calls XLogWalRcvSendReply, so the
position reported to the upstream server never goes backwards.

On my other machines it reliably reads the first message, runs out of
data on the socket so it falls out of the loop, and calls
XLogWalRcvSendReply to report the intermediate location 0/3020000.
Then the socket's soon ready again, we go around the outer loop again
and finally report 0/3028470.

> [...]
>
> So you're right that the standby's reported "write" position can go
> backward, but it seems pretty darn odd that the flush and apply
> positions didn't go backward too.  Is there a bug there?

Flush doesn't go backwards because XLogWalRcvFlush does nothing unless
Flush < Write.  I am not sure whether it constitutes a bug that we
rewrite the start of the current segment without flushing it.

A timeline change isn't actually necessary for the
rewind-to-start-of-segment behaviour, by the way: we always do that at
the start of streaming, as described in RequestXLogStreaming.  But
it's it's only a problem for timeline changes, because in other cases
it's talking to a fresh walsender that doesn't have any state that
would have triggered the (now removed) assertion.  I think the change
that was committed in 84638808 was therefore correct.

> I remain of the opinion that if we can't tell from the transmitted
> data whether a timeline switch has caused the position to go backward,
> then that's a protocol shortcoming that ought to be fixed.

Another approach might have been to teach the switch case for
T_StartReplicationCmd in exec_replication_command to clear all
LagTracker state, since that is the point at which the remote
walreceiver has requested a new stream and all subsequent messages
will refer to that stream.

-- 
Thomas Munro
http://www.enterprisedb.com