Re: [HACKERS] [COMMITTERS] pgsql: Replication lag tracking for walsenders - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: [HACKERS] [COMMITTERS] pgsql: Replication lag tracking for walsenders
Date
Msg-id CAEepm=1J1PxBjUNthkjc__mZLgO4T-huK6tSoSAdCz+vuy2Y5A@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] [COMMITTERS] pgsql: Replication lag tracking for walsenders  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: [HACKERS] [COMMITTERS] pgsql: Replication lag tracking for walsenders  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
On Sun, Apr 23, 2017 at 3:41 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thomas Munro <thomas.munro@enterprisedb.com> writes:
>> The assertion fails reliably for me, because standby2's reported write
>> LSN jumps backwards after the timeline changes: for example I see
>> 3020000 then 3028470 then 3020000 followed by a normal progression.
>> Surprisingly, 004_timeline_switch.pl reports success anyway.  I'm not
>> sure why the test fails sometimes on tern, but you can see that even
>> when it passed on tern the assertion had failed.
>
> Whoa.  This just turned into a much larger can of worms than I expected.
> How can it be that processes are getting assertion crashes and yet the
> test framework reports success anyway?  That's impossibly
> broken/unacceptable.

Agreed, thanks for fixing that.

> Looking closer at the tern report we started the thread with, there
> are actually TWO assertion trap reports, the one Alvaro noted and
> another one in 009_twophase_master.log:
>
> TRAP: FailedAssertion("!(*ptr == ((TransactionId) 0) || (*ptr == parent && overwriteOK))", File: "subtrans.c", Line:
92)

I see you started another thread for that one.  I admit I spent a
couple of hours trying to figure this out before I saw your email, but
I was looking at the wrong bit of git history and didn't spot that
it's likely a 7 year old problem.  So this is a good result for these
TAP tests, despite teething difficulties with, erm, "pass" vs "fail"
and the fact that 009_twophase.pl was bombing from the moment it was
committed.  Hoping to use this framework in future work.

>> Here is a fix for the assertion failure.
>
> As for this patch itself, is it reasonable to try to assert that the
> timeline has in fact changed?

The protocol doesn't include the timeline in reply messages, so it's
not clear how the upstream server would know what timeline the standby
thinks it's dealing with in any given reply message.  The sending
server has its own idea of the current timeline but it's not in sync
with the stream of incoming replies.

-- 
Thomas Munro
http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Petr Jelinek
Date:
Subject: Re: [HACKERS] logical replication and PANIC during shutdowncheckpoint in publisher
Next
From: Michael Paquier
Date:
Subject: Re: [HACKERS] A note about debugging TAP failures