Re: [HACKERS] [COMMITTERS] pgsql: Replication lag tracking for walsenders - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: [HACKERS] [COMMITTERS] pgsql: Replication lag tracking for walsenders
Date
Msg-id CAEepm=1KQyN+jd+r9X9L2D1zQhH+yofaoUQzVpTBhKm4NG90Yw@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] [COMMITTERS] pgsql: Replication lag tracking for walsenders  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: [HACKERS] [COMMITTERS] pgsql: Replication lag tracking for walsenders  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
On Sun, Apr 23, 2017 at 6:01 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thomas Munro <thomas.munro@enterprisedb.com> writes:
>> On Sun, Apr 23, 2017 at 3:41 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> As for this patch itself, is it reasonable to try to assert that the
>>> timeline has in fact changed?
>
>> The protocol doesn't include the timeline in reply messages, so it's
>> not clear how the upstream server would know what timeline the standby
>> thinks it's dealing with in any given reply message.  The sending
>> server has its own idea of the current timeline but it's not in sync
>> with the stream of incoming replies.
>
> Fair enough.  But I'd still like an explanation of why only about
> half of the population is showing a failure here.  Seems like every
> machine should be seeing the LSN as moving backwards in this test.
> So (a) why aren't they all failing, and (b) should we change the
> test to make sure every platform sees that happening?

Every machine sees the LSN moving backwards, but the code path that
had the assertion only reached if it decides to interpolate, which is
timing dependent: there needs to be a future sample in the lag
tracking buffer, which I guess is not the case in those runs.  For
example, "jacana" is one of the animals that reported the assertion
failure for 009_twophase.pl but not the one for
004_timeline_switch.pl, in this run:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jacana&dt=2017-04-20%2021%3A00%3A20

This standby2 log fragment corresponds to the LSN zigzag manoeuvre:

2017-04-20 22:04:51.744 GMT [1984] LOG:  fetching timeline history
file for timeline 2 from primary server
2017-04-20 22:04:51.744 GMT [1984] LOG:  started streaming WAL from
primary at 0/3000000 on timeline 1
2017-04-20 22:04:51.744 GMT [1984] LOG:  replication terminated by
primary server
2017-04-20 22:04:51.744 GMT [1984] DETAIL:  End of WAL reached on
timeline 1 at 0/3028D50.
2017-04-20 22:04:51.744 GMT [1708] LOG:  new target timeline is 2
2017-04-20 22:04:51.744 GMT [1984] LOG:  restarted WAL streaming at
0/3000000 on timeline 2

It looks approximately the same on my development machine where I got
the assertion failure until commit 84638808:

2017-04-24 09:04:46.563 NZST [95164] LOG:  fetching timeline history
file for timeline 2 from primary server
2017-04-24 09:04:46.563 NZST [95164] LOG:  started streaming WAL from
primary at 0/3000000 on timeline 1
2017-04-24 09:04:46.564 NZST [95164] LOG:  replication terminated by
primary server
2017-04-24 09:04:46.564 NZST [95164] DETAIL:  End of WAL reached on
timeline 1 at 0/3028470.
2017-04-24 09:04:46.565 NZST [95160] LOG:  new target timeline is 2
2017-04-24 09:04:46.565 NZST [95164] LOG:  restarted WAL streaming at
0/3000000 on timeline 2

-- 
Thomas Munro
http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Andrew Dunstan
Date:
Subject: [HACKERS] TAP tests - installcheck vs check
Next
From: Andres Freund
Date:
Subject: Re: [HACKERS] valgrind error in subscription code