Re: [HACKERS] [COMMITTERS] pgsql: Replication lag tracking for walsenders - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: [HACKERS] [COMMITTERS] pgsql: Replication lag tracking for walsenders
Date
Msg-id CAEepm=18jxXeMJF8A6GkozC-ssjmJRqSN_86hun68i+N7ga9nA@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] [COMMITTERS] pgsql: Replication lag tracking for walsenders  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
On Mon, Apr 24, 2017 at 2:53 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thomas Munro <thomas.munro@enterprisedb.com> writes:
>> Every machine sees the LSN moving backwards, but the code path that
>> had the assertion only reached if it decides to interpolate, which is
>> timing dependent: there needs to be a future sample in the lag
>> tracking buffer, which I guess is not the case in those runs.
>
> I'm dissatisfied with this explanation because if it's just timing,
> it doesn't seem very likely that some machines would reproduce the
> failure every single time while others never would.  Maybe that can be
> blamed on kernel scheduler vagaries + different numbers of cores, but
> I can't escape the feeling that there's something here we've not
> fully understood.

Ahh.  It is timing-dependent, but what I said before wasn't right.
The write LSN always jumps back to the start of the current segment,
but some machines don't report that upstream because of timing
effects.

I found a machine that doesn't reproduce the earlier assertion failure
(or, rather, an elog I put into that code path where the assertion
was).  It's a VirtualBox VM running Debian 8 on amd64, and it doesn't
matter whether I tell it to use 1 or 4 CPU cores.

After the timeline change, standby2 restarts WAL streaming from
0/3000000, and standby1 sends it two messages: one 128KB message, and
then another smaller one.  On this virtual machine, the inner message
processing loop in walreceiver.c reliably manages to read both
messages from the socket one after the other, taking it all the way to
0/3028470 (end of WAL).  Then it calls XLogWalRcvSendReply, so the
position reported to the upstream server never goes backwards.

On my other machines it reliably reads the first message, runs out of
data on the socket so it falls out of the loop, and calls
XLogWalRcvSendReply to report the intermediate location 0/3020000.
Then the socket's soon ready again, we go around the outer loop again
and finally report 0/3028470.

> [...]
>
> So you're right that the standby's reported "write" position can go
> backward, but it seems pretty darn odd that the flush and apply
> positions didn't go backward too.  Is there a bug there?

Flush doesn't go backwards because XLogWalRcvFlush does nothing unless
Flush < Write.  I am not sure whether it constitutes a bug that we
rewrite the start of the current segment without flushing it.

A timeline change isn't actually necessary for the
rewind-to-start-of-segment behaviour, by the way: we always do that at
the start of streaming, as described in RequestXLogStreaming.  But
it's it's only a problem for timeline changes, because in other cases
it's talking to a fresh walsender that doesn't have any state that
would have triggered the (now removed) assertion.  I think the change
that was committed in 84638808 was therefore correct.

> I remain of the opinion that if we can't tell from the transmitted
> data whether a timeline switch has caused the position to go backward,
> then that's a protocol shortcoming that ought to be fixed.

Another approach might have been to teach the switch case for
T_StartReplicationCmd in exec_replication_command to clear all
LagTracker state, since that is the point at which the remote
walreceiver has requested a new stream and all subsequent messages
will refer to that stream.

-- 
Thomas Munro
http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Noah Misch
Date:
Subject: Re: [HACKERS] Transition tables for triggers on foreign tables and views
Next
From: Ashutosh Bapat
Date:
Subject: Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables