Re: [HACKERS] [COMMITTERS] pgsql: Replication lag tracking for walsenders - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Re: [HACKERS] [COMMITTERS] pgsql: Replication lag tracking for walsenders |
Date | |
Msg-id | CAEepm=18jxXeMJF8A6GkozC-ssjmJRqSN_86hun68i+N7ga9nA@mail.gmail.com Whole thread Raw |
In response to | Re: [HACKERS] [COMMITTERS] pgsql: Replication lag tracking for walsenders (Tom Lane <tgl@sss.pgh.pa.us>) |
List | pgsql-hackers |
On Mon, Apr 24, 2017 at 2:53 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Thomas Munro <thomas.munro@enterprisedb.com> writes: >> Every machine sees the LSN moving backwards, but the code path that >> had the assertion only reached if it decides to interpolate, which is >> timing dependent: there needs to be a future sample in the lag >> tracking buffer, which I guess is not the case in those runs. > > I'm dissatisfied with this explanation because if it's just timing, > it doesn't seem very likely that some machines would reproduce the > failure every single time while others never would. Maybe that can be > blamed on kernel scheduler vagaries + different numbers of cores, but > I can't escape the feeling that there's something here we've not > fully understood. Ahh. It is timing-dependent, but what I said before wasn't right. The write LSN always jumps back to the start of the current segment, but some machines don't report that upstream because of timing effects. I found a machine that doesn't reproduce the earlier assertion failure (or, rather, an elog I put into that code path where the assertion was). It's a VirtualBox VM running Debian 8 on amd64, and it doesn't matter whether I tell it to use 1 or 4 CPU cores. After the timeline change, standby2 restarts WAL streaming from 0/3000000, and standby1 sends it two messages: one 128KB message, and then another smaller one. On this virtual machine, the inner message processing loop in walreceiver.c reliably manages to read both messages from the socket one after the other, taking it all the way to 0/3028470 (end of WAL). Then it calls XLogWalRcvSendReply, so the position reported to the upstream server never goes backwards. On my other machines it reliably reads the first message, runs out of data on the socket so it falls out of the loop, and calls XLogWalRcvSendReply to report the intermediate location 0/3020000. Then the socket's soon ready again, we go around the outer loop again and finally report 0/3028470. > [...] > > So you're right that the standby's reported "write" position can go > backward, but it seems pretty darn odd that the flush and apply > positions didn't go backward too. Is there a bug there? Flush doesn't go backwards because XLogWalRcvFlush does nothing unless Flush < Write. I am not sure whether it constitutes a bug that we rewrite the start of the current segment without flushing it. A timeline change isn't actually necessary for the rewind-to-start-of-segment behaviour, by the way: we always do that at the start of streaming, as described in RequestXLogStreaming. But it's it's only a problem for timeline changes, because in other cases it's talking to a fresh walsender that doesn't have any state that would have triggered the (now removed) assertion. I think the change that was committed in 84638808 was therefore correct. > I remain of the opinion that if we can't tell from the transmitted > data whether a timeline switch has caused the position to go backward, > then that's a protocol shortcoming that ought to be fixed. Another approach might have been to teach the switch case for T_StartReplicationCmd in exec_replication_command to clear all LagTracker state, since that is the point at which the remote walreceiver has requested a new stream and all subsequent messages will refer to that stream. -- Thomas Munro http://www.enterprisedb.com
pgsql-hackers by date: