Bogus recovery conflicts due to outdated XLogReceiptTime - Mailing list pgsql-hackers

From Andres Freund
Subject Bogus recovery conflicts due to outdated XLogReceiptTime
Date
Msg-id ymyivl5bw7qwd3bhduxmv5kytsl6q4nstjtledw5l6nr7sdwcw@nl7egmss5l2m
Whole thread
List pgsql-hackers
Hi,

While testing something independent, I had a replication setup with pgbench
r/w running on the primary and pgbench r/o on the standby, with a replication
slot setup and hot_standby_feedback=on.

When I saw recovery conflicts, I got worried that we screwed up something in
the recent changes around pruning horizons. But that wasn't it.  All the
conflicts I was seeing were due to buffer pin conflicts.

As the replica was keeping up perfectly, with replay lag well below
max_standby_streaming_delay, that should never have triggered conflicts.

Investigating this was not helped by the fact that we don't log *anything*
from the startup process when
  LockBufferForCleanup()
  -> ResolveRecoveryConflictWithBufferPin()
enters the "We're already behind" path to SendRecoveryConflictWithBufferPin().


I added some logging and found that GetStandbyLimitTime() thought we were
quite far behind, even though the actual replay lag was < 1s.


I think the reason for these conflicts is very old and it's that basically the
basis on which we compute GetStandbyLimitTime() is bogus, when streaming:

    /*
     * Walreceiver is active, so see if new data has arrived.
     *
     * We only advance XLogReceiptTime when we obtain fresh
     * WAL from walreceiver and observe that we had already
     * processed everything before the most recent "chunk"
     * that it flushed to disk.  In steady state where we are
     * keeping up with the incoming data, XLogReceiptTime will
     * be updated on each cycle. When we are behind,
     * XLogReceiptTime will not advance, so the grace time
     * allotted to conflicting queries will decrease.
     */
    if (RecPtr < flushedUpto)
        havedata = true;
    else
    {
        XLogRecPtr    latestChunkStart;

        flushedUpto = GetWalRcvFlushRecPtr(&latestChunkStart, &receiveTLI);
        if (RecPtr < flushedUpto && receiveTLI == curFileTLI)
        {
            havedata = true;
            if (latestChunkStart <= RecPtr)
            {
                XLogReceiptTime = GetCurrentTimestamp();
                SetCurrentChunkStartTime(XLogReceiptTime);
            }
        }
        else
            havedata = false;


I think the logic explained in the comment doesn't make much sense (but it's
not trivial to come up with something good, see below). If the standby is
keeping up, the primary is busy and walsender & walreceiver are keeping up,
it's entirely normal that the walreceiver would have received one or more
additional "chunks" from the primary.  If the primary is committing frequently
and the standby is keeping up receiving the data, walsender will send out data
in small chunks. Thereby making it very likely that the latest chunk is newer
than what we just processed.


I've seen now-rtime in GetStandbyLimitTime() be many minutes, even though the
actual replay lag was always under 2s, and most of the time < 0.5s.




Tracing through the history (a lot of moving around), this seems to have
originated in:


commit e76c1a0f4d2127f11c72c02b3d73a5dcb4517173
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date:   2010-07-03 20:43:58 +0000

    Replace max_standby_delay with two parameters, max_standby_archive_delay and
    max_standby_streaming_delay, and revise the implementation to avoid assuming
    that timestamps found in WAL records can meaningfully be compared to clock
    time on the standby server.  Instead, the delay limits are compared to the
    elapsed time since we last obtained a new WAL segment from archive or since
    we were last "caught up" to WAL data arriving via streaming replication.
    This avoids problems with clock skew between primary and standby, as well
    as other corner cases that the original coding would misbehave in, such
    as the primary server having significant idle time between transactions.
    Per my complaint some time ago and considerable ensuing discussion.

    Do some desultory editing on the hot standby documentation, too.




I think I've observed this problem in production setups, i.e. recovery
conflicts being triggered despite everything seemingly keeping up, with no
"source" of recovery conflicts apparent in the log.  I just never had enough
data to figure out what's going on.


It really doesn't help that XLogReceiptTime is basically unobservable in stock
postgres. Recovery doesn't log the fact that it thinks we are behind (or how
far), it doesn't even log that it triggers a recovery conflicts in that case
(the cancellation of is logged though).



Now, how to fix that:

It doesn't seem trivial. As the commit message cited above explains, we don't
want to rely on the timestamps from the primary, so we can't just use
XLogRecoveryCtl->recoveryLastXTime or such.  And we can't just always update
XLogReceiptTime in WaitForWALToBecomeAvailable() (or even more so in
walreceiver whenever we receive data), because that'd lead to XLogReceiptTime
being advanced way too aggressively.

I wonder if a decent minimal improvement would be to update it whenever we had
to update the XLogReceiptTime whenever we actually had to refresh flushedUpto
from the walreceiver and the new position is >= RecPtr.  While that doesn't
entirely avoid updating XLogReceiptTime too aggressively, I think it may bound
it sufficiently, because if we are behind, we don't need to call
GetWalRcvFlushRecPtr() very often - we only need to do so again, after we
replayed all the WAL that was available the last time.  It be a lot more
accurate than what we do today, because it'd avoid the "latestChunkStart <=
RecPtr" condition which is often unreachable today.

I guess we could also use recoveryLastXTime as an additional lower bound for
the apply delay? That way it'd "help" avoid a too stale XLogReceiptTime if the
clocks are in sync, but not hurt if the clocks are out of sync?


A better solution would probably be a proper "LSN range -> received time"
mapping.  In walsender we have LagTracker, which keeps a ringbuffer of LSN to
time mappings and computes lags for write/flush/replay.  Perhaps we could
generalize that?




There's an easy way to reproduce this issue: Configure the standby to use
recovery_apply_delay=1s and have workload on the primary that commits more
than once a second.

If you then configure walreceiver to log debug2 messages (or change the log
level in ProcessWalSndrMessage(), which makes the volume easier to deal with),
you can see that the "replication apply delay" in those messages creep up very
rapidly (as we never hit the "latest chunk" logic anymore).


Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: DaeMyung Kang
Date:
Subject: [PATCH] Free BufFile metadata in close and append paths
Next
From: SATYANARAYANA NARLAPURAM
Date:
Subject: Limit GRAPH_TABLE path combinations to prevent memory exhaustion