Re: archive status ".ready" files may be created too early - Mailing list pgsql-hackers

From Bossart, Nathan
Subject Re: archive status ".ready" files may be created too early
Date
Msg-id DA71434B-7340-4984-9B91-F085BC47A778@amazon.com
Whole thread Raw
In response to Re: archive status ".ready" files may be created too early  (Alvaro Herrera <alvherre@alvh.no-ip.org>)
Responses Re: archive status ".ready" files may be created too early
Re: archive status ".ready" files may be created too early
List pgsql-hackers
On 7/30/21, 11:34 AM, "Alvaro Herrera" <alvherre@alvh.no-ip.org> wrote:
> Hmm ... I'm not sure we're prepared to backpatch this kind of change.
> It seems a bit too disruptive to how replay works.  I think patch we
> should be focusing solely on patch 0001 to surgically fix the precise
> bug you see.  Does patch 0002 exist because you think that a system with
> only 0001 will not correctly deal with a crash at the right time?

Yes, that was what I was worried about.  However, I just performed a
variety of tests with just 0001 applied, and I am beginning to suspect
my concerns were unfounded.  With wal_buffers set very high,
synchronous_commit set to off, and a long sleep at the end of
XLogWrite(), I can reliably cause the archive status files to lag far
behind the current open WAL segment.  However, even if I crash at this
time, the .ready files are created when the server restarts (albeit
out of order).  This appears to be due to the call to
XLogArchiveCheckDone() in RemoveOldXlogFiles().  Therefore, we can
likely abandon 0002.

> Now, the reason I'm looking at this patch series is that we're seeing a
> related problem with walsender/walreceiver, which apparently are capable
> of creating a file in the replica that ends up not existing in the
> primary after a crash, for a reason closely related to what you
> describe for WAL archival.  I'm not sure what is going on just yet, so
> I'm not going to try and explain because I'm likely to get it wrong.

I've suspected that this is due to the use of the flushed location for
the send pointer, which AFAICT needn't align with a WAL record
boundary.

                /*
                 * Streaming the current timeline on a primary.
                 *
                 * Attempt to send all data that's already been written out and
                 * fsync'd to disk.  We cannot go further than what's been written out
                 * given the current implementation of WALRead().  And in any case
                 * it's unsafe to send WAL that is not securely down to disk on the
                 * primary: if the primary subsequently crashes and restarts, standbys
                 * must not have applied any WAL that got lost on the primary.
                 */
                 SendRqstPtr = GetFlushRecPtr();

Nathan


pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: Background writer and checkpointer in crash recovery
Next
From: Melanie Plageman
Date:
Subject: Re: Parallel Full Hash Join