Re: Incorrect handling of OOM in WAL replay leading to data loss - Mailing list pgsql-hackers

From Kyotaro Horiguchi
Subject Re: Incorrect handling of OOM in WAL replay leading to data loss
Date
Msg-id 20230810.144751.1045770009719107322.horikyota.ntt@gmail.com
Whole thread Raw
In response to Re: Incorrect handling of OOM in WAL replay leading to data loss  (Michael Paquier <michael@paquier.xyz>)
Responses Re: Incorrect handling of OOM in WAL replay leading to data loss
List pgsql-hackers
At Thu, 10 Aug 2023 13:33:48 +0900, Michael Paquier <michael@paquier.xyz> wrote in 
> On Thu, Aug 10, 2023 at 10:15:40AM +0900, Kyotaro Horiguchi wrote:
> > At Thu, 10 Aug 2023 10:00:17 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> >> We have treated every kind of broken data as end-of-recovery, like
> >> incorrect rm_id or prev link including excessively large record length
> >> due to corruption. This patch is going to change the behavior only for
> >> the last one. If you think there can't be non-zero broken data, we
> >> should inhibit proceeding recovery after all non-zero incorrect
> >> data. This seems to be a quite big change in our recovery policy.
> 
> Well, per se the report that led to this thread.  We may lose data and
> finish with corrupted pages.  I was planning to reply to the other
> thread [1] and the patch of Noah anyway, because we have to fix the
> detection of OOM vs corrupted records in the allocation path anyway.

Does this mean we will address the distinction between an OOM and a
corrupt total record length later on? If that's the case, should we
modify that behavior right now?

> The infra introduced by 0001 and something like 0002 that allows the
> startup process to take a different path depending on the type of
> error are still needed to avoid a too early end-of-recovery, though.

Agreed.

> >> 4) Wrap up recovery then continue to normal operation.
> >> 
> >> This is the traditional behavior for currupt WAL data.
> 
> Yeah, we've been doing a pretty bad job in classifying the errors that
> can happen doing WAL replay in crash recovery, because we assume that
> all the WAL still in pg_wal/ is correct.  That's not an easy problem,

I'm not quite sure what "correct" means here. I believe xlogreader
runs various checks since the data may be incorrect. Given that can
break for various reasons, during crash recovery, we continue as long
as incoming WAL record remains consistently valid. The problem raised
here that we can't distinctly identify an OOM from a corrupted total
record length field. One reason is we check the data after a part is
loaded, but we can't load all the bytes from the record into memory in
such cases.

> because the record CRC works for all the contents of the record, and
> we look at the record header before that.  Another idea may be to have
> an extra CRC only for the header itself, that can be used in isolation
> as one of the checks in XLogReaderValidatePageHeader().

Sounds reasonable. By using CRC to protect the header part and
allocating a fixed-length buffer for it, I believe we're adopting a
standard approach and can identify OOM and other kind of header errors
with a good degree of certainty.

> >> I'm not entirely certain, but if you were to ask me which is more
> >> probable during recovery - encountering a correct record that's too
> >> lengthy for the server to buffer or stumbling upon a corrupt byte
> >> sequence - I'd bet on the latter.
> 
> I don't really believe in chance when it comes to computer science,
> facts and a correct detection of such facts are better :)

Even now, it seems like we're balancing the risk of potential data
loss against the potential inability to start the server. I meant
that.. if I had to do choose, I'd lean slightly towards prioritizing
saving the latter, in other words, keeping the current behavior.

> > ... of course this refers to crash recovery. For replication, we
> > should keep retrying the current record until the operator commands
> > promotion.
> 
> Are you referring about a retry if there is a standby.signal?  I am a
> bit confused by this sentence, because we could do a crash recovery,
> then switch to archive recovery.  So, I guess that you mean that on
> OOM we should retry to retrieve WAL from the local pg_wal/ even in the
> case where we are in the crash recovery phase, *before* switching to

Apologies for the confusion. What I was thinking is that an OOM is
more likely to occur in replication downstreams than during server
startup. I also felt for the latter case that such a challenging
environment probably wouldn't let the server enter stable normal
operation.

> archive recovery and a different source, right?  I think that this
> argument can go two ways, because it could be more helpful for some to
> see a FATAL when we are still in crash recovery, even if there is a
> standby.signal.  It does not seem to me that we have a clear
> definition about what to do in which case, either.  Now we just fail
> and hope for the best when doing crash recovery.

Agreed.

> >> I'm not sure how often users encounter currupt WAL data, but I believe
> >> they should have the option to terminate recovery and then switch to
> >> normal operation.
> >> 
> >> What if we introduced an option to increase the timeline whenever
> >> recovery hits data error? If that option is disabled, the server stops
> >> when recovery detects an incorrect data, except in the case of an
> >> OOM. OOM cause record retry.
> 
> I guess that it depends on how much responsiveness one may want.
> Forcing a failure on OOM is at least something that users would be
> immediately able to act on when we don't run a standby but just
> recover from a crash, while a standby would do what it is designed to
> do, aka continue to replay what it can see.  One issue with the
> wait-and-continue is that a standby may loop continuously on OOM,
> which could be also bad if there's a replication slot retaining WAL on
> the primary.  Perhaps that's just OK to keep doing that for a
> standby.  At least this makes the discussion easier for the sake of
> this thread: just consider the case of crash recovery when we don't
> have a standby.

Yeah, I'm with you on focusing on crash recovery cases; that's what I
ntended. Sorry for any confusion.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



pgsql-hackers by date:

Previous
From: Peter Smith
Date:
Subject: Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication
Next
From: Peter Eisentraut
Date:
Subject: Make all Perl warnings fatal