Re: Incorrect handling of OOM in WAL replay leading to data loss - Mailing list pgsql-hackers

From Michael Paquier
Subject Re: Incorrect handling of OOM in WAL replay leading to data loss
Date
Msg-id ZNR8q8nevVK6IIgR@paquier.xyz
Whole thread Raw
In response to Re: Incorrect handling of OOM in WAL replay leading to data loss  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
Responses Re: Incorrect handling of OOM in WAL replay leading to data loss
List pgsql-hackers
On Thu, Aug 10, 2023 at 02:47:51PM +0900, Kyotaro Horiguchi wrote:
> At Thu, 10 Aug 2023 13:33:48 +0900, Michael Paquier <michael@paquier.xyz> wrote in
>> On Thu, Aug 10, 2023 at 10:15:40AM +0900, Kyotaro Horiguchi wrote:
>> > At Thu, 10 Aug 2023 10:00:17 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
>> >> We have treated every kind of broken data as end-of-recovery, like
>> >> incorrect rm_id or prev link including excessively large record length
>> >> due to corruption. This patch is going to change the behavior only for
>> >> the last one. If you think there can't be non-zero broken data, we
>> >> should inhibit proceeding recovery after all non-zero incorrect
>> >> data. This seems to be a quite big change in our recovery policy.
>>
>> Well, per se the report that led to this thread.  We may lose data and
>> finish with corrupted pages.  I was planning to reply to the other
>> thread [1] and the patch of Noah anyway, because we have to fix the
>> detection of OOM vs corrupted records in the allocation path anyway.
>
> Does this mean we will address the distinction between an OOM and a
> corrupt total record length later on? If that's the case, should we
> modify that behavior right now?

My apologies if I sounded unclear here.  It seems to me that we should
wrap the patch on [1] first, and get it backpatched.  At least that
makes for less conflicts when 0001 gets merged for HEAD when we are
able to set a proper error code.  (Was looking at it, actually.)

>>>> 4) Wrap up recovery then continue to normal operation.
>>>>
>>>> This is the traditional behavior for currupt WAL data.
>>
>> Yeah, we've been doing a pretty bad job in classifying the errors that
>> can happen doing WAL replay in crash recovery, because we assume that
>> all the WAL still in pg_wal/ is correct.  That's not an easy problem,
>
> I'm not quite sure what "correct" means here. I believe xlogreader
> runs various checks since the data may be incorrect. Given that can
> break for various reasons, during crash recovery, we continue as long
> as incoming WAL record remains consistently valid. The problem raised
> here that we can't distinctly identify an OOM from a corrupted total
> record length field. One reason is we check the data after a part is
> loaded, but we can't load all the bytes from the record into memory in
> such cases.

Yep.  Correct means that we end recovery in a consistent state, not
too early than we should.

>> because the record CRC works for all the contents of the record, and
>> we look at the record header before that.  Another idea may be to have
>> an extra CRC only for the header itself, that can be used in isolation
>> as one of the checks in XLogReaderValidatePageHeader().
>
> Sounds reasonable. By using CRC to protect the header part and
> allocating a fixed-length buffer for it, I believe we're adopting a
> standard approach and can identify OOM and other kind of header errors
> with a good degree of certainty.

Not something that I'd like to cover in this patch set, though..  This
is a problem on its own.

>>>> I'm not entirely certain, but if you were to ask me which is more
>>>> probable during recovery - encountering a correct record that's too
>>>> lengthy for the server to buffer or stumbling upon a corrupt byte
>>>> sequence - I'd bet on the latter.
>>
>> I don't really believe in chance when it comes to computer science,
>> facts and a correct detection of such facts are better :)
>
> Even now, it seems like we're balancing the risk of potential data
> loss against the potential inability to start the server. I meant
> that.. if I had to do choose, I'd lean slightly towards prioritizing
> saving the latter, in other words, keeping the current behavior.

On OOM, this means data loss and silent corruption.  A failure has the
merit to tell someone that something is wrong, at least, and that
they'd better look at it rather than hope for the best.

>>> ... of course this refers to crash recovery. For replication, we
>>> should keep retrying the current record until the operator commands
>>> promotion.
>>
>> Are you referring about a retry if there is a standby.signal?  I am a
>> bit confused by this sentence, because we could do a crash recovery,
>> then switch to archive recovery.  So, I guess that you mean that on
>> OOM we should retry to retrieve WAL from the local pg_wal/ even in the
>> case where we are in the crash recovery phase, *before* switching to
>
> Apologies for the confusion. What I was thinking is that an OOM is
> more likely to occur in replication downstreams than during server
> startup. I also felt for the latter case that such a challenging
> environment probably wouldn't let the server enter stable normal
> operation.

It depends on what the user does with the host running the cluster.
Both could be impacted.

>>>> I'm not sure how often users encounter currupt WAL data, but I believe
>>>> they should have the option to terminate recovery and then switch to
>>>> normal operation.
>>>>
>>>> What if we introduced an option to increase the timeline whenever
>>>> recovery hits data error? If that option is disabled, the server stops
>>>> when recovery detects an incorrect data, except in the case of an
>>>> OOM. OOM cause record retry.
>>
>> I guess that it depends on how much responsiveness one may want.
>> Forcing a failure on OOM is at least something that users would be
>> immediately able to act on when we don't run a standby but just
>> recover from a crash, while a standby would do what it is designed to
>> do, aka continue to replay what it can see.  One issue with the
>> wait-and-continue is that a standby may loop continuously on OOM,
>> which could be also bad if there's a replication slot retaining WAL on
>> the primary.  Perhaps that's just OK to keep doing that for a
>> standby.  At least this makes the discussion easier for the sake of
>> this thread: just consider the case of crash recovery when we don't
>> have a standby.
>
> Yeah, I'm with you on focusing on crash recovery cases; that's what I
> intended. Sorry for any confusion.

Okay, so we're on the same page here, keeping standbys as they are and
do something for the crash recovery case.
--
Michael

Attachment

pgsql-hackers by date:

Previous
From: Peter Eisentraut
Date:
Subject: Make all Perl warnings fatal
Next
From: Laetitia Avrot
Date:
Subject: Re: Adding a pg_servername() function