Re: Incorrect handling of OOM in WAL replay leading to data loss - Mailing list pgsql-hackers

From Kyotaro Horiguchi
Subject Re: Incorrect handling of OOM in WAL replay leading to data loss
Date
Msg-id 20230810.100017.2120346592058471531.horikyota.ntt@gmail.com
Whole thread Raw
In response to Re: Incorrect handling of OOM in WAL replay leading to data loss  (Michael Paquier <michael@paquier.xyz>)
Responses Re: Incorrect handling of OOM in WAL replay leading to data loss
List pgsql-hackers
At Wed, 9 Aug 2023 17:44:49 +0900, Michael Paquier <michael@paquier.xyz> wrote in 
> > While it's a kind of bug in total, we encountered a case where an
> > excessively large xl_tot_len actually came from a corrupted
> > record. [1]
> 
> Right, I remember this one.  I think that Thomas was pretty much right
> that this could be caused because of a lack of zeroing in the WAL
> pages.

We have treated every kind of broken data as end-of-recovery, like
incorrect rm_id or prev link including excessively large record length
due to corruption. This patch is going to change the behavior only for
the last one. If you think there can't be non-zero broken data, we
should inhibit proceeding recovery after all non-zero incorrect
data. This seems to be a quite big change in our recovery policy.

> There are a few options on the table, only doable once the WAL reader
> provider the error state to the startup process:
> 1) Retry a few times and FATAL.
> 2) Just FATAL immediately and don't wait.
> 3) Retry and hope for the best that the host calms down.

4) Wrap up recovery then continue to normal operation.

This is the traditional behavior for currupt WAL data.

> I have not seeing this issue being much of an issue in the field, so
> perhaps option 2 with the structure of 0002 and a FATAL when we catch
> XLOG_READER_OOM in the switch would be enough.  At least that's enough
> for the cases we've seen.  I'll think a bit more about it, as well.
> 
> Yeah, agreed.  That's orthogonal to the issue reported by Ethan,
> unfortunately, where he was able to trigger the issue of this thread
> by manipulating the sizing of a host after producing a record larger
> than what the host could afford after the resizing :/

I'm not entirely certain, but if you were to ask me which is more
probable during recovery - encountering a correct record that's too
lengthy for the server to buffer or stumbling upon a corrupt byte
sequence - I'd bet on the latter.

I'm not sure how often users encounter currupt WAL data, but I believe
they should have the option to terminate recovery and then switch to
normal operation.

What if we introduced an option to increase the timeline whenever
recovery hits data error? If that option is disabled, the server stops
when recovery detects an incorrect data, except in the case of an
OOM. OOM cause record retry.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



pgsql-hackers by date:

Previous
From: Julien Rouhaud
Date:
Subject: Re: Fix last unitialized memory warning
Next
From: Masahiko Sawada
Date:
Subject: Re: [PoC] pg_upgrade: allow to upgrade publisher node