Re: Incorrect handling of OOM in WAL replay leading to data loss - Mailing list pgsql-hackers

From Kyotaro Horiguchi
Subject Re: Incorrect handling of OOM in WAL replay leading to data loss
Date
Msg-id 20230810.101540.739766630476144825.horikyota.ntt@gmail.com
Whole thread Raw
In response to Re: Incorrect handling of OOM in WAL replay leading to data loss  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
Responses Re: Incorrect handling of OOM in WAL replay leading to data loss
List pgsql-hackers
At Thu, 10 Aug 2023 10:00:17 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> At Wed, 9 Aug 2023 17:44:49 +0900, Michael Paquier <michael@paquier.xyz> wrote in 
> > > While it's a kind of bug in total, we encountered a case where an
> > > excessively large xl_tot_len actually came from a corrupted
> > > record. [1]
> > 
> > Right, I remember this one.  I think that Thomas was pretty much right
> > that this could be caused because of a lack of zeroing in the WAL
> > pages.
> 
> We have treated every kind of broken data as end-of-recovery, like
> incorrect rm_id or prev link including excessively large record length
> due to corruption. This patch is going to change the behavior only for
> the last one. If you think there can't be non-zero broken data, we
> should inhibit proceeding recovery after all non-zero incorrect
> data. This seems to be a quite big change in our recovery policy.
> 
> > There are a few options on the table, only doable once the WAL reader
> > provider the error state to the startup process:
> > 1) Retry a few times and FATAL.
> > 2) Just FATAL immediately and don't wait.
> > 3) Retry and hope for the best that the host calms down.
> 
> 4) Wrap up recovery then continue to normal operation.
> 
> This is the traditional behavior for currupt WAL data.
> 
> > I have not seeing this issue being much of an issue in the field, so
> > perhaps option 2 with the structure of 0002 and a FATAL when we catch
> > XLOG_READER_OOM in the switch would be enough.  At least that's enough
> > for the cases we've seen.  I'll think a bit more about it, as well.
> > 
> > Yeah, agreed.  That's orthogonal to the issue reported by Ethan,
> > unfortunately, where he was able to trigger the issue of this thread
> > by manipulating the sizing of a host after producing a record larger
> > than what the host could afford after the resizing :/
> 
> I'm not entirely certain, but if you were to ask me which is more
> probable during recovery - encountering a correct record that's too
> lengthy for the server to buffer or stumbling upon a corrupt byte
> sequence - I'd bet on the latter.

... of course this refers to crash recovery. For replication, we
should keep retrying the current record until the operator commands
promotion.

> I'm not sure how often users encounter currupt WAL data, but I believe
> they should have the option to terminate recovery and then switch to
> normal operation.
> 
> What if we introduced an option to increase the timeline whenever
> recovery hits data error? If that option is disabled, the server stops
> when recovery detects an incorrect data, except in the case of an
> OOM. OOM cause record retry.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



pgsql-hackers by date:

Previous
From: Masahiko Sawada
Date:
Subject: Re: [PoC] pg_upgrade: allow to upgrade publisher node
Next
From: "Jonathan S. Katz"
Date:
Subject: Re: 2023-08-10 release announcement draft