Re: Incorrect handling of OOM in WAL replay leading to data loss - Mailing list pgsql-hackers
From | Kyotaro Horiguchi |
---|---|
Subject | Re: Incorrect handling of OOM in WAL replay leading to data loss |
Date | |
Msg-id | 20230808.174403.2484257348989772.horikyota.ntt@gmail.com Whole thread Raw |
In response to | Re: Incorrect handling of OOM in WAL replay leading to data loss (Michael Paquier <michael@paquier.xyz>) |
Responses |
Re: Incorrect handling of OOM in WAL replay leading to data loss
|
List | pgsql-hackers |
At Tue, 8 Aug 2023 16:29:49 +0900, Michael Paquier <michael@paquier.xyz> wrote in > On Wed, Aug 02, 2023 at 01:16:02PM +0900, Kyotaro Horiguchi wrote: > > I believe this approach is sufficient to determine whether the error > > is OOM or not. If total_len is currupted and has an excessively large > > value, it's highly unlikely that all subsequent pages for that length > > will be consistent. > > > > Do you have any thoughts on this? > > This could be more flexible IMO, and actually in some cases > errormsg_fatal may be eaten if using the WAL prefetcher as the error > message is reported with the caller of XLogPrefetcherReadRecord(), no? Right. The goal of my PoC was to detect OOM accurately or at least sufficiently so. We need to separately pass the "error code" along with the message to make it work with the prefethcer. We could enclose errormsg and errorcode in a struct. > Anything that has been discussed on this thread now involves a change > in XLogReaderState that induces an ABI breakage. For HEAD, we are > likely going in this direction, but if we are going to bite the bullet > we'd better be a bit more aggressive with the integration and report > an error code side-by-side with the error message returned by > XLogPrefetcherReadRecord(), XLogReadRecord() and XLogNextRecord() so > as all of the callers can decide what they want to do on an invalid > record or just an OOM. Sounds reasonable. > Attached is the idea of infrastructure I have in mind, as of 0001, > where this adds an error code to report_invalid_record(). For now > this includes three error codes appended to the error messages > generated that can be expanded if need be: no error, OOM and invalid > data. The invalid data part may needs to be much more verbose, and > could be improved to make this stuff "less scary" as the other thread > proposes, but what I have here would be enough to trigger a different > decision in the startup process if a record cannot be fetched on OOM > or if there's a different reason behind that. Agreed. This clarifies the basis for decisions at the upper layer (ReadRecord) and adds flexibility. > 0002 is an example of decision that can be taken in WAL replay if we > see an OOM, based on the error code received. One argument is that we > may want to improve emode_for_corrupt_record() so as it reacts better > on OOM, upgrading the emode wanted, but this needs more analysis > depending on the code path involved. > > 0003 is my previous trick to inject an OOM failure at replay. Reusing > the previous script, this would be enough to prevent an early redo > creating a loss of data. > > Note that we have a few other things going in the tree. As one > example, pg_walinspect would consider an OOM as the end of WAL. Not > critical, still slightly incorrect as the end of WAL may not have been > reached yet so it can report some incorrect information depending on > what the WAL reader faces. This could be improved with the additions > of 0001. > > Thoughts or comments? I like the overall direction. Though, I'm considering enclosing the errormsg and errorcode in a struct. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
pgsql-hackers by date: