Re: Incorrect handling of OOM in WAL replay leading to data loss - Mailing list pgsql-hackers
From | Michael Paquier |
---|---|
Subject | Re: Incorrect handling of OOM in WAL replay leading to data loss |
Date | |
Msg-id | ZNR8q8nevVK6IIgR@paquier.xyz Whole thread Raw |
In response to | Re: Incorrect handling of OOM in WAL replay leading to data loss (Kyotaro Horiguchi <horikyota.ntt@gmail.com>) |
Responses |
Re: Incorrect handling of OOM in WAL replay leading to data loss
|
List | pgsql-hackers |
On Thu, Aug 10, 2023 at 02:47:51PM +0900, Kyotaro Horiguchi wrote: > At Thu, 10 Aug 2023 13:33:48 +0900, Michael Paquier <michael@paquier.xyz> wrote in >> On Thu, Aug 10, 2023 at 10:15:40AM +0900, Kyotaro Horiguchi wrote: >> > At Thu, 10 Aug 2023 10:00:17 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in >> >> We have treated every kind of broken data as end-of-recovery, like >> >> incorrect rm_id or prev link including excessively large record length >> >> due to corruption. This patch is going to change the behavior only for >> >> the last one. If you think there can't be non-zero broken data, we >> >> should inhibit proceeding recovery after all non-zero incorrect >> >> data. This seems to be a quite big change in our recovery policy. >> >> Well, per se the report that led to this thread. We may lose data and >> finish with corrupted pages. I was planning to reply to the other >> thread [1] and the patch of Noah anyway, because we have to fix the >> detection of OOM vs corrupted records in the allocation path anyway. > > Does this mean we will address the distinction between an OOM and a > corrupt total record length later on? If that's the case, should we > modify that behavior right now? My apologies if I sounded unclear here. It seems to me that we should wrap the patch on [1] first, and get it backpatched. At least that makes for less conflicts when 0001 gets merged for HEAD when we are able to set a proper error code. (Was looking at it, actually.) >>>> 4) Wrap up recovery then continue to normal operation. >>>> >>>> This is the traditional behavior for currupt WAL data. >> >> Yeah, we've been doing a pretty bad job in classifying the errors that >> can happen doing WAL replay in crash recovery, because we assume that >> all the WAL still in pg_wal/ is correct. That's not an easy problem, > > I'm not quite sure what "correct" means here. I believe xlogreader > runs various checks since the data may be incorrect. Given that can > break for various reasons, during crash recovery, we continue as long > as incoming WAL record remains consistently valid. The problem raised > here that we can't distinctly identify an OOM from a corrupted total > record length field. One reason is we check the data after a part is > loaded, but we can't load all the bytes from the record into memory in > such cases. Yep. Correct means that we end recovery in a consistent state, not too early than we should. >> because the record CRC works for all the contents of the record, and >> we look at the record header before that. Another idea may be to have >> an extra CRC only for the header itself, that can be used in isolation >> as one of the checks in XLogReaderValidatePageHeader(). > > Sounds reasonable. By using CRC to protect the header part and > allocating a fixed-length buffer for it, I believe we're adopting a > standard approach and can identify OOM and other kind of header errors > with a good degree of certainty. Not something that I'd like to cover in this patch set, though.. This is a problem on its own. >>>> I'm not entirely certain, but if you were to ask me which is more >>>> probable during recovery - encountering a correct record that's too >>>> lengthy for the server to buffer or stumbling upon a corrupt byte >>>> sequence - I'd bet on the latter. >> >> I don't really believe in chance when it comes to computer science, >> facts and a correct detection of such facts are better :) > > Even now, it seems like we're balancing the risk of potential data > loss against the potential inability to start the server. I meant > that.. if I had to do choose, I'd lean slightly towards prioritizing > saving the latter, in other words, keeping the current behavior. On OOM, this means data loss and silent corruption. A failure has the merit to tell someone that something is wrong, at least, and that they'd better look at it rather than hope for the best. >>> ... of course this refers to crash recovery. For replication, we >>> should keep retrying the current record until the operator commands >>> promotion. >> >> Are you referring about a retry if there is a standby.signal? I am a >> bit confused by this sentence, because we could do a crash recovery, >> then switch to archive recovery. So, I guess that you mean that on >> OOM we should retry to retrieve WAL from the local pg_wal/ even in the >> case where we are in the crash recovery phase, *before* switching to > > Apologies for the confusion. What I was thinking is that an OOM is > more likely to occur in replication downstreams than during server > startup. I also felt for the latter case that such a challenging > environment probably wouldn't let the server enter stable normal > operation. It depends on what the user does with the host running the cluster. Both could be impacted. >>>> I'm not sure how often users encounter currupt WAL data, but I believe >>>> they should have the option to terminate recovery and then switch to >>>> normal operation. >>>> >>>> What if we introduced an option to increase the timeline whenever >>>> recovery hits data error? If that option is disabled, the server stops >>>> when recovery detects an incorrect data, except in the case of an >>>> OOM. OOM cause record retry. >> >> I guess that it depends on how much responsiveness one may want. >> Forcing a failure on OOM is at least something that users would be >> immediately able to act on when we don't run a standby but just >> recover from a crash, while a standby would do what it is designed to >> do, aka continue to replay what it can see. One issue with the >> wait-and-continue is that a standby may loop continuously on OOM, >> which could be also bad if there's a replication slot retaining WAL on >> the primary. Perhaps that's just OK to keep doing that for a >> standby. At least this makes the discussion easier for the sake of >> this thread: just consider the case of crash recovery when we don't >> have a standby. > > Yeah, I'm with you on focusing on crash recovery cases; that's what I > intended. Sorry for any confusion. Okay, so we're on the same page here, keeping standbys as they are and do something for the crash recovery case. -- Michael
Attachment
pgsql-hackers by date: