Thread: standby promotion can create unreadable WAL
My colleague Dilip Kumar and I have discovered what I believe to be a bug in the recently-added "overwrite contrecord" stuff. I'm not sure whether or not this bug has any serious consequences. I think that there may be a scenario where it does, but I'm not sure about that. Suppose you have a primary and a standby, and the standby is promoted after reading a partial WAL record. The attached script, which was written by Dilip and slightly modified by me, creates this scenario by setting up an archiving-only standby, writing a record that crosses a segment boundary, and then promoting the standby. If you then try to run pg_waldump on the WAL on timeline 2, it goes boom: [rhaas pg_wal]$ pg_waldump 000000020000000000000004 000000020000000000000005 2>&1 | tail -n4 rmgr: Heap len (rec/tot): 1959/ 1959, tx: 728, lsn: 0/04FFE7B0, prev 0/04FFDFF0, desc: INSERT off 4 flags 0x00, blkref #0: rel 1663/5/16384 blk 2132 rmgr: Heap len (rec/tot): 1959/ 1959, tx: 728, lsn: 0/04FFEF58, prev 0/04FFE7B0, desc: INSERT+INIT off 1 flags 0x00, blkref #0: rel 1663/5/16384 blk 2133 rmgr: Heap len (rec/tot): 1959/ 1959, tx: 728, lsn: 0/04FFF700, prev 0/04FFEF58, desc: INSERT off 2 flags 0x00, blkref #0: rel 1663/5/16384 blk 2133 pg_waldump: error: error in WAL record at 0/4FFF700: invalid record length at 0/4FFFEA8: wanted 24, got 0 What's happening here is that the last WAL segment from timeline 1, which is 000000010000000000000004, gets copied over to the new timeline up to the point where the last complete record on that timeline ends, namely, 0/4FFFEA8. I think that the first record on the new timeline should be written starting at that LSN, but that's not what happens. Instead, the rest of that WAL segment remains zeroed, and the first WAL record on the new timeline is written at the beginning of the next segment: [rhaas pg_wal]$ pg_waldump 000000020000000000000005 2>&1 | head -n4 rmgr: XLOG len (rec/tot): 42/ 42, tx: 0, lsn: 0/05000028, prev 0/04FFF700, desc: OVERWRITE_CONTRECORD lsn 0/4FFFEA8; time 2022-08-22 13:49:22.874435 EDT rmgr: XLOG len (rec/tot): 114/ 114, tx: 0, lsn: 0/05000058, prev 0/05000028, desc: CHECKPOINT_SHUTDOWN redo 0/5000058; tli 2; prev tli 1; fpw true; xid 0:729; oid 24576; multi 1; offset 0; oldest xid 719 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 0; shutdown rmgr: XLOG len (rec/tot): 30/ 30, tx: 0, lsn: 0/050000D0, prev 0/05000058, desc: NEXTOID 32768 rmgr: Storage len (rec/tot): 42/ 42, tx: 0, lsn: 0/050000F0, prev 0/050000D0, desc: CREATE base/5/24576 Nothing that uses xlogreader is going to be able to bridge the gap between file #4 and file #5. In this case it doesn't matter very much, because we immediately write a checkpoint record into file #5, so if we crash we won't try to replay file #4 anyway. However, if anything did try to look at file #4 it would get confused. Maybe that can happen if this is a streaming standby, where we only write an end-of-recovery record upon promotion, rather than a checkpoint, or maybe if there are cascading standbys someone could try to actually use the 000000020000000000000004 file for something. I'm not sure. But unless I'm missing something, that file is bogus, and our only hope of not having problems is that perhaps no one will ever look at it. I think that the cause of this problem is this code right here: /* * Actually, if WAL ended in an incomplete record, skip the parts that * made it through and start writing after the portion that persisted. * (It's critical to first write an OVERWRITE_CONTRECORD message, which * we'll do as soon as we're open for writing new WAL.) */ if (!XLogRecPtrIsInvalid(missingContrecPtr)) { Assert(!XLogRecPtrIsInvalid(abortedRecPtr)); EndOfLog = missingContrecPtr; } It seems to me that this if-statement should also test that the TLI has not changed i.e. if (newTLI != endOfRecoveryInfo->lastRecTLI && !XLogRecPtrIsInvalid(missingContrecPtr)). If the TLI hasn't changed, then everything the comment says is correct and I think that what the code does is also correct. However, if the TLI *has* changed, then I think we must not advance EndOfLog here, because the WAL that was copied from the old timeline to the new timeline ends at the point in the file corresponding to the value of EndOfLog just before executing this code. When this code then moves EndOfLog forward to the beginning of the next segment, it leaves the unused portion of the previous segment as all zeroes, which creates the problem described above. (Incidentally, there's also a bug in pg_waldump here: it's reporting the wrong LSN as the source of the error. 0/4FFF700 is not the record that's busted, as shown by the fact that it was successfully decoded and shown in the output. The relevant code in pg_waldump should be using EndRecPtr instead of ReadRecPtr to report the error. If it did, it would complain about 0/4FFFEA8, which is where the problem really is. This is of the same vintage as the bug fixed by d9fbb8862959912c5266364059c0abeda0c93bbf, though in that case the issue was reporting all errors using the start LSN of the first of several records read no matter where the error actually happened, whereas in this case the error is using the start LSN of the previous record instead of the current one.) Thoughts? -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
On Mon, Aug 22, 2022 at 02:36:36PM -0400, Robert Haas wrote: > (Incidentally, there's also a bug in pg_waldump here: it's reporting > the wrong LSN as the source of the error. 0/4FFF700 is not the record > that's busted, as shown by the fact that it was successfully decoded > and shown in the output. The relevant code in pg_waldump should be > using EndRecPtr instead of ReadRecPtr to report the error. If it did, > it would complain about 0/4FFFEA8, which is where the problem really > is. This is of the same vintage as the bug fixed by > d9fbb8862959912c5266364059c0abeda0c93bbf, though in that case the > issue was reporting all errors using the start LSN of the first of > several records read no matter where the error actually happened, > whereas in this case the error is using the start LSN of the previous > record instead of the current one.) There was some previous discussion on this [0] [1]. [0] https://postgr.es/m/2B4510B2-3D70-4990-BFE3-0FE64041C08A%40amazon.com [1] https://postgr.es/m/20220127.100738.1985658263632578184.horikyota.ntt%40gmail.com -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Mon, Aug 22, 2022 at 10:38 PM Nathan Bossart <nathandbossart@gmail.com> wrote: > There was some previous discussion on this [0] [1]. > > [0] https://postgr.es/m/2B4510B2-3D70-4990-BFE3-0FE64041C08A%40amazon.com > [1] https://postgr.es/m/20220127.100738.1985658263632578184.horikyota.ntt%40gmail.com Thanks. It seems like there are various doubts on those threads about whether EndRecPtr is really the right thing, but I'm not able to understand what the problem is exactly. Certainly, in the common case, it is, and as far as I can tell from looking at the code, it's what we're intended to use. So I would like to see some concrete evidence of it being wrong before we conclude that we need to do anything other than a trivial change. But the main issue for this thread is that we seem to be generating invalid WAL. That seems like something we'd better get fixed. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Aug 23, 2022 at 12:06 AM Robert Haas <robertmhaas@gmail.com> wrote: > Nothing that uses xlogreader is going to be able to bridge the gap > between file #4 and file #5. In this case it doesn't matter very much, > because we immediately write a checkpoint record into file #5, so if > we crash we won't try to replay file #4 anyway. However, if anything > did try to look at file #4 it would get confused. Maybe that can > happen if this is a streaming standby, where we only write an > end-of-recovery record upon promotion, rather than a checkpoint, or > maybe if there are cascading standbys someone could try to actually > use the 000000020000000000000004 file for something. I'm not sure. But > unless I'm missing something, that file is bogus, and our only hope of > not having problems is that perhaps no one will ever look at it. Yeah, this analysis looks correct to me. > I think that the cause of this problem is this code right here: > > /* > * Actually, if WAL ended in an incomplete record, skip the parts that > * made it through and start writing after the portion that persisted. > * (It's critical to first write an OVERWRITE_CONTRECORD message, which > * we'll do as soon as we're open for writing new WAL.) > */ > if (!XLogRecPtrIsInvalid(missingContrecPtr)) > { > Assert(!XLogRecPtrIsInvalid(abortedRecPtr)); > EndOfLog = missingContrecPtr; > } Yeah, this statement as well as another statement that creates the overwrite contrecord. After changing these two lines the problem is fixed for me. Although I haven't yet thought of all the scenarios that whether it is safe in all the cases. I agree that after timeline changes we are pointing to the end of the last valid record we can start writing the next record from that point onward. But I think we should need to think hard that whether it will break any case for which the overwrite contrecord was actually introduced. diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 7602fc8..3d38613 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -5491,7 +5491,7 @@ StartupXLOG(void) * (It's critical to first write an OVERWRITE_CONTRECORD message, which * we'll do as soon as we're open for writing new WAL.) */ - if (!XLogRecPtrIsInvalid(missingContrecPtr)) + if (newTLI == endOfRecoveryInfo->lastRecTLI && !XLogRecPtrIsInvalid(missingContrecPtr)) { Assert(!XLogRecPtrIsInvalid(abortedRecPtr)); EndOfLog = missingContrecPtr; @@ -5589,7 +5589,7 @@ StartupXLOG(void) LocalSetXLogInsertAllowed(); /* If necessary, write overwrite-contrecord before doing anything else */ - if (!XLogRecPtrIsInvalid(abortedRecPtr)) + if (newTLI == endOfRecoveryInfo->lastRecTLI && !XLogRecPtrIsInvalid(abortedRecPtr)) { Assert(!XLogRecPtrIsInvalid(missingContrecPtr)); CreateOverwriteContrecordRecord(abortedRecPtr, missingContrecPtr, newTLI); -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Nice find! At Wed, 24 Aug 2022 11:09:44 +0530, Dilip Kumar <dilipbalaut@gmail.com> wrote in > On Tue, Aug 23, 2022 at 12:06 AM Robert Haas <robertmhaas@gmail.com> wrote: > > > Nothing that uses xlogreader is going to be able to bridge the gap > > between file #4 and file #5. In this case it doesn't matter very much, > > because we immediately write a checkpoint record into file #5, so if > > we crash we won't try to replay file #4 anyway. However, if anything > > did try to look at file #4 it would get confused. Maybe that can > > happen if this is a streaming standby, where we only write an > > end-of-recovery record upon promotion, rather than a checkpoint, or > > maybe if there are cascading standbys someone could try to actually > > use the 000000020000000000000004 file for something. I'm not sure. But > > unless I'm missing something, that file is bogus, and our only hope of > > not having problems is that perhaps no one will ever look at it. > > Yeah, this analysis looks correct to me. (I didn't reproduce the case but understand what is happening.) Me, too. There are two ways to deal with this, I think. One is start writing new records from abortedContRecPtr as if it were not exist. Another is copying WAL file up to missingContRecPtr. Since the first segment of the new timeline doesn't need to be identcal to the last one of the previous timeline, so I think the former way is cleaner. XLogInitNewTimeline or near seems to be be the place for fix to me. Clearing abortedRecPtr and missingContrecPtr just before the call to findNewestTimeLine will work? ==== diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 87b243e0d4..27e01153e7 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -5396,6 +5396,13 @@ StartupXLOG(void) */ XLogInitNewTimeline(EndOfLogTLI, EndOfLog, newTLI); + /* + * EndOfLog doesn't cover aborted contrecord even if the last record + * was that, then the next timeline starts writing from there. Forget + * about aborted and missing contrecords even if any. + */ + abortedRecPtr = missingContrecPtr = InvalidXLogRecPtr; + /* * Remove the signal files out of the way, so that we don't * accidentally re-enter archive recovery mode in a subsequent crash. ==== > > I think that the cause of this problem is this code right here: > > > > /* > > * Actually, if WAL ended in an incomplete record, skip the parts that > > * made it through and start writing after the portion that persisted. > > * (It's critical to first write an OVERWRITE_CONTRECORD message, which > > * we'll do as soon as we're open for writing new WAL.) > > */ > > if (!XLogRecPtrIsInvalid(missingContrecPtr)) > > { > > Assert(!XLogRecPtrIsInvalid(abortedRecPtr)); > > EndOfLog = missingContrecPtr; > > } > > Yeah, this statement as well as another statement that creates the > overwrite contrecord. After changing these two lines the problem is > fixed for me. Although I haven't yet thought of all the scenarios > that whether it is safe in all the cases. I agree that after timeline > changes we are pointing to the end of the last valid record we can > start writing the next record from that point onward. But I think we > should need to think hard that whether it will break any case for > which the overwrite contrecord was actually introduced. > > diff --git a/src/backend/access/transam/xlog.c > b/src/backend/access/transam/xlog.c > index 7602fc8..3d38613 100644 > --- a/src/backend/access/transam/xlog.c > +++ b/src/backend/access/transam/xlog.c > @@ -5491,7 +5491,7 @@ StartupXLOG(void) > * (It's critical to first write an OVERWRITE_CONTRECORD message, which > * we'll do as soon as we're open for writing new WAL.) > */ > - if (!XLogRecPtrIsInvalid(missingContrecPtr)) > + if (newTLI == endOfRecoveryInfo->lastRecTLI && > !XLogRecPtrIsInvalid(missingContrecPtr)) > { > Assert(!XLogRecPtrIsInvalid(abortedRecPtr)); > EndOfLog = missingContrecPtr; > @@ -5589,7 +5589,7 @@ StartupXLOG(void) > LocalSetXLogInsertAllowed(); > > /* If necessary, write overwrite-contrecord before doing > anything else */ > - if (!XLogRecPtrIsInvalid(abortedRecPtr)) > + if (newTLI == endOfRecoveryInfo->lastRecTLI && > !XLogRecPtrIsInvalid(abortedRecPtr)) > { > Assert(!XLogRecPtrIsInvalid(missingContrecPtr)); > CreateOverwriteContrecordRecord(abortedRecPtr, > missingContrecPtr, newTLI); This also seems to work, of course. # However, I haven't managed to reproduce that, yet... regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Wed, Aug 24, 2022 at 4:40 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > Me, too. There are two ways to deal with this, I think. One is start > writing new records from abortedContRecPtr as if it were not > exist. Another is copying WAL file up to missingContRecPtr. Since the > first segment of the new timeline doesn't need to be identcal to the > last one of the previous timeline, so I think the former way is > cleaner. I agree, mostly because that gets us back to the way all of this worked before the contrecord stuff went in. This case wasn't broken then, because the breakage had to do with it being unsafe to back up and rewrite WAL that might have already been shipped someplace, and that's not an issue when we're first creating a totally new timeline. It seems safer to me to go back to the way this worked before the fix went in than to change over to a new system. Honestly, in a vacuum, I might prefer to get rid of this thing where the WAL segment gets copied over from the old timeline to the new, and just always switch TLIs at segment boundaries. And while we're at it, I'd also like TLIs to be 64-bit random numbers instead of integers assigned in ascending order. But those kinds of design changes seem best left for a future master-only development effort. Here, we need to back-patch the fix, and should try to just unbreak what's currently broken. > XLogInitNewTimeline or near seems to be be the place for fix > to me. Clearing abortedRecPtr and missingContrecPtr just before the > call to findNewestTimeLine will work? Hmm, yeah, that seems like a good approach. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Aug 23, 2022 at 12:06 AM Robert Haas <robertmhaas@gmail.com> wrote: > However, if anything > did try to look at file #4 it would get confused. Maybe that can > happen if this is a streaming standby, where we only write an > end-of-recovery record upon promotion, rather than a checkpoint, or > maybe if there are cascading standbys someone could try to actually > use the 000000020000000000000004 file for something. I'm not sure. But > unless I'm missing something, that file is bogus, and our only hope of > not having problems is that perhaps no one will ever look at it. I tried in streaming mode, but it seems in the streaming mode we will never create this bogus file because of this check [1]. So if the StandbyMode is true then we are never setting "abortedRecPtr" and "missingContrecPtr" which means we will never create that 0-filled gap in the WAL file that we are discussing in this thread. Do we need to set it? I feel we don't. Why? because on this thread we are also discussing that if the timeline switches then we don’t need to create that 0-filled gap and that is the actual problem we are discussing here. And we know that if we are coming out of the StandbyMode then we will always switch the timeline so we don’t create that 0-filled gap. OTOH if we are coming out of the archive recovery then also we will switch the timeline so in that case also we do not need that. So practically we need to 0 fill that partial record only when we are doing the crash recovery is that understanding correct? If so then we can simply avoid setting these variables if ArchiveRecoveryRequested is true. So in the below check[1] instead of (!StandbyMode), we can just put (! ArchiveRecoveryRequested), and then we don't need any other fix. Am I missing anything? [1] ReadRecord{ ..record = XLogPrefetcherReadRecord(xlogprefetcher, &errormsg); if (record == NULL) { /* * When not in standby mode we find that WAL ends in an incomplete * record, keep track of that record. After recovery is done, * we’ll write a record to indicate to downstream WAL readers that * that portion is to be ignored. */ if (!StandbyMode && !XLogRecPtrIsInvalid(xlogreader->abortedRecPtr)) { abortedRecPtr = xlogreader->abortedRecPtr; missingContrecPtr = xlogreader->missingContrecPtr; } -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Fri, Aug 26, 2022 at 8:44 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > ArchiveRecoveryRequested is true. So in the below check[1] instead of > (!StandbyMode), we can just put (! ArchiveRecoveryRequested), and then > we don't need any other fix. Am I missing anything? > > [1] > ReadRecord{ > ..record = XLogPrefetcherReadRecord(xlogprefetcher, &errormsg); > if (record == NULL) > { > /* > * When not in standby mode we find that WAL ends in an incomplete > * record, keep track of that record. After recovery is done, > * we’ll write a record to indicate to downstream WAL readers that > * that portion is to be ignored. > */ > if (!StandbyMode && > !XLogRecPtrIsInvalid(xlogreader->abortedRecPtr)) > { > abortedRecPtr = xlogreader->abortedRecPtr; > missingContrecPtr = xlogreader->missingContrecPtr; > } I agree. Testing StandbyMode here seems bogus. I thought initially that the test should perhaps be for InArchiveRecovery rather than ArchiveRecoveryRequested, but I see that the code which switches to a new timeline cares about ArchiveRecoveryRequested, so I think that is the correct thing to test here as well. Concretely, I propose the following patch. -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
On 2022-Aug-26, Robert Haas wrote: > I agree. Testing StandbyMode here seems bogus. I thought initially > that the test should perhaps be for InArchiveRecovery rather than > ArchiveRecoveryRequested, but I see that the code which switches to a > new timeline cares about ArchiveRecoveryRequested, so I think that is > the correct thing to test here as well. Yeah, I think you had already established elsewhere that testing StandbyMode was the wrong thing to do. Testing ArchiveRecoveryRequested here seems quite odd at first, but given the copying behavior, I agree that it seems a correct thing to do. There's a small typo in the comment: "When find that". I suppose that was meant to be "When we find that". You end that para with "and thus we should not do this", but that sounds like it wouldn't matter if we did. Maybe "and thus doing this would be wrong, so skip it." or something like that. (Perhaps be even more specific and say "if we did this, we would later create an overwrite record in the wrong place, breaking everything") -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
On Fri, Aug 26, 2022 at 10:06 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > There's a small typo in the comment: "When find that". I suppose that > was meant to be "When we find that". You end that para with "and thus > we should not do this", but that sounds like it wouldn't matter if we > did. Maybe "and thus doing this would be wrong, so skip it." or > something like that. (Perhaps be even more specific and say "if we did > this, we would later create an overwrite record in the wrong place, > breaking everything") I think that saying that someone should not do something implies pretty clearly that it would be bad if they did. But I have no problem with your more specific language, and as a general rule, it's good to be specific, so let's use that. v2 attached. Thanks for chiming in. -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
> I agree. Testing StandbyMode here seems bogus. I thought initially > that the test should perhaps be for InArchiveRecovery rather than > ArchiveRecoveryRequested, but I see that the code which switches to a > new timeline cares about ArchiveRecoveryRequested, so I think that is > the correct thing to test here as well. > Concretely, I propose the following patch. This patch looks similar to the change suggested in https://www.postgresql.org/message-id/FB0DEA0B-E14E-43A0-811F-C1AE93D00FF3%40amazon.com to deal with panics after promoting a standby. The difference is the patch tests !ArchiveRecoveryRequested instead of !StandbyModeRequested as proposed in the mentioned thread. Thanks -- Sami Imseih Amazon Web Services
On Fri, Aug 26, 2022 at 11:59 AM Imseih (AWS), Sami <simseih@amazon.com> wrote: > > I agree. Testing StandbyMode here seems bogus. I thought initially > > that the test should perhaps be for InArchiveRecovery rather than > > ArchiveRecoveryRequested, but I see that the code which switches to a > > new timeline cares about ArchiveRecoveryRequested, so I think that is > > the correct thing to test here as well. > > > Concretely, I propose the following patch. > > This patch looks similar to the change suggested in > https://www.postgresql.org/message-id/FB0DEA0B-E14E-43A0-811F-C1AE93D00FF3%40amazon.com > to deal with panics after promoting a standby. > > The difference is the patch tests !ArchiveRecoveryRequested instead > of !StandbyModeRequested as proposed in the mentioned thread. OK, I didn't realize this bug had been independently discovered and it looks like I was even involved in the previous discussion. I just totally forgot about it. I think, however, that your fix is wrong and this one is right. Fundamentally, the server is either in normal running, or crash recovery, or archive recovery. Standby mode is just an optional behavior of archive recovery, controlling whether or not we keep retrying once the end of WAL is reached. But there's no reason why the server should put the contrecord at a different location when recovery ends depending on that retry behavior. The only thing that matters is whether we're going to switch timelines. -- Robert Haas EDB: http://www.enterprisedb.com
> I think, however, that your fix is wrong and this one is right. > Fundamentally, the server is either in normal running, or crash > recovery, or archive recovery. Standby mode is just an optional > behavior of archive recovery Good point. Thanks for clearing my understanding. Thanks -- Sami Imseih Amazon Web Services
On Fri, Aug 26, 2022 at 7:53 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Fri, Aug 26, 2022 at 10:06 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > There's a small typo in the comment: "When find that". I suppose that > > was meant to be "When we find that". You end that para with "and thus > > we should not do this", but that sounds like it wouldn't matter if we > > did. Maybe "and thus doing this would be wrong, so skip it." or > > something like that. (Perhaps be even more specific and say "if we did > > this, we would later create an overwrite record in the wrong place, > > breaking everything") > > I think that saying that someone should not do something implies > pretty clearly that it would be bad if they did. But I have no problem > with your more specific language, and as a general rule, it's good to > be specific, so let's use that. > > v2 attached. The patch LGTM, this patch will apply on master and v15. PFA patch for back branches. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
At Sun, 28 Aug 2022 10:16:21 +0530, Dilip Kumar <dilipbalaut@gmail.com> wrote in > On Fri, Aug 26, 2022 at 7:53 PM Robert Haas <robertmhaas@gmail.com> wrote: > > v2 attached. > > The patch LGTM, this patch will apply on master and v15. PFA patch > for back branches. StandbyMode is obviously wrong. On the other hand I thought that !ArchiveRecoveryRequested is somewhat wrong, too (, as I stated in the pointed thread). On second thought, I changed my mind that it is right. After aborted contrec is found, The cause of the confusion is that I somehow thought that archive recovery continues from the aborted-contrec record. However, that assumption is wrong. The next redo starts from the beginning of the aborted contrecord so we should forget abouat the old missing/aborted contrec info when archive recovery is requested. In the end, the point is that we need to set the global variables only when XLogPrefetcherReadRecord() (or XLogReadRecord()) returns NULL and we return it to the caller. Is it worth to do a small refactoring like the attached? If no, I'm fine with the proposed patch including the added assertion. # I havent reproduce the issue of the OP in the other thread yet, and # also not found how to reproduce this issue, though.. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Attachment
At Mon, 29 Aug 2022 13:13:52 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > we return it to the caller. Is it worth to do a small refactoring > like the attached? If no, I'm fine with the proposed patch including > the added assertion. Mmm. That seems wrong. So forget about that. The proposed patch looks fine to me. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Fri, Aug 26, 2022 at 6:14 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Tue, Aug 23, 2022 at 12:06 AM Robert Haas <robertmhaas@gmail.com> wrote: > > > However, if anything > > did try to look at file #4 it would get confused. Maybe that can > > happen if this is a streaming standby, where we only write an > > end-of-recovery record upon promotion, rather than a checkpoint, or > > maybe if there are cascading standbys someone could try to actually > > use the 000000020000000000000004 file for something. I'm not sure. But > > unless I'm missing something, that file is bogus, and our only hope of > > not having problems is that perhaps no one will ever look at it. I tried to see the problem with the cascading standby, basically the setup is like below pgprimary->pgstandby(archive only)->pgcascade(streaming + archive). The second node has to be archive only because this 0 filled gap is created in archive only mode. With that I have noticed that the when cascading standby is getting that 0 filled gap it report same error what we seen with pg_waldump and that it keep waiting forever on that file. I have attached a test case, but I think timing is not done perfectly in this test so before the cascading standby setup some of the WAL file get removed by the pgstandby so I just put direct return in RemoveOldXlogFiles() to test this[2]. And this problem is getting resolved with the patch given by Robert upthread. [1] 2022-08-25 16:21:26.413 IST [18235] LOG: invalid record length at 0/FFFFEA8: wanted 24, got 0 [2] diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index eb5115f..990a879 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -3558,6 +3558,7 @@ RemoveOldXlogFiles(XLogSegNo segno, XLogRecPtr lastredoptr, XLogRecPtr endptr, XLogSegNo endlogSegNo; XLogSegNo recycleSegNo; + return; /* Initialize info about where to try to recycle to */ XLByteToSeg(endptr, endlogSegNo, wal_segment_size); recycleSegNo = XLOGfileslop(lastredoptr); -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
On Mon, Aug 29, 2022 at 12:21 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > Mmm. That seems wrong. So forget about that. The proposed patch looks > fine to me. Thanks for thinking it over. Committed and back-patched as far as v10, since that's the oldest supported release. -- Robert Haas EDB: http://www.enterprisedb.com