standby promotion can create unreadable WAL - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | standby promotion can create unreadable WAL |
Date | |
Msg-id | CA+TgmoY0Lri=fCueg7m_2R_bSspUb1F8OFycEGaHNJw_EUW-=Q@mail.gmail.com Whole thread Raw |
Responses |
Re: standby promotion can create unreadable WAL
Re: standby promotion can create unreadable WAL Re: standby promotion can create unreadable WAL |
List | pgsql-hackers |
My colleague Dilip Kumar and I have discovered what I believe to be a bug in the recently-added "overwrite contrecord" stuff. I'm not sure whether or not this bug has any serious consequences. I think that there may be a scenario where it does, but I'm not sure about that. Suppose you have a primary and a standby, and the standby is promoted after reading a partial WAL record. The attached script, which was written by Dilip and slightly modified by me, creates this scenario by setting up an archiving-only standby, writing a record that crosses a segment boundary, and then promoting the standby. If you then try to run pg_waldump on the WAL on timeline 2, it goes boom: [rhaas pg_wal]$ pg_waldump 000000020000000000000004 000000020000000000000005 2>&1 | tail -n4 rmgr: Heap len (rec/tot): 1959/ 1959, tx: 728, lsn: 0/04FFE7B0, prev 0/04FFDFF0, desc: INSERT off 4 flags 0x00, blkref #0: rel 1663/5/16384 blk 2132 rmgr: Heap len (rec/tot): 1959/ 1959, tx: 728, lsn: 0/04FFEF58, prev 0/04FFE7B0, desc: INSERT+INIT off 1 flags 0x00, blkref #0: rel 1663/5/16384 blk 2133 rmgr: Heap len (rec/tot): 1959/ 1959, tx: 728, lsn: 0/04FFF700, prev 0/04FFEF58, desc: INSERT off 2 flags 0x00, blkref #0: rel 1663/5/16384 blk 2133 pg_waldump: error: error in WAL record at 0/4FFF700: invalid record length at 0/4FFFEA8: wanted 24, got 0 What's happening here is that the last WAL segment from timeline 1, which is 000000010000000000000004, gets copied over to the new timeline up to the point where the last complete record on that timeline ends, namely, 0/4FFFEA8. I think that the first record on the new timeline should be written starting at that LSN, but that's not what happens. Instead, the rest of that WAL segment remains zeroed, and the first WAL record on the new timeline is written at the beginning of the next segment: [rhaas pg_wal]$ pg_waldump 000000020000000000000005 2>&1 | head -n4 rmgr: XLOG len (rec/tot): 42/ 42, tx: 0, lsn: 0/05000028, prev 0/04FFF700, desc: OVERWRITE_CONTRECORD lsn 0/4FFFEA8; time 2022-08-22 13:49:22.874435 EDT rmgr: XLOG len (rec/tot): 114/ 114, tx: 0, lsn: 0/05000058, prev 0/05000028, desc: CHECKPOINT_SHUTDOWN redo 0/5000058; tli 2; prev tli 1; fpw true; xid 0:729; oid 24576; multi 1; offset 0; oldest xid 719 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 0; shutdown rmgr: XLOG len (rec/tot): 30/ 30, tx: 0, lsn: 0/050000D0, prev 0/05000058, desc: NEXTOID 32768 rmgr: Storage len (rec/tot): 42/ 42, tx: 0, lsn: 0/050000F0, prev 0/050000D0, desc: CREATE base/5/24576 Nothing that uses xlogreader is going to be able to bridge the gap between file #4 and file #5. In this case it doesn't matter very much, because we immediately write a checkpoint record into file #5, so if we crash we won't try to replay file #4 anyway. However, if anything did try to look at file #4 it would get confused. Maybe that can happen if this is a streaming standby, where we only write an end-of-recovery record upon promotion, rather than a checkpoint, or maybe if there are cascading standbys someone could try to actually use the 000000020000000000000004 file for something. I'm not sure. But unless I'm missing something, that file is bogus, and our only hope of not having problems is that perhaps no one will ever look at it. I think that the cause of this problem is this code right here: /* * Actually, if WAL ended in an incomplete record, skip the parts that * made it through and start writing after the portion that persisted. * (It's critical to first write an OVERWRITE_CONTRECORD message, which * we'll do as soon as we're open for writing new WAL.) */ if (!XLogRecPtrIsInvalid(missingContrecPtr)) { Assert(!XLogRecPtrIsInvalid(abortedRecPtr)); EndOfLog = missingContrecPtr; } It seems to me that this if-statement should also test that the TLI has not changed i.e. if (newTLI != endOfRecoveryInfo->lastRecTLI && !XLogRecPtrIsInvalid(missingContrecPtr)). If the TLI hasn't changed, then everything the comment says is correct and I think that what the code does is also correct. However, if the TLI *has* changed, then I think we must not advance EndOfLog here, because the WAL that was copied from the old timeline to the new timeline ends at the point in the file corresponding to the value of EndOfLog just before executing this code. When this code then moves EndOfLog forward to the beginning of the next segment, it leaves the unused portion of the previous segment as all zeroes, which creates the problem described above. (Incidentally, there's also a bug in pg_waldump here: it's reporting the wrong LSN as the source of the error. 0/4FFF700 is not the record that's busted, as shown by the fact that it was successfully decoded and shown in the output. The relevant code in pg_waldump should be using EndRecPtr instead of ReadRecPtr to report the error. If it did, it would complain about 0/4FFFEA8, which is where the problem really is. This is of the same vintage as the bug fixed by d9fbb8862959912c5266364059c0abeda0c93bbf, though in that case the issue was reporting all errors using the start LSN of the first of several records read no matter where the error actually happened, whereas in this case the error is using the start LSN of the previous record instead of the current one.) Thoughts? -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
pgsql-hackers by date: