Thread: BUG #8686: Standby could not restart.
The following bug has been logged on the website: Bug reference: 8686 Logged by: Tomonari Katsumata Email address: katsumata.tomonari@po.ntts.co.jp PostgreSQL version: 9.2.6 Operating system: Red Hat Enterprise Linux 6.2 x86_64 Description: Hi, I'm testing whether a standby could restart with asynchronous replication. The testcase is very simple like below. (A) start asyncronous replication (B) stop standby with "-m f" (C) start standby I tried (B) and (C) periodically. In almost cases, it worked fine. But standby could not start one time because of PANIC. ---- Log ----- [Standby] 2013-12-09 10:42:30 JST LOG: 00000: database system was shut down in recovery at 2013-12-09 10:42:29 JST [Standby] 2013-12-09 10:42:30 JST LOCATION: StartupXLOG, xlog.c:6273 cp: cannot stat `../arc/00000002.history': No such file or directory [Standby] 2013-12-09 10:42:30 JST LOG: 00000: entering standby mode [Standby] 2013-12-09 10:42:30 JST LOCATION: StartupXLOG, xlog.c:6359 cp: cannot stat `../arc/000000010000000100000008': No such file or directory [Standby] 2013-12-09 10:42:30 JST LOG: 00000: consistent recovery state reached at 1/8E7F110 [Standby] 2013-12-09 10:42:30 JST LOCATION: CheckRecoveryConsistency, xlog.c:7366 [Standby] 2013-12-09 10:42:30 JST LOG: 00000: restored log file "000000010000000100000007" from archive [Standby] 2013-12-09 10:42:30 JST LOCATION: RestoreArchivedFile, xlog.c:3273 [Standby] 2013-12-09 10:42:30 JST LOG: 00000: redo starts at 1/783B230 [Standby] 2013-12-09 10:42:30 JST LOCATION: StartupXLOG, xlog.c:6827 [Standby] 2013-12-09 10:42:30 JST WARNING: 01000: page 1365 of relation base/16384/16388 does not exist [Standby] 2013-12-09 10:42:30 JST CONTEXT: xlog redo hot_update: rel 1663/16384/16388; tid 1365/152; new 1365/153 [Standby] 2013-12-09 10:42:30 JST LOCATION: report_invalid_page, xlogutils.c:66 [Standby] 2013-12-09 10:42:30 JST PANIC: XX000: WAL contains references to invalid pages [Standby] 2013-12-09 10:42:30 JST CONTEXT: xlog redo hot_update: rel 1663/16384/16388; tid 1365/152; new 1365/153 [Standby] 2013-12-09 10:42:30 JST LOCATION: log_invalid_page, xlogutils.c:91 [Standby] 2013-12-09 10:42:30 JST LOG: 00000: startup process (PID 12560) was terminated by signal 6: Aborted [Standby] 2013-12-09 10:42:30 JST LOCATION: LogChildExit, postmaster.c:2895 [Standby] 2013-12-09 10:42:30 JST LOG: 00000: terminating any other active server processes [Standby] 2013-12-09 10:42:30 JST LOCATION: HandleChildCrash, postmaster.c:2682 [Standby] 2013-12-09 11:10:12 JST LOG: 00000: database system was interrupted while in recovery at log time 2013-12-09 10:32:14 JST [Standby] 2013-12-09 11:10:12 JST HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target. [Standby] 2013-12-09 11:10:12 JST LOCATION: StartupXLOG, xlog.c:6289 --------- I tried to fix this problem. At first, I doubted the recovery state reached "consistent" before redo starts. And then I checked pg_control and related WAL. The WAL sequence is like below. WAL--(a)--(b)--(c)--(d)--(e)--> ================================================ (a) Latest checkpoint's REDO location 1/783B230 (b) hot_update 1/7842010 (c) truncate 1/8E7E5C8 (d) Latest checkpoint location 1/8E7F0B0 (e) Minimum recovery ending location 1/8E7F110 ================================================ >From these things, I found it has happened with this scenario. ---------- (1) standby starting (2) seeking checkpoint location 1/8E7F0B0 because backup_label is not absecnt (3) reachedConsistency is set to true at 1/8E7F110 in CheckRecoveryConsistent (4) redo start from 1/783B230 (5) PANIC at 1/7842010 because reachedConsistency has set already and operating against a block which will be truncated at (c). ---------- At step(2), EndRecPtr is set to 1/8E7F110(next to 1/8E7F0B0), so reachedConsistency is set to true at step(3). I think it's not need to increase EndRecPtr while seeking checkpoint location. I tried to revise it and this worked fine. I think this is very very narrow gate, but it could happen. I'm not sure this problem happen with synchronous replication or another version of PostgreSQL(ie. 9.3/9.1/9.0), but at least we need to fix it. regards, ---------------- Tomonari Katsumata
Hi, This is a patch against REL9_2_STABLE(*) (*)0c07ef1ad2ba73a6a050f3e1a19267f961a43586 regards, --------------- Tomonari Katsumata (2013/12/19 11:57), katsumata.tomonari@po.ntts.co.jp wrote: > The following bug has been logged on the website: > > Bug reference: 8686 > Logged by: Tomonari Katsumata > Email address: katsumata.tomonari@po.ntts.co.jp > PostgreSQL version: 9.2.6 > Operating system: Red Hat Enterprise Linux 6.2 x86_64 > Description: > > Hi, > > > I'm testing whether a standby could restart with asynchronous replication. > > > The testcase is very simple like below. > (A) start asyncronous replication > (B) stop standby with "-m f" > (C) start standby > > > I tried (B) and (C) periodically. > > > In almost cases, it worked fine. > But standby could not start one time because of PANIC. > ---- Log ----- > [Standby] 2013-12-09 10:42:30 JST LOG: 00000: database system was shut down > in recovery at 2013-12-09 10:42:29 JST > [Standby] 2013-12-09 10:42:30 JST LOCATION: StartupXLOG, xlog.c:6273 > cp: cannot stat `../arc/00000002.history': No such file or directory > [Standby] 2013-12-09 10:42:30 JST LOG: 00000: entering standby mode > [Standby] 2013-12-09 10:42:30 JST LOCATION: StartupXLOG, xlog.c:6359 > cp: cannot stat `../arc/000000010000000100000008': No such file or > directory > [Standby] 2013-12-09 10:42:30 JST LOG: 00000: consistent recovery state > reached at 1/8E7F110 > [Standby] 2013-12-09 10:42:30 JST LOCATION: CheckRecoveryConsistency, > xlog.c:7366 > [Standby] 2013-12-09 10:42:30 JST LOG: 00000: restored log file > "000000010000000100000007" from archive > [Standby] 2013-12-09 10:42:30 JST LOCATION: RestoreArchivedFile, > xlog.c:3273 > [Standby] 2013-12-09 10:42:30 JST LOG: 00000: redo starts at 1/783B230 > [Standby] 2013-12-09 10:42:30 JST LOCATION: StartupXLOG, xlog.c:6827 > [Standby] 2013-12-09 10:42:30 JST WARNING: 01000: page 1365 of relation > base/16384/16388 does not exist > [Standby] 2013-12-09 10:42:30 JST CONTEXT: xlog redo hot_update: rel > 1663/16384/16388; tid 1365/152; new 1365/153 > [Standby] 2013-12-09 10:42:30 JST LOCATION: report_invalid_page, > xlogutils.c:66 > [Standby] 2013-12-09 10:42:30 JST PANIC: XX000: WAL contains references to > invalid pages > [Standby] 2013-12-09 10:42:30 JST CONTEXT: xlog redo hot_update: rel > 1663/16384/16388; tid 1365/152; new 1365/153 > [Standby] 2013-12-09 10:42:30 JST LOCATION: log_invalid_page, > xlogutils.c:91 > [Standby] 2013-12-09 10:42:30 JST LOG: 00000: startup process (PID 12560) > was terminated by signal 6: Aborted > [Standby] 2013-12-09 10:42:30 JST LOCATION: LogChildExit, > postmaster.c:2895 > [Standby] 2013-12-09 10:42:30 JST LOG: 00000: terminating any other active > server processes > [Standby] 2013-12-09 10:42:30 JST LOCATION: HandleChildCrash, > postmaster.c:2682 > [Standby] 2013-12-09 11:10:12 JST LOG: 00000: database system was > interrupted while in recovery at log time 2013-12-09 10:32:14 JST > [Standby] 2013-12-09 11:10:12 JST HINT: If this has occurred more than once > some data might be corrupted and you might need to choose an earlier > recovery target. > [Standby] 2013-12-09 11:10:12 JST LOCATION: StartupXLOG, xlog.c:6289 > --------- > > > I tried to fix this problem. > At first, I doubted the recovery state reached "consistent" before redo > starts. > And then I checked pg_control and related WAL. > The WAL sequence is like below. > > > WAL--(a)--(b)--(c)--(d)--(e)--> > ================================================ > (a) Latest checkpoint's REDO location > 1/783B230 > > > (b) hot_update > 1/7842010 > > > (c) truncate > 1/8E7E5C8 > > > (d) Latest checkpoint location > 1/8E7F0B0 > > > (e) Minimum recovery ending location > 1/8E7F110 > ================================================ > > > >From these things, I found it has happened with this scenario. > ---------- > (1) standby starting > (2) seeking checkpoint location 1/8E7F0B0 because backup_label is not > absecnt > (3) reachedConsistency is set to true at 1/8E7F110 in > CheckRecoveryConsistent > (4) redo start from 1/783B230 > (5) PANIC at 1/7842010 because reachedConsistency has set already and > operating against a block which will be truncated at (c). > ---------- > > > At step(2), EndRecPtr is set to 1/8E7F110(next to 1/8E7F0B0), > so reachedConsistency is set to true at step(3). > > > I think it's not need to increase EndRecPtr while seeking checkpoint > location. > I tried to revise it and this worked fine. > > > I think this is very very narrow gate, but it could happen. > > > I'm not sure this problem happen with synchronous replication or > another version of PostgreSQL(ie. 9.3/9.1/9.0), > but at least we need to fix it. > > > regards, > ---------------- > Tomonari Katsumata > > > >
Attachment
On 12/19/2013 04:57 AM, katsumata.tomonari@po.ntts.co.jp wrote: > At first, I doubted the recovery state reached "consistent" before redo > starts. > And then I checked pg_control and related WAL. > The WAL sequence is like below. > > > WAL--(a)--(b)--(c)--(d)--(e)--> > ================================================ > (a) Latest checkpoint's REDO location > 1/783B230 > > > (b) hot_update > 1/7842010 > > > (c) truncate > 1/8E7E5C8 > > > (d) Latest checkpoint location > 1/8E7F0B0 > > > (e) Minimum recovery ending location > 1/8E7F110 > ================================================ > > >>From these things, I found it has happened with this scenario. > ---------- > (1) standby starting > (2) seeking checkpoint location 1/8E7F0B0 because backup_label is not > absecnt > (3) reachedConsistency is set to true at 1/8E7F110 in > CheckRecoveryConsistent > (4) redo start from 1/783B230 > (5) PANIC at 1/7842010 because reachedConsistency has set already and > operating against a block which will be truncated at (c). > ---------- > > At step(2), EndRecPtr is set to 1/8E7F110(next to 1/8E7F0B0), > so reachedConsistency is set to true at step(3). Yep. Thanks for a good explanation. > I think it's not need to increase EndRecPtr while seeking checkpoint > location. > I tried to revise it and this worked fine. Hmm. There's this comment in StartupXLOG, after reading the checkpoint record, but before reading the first record at REDO point: > /* > * Initialize shared replayEndRecPtr, lastReplayedEndRecPtr, and > * recoveryLastXTime. > * > * This is slightly confusing if we're starting from an online > * checkpoint; we've just read and replayed the checkpoint record, but > * we're going to start replay from its redo pointer, which precedes > * the location of the checkpoint record itself. So even though the > * last record we've replayed is indeed ReadRecPtr, we haven't > * replayed all the preceding records yet. That's OK for the current > * use of these variables. > */ > SpinLockAcquire(&xlogctl->info_lck); > xlogctl->replayEndRecPtr = ReadRecPtr; > xlogctl->lastReplayedEndRecPtr = EndRecPtr; > xlogctl->recoveryLastXTime = 0; > xlogctl->currentChunkStartTime = 0; > xlogctl->recoveryPause = false; > SpinLockRelease(&xlogctl->info_lck); I think we need to fix that confusion. Your patch will do it by not setting EndRecPtr yet; that fixes the bug, but leaves those variables in a slightly strange state; I'm not sure what EndRecPtr points to in that case (0 ?), but ReadRecPtr would be set I guess. Perhaps we should reset replayEndRecPtr and lastReplayedEndRecPtr to the REDO point here, instead of ReadRecPtr/EndRecPtr. - Heikki
Hi Heikki,
Thanks for your confirmation and comments.
/*
* Initialize shared replayEndRecPtr, lastReplayedEndRecPtr, and
* recoveryLastXTime.
*
* This is slightly confusing if we're starting from an online
* checkpoint; we've just read and replayed the checkpoint record, but
* we're going to start replay from its redo pointer, which precedes
* the location of the checkpoint record itself. So even though the
* last record we've replayed is indeed ReadRecPtr, we haven't
* replayed all the preceding records yet. That's OK for the current
* use of these variables.
*/
SpinLockAcquire(&xlogctl->info_lck);
xlogctl->replayEndRecPtr = ReadRecPtr;
xlogctl->lastReplayedEndRecPtr = EndRecPtr;
xlogctl->recoveryLastXTime = 0;
xlogctl->currentChunkStartTime = 0;
xlogctl->recoveryPause = false;
SpinLockRelease(&xlogctl->info_lck);
I think we need to fix that confusion. Your patch will do it by not setting EndRecPtr yet; that fixes the bug, but leaves those variables in a slightly strange state; I'm not sure what EndRecPtr points to in that case (0 ?), but ReadRecPtr would be set I guess.
Yes, the values were set like below.
ReadRecPtr:1/8E7F0B0
EndRecPtr:0/0
Perhaps we should reset replayEndRecPtr and lastReplayedEndRecPtr to the REDO point here, instead of ReadRecPtr/EndRecPtr.
I made another patch.
I added a ReadRecord to make sure the REDO location is present or not.
The similar process are done when we use backup_label.
Because the ReadRecord returns a record already read,
I set ReadRecPtr of the record to EndRecPtr.
And also I set record->xl_prev to ReadRecPtr.
And also I set record->xl_prev to ReadRecPtr.
As you said, it also worked fine.
I'm not sure we should do same thing when crash recovery occurs, but now I added the process when archive recovery is needed.
Please see attached patch.
regards,
---------------------
Tomonari Katsumata
Attachment
On 12/23/2013 08:15 AM, Tomonari Katsumata wrote: >> /* >>> * Initialize shared replayEndRecPtr, >>> lastReplayedEndRecPtr, and >>> * recoveryLastXTime. >>> * >>> * This is slightly confusing if we're starting from an >>> online >>> * checkpoint; we've just read and replayed the >>> checkpoint record, but >>> * we're going to start replay from its redo pointer, >>> which precedes >>> * the location of the checkpoint record itself. So even >>> though the >>> * last record we've replayed is indeed ReadRecPtr, we >>> haven't >>> * replayed all the preceding records yet. That's OK for >>> the current >>> * use of these variables. >>> */ >>> SpinLockAcquire(&xlogctl->info_lck); >>> xlogctl->replayEndRecPtr = ReadRecPtr; >>> xlogctl->lastReplayedEndRecPtr = EndRecPtr; >>> xlogctl->recoveryLastXTime = 0; >>> xlogctl->currentChunkStartTime = 0; >>> xlogctl->recoveryPause = false; >>> SpinLockRelease(&xlogctl->info_lck); >>> >> >> I think we need to fix that confusion. Your patch will do it by not >> setting EndRecPtr yet; that fixes the bug, but leaves those variables in a >> slightly strange state; I'm not sure what EndRecPtr points to in that case >> (0 ?), but ReadRecPtr would be set I guess. >> > Yes, the values were set like below. > ReadRecPtr:1/8E7F0B0 > EndRecPtr:0/0 > > > >> >> Perhaps we should reset replayEndRecPtr and lastReplayedEndRecPtr to the >> REDO point here, instead of ReadRecPtr/EndRecPtr. > > I made another patch. > I added a ReadRecord to make sure the REDO location is present or not. > The similar process are done when we use backup_label. > > Because the ReadRecord returns a record already read, > I set ReadRecPtr of the record to EndRecPtr. > And also I set record->xl_prev to ReadRecPtr. > As you said, it also worked fine. > > I'm not sure we should do same thing when crash recovery occurs, but now I > added the process when archive recovery is needed. > > Please see attached patch. Hmm. That would still initialize lastReplayedEndRecPtr to the checkpoint record, when you do crash recovery (or begin archive recovery from a filesystem snapshot, where you perform crash recovery before starting to read the archive). I'm not very comfortable with that, even though I don't see an immediate problem with it. I also noticed that CheckRecoveryConsistency() compares backupEndPoint with EndRecPtr, but minRecoveryPoint with lastReplayedEndRecPtr. That's correct as the code stands, but it's an accident waiting to happen: if you called CheckRecoveryConsistency() after reading a record with ReadRecord(), but before fully replaying it, it might conclude that it's reached the backup end location one record too early. And it's inconsistent, anyway. I propose the attached patch. I haven't tested it, but I think it's a slightly more robust fix. - Heikki
Attachment
Hi Heikki, Sorry for slow response and thank you for new patch! I've tried your patch, but the patch was not for 92_STABLE. So I created a patch for 92_STABLE(*) from your patch. (*)against 8aa6912b8fec3400441c365bde6a1030295de749 It worked fine! If there are no problem, please commit this patch also to 92_STABLE. NOTE: I think your original patch has a miss typing. We should use 'checkPoint' not 'checkpoint'. ----------------------------------------------------------- <original> 24:+ xlogctl->replayEndRecPtr = checkpoint.redo; 27:+ xlogctl->lastReplayedEndRecPtr = checkpoint.redo; <fix> 24:+ xlogctl->replayEndRecPtr = checkPoint.redo; 27:+ xlogctl->lastReplayedEndRecPtr = checkPoint.redo; ----------------------------------------------------------- regards, ------------- Tomonari Katsumata (2014/01/06 23:38), Heikki Linnakangas wrote: > On 12/23/2013 08:15 AM, Tomonari Katsumata wrote: >>> /* >>>> * Initialize shared replayEndRecPtr, >>>> lastReplayedEndRecPtr, and >>>> * recoveryLastXTime. >>>> * >>>> * This is slightly confusing if we're starting from an >>>> online >>>> * checkpoint; we've just read and replayed the >>>> checkpoint record, but >>>> * we're going to start replay from its redo pointer, >>>> which precedes >>>> * the location of the checkpoint record itself. So even >>>> though the >>>> * last record we've replayed is indeed ReadRecPtr, we >>>> haven't >>>> * replayed all the preceding records yet. That's OK for >>>> the current >>>> * use of these variables. >>>> */ >>>> SpinLockAcquire(&xlogctl->info_lck); >>>> xlogctl->replayEndRecPtr = ReadRecPtr; >>>> xlogctl->lastReplayedEndRecPtr = EndRecPtr; >>>> xlogctl->recoveryLastXTime = 0; >>>> xlogctl->currentChunkStartTime = 0; >>>> xlogctl->recoveryPause = false; >>>> SpinLockRelease(&xlogctl->info_lck); >>> >>> I think we need to fix that confusion. Your patch will do it by not >>> setting EndRecPtr yet; that fixes the bug, but leaves those variables in a >>> slightly strange state; I'm not sure what EndRecPtr points to in that case >>> (0 ?), but ReadRecPtr would be set I guess. >> Yes, the values were set like below. >> ReadRecPtr:1/8E7F0B0 >> EndRecPtr:0/0 >> >> >>> >>> Perhaps we should reset replayEndRecPtr and lastReplayedEndRecPtr to the >>> REDO point here, instead of ReadRecPtr/EndRecPtr. >> >> I made another patch. >> I added a ReadRecord to make sure the REDO location is present or not. >> The similar process are done when we use backup_label. >> >> Because the ReadRecord returns a record already read, >> I set ReadRecPtr of the record to EndRecPtr. >> And also I set record->xl_prev to ReadRecPtr. >> As you said, it also worked fine. >> >> I'm not sure we should do same thing when crash recovery occurs, but now I >> added the process when archive recovery is needed. >> >> Please see attached patch. > > Hmm. That would still initialize lastReplayedEndRecPtr to the checkpoint record, when you do crash recovery (or begin archive recovery from a filesystem snapshot, where you perform crash recovery before starting to read the archive). I'm not very comfortable with that, even though I don't see an immediate problem with it. > > I also noticed that CheckRecoveryConsistency() compares backupEndPoint with EndRecPtr, but minRecoveryPoint with lastReplayedEndRecPtr. That's correct as the code stands, but it's an accident waiting to happen: if you called CheckRecoveryConsistency() after reading a record with ReadRecord(), but before fully replaying it, it might conclude that it's reached the backup end location one record too early. And it's inconsistent, anyway. > > I propose the attached patch. I haven't tested it, but I think it's a slightly more robust fix. > > - Heikki > >
Attachment
On 01/08/2014 10:41 AM, Tomonari Katsumata wrote: > Hi Heikki, > > Sorry for slow response and > thank you for new patch! > > I've tried your patch, but the patch was not for 92_STABLE. > So I created a patch for 92_STABLE(*) from your patch. > (*)against 8aa6912b8fec3400441c365bde6a1030295de749 > > It worked fine! > If there are no problem, please commit this patch also to 92_STABLE. Thanks! Committed to all stable branches 9.0 and above. - Heikki