Thread: [BUG?] lag of minRecoveryPont in archive recovery
Hello, I have a problem with PostgreSQL 9.2 with Pacemaker. HA standby sometime failes to start under normal operation. Testing with a bare replication pair showed that the standby failes startup recovery under the operation sequence shown below. 9.3dev too, but 9.1 does not have this problem. This problem became apparent by the invalid-page check of xlog, but 9.1 also has same glitch potentially. After the investigation, the lag of minRecoveryPoint behind EndRecPtr in redo loop seems to be the cause. The lag brings about repetitive redoing of unrepeatable xlog sequences such as XLOG_HEAP2_VISIBLE -> SMGR_TRUNCATE on the same page. So I did the same aid work as xact_redo_commit_internal for smgr_redo. While doing this, I noticed that CheckRecoveryConsistency() in redo apply loop should be after redoing the record, so moved it. The patch attached (for 9.3dev) puts down this failure, also works for 9.2, and seems no harm and maybe some grace for 9.1.. What do you think about this? The details follow. reagrds, -- Kyotaro Horiguchi NTT Open Source Software Center Operation sequence to cause the problem. ======= Mst$export PGDATA=$HOME/pgdatam Mst$rm -rf $PGDATA Mst$initdb Mst$vi $PGDATA/postgresql.conf | wal_level = hot_standby | checkpoint_segments = 300 | checkpoint_timeout = 1h | archive_mode= on | archive_command = 'cp %p <ARCHIVE_DIR>/%f' | synchronous_standby_names = '*' | hot_standby = on | log_checkpoints= on Mst$vi $PGDATA/pg_hba.conf | local replication horiguti trust Mst$pg_ctl start Stb$export PGDATA=$HOME/pgdatas Stb$export PGPORT=5433 Stb$rm -rf $PGDATA Stb$pg_basebackup -X s -F p -D $PGDATA -h /tmp -p 5432 Stb$vi $PGDATA/postgresql.conf | checkpoint_segments = 1 | checkpoint_timeout = 30s Stb$vi $PGDATA/recovery.conf | standby_mode=yes | primary_conninfo='host=/tmp port=5432' | restore_command='cp <ARCHIVE_DIR>/%f%p' Stb$pg_ctl start Mst$createdb Mst$psql Mst_psql=#create table t (a int); Mst_psql=#insert into t values (1); Mst_psql=#chekcpoint; Mst_psql=#vacuum; !!! Waiting for the completion of restart point of the standby !!! Mst_psql=#delete from t; Mst_psql=#vacuum; Stb$pg_ctl stop -m i Stb$vi PGDATA/recovery.conf | restore_command='cp <ARCHIVE_DIR>/%f %p' Stb$pg_ctl start LOG: database system was interrupted while in recovery at log time 2012-12-05 09:55:18 JST HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recoverytarget. LOG: entering standby mode LOG: redo starts at 0/A020890 WARNING: page 0 of relation base/16384/16385 does not exist PANIC: WAL contains references to invalid pages LOG: startup process (PID 18422) was terminated by signal 6: Aborted LOG: terminating any other active server processes ====== diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 09d4dff..cf73b58 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -5759,9 +5759,6 @@ StartupXLOG(void) /* Handle interrupt signals of startup process */ HandleStartupProcInterrupts(); - /* Allow read-only connections if we're consistent now */ - CheckRecoveryConsistency(); - /* * Have we reached our recovery target? */ @@ -5833,6 +5830,9 @@ StartupXLOG(void) /* Pop the error context stack */ error_context_stack= errcallback.previous; + /* Allow read-only connections if we're consistent now */ + CheckRecoveryConsistency(); + if (!XLogRecPtrIsInvalid(ControlFile->backupStartPoint) && XLByteLE(ControlFile->backupEndPoint,EndRecPtr)) { diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c index 993bc49..d34ab65 100644 --- a/src/backend/catalog/storage.c +++ b/src/backend/catalog/storage.c @@ -519,6 +519,12 @@ smgr_redo(XLogRecPtr lsn, XLogRecord *record) visibilitymap_truncate(rel, xlrec->blkno); FreeFakeRelcacheEntry(rel); + + /* + * Xlogs before this record is unrepeatable, so winding + * minRecoveryPoint to here. + */ + XLogFlush(lsn); } else elog(PANIC, "smgr_redo: unknown op code %u", info);
On Thursday, December 06, 2012 9:35 AM Kyotaro HORIGUCHI wrote: > Hello, I have a problem with PostgreSQL 9.2 with Pacemaker. > > HA standby sometime failes to start under normal operation. > > Testing with a bare replication pair showed that the standby failes > startup recovery under the operation sequence shown below. 9.3dev too, > but 9.1 does not have this problem. This problem became apparent by the > invalid-page check of xlog, but > 9.1 also has same glitch potentially. > > After the investigation, the lag of minRecoveryPoint behind EndRecPtr in > redo loop seems to be the cause. The lag brings about repetitive redoing > of unrepeatable xlog sequences such as XLOG_HEAP2_VISIBLE -> > SMGR_TRUNCATE on the same page. So I did the same aid work as > xact_redo_commit_internal for smgr_redo. While doing this, I noticed > that > CheckRecoveryConsistency() in redo apply loop should be after redoing > the record, so moved it. I think moving CheckRecoveryConsistency() after redo apply loop might cause a problem. As currently it is done before recoveryStopsHere() function, which can allow connections on HOTSTANDY. But now if due to some reason recovery pauses or stops due to above function, connections might not be allowed as CheckRecoveryConsistency() is not called. With Regards, Amit Kapila.
On Thu, Dec 6, 2012 at 8:39 PM, Amit Kapila <amit.kapila@huawei.com> wrote: > On Thursday, December 06, 2012 9:35 AM Kyotaro HORIGUCHI wrote: >> Hello, I have a problem with PostgreSQL 9.2 with Pacemaker. >> >> HA standby sometime failes to start under normal operation. >> >> Testing with a bare replication pair showed that the standby failes >> startup recovery under the operation sequence shown below. 9.3dev too, >> but 9.1 does not have this problem. This problem became apparent by the >> invalid-page check of xlog, but >> 9.1 also has same glitch potentially. >> >> After the investigation, the lag of minRecoveryPoint behind EndRecPtr in >> redo loop seems to be the cause. The lag brings about repetitive redoing >> of unrepeatable xlog sequences such as XLOG_HEAP2_VISIBLE -> >> SMGR_TRUNCATE on the same page. So I did the same aid work as >> xact_redo_commit_internal for smgr_redo. While doing this, I noticed >> that >> CheckRecoveryConsistency() in redo apply loop should be after redoing >> the record, so moved it. > > I think moving CheckRecoveryConsistency() after redo apply loop might cause > a problem. > As currently it is done before recoveryStopsHere() function, which can allow > connections > on HOTSTANDY. But now if due to some reason recovery pauses or stops due to > above function, > connections might not be allowed as CheckRecoveryConsistency() is not > called. Yes, so we should just add the CheckRecoveryConsistency() call after rm_redo rather than moving it? This issue is related to the old discussion: http://archives.postgresql.org/pgsql-bugs/2012-09/msg00101.php Regards, -- Fujii Masao
On Thu, Dec 6, 2012 at 1:04 PM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c > index 993bc49..d34ab65 100644 > --- a/src/backend/catalog/storage.c > +++ b/src/backend/catalog/storage.c > @@ -519,6 +519,12 @@ smgr_redo(XLogRecPtr lsn, XLogRecord *record) > visibilitymap_truncate(rel, xlrec->blkno); > > FreeFakeRelcacheEntry(rel); > + > + /* > + * Xlogs before this record is unrepeatable, so winding > + * minRecoveryPoint to here. > + */ > + XLogFlush(lsn); > } > else > elog(PANIC, "smgr_redo: unknown op code %u", info); I think that minRecoveryPoint should be updated before the data-file changes, so XLogFlush() should be called before smgrtruncate(). No? Regards, -- Fujii Masao
Thank you. > I think moving CheckRecoveryConsistency() after redo apply loop might cause > a problem. > As currently it is done before recoveryStopsHere() function, which can allow > connections > on HOTSTANDY. But now if due to some reason recovery pauses or stops due to > above function, > connections might not be allowed as CheckRecoveryConsistency() is not > called. It depends on the precise meaning of minRecoveryPoint. I've not found the explicit explanation for it. Currently minRecoveryPoint is updated only in XLogFlush. Other updates of it seems to reset to InvalidXLogRecPtr. XLogFlush seems to be called AFTER the redo core operation has been done, at least in xact_redo_commit_internal. I shows me that the meaning of minRecoveryPoint is that "Just AFTER applying the XLog at current LSN, the database files will be assumed to be consistent." At Mon, 10 Dec 2012 00:36:31 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in <CAHGQGwG4W5QZ7+LJimg8xxuevwz0bYniHmZLZmWf0j6kBiuRCg@mail.gmail.com> > Yes, so we should just add the CheckRecoveryConsistency() call after > rm_redo rather than moving it? This issue is related to the old discussion: > http://archives.postgresql.org/pgsql-bugs/2012-09/msg00101.php Since I've not cleary understood the problem of missing it before redo, and it also seems to have no harm on performance, I have no objection to place it both before and after of redo. -- Kyotaro Horiguchi NTT Open Source Software Center
Hello, > I think that minRecoveryPoint should be updated before the data-file > changes, so XLogFlush() should be called before smgrtruncate(). No? Mmm.. As far as I saw in xact_redo_commit_internal, XLogFlush seems should be done AFTER redo substantial operation. Is it wrong? If so, I suppose xact_redo_commit_internal could shold be fixed in the same way. At Mon, 10 Dec 2012 00:41:34 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in <CAHGQGwEUpEvK=s0xRFinWnoasjhy3poD3HiE4namoWHK0a-d_w@mail.gmail.com> > On Thu, Dec 6, 2012 at 1:04 PM, Kyotaro HORIGUCHI > <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > > diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c > > index 993bc49..d34ab65 100644 > > --- a/src/backend/catalog/storage.c > > +++ b/src/backend/catalog/storage.c > > @@ -519,6 +519,12 @@ smgr_redo(XLogRecPtr lsn, XLogRecord *record) > > visibilitymap_truncate(rel, xlrec->blkno); > > > > FreeFakeRelcacheEntry(rel); > > + > > + /* > > + * Xlogs before this record is unrepeatable, so winding > > + * minRecoveryPoint to here. > > + */ > > + XLogFlush(lsn); > > } > > else > > elog(PANIC, "smgr_redo: unknown op code %u", info); > > I think that minRecoveryPoint should be updated before the data-file > changes, so XLogFlush() should be called before smgrtruncate(). No? -- Kyotaro Horiguchi NTT Open Source Software Center
Monday, December 10, 2012 7:16 AM Kyotaro HORIGUCHI wrote: > Thank you. > > > I think moving CheckRecoveryConsistency() after redo apply loop might > cause > > a problem. > > As currently it is done before recoveryStopsHere() function, which can > allow > > connections > > on HOTSTANDY. But now if due to some reason recovery pauses or stops > due to > > above function, > > connections might not be allowed as CheckRecoveryConsistency() is not > > called. > > It depends on the precise meaning of minRecoveryPoint. I've not > found the explicit explanation for it. > > Currently minRecoveryPoint is updated only in XLogFlush. Other > updates of it seems to reset to InvalidXLogRecPtr. XLogFlush > seems to be called AFTER the redo core operation has been done, > at least in xact_redo_commit_internal. I shows me that the > meaning of minRecoveryPoint is that "Just AFTER applying the XLog > at current LSN, the database files will be assumed to be > consistent." > > At Mon, 10 Dec 2012 00:36:31 +0900, Fujii Masao <masao.fujii@gmail.com> > wrote in > <CAHGQGwG4W5QZ7+LJimg8xxuevwz0bYniHmZLZmWf0j6kBiuRCg@mail.gmail.com> > > Yes, so we should just add the CheckRecoveryConsistency() call after > > rm_redo rather than moving it? This issue is related to the old > discussion: > > http://archives.postgresql.org/pgsql-bugs/2012-09/msg00101.php > > Since I've not cleary understood the problem of missing it before > redo, and it also seems to have no harm on performance, I have no > objection to place it both before and after of redo. I have tried that way as well, but it didn't completely resolved the problem reported in above link. As the situation of defect got arised when it does first time ReadRecord(). To resolve the defect mentioned in link by Fujii Masao, actually we need to check and try to reset the backupStartPoint before each ReadRecord. The reason is that in ReadRecord(), it can go and start waiting for records from Master. So now if backupStartPoint is not set and CheckRecoveryConsistency() is not done, it can keep on waiting Records from Master and no connections will be allowed on Standby. I have modified the code of XLogPageRead(...) such that before it calls WaitForWALToBecomeavailable(), following code will be added if (!XLogRecPtrIsInvalid(ControlFile->backupEndPoint) && XLByteLE(ControlFile->backupEndPoint, EndRecPtr)) { /* * We have reached the end of base backup, the point where * the minimum recovery point in pg_control indicates. The * data on disk is now consistent. Reset backupStartPoint * and backupEndPoint. */ elog(DEBUG1, "end of backup reached"); LWLockAcquire(ControlFileLock, LW_EXCLUSIVE); MemSet(&ControlFile->backupStartPoint, 0, sizeof(XLogRecPtr)); MemSet(&ControlFile->backupEndPoint, 0, sizeof(XLogRecPtr)); ControlFile->backupEndRequired = false; UpdateControlFile(); LWLockRelease(ControlFileLock); } CheckRecoveryConsistency(); This had completely resolved the problem reported on above link for me. With Regards, Amit Kapila.
On 10.12.2012 03:52, Kyotaro HORIGUCHI wrote: >> I think that minRecoveryPoint should be updated before the data-file >> changes, so XLogFlush() should be called before smgrtruncate(). No? > > Mmm.. As far as I saw in xact_redo_commit_internal, XLogFlush > seems should be done AFTER redo substantial operation. Is it > wrong? If so, I suppose xact_redo_commit_internal could shold be > fixed in the same way. So, two options: 1. Redo truncation, then XLogFlush() There's a window where the original problem could still occur: The file is truncated, and you crash before XLogFlush() finishes. 2. XLogFlush(), then redo truncation. If the truncation fails for some reason (disk failure?) and you have already updated minRecoveryPoint, you can not start up until you somehow fix the problem so that the truncation can succeed, and then finish recovery. Both options have their merits. The window in 1. is very small, and in 2., the probability that an unlink() or truncation fails is very small as well. We've chosen 1. in xact_redo_commit_internal(), but I don't think that was a carefully made decision, it just imitates what happens in the corresponding non-redo commit path. In xact_redo_commit_internal(), it makes sense to do XLogFlush() afterwards, for CREATE DATABASE and CREATE TABLESPACE. Those create files, and if you e.g run out of disk space, or non-existent path, you don't want minRecoveryPoint to be updated yet. Otherwise you can no longer recover to the point just before the creation of those files. But in case of DROP DATABASE, you have the exact same situation as with truncation: if you have already deleted some files, you must not be able to stop recovery at a point before that. So I'd say we should update minRecoveryPoint first, then truncate/delete. But we should still keep the XLogFlush() at the end of xact_redo_commit_internal(), for the case where files/directories are created. Patch attached. - Heikki
Attachment
On 10.12.2012 13:50, Heikki Linnakangas wrote: > So I'd say we should update minRecoveryPoint first, then > truncate/delete. But we should still keep the XLogFlush() at the end of > xact_redo_commit_internal(), for the case where files/directories are > created. Patch attached. Committed and backpatched that. Attached is a script I used to reproduce this problem, going back to 8.4. - Heikki
Attachment
On Tue, Dec 11, 2012 at 1:33 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > On 10.12.2012 13:50, Heikki Linnakangas wrote: >> >> So I'd say we should update minRecoveryPoint first, then >> truncate/delete. But we should still keep the XLogFlush() at the end of >> xact_redo_commit_internal(), for the case where files/directories are >> created. Patch attached. Sounds reasonable. > Committed and backpatched that. Attached is a script I used to reproduce > this problem, going back to 8.4. Thanks! Unfortunately I could reproduce the problem even after that commit. Attached is the script I used to reproduce the problem. The cause is that CheckRecoveryConsistency() is called before rm_redo(), as Horiguchi-san suggested upthead. Imagine the case where minRecoveryPoint is set to the location of the XLOG_SMGR_TRUNCATE record. When restarting the server with that minRecoveryPoint, the followings would happen, and then PANIC occurs. 1. XLOG_SMGR_TRUNCATE record is read. 2. CheckRecoveryConsistency() is called, and database is marked as consistent since we've reached minRecoveryPoint (i.e., the location of XLOG_SMGR_TRUNCATE). 3. XLOG_SMGR_TRUNCATE record is replayed, and invalid page is found. 4. Since the database has already been marked as consistent, an invalid page leads to PANIC. Regards, -- Fujii Masao
Attachment
Hello, I've also found this does not fix this problem. > >> So I'd say we should update minRecoveryPoint first, then > >> truncate/delete. But we should still keep the XLogFlush() at the end of > >> xact_redo_commit_internal(), for the case where files/directories are > >> created. Patch attached. > > Sounds reasonable. It makes perfectly sense. > > Committed and backpatched that. Attached is a script I used to reproduce > > this problem, going back to 8.4. > > Thanks! > > Unfortunately I could reproduce the problem even after that commit. > Attached is the script I used to reproduce the problem. Me too. > The cause is that CheckRecoveryConsistency() is called before rm_redo(), > as Horiguchi-san suggested upthead. Imagine the case where > minRecoveryPoint is set to the location of the XLOG_SMGR_TRUNCATE > record. When restarting the server with that minRecoveryPoint, > the followings would happen, and then PANIC occurs. > > 1. XLOG_SMGR_TRUNCATE record is read. > 2. CheckRecoveryConsistency() is called, and database is marked as > consistent since we've reached minRecoveryPoint (i.e., the location > of XLOG_SMGR_TRUNCATE). > 3. XLOG_SMGR_TRUNCATE record is replayed, and invalid page is > found. > 4. Since the database has already been marked as consistent, an invalid > page leads to PANIC. Exactly. In smgr_redo, EndRecPtr which is pointing the record next to SMGR_TRUNCATE, is used as the new minRecoveryPoint. On the other hand, during the second startup of the standby, CheckRecoveryConsistency checks for consistency by XLByteLE(minRecoveryPoint, EndRecPtr) which should be true at just BEFORE the SMGR_TRUNCATE record is applied. So reachedConsistency becomes true just before the SMGR_TRUNCATE record will be applied. Bang! I said I had no objection to placing CheckRecoveryConsistency both before and after of rm_redo in previous message, but it was wrong. Given aminRecoveryPoint value, it should be placed after rm_redo from the point of view of when the database should be considered to be consistent. Actually, simply moving CheckRecoeverConsistency after rm_redo turned into succeessfully startup, ignoring the another reason for it should be before, which is unknown to me. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
On 11.12.2012 08:07, Kyotaro HORIGUCHI wrote: >> The cause is that CheckRecoveryConsistency() is called before rm_redo(), >> as Horiguchi-san suggested upthead. Imagine the case where >> minRecoveryPoint is set to the location of the XLOG_SMGR_TRUNCATE >> record. When restarting the server with that minRecoveryPoint, >> the followings would happen, and then PANIC occurs. >> >> 1. XLOG_SMGR_TRUNCATE record is read. >> 2. CheckRecoveryConsistency() is called, and database is marked as >> consistent since we've reached minRecoveryPoint (i.e., the location >> of XLOG_SMGR_TRUNCATE). >> 3. XLOG_SMGR_TRUNCATE record is replayed, and invalid page is >> found. >> 4. Since the database has already been marked as consistent, an invalid >> page leads to PANIC. > > Exactly. > > In smgr_redo, EndRecPtr which is pointing the record next to > SMGR_TRUNCATE, is used as the new minRecoveryPoint. On the other > hand, during the second startup of the standby, > CheckRecoveryConsistency checks for consistency by > XLByteLE(minRecoveryPoint, EndRecPtr) which should be true at > just BEFORE the SMGR_TRUNCATE record is applied. So > reachedConsistency becomes true just before the SMGR_TRUNCATE > record will be applied. Bang! Ah, I see. I thought moving CheckRecoveryConsistency was just a nice-to-have thing, so that we'd notice that we're consistent earlier, so didn't pay attention to that part. I think the real issue here is that CheckRecoveryConsistency should not be comparing minRecoveryPoint against recoveryLastRecPtr, not EndRecPtr as it currently does. EndRecPtr points to the end of the last record *read*, while recoveryLastRecPtr points to the end of the last record *replayed*. The latter is what CheckRecoveryConsistency should be concerned with. BTW, it occurs to me that we have two variables in the shared struct that are almost the same thing: recoveryLastRecPtr and replayEndRecPtr. The former points to the end of the last record successfully replayed, while replayEndRecPtr points to the end of the last record successfully replayed, or begin replayed at the moment. I find that confusing, so I suggest that we rename recoveryLastRecPtr to make that more clear. Included in the attached patch. - Heikki
Attachment
On Tue, Dec 11, 2012 at 6:14 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > On 11.12.2012 08:07, Kyotaro HORIGUCHI wrote: >>> >>> The cause is that CheckRecoveryConsistency() is called before rm_redo(), >>> as Horiguchi-san suggested upthead. Imagine the case where >>> minRecoveryPoint is set to the location of the XLOG_SMGR_TRUNCATE >>> record. When restarting the server with that minRecoveryPoint, >>> the followings would happen, and then PANIC occurs. >>> >>> 1. XLOG_SMGR_TRUNCATE record is read. >>> 2. CheckRecoveryConsistency() is called, and database is marked as >>> consistent since we've reached minRecoveryPoint (i.e., the location >>> of XLOG_SMGR_TRUNCATE). >>> 3. XLOG_SMGR_TRUNCATE record is replayed, and invalid page is >>> found. >>> 4. Since the database has already been marked as consistent, an invalid >>> page leads to PANIC. >> >> >> Exactly. >> >> In smgr_redo, EndRecPtr which is pointing the record next to >> SMGR_TRUNCATE, is used as the new minRecoveryPoint. On the other >> hand, during the second startup of the standby, >> CheckRecoveryConsistency checks for consistency by >> XLByteLE(minRecoveryPoint, EndRecPtr) which should be true at >> just BEFORE the SMGR_TRUNCATE record is applied. So >> reachedConsistency becomes true just before the SMGR_TRUNCATE >> record will be applied. Bang! > > > Ah, I see. I thought moving CheckRecoveryConsistency was just a nice-to-have > thing, so that we'd notice that we're consistent earlier, so didn't pay > attention to that part. > > I think the real issue here is that CheckRecoveryConsistency should not be > comparing minRecoveryPoint against recoveryLastRecPtr, not EndRecPtr as it > currently does. EndRecPtr points to the end of the last record *read*, while > recoveryLastRecPtr points to the end of the last record *replayed*. The > latter is what CheckRecoveryConsistency should be concerned with. > > BTW, it occurs to me that we have two variables in the shared struct that > are almost the same thing: recoveryLastRecPtr and replayEndRecPtr. The > former points to the end of the last record successfully replayed, while > replayEndRecPtr points to the end of the last record successfully replayed, > or begin replayed at the moment. I find that confusing, so I suggest that we > rename recoveryLastRecPtr to make that more clear. Included in the attached > patch. The patch looks good. I confirmed that the PANIC error didn't happen, with the patch. xlog.c > * Initialize shared replayEndRecPtr, recoveryLastRecPtr, and s/recoveryLastRecPtr/lastReplayedEndRecPtr Regards, -- Fujii Masao
On 11.12.2012 17:04, Fujii Masao wrote: > On Tue, Dec 11, 2012 at 6:14 PM, Heikki Linnakangas > <hlinnakangas@vmware.com> wrote: >> On 11.12.2012 08:07, Kyotaro HORIGUCHI wrote: >>>> >>>> The cause is that CheckRecoveryConsistency() is called before rm_redo(), >>>> as Horiguchi-san suggested upthead. Imagine the case where >>>> minRecoveryPoint is set to the location of the XLOG_SMGR_TRUNCATE >>>> record. When restarting the server with that minRecoveryPoint, >>>> the followings would happen, and then PANIC occurs. >>>> >>>> 1. XLOG_SMGR_TRUNCATE record is read. >>>> 2. CheckRecoveryConsistency() is called, and database is marked as >>>> consistent since we've reached minRecoveryPoint (i.e., the location >>>> of XLOG_SMGR_TRUNCATE). >>>> 3. XLOG_SMGR_TRUNCATE record is replayed, and invalid page is >>>> found. >>>> 4. Since the database has already been marked as consistent, an invalid >>>> page leads to PANIC. >>> >>> >>> Exactly. >>> >>> In smgr_redo, EndRecPtr which is pointing the record next to >>> SMGR_TRUNCATE, is used as the new minRecoveryPoint. On the other >>> hand, during the second startup of the standby, >>> CheckRecoveryConsistency checks for consistency by >>> XLByteLE(minRecoveryPoint, EndRecPtr) which should be true at >>> just BEFORE the SMGR_TRUNCATE record is applied. So >>> reachedConsistency becomes true just before the SMGR_TRUNCATE >>> record will be applied. Bang! >> >> >> Ah, I see. I thought moving CheckRecoveryConsistency was just a nice-to-have >> thing, so that we'd notice that we're consistent earlier, so didn't pay >> attention to that part. >> >> I think the real issue here is that CheckRecoveryConsistency should not be >> comparing minRecoveryPoint against recoveryLastRecPtr, not EndRecPtr as it >> currently does. EndRecPtr points to the end of the last record *read*, while >> recoveryLastRecPtr points to the end of the last record *replayed*. The >> latter is what CheckRecoveryConsistency should be concerned with. >> >> BTW, it occurs to me that we have two variables in the shared struct that >> are almost the same thing: recoveryLastRecPtr and replayEndRecPtr. The >> former points to the end of the last record successfully replayed, while >> replayEndRecPtr points to the end of the last record successfully replayed, >> or begin replayed at the moment. I find that confusing, so I suggest that we >> rename recoveryLastRecPtr to make that more clear. Included in the attached >> patch. > > The patch looks good. I confirmed that the PANIC error didn't happen, > with the patch. Thanks. After thinking about this more, I think it's still not quite right. Now that we fix the check in CheckRecoveryConsistency, we have to move the call to CheckRecoveryConsistency to after the rm_redo call. Otherwise you don't enter hot standby mode after replaying the last record, the one ending at minRecoveryPoint. You have to wait for one more record to be read (but not replayed), so that CheckRecoveryConsistency gets called and we let backends in. As the code stands, the bad placement of CheckRecoveryConsistency is compensated by the bug that we open up for hot standby one record too early. The net result is that we open up for hot standby just before replaying the last record that makes us consistent. The window for someone to see the inconsistent state is really small, postmaster will have to receive the signal, start accepting connections, and you have connect and run a query, all before the startup process gets around to replay the record its already read into memory. - Heikki
On 11.12.2012 17:04, Fujii Masao wrote: > The patch looks good. I confirmed that the PANIC error didn't happen, > with the patch. Ok, committed, and also moved the CheckRecoveryConsistency call. Please take a look to double-check that I didn't miss anything. - Heikki
On Wed, Dec 12, 2012 at 2:03 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > On 11.12.2012 17:04, Fujii Masao wrote: >> >> The patch looks good. I confirmed that the PANIC error didn't happen, >> with the patch. > > > Ok, committed, and also moved the CheckRecoveryConsistency call. Please take > a look to double-check that I didn't miss anything. Looks good to me. Thanks a lot! Regards, -- Fujii Masao