Thread: [BUG?] lag of minRecoveryPont in archive recovery

[BUG?] lag of minRecoveryPont in archive recovery

From

Kyotaro HORIGUCHI

Date:

06 December 2012, 04:05:14

Hello, I have a problem with PostgreSQL 9.2 with Pacemaker.

HA standby sometime failes to start under normal operation.

Testing with a bare replication pair showed that the standby
failes startup recovery under the operation sequence shown
below. 9.3dev too, but 9.1 does not have this problem. This
problem became apparent by the invalid-page check of xlog, but
9.1 also has same glitch potentially.

After the investigation, the lag of minRecoveryPoint behind
EndRecPtr in redo loop seems to be the cause. The lag brings
about repetitive redoing of unrepeatable xlog sequences such as
XLOG_HEAP2_VISIBLE -> SMGR_TRUNCATE on the same page. So I did
the same aid work as xact_redo_commit_internal for
smgr_redo. While doing this, I noticed that
CheckRecoveryConsistency() in redo apply loop should be after
redoing the record, so moved it.

The patch attached (for 9.3dev) puts down this failure, also
works for 9.2, and seems no harm and maybe some grace for 9.1..

What do you think about this?

The details follow.

reagrds,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Operation sequence to cause the problem.
=======

Mst$export PGDATA=$HOME/pgdatam
Mst$rm -rf $PGDATA
Mst$initdb
Mst$vi $PGDATA/postgresql.conf | wal_level = hot_standby | checkpoint_segments = 300 | checkpoint_timeout = 1h |
archive_mode= on | archive_command = 'cp %p <ARCHIVE_DIR>/%f' | synchronous_standby_names = '*' | hot_standby = on |
log_checkpoints= on
 
Mst$vi $PGDATA/pg_hba.conf | local   replication     horiguti                                trust
Mst$pg_ctl start

Stb$export PGDATA=$HOME/pgdatas
Stb$export PGPORT=5433
Stb$rm -rf $PGDATA
Stb$pg_basebackup -X s -F p -D $PGDATA -h /tmp -p 5432
Stb$vi $PGDATA/postgresql.conf | checkpoint_segments = 1 | checkpoint_timeout = 30s
Stb$vi $PGDATA/recovery.conf | standby_mode=yes | primary_conninfo='host=/tmp port=5432' | restore_command='cp
<ARCHIVE_DIR>/%f%p'
 
Stb$pg_ctl start

Mst$createdb
Mst$psql
Mst_psql=#create table t (a int);
Mst_psql=#insert into t values (1);
Mst_psql=#chekcpoint;
Mst_psql=#vacuum;

!!! Waiting for the completion of restart point of the standby !!!

Mst_psql=#delete from t;
Mst_psql=#vacuum;

Stb$pg_ctl stop -m i
Stb$vi PGDATA/recovery.conf | restore_command='cp <ARCHIVE_DIR>/%f %p'
Stb$pg_ctl start
LOG:  database system was interrupted while in recovery at log time 2012-12-05 09:55:18 JST
HINT:  If this has occurred more than once some data might be corrupted and you might need to choose an earlier
recoverytarget.
 
LOG:  entering standby mode
LOG:  redo starts at 0/A020890
WARNING:  page 0 of relation base/16384/16385 does not exist
PANIC:  WAL contains references to invalid pages
LOG:  startup process (PID 18422) was terminated by signal 6: Aborted
LOG:  terminating any other active server processes
======

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 09d4dff..cf73b58 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5759,9 +5759,6 @@ StartupXLOG(void)                /* Handle interrupt signals of startup process */
HandleStartupProcInterrupts();
-                /* Allow read-only connections if we're consistent now */
-                CheckRecoveryConsistency();
-                /*                 * Have we reached our recovery target?                 */
@@ -5833,6 +5830,9 @@ StartupXLOG(void)                /* Pop the error context stack */
error_context_stack= errcallback.previous;
 
+                /* Allow read-only connections if we're consistent now */
+                CheckRecoveryConsistency();
+                if (!XLogRecPtrIsInvalid(ControlFile->backupStartPoint) &&
XLByteLE(ControlFile->backupEndPoint,EndRecPtr))                {
 
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 993bc49..d34ab65 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -519,6 +519,12 @@ smgr_redo(XLogRecPtr lsn, XLogRecord *record)            visibilitymap_truncate(rel,
xlrec->blkno);       FreeFakeRelcacheEntry(rel);
 
+        
+        /*
+         * Xlogs before this record is unrepeatable, so winding
+         * minRecoveryPoint to here.
+         */
+        XLogFlush(lsn);    }    else        elog(PANIC, "smgr_redo: unknown op code %u", info);

Re: [BUG?] lag of minRecoveryPont in archive recovery

From

Amit Kapila

Date:

06 December 2012, 11:39:39

On Thursday, December 06, 2012 9:35 AM Kyotaro HORIGUCHI wrote:
> Hello, I have a problem with PostgreSQL 9.2 with Pacemaker.
> 
> HA standby sometime failes to start under normal operation.
> 
> Testing with a bare replication pair showed that the standby failes
> startup recovery under the operation sequence shown below. 9.3dev too,
> but 9.1 does not have this problem. This problem became apparent by the
> invalid-page check of xlog, but
> 9.1 also has same glitch potentially.
> 
> After the investigation, the lag of minRecoveryPoint behind EndRecPtr in
> redo loop seems to be the cause. The lag brings about repetitive redoing
> of unrepeatable xlog sequences such as XLOG_HEAP2_VISIBLE ->
> SMGR_TRUNCATE on the same page. So I did the same aid work as
> xact_redo_commit_internal for smgr_redo. While doing this, I noticed
> that
> CheckRecoveryConsistency() in redo apply loop should be after redoing
> the record, so moved it.

I think moving CheckRecoveryConsistency() after redo apply loop might cause
a problem.
As currently it is done before recoveryStopsHere() function, which can allow
connections 
on HOTSTANDY. But now if due to some reason recovery pauses or stops due to
above function,
connections might not be allowed as CheckRecoveryConsistency() is not
called.

With Regards,
Amit Kapila.

Re: [BUG?] lag of minRecoveryPont in archive recovery

From

Fujii Masao

Date:

09 December 2012, 15:36:38

On Thu, Dec 6, 2012 at 8:39 PM, Amit Kapila <amit.kapila@huawei.com> wrote:
> On Thursday, December 06, 2012 9:35 AM Kyotaro HORIGUCHI wrote:
>> Hello, I have a problem with PostgreSQL 9.2 with Pacemaker.
>>
>> HA standby sometime failes to start under normal operation.
>>
>> Testing with a bare replication pair showed that the standby failes
>> startup recovery under the operation sequence shown below. 9.3dev too,
>> but 9.1 does not have this problem. This problem became apparent by the
>> invalid-page check of xlog, but
>> 9.1 also has same glitch potentially.
>>
>> After the investigation, the lag of minRecoveryPoint behind EndRecPtr in
>> redo loop seems to be the cause. The lag brings about repetitive redoing
>> of unrepeatable xlog sequences such as XLOG_HEAP2_VISIBLE ->
>> SMGR_TRUNCATE on the same page. So I did the same aid work as
>> xact_redo_commit_internal for smgr_redo. While doing this, I noticed
>> that
>> CheckRecoveryConsistency() in redo apply loop should be after redoing
>> the record, so moved it.
>
> I think moving CheckRecoveryConsistency() after redo apply loop might cause
> a problem.
> As currently it is done before recoveryStopsHere() function, which can allow
> connections
> on HOTSTANDY. But now if due to some reason recovery pauses or stops due to
> above function,
> connections might not be allowed as CheckRecoveryConsistency() is not
> called.

Yes, so we should just add the CheckRecoveryConsistency() call after
rm_redo rather than moving it? This issue is related to the old discussion:
http://archives.postgresql.org/pgsql-bugs/2012-09/msg00101.php

Regards,

-- 
Fujii Masao

Re: [BUG?] lag of minRecoveryPont in archive recovery

From

Fujii Masao

Date:

09 December 2012, 15:41:39

On Thu, Dec 6, 2012 at 1:04 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
> index 993bc49..d34ab65 100644
> --- a/src/backend/catalog/storage.c
> +++ b/src/backend/catalog/storage.c
> @@ -519,6 +519,12 @@ smgr_redo(XLogRecPtr lsn, XLogRecord *record)
>                         visibilitymap_truncate(rel, xlrec->blkno);
>
>                 FreeFakeRelcacheEntry(rel);
> +
> +               /*
> +                * Xlogs before this record is unrepeatable, so winding
> +                * minRecoveryPoint to here.
> +                */
> +               XLogFlush(lsn);
>         }
>         else
>                 elog(PANIC, "smgr_redo: unknown op code %u", info);

I think that minRecoveryPoint should be updated before the data-file
changes, so XLogFlush() should be called before smgrtruncate(). No?

Regards,

-- 
Fujii Masao

Re: [BUG?] lag of minRecoveryPont in archive recovery

From

Kyotaro HORIGUCHI

Date:

10 December 2012, 01:46:28

Thank you.

> I think moving CheckRecoveryConsistency() after redo apply loop might cause
> a problem.
> As currently it is done before recoveryStopsHere() function, which can allow
> connections 
> on HOTSTANDY. But now if due to some reason recovery pauses or stops due to
> above function,
> connections might not be allowed as CheckRecoveryConsistency() is not
> called.

It depends on the precise meaning of minRecoveryPoint. I've not
found the explicit explanation for it.

Currently minRecoveryPoint is updated only in XLogFlush. Other
updates of it seems to reset to InvalidXLogRecPtr. XLogFlush
seems to be called AFTER the redo core operation has been done,
at least in xact_redo_commit_internal. I shows me that the
meaning of minRecoveryPoint is that "Just AFTER applying the XLog
at current LSN, the database files will be assumed to be
consistent."

At Mon, 10 Dec 2012 00:36:31 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in
<CAHGQGwG4W5QZ7+LJimg8xxuevwz0bYniHmZLZmWf0j6kBiuRCg@mail.gmail.com>
> Yes, so we should just add the CheckRecoveryConsistency() call after
> rm_redo rather than moving it? This issue is related to the old discussion:
> http://archives.postgresql.org/pgsql-bugs/2012-09/msg00101.php

Since I've not cleary understood the problem of missing it before
redo, and it also seems to have no harm on performance, I have no
objection to place it both before and after of redo.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [BUG?] lag of minRecoveryPont in archive recovery

From

Kyotaro HORIGUCHI

Date:

10 December 2012, 01:52:45

Hello, 

> I think that minRecoveryPoint should be updated before the data-file
> changes, so XLogFlush() should be called before smgrtruncate(). No?

Mmm.. As far as I saw in xact_redo_commit_internal, XLogFlush
seems should be done AFTER redo substantial operation. Is it
wrong? If so, I suppose xact_redo_commit_internal could shold be
fixed in the same way.


At Mon, 10 Dec 2012 00:41:34 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in
<CAHGQGwEUpEvK=s0xRFinWnoasjhy3poD3HiE4namoWHK0a-d_w@mail.gmail.com>
> On Thu, Dec 6, 2012 at 1:04 PM, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
> > index 993bc49..d34ab65 100644
> > --- a/src/backend/catalog/storage.c
> > +++ b/src/backend/catalog/storage.c
> > @@ -519,6 +519,12 @@ smgr_redo(XLogRecPtr lsn, XLogRecord *record)
> >                         visibilitymap_truncate(rel, xlrec->blkno);
> >
> >                 FreeFakeRelcacheEntry(rel);
> > +
> > +               /*
> > +                * Xlogs before this record is unrepeatable, so winding
> > +                * minRecoveryPoint to here.
> > +                */
> > +               XLogFlush(lsn);
> >         }
> >         else
> >                 elog(PANIC, "smgr_redo: unknown op code %u", info);
> 
> I think that minRecoveryPoint should be updated before the data-file
> changes, so XLogFlush() should be called before smgrtruncate(). No?

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [BUG?] lag of minRecoveryPont in archive recovery

From

Amit Kapila

Date:

10 December 2012, 08:27:19

Monday, December 10, 2012 7:16 AM Kyotaro HORIGUCHI wrote:
> Thank you.
> 
> > I think moving CheckRecoveryConsistency() after redo apply loop might
> cause
> > a problem.
> > As currently it is done before recoveryStopsHere() function, which can
> allow
> > connections
> > on HOTSTANDY. But now if due to some reason recovery pauses or stops
> due to
> > above function,
> > connections might not be allowed as CheckRecoveryConsistency() is not
> > called.
> 
> It depends on the precise meaning of minRecoveryPoint. I've not
> found the explicit explanation for it.
> 
> Currently minRecoveryPoint is updated only in XLogFlush. Other
> updates of it seems to reset to InvalidXLogRecPtr. XLogFlush
> seems to be called AFTER the redo core operation has been done,
> at least in xact_redo_commit_internal. I shows me that the
> meaning of minRecoveryPoint is that "Just AFTER applying the XLog
> at current LSN, the database files will be assumed to be
> consistent."
> 
> At Mon, 10 Dec 2012 00:36:31 +0900, Fujii Masao <masao.fujii@gmail.com>
> wrote in
> <CAHGQGwG4W5QZ7+LJimg8xxuevwz0bYniHmZLZmWf0j6kBiuRCg@mail.gmail.com>
> > Yes, so we should just add the CheckRecoveryConsistency() call after
> > rm_redo rather than moving it? This issue is related to the old
> discussion:
> > http://archives.postgresql.org/pgsql-bugs/2012-09/msg00101.php
> 
> Since I've not cleary understood the problem of missing it before
> redo, and it also seems to have no harm on performance, I have no
> objection to place it both before and after of redo.

I have tried that way as well, but it didn't completely resolved the problem
reported in above link.
As the situation of defect got arised when it does first time ReadRecord(). 

To resolve the defect mentioned in link by Fujii Masao, actually we need to
check and 
try to reset the backupStartPoint before each ReadRecord.
The reason is that in ReadRecord(), it can go and start waiting for records
from Master.
So now if backupStartPoint is not set and CheckRecoveryConsistency() is not
done, it can keep on waiting
Records from Master and no connections will be allowed on Standby.

I have modified the code of XLogPageRead(...) such that before it calls
WaitForWALToBecomeavailable(), following code will be added

if (!XLogRecPtrIsInvalid(ControlFile->backupEndPoint) && 
XLByteLE(ControlFile->backupEndPoint, EndRecPtr)) 
{                                        /*                                         * We have reached the end of base
backup, the point where                                         * the minimum recovery point in
pg_control indicates. The                                         * data on disk is now consistent.
Reset backupStartPoint                                         * and backupEndPoint.
    */                                        elog(DEBUG1, "end of backup
 
reached"); 
                                       LWLockAcquire(ControlFileLock,
LW_EXCLUSIVE); 

MemSet(&ControlFile->backupStartPoint, 0, sizeof(XLogRecPtr));
MemSet(&ControlFile->backupEndPoint,
0, sizeof(XLogRecPtr));                                        ControlFile->backupEndRequired =
false;                                        UpdateControlFile(); 
                                       LWLockRelease(ControlFileLock); 
}

CheckRecoveryConsistency();

This had completely resolved the problem reported on above link for me.

With Regards,
Amit Kapila.

Re: [BUG?] lag of minRecoveryPont in archive recovery

From

Heikki Linnakangas

Date:

10 December 2012, 12:06:18

On 10.12.2012 03:52, Kyotaro HORIGUCHI wrote:
>> I think that minRecoveryPoint should be updated before the data-file
>> changes, so XLogFlush() should be called before smgrtruncate(). No?
>
> Mmm.. As far as I saw in xact_redo_commit_internal, XLogFlush
> seems should be done AFTER redo substantial operation. Is it
> wrong? If so, I suppose xact_redo_commit_internal could shold be
> fixed in the same way.

So, two options:

1. Redo truncation, then XLogFlush()

There's a window where the original problem could still occur: The file
is truncated, and you crash before XLogFlush() finishes.

2. XLogFlush(), then redo truncation.

If the truncation fails for some reason (disk failure?) and you have
already updated minRecoveryPoint, you can not start up until you somehow
fix the problem so that the truncation can succeed, and then finish
recovery.

Both options have their merits. The window in 1. is very small, and in
2., the probability that an unlink() or truncation fails is very small
as well.

We've chosen 1. in xact_redo_commit_internal(), but I don't think that
was a carefully made decision, it just imitates what happens in the
corresponding non-redo commit path. In xact_redo_commit_internal(), it
makes sense to do XLogFlush() afterwards, for CREATE DATABASE and CREATE
TABLESPACE. Those create files, and if you e.g run out of disk space, or
non-existent path, you don't want minRecoveryPoint to be updated yet.
Otherwise you can no longer recover to the point just before the
creation of those files. But in case of DROP DATABASE, you have the
exact same situation as with truncation: if you have already deleted
some files, you must not be able to stop recovery at a point before that.

So I'd say we should update minRecoveryPoint first, then
truncate/delete. But we should still keep the XLogFlush() at the end of
xact_redo_commit_internal(), for the case where files/directories are
created. Patch attached.

- Heikki

Attachment

xlogflush-before-truncation-1.patch

Re: [BUG?] lag of minRecoveryPont in archive recovery

From

Heikki Linnakangas

Date:

10 December 2012, 16:33:55

On 10.12.2012 13:50, Heikki Linnakangas wrote:
> So I'd say we should update minRecoveryPoint first, then
> truncate/delete. But we should still keep the XLogFlush() at the end of
> xact_redo_commit_internal(), for the case where files/directories are
> created. Patch attached.

Committed and backpatched that. Attached is a script I used to reproduce
this problem, going back to 8.4.

- Heikki

Attachment

minrecoverypoint-recipe.sh

Re: [BUG?] lag of minRecoveryPont in archive recovery

From

Fujii Masao

Date:

10 December 2012, 18:04:56

On Tue, Dec 11, 2012 at 1:33 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> On 10.12.2012 13:50, Heikki Linnakangas wrote:
>>
>> So I'd say we should update minRecoveryPoint first, then
>> truncate/delete. But we should still keep the XLogFlush() at the end of
>> xact_redo_commit_internal(), for the case where files/directories are
>> created. Patch attached.

Sounds reasonable.

> Committed and backpatched that. Attached is a script I used to reproduce
> this problem, going back to 8.4.

Thanks!

Unfortunately I could reproduce the problem even after that commit.
Attached is the script I used to reproduce the problem.

The cause is that CheckRecoveryConsistency() is called before rm_redo(),
as Horiguchi-san suggested upthead. Imagine the case where
minRecoveryPoint is set to the location of the XLOG_SMGR_TRUNCATE
record. When restarting the server with that minRecoveryPoint,
the followings would happen, and then PANIC occurs.

1. XLOG_SMGR_TRUNCATE record is read.
2. CheckRecoveryConsistency() is called, and database is marked as
    consistent since we've reached minRecoveryPoint (i.e., the location
    of XLOG_SMGR_TRUNCATE).
3. XLOG_SMGR_TRUNCATE record is replayed, and invalid page is
    found.
4. Since the database has already been marked as consistent, an invalid
    page leads to PANIC.

Regards,

--
Fujii Masao

Attachment

fujii_test.sh

Re: [BUG?] lag of minRecoveryPont in archive recovery

From

Kyotaro HORIGUCHI

Date:

11 December 2012, 06:07:25

Hello, I've also found this does not fix this problem.

> >> So I'd say we should update minRecoveryPoint first, then
> >> truncate/delete. But we should still keep the XLogFlush() at the end of
> >> xact_redo_commit_internal(), for the case where files/directories are
> >> created. Patch attached.
> 
> Sounds reasonable.

It makes perfectly sense.

> > Committed and backpatched that. Attached is a script I used to reproduce
> > this problem, going back to 8.4.
> 
> Thanks!
> 
> Unfortunately I could reproduce the problem even after that commit.
> Attached is the script I used to reproduce the problem.

Me too.

> The cause is that CheckRecoveryConsistency() is called before rm_redo(),
> as Horiguchi-san suggested upthead. Imagine the case where
> minRecoveryPoint is set to the location of the XLOG_SMGR_TRUNCATE
> record. When restarting the server with that minRecoveryPoint,
> the followings would happen, and then PANIC occurs.
> 
> 1. XLOG_SMGR_TRUNCATE record is read.
> 2. CheckRecoveryConsistency() is called, and database is marked as
>     consistent since we've reached minRecoveryPoint (i.e., the location
>     of XLOG_SMGR_TRUNCATE).
> 3. XLOG_SMGR_TRUNCATE record is replayed, and invalid page is
>     found.
> 4. Since the database has already been marked as consistent, an invalid
>     page leads to PANIC.

Exactly.

In smgr_redo, EndRecPtr which is pointing the record next to
SMGR_TRUNCATE, is used as the new minRecoveryPoint. On the other
hand, during the second startup of the standby,
CheckRecoveryConsistency checks for consistency by
XLByteLE(minRecoveryPoint, EndRecPtr) which should be true at
just BEFORE the SMGR_TRUNCATE record is applied. So
reachedConsistency becomes true just before the SMGR_TRUNCATE
record will be applied. Bang!

I said I had no objection to placing CheckRecoveryConsistency
both before and after of rm_redo in previous message, but it was
wrong. Given aminRecoveryPoint value, it should be placed after
rm_redo from the point of view of when the database should be
considered to be consistent.

Actually, simply moving CheckRecoeverConsistency after rm_redo
turned into succeessfully startup, ignoring the another reason
for it should be before, which is unknown to me.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [BUG?] lag of minRecoveryPont in archive recovery

From

Heikki Linnakangas

Date:

11 December 2012, 09:14:24

On 11.12.2012 08:07, Kyotaro HORIGUCHI wrote:
>> The cause is that CheckRecoveryConsistency() is called before rm_redo(),
>> as Horiguchi-san suggested upthead. Imagine the case where
>> minRecoveryPoint is set to the location of the XLOG_SMGR_TRUNCATE
>> record. When restarting the server with that minRecoveryPoint,
>> the followings would happen, and then PANIC occurs.
>>
>> 1. XLOG_SMGR_TRUNCATE record is read.
>> 2. CheckRecoveryConsistency() is called, and database is marked as
>>      consistent since we've reached minRecoveryPoint (i.e., the location
>>      of XLOG_SMGR_TRUNCATE).
>> 3. XLOG_SMGR_TRUNCATE record is replayed, and invalid page is
>>      found.
>> 4. Since the database has already been marked as consistent, an invalid
>>      page leads to PANIC.
>
> Exactly.
>
> In smgr_redo, EndRecPtr which is pointing the record next to
> SMGR_TRUNCATE, is used as the new minRecoveryPoint. On the other
> hand, during the second startup of the standby,
> CheckRecoveryConsistency checks for consistency by
> XLByteLE(minRecoveryPoint, EndRecPtr) which should be true at
> just BEFORE the SMGR_TRUNCATE record is applied. So
> reachedConsistency becomes true just before the SMGR_TRUNCATE
> record will be applied. Bang!

Ah, I see. I thought moving CheckRecoveryConsistency was just a
nice-to-have thing, so that we'd notice that we're consistent earlier,
so didn't pay attention to that part.

I think the real issue here is that CheckRecoveryConsistency should not
be comparing minRecoveryPoint against recoveryLastRecPtr, not EndRecPtr
as it currently does. EndRecPtr points to the end of the last record
*read*, while recoveryLastRecPtr points to the end of the last record
*replayed*. The latter is what CheckRecoveryConsistency should be
concerned with.

BTW, it occurs to me that we have two variables in the shared struct
that are almost the same thing: recoveryLastRecPtr and replayEndRecPtr.
The former points to the end of the last record successfully replayed,
while replayEndRecPtr points to the end of the last record successfully
replayed, or begin replayed at the moment. I find that confusing, so I
suggest that we rename recoveryLastRecPtr to make that more clear.
Included in the attached patch.

- Heikki

Attachment

fix-check-recovery-consistency-1.patch

Re: [BUG?] lag of minRecoveryPont in archive recovery

From

Fujii Masao

Date:

11 December 2012, 15:04:38

On Tue, Dec 11, 2012 at 6:14 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> On 11.12.2012 08:07, Kyotaro HORIGUCHI wrote:
>>>
>>> The cause is that CheckRecoveryConsistency() is called before rm_redo(),
>>> as Horiguchi-san suggested upthead. Imagine the case where
>>> minRecoveryPoint is set to the location of the XLOG_SMGR_TRUNCATE
>>> record. When restarting the server with that minRecoveryPoint,
>>> the followings would happen, and then PANIC occurs.
>>>
>>> 1. XLOG_SMGR_TRUNCATE record is read.
>>> 2. CheckRecoveryConsistency() is called, and database is marked as
>>>      consistent since we've reached minRecoveryPoint (i.e., the location
>>>      of XLOG_SMGR_TRUNCATE).
>>> 3. XLOG_SMGR_TRUNCATE record is replayed, and invalid page is
>>>      found.
>>> 4. Since the database has already been marked as consistent, an invalid
>>>      page leads to PANIC.
>>
>>
>> Exactly.
>>
>> In smgr_redo, EndRecPtr which is pointing the record next to
>> SMGR_TRUNCATE, is used as the new minRecoveryPoint. On the other
>> hand, during the second startup of the standby,
>> CheckRecoveryConsistency checks for consistency by
>> XLByteLE(minRecoveryPoint, EndRecPtr) which should be true at
>> just BEFORE the SMGR_TRUNCATE record is applied. So
>> reachedConsistency becomes true just before the SMGR_TRUNCATE
>> record will be applied. Bang!
>
>
> Ah, I see. I thought moving CheckRecoveryConsistency was just a nice-to-have
> thing, so that we'd notice that we're consistent earlier, so didn't pay
> attention to that part.
>
> I think the real issue here is that CheckRecoveryConsistency should not be
> comparing minRecoveryPoint against recoveryLastRecPtr, not EndRecPtr as it
> currently does. EndRecPtr points to the end of the last record *read*, while
> recoveryLastRecPtr points to the end of the last record *replayed*. The
> latter is what CheckRecoveryConsistency should be concerned with.
>
> BTW, it occurs to me that we have two variables in the shared struct that
> are almost the same thing: recoveryLastRecPtr and replayEndRecPtr. The
> former points to the end of the last record successfully replayed, while
> replayEndRecPtr points to the end of the last record successfully replayed,
> or begin replayed at the moment. I find that confusing, so I suggest that we
> rename recoveryLastRecPtr to make that more clear. Included in the attached
> patch.

The patch looks good. I confirmed that the PANIC error didn't happen,
with the patch.

xlog.c
> * Initialize shared replayEndRecPtr, recoveryLastRecPtr, and

s/recoveryLastRecPtr/lastReplayedEndRecPtr

Regards,

-- 
Fujii Masao

Re: [BUG?] lag of minRecoveryPont in archive recovery

From

Heikki Linnakangas

Date:

11 December 2012, 16:51:10

On 11.12.2012 17:04, Fujii Masao wrote:
> On Tue, Dec 11, 2012 at 6:14 PM, Heikki Linnakangas
> <hlinnakangas@vmware.com>  wrote:
>> On 11.12.2012 08:07, Kyotaro HORIGUCHI wrote:
>>>>
>>>> The cause is that CheckRecoveryConsistency() is called before rm_redo(),
>>>> as Horiguchi-san suggested upthead. Imagine the case where
>>>> minRecoveryPoint is set to the location of the XLOG_SMGR_TRUNCATE
>>>> record. When restarting the server with that minRecoveryPoint,
>>>> the followings would happen, and then PANIC occurs.
>>>>
>>>> 1. XLOG_SMGR_TRUNCATE record is read.
>>>> 2. CheckRecoveryConsistency() is called, and database is marked as
>>>>       consistent since we've reached minRecoveryPoint (i.e., the location
>>>>       of XLOG_SMGR_TRUNCATE).
>>>> 3. XLOG_SMGR_TRUNCATE record is replayed, and invalid page is
>>>>       found.
>>>> 4. Since the database has already been marked as consistent, an invalid
>>>>       page leads to PANIC.
>>>
>>>
>>> Exactly.
>>>
>>> In smgr_redo, EndRecPtr which is pointing the record next to
>>> SMGR_TRUNCATE, is used as the new minRecoveryPoint. On the other
>>> hand, during the second startup of the standby,
>>> CheckRecoveryConsistency checks for consistency by
>>> XLByteLE(minRecoveryPoint, EndRecPtr) which should be true at
>>> just BEFORE the SMGR_TRUNCATE record is applied. So
>>> reachedConsistency becomes true just before the SMGR_TRUNCATE
>>> record will be applied. Bang!
>>
>>
>> Ah, I see. I thought moving CheckRecoveryConsistency was just a nice-to-have
>> thing, so that we'd notice that we're consistent earlier, so didn't pay
>> attention to that part.
>>
>> I think the real issue here is that CheckRecoveryConsistency should not be
>> comparing minRecoveryPoint against recoveryLastRecPtr, not EndRecPtr as it
>> currently does. EndRecPtr points to the end of the last record *read*, while
>> recoveryLastRecPtr points to the end of the last record *replayed*. The
>> latter is what CheckRecoveryConsistency should be concerned with.
>>
>> BTW, it occurs to me that we have two variables in the shared struct that
>> are almost the same thing: recoveryLastRecPtr and replayEndRecPtr. The
>> former points to the end of the last record successfully replayed, while
>> replayEndRecPtr points to the end of the last record successfully replayed,
>> or begin replayed at the moment. I find that confusing, so I suggest that we
>> rename recoveryLastRecPtr to make that more clear. Included in the attached
>> patch.
>
> The patch looks good. I confirmed that the PANIC error didn't happen,
> with the patch.

Thanks. After thinking about this more, I think it's still not quite 
right. Now that we fix the check in CheckRecoveryConsistency, we have to 
move the call to CheckRecoveryConsistency to after the rm_redo call. 
Otherwise you don't enter hot standby mode after replaying the last 
record, the one ending at minRecoveryPoint. You have to wait for one 
more record to be read (but not replayed), so that 
CheckRecoveryConsistency gets called and we let backends in.

As the code stands, the bad placement of CheckRecoveryConsistency is 
compensated by the bug that we open up for hot standby one record too 
early. The net result is that we open up for hot standby just before 
replaying the last record that makes us consistent. The window for 
someone to see the inconsistent state is really small, postmaster will 
have to receive the signal, start accepting connections, and you have 
connect and run a query, all before the startup process gets around to 
replay the record its already read into memory.

- Heikki

Re: [BUG?] lag of minRecoveryPont in archive recovery

From

Heikki Linnakangas

Date:

11 December 2012, 17:18:35

On 11.12.2012 17:04, Fujii Masao wrote:
> The patch looks good. I confirmed that the PANIC error didn't happen,
> with the patch.

Ok, committed, and also moved the CheckRecoveryConsistency call. Please 
take a look to double-check that I didn't miss anything.

- Heikki

Re: [BUG?] lag of minRecoveryPont in archive recovery

From

Fujii Masao

Date:

12 December 2012, 15:36:52

On Wed, Dec 12, 2012 at 2:03 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> On 11.12.2012 17:04, Fujii Masao wrote:
>>
>> The patch looks good. I confirmed that the PANIC error didn't happen,
>> with the patch.
>
>
> Ok, committed, and also moved the CheckRecoveryConsistency call. Please take
> a look to double-check that I didn't miss anything.

Looks good to me. Thanks a lot!

Regards,

-- 
Fujii Masao