Thread: BUG #8686: Standby could not restart.

BUG #8686: Standby could not restart.

From
katsumata.tomonari@po.ntts.co.jp
Date:
The following bug has been logged on the website:

Bug reference:      8686
Logged by:          Tomonari Katsumata
Email address:      katsumata.tomonari@po.ntts.co.jp
PostgreSQL version: 9.2.6
Operating system:   Red Hat Enterprise Linux 6.2 x86_64
Description:

Hi,


I'm testing whether a standby could restart with asynchronous replication.


The testcase is very simple like below.
(A) start asyncronous replication
(B) stop standby with "-m f"
(C) start standby


I tried (B) and (C) periodically.


In almost cases, it worked fine.
But standby could not start one time because of PANIC.
---- Log -----
[Standby] 2013-12-09 10:42:30 JST LOG:  00000: database system was shut down
in recovery at 2013-12-09 10:42:29 JST
[Standby] 2013-12-09 10:42:30 JST LOCATION:  StartupXLOG, xlog.c:6273
cp: cannot stat `../arc/00000002.history': No such file or directory
[Standby] 2013-12-09 10:42:30 JST LOG:  00000: entering standby mode
[Standby] 2013-12-09 10:42:30 JST LOCATION:  StartupXLOG, xlog.c:6359
cp: cannot stat `../arc/000000010000000100000008': No such file or
directory
[Standby] 2013-12-09 10:42:30 JST LOG:  00000: consistent recovery state
reached at 1/8E7F110
[Standby] 2013-12-09 10:42:30 JST LOCATION:  CheckRecoveryConsistency,
xlog.c:7366
[Standby] 2013-12-09 10:42:30 JST LOG:  00000: restored log file
"000000010000000100000007" from archive
[Standby] 2013-12-09 10:42:30 JST LOCATION:  RestoreArchivedFile,
xlog.c:3273
[Standby] 2013-12-09 10:42:30 JST LOG:  00000: redo starts at 1/783B230
[Standby] 2013-12-09 10:42:30 JST LOCATION:  StartupXLOG, xlog.c:6827
[Standby] 2013-12-09 10:42:30 JST WARNING:  01000: page 1365 of relation
base/16384/16388 does not exist
[Standby] 2013-12-09 10:42:30 JST CONTEXT:  xlog redo hot_update: rel
1663/16384/16388; tid 1365/152; new 1365/153
[Standby] 2013-12-09 10:42:30 JST LOCATION:  report_invalid_page,
xlogutils.c:66
[Standby] 2013-12-09 10:42:30 JST PANIC:  XX000: WAL contains references to
invalid pages
[Standby] 2013-12-09 10:42:30 JST CONTEXT:  xlog redo hot_update: rel
1663/16384/16388; tid 1365/152; new 1365/153
[Standby] 2013-12-09 10:42:30 JST LOCATION:  log_invalid_page,
xlogutils.c:91
[Standby] 2013-12-09 10:42:30 JST LOG:  00000: startup process (PID 12560)
was terminated by signal 6: Aborted
[Standby] 2013-12-09 10:42:30 JST LOCATION:  LogChildExit,
postmaster.c:2895
[Standby] 2013-12-09 10:42:30 JST LOG:  00000: terminating any other active
server processes
[Standby] 2013-12-09 10:42:30 JST LOCATION:  HandleChildCrash,
postmaster.c:2682
[Standby] 2013-12-09 11:10:12 JST LOG:  00000: database system was
interrupted while in recovery at log time 2013-12-09 10:32:14 JST
[Standby] 2013-12-09 11:10:12 JST HINT:  If this has occurred more than once
some data might be corrupted and you might need to choose an earlier
recovery target.
[Standby] 2013-12-09 11:10:12 JST LOCATION:  StartupXLOG, xlog.c:6289
---------


I tried to fix this problem.
At first, I doubted the recovery state reached "consistent" before redo
starts.
And then I checked pg_control and related WAL.
The WAL sequence is like below.


WAL--(a)--(b)--(c)--(d)--(e)-->
================================================
(a) Latest checkpoint's REDO location
1/783B230


(b) hot_update
1/7842010


(c) truncate
1/8E7E5C8


(d) Latest checkpoint location
1/8E7F0B0


(e) Minimum recovery ending location
1/8E7F110
================================================


>From these things, I found it has happened with this scenario.
----------
(1) standby starting
(2) seeking checkpoint location 1/8E7F0B0 because backup_label is not
absecnt
(3) reachedConsistency is set to true at 1/8E7F110 in
CheckRecoveryConsistent
(4) redo start from 1/783B230
(5) PANIC at 1/7842010 because reachedConsistency has set already and
operating against a block which will be truncated at (c).
----------


At step(2), EndRecPtr is set to 1/8E7F110(next to 1/8E7F0B0),
so reachedConsistency is set to true at step(3).


I think it's not need to increase EndRecPtr while seeking checkpoint
location.
I tried to revise it and this worked fine.


I think this is very very narrow gate, but it could happen.


I'm not sure this problem happen with synchronous replication or
another version of PostgreSQL(ie. 9.3/9.1/9.0),
but at least we need to fix it.


regards,
----------------
Tomonari Katsumata

Re: BUG #8686: Standby could not restart.

From
Tomonari Katsumata
Date:
Hi,

This is a patch against REL9_2_STABLE(*)
(*)0c07ef1ad2ba73a6a050f3e1a19267f961a43586

regards,
---------------
Tomonari Katsumata


(2013/12/19 11:57), katsumata.tomonari@po.ntts.co.jp wrote:
 > The following bug has been logged on the website:
 >
 > Bug reference:      8686
 > Logged by:          Tomonari Katsumata
 > Email address:      katsumata.tomonari@po.ntts.co.jp
 > PostgreSQL version: 9.2.6
 > Operating system:   Red Hat Enterprise Linux 6.2 x86_64
 > Description:
 >
 > Hi,
 >
 >
 > I'm testing whether a standby could restart with asynchronous
replication.
 >
 >
 > The testcase is very simple like below.
 > (A) start asyncronous replication
 > (B) stop standby with "-m f"
 > (C) start standby
 >
 >
 > I tried (B) and (C) periodically.
 >
 >
 > In almost cases, it worked fine.
 > But standby could not start one time because of PANIC.
 > ---- Log -----
 > [Standby] 2013-12-09 10:42:30 JST LOG:  00000: database system was
shut down
 > in recovery at 2013-12-09 10:42:29 JST
 > [Standby] 2013-12-09 10:42:30 JST LOCATION:  StartupXLOG, xlog.c:6273
 > cp: cannot stat `../arc/00000002.history': No such file or directory
 > [Standby] 2013-12-09 10:42:30 JST LOG:  00000: entering standby mode
 > [Standby] 2013-12-09 10:42:30 JST LOCATION:  StartupXLOG, xlog.c:6359
 > cp: cannot stat `../arc/000000010000000100000008': No such file or
 > directory
 > [Standby] 2013-12-09 10:42:30 JST LOG:  00000: consistent recovery state
 > reached at 1/8E7F110
 > [Standby] 2013-12-09 10:42:30 JST LOCATION: CheckRecoveryConsistency,
 > xlog.c:7366
 > [Standby] 2013-12-09 10:42:30 JST LOG:  00000: restored log file
 > "000000010000000100000007" from archive
 > [Standby] 2013-12-09 10:42:30 JST LOCATION: RestoreArchivedFile,
 > xlog.c:3273
 > [Standby] 2013-12-09 10:42:30 JST LOG:  00000: redo starts at 1/783B230
 > [Standby] 2013-12-09 10:42:30 JST LOCATION:  StartupXLOG, xlog.c:6827
 > [Standby] 2013-12-09 10:42:30 JST WARNING:  01000: page 1365 of relation
 > base/16384/16388 does not exist
 > [Standby] 2013-12-09 10:42:30 JST CONTEXT:  xlog redo hot_update: rel
 > 1663/16384/16388; tid 1365/152; new 1365/153
 > [Standby] 2013-12-09 10:42:30 JST LOCATION: report_invalid_page,
 > xlogutils.c:66
 > [Standby] 2013-12-09 10:42:30 JST PANIC:  XX000: WAL contains
references to
 > invalid pages
 > [Standby] 2013-12-09 10:42:30 JST CONTEXT:  xlog redo hot_update: rel
 > 1663/16384/16388; tid 1365/152; new 1365/153
 > [Standby] 2013-12-09 10:42:30 JST LOCATION:  log_invalid_page,
 > xlogutils.c:91
 > [Standby] 2013-12-09 10:42:30 JST LOG:  00000: startup process (PID
12560)
 > was terminated by signal 6: Aborted
 > [Standby] 2013-12-09 10:42:30 JST LOCATION:  LogChildExit,
 > postmaster.c:2895
 > [Standby] 2013-12-09 10:42:30 JST LOG:  00000: terminating any other
active
 > server processes
 > [Standby] 2013-12-09 10:42:30 JST LOCATION:  HandleChildCrash,
 > postmaster.c:2682
 > [Standby] 2013-12-09 11:10:12 JST LOG:  00000: database system was
 > interrupted while in recovery at log time 2013-12-09 10:32:14 JST
 > [Standby] 2013-12-09 11:10:12 JST HINT:  If this has occurred more
than once
 > some data might be corrupted and you might need to choose an earlier
 > recovery target.
 > [Standby] 2013-12-09 11:10:12 JST LOCATION:  StartupXLOG, xlog.c:6289
 > ---------
 >
 >
 > I tried to fix this problem.
 > At first, I doubted the recovery state reached "consistent" before redo
 > starts.
 > And then I checked pg_control and related WAL.
 > The WAL sequence is like below.
 >
 >
 > WAL--(a)--(b)--(c)--(d)--(e)-->
 > ================================================
 > (a) Latest checkpoint's REDO location
 > 1/783B230
 >
 >
 > (b) hot_update
 > 1/7842010
 >
 >
 > (c) truncate
 > 1/8E7E5C8
 >
 >
 > (d) Latest checkpoint location
 > 1/8E7F0B0
 >
 >
 > (e) Minimum recovery ending location
 > 1/8E7F110
 > ================================================
 >
 >
 > >From these things, I found it has happened with this scenario.
 > ----------
 > (1) standby starting
 > (2) seeking checkpoint location 1/8E7F0B0 because backup_label is not
 > absecnt
 > (3) reachedConsistency is set to true at 1/8E7F110 in
 > CheckRecoveryConsistent
 > (4) redo start from 1/783B230
 > (5) PANIC at 1/7842010 because reachedConsistency has set already and
 > operating against a block which will be truncated at (c).
 > ----------
 >
 >
 > At step(2), EndRecPtr is set to 1/8E7F110(next to 1/8E7F0B0),
 > so reachedConsistency is set to true at step(3).
 >
 >
 > I think it's not need to increase EndRecPtr while seeking checkpoint
 > location.
 > I tried to revise it and this worked fine.
 >
 >
 > I think this is very very narrow gate, but it could happen.
 >
 >
 > I'm not sure this problem happen with synchronous replication or
 > another version of PostgreSQL(ie. 9.3/9.1/9.0),
 > but at least we need to fix it.
 >
 >
 > regards,
 > ----------------
 > Tomonari Katsumata
 >
 >
 >
 >


Attachment

Re: BUG #8686: Standby could not restart.

From
Heikki Linnakangas
Date:
On 12/19/2013 04:57 AM, katsumata.tomonari@po.ntts.co.jp wrote:
> At first, I doubted the recovery state reached "consistent" before redo
> starts.
> And then I checked pg_control and related WAL.
> The WAL sequence is like below.
>
>
> WAL--(a)--(b)--(c)--(d)--(e)-->
> ================================================
> (a) Latest checkpoint's REDO location
> 1/783B230
>
>
> (b) hot_update
> 1/7842010
>
>
> (c) truncate
> 1/8E7E5C8
>
>
> (d) Latest checkpoint location
> 1/8E7F0B0
>
>
> (e) Minimum recovery ending location
> 1/8E7F110
> ================================================
>
>
>>From these things, I found it has happened with this scenario.
> ----------
> (1) standby starting
> (2) seeking checkpoint location 1/8E7F0B0 because backup_label is not
> absecnt
> (3) reachedConsistency is set to true at 1/8E7F110 in
> CheckRecoveryConsistent
> (4) redo start from 1/783B230
> (5) PANIC at 1/7842010 because reachedConsistency has set already and
> operating against a block which will be truncated at (c).
> ----------
>
> At step(2), EndRecPtr is set to 1/8E7F110(next to 1/8E7F0B0),
> so reachedConsistency is set to true at step(3).

Yep. Thanks for a good explanation.

> I think it's not need to increase EndRecPtr while seeking checkpoint
> location.
> I tried to revise it and this worked fine.

Hmm. There's this comment in StartupXLOG, after reading the checkpoint
record, but before reading the first record at REDO point:


>         /*
>          * Initialize shared replayEndRecPtr, lastReplayedEndRecPtr, and
>          * recoveryLastXTime.
>          *
>          * This is slightly confusing if we're starting from an online
>          * checkpoint; we've just read and replayed the checkpoint record, but
>          * we're going to start replay from its redo pointer, which precedes
>          * the location of the checkpoint record itself. So even though the
>          * last record we've replayed is indeed ReadRecPtr, we haven't
>          * replayed all the preceding records yet. That's OK for the current
>          * use of these variables.
>          */
>         SpinLockAcquire(&xlogctl->info_lck);
>         xlogctl->replayEndRecPtr = ReadRecPtr;
>         xlogctl->lastReplayedEndRecPtr = EndRecPtr;
>         xlogctl->recoveryLastXTime = 0;
>         xlogctl->currentChunkStartTime = 0;
>         xlogctl->recoveryPause = false;
>         SpinLockRelease(&xlogctl->info_lck);

I think we need to fix that confusion. Your patch will do it by not
setting EndRecPtr yet; that fixes the bug, but leaves those variables in
a slightly strange state; I'm not sure what EndRecPtr points to in that
case (0 ?), but ReadRecPtr would be set I guess.

Perhaps we should reset replayEndRecPtr and lastReplayedEndRecPtr to the
REDO point here, instead of ReadRecPtr/EndRecPtr.

- Heikki

Re: BUG #8686: Standby could not restart.

From
Tomonari Katsumata
Date:
Hi Heikki,
Thanks for your confirmation and comments.


                /*
                 * Initialize shared replayEndRecPtr, lastReplayedEndRecPtr, and
                 * recoveryLastXTime.
                 *
                 * This is slightly confusing if we're starting from an online
                 * checkpoint; we've just read and replayed the checkpoint record, but
                 * we're going to start replay from its redo pointer, which precedes
                 * the location of the checkpoint record itself. So even though the
                 * last record we've replayed is indeed ReadRecPtr, we haven't
                 * replayed all the preceding records yet. That's OK for the current
                 * use of these variables.
                 */
                SpinLockAcquire(&xlogctl->info_lck);
                xlogctl->replayEndRecPtr = ReadRecPtr;
                xlogctl->lastReplayedEndRecPtr = EndRecPtr;
                xlogctl->recoveryLastXTime = 0;
                xlogctl->currentChunkStartTime = 0;
                xlogctl->recoveryPause = false;
                SpinLockRelease(&xlogctl->info_lck);

I think we need to fix that confusion. Your patch will do it by not setting EndRecPtr yet; that fixes the bug, but leaves those variables in a slightly strange state; I'm not sure what EndRecPtr points to in that case (0 ?), but ReadRecPtr would be set I guess.
Yes, the values were set like below.
ReadRecPtr:1/8E7F0B0
EndRecPtr:0/0
 
 


Perhaps we should reset replayEndRecPtr and lastReplayedEndRecPtr to the REDO point here, instead of ReadRecPtr/EndRecPtr.

 
 I made another patch.
I added a ReadRecord to make sure the REDO location is present or not.
The similar process are done when we use backup_label.
 
Because the ReadRecord returns a record already read,
I set ReadRecPtr of the record to EndRecPtr.
And also I set record->xl_prev to ReadRecPtr.
As you said, it also worked fine.
 
I'm not sure we should do same thing when crash recovery occurs, but now I added the process when archive recovery is needed.
 
Please see attached patch.
 
regards,
---------------------
Tomonari Katsumata
Attachment

Re: BUG #8686: Standby could not restart.

From
Heikki Linnakangas
Date:
On 12/23/2013 08:15 AM, Tomonari Katsumata wrote:
>>                   /*
>>>                   * Initialize shared replayEndRecPtr,
>>> lastReplayedEndRecPtr, and
>>>                   * recoveryLastXTime.
>>>                   *
>>>                   * This is slightly confusing if we're starting from an
>>> online
>>>                   * checkpoint; we've just read and replayed the
>>> checkpoint record, but
>>>                   * we're going to start replay from its redo pointer,
>>> which precedes
>>>                   * the location of the checkpoint record itself. So even
>>> though the
>>>                   * last record we've replayed is indeed ReadRecPtr, we
>>> haven't
>>>                   * replayed all the preceding records yet. That's OK for
>>> the current
>>>                   * use of these variables.
>>>                   */
>>>                  SpinLockAcquire(&xlogctl->info_lck);
>>>                  xlogctl->replayEndRecPtr = ReadRecPtr;
>>>                  xlogctl->lastReplayedEndRecPtr = EndRecPtr;
>>>                  xlogctl->recoveryLastXTime = 0;
>>>                  xlogctl->currentChunkStartTime = 0;
>>>                  xlogctl->recoveryPause = false;
>>>                  SpinLockRelease(&xlogctl->info_lck);
>>>
>>
>> I think we need to fix that confusion. Your patch will do it by not
>> setting EndRecPtr yet; that fixes the bug, but leaves those variables in a
>> slightly strange state; I'm not sure what EndRecPtr points to in that case
>> (0 ?), but ReadRecPtr would be set I guess.
>>
> Yes, the values were set like below.
> ReadRecPtr:1/8E7F0B0
> EndRecPtr:0/0
>
>
>
>>
>> Perhaps we should reset replayEndRecPtr and lastReplayedEndRecPtr to the
>> REDO point here, instead of ReadRecPtr/EndRecPtr.
>
>   I made another patch.
> I added a ReadRecord to make sure the REDO location is present or not.
> The similar process are done when we use backup_label.
>
> Because the ReadRecord returns a record already read,
> I set ReadRecPtr of the record to EndRecPtr.
> And also I set record->xl_prev to ReadRecPtr.
> As you said, it also worked fine.
>
> I'm not sure we should do same thing when crash recovery occurs, but now I
> added the process when archive recovery is needed.
>
> Please see attached patch.

Hmm. That would still initialize lastReplayedEndRecPtr to the checkpoint
record, when you do crash recovery (or begin archive recovery from a
filesystem snapshot, where you perform crash recovery before starting to
read the archive). I'm not very comfortable with that, even though I
don't see an immediate problem with it.

I also noticed that CheckRecoveryConsistency() compares backupEndPoint
with EndRecPtr, but minRecoveryPoint with lastReplayedEndRecPtr. That's
correct as the code stands, but it's an accident waiting to happen: if
you called CheckRecoveryConsistency() after reading a record with
ReadRecord(), but before fully replaying it, it might conclude that it's
reached the backup end location one record too early. And it's
inconsistent, anyway.

I propose the attached patch. I haven't tested it, but I think it's a
slightly more robust fix.

- Heikki

Attachment

Re: BUG #8686: Standby could not restart.

From
Tomonari Katsumata
Date:
Hi Heikki,

Sorry for slow response and
thank you for new patch!

I've tried your patch, but the patch was not for 92_STABLE.
So I created a patch for 92_STABLE(*) from your patch.
(*)against 8aa6912b8fec3400441c365bde6a1030295de749

It worked fine!
If there are no problem, please commit this patch also to 92_STABLE.

NOTE:
I think your original patch has a miss typing.
We should use 'checkPoint' not 'checkpoint'.
-----------------------------------------------------------
<original>
24:+ xlogctl->replayEndRecPtr = checkpoint.redo;
27:+ xlogctl->lastReplayedEndRecPtr = checkpoint.redo;

<fix>
24:+ xlogctl->replayEndRecPtr = checkPoint.redo;
27:+ xlogctl->lastReplayedEndRecPtr = checkPoint.redo;
-----------------------------------------------------------

regards,
-------------
Tomonari Katsumata

(2014/01/06 23:38), Heikki Linnakangas wrote:
 > On 12/23/2013 08:15 AM, Tomonari Katsumata wrote:
 >>> /*
 >>>> * Initialize shared replayEndRecPtr,
 >>>> lastReplayedEndRecPtr, and
 >>>> * recoveryLastXTime.
 >>>> *
 >>>> * This is slightly confusing if we're starting from an
 >>>> online
 >>>> * checkpoint; we've just read and replayed the
 >>>> checkpoint record, but
 >>>> * we're going to start replay from its redo pointer,
 >>>> which precedes
 >>>> * the location of the checkpoint record itself. So even
 >>>> though the
 >>>> * last record we've replayed is indeed ReadRecPtr, we
 >>>> haven't
 >>>> * replayed all the preceding records yet. That's OK for
 >>>> the current
 >>>> * use of these variables.
 >>>> */
 >>>> SpinLockAcquire(&xlogctl->info_lck);
 >>>> xlogctl->replayEndRecPtr = ReadRecPtr;
 >>>> xlogctl->lastReplayedEndRecPtr = EndRecPtr;
 >>>> xlogctl->recoveryLastXTime = 0;
 >>>> xlogctl->currentChunkStartTime = 0;
 >>>> xlogctl->recoveryPause = false;
 >>>> SpinLockRelease(&xlogctl->info_lck);
 >>>
 >>> I think we need to fix that confusion. Your patch will do it by not
 >>> setting EndRecPtr yet; that fixes the bug, but leaves those
variables in a
 >>> slightly strange state; I'm not sure what EndRecPtr points to in
that case
 >>> (0 ?), but ReadRecPtr would be set I guess.
 >> Yes, the values were set like below.
 >> ReadRecPtr:1/8E7F0B0
 >> EndRecPtr:0/0
 >>
 >>
 >>>
 >>> Perhaps we should reset replayEndRecPtr and lastReplayedEndRecPtr
to the
 >>> REDO point here, instead of ReadRecPtr/EndRecPtr.
 >>
 >> I made another patch.
 >> I added a ReadRecord to make sure the REDO location is present or not.
 >> The similar process are done when we use backup_label.
 >>
 >> Because the ReadRecord returns a record already read,
 >> I set ReadRecPtr of the record to EndRecPtr.
 >> And also I set record->xl_prev to ReadRecPtr.
 >> As you said, it also worked fine.
 >>
 >> I'm not sure we should do same thing when crash recovery occurs, but
now I
 >> added the process when archive recovery is needed.
 >>
 >> Please see attached patch.
 >
 > Hmm. That would still initialize lastReplayedEndRecPtr to the
checkpoint record, when you do crash recovery (or begin archive recovery
from a filesystem snapshot, where you perform crash recovery before
starting to read the archive). I'm not very comfortable with that, even
though I don't see an immediate problem with it.
 >
 > I also noticed that CheckRecoveryConsistency() compares
backupEndPoint with EndRecPtr, but minRecoveryPoint with
lastReplayedEndRecPtr. That's correct as the code stands, but it's an
accident waiting to happen: if you called CheckRecoveryConsistency()
after reading a record with ReadRecord(), but before fully replaying it,
it might conclude that it's reached the backup end location one record
too early. And it's inconsistent, anyway.
 >
 > I propose the attached patch. I haven't tested it, but I think it's a
slightly more robust fix.
 >
 > - Heikki
 >
 >




Attachment

Re: BUG #8686: Standby could not restart.

From
Heikki Linnakangas
Date:
On 01/08/2014 10:41 AM, Tomonari Katsumata wrote:
> Hi Heikki,
>
> Sorry for slow response and
> thank you for new patch!
>
> I've tried your patch, but the patch was not for 92_STABLE.
> So I created a patch for 92_STABLE(*) from your patch.
> (*)against 8aa6912b8fec3400441c365bde6a1030295de749
>
> It worked fine!
> If there are no problem, please commit this patch also to 92_STABLE.

Thanks! Committed to all stable branches 9.0 and above.

- Heikki