BUG #8686: Standby could not restart. - Mailing list pgsql-bugs

From katsumata.tomonari@po.ntts.co.jp
Subject BUG #8686: Standby could not restart.
Date
Msg-id E1VtTni-00082E-Jv@wrigleys.postgresql.org
Whole thread Raw
Responses Re: BUG #8686: Standby could not restart.  (Tomonari Katsumata <katsumata.tomonari@po.ntts.co.jp>)
Re: BUG #8686: Standby could not restart.  (Heikki Linnakangas <hlinnakangas@vmware.com>)
List pgsql-bugs
The following bug has been logged on the website:

Bug reference:      8686
Logged by:          Tomonari Katsumata
Email address:      katsumata.tomonari@po.ntts.co.jp
PostgreSQL version: 9.2.6
Operating system:   Red Hat Enterprise Linux 6.2 x86_64
Description:

Hi,


I'm testing whether a standby could restart with asynchronous replication.


The testcase is very simple like below.
(A) start asyncronous replication
(B) stop standby with "-m f"
(C) start standby


I tried (B) and (C) periodically.


In almost cases, it worked fine.
But standby could not start one time because of PANIC.
---- Log -----
[Standby] 2013-12-09 10:42:30 JST LOG:  00000: database system was shut down
in recovery at 2013-12-09 10:42:29 JST
[Standby] 2013-12-09 10:42:30 JST LOCATION:  StartupXLOG, xlog.c:6273
cp: cannot stat `../arc/00000002.history': No such file or directory
[Standby] 2013-12-09 10:42:30 JST LOG:  00000: entering standby mode
[Standby] 2013-12-09 10:42:30 JST LOCATION:  StartupXLOG, xlog.c:6359
cp: cannot stat `../arc/000000010000000100000008': No such file or
directory
[Standby] 2013-12-09 10:42:30 JST LOG:  00000: consistent recovery state
reached at 1/8E7F110
[Standby] 2013-12-09 10:42:30 JST LOCATION:  CheckRecoveryConsistency,
xlog.c:7366
[Standby] 2013-12-09 10:42:30 JST LOG:  00000: restored log file
"000000010000000100000007" from archive
[Standby] 2013-12-09 10:42:30 JST LOCATION:  RestoreArchivedFile,
xlog.c:3273
[Standby] 2013-12-09 10:42:30 JST LOG:  00000: redo starts at 1/783B230
[Standby] 2013-12-09 10:42:30 JST LOCATION:  StartupXLOG, xlog.c:6827
[Standby] 2013-12-09 10:42:30 JST WARNING:  01000: page 1365 of relation
base/16384/16388 does not exist
[Standby] 2013-12-09 10:42:30 JST CONTEXT:  xlog redo hot_update: rel
1663/16384/16388; tid 1365/152; new 1365/153
[Standby] 2013-12-09 10:42:30 JST LOCATION:  report_invalid_page,
xlogutils.c:66
[Standby] 2013-12-09 10:42:30 JST PANIC:  XX000: WAL contains references to
invalid pages
[Standby] 2013-12-09 10:42:30 JST CONTEXT:  xlog redo hot_update: rel
1663/16384/16388; tid 1365/152; new 1365/153
[Standby] 2013-12-09 10:42:30 JST LOCATION:  log_invalid_page,
xlogutils.c:91
[Standby] 2013-12-09 10:42:30 JST LOG:  00000: startup process (PID 12560)
was terminated by signal 6: Aborted
[Standby] 2013-12-09 10:42:30 JST LOCATION:  LogChildExit,
postmaster.c:2895
[Standby] 2013-12-09 10:42:30 JST LOG:  00000: terminating any other active
server processes
[Standby] 2013-12-09 10:42:30 JST LOCATION:  HandleChildCrash,
postmaster.c:2682
[Standby] 2013-12-09 11:10:12 JST LOG:  00000: database system was
interrupted while in recovery at log time 2013-12-09 10:32:14 JST
[Standby] 2013-12-09 11:10:12 JST HINT:  If this has occurred more than once
some data might be corrupted and you might need to choose an earlier
recovery target.
[Standby] 2013-12-09 11:10:12 JST LOCATION:  StartupXLOG, xlog.c:6289
---------


I tried to fix this problem.
At first, I doubted the recovery state reached "consistent" before redo
starts.
And then I checked pg_control and related WAL.
The WAL sequence is like below.


WAL--(a)--(b)--(c)--(d)--(e)-->
================================================
(a) Latest checkpoint's REDO location
1/783B230


(b) hot_update
1/7842010


(c) truncate
1/8E7E5C8


(d) Latest checkpoint location
1/8E7F0B0


(e) Minimum recovery ending location
1/8E7F110
================================================


>From these things, I found it has happened with this scenario.
----------
(1) standby starting
(2) seeking checkpoint location 1/8E7F0B0 because backup_label is not
absecnt
(3) reachedConsistency is set to true at 1/8E7F110 in
CheckRecoveryConsistent
(4) redo start from 1/783B230
(5) PANIC at 1/7842010 because reachedConsistency has set already and
operating against a block which will be truncated at (c).
----------


At step(2), EndRecPtr is set to 1/8E7F110(next to 1/8E7F0B0),
so reachedConsistency is set to true at step(3).


I think it's not need to increase EndRecPtr while seeking checkpoint
location.
I tried to revise it and this worked fine.


I think this is very very narrow gate, but it could happen.


I'm not sure this problem happen with synchronous replication or
another version of PostgreSQL(ie. 9.3/9.1/9.0),
but at least we need to fix it.


regards,
----------------
Tomonari Katsumata

pgsql-bugs by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: Update with subselect sometimes returns wrong result
Next
From: Tomonari Katsumata
Date:
Subject: Re: BUG #8686: Standby could not restart.