Re: BUG #8686: Standby could not restart. - Mailing list pgsql-bugs

From Tomonari Katsumata
Subject Re: BUG #8686: Standby could not restart.
Date
Msg-id 52B26197.9080208@po.ntts.co.jp
Whole thread Raw
In response to BUG #8686: Standby could not restart.  (katsumata.tomonari@po.ntts.co.jp)
List pgsql-bugs
Hi,

This is a patch against REL9_2_STABLE(*)
(*)0c07ef1ad2ba73a6a050f3e1a19267f961a43586

regards,
---------------
Tomonari Katsumata


(2013/12/19 11:57), katsumata.tomonari@po.ntts.co.jp wrote:
 > The following bug has been logged on the website:
 >
 > Bug reference:      8686
 > Logged by:          Tomonari Katsumata
 > Email address:      katsumata.tomonari@po.ntts.co.jp
 > PostgreSQL version: 9.2.6
 > Operating system:   Red Hat Enterprise Linux 6.2 x86_64
 > Description:
 >
 > Hi,
 >
 >
 > I'm testing whether a standby could restart with asynchronous
replication.
 >
 >
 > The testcase is very simple like below.
 > (A) start asyncronous replication
 > (B) stop standby with "-m f"
 > (C) start standby
 >
 >
 > I tried (B) and (C) periodically.
 >
 >
 > In almost cases, it worked fine.
 > But standby could not start one time because of PANIC.
 > ---- Log -----
 > [Standby] 2013-12-09 10:42:30 JST LOG:  00000: database system was
shut down
 > in recovery at 2013-12-09 10:42:29 JST
 > [Standby] 2013-12-09 10:42:30 JST LOCATION:  StartupXLOG, xlog.c:6273
 > cp: cannot stat `../arc/00000002.history': No such file or directory
 > [Standby] 2013-12-09 10:42:30 JST LOG:  00000: entering standby mode
 > [Standby] 2013-12-09 10:42:30 JST LOCATION:  StartupXLOG, xlog.c:6359
 > cp: cannot stat `../arc/000000010000000100000008': No such file or
 > directory
 > [Standby] 2013-12-09 10:42:30 JST LOG:  00000: consistent recovery state
 > reached at 1/8E7F110
 > [Standby] 2013-12-09 10:42:30 JST LOCATION: CheckRecoveryConsistency,
 > xlog.c:7366
 > [Standby] 2013-12-09 10:42:30 JST LOG:  00000: restored log file
 > "000000010000000100000007" from archive
 > [Standby] 2013-12-09 10:42:30 JST LOCATION: RestoreArchivedFile,
 > xlog.c:3273
 > [Standby] 2013-12-09 10:42:30 JST LOG:  00000: redo starts at 1/783B230
 > [Standby] 2013-12-09 10:42:30 JST LOCATION:  StartupXLOG, xlog.c:6827
 > [Standby] 2013-12-09 10:42:30 JST WARNING:  01000: page 1365 of relation
 > base/16384/16388 does not exist
 > [Standby] 2013-12-09 10:42:30 JST CONTEXT:  xlog redo hot_update: rel
 > 1663/16384/16388; tid 1365/152; new 1365/153
 > [Standby] 2013-12-09 10:42:30 JST LOCATION: report_invalid_page,
 > xlogutils.c:66
 > [Standby] 2013-12-09 10:42:30 JST PANIC:  XX000: WAL contains
references to
 > invalid pages
 > [Standby] 2013-12-09 10:42:30 JST CONTEXT:  xlog redo hot_update: rel
 > 1663/16384/16388; tid 1365/152; new 1365/153
 > [Standby] 2013-12-09 10:42:30 JST LOCATION:  log_invalid_page,
 > xlogutils.c:91
 > [Standby] 2013-12-09 10:42:30 JST LOG:  00000: startup process (PID
12560)
 > was terminated by signal 6: Aborted
 > [Standby] 2013-12-09 10:42:30 JST LOCATION:  LogChildExit,
 > postmaster.c:2895
 > [Standby] 2013-12-09 10:42:30 JST LOG:  00000: terminating any other
active
 > server processes
 > [Standby] 2013-12-09 10:42:30 JST LOCATION:  HandleChildCrash,
 > postmaster.c:2682
 > [Standby] 2013-12-09 11:10:12 JST LOG:  00000: database system was
 > interrupted while in recovery at log time 2013-12-09 10:32:14 JST
 > [Standby] 2013-12-09 11:10:12 JST HINT:  If this has occurred more
than once
 > some data might be corrupted and you might need to choose an earlier
 > recovery target.
 > [Standby] 2013-12-09 11:10:12 JST LOCATION:  StartupXLOG, xlog.c:6289
 > ---------
 >
 >
 > I tried to fix this problem.
 > At first, I doubted the recovery state reached "consistent" before redo
 > starts.
 > And then I checked pg_control and related WAL.
 > The WAL sequence is like below.
 >
 >
 > WAL--(a)--(b)--(c)--(d)--(e)-->
 > ================================================
 > (a) Latest checkpoint's REDO location
 > 1/783B230
 >
 >
 > (b) hot_update
 > 1/7842010
 >
 >
 > (c) truncate
 > 1/8E7E5C8
 >
 >
 > (d) Latest checkpoint location
 > 1/8E7F0B0
 >
 >
 > (e) Minimum recovery ending location
 > 1/8E7F110
 > ================================================
 >
 >
 > >From these things, I found it has happened with this scenario.
 > ----------
 > (1) standby starting
 > (2) seeking checkpoint location 1/8E7F0B0 because backup_label is not
 > absecnt
 > (3) reachedConsistency is set to true at 1/8E7F110 in
 > CheckRecoveryConsistent
 > (4) redo start from 1/783B230
 > (5) PANIC at 1/7842010 because reachedConsistency has set already and
 > operating against a block which will be truncated at (c).
 > ----------
 >
 >
 > At step(2), EndRecPtr is set to 1/8E7F110(next to 1/8E7F0B0),
 > so reachedConsistency is set to true at step(3).
 >
 >
 > I think it's not need to increase EndRecPtr while seeking checkpoint
 > location.
 > I tried to revise it and this worked fine.
 >
 >
 > I think this is very very narrow gate, but it could happen.
 >
 >
 > I'm not sure this problem happen with synchronous replication or
 > another version of PostgreSQL(ie. 9.3/9.1/9.0),
 > but at least we need to fix it.
 >
 >
 > regards,
 > ----------------
 > Tomonari Katsumata
 >
 >
 >
 >


Attachment

pgsql-bugs by date:

Previous
From: katsumata.tomonari@po.ntts.co.jp
Date:
Subject: BUG #8686: Standby could not restart.
Next
From: Alvaro Herrera
Date:
Subject: Re: Update with subselect sometimes returns wrong result