Hot Standby has PANIC: WAL contains references to invalid pages - Mailing list pgsql-general

From Michael Harris
Subject Hot Standby has PANIC: WAL contains references to invalid pages
Date
Msg-id 30BC62DC16C7B842A8446ED8EB2F0439067D1C@ESGSCMB105.ericsson.se
Whole thread Raw
Responses Re: Hot Standby has PANIC: WAL contains references to invalid pages  (Hari Babu <haribabu.kommi@huawei.com>)
List pgsql-general
Hi All,

We are having a thorny problem I'm hoping someone will be able to help with=
.

We have a pair of machines set up as an active / hot SB pair. The database =
they contain is quite large - approx. 9TB. They were working fine on 9.1, a=
nd we recently upgraded the active DB to 9.2.1.

After upgrading the active DB, we re-mirrored the standby (using pg_basebac=
kup) and started it up. It began replaying the WAL files as expected.

After a few hours this happened:

WARNING:  page 1 of relation pg_tblspc/16408/PG_9.2_201204301/16409/1123460=
086 is uninitialized
CONTEXT:  xlog redo vacuum: rel 16408/16409/1123460086; blk 4411, lastBlock=
Vacuumed 0
PANIC:  WAL contains references to invalid pages
CONTEXT:  xlog redo vacuum: rel 16408/16409/1123460086; blk 4411, lastBlock=
Vacuumed 0
LOG:  startup process (PID 24195) was terminated by signal 6: Aborted
LOG:  terminating any other active server processes

We tried starting it up again, the same thing happened.

After some googling and re-reading the release notes, we noticed the mentio=
n in the 9.2.1 release notes about the potential for corrupted visibility m=
aps, so as per the recommendation we did a full VACUUM of the whole databas=
e (with vacuum_freeze_table_age set to zero), then re-mirrored the standby =
again.

After re-mirroring was completed we started the standby again. Strangely it=
 reached consistency after only 33 WAL files - since the base backup took 5=
 days to complete this does not seem right to me. Anyway, WAL recovery cont=
inued, with occasional warnings like this:

[2013-02-04 10:30:51 EST]  13546@  WARNING:  xlog min recovery request 1A13=
A/9BC425A0 is past current point 19F1E/725043E8
[2013-02-04 10:30:51 EST]  13546@  CONTEXT:  writing block 0 of relation pg=
_tblspc/16408/PG_9.2_201204301/16409/12525_vm

After a few hours, this happened:

[2013-02-04 13:43:24 EST]  13538@  WARNING:  page 1248 of relation pg_tblsp=
c/16408/PG_9.2_201204301/16409/1128746393 does not exist
[2013-02-04 13:43:24 EST]  13538@  CONTEXT:  xlog redo visible: rel 16408/1=
6409/1128746393; blk 1248
[2013-02-04 13:43:24 EST]  13538@  PANIC:  WAL contains references to inval=
id pages
[2013-02-04 13:43:24 EST]  13538@  CONTEXT:  xlog redo visible: rel 16408/1=
6409/1128746393; blk 1248
[2013-02-04 13:43:25 EST]  13532@  LOG:  startup process (PID 13538) was te=
rminated by signal 6: Aborted
[2013-02-04 13:43:25 EST]  13532@  LOG:  terminating any other active serve=
r processes

Looks similar to the first case, but a different context. We thought that p=
erhaps an index had become corrupted (apparently also a possibility with th=
e bug mentioned above) however the file mentioned belongs to a normal table=
, not an index. And 'redo visible' sounds like it might be to do with the v=
isibility map?

We restarted it again with debugging cranked up. It didn't reveal anything =
more interesting. We then upgraded the standby to 9.2.2 and started it agai=
n. Again no dice. In each case it fails at exactly the same point with the =
same error.

Any ideas for a next troubleshooting step?

Regards // Mike

pgsql-general by date:

Previous
From: Brent Wood
Date:
Subject: Re: partial time stamp query
Next
From: Edson Richter
Date:
Subject: Re: Reverse Engr into erwin