Thread: Hot Standby has PANIC: WAL contains references to invalid pages

Hot Standby has PANIC: WAL contains references to invalid pages

From
Michael Harris
Date:
Hi All,

We are having a thorny problem I'm hoping someone will be able to help with=
.

We have a pair of machines set up as an active / hot SB pair. The database =
they contain is quite large - approx. 9TB. They were working fine on 9.1, a=
nd we recently upgraded the active DB to 9.2.1.

After upgrading the active DB, we re-mirrored the standby (using pg_basebac=
kup) and started it up. It began replaying the WAL files as expected.

After a few hours this happened:

WARNING:  page 1 of relation pg_tblspc/16408/PG_9.2_201204301/16409/1123460=
086 is uninitialized
CONTEXT:  xlog redo vacuum: rel 16408/16409/1123460086; blk 4411, lastBlock=
Vacuumed 0
PANIC:  WAL contains references to invalid pages
CONTEXT:  xlog redo vacuum: rel 16408/16409/1123460086; blk 4411, lastBlock=
Vacuumed 0
LOG:  startup process (PID 24195) was terminated by signal 6: Aborted
LOG:  terminating any other active server processes

We tried starting it up again, the same thing happened.

After some googling and re-reading the release notes, we noticed the mentio=
n in the 9.2.1 release notes about the potential for corrupted visibility m=
aps, so as per the recommendation we did a full VACUUM of the whole databas=
e (with vacuum_freeze_table_age set to zero), then re-mirrored the standby =
again.

After re-mirroring was completed we started the standby again. Strangely it=
 reached consistency after only 33 WAL files - since the base backup took 5=
 days to complete this does not seem right to me. Anyway, WAL recovery cont=
inued, with occasional warnings like this:

[2013-02-04 10:30:51 EST]  13546@  WARNING:  xlog min recovery request 1A13=
A/9BC425A0 is past current point 19F1E/725043E8
[2013-02-04 10:30:51 EST]  13546@  CONTEXT:  writing block 0 of relation pg=
_tblspc/16408/PG_9.2_201204301/16409/12525_vm

After a few hours, this happened:

[2013-02-04 13:43:24 EST]  13538@  WARNING:  page 1248 of relation pg_tblsp=
c/16408/PG_9.2_201204301/16409/1128746393 does not exist
[2013-02-04 13:43:24 EST]  13538@  CONTEXT:  xlog redo visible: rel 16408/1=
6409/1128746393; blk 1248
[2013-02-04 13:43:24 EST]  13538@  PANIC:  WAL contains references to inval=
id pages
[2013-02-04 13:43:24 EST]  13538@  CONTEXT:  xlog redo visible: rel 16408/1=
6409/1128746393; blk 1248
[2013-02-04 13:43:25 EST]  13532@  LOG:  startup process (PID 13538) was te=
rminated by signal 6: Aborted
[2013-02-04 13:43:25 EST]  13532@  LOG:  terminating any other active serve=
r processes

Looks similar to the first case, but a different context. We thought that p=
erhaps an index had become corrupted (apparently also a possibility with th=
e bug mentioned above) however the file mentioned belongs to a normal table=
, not an index. And 'redo visible' sounds like it might be to do with the v=
isibility map?

We restarted it again with debugging cranked up. It didn't reveal anything =
more interesting. We then upgraded the standby to 9.2.2 and started it agai=
n. Again no dice. In each case it fails at exactly the same point with the =
same error.

Any ideas for a next troubleshooting step?

Regards // Mike

Re: Hot Standby has PANIC: WAL contains references to invalid pages

From
Hari Babu
Date:
On Tuesday, February 05, 2013 6:05 AM Michael Harris wrote:
>Any ideas for a next troubleshooting step?

[BUG?] lag of minRecoveryPont in archive recovery, which has fixed recently.
Please check the following link for more details. It may help.

http://www.postgresql.org/message-id/20121206.130458.170549097.horiguchi.kyo
taro@lab.ntt.co.jp

Regards,
Hari babu.

Re: Hot Standby has PANIC: WAL contains references to invalid pages

From
Michael Harris
Date:
Hi Hari,

Thanks for the tip. We tried applying that patch, however the error recurre=
d exactly as before.

Regards // Mike


-----Original Message-----
From: Hari Babu [mailto:haribabu.kommi@huawei.com]=20
Sent: Tuesday, 5 February 2013 10:07 PM
To: Michael Harris; pgsql-general@postgresql.org
Subject: RE: [GENERAL] Hot Standby has PANIC: WAL contains references to in=
valid pages

On Tuesday, February 05, 2013 6:05 AM Michael Harris wrote:
>Any ideas for a next troubleshooting step?

[BUG?] lag of minRecoveryPont in archive recovery, which has fixed recently=
.
Please check the following link for more details. It may help.

http://www.postgresql.org/message-id/20121206.130458.170549097.horiguchi.ky=
o
taro@lab.ntt.co.jp

Regards,
Hari babu.

Re: Hot Standby has PANIC: WAL contains references to invalid pages

From
amutu
Date:
maybe pg_basebackup can`t handle such big database.try
rsync,pg_start_backup,rsync,pg_stop_backup,it always works fine for us.our
instance is about 2TB and we use pg9.1.x.

jov
=D4=DA 2013-2-7 =CF=C2=CE=E72:25=A3=AC"Michael Harris" <michael.harris@eric=
sson.com>=D0=B4=B5=C0=A3=BA

> Hi Hari,
>
> Thanks for the tip. We tried applying that patch, however the error
> recurred exactly as before.
>
> Regards // Mike
>
>
> -----Original Message-----
> From: Hari Babu [mailto:haribabu.kommi@huawei.com]
> Sent: Tuesday, 5 February 2013 10:07 PM
> To: Michael Harris; pgsql-general@postgresql.org
> Subject: RE: [GENERAL] Hot Standby has PANIC: WAL contains references to
> invalid pages
>
> On Tuesday, February 05, 2013 6:05 AM Michael Harris wrote:
> >Any ideas for a next troubleshooting step?
>
> [BUG?] lag of minRecoveryPont in archive recovery, which has fixed
> recently.
> Please check the following link for more details. It may help.
>
>
> http://www.postgresql.org/message-id/20121206.130458.170549097.horiguchi.=
kyo
> taro@lab.ntt.co.jp
>
> Regards,
> Hari babu.
>
>
>
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>

Re: Hot Standby has PANIC: WAL contains references to invalid pages

From
Magnus Hagander
Date:
On Thu, Feb 7, 2013 at 7:39 AM, amutu <zhao6014@gmail.com> wrote:
> maybe pg_basebackup can`t handle such big database.try
> rsync,pg_start_backup,rsync,pg_stop_backup,it always works fine for us.our
> instance is about 2TB and we use pg9.1.x.

It really should handle that without problem, but sure, it might be
worth trying that one. If you can show that the problem is in
pg_basebackup, that's a very clear bug (either in pg_basebackup or in
the backend supporting code), so that would be good to know.

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

Re: Hot Standby has PANIC: WAL contains references to invalid pages

From
Michael Harris
Date:
Hi,

We suspect the problem is not in that area, because we used pg_basebackup o=
n the same database pair under 9.1 and did not have any such problems.=20

Looking at the context of the crashes, they seem to relate to handling of v=
isibility maps during WAL replay. Going by the 9.2 release notes that is an=
 area that was changed to allow index-only scans in 9.2.

Also, we can see that 9.2.3 has been released now and has a number of fixes=
 relating to WAL replay, so we have decided to try again using that. We wil=
l scrub the standby and make a fresh copy using pg_basebackup. If that does=
n't work then we may try using rsync instead.

We'll let you all know the result.

Regards // Mike

-----Original Message-----
From: Magnus Hagander [mailto:magnus@hagander.net]=20
Sent: Thursday, 7 February 2013 11:49 PM
To: amutu
Cc: Michael Harris; pgsql-general@postgresql.org; Hari Babu
Subject: Re: [GENERAL] Hot Standby has PANIC: WAL contains references to in=
valid pages

On Thu, Feb 7, 2013 at 7:39 AM, amutu <zhao6014@gmail.com> wrote:
> maybe pg_basebackup can`t handle such big database.try=20
> rsync,pg_start_backup,rsync,pg_stop_backup,it always works fine for=20
> us.our instance is about 2TB and we use pg9.1.x.

It really should handle that without problem, but sure, it might be worth t=
rying that one. If you can show that the problem is in pg_basebackup, that'=
s a very clear bug (either in pg_basebackup or in the backend supporting co=
de), so that would be good to know.

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

Re: Hot Standby has PANIC: WAL contains references to invalid pages

From
Michael Harris
Date:
Hi,

>> Also, we can see that 9.2.3 has been released now and has a number of fi=
xes relating to WAL replay, so we have decided to try again using that.=20
>> We will scrub the standby and make a fresh copy using pg_basebackup. If =
that doesn't work then we may try using rsync instead.

I am pleased to be able to report that the problem seems to be fixed after =
upgrading to 9.2.3.

We upgraded the standby server only to 9.2.3, rebuilt the standby using pg_=
basebackup, and then started it up. It replayed all the outstanding WAL fil=
es with no problems.

Regards // Mike