Thread: BUG #7883: "PANIC: WAL contains references to invalid pages" on replica recovery

BUG #7883: "PANIC: WAL contains references to invalid pages" on replica recovery

From
maciek@heroku.com
Date:
The following bug has been logged on the website:

Bug reference:      7883
Logged by:          Maciek Sakrejda
Email address:      maciek@heroku.com
PostgreSQL version: 9.1.8
Operating system:   Ubuntu 12.04 64-bit
Description:        =


We ran into a customer database giving us the error above when replicating
from 9.1.7 to 9.1.8 and attempting to fail over to the 9.1.8. I noticed
several fixes to WAL replay in 9.1.8--could this be a factor in this case?
We're trying again with a fresh replica; hopefully that will just work. Logs
from the incident are below.

Thanks,
Maciek

Feb 15 00:49:12 wal_e.worker.s3_worker INFO     MSG: could not locate object
while performing wal restore#012        DETAIL: The absolute URI that could
not be located is
s3://wal-e-[redacted]/wal-e-backups/timeline-0e0b390f-cb3f-4192-8cdb-fced4d=
54b0a2/wal_005/000000010000003C00000094.lzo.#012
       HINT: This can be normal when Postgres is trying to detect what
timelines are available during restoration.
Feb 15 00:49:12 [1300-1]  [COPPER] LOG:  invalid magic number 0000 in log
file 60, segment 148, offset 0
Feb 15 00:49:13 wal_e.worker.s3_worker INFO     MSG: could not locate object
while performing wal restore#012        DETAIL: The absolute URI that could
not be located is
s3://wal-e-[redacted]/wal-e-backups/timeline-0e0b390f-cb3f-4192-8cdb-fced4d=
54b0a2/wal_005/000000010000003C00000094.lzo.#012
       HINT: This can be normal when Postgres is trying to detect what
timelines are available during restoration.
Feb 15 00:49:13 [1301-1]  [COPPER] LOG:  redo done at 3C/930000B0
Feb 15 00:49:13 [1302-1]  [COPPER] LOG:  last completed transaction was at
log time 2013-02-14 22:35:05.338681+00
Feb 15 00:49:15 wal_e.worker.s3_worker INFO     MSG: completed download and
decompression#012        DETAIL: Downloaded and decompressed
"s3://wal-e-[redacted]/wal-e-backups/timeline-0e0b390f-cb3f-4192-8cdb-fced4=
d54b0a2/wal_005/000000010000003C00000093.lzo"
to "pg_xlog/RECOVERYXLOG"
Feb 15 00:49:15 [1303-1]  [COPPER] LOG:  restored log file
"000000010000003C00000093" from archive
Feb 15 00:49:15 wal_e.worker.s3_worker INFO     MSG: could not locate object
while performing wal restore#012        DETAIL: The absolute URI that could
not be located is
s3://wal-e-[redacted]/wal-e-backups/timeline-0e0b390f-cb3f-4192-8cdb-fced4d=
54b0a2/wal_005/00000002.history.lzo.#012
       HINT: This can be normal when Postgres is trying to detect what
timelines are available during restoration.
Feb 15 00:49:16 [1304-1]  [COPPER] LOG:  selected new timeline ID: 2
Feb 15 00:49:16 wal_e.worker.s3_worker INFO     MSG: could not locate object
while performing wal restore#012        DETAIL: The absolute URI that could
not be located is
s3://wal-e-[redacted]/wal-e-backups/timeline-0e0b390f-cb3f-4192-8cdb-fced4d=
54b0a2/wal_005/00000001.history.lzo.#012
       HINT: This can be normal when Postgres is trying to detect what
timelines are available during restoration.
Feb 15 00:49:16 [1305-1]  [COPPER] LOG:  archive recovery complete
Feb 15 00:49:16 [1306-1]  [COPPER] WARNING:  page 37956 of relation
base/16385/16430 was uninitialized
Feb 15 00:49:16 [1307-1]  [COPPER] PANIC:  WAL contains references to
invalid pages
Feb 15 00:49:17 [3-1]  [COPPER] LOG:  startup process (PID 7) was terminated
by signal 6: Aborted
Feb 15 00:49:17 [4-1]  [COPPER] LOG:  terminating any other active server
processes
Feb 15 00:49:17 [37-1] collectd [COPPER] WARNING:  terminating connection
because of crash of another server process
Feb 15 00:49:17 [37-2] collectd [COPPER] DETAIL:  The postmaster has
commanded this server process to roll back the current transaction and exit,
because another server process exited abnormally and possibly corrupted
shared memory.
Feb 15 00:49:17 [37-3] collectd [COPPER] HINT:  In a moment you should be
able to reconnect to the database and repeat your command.

Re: BUG #7883: "PANIC: WAL contains references to invalid pages" on replica recovery

From
Heikki Linnakangas
Date:
On 15.02.2013 03:49, maciek@heroku.com wrote:
> The following bug has been logged on the website:
>
> Bug reference:      7883
> Logged by:          Maciek Sakrejda
> Email address:      maciek@heroku.com
> PostgreSQL version: 9.1.8
> Operating system:   Ubuntu 12.04 64-bit
> Description:
>
> We ran into a customer database giving us the error above when replicating
> from 9.1.7 to 9.1.8 and attempting to fail over to the 9.1.8. I noticed
> several fixes to WAL replay in 9.1.8--could this be a factor in this case?
> ...
> Feb 15 00:49:16 [1305-1]  [COPPER] LOG:  archive recovery complete
> Feb 15 00:49:16 [1306-1]  [COPPER] WARNING:  page 37956 of relation
> base/16385/16430 was uninitialized
> Feb 15 00:49:16 [1307-1]  [COPPER] PANIC:  WAL contains references to
> invalid pages

Hmm, that sure looks like the same issue Kyotaro HORIGUCHI reported
(http://www.postgresql.org/message-id/20121206.130458.170549097.horiguchi.kyotaro@lab.ntt.co.jp),
but that was fixed in 9.1.8. Maybe there's some corner case where it's
still not working.

Did you keep a copy of the WAL involved? Any chance of running xlogdump
on it, and posting the results, or just sending over the WAL files so I
could take a look?

- Heikki

Re: BUG #7883: "PANIC: WAL contains references to invalid pages" on replica recovery

From
Maciek Sakrejda
Date:
On Fri, Feb 15, 2013 at 7:10 AM, Heikki Linnakangas <hlinnakangas@vmware.co=
m
> wrote:

> Hmm, that sure looks like the same issue Kyotaro HORIGUCHI reported (
> http://www.postgresql.org/message-id/20121206.130458.170549097.horiguchi.=
kyotaro@lab.ntt.co.jp),
> but that was fixed in 9.1.8. Maybe there's some corner case where it's
> still not working.
>

We reproduce it on a separate replica with the same base backup. Tried a
new base backup; that worked.


> Did you keep a copy of the WAL involved? Any chance of running xlogdump o=
n
> it, and posting the results, or just sending over the WAL files so I coul=
d
> take a look?
>

Yes, we have the WAL. I can't share it as it's customer data. I tried
setting up xlogdump (this, yes? https://github.com/snaga/xlogdump ), but
ran into problems building it (tag REL0_5_2):

root@host:/home/heroku/xlogdump# PATH=3D/usr/lib/postgresql/9.1/bin:$PATH
USE_PGXS=3D1 make
gcc -g -O2 -fstack-protector --param=3Dssp-buffer-size=3D4 -Wformat
-Wformat-security -Werror=3Dformat-security -fPIC -Wall -Wmissing-prototype=
s
-Wpointer-arith -Wdeclaration-after-statement -Wendif-labels
-Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=3Dstandar=
d
-g -I/usr/include/postgresql -DDATADIR=3D\"/usr/share/postgresql/9.1\" -I.
-I. -I/usr/include/postgresql/9.1/server -I/usr/include/postgresql/internal
-D_FORTIFY_SOURCE=3D2 -DLINUX_OOM_ADJ=3D0 -D_GNU_SOURCE -I/usr/include/libx=
ml2
-I/usr/include/tcl8.5  -c -o xlogdump.o xlogdump.c
xlogdump.c: In function =91ReadRecord=92:
xlogdump.c:321:7: warning: format =91%u=92 expects argument of type =91unsi=
gned
int=92, but argument 3 has type =91long unsigned int=92 [-Wformat]
xlogdump.c: At top level:
xlogdump.c:865:1: error: conflicting types for =91ExceptionalCondition=92
/usr/include/postgresql/postgres.h:686:13: note: previous declaration of
=91ExceptionalCondition=92 was here
make: *** [xlogdump.o] Error 1
On Fri, Feb 15, 2013 at 10:57 AM, Maciek Sakrejda <maciek@heroku.com> wrote=
:

> On Fri, Feb 15, 2013 at 7:10 AM, Heikki Linnakangas <
> hlinnakangas@vmware.com> wrote:
>
>> Hmm, that sure looks like the same issue Kyotaro HORIGUCHI reported (
>> http://www.postgresql.org/message-id/20121206.130458.170549097.horiguchi=
.kyotaro@lab.ntt.co.jp),
>> but that was fixed in 9.1.8. Maybe there's some corner case where it's
>> still not working.
>>
>
> We reproduce it on a separate replica with the same base backup. Tried a
> new base backup; that worked.
>
>
>> Did you keep a copy of the WAL involved? Any chance of running xlogdump
>> on it, and posting the results, or just sending over the WAL files so I
>> could take a look?
>>
>
> Yes, we have the WAL. I can't share it as it's customer data. I tried
> setting up xlogdump (this, yes? https://github.com/snaga/xlogdump ), but
> ran into problems building it (tag REL0_5_2):
>
> root@host:/home/heroku/xlogdump# PATH=3D/usr/lib/postgresql/9.1/bin:$PATH
> USE_PGXS=3D1 make
> gcc -g -O2 -fstack-protector --param=3Dssp-buffer-size=3D4 -Wformat
> -Wformat-security -Werror=3Dformat-security -fPIC -Wall -Wmissing-prototy=
pes
> -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels
> -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=3Dstand=
ard
> -g -I/usr/include/postgresql -DDATADIR=3D\"/usr/share/postgresql/9.1\" -I=
.
> -I. -I/usr/include/postgresql/9.1/server -I/usr/include/postgresql/intern=
al
> -D_FORTIFY_SOURCE=3D2 -DLINUX_OOM_ADJ=3D0 -D_GNU_SOURCE -I/usr/include/li=
bxml2
> -I/usr/include/tcl8.5  -c -o xlogdump.o xlogdump.c
> xlogdump.c: In function =91ReadRecord=92:
> xlogdump.c:321:7: warning: format =91%u=92 expects argument of type =91un=
signed
> int=92, but argument 3 has type =91long unsigned int=92 [-Wformat]
> xlogdump.c: At top level:
> xlogdump.c:865:1: error: conflicting types for =91ExceptionalCondition=92
> /usr/include/postgresql/postgres.h:686:13: note: previous declaration of
> =91ExceptionalCondition=92 was here
> make: *** [xlogdump.o] Error 1
>
>
Can you try master on xlogdump instead of the release? I have fixed a few
bugs there a year ago that don't seem to be in the release, and will fix
more if necessary.

--=20
fdr

Re: BUG #7883: "PANIC: WAL contains references to invalid pages" on replica recovery

From
Maciek Sakrejda
Date:
On Fri, Feb 15, 2013 at 3:15 PM, Daniel Farina <daniel@heroku.com> wrote:

> Can you try master on xlogdump instead of the release? I have fixed a few
> bugs there a year ago that don't seem to be in the release, and will fix
> more if necessary.
>

Had tried that, too--complained about a missing pg_crc_tables.h
On Fri, Feb 15, 2013 at 3:18 PM, Maciek Sakrejda <maciek@heroku.com> wrote:

> On Fri, Feb 15, 2013 at 3:15 PM, Daniel Farina <daniel@heroku.com> wrote:
>
>> Can you try master on xlogdump instead of the release? I have fixed a few
>> bugs there a year ago that don't seem to be in the release, and will fix
>> more if necessary.
>>
>
> Had tried that, too--complained about a missing pg_crc_tables.h
>

I'm poking at this and somehow it seems like the wrong PG_VERSION_NUM is
being slurped in -- even though the gcc invocation would suggest it is
properly targeting 9.1 -- which is causing xlogdump to have a cow with
versions < 9.2.

I think the problem is related to this:

/usr/include/postgresql/9.1/server/fmgr.h: PG_VERSION_NUM / 100, \
/usr/include/postgresql/9.1/server/pg_config.h:#define PG_VERSION_NUM 90108
/usr/include/postgresql/pg_config.h:#define PG_VERSION_NUM 90203 <-- this
one
/usr/include/postgresql/9.2/server/fmgr.h: PG_VERSION_NUM / 100, \
/usr/include/postgresql/9.2/server/pg_config.h:#define PG_VERSION_NUM 90203

And USE_PGXS=1 make yields something like....
gcc -g -O2 -fstack-protector --param=ssp-buffer-size=4 -Wformat
-Werror=format-security -fPIC -pie -Wall -Wmissing-prototypes
-Wpointer-arith -Wdeclaration-after-statement -Wendif-labels
-Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard
-g -DVERSION_STR=\""0.6devel"\" -I. -I/usr/include/postgresql
-DDATADIR=\"/usr/share/postgresql/9.1\" -I. -I.
-I/usr/include/postgresql/9.1/server -I/usr/include/postgresql/internal
-D_FORTIFY_SOURCE=2 -DLINUX_OOM_ADJ=0 -D_GNU_SOURCE -I/usr/include/libxml2
 -I/usr/include/tcl8.5  -c -o strlcpy.o strlcpy.c

Where, notably, both the 9.1 and /usr/include/postgresql directories are in
place.  Hence, which ever pg_config gets loaded in first decides what
PG_VERSION_NUM is compiling against...that's no good...

Why is it there...because:

dpkg -S /usr/include/postgresql/pg_config.h
libpq-dev: /usr/include/postgresql/pg_config.h

I guess that means Ubuntu (and probably Debian?) libpq-dev breaks
PG_VERSION_NUM for PGXS=1.

--
fdr

Re: BUG #7883: "PANIC: WAL contains references to invalid pages" on replica recovery

From
Maciek Sakrejda
Date:
Purged and reinstalled the dev packages (only reinstalled 9.1) and I still
get the same error. The pg_config.h in /usr/include/postgresql/ looks legit:

heroku@ip-10-99-18-64:~/xlogdump$ grep PG_VERSION
/usr/include/postgresql/pg_config.h
#define PG_VERSION "9.1.8"
#define PG_VERSION_NUM 90108
#define PG_VERSION_STR "PostgreSQL 9.1.8 on x86_64-unknown-linux-gnu,
compiled by gcc-4.6.real (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3, 64-bit"

Any ideas?

Re: BUG #7883: "PANIC: WAL contains references to invalid pages" on replica recovery

From
Heikki Linnakangas
Date:
On 16.02.2013 01:49, Daniel Farina wrote:
> I guess that means Ubuntu (and probably Debian?) libpq-dev breaks
> PG_VERSION_NUM for PGXS=1.

That obviously needs to be fixed in debian. Meanwhile, Maciek, I'd
suggest that you build PostgreSQL from sources, install it to some
temporary location, and then build xlogdump against that.

- Heikki

Re: BUG #7883: "PANIC: WAL contains references to invalid pages" on replica recovery

From
Maciek Sakrejda
Date:
On Mon, Feb 18, 2013 at 12:57 AM, Heikki Linnakangas <
hlinnakangas@vmware.com> wrote:

> On 16.02.2013 01:49, Daniel Farina wrote:
>
>> I guess that means Ubuntu (and probably Debian?) libpq-dev breaks
>> PG_VERSION_NUM for PGXS=1.
>>
>
> That obviously needs to be fixed in debian. Meanwhile, Maciek, I'd suggest
> that you build PostgreSQL from sources, install it to some temporary
> location, and then build xlogdump against that.
>

That worked, thanks. I have a working xlogdump. Any pointers as to what I
should look for? This is the contents of the pg_xlog directory:

total 49160
-rw------- 1 udrehggpif7kft postgres 16777216 Feb 15 00:00
000000010000003C00000093
-rw------- 1 udrehggpif7kft postgres 16777216 Feb 15 00:47
000000010000003C00000094
-rw------- 1 udrehggpif7kft postgres 16777216 Feb 15 00:49
000000020000003C00000093
-rw------- 1 udrehggpif7kft postgres       56 Feb 15 00:49 00000002.history
drwx------ 2 udrehggpif7kft postgres     4096 Feb 15 00:49 archive_status


This is what's in 00000002.history:

1    000000010000003C00000093    no recovery target specified

Re: BUG #7883: "PANIC: WAL contains references to invalid pages" on replica recovery

From
Heikki Linnakangas
Date:
On 19.02.2013 00:19, Maciek Sakrejda wrote:
> On Mon, Feb 18, 2013 at 12:57 AM, Heikki Linnakangas<
> hlinnakangas@vmware.com>  wrote:
>
>> On 16.02.2013 01:49, Daniel Farina wrote:
>>
>>> I guess that means Ubuntu (and probably Debian?) libpq-dev breaks
>>> PG_VERSION_NUM for PGXS=1.
>>>
>>
>> That obviously needs to be fixed in debian. Meanwhile, Maciek, I'd suggest
>> that you build PostgreSQL from sources, install it to some temporary
>> location, and then build xlogdump against that.
>
> That worked, thanks. I have a working xlogdump. Any pointers as to what I
> should look for? This is the contents of the pg_xlog directory:
>
> total 49160
> -rw------- 1 udrehggpif7kft postgres 16777216 Feb 15 00:00
> 000000010000003C00000093
> -rw------- 1 udrehggpif7kft postgres 16777216 Feb 15 00:47
> 000000010000003C00000094
> -rw------- 1 udrehggpif7kft postgres 16777216 Feb 15 00:49
> 000000020000003C00000093
> -rw------- 1 udrehggpif7kft postgres       56 Feb 15 00:49 00000002.history
> drwx------ 2 udrehggpif7kft postgres     4096 Feb 15 00:49 archive_status

I'd like to see the contents of the WAL, starting from the last
checkpoint, up to the point where failover happened. In particular, any
actions on the relation base/16385/16430, which caused the error.
pg_controldata output on the base backup would also interesting, as well
as the contents of backup_label file.

How long did the standby run between the base backup and the failover?
How many WAL segments?

One more thing you could try to narrow down the error: restore from the
base backup, and let it run up to the point of failover, but shut it
down just before the failover with "pg_ctl stop -m fast". That should
create a restartpoint, at the latest checkpoint record. Then restart,
and perform failover. If it still throws the same error, we know that
the WAL record that touched the page that doesn't exist was after the
last checkpoint.

- Heikki

Re: BUG #7883: "PANIC: WAL contains references to invalid pages" on replica recovery

From
Maciek Sakrejda
Date:
On Thu, Feb 21, 2013 at 4:04 AM, Heikki Linnakangas <hlinnakangas@vmware.com
> wrote:

> I'd like to see the contents of the WAL, starting from the last
> checkpoint, up to the point where failover happened. In particular, any
> actions on the relation base/16385/16430, which caused the error.
> pg_controldata output on the base backup would also interesting, as well as
> the contents of backup_label file.
>
> How long did the standby run between the base backup and the failover? How
> many WAL segments?
>
> One more thing you could try to narrow down the error: restore from the
> base backup, and let it run up to the point of failover, but shut it down
> just before the failover with "pg_ctl stop -m fast". That should create a
> restartpoint, at the latest checkpoint record. Then restart, and perform
> failover. If it still throws the same error, we know that the WAL record
> that touched the page that doesn't exist was after the last checkpoint.
>

Unfortunately, it looks like we lost the bad wal segments and necessary
base backup due to our archiving mechanism. We don't yet have a principled
way of saving systems for forensics. I thought I had manually accounted for
everything to keep this "on ice" but I missed a step and the system was
archived. I apologize. I'll see if I can add something for us to better
support this.

For what it's worth the failover was done at 2013-02-14 23:55:44 +0000 and
the base backup used was dated 2013-02-15 00:49:22 +0000.

I'll follow up in case we run into this again.