Thread: BUG #7883: "PANIC: WAL contains references to invalid pages" on replica recovery
BUG #7883: "PANIC: WAL contains references to invalid pages" on replica recovery
From
maciek@heroku.com
Date:
The following bug has been logged on the website: Bug reference: 7883 Logged by: Maciek Sakrejda Email address: maciek@heroku.com PostgreSQL version: 9.1.8 Operating system: Ubuntu 12.04 64-bit Description: = We ran into a customer database giving us the error above when replicating from 9.1.7 to 9.1.8 and attempting to fail over to the 9.1.8. I noticed several fixes to WAL replay in 9.1.8--could this be a factor in this case? We're trying again with a fresh replica; hopefully that will just work. Logs from the incident are below. Thanks, Maciek Feb 15 00:49:12 wal_e.worker.s3_worker INFO MSG: could not locate object while performing wal restore#012 DETAIL: The absolute URI that could not be located is s3://wal-e-[redacted]/wal-e-backups/timeline-0e0b390f-cb3f-4192-8cdb-fced4d= 54b0a2/wal_005/000000010000003C00000094.lzo.#012 HINT: This can be normal when Postgres is trying to detect what timelines are available during restoration. Feb 15 00:49:12 [1300-1] [COPPER] LOG: invalid magic number 0000 in log file 60, segment 148, offset 0 Feb 15 00:49:13 wal_e.worker.s3_worker INFO MSG: could not locate object while performing wal restore#012 DETAIL: The absolute URI that could not be located is s3://wal-e-[redacted]/wal-e-backups/timeline-0e0b390f-cb3f-4192-8cdb-fced4d= 54b0a2/wal_005/000000010000003C00000094.lzo.#012 HINT: This can be normal when Postgres is trying to detect what timelines are available during restoration. Feb 15 00:49:13 [1301-1] [COPPER] LOG: redo done at 3C/930000B0 Feb 15 00:49:13 [1302-1] [COPPER] LOG: last completed transaction was at log time 2013-02-14 22:35:05.338681+00 Feb 15 00:49:15 wal_e.worker.s3_worker INFO MSG: completed download and decompression#012 DETAIL: Downloaded and decompressed "s3://wal-e-[redacted]/wal-e-backups/timeline-0e0b390f-cb3f-4192-8cdb-fced4= d54b0a2/wal_005/000000010000003C00000093.lzo" to "pg_xlog/RECOVERYXLOG" Feb 15 00:49:15 [1303-1] [COPPER] LOG: restored log file "000000010000003C00000093" from archive Feb 15 00:49:15 wal_e.worker.s3_worker INFO MSG: could not locate object while performing wal restore#012 DETAIL: The absolute URI that could not be located is s3://wal-e-[redacted]/wal-e-backups/timeline-0e0b390f-cb3f-4192-8cdb-fced4d= 54b0a2/wal_005/00000002.history.lzo.#012 HINT: This can be normal when Postgres is trying to detect what timelines are available during restoration. Feb 15 00:49:16 [1304-1] [COPPER] LOG: selected new timeline ID: 2 Feb 15 00:49:16 wal_e.worker.s3_worker INFO MSG: could not locate object while performing wal restore#012 DETAIL: The absolute URI that could not be located is s3://wal-e-[redacted]/wal-e-backups/timeline-0e0b390f-cb3f-4192-8cdb-fced4d= 54b0a2/wal_005/00000001.history.lzo.#012 HINT: This can be normal when Postgres is trying to detect what timelines are available during restoration. Feb 15 00:49:16 [1305-1] [COPPER] LOG: archive recovery complete Feb 15 00:49:16 [1306-1] [COPPER] WARNING: page 37956 of relation base/16385/16430 was uninitialized Feb 15 00:49:16 [1307-1] [COPPER] PANIC: WAL contains references to invalid pages Feb 15 00:49:17 [3-1] [COPPER] LOG: startup process (PID 7) was terminated by signal 6: Aborted Feb 15 00:49:17 [4-1] [COPPER] LOG: terminating any other active server processes Feb 15 00:49:17 [37-1] collectd [COPPER] WARNING: terminating connection because of crash of another server process Feb 15 00:49:17 [37-2] collectd [COPPER] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. Feb 15 00:49:17 [37-3] collectd [COPPER] HINT: In a moment you should be able to reconnect to the database and repeat your command.
Re: BUG #7883: "PANIC: WAL contains references to invalid pages" on replica recovery
From
Heikki Linnakangas
Date:
On 15.02.2013 03:49, maciek@heroku.com wrote: > The following bug has been logged on the website: > > Bug reference: 7883 > Logged by: Maciek Sakrejda > Email address: maciek@heroku.com > PostgreSQL version: 9.1.8 > Operating system: Ubuntu 12.04 64-bit > Description: > > We ran into a customer database giving us the error above when replicating > from 9.1.7 to 9.1.8 and attempting to fail over to the 9.1.8. I noticed > several fixes to WAL replay in 9.1.8--could this be a factor in this case? > ... > Feb 15 00:49:16 [1305-1] [COPPER] LOG: archive recovery complete > Feb 15 00:49:16 [1306-1] [COPPER] WARNING: page 37956 of relation > base/16385/16430 was uninitialized > Feb 15 00:49:16 [1307-1] [COPPER] PANIC: WAL contains references to > invalid pages Hmm, that sure looks like the same issue Kyotaro HORIGUCHI reported (http://www.postgresql.org/message-id/20121206.130458.170549097.horiguchi.kyotaro@lab.ntt.co.jp), but that was fixed in 9.1.8. Maybe there's some corner case where it's still not working. Did you keep a copy of the WAL involved? Any chance of running xlogdump on it, and posting the results, or just sending over the WAL files so I could take a look? - Heikki
Re: BUG #7883: "PANIC: WAL contains references to invalid pages" on replica recovery
From
Maciek Sakrejda
Date:
On Fri, Feb 15, 2013 at 7:10 AM, Heikki Linnakangas <hlinnakangas@vmware.co= m > wrote: > Hmm, that sure looks like the same issue Kyotaro HORIGUCHI reported ( > http://www.postgresql.org/message-id/20121206.130458.170549097.horiguchi.= kyotaro@lab.ntt.co.jp), > but that was fixed in 9.1.8. Maybe there's some corner case where it's > still not working. > We reproduce it on a separate replica with the same base backup. Tried a new base backup; that worked. > Did you keep a copy of the WAL involved? Any chance of running xlogdump o= n > it, and posting the results, or just sending over the WAL files so I coul= d > take a look? > Yes, we have the WAL. I can't share it as it's customer data. I tried setting up xlogdump (this, yes? https://github.com/snaga/xlogdump ), but ran into problems building it (tag REL0_5_2): root@host:/home/heroku/xlogdump# PATH=3D/usr/lib/postgresql/9.1/bin:$PATH USE_PGXS=3D1 make gcc -g -O2 -fstack-protector --param=3Dssp-buffer-size=3D4 -Wformat -Wformat-security -Werror=3Dformat-security -fPIC -Wall -Wmissing-prototype= s -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=3Dstandar= d -g -I/usr/include/postgresql -DDATADIR=3D\"/usr/share/postgresql/9.1\" -I. -I. -I/usr/include/postgresql/9.1/server -I/usr/include/postgresql/internal -D_FORTIFY_SOURCE=3D2 -DLINUX_OOM_ADJ=3D0 -D_GNU_SOURCE -I/usr/include/libx= ml2 -I/usr/include/tcl8.5 -c -o xlogdump.o xlogdump.c xlogdump.c: In function =91ReadRecord=92: xlogdump.c:321:7: warning: format =91%u=92 expects argument of type =91unsi= gned int=92, but argument 3 has type =91long unsigned int=92 [-Wformat] xlogdump.c: At top level: xlogdump.c:865:1: error: conflicting types for =91ExceptionalCondition=92 /usr/include/postgresql/postgres.h:686:13: note: previous declaration of =91ExceptionalCondition=92 was here make: *** [xlogdump.o] Error 1
Re: BUG #7883: "PANIC: WAL contains references to invalid pages" on replica recovery
From
Daniel Farina
Date:
On Fri, Feb 15, 2013 at 10:57 AM, Maciek Sakrejda <maciek@heroku.com> wrote= : > On Fri, Feb 15, 2013 at 7:10 AM, Heikki Linnakangas < > hlinnakangas@vmware.com> wrote: > >> Hmm, that sure looks like the same issue Kyotaro HORIGUCHI reported ( >> http://www.postgresql.org/message-id/20121206.130458.170549097.horiguchi= .kyotaro@lab.ntt.co.jp), >> but that was fixed in 9.1.8. Maybe there's some corner case where it's >> still not working. >> > > We reproduce it on a separate replica with the same base backup. Tried a > new base backup; that worked. > > >> Did you keep a copy of the WAL involved? Any chance of running xlogdump >> on it, and posting the results, or just sending over the WAL files so I >> could take a look? >> > > Yes, we have the WAL. I can't share it as it's customer data. I tried > setting up xlogdump (this, yes? https://github.com/snaga/xlogdump ), but > ran into problems building it (tag REL0_5_2): > > root@host:/home/heroku/xlogdump# PATH=3D/usr/lib/postgresql/9.1/bin:$PATH > USE_PGXS=3D1 make > gcc -g -O2 -fstack-protector --param=3Dssp-buffer-size=3D4 -Wformat > -Wformat-security -Werror=3Dformat-security -fPIC -Wall -Wmissing-prototy= pes > -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels > -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=3Dstand= ard > -g -I/usr/include/postgresql -DDATADIR=3D\"/usr/share/postgresql/9.1\" -I= . > -I. -I/usr/include/postgresql/9.1/server -I/usr/include/postgresql/intern= al > -D_FORTIFY_SOURCE=3D2 -DLINUX_OOM_ADJ=3D0 -D_GNU_SOURCE -I/usr/include/li= bxml2 > -I/usr/include/tcl8.5 -c -o xlogdump.o xlogdump.c > xlogdump.c: In function =91ReadRecord=92: > xlogdump.c:321:7: warning: format =91%u=92 expects argument of type =91un= signed > int=92, but argument 3 has type =91long unsigned int=92 [-Wformat] > xlogdump.c: At top level: > xlogdump.c:865:1: error: conflicting types for =91ExceptionalCondition=92 > /usr/include/postgresql/postgres.h:686:13: note: previous declaration of > =91ExceptionalCondition=92 was here > make: *** [xlogdump.o] Error 1 > > Can you try master on xlogdump instead of the release? I have fixed a few bugs there a year ago that don't seem to be in the release, and will fix more if necessary. --=20 fdr
Re: BUG #7883: "PANIC: WAL contains references to invalid pages" on replica recovery
From
Maciek Sakrejda
Date:
On Fri, Feb 15, 2013 at 3:15 PM, Daniel Farina <daniel@heroku.com> wrote: > Can you try master on xlogdump instead of the release? I have fixed a few > bugs there a year ago that don't seem to be in the release, and will fix > more if necessary. > Had tried that, too--complained about a missing pg_crc_tables.h
Re: BUG #7883: "PANIC: WAL contains references to invalid pages" on replica recovery
From
Daniel Farina
Date:
On Fri, Feb 15, 2013 at 3:18 PM, Maciek Sakrejda <maciek@heroku.com> wrote: > On Fri, Feb 15, 2013 at 3:15 PM, Daniel Farina <daniel@heroku.com> wrote: > >> Can you try master on xlogdump instead of the release? I have fixed a few >> bugs there a year ago that don't seem to be in the release, and will fix >> more if necessary. >> > > Had tried that, too--complained about a missing pg_crc_tables.h > I'm poking at this and somehow it seems like the wrong PG_VERSION_NUM is being slurped in -- even though the gcc invocation would suggest it is properly targeting 9.1 -- which is causing xlogdump to have a cow with versions < 9.2. I think the problem is related to this: /usr/include/postgresql/9.1/server/fmgr.h: PG_VERSION_NUM / 100, \ /usr/include/postgresql/9.1/server/pg_config.h:#define PG_VERSION_NUM 90108 /usr/include/postgresql/pg_config.h:#define PG_VERSION_NUM 90203 <-- this one /usr/include/postgresql/9.2/server/fmgr.h: PG_VERSION_NUM / 100, \ /usr/include/postgresql/9.2/server/pg_config.h:#define PG_VERSION_NUM 90203 And USE_PGXS=1 make yields something like.... gcc -g -O2 -fstack-protector --param=ssp-buffer-size=4 -Wformat -Werror=format-security -fPIC -pie -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -g -DVERSION_STR=\""0.6devel"\" -I. -I/usr/include/postgresql -DDATADIR=\"/usr/share/postgresql/9.1\" -I. -I. -I/usr/include/postgresql/9.1/server -I/usr/include/postgresql/internal -D_FORTIFY_SOURCE=2 -DLINUX_OOM_ADJ=0 -D_GNU_SOURCE -I/usr/include/libxml2 -I/usr/include/tcl8.5 -c -o strlcpy.o strlcpy.c Where, notably, both the 9.1 and /usr/include/postgresql directories are in place. Hence, which ever pg_config gets loaded in first decides what PG_VERSION_NUM is compiling against...that's no good... Why is it there...because: dpkg -S /usr/include/postgresql/pg_config.h libpq-dev: /usr/include/postgresql/pg_config.h I guess that means Ubuntu (and probably Debian?) libpq-dev breaks PG_VERSION_NUM for PGXS=1. -- fdr
Re: BUG #7883: "PANIC: WAL contains references to invalid pages" on replica recovery
From
Maciek Sakrejda
Date:
Purged and reinstalled the dev packages (only reinstalled 9.1) and I still get the same error. The pg_config.h in /usr/include/postgresql/ looks legit: heroku@ip-10-99-18-64:~/xlogdump$ grep PG_VERSION /usr/include/postgresql/pg_config.h #define PG_VERSION "9.1.8" #define PG_VERSION_NUM 90108 #define PG_VERSION_STR "PostgreSQL 9.1.8 on x86_64-unknown-linux-gnu, compiled by gcc-4.6.real (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3, 64-bit" Any ideas?
Re: BUG #7883: "PANIC: WAL contains references to invalid pages" on replica recovery
From
Heikki Linnakangas
Date:
On 16.02.2013 01:49, Daniel Farina wrote: > I guess that means Ubuntu (and probably Debian?) libpq-dev breaks > PG_VERSION_NUM for PGXS=1. That obviously needs to be fixed in debian. Meanwhile, Maciek, I'd suggest that you build PostgreSQL from sources, install it to some temporary location, and then build xlogdump against that. - Heikki
Re: BUG #7883: "PANIC: WAL contains references to invalid pages" on replica recovery
From
Maciek Sakrejda
Date:
On Mon, Feb 18, 2013 at 12:57 AM, Heikki Linnakangas < hlinnakangas@vmware.com> wrote: > On 16.02.2013 01:49, Daniel Farina wrote: > >> I guess that means Ubuntu (and probably Debian?) libpq-dev breaks >> PG_VERSION_NUM for PGXS=1. >> > > That obviously needs to be fixed in debian. Meanwhile, Maciek, I'd suggest > that you build PostgreSQL from sources, install it to some temporary > location, and then build xlogdump against that. > That worked, thanks. I have a working xlogdump. Any pointers as to what I should look for? This is the contents of the pg_xlog directory: total 49160 -rw------- 1 udrehggpif7kft postgres 16777216 Feb 15 00:00 000000010000003C00000093 -rw------- 1 udrehggpif7kft postgres 16777216 Feb 15 00:47 000000010000003C00000094 -rw------- 1 udrehggpif7kft postgres 16777216 Feb 15 00:49 000000020000003C00000093 -rw------- 1 udrehggpif7kft postgres 56 Feb 15 00:49 00000002.history drwx------ 2 udrehggpif7kft postgres 4096 Feb 15 00:49 archive_status This is what's in 00000002.history: 1 000000010000003C00000093 no recovery target specified
Re: BUG #7883: "PANIC: WAL contains references to invalid pages" on replica recovery
From
Heikki Linnakangas
Date:
On 19.02.2013 00:19, Maciek Sakrejda wrote: > On Mon, Feb 18, 2013 at 12:57 AM, Heikki Linnakangas< > hlinnakangas@vmware.com> wrote: > >> On 16.02.2013 01:49, Daniel Farina wrote: >> >>> I guess that means Ubuntu (and probably Debian?) libpq-dev breaks >>> PG_VERSION_NUM for PGXS=1. >>> >> >> That obviously needs to be fixed in debian. Meanwhile, Maciek, I'd suggest >> that you build PostgreSQL from sources, install it to some temporary >> location, and then build xlogdump against that. > > That worked, thanks. I have a working xlogdump. Any pointers as to what I > should look for? This is the contents of the pg_xlog directory: > > total 49160 > -rw------- 1 udrehggpif7kft postgres 16777216 Feb 15 00:00 > 000000010000003C00000093 > -rw------- 1 udrehggpif7kft postgres 16777216 Feb 15 00:47 > 000000010000003C00000094 > -rw------- 1 udrehggpif7kft postgres 16777216 Feb 15 00:49 > 000000020000003C00000093 > -rw------- 1 udrehggpif7kft postgres 56 Feb 15 00:49 00000002.history > drwx------ 2 udrehggpif7kft postgres 4096 Feb 15 00:49 archive_status I'd like to see the contents of the WAL, starting from the last checkpoint, up to the point where failover happened. In particular, any actions on the relation base/16385/16430, which caused the error. pg_controldata output on the base backup would also interesting, as well as the contents of backup_label file. How long did the standby run between the base backup and the failover? How many WAL segments? One more thing you could try to narrow down the error: restore from the base backup, and let it run up to the point of failover, but shut it down just before the failover with "pg_ctl stop -m fast". That should create a restartpoint, at the latest checkpoint record. Then restart, and perform failover. If it still throws the same error, we know that the WAL record that touched the page that doesn't exist was after the last checkpoint. - Heikki
Re: BUG #7883: "PANIC: WAL contains references to invalid pages" on replica recovery
From
Maciek Sakrejda
Date:
On Thu, Feb 21, 2013 at 4:04 AM, Heikki Linnakangas <hlinnakangas@vmware.com > wrote: > I'd like to see the contents of the WAL, starting from the last > checkpoint, up to the point where failover happened. In particular, any > actions on the relation base/16385/16430, which caused the error. > pg_controldata output on the base backup would also interesting, as well as > the contents of backup_label file. > > How long did the standby run between the base backup and the failover? How > many WAL segments? > > One more thing you could try to narrow down the error: restore from the > base backup, and let it run up to the point of failover, but shut it down > just before the failover with "pg_ctl stop -m fast". That should create a > restartpoint, at the latest checkpoint record. Then restart, and perform > failover. If it still throws the same error, we know that the WAL record > that touched the page that doesn't exist was after the last checkpoint. > Unfortunately, it looks like we lost the bad wal segments and necessary base backup due to our archiving mechanism. We don't yet have a principled way of saving systems for forensics. I thought I had manually accounted for everything to keep this "on ice" but I missed a step and the system was archived. I apologize. I'll see if I can add something for us to better support this. For what it's worth the failover was done at 2013-02-14 23:55:44 +0000 and the base backup used was dated 2013-02-15 00:49:22 +0000. I'll follow up in case we run into this again.