Thread: page 1 of relation global/11787 was uninitialized

page 1 of relation global/11787 was uninitialized

From
"Stephen R. van den Berg"
Date:
Just today one of my systems experienced a kernel panic, and halted abruptly.
Running Linux 3.1.9, PostgreSQL 9.0.4 (Debian 9.0.4-1+b1, to be precise).

The system was moderately active, i.e. about one commit per minute.
It is not a large problem if the last few commits would be gone.

Now, in restarting the system, I get this:

-------------------------
LOG: database system was interrupted while in recovery at 2013-04-09 18:07:45 CEST
HINT: This probably means that some data is corrupted and you will have to use the last backup for recovery.
LOG:  database system was not properly shut down; automatic recovery in progress
LOG:  consistent recovery state reached at D/B0BCE118
LOG:  redo starts at D/B0BAB734
LOG:  invalid record length at D/B0BAE010
LOG:  redo done at D/B0BADFC4
LOG:  last completed transaction was at log time 2013-04-09 14:50:29.743986+02
WARNING:  page 1 of relation global/11787 was uninitialized
PANIC:  WAL contains references to invalid pages
LOG:  startup process (PID 30827) was terminated by signal 6: Aborted
LOG:  aborting startup due to startup process failure
-------------------------

Looking at global/11787, doesn't reveal any obvious corruption.
The server was running with:synchronous_commit = offfull_page_writes = off

to maximise performance, since the data is not 100% critical, but I would
like to recover the data up to some point in the past (an hour ago is fine).

Any suggestions?
Restarting PostgreSQL several times, results in identical messages.
-- 
Stephen.



Re: page 1 of relation global/11787 was uninitialized

From
"Joshua D. Drake"
Date:
On 04/09/2013 09:21 AM, Stephen R. van den Berg wrote:

> -------------------------
>
> Looking at global/11787, doesn't reveal any obvious corruption.
> The server was running with:
>   synchronous_commit = off
>   full_page_writes = off

full_page_writes = off is the problem.
From the docs:

Turning this parameter off speeds normal operation, but might lead to 
either unrecoverable data corruption, or silent data corruption, after a 
system failure. The risks are similar to turning off fsync, though 
smaller, and it should be turned off only based on the same 
circumstances recommended for that parameter.

http://www.postgresql.org/docs/9.0/static/runtime-config-wal.html#GUC-FULL-PAGE-WRITES

Sincerely,

Joshua D. Drake


-- 
Command Prompt, Inc. - http://www.commandprompt.com/
PostgreSQL Support, Training, Professional Services and Development
High Availability, Oracle Conversion, Postgres-XC
@cmdpromptinc - 509-416-6579



Re: page 1 of relation global/11787 was uninitialized

From
Andres Freund
Date:
On 2013-04-09 18:21:20 +0200, Stephen R. van den Berg wrote:
> Just today one of my systems experienced a kernel panic, and halted abruptly.
> Running Linux 3.1.9, PostgreSQL 9.0.4 (Debian 9.0.4-1+b1, to be precise).

Thats an absolutely outdated version of 9.0. You shouldn't be running
this in production.

On 2013-04-09 09:27:52 -0700, Joshua D. Drake wrote:
> 
> On 04/09/2013 09:21 AM, Stephen R. van den Berg wrote:
> 
> >-------------------------
> >
> >Looking at global/11787, doesn't reveal any obvious corruption.

> >The server was running with:
> >  synchronous_commit = off
> >  full_page_writes = off
> 
> full_page_writes = off is the problem.

Yea, and it can cause very hard to recover corruption, its not that you
only may loose some of the last transactions, in contrast to
synchronous_commit=off where you can loose the last transactions but
which never should cause corruption.

> From the docs:
> 
> Turning this parameter off speeds normal operation, but might lead to either
> unrecoverable data corruption, or silent data corruption, after a system
> failure. The risks are similar to turning off fsync, though smaller, and it
> should be turned off only based on the same circumstances recommended for
> that parameter.
> 
> http://www.postgresql.org/docs/9.0/static/runtime-config-wal.html#GUC-FULL-PAGE-WRITES

That was my first thought as well, but whilst it sure can cause
corruption, I can't immediately see how it should be responsible for
this error. That seems to indicate another problem.

Stephen, could you check how big global/11787 exactly is? Too bad we
don't know what that relfilenode corresponds to and we can't easily find
out what it maps to.

Afaik we don't have any debugging utility to dump the pg_filenode.map
contents?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: page 1 of relation global/11787 was uninitialized

From
Tom Lane
Date:
"Stephen R. van den Berg" <srb@cuci.nl> writes:
> Just today one of my systems experienced a kernel panic, and halted abruptly.
> Running Linux 3.1.9, PostgreSQL 9.0.4 (Debian 9.0.4-1+b1, to be precise).

It's conceivable that updating to something more current than 9.0.4
would get you out of this --- we've fixed quite a number of WAL replay
bugs in the last two years.
        regards, tom lane



Re: page 1 of relation global/11787 was uninitialized

From
"Stephen R. van den Berg"
Date:
Tom Lane wrote:
>"Stephen R. van den Berg" <srb@cuci.nl> writes:
>> Just today one of my systems experienced a kernel panic, and halted abruptly.
>> Running Linux 3.1.9, PostgreSQL 9.0.4 (Debian 9.0.4-1+b1, to be precise).

>It's conceivable that updating to something more current than 9.0.4
>would get you out of this --- we've fixed quite a number of WAL replay
>bugs in the last two years.

I see that there is a 9.0.13, but that would be a source upgrade, since Debian
doesn't provide later than 9.0.4 AFAICS.
I suppose upgrading to 9.1.9 is not recommended in this state, or is it?
-- 
Stephen.

"Reality is merely an illusion,albeit a very persistent one."  --  Albert Einstein



Re: page 1 of relation global/11787 was uninitialized

From
"Stephen R. van den Berg"
Date:
Joshua D. Drake wrote:
>full_page_writes = off is the problem.

>Turning this parameter off speeds normal operation, but might lead to
>either unrecoverable data corruption, or silent data corruption,
>after a system failure. The risks are similar to turning off fsync,
>though smaller, and it should be turned off only based on the same
>circumstances recommended for that parameter.

Well, silent data corruption would have been "ok" (to a certain extent).
Unrecoverable data corruption would be ok too, if it would only imply
losing some of the more recent parts of the data (at least for this database).
-- 
Stephen.

"Reality is merely an illusion,albeit a very persistent one."  --  Albert Einstein



Re: page 1 of relation global/11787 was uninitialized

From
Andres Freund
Date:
On 2013-04-09 19:18:57 +0200, Stephen R. van den Berg wrote:
> Tom Lane wrote:
> >"Stephen R. van den Berg" <srb@cuci.nl> writes:
> >> Just today one of my systems experienced a kernel panic, and halted abruptly.
> >> Running Linux 3.1.9, PostgreSQL 9.0.4 (Debian 9.0.4-1+b1, to be precise).
> 
> >It's conceivable that updating to something more current than 9.0.4
> >would get you out of this --- we've fixed quite a number of WAL replay
> >bugs in the last two years.
> 
> I see that there is a 9.0.13, but that would be a source upgrade, since Debian
> doesn't provide later than 9.0.4 AFAICS.

Afaik debian has never shipped 9.0 in any stable release at all. You can
use the packages provided by the postgres community though, they are
compatible:
http://wiki.postgresql.org/wiki/Apt

> I suppose upgrading to 9.1.9 is not recommended in this state, or is
> it?

You can at least try, don't think it will make more problems than you
already have.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: page 1 of relation global/11787 was uninitialized

From
"Stephen R. van den Berg"
Date:
Andres Freund wrote:
>On 2013-04-09 18:21:20 +0200, Stephen R. van den Berg wrote:
>Thats an absolutely outdated version of 9.0. You shouldn't be running
>this in production.

Yes, well, it's one of those things.  It got installed with Debian
and the automatic upgrade didn't upgrade it further, they switched to
9.1 without an automatic upgrade path.

>Yea, and it can cause very hard to recover corruption, its not that you
>only may loose some of the last transactions, in contrast to
>synchronous_commit=off where you can loose the last transactions but
>which never should cause corruption.

Ok, I'll avoid this option in the future, and just run with
synchronous_commit=off for non-critical databases.

>Stephen, could you check how big global/11787 exactly is? Too bad we
>don't know what that relfilenode corresponds to and we can't easily find
>out what it maps to.

This is the filelist of the whole global directory (there is only one active
database on this server), the 11787 file is 16KB exactly:

-rw------- 1 postgres postgres  8192 May 30  2011 11601
-rw------- 1 postgres postgres 24576 May 30  2011 11601_fsm
-rw------- 1 postgres postgres  8192 May 30  2011 11601_vm
-rw------- 1 postgres postgres 16384 May 30  2011 11603
-rw------- 1 postgres postgres 16384 May 30  2011 11604
-rw------- 1 postgres postgres     0 May 30  2011 11770
-rw------- 1 postgres postgres     0 May 30  2011 11772
-rw------- 1 postgres postgres  8192 May 30  2011 11774
-rw------- 1 postgres postgres  8192 May 30  2011 11775
-rw------- 1 postgres postgres  8192 May 30  2011 11776
-rw------- 1 postgres postgres 24576 May 30  2011 11776_fsm
-rw------- 1 postgres postgres  8192 May 30  2011 11776_vm
-rw------- 1 postgres postgres 16384 May 30  2011 11778
-rw------- 1 postgres postgres 16384 May 30  2011 11779
-rw------- 1 postgres postgres  8192 May 30  2011 11780
-rw------- 1 postgres postgres 24576 May 30  2011 11780_fsm
-rw------- 1 postgres postgres  8192 May 30  2011 11780_vm
-rw------- 1 postgres postgres 16384 May 30  2011 11782
-rw------- 1 postgres postgres  8192 May 30  2011 11783
-rw------- 1 postgres postgres 16384 May 30  2011 11785
-rw------- 1 postgres postgres 16384 May 30  2011 11786
-rw------- 1 postgres postgres 16384 Apr  9 18:08 11787
-rw------- 1 postgres postgres 24576 Apr  9 13:09 11787_fsm
-rw------- 1 postgres postgres  8192 Apr  9 14:58 11787_vm
-rw------- 1 postgres postgres 40960 Apr  9 14:51 11789
-rw------- 1 postgres postgres 24576 Nov 25  2011 11789_fsm
-rw------- 1 postgres postgres 16384 Apr  9 14:51 11790
-rw------- 1 postgres postgres  8192 May 30  2011 11791
-rw------- 1 postgres postgres 24576 May 30  2011 11791_fsm
-rw------- 1 postgres postgres  8192 May 30  2011 11791_vm
-rw------- 1 postgres postgres     0 May 30  2011 11793
-rw------- 1 postgres postgres  8192 May 30  2011 11795
-rw------- 1 postgres postgres 16384 May 30  2011 11796
-rw------- 1 postgres postgres  8192 May 30  2011 11867
-rw------- 1 postgres postgres     0 May 30  2011 11869
-rw------- 1 postgres postgres  8192 May 30  2011 11871
-rw------- 1 postgres postgres 16384 May 30  2011 11872
-rw------- 1 postgres postgres 16384 May 30  2011 11873
-- 
Stephen.



Re: page 1 of relation global/11787 was uninitialized

From
Tom Lane
Date:
"Stephen R. van den Berg" <srb@cuci.nl> writes:
> Tom Lane wrote:
>> It's conceivable that updating to something more current than 9.0.4
>> would get you out of this --- we've fixed quite a number of WAL replay
>> bugs in the last two years.

> I see that there is a 9.0.13, but that would be a source upgrade, since Debian
> doesn't provide later than 9.0.4 AFAICS.

That's a bit hard to believe, especially given that 9.0.13 fixes a
rather major security bug.  Any packager that isn't offering something
more current than 9.0.4 is *seriously* misfeasant.

> I suppose upgrading to 9.1.9 is not recommended in this state, or is it?

You can't -- only a 9.0.x server will even try to start up against this
DB.
        regards, tom lane



Re: page 1 of relation global/11787 was uninitialized

From
Andres Freund
Date:
On 2013-04-09 13:34:41 -0400, Tom Lane wrote:
> "Stephen R. van den Berg" <srb@cuci.nl> writes:
> > Tom Lane wrote:
> >> It's conceivable that updating to something more current than 9.0.4
> >> would get you out of this --- we've fixed quite a number of WAL replay
> >> bugs in the last two years.
> 
> > I see that there is a 9.0.13, but that would be a source upgrade, since Debian
> > doesn't provide later than 9.0.4 AFAICS.
> 
> That's a bit hard to believe, especially given that 9.0.13 fixes a
> rather major security bug.  Any packager that isn't offering something
> more current than 9.0.4 is *seriously* misfeasant.

Debian simply has never shipped postgres 9.0 at all outside of its
unstable distribution. Afaik it didn't even enter testing once.. Its
pretty normal to drop support for packages from there imo.

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: page 1 of relation global/11787 was uninitialized

From
Andres Freund
Date:
On 2013-04-09 19:30:38 +0200, Andres Freund wrote:
> On 2013-04-09 19:18:57 +0200, Stephen R. van den Berg wrote:
> > Tom Lane wrote:
> > >"Stephen R. van den Berg" <srb@cuci.nl> writes:
> > >> Just today one of my systems experienced a kernel panic, and halted abruptly.
> > >> Running Linux 3.1.9, PostgreSQL 9.0.4 (Debian 9.0.4-1+b1, to be precise).
> > 
> > >It's conceivable that updating to something more current than 9.0.4
> > >would get you out of this --- we've fixed quite a number of WAL replay
> > >bugs in the last two years.
> > 
> > I see that there is a 9.0.13, but that would be a source upgrade, since Debian
> > doesn't provide later than 9.0.4 AFAICS.
> 
> Afaik debian has never shipped 9.0 in any stable release at all. You can
> use the packages provided by the postgres community though, they are
> compatible:
> http://wiki.postgresql.org/wiki/Apt
> 
> > I suppose upgrading to 9.1.9 is not recommended in this state, or is
> > it?
> 
> You can at least try, don't think it will make more problems than you
> already have.

Uh, I didn't read careful enough. I thought you were talking about
upgrading to the newest 9.0 release which *does* make sense.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: page 1 of relation global/11787 was uninitialized

From
"Stephen R. van den Berg"
Date:
Tom Lane wrote:
>"Stephen R. van den Berg" <srb@cuci.nl> writes:
>> Tom Lane wrote:
>>> It's conceivable that updating to something more current than 9.0.4
>>> would get you out of this --- we've fixed quite a number of WAL replay
>>> bugs in the last two years.

>> I see that there is a 9.0.13, but that would be a source upgrade, since Debian
>> doesn't provide later than 9.0.4 AFAICS.

>That's a bit hard to believe, especially given that 9.0.13 fixes a
>rather major security bug.  Any packager that isn't offering something
>more current than 9.0.4 is *seriously* misfeasant.

Well, it might be such that this was only in unstable at some point in time,
and by the time they wanted to include it in testing, they skipped 9.0
and went for 9.1 instead.

>> I suppose upgrading to 9.1.9 is not recommended in this state, or is it?

>You can't -- only a 9.0.x server will even try to start up against this
>DB.

Well, there is something like pg_upgradecluster, which probably will
attempt a binary conversion of the database, but I guess that in this
state that is hazardous at best.  Besides, the standard tool doesn't
seem to like tablespaces (and I'm using one).

I'll look into trying a 9.0.13 first.
-- 
Stephen.

"Reality is merely an illusion,albeit a very persistent one."  --  Albert Einstein



Re: page 1 of relation global/11787 was uninitialized

From
Tom Lane
Date:
Andres Freund <andres@2ndquadrant.com> writes:
> Afaik we don't have any debugging utility to dump the pg_filenode.map
> contents?

Hardly need one ... od -t d4 $PGDATA/global/pg_filenode.map
is readable enough, though it does leave you still having to
map the numeric OIDs back to names.  The OIDs will be stable though.
        regards, tom lane



Success (Re: page 1 of relation global/11787 was uninitialized)

From
"Stephen R. van den Berg"
Date:
Thanks, all of you, for all good advice and suggestions.

Andres Freund wrote:
>Afaik debian has never shipped 9.0 in any stable release at all. You can
>use the packages provided by the postgres community though, they are
>compatible:
>http://wiki.postgresql.org/wiki/Apt

I used this to upgrade to 9.0.13, which indeed is fully compatible with
Debian.  It started up like a charm.  I can upgrade it to 9.1.9 now
(while I'm at it).

For reference, the startup log now looked like this:
LOG:  database system was interrupted while in recovery at 2013-04-09 19:45:49 CESTFATAL:  the database system is
startingupHINT:  This probably means that some data is corrupted and you will have to use the last backup for
recovery.LOG: incomplete startup packetLOG:  database system was not properly shut down; automatic recovery in
progressLOG: redo starts at D/B0BAB734LOG:  record with zero length at D/B0D3D71CLOG:  redo done at D/B0D3D6E0LOG:
lastcompleted transaction was at log time 2013-04-09 15:02:24.848947+02LOG:  checkpoint starting: end-of-recovery
immediateLOG: checkpoint complete: wrote 423 buffers (11.8%); 0 transaction log file(s) added, 0 removed, 0 recycled;
write=0.003s, sync=11.419 s, total=11.601 s
 

So, Tom, thanks for fixing that WAL recovery code :-).
-- 
Stephen.



Re: Success (Re: page 1 of relation global/11787 was uninitialized)

From
Florian Pflug
Date:
On Apr9, 2013, at 19:56 , "Stephen R. van den Berg" <srb@cuci.nl> wrote:
> Andres Freund wrote:
>> Afaik debian has never shipped 9.0 in any stable release at all. You can
>> use the packages provided by the postgres community though, they are
>> compatible:
>> http://wiki.postgresql.org/wiki/Apt
> 
> I used this to upgrade to 9.0.13, which indeed is fully compatible with
> Debian.  It started up like a charm.  I can upgrade it to 9.1.9 now
> (while I'm at it).

Since you've been running with full_page_writes=off, I suggest you update
to 9.1 by dumping and reloading your data (i.e. pg_dumpall) instead of
via binary upgrades (i.e. pg_upgrade).

That way, you can be sure that your data is consistent (from postgres'
POV anyway) after the upgrade. pg_upgrade OTOH will only notice
inconsistencies in the system tables, since the data files of user tables
and indices are simply copied unchanged.

best regards,
Florian Pflug





Re: Success (Re: page 1 of relation global/11787 was uninitialized)

From
"Stephen R. van den Berg"
Date:
Florian Pflug wrote:
>> I used this to upgrade to 9.0.13, which indeed is fully compatible with
>> Debian.  It started up like a charm.  I can upgrade it to 9.1.9 now
>> (while I'm at it).

>Since you've been running with full_page_writes=off, I suggest you update
>to 9.1 by dumping and reloading your data (i.e. pg_dumpall) instead of
>via binary upgrades (i.e. pg_upgrade).

I need to do that anyway, since the pg_upgradecluster script of Debian
(or is it a native PostgreSQL script?) does not support tablespaces.

>That way, you can be sure that your data is consistent (from postgres'
>POV anyway) after the upgrade. pg_upgrade OTOH will only notice
>inconsistencies in the system tables, since the data files of user tables
>and indices are simply copied unchanged.

Good point.  Thanks.
-- 
Stephen.



Re: page 1 of relation global/11787 was uninitialized

From
Albe Laurenz
Date:
Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
>> Afaik we don't have any debugging utility to dump the pg_filenode.map
>> contents?
>
> Hardly need one ... od -t d4 $PGDATA/global/pg_filenode.map
> is readable enough, though it does leave you still having to
> map the numeric OIDs back to names.  The OIDs will be stable though.

Shouldn't that be "od -t u4 $PGDATA/global/pg_filenode.map"?

Yours,
Laurenz Albe