Re: Allow WAL information to recover corrupted pg_controldata - Mailing list pgsql-hackers

From Robert Haas
Subject Re: Allow WAL information to recover corrupted pg_controldata
Date
Msg-id CA+TgmoYacp03KD8w4i19G10QfqVXsSE8S0Df86FX77frHPP_uQ@mail.gmail.com
Whole thread Raw
In response to Re: Allow WAL information to recover corrupted pg_controldata  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Allow WAL information to recover corrupted pg_controldata
List pgsql-hackers
On Tue, Jun 19, 2012 at 1:43 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Amit Kapila <amit.kapila@huawei.com> writes:
>>> AFAIR you can create pg_control from scratch already with pg_resetxlog.
>>> The hard part is coming up with values for the counters, such as the
>>> next WAL location.  Some of them such as next OID are pretty harmless
>>> if you don't guess right, but I'm worried that wrong next WAL could
>>> make things worse not better.
>
>> I believe if WAL files are proper as mentioned in Alvaro's mail, the
>> purposed logic should generate correct values.
>
> I've got a problem with the assumption that, when pg_control is trash,
> megabytes or gigabytes of WAL can still be relied on completely.
>
> I'm almost inclined to suggest that we not get next-LSN from WAL, but
> by scanning all the pages in the main data store and computing the max
> observed LSN.  This is clearly not very attractive from a performance
> standpoint, but it would avoid the obvious failure mode where you lost
> some recent WAL segments along with pg_control.

I think it could be useful to have a tool that scans all the blocks
and computes that value, but I'd want it to just print the value out
and let me decide what to do about it.  There are cases where you
don't necessarily want to clobber pg_control, but you do have future
LSNs in your data file pages.  This can be either because the disk ate
your WAL, or because you didn't create recovery.conf, or because your
disk corrupted the LSNs on the data file pages.  I'd want a tool that
could be either run on an individual file, or recursively on a
directory.

In terms of the TODO item, I haven't yet heard anyone clearly state "I
wanted to use pg_controldata but it couldn't because X so therefore we
need this patch".  Alvaro mentioned the case where pg_control is
missing altogether, but:

[rhaas pgsql]$ rm ~/pgdata/global/pg_control
[rhaas pgsql]$ postgres
postgres: could not find the database system
Expected to find it in the directory "/Users/rhaas/pgdata",
but could not open file "/Users/rhaas/pgdata/global/pg_control": No
such file or directory
[rhaas pgsql]$ pg_resetxlog ~/pgdata
pg_resetxlog: could not open file "global/pg_control" for reading: No
such file or directory
If you are sure the data directory path is correct, execute touch global/pg_control
and try again.
[rhaas pgsql]$ touch ~/pgdata/global/pg_control
[rhaas pgsql]$ pg_resetxlog ~/pgdata
pg_resetxlog: pg_control exists but is broken or unknown version; ignoring it
Guessed pg_control values:

First log file ID after reset:        0
First log file segment after reset:   69
pg_control version number:            922
Catalog version number:               201206141
Database system identifier:           5755831325641078488
Latest checkpoint's TimeLineID:       1
Latest checkpoint's full_page_writes: off
Latest checkpoint's NextXID:          0/3
Latest checkpoint's NextOID:          10000
Latest checkpoint's NextMultiXactId:  1
Latest checkpoint's NextMultiOffset:  0
Latest checkpoint's oldestXID:        3
Latest checkpoint's oldestXID's DB:   0
Latest checkpoint's oldestActiveXID:  0
Maximum data alignment:               8
Database block size:                  8192
Blocks per segment of large relation: 131072
WAL block size:                       8192
Bytes per WAL segment:                16777216
Maximum length of identifiers:        64
Maximum columns in an index:          32
Maximum size of a TOAST chunk:        1996
Date/time type storage:               64-bit integers
Float4 argument passing:              by value
Float8 argument passing:              by value

If these values seem acceptable, use -f to force reset.
[rhaas pgsql]$ pg_resetxlog -f ~/pgdata
pg_resetxlog: pg_control exists but is broken or unknown version; ignoring it
Transaction log reset
[rhaas pgsql]$ postgres
LOG:  database system was shut down at 2012-06-19 15:25:28 EDT
LOG:  database system is ready to accept connections
LOG:  autovacuum launcher started

So I still don't understand what problem we're solving here.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


pgsql-hackers by date:

Previous
From: Peter Geoghegan
Date:
Subject: Re: use of int4/int32 in C code
Next
From: Robert Haas
Date:
Subject: Re: pl/perl and utf-8 in sql_ascii databases