Re: emergency outage requiring database restart - Mailing list pgsql-hackers

From Merlin Moncure
Subject Re: emergency outage requiring database restart
Date
Msg-id CAHyXU0zzgA6x0jOceWOFxoUqN=riEd6nijetH-ktzzwti5Rz5g@mail.gmail.com
Whole thread Raw
In response to Re: emergency outage requiring database restart  (Merlin Moncure <mmoncure@gmail.com>)
Responses Re: emergency outage requiring database restart  (Alvaro Herrera <alvherre@2ndquadrant.com>)
List pgsql-hackers
On Fri, Oct 21, 2016 at 1:37 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
> On Fri, Oct 21, 2016 at 8:03 AM, Kevin Grittner <kgrittn@gmail.com> wrote:
>> On Tue, Oct 18, 2016 at 8:45 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
>>
>>> Most or all the damage seemed to be to the system catalogs with
>>> at least two critical tables dropped or inaccessible in some
>>> fashion.  A lot of the OIDs seemed to be pointing at the wrong
>>> thing.
>>
>> While the oid in pg_class often matches the filename, that is not
>> true after some operations (like CLUSTER or VACUUM FULL).  It is
>> the relfilenode column that is the definitive link to the file.
>
> no such operations happened.  In the first instance at least one table
> dropped from the system catalogs.   I have a hunch that the heap is
> fine (supported by the size of the database on disk).   At this
> precise moment I'm restoring the database to another fileserver in
> order to do some forensic analysis, also in the hopes of getting the
> second database online in order to expedite recovery.
>
> ah -- done. :-)  deleting the init file didn't help, but starting up
> single user allowed the start up to gracefully fail with a FATAL cache
> lookup.

OK, I have some good (very- in the specific case of yours truly) news
to report.  Doing a filesystem level copy to a test server I was able
to relfilenode swap one of the critical tables over the place of the
refilenode of the stored backup.  Not being able know the file to copy
from, I figured out the source node by judging the size and using
'strings'  utility.  Data recovery for that table at least appears to
be 100%.

For those following along, this simple process is only likely to work
easily if the table contains only system types; no user types, enums,
composites, etc, since those have a unique ID for each data restore.

merlin



pgsql-hackers by date:

Previous
From: Merlin Moncure
Date:
Subject: Re: emergency outage requiring database restart
Next
From: Alvaro Herrera
Date:
Subject: Re: emergency outage requiring database restart