Thread: Database Corruption - last chance recovery options?

Database Corruption - last chance recovery options?

From
Michael Best
Date:
Had some database corruption problems today.  Since they came on the
heels of making some minor database changes yesterday, they may or may
not be related to that.  Centos 4.x, Postgresql 8.1.4

I modified the following settings and then issued a reload.    I hadn't
turned up the kernel.shmmax to allow for these bigger memory settings,
but the database continued to run fine.

shared_buffers = 10000
work_mem = 2048
autovacuum = on                        # enable autovacuum subprocess?
autovacuum_naptime = 60                # time between autovacuum runs,
in secs
autovacuum_vacuum_threshold = 1000     # min # of tuple updates before
autovacuum_analyze_threshold = 500     # min # of tuple updates before
autovacuum_vacuum_scale_factor = 0.4   # fraction of rel size before
autovacuum_analyze_scale_factor = 0.2  # fraction of rel size before
autovacuum_vacuum_cost_delay = -1      # default vacuum cost delay for
autovacuum_vacuum_cost_limit = -1      # default vacuum cost limit for

After reloading I ran a number of vacuumdb -a -z which completed
successfully.

Sometime after the vacuum or reload one of our clients started to have
database problems.  The other databases in the same postgresql on the
server don't seem to be affected.

When I finally got the error report in the morning the database was in
this state:

$ psql dbname

dbname=# \dt
ERROR:  cache lookup failed for relation 20884

Doing a select * from pg_tables seemed to indicate that some of the
tables were no longer in the database, also some other tables were
inaccessible.

I made a backup and then some functionality was restored by issuing a
reindex system dbname

Using the "broken database" pg_dump on all tables in pg_table gives this
for some tables:

pg_dump: SQL command failed
pg_dump: Error message from server: ERROR:  cache lookup failed for
relation 20871
pg_dump: The command was: SELECT tableoid, oid, conname,
pg_catalog.pg_get_const
raintdef(oid) as condef FROM pg_catalog.pg_constraint WHERE conrelid =
'20876'::
pg_catalog.oid AND contype = 'f'
pg_dump: SQL command failed
pg_dump: Error message from server: ERROR:  relation
"public.auth_groups_permissions" does not exist
pg_dump: The command was: LOCK TABLE public.auth_groups_permissions IN
ACCESS SH
ARE MODE
pg_dump: SQL command failed
pg_dump: Error message from server: ERROR:  relation with OID 21186 does
not exist
pg_dump: The command was: LOCK TABLE public.ght_ght_shippingorders IN
ACCESS SHA
RE MODE

Our backups failed of course, and we have made a good attempt at
recovery, which we are willing to accept as all the recovery we can do.

As a last chance to get some of the data back, I would be interested if
there is any way to read through the raw database files to see if I can
recover some more data from them.

I'm also curious if any of the settings/reload caused this problem, or
perhaps the vacuum or autovacuum is what caused our error.  I'll file a
bug report if it's somehow repeatable.

Oh, and make backups.

-Mike

Re: Database Corruption - last chance recovery options?

From
"Thomas F. O'Connell"
Date:
On Jan 4, 2007, at 11:24 PM, Michael Best wrote:

> When I finally got the error report in the morning the database was
> in this state:
>
> $ psql dbname
>
> dbname=# \dt
> ERROR:  cache lookup failed for relation 20884

Do you have your error logs, and were there any relevant errors in
them preceding your inability to get a table listing via psql?

--
Thomas F. O'Connell

optimizing modern web applications
: for search engines, for usability, and for performance :

http://o.ptimized.com/
615-260-0005

Re: Database Corruption - last chance recovery options?

From
Michael Best
Date:
Thomas F. O'Connell wrote:
>
> On Jan 4, 2007, at 11:24 PM, Michael Best wrote:
>
>> When I finally got the error report in the morning the database was in
>> this state:
>>
>> $ psql dbname
>>
>> dbname=# \dt
>> ERROR:  cache lookup failed for relation 20884
>
> Do you have your error logs, and were there any relevant errors in them
> preceding your inability to get a table listing via psql?

 From pgstartup.log I have the following, but I think these were from
when I actually restarted the database the next day after finding out
about the corruption.

FATAL:  could not create shared memory segment: Invalid argument
DETAIL:  Failed system call was shmget(key=5432001, size=86441984, 03600).

Set your memory requirement too high in postgresql.conf, reload instead
of restarting the database, it silently fails sometime later?  Just a
theory, as far as I know those errors were probably just from failed
startup attempts when I was rebooting the server after seeing the data
corruption.

 From Wednesday when the problems started occuring I have a SIGHUP which
I assume is the reload, and then a number of errors such as:
LOG:  received SIGHUP, reloading configuration files
LOG:  transaction ID wrap limit is 1074563238, limited by database
"template1"

The last line is repeated 34 times before I start seeing things like
ERROR:  relation "auth_users" does not exist

On Thursday after the corruption there are more cache lookup failed log
entries and more relation does not exist errors as well as some lines
that say "dbname" instead of "template1":

LOG:  transaction ID wrap limit is 1074563466, limited by database "dbname"

Interspersed with all these entries are NOTICES that are from ongoing
development on a different database on the same server (I should talk to
them about insanely long names):  236 of these NOTICES on the day of the
problems and about 638 since last Saturday.

NOTICE:  identifier
"docmanager_document__document_type__workflow__workflowstatehasaction"
will be truncated to
"docmanager_document__document_type__workflow__workflowstatehasa"
NOTICE:  identifier
"m2m_docmanager_documenttype__workflow__workflowstatehasaction__permitted_
roles" will be truncated to
"m2m_docmanager_documenttype__workflow__workflowstatehasaction__"

That's all I have.  No timestamps on the logs means I have to guess as
some of the timing as well.

-Mike

Re: Database Corruption - last chance recovery options?

From
Tom Lane
Date:
Michael Best <mbest@pendragon.org> writes:
> Set your memory requirement too high in postgresql.conf, reload instead
> of restarting the database, it silently fails sometime later?

Yeah, wouldn't surprise me, since the reload is going to ignore any
changes related to resizing shared memory.  I think that 8.2 might warn
you that it was ignoring the un-applyable changes, but the warnings
would only go to the postmaster log, where they're easily missed :-(

            regards, tom lane

Re: Database Corruption - last chance recovery options?

From
"Thomas F. O'Connell"
Date:
On Jan 5, 2007, at 10:01 PM, Tom Lane wrote:

> Michael Best <mbest@pendragon.org> writes:
>> Set your memory requirement too high in postgresql.conf, reload
>> instead
>> of restarting the database, it silently fails sometime later?
>
> Yeah, wouldn't surprise me, since the reload is going to ignore any
> changes related to resizing shared memory.  I think that 8.2 might
> warn
> you that it was ignoring the un-applyable changes, but the warnings
> would only go to the postmaster log, where they're easily missed :-(

Wait, now I'm curious. If a change in postgresql.conf that requires a
restart doesn't take effect on reload, then how could a related
failure manifest at all, regardless of when?

--
Thomas F. O'Connell

optimizing modern web applications
: for search engines, for usability, and for performance :

http://o.ptimized.com/
615-260-0005

Re: Database Corruption - last chance recovery options?

From
Tom Lane
Date:
"Thomas F. O'Connell" <tf@o.ptimized.com> writes:
>> Michael Best <mbest@pendragon.org> writes:
>>> Set your memory requirement too high in postgresql.conf, reload
>>> instead of restarting the database, it silently fails sometime later?

> Wait, now I'm curious. If a change in postgresql.conf that requires a
> restart doesn't take effect on reload, then how could a related
> failure manifest at all, regardless of when?

The point is that when you eventually shut down and restart the
postmaster, it'd fail then ... perhaps long after you've forgotten
about your configuration changes, so it might be a bit mystifying.

            regards, tom lane