Thread: ERROR: catalog is missing 9 attribute(s) for relid 10297

ERROR: catalog is missing 9 attribute(s) for relid 10297

From
"O'Shea, Brendan"
Date:
We have run into a situation where our postgresql 8.2.5 database appears
to be corrupt and we are no longer able to run pg_dump.  We don't know
what is causing the data corruption issues and unfortunately this is one
of those cases where we cannot reliably reproduce the problem (so no
test case can be provided).

The error message output from running pg_dump in verbose mode is:


pg_dump: reading schemas
pg_dump: SQL command failed
pg_dump: Error message from server: ERROR:  catalog is missing 9
attribute(s) for relid 10297
pg_dump: The command was: SELECT tableoid, oid, nspname, (SELECT rolname
FROM pg_catalog.pg_roles WHERE oid = nspowner) as rolname, nspacl FROM
pg_namespace
pg_dump: *** aborted because of error


Querying 'pg_catalog.pg_roles' directly also produces the error.  The
pg_roles object is simply a view of pg_authid that blanks out the
password field.  Querying the pg_authid table directly does not produce
the error.

Our server setup is:
Linux 2.4
Postgresql 8.2.5

Our application is fairly transaction heavy, about 50-200 updates per
second on a few small tables, and we have around 40 or so client
connections querying the database.

Checking the postgresql server log file did not provide any additional
useful information beyond what pg_dump provided.

The first time we ran into this problem on 2007-12-16 we decided that
since the database system objects appeared to be corrupt we wiped the
server clean and started over.  We also checked for hardware issues but
none were found.  We checked for (disk errors, memory errors, fs errors)
as well as (overheating drives, intermittent drive slowness not reported
by the kernel, drive read prediction failures, errors that only surface
under extreme load).  We were not able to test (overheating CPU, bad
power supply (bad voltage)).

A few weeks later on Dec 25th the same exact error cropped up again
"catalog is missing 9 attribute(s) for relid 10297".

My questions are:

1) Does this error message generally indicate that the database system
objects are corrupt?
2) To resolve this type of issue what would be the recommended course of
action?  In our case we decided that the database was a lost cause and
we started over (unfortunately only to run into the same issue 2 weeks
later).
3) Does anyone have any suggestions for additional logging or monitoring
we could turn on to try and determine the root cause of this issue?


Thanks for any help,
Brendan

Re: ERROR: catalog is missing 9 attribute(s) for relid 10297

From
Tom Lane
Date:
"O'Shea, Brendan" <boshea@akamai.com> writes:
> pg_dump: reading schemas
> pg_dump: SQL command failed
> pg_dump: Error message from server: ERROR:  catalog is missing 9
> attribute(s) for relid 10297

Ugh.  Does it work if you do

    export PGOPTIONS="--ignore_system_indexes=1"

first?  If so, the problem would evidently be corruption of the indexes
for pg_attribute, specifically pg_attribute_relid_attnum_index.  It
seems pretty suspicious that lightning would strike twice in the same
place ...

            regards, tom lane

Re: ERROR: catalog is missing 9 attribute(s) for relid 10297

From
Tom Lane
Date:
"O'Shea, Brendan" <boshea@akamai.com> writes:
>> "Lane, Tom" <tgl@sss.pgh.pa.us> writes:
>> Ugh.  Does it work if you do
>> export PGOPTIONS="--ignore_system_indexes=1"

> I tried that, but unfortunately pg_dump still fails to run and the error
> message is identical to previous attempts.

Huh.  So it's not index corruption then.  Table corruption is still a
possibility but it seems likely that you'd be getting other errors
during the seqscan that looks for the rows.  This suggests that the rows
actually disappeared from pg_attribute, which is a bit hard to credit.
The only mechanisms I can think of are that VACUUM decided they were
dead or something physically truncated the table.  The latter would
probably have zapped a lot of other rows though.

Have you checked to see if pg_roles is the only relation with this
problem?  Try
select c.relname from pg_class c left join pg_attribute a
on a.attrelid = c.oid and a.attnum > 0
group by c.oid,c.relname,c.relnatts having count(*) != c.relnatts;

> Any suggestions for additional logging we might turn on to help
> determine the cause of this issue?

Maybe VACUUM VERBOSE on pg_attribute?  Although you'd have to get lucky
enough to catch the time that it zapped the rows, if that's what the
problem is.

            regards, tom lane

Re: ERROR: catalog is missing 9 attribute(s) for relid 10297

From
"O'Shea, Brendan"
Date:
"Lane, Tom" <tgl@sss.pgh.pa.us> writes:>
> "O'Shea, Brendan" <boshea@akamai.com> writes:
> >> "Lane, Tom" <tgl@sss.pgh.pa.us> writes:
> >> Ugh.  Does it work if you do
> >> export PGOPTIONS="--ignore_system_indexes=1"
>
> > I tried that, but unfortunately pg_dump still fails to run
> and the error
> > message is identical to previous attempts.
>
> Huh.  So it's not index corruption then.  Table corruption is still a
> possibility but it seems likely that you'd be getting other errors
> during the seqscan that looks for the rows.  This suggests
> that the rows
> actually disappeared from pg_attribute, which is a bit hard to credit.
> The only mechanisms I can think of are that VACUUM decided they were
> dead or something physically truncated the table.  The latter would
> probably have zapped a lot of other rows though.
>

The auto vacuum daemon is running pretty frequently on the database
since one of our tables is very heavily updated, about 50-200 per
second, and the rows themselves are large, about 1k in size on average
and about 7000 rows in the table.  We also run a nightly 'vacuum full'
operation on this table since it can sometimes grow fairly large (around
50-100MB or more) if there are any long running transactions that last
20-30 minutes or more.  Not sure if this is related but thought it was
worth a mention.

> Have you checked to see if pg_roles is the only relation with this
> problem?  Try
> select c.relname from pg_class c left join pg_attribute a
> on a.attrelid = c.oid and a.attnum > 0
> group by c.oid,c.relname,c.relnatts having count(*) != c.relnatts;
>

This SQL produces the output:
"pg_roles"
"pg_group"
"pg_rules"
"pg_user"
"pg_shadow"

Querying each table produces the error:
"pg_roles"  => ERROR:  catalog is missing 9 attribute(s) for relid 10297
"pg_group"  => ERROR:  catalog is missing 3 attribute(s) for relid 10303
"pg_rules"  => ERROR:  invalid attribute number 0 for pg_rules
"pg_user"   => ERROR:  catalog is missing 8 attribute(s) for relid 10306
"pg_shadow" => ERROR:  catalog is missing 8 attribute(s) for relid 10300

> > Any suggestions for additional logging we might turn on to help
> > determine the cause of this issue?
>
> Maybe VACUUM VERBOSE on pg_attribute?  Although you'd have to
> get lucky
> enough to catch the time that it zapped the rows, if that's what the
> problem is.
>

So are you saying that on our newly installed system we could
periodically run "VACUUM VERBOSE pg_attribute", append the output to a
log file, and then after the catalog error starts happening again we
could go to this log file to look for potential clues?

Running "VACUUM VERBOSE pg_attribute" now produces:

INFO:  vacuuming "pg_catalog.pg_attribute"
INFO:  scanned index "pg_attribute_relid_attnam_index" to remove 30 row
versions
DETAIL:  CPU 0.00s/0.00u sec elapsed 0.03 sec.
INFO:  scanned index "pg_attribute_relid_attnum_index" to remove 30 row
versions
DETAIL:  CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  "pg_attribute": removed 30 row versions in 1 pages
DETAIL:  CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  index "pg_attribute_relid_attnam_index" now contains 2670 row
versions in 48 pages
DETAIL:  30 index row versions were removed.
11 index pages have been deleted, 11 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  index "pg_attribute_relid_attnum_index" now contains 2670 row
versions in 13 pages
DETAIL:  30 index row versions were removed.
2 index pages have been deleted, 2 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  "pg_attribute": found 30 removable, 2670 nonremovable row
versions in 62 pages
DETAIL:  0 dead row versions cannot be removed yet.
There were 965 unused item pointers.
20 pages contain useful free space.
0 pages are entirely empty.
CPU 0.00s/0.00u sec elapsed 0.04 sec.

Query returned successfully with no result in 500 ms.


Thanks for your help,
Brendan

Re: ERROR: catalog is missing 9 attribute(s) for relid 10297

From
"O'Shea, Brendan"
Date:
> "Lane, Tom" <tgl@sss.pgh.pa.us> writes:
>
> Ugh.  Does it work if you do
>
>     export PGOPTIONS="--ignore_system_indexes=1"
>
> first?  If so, the problem would evidently be corruption of
> the indexes for pg_attribute, specifically
> pg_attribute_relid_attnum_index.  It seems pretty suspicious
> that lightning would strike twice in the same place ...
>
>             regards, tom lane
>

I tried that, but unfortunately pg_dump still fails to run and the error
message is identical to previous attempts.

Any suggestions for additional logging we might turn on to help
determine the cause of this issue?  Our current logging setup uses the
default postgresql.conf setup with the following changes to the logging
section:

  log_line_prefix='%t %p %u %h '
  log_connections=on

Thanks,
Brendan