Thread: VACUUM PANIC: corrupted item pointer

VACUUM PANIC: corrupted item pointer

From
André Volpato
Date:
Hi,

We got a corrupted database server, and I need some advice on what else
can I do.

Specs:
- Postgres 8.3.3 @ Linux dbserver4 2.6.24-etchnhalf.1-686-bigmem
- 4 GB RAM
- PGDATA with 700GB (70GB used), software RAID-0 with 3 sata-2
(blame it on me, ugh...)

So, the steps till now:

1 - Vacuum analyze fails this night:
2008-10-17 02:36:30 BRST: PANIC:  corrupted item pointer: 44
2008-10-17 02:36:30 BRST: STATEMENT:  VACUUM ANALYZE;

2 - We found the corrupted table, but the data was ok:
# vacuumdb -vz matriz -t total
PANIC:  corrupted item pointer: 44

# psql matriz
matriz# select * from total;
OK

3. Tried postgres single
# pg_ctl stop
# postgres --single -D /postgres matriz
matriz# vacuum analyze
PANIC:  corrupted item pointer: 44

4. reindex
# postgres --single -D /postgres matriz
matriz# reindexdb
OK

5. Again, vacuum single
# postgres --single -D /postgres matriz
matriz# vacuum analyze
PANIC:  corrupted item pointer: 44

6. Tried to connect again, but...
# postgres --single -D /postgres matriz
I got that message when Postgres cant create the shared memory segment,
as if the OS has not enough shmmax. Weird.
# pg_ctl start
Again, same message complaining aboyt shared memory.

7. Reboot

8. Postgres is up, but...
# psql matriz
psql: FATAL:  could not read block 0 of relation 1664/0/2676: read only
0 of 8192 bytes

9. Reset xlog
# pg_resetxlog
OK

10. Same error
# psql matriz
psql: FATAL:  could not read block 0 of relation 1664/0/2676: read only
0 of 8192 bytes

Right now this is the message I got every time I try to connect, even in
single mode.
We tried to fsck the raid partition, but with no luck.

Hope there is something else I can do...

--

[]´s, ACV



Re: VACUUM PANIC: corrupted item pointer

From
André Volpato
Date:
André Volpato escreveu:
> 10. Same error
> # psql matriz
> psql: FATAL:  could not read block 0 of relation 1664/0/2676: read
> only 0 of 8192 bytes
>
> Right now this is the message I got every time I try to connect, even
> in single mode.
> We tried to fsck the raid partition, but with no luck.
>
> Hope there is something else I can do...
>

Digging a little bit more, I found that global/2676 has zero bytes, so
we restore global/ from backup.
Now the only databases we can log in the ones wich hasn´t been vacuumed
regulary, e.g. "postgres":

$ psql matriz
psql: FATAL:  cache lookup failed for database 87421438
$ psql volpato
psql: FATAL:  cache lookup failed for database 94680947
$ psql postgres
# \d
             List of relations
 Schema |      Name      | Type |  Owner
--------+----------------+------+----------
 public | pg_buffercache | view | postgres

Not sure if pg_buffercache should be there anyway ...

Ideas?

--

[]´s, ACV



Re: VACUUM PANIC: corrupted item pointer

From
Tom Lane
Date:
=?ISO-8859-1?Q?Andr=E9_Volpato?= <andre.volpato@ecomtecnologia.com.br> writes:
> 8. Postgres is up, but...
> # psql matriz
> psql: FATAL:  could not read block 0 of relation 1664/0/2676: read only
> 0 of 8192 bytes

Looks like your raid array is dying under you :-( ...

2676 is pg_authid_rolname_index, so you could probably get around that
particular problem by reindexing pg_authid in standalone mode.  (You
might need to use the -P option too, not sure.)  But the "corrupted item
pointer" complaint is heap corruption and no amount of reindexing will
fix it.

Since you seem to be able to select from the broken table without
problem, I'd suggest just copying its data into a new table and dropping
the broken one (or truncate it and copy the data back).

            regards, tom lane

Re: VACUUM PANIC: corrupted item pointer

From
André Volpato
Date:
Tom Lane escreveu:
> =?ISO-8859-1?Q?Andr=E9_Volpato?= <andre.volpato@ecomtecnologia.com.br> writes:
>
>> 8. Postgres is up, but...
>> # psql matriz
>> psql: FATAL:  could not read block 0 of relation 1664/0/2676: read only
>> 0 of 8192 bytes
>>
>
> Looks like your raid array is dying under you :-( ...
We hope it last more 3 weeks, new servers are coming...


> 2676 is pg_authid_rolname_index, so you could probably get around that
> particular problem by reindexing pg_authid in standalone mode.  (You
> might need to use the -P option too, not sure.)  But the "corrupted item
> pointer" complaint is heap corruption and no amount of reindexing will
> fix it.
> Since you seem to be able to select from the broken table without
> problem, I'd suggest just copying its data into a new table and dropping
> the broken one (or truncate it and copy the data back).
>

Not anymore... I cant even connect to any database since reboot (#7).
I tryed to restore global/ directory from backup, and I got
"cache lookup failed for database 87421438".

Now we ran initdb again, and start recreating dbs with backup data.


>             regards, tom lane
>

--

[]´s, ACV