Thread: Corrupted index

Corrupted index

From
Akash Garg
Date:
During a vacuum, I ran into this error:
vacuumdb: vacuuming of database "friend" failed: ERROR:  invalid page
header in block 41661 of relation "friend_pkey"

I've read the posts on this newsgroup and it's clear that I have to
REINDEX to fix this.  The bigger question is -- why did this happen in
the first place?  I'm using open_sync as my WAL_SYNC_METHOD.  Is it
better to use fsync?

Running Suse 9.0 on a Quad Opteron, with direct attached storage.  The
RAID card is set to write through all of the writes, so I don't think
this is a case where the RAID card is lying about a commit.

-Akash

Re: Corrupted index

From
Tom Lane
Date:
Akash Garg <akash.garg@gmail.com> writes:
> During a vacuum, I ran into this error:
> vacuumdb: vacuuming of database "friend" failed: ERROR:  invalid page
> header in block 41661 of relation "friend_pkey"

> I've read the posts on this newsgroup and it's clear that I have to
> REINDEX to fix this.  The bigger question is -- why did this happen in
> the first place?

Good question.  If you haven't REINDEXed yet, it would be worthwhile
to get a hex dump (use "od -x") of that page of the index first.
Speculation with no data to look at is a bit pointless ...

            regards, tom lane

Re: Corrupted index

From
Tom Lane
Date:
Akash Garg <akash.garg@gmail.com> writes:
> I'll try running the od command -- I'm just a little confused on where
> I run it.  I tried running od -x 41661 but that doesn't yield any
> results.  I'm assuming I have to run this command on the actual index
> file itself -- how do I do this?

See contrib/oid2name, or read the docs at
http://developer.postgresql.org/docs/postgres/storage.html
to learn how to figure out which file is the index.  (In pre-8.0 PG,
ignore what those docs say about tablespaces, but the rest of the info
goes back a ways.)  The short answer is you need the database's OID
from pg_database, and the index's relfilenode from pg_class.

            regards, tom lane

Re: Corrupted index

From
Akash Garg
Date:
Tom,
I'll try running the od command -- I'm just a little confused on where
I run it.  I tried running od -x 41661 but that doesn't yield any
results.  I'm assuming I have to run this command on the actual index
file itself -- how do I do this?

Thanks,
Akash

On 6/23/05, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Akash Garg <akash.garg@gmail.com> writes:
> > During a vacuum, I ran into this error:
> > vacuumdb: vacuuming of database "friend" failed: ERROR:  invalid page
> > header in block 41661 of relation "friend_pkey"
>
> > I've read the posts on this newsgroup and it's clear that I have to
> > REINDEX to fix this.  The bigger question is -- why did this happen in
> > the first place?
>
> Good question.  If you haven't REINDEXed yet, it would be worthwhile
> to get a hex dump (use "od -x") of that page of the index first.
> Speculation with no data to look at is a bit pointless ...
>
>                        regards, tom lane
>

Re: Corrupted index

From
Akash Garg
Date:
I used oid2name to find the index files:
168807081
168807081.1
168807081.2
168807081.3
168807081.4

Now how do I run the od command to find the block in question?

Thanks,
Akash

On 6/23/05, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Akash Garg <akash.garg@gmail.com> writes:
> > I'll try running the od command -- I'm just a little confused on where
> > I run it.  I tried running od -x 41661 but that doesn't yield any
> > results.  I'm assuming I have to run this command on the actual index
> > file itself -- how do I do this?
>
> See contrib/oid2name, or read the docs at
> http://developer.postgresql.org/docs/postgres/storage.html
> to learn how to figure out which file is the index.  (In pre-8.0 PG,
> ignore what those docs say about tablespaces, but the rest of the info
> goes back a ways.)  The short answer is you need the database's OID
> from pg_database, and the index's relfilenode from pg_class.
>
>                        regards, tom lane
>

Re: Corrupted index

From
Tom Lane
Date:
Akash Garg <akash.garg@gmail.com> writes:
> I used oid2name to find the index files:
> 168807081
> 168807081.1
> 168807081.2
> 168807081.3
> 168807081.4

> Now how do I run the od command to find the block in question?

Rather than doing the math by hand, let dd do it:

dd bs=8k skip=41661 count=1 </path/to/indexfile | od -x

This assumes your PG was built with the default 8k block size, but
if it wasn't you'd presumably know that ...

            regards, tom lane

Re: Corrupted index

From
Tom Lane
Date:
Akash Garg <akash.garg@gmail.com> writes:
> Ok, I ran that command on the index files -- they are attached.

I'm a bit confused --- you mean you extracted block 41661 from each of
the index's segments?  If so, only the first one is actually relevant
here.

> I notice that in file2, file2 and file3, I notice a pattern like this:
> 0002660 8980 0020 8970 0020 8960 0020 8950 0020
> 0002700 8940 0020 8930 0020 8920 0020 8910 0020
> 0002720 0900 0020 0000 0000 0000 0000 0000 0000
> 0002740 0000 0000 0000 0000 0000 0000 0000 0000
> *
> 0004400 000f 611c 003e 0010 70d9 0b69 0000 0000
> 0004420 000f 5f97 0003 0010 70d8 0b69 0000 0000
> 0004440 000f 6118 0077 0010 70d7 0b69 0000 0000
> 0004460 000f 6118 0075 0010 70d6 0b69 0000 0000

That's what it should look like --- "*" is od's notation for "more of
the same", in this case lines containing all zeroes.  Those pages look
exactly like what I'd expect a PG index page to look like.

The file1 extract, however, is pure text and not PG data of any kind.
The first part of it looks like

00000000: 732c 205c 725c 6e77 6861 7420 646f 6573  s, \r\nwhat does
00000010: 2061 2068 756d 616e 6974 6172 6961 6e20   a humanitarian
00000020: 6561 743f 205c 725c 6e5c 725c 6e53 6f6d  eat? \r\n\r\nSom
00000030: 6574 696d 6573 2c20 4920 7468 696e 6b20  etimes, I think
00000040: 616c 6c20 7468 6520 666f 6c6b 7320 7768  all the folks wh
00000050: 6f20 6772 6577 2075 7020 7370 6561 6b69  o grew up speaki
00000060: 6e67 2045 6e67 6c69 7368 205c 725c 6e73  ng English \r\ns
00000070: 686f 756c 6420 6265 2063 6f6d 6d69 7474  hould be committ
00000080: 6564 2074 6f20 616e 2061 7379 6c75 6d20  ed to an asylum
00000090: 666f 7220 7468 6520 7665 7262 616c 6c79  for the verbally
000000a0: 2069 6e73 616e 652e 205c 725c 6e5c 725c   insane. \r\n\r\
000000b0: 6e49 6e20 7768 6174 206f 7468 6572 206c  nIn what other l

and it goes downhill from there (something out of a spam folder maybe?)

What it looks like to me is that a page of an entirely unrelated file
got dropped into the Postgres index.  This suggests either a disk drive
error (writing someplace else than it was commanded to) or a kernel
bug (writing the wrong buffer).  I'd suggest some disk-drive testing
as well as checking into bug fixes for your kernel.  I'm pretty well
convinced that this wasn't Postgres' fault.

            regards, tom lane