Thread: corrupted item pointer:???

corrupted item pointer:???

From

"hubert depesz lubaczewski"

Date:

13 April 2006, 06:35:57

hi.
basic information:
machine: desktop computer, with sata hard drive, no bad blocks. 2g ram.AMD Sempron(tm) Processor 2600+
system: linux debian testing, using 2.6.11 kernel.
postgresql: 8.1.3 compiled by hand using:
./configure \
        --prefix=/home/pgdba/work \
        --without-debug \
        --disable-debug \
        --with-pgport=5810 \
        --with-tcl \
        --with-perl \
        --with-python \
        --without-krb4 \
        --without-krb5 \
        --without-pam \
        --without-rendezvous \
        --with-openssl \
        --with-readline \
        --with-zlib \
        --with-gnu-ld

version() -> PostgreSQL 8.1.3 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.3.5 (Debian 1:3.3.5-13)

on this machine i have copy of 40G database from production servers.
i was testing migration to hstore ( http://www.sai.msu.su/~megera/postgres/gist/hstore/README.hstore)

at some point ot tests it failed saying corrupted item pointer.

migration was done using:
LOOP:
   select data from test_table where primary_key = constant;
   select data from secondary_table where unique_key = constant;
   update test_table set hstore_field = some_value where primary_key = constant
REPEAT FOR ALL items;
this loop was made in external code (perl, using dbi + dbd::pg).
we did about 37 such loops every second.
everything went fine.
then - every 30 minutes) i ran vacuum to reclaim space from test_table.
at one of such vacuums it paniced showing me forementioned error and killing all connections.

is there any way i can check what went wrong?
i dont need to recover the data.
i just need to know wherher the problem is hstore-related, hardware-related or just random thing happening because of nothing.

i tried:
hand-vacuum
vacuum analyze
vacuum full analyze
reindex the table
another vacuum
none of this worked.
i still get the corrupted item pointer message.

any clues on how can i check what went wrong?

depesz

Re: corrupted item pointer:???

From

Richard Huxton

Date:

13 April 2006, 07:25:42

hubert depesz lubaczewski wrote:
> hi.
> basic information:
> machine: desktop computer, with sata hard drive, no bad blocks. 2g
> ram.AMD Sempron(tm) Processor 2600+
> system: linux debian testing, using 2.6.11 kernel.
> postgresql: 8.1.3 compiled by hand using:
...
>
> version() -> PostgreSQL 8.1.3 on i686-pc-linux-gnu, compiled by GCC gcc
> (GCC) 3.3.5 (Debian 1:3.3.5-13)

OK - nothing unusual there.

> on this machine i have copy of 40G database from production servers.
> i was testing migration to hstore (
> http://www.sai.msu.su/~megera/postgres/gist/hstore/README.hstore)
>
> at some point ot tests it failed saying corrupted item pointer.

> is there any way i can check what went wrong?
> i dont need to recover the data.
> i just need to know wherher the problem is hstore-related,
> hardware-related or just random thing happening because of nothing.

Hmm - I believe that means a data/index block was corrupted.

Have you seen any crashes, or hardware-related errors in your logs?

What are your config settings, particularly the first three here:
   http://www.postgresql.org/docs/8.1/static/runtime-config-wal.html
fsync, wal_sync_method, full_page_writes

--
   Richard Huxton
   Archonet Ltd

Re: corrupted item pointer:???

From

"hubert depesz lubaczewski"

Date:

13 April 2006, 07:42:39

On 4/13/06, Richard Huxton <dev@archonet.com> wrote:

Hmm - I believe that means a data/index block was corrupted.

indices were recreated (reindex table), so i think this is data related problem.

Have you seen any crashes, or hardware-related errors in your logs?

nope. uptime is over 40 days.

the machine is not used for anything else so i can't tell anything, but i didn't see any problems with it.

What are your config settings, particularly the first three here:
http://www.postgresql.org/docs/8.1/static/runtime-config-wal.html
fsync, wal_sync_method, full_page_writes

sure:
irr=# show fsync;
fsync
-------
on
(1 row)

irr=# show wal_sync_method;
wal_sync_method
-----------------
fdatasync
(1 row)

irr=# show full_page_writes;
full_page_writes
------------------
on
(1 row)

depesz

Re: corrupted item pointer:???

From

Richard Huxton

Date:

13 April 2006, 11:44:13

hubert depesz lubaczewski wrote:
> On 4/13/06, Richard Huxton <dev@archonet.com> wrote:
>> Hmm - I believe that means a data/index block was corrupted.
>
> indices were recreated (reindex table), so i think this is data related
> problem.
>
> Have you seen any crashes, or hardware-related errors in your logs?
>
> nope. uptime is over 40 days.
>
> the machine is not used for anything else so i can't tell anything, but i
> didn't see any problems with it.
>
>
> What are your config settings, particularly the first three here:
>>    http://www.postgresql.org/docs/8.1/static/runtime-config-wal.html
>> fsync, wal_sync_method, full_page_writes
>>
>
> sure:
>  fsync
> -------
>  on

>  wal_sync_method
> -----------------
>  fdatasync

>  full_page_writes
> ------------------
>  on

All looks fine. Can you isolate the row(s) in question that seem to be
the problem? Then we can have a look at the system columns.
http://www.postgresql.org/docs/8.1/static/ddl-system-columns.html

--
   Richard Huxton
   Archonet Ltd

Re: corrupted item pointer:???

From

Tom Lane

Date:

13 April 2006, 11:47:29

"hubert depesz lubaczewski" <depesz@gmail.com> writes:
> On 4/13/06, Richard Huxton <dev@archonet.com> wrote:
>> Hmm - I believe that means a data/index block was corrupted.

> indices were recreated (reindex table), so i think this is data related
> problem.

AFAICS, the only non-index-related occurrence of that error message
is in PageRepairFragmentation, which is invoked by VACUUM.  I'd say
it indicates a real problem and you shouldn't ignore it.  You might
try using pg_filedump or some such to examine the table and see if
there's anything obvious about what happened to the corrupted page.

            regards, tom lane

Re: corrupted item pointer:???

From

"hubert depesz lubaczewski"

Date:

13 April 2006, 14:28:39

On 4/13/06, Richard Huxton <dev@archonet.com> wrote:

All looks fine. Can you isolate the row(s) in question that seem to be
the problem? Then we can have a look at the system columns.
http://www.postgresql.org/docs/8.1/static/ddl-system-columns.html

i ran the test to find it. as soon as i will get it (probably tomorrow) i will mail it to the list.

hubert

Re: corrupted item pointer:???

From

"hubert depesz lubaczewski"

Date:

13 April 2006, 14:31:10

On 4/13/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:

"hubert depesz lubaczewski" <depesz@gmail.com> writes:
> On 4/13/06, Richard Huxton <dev@archonet.com> wrote:
>> Hmm - I believe that means a data/index block was corrupted.
> indices were recreated (reindex table), so i think this is data related
> problem.
AFAICS, the only non-index-related occurrence of that error message
is in PageRepairFragmentation, which is invoked by VACUUM. I'd say
it indicates a real problem and you shouldn't ignore it. You might
try using pg_filedump or some such to examine the table and see if
there's anything obvious about what happened to the corrupted page.

i'm not familiar with this utility.
i can of course find it using google, but how do i check what is wrong?
i am even willing to upload the dump file, but with 4 milion records in table, it is going to be rather large...
pg_relation_size says that the table is about 3g i size

depesz

Re: corrupted item pointer:???

From

Tom Lane

Date:

13 April 2006, 15:11:09

"hubert depesz lubaczewski" <depesz@gmail.com> writes:
> On 4/13/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> AFAICS, the only non-index-related occurrence of that error message
>> is in PageRepairFragmentation, which is invoked by VACUUM.  I'd say
>> it indicates a real problem and you shouldn't ignore it.  You might
>> try using pg_filedump or some such to examine the table and see if
>> there's anything obvious about what happened to the corrupted page.

> i'm not familiar with this utility.

http://sources.redhat.com/rhdb/

> i can of course find it using google, but how do i check what is wrong?

pg_filedump will complain about a bad item pointer (looks like the
message will be something about "Error: Item contents extend beyond block")

> i am even willing to upload the dump file, but with 4 milion records in
> table, it is going to be rather large...

I don't think we want to see the whole thing!  But "pg_filedump -i -f"
output would be interesting for the specific block(s) that pg_filedump
reports errors for.

            regards, tom lane

Re: corrupted item pointer:???

From

"hubert depesz lubaczewski"

Date:

14 April 2006, 05:50:03

On 4/13/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:

"hubert depesz lubaczewski" <depesz@gmail.com> writes:
> On 4/13/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> AFAICS, the only non-index-related occurrence of that error message
>> is in PageRepairFragmentation, which is invoked by VACUUM.  I'd say
>> it indicates a real problem and you shouldn't ignore it.  You might
>> try using pg_filedump or some such to examine the table and see if
>> there's anything obvious about what happened to the corrupted page.
> i'm not familiar with this utility.
http://sources.redhat.com/rhdb/
> i can of course find it using google, but how do i check what is wrong?
pg_filedump will complain about a bad item pointer (looks like the
message will be something about "Error: Item contents extend beyond block")
> i am even willing to upload the dump file, but with 4 milion records in
> table, it is going to be rather large...
I don't think we want to see the whole thing!  But "pg_filedump -i -f"
output would be interesting for the specific block(s) that pg_filedump
reports errors for.

if i understand correctly i have to do pg_filedump <relfilenode> of table, check output for errors, and make pg_filedump -i -f of problematic blocks.
if that's ok - i'm running it.
as soon as i have some info - i'll let you know.

depesz

Re: corrupted item pointer:???

From

"hubert depesz lubaczewski"

Date:

14 April 2006, 05:56:25

On 4/13/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:

pg_filedump will complain about a bad item pointer (looks like the
message will be something about "Error: Item contents extend beyond block")

the problematic table spans over 3 files (18026 18026.1 and 18026.2).
i made pg_filedump _FILE_ > ~/_FILE_.dump
it went fine
grep -i error ~/*.dump also didn't show anything.
the dumps are quite large:
pgdba@lab02:~$ ls -l *.dump
-rw-r--r-- 1 pgdba pgdba 154631630 2006-04-14 18:03 18026.1.dump
-rw-r--r-- 1 pgdba pgdba 108808017 2006-04-14 18:03 18026.2.dump
-rw-r--r-- 1 pgdba pgdba 161625849 2006-04-14 18:01 18026.dump

what else can i look in it for?

best regards

hubert

Re: corrupted item pointer:???

From

Tom Lane

Date:

14 April 2006, 12:40:50

"hubert depesz lubaczewski" <depesz@gmail.com> writes:
> i made pg_filedump _FILE_ > ~/_FILE_.dump
> it went fine
> grep -i error ~/*.dump also didn't show anything.

Oh, that's interesting.  Looking more closely, the test in
PageRepairFragmentation()

                if (itemidptr->itemoff < (int) pd_upper ||
                    itemidptr->itemoff >= (int) pd_special)
                    ereport(ERROR,
                            (errcode(ERRCODE_DATA_CORRUPTED),
                             errmsg("corrupted item pointer: %u",
                                    itemidptr->itemoff)));

is slightly tighter than what pg_filedump does:

      // Make sure the item can physically fit on this block before
      // formatting
      if ((itemOffset + itemSize > blockSize) ||
          (itemOffset + itemSize > bytesToFormat))
        printf ("  Error: Item contents extend beyond block.\n"
            "         BlockSize<%d> Bytes Read<%d> Item Start<%d>.\n",
            blockSize, bytesToFormat, itemOffset + itemSize);

I'm guessing that the lack of a check for itemOffset < pd_upper is why
pg_filedump is failing to notice anything wrong.  Do you want to add one
and try again?

            regards, tom lane

Re: corrupted item pointer:???

From

"hubert depesz lubaczewski"

Date:

14 April 2006, 15:38:10

On 4/14/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I'm guessing that the lack of a check for itemOffset < pd_upper is why
pg_filedump is failing to notice anything wrong. Do you want to add one
and try again?

sure. but could you please tell me what to change? c is not my favourite language and i'd like not to damage something else while trying to change it myself.

hubert