Thread: corrupted item pointer:???
hi.
basic information:
machine: desktop computer, with sata hard drive, no bad blocks. 2g ram.AMD Sempron(tm) Processor 2600+
system: linux debian testing, using 2.6.11 kernel.
postgresql: 8.1.3 compiled by hand using:
./configure \
--prefix=/home/pgdba/work \
--without-debug \
--disable-debug \
--with-pgport=5810 \
--with-tcl \
--with-perl \
--with-python \
--without-krb4 \
--without-krb5 \
--without-pam \
--without-rendezvous \
--with-openssl \
--with-readline \
--with-zlib \
--with-gnu-ld
version() -> PostgreSQL 8.1.3 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.3.5 (Debian 1:3.3.5-13)
on this machine i have copy of 40G database from production servers.
i was testing migration to hstore ( http://www.sai.msu.su/~megera/postgres/gist/hstore/README.hstore)
at some point ot tests it failed saying corrupted item pointer.
migration was done using:
LOOP:
select data from test_table where primary_key = constant;
select data from secondary_table where unique_key = constant;
update test_table set hstore_field = some_value where primary_key = constant
REPEAT FOR ALL items;
this loop was made in external code (perl, using dbi + dbd::pg).
we did about 37 such loops every second.
everything went fine.
then - every 30 minutes) i ran vacuum to reclaim space from test_table.
at one of such vacuums it paniced showing me forementioned error and killing all connections.
is there any way i can check what went wrong?
i dont need to recover the data.
i just need to know wherher the problem is hstore-related, hardware-related or just random thing happening because of nothing.
i tried:
hand-vacuum
vacuum analyze
vacuum full analyze
reindex the table
another vacuum
none of this worked.
i still get the corrupted item pointer message.
any clues on how can i check what went wrong?
depesz
basic information:
machine: desktop computer, with sata hard drive, no bad blocks. 2g ram.AMD Sempron(tm) Processor 2600+
system: linux debian testing, using 2.6.11 kernel.
postgresql: 8.1.3 compiled by hand using:
./configure \
--prefix=/home/pgdba/work \
--without-debug \
--disable-debug \
--with-pgport=5810 \
--with-tcl \
--with-perl \
--with-python \
--without-krb4 \
--without-krb5 \
--without-pam \
--without-rendezvous \
--with-openssl \
--with-readline \
--with-zlib \
--with-gnu-ld
version() -> PostgreSQL 8.1.3 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.3.5 (Debian 1:3.3.5-13)
on this machine i have copy of 40G database from production servers.
i was testing migration to hstore ( http://www.sai.msu.su/~megera/postgres/gist/hstore/README.hstore)
at some point ot tests it failed saying corrupted item pointer.
migration was done using:
LOOP:
select data from test_table where primary_key = constant;
select data from secondary_table where unique_key = constant;
update test_table set hstore_field = some_value where primary_key = constant
REPEAT FOR ALL items;
this loop was made in external code (perl, using dbi + dbd::pg).
we did about 37 such loops every second.
everything went fine.
then - every 30 minutes) i ran vacuum to reclaim space from test_table.
at one of such vacuums it paniced showing me forementioned error and killing all connections.
is there any way i can check what went wrong?
i dont need to recover the data.
i just need to know wherher the problem is hstore-related, hardware-related or just random thing happening because of nothing.
i tried:
hand-vacuum
vacuum analyze
vacuum full analyze
reindex the table
another vacuum
none of this worked.
i still get the corrupted item pointer message.
any clues on how can i check what went wrong?
depesz
hubert depesz lubaczewski wrote: > hi. > basic information: > machine: desktop computer, with sata hard drive, no bad blocks. 2g > ram.AMD Sempron(tm) Processor 2600+ > system: linux debian testing, using 2.6.11 kernel. > postgresql: 8.1.3 compiled by hand using: ... > > version() -> PostgreSQL 8.1.3 on i686-pc-linux-gnu, compiled by GCC gcc > (GCC) 3.3.5 (Debian 1:3.3.5-13) OK - nothing unusual there. > on this machine i have copy of 40G database from production servers. > i was testing migration to hstore ( > http://www.sai.msu.su/~megera/postgres/gist/hstore/README.hstore) > > at some point ot tests it failed saying corrupted item pointer. > is there any way i can check what went wrong? > i dont need to recover the data. > i just need to know wherher the problem is hstore-related, > hardware-related or just random thing happening because of nothing. Hmm - I believe that means a data/index block was corrupted. Have you seen any crashes, or hardware-related errors in your logs? What are your config settings, particularly the first three here: http://www.postgresql.org/docs/8.1/static/runtime-config-wal.html fsync, wal_sync_method, full_page_writes -- Richard Huxton Archonet Ltd
On 4/13/06, Richard Huxton <dev@archonet.com> wrote:
indices were recreated (reindex table), so i think this is data related problem.
nope. uptime is over 40 days.
the machine is not used for anything else so i can't tell anything, but i didn't see any problems with it.
sure:
irr=# show fsync;
fsync
-------
on
(1 row)
irr=# show wal_sync_method;
wal_sync_method
-----------------
fdatasync
(1 row)
irr=# show full_page_writes;
full_page_writes
------------------
on
(1 row)
depesz
Hmm - I believe that means a data/index block was corrupted.
indices were recreated (reindex table), so i think this is data related problem.
Have you seen any crashes, or hardware-related errors in your logs?
nope. uptime is over 40 days.
the machine is not used for anything else so i can't tell anything, but i didn't see any problems with it.
What are your config settings, particularly the first three here:
http://www.postgresql.org/docs/8.1/static/runtime-config-wal.html
fsync, wal_sync_method, full_page_writes
sure:
irr=# show fsync;
fsync
-------
on
(1 row)
irr=# show wal_sync_method;
wal_sync_method
-----------------
fdatasync
(1 row)
irr=# show full_page_writes;
full_page_writes
------------------
on
(1 row)
depesz
hubert depesz lubaczewski wrote: > On 4/13/06, Richard Huxton <dev@archonet.com> wrote: >> Hmm - I believe that means a data/index block was corrupted. > > indices were recreated (reindex table), so i think this is data related > problem. > > Have you seen any crashes, or hardware-related errors in your logs? > > nope. uptime is over 40 days. > > the machine is not used for anything else so i can't tell anything, but i > didn't see any problems with it. > > > What are your config settings, particularly the first three here: >> http://www.postgresql.org/docs/8.1/static/runtime-config-wal.html >> fsync, wal_sync_method, full_page_writes >> > > sure: > fsync > ------- > on > wal_sync_method > ----------------- > fdatasync > full_page_writes > ------------------ > on All looks fine. Can you isolate the row(s) in question that seem to be the problem? Then we can have a look at the system columns. http://www.postgresql.org/docs/8.1/static/ddl-system-columns.html -- Richard Huxton Archonet Ltd
"hubert depesz lubaczewski" <depesz@gmail.com> writes: > On 4/13/06, Richard Huxton <dev@archonet.com> wrote: >> Hmm - I believe that means a data/index block was corrupted. > indices were recreated (reindex table), so i think this is data related > problem. AFAICS, the only non-index-related occurrence of that error message is in PageRepairFragmentation, which is invoked by VACUUM. I'd say it indicates a real problem and you shouldn't ignore it. You might try using pg_filedump or some such to examine the table and see if there's anything obvious about what happened to the corrupted page. regards, tom lane
On 4/13/06, Richard Huxton <dev@archonet.com> wrote:
i ran the test to find it. as soon as i will get it (probably tomorrow) i will mail it to the list.
hubert
All looks fine. Can you isolate the row(s) in question that seem to be
the problem? Then we can have a look at the system columns.
http://www.postgresql.org/docs/8.1/static/ddl-system-columns.html
i ran the test to find it. as soon as i will get it (probably tomorrow) i will mail it to the list.
hubert
On 4/13/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
i'm not familiar with this utility.
i can of course find it using google, but how do i check what is wrong?
i am even willing to upload the dump file, but with 4 milion records in table, it is going to be rather large...
pg_relation_size says that the table is about 3g i size
depesz
"hubert depesz lubaczewski" <depesz@gmail.com> writes:
> On 4/13/06, Richard Huxton <dev@archonet.com> wrote:
>> Hmm - I believe that means a data/index block was corrupted.
> indices were recreated (reindex table), so i think this is data related
> problem.
AFAICS, the only non-index-related occurrence of that error message
is in PageRepairFragmentation, which is invoked by VACUUM. I'd say
it indicates a real problem and you shouldn't ignore it. You might
try using pg_filedump or some such to examine the table and see if
there's anything obvious about what happened to the corrupted page.
i'm not familiar with this utility.
i can of course find it using google, but how do i check what is wrong?
i am even willing to upload the dump file, but with 4 milion records in table, it is going to be rather large...
pg_relation_size says that the table is about 3g i size
depesz
"hubert depesz lubaczewski" <depesz@gmail.com> writes: > On 4/13/06, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> AFAICS, the only non-index-related occurrence of that error message >> is in PageRepairFragmentation, which is invoked by VACUUM. I'd say >> it indicates a real problem and you shouldn't ignore it. You might >> try using pg_filedump or some such to examine the table and see if >> there's anything obvious about what happened to the corrupted page. > i'm not familiar with this utility. http://sources.redhat.com/rhdb/ > i can of course find it using google, but how do i check what is wrong? pg_filedump will complain about a bad item pointer (looks like the message will be something about "Error: Item contents extend beyond block") > i am even willing to upload the dump file, but with 4 milion records in > table, it is going to be rather large... I don't think we want to see the whole thing! But "pg_filedump -i -f" output would be interesting for the specific block(s) that pg_filedump reports errors for. regards, tom lane
On 4/13/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
if i understand correctly i have to do pg_filedump <relfilenode> of table, check output for errors, and make pg_filedump -i -f of problematic blocks.
if that's ok - i'm running it.
as soon as i have some info - i'll let you know.
depesz
"hubert depesz lubaczewski" <depesz@gmail.com> writes:
> On 4/13/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> AFAICS, the only non-index-related occurrence of that error message
>> is in PageRepairFragmentation, which is invoked by VACUUM. I'd say
>> it indicates a real problem and you shouldn't ignore it. You might
>> try using pg_filedump or some such to examine the table and see if
>> there's anything obvious about what happened to the corrupted page.
> i'm not familiar with this utility.
http://sources.redhat.com/rhdb/
> i can of course find it using google, but how do i check what is wrong?
pg_filedump will complain about a bad item pointer (looks like the
message will be something about "Error: Item contents extend beyond block")
> i am even willing to upload the dump file, but with 4 milion records in
> table, it is going to be rather large...
I don't think we want to see the whole thing! But "pg_filedump -i -f"
output would be interesting for the specific block(s) that pg_filedump
reports errors for.
if i understand correctly i have to do pg_filedump <relfilenode> of table, check output for errors, and make pg_filedump -i -f of problematic blocks.
if that's ok - i'm running it.
as soon as i have some info - i'll let you know.
depesz
On 4/13/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
the problematic table spans over 3 files (18026 18026.1 and 18026.2).
i made pg_filedump _FILE_ > ~/_FILE_.dump
it went fine
grep -i error ~/*.dump also didn't show anything.
the dumps are quite large:
pgdba@lab02:~$ ls -l *.dump
-rw-r--r-- 1 pgdba pgdba 154631630 2006-04-14 18:03 18026.1.dump
-rw-r--r-- 1 pgdba pgdba 108808017 2006-04-14 18:03 18026.2.dump
-rw-r--r-- 1 pgdba pgdba 161625849 2006-04-14 18:01 18026.dump
what else can i look in it for?
best regards
hubert
pg_filedump will complain about a bad item pointer (looks like the
message will be something about "Error: Item contents extend beyond block")
the problematic table spans over 3 files (18026 18026.1 and 18026.2).
i made pg_filedump _FILE_ > ~/_FILE_.dump
it went fine
grep -i error ~/*.dump also didn't show anything.
the dumps are quite large:
pgdba@lab02:~$ ls -l *.dump
-rw-r--r-- 1 pgdba pgdba 154631630 2006-04-14 18:03 18026.1.dump
-rw-r--r-- 1 pgdba pgdba 108808017 2006-04-14 18:03 18026.2.dump
-rw-r--r-- 1 pgdba pgdba 161625849 2006-04-14 18:01 18026.dump
what else can i look in it for?
best regards
hubert
"hubert depesz lubaczewski" <depesz@gmail.com> writes: > i made pg_filedump _FILE_ > ~/_FILE_.dump > it went fine > grep -i error ~/*.dump also didn't show anything. Oh, that's interesting. Looking more closely, the test in PageRepairFragmentation() if (itemidptr->itemoff < (int) pd_upper || itemidptr->itemoff >= (int) pd_special) ereport(ERROR, (errcode(ERRCODE_DATA_CORRUPTED), errmsg("corrupted item pointer: %u", itemidptr->itemoff))); is slightly tighter than what pg_filedump does: // Make sure the item can physically fit on this block before // formatting if ((itemOffset + itemSize > blockSize) || (itemOffset + itemSize > bytesToFormat)) printf (" Error: Item contents extend beyond block.\n" " BlockSize<%d> Bytes Read<%d> Item Start<%d>.\n", blockSize, bytesToFormat, itemOffset + itemSize); I'm guessing that the lack of a check for itemOffset < pd_upper is why pg_filedump is failing to notice anything wrong. Do you want to add one and try again? regards, tom lane
On 4/14/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
sure. but could you please tell me what to change? c is not my favourite language and i'd like not to damage something else while trying to change it myself.
hubert
I'm guessing that the lack of a check for itemOffset < pd_upper is why
pg_filedump is failing to notice anything wrong. Do you want to add one
and try again?
sure. but could you please tell me what to change? c is not my favourite language and i'd like not to damage something else while trying to change it myself.
hubert