Thread: Please help with this error message

Please help with this error message

From
"Chris Smith"
Date:
I posted about this earlier and got no response, but now it's happening
again.  Can someone please help?

We're getting the following error message from PostgreSQL 7.3.2:

    ERROR:  unexpected chunk number 1 (expected 0) for toast value 77579

This error occurs whenever I attempt to SELECT the contents of a bytea field
from this particular record.  It has also occurred in the past, and appears
to happen somewhat randomly among records.  It does not appear to be related
to the content of the data inserted into the database, because we've been
able to retrieve the same actual contents successfully in other records.

The INSERT and UPDATE to add the record and set the contents field work
fine.

Does anyone have any ideas on what I can try to solve this problem?  I can
put in a number of retries, I suppose, but that seems pretty kludgy.

--
www.designacourse.com
The Easiest Way to Train Anyone... Anywhere.

Chris Smith - Lead Software Developer/Technical Trainer
MindIQ Corporation


Re: Please help with this error message

From
Dennis Gearon
Date:
In general, I have seen no one complain about data integrity on this list unless
their hardware was bad. Search the arhives at:

    http://marc.theaimsgroup.com/

under databases, postgres, for disk test and ram test software. Then test.

Chris Smith wrote:
> I posted about this earlier and got no response, but now it's happening
> again.  Can someone please help?
>
> We're getting the following error message from PostgreSQL 7.3.2:
>
>     ERROR:  unexpected chunk number 1 (expected 0) for toast value 77579
>
> This error occurs whenever I attempt to SELECT the contents of a bytea field
> from this particular record.  It has also occurred in the past, and appears
> to happen somewhat randomly among records.  It does not appear to be related
> to the content of the data inserted into the database, because we've been
> able to retrieve the same actual contents successfully in other records.
>
> The INSERT and UPDATE to add the record and set the contents field work
> fine.
>
> Does anyone have any ideas on what I can try to solve this problem?  I can
> put in a number of retries, I suppose, but that seems pretty kludgy.
>
> --
> www.designacourse.com
> The Easiest Way to Train Anyone... Anywhere.
>
> Chris Smith - Lead Software Developer/Technical Trainer
> MindIQ Corporation
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
>


Re: Please help with this error message

From
Dennis Gearon
Date:
http://marc.theaimsgroup.com/?l=postgresql-general&w=2&r=1&s=disk+memory+test&q=b

Chris Smith wrote:
>>In general, I have seen no one complain about data integrity on this list
>
> unless
>
>>their hardware was bad. Search the arhives at:
>>
>>http://marc.theaimsgroup.com/
>>
>>under databases, postgres, for disk test and ram test software. Then test.
>
>
> Hmm.  That seems unlikely since the problem occurs on at least three
> different servers.  Nevertheless, I would be willing to test... but I can't
> find the tests you're referring to in the archives above.
>
> --
> www.designacourse.com
> The Easiest Way to Train Anyone... Anywhere.
>
> Chris Smith - Lead Software Developer/Technical Trainer
> MindIQ Corporation
>
>
>


Re: Please help with this error message

From
"Chris Smith"
Date:
Okay, after looking through several pages of those search results and
finding nothing that resembles a link to a disk or memory test, I gave up.
Again, since this happens on multiple servers, it's unlikely to be a
hardware problem in the first place.  Anyone have other ideas?

--
www.designacourse.com
The Easiest Way to Train Anyone... Anywhere.

Chris Smith - Lead Software Developer/Technical Trainer
MindIQ Corporation


Re: Please help with this error message

From
"Chris Smith"
Date:
Okay,

After the suggestion of possible hardware problems, we've decided to move
the database to yet another server in the hopes that this will cause the
problem to go away.  Trouble is, we've still a corrupted record.  I
successfully deleted one of the corrupted records; the other, when I attempt
to delete it, responds with:

PANIC:  open of /data/dac/pg_clog/09C5 failed: No such file or directory
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

I then get a psql prompt of "!>", and am no longer connected to the
database.  Any suggestions on how to delete this record?  Does this mean
that the data is corrupted more than I think?

--
www.designacourse.com
The Easiest Way to Train Anyone... Anywhere.

Chris Smith - Lead Software Developer/Technical Trainer
MindIQ Corporation


Re: Please help with this error message

From
Andrew Sullivan
Date:
On Wed, Mar 26, 2003 at 10:19:24AM -0700, Chris Smith wrote:
> Okay, after looking through several pages of those search results and
> finding nothing that resembles a link to a disk or memory test, I gave up.

What hardware are you using?  If you're using x86 hardware,
memtestx86 is your best bet.  Linux has badblocks for disk checking;
other systems do this in other ways (with VxFS, you get notified
automatically by Solaris, for instance).

> Again, since this happens on multiple servers, it's unlikely to be a
> hardware problem in the first place.  Anyone have other ideas?

I don't think that's true.  If you got bad data in the first time,
you'll probably have it in all the systems.  SELECT might turn it up
when COPY wouldn't, because AFAIK COPY just spits out what it finds.

A

--
----
Andrew Sullivan                         204-4141 Yonge Street
Liberty RMS                           Toronto, Ontario Canada
<andrew@libertyrms.info>                              M2P 2A8
                                         +1 416 646 3304 x110


Re: Please help with this error message

From
Tom Lane
Date:
"Chris Smith" <cdsmith@twu.net> writes:
> I posted about this earlier and got no response, but now it's happening
> again.  Can someone please help?
> We're getting the following error message from PostgreSQL 7.3.2:
>     ERROR:  unexpected chunk number 1 (expected 0) for toast value 77579

Can you provide a test case that makes it happen?  I'm prepared to
believe there's a bug here, but the simple fact of the error message
is certainly not enough to find the bug :-(

            regards, tom lane


Re: Please help with this error message

From
"scott.marlowe"
Date:
On Wed, 26 Mar 2003, Chris Smith wrote:

> Okay, after looking through several pages of those search results and
> finding nothing that resembles a link to a disk or memory test, I gave up.
> Again, since this happens on multiple servers, it's unlikely to be a
> hardware problem in the first place.  Anyone have other ideas?

Until you prove your hardware is good, you can't be sure it's not the
problem, and therefore it's not worth your time to try and fix until
you're sure.  and just because more than one machine does it does not mean
it's not hardware.  We had a supplier for a while who sent out machine on
which probably 75% of them had single bit memory failures.  It took months
of my time to test all the boxes we'd already bought and find all the bad
memory, and we had to start testing all machines coming in the door .

http://www.memtest86.com/
In linux use badblocks or the -c switch with mkfs to check a hard drive.


Re: Please help with this error message

From
"Chris Smith"
Date:
> Can you provide a test case that makes it happen?  I'm prepared to
> believe there's a bug here, but the simple fact of the error message
> is certainly not enough to find the bug :-(

Thanks, Tom.  I didn't intend for this to be a bug report (if I had, I'd
have sent it to the bugs list instead.)  I just wanted to know if others had
ideas.  Looks like I'll try to get these memory checks in place, and then
see what happens.

I have been trying to get a test case put together.  Unfortunately, I
haven't been able to do so.  On our production systems, it generally takes
about three days of hitting the database pretty hard for this to turn up
once... and I don't have anything close to the production system available
for testing.  I think it will be difficult to reproduce the problem in a
reasonable length of time.

I have put together an application that generates about the same
distribution of sizes for the bytea field, and does a continual series of
insert/read sequences to ensure that they show up right... that's making an
assumption, though, that the record is getting corrupted right from the
get-go, rather than by other actions later on; perhaps a bad one, so I'm
thinking of rewriting it to have a thread do inserts, and another keep
randomly reading the inserted fields out of order.  Maybe that will make it
more reproducable.

I also have obtained permission to shut down a production system tonight and
leave my test app running overnight, with actual production traffic pointing
to the failover server.  That would provide extra data for diagnosing as a
hardware issue (if I can reproduce easily on production but not at all
elsewhere, for example) and give a more powerful machine, which would allow
more testing to happen in less time.  I'll let you know if I turn up
anything that looks suspicious by tomorrow morning.

--
www.designacourse.com
The Easiest Way to Train Anyone... Anywhere.

Chris Smith - Lead Software Developer/Technical Trainer
MindIQ Corporation


Re: Please help with this error message

From
"Chris Smith"
Date:
> What hardware are you using?  If you're using x86 hardware,
> memtestx86 is your best bet.  Linux has badblocks for disk checking;
> other systems do this in other ways (with VxFS, you get notified
> automatically by Solaris, for instance).

Thanks.  I'll request that this be done, and come back with more whining if
it doesn't turn up errors.

Even if it does turn up errors, are there any thoughts on how to delete the
errant record, as indicated in my last response?  We'd like to not lose
*all* of our data, and we know that only one record is bad at this point.

--
www.designacourse.com
The Easiest Way to Train Anyone... Anywhere.

Chris Smith - Lead Software Developer/Technical Trainer
MindIQ Corporation


Re: Please help with this error message

From
"Chris Smith"
Date:
> In general, I have seen no one complain about data integrity on this list
unless
> their hardware was bad. Search the arhives at:
>
> http://marc.theaimsgroup.com/
>
> under databases, postgres, for disk test and ram test software. Then test.

Hmm.  That seems unlikely since the problem occurs on at least three
different servers.  Nevertheless, I would be willing to test... but I can't
find the tests you're referring to in the archives above.

--
www.designacourse.com
The Easiest Way to Train Anyone... Anywhere.

Chris Smith - Lead Software Developer/Technical Trainer
MindIQ Corporation


Re: Please help with this error message

From
"Chris Smith"
Date:
> What hardware are you using?  If you're using x86 hardware,
> memtestx86 is your best bet.  Linux has badblocks for disk checking;
> other systems do this in other ways (with VxFS, you get notified
> automatically by Solaris, for instance).

Thanks.  I'll request that this be done, and come back with more whining if
it doesn't turn up errors.

Even if it does turn up errors, are there any thoughts on how to delete the
errant record, as indicated in my last response?  We'd like to not lose
*all* of our data, and we know that only one record is bad at this point.

--
www.designacourse.com
The Easiest Way to Train Anyone... Anywhere.

Chris Smith - Lead Software Developer/Technical Trainer
MindIQ Corporation


Re: Please help with this error message

From
"Chris Smith"
Date:
Okay, after looking through several pages of those search results and
finding nothing that resembles a link to a disk or memory test, I gave up.
Again, since this happens on multiple servers, it's unlikely to be a
hardware problem in the first place.  Anyone have other ideas?

--
www.designacourse.com
The Easiest Way to Train Anyone... Anywhere.

Chris Smith - Lead Software Developer/Technical Trainer
MindIQ Corporation


Re: Please help with this error message

From
"Chris Smith"
Date:
> Can you provide a test case that makes it happen?  I'm prepared to
> believe there's a bug here, but the simple fact of the error message
> is certainly not enough to find the bug :-(

Thanks, Tom.  I didn't intend for this to be a bug report (if I had, I'd
have sent it to the bugs list instead.)  I just wanted to know if others had
ideas.  Looks like I'll try to get these memory checks in place, and then
see what happens.

I have been trying to get a test case put together.  Unfortunately, I
haven't been able to do so.  On our production systems, it generally takes
about three days of hitting the database pretty hard for this to turn up
once... and I don't have anything close to the production system available
for testing.  I think it will be difficult to reproduce the problem in a
reasonable length of time.

I have put together an application that generates about the same
distribution of sizes for the bytea field, and does a continual series of
insert/read sequences to ensure that they show up right... that's making an
assumption, though, that the record is getting corrupted right from the
get-go, rather than by other actions later on; perhaps a bad one, so I'm
thinking of rewriting it to have a thread do inserts, and another keep
randomly reading the inserted fields out of order.  Maybe that will make it
more reproducable.

I also have obtained permission to shut down a production system tonight and
leave my test app running overnight, with actual production traffic pointing
to the failover server.  That would provide extra data for diagnosing as a
hardware issue (if I can reproduce easily on production but not at all
elsewhere, for example) and give a more powerful machine, which would allow
more testing to happen in less time.  I'll let you know if I turn up
anything that looks suspicious by tomorrow morning.

--
www.designacourse.com
The Easiest Way to Train Anyone... Anywhere.

Chris Smith - Lead Software Developer/Technical Trainer
MindIQ Corporation


Re: Please help with this error message

From
Joseph Shraibman
Date:
Chris Smith wrote:
> Okay,
>
> After the suggestion of possible hardware problems, we've decided to move
> the database to yet another server in the hopes that this will cause the
> problem to go away.  Trouble is, we've still a corrupted record.

Did you move it via dump/restore or did you copy the data files?


Re: Please help with this error message

From
"Chris Smith"
Date:
>
> Did you move it via dump/restore or did you copy the data files?

pg_dump doesn't work.  Our SA copied the files, and all testing seems to
indicate that the database is working just the same on the new server as the
old location.

In any case, that's not exactly related to the problem with the corrupted
records, since they are occurring on the original server as well.

--
www.designacourse.com
The Easiest Way to Train Anyone... Anywhere.

Chris Smith - Lead Software Developer/Technical Trainer
MindIQ Corporation


Re: Please help with this error message

From
Andrew Sullivan
Date:
On Fri, Mar 28, 2003 at 03:48:00PM -0700, Chris Smith wrote:

> pg_dump doesn't work.  Our SA copied the files, and all testing seems to
> indicate that the database is working just the same on the new server as the
> old location.

Even more, then, I suspect the original server.  If bad data made it
in such that you can't COPY it out, then it'll be bad everywhere you
go.

What you need to do is figure out what table it's on and what
record(s) it is.  I think you might want to try playing with
pg_filedump to get some ideas.  If all else fails, you can get the
data on either side of it by SORTing from the top of the table and
the bottom.  You can then select all that into a file, and put it
into a freshly-created table.  Not ideal, but it'll get you there.
Others will undoubtedly have more elegant solutions.

By the way, the potential for inserting bad records is one reason we
always specify ECC RAM for database machines.

A

--
----
Andrew Sullivan                         204-4141 Yonge Street
Liberty RMS                           Toronto, Ontario Canada
<andrew@libertyrms.info>                              M2P 2A8
                                         +1 416 646 3304 x110