Thread: Please help with this error message
I posted about this earlier and got no response, but now it's happening again. Can someone please help? We're getting the following error message from PostgreSQL 7.3.2: ERROR: unexpected chunk number 1 (expected 0) for toast value 77579 This error occurs whenever I attempt to SELECT the contents of a bytea field from this particular record. It has also occurred in the past, and appears to happen somewhat randomly among records. It does not appear to be related to the content of the data inserted into the database, because we've been able to retrieve the same actual contents successfully in other records. The INSERT and UPDATE to add the record and set the contents field work fine. Does anyone have any ideas on what I can try to solve this problem? I can put in a number of retries, I suppose, but that seems pretty kludgy. -- www.designacourse.com The Easiest Way to Train Anyone... Anywhere. Chris Smith - Lead Software Developer/Technical Trainer MindIQ Corporation
In general, I have seen no one complain about data integrity on this list unless their hardware was bad. Search the arhives at: http://marc.theaimsgroup.com/ under databases, postgres, for disk test and ram test software. Then test. Chris Smith wrote: > I posted about this earlier and got no response, but now it's happening > again. Can someone please help? > > We're getting the following error message from PostgreSQL 7.3.2: > > ERROR: unexpected chunk number 1 (expected 0) for toast value 77579 > > This error occurs whenever I attempt to SELECT the contents of a bytea field > from this particular record. It has also occurred in the past, and appears > to happen somewhat randomly among records. It does not appear to be related > to the content of the data inserted into the database, because we've been > able to retrieve the same actual contents successfully in other records. > > The INSERT and UPDATE to add the record and set the contents field work > fine. > > Does anyone have any ideas on what I can try to solve this problem? I can > put in a number of retries, I suppose, but that seems pretty kludgy. > > -- > www.designacourse.com > The Easiest Way to Train Anyone... Anywhere. > > Chris Smith - Lead Software Developer/Technical Trainer > MindIQ Corporation > > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org >
http://marc.theaimsgroup.com/?l=postgresql-general&w=2&r=1&s=disk+memory+test&q=b Chris Smith wrote: >>In general, I have seen no one complain about data integrity on this list > > unless > >>their hardware was bad. Search the arhives at: >> >>http://marc.theaimsgroup.com/ >> >>under databases, postgres, for disk test and ram test software. Then test. > > > Hmm. That seems unlikely since the problem occurs on at least three > different servers. Nevertheless, I would be willing to test... but I can't > find the tests you're referring to in the archives above. > > -- > www.designacourse.com > The Easiest Way to Train Anyone... Anywhere. > > Chris Smith - Lead Software Developer/Technical Trainer > MindIQ Corporation > > >
Okay, after looking through several pages of those search results and finding nothing that resembles a link to a disk or memory test, I gave up. Again, since this happens on multiple servers, it's unlikely to be a hardware problem in the first place. Anyone have other ideas? -- www.designacourse.com The Easiest Way to Train Anyone... Anywhere. Chris Smith - Lead Software Developer/Technical Trainer MindIQ Corporation
Okay, After the suggestion of possible hardware problems, we've decided to move the database to yet another server in the hopes that this will cause the problem to go away. Trouble is, we've still a corrupted record. I successfully deleted one of the corrupted records; the other, when I attempt to delete it, responds with: PANIC: open of /data/dac/pg_clog/09C5 failed: No such file or directory server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Failed. I then get a psql prompt of "!>", and am no longer connected to the database. Any suggestions on how to delete this record? Does this mean that the data is corrupted more than I think? -- www.designacourse.com The Easiest Way to Train Anyone... Anywhere. Chris Smith - Lead Software Developer/Technical Trainer MindIQ Corporation
On Wed, Mar 26, 2003 at 10:19:24AM -0700, Chris Smith wrote: > Okay, after looking through several pages of those search results and > finding nothing that resembles a link to a disk or memory test, I gave up. What hardware are you using? If you're using x86 hardware, memtestx86 is your best bet. Linux has badblocks for disk checking; other systems do this in other ways (with VxFS, you get notified automatically by Solaris, for instance). > Again, since this happens on multiple servers, it's unlikely to be a > hardware problem in the first place. Anyone have other ideas? I don't think that's true. If you got bad data in the first time, you'll probably have it in all the systems. SELECT might turn it up when COPY wouldn't, because AFAIK COPY just spits out what it finds. A -- ---- Andrew Sullivan 204-4141 Yonge Street Liberty RMS Toronto, Ontario Canada <andrew@libertyrms.info> M2P 2A8 +1 416 646 3304 x110
"Chris Smith" <cdsmith@twu.net> writes: > I posted about this earlier and got no response, but now it's happening > again. Can someone please help? > We're getting the following error message from PostgreSQL 7.3.2: > ERROR: unexpected chunk number 1 (expected 0) for toast value 77579 Can you provide a test case that makes it happen? I'm prepared to believe there's a bug here, but the simple fact of the error message is certainly not enough to find the bug :-( regards, tom lane
On Wed, 26 Mar 2003, Chris Smith wrote: > Okay, after looking through several pages of those search results and > finding nothing that resembles a link to a disk or memory test, I gave up. > Again, since this happens on multiple servers, it's unlikely to be a > hardware problem in the first place. Anyone have other ideas? Until you prove your hardware is good, you can't be sure it's not the problem, and therefore it's not worth your time to try and fix until you're sure. and just because more than one machine does it does not mean it's not hardware. We had a supplier for a while who sent out machine on which probably 75% of them had single bit memory failures. It took months of my time to test all the boxes we'd already bought and find all the bad memory, and we had to start testing all machines coming in the door . http://www.memtest86.com/ In linux use badblocks or the -c switch with mkfs to check a hard drive.
> Can you provide a test case that makes it happen? I'm prepared to > believe there's a bug here, but the simple fact of the error message > is certainly not enough to find the bug :-( Thanks, Tom. I didn't intend for this to be a bug report (if I had, I'd have sent it to the bugs list instead.) I just wanted to know if others had ideas. Looks like I'll try to get these memory checks in place, and then see what happens. I have been trying to get a test case put together. Unfortunately, I haven't been able to do so. On our production systems, it generally takes about three days of hitting the database pretty hard for this to turn up once... and I don't have anything close to the production system available for testing. I think it will be difficult to reproduce the problem in a reasonable length of time. I have put together an application that generates about the same distribution of sizes for the bytea field, and does a continual series of insert/read sequences to ensure that they show up right... that's making an assumption, though, that the record is getting corrupted right from the get-go, rather than by other actions later on; perhaps a bad one, so I'm thinking of rewriting it to have a thread do inserts, and another keep randomly reading the inserted fields out of order. Maybe that will make it more reproducable. I also have obtained permission to shut down a production system tonight and leave my test app running overnight, with actual production traffic pointing to the failover server. That would provide extra data for diagnosing as a hardware issue (if I can reproduce easily on production but not at all elsewhere, for example) and give a more powerful machine, which would allow more testing to happen in less time. I'll let you know if I turn up anything that looks suspicious by tomorrow morning. -- www.designacourse.com The Easiest Way to Train Anyone... Anywhere. Chris Smith - Lead Software Developer/Technical Trainer MindIQ Corporation
> What hardware are you using? If you're using x86 hardware, > memtestx86 is your best bet. Linux has badblocks for disk checking; > other systems do this in other ways (with VxFS, you get notified > automatically by Solaris, for instance). Thanks. I'll request that this be done, and come back with more whining if it doesn't turn up errors. Even if it does turn up errors, are there any thoughts on how to delete the errant record, as indicated in my last response? We'd like to not lose *all* of our data, and we know that only one record is bad at this point. -- www.designacourse.com The Easiest Way to Train Anyone... Anywhere. Chris Smith - Lead Software Developer/Technical Trainer MindIQ Corporation
> In general, I have seen no one complain about data integrity on this list unless > their hardware was bad. Search the arhives at: > > http://marc.theaimsgroup.com/ > > under databases, postgres, for disk test and ram test software. Then test. Hmm. That seems unlikely since the problem occurs on at least three different servers. Nevertheless, I would be willing to test... but I can't find the tests you're referring to in the archives above. -- www.designacourse.com The Easiest Way to Train Anyone... Anywhere. Chris Smith - Lead Software Developer/Technical Trainer MindIQ Corporation
> What hardware are you using? If you're using x86 hardware, > memtestx86 is your best bet. Linux has badblocks for disk checking; > other systems do this in other ways (with VxFS, you get notified > automatically by Solaris, for instance). Thanks. I'll request that this be done, and come back with more whining if it doesn't turn up errors. Even if it does turn up errors, are there any thoughts on how to delete the errant record, as indicated in my last response? We'd like to not lose *all* of our data, and we know that only one record is bad at this point. -- www.designacourse.com The Easiest Way to Train Anyone... Anywhere. Chris Smith - Lead Software Developer/Technical Trainer MindIQ Corporation
Okay, after looking through several pages of those search results and finding nothing that resembles a link to a disk or memory test, I gave up. Again, since this happens on multiple servers, it's unlikely to be a hardware problem in the first place. Anyone have other ideas? -- www.designacourse.com The Easiest Way to Train Anyone... Anywhere. Chris Smith - Lead Software Developer/Technical Trainer MindIQ Corporation
> Can you provide a test case that makes it happen? I'm prepared to > believe there's a bug here, but the simple fact of the error message > is certainly not enough to find the bug :-( Thanks, Tom. I didn't intend for this to be a bug report (if I had, I'd have sent it to the bugs list instead.) I just wanted to know if others had ideas. Looks like I'll try to get these memory checks in place, and then see what happens. I have been trying to get a test case put together. Unfortunately, I haven't been able to do so. On our production systems, it generally takes about three days of hitting the database pretty hard for this to turn up once... and I don't have anything close to the production system available for testing. I think it will be difficult to reproduce the problem in a reasonable length of time. I have put together an application that generates about the same distribution of sizes for the bytea field, and does a continual series of insert/read sequences to ensure that they show up right... that's making an assumption, though, that the record is getting corrupted right from the get-go, rather than by other actions later on; perhaps a bad one, so I'm thinking of rewriting it to have a thread do inserts, and another keep randomly reading the inserted fields out of order. Maybe that will make it more reproducable. I also have obtained permission to shut down a production system tonight and leave my test app running overnight, with actual production traffic pointing to the failover server. That would provide extra data for diagnosing as a hardware issue (if I can reproduce easily on production but not at all elsewhere, for example) and give a more powerful machine, which would allow more testing to happen in less time. I'll let you know if I turn up anything that looks suspicious by tomorrow morning. -- www.designacourse.com The Easiest Way to Train Anyone... Anywhere. Chris Smith - Lead Software Developer/Technical Trainer MindIQ Corporation
Chris Smith wrote: > Okay, > > After the suggestion of possible hardware problems, we've decided to move > the database to yet another server in the hopes that this will cause the > problem to go away. Trouble is, we've still a corrupted record. Did you move it via dump/restore or did you copy the data files?
> > Did you move it via dump/restore or did you copy the data files? pg_dump doesn't work. Our SA copied the files, and all testing seems to indicate that the database is working just the same on the new server as the old location. In any case, that's not exactly related to the problem with the corrupted records, since they are occurring on the original server as well. -- www.designacourse.com The Easiest Way to Train Anyone... Anywhere. Chris Smith - Lead Software Developer/Technical Trainer MindIQ Corporation
On Fri, Mar 28, 2003 at 03:48:00PM -0700, Chris Smith wrote: > pg_dump doesn't work. Our SA copied the files, and all testing seems to > indicate that the database is working just the same on the new server as the > old location. Even more, then, I suspect the original server. If bad data made it in such that you can't COPY it out, then it'll be bad everywhere you go. What you need to do is figure out what table it's on and what record(s) it is. I think you might want to try playing with pg_filedump to get some ideas. If all else fails, you can get the data on either side of it by SORTing from the top of the table and the bottom. You can then select all that into a file, and put it into a freshly-created table. Not ideal, but it'll get you there. Others will undoubtedly have more elegant solutions. By the way, the potential for inserting bad records is one reason we always specify ECC RAM for database machines. A -- ---- Andrew Sullivan 204-4141 Yonge Street Liberty RMS Toronto, Ontario Canada <andrew@libertyrms.info> M2P 2A8 +1 416 646 3304 x110