Thread: unable to dump database, toast errors
Greetings, I'm running postgresql-7.2.1 on Redhat-7.2, and when attempting to perform a dump with the command: pg_dump -Ft -b -o DB_NAME > /tmp/db.tar it chugs away for a few minutes, and then dies with the following error: pg_dump: dumping out the contents of table artifact_file pg_dump: ERROR: unexpected chunk number 0 (expected 1) for toast value 7685119 pg_dump: lost synchronization with server, resetting connection pg_dump: SQL command to dump the contents of table "artifact_file" failed: PQendcopy() failed. pg_dump: Error message from server: pg_dump: The command was: COPY "artifact_file" WITH OIDS TO stdout; pg_dump: *** aborted because of error I've tried running vacuumdb as follows: vacuumdb -z -v -a -f and its come up clean every time, so i'm really at a loss on how to resolve this. If it matters, the artifact_file table has 782 rows, and is comprised of text and integer datatype columns. Any suggestions would be greatly appreciated. thanks!
On 04/04/03 14:32, Tom Lane wrote: >> i'm positive that there is a much more recent release. this server is >> sadly, very poorly maintained by the customer. do you feel that the >> segfaults generated in psql are blocking any possible solution to the >> pg_dump failures, or can you/I work on a solution for that indpendently of >> resolving the segfaults? > > Assuming that I've correctly diagnosed the segfault, it will only occur > if you try to print an extremely wide value in psql. It's pretty easy > to avoid doing so, though. For example, you could query for > LENGTH(bin_data) or a SUBSTRING() of it, instead of printing the whole > field value. AFAIR, 7.2 does not have any optimizations that would > skip fetching the whole toasted value when it is the argument of one of > these functions, so "select length(bin_data) ..." should be just as > useful for checking for brokenness as "select bin_data ...". Hi Tom, Following your suggestion, i pinned down the following: sfee=# select LENGTH(bin_data) from artifact_file LIMIT 1 OFFSET 693; ERROR: missing chunk number 1 for toast value 7685119 However, what is very odd, is that I can view the entire contents of that row's particular bin_data column if i do a normal select: select bin_data from artifact_file LIMIT 1 OFFSET 693; and it completes with no errors. After more testing, it appears that the 'select LENGTH(bin_data) ...' for any offset greater than or equal to 693 dies with the same 'missing chunk...' error. any suggestions on where to go from here? thanks -Lonni
Lonni Friedman wrote: > > On 04/04/03 14:32, Tom Lane wrote: > >> i'm positive that there is a much more recent release. this server is > >> sadly, very poorly maintained by the customer. do you feel that the > >> segfaults generated in psql are blocking any possible solution to the > >> pg_dump failures, or can you/I work on a solution for that indpendently of > >> resolving the segfaults? > > > > Assuming that I've correctly diagnosed the segfault, it will only occur > > if you try to print an extremely wide value in psql. It's pretty easy > > to avoid doing so, though. For example, you could query for > > LENGTH(bin_data) or a SUBSTRING() of it, instead of printing the whole > > field value. AFAIR, 7.2 does not have any optimizations that would > > skip fetching the whole toasted value when it is the argument of one of > > these functions, so "select length(bin_data) ..." should be just as > > useful for checking for brokenness as "select bin_data ...". > > Hi Tom, > Following your suggestion, i pinned down the following: > > sfee=# select LENGTH(bin_data) from artifact_file LIMIT 1 OFFSET 693; > ERROR: missing chunk number 1 for toast value 7685119 > > However, what is very odd, is that I can view the entire contents of > that row's particular bin_data column if i do a normal select: > select bin_data from artifact_file LIMIT 1 OFFSET 693; > > and it completes with no errors. After more testing, it appears that > the 'select LENGTH(bin_data) ...' for any offset greater than or equal > to 693 dies with the same 'missing chunk...' error. > > any suggestions on where to go from here? From that I would assume that the row at offset 693 is totally intact but the one at 694 is the one damaged. LIMIT reads one more row internally than you asked for. Get the key of the row at 694 (and others you might find) and continue the work using those keys. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
On 04/06/03 14:50, Jan Wieck wrote: > Lonni Friedman wrote: >> On 04/04/03 14:32, Tom Lane wrote: >> >> i'm positive that there is a much more recent release. this server is >> >> sadly, very poorly maintained by the customer. do you feel that the >> >> segfaults generated in psql are blocking any possible solution to the >> >> pg_dump failures, or can you/I work on a solution for that indpendently of >> >> resolving the segfaults? >> > >> > Assuming that I've correctly diagnosed the segfault, it will only occur >> > if you try to print an extremely wide value in psql. It's pretty easy >> > to avoid doing so, though. For example, you could query for >> > LENGTH(bin_data) or a SUBSTRING() of it, instead of printing the whole >> > field value. AFAIR, 7.2 does not have any optimizations that would >> > skip fetching the whole toasted value when it is the argument of one of >> > these functions, so "select length(bin_data) ..." should be just as >> > useful for checking for brokenness as "select bin_data ...". >> >> Hi Tom, >> Following your suggestion, i pinned down the following: >> >> sfee=# select LENGTH(bin_data) from artifact_file LIMIT 1 OFFSET 693; >> ERROR: missing chunk number 1 for toast value 7685119 >> >> However, what is very odd, is that I can view the entire contents of >> that row's particular bin_data column if i do a normal select: >> select bin_data from artifact_file LIMIT 1 OFFSET 693; >> >> and it completes with no errors. After more testing, it appears that >> the 'select LENGTH(bin_data) ...' for any offset greater than or equal >> to 693 dies with the same 'missing chunk...' error. >> >> any suggestions on where to go from here? > >From that I would assume that the row at offset 693 is totally intact > but the one at 694 is the one damaged. LIMIT reads one more row > internally than you asked for. > > Get the key of the row at 694 (and others you might find) and continue > the work using those keys. I'm not sure that I understand what it is that I need to do now: sfee=# select bin_data from artifact_file LIMIT 1 OFFSET 694; ERROR: missing chunk number 1 for toast value 7685119 sfee=# select LENGTH(bin_data) from artifact_file LIMIT 1 OFFSET 694; ERROR: missing chunk number 1 for toast value 7685119 How would i get the key of row 694?
Lonni Friedman wrote: > > On 04/06/03 14:50, Jan Wieck wrote: > >>From that I would assume that the row at offset 693 is totally intact > > but the one at 694 is the one damaged. LIMIT reads one more row > > internally than you asked for. > > > > Get the key of the row at 694 (and others you might find) and continue > > the work using those keys. > > I'm not sure that I understand what it is that I need to do now: > > sfee=# select bin_data from artifact_file LIMIT 1 OFFSET 694; > ERROR: missing chunk number 1 for toast value 7685119 > sfee=# select LENGTH(bin_data) from artifact_file LIMIT 1 OFFSET 694; > ERROR: missing chunk number 1 for toast value 7685119 > > How would i get the key of row 694? does that table have any OTHER fields than the one that is knowingly corrupted? Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
On 04/06/03 17:28, Jan Wieck wrote: > Lonni Friedman wrote: >> >> On 04/06/03 14:50, Jan Wieck wrote: >> >>From that I would assume that the row at offset 693 is totally intact >> > but the one at 694 is the one damaged. LIMIT reads one more row >> > internally than you asked for. >> > >> > Get the key of the row at 694 (and others you might find) and continue >> > the work using those keys. >> >> I'm not sure that I understand what it is that I need to do now: >> >> sfee=# select bin_data from artifact_file LIMIT 1 OFFSET 694; >> ERROR: missing chunk number 1 for toast value 7685119 >> sfee=# select LENGTH(bin_data) from artifact_file LIMIT 1 OFFSET 694; >> ERROR: missing chunk number 1 for toast value 7685119 >> >> How would i get the key of row 694? > > does that table have any OTHER fields than the one that is knowingly > corrupted? yes, its got several other integer & text datatype fields.
Lonni Friedman wrote: > yes, its got several other integer & text datatype fields. Lonni, just to let you know, we have all the time in the world and after all it is your problem. So if you slow down the process by giving the informaiton one drop at a time, we can certainly play that game. Just don't blame us that it takes that long then. Is there any of those other columns that could identify a single row uniquely? If so, that could work as a key for this purpose. If you have that, let's call it <keycolumn>. Then do a select <keycolumn> from artifact_file limit 1 offset 694. The key you've got now will be most likely the row containing the corrupted toast value. To check that it is the only one, let's say you got <keyvalue> from the query above. Now do a select sum(length(bin_data)) from artifact_file where <keycolumn> != <keyvalue> If that completes without an error, you found a way to identify the damaged row without using LIMIT. I think Tom already explained what can be done from here. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
On Mon, 07 Apr 2003 07:34:47 -0400, Jan Wieck <JanWieck@Yahoo.com> wrote: >Is there any of those other columns that could identify a single row >uniquely? If so, that could work as a key for this purpose. > >If you have that, let's call it <keycolumn>. Lonni, let me add that if there is no such *user* column, you can always use the system column ctid: SELECT ctid FROM artifact_file LIMIT 1 OFFSET 694; ctid -------- (42,7) Now check whether this is really your broken tuple: SELECT length(bin_data) FROM artifact_file WHERE ctid = '(42,7)'; -- should give an ERROR ... and that all other tuples are ok as suggested by Jan: SELECT sum(length(bin_data)) FROM artifact_file WHERE NOT ctid = '(42,7)'; -- should work Note that you have to use 'NOT ctid =', because operator '!=' is not defined for datatype tid. Servus Manfred
On Mon, 7 Apr 2003, Manfred Koizar wrote: > On Mon, 07 Apr 2003 07:34:47 -0400, Jan Wieck <JanWieck@Yahoo.com> > wrote: > >Is there any of those other columns that could identify a single row > >uniquely? If so, that could work as a key for this purpose. > > > >If you have that, let's call it <keycolumn>. > > Lonni, > > let me add that if there is no such *user* column, you can always use > the system column ctid: > > SELECT ctid FROM artifact_file LIMIT 1 OFFSET 694; > > ctid > -------- > (42,7) > > Now check whether this is really your broken tuple: > > SELECT length(bin_data) FROM artifact_file > WHERE ctid = '(42,7)'; -- should give an ERROR > > ... and that all other tuples are ok as suggested by Jan: > > SELECT sum(length(bin_data)) FROM artifact_file > WHERE NOT ctid = '(42,7)'; -- should work > > Note that you have to use 'NOT ctid =', because operator '!=' is not > defined for datatype tid. thanks for your reply. # select ctid from artifact_file LIMIT 1 OFFSET 694; ctid --------- (16,33) # select LENGTH(bin_data) from artifact_file where ctid='(16,33)'; ERROR: missing chunk number 1 for toast value 7685119 # select sum(length(bin_data)) from artifact_file where NOT ctid='(16,33)'; sum ----------- 293963428 So, where do i go from here? Is it now just a matter of nullifying bin_data for that particular row, or is it more involved? thanks. -Lonni
Lonni J Friedman wrote: > > On Mon, 7 Apr 2003, Manfred Koizar wrote: > > On Mon, 07 Apr 2003 07:34:47 -0400, Jan Wieck <JanWieck@Yahoo.com> > > wrote: > > [...] > > # select ctid from artifact_file LIMIT 1 OFFSET 694; > ctid > --------- > (16,33) > > # select LENGTH(bin_data) from artifact_file where ctid='(16,33)'; > ERROR: missing chunk number 1 for toast value 7685119 > > # select sum(length(bin_data)) from artifact_file where NOT > ctid='(16,33)'; > sum > ----------- > 293963428 > > So, where do i go from here? Is it now just a matter of nullifying > bin_data for that particular row, or is it more involved? thanks. I think an UPDATE artifact_file SET bin_data = '' WHERE ctid = '(16,33)'; should make a SELECT sum(length(bin_data)) FROM artifact_file; work again. If that is the case, dump the whole database system with pg_dumpall. If and only if that is okay too, save the current pg_hba.conf and postgres.conf files, remove the whole PGDATA directory, run initdb, restore the dump and finally restore the pg_hba.conf and postgres.conf files. Another question is how did it come to the corrupted TOAST table. Have you run test programs that check for bad memory modules? Have you checked your harddisks for badblocks? Are you running an overclocked system? So basically, can you trust your hardware as far as you can throw it? Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #