Thread: ERROR: missing chunk number 0 for toast value 382548694 in pg_toast_847386

ERROR: missing chunk number 0 for toast value 382548694 in pg_toast_847386

From

Michael Harris

Date:

27 February 2011, 23:40:15

Hi,

We have a PG 8.4 database approx 5TB in size.

We were recently testing our restore procedure against our latest dump. The dumps are taken using the Continuous
Archivingmethod with base dumps taken using tar. Our tar script is set up to ignore missing/modified files but should
stopon all other errors. 

We are testing the restore on a completely separate machine of a similar spec to the server on which the dumps are
made.

On our first attempt we were using PG 8.4.0. We struck problems on restore with the error 'Unexpected timeline ID 0 in
logfile'. After some googling we identified that this was a known bug in PG 8.4.0 so we upgraded the target machine to
8.4.7and began the recovery again. 

On the second attempt the recovery went through smoothly. However, after starting the application again we started to
getthe following error: 

ERROR:  missing chunk number 0 for toast value 382548694 in pg_toast_847386

The table it is trying to access is one in which our application stores various schema information that is used each
timethe application starts up. 

As far as we know the base dump is good and the DB did not complain about any of the WAL files. We are now nervous that
ourbackups are no good, so we want to get to the bottom of it. What can be the reason that these backups are not
restoringproperly? 

Thanks in advance,

Regards
Mike Harris

Re: ERROR: missing chunk number 0 for toast value 382548694 in pg_toast_847386

From

Vibhor Kumar

Date:

28 February 2011, 03:06:01

On Feb 28, 2011, at 5:55 AM, Michael Harris wrote:

> ERROR:  missing chunk number 0 for toast value 382548694 in pg_toast_847386



This seems more like a corrupted toast table.

Did you try to reindex the pg_toast_847386?
REINDEX table pg_toast.pg_toast_847386;
VACUUM ANALYZE <tablename>;


Thanks & Regards,
Vibhor Kumar
vibhor.kumar@enterprisedb.com
Blog:http://vibhork.blogspot.com

Re: ERROR: missing chunk number 0 for toast value 382548694 in pg_toast_847386

From

Michael Harris

Date:

28 February 2011, 03:16:58

>> ERROR:  missing chunk number 0 for toast value 382548694 in
>> pg_toast_847386
>
> This seems more like a corrupted toast table.
>
> Did you try to reindex the pg_toast_847386?
> REINDEX table pg_toast.pg_toast_847386;
> VACUUM ANALYZE <tablename>;

Hi Vibhor,

Thanks for the suggestion.

We didn't try that yet, even though we did see others recommending this as a solution for similar corruptions in the
past. 

The main reason we have not gone down that path that after getting this error we do not have any confidence in the
integrityof the rest of the database after performing the restore - maybe there are many more tables with corruption in
them.

We are mainly concerned that our backups are valid. The fault does not seem to be present on the original database,
onlyafter restoring from the base backup + PITR. 

Regards
Mike Harris

Re: ERROR: missing chunk number 0 for toast value 382548694 in pg_toast_847386

From

Vibhor Kumar

Date:

28 February 2011, 03:24:22

On Feb 28, 2011, at 9:46 AM, Michael Harris wrote:

> The main reason we have not gone down that path that after getting this error we do not have any confidence in the
integrityof the rest of the database after performing the restore - maybe there are many more tables with corruption in
them.
May be or may not be.

Did you find anything suspicious in pg_log?
Please share recovery.conf information.


> We are mainly concerned that our backups are valid. The fault does not seem to be present on the original database,
onlyafter restoring from the base backup + PITR. 


Did you verify base backup (Restoring without PITR) ?

Thanks & Regards,
Vibhor Kumar
vibhor.kumar@enterprisedb.com
Blog:http://vibhork.blogspot.com

Re: ERROR: missing chunk number 0 for toast value 382548694 in pg_toast_847386

From

Michael Harris

Date:

28 February 2011, 03:44:15

Hi Vibhor,

>> Did you find anything suspicious in pg_log?

We've been through it all and did not see anything we didn't expect.

>> Please share recovery.conf information.

We did interrupt the restore a few times. The initial recovery.conf file contained only:

restore_command = 'gunzip -c /mnt/dbsbackup/pg_xlog/%f.gz > %p'

Later we decided to replace the recovery command with a wrapper script that would allow us to leave the restore going
unattendedover the weekend, and complete up until the latest WAL file on the original database (which is still
running).We changed the recovery command to: 

restore_command = '/var/lib/pgsql/data/db_restore_dm %f %p'

where the script db_restore_dm contained:

#!/usr/bin/perl

use strict;

my ($pg_f, $pg_p) = @ARGV;
exit 1 if $pg_f eq '00000001.history';

my $xlogBackupFile = "/mnt/dbsbackup/pg_xlog/$pg_f.gz";

while (! -f $xlogBackupFile and !$triggered) {
        sleep 2;
}

while (1) {
  system("gunzip -c $xlogBackupFile > $pg_p");
  last if ($? >> 8 == 0);
  sleep 2;
}

We were concerned that shutting down / starting up while recovery is ongoing might cause some problems, but the pg
documentationindicates this should be OK and we saw no cause for concern in the pg logs. 

> Did you verify base backup (Restoring without PITR) ?

I guess you mean did we restore it up until consistency reached? No so far we have continued restoration until we
reachedthe last WAL file made by the original database. 

Regards // Mike


-----Original Message-----
From: Vibhor Kumar [mailto:vibhor.kumar@enterprisedb.com]
Sent: Monday, 28 February 2011 3:25 PM
To: Michael Harris
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] ERROR: missing chunk number 0 for toast value 382548694 in pg_toast_847386


On Feb 28, 2011, at 9:46 AM, Michael Harris wrote:

> The main reason we have not gone down that path that after getting this error we do not have any confidence in the
integrityof the rest of the database after performing the restore - maybe there are many more tables with corruption in
them.
May be or may not be.

Did you find anything suspicious in pg_log?
Please share recovery.conf information.


> We are mainly concerned that our backups are valid. The fault does not seem to be present on the original database,
onlyafter restoring from the base backup + PITR. 


Did you verify base backup (Restoring without PITR) ?

Thanks & Regards,
Vibhor Kumar
vibhor.kumar@enterprisedb.com
Blog:http://vibhork.blogspot.com

Re: ERROR: missing chunk number 0 for toast value 382548694 in pg_toast_847386

From

Vibhor Kumar

Date:

28 February 2011, 04:56:40

On Feb 28, 2011, at 10:13 AM, Michael Harris wrote:

>>>
>>> Did you find anything suspicious in pg_log?
>
> We've been through it all and did not see anything we didn't expect.
>
>>> Please share recovery.conf information.
>
> We did interrupt the restore a few times. The initial recovery.conf file contained only:
>
> restore_command = 'gunzip -c /mnt/dbsbackup/pg_xlog/%f.gz > %p'
>
> Later we decided to replace the recovery command with a wrapper script that would allow us to leave the restore going
unattendedover the weekend, and complete up until the latest WAL file on the original database (which is still
running).We changed the recovery command to: 
>
> restore_command = '/var/lib/pgsql/data/db_restore_dm %f %p'
>
> where the script db_restore_dm contained:
>
> #!/usr/bin/perl
>
> use strict;
>
> my ($pg_f, $pg_p) = @ARGV;
> exit 1 if $pg_f eq '00000001.history';
>
> my $xlogBackupFile = "/mnt/dbsbackup/pg_xlog/$pg_f.gz";
>
> while (! -f $xlogBackupFile and !$triggered) {
>        sleep 2;
> }
>
> while (1) {
>  system("gunzip -c $xlogBackupFile > $pg_p");
>  last if ($? >> 8 == 0);
>  sleep 2;
> }

Not sure about above wrapper function. However, if you can share some information from pg_log when you have started the
restorewith backup_label information. 

Try following steps:
1. Untar all the gzipped WAL File in One Location
2. Use Following restore command:
    cp <WAL Location>/%f %p


> We were concerned that shutting down / starting up while recovery is ongoing might cause some problems, but the pg
documentationindicates this should be OK and we saw no cause for concern in the pg logs. 

What options have you used for shutting down?

Thanks & Regards,
Vibhor Kumar
vibhor.kumar@enterprisedb.com
Blog:http://vibhork.blogspot.com

Re: ERROR: missing chunk number 0 for toast value 382548694 in pg_toast_847386

From

Michael Harris

Date:

28 February 2011, 05:27:21

Hi Vibhor,

>> Not sure about above wrapper function. However, if you can share some
>> information from pg_log when you have started the restore with
>> backup_label information.

Here it is at the beginning:
[2011-02-25 09:40:11 EST] LOG:  database system was interrupted; last known up at 2011-02-01 01:04:12 EST
[2011-02-25 09:40:11 EST] LOG:  starting archive recovery
[2011-02-25 09:40:11 EST] LOG:  restore_command = 'gunzip -c /mnt/dbsbackup/pg_xlog/%f.gz > %p'
gunzip: /mnt/dbsbackup/pg_xlog/00000001.history.gz: No such file or directory
[2011-02-25 09:40:12 EST] LOG:  restored log file "0000000100006F720000006F.00024AE0.backup" from archive
[2011-02-25 09:40:12 EST] LOG:  restored log file "0000000100006F7200000094" from archive
[2011-02-25 09:40:12 EST] LOG:  restored log file "0000000100006F720000006F" from archive
[2011-02-25 09:40:12 EST] LOG:  automatic recovery in progress
[2011-02-25 09:40:12 EST] LOG:  redo starts at 6F72/6F024AE0, consistency will be reached at 70B1/B75C9AF0
[2011-02-25 09:40:14 EST] LOG:  restored log file "0000000100006F7200000070" from archive
[2011-02-25 09:40:14 EST] LOG:  restored log file "0000000100006F7200000071" from archive
[2011-02-25 09:40:15 EST] LOG:  restored log file "0000000100006F7200000072" from archive

...etc...

We did reach consistency eventually:

[2011-02-25 21:29:28 EST] LOG:  restored log file "00000001000070B1000000B5" from archive
[2011-02-25 21:29:29 EST] LOG:  restored log file "00000001000070B1000000B6" from archive
[2011-02-25 21:29:29 EST] LOG:  restored log file "00000001000070B1000000B7" from archive
[2011-02-25 21:29:29 EST] LOG:  consistent recovery state reached
[2011-02-25 21:29:29 EST] LOG:  restored log file "00000001000070B1000000B8" from archive
[2011-02-25 21:29:30 EST] LOG:  restored log file "00000001000070B1000000B9" from archive
...

>> Try following steps:
>> 1. Untar all the gzipped WAL File in One Location
>> 2. Use Following restore command:
>>     cp <WAL Location>/%f %p

Unfortunately it is not practical to do this, since there are many terabytes of WAL files. They are available on an NFS
mountedvolume which is accessible from the target machine. 

What we can do (and will do if no better suggestions come forward) is to restore again using only the PITR files needed
toachieve consistency, ie. up to 70B1/B75C9AF0, a much smaller quantity of WAL files. 

>> We were concerned that shutting down / starting up while recovery is
>> ongoing might cause some problems, but the pg documentation indicates
>> this should be OK and we saw no cause for concern in the pg logs.
> What options have you used for shutting down?

We are using Centos and used the supplied init scripts - the actual command is: service postgresql stop, which
translatesto: 

    pg_ctl stop -D '$PGDATA' -s -m fast

Regards
Mike Harris