Thread: Corrupt Incrementally Updated Backup: missing pg_clog file
Hi all, I have a problem with a corrupt backup, fortunately I was only testing so I did not loose any data. Unfortunetely what I did is to follow the backup guidelines in the documentation, which I thought should work reliably. Here are the details: I am running a postgreSQL 8.4 database on a Debian Squeeze system. For Backups I am using the warm standby and "Incrementally Updated Backup" method as described in chapter 24.4 of the documentation. So my Setup is as follows: Server 1 (Main): PostgreSQL 8.4 Database with archive_mode enabled shipping WAL files to a NFS drive. Size of database is about 370 GB and growing. Server 2 (Replica): PostgreSQL 8.4 Database in recovery mode. Using pg_standby in recovery.conf and getting WAL files from Server 1 NFS drive. All this works fine and runs without errors. The replica is backed up once a week using rsync, a full backup runs about 10 hours, so I also keep at least 24h of WAL files to make sure I have a consistent backup. The backup process also runs fine without errors, only the time (10h) it takes is quite long, so I decided to test the backup: 1) Restored the full backup to a test directory (var/lib/postgresql/8.4/test) 2) Copied the configuration of the main server to /etc/postgresql/8.4/test/ Altered port number, paths and turned off archive mode in postgresql.conf. 3) Added a recovery.conf in the test servers data dir. Recovering from my backed up WAL files: restore_command = 'cp /var/postgresql-wal-test/%f "%p"' 4) Started the test server (pg_ctlcluster 8.4 test start) 5) Waited until recovery was done (everything worked fine until then) 2012-09-25 08:26:41 UTC LOG: database system is ready to accept connections 2012-09-25 08:26:41 UTC LOG: autovacuum launcher started 6) Connected via psql to the database and tried a \d to see my tables which did *not* work! Here is the output: 2012-09-25 08:27:03 UTC ERROR: could not access status of transaction 500185903 2012-09-25 08:27:03 UTC DETAIL: Could not open file "pg_clog/01DD": No such file or directory. Also trying to SELECT data from the database tables failed with the same error. The backup is corrupt. So my question is, what went wrong: Obviously as the rsync started it copied everything from the pg_clog (which at this point was until pg_clog/01DC) and then went on for another 10+ hours backing up all the rest of the database. At the time the backup ended, the database content changed but the newer clog files did not go into the backup. When restoring the backup and starting the server, the recovery process started at a point where pg_clog was at state 01DE or even further and thus the data from 01DD was missing. So what I do from now, is an extra daily backup of my clog directory to make sure to have working backups. This is not documented in the postgreSQL documentation, and since the result in not doing so can be quite severe I think you should consider this in future PostgreSQL documentation versions. Regards, Juergen Additional information: The EXACT PostgreSQL version you are running: PostgreSQL 8.4.13 on i486-pc-linux-gnu, compiled by GCC gcc-4.4.real (Debian 4.4.5-8) 4.4.5, 32-bit How you installed PostgreSQL From Linux distro package management: Debian/Aptitude If so, what repository? deb http://ftp.at.debian.org/debian/ squeeze main non-free contrib deb-src http://ftp.at.debian.org/debian/ squeeze main non-free contrib deb http://security.debian.org/ squeeze/updates main contrib non-free deb-src http://security.debian.org/ squeeze/updates main contrib non-free deb http://ftp.debian.org/debian squeeze-updates main contrib non-free deb-src http://ftp.debian.org/debian squeeze-updates main contrib non-free Changes made to the settings in the postgresql.conf file: name | current_setting ----------------------+------------------------------------------------- version | PostgreSQL 8.4.13 on i486-pc-linux-gnu, compiled by GCC gcc-4.4.real (Debian 4.4.5-8) 4.4.5, 32-bit archive_command | cp -i %p /var/postgres-wal/%f </dev/null && cp -i %p /var/postgres-wal/bak/%f </dev/null && gzip /var/postgres-wal/bak/%f archive_mode | on archive_timeout | 0 client_encoding | utf8 effective_cache_size | 1000MB lc_collate | en_US.UTF-8 lc_ctype | en_US.UTF-8 listen_addresses | * log_line_prefix | %t maintenance_work_mem | 256MB max_connections | 100 max_stack_depth | 2MB password_encryption | on port | 5432 server_encoding | UTF8 shared_buffers | 650MB ssl | on synchronous_commit | off TimeZone | UTC work_mem | 40MB Operating system and version Linux distro and version: Debian 6.0.6 (squeeze) Kernel details: Linux wegc203094 2.6.32-5-686 #1 SMP Sun May 6 04:01:19 UTC 2012 i686 GNU/Linux What program you're using to connect to PostgreSQL: psql and phpPgAdmin -- | Juergen Fuchsberger | Wegener Center for Climate and Global Change | Karl-Franzens-University Graz | Leechgasse 25, A-8010 Graz | phone: +43-316-380-8438 | fax: +43-316-380-9830 | eMail: juergen.fuchsberger@uni-graz.at | web: www.wegcenter.at
Hi Jürgen, am 31.10.2012 um 15:24 schrieb Jürgen Fuchsberger <juergen.fuchsberger@uni-graz.at>: > The replica is backed up once a week using rsync, a full backup runs about 10 hours, so I also keep at least 24h of WALfiles to make sure I have a consistent backup. > > The backup process also runs fine without errors, only the time (10h) it takes is quite long, so I decided to test thebackup: [...] > The backup is corrupt. So my question is, what went wrong: > Obviously as the rsync started it copied everything from the pg_clog (which at this point was until pg_clog/01DC) and thenwent on for another 10+ hours backing up all the rest of the database. At the time the backup ended, the database contentchanged but the newer clog files did not go into the backup. > When restoring the backup and starting the server, the recovery process started at a point where pg_clog was at state 01DEor even further and thus the data from 01DD was missing. It sounds as if the standby server is running while you take its backup. I'm not sure it is possible to make a consistentbackup this way if this is the case. I would stop the standby, make the backup and then continue. Regards, Ralf
Attachment
On Wed, Oct 31, 2012 at 7:24 AM, Jürgen Fuchsberger <juergen.fuchsberger@uni-graz.at> wrote: > Hi all, > > I have a problem with a corrupt backup, fortunately I was only testing so I > did not loose any data. Unfortunetely what I did is to follow the backup > guidelines in the documentation, which I thought should work reliably. Here > are the details: > > I am running a postgreSQL 8.4 database on a Debian Squeeze system. For > Backups I am using the warm standby and "Incrementally Updated Backup" > method as described in chapter 24.4 of the documentation. So my Setup is as > follows: > > Server 1 (Main): PostgreSQL 8.4 Database with archive_mode enabled shipping > WAL files to a NFS drive. Size of database is about 370 GB and growing. > > Server 2 (Replica): PostgreSQL 8.4 Database in recovery mode. Using > pg_standby in recovery.conf and getting WAL files from Server 1 NFS drive. > > All this works fine and runs without errors. > > The replica is backed up once a week using rsync, You can not safely backup a running server using rsync, except as described in the documentation. For this purpose, it does not matter that the running server is a warm standby. > a full backup runs about > 10 hours, so I also keep at least 24h of WAL files to make sure I have a > consistent backup. > > The backup process also runs fine without errors, only the time (10h) it > takes is quite long, so I decided to test the backup: Since the backup process could not be restored, then it was *not* running fine. It was good that you tested this, but I don't understand your reasoning behind it. You test a backup to make sure it works, but because conducting one is slow. .... > The backup is corrupt. So my question is, what went wrong: > Obviously as the rsync started it copied everything from the pg_clog (which > at this point was until pg_clog/01DC) and then went on for another 10+ hours > backing up all the rest of the database. At the time the backup ended, the > database content changed but the newer clog files did not go into the > backup. > When restoring the backup and starting the server, the recovery process > started at a point where pg_clog was at state 01DE or even further and thus > the data from 01DD was missing. And heaven knows what else is missing. > > So what I do from now, is an extra daily backup of my clog directory to make > sure to have working backups. I think you are inventing ever more clever ways to destroy your data. Doing this is unlikely to protect your data, but will simply corrupt it in ways that are harder to detect. > > Changes made to the settings in the postgresql.conf file: > name | current_setting > ----------------------+------------------------------------------------- > version | PostgreSQL 8.4.13 on i486-pc-linux-gnu, compiled by > GCC gcc-4.4.real (Debian 4.4.5-8) 4.4.5, 32-bit > archive_command | cp -i %p /var/postgres-wal/%f </dev/null && cp -i %p > /var/postgres-wal/bak/%f </dev/null && gzip /var/postgres-wal/bak/%f I don't think that cp -i ... </dev/null does what you want on Debian, in that it returns 0 even if it failed due to refusal to overwrite. Cheers, Jeff
Jeff Janes wrote: > On Wed, Oct 31, 2012 at 7:24 AM, Jürgen Fuchsberger <juergen.fuchsberger@uni-graz.at> wrote: > > I have a problem with a corrupt backup, fortunately I was only testing so I > > did not loose any data. Unfortunetely what I did is to follow the backup > > guidelines in the documentation, which I thought should work reliably. Here > > are the details: > > > > I am running a postgreSQL 8.4 database on a Debian Squeeze system. For > > Backups I am using the warm standby and "Incrementally Updated Backup" > > method as described in chapter 24.4 of the documentation. So my Setup is as > > follows: > > > > Server 1 (Main): PostgreSQL 8.4 Database with archive_mode enabled shipping > > WAL files to a NFS drive. Size of database is about 370 GB and growing. > > > > Server 2 (Replica): PostgreSQL 8.4 Database in recovery mode. Using > > pg_standby in recovery.conf and getting WAL files from Server 1 NFS drive. > > > > All this works fine and runs without errors. > > > > The replica is backed up once a week using rsync, > > You can not safely backup a running server using rsync, except as > described in the documentation. > > For this purpose, it does not matter that the running server is a warm standby. I case that was ambiguous: You cannot take a correct backup of a log shipping standby server in 8.4. Yours, Laurenz Albe