Thread: Corrupt Incrementally Updated Backup: missing pg_clog file

Corrupt Incrementally Updated Backup: missing pg_clog file

From
Jürgen Fuchsberger
Date:
Hi all,

I have a problem with a corrupt backup, fortunately I was only testing
so I did not loose any data. Unfortunetely what I did is to follow the
backup guidelines in the documentation, which I thought should work
reliably. Here are the details:

I am running a postgreSQL 8.4 database on a Debian Squeeze system. For
Backups I am using the warm standby and "Incrementally Updated Backup"
method as described in chapter 24.4 of the documentation. So my Setup is
as follows:

Server 1 (Main): PostgreSQL 8.4 Database with archive_mode enabled
shipping WAL files to a NFS drive. Size of database is about 370 GB and
growing.

Server 2 (Replica): PostgreSQL 8.4 Database in recovery mode. Using
pg_standby in recovery.conf and getting WAL files from Server 1 NFS drive.

All this works fine and runs without errors.

The replica is backed up once a week using rsync, a full backup runs
about 10 hours, so I also keep at least 24h of WAL files to make sure I
have a consistent backup.

The backup process also runs fine without errors, only the time (10h) it
takes is quite long, so I decided to test the backup:

1) Restored the full backup to a test directory
(var/lib/postgresql/8.4/test)

2) Copied the configuration of the main server to
/etc/postgresql/8.4/test/
Altered port number, paths and turned off archive mode in postgresql.conf.

3) Added a recovery.conf in the test servers data dir. Recovering from
my backed up WAL files:
restore_command = 'cp /var/postgresql-wal-test/%f "%p"'

4) Started the test server (pg_ctlcluster 8.4 test start)

5) Waited until recovery was done (everything worked fine until then)
2012-09-25 08:26:41 UTC LOG:  database system is ready to accept connections
2012-09-25 08:26:41 UTC LOG:  autovacuum launcher started

6) Connected via psql to the database and tried a \d to see my tables
which did *not* work!
Here is the output:
2012-09-25 08:27:03 UTC ERROR:  could not access status of transaction
500185903
2012-09-25 08:27:03 UTC DETAIL:  Could not open file "pg_clog/01DD": No
such file or directory.

Also trying to SELECT data from the database tables failed with the same
error.
The backup is corrupt. So my question is, what went wrong:
Obviously as the rsync started it copied everything from the pg_clog
(which at this point was until pg_clog/01DC) and then went on for
another 10+ hours backing up all the rest of the database. At the time
the backup ended, the database content changed but the newer clog files
did not go into the backup.
When restoring the backup and starting the server, the recovery process
started at a point where pg_clog was at state 01DE or even further and
thus the data from 01DD was missing.

So what I do from now, is an extra daily backup of my clog directory to
make sure to have working backups. This is not documented in the
postgreSQL documentation, and since the result in not doing so can be
quite severe I think you should consider this in future PostgreSQL
documentation versions.

Regards,
Juergen

Additional information:

The EXACT PostgreSQL version you are running:
PostgreSQL 8.4.13 on i486-pc-linux-gnu, compiled by GCC gcc-4.4.real
(Debian 4.4.5-8) 4.4.5, 32-bit

How you installed PostgreSQL

 From Linux distro package management: Debian/Aptitude
If so, what repository?
deb http://ftp.at.debian.org/debian/ squeeze main non-free contrib
deb-src http://ftp.at.debian.org/debian/ squeeze main non-free contrib
deb http://security.debian.org/ squeeze/updates main contrib non-free
deb-src http://security.debian.org/ squeeze/updates main contrib non-free
deb http://ftp.debian.org/debian squeeze-updates main contrib non-free
deb-src http://ftp.debian.org/debian squeeze-updates main contrib non-free

Changes made to the settings in the postgresql.conf file:
name                  |  current_setting
----------------------+-------------------------------------------------
  version              | PostgreSQL 8.4.13 on i486-pc-linux-gnu,
compiled by GCC gcc-4.4.real (Debian 4.4.5-8) 4.4.5, 32-bit
  archive_command      | cp -i %p /var/postgres-wal/%f </dev/null && cp
-i %p /var/postgres-wal/bak/%f </dev/null && gzip /var/postgres-wal/bak/%f
  archive_mode         | on
  archive_timeout      | 0
  client_encoding      | utf8
  effective_cache_size | 1000MB
  lc_collate           | en_US.UTF-8
  lc_ctype             | en_US.UTF-8
  listen_addresses     | *
  log_line_prefix      | %t
  maintenance_work_mem | 256MB
  max_connections      | 100
  max_stack_depth      | 2MB
  password_encryption  | on
  port                 | 5432
  server_encoding      | UTF8
  shared_buffers       | 650MB
  ssl                  | on
  synchronous_commit   | off
  TimeZone             | UTC
  work_mem             | 40MB


Operating system and version
             Linux distro and version:
Debian 6.0.6 (squeeze)
             Kernel details:
Linux wegc203094 2.6.32-5-686 #1 SMP Sun May 6 04:01:19 UTC 2012 i686
GNU/Linux

What program you're using to connect to PostgreSQL:
psql and phpPgAdmin

--
| Juergen Fuchsberger
| Wegener Center for Climate and Global Change
| Karl-Franzens-University Graz
| Leechgasse 25, A-8010 Graz
| phone: +43-316-380-8438
|   fax: +43-316-380-9830
| eMail: juergen.fuchsberger@uni-graz.at
|   web: www.wegcenter.at


Re: Corrupt Incrementally Updated Backup: missing pg_clog file

From
Ralf Schuchardt
Date:
Hi Jürgen,


am 31.10.2012 um 15:24 schrieb Jürgen Fuchsberger <juergen.fuchsberger@uni-graz.at>:

> The replica is backed up once a week using rsync, a full backup runs about 10 hours, so I also keep at least 24h of
WALfiles to make sure I have a consistent backup. 
>
> The backup process also runs fine without errors, only the time (10h) it takes is quite long, so I decided to test
thebackup: 

[...]

> The backup is corrupt. So my question is, what went wrong:
> Obviously as the rsync started it copied everything from the pg_clog (which at this point was until pg_clog/01DC) and
thenwent on for another 10+ hours backing up all the rest of the database. At the time the backup ended, the database
contentchanged but the newer clog files did not go into the backup. 
> When restoring the backup and starting the server, the recovery process started at a point where pg_clog was at state
01DEor even further and thus the data from 01DD was missing. 

It sounds as if the standby server is running while you take its backup. I'm not sure it is possible to make a
consistentbackup this way if this is the case. 
I would stop the standby, make the backup and then continue.


Regards,
Ralf


Attachment

Re: Corrupt Incrementally Updated Backup: missing pg_clog file

From
Jeff Janes
Date:
On Wed, Oct 31, 2012 at 7:24 AM, Jürgen Fuchsberger
<juergen.fuchsberger@uni-graz.at> wrote:
> Hi all,
>
> I have a problem with a corrupt backup, fortunately I was only testing so I
> did not loose any data. Unfortunetely what I did is to follow the backup
> guidelines in the documentation, which I thought should work reliably. Here
> are the details:
>
> I am running a postgreSQL 8.4 database on a Debian Squeeze system. For
> Backups I am using the warm standby and "Incrementally Updated Backup"
> method as described in chapter 24.4 of the documentation. So my Setup is as
> follows:
>
> Server 1 (Main): PostgreSQL 8.4 Database with archive_mode enabled shipping
> WAL files to a NFS drive. Size of database is about 370 GB and growing.
>
> Server 2 (Replica): PostgreSQL 8.4 Database in recovery mode. Using
> pg_standby in recovery.conf and getting WAL files from Server 1 NFS drive.
>
> All this works fine and runs without errors.
>
> The replica is backed up once a week using rsync,

You can not safely backup a running server using rsync, except as
described in the documentation.

For this purpose, it does not matter that the running server is a warm standby.

> a full backup runs about
> 10 hours, so I also keep at least 24h of WAL files to make sure I have a
> consistent backup.
>
> The backup process also runs fine without errors, only the time (10h) it
> takes is quite long, so I decided to test the backup:

Since the backup process could not be restored, then it was *not*
running fine.  It was good that you tested this, but I don't
understand your reasoning behind it.  You test a backup to make sure
it works, but because conducting one is slow.

....

> The backup is corrupt. So my question is, what went wrong:
> Obviously as the rsync started it copied everything from the pg_clog (which
> at this point was until pg_clog/01DC) and then went on for another 10+ hours
> backing up all the rest of the database. At the time the backup ended, the
> database content changed but the newer clog files did not go into the
> backup.
> When restoring the backup and starting the server, the recovery process
> started at a point where pg_clog was at state 01DE or even further and thus
> the data from 01DD was missing.

And heaven knows what else is missing.

>
> So what I do from now, is an extra daily backup of my clog directory to make
> sure to have working backups.

I think you are inventing ever more clever ways to destroy your data.
Doing this is unlikely to protect your data, but will simply corrupt
it in ways that are harder to detect.

>
> Changes made to the settings in the postgresql.conf file:
> name                  |  current_setting
> ----------------------+-------------------------------------------------
>  version              | PostgreSQL 8.4.13 on i486-pc-linux-gnu, compiled by
> GCC gcc-4.4.real (Debian 4.4.5-8) 4.4.5, 32-bit

>  archive_command      | cp -i %p /var/postgres-wal/%f </dev/null && cp -i %p
> /var/postgres-wal/bak/%f </dev/null && gzip /var/postgres-wal/bak/%f

I don't think that cp -i ... </dev/null does what you want on Debian,
in that it returns 0 even if it failed due to refusal to overwrite.

Cheers,

Jeff


Re: Corrupt Incrementally Updated Backup: missing pg_clog file

From
"Albe Laurenz"
Date:
Jeff Janes wrote:
> On Wed, Oct 31, 2012 at 7:24 AM, Jürgen Fuchsberger <juergen.fuchsberger@uni-graz.at> wrote:
> > I have a problem with a corrupt backup, fortunately I was only testing so I
> > did not loose any data. Unfortunetely what I did is to follow the backup
> > guidelines in the documentation, which I thought should work reliably. Here
> > are the details:
> >
> > I am running a postgreSQL 8.4 database on a Debian Squeeze system. For
> > Backups I am using the warm standby and "Incrementally Updated Backup"
> > method as described in chapter 24.4 of the documentation. So my Setup is as
> > follows:
> >
> > Server 1 (Main): PostgreSQL 8.4 Database with archive_mode enabled shipping
> > WAL files to a NFS drive. Size of database is about 370 GB and growing.
> >
> > Server 2 (Replica): PostgreSQL 8.4 Database in recovery mode. Using
> > pg_standby in recovery.conf and getting WAL files from Server 1 NFS drive.
> >
> > All this works fine and runs without errors.
> >
> > The replica is backed up once a week using rsync,
>
> You can not safely backup a running server using rsync, except as
> described in the documentation.
>
> For this purpose, it does not matter that the running server is a warm standby.

I case that was ambiguous:
You cannot take a correct backup of a log shipping standby
server in 8.4.

Yours,
Laurenz Albe