base backup requirements for PITR - Mailing list pgsql-admin

From Andrew W. Gibbs
Subject base backup requirements for PITR
Date
Msg-id 20121129015217.GA9975@raptor.commandosoftware.com
Whole thread Raw
Responses Re: base backup requirements for PITR
List pgsql-admin
Postgres Admin Wizards,

I am trying to manage an 8.4 database whose ingest rate is
sufficiently high that doing base backups on it causes it to fall
unacceptably behind.  To deal with this, I want to do WAL shipping to
another server that is in warm standby mode, and then on that server
perform the base backups in order to effect log change accumulation
without hosing the master server.

Once upon a time, months ago, before the load on the system made
creating a base backup completely untenable, I configured an
archive_command, did a pg_start_backup, kicked off a tar operation on
the data directory, and after the tar operation's completion ran
pg_stop_backup.  Last week, I was finally allocated a new server on
which to create a warm standby, so I unpacked the tarball I created,
installed a recovery.conf with a restore_command, and started the
server up.  It has been methodically chewing through the WAL segments,
and a little over a week later it has gotten through nearly all three
months' worth of WAL segments (I know, miserable, and hopefully never
an experience I will repeat).  I take this to mean that I did things
correctly and that shortly I will end up with a warm standby server,
on the assumption that if I had botched the creation of the base
backup things would have bombed out very quickly.

That said, I'm wondering whether I created the base backup with
reliably correct procedures or merely was lucky.  It's not exactly
clear to me when I turned on the warm standby server how it knew where
in the WAL to start processing data.  As I understand things, when
Postgres experiences a hard crash, it can rely on the contents of
global/pg_control to provide the value of the last checkpoint's REDO
location, from which it can start replaying WAL records.  But what
happens if you're in the middle of creating a base backup and Postgres
forces a checkpoint, possibly causing global/pg_control to be changed
at some indeterminate point during tar's execution?  And are the
concerns different for the scenario in which you're running
pg_start_backup/tar/pg_stop_backup on a master server and the scenario
in which you're just running tar on a warm standby server (because you
can't actually log into a warm standby server)?  In other words, is
pg_start_backup/pg_stop_backup doing something magical do give the
eventual recovery process a heads up about what to do?

When I first started puzzling through this, I figured that all I had
to do was run pg_controldata on the warm standby, parse out the latest
checkpoint's REDO address, calculate the corresponding file, and then
assume that I could purge my WAL archive of any files preceding that.
But then I started to worry about whether upon startup of a server in
recovery mode using a tarball that I had generated from a warm standby
Postgres would know where to start in the WAL when invoking my
restore_command.  Looking through emails from the archives spanning
the past decade, I've seen at least one saying that I need to make
sure that global/pg_control is archived first and one that says it
needs to be archived last when I'm creating a tarball on a warm
standby server, and I really can't tell what the requirement is, if
any, on 8.4, and I also can't tell, if there is such a requirement for
a warm standby server, whether such a requirement is obviated on a
master server by virtue of things that pg_start_backup/pg_stop_backup
might be doing.

So, can someone clarify the following?

1) When Postgres starts up in recovery mode (i.e. with a
recovery.conf), how does it know where in the WAL to start replaying
records?

2) Beyond just running a tar command and making sure that I keep
around the required WAL files, what constraints are there on creating
a tarball for the base backup?  Do I need to do anything in particular
to ensure that the resultant warm standby server, upon startup, knows
where in the WAL to start replay?  Is the point at which
global/pg_control is picked up by tar critical to the success of a
server in recovery mode coming online in a consistent state, or is
Postgres doing something else involving log sequence numbers stamped
onto data files to figure out what to do?

Many thanks for your guidance...

  -- AWG


pgsql-admin by date:

Previous
From: Steve Crawford
Date:
Subject: Re: Fwd: Monitoring Replication on Master/Slave Postgres(9.1)
Next
From: Gabriel Muñoz
Date:
Subject: Re: Fwd: Monitoring Replication on Master/Slave Postgres(9.1)