Thread: backup_label during crash recovery: do we know how to solve it?

backup_label during crash recovery: do we know how to solve it?

From

Daniel Farina

Date:

29 November 2011, 22:11:52

Reviving a thread that has hit its second birthday:

http://archives.postgresql.org/pgsql-hackers/2009-11/msg00024.php

In our case not being able to restart Postgres when it has been taken
down in the middle of a base backup is starting to manifest as a
serious source of downtime: basically, any backend crash or machine
restart will cause postgres not to start without human intervention.
The message delivered is sufficiently scary and indirect enough
(because of the plausible scenarios that could cause corruption if
postgres were to make a decision automatically in the most general
case) that it's not all that attractive to train a general operator
rotation to assess what to do, as it involves reading and then,
effectively, ignoring some awfully scary error messages and removing
the backup label file.  Even if the error messages weren't scary
(itself a problem if one comes to the wrong conclusion as a result),
the time spent digging around under short notice to confirm what's
going on is a high pole in the tent for improving uptime for us,
taking an extra five to ten minutes per common encounter.

Our problem is compounded by having a lot of databases that take base
backups at attenuated rates in an unattended way, and therefore a
human who may have been woken up from a sound sleep will have to
figure out what was going on before they've reached consciousness,
rather than a person with prior knowledge of having started a backup.
Also, fairly unremarkable databases can take so long to back up that
they may well have a greater than 20% chance of encountering this
problem at any particular time:  20% of a day is less than 5 hours per
day taken to do on-line backups.  Basically, we -- and anyone else
with unattended physical backup schemes -- are punished rather
severely by the current design.

This issue has some more recent related incarnations, even if for
different reasons:

http://archives.postgresql.org/pgsql-hackers/2011-01/msg00764.php

Because backup_label "coming or going?" confusion in Postgres can have
serious consequences, I wanted to post to the list first to solicit a
minimal design to solve this problem.  If it's fairly small in its
mechanics then it may yet be feasible for the January CF.

-- 
fdr

Re: backup_label during crash recovery: do we know how to solve it?

From

Robert Haas

Date:

01 December 2011, 19:47:31

On Tue, Nov 29, 2011 at 9:10 PM, Daniel Farina <daniel@heroku.com> wrote:
> Reviving a thread that has hit its second birthday:
>
> http://archives.postgresql.org/pgsql-hackers/2009-11/msg00024.php
>
> In our case not being able to restart Postgres when it has been taken
> down in the middle of a base backup is starting to manifest as a
> serious source of downtime: basically, any backend crash or machine
> restart will cause postgres not to start without human intervention.
> The message delivered is sufficiently scary and indirect enough
> (because of the plausible scenarios that could cause corruption if
> postgres were to make a decision automatically in the most general
> case) that it's not all that attractive to train a general operator
> rotation to assess what to do, as it involves reading and then,
> effectively, ignoring some awfully scary error messages and removing
> the backup label file.  Even if the error messages weren't scary
> (itself a problem if one comes to the wrong conclusion as a result),
> the time spent digging around under short notice to confirm what's
> going on is a high pole in the tent for improving uptime for us,
> taking an extra five to ten minutes per common encounter.
>
> Our problem is compounded by having a lot of databases that take base
> backups at attenuated rates in an unattended way, and therefore a
> human who may have been woken up from a sound sleep will have to
> figure out what was going on before they've reached consciousness,
> rather than a person with prior knowledge of having started a backup.
> Also, fairly unremarkable databases can take so long to back up that
> they may well have a greater than 20% chance of encountering this
> problem at any particular time:  20% of a day is less than 5 hours per
> day taken to do on-line backups.  Basically, we -- and anyone else
> with unattended physical backup schemes -- are punished rather
> severely by the current design.
>
> This issue has some more recent related incarnations, even if for
> different reasons:
>
> http://archives.postgresql.org/pgsql-hackers/2011-01/msg00764.php
>
> Because backup_label "coming or going?" confusion in Postgres can have
> serious consequences, I wanted to post to the list first to solicit a
> minimal design to solve this problem.  If it's fairly small in its
> mechanics then it may yet be feasible for the January CF.

In some ways, I feel like this problem is unsolvable by definition.
If a backup is designed to be an exact copy of the data directory
taken between pg_start_backup() and pg_stop_backup(), then by
definition you can't distinguish between the original and the copy.
That's what a copy *is*.

Now, we could fix this by requiring an additional step when creating a
backup.  For example, we could create backup_label.not_really on the
master and require the person taking the backup to rename it to
backup_label on the slave before starting the postmaster.  This could
be an optional behavior, to preserve backward compatibility.  Now the
slave *isn't* an exact copy of the master, so PostgreSQL can
distinguish.

But it seems that this could also be worked around outside the
database.  We don't have built-in clusterware, so there must be
something in the external environment that knows which server is
supposed to be the master and which is supposed to be the standby.
So, if you're on the master, remove the backup label file before
starting the postmaster.  If you're on the standby, don't.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: backup_label during crash recovery: do we know how to solve it?

From

Daniel Farina

Date:

02 December 2011, 19:26:55

On Thu, Dec 1, 2011 at 3:47 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Nov 29, 2011 at 9:10 PM, Daniel Farina <daniel@heroku.com> wrote:
>> Reviving a thread that has hit its second birthday:
>>
>> http://archives.postgresql.org/pgsql-hackers/2009-11/msg00024.php
>>
>> In our case not being able to restart Postgres when it has been taken
>> down in the middle of a base backup is starting to manifest as a
>> serious source of downtime: basically, any backend crash or machine
>> restart will cause postgres not to start without human intervention.
>> The message delivered is sufficiently scary and indirect enough
>> (because of the plausible scenarios that could cause corruption if
>> postgres were to make a decision automatically in the most general
>> case) that it's not all that attractive to train a general operator
>> rotation to assess what to do, as it involves reading and then,
>> effectively, ignoring some awfully scary error messages and removing
>> the backup label file.  Even if the error messages weren't scary
>> (itself a problem if one comes to the wrong conclusion as a result),
>> the time spent digging around under short notice to confirm what's
>> going on is a high pole in the tent for improving uptime for us,
>> taking an extra five to ten minutes per common encounter.
>>
>> Our problem is compounded by having a lot of databases that take base
>> backups at attenuated rates in an unattended way, and therefore a
>> human who may have been woken up from a sound sleep will have to
>> figure out what was going on before they've reached consciousness,
>> rather than a person with prior knowledge of having started a backup.
>> Also, fairly unremarkable databases can take so long to back up that
>> they may well have a greater than 20% chance of encountering this
>> problem at any particular time:  20% of a day is less than 5 hours per
>> day taken to do on-line backups.  Basically, we -- and anyone else
>> with unattended physical backup schemes -- are punished rather
>> severely by the current design.
>>
>> This issue has some more recent related incarnations, even if for
>> different reasons:
>>
>> http://archives.postgresql.org/pgsql-hackers/2011-01/msg00764.php
>>
>> Because backup_label "coming or going?" confusion in Postgres can have
>> serious consequences, I wanted to post to the list first to solicit a
>> minimal design to solve this problem.  If it's fairly small in its
>> mechanics then it may yet be feasible for the January CF.
>
> In some ways, I feel like this problem is unsolvable by definition.
> If a backup is designed to be an exact copy of the data directory
> taken between pg_start_backup() and pg_stop_backup(), then by
> definition you can't distinguish between the original and the copy.
> That's what a copy *is*.
>
> Now, we could fix this by requiring an additional step when creating a
> backup.  For example, we could create backup_label.not_really on the
> master and require the person taking the backup to rename it to
> backup_label on the slave before starting the postmaster.  This could
> be an optional behavior, to preserve backward compatibility.  Now the
> slave *isn't* an exact copy of the master, so PostgreSQL can
> distinguish.

I actually think such a protocol should be chosen.  As is I cannot say
"yeah, restarting postgres is always designed to work" in the presence
of backups.  Prior suggestions -- I think rejected -- were to use the
recovery.conf as such a sentinel file suggesting "I am restoring, not
being backed up".

> But it seems that this could also be worked around outside the
> database.  We don't have built-in clusterware, so there must be
> something in the external environment that knows which server is
> supposed to be the master and which is supposed to be the standby.
> So, if you're on the master, remove the backup label file before
> starting the postmaster.  If you're on the standby, don't.

Fundamentally this is true, but taking a backup should not make
database restart a non-automatic process.  By some definition one
could adapt their processes to remove backup_label at all these times,
but I think this should be codified; I cannot think of any convincing
reason have that much freedom (or homework, depending how you look at
it) to write one's own protocol for this from scratch.  From an arm's
length view, a database that cannot do a clean or non-clean restart at
any time regardless of the existence of a concurrent on-line backup
has a clear defect.

Here's a protocol: have pg_start_backup() write a file that just means
"backing up".  Restarts are OK, because that's all it means, it has no
meaning to a recovery/restoration process.

When one wishes to restore, one must touch a file -- not unlike the
trigger file in recovery.conf (some have argued in the past this
*should* be recovery.conf, except perhaps for its tendency to be moved
to recovery.done) to have that behavior occur.

How does that sound?  All fundamentally possible right now, but the
cause of slivers in my and other people's sides for years.

--
fdr

Re: backup_label during crash recovery: do we know how to solve it?

From

Robert Haas

Date:

02 December 2011, 20:03:06

On Fri, Dec 2, 2011 at 6:25 PM, Daniel Farina <daniel@heroku.com> wrote:
> Here's a protocol: have pg_start_backup() write a file that just means
> "backing up".  Restarts are OK, because that's all it means, it has no
> meaning to a recovery/restoration process.
>
> When one wishes to restore, one must touch a file -- not unlike the
> trigger file in recovery.conf (some have argued in the past this
> *should* be recovery.conf, except perhaps for its tendency to be moved
> to recovery.done) to have that behavior occur.

It certainly doesn't seem to me that you need TWO files.  If you
create a file on the master, then you just need to remove it from the
backup.

But I think the use of such a new protocol should be optional; it's
easy to provide backward-compatibility here.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: backup_label during crash recovery: do we know how to solve it?

From

Heikki Linnakangas

Date:

03 December 2011, 12:05:36

On 03.12.2011 01:25, Daniel Farina wrote:
> Here's a protocol: have pg_start_backup() write a file that just means
> "backing up".  Restarts are OK, because that's all it means, it has no
> meaning to a recovery/restoration process.
>
> When one wishes to restore, one must touch a file -- not unlike the
> trigger file in recovery.conf (some have argued in the past this
> *should* be recovery.conf, except perhaps for its tendency to be moved
> to recovery.done) to have that behavior occur.

At the moment, if the situation is ambiguous, the system assumes that 
you're restoring from a backup. What your suggestion amounts to is to 
reverse tht assumption, and assume instead that you're doing crash 
recovery on a system where a backup was being taken. In that case, if 
you take a backup with pg_base_backup(), and fail to archive the WAL 
files correctly, or forget to create a recovery.conf file, the database 
will happily start up from the backup, but is in fact corrupt. That is 
not good either.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: backup_label during crash recovery: do we know how to solve it?

From

Daniel Farina

Date:

29 December 2011, 20:41:53

On Sat, Dec 3, 2011 at 8:04 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> At the moment, if the situation is ambiguous, the system assumes that you're
> restoring from a backup. What your suggestion amounts to is to reverse tht
> assumption, and assume instead that you're doing crash recovery on a system
> where a backup was being taken. In that case, if you take a backup with
> pg_base_backup(), and fail to archive the WAL files correctly, or forget to
> create a recovery.conf file, the database will happily start up from the
> backup, but is in fact corrupt. That is not good either.

Sorry for my lengthy time before getting around to writing a response,
but I do think there is, in practice, a way around this conundrum,
whose fundamental goal is to make sure that the backup is not, in
actuality, a full binary copy of the database.

A workaround that has a much smaller restart-hole is to move the backup_label in
and out of the database directory after having copied it to the
archive and before calling stop_backup.

How about this revised protocol (names and adjustments welcome), to
enable a less-terrible approach?  Not only is that workaround
incorrect (it has a small window where the system will not be able to
restart), but it's pretty inconvenient.

New concepts:

pg_prepare_backup: readies postgres for backing up.  Saves the
backup_label content in volatile memory.  The next start_backup will
write that volatile information to disk, and the information within
can be used to compute a "backup-key"

"backup-key": a subset of the backup label, all it needs (as far as I
know) might be the database-id and then the WAL position (timeline,
seg, offset) the backup is starting at.

Protocol:

1. select pg_prepare_backup();
(Backup process remembers that backup-key is in progress (say, writes
it to /backup-keys/%k)
2. select pg_start_backup();
(perform copying)
3. select pg_stop_backup();
4. backup process can optionally clear its state remembering the
backup-key (rm /backup-keys/%k)

A crash at each point would be resolved this way:

Before step 1: Nothing has happened, so normal crash recovery.

Before step 2: (same, as it doesn't involve a state transition in postgres)

Before step 3: when the crash occurs and postgres starts up, postgres
asks the external software if a backup was in progress, say via a
"backup-in-progress command".  It is responsible for looking at
/backup-keys/%k and saying "yes, it was". The database can then do
normal crash recovery.  The backup can even be continuing through this
time, I think.

Before step 4: The archiver may leak the backup-key.  Because
backup-keys using the information I defined earlier have an ordering,
it should be possible to reap these if necessary at intervals.

Fundamentally, the way this approach gets around the 'physical copy'
conundrum is asking the archiver software to remember something well
out of the way of the database directory on the system that is being
backed up.

The main usability gain is that there will be a standardized way to
have postgres check to see if it was doing a backup (and thus should
use normal crash recovery) regardless of how it's started, rather than
hacks around, say, upstart scripts on ubuntu, or pg_ctl that are
idiosyncratic to what is a common need.

What do you think?  I think this may even be backwards compatible,
because if one doesn't call pg_prepare_backup then one can fall back
to that upon calling pg_start_backup.  The "backup in progress"
command is additive, and doesn't change anything for systems that do
not have it defined.

--
fdr

Re: backup_label during crash recovery: do we know how to solve it?

From

Heikki Linnakangas

Date:

01 January 2012, 09:18:45

On 30.12.2011 02:40, Daniel Farina wrote:
> How about this revised protocol (names and adjustments welcome), to
> enable a less-terrible approach?  Not only is that workaround
> incorrect (it has a small window where the system will not be able to
> restart), but it's pretty inconvenient.
>
> New concepts:
>
> pg_prepare_backup: readies postgres for backing up.  Saves the
> backup_label content in volatile memory.  The next start_backup will
> write that volatile information to disk, and the information within
> can be used to compute a "backup-key"
>
> "backup-key": a subset of the backup label, all it needs (as far as I
> know) might be the database-id and then the WAL position (timeline,
> seg, offset) the backup is starting at.
>
> Protocol:
>
> 1. select pg_prepare_backup();
> (Backup process remembers that backup-key is in progress (say, writes
> it to /backup-keys/%k)
> 2. select pg_start_backup();
> (perform copying)
> 3. select pg_stop_backup();
> 4. backup process can optionally clear its state remembering the
> backup-key (rm /backup-keys/%k)
>
> A crash at each point would be resolved this way:
>
> Before step 1: Nothing has happened, so normal crash recovery.
>
> Before step 2: (same, as it doesn't involve a state transition in postgres)
>
> Before step 3: when the crash occurs and postgres starts up, postgres
> asks the external software if a backup was in progress, say via a
> "backup-in-progress command".  It is responsible for looking at
> /backup-keys/%k and saying "yes, it was". The database can then do
> normal crash recovery.  The backup can even be continuing through this
> time, I think.
>
> Before step 4: The archiver may leak the backup-key.  Because
> backup-keys using the information I defined earlier have an ordering,
> it should be possible to reap these if necessary at intervals.
>
> Fundamentally, the way this approach gets around the 'physical copy'
> conundrum is asking the archiver software to remember something well
> out of the way of the database directory on the system that is being
> backed up.

That's awfully complicated. If we're going to require co-operation from 
the backup/archiving software, we might as well just change the 
procedure so that backup_label is not stored in the data directory, but 
returned by pg_start/stop_backup(), and the caller is responsible for 
placing it in the backed up copy of the data directory (or provide a new 
version of them to retain backwards compatibility). That would be a lot 
simpler.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: backup_label during crash recovery: do we know how to solve it?

From

Magnus Hagander

Date:

01 January 2012, 10:13:54

On Sun, Jan 1, 2012 at 14:18, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> On 30.12.2011 02:40, Daniel Farina wrote:
>>
>> How about this revised protocol (names and adjustments welcome), to
>> enable a less-terrible approach?  Not only is that workaround
>> incorrect (it has a small window where the system will not be able to
>> restart), but it's pretty inconvenient.
>>
>> New concepts:
>>
>> pg_prepare_backup: readies postgres for backing up.  Saves the
>> backup_label content in volatile memory.  The next start_backup will
>> write that volatile information to disk, and the information within
>> can be used to compute a "backup-key"
>>
>> "backup-key": a subset of the backup label, all it needs (as far as I
>> know) might be the database-id and then the WAL position (timeline,
>> seg, offset) the backup is starting at.
>>
>> Protocol:
>>
>> 1. select pg_prepare_backup();
>> (Backup process remembers that backup-key is in progress (say, writes
>> it to /backup-keys/%k)
>> 2. select pg_start_backup();
>> (perform copying)
>> 3. select pg_stop_backup();
>> 4. backup process can optionally clear its state remembering the
>> backup-key (rm /backup-keys/%k)
>>
>> A crash at each point would be resolved this way:
>>
>> Before step 1: Nothing has happened, so normal crash recovery.
>>
>> Before step 2: (same, as it doesn't involve a state transition in
>> postgres)
>>
>> Before step 3: when the crash occurs and postgres starts up, postgres
>> asks the external software if a backup was in progress, say via a
>> "backup-in-progress command".  It is responsible for looking at
>> /backup-keys/%k and saying "yes, it was". The database can then do
>> normal crash recovery.  The backup can even be continuing through this
>> time, I think.
>>
>> Before step 4: The archiver may leak the backup-key.  Because
>> backup-keys using the information I defined earlier have an ordering,
>> it should be possible to reap these if necessary at intervals.
>>
>> Fundamentally, the way this approach gets around the 'physical copy'
>> conundrum is asking the archiver software to remember something well
>> out of the way of the database directory on the system that is being
>> backed up.
>
>
> That's awfully complicated. If we're going to require co-operation from the
> backup/archiving software, we might as well just change the procedure so
> that backup_label is not stored in the data directory, but returned by
> pg_start/stop_backup(), and the caller is responsible for placing it in the
> backed up copy of the data directory (or provide a new version of them to
> retain backwards compatibility). That would be a lot simpler.

+1 for Heikki's suggestion. It seems like "really fragile" vs "very
straightforward".

It also doesn't affect backups taken through pg_basebackup - but I
guess you have good reasons for not being able to use that?

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

Re: backup_label during crash recovery: do we know how to solve it?

From

Daniel Farina

Date:

01 January 2012, 18:02:17

On Sun, Jan 1, 2012 at 5:18 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> That's awfully complicated. If we're going to require co-operation from the
> backup/archiving software, we might as well just change the procedure so
> that backup_label is not stored in the data directory, but returned by
> pg_start/stop_backup(), and the caller is responsible for placing it in the
> backed up copy of the data directory (or provide a new version of them to
> retain backwards compatibility). That would be a lot simpler.

That's also entirely acceptable.  In fact, I like it.

+1

--
fdr

Re: backup_label during crash recovery: do we know how to solve it?

From

Daniel Farina

Date:

01 January 2012, 18:10:30

On Sun, Jan 1, 2012 at 6:13 AM, Magnus Hagander <magnus@hagander.net> wrote:
> It also doesn't affect backups taken through pg_basebackup - but I
> guess you have good reasons for not being able to use that?

Parallel archiving/de-archiving and segmentation of the backup into
pieces and rate limiting are the most clear gaps.  I don't know if
there are performance implications either, but I do pass all my bytes
through unoptimized Python right now -- not exactly a speed demon.

The approach I use is:

* Scan the directory tree immediately after pg_start_backup, taking
notes of existent files and sizes
* Split those files into volumes, none of which can exceed 1.5GB.
These volumes are all disjoint
* When creating the tar file, set the header for a tar member to have
as many bytes as recorded in the first pass.  If the file has been
truncated, pad with zeros (this is also the behavior of GNU Tar).  If
it grew, only read the number of bytes recorded.
* Generate and compress these tar files in parallel
* All the while, the rate of reading files is subject to optional rate limiting

As important is the fact that each volume can be downloaded and
decompressed in a pipeline (no on-disk transformations to de-archive)
with a tunable amount of concurrency, as all that tar files do not
overlap for any file, and no file needs to span two tar files thanks
to Postgres's refusal to deal in files too large for old platforms.

--
fdr

Re: backup_label during crash recovery: do we know how to solve it?

From

Magnus Hagander

Date:

02 January 2012, 15:35:38

On Sun, Jan 1, 2012 at 23:09, Daniel Farina <daniel@heroku.com> wrote:
> On Sun, Jan 1, 2012 at 6:13 AM, Magnus Hagander <magnus@hagander.net> wrote:
>> It also doesn't affect backups taken through pg_basebackup - but I
>> guess you have good reasons for not being able to use that?
>
> Parallel archiving/de-archiving and segmentation of the backup into
> pieces and rate limiting are the most clear gaps.  I don't know if
> there are performance implications either, but I do pass all my bytes
> through unoptimized Python right now -- not exactly a speed demon.
>
> The approach I use is:
>
> * Scan the directory tree immediately after pg_start_backup, taking
> notes of existent files and sizes
> * Split those files into volumes, none of which can exceed 1.5GB.
> These volumes are all disjoint
> * When creating the tar file, set the header for a tar member to have
> as many bytes as recorded in the first pass.  If the file has been
> truncated, pad with zeros (this is also the behavior of GNU Tar).  If
> it grew, only read the number of bytes recorded.
> * Generate and compress these tar files in parallel
> * All the while, the rate of reading files is subject to optional rate limiting

Well, that certainly goes to enough detail to agree that no, that
can't be done with only minor modifications to pg_basebackup. Nor
could it be done with your python program talking directly to the
walsender backend and get around it that way. But you probably already
considered that :D


> As important is the fact that each volume can be downloaded and
> decompressed in a pipeline (no on-disk transformations to de-archive)
> with a tunable amount of concurrency, as all that tar files do not
> overlap for any file, and no file needs to span two tar files thanks
> to Postgres's refusal to deal in files too large for old platforms.

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

Re: backup_label during crash recovery: do we know how to solve it?

From

Dimitri Fontaine

Date:

02 January 2012, 17:59:44

Magnus Hagander <magnus@hagander.net> writes:
> Well, that certainly goes to enough detail to agree that no, that
> can't be done with only minor modifications to pg_basebackup. Nor
> could it be done with your python program talking directly to the
> walsender backend and get around it that way. But you probably already
> considered that :D

It sounds like a nice project to extend the pg_basebackup client side
prototype I did after Heikki's talk at pgcon 2010, though:
 https://github.com/dimitri/pg_basebackup

This “hack” allows to take a base backup from a libpq connection without
any special protocol, so work from 8.4 onward, and by replacing a simple
enough WITH RECURSIVE form could be made to support earlier version
also.

It's already doing things in parallel with a first pass on the server to
get the list of files, you would have to enhance how the splitting is
decided then add your other features.

Well, maybe that's already what you did, without using that code…  or
maybe you want to use the rsync abilities rather than libpq.  Well…

A limitation of this prototype is its failure to ensure that all the WAL
produced in the meantime are actually archived in parallel of running
the base backup, which is only important if you don't have already some
archiving in place.  I guess that the dedicated subprocess doing the WAL
copy could be enhanced to talk the wal streaming protocol when the
server is recent enough.

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr     PostgreSQL : Expertise, Formation et Support