Thread: backup_label during crash recovery: do we know how to solve it?
Reviving a thread that has hit its second birthday: http://archives.postgresql.org/pgsql-hackers/2009-11/msg00024.php In our case not being able to restart Postgres when it has been taken down in the middle of a base backup is starting to manifest as a serious source of downtime: basically, any backend crash or machine restart will cause postgres not to start without human intervention. The message delivered is sufficiently scary and indirect enough (because of the plausible scenarios that could cause corruption if postgres were to make a decision automatically in the most general case) that it's not all that attractive to train a general operator rotation to assess what to do, as it involves reading and then, effectively, ignoring some awfully scary error messages and removing the backup label file. Even if the error messages weren't scary (itself a problem if one comes to the wrong conclusion as a result), the time spent digging around under short notice to confirm what's going on is a high pole in the tent for improving uptime for us, taking an extra five to ten minutes per common encounter. Our problem is compounded by having a lot of databases that take base backups at attenuated rates in an unattended way, and therefore a human who may have been woken up from a sound sleep will have to figure out what was going on before they've reached consciousness, rather than a person with prior knowledge of having started a backup. Also, fairly unremarkable databases can take so long to back up that they may well have a greater than 20% chance of encountering this problem at any particular time: 20% of a day is less than 5 hours per day taken to do on-line backups. Basically, we -- and anyone else with unattended physical backup schemes -- are punished rather severely by the current design. This issue has some more recent related incarnations, even if for different reasons: http://archives.postgresql.org/pgsql-hackers/2011-01/msg00764.php Because backup_label "coming or going?" confusion in Postgres can have serious consequences, I wanted to post to the list first to solicit a minimal design to solve this problem. If it's fairly small in its mechanics then it may yet be feasible for the January CF. -- fdr
On Tue, Nov 29, 2011 at 9:10 PM, Daniel Farina <daniel@heroku.com> wrote: > Reviving a thread that has hit its second birthday: > > http://archives.postgresql.org/pgsql-hackers/2009-11/msg00024.php > > In our case not being able to restart Postgres when it has been taken > down in the middle of a base backup is starting to manifest as a > serious source of downtime: basically, any backend crash or machine > restart will cause postgres not to start without human intervention. > The message delivered is sufficiently scary and indirect enough > (because of the plausible scenarios that could cause corruption if > postgres were to make a decision automatically in the most general > case) that it's not all that attractive to train a general operator > rotation to assess what to do, as it involves reading and then, > effectively, ignoring some awfully scary error messages and removing > the backup label file. Even if the error messages weren't scary > (itself a problem if one comes to the wrong conclusion as a result), > the time spent digging around under short notice to confirm what's > going on is a high pole in the tent for improving uptime for us, > taking an extra five to ten minutes per common encounter. > > Our problem is compounded by having a lot of databases that take base > backups at attenuated rates in an unattended way, and therefore a > human who may have been woken up from a sound sleep will have to > figure out what was going on before they've reached consciousness, > rather than a person with prior knowledge of having started a backup. > Also, fairly unremarkable databases can take so long to back up that > they may well have a greater than 20% chance of encountering this > problem at any particular time: 20% of a day is less than 5 hours per > day taken to do on-line backups. Basically, we -- and anyone else > with unattended physical backup schemes -- are punished rather > severely by the current design. > > This issue has some more recent related incarnations, even if for > different reasons: > > http://archives.postgresql.org/pgsql-hackers/2011-01/msg00764.php > > Because backup_label "coming or going?" confusion in Postgres can have > serious consequences, I wanted to post to the list first to solicit a > minimal design to solve this problem. If it's fairly small in its > mechanics then it may yet be feasible for the January CF. In some ways, I feel like this problem is unsolvable by definition. If a backup is designed to be an exact copy of the data directory taken between pg_start_backup() and pg_stop_backup(), then by definition you can't distinguish between the original and the copy. That's what a copy *is*. Now, we could fix this by requiring an additional step when creating a backup. For example, we could create backup_label.not_really on the master and require the person taking the backup to rename it to backup_label on the slave before starting the postmaster. This could be an optional behavior, to preserve backward compatibility. Now the slave *isn't* an exact copy of the master, so PostgreSQL can distinguish. But it seems that this could also be worked around outside the database. We don't have built-in clusterware, so there must be something in the external environment that knows which server is supposed to be the master and which is supposed to be the standby. So, if you're on the master, remove the backup label file before starting the postmaster. If you're on the standby, don't. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Dec 1, 2011 at 3:47 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Nov 29, 2011 at 9:10 PM, Daniel Farina <daniel@heroku.com> wrote: >> Reviving a thread that has hit its second birthday: >> >> http://archives.postgresql.org/pgsql-hackers/2009-11/msg00024.php >> >> In our case not being able to restart Postgres when it has been taken >> down in the middle of a base backup is starting to manifest as a >> serious source of downtime: basically, any backend crash or machine >> restart will cause postgres not to start without human intervention. >> The message delivered is sufficiently scary and indirect enough >> (because of the plausible scenarios that could cause corruption if >> postgres were to make a decision automatically in the most general >> case) that it's not all that attractive to train a general operator >> rotation to assess what to do, as it involves reading and then, >> effectively, ignoring some awfully scary error messages and removing >> the backup label file. Even if the error messages weren't scary >> (itself a problem if one comes to the wrong conclusion as a result), >> the time spent digging around under short notice to confirm what's >> going on is a high pole in the tent for improving uptime for us, >> taking an extra five to ten minutes per common encounter. >> >> Our problem is compounded by having a lot of databases that take base >> backups at attenuated rates in an unattended way, and therefore a >> human who may have been woken up from a sound sleep will have to >> figure out what was going on before they've reached consciousness, >> rather than a person with prior knowledge of having started a backup. >> Also, fairly unremarkable databases can take so long to back up that >> they may well have a greater than 20% chance of encountering this >> problem at any particular time: 20% of a day is less than 5 hours per >> day taken to do on-line backups. Basically, we -- and anyone else >> with unattended physical backup schemes -- are punished rather >> severely by the current design. >> >> This issue has some more recent related incarnations, even if for >> different reasons: >> >> http://archives.postgresql.org/pgsql-hackers/2011-01/msg00764.php >> >> Because backup_label "coming or going?" confusion in Postgres can have >> serious consequences, I wanted to post to the list first to solicit a >> minimal design to solve this problem. If it's fairly small in its >> mechanics then it may yet be feasible for the January CF. > > In some ways, I feel like this problem is unsolvable by definition. > If a backup is designed to be an exact copy of the data directory > taken between pg_start_backup() and pg_stop_backup(), then by > definition you can't distinguish between the original and the copy. > That's what a copy *is*. > > Now, we could fix this by requiring an additional step when creating a > backup. For example, we could create backup_label.not_really on the > master and require the person taking the backup to rename it to > backup_label on the slave before starting the postmaster. This could > be an optional behavior, to preserve backward compatibility. Now the > slave *isn't* an exact copy of the master, so PostgreSQL can > distinguish. I actually think such a protocol should be chosen. As is I cannot say "yeah, restarting postgres is always designed to work" in the presence of backups. Prior suggestions -- I think rejected -- were to use the recovery.conf as such a sentinel file suggesting "I am restoring, not being backed up". > But it seems that this could also be worked around outside the > database. We don't have built-in clusterware, so there must be > something in the external environment that knows which server is > supposed to be the master and which is supposed to be the standby. > So, if you're on the master, remove the backup label file before > starting the postmaster. If you're on the standby, don't. Fundamentally this is true, but taking a backup should not make database restart a non-automatic process. By some definition one could adapt their processes to remove backup_label at all these times, but I think this should be codified; I cannot think of any convincing reason have that much freedom (or homework, depending how you look at it) to write one's own protocol for this from scratch. From an arm's length view, a database that cannot do a clean or non-clean restart at any time regardless of the existence of a concurrent on-line backup has a clear defect. Here's a protocol: have pg_start_backup() write a file that just means "backing up". Restarts are OK, because that's all it means, it has no meaning to a recovery/restoration process. When one wishes to restore, one must touch a file -- not unlike the trigger file in recovery.conf (some have argued in the past this *should* be recovery.conf, except perhaps for its tendency to be moved to recovery.done) to have that behavior occur. How does that sound? All fundamentally possible right now, but the cause of slivers in my and other people's sides for years. -- fdr
On Fri, Dec 2, 2011 at 6:25 PM, Daniel Farina <daniel@heroku.com> wrote: > Here's a protocol: have pg_start_backup() write a file that just means > "backing up". Restarts are OK, because that's all it means, it has no > meaning to a recovery/restoration process. > > When one wishes to restore, one must touch a file -- not unlike the > trigger file in recovery.conf (some have argued in the past this > *should* be recovery.conf, except perhaps for its tendency to be moved > to recovery.done) to have that behavior occur. It certainly doesn't seem to me that you need TWO files. If you create a file on the master, then you just need to remove it from the backup. But I think the use of such a new protocol should be optional; it's easy to provide backward-compatibility here. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 03.12.2011 01:25, Daniel Farina wrote: > Here's a protocol: have pg_start_backup() write a file that just means > "backing up". Restarts are OK, because that's all it means, it has no > meaning to a recovery/restoration process. > > When one wishes to restore, one must touch a file -- not unlike the > trigger file in recovery.conf (some have argued in the past this > *should* be recovery.conf, except perhaps for its tendency to be moved > to recovery.done) to have that behavior occur. At the moment, if the situation is ambiguous, the system assumes that you're restoring from a backup. What your suggestion amounts to is to reverse tht assumption, and assume instead that you're doing crash recovery on a system where a backup was being taken. In that case, if you take a backup with pg_base_backup(), and fail to archive the WAL files correctly, or forget to create a recovery.conf file, the database will happily start up from the backup, but is in fact corrupt. That is not good either. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Sat, Dec 3, 2011 at 8:04 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > At the moment, if the situation is ambiguous, the system assumes that you're > restoring from a backup. What your suggestion amounts to is to reverse tht > assumption, and assume instead that you're doing crash recovery on a system > where a backup was being taken. In that case, if you take a backup with > pg_base_backup(), and fail to archive the WAL files correctly, or forget to > create a recovery.conf file, the database will happily start up from the > backup, but is in fact corrupt. That is not good either. Sorry for my lengthy time before getting around to writing a response, but I do think there is, in practice, a way around this conundrum, whose fundamental goal is to make sure that the backup is not, in actuality, a full binary copy of the database. A workaround that has a much smaller restart-hole is to move the backup_label in and out of the database directory after having copied it to the archive and before calling stop_backup. How about this revised protocol (names and adjustments welcome), to enable a less-terrible approach? Not only is that workaround incorrect (it has a small window where the system will not be able to restart), but it's pretty inconvenient. New concepts: pg_prepare_backup: readies postgres for backing up. Saves the backup_label content in volatile memory. The next start_backup will write that volatile information to disk, and the information within can be used to compute a "backup-key" "backup-key": a subset of the backup label, all it needs (as far as I know) might be the database-id and then the WAL position (timeline, seg, offset) the backup is starting at. Protocol: 1. select pg_prepare_backup(); (Backup process remembers that backup-key is in progress (say, writes it to /backup-keys/%k) 2. select pg_start_backup(); (perform copying) 3. select pg_stop_backup(); 4. backup process can optionally clear its state remembering the backup-key (rm /backup-keys/%k) A crash at each point would be resolved this way: Before step 1: Nothing has happened, so normal crash recovery. Before step 2: (same, as it doesn't involve a state transition in postgres) Before step 3: when the crash occurs and postgres starts up, postgres asks the external software if a backup was in progress, say via a "backup-in-progress command". It is responsible for looking at /backup-keys/%k and saying "yes, it was". The database can then do normal crash recovery. The backup can even be continuing through this time, I think. Before step 4: The archiver may leak the backup-key. Because backup-keys using the information I defined earlier have an ordering, it should be possible to reap these if necessary at intervals. Fundamentally, the way this approach gets around the 'physical copy' conundrum is asking the archiver software to remember something well out of the way of the database directory on the system that is being backed up. The main usability gain is that there will be a standardized way to have postgres check to see if it was doing a backup (and thus should use normal crash recovery) regardless of how it's started, rather than hacks around, say, upstart scripts on ubuntu, or pg_ctl that are idiosyncratic to what is a common need. What do you think? I think this may even be backwards compatible, because if one doesn't call pg_prepare_backup then one can fall back to that upon calling pg_start_backup. The "backup in progress" command is additive, and doesn't change anything for systems that do not have it defined. -- fdr
On 30.12.2011 02:40, Daniel Farina wrote: > How about this revised protocol (names and adjustments welcome), to > enable a less-terrible approach? Not only is that workaround > incorrect (it has a small window where the system will not be able to > restart), but it's pretty inconvenient. > > New concepts: > > pg_prepare_backup: readies postgres for backing up. Saves the > backup_label content in volatile memory. The next start_backup will > write that volatile information to disk, and the information within > can be used to compute a "backup-key" > > "backup-key": a subset of the backup label, all it needs (as far as I > know) might be the database-id and then the WAL position (timeline, > seg, offset) the backup is starting at. > > Protocol: > > 1. select pg_prepare_backup(); > (Backup process remembers that backup-key is in progress (say, writes > it to /backup-keys/%k) > 2. select pg_start_backup(); > (perform copying) > 3. select pg_stop_backup(); > 4. backup process can optionally clear its state remembering the > backup-key (rm /backup-keys/%k) > > A crash at each point would be resolved this way: > > Before step 1: Nothing has happened, so normal crash recovery. > > Before step 2: (same, as it doesn't involve a state transition in postgres) > > Before step 3: when the crash occurs and postgres starts up, postgres > asks the external software if a backup was in progress, say via a > "backup-in-progress command". It is responsible for looking at > /backup-keys/%k and saying "yes, it was". The database can then do > normal crash recovery. The backup can even be continuing through this > time, I think. > > Before step 4: The archiver may leak the backup-key. Because > backup-keys using the information I defined earlier have an ordering, > it should be possible to reap these if necessary at intervals. > > Fundamentally, the way this approach gets around the 'physical copy' > conundrum is asking the archiver software to remember something well > out of the way of the database directory on the system that is being > backed up. That's awfully complicated. If we're going to require co-operation from the backup/archiving software, we might as well just change the procedure so that backup_label is not stored in the data directory, but returned by pg_start/stop_backup(), and the caller is responsible for placing it in the backed up copy of the data directory (or provide a new version of them to retain backwards compatibility). That would be a lot simpler. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Sun, Jan 1, 2012 at 14:18, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > On 30.12.2011 02:40, Daniel Farina wrote: >> >> How about this revised protocol (names and adjustments welcome), to >> enable a less-terrible approach? Not only is that workaround >> incorrect (it has a small window where the system will not be able to >> restart), but it's pretty inconvenient. >> >> New concepts: >> >> pg_prepare_backup: readies postgres for backing up. Saves the >> backup_label content in volatile memory. The next start_backup will >> write that volatile information to disk, and the information within >> can be used to compute a "backup-key" >> >> "backup-key": a subset of the backup label, all it needs (as far as I >> know) might be the database-id and then the WAL position (timeline, >> seg, offset) the backup is starting at. >> >> Protocol: >> >> 1. select pg_prepare_backup(); >> (Backup process remembers that backup-key is in progress (say, writes >> it to /backup-keys/%k) >> 2. select pg_start_backup(); >> (perform copying) >> 3. select pg_stop_backup(); >> 4. backup process can optionally clear its state remembering the >> backup-key (rm /backup-keys/%k) >> >> A crash at each point would be resolved this way: >> >> Before step 1: Nothing has happened, so normal crash recovery. >> >> Before step 2: (same, as it doesn't involve a state transition in >> postgres) >> >> Before step 3: when the crash occurs and postgres starts up, postgres >> asks the external software if a backup was in progress, say via a >> "backup-in-progress command". It is responsible for looking at >> /backup-keys/%k and saying "yes, it was". The database can then do >> normal crash recovery. The backup can even be continuing through this >> time, I think. >> >> Before step 4: The archiver may leak the backup-key. Because >> backup-keys using the information I defined earlier have an ordering, >> it should be possible to reap these if necessary at intervals. >> >> Fundamentally, the way this approach gets around the 'physical copy' >> conundrum is asking the archiver software to remember something well >> out of the way of the database directory on the system that is being >> backed up. > > > That's awfully complicated. If we're going to require co-operation from the > backup/archiving software, we might as well just change the procedure so > that backup_label is not stored in the data directory, but returned by > pg_start/stop_backup(), and the caller is responsible for placing it in the > backed up copy of the data directory (or provide a new version of them to > retain backwards compatibility). That would be a lot simpler. +1 for Heikki's suggestion. It seems like "really fragile" vs "very straightforward". It also doesn't affect backups taken through pg_basebackup - but I guess you have good reasons for not being able to use that? -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
On Sun, Jan 1, 2012 at 5:18 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > That's awfully complicated. If we're going to require co-operation from the > backup/archiving software, we might as well just change the procedure so > that backup_label is not stored in the data directory, but returned by > pg_start/stop_backup(), and the caller is responsible for placing it in the > backed up copy of the data directory (or provide a new version of them to > retain backwards compatibility). That would be a lot simpler. That's also entirely acceptable. In fact, I like it. +1 -- fdr
On Sun, Jan 1, 2012 at 6:13 AM, Magnus Hagander <magnus@hagander.net> wrote: > It also doesn't affect backups taken through pg_basebackup - but I > guess you have good reasons for not being able to use that? Parallel archiving/de-archiving and segmentation of the backup into pieces and rate limiting are the most clear gaps. I don't know if there are performance implications either, but I do pass all my bytes through unoptimized Python right now -- not exactly a speed demon. The approach I use is: * Scan the directory tree immediately after pg_start_backup, taking notes of existent files and sizes * Split those files into volumes, none of which can exceed 1.5GB. These volumes are all disjoint * When creating the tar file, set the header for a tar member to have as many bytes as recorded in the first pass. If the file has been truncated, pad with zeros (this is also the behavior of GNU Tar). If it grew, only read the number of bytes recorded. * Generate and compress these tar files in parallel * All the while, the rate of reading files is subject to optional rate limiting As important is the fact that each volume can be downloaded and decompressed in a pipeline (no on-disk transformations to de-archive) with a tunable amount of concurrency, as all that tar files do not overlap for any file, and no file needs to span two tar files thanks to Postgres's refusal to deal in files too large for old platforms. -- fdr
On Sun, Jan 1, 2012 at 23:09, Daniel Farina <daniel@heroku.com> wrote: > On Sun, Jan 1, 2012 at 6:13 AM, Magnus Hagander <magnus@hagander.net> wrote: >> It also doesn't affect backups taken through pg_basebackup - but I >> guess you have good reasons for not being able to use that? > > Parallel archiving/de-archiving and segmentation of the backup into > pieces and rate limiting are the most clear gaps. I don't know if > there are performance implications either, but I do pass all my bytes > through unoptimized Python right now -- not exactly a speed demon. > > The approach I use is: > > * Scan the directory tree immediately after pg_start_backup, taking > notes of existent files and sizes > * Split those files into volumes, none of which can exceed 1.5GB. > These volumes are all disjoint > * When creating the tar file, set the header for a tar member to have > as many bytes as recorded in the first pass. If the file has been > truncated, pad with zeros (this is also the behavior of GNU Tar). If > it grew, only read the number of bytes recorded. > * Generate and compress these tar files in parallel > * All the while, the rate of reading files is subject to optional rate limiting Well, that certainly goes to enough detail to agree that no, that can't be done with only minor modifications to pg_basebackup. Nor could it be done with your python program talking directly to the walsender backend and get around it that way. But you probably already considered that :D > As important is the fact that each volume can be downloaded and > decompressed in a pipeline (no on-disk transformations to de-archive) > with a tunable amount of concurrency, as all that tar files do not > overlap for any file, and no file needs to span two tar files thanks > to Postgres's refusal to deal in files too large for old platforms. -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
Magnus Hagander <magnus@hagander.net> writes: > Well, that certainly goes to enough detail to agree that no, that > can't be done with only minor modifications to pg_basebackup. Nor > could it be done with your python program talking directly to the > walsender backend and get around it that way. But you probably already > considered that :D It sounds like a nice project to extend the pg_basebackup client side prototype I did after Heikki's talk at pgcon 2010, though: https://github.com/dimitri/pg_basebackup This “hack” allows to take a base backup from a libpq connection without any special protocol, so work from 8.4 onward, and by replacing a simple enough WITH RECURSIVE form could be made to support earlier version also. It's already doing things in parallel with a first pass on the server to get the list of files, you would have to enhance how the splitting is decided then add your other features. Well, maybe that's already what you did, without using that code… or maybe you want to use the rsync abilities rather than libpq. Well… A limitation of this prototype is its failure to ensure that all the WAL produced in the meantime are actually archived in parallel of running the base backup, which is only important if you don't have already some archiving in place. I guess that the dedicated subprocess doing the WAL copy could be enhanced to talk the wal streaming protocol when the server is recent enough. Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support