Thread: BUG #4796: Recovery followed by backup creates unrecoverable WAL-file
The following bug has been logged online: Bug reference: 4796 Logged by: Mikael Krantz Email address: mk@zigamorph.se PostgreSQL version: 8.3.7-0lenny1 Operating system: Linux (debian lenny) Description: Recovery followed by backup creates unrecoverable WAL-file Details: If you perform a recovery form a file system level backup postgres will switch to a new timeline but the first WAL-log in with the new timeline will contain the previous timeline. If you start a backup immediately after recovery have completed the start of the backup will be in this bad WAL file. This makes the backup unrecoverable as it will fail with an error similar to: LOG: unexpected timeline ID 54 in log file 4, segment 236, offset 0 LOG: invalid checkpoint record PANIC: could not locate required checkpoint record HINT: If you are not restoring from a backup, try removing the file "/var/lib/postgresql/8.3/main/backup_label". How to reproduce: * restore from backup * SELECT pg_start_backup('label'); * take a new backup * SELECT pg_stop_backup(); * copy the relevant WAL-files * try to restore the backup It is also visible in the first WAL-file of a new timeline: # od -t x4 /var/lib/postgresql/8.3/main/pg_xlog/0000003D0000000500000001 |head -1 0000000 0002d062 0000003c 00000005 01000000 The timeline tag 0000003c is in a file named 0000003D which causes it to be unrecoverable. Workaround: Wait for or force a xlog switch before pg_start_backup. Possibly a simple fix would be to make pg_start_backup force this switch automatically.
Re: BUG #4796: Recovery followed by backup creates unrecoverable WAL-file
From
Heikki Linnakangas
Date:
Mikael Krantz wrote: > If you perform a recovery form a file system level backup postgres will > switch to a new timeline but the first WAL-log in with the new timeline will > contain the previous timeline. > > If you start a backup immediately after recovery have completed the start of > the backup will be in this bad WAL file. This makes the backup unrecoverable > as it will fail with an error similar to: > > LOG: unexpected timeline ID 54 in log file 4, segment 236, offset 0 > LOG: invalid checkpoint record > PANIC: could not locate required checkpoint record > HINT: If you are not restoring from a backup, try removing the file > "/var/lib/postgresql/8.3/main/backup_label". > > > How to reproduce: > > * restore from backup > * SELECT pg_start_backup('label'); > * take a new backup > * SELECT pg_stop_backup(); > * copy the relevant WAL-files > * try to restore the backup I failed to reproduce this. Is it possible that the history file went missing in the process? That's needed to recover WAL files from timelines other than the latest one. You should only get that "unexpected timeline ID" message if the history file doesn't contain a line for that timeline ID. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Wed, May 6, 2009 at 6:26 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: >> How to reproduce: >> >> =A0* restore from backup >> =A0* SELECT pg_start_backup('label'); >> =A0* take a new backup >> =A0* SELECT pg_stop_backup(); >> =A0* copy the relevant WAL-files >> =A0* try to restore the backup > > I failed to reproduce this. Is it possible that the history file went > missing in the process? That's needed to recover WAL files from timelines > other than the latest one. You should only get that "unexpected timeline = ID" > message if the history file doesn't contain a line for that timeline ID. Yes that's true. The history file is not included in the backup. It is archived before the backup starts and is not included in the range specified in the backup file (e.g: 0000003B00000004000000FC.00000020.backup). Doesn't this mean that the range of log-files in the backup file is incorrect? If the first WAL-file in the range contain records referring to earlier timelines I will have to backup the .history-file of that timeline in addition to the WAL-files explicitly required for the backup. Or force a switch of log-files before starting the backup as I'm currently doing. The reason I stumbled onto this is that I've setup an automatic test that sets up a warm standby, fails over, sets up a new warm server and so on. This causes me to take new base backups very soon after a finished recovery process. /M
Re: BUG #4796: Recovery followed by backup creates unrecoverable WAL-file
From
Heikki Linnakangas
Date:
Mikael Krantz wrote: > On Wed, May 6, 2009 at 6:26 PM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: >>> How to reproduce: >>> >>> * restore from backup >>> * SELECT pg_start_backup('label'); >>> * take a new backup >>> * SELECT pg_stop_backup(); >>> * copy the relevant WAL-files >>> * try to restore the backup >> I failed to reproduce this. Is it possible that the history file went >> missing in the process? That's needed to recover WAL files from timelines >> other than the latest one. You should only get that "unexpected timeline ID" >> message if the history file doesn't contain a line for that timeline ID. > > Yes that's true. The history file is not included in the backup. It is > archived before the backup starts and is not included in the > range specified in the backup file (e.g: > 0000003B00000004000000FC.00000020.backup). > > Doesn't this mean that the range of log-files in the backup file is > incorrect? If the first WAL-file in the range contain records > referring to earlier timelines I will have to backup the .history-file > of that timeline in addition to the WAL-files explicitly required for > the backup. Or force a switch of log-files before starting the backup > as I'm currently doing. Yeah, I think you're right. If you omit pg_xlog from the base backup, as we recommend in the manual, and clear the old files from the archive too, then you won't have the old history file around. I'll make pg_start_backup() to request xlog switch before the checkpoint as you suggested. That's an easy fix that can be easily back-patched. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Re: BUG #4796: Recovery followed by backup creates unrecoverable WAL-file
From
Heikki Linnakangas
Date:
I wrote: > I'll make pg_start_backup() to request xlog switch before the checkpoint > as you suggested. That's an easy fix that can be easily back-patched. Done. I only back-patched it down to 8.2, because earlier versions didn't have pg_xlog_switch(). They would've required more invasive changes which don't seem worth the effort. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Thu, 2009-05-07 at 12:15 +0300, Heikki Linnakangas wrote: > Yeah, I think you're right. If you omit pg_xlog from the base backup, > as we recommend in the manual, and clear the old files from the > archive too, then you won't have the old history file around. Sorry about this, but I don't agree with that fix and think it needs more discussion, at very least. (I'm also not sure why this fix needs to applied with such haste, even taking priority over other unapplied patches.) The error seems to come from deleting the history file from the archive, rather than from the sequence of actions. A more useful thing might be to do an xlog switch before we do the shutdown checkpoint at end of recovery. That gives the same sequence of actions without modifying the existing sequence of activities for backups, which is delicate enough for me to not want to touch it. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Re: BUG #4796: Recovery followed by backup creates unrecoverable WAL-file
From
Heikki Linnakangas
Date:
Simon Riggs wrote: > On Thu, 2009-05-07 at 12:15 +0300, Heikki Linnakangas wrote: > >> Yeah, I think you're right. If you omit pg_xlog from the base backup, >> as we recommend in the manual, and clear the old files from the >> archive too, then you won't have the old history file around. > > ... > A more useful thing might be to do an xlog switch before we do the > shutdown checkpoint at end of recovery. That gives the same sequence of > actions without modifying the existing sequence of activities for > backups, which is delicate enough for me to not want to touch it. Hmm, yeah should work as well. I find the recovery sequence to be even more delicate, though, than pg_start_backup(). I think you'd need to write the XLOG switch record using the old timeline ID, as we currently require that the timeline changes only at a shutdown checkpoint record. That's not hard, but does make me a bit nervous. The advantage of that over switching xlog segment in pg_start_backup() would be that you would go through fewer XLOG segments if you took backups often. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] Re: BUG #4796: Recovery followed by backup creates unrecoverable WAL-file
From
Simon Riggs
Date:
On Thu, 2009-05-07 at 17:54 +0300, Heikki Linnakangas wrote: > Simon Riggs wrote: > > On Thu, 2009-05-07 at 12:15 +0300, Heikki Linnakangas wrote: > > > >> Yeah, I think you're right. If you omit pg_xlog from the base backup, > >> as we recommend in the manual, and clear the old files from the > >> archive too, then you won't have the old history file around. > > > > ... > > A more useful thing might be to do an xlog switch before we do the > > shutdown checkpoint at end of recovery. That gives the same sequence of > > actions without modifying the existing sequence of activities for > > backups, which is delicate enough for me to not want to touch it. > > Hmm, yeah should work as well. I find the recovery sequence to be even > more delicate, though, than pg_start_backup(). I think you'd need to > write the XLOG switch record using the old timeline ID, as we currently > require that the timeline changes only at a shutdown checkpoint record. > That's not hard, but does make me a bit nervous. > > The advantage of that over switching xlog segment in pg_start_backup() > would be that you would go through fewer XLOG segments if you took > backups often. Yes, you're right about the delicacy of all of this so both suggestions sound kludgey - the problem is to do with timelines not with sequencing of checkpoints and log switches. The problem is Mikael deleted the history file and he shouldn't have done that. We need some explicit protection for when that occurs, I feel, to avoid it breaking again in the future with various changes we have planned. If the history file is so important, we shouldn't only store it in the archive. We should keep a copy locally as well and refer to it if the archived copy is missing. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Re: [HACKERS] Re: BUG #4796: Recovery followed by backup creates unrecoverable WAL-file
From
Heikki Linnakangas
Date:
Simon Riggs wrote: > On Thu, 2009-05-07 at 17:54 +0300, Heikki Linnakangas wrote: >> Simon Riggs wrote: >>> A more useful thing might be to do an xlog switch before we do the >>> shutdown checkpoint at end of recovery. That gives the same sequence of >>> actions without modifying the existing sequence of activities for >>> backups, which is delicate enough for me to not want to touch it. >> >> Hmm, yeah should work as well. I find the recovery sequence to be even >> more delicate, though, than pg_start_backup(). I think you'd need to >> write the XLOG switch record using the old timeline ID, as we currently >> require that the timeline changes only at a shutdown checkpoint record. >> That's not hard, but does make me a bit nervous. > > Yes, you're right about the delicacy of all of this so both suggestions > sound kludgey - the problem is to do with timelines not with sequencing > of checkpoints and log switches. The problem is Mikael deleted the > history file and he shouldn't have done that. I don't see any user error here. What he did was: 1. Restore from backup A 2. Clear old WAL archive 3. pg_start_backup() + tar all but pg_xlog + pg_stop_backup(); 4. Restore new backup B There's no history file in the archive because it was cleared in step 2. There's nothing wrong with that; you only need to retain WAL files from the point that you call pg_start_backup(). There's no history file either in the tar, because pg_xlog was not tarred as we recommend in the manual. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] Re: BUG #4796: Recovery followed by backup creates unrecoverable WAL-file
From
Simon Riggs
Date:
On Thu, 2009-05-07 at 18:57 +0300, Heikki Linnakangas wrote: > I don't see any user error here. Just observing that the error occurs because we rely on a file being there when we haven't even documented that it needs to be there for it to work. File deletion with %r from the archive would not have removed that file at that point. We should have an explicit statement about which files can be deleted from the archive and which should not be, but in general it is dangerous to remove files that have not been explicitly described as removable. Playing with the order of events seems fragile and I would prefer a more explicit solution. Recording the timeline history permanently with each server would be a sensible and useful thing (IIRC DB2 does this). -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Re: [HACKERS] Re: BUG #4796: Recovery followed by backup creates unrecoverable WAL-file
From
Heikki Linnakangas
Date:
Simon Riggs wrote: > On Thu, 2009-05-07 at 18:57 +0300, Heikki Linnakangas wrote: >> I don't see any user error here. > > Just observing that the error occurs because we rely on a file being > there when we haven't even documented that it needs to be there for it > to work. File deletion with %r from the archive would not have removed > that file at that point. We should have an explicit statement about > which files can be deleted from the archive and which should not be, but > in general it is dangerous to remove files that have not been explicitly > described as removable. When you create a new base backup, you shouldn't need any files archived before starting the backup. You might not even have had archiving enabled before that, or you might change archive_command to archive into a new location before tarting the backup. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] Re: BUG #4796: Recovery followed by backup creates unrecoverable WAL-file
From
Fujii Masao
Date:
Hi, On Fri, May 8, 2009 at 2:42 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > When you create a new base backup, you shouldn't need any files archived > before starting the backup. If so, this fix is not enough, since findNewestTimeLine() is still based on the premise that *all* the history files exist. So, as Simon says, we should clearly say that a history file must not be deleted from the archive. Or, we should create a new solution. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Re: [HACKERS] Re: BUG #4796: Recovery followed by backup creates unrecoverable WAL-file
From
Heikki Linnakangas
Date:
Fujii Masao wrote: > Hi, > > On Fri, May 8, 2009 at 2:42 AM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: >> When you create a new base backup, you shouldn't need any files archived >> before starting the backup. > > If so, this fix is not enough, since findNewestTimeLine() is > still based on the premise that *all* the history files exist. > So, as Simon says, we should clearly say that a history file > must not be deleted from the archive. Or, we should create > a new solution. The probe in findNewestTimeLine() initialized to recovery target timeline + 1. It doesn't require history files for any old timelines to be present. The purpose of findNewestTimeLine() is to ensure that if you e.g recover to a point in time in timeline 5, and there's already WAL files for timelines 6 and 7 in the archive, we pick a unique timeline id. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] Re: BUG #4796: Recovery followed by backup creates unrecoverable WAL-file
From
Simon Riggs
Date:
On Fri, 2009-05-15 at 20:11 +0900, Fujii Masao wrote: > Hi, > > On Fri, May 8, 2009 at 2:42 AM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: > > When you create a new base backup, you shouldn't need any files archived > > before starting the backup. > > If so, this fix is not enough, since findNewestTimeLine() is > still based on the premise that *all* the history files exist. > So, as Simon says, we should clearly say that a history file > must not be deleted from the archive. Or, we should create > a new solution. I will feel safer if we keep history files in the main data directory (somehow), not just send them to the archive. The history files together describe the provenance of the current database and I think it takes almost no space to record that, so it seems like a good idea to keep them. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Re: [HACKERS] Re: BUG #4796: Recovery followed by backup creates unrecoverable WAL-file
From
Fujii Masao
Date:
Hi, On Fri, May 15, 2009 at 8:20 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > The probe in findNewestTimeLine() initialized to recovery target timeline + > 1. It doesn't require history files for any old timelines to be present. What if recovery_target_timeline = 'latest'? The unexpected (not latest) recovery target timeline might be chosen when some timeline history files don't exist. > The > purpose of findNewestTimeLine() is to ensure that if you e.g recover to a > point in time in timeline 5, and there's already WAL files for timelines 6 > and 7 in the archive, we pick a unique timeline id. When only the history file for timeline 6 is deleted, timeline 6 would be assigned as the newest one *again* at the end of archive recovery. Is this safe? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Re: [HACKERS] Re: BUG #4796: Recovery followed by backup creates unrecoverable WAL-file
From
Simon Riggs
Date:
On Fri, 2009-05-15 at 20:38 +0900, Fujii Masao wrote: > On Fri, May 15, 2009 at 8:20 PM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: > > The probe in findNewestTimeLine() initialized to recovery target timeline + > > 1. It doesn't require history files for any old timelines to be present. > > What if recovery_target_timeline = 'latest'? The unexpected (not latest) > recovery target timeline might be chosen when some timeline history > files don't exist. > > > The > > purpose of findNewestTimeLine() is to ensure that if you e.g recover to a > > point in time in timeline 5, and there's already WAL files for timelines 6 > > and 7 in the archive, we pick a unique timeline id. > > When only the history file for timeline 6 is deleted, timeline 6 would be > assigned as the newest one *again* at the end of archive recovery. > Is this safe? Yeh, those cases screw us up. I'm sure we can think of others, I had time to analyse things in more detail. I'd be happier with the general assessment that "it's unsafe to keep history files in the archive". My suggestion is that we keep history files in a new directory under the data directory. That way they get copied as part of the base backup, rather than sent off to the archive where DBAs can have mad moments and delete all, or worse, just some of them. Implementation for this proposal is really easy and safe for where we are now: we just access the appropriate local directory. Call it pg_history or pg_timeline etc.. Not under pg_xlog! There is no particular reason to send history files to the archive, since new ones are only ever generated at the end of an archive recovery. Now that we increment the timeline more often this is a more visible problem than previously. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Re: [HACKERS] Re: BUG #4796: Recovery followed by backup creates unrecoverable WAL-file
From
Simon Riggs
Date:
On Fri, 2009-05-15 at 12:56 +0100, Simon Riggs wrote: > There is no particular reason to send history files to the archive, > since new ones are only ever generated at the end of an archive > recovery. It also clears up a long standing confusion between backup history files and timeline history files. The backup history file(s) do need to go to the archive, whereas the timeline file(s) do not. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Re: [HACKERS] Re: BUG #4796: Recovery followed by backup creates unrecoverable WAL-file
From
Heikki Linnakangas
Date:
Fujii Masao wrote: > On Fri, May 15, 2009 at 8:20 PM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: >> The probe in findNewestTimeLine() initialized to recovery target timeline + >> 1. It doesn't require history files for any old timelines to be present. > > What if recovery_target_timeline = 'latest'? The unexpected (not latest) > recovery target timeline might be chosen when some timeline history > files don't exist. > >> The >> purpose of findNewestTimeLine() is to ensure that if you e.g recover to a >> point in time in timeline 5, and there's already WAL files for timelines 6 >> and 7 in the archive, we pick a unique timeline id. > > When only the history file for timeline 6 is deleted, timeline 6 would be > assigned as the newest one *again* at the end of archive recovery. > Is this safe? If you delete history file and all the WAL for timeline 6, yeah, nothing stops it from being reused. It will work just fine, as if it never existed. If you still have the history file and WAL for the old timeline 6 lying around somewhere else like an older offsite backup, it's easy for the administrator to get confused, but there isn't much we can do about that. Simon's idea of keeping a copy of all the history files in the data directory wouldn't help here. In fact, I think we already never delete history files in the server, it's just that if you omit the pg_xlog directory in the base backup they won't be included. But even if they are included in the base backup, that wouldn't help in this scenario because the base backup still wouldn't contain the history files for the later timelines. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] Re: BUG #4796: Recovery followed by backup creates unrecoverable WAL-file
From
Simon Riggs
Date:
On Fri, 2009-05-15 at 14:56 +0300, Heikki Linnakangas wrote: > Simon's idea of keeping a copy of all the history files in the data > directory wouldn't help here. In fact, I think we already never delete > history files in the server, it's just that if you omit the pg_xlog > directory in the base backup they won't be included. But even if they > are included in the base backup, that wouldn't help in this scenario > because the base backup still wouldn't contain the history files for the > later timelines. You're right there. That still leaves the problem that we need to know the later history, even if we don't use it. > If you delete history file and all the WAL for timeline 6, yeah, nothing > stops it from being reused. It will work just fine, as if it never > existed. If you still have the history file and WAL for the old timeline > 6 lying around somewhere else like an older offsite backup, it's easy > for the administrator to get confused, but there isn't much we can do > about that. ehem, "It will work fine" isn't correct, as Fujii-san observes. Let's document that timeline files should not be deleted from the archive iff there exists a base backup made during a lower numbered timeline. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Re: [HACKERS] Re: BUG #4796: Recovery followed by backup creates unrecoverable WAL-file
From
Heikki Linnakangas
Date:
Simon Riggs wrote: > On Fri, 2009-05-15 at 20:38 +0900, Fujii Masao wrote: > >> On Fri, May 15, 2009 at 8:20 PM, Heikki Linnakangas >> <heikki.linnakangas@enterprisedb.com> wrote: >>> The probe in findNewestTimeLine() initialized to recovery target timeline + >>> 1. It doesn't require history files for any old timelines to be present. >> What if recovery_target_timeline = 'latest'? The unexpected (not latest) >> recovery target timeline might be chosen when some timeline history >> files don't exist. >> >>> The >>> purpose of findNewestTimeLine() is to ensure that if you e.g recover to a >>> point in time in timeline 5, and there's already WAL files for timelines 6 >>> and 7 in the archive, we pick a unique timeline id. >> When only the history file for timeline 6 is deleted, timeline 6 would be >> assigned as the newest one *again* at the end of archive recovery. >> Is this safe? > > Yeh, those cases screw us up. I'm sure we can think of others, I had > time to analyse things in more detail. I'd be happier with the general > assessment that "it's unsafe to keep history files in the archive". > > My suggestion is that we keep history files in a new directory under the > data directory. That way they get copied as part of the base backup, > rather than sent off to the archive where DBAs can have mad moments and > delete all, or worse, just some of them. Implementation for this > proposal is really easy and safe for where we are now: we just access > the appropriate local directory. Call it pg_history or pg_timeline etc.. > Not under pg_xlog! > > There is no particular reason to send history files to the archive, > since new ones are only ever generated at the end of an archive > recovery. Consider this: 1. Take base backup, on timeline 1. Archive to directory X 2. Disaster. 3. restore from base backup and the archive. Timeline ID is incremented to 2. Keep archiving to directory X. 4. Another disaster. 5. Restore again from the base backup and archive. Timeline ID is incremented to 3. If the history files are not in the archive, where is the restore at step 5 going to get the history file for timeline 2? You certainly need the history files in the archive. The history files should be considered as part of the WAL data. They need to be archived together with the WAL segments. When you take a new base backup, you no longer need the history files for old timelines, just like you don't need old WAL. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] Re: BUG #4796: Recovery followed by backup creates unrecoverable WAL-file
From
Heikki Linnakangas
Date:
Mikael Krantz wrote: > On Fri, May 15, 2009 at 2:22 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >> Let's document that timeline files should not be deleted from the >> archive iff there exists a base backup made during a lower numbered >> timeline. > > Or made during a higher numbered timeline which happens to start in a > WAL-file containing records from a lower numbered timeline... That was the original issue you ran into. That has now been fixed by forcing an xlog switch at pg_start_backup(), so that you can't start a backup in a WAL file that contains records from a lower numbered timeline. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] Re: BUG #4796: Recovery followed by backup creates unrecoverable WAL-file
From
Mikael Krantz
Date:
On Fri, May 15, 2009 at 2:22 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > Let's document that timeline files should not be deleted from the > archive iff there exists a base backup made during a lower numbered > timeline. Or made during a higher numbered timeline which happens to start in a WAL-file containing records from a lower numbered timeline... /M
Re: [HACKERS] Re: BUG #4796: Recovery followed by backup creates unrecoverable WAL-file
From
Heikki Linnakangas
Date:
Simon Riggs wrote: > ehem, "It will work fine" isn't correct, as Fujii-san observes. What exactly are the steps required to run into that problem? I fail to see what the problem is. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] Re: BUG #4796: Recovery followed by backup creates unrecoverable WAL-file
From
Heikki Linnakangas
Date:
Simon Riggs wrote: > On Fri, 2009-05-15 at 12:56 +0100, Simon Riggs wrote: > >> There is no particular reason to send history files to the archive, >> since new ones are only ever generated at the end of an archive >> recovery. > > It also clears up a long standing confusion between backup history files > and timeline history files. The backup history file(s) do need to go to > the archive, whereas the timeline file(s) do not. (blush). Umm, and what is the distinction again? I thought they're the same thing.. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] Re: BUG #4796: Recovery followed by backup creates unrecoverable WAL-file
From
Fujii Masao
Date:
Hi, On Fri, May 15, 2009 at 9:22 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >> If you delete history file and all the WAL for timeline 6, yeah, nothing >> stops it from being reused. It will work just fine, as if it never >> existed. If you still have the history file and WAL for the old timeline >> 6 lying around somewhere else like an older offsite backup, it's easy >> for the administrator to get confused, but there isn't much we can do >> about that. > > ehem, "It will work fine" isn't correct, as Fujii-san observes. Yes. In the case which I described, 6 is treated as timeline newer than 7. At least, this is against the current premise that timeline IDs must be in increasing sequence. > Let's document that timeline files should not be deleted from the > archive iff there exists a base backup made during a lower numbered > timeline. Agreed. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Re: [HACKERS] Re: BUG #4796: Recovery followed by backup creates unrecoverable WAL-file
From
Simon Riggs
Date:
On Fri, 2009-05-15 at 15:41 +0300, Heikki Linnakangas wrote: > Simon Riggs wrote: > > On Fri, 2009-05-15 at 12:56 +0100, Simon Riggs wrote: > > > >> There is no particular reason to send history files to the archive, > >> since new ones are only ever generated at the end of an archive > >> recovery. > > > > It also clears up a long standing confusion between backup history files > > and timeline history files. The backup history file(s) do need to go to > > the archive, whereas the timeline file(s) do not. > > (blush). Umm, and what is the distinction again? I thought they're the > same thing.. Some additional code refers to "backup history" files when it means backup label files, which are then easily confused with the timeline history files. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Re: [HACKERS] Re: BUG #4796: Recovery followed by backup creates unrecoverable WAL-file
From
Mikael Krantz
Date:
On Fri, May 15, 2009 at 2:26 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > That was the original issue you ran into. That has now been fixed by forcing > an xlog switch at pg_start_backup(), so that you can't start a backup in a > WAL file that contains records from a lower numbered timeline. Ah, sorry. /M
Re: [HACKERS] Re: BUG #4796: Recovery followed by backup creates unrecoverable WAL-file
From
Simon Riggs
Date:
On Fri, 2009-05-15 at 15:34 +0200, Mikael Krantz wrote: > On Fri, May 15, 2009 at 2:26 PM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: > > That was the original issue you ran into. That has now been fixed by forcing > > an xlog switch at pg_start_backup(), so that you can't start a backup in a > > WAL file that contains records from a lower numbered timeline. > > Ah, sorry. No worries. Thanks for reporting the original bug and for staying involved while we think of how to handle the problems it highlights. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Re: [HACKERS] Re: BUG #4796: Recovery followed by backup creates unrecoverable WAL-file
From
Fujii Masao
Date:
Hi, On Fri, May 15, 2009 at 8:56 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > Fujii Masao wrote: >> >> On Fri, May 15, 2009 at 8:20 PM, Heikki Linnakangas >> <heikki.linnakangas@enterprisedb.com> wrote: >>> >>> The probe in findNewestTimeLine() initialized to recovery target timeline >>> + >>> 1. It doesn't require history files for any old timelines to be present. >> >> What if recovery_target_timeline = 'latest'? The unexpected (not latest) >> recovery target timeline might be chosen when some timeline history >> files don't exist. >> >>> The >>> purpose of findNewestTimeLine() is to ensure that if you e.g recover to a >>> point in time in timeline 5, and there's already WAL files for timelines >>> 6 >>> and 7 in the archive, we pick a unique timeline id. >> >> When only the history file for timeline 6 is deleted, timeline 6 would be >> assigned as the newest one *again* at the end of archive recovery. >> Is this safe? > > If you delete history file and all the WAL for timeline 6, yeah, nothing > stops it from being reused. It will work just fine, as if it never existed. > If you still have the history file and WAL for the old timeline 6 lying > around somewhere else like an older offsite backup, it's easy for the > administrator to get confused, but there isn't much we can do about that. OK, I probably understood your point. The timeline history files whose timeline ID is larger than that of an oldest backup must not be deleted from the archive. On the other hand, the smaller or equal one can be deleted. Not all history files are necessary. So, if we don't keep older backup, we probably can delete all files in the archive before pg_start_backup(). Is my understanding right? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Re: [HACKERS] Re: BUG #4796: Recovery followed by backup creates unrecoverable WAL-file
From
Simon Riggs
Date:
On Fri, 2009-05-15 at 22:56 +0900, Fujii Masao wrote: > OK, I probably understood your point. The timeline history files whose > timeline ID is larger than that of an oldest backup must not be deleted > from the archive. On the other hand, the smaller or equal one can be > deleted. Not all history files are necessary. So, if we don't keep older > backup, we probably can delete all files in the archive before > pg_start_backup(). > Is my understanding right? Heikki is right in one sense: if you do pg_start_backup() then for *that* backup you do not need earlier files. However, as you have pointed out, if you have *multiple* backups then deleting history files may cause problems with an earlier backup. It's standard practice to have >1 backup, so there is potential for error and minimum is we must document that. Rather than explaining the problem and the rules by which we can work out exactly which history files to keep, I think it is safer to say that we must keep all history files. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Re: [HACKERS] Re: BUG #4796: Recovery followed by backup creates unrecoverable WAL-file
From
Andrew Dunstan
Date:
Simon Riggs wrote: > On Fri, 2009-05-15 at 22:56 +0900, Fujii Masao wrote: > > >> OK, I probably understood your point. The timeline history files whose >> timeline ID is larger than that of an oldest backup must not be deleted >> from the archive. On the other hand, the smaller or equal one can be >> deleted. Not all history files are necessary. So, if we don't keep older >> backup, we probably can delete all files in the archive before >> pg_start_backup(). >> Is my understanding right? >> > > Heikki is right in one sense: if you do pg_start_backup() then for > *that* backup you do not need earlier files. > > However, as you have pointed out, if you have *multiple* backups then > deleting history files may cause problems with an earlier backup. > > It's standard practice to have >1 backup, so there is potential for > error and minimum is we must document that. > > Rather than explaining the problem and the rules by which we can work > out exactly which history files to keep, I think it is safer to say that > we must keep all history files. > > This whole area is unfortunately way too fragile. We need some way of managing these facilities that hides a lot of these details and is therefore less likely to produce shot feet, IMNSHO. I get very nervous every time I have to touch it. cheers andrew
Re: [HACKERS] Re: BUG #4796: Recovery followed by backup creates unrecoverable WAL-file
From
Heikki Linnakangas
Date:
Simon Riggs wrote: > On Fri, 2009-05-15 at 22:56 +0900, Fujii Masao wrote: > >> OK, I probably understood your point. The timeline history files whose >> timeline ID is larger than that of an oldest backup must not be deleted >> from the archive. On the other hand, the smaller or equal one can be >> deleted. Not all history files are necessary. So, if we don't keep older >> backup, we probably can delete all files in the archive before >> pg_start_backup(). >> Is my understanding right? > > Heikki is right in one sense: if you do pg_start_backup() then for > *that* backup you do not need earlier files. > > However, as you have pointed out, if you have *multiple* backups then > deleting history files may cause problems with an earlier backup. Yes, just as deleting old WAL files. > It's standard practice to have >1 backup, so there is potential for > error and minimum is we must document that. > > Rather than explaining the problem and the rules by which we can work > out exactly which history files to keep, I think it is safer to say that > we must keep all history files. The rule for determining which history files need to be retained is the same as for WAL files. Anything archived before pg_start_backup() was called for the oldest backup you still want to be able to restore can be deleted. And the alphabetical sorting property works with history files as well, you can call pg_xlogfile_name(pg_start_backup()) and delete anything < the return value from the archive. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] Re: BUG #4796: Recovery followed by backup creates unrecoverable WAL-file
From
Simon Riggs
Date:
On Fri, 2009-05-15 at 17:19 +0300, Heikki Linnakangas wrote: > Yes, just as deleting old WAL files. So what you're saying is because it's possible to blow your left foot off, we're not concerned about blowing your right foot off either. We've asked for some additional docs. What would be the objection to that? And you guys wonder why I get frustrated trying to fix things. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Re: [HACKERS] Re: BUG #4796: Recovery followed by backup creates unrecoverable WAL-file
From
Heikki Linnakangas
Date:
Fujii Masao wrote: > On Fri, May 15, 2009 at 8:56 PM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: >> Fujii Masao wrote: >>> When only the history file for timeline 6 is deleted, timeline 6 would be >>> assigned as the newest one *again* at the end of archive recovery. >>> Is this safe? >> If you delete history file and all the WAL for timeline 6, yeah, nothing >> stops it from being reused. It will work just fine, as if it never existed. >> If you still have the history file and WAL for the old timeline 6 lying >> around somewhere else like an older offsite backup, it's easy for the >> administrator to get confused, but there isn't much we can do about that. > > OK, I probably understood your point. The timeline history files whose > timeline ID is larger than that of an oldest backup must not be deleted > from the archive. On the other hand, the smaller or equal one can be > deleted. Not all history files are necessary. So, if we don't keep older > backup, we probably can delete all files in the archive before > pg_start_backup(). > Is my understanding right? Yes, that's correct. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] Re: BUG #4796: Recovery followed by backup creates unrecoverable WAL-file
From
Heikki Linnakangas
Date:
Simon Riggs wrote: > On Fri, 2009-05-15 at 17:19 +0300, Heikki Linnakangas wrote: > >> Yes, just as deleting old WAL files. > > So what you're saying is because it's possible to blow your left foot > off, we're not concerned about blowing your right foot off either. I don't get it. What are the left and right foot in that metaphor referring to? > We've asked for some additional docs. What would be the objection to > that? I'm certainly not opposed to improving docs. And I agree with Andrew's sentiment that easier-to-use tools to manage PITR archives would be very helpful. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] Re: BUG #4796: Recovery followed by backup creates unrecoverable WAL-file
From
Simon Riggs
Date:
On Fri, 2009-05-15 at 17:39 +0300, Heikki Linnakangas wrote: > > We've asked for some additional docs. What would be the objection to > > that? > > I'm certainly not opposed to improving docs. OK, so will you update the docs as requested? -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Re: [HACKERS] Re: BUG #4796: Recovery followed by backup creates unrecoverable WAL-file
From
Simon Riggs
Date:
On Fri, 2009-05-15 at 10:17 -0400, Andrew Dunstan wrote: > This whole area is unfortunately way too fragile. We need some way of > managing these facilities that hides a lot of these details and is > therefore less likely to produce shot feet, IMNSHO. I get very nervous > every time I have to touch it. I think it is complex, though that is because we now support a huge number of use cases and options, to the benefit of many users. In fact, more than I would like, but this is a group project. Not sure why you say it's fragile; there have been very few bugs considering the wide user base and those that have occurred have had fixes submitted for them quickly. Yes, we require you to actually read the docs, rather than open up psql and play, but this is business critical stuff. Realistically, we have more developers on this part of the code now than any other. That's one reason for all the debate. No problem in receiving feedback, just want to be able to understand it sufficiently well to be able to enhance it. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Re: [HACKERS] Re: BUG #4796: Recovery followed by backup creates unrecoverable WAL-file
From
Andrew Dunstan
Date:
Simon Riggs wrote: > On Fri, 2009-05-15 at 10:17 -0400, Andrew Dunstan wrote: > > >> This whole area is unfortunately way too fragile. We need some way of >> managing these facilities that hides a lot of these details and is >> therefore less likely to produce shot feet, IMNSHO. I get very nervous >> every time I have to touch it. >> > > I think it is complex, though that is because we now support a huge > number of use cases and options, to the benefit of many users. In fact, > more than I would like, but this is a group project. > > Not sure why you say it's fragile; there have been very few bugs > considering the wide user base and those that have occurred have had > fixes submitted for them quickly. Yes, we require you to actually read > the docs, rather than open up psql and play, but this is business > critical stuff. > > Realistically, we have more developers on this part of the code now than > any other. That's one reason for all the debate. > > No problem in receiving feedback, just want to be able to understand it > sufficiently well to be able to enhance it. > > I don't mean that it has bugs. I mean that it's far too easy to get it wrong and far too hard to get it right. I have reduced my uses to a couple of cases where I have worked out, with some trial and error, recipes that I follow. If I find these facilities complex to use, and I make virtually 100% of my living working with Postgres, what are more ordinary users going to say? That's why I think we need at the very least some tools for supporting the most common use cases, and hiding the messy details. And no, I haven't even begun to think of what such tools might look like. cheers andrew
Re: [HACKERS] Re: BUG #4796: Recovery followed by backup creates unrecoverable WAL-file
From
Heikki Linnakangas
Date:
Simon Riggs wrote: > On Fri, 2009-05-15 at 17:39 +0300, Heikki Linnakangas wrote: > >>> We've asked for some additional docs. What would be the objection to >>> that? >> I'm certainly not opposed to improving docs. > > OK, so will you update the docs as requested? Well, we already have this in the docs: > Each time a new timeline is created, PostgreSQL creates a "timeline history" file that shows which timeline it branchedoff from and when. These history files are necessary to allow the system to pick the right WAL segment files whenrecovering from an archive that contains multiple timelines. Therefore, they are archived into the WAL archive area justlike WAL segment files. The history files are just small text files, so it's cheap and appropriate to keep them aroundindefinitely (unlike the segment files which are large). You can, if you like, add comments to a history file to makeyour own notes about how and why this particular timeline came to be. Such comments will be especially valuable whenyou have a thicket of different timelines as a result of experimentation. What exactly do you want to change? Patch, please. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] Re: BUG #4796: Recovery followed by backup creates unrecoverable WAL-file
From
Simon Riggs
Date:
On Fri, 2009-05-15 at 11:19 -0400, Andrew Dunstan wrote: > I don't mean that it has bugs. I mean that it's far too easy to get it > wrong and far too hard to get it right. I have reduced my uses to a > couple of cases where I have worked out, with some trial and error, > recipes that I follow. If I find these facilities complex to use, and I > make virtually 100% of my living working with Postgres, what are more > ordinary users going to say? That's why I think we need at the very > least some tools for supporting the most common use cases, and hiding > the messy details. I've never had a private comment complaining about the facilities in a general way except from you and Josh Drake, though obviously I field bugs and questions from users frequently. I regularly get emails saying thanks, easy to use, much easier to manage than any other form of replication. Most frequent comment is "I was told it was really hard, but I see now that it is easy to understand and use". People with HA or backup experience from other databases usually have no problem understanding the concepts or the implementation. > And no, I haven't even begun to think of what such tools might look like. That's OK. Wanting it to be different is the first step. I want to improve it as well, though without removing features. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Re: [HACKERS] Re: BUG #4796: Recovery followed by backup creates unrecoverable WAL-file
From
Simon Riggs
Date:
On Fri, 2009-05-15 at 18:46 +0300, Heikki Linnakangas wrote: > Well, we already have this in the docs: > > > Each time a new timeline is created, PostgreSQL creates a "timeline > history" file that shows which timeline it branched off from and when. > These history files are necessary to allow the system to pick the > right WAL segment files when recovering from an archive that contains > multiple timelines. Therefore, they are archived into the WAL archive > area just like WAL segment files. The history files are just small > text files, so it's cheap and appropriate to keep them around > indefinitely (unlike the segment files which are large). You can, if > you like, add comments to a history file to make your own notes about > how and why this particular timeline came to be. Such comments will be > especially valuable when you have a thicket of different timelines as > a result of experimentation. > > What exactly do you want to change? Patch, please. I find this exchange between us quite strange. The discussion on this thread has been fairly clear. Fujii-san and myself have both asked for it to be documented that history files should not be deleted. The above section says it's "appropriate to keep them around indefinitely". What it doesn't say is if you delete them then you can experience problems in certain circumstances, so we advise strongly not do this. It would be even better if there was a section on remvong files from the archive. Do I really need to write a patch to say that, have you formally review it, then change the wording to what you would have written in the first place and then commit? Really? How many years do all of us have to work together before we develop an efficient process for trivial changes such as this? -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Re: [HACKERS] Re: BUG #4796: Recovery followed by backup creates unrecoverable WAL-file
From
Heikki Linnakangas
Date:
Simon Riggs wrote: > On Fri, 2009-05-15 at 18:46 +0300, Heikki Linnakangas wrote: >> What exactly do you want to change? Patch, please. > > I find this exchange between us quite strange. The discussion on this > thread has been fairly clear. Fujii-san and myself have both asked for > it to be documented that history files should not be deleted. > > The above section says it's "appropriate to keep them around > indefinitely". > > What it doesn't say is if you delete them then you can experience > problems in certain circumstances, so we advise strongly not do this. It > would be even better if there was a section on remvong files from the > archive. Well, then again it does also say "These history files are necessary to allow the system to pick the right WAL segment files when recovering from an archive that contains multiple timelines." Necessary says "do not delete" to me. > Do I really need to write a patch to say that, have you formally review > it, then change the wording to what you would have written in the first > place and then commit? Really? Yes. It's not a trivial change for me, you're much better at writing documentation than I am. And it's still not 100% clear to me what you're having in mind. > How many years do all of us have to work > together before we develop an efficient process for trivial changes such > as this? It sure is a pain at times :-) -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] Re: BUG #4796: Recovery followed by backup creates unrecoverable WAL-file
From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > Simon Riggs wrote: >> Do I really need to write a patch to say that, have you formally review >> it, then change the wording to what you would have written in the first >> place and then commit? Really? > Yes. It's not a trivial change for me, you're much better at writing > documentation than I am. And it's still not 100% clear to me what you're > having in mind. I didn't read this thread earlier, but now that I have, it seems to be making a mountain out of a molehill. The original complaint seems to have neglected the fact that existsTimeLineHistory() will pull history files back from an archive. Therefore, you can only get into trouble if you archive the WAL segment files for a timeline and fail to keep the associated history file in the same place. It is entirely false that you've got to keep the history files on the live server. I've got no objection to clarifying the documentation's rather offhand statement about this, but let's clarify it correctly. regards, tom lane
Re: [HACKERS] Re: BUG #4796: Recovery followed by backup creates unrecoverable WAL-file
From
Simon Riggs
Date:
On Fri, 2009-05-15 at 18:03 -0400, Tom Lane wrote: > I didn't read this thread earlier, but now that I have, it seems to be > making a mountain out of a molehill. We've discussed a complex issue to pursue other nascent bugs. It's confused all of us at some point, but seems we're thru that now. Why do you think the issue on this thread has become a mountain? I don't see anything other than a docs improvement coming out of it. (The last thread on pg_standby *was* a mountain IMHO, but that has nothing to do with this, other than the usual suspects being involved). > It is entirely false that > you've got to keep the history files on the live server. There was a similar suggestion that was already clearly dropped, after discussion. I (still) think that keeping the history files that have been used to build the current timeline would be an important documentary record for DBAs, especially since we encourage people to add their own notes to them. The safest place for them would be in the data directory. Keeping them there would be a minor new feature, not any kind of bug fix. > I've got no objection to clarifying the documentation's rather offhand > statement about this, Cool > but let's clarify it correctly. Of course. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support