Thread: Verified fix for Bug 4137
The following patch has been independently verified as fixing bug 4137. The problem was that at the very start of archive recovery the %r parameter in restore_command could be set to a filename later than the currently requested filename (%f). This could lead to early truncation of the archived WAL files and would cause warm standby replication to fail soon afterwards, in certain specific circumstances. Fix applied to both core server in generating correct %r filenames and also to pg_standby to prevent acceptance of out-of-sequence filenames. Request review and commit. No port specific details. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
Attachment
Simon Riggs wrote: > The problem was that at the very start of archive recovery the %r > parameter in restore_command could be set to a filename later than the > currently requested filename (%f). This could lead to early truncation > of the archived WAL files and would cause warm standby replication to > fail soon afterwards, in certain specific circumstances. > > Fix applied to both core server in generating correct %r filenames and > also to pg_standby to prevent acceptance of out-of-sequence filenames. So the core problem is that we use ControlFile->checkPointCopy.redo in RestoreArchivedFile to determine the safe truncation point, but when there's a backup label file, that's still coming from pg_control file, which is wrong. The patch fixes that by determining the safe truncation point as Min(checkPointCopy.redo, xlogfname), where xlogfname is the xlog file being restored. That depends on the assumption that everything before the first file we (try to) restore is safe to truncate. IOW, we never try to restore file B first, and then A, where A < B. I'm not totally convinced that's a safe assumption. As an example, consider doing an archive recovery, but without a backup label, and the latest checkpoint record is broken. We would try to read the latest (broken) checkpoint record first, and call RestoreArchivedFile to get the xlog file containing that. But because that record is broken, we fall back to using the previous checkpoint, and will need the xlog file where the previous checkpoint record is in. That's a pretty contrived example, but the point is that assumption feels fragile. At the very least it should be noted explicitly in the comments. A less fragile approach would be to use something dummy, like "000000000000000000000000" as the truncation point until we've successfully read the checkpoint/restartpoint record and started the replay. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Tue, 2008-05-06 at 12:02 +0100, Heikki Linnakangas wrote: > Simon Riggs wrote: > > The problem was that at the very start of archive recovery the %r > > parameter in restore_command could be set to a filename later than the > > currently requested filename (%f). This could lead to early truncation > > of the archived WAL files and would cause warm standby replication to > > fail soon afterwards, in certain specific circumstances. > > > > Fix applied to both core server in generating correct %r filenames and > > also to pg_standby to prevent acceptance of out-of-sequence filenames. > > So the core problem is that we use ControlFile->checkPointCopy.redo in > RestoreArchivedFile to determine the safe truncation point, but when > there's a backup label file, that's still coming from pg_control file, > which is wrong. > > The patch fixes that by determining the safe truncation point as > Min(checkPointCopy.redo, xlogfname), where xlogfname is the xlog file > being restored. That depends on the assumption that everything before > the first file we (try to) restore is safe to truncate. IOW, we never > try to restore file B first, and then A, where A < B. > > I'm not totally convinced that's a safe assumption. As an example, > consider doing an archive recovery, but without a backup label, and the > latest checkpoint record is broken. We would try to read the latest > (broken) checkpoint record first, and call RestoreArchivedFile to get > the xlog file containing that. But because that record is broken, we > fall back to using the previous checkpoint, and will need the xlog file > where the previous checkpoint record is in. > > That's a pretty contrived example, but the point is that assumption > feels fragile. At the very least it should be noted explicitly in the > comments. A less fragile approach would be to use something dummy, like > "000000000000000000000000" as the truncation point until we've > successfully read the checkpoint/restartpoint record and started the replay. Your comments are interesting and I'll think through that some more to see if I can see if that leads to bugs. Will talk more later. The only valid starting place, in all cases, is the checkpoint written by pg_start_backup(). The user doesn't need to have been archiving files prior to that so could legitimately have had a null archive_command. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
On Tue, 2008-05-06 at 12:02 +0100, Heikki Linnakangas wrote: > Simon Riggs wrote: > > The problem was that at the very start of archive recovery the %r > > parameter in restore_command could be set to a filename later than the > > currently requested filename (%f). This could lead to early truncation > > of the archived WAL files and would cause warm standby replication to > > fail soon afterwards, in certain specific circumstances. > > > > Fix applied to both core server in generating correct %r filenames and > > also to pg_standby to prevent acceptance of out-of-sequence filenames. > > So the core problem is that we use ControlFile->checkPointCopy.redo in > RestoreArchivedFile to determine the safe truncation point, but when > there's a backup label file, that's still coming from pg_control file, > which is wrong. > > The patch fixes that by determining the safe truncation point as > Min(checkPointCopy.redo, xlogfname), where xlogfname is the xlog file > being restored. That depends on the assumption that everything before > the first file we (try to) restore is safe to truncate. IOW, we never > try to restore file B first, and then A, where A < B. > > I'm not totally convinced that's a safe assumption. As an example, > consider doing an archive recovery, but without a backup label, and the > latest checkpoint record is broken. We would try to read the latest > (broken) checkpoint record first, and call RestoreArchivedFile to get > the xlog file containing that. But because that record is broken, we > fall back to using the previous checkpoint, and will need the xlog file > where the previous checkpoint record is in. > That's a pretty contrived example, but the point is that assumption > feels fragile. At the very least it should be noted explicitly in the > comments. A less fragile approach would be to use something dummy, like > "000000000000000000000000" as the truncation point until we've > successfully read the checkpoint/restartpoint record and started the replay. Thanks for thinking about this. Falling back to the secondary checkpoint implies we have a corrupted or absent WAL file, so making recovery startup work correctly won't avoid the need to re-run the base backup. We'll end with an unrecoverable error in either case, so it doesn't seem worth attempting to improve this in the way you suggest. I think we should completely prevent access to secondary checkpoints during archive recovery, because if the primary checkpoint isn't present or is corrupt we aren't ever going to get passed it to get to the pg_stop_backup() point. Hence an archive recovery can never be valid in that case. I'll do a separate patch for that because they are unrelated issues. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
Simon Riggs wrote: > Falling back to the secondary checkpoint implies we have a corrupted or > absent WAL file, so making recovery startup work correctly won't avoid > the need to re-run the base backup. We'll end with an unrecoverable > error in either case, so it doesn't seem worth attempting to improve > this in the way you suggest. That's true whenever you have to fall back to a secondary checkpoint, but we still try to get the database up. One could argue that we shouldn't, of course. Anyway, the point is that the patch relies on a non-obvious assumption. Even if the secondary checkpoint issue is a non-issue, it's not obvious (to me at least) that there isn't other similar scenarios. And someone might inadvertently break the assumption in a future patch, because it's not an obvious one; calling ReadRecord looks very innocent. We shouldn't introduce an assumption like that when we don't have to. > I think we should completely prevent access to secondary checkpoints > during archive recovery, because if the primary checkpoint isn't present > or is corrupt we aren't ever going to get passed it to get to the > pg_stop_backup() point. Hence an archive recovery can never be valid in > that case. I'll do a separate patch for that because they are unrelated > issues. Well, we already don't use the secondary checkpoint if a backup label file is present. And you can take a base backup without pg_start_backup()/pg_stop_backup() if you shut down the system first (a cold backup). -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Tue, 2008-05-06 at 15:00 +0100, Heikki Linnakangas wrote: > Simon Riggs wrote: > > Falling back to the secondary checkpoint implies we have a corrupted or > > absent WAL file, so making recovery startup work correctly won't avoid > > the need to re-run the base backup. We'll end with an unrecoverable > > error in either case, so it doesn't seem worth attempting to improve > > this in the way you suggest. > > That's true whenever you have to fall back to a secondary checkpoint, > but we still try to get the database up. One could argue that we > shouldn't, of course. I wouldn't. If we're doing a crash recovery we need that. > Anyway, the point is that the patch relies on a non-obvious assumption. > Even if the secondary checkpoint issue is a non-issue, it's not obvious > (to me at least) that there isn't other similar scenarios. And someone > might inadvertently break the assumption in a future patch, because it's > not an obvious one; calling ReadRecord looks very innocent. We shouldn't > introduce an assumption like that when we don't have to. We were already assuming archive files were "OK to delete, if before". The whole of recovery already relies heavily on the alphabetic sorting property of WAL and associated filenames. Those filenames have been specifically documented as maintaining that sorted order for that reason. If somebody wanted to recover files in non-sorted order, then yes I would expect a few things to break - this aspect wouldn't be the most critical thing though. The patch adds only what we already knew: you should never expect to delete a file *after* one that is being requested. So I see no reason to remove or change anything from the patch as it stands. I'd be happy to add comments to say * We use the alphanumeric sorting property of the filenames to decide * which ones are earlier as we have done elsewhere in xlog.c I do want everything to work in a clear way, but I honestly can't see any reason to add more code, especially in the absence of any problem. > > I think we should completely prevent access to secondary checkpoints > > during archive recovery, because if the primary checkpoint isn't present > > or is corrupt we aren't ever going to get passed it to get to the > > pg_stop_backup() point. Hence an archive recovery can never be valid in > > that case. I'll do a separate patch for that because they are unrelated > > issues. > > Well, we already don't use the secondary checkpoint if a backup label > file is present. Yep, looking at the wrong section of code. I wondered why I hadn't plugged that gap previously. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
Simon Riggs wrote: > We were already assuming archive files were "OK to delete, if before". > The whole of recovery already relies heavily on the alphabetic sorting > property of WAL and associated filenames. Those filenames have been > specifically documented as maintaining that sorted order for that > reason. If somebody wanted to recover files in non-sorted order, then > yes I would expect a few things to break - this aspect wouldn't be the > most critical thing though. I didn't suggest that alphabetical sorting property is a new assumption; it sure isn't. The new assumption is that you never call ReadRecord() for record 0002 before you call it for record 0001 (before initializing the checkPointCopy field from the checkpoint record, anyway). I can imagine a future patch to do xlog file prefetching, for example, that breaks that assumption. Or falling back to the previous checkpoint as discussed. Or maybe you screwed up your recovery.conf, and try to use WAL files that belong to a different installation. The database won't start up, of course, because the checkpoint record isn't in the right place, but the damage has already been done because the recovery command deleted some files it shouldn't have. Granted, none of those are particularly likely, but we should be extra careful when deleting files. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Tue, 2008-05-06 at 17:52 +0100, Heikki Linnakangas wrote: > Simon Riggs wrote: > > We were already assuming archive files were "OK to delete, if before". > > The whole of recovery already relies heavily on the alphabetic sorting > > property of WAL and associated filenames. Those filenames have been > > specifically documented as maintaining that sorted order for that > > reason. If somebody wanted to recover files in non-sorted order, then > > yes I would expect a few things to break - this aspect wouldn't be the > > most critical thing though. > > I didn't suggest that alphabetical sorting property is a new assumption; > it sure isn't. The new assumption is that you never call ReadRecord() > for record 0002 before you call it for record 0001 (before initializing > the checkPointCopy field from the checkpoint record, anyway). I've not done anything with the ordering of calls to ReadRecord(). There is no such assumption introduced here. The patch prevents an erroneous call to delete files. The earlier code just assumed such a call would never occur, which was wrong, hence the patch. There are no new assumptions introduced into the code by this. I very much respect your opinion in all things, most especially on matters of code, more so now you are a committer. I would be willing to make changes to any patch on your say-so, though I can't see how any of your comments improve this patch in this specific case only. I don't doubt that you will find flaws in many of my future patches. > I can imagine a future patch to do xlog file prefetching, for example, > that breaks that assumption. Maybe, but I think its for such a patch to consider in full the consequences of doing that, not for me to do so in advance. The current assumption is filename ordered recovery, there is definitely more than one part of the code that relies significantly on that point. I think any prefetcher would be advised to stay within one file at a time. Anything else will effect other behaviours in ways that would need careful planning to avoid unintended consequences. > Or falling back to the previous checkpoint > as discussed. Or maybe you screwed up your recovery.conf, and try to use > WAL files that belong to a different installation. The database won't > start up, of course, because the checkpoint record isn't in the right > place, but the damage has already been done because the recovery command > deleted some files it shouldn't have. > > Granted, none of those are particularly likely, but we should be extra > careful when deleting files. With respect, that is exactly what the patch does, though it is certainly better for being tested by your comments. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
Simon Riggs wrote: > On Tue, 2008-05-06 at 17:52 +0100, Heikki Linnakangas wrote: >> I didn't suggest that alphabetical sorting property is a new assumption; >> it sure isn't. The new assumption is that you never call ReadRecord() >> for record 0002 before you call it for record 0001 (before initializing >> the checkPointCopy field from the checkpoint record, anyway). > > I've not done anything with the ordering of calls to ReadRecord(). There > is no such assumption introduced here. Hmm, I see that I was inaccurate when I said the patch introduces that assumption, sorry about the confusion. It's more like the code is currently completely oblivious of the issue, and the patch makes it better but doesn't fully fix it. The code in CVS is broken, as we now know, because it assumes that we can delete all files < checkPointCopy.redo, which is not true when checkPointCopy.redo has not yet been initialized from the backup label file, and points to a location that's ahead of the real safe truncation point. The patch makes it better, and the assumption is now that it's safe to truncate everything < min(checkPointCopy.redo, xlog file we're reading). That still seems too fragile to me, because we assume that after you've asked for xlog record X, we never need anything earlier then that. In fact, what will happen if the checkpoint record's redo pointer points to an earlier xlog file: 1. The location of the checkpoint record is read by read_backup_label(). Let's say that it's 0005. 2. ReadCheckpointRecord() is called for 0005. The restore command is called because that xlog file is not present. The safe truncation point is determined to be 0005, as that's what we're reading. Everything before that is truncated 3. The redo pointer in the checkpoint record points to 0003. That's where we should start the recovery. Oops :-( I haven't tested this, so, am I missing something that makes that not happen? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Tue, 2008-05-06 at 21:51 +0100, Heikki Linnakangas wrote: > In fact, what will happen if the checkpoint record's redo pointer points > to an earlier xlog file: > > 1. The location of the checkpoint record is read by read_backup_label(). > Let's say that it's 0005. > 2. ReadCheckpointRecord() is called for 0005. The restore command is > called because that xlog file is not present. The safe truncation point > is determined to be 0005, as that's what we're reading. Everything > before that is truncated > 3. The redo pointer in the checkpoint record points to 0003. That's > where we should start the recovery. Oops :-( Yes, this case could be a problem, if the records are in different files. It's the files that matter, not the records themselves though. I've extended the patch without introducing another new status variable, which was my original concern with what you suggested previously. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
Attachment
Simon Riggs wrote: > I've extended the patch without introducing another new status variable, > which was my original concern with what you suggested previously. Ok, that'll work. Committed, thanks. I changed the sanity check that xlogfname > restore point into an Assert, though, because that's a sign that something's wrong. I noted that there's no reason why the truncation point calculation had to be moved outside the case-statement, but it does look better that way so I did apply that change. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Simon Riggs wrote: > if (restartWALFileName) > { > + /* > + * Don't do cleanup if the restartWALFileName provided > + * is later than the xlog file requested. This is an error > + * and we must not remove these files from archive. > + * This shouldn't happen, but better safe than sorry. > + */ > + if (strcmp(restartWALFileName, nextWALFileName) > 0) > + return false; > + > strcpy(exclusiveCleanupFileName, restartWALFileName); > return true; > } I committed this sanity check into pg_standy, though it really shouldn't happen, but it just occurred to me that the most likely reason for that to happen is probably that the user has screwed up his restore_command line, mixing up the %p and %r arguments. Should we make that an error instead of just not doing the cleanup? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Fri, 2008-05-09 at 15:37 +0100, Heikki Linnakangas wrote: > Simon Riggs wrote: > > if (restartWALFileName) > > { > > + /* > > + * Don't do cleanup if the restartWALFileName provided > > + * is later than the xlog file requested. This is an error > > + * and we must not remove these files from archive. > > + * This shouldn't happen, but better safe than sorry. > > + */ > > + if (strcmp(restartWALFileName, nextWALFileName) > 0) > > + return false; > > + > > strcpy(exclusiveCleanupFileName, restartWALFileName); > > return true; > > } > > I committed this sanity check into pg_standy, though it really shouldn't > happen, but it just occurred to me that the most likely reason for that > to happen is probably that the user has screwed up his restore_command > line, mixing up the %p and %r arguments. Should we make that an error > instead of just not doing the cleanup? You can't explicitly throw a pgsql error in pg_standby, so the best we can do is get the file requested if it exists. If the file is the wrong one then recovery will throw the error. As long as we didn't delete anything when that happens we can just correct the mistake and retry. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
"Heikki Linnakangas" <heikki@enterprisedb.com> writes: > Ok, that'll work. Committed, thanks. I changed the sanity check that > xlogfname > restore point into an Assert, though, because that's a sign > that something's wrong. Shouldn't that Assert allow the equality case? regards, tom lane
Tom Lane wrote: > "Heikki Linnakangas" <heikki@enterprisedb.com> writes: >> Ok, that'll work. Committed, thanks. I changed the sanity check that >> xlogfname > restore point into an Assert, though, because that's a sign >> that something's wrong. > > Shouldn't that Assert allow the equality case? Yes. Thanks. Hmm. I can't actually think of a scenario where that would happen, but it was definitely an oversight on my part. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com