Thread: Verified fix for Bug 4137

Verified fix for Bug 4137

From
Simon Riggs
Date:
The following patch has been independently verified as fixing bug 4137.

The problem was that at the very start of archive recovery the %r
parameter in restore_command could be set to a filename later than the
currently requested filename (%f). This could lead to early truncation
of the archived WAL files and would cause warm standby replication to
fail soon afterwards, in certain specific circumstances.

Fix applied to both core server in generating correct %r filenames and
also to pg_standby to prevent acceptance of out-of-sequence filenames.

Request review and commit. No port specific details.

--
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com

Attachment

Re: Verified fix for Bug 4137

From
"Heikki Linnakangas"
Date:
Simon Riggs wrote:
> The problem was that at the very start of archive recovery the %r
> parameter in restore_command could be set to a filename later than the
> currently requested filename (%f). This could lead to early truncation
> of the archived WAL files and would cause warm standby replication to
> fail soon afterwards, in certain specific circumstances.
>
> Fix applied to both core server in generating correct %r filenames and
> also to pg_standby to prevent acceptance of out-of-sequence filenames.

So the core problem is that we use ControlFile->checkPointCopy.redo in
RestoreArchivedFile to determine the safe truncation point, but when
there's a backup label file, that's still coming from pg_control file,
which is wrong.

The patch fixes that by determining the safe truncation point as
Min(checkPointCopy.redo, xlogfname), where xlogfname is the xlog file
being restored. That depends on the assumption that everything before
the first file we (try to) restore is safe to truncate. IOW, we never
try to restore file B first, and then A, where A < B.

I'm not totally convinced that's a safe assumption. As an example,
consider doing an archive recovery, but without a backup label, and the
latest checkpoint record is broken. We would try to read the latest
(broken) checkpoint record first, and call RestoreArchivedFile to get
the xlog file containing that. But because that record is broken, we
fall back to using the previous checkpoint, and will need the xlog file
where the previous checkpoint record is in.

That's a pretty contrived example, but the point is that assumption
feels fragile. At the very least it should be noted explicitly in the
comments. A less fragile approach would be to use something dummy, like
"000000000000000000000000" as the truncation point until we've
successfully read the checkpoint/restartpoint record and started the replay.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Verified fix for Bug 4137

From
Simon Riggs
Date:
On Tue, 2008-05-06 at 12:02 +0100, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > The problem was that at the very start of archive recovery the %r
> > parameter in restore_command could be set to a filename later than the
> > currently requested filename (%f). This could lead to early truncation
> > of the archived WAL files and would cause warm standby replication to
> > fail soon afterwards, in certain specific circumstances.
> >
> > Fix applied to both core server in generating correct %r filenames and
> > also to pg_standby to prevent acceptance of out-of-sequence filenames.
>
> So the core problem is that we use ControlFile->checkPointCopy.redo in
> RestoreArchivedFile to determine the safe truncation point, but when
> there's a backup label file, that's still coming from pg_control file,
> which is wrong.
>
> The patch fixes that by determining the safe truncation point as
> Min(checkPointCopy.redo, xlogfname), where xlogfname is the xlog file
> being restored. That depends on the assumption that everything before
> the first file we (try to) restore is safe to truncate. IOW, we never
> try to restore file B first, and then A, where A < B.
>
> I'm not totally convinced that's a safe assumption. As an example,
> consider doing an archive recovery, but without a backup label, and the
> latest checkpoint record is broken. We would try to read the latest
> (broken) checkpoint record first, and call RestoreArchivedFile to get
> the xlog file containing that. But because that record is broken, we
> fall back to using the previous checkpoint, and will need the xlog file
> where the previous checkpoint record is in.
>
> That's a pretty contrived example, but the point is that assumption
> feels fragile. At the very least it should be noted explicitly in the
> comments. A less fragile approach would be to use something dummy, like
> "000000000000000000000000" as the truncation point until we've
> successfully read the checkpoint/restartpoint record and started the replay.

Your comments are interesting and I'll think through that some more to
see if I can see if that leads to bugs. Will talk more later.

The only valid starting place, in all cases, is the checkpoint written
by pg_start_backup(). The user doesn't need to have been archiving files
prior to that so could legitimately have had a null archive_command.

--
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


Re: Verified fix for Bug 4137

From
Simon Riggs
Date:
On Tue, 2008-05-06 at 12:02 +0100, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > The problem was that at the very start of archive recovery the %r
> > parameter in restore_command could be set to a filename later than the
> > currently requested filename (%f). This could lead to early truncation
> > of the archived WAL files and would cause warm standby replication to
> > fail soon afterwards, in certain specific circumstances.
> >
> > Fix applied to both core server in generating correct %r filenames and
> > also to pg_standby to prevent acceptance of out-of-sequence filenames.
>
> So the core problem is that we use ControlFile->checkPointCopy.redo in
> RestoreArchivedFile to determine the safe truncation point, but when
> there's a backup label file, that's still coming from pg_control file,
> which is wrong.
>
> The patch fixes that by determining the safe truncation point as
> Min(checkPointCopy.redo, xlogfname), where xlogfname is the xlog file
> being restored. That depends on the assumption that everything before
> the first file we (try to) restore is safe to truncate. IOW, we never
> try to restore file B first, and then A, where A < B.
>
> I'm not totally convinced that's a safe assumption. As an example,
> consider doing an archive recovery, but without a backup label, and the
> latest checkpoint record is broken. We would try to read the latest
> (broken) checkpoint record first, and call RestoreArchivedFile to get
> the xlog file containing that. But because that record is broken, we
> fall back to using the previous checkpoint, and will need the xlog file
> where the previous checkpoint record is in.

> That's a pretty contrived example, but the point is that assumption
> feels fragile. At the very least it should be noted explicitly in the
> comments. A less fragile approach would be to use something dummy, like
> "000000000000000000000000" as the truncation point until we've
> successfully read the checkpoint/restartpoint record and started the replay.

Thanks for thinking about this.

Falling back to the secondary checkpoint implies we have a corrupted or
absent WAL file, so making recovery startup work correctly won't avoid
the need to re-run the base backup. We'll end with an unrecoverable
error in either case, so it doesn't seem worth attempting to improve
this in the way you suggest.

I think we should completely prevent access to secondary checkpoints
during archive recovery, because if the primary checkpoint isn't present
or is corrupt we aren't ever going to get passed it to get to the
pg_stop_backup() point. Hence an archive recovery can never be valid in
that case. I'll do a separate patch for that because they are unrelated
issues.

--
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


Re: Verified fix for Bug 4137

From
"Heikki Linnakangas"
Date:
Simon Riggs wrote:
> Falling back to the secondary checkpoint implies we have a corrupted or
> absent WAL file, so making recovery startup work correctly won't avoid
> the need to re-run the base backup. We'll end with an unrecoverable
> error in either case, so it doesn't seem worth attempting to improve
> this in the way you suggest.

That's true whenever you have to fall back to a secondary checkpoint,
but we still try to get the database up. One could argue that we
shouldn't, of course.

Anyway, the point is that the patch relies on a non-obvious assumption.
Even if the secondary checkpoint issue is a non-issue, it's not obvious
(to me at least) that there isn't other similar scenarios. And someone
might inadvertently break the assumption in a future patch, because it's
not an obvious one; calling ReadRecord looks very innocent. We shouldn't
introduce an assumption like that when we don't have to.

> I think we should completely prevent access to secondary checkpoints
> during archive recovery, because if the primary checkpoint isn't present
> or is corrupt we aren't ever going to get passed it to get to the
> pg_stop_backup() point. Hence an archive recovery can never be valid in
> that case. I'll do a separate patch for that because they are unrelated
> issues.

Well, we already don't use the secondary checkpoint if a backup label
file is present. And you can take a base backup without
pg_start_backup()/pg_stop_backup() if you shut down the system first (a
cold backup).

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Verified fix for Bug 4137

From
Simon Riggs
Date:
On Tue, 2008-05-06 at 15:00 +0100, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > Falling back to the secondary checkpoint implies we have a corrupted or
> > absent WAL file, so making recovery startup work correctly won't avoid
> > the need to re-run the base backup. We'll end with an unrecoverable
> > error in either case, so it doesn't seem worth attempting to improve
> > this in the way you suggest.
>
> That's true whenever you have to fall back to a secondary checkpoint,
> but we still try to get the database up. One could argue that we
> shouldn't, of course.

I wouldn't. If we're doing a crash recovery we need that.

> Anyway, the point is that the patch relies on a non-obvious assumption.
> Even if the secondary checkpoint issue is a non-issue, it's not obvious
> (to me at least) that there isn't other similar scenarios. And someone
> might inadvertently break the assumption in a future patch, because it's
> not an obvious one; calling ReadRecord looks very innocent. We shouldn't
> introduce an assumption like that when we don't have to.

We were already assuming archive files were "OK to delete, if before".
The whole of recovery already relies heavily on the alphabetic sorting
property of WAL and associated filenames. Those filenames have been
specifically documented as maintaining that sorted order for that
reason. If somebody wanted to recover files in non-sorted order, then
yes I would expect a few things to break - this aspect wouldn't be the
most critical thing though.

The patch adds only what we already knew: you should never expect to
delete a file *after* one that is being requested. So I see no reason to
remove or change anything from the patch as it stands.

I'd be happy to add comments to say

 * We use the alphanumeric sorting property of the filenames to decide
 * which ones are earlier

as we have done elsewhere in xlog.c

I do want everything to work in a clear way, but I honestly can't see
any reason to add more code, especially in the absence of any problem.

> > I think we should completely prevent access to secondary checkpoints
> > during archive recovery, because if the primary checkpoint isn't present
> > or is corrupt we aren't ever going to get passed it to get to the
> > pg_stop_backup() point. Hence an archive recovery can never be valid in
> > that case. I'll do a separate patch for that because they are unrelated
> > issues.
>
> Well, we already don't use the secondary checkpoint if a backup label
> file is present.

Yep, looking at the wrong section of code. I wondered why I hadn't
plugged that gap previously.

--
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


Re: Verified fix for Bug 4137

From
"Heikki Linnakangas"
Date:
Simon Riggs wrote:
> We were already assuming archive files were "OK to delete, if before".
> The whole of recovery already relies heavily on the alphabetic sorting
> property of WAL and associated filenames. Those filenames have been
> specifically documented as maintaining that sorted order for that
> reason. If somebody wanted to recover files in non-sorted order, then
> yes I would expect a few things to break - this aspect wouldn't be the
> most critical thing though.

I didn't suggest that alphabetical sorting property is a new assumption;
it sure isn't. The new assumption is that you never call ReadRecord()
for record 0002 before you call it for record 0001 (before initializing
the checkPointCopy field from the checkpoint record, anyway).

I can imagine a future patch to do xlog file prefetching, for example,
that breaks that assumption. Or falling back to the previous checkpoint
as discussed. Or maybe you screwed up your recovery.conf, and try to use
WAL files that belong to a different installation. The database won't
start up, of course, because the checkpoint record isn't in the right
place, but the damage has already been done because the recovery command
deleted some files it shouldn't have.

Granted, none of those are particularly likely, but we should be extra
careful when deleting files.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Verified fix for Bug 4137

From
Simon Riggs
Date:
On Tue, 2008-05-06 at 17:52 +0100, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > We were already assuming archive files were "OK to delete, if before".
> > The whole of recovery already relies heavily on the alphabetic sorting
> > property of WAL and associated filenames. Those filenames have been
> > specifically documented as maintaining that sorted order for that
> > reason. If somebody wanted to recover files in non-sorted order, then
> > yes I would expect a few things to break - this aspect wouldn't be the
> > most critical thing though.
>
> I didn't suggest that alphabetical sorting property is a new assumption;
> it sure isn't. The new assumption is that you never call ReadRecord()
> for record 0002 before you call it for record 0001 (before initializing
> the checkPointCopy field from the checkpoint record, anyway).

I've not done anything with the ordering of calls to ReadRecord(). There
is no such assumption introduced here.

The patch prevents an erroneous call to delete files. The earlier code
just assumed such a call would never occur, which was wrong, hence the
patch. There are no new assumptions introduced into the code by this.

I very much respect your opinion in all things, most especially on
matters of code, more so now you are a committer. I would be willing to
make changes to any patch on your say-so, though I can't see how any of
your comments improve this patch in this specific case only. I don't
doubt that you will find flaws in many of my future patches.

> I can imagine a future patch to do xlog file prefetching, for example,
> that breaks that assumption.

Maybe, but I think its for such a patch to consider in full the
consequences of doing that, not for me to do so in advance. The current
assumption is filename ordered recovery, there is definitely more than
one part of the code that relies significantly on that point.

I think any prefetcher would be advised to stay within one file at a
time. Anything else will effect other behaviours in ways that would need
careful planning to avoid unintended consequences.

> Or falling back to the previous checkpoint
> as discussed. Or maybe you screwed up your recovery.conf, and try to use
> WAL files that belong to a different installation. The database won't
> start up, of course, because the checkpoint record isn't in the right
> place, but the damage has already been done because the recovery command
> deleted some files it shouldn't have.
>
> Granted, none of those are particularly likely, but we should be extra
> careful when deleting files.

With respect, that is exactly what the patch does, though it is
certainly better for being tested by your comments.

--
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


Re: Verified fix for Bug 4137

From
"Heikki Linnakangas"
Date:
Simon Riggs wrote:
> On Tue, 2008-05-06 at 17:52 +0100, Heikki Linnakangas wrote:
>> I didn't suggest that alphabetical sorting property is a new assumption;
>> it sure isn't. The new assumption is that you never call ReadRecord()
>> for record 0002 before you call it for record 0001 (before initializing
>> the checkPointCopy field from the checkpoint record, anyway).
>
> I've not done anything with the ordering of calls to ReadRecord(). There
> is no such assumption introduced here.

Hmm, I see that I was inaccurate when I said the patch introduces that
assumption, sorry about the confusion. It's more like the code is
currently completely oblivious of the issue, and the patch makes it
better but doesn't fully fix it.

The code in CVS is broken, as we now know, because it assumes that we
can delete all files < checkPointCopy.redo, which is not true when
checkPointCopy.redo has not yet been initialized from the backup label
file, and points to a location that's ahead of the real safe truncation
point. The patch makes it better, and the assumption is now that it's
safe to truncate everything < min(checkPointCopy.redo, xlog file we're
reading). That still seems too fragile to me, because we assume that
after you've asked for xlog record X, we never need anything earlier
then that.

In fact, what will happen if the checkpoint record's redo pointer points
to an earlier xlog file:

1. The location of the checkpoint record is read by read_backup_label().
Let's say that it's 0005.
2. ReadCheckpointRecord() is called for 0005. The restore command is
called because that xlog file is not present. The safe truncation point
is determined to be 0005, as that's what we're reading. Everything
before that is truncated
3. The redo pointer in the checkpoint record points to 0003. That's
where we should start the recovery. Oops :-(

I haven't tested this, so, am I missing something that makes that not
happen?

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Verified fix for Bug 4137

From
Simon Riggs
Date:
On Tue, 2008-05-06 at 21:51 +0100, Heikki Linnakangas wrote:

> In fact, what will happen if the checkpoint record's redo pointer points
> to an earlier xlog file:
>
> 1. The location of the checkpoint record is read by read_backup_label().
> Let's say that it's 0005.
> 2. ReadCheckpointRecord() is called for 0005. The restore command is
> called because that xlog file is not present. The safe truncation point
> is determined to be 0005, as that's what we're reading. Everything
> before that is truncated
> 3. The redo pointer in the checkpoint record points to 0003. That's
> where we should start the recovery. Oops :-(

Yes, this case could be a problem, if the records are in different
files. It's the files that matter, not the records themselves though.

I've extended the patch without introducing another new status variable,
which was my original concern with what you suggested previously.

--
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com

Attachment

Re: Verified fix for Bug 4137

From
"Heikki Linnakangas"
Date:
Simon Riggs wrote:
> I've extended the patch without introducing another new status variable,
> which was my original concern with what you suggested previously.

Ok, that'll work. Committed, thanks. I changed the sanity check that
xlogfname > restore point into an Assert, though, because that's a sign
that something's wrong.

I noted that there's no reason why the truncation point calculation had
to be moved outside the case-statement, but it does look better that way
so I did apply that change.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Verified fix for Bug 4137

From
"Heikki Linnakangas"
Date:
Simon Riggs wrote:
>       if (restartWALFileName)
>       {
> +         /*
> +          * Don't do cleanup if the restartWALFileName provided
> +          * is later than the xlog file requested. This is an error
> +          * and we must not remove these files from archive.
> +          * This shouldn't happen, but better safe than sorry.
> +          */
> +         if (strcmp(restartWALFileName, nextWALFileName) > 0)
> +             return false;
> +
>           strcpy(exclusiveCleanupFileName, restartWALFileName);
>           return true;
>       }

I committed this sanity check into pg_standy, though it really shouldn't
happen, but it just occurred to me that the most likely reason for that
to happen is probably that the user has screwed up his restore_command
line, mixing up the  %p and %r arguments. Should we make that an error
instead of just not doing the cleanup?

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Verified fix for Bug 4137

From
Simon Riggs
Date:
On Fri, 2008-05-09 at 15:37 +0100, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> >       if (restartWALFileName)
> >       {
> > +         /*
> > +          * Don't do cleanup if the restartWALFileName provided
> > +          * is later than the xlog file requested. This is an error
> > +          * and we must not remove these files from archive.
> > +          * This shouldn't happen, but better safe than sorry.
> > +          */
> > +         if (strcmp(restartWALFileName, nextWALFileName) > 0)
> > +             return false;
> > +
> >           strcpy(exclusiveCleanupFileName, restartWALFileName);
> >           return true;
> >       }
>
> I committed this sanity check into pg_standy, though it really shouldn't
> happen, but it just occurred to me that the most likely reason for that
> to happen is probably that the user has screwed up his restore_command
> line, mixing up the  %p and %r arguments. Should we make that an error
> instead of just not doing the cleanup?

You can't explicitly throw a pgsql error in pg_standby, so the best we
can do is get the file requested if it exists. If the file is the wrong
one then recovery will throw the error. As long as we didn't delete
anything when that happens we can just correct the mistake and retry.

--
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


Re: Verified fix for Bug 4137

From
Tom Lane
Date:
"Heikki Linnakangas" <heikki@enterprisedb.com> writes:
> Ok, that'll work. Committed, thanks. I changed the sanity check that
> xlogfname > restore point into an Assert, though, because that's a sign
> that something's wrong.

Shouldn't that Assert allow the equality case?

            regards, tom lane

Re: Verified fix for Bug 4137

From
"Heikki Linnakangas"
Date:
Tom Lane wrote:
> "Heikki Linnakangas" <heikki@enterprisedb.com> writes:
>> Ok, that'll work. Committed, thanks. I changed the sanity check that
>> xlogfname > restore point into an Assert, though, because that's a sign
>> that something's wrong.
>
> Shouldn't that Assert allow the equality case?

Yes. Thanks.

Hmm. I can't actually think of a scenario where that would happen, but
it was definitely an oversight on my part.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com