Thread: Recovery bug

Recovery bug

From
Jeff Davis
Date:
This occurs on the master branch, but also pretty much everywhere else.

To reproduce:

First, it requires a little setup to make sure things happen in the
right (wrong?) order. In postgresql.conf I set:
  archive_mode = on
  archive_command = 'echo -n archiving: %f... && while [ ! -f /tmp/a ];
do sleep 1; done && rm /tmp/a && echo done'

The purpose of the archive command is to control what files get archived
and when (they aren't actually archived, but that doesn't matter in this
case because I'll never set restore_command). Also, set
checkpoint_completion_target high enough (0.5 worked for me).

Session1:
  create table a( i int );
  insert into a select generate_series(1,1000000);
  select pg_start_backup('b1'); -- this command takes a while due
                                -- to checkpoint_completion_target

Session2 (quickly before the start_backup() returns):
  insert into a select generate_series(1,1000000);

Now, the last insert should have forced a checkpoint (assuming
checkpoint_segments=3), which enables pg_start_backup() to complete.

Cat backup_label to make sure that the checkpoint for the backup
occurred across two WAL segments (for me it's 00...5 and 00...6). If
not, then try again or somehow force the pg_start_backup-checkpoint to
be spread out over more other activity.

Now, do "touch /tmp/a" to step through the archiving process one file at
a time until 00...5 is archived but 00...6 is _not_ archived. Then issue
a manual CHECKPOINT. This should result in 00...6 and beyond existing in
pg_xlog but nothing before it.

Send a SIGQUIT to the postmaster to simulate a crash. When you bring it
back up, it thinks it is recovering from a backup, so it reads
backup_label. The checkpoint for the backup label is in 00...6, so it
reads that just fine. But then it tries to read the WAL starting at the
redo location from that checkpoint, which is in 00...5 and it doesn't
exist and PANICs.

Ordinarily you might say that this is just confusion over whether it's
recovering from a backup or not, and you just need to remove
backup_label and try again. But that doesn't work: at this point
StartupXLOG has already done two things:
  1. renamed the backup file to .old
  2. updated the control file

So the state has been permanently changed such that it thinks the redo
location is before the earliest WAL you actually have. I see a control
file that looks something like:
  Latest checkpoint location:           0/6D17110
  Prior checkpoint location:            0/8D4DEF0
  Latest checkpoint's REDO location:    0/5000020

(and no more backup_label). That makes no sense, of course. If I try to
restart the postmaster again, it updates the control file again to
something even more surprising:
  Latest checkpoint location:           0/6D17110
  Prior checkpoint location:            0/6D17110
  Latest checkpoint's REDO location:    0/5000020

and PANICs again. Starting the postmaster a third time causes another
PANIC, of course.

There might be an argument that if you crash in the middle of a backup,
you should examine the situation before restarting. But there are two
big problems with that:
  1. a backend crash will cause an automatic restart, corrupting
pg_control automatically as well
  2. after the restart happens it makes permanent changes that can't be
undone (by mere mortals, anyway), removes important information about
what actually happened, and generally leaves you with little to work
with.

I don't have a fix yet, because I think it requires a little discussion.
For instance, it seems to be dangerous to assume that we're starting up
from a backup with access to the archive when it might have been a crash
of the primary system. This is obviously wrong in the case of an
automatic restart, or one with no restore_command. Fixing this issue
might also remove the annoying "If you are not restoring from a backup,
try removing..." PANIC error message.

Also, in general we should do more logging during recovery, at least the
first stages, indicating what WAL segments it's looking for to get
started, why it thinks it needs that segment (from backup or control
data), etc. Ideally we would verify that the necessary files exist (at
least the initial ones) before making permanent changes. It was pretty
painful trying to work backwards on this problem from the final
controldata (where checkpoint and prior checkpoint are the same, and
redo is before both), a crash, a PANIC, a backup_label.old, and not much
else.

Regards,
    Jeff Davis

Re: Recovery bug

From
Jeff Davis
Date:
On Fri, 2010-10-15 at 15:58 -0700, Jeff Davis wrote:
> I don't have a fix yet, because I think it requires a little discussion.
> For instance, it seems to be dangerous to assume that we're starting up
> from a backup with access to the archive when it might have been a crash
> of the primary system. This is obviously wrong in the case of an
> automatic restart, or one with no restore_command. Fixing this issue
> might also remove the annoying "If you are not restoring from a backup,
> try removing..." PANIC error message.
>
> Also, in general we should do more logging during recovery, at least the
> first stages, indicating what WAL segments it's looking for to get
> started, why it thinks it needs that segment (from backup or control
> data), etc. Ideally we would verify that the necessary files exist (at
> least the initial ones) before making permanent changes. It was pretty
> painful trying to work backwards on this problem from the final
> controldata (where checkpoint and prior checkpoint are the same, and
> redo is before both), a crash, a PANIC, a backup_label.old, and not much
> else.
>

Here's a proposed fix. I didn't solve the problem of determining whether
we really are restoring a backup, or if there's just a backup_label file
left around.

I did two things:
  1. If reading a checkpoint from the backup_label location, verify that
the REDO location for that checkpoint exists in addition to the
checkpoint itself. If not, elog with a FATAL immediately.
  2. Change the error that happens when the checkpoint location
referenced in the backup_label doesn't exist to a FATAL. If it can
happen due to a normal crash, a FATAL seems more appropriate than a
PANIC.

The benefit of this patch is that it won't continue on, corrupting the
pg_controldata along the way. And it also tells the administrator
exactly what's going on and how to correct it, rather than leaving them
with a PANIC and bogus controldata after they crashed in the middle of a
backup.

I still think it would be nice if postgres knew whether it was restoring
a backup or recovering from a crash, otherwise it's hard to
automatically recover from failures. I thought about using the presence
of recoveryRestoreCommand or PrimaryConnInfo to determine that. But it
seemed potentially dangerous if the person restoring a backup simply
forgot to set those, and then it tries restoring from the controldata
instead (which is unsafe to do during a backup).

Comments?

Regards,
    Jeff Davis

Attachment

Re: Recovery bug

From
Fujii Masao
Date:
>> Send a SIGQUIT to the postmaster to simulate a crash. When you bring it
>> back up, it thinks it is recovering from a backup, so it reads
>> backup_label. The checkpoint for the backup label is in 00...6, so it
>> reads that just fine. But then it tries to read the WAL starting at the
>> redo location from that checkpoint, which is in 00...5 and it doesn't
>> exist and PANICs.
>>
>> Ordinarily you might say that this is just confusion over whether it's
>> recovering from a backup or not, and you just need to remove
>> backup_label and try again. But that doesn't work: at this point
>> StartupXLOG has already done two things:
>>  1. renamed the backup file to .old
>>  2. updated the control file

Good catch!

> I still think it would be nice if postgres knew whether it was restoring
> a backup or recovering from a crash, otherwise it's hard to
> automatically recover from failures. I thought about using the presence
> of recoveryRestoreCommand or PrimaryConnInfo to determine that. But it
> seemed potentially dangerous if the person restoring a backup simply
> forgot to set those, and then it tries restoring from the controldata
> instead (which is unsafe to do during a backup).

Yep, to automatically delete backup_label and continue recovery seems to be
dangerous. How about just emitting FATAL error when neither restore_command
nor primary_conninfo is supplied and backup_label exists? This seems to be
simpler than your proposed patch (i.e., check whether REDO location exists).

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Recovery bug

From
Jeff Davis
Date:
On Mon, 2010-10-18 at 17:02 +0900, Fujii Masao wrote:
> Yep, to automatically delete backup_label and continue recovery seems to be
> dangerous. How about just emitting FATAL error when neither restore_command
> nor primary_conninfo is supplied and backup_label exists? This seems to be
> simpler than your proposed patch (i.e., check whether REDO location exists).
>

Either is fine with me.

Do users have any expectation that they can restore a backup without
using recovery.conf by merely having the WAL segments in pg_xlog?

Regards,
    Jeff Davis

Re: Recovery bug

From
Robert Haas
Date:
On Mon, Oct 18, 2010 at 2:07 PM, Jeff Davis <pgsql@j-davis.com> wrote:
> On Mon, 2010-10-18 at 17:02 +0900, Fujii Masao wrote:
>> Yep, to automatically delete backup_label and continue recovery seems to be
>> dangerous. How about just emitting FATAL error when neither restore_command
>> nor primary_conninfo is supplied and backup_label exists? This seems to be
>> simpler than your proposed patch (i.e., check whether REDO location exists).
>>
>
> Either is fine with me.
>
> Do users have any expectation that they can restore a backup without
> using recovery.conf by merely having the WAL segments in pg_xlog?

I would expect that to work.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Recovery bug

From
Jeff Davis
Date:
On Mon, 2010-10-18 at 17:51 -0400, Robert Haas wrote:
> On Mon, Oct 18, 2010 at 2:07 PM, Jeff Davis <pgsql@j-davis.com> wrote:
> > On Mon, 2010-10-18 at 17:02 +0900, Fujii Masao wrote:
> >> Yep, to automatically delete backup_label and continue recovery seems to be
> >> dangerous. How about just emitting FATAL error when neither restore_command
> >> nor primary_conninfo is supplied and backup_label exists? This seems to be
> >> simpler than your proposed patch (i.e., check whether REDO location exists).
> >>
> >
> > Either is fine with me.
> >
> > Do users have any expectation that they can restore a backup without
> > using recovery.conf by merely having the WAL segments in pg_xlog?
>
> I would expect that to work.
>

If that's the expectation, I believe my original fix has less impact.
It's essentially a sanity check to make sure that the first segment
needed is available before proceeding.

Regards,
    Jeff Davis

Re: Recovery bug

From
Fujii Masao
Date:
On Tue, Oct 19, 2010 at 7:00 AM, Jeff Davis <pgsql@j-davis.com> wrote:
>> > Do users have any expectation that they can restore a backup without
>> > using recovery.conf by merely having the WAL segments in pg_xlog?
>>
>> I would expect that to work.

What's the use case?

> If that's the expectation, I believe my original fix has less impact.
> It's essentially a sanity check to make sure that the first segment
> needed is available before proceeding.

If that's really useful for some cases, I agree with your fix.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Recovery bug

From
Heikki Linnakangas
Date:
On 19.10.2010 08:51, Fujii Masao wrote:
> On Tue, Oct 19, 2010 at 7:00 AM, Jeff Davis<pgsql@j-davis.com>  wrote:
>>>> Do users have any expectation that they can restore a backup without
>>>> using recovery.conf by merely having the WAL segments in pg_xlog?
>>>
>>> I would expect that to work.
>
> What's the use case?

Creating a stand-alone backup tarball that contains both the base backup
and all WAL files required to restore.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Recovery bug

From
Fujii Masao
Date:
On Tue, Oct 19, 2010 at 5:18 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> On 19.10.2010 08:51, Fujii Masao wrote:
>>
>> On Tue, Oct 19, 2010 at 7:00 AM, Jeff Davis<pgsql@j-davis.com> =A0wrote:
>>>>>
>>>>> Do users have any expectation that they can restore a backup without
>>>>> using recovery.conf by merely having the WAL segments in pg_xlog?
>>>>
>>>> I would expect that to work.
>>
>> What's the use case?
>
> Creating a stand-alone backup tarball that contains both the base backup =
and
> all WAL files required to restore.

Oh, I had forgotten that case. Thanks. I'd drop my proposal.

Regards,

--=20
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Recovery bug

From
Heikki Linnakangas
Date:
On 18.10.2010 01:48, Jeff Davis wrote:
> On Fri, 2010-10-15 at 15:58 -0700, Jeff Davis wrote:
>> I don't have a fix yet, because I think it requires a little discussion.
>> For instance, it seems to be dangerous to assume that we're starting up
>> from a backup with access to the archive when it might have been a crash
>> of the primary system. This is obviously wrong in the case of an
>> automatic restart, or one with no restore_command. Fixing this issue
>> might also remove the annoying "If you are not restoring from a backup,
>> try removing..." PANIC error message.
>>
>> Also, in general we should do more logging during recovery, at least the
>> first stages, indicating what WAL segments it's looking for to get
>> started, why it thinks it needs that segment (from backup or control
>> data), etc. Ideally we would verify that the necessary files exist (at
>> least the initial ones) before making permanent changes. It was pretty
>> painful trying to work backwards on this problem from the final
>> controldata (where checkpoint and prior checkpoint are the same, and
>> redo is before both), a crash, a PANIC, a backup_label.old, and not much
>> else.
>>
>
> Here's a proposed fix. I didn't solve the problem of determining whether
> we really are restoring a backup, or if there's just a backup_label file
> left around.
>
> I did two things:
>    1. If reading a checkpoint from the backup_label location, verify that
> the REDO location for that checkpoint exists in addition to the
> checkpoint itself. If not, elog with a FATAL immediately.

Makes sense. I wonder if we could just move the rename() after reading
the checkpoint record?

>    2. Change the error that happens when the checkpoint location
> referenced in the backup_label doesn't exist to a FATAL. If it can
> happen due to a normal crash, a FATAL seems more appropriate than a
> PANIC.

I guess, although it's really not appropriate that the database doesn't
recover after a crash during a base backup.

> I still think it would be nice if postgres knew whether it was restoring
> a backup or recovering from a crash, otherwise it's hard to
> automatically recover from failures. I thought about using the presence
> of recoveryRestoreCommand or PrimaryConnInfo to determine that. But it
> seemed potentially dangerous if the person restoring a backup simply
> forgot to set those, and then it tries restoring from the controldata
> instead (which is unsafe to do during a backup).

Right, that's not good either.

One alternative is to not remove any WAL files during a base backup. The
obvious downside is that if the backup takes a long time, you run out of
disk space.

The fundamental problem is that by definition, a base backup is
completely indistinguishable from the data directory in the original
server. Or is it? We recommend that you exclude the files under pg_xlog
from the backup. So we could create a "pg_xlog/just_kidding" file along
with backup_label. When starting recovery, if just_kidding exists, we
can assume that we're doing crash recovery and ignore backup_label.

Excluding pg_xlog is just a recommendation at the moment, though, so we
would need a big warning in the docs. And some way to enforce that
just_kidding is not included in the backup would be nice, maybe we could
remove read-permission from it?

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Recovery bug

From
Jeff Davis
Date:
On Tue, 2010-10-19 at 12:26 +0300, Heikki Linnakangas wrote:
> >    1. If reading a checkpoint from the backup_label location, verify that
> > the REDO location for that checkpoint exists in addition to the
> > checkpoint itself. If not, elog with a FATAL immediately.
>
> Makes sense. I wonder if we could just move the rename() after reading
> the checkpoint record?

I assume you mean "after reading the record at the REDO location"? And
you'd need to make sure that any changes to the control data were after
reading the record at the REDO location, as well.

I, too, thought about juggling the order around, but there are quite a
few global variables so it seemed fairly fragile. I'm open to
suggestion, however.

> >    2. Change the error that happens when the checkpoint location
> > referenced in the backup_label doesn't exist to a FATAL. If it can
> > happen due to a normal crash, a FATAL seems more appropriate than a
> > PANIC.
>
> I guess, although it's really not appropriate that the database doesn't
> recover after a crash during a base backup.

Agreed. I'm a little concerned that any fix for that might be intrusive
enough that we didn't want to back-patch it though.

> One alternative is to not remove any WAL files during a base backup. The
> obvious downside is that if the backup takes a long time, you run out of
> disk space.

That doesn't seem very appealing.

> The fundamental problem is that by definition, a base backup is
> completely indistinguishable from the data directory in the original
> server. Or is it? We recommend that you exclude the files under pg_xlog
> from the backup. So we could create a "pg_xlog/just_kidding" file along
> with backup_label. When starting recovery, if just_kidding exists, we
> can assume that we're doing crash recovery and ignore backup_label.

I had some similar ideas, but I rejected them because they added room
for error into the process. One of the problems is that assuming that
we're doing normal recovery when we're doing backup recovery is more
dangerous than the other way around (setting aside this particular bug I
found, anyway). That's because the control data might start the WAL
replay too _late_, which is worse than starting it too early.

> Excluding pg_xlog is just a recommendation at the moment, though, so we
> would need a big warning in the docs. And some way to enforce that
> just_kidding is not included in the backup would be nice, maybe we could
> remove read-permission from it?

Hmm, removing the read bit would add some confidence into the process. I
like that idea better than just assuming that the user won't copy it.

I think I like this direction most, because it doesn't leave us
guessing. If the file is there then we assume normal recovery. If we
encounter recovery.conf we throw FATAL. If we encounter backup_label we
can simply remove it (perhaps with a warning that there was a crash in
the middle of a backup).

However, I do still have some lingering doubts. One is that the file
probably needs to have some content, just in case a backup tool blindly
creates the directory entry and then silently ignores the fact that it
couldn't copy the contents. Another concern is: what if they are using
some kind of filesystem mirroring tool that doesn't take consistent
snapshots (do such tools exist?)? Are there other system-level tools
that people might be using that are safe now, but would be dangerous if
they somehow got a copy of the file?

Regards,
    Jeff Davis

Re: Recovery bug

From
Jeff Davis
Date:
On Tue, 2010-10-19 at 09:51 -0700, Jeff Davis wrote:
> On Tue, 2010-10-19 at 12:26 +0300, Heikki Linnakangas wrote:
> > Excluding pg_xlog is just a recommendation at the moment, though, so we
> > would need a big warning in the docs. And some way to enforce that
> > just_kidding is not included in the backup would be nice, maybe we could
> > remove read-permission from it?
>
> Hmm, removing the read bit would add some confidence into the process. I
> like that idea better than just assuming that the user won't copy it.
>
> I think I like this direction most, because it doesn't leave us
> guessing. If the file is there then we assume normal recovery. If we
> encounter recovery.conf we throw FATAL. If we encounter backup_label we
> can simply remove it (perhaps with a warning that there was a crash in
> the middle of a backup).
>

On second thought, this doesn't sound backpatch-friendly. We should
probably put a simpler fix in first and back-patch it. Then we can do
something like your proposal for 9.1. What do you think of my proposed
patch?

Regards,
    Jeff Davis

Re: Recovery bug

From
Robert Haas
Date:
On Tue, Oct 19, 2010 at 5:26 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> The fundamental problem is that by definition, a base backup is completely
> indistinguishable from the data directory in the original server. Or is it?
> We recommend that you exclude the files under pg_xlog from the backup. So we
> could create a "pg_xlog/just_kidding" file along with backup_label. When
> starting recovery, if just_kidding exists, we can assume that we're doing
> crash recovery and ignore backup_label.

I think you'll find that it's completely impossible to make this work
reliably given all of the ways people may choose to make a backup.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Recovery bug

From
Heikki Linnakangas
Date:
On 19.10.2010 22:40, Jeff Davis wrote:
> On Tue, 2010-10-19 at 09:51 -0700, Jeff Davis wrote:
>> On Tue, 2010-10-19 at 12:26 +0300, Heikki Linnakangas wrote:
>>> Excluding pg_xlog is just a recommendation at the moment, though, so we
>>> would need a big warning in the docs. And some way to enforce that
>>> just_kidding is not included in the backup would be nice, maybe we could
>>> remove read-permission from it?
>>
>> Hmm, removing the read bit would add some confidence into the process. I
>> like that idea better than just assuming that the user won't copy it.
>>
>> I think I like this direction most, because it doesn't leave us
>> guessing. If the file is there then we assume normal recovery. If we
>> encounter recovery.conf we throw FATAL. If we encounter backup_label we
>> can simply remove it (perhaps with a warning that there was a crash in
>> the middle of a backup).
>
> On second thought, this doesn't sound backpatch-friendly. We should
> probably put a simpler fix in first and back-patch it. Then we can do
> something like your proposal for 9.1. What do you think of my proposed
> patch?

Yes, let's go ahead with your original patch.

It seems we should use ReadRecord instead of the lower-level
XLogPageRead function. One difference is that ReadRecord performs a
bunch of sanity checks on the record, while XLogPageRead just reads the
raw page. Extra sanity checking before removing backup_label seems like
a good idea. Another difference is that in standby-mode, ReadRecord will
retry until it succeeds. A standby server should keep retrying, even the
very first record, until it succeeds, otherwise we have a change in
behavior.

So I'm thinking of the attached patch. I'll give this some testing on
older branches and commit.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Attachment

Re: Recovery bug

From
Jeff Davis
Date:
On Mon, 2010-10-25 at 14:44 +0300, Heikki Linnakangas wrote:
> It seems we should use ReadRecord instead of the lower-level
> XLogPageRead function. One difference is that ReadRecord performs a
> bunch of sanity checks on the record, while XLogPageRead just reads the
> raw page. Extra sanity checking before removing backup_label seems like
> a good idea. Another difference is that in standby-mode, ReadRecord will
> retry until it succeeds. A standby server should keep retrying, even the
> very first record, until it succeeds, otherwise we have a change in
> behavior.

The reason I didn't use ReadRecord is because it sets a global variable
to point to the next location in the log, so that subsequent calls can
just pass NULL for the location.

It looks like the patch leaves the global variable pointing just after
the redo location rather than the checkpoint. I haven't tested your
patch yet, but it looks like some of the following code depends on
ReadRecord(NULL,...) fetching the record right after the checkpoint
record; so I think something else is required if you want to use
ReadRecord.

Regards,
    Jeff Davis

Re: Recovery bug

From
Heikki Linnakangas
Date:
On 25.10.2010 19:04, Jeff Davis wrote:
> On Mon, 2010-10-25 at 14:44 +0300, Heikki Linnakangas wrote:
>> It seems we should use ReadRecord instead of the lower-level
>> XLogPageRead function. One difference is that ReadRecord performs a
>> bunch of sanity checks on the record, while XLogPageRead just reads the
>> raw page. Extra sanity checking before removing backup_label seems like
>> a good idea. Another difference is that in standby-mode, ReadRecord will
>> retry until it succeeds. A standby server should keep retrying, even the
>> very first record, until it succeeds, otherwise we have a change in
>> behavior.
>
> The reason I didn't use ReadRecord is because it sets a global variable
> to point to the next location in the log, so that subsequent calls can
> just pass NULL for the location.

True. XLogPageRead is new in 9.0, however. We'll have to use ReadRecord
or invent something new for back-branches anyway.

> It looks like the patch leaves the global variable pointing just after
> the redo location rather than the checkpoint. I haven't tested your
> patch yet, but it looks like some of the following code depends on
> ReadRecord(NULL,...) fetching the record right after the checkpoint
> record; so I think something else is required if you want to use
> ReadRecord.

Hmm, the next call to ReadRecord is this:

>         /*
>          * Find the first record that logically follows the checkpoint --- it
>          * might physically precede it, though.
>          */
>         if (XLByteLT(checkPoint.redo, RecPtr))
>         {
>             /* back up to find the record */
>             record = ReadRecord(&(checkPoint.redo), PANIC, false);
>         }
>         else
>         {
>             /* just have to read next record after CheckPoint */
>             record = ReadRecord(NULL, LOG, false);
>         }

In the first case, the location is given explicitly. In the second case,
the redo pointer equals the checkpoint record, so the current position
is correct even with the patch. It makes me slightly nervous, though.
It's correct today, but if someone adds code between the backup_label
check and this that assumes that the current position is the checkpoint
record, it'll fail. Then again, any new ReadRecord call in such added
code would also break the assumption in the above block that the current
position is the checkpoint record.

In the case that the redo pointer is the same as the checkpoint record,
we don't need to re-fetch the checkpoint record. I've added a test for
that in the attached patch.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Attachment

Re: Recovery bug

From
Heikki Linnakangas
Date:
On 26.10.2010 10:48, Heikki Linnakangas wrote:
> On 25.10.2010 19:04, Jeff Davis wrote:
>> On Mon, 2010-10-25 at 14:44 +0300, Heikki Linnakangas wrote:
>>> It seems we should use ReadRecord instead of the lower-level
>>> XLogPageRead function. One difference is that ReadRecord performs a
>>> bunch of sanity checks on the record, while XLogPageRead just reads the
>>> raw page. Extra sanity checking before removing backup_label seems like
>>> a good idea. Another difference is that in standby-mode, ReadRecord will
>>> retry until it succeeds. A standby server should keep retrying, even the
>>> very first record, until it succeeds, otherwise we have a change in
>>> behavior.
>>
>> The reason I didn't use ReadRecord is because it sets a global variable
>> to point to the next location in the log, so that subsequent calls can
>> just pass NULL for the location.
>
> True. XLogPageRead is new in 9.0, however. We'll have to use ReadRecord
> or invent something new for back-branches anyway.

Ok, committed the patch I posted earlier today using ReadRecord.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Recovery bug

From
Jeff Davis
Date:
On Tue, 2010-10-26 at 10:48 +0300, Heikki Linnakangas wrote:
> > The reason I didn't use ReadRecord is because it sets a global variable
> > to point to the next location in the log, so that subsequent calls can
> > just pass NULL for the location.
>
> True. XLogPageRead is new in 9.0, however. We'll have to use ReadRecord
> or invent something new for back-branches anyway.
>
> > It looks like the patch leaves the global variable pointing just after
> > the redo location rather than the checkpoint. I haven't tested your
> > patch yet, but it looks like some of the following code depends on
> > ReadRecord(NULL,...) fetching the record right after the checkpoint
> > record; so I think something else is required if you want to use
> > ReadRecord.
>
> Hmm, the next call to ReadRecord is this:
>
> >         /*
> >          * Find the first record that logically follows the checkpoint --- it
> >          * might physically precede it, though.
> >          */
> >         if (XLByteLT(checkPoint.redo, RecPtr))
> >         {
> >             /* back up to find the record */
> >             record = ReadRecord(&(checkPoint.redo), PANIC, false);
> >         }
> >         else
> >         {
> >             /* just have to read next record after CheckPoint */
> >             record = ReadRecord(NULL, LOG, false);
> >         }
>
> In the first case, the location is given explicitly. In the second case,
> the redo pointer equals the checkpoint record, so the current position
> is correct even with the patch. It makes me slightly nervous, though.
> It's correct today, but if someone adds code between the backup_label
> check and this that assumes that the current position is the checkpoint
> record, it'll fail. Then again, any new ReadRecord call in such added
> code would also break the assumption in the above block that the current
> position is the checkpoint record.
>
> In the case that the redo pointer is the same as the checkpoint record,
> we don't need to re-fetch the checkpoint record. I've added a test for
> that in the attached patch.

There is a problem with this patch. ReadRecord() not only modifies
global variables, it also modifies the location pointed to by "record",
which is later used to set "wasShutdown". How about if we only set
"wasShutdown" if there is no backup_label (because the checkpoint for
pg_start_backup() will never be a shutdown checkpoint anyway)?

Trivial patch attached.

It's not easy to reproduce a real problem, but the easiest way that I
found is by creating a commit record at the REDO location. Put a sleep()
in CheckPointGuts() to give you time before the checkpoint completes.

Session1:
  CREATE TABLE foo(i int);
  BEGIN;
  INSERT INTO foo VALUES(1);

Session2:
  SELECT pg_start_backup('foo');

Session1 (before checkpoint for pg_start_backup() completes):
  COMMIT;

Then, perform backup, stop backup, shutdown the server. Then try to
restore the backup, and you'll get a PANIC.

This seems like a pretty serious issue to me (backups could appear
unrecoverable), so please consider this before the next patch-level
release so that the bad fix doesn't go out to the world. Also, you might
want to double-check that there aren't other side effects that we're
still missing.

Regards,
    Jeff Davis


Attachment

Re: Recovery bug

From
Heikki Linnakangas
Date:
On 11.11.2010 02:20, Jeff Davis wrote:
> There is a problem with this patch. ReadRecord() not only modifies
> global variables, it also modifies the location pointed to by "record",
> which is later used to set "wasShutdown". How about if we only set
> "wasShutdown" if there is no backup_label (because the checkpoint for
> pg_start_backup() will never be a shutdown checkpoint anyway)?

Oh, good catch.

> Trivial patch attached.

Hmm, you're assuming that the checkpoint specified in the backup label
is never a shutdown checkpoint. I guess that's true, but I'd rather do
this (patch attached).

(StartupXLOG is quite a mess...)

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Attachment

Re: Recovery bug

From
Jeff Davis
Date:
On Thu, 2010-11-11 at 18:20 +0200, Heikki Linnakangas wrote:
> On 11.11.2010 02:20, Jeff Davis wrote:
> > There is a problem with this patch. ReadRecord() not only modifies
> > global variables, it also modifies the location pointed to by "record",
> > which is later used to set "wasShutdown". How about if we only set
> > "wasShutdown" if there is no backup_label (because the checkpoint for
> > pg_start_backup() will never be a shutdown checkpoint anyway)?
>
> Oh, good catch.
>
> > Trivial patch attached.
>
> Hmm, you're assuming that the checkpoint specified in the backup label
> is never a shutdown checkpoint. I guess that's true, but I'd rather do
> this (patch attached).
>

Looks good to me. Thanks!

> (StartupXLOG is quite a mess...)

Agreed.

Regards,
    Jeff Davis