Thread: Hot standby, recovery infrastructure

Hot standby, recovery infrastructure

From

Heikki Linnakangas

Date:

27 January 2009, 09:59:45

Regarding this comment:

> +   /*
> +    * Prior to 8.4 we wrote a Shutdown Checkpoint at the end of recovery.
> +    * This could add minutes to the startup time, so we want bgwriter
> +    * to perform it. This then frees the Startup process to complete so we can
> +    * allow transactions and WAL inserts. We still write a checkpoint, but
> +    * it will be an online checkpoint. Online checkpoints have a redo
> +    * location that can be prior to the actual checkpoint record. So we want
> +    * to derive that redo location *before* we let anybody else write WAL,
> +    * otherwise we might miss some WAL records if we crash.
> +    */

Does this describe a failure case or something that would cause 
corruption? The tone of the message implies so, but I don't see anything 
wrong with deriving the redo location for the first checkpoint the usual 
way.

I belive the case of "missing some WAL records" refers to the 
possibility that someone connects to the database and does a WAL logged 
change before the first checkpoint starts. But if we then crash before 
the checkpoint finishes, we'll start crash recovery from the previous 
restartpoint/checkpoint as usual, and replay that WAL record as well. 
And if the first checkpoint finishes, the redo ptr of that checkpoint is 
after that WAL record, and those changes are safely on disk.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Hot standby, recovery infrastructure

From

Simon Riggs

Date:

27 January 2009, 11:03:53

On Tue, 2009-01-27 at 15:59 +0200, Heikki Linnakangas wrote:
> Regarding this comment:
> 
> > +   /*
> > +    * Prior to 8.4 we wrote a Shutdown Checkpoint at the end of recovery.
> > +    * This could add minutes to the startup time, so we want bgwriter
> > +    * to perform it. This then frees the Startup process to complete so we can
> > +    * allow transactions and WAL inserts. We still write a checkpoint, but
> > +    * it will be an online checkpoint. Online checkpoints have a redo
> > +    * location that can be prior to the actual checkpoint record. So we want
> > +    * to derive that redo location *before* we let anybody else write WAL,
> > +    * otherwise we might miss some WAL records if we crash.
> > +    */
> 
> Does this describe a failure case or something that would cause 
> corruption? The tone of the message implies so, but I don't see anything 
> wrong with deriving the redo location for the first checkpoint the usual 
> way.
> 
> I belive the case of "missing some WAL records" refers to the 
> possibility that someone connects to the database and does a WAL logged 
> change before the first checkpoint starts. But if we then crash before 
> the checkpoint finishes, we'll start crash recovery from the previous 
> restartpoint/checkpoint as usual, and replay that WAL record as well. 
> And if the first checkpoint finishes, the redo ptr of that checkpoint is 
> after that WAL record, 

Sorry, this is another one of those "yes I thought that at first"
moments.

> and those changes are safely on disk.

They may not be. They might have happened after BufferSync marks all
dirty buffers BM_CHECKPOINT_NEEDED and yet before we write the physical
checkpoint record.

The idea of the checkpoint is to confirm the recovery is complete and
make sure the starting point for crash recovery isn't somewhere in the
archive.

We must record the logical start before we allow any changes to be
written, otherwise we might miss the intermediate changes.

Just think standard-online-checkpoint and it all fits.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Hot standby, recovery infrastructure

From

Heikki Linnakangas

Date:

27 January 2009, 11:53:36

Simon Riggs wrote:
> On Tue, 2009-01-27 at 15:59 +0200, Heikki Linnakangas wrote:
>> Regarding this comment:
>>
>>> +   /*
>>> +    * Prior to 8.4 we wrote a Shutdown Checkpoint at the end of recovery.
>>> +    * This could add minutes to the startup time, so we want bgwriter
>>> +    * to perform it. This then frees the Startup process to complete so we can
>>> +    * allow transactions and WAL inserts. We still write a checkpoint, but
>>> +    * it will be an online checkpoint. Online checkpoints have a redo
>>> +    * location that can be prior to the actual checkpoint record. So we want
>>> +    * to derive that redo location *before* we let anybody else write WAL,
>>> +    * otherwise we might miss some WAL records if we crash.
>>> +    */
>> Does this describe a failure case or something that would cause 
>> corruption? The tone of the message implies so, but I don't see anything 
>> wrong with deriving the redo location for the first checkpoint the usual 
>> way.
>>
>> I belive the case of "missing some WAL records" refers to the 
>> possibility that someone connects to the database and does a WAL logged 
>> change before the first checkpoint starts. But if we then crash before 
>> the checkpoint finishes, we'll start crash recovery from the previous 
>> restartpoint/checkpoint as usual, and replay that WAL record as well. 
>> And if the first checkpoint finishes, the redo ptr of that checkpoint is 
>> after that WAL record, 
> 
> Sorry, this is another one of those "yes I thought that at first"
> moments.
> 
>> and those changes are safely on disk.
> 
> They may not be. They might have happened after BufferSync marks all
> dirty buffers BM_CHECKPOINT_NEEDED and yet before we write the physical
> checkpoint record.

If the WAL record is written after BufferSync has started, the redo 
pointer is < that WAL record. If we crash, recovery will start at the 
redo pointer, and will replay that WAL record.

> The idea of the checkpoint is to confirm the recovery is complete and
> make sure the starting point for crash recovery isn't somewhere in the
> archive.

Ah, that's what you're worrying about. But I don't see that being an 
issue here. If we remove the special handling of the first checkpoint, 
the redo pointer will always be >= what the patch does now. So it can't 
be in the archive any more than it is now.

Hmm, I think we have small issue if the last WAL segment restored from 
the archive is an incomplete one:

0. Archive recovery runs as usual
1. Recovery hits the end of valid WAL, in the last incomplete WAL segment.
2. Checkpoint is started (let's assume that the redo pointer is the end 
of valid WAL, as the patch does now.)
3. startup process ends, the system is opened for connections
4. A new WAL record is written (and flushed) to the last incomplete WAL.
5. Crash.

In crash recovery after that, recovery.conf is still in place (thanks to 
the changes in the patch, it's not renamed until the first checkpoint 
isthe  finished). We always prefer the files from the archive, even if 
the same file already exists in pg_xlog, so we'll overwrite it with the 
version from the archive that doesn't contain the new WAL record written 
in step 4.

I think we need to keep the renaming of recovery.conf unchanged.

> Just think standard-online-checkpoint and it all fits.

Exactly that made me wonder why the first checkpoint needs to be any 
different.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Hot standby, recovery infrastructure

From

Simon Riggs

Date:

27 January 2009, 13:43:31

On Tue, 2009-01-27 at 17:50 +0200, Heikki Linnakangas wrote:

> Hmm, I think we have small issue if the last WAL segment restored from
> the archive is an incomplete one:

All normal archive recoveries have "complete" WAL files, since an xlog
switch record jumps past missing entries at the end of the file. So I
was just about to say "how can it be an incomplete one?", to which the
answer is file damage of some kind causing a failed CRC check.

So yes, it is possible and no, the patch doesn't work that way yet.

It's enough to give WALMgr problems at failover, certainly.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Hot standby, recovery infrastructure

From

Simon Riggs

Date:

27 January 2009, 13:47:01

On Tue, 2009-01-27 at 17:50 +0200, Heikki Linnakangas wrote:
> 
> > Just think standard-online-checkpoint and it all fits.
> 
> Exactly that made me wonder why the first checkpoint needs to be any 
> different.

Not really following you to be honest, assuming this was a separate
point to the other part of the mail.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Hot standby, recovery infrastructure

From

Heikki Linnakangas

Date:

27 January 2009, 14:12:32

Simon Riggs wrote:
> On Tue, 2009-01-27 at 17:50 +0200, Heikki Linnakangas wrote:
> 
>> Hmm, I think we have small issue if the last WAL segment restored from
>> the archive is an incomplete one:
> 
> All normal archive recoveries have "complete" WAL files, since an xlog
> switch record jumps past missing entries at the end of the file. So I
> was just about to say "how can it be an incomplete one?", to which the
> answer is file damage of some kind causing a failed CRC check.
> 
> So yes, it is possible and no, the patch doesn't work that way yet.

Actually, the code already deals with that possibility. That's the 
motivation for the timelines: the new WAL record will be generated with 
a new timelineid. We never append to an existing WAL file after archive 
recovery.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Hot standby, recovery infrastructure

From

Simon Riggs

Date:

27 January 2009, 14:40:23

On Tue, 2009-01-27 at 20:11 +0200, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > On Tue, 2009-01-27 at 17:50 +0200, Heikki Linnakangas wrote:
> > 
> >> Hmm, I think we have small issue if the last WAL segment restored from
> >> the archive is an incomplete one:
> > 
> > All normal archive recoveries have "complete" WAL files, since an xlog
> > switch record jumps past missing entries at the end of the file. So I
> > was just about to say "how can it be an incomplete one?", to which the
> > answer is file damage of some kind causing a failed CRC check.
> > 
> > So yes, it is possible and no, the patch doesn't work that way yet.
> 
> Actually, the code already deals with that possibility. That's the 
> motivation for the timelines: the new WAL record will be generated with 
> a new timelineid. We never append to an existing WAL file after archive 
> recovery.

Now I'm confused. You're disagreeing with what you just said? 

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Hot standby, recovery infrastructure

From

Heikki Linnakangas

Date:

27 January 2009, 14:46:19

Simon Riggs wrote:
> On Tue, 2009-01-27 at 20:11 +0200, Heikki Linnakangas wrote:
>> Simon Riggs wrote:
>>> On Tue, 2009-01-27 at 17:50 +0200, Heikki Linnakangas wrote:
>>>
>>>> Hmm, I think we have small issue if the last WAL segment restored from
>>>> the archive is an incomplete one:
>>> All normal archive recoveries have "complete" WAL files, since an xlog
>>> switch record jumps past missing entries at the end of the file. So I
>>> was just about to say "how can it be an incomplete one?", to which the
>>> answer is file damage of some kind causing a failed CRC check.
>>>
>>> So yes, it is possible and no, the patch doesn't work that way yet.
>> Actually, the code already deals with that possibility. That's the 
>> motivation for the timelines: the new WAL record will be generated with 
>> a new timelineid. We never append to an existing WAL file after archive 
>> recovery.
> 
> Now I'm confused. You're disagreeing with what you just said? 

Right. I thought I found a problem, but then I found out that it's not a 
problem. Aka false alarm.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com