Re: [PATCHES] Infrastructure changes for recovery - Mailing list pgsql-hackers

From Simon Riggs
Subject Re: [PATCHES] Infrastructure changes for recovery
Date
Msg-id 1222691596.4445.1188.camel@ebony.2ndQuadrant
Whole thread Raw
In response to Re: [PATCHES] Infrastructure changes for recovery  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: [PATCHES] Infrastructure changes for recovery
List pgsql-hackers
On Sun, 2008-09-28 at 21:16 -0400, Tom Lane wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
> >> It does nothing AFAICS for the
> >> problem that when restarting archive recovery from a restartpoint,
> >> it's not clear when it is safe to start letting in backends.  You need
> >> to get past the highest LSN that has made it out to disk, and there is
> >> no good way to know what that is.
>
> > AFAICS when we set minRecoveryLoc we *never* unset it. It's recorded in
> > the controlfile, so whenever we restart we can see that it has been set
> > previously and now we are beyond it.
>
> Right ...
>
> > So if we crash during recovery and
> > then restart *after* we reached minRecoveryLoc then we resume in safe
> > mode almost immediately.
>
> Wrong.

OK, see where you're coming from now. Solution is needed, I agree.

> What minRecoveryLoc is is an upper bound for the LSNs that might be
> on-disk in the filesystem backup that an archive recovery starts from.
> (Defined as such, it never changes during a restartpoint crash/restart.)
> Once you pass that, the on-disk state as modified by any dirty buffers
> inside the recovery process represents a consistent database state.
> However, the on-disk state alone is not guaranteed consistent.  As you
> flush some (not all) of your shared buffers you enter other
> not-certainly-consistent on-disk states.  If we crash in such a state,
> we know how to use the last restartpoint plus WAL replay to recover to
> another state in which disk + dirty buffers are consistent.  However,
> we reach such a state only when we have read WAL to beyond the highest
> LSN that has reached disk --- and in recovery mode there is no clean
> way to determine what that was.
>
> Perhaps a solution is to make XLogFLush not be a no-op in recovery mode,
> but have it scribble a highest-LSN somewhere on stable storage (maybe
> scribble on pg_control itself, or maybe better someplace else).  I'm
> not totally sure about that.  But I am sure that doing nothing will
> be unreliable.

No need to write highest LSN to disk constantly...

If we restart from a restartpoint then initially the current apply LSN
will be potentially/probably earlier than the latest on-disk LSN, as you
say. But once we have completed the next restartpoint *after* the value
pg_control says then we will be guaranteed that the two LSNs are the
same, since otherwise we would have restarted at a later point.

That kinda works, but the problem is that restartpoints are time based,
not log based. We need them to be deterministic for us to rely upon them
in the above way. If we crash and then replay we can only be certain we
are safe when we have found a restartpoint that the previous recovery
will definitely have reached.

So we must have log-based restartpoints, using either a constant LSN
offset, or a parameter like checkpoint_segments. But if it is changeable
then it needs to be written into the control file, so we don't make a
mistake about it.

So we need to:
* add an extra test to delay safe point if required
* write restart_segments value to control file
* force a restartpoint on first valid checkpoint WAL record after we
have passed restart_segments worth of log

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Proposal: move column defaults into pg_attribute along with attacl
Next
From: Andrew Dunstan
Date:
Subject: Re: parallel pg_restore - WIP patch