Re: PITR potentially broken in 9.2 - Mailing list pgsql-bugs

From Simon Riggs
Subject Re: PITR potentially broken in 9.2
Date
Msg-id CA+U5nMKOsUWKP++QH=0nMff27t-KbLTEz560CzXz_upn-k1LDQ@mail.gmail.com
Whole thread Raw
In response to Re: PITR potentially broken in 9.2  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: PITR potentially broken in 9.2
Re: PITR potentially broken in 9.2
List pgsql-bugs
On 5 December 2012 02:27, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
>>> But the key is, the database was not actually consistent at that
>>> point, and so opening hot standby was a dangerous thing to do.
>>>
>>> The bug that allowed the database to open early (the original topic if
>>> this email chain) was masking this secondary issue.
>
>> Could you check whether the attached patch fixes the behaviour?
>
> Yeah, I had just come to the same conclusion: the bug is not with
> Heikki's fix, but with the pause logic.  The comment says that it
> shouldn't pause unless users can connect to un-pause, but the actual
> implementation of that test is several bricks shy of a load.

OK, I've had a look at this now.

Andres correctly identified the bug above, which was that the backup
end is noted by the XLOG_BACKUP_END record. recoveryStopsHere() was
not changed when that record type was added, so it ignores the
situation that we are waiting for end of backup and yet it stop
anyway. So I concur with Jeff that there is a bug and think that
Andres has provided the clue to a fix. I'll patch that now. Aboriginal
bug extends back to 9.0.

Pause has got nothing at all to do with this, even if there are other
problems there.

> Your patch is better, but it's still missing a bet: what we need to be
> sure of is not merely that we *could* have told the postmaster to start
> hot standby, but that we *actually have done so*.  Otherwise, it's
> flow-of-control-dependent whether it works or not; we could fail if the
> main loop hasn't gotten to CheckRecoveryConsistency since the conditions
> became true.  Looking at the code in CheckRecoveryConsistency, testing
> LocalHotStandbyActive seems appropriate.
>
> I also thought it was pretty dangerous that this absolutely critical
> test was not placed in recoveryPausesHere() itself, rather than relying
> on the call sites to remember to do it.
>
> So the upshot is that I propose a patch more like the attached.

I've reworked pause logic along the lines you suggest. Attached here
for further discussion.

> This is not a regression because the pause logic is broken this same
> way since 9.1.  So I no longer think that we need a rewrap.

I think we do need a rewrap, since the bug is not in pause logic.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

pgsql-bugs by date:

Previous
From: "Duggirala, Manikanth (TCS)"
Date:
Subject: PostgreSQL v8.1.11 compatibility with OS 2008 R2
Next
From: Tom Lane
Date:
Subject: Re: PITR potentially broken in 9.2