Thread: Pause/Resume feature for Hot Standby

Pause/Resume feature for Hot Standby

From

Simon Riggs

Date:

04 May 2010, 05:14:41

In the original patch I had Pause/Resume feature for controlling
recovery during Hot Standby. It was removed for lack of time.

With all the discussion around the HS UI, it would be something that
could be back very easily.

I would like to do it as a recovery control plugin. The plugin would be
passed 3 pieces of information and be called before each WAL record was
applied:

* current WAL pointer
* xid - 0 if not a commit/abort
* timestamp - if available
* boolean flag indicating whether it's a record type that conflicts

**No user data would be passed to the plugin**, so no need to revisit
the discussions around WAL plugins etc.. The plugin has the benefit of
providing a whole range of possible control options, as well as being
minimal performance overhead.

This would allow initially allow
* Pause
* Resume
and would go into 9.0 as a contrib module, included with the plugin
patch.

Later we would be able to add on such things as
* Pause for a delay
* Seek to a particular xid commit record
* Seek to a particular WAL pointer and stop

This would be particularly helpful in designing an automated test suite,
since we can recheck the snapshot after each commit to verify it matches
the primary.

-- Simon Riggs           www.2ndQuadrant.com

Re: Pause/Resume feature for Hot Standby

From

Dimitri Fontaine

Date:

04 May 2010, 05:39:32

Simon Riggs <simon@2ndQuadrant.com> writes:

> In the original patch I had Pause/Resume feature for controlling
> recovery during Hot Standby. It was removed for lack of time.
>
> With all the discussion around the HS UI, it would be something that
> could be back very easily.

Please!

Manual control over recovery is the best solution ever proposed for
giving the user explicit control over the trade-off between HA and slave
queries.

It would allow us to say that by default, conflict favors WAL recovery
no matter what. If you want to ensure your queries won't get canceled,
pause the recovery, run your report, resume the recovery.

I understand that automated and flexible conflict resolution still is
needed or wanted even with this UI, but that would allow a much more
crude automated tool to be acceptable.  Specifically, it could only
target short queries on the standby, for long running queries you don't
want to get cancelled, pause the recovery.

Regards,
-- 
dim

Re: Pause/Resume feature for Hot Standby

From

Robert Haas

Date:

04 May 2010, 10:36:39

On Tue, May 4, 2010 at 4:02 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> In the original patch I had Pause/Resume feature for controlling
> recovery during Hot Standby. It was removed for lack of time.

Well, it's not like we have more time now than we did then. I think
we need to postpone this discussion to 9.1. If we're going to start
accepting patches for new features, then why should we accept only
patches for HS/SR? I have two patches already in the queue that I'd
like to see committed and if I thought that there was a chance of
getting anything further done for 9.0, there'd be several more. Many
other people have patches waiting also, or are holding off development
because we are in feature freeze right now. Hot Standby is a great
feature, but, I don't see any reason to say that we're going to allow
new feature development just for HS but not for anything else.

I also think that worrying about fine-tuning HS at this point is a bit
like complaining that the jump suits of the crew of the Space Shuttle
Challenger were not made of 100% recyclable materials. Just yesterday
we had a report of an HS server getting into a state where it failed
to shut down properly; and I believe that we never fully resolved the
issue of occasional extremely-long spikes in HS response time, either.Heikki just fixed a bug our btree recovery code
whichis apparently

new to 9.0 since he did not backpatch it. I think that getting into a
discussion of pausing and resuming recovery, or even the parallel
discussion on max_standby_delay, are fiddling with things that,
granted, are probably not ideal, and yes, we should improve them in a
future release, but they're not what we should be worrying about right
now. What I think we SHOULD be worried about right now - VERY worried
- is stabilizing the existing Hot Standby code to the point where it
won't be an embarrassment to us when we ship it. The rate at which
we're finding new problems even with the small number of people who
test alpha releases and nightly snapshots suggests to me that we're
not there yet.

...Robert

Re: Pause/Resume feature for Hot Standby

From

Simon Riggs

Date:

04 May 2010, 12:28:18

On Tue, 2010-05-04 at 09:36 -0400, Robert Haas wrote:
> On Tue, May 4, 2010 at 4:02 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> > In the original patch I had Pause/Resume feature for controlling
> > recovery during Hot Standby. It was removed for lack of time.
> 
> Well, it's not like we have more time now than we did then.  I think
> we need to postpone this discussion to 9.1.  If we're going to start
> accepting patches for new features, then why should we accept only
> patches for HS/SR? 

Robert,

This is clearly a response to issues raised about HS and not a new
feature. It's also proposed in the most minimal way possible with
respect for the current state of release. Why is you think I want to go
to beta less quickly than anyone else? I have many other items to work
on in the new release also, none of them have been even discussed, again
out of respect for the timing and the process.

> I also think that worrying about fine-tuning HS at this point is a bit
> like complaining that the jump suits of the crew of the Space Shuttle
> Challenger were not made of 100% recyclable materials.  Just yesterday
> we had a report of an HS server getting into a state where it failed
> to shut down properly; and I believe that we never fully resolved the
> issue of occasional extremely-long spikes in HS response time, either.
>  Heikki just fixed a bug our btree recovery code which is apparently
> new to 9.0 since he did not backpatch it.  I think that getting into a
> discussion of pausing and resuming recovery, or even the parallel
> discussion on max_standby_delay, are fiddling with things that,
> granted, are probably not ideal, and yes, we should improve them in a
> future release, but they're not what we should be worrying about right
> now.  What I think we SHOULD be worried about right now - VERY worried
> - is stabilizing the existing Hot Standby code to the point where it
> won't be an embarrassment to us when we ship it.  The rate at which
> we're finding new problems even with the small number of people who
> test alpha releases and nightly snapshots suggests to me that we're
> not there yet.

There hasn't been anything more than a minor bug in weeks, so not really
sure how you arrive at that the idea the code needs "stabilising". 

But even if you think we need "stabilising", how do you propose I do
that? What exact action?

When people complain, I propose solutions. If you then object that the
proposed solution is actually a new feature, that leaves us in a
deadlock.

There is no evidence that Erik's strange performance has anything to do
with HS; it hasn't been seen elsewhere and he didn't respond to
questions about the test setup to provide background. The profile didn't
fit any software problem I can see.

-- Simon Riggs           www.2ndQuadrant.com

Re: Pause/Resume feature for Hot Standby

From

Tom Lane

Date:

04 May 2010, 12:29:02

Simon Riggs <simon@2ndQuadrant.com> writes:
> In the original patch I had Pause/Resume feature for controlling
> recovery during Hot Standby. It was removed for lack of time.

> With all the discussion around the HS UI, it would be something that
> could be back very easily.

Sure.  In 9.1.  You have enough bugs to fix that you have *no* business
thinking about adding features for 9.0, even if that were permissible
under the ground rules for beta.  Pretending that it's a contrib module
is just a transparent end-run around that.
        regards, tom lane

Re: Pause/Resume feature for Hot Standby

From

Simon Riggs

Date:

04 May 2010, 12:32:15

On Tue, 2010-05-04 at 11:12 -0400, Tom Lane wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
> > In the original patch I had Pause/Resume feature for controlling
> > recovery during Hot Standby. It was removed for lack of time.
> 
> > With all the discussion around the HS UI, it would be something that
> > could be back very easily.
> 
> Sure.  In 9.1.  You have enough bugs to fix that you have *no* business
> thinking about adding features for 9.0, even if that were permissible
> under the ground rules for beta.  Pretending that it's a contrib module
> is just a transparent end-run around that.

As stated, this was proposed as a response to your gripes elsewhere. 

If people gripe, I propose a solution. I'm happy if you say No to the
proposed solution, but let's not pretend I'm breaking rules all the time
when I do.

What bugs do I have to fix? I am not aware of any.

-- Simon Riggs           www.2ndQuadrant.com

Re: Pause/Resume feature for Hot Standby

From

Robert Haas

Date:

04 May 2010, 12:35:45

On Tue, May 4, 2010 at 11:10 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> This is clearly a response to issues raised about HS and not a new
> feature.

I don't find that clear at all.  In fact, I find the exact opposition
position to be clear.

> It's also proposed in the most minimal way possible with
> respect for the current state of release. Why is you think I want to go
> to beta less quickly than anyone else?

We're already in beta.  I said nothing about when you want to go to
beta or do anything else.

> There hasn't been anything more than a minor bug in weeks, so not really
> sure how you arrive at that the idea the code needs "stabilising".

I don't agree that there hasn't been anything more than a minor bug in
weeks.  I arrive at the idea that the code needs stabilizing on the
basis of the fact that we keep finding new bugs.

> When people complain, I propose solutions. If you then object that the
> proposed solution is actually a new feature, that leaves us in a
> deadlock.

Not really.  You're entitled to say what you think we should do and I
am entitled to say what I think we should do.  I think we should wait
for 9.1.

...Robert

Re: Pause/Resume feature for Hot Standby

From

Tom Lane

Date:

04 May 2010, 14:24:12

Simon Riggs <simon@2ndQuadrant.com> writes:
> There hasn't been anything more than a minor bug in weeks, so not really
> sure how you arrive at that the idea the code needs "stabilising". 

Simon, if you don't think the code needs stabilizing, you need to think
again.

* max_standby_delay logic is broken, as per other thread.

* handle_standby_sig_alarm is broken to the point of needing to be
thrown away; you can NOT do that kind of thing in an interrupt handler.

* RecordKnownAssignedTransactionIds is changing ShmemVariableCache->nextXid
without any kind of lock (in general, I suspect all the xlog replay code
needs to be revisited to see if it's skipping locks on shared data
structures that are now potentially going to be examined by backends)

* Use of StandbyTransactionIdIsPrepared seems awfully dubious: why are
we trusting the standby's pg_twophase files more than data from the WAL
log, *especially* before we have reached consistency?  Not to mention
that that's a horridly expensive operation (filesystem access) being
invoked while holding ProcArrayLock.

* Why is ExtendCLOG/ExtendSUBTRANS done in RecordKnownAssignedTransactionIds?
It's inappropriate from a modularity standpoint, and it also seems completely
wrong that it won't get done if standbyState < STANDBY_SNAPSHOT_PENDING.
nextXID manipulation there seems equally bogus not to mention unlocked.

* snapshotOldestActiveXid is bogus (I complained about this
already, you have not fixed it)

* LogStandbySnapshot is merest fantasy: no guarantee that either the XIDs
list or the locks list will be consistent with the point in WAL where it
will get inserted.  What's worse, locking things down enough to guarantee
consistency would be horrid for performance, or maybe even deadlock-inducing.
Could lose both ways: list might contain an XID whose commit/abort went
to WAL before the snapshot did, or list might be missing an XID started
just after snap was taken, The latter case could possibly be dealt with
via nextXid filtering, but that doesn't fix the former case, and anyway
we have both ends of the same problem for locks.

That's just what I found in a day or so of code reading, and I haven't
read anything like all of the HS patches.  You need to stop thinking
about adding features and start thinking about making what's in there
bulletproof.  If you happen to have an idle moment when you're not
fixing known problems, re-read some code.
        regards, tom lane

Re: Pause/Resume feature for Hot Standby

From

Simon Riggs

Date:

04 May 2010, 14:47:05

On Tue, 2010-05-04 at 13:23 -0400, Tom Lane wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
> > There hasn't been anything more than a minor bug in weeks, so not really
> > sure how you arrive at that the idea the code needs "stabilising". 
> 
> Simon, if you don't think the code needs stabilizing, you need to think
> again.

This list is entirely new to me. I can't fix problems you haven't even
raised before, can I? Why have you been saving that list?? No way are
these "known problems".

> That's just what I found in a day or so of code reading, and I haven't
> read anything like all of the HS patches.  You need to stop thinking
> about adding features and start thinking about making what's in there
> bulletproof.  If you happen to have an idle moment when you're not
> fixing known problems, re-read some code.

Nobody is adding new features. Stop barracking me for something that's
not even happening, especially if you persuade yourself you should be
angry about it. I care as much about beta as anyone else.

Yes, I'll go read your list. Thank you for your review.

-- Simon Riggs           www.2ndQuadrant.com

Re: Pause/Resume feature for Hot Standby

From

Simon Riggs

Date:

04 May 2010, 17:07:16

On Tue, 2010-05-04 at 13:23 -0400, Tom Lane wrote:

> * max_standby_delay logic is broken, as per other thread.

Proposed fix submitted,

> * handle_standby_sig_alarm is broken to the point of needing to be
> thrown away; you can NOT do that kind of thing in an interrupt handler.

This was modelled very closely on handle_sig_alarm() and was reviewed by
other hackers. I'm not great on that, as you know, so if you can explain
what it is I can't do, and how that differs from handle_sig_alarm
running the deadlock detector in the same way, then I'll work on it some
more.

> * RecordKnownAssignedTransactionIds is changing ShmemVariableCache->nextXid
> without any kind of lock (in general, I suspect all the xlog replay code
> needs to be revisited to see if it's skipping locks on shared data
> structures that are now potentially going to be examined by backends)

There is only one writer and this a single integer value, so any reads
are atomic. This is not being used as a memory barrier, so our earlier
discussion about weak-memory ordering doesn't apply.

The only other reader is bgwriter.

I'm happy to add additional locking if you think its really needed.

> * Use of StandbyTransactionIdIsPrepared seems awfully dubious: why are
> we trusting the standby's pg_twophase files more than data from the WAL
> log, *especially* before we have reached consistency?  

StandbyTransactionIdIsPrepared() is only called in two places, both of
which relate to pruning the KnownAssignedXids array. Pruning only occurs
when the WAL log specifically does not contain the information we need,
which only occurs when those hypothetical FATAL errors come along. In
that case we rely upon the pg_twophase files.

Both of those call points happen in ProcArrayApplyRecoveryInfo() which
does get called before we are consistent, though we can change that if
you see a problem. At this point, I don't see an issue.

> Not to mention
> that that's a horridly expensive operation (filesystem access) being
> invoked while holding ProcArrayLock.

I just optimised that in the recent patch you committed. It isn't a high
cost item any longer now that we are able to prune KnownAssignedXids()
from the left, since pruning will typically not test more than one xid.

> * Why is ExtendCLOG/ExtendSUBTRANS done in RecordKnownAssignedTransactionIds?

Heikki placed them there, so I left that coding, since it does work.
RecordKnown..() is supposed to be the logical equivalent of assigning an
xid, so it seemed logical. Happy to move wherever you see fit.

> It's inappropriate from a modularity standpoint, and it also seems completely
> wrong that it won't get done if standbyState < STANDBY_SNAPSHOT_PENDING.

Yes, that looks like a logic error and will be fixed. However, its
trapped later by clog code to zero new blocks, so in practice there is
no bug.

> nextXID manipulation there seems equally bogus not to mention unlocked.

Traced the code, looks fine to me. Yes, unlocked.

> * snapshotOldestActiveXid is bogus (I complained about this
> already, you have not fixed it)

I understood you were fixing it, as raised during your recent review of
the KAX patch. Will fix.

> * LogStandbySnapshot is merest fantasy: no guarantee that either the XIDs
> list or the locks list will be consistent with the point in WAL where it
> will get inserted.  What's worse, locking things down enough to guarantee
> consistency would be horrid for performance, or maybe even deadlock-inducing.
> Could lose both ways: list might contain an XID whose commit/abort went
> to WAL before the snapshot did, or list might be missing an XID started
> just after snap was taken, The latter case could possibly be dealt with
> via nextXid filtering, but that doesn't fix the former case, and anyway
> we have both ends of the same problem for locks.

That was recoded by Heikki and I left it as written, though I checked
it, considered it correct and take responsibility for it. Will review
further and report back.

Thanks for the review.

-- Simon Riggs           www.2ndQuadrant.com

LogStandbySnapshot (was another thread)

From

Simon Riggs

Date:

04 May 2010, 20:42:30

On Tue, 2010-05-04 at 13:23 -0400, Tom Lane wrote:

> * LogStandbySnapshot is merest fantasy: no guarantee that either the
> XIDs list or the locks list will be consistent with the point in WAL
> where it will get inserted.  What's worse, locking things down enough
> to guarantee consistency would be horrid for performance, or maybe
> even deadlock-inducing. Could lose both ways: list might contain an
> XID whose commit/abort went to WAL before the snapshot did, or list
> might be missing an XID started just after snap was taken, The latter
> case could possibly be dealt with via nextXid filtering, but that
> doesn't fix the former case, and anyway we have both ends of the same
> problem for locks.

This was the only serious complaint on your list, so lets address it.

Clearly we don't want to lock everything down, for all the reasons you
say. That creates a gap between when data is derived and when data
logged to WAL.

LogStandbySnapshot() occurs during online checkpoints on or after the
logical checkpoint location and before the physical checkpoint location.

We start recovery from a checkpoint, so we have a starting point in WAL
for our processing. The time sequence on the primary of these related
events is

Logical Checkpoint location
newxids/commits/locks "Before1"
AccessExclusiveLocks derived
newxids/commits/locks "Before2"
AccessExclusiveLocks WAL record inserted
newxids/commits/locks "After1"
RunningXact derived
newxids/commits/locks "After2"
RunningXact WAL record inserted

though when we read them back from WAL, they will be in this order, and
we cannot tell the difference between events at Before 1 & 2 or After 1
& 2.

Logical Checkpoint location <= STANDBY_INITIALIZED
newxids/commits/locks "Before1"
newxids/commits/locks "Before2"
AccessExclusiveLocks WAL record
newxids/commits/locks "After1"
newxids/commits/locks "After2"
RunningXact WAL record <= STANDBY_SNAPSHOT_READY

We're looking for a consistent point. We don't know what the exact
time-synchronised point is on master, so we have to use an exact point
in WAL and work from there. We need to understand that the serialization
of events in the log can be slightly different to how they occurred on
the primary, but that doesn't change anything important.

So to get a set of xids + locks that are consistent at the moment the
RunningXact WAL record is read we need to 

1. Begin processing incoming changes from the time we are
STANDBY_INITIALIZED, though forgive any errors for removals of missing
items until we hit STANDBY_SNAPSHOT_READY
a) locks - we ignore missing locks in StandbyReleaseLocks()
b) xids - we ignore missing xids in KnownAssignedXidsRemove()

2. Any transaction commits/aborts from the time we are
STANDBY_INITIALIZED, through to STANDBY_SNAPSHOT_READY need to be saved,
so that we can remove them again from the snapshot state. That is
because events might otherwise exist in the standby that will never be
removed from snapshot. We do this by simple test whether the related xid
has already completed.
a) locks - we ignore locks for already completed xids in
StandbyAcquireAccessExclusiveLock()
b) xids - we ignore already completed xids in
ProcArrayApplyRecoveryInfo()

We currently do all of the above. So it looks correct to me.

-- Simon Riggs           www.2ndQuadrant.com

Re: LogStandbySnapshot (was another thread)

From

Heikki Linnakangas

Date:

05 May 2010, 03:12:36

Simon Riggs wrote:
> On Tue, 2010-05-04 at 13:23 -0400, Tom Lane wrote:
> 
>> * LogStandbySnapshot is merest fantasy: no guarantee that either the
>> XIDs list or the locks list will be consistent with the point in WAL
>> where it will get inserted.  What's worse, locking things down enough
>> to guarantee consistency would be horrid for performance, or maybe
>> even deadlock-inducing. Could lose both ways: list might contain an
>> XID whose commit/abort went to WAL before the snapshot did, or list
>> might be missing an XID started just after snap was taken, The latter
>> case could possibly be dealt with via nextXid filtering, but that
>> doesn't fix the former case, and anyway we have both ends of the same
>> problem for locks.
> 
> This was the only serious complaint on your list, so lets address it.
> 
> Clearly we don't want to lock everything down, for all the reasons you
> say. That creates a gap between when data is derived and when data
> logged to WAL.

Right. This was discussed first in August:
http://archives.postgresql.org/message-id/4A8CE561.4000302@enterprisedb.com.

I concur that the idea is that we deal at replay with the fact that the
snapshot lags behind. At replay, any locks/XIDs in the snapshot that
have already been committed/aborted are ignored. For any locks/XIDs
taken just after the snapshot was taken, the replay will see the other
WAL records with that information.

We need to add comments explaining all that.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com

Re: LogStandbySnapshot (was another thread)

From

Simon Riggs

Date:

06 May 2010, 05:04:30

On Wed, 2010-05-05 at 09:12 +0300, Heikki Linnakangas wrote:

> I concur that the idea is that we deal at replay with the fact that the
> snapshot lags behind. At replay, any locks/XIDs in the snapshot that
> have already been committed/aborted are ignored. For any locks/XIDs
> taken just after the snapshot was taken, the replay will see the other
> WAL records with that information.
>
> We need to add comments explaining all that.

The attached comments are proposed.

Reviewing this information again to propose a fix for the two minor
other bugs pointed out by Tom show that they are both related and need
one combined fix that would work like this:

Currently we handle the state STANDBY_INITIALIZED incorrectly. We need
to run RecordKnownAssignedXids() during this mode, so that we both
extend the clog and record known xids. That means that two other callers
of RecordKnownAssignedXids also need to call it at that time.

In ProcArrayApplyRecoveryInfo() we run KnownAssignedXidsAdd(), though
this will fail if there are existing xids in there, now it is sorted. So
we need to: run KnownAssignedXidsRemovePreceding(latestObservedXid) to
remove extraneous xids, then extract any xids that remain and add them
to the ones arriving with the running xacts record. We then need to sort
the combined array and re-insert into KnownAssignedXids.

Previously, I had imagined that the gap between the logical checkpoint
and the physical checkpoint was small. With spread checkpoints this
isn't the case any longer. So I propose adding a special WAL record that
is inserted during LogStandbySnapshot() immediately before
GetRunningTransactionLocks(), so that we minimise the time window
between deriving snapshot data and recording it in WAL.

Those changes are not especially invasive.

--
 Simon Riggs           www.2ndQuadrant.com

Attachment

hs_init_doc.patch