Re: max_standby_delay considered harmful - Mailing list pgsql-hackers

From Robert Haas
Subject Re: max_standby_delay considered harmful
Date
Msg-id AANLkTinigyOHRex2G2WrJKHlVZXPMpTTrjhg_WCbAeaM@mail.gmail.com
Whole thread Raw
In response to Re: max_standby_delay considered harmful  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: max_standby_delay considered harmful
List pgsql-hackers
On Mon, May 3, 2010 at 3:39 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Mon, May 3, 2010 at 11:37 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> I'm inclined to think that we should throw away all this logic and just
>>> have the slave cancel competing queries if the replay process waits
>>> more than max_standby_delay seconds to acquire a lock.
>
>> What if we somehow get into a situation where the replay process is
>> waiting for a lock over and over and over again, because it keeps
>> killing conflicting processes but something restarts them and they
>> take locks over again?
>
> They won't be able to take locks "over again", because the lock manager
> won't allow requests to pass a pending previous request, except in
> very limited circumstances that shouldn't hold here.  They'll queue
> up behind the replay process's lock request, not in front of it.
> (If that isn't the case, it needs to be fixed, quite independently
> of this concern.)

Well, the new backends needn't try to take "the same" locks as the
existing backends - the point is that in the worst case this proposal
means waiting max_standby_delay for EACH replay that requires taking a
lock.  And that might be a LONG time.

One idea I had while thinking this over was to bound the maximum
amount of unapplied WAL rather than the absolute amount of time lag.
Now, that's a little fruity, because your WAL volume might fluctuate
considerably, so you wouldn't really know how far the slave was behind
the master chronologically.  However, it would avoid all the time skew
issues, and it would also more accurately model the idea of a bound on
recovery time should we need to promote the standby to master, so
maybe it works out to a win.  You could still end up stuck
semi-permanently behind, but never by more than N segments.

Stephen's idea of a mode where we wait up to max_standby_delay for a
lock but then kill everything in our path until we've caught up again
is another possible way of approaching this problem, although it may
lead to "kill storms".  Some of that may be inevitable, though: a
bound on WAL lag has the same issue - if the primary is generating WAL
faster than the standby can apply it, the standby will eventually
decide to slaughter everything in its path.

...Robert


pgsql-hackers by date:

Previous
From: Simon Riggs
Date:
Subject: Re: max_standby_delay considered harmful
Next
From: Robert Haas
Date:
Subject: Re: buildfarm building all live branches from git