On Fri, 2008-12-19 at 18:59 +0000, Gregory Stark wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
>
> > The error message ought to be "snapshot too old", which could raise a
> > chuckle, so I called it something else.
> >
> > The point you raise is a good one and I think we should publish a list
> > of retryable error messages. I contemplated once proposing a special log
> > level for a retryable error, but not quite a good idea.
>
> I'm a bit concerned about the idea of killing off queries to allow WAL to
> proceed. While I have nothing against that being an option I think we should
> be aiming to make it not necessary for correctness and not the default. By
> default I think WAL replay should stick to stalling WAL replay and only resort
> to killing queries if the user specifically requests it.
Increasing the waiting time increases the failover time and thus
decreases the value of the standby as an HA system. Others value high
availability higher than you and so we had agreed to provide an option
to allow the max waiting time to be set.
max_standby_delay is set in recovery.conf, value 0 (forever) - 2,000,000
secs, settable in milliseconds. So think of it like a deadlock detector
for recovery apply.
Also, there is a set of functions to control the way recovery proceeds,
much as you might control an MP3 player (start, stop, pause). There ares
also functions to pause at specific xids, pause at specific time, pause
at the next cleanup record. That allows you to set the max_standby_delay
lower and then freeze the server for longer to run a long query if
required. It also allows you to do PITR by trial and error rather than
one shot specify-in-advance settings. There is a function to manually
end recovery at a useful place if desired.
I hope your needs and wishes are catered for by that?
(I have a Plan B in case we need it during wider user testing, as
explained up thread.)
-- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support