Re: Re: [COMMITTERS] pgsql: Send new protocol keepalive messages to standby servers. - Mailing list pgsql-hackers

From Robert Haas
Subject Re: Re: [COMMITTERS] pgsql: Send new protocol keepalive messages to standby servers.
Date
Msg-id CA+TgmoZb1xW3fcG4mAS5ApuM_Ki3ey+Y5C+aSgNKX4r9HMrqdw@mail.gmail.com
Whole thread Raw
In response to Re: Re: [COMMITTERS] pgsql: Send new protocol keepalive messages to standby servers.  (Fujii Masao <masao.fujii@gmail.com>)
Responses Re: Re: [COMMITTERS] pgsql: Send new protocol keepalive messages to standby servers.
List pgsql-hackers
On Wed, May 30, 2012 at 12:17 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> OTOH, I wonder whether we really need to send keepalive messages
> periodically to calculate a network latency. ISTM we don't unless a network
> latency varies from situation to situation so frequently and we'd like to
> monitor that in almost real time.

I didn't look at this patch too carefully when it was committed.
Looking at it more carefully now, it looks to me like this patch does
two different things.  One is to add a function called
GetReplicationApplyDelay(), which returns the number of milliseconds
since replay was fully caught up.  So if you were last caught up 5
minutes ago and you have replayed 4 minutes and 50 seconds worth of
WAL during that time, this function will return 5 minutes, not 10
seconds.  That is not what I would call "apply delay", which I would
define as how far behind you are NOW, not how long it's been since you
weren't behind at all.

The second thing it does is add a function called
GetReplicationTransferLatency().  The return value of this function is
the difference between the slave's clock at the time the last master
keepalive was processed and the master's clock at the time that
keepalive was generated.  I think that in practice, unless network
time synchronization is in use, this is mostly going to be computing
the clock skew between the master and the slave. If time
synchronization is in use, then as you say it'll be a very jittery
measure of master-slave network latency, which can be monitored
perfectly well from outside PG.

Now, measuring time skew is potentially a useful thing to do, if we
believe that this will actually give us an accurate measurement of
what the time skew is, because there are a whole series of things that
people want to do which involve subtracting a slave timestamp from a
master timestamp.  Tom has persistently rebuffed all such proposals on
the grounds that there might be time skew, so in theory we could make
those things possible by having a way to measure time skew, which this
does.  Here's what we do: given a slave timestamp, add the estimated
time skew to find an equivalent master timestamp, and then subtract.
Using a method of this type would allow us to compute a *real* apply
delay.  Woohoo!  Unfortunately, if time synchronization IS in use,
then the system clocks are probably already synchronized three to six
orders of magnitude more precisely than what this method can measure,
so the effect of using GetReplicationTransferLatency() to adjust slave
timestamps will be to massively reduce the accuracy of such
calculations.  However, I've thus far been unable to convince anyone
that this is a bad idea, so maybe this is where we're gonna end up.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


pgsql-hackers by date:

Previous
From: Merlin Moncure
Date:
Subject: Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
Next
From: Kohei KaiGai
Date:
Subject: Re: [RFC] Interface of Row Level Security