Re: [HACKERS] Measuring replay lag - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: [HACKERS] Measuring replay lag
Date
Msg-id CAEepm=3oz5_NPeF0d_sYaRedD+S4HJkCSODvNX=rd4GaiYg5ug@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] Measuring replay lag  (Simon Riggs <simon@2ndquadrant.com>)
Responses Re: [HACKERS] Measuring replay lag
List pgsql-hackers
On Fri, Feb 17, 2017 at 12:45 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> Feeling happier about this for now at least.

Thanks!

> I think we need to document how this works more in README or header
> comments. That way I can review it against what it aims to do rather
> than what I think it might do.

I have added a bunch of new comments to explain in the -v2 patch (see
reply to Abhijit).  Please let me know if you think I need to add
still more.  I'm especially interested in your feedback on the block
of comments above the line:

+   LagTrackerWrite(SendRqstPtr, GetCurrentTimestamp());

Specifically, your feedback on the sufficiency of this (LSN, time)
pair + filtering out repeat LSNs as an approximation of the time this
LSN was flushed.

> e.g. We need to document what replay_lag represents. Does it include
> write_lag and flush_lag, or is it the time since the flush_lag. i.e.
> do I add all 3 together to get the full lag, or would that cause me to
> double count?

I have included full descriptions of exactly what the 3 times
represent in the user documentation in the -v2 patch.

> How sensitive is this? Does the lag spike quickly and then disappear
> again quickly? If we're sampling this every N seconds, will we get a
> realistic viewpoint or just a random sample?

In my testing it seems to move fairly smoothly so I think sampling
every N seconds would be quite effective and would not be 'noisy'.
The main time it jumps quickly is at the end of a large data load,
when a slow standby finally reaches the end of its backlog; you see it
climb slowly up and up while the faster primary is busy generating WAL
too fast for it to apply, but then if the primary goes idle the
standby eventually catches up.  The high lag number sometimes lingers
for a bit and then pops down to a low number when new WAL arrives that
can be applied quickly.  It seems like a very accurate depiction of
what is really happening so I like that.  I would love to hear other
opinions and feedback/testing experiences!

> Should we smooth the
> value, or present preak info?

Hmm.  Well, it might be interesting to do online exponential moving
averages, similar to the three numbers Unix systems present for load.
On the other hand, I'm amazed no one has complained that I'm making
pg_stat_replication ridiculously wide already, and users/monitoring
system could easy do that kind of thing themselves, and the number
doesn't seem to jumping/noisy/in-need-of-smoothing.  Same would go for
logging over time; seems like an external monitoring tool's bailiwick.

-- 
Thomas Munro
http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Amit Langote
Date:
Subject: Re: [HACKERS] Partitioned tables and relfilenode
Next
From: Amit Langote
Date:
Subject: [HACKERS] pg_dump emits ALTER TABLE ONLY partitioned_table