Re: [REVIEW] pg_last_xact_insert_timestamp - Mailing list pgsql-hackers

From Greg Smith
Subject Re: [REVIEW] pg_last_xact_insert_timestamp
Date
Msg-id 4EE350B3.2040801@2ndQuadrant.com
Whole thread Raw
In response to Re: [REVIEW] pg_last_xact_insert_timestamp  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: [REVIEW] pg_last_xact_insert_timestamp
Re: [REVIEW] pg_last_xact_insert_timestamp
List pgsql-hackers
On 10/02/2011 07:10 AM, Robert Haas wrote:
> Your proposals involve sending additional information from the master to
> the slave, but the slave already knows both its WAL position and the
> timestamp of the transaction it has most recently replayed, because
> the startup process on the slave tracks that information and publishes
> it in shared memory.  On the master, however, only the WAL position is
> centrally tracked; the transaction timestamps are not.
>    

This seems to be the question that was never really answered well enough 
to satisfy anyone, so let's rewind to here for a bit.  I wasn't 
following this closely until now, so I just did my own review from 
scratch against the latest patch.  I found a few issues, and I don't 
think all of them have been vented here yet:

-It adds overhead at every commit, even for people who aren't using it.  
Probably not enough to matter, but it's yet another thing going through 
the often maligned as too heavy pgstat system, often.

-In order to measure lag this way, you need access to both the master 
and the standby.  Yuck, dblink or application code doing timestamp math, 
either idea makes me shudder.  It would be nice to answer "how many 
seconds of lag do have?" directly from the standby.  Ideally I would 
like both the master and standby to know those numbers.

-After the server is restarted, you get a hole in the monitoring data 
until the first transaction is committed or aborted.  The way the 
existing pg_last_xact_replay_timestamp side of this computation goes 
NULL for some unpredictable period after restart is going to drive 
monitoring systems batty.  Building this new facility similarly will 
force everyone who writes a lag monitor to special case for that 
condition on both sides.  Sure, that's less user hostile than the status 
quo, but it isn't going to help PostgreSQL's battered reputation in the 
area of monitoring either.

-The transaction ID advancing is not a very good proxy for real-world 
lag.  You can have a very large amount of writing in between commits.  
The existing lag monitoring possibilities focus on WAL position instead, 
which is better correlated with the sort of activity that causes lag.  
Making one measurement of lag WAL position based (the existing ones) and 
another based on transaction advance (this proposal) is bound to raise 
some "which of these is the correct lag?" questions, when the two 
diverge.  Large data loading operations come to mind as a not unusual at 
all situation where this would happen.

I'm normally a fan of building the simplest thing that will do something 
useful, and this patch succeeds there.  But the best data to collect 
needs to have a timestamp that keeps moving forward in a way that 
correlates reasonably well to the system WAL load.  Using the 
transaction ID doesn't do that.  Simon did some hand waving around 
sending a timestamp every checkpoint.  That would allow the standby to 
compute its own lag, limit overhead to something much lighter than 
per-transaction, and better track write volume.  There could still be a 
bigger than normal discontinuity after server restart, if the server was 
down a while, but at least there wouldn't ever be a point where the 
value was returned by the master but was NULL.

But as Simon mentioned in passing, it will bloat the WAL streams for 
everyone.  Here's the as yet uncoded mini-proposal that seems to have 
slid by uncommented upon:

"We can send regular special messages from WALSender to WALReceiver that 
do not form part of the WAL stream, so we don't bulk
up WAL archives. (i.e. don't use "w" messages)."

Here's my understanding of how this would work.  Each time a 
commit/abort record appears in the WAL, that updates XLogCtl with the 
associated timestamp.  If WALReceiver received regular updates 
containing the master's clock timestamp and stored them 
similarly--either via regular streaming or the heartbeat--it could 
compute lag with the same resolution as this patch aims to do, for the 
price of two spinlocks:  just subtract the two timestamps.  No overhead 
on the master, and lag can be directly computed and queried from each 
standby.  If you want to feed that data back to the master so it can 
appear in pg_stat_replication in both places, send it periodically via 
the same channel sync rep and standby feedback use.  I believe that will 
be cheap in many cases, since it can piggyback on messages that will 
still be quite small relative to minimum packet size on most networks.  
(Exception for compressed/encrypted networks where packets aren't 
discrete in this way doesn't seem that relevant, presuming that if 
you're doing one of those I would think this overhead is the least of 
your worries)

Now, there's still one problem here.  This doesn't address the "lots of 
write volume but no commit" problem any better than the proposed patch 
does.  Maybe there's some fancy way to inject it along with the received 
WAL on the standby, I'm out of brain power to think through that right 
now.  One way to force this to go away is to add a "time update" WAL 
record type here, one that only appears in some of these unusual 
situations.  My first idea for keeping that overhead small is to put the 
injection logic for it at WAL file switch time.  If you haven't 
committed a transaction during that entire 16MB of writes, start the new 
WAL segment with a little time update record.  That would ensure you 
never do too many writes before the WAL replay clock is updated, while 
avoiding this record type altogether during commit-heavy periods.

The WAL sender and receiver pair should be able to work out how to 
handle the other corner case, where neither WAL advance nor transactions 
occured.  You don't want lag to keep increasing forever there.  If the 
transaction ID hasn't moved forward, the heartbeat can still update the 
time.  I think if you do that, all you'd need to special case is that if 
master XID=standby replay XID, lag time should be forced to 0 instead of 
being its usual value (master timestamp - last standby commit/abort record).

As for "what do you return if asked for lag before the first data with a 
timestamp appears?" problem, there are surely still cases where that 
happens in this approach.  I'm less concerned about that if there's only 
a single function involved though.  The part that worries me is the high 
probability people are going to do NULL math wrong if they have to 
combine two values from different servers, and not catch it during 
development.  If I had a penny for every NULL handling mistake ever made 
in that category, I'd be writing this from my island lair instead of 
Baltimore.  I'm more comfortable saying "this lag interval might be 
NULL"; if that's the only exception people have to worry about, that 
doesn't stress me as much.

Last disclaimer:  if this argument is persuasive enough to punt 
pg_last_xact_insert_timestamp for now, in favor of a new specification 
along the lines I've outlined here, I have no intention of letting 
myself or Simon end up blocking progress on that.  This one is important 
enough for 9.2 that if there's not another patch offering the same 
feature sitting in the queue by the end of the month, I'll ask someone 
else here to sit on this problem a while.  Probably Jaime, because he's 
suffering with this problem as much as I am.  We maintain code in repmgr 
to do this job with brute force:  it saves a history table to translate 
XID->timestamps and works out lag from there.  Window function query + 
constantly churning monitoring table=high overhead.  It would really be 
great to EOL that whole messy section as deprecated starting in 9.2.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us



pgsql-hackers by date:

Previous
From: Gabriele Bartolini
Date:
Subject: Re: [PATCH] Support for foreign keys with arrays
Next
From: Greg Smith
Date:
Subject: Re: pg_stat_statements with query tree based normalization