> I'd happily write a patch to handle all that if I thought it would be
> accepted. I fear that the whole approach will be considered a bit too
> hackish and get rejected on that basis though. Not really sure of a
> "right" way to handle this though. Anything better is going to be more
> complicated because it requires passing more information into the
> archiver, with little gain for that work beyond improving the quality of
> this diagnostic routine. And I think most people would find what I
> described above useful enough.
Yeah, I think we should focus right now on "what monitoring can we get
into this version without holding up release?" Your proposal sounds
like a good one in that respect.
In future versions, I think we'll want a host of granular data on including:
* amount of *time* since last successful archive (this would be a good
trigger for alerts)
* number of failed archive attempts
* number of archive files awaiting processing (presumably monitored by
the slave)
* last archive file processed by the slave, and when
* for HS: frequency and length of conflict delays in log processing, as
a stat
* for HS: number of query cancels due to write/lock conflicts from the
master, as a stat
However, *all* of the above can wait for the next version, especially
since by then we'll have user feedback from the field on required
monitoring. If we try to nail this all down now, not only will it delay
the release, but we'll get it wrong and have to re-do it anyway. Release
early and often, y'know?
I think it's key to keep our data as granular and low-level as possible;
with good low-level data people can write good tools, but if we
over-summarize they can't. Also, it would be nice to have all of our
archiving stuff grouped into something like pg_stat_archive rather than
being a bunch of disconnected functions.
--Josh Berkus