Publish checkpoint timing and sync files summary data to pg_stat_bgwriter - Mailing list pgsql-hackers

One of the most useful bits of feedback on how well checkpoint I/O is
going is the amount of time taken to sync files to disk.  Right now the
only way to get that is to parse the logs.  The attached patch publishes
the most useful three bits of data you could only get from
log_checkpoints before out to pg_stat_bgwriter.  Easiest to just show an
example:

$ createdb pgbench
$ psql -c "select pg_stat_reset_shared('bgwriter')"
$ pgbench -i -s 10 pgbench
$ psql -c "checkpoint"

$ tail $PGLOG
LOG:  checkpoint complete: wrote 590 buffers (14.4%); 0 transaction log
file(s) added, 0 removed, 0 recycled; write=0.002 s, sync=0.117 s,
total=1.746 s; sync files=31, longest=0.104 s, average=0.003 s

$ psql -c "select write_ms,sync_ms,sync_files from pg_stat_bgwriter"
  write_ms | sync_ms | sync_files
----------+---------+------------
         2 |     117 |         31

Like the other checkpoint statistics, these are updated in one lump when
the checkpoint finishes.

This includes rough documentation and the code isn't very complicated; a
cat version bump is required.  Some open concerns I still have about this:

-The total amount of time in the sync phase is not the same as the total
of all sync wait times.  I have published the total elapsed sync time.
The total of the times spent waiting on individual sync calls is also
available at the point when the above data is collected, and it could be
published as well.  I don't think that's necessary yet, since they are
almost identical right now.  But any feature that staggers sync calls
around might make that distinction important.

-The total checkpoint time is larger than the total of write+sync.  You
can see that here:  2ms write + 117ms sync = 119ms; total checkpoint
time is actually 1746ms though.  While I don't normally find the total
useful from a tuning perspective, it may be sensible to publish it just
to further reduce any perceived need for log scraping here.

-None of the debug1 level information about individual sync calls can be
published this way usefully.  I'd like that to come out somewhere
eventually, but really what's needed there is a performance event
logging history collector, not pg_stats.  I'm content that this patch
gets the big picture data out; it's sufficient to let you rule out
checkpoint problems if they aren't happening.  If you only have to drill
into the logs when long times start showing up in this data, that's not
so bad to live with.

-I can't tell for sure if this is working properly when log_checkpoints
is off.  This now collects checkpoint end time data in all cases,
whereas before it ignored that work if log_checkpoints was off.  I think
I will have to fake the logging in this area, make it log some things
even when log_checkpoints is off, to confirm it gives the right
results.  I haven't done that yet.

I would like to be able to tell people they need never turn on
log_checkpoints if they monitor pg_stat_bgwriter instead, because that
sets a good precedent for what direction the database is going.  It
would be nice for pg_stat_bgwriter's format to settle down for a long
period too.  In addition to my small list of concerns here, there seem
to be an increasing number of write path feature patches floating around
here lately.  That all works against this being stable.  The main
consumer of this data I use regularly is the pg_stat_bgwriter graphs
that Munin produces, and my expectation is that myself and/or Magnus
will make sure it's compatible with any changes made here for 9.2.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


Attachment

pgsql-hackers by date:

Previous
From: Jaime Casanova
Date:
Subject: Re: patch : Allow toast tables to be moved to a different tablespace
Next
From: Greg Smith
Date:
Subject: Re: Publish checkpoint timing and sync files summary data to pg_stat_bgwriter