Log levels for checkpoint/bgwriter monitoring - Mailing list pgsql-hackers

From Greg Smith
Subject Log levels for checkpoint/bgwriter monitoring
Date
Msg-id Pine.GSO.4.64.0702192155530.20153@westnet.com
Whole thread Raw
In response to Re: Multiple Storage per Tablespace, or Volumes  (Andrew Sullivan <ajs@crankycanuck.ca>)
Responses Re: Log levels for checkpoint/bgwriter monitoring  (Robert Treat <xzilla@users.sourceforge.net>)
Re: Log levels for checkpoint/bgwriter monitoring  ("Jim C. Nasby" <jim@nasby.net>)
Re: Log levels for checkpoint/bgwriter monitoring  (Bruce Momjian <bruce@momjian.us>)
List pgsql-hackers
I have a WIP patch that adds the main detail I have found I need to 
properly tune checkpoint and background writer activity.  I think it's 
almost ready to submit (you can see the current patch against 8.2 at 
http://www.westnet.com/~gsmith/content/postgresql/patch-checkpoint.txt ) 
after making it a bit more human-readable.  But I've realized that along 
with that, I need some guidance in regards to what log level is 
appropriate for this information.

An example works better than explaining what the patch does:

2007-02-19 21:53:24.602 EST - DEBUG:  checkpoint required (wrote 
checkpoint_segments)
2007-02-19 21:53:24.685 EST - DEBUG:  checkpoint starting
2007-02-19 21:53:24.705 EST - DEBUG:  checkpoint flushing buffer pool
2007-02-19 21:53:24.985 EST - DEBUG:  checkpoint database fsync starting
2007-02-19 21:53:42.725 EST - DEBUG:  checkpoint database fsync complete
2007-02-19 21:53:42.726 EST - DEBUG:  checkpoint buffer flush dirty=8034 
write=279956 us sync=17739974 us

Remember that "Load distributed checkpoint" discussion back in December? 
You can see exactly how bad the problem is on your system with this log 
style (this is from a pgbench run where it's postively awful--it really 
does take over 17 seconds for the fsync to execute, and there are clients 
that are hung the whole time waiting for it).

I also instrumented the background writer.  You get messages like this:

2007-02-19 21:58:54.328 EST - DEBUG:  BGWriter Scan All - Written = 5/5 
Unscanned = 23/54

This shows that we wrote (5) the maximum pages we were allowed to write 
(5) while failing to scan almost half (23) of the buffers we meant to look 
at (54).  By taking a look at this output while the system is under load, 
I found I was able to do bgwriter optimization that used to take me days 
of frustrating testing in hours.  I've been waiting for a good guide to 
bgwriter tuning since 8.1 came out.  Once you have this, combined with 
knowing how many buffers were dirty at checkpoint time because the 
bgwriter didn't get to them in time (the number you want to minimize), the 
tuning guide practically writes itself.

So my question is...what log level should all this go at?  Right now, I 
have the background writer stuff adjusting its level dynamically based on 
what happened; it logs at DEBUG2 if it hits the write limit (which should 
be a rare event once you're tuned properly), DEBUG3 for writes that 
scanned everything they were supposed to, and DEBUG4 if it scanned but 
didn't find anything to write.  The source of checkpoint information logs 
at DEBUG1, the fsync/write info at DEBUG2.

I'd like to move some of these up.  On my system, I even have many of the 
checkpoint messages logged at INFO (the source of the checkpoint and the 
total write time line).  It's a bit chatty, but when you get some weird 
system pause issue it makes it easy to figure out if checkpoints were to 
blame.  Given how useful I feel some of these messages are to system 
tuning, and to explaining what currently appears as inexplicable pauses, I 
don't want them to be buried at DEBUG levels where people are unlikely to 
ever see them (I think some people may be concerned about turning on 
things labeled DEBUG at all).  I am aware that I am too deep into this to 
have an unbiased opinion at this point though, which is why I ask for 
feedback on how to proceed here.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD


pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: pgsql: Fix for plpython functions; return true/false for boolean,
Next
From: Tom Lane
Date:
Subject: Re: [PATCHES] WIP patch - INSERT-able log statements