Re: Log levels for checkpoint/bgwriter monitoring - Mailing list pgsql-hackers

From Greg Smith
Subject Re: Log levels for checkpoint/bgwriter monitoring
Date
Msg-id Pine.GSO.4.64.0703121958540.27778@westnet.com
Whole thread Raw
In response to Re: Log levels for checkpoint/bgwriter monitoring  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Log levels for checkpoint/bgwriter monitoring
List pgsql-hackers
On Fri, 9 Mar 2007, Tom Lane wrote:

> It strikes me that the patch would be more useful if it produced a 
> histogram of the observed usage_counts

Don't have something worth releasing yet, but I did code a first rev of 
this today.  The results are quite instructive and it's well worth looking 
at.

The main server I work on has shared_buffers=60000 for pgbench testing and 
the background writers turned way up (in hopes of reducing checkpoint 
times; that's a whole nother topic).  I run pgbench with s=100 (~1.6GB 
database).  Here's what I get as statistics on the buffer pool after a 
scan when the server is "happy", from a run with 20 clients:

writes=38.3MB (8.2%) pinned+used=38.3MB (8.2%)
dirty buffer usage count histogram:
0=0.1% 1=0.3% 2=26% 3=17% 4=21% 5+=36%

Basically, everything that's left dirty is also heavily used; as I noted 
before, when I inspect with pg_buffercache these are mostly index blocks. 
I note that I should probably generate a second histogram that compares 
the clean data.  The all scan is writing pages out as soon as they dirty, 
the LRU background writer is lucky if it can find anything to do.  (Note 
that I don't rely on the LRU writer more because that causes a significant 
lull in writes after a checkpoint.  By the time it gets going again it's 
harder to catch up, and delaying PostgreSQL writes also aggrevates issues 
with the way Linux caches writes.)

What's I found really interesting was comparing a "sad" section where 
performance breaks down.  This is from a minute later:

writes=441.6MB (94.2%) pinned+used=356.2MB (76.0%)
dirty buffer usage count histogram:
0=18.7% 1=26.4% 2=31% 3=11% 4=9% 5+=4%

Note how the whole buffer distribution has shifted toward lower usage 
counts.  The breakdown seems to involve evicting the index blocks to make 
room for the recently dirty data when it can't be written out fast enough. 
As the index blocks go poof, things slow further, and into the vicious 
circle you go.  Eventually you can end up blocked on a combination of 
buffer evictions and disk seeks for uncached data that are fighting with 
the writes.

The bgwriter change this suggested to me is defining a triage behavior 
where the all scan switches to acting like the LRU one:

-Each sample period, note what % of the usage_count=0 records are dirty 
-If that number is above a tunable threshold, switch to only writing 
usage_count=0 records until it isn't anymore.

On my system a first guess for that tunable would be 2-5%, based on what 
values I see on either side of the sad periods.  No doubt some systems 
would set that much higher, haven't tested my system at home yet to have a 
guideline for a more typical PC.

As for why this behavior matters so much to me, I actually have a 
prototype auto-tuning background writer design that was hung up on this 
particular problem.  It notes how often you write out max_pages, uses that 
to model the average percentage of the buffer cache you're traversing each 
pass, then runs a couple of weighted-average feedback loops to aim for a 
target seconds/sweep.

The problem was that it went berzerk when the whole buffer cache was dirty 
(I hope someone appreciates that I've been calling this "Felix Unger on 
crack mode" in my notes).  I needed some way to prioritize which buffers 
to concentrate on when that happens, and so far the above has been a good 
first-cut way to help with that.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD


pgsql-hackers by date:

Previous
From: Richard Huxton
Date:
Subject: Re: My honours project - databases using dynamically attached entity-properties
Next
From: Tom Lane
Date:
Subject: Re: autovacuum next steps, take 3