Final background writer cleanup for 8.3 - Mailing list pgsql-hackers

From Greg Smith
Subject Final background writer cleanup for 8.3
Date
Msg-id Pine.GSO.4.64.0708232004540.15073@westnet.com
Whole thread Raw
Responses Re: Final background writer cleanup for 8.3  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
In the interest of closing work on what's officially titled the "Automatic 
adjustment of bgwriter_lru_maxpages" patch, I wanted to summarize where I 
think this is at, what I'm working on right now, and see if feedback from 
that changes how I submit my final attempt for a useful patch in this area 
this week.  Hopefully there are enough free eyes to stare at this now to 
wrap up a plan for what to do that makes sense and still fits in the 8.3 
schedule.  I'd hate to see this pushed off to 8.4 without making some 
forward progress here after the amount of work done already, particularly 
when odds aren't good I'll still be working with this code by then.

Let me start with a summary of the conclusions I've reached based on my 
own tests and the set that Heikki did last month (last results at 
http://community.enterprisedb.com/bgwriter/ ); Heikki will hopefully chime 
in if he disagrees with how I'm characterizing things:

1) In the current configuration, if you have a large setting for 
bgwriter_lru_percent and/or a small setting for bgwriter_delay, that can 
be extremely wasteful because the background writer will consume 
CPU/locking resources scanning the buffer pool needlessly.  This problem 
should go away.

2) Having backends write their own buffers out does not significantly 
degrade performance, as those turn into cached OS writes which generally 
execute fast enough to not be a large drag on the backend.

3) Any attempt to scan significantly ahead of the current strategy point 
will result in some amount of premature writes that decreases overall 
efficiency in cases where the buffer is touched again before it gets 
re-used.  The more in advance you go, the worse this inefficiency is. 
The most efficient way for many workloads is to just let the backends do 
all the writes.

4) Tom observed that there's no reason to ever scan the same section of 
the pool more than once, because anything that changes a buffer's status 
will always make it un-reusable until the strategy point has passed over 
it.  But because of (3), this does not mean that one should drive forward 
constantly trying to lap the buffer pool and catch up with the strategy 
point.

5) There hasn't been any definitive proof that the background writer is 
helpful at all in the context of 8.3.  However, yanking it out altogether 
may be premature, as there are some theorized ways that it may be helpful 
in real-world situations with more intermittant workloads than are 
generally encountered in a benchmarking situation.  I personally feel that 
is some potential for the BGW to become more useful in the context of the 
8.4 release if it starts doing things like adding pages it expects to be 
recycled soon onto the free list, which could improve backend efficiency 
quite a bit compared to the current situation where each backend is 
normally running their own scan.  But that's a bit too big to fit into 8.3 
I think.

What I'm aiming for here is to have the BGW do as little work as possible, 
as efficiently as possible, but not remove it altogether.  (2) suggests 
that this approach won't decrease performance compared to the current 8.2 
situation, where I've seen evidence some are over-tuning to have a very 
aggressive BGW scan an enormous amount of the pool each time because they 
have resources to burn.  Having a generally self-tuning background writer 
that errs on the lazy side stay in the codebase satisfies (5).  Here is 
what the patch I'm testing right now does to try and balance all this out:

A) Counters are added to pg_stat_bgwriter that show how many buffers were 
written by the backends, by the background writer, how many times 
bgwriter_lru_maxpages was hit, and the total number of buffers allocated. 
This at least allows monitoring what's going on as people run their own 
experiments.  Heikki's results included data using the earlier version of 
this patch I put assembled (which now conflicts with HEAD, I have an 
updated one).

B) bgwriter_lru_percent is removed as a tunable.  This eliminates (1). 
The idea of scanning a fixed percentage doesn't ever make sense given the 
observations above; we scan until we accomplish the cleaning mission 
instead.

C) bgwriter_lru_maxpages stays as an absolute maximum number of pages that 
can be written in one sweep each bgwriter_delay.  This allows easily 
turning the writer off altogether by setting it to 0, or limiting how 
active it tries to be in situations where (3) is a concern. Admins can 
monitor the amount that the max is hit in pg_stat_bgwriter and consider 
raising it (or lowering the delay) if it proves to be too limiting. I 
think the default needs to be bumped to something more like 100 rather 
than the current tiny one before the stock configuration can be considered 
"self-tuning" at all.

D) The strategy code gets a "passes" count added to it that serves as a 
sort of high-order int for how many times the buffer cache has been looked 
over in its entirety.

E) When the background writer start the LRU cleaner, it checks if the 
strategy point has passed where it last cleaned up to, using the 
passes+buf_id "pointer". If so, it just starts cleaning from the strategy 
point as it always has.  But if it's still ahead it just continues from 
there, thus implementing the core of (4)'s insight.  It estimates how many 
buffers are probably clean in the space between the strategy point and 
where it's starting at, based on how far ahead it is combined with 
historical data about how many buffers are scanned on average per reusable 
buffer found (the exact computation of this number is the main thing I'm 
still fiddling with).

F) A moving average of buffer allocations is used to predict how many 
clean buffers are expected to be needed in the next delay cycle.  The 
original patch from Itagaki doubled the recent allocations to pad this 
out; (3) suggests that's too much.

G) Scan the buffer pool until either  --Enough reusable buffers have been located or written out to fill the 
upcoming allocation need, taking into account the estimate from (E); this 
is the normal expected way the scan will terminate.  --We've written bgwriter_lru_maxpages  --We "lap" and catch the
strategypoint
 

In addition to removing a tunable and making the remaining two less 
critical, one of my hopes here is that the more efficient way this scheme 
operates will allow using much smaller values for bgwriter_delay than have 
been practical in the current codebase, which may ultimately have its own 
value.

That's what I've got working here now, still need some more tweaking and 
testing before I'm done with the code but there's not much left.  The main 
problem I forsee is that this approach is moderately complicated, adding a 
lot of new code and regular+static variables, for something that's not 
really proven to be valuable.  I will not be surprised if my patch is 
rejected on that basis.  That's why I wanted to get the big picture 
painted in this message while I finish up the work necessary to submit it, 
'cause if the whole idea is doomed anyway I might as well stop now.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: [COMMITTERS] pgsql: Add configure option --with-system-tzdata to use operating system
Next
From: Tom Lane
Date:
Subject: Re: Final background writer cleanup for 8.3