Re: Automatic adjustment of bgwriter_lru_maxpages - Mailing list pgsql-patches

From Greg Smith
Subject Re: Automatic adjustment of bgwriter_lru_maxpages
Date
Msg-id Pine.GSO.4.64.0705131234470.23574@westnet.com
Whole thread Raw
In response to Re: Automatic adjustment of bgwriter_lru_maxpages  (Heikki Linnakangas <heikki@enterprisedb.com>)
List pgsql-patches
On Sun, 13 May 2007, Heikki Linnakangas wrote:

> StrategyReportWrite increments numClientWrites without holding the
> BufFreeListLock, that's a race condition. The terminology needs some
> adjustment; clients don't write buffers, backends do.

That was another piece of debugging code I moved into the main path
without thinking too hard about it, good catch.  I have a
documentation/naming patch I've started on that revises a lot of the
pg_stat_bgwriter names to be more consistant and easier to understand (as
well as re-ordering the view); the underlying code is still fluid enough
that I was trying to nail that down first.

> That algorithm seems decent, but I wonder why the simple fudge factor
> wasn't good enough? I would've thought that a 2x or even bigger fudge
> factor would still be only a tiny fraction of shared_buffers, and
> wouldn't really affect performance.

I like the way the smoothing evens out the I/O rates.  I saw occasional
spots where the buffer allocations drop to 0 for a few intervals while
other stuff is going on everybody is waiting for, and I didn't want all
LRU cleanup come to halt just because there's a fraction of a second where
nothing happened in the middle of a very busy period.

As for why not overestimate, if you get into a situation where the buffer
cache is very dirty with much of the data being recently used (I normally
see this with bulk UPDATEs on indexed tables), you can end up scanning
many buffers for each one you find that can be written out.  In this kind
of situation, deciding that you actually need to write out twice as many
just because you don't trust your estimate is very inefficient.

I was able to simulate most of the bad behavior I look for with the
pgbench schema using "update accounts set abalance=abalance+1;".  To throw
some sample numbers out, on my test server I was just doing final work on
last night, I was seeing peaks of about 600-1200 buffers allocated per
200ms interval doing that simple UPDATE with shared_buffers=32768.

Let's call it 2% of the pool.  If 50% of the pool is either dirty or can't
be reused yet, that means I'll average having to scan 2%/50%=4% of the
pool to find enough buffers to reuse per interval.  I wouldn't describe
that as a tiny fraction, and doubling it is not an insignificant load
increase.  I'd like to be able to increase the LRU percentage scanned
without being concerned that I'm wasting resources because of this
situation.

The fact that this problem exists is what got me digging into the
background writer code in the first place, because it's way worse on my
production server than this example suggests.  The buffer cache is bigger,
but the ability of the server to dirty it under heavy load is far better.
Returning to the theme discussed in the -hackers thread I referenced:
you can't try to make the background writer LRU do all the writes without
exposing yourself to issues like this, because it doesn't touch the usage
counts.  Therefore it's vulnerable to breakdowns if your buffer pool
shifts toward dirty and non-reusable.

Having the background writer run amok when reusable buffers are rare can
really pull down the performance of the other backends (as well as delay
checkpoints), both in terms of CPU usage and locking issues.  I don't feel
it's a good idea to try and push it too hard unless some of these
underlying issues are fixed first; I'd rather err on the side of letting
it do less rather than more than it has to.

> The normal way to return multiple values is to pass a pointer as an
> argument, though that can get ugly as well if there's a lot of return
> values.

I'm open to better suggestions, but after tinkering with this interface
for over a month now--including pointers and enums--this is the first
implementation I was happy with.

There are four things I eventually need returned here, to support the
fully automatic BGW tuning. My 1st implementation passed in pointers, and
in addition to being ugly I found consistantly checking for null pointers
and data consistancy a drag, both from the coding and the overhead
perspective.

> What combinations of the flags are valid? Would an enum be better?

And my 2nd generation code used an enum.  There are five possible return
code states:

CLEAN + REUSABLE + !WRITTEN
CLEAN + !REUSABLE + !WRITTEN
!CLEAN + !REUSABLE + WRITTEN (all-scan only)
!CLEAN + !REUSABLE + !WRITTEN (rejected by skip)
!CLEAN + REUSABLE + WRITTEN

!CLEAN + REUSABLE + !WRITTEN isn't possible (all paths will write dirty
reusable buffers)

I found the enum-based code more confusing, both reading it and making
sure it was correct when writing it, than the current form.  Right now I
have lines like:

  if (buffer_state & BUF_REUSABLE)

With an enum this has to be something like

   if (buffer_state == BUF_CLEAN_REUSABLE || buffer_state ==
BUF_REUSABLE_WRITTEN)

And that was a pain all around; I kept having to stare at the table above
to make sure the code was correct.  Also, in order to pass back full
usage_count information I was back to either pointers or bitshifting
anyway.  While this particular patch doesn't need the usage count, the
later ones I'm working on do, and I'd like to get this interface complete
while it's being tinkered with anyway.

> Or how about moving the checks for dirty and pinned buffers from
> SyncOneBuffer to the callers?

There are 3 callers to SyncOneBuffer, and almost all the code is shared
between them.  Trying to push even just the dirty/pinned stuff back into
the callers would end up being a cut and paste job that would duplicate
many lines.  That's on top of the fact that the buffer is cleanly
locked/unlocked all in one section of code right now, and I didn't see how
to move any parts of that to the callers without disrupting that clean
interface.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

pgsql-patches by date:

Previous
From: David Fetter
Date:
Subject: Re: Concurrent psql patch
Next
From: Gregory Stark
Date:
Subject: Re: Concurrent psql patch