Thread: Just-in-time Background Writer Patch+Test Results

Just-in-time Background Writer Patch+Test Results

From
Greg Smith
Date:
Tom gets credit for naming the attached patch, which is my latest attempt to 
finalize what has been called the "Automatic adjustment of 
bgwriter_lru_maxpages" patch for 8.3; that's not what it does anymore but 
that's where it started.

Background on testing
---------------------

I decided to use pgbench for running my tests.  The scripting framework to 
collect all that data and usefully summarize it is now available as 
pgbench-tools-0.2 at 
http://www.westnet.com/~gsmith/content/postgresql/pgbench-tools.htm

I hope to expand and actually document use of pgbench-tools in the future but 
didn't want to hold the rest of this up on that work.  That page includes basic 
information about what my testing environment was and why I felt this was an 
appropriate way to test background writer efficiency.

Quite a bit of raw data for all of the test sets summarized here is at 
http://www.westnet.com/~gsmith/content/bgwriter/

The patches attached to this message are also available at: 
http://www.westnet.com/~gsmith/content/postgresql/buf-alloc-2.patch 
http://www.westnet.com/~gsmith/content/postgresql/jit-cleaner.patch
(This is my second attempt to send this message, don't know why the 
earlier one failed; using gzip'd patches for this one and hopefully there 
won't be a dupe)

Baseline test results
---------------------

The first patch to apply attached to this message is the latest buf-alloc-2 
that adds counters to pgstat_bgwriter for everything the background writer is 
doing. Here's what we get out of the standard 8.3 background writer before and 
after applying that patch, at various settings:
                info                | set | tps  | cleaner_pct
------------------------------------+-----+------+------------- HEAD nobgwriter                    |   5 |  994 |
HEAD+buf-alloc-2nobgwriter        |   6 | 1012 |           0 HEAD+buf-alloc-2 LRU=0.5%/500      |  16 |  974 |
15.94HEAD+buf-alloc-2 LRU=5%/500        |  19 |  983 |       98.47 HEAD+buf-alloc-2 LRU=10%/500       |   7 |  997 |
  99.95
 

cleaner_pct is what percentage of the writes the BGW LRU cleaner did relative 
to a total that includes the client backend writes; writes done by checkpoints 
are not included in this summary computation, it just shows the balance of 
backend vs. BGW writes.

The /500 means bgwriter_lru_maxpages=500, which I already knew was about as 
many pages as this server ever dirties in a 200ms cycle.  Without the 
buf-alloc-2 patch I don't get statistics on the LRU cleaner, I include that 
number as a baseline just to suggest that the buf-alloc-2 patch itself isn't 
pulling down results.

Here we see that in order to get most of the writes to happen via the LRU 
cleaner rather than having the backends handle them, you'd need to play with 
the settings until the bgwriter_lru_percent was somewhere between 5% and 10%, 
and it seems obvious that doing this doesn't improve the TPS results.  The 
margin of error here is big enough that I consider all these basically the same 
performance.  The question then is how to get this high level of writes by the 
background writer automatically, without having to know what percentage to 
scan; I wanted to remove bgwriter_lru_percent, while still keeping 
bgwriter_lru_maxpages strictly as a way to throttle overall BGW activity.

First JIT Implementation
------------------------

The method I described in my last message on this topic ( 
http://archives.postgresql.org/pgsql-hackers/2007-08/msg00887.php ) implemented 
a weighted moving average of how many pages were allocated, and based on 
feedback from that I improved the code to allow a multiplier factor on top of 
that.  Here's the summary of those results:
                info                | set | tps  | cleaner_pct
------------------------------------+-----+------+------------- jit cleaner multiplier=1.0/500     |   9 |  981 |
94.3 jit cleaner multiplier=2.0/500     |   8 | 1005 |       99.78 jit multiplier=1.0/100             |  10 |  985 |
  68.14
 

That's pretty good.  As long as maxpages is set intelligently, it gets most of 
the writes even with the multiplier of 1.0, and cranking it up to the 2.0 
suggested by the original Itagaki Takahiro patch gets nearly all of them. 
Again, there's really no performance change here in throughput by any of this.

Coping with idle periods
------------------------

While I was basically happy with these results, the data Kevin Grittner 
submitted in response to my last call for commentary left me concerned. While 
the JIT approach works fine as long as your system is active, it does 
absolutely nothing if the system is idle.  I noticed that a lot of the writes 
that were being done by the client backends were after idle periods where the 
JIT writer just didn't react fast enough during the ramp-up.  For example, if 
the system went from idle for a while to full-speed just as the 200ms sleep 
started, by the time the BGW woke up again the backends could have needed to 
write many buffers already themselves.

Ideally, idle periods should be used to slowly trickly dirty pages out, so that 
there are less of them hanging around when a checkpoint shows up or so that 
reusable pages are already available. The question then is how fast to go about 
that trickle.  Heikki's background writer tests and my own suggest that if you 
make the rate during quiet periods too high, you'll clog the underlying buffers 
with some writes that end up being duplicated and lower overall efficiency. 
But all of those tests had the background writer going at a constant and 
relatively high speed.

I wanted to keep the ability to scan the entire buffer cache, using the latest 
idea of never looking at the same buffer twice, but to do that slowly when idle 
and using the JIT rate otherwise.  This is sort of a hybrid of the old LRU 
cleaner behavior (scan a fixed %) at a low speed with the new approach (scan 
based on allocations, however many of them there are).  I starting with the old 
default of 0.5% used by bgwriter_lru_percent (a tunable already removed by the 
patch at this point) with logic to tack that onto the JIT intelligently and got 
these results:
                info                | set | tps  | cleaner_pct
------------------------------------+-----+------+------------- jit multiplier=1.0 min scan=0.5%   |  13 |  882 |
 100 jit multiplier=1.5 min scan=0.5%   |  12 |  871 |         100 jit multiplier=2.0 min scan=0.5%   |  11 |  910 |
    100 jit multiplier=1.0 min scan=0.25%  |  14 |  982 |       98.34
 

It's nice to see fully 100% of the buffers written by the cleaner with the 
hybrid approach; I feel that validates my idea that just a bit more work needs 
to be done during idle periods to completely fix the issue with it not reacting 
fast enough during the idle/full speed transition.  But look at the drop in 
TPS.  While I'm willing to say a couple of percent change isn't significant in 
a pgbench result, those <900 results are clearly bad. This is crossing that 
line where inefficient writes are being done.  I'm happier with the result 
using the smaller min scan=0.25% even though it doesn't quite get every write 
that way.

Making percentage independant of delay
--------------------------------------

But a new problem here is that if you lower bgwriter_delay, the minimum scan 
percentage needs to drop too, and my goal was to remove the number of tunables 
people need to tinker with.  Assuming you're not stopped by the maxpages 
parameter, with the default delay=200ms a scan that hits 0.5% each time will 
scan 5*0.5%=2.5% of the buffer cache per second, which means it will take 24 
seconds to scan the entire pool.  Using 0.25% means 48 seconds between scans. I 
improved the overall algorithm a bit and decided to set this parameter an 
alternate way:  by how long it should take to creep its way through the entire 
buffer cache if the JIT code is idle.  I decided I liked 120 seconds as value 
for that parameter, which is a slower rate than any of the above but still a
reasonable one for a typical application.  Here's what the results look like 
using that approach:
                info                | set | tps  | cleaner_pct
------------------------------------+-----+------+------------- jit multiplier=1.0 scan_whole=120s |  18 |  970 |
99.99jit multiplier=1.5 scan_whole=120s |  15 |  995 |       99.93 jit multiplier=2.0 scan_whole=120s |  17 |  981 |
  99.98
 

Now here are results I'm happy with.  The TPS results are almost unchanged from 
where we started from, with minimal inefficient writes, but almost all the 
writes are being done by the cleaner process.  The results appear much less 
sensitive to what you set the multiplier to.  And unless you use an unresonable 
low value for maxpages (which will quickly become obvious if you monitor 
pg_stat_bgwriter and look for maxwritten_clean increasing fast), you'll get a 
complete scan of the buffer cache within 2 minutes even if there's no system 
activity.  But once that's done, until more buffers are allocated the code 
won't even look at the buffer cache again (as opposed to the current code, 
which is always looking at buffers and acquiring locks even if nothing is going 
on).

I think I can safely say there is a level of intelligence going into what the 
LRU background writer does with this patch that has never been applied to this 
problem before.  There have been a lot of good ideas thrown out in this area, 
but it took a hybrid approach that included and carefully balanced all of them 
to actually get results that I felt were usable. What I don't know is whether 
that will also be true for other testers.

Patch review
------------

The attached jit-cleaner.patch implements this approach, and if you just want 
to look at the main code involved without having to apply the patch you can 
browse the BgBufferSync function in bufmgr.c starting around line 1120 at 
http://www.westnet.com/~gsmith/content/postgresql/bufmgr.c

There is lots of debugging of internals dumped into the logs if you toggle on 
#define BGW_DEBUG , the gross summary of the two most important things that 
show what the code is doing are logged at DEBUG1 (but should probably be pushed 
lower before committing).

This code is as good as you're going to get from me before the 8.3 close. I 
could do some small rewriting and certainly can document all this further as 
part of getting this patch moved toward committed, but I'm out of resources to 
do too much more here.  Along with the big question of whether this whole idea 
is worth following at all as part of 8.3, here are the remaining small 
questions I feel review feedback would be valuable on related to my specific 
code:

-The way I'm getting the passes number back from the freelist.c strategy code 
seems like it will eventually overflow the long I'm using for the intermediate 
results when I execute statements like this:

strategy_position=(long)strategy_passes * NBuffers + strategy_buf_id;

I'm not sure if the code would be better if I were to use a 64-bit integer for 
strategy_position instead, or if I should just rewrite the code to separate out 
the passes multiplication--which will make it less elegant to read but should 
make overflow issues go away.

-Heikki didn't like the way I pass information back from SyncOneBuffer back to 
the background writer.  The bitmask approach I'm using has added flexibility to 
writing more intelligent background writers in the future. I have written more 
complicated ones than any of the approaches mentioned here in the past, using 
things like the usage_count information returned, but the simpler 
implementation here that ignores that.  I could simplify this interface if I 
had to, but I like what I've done as a solid structure for future coding as 
it's written right now.

-There are two magic constants in the code:
    int         smoothing_samples = 16;    float       scan_whole_pool_seconds = 120.0;

I believe I've done enough testing recently and in the past to say these are 
reasonable numbers for most installations, and high-throughput systems are 
going to care more about tuning the multiplier GUC than either of these.  In 
the interest of having less knobs people can fool with and break, I personally 
don't feel like these constants need to be exposed for tuning purposes; they 
don't have a significant impact on how the underlying model works.  Determining 
whether these should be exposed as GUC tunables is certainly an open question 
though.

-I bumped the default for bgwriter_lru_maxpages to 100 so that typical low-end 
systems should get an automatically tuning LRU background writer out of the box 
in 8.3.  This is a big change from the 5 that was used in the older releases. 
If you keep everything at the defaults this represents a maximum theoretical 
write rate for the BGW of 4MB/s, which isn't very much relative to modern 
hardware.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Just-in-time Background Writer Patch+Test Results

From
"Kevin Grittner"
Date:
>>> On Wed, Sep 5, 2007 at 10:31 PM, in message
<Pine.GSO.4.64.0709052324020.25284@westnet.com>, Greg Smith
<gsmith@gregsmith.com> wrote:
>
> -There are two magic constants in the code:
>
>      int         smoothing_samples = 16;
>      float       scan_whole_pool_seconds = 120.0;
>

> I personally
> don't feel like these constants need to be exposed for tuning purposes;

> Determining
> whether these should be exposed as GUC tunables is certainly an open
> question though.
If you exposed the scan_whole_pool_seconds as a tunable GUC, that would
allay all of my concerns about this patch.  Basically, our problems were
resolved by getting all dirty buffers out to the OS cache within two
seconds; any longer than that and the OS cache didn't reach its trigger
point for pushing out to the controller cache in time to prevent the glut
which locks everything up.  I also suspect that this interval kept the OS
cache more aware of frequently updated pages, so that it could avoid
unnecessary physical writes under its own logic.
While I'm hoping that the new checkpoint techniques will be a better
solution, I can't count on that without significant testing in our
environment, and I really want a fall-back.  The metric you emphasized was
the percentage of PostgreSQL writes to the OS cache which were handled by
the background writer, which doesn't necessarily correspond to a solution
to the glut, which is based on the peak number of total writes presented
to the controller by the OS within a small window of time.
-Kevin




Re: Just-in-time Background Writer Patch+Test Results

From
Greg Smith
Date:
On Thu, 6 Sep 2007, Kevin Grittner wrote:

> If you exposed the scan_whole_pool_seconds as a tunable GUC, that would
> allay all of my concerns about this patch.  Basically, our problems were
> resolved by getting all dirty buffers out to the OS cache within two
> seconds

Unfortunately it wouldn't make my concerns about your system go away or 
I'd have recommended exposing it specifically to address your situation. 
I have been staring carefully at your configuration recently, and I would 
wager that you could turn off the LRU writer altogether and still meet 
your requirements in 8.2.  Here's what you've got right now:

> shared_buffers = 160MB (=20000 buffers)
> bgwriter_lru_percent = 20.0
> bgwriter_lru_maxpages = 200
> bgwriter_all_percent = 10.0
> bgwriter_all_maxpages = 600

With the default delay of 200ms, this has the LRU-writer scanning the 
whole pool every 1 second, while the all-writer scans every two 
seconds--assuming they don't hit the write limits.  If some event were to 
dirty the whole pool in 200ms, it might take as much as 6.7 seconds to 
write everything out (20000 / 600 * 200 ms) via the all-scan.  The 
all-scan is already gone in 8.3.  Your LRU scan will take much longer than 
that to clear everything out.  At least (20000 / 200 * 200ms) 20 seconds 
to clear a fully dirty cache.

But in fact, it's impossible to even bound how long it will take before 
the LRU writer (which is the only part this new patch tries to improve) 
gets around to writing even a single dirty buffer no matter what 
bgwriter_lru_percent (8.2) or scan_whole_pool_seconds (JIT patch) is set 
to.

There's a second low-level issue involved here.  When a page becomes 
dirty, that implies it was also recently used, which means the LRU writer 
won't touch it.  That page can't be written out by the LRU writer until an 
entire pass has been made over the shared_buffer pool while looking for 
buffers to allocate for new activity.  When the allocation clock-sweep 
passes over the newly dirtied buffer again, its usage count will drop by 
one and it will no longer be considered recently used.  At that point the 
LRU writer can write it out.  So unless there is other allocation activity 
going on, the scan_whole_pool_seconds mechanism will never provide the 
bound on time to scan and write everything you hope it will.

And if there's other allocations going on, the much more powerful JIT 
mechanism will scan the whole pool plenty fast if you bump the already 
exposed multiplier tunable up.  In my tests where the buffer cache was 
filled with mostly dirty buffers that couldn't be re-used (something 
relatively easy to trigger with pgbench tests), I've actually watched the 
new code scan >90% of the buffer cache looking for those few reusable 
buffers in the pool in a single invocation.  This would be like setting 
bgwriter_lru_percent=90.0 in the old configuration, but it only gets that 
aggressive when the distribution of pages in the buffer cache demands it, 
and when it has reason to believe going that fast will be helpful.

The completely understandable line of thinking that led to your request 
here is one of my concerns with exposing scan_whole_pool_seconds as a 
tunable.  It may suggest to people that if they set the number very low, 
it will assure all dirty buffers will be scanned and written within that 
time bound.  That's certainly not the case; both the maxpages and the 
usage count information will actually drive the speed that mechanism plods 
through the buffer cache.  It really isn't useful for scanning fast.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD


Re: Just-in-time Background Writer Patch+Test Results

From
Decibel!
Date:
On Thu, Sep 06, 2007 at 09:20:31AM -0500, Kevin Grittner wrote:
> >>> On Wed, Sep 5, 2007 at 10:31 PM, in message
> <Pine.GSO.4.64.0709052324020.25284@westnet.com>, Greg Smith
> <gsmith@gregsmith.com> wrote:
> >
> > -There are two magic constants in the code:
> >
> >      int         smoothing_samples = 16;
> >      float       scan_whole_pool_seconds = 120.0;
> >
>
> > I personally
> > don't feel like these constants need to be exposed for tuning purposes;
>
> > Determining
> > whether these should be exposed as GUC tunables is certainly an open
> > question though.
>
> If you exposed the scan_whole_pool_seconds as a tunable GUC, that would
> allay all of my concerns about this patch.  Basically, our problems were

I like the idea of not having that as a GUC, but I'm doubtful that it
can be hard-coded like that. What if checkpoint_timeout is set to 120?
Or 60? Or 2000?

I don't know that there should be a direct correlation, but ISTM that
scan_whole_pool_seconds should take checkpoint intervals into account
somehow.
--
Decibel!, aka Jim Nasby                        decibel@decibel.org
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)

Re: Just-in-time Background Writer Patch+Test Results

From
"Kevin Grittner"
Date:
>>> On Thu, Sep 6, 2007 at 11:27 AM, in message
<Pine.GSO.4.64.0709061121020.14491@westnet.com>, Greg Smith
<gsmith@gregsmith.com> wrote:
> On Thu, 6 Sep 2007, Kevin Grittner wrote:
>
> I have been staring carefully at your configuration recently, and I would
> wager that you could turn off the LRU writer altogether and still meet
> your requirements in 8.2.
I totally agree that it is of minor benefit compared to the all-writer,
if it even matters at all.  I knew that when I chose the settings.
> Here's what you've got right now:
>
>> shared_buffers = 160MB (=20000 buffers)
>> bgwriter_lru_percent = 20.0
>> bgwriter_lru_maxpages = 200
>> bgwriter_all_percent = 10.0
>> bgwriter_all_maxpages = 600
>
> With the default delay of 200ms, this has the LRU-writer scanning the
> whole pool every 1 second,
Whoa!  Apparently I've totally misread the documentation.  I thought that
the bgwriter_lru_percent was scanned from the lru end each time; I would
not expect that it would ever get beyond the oldest 10%.  I put that in
just as a guard to keep the backends from having to wait for the OS write.
I've always doubted whether it was helping, but "it wasn't broke"....
> while the all-writer scans every two
> seconds--assuming they don't hit the write limits.  If some event were to
> dirty the whole pool in 200ms, it might take as much as 6.7 seconds to
> write everything out (20000 / 600 * 200 ms) via the all-scan.
Right.  Since the file system didn't seem to be able to accept writes
faster than 800 PostgreSQL pages per second, and I wanted to leave a
LITTLE slack, I set that limit.  We don't seem to hit it, as far as I can
tell.  In fact, the output rate would be naturally fairly smooth, if not
for the "hold all dirty pages until the last possible moment, then write
them all to the OS and fsync" approach.
> There's a second low-level issue involved here.  When a page becomes
> dirty, that implies it was also recently used, which means the LRU writer
> won't touch it.  That page can't be written out by the LRU writer until an
> entire pass has been made over the shared_buffer pool while looking for
> buffers to allocate for new activity.  When the allocation clock-sweep
> passes over the newly dirtied buffer again, its usage count will drop by
> one and it will no longer be considered recently used.  At that point the
> LRU writer can write it out.
How low does the count have to go, or does it track the count when it
becomes dirty and look for a decrease?
> So unless there is other allocation activity
> going on, the scan_whole_pool_seconds mechanism will never provide the
> bound on time to scan and write everything you hope it will.
That may not be an issue for the environment where this has been a problem
for us -- the web hits are coming in at a pretty good rate 24/7.  (We have
a couple dozen large companies scanning data through HTTP SOAP requests
all the time.)  This should keep us reading new pages, which covers this,
yes?
> where the buffer cache was
> filled with mostly dirty buffers that couldn't be re-used
That would be the condition that would be the killer with a synchronous
checkpoint if the OS cache has already had some dirty pages trickled out.
If we can hit this condition in our web database, either the load
distributed checkpoint will save us, or we can't use 8.3.  Period.
> The completely understandable line of thinking that led to your request
> here is one of my concerns with exposing scan_whole_pool_seconds as a
> tunable.  It may suggest to people that if they set the number very low,
> it will assure all dirty buffers will be scanned and written within that
> time bound.  That's certainly not the case; both the maxpages and the
> usage count information will actually drive the speed that mechanism plods
> through the buffer cache.  It really isn't useful for scanning fast.
I'm not clear on the benefit of not writing the recently accessed dirty
pages when there are no less recently used dirty pages.  I do trust the OS
to not write them before they age out in that cache, and the OS cache
doesn't start writing dirty pages from its cache until they reach a
certain percentage of the cache space, so I'd just as soon let the OS know
that the MRU dirty pages are there, so it knows that it's time to start
working on the LRU pages in its cache.
-Kevin



Re: Just-in-time Background Writer Patch+Test Results

From
Tom Lane
Date:
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> On Thu, Sep 6, 2007 at 11:27 AM, in message
> <Pine.GSO.4.64.0709061121020.14491@westnet.com>, Greg Smith
> <gsmith@gregsmith.com> wrote: 
>> With the default delay of 200ms, this has the LRU-writer scanning the 
>> whole pool every 1 second,
>  
> Whoa!  Apparently I've totally misread the documentation.  I thought that
> the bgwriter_lru_percent was scanned from the lru end each time; I would
> not expect that it would ever get beyond the oldest 10%.

I believe you're correct and Greg got this wrong.  I won't draw any
conclusions about whether the LRU stuff is actually doing you any good
though.
        regards, tom lane


Re: Just-in-time Background Writer Patch+Test Results

From
Greg Smith
Date:
On Thu, 6 Sep 2007, Kevin Grittner wrote:

> I thought that the bgwriter_lru_percent was scanned from the lru end 
> each time; I would not expect that it would ever get beyond the oldest 
> 10%.

You're correct; I stated that badly.  What I should have said is that your 
LRU writer could potentially scan the pool as fast as once per second if 
there were enough allocations going on.

> How low does the count have to go, or does it track the count when it
> becomes dirty and look for a decrease?

The usage count has to be 0 before a page can be re-used for a new 
allocation, and the LRU background writer only writes out potentially 
reusable pages that are dirty.  So the count has to be 0 before it will 
write it.

> This should keep us reading new pages, which covers this, yes?

One would hope.  Your whole arrangement of shared_buffers, 
checkpoint_segments, and related parameters will need to be reconsidered 
for 8.3; you've got a delicated balanced arrangement for your 8.2 setup 
right now that's working for you, but just translating it straight to 8.3 
won't get you what you want.  I'll get back to the message you already 
sent on that subject when I get enough time to address it fully.

> I'm not clear on the benefit of not writing the recently accessed dirty
> pages when there are no less recently used dirty pages.

This presumes PostgreSQL has some notion of the balance of recently 
accessed vs. not accessed dirty pages, which it does not.  Buffers get 
updated individually, and there's no mechanism summarizing what's in 
there; you have to scan the buffer cache yourself to figure that out.  I 
do some of that in this new patch, tracking things like how many buffers 
are scanned on average to find reusable ones.

Many months ago, I wrote a very complicated re-implementation of the 
all-scan portion of the background writer that tracked the usage count of 
everything it looked at, kept statistics about how many pages were dirty 
at each usage count, then targeted how high of a usage count could be 
written given some information about what I/O rate you felt your devices 
could sustain.  This did exactly what you're asking for here:  wrote 
whatever dirty pages were around starting with the ones that hadn't been 
recently used, then worked its way up to pages with a higher usage count 
if the recently used ones were all clean.

As far as I've been able to tell, and from Heikki's test results, the load 
distributed checkpoint was a better answer to this problem.  Rather than 
constantly fight to get pages with high usage counts out all the time, 
just spread the checkpoint out instead and deal with them only then.  I 
gave up on that branch of code while he removed the all-scan writer 
altogether as part of committing LDC.  I suspect the path I was following 
was exactly what you think you'd like to have, but it seems that it's not 
actually needed.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD


Re: Just-in-time Background Writer Patch+Test Results

From
Simon Riggs
Date:
On Wed, 2007-09-05 at 23:31 -0400, Greg Smith wrote:

> Tom gets credit for naming the attached patch, which is my latest attempt to 
> finalize what has been called the "Automatic adjustment of 
> bgwriter_lru_maxpages" patch for 8.3; that's not what it does anymore but 
> that's where it started.

This is a big undertaking, so well done for going for it.

> I decided to use pgbench for running my tests.  The scripting framework to 
> collect all that data and usefully summarize it is now available as 
> pgbench-tools-0.2 at 
> http://www.westnet.com/~gsmith/content/postgresql/pgbench-tools.htm

For me, the main role of the bgwriter is to avoid dirty writes in
backends. The purpose of doing that is to improve the response time
distribution as perceived by users. I think that is what we should be
measuring, perhaps in a simple way such as calculating the 90th
percentile of the response time distribution. Looking at the "headline
numbers" especially tps is notoriously difficult to determine any
meaning from test results. 

Looking at the tps also tempts us to run a test which maxes out the
server, an area we already know and expect the bgwriter to be unhelpful
in.

If I run a server at or below 70% capacity, what settings of the
bgwriter help maintain my response time distribution?

> Coping with idle periods
> ------------------------
> 
> While I was basically happy with these results, the data Kevin Grittner 
> submitted in response to my last call for commentary left me concerned. While 
> the JIT approach works fine as long as your system is active, it does 
> absolutely nothing if the system is idle.  I noticed that a lot of the writes 
> that were being done by the client backends were after idle periods where the 
> JIT writer just didn't react fast enough during the ramp-up.  For example, if 
> the system went from idle for a while to full-speed just as the 200ms sleep 
> started, by the time the BGW woke up again the backends could have needed to 
> write many buffers already themselves.

You've hit the nail on the head there. I can't see how you can do
anything sensible when the bgwriter keeps going to sleep for long
periods.

The bgwriter's activity curve should ideally be the same shape as a
critically damped harmonic oscillator. It should wake up, lots of
writing if needed, then trail off over time. The only way to do that
seems to be to vary the sleep automatically, or make short sleeps.

For me, the bgwriter should sleep for at most 10ms at a time. If it has
nothing to do it can go straight back to sleep again. Trying to set that
time is fairly difficult, so it would be better not to have to set it at
all.

If you've changed bgwriter so it doesn't scan if no blocks have been
allocated, I don't see any reason to keep the _delay parameter at all.

> I think I can safely say there is a level of intelligence going into what the 
> LRU background writer does with this patch that has never been applied to this 
> problem before.  There have been a lot of good ideas thrown out in this area, 
> but it took a hybrid approach that included and carefully balanced all of them 
> to actually get results that I felt were usable. What I don't know is whether 
> that will also be true for other testers.

I get the feeling that what we have here is better than what we had
before, but I guess I'm a bit disappointed we still have 3 magic
parameters, or 5 if you count your hard-coded ones also.

There's still no formal way to tune these. As long as we have *any*
magic parameters, we need a way to tune them in the field, or they are
useless. At very least we need a plan for how people will report results
during Beta. That means we need a log_bgwriter (better name, please...)
parameter that provides information to assist with tuning. At the very
least we need this to be present during Beta, if not beyond.

--  Simon Riggs 2ndQuadrant  http://www.2ndQuadrant.com



Re: Just-in-time Background Writer Patch+Test Results

From
Greg Smith
Date:
On Fri, 7 Sep 2007, Simon Riggs wrote:

> I think that is what we should be measuring, perhaps in a simple way 
> such as calculating the 90th percentile of the response time 
> distribution.

I do track the 90th percentile numbers, but in these pgbench tests where 
I'm writing as fast as possible they're actually useless--in many cases 
they're *smaller* than the average response, because there are enough 
cases where there is a really, really long wait that they skew the average 
up really hard.  Take a look at any of the inidividual test graphs and 
you'll see what I mean.

> Looking at the tps also tempts us to run a test which maxes out the
> server, an area we already know and expect the bgwriter to be unhelpful
> in.

I tried to turn that around and make my thinking be that if I built a 
bgwriter that did most of the writes without badly impacting the measure 
we know and expect it to be unhelpful in, that would be more likely to 
yield a robust design.  It kept me out of areas where I might have built 
something that had to be disclaimed with "don't run this when the server 
is maxed out".

> For me, the bgwriter should sleep for at most 10ms at a time. If it has
> nothing to do it can go straight back to sleep again. Trying to set that
> time is fairly difficult, so it would be better not to have to set it at
> all.

I wanted to get this patch out there so people could start thinking about 
what I'd done and consider whether this still fit into the 8.3 timeline. 
What I'm doing myself right now is running tests with a much lower setting 
for the delay time--am testing 20ms right now.  I personally would be 
happy saying it's 10ms and that's it.  Is anyone using a time lower than 
that right now?  I seem to recall that 10ms was also the shortest interval 
Heikki used in his tests as well.

> I get the feeling that what we have here is better than what we had
> before, but I guess I'm a bit disappointed we still have 3 magic
> parameters, or 5 if you count your hard-coded ones also.

I may be able to eliminate more of them, but I didn't want to take them 
out before beta.  If it can be demonstrated that some of these parameters 
can be set to specific values and still work across a wider range of 
applications than what I've tested, then there's certainly room to fix 
some of these, which actually makes some things easier.  For example, I'd 
be more confident fixing the weighted average smoothing period to a 
specific number if I knew the delay was fixed, and there's two parameters 
gone.  And the multiplier is begging to be eliminated, just need some more 
data to confirm that's true.

> There's still no formal way to tune these. As long as we have *any*
> magic parameters, we need a way to tune them in the field, or they are
> useless. At very least we need a plan for how people will report results
> during Beta. That means we need a log_bgwriter (better name, please...)
> parameter that provides information to assist with tuning.

Once I got past the "does it work?" stage, I've been doing all the tuning 
work using a before/after snapshot of pg_stat_bgwriter data during a 
representative snapshot of activity and looking at the delta.  Been a 
while since I actually looked into the logs for anything.  It's very 
straightforward to put together a formal tuning plan using the data in 
there, particularly compared to the the impossibility of creating such a 
plan in the current code.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD


Re: Just-in-time Background Writer Patch+Test Results

From
Simon Riggs
Date:
On Fri, 2007-09-07 at 11:48 -0400, Greg Smith wrote:
> On Fri, 7 Sep 2007, Simon Riggs wrote:
> 
> > I think that is what we should be measuring, perhaps in a simple way 
> > such as calculating the 90th percentile of the response time 
> > distribution.
> 
> I do track the 90th percentile numbers, but in these pgbench tests where 
> I'm writing as fast as possible they're actually useless--in many cases 
> they're *smaller* than the average response, because there are enough 
> cases where there is a really, really long wait that they skew the average 
> up really hard.  Take a look at any of the inidividual test graphs and 
> you'll see what I mean.

I've looked at the graphs now, but I'm not any wiser, I'm very sorry to
say. We need something like a frequency distribution curve, not just the
actual times. Bottom line is we need a good way to visualise the
detailed effects of the patch.

I think we should do some more basic tests to see where those outliers
come from. We need to establish a clear link between number of dirty
writes and response time. If there is one, which we all believe, then it
is worth minimising those with these techniques. We might just be
chasing the wrong thing.

Perhaps output the number of dirty blocks written on the same line as
the output of log_min_duration_statement so that we can correlate
response time to dirty-block-writes on that statement.

For me, we can enter Beta while this is still partially in the air. We
won't be able to get this right without lots of other feedback. So I
think we should concentrate now on making sure we've got the logging in
place so we can check whether your patch works when its out there. I'd
say lets include what you've done and then see how it works during Beta.
We've been trying to get this right for years now, so we have to allow
some slack to make sure we get this right. We can reduce or strip out
logging once we go RC.

--  Simon Riggs 2ndQuadrant  http://www.2ndQuadrant.com



Re: Just-in-time Background Writer Patch+Test Results

From
Greg Smith
Date:
On Fri, 7 Sep 2007, Simon Riggs wrote:

> I think we should do some more basic tests to see where those outliers 
> come from. We need to establish a clear link between number of dirty 
> writes and response time.

With the test I'm running, which is specifically designed to aggrevate 
this behavior, the outliers on my system come from how Linux buffers 
writes.  I can adjust them a bit by playing with the parameters as 
described at http://www.westnet.com/~gsmith/content/linux-pdflush.htm but 
on the hardware I've got here (single 7200RPM disk for database, another 
for WAL) they don't move much.  Once /proc/meminfo shows enough Dirty 
memory that pdflush starts blocking writes, game over; you're looking at 
multi-second delays before my plain old IDE disks clear enough debris out 
to start responding to new requests even with the Areca controller I'm 
using.

> Perhaps output the number of dirty blocks written on the same line as
> the output of log_min_duration_statement so that we can correlate
> response time to dirty-block-writes on that statement.

On Linux at least, I'd expect this won't reveal much.  There, the 
interesting correlation is with how much dirty data is in the underlying 
OS buffer cache.  And exactly how that plays into things is a bit strange 
sometimes.  If you go back to Heikki's DBT2 tests with the background 
writer schemes he tested, he got frustrated enough with that disconnect 
that he wrote a little test program just to map out the underlying 
weirdness: 
http://archives.postgresql.org/pgsql-hackers/2007-07/msg00261.php

I've confirmed his results on my system and done some improvements to that 
program myself, but pushed further work on it to the side to finish up the 
main background writer task instead.  I may circle back to that.  I'd 
really like to run all this on another OS as well (I have Solaris 10 on my 
server box but not fully setup yet), but I can only volunteer so much time 
to work on all this right now.

If there's anything that needs to be looked at more carefully during tests 
in this area, it's getting more data about just what the underlying OS is 
doing while all this is going on.  Just the output from vmstat/iostat is 
very informative.  Those using DBT2 for their tests get some nice graphs 
of this already.  I've done some pgbench-based tests that included that 
before that were very enlightening but sadly that system isn't available 
to me anymore.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD


Re: Just-in-time Background Writer Patch+Test Results

From
Greg Smith
Date:
On Fri, 7 Sep 2007, Simon Riggs wrote:

> For me, the bgwriter should sleep for at most 10ms at a time.

Here's the results I got when I pushed the time down significantly from 
the defaults, with some of the earlier results for comparision:
                     info                      | set | tps  | cleaner_pct
-----------------------------------------------+-----+------+------------- jit multiplier=2.0 scan_whole=120s
delay=200ms| 17 |  981 |       99.98 jit multiplier=1.0 scan_whole=120s delay=200ms|  18 |  970 |       99.99
 
 jit multiplier=1.0 scan_whole=120s delay=20ms |  20 |  956 |       92.34 jit multiplier=2.0 scan_whole=120s delay=20ms
| 21 |  967 |       99.94
 
 jit multiplier=1.5 scan_whole=120s delay=10ms |  22 |  944 |       97.91 jit multiplier=2.0 scan_whole=120s delay=10ms
| 23 |  981 |        99.7
 

It seems I have to push the multiplier higher to get good results when 
using a much lower interval, which was expected, but the fundamentals all 
scale down to the running much faster the way I'd hoped.

I'm tempted to make the default 10ms, adjust some of the other constants 
just a bit to optimize better for that time scale:  make the default 
multiplier 2.0, increase the weighted average sample period, and perhaps 
reduce scan_whole a bit because that's barely doing anything at 10ms.  If 
no one discovers any problems with working that way during beta, then 
consider locking them in for the RC.  That would leave just the multiplier 
and maxpages as the exposed tunables, and it's very easy to tune maxpages 
just by watching pg_stat_bgwriter.  This would obviously be a very 
aggressive plan--it would be eliminating GUCs and reducing flexibility for 
people in the field, aiming instead at making this more automatic for the 
average case.

If anyone has a reason why they feel the bgwriter_delay needs to be a 
tunable or why the rate might need to run even faster than 10ms, now would 
be a good time to say why.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD


Re: Just-in-time Background Writer Patch+Test Results

From
Tom Lane
Date:
Greg Smith <gsmith@gregsmith.com> writes:
> If anyone has a reason why they feel the bgwriter_delay needs to be a 
> tunable or why the rate might need to run even faster than 10ms, now would 
> be a good time to say why.

You'd be hard-wiring the thing to wake up 100 times per second?  Doesn't
sound like a good plan from here.  Keep in mind that not everyone wants
their machine to be dedicated to Postgres, and some people even would
like their CPU to go to sleep now and again.

I've already gotten flak about the current default of 200ms:
https://bugzilla.redhat.com/show_bug.cgi?id=252129
I can't imagine that folk with those types of goals will tolerate
an un-tunable 10ms cycle.

In fact, given the numbers you show here, I'd say you should leave the
default cycle time at 200ms.  The 10ms value is eating way more CPU and
producing absolutely no measured benefit relative to 200ms...
        regards, tom lane


Re: Just-in-time Background Writer Patch+Test Results

From
Greg Smith
Date:
On Sat, 8 Sep 2007, Tom Lane wrote:

> I've already gotten flak about the current default of 200ms: 
> https://bugzilla.redhat.com/show_bug.cgi?id=252129
> I can't imagine that folk with those types of goals will tolerate an 
> un-tunable 10ms cycle.

That's the counter-example for why lowering the default is unacceptable I 
was looking for.  Scratch bgwriter_delay off the list of things that might 
be fixed to a specific value.

Will return to the drawing board to figure out a way to incorporate what 
I've learned about running at 10ms into a tuning plan that still works 
fine at 200ms or higher.  The good news as far as I'm concerned is that I 
haven't had to adjust the code so far, just tweak the existing knobs.

> In fact, given the numbers you show here, I'd say you should leave the 
> default cycle time at 200ms.  The 10ms value is eating way more CPU and 
> producing absolutely no measured benefit relative to 200ms...

My server is a bit underpowered to run at 10ms and gain anything when 
doing a stress test like this; I was content that it didn't degrade 
performance significantly, that was the best I could hope for.  I would 
expect the class of systems that Simon and Heikki are working with could 
show significant benefit from running the BGW that often.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD


Re: Just-in-time Background Writer Patch+Test Results

From
Tom Lane
Date:
Greg Smith <gsmith@gregsmith.com> writes:
> On Sat, 8 Sep 2007, Tom Lane wrote:
>> In fact, given the numbers you show here, I'd say you should leave the 
>> default cycle time at 200ms.  The 10ms value is eating way more CPU and 
>> producing absolutely no measured benefit relative to 200ms...

> My server is a bit underpowered to run at 10ms and gain anything when 
> doing a stress test like this; I was content that it didn't degrade 
> performance significantly, that was the best I could hope for.  I would 
> expect the class of systems that Simon and Heikki are working with could 
> show significant benefit from running the BGW that often.

Quite possibly.  So it sounds like we still need to expose
bgwriter_delay as a tunable.

It might be interesting to consider making the delay auto-tune: if you
wake up and find nothing (much) to do, sleep longer the next time,
conversely shorten the delay when work picks up.  Something for 8.4,
though, at this point.
        regards, tom lane


Re: Just-in-time Background Writer Patch+Test Results

From
Gregory Stark
Date:
"Greg Smith" <gsmith@gregsmith.com> writes:

> On Sat, 8 Sep 2007, Tom Lane wrote:
>
>> I've already gotten flak about the current default of 200ms:
>> https://bugzilla.redhat.com/show_bug.cgi?id=252129
>> I can't imagine that folk with those types of goals will tolerate an
>> un-tunable 10ms cycle.
>
> That's the counter-example for why lowering the default is unacceptable I was
> looking for.  Scratch bgwriter_delay off the list of things that might be fixed
> to a specific value.

Ok, time for the obligatory contrarian voice here. It's all well and good to
aim to eliminate GUC variables but I don't think it's productive to do so by
simply hard-wiring them. 

Firstly that doesn't really make life any easier than simply finding good
defaults and documenting that DBAs probably shouldn't be bothering to tweak
them.

Secondly it's unlikely to work. The variables under consideration may have
reasonable defaults but they're not likely to have defaults will work in every
case. This example is pretty typical. There aren't many variables that will
have a reasonable default which will work for both an interactive desktop
where Postgres is running in the background and Sun's 1000+ process
benchmarks.

What I think is more likely to work is looking for ways to make these
variables auto-tuning. That eliminates the knob not by just hiding it away and
declaring it doesn't exist but by architecting the system so that there really
is no knob that might need tweaking.

Perhaps what would work better here is having a semaphore which bgwriter
sleeps on which backends wake up whenever the clock sweep hand completes a
cycle. Or gets within a certain fraction of a cycle of catching up.

Or perhaps bgwriter shouldn't be adjusting the number of pages it processes at
all and instead it should only be adjusting the sleep time. So it would always
process a full cycle for example but adjust the sleep time based on what
percentage of the cycle the backends used up in the last sleep time.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com


Re: Just-in-time Background Writer Patch+Test Results

From
Greg Smith
Date:
On Sat, 8 Sep 2007, Tom Lane wrote:

> It might be interesting to consider making the delay auto-tune: if you
> wake up and find nothing (much) to do, sleep longer the next time,
> conversely shorten the delay when work picks up.  Something for 8.4,
> though, at this point.

I have a couple of pages of notes on how to tune the delay automatically. 
The tricky part are applications that go from 0 to full speed with little 
warning; the first few seconds of the stock market open come to mind. 
What I was working toward was considering what you set the delay to as a 
steady-state value, and then the delay cranks downward as activity levels 
go up.  As activity dies off, it slowly returns to the default again.

But I realized that I needed to get all this other stuff working, all the 
statistics counters exposed usefully, and then collect a lot more data 
before I could implement that plan.  Definately something that might fit 
into 8.4, completely impossible for 8.3.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD


Re: Just-in-time Background Writer Patch+Test Results

From
Greg Smith
Date:
On Thu, 6 Sep 2007, Decibel! wrote:

> I don't know that there should be a direct correlation, but ISTM that
> scan_whole_pool_seconds should take checkpoint intervals into account
> somehow.

Any direct correlation is weak at this point.  The LRU cleaner has a small 
impact on checkpoints, in that it's writing out buffers that may make the 
checkpoint quicker.  But this particular write trickling mechanism is not 
aimed directly at flushing the whole pool; it's more about smoothing out 
idle periods a bit.

Also, computing the checkpoint interval is itself tricky.  Heikki had to 
put some work into getting something that took into account both the 
timeout and segments mechanisms to gauge progress, and I'm not sure I can 
directly re-use that because it's really only doing that while the 
checkpoint is active.  I'm not saying it's a bad idea to have the expected 
interval as an input to the model, just that it's not obvious to me how to 
do it and whether it would really help.

> I like the idea of not having that as a GUC, but I'm doubtful that it
> can be hard-coded like that. What if checkpoint_timeout is set to 120?
> Or 60? Or 2000?

Someone using 60 or 120 has checkpoint problems way bigger than the LRU 
cleaner can be expected to help with.  How fast the reusable buffers it 
can write are pushed out is the least of their problems.  Also, I'd expect 
that the only cases using such a low value for a good reason are doing so 
because they have enormous amounts of activity on their system, and in 
that case the primary JIT mechanism should dominate how the LRU cleaner 
treats them.  scan_whole_pool_seconds doesn't do anything if the primary 
mechanism was already planning to scan more buffers than it aims for.

Someone who has very infrequent checkpoints and therefore low activity, 
like your 2000 case, can expect that the LRU cleaner will lap and catch up 
to the strategy point about 2 minutes after any activity and then follow 
directly behind it with the way I've set this up.  If that's cleaning the 
buffer cache too aggressively, I think those in that situation would be 
better served by constraining the maxpages parameter; that's directly 
adjusting what I'd expect their real issue is, how fast pages can flush to 
disk, rather than the secondary one of how fast the pool is being scanned.

I picked 2 minutes for that value because it's as slow as I can make it 
and still serve its purpose, while not feeling to me like it's too fast 
for a relatively idle system even if someone set maxpages=1000.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD


Re: Just-in-time Background Writer Patch+Test Results

From
Alvaro Herrera
Date:
Greg Smith wrote:
> On Sat, 8 Sep 2007, Tom Lane wrote:
>
>> It might be interesting to consider making the delay auto-tune: if you
>> wake up and find nothing (much) to do, sleep longer the next time,
>> conversely shorten the delay when work picks up.  Something for 8.4,
>> though, at this point.
>
> I have a couple of pages of notes on how to tune the delay automatically. 
> The tricky part are applications that go from 0 to full speed with little 
> warning; the first few seconds of the stock market open come to mind.

Maybe have the backends send a signal to bgwriter when they see it
sleeping and are overwhelmed by work.  That way, bgwriter can sleep for
a few seconds, safe in the knowledge that somebody else will wake it up
if needed sooner.  The way backends would detect that bgwriter is
sleeping is that bgwriter would keep an atomic flag in shared memory,
and it gets set only if it's going to sleep for long (so if it's going
to sleep for (say) 100ms or less, it doesn't set the flag, so the
backends won't signal it).  In order to avoid a huge amount of signals
when all backends suddenly start working at the same instant, have the
signal itself be sent only by the first backend that manages to
LWLockConditionalAcquire a lwlock that's only used for that purpose.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: Just-in-time Background Writer Patch+Test Results

From
Greg Smith
Date:
On Sat, 8 Sep 2007, Greg Smith wrote:

> Here's the results I got when I pushed the time down significantly from the 
> defaults
>                     info                      | set | tps  | cleaner_pct
> -----------------------------------------------+-----+------+-------------
> jit multiplier=1.0 scan_whole=120s delay=20ms |  20 |  956 |       92.34
> jit multiplier=2.0 scan_whole=120s delay=20ms |  21 |  967 |       99.94
>
> jit multiplier=1.5 scan_whole=120s delay=10ms |  22 |  944 |       97.91
> jit multiplier=2.0 scan_whole=120s delay=10ms |  23 |  981 |        99.7
> It seems I have to push the multiplier higher to get good results when using 
> a much lower interval

Since I'm not exactly overwhelmed processing field reports, I've continued 
this line of investigation myself...increasing the multiplier to 3.0 got 
me another nine on the buffers written by the LRU BGW without a 
significant change in performance:
                     info                      | set | tps  | cleaner_pct
-----------------------------------------------+-----+------+-------------
jit multiplier=3.0 scan_whole=120s delay=10ms  |  24 |  967 | 99.95

After thinking for a bit about why the 10ms case wasn't working so well 
without a big multiplier, I considered that the default moving average 
smoothing makes the sample period operating over such a short period of 
time (10ms * 16=160ms) that it's unlikely to cover a typical pause that 
one might want to smooth over.  My initial thinking was to increase the 
period of the smoothing so that it's of similar length to the default case 
even when the interval goes down, but that didn't really improve anything 
(note that the 16 case here is the default setup with just the delay at 
10ms, which was a missing piece of data from the above as well--I only 
tested with larger multipliers above at 10ms):
                     info                     | set | tps  | cleaner_pct
----------------------------------------------+-----+------+------------- jit multiplier=1.0 delay=10ms smoothing=16
| 27 |  982 |  89.4 jit multiplier=1.0 delay=10ms smoothing=64   |  26 |  946 |  89.55 jit multiplier=1.0 delay=10ms
smoothing=320 |  25 |  970 |  89.53
 

What I realized is that after rounding the number of buffers to an 
integer, dividing a very short period of activity by the smoothing 
constant was resulting in the smoothing value usually dropping to 0 and 
not doing much.  This made me wonder how much the weighted average 
smoothing was really doing in the default case.  I put that code in months 
ago and I hadn't looked recently at its effectiveness.  Here's a 
comparison:
                     info                     | set | tps  | cleaner_pct
----------------------------------------------+-----+------+------------- jit multiplier=1.0 delay=200ms smoothing=16
| 18 |  970 |  99.99 jit multiplier=1.0 delay=200ms smoothing=off |  28 |  957 |  97.16
 

All this data support my suggestion that the exact value of the smoothing 
period constant isn't really a critical one.  It appears moderately 
helpful to have that logic on in some cases and the default value doesn't 
seem to hurt the cases where I'd expect it to be the least effective. 
Tuning the multiplier is much more powerful and useful than ever touching 
this constant.  I could probably even pull the smoothing logic out 
altogether, at the cost of increasing the burden of correctly tuning the 
multiplier on the administrator.  So far it looks like it's reasonable 
instead to leave it as an untunable to help the default configuration, and 
I'll just add a documentation note that if you decrease the interval 
you'll probably have to increase the multiplier.

After going through this, the extra data gives more useful baselines to do 
a similar sensitivity analysis of the other item that's untunable in the 
current patch:
    float       scan_whole_pool_seconds = 120.0;

But I'll be travelling for the next week and won't have time to look into 
that myself until I get back.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD


Re: Just-in-time Background Writer Patch+Test Results

From
Greg Smith
Date:
It was suggested to me today that I should clarify how others should be 
able to test this patch themselves by writing a sort of performance 
reviewer's guide; that information has been scattered among material 
covering development.  That's what you'll find below.  Let me know if any 
of it seems confusing and I'll try to clarify.  I'll be checking my mail 
and responding intermitantly while I'm away, just won't be able to run any 
tests myself until next week.

The latest version of the background writer code that I've been reporting 
on is attached to the first message in this thread:

http://archives.postgresql.org/pgsql-hackers/2007-09/msg00214.php

I haven't found any reason so far to update that code, the existing 
exposed tunables still appear sufficient for all the situations I've 
found.

Track Buffer Allocations and Cleaner Efficiency
-----------------------------------------------

First you apply the patch inside buf-alloc-2.patch.gz , which adds several 
entries to pg_stat_bgwriter; it applied cleanly to HEAD at the point when 
I generated it.  I'd suggest testing that one to collect baseline 
information with the current background writer, and to confirm that the 
overhead of tracking the buffer allocations by itself doesn't cause a 
performance hit, before applying the second patch.  I keep two clusters 
going on the same port, one with just buf-alloc-2, one with both patches, 
to be able to make such comparisions, only having one active at a time. 
You'll need to run initdb to create a database with the new stats in it 
after applying the patch.

What I've been doing to test the effectiveness of any LRU background 
writer method using this patch is take a before/after snapshot of 
pg_stat_bgwriter.  Then I compute the delta during the test run in order 
to figure what percentage of buffers were written by the background writer 
vs. the client backends; that's the number I'm reporting as cleaner_pct in 
my tests.  Here is an example of how to compute that against all 
transactions in pg_stat_bgwriter:

select round(buffers_clean * 10000 / (buffers_backend + buffers_clean)) / 
100 as cleaner_pct from pg_stat_bgwriter;

You should also monitor maxwritten_clean to make sure you've set 
bgwriter_lru_maxpages high enough that it's not limiting writes.  You can 
always turn the background writer off by setting maxpages to 0 (it's the 
only way to do so after applying the below patch).

For reference, the exact code I'm using to save the deltas and compute 
everything is available within pgbench-tools-0.2 at 
http://www.westnet.com/~gsmith/content/postgresql/pgbench-tools.htm

The code inside the benchwarmer script uses a table called test_bgwriter 
(schema in init/resultdb.sql), populates it before the test, then computes 
the delta afterwards.  bufsummary.sql generates the results I've been 
putting in my messages.  I assume there's a cleaner way to compute just 
these numbers by resetting the statistics before the test instead, but 
that didn't fit into what I was working towards.

New Background Writer Logic
---------------------------

The second patch in jit-cleaner.patch.gz applies on top of buf-alloc-2. 
It modifies the LRU background writer with the just-in-time logic as I 
described in the message the patches were attached to.  The main tunable 
there is bgwriter_lru_multiplier, which replaces bgwriter_lru_percent. 
The effective range seems to be 1.0 to 3.0.  You can take an existing 8.3 
postgresql.conf, rename bgwriter_lru_percent to bgwriter_lru_multiplier, 
adjust the value to be in the right range, and then it will work with this 
patched version.

For comparing the patched vs. original BGW behavior, I've taken to keeping 
definitions for both variables in a common postgresql.conf, and then I 
just comment/uncomment the one I need based on which version I'm running:

bgwriter_lru_multiplier = 1.0
#bgwriter_lru_percent = 5

The main thing I've noticed so far is that as you decrease bgwriter_delay 
from the default of 200ms, the multiplier has needed to be larger to 
maintain the same cleaner percentage in my tests.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD


Re: Just-in-time Background Writer Patch+Test Results

From
Tom Lane
Date:
Greg Smith <gsmith@gregsmith.com> writes:
> Tom gets credit for naming the attached patch, which is my latest attempt to 
> finalize what has been called the "Automatic adjustment of 
> bgwriter_lru_maxpages" patch for 8.3; that's not what it does anymore but 
> that's where it started.

I've applied this patch with some revisions.

> -The way I'm getting the passes number back from the freelist.c
> strategy code seems like it will eventually overflow

Yup ... I rewrote that.  I also revised the collection of backend-write
count events, which didn't seem to me to be something the freelist.c
code should have anything to do with.  It turns out that we can count
them with essentially no overhead by attaching the counter to
the existing fsync-request reporting machinery.

> -Heikki didn't like the way I pass information back from SyncOneBuffer
> back to the background writer.

I didn't either --- it was too complicated and not actually doing
anything useful.  I simplified it down to the two bits that were being
used.  We can always add more as needed, but since this routine isn't
even exported, I see no need to make it do more than the known callers
need it to do.

I did some marginal tweaking to the way you were doing the moving
averages --- in particular, use a float to avoid strange roundoff
behavior and force the smoothed_alloc average up when a new peak
occurs, instead of only letting it affect the behavior for one
cycle.

Also, I set the default value of bgwriter_lru_multiplier to 2.0,
as 1.0 seemed to be leaving too many writes to the backends in my
testing.  That's something we can play with during beta when we'll
have more testing resources available.

I did some other cleanup in BgBufferSync too, like trying to reduce
the chattiness of the debug output, but I don't believe I made any
fundamental change in your algorithm.

Nice work --- thanks for seeing it through!
        regards, tom lane


Re: Just-in-time Background Writer Patch+Test Results

From
Greg Smith
Date:
On Tue, 25 Sep 2007, Tom Lane wrote:

>> -Heikki didn't like the way I pass information back from SyncOneBuffer
>> back to the background writer.
> I didn't either --- it was too complicated and not actually doing
> anything useful.

I suspect someone (possibly me) may want to put back some of that same 
additional complication in the future, but I'm fine with it not being 
there yet.  The main thing I wanted accomplished was changing the return 
to a bitmask of some sort and that's there now; adding more data to that 
interface later is at least easier now.

> Also, I set the default value of bgwriter_lru_multiplier to 2.0,
> as 1.0 seemed to be leaving too many writes to the backends in my
> testing.

The data I've collected since originally submitting the patch agrees that 
2.0 is probably a better default as well.

I should have time to take an initial stab this week at updating the 
documentation to reflect what's now been commited, and to see how this 
stacks on top of HOT running pgbench on my test system.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD