Just-in-time Background Writer Patch+Test Results - Mailing list pgsql-hackers

From Greg Smith
Subject Just-in-time Background Writer Patch+Test Results
Date
Msg-id Pine.GSO.4.64.0709052324020.25284@westnet.com
Whole thread Raw
Responses Re: Just-in-time Background Writer Patch+Test Results
Re: Just-in-time Background Writer Patch+Test Results
Re: Just-in-time Background Writer Patch+Test Results
List pgsql-hackers
Tom gets credit for naming the attached patch, which is my latest attempt to 
finalize what has been called the "Automatic adjustment of 
bgwriter_lru_maxpages" patch for 8.3; that's not what it does anymore but 
that's where it started.

Background on testing
---------------------

I decided to use pgbench for running my tests.  The scripting framework to 
collect all that data and usefully summarize it is now available as 
pgbench-tools-0.2 at 
http://www.westnet.com/~gsmith/content/postgresql/pgbench-tools.htm

I hope to expand and actually document use of pgbench-tools in the future but 
didn't want to hold the rest of this up on that work.  That page includes basic 
information about what my testing environment was and why I felt this was an 
appropriate way to test background writer efficiency.

Quite a bit of raw data for all of the test sets summarized here is at 
http://www.westnet.com/~gsmith/content/bgwriter/

The patches attached to this message are also available at: 
http://www.westnet.com/~gsmith/content/postgresql/buf-alloc-2.patch 
http://www.westnet.com/~gsmith/content/postgresql/jit-cleaner.patch
(This is my second attempt to send this message, don't know why the 
earlier one failed; using gzip'd patches for this one and hopefully there 
won't be a dupe)

Baseline test results
---------------------

The first patch to apply attached to this message is the latest buf-alloc-2 
that adds counters to pgstat_bgwriter for everything the background writer is 
doing. Here's what we get out of the standard 8.3 background writer before and 
after applying that patch, at various settings:
                info                | set | tps  | cleaner_pct
------------------------------------+-----+------+------------- HEAD nobgwriter                    |   5 |  994 |
HEAD+buf-alloc-2nobgwriter        |   6 | 1012 |           0 HEAD+buf-alloc-2 LRU=0.5%/500      |  16 |  974 |
15.94HEAD+buf-alloc-2 LRU=5%/500        |  19 |  983 |       98.47 HEAD+buf-alloc-2 LRU=10%/500       |   7 |  997 |
  99.95
 

cleaner_pct is what percentage of the writes the BGW LRU cleaner did relative 
to a total that includes the client backend writes; writes done by checkpoints 
are not included in this summary computation, it just shows the balance of 
backend vs. BGW writes.

The /500 means bgwriter_lru_maxpages=500, which I already knew was about as 
many pages as this server ever dirties in a 200ms cycle.  Without the 
buf-alloc-2 patch I don't get statistics on the LRU cleaner, I include that 
number as a baseline just to suggest that the buf-alloc-2 patch itself isn't 
pulling down results.

Here we see that in order to get most of the writes to happen via the LRU 
cleaner rather than having the backends handle them, you'd need to play with 
the settings until the bgwriter_lru_percent was somewhere between 5% and 10%, 
and it seems obvious that doing this doesn't improve the TPS results.  The 
margin of error here is big enough that I consider all these basically the same 
performance.  The question then is how to get this high level of writes by the 
background writer automatically, without having to know what percentage to 
scan; I wanted to remove bgwriter_lru_percent, while still keeping 
bgwriter_lru_maxpages strictly as a way to throttle overall BGW activity.

First JIT Implementation
------------------------

The method I described in my last message on this topic ( 
http://archives.postgresql.org/pgsql-hackers/2007-08/msg00887.php ) implemented 
a weighted moving average of how many pages were allocated, and based on 
feedback from that I improved the code to allow a multiplier factor on top of 
that.  Here's the summary of those results:
                info                | set | tps  | cleaner_pct
------------------------------------+-----+------+------------- jit cleaner multiplier=1.0/500     |   9 |  981 |
94.3 jit cleaner multiplier=2.0/500     |   8 | 1005 |       99.78 jit multiplier=1.0/100             |  10 |  985 |
  68.14
 

That's pretty good.  As long as maxpages is set intelligently, it gets most of 
the writes even with the multiplier of 1.0, and cranking it up to the 2.0 
suggested by the original Itagaki Takahiro patch gets nearly all of them. 
Again, there's really no performance change here in throughput by any of this.

Coping with idle periods
------------------------

While I was basically happy with these results, the data Kevin Grittner 
submitted in response to my last call for commentary left me concerned. While 
the JIT approach works fine as long as your system is active, it does 
absolutely nothing if the system is idle.  I noticed that a lot of the writes 
that were being done by the client backends were after idle periods where the 
JIT writer just didn't react fast enough during the ramp-up.  For example, if 
the system went from idle for a while to full-speed just as the 200ms sleep 
started, by the time the BGW woke up again the backends could have needed to 
write many buffers already themselves.

Ideally, idle periods should be used to slowly trickly dirty pages out, so that 
there are less of them hanging around when a checkpoint shows up or so that 
reusable pages are already available. The question then is how fast to go about 
that trickle.  Heikki's background writer tests and my own suggest that if you 
make the rate during quiet periods too high, you'll clog the underlying buffers 
with some writes that end up being duplicated and lower overall efficiency. 
But all of those tests had the background writer going at a constant and 
relatively high speed.

I wanted to keep the ability to scan the entire buffer cache, using the latest 
idea of never looking at the same buffer twice, but to do that slowly when idle 
and using the JIT rate otherwise.  This is sort of a hybrid of the old LRU 
cleaner behavior (scan a fixed %) at a low speed with the new approach (scan 
based on allocations, however many of them there are).  I starting with the old 
default of 0.5% used by bgwriter_lru_percent (a tunable already removed by the 
patch at this point) with logic to tack that onto the JIT intelligently and got 
these results:
                info                | set | tps  | cleaner_pct
------------------------------------+-----+------+------------- jit multiplier=1.0 min scan=0.5%   |  13 |  882 |
 100 jit multiplier=1.5 min scan=0.5%   |  12 |  871 |         100 jit multiplier=2.0 min scan=0.5%   |  11 |  910 |
    100 jit multiplier=1.0 min scan=0.25%  |  14 |  982 |       98.34
 

It's nice to see fully 100% of the buffers written by the cleaner with the 
hybrid approach; I feel that validates my idea that just a bit more work needs 
to be done during idle periods to completely fix the issue with it not reacting 
fast enough during the idle/full speed transition.  But look at the drop in 
TPS.  While I'm willing to say a couple of percent change isn't significant in 
a pgbench result, those <900 results are clearly bad. This is crossing that 
line where inefficient writes are being done.  I'm happier with the result 
using the smaller min scan=0.25% even though it doesn't quite get every write 
that way.

Making percentage independant of delay
--------------------------------------

But a new problem here is that if you lower bgwriter_delay, the minimum scan 
percentage needs to drop too, and my goal was to remove the number of tunables 
people need to tinker with.  Assuming you're not stopped by the maxpages 
parameter, with the default delay=200ms a scan that hits 0.5% each time will 
scan 5*0.5%=2.5% of the buffer cache per second, which means it will take 24 
seconds to scan the entire pool.  Using 0.25% means 48 seconds between scans. I 
improved the overall algorithm a bit and decided to set this parameter an 
alternate way:  by how long it should take to creep its way through the entire 
buffer cache if the JIT code is idle.  I decided I liked 120 seconds as value 
for that parameter, which is a slower rate than any of the above but still a
reasonable one for a typical application.  Here's what the results look like 
using that approach:
                info                | set | tps  | cleaner_pct
------------------------------------+-----+------+------------- jit multiplier=1.0 scan_whole=120s |  18 |  970 |
99.99jit multiplier=1.5 scan_whole=120s |  15 |  995 |       99.93 jit multiplier=2.0 scan_whole=120s |  17 |  981 |
  99.98
 

Now here are results I'm happy with.  The TPS results are almost unchanged from 
where we started from, with minimal inefficient writes, but almost all the 
writes are being done by the cleaner process.  The results appear much less 
sensitive to what you set the multiplier to.  And unless you use an unresonable 
low value for maxpages (which will quickly become obvious if you monitor 
pg_stat_bgwriter and look for maxwritten_clean increasing fast), you'll get a 
complete scan of the buffer cache within 2 minutes even if there's no system 
activity.  But once that's done, until more buffers are allocated the code 
won't even look at the buffer cache again (as opposed to the current code, 
which is always looking at buffers and acquiring locks even if nothing is going 
on).

I think I can safely say there is a level of intelligence going into what the 
LRU background writer does with this patch that has never been applied to this 
problem before.  There have been a lot of good ideas thrown out in this area, 
but it took a hybrid approach that included and carefully balanced all of them 
to actually get results that I felt were usable. What I don't know is whether 
that will also be true for other testers.

Patch review
------------

The attached jit-cleaner.patch implements this approach, and if you just want 
to look at the main code involved without having to apply the patch you can 
browse the BgBufferSync function in bufmgr.c starting around line 1120 at 
http://www.westnet.com/~gsmith/content/postgresql/bufmgr.c

There is lots of debugging of internals dumped into the logs if you toggle on 
#define BGW_DEBUG , the gross summary of the two most important things that 
show what the code is doing are logged at DEBUG1 (but should probably be pushed 
lower before committing).

This code is as good as you're going to get from me before the 8.3 close. I 
could do some small rewriting and certainly can document all this further as 
part of getting this patch moved toward committed, but I'm out of resources to 
do too much more here.  Along with the big question of whether this whole idea 
is worth following at all as part of 8.3, here are the remaining small 
questions I feel review feedback would be valuable on related to my specific 
code:

-The way I'm getting the passes number back from the freelist.c strategy code 
seems like it will eventually overflow the long I'm using for the intermediate 
results when I execute statements like this:

strategy_position=(long)strategy_passes * NBuffers + strategy_buf_id;

I'm not sure if the code would be better if I were to use a 64-bit integer for 
strategy_position instead, or if I should just rewrite the code to separate out 
the passes multiplication--which will make it less elegant to read but should 
make overflow issues go away.

-Heikki didn't like the way I pass information back from SyncOneBuffer back to 
the background writer.  The bitmask approach I'm using has added flexibility to 
writing more intelligent background writers in the future. I have written more 
complicated ones than any of the approaches mentioned here in the past, using 
things like the usage_count information returned, but the simpler 
implementation here that ignores that.  I could simplify this interface if I 
had to, but I like what I've done as a solid structure for future coding as 
it's written right now.

-There are two magic constants in the code:
    int         smoothing_samples = 16;    float       scan_whole_pool_seconds = 120.0;

I believe I've done enough testing recently and in the past to say these are 
reasonable numbers for most installations, and high-throughput systems are 
going to care more about tuning the multiplier GUC than either of these.  In 
the interest of having less knobs people can fool with and break, I personally 
don't feel like these constants need to be exposed for tuning purposes; they 
don't have a significant impact on how the underlying model works.  Determining 
whether these should be exposed as GUC tunables is certainly an open question 
though.

-I bumped the default for bgwriter_lru_maxpages to 100 so that typical low-end 
systems should get an automatically tuning LRU background writer out of the box 
in 8.3.  This is a big change from the 5 that was used in the older releases. 
If you keep everything at the defaults this represents a maximum theoretical 
write rate for the BGW of 4MB/s, which isn't very much relative to modern 
hardware.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

pgsql-hackers by date:

Previous
From: John DeSoi
Date:
Subject: Re: Has anyone tried out the PL/pgSQL debugger?
Next
From: Tom Lane
Date:
Subject: Re: [PATCHES] Lazy xid assignment V4