Re: Bgwriter strategies - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: Bgwriter strategies
Date
Msg-id 468E1190.8050902@enterprisedb.com
Whole thread Raw
In response to Re: Bgwriter strategies  (Greg Smith <gsmith@gregsmith.com>)
Responses Re: Bgwriter strategies  (Greg Smith <gsmith@gregsmith.com>)
Re: Bgwriter strategies  ("Simon Riggs" <simon@2ndquadrant.com>)
List pgsql-hackers
Greg Smith wrote:
> On Thu, 5 Jul 2007, Heikki Linnakangas wrote:
> 
>> It looks like Tom's idea is not a winner; it leads to more writes than 
>> necessary.
> 
> What I came away with as the core of Tom's idea is that the cleaning/LRU 
> writer shouldn't ever scan the same section of the buffer cache twice, 
> because anything that resulted in a new dirty buffer will be unwritable 
> by it until the clock sweep passes over it.  I never took that to mean 
> that idea necessarily had to be implemented as "trying to aggressively 
> keep all pages with usage_count=0 clean".
> 
> I've been making slow progress on this myself, and the question I've 
> been trying to answer is whether this fundamental idea really matters or 
> not. One clear benefit of that alternate implementation should allow is 
> setting a lower value for the interval without being as concerned that 
> you're wasting resources by doing so, which I've found to a problem with 
> the current implementation--it will consume a lot of CPU scanning the 
> same section right now if you lower that too much.

Yes, in fact ignoring the CPU overhead of scanning the same section over 
and over again, Tom's proposal is the same as setting both 
bgwriter_lru_* settings all the way up to the max. In fact I ran a DBT-2 
test like that as well, and the # of writes was indeed the same, just 
with a max higher CPU usage. It's clear that scanning the same section 
over and over again has been a waste of time in previous releases.

As a further data point, I constructed a smaller test case that performs 
random DELETEs on a table using an index. I varied the # of 
shared_buffers, and ran the test with bgwriter disabled, or tuned all 
the way up to the maximum. Here's the results from that:
 shared_buffers | writes | writes |   writes_ratio
----------------+--------+--------+------------------- 2560           |  86936 |  88023 |  1.01250345081439 5120
  |  81207 |  84551 |  1.04117871612053 7680           |  75367 |  80603 |  1.06947337694216 10240          |  69772 |
74533|  1.06823654187926 12800          |  64281 |  69237 |  1.07709898725907 15360          |  58515 |  64735 |
1.1062975305477217920          |  53231 |  58635 |  1.10151979109917 20480          |  48128 |  54403 |
1.1303814827127723040          |  43087 |  49949 |  1.15925917330053 25600          |  39062 |  46477 |
1.189826429778328160          |  35391 |  43739 |  1.23587917832217 30720          |  32713 |  37480 |
1.1457218842661933280          |  31634 |  31677 |  1.00135929695897 35840          |  31668 |  31717 |
1.0015473032714438400          |  31696 |  31693 | 0.999905350832913 40960          |  31685 |  31730 |
1.0014202303929343520          |  31694 |  31650 | 0.998611724616647 46080          |  31661 |  31650 |
0.999652569407157

The first writes-column is the # of writes with bgwriter disabled, 2nd 
column is with the aggressive bgwriter. The table size is 33334 pages, 
so after that the table fits in cache and the bgwriter strategy makes no 
difference.


> As far as your results, first off I'm really glad to see someone else 
> comparing checkpoint/backend/bgwriter writes the same I've been doing so 
> I finally have someone else's results to compare against.  I expect that 
> the optimal approach here is a hybrid one that structures scanning the 
> buffer cache the new way Tom suggests, but limits the number of writes 
> to "just enough".  I happen to be fond of the "just enough" computation 
> based on a weighted moving average I wrote before, but there's certainly 
> room for multiple implementations of that part of the code to evolve.

We need to get the requirements straight.

One goal of bgwriter is clearly to keep just enough buffers clean in 
front of the clock hand so that backends don't need to do writes 
themselves until the next bgwriter iteration. But not any more than 
that, otherwise we might end up doing more writes than necessary if some 
of the buffers are redirtied.

To deal with bursty workloads, for example a batch of 2 GB worth of 
inserts coming in every 10 minutes, it seems we want to keep doing a 
little bit of cleaning even when the system is idle, to prepare for the 
next burst. The idea is to smoothen the physical I/O bursts; if we don't 
clean the dirty buffers left over from the previous burst during the 
idle period, the I/O system will be bottlenecked during the bursts, and 
sit idle otherwise.

To strike a balance between cleaning buffers ahead of possible bursts in 
the future and not doing unnecessary I/O when no such bursts come, I 
think a reasonable strategy is to write buffers with usage_count=0 at a 
slow pace when there's no buffer allocations happening.

To smoothen the small variations on a relatively steady workload, the 
weighted average sounds good.




--  Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


pgsql-hackers by date:

Previous
From: Greg Smith
Date:
Subject: Re: Bgwriter strategies
Next
From: Heikki Linnakangas
Date:
Subject: Re: Bgwriter strategies