Thread: Final background writer cleanup for 8.3

Final background writer cleanup for 8.3

From

Greg Smith

Date:

23 August 2007, 23:13:20

In the interest of closing work on what's officially titled the "Automatic
adjustment of bgwriter_lru_maxpages" patch, I wanted to summarize where I
think this is at, what I'm working on right now, and see if feedback from
that changes how I submit my final attempt for a useful patch in this area
this week. Hopefully there are enough free eyes to stare at this now to
wrap up a plan for what to do that makes sense and still fits in the 8.3
schedule. I'd hate to see this pushed off to 8.4 without making some
forward progress here after the amount of work done already, particularly
when odds aren't good I'll still be working with this code by then.

Let me start with a summary of the conclusions I've reached based on my
own tests and the set that Heikki did last month (last results at
http://community.enterprisedb.com/bgwriter/ ); Heikki will hopefully chime
in if he disagrees with how I'm characterizing things:

1) In the current configuration, if you have a large setting for
bgwriter_lru_percent and/or a small setting for bgwriter_delay, that can
be extremely wasteful because the background writer will consume
CPU/locking resources scanning the buffer pool needlessly. This problem
should go away.

2) Having backends write their own buffers out does not significantly
degrade performance, as those turn into cached OS writes which generally
execute fast enough to not be a large drag on the backend.

3) Any attempt to scan significantly ahead of the current strategy point
will result in some amount of premature writes that decreases overall
efficiency in cases where the buffer is touched again before it gets
re-used. The more in advance you go, the worse this inefficiency is.
The most efficient way for many workloads is to just let the backends do
all the writes.

4) Tom observed that there's no reason to ever scan the same section of
the pool more than once, because anything that changes a buffer's status
will always make it un-reusable until the strategy point has passed over
it. But because of (3), this does not mean that one should drive forward
constantly trying to lap the buffer pool and catch up with the strategy
point.

5) There hasn't been any definitive proof that the background writer is
helpful at all in the context of 8.3. However, yanking it out altogether
may be premature, as there are some theorized ways that it may be helpful
in real-world situations with more intermittant workloads than are
generally encountered in a benchmarking situation. I personally feel that
is some potential for the BGW to become more useful in the context of the
8.4 release if it starts doing things like adding pages it expects to be
recycled soon onto the free list, which could improve backend efficiency
quite a bit compared to the current situation where each backend is
normally running their own scan. But that's a bit too big to fit into 8.3
I think.

What I'm aiming for here is to have the BGW do as little work as possible,
as efficiently as possible, but not remove it altogether. (2) suggests
that this approach won't decrease performance compared to the current 8.2
situation, where I've seen evidence some are over-tuning to have a very
aggressive BGW scan an enormous amount of the pool each time because they
have resources to burn. Having a generally self-tuning background writer
that errs on the lazy side stay in the codebase satisfies (5). Here is
what the patch I'm testing right now does to try and balance all this out:

A) Counters are added to pg_stat_bgwriter that show how many buffers were
written by the backends, by the background writer, how many times
bgwriter_lru_maxpages was hit, and the total number of buffers allocated.
This at least allows monitoring what's going on as people run their own
experiments. Heikki's results included data using the earlier version of
this patch I put assembled (which now conflicts with HEAD, I have an
updated one).

B) bgwriter_lru_percent is removed as a tunable. This eliminates (1).
The idea of scanning a fixed percentage doesn't ever make sense given the
observations above; we scan until we accomplish the cleaning mission
instead.

C) bgwriter_lru_maxpages stays as an absolute maximum number of pages that
can be written in one sweep each bgwriter_delay. This allows easily
turning the writer off altogether by setting it to 0, or limiting how
active it tries to be in situations where (3) is a concern. Admins can
monitor the amount that the max is hit in pg_stat_bgwriter and consider
raising it (or lowering the delay) if it proves to be too limiting. I
think the default needs to be bumped to something more like 100 rather
than the current tiny one before the stock configuration can be considered
"self-tuning" at all.

D) The strategy code gets a "passes" count added to it that serves as a
sort of high-order int for how many times the buffer cache has been looked
over in its entirety.

E) When the background writer start the LRU cleaner, it checks if the
strategy point has passed where it last cleaned up to, using the
passes+buf_id "pointer". If so, it just starts cleaning from the strategy
point as it always has. But if it's still ahead it just continues from
there, thus implementing the core of (4)'s insight. It estimates how many
buffers are probably clean in the space between the strategy point and
where it's starting at, based on how far ahead it is combined with
historical data about how many buffers are scanned on average per reusable
buffer found (the exact computation of this number is the main thing I'm
still fiddling with).

F) A moving average of buffer allocations is used to predict how many
clean buffers are expected to be needed in the next delay cycle. The
original patch from Itagaki doubled the recent allocations to pad this
out; (3) suggests that's too much.

G) Scan the buffer pool until either --Enough reusable buffers have been located or written out to fill the
upcoming allocation need, taking into account the estimate from (E); this
is the normal expected way the scan will terminate. --We've written bgwriter_lru_maxpages --We "lap" and catch the
strategypoint

In addition to removing a tunable and making the remaining two less
critical, one of my hopes here is that the more efficient way this scheme
operates will allow using much smaller values for bgwriter_delay than have
been practical in the current codebase, which may ultimately have its own
value.

That's what I've got working here now, still need some more tweaking and
testing before I'm done with the code but there's not much left. The main
problem I forsee is that this approach is moderately complicated, adding a
lot of new code and regular+static variables, for something that's not
really proven to be valuable. I will not be surprised if my patch is
rejected on that basis. That's why I wanted to get the big picture
painted in this message while I finish up the work necessary to submit it,
'cause if the whole idea is doomed anyway I might as well stop now.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Final background writer cleanup for 8.3

From

Tom Lane

Date:

24 August 2007, 00:09:41

Greg Smith <gsmith@gregsmith.com> writes:
> In the interest of closing work on what's officially titled the "Automatic 
> adjustment of bgwriter_lru_maxpages" patch, I wanted to summarize where I 
> think this is at ...

> 2) Having backends write their own buffers out does not significantly 
> degrade performance, as those turn into cached OS writes which generally 
> execute fast enough to not be a large drag on the backend.

[ itch... ]  That assumption scares the heck out of me.  It is doubtless
true in a lightly loaded system, but once the kernel is under any kind
of memory pressure I think it's completely wrong.  I think designing the
system around this assumption will lead to something that performs great
as long as you're not pushing it hard.

However, your actual specific proposals do not seem to rely on this
assumption extensively, so I wonder why you are emphasizing it.

The only parts of your specific proposals that I find a bit dubious are

> ... It estimates how many 
> buffers are probably clean in the space between the strategy point and 
> where it's starting at, based on how far ahead it is combined with 
> historical data about how many buffers are scanned on average per reusable 
> buffer found (the exact computation of this number is the main thing I'm 
> still fiddling with).

If you're still fiddling with it then you probably aren't going to get
it right in the next few days.  Perhaps you should think about whether
this can be left out entirely for 8.3 and revisited later.

> F) A moving average of buffer allocations is used to predict how many 
> clean buffers are expected to be needed in the next delay cycle.  The 
> original patch from Itagaki doubled the recent allocations to pad this 
> out; (3) suggests that's too much.

Maybe you need to put back the eliminated tuning parameter in the form
of the scaling factor to be used here.  I don't like 1.0, mainly because
I don't believe your assumption (2).  I'm willing to concede that 2.0
might be too much, but I don't know where in between is the sweet spot.

Also, we might need a tuning parameter for the reaction speed of the
moving average --- what are you using for that?
        regards, tom lane

Re: Final background writer cleanup for 8.3

From

Greg Smith

Date:

24 August 2007, 02:17:55

On Thu, 23 Aug 2007, Tom Lane wrote:

> It is doubtless true in a lightly loaded system, but once the kernel is 
> under any kind of memory pressure I think it's completely wrong.

The fact that so many tests I've done or seen get maximum throughput in 
terms of straight TPS with the background writer turned completely off is 
why I stated that so explicitly.  I understand what you're saying in terms 
of memory pressure, all I'm suggesting is that the empirical tests suggest 
the current background writer even with moderate improvements doesn't 
necessarily help when you get there.  If writes are blocking, whether the 
background writer does them slightly ahead of time or whether the backend 
does them itself doesn't seem to matter very much.  On a heavily loaded 
system, your throughput is bottlenecked at the disk either way--and 
therefore it's all the more important in those cases to never do a write 
until you absolutely have to, lest it be wasted.

> If you're still fiddling with it then you probably aren't going to get
> it right in the next few days.

The implementation is fine most of the time, I've just found some corner 
cases in testing I'd like to improve stability on (mainly how best to 
handle when no buffers were allocated during the previous period, some 
small concerns about the first pass over the pool).  What I'm thinking of 
doing is taking a couple of my assumptions/techniques and turning them 
into things that can be turned on or off with #DEFINE, that way the parts 
of the code that people don't like are easy to identify and pull out. 
I've already done with that with one section.

> Maybe you need to put back the eliminated tuning parameter in the form
> of the scaling factor to be used here.  I don't like 1.0, mainly because
> I don't believe your assumption (2).  I'm willing to concede that 2.0
> might be too much, but I don't know where in between is the sweet spot.

That would be easy to implement and add some flexibility, so I'll do that. 
bgwriter_lru_percent becomes bgwriter_lru_multiplier, possibly to be 
renamed later if someone comes up with a snappier name.

> Also, we might need a tuning parameter for the reaction speed of the
> moving average --- what are you using for that?

It's hard-coded at 16 samples.  Seemed stable around 10-20, picked 16 in 
so maybe it will optimize usefully to a bit shift.  On the reaction side, 
it actually reacts faster than that--if the most recent allocation is 
greater than the average, it uses that instead.  The number of samples has 
more of an impact on the trailing side, and accordingly isn't that 
critical.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Final background writer cleanup for 8.3

From

Gregory Stark

Date:

24 August 2007, 09:02:21

"Tom Lane" <tgl@sss.pgh.pa.us> writes:

> Greg Smith <gsmith@gregsmith.com> writes:
>> In the interest of closing work on what's officially titled the "Automatic 
>> adjustment of bgwriter_lru_maxpages" patch, I wanted to summarize where I 
>> think this is at ...
>
>> 2) Having backends write their own buffers out does not significantly 
>> degrade performance, as those turn into cached OS writes which generally 
>> execute fast enough to not be a large drag on the backend.
>
> [ itch... ]  That assumption scares the heck out of me.  It is doubtless
> true in a lightly loaded system, but once the kernel is under any kind
> of memory pressure I think it's completely wrong.  I think designing the
> system around this assumption will lead to something that performs great
> as long as you're not pushing it hard.

I think Heikki's experiments showed it wasn't true for at least some kinds of
heavy loads. However I would expect it to depend heavily on just what kind of
load the machine is under. At least if it's busy writing then I would expect
it to throttle writes. Perhaps in TPCC there are enough reads to throttle the
write rate to something the kernel can buffer. 

> If you're still fiddling with it then you probably aren't going to get
> it right in the next few days.  Perhaps you should think about whether
> this can be left out entirely for 8.3 and revisited later.

How does all of this relate to your epiphany that we should just have bgwriter
be a full clock sweep ahead clock hand without retracing its steps?

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com

Re: Final background writer cleanup for 8.3

From

"Heikki Linnakangas"

Date:

24 August 2007, 09:42:16

Gregory Stark wrote:
> "Tom Lane" <tgl@sss.pgh.pa.us> writes:
> 
>> Greg Smith <gsmith@gregsmith.com> writes:
>>> In the interest of closing work on what's officially titled the "Automatic 
>>> adjustment of bgwriter_lru_maxpages" patch, I wanted to summarize where I 
>>> think this is at ...
>>> 2) Having backends write their own buffers out does not significantly 
>>> degrade performance, as those turn into cached OS writes which generally 
>>> execute fast enough to not be a large drag on the backend.
>> [ itch... ]  That assumption scares the heck out of me.  It is doubtless
>> true in a lightly loaded system, but once the kernel is under any kind
>> of memory pressure I think it's completely wrong.  I think designing the
>> system around this assumption will lead to something that performs great
>> as long as you're not pushing it hard.
> 
> I think Heikki's experiments showed it wasn't true for at least some kinds of
> heavy loads. However I would expect it to depend heavily on just what kind of
> load the machine is under. At least if it's busy writing then I would expect
> it to throttle writes. Perhaps in TPCC there are enough reads to throttle the
> write rate to something the kernel can buffer. 

I ran a bunch of DBT-2 in different configurations, as well as simple
single-threaded tests like random DELETEs on a table with index, steady
rate of INSERTs to a table with no indexes, and bursts of INSERTs with
different bursts sizes and delays between them. I tried the tests with
different bgwriter settings, including turning it off and with the patch
applied, and with different shared_buffers settings.

I was not able to find a test where turning bgwriter on performed better
than turning it off.

If anyone out there has a repeatable test case where bgwriter does help,
I'm all ears. The theory of moving the writes out of the critical path
does sound reasonable, so I'm sure there is test case to demonstrate the
effect, but it seems to be pretty darn hard to find.

The cold, rational side of me says we need a test case to show the
benefit, or if one can't be found, we should remove bgwriter altogether.The emotional side of me tells me we can't go
thatfar. A reasonable

compromise would be to apply the autotuning patch on the grounds that it
removes a GUC variable that's next to impossible to tune right, even
though we can't show a performance benefit compared to bgwriter=off. And
it definitely makes sense not to restart the scan from the clock sweep
hand on each bgwriter round; as Tom pointed out, it's a waste of time.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com

Re: Final background writer cleanup for 8.3

From

Tom Lane

Date:

24 August 2007, 12:25:00

Gregory Stark <stark@enterprisedb.com> writes:
> How does all of this relate to your epiphany that we should just have
> bgwriter be a full clock sweep ahead clock hand without retracing its
> steps?

Well, it's still clearly silly for the bgwriter to rescan buffers it's
already cleaned.  But I think we've established that the "keep a lap
ahead" idea goes too far, because it writes dirty buffers speculatively,
long before they actually are needed, and there's just too much chance
of the writes being wasted due to re-dirtying.  When proposing that
idea I had supposed that wasted writes wouldn't hurt much, but that's
evidently wrong.

Heikki makes a good point nearby that if you are not disk write
bottlenecked then it's perfectly OK for backends to issue writes,
as it'll just result in a transfer to kernel cache space, and no actual
wait for I/O.  And if you *are* write-bottlenecked, then the last thing
you want is any wasted writes.  So a fairly conservative strategy that
does bgwrites only "just in time" seems like what we need to aim at.

I think the moving-average-of-requests idea, with a user-adjustable
scaling factor, is the best we have at the moment.
        regards, tom lane

Re: Final background writer cleanup for 8.3

From

Greg Smith

Date:

24 August 2007, 12:53:43

On Fri, 24 Aug 2007, Tom Lane wrote:

> Heikki makes a good point nearby that if you are not disk write 
> bottlenecked then it's perfectly OK for backends to issue writes, as 
> it'll just result in a transfer to kernel cache space, and no actual 
> wait for I/O.  And if you *are* write-bottlenecked, then the last thing 
> you want is any wasted writes.

Which is the same thing I was saying in my last message, so I'm content 
we're all on the same page here now--and that the contents of that page 
are now clear in the archives for when this comes up again.

> So a fairly conservative strategy that does bgwrites only "just in time" 
> seems like what we need to aim at.

And that's exactly what I've been building.  Feedback and general feeling 
that I'm doing the right thing appreciated, am returning to the code with 
scaling factor as a new tunable but plan otherwise unchanged.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Final background writer cleanup for 8.3

From

"Kevin Grittner"

Date:

24 August 2007, 12:59:54

>>> On Fri, Aug 24, 2007 at  7:41 AM, in message
<46CED1EF.8010707@enterprisedb.com>, "Heikki Linnakangas"
<heikki@enterprisedb.com> wrote:
> I was not able to find a test where turning bgwriter on performed better
> than turning it off.
Any tests which focus just on throughput don't address the problems which
caused us so much grief.  What we need is some sort of test which generates
a moderate write load in the background, while paying attention to the
response time of a large number of read-only queries.  The total load should
not be enough to saturate the I/O bandwidth overall if applied evenly.
The problem which the background writer has solved for us is that we have
three layers of caching (PostgreSQL, OS, and RAID controller), each with its
own delay before writing; when something like fsync triggers a cascade from
one cache to the next, the write burst bottlenecks the I/O, and reads exceed
acceptable response times.  The two approaches which seem to prevent this
problem are to disable all OS delays in writing dirty pages, or to minimize
the delays in PostgreSQL writing dirty pages.
Throughput is not everything.  Response time matters.
> If anyone out there has a repeatable test case where bgwriter does help,
> I'm all ears.
All we have is a production system where PostgreSQL failed to perform at a
level acceptable to the users without it.
> The cold, rational side of me says we need a test case to show the
> benefit, or if one can't be found, we should remove bgwriter altogether.
I would be fine with that if I could configure the back end to always write a
dirty page to the OS when it is written to shared memory.  That would allow
Linux and XFS to do their job in a timely manner, and avoid this problem.
I know we're doing more in 8.3 to move this from the OS's realm into
PostgreSQL code, but until I have a chance to test that, I want to make sure
that what has been proven to work for us is not broken.
-Kevin

Re: Final background writer cleanup for 8.3

From

Tom Lane

Date:

24 August 2007, 14:38:41

"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> Any tests which focus just on throughput don't address the problems which
> caused us so much grief.

This is a good point: a steady-state load is either going to be in the
regime where you're not write-bottlenecked, or the one where you are;
and either way the bgwriter isn't going to look like it helps much.

The real use of the bgwriter, perhaps, is to smooth out a varying load
so that you don't get pushed into the write-bottlenecked mode during
spikes.  We've already had to rethink the details of how we made that
happen with respect to preventing checkpoints from causing I/O spikes.
Maybe LRU buffer flushes need a rethink too.

Right at the moment I'm still comfortable with what Greg is doing, but
there's an argument here for a more aggressive scaling factor on
number-of-buffers-to-write than he thinks.  Still, as long as we have a
GUC variable in there, tuning should be possible.
        regards, tom lane

Re: Final background writer cleanup for 8.3

From

Greg Smith

Date:

24 August 2007, 19:47:19

On Fri, 24 Aug 2007, Kevin Grittner wrote:

> I would be fine with that if I could configure the back end to always write a
> dirty page to the OS when it is written to shared memory.  That would allow
> Linux and XFS to do their job in a timely manner, and avoid this problem.

You should take a look at the "io storm on checkpoints" thread on the 
pgsql-performance@postgresql.org started by Dmitry Potapov on 8/22 if you 
aren't on that list.  He was running into the same problem as you (and me 
and lots of other people) and had an interesting resolution based on 
turning the Linux kernel so that it basically stopped caching writes. 
What you suggest here would be particularly inefficient because of how 
much extra I/O would happen on the index blocks involved in the active 
tables.

> I know we're doing more in 8.3 to move this from the OS's realm into
> PostgreSQL code, but until I have a chance to test that, I want to make sure
> that what has been proven to work for us is not broken.

The background writer code that's in 8.2 can be configured as a big 
sledgehammer that happens to help in this area while doing large amounts 
of collateral damage via writing things prematurely.  Some of the people 
involved in the 8.3 code rewrite and testing were having the same problem 
as you on a similar scale--I recall Greg Stark commenting that he had a 
system that was freezing for a full 30 seconds the way yours was.

I would be extremely surprised to find that the code that's already in 8.3 
isn't a big improvement over what you're doing now based on how much it 
has helped others running into this issue.  And much of the code that 
you're relying on now to help with the problem (the all-scan portion of 
the BGW) has already been removed as part of that.

Switching to my Agent Smith voice:  "No Kevin, your old background writer 
is already dead". You'd have to produce some really unexpected and 
compelling results during the beta period for it to get put back again. 
The work I'm still doing here is very much fine-tuning in comparision to 
what's already been committed into 8.3.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Final background writer cleanup for 8.3

From

"Kevin Grittner"

Date:

26 August 2007, 01:07:59

>>> On Fri, Aug 24, 2007 at  5:47 PM, in message
<Pine.GSO.4.64.0708241807500.28499@westnet.com>, Greg Smith
<gsmith@gregsmith.com> wrote:
> On Fri, 24 Aug 2007, Kevin Grittner wrote:
>
>> I would be fine with that if I could configure the back end to always write
> a
>> dirty page to the OS when it is written to shared memory.  That would allow
>> Linux and XFS to do their job in a timely manner, and avoid this problem.
>
> You should take a look at the "io storm on checkpoints" thread on the
> pgsql-performance@postgresql.org started by Dmitry Potapov on 8/22 if you
> aren't on that list.  He was running into the same problem as you (and me
> and lots of other people) and had an interesting resolution based on
> turning the Linux kernel so that it basically stopped caching writes.
I saw it.  I think that I'd rather have a write-through cache in PostgreSQL
than give up OS caching entirely.  The problem seems to be caused by the
cascade from one cache to the next, so I can easily believe that disabling
the delay on either one solves the problem.
> What you suggest here would be particularly inefficient because of how
> much extra I/O would happen on the index blocks involved in the active
> tables.
I've certainly seen that assertion on these lists often.  I don't think I've
yet seen any evidence that it's true.  When I made the background writer
more aggressive, there was no discernible increase in disk writes at the OS
level (much less from controller cache to the drives).  This may not be true
with some of the benchmark software, but in our environment there tends to
be a lot of activity on a singe court case, and then they're done with it.
(I spent some time looking at this to tune our heuristics for generating
messages on our interfaces to business partners.)
>> I know we're doing more in 8.3 to move this from the OS's realm into
>> PostgreSQL code, but until I have a chance to test that, I want to make sure
>> that what has been proven to work for us is not broken.
>
> The background writer code that's in 8.2 can be configured as a big
> sledgehammer that happens to help in this area while doing large amounts
> of collateral damage via writing things prematurely.
Again -- to the OS cache, where it sits and accumulates other changes until
the page settles.
> I would be extremely surprised to find that the code that's already in 8.3
> isn't a big improvement over what you're doing now based on how much it
> has helped others running into this issue.
I'm certainly hoping that it will be.  I'm not moving to it for production
until I've established that as a fact, however.
> And much of the code that
> you're relying on now to help with the problem (the all-scan portion of
> the BGW) has already been removed as part of that.
>
> Switching to my Agent Smith voice:  "No Kevin, your old background writer
> is already dead". You'd have to produce some really unexpected and
> compelling results during the beta period for it to get put back again.
If I fail to get resources approved to test during beta, this could become
an issue later, when we do get around to testing it.  (There's exactly zero
chance of us moving to something which so radically changes a problem area
for us without serious testing.)
For what it's worth, the background writer settings I'm using weren't
arrived at entirely randomly.  I monitored I/O during episodes of the
database freezing up, and looked at how many writes per second were going
through.  I then reasoned that there was no good reason NOT to push data out
from PostgreSQL to the OS at that speed.  I split the writes between the LRU
and full cache aspects of the background writer, with heavier weight given
to getting all dirty pages pushed out to the OS cache so that they could
start to age through the OS timers.  (While the raw numbers totaled to the
peak write load, I figured I was actually allowing some slack, since there
was the percentage limit and the two scans would often cover the same
ground, not to mention the assumption that the interval was a sleep time
from the end of one run to the start of the next.)  Since it was a
production system, I made incremental changes each day, and each day the
problem became less severe.  At the point where I finally set it to my
calculated numbers, we stopped seeing the problem.
I'm not entirely convinced that it's a sound assumption that we should
always try to keep some dirty buffers in the cache on the off chance that
we might be smarter than the OS/FS/RAID controller algorithms about when to
write them. That said, the 8.3 changes sound as though they are likely to
reduce the problems with I/O-related freezes.
Is it my imagination, or are we coming pretty close to the point where we
could accomadate the oft-requested feature of dealing directly with a raw
volume, rather than going through the file system at all?
-Kevin

Re: Final background writer cleanup for 8.3

From

Greg Smith

Date:

26 August 2007, 02:52:00

On Sat, 25 Aug 2007, Kevin Grittner wrote:

> in our environment there tends to be a lot of activity on a singe court 
> case, and then they're done with it.

I submitted a patch to 8.3 that lets contrib/pg_buffercache show the 
usage_count data for each of the buffers.  It's actually pretty tiny; you 
might consider applying just that patch to your 8.2 production system and 
installing the module (as an add-in, it's easy enough to back out).  See 
http://archives.postgresql.org/pgsql-patches/2007-03/msg00555.php

With that patch in place, try a query like

select usagecount,count(*),isdirty from pg_buffercache group by  isdirty,usagecount order by isdirty,usagecount;

That lets you estimate how much waste would be involved for your 
particular data if you wrote it out early--the more high usage_count 
blocks in there cache, the worse the potential waste.  With the tests I 
was running, the hot index blocks were pegged at the maximum count allowed 
(5) and they were taking up around 20% of the buffer cache.  If those were 
written out every time they were touched, it would be a bad scene.

It sounds like your system has a lot of data where the usage_count would 
be much lower on average, which would explain why you've been so 
successful with resolving it using the background writer.  That's a 
slightly easier problem to solve than the one I've been banging on.

> I'm not moving to it for production until I've established that as a 
> fact, however.

And you'd be crazy to do otherwise.

> I'm not entirely convinced that it's a sound assumption that we should
> always try to keep some dirty buffers in the cache on the off chance that
> we might be smarter than the OS/FS/RAID controller algorithms about when to
> write them.

All I can say is that every time someone had tried to tune the code toward 
writing that much more proactively, the results haven't seemed like an 
improvement.  I wouldn't characterize it as an assumption--it's a theory 
that seems to hold every time it's tested.  At least on the kind of Linux 
systems people put into production right now (which often have relatively 
old kernels), the OS is not as smart as everyone would like to to be in 
this area.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Final background writer cleanup for 8.3

From

Gregory Stark

Date:

26 August 2007, 05:42:02

"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:

> Is it my imagination, or are we coming pretty close to the point where we
> could accomadate the oft-requested feature of dealing directly with a raw
> volume, rather than going through the file system at all?

Or O_DIRECT.

I think the answer is that we've built enough intelligence that it's feasible
from the memory management side.

However there's another side to that problem. a) you would either need to have
multiple bgwriters or have bgwriter use aio since having only one would
serialize your i/o which would be a big hit to i/o bandwidth. b) you need some
solution to handle preemptively reading ahead for sequential reads.

I don't think we're terribly far off from being able to do it. The traditional
response has always been that our time is better spent doing database stuff
rather than reimplementing what the OS people are doing better. And also that
the OS has more information about the hardware and so can schedule I/O more
efficiently.

However there's also a strong counter-argument that we have more information
about what we're intending to use the data for and how urgent any given i/o is
so.

I'm not sure how that balancing act ends. I have a hunch but I guess it would
take experiments to get a real answer. And the answer might be very different
on different OSes and hardware configurations.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com

Re: Final background writer cleanup for 8.3

From

"Kevin Grittner"

Date:

26 August 2007, 16:15:40

>>> On Sun, Aug 26, 2007 at 12:51 AM, in message
<Pine.GSO.4.64.0708260115400.14470@westnet.com>, Greg Smith
<gsmith@gregsmith.com> wrote:
> On Sat, 25 Aug 2007, Kevin Grittner wrote:
>
>> in our environment there tends to be a lot of activity on a singe court
>> case, and then they're done with it.
>
> I submitted a patch to 8.3 that lets contrib/pg_buffercache show the
> usage_count data for each of the buffers.  It's actually pretty tiny; you
> might consider applying just that patch to your 8.2 production system and
> installing the module (as an add-in, it's easy enough to back out).  See
> http://archives.postgresql.org/pgsql-patches/2007-03/msg00555.php
>
> With that patch in place, try a query like
>
> select usagecount,count(*),isdirty from pg_buffercache group by
>    isdirty,usagecount order by isdirty,usagecount;
>
> That lets you estimate how much waste would be involved for your
> particular data if you wrote it out early--the more high usage_count
> blocks in there cache, the worse the potential waste.  With the tests I
> was running, the hot index blocks were pegged at the maximum count allowed
> (5) and they were taking up around 20% of the buffer cache.  If those were
> written out every time they were touched, it would be a bad scene.
Just to be sure that I understand, are you saying it would be a bad scene if
the physical writes happened, or that the overhead of pushing them out to
the OS would be crippling?
Anyway, I've installed this on the machine that I proposed using for the
tests.  It is our older generation of central servers, soon to be put to
some less critical use as we bring the newest generation on line and the
current "new" machines fall back to secondary roles in our central server
pool.  It is currently a replication target for the 72 county-based circuit
court systems, but is just there for ad hoc queries against statewide data;
there's no web load present.
Running the suggested query a few times, with the samples separated by a few
seconds each, I got the following.  (The Sunday afternoon replication load
is unusual in that there will be very few users entering any data, just a
trickle of input from our law enforcement interfaces, but a lot of the
county middle tiers will have noticed that there is idle time and that it
has been more than 23 hours since the start of the last synchronization of
county data against the central copies, and so will be doing massive selects
to look for and report any "drift".)  I'll check again during normal weekday
load.usagecount | count | isdirty
------------+-------+---------         0 |  8711 | f         1 |  9394 | f         2 |  1188 | f         3 |   869 | f
      4 |   160 | f         5 |   157 | f           |     1 | 
(7 rows)
usagecount | count | isdirty
------------+-------+---------         0 |  9033 | f         1 |  8849 | f         2 |  1623 | f         3 |   619 | f
      4 |   181 | f         5 |   175 | f 
(6 rows)
usagecount | count | isdirty
------------+-------+---------         0 |  9093 | f         1 |  6702 | f         2 |  2267 | f         3 |   602 | f
      4 |   428 | f         5 |  1388 | f 
(6 rows)
usagecount | count | isdirty
------------+-------+---------         0 |  6556 | f         1 |  7188 | f         2 |  3648 | f         3 |  2074 | f
      4 |   720 | f         5 |   293 | f           |     1 | 
(7 rows)
usagecount | count | isdirty
------------+-------+---------         0 |  6569 | f         1 |  7855 | f         2 |  3942 | f         3 |  1181 | f
      4 |   532 | f         5 |   401 | f 
(6 rows)
I also ran the query mentioned in the cited email about 100 times, with 52
instead of 32.  (I guess I have a bigger screen.)  It would gradually go
from entirely -1 values to mostly -2 with a few -1, then gradually back to
all -1.  Repeatedly.  I never saw anything other than -1 or -2.  Of course
this is with our aggressive background writer settings.
This contrib module seems pretty safe, patch and all.  Does anyone think
there is significant risk to slipping it into the 8.2.4 database where we
have massive public exposure on the web site handling 2 million hits per
day?
By the way, Greg, lest my concerns about this be misinterpreted -- I do
really appreciate the effort you've put into analyzing this and tuning the
background writer.  I just want to be very cautious here, and I do get
downright alarmed at some of the posts which seem to deny the reality of the
problems which many have experienced with write spikes choking off reads to
the point of significant user impact.  I also think we need to somehow
develop a set of tests which report maximum response time on (what should
be) fast queries while the database is under different loads, so that those
of us for whom reliable response time is more important than maximum overall
throughput are protected from performance regressions.
-Kevin

Re: Final background writer cleanup for 8.3

From

Greg Smith

Date:

26 August 2007, 18:16:21

On Sun, 26 Aug 2007, Kevin Grittner wrote:

> usagecount | count | isdirty
> ------------+-------+---------
>          0 |  8711 | f
>          1 |  9394 | f
>          2 |  1188 | f
>          3 |   869 | f
>          4 |   160 | f
>          5 |   157 | f

Here's a typical sample from your set.  Notice how you've got very few 
buffers with a high usage count.  This is a situation the background 
writer is good at working with.  Either the old or new work-in-progress 
LRU writer can aggressively pound away at any of the buffers with a 0 
usage count shortly after they get dirty, and that won't be inefficient 
because there aren't large numbers of other clients using them.

Compare against this other sample:

> usagecount | count | isdirty
> ------------+-------+---------
>          0 |  9093 | f
>          1 |  6702 | f
>          2 |  2267 | f
>          3 |   602 | f
>          4 |   428 | f
>          5 |  1388 | f

Notice that you have a much larger number of buffers where the usage count 
is 4 or 5.  The all-scan part of the 8.2 background writer will waste a 
lot of writes when you have a profile that's more like this.  If there 
have been 4+ client backends touching the buffer recently, you'd be crazy 
to write it out right now if you could instead be focusing on banging out 
the ones where the usage count is 0.  The 8.2 background writer would 
write them out anyway, which meant that when you hit a checkpoint both the 
OS and the controller cache were filled with such buffers before you even 
started writing the checkpoint data.  The new setup in 8.3 only worries 
about the high usage count buffers when you hit a checkpoint, at which 
point it streams them out over a longer, adjustable period (as not to 
spike the I/O more than necessary and block your readers) than the 8.2 
design, which just dumped them all immediately.

> Just to be sure that I understand, are you saying it would be a bad scene if
> the physical writes happened, or that the overhead of pushing them out to
> the OS would be crippling?

If you have a lot of buffers where the usage_count data was high, it would 
be problematic to write them out every time they were touched; odds are 
good somebody else is going to dirty them again soon enough so why bother. 
On your workload, that doesn't seem to be the case.  But that is the 
situation on some other test workloads, and balancing for that situation 
has been central to the parts of the redesign I've been injecting 
suggestions into.  One of the systems I was tormented by had the 
usagecount of 5 for >20% of the buffers in the cache under heavy load, and 
had a physical write been executed every time one of those was touched 
that would have been crippling (even if the OS was smart to cache and 
therefore make redundant some of the writes, which is behavior I would 
prefer not to rely on).

> This contrib module seems pretty safe, patch and all.  Does anyone think
> there is significant risk to slipping it into the 8.2.4 database where we
> have massive public exposure on the web site handling 2 million hits per
> day?

I think it's fairly safe, and my patch was pretty small; just exposing 
some data that nobody had been looking at before.  Think how much easier 
your life would have been when doing your earlier tuning if you were 
looking at the data in these terms.  Just be aware that running the query 
is itself intensive and causes its own tiny hiccup in throughput every 
time it executes, so you may want to consider this more of a snapshot you 
run periodically to learn more about your data rather than something you 
do very regularly.

> I also think we need to somehow develop a set of tests which report 
> maximum response time on (what should be) fast queries while the 
> database is under different loads, so that those of us for whom reliable 
> response time is more important than maximum overall throughput are 
> protected from performance regressions.

My guess is that the DBT2 tests that Heikki has been running are a more 
complicated than you think they are; there are response time guarantee 
requirements in there as well as the throughput numbers.  The tests that I 
run (which I haven't been publishing yet but will be with the final patch 
soon) also report worst-case and 90-th percentile latency numbers as well 
as TPS.  A "regression" that improved TPS at the expense of those two 
would not be considered an improvement by anyone involved here.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Final background writer cleanup for 8.3

From

"Kevin Grittner"

Date:

26 August 2007, 21:35:50

>>> On Sun, Aug 26, 2007 at  4:16 PM, in message
<Pine.GSO.4.64.0708261637030.3811@westnet.com>, Greg Smith
<gsmith@gregsmith.com> wrote:
> On Sun, 26 Aug 2007, Kevin Grittner wrote:
>
>> usagecount | count | isdirty
>> ------------+-------+---------
>>          0 |  9093 | f
>>          1 |  6702 | f
>>          2 |  2267 | f
>>          3 |   602 | f
>>          4 |   428 | f
>>          5 |  1388 | f
>
> Notice that you have a much larger number of buffers where the usage count
> is 4 or 5.  The all-scan part of the 8.2 background writer will waste a
> lot of writes when you have a profile that's more like this.  If there
> have been 4+ client backends touching the buffer recently, you'd be crazy
> to write it out right now if you could instead be focusing on banging out
> the ones where the usage count is 0.
Seems to me I'd be crazy to be writing out anything.  Nothing's dirty.
In fact, I ran a simple query to count dirty pages once per second for a
minute and had three sample show any pages dirty.  The highest count was 5.
Again, this was Sunday afternoon, which is not traditionally a busy time for
the courts.  I'll try to get some more meaningful numbers tomorrow.

> One of the systems I was tormented by had the
> usagecount of 5 for >20% of the buffers in the cache under heavy load, and
> had a physical write been executed every time one of those was touched
> that would have been crippling (even if the OS was smart to cache and
> therefore make redundant some of the writes, which is behavior I would
> prefer not to rely on).
Why is that?
> The tests that I
> run (which I haven't been publishing yet but will be with the final patch
> soon) also report worst-case and 90-th percentile latency numbers as well
> as TPS.  A "regression" that improved TPS at the expense of those two
> would not be considered an improvement by anyone involved here.
Have you been able to create a test case which exposes the write-spike
problem under 8.2.4?
By the way, the 90th percentile metric isn't one I'll care a lot about.
In our environment any single instance of a "fast" query running slow is
considered a problem, and my job is to keep those users happy.
-Kevin

Re: Final background writer cleanup for 8.3

From

Gregory Stark

Date:

26 August 2007, 21:52:53

"Greg Smith" <gsmith@gregsmith.com> writes:

> On Sun, 26 Aug 2007, Kevin Grittner wrote:
>
>> I also think we need to somehow develop a set of tests which report maximum
>> response time on (what should be) fast queries while the database is under
>> different loads, so that those of us for whom reliable response time is more
>> important than maximum overall throughput are protected from performance
>> regressions.
>
> My guess is that the DBT2 tests that Heikki has been running are a more
> complicated than you think they are; there are response time guarantee
> requirements in there as well as the throughput numbers.  The tests that I run
> (which I haven't been publishing yet but will be with the final patch soon)
> also report worst-case and 90-th percentile latency numbers as well as TPS.  A
> "regression" that improved TPS at the expense of those two would not be
> considered an improvement by anyone involved here.

TPCC requires that the 90th percentile response time be under 5s for most
transactions. It also requires that the average be less than the 90th
percentile which helps rule out circumstances where the longest 10% response
times are *much* longer than 5s.

However in practice neither of those requirements really rule out some pretty
bad behaviour as long as it's rare enough. Before the distributed checkpoint
patch went in we were finding 60s of zero activity at every checkpoint. But
there were so few transactions affected that in the big picture it didn't
impact the 90th percentile. It didn't even affect the 95th percentile. I think
you had to look at the 99th percentile before it even began to impact the
results.

I can't really imagine a web site operator being happy if he was told that
only 1% of user's clicks resulted in a browser timeout...

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com

Re: Final background writer cleanup for 8.3

From

"Kevin Grittner"

Date:

27 August 2007, 13:31:50

>>> On Sun, Aug 26, 2007 at  7:35 PM, in message
<46D1D601.EE98.0025.0@wicourts.gov>, "Kevin Grittner"
<Kevin.Grittner@wicourts.gov> wrote:
>>>> On Sun, Aug 26, 2007 at  4:16 PM, in message
> <Pine.GSO.4.64.0708261637030.3811@westnet.com>, Greg Smith
> <gsmith@gregsmith.com> wrote:
> I'll try to get some more meaningful numbers tomorrow.
Well, I ran the query against the production web server 40 times, and the highest number I got for usagecount 5 dirty
pageswas in this sample:usagecount | count | isdirty 
------------+-------+---------         0 |  7358 | f         1 |  7428 | f         2 |  1938 | f         3 |  1311 | f
      4 |  1066 | f         5 |  1097 | f         1 |    87 | t         2 |    62 | t         3 |    31 | t         4 |
  11 | t         5 |    86 | t           |     5 | 
(12 rows)
Most samples looked something like this:usagecount | count | isdirty
------------+-------+---------         0 |  7981 | f         1 |  6584 | f         2 |  1975 | f         3 |  1063 | f
      4 |  1366 | f         5 |  1294 | f         0 |     5 | t         1 |    83 | t         2 |    60 | t         3 |
  19 | t         4 |    21 | t         5 |    28 | t           |     1 | 
(13 rows)
The system can comfortably write out about 4,000 pages per second as long as the write cache doesn't get swamped, so in
theworst case I caught it had 69 ms worth of work to do, if they were all physical writes (which, of course, is highly
unlikely).
From shortly afterwards, possibly of interest:
postgres@ATHENA:~> vmstat 1
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----r  b   swpd   free   buff  cache   si
so   bi    bo   in    cs us sy id wa2  3     20 402248      0 10538028    0    0     0     1    1     2 21  4 55 192  4
   20 403116      0 10538028    0    0  5180   384 2233  9599 24  5 50 213  6     20 402868      0 10532888    0    0
4844  512 2841 14054 44  6 31 197 10     20 397908      0 10534944    0    0  6768   465 2674 11995 40  6 26 284 15
20398016      0 10534944    0    0  3344  4703 2297 10578 34  7 13 460 22     20 405456      0 10534944    0    0  2464
4192 1785  6167 20  3 21 56 
14 19     20 401852      0 10538028    0    0  3680  4704 2474 11779 29  5 12 54
17 13     20 401728      0 10532888    0    0  5504  1945 2554 21490 35  8 10 473 10     20 408176      0 10530832    0
  0 11380   553 3907 15463 67 13  5 154  4     20 405572      0 10535972    0    0  8708   981 2904 12051 26  7 34 331
5    20 403588      0 10535972    0    0  5924   464 2589 12194 26  5 45 234  7     20 410780      0 10529804    0    0
6284  1163 2674 11830 33  8 35 243 13     20 402596      0 10526720    0    0  2424  6598 2441 10332 40  7 11 427 16
20 400736      0 10528776    0    0  3928  6784 2453  9852 26  6 26 42 
19 14     20 405308      0 10524664    0    0  2272  4708 2208  8583 27  5 19 499 17     20 404580      0 10527748    0
  0  7156  3560 3185 13203 55 11  3 321 11     20 406192      0 10531860    0    0  5112  3647 2758 11362 31  6 26 373
13    20 404464      0 10531860    0    0  4856  3426 2342 11077 24  5 35 362 13     20 403968      0 10530832    0
0 5308  4634 2762 15778 34  7 22 364 12     20 403472      0 10534944    0    0  2996  3766 2090  9331 20  4 34 420  5
  20 412648      0 10522608    0    0  2364  5187 1816  5194 18  5 56 224 13     20 415376      0 10519524    0    0
2836 6172 1929  5075 25  6 26 43 
27 16     20 413880      0 10522608    0    0  7892  2340 3325 19769 52  8 10 307  7     20 402340      0 10530832    0
  0  7600   712 3511 16486 45  8 20 264  9     20 403704      0 10531860    0    0  7708   830 3133 16164 43 11 22 245
6    20 408416      0 10529804    0    0  6900   814 2703 10806 31  7 39 248  6     20 401844      0 10532888    0    0
6884   632 2993 13792 37  7 29 27 
13  3     20 398868      0 10534944    0    0  7732   744 3443 14580 63  9  8 195  6     20 403580      0 10533916    0
  0  6724   623 2905 11937 37  7 34 223  7     20 400728      0 10529804    0    0  6924   712 2746 12085 35  7 37 210
7    20 408664      0 10526720    0    0  6536   344 2562 10555 27  6 44 245  1     20 407796      0 10527748    0    0
4628  1000 2653 13092 41  7 37 157  9     20 400480      0 10529804    0    0  3364   744 2326 11198 35  7 40 183  4
20 406384      0 10531860    0    0  4044   904 2998 14055 60  9 16 14 
18  5     20 397976      0 10525692    0    0  6000   671 3082 14058 55 10 15 20
11  6     20 410996      0 10528776    0    0  4828  3498 2768 13027 38  7 28 271  3     20 406416      0 10531860    0
  0  4140   616 2496 11980 33  6 43 17 
This box is a little beefier than the proposed test box, with 8 3 GHz Xeon MP CPUs and 12 GB of RAM.  Other than
tellingPostgreSQL about the extra RAM in the effective cache size GUC, this box has the same postgresql.conf. 
Other than cranking up the background writer settings this is the same box and configuration that stalled so badly that
wewere bombarded with user complaints. 
-Kevin

Re: Final background writer cleanup for 8.3

From

Jan Wieck

Date:

31 August 2007, 09:36:03

On 8/24/2007 1:17 AM, Greg Smith wrote:
> On Thu, 23 Aug 2007, Tom Lane wrote:
> 
>> It is doubtless true in a lightly loaded system, but once the kernel is 
>> under any kind of memory pressure I think it's completely wrong.
> 
> The fact that so many tests I've done or seen get maximum throughput in 
> terms of straight TPS with the background writer turned completely off is 
> why I stated that so explicitly.  I understand what you're saying in terms 
> of memory pressure, all I'm suggesting is that the empirical tests suggest 
> the current background writer even with moderate improvements doesn't 
> necessarily help when you get there.  If writes are blocking, whether the 
> background writer does them slightly ahead of time or whether the backend 
> does them itself doesn't seem to matter very much.  On a heavily loaded 
> system, your throughput is bottlenecked at the disk either way--and 
> therefore it's all the more important in those cases to never do a write 
> until you absolutely have to, lest it be wasted.

Have you used something that like a properly implemented TPC benchmark 
simulates users that go through cycles of think times instead of 
hammering SUT interactions at the maximum possible rate allowed by the 
network latency? And do your tests consider any completed transaction a 
good transaction, or are they like TPC benchmarks, which require the 
majority of transactions to complete in a certain maximum response time?

Those tests will show you that inflicting an IO storm at checkpoint time 
will delay processing enough to get a significant increase in the number 
of concurrent transactions by giving the "users" time enough to come out 
of their thinking time. That spike in active transactions increases 
pressure on CPU, memory and IO ... and eventually leads to the situation 
where users submit new transactions at a higher rate than you currently 
can commit ... which is where you enter the spiral of death.

Observing that very symptom during my TPC-W tests several years ago was 
what lead to developing the background writer in the first place. Can 
your tests demonstrate improvements for this kind of (typical web 
application) load profile?

Jan

-- 
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #

Re: Final background writer cleanup for 8.3

From

Jan Wieck

Date:

31 August 2007, 09:46:45

On 8/24/2007 8:41 AM, Heikki Linnakangas wrote:
> If anyone out there has a repeatable test case where bgwriter does help,
> I'm all ears. The theory of moving the writes out of the critical path
> does sound reasonable, so I'm sure there is test case to demonstrate the
> effect, but it seems to be pretty darn hard to find.

One could try to dust off this TPC-W benchmark.
    http://pgfoundry.org/projects/tpc-w-php/

Again, the original theory for the bgwriter wasn't moving writes out of 
the critical path, but smoothing responsetimes that tended to go 
completely down the toilet during checkpointing, causing all the users 
to wake up and overload the system entirely.

It is well known that any kind of bgwriter configuration other than OFF 
does increase the total IO cost. But you will find that everyone who has 
SLA's that define maximum response times will happily increase the IO 
bandwidth to give an aggressively configured bgwriter room to work.

Jan

-- 
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #

Re: Final background writer cleanup for 8.3

From

Greg Smith

Date:

31 August 2007, 13:17:01

On Fri, 31 Aug 2007, Jan Wieck wrote:

> Again, the original theory for the bgwriter wasn't moving writes out of the 
> critical path, but smoothing responsetimes that tended to go completely down 
> the toilet during checkpointing, causing all the users to wake up and 
> overload the system entirely.

As far as I'm concerned, that function of the background writer has been 
replaced by the load distributed checkpoint features now controlled by 
checkpoint_completion_target, which is believed to be a better solution in 
several respects.  I'm been trying to motivate people happily using the 
current background writer to confirm or deny that during beta, while 
there's still time to put the all-scan portion that was removed back 
again.

The open issue I'm working on is whether the LRU cleaner running in 
advance of the Strategy point is still a worthwhile addition on top of 
that.

My own tests with pgbench that I'm busy wrapping up today haven't provided 
many strong conclusions here; the raw data is now on-line at 
http://www.westnet.com/~gsmith/content/bgwriter/ , am working on 
summarizing it usefully and bundling the toolchain I used to run all 
those.  I'll take a look at whether TCP-W provides a helpfully different 
view here because as far as I'm aware that's a test neither myself or 
Heikki has tried yet to investigate this area.

> It is well known that any kind of bgwriter configuration other than OFF does 
> increase the total IO cost. But you will find that everyone who has SLA's 
> that define maximum response times will happily increase the IO bandwidth to 
> give an aggressively configured bgwriter room to work.

The old background writer couldn't be configured to be aggressive enough 
to satisfy some SLAs because of interactions with the underlying operating 
system write caches.  It actually made things worse in some situations 
because at the point when you hit a checkpoint, the OS/disk controller 
caches were already filled to capacity with writes of active pages, many 
of which were now being written again.  Had you just left the background 
writer off those caches would have had less data in them and better been 
able to absorb the storm of writes that come with the checkpoint.  This is 
particularly true in the situtation where you have a large caching disk 
controller that might chew GB worth of shared_buffers almost instantly 
were it mostly clean when the checkpoint storm begins, but if the 
background writer has been busy pounding at it then it's already full of 
data at checkpoint time.

We just talked about this for a bit at Bruce's back in July; the hardware 
you did your development against and what people are deploying nowadays 
are so different that the entire character of the problem has changed. 
The ability of the processors and memory to create dirty pages has gone up 
by at least one order of magnitude, and the sophistication of the disk 
controller on a high-end PostgreSQL server is pretty high now; the speed 
of the underlying disks haven't kept pace, and that gap has been making 
this particular problem worse every year.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Final background writer cleanup for 8.3

From

Josh Berkus

Date:

04 September 2007, 18:09:42

Greg,

> As far as I'm concerned, that function of the background writer has been
> replaced by the load distributed checkpoint features now controlled by
> checkpoint_completion_target, which is believed to be a better solution
> in several respects.  I'm been trying to motivate people happily using
> the current background writer to confirm or deny that during beta, while
> there's still time to put the all-scan portion that was removed back
> again.

In about 200 benchmark test runs, I don't feel like we ever came up with a 
set of bgwriter settings we'd happily recommend to others.  SO it's hard 
for me to tell whether this is true or not.

> The open issue I'm working on is whether the LRU cleaner running in
> advance of the Strategy point is still a worthwhile addition on top of
> that.
>
> My own tests with pgbench that I'm busy wrapping up today haven't
> provided many strong conclusions here; the raw data is now on-line at
> http://www.westnet.com/~gsmith/content/bgwriter/ , am working on
> summarizing it usefully and bundling the toolchain I used to run all
> those.  I'll take a look at whether TCP-W provides a helpfully different
> view here because as far as I'm aware that's a test neither myself or
> Heikki has tried yet to investigate this area.

Can you send me the current version of the patch, plus some bgwriter 
settings to try with it, so we can throw it on some of the Sun benchmarks?

-- 
--Josh

Josh Berkus
PostgreSQL @ Sun
San Francisco

Re: Final background writer cleanup for 8.3

From

Greg Smith

Date:

05 September 2007, 13:28:19

On Tue, 4 Sep 2007, Josh Berkus wrote:

> In about 200 benchmark test runs, I don't feel like we ever came up with a
> set of bgwriter settings we'd happily recommend to others.  SO it's hard
> for me to tell whether this is true or not.

Are you talking about 200 runs with 8.2.4 or 8.3?  If you've collected a 
bunch of 8.3 data, that's something I haven't been able to do; if what 
you're saying is that you never found settings with 8.2.4 that you'd 
recommend, that's consistant with what I was saying.

> Can you send me the current version of the patch, plus some bgwriter
> settings to try with it, so we can throw it on some of the Sun benchmarks?

Am in the middle of wrapping this up today, will send out a patch for 
everyone to try shortly.  Tests are done, patch is done for now, just 
writing the results up and making my tests reproducible.  I had some 
unexpected inspiration the other day that dragged things out, but with 
useful improvements.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Final background writer cleanup for 8.3

From

Josh Berkus

Date:

05 September 2007, 13:51:54

Greg,

> Are you talking about 200 runs with 8.2.4 or 8.3?  

8.2.4.

-- 
Josh Berkus
PostgreSQL @ Sun
San Francisco

Re: Final background writer cleanup for 8.3

From

Greg Smith

Date:

05 September 2007, 15:54:31

On Wed, 5 Sep 2007, Josh Berkus wrote:

>> Are you talking about 200 runs with 8.2.4 or 8.3?
> 8.2.4.

Right, then we're in agreement here.  I did something like 4000 small test 
runs with dozens of settings under various 8.2.X releases and my 
conclusion was that in the general case, it just didn't work at reducing 
checkpoint spikes the way it was supposed to.  Your statement that you 
never found a "set of bgwriter settings we'd happily recommend to others" 
was also the case for me.

While there certainly are some cases where we've heard about people whose 
workloads were such that the background writer worked successfully for 
them, I consider those lucky rather than normal.  I'd like those people to 
test 8.3 because I'd hate to see the changes made to improve the general 
case cause a regression for them.

You are certainly spot-on that this causes a bit of a problem for testing 
8.3 in beta, because if you come from a world-view where the 8.2.4 
background writer was never successful it's hard to figure out a starting 
point for comparing it to the one in 8.3.  Maybe I'll spark some ideas 
when I get the rest of my data out here soon.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Final background writer cleanup for 8.3

From

"Kevin Grittner"

Date:

05 September 2007, 17:22:45

>>> On Wed, Sep 5, 2007 at  1:54 PM, in message
<Pine.GSO.4.64.0709051443300.17248@westnet.com>, Greg Smith
<gsmith@gregsmith.com> wrote:
> On Wed, 5 Sep 2007, Josh Berkus wrote:
>
> While there certainly are some cases where we've heard about people whose
> workloads were such that the background writer worked successfully for
> them, I consider those lucky rather than normal.  I'd like those people to
> test 8.3 because I'd hate to see the changes made to improve the general
> case cause a regression for them.
Being one of the lucky ones, I'm still hopeful that I'll be able to do
these tests.  I think I know how to tailor the load so that we see the
problem often enough to get useful benchmarks (we tended to see the
problem a few times per day in actual 24/7 production).
My plan would be to run 8.2.4 with the background writer turned off to
establish a baseline.  I think that any test, to be meaningful would need
to run for several hours, with the first half hour discarded as just being
enough to establish the testing state.
Then I would test our aggressive background writer settings under 8.2.4 to
confirm that those settings do handle the problem in this test
environment.
Then I would test the new background writer with synchronous commits under
the 8.3 beta, using various settings.  The 0.5, 0.7 and 0.9 settings you
recommended for a test are how far from the LRU end of the cache to look
for dirty pages to write, correct?  Is there any upper bound, as long as I
keep it below 1?  Are the current shared memory and the 1 GB you suggested
enough of a spread for these tests?  (At several hours per test in order
to get meaningful results, I don't want to get into too many permutations.)
Finally, I would try the new checkpoint techniques, with and without the
new background writer.  Any suggestions on where to set the knobs for
those runs?
I'm inclined to think that it would be interesting to try the benchmarks
with the backend writing any dirty page through to the OS at the same time
they are written to the PostgreSQL cache, as a reference point at the
opposite extreme from having the cache hold onto dirty pages for as long
as possible before sharing them with the OS.  Do you see any value in
getting actual numbers for that?
> this causes a bit of a problem for testing
> 8.3 in beta, because if you come from a world-view where the 8.2.4
> background writer was never successful it's hard to figure out a starting
> point for comparing it to the one in 8.3.
In terms of comparing the new technique to the old, one would approach the
new technique by turning off the "all" scan and setting the lru scan
percentage to 50% or more, right?  (I mean, obviously there would be more
CPU time used as it scanned through clean pages repeatedly, but it would
be a rough analogy otherwise, yes?)
-Kevin

Testing 8.3 LDC vs. 8.2.4 with aggressive BGW

From

Greg Smith

Date:

11 September 2007, 02:07:15

Renaming the old thread to more appropriately address the topic:

On Wed, 5 Sep 2007, Kevin Grittner wrote:

> Then I would test the new background writer with synchronous commits under
> the 8.3 beta, using various settings.  The 0.5, 0.7 and 0.9 settings you
> recommended for a test are how far from the LRU end of the cache to look
> for dirty pages to write, correct?

This is alluding to the suggestions I gave at 
http://archives.postgresql.org/pgsql-hackers/2007-08/msg00755.php

checkpoint_completion_target has nothing to do with the LRU, so let's step 
back to fundamentals and talk about what it actually does.  The official 
documentation is at 
http://developer.postgresql.org/pgdocs/postgres/wal-configuration.html

As you generate transactions, Postgres puts data into the WAL.  The WAL is 
organized into segments that are typically 16MB each.  Periodically, the 
system hits a checkpoint where the WAL data up to a certain point is 
guaranteed to have been applied to the database, at which point the old 
WAL files aren't needed anymore and can be reused.  These checkpoints are 
generally caused by one of two things happening:

1) checkpoint_segments worth of WAL files have been written
2) more than checkpoint_timeout seconds have passed since the last 
checkpoint

The system doesn't stop working while the checkpoint is happening; it just 
keeps creating new WAL files.  As long as the checkpoint finishes in 
advance of what the next one is required things performance should be 
fine.

In the 8.2 model, processing the checkpoint occurs as fast as data can be 
written to disk.  In 8.3, the writes can be spread out instead.  What 
checkpoint_completion_target does is suggest how far along the system 
should aim to have finished the current checkpoint relative to when the 
next one is expected.

For example, your current system has checkpoint_segments=10.  Assume that 
you have checkpoint_timeout set to a large number such that the 
checkpoints are typically being driven by the number of segments being 
filled (so you get a checkpoint every 10 WAL segments, period).  If 
checkpoint_completion_target was set to 0.5, the expectation is that the 
writes for the currently executing checkpoint would be finished about the 
time that 0.5*10=5 segments of new WAL data had been written.  If you set 
it to 0.9 instead, you'd expect the checkpoint is finishing just about 
when the 9th WAL segment is being written out, which is cutting things a 
bit tight; somewhere around there is the safe upper limit for that 
parameter.

Now, checkpoint_segments=10 is a pretty low setting, but I'm guessing that 
on your current system that's forcing very regular checkpoints, which 
makes each individual checkpoint have less work to do and therefore 
reduces the impact of the spikes you're trying to avoid.  With LDC and 
checkpoint_completion_target, you can make that number much bigger (I 
suggested 50), which means you'll only have 1/5 as many checkpoints 
causing I/O spikes, and each of those checkpoints will have 5X as long to 
potentially spread the writes over.  The main cost is that it will take 
longer to recover if your database crashes, which hopefully is a rare 
event.

Having far less checkpoints is obviously a win for your situation, but the 
open question is whether this fashion of spreading them out will reduce 
the I/O spike as effectively as the all-scan background writer in 8.2 has 
been working for you.  This is one aspect that makes your comparision a 
bit tricky.  It's possible that by increasing the segments enough, you'll 
get into a situation where you don't see (m)any of them during your 
testing run of 8.3.  You should try and collect some data on how regularly 
checkpoints are happening during early testing to get an idea if this is a 
possibility.  The usual approach is to set checkpoint_warning to a really 
high number (like the maximum of 3600) and then you'll get a harmless note 
in the logs every time one happens, and that will show you how frequently 
they're happening.  It's kind of important to have an idea how many 
checkpoints you can expect during each test run to put together a fair 
comparison; as you increase checkpoint_segments, you need to adopt a 
mindset that is considering "how many sluggish transactions am I seeing 
per checkpoint?", not how many total per test run.

I have a backport of some of the pg_stat_bgwriter features added to 8.3 
that can be applied to 8.2 that might be helpful for monitoring your test 
benchmarking server (this is most certainly *not* suitable to go onto the 
real one) at 
http://www.westnet.com/~gsmith/content/postgresql/perfmon82.htm you might 
want to take a look at; I put that together specifically for allowing 
easier comparisions of 8.2 and 8.3 in this area.

> Are the current shared memory and the 1 GB you suggested enough of a 
> spread for these tests?  (At several hours per test in order to get 
> meaningful results, I don't want to get into too many permutations.)

Having a much larger shared_buffers setting should allow you to keep more 
data in memory usefully, which may lead to an overall performance gain due 
to improved efficiency.  With your current configuration, I would guess 
that making the buffer cache bigger would increase the checkpoint spike 
problems, where that shouldn't be as much of a problem with 8.3 because of 
how the checkpoint can be spread out.  The hope here is that by letting 
PostgreSQL cache more and avoiding writes of popular buffers except at 
checkpoint time, your total I/O will be significantly lower with 8.3 
compared to how much an aggressive BGW will write in 8.2.  Right now, 
you've got a pretty low number of pages that accumulate a high usage 
count; that may change if you give the buffer cache a lot more room to 
work.

> Finally, I would try the new checkpoint techniques, with and without the
> new background writer.  Any suggestions on where to set the knobs for
> those runs?

This and your related question about simulating the new LRU behavior by 
"turning off the 'all' scan and setting the lru scan percentage to 50% or 
more" depend on what final form the LRU background writer ends up in. 
Certainly you should consider using a higher value for the percentage and 
maxpages parameters with the current form 8.3 is in because you're not 
having the all scan doing the majority of the work anymore.  If some form 
of my JIT BGW patch gets applied before beta, you'll still want to 
increase maxpages but don't have to play with the percentage anymore; you 
might try adjusting the multiplier setting instead.

> I'm inclined to think that it would be interesting to try the benchmarks 
> with the backend writing any dirty page through to the OS at the same 
> time they are written to the PostgreSQL cache, as a reference point at 
> the opposite extreme from having the cache hold onto dirty pages for as 
> long as possible before sharing them with the OS.  Do you see any value 
> in getting actual numbers for that?

It might be an interesting curiousity to see how this works for you, but 
I'm not sure of its value to the community at large.  The configuration 
trend for larger systems seems to be pretty clear at this point:  use 
large values for shared_buffers and checkpoint_segments.  Minimize total 
I/O in the background writer by not writing more than you have to, only 
even consider writing buffers that are going to be reused in the near 
future regularly; everything else only gets written out at checkpoint 
time.  I consider the fact that you've gotten good results in the past by 
a radically different configuration than what's considered normal best 
practice, a configuration that works around problems in 8.2, an 
interesting data point.  I don't see any reason that anyone would jump 
from there to expecting that turning the PostgreSQL cache into what's 
essentially a write-through one the way you describe here will be helpful 
in most cases, and I'm not sure how you would do it anyway.

What I would encourage you to take a look at while you're doing these
experiments is radically lowering the Linux dirty_background_ratio tunable 
(perhaps even to 0) to see what that does for you.  From what I've seen in 
the past, the caching there is more likely to be the root of your problem. 
Hopefully LDC will address your issue such that you don't have to adjust 
this, because it will lower efficiency considerably, but it may be the 
most straightforward way to get the more timely I/O path you're obviously 
looking for.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD