Testing 8.3 LDC vs. 8.2.4 with aggressive BGW - Mailing list pgsql-hackers

From Greg Smith
Subject Testing 8.3 LDC vs. 8.2.4 with aggressive BGW
Date
Msg-id Pine.GSO.4.64.0709071630180.10175@westnet.com
Whole thread Raw
In response to Re: Final background writer cleanup for 8.3  ("Kevin Grittner" <Kevin.Grittner@wicourts.gov>)
List pgsql-hackers
Renaming the old thread to more appropriately address the topic:

On Wed, 5 Sep 2007, Kevin Grittner wrote:

> Then I would test the new background writer with synchronous commits under
> the 8.3 beta, using various settings.  The 0.5, 0.7 and 0.9 settings you
> recommended for a test are how far from the LRU end of the cache to look
> for dirty pages to write, correct?

This is alluding to the suggestions I gave at 
http://archives.postgresql.org/pgsql-hackers/2007-08/msg00755.php

checkpoint_completion_target has nothing to do with the LRU, so let's step 
back to fundamentals and talk about what it actually does.  The official 
documentation is at 
http://developer.postgresql.org/pgdocs/postgres/wal-configuration.html

As you generate transactions, Postgres puts data into the WAL.  The WAL is 
organized into segments that are typically 16MB each.  Periodically, the 
system hits a checkpoint where the WAL data up to a certain point is 
guaranteed to have been applied to the database, at which point the old 
WAL files aren't needed anymore and can be reused.  These checkpoints are 
generally caused by one of two things happening:

1) checkpoint_segments worth of WAL files have been written
2) more than checkpoint_timeout seconds have passed since the last 
checkpoint

The system doesn't stop working while the checkpoint is happening; it just 
keeps creating new WAL files.  As long as the checkpoint finishes in 
advance of what the next one is required things performance should be 
fine.

In the 8.2 model, processing the checkpoint occurs as fast as data can be 
written to disk.  In 8.3, the writes can be spread out instead.  What 
checkpoint_completion_target does is suggest how far along the system 
should aim to have finished the current checkpoint relative to when the 
next one is expected.

For example, your current system has checkpoint_segments=10.  Assume that 
you have checkpoint_timeout set to a large number such that the 
checkpoints are typically being driven by the number of segments being 
filled (so you get a checkpoint every 10 WAL segments, period).  If 
checkpoint_completion_target was set to 0.5, the expectation is that the 
writes for the currently executing checkpoint would be finished about the 
time that 0.5*10=5 segments of new WAL data had been written.  If you set 
it to 0.9 instead, you'd expect the checkpoint is finishing just about 
when the 9th WAL segment is being written out, which is cutting things a 
bit tight; somewhere around there is the safe upper limit for that 
parameter.

Now, checkpoint_segments=10 is a pretty low setting, but I'm guessing that 
on your current system that's forcing very regular checkpoints, which 
makes each individual checkpoint have less work to do and therefore 
reduces the impact of the spikes you're trying to avoid.  With LDC and 
checkpoint_completion_target, you can make that number much bigger (I 
suggested 50), which means you'll only have 1/5 as many checkpoints 
causing I/O spikes, and each of those checkpoints will have 5X as long to 
potentially spread the writes over.  The main cost is that it will take 
longer to recover if your database crashes, which hopefully is a rare 
event.

Having far less checkpoints is obviously a win for your situation, but the 
open question is whether this fashion of spreading them out will reduce 
the I/O spike as effectively as the all-scan background writer in 8.2 has 
been working for you.  This is one aspect that makes your comparision a 
bit tricky.  It's possible that by increasing the segments enough, you'll 
get into a situation where you don't see (m)any of them during your 
testing run of 8.3.  You should try and collect some data on how regularly 
checkpoints are happening during early testing to get an idea if this is a 
possibility.  The usual approach is to set checkpoint_warning to a really 
high number (like the maximum of 3600) and then you'll get a harmless note 
in the logs every time one happens, and that will show you how frequently 
they're happening.  It's kind of important to have an idea how many 
checkpoints you can expect during each test run to put together a fair 
comparison; as you increase checkpoint_segments, you need to adopt a 
mindset that is considering "how many sluggish transactions am I seeing 
per checkpoint?", not how many total per test run.

I have a backport of some of the pg_stat_bgwriter features added to 8.3 
that can be applied to 8.2 that might be helpful for monitoring your test 
benchmarking server (this is most certainly *not* suitable to go onto the 
real one) at 
http://www.westnet.com/~gsmith/content/postgresql/perfmon82.htm you might 
want to take a look at; I put that together specifically for allowing 
easier comparisions of 8.2 and 8.3 in this area.

> Are the current shared memory and the 1 GB you suggested enough of a 
> spread for these tests?  (At several hours per test in order to get 
> meaningful results, I don't want to get into too many permutations.)

Having a much larger shared_buffers setting should allow you to keep more 
data in memory usefully, which may lead to an overall performance gain due 
to improved efficiency.  With your current configuration, I would guess 
that making the buffer cache bigger would increase the checkpoint spike 
problems, where that shouldn't be as much of a problem with 8.3 because of 
how the checkpoint can be spread out.  The hope here is that by letting 
PostgreSQL cache more and avoiding writes of popular buffers except at 
checkpoint time, your total I/O will be significantly lower with 8.3 
compared to how much an aggressive BGW will write in 8.2.  Right now, 
you've got a pretty low number of pages that accumulate a high usage 
count; that may change if you give the buffer cache a lot more room to 
work.

> Finally, I would try the new checkpoint techniques, with and without the
> new background writer.  Any suggestions on where to set the knobs for
> those runs?

This and your related question about simulating the new LRU behavior by 
"turning off the 'all' scan and setting the lru scan percentage to 50% or 
more" depend on what final form the LRU background writer ends up in. 
Certainly you should consider using a higher value for the percentage and 
maxpages parameters with the current form 8.3 is in because you're not 
having the all scan doing the majority of the work anymore.  If some form 
of my JIT BGW patch gets applied before beta, you'll still want to 
increase maxpages but don't have to play with the percentage anymore; you 
might try adjusting the multiplier setting instead.

> I'm inclined to think that it would be interesting to try the benchmarks 
> with the backend writing any dirty page through to the OS at the same 
> time they are written to the PostgreSQL cache, as a reference point at 
> the opposite extreme from having the cache hold onto dirty pages for as 
> long as possible before sharing them with the OS.  Do you see any value 
> in getting actual numbers for that?

It might be an interesting curiousity to see how this works for you, but 
I'm not sure of its value to the community at large.  The configuration 
trend for larger systems seems to be pretty clear at this point:  use 
large values for shared_buffers and checkpoint_segments.  Minimize total 
I/O in the background writer by not writing more than you have to, only 
even consider writing buffers that are going to be reused in the near 
future regularly; everything else only gets written out at checkpoint 
time.  I consider the fact that you've gotten good results in the past by 
a radically different configuration than what's considered normal best 
practice, a configuration that works around problems in 8.2, an 
interesting data point.  I don't see any reason that anyone would jump 
from there to expecting that turning the PostgreSQL cache into what's 
essentially a write-through one the way you describe here will be helpful 
in most cases, and I'm not sure how you would do it anyway.

What I would encourage you to take a look at while you're doing these
experiments is radically lowering the Linux dirty_background_ratio tunable 
(perhaps even to 0) to see what that does for you.  From what I've seen in 
the past, the caching there is more likely to be the root of your problem. 
Hopefully LDC will address your issue such that you don't have to adjust 
this, because it will lower efficiency considerably, but it may be the 
most straightforward way to get the more timely I/O path you're obviously 
looking for.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD


pgsql-hackers by date:

Previous
From: Jeff Davis
Date:
Subject: Re: invalidly encoded strings
Next
From: Jeff Davis
Date:
Subject: Re: invalidly encoded strings