Checkpoint tuning on 8.2.4 - Mailing list pgsql-performance

From Greg Smith
Subject Checkpoint tuning on 8.2.4
Date
Msg-id Pine.GSO.4.64.0806060137130.16835@westnet.com
Whole thread Raw
Responses Re: Checkpoint tuning on 8.2.4
List pgsql-performance
One of the things to hit my mailbox this week is from someone who is
frustrated not only by their database server but by issues sending
messages to this list; I'm forwarding to here for them, please reply to
all so they get a copy.

Here's the basic server information:

> PostgreSQL version - 8.2.4, RHEL4 Linux
> 64-bit, 8 cpu(s), 16GB memory, raid 5 storage.
> The tuning objective is to optimize the PostgreSQL database to handle
> both reads and writes.  The database receives continuous inserts,
> updates and deletes on tables with 140+ million records.

The primary problem they're having are really awful checkpoint spikes,
which is how I got conne^H^Hvinced into helping out here.  I belive this
is hardware RAID with a caching controller.

First off, the bad news nobody ever wants to hear:  you can't really make
this problem go completely away in many situations with 8.2, whereas the
new spread checkpoint feature in 8.3 is aimed specifically at this
problem.
http://www.westnet.com/~gsmith/content/postgresql/chkp-bgw-83.htm goes
over all that, along with introducing some of the ideas I'll toss in below
about how to optimize for 8.2.  A related paper I did talks about reducing
how much memory Linux caches for you when writing heavily, that might be
appropriate here as well:
http://www.westnet.com/~gsmith/content/linux-pdflush.htm

The other obvious suggestion is that RAID5 is known to be poor at heavy
write performance, which makes it really the wrong choice here as well.
What this system really wants to have done to it is to be reconfigured
with RAID10 and PostgreSQL 8.3 instead.  But since as always that's
impractical for now, let's take a look at the postgresql.settings to see
what might be improved immediately:

max_connections = 128
shared_buffers = 400
temp_buffers = 1000
effective_cache_size = 50000
random_page_cost = 2.5

They've experimented with lowering shared_buffers so much here because it
helps the problem, but 400 is going a bit too far.  You should be able to
get at to least a few thousand for that setting without making the problem
much worse, and that will help lower general I/O that might block the
checkpoint work a bit.  Something like 5000 to 20000 would be my guess for
a good setting here.

sort_mem = 4194304
vacuum_mem = 2097152
work_mem = 4194304
maintenance_work_mem = 256000

There is no sort_mem or vacuum_mem in 8.2 anymore, so those can be
deleted:  replaced by work_mem and maintenance_work_mem.  The values for
all the active *_mem settings here are on the low side for a system with
16GB of RAM.  If we re-cast these with more useful units this is obvious:

work_mem = 4MB
maintenance_work_mem = 256KB

Try work_mem=16MB and maintenance_work_mem=256MB instead as starting
values.  work_mem could go a lot higher, but you have to have to be
careful to consider how many connections are involved because this is a
per-session parameter.

effective_cache_size is wildly low here; something >8GB is likely more
accurate.  While not directly causing checkpoint issues, getting better
plans can lower overall system I/O through more efficient use of available
resources and therefore leave more bandwidth for the writes.

bgwriter_lru_percent = 70
bgwriter_lru_maxpages = 800
bgwriter_all_percent = 50
bgwriter_all_maxpages = 800

Ah, the delicate scent of someone on IRC suggesting "oh, checkpoints
spikes are taken care of by the background writer, just make that more
aggressive and they'll go away".  These values are crazy big, and the only
reason they work at all is that with shared_buffers=400 and 8 CPUs you can
afford to scan them every single time and nobody cares.  The settings
Kevin Grittner settled on that I mentioned in the 8.2->8.3 paper are about
as aggressive as I've ever seen work well in the real world:

> bgwriter_delay = 200
> bgwriter_lru_percent = 20.0
> bgwriter_lru_maxpages = 200
> bgwriter_all_percent = 10.0
> bgwriter_all_maxpages = 600

I personally will often just turn the background writer off all together
by setting both maxpages parameters to zero, and wait for the surprised
looks as the checkpoint spikes get smaller.  The 8.2 BGW just isn't
effective in modern systems with gigabytes of RAM.  It writes the same
blocks over and over into the gigantic OS cache, in a way that competes
inefficiently for I/O resources with how buffers are naturally evicted
anyway when you use the kind of low shared_buffers settings that are a
must on 8.2.

fsync = off

Well, this is asking for trouble.  The first time your server crashes, I
hope you're feeling lucky.  I think this system is setup so that it can
easily be replaced if there's a problem, so this may not be a huge
problem, but it is dangerous to turn fsync off.

checkpoint_segments = 40
checkpoint_timeout = 300
checkpoint_warning = 15

setting checkpoint_segments to 40 is likely too large for an 8.2 system
that's writing heavily.  That keeps the number of checkpoints down, so you
get less spikes, but each one of them will be much larger.  Something in
the 5-20 range is likely more appropriate here.

vacuum_cost_delay = 750
autovacuum = true
autovacuum_naptime = 3600
autovacuum_vacuum_threshold = 1000
autovacuum_analyze_threshold = 500
autovacuum_vacuum_scale_factor = 0.4
autovacuum_analyze_scale_factor = 0.2
autovacuum_vacuum_cost_delay = -1
autovacuum_vacuum_cost_limit = -1
max_fsm_pages = 5000000
max_fsm_relations = 2000

Now, when I was on the phone about this system, I recall hearing that
they've fallen into that ugly trap where they are forced to reload this
database altogether regularly to get performance to stay at a reasonable
level.  That's usually a vacuum problem, and yet another reason to upgrade
to 8.3 so you get the improved autovacuum there.  Vacuum tuning isn't
really my bag, and I'm out of time here tonight; anybody else want to make
some suggestions on what might be changed here based on what I've shared
about the system?

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

pgsql-performance by date:

Previous
From: Dan Harris
Date:
Subject: Re: query performance question
Next
From: Volkan YAZICI
Date:
Subject: Optimizing AGE()