Checkpoint tuning on 8.2.4 - Mailing list pgsql-performance
From | Greg Smith |
---|---|
Subject | Checkpoint tuning on 8.2.4 |
Date | |
Msg-id | Pine.GSO.4.64.0806060137130.16835@westnet.com Whole thread Raw |
Responses |
Re: Checkpoint tuning on 8.2.4
|
List | pgsql-performance |
One of the things to hit my mailbox this week is from someone who is frustrated not only by their database server but by issues sending messages to this list; I'm forwarding to here for them, please reply to all so they get a copy. Here's the basic server information: > PostgreSQL version - 8.2.4, RHEL4 Linux > 64-bit, 8 cpu(s), 16GB memory, raid 5 storage. > The tuning objective is to optimize the PostgreSQL database to handle > both reads and writes. The database receives continuous inserts, > updates and deletes on tables with 140+ million records. The primary problem they're having are really awful checkpoint spikes, which is how I got conne^H^Hvinced into helping out here. I belive this is hardware RAID with a caching controller. First off, the bad news nobody ever wants to hear: you can't really make this problem go completely away in many situations with 8.2, whereas the new spread checkpoint feature in 8.3 is aimed specifically at this problem. http://www.westnet.com/~gsmith/content/postgresql/chkp-bgw-83.htm goes over all that, along with introducing some of the ideas I'll toss in below about how to optimize for 8.2. A related paper I did talks about reducing how much memory Linux caches for you when writing heavily, that might be appropriate here as well: http://www.westnet.com/~gsmith/content/linux-pdflush.htm The other obvious suggestion is that RAID5 is known to be poor at heavy write performance, which makes it really the wrong choice here as well. What this system really wants to have done to it is to be reconfigured with RAID10 and PostgreSQL 8.3 instead. But since as always that's impractical for now, let's take a look at the postgresql.settings to see what might be improved immediately: max_connections = 128 shared_buffers = 400 temp_buffers = 1000 effective_cache_size = 50000 random_page_cost = 2.5 They've experimented with lowering shared_buffers so much here because it helps the problem, but 400 is going a bit too far. You should be able to get at to least a few thousand for that setting without making the problem much worse, and that will help lower general I/O that might block the checkpoint work a bit. Something like 5000 to 20000 would be my guess for a good setting here. sort_mem = 4194304 vacuum_mem = 2097152 work_mem = 4194304 maintenance_work_mem = 256000 There is no sort_mem or vacuum_mem in 8.2 anymore, so those can be deleted: replaced by work_mem and maintenance_work_mem. The values for all the active *_mem settings here are on the low side for a system with 16GB of RAM. If we re-cast these with more useful units this is obvious: work_mem = 4MB maintenance_work_mem = 256KB Try work_mem=16MB and maintenance_work_mem=256MB instead as starting values. work_mem could go a lot higher, but you have to have to be careful to consider how many connections are involved because this is a per-session parameter. effective_cache_size is wildly low here; something >8GB is likely more accurate. While not directly causing checkpoint issues, getting better plans can lower overall system I/O through more efficient use of available resources and therefore leave more bandwidth for the writes. bgwriter_lru_percent = 70 bgwriter_lru_maxpages = 800 bgwriter_all_percent = 50 bgwriter_all_maxpages = 800 Ah, the delicate scent of someone on IRC suggesting "oh, checkpoints spikes are taken care of by the background writer, just make that more aggressive and they'll go away". These values are crazy big, and the only reason they work at all is that with shared_buffers=400 and 8 CPUs you can afford to scan them every single time and nobody cares. The settings Kevin Grittner settled on that I mentioned in the 8.2->8.3 paper are about as aggressive as I've ever seen work well in the real world: > bgwriter_delay = 200 > bgwriter_lru_percent = 20.0 > bgwriter_lru_maxpages = 200 > bgwriter_all_percent = 10.0 > bgwriter_all_maxpages = 600 I personally will often just turn the background writer off all together by setting both maxpages parameters to zero, and wait for the surprised looks as the checkpoint spikes get smaller. The 8.2 BGW just isn't effective in modern systems with gigabytes of RAM. It writes the same blocks over and over into the gigantic OS cache, in a way that competes inefficiently for I/O resources with how buffers are naturally evicted anyway when you use the kind of low shared_buffers settings that are a must on 8.2. fsync = off Well, this is asking for trouble. The first time your server crashes, I hope you're feeling lucky. I think this system is setup so that it can easily be replaced if there's a problem, so this may not be a huge problem, but it is dangerous to turn fsync off. checkpoint_segments = 40 checkpoint_timeout = 300 checkpoint_warning = 15 setting checkpoint_segments to 40 is likely too large for an 8.2 system that's writing heavily. That keeps the number of checkpoints down, so you get less spikes, but each one of them will be much larger. Something in the 5-20 range is likely more appropriate here. vacuum_cost_delay = 750 autovacuum = true autovacuum_naptime = 3600 autovacuum_vacuum_threshold = 1000 autovacuum_analyze_threshold = 500 autovacuum_vacuum_scale_factor = 0.4 autovacuum_analyze_scale_factor = 0.2 autovacuum_vacuum_cost_delay = -1 autovacuum_vacuum_cost_limit = -1 max_fsm_pages = 5000000 max_fsm_relations = 2000 Now, when I was on the phone about this system, I recall hearing that they've fallen into that ugly trap where they are forced to reload this database altogether regularly to get performance to stay at a reasonable level. That's usually a vacuum problem, and yet another reason to upgrade to 8.3 so you get the improved autovacuum there. Vacuum tuning isn't really my bag, and I'm out of time here tonight; anybody else want to make some suggestions on what might be changed here based on what I've shared about the system? -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
pgsql-performance by date: