Thread: Analyse without locking?
Dear All, I'm wondering whether Vacuum/analyse (notably by the autovaccuum daemon) is responsible for some deadlocks/dropouts I'm seeing. One particular table gets hit about 5 times a second (for single row updates and inserts) + associated index changes. This is a very light load for the hardware; we have 7 CPU cores idling, and very little disk activity. The query normally runs in about 20 ms. However, the query must always respond within 200ms, or userspace gets nasty errors. [we're routing books on a sorter machine, and the book misses its exit opportunity]. Although this is a low load, it's a bit like a heartbeat. The question is, could the autovacuum daemon (running either in vacuum or in analyse mode) be taking out locks on this table that sometimes cause the query response time to go way up (exceeding 10 seconds)? I think I've set up autovacuum to do "little and often", using autovacuum_vacuum_cost_delay = 20ms autovacuum_vacuum_cost_limit = 20 but I'm not sure this is doing exactly what I think it is. In particular, the system-wide I/O (and CPU) limit of autovacuum is negligible, but it's possible that queries may be waiting on locks. In particular, I want to make sure that the autovacuum daemon never holds any lock for more than about 50ms at a time. (or will release it immediately if something else wants it) Or am I barking up the wrong tree entirely? Thanks, Richard
On Thu, Nov 26, 2009 at 4:20 PM, Richard Neill <rn214@cam.ac.uk> wrote:
those are basically thresholds. So in essence you are forcing your autovacuum to be active pretty often,
And from what I can read here, you are looking for completely opposite behaviour. Unless you think statistical image of your table will be completely invalid, after 20 modifications to it, which I am sure is not true.
Dear All,
I'm wondering whether Vacuum/analyse (notably by the autovaccuum daemon) is responsible for some deadlocks/dropouts I'm seeing.
One particular table gets hit about 5 times a second (for single row updates and inserts) + associated index changes. This is a very light load for the hardware; we have 7 CPU cores idling, and very little disk activity. The query normally runs in about 20 ms.
However, the query must always respond within 200ms, or userspace gets nasty errors. [we're routing books on a sorter machine, and the book misses its exit opportunity]. Although this is a low load, it's a bit like a heartbeat.
The question is, could the autovacuum daemon (running either in vacuum or in analyse mode) be taking out locks on this table that sometimes cause the query response time to go way up (exceeding 10 seconds)?
I think I've set up autovacuum to do "little and often", using
autovacuum_vacuum_cost_delay = 20ms
autovacuum_vacuum_cost_limit = 20
those are basically thresholds. So in essence you are forcing your autovacuum to be active pretty often,
And from what I can read here, you are looking for completely opposite behaviour. Unless you think statistical image of your table will be completely invalid, after 20 modifications to it, which I am sure is not true.
--
GJ
Richard Neill <rn214@cam.ac.uk> writes: > I'm wondering whether Vacuum/analyse (notably by the autovaccuum daemon) > is responsible for some deadlocks/dropouts I'm seeing. > One particular table gets hit about 5 times a second (for single row > updates and inserts) + associated index changes. This is a very light > load for the hardware; we have 7 CPU cores idling, and very little disk > activity. The query normally runs in about 20 ms. > However, the query must always respond within 200ms, or userspace gets > nasty errors. [we're routing books on a sorter machine, and the book > misses its exit opportunity]. Although this is a low load, it's a bit > like a heartbeat. > The question is, could the autovacuum daemon (running either in vacuum > or in analyse mode) be taking out locks on this table that sometimes > cause the query response time to go way up (exceeding 10 seconds)? Hmm. Autovacuum does sometimes take an exclusive lock. It is supposed to release it "on demand" but if I recall the details correctly, that could involve a delay of about deadlock_timeout, or 1s by default. It would be reasonable to reduce deadlock_timeout to 100ms to ensure your external constraint is met. Delays of up to 10s would not be explained by that though. Do you have usage spikes of other types? I wonder in particular if you've got checkpoints smoothed out enough. regards, tom lane
On Thursday 26 November 2009 17:20:35 Richard Neill wrote: > Dear All, > > I'm wondering whether Vacuum/analyse (notably by the autovaccuum daemon) > is responsible for some deadlocks/dropouts I'm seeing. > > One particular table gets hit about 5 times a second (for single row > updates and inserts) + associated index changes. This is a very light > load for the hardware; we have 7 CPU cores idling, and very little disk > activity. The query normally runs in about 20 ms. > > However, the query must always respond within 200ms, or userspace gets > nasty errors. [we're routing books on a sorter machine, and the book > misses its exit opportunity]. Although this is a low load, it's a bit > like a heartbeat. > > The question is, could the autovacuum daemon (running either in vacuum > or in analyse mode) be taking out locks on this table that sometimes > cause the query response time to go way up (exceeding 10 seconds)? > > I think I've set up autovacuum to do "little and often", using > autovacuum_vacuum_cost_delay = 20ms > autovacuum_vacuum_cost_limit = 20 > but I'm not sure this is doing exactly what I think it is. In > particular, the system-wide I/O (and CPU) limit of autovacuum is > negligible, but it's possible that queries may be waiting on locks. > > In particular, I want to make sure that the autovacuum daemon never > holds any lock for more than about 50ms at a time. (or will release it > immediately if something else wants it) > > Or am I barking up the wrong tree entirely? I would suggest enabling log_log_wait and setting deadlock_timeout to a low value - should give you more information. Andres
Richard Neill wrote: > Or am I barking up the wrong tree entirely? If you haven't already tuned checkpoint behavior, it's more likely that's causing a dropout than autovacuum. See the checkpoint_segments section of http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server for an intro. -- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.com
Greg Smith wrote: > Richard Neill wrote: >> Or am I barking up the wrong tree entirely? > If you haven't already tuned checkpoint behavior, it's more likely > that's causing a dropout than autovacuum. See the checkpoint_segments > section of http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server > for an intro. > Greg Smith wrote: > Richard Neill wrote: >> Or am I barking up the wrong tree entirely? > If you haven't already tuned checkpoint behavior, it's more likely > that's causing a dropout than autovacuum. See the checkpoint_segments > section of http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server > for an intro. > Thanks - I did that already - it's currently checkpoint_segments = 64 Now, I understand that increasing checkpoint_segments is generally a good thing (subject to some limit), but doesn't that just mean that instead of say a 1 second outage every minute, it's a 10 second outage every 10 minutes? Also, correct me if I'm wrong, but mere selects shouldn't cause any addition to the WAL. I'd expect that a simple row insert might require perhaps 1kB of disk writes(*), in which case we're looking at only a few kB/sec at most of writes in normal use.? Is it possible (or even sensible) to do a manual vacuum analyze with nice/ionice? Richard (*)A typical write should be about 80 Bytes of data, in terms of how much is actually being stored. I'm using the engineers' "rule of 10" approximation to call that 1kB, based on indexes, and incomplete pages.
Richard Neill <rn214@cam.ac.uk> writes: > Now, I understand that increasing checkpoint_segments is generally a > good thing (subject to some limit), but doesn't that just mean that > instead of say a 1 second outage every minute, it's a 10 second outage > every 10 minutes? In recent PG versions you can spread the checkpoint I/O out over a period of time, so it shouldn't be an "outage" at all, just background load. Other things being equal, a longer checkpoint cycle is better since it improves the odds of being able to coalesce multiple changes to the same page into a single write. The limiting factor is your threshold of pain on how much WAL-replay work would be needed to recover after a crash. > Is it possible (or even sensible) to do a manual vacuum analyze with > nice/ionice? There's no support for that in PG. You could try manually renice'ing the backend that's running your VACUUM but I'm not sure how well it would work; there are a number of reasons why it might be counterproductive. Fooling with the vacuum_cost_delay parameters is the recommended way to make a vacuum run slower and use less of the machine. regards, tom lane
Richard Neill wrote: > Now, I understand that increasing checkpoint_segments is generally a > good thing (subject to some limit), but doesn't that just mean that > instead of say a 1 second outage every minute, it's a 10 second outage > every 10 minutes? That was the case in versions before 8.3. Now, the I/O is spread out over most of the next checkpoint's time period. So what actually happens is that all the I/O that happens over 10 minutes will be spread out over the next five minutes of time. With the defaults, there's so little time between checkpoints under heavy writes that the spreading doesn't have enough room to work, leading to higher write bursts. -- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.com
Thanks for your explanations. Tom Lane wrote: > Richard Neill <rn214@cam.ac.uk> writes: >> Now, I understand that increasing checkpoint_segments is generally a >> good thing (subject to some limit), but doesn't that just mean that >> instead of say a 1 second outage every minute, it's a 10 second outage >> every 10 minutes? > > In recent PG versions you can spread the checkpoint I/O out over a > period of time, so it shouldn't be an "outage" at all, just background > load. Other things being equal, a longer checkpoint cycle is better > since it improves the odds of being able to coalesce multiple changes > to the same page into a single write. The limiting factor is your > threshold of pain on how much WAL-replay work would be needed to recover > after a crash. That makes sense. I think that 64 is sane - it means crash-recovery takes less than 1 minute, yet we aren't seeing the warning that checkpoints are too frequent. > >> Is it possible (or even sensible) to do a manual vacuum analyze with >> nice/ionice? > > There's no support for that in PG. You could try manually renice'ing > the backend that's running your VACUUM but I'm not sure how well it > would work; there are a number of reasons why it might be > counterproductive. Fooling with the vacuum_cost_delay parameters is the > recommended way to make a vacuum run slower and use less of the machine. I see why it might not work well - priority inversion etc. What I was trying to achieve is to say that vacuum can have all the spare idle CPU/IO that's available, but must *immediately* back off when something else needs the CPU/IO/Locks. For example, nice -n 20 yes > /dev/null ionice -c 3 dd if=/dev/zero > tmp.del will both get quite a lot of work done on a medium-loaded system (try this on your own laptop), but have zero impact on other processes. On the other hand, changing vacuum_cost_delay means that vacuum runs slowly even if the CPU is otherwise idle; yet it still impacts on the responsiveness of some queries. Richard
Dear All, I'm still puzzled by this one - it looks like it's causing about 5% of queries to rise in duration from ~300ms to 2-6 seconds. On the other hand, the system never seems to be I/O bound. (we have at least 25 MB/sec of write bandwidth, and use a small fraction of that normally). Here's the typical checkpoint logs: 2009-12-03 06:21:21 GMT LOG: checkpoint complete: wrote 12400 buffers (2.2%); 0 transaction log file(s) added, 0 removed, 12 recycled; write=149.883 s, sync=5.143 s, total=155.040 s We're using 8.4.1, on ext4 with SSD. Is it possible that something exotic is occurring to do with write barriers (on by default in ext4, and we haven't changed this). Perhaps a low priority IO process for writing the previous WAL to disk is blocking a high-priority transaction (which is trying to write to the new WAL). If the latter is trying to sync, could the large amount of lower priority IO be getting in the way thanks to write barriers? If so, can I safely turn off write barriers? Thanks, Richard P.S. Should I rename this thread? Richard Neill wrote: > Dear All, > > It definitely looks checkpoint-related - the checkpoint timeout is set > to 5 minutes, and here is a graph of our response time (in ms) over a 1 > hour period. The query is pretty much identical each time. > > Any ideas what I could do to make checkpoints not hurt performance like > this? > > Thanks, > > Richard > > > > Tom Lane wrote: >> Richard Neill <rn214@cam.ac.uk> writes: >>> Now, I understand that increasing checkpoint_segments is generally a >>> good thing (subject to some limit), but doesn't that just mean that >>> instead of say a 1 second outage every minute, it's a 10 second >>> outage every 10 minutes? >> >> In recent PG versions you can spread the checkpoint I/O out over a >> period of time, so it shouldn't be an "outage" at all, just background >> load. Other things being equal, a longer checkpoint cycle is better >> since it improves the odds of being able to coalesce multiple changes >> to the same page into a single write. The limiting factor is your >> threshold of pain on how much WAL-replay work would be needed to recover >> after a crash. > > > ------------------------------------------------------------------------ >
Richard Neill wrote: > On the other hand, the system never seems to be I/O bound. (we have at > least 25 MB/sec of write bandwidth, and use a small fraction of that > normally). I would bet that if you sampled vmstat or iostat every single second, you'd discover there's a large burst in write speed for the same few seconds that queries are stuck. If you're averaging out the data over a 5 second or longer period, you'll never see it--the spike will get lost in the average. You just can't monitor checkpoint spikes unless you're watching I/O with an extremely tight time resolution. Watching the "Writeback" figure in /proc/meminfo is helpful too, that is where I normally see everything jammed up. > Here's the typical checkpoint logs: > 2009-12-03 06:21:21 GMT LOG: checkpoint complete: wrote 12400 buffers > (2.2%); 0 transaction log file(s) added, 0 removed, 12 recycled; > write=149.883 s, sync=5.143 s, total=155.040 s See that "sync" number there? That's your problem; while that sync operation is going on, everybody else is grinding to a halt waiting for it. Not a coincidence that the duration is about the same amount of time that your queries are getting stuck. This example shows 12400 buffers = 97MB of total data written. Since those writes are pretty random I/O, it's easily possible to get stuck for a few seconds waiting for that much data to make it out to disk. You only gave the write phase a couple of minutes to spread things out over; meanwhile, Linux may not even bother starting to write things out until 30 seconds into that, so the effective time between when writes to disk start and when the matching sync happens on your system is extremely small. That's not good--you have to give that several minutes of breathing room if you want to avoid checkpoint spikes. > We're using 8.4.1, on ext4 with SSD. Is it possible that something > exotic is occurring to do with write barriers (on by default in ext4, > and we haven't changed this). > Perhaps a low priority IO process for writing the previous WAL to disk > is blocking a high-priority transaction (which is trying to write to > the new WAL). If the latter is trying to sync, could the large amount > of lower priority IO be getting in the way thanks to write barriers? > If so, can I safely turn off write barriers? Linux is pretty dumb in general here. fsync operations will usually end up writing out way more of the OS buffer cache than they need to. And the write cache can get quite big before pdflush decides it should actually do some work, the whole thing is optimized for throughput rather than latency. I don't really trust barriers at all, so I don't know if there's some specific tuning you can do with those to improve things. Your whole system is bleeding edge craziness IMHO--SSD, ext4, write barriers, all stuff that just doesn't work reliably yet far as I'm concerned. ...but that's not what you want to hear. When I can suggest that should help is increasing checkpoint_segments (>32), checkpoint_timeout (>=10 minutes), checkpoint_completion_target (0.9), and lowering the amount of writes Linux will cache before it gets more aggressive about flushing them. Those things will fight the root cause of the problem, by giving more time between the "write" and "sync" phases of the checkpoint. It's ok if "write" takes a long while, decreasing the "sync" number is your goal you need to keep your eye on. I've written a couple of articles on this specific topic if you want more background on the underlying issues, it's kind of heavy reading: http://www.westnet.com/~gsmith/content/postgresql/chkp-bgw-83.htm http://www.westnet.com/~gsmith/content/linux-pdflush.htm http://notemagnet.blogspot.com/2008/08/linux-write-cache-mystery.html -- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.com
On Sat, Nov 28, 2009 at 6:57 PM, Richard Neill <rn214@cam.ac.uk> wrote: > Greg Smith wrote: >> >> Richard Neill wrote: >>> >>> Or am I barking up the wrong tree entirely? >> >> If you haven't already tuned checkpoint behavior, it's more likely that's >> causing a dropout than autovacuum. See the checkpoint_segments section of >> http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server for an intro. >> > > Greg Smith wrote: >> Richard Neill wrote: >>> Or am I barking up the wrong tree entirely? >> If you haven't already tuned checkpoint behavior, it's more likely >> that's causing a dropout than autovacuum. See the checkpoint_segments >> section of http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server >> for an intro. >> > > Thanks - I did that already - it's currently > checkpoint_segments = 64 > > Now, I understand that increasing checkpoint_segments is generally a good > thing (subject to some limit), but doesn't that just mean that instead of > say a 1 second outage every minute, it's a 10 second outage every 10 > minutes? > > Also, correct me if I'm wrong, but mere selects shouldn't cause any addition > to the WAL. I'd expect that a simple row insert might require perhaps 1kB of > disk writes(*), in which case we're looking at only a few kB/sec at most of > writes in normal use.? > > Is it possible (or even sensible) to do a manual vacuum analyze with > nice/ionice? this is the job of autovacuum_vacuum_cost_delay and vacuum_cost_delay. About checkpoint, you may eventually set : synchronous_commit = off Please note that you may loose some queries if the server badly crash. (but that shouldn't cause database corruption like a fsync = off) If you are running on linux, you could try to monitor (rrd is your friend) /proc/meminfo and specifically the "Dirty" field. Read your syslog log to see if the checkpoint is a problem. Here is a sample of mine (cleaned) : checkpoint complete: wrote 3117 buffers (1.2%); 0 transaction log file(s) added, 0 removed, 3 recycled; write=280.213 s, sync=0.579 s, total=280.797 s The more Dirty page (/proc/meminfo), the longer is your sync time. A high sync time can easily "lock" your server. To reduce the dirty page, tune /proc/sys/vm/dirty_background_ratio I have set it to "1" on my 32GB servers. You should also be carefull about all the other /proc/sys/vm/dirty_* And specifically /proc/sys/vm/dirty_ratio : Maximum percentage of total memory that can be filled with dirty pages before processes are forced to write dirty buffers themselves during their time slice instead of being allowed to do more writes. Note that all processes are blocked for writes when this happens, not just the one that filled the write buffers. About "ionice" : it only work with the CFQ I/O Scheduler. And CFQ is a very bad idea when using postgresql. -- Laurent "ker2x" Laborde Sysadmin & DBA at http://www.over-blog.com/
Greg Smith wrote: > Richard Neill wrote: >> Here's the typical checkpoint logs: >> 2009-12-03 06:21:21 GMT LOG: checkpoint complete: wrote 12400 buffers >> (2.2%); 0 transaction log file(s) added, 0 removed, 12 recycled; >> write=149.883 s, sync=5.143 s, total=155.040 s > See that "sync" number there? That's your problem; while that sync > operation is going on, everybody else is grinding to a halt waiting for > it. Not a coincidence that the duration is about the same amount of > time that your queries are getting stuck. This example shows 12400 > buffers = 97MB of total data written. Since those writes are pretty > random I/O, it's easily possible to get stuck for a few seconds waiting > for that much data to make it out to disk. You only gave the write > phase a couple of minutes to spread things out over; meanwhile, Linux > may not even bother starting to write things out until 30 seconds into > that, so the effective time between when writes to disk start and when > the matching sync happens on your system is extremely small. That's not > good--you have to give that several minutes of breathing room if you > want to avoid checkpoint spikes. I wonder how common this issue is? When we implemented spreading of the write phase, we had long discussions about spreading out the fsyncs too, but in the end it wasn't done. Perhaps it is time to revisit that now that 8.3 has been out for some time and people have experience with the load-distributed checkpoints. I'm not sure how the spreading of the fsync()s should work, it's hard to estimate how long each fsync() is going to take, for example, but surely something would be better than nothing. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas wrote: > I wonder how common this issue is? When we implemented spreading of the > write phase, we had long discussions about spreading out the fsyncs too, > but in the end it wasn't done. Perhaps it is time to revisit that now > that 8.3 has been out for some time and people have experience with the > load-distributed checkpoints. > Circa 8.2, I ran into checkpoint problems all the time. With the spreading logic in 8.3, properly setup, the worst case is so improved that I usually find something else more pressing to tune, rather than worry about the exact details of the sync process. It seems to have hit the "good enough" point where it's hard to justify time for further improvements when there are other things to work on. I'd still like to see spread fsync happen one day, just hasn't been a priority for any systems I have to improve lately. -- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.com
Dear All, Thanks for all your help so far. This page was particularly helpful: http://www.westnet.com/~gsmith/content/postgresql/chkp-bgw-83.htm (does the advice for 8.3 apply unchanged to 8.4?) I'm still hitting issues with this though: sync is taking 7-10 seconds and I need to get it down to nearer 3. We're running a lightly-loaded system which has a realtime response requirement of 0.5 seconds all the time (with a few seconds permissible occasionally, but never exceeding 10). So far, I've set checkpoint_segments to 128, timeout to 10min, and completion_target to 0.8. This helps, but not as much as I'd hoped. But I haven't touched any of the other WAL or BG Writer settings. Where should I look next? Should I be looking at the BG Writer settings, or should I look at the Linux VM configuration? (eg changing /proc/sys/vm/dirty_background_ratio from 5 to 1) Or would it be most useful to try to move the WAL to a different disk? Latest messages: # tail -f /var/log/postgresql/postgresql-8.4-main.log | grep check 2009-12-08 09:12:00 GMT LOG: checkpoint starting: time 2009-12-08 09:20:09 GMT LOG: checkpoint complete: wrote 51151 buffers (8.9%); 0 transaction log file(s) added, 0 removed, 23 recycled; write=479.669 s, sync=9.852 s, total=489.553 s 2009-12-08 09:22:00 GMT LOG: checkpoint starting: time 2009-12-08 09:30:07 GMT LOG: checkpoint complete: wrote 45772 buffers (7.9%); 0 transaction log file(s) added, 0 removed, 24 recycled; write=479.706 s, sync=7.337 s, total=487.120 s 2009-12-08 09:32:00 GMT LOG: checkpoint starting: time 2009-12-08 09:40:09 GMT LOG: checkpoint complete: wrote 47043 buffers (8.2%); 0 transaction log file(s) added, 0 removed, 22 recycled; write=479.744 s, sync=9.300 s, total=489.122 s 2009-12-08 09:42:00 GMT LOG: checkpoint starting: time 2009-12-08 09:50:07 GMT LOG: checkpoint complete: wrote 48210 buffers (8.4%); 0 transaction log file(s) added, 0 removed, 23 recycled; write=479.689 s, sync=7.707 s, total=487.416 s Thanks a lot, Richard
Dear All, Thanks for all your help so far. This page was particularly helpful: http://www.westnet.com/~gsmith/content/postgresql/chkp-bgw-83.htm (does the advice for 8.3 apply unchanged to 8.4?) I'm still hitting issues with this though: sync is taking 7-10 seconds and I need to get it down to nearer 3. We're running a lightly-loaded system which has a realtime response requirement of 0.5 seconds all the time (with a few seconds permissible occasionally, but never exceeding 10). So far, I've set checkpoint_segments to 128, timeout to 10min, and completion_target to 0.8. This helps, but not as much as I'd hoped. But I haven't touched any of the other WAL or BG Writer settings. Where should I look next? Should I be looking at the BG Writer settings, or should I look at the Linux VM configuration? (eg changing /proc/sys/vm/dirty_background_ratio from 5 to 1) Or would it be most useful to try to move the WAL to a different disk? Latest messages: # tail -f /var/log/postgresql/postgresql-8.4-main.log | grep check 2009-12-08 09:12:00 GMT LOG: checkpoint starting: time 2009-12-08 09:20:09 GMT LOG: checkpoint complete: wrote 51151 buffers (8.9%); 0 transaction log file(s) added, 0 removed, 23 recycled; write=479.669 s, sync=9.852 s, total=489.553 s 2009-12-08 09:22:00 GMT LOG: checkpoint starting: time 2009-12-08 09:30:07 GMT LOG: checkpoint complete: wrote 45772 buffers (7.9%); 0 transaction log file(s) added, 0 removed, 24 recycled; write=479.706 s, sync=7.337 s, total=487.120 s 2009-12-08 09:32:00 GMT LOG: checkpoint starting: time 2009-12-08 09:40:09 GMT LOG: checkpoint complete: wrote 47043 buffers (8.2%); 0 transaction log file(s) added, 0 removed, 22 recycled; write=479.744 s, sync=9.300 s, total=489.122 s 2009-12-08 09:42:00 GMT LOG: checkpoint starting: time 2009-12-08 09:50:07 GMT LOG: checkpoint complete: wrote 48210 buffers (8.4%); 0 transaction log file(s) added, 0 removed, 23 recycled; write=479.689 s, sync=7.707 s, total=487.416 s Thanks a lot, Richard
Richard Neill <rn214@cam.ac.uk> wrote: > So far, I've set checkpoint_segments to 128, timeout to 10min, and > completion_target to 0.8. This helps, but not as much as I'd > hoped. > > But I haven't touched any of the other WAL or BG Writer settings. > > Where should I look next? On our web servers, where we had similar issues, we seem to be doing OK using: bgwriter_lru_maxpages = 1000 bgwriter_lru_multiplier = 4.0 The other thing which can help this problem is keeping shared_buffers smaller than most people recommend. We use 512MB on our larger web server and 256MB on other servers. (Be sure to test with your actual load, as this might or might not degrade overall performance.) -Kevin
Richard Neill wrote: > (does the advice for 8.3 apply unchanged to 8.4?) Yes; no changes in this area for 8.4. The main things performance related that changed between 8.3 and 8.4 are: 1) VACUUM free space management reimplemented so that the max_fsm_* parameters aren't needed anymore 2) default_statistics_target now starts at 100 instead of 10 > So far, I've set checkpoint_segments to 128, timeout to 10min, and > completion_target to 0.8. This helps, but not as much as I'd hoped. Good, if the problem is moving in the right direction you're making progress. > But I haven't touched any of the other WAL or BG Writer settings. > Where should I look next? > Should I be looking at the BG Writer settings, > or should I look at the Linux VM configuration? > (eg changing /proc/sys/vm/dirty_background_ratio from 5 to 1) I would start by reducing dirty_background_ratio; as RAM sizes climb, this keeps becoming a bigger issue. The whole disk flushing code finally got a major overhaul in the 2.6.32 Linux kernel, I'm hoping this whole class of problem was improved from the changes made. Changes to the background writer behavior will probably not work as you'd expect. The first thing I'd try it in your situation turning it off altogether; it can be slightly counterproductive for reducing checkpoint issues if they're really bad, which yours are. If that goes in the wrong direction, experimenting with increasing the maximum pages and the multiplier might be useful, I wouldn't bet on it helping through. As Kevin already mentioned, reducing the size of the buffer cache can help too. That's worth trying if you're exhausted the other obvious possibilities. > Or would it be most useful to try to move the WAL to a different disk? On Linux having the WAL on a separate disk can improve things much more than you might expect, simply because of how brain-dead the filesystem fsync implementation is. Reducing the seeks for WAL traffic can help a lot too. If you've lowered Linux's caching, tried some BGW tweaks, and moved the WAL to somewhere else, if latency is still high you may be facing a hardware upgrade to improve things. Sometimes these problems just require more burst write throughput (regardless of how good average performance looks) and nothing else will substitute. Hopefully you'll find a tuning solution before that though. -- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.com
On Wednesday 09 December 2009 03:05:40 Greg Smith wrote: > On Linux having the WAL on a separate disk can improve things much more > than you might expect, simply because of how brain-dead the filesystem > fsync implementation is. Reducing the seeks for WAL traffic can help a > lot too. Not using ext3's data=ordered helps massively already. Andres