Thread: Huge iowait during checkpoint finish
Hello dear list members, I have strange problem with my new 8.4 deployment, which I never encountered on previous 8.3 deployment. IOwait values are extremely high exactly when Postgres finishes a checkpoint. During the checkpoint itself (which is quite lengthy) IOwait is very low. Why does this happen and how to fix it? Take a look at this chart: http://kpox.s3.amazonaws.com/cpu-day.png The lates (rightest) IOwait peak corresponds to this log entry: 2010-01-08 18:40:36 CET LOG: checkpoint complete: wrote 46357 buffers (35.4%); 0 transaction log file(s) added, 0 removed, 40 recycled; write=2502.800 s, sync=79.972 s, total=2583.184 s IOwait peak starts about 18:39:30 and finishes at 18:40:30. I think all the IOwait comes during sync time, which is 80 s, according to the log entry. Config details that might be related to the problem: bgwriter_lru_maxpages = 0 # BG writer is off checkpoint_segments = 45 checkpoint_timeout = 60min checkpoint_completion_target = 0.9 I had mostly the same config with my 8.3 deployment. But hardware is different: Disk is software RAID-5 with 3 hard drives. Operating system is Ubuntu 9.10 Server x64. Thanks. Anton.
Anton Belyaev wrote: > I think all the IOwait comes during sync time, which is 80 s, > according to the log entry. > I believe you are correctly diagnosing the issue. The "sync time" entry in the log was added there specifically to make it easier to confirm this problem you're having exists on a given system. > bgwriter_lru_maxpages = 0 # BG writer is off > checkpoint_segments = 45 > checkpoint_timeout = 60min > checkpoint_completion_target = 0.9 > These are reasonable settings. You can look at pg_stat_bgwriter to get more statistics about your checkpoints; grab a snapshot of that now, another one later, and then compute the difference between the two. I've got an example of that http://www.westnet.com/~gsmith/content/postgresql/chkp-bgw-83.htm You should be aiming to have a checkpoint no more than every 5 minutes, and on a write-heavy system shooting for closer to every 10 is probably more appropriate. Do you know how often they're happening on yours? Two pg_stat_bgwriter snapshots from a couple of hours apart, with a timestamp on each, can be used to figure that out. > I had mostly the same config with my 8.3 deployment. > But hardware is different: > Disk is software RAID-5 with 3 hard drives. > Operating system is Ubuntu 9.10 Server x64. > Does the new server have a lot more RAM than the 8.3 one? Some of the problems in this area get worse the more RAM you've got. Does the new server use ext4 while the old one used ext3? Basically, you have a couple of standard issues here: 1) You're using RAID-5, which is not known for good write performance. Are you sure the disk array performs well on writes? And if you didn't benchmark it, you can't be sure. 2) Linux is buffering a lot of writes that are only making it to disk at checkpoint time. This could be simply because of (1)--maybe the disk is always overloaded. But it's possible this is just due to excessive Linux buffering being lazy about the writes. I wrote something about that topic at http://notemagnet.blogspot.com/2008/08/linux-write-cache-mystery.html you might find interesting. -- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.com
On Fri, Jan 8, 2010 at 3:07 PM, Greg Smith <greg@2ndquadrant.com> wrote: > Basically, you have a couple of standard issues here: > > 1) You're using RAID-5, which is not known for good write performance. Are > you sure the disk array performs well on writes? And if you didn't > benchmark it, you can't be sure. This can be doubly bad if he's now moved to a set of disks that are properly obeying fsync but was on disks that were lying about it before.
Hello Greg, Thanks for you extensive reply. 2010/1/9 Greg Smith <greg@2ndquadrant.com>: > Anton Belyaev wrote: >> >> I think all the IOwait comes during sync time, which is 80 s, >> according to the log entry. >> > > I believe you are correctly diagnosing the issue. The "sync time" entry in > the log was added there specifically to make it easier to confirm this > problem you're having exists on a given system. > >> bgwriter_lru_maxpages = 0 # BG writer is off >> checkpoint_segments = 45 >> checkpoint_timeout = 60min >> checkpoint_completion_target = 0.9 >> > > These are reasonable settings. You can look at pg_stat_bgwriter to get more > statistics about your checkpoints; grab a snapshot of that now, another one > later, and then compute the difference between the two. I've got an example > of that http://www.westnet.com/~gsmith/content/postgresql/chkp-bgw-83.htm > > You should be aiming to have a checkpoint no more than every 5 minutes, and > on a write-heavy system shooting for closer to every 10 is probably more > appropriate. Do you know how often they're happening on yours? Two > pg_stat_bgwriter snapshots from a couple of hours apart, with a timestamp on > each, can be used to figure that out. > Checkpoint happens about once an hour, sometimes a bit more offen (30 minutes) - during daily peaks. >> I had mostly the same config with my 8.3 deployment. >> But hardware is different: >> Disk is software RAID-5 with 3 hard drives. >> Operating system is Ubuntu 9.10 Server x64. >> > > Does the new server have a lot more RAM than the 8.3 one? Some of the > problems in this area get worse the more RAM you've got. > Yes, new server has 12 GB while old one only 8 GB. > Does the new server use ext4 while the old one used ext3? > Same ext3 filesystem. > Basically, you have a couple of standard issues here: > > 1) You're using RAID-5, which is not known for good write performance. Are > you sure the disk array performs well on writes? And if you didn't > benchmark it, you can't be sure. > I did some dd benchmarks (according to http://www.westnet.com/~gsmith/content/postgresql/pg-disktesting.htm): Old server with its "hardware RAID-1" shows 60 mb/s on write. New server with software RAID-5 shows 85 mb/s on write. > 2) Linux is buffering a lot of writes that are only making it to disk at > checkpoint time. This could be simply because of (1)--maybe the disk is > always overloaded. But it's possible this is just due to excessive Linux > buffering being lazy about the writes. I wrote something about that topic > at http://notemagnet.blogspot.com/2008/08/linux-write-cache-mystery.html you > might find interesting. > Old server has dirty_ratio = 10 dirty_background_ratio = 5 New server had dirty_ratio = 20 dirty_background_ratio = 10 Assuming all the tests and measures above: Server has more RAM, leaving Linux some room for write cache. During dd test DirtyPages of /proc/meminfo were up to 2 GB. RAID-5 is a bit faster (at least on sequential write). Drives arent overloaded, because their utilization during lengthy checkpoint is low. IOwait problems occur only at final sync part of checkpoint. And during this short period drives are almost 100% utilized (according to sar -d 1). I played a bit, setting dirty_background_ratio = 1, but this had negative effect somehow. And this is strange. I hoped this will force to distribute the load from 2 min sync period to 1 hour checkpoint span, but it did not. As the result, I am dont know still where is the real problem. Drives arent overloaded. Linux cache is really misterious, but modifying its parameters does not give the desired effect. Thanks. Anton.
2010/1/9 Scott Marlowe <scott.marlowe@gmail.com>: > On Fri, Jan 8, 2010 at 3:07 PM, Greg Smith <greg@2ndquadrant.com> wrote: >> Basically, you have a couple of standard issues here: >> >> 1) You're using RAID-5, which is not known for good write performance. Are >> you sure the disk array performs well on writes? And if you didn't >> benchmark it, you can't be sure. > > This can be doubly bad if he's now moved to a set of disks that are > properly obeying fsync but was on disks that were lying about it > before. > Scott, thanks for the interesting suggestion. And how do I check this? Old RAID-1 has "hardware" LSI controller. I still have access to old server.
On Mon, Jan 11, 2010 at 3:53 AM, Anton Belyaev <anton.belyaev@gmail.com> wrote: > 2010/1/9 Scott Marlowe <scott.marlowe@gmail.com>: >> On Fri, Jan 8, 2010 at 3:07 PM, Greg Smith <greg@2ndquadrant.com> wrote: >>> Basically, you have a couple of standard issues here: >>> >>> 1) You're using RAID-5, which is not known for good write performance. Are >>> you sure the disk array performs well on writes? And if you didn't >>> benchmark it, you can't be sure. >> >> This can be doubly bad if he's now moved to a set of disks that are >> properly obeying fsync but was on disks that were lying about it >> before. >> > > Scott, thanks for the interesting suggestion. > > And how do I check this? > Old RAID-1 has "hardware" LSI controller. > I still have access to old server. The old RAID card likely had a battery backed cache, which would make the fsyncs much faster, as long as you hadn't run out of cache. While RAID-5 can have reasonable throughput sequentially, it has a much higher random write costs, (two reads and two writes for each random write). Likely those two things, lack of cache, and more expensive writes is why the new system seems so much slower. If you can shoehorn one more drive, you could run RAID-10 and get much better performance.
Scott Marlowe wrote:
To be super clear here: it's possible to see a 100:1 performance drop going from a system with a battery-backed write cache to one that doesn't. This one of the three main weak spots of software RAID that still keeps hardware RAID vendors in business: it can't do anything to speed up the type of writes done during transactions commit and at checkpoint time. (The others are that it's hard to setup transparent failover after failure in software RAID so that it always works at boot time, and that motherboard chipsets can easily lose their minds and take down the whole system when one drive goes bad).
If the old system had a write caching card, and the new one doesn't, that's certainly your most likely suspect for the source of the slowdown. As for testing that specifically, if you have the old system too you can look at the slides I've got for "Database Hardware Benchmarking" at http://www.westnet.com/~gsmith/content/postgresql/index.htm and use the sysbench example I show on P26 to measure commit fsync rate. There's a video of that presentation where I explain a lot of the background in this area too.
On Mon, Jan 11, 2010 at 3:53 AM, Anton Belyaev <anton.belyaev@gmail.com> wrote:Old RAID-1 has "hardware" LSI controller. I still have access to old server.The old RAID card likely had a battery backed cache, which would make the fsyncs much faster, as long as you hadn't run out of cache.
To be super clear here: it's possible to see a 100:1 performance drop going from a system with a battery-backed write cache to one that doesn't. This one of the three main weak spots of software RAID that still keeps hardware RAID vendors in business: it can't do anything to speed up the type of writes done during transactions commit and at checkpoint time. (The others are that it's hard to setup transparent failover after failure in software RAID so that it always works at boot time, and that motherboard chipsets can easily lose their minds and take down the whole system when one drive goes bad).
And throwing drives at the problem may not help. I've see a system with a 48 disk software RAID-10 that only got 100 TPS when running a commit-heavy test, because it didn't have any way to cache writes usefully for that purpose.If you can shoehorn one more drive, you could run RAID-10 and get much better performance.
If the old system had a write caching card, and the new one doesn't, that's certainly your most likely suspect for the source of the slowdown. As for testing that specifically, if you have the old system too you can look at the slides I've got for "Database Hardware Benchmarking" at http://www.westnet.com/~gsmith/content/postgresql/index.htm and use the sysbench example I show on P26 to measure commit fsync rate. There's a video of that presentation where I explain a lot of the background in this area too.
-- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.com
On Mon, Jan 11, 2010 at 2:59 PM, Greg Smith <greg@2ndquadrant.com> wrote: > Scott Marlowe wrote: > If you can shoehorn one more drive, you could run RAID-10 and get much > better performance. > > > And throwing drives at the problem may not help. I've see a system with a > 48 disk software RAID-10 that only got 100 TPS when running a commit-heavy > test, because it didn't have any way to cache writes usefully for that > purpose. A 4 disk RAID-10 will be about 4 to 8 times faster than a RAID-5 of 3 disks. It won't be as fast as a good sized RAID-10 with HW caching, but it would be a big improvement. > If the old system had a write caching card, and the new one doesn't, that's > certainly your most likely suspect for the source of the slowdown. As for Agreed. If the new machine is limited to 3 disks, and any one is big enough to hold the db, I'd look at a two disk mirror with a hot spare on a HW RAID controller with battery backed chat. If they can't get a HW RAID controller than switching to SW RAID-1 would be a positive step.
Greg Smith wrote: > If the old system had a write caching card, and the new one doesn't > that's certainly your most likely suspect for the source of the > slowdown. Note that it's even possible that the old system had a card with write caching enabled, but *no* battery backed cache. That's crazily dangerous, but does get you the speed benefits of write caching without the need for a BBU. I only mention this to help explain a possible performance difference - you should NEVER do this if you care about your data. -- Craig Ringer