Thread: Huge iowait during checkpoint finish

Huge iowait during checkpoint finish

From
Anton Belyaev
Date:
Hello dear list members,

I have strange problem with my new 8.4 deployment, which I never
encountered on previous 8.3 deployment.
IOwait values are extremely high exactly when Postgres finishes a checkpoint.
During the checkpoint itself (which is quite lengthy) IOwait is very low.
Why does this happen and how to fix it?

Take a look at this chart:

http://kpox.s3.amazonaws.com/cpu-day.png

The lates (rightest) IOwait peak corresponds to this log entry:

2010-01-08 18:40:36 CET LOG:  checkpoint complete: wrote 46357 buffers
(35.4%); 0 transaction log file(s) added, 0 removed, 40 recycled;
write=2502.800 s, sync=79.972 s, total=2583.184 s

IOwait peak starts about 18:39:30 and finishes at 18:40:30.
I think all the IOwait comes during sync time, which is 80 s,
according to the log entry.

Config details that might be related to the problem:

bgwriter_lru_maxpages = 0 # BG writer is off
checkpoint_segments = 45
checkpoint_timeout = 60min
checkpoint_completion_target = 0.9

I had mostly the same config with my 8.3 deployment.
But hardware is different:
Disk is software RAID-5 with 3 hard drives.

Operating system is Ubuntu 9.10 Server x64.

Thanks.
Anton.

Re: Huge iowait during checkpoint finish

From
Greg Smith
Date:
Anton Belyaev wrote:
> I think all the IOwait comes during sync time, which is 80 s,
> according to the log entry.
>

I believe you are correctly diagnosing the issue.  The "sync time" entry
in the log was added there specifically to make it easier to confirm
this problem you're having exists on a given system.

> bgwriter_lru_maxpages = 0 # BG writer is off
> checkpoint_segments = 45
> checkpoint_timeout = 60min
> checkpoint_completion_target = 0.9
>
These are reasonable settings.  You can look at pg_stat_bgwriter to get
more statistics about your checkpoints; grab a snapshot of that now,
another one later, and then compute the difference between the two.
I've got an example of that
http://www.westnet.com/~gsmith/content/postgresql/chkp-bgw-83.htm

You should be aiming to have a checkpoint no more than every 5 minutes,
and on a write-heavy system shooting for closer to every 10 is probably
more appropriate.  Do you know how often they're happening on yours?
Two pg_stat_bgwriter snapshots from a couple of hours apart, with a
timestamp on each, can be used to figure that out.

> I had mostly the same config with my 8.3 deployment.
> But hardware is different:
> Disk is software RAID-5 with 3 hard drives.
> Operating system is Ubuntu 9.10 Server x64.
>

Does the new server have a lot more RAM than the 8.3 one?  Some of the
problems in this area get worse the more RAM you've got.

Does the new server use ext4 while the old one used ext3?

Basically, you have a couple of standard issues here:

1) You're using RAID-5, which is not known for good write performance.
Are you sure the disk array performs well on writes?  And if you didn't
benchmark it, you can't be sure.

2) Linux is buffering a lot of writes that are only making it to disk at
checkpoint time.  This could be simply because of (1)--maybe the disk is
always overloaded.  But it's possible this is just due to excessive
Linux buffering being lazy about the writes.  I wrote something about
that topic at
http://notemagnet.blogspot.com/2008/08/linux-write-cache-mystery.html
you might find interesting.

--
Greg Smith    2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com  www.2ndQuadrant.com


Re: Huge iowait during checkpoint finish

From
Scott Marlowe
Date:
On Fri, Jan 8, 2010 at 3:07 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> Basically, you have a couple of standard issues here:
>
> 1) You're using RAID-5, which is not known for good write performance.  Are
> you sure the disk array performs well on writes?  And if you didn't
> benchmark it, you can't be sure.

This can be doubly bad if he's now moved to a set of disks that are
properly obeying fsync but was on disks that were lying about it
before.

Re: Huge iowait during checkpoint finish

From
Anton Belyaev
Date:
Hello Greg,

Thanks for you extensive reply.

2010/1/9 Greg Smith <greg@2ndquadrant.com>:
> Anton Belyaev wrote:
>>
>> I think all the IOwait comes during sync time, which is 80 s,
>> according to the log entry.
>>
>
> I believe you are correctly diagnosing the issue.  The "sync time" entry in
> the log was added there specifically to make it easier to confirm this
> problem you're having exists on a given system.
>
>> bgwriter_lru_maxpages = 0 # BG writer is off
>> checkpoint_segments = 45
>> checkpoint_timeout = 60min
>> checkpoint_completion_target = 0.9
>>
>
> These are reasonable settings.  You can look at pg_stat_bgwriter to get more
> statistics about your checkpoints; grab a snapshot of that now, another one
> later, and then compute the difference between the two.  I've got an example
> of that http://www.westnet.com/~gsmith/content/postgresql/chkp-bgw-83.htm
>
> You should be aiming to have a checkpoint no more than every 5 minutes, and
> on a write-heavy system shooting for closer to every 10 is probably more
> appropriate.  Do you know how often they're happening on yours?  Two
> pg_stat_bgwriter snapshots from a couple of hours apart, with a timestamp on
> each, can be used to figure that out.
>

Checkpoint happens about once an hour, sometimes a bit more offen (30
minutes) - during daily peaks.

>> I had mostly the same config with my 8.3 deployment.
>> But hardware is different:
>> Disk is software RAID-5 with 3 hard drives.
>> Operating system is Ubuntu 9.10 Server x64.
>>
>
> Does the new server have a lot more RAM than the 8.3 one?  Some of the
> problems in this area get worse the more RAM you've got.
>

Yes, new server has 12 GB while old one only 8 GB.

> Does the new server use ext4 while the old one used ext3?
>

Same ext3 filesystem.

> Basically, you have a couple of standard issues here:
>
> 1) You're using RAID-5, which is not known for good write performance.  Are
> you sure the disk array performs well on writes?  And if you didn't
> benchmark it, you can't be sure.
>

I did some dd benchmarks (according to
http://www.westnet.com/~gsmith/content/postgresql/pg-disktesting.htm):
Old server with its "hardware RAID-1" shows 60 mb/s on write.
New server with software RAID-5 shows 85 mb/s on write.

> 2) Linux is buffering a lot of writes that are only making it to disk at
> checkpoint time.  This could be simply because of (1)--maybe the disk is
> always overloaded.  But it's possible this is just due to excessive Linux
> buffering being lazy about the writes.  I wrote something about that topic
> at http://notemagnet.blogspot.com/2008/08/linux-write-cache-mystery.html you
> might find interesting.
>

Old server has
dirty_ratio = 10
dirty_background_ratio = 5

New server had
dirty_ratio = 20
dirty_background_ratio = 10

Assuming all the tests and measures above:

Server has more RAM, leaving Linux some room for write cache. During
dd test DirtyPages of /proc/meminfo were up to 2 GB.

RAID-5 is a bit faster (at least on sequential write). Drives arent
overloaded, because their utilization during lengthy checkpoint is
low. IOwait problems occur only at final sync part of checkpoint. And
during this short period drives are almost 100% utilized (according to
sar -d 1).

I played a bit, setting dirty_background_ratio = 1, but this had
negative effect somehow.
And this is strange. I hoped this will force to distribute the load
from 2 min sync period to 1 hour checkpoint span, but it did not.

As the result, I am dont know still where is the real problem. Drives
arent overloaded. Linux cache is really misterious, but modifying its
parameters does not give the desired effect.

Thanks.
Anton.

Re: Huge iowait during checkpoint finish

From
Anton Belyaev
Date:
2010/1/9 Scott Marlowe <scott.marlowe@gmail.com>:
> On Fri, Jan 8, 2010 at 3:07 PM, Greg Smith <greg@2ndquadrant.com> wrote:
>> Basically, you have a couple of standard issues here:
>>
>> 1) You're using RAID-5, which is not known for good write performance.  Are
>> you sure the disk array performs well on writes?  And if you didn't
>> benchmark it, you can't be sure.
>
> This can be doubly bad if he's now moved to a set of disks that are
> properly obeying fsync but was on disks that were lying about it
> before.
>

Scott, thanks for the interesting suggestion.

And how do I check this?
Old RAID-1 has "hardware" LSI controller.
I still have access to old server.

Re: Huge iowait during checkpoint finish

From
Scott Marlowe
Date:
On Mon, Jan 11, 2010 at 3:53 AM, Anton Belyaev <anton.belyaev@gmail.com> wrote:
> 2010/1/9 Scott Marlowe <scott.marlowe@gmail.com>:
>> On Fri, Jan 8, 2010 at 3:07 PM, Greg Smith <greg@2ndquadrant.com> wrote:
>>> Basically, you have a couple of standard issues here:
>>>
>>> 1) You're using RAID-5, which is not known for good write performance.  Are
>>> you sure the disk array performs well on writes?  And if you didn't
>>> benchmark it, you can't be sure.
>>
>> This can be doubly bad if he's now moved to a set of disks that are
>> properly obeying fsync but was on disks that were lying about it
>> before.
>>
>
> Scott, thanks for the interesting suggestion.
>
> And how do I check this?
> Old RAID-1 has "hardware" LSI controller.
> I still have access to old server.

The old RAID card likely had a battery backed cache, which would make
the fsyncs much faster, as long as you hadn't run out of cache.

While RAID-5 can have reasonable throughput sequentially, it has a
much higher random write costs, (two reads and two writes for each
random write).

Likely those two things, lack of cache, and more expensive writes is
why the new system seems so much slower.

If you can shoehorn one more drive, you could run RAID-10 and get much
better performance.

Re: Huge iowait during checkpoint finish

From
Greg Smith
Date:
Scott Marlowe wrote:
On Mon, Jan 11, 2010 at 3:53 AM, Anton Belyaev <anton.belyaev@gmail.com> wrote: 
Old RAID-1 has "hardware" LSI controller.
I still have access to old server.   
The old RAID card likely had a battery backed cache, which would make
the fsyncs much faster, as long as you hadn't run out of cache. 

To be super clear here:  it's possible to see a 100:1 performance drop going from a system with a battery-backed write cache to one that doesn't.  This one of the three main weak spots of software RAID that still keeps hardware RAID vendors in business:  it can't do anything to speed up the type of writes done during transactions commit and at checkpoint time.  (The others are that it's hard to setup transparent failover after failure in software RAID so that it always works at boot time, and that motherboard chipsets can easily lose their minds and take down the whole system when one drive goes bad).

If you can shoehorn one more drive, you could run RAID-10 and get much
better performance. 
And throwing drives at the problem may not help.  I've see a system with a 48 disk software RAID-10 that only got 100 TPS when running a commit-heavy test, because it didn't have any way to cache writes usefully for that purpose.

If the old system had a write caching card, and the new one doesn't, that's certainly your most likely suspect for the source of the slowdown.  As for testing that specifically, if you have the old system too you can look at the slides I've got for "Database Hardware Benchmarking" at http://www.westnet.com/~gsmith/content/postgresql/index.htm and use the sysbench example I show on P26 to measure commit fsync rate.  There's a video of that presentation where I explain a lot of the background in this area too.

-- 
Greg Smith    2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com  www.2ndQuadrant.com

Re: Huge iowait during checkpoint finish

From
Scott Marlowe
Date:
On Mon, Jan 11, 2010 at 2:59 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> Scott Marlowe wrote:

> If you can shoehorn one more drive, you could run RAID-10 and get much
> better performance.
>
>
> And throwing drives at the problem may not help.  I've see a system with a
> 48 disk software RAID-10 that only got 100 TPS when running a commit-heavy
> test, because it didn't have any way to cache writes usefully for that
> purpose.

A 4 disk RAID-10 will be about 4 to 8 times faster than a RAID-5 of 3
disks.  It won't be as fast as a good sized RAID-10 with HW caching,
but it would be a big improvement.

> If the old system had a write caching card, and the new one doesn't, that's
> certainly your most likely suspect for the source of the slowdown.  As for

Agreed.  If the new machine is limited to 3 disks, and any one is big
enough to hold the db, I'd look at a two disk mirror with a hot spare
on a HW RAID controller with battery backed chat.  If they can't get a
HW RAID controller than switching to SW RAID-1 would be a positive
step.

Re: Huge iowait during checkpoint finish

From
Craig Ringer
Date:
Greg Smith wrote:

> If the old system had a write caching card, and the new one doesn't
> that's certainly your most likely suspect for the source of the
> slowdown.

Note that it's even possible that the old system had a card with write
caching enabled, but *no* battery backed cache. That's crazily
dangerous, but does get you the speed benefits of write caching without
the need for a BBU. I only mention this to help explain a possible
performance difference - you should NEVER do this if you care about your
data.

--
Craig Ringer