Thread: Postgres, fsync and RAID controller with 100M of internal cache & dedicated battery
Postgres, fsync and RAID controller with 100M of internal cache & dedicated battery
From
"Dmitry Koterov"
Date:
Hello.
We are trying to use HP CISS contoller (Smart Array E200i) with internal cache memory (100M for write caching, built-in power battery) together with Postgres. Typically under a heavy load Postgres runs checkpoint fsync very slow:
checkpoint buffers dirty=16.8 MB (3.3%) write=24.3 ms sync=6243.3 ms
(If we turn off fsync, the speed increases greatly, fsync=0.) And unfortunately it affects all the database productivity during the checkpoint.
Here is the timing (in milliseconds) of a test transaction called multiple times concurrently (6 threads) with fsync turned ON:
40.4
44.4
37.4
44.0
42.7
41.8
218.1
254.2
101.0
42.2
42.4
41.0
39.5
(you may see a significant slowdown during a checkpoint).
Here is dstat disc write activity log for that test:
0
0
284k
0
0
84k
0
0
276k
37M
208k
0
0
0
0
156k
0
0
0
0
I have written a small perl script to check how slow is fsync for Smart Array E200i controller. Theoretically, because of write cache, fsync MUST cost nothing, but in practice it is not true:
# cd /mnt/c0d1p1/
# perl -e 'use Time::HiRes qw(gettimeofday tv_interval); system "sync"; open F, ">bulk"; print F ("a" x (1024 * 1024 * 20)); close F; $t0=[gettimeofday]; system "sync"; print ">>> fsync took " . tv_interval ( $t0, [gettimeofday]) . " s\n"; unlink "bulk"'
>>> fsync took 0.247033 s
You see, 50M block was fsynced for 0.25 s.
The question is: how to solve this problem and make fsync run with no delay. Seems to me that controller's internal write cache is not used (strange, because all configuration options are fine), but how to check it? Or, maybe, there is another side-effect?
We are trying to use HP CISS contoller (Smart Array E200i) with internal cache memory (100M for write caching, built-in power battery) together with Postgres. Typically under a heavy load Postgres runs checkpoint fsync very slow:
checkpoint buffers dirty=16.8 MB (3.3%) write=24.3 ms sync=6243.3 ms
(If we turn off fsync, the speed increases greatly, fsync=0.) And unfortunately it affects all the database productivity during the checkpoint.
Here is the timing (in milliseconds) of a test transaction called multiple times concurrently (6 threads) with fsync turned ON:
40.4
44.4
37.4
44.0
42.7
41.8
218.1
254.2
101.0
42.2
42.4
41.0
39.5
(you may see a significant slowdown during a checkpoint).
Here is dstat disc write activity log for that test:
0
0
284k
0
0
84k
0
0
276k
37M
208k
0
0
0
0
156k
0
0
0
0
I have written a small perl script to check how slow is fsync for Smart Array E200i controller. Theoretically, because of write cache, fsync MUST cost nothing, but in practice it is not true:
# cd /mnt/c0d1p1/
# perl -e 'use Time::HiRes qw(gettimeofday tv_interval); system "sync"; open F, ">bulk"; print F ("a" x (1024 * 1024 * 20)); close F; $t0=[gettimeofday]; system "sync"; print ">>> fsync took " . tv_interval ( $t0, [gettimeofday]) . " s\n"; unlink "bulk"'
>>> fsync took 0.247033 s
You see, 50M block was fsynced for 0.25 s.
The question is: how to solve this problem and make fsync run with no delay. Seems to me that controller's internal write cache is not used (strange, because all configuration options are fine), but how to check it? Or, maybe, there is another side-effect?
Re: Postgres, fsync and RAID controller with 100M of internal cache & dedicated battery
From
"Scott Marlowe"
Date:
On 8/22/07, Dmitry Koterov <dmitry@koterov.ru> wrote: > Hello. > You see, 50M block was fsynced for 0.25 s. > > The question is: how to solve this problem and make fsync run with no delay. > Seems to me that controller's internal write cache is not used (strange, > because all configuration options are fine), but how to check it? Or, maybe, > there is another side-effect? I would suggest that either the controller is NOT configured fine, OR there's some bug in how the OS is interacting with it. What options are there for this RAID controller, and what are they set to? Specifically, the writeback / writethru type options for the cache, and it might be if it doesn't preoprly detect a battery backup module it refuses to go into writeback mode.
Postgres, fsync and RAID controller with 100M of internal cache & dedicated battery
From
"Dmitry Koterov"
Date:
And here are results of built-in Postgres test script:
Simple write timing:
write 0.006355
Compare fsync times on write() and non-write() descriptor:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
write, fsync, close 0.233793
write, close, fsync 0.227444
Compare one o_sync write to two:
one 16k o_sync write 0.297093
two 8k o_sync writes 0.402803
Compare file sync methods with one 8k write:
(o_dsync unavailable)
write, fdatasync 0.228725
write, fsync, 0.223302
Compare file sync methods with 2 8k writes:
(o_dsync unavailable)
open o_sync, write 0.414954
write, fdatasync 0.335280
write, fsync, 0.327195
(Also, I tried to manually specify open_sync method in postgresql.conf, but after that Postgres database had completely crashed. :-)
Simple write timing:
write 0.006355
Compare fsync times on write() and non-write() descriptor:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
write, fsync, close 0.233793
write, close, fsync 0.227444
Compare one o_sync write to two:
one 16k o_sync write 0.297093
two 8k o_sync writes 0.402803
Compare file sync methods with one 8k write:
(o_dsync unavailable)
write, fdatasync 0.228725
write, fsync, 0.223302
Compare file sync methods with 2 8k writes:
(o_dsync unavailable)
open o_sync, write 0.414954
write, fdatasync 0.335280
write, fsync, 0.327195
(Also, I tried to manually specify open_sync method in postgresql.conf, but after that Postgres database had completely crashed. :-)
On 8/22/07, Dmitry Koterov <dmitry@koterov.ru > wrote:
All settings seems to be fine. Mode is writeback.
We temporarily (for tests only on test machine!!!) put pg_xlog into RAM drive (to completely exclude xlog fsync from the statistics), but slowdown during the checkpoint and 5-10 second fsync during the checkpoint are alive yet.
Here are some statistical data from the controller. Other report data is attached to the mail.
ACCELERATOR STATUS:
Logical Drive Disable Map: 0x00000000
Read Cache Size: 24 MBytes
Posted Write Size: 72 MBytes
Disable Flag: 0x00
Status: 0x00000001
Disable Code: 0x0000
Total Memory Size: 128 MBytes
Battery Count: 1
Battery Status: 0x0001
Parity Read Errors: 0000
Parity Write Errors: 0000
Error Log: N/A
Failed Batteries: 0x0000
Board Present: Yes
Accelerator Failure Map: 0x00000000
Max Error Log Entries: 12
NVRAM Load Status: 0x00
Memory Size Shift Factor: 0x0a
Non Battery Backed Memory: 0 MBytes
Memory State: 0x00On 8/22/07, Scott Marlowe <scott.marlowe@gmail.com> wrote:On 8/22/07, Dmitry Koterov <dmitry@koterov.ru> wrote:
> Hello.
> You see, 50M block was fsynced for 0.25 s.
>
> The question is: how to solve this problem and make fsync run with no delay.
> Seems to me that controller's internal write cache is not used (strange,
> because all configuration options are fine), but how to check it? Or, maybe,
> there is another side-effect?
I would suggest that either the controller is NOT configured fine, OR
there's some bug in how the OS is interacting with it.
What options are there for this RAID controller, and what are they set
to? Specifically, the writeback / writethru type options for the
cache, and it might be if it doesn't preoprly detect a battery backup
module it refuses to go into writeback mode.
Re: Postgres, fsync and RAID controller with 100M of internal cache & dedicated battery
From
"Phoenix Kiula"
Date:
Hi, On 23/08/07, Dmitry Koterov <dmitry@koterov.ru> wrote: > And here are results of built-in Postgres test script: > Can you tell me how I can execute this script on my system? Where is this script? Thanks!
Re: Postgres, fsync and RAID controller with 100M of internal cache & dedicated battery
From
"Dmitry Koterov"
Date:
This script is here:
postgresql-8.2.3\src\tools\fsync\test_fsync.c
postgresql-8.2.3\src\tools\fsync\test_fsync.c
On 8/22/07, Phoenix Kiula <phoenix.kiula@gmail.com > wrote:
Hi,
On 23/08/07, Dmitry Koterov < dmitry@koterov.ru> wrote:
> And here are results of built-in Postgres test script:
>
Can you tell me how I can execute this script on my system? Where is
this script?
Thanks!
Re: Postgres, fsync and RAID controller with 100M of internal cache & dedicated battery
From
Greg Smith
Date:
On Wed, 22 Aug 2007, Dmitry Koterov wrote: > I have written a small perl script to check how slow is fsync for Smart > Array E200i controller. Theoretically, because of write cache, fsync MUST > cost nothing, but in practice it is not true That theory is fundamentally flawed; you don't know what else is in the operating system write cache in front of what you're trying to fsync, and you also don't know exactly what's in the controller's cache when you start. For all you know, the controller might be filled with cached reads and refuse to kick all of them out. This is a complicated area where tests are much more useful than trying to predict the behavior. You haven't mentioned any details yet about the operating system you're running on; Solaris? Guessing from the device name. There have been some comments passing by lately about the write caching behavior not being turned on by default in that operating system. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Re: Postgres, fsync and RAID controller with 100M of internal cache & dedicated battery
From
"Dmitry Koterov"
Date:
> I have written a small perl script to check how slow is fsync for Smart
> Array E200i controller. Theoretically, because of write cache, fsync MUST
> cost nothing, but in practice it is not true
That theory is fundamentally flawed; you don't know what else is in the
operating system write cache in front of what you're trying to fsync, and
you also don't know exactly what's in the controller's cache when you
start. For all you know, the controller might be filled with cached reads
and refuse to kick all of them out. This is a complicated area where
tests are much more useful than trying to predict the behavior.
Nobody else writes, nobody reads. The machine is for tests, it is clean. I monitor dstat - for 5 minutes before there is no disc activity. So I suppose that the conntroller cache is already flushed before I am running the test.
tests are much more useful than trying to predict the behavior. You haven't mentioned any details yet about the operating system you're
running on; Solaris? Guessing from the device name. There have been some
comments passing by lately about the write caching behavior not being
turned on by default in that operating system.
Filesystem is ext2 (to reduce the journalling side-effects).
OS write caching is turned on, turned off and also set to flush once per second (all these cases are tested, all these have no effect).
The question is - MUST my test script report about a zero fsync time or not, if the controler has built-in and large write cache. If yes, something wrong with controller or drivers (how to diagnose?). If no, why?
There are a lot of discussions in this maillist about fsync & battery-armed controller, people say that a controller with builtin cache memory reduces the price of fsync to zero. I just want to achieve this.
Re: Postgres, fsync and RAID controller with 100M of internal cache & dedicated battery
From
"Dmitry Koterov"
Date:
Also, the controller is configured to use 75% of its memory for write caching and 25% - for read caching. So reads cannot flood writes.
On 8/23/07, Dmitry Koterov <dmitry@koterov.ru> wrote:
Linux CentOS x86_64. A lot of memory, 8 processors.> I have written a small perl script to check how slow is fsync for Smart
> Array E200i controller. Theoretically, because of write cache, fsync MUST
> cost nothing, but in practice it is not true
That theory is fundamentally flawed; you don't know what else is in the
operating system write cache in front of what you're trying to fsync, and
you also don't know exactly what's in the controller's cache when you
start. For all you know, the controller might be filled with cached reads
and refuse to kick all of them out. This is a complicated area wheretests are much more useful than trying to predict the behavior.
Nobody else writes, nobody reads. The machine is for tests, it is clean. I monitor dstat - for 5 minutes before there is no disc activity. So I suppose that the conntroller cache is already flushed before I am running the test.
tests are much more useful than trying to predict the behavior. You haven't mentioned any details yet about the operating system you'rerunning on; Solaris? Guessing from the device name. There have been some
comments passing by lately about the write caching behavior not being
turned on by default in that operating system.
Filesystem is ext2 (to reduce the journalling side-effects).
OS write caching is turned on, turned off and also set to flush once per second (all these cases are tested, all these have no effect).
The question is - MUST my test script report about a zero fsync time or not, if the controler has built-in and large write cache. If yes, something wrong with controller or drivers (how to diagnose?). If no, why?
There are a lot of discussions in this maillist about fsync & battery-armed controller, people say that a controller with builtin cache memory reduces the price of fsync to zero. I just want to achieve this.
Re: Postgres, fsync and RAID controller with 100M of internal cache & dedicated battery
From
Ron Johnson
Date:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 08/22/07 17:45, Dmitry Koterov wrote: > Also, the controller is configured to use 75% of its memory for write > caching and 25% - for read caching. So reads cannot flood writes. That seems to be a very extreme ratio. Most databases do *many* times more reads than writes. - -- Ron Johnson, Jr. Jefferson LA USA Give a man a fish, and he eats for a day. Hit him with a fish, and he goes away for good! -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFGzMNDS9HxQb37XmcRAgMLAJsGvA43MKrfRKoyf0W0Nv5/VWu5gACdG8qh oJbb6+7FbotnEXnf9PdYF+E= =Esfi -----END PGP SIGNATURE-----
Re: Postgres, fsync and RAID controller with 100M of internal cache & dedicated battery
From
"Scott Marlowe"
Date:
On 8/22/07, Dmitry Koterov <dmitry@koterov.ru> wrote: > Also, the controller is configured to use 75% of its memory for write > caching and 25% - for read caching. So reads cannot flood writes. 128 Meg is a pretty small cache for a modern RAID controller. I wonder if this one is just a dog performer. Have you looked at things like the Areca or Escalade with 1g or more cache on them?
Re: Postgres, fsync and RAID controller with 100M of internal cache & dedicated battery
From
Greg Smith
Date:
On Wed, 22 Aug 2007, Ron Johnson wrote: > That seems to be a very extreme ratio. Most databases do *many* > times more reads than writes. Yes, but the OS has a lot more memory to cache the reads for you, so you should be relying more heavily on it in cases like this where the card has a relatively small amount of memory. The main benefit for having a caching controller is fsync acceleration, the reads should pass right through the controller's cache and then stay in system RAM afterwards if they're needed again. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Re: Postgres, fsync and RAID controller with 100M of internal cache & dedicated battery
From
Greg Smith
Date:
On Wed, 22 Aug 2007, Dmitry Koterov wrote: > We are trying to use HP CISS contoller (Smart Array E200i) There have been multiple reports of problems with general performance issues specifically with the cciss Linux driver for other HP cards. The E200i isn't from the same series, but I wouldn't expect that their drivers have gotten much better. Wander through the thread at http://svr5.postgresql.org/pgsql-performance/2006-07/msg00257.php to see one example I recall from last year; there are more in the archives if you search around a bit. > I have written a small perl script to check how slow is fsync for Smart > Array E200i controller. Theoretically, because of write cache, fsync MUST > cost nothing, but in practice it is not true: >>>> fsync took 0.247033 s For comparision sake, your script run against my system with an Areca ARC-1210 card with 256MB of cache 20 times gives me the following minimum and maximum times (full details on my server config are at http://www.westnet.com/~gsmith/content/postgresql/serverinfo.htm ): >>> fsync took 0.039676 s >>> fsync took 0.041137 s And here's what the last set of test_fsync results look like on my system: Compare file sync methods with 2 8k writes: open o_sync, write 0.099819 write, fdatasync 0.100054 write, fsync, 0.094009 So basically your card is running 3 (test_fsync) to 6 (your script) times slower than my Areca unit on these low-level tests. I don't know that it's possible to drive the fsync times completely to zero, but there's certainly a whole lot of improvement from where you are to what I'd expect from even a cheap caching controller like I'm using. I've got maybe $900 worth of hardware total in this box and it's way faster than yours in this area. > (Also, I tried to manually specify open_sync method in postgresql.conf, > but after that Postgres database had completely crashed. :-) This is itself a sign there's something really strange going on. There's something wrong with your system, your card, or the OS/driver you're using if open_sync doesn't work under Linux; in fact, it should be faster in practice even if it looks a little slower on test_fsync. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Re: Postgres, fsync and RAID controller with 100M of internal cache & dedicated battery
From
Lincoln Yeoh
Date:
At 11:28 PM 8/22/2007, Dmitry Koterov wrote: >Hello. > >We are trying to use HP CISS contoller (Smart Array E200i) with >internal cache memory (100M for write caching, built-in power >battery) together with Postgres. Typically under a heavy load >Postgres runs checkpoint fsync very slow: > >checkpoint buffers dirty=16.8 MB (3.3%) write=24.3 ms sync=6243.3 ms > >(If we turn off fsync, the speed increases greatly, fsync=0.) And >unfortunately it affects all the database productivity during the checkpoint. >Here is the timing (in milliseconds) of a test transaction called >multiple times concurrently (6 threads) with fsync turned ON: It's likely your controller is probably not doing the write caching thingy or the write caching is still slow (I've seen raid controllers that are slower than software raid). Have you actually configured your controller to do the write caching? Won't be surprised if it's in a conservative setting which means "write-through" rather than "write-back", even if there's a battery. BTW, what happens if someone replaced a faulty battery backed controller card on a "live" system with one from a "don't care test system" (identical hardware tho) that was powered down abruptly because people didn't care? Would the new card proceed to trash the "live" system? Probably not that important, but what are your mount options for the partition? Is the partition mounted noatime (or similar)? Regards, Link.
Re: Postgres, fsync and RAID controller with 100M of internal cache & dedicated battery
From
Greg Smith
Date:
On Fri, 24 Aug 2007, Lincoln Yeoh wrote: > BTW, what happens if someone replaced a faulty battery backed controller card > on a "live" system with one from a "don't care test system" (identical > hardware tho) that was powered down abruptly because people didn't care? > Would the new card proceed to trash the "live" system? All the caching controllers I've examined this behavior on give each disk a unique ID, so if you connect new disks to them they wouldn't trash anything because those writes will only go out to the original drives. What happens to the pending writes for the drives that aren't there anymore is kind of undefined though; presumably they'll just be thrown away, I don't know if there are any cards that try to hang on to them in case the original disks are connected later. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD