Thread: Postgres, fsync and RAID controller with 100M of internal cache & dedicated battery

Postgres, fsync and RAID controller with 100M of internal cache & dedicated battery

From
"Dmitry Koterov"
Date:
Hello.

We are trying to use HP CISS contoller (Smart Array E200i) with internal cache memory (100M for write caching, built-in power battery) together with Postgres. Typically under a heavy load Postgres runs checkpoint fsync very slow:

checkpoint buffers dirty=16.8 MB (3.3%) write=24.3 ms sync=6243.3 ms

(If we turn off fsync, the speed increases greatly, fsync=0.) And unfortunately it affects all the database productivity during the checkpoint.
Here is the timing (in milliseconds) of a test transaction called multiple times concurrently (6 threads) with fsync turned ON:

40.4
44.4
37.4
44.0
42.7
41.8
218.1
254.2
101.0
42.2
42.4
41.0
39.5

(you may see a significant slowdown during a checkpoint).
Here is dstat disc write activity log for that test:

   0
   0
 284k
   0
   0
  84k
   0
   0
 276k
  37M
  208k
   0
   0
   0
   0
 156k
   0
   0
   0
   0

I have written a small perl script to check how slow is fsync for Smart Array E200i controller. Theoretically, because of write cache, fsync MUST cost nothing, but in practice it is not true:

# cd /mnt/c0d1p1/
# perl -e 'use Time::HiRes qw(gettimeofday tv_interval); system "sync"; open F, ">bulk"; print F ("a" x (1024 * 1024 * 20)); close F; $t0=[gettimeofday]; system "sync"; print ">>> fsync took " . tv_interval ( $t0, [gettimeofday]) . " s\n"; unlink "bulk"'
>>> fsync took 0.247033 s

You see, 50M block was fsynced for 0.25 s.

The question is: how to solve this problem and make fsync run with no delay. Seems to me that controller's internal write cache is not used (strange, because all configuration options are fine), but how to check it? Or, maybe, there is another side-effect?

On 8/22/07, Dmitry Koterov <dmitry@koterov.ru> wrote:
> Hello.
> You see, 50M block was fsynced for 0.25 s.
>
> The question is: how to solve this problem and make fsync run with no delay.
> Seems to me that controller's internal write cache is not used (strange,
> because all configuration options are fine), but how to check it? Or, maybe,
> there is another side-effect?

I would suggest that either the controller is NOT configured fine, OR
there's some bug in how the OS is interacting with it.

What options are there for this RAID controller, and what are they set
to?  Specifically, the writeback / writethru type options for the
cache, and it might be if it doesn't preoprly detect a battery backup
module it refuses to go into writeback mode.

Postgres, fsync and RAID controller with 100M of internal cache & dedicated battery

From
"Dmitry Koterov"
Date:
And here are results of built-in Postgres test script:

Simple write timing:
        write                    0.006355

Compare fsync times on write() and non-write() descriptor:
(If the times are similar, fsync() can sync data written
 on a different descriptor.)
        write, fsync, close      0.233793
        write, close, fsync      0.227444

Compare one o_sync write to two:
        one 16k o_sync write     0.297093
        two 8k o_sync writes     0.402803

Compare file sync methods with one 8k write:

        (o_dsync unavailable)
        write, fdatasync         0.228725
        write, fsync,            0.223302

Compare file sync methods with 2 8k writes:
        (o_dsync unavailable)
        open o_sync, write       0.414954
        write, fdatasync         0.335280
        write, fsync,            0.327195

(Also, I tried to manually specify open_sync method in postgresql.conf, but after that Postgres database had completely crashed. :-)



On 8/22/07, Dmitry Koterov <dmitry@koterov.ru > wrote:
All settings seems to be fine. Mode is writeback.

We temporarily (for tests only on test machine!!!) put pg_xlog into RAM drive (to completely exclude xlog fsync from the statistics), but slowdown during the checkpoint and 5-10 second fsync during the checkpoint are alive yet.

Here are some statistical data from the controller. Other report data is attached to the mail.

ACCELERATOR STATUS:
   Logical Drive Disable Map: 0x00000000
   Read Cache Size:           24 MBytes
   Posted Write Size:         72 MBytes
   Disable Flag:              0x00
   Status:                    0x00000001
   Disable Code:              0x0000
   Total Memory Size:         128 MBytes
   Battery Count:             1
   Battery Status:            0x0001
   Parity Read Errors:        0000
   Parity Write Errors:       0000
   Error Log:                 N/A
   Failed Batteries:          0x0000
   Board Present:             Yes
   Accelerator Failure Map:   0x00000000
   Max Error Log Entries:     12
   NVRAM Load Status:         0x00
   Memory Size Shift Factor:  0x0a
   Non Battery Backed Memory: 0 MBytes
   Memory State:              0x00



On 8/22/07, Scott Marlowe <scott.marlowe@gmail.com> wrote:
On 8/22/07, Dmitry Koterov <dmitry@koterov.ru> wrote:
> Hello.
> You see, 50M block was fsynced for 0.25 s.
>
> The question is: how to solve this problem and make fsync run with no delay.
> Seems to me that controller's internal write cache is not used (strange,
> because all configuration options are fine), but how to check it? Or, maybe,
> there is another side-effect?

I would suggest that either the controller is NOT configured fine, OR
there's some bug in how the OS is interacting with it.

What options are there for this RAID controller, and what are they set
to?  Specifically, the writeback / writethru type options for the
cache, and it might be if it doesn't preoprly detect a battery backup
module it refuses to go into writeback mode.



Hi,


On 23/08/07, Dmitry Koterov <dmitry@koterov.ru> wrote:
> And here are results of built-in Postgres test script:
>



Can you tell me how I can execute this script on my system? Where is
this script?

Thanks!

This script is here:
postgresql-8.2.3\src\tools\fsync\test_fsync.c


On 8/22/07, Phoenix Kiula <phoenix.kiula@gmail.com > wrote:
Hi,


On 23/08/07, Dmitry Koterov < dmitry@koterov.ru> wrote:
> And here are results of built-in Postgres test script:
>



Can you tell me how I can execute this script on my system? Where is
this script?

Thanks!

On Wed, 22 Aug 2007, Dmitry Koterov wrote:

> I have written a small perl script to check how slow is fsync for Smart
> Array E200i controller. Theoretically, because of write cache, fsync MUST
> cost nothing, but in practice it is not true

That theory is fundamentally flawed; you don't know what else is in the
operating system write cache in front of what you're trying to fsync, and
you also don't know exactly what's in the controller's cache when you
start.  For all you know, the controller might be filled with cached reads
and refuse to kick all of them out.  This is a complicated area where
tests are much more useful than trying to predict the behavior.

You haven't mentioned any details yet about the operating system you're
running on; Solaris?  Guessing from the device name.  There have been some
comments passing by lately about the write caching behavior not being
turned on by default in that operating system.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

> I have written a small perl script to check how slow is fsync for Smart
> Array E200i controller. Theoretically, because of write cache, fsync MUST
> cost nothing, but in practice it is not true

That theory is fundamentally flawed; you don't know what else is in the
operating system write cache in front of what you're trying to fsync, and
you also don't know exactly what's in the controller's cache when you
start.  For all you know, the controller might be filled with cached reads
and refuse to kick all of them out.  This is a complicated area where
tests are much more useful than trying to predict the behavior.

Nobody else writes, nobody reads. The machine is for tests, it is clean. I monitor dstat - for 5 minutes before there is no disc activity. So I suppose that the conntroller cache is already flushed before I am running the test.
 
tests are much more useful than trying to predict the behavior. You haven't mentioned any details yet about the operating system you're
running on; Solaris?  Guessing from the device name.  There have been some
comments passing by lately about the write caching behavior not being
turned on by default in that operating system.
Linux CentOS x86_64. A lot of memory, 8 processors.
Filesystem is ext2 (to reduce the journalling side-effects).
OS write caching is turned on, turned off and also set to flush once per second (all these cases are tested, all these have no effect).

The question is - MUST my test script report about a zero fsync time or not, if the controler has built-in and large write cache. If yes, something wrong with controller or drivers (how to diagnose?). If no, why?

There are a lot of discussions in this maillist about fsync & battery-armed controller, people say that a controller with builtin cache memory reduces the price of fsync to zero. I just want to achieve this.


Also, the controller is configured to use 75% of its memory for write caching and 25% - for read caching. So reads cannot flood writes.

On 8/23/07, Dmitry Koterov <dmitry@koterov.ru> wrote:
> I have written a small perl script to check how slow is fsync for Smart
> Array E200i controller. Theoretically, because of write cache, fsync MUST
> cost nothing, but in practice it is not true

That theory is fundamentally flawed; you don't know what else is in the
operating system write cache in front of what you're trying to fsync, and
you also don't know exactly what's in the controller's cache when you
start.  For all you know, the controller might be filled with cached reads
and refuse to kick all of them out.  This is a complicated area where
tests are much more useful than trying to predict the behavior.

Nobody else writes, nobody reads. The machine is for tests, it is clean. I monitor dstat - for 5 minutes before there is no disc activity. So I suppose that the conntroller cache is already flushed before I am running the test.
 
tests are much more useful than trying to predict the behavior. You haven't mentioned any details yet about the operating system you're
running on; Solaris?  Guessing from the device name.  There have been some
comments passing by lately about the write caching behavior not being
turned on by default in that operating system.
Linux CentOS x86_64. A lot of memory, 8 processors.
Filesystem is ext2 (to reduce the journalling side-effects).
OS write caching is turned on, turned off and also set to flush once per second (all these cases are tested, all these have no effect).

The question is - MUST my test script report about a zero fsync time or not, if the controler has built-in and large write cache. If yes, something wrong with controller or drivers (how to diagnose?). If no, why?

There are a lot of discussions in this maillist about fsync & battery-armed controller, people say that a controller with builtin cache memory reduces the price of fsync to zero. I just want to achieve this.



-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 08/22/07 17:45, Dmitry Koterov wrote:
> Also, the controller is configured to use 75% of its memory for write
> caching and 25% - for read caching. So reads cannot flood writes.

That seems to be a very extreme ratio.  Most databases do *many*
times more reads than writes.

- --
Ron Johnson, Jr.
Jefferson LA  USA

Give a man a fish, and he eats for a day.
Hit him with a fish, and he goes away for good!

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFGzMNDS9HxQb37XmcRAgMLAJsGvA43MKrfRKoyf0W0Nv5/VWu5gACdG8qh
oJbb6+7FbotnEXnf9PdYF+E=
=Esfi
-----END PGP SIGNATURE-----

On 8/22/07, Dmitry Koterov <dmitry@koterov.ru> wrote:
> Also, the controller is configured to use 75% of its memory for write
> caching and 25% - for read caching. So reads cannot flood writes.

128 Meg is a pretty small cache for a modern RAID controller.  I
wonder if this one is just a dog performer.

Have you looked at things like the Areca or Escalade with 1g or more
cache on them?

On Wed, 22 Aug 2007, Ron Johnson wrote:

> That seems to be a very extreme ratio.  Most databases do *many*
> times more reads than writes.

Yes, but the OS has a lot more memory to cache the reads for you, so you
should be relying more heavily on it in cases like this where the card has
a relatively small amount of memory.  The main benefit for having a
caching controller is fsync acceleration, the reads should pass right
through the controller's cache and then stay in system RAM afterwards if
they're needed again.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

On Wed, 22 Aug 2007, Dmitry Koterov wrote:

> We are trying to use HP CISS contoller (Smart Array E200i)

There have been multiple reports of problems with general performance
issues specifically with the cciss Linux driver for other HP cards.  The
E200i isn't from the same series, but I wouldn't expect that their drivers
have gotten much better.  Wander through the thread at
http://svr5.postgresql.org/pgsql-performance/2006-07/msg00257.php to see
one example I recall from last year; there are more in the archives if you
search around a bit.

> I have written a small perl script to check how slow is fsync for Smart
> Array E200i controller. Theoretically, because of write cache, fsync MUST
> cost nothing, but in practice it is not true:
>>>> fsync took 0.247033 s

For comparision sake, your script run against my system with an Areca
ARC-1210 card with 256MB of cache 20 times gives me the following minimum
and maximum times (full details on my server config are at
http://www.westnet.com/~gsmith/content/postgresql/serverinfo.htm ):

>>> fsync took 0.039676 s
>>> fsync took 0.041137 s

And here's what the last set of test_fsync results look like on my system:

Compare file sync methods with 2 8k writes:
         open o_sync, write       0.099819
         write, fdatasync         0.100054
         write, fsync,            0.094009

So basically your card is running 3 (test_fsync) to 6 (your script) times
slower than my Areca unit on these low-level tests.  I don't know that
it's possible to drive the fsync times completely to zero, but there's
certainly a whole lot of improvement from where you are to what I'd expect
from even a cheap caching controller like I'm using.  I've got maybe $900
worth of hardware total in this box and it's way faster than yours in this
area.

> (Also, I tried to manually specify open_sync method in postgresql.conf,
> but after that Postgres database had completely crashed. :-)

This is itself a sign there's something really strange going on.  There's
something wrong with your system, your card, or the OS/driver you're using
if open_sync doesn't work under Linux; in fact, it should be faster in
practice even if it looks a little slower on test_fsync.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

At 11:28 PM 8/22/2007, Dmitry Koterov wrote:
>Hello.
>
>We are trying to use HP CISS contoller (Smart Array E200i) with
>internal cache memory (100M for write caching, built-in power
>battery) together with Postgres. Typically under a heavy load
>Postgres runs checkpoint fsync very slow:
>
>checkpoint buffers dirty=16.8 MB (3.3%) write=24.3 ms sync=6243.3 ms
>
>(If we turn off fsync, the speed increases greatly, fsync=0.) And
>unfortunately it affects all the database productivity during the checkpoint.
>Here is the timing (in milliseconds) of a test transaction called
>multiple times concurrently (6 threads) with fsync turned ON:

It's likely your controller is probably not doing the write caching
thingy or the write caching is still slow (I've seen raid controllers
that are slower than software raid).

Have you actually configured your controller to do the write caching?
Won't be surprised if it's in a conservative setting which means
"write-through" rather than "write-back", even if there's a battery.

BTW, what happens if someone replaced a faulty battery backed
controller card on a "live" system with one from a "don't care test
system" (identical hardware tho) that was powered down abruptly
because people didn't care? Would the new card proceed to trash the
"live" system?

Probably not that important, but what are your mount options for the
partition? Is the partition mounted noatime (or similar)?

Regards,
Link.





On Fri, 24 Aug 2007, Lincoln Yeoh wrote:

> BTW, what happens if someone replaced a faulty battery backed controller card
> on a "live" system with one from a "don't care test system" (identical
> hardware tho) that was powered down abruptly because people didn't care?
> Would the new card proceed to trash the "live" system?

All the caching controllers I've examined this behavior on give each disk
a unique ID, so if you connect new disks to them they wouldn't trash
anything because those writes will only go out to the original drives.
What happens to the pending writes for the drives that aren't there
anymore is kind of undefined though; presumably they'll just be thrown
away, I don't know if there are any cards that try to hang on to them in
case the original disks are connected later.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD