Thread: Determine optimal fdatasync/fsync, O_SYNC/O_DSYNC options

Determine optimal fdatasync/fsync, O_SYNC/O_DSYNC options

From
mudfoot@rawbw.com
Date:
Hi, I'd like to help with the topic in the Subject: line.  It seems to be a
TODO item.  I've reviewed some threads discussing the matter, so I hope I've
acquired enough history concerning it.  I've taken an initial swipe at
figuring out how to optimize sync'ing methods.  It's based largely on
recommendations I've read on previous threads about fsync/O_SYNC and so on.
After reviewing, if anybody has recommendations on how to proceed then I'd
love to hear them.

Attached is a little program that basically does a bunch of sequential writes
to a file.  All of the sync'ing methods supported by PostgreSQL WAL can be
used.  Results are printed in microseconds.  Size and quanity of writes are
configurable.  The documentation is in the code (how to configure, build, run,
etc.).  I realize that this program doesn't reflect all of the possible
activities of a production database system, but I hope it's a step in the
right direction for this task.  I've used it to see differences in behavior
between the various sync'ing methods on various platforms.

Here's what I've found running the benchmark on some systems to which
I have access.  The differences in behavior between platforms is quite vast.

Summary first...

<halfjoke>
PostgreSQL should be run on an old Apple MacIntosh attached to
its own Hitachi disk array with 2GB cache or so.  Use any sync method
except for fsync().
</halfjoke>

Anyway, there is *a lot* of variance in file synching behavior across
different hardware and O/S platforms.  It's probably not safe
to conclude much.  That said, here are some findings so far based on
tests I've run:

1.  under no circumstances do fsync() or fdatasync() seem to perform
better than opening files with O_SYNC or O_DSYNC
2.  where there are differences, opening files with O_SYNC or O_DSYNC
tends to be quite faster.
3.  fsync() seems to be the slowest where there are differences.  And
O_DSYNC seems to be the fastest where results differ.
4.  the safest thing to assert at this point is that
Solaris systems ought to use the O_DSYNC method for WAL.

-----------

Test system(s)

Athlon Linux:
AMD Athlon XP2000, 512MB RAM, single (54 or 7200?) RPM 20GB IDE disk,
reiserfs filesystem (3 something I think)
SuSE Linux kernel 2.4.21-99

Mac Linux:
I don't know the specific model.  400MHz G3, 512MB, single IDE disk,
ext2 filesystem
Debian GNU/Linux 2.4.16-powerpc

HP Intel Linux:
Prolient HPDL380G3, 2 x 3GHz Xeon, 2GB RAM, SmartArray 5i 64MB cache,
2 x 15,000RPM 36GB U320 SCSI drives mirrored.  I'm not sure if
writes are cached or not.  There's no battery backup.
ext3 filesystem.
Redhat Enterprise Linux 3.0 kernel based on 2.4.21

Dell Intel OpenBSD:
Poweredge ?, single 1GHz PIII, 128MB RAM, single 7200RPM 80GB IDE disk,
ffs filesystem
OpenBSD 3.2 GENERIC kernel

SUN Ultra2:
Ultra2, 2 x 296MHz UltraSPARC II, 2GB RAM, 2 x 10,000RPM 18GB U160
SCSI drives mirrored with Solstice DiskSuite.  UFS filesystem.
Solaris 8.

SUN E4500 + HDS Thunder 9570v
E4500, 8 x 400MHz UltraSPARC II, 3GB RAM,
HDS Thunder 9570v, 2GB mirrored battery-backed cache, RAID5 with a
bunch of 146GB 10,000RPM FC drives.  LUN is on single 2GB FC fabric
connection.
Veritas filesystem (VxFS)
Solaris 8.

Test methodology:

All test runs were done with CHUNKSIZE 8 * 1024, CHUNKS 2 * 1024,
FILESIZE_MULTIPLIER 2, and SLEEP 5.  So a total of 16MB was sequentially
written for each benchmark.

Results are in microseconds.

PLATFORM:       Athlon Linux
buffered:       48220
fsync:          74854397
fdatasync:      75061357
open_sync:      73869239
open_datasync:  74748145
Notes:  System mostly idle.  Even during tests, top showed about 95%
idle.  Something's not right on this box.  All sync methods similarly
horrible on this system.

PLATFORM:       Mac Linux
buffered:       58912
fsync:          1539079
fdatasync:      769058
open_sync:      767094
open_datasync:  763074
Notes: system mostly idle.  fsync seems worst.  Otherwise, they seem
pretty equivalent.  This is the fastest system tested.

PLATFORM:       HP Intel Linux
buffered:       33026
fsync:          29330067
fdatasync:      28673880
open_sync:      8783417
open_datasync:  8747971
Notes: system idle.  O_SYNC and O_DSYNC methods seem to be a lot
better on this platform than fsync & fdatasync.

PLATFORM:       Dell Intel OpenBSD
buffered:       511890
fsync:          1769190
fdatasync:      --------
open_sync:      1748764
open_datasync:  1747433
Notes: system idle.  I couldn't locate fdatasync() on this box, so I
couldn't test it.  All sync methods seem equivalent and are very fast --
though still trail the old Mac.

PLATFORM:       SUN Ultra2
buffered:       1814824
fsync:          73954800
fdatasync:      52594532
open_sync:      34405585
open_datasync:  13883758
Notes:  system mostly idle, with occasional spikes from 1-10% utilization.
It looks like substantial difference between each sync method, with
O_DSYNC the best and fsync() the worst.  There is substantial
difference between the open* and f* methods.

PLATFORM:       SUN E4500 + HDS Thunder 9570v
buffered:       233947
fsync:          57802065
fdatasync:      56631013
open_sync:      2362207
open_datasync:  1976057
Notes:  host about 30% idle, but the array tested on was completely idle.
Something looks seriously not right about fsync and fdatasync -- write
cache seems to have no effect on them.  As for write cache, that
probably explains the 2 seconds or so for the open_sync and
open_datasync methods.

--------------

Thanks for reading...I look forward to feedback, and hope to be helpful in
this effort!

Mark


Attachment

Re: Determine optimal fdatasync/fsync, O_SYNC/O_DSYNC options

From
Bruce Momjian
Date:
Have you seen /src/tools/fsync?

---------------------------------------------------------------------------

mudfoot@rawbw.com wrote:
> Hi, I'd like to help with the topic in the Subject: line.  It seems to be a
> TODO item.  I've reviewed some threads discussing the matter, so I hope I've
> acquired enough history concerning it.  I've taken an initial swipe at
> figuring out how to optimize sync'ing methods.  It's based largely on
> recommendations I've read on previous threads about fsync/O_SYNC and so on.
> After reviewing, if anybody has recommendations on how to proceed then I'd
> love to hear them.
>
> Attached is a little program that basically does a bunch of sequential writes
> to a file.  All of the sync'ing methods supported by PostgreSQL WAL can be
> used.  Results are printed in microseconds.  Size and quanity of writes are
> configurable.  The documentation is in the code (how to configure, build, run,
> etc.).  I realize that this program doesn't reflect all of the possible
> activities of a production database system, but I hope it's a step in the
> right direction for this task.  I've used it to see differences in behavior
> between the various sync'ing methods on various platforms.
>
> Here's what I've found running the benchmark on some systems to which
> I have access.  The differences in behavior between platforms is quite vast.
>
> Summary first...
>
> <halfjoke>
> PostgreSQL should be run on an old Apple MacIntosh attached to
> its own Hitachi disk array with 2GB cache or so.  Use any sync method
> except for fsync().
> </halfjoke>
>
> Anyway, there is *a lot* of variance in file synching behavior across
> different hardware and O/S platforms.  It's probably not safe
> to conclude much.  That said, here are some findings so far based on
> tests I've run:
>
> 1.  under no circumstances do fsync() or fdatasync() seem to perform
> better than opening files with O_SYNC or O_DSYNC
> 2.  where there are differences, opening files with O_SYNC or O_DSYNC
> tends to be quite faster.
> 3.  fsync() seems to be the slowest where there are differences.  And
> O_DSYNC seems to be the fastest where results differ.
> 4.  the safest thing to assert at this point is that
> Solaris systems ought to use the O_DSYNC method for WAL.
>
> -----------
>
> Test system(s)
>
> Athlon Linux:
> AMD Athlon XP2000, 512MB RAM, single (54 or 7200?) RPM 20GB IDE disk,
> reiserfs filesystem (3 something I think)
> SuSE Linux kernel 2.4.21-99
>
> Mac Linux:
> I don't know the specific model.  400MHz G3, 512MB, single IDE disk,
> ext2 filesystem
> Debian GNU/Linux 2.4.16-powerpc
>
> HP Intel Linux:
> Prolient HPDL380G3, 2 x 3GHz Xeon, 2GB RAM, SmartArray 5i 64MB cache,
> 2 x 15,000RPM 36GB U320 SCSI drives mirrored.  I'm not sure if
> writes are cached or not.  There's no battery backup.
> ext3 filesystem.
> Redhat Enterprise Linux 3.0 kernel based on 2.4.21
>
> Dell Intel OpenBSD:
> Poweredge ?, single 1GHz PIII, 128MB RAM, single 7200RPM 80GB IDE disk,
> ffs filesystem
> OpenBSD 3.2 GENERIC kernel
>
> SUN Ultra2:
> Ultra2, 2 x 296MHz UltraSPARC II, 2GB RAM, 2 x 10,000RPM 18GB U160
> SCSI drives mirrored with Solstice DiskSuite.  UFS filesystem.
> Solaris 8.
>
> SUN E4500 + HDS Thunder 9570v
> E4500, 8 x 400MHz UltraSPARC II, 3GB RAM,
> HDS Thunder 9570v, 2GB mirrored battery-backed cache, RAID5 with a
> bunch of 146GB 10,000RPM FC drives.  LUN is on single 2GB FC fabric
> connection.
> Veritas filesystem (VxFS)
> Solaris 8.
>
> Test methodology:
>
> All test runs were done with CHUNKSIZE 8 * 1024, CHUNKS 2 * 1024,
> FILESIZE_MULTIPLIER 2, and SLEEP 5.  So a total of 16MB was sequentially
> written for each benchmark.
>
> Results are in microseconds.
>
> PLATFORM:       Athlon Linux
> buffered:       48220
> fsync:          74854397
> fdatasync:      75061357
> open_sync:      73869239
> open_datasync:  74748145
> Notes:  System mostly idle.  Even during tests, top showed about 95%
> idle.  Something's not right on this box.  All sync methods similarly
> horrible on this system.
>
> PLATFORM:       Mac Linux
> buffered:       58912
> fsync:          1539079
> fdatasync:      769058
> open_sync:      767094
> open_datasync:  763074
> Notes: system mostly idle.  fsync seems worst.  Otherwise, they seem
> pretty equivalent.  This is the fastest system tested.
>
> PLATFORM:       HP Intel Linux
> buffered:       33026
> fsync:          29330067
> fdatasync:      28673880
> open_sync:      8783417
> open_datasync:  8747971
> Notes: system idle.  O_SYNC and O_DSYNC methods seem to be a lot
> better on this platform than fsync & fdatasync.
>
> PLATFORM:       Dell Intel OpenBSD
> buffered:       511890
> fsync:          1769190
> fdatasync:      --------
> open_sync:      1748764
> open_datasync:  1747433
> Notes: system idle.  I couldn't locate fdatasync() on this box, so I
> couldn't test it.  All sync methods seem equivalent and are very fast --
> though still trail the old Mac.
>
> PLATFORM:       SUN Ultra2
> buffered:       1814824
> fsync:          73954800
> fdatasync:      52594532
> open_sync:      34405585
> open_datasync:  13883758
> Notes:  system mostly idle, with occasional spikes from 1-10% utilization.
> It looks like substantial difference between each sync method, with
> O_DSYNC the best and fsync() the worst.  There is substantial
> difference between the open* and f* methods.
>
> PLATFORM:       SUN E4500 + HDS Thunder 9570v
> buffered:       233947
> fsync:          57802065
> fdatasync:      56631013
> open_sync:      2362207
> open_datasync:  1976057
> Notes:  host about 30% idle, but the array tested on was completely idle.
> Something looks seriously not right about fsync and fdatasync -- write
> cache seems to have no effect on them.  As for write cache, that
> probably explains the 2 seconds or so for the open_sync and
> open_datasync methods.
>
> --------------
>
> Thanks for reading...I look forward to feedback, and hope to be helpful in
> this effort!
>
> Mark
>

[ Attachment, skipping... ]

>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
>
>                http://www.postgresql.org/docs/faqs/FAQ.html

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: Determine optimal fdatasync/fsync, O_SYNC/O_DSYNC options

From
mudfoot@rawbw.com
Date:
Quoting Bruce Momjian <pgman@candle.pha.pa.us>:

>
> Have you seen /src/tools/fsync?
>

I have now.  Thanks.

Re: Determine optimal fdatasync/fsync, O_SYNC/O_DSYNC options

From
Gaetano Mendola
Date:
Bruce Momjian wrote:

> Have you seen /src/tools/fsync?
>

Now that the argument is already open, why postgres choose
on linux fdatasync? I'm understanding from other posts that
on this platform open_sync is better than fdatasync.

However I choose open_sync. During initdb why don't detect
this parameter ?

Regards
Gaetano Mendola





These are my times:



kernel 2.4.9-e.24smp ( RAID SCSI ):

Simple write timing:
         write                    0.011544

Compare fsync times on write() and non-write() descriptor:
(If the times are similar, fsync() can sync data written
  on a different descriptor.)
         write, fsync, close      1.233312
         write, close, fsync      1.242086

Compare one o_sync write to two:
         one 16k o_sync write     0.517633
         two 8k o_sync writes     0.824603

Compare file sync methods with one 8k write:
         (o_dsync unavailable)
         open o_sync, write       0.438580
         write, fdatasync         1.239377
         write, fsync,            1.178017

Compare file sync methods with 2 8k writes:
         (o_dsync unavailable)
         open o_sync, write       0.818720
         write, fdatasync         1.395602
         write, fsync,            1.351214




kernel 2.4.22-1.2199.nptlsmp (single EIDE disk):

Simple write timing:
         write                    0.023697

Compare fsync times on write() and non-write() descriptor:
(If the times are similar, fsync() can sync data written
  on a different descriptor.)
         write, fsync, close      0.688765
         write, close, fsync      0.702166

Compare one o_sync write to two:
         one 16k o_sync write     0.498296
         two 8k o_sync writes     0.543956

Compare file sync methods with one 8k write:
         (o_dsync unavailable)
         open o_sync, write       0.259664
         write, fdatasync         0.971712
         write, fsync,            1.006096

Compare file sync methods with 2 8k writes:
         (o_dsync unavailable)
         open o_sync, write       0.536882
         write, fdatasync         1.160347
         write, fsync,            1.189699








Re: Determine optimal fdatasync/fsync, O_SYNC/O_DSYNC options

From
Josh Berkus
Date:
Gaetano,

> Now that the argument is already open, why postgres choose
> on linux fdatasync? I'm understanding from other posts that
> on this platform open_sync is better than fdatasync.

Not necessarily.   For example, here's my test results, on Linux 2.6.7,
writing to a ReiserFS mount on a Software RAID 1 slave of 2 IDE disks, on an
Athalon 1600mhz single-processor machine.   I ran the loop 10,000 times
instead of 1000 because tests with 1,000 varied too much.

Simple write timing:
        write                    0.088701

Compare fsync times on write() and non-write() descriptor:
(If the times are similar, fsync() can sync data written
 on a different descriptor.)
        write, fsync, close      3.593958
        write, close, fsync      3.556978

Compare one o_sync write to two:
        one 16k o_sync write    42.951595
        two 8k o_sync writes    11.251389

Compare file sync methods with one 8k write:
        (o_dsync unavailable)
        open o_sync, write       6.807060
        write, fdatasync         7.207879
        write, fsync,            7.209087

Compare file sync methods with 2 8k writes:
        (o_dsync unavailable)
        open o_sync, write      13.120305
        write, fdatasync         7.583871
        write, fsync,            7.801748

--
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

Re: Determine optimal fdatasync/fsync, O_SYNC/O_DSYNC options

From
Tom Lane
Date:
Gaetano Mendola <mendola@bigfoot.com> writes:
> Now that the argument is already open, why postgres choose
> on linux fdatasync? I'm understanding from other posts that
> on this platform open_sync is better than fdatasync.

AFAIR, we've seen *one* test from *one* person alleging that.
And it was definitely not that way when we tested the behavior
originally, several releases back.  I'd like to see more evidence,
or better some indication that the Linux kernel changed algorithms,
before changing the default.

The tests that started this thread are pretty unconvincing in my eyes,
because they are comparing open_sync against code that fsyncs after each
one-block write.  Under those circumstances, *of course* fsync will lose
(or at least do no better), because it's forcing the same number of
writes through a same-or-less-efficient API.  The reason that this isn't
a trivial choice is that Postgres doesn't necessarily need to fsync
after every block of WAL.  In particular, when doing large transactions
there could be many blocks written between fsyncs, and in that case you
could come out ahead with fsync because the kernel would have more
freedom to schedule disk writes.

So, the only test I put a whole lot of faith in is testing your own
workload on your own Postgres server.  But if we want to set up a toy
test program to test this stuff, it's at least got to have an easily
adjustable (and preferably randomizable) distance between fsyncs.

Also, tests on IDE drives have zero credibility to start with, unless
you can convince me you know how to turn off write buffering on the
drive...

            regards, tom lane

Re: Determine optimal fdatasync/fsync, O_SYNC/O_DSYNC options

From
Bruce Momjian
Date:
Tom Lane wrote:
> The tests that started this thread are pretty unconvincing in my eyes,
> because they are comparing open_sync against code that fsyncs after each
> one-block write.  Under those circumstances, *of course* fsync will lose
> (or at least do no better), because it's forcing the same number of
> writes through a same-or-less-efficient API.  The reason that this isn't
> a trivial choice is that Postgres doesn't necessarily need to fsync
> after every block of WAL.  In particular, when doing large transactions
> there could be many blocks written between fsyncs, and in that case you
> could come out ahead with fsync because the kernel would have more
> freedom to schedule disk writes.

My guess is that the majority of queries do not fill more than one WAL
block.  Sure some do, but in those cases the fsync is probably small
compared to the duration of the query.  If we had a majority of queries
filling more than one block we would be checkpointing like crazy  and we
don't normally get reports about that.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: Determine optimal fdatasync/fsync, O_SYNC/O_DSYNC options

From
Gaetano Mendola
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Tom Lane wrote:

| Gaetano Mendola <mendola@bigfoot.com> writes:
|
|>Now that the argument is already open, why postgres choose
|>on linux fdatasync? I'm understanding from other posts that
|>on this platform open_sync is better than fdatasync.
|
|
| AFAIR, we've seen *one* test from *one* person alleging that.
| And it was definitely not that way when we tested the behavior
| originally, several releases back.  I'd like to see more evidence,
| or better some indication that the Linux kernel changed algorithms,
| before changing the default.

I remember more then one person claim that open_sync *apparently*
was working better then fdatasync, however I trust you ( here is
3:00 AM ).


| The tests that started this thread are pretty unconvincing in my eyes,
| because they are comparing open_sync against code that fsyncs after each
| one-block write.  Under those circumstances, *of course* fsync will lose
| (or at least do no better), because it's forcing the same number of
| writes through a same-or-less-efficient API.
|
| The reason that this isn't a trivial choice is that Postgres doesn't
| necessarily need to fsync after every block of WAL.  In particular,
| when doing large transactions there could be many blocks written between
| fsyncs, and in that case you could come out ahead with fsync because the
| kernel would have more freedom to schedule disk writes.

Are you suggesting that postgres shall use more the one sync method and use
one or the other depending on the activity is performing ?

| So, the only test I put a whole lot of faith in is testing your own
| workload on your own Postgres server.  But if we want to set up a toy
| test program to test this stuff, it's at least got to have an easily
| adjustable (and preferably randomizable) distance between fsyncs.
|
| Also, tests on IDE drives have zero credibility to start with, unless
| you can convince me you know how to turn off write buffering on the
| drive...

I reported the IDE times just for info; however my SAN works better
with open_sync. Can we trust on numbers given by tools/fsync ?  I seen
some your objections in the past but I don't know if there was some fix
from that time.


Regards
Gaetano Mendola









-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (MingW32)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFBRkvW7UpzwH2SGd4RAia1AKD2L5JLhpRNvBzPq9Lv5bAfFJvRmwCffjC5
hg7V0Sfm2At7yR1C+gBCzPE=
=RsSy
-----END PGP SIGNATURE-----


Re: Determine optimal fdatasync/fsync, O_SYNC/O_DSYNC options

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> If we had a majority of queries filling more than one block we would
> be checkpointing like crazy and we don't normally get reports about
> that.

[ raised eyebrow... ]  And of course the 30-second-checkpoint-warning
stuff is a useless feature that no one ever exercises.

But your logic doesn't hold up anyway.  People may be doing large
transactions without necessarily doing them back-to-back-to-back;
there could be idle time in between.  For instance, I'd think an average
transaction size of 100 blocks would be more than enough to make fsync a
winner.  There are 2K blocks per WAL segment, so 20 of these would fit
in a segment.  With the default WAL parameters you could do sixty such
transactions per five minutes, or one every five seconds, without even
causing more-frequent-than-default checkpoints; and you could do two a
second without setting off the checkpoint-warning alarm.  The lack of
checkpoint complaints doesn't prove that this isn't a common real-world
load.

            regards, tom lane

Re: Determine optimal fdatasync/fsync, O_SYNC/O_DSYNC options

From
Vivek Khera
Date:
>>>>> "TL" == Tom Lane <tgl@sss.pgh.pa.us> writes:

TL> Bruce Momjian <pgman@candle.pha.pa.us> writes:
>> If we had a majority of queries filling more than one block we would
>> be checkpointing like crazy and we don't normally get reports about
>> that.

TL> [ raised eyebrow... ]  And of course the 30-second-checkpoint-warning
TL> stuff is a useless feature that no one ever exercises.

Well, last year about this time I discovered in my testing I was
excessively checkpointing;  I found that the error message was
confusing, and Bruce cleaned it up.  So at least one person excercised
that feature, namely me. :-)

--
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Vivek Khera, Ph.D.                Khera Communications, Inc.
Internet: khera@kciLink.com       Rockville, MD  +1-301-869-4449 x806
AIM: vivekkhera Y!: vivek_khera   http://www.khera.org/~vivek/