Thread: HFS+ pg_test_fsync performance

HFS+ pg_test_fsync performance

From

Mel Llaguno

Date:

14 April 2014, 22:32:52

I was given anecdotal information regarding HFS+ performance under OSX as
being unsuitable for production PG deployments and that pg_test_fsync
could be used to measure the relative speed versus other operating systems
(such as Linux). In my performance lab, I have a number of similarly
equipped Linux hosts (Ubuntu 12.04 64-bit LTS Server w/128Gb RAM / 2 OWC
6g Mercury Extreme SSDs / 7200rpm SATA3 HDD / 16 E5-series cores) that I
used to capture baseline Linux numbers. As we generally recommend our
customers use SSD (the s3700 recommended by PG), I wanted to perform a
comparison. On these beefy machines I ran the following tests:

SSD:

# pg_test_fsync -f ./fsync.out -s 30
30 seconds per test
O_DIRECT supported on this platform for open_datasync and open_sync.

Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
        open_datasync                                 n/a
        fdatasync                        2259.652 ops/sec     443 usecs/op
        fsync                            1949.664 ops/sec     513 usecs/op
        fsync_writethrough                            n/a
        open_sync                        2245.162 ops/sec     445 usecs/op

Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
        open_datasync                                 n/a
        fdatasync                        2161.941 ops/sec     463 usecs/op
        fsync                            1891.894 ops/sec     529 usecs/op
        fsync_writethrough                            n/a
        open_sync                        1118.826 ops/sec     894 usecs/op

Compare open_sync with different write sizes:
(This is designed to compare the cost of writing 16kB
in different write open_sync sizes.)
         1 * 16kB open_sync write        2171.558 ops/sec     460 usecs/op
         2 *  8kB open_sync writes       1126.490 ops/sec     888 usecs/op
         4 *  4kB open_sync writes        569.594 ops/sec    1756 usecs/op
         8 *  2kB open_sync writes        285.149 ops/sec    3507 usecs/op
        16 *  1kB open_sync writes        142.528 ops/sec    7016 usecs/op

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
        write, fsync, close              1947.557 ops/sec     513 usecs/op
        write, close, fsync              1951.082 ops/sec     513 usecs/op

Non-Sync'ed 8kB writes:
        write                           481296.909 ops/sec       2 usecs/op


HDD:

pg_test_fsync -f /tmp/fsync.out -s 30
30 seconds per test
O_DIRECT supported on this platform for open_datasync and open_sync.

Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
        open_datasync                                 n/a
        fdatasync                         105.783 ops/sec    9453 usecs/op
        fsync                              27.692 ops/sec   36111 usecs/op
        fsync_writethrough                            n/a
        open_sync                         103.399 ops/sec    9671 usecs/op

Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
        open_datasync                                 n/a
        fdatasync                         104.647 ops/sec    9556 usecs/op
        fsync                              27.223 ops/sec   36734 usecs/op
        fsync_writethrough                            n/a
        open_sync                          55.839 ops/sec   17909 usecs/op

Compare open_sync with different write sizes:
(This is designed to compare the cost of writing 16kB
in different write open_sync sizes.)
         1 * 16kB open_sync write         103.581 ops/sec    9654 usecs/op
         2 *  8kB open_sync writes         55.207 ops/sec   18113 usecs/op
         4 *  4kB open_sync writes         28.320 ops/sec   35311 usecs/op
         8 *  2kB open_sync writes         14.581 ops/sec   68582 usecs/op
        16 *  1kB open_sync writes          7.407 ops/sec  135003 usecs/op

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
        write, fsync, close                27.228 ops/sec   36727 usecs/op
        write, close, fsync                27.108 ops/sec   36890 usecs/op

Non-Sync'ed 8kB writes:
        write                           466108.001 ops/sec       2 usecs/op


-------

So far, so good. Local HDD vs. SSD shows a significant difference in fsync
performance. Here are the corresponding fstab entries :

/dev/mapper/cim-base
/opt/cim        ext4    defaults,noatime,nodiratime,discard    0    2 (SSD)
/dev/mapper/p--app--lin-root /               ext4    errors=remount-ro 0
    1 (HDD)

I then tried the pg_test_fsync on my OSX Mavericks machine (quad-core i7 /
Intel 520SSD / 16GB RAM) and got the following results :

# pg_test_fsync -s 30 -f ./fsync.out
30 seconds per test
Direct I/O is not supported on this platform.

Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
        open_datasync                    8752.240 ops/sec     114 usecs/op
        fdatasync                        8556.469 ops/sec     117 usecs/op
        fsync                            8831.080 ops/sec     113 usecs/op
        fsync_writethrough                735.362 ops/sec    1360 usecs/op
        open_sync                        8967.000 ops/sec     112 usecs/op

Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
        open_datasync                    4256.906 ops/sec     235 usecs/op
        fdatasync                        7485.242 ops/sec     134 usecs/op
        fsync                            7335.658 ops/sec     136 usecs/op
        fsync_writethrough                716.530 ops/sec    1396 usecs/op
        open_sync                        4303.408 ops/sec     232 usecs/op

Compare open_sync with different write sizes:
(This is designed to compare the cost of writing 16kB
in different write open_sync sizes.)
         1 * 16kB open_sync write        7559.381 ops/sec     132 usecs/op
         2 *  8kB open_sync writes       4537.573 ops/sec     220 usecs/op
         4 *  4kB open_sync writes       2539.780 ops/sec     394 usecs/op
         8 *  2kB open_sync writes       1307.499 ops/sec     765 usecs/op
        16 *  1kB open_sync writes        659.985 ops/sec    1515 usecs/op

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
        write, fsync, close              9003.622 ops/sec     111 usecs/op
        write, close, fsync              8035.427 ops/sec     124 usecs/op

Non-Sync'ed 8kB writes:
        write                           271112.074 ops/sec       4 usecs/op

-------


These results were unexpected and surprising. In almost every metric (with
the exception of the Non-Sync¹d 8k8 writes), OSX Mavericks 10.9.2 using
HFS+ out-performed my Ubuntu servers. While the SSDs come from different
manufacturers, both use the SandForce SF-2281 controllers.

Plausible explanations of the apparent disparity in fsync performance
would be welcome.

Thanks, Mel

P.S. One more thing; I found this article which maps fsync mechanisms
versus
operating systems :
http://www.westnet.com/~gsmith/content/postgresql/TuningPGWAL.htm

This article suggests that both open_datasync and fdatasync are _not_
supported for OSX, but the pg_test_fsync results suggest otherwise.

Re: HFS+ pg_test_fsync performance

From

desmodemone

Date:

15 April 2014, 12:18:18

2014-04-15 0:32 GMT+02:00 Mel Llaguno <mllaguno@coverity.com>:

I was given anecdotal information regarding HFS+ performance under OSX as
being unsuitable for production PG deployments and that pg_test_fsync
could be used to measure the relative speed versus other operating systems
(such as Linux). In my performance lab, I have a number of similarly
equipped Linux hosts (Ubuntu 12.04 64-bit LTS Server w/128Gb RAM / 2 OWC
6g Mercury Extreme SSDs / 7200rpm SATA3 HDD / 16 E5-series cores) that I
used to capture baseline Linux numbers. As we generally recommend our
customers use SSD (the s3700 recommended by PG), I wanted to perform a
comparison. On these beefy machines I ran the following tests:

SSD:

# pg_test_fsync -f ./fsync.out -s 30
30 seconds per test
O_DIRECT supported on this platform for open_datasync and open_sync.

Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
open_datasync n/a
fdatasync 2259.652 ops/sec 443 usecs/op
fsync 1949.664 ops/sec 513 usecs/op
fsync_writethrough n/a
open_sync 2245.162 ops/sec 445 usecs/op

Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
open_datasync n/a
fdatasync 2161.941 ops/sec 463 usecs/op
fsync 1891.894 ops/sec 529 usecs/op
fsync_writethrough n/a
open_sync 1118.826 ops/sec 894 usecs/op

Compare open_sync with different write sizes:
(This is designed to compare the cost of writing 16kB
in different write open_sync sizes.)
1 * 16kB open_sync write 2171.558 ops/sec 460 usecs/op
2 * 8kB open_sync writes 1126.490 ops/sec 888 usecs/op
4 * 4kB open_sync writes 569.594 ops/sec 1756 usecs/op
8 * 2kB open_sync writes 285.149 ops/sec 3507 usecs/op
16 * 1kB open_sync writes 142.528 ops/sec 7016 usecs/op

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
write, fsync, close 1947.557 ops/sec 513 usecs/op
write, close, fsync 1951.082 ops/sec 513 usecs/op

Non-Sync'ed 8kB writes:
write 481296.909 ops/sec 2 usecs/op

HDD:

pg_test_fsync -f /tmp/fsync.out -s 30
30 seconds per test
O_DIRECT supported on this platform for open_datasync and open_sync.

Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
open_datasync n/a
fdatasync 105.783 ops/sec 9453 usecs/op
fsync 27.692 ops/sec 36111 usecs/op
fsync_writethrough n/a
open_sync 103.399 ops/sec 9671 usecs/op

Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
open_datasync n/a
fdatasync 104.647 ops/sec 9556 usecs/op
fsync 27.223 ops/sec 36734 usecs/op
fsync_writethrough n/a
open_sync 55.839 ops/sec 17909 usecs/op

Compare open_sync with different write sizes:
(This is designed to compare the cost of writing 16kB
in different write open_sync sizes.)
1 * 16kB open_sync write 103.581 ops/sec 9654 usecs/op
2 * 8kB open_sync writes 55.207 ops/sec 18113 usecs/op
4 * 4kB open_sync writes 28.320 ops/sec 35311 usecs/op
8 * 2kB open_sync writes 14.581 ops/sec 68582 usecs/op
16 * 1kB open_sync writes 7.407 ops/sec 135003 usecs/op

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
write, fsync, close 27.228 ops/sec 36727 usecs/op
write, close, fsync 27.108 ops/sec 36890 usecs/op

Non-Sync'ed 8kB writes:
write 466108.001 ops/sec 2 usecs/op

-------

So far, so good. Local HDD vs. SSD shows a significant difference in fsync
performance. Here are the corresponding fstab entries :

/dev/mapper/cim-base
/opt/cim ext4 defaults,noatime,nodiratime,discard 0 2 (SSD)
/dev/mapper/p--app--lin-root / ext4 errors=remount-ro 0
1 (HDD)

I then tried the pg_test_fsync on my OSX Mavericks machine (quad-core i7 /
Intel 520SSD / 16GB RAM) and got the following results :

# pg_test_fsync -s 30 -f ./fsync.out
30 seconds per test
Direct I/O is not supported on this platform.

Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
open_datasync 8752.240 ops/sec 114 usecs/op
fdatasync 8556.469 ops/sec 117 usecs/op
fsync 8831.080 ops/sec 113 usecs/op
fsync_writethrough 735.362 ops/sec 1360 usecs/op
open_sync 8967.000 ops/sec 112 usecs/op

Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
open_datasync 4256.906 ops/sec 235 usecs/op
fdatasync 7485.242 ops/sec 134 usecs/op
fsync 7335.658 ops/sec 136 usecs/op
fsync_writethrough 716.530 ops/sec 1396 usecs/op
open_sync 4303.408 ops/sec 232 usecs/op

Compare open_sync with different write sizes:
(This is designed to compare the cost of writing 16kB
in different write open_sync sizes.)
1 * 16kB open_sync write 7559.381 ops/sec 132 usecs/op
2 * 8kB open_sync writes 4537.573 ops/sec 220 usecs/op
4 * 4kB open_sync writes 2539.780 ops/sec 394 usecs/op
8 * 2kB open_sync writes 1307.499 ops/sec 765 usecs/op
16 * 1kB open_sync writes 659.985 ops/sec 1515 usecs/op

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
write, fsync, close 9003.622 ops/sec 111 usecs/op
write, close, fsync 8035.427 ops/sec 124 usecs/op

Non-Sync'ed 8kB writes:
write 271112.074 ops/sec 4 usecs/op

-------

These results were unexpected and surprising. In almost every metric (with
the exception of the Non-Sync¹d 8k8 writes), OSX Mavericks 10.9.2 using
HFS+ out-performed my Ubuntu servers. While the SSDs come from different
manufacturers, both use the SandForce SF-2281 controllers.

Plausible explanations of the apparent disparity in fsync performance
would be welcome.

Thanks, Mel

P.S. One more thing; I found this article which maps fsync mechanisms
versus
operating systems :
http://www.westnet.com/~gsmith/content/postgresql/TuningPGWAL.htm

This article suggests that both open_datasync and fdatasync are _not_
supported for OSX, but the pg_test_fsync results suggest otherwise.

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

My 2 cents :

The results are not surprising, in the linux enviroment the i/o call of pg_test_fsync are using O_DIRECT (PG_O_DIRECT) with also the O_SYNC or O_DSYNC calls, so ,in practice, it is waiting the "answer" from the storage bypassing the cache in sync mode, while in the Mac OS X it is not doing so, it's only using the O_SYNC or O_DSYNC calls without O_DIRECT, in practice, it's using the cache of filesystem , even if it is asking the sync of io calls.

Bye

Mat Dba

Re: HFS+ pg_test_fsync performance

From

Mel Llaguno

Date:

15 April 2014, 14:31:19

My 2 cents :

Bye

Mat Dba

--------

Thanks for the explanation. Given that OSX always seems to use filesystem cache, is there a way to measure fsync performance that is equivalent to Linux? Or will the use of pg_test_fsync always be inflated under OSX? The reason I ask is that we would like to make a case with a customer that PG performance on OSX/HFS+ would be sub-optimal compared to using Linux/EXT4 (or FreeBSD/UFS2 for that matter).

Thanks, Mel

Re: HFS+ pg_test_fsync performance

From

Josh Berkus

Date:

23 April 2014, 18:05:24

Mel,

> I was given anecdotal information regarding HFS+ performance under OSX as
> being unsuitable for production PG deployments and that pg_test_fsync
> could be used to measure the relative speed versus other operating systems

You're welcome to identify your source of anecdotal evidence: me.  And
it's based on experience and testing, although I'm not allowed to share
the raw results.  Let's just say that I was part of two different
projects where we moved from using OSX on Apple hardware do using Linux
on the *same* hardware ... because of intolerable performance on OSX.
Switching to Linux more than doubled our real write throughput.

> Compare file sync methods using one 8kB write:
> (in wal_sync_method preference order, except fdatasync
> is Linux's default)
>         open_datasync                    8752.240 ops/sec     114 usecs/op
>         fdatasync                        8556.469 ops/sec     117 usecs/op
>         fsync                            8831.080 ops/sec     113 usecs/op
============================================================================
>         fsync_writethrough                735.362 ops/sec    1360 usecs/op
============================================================================
>         open_sync                        8967.000 ops/sec     112 usecs/op

fsync_writethrough is the *only* relevant stat above.  For all of the
other fsync operations, OSX is "faking it"; returning to the calling
code without ever flushing to disk.  This would result in data
corruption in the event of an unexpected system shutdown.

Both OSX and Windows do this, which is why we *have* fsync_writethrough.
 Mind you, I'm a little shocked that performance is still so bad on
SSDs; I'd attributed HFS's slow fsync mostly to waiting for a full disk
rotation, but apparently the primary cause is something else.

You'll notice that the speed of fsync_writethrough is 1/4 that of
comparable speed on Linux.   You can get similar performance on Linux by
putting your WAL on a ramdisk, but I don't recommend that for production.

But: things get worse.  In addition to the very slow speed on real
fsyncs, HFS+ has very primitive IO scheduling for multiprocessor
workloads; the filesystem was designed for single-core machines (in
1995!) and has no ability to interleave writes from multiple concurrent
processes.  This results in "stuttering" as the IO system tries to
service first one write request, then another, and ends up stalling
both.  If you do, for example, a serious ETL workload with parallelism
on OSX, you'll see that IO throughput describes a sawtooth from full
speed to zero, being near-zero about half the time.

So not only are fsyncs slower, real throughput for sustained writes on
HFS+ are 50% or less of the hardware maximum in any real multi-user
workload.

In order to test this, you'd need a workload which required loading and
sorting several tables larger than RAM, at least two in parallel.

In the words of the lead HFS+ developer, Don Brady:  "Since we believed
it was only a stop gap solution, we just went from 16 to 32 bits. Had we
known that it would still be in use 15 years later with multi-terabyte
drives, we probably would have done more design changes!"

HFS+ was written in about 6 months, and is largely unimproved since its
release in 1995.  Ext2 doesn't perform too well, either; the difference
is that Linux users have alternative filesystems available.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

Re: HFS+ pg_test_fsync performance

From

Mel Llaguno

Date:

23 April 2014, 18:19:34

Josh,

Thanks for the feedback. Given the prevalence of SSDs/VMs, it would be
useful to start collecting stats/tuning for different operating systems
for things like fsync (and possibly bonnie++/dd). If the community is
interested, I¹ve got a performance lab that I¹d be willing to help run
tests on. Having this information would only improve our ability to
support our customers.

M.

On 4/23/14, 12:04 PM, "Josh Berkus" <josh@agliodbs.com> wrote:

>Mel,
>
>> I was given anecdotal information regarding HFS+ performance under OSX
>>as
>> being unsuitable for production PG deployments and that pg_test_fsync
>> could be used to measure the relative speed versus other operating
>>systems
>
>You're welcome to identify your source of anecdotal evidence: me.  And
>it's based on experience and testing, although I'm not allowed to share
>the raw results.  Let's just say that I was part of two different
>projects where we moved from using OSX on Apple hardware do using Linux
>on the *same* hardware ... because of intolerable performance on OSX.
>Switching to Linux more than doubled our real write throughput.
>
>> Compare file sync methods using one 8kB write:
>> (in wal_sync_method preference order, except fdatasync
>> is Linux's default)
>>         open_datasync                    8752.240 ops/sec     114
>>usecs/op
>>         fdatasync                        8556.469 ops/sec     117
>>usecs/op
>>         fsync                            8831.080 ops/sec     113
>>usecs/op
>==========================================================================
>==
>>         fsync_writethrough                735.362 ops/sec    1360
>>usecs/op
>==========================================================================
>==
>>         open_sync                        8967.000 ops/sec     112
>>usecs/op
>
>fsync_writethrough is the *only* relevant stat above.  For all of the
>other fsync operations, OSX is "faking it"; returning to the calling
>code without ever flushing to disk.  This would result in data
>corruption in the event of an unexpected system shutdown.
>
>Both OSX and Windows do this, which is why we *have* fsync_writethrough.
> Mind you, I'm a little shocked that performance is still so bad on
>SSDs; I'd attributed HFS's slow fsync mostly to waiting for a full disk
>rotation, but apparently the primary cause is something else.
>
>You'll notice that the speed of fsync_writethrough is 1/4 that of
>comparable speed on Linux.   You can get similar performance on Linux by
>putting your WAL on a ramdisk, but I don't recommend that for production.
>
>But: things get worse.  In addition to the very slow speed on real
>fsyncs, HFS+ has very primitive IO scheduling for multiprocessor
>workloads; the filesystem was designed for single-core machines (in
>1995!) and has no ability to interleave writes from multiple concurrent
>processes.  This results in "stuttering" as the IO system tries to
>service first one write request, then another, and ends up stalling
>both.  If you do, for example, a serious ETL workload with parallelism
>on OSX, you'll see that IO throughput describes a sawtooth from full
>speed to zero, being near-zero about half the time.
>
>So not only are fsyncs slower, real throughput for sustained writes on
>HFS+ are 50% or less of the hardware maximum in any real multi-user
>workload.
>
>In order to test this, you'd need a workload which required loading and
>sorting several tables larger than RAM, at least two in parallel.
>
>In the words of the lead HFS+ developer, Don Brady:  "Since we believed
>it was only a stop gap solution, we just went from 16 to 32 bits. Had we
>known that it would still be in use 15 years later with multi-terabyte
>drives, we probably would have done more design changes!"
>
>HFS+ was written in about 6 months, and is largely unimproved since its
>release in 1995.  Ext2 doesn't perform too well, either; the difference
>is that Linux users have alternative filesystems available.
>
>--
>Josh Berkus
>PostgreSQL Experts Inc.
>http://pgexperts.com
>
>
>--
>Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
>To make changes to your subscription:
>http://www.postgresql.org/mailpref/pgsql-performance

Re: HFS+ pg_test_fsync performance

From

Josh Berkus

Date:

23 April 2014, 18:29:09

On 04/23/2014 11:19 AM, Mel Llaguno wrote:
> Josh,
>
> Thanks for the feedback. Given the prevalence of SSDs/VMs, it would be
> useful to start collecting stats/tuning for different operating systems
> for things like fsync (and possibly bonnie++/dd). If the community is
> interested, I¹ve got a performance lab that I¹d be willing to help run
> tests on. Having this information would only improve our ability to
> support our customers.

That would be terrific.  I'd also suggest running the performance test
you have for your production workload, and we could run some different
sizes of pgbench.

I'd be particularly interested in the performance of ZFS tuning options
on Linux ...

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com