Thread: HFS+ pg_test_fsync performance
I was given anecdotal information regarding HFS+ performance under OSX as being unsuitable for production PG deployments and that pg_test_fsync could be used to measure the relative speed versus other operating systems (such as Linux). In my performance lab, I have a number of similarly equipped Linux hosts (Ubuntu 12.04 64-bit LTS Server w/128Gb RAM / 2 OWC 6g Mercury Extreme SSDs / 7200rpm SATA3 HDD / 16 E5-series cores) that I used to capture baseline Linux numbers. As we generally recommend our customers use SSD (the s3700 recommended by PG), I wanted to perform a comparison. On these beefy machines I ran the following tests: SSD: # pg_test_fsync -f ./fsync.out -s 30 30 seconds per test O_DIRECT supported on this platform for open_datasync and open_sync. Compare file sync methods using one 8kB write: (in wal_sync_method preference order, except fdatasync is Linux's default) open_datasync n/a fdatasync 2259.652 ops/sec 443 usecs/op fsync 1949.664 ops/sec 513 usecs/op fsync_writethrough n/a open_sync 2245.162 ops/sec 445 usecs/op Compare file sync methods using two 8kB writes: (in wal_sync_method preference order, except fdatasync is Linux's default) open_datasync n/a fdatasync 2161.941 ops/sec 463 usecs/op fsync 1891.894 ops/sec 529 usecs/op fsync_writethrough n/a open_sync 1118.826 ops/sec 894 usecs/op Compare open_sync with different write sizes: (This is designed to compare the cost of writing 16kB in different write open_sync sizes.) 1 * 16kB open_sync write 2171.558 ops/sec 460 usecs/op 2 * 8kB open_sync writes 1126.490 ops/sec 888 usecs/op 4 * 4kB open_sync writes 569.594 ops/sec 1756 usecs/op 8 * 2kB open_sync writes 285.149 ops/sec 3507 usecs/op 16 * 1kB open_sync writes 142.528 ops/sec 7016 usecs/op Test if fsync on non-write file descriptor is honored: (If the times are similar, fsync() can sync data written on a different descriptor.) write, fsync, close 1947.557 ops/sec 513 usecs/op write, close, fsync 1951.082 ops/sec 513 usecs/op Non-Sync'ed 8kB writes: write 481296.909 ops/sec 2 usecs/op HDD: pg_test_fsync -f /tmp/fsync.out -s 30 30 seconds per test O_DIRECT supported on this platform for open_datasync and open_sync. Compare file sync methods using one 8kB write: (in wal_sync_method preference order, except fdatasync is Linux's default) open_datasync n/a fdatasync 105.783 ops/sec 9453 usecs/op fsync 27.692 ops/sec 36111 usecs/op fsync_writethrough n/a open_sync 103.399 ops/sec 9671 usecs/op Compare file sync methods using two 8kB writes: (in wal_sync_method preference order, except fdatasync is Linux's default) open_datasync n/a fdatasync 104.647 ops/sec 9556 usecs/op fsync 27.223 ops/sec 36734 usecs/op fsync_writethrough n/a open_sync 55.839 ops/sec 17909 usecs/op Compare open_sync with different write sizes: (This is designed to compare the cost of writing 16kB in different write open_sync sizes.) 1 * 16kB open_sync write 103.581 ops/sec 9654 usecs/op 2 * 8kB open_sync writes 55.207 ops/sec 18113 usecs/op 4 * 4kB open_sync writes 28.320 ops/sec 35311 usecs/op 8 * 2kB open_sync writes 14.581 ops/sec 68582 usecs/op 16 * 1kB open_sync writes 7.407 ops/sec 135003 usecs/op Test if fsync on non-write file descriptor is honored: (If the times are similar, fsync() can sync data written on a different descriptor.) write, fsync, close 27.228 ops/sec 36727 usecs/op write, close, fsync 27.108 ops/sec 36890 usecs/op Non-Sync'ed 8kB writes: write 466108.001 ops/sec 2 usecs/op ------- So far, so good. Local HDD vs. SSD shows a significant difference in fsync performance. Here are the corresponding fstab entries : /dev/mapper/cim-base /opt/cim ext4 defaults,noatime,nodiratime,discard 0 2 (SSD) /dev/mapper/p--app--lin-root / ext4 errors=remount-ro 0 1 (HDD) I then tried the pg_test_fsync on my OSX Mavericks machine (quad-core i7 / Intel 520SSD / 16GB RAM) and got the following results : # pg_test_fsync -s 30 -f ./fsync.out 30 seconds per test Direct I/O is not supported on this platform. Compare file sync methods using one 8kB write: (in wal_sync_method preference order, except fdatasync is Linux's default) open_datasync 8752.240 ops/sec 114 usecs/op fdatasync 8556.469 ops/sec 117 usecs/op fsync 8831.080 ops/sec 113 usecs/op fsync_writethrough 735.362 ops/sec 1360 usecs/op open_sync 8967.000 ops/sec 112 usecs/op Compare file sync methods using two 8kB writes: (in wal_sync_method preference order, except fdatasync is Linux's default) open_datasync 4256.906 ops/sec 235 usecs/op fdatasync 7485.242 ops/sec 134 usecs/op fsync 7335.658 ops/sec 136 usecs/op fsync_writethrough 716.530 ops/sec 1396 usecs/op open_sync 4303.408 ops/sec 232 usecs/op Compare open_sync with different write sizes: (This is designed to compare the cost of writing 16kB in different write open_sync sizes.) 1 * 16kB open_sync write 7559.381 ops/sec 132 usecs/op 2 * 8kB open_sync writes 4537.573 ops/sec 220 usecs/op 4 * 4kB open_sync writes 2539.780 ops/sec 394 usecs/op 8 * 2kB open_sync writes 1307.499 ops/sec 765 usecs/op 16 * 1kB open_sync writes 659.985 ops/sec 1515 usecs/op Test if fsync on non-write file descriptor is honored: (If the times are similar, fsync() can sync data written on a different descriptor.) write, fsync, close 9003.622 ops/sec 111 usecs/op write, close, fsync 8035.427 ops/sec 124 usecs/op Non-Sync'ed 8kB writes: write 271112.074 ops/sec 4 usecs/op ------- These results were unexpected and surprising. In almost every metric (with the exception of the Non-Sync¹d 8k8 writes), OSX Mavericks 10.9.2 using HFS+ out-performed my Ubuntu servers. While the SSDs come from different manufacturers, both use the SandForce SF-2281 controllers. Plausible explanations of the apparent disparity in fsync performance would be welcome. Thanks, Mel P.S. One more thing; I found this article which maps fsync mechanisms versus operating systems : http://www.westnet.com/~gsmith/content/postgresql/TuningPGWAL.htm This article suggests that both open_datasync and fdatasync are _not_ supported for OSX, but the pg_test_fsync results suggest otherwise.
2014-04-15 0:32 GMT+02:00 Mel Llaguno <mllaguno@coverity.com>:
I was given anecdotal information regarding HFS+ performance under OSX as
being unsuitable for production PG deployments and that pg_test_fsync
could be used to measure the relative speed versus other operating systems
(such as Linux). In my performance lab, I have a number of similarly
equipped Linux hosts (Ubuntu 12.04 64-bit LTS Server w/128Gb RAM / 2 OWC
6g Mercury Extreme SSDs / 7200rpm SATA3 HDD / 16 E5-series cores) that I
used to capture baseline Linux numbers. As we generally recommend our
customers use SSD (the s3700 recommended by PG), I wanted to perform a
comparison. On these beefy machines I ran the following tests:
SSD:
# pg_test_fsync -f ./fsync.out -s 30
30 seconds per test
O_DIRECT supported on this platform for open_datasync and open_sync.
Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
open_datasync n/a
fdatasync 2259.652 ops/sec 443 usecs/op
fsync 1949.664 ops/sec 513 usecs/op
fsync_writethrough n/a
open_sync 2245.162 ops/sec 445 usecs/op
Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
open_datasync n/a
fdatasync 2161.941 ops/sec 463 usecs/op
fsync 1891.894 ops/sec 529 usecs/op
fsync_writethrough n/a
open_sync 1118.826 ops/sec 894 usecs/op
Compare open_sync with different write sizes:
(This is designed to compare the cost of writing 16kB
in different write open_sync sizes.)
1 * 16kB open_sync write 2171.558 ops/sec 460 usecs/op
2 * 8kB open_sync writes 1126.490 ops/sec 888 usecs/op
4 * 4kB open_sync writes 569.594 ops/sec 1756 usecs/op
8 * 2kB open_sync writes 285.149 ops/sec 3507 usecs/op
16 * 1kB open_sync writes 142.528 ops/sec 7016 usecs/op
Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
write, fsync, close 1947.557 ops/sec 513 usecs/op
write, close, fsync 1951.082 ops/sec 513 usecs/op
Non-Sync'ed 8kB writes:
write 481296.909 ops/sec 2 usecs/op
HDD:
pg_test_fsync -f /tmp/fsync.out -s 30
30 seconds per test
O_DIRECT supported on this platform for open_datasync and open_sync.
Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
open_datasync n/a
fdatasync 105.783 ops/sec 9453 usecs/op
fsync 27.692 ops/sec 36111 usecs/op
fsync_writethrough n/a
open_sync 103.399 ops/sec 9671 usecs/op
Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
open_datasync n/a
fdatasync 104.647 ops/sec 9556 usecs/op
fsync 27.223 ops/sec 36734 usecs/op
fsync_writethrough n/a
open_sync 55.839 ops/sec 17909 usecs/op
Compare open_sync with different write sizes:
(This is designed to compare the cost of writing 16kB
in different write open_sync sizes.)
1 * 16kB open_sync write 103.581 ops/sec 9654 usecs/op
2 * 8kB open_sync writes 55.207 ops/sec 18113 usecs/op
4 * 4kB open_sync writes 28.320 ops/sec 35311 usecs/op
8 * 2kB open_sync writes 14.581 ops/sec 68582 usecs/op
16 * 1kB open_sync writes 7.407 ops/sec 135003 usecs/op
Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
write, fsync, close 27.228 ops/sec 36727 usecs/op
write, close, fsync 27.108 ops/sec 36890 usecs/op
Non-Sync'ed 8kB writes:
write 466108.001 ops/sec 2 usecs/op
-------
So far, so good. Local HDD vs. SSD shows a significant difference in fsync
performance. Here are the corresponding fstab entries :
/dev/mapper/cim-base
/opt/cim ext4 defaults,noatime,nodiratime,discard 0 2 (SSD)
/dev/mapper/p--app--lin-root / ext4 errors=remount-ro 0
1 (HDD)
I then tried the pg_test_fsync on my OSX Mavericks machine (quad-core i7 /
Intel 520SSD / 16GB RAM) and got the following results :
# pg_test_fsync -s 30 -f ./fsync.out
30 seconds per test
Direct I/O is not supported on this platform.
Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
open_datasync 8752.240 ops/sec 114 usecs/op
fdatasync 8556.469 ops/sec 117 usecs/op
fsync 8831.080 ops/sec 113 usecs/op
fsync_writethrough 735.362 ops/sec 1360 usecs/op
open_sync 8967.000 ops/sec 112 usecs/op
Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
open_datasync 4256.906 ops/sec 235 usecs/op
fdatasync 7485.242 ops/sec 134 usecs/op
fsync 7335.658 ops/sec 136 usecs/op
fsync_writethrough 716.530 ops/sec 1396 usecs/op
open_sync 4303.408 ops/sec 232 usecs/op
Compare open_sync with different write sizes:
(This is designed to compare the cost of writing 16kB
in different write open_sync sizes.)
1 * 16kB open_sync write 7559.381 ops/sec 132 usecs/op
2 * 8kB open_sync writes 4537.573 ops/sec 220 usecs/op
4 * 4kB open_sync writes 2539.780 ops/sec 394 usecs/op
8 * 2kB open_sync writes 1307.499 ops/sec 765 usecs/op
16 * 1kB open_sync writes 659.985 ops/sec 1515 usecs/op
Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
write, fsync, close 9003.622 ops/sec 111 usecs/op
write, close, fsync 8035.427 ops/sec 124 usecs/op
Non-Sync'ed 8kB writes:
write 271112.074 ops/sec 4 usecs/op
-------
These results were unexpected and surprising. In almost every metric (with
the exception of the Non-Sync¹d 8k8 writes), OSX Mavericks 10.9.2 using
HFS+ out-performed my Ubuntu servers. While the SSDs come from different
manufacturers, both use the SandForce SF-2281 controllers.
Plausible explanations of the apparent disparity in fsync performance
would be welcome.
Thanks, Mel
P.S. One more thing; I found this article which maps fsync mechanisms
versus
operating systems :
http://www.westnet.com/~gsmith/content/postgresql/TuningPGWAL.htm
This article suggests that both open_datasync and fdatasync are _not_
supported for OSX, but the pg_test_fsync results suggest otherwise.
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
My 2 cents :
The results are not surprising, in the linux enviroment the i/o call of pg_test_fsync are using O_DIRECT (PG_O_DIRECT) with also the O_SYNC or O_DSYNC calls, so ,in practice, it is waiting the "answer" from the storage bypassing the cache in sync mode, while in the Mac OS X it is not doing so, it's only using the O_SYNC or O_DSYNC calls without O_DIRECT, in practice, it's using the cache of filesystem , even if it is asking the sync of io calls.
Bye
Mat Dba
My 2 cents :
The results are not surprising, in the linux enviroment the i/o call of pg_test_fsync are using O_DIRECT (PG_O_DIRECT) with also the O_SYNC or O_DSYNC calls, so ,in practice, it is waiting the "answer" from the storage bypassing the cache in sync mode, while in the Mac OS X it is not doing so, it's only using the O_SYNC or O_DSYNC calls without O_DIRECT, in practice, it's using the cache of filesystem , even if it is asking the sync of io calls.
Bye
Mat Dba
--------
Thanks for the explanation. Given that OSX always seems to use filesystem cache, is there a way to measure fsync performance that is equivalent to Linux? Or will the use of pg_test_fsync always be inflated under OSX? The reason I ask is that we would like to make a case with a customer that PG performance on OSX/HFS+ would be sub-optimal compared to using Linux/EXT4 (or FreeBSD/UFS2 for that matter).
Thanks, Mel
Mel, > I was given anecdotal information regarding HFS+ performance under OSX as > being unsuitable for production PG deployments and that pg_test_fsync > could be used to measure the relative speed versus other operating systems You're welcome to identify your source of anecdotal evidence: me. And it's based on experience and testing, although I'm not allowed to share the raw results. Let's just say that I was part of two different projects where we moved from using OSX on Apple hardware do using Linux on the *same* hardware ... because of intolerable performance on OSX. Switching to Linux more than doubled our real write throughput. > Compare file sync methods using one 8kB write: > (in wal_sync_method preference order, except fdatasync > is Linux's default) > open_datasync 8752.240 ops/sec 114 usecs/op > fdatasync 8556.469 ops/sec 117 usecs/op > fsync 8831.080 ops/sec 113 usecs/op ============================================================================ > fsync_writethrough 735.362 ops/sec 1360 usecs/op ============================================================================ > open_sync 8967.000 ops/sec 112 usecs/op fsync_writethrough is the *only* relevant stat above. For all of the other fsync operations, OSX is "faking it"; returning to the calling code without ever flushing to disk. This would result in data corruption in the event of an unexpected system shutdown. Both OSX and Windows do this, which is why we *have* fsync_writethrough. Mind you, I'm a little shocked that performance is still so bad on SSDs; I'd attributed HFS's slow fsync mostly to waiting for a full disk rotation, but apparently the primary cause is something else. You'll notice that the speed of fsync_writethrough is 1/4 that of comparable speed on Linux. You can get similar performance on Linux by putting your WAL on a ramdisk, but I don't recommend that for production. But: things get worse. In addition to the very slow speed on real fsyncs, HFS+ has very primitive IO scheduling for multiprocessor workloads; the filesystem was designed for single-core machines (in 1995!) and has no ability to interleave writes from multiple concurrent processes. This results in "stuttering" as the IO system tries to service first one write request, then another, and ends up stalling both. If you do, for example, a serious ETL workload with parallelism on OSX, you'll see that IO throughput describes a sawtooth from full speed to zero, being near-zero about half the time. So not only are fsyncs slower, real throughput for sustained writes on HFS+ are 50% or less of the hardware maximum in any real multi-user workload. In order to test this, you'd need a workload which required loading and sorting several tables larger than RAM, at least two in parallel. In the words of the lead HFS+ developer, Don Brady: "Since we believed it was only a stop gap solution, we just went from 16 to 32 bits. Had we known that it would still be in use 15 years later with multi-terabyte drives, we probably would have done more design changes!" HFS+ was written in about 6 months, and is largely unimproved since its release in 1995. Ext2 doesn't perform too well, either; the difference is that Linux users have alternative filesystems available. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Josh, Thanks for the feedback. Given the prevalence of SSDs/VMs, it would be useful to start collecting stats/tuning for different operating systems for things like fsync (and possibly bonnie++/dd). If the community is interested, I¹ve got a performance lab that I¹d be willing to help run tests on. Having this information would only improve our ability to support our customers. M. On 4/23/14, 12:04 PM, "Josh Berkus" <josh@agliodbs.com> wrote: >Mel, > >> I was given anecdotal information regarding HFS+ performance under OSX >>as >> being unsuitable for production PG deployments and that pg_test_fsync >> could be used to measure the relative speed versus other operating >>systems > >You're welcome to identify your source of anecdotal evidence: me. And >it's based on experience and testing, although I'm not allowed to share >the raw results. Let's just say that I was part of two different >projects where we moved from using OSX on Apple hardware do using Linux >on the *same* hardware ... because of intolerable performance on OSX. >Switching to Linux more than doubled our real write throughput. > >> Compare file sync methods using one 8kB write: >> (in wal_sync_method preference order, except fdatasync >> is Linux's default) >> open_datasync 8752.240 ops/sec 114 >>usecs/op >> fdatasync 8556.469 ops/sec 117 >>usecs/op >> fsync 8831.080 ops/sec 113 >>usecs/op >========================================================================== >== >> fsync_writethrough 735.362 ops/sec 1360 >>usecs/op >========================================================================== >== >> open_sync 8967.000 ops/sec 112 >>usecs/op > >fsync_writethrough is the *only* relevant stat above. For all of the >other fsync operations, OSX is "faking it"; returning to the calling >code without ever flushing to disk. This would result in data >corruption in the event of an unexpected system shutdown. > >Both OSX and Windows do this, which is why we *have* fsync_writethrough. > Mind you, I'm a little shocked that performance is still so bad on >SSDs; I'd attributed HFS's slow fsync mostly to waiting for a full disk >rotation, but apparently the primary cause is something else. > >You'll notice that the speed of fsync_writethrough is 1/4 that of >comparable speed on Linux. You can get similar performance on Linux by >putting your WAL on a ramdisk, but I don't recommend that for production. > >But: things get worse. In addition to the very slow speed on real >fsyncs, HFS+ has very primitive IO scheduling for multiprocessor >workloads; the filesystem was designed for single-core machines (in >1995!) and has no ability to interleave writes from multiple concurrent >processes. This results in "stuttering" as the IO system tries to >service first one write request, then another, and ends up stalling >both. If you do, for example, a serious ETL workload with parallelism >on OSX, you'll see that IO throughput describes a sawtooth from full >speed to zero, being near-zero about half the time. > >So not only are fsyncs slower, real throughput for sustained writes on >HFS+ are 50% or less of the hardware maximum in any real multi-user >workload. > >In order to test this, you'd need a workload which required loading and >sorting several tables larger than RAM, at least two in parallel. > >In the words of the lead HFS+ developer, Don Brady: "Since we believed >it was only a stop gap solution, we just went from 16 to 32 bits. Had we >known that it would still be in use 15 years later with multi-terabyte >drives, we probably would have done more design changes!" > >HFS+ was written in about 6 months, and is largely unimproved since its >release in 1995. Ext2 doesn't perform too well, either; the difference >is that Linux users have alternative filesystems available. > >-- >Josh Berkus >PostgreSQL Experts Inc. >http://pgexperts.com > > >-- >Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) >To make changes to your subscription: >http://www.postgresql.org/mailpref/pgsql-performance
On 04/23/2014 11:19 AM, Mel Llaguno wrote: > Josh, > > Thanks for the feedback. Given the prevalence of SSDs/VMs, it would be > useful to start collecting stats/tuning for different operating systems > for things like fsync (and possibly bonnie++/dd). If the community is > interested, I¹ve got a performance lab that I¹d be willing to help run > tests on. Having this information would only improve our ability to > support our customers. That would be terrific. I'd also suggest running the performance test you have for your production workload, and we could run some different sizes of pgbench. I'd be particularly interested in the performance of ZFS tuning options on Linux ... -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com