Thread: Changing default value of wal_sync_method to open_datasync on Linux

Changing default value of wal_sync_method to open_datasync on Linux

From
"Tsunakawa, Takayuki"
Date:
Hello,

I propose changing the default value of wal_sync_method from fdatasync to open_datasync on Linux.  The patch is
attached. I'm feeling this may be controversial, so I'd like to hear your opinions.
 

The reason for change is better performance.  Robert Haas said open_datasync was much faster than fdatasync with NVRAM
inthis thread:
 

https://www.postgresql.org/message-id/flat/C20D38E97BCB33DAD59E3A1@lab.ntt.co.jp#C20D38E97BCB33DAD59E3A1@lab.ntt.co.jp

pg_test_fsync shows higher figures for open_datasync:

[SSD on bare metal, ext4 volume mounted with noatime,nobarrier,data=ordered]
--------------------------------------------------
5 seconds per test
O_DIRECT supported on this platform for open_datasync and open_sync.

Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync is Linux's default)
        open_datasync                     50829.597 ops/sec      20 usecs/op
        fdatasync                         42094.381 ops/sec      24 usecs/op
        fsync                                          42209.972 ops/sec      24 usecs/op
        fsync_writethrough                            n/a
        open_sync                         48669.605 ops/sec      21 usecs/op
--------------------------------------------------


[HDD on VM, ext4 volume mounted with noatime,nobarrier,data=writeback]
(the figures seem oddly high, though; this may be due to some VM configuration)
--------------------------------------------------
5 seconds per test
O_DIRECT supported on this platform for open_datasync and open_sync.

Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync is Linux's default)
        open_datasync                     34648.778 ops/sec      29 usecs/op
        fdatasync                         31570.947 ops/sec      32 usecs/op
        fsync                             27783.283 ops/sec      36 usecs/op
        fsync_writethrough                              n/a
        open_sync                         35238.866 ops/sec      28 usecs/op
--------------------------------------------------


pgbench only shows marginally better results, although the difference is within an error range.  The following is the
tpsof the default read/write workload of pgbench.  I ran the test with all the tables and indexes preloaded with
pg_prewarm(except pgbench_history), and the checkpoint not happening.  I ran a write workload before running the
benchmarkso that no new WAL file would be created during the benchmark run.
 

[SSD on bare metal, ext4 volume mounted with noatime,nobarrier,data=ordered]
--------------------------------------------------
                   1      2      3    avg
fdatasync      17610  17164  16678  17150
open_datasync  17847  17457  17958  17754 (+3%)

[HDD on VM, ext4 volume mounted with noatime,nobarrier,data=writeback]
(the figures seem oddly high, though; this may be due to some VM configuration)
--------------------------------------------------
                  1     2     3   avg
fdatasync      4911  5225  5198  5111
open_datasync  4996  5284  5317  5199 (+1%)


As the removed comment describes, when wal_sync_method is open_datasync (or open_sync), open() fails with errno=EINVAL
ifthe ext4 volume is mounted with data=journal.  That's because open() specifies O_DIRECT in that case.  I don't think
that'sa problem in practice, because data=journal will not be used for performance, and wal_level needs to be changed
fromits default replica to minimal and max_wal_senders must be set to 0 for O_DIRECT to be used.
 


Regards
Takayuki Tsunakawa


Attachment

Re: Changing default value of wal_sync_method to open_datasync onLinux

From
Andres Freund
Date:
Hi,

On 2018-02-20 00:27:47 +0000, Tsunakawa, Takayuki wrote:
> I propose changing the default value of wal_sync_method from fdatasync
> to open_datasync on Linux.  The patch is attached.  I'm feeling this
> may be controversial, so I'd like to hear your opinions.

Indeed. My past experience with open_datasync on linux shows it to be
slower by roughly an order of magnitude. Even if that would turn out not
to be the case anymore, I'm *extremely* hesitant to make such a change.

> [HDD on VM, ext4 volume mounted with noatime,nobarrier,data=writeback]
> (the figures seem oddly high, though; this may be due to some VM
> configuration)

These numbers clearly aren't reliable, there's absolutely no way an hdd
can properly do ~30k syncs/sec.  Until there's reliable numbers this
seems moot.

Greetings,

Andres Freund


Re: Changing default value of wal_sync_method to open_datasync onLinux

From
Mark Kirkwood
Date:
On 20/02/18 13:27, Tsunakawa, Takayuki wrote:

> Hello,
>
> I propose changing the default value of wal_sync_method from fdatasync to open_datasync on Linux.  The patch is
attached. I'm feeling this may be controversial, so I'd like to hear your opinions.
 
>
> The reason for change is better performance.  Robert Haas said open_datasync was much faster than fdatasync with
NVRAMin this thread:
 
>
>
https://www.postgresql.org/message-id/flat/C20D38E97BCB33DAD59E3A1@lab.ntt.co.jp#C20D38E97BCB33DAD59E3A1@lab.ntt.co.jp
>
> pg_test_fsync shows higher figures for open_datasync:
>
> [SSD on bare metal, ext4 volume mounted with noatime,nobarrier,data=ordered]
> --------------------------------------------------
> 5 seconds per test
> O_DIRECT supported on this platform for open_datasync and open_sync.
>
> Compare file sync methods using one 8kB write:
> (in wal_sync_method preference order, except fdatasync is Linux's default)
>          open_datasync                     50829.597 ops/sec      20 usecs/op
>          fdatasync                         42094.381 ops/sec      24 usecs/op
>          fsync                                          42209.972 ops/sec      24 usecs/op
>          fsync_writethrough                            n/a
>          open_sync                         48669.605 ops/sec      21 usecs/op
> --------------------------------------------------
>
>
> [HDD on VM, ext4 volume mounted with noatime,nobarrier,data=writeback]
> (the figures seem oddly high, though; this may be due to some VM configuration)
> --------------------------------------------------
> 5 seconds per test
> O_DIRECT supported on this platform for open_datasync and open_sync.
>
> Compare file sync methods using one 8kB write:
> (in wal_sync_method preference order, except fdatasync is Linux's default)
>          open_datasync                     34648.778 ops/sec      29 usecs/op
>          fdatasync                         31570.947 ops/sec      32 usecs/op
>          fsync                             27783.283 ops/sec      36 usecs/op
>          fsync_writethrough                              n/a
>          open_sync                         35238.866 ops/sec      28 usecs/op
> --------------------------------------------------
>
>
> pgbench only shows marginally better results, although the difference is within an error range.  The following is the
tpsof the default read/write workload of pgbench.  I ran the test with all the tables and indexes preloaded with
pg_prewarm(except pgbench_history), and the checkpoint not happening.  I ran a write workload before running the
benchmarkso that no new WAL file would be created during the benchmark run.
 
>
> [SSD on bare metal, ext4 volume mounted with noatime,nobarrier,data=ordered]
> --------------------------------------------------
>                     1      2      3    avg
> fdatasync      17610  17164  16678  17150
> open_datasync  17847  17457  17958  17754 (+3%)
>
> [HDD on VM, ext4 volume mounted with noatime,nobarrier,data=writeback]
> (the figures seem oddly high, though; this may be due to some VM configuration)
> --------------------------------------------------
>                    1     2     3   avg
> fdatasync      4911  5225  5198  5111
> open_datasync  4996  5284  5317  5199 (+1%)
>
>
> As the removed comment describes, when wal_sync_method is open_datasync (or open_sync), open() fails with
errno=EINVALif the ext4 volume is mounted with data=journal.  That's because open() specifies O_DIRECT in that case.  I
don'tthink that's a problem in practice, because data=journal will not be used for performance, and wal_level needs to
bechanged from its default replica to minimal and max_wal_senders must be set to 0 for O_DIRECT to be used.
 
>
>

I think the use of 'nobarrier' is probably disabling most/all reliable  
writing to the devices. What do the numbers look like if use remove this  
option?

regards

Mark


RE: Changing default value of wal_sync_method to open_datasync onLinux

From
"Tsunakawa, Takayuki"
Date:
From: Andres Freund [mailto:andres@anarazel.de]
> Indeed. My past experience with open_datasync on linux shows it to be slower
> by roughly an order of magnitude. Even if that would turn out not to be
> the case anymore, I'm *extremely* hesitant to make such a change.

Thanks for giving so quick feedback.  An order of magnitude is surprising.  Can you share the environment (Linux distro
version,kernel version, filesystem, mount options, workload, etc.)?  Do you think of anything that explains the
degradation? I think it is reasonable that open_datasync is faster than fdatasync because:
 

* Short transactions like pgbench require less system calls: write()+fdatasync() vs write().
* fdatasync() probably has to scan the page cache for dirty pages.

The above differences should be invisible on slow disks, but they will show up on faster storage.  I guess that's why
Robertsaid open_datasync was much faster on NVRAM.
 

The manual says that pg_test_fsync is a tool for selecting wal_sync_method value, and it indicates open_datasync is
better. Why is fdatasync the default value only on Linux?  I don't understand as a user why PostgreSQL does the special
handling. If the current behavior of choosing fdatasync by default is due to some deficiency of old kernel and/or
filesystem,I think we can change the default so that most users don't have to change wal_sync_method.
 


Regards
Takayuki Tsunakawa






RE: Changing default value of wal_sync_method to open_datasync onLinux

From
"Tsunakawa, Takayuki"
Date:
From: Mark Kirkwood [mailto:mark.kirkwood@catalyst.net.nz]
> I think the use of 'nobarrier' is probably disabling most/all reliable
> writing to the devices. What do the numbers look like if use remove this
> option?

Disabling the filesystem barrier is a valid tuning method as the PG manual says:

https://www.postgresql.org/docs/devel/static/wal-reliability.html

[Excerpt]
Recent SATA drives (those following ATAPI-6 or later) offer a drive cache flush command (FLUSH CACHE EXT), while SCSI
driveshave long supported a similar command SYNCHRONIZE CACHE. These commands are not directly accessible to
PostgreSQL,but some file systems (e.g., ZFS, ext4) can use them to flush data to the platters on write-back-enabled
drives.Unfortunately, such file systems behave suboptimally when combined with battery-backup unit (BBU) disk
controllers.In such setups, the synchronize command forces all data from the controller cache to the disks, eliminating
muchof the benefit of the BBU. You can run the pg_test_fsync program to see if you are affected. If you are affected,
theperformance benefits of the BBU can be regained by turning off write barriers in the file system or reconfiguring
thedisk controller, if that is an option. If write barriers are turned off, make sure the battery remains functional; a
faultybattery can potentially lead to data loss. Hopefully file system and disk controller designers will eventually
addressthis suboptimal behavior. 


I removed nobarrier mount option on a VM with HDD.  pgbench throughput decreased about 30% to 3550 tps, but the
relativedifference between fdatasync and open_datasync is similar.  I cannot disable nowritebarrier on the bare metal
withSSD for now, as other developers and test teams are using it. 



Regards
Takayuki Tsunakawa






Re: Changing default value of wal_sync_method to open_datasync onLinux

From
Andres Freund
Date:
On 2018-02-20 01:56:17 +0000, Tsunakawa, Takayuki wrote:
> Disabling the filesystem barrier is a valid tuning method as the PG manual says:

I don't think it says that:

> https://www.postgresql.org/docs/devel/static/wal-reliability.html
>
> [Excerpt]
> Recent SATA drives (those following ATAPI-6 or later) offer a drive cache flush command (FLUSH CACHE EXT), while SCSI
driveshave long supported a similar command SYNCHRONIZE CACHE. These commands are not directly accessible to
PostgreSQL,but some file systems (e.g., ZFS, ext4) can use them to flush data to the platters on write-back-enabled
drives.Unfortunately, such file systems behave suboptimally when combined with battery-backup unit (BBU) disk
controllers.In such setups, the synchronize command forces all data from the controller cache to the disks, eliminating
muchof the benefit of the BBU. You can run the pg_test_fsync program to see if you are affected. If you are affected,
theperformance benefits of the BBU can be regained by turning off write barriers in the file system or reconfiguring
thedisk controller, if that is an option. If write barriers are turned off, make sure the battery remains functional; a
faultybattery can potentially lead to data loss. Hopefully file system and disk controller designers will eventually
addressthis suboptimal behavior. 

Note it's only valid if running with a BBU. In which case the
performance measurements you're doing aren't particularly meaningful
anyway, as you'd test BBU performance rather than disk performance.


Greetings,

Andres Freund


Re: Changing default value of wal_sync_method to open_datasync on Linux

From
Robert Haas
Date:
On Mon, Feb 19, 2018 at 7:27 PM, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:
> The reason for change is better performance.  Robert Haas said open_datasync was much faster than fdatasync with
NVRAMin this thread:
 

I also said it would be worse on spinning disks.

Also, Yoshimi Ichiyanagi did not find it to be true even on NVRAM.

Changing the default requires a lot more than one test result where a
non-default setting is better.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Robert Haas
I also said it would be worse on spinning disks.

Also, Yoshimi Ichiyanagi did not find it to be true even on NVRAM.


Yes, let me withdraw this proposal.  I couldn't see any performance
difference even with ext4 volume on a PCIe flash memory.

Regards
MauMau