Potential Large Performance Gain in WAL synching - Mailing list pgsql-hackers
From | Curtis Faith |
---|---|
Subject | Potential Large Performance Gain in WAL synching |
Date | |
Msg-id | DMEEJMCDOJAKPPFACMPMCEBOCEAA.curtis@galtair.com Whole thread Raw |
Responses |
Re: Potential Large Performance Gain in WAL synching
|
List | pgsql-hackers |
I've been looking at the TODO lists and caching issues and think there may be a way to greatly improve the performance of the WAL. I've made the following assumptions based on my reading in the manual and the WAL archives since about November 2000: 1) WAL is currently fsync'd before commit succeeds. This is done to ensure that the D in ACID is satisfied. 2) The wait on fsync is the biggest time cost for inserts or updates. 3) fsync itself probably increases contention for file i/o on the same file since some OS file system cache structures must be locked as part of fsync. Depending on the file system this could be a significant choke on total i/o throughput. The issue is that there must be a definite record in durable storage for the log before one can be certain that a transaction has succeeded. I'm not familiar with the exact WAL implementation in PostgreSQL but am familiar with others including ARIES II, however, it seems that it comes down to making sure that the write to the WAL log has been positively written to disk. So, why don't we use files opened with O_DSYNC | O_APPEND for the WAL log and then use aio_write for all log writes? A transaction would simple do all the log writing using aio_write and block until all the last log aio request has completed using aio_waitcomplete. The call to aio_waitcomplete won't return until the log record has been written to the disk. Opening with O_DSYNC ensures that when i/o completes the write has been written to the disk, and aio_write with O_APPEND opened files ensures that writes append in the order they are received, hence when the aio_write for the last log entry for a transaction completes, the transaction can be sure that its log records are in durable storage (IDE problems aside). It seems to me that this would: 1) Preserve the required D semantics. 2) Allow transactions to complete and do work while other threads are waiting on the completion of the log write. 3) Obviate the need for commit_delay, since there is no blocking and the file system and the disk controller can put multiple writes to the log together as the drive is waiting for the end of the log file to come under one of the heads. Here are the relevant TODO's: Delay fsync() when other backends are about to commit too [fsync] Determine optimal commit_delay value Determine optimal fdatasync/fsync, O_SYNC/O_DSYNC options Allow multiple blocks to be written to WAL with one write() Am I missing something? Curtis Faith Principal Galt Capital, LLP ------------------------------------------------------------------ Galt Capital http://www.galtcapital.com 12 Wimmelskafts Gade Post Office Box 7549 voice: 340.776.0144 Charlotte Amalie, St. Thomas fax: 340.776.0244 United States Virgin Islands 00801 cell: 340.643.5368
pgsql-hackers by date: