Re: Potential Large Performance Gain in WAL synching - Mailing list pgsql-hackers
From | Curtis Faith |
---|---|
Subject | Re: Potential Large Performance Gain in WAL synching |
Date | |
Msg-id | DMEEJMCDOJAKPPFACMPMCECOCEAA.curtis@galtair.com Whole thread Raw |
In response to | Re: Potential Large Performance Gain in WAL synching (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Potential Large Performance Gain in WAL synching
|
List | pgsql-hackers |
I wrote: > > ... most file systems can't process fsync's > > simultaneous with other writes, so those writes block because the file > > system grabs its own internal locks. > tom lane replies: > Oh? That would be a serious problem, but I've never heard that asserted > before. Please provide some evidence. Well I'm basing this on past empirical testing and having read some man pages that describe fsync under this exact scenario. I'll have to write a test to prove this one way or another. I'll also try and look into the linux/BSD source for the common file systems used for PostgreSQL. > On a filesystem that does have that kind of problem, can't you avoid it > just by using O_DSYNC on the WAL files? Then there's no need to call > fsync() at all, except during checkpoints (which actually issue sync() > not fsync(), anyway). > No, they're not exactly the same thing. Consider: Process A File System --------- ----------- Writes index buffer .idling... Writes entry to log cache . Writes another index buffer . Writes another log entry . Writes tuple buffer . Writes another log entry . Index scan . Large table sort . Writes tuple buffer . Writes another log entry . Writes . Writes another index buffer . Writes another log entry . Writes another index buffer . Writes another log entry . Index scan . Large table sort . Commit . File Write Log Entry . .idling... Write to cache File Write Log Entry .idling... .idling... Write to cache File Write Log Entry .idling... .idling... Write to cache File Write Log Entry .idling... .idling... Write to cache Write Commit Log Entry .idling... .idling... Write to cache Call fsync .idling... .idling... Write all buffers to device. .DONE. In this case, Process A is waiting for all the buffers to write at the end of the transaction. With asynchronous I/O this becomes: Process A File System --------- ----------- Writes index buffer .idling... Writes entry to log cache Queue up write - move head to cylinder Writes another index buffer Write log entry to media Writes another log entry Immediate write to cylinder since head is still there. Writes tuple buffer . Writes another log entry Queue up write - move head to cylinder Index scan .busy with scan... Large table sort Write log entry to media Writes tuple buffer . Writes another log entry Queue up write - move head to cylinder Writes . Writes another index buffer Write log entry to media Writes another log entry Queue up write - move head to cylinder Writes another index buffer . Writes another log entry Write log entry to media Index scan . Large table sort Write log entry to media Commit . Write Commit Log Entry Immediate write to cylinder since head is still there. .DONE. Effectively the real work of writing the cache is done while the CPU for the process is busy doing index scans, sorts, etc. With the WAL log on another device and SCSI I/O the log writing should almost always be done except for the final commit write. > > Whether by threads or multiple processes, there is the same > contention on > > the file through multiple writers. The file system can decide to reorder > > writes before they start but not after. If a write comes after a > > fsync starts it will have to wait on that fsync. > > AFAICS we cannot allow the filesystem to reorder writes of WAL blocks, > on safety grounds (we want to be sure we have a consistent WAL up to the > end of what we've written). Even if we can allow some reordering when a > single transaction puts out a large volume of WAL data, I fail to see > where any large gain is going to come from. We're going to be issuing > those writes sequentially and that ought to match the disk layout about > as well as can be hoped anyway. My comment was applying to reads and writes of other processes not the WAL log. In my original email, recall I mentioned using the O_APPEND open flag which will ensure that all log entries are done sequentially. > > Likewise a given process's writes can NEVER be reordered if they are > > submitted synchronously, as is done in the calls to flush the log as > > well as the dirty pages in the buffer in the current code. > > We do not fsync buffer pages; in fact a transaction commit doesn't write > buffer pages at all. I think the above is just a misunderstanding of > what's really happening. We have synchronous WAL writing, agreed, but > we want that AFAICS. Data block writes are asynchronous (between > checkpoints, anyway). Hmm, I keep hearing that buffer block writes are asynchronous but I don't read that in the code at all. There are simple "write" calls with files that are not opened with O_NOBLOCK, so they'll be done synchronously. The code for this is relatively straighforward (once you get past the storage manager abstraction) so I don't see what I might be missing. It's true that data blocks are not required to be written before the transaction commits, so they are in some sense asynchronous to the transactions. However, they still later on block the process that is requesting a new block when it happens to be dirty forcing a write of the block in the cache. It looks to me like BufferAlloc will simply result in a call to BufferReplace > smgrblindwrt > write for md storage manager objects. This means that a process will block while the write of dirty cache buffers takes place. I'm happy to be wrong on this but I don't see any hard evidence of asynch file calls anywhere in the code. Unless I am missing something this is a huuuuge problem. > There is one thing in the current WAL code that I don't like: if the WAL > buffers fill up then everybody who would like to make WAL entries is > forced to wait while some space is freed, which means a write, which is > synchronous if you are using O_DSYNC. It would be nice to have a > background process whose only task is to issue write()s as soon as WAL > pages are filled, thus reducing the probability that foreground > processes have to wait for WAL writes (when they're not committing that > is). But this could be done portably with one more postmaster child > process; I see no real need to dabble in aio_write. Hmm, well, another process writing the log would accomplish the same thing but isn't that what a file system is? ISTM that aio_write is quite a bit easier and higher performance? This is especially true for those OS's which have KAIO support. - Curtis
pgsql-hackers by date: