Thread: Potential Large Performance Gain in WAL synching
I've been looking at the TODO lists and caching issues and think there may be a way to greatly improve the performance of the WAL. I've made the following assumptions based on my reading in the manual and the WAL archives since about November 2000: 1) WAL is currently fsync'd before commit succeeds. This is done to ensure that the D in ACID is satisfied. 2) The wait on fsync is the biggest time cost for inserts or updates. 3) fsync itself probably increases contention for file i/o on the same file since some OS file system cache structures must be locked as part of fsync. Depending on the file system this could be a significant choke on total i/o throughput. The issue is that there must be a definite record in durable storage for the log before one can be certain that a transaction has succeeded. I'm not familiar with the exact WAL implementation in PostgreSQL but am familiar with others including ARIES II, however, it seems that it comes down to making sure that the write to the WAL log has been positively written to disk. So, why don't we use files opened with O_DSYNC | O_APPEND for the WAL log and then use aio_write for all log writes? A transaction would simple do all the log writing using aio_write and block until all the last log aio request has completed using aio_waitcomplete. The call to aio_waitcomplete won't return until the log record has been written to the disk. Opening with O_DSYNC ensures that when i/o completes the write has been written to the disk, and aio_write with O_APPEND opened files ensures that writes append in the order they are received, hence when the aio_write for the last log entry for a transaction completes, the transaction can be sure that its log records are in durable storage (IDE problems aside). It seems to me that this would: 1) Preserve the required D semantics. 2) Allow transactions to complete and do work while other threads are waiting on the completion of the log write. 3) Obviate the need for commit_delay, since there is no blocking and the file system and the disk controller can put multiple writes to the log together as the drive is waiting for the end of the log file to come under one of the heads. Here are the relevant TODO's: Delay fsync() when other backends are about to commit too [fsync] Determine optimal commit_delay value Determine optimal fdatasync/fsync, O_SYNC/O_DSYNC options Allow multiple blocks to be written to WAL with one write() Am I missing something? Curtis Faith Principal Galt Capital, LLP ------------------------------------------------------------------ Galt Capital http://www.galtcapital.com 12 Wimmelskafts Gade Post Office Box 7549 voice: 340.776.0144 Charlotte Amalie, St. Thomas fax: 340.776.0244 United States Virgin Islands 00801 cell: 340.643.5368
"Curtis Faith" <curtis@galtair.com> writes: > So, why don't we use files opened with O_DSYNC | O_APPEND for the WAL log > and then use aio_write for all log writes? We already offer an O_DSYNC option. It's not obvious to me what aio_write brings to the table (aside from loss of portability). You still have to wait for the final write to complete, no? > 2) Allow transactions to complete and do work while other threads are > waiting on the completion of the log write. I'm missing something. There is no useful work that a transaction can do between writing its commit record and reporting completion, is there? It has to wait for that record to hit disk. regards, tom lane
tom lane replies: > "Curtis Faith" <curtis@galtair.com> writes: > > So, why don't we use files opened with O_DSYNC | O_APPEND for > the WAL log > > and then use aio_write for all log writes? > > We already offer an O_DSYNC option. It's not obvious to me what > aio_write brings to the table (aside from loss of portability). > You still have to wait for the final write to complete, no? Well, for starters by the time the write which includes the commit log entry is written, much of the writing of the log for the transaction will already be on disk, or in a controller on its way. I don't see any O_NONBLOCK or O_NDELAY references in the sources so it looks like the log writes are blocking. If I read correctly, XLogInsert calls XLogWrite which calls write which blocks. If these assumptions are correct, there should be some significant gain here but I won't know how much until I try to change it. This issue only affects the speed of a given back-ends transaction processing capability. The REAL issue and the one that will greatly affect total system throughput is that of contention on the file locks. Since fsynch needs to obtain a write lock on the file descriptor, as does the write calls which originate from XLogWrite as the writes are written to the disk, other back-ends will block while another transaction is committing if the log cache fills to the point where their XLogInsert results in a XLogWrite call to flush the log cache. I'd guess this means that one won't gain much by adding other back-end processes past three or four if there are a lot of inserts or updates. The method I propose does not result in any blocking because of writes other than the final commit's write and it has the very significant advantage of allowing other transactions (from other back-ends) to continue until they enter commit (and blocking waiting for their final commit write to complete). > > 2) Allow transactions to complete and do work while other threads are > > waiting on the completion of the log write. > > I'm missing something. There is no useful work that a transaction can > do between writing its commit record and reporting completion, is there? > It has to wait for that record to hit disk. The key here is that a thread that has not committed and therefore is not blocking can do work while "other threads" (should have said back-ends or processes) are waiting on their commit writes. - Curtis P.S. If I am right in my assumptions about the way the current system works, I'll bet the change would speed up inserts in Shridhar's huge database test by at least a factor of two or three, perhaps even an order of magnitude. :-) > -----Original Message----- > From: Tom Lane [mailto:tgl@sss.pgh.pa.us] > Sent: Thursday, October 03, 2002 7:17 PM > To: Curtis Faith > Cc: Pgsql-Hackers > Subject: Re: [HACKERS] Potential Large Performance Gain in WAL synching > > > "Curtis Faith" <curtis@galtair.com> writes: > > So, why don't we use files opened with O_DSYNC | O_APPEND for > the WAL log > > and then use aio_write for all log writes? > > We already offer an O_DSYNC option. It's not obvious to me what > aio_write brings to the table (aside from loss of portability). > You still have to wait for the final write to complete, no? > > > 2) Allow transactions to complete and do work while other threads are > > waiting on the completion of the log write. > > I'm missing something. There is no useful work that a transaction can > do between writing its commit record and reporting completion, is there? > It has to wait for that record to hit disk. > > regards, tom lane >
Curtis Faith wrote: > The method I propose does not result in any blocking because of writes > other than the final commit's write and it has the very significant > advantage of allowing other transactions (from other back-ends) to > continue until they enter commit (and blocking waiting for their final > commit write to complete). > > > > 2) Allow transactions to complete and do work while other threads are > > > waiting on the completion of the log write. > > > > I'm missing something. There is no useful work that a transaction can > > do between writing its commit record and reporting completion, is there? > > It has to wait for that record to hit disk. > > The key here is that a thread that has not committed and therefore is > not blocking can do work while "other threads" (should have said back-ends > or processes) are waiting on their commit writes. I may be missing something here, but other backends don't block while one writes to WAL. Remember, we are proccess based, not thread based, so the write() call only blocks the one session. If you had threads, and you did a write() call that blocked other threads, I can see where your idea would be good, and where async i/o becomes an advantage. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
"Curtis Faith" <curtis@galtair.com> writes: > The REAL issue and the one that will greatly affect total system > throughput is that of contention on the file locks. Since fsynch needs to > obtain a write lock on the file descriptor, as does the write calls which > originate from XLogWrite as the writes are written to the disk, other > back-ends will block while another transaction is committing if the > log cache fills to the point where their XLogInsert results in a > XLogWrite call to flush the log cache. But that's exactly *why* we have a log cache: to ensure we can buffer a reasonable amount of log data between XLogFlush calls. If the above scenario is really causing a problem, doesn't that just mean you need to increase wal_buffers? regards, tom lane
Bruce Momjian wrote: > I may be missing something here, but other backends don't block while > one writes to WAL. I don't think they'll block until they get to the fsync or XLogWrite call while another transaction is fsync'ing. I'm no Unix filesystem expert but I don't see how the OS can handle multiple writes and fsyncs to the same file descriptors without blocking other processes from writing at the same time. It may be that there are some clever data structures they use but I've not seen huge praise for most of the file systems. A well written file system could minimize this contention but I'll bet it's there with most of the ones that PostgreSQL most commonly runs on. I'll have to write a test and see if there really is a problem. - Curtis > -----Original Message----- > From: Bruce Momjian [mailto:pgman@candle.pha.pa.us] > Sent: Friday, October 04, 2002 12:44 AM > To: Curtis Faith > Cc: Tom Lane; Pgsql-Hackers > Subject: Re: [HACKERS] Potential Large Performance Gain in WAL synching > > > Curtis Faith wrote: > > The method I propose does not result in any blocking because of writes > > other than the final commit's write and it has the very significant > > advantage of allowing other transactions (from other back-ends) to > > continue until they enter commit (and blocking waiting for their final > > commit write to complete). > > > > > > 2) Allow transactions to complete and do work while other > threads are > > > > waiting on the completion of the log write. > > > > > > I'm missing something. There is no useful work that a transaction can > > > do between writing its commit record and reporting > completion, is there? > > > It has to wait for that record to hit disk. > > > > The key here is that a thread that has not committed and therefore is > > not blocking can do work while "other threads" (should have > said back-ends > > or processes) are waiting on their commit writes. > > I may be missing something here, but other backends don't block while > one writes to WAL. Remember, we are proccess based, not thread based, > so the write() call only blocks the one session. If you had threads, > and you did a write() call that blocked other threads, I can see where > your idea would be good, and where async i/o becomes an advantage. > > -- > Bruce Momjian | http://candle.pha.pa.us > pgman@candle.pha.pa.us | (610) 359-1001 > + If your life is a hard drive, | 13 Roberts Road > + Christ can be your backup. | Newtown Square, > Pennsylvania 19073 >
I wrote: > > The REAL issue and the one that will greatly affect total system > > throughput is that of contention on the file locks. Since > fsynch needs to > > obtain a write lock on the file descriptor, as does the write > calls which > > originate from XLogWrite as the writes are written to the disk, other > > back-ends will block while another transaction is committing if the > > log cache fills to the point where their XLogInsert results in a > > XLogWrite call to flush the log cache. tom lane wrote: > But that's exactly *why* we have a log cache: to ensure we can buffer a > reasonable amount of log data between XLogFlush calls. If the above > scenario is really causing a problem, doesn't that just mean you need > to increase wal_buffers? Well, in cases where there are a lot of small transactions the contention will not be on the XLogWrite calls from caches getting full but from XLogWrite calls from transaction commits which will happen very frequently. I think this will have a detrimental effect on very high update frequency performance. So while larger WAL caches will help in the case of cache flushing because of its being full I don't think it will make any difference for the potentially more common case of transaction commits. - Curtis
Curtis Faith wrote: > Bruce Momjian wrote: > > I may be missing something here, but other backends don't block while > > one writes to WAL. > > I don't think they'll block until they get to the fsync or XLogWrite > call while another transaction is fsync'ing. > > I'm no Unix filesystem expert but I don't see how the OS can > handle multiple writes and fsyncs to the same file descriptors without > blocking other processes from writing at the same time. It may be that > there are some clever data structures they use but I've not seen huge > praise for most of the file systems. A well written file system could > minimize this contention but I'll bet it's there with most of the ones > that PostgreSQL most commonly runs on. > > I'll have to write a test and see if there really is a problem. Yes, I can see some contention, but what does aio solve? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
I wrote: > > I'm no Unix filesystem expert but I don't see how the OS can > > handle multiple writes and fsyncs to the same file descriptors without > > blocking other processes from writing at the same time. It may be that > > there are some clever data structures they use but I've not seen huge > > praise for most of the file systems. A well written file system could > > minimize this contention but I'll bet it's there with most of the ones > > that PostgreSQL most commonly runs on. > > > > I'll have to write a test and see if there really is a problem. Bruce Momjian wrote: > Yes, I can see some contention, but what does aio solve? > Well, theoretically, aio lets the file system handle the writes without requiring any locks being held by the processes issuing those reads. The disk i/o scheduler can therefore issue the writes using spinlocks or something very fast since it controls the timing of each of the actual writes. In some systems this is handled by the kernal and can be very fast. I suspect that with large RAID controllers or intelligent disk systems like EMC this is even more important because they should be able to handle a much higher level of concurrent i/o. Now whether or not the common file systems handle this well, I can't say, Take a look at some comments on how Oracle uses asynchronous I/O http://www.ixora.com.au/notes/redo_write_multiplexing.htm http://www.ixora.com.au/notes/asynchronous_io.htm http://www.ixora.com.au/notes/raw_asynchronous_io.htm It seems that OS support for this will likely increase and that this issue will become more and more important as uses contemplate SMP systems or if threading is added to certain PostgreSQL subsystems. It might be easier for me to implement the change I propose and then see what kind of difference it makes. I wanted to run the idea past this group first. We can all postulate whether or not it will work but we won't know unless we try it. My real issue is one of what happens in the event that it does work. I've had very good luck implementing this sort of thing for other systems but I don't yet know the range of i/o requests that PostgreSQL makes. Assuming we can demonstrate no detrimental effects on system reliability and that the change is implemented in such a way that it can be turned on or off easily, will a 50% or better increase in speed for updates justify the sort or change I am proposing. 20%? 10%? - Curtis
Curtis Faith wrote: > > Yes, I can see some contention, but what does aio solve? > > > > Well, theoretically, aio lets the file system handle the writes without > requiring any locks being held by the processes issuing those reads. > The disk i/o scheduler can therefore issue the writes using spinlocks or > something very fast since it controls the timing of each of the actual > writes. In some systems this is handled by the kernal and can be very > fast. I am again confused. When we do write(), we don't have to lock anything, do we? (Multiple processes can write() to the same file just fine.) We do block the current process, but we have nothing else to do until we know it is written/fsync'ed. Does aio more easily allow the kernel to order those write? Is that the issue? Well, certainly the kernel already order the writes. Just because we write() doesn't mean it goes to disk. Only fsync() or the kernel do that. > > I suspect that with large RAID controllers or intelligent disk systems > like EMC this is even more important because they should be able to > handle a much higher level of concurrent i/o. > > Now whether or not the common file systems handle this well, I can't say, > > Take a look at some comments on how Oracle uses asynchronous I/O > > http://www.ixora.com.au/notes/redo_write_multiplexing.htm > http://www.ixora.com.au/notes/asynchronous_io.htm > http://www.ixora.com.au/notes/raw_asynchronous_io.htm Yes, but Oracle is threaded, right, so, yes, they clearly could win with it. I read the second URL and it said we could issue separate writes and have them be done in an optimal order. However, we use the file system, not raw devices, so don't we already have that in the kernel with fsync()? > It seems that OS support for this will likely increase and that this > issue will become more and more important as uses contemplate SMP systems > or if threading is added to certain PostgreSQL subsystems. Probably. Having seen the Informix 5/7 debacle, I don't want to fall into the trap where we add stuff that just makes things faster on SMP/threaded systems when it makes our code _slower_ on single CPU systems, which is exaclty what Informix did in Informix 7, and we know how that ended (lost customers, bought by IBM). I don't think that's going to happen to us, but I thought I would mention it. > Assuming we can demonstrate no detrimental effects on system reliability > and that the change is implemented in such a way that it can be turned > on or off easily, will a 50% or better increase in speed for updates > justify the sort or change I am proposing. 20%? 10%? Yea, let's see what boost we get, and the size of the patch, and we can review it. It is certainly worth researching. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian wrote: > I am again confused. When we do write(), we don't have to lock > anything, do we? (Multiple processes can write() to the same file just > fine.) We do block the current process, but we have nothing else to do > until we know it is written/fsync'ed. Does aio more easily allow the > kernel to order those write? Is that the issue? Well, certainly the > kernel already order the writes. Just because we write() doesn't mean > it goes to disk. Only fsync() or the kernel do that. "We" don't have to lock anything, but most file systems can't process fsync's simultaneous with other writes, so those writes block because the file system grabs its own internal locks. The fsync call is more contentious than typical writes because its duration is usually longer so it holds the locks longer over more pages and structures. That is the real issue. The contention caused by fsync'ing very frequently which blocks other writers and readers. For the buffer manager, the blocking of readers is probably even more problematic when the cache is a small percentage (say < 10% to 15%) of the total database size because most leaf node accesses will result in a read. Each of these reads will have to wait on the fsync as well. Again, a very well written file system probably can minimize this but I've not seen any. Further comment on: <We do block the current process, but we have nothing else to do >until we know it is written/fsync'ed. Writing out a bunch of calls at the end, after having consumed a lot of CPU cycles and then waiting is not as efficient as writing them out, while those CPU cycles are being used. We are currently waisting the time it takes for a given process to write. The thinking probably has been that this is no big deal because other processes, say B, C and D can use the CPU cycles while process A blocks. This is true UNLESS the other processes are blocking on reads or writes caused by process A doing the final writes and fsync. > Yes, but Oracle is threaded, right, so, yes, they clearly could win with > it. I read the second URL and it said we could issue separate writes > and have them be done in an optimal order. However, we use the file > system, not raw devices, so don't we already have that in the kernel > with fsync()? Whether by threads or multiple processes, there is the same contention on the file through multiple writers. The file system can decide to reorder writes before they start but not after. If a write comes after a fsync starts it will have to wait on that fsync. Likewise a given process's writes can NEVER be reordered if they are submitted synchronously, as is done in the calls to flush the log as well as the dirty pages in the buffer in the current code. > Probably. Having seen the Informix 5/7 debacle, I don't want to fall > into the trap where we add stuff that just makes things faster on > SMP/threaded systems when it makes our code _slower_ on single CPU > systems, which is exaclty what Informix did in Informix 7, and we know > how that ended (lost customers, bought by IBM). I don't think that's > going to happen to us, but I thought I would mention it. Yes, I hate "improvements" that make things worse for most people. Any changes I'd contemplate would be simply another configuration driven optimization that could be turned off very easily. - Curtis
"Curtis Faith" <curtis@galtair.com> writes: > ... most file systems can't process fsync's > simultaneous with other writes, so those writes block because the file > system grabs its own internal locks. Oh? That would be a serious problem, but I've never heard that asserted before. Please provide some evidence. On a filesystem that does have that kind of problem, can't you avoid it just by using O_DSYNC on the WAL files? Then there's no need to call fsync() at all, except during checkpoints (which actually issue sync() not fsync(), anyway). > Whether by threads or multiple processes, there is the same contention on > the file through multiple writers. The file system can decide to reorder > writes before they start but not after. If a write comes after a > fsync starts it will have to wait on that fsync. AFAICS we cannot allow the filesystem to reorder writes of WAL blocks, on safety grounds (we want to be sure we have a consistent WAL up to the end of what we've written). Even if we can allow some reordering when a single transaction puts out a large volume of WAL data, I fail to see where any large gain is going to come from. We're going to be issuing those writes sequentially and that ought to match the disk layout about as well as can be hoped anyway. > Likewise a given process's writes can NEVER be reordered if they are > submitted synchronously, as is done in the calls to flush the log as > well as the dirty pages in the buffer in the current code. We do not fsync buffer pages; in fact a transaction commit doesn't write buffer pages at all. I think the above is just a misunderstanding of what's really happening. We have synchronous WAL writing, agreed, but we want that AFAICS. Data block writes are asynchronous (between checkpoints, anyway). There is one thing in the current WAL code that I don't like: if the WAL buffers fill up then everybody who would like to make WAL entries is forced to wait while some space is freed, which means a write, which is synchronous if you are using O_DSYNC. It would be nice to have a background process whose only task is to issue write()s as soon as WAL pages are filled, thus reducing the probability that foreground processes have to wait for WAL writes (when they're not committing that is). But this could be done portably with one more postmaster child process; I see no real need to dabble in aio_write. regards, tom lane
> > ... most file systems can't process fsync's > > simultaneous with other writes, so those writes block because the file > > system grabs its own internal locks. > > Oh? That would be a serious problem, but I've never heard that asserted > before. Please provide some evidence. > > On a filesystem that does have that kind of problem, can't you avoid it > just by using O_DSYNC on the WAL files? To make this competitive, the WAL writes would need to be improved to do more than one block (up to 256k or 512k per write) with one write call (if that much is to be written for this tx to be able to commit). This should actually not be too difficult since the WAL buffer is already contiguous memory. If that is done, then I bet O_DSYNC will beat any other config we currently have. With this, a separate disk for WAL and large transactions you shoud be able to see your disks hit the max IO figures they are capable of :-) Andreas
"Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes: > To make this competitive, the WAL writes would need to be improved to > do more than one block (up to 256k or 512k per write) with one write call > (if that much is to be written for this tx to be able to commit). > This should actually not be too difficult since the WAL buffer is already > contiguous memory. Hmmm ... if you were willing to dedicate a half meg or meg of shared memory for WAL buffers, that's doable. I was originally thinking of having the (still hypothetical) background process wake up every time a WAL page was completed and available to write. But it could be set up so that there is some "slop", and it only wakes up when the number of writable pages exceeds N, for some N that's still well less than the number of buffers. Then it could write up to N sequential pages in a single write(). However, this would only be a win if you had few and large transactions. Any COMMIT will force a write of whatever we have so far, so the idea of writing hundreds of K per WAL write can only work if it's hundreds of K between commit records. Is that a common scenario? I doubt it. If you try to set it up that way, then it's more likely that what will happen is the background process seldom awakens at all, and each committer effectively becomes responsible for writing all the WAL traffic since the last commit. Wouldn't that lose compared to someone else having written the previous WAL pages in background? We could certainly build the code to support this, though, and then experiment with different values of N. If it turns out N==1 is best after all, I don't think we'd have wasted much code. regards, tom lane
I wrote: > > ... most file systems can't process fsync's > > simultaneous with other writes, so those writes block because the file > > system grabs its own internal locks. > tom lane replies: > Oh? That would be a serious problem, but I've never heard that asserted > before. Please provide some evidence. Well I'm basing this on past empirical testing and having read some man pages that describe fsync under this exact scenario. I'll have to write a test to prove this one way or another. I'll also try and look into the linux/BSD source for the common file systems used for PostgreSQL. > On a filesystem that does have that kind of problem, can't you avoid it > just by using O_DSYNC on the WAL files? Then there's no need to call > fsync() at all, except during checkpoints (which actually issue sync() > not fsync(), anyway). > No, they're not exactly the same thing. Consider: Process A File System --------- ----------- Writes index buffer .idling... Writes entry to log cache . Writes another index buffer . Writes another log entry . Writes tuple buffer . Writes another log entry . Index scan . Large table sort . Writes tuple buffer . Writes another log entry . Writes . Writes another index buffer . Writes another log entry . Writes another index buffer . Writes another log entry . Index scan . Large table sort . Commit . File Write Log Entry . .idling... Write to cache File Write Log Entry .idling... .idling... Write to cache File Write Log Entry .idling... .idling... Write to cache File Write Log Entry .idling... .idling... Write to cache Write Commit Log Entry .idling... .idling... Write to cache Call fsync .idling... .idling... Write all buffers to device. .DONE. In this case, Process A is waiting for all the buffers to write at the end of the transaction. With asynchronous I/O this becomes: Process A File System --------- ----------- Writes index buffer .idling... Writes entry to log cache Queue up write - move head to cylinder Writes another index buffer Write log entry to media Writes another log entry Immediate write to cylinder since head is still there. Writes tuple buffer . Writes another log entry Queue up write - move head to cylinder Index scan .busy with scan... Large table sort Write log entry to media Writes tuple buffer . Writes another log entry Queue up write - move head to cylinder Writes . Writes another index buffer Write log entry to media Writes another log entry Queue up write - move head to cylinder Writes another index buffer . Writes another log entry Write log entry to media Index scan . Large table sort Write log entry to media Commit . Write Commit Log Entry Immediate write to cylinder since head is still there. .DONE. Effectively the real work of writing the cache is done while the CPU for the process is busy doing index scans, sorts, etc. With the WAL log on another device and SCSI I/O the log writing should almost always be done except for the final commit write. > > Whether by threads or multiple processes, there is the same > contention on > > the file through multiple writers. The file system can decide to reorder > > writes before they start but not after. If a write comes after a > > fsync starts it will have to wait on that fsync. > > AFAICS we cannot allow the filesystem to reorder writes of WAL blocks, > on safety grounds (we want to be sure we have a consistent WAL up to the > end of what we've written). Even if we can allow some reordering when a > single transaction puts out a large volume of WAL data, I fail to see > where any large gain is going to come from. We're going to be issuing > those writes sequentially and that ought to match the disk layout about > as well as can be hoped anyway. My comment was applying to reads and writes of other processes not the WAL log. In my original email, recall I mentioned using the O_APPEND open flag which will ensure that all log entries are done sequentially. > > Likewise a given process's writes can NEVER be reordered if they are > > submitted synchronously, as is done in the calls to flush the log as > > well as the dirty pages in the buffer in the current code. > > We do not fsync buffer pages; in fact a transaction commit doesn't write > buffer pages at all. I think the above is just a misunderstanding of > what's really happening. We have synchronous WAL writing, agreed, but > we want that AFAICS. Data block writes are asynchronous (between > checkpoints, anyway). Hmm, I keep hearing that buffer block writes are asynchronous but I don't read that in the code at all. There are simple "write" calls with files that are not opened with O_NOBLOCK, so they'll be done synchronously. The code for this is relatively straighforward (once you get past the storage manager abstraction) so I don't see what I might be missing. It's true that data blocks are not required to be written before the transaction commits, so they are in some sense asynchronous to the transactions. However, they still later on block the process that is requesting a new block when it happens to be dirty forcing a write of the block in the cache. It looks to me like BufferAlloc will simply result in a call to BufferReplace > smgrblindwrt > write for md storage manager objects. This means that a process will block while the write of dirty cache buffers takes place. I'm happy to be wrong on this but I don't see any hard evidence of asynch file calls anywhere in the code. Unless I am missing something this is a huuuuge problem. > There is one thing in the current WAL code that I don't like: if the WAL > buffers fill up then everybody who would like to make WAL entries is > forced to wait while some space is freed, which means a write, which is > synchronous if you are using O_DSYNC. It would be nice to have a > background process whose only task is to issue write()s as soon as WAL > pages are filled, thus reducing the probability that foreground > processes have to wait for WAL writes (when they're not committing that > is). But this could be done portably with one more postmaster child > process; I see no real need to dabble in aio_write. Hmm, well, another process writing the log would accomplish the same thing but isn't that what a file system is? ISTM that aio_write is quite a bit easier and higher performance? This is especially true for those OS's which have KAIO support. - Curtis
"Curtis Faith" <curtis@galtair.com> writes: > It looks to me like BufferAlloc will simply result in a call to > BufferReplace > smgrblindwrt > write for md storage manager objects. > > This means that a process will block while the write of dirty cache > buffers takes place. I think Tom was suggesting that when a buffer is written out, the write() call only pushes the data down into the filesystem's buffer -- which is free to then write the actual blocks to disk whenever it chooses to. In other words, the write() returns, the backend process can continue with what it was doing, and at some later time the blocks that we flushed from the Postgres buffer will actually be written to disk. So in some sense of the word, that I/O is asynchronous. Cheers, Neil -- Neil Conway <neilc@samurai.com> || PGP Key ID: DB3C29FC
fsync exlusive lock evidence WAS: Potential Large Performance Gain in WAL synching
From
"Curtis Faith"
Date:
After some research I still hold that fsync blocks, at least on FreeBSD. Am I missing something? Here's the evidence: Code from: /usr/src/sys/syscalls/vfs_syscalls int fsync(p, uap) struct proc *p; struct fsync_args /* { syscallarg(int) fd; } */ *uap; { register struct vnode *vp; struct file *fp; vm_object_t obj; int error; if ((error = getvnode(p->p_fd, SCARG(uap, fd), &fp)) != 0) return (error); vp = (struct vnode*)fp->f_data; vn_lock(vp, LK_EXCLUSIVE | LK_RETRY, p); if (VOP_GETVOBJECT(vp, &obj) == 0) vm_object_page_clean(obj, 0, 0, 0); if ((error = VOP_FSYNC(vp, fp->f_cred, MNT_WAIT, p)) == 0 && vp->v_mount&& (vp->v_mount->mnt_flag & MNT_SOFTDEP) && bioops.io_fsync) error = (*bioops.io_fsync)(vp); VOP_UNLOCK(vp, 0, p); return (error); } Notice the calls to: vn_lock(vp, LK_EXCLUSIVE | LK_RETRY, p);..VOP_UNLOCK(vp, 0, p); surrounding the call to VOP_FSYNC. From the man pages for VOP_UNLOCK: HEADER STUFF ..... VOP_LOCK(struct vnode *vp, int flags, struct proc *p); int VOP_UNLOCK(struct vnode *vp, int flags, struct proc *p); int VOP_ISLOCKED(struct vnode *vp, struct proc *p); int vn_lock(struct vnode *vp, int flags, struct proc *p); DESCRIPTION These calls are used to serialize access to the filesystem, such as to prevent two writes to the same filefrom happening at the same time. The arguments are: vp the vnode being locked or unlocked flags One of the lock request types: LK_SHARED Shared lock LK_EXCLUSIVE Exclusive lock LK_UPGRADE Shared-to-exclusive upgrade LK_EXCLUPGRADE First shared-to-exclusive upgrade LK_DOWNGRADE Exclusive-to-shared downgrade LK_RELEASE Release any type of lock LK_DRAIN Wait for all lock activity to end The lock type may be or'ed with these lock flags: LK_NOWAIT Do not sleep to wait for lock LK_SLEEPFAIL Sleep, then return failure LK_CANRECURSE Allow recursive exclusive lock LK_REENABLE Lock is to be reenabled after drain LK_NOPAUSE No spinloop The lock type may be or'ed with these control flags: LK_INTERLOCK Specify when the caller already has a simple lock (VOP_LOCK will unlock the simple lock after getting the lock) LK_RETRY Retry until locked LK_NOOBJ Don't create object p process context to use for the locks Kernel code should use vn_lock() to lock a vnode rather than calling VOP_LOCK() directly.
> Hmmm ... if you were willing to dedicate a half meg or meg of shared > memory for WAL buffers, that's doable. Yup, configuring Informix to three 2 Mb buffers (LOGBUF 2048) here. > However, this would only be a win if you had few and large transactions. > Any COMMIT will force a write of whatever we have so far, so the idea of > writing hundreds of K per WAL write can only work if it's hundreds of K > between commit records. Is that a common scenario? I doubt it. It should help most for data loading, or mass updating, yes. Andreas
On Fri, 2002-10-04 at 18:03, Neil Conway wrote: > "Curtis Faith" <curtis@galtair.com> writes: > > It looks to me like BufferAlloc will simply result in a call to > > BufferReplace > smgrblindwrt > write for md storage manager objects. > > > > This means that a process will block while the write of dirty cache > > buffers takes place. > > I think Tom was suggesting that when a buffer is written out, the > write() call only pushes the data down into the filesystem's buffer -- > which is free to then write the actual blocks to disk whenever it > chooses to. In other words, the write() returns, the backend process > can continue with what it was doing, and at some later time the blocks > that we flushed from the Postgres buffer will actually be written to > disk. So in some sense of the word, that I/O is asynchronous. Isn't that true only as long as there is buffer space available? When there isn't buffer space available, seems the window for blocking comes into play?? So I guess you could say it is optimally asynchronous and worse case synchronous. I think the worse case situation is one which he's trying to address. At least that's how I interpret it. Greg
Curtis Faith writes: > I'm no Unix filesystem expert but I don't see how the OS can handle > multiple writes and fsyncs to the same file descriptors without > blocking other processes from writing at the same time. Why not? Other than the necessary synchronisation for attributes such as file size and modification times, multiple processes can readily write to different areas of the same file at the "same" time. fsync() may not return until after the buffers it schedules are written, but it doesn't have to block subsequent writes to different buffers in the file either. (Note too Tom Lane's responses about when fsync() is used and not used.) > I'll have to write a test and see if there really is a problem. Please do. I expect you'll find things aren't as bad as you fear. In another posting, you write: > Hmm, I keep hearing that buffer block writes are asynchronous but I don't > read that in the code at all. There are simple "write" calls with files > that are not opened with O_NOBLOCK, so they'll be done synchronously. The > code for this is relatively straighforward (once you get past the > storage manager abstraction) so I don't see what I might be missing. There is a confusion of terminology here: the write() is synchronous from the point of the application only in that the data is copied into kernel buffers (or pages remapped, or whatever) before the system call returns. For files opened with O_DSYNC the write() would wait for the data to be written to disk. Thus O_DSYNC is "synchronous" I/O, but there is no equivalently easy name for the regular "flush to disk after write() returns" that the Unix kernel has done ~forever. The asynchronous I/O that you mention ("aio") is a third thing, different from both regular write() and write() with O_DSYNC. I understand that with aio the data is not even transferred to the kernel before the aio_write() call returns, but I've never programmed with aio and am not 100% sure how it works. Regards, Giles
Neil Conway <neilc@samurai.com> writes: > "Curtis Faith" <curtis@galtair.com> writes: >> It looks to me like BufferAlloc will simply result in a call to >> BufferReplace > smgrblindwrt > write for md storage manager objects. >> >> This means that a process will block while the write of dirty cache >> buffers takes place. > I think Tom was suggesting that when a buffer is written out, the > write() call only pushes the data down into the filesystem's buffer -- > which is free to then write the actual blocks to disk whenever it > chooses to. Exactly --- in all Unix systems that I know of, a write() is asynchronous unless one takes special pains (like opening the file with O_SYNC). Pushing the data from userspace to the kernel disk buffers does not count as I/O in my mind. I am quite concerned about Curtis' worries about fsync, though. There's not any fundamental reason for fsync to block other operations, but that doesn't mean that it's been implemented reasonably everywhere :-(. We need to take a look at that. regards, tom lane
I resent this since it didn't seem to get to the list. After some research I still hold that fsync blocks, at least on FreeBSD. Am I missing something? Here's the evidence: Code from: /usr/src/sys/syscalls/vfs_syscalls int fsync(p, uap) struct proc *p; struct fsync_args /* { syscallarg(int) fd; } */ *uap; { register struct vnode *vp; struct file *fp; vm_object_t obj; int error; if ((error = getvnode(p->p_fd, SCARG(uap, fd), &fp)) != 0) return (error); vp = (struct vnode*)fp->f_data; vn_lock(vp, LK_EXCLUSIVE | LK_RETRY, p); if (VOP_GETVOBJECT(vp, &obj) == 0) vm_object_page_clean(obj, 0, 0, 0); if ((error = VOP_FSYNC(vp, fp->f_cred, MNT_WAIT, p)) == 0 && vp->v_mount&& (vp->v_mount->mnt_flag & MNT_SOFTDEP) && bioops.io_fsync) error = (*bioops.io_fsync)(vp); VOP_UNLOCK(vp, 0, p); return (error); } Notice the calls to: vn_lock(vp, LK_EXCLUSIVE | LK_RETRY, p);..VOP_UNLOCK(vp, 0, p); surrounding the call to VOP_FSYNC. From the man pages for VOP_UNLOCK: HEADER STUFF ..... VOP_LOCK(struct vnode *vp, int flags, struct proc *p); int VOP_UNLOCK(struct vnode *vp, int flags, struct proc *p); int VOP_ISLOCKED(struct vnode *vp, struct proc *p); int vn_lock(struct vnode *vp, int flags, struct proc *p); DESCRIPTION These calls are used to serialize access to the filesystem, such as to prevent two writes to the same filefrom happening at the same time. The arguments are: vp the vnode being locked or unlocked flags One of the lock request types: LK_SHARED Shared lock LK_EXCLUSIVE Exclusive lock LK_UPGRADE Shared-to-exclusive upgrade LK_EXCLUPGRADE First shared-to-exclusive upgrade LK_DOWNGRADE Exclusive-to-shared downgrade LK_RELEASE Release any type of lock LK_DRAIN Wait for all lock activity to end The lock type may be or'ed with these lock flags: LK_NOWAIT Do not sleep to wait for lock LK_SLEEPFAIL Sleep, then return failure LK_CANRECURSE Allow recursive exclusive lock LK_REENABLE Lock is to be reenabled after drain LK_NOPAUSE No spinloop The lock type may be or'ed with these control flags: LK_INTERLOCK Specify when the caller already has a simple lock (VOP_LOCK will unlock the simple lock after getting the lock) LK_RETRY Retry until locked LK_NOOBJ Don't create object p process context to use for the locks Kernel code should use vn_lock() to lock a vnode rather than calling VOP_LOCK() directly.
"Curtis Faith" <curtis@galtair.com> writes: > After some research I still hold that fsync blocks, at least on > FreeBSD. Am I missing something? > Here's the evidence: > [ much snipped ] > vp = (struct vnode *)fp->f_data; > vn_lock(vp, LK_EXCLUSIVE | LK_RETRY, p); Hm, I take it a "vnode" is what's usually called an inode, ie the unique identification data for a specific disk file? This is kind of ugly in general terms but I'm not sure that it really hurts Postgres. In our present scheme, the only files we ever fsync() are WAL log files, not data files. And in normal operation there is only one WAL writer at a time, and *no* WAL readers. So an exclusive kernel-level lock on a WAL file while we fsync really shouldn't create any problem for us. (Unless this indirectly blocks other operations that I'm missing?) As I commented before, I think we could do with an extra process to issue WAL writes in places where they're not in the critical path for a foreground process. But that seems to be orthogonal from this issue. regards, tom lane
Tom Lane wrote: > "Curtis Faith" <curtis@galtair.com> writes: > > After some research I still hold that fsync blocks, at least on > > FreeBSD. Am I missing something? > > > Here's the evidence: > > [ much snipped ] > > vp = (struct vnode *)fp->f_data; > > vn_lock(vp, LK_EXCLUSIVE | LK_RETRY, p); > > Hm, I take it a "vnode" is what's usually called an inode, ie the unique > identification data for a specific disk file? Yes, Virtual Inode. I think it is virtual because it is used for NFS, where the handle really isn't an inode. > This is kind of ugly in general terms but I'm not sure that it really > hurts Postgres. In our present scheme, the only files we ever fsync() > are WAL log files, not data files. And in normal operation there is > only one WAL writer at a time, and *no* WAL readers. So an exclusive > kernel-level lock on a WAL file while we fsync really shouldn't create > any problem for us. (Unless this indirectly blocks other operations > that I'm missing?) I think the small issue is: proc1 proc2writefsync write fync Proc2 has to wait for the fsync, but the write is so short compared to the fsync, I don't see an issue. Now, if someone would come up with code that did only one fsync for the above case, that would be a big win. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Proposed LogWriter Scheme, WAS: Potential Large Performance Gain in WAL synching
From
"Curtis Faith"
Date:
It appears the fsync problem is pervasive. Here's Linux 2.4.19's version from fs/buffer.c: lock-> down(&inode->i_sem); ret = filemap_fdatasync(inode->i_mapping); err = file->f_op->fsync(file,dentry, 1); if (err && !ret) ret = err; err = filemap_fdatawait(inode->i_mapping); if (err && !ret) ret = err; unlock-> up(&inode->i_sem); But this is probably not a big factor as you outline below because the WALWriteLock is causing the same kind of contention. tom lane wrote: > This is kind of ugly in general terms but I'm not sure that it really > hurts Postgres. In our present scheme, the only files we ever fsync() > are WAL log files, not data files. And in normal operation there is > only one WAL writer at a time, and *no* WAL readers. So an exclusive > kernel-level lock on a WAL file while we fsync really shouldn't create > any problem for us. (Unless this indirectly blocks other operations > that I'm missing?) I hope you're right but I see some very similar contention problems in the case of many small transactions because of the WALWriteLock. Assume Transaction A which writes a lot of buffers and XLog entries, so the Commit forces a relatively lengthy fsynch. Transactions B - E block not on the kernel lock from fsync but on the WALWriteLock. When A finishes the fsync and subsequently releases the WALWriteLock B unblocks and gets the WALWriteLock for its fsync for the flush. C blocks on the WALWriteLock waiting to write its XLOG_XACT_COMMIT. B Releases and now C writes its XLOG_XACT_COMMIT. There now seems to be a lot of contention on the WALWriteLock. This is a shame for a system that has no locking at the logical level and therefore seems like it could be very, very fast and offer incredible concurrency. > As I commented before, I think we could do with an extra process to > issue WAL writes in places where they're not in the critical path for > a foreground process. But that seems to be orthogonal from this issue. It's only orthogonal to the fsync-specific contention issue. We now have to worry about WALWriteLock semantics causes the same contention. Your idea of a separate LogWriter process could very nicely solve this problem and accomplish a few other things at the same time if we make a few enhancements. Back-end servers would not issue fsync calls. They would simply block waiting until the LogWriter had written their record to the disk, i.e. until the sync'd block # was greater than the block that contained the XLOG_XACT_COMMIT record. The LogWriter could wake up committed back- ends after its log write returns. The log file would be opened O_DSYNC, O_APPEND every time. The LogWriter would issue writes of the optimal size when enough data was present or of smaller chunks if enough time had elapsed since the last write. The nice part is that the WALWriteLock semantics could be changed to allow the LogWriter to write to disk while WALWriteLocks are acquired by back-end servers. WALWriteLocks would only be held for the brief time needed to copy the entries into the log buffer. The LogWriter would only need to grab a lock to determine the current end of the log buffer. Since it would be writing blocks that occur earlier in the cache than the XLogInsert log writers it won't need to grab a WALWriteLock before writing the cache buffers. Many transactions would commit on the same fsync (now really a write with O_DSYNC) and we would get optimal write throughput for the log system. This would handle all the issues I had and it doesn't sound like a huge change. In fact, it ends up being almost semantically identical to the aio_write suggestion I made orignally, except the LogWriter is doing the background writing instead of the OS and we don't have to worry about aio implementations and portability. - Curtis
tgl@sss.pgh.pa.us (Tom Lane) writes: [snip] > On a filesystem that does have that kind of problem, can't you avoid it > just by using O_DSYNC on the WAL files? Then there's no need to call > fsync() at all, except during checkpoints (which actually issue sync() > not fsync(), anyway). This comment on using sync() instead of fsync() makes me slightly worried since sync() doesn't in any way guarantee that all data is written immediately. E.g. on *BSD with softupdates, it doesn't even guarantee that data is written within some deterministic time as far as I know (*). With a quick check of the code I found /** mdsync() -- Sync storage.**/ int mdsync() {sync();if (IsUnderPostmaster) sleep(2);sync();return SM_SUCCESS; } which is ugly (imho) even if sync() starts an immediate and complete file system flush (which I don't think it does with softupdates). It seems to be used only by /* ------------------------------------------------* FlushBufferPool** Flush all dirty blocks in buffer pool to disk* atthe checkpoint time* ------------------------------------------------*/ void FlushBufferPool(void) {BufferSync();smgrsync(); /* calls mdsync() */ } so the question that remains is what kinds of guarantees FlushBufferPool() really expects and needs from smgrsync() ? If smgrsync() is called to make up for lack of fsync() calls in BufferSync(), I'm getting really worried :-) _ Mats Lofkvist mal@algonet.se (*) See for example http://groups.google.com/groups?th=bfc8a0dc5373ed6e
Curtis Faith wrote: > Back-end servers would not issue fsync calls. They would simply block > waiting until the LogWriter had written their record to the disk, i.e. > until the sync'd block # was greater than the block that contained the > XLOG_XACT_COMMIT record. The LogWriter could wake up committed back- > ends after its log write returns. > > The log file would be opened O_DSYNC, O_APPEND every time. The LogWriter > would issue writes of the optimal size when enough data was present or > of smaller chunks if enough time had elapsed since the last write. So every backend is to going to wait around until its fsync gets done by the backend process? How is that a win? This is just another version of our GUC parameters:#commit_delay = 0 # range 0-100000, in microseconds#commit_siblings = 5 #range 1-1000 which attempt to delay fsync if other backends are nearing commit. Pushing things out to another process isn't a win; figuring out if someone else is coming for commit is. Remember, write() is fast, fsync is slow. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian kirjutas L, 05.10.2002 kell 13:49: > Curtis Faith wrote: > > Back-end servers would not issue fsync calls. They would simply block > > waiting until the LogWriter had written their record to the disk, i.e. > > until the sync'd block # was greater than the block that contained the > > XLOG_XACT_COMMIT record. The LogWriter could wake up committed back- > > ends after its log write returns. > > > > The log file would be opened O_DSYNC, O_APPEND every time. The LogWriter > > would issue writes of the optimal size when enough data was present or > > of smaller chunks if enough time had elapsed since the last write. > > So every backend is to going to wait around until its fsync gets done by > the backend process? How is that a win? This is just another version > of our GUC parameters: > > #commit_delay = 0 # range 0-100000, in microseconds > #commit_siblings = 5 # range 1-1000 > > which attempt to delay fsync if other backends are nearing commit. > Pushing things out to another process isn't a win; figuring out if > someone else is coming for commit is. Exactly. If I understand correctly what Curtis is proposing, you don't have to figure it out under his scheme - you just issue a WALWait command and the WAL writing process notifies you when your transactions WAL is safe storage. If the other committer was able to get his WALWait in before the actual write took place, it will notified too, if not, it will be notified about 1/166th sec. later (for 10K rpm disk) when it's write is done on the next rev of disk platters. The writer process should just issue a continuous stream of aio_write()'s while there are any waiters and keep track which waiters are safe to continue - thus no guessing of who's gonna commit. If supported by platform this should use zero-copy writes - it should be safe because WAL is append-only. ----------- Hannu
Re: Proposed LogWriter Scheme, WAS: Potential Large PerformanceGain in WAL synching
From
"Curtis Faith"
Date:
Bruce Momjian wrote: > So every backend is to going to wait around until its fsync gets done by > the backend process? How is that a win? This is just another version > of our GUC parameters: > > #commit_delay = 0 # range 0-100000, in microseconds > #commit_siblings = 5 # range 1-1000 > > which attempt to delay fsync if other backends are nearing commit. > Pushing things out to another process isn't a win; figuring out if > someone else is coming for commit is. It's not the same at all. My proposal make two extremely important changes from a performance perspective. 1) WALWriteLocks are never held by processes for lengthy transations. Only for long enough to copy the log entry into the buffer. This means real work can be done by other processes while a transaction is waiting for it's commit to finish. I'm sure that blocking on XLogInsert because another transaction is performing an fsync is extremely common with frequent update scenarios. 2) The log is written using optimal write sizes which is much better than a user-defined guess of the microseconds to delay the fsync. We should be able to get the bottleneck to be the maximum write throughput of the disk with the modifications to Tom Lane's scheme I proposed. > Remember, write() is fast, fsync is slow. Okay, it's clear I missed the point about Unix write earlier :-) However, it's not just saving fsyncs that we need to worry about. It's the unnecessary blocking of other processes that are simply trying to append some log records in the course of whatever updating, inserting they are doing. They may be a long way from commit. fsync being slow is the whole reason for not wanting to have exclusive locks held for the duration of an fsync. On an SMP machine this change alone would probably speed things up by an order of magnitude (assuming there aren't any other similar locks causing the same problem). - Curtis
Re: Proposed LogWriter Scheme, WAS: Potential Large Performance Gain in WAL synching
From
Tom Lane
Date:
"Curtis Faith" <curtis@galtair.com> writes: > Assume Transaction A which writes a lot of buffers and XLog entries, > so the Commit forces a relatively lengthy fsynch. > Transactions B - E block not on the kernel lock from fsync but on > the WALWriteLock. You are confusing WALWriteLock with WALInsertLock. A transaction-committing flush operation only holds the former. XLogInsert only needs the latter --- at least as long as it doesn't need to write. Thus, given adequate space in the WAL buffers, transactions B-E do not get blocked by someone else who is writing/syncing in order to commit. Now, as the code stands at the moment there is no event other than commit or full-buffers that prompts a write; that means that we are likely to run into the full-buffer case more often than is good for performance. But a background writer task would fix that. > Back-end servers would not issue fsync calls. They would simply block > waiting until the LogWriter had written their record to the disk, i.e. > until the sync'd block # was greater than the block that contained the > XLOG_XACT_COMMIT record. The LogWriter could wake up committed back- > ends after its log write returns. This will pessimize performance except in the case where WAL traffic is very heavy, because it means you don't commit until the block containing your commit record is filled. What if you are the only active backend? My view of this is that backends would wait for the background writer only when they encounter a full-buffer situation, or indirectly when they are trying to do a commit write and the background guy has the WALWriteLock. The latter serialization is unavoidable: in that scenario, the background guy is writing/flushing an earlier page of the WAL log, and we *must* have that down to disk before we can declare our transaction committed. So any scheme that tries to eliminate the serialization of WAL writes will fail. I do not, however, see any value in forcing all the WAL writes to be done by a single process; which is essentially what you're saying we should do. That just adds extra process-switch overhead that we don't really need. > The log file would be opened O_DSYNC, O_APPEND every time. Keep in mind that we support platforms without O_DSYNC. I am not sure whether there are any that don't have O_SYNC either, but I am fairly sure that we measured O_SYNC to be slower than fsync()s on some platforms. > The nice part is that the WALWriteLock semantics could be changed to > allow the LogWriter to write to disk while WALWriteLocks are acquired > by back-end servers. As I said, we already have that; you are confusing WALWriteLock with WALInsertLock. > Many transactions would commit on the same fsync (now really a write > with O_DSYNC) and we would get optimal write throughput for the log > system. How are you going to avoid pessimizing the few-transactions case? regards, tom lane
Re: Proposed LogWriter Scheme, WAS: Potential Large Performance Gain in WAL synching
From
Doug McNaught
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes: > "Curtis Faith" <curtis@galtair.com> writes: > > The log file would be opened O_DSYNC, O_APPEND every time. > > Keep in mind that we support platforms without O_DSYNC. I am not > sure whether there are any that don't have O_SYNC either, but I am > fairly sure that we measured O_SYNC to be slower than fsync()s on > some platforms. And don't we preallocate WAL files anyway? So O_APPEND would be irrelevant? -Doug
Hannu Krosing <hannu@tm.ee> writes: > The writer process should just issue a continuous stream of > aio_write()'s while there are any waiters and keep track which waiters > are safe to continue - thus no guessing of who's gonna commit. This recipe sounds like "eat I/O bandwidth whether we need it or not". It might be optimal in the case where activity is so heavy that we do actually need a WAL write on every disk revolution, but in any scenario where we're not maxing out the WAL disk's bandwidth, it will hurt performance. In particular, it would seriously degrade performance if the WAL file isn't on its own spindle but has to share bandwidth with data file access. What we really want, of course, is "write on every revolution where there's something worth writing" --- either we've filled a WAL blovk or there is a commit pending. But that just gets us back into the same swamp of how-do-you-guess-whether-more-commits-will-arrive-soon. I don't see how an extra process makes that problem any easier. BTW, it would seem to me that aio_write() buys nothing over plain write() in terms of ability to gang writes. If we issue the write at time T and it completes at T+X, we really know nothing about exactly when in that interval the data was read out of our WAL buffers. We cannot assume that commit records that were stored into the WAL buffer during that interval got written to disk. The only safe assumption is that only records that were in the buffer at time T are down to disk; and that means that late arrivals lose. You can't issue aio_write immediately after the previous one completes and expect that this optimizes performance --- you have to delay it as long as you possibly can in hopes that more commit records arrive. So it comes down to being the same problem. regards, tom lane
Mats Lofkvist <mal@algonet.se> writes: > [ mdsync is ugly and not completely reliable ] Yup, it is. Do you have a better solution? fsync is not the answer, since the checkpoint process has no way to know what files may have been touched since the last checkpoint ... and even if it could find that out, a string of retail fsync calls would kill performance, cf. Curtis Faith's complaint. In practice I am not sure there is a problem. The local man page for sync() says The writing, although scheduled, is not necessarily complete upon return from sync. Now if "scheduled" means "will occur before any subsequently-commanded write occurs" then we're fine. I don't know if that's true though ... regards, tom lane
Re: Use of sync() [was Re: Potential Large Performance Gain in WAL synching]
From
Doug McNaught
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes: > In practice I am not sure there is a problem. The local man page for > sync() says > > The writing, although scheduled, is not necessarily complete upon > return from sync. > > Now if "scheduled" means "will occur before any subsequently-commanded > write occurs" then we're fine. I don't know if that's true though ... In my understanding, it means "all currently dirty blocks in the file cache are queued to the disk driver". The queued writes will eventually complete, but not necessarily before sync() returns. I don't think subsequent write()s will block, unless the system is low on buffers and has to wait until dirty blocks are freed by the driver. -Doug
Re: Proposed LogWriter Scheme, WAS: Potential Large Performance Gain in WAL synching
From
"Curtis Faith"
Date:
> You are confusing WALWriteLock with WALInsertLock. A > transaction-committing flush operation only holds the former. > XLogInsert only needs the latter --- at least as long as it > doesn't need to write. Well that make things better than I thought. We still end up with a disk write for each transaction though and I don't see how this can ever get better than (Disk RPM)/ 60 transactions per second, since commit fsyncs are serialized. Every fsync will have to wait almost a full revolution to reach the end of the log. As a practial matter then everyone will use commit_delay to improve this. > This will pessimize performance except in the case where WAL traffic > is very heavy, because it means you don't commit until the block > containing your commit record is filled. What if you are the only > active backend? We could handle this using a mechanism analogous to the current commit delay. If there are more than commit_siblings other processes running then do the write automatically after commit_delay seconds. This would make things no more pessimistic than the current implementation but provide the additional benefit of allowing the LogWriter to write in optimal sizes if there are many transactions. The commit_delay method won't be as good in many cases. Consider a update scenario where a larger commit delay gives better throughput. A given transaction will flush after commit_delay milliseconds. The delay is very unlikely to result in a scenario where the dirty log buffers are the optimal size. As a practical matter I think this would tend to make the writes larger than they would otherwise have been and this would unnecessarily delay the commit on the transaction. > I do not, however, see any > value in forcing all the WAL writes to be done by a single process; > which is essentially what you're saying we should do. That just adds > extra process-switch overhead that we don't really need. I don't think that an fsync will ever NOT cause the process to get switched out so I don't see how another process doing the write would result in more overhead. The fsync'ing process will block on the fsync, so there will always be at least one process switch (probably many) while waiting for the fsync to comlete since we are talking many milliseconds for the fsync in every case. > > The log file would be opened O_DSYNC, O_APPEND every time. > > Keep in mind that we support platforms without O_DSYNC. I am not > sure whether there are any that don't have O_SYNC either, but I am > fairly sure that we measured O_SYNC to be slower than fsync()s on > some platforms. Well there is no reason that the logwriter couldn't be doing fsyncs instead of O_DSYNC writes in those cases. I'd leave this switchable using the current flags. Just change the semantics a bit. - Curtis
Doug McNaught <doug@wireboard.com> writes: > Tom Lane <tgl@sss.pgh.pa.us> writes: >> In practice I am not sure there is a problem. The local man page for >> sync() says >> >> The writing, although scheduled, is not necessarily complete upon >> return from sync. >> >> Now if "scheduled" means "will occur before any subsequently-commanded >> write occurs" then we're fine. I don't know if that's true though ... > In my understanding, it means "all currently dirty blocks in the file > cache are queued to the disk driver". The queued writes will > eventually complete, but not necessarily before sync() returns. I > don't think subsequent write()s will block, unless the system is low > on buffers and has to wait until dirty blocks are freed by the driver. We don't need later write()s to block. We only need them to not hit disk before the sync-queued writes hit disk. So I guess the question boils down to what "queued to the disk driver" means --- has the order of writes been determined at that point? regards, tom lane
>In particular, it would seriously degrade performance if the WAL file > isn't on its own spindle but has to share bandwidth with > data file access. If the OS is stupid I could see this happening. But if there are buffers and some sort of elevator algorithm the I/O won't happen at bad times. I agree with you though that writing for every single insert probably does not make sense. There should be some blocking of writes. The optimal size would have to be derived empirically. > What we really want, of course, is "write on every revolution where > there's something worth writing" --- either we've filled a WAL blovk > or there is a commit pending. But that just gets us back into the > same swamp of how-do-you-guess-whether-more-commits-will-arrive-soon. > I don't see how an extra process makes that problem any easier. The whole point of the extra process handling all the writes is so that it can write on every revolution, if there is something to write. It doesn't need to care if more commits will arrive soon. > BTW, it would seem to me that aio_write() buys nothing over plain write() > in terms of ability to gang writes. If we issue the write at time T > and it completes at T+X, we really know nothing about exactly when in > that interval the data was read out of our WAL buffers. We cannot > assume that commit records that were stored into the WAL buffer during > that interval got written to disk. Why would we need to make that assumption? The only thing we'd need to know is that a given write succeeded meaning that commits before that write are done. The advantage to aio_write in this scenario is when writes cross track boundaries or when the head is in the wrong spot. If we write in reasonable blocks with aio_write the write might get to the disk before the head passes the location for the write. Consider a scenario where: Head is at file offset 10,000. Log contains blocks 12,000 - 12,500 ..time passes.. Head is now at 12,050 Commit occurs writing block 12,501 In the aio_write case the write would already have been done for blocks 12,000 to 12,050 and would be queued up for some additional blocks up to potentially 12,500. So the write for the commit could occur without an additional rotation delay. We are talking 85 to 200 milliseconds delay for this rotation on a single disk. I don't know how often this happens in actual practice but it might occur as often as every other time. - Curtis
Curtis Faith wrote: > The advantage to aio_write in this scenario is when writes cross track > boundaries or when the head is in the wrong spot. If we write > in reasonable blocks with aio_write the write might get to the disk > before the head passes the location for the write. > > Consider a scenario where: > > Head is at file offset 10,000. > > Log contains blocks 12,000 - 12,500 > > ..time passes.. > > Head is now at 12,050 > > Commit occurs writing block 12,501 > > In the aio_write case the write would already have been done for blocks > 12,000 to 12,050 and would be queued up for some additional blocks up to > potentially 12,500. So the write for the commit could occur without an > additional rotation delay. We are talking 85 to 200 milliseconds > delay for this rotation on a single disk. I don't know how often this > happens in actual practice but it might occur as often as every other > time. So, you are saying that we may get back aio confirmation quicker than if we issued our own write/fsync because the OS was able to slip our flush to disk in as part of someone else's or a general fsync? I don't buy that because it is possible our write() gets in as part of someone else's fsync and our fsync becomes a no-op, meaning there aren't any dirty buffers for that file. Isn't that also possible? Also, remember the kernel doesn't know where the platter rotation is either. Only the SCSI drive can reorder the requests to match this. The OS can group based on head location, but it doesn't know much about the platter location, and it doesn't even know where the head is. Also, does aio return info when the data is in the kernel buffers or when it is actually on the disk? Simply, aio allows us to do the write and get notification when it is complete. I don't see how that helps us, and I don't see any other advantages to aio. To use aio, we need to find something that _can't_ be solved with more traditional Unix API's, and I haven't seen that yet. This aio thing is getting out of hand. It's like we have a hammer, and everything looks like a nail, or a use for aio. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
> So, you are saying that we may get back aio confirmation quicker than if > we issued our own write/fsync because the OS was able to slip our flush > to disk in as part of someone else's or a general fsync? > > I don't buy that because it is possible our write() gets in as part of > someone else's fsync and our fsync becomes a no-op, meaning there aren't > any dirty buffers for that file. Isn't that also possible? Separate out the two concepts: 1) Writing of incomplete transactions at the block level by a background LogWriter. I think it doesn't matter whether the write is aio_write or write, writing blocks when we get them should provide the benefit I outlined. Waiting till fsync could miss the opporunity to write before the head passes the end of the last durable write because the drive buffers might empty causing up to a full rotation's delay. 2) aio_write vs. normal write. Since as you and others have pointed out aio_write and write are both asynchronous, the issue becomes one of whether or not the copies to the file system buffers happen synchronously or not. This is not a big difference but it seems to me that the OS might be able to avoid some context switches by grouping copying in the case of aio_write. I've heard anecdotal reports that this is significantly faster for some things but I don't know for certain. > > Also, remember the kernel doesn't know where the platter rotation is > either. Only the SCSI drive can reorder the requests to match this. The > OS can group based on head location, but it doesn't know much about the > platter location, and it doesn't even know where the head is. The kernel doesn't need to know anything about platter rotation. It just needs to keep the disk write buffers full enough not to cause a rotational latency. It's not so much a matter of reordering as it is of getting the data into the SCSI drive before the head passes the last write's position. If the SCSI drive's buffers are kept full it can continue writing at its full throughput. If the writes stop and the buffers empty it will need to wait up to a full rotation before it gets to the end of the log again > Also, does aio return info when the data is in the kernel buffers or > when it is actually on the disk? > > Simply, aio allows us to do the write and get notification when it is > complete. I don't see how that helps us, and I don't see any other > advantages to aio. To use aio, we need to find something that _can't_ > be solved with more traditional Unix API's, and I haven't seen that yet. > > This aio thing is getting out of hand. It's like we have a hammer, and > everything looks like a nail, or a use for aio. Yes, while I think its probably worth doing and faster, it won't help as much as just keeping the drive buffers full even if that's by using write calls. I still don't understand the opposition to aio_write. Could we just have the configuration setup determine whether one or the other is used? I don't see why we wouldn't use the faster calls if they were present and reliable on a given system. - Curtis
Curtis Faith wrote: > > So, you are saying that we may get back aio confirmation quicker than if > > we issued our own write/fsync because the OS was able to slip our flush > > to disk in as part of someone else's or a general fsync? > > > > I don't buy that because it is possible our write() gets in as part of > > someone else's fsync and our fsync becomes a no-op, meaning there aren't > > any dirty buffers for that file. Isn't that also possible? > > Separate out the two concepts: > > 1) Writing of incomplete transactions at the block level by a > background LogWriter. > > I think it doesn't matter whether the write is aio_write or > write, writing blocks when we get them should provide the benefit > I outlined. > > Waiting till fsync could miss the opportunity to write before the > head passes the end of the last durable write because the drive > buffers might empty causing up to a full rotation's delay. No question about that! The sooner we can get stuff to the WAL buffers, the more likely we will get some other transaction to do our fsync work. Any ideas on how we can do that? > 2) aio_write vs. normal write. > > Since as you and others have pointed out aio_write and write are both > asynchronous, the issue becomes one of whether or not the copies to the > file system buffers happen synchronously or not. > > This is not a big difference but it seems to me that the OS might be > able to avoid some context switches by grouping copying in the case > of aio_write. I've heard anecdotal reports that this is significantly > faster for some things but I don't know for certain. I suppose it is possible, but because we spend so much time in fsync, we want to focus on that. People have recommended mmap of the WAL file, and that seems like a much more direct way to handle it rather than aio. However, we can't control when the stuff gets sent to disk with mmap'ed WAL, or should I say we can't write to it and withhold writes to the disk file with mmap, so we would need some intermediate step, and then again, it just becomes more steps and extra steps slow things down too. > > This aio thing is getting out of hand. It's like we have a hammer, and > > everything looks like a nail, or a use for aio. > > Yes, while I think its probably worth doing and faster, it won't help as > much as just keeping the drive buffers full even if that's by using write > calls. > I still don't understand the opposition to aio_write. Could we just have > the configuration setup determine whether one or the other is used? I > don't see why we wouldn't use the faster calls if they were present and > reliable on a given system. We hesitate to add code relying on new features unless it is a significant win, and in the aio case, we would have different WAL disk write models for with/without aio, so it clearly could be two code paths, and with two code paths, we can't as easily improve or optimize. If we get 2% boost out of some feature, but it later discourages us from adding a 5% optimization, it is a loss. And, in most cases, the 2% optimization is for a few platform, while the 5% optimization is for all. This code is +15 years old, so we are looking way down the road, not just for today's hot feature. For example, Tom just improved DISTINCT by 25% by optimizing some of the sorting and function call handling. If we had more complex threaded sort code, that may not have been possible, or it may have been possible for him to optimize only one of the code paths. I can't tell you how many aio/mmap/fancy feature discussions we have had, and we obviously discuss them, but in the end, they end up being of questionable value for the risk/complexity; but, we keep talking, hoping we are wrong or some good ideas come out of it. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
On Sat, 2002-10-05 at 20:32, Tom Lane wrote: > Hannu Krosing <hannu@tm.ee> writes: > > The writer process should just issue a continuous stream of > > aio_write()'s while there are any waiters and keep track which waiters > > are safe to continue - thus no guessing of who's gonna commit. > > This recipe sounds like "eat I/O bandwidth whether we need it or not". > It might be optimal in the case where activity is so heavy that we > do actually need a WAL write on every disk revolution, but in any > scenario where we're not maxing out the WAL disk's bandwidth, it will > hurt performance. In particular, it would seriously degrade performance > if the WAL file isn't on its own spindle but has to share bandwidth with > data file access. > > What we really want, of course, is "write on every revolution where > there's something worth writing" --- either we've filled a WAL blovk > or there is a commit pending. That's what I meant by "while there are any waiters". > But that just gets us back into the > same swamp of how-do-you-guess-whether-more-commits-will-arrive-soon. > I don't see how an extra process makes that problem any easier. I still think that we could get gang writes automatically, if we just ask for aio_write at completion of each WAL file page and keep track of those that are written. We could also keep track of write position inside the WAL page for 1. end of last write() of each process 2. WAL files write position at each aio_write() Then we can safely(?) assume, that each backend wants only its own write()'s be on disk before it can assume the trx has committed. If the fsync()-like request comes in at time when aio_write for that processes last position has committed, we can let that process continue without even a context switch. In the above scenario I assume that kernel can do the right thing by doing multiple aio_write requests for the same page in one sweep and not doing one physical write for each aio_write. > BTW, it would seem to me that aio_write() buys nothing over plain write() > in terms of ability to gang writes. If we issue the write at time T > and it completes at T+X, we really know nothing about exactly when in > that interval the data was read out of our WAL buffers. Yes, most likely. If we do several write's of the same pages they will hit physical disk at the same physical write. > We cannot > assume that commit records that were stored into the WAL buffer during > that interval got written to disk. The only safe assumption is that > only records that were in the buffer at time T are down to disk; and > that means that late arrivals lose. I assume that if each commit record issues an aio_write when all of those which actually reached the disk will be notified. IOW the first aio_write orders the write, but all the latecomers which arrive before actual write will also get written and notified. > You can't issue aio_write > immediately after the previous one completes and expect that this > optimizes performance --- you have to delay it as long as you possibly > can in hopes that more commit records arrive. I guess we have quite different cases for different hardware configurations - if we have a separate disk subsystem for WAL, we may want to keep the log flowing to disk as fast as it is ready, including the writing of last, partial page as often as new writes to it are done - as we possibly can't write more than ~ 250 times/sec (with 15K drives, no battery RAM) we will always have at least two context switches between writes (for 500Hz ontext switch clock), and much more if processes background themselves while waiting for small transactions to commit. > So it comes down to being the same problem. Or its solution ;) as instead of the predicting we just write all data in log that is ready to be written. If we postpone writing, there will be hickups when we suddenly discover that we need to write a whole lot of pages (fsync()) after idling the disk for some period. --------------- Hannu
> No question about that! The sooner we can get stuff to the WAL buffers, > the more likely we will get some other transaction to do our fsync work. > Any ideas on how we can do that? More like the sooner we get stuff out of the WAL buffers and into the disk's buffers whether by write or aio_write. It doesn't do any good to have information in the XLog unless it gets written to the disk buffers before they empty. > We hesitate to add code relying on new features unless it is a > significant win, and in the aio case, we would have different WAL disk > write models for with/without aio, so it clearly could be two code > paths, and with two code paths, we can't as easily improve or optimize. > If we get 2% boost out of some feature, but it later discourages us > from adding a 5% optimization, it is a loss. And, in most cases, the 2% > optimization is for a few platform, while the 5% optimization is for > all. This code is +15 years old, so we are looking way down the road, > not just for today's hot feature. I'll just have to implement it and see if it's as easy and isolated as I think it might be and would allow the same algorithm for aio_write or write. > I can't tell you how many aio/mmap/fancy feature discussions we have > had, and we obviously discuss them, but in the end, they end up being of > questionable value for the risk/complexity; but, we keep talking, > hoping we are wrong or some good ideas come out of it. I'm all in favor of keeping clean designs. I'm very pleased with how easy PostreSQL is to read and understand given how much it does. - Curtis
Curtis Faith wrote: > > No question about that! The sooner we can get stuff to the WAL buffers, > > the more likely we will get some other transaction to do our fsync work. > > Any ideas on how we can do that? > > More like the sooner we get stuff out of the WAL buffers and into the > disk's buffers whether by write or aio_write. Does aio_write to write or write _and_ fsync()? > It doesn't do any good to have information in the XLog unless it > gets written to the disk buffers before they empty. Just for clarification, we have two issues in this thread: WAL memory buffers fill up, forcing WAL writemultiple commits at the same time force too many fsync's I just wanted to throw that out. > > I can't tell you how many aio/mmap/fancy feature discussions we have > > had, and we obviously discuss them, but in the end, they end up being of > > questionable value for the risk/complexity; but, we keep talking, > > hoping we are wrong or some good ideas come out of it. > > I'm all in favor of keeping clean designs. I'm very pleased with how > easy PostreSQL is to read and understand given how much it does. Glad you see the situation we are in. ;-) -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Hannu Krosing <hannu@tm.ee> writes: > Or its solution ;) as instead of the predicting we just write all data > in log that is ready to be written. If we postpone writing, there will > be hickups when we suddenly discover that we need to write a whole lot > of pages (fsync()) after idling the disk for some period. This part is exactly the same point that I've been proposing to solve with a background writer process. We don't need aio_write for that. The background writer can handle pushing completed WAL pages out to disk. The sticky part is trying to gang the writes for multiple transactions whose COMMIT records would fit into the same WAL page, and that WAL page isn't full yet. The rest of what you wrote seems like wishful thinking about how aio_write might behave :-(. I have no faith in it. regards, tom lane
It seems that the Hackers list isn't in the list to subscribe/unsubscribe at http://developer.postgresql.org/mailsub.php Just an FYI. -Mitch Computers are like Air Conditioners, they don't work when you open Windows.
On Sun, 2002-10-06 at 04:03, Tom Lane wrote: > Hannu Krosing <hannu@tm.ee> writes: > > Or its solution ;) as instead of the predicting we just write all data > > in log that is ready to be written. If we postpone writing, there will > > be hickups when we suddenly discover that we need to write a whole lot > > of pages (fsync()) after idling the disk for some period. > > This part is exactly the same point that I've been proposing to solve > with a background writer process. We don't need aio_write for that. > The background writer can handle pushing completed WAL pages out to > disk. The sticky part is trying to gang the writes for multiple > transactions whose COMMIT records would fit into the same WAL page, > and that WAL page isn't full yet. I just hoped that kernel could be used as the background writer process and in the process also solve the multiple commits on the same page problem > The rest of what you wrote seems like wishful thinking about how > aio_write might behave :-(. I have no faith in it. Yeah, and the fact that there are several slightly different implementations of AIO even on Linux alone does not help. I have to test the SGI KAIO implementation for conformance with my wishful thinking ;) Perhaps you could ask around about AIO in RedHat Advanced Server (is it the same AIO as SGI, how does it behave in "multiple writes on the same page" case) as you may have better links to RedHat ? -------------- Hannu
On Sat, 2002-10-05 at 14:46, Curtis Faith wrote: > > 2) aio_write vs. normal write. > > Since as you and others have pointed out aio_write and write are both > asynchronous, the issue becomes one of whether or not the copies to the > file system buffers happen synchronously or not. Actually, I believe that write will be *mostly* asynchronous while aio_write will always be asynchronous. In a buffer poor environment, I believe write will degrade into a synchronous operation. In an ideal situation, I think they will prove to be on par with one another with a slight bias toward aio_write. In less than ideal situations where buffer space is at a premium, I think aio_write will get the leg up. > > The kernel doesn't need to know anything about platter rotation. It > just needs to keep the disk write buffers full enough not to cause > a rotational latency. Which is why in a buffer poor environment, aio_write is generally preferred as the write is still queued even if the buffer is full. That means it will be ready to begin placing writes into the buffer, all without the process having to wait. On the other hand, when using write, the process must wait. In a worse case scenario, it seems that aio_write does get a win. I personally would at least like to see an aio implementation and would be willing to even help benchmark it to benchmark/validate any returns in performance. Surely if testing reflected a performance boost it would be considered for baseline inclusion? Greg
Greg Copeland <greg@CopelandConsulting.Net> writes: > I personally would at least like to see an aio implementation and would > be willing to even help benchmark it to benchmark/validate any returns > in performance. Surely if testing reflected a performance boost it > would be considered for baseline inclusion? It'd be considered, but whether it'd be accepted would have to depend on the size of the performance boost, its portability (how many platforms/scenarios do you actually get a boost for), and the extent of bloat/uglification of the code. I can't personally get excited about something that only helps if your server is starved for RAM --- who runs servers that aren't fat on RAM anymore? But give it a shot if you like. Perhaps your analysis is pessimistic. regards, tom lane
On Sun, 2002-10-06 at 11:46, Tom Lane wrote: > I can't personally get excited about something that only helps if your > server is starved for RAM --- who runs servers that aren't fat on RAM > anymore? But give it a shot if you like. Perhaps your analysis is > pessimistic. I do suspect my analysis is somewhat pessimistic too but to what degree, I have no idea. You make a good case on your memory argument but please allow me to further kick it around. I don't find it far fetched to imagine situations where people may commit large amounts of memory for the database yet marginally starve available memory for file system buffers. Especially so on heavily I/O bound systems or where sporadicly other types of non-database file activity may occur. Now, while I continue to assure myself that it is not far fetched I honestly have no idea how often this type of situation will typically occur. Of course, that opens the door for simply adding more memory and/or slightly reducing the amount of memory available to the database (thus making it available elsewhere). Now, after all that's said and done, having something like aio in use would seemingly allowing it to be somewhat more "self-tuning" from a potential performance perspective. Greg
Re: Use of sync() [was Re: Potential Large Performance Gain in WAL synching]
From
Doug McNaught
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes: > Doug McNaught <doug@wireboard.com> writes: > > In my understanding, it means "all currently dirty blocks in the file > > cache are queued to the disk driver". The queued writes will > > eventually complete, but not necessarily before sync() returns. I > > don't think subsequent write()s will block, unless the system is low > > on buffers and has to wait until dirty blocks are freed by the driver. > > We don't need later write()s to block. We only need them to not hit > disk before the sync-queued writes hit disk. So I guess the question > boils down to what "queued to the disk driver" means --- has the order > of writes been determined at that point? It's certainy possible that new write(s) get put into the queue alongside old ones--I think the Linux block layer tries to do this when it can, for one. According to the manpage, Linux used to wait until everything was written to return from sync(), though I don't *think* it does anymore. But that's not mandated by the specs. So I don't think we can rely on such behavior (not reordering writes across a sync()), though it will probably happen in practice a lot of the time. AFAIK there isn't anything better than sync() + sleep() as far as the specs go. Yes, it kinda sucks. ;) -Doug
On 6 Oct 2002, Greg Copeland wrote: > On Sat, 2002-10-05 at 14:46, Curtis Faith wrote: > > > > 2) aio_write vs. normal write. > > > > Since as you and others have pointed out aio_write and write are both > > asynchronous, the issue becomes one of whether or not the copies to the > > file system buffers happen synchronously or not. > > Actually, I believe that write will be *mostly* asynchronous while > aio_write will always be asynchronous. In a buffer poor environment, I > believe write will degrade into a synchronous operation. In an ideal > situation, I think they will prove to be on par with one another with a > slight bias toward aio_write. In less than ideal situations where > buffer space is at a premium, I think aio_write will get the leg up. Browsed web and came across this piece of text regarding a Linux-KAIO patch by Silicon Graphics... "The asynchronous I/O (AIO) facility implements interfaces defined by the POSIX standard, although it has not been through formal compliance certification. This version of AIO is implemented with support from kernel modifications, and hence will be called KAIO to distinguish it from AIO facilities available from newer versions of glibc/librt. Because of the kernel support, KAIO is able to perform split-phase I/O to maximize concurrency of I/O at the device. With split-phase I/O, the initiating request (such as an aio_read) truly queues the I/O at the device as the first phase of the I/O request; a second phase of the I/O request, performed as part of the I/O completion, propagates results of the request. The results may include the contents of the I/O buffer on a read, the number of bytes read or written, and any error status. Preliminary experience with KAIO have shown over 35% improvement in database performance tests. Unit tests (which only perform I/O) using KAIO and Raw I/O have been successful in achieving 93% saturation with 12 disks hung off 2 X 40 MB/s Ultra-Wide SCSI channels. We believe that these encouraging results are a direct result of implementing a significant part of KAIO in the kernel using split-phase I/O while avoiding or minimizing the use of any globally contented locks." Well... > In a worse case scenario, it seems that aio_write does get a win. > > I personally would at least like to see an aio implementation and would > be willing to even help benchmark it to benchmark/validate any returns > in performance. Surely if testing reflected a performance boost it > would be considered for baseline inclusion?
On Mon, 2002-10-07 at 10:38, Antti Haapala wrote: > Browsed web and came across this piece of text regarding a Linux-KAIO > patch by Silicon Graphics... > Ya, I have read this before. The problem here is that I'm not aware of which AIO implementation on Linux is the forerunner nor do I have any idea how it's implementation or performance details defer from that of other implementations on other platforms. I know there are at least two aio efforts underway for Linux. There could yet be others. Attempting to cite specifics that only pertain to Linux and then, only with a specific implementation which may or may not be in general use is questionable. Because of this I simply left it as saying that I believe my analysis is pessimistic. Anyone have any idea of Red Hat's Advanced Server uses KAIO or what? > > Preliminary experience with KAIO have shown over 35% improvement in > database performance tests. Unit tests (which only perform I/O) using KAIO > and Raw I/O have been successful in achieving 93% saturation with 12 disks > hung off 2 X 40 MB/s Ultra-Wide SCSI channels. We believe that these > encouraging results are a direct result of implementing a significant > part of KAIO in the kernel using split-phase I/O while avoiding or > minimizing the use of any globally contented locks." The problem here is, I have no idea what they are comparing to (worse case read/writes which we know PostgreSQL *mostly* isn't suffering from). If we assume that PostgreSQL's read/write operations are somewhat optimized (as it currently sounds like they are), I'd seriously doubt we'd see that big of a difference. On the other hand, I'm hoping that if an aio postgresql implementation does get done we'll see something like a 5%-10% performance boost. Even still, I have nothing to pin that on other than hope. If we do see a notable performance increase for Linux, I have no idea what it will do for other platforms. Then, there are all of the issues that Tom brought up about bloat/uglification and maintainability. So, while I certainly do keep those remarks in my mind, I think it's best to simply encourage the effort (or something like it) and help determine where we really sit by means of empirical evidence. Greg
Greg Copeland <greg@CopelandConsulting.Net> writes: > Ya, I have read this before. The problem here is that I'm not aware of > which AIO implementation on Linux is the forerunner nor do I have any > idea how it's implementation or performance details defer from that of > other implementations on other platforms. The implementation of AIO in 2.5 is the one by Ben LaHaise (not SGI). Not sure what the performance is like -- although it's been merged into 2.5 already, so someone can do some benchmarking. Can anyone suggest a good test? Keep in mind that glibc has had a user-space implementation for a little while (although I'd guess the performance to be unimpressive), so AIO would not be *that* kernel-version specific. > Anyone have any idea of Red Hat's Advanced Server uses KAIO or what? RH AS uses Ben LaHaise's implemention of AIO, I believe. Cheers, Neil -- Neil Conway <neilc@samurai.com> || PGP Key ID: DB3C29FC
> On Sun, 2002-10-06 at 11:46, Tom Lane wrote: > > I can't personally get excited about something that only helps if your > > server is starved for RAM --- who runs servers that aren't fat on RAM > > anymore? But give it a shot if you like. Perhaps your analysis is > > pessimistic. > > <snipped> I don't find it far fetched to > imagine situations where people may commit large amounts of memory for > the database yet marginally starve available memory for file system > buffers. Especially so on heavily I/O bound systems or where sporadicly > other types of non-database file activity may occur. > > <snipped> Of course, that opens the door for simply adding more memory > and/or slightly reducing the amount of memory available to the database > (thus making it available elsewhere). Now, after all that's said and > done, having something like aio in use would seemingly allowing it to be > somewhat more "self-tuning" from a potential performance perspective. Good points. Now for some surprising news (at least it surprised me). I researched the file system source on my system (FreeBSD 4.6) and found that the behavior was optimized for non-database access to eliminate unnecessary writes when temp files are created and deleted rapidly. It was not optimized to get data to the disk in the most efficient manner. The syncer on FreeBSD appears to place dirtied filesystem buffers into work queues that range from 1 to SYNCER_MAXDELAY. Each second the syncer processes one of the queues and increments a counter syncer_delayno. On my system the setting for SYNCER_MAXDELAY is 32. So each second 1/32nd of the writes that were buffered are processed. If the syncer gets behind and the writes for a given second exceed one second to process the syncer does not wait but begins processing the next queue. AFAICT this means that there is no opportunity to have writes combined by the disk since they are processed in buckets based on the time the writes came in. Also, it seems very likely that many installations won't have enough buffers for 30 seconds worth of changes and that there would be some level of SYNCHRONOUS writing because of this delay and the syncer process getting backed up. This might happen once per second as the buffers get full and the syncer has not yet started for that second interval. Linux might handle this better. I saw some emails exchanged a year or so ago about starting writes immediately in a low-priority way but I'm not sure if those patches got applied to the linux kernel or not. The source I had access to seems to do something analogous to FreeBSD but using fixed percentages of the dirty blocks or a minimum number of blocks. They appear to be handled in LRU order however. On-disk caches are much much larger these days so it seems that some way of getting the data out sooner would result in better write performance for the cache. My newer drive is a 10K RPM IBM Ultrastar SCSI and it has a 4M cache. I don't see these caches getting smaller over time so not letting the disk see writes will become more and more of a performance drain. - Curtis
Curtis Faith wrote: > Good points. > > Now for some surprising news (at least it surprised me). > > I researched the file system source on my system (FreeBSD 4.6) and found > that the behavior was optimized for non-database access to eliminate > unnecessary writes when temp files are created and deleted rapidly. It was > not optimized to get data to the disk in the most efficient manner. > > The syncer on FreeBSD appears to place dirtied filesystem buffers into > work queues that range from 1 to SYNCER_MAXDELAY. Each second the syncer > processes one of the queues and increments a counter syncer_delayno. > > On my system the setting for SYNCER_MAXDELAY is 32. So each second 1/32nd > of the writes that were buffered are processed. If the syncer gets behind > and the writes for a given second exceed one second to process the syncer > does not wait but begins processing the next queue. > > AFAICT this means that there is no opportunity to have writes combined by > the disk since they are processed in buckets based on the time the writes > came in. This is the trickle syncer. It prevents bursts of disk activity every 30 seconds. It is for non-fsync writes, of course, and I assume if the kernel buffers get low, it starts to flush faster. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
> This is the trickle syncer. It prevents bursts of disk activity every > 30 seconds. It is for non-fsync writes, of course, and I assume if the > kernel buffers get low, it starts to flush faster. AFAICT, the syncer only speeds up when virtual memory paging fills the buffers past a threshold and even in that event it only speeds it up by a factor of two. I can't find any provision for speeding up flushing of the dirty buffers when they fill for normal file system writes, so I don't think that happens. - Curtis
On Mon, 2002-10-07 at 15:28, Bruce Momjian wrote: > This is the trickle syncer. It prevents bursts of disk activity every > 30 seconds. It is for non-fsync writes, of course, and I assume if the > kernel buffers get low, it starts to flush faster. Doesn't this also increase the likelihood that people will be running in a buffer-poor environment more frequently that I previously asserted, especially in very heavily I/O bound systems? Unless I'm mistaken, that opens the door for a general case of why an aio implementation should be looked into. Also, on a side note, IIRC, linux kernel 2.5.x has a new priority elevator which is said to be MUCH better as saturating disks than ever before. Once 2.6 (or whatever it's number will be) is released, it may not be as much of a problem as it seems to be for FreeBSD (I think that's the one you're using). Greg
On Mon, 2002-10-07 at 21:35, Neil Conway wrote: > Greg Copeland <greg@CopelandConsulting.Net> writes: > > Ya, I have read this before. The problem here is that I'm not aware of > > which AIO implementation on Linux is the forerunner nor do I have any > > idea how it's implementation or performance details defer from that of > > other implementations on other platforms. > > The implementation of AIO in 2.5 is the one by Ben LaHaise (not > SGI). Not sure what the performance is like -- although it's been > merged into 2.5 already, so someone can do some benchmarking. Can > anyone suggest a good test? What would be really interesting is to aio_write small chunks to the same 8k page by multiple threads/processes and then wait for the page getting written to disk. Then check how many backends get their wait back at the same write. The docs for POSIX aio_xxx are at: http://www.opengroup.org/onlinepubs/007904975/functions/aio_write.html ---------------- Hannu
Greg Copeland <greg@CopelandConsulting.Net> writes: > Doesn't this also increase the likelihood that people will be running in > a buffer-poor environment more frequently that I previously asserted, > especially in very heavily I/O bound systems? Unless I'm mistaken, that > opens the door for a general case of why an aio implementation should be > looked into. Well, at least for *this specific sitation*, it doesn't really change anything -- since FreeBSD doesn't implement POSIX AIO as far as I know, we can't use that as an alternative. However, I'd suspect that the FreeBSD kernel allows for some way to tune the behavior of the syncer. If that's the case, we could do some research into what settings are more appropriate for FreeBSD, and recommend those in the docs. I don't run FreeBSD, however -- would someone like to volunteer to take a look at this? BTW Curtis, did you happen to check whether this behavior has been changed in FreeBSD 5.0? > Also, on a side note, IIRC, linux kernel 2.5.x has a new priority > elevator which is said to be MUCH better as saturating disks than ever > before. Yeah, there are lots of new & interesting features for database systems in the new kernel -- I'm looking forward to when 2.6 is widely deployed... Cheers, Neil -- Neil Conway <neilc@samurai.com> || PGP Key ID: DB3C29FC
> Greg Copeland <greg@CopelandConsulting.Net> writes: > > Doesn't this also increase the likelihood that people will be > > running in a buffer-poor environment more frequently that I > > previously asserted, especially in very heavily I/O bound > > systems? Unless I'm mistaken, that opens the door for a > > general case of why an aio implementation should be looked into. Neil Conway replies: > Well, at least for *this specific sitation*, it doesn't really change > anything -- since FreeBSD doesn't implement POSIX AIO as far as I > know, we can't use that as an alternative. I haven't tried it yet but there does seem to be an aio implementation that conforms to POSIX in FreeBSD 4.6.2. Its part of the kernel and can be found in: /usr/src/sys/kern/vfs_aio.c > However, I'd suspect that the FreeBSD kernel allows for some way to > tune the behavior of the syncer. If that's the case, we could do some > research into what settings are more appropriate for FreeBSD, and > recommend those in the docs. I don't run FreeBSD, however -- would > someone like to volunteer to take a look at this? I didn't see anything obvious in the docs but I still believe there's some way to tune it. I'll let everyone know if I find some better settings. > BTW Curtis, did you happen to check whether this behavior has been > changed in FreeBSD 5.0? I haven't checked but I will.
Curtis Faith wrote: > > This is the trickle syncer. It prevents bursts of disk activity every > > 30 seconds. It is for non-fsync writes, of course, and I assume if the > > kernel buffers get low, it starts to flush faster. > > AFAICT, the syncer only speeds up when virtual memory paging fills the > buffers past > a threshold and even in that event it only speeds it up by a factor of two. > > I can't find any provision for speeding up flushing of the dirty buffers > when they fill for normal file system writes, so I don't think that > happens. So you think if I try to write a 1 gig file, it will write enough to fill up the buffers, then wait while the sync'er writes out a few blocks every second, free up some buffers, then write some more? Take a look at vfs_bio::getnewbuf() on *BSD and you will see that when it can't get a buffer, it will async write a dirty buffer to disk. As far as this AIO conversation is concerned, I want to see someone come up with some performance improvement that we can only do with AIO. Unless I see it, I am not interested in pursuing this thread. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
> So you think if I try to write a 1 gig file, it will write enough to > fill up the buffers, then wait while the sync'er writes out a few blocks > every second, free up some buffers, then write some more? > > Take a look at vfs_bio::getnewbuf() on *BSD and you will see that when > it can't get a buffer, it will async write a dirty buffer to disk. We've addressed this scenario before, if I recall, the point Greg made earlier is that buffers getting full means writes become synchronous. I was trying to point out was that it was very likely that the buffers will fill even for large buffers and that the writes are going to be driven out not by efficient ganging but by something approaching LRU flushing, with an occasional once a second slightly more efficient write of 1/32nd of the buffers. Once the buffers get full, all subsequent writes turn into synchronous writes, since even if the kernel writes asynchronously (meaning it can do other work), the writing process can't complete, it has to wait until the buffer has been flushed and is free for the copy. So the relatively poor implementation (for database inserts at least) of the syncer mechanism will cost a lot of performance if we get to this synchronous write mode due to a full buffer. It appears this scenario is much more likely than I had thought. Do you not think this is a potential performance problem to be explored? I'm only pursuing this as hard as I am because I feel like it's deja vu all over again. I've done this before and found a huge improvement (12X to 20X for bulk inserts). I'm not necessarily expecting that level of improvement here but my gut tells me there is more here than seems obvious on the surface. > As far as this AIO conversation is concerned, I want to see someone come > up with some performance improvement that we can only do with AIO. > Unless I see it, I am not interested in pursuing this thread. If I come up with something via aio that helps I'd be more than happy if someone else points out a non-aio way to accomplish the same thing. I'm by no means married to any particular solutions, I care about getting problems solved. And I'll stop trying to sell anyone on aio. - Curtis
"Curtis Faith" <curtis@galtair.com> writes: > Do you not think this is a potential performance problem to be explored? I agree that there's a problem if the kernel runs short of buffer space. I am not sure whether that's really an issue in practical situations, nor whether we can do much about it at the application level if it is --- but by all means look for solutions if you are concerned. (This is, BTW, one of the reasons for discouraging people from pushing Postgres' shared buffer cache up to a large fraction of total RAM; starving the kernel of disk buffers is just plain not a good idea.) regards, tom lane
Bruce, Is there remarks along these lines in the performance turning section of the docs? Based on what's coming out of this it would seem that stressing the importance of leaving a notable (rule of thumb here?) amount for general OS/kernel needs is pretty important. Greg On Tue, 2002-10-08 at 09:50, Tom Lane wrote: > (This is, BTW, one of the reasons for discouraging people from pushing > Postgres' shared buffer cache up to a large fraction of total RAM; > starving the kernel of disk buffers is just plain not a good idea.)