Thread: Analysis of ganged WAL writes
I do not think the situation for ganging of multiple commit-record writes is quite as dire as has been painted. There is a simple error in the current code that is easily corrected: in XLogFlush(), the wait to acquire WALWriteLock should occur before, not after, we try to acquire WALInsertLock and advance our local copy of the write request pointer. (To be exact, xlog.c lines 1255-1269 in CVS tip ought to be moved down to before line 1275, inside the "if" that tests whether we are going to call XLogWrite.) Given that change, what will happen during heavy commit activity is like this: 1. Transaction A is ready to commit. It calls XLogInsert to insert its commit record into the WAL buffers (thereby transiently acquiring WALInsertLock) and then it calls XLogFlush to write and sync the log through the commit record. XLogFlush acquires WALWriteLock and begins issuing the needed I/O request(s). 2. Transaction B is ready to commit. It gets through XLogInsert and then blocks on WALWriteLock inside XLogFlush. 3. Transactions C, D, E likewise insert their commit records and then block on WALWriteLock. 4. Eventually, transaction A finishes its I/O, advances the "known flushed" pointer past its own commit record, and releases the WALWriteLock. 5. Transaction B now acquires WALWriteLock. Given the code change I recommend, it will choose to flush the WAL *through the last queued commit record as of this instant*, not the WAL endpoint as of when it started to wait. Therefore, this WAL write will handle all of the so-far-queued commits. 6. More transactions F, G, H, ... arrive to be committed. They will likewise insert their COMMIT records into the buffer and block on WALWriteLock. 7. When B finishes its write and releases WALWriteLock, it will have set the "known flushed" pointer past E's commit record. Therefore, transactions C, D, E will fall through their tests without calling XLogWrite at all. When F gets the lock, it will conclude that it should write the data queued up to that time, and so it will handle the commit records for G, H, etc. (The fact that lwlock.c will release waiters in order of arrival is important here --- we want C, D, E to get out of the queue before F decides it needs to write.) It seems to me that this behavior will provide fairly effective ganging of COMMIT flushes under load. And it's self-tuning; no need to fiddle with weird parameters like commit_siblings. We automatically gang as many COMMITs as arrive during the time it takes to write and flush the previous gang of COMMITs. Comments? regards, tom lane
I said: > There is a simple error > in the current code that is easily corrected: in XLogFlush(), the > wait to acquire WALWriteLock should occur before, not after, we try > to acquire WALInsertLock and advance our local copy of the write > request pointer. (To be exact, xlog.c lines 1255-1269 in CVS tip > ought to be moved down to before line 1275, inside the "if" that > tests whether we are going to call XLogWrite.) That patch was not quite right, as it didn't actually flush the later-arriving data. The correct patch is *** src/backend/access/transam/xlog.c.orig Thu Sep 26 18:58:33 2002 --- src/backend/access/transam/xlog.c Sun Oct 6 18:45:57 2002 *************** *** 1252,1279 **** /* done already? */ if (!XLByteLE(record, LogwrtResult.Flush)) { - /* if something was added to log cache then try to flush this too */ - if (LWLockConditionalAcquire(WALInsertLock, LW_EXCLUSIVE)) - { - XLogCtlInsert *Insert = &XLogCtl->Insert; - uint32 freespace = INSERT_FREESPACE(Insert); - - if (freespace < SizeOfXLogRecord) /* buffer is full */ - WriteRqstPtr = XLogCtl->xlblocks[Insert->curridx]; - else - { - WriteRqstPtr = XLogCtl->xlblocks[Insert->curridx]; - WriteRqstPtr.xrecoff -= freespace; - } - LWLockRelease(WALInsertLock); - } /* now wait for the write lock */ LWLockAcquire(WALWriteLock, LW_EXCLUSIVE); LogwrtResult= XLogCtl->Write.LogwrtResult; if (!XLByteLE(record, LogwrtResult.Flush)) { ! WriteRqst.Write = WriteRqstPtr; ! WriteRqst.Flush = record; XLogWrite(WriteRqst); } LWLockRelease(WALWriteLock); --- 1252,1284 ---- /* done already? */ if (!XLByteLE(record, LogwrtResult.Flush)) { /* now wait for thewrite lock */ LWLockAcquire(WALWriteLock, LW_EXCLUSIVE); LogwrtResult = XLogCtl->Write.LogwrtResult; if (!XLByteLE(record, LogwrtResult.Flush)) { ! /* try to write/flush later additions to XLOG as well */ ! if (LWLockConditionalAcquire(WALInsertLock, LW_EXCLUSIVE)) ! { ! XLogCtlInsert *Insert = &XLogCtl->Insert; ! uint32 freespace = INSERT_FREESPACE(Insert); ! ! if (freespace < SizeOfXLogRecord) /* buffer is full */ ! WriteRqstPtr = XLogCtl->xlblocks[Insert->curridx]; ! else ! { ! WriteRqstPtr = XLogCtl->xlblocks[Insert->curridx]; ! WriteRqstPtr.xrecoff -= freespace; ! } ! LWLockRelease(WALInsertLock); ! WriteRqst.Write = WriteRqstPtr; ! WriteRqst.Flush = WriteRqstPtr; ! } ! else ! { ! WriteRqst.Write = WriteRqstPtr; ! WriteRqst.Flush = record; ! } XLogWrite(WriteRqst); } LWLockRelease(WALWriteLock); To test this, I made a modified version of pgbench in which each transaction consists of a simpleinsert into table_NNN values(0); where each client thread has a separate insertion target table. This is about the simplest transaction I could think of that would generate a WAL record each time. Running this modified pgbench with postmaster parameterspostmaster -i -N 120 -B 1000 --wal_buffers=250 and all other configuration settings at default, CVS tip code gives me a pretty consistent 115-118 transactions per second for anywhere from 1 to 100 pgbench client threads. This is exactly what I expected, since the database (including WAL file) is on a 7200 RPM SCSI drive. The theoretical maximum rate of sync'd writes to the WAL file is therefore 120 per second (one per disk revolution), but we lose a little because once in awhile the disk has to seek to a data file. Inserting the above patch, and keeping all else the same, I get: $ mybench -c 1 -t 10000 bench1 number of clients: 1 number of transactions per client: 10000 number of transactions actually processed: 10000/10000 tps = 116.694205 (including connections establishing) tps = 116.722648 (excluding connections establishing) $ mybench -c 5 -t 2000 -S -n bench1 number of clients: 5 number of transactions per client: 2000 number of transactions actually processed: 10000/10000 tps = 282.808341 (including connections establishing) tps = 283.656898 (excluding connections establishing) $ mybench -c 10 -t 1000 bench1 number of clients: 10 number of transactions per client: 1000 number of transactions actually processed: 10000/10000 tps = 443.131083 (including connections establishing) tps = 447.406534 (excluding connections establishing) $ mybench -c 50 -t 200 bench1 number of clients: 50 number of transactions per client: 200 number of transactions actually processed: 10000/10000 tps = 416.154173 (including connections establishing) tps = 436.748642 (excluding connections establishing) $ mybench -c 100 -t 100 bench1 number of clients: 100 number of transactions per client: 100 number of transactions actually processed: 10000/10000 tps = 336.449110 (including connections establishing) tps = 405.174237 (excluding connections establishing) CPU loading goes from 80% idle at 1 client to 50% idle at 5 clients to <10% idle at 10 or more. So this does seem to be a nice win, and unless I hear objections I will apply it ... regards, tom lane
On Sun, 2002-10-06 at 18:07, Tom Lane wrote: > > CPU loading goes from 80% idle at 1 client to 50% idle at 5 clients > to <10% idle at 10 or more. > > So this does seem to be a nice win, and unless I hear objections > I will apply it ... > Wow Tom! That's wonderful! On the other hand, maybe people needed the extra idle CPU time that was provided by the unpatched code. ;) Greg
On Sun, 2002-10-06 at 19:35, Greg Copeland wrote: > On Sun, 2002-10-06 at 18:07, Tom Lane wrote: > > > > CPU loading goes from 80% idle at 1 client to 50% idle at 5 clients > > to <10% idle at 10 or more. > > > > So this does seem to be a nice win, and unless I hear objections > > I will apply it ... > > > > Wow Tom! That's wonderful! On the other hand, maybe people needed the > extra idle CPU time that was provided by the unpatched code. ;) Naw. Distributed.net finally got through RC5-64. Lots of CPU to spare now. -- Rod Taylor
Tom Lane kirjutas E, 07.10.2002 kell 01:07: > > To test this, I made a modified version of pgbench in which each > transaction consists of a simple > insert into table_NNN values(0); > where each client thread has a separate insertion target table. > This is about the simplest transaction I could think of that would > generate a WAL record each time. > > Running this modified pgbench with postmaster parameters > postmaster -i -N 120 -B 1000 --wal_buffers=250 > and all other configuration settings at default, CVS tip code gives me > a pretty consistent 115-118 transactions per second for anywhere from > 1 to 100 pgbench client threads. This is exactly what I expected, > since the database (including WAL file) is on a 7200 RPM SCSI drive. > The theoretical maximum rate of sync'd writes to the WAL file is > therefore 120 per second (one per disk revolution), but we lose a little > because once in awhile the disk has to seek to a data file. > > Inserting the above patch, and keeping all else the same, I get: > > $ mybench -c 1 -t 10000 bench1 > number of clients: 1 > number of transactions per client: 10000 > number of transactions actually processed: 10000/10000 > tps = 116.694205 (including connections establishing) > tps = 116.722648 (excluding connections establishing) > > $ mybench -c 5 -t 2000 -S -n bench1 > number of clients: 5 > number of transactions per client: 2000 > number of transactions actually processed: 10000/10000 > tps = 282.808341 (including connections establishing) > tps = 283.656898 (excluding connections establishing) in an ideal world this would be 5*120=600 tps. Have you any good any ideas what holds it back for the other 300 tps ? If it has CPU utilisation of only 50% then there must be still some moderate lock contention. btw, what is the number for 1-5-10 clients with fsync off ? > $ mybench -c 10 -t 1000 bench1 > number of clients: 10 > number of transactions per client: 1000 > number of transactions actually processed: 10000/10000 > tps = 443.131083 (including connections establishing) > tps = 447.406534 (excluding connections establishing) > > CPU loading goes from 80% idle at 1 client to 50% idle at 5 clients > to <10% idle at 10 or more. > > So this does seem to be a nice win, and unless I hear objections > I will apply it ... 3x speedup is not just nice, it's great ;) -------------- Hannu
Hannu Krosing <hannu@tm.ee> writes: > in an ideal world this would be 5*120=600 tps. > Have you any good any ideas what holds it back for the other 300 tps ? Well, recall that the CPU usage was about 20% in the single-client test. (The reason I needed a variant version of pgbench is that this machine is too slow to do more than 120 TPC-B transactions per second anyway.) That says that the best possible throughput on this test scenario is 5 transactions per disk rotation --- the CPU is just not capable of doing more. I am actually getting about 4 xact/rotation for 10 or more clients (in fact it seems to reach that plateau at 8 clients, and be close to it at 7). I'm inclined to think that the fact that it's 4 not 5 is just a matter of "not quite there" --- there's some additional CPU overhead due to lock contention, etc, and any slowdown at all will cause it to miss making 5. The 20% CPU figure was approximate to begin with, anyway. The other interesting question is why we're not able to saturate the machine with only 4 or 5 clients. I think pgbench itself is probably to blame for that: it can't keep all its backend threads constantly busy ... especially not when several of them report back transaction completion at essentially the same instant, as will happen under ganged-commit conditions. There will be intervals where multiple backends are waiting for pgbench to send a new command. That delay in starting a new command cycle is probably enough for them to "miss the bus" of getting included in the next commit write. That's just a guess though; I don't have tools that would let me see exactly what's happening. Anyone else want to reproduce the test on a different system and see what it does? > If it has CPU utilisation of only 50% then there must be still some > moderate lock contention. No, that's I/O wait I think, forced by the quantization of the number of transactions that get committed per rotation. > btw, what is the number for 1-5-10 clients with fsync off ? About 640 tps at 1 and 5, trailing off to 615 at 10, and down to 450 at 100 clients (now that must be lock contention...) regards, tom lane
I wrote: > That says that the best possible throughput on this test scenario is 5 > transactions per disk rotation --- the CPU is just not capable of doing > more. I am actually getting about 4 xact/rotation for 10 or more > clients (in fact it seems to reach that plateau at 8 clients, and be > close to it at 7). After further thought I understand why it takes 8 clients to reach full throughput in this scenario. Assume that we have enough CPU oomph so that we can process four transactions, but not five, in the time needed for one revolution of the WAL disk. If we have five active clients then the behavior will be like this: 1. Backend A becomes ready to commit. It locks WALWriteLock and issues a write/flush that will only cover its own commit record. Assume that it has to wait one full disk revolution for the write to complete (this will be the steady-state situation). 2. While A is waiting, there is enough time for B, C, D, and E to run their transactions and become ready to commit. All eventually block on WALWriteLock. 3. When A finishes its write and releases WALWriteLock, B will acquire the lock and initiate a write that (with my patch) will cover C, D, and E's commit records as well as its own. 4. While B is waiting for the disk to spin, A receives a new transaction from its client, processes it, and becomes ready to commit. It blocks on WALWriteLock. 5. When B releases the lock, C, D, E acquire it and quickly fall through, seeing that they need do no work. Then A acquires the lock. GOTO step 1. So with five active threads, we alternate between committing one transaction and four transactions on odd and even disk revolutions. It's pretty easy to see that with six or seven active threads, we will alternate between committing two or three transactions and committing four. Only when we get to eight threads do we have enough backends to ensure that four transactions are available to commit on every disk revolution. This must be so because the backends that are released at the end of any given disk revolution will not be able to participate in the next group commit, if there is already at least one backend ready to commit. So this solution isn't perfect; it would still be nice to have a way to delay initiation of the WAL write until "just before" the disk is ready to accept it. I dunno any good way to do that, though. I went ahead and committed the patch for 7.3, since it's simple and does offer some performance improvement. But maybe we can think of something better later on... regards, tom lane
Tom, first of all, excellent job improving the current algorithm. I'm glad you look at the WALCommitLock code. > This must be so because the backends that are > released at the end of any given disk revolution will not be able to > participate in the next group commit, if there is already at least > one backend ready to commit. This is the major reason for my original suggestion about using aio_write. The writes don't block each other and there is no need for a kernel level exclusive locking call like fsync or fdatasync. Even the theoretical limit you mention of one transaction per revolution per committing process seem like a significant bottleneck. Is committing 1 and 4 transactions on every revolution good? It's certainly better than 1 per revolution. However, what if we could have done 3 transactions per process in the time it took for a single revolution? Then we are looking at (1 + 4)/ 2 = 2.5 transactions per revolution versus the theoretical maximum of (3 * 5) = 15 transactions per revolution if we can figure out a way to do non-blocking writes that we can guarantee are on the disk platter so we can return from commit. Separating out whether or not aio is viable. Do you not agree that eliminating the blocking would result in potentially a 6X improvement for the 5 process case? > > So this solution isn't perfect; it would still be nice to have a way to > delay initiation of the WAL write until "just before" the disk is ready > to accept it. I dunno any good way to do that, though. I still think that it would be much faster to just keep writing the WAL log blocks when they fill up and have a separate process wake the commiting process when the write completes. This would eliminate WAL writing as a bottleneck. I have yet to hear anyone say that this can't be done, only that we might not want to do it because the code might not be clean. I'm generally only happy when I can finally remove a bottleneck completely, but speeding one up by 3X like you have done is pretty damn cool for a day or two's work. - Curtis
"Curtis Faith" <curtis@galtair.com> writes: > Even the theoretical limit you mention of one transaction per revolution > per committing process seem like a significant bottleneck. Well, too bad. If you haven't gotten your commit record down to disk, then *you have not committed*. This is not negotiable. (If you think it is, then turn off fsync and quit worrying ;-)) An application that is willing to have multiple transactions in flight at the same time can open up multiple backend connections to issue those transactions, and thereby perhaps beat the theoretical limit. But for serial transactions, there is not anything we can do to beat that limit. (At least not with the log structure we have now. One could imagine dropping a commit record into the nearest one of multiple buckets that are carefully scattered around the disk. But exploiting that would take near-perfect knowledge about disk head positioning; it's even harder to solve than the problem we're considering now.) > I still think that it would be much faster to just keep writing the WAL log > blocks when they fill up and have a separate process wake the commiting > process when the write completes. This would eliminate WAL writing as a > bottleneck. You're failing to distinguish total throughput to the WAL drive from response time seen by any one transaction. Yes, a policy of writing each WAL block once when it fills would maximize potential throughput, but it would also mean a potentially very large delay for a transaction waiting to commit. The lower the system load, the worse the performance on that scale. The scheme we now have (with my recent patch) essentially says that the commit delay seen by any one transaction is at most two disk rotations. Unfortunately it's also at least one rotation :-(, except in the case where there is no contention, ie, no already-scheduled WAL write when the transaction reaches the commit stage. It would be nice to be able to say "at most one disk rotation" instead --- but I don't see how to do that in the absence of detailed information about disk head position. Something I was toying with this afternoon: assume we have a background process responsible for all WAL writes --- not only filled buffers, but the currently active buffer. It periodically checks to see if there are unwritten commit records in the active buffer, and if so schedules a write for them. If this could be done during each disk rotation, "just before" the disk reaches the active WAL log block, we'd have an ideal solution. And it would not be too hard for such a process to determine the right time: it could measure the drive rotational speed by observing the completion times of successive writes to the same sector, and it wouldn't take much logic to empirically find the latest time at which a write can be issued and have a good probability of hitting the disk on time. (At least, this would work pretty well given a dedicated WAL drive, else there'd be too much interference from other I/O requests.) However, this whole scheme falls down on the same problem we've run into before: user processes can't schedule themselves with millisecond accuracy. The writer process might be able to determine the ideal time to wake up and make the check, but it can't get the Unix kernel to dispatch it then, at least not on most Unixen. The typical scheduling slop is one time slice, which is comparable to if not more than the disk rotation time. ISTM aio_write only improves the picture if there's some magic in-kernel processing that makes this same kind of judgment as to when to issue the "ganged" write for real, and is able to do it on time because it's in the kernel. I haven't heard anything to make me think that that feature actually exists. AFAIK the kernel isn't much more enlightened about physical head positions than we are. regards, tom lane
On Tue, 2002-10-08 at 00:12, Curtis Faith wrote: > Tom, first of all, excellent job improving the current algorithm. I'm glad > you look at the WALCommitLock code. > > > This must be so because the backends that are > > released at the end of any given disk revolution will not be able to > > participate in the next group commit, if there is already at least > > one backend ready to commit. > > This is the major reason for my original suggestion about using aio_write. > The writes don't block each other and there is no need for a kernel level > exclusive locking call like fsync or fdatasync. > > Even the theoretical limit you mention of one transaction per revolution > per committing process seem like a significant bottleneck. > > Is committing 1 and 4 transactions on every revolution good? It's certainly > better than 1 per revolution. Of course committing all 5 at each rev would be better ;) > However, what if we could have done 3 transactions per process in the time > it took for a single revolution? I may be missing something obvious, but I don't see a way to get more than 1 trx/process/revolution, as each previous transaction in that process must be written to disk before the next can start, and the only way it can be written to the disk is when the disk heads are on the right place and that happens exactly once per revolution. In theory we could devise some clever page interleave scheme that would allow us to go like this: fill one page - write page to disk, commit trx's - fill the page in next 1/3 of rev - write next page to disk ... , but this will work only for some limited set ao WAL page sizes. It could be possible to get near 5/trx/rev for 5 backends if we do the following (A-E are backends from Toms explanation): 1. write the page for A's trx to its proper pos P (wher P is page number) 2. if after sync for A returns and we already have more transactions waiting for write()+sync() of the same page, immediately write the _same_ page to pos P+N (where N is a tunable parameter). If N is small enough then P+N will be on the same cylinder for most cases and thus will get transactions B-E also committed on the same rev. 3. make sure that the last version will also be written to its proper place before the end of log will overwrite P+N. (This may be tricky.) 4. When restoring from WAL, always check for a page at EndPos+N for a possible newer version of last page. This scheme requires page numbers+page versions to be stored in each page and could get us near 1 trx/backend/rev performance, but it's hard to tell if it is really useful in real life. This could also possibly be extended to more than one "end page" and more than one "continuation end page copy" to get better than 1 trx/backend/rev. ----------------- Hannu
On Tue, 2002-10-08 at 01:27, Tom Lane wrote: > > The scheme we now have (with my recent patch) essentially says that the > commit delay seen by any one transaction is at most two disk rotations. > Unfortunately it's also at least one rotation :-(, except in the case > where there is no contention, ie, no already-scheduled WAL write when > the transaction reaches the commit stage. It would be nice to be able > to say "at most one disk rotation" instead --- but I don't see how to > do that in the absence of detailed information about disk head position. > > Something I was toying with this afternoon: assume we have a background > process responsible for all WAL writes --- not only filled buffers, but > the currently active buffer. It periodically checks to see if there > are unwritten commit records in the active buffer, and if so schedules > a write for them. If this could be done during each disk rotation, > "just before" the disk reaches the active WAL log block, we'd have an > ideal solution. And it would not be too hard for such a process to > determine the right time: it could measure the drive rotational speed > by observing the completion times of successive writes to the same > sector, and it wouldn't take much logic to empirically find the latest > time at which a write can be issued and have a good probability of > hitting the disk on time. (At least, this would work pretty well given > a dedicated WAL drive, else there'd be too much interference from other > I/O requests.) > > However, this whole scheme falls down on the same problem we've run into > before: user processes can't schedule themselves with millisecond > accuracy. The writer process might be able to determine the ideal time > to wake up and make the check, but it can't get the Unix kernel to > dispatch it then, at least not on most Unixen. The typical scheduling > slop is one time slice, which is comparable to if not more than the > disk rotation time. Standard for Linux has been 100Hz time slice, but it is configurable for some time. The latest RedHat (8.0) is built with 500Hz that makes about 4 slices/rev for 7200 rpm disks (2 for 15000rpm) > ISTM aio_write only improves the picture if there's some magic in-kernel > processing that makes this same kind of judgment as to when to issue the > "ganged" write for real, and is able to do it on time because it's in > the kernel. I haven't heard anything to make me think that that feature > actually exists. AFAIK the kernel isn't much more enlightened about > physical head positions than we are. At least for open source kernels it could be possible to 1. write a patch to kernel or 2. get the authors of kernel aio interested in doing it. or 3. the third possibility would be using some real-time (RT) OS or mixed RT/conventional OS where some threads can be scheduled for hard-RT . In an RT os you are supposed to be able to do exactly what you describe. I think that 2 and 3 could be "outsourced" (the respective developers talked into supporting it) as both KAIO and RT Linuxen/BSDs are probably also inetersted in high-profile applications so they could boast that "using our stuff enabled PostgreSQL database run twice as fast". Anyway, getting to near-harware speeds for database will need more specific support from OS than web browsing or compiling. --------------- Hannu
> Well, too bad. If you haven't gotten your commit record down to disk, > then *you have not committed*. This is not negotiable. (If you think > it is, then turn off fsync and quit worrying ;-)) I've never disputed this, so if I seem to be suggesting that, I've beee unclear. I'm just assuming that the disk can get a confirmation back to the INSERTing process in much less than one rotation. This would allow that process to start working again, perhaps in time to complete another transaction. > An application that is willing to have multiple transactions in flight > at the same time can open up multiple backend connections to issue those > transactions, and thereby perhaps beat the theoretical limit. But for > serial transactions, there is not anything we can do to beat that limit. > (At least not with the log structure we have now. One could imagine > dropping a commit record into the nearest one of multiple buckets that > are carefully scattered around the disk. But exploiting that would take > near-perfect knowledge about disk head positioning; it's even harder to > solve than the problem we're considering now.) Consider the following scenario: Time measured in disk rotations. Time 1.00 - Process A Commits - Causing aio_write to log and wait Time 1.03 - aio_completes for Process A write - wakes process A Time 1.05 - Process A Starts another transaction. Time 1.08 - Process A Commits etc. I agree that a process can't proceed from a commit until it receives confirmation of the write, but if the write has hit the disk before a full rotation then the process should be able to continue processing new transactions > You're failing to distinguish total throughput to the WAL drive from > response time seen by any one transaction. Yes, a policy of writing > each WAL block once when it fills would maximize potential throughput, > but it would also mean a potentially very large delay for a transaction > waiting to commit. The lower the system load, the worse the performance > on that scale. You are assuming fsync or fdatasync behavior, I am not. There would be no delay under the scenario I describe. The transaction would exit commit as soon as the confirmation of the write is received from the aio system. I would hope that with a decent aio implementation this would generally be much less than one rotation. I think that the single transaction response time is very important and that's one of the chief problems I sought to solve when I proposed aio_writes for logging in my original email many moons ago. > ISTM aio_write only improves the picture if there's some magic in-kernel > processing that makes this same kind of judgment as to when to issue the > "ganged" write for real, and is able to do it on time because it's in > the kernel. I haven't heard anything to make me think that that feature > actually exists. AFAIK the kernel isn't much more enlightened about > physical head positions than we are. All aio_write has to do is pass the write off to the device as soon as it aio_write gets it bypassing the system buffers. The code on the disk's hardware is very good at knowing when the disk head is coming. IMHO, bypassing the kernel's less than enlightened writing system is the main point of using aio_write. - Curtis
On Mon, 2002-10-07 at 16:06, Curtis Faith wrote: > > Well, too bad. If you haven't gotten your commit record down to disk, > > then *you have not committed*. This is not negotiable. (If you think > > it is, then turn off fsync and quit worrying ;-)) > At this point, I think we've come full circle. Can we all agree that this concept is a *potential* source of improvement in a variety of situations? If we can agree on that, perhaps we should move to the next stage in the process, validation? How long do you think it would take to develop something worthy of testing? Do we have known test cases which will properly (in)validate the approach that everyone will agree to? If code is reasonably clean so as to pass the smell test and shows a notable performance boost, will it be seriously considered for inclusion? If so, I assume it would become a configure option (--with-aio)? Regards, Greg
Greg Copeland wrote: <snip> > If so, I assume it would become a configure option (--with-aio)? Or maybe a GUC "use_aio" ? :-) Regards and best wishes, Justin Clift > > Regards, > > Greg > > ------------------------------------------------------------------------ > Name: signature.asc > signature.asc Type: application/pgp-signature > Description: This is a digitally signed message part -- "My grandfather once told me that there are two kinds of people: those who work and those who take the credit. He told me to try to be in the first group; there was less competition there." - Indira Gandhi
"Curtis Faith" <curtis@galtair.com> writes: >> Well, too bad. If you haven't gotten your commit record down to disk, >> then *you have not committed*. This is not negotiable. (If you think >> it is, then turn off fsync and quit worrying ;-)) > I've never disputed this, so if I seem to be suggesting that, I've beee > unclear. I'm just assuming that the disk can get a confirmation back to the > INSERTing process in much less than one rotation. You've spent way too much time working with lying IDE drives :-( Do you really trust a confirm-before-write drive to make that write if it loses power? I sure don't. If you do trust your drive to hold that data across a crash, then ISTM the whole problem goes away anyway, as writes will "complete" quite independently of disk rotation. My Linux box has no problem claiming that it's completing several thousand TPS with a single client ... and yes, fsync is on, but it's using an IDE drive, and I don't know how to disable confirm-before-write on that drive. (That's why I did these tests on my old slow HP hardware.) Basically, the ganging of commit writes happens inside the disk controller on a setup like that. You still don't need aio_write --- unless perhaps to reduce wastage of IDE bus bandwidth by repeated writes, but that doesn't seem to be a scarce resource in this context. regards, tom lane
Well, I was thinking that aio may not be available on all platforms, thus the conditional compile option. On the other hand, wouldn't you pretty much want it either on or off for all instances? I can see that it would be nice for testing though. ;) Greg On Mon, 2002-10-07 at 16:23, Justin Clift wrote: > Greg Copeland wrote: > <snip> > > If so, I assume it would become a configure option (--with-aio)? > > Or maybe a GUC "use_aio" ? > > :-) > > Regards and best wishes, > > Justin Clift > > > > > Regards, > > > > Greg > > > > ------------------------------------------------------------------------ > > Name: signature.asc > > signature.asc Type: application/pgp-signature > > Description: This is a digitally signed message part > > -- > "My grandfather once told me that there are two kinds of people: those > who work and those who take the credit. He told me to try to be in the > first group; there was less competition there." > - Indira Gandhi
> I may be missing something obvious, but I don't see a way to get more > than 1 trx/process/revolution, as each previous transaction in that > process must be written to disk before the next can start, and the only > way it can be written to the disk is when the disk heads are on the > right place and that happens exactly once per revolution. Okay, consider the following scenario. 1) Process A commits when the platter is at 0 degrees. 2) There are enough XLog writes from other processes to fill 1/4 platter rotation worth of log or 90 degrees. The SCSI drive writes the XLog commit record and keeps writing other log entries as the head rotates. 3) Process A receives a confirmation of the write before the platter rotates 60 degrees. 4) Process A continues and adds another commit before the platter rotates to 90 degrees. This should be very possible and more and more likely in the future as CPUs get faster and faster relative to disks. I'm not suggesting this would happen all the time, just that it's possible and that an SMP machine with good CPUs and a fast I/O subsystem should be able to keep the log writing at close to I/O bandwidth limits. The case of bulk inserts is one where I would expect that for simple tables we should be able to peg the disks given today's hardware and enough inserting processes. - Curtis
Curtis Faith kirjutas T, 08.10.2002 kell 01:04: > > I may be missing something obvious, but I don't see a way to get more > > than 1 trx/process/revolution, as each previous transaction in that > > process must be written to disk before the next can start, and the only > > way it can be written to the disk is when the disk heads are on the > > right place and that happens exactly once per revolution. > > Okay, consider the following scenario. > > 1) Process A commits when the platter is at 0 degrees. > 2) There are enough XLog writes from other processes to fill 1/4 platter > rotation worth of log or 90 degrees. The SCSI drive writes the XLog commit > record and keeps writing other log entries as the head rotates. > 3) Process A receives a confirmation of the write before the platter > rotates 60 degrees. > 4) Process A continues and adds another commit before the platter rotates > to 90 degrees. for this scheme to work there are _very_ specific conditions to be met (i.e. the number of writing processes and size of WAL records has to meet very strict criteria) I'd suspect that this will not work for 95% of cases. > This should be very possible and more and more likely in the future as CPUs > get faster and faster relative to disks. You example of >1 trx/proc/rev will wok _only_ if no more and no less than 1/4 of platter is filled by _other_ log writers. > I'm not suggesting this would happen all the time, just that it's possible > and that an SMP machine with good CPUs and a fast I/O subsystem should be > able to keep the log writing at close to I/O bandwidth limits. Keeping log writing parallel and close to I/O bandwidth limits is the real key issue there. Toms test case of fast inserting of small records by relatively small number of processes (forcing repeated writes of the same page) seems also a border case. > The case of bulk inserts is one where I would expect that for simple tables > we should be able to peg the disks given today's hardware and enough > inserting processes. bulk inserts should probably be chunked at higher level by inserting several records inside a single transaction. ----------- Hannu
> ISTM aio_write only improves the picture if there's some magic in-kernel > processing that makes this same kind of judgment as to when to issue the > "ganged" write for real, and is able to do it on time because it's in > the kernel. I haven't heard anything to make me think that that feature > actually exists. AFAIK the kernel isn't much more enlightened about > physical head positions than we are. Can the magic be, that kaio directly writes from user space memory to the disk ? Since in your case all transactions A-E want the same buffer written, the memory (not it's content) will also be the same. This would automatically write the latest possible version of our WAL buffer to disk. The problem I can see offhand is how the kaio system can tell which transaction can be safely notified of the write, or whether the programmer is actually responsible for not changing the buffer until notified of completion ? Andreas
On Tue, 2002-10-08 at 04:15, Zeugswetter Andreas SB SD wrote: > Can the magic be, that kaio directly writes from user space memory to the > disk ? Since in your case all transactions A-E want the same buffer written, > the memory (not it's content) will also be the same. This would automatically > write the latest possible version of our WAL buffer to disk. > *Some* implementations allow for zero-copy aio. That is a savings. On heavily used systems, it can be a large savings. > The problem I can see offhand is how the kaio system can tell which transaction > can be safely notified of the write, or whether the programmer is actually responsible > for not changing the buffer until notified of completion ? That's correct. The programmer can not change the buffer contents until notification has completed for that outstanding aio operation. To do otherwise results in undefined behavior. Since some systems do allow for zero-copy aio operations, requiring the buffers not be modified, once queued, make a lot of sense. Of course, even on systems that don't support zero-copy, changing the buffered data prior to write completion just seems like a bad idea to me. Here's a quote from SGI's aio_write man page: If the buffer pointed to by aiocbp->aio_buf or the control block pointed to by aiocbp changes or becomes an illegal address prior to asynchronous I/O completion then the behavior is undefined. Simultaneous synchronous operations using the same aiocbp produce undefined results. And on SunOS we have: The aiocbp argument points to an aiocb structure. If the buffer pointed to by aiocbp->aio_bufor the control block pointed to by aiocbp becomes an illegal address prior to asynchronous I/O completion,then the behavior is undefined. and For any system action that changes the process memory space while an asynchronous I/O is outstanding to theaddress range being changed, the result of that action is undefined. Greg
"Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes: > Can the magic be, that kaio directly writes from user space memory to the > disk ? This makes more assumptions about the disk drive's behavior than I think are justified... > Since in your case all transactions A-E want the same buffer written, > the memory (not it's content) will also be the same. But no, it won't: the successive writes will ask to write different snapshots of the same buffer. > The problem I can see offhand is how the kaio system can tell which > transaction can be safely notified of the write, Yup, exactly. Whose snapshot made it down to (stable) disk storage? regards, tom lane
> You example of >1 trx/proc/rev will wok _only_ if no more and no less > than 1/4 of platter is filled by _other_ log writers. Not really, if 1/2 the platter has been filled we'll still get in one more commit in for a given rotation. If more than a rotation's worth of writing has occurred that means we are bumping into the limit of disk I/O and that it the limit that we can't do anything about without doing interleaved log files. > > The case of bulk inserts is one where I would expect that for > simple tables > > we should be able to peg the disks given today's hardware and enough > > inserting processes. > > bulk inserts should probably be chunked at higher level by inserting > several records inside a single transaction. Agreed, that's much more efficient. There are plenty of situations where the inserts and updates are ongoing rather than initial, Shridhar's real-world test or TPC benchmarks, for example. - Curtis
> > Since in your case all transactions A-E want the same buffer written, > > the memory (not it's content) will also be the same. > > But no, it won't: the successive writes will ask to write different > snapshots of the same buffer. Successive writes would write different NON-OVERLAPPING sections of the same log buffer. It wouldn't make sense to send three separate copies of the entire block. That could indeed cause problems. If a separate log writing process was doing all the writing, it could pre-gang the writes. However, I'm not sure this is necessary. I'll test the simpler way first. > > The problem I can see offhand is how the kaio system can tell which > > transaction can be safely notified of the write, > > Yup, exactly. Whose snapshot made it down to (stable) disk storage? If you do as above, it can inform the transactions when the blocks get written to disk since there are no inconsistent writes. If transactions A, B and C had commits in block 1023, and the aio system writes that block to the disk, it can notify all three that their transaction write is complete when that block (or partial block) is written to disk. It transaction C's write didn't make it into the buffer, I've got to assume the aio system or the disk cache logic can handle realizing that it didn't queue that write and therefore not inform transaction C of a completion. - Curtis
"Curtis Faith" <curtis@galtair.com> writes: > Successive writes would write different NON-OVERLAPPING sections of the > same log buffer. It wouldn't make sense to send three separate copies of > the entire block. That could indeed cause problems. So you're going to undo the code's present property that all writes are block-sized? Aren't you worried about incurring page-in reads because the kernel can't know that we don't care about data beyond what we've written so far in the block? regards, tom lane
> "Curtis Faith" <curtis@galtair.com> writes: > > Successive writes would write different NON-OVERLAPPING sections of the > > same log buffer. It wouldn't make sense to send three separate > copies of > > the entire block. That could indeed cause problems. > > So you're going to undo the code's present property that all writes are > block-sized? Aren't you worried about incurring page-in reads because > the kernel can't know that we don't care about data beyond what we've > written so far in the block? Yes, I'll try undoing the current behavior. I'm not really worried about doing page-in reads because the disks internal buffers should contain most of the blocks surrounding the end of the log file. If the successive partial writes exceed a block (which they will in heavy use) then most of the time this won't be a problem anyway since the disk will gang the full blocks before writing. If the inserts are not coming fast enough to fill the log then the disks cache should contain the data from the last time that block (or the block before) was written. Disks have become pretty good at this sort of thing since writing sequentially is a very common scenario. It may not work, but one doesn't make significant progress without trying things that might not work. If it doesn't work, then I'll make certain that commit log records always fill the buffer they are written too, with variable length commit records and something to identify the size of the padding used to fill the rest of the block.
"Curtis Faith" <curtis@galtair.com> writes: > I'm not really worried about doing page-in reads because the disks internal > buffers should contain most of the blocks surrounding the end of the log > file. If the successive partial writes exceed a block (which they will in > heavy use) then most of the time this won't be a problem anyway since the > disk will gang the full blocks before writing. You seem to be willing to make quite a large number of assumptions about what the disk hardware will do or not do. I trust you're going to test your results on a wide range of hardware before claiming they have any general validity ... regards, tom lane
> "Curtis Faith" <curtis@galtair.com> writes: > > I'm not really worried about doing page-in reads because the > disks internal > > buffers should contain most of the blocks surrounding the end > of the log > > file. If the successive partial writes exceed a block (which > they will in > > heavy use) then most of the time this won't be a problem > anyway since the > > disk will gang the full blocks before writing. > > You seem to be willing to make quite a large number of assumptions about > what the disk hardware will do or not do. I trust you're going to test > your results on a wide range of hardware before claiming they have any > general validity ... Tom, I'm actually an empiricist. I trust nothing that I haven't tested or read the code for myself. I've found too many instances of bugs or poor implementations in the O/S to believe without testing. On the other hand, one has to make some assumptions in order to devise useful tests. I'm not necessarily expecting that I'll come up with something that will help everyone all the time. I'm just hoping that I can come up with something that will help those using modern hardware, most of the time. Even if it works, this will probably become one of those flags that need to be tested as part of the performance analysis for any given system. Or perhaps ideally, I'll come up with a LogDiskTester that simulates log output and determines the best settings to use for a given disk and O/S, optimized for size or speed, heavy inserts, etc.
> > Can the magic be, that kaio directly writes from user space memory to the > > disk ? > > This makes more assumptions about the disk drive's behavior than I think > are justified... No, no assumption about the drive, only the kaio implementation, namely, that the kaio implementation reads the user memory at a latest (for the implementation) possible time. > > Since in your case all transactions A-E want the same buffer written, > > the memory (not it's content) will also be the same. > > But no, it won't: the successive writes will ask to write different > snapshots of the same buffer. That is what I meant, yes. Should have better formulated memory location. > > The problem I can see offhand is how the kaio system can tell which > > transaction can be safely notified of the write, > > Yup, exactly. Whose snapshot made it down to (stable) disk storage? That is something the kaio implementation could prbbly know in our scenario, where we iirc only append new records inside the page (don't modify anything up front). Worst thing that can happen is that the kaio is too pessimistic about what is already on disk (memory already written, but aio call not yet done). Andreas
tgl@sss.pgh.pa.us (Tom Lane) writes: [snip] > > So this does seem to be a nice win, and unless I hear objections > I will apply it ... > It does indeed look like a great improvement, so is the fix going to be merged to the 7.3 branch or is it too late for that? _ Mats Lofkvist mal@algonet.se
Mats Lofkvist <mal@algonet.se> writes: > It does indeed look like a great improvement, so is the fix > going to be merged to the 7.3 branch or is it too late for that? Yes, been there done that ... regards, tom lane