Thread: CommitDelay performance improvement
Looking at the XLOG stuff, I notice that we already have a field (logRec) in the per-backend PROC structures that shows whether a transaction is currently in progress with at least one change made (ie at least one XLOG entry written). It would be very easy to extend the existing code so that the commit delay is not done unless there is at least one other backend with nonzero logRec --- or, more generally, at least N other backends with nonzero logRec. We cannot tell if any of them are actually nearing their commits, but this seems better than just blindly waiting. Larger values of N would presumably improve the odds that at least one of them is nearing its commit. A further refinement, still quite cheap to implement since the info is in the PROC struct, would be to not count backends that are blocked waiting for locks. These guys are less likely to be ready to commit in the next few milliseconds than the guys who are actively running; indeed they cannot commit until someone else has committed/aborted to release the lock they need. Comments? What should the threshold N be ... or do we need to make that a tunable parameter? regards, tom lane
> Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Why not just set a flag in there when someone nears commit and clear > > when they are about to commit? > > Define "nearing commit", in such a way that you can specify where you > plan to set that flag. Is there significant time between entry of CommitTransaction() and the fsync()? Maybe not. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> Looking at the XLOG stuff, I notice that we already have a field > (logRec) in the per-backend PROC structures that shows whether a > transaction is currently in progress with at least one change made > (ie at least one XLOG entry written). > > It would be very easy to extend the existing code so that the commit > delay is not done unless there is at least one other backend with > nonzero logRec --- or, more generally, at least N other backends with > nonzero logRec. We cannot tell if any of them are actually nearing > their commits, but this seems better than just blindly waiting. Larger > values of N would presumably improve the odds that at least one of them > is nearing its commit. Why not just set a flag in there when someone nears commit and clear when they are about to commit? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Is there significant time between entry of CommitTransaction() and the > fsync()? Maybe not. I doubt it. No I/O anymore, anyway, unless the commit record happens to overrun an xlog block boundary. regards, tom lane
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Why not just set a flag in there when someone nears commit and clear > when they are about to commit? Define "nearing commit", in such a way that you can specify where you plan to set that flag. regards, tom lane
> Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Is there significant time between entry of CommitTransaction() and the > > fsync()? Maybe not. > > I doubt it. No I/O anymore, anyway, unless the commit record happens to > overrun an xlog block boundary. That's what I was afraid of. Since we don't write the dirty blocks to the kernel anymore, we don't really have much happening before someone says they are about to commit. In the old days, we were write()'ing those buffers, and we had some delay and kernel calls in there. Guess that idea is dead. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
On Fri, Feb 23, 2001 at 11:32:21AM -0500, Tom Lane wrote: > A further refinement, still quite cheap to implement since the info is > in the PROC struct, would be to not count backends that are blocked > waiting for locks. These guys are less likely to be ready to commit > in the next few milliseconds than the guys who are actively running; > indeed they cannot commit until someone else has committed/aborted to > release the lock they need. > > Comments? What should the threshold N be ... or do we need to make > that a tunable parameter? Once you make it tuneable, you're stuck with it. You can always add a knob later, after somebody discovers a real need. Nathan Myers ncm@zembu.com
> On Fri, Feb 23, 2001 at 11:32:21AM -0500, Tom Lane wrote: > > A further refinement, still quite cheap to implement since the info is > > in the PROC struct, would be to not count backends that are blocked > > waiting for locks. These guys are less likely to be ready to commit > > in the next few milliseconds than the guys who are actively running; > > indeed they cannot commit until someone else has committed/aborted to > > release the lock they need. > > > > Comments? What should the threshold N be ... or do we need to make > > that a tunable parameter? > > Once you make it tuneable, you're stuck with it. You can always add > a knob later, after somebody discovers a real need. I wonder if Tom should implement it, but leave it at zero until people can report that a non-zero helps. We already have the parameter, we can just make it smarter and let people test it. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
ncm@zembu.com (Nathan Myers) writes: >> Comments? What should the threshold N be ... or do we need to make >> that a tunable parameter? > Once you make it tuneable, you're stuck with it. You can always add > a knob later, after somebody discovers a real need. If we had a good idea what the default level should be, I'd be willing to go without a knob. I'm thinking of a default of about 5 (ie, at least 5 other active backends to trigger a commit delay) ... but I'm not so confident of that that I think it needn't be tunable. It's really dependent on your average and peak transaction lengths, and that's going to vary across installations, so unless we want to try to make it self-adjusting, a knob seems like a good idea. A self-adjusting delay might well be a great idea, BTW, but I'm trying to be conservative about how much complexity we should add right now. regards, tom lane
> ncm@zembu.com (Nathan Myers) writes: > >> Comments? What should the threshold N be ... or do we need to make > >> that a tunable parameter? > > > Once you make it tuneable, you're stuck with it. You can always add > > a knob later, after somebody discovers a real need. > > If we had a good idea what the default level should be, I'd be willing > to go without a knob. I'm thinking of a default of about 5 (ie, at > least 5 other active backends to trigger a commit delay) ... but I'm not > so confident of that that I think it needn't be tunable. It's really > dependent on your average and peak transaction lengths, and that's > going to vary across installations, so unless we want to try to make it > self-adjusting, a knob seems like a good idea. > > A self-adjusting delay might well be a great idea, BTW, but I'm trying > to be conservative about how much complexity we should add right now. OH, so you are saying N backends should have dirtied buffers before doing the delay? Hmm, that seems almost untunable to me. Let's suppose we decide to sleep. When we wake up, can we know that someone else has fsync'ed for us? And if they have, should we be more likely to fsync() in the future? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> > And if they have, should we be more > > likely to fsync() in the future? I meant more likely to sleep(). > You mean less likely. My thought for a self-adjusting delay was to > ratchet the delay up a little every time it succeeds in avoiding an > fsync, and down a little every time it fails to do so. No change when > we don't delay at all (because of no other active backends). But > testing this and making sure it behaves reasonably seems like more work > than we should try to accomplish before 7.1. It could be tough. Imagine the delay increasing to 3 seconds? Seems there has to be an upper bound on the sleep. The more you delay, the more likely you will be to find someone to fsync you. Are we waking processes up after we have fsync()'ed them? If so, we can keep increasing the delay. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian <pgman@candle.pha.pa.us> writes: >> A self-adjusting delay might well be a great idea, BTW, but I'm trying >> to be conservative about how much complexity we should add right now. > OH, so you are saying N backends should have dirtied buffers before > doing the delay? Hmm, that seems almost untunable to me. > Let's suppose we decide to sleep. When we wake up, can we know that > someone else has fsync'ed for us? XLogFlush will find that it has nothing to do, so yes we can. > And if they have, should we be more > likely to fsync() in the future? You mean less likely. My thought for a self-adjusting delay was to ratchet the delay up a little every time it succeeds in avoiding an fsync, and down a little every time it fails to do so. No change when we don't delay at all (because of no other active backends). But testing this and making sure it behaves reasonably seems like more work than we should try to accomplish before 7.1. regards, tom lane
Bruce Momjian <pgman@candle.pha.pa.us> writes: > It could be tough. Imagine the delay increasing to 3 seconds? Seems > there has to be an upper bound on the sleep. The more you delay, the > more likely you will be to find someone to fsync you. Good point, and an excellent illustration of the fact that self-adjusting algorithms aren't that easy to get right the first time ;-) > Are we waking processes up after we have fsync()'ed them? Not at the moment. That would be another good mechanism to investigate for 7.2; but right now there's no infrastructure that would allow a backend to discover which other ones were sleeping for fsync. regards, tom lane
On Fri, Feb 23, 2001 at 05:18:19PM -0500, Tom Lane wrote: > ncm@zembu.com (Nathan Myers) writes: > >> Comments? What should the threshold N be ... or do we need to make > >> that a tunable parameter? > > > Once you make it tuneable, you're stuck with it. You can always add > > a knob later, after somebody discovers a real need. > > If we had a good idea what the default level should be, I'd be willing > to go without a knob. I'm thinking of a default of about 5 (ie, at > least 5 other active backends to trigger a commit delay) ... but I'm not > so confident of that that I think it needn't be tunable. It's really > dependent on your average and peak transaction lengths, and that's > going to vary across installations, so unless we want to try to make it > self-adjusting, a knob seems like a good idea. > > A self-adjusting delay might well be a great idea, BTW, but I'm trying > to be conservative about how much complexity we should add right now. When thinking about tuning N, I like to consider what are the interesting possible values for N: 0: Ignore any other potential committers. 1: The minimum possible responsiveness to other committers. 5: Tom's guess forwhat might be a good choice. 10: Harry's guess. ~0: Always delay. I would rather release with N=1 than with 0, because it actually responds to conditions. What N might best be, >1, probably varies on a lot of hard-to-guess parameters. It seems to me that comparing various choices (and other, more interesting, algorithms) to the N=1 case would be more productive than comparing them to the N=0 case, so releasing at N=1 would yield better statistics for actually tuning in 7.2. Nathan Myers ncm@zembu.com
> When thinking about tuning N, I like to consider what are the interesting > possible values for N: > > 0: Ignore any other potential committers. > 1: The minimum possible responsiveness to other committers. > 5: Tom's guess for what might be a good choice. > 10: Harry's guess. > ~0: Always delay. > > I would rather release with N=1 than with 0, because it actually responds > to conditions. What N might best be, >1, probably varies on a lot of > hard-to-guess parameters. > > It seems to me that comparing various choices (and other, more interesting, > algorithms) to the N=1 case would be more productive than comparing them > to the N=0 case, so releasing at N=1 would yield better statistics for > actually tuning in 7.2. We don't release code becuase it has better tuning oportunities for later releases. What we can do is give people parameters where the default is safe, and they can play and report to us. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> Bruce Momjian <pgman@candle.pha.pa.us> writes: > > It could be tough. Imagine the delay increasing to 3 seconds? Seems > > there has to be an upper bound on the sleep. The more you delay, the > > more likely you will be to find someone to fsync you. > > Good point, and an excellent illustration of the fact that > self-adjusting algorithms aren't that easy to get right the first > time ;-) I see. I am concerned that anything done to 7.1 at this point may cause problems with performance under certain circumstances. Let's see what the new code shows our testers. > > > Are we waking processes up after we have fsync()'ed them? > > Not at the moment. That would be another good mechanism to investigate > for 7.2; but right now there's no infrastructure that would allow a > backend to discover which other ones were sleeping for fsync. Can we put the backends to sleep waiting for a lock, and have them wake up later? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Can we put the backends to sleep waiting for a lock, and have them wake > up later? Locks don't have timeouts. There is no existing mechanism that will serve this purpose; we'll have to create a new one. regards, tom lane
On Fri, Feb 23, 2001 at 06:37:06PM -0500, Bruce Momjian wrote: > > When thinking about tuning N, I like to consider what are the interesting > > possible values for N: > > > > 0: Ignore any other potential committers. > > 1: The minimum possible responsiveness to other committers. > > 5: Tom's guess for what might be a good choice. > > 10: Harry's guess. > > ~0: Always delay. > > > > I would rather release with N=1 than with 0, because it actually > > responds to conditions. What N might best be, >1, probably varies on > > a lot of hard-to-guess parameters. > > > > It seems to me that comparing various choices (and other, more > > interesting, algorithms) to the N=1 case would be more productive > > than comparing them to the N=0 case, so releasing at N=1 would yield > > better statistics for actually tuning in 7.2. > > We don't release code because it has better tuning opportunities for > later releases. What we can do is give people parameters where the > default is safe, and they can play and report to us. Perhaps I misunderstood. I had perceived N=1 as a conservative choice that was nevertheless preferable to N=0. Nathan Myers ncm@zembu.com
> > > It seems to me that comparing various choices (and other, more > > > interesting, algorithms) to the N=1 case would be more productive > > > than comparing them to the N=0 case, so releasing at N=1 would yield > > > better statistics for actually tuning in 7.2. > > > > We don't release code because it has better tuning opportunities for > > later releases. What we can do is give people parameters where the > > default is safe, and they can play and report to us. > > Perhaps I misunderstood. I had perceived N=1 as a conservative choice > that was nevertheless preferable to N=0. I think zero delay is the conservative choice at this point, unless we hear otherwise from testers. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Can we put the backends to sleep waiting for a lock, and have them wake > > up later? > > Locks don't have timeouts. There is no existing mechanism that will > serve this purpose; we'll have to create a new one. That is what I suspected. Having thought about it, We currently have a few options: 1) let every backend fsync on its own2) try to delay backends so they all fsync() at the same time3) delay fsync until aftercommit Items 2 and 3 attempt to bunch up fsyncs. Option 2 has backends waiting to fsync() on the expectation that some other backend may commit soon. Option 3 I may turn out to be the best solution. No matter how smart we make the code, we will never know for sure if someone is about to commit and whether it is worth waiting. My idea would be to let committing backends return "COMMIT" to the user, and set a need_fsync flag that is guaranteed to cause an fsync within X milliseconds. This way, if other backends commit in the next X millisecond, they can all use one fsync(). Now, I know many will complain that we are returning commit while not having the stuff on the platter. But consider, we only lose data from a OS crash or hardware failure. Do people who commit something, and then the machines crashes 2 milliseconds after the commit, really expect the data to be on the disk when they restart? Maybe they do, but it seems the benefit of grouped fsyncs() is large enough that many will say they would rather have this option. This was my point long ago that we could offer sub-second reliability with no-fsync performance if we just had some process running that wrote dirty pages and fsynced every 20 milliseconds. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
At 21:31 23/02/01 -0500, Bruce Momjian wrote: >Now, I know many will complain that we are returning commit while not >having the stuff on the platter. You're definitely right there. >Maybe they do, but it seems >the benefit of grouped fsyncs() is large enough that many will say they >would rather have this option. I'd prefer to wait for a lock manager that supports timeouts and contention notification. ---------------------------------------------------------------- Philip Warner | __---_____ Albatross Consulting Pty. Ltd. |----/ - \ (A.B.N. 75 008 659 498) | /(@) ______---_ Tel: (+61) 0500 83 82 81 | _________ \ Fax: (+61) 0500 83 82 82 | ___________ | Http://www.rhyme.com.au | / \| | --________-- PGP key available upon request, | / and from pgp5.ai.mit.edu:11371 |/
> At 21:31 23/02/01 -0500, Bruce Momjian wrote: > >Now, I know many will complain that we are returning commit while not > >having the stuff on the platter. > > You're definitely right there. > > >Maybe they do, but it seems > >the benefit of grouped fsyncs() is large enough that many will say they > >would rather have this option. > > I'd prefer to wait for a lock manager that supports timeouts and contention > notification. I understand, and if that was going to fix the problem completely, but it isn't. It is just going to allow us more flexibility at guessing who may be about to commit. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
At 14:57 23/02/01 -0800, Nathan Myers wrote: > >When thinking about tuning N, I like to consider what are the interesting >possible values for N: > It may have been much earler in the debate, but has anyone checked to see what the maximum possible gains might be - or is it self-evident to people who know the code? Would it be worth considering creating a test case with no flush in RecordTransactionCommit, and rely on checkpointing to flush? I realize this is never an option in production, but is it possible to modify the code in this way? I *should* give an upper limit on the gains that can be made by flushing at the best possible time. ---------------------------------------------------------------- Philip Warner | __---_____ Albatross Consulting Pty. Ltd. |----/ - \ (A.B.N. 75 008 659 498) | /(@) ______---_ Tel: (+61) 0500 83 82 81 | _________ \ Fax: (+61) 0500 83 82 82 | ___________ | Http://www.rhyme.com.au | / \| | --________-- PGP key available upon request, | / and from pgp5.ai.mit.edu:11371 |/
At 11:32 23/02/01 -0500, Tom Lane wrote: >Looking at the XLOG stuff, I notice that we already have a field >(logRec) in the per-backend PROC structures that shows whether a >transaction is currently in progress with at least one change made >(ie at least one XLOG entry written). Would it be worth adding a field 'waiting for fsync since xxx', so the second process can (a) log that it is expecting someone else to FSYNC (for perf stats, if we want them), and (b) wait for (xxx + delta)ms/us etc? ---------------------------------------------------------------- Philip Warner | __---_____ Albatross Consulting Pty. Ltd. |----/ - \ (A.B.N. 75 008 659 498) | /(@) ______---_ Tel: (+61) 0500 83 82 81 | _________ \ Fax: (+61) 0500 83 82 82 | ___________ | Http://www.rhyme.com.au | / \| | --________-- PGP key available upon request, | / and from pgp5.ai.mit.edu:11371 |/
At 23:14 23/02/01 -0500, Bruce Momjian wrote: > >There is one more thing. Even though the kernel says the data is on the >platter, it still may not be there. This is true, but it does not mean we should say 'the disk is slightly unreliable, so we can be too'. Also, IIRC, the last time this was discussed, someone commented that buying expensive disks and a UPS gets you reliability (barring a direct lightining strike) - it had something to do with write-ordering and hardware caches. In any case, I'd hate to see DB design decisions based closely on harware capability. At least two of my customers use high performance ram disks for databases - do these also suffer from 'flush is not really flush' problems? >Basically, I am not sure how much we lose by doing the delay after >returning COMMIT, and I know we gain quite a bit by enabling us to group >fsync calls. If included, this should be an option only, and not the default option. In fact I'd quite like to see such a feature, although I'd not only do a 'flush every X ms', but I'd also do a 'flush every X transactions' - this way a DBA can say 'I dont mind losing the last 20 TXs in a crash'. Bear in mind that on a fast system, 20ms is a lot of transactions. ---------------------------------------------------------------- Philip Warner | __---_____ Albatross Consulting Pty. Ltd. |----/ - \ (A.B.N. 75 008 659 498) | /(@) ______---_ Tel: (+61) 0500 83 82 81 | _________ \ Fax: (+61) 0500 83 82 82 | ___________ | Http://www.rhyme.com.au | / \| | --________-- PGP key available upon request, | / and from pgp5.ai.mit.edu:11371 |/
On Fri, Feb 23, 2001 at 09:05:20PM -0500, Bruce Momjian wrote: > > > > It seems to me that comparing various choices (and other, more > > > > interesting, algorithms) to the N=1 case would be more productive > > > > than comparing them to the N=0 case, so releasing at N=1 would yield > > > > better statistics for actually tuning in 7.2. > > > > > > We don't release code because it has better tuning opportunities for > > > later releases. What we can do is give people parameters where the > > > default is safe, and they can play and report to us. > > > > Perhaps I misunderstood. I had perceived N=1 as a conservative choice > > that was nevertheless preferable to N=0. > > I think zero delay is the conservative choice at this point, unless we > hear otherwise from testers. I see, I had it backwards: N=0 corresponds to "always delay", and N=infinity (~0) is "never delay", or what you call zero delay. N=1 is not interesting. N=M/2 or N=sqrt(M) or N=log(M) might be interesting, where M is the number of backends, or the number of backends with begun transactions, or something. N=10 would be conservative (and maybe pointless) just because it would hardly ever trigger a delay. Nathan Myers ncm@zembu.com
> At 23:14 23/02/01 -0500, Bruce Momjian wrote: > > > >There is one more thing. Even though the kernel says the data is on the > >platter, it still may not be there. > > This is true, but it does not mean we should say 'the disk is slightly > unreliable, so we can be too'. Also, IIRC, the last time this was > discussed, someone commented that buying expensive disks and a UPS gets you > reliability (barring a direct lightining strike) - it had something to do > with write-ordering and hardware caches. In any case, I'd hate to see DB > design decisions based closely on harware capability. At least two of my > customers use high performance ram disks for databases - do these also > suffer from 'flush is not really flush' problems? Well, I am saying we are being pretty rigid here when we may be on top of a system that is not, meaning that our rigidity is buying us little. > > >Basically, I am not sure how much we lose by doing the delay after > >returning COMMIT, and I know we gain quite a bit by enabling us to group > >fsync calls. > > If included, this should be an option only, and not the default option. In > fact I'd quite like to see such a feature, although I'd not only do a > 'flush every X ms', but I'd also do a 'flush every X transactions' - this > way a DBA can say 'I dont mind losing the last 20 TXs in a crash'. Bear in > mind that on a fast system, 20ms is a lot of transactions. Yes, I can see this as a good option for many users. My old complaint was that we allowed only two very extreme options, fsync() all the time, or fsync() never and recover from a crash. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> Bruce Momjian <pgman@candle.pha.pa.us> writes: > > My idea would be to let committing backends return "COMMIT" to the user, > > and set a need_fsync flag that is guaranteed to cause an fsync within X > > milliseconds. This way, if other backends commit in the next X > > millisecond, they can all use one fsync(). > > Guaranteed by what? We have no mechanism available to make an fsync > happen while the backend is waiting for input. We would need a separate binary that can look at shared memory and fsync is someone requested it. Again, nothing for 7.1.X. > > Now, I know many will complain that we are returning commit while not > > having the stuff on the platter. > > I think that's unacceptable on its face. A remote client may take > action on the basis that COMMIT was returned. If the server then > crashes, the client is unlikely to realize this for some time (certainly > at least one TCP timeout interval). It won't look like a "milliseconds > later" situation to that client. In fact, the client might *never* > realize there was a problem; what if it disconnects after getting the > COMMIT? > > If the dbadmin thinks he doesn't need fsync before commit, he'll likely > be running with fsync off anyway. For the ones who do think they need > fsync, I don't believe that we get to rearrange the fsync to occur after > commit. I can see someone wanting some fsync, but not take the hit. My argument is that having this ability, there would be no need to turn off fsync. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian <pgman@candle.pha.pa.us> writes: > My idea would be to let committing backends return "COMMIT" to the user, > and set a need_fsync flag that is guaranteed to cause an fsync within X > milliseconds. This way, if other backends commit in the next X > millisecond, they can all use one fsync(). Guaranteed by what? We have no mechanism available to make an fsync happen while the backend is waiting for input. > Now, I know many will complain that we are returning commit while not > having the stuff on the platter. I think that's unacceptable on its face. A remote client may take action on the basis that COMMIT was returned. If the server then crashes, the client is unlikely to realize this for some time (certainly at least one TCP timeout interval). It won't look like a "milliseconds later" situation to that client. In fact, the client might *never* realize there was a problem; what if it disconnects after getting the COMMIT? If the dbadmin thinks he doesn't need fsync before commit, he'll likely be running with fsync off anyway. For the ones who do think they need fsync, I don't believe that we get to rearrange the fsync to occur after commit. regards, tom lane
Philip Warner <pjw@rhyme.com.au> writes: > It may have been much earler in the debate, but has anyone checked to see > what the maximum possible gains might be - or is it self-evident to people > who know the code? fsync off provides an upper bound to the speed achievable from being smarter about when to fsync... I doubt that fsync-once-per-checkpoint would be much different. regards, tom lane
> At 21:31 23/02/01 -0500, Bruce Momjian wrote: > >Now, I know many will complain that we are returning commit while not > >having the stuff on the platter. > > You're definitely right there. > > >Maybe they do, but it seems > >the benefit of grouped fsyncs() is large enough that many will say they > >would rather have this option. > > I'd prefer to wait for a lock manager that supports timeouts and contention > notification. > There is one more thing. Even though the kernel says the data is on the platter, it still may not be there. Some OS's may return from fsync when the data is _queued_ to the disk, rather than actually wanting for the drive return code to say it completed. Second, some disks report back that the data is on the disk when it is actually in the disk memory buffer, not really on the disk. Basically, I am not sure how much we lose by doing the delay after returning COMMIT, and I know we gain quite a bit by enabling us to group fsync calls. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Preliminary results from experimenting with an N-transactions-must-be-running-to-cause-commit-delay heuristic are attached. It seems to be a pretty definite win. I'm currently running a more extensive set of cases on another machine for comparison. The test case is pgbench, unmodified, but run at scalefactor 10 to reduce write contention on the 'branch' rows. Postmaster parameters are -N 100 -B 1024 in all cases. The fsync-off (with, of course, no commit delay either) case is shown for comparison. "commit siblings" is the number of other backends that must be running active (unblocked, at least one XLOG entry made) transactions before we will do a precommit delay. commit delay=1 is effectively commit delay=10000 (10msec) on this hardware. Interestingly, it seems that we can push the delay up to two or three clock ticks without degradation, given positive N. regards, tom lane
Attachment
ncm@zembu.com (Nathan Myers) writes: > I see, I had it backwards: N=0 corresponds to "always delay", and > N=infinity (~0) is "never delay", or what you call zero delay. N=1 is > not interesting. N=M/2 or N=sqrt(M) or N=log(M) might be interesting, > where M is the number of backends, or the number of backends with begun > transactions, or something. N=10 would be conservative (and maybe > pointless) just because it would hardly ever trigger a delay. Why is N=1 not interesting? That requires at least one other backend to be in a transaction before you'll delay. That would seem to be the minimum useful value --- N=0 (always delay) seems clearly to be too stupid to be useful. regards, tom lane
> Philip Warner <pjw@rhyme.com.au> writes: > > It may have been much earler in the debate, but has anyone checked to see > > what the maximum possible gains might be - or is it self-evident to people > > who know the code? > > fsync off provides an upper bound to the speed achievable from being > smarter about when to fsync... I doubt that fsync-once-per-checkpoint > would be much different. That was my point, people should be doing fsync once per checkpoint rather than never. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
On Sat, Feb 24, 2001 at 01:07:17AM -0500, Tom Lane wrote: > ncm@zembu.com (Nathan Myers) writes: > > I see, I had it backwards: N=0 corresponds to "always delay", and > > N=infinity (~0) is "never delay", or what you call zero delay. N=1 is > > not interesting. N=M/2 or N=sqrt(M) or N=log(M) might be interesting, > > where M is the number of backends, or the number of backends with begun > > transactions, or something. N=10 would be conservative (and maybe > > pointless) just because it would hardly ever trigger a delay. > > Why is N=1 not interesting? That requires at least one other backend > to be in a transaction before you'll delay. That would seem to be > the minimum useful value --- N=0 (always delay) seems clearly to be > too stupid to be useful. N=1 seems arbitrarily aggressive. It assumes any open transaction will commit within a few milliseconds; otherwise the delay is wasted. On a fairly busy system, it seems to me to impose a strict upper limit on transaction rate for any client, regardless of actual system I/O load. (N=0 would impose that strict upper limit even for a single client.) Delaying isn't free, because it means that the client can't turn around and do even a cheap query for a while. In a sense, when you delay you are charging the committer a tax to try to improve overall throughput. If the delay lets you reduce I/O churn enough to increase the total bandwidth, then it was worthwhile; if not, you just cut system performance, and responsiveness to each client, for nothing. The above suggests that maybe N should depend on recent disk I/O activity, so you get a larger N (and thus less likely delay and more certain payoff) for a more lightly-loaded system. On a system that has maxed its I/O bandwidth, clients will suffer delays anyhow, so they might as well suffer controlled delays that result in better total throughput. On a lightly-loaded system there's no need, or payoff, for such throttling. Can we measure disk system load by averaging the times taken for fsyncs? Nathan Myers ncm@zembu.com
Attached are graphs from more thorough runs of pgbench with a commit delay that occurs only when at least N other backends are running active transactions. My initial try at this proved to be too noisy to tell much. The noise seems to be coming from WAL checkpoints that occur during a run and push down the reported TPS value for the particular case that's running. While we'd need to include WAL checkpoints to make an honest performance comparison against another RDBMS, I think they are best ignored for the purpose of figuring out what the commit-delay behavior ought to be. Accordingly, I modified my test script to minimize the occurrence of checkpoint activity during runs (see attached script). There are still some data points that are unexpectedly low compared to their neighbors; presumably these were affected by checkpoints or other system activity. It's not entirely clear what set of parameters is best, but it is absolutely clear that a flat zero-commit-delay policy is NOT best. The test conditions are postmaster options -N 100 -B 1024, pgbench scale factor 10, pgbench -t (transactions per client) 100. (Hence the results for a single client rely on only 100 transactions, and are pretty noisy. The noise level should decrease as the number of clients increases.) Comments anyone? regards, tom lane #! /bin/sh # Expected postmaster options: -N 100 -B 1024 -c checkpoint_timeout=1800 # Recommended pgbench setup: pgbench -i -s 10 bench for del in 0 ; do for sib in 1 ; do for cli in 1 10 20 30 40 50 ; do echo "commit_delay = $del" echo "commit_siblings = $sib" psql -c "vacuum branches; vacuum tellers; delete from history; vacuum history; checkpoint;" bench PGOPTIONS="-c commit_delay=$del -c commit_siblings=$sib" \pgbench -c $cli -t 100 -n bench done done done for del in 10000 30000 50000 100000 ; do for sib in 1 5 10 20 ; do for cli in 1 10 20 30 40 50 ; do echo "commit_delay = $del" echo "commit_siblings = $sib" psql -c "vacuum branches; vacuum tellers; delete from history; vacuum history; checkpoint;" bench PGOPTIONS="-c commit_delay=$del -c commit_siblings=$sib" \pgbench -c $cli -t 100 -n bench done done done
At 00:41 25/02/01 -0500, Tom Lane wrote: > >Comments anyone? > Don't suppose you could post the original data? ---------------------------------------------------------------- Philip Warner | __---_____ Albatross Consulting Pty. Ltd. |----/ - \ (A.B.N. 75 008 659 498) | /(@) ______---_ Tel: (+61) 0500 83 82 81 | _________ \ Fax: (+61) 0500 83 82 82 | ___________ | Http://www.rhyme.com.au | / \| | --________-- PGP key available upon request, | / and from pgp5.ai.mit.edu:11371 |/
Philip Warner <pjw@rhyme.com.au> writes: > Don't suppose you could post the original data? Sure. regards, tom lane commit_delay = 0 commit_siblings = 1 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 1 number of transactions per client: 100 number of transactions actually processed: 100/100 tps = 10.996953(including connections establishing) tps = 11.051216(excluding connections establishing) commit_delay = 0 commit_siblings = 1 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 10 number of transactions per client: 100 number of transactions actually processed: 1000/1000 tps = 17.779923(including connections establishing) tps = 17.924390(excluding connections establishing) commit_delay = 0 commit_siblings = 1 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 20 number of transactions per client: 100 number of transactions actually processed: 2000/2000 tps = 17.289815(including connections establishing) tps = 17.429343(excluding connections establishing) commit_delay = 0 commit_siblings = 1 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 30 number of transactions per client: 100 number of transactions actually processed: 3000/3000 tps = 17.292171(including connections establishing) tps = 17.432905(excluding connections establishing) commit_delay = 0 commit_siblings = 1 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 40 number of transactions per client: 100 number of transactions actually processed: 4000/4000 tps = 17.733478(including connections establishing) tps = 17.913251(excluding connections establishing) commit_delay = 0 commit_siblings = 1 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 50 number of transactions per client: 100 number of transactions actually processed: 5000/5000 tps = 18.325273(including connections establishing) tps = 18.534556(excluding connections establishing) commit_delay = 10000 commit_siblings = 1 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 1 number of transactions per client: 100 number of transactions actually processed: 100/100 tps = 10.449347(including connections establishing) tps = 10.500278(excluding connections establishing) commit_delay = 10000 commit_siblings = 1 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 10 number of transactions per client: 100 number of transactions actually processed: 1000/1000 tps = 17.865721(including connections establishing) tps = 18.015078(excluding connections establishing) commit_delay = 10000 commit_siblings = 1 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 20 number of transactions per client: 100 number of transactions actually processed: 2000/2000 tps = 17.980234(including connections establishing) tps = 18.131986(excluding connections establishing) commit_delay = 10000 commit_siblings = 1 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 30 number of transactions per client: 100 number of transactions actually processed: 3000/3000 tps = 18.858489(including connections establishing) tps = 19.027436(excluding connections establishing) commit_delay = 10000 commit_siblings = 1 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 40 number of transactions per client: 100 number of transactions actually processed: 4000/4000 tps = 19.320221(including connections establishing) tps = 19.496999(excluding connections establishing) commit_delay = 10000 commit_siblings = 1 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 50 number of transactions per client: 100 number of transactions actually processed: 5000/5000 tps = 19.440978(including connections establishing) tps = 19.621221(excluding connections establishing) commit_delay = 10000 commit_siblings = 5 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 1 number of transactions per client: 100 number of transactions actually processed: 100/100 tps = 11.298701(including connections establishing) tps = 11.357102(excluding connections establishing) commit_delay = 10000 commit_siblings = 5 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 10 number of transactions per client: 100 number of transactions actually processed: 1000/1000 tps = 19.722266(including connections establishing) tps = 19.903373(excluding connections establishing) commit_delay = 10000 commit_siblings = 5 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 20 number of transactions per client: 100 number of transactions actually processed: 2000/2000 tps = 19.042737(including connections establishing) tps = 19.214042(excluding connections establishing) commit_delay = 10000 commit_siblings = 5 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 30 number of transactions per client: 100 number of transactions actually processed: 3000/3000 tps = 19.013869(including connections establishing) tps = 19.185863(excluding connections establishing) commit_delay = 10000 commit_siblings = 5 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 40 number of transactions per client: 100 number of transactions actually processed: 4000/4000 tps = 20.081644(including connections establishing) tps = 20.273612(excluding connections establishing) commit_delay = 10000 commit_siblings = 5 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 50 number of transactions per client: 100 number of transactions actually processed: 5000/5000 tps = 20.379646(including connections establishing) tps = 20.577183(excluding connections establishing) commit_delay = 10000 commit_siblings = 10 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 1 number of transactions per client: 100 number of transactions actually processed: 100/100 tps = 10.896660(including connections establishing) tps = 10.951360(excluding connections establishing) commit_delay = 10000 commit_siblings = 10 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 10 number of transactions per client: 100 number of transactions actually processed: 1000/1000 tps = 19.506836(including connections establishing) tps = 19.686328(excluding connections establishing) commit_delay = 10000 commit_siblings = 10 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 20 number of transactions per client: 100 number of transactions actually processed: 2000/2000 tps = 18.801060(including connections establishing) tps = 18.968530(excluding connections establishing) commit_delay = 10000 commit_siblings = 10 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 30 number of transactions per client: 100 number of transactions actually processed: 3000/3000 tps = 19.855547(including connections establishing) tps = 20.044110(excluding connections establishing) commit_delay = 10000 commit_siblings = 10 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 40 number of transactions per client: 100 number of transactions actually processed: 4000/4000 tps = 20.557934(including connections establishing) tps = 20.760724(excluding connections establishing) commit_delay = 10000 commit_siblings = 10 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 50 number of transactions per client: 100 number of transactions actually processed: 5000/5000 tps = 20.278060(including connections establishing) tps = 20.473699(excluding connections establishing) commit_delay = 10000 commit_siblings = 20 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 1 number of transactions per client: 100 number of transactions actually processed: 100/100 tps = 11.098777(including connections establishing) tps = 11.155340(excluding connections establishing) commit_delay = 10000 commit_siblings = 20 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 10 number of transactions per client: 100 number of transactions actually processed: 1000/1000 tps = 18.638060(including connections establishing) tps = 18.801436(excluding connections establishing) commit_delay = 10000 commit_siblings = 20 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 20 number of transactions per client: 100 number of transactions actually processed: 2000/2000 tps = 19.815520(including connections establishing) tps = 20.003053(excluding connections establishing) commit_delay = 10000 commit_siblings = 20 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 30 number of transactions per client: 100 number of transactions actually processed: 3000/3000 tps = 20.034017(including connections establishing) tps = 20.231631(excluding connections establishing) commit_delay = 10000 commit_siblings = 20 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 40 number of transactions per client: 100 number of transactions actually processed: 4000/4000 tps = 20.676088(including connections establishing) tps = 20.879088(excluding connections establishing) commit_delay = 10000 commit_siblings = 20 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 50 number of transactions per client: 100 number of transactions actually processed: 5000/5000 tps = 20.692725(including connections establishing) tps = 20.895842(excluding connections establishing) commit_delay = 30000 commit_siblings = 1 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 1 number of transactions per client: 100 number of transactions actually processed: 100/100 tps = 11.160902(including connections establishing) tps = 11.218247(excluding connections establishing) commit_delay = 30000 commit_siblings = 1 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 10 number of transactions per client: 100 number of transactions actually processed: 1000/1000 tps = 18.831596(including connections establishing) tps = 19.000649(excluding connections establishing) commit_delay = 30000 commit_siblings = 1 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 20 number of transactions per client: 100 number of transactions actually processed: 2000/2000 tps = 20.239767(including connections establishing) tps = 20.434566(excluding connections establishing) commit_delay = 30000 commit_siblings = 1 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 30 number of transactions per client: 100 number of transactions actually processed: 3000/3000 tps = 20.686848(including connections establishing) tps = 20.891519(excluding connections establishing) commit_delay = 30000 commit_siblings = 1 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 40 number of transactions per client: 100 number of transactions actually processed: 4000/4000 tps = 21.014861(including connections establishing) tps = 21.224443(excluding connections establishing) commit_delay = 30000 commit_siblings = 1 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 50 number of transactions per client: 100 number of transactions actually processed: 5000/5000 tps = 21.315164(including connections establishing) tps = 21.533027(excluding connections establishing) commit_delay = 30000 commit_siblings = 5 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 1 number of transactions per client: 100 number of transactions actually processed: 100/100 tps = 11.384356(including connections establishing) tps = 11.444286(excluding connections establishing) commit_delay = 30000 commit_siblings = 5 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 10 number of transactions per client: 100 number of transactions actually processed: 1000/1000 tps = 18.614866(including connections establishing) tps = 18.780395(excluding connections establishing) commit_delay = 30000 commit_siblings = 5 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 20 number of transactions per client: 100 number of transactions actually processed: 2000/2000 tps = 20.462955(including connections establishing) tps = 20.661262(excluding connections establishing) commit_delay = 30000 commit_siblings = 5 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 30 number of transactions per client: 100 number of transactions actually processed: 3000/3000 tps = 20.769457(including connections establishing) tps = 20.975243(excluding connections establishing) commit_delay = 30000 commit_siblings = 5 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 40 number of transactions per client: 100 number of transactions actually processed: 4000/4000 tps = 19.280678(including connections establishing) tps = 19.457795(excluding connections establishing) commit_delay = 30000 commit_siblings = 5 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 50 number of transactions per client: 100 number of transactions actually processed: 5000/5000 tps = 20.852166(including connections establishing) tps = 21.057769(excluding connections establishing) commit_delay = 30000 commit_siblings = 10 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 1 number of transactions per client: 100 number of transactions actually processed: 100/100 tps = 11.129848(including connections establishing) tps = 11.188346(excluding connections establishing) commit_delay = 30000 commit_siblings = 10 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 10 number of transactions per client: 100 number of transactions actually processed: 1000/1000 tps = 19.154248(including connections establishing) tps = 19.328718(excluding connections establishing) commit_delay = 30000 commit_siblings = 10 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 20 number of transactions per client: 100 number of transactions actually processed: 2000/2000 tps = 19.487838(including connections establishing) tps = 19.668323(excluding connections establishing) commit_delay = 30000 commit_siblings = 10 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 30 number of transactions per client: 100 number of transactions actually processed: 3000/3000 tps = 20.387741(including connections establishing) tps = 20.586291(excluding connections establishing) commit_delay = 30000 commit_siblings = 10 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 40 number of transactions per client: 100 number of transactions actually processed: 4000/4000 tps = 21.187943(including connections establishing) tps = 21.403037(excluding connections establishing) commit_delay = 30000 commit_siblings = 10 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 50 number of transactions per client: 100 number of transactions actually processed: 5000/5000 tps = 20.870339(including connections establishing) tps = 21.080454(excluding connections establishing) commit_delay = 30000 commit_siblings = 20 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 1 number of transactions per client: 100 number of transactions actually processed: 100/100 tps = 11.119876(including connections establishing) tps = 11.177152(excluding connections establishing) commit_delay = 30000 commit_siblings = 20 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 10 number of transactions per client: 100 number of transactions actually processed: 1000/1000 tps = 18.987202(including connections establishing) tps = 19.157841(excluding connections establishing) commit_delay = 30000 commit_siblings = 20 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 20 number of transactions per client: 100 number of transactions actually processed: 2000/2000 tps = 19.771415(including connections establishing) tps = 19.957555(excluding connections establishing) commit_delay = 30000 commit_siblings = 20 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 30 number of transactions per client: 100 number of transactions actually processed: 3000/3000 tps = 20.277710(including connections establishing) tps = 20.473996(excluding connections establishing) commit_delay = 30000 commit_siblings = 20 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 40 number of transactions per client: 100 number of transactions actually processed: 4000/4000 tps = 20.736168(including connections establishing) tps = 20.942539(excluding connections establishing) commit_delay = 30000 commit_siblings = 20 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 50 number of transactions per client: 100 number of transactions actually processed: 5000/5000 tps = 18.894930(including connections establishing) tps = 19.064049(excluding connections establishing) commit_delay = 50000 commit_siblings = 1 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 1 number of transactions per client: 100 number of transactions actually processed: 100/100 tps = 11.006743(including connections establishing) tps = 11.062485(excluding connections establishing) commit_delay = 50000 commit_siblings = 1 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 10 number of transactions per client: 100 number of transactions actually processed: 1000/1000 tps = 18.240024(including connections establishing) tps = 18.399169(excluding connections establishing) commit_delay = 50000 commit_siblings = 1 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 20 number of transactions per client: 100 number of transactions actually processed: 2000/2000 tps = 19.817212(including connections establishing) tps = 20.002657(excluding connections establishing) commit_delay = 50000 commit_siblings = 1 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 30 number of transactions per client: 100 number of transactions actually processed: 3000/3000 tps = 20.260368(including connections establishing) tps = 20.455821(excluding connections establishing) commit_delay = 50000 commit_siblings = 1 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 40 number of transactions per client: 100 number of transactions actually processed: 4000/4000 tps = 20.928079(including connections establishing) tps = 21.135532(excluding connections establishing) commit_delay = 50000 commit_siblings = 1 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 50 number of transactions per client: 100 number of transactions actually processed: 5000/5000 tps = 21.216875(including connections establishing) tps = 21.431381(excluding connections establishing) commit_delay = 50000 commit_siblings = 5 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 1 number of transactions per client: 100 number of transactions actually processed: 100/100 tps = 11.362410(including connections establishing) tps = 11.421545(excluding connections establishing) commit_delay = 50000 commit_siblings = 5 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 10 number of transactions per client: 100 number of transactions actually processed: 1000/1000 tps = 18.879526(including connections establishing) tps = 19.047014(excluding connections establishing) commit_delay = 50000 commit_siblings = 5 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 20 number of transactions per client: 100 number of transactions actually processed: 2000/2000 tps = 20.100514(including connections establishing) tps = 20.292700(excluding connections establishing) commit_delay = 50000 commit_siblings = 5 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 30 number of transactions per client: 100 number of transactions actually processed: 3000/3000 tps = 20.108420(including connections establishing) tps = 20.326053(excluding connections establishing) commit_delay = 50000 commit_siblings = 5 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 40 number of transactions per client: 100 number of transactions actually processed: 4000/4000 tps = 20.876438(including connections establishing) tps = 21.083252(excluding connections establishing) commit_delay = 50000 commit_siblings = 5 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 50 number of transactions per client: 100 number of transactions actually processed: 5000/5000 tps = 20.929535(including connections establishing) tps = 21.139167(excluding connections establishing) commit_delay = 50000 commit_siblings = 10 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 1 number of transactions per client: 100 number of transactions actually processed: 100/100 tps = 11.037506(including connections establishing) tps = 11.094671(excluding connections establishing) commit_delay = 50000 commit_siblings = 10 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 10 number of transactions per client: 100 number of transactions actually processed: 1000/1000 tps = 16.197469(including connections establishing) tps = 16.321687(excluding connections establishing) commit_delay = 50000 commit_siblings = 10 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 20 number of transactions per client: 100 number of transactions actually processed: 2000/2000 tps = 19.408106(including connections establishing) tps = 19.586455(excluding connections establishing) commit_delay = 50000 commit_siblings = 10 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 30 number of transactions per client: 100 number of transactions actually processed: 3000/3000 tps = 20.628612(including connections establishing) tps = 20.832682(excluding connections establishing) commit_delay = 50000 commit_siblings = 10 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 40 number of transactions per client: 100 number of transactions actually processed: 4000/4000 tps = 20.687795(including connections establishing) tps = 20.892172(excluding connections establishing) commit_delay = 50000 commit_siblings = 10 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 50 number of transactions per client: 100 number of transactions actually processed: 5000/5000 tps = 21.072593(including connections establishing) tps = 21.285268(excluding connections establishing) commit_delay = 50000 commit_siblings = 20 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 1 number of transactions per client: 100 number of transactions actually processed: 100/100 tps = 11.114714(including connections establishing) tps = 11.172162(excluding connections establishing) commit_delay = 50000 commit_siblings = 20 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 10 number of transactions per client: 100 number of transactions actually processed: 1000/1000 tps = 19.558748(including connections establishing) tps = 19.742513(excluding connections establishing) commit_delay = 50000 commit_siblings = 20 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 20 number of transactions per client: 100 number of transactions actually processed: 2000/2000 tps = 18.631916(including connections establishing) tps = 18.797678(excluding connections establishing) commit_delay = 50000 commit_siblings = 20 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 30 number of transactions per client: 100 number of transactions actually processed: 3000/3000 tps = 19.825138(including connections establishing) tps = 20.012726(excluding connections establishing) commit_delay = 50000 commit_siblings = 20 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 40 number of transactions per client: 100 number of transactions actually processed: 4000/4000 tps = 20.088452(including connections establishing) tps = 20.280854(excluding connections establishing) commit_delay = 50000 commit_siblings = 20 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 50 number of transactions per client: 100 number of transactions actually processed: 5000/5000 tps = 20.297366(including connections establishing) tps = 20.493717(excluding connections establishing) commit_delay = 100000 commit_siblings = 1 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 1 number of transactions per client: 100 number of transactions actually processed: 100/100 tps = 15.439671(including connections establishing) tps = 15.549962(excluding connections establishing) commit_delay = 100000 commit_siblings = 1 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 10 number of transactions per client: 100 number of transactions actually processed: 1000/1000 tps = 19.693075(including connections establishing) tps = 19.876400(excluding connections establishing) commit_delay = 100000 commit_siblings = 1 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 20 number of transactions per client: 100 number of transactions actually processed: 2000/2000 tps = 18.946142(including connections establishing) tps = 19.115107(excluding connections establishing) commit_delay = 100000 commit_siblings = 1 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 30 number of transactions per client: 100 number of transactions actually processed: 3000/3000 tps = 18.454647(including connections establishing) tps = 18.616867(excluding connections establishing) commit_delay = 100000 commit_siblings = 1 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 40 number of transactions per client: 100 number of transactions actually processed: 4000/4000 tps = 20.280877(including connections establishing) tps = 20.476160(excluding connections establishing) commit_delay = 100000 commit_siblings = 1 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 50 number of transactions per client: 100 number of transactions actually processed: 5000/5000 tps = 20.500824(including connections establishing) tps = 20.701014(excluding connections establishing) commit_delay = 100000 commit_siblings = 5 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 1 number of transactions per client: 100 number of transactions actually processed: 100/100 tps = 10.952132(including connections establishing) tps = 11.006296(excluding connections establishing) commit_delay = 100000 commit_siblings = 5 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 10 number of transactions per client: 100 number of transactions actually processed: 1000/1000 tps = 17.366365(including connections establishing) tps = 17.508544(excluding connections establishing) commit_delay = 100000 commit_siblings = 5 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 20 number of transactions per client: 100 number of transactions actually processed: 2000/2000 tps = 19.543583(including connections establishing) tps = 19.725347(excluding connections establishing) commit_delay = 100000 commit_siblings = 5 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 30 number of transactions per client: 100 number of transactions actually processed: 3000/3000 tps = 20.115157(including connections establishing) tps = 20.307981(excluding connections establishing) commit_delay = 100000 commit_siblings = 5 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 40 number of transactions per client: 100 number of transactions actually processed: 4000/4000 tps = 20.223466(including connections establishing) tps = 20.420063(excluding connections establishing) commit_delay = 100000 commit_siblings = 5 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 50 number of transactions per client: 100 number of transactions actually processed: 5000/5000 tps = 20.148971(including connections establishing) tps = 20.341425(excluding connections establishing) commit_delay = 100000 commit_siblings = 10 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 1 number of transactions per client: 100 number of transactions actually processed: 100/100 tps = 10.751800(including connections establishing) tps = 10.805719(excluding connections establishing) commit_delay = 100000 commit_siblings = 10 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 10 number of transactions per client: 100 number of transactions actually processed: 1000/1000 tps = 17.248793(including connections establishing) tps = 17.389532(excluding connections establishing) commit_delay = 100000 commit_siblings = 10 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 20 number of transactions per client: 100 number of transactions actually processed: 2000/2000 tps = 18.971746(including connections establishing) tps = 19.141706(excluding connections establishing) commit_delay = 100000 commit_siblings = 10 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 30 number of transactions per client: 100 number of transactions actually processed: 3000/3000 tps = 20.250238(including connections establishing) tps = 20.445726(excluding connections establishing) commit_delay = 100000 commit_siblings = 10 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 40 number of transactions per client: 100 number of transactions actually processed: 4000/4000 tps = 18.616027(including connections establishing) tps = 18.782432(excluding connections establishing) commit_delay = 100000 commit_siblings = 10 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 50 number of transactions per client: 100 number of transactions actually processed: 5000/5000 tps = 20.101571(including connections establishing) tps = 20.293550(excluding connections establishing) commit_delay = 100000 commit_siblings = 20 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 1 number of transactions per client: 100 number of transactions actually processed: 100/100 tps = 10.630630(including connections establishing) tps = 10.682598(excluding connections establishing) commit_delay = 100000 commit_siblings = 20 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 10 number of transactions per client: 100 number of transactions actually processed: 1000/1000 tps = 17.308711(including connections establishing) tps = 17.450166(excluding connections establishing) commit_delay = 100000 commit_siblings = 20 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 20 number of transactions per client: 100 number of transactions actually processed: 2000/2000 tps = 18.041733(including connections establishing) tps = 18.196939(excluding connections establishing) commit_delay = 100000 commit_siblings = 20 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 30 number of transactions per client: 100 number of transactions actually processed: 3000/3000 tps = 18.610682(including connections establishing) tps = 18.775963(excluding connections establishing) commit_delay = 100000 commit_siblings = 20 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 40 number of transactions per client: 100 number of transactions actually processed: 4000/4000 tps = 19.522874(including connections establishing) tps = 19.705095(excluding connections establishing) commit_delay = 100000 commit_siblings = 20 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 50 number of transactions per client: 100 number of transactions actually processed: 5000/5000 tps = 20.085380(including connections establishing) tps = 20.277826(excluding connections establishing)
On Sun, Feb 25, 2001 at 12:41:28AM -0500, Tom Lane wrote: > Attached are graphs from more thorough runs of pgbench with a commit > delay that occurs only when at least N other backends are running active > transactions. ... > It's not entirely clear what set of parameters is best, but it is > absolutely clear that a flat zero-commit-delay policy is NOT best. > > The test conditions are postmaster options -N 100 -B 1024, pgbench scale > factor 10, pgbench -t (transactions per client) 100. (Hence the results > for a single client rely on only 100 transactions, and are pretty noisy. > The noise level should decrease as the number of clients increases.) It's hard to interpret these results. In particular, "delay 10k, sibs 20" (10k,20), or cyan-triangle, is almost the same as "delay 50k, sibs 1" (50k,1), or green X. Those are pretty different parameters to get such similar results. The only really bad performers were (0), (10k,1), (100k,20). The best were (30k,1) and (30k,10), although (30k,5) also did well except at 40. Why would 30k be a magic delay, regardless of siblings? What happened at 40? At low loads, it seems (100k,1) (brown +) did best by far, which seems very odd. Even more odd, it did pretty well at very high loads but had problems at intermediate loads. Nathan Myers ncm@zembu.com
At 00:42 25/02/01 -0800, Nathan Myers wrote: > >The only really bad performers were (0), (10k,1), (100k,20). The best >were (30k,1) and (30k,10), although (30k,5) also did well except at 40. >Why would 30k be a magic delay, regardless of siblings? What happened >at 40? > I had assumed that 40 was one of the glitches - it would be good if Tom (or someone else) could rerun the suite, to see if we see the same dip. I agree that 30k looks like the magic delay, and probably 30/5 would be a good conservative choice. But now I think about the choice of number, I think it must vary with the speed of the machine and length of the transactions; at 20tps, each TX is completing in around 50ms. Probably the delay needs to be set at a value related to the average TX duration, and since that is not really a known figure, perhaps we should go with 30% of TX duration, with a max of 100k. Alternatively, can PG monitor the commits/second, then set the delay to reflect half of the average TX time (or 100ms, whichever is smaller)? Is this too baroque? ---------------------------------------------------------------- Philip Warner | __---_____ Albatross Consulting Pty. Ltd. |----/ - \ (A.B.N. 75 008 659 498) | /(@) ______---_ Tel: (+61) 0500 83 82 81 | _________ \ Fax: (+61) 0500 83 82 82 | ___________ | Http://www.rhyme.com.au | / \| | --________-- PGP key available upon request, | / and from pgp5.ai.mit.edu:11371 |/
> -----Original Message----- > From: Tom Lane > > Attached are graphs from more thorough runs of pgbench with a commit > delay that occurs only when at least N other backends are running active > transactions. > > My initial try at this proved to be too noisy to tell much. The noise > seems to be coming from WAL checkpoints that occur during a run and > push down the reported TPS value for the particular case that's running. > While we'd need to include WAL checkpoints to make an honest performance > comparison against another RDBMS, I think they are best ignored for the > purpose of figuring out what the commit-delay behavior ought to be. > Accordingly, I modified my test script to minimize the occurrence of > checkpoint activity during runs (see attached script). There are still > some data points that are unexpectedly low compared to their neighbors; > presumably these were affected by checkpoints or other system activity. > > It's not entirely clear what set of parameters is best, but it is > absolutely clear that a flat zero-commit-delay policy is NOT best. > > The test conditions are postmaster options -N 100 -B 1024, pgbench scale > factor 10, pgbench -t (transactions per client) 100. (Hence the results > for a single client rely on only 100 transactions, and are pretty noisy. > The noise level should decrease as the number of clients increases.) > > Comments anyone? > How about the case with scaling factor 1 ? i.e Could your proposal detect lock conflicts in reality ? If so, I agree with your proposal. BTW there seems to be a misunderstanding about CommitDelay, i.e CommitDelay is completely a waste of time unless there's an overlap of commit. If other backends use the delay(cpu cycle) the delay is never a waste of time totally. Regards, Hiroshi Inoue
"Hiroshi Inoue" <Inoue@tpf.co.jp> writes: > How about the case with scaling factor 1 ? i.e Could your > proposal detect lock conflicts in reality ? The code is set up to not count backends that are waiting on locks. That is, to do a commit delay there must be at least N other backends that are in transactions, have written at least one XLOG entry in their transaction (so it's not a read-only xact and will need to write a commit record), and are not waiting on a lock. Is that what you meant? > BTW there seems to be a misunderstanding about CommitDelay, > i.e > CommitDelay is completely a waste of time unless there's > an overlap of commit. > If other backends use the delay(cpu cycle) the delay is never > a waste of time totally. Good point. In fact, if we measure only the total throughput in transactions per second then the commit delay will not appear to be hurting performance no matter how long it is, so long as other backends are in the RUN state for the whole delay. This suggests that pgbench should also measure the average transaction time seen by any one client. Is that a simple change? regards, tom lane
Philip Warner <pjw@rhyme.com.au> writes: > At 00:42 25/02/01 -0800, Nathan Myers wrote: >> The only really bad performers were (0), (10k,1), (100k,20). The best >> were (30k,1) and (30k,10), although (30k,5) also did well except at 40. >> Why would 30k be a magic delay, regardless of siblings? What happened >> at 40? > I had assumed that 40 was one of the glitches - it would be good if Tom (or > someone else) could rerun the suite, to see if we see the same dip. Yes, I assumed the same. I posted the script; could someone else make the same run? We really need more than one test case ;-) > I agree that 30k looks like the magic delay, and probably 30/5 would be a > good conservative choice. But now I think about the choice of number, I > think it must vary with the speed of the machine and length of the > transactions; at 20tps, each TX is completing in around 50ms. Yes, I think so too. This machine is able to do about 40 pgbench tr/sec single-client with fsync off, so the computational load is right about 25msec per transaction. That's presumably why 30msec looks like a good delay number. What interested me was that there doesn't seem to be a very sharp peak; anything from 10 to 100 msec yields fairly comparable results. This is a good thing ... if there *were* a sharp peak at the average xact length, tuning the delay parameter would be an impossible task in real-world cases where the transactions aren't all alike. On the data so far, I'm inclined to go with 10k/5 as the default, so as not to risk wasting time with overly long delays on machines that are faster than this one. But we really need some data from other machines before deciding. It'd be nice to see some results with <10k delays too, from a machine where the kernel supports better-than-10msec delay resolution. Where's the Alpha contingent?? regards, tom lane
ncm@zembu.com (Nathan Myers) writes: > At low loads, it seems (100k,1) (brown +) did best by far, which seems > very odd. Even more odd, it did pretty well at very high loads but had > problems at intermediate loads. In theory, all these variants should behave exactly the same for a single client, since there will be no commit delay in any of 'em in that case. I'm inclined to write off the aberrant result for 100k/1 as due to outside factors --- maybe the WAL file happened to be located in a particularly convenient place on the disk during that run, or some such. Since there's only 100 transactions in that test, it wouldn't take much to affect the result. Likewise, the places where one mid-load datapoint is well below either neighbor are probably due to outside factors --- either a background WAL checkpoint or other activity on the machine, mail arrival for instance. I left the machine alone during the test, but I didn't bother to shut down the usual system services. My feeling is that this test run tells us that zero commit delay is inferior to nonzero under these test conditions, but there's too much noise to pick out one of the nonzero-delay parameter combinations as being clearly better than the rest. (BTW, I did repeat the zero-delay series just to be sure it wasn't itself an outlier...) regards, tom lane
Tom Lane wrote: > > Philip Warner <pjw@rhyme.com.au> writes: > > At 00:42 25/02/01 -0800, Nathan Myers wrote: > >> The only really bad performers were (0), (10k,1), (100k,20). The best > >> were (30k,1) and (30k,10), although (30k,5) also did well except at 40. > >> Why would 30k be a magic delay, regardless of siblings? What happened > >> at 40? > > > I had assumed that 40 was one of the glitches - it would be good if Tom (or > > someone else) could rerun the suite, to see if we see the same dip. > > Yes, I assumed the same. I posted the script; could someone else make > the same run? We really need more than one test case ;-) > I could find the sciript but seem to have missed your change about commit_siblings. Where could I get it ? Regards, Hiroshi Inoue
Hiroshi Inoue <Inoue@tpf.co.jp> writes: >> Yes, I assumed the same. I posted the script; could someone else make >> the same run? We really need more than one test case ;-) > I could find the sciript but seem to have missed your change > about commit_siblings. Where could I get it ? Er ... duh ... I didn't commit it yet. Well, it's harmless enough as long as commit_delay defaults to 0, so I'll go ahead and commit. regards, tom lane
>> I could find the sciript but seem to have missed your change >> about commit_siblings. Where could I get it ? > Er ... duh ... I didn't commit it yet. Well, it's harmless enough > as long as commit_delay defaults to 0, so I'll go ahead and commit. In CVS now. However, it might be well to wait to run tests until we tweak pgbench to measure the average elapsed time for a transaction. As you pointed out earlier today, overall TPS is not the only figure of merit we need to worry about. regards, tom lane
> >Basically, I am not sure how much we lose by doing the delay after > >returning COMMIT, and I know we gain quite a bit by enabling us to group > >fsync calls. > > If included, this should be an option only, and not the default option. Sure it should never become the default, because the "D" in ACID is just about forbidding this kind of behaviour... -- Dominique