Thread: Marginal performance improvement for fast-path locking
While debugging the recent fastpath lock unpleasantness, I noticed that the code tends to use only the last few entries in the fpRelId[] arrays, which seemed a bit surprising. The reason of course is the way that FastPathGrantRelationLock() is written: it remembers the *last* unused slot while scanning the array. This ends up wasting cycles in FastPathUnGrantRelationLock(), as well as other places where we search for an existing entry, since they'll generally have to iterate to the end of the array to find it. We should prefer to put entries near the front of the array, not the back. (Of course, if the array is about full then it's going to be a wash, but in simple transactions we might only have a few relations with fast-path locks.) We could add an extra test in FastPathGrantRelationLock's loop to make it remember the first unused slot rather than the last one, but that would add some cycles there, partially negating any benefit. Instead I propose that we reverse the direction of the search loop, as attached. I doubt this will actually make any easily-measurable difference, so I've not attempted to benchmark it, but in principle we should be able to save a few loop iterations in places where we're holding locks. Comments? regards, tom lane diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c index fde2042..c1b2b42 100644 *** a/src/backend/storage/lmgr/lock.c --- b/src/backend/storage/lmgr/lock.c *************** LockReassignOwner(LOCALLOCK *locallock, *** 2407,2417 **** static bool FastPathGrantRelationLock(Oid relid, LOCKMODE lockmode) { ! uint32 f; ! uint32 unused_slot = FP_LOCK_SLOTS_PER_BACKEND; ! /* Scan for existing entry for this relid, remembering empty slot. */ ! for (f = 0; f < FP_LOCK_SLOTS_PER_BACKEND; f++) { if (FAST_PATH_GET_BITS(MyProc, f) == 0) unused_slot = f; --- 2407,2421 ---- static bool FastPathGrantRelationLock(Oid relid, LOCKMODE lockmode) { ! int f; ! int unused_slot = -1; ! /* ! * Scan for existing entry for this relid, remembering first empty slot. ! * We prefer to assign a low-numbered slot so that later search loops, ! * which search the array forwards, will find entries more quickly. ! */ ! for (f = FP_LOCK_SLOTS_PER_BACKEND - 1; f >= 0; f--) { if (FAST_PATH_GET_BITS(MyProc, f) == 0) unused_slot = f; *************** FastPathGrantRelationLock(Oid relid, LOC *** 2423,2430 **** } } ! /* If no existing entry, use any empty slot. */ ! if (unused_slot < FP_LOCK_SLOTS_PER_BACKEND) { MyProc->fpRelId[unused_slot] = relid; FAST_PATH_SET_LOCKMODE(MyProc, unused_slot, lockmode); --- 2427,2434 ---- } } ! /* If no existing entry, use first empty slot. */ ! if (unused_slot >= 0) { MyProc->fpRelId[unused_slot] = relid; FAST_PATH_SET_LOCKMODE(MyProc, unused_slot, lockmode);
On Fri, Nov 29, 2013 at 12:16 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > While debugging the recent fastpath lock unpleasantness, I noticed that > the code tends to use only the last few entries in the fpRelId[] arrays, > which seemed a bit surprising. The reason of course is the way that > FastPathGrantRelationLock() is written: it remembers the *last* unused > slot while scanning the array. This ends up wasting cycles in > FastPathUnGrantRelationLock(), as well as other places where we search > for an existing entry, since they'll generally have to iterate to the end > of the array to find it. We should prefer to put entries near the front > of the array, not the back. (Of course, if the array is about full then > it's going to be a wash, but in simple transactions we might only have a > few relations with fast-path locks.) > > We could add an extra test in FastPathGrantRelationLock's loop to make > it remember the first unused slot rather than the last one, but that > would add some cycles there, partially negating any benefit. Instead > I propose that we reverse the direction of the search loop, as attached. Nice idea, but would not be making an extra array just to hold the hot entries be a better idea? I agree,the code added would be more complex, but we could potentially drastically reduce the time for the lookup, since only the smaller array will be mostly looked at. Regards, Atri -- Regards, Atri l'apprenant
Atri Sharma <atri.jiit@gmail.com> writes: > On Fri, Nov 29, 2013 at 12:16 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> We could add an extra test in FastPathGrantRelationLock's loop to make >> it remember the first unused slot rather than the last one, but that >> would add some cycles there, partially negating any benefit. Instead >> I propose that we reverse the direction of the search loop, as attached. > Nice idea, but would not be making an extra array just to hold the hot > entries be a better idea? Well, the array isn't so large that there's going to be value in complicating the searches. I did think about instituting a rule that all valid entries must be consecutive at the front, but it's far from clear that the extra logic needed to maintain that invariant would cost less than what's saved. The attraction of the patch I propose here is that it's zero extra cost to get some savings. One other thing we could do if we wanted to micro-optimize here would be to fetch the fpLockBits value into a local register; the existing coding most likely reads it out of the PGPROC again on every iteration. You could further imagine coding the search loops like for (f = 0, bits = proc->fpLockBits; bits != 0; f++, bits >>= 3) {if (bits & 7 != 0) do something with this slot; } so that you'd fall out of the loop as soon as there were no later occupied slots. But, again, this is adding complexity and cycles for hypothetical benefit, so it'd be nicer if we could measure some actual speedup before doing that. regards, tom lane
On Thu, Nov 28, 2013 at 1:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > While debugging the recent fastpath lock unpleasantness, I noticed that > the code tends to use only the last few entries in the fpRelId[] arrays, > which seemed a bit surprising. The reason of course is the way that > FastPathGrantRelationLock() is written: it remembers the *last* unused > slot while scanning the array. This ends up wasting cycles in > FastPathUnGrantRelationLock(), as well as other places where we search > for an existing entry, since they'll generally have to iterate to the end > of the array to find it. We should prefer to put entries near the front > of the array, not the back. (Of course, if the array is about full then > it's going to be a wash, but in simple transactions we might only have a > few relations with fast-path locks.) > > We could add an extra test in FastPathGrantRelationLock's loop to make > it remember the first unused slot rather than the last one, but that > would add some cycles there, partially negating any benefit. Instead > I propose that we reverse the direction of the search loop, as attached. > > I doubt this will actually make any easily-measurable difference, so I've > not attempted to benchmark it, but in principle we should be able to save > a few loop iterations in places where we're holding locks. Comments? Well, the reason why the array is only 64 bytes in size is to make sure that searching the whole thing is really fast. We figure we're going to have to do that often, so it needs to be cheap. If it's not, we're hosed already, I think. Moreover, the whole point of the backendLock is that holding your own costs very little, because other backends should acquire it only rarely. Shaving cycles off code that runs while holding the partition locks is doubtless valuable; shaving them here is much less so. But I don't think it'll hurt anything. If you can prove that it actually helps something, I'll buy you a glass of wine. :-) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Thu, Nov 28, 2013 at 1:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> We could add an extra test in FastPathGrantRelationLock's loop to make >> it remember the first unused slot rather than the last one, but that >> would add some cycles there, partially negating any benefit. Instead >> I propose that we reverse the direction of the search loop, as attached. > Well, the reason why the array is only 64 bytes in size is to make > sure that searching the whole thing is really fast. We figure we're > going to have to do that often, so it needs to be cheap. If it's not, > we're hosed already, I think. I actually suspect the bitmask manipulations cost more than the touches of fpRelId[]. I agree that there's no reason to think that this area needs really tense micro-optimization, but if we can get some savings for zero added cost/complexity, why not? regards, tom lane
On Thu, Nov 28, 2013 at 2:12 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I did think about instituting a rule that all valid entries must be > consecutive at the front, but it's far from clear that the extra logic > needed to maintain that invariant would cost less than what's saved. FWIW, I considered that approach when initially developing the feature and came to the same conclusion. Now we could benchmark it... > One other thing we could do if we wanted to micro-optimize here would > be to fetch the fpLockBits value into a local register; the existing > coding most likely reads it out of the PGPROC again on every iteration. > You could further imagine coding the search loops like > > for (f = 0, bits = proc->fpLockBits; bits != 0; f++, bits >>= 3) > { > if (bits & 7 != 0) do something with this slot; > } > > so that you'd fall out of the loop as soon as there were no later > occupied slots. …and we could also benchmark this. But I bet there are more fruitful optimization targets elsewhere. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Nov 28, 2013 at 2:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Thu, Nov 28, 2013 at 1:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> We could add an extra test in FastPathGrantRelationLock's loop to make >>> it remember the first unused slot rather than the last one, but that >>> would add some cycles there, partially negating any benefit. Instead >>> I propose that we reverse the direction of the search loop, as attached. > >> Well, the reason why the array is only 64 bytes in size is to make >> sure that searching the whole thing is really fast. We figure we're >> going to have to do that often, so it needs to be cheap. If it's not, >> we're hosed already, I think. > > I actually suspect the bitmask manipulations cost more than the touches > of fpRelId[]. I agree that there's no reason to think that this area > needs really tense micro-optimization, but if we can get some savings for > zero added cost/complexity, why not? Sure. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > But I don't think it'll hurt anything. If you can prove that it > actually helps something, I'll buy you a glass of wine. :-) FWIW, I did try benchmarking this using "pgbench -S -c 10 -M prepared". If there's any difference, it's below the noise floor. (I see about 0.2% or so run-to-run variation even without changing anything.) regards, tom lane
On Thu, Nov 28, 2013 at 3:09 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> But I don't think it'll hurt anything. If you can prove that it >> actually helps something, I'll buy you a glass of wine. :-) > > FWIW, I did try benchmarking this using "pgbench -S -c 10 -M prepared". > If there's any difference, it's below the noise floor. (I see about > 0.2% or so run-to-run variation even without changing anything.) No wine for you. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company