Thread: Marginal performance improvement for fast-path locking

Marginal performance improvement for fast-path locking

From
Tom Lane
Date:
While debugging the recent fastpath lock unpleasantness, I noticed that
the code tends to use only the last few entries in the fpRelId[] arrays,
which seemed a bit surprising.  The reason of course is the way that
FastPathGrantRelationLock() is written: it remembers the *last* unused
slot while scanning the array.  This ends up wasting cycles in
FastPathUnGrantRelationLock(), as well as other places where we search
for an existing entry, since they'll generally have to iterate to the end
of the array to find it.  We should prefer to put entries near the front
of the array, not the back.  (Of course, if the array is about full then
it's going to be a wash, but in simple transactions we might only have a
few relations with fast-path locks.)

We could add an extra test in FastPathGrantRelationLock's loop to make
it remember the first unused slot rather than the last one, but that
would add some cycles there, partially negating any benefit.  Instead
I propose that we reverse the direction of the search loop, as attached.

I doubt this will actually make any easily-measurable difference, so I've
not attempted to benchmark it, but in principle we should be able to save
a few loop iterations in places where we're holding locks.  Comments?

            regards, tom lane

diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index fde2042..c1b2b42 100644
*** a/src/backend/storage/lmgr/lock.c
--- b/src/backend/storage/lmgr/lock.c
*************** LockReassignOwner(LOCALLOCK *locallock,
*** 2407,2417 ****
  static bool
  FastPathGrantRelationLock(Oid relid, LOCKMODE lockmode)
  {
!     uint32        f;
!     uint32        unused_slot = FP_LOCK_SLOTS_PER_BACKEND;

!     /* Scan for existing entry for this relid, remembering empty slot. */
!     for (f = 0; f < FP_LOCK_SLOTS_PER_BACKEND; f++)
      {
          if (FAST_PATH_GET_BITS(MyProc, f) == 0)
              unused_slot = f;
--- 2407,2421 ----
  static bool
  FastPathGrantRelationLock(Oid relid, LOCKMODE lockmode)
  {
!     int            f;
!     int            unused_slot = -1;

!     /*
!      * Scan for existing entry for this relid, remembering first empty slot.
!      * We prefer to assign a low-numbered slot so that later search loops,
!      * which search the array forwards, will find entries more quickly.
!      */
!     for (f = FP_LOCK_SLOTS_PER_BACKEND - 1; f >= 0; f--)
      {
          if (FAST_PATH_GET_BITS(MyProc, f) == 0)
              unused_slot = f;
*************** FastPathGrantRelationLock(Oid relid, LOC
*** 2423,2430 ****
          }
      }

!     /* If no existing entry, use any empty slot. */
!     if (unused_slot < FP_LOCK_SLOTS_PER_BACKEND)
      {
          MyProc->fpRelId[unused_slot] = relid;
          FAST_PATH_SET_LOCKMODE(MyProc, unused_slot, lockmode);
--- 2427,2434 ----
          }
      }

!     /* If no existing entry, use first empty slot. */
!     if (unused_slot >= 0)
      {
          MyProc->fpRelId[unused_slot] = relid;
          FAST_PATH_SET_LOCKMODE(MyProc, unused_slot, lockmode);

Re: Marginal performance improvement for fast-path locking

From
Atri Sharma
Date:
On Fri, Nov 29, 2013 at 12:16 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> While debugging the recent fastpath lock unpleasantness, I noticed that
> the code tends to use only the last few entries in the fpRelId[] arrays,
> which seemed a bit surprising.  The reason of course is the way that
> FastPathGrantRelationLock() is written: it remembers the *last* unused
> slot while scanning the array.  This ends up wasting cycles in
> FastPathUnGrantRelationLock(), as well as other places where we search
> for an existing entry, since they'll generally have to iterate to the end
> of the array to find it.  We should prefer to put entries near the front
> of the array, not the back.  (Of course, if the array is about full then
> it's going to be a wash, but in simple transactions we might only have a
> few relations with fast-path locks.)
>
> We could add an extra test in FastPathGrantRelationLock's loop to make
> it remember the first unused slot rather than the last one, but that
> would add some cycles there, partially negating any benefit.  Instead
> I propose that we reverse the direction of the search loop, as attached.

Nice idea, but would not be making an extra array just to hold the hot
entries be a better idea? I agree,the code added would be more
complex, but we could potentially drastically reduce the time for the
lookup, since only the smaller array will be mostly looked at.

Regards,

Atri


-- 
Regards,

Atri
l'apprenant



Re: Marginal performance improvement for fast-path locking

From
Tom Lane
Date:
Atri Sharma <atri.jiit@gmail.com> writes:
> On Fri, Nov 29, 2013 at 12:16 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> We could add an extra test in FastPathGrantRelationLock's loop to make
>> it remember the first unused slot rather than the last one, but that
>> would add some cycles there, partially negating any benefit.  Instead
>> I propose that we reverse the direction of the search loop, as attached.

> Nice idea, but would not be making an extra array just to hold the hot
> entries be a better idea?

Well, the array isn't so large that there's going to be value in
complicating the searches.

I did think about instituting a rule that all valid entries must be
consecutive at the front, but it's far from clear that the extra logic
needed to maintain that invariant would cost less than what's saved.
The attraction of the patch I propose here is that it's zero extra
cost to get some savings.

One other thing we could do if we wanted to micro-optimize here would
be to fetch the fpLockBits value into a local register; the existing
coding most likely reads it out of the PGPROC again on every iteration.
You could further imagine coding the search loops like
   for (f = 0, bits = proc->fpLockBits; bits != 0; f++, bits >>= 3)   {if (bits & 7 != 0) do something with this slot;
}
 

so that you'd fall out of the loop as soon as there were no later
occupied slots.  But, again, this is adding complexity and cycles
for hypothetical benefit, so it'd be nicer if we could measure
some actual speedup before doing that.
        regards, tom lane



Re: Marginal performance improvement for fast-path locking

From
Robert Haas
Date:
On Thu, Nov 28, 2013 at 1:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> While debugging the recent fastpath lock unpleasantness, I noticed that
> the code tends to use only the last few entries in the fpRelId[] arrays,
> which seemed a bit surprising.  The reason of course is the way that
> FastPathGrantRelationLock() is written: it remembers the *last* unused
> slot while scanning the array.  This ends up wasting cycles in
> FastPathUnGrantRelationLock(), as well as other places where we search
> for an existing entry, since they'll generally have to iterate to the end
> of the array to find it.  We should prefer to put entries near the front
> of the array, not the back.  (Of course, if the array is about full then
> it's going to be a wash, but in simple transactions we might only have a
> few relations with fast-path locks.)
>
> We could add an extra test in FastPathGrantRelationLock's loop to make
> it remember the first unused slot rather than the last one, but that
> would add some cycles there, partially negating any benefit.  Instead
> I propose that we reverse the direction of the search loop, as attached.
>
> I doubt this will actually make any easily-measurable difference, so I've
> not attempted to benchmark it, but in principle we should be able to save
> a few loop iterations in places where we're holding locks.  Comments?

Well, the reason why the array is only 64 bytes in size is to make
sure that searching the whole thing is really fast.  We figure we're
going to have to do that often, so it needs to be cheap.  If it's not,
we're hosed already, I think.

Moreover, the whole point of the backendLock is that holding your own
costs very little, because other backends should acquire it only
rarely.  Shaving cycles off code that runs while holding the partition
locks is doubtless valuable; shaving them here is much less so.

But I don't think it'll hurt anything.  If you can prove that it
actually helps something, I'll buy you a glass of wine.  :-)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Marginal performance improvement for fast-path locking

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Thu, Nov 28, 2013 at 1:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> We could add an extra test in FastPathGrantRelationLock's loop to make
>> it remember the first unused slot rather than the last one, but that
>> would add some cycles there, partially negating any benefit.  Instead
>> I propose that we reverse the direction of the search loop, as attached.

> Well, the reason why the array is only 64 bytes in size is to make
> sure that searching the whole thing is really fast.  We figure we're
> going to have to do that often, so it needs to be cheap.  If it's not,
> we're hosed already, I think.

I actually suspect the bitmask manipulations cost more than the touches
of fpRelId[].  I agree that there's no reason to think that this area
needs really tense micro-optimization, but if we can get some savings for
zero added cost/complexity, why not?
        regards, tom lane



Re: Marginal performance improvement for fast-path locking

From
Robert Haas
Date:
On Thu, Nov 28, 2013 at 2:12 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I did think about instituting a rule that all valid entries must be
> consecutive at the front, but it's far from clear that the extra logic
> needed to maintain that invariant would cost less than what's saved.

FWIW, I considered that approach when initially developing the feature
and came to the same conclusion.  Now we could benchmark it...

> One other thing we could do if we wanted to micro-optimize here would
> be to fetch the fpLockBits value into a local register; the existing
> coding most likely reads it out of the PGPROC again on every iteration.
> You could further imagine coding the search loops like
>
>     for (f = 0, bits = proc->fpLockBits; bits != 0; f++, bits >>= 3)
>     {
>         if (bits & 7 != 0) do something with this slot;
>     }
>
> so that you'd fall out of the loop as soon as there were no later
> occupied slots.

…and we could also benchmark this.  But I bet there are more fruitful
optimization targets elsewhere.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Marginal performance improvement for fast-path locking

From
Robert Haas
Date:
On Thu, Nov 28, 2013 at 2:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Thu, Nov 28, 2013 at 1:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> We could add an extra test in FastPathGrantRelationLock's loop to make
>>> it remember the first unused slot rather than the last one, but that
>>> would add some cycles there, partially negating any benefit.  Instead
>>> I propose that we reverse the direction of the search loop, as attached.
>
>> Well, the reason why the array is only 64 bytes in size is to make
>> sure that searching the whole thing is really fast.  We figure we're
>> going to have to do that often, so it needs to be cheap.  If it's not,
>> we're hosed already, I think.
>
> I actually suspect the bitmask manipulations cost more than the touches
> of fpRelId[].  I agree that there's no reason to think that this area
> needs really tense micro-optimization, but if we can get some savings for
> zero added cost/complexity, why not?

Sure.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Marginal performance improvement for fast-path locking

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> But I don't think it'll hurt anything.  If you can prove that it
> actually helps something, I'll buy you a glass of wine.  :-)

FWIW, I did try benchmarking this using "pgbench -S -c 10 -M prepared".
If there's any difference, it's below the noise floor.  (I see about
0.2% or so run-to-run variation even without changing anything.)
        regards, tom lane



Re: Marginal performance improvement for fast-path locking

From
Robert Haas
Date:
On Thu, Nov 28, 2013 at 3:09 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> But I don't think it'll hurt anything.  If you can prove that it
>> actually helps something, I'll buy you a glass of wine.  :-)
>
> FWIW, I did try benchmarking this using "pgbench -S -c 10 -M prepared".
> If there's any difference, it's below the noise floor.  (I see about
> 0.2% or so run-to-run variation even without changing anything.)

No wine for you.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company