Thread: Reducing contention for the LockMgrLock

Reducing contention for the LockMgrLock

From

Tom Lane

Date:

07 December 2005, 17:59:15

We've suspected for awhile that once we'd fixed the buffer manager's use
of a single global BufMgrLock, the next contention hotspot would be the
lock manager's LockMgrLock.  I've now seen actual evidence of that in
profiling pgbench: using a modified backend that counts LWLock-related
wait operations, the LockMgrLock is responsible for an order of magnitude
more blockages than the next highest LWLock:

PID 12971 lwlock LockMgrLock: shacq 0 exacq 50630 blk 3354
PID 12979 lwlock LockMgrLock: shacq 0 exacq 49706 blk 3323
PID 12976 lwlock LockMgrLock: shacq 0 exacq 50567 blk 3304
PID 12962 lwlock LockMgrLock: shacq 0 exacq 50635 blk 3278
PID 12974 lwlock LockMgrLock: shacq 0 exacq 50599 blk 3251
PID 12972 lwlock LockMgrLock: shacq 0 exacq 50204 blk 3243
PID 12973 lwlock LockMgrLock: shacq 0 exacq 50321 blk 3200
PID 12978 lwlock LockMgrLock: shacq 0 exacq 50266 blk 3177
PID 12977 lwlock LockMgrLock: shacq 0 exacq 50379 blk 3148
PID 12975 lwlock LockMgrLock: shacq 0 exacq 49790 blk 3124
PID 12971 lwlock WALInsertLock: shacq 0 exacq 24022 blk 408
PID 12972 lwlock WALInsertLock: shacq 0 exacq 24021 blk 393
PID 12976 lwlock WALInsertLock: shacq 0 exacq 24017 blk 390
PID 12977 lwlock WALInsertLock: shacq 0 exacq 24021 blk 388
PID 12973 lwlock WALInsertLock: shacq 0 exacq 24018 blk 379
PID 12962 lwlock WALInsertLock: shacq 0 exacq 24024 blk 377
PID 12974 lwlock WALInsertLock: shacq 0 exacq 24016 blk 367
PID 12975 lwlock WALInsertLock: shacq 0 exacq 24021 blk 366
PID 12978 lwlock WALInsertLock: shacq 0 exacq 24023 blk 354
PID 12979 lwlock WALInsertLock: shacq 0 exacq 24033 blk 321
PID 12973 lwlock ProcArrayLock: shacq 45214 exacq 6003 blk 241
PID 12971 lwlock ProcArrayLock: shacq 45355 exacq 6003 blk 225
(etc)

We had also seen evidence to this effect from OSDL:
http://archives.postgresql.org/pgsql-patches/2003-12/msg00365.php

So it seems it's time to start thinking about how to reduce contention
for the LockMgrLock.  There are no interesting read-only operations on the
shared lock table, so there doesn't seem to be any traction to be gained
by making some operations take just shared access to the LockMgrLock.

The best idea I've come up with after a bit of thought is to replace the
shared lock table with N independent tables representing partitions of the
lock space.  Each lock would be assigned to one of these partitions based
on, say, a hash of its LOCKTAG.  I'm envisioning N of 16 or so to achieve
(hopefully) about an order-of-magnitude reduction of contention.  There
would be a separate LWLock guarding each partition; the LWLock for a given
partition would be considered to protect the LOCK objects assigned to that
partition, all the PROCLOCK objects associated with each such LOCK, and
the shared-memory hash tables holding these objects (each partition would
need its own hash tables).  A PGPROC's lock-related fields are only
interesting when it is waiting for a lock, so we could say that the
LWLock for the partition containing the lock it is waiting for must be
held to examine/change these fields.

The per-PGPROC list of all PROCLOCKs belonging to that PGPROC is a bit
tricky to handle since it necessarily spans across partitions.  We might
be able to deal with this with suitable rules about when the list can be
touched, but I've not worked this out in detail.  Another possibility is
to break this list apart into N lists, one per partition, but that would
bloat the PGPROC array a bit, especially if we wanted larger N.

The basic LockAcquire and LockRelease operations would only need to
acquire the LWLock for the partition containing the lock they are
interested in; this is what gives us the contention reduction.
LockReleaseAll is also interesting from a performance point of view,
since it executes at every transaction exit.  If we divide PGPROC's
PROCLOCK list into N lists then it will be very easy for LockReleaseAll
to take only the partition locks it actually needs to release these locks;
if not, we might have to resort to scanning the list N times, once while
we hold the LWLock for each partition.

I think that CheckDeadLock will probably require taking all the partition
LWLocks (as long as it does this in a predetermined order there is no risk
of deadlock on the partition LWLocks).  But one hopes this is not a
performance-critical operation.  Ditto for GetLockStatusData.

One objection I can see to this idea is that having N lock hash tables
instead of one will eat a larger amount of shared memory in hashtable
overhead.  But the lock hashtables are fairly small relative to the
shared buffer array (given typical configuration parameters) so this
doesn't seem like a major problem.

Another objection is that LockReleaseAll will get slower (since it will
certainly call LWLockAcquire/Release more times) and in situations that
aren't heavily concurrent there won't be any compensating gain.  I think
this won't be a significant effect, but there's probably no way to tell
for sure without actually writing the code and testing it.

While at it, I'm inclined to get rid of the current assumption that there
are logically separate hash tables for different LockMethodIds.  AFAICS all
that's doing for us is creating a level of confusion; there's nothing on
the horizon suggesting we'd ever actually make use of the flexibility.

Thoughts, better ideas?
        regards, tom lane

Re: Reducing contention for the LockMgrLock

From

"Jonah H. Harris"

Date:

07 December 2005, 18:19:31

Tom,

This would also explain some things we've seen during benchmarking here at EnterpriseDB. I like your idea and, as I'm on my way out, will think about it a bit tonight.

Similarly, I don't see the any forward-looking reason for keeping the separate hash tables used for the LockMethodIds. Or, it may just be that I haven't looked closely enough at what the differences are.

-Jonah

On 12/7/05, Tom Lane <tgl@sss.pgh.pa.us> wrote:

We've suspected for awhile that once we'd fixed the buffer manager's use
of a single global BufMgrLock, the next contention hotspot would be the
lock manager's LockMgrLock.  I've now seen actual evidence of that in
profiling pgbench: using a modified backend that counts LWLock-related
wait operations, the LockMgrLock is responsible for an order of magnitude
more blockages than the next highest LWLock:

PID 12971 lwlock LockMgrLock: shacq 0 exacq 50630 blk 3354
PID 12979 lwlock LockMgrLock: shacq 0 exacq 49706 blk 3323
PID 12976 lwlock LockMgrLock: shacq 0 exacq 50567 blk 3304
PID 12962 lwlock LockMgrLock: shacq 0 exacq 50635 blk 3278
PID 12974 lwlock LockMgrLock: shacq 0 exacq 50599 blk 3251
PID 12972 lwlock LockMgrLock: shacq 0 exacq 50204 blk 3243
PID 12973 lwlock LockMgrLock: shacq 0 exacq 50321 blk 3200
PID 12978 lwlock LockMgrLock: shacq 0 exacq 50266 blk 3177
PID 12977 lwlock LockMgrLock: shacq 0 exacq 50379 blk 3148
PID 12975 lwlock LockMgrLock: shacq 0 exacq 49790 blk 3124
PID 12971 lwlock WALInsertLock: shacq 0 exacq 24022 blk 408
PID 12972 lwlock WALInsertLock: shacq 0 exacq 24021 blk 393
PID 12976 lwlock WALInsertLock: shacq 0 exacq 24017 blk 390
PID 12977 lwlock WALInsertLock: shacq 0 exacq 24021 blk 388
PID 12973 lwlock WALInsertLock: shacq 0 exacq 24018 blk 379
PID 12962 lwlock WALInsertLock: shacq 0 exacq 24024 blk 377
PID 12974 lwlock WALInsertLock: shacq 0 exacq 24016 blk 367
PID 12975 lwlock WALInsertLock: shacq 0 exacq 24021 blk 366
PID 12978 lwlock WALInsertLock: shacq 0 exacq 24023 blk 354
PID 12979 lwlock WALInsertLock: shacq 0 exacq 24033 blk 321
PID 12973 lwlock ProcArrayLock: shacq 45214 exacq 6003 blk 241
PID 12971 lwlock ProcArrayLock: shacq 45355 exacq 6003 blk 225
(etc)

We had also seen evidence to this effect from OSDL:
http://archives.postgresql.org/pgsql-patches/2003-12/msg00365.php

So it seems it's time to start thinking about how to reduce contention
for the LockMgrLock.  There are no interesting read-only operations on the
shared lock table, so there doesn't seem to be any traction to be gained
by making some operations take just shared access to the LockMgrLock.

The best idea I've come up with after a bit of thought is to replace the
shared lock table with N independent tables representing partitions of the
lock space.  Each lock would be assigned to one of these partitions based
on, say, a hash of its LOCKTAG.  I'm envisioning N of 16 or so to achieve
(hopefully) about an order-of-magnitude reduction of contention.  There
would be a separate LWLock guarding each partition; the LWLock for a given
partition would be considered to protect the LOCK objects assigned to that
partition, all the PROCLOCK objects associated with each such LOCK, and
the shared-memory hash tables holding these objects (each partition would
need its own hash tables).  A PGPROC's lock-related fields are only
interesting when it is waiting for a lock, so we could say that the
LWLock for the partition containing the lock it is waiting for must be
held to examine/change these fields.

The per-PGPROC list of all PROCLOCKs belonging to that PGPROC is a bit
tricky to handle since it necessarily spans across partitions.  We might
be able to deal with this with suitable rules about when the list can be
touched, but I've not worked this out in detail.  Another possibility is
to break this list apart into N lists, one per partition, but that would
bloat the PGPROC array a bit, especially if we wanted larger N.

The basic LockAcquire and LockRelease operations would only need to
acquire the LWLock for the partition containing the lock they are
interested in; this is what gives us the contention reduction.
LockReleaseAll is also interesting from a performance point of view,
since it executes at every transaction exit.  If we divide PGPROC's
PROCLOCK list into N lists then it will be very easy for LockReleaseAll
to take only the partition locks it actually needs to release these locks;
if not, we might have to resort to scanning the list N times, once while
we hold the LWLock for each partition.

I think that CheckDeadLock will probably require taking all the partition
LWLocks (as long as it does this in a predetermined order there is no risk
of deadlock on the partition LWLocks).  But one hopes this is not a
performance-critical operation.  Ditto for GetLockStatusData.

One objection I can see to this idea is that having N lock hash tables
instead of one will eat a larger amount of shared memory in hashtable
overhead.  But the lock hashtables are fairly small relative to the
shared buffer array (given typical configuration parameters) so this
doesn't seem like a major problem.

Another objection is that LockReleaseAll will get slower (since it will
certainly call LWLockAcquire/Release more times) and in situations that
aren't heavily concurrent there won't be any compensating gain.  I think
this won't be a significant effect, but there's probably no way to tell
for sure without actually writing the code and testing it.

While at it, I'm inclined to get rid of the current assumption that there
are logically separate hash tables for different LockMethodIds.  AFAICS all
that's doing for us is creating a level of confusion; there's nothing on
the horizon suggesting we'd ever actually make use of the flexibility.

Thoughts, better ideas?

                        regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
       subscribe-nomail command to majordomo@postgresql.org so that your
       message can get through to the mailing list cleanly

Re: Reducing contention for the LockMgrLock

From

Simon Riggs

Date:

07 December 2005, 20:14:13

On Wed, 2005-12-07 at 16:59 -0500, Tom Lane wrote:
> I've now seen actual evidence of that in
> profiling pgbench: using a modified backend that counts LWLock-related
> wait operations, 

> So it seems it's time to start thinking about how to reduce contention
> for the LockMgrLock

You're right to be following up this thought.

My concern, longer term is on our ability to determine contention issues
in an agreed way. I've long been thinking about wait-time measurement -
I think its the only way to proceed.

There's always a next-bottleneck, so I'd like to first agree the
diagnostic probes so we can decide how to determine that. That way we
can all work on solutions for various workloads, and prove that they
work, in those cases.

My view would be that the LockMgrLock is not relevant for all workloads,
but I want even more to be able to discuss whether it is, or is not, on
an accepted basis before discussions begin.

Best Regards, Simon Riggs

Re: Reducing contention for the LockMgrLock

From

Alvaro Herrera

Date:

07 December 2005, 20:45:02

Tom Lane wrote:

Interesting proposal.

> LockReleaseAll is also interesting from a performance point of view,
> since it executes at every transaction exit.  If we divide PGPROC's
> PROCLOCK list into N lists then it will be very easy for LockReleaseAll
> to take only the partition locks it actually needs to release these locks;
> if not, we might have to resort to scanning the list N times, once while
> we hold the LWLock for each partition.

On the other hand, each scan would be shorter than the previous one; and
it's not necessary to hold each and every partition's LWLock, only the
one found in the first entry of the list on each scan until the list is
empty.  So it's N scans only in the worst case of a PGPROC holding locks
on all partitions.

> One objection I can see to this idea is that having N lock hash tables
> instead of one will eat a larger amount of shared memory in hashtable
> overhead.  But the lock hashtables are fairly small relative to the
> shared buffer array (given typical configuration parameters) so this
> doesn't seem like a major problem.

Is hashtable overhead all that large?  Each table could be made
initially size-of-current-table/N entries.  One problem is that
currently the memory freed from a hashtable is not put back into shmem
freespace, is it?

> While at it, I'm inclined to get rid of the current assumption that there
> are logically separate hash tables for different LockMethodIds.  AFAICS all
> that's doing for us is creating a level of confusion; there's nothing on
> the horizon suggesting we'd ever actually make use of the flexibility.

Yeah, please.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: Reducing contention for the LockMgrLock

From

Tom Lane

Date:

07 December 2005, 23:46:03

Alvaro Herrera <alvherre@commandprompt.com> writes:
> Is hashtable overhead all that large?  Each table could be made
> initially size-of-current-table/N entries.  One problem is that
> currently the memory freed from a hashtable is not put back into shmem
> freespace, is it?

Yeah; the problem is mainly that we'd have to allocate extra space to
allow for unevenness of usage across the multiple hashtables.  It's hard
to judge how large the effect would be without testing, but I think that
this problem would inhibit us from having dozens or hundreds of separate
partitions.

A possible response is to try to improve dynahash.c to make its memory
management more flexible, but I'd prefer not to get into that unless
it becomes really necessary.  A shared freespace pool would create a
contention bottleneck of its own...
        regards, tom lane

Re: Reducing contention for the LockMgrLock

From

Tom Lane

Date:

07 December 2005, 23:53:46

Simon Riggs <simon@2ndquadrant.com> writes:
> My view would be that the LockMgrLock is not relevant for all workloads,
> but I want even more to be able to discuss whether it is, or is not, on
> an accepted basis before discussions begin.

Certainly.  I showed the evidence that it is currently a significant
problem for pgbench-like workloads, but pgbench is of course not
representative of everything.

My feeling about it is that different workloads are going to expose
different weak spots, and so as long as a given test case isn't
obviously artificial, whatever bottleneck it exposes is fair game
to work on.  pgbench seems reasonably representative of a class of
applications with relatively short transactions, so I don't doubt that
if pgbench has a problem with LockMgrLock contention, there are real-
world cases out there that do too.
        regards, tom lane

Re: Reducing contention for the LockMgrLock

From

Simon Riggs

Date:

08 December 2005, 05:38:23

On Wed, 2005-12-07 at 22:46 -0500, Tom Lane wrote:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
> > Is hashtable overhead all that large?  Each table could be made
> > initially size-of-current-table/N entries.  One problem is that
> > currently the memory freed from a hashtable is not put back into shmem
> > freespace, is it?
> 
> Yeah; the problem is mainly that we'd have to allocate extra space to
> allow for unevenness of usage across the multiple hashtables.  It's hard
> to judge how large the effect would be without testing, but I think that
> this problem would inhibit us from having dozens or hundreds of separate
> partitions.

The imbalance across partitions would be a major issue because of the
difficulty of selecting a well-distributed partitioning key. If you use
the LOCKTAG, then operations on the heaviest hit tables would go to the
same partitions continually for LockRelation requests. The frequency of
access per table usually drops off dramatically in rank order: look at
TPC-B (pgbench) and TPC-C; my own experience would be that you seldom
have as many even as 16 heavy hit tables. My guess would be that doing
all of that would do little more than reduce contention to ~50%, and
that this would show quickly diminishing returns for N > 4. Also, the
more sharply defined your application profile, the worse this effect
will be.

Having said that, I think this *is* the best way forward *if* we
continue to accept the barrage of lock requests. So I've reopened the
debate on the earlier thread: [HACKERS] Reducing relation locking overhead
and am reviewing thoughts/requirements on that thread to avoid the
necessity of altering the lock manager in this way.

pgbench is the right workload to expose this effect and measure worst
case contention, so at least performance gains are easy to measure.

> A possible response is to try to improve dynahash.c to make its memory
> management more flexible, but I'd prefer not to get into that unless
> it becomes really necessary.  A shared freespace pool would create a
> contention bottleneck of its own...

...but a less frequently accessed one.

Best Regards, Simon Riggs

Re: Reducing contention for the LockMgrLock

From

Simon Riggs

Date:

08 December 2005, 05:57:21

On Wed, 2005-12-07 at 22:53 -0500, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > My view would be that the LockMgrLock is not relevant for all workloads,
> > but I want even more to be able to discuss whether it is, or is not, on
> > an accepted basis before discussions begin.
> 
> Certainly.  I showed the evidence ...

The output you gave wasn't anything I recognize in the code. Assuming
its not already there, please can you share code you are using to find
the evidence, even if its just privately in some form?

You're looking at the number of spins to acquire each lock? Or some
other measure of wait time on a lock?

I want to be in a position to run tests and then share the output with
the project in an agreed form, then quickly move to action. You're right
to put the burden of proof onto test results; I want to agree the
measurements before we test.

Manfred's earlier patch provides very clear output for observing
contention, including full summaries. Could we commit that, so we can
all use this for analysis? Updated with the wait info.

Best Regards, Simon Riggs

Re: Reducing contention for the LockMgrLock

From

Tom Lane

Date:

08 December 2005, 10:44:36

Simon Riggs <simon@2ndquadrant.com> writes:
> The output you gave wasn't anything I recognize in the code. Assuming
> its not already there, please can you share code you are using to find
> the evidence, even if its just privately in some form?

See below.  Also, the message I previously mentioned shows a different
tack on the same theme:
http://archives.postgresql.org/pgsql-patches/2003-12/msg00365.php
although in the light of later events I think that keeping the counts
in shared memory like that is a bad idea --- too likely to skew the
results.

> You're looking at the number of spins to acquire each lock?

Number of semop waits.

> Manfred's earlier patch provides very clear output for observing
> contention, including full summaries. Could we commit that, so we can
> all use this for analysis? Updated with the wait info.

What patch would that be?
        regards, tom lane

*** src/backend/storage/ipc/ipc.c.orig    Tue Nov 22 16:06:33 2005
--- src/backend/storage/ipc/ipc.c    Tue Nov 29 12:27:13 2005
***************
*** 125,130 ****
--- 125,132 ---- {     elog(DEBUG3, "shmem_exit(%d)", code); 
+     LWLockStats();
+      /*      * call all the registered callbacks.      *
*** src/backend/storage/lmgr/lwlock.c.orig    Tue Dec  6 18:08:33 2005
--- src/backend/storage/lmgr/lwlock.c    Tue Dec  6 18:16:05 2005
***************
*** 21,26 ****
--- 21,28 ----  */ #include "postgres.h" 
+ #include <unistd.h>
+  #include "access/clog.h" #include "access/multixact.h" #include "access/subtrans.h"
***************
*** 32,37 ****
--- 34,43 ---- /* We use the ShmemLock spinlock to protect LWLockAssign */ extern slock_t *ShmemLock; 
+ static int num_counts;
+ static int *sh_acquire_counts;
+ static int *ex_acquire_counts;
+ static int *block_counts;  typedef struct LWLock {
***************
*** 209,214 ****
--- 215,226 ----     LWLockCounter = (int *) ((char *) LWLockArray - 2 * sizeof(int));     LWLockCounter[0] = (int)
NumFixedLWLocks;    LWLockCounter[1] = numLocks;
 
+ 
+     /* local counter space */
+     num_counts = numLocks;
+     sh_acquire_counts = calloc(numLocks, sizeof(int));
+     ex_acquire_counts = calloc(numLocks, sizeof(int));
+     block_counts = calloc(numLocks, sizeof(int)); }  
***************
*** 257,262 ****
--- 269,278 ----     int            extraWaits = 0;      PRINT_LWDEBUG("LWLockAcquire", lockid, lock);
+     if (mode == LW_EXCLUSIVE)
+         ex_acquire_counts[lockid]++;
+     else
+         sh_acquire_counts[lockid]++;      /*      * We can't wait if we haven't got a PGPROC.  This should only
occur
***************
*** 328,333 ****
--- 344,351 ----         if (!mustwait)             break;                /* got the lock */ 
+         block_counts[lockid]++;
+          /*          * Add myself to wait queue.          *
***************
*** 598,601 ****
--- 616,640 ----             return true;     }     return false;
+ }
+ 
+ void
+ LWLockStats(void)
+ {
+     int pid = getpid();
+     int i;
+ 
+     LWLockAcquire(0, LW_EXCLUSIVE);
+ 
+     for (i = 0; i < num_counts; i++)
+     {
+         if (sh_acquire_counts[i] || ex_acquire_counts[i] || block_counts[i])
+         {
+             fprintf(stderr, "PID %d lwlock %d: shacq %u exacq %u blk %u\n",
+                     pid, i, sh_acquire_counts[i], ex_acquire_counts[i],
+                     block_counts[i]);
+         }
+     }
+ 
+     LWLockRelease(0); }

Re: Reducing contention for the LockMgrLock

From

Tom Lane

Date:

08 December 2005, 11:26:29

Simon Riggs <simon@2ndquadrant.com> writes:
> The imbalance across partitions would be a major issue because of the
> difficulty of selecting a well-distributed partitioning key. If you use
> the LOCKTAG, then operations on the heaviest hit tables would go to the
> same partitions continually for LockRelation requests. The frequency of
> access per table usually drops off dramatically in rank order: look at
> TPC-B (pgbench) and TPC-C; my own experience would be that you seldom
> have as many even as 16 heavy hit tables. My guess would be that doing
> all of that would do little more than reduce contention to ~50%, and
> that this would show quickly diminishing returns for N > 4. Also, the
> more sharply defined your application profile, the worse this effect
> will be.

That's a fair point, and reinforces my instinct that having a large
number of partitions would be a losing game.  But you are mistaken to
think that the number of hot-spot tables is the only limit on the number
of usable partitions.  It's the number of their indexes that matters most.
(The pgbench test is if anything probably understating the problem,
because it has only a single index on each table.)  In any case, even a
factor of 2 or so reduction in direct conflicts should have a useful
impact on the number of semop waits, because it's a nonlinear effect...
        regards, tom lane

Re: Reducing contention for the LockMgrLock

From

Greg Stark

Date:

08 December 2005, 13:27:48

Tom Lane <tgl@sss.pgh.pa.us> writes:

> That's a fair point, and reinforces my instinct that having a large
> number of partitions would be a losing game.  But you are mistaken to
> think that the number of hot-spot tables is the only limit on the number
> of usable partitions.  It's the number of their indexes that matters most.

Hm, so hypothetically an insert or update on a table with 4 indexes which have
been split into 4 partitions would need to touch each partition?

Would that defeat the benefits of the partitioning? Or enhance it?

Would it be better to ensure that the indexes of a single table ended up in
the same partition? Or to ensure they're spread across partitions?

-- 
greg

Re: Reducing contention for the LockMgrLock

From

Tom Lane

Date:

08 December 2005, 13:36:50

Greg Stark <gsstark@mit.edu> writes:
> Hm, so hypothetically an insert or update on a table with 4 indexes which have
> been split into 4 partitions would need to touch each partition?

That would be the best case, actually, that each heavily-used lock ends
up in a different partition.  As Simon points out, we have no way to
guarantee that.

> Would that defeat the benefits of the partitioning? Or enhance it?

It'd be what you'd want, because it would reduce the odds that two
processes doing this concurrently would need to touch the same partition
at the same time.
        regards, tom lane

Re: Reducing contention for the LockMgrLock

From

Simon Riggs

Date:

08 December 2005, 14:10:58

On Thu, 2005-12-08 at 10:26 -0500, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > The imbalance across partitions would be a major issue because of the
> > difficulty of selecting a well-distributed partitioning key. If you use
> > the LOCKTAG, then operations on the heaviest hit tables would go to the
> > same partitions continually for LockRelation requests. The frequency of
> > access per table usually drops off dramatically in rank order: look at
> > TPC-B (pgbench) and TPC-C; my own experience would be that you seldom
> > have as many even as 16 heavy hit tables. My guess would be that doing
> > all of that would do little more than reduce contention to ~50%, and
> > that this would show quickly diminishing returns for N > 4. Also, the
> > more sharply defined your application profile, the worse this effect
> > will be.
> 
> That's a fair point, and reinforces my instinct that having a large
> number of partitions would be a losing game.  But you are mistaken to
> think that the number of hot-spot tables is the only limit on the number
> of usable partitions.  It's the number of their indexes that matters most.
> (The pgbench test is if anything probably understating the problem,
> because it has only a single index on each table.)  

True. So either 16 partitions, or maybe 8, does sound about right then.

> In any case, even a
> factor of 2 or so reduction in direct conflicts should have a useful
> impact on the number of semop waits, because it's a nonlinear effect...

Nonlinear effects work both ways. Factor of 2 is great, but not enough
to prevent this discussion becoming an annual event ;-)
(Thinks: Oh, it already is...)

Best Regards, Simon Riggs

Re: Reducing contention for the LockMgrLock

From

Simon Riggs

Date:

09 December 2005, 04:26:03

On Thu, 2005-12-08 at 09:44 -0500, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > The output you gave wasn't anything I recognize in the code. Assuming
> > its not already there, please can you share code you are using to find
> > the evidence, even if its just privately in some form?
> 
> See below.  Also, the message I previously mentioned shows a different
> tack on the same theme:
> http://archives.postgresql.org/pgsql-patches/2003-12/msg00365.php
> although in the light of later events I think that keeping the counts
> in shared memory like that is a bad idea --- too likely to skew the
> results.

Looks good, thanks. Like the shmem_exit solution. I'll use this in any
testing, to allow easier discussion of results and encourage all other
testers to do the same.

> > You're looking at the number of spins to acquire each lock?
> 
> Number of semop waits.

I wonder whether that is the thing to measure. That measure doesn't show
how long each waiter waited. I would also want to see:
- queue length at wait time
- any variability over time

I was thinking of looking at the histogram of wait queue length
frequency for the highly contested locks.

> > Manfred's earlier patch provides very clear output for observing
> > contention, including full summaries. Could we commit that, so we can
> > all use this for analysis? Updated with the wait info.
> 
> What patch would that be?

Sorry, thought Manfred had written the earlier patch.

Best Regards, Simon Riggs

Re: Reducing contention for the LockMgrLock

From

Tom Lane

Date:

09 December 2005, 10:17:55

Simon Riggs <simon@2ndquadrant.com> writes:
> On Thu, 2005-12-08 at 09:44 -0500, Tom Lane wrote:
>> Simon Riggs <simon@2ndquadrant.com> writes:
>>> You're looking at the number of spins to acquire each lock?
>> 
>> Number of semop waits.
>
> I wonder whether that is the thing to measure. That measure doesn't show
> how long each waiter waited.

True, but what I am focusing on minimizing right now is the number of
context swaps, and so number of semops seems an adequate proxy.  I don't
see any way to measure wait time without adding an intolerable amount of
overhead (enough to change the behavior --- a true Heisenberg problem).

Moreover, any high-semop-rate lock is going to have very short wait
times: the time spent holding the lock has to be short by definition,
else you couldn't get to the point of having a high rate of attempts to
acquire the lock.  So I don't expect that the wait-time curve would be
very interesting even if we could measure it.

>>> Manfred's earlier patch provides very clear output for observing
>>> contention, including full summaries. Could we commit that, so we can
>>> all use this for analysis? Updated with the wait info.
>> 
>> What patch would that be?
>
> Sorry, thought Manfred had written the earlier patch.

I still don't know what you are referring to.
        regards, tom lane

Re: Reducing contention for the LockMgrLock

From

Tom Lane

Date:

11 December 2005, 17:26:23

I wrote:
> So it seems it's time to start thinking about how to reduce contention
> for the LockMgrLock.
> ...
> The best idea I've come up with after a bit of thought is to replace the
> shared lock table with N independent tables representing partitions of the
> lock space.

I've committed changes along this line.  Testing with pgbench on a dual
HT Xeon, I get numbers like this (for successive -c 10 -t 3000 runs
after an -s 10 initialization):

Previous CVS HEAD:
tps = 1561.983651 (including connections establishing)
tps = 1510.301236 (including connections establishing)
tps = 1496.679616 (including connections establishing)

With 4 partitions:
tps = 1671.311892 (including connections establishing)
tps = 1620.093917 (including connections establishing)
tps = 1598.887515 (including connections establishing)

With 16 partitions:
tps = 1689.662504 (including connections establishing)
tps = 1595.530388 (including connections establishing)
tps = 1609.552501 (including connections establishing)

CPU idle percentage according to "top" is around 5% for the previous
HEAD, and around 2% for either of the partition cases.  I didn't see
any dropoff in CS rate however --- seemed to be around 35K in all cases.

The TPS rates for a single client are the same to within measurement
noise, so it seems we're not paying too much for the extra
LWLockAcquire/Release cycles during LockReleaseAll.

As you can see, there's not a lot of difference between the 4- and 16-
partition numbers; this is probably because the OIDs assigned in
pgbench's simplistic schema are such that the load is fairly evenly
distributed across partitions in both cases.  We need to test some other
scenarios to see which size we should go with.  (If you want to test,
change NUM_LOCK_PARTITIONS in src/include/storage/lock.h, and be sure
to recompile the whole backend because this affects the PGPROC struct.)

I spent some time looking at the lock acquire/conflict counts using the
same patch mentioned previously, and got some moderately interesting
numbers.  A representative value of the per-process counts for the
single LockMgrLock was

PID 12972 lwlock LockMgrLock: shacq 0 exacq 50204 blk 3243

In the old code, there were 15 predictable LockMgrLock acquisitions per
pgbench transaction (for transaction and relation locks), or 45000 for
the whole run; the majority of the other 5K acquisitions seem to be for
RelationExtension locks, with a few hundred Tuple locks occurring due to
update contention on rows of the "branches" table.

With 4 lock partitions, a typical process shows

PID 20471 lwlock 20: shacq 0 exacq 8809 blk 115
PID 20471 lwlock 21: shacq 0 exacq 10933 blk 245
PID 20471 lwlock 22: shacq 0 exacq 20267 blk 503
PID 20471 lwlock 23: shacq 0 exacq 17148 blk 404
TOTAL                   57157     1267

and with 16:

PID 13367 lwlock 20: shacq 0 exacq 679 blk 1
PID 13367 lwlock 21: shacq 0 exacq 648 blk 2
PID 13367 lwlock 22: shacq 0 exacq 665 blk 3
PID 13367 lwlock 23: shacq 0 exacq 12611 blk 262
PID 13367 lwlock 24: shacq 0 exacq 773 blk 3
PID 13367 lwlock 25: shacq 0 exacq 6715 blk 80
PID 13367 lwlock 26: shacq 0 exacq 781 blk 1
PID 13367 lwlock 27: shacq 0 exacq 6706 blk 89
PID 13367 lwlock 28: shacq 0 exacq 6507 blk 68
PID 13367 lwlock 29: shacq 0 exacq 731 blk 2
PID 13367 lwlock 30: shacq 0 exacq 9492 blk 170
PID 13367 lwlock 31: shacq 0 exacq 837 blk 3
PID 13367 lwlock 32: shacq 0 exacq 6530 blk 81
PID 13367 lwlock 33: shacq 0 exacq 717 blk 1
PID 13367 lwlock 34: shacq 0 exacq 6564 blk 74
PID 13367 lwlock 35: shacq 0 exacq 831 blk 0
TOTAL                   61787   840

The increase in the total number of acquisitions happens because
LockReleaseAll needs to touch several partitions during each transaction
commit.  There are seven relations in the test (4 tables, 3 indexes) and
you can clearly see which partitions their locks fell into during the
16-way test.  (Transaction and tuple locks will be pretty evenly spread
across all the partitions, because those locktags change constantly.)

We are getting a reduction in contention, as shown by the falling number
of lock blockages, but we're paying for it with more lock acquisition
cycles.

Bottom line is that this seems to have been a useful improvement, but
it didn't get us as far as I'd hoped.

Any thoughts on other things to try?
        regards, tom lane