Thread: Reducing contention for the LockMgrLock
We've suspected for awhile that once we'd fixed the buffer manager's use of a single global BufMgrLock, the next contention hotspot would be the lock manager's LockMgrLock. I've now seen actual evidence of that in profiling pgbench: using a modified backend that counts LWLock-related wait operations, the LockMgrLock is responsible for an order of magnitude more blockages than the next highest LWLock: PID 12971 lwlock LockMgrLock: shacq 0 exacq 50630 blk 3354 PID 12979 lwlock LockMgrLock: shacq 0 exacq 49706 blk 3323 PID 12976 lwlock LockMgrLock: shacq 0 exacq 50567 blk 3304 PID 12962 lwlock LockMgrLock: shacq 0 exacq 50635 blk 3278 PID 12974 lwlock LockMgrLock: shacq 0 exacq 50599 blk 3251 PID 12972 lwlock LockMgrLock: shacq 0 exacq 50204 blk 3243 PID 12973 lwlock LockMgrLock: shacq 0 exacq 50321 blk 3200 PID 12978 lwlock LockMgrLock: shacq 0 exacq 50266 blk 3177 PID 12977 lwlock LockMgrLock: shacq 0 exacq 50379 blk 3148 PID 12975 lwlock LockMgrLock: shacq 0 exacq 49790 blk 3124 PID 12971 lwlock WALInsertLock: shacq 0 exacq 24022 blk 408 PID 12972 lwlock WALInsertLock: shacq 0 exacq 24021 blk 393 PID 12976 lwlock WALInsertLock: shacq 0 exacq 24017 blk 390 PID 12977 lwlock WALInsertLock: shacq 0 exacq 24021 blk 388 PID 12973 lwlock WALInsertLock: shacq 0 exacq 24018 blk 379 PID 12962 lwlock WALInsertLock: shacq 0 exacq 24024 blk 377 PID 12974 lwlock WALInsertLock: shacq 0 exacq 24016 blk 367 PID 12975 lwlock WALInsertLock: shacq 0 exacq 24021 blk 366 PID 12978 lwlock WALInsertLock: shacq 0 exacq 24023 blk 354 PID 12979 lwlock WALInsertLock: shacq 0 exacq 24033 blk 321 PID 12973 lwlock ProcArrayLock: shacq 45214 exacq 6003 blk 241 PID 12971 lwlock ProcArrayLock: shacq 45355 exacq 6003 blk 225 (etc) We had also seen evidence to this effect from OSDL: http://archives.postgresql.org/pgsql-patches/2003-12/msg00365.php So it seems it's time to start thinking about how to reduce contention for the LockMgrLock. There are no interesting read-only operations on the shared lock table, so there doesn't seem to be any traction to be gained by making some operations take just shared access to the LockMgrLock. The best idea I've come up with after a bit of thought is to replace the shared lock table with N independent tables representing partitions of the lock space. Each lock would be assigned to one of these partitions based on, say, a hash of its LOCKTAG. I'm envisioning N of 16 or so to achieve (hopefully) about an order-of-magnitude reduction of contention. There would be a separate LWLock guarding each partition; the LWLock for a given partition would be considered to protect the LOCK objects assigned to that partition, all the PROCLOCK objects associated with each such LOCK, and the shared-memory hash tables holding these objects (each partition would need its own hash tables). A PGPROC's lock-related fields are only interesting when it is waiting for a lock, so we could say that the LWLock for the partition containing the lock it is waiting for must be held to examine/change these fields. The per-PGPROC list of all PROCLOCKs belonging to that PGPROC is a bit tricky to handle since it necessarily spans across partitions. We might be able to deal with this with suitable rules about when the list can be touched, but I've not worked this out in detail. Another possibility is to break this list apart into N lists, one per partition, but that would bloat the PGPROC array a bit, especially if we wanted larger N. The basic LockAcquire and LockRelease operations would only need to acquire the LWLock for the partition containing the lock they are interested in; this is what gives us the contention reduction. LockReleaseAll is also interesting from a performance point of view, since it executes at every transaction exit. If we divide PGPROC's PROCLOCK list into N lists then it will be very easy for LockReleaseAll to take only the partition locks it actually needs to release these locks; if not, we might have to resort to scanning the list N times, once while we hold the LWLock for each partition. I think that CheckDeadLock will probably require taking all the partition LWLocks (as long as it does this in a predetermined order there is no risk of deadlock on the partition LWLocks). But one hopes this is not a performance-critical operation. Ditto for GetLockStatusData. One objection I can see to this idea is that having N lock hash tables instead of one will eat a larger amount of shared memory in hashtable overhead. But the lock hashtables are fairly small relative to the shared buffer array (given typical configuration parameters) so this doesn't seem like a major problem. Another objection is that LockReleaseAll will get slower (since it will certainly call LWLockAcquire/Release more times) and in situations that aren't heavily concurrent there won't be any compensating gain. I think this won't be a significant effect, but there's probably no way to tell for sure without actually writing the code and testing it. While at it, I'm inclined to get rid of the current assumption that there are logically separate hash tables for different LockMethodIds. AFAICS all that's doing for us is creating a level of confusion; there's nothing on the horizon suggesting we'd ever actually make use of the flexibility. Thoughts, better ideas? regards, tom lane
Tom,
This would also explain some things we've seen during benchmarking here at EnterpriseDB. I like your idea and, as I'm on my way out, will think about it a bit tonight.
Similarly, I don't see the any forward-looking reason for keeping the separate hash tables used for the LockMethodIds. Or, it may just be that I haven't looked closely enough at what the differences are.
-Jonah
This would also explain some things we've seen during benchmarking here at EnterpriseDB. I like your idea and, as I'm on my way out, will think about it a bit tonight.
Similarly, I don't see the any forward-looking reason for keeping the separate hash tables used for the LockMethodIds. Or, it may just be that I haven't looked closely enough at what the differences are.
-Jonah
On 12/7/05, Tom Lane <tgl@sss.pgh.pa.us> wrote:
We've suspected for awhile that once we'd fixed the buffer manager's use
of a single global BufMgrLock, the next contention hotspot would be the
lock manager's LockMgrLock. I've now seen actual evidence of that in
profiling pgbench: using a modified backend that counts LWLock-related
wait operations, the LockMgrLock is responsible for an order of magnitude
more blockages than the next highest LWLock:
PID 12971 lwlock LockMgrLock: shacq 0 exacq 50630 blk 3354
PID 12979 lwlock LockMgrLock: shacq 0 exacq 49706 blk 3323
PID 12976 lwlock LockMgrLock: shacq 0 exacq 50567 blk 3304
PID 12962 lwlock LockMgrLock: shacq 0 exacq 50635 blk 3278
PID 12974 lwlock LockMgrLock: shacq 0 exacq 50599 blk 3251
PID 12972 lwlock LockMgrLock: shacq 0 exacq 50204 blk 3243
PID 12973 lwlock LockMgrLock: shacq 0 exacq 50321 blk 3200
PID 12978 lwlock LockMgrLock: shacq 0 exacq 50266 blk 3177
PID 12977 lwlock LockMgrLock: shacq 0 exacq 50379 blk 3148
PID 12975 lwlock LockMgrLock: shacq 0 exacq 49790 blk 3124
PID 12971 lwlock WALInsertLock: shacq 0 exacq 24022 blk 408
PID 12972 lwlock WALInsertLock: shacq 0 exacq 24021 blk 393
PID 12976 lwlock WALInsertLock: shacq 0 exacq 24017 blk 390
PID 12977 lwlock WALInsertLock: shacq 0 exacq 24021 blk 388
PID 12973 lwlock WALInsertLock: shacq 0 exacq 24018 blk 379
PID 12962 lwlock WALInsertLock: shacq 0 exacq 24024 blk 377
PID 12974 lwlock WALInsertLock: shacq 0 exacq 24016 blk 367
PID 12975 lwlock WALInsertLock: shacq 0 exacq 24021 blk 366
PID 12978 lwlock WALInsertLock: shacq 0 exacq 24023 blk 354
PID 12979 lwlock WALInsertLock: shacq 0 exacq 24033 blk 321
PID 12973 lwlock ProcArrayLock: shacq 45214 exacq 6003 blk 241
PID 12971 lwlock ProcArrayLock: shacq 45355 exacq 6003 blk 225
(etc)
We had also seen evidence to this effect from OSDL:
http://archives.postgresql.org/pgsql-patches/2003-12/msg00365.php
So it seems it's time to start thinking about how to reduce contention
for the LockMgrLock. There are no interesting read-only operations on the
shared lock table, so there doesn't seem to be any traction to be gained
by making some operations take just shared access to the LockMgrLock.
The best idea I've come up with after a bit of thought is to replace the
shared lock table with N independent tables representing partitions of the
lock space. Each lock would be assigned to one of these partitions based
on, say, a hash of its LOCKTAG. I'm envisioning N of 16 or so to achieve
(hopefully) about an order-of-magnitude reduction of contention. There
would be a separate LWLock guarding each partition; the LWLock for a given
partition would be considered to protect the LOCK objects assigned to that
partition, all the PROCLOCK objects associated with each such LOCK, and
the shared-memory hash tables holding these objects (each partition would
need its own hash tables). A PGPROC's lock-related fields are only
interesting when it is waiting for a lock, so we could say that the
LWLock for the partition containing the lock it is waiting for must be
held to examine/change these fields.
The per-PGPROC list of all PROCLOCKs belonging to that PGPROC is a bit
tricky to handle since it necessarily spans across partitions. We might
be able to deal with this with suitable rules about when the list can be
touched, but I've not worked this out in detail. Another possibility is
to break this list apart into N lists, one per partition, but that would
bloat the PGPROC array a bit, especially if we wanted larger N.
The basic LockAcquire and LockRelease operations would only need to
acquire the LWLock for the partition containing the lock they are
interested in; this is what gives us the contention reduction.
LockReleaseAll is also interesting from a performance point of view,
since it executes at every transaction exit. If we divide PGPROC's
PROCLOCK list into N lists then it will be very easy for LockReleaseAll
to take only the partition locks it actually needs to release these locks;
if not, we might have to resort to scanning the list N times, once while
we hold the LWLock for each partition.
I think that CheckDeadLock will probably require taking all the partition
LWLocks (as long as it does this in a predetermined order there is no risk
of deadlock on the partition LWLocks). But one hopes this is not a
performance-critical operation. Ditto for GetLockStatusData.
One objection I can see to this idea is that having N lock hash tables
instead of one will eat a larger amount of shared memory in hashtable
overhead. But the lock hashtables are fairly small relative to the
shared buffer array (given typical configuration parameters) so this
doesn't seem like a major problem.
Another objection is that LockReleaseAll will get slower (since it will
certainly call LWLockAcquire/Release more times) and in situations that
aren't heavily concurrent there won't be any compensating gain. I think
this won't be a significant effect, but there's probably no way to tell
for sure without actually writing the code and testing it.
While at it, I'm inclined to get rid of the current assumption that there
are logically separate hash tables for different LockMethodIds. AFAICS all
that's doing for us is creating a level of confusion; there's nothing on
the horizon suggesting we'd ever actually make use of the flexibility.
Thoughts, better ideas?
regards, tom lane
---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly
On Wed, 2005-12-07 at 16:59 -0500, Tom Lane wrote: > I've now seen actual evidence of that in > profiling pgbench: using a modified backend that counts LWLock-related > wait operations, > So it seems it's time to start thinking about how to reduce contention > for the LockMgrLock You're right to be following up this thought. My concern, longer term is on our ability to determine contention issues in an agreed way. I've long been thinking about wait-time measurement - I think its the only way to proceed. There's always a next-bottleneck, so I'd like to first agree the diagnostic probes so we can decide how to determine that. That way we can all work on solutions for various workloads, and prove that they work, in those cases. My view would be that the LockMgrLock is not relevant for all workloads, but I want even more to be able to discuss whether it is, or is not, on an accepted basis before discussions begin. Best Regards, Simon Riggs
Tom Lane wrote: Interesting proposal. > LockReleaseAll is also interesting from a performance point of view, > since it executes at every transaction exit. If we divide PGPROC's > PROCLOCK list into N lists then it will be very easy for LockReleaseAll > to take only the partition locks it actually needs to release these locks; > if not, we might have to resort to scanning the list N times, once while > we hold the LWLock for each partition. On the other hand, each scan would be shorter than the previous one; and it's not necessary to hold each and every partition's LWLock, only the one found in the first entry of the list on each scan until the list is empty. So it's N scans only in the worst case of a PGPROC holding locks on all partitions. > One objection I can see to this idea is that having N lock hash tables > instead of one will eat a larger amount of shared memory in hashtable > overhead. But the lock hashtables are fairly small relative to the > shared buffer array (given typical configuration parameters) so this > doesn't seem like a major problem. Is hashtable overhead all that large? Each table could be made initially size-of-current-table/N entries. One problem is that currently the memory freed from a hashtable is not put back into shmem freespace, is it? > While at it, I'm inclined to get rid of the current assumption that there > are logically separate hash tables for different LockMethodIds. AFAICS all > that's doing for us is creating a level of confusion; there's nothing on > the horizon suggesting we'd ever actually make use of the flexibility. Yeah, please. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Alvaro Herrera <alvherre@commandprompt.com> writes: > Is hashtable overhead all that large? Each table could be made > initially size-of-current-table/N entries. One problem is that > currently the memory freed from a hashtable is not put back into shmem > freespace, is it? Yeah; the problem is mainly that we'd have to allocate extra space to allow for unevenness of usage across the multiple hashtables. It's hard to judge how large the effect would be without testing, but I think that this problem would inhibit us from having dozens or hundreds of separate partitions. A possible response is to try to improve dynahash.c to make its memory management more flexible, but I'd prefer not to get into that unless it becomes really necessary. A shared freespace pool would create a contention bottleneck of its own... regards, tom lane
Simon Riggs <simon@2ndquadrant.com> writes: > My view would be that the LockMgrLock is not relevant for all workloads, > but I want even more to be able to discuss whether it is, or is not, on > an accepted basis before discussions begin. Certainly. I showed the evidence that it is currently a significant problem for pgbench-like workloads, but pgbench is of course not representative of everything. My feeling about it is that different workloads are going to expose different weak spots, and so as long as a given test case isn't obviously artificial, whatever bottleneck it exposes is fair game to work on. pgbench seems reasonably representative of a class of applications with relatively short transactions, so I don't doubt that if pgbench has a problem with LockMgrLock contention, there are real- world cases out there that do too. regards, tom lane
On Wed, 2005-12-07 at 22:46 -0500, Tom Lane wrote: > Alvaro Herrera <alvherre@commandprompt.com> writes: > > Is hashtable overhead all that large? Each table could be made > > initially size-of-current-table/N entries. One problem is that > > currently the memory freed from a hashtable is not put back into shmem > > freespace, is it? > > Yeah; the problem is mainly that we'd have to allocate extra space to > allow for unevenness of usage across the multiple hashtables. It's hard > to judge how large the effect would be without testing, but I think that > this problem would inhibit us from having dozens or hundreds of separate > partitions. The imbalance across partitions would be a major issue because of the difficulty of selecting a well-distributed partitioning key. If you use the LOCKTAG, then operations on the heaviest hit tables would go to the same partitions continually for LockRelation requests. The frequency of access per table usually drops off dramatically in rank order: look at TPC-B (pgbench) and TPC-C; my own experience would be that you seldom have as many even as 16 heavy hit tables. My guess would be that doing all of that would do little more than reduce contention to ~50%, and that this would show quickly diminishing returns for N > 4. Also, the more sharply defined your application profile, the worse this effect will be. Having said that, I think this *is* the best way forward *if* we continue to accept the barrage of lock requests. So I've reopened the debate on the earlier thread: [HACKERS] Reducing relation locking overhead and am reviewing thoughts/requirements on that thread to avoid the necessity of altering the lock manager in this way. pgbench is the right workload to expose this effect and measure worst case contention, so at least performance gains are easy to measure. > A possible response is to try to improve dynahash.c to make its memory > management more flexible, but I'd prefer not to get into that unless > it becomes really necessary. A shared freespace pool would create a > contention bottleneck of its own... ...but a less frequently accessed one. Best Regards, Simon Riggs
On Wed, 2005-12-07 at 22:53 -0500, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > My view would be that the LockMgrLock is not relevant for all workloads, > > but I want even more to be able to discuss whether it is, or is not, on > > an accepted basis before discussions begin. > > Certainly. I showed the evidence ... The output you gave wasn't anything I recognize in the code. Assuming its not already there, please can you share code you are using to find the evidence, even if its just privately in some form? You're looking at the number of spins to acquire each lock? Or some other measure of wait time on a lock? I want to be in a position to run tests and then share the output with the project in an agreed form, then quickly move to action. You're right to put the burden of proof onto test results; I want to agree the measurements before we test. Manfred's earlier patch provides very clear output for observing contention, including full summaries. Could we commit that, so we can all use this for analysis? Updated with the wait info. Best Regards, Simon Riggs
Simon Riggs <simon@2ndquadrant.com> writes: > The output you gave wasn't anything I recognize in the code. Assuming > its not already there, please can you share code you are using to find > the evidence, even if its just privately in some form? See below. Also, the message I previously mentioned shows a different tack on the same theme: http://archives.postgresql.org/pgsql-patches/2003-12/msg00365.php although in the light of later events I think that keeping the counts in shared memory like that is a bad idea --- too likely to skew the results. > You're looking at the number of spins to acquire each lock? Number of semop waits. > Manfred's earlier patch provides very clear output for observing > contention, including full summaries. Could we commit that, so we can > all use this for analysis? Updated with the wait info. What patch would that be? regards, tom lane *** src/backend/storage/ipc/ipc.c.orig Tue Nov 22 16:06:33 2005 --- src/backend/storage/ipc/ipc.c Tue Nov 29 12:27:13 2005 *************** *** 125,130 **** --- 125,132 ---- { elog(DEBUG3, "shmem_exit(%d)", code); + LWLockStats(); + /* * call all the registered callbacks. * *** src/backend/storage/lmgr/lwlock.c.orig Tue Dec 6 18:08:33 2005 --- src/backend/storage/lmgr/lwlock.c Tue Dec 6 18:16:05 2005 *************** *** 21,26 **** --- 21,28 ---- */ #include "postgres.h" + #include <unistd.h> + #include "access/clog.h" #include "access/multixact.h" #include "access/subtrans.h" *************** *** 32,37 **** --- 34,43 ---- /* We use the ShmemLock spinlock to protect LWLockAssign */ extern slock_t *ShmemLock; + static int num_counts; + static int *sh_acquire_counts; + static int *ex_acquire_counts; + static int *block_counts; typedef struct LWLock { *************** *** 209,214 **** --- 215,226 ---- LWLockCounter = (int *) ((char *) LWLockArray - 2 * sizeof(int)); LWLockCounter[0] = (int) NumFixedLWLocks; LWLockCounter[1] = numLocks; + + /* local counter space */ + num_counts = numLocks; + sh_acquire_counts = calloc(numLocks, sizeof(int)); + ex_acquire_counts = calloc(numLocks, sizeof(int)); + block_counts = calloc(numLocks, sizeof(int)); } *************** *** 257,262 **** --- 269,278 ---- int extraWaits = 0; PRINT_LWDEBUG("LWLockAcquire", lockid, lock); + if (mode == LW_EXCLUSIVE) + ex_acquire_counts[lockid]++; + else + sh_acquire_counts[lockid]++; /* * We can't wait if we haven't got a PGPROC. This should only occur *************** *** 328,333 **** --- 344,351 ---- if (!mustwait) break; /* got the lock */ + block_counts[lockid]++; + /* * Add myself to wait queue. * *************** *** 598,601 **** --- 616,640 ---- return true; } return false; + } + + void + LWLockStats(void) + { + int pid = getpid(); + int i; + + LWLockAcquire(0, LW_EXCLUSIVE); + + for (i = 0; i < num_counts; i++) + { + if (sh_acquire_counts[i] || ex_acquire_counts[i] || block_counts[i]) + { + fprintf(stderr, "PID %d lwlock %d: shacq %u exacq %u blk %u\n", + pid, i, sh_acquire_counts[i], ex_acquire_counts[i], + block_counts[i]); + } + } + + LWLockRelease(0); }
Simon Riggs <simon@2ndquadrant.com> writes: > The imbalance across partitions would be a major issue because of the > difficulty of selecting a well-distributed partitioning key. If you use > the LOCKTAG, then operations on the heaviest hit tables would go to the > same partitions continually for LockRelation requests. The frequency of > access per table usually drops off dramatically in rank order: look at > TPC-B (pgbench) and TPC-C; my own experience would be that you seldom > have as many even as 16 heavy hit tables. My guess would be that doing > all of that would do little more than reduce contention to ~50%, and > that this would show quickly diminishing returns for N > 4. Also, the > more sharply defined your application profile, the worse this effect > will be. That's a fair point, and reinforces my instinct that having a large number of partitions would be a losing game. But you are mistaken to think that the number of hot-spot tables is the only limit on the number of usable partitions. It's the number of their indexes that matters most. (The pgbench test is if anything probably understating the problem, because it has only a single index on each table.) In any case, even a factor of 2 or so reduction in direct conflicts should have a useful impact on the number of semop waits, because it's a nonlinear effect... regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > That's a fair point, and reinforces my instinct that having a large > number of partitions would be a losing game. But you are mistaken to > think that the number of hot-spot tables is the only limit on the number > of usable partitions. It's the number of their indexes that matters most. Hm, so hypothetically an insert or update on a table with 4 indexes which have been split into 4 partitions would need to touch each partition? Would that defeat the benefits of the partitioning? Or enhance it? Would it be better to ensure that the indexes of a single table ended up in the same partition? Or to ensure they're spread across partitions? -- greg
Greg Stark <gsstark@mit.edu> writes: > Hm, so hypothetically an insert or update on a table with 4 indexes which have > been split into 4 partitions would need to touch each partition? That would be the best case, actually, that each heavily-used lock ends up in a different partition. As Simon points out, we have no way to guarantee that. > Would that defeat the benefits of the partitioning? Or enhance it? It'd be what you'd want, because it would reduce the odds that two processes doing this concurrently would need to touch the same partition at the same time. regards, tom lane
On Thu, 2005-12-08 at 10:26 -0500, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > The imbalance across partitions would be a major issue because of the > > difficulty of selecting a well-distributed partitioning key. If you use > > the LOCKTAG, then operations on the heaviest hit tables would go to the > > same partitions continually for LockRelation requests. The frequency of > > access per table usually drops off dramatically in rank order: look at > > TPC-B (pgbench) and TPC-C; my own experience would be that you seldom > > have as many even as 16 heavy hit tables. My guess would be that doing > > all of that would do little more than reduce contention to ~50%, and > > that this would show quickly diminishing returns for N > 4. Also, the > > more sharply defined your application profile, the worse this effect > > will be. > > That's a fair point, and reinforces my instinct that having a large > number of partitions would be a losing game. But you are mistaken to > think that the number of hot-spot tables is the only limit on the number > of usable partitions. It's the number of their indexes that matters most. > (The pgbench test is if anything probably understating the problem, > because it has only a single index on each table.) True. So either 16 partitions, or maybe 8, does sound about right then. > In any case, even a > factor of 2 or so reduction in direct conflicts should have a useful > impact on the number of semop waits, because it's a nonlinear effect... Nonlinear effects work both ways. Factor of 2 is great, but not enough to prevent this discussion becoming an annual event ;-) (Thinks: Oh, it already is...) Best Regards, Simon Riggs
On Thu, 2005-12-08 at 09:44 -0500, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > The output you gave wasn't anything I recognize in the code. Assuming > > its not already there, please can you share code you are using to find > > the evidence, even if its just privately in some form? > > See below. Also, the message I previously mentioned shows a different > tack on the same theme: > http://archives.postgresql.org/pgsql-patches/2003-12/msg00365.php > although in the light of later events I think that keeping the counts > in shared memory like that is a bad idea --- too likely to skew the > results. Looks good, thanks. Like the shmem_exit solution. I'll use this in any testing, to allow easier discussion of results and encourage all other testers to do the same. > > You're looking at the number of spins to acquire each lock? > > Number of semop waits. I wonder whether that is the thing to measure. That measure doesn't show how long each waiter waited. I would also want to see: - queue length at wait time - any variability over time I was thinking of looking at the histogram of wait queue length frequency for the highly contested locks. > > Manfred's earlier patch provides very clear output for observing > > contention, including full summaries. Could we commit that, so we can > > all use this for analysis? Updated with the wait info. > > What patch would that be? Sorry, thought Manfred had written the earlier patch. Best Regards, Simon Riggs
Simon Riggs <simon@2ndquadrant.com> writes: > On Thu, 2005-12-08 at 09:44 -0500, Tom Lane wrote: >> Simon Riggs <simon@2ndquadrant.com> writes: >>> You're looking at the number of spins to acquire each lock? >> >> Number of semop waits. > > I wonder whether that is the thing to measure. That measure doesn't show > how long each waiter waited. True, but what I am focusing on minimizing right now is the number of context swaps, and so number of semops seems an adequate proxy. I don't see any way to measure wait time without adding an intolerable amount of overhead (enough to change the behavior --- a true Heisenberg problem). Moreover, any high-semop-rate lock is going to have very short wait times: the time spent holding the lock has to be short by definition, else you couldn't get to the point of having a high rate of attempts to acquire the lock. So I don't expect that the wait-time curve would be very interesting even if we could measure it. >>> Manfred's earlier patch provides very clear output for observing >>> contention, including full summaries. Could we commit that, so we can >>> all use this for analysis? Updated with the wait info. >> >> What patch would that be? > > Sorry, thought Manfred had written the earlier patch. I still don't know what you are referring to. regards, tom lane
I wrote: > So it seems it's time to start thinking about how to reduce contention > for the LockMgrLock. > ... > The best idea I've come up with after a bit of thought is to replace the > shared lock table with N independent tables representing partitions of the > lock space. I've committed changes along this line. Testing with pgbench on a dual HT Xeon, I get numbers like this (for successive -c 10 -t 3000 runs after an -s 10 initialization): Previous CVS HEAD: tps = 1561.983651 (including connections establishing) tps = 1510.301236 (including connections establishing) tps = 1496.679616 (including connections establishing) With 4 partitions: tps = 1671.311892 (including connections establishing) tps = 1620.093917 (including connections establishing) tps = 1598.887515 (including connections establishing) With 16 partitions: tps = 1689.662504 (including connections establishing) tps = 1595.530388 (including connections establishing) tps = 1609.552501 (including connections establishing) CPU idle percentage according to "top" is around 5% for the previous HEAD, and around 2% for either of the partition cases. I didn't see any dropoff in CS rate however --- seemed to be around 35K in all cases. The TPS rates for a single client are the same to within measurement noise, so it seems we're not paying too much for the extra LWLockAcquire/Release cycles during LockReleaseAll. As you can see, there's not a lot of difference between the 4- and 16- partition numbers; this is probably because the OIDs assigned in pgbench's simplistic schema are such that the load is fairly evenly distributed across partitions in both cases. We need to test some other scenarios to see which size we should go with. (If you want to test, change NUM_LOCK_PARTITIONS in src/include/storage/lock.h, and be sure to recompile the whole backend because this affects the PGPROC struct.) I spent some time looking at the lock acquire/conflict counts using the same patch mentioned previously, and got some moderately interesting numbers. A representative value of the per-process counts for the single LockMgrLock was PID 12972 lwlock LockMgrLock: shacq 0 exacq 50204 blk 3243 In the old code, there were 15 predictable LockMgrLock acquisitions per pgbench transaction (for transaction and relation locks), or 45000 for the whole run; the majority of the other 5K acquisitions seem to be for RelationExtension locks, with a few hundred Tuple locks occurring due to update contention on rows of the "branches" table. With 4 lock partitions, a typical process shows PID 20471 lwlock 20: shacq 0 exacq 8809 blk 115 PID 20471 lwlock 21: shacq 0 exacq 10933 blk 245 PID 20471 lwlock 22: shacq 0 exacq 20267 blk 503 PID 20471 lwlock 23: shacq 0 exacq 17148 blk 404 TOTAL 57157 1267 and with 16: PID 13367 lwlock 20: shacq 0 exacq 679 blk 1 PID 13367 lwlock 21: shacq 0 exacq 648 blk 2 PID 13367 lwlock 22: shacq 0 exacq 665 blk 3 PID 13367 lwlock 23: shacq 0 exacq 12611 blk 262 PID 13367 lwlock 24: shacq 0 exacq 773 blk 3 PID 13367 lwlock 25: shacq 0 exacq 6715 blk 80 PID 13367 lwlock 26: shacq 0 exacq 781 blk 1 PID 13367 lwlock 27: shacq 0 exacq 6706 blk 89 PID 13367 lwlock 28: shacq 0 exacq 6507 blk 68 PID 13367 lwlock 29: shacq 0 exacq 731 blk 2 PID 13367 lwlock 30: shacq 0 exacq 9492 blk 170 PID 13367 lwlock 31: shacq 0 exacq 837 blk 3 PID 13367 lwlock 32: shacq 0 exacq 6530 blk 81 PID 13367 lwlock 33: shacq 0 exacq 717 blk 1 PID 13367 lwlock 34: shacq 0 exacq 6564 blk 74 PID 13367 lwlock 35: shacq 0 exacq 831 blk 0 TOTAL 61787 840 The increase in the total number of acquisitions happens because LockReleaseAll needs to touch several partitions during each transaction commit. There are seven relations in the test (4 tables, 3 indexes) and you can clearly see which partitions their locks fell into during the 16-way test. (Transaction and tuple locks will be pretty evenly spread across all the partitions, because those locktags change constantly.) We are getting a reduction in contention, as shown by the falling number of lock blockages, but we're paying for it with more lock acquisition cycles. Bottom line is that this seems to have been a useful improvement, but it didn't get us as far as I'd hoped. Any thoughts on other things to try? regards, tom lane