Thread: Improve heavyweight locks instead of building new lock managers?
Hi, I started this as a reply to https://www.postgresql.org/message-id/CAA4eK1JMgUfiTdAgr9k3nA4cdKvvruousKBg7FWTDNzQgBpOZA%40mail.gmail.com but the email seemed to morph into distinct topic that I thought it best to separate out. We've had a number of cases where heavyweight locks turned out to be too, well, heavy. And I don't mean cases where a fixed number of locks can do the job. The last case of this is the above thread, where a separate lock manager just for relation extension was implemented. My problem with that patch is that it just seems like the wrong direction architecturally to me. There's two main aspects to this: 1) It basically builds a another, more lightweight but less capable, of a lock manager that can lock more objects than we can have distinct locks for. It is faster because it uses *one* hashtable without conflict handling, because it has fewer lock modes, and because it doesn't support detecting deadlocks. And probably some other things. 2) A lot of the contention around file extension comes from us doing multiple expensive things under one lock (determining current relation size, searching victim buffer, extending file), and in tiny increments (growing a 1TB table by 8kb). This patch doesn't address that at all. I'm only planning to address 1) in this thread, and write a separate about 2) as part of the above thread. So more on that later. With regard to 1): To me this seems to go in the direction of having multiple bespoke lock managers with slightly different feature sets, different debugging / logging / monitoring support, with separate locking code each. That's bad for maintainability. I think one crucial piece of analysis that I've not seen fully done, is why a separate lock manager is actually faster. An early (quite different) version of this patch yielded the following micro-benchmark: https://www.postgresql.org/message-id/CAD21AoD1NenjTmD%3D5ypOBo9%3DFRtAtWVxUcHqHxY3wNos_5Bb5w%40mail.gmail.com On 2017-11-21 07:19:30 +0900, Masahiko Sawada wrote: > Also I've done a micro-benchmark; calling LockRelationForExtension and > UnlockRelationForExtension tightly in order to measure the number of > lock/unlock cycles per second. The result is, > PATCHED = 3.95892e+06 (cycles/sec) > HEAD = 1.15284e+06 (cycles/sec) > The patched is 3 times faster than current HEAD. but I've not actually seen an analysis as to *why* precisely there's a threefold difference in cost, and whether the problem could instead be addressed by making the difference much smaller. My guess would be that the difference, to a very large degree, comes from avoiding dynahash lookups. We currently have quite a few hashtable lookups: * LockRelationForExtension ** LockAcquireExtended does HASH_ENTERs in LockMethodLocalHash ** SetupLockInTable does HASH_ENTER_NULL in LockMethodLockHash ** SetupLockInTable does HASH_ENTER_NULL in LockMethodProcLockHash * UnlockRelationForExtension ** LockRelease does HASH_FIND in LockMethodLocalHash ** CleanUpLock does HASH_REMOVE in LockMethodProcLockHash ** CleanUpLock does HASH_REMOVE in LockMethodLockHash ** RemoveLocalLock does HASH_REMOVE in LockMethodLocalHash it's pretty easy to believe that a lock mapping like: + relextlock = &RelExtLockArray[RelExtLockTargetTagToIndex(&tag)]; with +typedef struct RelExtLockTag +{ + Oid dbid; /* InvalidOid for a shared relation */ + Oid relid; +} RelExtLockTag; ... +static inline uint32 +RelExtLockTargetTagToIndex(RelExtLockTag *locktag) +{ + return tag_hash(locktag, sizeof(RelExtLockTag)) % N_RELEXTLOCK_ENTS; +} and then just waiting till that specific hash entry becomes free, is cheaper than *7* dynahash lookups. Especially if doing so doesn't require any mapping locks. but it's not at all clear to me whether we cannot sometimes / always avoid some of this overhead. Improving lock.c would be a much bigger win, without building separate infrastructure. E.g. we could: - Have an extended LockAcquire API where the caller provides a stack variable for caching information until the LockRelease call, avoiding separate LockMethodLocalHash lookup for release. - Instead of always implementing HASH_REMOVE as a completely fresh lookup in dynahash, we should provide an API for removing an object we *know* is in the hashtable at a specific pointer. We're obviously already relying on that address to stay constant, so we'd not loose flexibility. With this separate API we can avoid the bucket lookup, walking through each element in that bucket, and having to compare the hash key every time (which is quite expensive for a full LOCKTAG). For the biggest benefit, we'd have to make the bucket list doubly linked, but even if we were to just go through from start, and just compare entries by pointer, we'd win big. - We should try to figure out whether we can avoid needing the separate lock table for PROCLOCKs. As far as I can tell there's not a single reason for it to be a hashtable, rather than something like NUM_LOCK_PARTITIONS lists of free PROCLOCKs. - Whenever we do a relation extension, we better already hold a normal relation lock. We don't actually need to have an entirely distinct type of lock for extensions, that theoretically could support N extension locks for each relation. The fact that these are associated could be utilized in different ways: - We could define relation extension as a separate lock level only conflicting with itself (and perhaps treat AELs as including it). That'd avoid needing separate shared memory states in many cases. This might be too problematic, because extension locks don't have the same desired behaviour around group locking (and a small army of other issues). - We could keep a separate extension lock cached inside the relation lock. The first time a transaction needs to extend, it does the normal work, and after that stores the PROCLOCK in the LOCALLOCK (or something like that). Once extension is done, don't release the lock entirely, but instead just reduce the lock level to a new non-conflicting lock level Alternatively we could implement something very similar outside of lock.c, e.g. by caching the LOCALLOCK* (possibly identified by an integer or such) in RelationData. That'd require enough support from lock.c to be able to make that really cheap. The second big difference between lock.c and the proposed relation extension lock is that it doesn't have a "mapping" lock. It does instead solve the the mapping without a lock by having exactly one potential location for each lock, using atomic ops to manipulate the lock state, and deals with conflicts by waiting for the bucket to become free. I don't think it's realistic to not use locks to protect the lock mapping hashtable, nor does it seem likely we can make the manipulation of individual lock states entirely atomic. But it very well seems possible to reduce the likelihood of contention: We could e.g. split the maintenance of the hashtable(s) from protecting the state of individual locks. The locks protecting the hashtable would just be held long enough to change a "pin count" of each lock, and then a per LOCK lwlock would protect each heavyweight lock's state. We'd not need to lock the partition locks for quite a few cases, because there's many locks in a loaded system that always have lockers. There'd be cost to that, needing more atomic ops in some cases, but I think it'd reduce contention tremendously, and it'd open a lot more optimization potential. It seems plausible that we could even work, as a followup, to not needing the partition locks for some lock releases (e.g. when there are other holders), and we might even be able to avoid it for acquisitions, by caching the LOCK inside LOCALLOCK, and re-identifying the identity. Thoughts? Greetings, Andres Freund
Hi, Some of the discussions about improving the locking code, in particular the group locking / deadlock detector discussion with Robert, again made me look at lock.c. While looking at how much work it'd be to a) remove the PROCLOCK hashtable b) move more of the group locking logic into lock.c, rather than smarts in deadlock.c, I got sidetracked by all the verbose and hard to read SHM_QUEUE code. Here's a *draft* patch series for removing all use of SHM_QUEUE, and subsequently removing SHM_QUEUE. There's a fair bit of polish needed, but I do think it makes the code considerably easier to read (particularly for predicate.c). The diffstat is nice too: src/include/lib/ilist.h | 132 +++++++++++++++++---- src/include/replication/walsender_private.h | 3 +- src/include/storage/lock.h | 10 +- src/include/storage/predicate_internals.h | 49 +++----- src/include/storage/proc.h | 18 +-- src/include/storage/shmem.h | 22 ---- src/backend/access/transam/twophase.c | 4 +- src/backend/lib/ilist.c | 8 +- src/backend/replication/syncrep.c | 89 ++++++-------- src/backend/replication/walsender.c | 2 +- src/backend/storage/ipc/Makefile | 1 - src/backend/storage/ipc/shmqueue.c | 190 ------------------------------ src/backend/storage/lmgr/deadlock.c | 76 +++++------- src/backend/storage/lmgr/lock.c | 129 ++++++++------------ src/backend/storage/lmgr/predicate.c | 692 +++++++++++++++++++++++++++++++++++------------------------------------------------------------------------ src/backend/storage/lmgr/proc.c | 197 +++++++++++++------------------ 16 files changed, 569 insertions(+), 1053 deletions(-) I don't want to invest a lot of time into this if there's not some agreement that this is a good direction to go into. So I'd appreciate a few cursory looks before spending more time. Overview: 0001: Add additional prev/next & detached node functions to ilist. I think the additional prev/next accessors are nice. I am less convinced by the 'detached' stuff, but it's used by some SHM_QUEUE users. I don't want to make the plain dlist_delete reset the node's prev/next pointers, it's not needed in the vast majority of cases... 0002: Use dlists instead of SHM_QUEUE for heavyweight locks. I mostly removed the odd reliance on PGPROC.links needing to be the first struct member - seems odd. I think we should rename PROC_QUEUE.links, elsewhere that's used for list membership nodes, so it's imo confusing/odd. 0003: Use dlist for syncrep queue. This seems fairly simple to me. 0004: Use dlists for predicate locking. Unfortunately pretty large. I think it's a huge improvement, but it's also subtle code. Wonder if there's something better to do here wrt OnConflict_CheckForSerializationFailure? 0005: Remove now unused SHMQueue*. 0006: Remove PROC_QUEUE.size. I'm not sure whether this is a a good idea. I was looking primarily at that because I thought it'd allow us to remove PROC_QUEUE as a whole if we wanted to. But as PROC_QUEUE.size doesn't really seem to buy us much, we should perhaps just do something roughly like in the patch? Greetings, Andres Freund
Attachment
Hi Andres, On Thu, Feb 20, 2020 at 1:14 PM Andres Freund <andres@anarazel.de> wrote: > Here's a *draft* patch series for removing all use of SHM_QUEUE, and > subsequently removing SHM_QUEUE. There's a fair bit of polish needed, > but I do think it makes the code considerably easier to read > (particularly for predicate.c). The diffstat is nice too: > > 0005: Remove now unused SHMQueue*. > 0006: Remove PROC_QUEUE.size. Maybe you're aware, but there still seem to be places using it. In LOCK_DEBUG build: lock.c: In function ‘LOCK_PRINT’: lock.c:320:20: error: ‘PROC_QUEUE’ {aka ‘const struct PROC_QUEUE’} has no member named ‘size’ lock->waitProcs.size, lock.c: In function ‘DumpLocks’: lock.c:3906:2: error: unknown type name ‘SHM_QUEUE’; did you mean ‘SI_QUEUE’? Fwiw, I for one, am all for removing specialized data structures when more widely used data structures will do, especially if that specialization is no longer relevant. Thanks, Amit
On Thu, Feb 20, 2020 at 5:14 PM Andres Freund <andres@anarazel.de> wrote: > 16 files changed, 569 insertions(+), 1053 deletions(-) Nice! Some comments on 0001, 0003, 0004: > Subject: [PATCH v1 1/6] Add additional prev/next & detached node functions to +extern void dlist_check(const dlist_head *head); +extern void slist_check(const slist_head *head); I approve of the incidental constification in this patch. +/* + * Like dlist_delete(), but also sets next/prev to NULL to signal not being in + * list. + */ +static inline void +dlist_delete_thoroughly(dlist_node *node) +{ + node->prev->next = node->next; + node->next->prev = node->prev; + node->next = NULL; + node->prev = NULL; +} Instead of introducing this strange terminology, why not just have the callers do ... dlist_delete(node); dlist_node_init(node); ..., or perhaps supply dlist_delete_and_reinit(node) that does exactly that? That is, reuse the code and terminology. +/* + * Check if node is detached. A node is only detached if it either has been + * initialized with dlist_init_node(), or deleted with + * dlist_delete_thoroughly(). + */ +static inline bool +dlist_node_is_detached(const dlist_node *node) +{ + Assert((node->next == NULL && node->prev == NULL) || + (node->next != NULL && node->prev != NULL)); + + return node->next == NULL; +} How about dlist_node_is_linked()? I don't like introducing random new verbs when we already have 'linked' in various places, and these things are, y'know, linked lists. > Subject: [PATCH v1 3/6] Use dlist for syncrep queue. LGTM. > Subject: [PATCH v1 4/6] Use dlists for predicate locking. + dlist_foreach(iter, &unconstify(SERIALIZABLEXACT *, reader)->outConflicts) Yuck... I suppose you could do this: - dlist_foreach(iter, &unconstify(SERIALIZABLEXACT *, reader)->outConflicts) + dlist_foreach_const(iter, &reader->outConflicts) ... given: +/* Variant for when you have a pointer to const dlist_head. */ +#define dlist_foreach_const(iter, lhead) \ + for (AssertVariableIsOfTypeMacro(iter, dlist_iter), \ + AssertVariableIsOfTypeMacro(lhead, const dlist_head *), \ + (iter).end = (dlist_node *) &(lhead)->head, \ + (iter).cur = (iter).end->next ? (iter).end->next : (iter).end; \ + (iter).cur != (iter).end; \ + (iter).cur = (iter).cur->next) + ... or find a way to make dlist_foreach() handle that itself, which seems pretty reasonable given its remit to traverse lists without modifying them, though perhaps that would require a different iterator type... Otherwise looks OK to me and passes various tests I threw at it.
Hi, On 2020-02-21 12:40:06 +1300, Thomas Munro wrote: > On Thu, Feb 20, 2020 at 5:14 PM Andres Freund <andres@anarazel.de> wrote: > > 16 files changed, 569 insertions(+), 1053 deletions(-) > > Nice! Thanks! > Some comments on 0001, 0003, 0004: > > > Subject: [PATCH v1 1/6] Add additional prev/next & detached node functions to > > +extern void dlist_check(const dlist_head *head); > +extern void slist_check(const slist_head *head); > > I approve of the incidental constification in this patch. It was just necessary fallout :) > +/* > + * Like dlist_delete(), but also sets next/prev to NULL to signal not being in > + * list. > + */ > +static inline void > +dlist_delete_thoroughly(dlist_node *node) > +{ > + node->prev->next = node->next; > + node->next->prev = node->prev; > + node->next = NULL; > + node->prev = NULL; > +} > > Instead of introducing this strange terminology, why not just have the > callers do ... > > dlist_delete(node); > dlist_node_init(node); There's quite a few callers in predicate.c - I actually did that first. > ..., or perhaps supply dlist_delete_and_reinit(node) that does exactly > that? That is, reuse the code and terminology. Yea, that's might be better, but see paragraph below. I quite dislike adding any "empty node" state. > +/* > + * Check if node is detached. A node is only detached if it either has been > + * initialized with dlist_init_node(), or deleted with > + * dlist_delete_thoroughly(). > + */ > +static inline bool > +dlist_node_is_detached(const dlist_node *node) > +{ > + Assert((node->next == NULL && node->prev == NULL) || > + (node->next != NULL && node->prev != NULL)); > + > + return node->next == NULL; > +} > > How about dlist_node_is_linked()? I don't like introducing random new > verbs when we already have 'linked' in various places, and these > things are, y'know, linked lists. Well, but that doesn't signal that you can't just delete and have dlist_node_is_linked() work. I *want* it to sound "different". We could of course make delete always do this, but I don't want to add that overhead unnecessarily. > > Subject: [PATCH v1 4/6] Use dlists for predicate locking. > > + dlist_foreach(iter, &unconstify(SERIALIZABLEXACT *, reader)->outConflicts) > > Yuck... It doesn't seem *that* bad to me, at least signals properly what we're doing, and only does so in one place. > I suppose you could do this: > > - dlist_foreach(iter, &unconstify(SERIALIZABLEXACT *, reader)->outConflicts) > + dlist_foreach_const(iter, &reader->outConflicts) We'd need a different iterator type too, I think? Because the iterator itself can't be constant, but we'd want the elements themselves be pointers to constants. const just isn't a very granular thing in C :(. > ... given: > > +/* Variant for when you have a pointer to const dlist_head. */ > +#define dlist_foreach_const(iter, lhead) \ > + for (AssertVariableIsOfTypeMacro(iter, dlist_iter), \ > + AssertVariableIsOfTypeMacro(lhead, const dlist_head *), \ > + (iter).end = (dlist_node *) &(lhead)->head, \ > + (iter).cur = (iter).end->next ? (iter).end->next : (iter).end; \ > + (iter).cur != (iter).end; \ > + (iter).cur = (iter).cur->next) > + > > ... or find a way to make dlist_foreach() handle that itself, which > seems pretty reasonable given its remit to traverse lists without > modifying them, though perhaps that would require a different iterator > type... I was trying that first, but I don't easily see how we can do so. Iterating over a non-constant list with dlist_foreach obviously still allows to to manipulate the list members. Thus dlist_iter.cur can't be a 'pointer to const'. Whereas that's arguably what'd be needed for a correct dlist_foreach() of a constant list? We could just accept const pointers for dlist_foreach(), but then we'd *always* accept them, and we'd thus unconditionally have iter.cur as non-const. Would that be better? > Otherwise looks OK to me and passes various tests I threw at it Thanks! Greetings, Andres Freund
On Mon, Feb 10, 2020 at 11:22 PM Andres Freund <andres@anarazel.de> wrote: > To me this seems to go in the direction of having multiple bespoke lock > managers with slightly different feature sets, different debugging / > logging / monitoring support, with separate locking code each. That's > bad for maintainability. I think it would help a lot if the lock manager were not such a monolithic thing. For instance, suppose that instead of having the deadlock detector be part of the lock manager, it's a separate thing that integrates with the lock manager. Instead of only knowing about waits for heavyweight locks, it knows about whatever you want to tell it about. Then, for instance, you could possibly tell it that process A is waiting for a cleanup lock while process B holds a pin, possibly detecting a deadlock we can't notice today. There are likely other cases as well. > E.g. we could: > > - Have an extended LockAcquire API where the caller provides a stack > variable for caching information until the LockRelease call, avoiding > separate LockMethodLocalHash lookup for release. Maybe. Seems like it might make things a little clunky for the caller, though. If you just kept a stack of locks acquired and checked the lock being released against the top of the stack, you'd probably catch a large percentage of cases. Or, the caller could pass a flag indicating whether they intend to release the lock prior to the end of the transaction (which could be cross-checked by assertions). Any that you intend to release early go on a stack. > - Instead of always implementing HASH_REMOVE as a completely fresh > lookup in dynahash, we should provide an API for removing an object we > *know* is in the hashtable at a specific pointer. We're obviously > already relying on that address to stay constant, so we'd not loose > flexibility. With this separate API we can avoid the bucket lookup, > walking through each element in that bucket, and having to compare the > hash key every time (which is quite expensive for a full LOCKTAG). That makes a lot of sense, although I wonder if we should go further and replace dynahash entirely. > - We should try to figure out whether we can avoid needing the separate > lock table for PROCLOCKs. As far as I can tell there's not a single > reason for it to be a hashtable, rather than something like > NUM_LOCK_PARTITIONS lists of free PROCLOCKs. I agree that the PROCLOCK thing seems ugly and inefficient. I'm not sure that a bunch of lists is the best answer, though it might be. One case to consider is when a lock is initially acquired via the fast-path and then, because of a conflicting lock acquisition, transferred by *some other process* to the main lock table. If this occurs, the original locker must be able to find its PROCLOCK. It doesn't have to be crazy efficient because it shouldn't happen very often, but it shouldn't suck too much. > - Whenever we do a relation extension, we better already hold a normal > relation lock. We don't actually need to have an entirely distinct > type of lock for extensions, that theoretically could support N > extension locks for each relation. The fact that these are associated > could be utilized in different ways: > > - We could define relation extension as a separate lock level only > conflicting with itself (and perhaps treat AELs as including > it). That'd avoid needing separate shared memory states in many > cases. > > This might be too problematic, because extension locks don't have > the same desired behaviour around group locking (and a small army of > other issues). Yeah, I don't think that's likely to work out very nicely. > - We could keep a separate extension lock cached inside the relation > lock. The first time a transaction needs to extend, it does the > normal work, and after that stores the PROCLOCK in the LOCALLOCK (or > something like that). Once extension is done, don't release the lock > entirely, but instead just reduce the lock level to a new > non-conflicting lock level > > Alternatively we could implement something very similar outside of > lock.c, e.g. by caching the LOCALLOCK* (possibly identified by an > integer or such) in RelationData. That'd require enough support > from lock.c to be able to make that really cheap. Not sure I quite see what you mean here. > We could e.g. split the maintenance of the hashtable(s) from protecting > the state of individual locks. The locks protecting the hashtable would > just be held long enough to change a "pin count" of each lock, and then > a per LOCK lwlock would protect each heavyweight lock's state. We'd not > need to lock the partition locks for quite a few cases, because there's > many locks in a loaded system that always have lockers. There'd be cost > to that, needing more atomic ops in some cases, but I think it'd reduce > contention tremendously, and it'd open a lot more optimization > potential. It seems plausible that we could even work, as a followup, > to not needing the partition locks for some lock releases (e.g. when > there are other holders), and we might even be able to avoid it for > acquisitions, by caching the LOCK inside LOCALLOCK, and re-identifying > the identity. I agree. This seems worth exploring. The idea of caching the probably location of the lock and re-pinning it to check whether it's the one you expected seems like it would avoid false sharing in a lot of practical cases. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Feb 19, 2020 at 11:14 PM Andres Freund <andres@anarazel.de> wrote: > Here's a *draft* patch series for removing all use of SHM_QUEUE, and > subsequently removing SHM_QUEUE. +1 for that idea. But color me skeptical of what Thomas described as the "incidental constification". -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company