Thread: Improving LWLock wait events
Hi, The current wait events are already pretty useful. But I think we could make them more informative without adding real runtime overhead. 1) For lwlocks I think it'd be quite useful to show the mode of acquisition in pg_stat_activity.wait_event_type, instead of just saying 'LWLock'. I think we should split PG_WAIT_LWLOCK into PG_WAIT_LWLOCK_{EXCLUSIVE,SHARED,WAIT_UNTIL_FREE}, and report a different wait_event_type based on the different class. The fact that it'd break people explicitly looking for LWLock in pg_stat_activity doesn't seem to outweigh the benefits to me. 2) I think it's unhelpful that waits for WAL insertion locks to progress show up LWLock acquisitions. LWLockWaitForVar() feels like a distinct enough operation that passing in a user-specified wait event is worth the miniscule incremental overhead that'd add. I'd probably just make it a different wait class, and have xlog.c compute that based on the number of the slot being waited for. 3) I have observed waking up other processes as part of a lock release to be a significant performance factor. I would like to add a separate wait event type for that. That'd be a near trivial extension to 1) I also think there's a 4, but I think the tradeoffs are a bit more complicated: 4) For a few types of lwlock just knowing the tranche isn't sufficient. E.g. knowing whether it's one or different buffer mapping locks are being waited on is important to judge contention. For wait events right now we use 1 byte for the class, 1 byte is unused and 2 bytes are used for event specific information (the tranche in case of lwlocks). Seems like we could change the split to be a 4bit class and leave 28bit to the specific wait event type? And in lwlocks case we could make something like 4 bit class, 10 bit tranche, 20 bit sub-tranche? 20 bit aren't enough to uniquely identify a lock for the larger tranches (mostly buffer locks, I think), but I think it'd still be enough to disambiguate. The hardest part would be to know how to identify individual locks. The easiest would probably be to just mask in a parts of the lwlock address (e.g. shift it right by INTALIGN, and then mask in the result into the eventId). That seems a bit unsatisfying. We could probably do a bit better: We could just store the information about tranche / offset within tranche at LWLockInitialize() time, instead of computing something just before waiting. While LWLock.tranche is only 16bits right now, the following two bytes are currently padding... That'd allow us to have proper numerical identification for nearly all tranches, without needing to go back to the complexity of having tranches specify base & stride. Even more API churn around lwlock initialization isn't desirable :(, but we could just add a LWLockInitializeIdentified() or such. Greetings, Andres Freund
On Mon, 21 Dec 2020 at 05:27, Andres Freund <andres@anarazel.de> wrote:
Hi,
The current wait events are already pretty useful. But I think we could
make them more informative without adding real runtime overhead.
All 1-3 sound pretty sensible to me.
I also think there's a 4, but I think the tradeoffs are a bit more
complicated:
4) For a few types of lwlock just knowing the tranche isn't
sufficient. E.g. knowing whether it's one or different buffer mapping locks
are being waited on is important to judge contention.
I've struggled with this quite a bit myself.
In particular, for tools that validate acquire-ordering safety it's desirable to be able to identify a specific lock in a backend-independent way.
The hardest part would be to know how to identify individual locks. The
easiest would probably be to just mask in a parts of the lwlock address
(e.g. shift it right by INTALIGN, and then mask in the result into the
eventId). That seems a bit unsatisfying.
It also won't work reliably for locks in dsm segments, since the lock can be mapped to a different address in different backends.
We could probably do a bit better: We could just store the information about
tranche / offset within tranche at LWLockInitialize() time, instead of
computing something just before waiting. While LWLock.tranche is only 16bits
right now, the following two bytes are currently padding...
That'd allow us to have proper numerical identification for nearly all
tranches, without needing to go back to the complexity of having tranches
specify base & stride.
That sounds appealing. It'd work for any lock in MainLWLockArray - all built-in individual LWLocks, LWTRANCHE_BUFFER_MAPPING, LWTRANCHE_LOCK_MANAGER, LWTRANCHE_PREDICATE_LOCK_MANAGER, any lock allocated by RequestNamedLWLockTranche().
Some of the other tranches allocate locks in contiguous fixed blocks or in ways that would let them maintain a counter.
We'd need some kind of "unknown" placeholder value for LWLocks where that doesn't make sense, though, like most locks allocated by callers that make their own LWLockNewTrancheId() call and locks in some of the built-in tranches not allocated in MainLWLockArray.
So I suggest retaining the current LWLockInitialize() and making it a wrapper for LWLockInitializeWithIndex() or similar. Use a 1-index and keep 0 as unknown, or use 0-index and use (max-1) as unknown.
On Wed, 23 Dec 2020 at 15:51, Craig Ringer <craig.ringer@enterprisedb.com> wrote:
I've struggled with this quite a bit myself.
By the way, I sent in a patch to enhance the static tracepoints available for LWLocks. See https://www.postgresql.org/message-id/CAGRY4nxJo+-HCC2i5H93ttSZ4gZO-FSddCwvkb-qAfQ1zdXd1w@mail.gmail.com .
It'd benefit significantly from the sort of changes you mentioned in #4. For most purposes I've been able to just use the raw LWLock* but having a nice neat (tranche,index) value would be ideal.
The trace patch has helped me identify some excessively long LWLock waits in tools I work on. I'll share another of the systemtap scripts I used with it soon.