Thread: Improving LWLock wait events

Improving LWLock wait events

From

Andres Freund

Date:

20 December 2020, 21:27:28

Hi,

The current wait events are already pretty useful. But I think we could
make them more informative without adding real runtime overhead.


1) For lwlocks I think it'd be quite useful to show the mode of acquisition in
pg_stat_activity.wait_event_type, instead of just saying 'LWLock'.

I think we should split PG_WAIT_LWLOCK into
PG_WAIT_LWLOCK_{EXCLUSIVE,SHARED,WAIT_UNTIL_FREE}, and report a different
wait_event_type based on the different class.

The fact that it'd break people explicitly looking for LWLock in
pg_stat_activity doesn't seem to outweigh the benefits to me.


2) I think it's unhelpful that waits for WAL insertion locks to progress show
up LWLock acquisitions. LWLockWaitForVar() feels like a distinct enough
operation that passing in a user-specified wait event is worth the miniscule
incremental overhead that'd add.

I'd probably just make it a different wait class, and have xlog.c compute that
based on the number of the slot being waited for.


3) I have observed waking up other processes as part of a lock release to be a
significant performance factor. I would like to add a separate wait event type
for that. That'd be a near trivial extension to 1)


I also think there's a 4, but I think the tradeoffs are a bit more
complicated:

4) For a few types of lwlock just knowing the tranche isn't
sufficient. E.g. knowing whether it's one or different buffer mapping locks
are being waited on is important to judge contention.

For wait events right now we use 1 byte for the class, 1 byte is unused and 2
bytes are used for event specific information (the tranche in case of
lwlocks).

Seems like we could change the split to be a 4bit class and leave 28bit to the
specific wait event type? And in lwlocks case we could make something like 4
bit class, 10 bit tranche, 20 bit sub-tranche?

20 bit aren't enough to uniquely identify a lock for the larger tranches
(mostly buffer locks, I think), but I think it'd still be enough to
disambiguate.

The hardest part would be to know how to identify individual locks. The
easiest would probably be to just mask in a parts of the lwlock address
(e.g. shift it right by INTALIGN, and then mask in the result into the
eventId). That seems a bit unsatisfying.

We could probably do a bit better: We could just store the information about
tranche / offset within tranche at LWLockInitialize() time, instead of
computing something just before waiting.  While LWLock.tranche is only 16bits
right now, the following two bytes are currently padding...

That'd allow us to have proper numerical identification for nearly all
tranches, without needing to go back to the complexity of having tranches
specify base & stride.

Even more API churn around lwlock initialization isn't desirable :(, but we
could just add a LWLockInitializeIdentified() or such.


Greetings,

Andres Freund

Re: Improving LWLock wait events

From

Craig Ringer

Date:

23 December 2020, 07:51:50

On Mon, 21 Dec 2020 at 05:27, Andres Freund <andres@anarazel.de> wrote:

Hi,

The current wait events are already pretty useful. But I think we could
make them more informative without adding real runtime overhead.

All 1-3 sound pretty sensible to me.

I also think there's a 4, but I think the tradeoffs are a bit more
complicated:

4) For a few types of lwlock just knowing the tranche isn't
sufficient. E.g. knowing whether it's one or different buffer mapping locks
are being waited on is important to judge contention.

I've struggled with this quite a bit myself.

In particular, for tools that validate acquire-ordering safety it's desirable to be able to identify a specific lock in a backend-independent way.

The hardest part would be to know how to identify individual locks. The
easiest would probably be to just mask in a parts of the lwlock address
(e.g. shift it right by INTALIGN, and then mask in the result into the
eventId). That seems a bit unsatisfying.

It also won't work reliably for locks in dsm segments, since the lock can be mapped to a different address in different backends.

We could probably do a bit better: We could just store the information about
tranche / offset within tranche at LWLockInitialize() time, instead of
computing something just before waiting. While LWLock.tranche is only 16bits
right now, the following two bytes are currently padding...

That'd allow us to have proper numerical identification for nearly all
tranches, without needing to go back to the complexity of having tranches
specify base & stride.

That sounds appealing. It'd work for any lock in MainLWLockArray - all built-in individual LWLocks, LWTRANCHE_BUFFER_MAPPING, LWTRANCHE_LOCK_MANAGER, LWTRANCHE_PREDICATE_LOCK_MANAGER, any lock allocated by RequestNamedLWLockTranche().

Some of the other tranches allocate locks in contiguous fixed blocks or in ways that would let them maintain a counter.

We'd need some kind of "unknown" placeholder value for LWLocks where that doesn't make sense, though, like most locks allocated by callers that make their own LWLockNewTrancheId() call and locks in some of the built-in tranches not allocated in MainLWLockArray.

So I suggest retaining the current LWLockInitialize() and making it a wrapper for LWLockInitializeWithIndex() or similar. Use a 1-index and keep 0 as unknown, or use 0-index and use (max-1) as unknown.

Re: Improving LWLock wait events

From

Craig Ringer

Date:

23 December 2020, 07:56:32

On Wed, 23 Dec 2020 at 15:51, Craig Ringer <craig.ringer@enterprisedb.com> wrote:

I've struggled with this quite a bit myself.

By the way, I sent in a patch to enhance the static tracepoints available for LWLocks. See https://www.postgresql.org/message-id/CAGRY4nxJo+-HCC2i5H93ttSZ4gZO-FSddCwvkb-qAfQ1zdXd1w@mail.gmail.com .

It'd benefit significantly from the sort of changes you mentioned in #4. For most purposes I've been able to just use the raw LWLock* but having a nice neat (tranche,index) value would be ideal.

The trace patch has helped me identify some excessively long LWLock waits in tools I work on. I'll share another of the systemtap scripts I used with it soon.