Re: [PATCH] Identify LWLocks in tracepoints - Mailing list pgsql-hackers

From Craig Ringer
Subject Re: [PATCH] Identify LWLocks in tracepoints
Date
Msg-id CAGRY4nxK65MLTW6Z9aV3wGP0HESgY2dxOUZqtO8oODpHB96zqg@mail.gmail.com
Whole thread Raw
In response to Re: [PATCH] Identify LWLocks in tracepoints  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
On Wed, 14 Apr 2021 at 04:46, Andres Freund <andres@anarazel.de> wrote:
>
> On 2021-04-13 14:25:23 -0400, Robert Haas wrote:
> > On Mon, Apr 12, 2021 at 11:06 PM Andres Freund <andres@anarazel.de> wrote:
> > You could identify every lock by a tranche ID + an array offset + a
> > "tranche instance ID". But where would you store the tranche instance
> > ID to make it readily accessible, other than in the lock itself?
> > Andres wasn't thrilled about using even 2 bytes to identify the
> > LWLock, so he'll probably like having more bytes in there for that
> > kind of thing even less.
>
> I still don't like the two bytes, fwiw ;). Especially because it's 4
> bytes due to padding right now.

Aha, did I hear you say "there are two free bytes for me to shove
something marginally useful and irrelevant into"?

(*grin*)

> I'd like to move the LWLock->waiters list to outside of the lwlock
> itself - at most TotalProcs LWLocks can be waited for, so we don't need
> millions of empty proclist_heads. That way we could also remove the
> proclist indirection - which shows up a fair bit in contended workloads.
>
> And if we had a separate "lwlocks being waited for" structure, we could
> also add more information to it if we wanted to...

Having the ability to observe LWLock waiters would be nice. But you're
right to constantly emphasise that LWLocks need to be very slim. We
don't want to turn them into near-heavyweight locks by saddling them
with overhead that's not strictly necessary. A simple pg_regress run
(with cassert, admittedly) takes 25,000,000 LWLocks and does 24,000
LWLock waits and 130,000 condvar waits. All in less than a minute.
OTOH, once someone's waiting we don't care nearly as much about
bookkeeping cost, it's the un-contested fast paths that're most
critical.

> - We could make it work if we restricted MAX_BACKENDS to be 2**14 - but
>   while I personally think that's a sane upper limit, I already had a
>   hard time getting consensus to lower the limit to 2^18-1.

16384 backends is totally fine in sane real world deployments. But
it'll probably upset marketing people when OtherDatabaseVendor starts
shouting that they support 14 million connections, and postgres only
has 16k. Sigh.

The real answer here in the long term probably needs to be decoupling
of executors from connection state inside postgres. But while we're on
that topic, how about we convert the entire codebase to Rust while
riding on a flying rainbow unicorn? We're stuck with the 1:1
connection to executor mapping for the foreseeable future.

> - Use a 64bit integer. Then we can easily fit MAX_BACKENDS lockers, as
>   well as an offset to one of MAX_BACKENDS "wait lists" into LWLock.

You know much more than me about the possible impacts of that on
layout and caching, but I gather that it's probably undesirable to
make LWLocks any bigger.

> I think it's quite useful for relatively simple things like analyzing
> the total amount of time spent in individual locks, without incuring
> much overhead when not doing so (for which you need to identify
> individual locks, otherwise your end - start time is going to be
> meaningless).

Yep.

That's why the removal of the lock offset is a bit frustrating,
because you can't identify an LWLock instance-wide by LWLock* due to
the possibility of different per-backend DSM base-address mappings.
Well, and ASLR on EXEC_BACKEND systems, but who cares about those?

The removal was for good reasons though. And it only affects LWLocks
in DSM, for everything else the LWLock* is good enough. If identifying
LWLocks in DSM ever matters enough to bother to solve that problem, we
can emit trace events on DSM mapping attach in each backend, and trace
tools can do the work to track which LWLocks are in DSM and convert
their addresses to a reference base address. Pg shouldn't have to pay
the price for that unless it's something a lot of people need.

> And, slightly more advanced, for analyzing what the stack
> was when the lock was released - which then allows you to see what work
> you're blocked on, something I found hard to figure out otherwise.
>
> I found that that's mostly quite doable with dynamic probes though.

Yeah, it is.

That's part of why my patchset here doesn't try to do a lot to LWLock
tracepoints - I didn't think it was necessary to add a lot.
The LWLock code is fairly stable, not usually something you have to
worry about in production unless you're debugging badly behaved
extensions, and usually somewhat probe-able with DWARF based dynamic
probes. However, the way the wait-loop and fast-path are in the same
function is a serious pain for dynamic probing; you can't tell the
difference between a fast-path acquire and an acquire after a wait
without using probes on function+offset or probing by source line.
Both those are fine for dev work but useless in tool/library scripts.

I almost wonder if we should test out moving the LWLock wait-loops out
of LWLockAcquire(), LWLockAcquireOrWait() and LWLockWaitForVar()
anyway, so the hot parts of the function are smaller. That'd make
dynamic probing more convenient as a pleasant side effect. I imagine
you must've tried this, benchmarked and profiled it, though, and found
it to be a net loss, otherwise you surely would've done it as part of
your various (awesome) performance work.

Anyway, there are some other areas of postgres that are ridiculously
painful to instrument with dynamic probes, especially in a somewhat
version- and build-independent way. Tracking txn commit and abort
(including 2PC and normal xacts, with capture of commit LSNs) is just
painful with dynamic probing for example, and is one of my top
priority areas to get some decent tracepoints for - the current txn
management tracepoints are utterly worthless. But LWLocks are mostly
fine, the only really big piece missing is a tracepoint fired exactly
once when a lock is released by any release path.

> > Like, I've seen problems for example we can see that there's a lot of
> > contention on SubtransSLRULock, or there's problems with
> > WALInsertLock. But I can't really see why I'd need to know which
> > WALInsertLock was experiencing contention.
>
> Well, but you might want to know what the task blocking you was
> doing. What to optimize might differ if the other task is e.g. a log
> switch (which acquires all insert locks), than if it's WAL writes by
> VACUUM.

That sort of thing is why I've been interested in IDing the LWLock.
That, and I work with extension code that probably abuses LWLocks a
bit, but that's not a problem core postgres should have to care about.

> > If we were speaking of buffer content locks, I suppose I can imagine
> > wanting more details, but it's not really the buffer number I'd want
> > to know. I'd want to know the database OID, the relfilenode, the fork
> > number, and the block number. You can argue that we should just expose
> > the buffer number and let the user sort out the rest with
> > dtrace/systemtap magic, but that makes it useless in practice to an
> > awful lot of people, including me.
>
> I have wondered if we ought to put some utilities for that in contrib or
> such. It's a lot easier to address something new with a decent starting
> point...

Long term that's exactly what I want to do.

I wrote some with systemtap, but it's since become clear to me that
systemtap isn't going to get enough people on board. Setup is a
hassle. So I'm trying to pivot over to bpf tools now, with the
intention of getting together a library of canned probes and example
scripts to help people get started.

I've written systemtap scripts that can track a transaction from
localxid allocation through real xid allocation, commit (or 2pc
prepare and commit prepared), logical decoding, reorder buffering,
output plugin processing, receipt by a logical downstream, downstream
xid assignment, downstream commit and replorigin advance,
downstream->upstream feedback, upstream slot advance, and upstream
confirmed flush/catalog_xmin advance. Whole-lifecycle tracking with
timing of each phase, across multiple processes and two postgres
instances. For now the two postgres instances must be on the same
host, but that could be dealt with too. The script reports the
application name and pid of the upstream session, the upstream
localxid, the upstream xid, upstream commit lsn, the downstream xid
and  the downstream commit lsn as it goes, and follows associations to
track the transaction through its lifecycle. (The current script is
written for BDR and pglogical, not in-core logical decoding, but the
principles are the same).

The problem is that scripts like this are just too fragile right now.
Changes across Pg versions break the dynamic function probes they use,
though that can be adapted to somewhat. The bigger problem is the
number of places I have to insert statement (function+offset) probes,
which are just too fragile to make these sorts of scripts generally
useful. I have to fix them whenever I want to use them, so there's not
much point trying to get people to use them.

But it's hard to convince people of the value of static tracepoints
that would make this sort of thing so much easier to do in a more
stable manner when they can't easily see hands-on examples of what's
possible. There's no "wow" factor. So I need to address the worst of
the difficult-to-probe sites and start sharing some examples that use
them.

I thought this would be low-hanging fruit to start with. Whoops!



pgsql-hackers by date:

Previous
From: Masahiko Sawada
Date:
Subject: Re: Replication slot stats misgivings
Next
From: vignesh C
Date:
Subject: Re: Replication slot stats misgivings