Re: [PATCH] Identify LWLocks in tracepoints - Mailing list pgsql-hackers
From | Craig Ringer |
---|---|
Subject | Re: [PATCH] Identify LWLocks in tracepoints |
Date | |
Msg-id | CAGRY4nxK65MLTW6Z9aV3wGP0HESgY2dxOUZqtO8oODpHB96zqg@mail.gmail.com Whole thread Raw |
In response to | Re: [PATCH] Identify LWLocks in tracepoints (Andres Freund <andres@anarazel.de>) |
List | pgsql-hackers |
On Wed, 14 Apr 2021 at 04:46, Andres Freund <andres@anarazel.de> wrote: > > On 2021-04-13 14:25:23 -0400, Robert Haas wrote: > > On Mon, Apr 12, 2021 at 11:06 PM Andres Freund <andres@anarazel.de> wrote: > > You could identify every lock by a tranche ID + an array offset + a > > "tranche instance ID". But where would you store the tranche instance > > ID to make it readily accessible, other than in the lock itself? > > Andres wasn't thrilled about using even 2 bytes to identify the > > LWLock, so he'll probably like having more bytes in there for that > > kind of thing even less. > > I still don't like the two bytes, fwiw ;). Especially because it's 4 > bytes due to padding right now. Aha, did I hear you say "there are two free bytes for me to shove something marginally useful and irrelevant into"? (*grin*) > I'd like to move the LWLock->waiters list to outside of the lwlock > itself - at most TotalProcs LWLocks can be waited for, so we don't need > millions of empty proclist_heads. That way we could also remove the > proclist indirection - which shows up a fair bit in contended workloads. > > And if we had a separate "lwlocks being waited for" structure, we could > also add more information to it if we wanted to... Having the ability to observe LWLock waiters would be nice. But you're right to constantly emphasise that LWLocks need to be very slim. We don't want to turn them into near-heavyweight locks by saddling them with overhead that's not strictly necessary. A simple pg_regress run (with cassert, admittedly) takes 25,000,000 LWLocks and does 24,000 LWLock waits and 130,000 condvar waits. All in less than a minute. OTOH, once someone's waiting we don't care nearly as much about bookkeeping cost, it's the un-contested fast paths that're most critical. > - We could make it work if we restricted MAX_BACKENDS to be 2**14 - but > while I personally think that's a sane upper limit, I already had a > hard time getting consensus to lower the limit to 2^18-1. 16384 backends is totally fine in sane real world deployments. But it'll probably upset marketing people when OtherDatabaseVendor starts shouting that they support 14 million connections, and postgres only has 16k. Sigh. The real answer here in the long term probably needs to be decoupling of executors from connection state inside postgres. But while we're on that topic, how about we convert the entire codebase to Rust while riding on a flying rainbow unicorn? We're stuck with the 1:1 connection to executor mapping for the foreseeable future. > - Use a 64bit integer. Then we can easily fit MAX_BACKENDS lockers, as > well as an offset to one of MAX_BACKENDS "wait lists" into LWLock. You know much more than me about the possible impacts of that on layout and caching, but I gather that it's probably undesirable to make LWLocks any bigger. > I think it's quite useful for relatively simple things like analyzing > the total amount of time spent in individual locks, without incuring > much overhead when not doing so (for which you need to identify > individual locks, otherwise your end - start time is going to be > meaningless). Yep. That's why the removal of the lock offset is a bit frustrating, because you can't identify an LWLock instance-wide by LWLock* due to the possibility of different per-backend DSM base-address mappings. Well, and ASLR on EXEC_BACKEND systems, but who cares about those? The removal was for good reasons though. And it only affects LWLocks in DSM, for everything else the LWLock* is good enough. If identifying LWLocks in DSM ever matters enough to bother to solve that problem, we can emit trace events on DSM mapping attach in each backend, and trace tools can do the work to track which LWLocks are in DSM and convert their addresses to a reference base address. Pg shouldn't have to pay the price for that unless it's something a lot of people need. > And, slightly more advanced, for analyzing what the stack > was when the lock was released - which then allows you to see what work > you're blocked on, something I found hard to figure out otherwise. > > I found that that's mostly quite doable with dynamic probes though. Yeah, it is. That's part of why my patchset here doesn't try to do a lot to LWLock tracepoints - I didn't think it was necessary to add a lot. The LWLock code is fairly stable, not usually something you have to worry about in production unless you're debugging badly behaved extensions, and usually somewhat probe-able with DWARF based dynamic probes. However, the way the wait-loop and fast-path are in the same function is a serious pain for dynamic probing; you can't tell the difference between a fast-path acquire and an acquire after a wait without using probes on function+offset or probing by source line. Both those are fine for dev work but useless in tool/library scripts. I almost wonder if we should test out moving the LWLock wait-loops out of LWLockAcquire(), LWLockAcquireOrWait() and LWLockWaitForVar() anyway, so the hot parts of the function are smaller. That'd make dynamic probing more convenient as a pleasant side effect. I imagine you must've tried this, benchmarked and profiled it, though, and found it to be a net loss, otherwise you surely would've done it as part of your various (awesome) performance work. Anyway, there are some other areas of postgres that are ridiculously painful to instrument with dynamic probes, especially in a somewhat version- and build-independent way. Tracking txn commit and abort (including 2PC and normal xacts, with capture of commit LSNs) is just painful with dynamic probing for example, and is one of my top priority areas to get some decent tracepoints for - the current txn management tracepoints are utterly worthless. But LWLocks are mostly fine, the only really big piece missing is a tracepoint fired exactly once when a lock is released by any release path. > > Like, I've seen problems for example we can see that there's a lot of > > contention on SubtransSLRULock, or there's problems with > > WALInsertLock. But I can't really see why I'd need to know which > > WALInsertLock was experiencing contention. > > Well, but you might want to know what the task blocking you was > doing. What to optimize might differ if the other task is e.g. a log > switch (which acquires all insert locks), than if it's WAL writes by > VACUUM. That sort of thing is why I've been interested in IDing the LWLock. That, and I work with extension code that probably abuses LWLocks a bit, but that's not a problem core postgres should have to care about. > > If we were speaking of buffer content locks, I suppose I can imagine > > wanting more details, but it's not really the buffer number I'd want > > to know. I'd want to know the database OID, the relfilenode, the fork > > number, and the block number. You can argue that we should just expose > > the buffer number and let the user sort out the rest with > > dtrace/systemtap magic, but that makes it useless in practice to an > > awful lot of people, including me. > > I have wondered if we ought to put some utilities for that in contrib or > such. It's a lot easier to address something new with a decent starting > point... Long term that's exactly what I want to do. I wrote some with systemtap, but it's since become clear to me that systemtap isn't going to get enough people on board. Setup is a hassle. So I'm trying to pivot over to bpf tools now, with the intention of getting together a library of canned probes and example scripts to help people get started. I've written systemtap scripts that can track a transaction from localxid allocation through real xid allocation, commit (or 2pc prepare and commit prepared), logical decoding, reorder buffering, output plugin processing, receipt by a logical downstream, downstream xid assignment, downstream commit and replorigin advance, downstream->upstream feedback, upstream slot advance, and upstream confirmed flush/catalog_xmin advance. Whole-lifecycle tracking with timing of each phase, across multiple processes and two postgres instances. For now the two postgres instances must be on the same host, but that could be dealt with too. The script reports the application name and pid of the upstream session, the upstream localxid, the upstream xid, upstream commit lsn, the downstream xid and the downstream commit lsn as it goes, and follows associations to track the transaction through its lifecycle. (The current script is written for BDR and pglogical, not in-core logical decoding, but the principles are the same). The problem is that scripts like this are just too fragile right now. Changes across Pg versions break the dynamic function probes they use, though that can be adapted to somewhat. The bigger problem is the number of places I have to insert statement (function+offset) probes, which are just too fragile to make these sorts of scripts generally useful. I have to fix them whenever I want to use them, so there's not much point trying to get people to use them. But it's hard to convince people of the value of static tracepoints that would make this sort of thing so much easier to do in a more stable manner when they can't easily see hands-on examples of what's possible. There's no "wow" factor. So I need to address the worst of the difficult-to-probe sites and start sharing some examples that use them. I thought this would be low-hanging fruit to start with. Whoops!
pgsql-hackers by date: