Re: [PATCH] Identify LWLocks in tracepoints - Mailing list pgsql-hackers

From Craig Ringer
Subject Re: [PATCH] Identify LWLocks in tracepoints
Date
Msg-id CAGRY4ny7NGwvq1Ntb_dTaxD0i5OHz1nA2RbezxU9Suu8R+4RaQ@mail.gmail.com
Whole thread Raw
In response to Re: [PATCH] Identify LWLocks in tracepoints  (Peter Eisentraut <peter.eisentraut@enterprisedb.com>)
Responses Re: [PATCH] Identify LWLocks in tracepoints  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
On Mon, 22 Mar 2021 at 16:38, Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote:

First, a problem:  0002 doesn't build on macOS, because uint64 has been
used in the probe definitions.  That needs to be handled like the other
nonnative types in that file.

Will fix.

All the probe changes and additions should be accompanied by
documentation changes.

Agreed, will fix.

The probes used to have an argument to identify the lock, which was
removed by 3761fe3c20bb040b15f0e8da58d824631da00caa.

Huh. That's exactly the functionality I was looking for. Damn. I understand why Robert removed it, but its removal makes it much harder to identify an LWLock since it might fall in a DSM segment that could be mapped at different base addresses in different backends.

Robert's patch didn't replace the offset within tranche with anything else to identify the lock. A LWLock* is imperfect due to ASLR and DSM but it's better than nothing. In theory we could even remap them in trace tools if we had tracepoints on DSM attach and detach that showed their size and base address too.

CC'ing Andres, as he expressed interest in being able to globally identify LWLocks too.
 
The 0001 patch is
essentially trying to reinstate that, which seems sensible.  Perhaps we
should also use the argument order that used to be there.  It used to be

probe lwlock__acquire(const char *, int, LWLockMode);

and now it would be

probe lwlock__acquire(const char *, LWLockMode, LWLock*, int);

Also, do we need both the tranche name and the tranche id?

Reasons to have the name:

* There is no easy way to look up the tranche name by ID from outside the backend
* A tranche ID by itself is pretty much meaningless especially for dynamic tranches
* Any existing scripts will rely on the tranche name

So the tranche name is really required to generate useful output for any dynamic tranches, or simple and readable output from things like perf.

Reasons to have the tranche ID:

* The tranche name is not guaranteed to have the same address for a given value across backends in the presence of ASLR, even for built-in tranches. So tools need to read tranche names as user-space strings, which is much more expensive than consuming an int argument from the trace args. Storing and reporting maps of events by tranche name (string) in tools is also more expensive than having a tranche id.
* When the trace tool or script wants to filter for only one particular tranche,particularly when it's a built-in tranche where the tranche ID is known, having the ID is much more useful and efficient.
* If we can avoid computing the tranche name, emitting just the tranche ID would be much faster.

It's annoying that we have to pay the cost of computing the tranche name though. It never used to matter, but now that T_NAME() expands to GetLWTrancheName() calls as of 29c3e2dd5a6 it's going to cost a little more on such a hot path. I might see if I can do a little comparison and see how much.

I could add TRACE_POSTGRESQL_<<tracepointname>>_ENABLED() guards since we do in fact build with SDT semaphore support. That adds a branch for each tracepoint, but they're already marshalling arguments and making a function call that does lots more than a single branch, so that seems pretty sensible. The main downside of using _ENABLED() USDT semaphore guards is that not all tools are guaranteed to understand or support them. So an older perf, for example, might simply fail to fire events on guarded probes. That seems OK to me, the onus should be on the probe tool to pay any costs, not on PostgreSQL. Despite that I don't want to mark the _ENABLED() guards unlikely(), since that'd increase the observer effect where probing LWLocks changes their timing and behaviour. Branch prediction should do a very good job in such cases without being forced.

I wonder a little about the possible cache costs of the _ENABLED() macros though. Their data is in a separate ELF segment and separate .o, with no locality to the traced code. It might be worth checking that before proceeding; I guess it's even possible that the GetLWTrancheName() calls could be cheaper. Will see if I can run some checks and report back.

BTW, if you want some of the details on how userspace SDTs work, https://leezhenghui.github.io/linux/2019/03/05/exploring-usdt-on-linux.html is interesting and useful. It helps explain uprobes, ftrace, bcc, etc.

Or maybe we
don't need the name, or can record it differently, which might also
address your other concern that it's too expensive to compute.  In any
case, I think an argument order like

probe lwlock__acquite(const char *, int, LWLock*, LWLockMode);

would make more sense.

OK.

In 0004, you add a probe to record the application_name setting?  Would
there be any value in making that a generic probe that can record any
GUC change?

Yes, there would, but I didn't want to go and do that in the same patch, and a named probe on application_name is useful separately to having probes on any GUC.

There's value in having a probe with an easily targeted name that probes the application_name since it's of obvious interest and utility to probing and tracing tools. A probe specifically on application_name means a probing script doesn't have to fire an event for every GUC, copy the GUC name string, strcmp() it to see if it's the GUC of interest, etc. So specific probes on "major" GUCs like this are IMO very useful.

(It'd be possible to instead generate probes for each GUC at compile-time using the preprocessor and the DTRACE_ macros. But as noted above, that doesn't currently work properly in the same compilation unit that a dtrace script-generated probes.h is included in. I think it's probably nicer to have specific probes for GUCs of high interest, then generic probes that capture all GUCs anyway.)

There are a TON of probes I want to add, and I have a tree full of them waiting to submit progressively. Yes, ability to probe all GUCs is in there. So is detail on walsender, reorder buffer, and snapshot builder activity. Heavyweight lock SDTs. A probe that identifies the backend type at startup. SDT probe events emitted for every wait-event. Probes in elog.c to let probes observe error unwinding, capture error messages, etc. (Those can also be used with systemtap guru mode scripts to do things like turn a particular elog(DEBUG) into a PANIC at runtime for diagnostic purposes). Probes in shm_mq to observe message passing and blocking. A probe that fires whenever debug_query_string changes. Lots. But I can't submit them all at once, especially without some supporting use cases and scripts that other people can use so they can understand why these probes are useful.

So I figured I'd start here...

pgsql-hackers by date:

Previous
From: Masahiko Sawada
Date:
Subject: Re: Performance degradation of REFRESH MATERIALIZED VIEW
Next
From: Amit Kapila
Date:
Subject: Re: missing documentation for streaming in-progress transactions