Thread: [PATCH] Identify LWLocks in tracepoints
Hi all
The attached patch set follows on from the discussion in [1] "Add LWLock blocker(s) information" by adding the actual LWLock* and the numeric tranche ID to each LWLock related TRACE_POSTGRESQL_foo tracepoint.
This does not provide complete information on blockers, because it's not necessarily valid to compare any two LWLock* pointers between two process address spaces. The locks could be in DSM segments, and those DSM segments could be mapped at different addresses.
I wasn't able to work out a sensible way to map a LWLock* to any sort of (tranche-id, lock-index) because there's no requirement that locks in a tranche be contiguous or known individually to the lmgr.
Despite that, the patches improve the information available for LWLock analysis significantly.
Patch 1 fixes a bogus tracepoint where an lwlock__acquire event would be fired from LWLockWaitForVar, despite that function never actually acquiring the lock.
Patch 2 adds the tranche id and lock pointer for each trace hit. This makes it possible to differentiate between individual locks within a tranche, and (so long as they aren't tranches in a DSM segment) compare locks between processes. That means you can do lock-order analysis etc, which was not previously especially feasible. Traces also don't have to do userspace reads for the tranche name all the time, so the trace can run with lower overhead.
Patch 3 adds a single-path tracepoint for all lock acquires and releases, so you only have to probe the lwlock__acquired and lwlock__release events to see all acquires/releases, whether conditional or otherwise. It also adds start markers that can be used for timing wallclock duration of LWLock acquires/releases.
Patch 4 adds some comments on LWLock tranches to try to address some points I found confusing and hard to understand when investigating this topic.
Attachment
Hi Craig, On Sat, Dec 19, 2020 at 2:00 PM Craig Ringer <craig.ringer@enterprisedb.com> wrote: > > Hi all > > The attached patch set follows on from the discussion in [1] "Add LWLock blocker(s) information" by adding the actual LWLock*and the numeric tranche ID to each LWLock related TRACE_POSTGRESQL_foo tracepoint. > > This does not provide complete information on blockers, because it's not necessarily valid to compare any two LWLock* pointersbetween two process address spaces. The locks could be in DSM segments, and those DSM segments could be mapped atdifferent addresses. > > I wasn't able to work out a sensible way to map a LWLock* to any sort of (tranche-id, lock-index) because there's no requirementthat locks in a tranche be contiguous or known individually to the lmgr. > > Despite that, the patches improve the information available for LWLock analysis significantly. > > Patch 1 fixes a bogus tracepoint where an lwlock__acquire event would be fired from LWLockWaitForVar, despite that functionnever actually acquiring the lock. > > Patch 2 adds the tranche id and lock pointer for each trace hit. This makes it possible to differentiate between individuallocks within a tranche, and (so long as they aren't tranches in a DSM segment) compare locks between processes.That means you can do lock-order analysis etc, which was not previously especially feasible. Traces also don'thave to do userspace reads for the tranche name all the time, so the trace can run with lower overhead. > > Patch 3 adds a single-path tracepoint for all lock acquires and releases, so you only have to probe the lwlock__acquiredand lwlock__release events to see all acquires/releases, whether conditional or otherwise. It also adds startmarkers that can be used for timing wallclock duration of LWLock acquires/releases. > > Patch 4 adds some comments on LWLock tranches to try to address some points I found confusing and hard to understand wheninvestigating this topic. > You sent in your patch to pgsql-hackers on Dec 19, but you did not post it to the next CommitFest[1]. If this was intentional, then you need to take no action. However, if you want your patch to be reviewed as part of the upcoming CommitFest, then you need to add it yourself before 2021-01-01 AoE[2]. Thanks for your contributions. Regards, [1] https://commitfest.postgresql.org/31/ [2] https://en.wikipedia.org/wiki/Anywhere_on_Earth -- Masahiko Sawada EnterpriseDB: https://www.enterprisedb.com/
On Mon, 28 Dec 2020 at 20:09, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
Hi Craig,
On Sat, Dec 19, 2020 at 2:00 PM Craig Ringer
<craig.ringer@enterprisedb.com> wrote:
>
> Hi all
>
> The attached patch set follows on from the discussion in [1] "Add LWLock blocker(s) information" by adding the actual LWLock* and the numeric tranche ID to each LWLock related TRACE_POSTGRESQL_foo tracepoint.
>
> This does not provide complete information on blockers, because it's not necessarily valid to compare any two LWLock* pointers between two process address spaces. The locks could be in DSM segments, and those DSM segments could be mapped at different addresses.
>
> I wasn't able to work out a sensible way to map a LWLock* to any sort of (tranche-id, lock-index) because there's no requirement that locks in a tranche be contiguous or known individually to the lmgr.
>
> Despite that, the patches improve the information available for LWLock analysis significantly.
>
> Patch 1 fixes a bogus tracepoint where an lwlock__acquire event would be fired from LWLockWaitForVar, despite that function never actually acquiring the lock.
>
> Patch 2 adds the tranche id and lock pointer for each trace hit. This makes it possible to differentiate between individual locks within a tranche, and (so long as they aren't tranches in a DSM segment) compare locks between processes. That means you can do lock-order analysis etc, which was not previously especially feasible. Traces also don't have to do userspace reads for the tranche name all the time, so the trace can run with lower overhead.
>
> Patch 3 adds a single-path tracepoint for all lock acquires and releases, so you only have to probe the lwlock__acquired and lwlock__release events to see all acquires/releases, whether conditional or otherwise. It also adds start markers that can be used for timing wallclock duration of LWLock acquires/releases.
>
> Patch 4 adds some comments on LWLock tranches to try to address some points I found confusing and hard to understand when investigating this topic.
>
You sent in your patch to pgsql-hackers on Dec 19, but you did not
post it to the next CommitFest[1]. If this was intentional, then you
need to take no action. However, if you want your patch to be
reviewed as part of the upcoming CommitFest, then you need to add it
yourself before 2021-01-01 AoE[2]. Thanks for your contributions.
Regards,
[1] https://commitfest.postgresql.org/31/
[2] https://en.wikipedia.org/wiki/Anywhere_on_Earth
Thanks.
CF entry created at https://commitfest.postgresql.org/32/2927/ . I don't think it's urgent and will have limited review time so I didn't try to wedge it into the current CF.
On Sat, 19 Dec 2020 at 13:00, Craig Ringer <craig.ringer@enterprisedb.com> wrote:
Hi allThe attached patch set follows on from the discussion in [1] "Add LWLock blocker(s) information" by adding the actual LWLock* and the numeric tranche ID to each LWLock related TRACE_POSTGRESQL_foo tracepoint.
I've attached a systemtap script that makes use of the information exported by the enhanced LWLock tracepoints. It offers something akin to dynamic -DLWLOCK_STATS with automatic statistical aggregation and some selective LWLOCK_DEBUG output.
The script isn't super pretty. I didn't try to separate event-data collection from results output, and there's some duplication in places. But it gives you an idea what's possible when we report lock pointers and tranche IDs to tracepoints and add entry/exit tracepoints.
Key features:
* Collect statistical aggregates on lwlock hold and wait durations across all processes. Stats are grouped by lockmode (shared or exclusive) and by tranche name, as well as rollup stats across all tranches.
* Report lock wait and hold durations for each process when that process exits. Again, reported by mode and tranche name.
* For long lock waits, print the waiter pid and waiting pid, along with each process's backend type and application_name if known, the acquire mode, and the acquire function
The output is intended to be human-readable, but it'd be quite simple to convert it into raw tsv-style output suitable for ingestion into statistical postprocessing or graphing tools.
It should be fairly easy to break down the stats by acquire function if desired, so LWLockAcquire(), LWLockWaitForVar(), and LWLockAcquireOrWait() are reported separately. They're combined in the current output.
Capturing the current query string is pretty simple if needed, but I didn't think it was likely to be especially useful.
Sample output for a pg_regress run attached. Abridged version follows. Here the !!W!! lines are "waited a long time", the !!H!! lines are "held a long time". Then [pid]:MyBackendType tranche_name wait_time_us (wait_time) in wait_func (appliation_name) => [blocker_pid] (blocker_application_name) . If blocker pid wasn't identified it won't be reported - I know how to fix that and will do so soon.
!!W!! [ 93030]:3 BufferContent 12993 (0m0.012993s) in lwlock__acquire__start (pg_regress/text)
!!W!! [ 93036]:3 LockManager 14540 (0m0.014540s) in lwlock__acquire__start (pg_regress/float8) => [ 93045] (pg_regress/regproc)
!!W!! [ 93035]:3 BufferContent 12608 (0m0.012608s) in lwlock__acquire__start (pg_regress/float4) => [ 93034] (pg_regress/oid)
!!W!! [ 93036]:3 LockManager 10301 (0m0.010301s) in lwlock__acquire__start (pg_regress/float8)
!!W!! [ 93043]:3 LockManager 10356 (0m0.010356s) in lwlock__acquire__start (pg_regress/pg_lsn)
!!H!! [ 93033]:3 BufferContent 20579 (0m0.020579s) (pg_regress/int8)
!!W!! [ 93027]:3 BufferContent 10766 (0m0.010766s) in lwlock__acquire__start (pg_regress/char) => [ 93037] (pg_regress/bit)
!!W!! [ 93036]:3 OidGen 12876 (0m0.012876s) in lwlock__acquire__start (pg_regress/float8)
!!W!! [ 93036]:3 LockManager 14540 (0m0.014540s) in lwlock__acquire__start (pg_regress/float8) => [ 93045] (pg_regress/regproc)
!!W!! [ 93035]:3 BufferContent 12608 (0m0.012608s) in lwlock__acquire__start (pg_regress/float4) => [ 93034] (pg_regress/oid)
!!W!! [ 93036]:3 LockManager 10301 (0m0.010301s) in lwlock__acquire__start (pg_regress/float8)
!!W!! [ 93043]:3 LockManager 10356 (0m0.010356s) in lwlock__acquire__start (pg_regress/pg_lsn)
!!H!! [ 93033]:3 BufferContent 20579 (0m0.020579s) (pg_regress/int8)
!!W!! [ 93027]:3 BufferContent 10766 (0m0.010766s) in lwlock__acquire__start (pg_regress/char) => [ 93037] (pg_regress/bit)
!!W!! [ 93036]:3 OidGen 12876 (0m0.012876s) in lwlock__acquire__start (pg_regress/float8)
...
Then the summary rollup at the end of the run. This can also be output periodically during the run. Abbreviated for highlights:
wait locks: all procs tranche mode count total avg variance min max
W LW_EXCLUSIVE (all) E 54185 14062734 259 1850265 1 44177
W LW_SHARED (all) S 3668 1116022 304 1527261 2 18642
W LW_EXCLUSIVE (all) E 54185 14062734 259 1850265 1 44177
W LW_SHARED (all) S 3668 1116022 304 1527261 2 18642
held locks: all procs tranche mode count total avg variance min max
H LW_EXCLUSIVE (all) E 10438060 153077259 14 37035 1 195043
H LW_SHARED (all) S 14199902 65466934 4 5318 1 44030
H LW_EXCLUSIVE (all) E 10438060 153077259 14 37035 1 195043
H LW_SHARED (all) S 14199902 65466934 4 5318 1 44030
all procs by tranche tranche mode count total avg variance min max
W tranche (all) S 3668 1116022 304 1527261 2 18642
W tranche (all) E 54185 14062734 259 1850265 1 44177
W tranche WALInsert E 9839 2393229 243 1180294 2 14209
W tranche BufferContent E 3012 1726543 573 3869409 2 28186
W tranche BufferContent S 1664 657855 395 2185694 2 18642
W tranche LockFastPath E 28314 6327801 223 1278053 1 26133
W tranche LockFastPath S 87 59401 682 3703217 19 9454
W tranche LockManager E 7223 2764392 382 2514863 2 44177
W tranche (all) S 3668 1116022 304 1527261 2 18642
W tranche (all) E 54185 14062734 259 1850265 1 44177
W tranche WALInsert E 9839 2393229 243 1180294 2 14209
W tranche BufferContent E 3012 1726543 573 3869409 2 28186
W tranche BufferContent S 1664 657855 395 2185694 2 18642
W tranche LockFastPath E 28314 6327801 223 1278053 1 26133
W tranche LockFastPath S 87 59401 682 3703217 19 9454
W tranche LockManager E 7223 2764392 382 2514863 2 44177
Hope this is interesting to someone.
Attachment
> On Sat, Dec 19, 2020 at 01:00:01PM +0800, Craig Ringer wrote: > > The attached patch set follows on from the discussion in [1] "Add LWLock > blocker(s) information" by adding the actual LWLock* and the numeric > tranche ID to each LWLock related TRACE_POSTGRESQL_foo tracepoint. > > This does not provide complete information on blockers, because it's not > necessarily valid to compare any two LWLock* pointers between two process > address spaces. The locks could be in DSM segments, and those DSM segments > could be mapped at different addresses. > > I wasn't able to work out a sensible way to map a LWLock* to any sort of > (tranche-id, lock-index) because there's no requirement that locks in a > tranche be contiguous or known individually to the lmgr. > > Despite that, the patches improve the information available for LWLock > analysis significantly. Thanks for the patches, this could be indeed useful. I've looked through and haven't noticed any issues with either the tracepoint extensions or commentaries, except that I find it is not that clear how trance_id indicates a re-initialization here? /* Re-initialization of individual LWLocks is not permitted */ Assert(tranche_id >= NUM_INDIVIDUAL_LWLOCKS || !IsUnderPostmaster); > Patch 2 adds the tranche id and lock pointer for each trace hit. This makes > it possible to differentiate between individual locks within a tranche, and > (so long as they aren't tranches in a DSM segment) compare locks between > processes. That means you can do lock-order analysis etc, which was not > previously especially feasible. I'm curious in which kind of situations lock-order analysis could be helpful? > Traces also don't have to do userspace reads for the tranche name all > the time, so the trace can run with lower overhead. This one is also interesting. Just for me to clarify, wouldn't there be a bit of overhead anyway (due to switching from kernel context to user space when a tracepoint was hit) that will mask name read overhead? Or are there any available numbers about it?
On 2020-12-19 06:00, Craig Ringer wrote: > Patch 1 fixes a bogus tracepoint where an lwlock__acquire event would be > fired from LWLockWaitForVar, despite that function never actually > acquiring the lock. This was added in 68a2e52bbaf when LWLockWaitForVar() was first introduced. It looks like a mistake to me too, but maybe Heikki wants to comment.
On Wed, 13 Jan 2021 at 19:19, Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> On Sat, Dec 19, 2020 at 01:00:01PM +0800, Craig Ringer wrote:
>
> The attached patch set follows on from the discussion in [1] "Add LWLock
> blocker(s) information" by adding the actual LWLock* and the numeric
> tranche ID to each LWLock related TRACE_POSTGRESQL_foo tracepoint.
>
> This does not provide complete information on blockers, because it's not
> necessarily valid to compare any two LWLock* pointers between two process
> address spaces. The locks could be in DSM segments, and those DSM segments
> could be mapped at different addresses.
>
> I wasn't able to work out a sensible way to map a LWLock* to any sort of
> (tranche-id, lock-index) because there's no requirement that locks in a
> tranche be contiguous or known individually to the lmgr.
>
> Despite that, the patches improve the information available for LWLock
> analysis significantly.
Thanks for the patches, this could be indeed useful. I've looked through
and haven't noticed any issues with either the tracepoint extensions or
commentaries, except that I find it is not that clear how trance_id
indicates a re-initialization here?
/* Re-initialization of individual LWLocks is not permitted */
Assert(tranche_id >= NUM_INDIVIDUAL_LWLOCKS || !IsUnderPostmaster);
There should be no reason for anything to call LWLockInitialize(...) on an individual LWLock, since they are all initialized during postmaster startup.
Doing so must be a bug.
But that's a trivial change that can be done separately.
> Patch 2 adds the tranche id and lock pointer for each trace hit. This makes
> it possible to differentiate between individual locks within a tranche, and
> (so long as they aren't tranches in a DSM segment) compare locks between
> processes. That means you can do lock-order analysis etc, which was not
> previously especially feasible.
I'm curious in which kind of situations lock-order analysis could be
helpful?
If code-path 1 does
LWLockAcquire(LockA, LW_EXCLUSIVE);
...
LWLockAcquire(LockB, LW_EXCLUSIVE);
and code-path 2 does:
LWLockAcquire(LockB, LW_EXCLUSIVE);
...
LWLockAcquire(LockA, LW_EXCLUSIVE);
then they're subject to deadlock. But you might not actually hit that often in test workloads if the timing required for the deadlock to occur is tight and/or occurs on infrequent operations.
It's not always easy to reason about or prove things about lock order when they're potentially nested deep within many layers of other calls and callbacks. Obviously something we try to avoid with LWLocks, but not impossible.
If you trace a workload and derive all possible nestings of lock acquire order, you can then prove things about whether there are any possible ordering conflicts and where they might arise.
A PoC to do so is on my TODO.
> Traces also don't have to do userspace reads for the tranche name all
> the time, so the trace can run with lower overhead.
This one is also interesting. Just for me to clarify, wouldn't there be
a bit of overhead anyway (due to switching from kernel context to user
space when a tracepoint was hit) that will mask name read overhead? Or
are there any available numbers about it?
I don't have numbers on that. Whether it matters will depend way too much on how you're using the probe points and collecting/consuming the data anyway.
It's a bit unfortunate (IMO) that we make a function call for each tracepoint invocation to get the tranche names. Ideally I'd prefer to be able to omit the tranche names lookups for these probes entirely for something as hot as LWLocks. But it's a bit of a pain to look up the tranche names from an external trace tool, so instead I'm inclined to see if we can enable systemtap's semaphores and only compute the tranche name if the target probe is actually enabled. But that'd be separate to this patch and require a build change in how systemtap support is compiled and linked.
BTW, a user->kernel->user context switch only occurs when the trace tool's probes use kernel space - such as for perf based probes, or for systemtap's kernel-runtime probes. The same markers can be used by e.g. systemtap's "dyninst" runtime that runs entirely in userspace.
On Thu, 14 Jan 2021 at 15:56, Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote:
On 2020-12-19 06:00, Craig Ringer wrote:
> Patch 1 fixes a bogus tracepoint where an lwlock__acquire event would be
> fired from LWLockWaitForVar, despite that function never actually
> acquiring the lock.
This was added in 68a2e52bbaf when LWLockWaitForVar() was first
introduced. It looks like a mistake to me too, but maybe Heikki wants
to comment.
I'm certain it's a copy/paste bug.
On 2021-01-14 09:39, Craig Ringer wrote: > On Thu, 14 Jan 2021 at 15:56, Peter Eisentraut > <peter.eisentraut@enterprisedb.com > <mailto:peter.eisentraut@enterprisedb.com>> wrote: > > On 2020-12-19 06:00, Craig Ringer wrote: > > Patch 1 fixes a bogus tracepoint where an lwlock__acquire event > would be > > fired from LWLockWaitForVar, despite that function never actually > > acquiring the lock. > > This was added in 68a2e52bbaf when LWLockWaitForVar() was first > introduced. It looks like a mistake to me too, but maybe Heikki wants > to comment. > > > I'm certain it's a copy/paste bug. I have committed that patch.
On Thu, 11 Mar 2021 at 15:57, Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote:
On 10.03.21 06:38, Craig Ringer wrote:
> On Wed, 3 Mar 2021 at 20:50, David Steele <david@pgmasters.net
> <mailto:david@pgmasters.net>> wrote:
>
> On 1/22/21 6:02 AM, Peter Eisentraut wrote:
>
> This patch set no longer applies:
> http://cfbot.cputube.org/patch_32_2927.log
> <http://cfbot.cputube.org/patch_32_2927.log>.
>
> Can we get a rebase? Also marked Waiting on Author.
>
>
> Rebased as requested.
In patch 0001, why was the TRACE_POSTGRESQL_LWLOCK_RELEASE() call moved?
Is there some correctness issue? If so, we should explain that (at
least in the commit message, or as a separate patch).
If you want I can split it out, or drop that change. I thought it was sufficiently inconsequential, but you're right to check.
The current tracepoint TRACE_POSTGRESQL_LWLOCK_RELEASE really means "releaseD". It's appropriate to emit this as soon as the lock could be acquired by anything else. By deferring it until we'd processed the waitlist and woken other backends the window during which the lock was reported as "held" was longer than it truly was, and it was easy to see one backend acquire the lock while another still appeared to hold it.
It'd possibly make more sense to have a separate TRACE_POSTGRESQL_LWLOCK_RELEASING just before the `pg_atomic_sub_fetch_u32` call. But I didn't want to spam the tracepoints too hard, and there's always going to be some degree of overlap because tracing tools cannot intercept and act during the atomic swap, so they'll always see a slightly premature or slightly delayed release. This window should be as short as possible though, hence moving the tracepoint.
Side note:
The main reason I didn't want to add more tracepoints than were strictly necessary is that Pg doesn't enable the systemtap semaphores feature, so right now we do a T_NAME(lock) evaluation each time we pass a tracepoint if --enable-dtrace is compiled in, whether or not anything is tracing. This was fine on pg11 where it was just:
#define T_NAME(lock) \
(LWLockTrancheArray[(lock)->tranche])
(LWLockTrancheArray[(lock)->tranche])
but since pg13 it instead expands to
GetLWTrancheName((lock)->tranche)
where GetLWTrancheName isn't especially trivial. We'll run that function every single time we pass any of these tracepoints and then discard the result, which is ... not ideal. That applies so long as Pg is compiled with --enable-dtrace. I've been meaning to look at enabling the systemtap semaphores feature in our build so these can be wrapped in unlikely(TRACE_POSTGRESQL_LWLOCK_RELEASE_ENABLED()) guards, but I wanted to wrap this patch set up first as there are some complexities around enabling the semaphores feature.
On 18.03.21 07:34, Craig Ringer wrote: > In patch 0001, why was the TRACE_POSTGRESQL_LWLOCK_RELEASE() call > moved? > Is there some correctness issue? If so, we should explain that (at > least in the commit message, or as a separate patch). > > > If you want I can split it out, or drop that change. I thought it was > sufficiently inconsequential, but you're right to check. > > The current tracepoint TRACE_POSTGRESQL_LWLOCK_RELEASE really means > "releaseD". It's appropriate to emit this as soon as the lock could be > acquired by anything else. By deferring it until we'd processed the > waitlist and woken other backends the window during which the lock was > reported as "held" was longer than it truly was, and it was easy to see > one backend acquire the lock while another still appeared to hold it. From the archeology department: The TRACE_POSTGRESQL_LWLOCK_RELEASE probe was in the right place until PG 9.4, but was then moved by ab5194e6f617a9a9e7aadb3dd1cee948a42d0755, which was a major rewrite, so it seems the move might have been accidental. The documentation specifically states that the probe is triggered before waiters are woken up, which it specifically does not do at the moment. So this looks like a straight bug fix to me.
On 18.03.21 07:34, Craig Ringer wrote: > The main reason I didn't want to add more tracepoints than were strictly > necessary is that Pg doesn't enable the systemtap semaphores feature, so > right now we do a T_NAME(lock) evaluation each time we pass a tracepoint > if --enable-dtrace is compiled in, whether or not anything is tracing. > This was fine on pg11 where it was just: > > #define T_NAME(lock) \ > (LWLockTrancheArray[(lock)->tranche]) > > but since pg13 it instead expands to > > GetLWTrancheName((lock)->tranche) > > where GetLWTrancheName isn't especially trivial. We'll run that function > every single time we pass any of these tracepoints and then discard the > result, which is ... not ideal. That applies so long as Pg is compiled > with --enable-dtrace. I've been meaning to look at enabling the > systemtap semaphores feature in our build so these can be wrapped in > unlikely(TRACE_POSTGRESQL_LWLOCK_RELEASE_ENABLED()) guards, but I wanted > to wrap this patch set up first as there are some complexities around > enabling the semaphores feature. There is already support for that. See the documentation at the end of this page: https://www.postgresql.org/docs/devel/dynamic-trace.html#DEFINING-TRACE-POINTS
On Sat, 20 Mar 2021, 04:21 Peter Eisentraut, <peter.eisentraut@enterprisedb.com> wrote:
On 18.03.21 07:34, Craig Ringer wrote:
> The main reason I didn't want to add more tracepoints than were strictly
> necessary is that Pg doesn't enable the systemtap semaphores feature, so
> right now we do a T_NAME(lock) evaluation each time we pass a tracepoint
> if --enable-dtrace is compiled in, whether or not anything is tracing.
> This was fine on pg11 where it was just:
>
> #define T_NAME(lock) \
> (LWLockTrancheArray[(lock)->tranche])
>
> but since pg13 it instead expands to
>
> GetLWTrancheName((lock)->tranche)
>
> where GetLWTrancheName isn't especially trivial. We'll run that function
> every single time we pass any of these tracepoints and then discard the
> result, which is ... not ideal. That applies so long as Pg is compiled
> with --enable-dtrace. I've been meaning to look at enabling the
> systemtap semaphores feature in our build so these can be wrapped in
> unlikely(TRACE_POSTGRESQL_LWLOCK_RELEASE_ENABLED()) guards, but I wanted
> to wrap this patch set up first as there are some complexities around
> enabling the semaphores feature.
There is already support for that. See the documentation at the end of
this page:
https://www.postgresql.org/docs/devel/dynamic-trace.html#DEFINING-TRACE-POINTS
Pretty sure it won't work right now.
To use systemtap semaphores (the _ENABLED macros) you need to run dtrace -g to generate a probes.o then link that into postgres.
I don't think we do that. I'll double check soon.
On 19.03.21 21:06, Peter Eisentraut wrote: > On 18.03.21 07:34, Craig Ringer wrote: >> In patch 0001, why was the TRACE_POSTGRESQL_LWLOCK_RELEASE() call >> moved? >> Is there some correctness issue? If so, we should explain that >> (at >> least in the commit message, or as a separate patch). >> >> >> If you want I can split it out, or drop that change. I thought it was >> sufficiently inconsequential, but you're right to check. >> >> The current tracepoint TRACE_POSTGRESQL_LWLOCK_RELEASE really means >> "releaseD". It's appropriate to emit this as soon as the lock could be >> acquired by anything else. By deferring it until we'd processed the >> waitlist and woken other backends the window during which the lock was >> reported as "held" was longer than it truly was, and it was easy to >> see one backend acquire the lock while another still appeared to hold it. > > From the archeology department: The TRACE_POSTGRESQL_LWLOCK_RELEASE > probe was in the right place until PG 9.4, but was then moved by > ab5194e6f617a9a9e7aadb3dd1cee948a42d0755, which was a major rewrite, so > it seems the move might have been accidental. The documentation > specifically states that the probe is triggered before waiters are woken > up, which it specifically does not do at the moment. So this looks like > a straight bug fix to me. committed a fix for that
On 10.03.21 06:38, Craig Ringer wrote: > On Wed, 3 Mar 2021 at 20:50, David Steele <david@pgmasters.net > <mailto:david@pgmasters.net>> wrote: > > On 1/22/21 6:02 AM, Peter Eisentraut wrote: > > This patch set no longer applies: > http://cfbot.cputube.org/patch_32_2927.log > <http://cfbot.cputube.org/patch_32_2927.log>. > > Can we get a rebase? Also marked Waiting on Author. > > > Rebased as requested. > > I'm still interested in whether Andres will be able to do anything about > identifying LWLocks in a cross-backend manner. But this work doesn't > really depend on that; it'd benefit from it, but would be easily adapted > to it later if needed. First, a problem: 0002 doesn't build on macOS, because uint64 has been used in the probe definitions. That needs to be handled like the other nonnative types in that file. All the probe changes and additions should be accompanied by documentation changes. The probes used to have an argument to identify the lock, which was removed by 3761fe3c20bb040b15f0e8da58d824631da00caa. The 0001 patch is essentially trying to reinstate that, which seems sensible. Perhaps we should also use the argument order that used to be there. It used to be probe lwlock__acquire(const char *, int, LWLockMode); and now it would be probe lwlock__acquire(const char *, LWLockMode, LWLock*, int); Also, do we need both the tranche name and the tranche id? Or maybe we don't need the name, or can record it differently, which might also address your other concern that it's too expensive to compute. In any case, I think an argument order like probe lwlock__acquite(const char *, int, LWLock*, LWLockMode); would make more sense. In 0004, you add a probe to record the application_name setting? Would there be any value in making that a generic probe that can record any GUC change?
On 20.03.21 01:29, Craig Ringer wrote: > There is already support for that. See the documentation at the end of > this page: > https://www.postgresql.org/docs/devel/dynamic-trace.html#DEFINING-TRACE-POINTS > <https://www.postgresql.org/docs/devel/dynamic-trace.html#DEFINING-TRACE-POINTS> > > > Pretty sure it won't work right now. > > To use systemtap semaphores (the _ENABLED macros) you need to run dtrace > -g to generate a probes.o then link that into postgres. > > I don't think we do that. I'll double check soon. We do that. (It's -G.)
On Mon, 22 Mar 2021 at 17:00, Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote:
On 20.03.21 01:29, Craig Ringer wrote:
> There is already support for that. See the documentation at the end of
> this page:
> https://www.postgresql.org/docs/devel/dynamic-trace.html#DEFINING-TRACE-POINTS
> <https://www.postgresql.org/docs/devel/dynamic-trace.html#DEFINING-TRACE-POINTS>
>
>
> Pretty sure it won't work right now.
>
> To use systemtap semaphores (the _ENABLED macros) you need to run dtrace
> -g to generate a probes.o then link that into postgres.
>
> I don't think we do that. I'll double check soon.
We do that. (It's -G.)
Huh. I could've sworn we didn't. My mistake, it's there in src/backend/Makefile .
In that case I'll amend the patch to use semaphore guards.
(On a side note, systemtap's semaphore support is actually a massive pain. The way it's implemented in <sys/sdt.h> means that a single compilation unit may not use both probes.d style markers produced by the dtrace script and use regular DTRACE_PROBE(providername,probename) preprocessor macros. If it attempts to do so, the DTRACE_PROBE macros will emit inline asm that tries to reference probename_semaphore symbols that will not exist, resulting in linker errors or runtime link errors. But that's really a systemtap problem. Core PostgreSQL doesn't use any explicit DTRACE_PROBE(...), STAP_PROBE(...) etc.)
On Mon, 22 Mar 2021 at 16:38, Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote:
First, a problem: 0002 doesn't build on macOS, because uint64 has been
used in the probe definitions. That needs to be handled like the other
nonnative types in that file.
Will fix.
All the probe changes and additions should be accompanied by
documentation changes.
Agreed, will fix.
The probes used to have an argument to identify the lock, which was
removed by 3761fe3c20bb040b15f0e8da58d824631da00caa.
Huh. That's exactly the functionality I was looking for. Damn. I understand why Robert removed it, but its removal makes it much harder to identify an LWLock since it might fall in a DSM segment that could be mapped at different base addresses in different backends.
Robert's patch didn't replace the offset within tranche with anything else to identify the lock. A LWLock* is imperfect due to ASLR and DSM but it's better than nothing. In theory we could even remap them in trace tools if we had tracepoints on DSM attach and detach that showed their size and base address too.
CC'ing Andres, as he expressed interest in being able to globally identify LWLocks too.
The 0001 patch is
essentially trying to reinstate that, which seems sensible. Perhaps we
should also use the argument order that used to be there. It used to be
probe lwlock__acquire(const char *, int, LWLockMode);
and now it would be
probe lwlock__acquire(const char *, LWLockMode, LWLock*, int);
Also, do we need both the tranche name and the tranche id?
Reasons to have the name:
* There is no easy way to look up the tranche name by ID from outside the backend
* A tranche ID by itself is pretty much meaningless especially for dynamic tranches
* Any existing scripts will rely on the tranche name
So the tranche name is really required to generate useful output for any dynamic tranches, or simple and readable output from things like perf.
Reasons to have the tranche ID:
* The tranche name is not guaranteed to have the same address for a given value across backends in the presence of ASLR, even for built-in tranches. So tools need to read tranche names as user-space strings, which is much more expensive than consuming an int argument from the trace args. Storing and reporting maps of events by tranche name (string) in tools is also more expensive than having a tranche id.
* When the trace tool or script wants to filter for only one particular tranche,particularly when it's a built-in tranche where the tranche ID is known, having the ID is much more useful and efficient.
* If we can avoid computing the tranche name, emitting just the tranche ID would be much faster.
It's annoying that we have to pay the cost of computing the tranche name though. It never used to matter, but now that T_NAME() expands to GetLWTrancheName() calls as of 29c3e2dd5a6 it's going to cost a little more on such a hot path. I might see if I can do a little comparison and see how much.
I could add TRACE_POSTGRESQL_<<tracepointname>>_ENABLED() guards since we do in fact build with SDT semaphore support. That adds a branch for each tracepoint, but they're already marshalling arguments and making a function call that does lots more than a single branch, so that seems pretty sensible. The main downside of using _ENABLED() USDT semaphore guards is that not all tools are guaranteed to understand or support them. So an older perf, for example, might simply fail to fire events on guarded probes. That seems OK to me, the onus should be on the probe tool to pay any costs, not on PostgreSQL. Despite that I don't want to mark the _ENABLED() guards unlikely(), since that'd increase the observer effect where probing LWLocks changes their timing and behaviour. Branch prediction should do a very good job in such cases without being forced.
I wonder a little about the possible cache costs of the _ENABLED() macros though. Their data is in a separate ELF segment and separate .o, with no locality to the traced code. It might be worth checking that before proceeding; I guess it's even possible that the GetLWTrancheName() calls could be cheaper. Will see if I can run some checks and report back.
BTW, if you want some of the details on how userspace SDTs work, https://leezhenghui.github.io/linux/2019/03/05/exploring-usdt-on-linux.html is interesting and useful. It helps explain uprobes, ftrace, bcc, etc.
Or maybe we
don't need the name, or can record it differently, which might also
address your other concern that it's too expensive to compute. In any
case, I think an argument order like
probe lwlock__acquite(const char *, int, LWLock*, LWLockMode);
would make more sense.
OK.
In 0004, you add a probe to record the application_name setting? Would
there be any value in making that a generic probe that can record any
GUC change?
Yes, there would, but I didn't want to go and do that in the same patch, and a named probe on application_name is useful separately to having probes on any GUC.
There's value in having a probe with an easily targeted name that probes the application_name since it's of obvious interest and utility to probing and tracing tools. A probe specifically on application_name means a probing script doesn't have to fire an event for every GUC, copy the GUC name string, strcmp() it to see if it's the GUC of interest, etc. So specific probes on "major" GUCs like this are IMO very useful.
(It'd be possible to instead generate probes for each GUC at compile-time using the preprocessor and the DTRACE_ macros. But as noted above, that doesn't currently work properly in the same compilation unit that a dtrace script-generated probes.h is included in. I think it's probably nicer to have specific probes for GUCs of high interest, then generic probes that capture all GUCs anyway.)
There are a TON of probes I want to add, and I have a tree full of them waiting to submit progressively. Yes, ability to probe all GUCs is in there. So is detail on walsender, reorder buffer, and snapshot builder activity. Heavyweight lock SDTs. A probe that identifies the backend type at startup. SDT probe events emitted for every wait-event. Probes in elog.c to let probes observe error unwinding, capture error messages, etc. (Those can also be used with systemtap guru mode scripts to do things like turn a particular elog(DEBUG) into a PANIC at runtime for diagnostic purposes). Probes in shm_mq to observe message passing and blocking. A probe that fires whenever debug_query_string changes. Lots. But I can't submit them all at once, especially without some supporting use cases and scripts that other people can use so they can understand why these probes are useful.
So I figured I'd start here...
Hi, On 2021-04-12 14:31:32 +0800, Craig Ringer wrote: > * There is no easy way to look up the tranche name by ID from outside the > backend But it's near trivial to add that. > It's annoying that we have to pay the cost of computing the tranche name > though. It never used to matter, but now that T_NAME() expands to > GetLWTrancheName() calls as of 29c3e2dd5a6 it's going to cost a little more > on such a hot path. I might see if I can do a little comparison and see how > much. I could add TRACE_POSTGRESQL_<<tracepointname>>_ENABLED() guards since we > do in fact build with SDT semaphore support. That adds a branch for each > tracepoint, but they're already marshalling arguments and making a function > call that does lots more than a single branch, so that seems pretty > sensible. I am against adding any overhead for this feature. I honestly think the probes we have right now in postgres do not provide a commensurate benefit. > (It'd be possible to instead generate probes for each GUC at compile-time > using the preprocessor and the DTRACE_ macros. But as noted above, that > doesn't currently work properly in the same compilation unit that a dtrace > script-generated probes.h is included in. I think it's probably nicer to > have specific probes for GUCs of high interest, then generic probes that > capture all GUCs anyway.) > > There are a TON of probes I want to add, and I have a tree full of them > waiting to submit progressively. Yes, ability to probe all GUCs is in > there. So is detail on walsender, reorder buffer, and snapshot builder > activity. Heavyweight lock SDTs. A probe that identifies the backend type > at startup. SDT probe events emitted for every wait-event. Probes in elog.c > to let probes observe error unwinding, capture error messages, > etc. [...] A probe that fires whenever debug_query_string > changes. Lots. But I can't submit them all at once, especially without > some supporting use cases and scripts that other people can use so > they can understand why these probes are useful. -1. This is not scalable. Adding static probes all over has both a runtime (L1I, branches, code optimization) and maintenance overhead. > (Those can also be used with systemtap guru mode scripts to do things > like turn a particular elog(DEBUG) into a PANIC at runtime for > diagnostic purposes). Yikes. Greetings, Andres Freund
On Tue, 13 Apr 2021 at 02:23, Andres Freund <andres@anarazel.de> wrote: [I've changed the order of the quoted sections a little to prioritize the key stuff] > > On 2021-04-12 14:31:32 +0800, Craig Ringer wrote: > > > It's annoying that we have to pay the cost of computing the tranche name > > though. It never used to matter, but now that T_NAME() expands to > > GetLWTrancheName() calls as of 29c3e2dd5a6 it's going to cost a little more > > on such a hot path. I might see if I can do a little comparison and see how > > much. I could add TRACE_POSTGRESQL_<<tracepointname>>_ENABLED() guards since we > > do in fact build with SDT semaphore support. That adds a branch for each > > tracepoint, but they're already marshalling arguments and making a function > > call that does lots more than a single branch, so that seems pretty > > sensible. > > I am against adding any overhead for this feature. I honestly think the > probes we have right now in postgres do not provide a commensurate > benefit. I agree that the probes we have now are nearly useless, if not entirely useless. The transaction management ones are misplaced and utterly worthless. The LWLock ones don't carry enough info to be much use and are incomplete. I doubt anybody uses any of them at all, or would even notice their absence. In terms of overhead, what is in place right now is not free. It used to be very cheap, but since 29c3e2dd5a6 it's not. I'd like to reduce the current cost and improve functionality at the same time, so it's actually useful. > > * There is no easy way to look up the tranche name by ID from outside the > > backend > > But it's near trivial to add that. Really? We can expose a pg_catalog.lwlock_tranches view that lets you observe the current mappings for any given user backend I guess. But if I'm looking for performance issues caused by excessive LWLock contention or waits, LWLocks held too long, LWLock lock-ordering deadlocks, or the like, it's something I want to capture across the whole postgres instance. Each backend can have different tranche IDs (right?) and there's no way to know what a given non-built-in tranche ID means for any given backend without accessing backend-specific in-memory state. Including for non-user-accessible backends like bgworkers and auxprocs, where it's not possible to just query the state from a view directly. So we'd be looking at some kind of shm based monstrosity. That doesn't sound appealing. Worse, there's no way to solve races with it - is a given tranche ID already allocated when you see it? If not, can you look it up from the backend before the backend exits/dies? For that matter, how do you do that, since the connection to the backend is likely under the control of an application, not your monitoring and diagnostic tooling. Some trace tools can poke backend memory directly, but it generally requires debuginfo, is fragile and Pg version specific, slow, and a real pain to use. If we don't attach the LWLock names to the tracepoints in some way they're pretty worthless. Again, I don't plan to add new costs here. I'm actually proposing to reduce an existing cost. And you can always build without `--enable-dtrace` and ... just not care. Anyway - I'll do some `perf` runs shortly to quantify this: * With/without tracepoints at all * With/without names in tracepoints * With/without tracepoint refcounting (_ENABLED() semaphores) so as to rely less on handwaving. > > (Those can also be used with systemtap guru mode scripts to do things > > like turn a particular elog(DEBUG) into a PANIC at runtime for > > diagnostic purposes). > > Yikes. > Well, it's not like it can happen by accident. You have to deliberately write a script that twiddles process memory, using a tool that requires special privileges and I recently had to prepare a custom build for a customer that converted an elog(DEBUG) into an elog(PANIC) in order to capture a core with much better diagnostic info for a complex, hard to reproduce and intermittent memory management issue. It would've been rather nice to be able to do so with a trace marker instead of a custom build. > > There are a TON of probes I want to add, and I have a tree full of them > > waiting to submit progressively. Yes, ability to probe all GUCs is in > > there. So is detail on walsender, reorder buffer, and snapshot builder > > activity. Heavyweight lock SDTs. A probe that identifies the backend type > > at startup. SDT probe events emitted for every wait-event. Probes in elog.c > > to let probes observe error unwinding, capture error messages, > > etc. [...] A probe that fires whenever debug_query_string > > changes. Lots. But I can't submit them all at once, especially without > > some supporting use cases and scripts that other people can use so > > they can understand why these probes are useful. > > -1. This is not scalable. Adding static probes all over has both a > runtime (L1I, branches, code optimization) and maintenance overhead. Take a look at "sudo perf list". sched:sched_kthread_work_execute_end [Tracepoint event] sched:sched_kthread_work_execute_start [Tracepoint event] ... sched:sched_migrate_task [Tracepoint event] ... sched:sched_process_exec [Tracepoint event] ... sched:sched_process_fork [Tracepoint event] ... sched:sched_stat_iowait [Tracepoint event] ... sched:sched_stat_sleep [Tracepoint event] sched:sched_stat_wait [Tracepoint event] ... sched:sched_switch [Tracepoint event] ... sched:sched_wakeup [Tracepoint event] The kernel is packed with extremely useful trace events, and for very good reasons. Some on very hot paths. I do _not_ want to randomly add probes everywhere. I propose that they be added: * Where they will meaningfully aid production diagnosis, complex testing, and/or development activity. Expose high level activity of key subsystems via trace markers especially at the boundaries of IPCs or logic otherwise passes between processes. * Where it's not feasible to instead adjust code structure to make DWARF debuginfo based probing sufficient. * Where there's no other sensible way to get useful information without excessive complexity and/or runtime cost, but it could be very important for understanding intermittent production issues or performance problems at scale in live systems. * Where the execution path is not extremely hot - e.g. no static tracepoints in spinlocks or atomics. * Where a DWARF debuginfo based probe cannot easily replace them, i.e. generally not placed on entry and exit of stable and well-known functions. Re the code structure point above, we have lots of places where we return in multiple places, or where a single function can do many different things with different effects on system state. For example right now it's quite complex to place probes to definitively confirm the outcome of a given transaction and capture its commit record lsn. Functions with many branches that each fiddle with system state, functions that test for the validity of some global and short-circuit return if invalid, etc. Functions that do long loops over big chunks of logic are hard too, e.g. ReorderBufferCommit. I want to place probes where they will greatly simplify observation of important global system state that's not easily observed using traditional tools like gdb or logging. When applied sensibly and moderately, trace markers are absolutely amazing for diagnostic and performance work. You can attach to them in production builds even without debuginfo and observe behaviour that would otherwise be impossible without complex fiddling around with multi-process gdb. This sort of capability is going to become more and more important as we become more parallel and can rely less on single-process gdb-style tracing. Diagnostics using logging is a blunt hammer that does not scale and is rarely viable for intermittent or hard to reproduce production issues. I will always favour "native postgres" solutions where feasible - for example, I want to add some basic reorder buffer state to struct WalSnd and the pg_stat_replication views, and I want to expose some means to get a walsender to report details of its ReorderBuffer state. But some things are not very amenable to that. Either the runtime costs of having the facility available are too high (we're never going to have a pg_catalog.pg_lwlocks for good reasons) or it's too complicated to write and maintain. Especially where info is needed from many processes. That's where trace markers become valuable. But right now what we have in Pg is worthless, and it seems almost nobody knows how to use the tools. I want to change that, but it's a bit of a catch-22. Making tooling easy to use benefits enormously from some more stable interfaces that don't break so much version-to-version, don't require deep code knowledge to understand, and work without debuginfo on production builds. But without some "oh, wow" tools, it's hard to convince anyone we should invest any effort in improving the infrastructure... It's possible I'm beating a dead horse here. I find these tools amazingly useful, but they're currently made 10x harder than they need to be by the complexities of directly poking at postgres's complex and version-specific internal structure using debuginfo based probing. Things that should be simple, like determining the timings of a txn from xid assignment -> 2pc prepare -> 2pc commit prepared .... really aren't. Markers that report xid assignment, commit, rollback, etc, with the associated topxid would help immensely.
Hi, On 2021-04-13 10:34:18 +0800, Craig Ringer wrote: > > But it's near trivial to add that. > > Really? Yes. > Each backend can have different tranche IDs (right?) No, they have to be the same in each. Note how the tranche ID is part of struct LWLock. Which is why LWLockNewTrancheId() has to acquire a lock etc. > But if I'm looking for performance issues caused by excessive LWLock > contention or waits, LWLocks held too long, [...] or the like, it's > something I want to capture across the whole postgres instance. Sure. Although I still don't really buy that static tracepoints are the best way to measure this kind of thing, given the delay introducing them and the cost of having them around. I think I pointed out https://postgr.es/m/20200813004233.hdsdfvufqrbdwzgr%40alap3.anarazel.de before. > LWLock lock-ordering deadlocks This seems unrelated to tracepoints to me. > and there's no way to know what a given non-built-in tranche ID means > for any given backend without accessing backend-specific in-memory > state. Including for non-user-accessible backends like bgworkers and > auxprocs, where it's not possible to just query the state from a view > directly. The only per-backend part is that some backends might not know the tranche name for dynamically registered tranches where the LWLockRegisterTranche() hasn't been executed in a backend. Which should pretty much never be an aux process or such. And even for bgworkers it seems like a pretty rare thing, because those need to be started by something... It might be worth proposing a shared hashtable with tranch names and jut reserving enough space for ~hundred entries... > And you can always build without `--enable-dtrace` and ... just not care. Practically speaking, distributions enable it, which then incurs the cost for everyone. > Take a look at "sudo perf list". > > > sched:sched_kthread_work_execute_end [Tracepoint event] > sched:sched_kthread_work_execute_start [Tracepoint event] > ... > sched:sched_migrate_task [Tracepoint event] > ... > sched:sched_process_exec [Tracepoint event] > ... > sched:sched_process_fork [Tracepoint event] > ... > sched:sched_stat_iowait [Tracepoint event] > ... > sched:sched_stat_sleep [Tracepoint event] > sched:sched_stat_wait [Tracepoint event] > ... > sched:sched_switch [Tracepoint event] > ... > sched:sched_wakeup [Tracepoint event] > > The kernel is packed with extremely useful trace events, and for very > good reasons. Some on very hot paths. IIRC those aren't really comparable - the kernel actually does modify the executable code to replace the tracepoints with nops. Greetings, Andres Freund
On Tue, 13 Apr 2021 at 11:06, Andres Freund <andres@anarazel.de> wrote: > > Each backend can have different tranche IDs (right?) > > No, they have to be the same in each. Note how the tranche ID is part of > struct LWLock. Which is why LWLockNewTrancheId() has to acquire a lock > etc. Ah. I misunderstood that at some point. That makes it potentially more sensible to skip reporting tranche names. Not great, because it's much less convenient to work with trace data full of internal ordinals that must be re-mapped in post-processing. But I'm generally OK with deferring runtime costs to tooling rather than the db itself so long as doing so is moderately practical. In this case, I think we could likely get away with removing the tranche names from the tracepoints if we instead emit a trace event on each dynamic tranche registration that reports the tranche id -> name mapping. It still sucks for tools, since they have to scrape up the static tranche registrations from somewhere else, but ... it'd be tolerable. > > The kernel is packed with extremely useful trace events, and for very > > good reasons. Some on very hot paths. > > IIRC those aren't really comparable - the kernel actually does modify > the executable code to replace the tracepoints with nops. Same with userspace static trace markers (USDTs). A followup mail will contain a testcase and samples to demonstrate this. > Although I still don't really buy that static tracepoints are the best > way to measure this kind of thing, given the delay introducing them and > the cost of having them around. I think I pointed out > https://postgr.es/m/20200813004233.hdsdfvufqrbdwzgr%40alap3.anarazel.de > before. Yeah. Semaphores are something hot enough that I'd hesitate to touch them. > > LWLock lock-ordering deadlocks > > This seems unrelated to tracepoints to me. If I can observe which locks are acquired in which order by each proc, I can then detect excessive waits and report the stack of held locks of both procs and their order of acquisition. Since LWLocks shmem state doesn't AFAICS track any information on the lock holder(s) I don't see a way to do this in-process. It's not vital, it's just one of the use cases I have in mind. I suspect that any case where such deadlocks are possible represents a misuse of LWLocks anyway. > > and there's no way to know what a given non-built-in tranche ID means > > for any given backend without accessing backend-specific in-memory > > state. Including for non-user-accessible backends like bgworkers and > > auxprocs, where it's not possible to just query the state from a view > > directly. > > The only per-backend part is that some backends might not know the > tranche name for dynamically registered tranches where the > LWLockRegisterTranche() hasn't been executed in a backend. Which should > pretty much never be an aux process or such. And even for bgworkers it > seems like a pretty rare thing, because those need to be started by > something... > > It might be worth proposing a shared hashtable with tranch names and > jut reserving enough space for ~hundred entries... Yeah, that'd probably work and be cheap enough not to really matter. Might even save us a chunk of memory by not turning CoW pages into private mappings for each backend during registration. > > And you can always build without `--enable-dtrace` and ... just not care. > > Practically speaking, distributions enable it, which then incurs the > cost for everyone. Yep. That's part of why I was so surprised to notice the GetLWTrancheName() function call in LWLock tracepoints. Nearly anywhere else it wouldn't matter at all, but LWLocks are hot enough that it just might matter for the no-wait fastpath.
On Tue, 13 Apr 2021 at 21:05, Craig Ringer <craig.ringer@enterprisedb.com> wrote: > On Tue, 13 Apr 2021 at 11:06, Andres Freund <andres@anarazel.de> wrote: > > IIRC those aren't really comparable - the kernel actually does modify > > the executable code to replace the tracepoints with nops. > > Same with userspace static trace markers (USDTs). > > A followup mail will contain a testcase and samples to demonstrate this. Demo follows, with source attached too. gcc 10.2 compiling with -O2, using dtrace and <sys/sdt.h> from systemtap 4.4 . Trivial empty function definition: __attribute__((noinline)) void no_args(void) { SDT_NOOP_NO_ARGS(); } Disassembly when SDT_NOOP_NO_ARGS is defined as #define SDT_NOOP_NO_ARGS() is: <no_args>: retq When built with a probes.d definition processed by the dtrace script instead, the disassembly becomes: <no_args>: nop retq So ... yup, it's a nop. Now, if we introduce semaphores that changes. __attribute__((noinline)) void no_args(void) { if (SDT_NOOP_NO_ARGS_ENABLED()) SDT_NOOP_NO_ARGS(); } disassembles to: <no_args>: cmpw $0x0,0x2ec4(%rip) # <sdt_noop_no_args_semaphore> jne <no_args+0x10> retq nopl 0x0(%rax,%rax,1) nop retq so the semaphore test is actually quite harmful and wasteful in this case. That's not surprising since this SDT is a simple marker point. But what if we supply arguments to it? It turns out that the disassembly is the same if args are passed, whether locals or globals, including globals assigned based on program input that can't be determined at compile time. Still just a nop. If I pass a function call as an argument expression to a probe, e.g. __attribute__((noinline)) static int compute_probe_argument(void) { return 100; } void with_computed_arg(void) { SDT_NOOP_WITH_COMPUTED_ARG(compute_probe_argument()); } then the disassembly with SDTs is: <with_computed_arg>: callq <compute_probe_argument> nop retq so the function call isn't elided even if it's unused. That's somewhat expected. The same will be true if the arguments to a probe require pointer chasing or non-trivial marshalling. If a semaphore guard is added this becomes: <with_computed_arg>: cmpw $0x0,0x2e2e(%rip) # <sdt_noop_with_computed_arg_semaphore> jne <with_computed_arg+0x10> retq nopl 0x0(%rax,%rax,1) callq <compute_probe_argument> nop retq so now the call to compute_probe_argument() is skipped unless the probe is enabled, but the function is longer and requires a test and jump. If I dummy up a function that does some pointer chasing, without semaphores I get <with_pointer_chasing>: mov (%rdi),%rax mov (%rax),%rax mov (%rax),%rax nop retq so the arguments are marshalled then ignored. with semaphores I get: <with_pointer_chasing>: cmpw $0x0,0x2d90(%rip) # <sdt_noop_with_pointer_chasing_semaphore> jne <with_pointer_chasing+0x10> retq nopl 0x0(%rax,%rax,1) mov (%rdi),%rax mov (%rax),%rax mov (%rax),%rax nop retq so again the probe's argument marshalling is inline in the function body, but at the end, and skipped over. Findings: * A probe without arguments or with simple arguments is just a 'nop' instruction * Probes that require function calls, pointer chasing, other expression evaluation etc may impose a fixed cost to collect up arguments even if the probe is disabled. * SDT semaphores can avoid that cost but add a branch, so should probably be avoided unless preparing probe arguments is likely to be expensive. Hideous but effective demo code attached.
Attachment
On Tue, 13 Apr 2021 at 21:40, Craig Ringer <craig.ringer@enterprisedb.com> wrote: > Findings: > > * A probe without arguments or with simple arguments is just a 'nop' instruction > * Probes that require function calls, pointer chasing, other > expression evaluation etc may impose a fixed cost to collect up > arguments even if the probe is disabled. > * SDT semaphores can avoid that cost but add a branch, so should > probably be avoided unless preparing probe arguments is likely to be > expensive. Back to the topic directly at hand. Attached a disassembly of what LWLockAcquire looks like now on my current build of git master @ 5fe83adad9efd5e3929f0465b44e786dc23c7b55 . This is an --enable-debug --enable-cassert --enable-dtrace build with -Og -ggdb3. The three tracepoints in it are all of the form: movzwl 0x0(%r13),%edi call 0x801c49 <GetLWTrancheName> nop so it's clear we're doing redundant calls to GetLWTrancheName(), as expected. Not ideal. Now if I patch it to add the _ENABLED() guards on all the tracepoints, the probes look like this: 0x0000000000803176 <+200>: cmpw $0x0,0x462da8(%rip) # 0xc65f26 <postgresql_lwlock__acquire_semaphore> 0x000000000080317e <+208>: jne 0x80328b <LWLockAcquire+477> .... other interleaved code ... 0x000000000080328b <+477>: movzwl 0x0(%r13),%edi 0x0000000000803290 <+482>: call 0x801c49 <GetLWTrancheName> 0x0000000000803295 <+487>: nop 0x0000000000803296 <+488>: jmp 0x803184 <LWLockAcquire+214> so we avoid the GetLWTrancheName() call at the cost of a test and possible branch, and a small expansion in total function size. Without the semaphores, LWLockAcquire is 463 bytes. With them, it's 524 bytes, which is nothing to sneeze at for code that doesn't do anything 99.999% of the time, but we avoid a bunch of GetLWTrancheName() calls. If I instead replace T_NAME() with NULL for all tracepoints in LWLockAcquire, the disassembly shows that the tracepoints now become a simple 0x0000000000803176 <+200>: nop which is pretty hard to be concerned about. So at the very least we should be calling GetLWTrancheName() once at the start of the function if built with dtrace support and remembering the value, instead of calling it for each tracepoint. And omitting the tranche name looks like it might be sensible for the LWLock code. In most places it won't matter, but LWLocks are hot enough that it possibly might. A simple pg_regress run hits LWLockAcquire 25 million times, so that's 75 million calls to GetLWTrancheName().
On Mon, Apr 12, 2021 at 11:06 PM Andres Freund <andres@anarazel.de> wrote: > No, they have to be the same in each. Note how the tranche ID is part of > struct LWLock. Which is why LWLockNewTrancheId() has to acquire a lock > etc. More precisely, if a tranche ID is defined in multiple backends, it needs to be defined the same way in all of them. But it is possible to have an extension loaded into some backends and not others and have it define a tranche ID that other backends know nothing about. Another point to note is that, originally, I had an idea that each tranche of lwlocks was situation in a single array somewhere in memory. Perhaps that was an array of something else, like buffer descriptors, and the lwlocks were just one element of the struct, or maybe it was an array specifically of LWLocks, but one way or the other, there was definitely one array that had all the LWLocks from that tranche in it. So before the commit in question -- 3761fe3c20bb040b15f0e8da58d824631da00caa -- T_ID() used to compute an offset for a lock within the tranche that was supposed to uniquely identify the lock. However, the whole idea of an array per tranche turns out to be broken by design. Consider parallel query. You could, perhaps, arrange for all the LWLocks that a particular query needs to be in one tranche. And that's all fine. But what if there are multiple parallel contexts in existence at the same time? I think right now that may be impossible as a practical matter, since for example an SQL function that is called by a parallel query is supposed to run any SQL statements inside of it without parallelism. But, that's basically a policy decision. There's nothing in the parallel context machinery itself which prevents multiple parallel contexts from being active at the same time. And if that happens, then you'd have multiple arrays with the same tranche ID, so how do you identify the locks then? The pre-3761fe3c20bb040b15f0e8da58d824631da00caa data structure doesn't work because it has only one place to store an array base, but having multiple places to store an array base doesn't fix it either because now you've just given the same identifier to multiple locks. You could maybe fix it by putting a limit on how many parallel contexts can be open at the same time, and then having N copies of each parallelism-related tranche. But that seems ugly and messy and a burden on extension authors and not really what anybody wants. You could try to identify locks by pointer addresses, but that's got security hazards and the addreses aren't portable across all the backends involved in the parallel query because of how DSM works, so it's not really that helpful in terms of matching stuff up. You could identify every lock by a tranche ID + an array offset + a "tranche instance ID". But where would you store the tranche instance ID to make it readily accessible, other than in the lock itself? Andres wasn't thrilled about using even 2 bytes to identify the LWLock, so he'll probably like having more bytes in there for that kind of thing even less. And to be honest I wouldn't blame him. We only need 12 bytes to implement the lock itself -- we can't justify having more than a couple of additional bytes for debugging purposes. On a broader level, I agree that being able to find out what the system is doing is really important. But I'm also not entirely convinced that having really fine-grained information here to distinguish between one lock and another is the way to get there. Personally, I've never run into a problem where I really needed to know anything more than the tranche name. Like, I've seen problems for example we can see that there's a lot of contention on SubtransSLRULock, or there's problems with WALInsertLock. But I can't really see why I'd need to know which WALInsertLock was experiencing contention. If we were speaking of buffer content locks, I suppose I can imagine wanting more details, but it's not really the buffer number I'd want to know. I'd want to know the database OID, the relfilenode, the fork number, and the block number. You can argue that we should just expose the buffer number and let the user sort out the rest with dtrace/systemtap magic, but that makes it useless in practice to an awful lot of people, including me. -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On 2021-04-13 14:25:23 -0400, Robert Haas wrote: > On Mon, Apr 12, 2021 at 11:06 PM Andres Freund <andres@anarazel.de> wrote: > You could identify every lock by a tranche ID + an array offset + a > "tranche instance ID". But where would you store the tranche instance > ID to make it readily accessible, other than in the lock itself? > Andres wasn't thrilled about using even 2 bytes to identify the > LWLock, so he'll probably like having more bytes in there for that > kind of thing even less. I still don't like the two bytes, fwiw ;). Especially because it's 4 bytes due to padding right now. I'd like to move the LWLock->waiters list to outside of the lwlock itself - at most TotalProcs LWLocks can be waited for, so we don't need millions of empty proclist_heads. That way we could also remove the proclist indirection - which shows up a fair bit in contended workloads. And if we had a separate "lwlocks being waited for" structure, we could also add more information to it if we wanted to... The difficulty of course is having space to indicate which of these "waiting for" lists are being used - there's not enough space in ->state right now to represent that. Two possibile approaches: - We could make it work if we restricted MAX_BACKENDS to be 2**14 - but while I personally think that's a sane upper limit, I already had a hard time getting consensus to lower the limit to 2^18-1. - Use a 64bit integer. Then we can easily fit MAX_BACKENDS lockers, as well as an offset to one of MAX_BACKENDS "wait lists" into LWLock. It's not so much that I want to lower the overall memory usage (although it doesn't hurt). It's more about being able to fit more data into one cacheline together with the lwlock. E.g. being able to fit more into BufferDesc would be very useful. A secondary benefit of such an approach would be that it it makes it a lot easier to add efficient adaptive spinning on contended locks. I did experiment with that, and there's some considerable potential for performance benefits there. But for it to scale well we need something similar to "mcs locks", to avoid causing too much contention. And that pretty much requires some separate space to store wait information anyway. With an 8 bytes state we probably could also stash the tranche inside that... > On a broader level, I agree that being able to find out what the > system is doing is really important. But I'm also not entirely > convinced that having really fine-grained information here to > distinguish between one lock and another is the way to get there. > Personally, I've never run into a problem where I really needed to > know anything more than the tranche name. I think it's quite useful for relatively simple things like analyzing the total amount of time spent in individual locks, without incuring much overhead when not doing so (for which you need to identify individual locks, otherwise your end - start time is going to be meaningless). And, slightly more advanced, for analyzing what the stack was when the lock was released - which then allows you to see what work you're blocked on, something I found hard to figure out otherwise. I found that that's mostly quite doable with dynamic probes though. > Like, I've seen problems for example we can see that there's a lot of > contention on SubtransSLRULock, or there's problems with > WALInsertLock. But I can't really see why I'd need to know which > WALInsertLock was experiencing contention. Well, but you might want to know what the task blocking you was doing. What to optimize might differ if the other task is e.g. a log switch (which acquires all insert locks), than if it's WAL writes by VACUUM. > If we were speaking of buffer content locks, I suppose I can imagine > wanting more details, but it's not really the buffer number I'd want > to know. I'd want to know the database OID, the relfilenode, the fork > number, and the block number. You can argue that we should just expose > the buffer number and let the user sort out the rest with > dtrace/systemtap magic, but that makes it useless in practice to an > awful lot of people, including me. I have wondered if we ought to put some utilities for that in contrib or such. It's a lot easier to address something new with a decent starting point... Greetings, Andres Freund
On Wed, 14 Apr 2021 at 04:46, Andres Freund <andres@anarazel.de> wrote: > > On 2021-04-13 14:25:23 -0400, Robert Haas wrote: > > On Mon, Apr 12, 2021 at 11:06 PM Andres Freund <andres@anarazel.de> wrote: > > You could identify every lock by a tranche ID + an array offset + a > > "tranche instance ID". But where would you store the tranche instance > > ID to make it readily accessible, other than in the lock itself? > > Andres wasn't thrilled about using even 2 bytes to identify the > > LWLock, so he'll probably like having more bytes in there for that > > kind of thing even less. > > I still don't like the two bytes, fwiw ;). Especially because it's 4 > bytes due to padding right now. Aha, did I hear you say "there are two free bytes for me to shove something marginally useful and irrelevant into"? (*grin*) > I'd like to move the LWLock->waiters list to outside of the lwlock > itself - at most TotalProcs LWLocks can be waited for, so we don't need > millions of empty proclist_heads. That way we could also remove the > proclist indirection - which shows up a fair bit in contended workloads. > > And if we had a separate "lwlocks being waited for" structure, we could > also add more information to it if we wanted to... Having the ability to observe LWLock waiters would be nice. But you're right to constantly emphasise that LWLocks need to be very slim. We don't want to turn them into near-heavyweight locks by saddling them with overhead that's not strictly necessary. A simple pg_regress run (with cassert, admittedly) takes 25,000,000 LWLocks and does 24,000 LWLock waits and 130,000 condvar waits. All in less than a minute. OTOH, once someone's waiting we don't care nearly as much about bookkeeping cost, it's the un-contested fast paths that're most critical. > - We could make it work if we restricted MAX_BACKENDS to be 2**14 - but > while I personally think that's a sane upper limit, I already had a > hard time getting consensus to lower the limit to 2^18-1. 16384 backends is totally fine in sane real world deployments. But it'll probably upset marketing people when OtherDatabaseVendor starts shouting that they support 14 million connections, and postgres only has 16k. Sigh. The real answer here in the long term probably needs to be decoupling of executors from connection state inside postgres. But while we're on that topic, how about we convert the entire codebase to Rust while riding on a flying rainbow unicorn? We're stuck with the 1:1 connection to executor mapping for the foreseeable future. > - Use a 64bit integer. Then we can easily fit MAX_BACKENDS lockers, as > well as an offset to one of MAX_BACKENDS "wait lists" into LWLock. You know much more than me about the possible impacts of that on layout and caching, but I gather that it's probably undesirable to make LWLocks any bigger. > I think it's quite useful for relatively simple things like analyzing > the total amount of time spent in individual locks, without incuring > much overhead when not doing so (for which you need to identify > individual locks, otherwise your end - start time is going to be > meaningless). Yep. That's why the removal of the lock offset is a bit frustrating, because you can't identify an LWLock instance-wide by LWLock* due to the possibility of different per-backend DSM base-address mappings. Well, and ASLR on EXEC_BACKEND systems, but who cares about those? The removal was for good reasons though. And it only affects LWLocks in DSM, for everything else the LWLock* is good enough. If identifying LWLocks in DSM ever matters enough to bother to solve that problem, we can emit trace events on DSM mapping attach in each backend, and trace tools can do the work to track which LWLocks are in DSM and convert their addresses to a reference base address. Pg shouldn't have to pay the price for that unless it's something a lot of people need. > And, slightly more advanced, for analyzing what the stack > was when the lock was released - which then allows you to see what work > you're blocked on, something I found hard to figure out otherwise. > > I found that that's mostly quite doable with dynamic probes though. Yeah, it is. That's part of why my patchset here doesn't try to do a lot to LWLock tracepoints - I didn't think it was necessary to add a lot. The LWLock code is fairly stable, not usually something you have to worry about in production unless you're debugging badly behaved extensions, and usually somewhat probe-able with DWARF based dynamic probes. However, the way the wait-loop and fast-path are in the same function is a serious pain for dynamic probing; you can't tell the difference between a fast-path acquire and an acquire after a wait without using probes on function+offset or probing by source line. Both those are fine for dev work but useless in tool/library scripts. I almost wonder if we should test out moving the LWLock wait-loops out of LWLockAcquire(), LWLockAcquireOrWait() and LWLockWaitForVar() anyway, so the hot parts of the function are smaller. That'd make dynamic probing more convenient as a pleasant side effect. I imagine you must've tried this, benchmarked and profiled it, though, and found it to be a net loss, otherwise you surely would've done it as part of your various (awesome) performance work. Anyway, there are some other areas of postgres that are ridiculously painful to instrument with dynamic probes, especially in a somewhat version- and build-independent way. Tracking txn commit and abort (including 2PC and normal xacts, with capture of commit LSNs) is just painful with dynamic probing for example, and is one of my top priority areas to get some decent tracepoints for - the current txn management tracepoints are utterly worthless. But LWLocks are mostly fine, the only really big piece missing is a tracepoint fired exactly once when a lock is released by any release path. > > Like, I've seen problems for example we can see that there's a lot of > > contention on SubtransSLRULock, or there's problems with > > WALInsertLock. But I can't really see why I'd need to know which > > WALInsertLock was experiencing contention. > > Well, but you might want to know what the task blocking you was > doing. What to optimize might differ if the other task is e.g. a log > switch (which acquires all insert locks), than if it's WAL writes by > VACUUM. That sort of thing is why I've been interested in IDing the LWLock. That, and I work with extension code that probably abuses LWLocks a bit, but that's not a problem core postgres should have to care about. > > If we were speaking of buffer content locks, I suppose I can imagine > > wanting more details, but it's not really the buffer number I'd want > > to know. I'd want to know the database OID, the relfilenode, the fork > > number, and the block number. You can argue that we should just expose > > the buffer number and let the user sort out the rest with > > dtrace/systemtap magic, but that makes it useless in practice to an > > awful lot of people, including me. > > I have wondered if we ought to put some utilities for that in contrib or > such. It's a lot easier to address something new with a decent starting > point... Long term that's exactly what I want to do. I wrote some with systemtap, but it's since become clear to me that systemtap isn't going to get enough people on board. Setup is a hassle. So I'm trying to pivot over to bpf tools now, with the intention of getting together a library of canned probes and example scripts to help people get started. I've written systemtap scripts that can track a transaction from localxid allocation through real xid allocation, commit (or 2pc prepare and commit prepared), logical decoding, reorder buffering, output plugin processing, receipt by a logical downstream, downstream xid assignment, downstream commit and replorigin advance, downstream->upstream feedback, upstream slot advance, and upstream confirmed flush/catalog_xmin advance. Whole-lifecycle tracking with timing of each phase, across multiple processes and two postgres instances. For now the two postgres instances must be on the same host, but that could be dealt with too. The script reports the application name and pid of the upstream session, the upstream localxid, the upstream xid, upstream commit lsn, the downstream xid and the downstream commit lsn as it goes, and follows associations to track the transaction through its lifecycle. (The current script is written for BDR and pglogical, not in-core logical decoding, but the principles are the same). The problem is that scripts like this are just too fragile right now. Changes across Pg versions break the dynamic function probes they use, though that can be adapted to somewhat. The bigger problem is the number of places I have to insert statement (function+offset) probes, which are just too fragile to make these sorts of scripts generally useful. I have to fix them whenever I want to use them, so there's not much point trying to get people to use them. But it's hard to convince people of the value of static tracepoints that would make this sort of thing so much easier to do in a more stable manner when they can't easily see hands-on examples of what's possible. There's no "wow" factor. So I need to address the worst of the difficult-to-probe sites and start sharing some examples that use them. I thought this would be low-hanging fruit to start with. Whoops!
On Wed, 14 Apr 2021 at 02:25, Robert Haas <robertmhaas@gmail.com> wrote: > So before the commit in question -- > 3761fe3c20bb040b15f0e8da58d824631da00caa -- T_ID() used to compute an > offset for a lock within the tranche that was supposed to uniquely > identify the lock. However, the whole idea of an array per tranche > turns out to be broken by design. Yeah, I understand that. I'd really love it if a committer could add an explanatory comment or two in the area though. I'm happy to draft a comment patch if anyone agrees my suggestion is sensible. The key things I needed to know when studying the code were: * A LWLock* is always part of a tranche, but locks within a given tranche are not necessarily co-located in memory, cross referenced or associated in any way. * A LWLock tranche does not track how many LWLocks are in the tranche or where they are in memory. It only groups up LWLocks into categories and maps the tranche ID to a name. * Not all LWLocks are part of the main LWLock array; others can be embedded in shmem structs elsewhere, including in DSM segments. * LWLocks in DSM segments may not have the same address between different backends, because the DSM base address can vary, so a LWLock* cannot be reliably compared between backends unless you know it's in the main LWLock array or in static shmem. Having that info in lwlock.c near the tranche management code or the tranche and main lwlock arrays would've been very handy. > You could try to identify locks by pointer addresses, but that's got > security hazards and the addreses aren't portable across all the > backends involved in the parallel query because of how DSM works, so > it's not really that helpful in terms of matching stuff up. What I'm doing now is identifying them by LWLock* across backends. I keep track of DSM segment mappings in each backend inside the trace script and I relocate LWLock* pointers known to be inside DSM segments relative to a dummy base address so they're equal across backends. It's a real pain, but it works. The main downside is that the trace script has to observe the DSM attach; if it's started once a backend already has the DSM segment attached, it has no idea the LWLock is in a DSM segment or how to remap it. But that's not a serious issue. > On a broader level, I agree that being able to find out what the > system is doing is really important. But I'm also not entirely > convinced that having really fine-grained information here to > distinguish between one lock and another is the way to get there. At the start of this thread I would've disagreed with you. But yeah, you and Andres are right, because the costs outweigh the benefits here. I'm actually inclined to revise the patch I sent in order to *remove* the LWLock name from the tracepoint argument. At least for the fast-path tracepoints on start-of-acquire and end-of-acquire. I think it's probably OK to report it in the lock wait tracepoints, which is where it's most important to have anyway. So the tracepoint will always report the LWLock* and tranche ID, but it won't report the tranche name for the fast-path. I'll add trace events for tranche ID registration, so trace tools can either remember the tranche ID->name mappings from there, or capture them from lock wait events and remember them. Reasonable? That way we retain the most important trace functionality, but we reduce the overheads.
On Wed, 14 Apr 2021 at 10:41, Craig Ringer <craig.ringer@enterprisedb.com> wrote: > On Wed, 14 Apr 2021 at 02:25, Robert Haas <robertmhaas@gmail.com> wrote: > > You could try to identify locks by pointer addresses, but that's got > > security hazards and the addreses aren't portable across all the > > backends involved in the parallel query because of how DSM works, so > > it's not really that helpful in terms of matching stuff up. > > What I'm doing now is identifying them by LWLock* across backends. I > keep track of DSM segment mappings in each backend inside the trace > script and I relocate LWLock* pointers known to be inside DSM segments > relative to a dummy base address so they're equal across backends. BTW, one of the reasons I did this was to try to identify BDR and pglogical code that blocks or sleeps while holding a LWLock. I got stuck on that for other reasons, so it didn't go anywhere, but those issues are now resolved so I should probably return to it at some point. It'd be a nice thing to be able to run on postgres itself too.
On 12.04.21 07:46, Craig Ringer wrote: > > To use systemtap semaphores (the _ENABLED macros) you need to run > dtrace > > -g to generate a probes.o then link that into postgres. > > > > I don't think we do that. I'll double check soon. > > We do that. (It's -G.) > > > Huh. I could've sworn we didn't. My mistake, it's there in > src/backend/Makefile . > > In that case I'll amend the patch to use semaphore guards. This whole thread is now obviously moved to consideration for PG15, but I did add an open item about this particular issue (https://wiki.postgresql.org/wiki/PostgreSQL_14_Open_Items, search for "dtrace"). So if you could produce a separate patch that adds the _ENABLED guards targeting PG14 (and PG13), that would be helpful.
On Tue, Apr 13, 2021 at 4:46 PM Andres Freund <andres@anarazel.de> wrote: > I still don't like the two bytes, fwiw ;). Especially because it's 4 > bytes due to padding right now. I'm not surprised by that disclosure. But I think it's entirely worth it. Making wait states visible in pg_stat_activity isn't the most useful thing I've ever done to PostgreSQL, but it's far from the least useful. If we can get the same benefit with less overhead, that's great. > I'd like to move the LWLock->waiters list to outside of the lwlock > itself - at most TotalProcs LWLocks can be waited for, so we don't need > millions of empty proclist_heads. That way we could also remove the > proclist indirection - which shows up a fair bit in contended workloads. > > And if we had a separate "lwlocks being waited for" structure, we could > also add more information to it if we wanted to... > > The difficulty of course is having space to indicate which of these > "waiting for" lists are being used - there's not enough space in ->state > right now to represent that. Two possibile approaches: > > - We could make it work if we restricted MAX_BACKENDS to be 2**14 - but > while I personally think that's a sane upper limit, I already had a > hard time getting consensus to lower the limit to 2^18-1. > > - Use a 64bit integer. Then we can easily fit MAX_BACKENDS lockers, as > well as an offset to one of MAX_BACKENDS "wait lists" into LWLock. I'd rather not further reduce MAX_BACKENDS. I still think some day we're going to want to make that bigger again. Maybe not for a while, admittedly. But, do you need to fit this into "state"? If you just replaced "waiters" with a 32-bit integer, you'd save 4 bytes and have bits left over (and maybe restrict the tranche ID to 2^14 and squeeze that in too, as you mention). -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Apr 13, 2021 at 10:42 PM Craig Ringer <craig.ringer@enterprisedb.com> wrote: > I'd really love it if a committer could add an explanatory comment or > two in the area though. I'm happy to draft a comment patch if anyone > agrees my suggestion is sensible. The key things I needed to know when > studying the code were: > > * A LWLock* is always part of a tranche, but locks within a given > tranche are not necessarily co-located in memory, cross referenced or > associated in any way. > * A LWLock tranche does not track how many LWLocks are in the tranche > or where they are in memory. It only groups up LWLocks into categories > and maps the tranche ID to a name. > * Not all LWLocks are part of the main LWLock array; others can be > embedded in shmem structs elsewhere, including in DSM segments. > * LWLocks in DSM segments may not have the same address between > different backends, because the DSM base address can vary, so a > LWLock* cannot be reliably compared between backends unless you know > it's in the main LWLock array or in static shmem. > > Having that info in lwlock.c near the tranche management code or the > tranche and main lwlock arrays would've been very handy. I'm willing to review a comment patch along those lines. > I'm actually inclined to revise the patch I sent in order to *remove* > the LWLock name from the tracepoint argument. At least for the > fast-path tracepoints on start-of-acquire and end-of-acquire. I think > it's probably OK to report it in the lock wait tracepoints, which is > where it's most important to have anyway. So the tracepoint will > always report the LWLock* and tranche ID, but it won't report the > tranche name for the fast-path. I'll add trace events for tranche ID > registration, so trace tools can either remember the tranche ID->name > mappings from there, or capture them from lock wait events and > remember them. > > Reasonable? That way we retain the most important trace functionality, > but we reduce the overheads. Reducing the overheads is good, but I have no opinion on what's important for people doing tracing, because I am not one of those people. -- Robert Haas EDB: http://www.enterprisedb.com
On 14.04.21 15:20, Peter Eisentraut wrote: > On 12.04.21 07:46, Craig Ringer wrote: >> > To use systemtap semaphores (the _ENABLED macros) you need to run >> dtrace >> > -g to generate a probes.o then link that into postgres. >> > >> > I don't think we do that. I'll double check soon. >> >> We do that. (It's -G.) >> >> >> Huh. I could've sworn we didn't. My mistake, it's there in >> src/backend/Makefile . >> >> In that case I'll amend the patch to use semaphore guards. > > This whole thread is now obviously moved to consideration for PG15, but > I did add an open item about this particular issue > (https://wiki.postgresql.org/wiki/PostgreSQL_14_Open_Items, search for > "dtrace"). So if you could produce a separate patch that adds the > _ENABLED guards targeting PG14 (and PG13), that would be helpful. Here is a proposed patch for this.
Attachment
On Thu, 29 Apr 2021 at 15:31, Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote: > > So if you could produce a separate patch that adds the > > _ENABLED guards targeting PG14 (and PG13), that would be helpful. > > Here is a proposed patch for this. LGTM. Applies and builds fine on master and (with default fuzz) on REL_13_STABLE. Works as expected. This does increase the size of LWLockAcquire() etc slightly but since it skips these function calls, and the semaphores are easily predicted, I don't have any doubt it's a net win. +1 for merge.
On Wed, 14 Apr 2021, 22:29 Robert Haas, <robertmhaas@gmail.com> wrote:
On Tue, Apr 13, 2021 at 10:42 PM Craig Ringer
<craig.ringer@enterprisedb.com> wrote:
> I'd really love it if a committer could add an explanatory comment or
> two in the area though. I'm happy to draft a comment patch if anyone
> agrees my suggestion is sensible. The key things I needed to know when
> studying the code were:
> [...]
I'm willing to review a comment patch along those lines.
Cool. I'll draft soon.
I since noticed that some of the info is present, but it's in lwlock.h whereas in Pg comment detail is more often than not in the .c file.
I prefer it in headers myself anyway, since it's more available to tools like doxygen. I'll add a few "see lwlock.h" hints, a short para about appropriate lwlock use in the .c into comment etc and post on a separate thread soon.
> I'm actually inclined to revise the patch I sent in order to *remove*
> the LWLock name from the tracepoint argument.
Reducing the overheads is good, but I have no opinion on what's
important for people doing tracing, because I am not one of those
people.
Truthfully I'm not convinced anyone is "those people" right now. I don't think anyone is likely to be making serious use of them due to their limitations.
Certainly that'll be the case for the txn ones which are almost totally useless. They only track the localxid lifecycle, they don't track real txid allocation, WAL writing, commit (wal or shmem), etc.
> On Fri, Apr 30, 2021 at 11:23:56AM +0800, Craig Ringer wrote: > On Wed, 14 Apr 2021, 22:29 Robert Haas, <robertmhaas@gmail.com> wrote: > > > > I'm actually inclined to revise the patch I sent in order to *remove* > > > the LWLock name from the tracepoint argument. > > > Reducing the overheads is good, but I have no opinion on what's > > important for people doing tracing, because I am not one of those > > people. > > > > Truthfully I'm not convinced anyone is "those people" right now. I don't > think anyone is likely to be making serious use of them due to their > limitations. I would like to mention that tracepoints could be useful not only directly, they also: * deliver an information about what is important enough to trace from the developers, who wrote the code, point of view. * declare more stable tracing points within the code, which are somewhat more reliable between the versions. E.g. writing bcc scripts one is also sort of limited in use of those tracepoints because of requirement to provide a specific pid, but still can get better understanding what to look at (maybe using other methods).
On 30.04.21 05:22, Craig Ringer wrote: > On Thu, 29 Apr 2021 at 15:31, Peter Eisentraut > <peter.eisentraut@enterprisedb.com> wrote: >>> So if you could produce a separate patch that adds the >>> _ENABLED guards targeting PG14 (and PG13), that would be helpful. >> >> Here is a proposed patch for this. > > LGTM. > > Applies and builds fine on master and (with default fuzz) on > REL_13_STABLE. Works as expected. committed
Hi, On 2021-05-03 21:06:30 +0200, Peter Eisentraut wrote: > On 30.04.21 05:22, Craig Ringer wrote: > > On Thu, 29 Apr 2021 at 15:31, Peter Eisentraut > > <peter.eisentraut@enterprisedb.com> wrote: > > > > So if you could produce a separate patch that adds the > > > > _ENABLED guards targeting PG14 (and PG13), that would be helpful. > > > > > > Here is a proposed patch for this. > > > > LGTM. > > > > Applies and builds fine on master and (with default fuzz) on > > REL_13_STABLE. Works as expected. > > committed I'm now getting /home/andres/src/postgresql/src/backend/storage/lmgr/lwlock.c: In function ‘LWLockAcquire’: /home/andres/src/postgresql/src/backend/storage/lmgr/lwlock.c:1322:58: warning: suggest braces around empty body in an ‘if’statement [-Wempty-body] 1322 | TRACE_POSTGRESQL_LWLOCK_WAIT_START(T_NAME(lock), mode); | ^ /home/andres/src/postgresql/src/backend/storage/lmgr/lwlock.c:1345:57: warning: suggest braces around empty body in an ‘if’statement [-Wempty-body] 1345 | TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(T_NAME(lock), mode); | ^ /home/andres/src/postgresql/src/backend/storage/lmgr/lwlock.c:1355:54: warning: suggest braces around empty body in an ‘if’statement [-Wempty-body] 1355 | TRACE_POSTGRESQL_LWLOCK_ACQUIRE(T_NAME(lock), mode); | ^ /home/andres/src/postgresql/src/backend/storage/lmgr/lwlock.c: In function ‘LWLockConditionalAcquire’: /home/andres/src/postgresql/src/backend/storage/lmgr/lwlock.c:1407:64: warning: suggest braces around empty body in an ‘if’statement [-Wempty-body] 1407 | TRACE_POSTGRESQL_LWLOCK_CONDACQUIRE_FAIL(T_NAME(lock), mode); | ^ /home/andres/src/postgresql/src/backend/storage/lmgr/lwlock.c:1415:59: warning: suggest braces around empty body in an ‘if’statement [-Wempty-body] 1415 | TRACE_POSTGRESQL_LWLOCK_CONDACQUIRE(T_NAME(lock), mode); | ^ /home/andres/src/postgresql/src/backend/storage/lmgr/lwlock.c: In function ‘LWLockAcquireOrWait’: /home/andres/src/postgresql/src/backend/storage/lmgr/lwlock.c:1488:59: warning: suggest braces around empty body in an ‘if’statement [-Wempty-body] 1488 | TRACE_POSTGRESQL_LWLOCK_WAIT_START(T_NAME(lock), mode); | ^ /home/andres/src/postgresql/src/backend/storage/lmgr/lwlock.c:1507:58: warning: suggest braces around empty body in an ‘if’statement [-Wempty-body] 1507 | TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(T_NAME(lock), mode); | ^ /home/andres/src/postgresql/src/backend/storage/lmgr/lwlock.c:1538:68: warning: suggest braces around empty body in an ‘if’statement [-Wempty-body] 1538 | TRACE_POSTGRESQL_LWLOCK_ACQUIRE_OR_WAIT_FAIL(T_NAME(lock), mode); | ^ /home/andres/src/postgresql/src/backend/storage/lmgr/lwlock.c:1547:63: warning: suggest braces around empty body in an ‘if’statement [-Wempty-body] 1547 | TRACE_POSTGRESQL_LWLOCK_ACQUIRE_OR_WAIT(T_NAME(lock), mode); | ^ /home/andres/src/postgresql/src/backend/storage/lmgr/lwlock.c: In function ‘LWLockWaitForVar’: /home/andres/src/postgresql/src/backend/storage/lmgr/lwlock.c:1708:66: warning: suggest braces around empty body in an ‘if’statement [-Wempty-body] 1708 | TRACE_POSTGRESQL_LWLOCK_WAIT_START(T_NAME(lock), LW_EXCLUSIVE); | ^ /home/andres/src/postgresql/src/backend/storage/lmgr/lwlock.c:1728:65: warning: suggest braces around empty body in an ‘if’statement [-Wempty-body] 1728 | TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(T_NAME(lock), LW_EXCLUSIVE); | ^ /home/andres/src/postgresql/src/backend/storage/lmgr/lwlock.c: In function ‘LWLockRelease’: /home/andres/src/postgresql/src/backend/storage/lmgr/lwlock.c:1855:48: warning: suggest braces around empty body in an ‘if’statement [-Wempty-body] 1855 | TRACE_POSTGRESQL_LWLOCK_RELEASE(T_NAME(lock)); In a build without the trace stuff enabled. Greetings, Andres Freund
On Wed, 5 May 2021, 06:15 Andres Freund, <andres@anarazel.de> wrote:
Hi,
warning: suggest braces around empty body in an ‘if’ statement [-Wempty-body]
1322 | TRACE_POSTGRESQL_LWLOCK_WAIT_START(T_NAME(lock), mode);
| ^
Odd that I didn't get that.
I'll send a patch to revise shortly.
On Wed, 5 May 2021 at 09:15, Craig Ringer <craig.ringer@enterprisedb.com> wrote: >> warning: suggest braces around empty body in an ‘if’ statement [-Wempty-body] >> 1322 | TRACE_POSTGRESQL_LWLOCK_WAIT_START(T_NAME(lock), mode); >> | ^ > > Odd that I didn't get that. This compiler complaint is not due to the _ENABLED() test as such. It's due to the fact that we completely define out the TRACE_POSTGRESQL_ macros with src/backend/utils/Gen_dummy_probes.sed . While explicit braces could be added around each test, I suggest fixing Gen_dummy_probes.sed to emit the usual dummy statement instead. Patch attached.
Attachment
On 05.05.21 00:15, Andres Freund wrote: > I'm now getting > > /home/andres/src/postgresql/src/backend/storage/lmgr/lwlock.c: In function ‘LWLockAcquire’: > /home/andres/src/postgresql/src/backend/storage/lmgr/lwlock.c:1322:58: warning: suggest braces around empty body in an‘if’ statement [-Wempty-body] > 1322 | TRACE_POSTGRESQL_LWLOCK_WAIT_START(T_NAME(lock), mode); > | ^ For clarification, -Wempty-body is not part of the default warnings, right? And even if I turn it on explicitly, I don't get this warning. I read something that newer compilers don't warn in cases of macro expansion. What compiler are you using in this situation?
Peter Eisentraut <peter.eisentraut@enterprisedb.com> writes: > On 05.05.21 00:15, Andres Freund wrote: >> I'm now getting >> /home/andres/src/postgresql/src/backend/storage/lmgr/lwlock.c: In function ‘LWLockAcquire’: >> /home/andres/src/postgresql/src/backend/storage/lmgr/lwlock.c:1322:58: warning: suggest braces around empty body in an‘if’ statement [-Wempty-body] >> 1322 | TRACE_POSTGRESQL_LWLOCK_WAIT_START(T_NAME(lock), mode); >> | ^ > What compiler are you using in this situation? All of these buildfarm members are now showing this warning: calliphoridae gcc (Debian 10.1.0-6) 10.1.0 culicidae gcc (Debian 10.1.0-6) 10.1.0 flaviventris gcc (Debian 20200124-1) 10.0.1 20200124 (experimental) francolin gcc (Debian 10.1.0-6) 10.1.0 piculetœ gcc (Debian 10.1.0-6) 10.1.0 rorqual gcc (Debian 10.1.0-6) 10.1.0 serinus gcc (Debian 20200124-1) 10.0.1 20200124 (experimental) skink gcc (Debian 10.1.0-6) 10.1.0 so there's your answer. (I wonder why flaviventris and serinus are still using an "experimental" compiler version that is now behind mainstream.) regards, tom lane
Hi, On 2021-05-08 13:13:47 -0400, Tom Lane wrote: > Peter Eisentraut <peter.eisentraut@enterprisedb.com> writes: > > On 05.05.21 00:15, Andres Freund wrote: > >> I'm now getting > >> /home/andres/src/postgresql/src/backend/storage/lmgr/lwlock.c: In function ‘LWLockAcquire’: > >> /home/andres/src/postgresql/src/backend/storage/lmgr/lwlock.c:1322:58: warning: suggest braces around empty body inan ‘if’ statement [-Wempty-body] > >> 1322 | TRACE_POSTGRESQL_LWLOCK_WAIT_START(T_NAME(lock), mode); > >> | ^ > > > What compiler are you using in this situation? gcc - I think the warning is pulled in via -Wextra. I think it's something sensible to warn about, too easy to end up with misleading behaviour when statement-like macros are defined empty. > All of these buildfarm members are now showing this warning: > > calliphoridae gcc (Debian 10.1.0-6) 10.1.0 > culicidae gcc (Debian 10.1.0-6) 10.1.0 > flaviventris gcc (Debian 20200124-1) 10.0.1 20200124 (experimental) > francolin gcc (Debian 10.1.0-6) 10.1.0 > piculetœ gcc (Debian 10.1.0-6) 10.1.0 > rorqual gcc (Debian 10.1.0-6) 10.1.0 > serinus gcc (Debian 20200124-1) 10.0.1 20200124 (experimental) > skink gcc (Debian 10.1.0-6) 10.1.0 I think those likely are all mine, so it's not too surprising. They all use something like CFLAGS => '-Og -ggdb -g3 -Wall -Wextra -Wno-unused-parameter -Wno-sign-compare -Wno-missing-field-initializers -fno-omit-frame-pointer', > (I wonder why flaviventris and serinus are still using an "experimental" > compiler version that is now behind mainstream.) The upgrade script didn't install the newer version it because it had to remove some conflicting packages... Should be fixed for runs starting now. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2021-05-08 13:13:47 -0400, Tom Lane wrote: >> (I wonder why flaviventris and serinus are still using an "experimental" >> compiler version that is now behind mainstream.) > The upgrade script didn't install the newer version it because it had to > remove some conflicting packages... Should be fixed for runs starting > now. Looks like that didn't work ... regards, tom lane
On 05.05.21 06:20, Craig Ringer wrote: > On Wed, 5 May 2021 at 09:15, Craig Ringer <craig.ringer@enterprisedb.com> wrote: > >>> warning: suggest braces around empty body in an ‘if’ statement [-Wempty-body] >>> 1322 | TRACE_POSTGRESQL_LWLOCK_WAIT_START(T_NAME(lock), mode); >>> | ^ >> >> Odd that I didn't get that. > > This compiler complaint is not due to the _ENABLED() test as such. > It's due to the fact that we completely define out the > TRACE_POSTGRESQL_ macros with src/backend/utils/Gen_dummy_probes.sed . > > While explicit braces could be added around each test, I suggest > fixing Gen_dummy_probes.sed to emit the usual dummy statement instead. > Patch attached. Committed, with the Gen_dummy_probes.pl change added.
On 2021-05-09 19:51:13 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2021-05-08 13:13:47 -0400, Tom Lane wrote: > >> (I wonder why flaviventris and serinus are still using an "experimental" > >> compiler version that is now behind mainstream.) > > > The upgrade script didn't install the newer version it because it had to > > remove some conflicting packages... Should be fixed for runs starting > > now. > > Looks like that didn't work ... Looks like it did, but turned out to have some unintended side-effects :(. The snapshot builds are now new: https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=flaviventris&dt=2021-05-10%2015%3A43%3A56&stg=configure configure:3966: ccache /usr/lib/gcc-snapshot/bin/gcc --version >&5 gcc (Debian 20210421-1) 11.0.1 20210421 (prerelease) [gcc-11 revision fbb7739892e:d13ce34bd01:3756d99dab6a268d0d8a17583980a86f23f0595a] But the aforementioned dependencies that needed to remove broke the installed old versions of gcc/clang. I started to build the old versions of llvm manually, but that then hits the issue that at least 3.9 doesn't build with halfway modern versions of gcc/clang. So I gotta do it stepwise (i.e. go backwards, build llvm n-2 with n-1), will take a bit of time. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > Looks like it did, but turned out to have some unintended side-effects > :(. > The snapshot builds are now new: > https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=flaviventris&dt=2021-05-10%2015%3A43%3A56&stg=configure > configure:3966: ccache /usr/lib/gcc-snapshot/bin/gcc --version >&5 > gcc (Debian 20210421-1) 11.0.1 20210421 (prerelease) [gcc-11 revision fbb7739892e:d13ce34bd01:3756d99dab6a268d0d8a17583980a86f23f0595a] > But the aforementioned dependencies that needed to remove broke the > installed old versions of gcc/clang. > I started to build the old versions of llvm manually, but that then hits > the issue that at least 3.9 doesn't build with halfway modern versions > of gcc/clang. So I gotta do it stepwise (i.e. go backwards, build llvm > n-2 with n-1), will take a bit of time. Ugh. Memo to self: don't rag on other peoples' buildfarm configurations right before a release deadline :-(. Sorry to cause you trouble. regards, tom lane
Hi, On 2021-05-10 12:14:46 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > Looks like it did, but turned out to have some unintended side-effects > > :(. > > The snapshot builds are now new: > > https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=flaviventris&dt=2021-05-10%2015%3A43%3A56&stg=configure > > configure:3966: ccache /usr/lib/gcc-snapshot/bin/gcc --version >&5 > > gcc (Debian 20210421-1) 11.0.1 20210421 (prerelease) [gcc-11 revision fbb7739892e:d13ce34bd01:3756d99dab6a268d0d8a17583980a86f23f0595a] > > But the aforementioned dependencies that needed to remove broke the > > installed old versions of gcc/clang. > > I started to build the old versions of llvm manually, but that then hits > > the issue that at least 3.9 doesn't build with halfway modern versions > > of gcc/clang. So I gotta do it stepwise (i.e. go backwards, build llvm > > n-2 with n-1), will take a bit of time. > > Ugh. Memo to self: don't rag on other peoples' buildfarm configurations > right before a release deadline :-(. Sorry to cause you trouble. No worries - I knew that I'd have to do this at some point, even though I hadn't planned to do that today... I should have all of them green before end of today. I found that I actually can build LLVM 3.9 directly, as clang-6 can still build it directly (wheras the oldest gcc still installed can't build it directly). So it's a bit less painful than I thought at first The 3.9 instances (phycodurus, dragonet) tests are running right now, and I'm fairly sure they'll pass (most of a --noreport --nostatus run passed). Going forward building LLVM 4,5,6 now - the later versions take longer... Greetings, Andres Freund
Hi, On 2021-05-10 09:46:02 -0700, Andres Freund wrote: > No worries - I knew that I'd have to do this at some point, even though > I hadn't planned to do that today... I should have all of them green > before end of today. > > I found that I actually can build LLVM 3.9 directly, as clang-6 can > still build it directly (wheras the oldest gcc still installed can't > build it directly). So it's a bit less painful than I thought at first > > The 3.9 instances (phycodurus, dragonet) tests are running right now, > and I'm fairly sure they'll pass (most of a --noreport --nostatus run > passed). Going forward building LLVM 4,5,6 now - the later versions take > longer... Looks like it's all clear now. All but the results for 11 had cleared up until yesterday evening, and the rest came in ok over night. Greetings, Andres Freund