Re: Configurable FP_LOCK_SLOTS_PER_BACKEND - Mailing list pgsql-hackers

From Matt Smiley
Subject Re: Configurable FP_LOCK_SLOTS_PER_BACKEND
Date
Msg-id CA+eRB3qhb1Nt5hDct3iO=vtry1kuMgF4uzOns9GOatSLc3vuSw@mail.gmail.com
Whole thread Raw
In response to Re: Configurable FP_LOCK_SLOTS_PER_BACKEND  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
List pgsql-hackers
Thank you Tomas!  I really appreciate your willingness to dig in here and help us out!  The rest of my replies are inline below.

On Thu, Aug 3, 2023 at 1:39 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
The analysis in the linked gitlab issue is pretty amazing. I wasn't
planning to argue against the findings anyway, but plenty of data
supporting the conclusions is good.

Thank you!  I totally agree, having supporting data is so helpful.

I'm not an expert on locking, so some of the stuff I say may be
trivially obvious - it's just me thinking about ...

Absolutely makes sense to check assumptions, etc.  Thanks for being open!  For what it's worth, I've also been working with Postgres for many years, and I love that it keeps teaching me new things, this topic being just the latest.

I wonder what's the rough configuration of those systems, though. Both
the hardware and PostgreSQL side. How many cores / connections, etc.?

Each of the postgres hosts had 96 vCPUs and at peak handled roughly 80 concurrently active connections.

For purposes of reproducing the pathology, I think we can do so with a single postgres instance.  We will need a high enough query rate to push the bottleneck to lock_manager lwlock contention.  The simplest way to do so is probably to give it a small dataset that fits easily in cache and run several concurrent client connections doing cheap single-row queries, each in its own transaction, against a target table that has either many indexes or partitions or both.

For context, here's a brief summary of the production environment where we first observed this pathology:
The writable primary postgres instance has several streaming replicas, used for read-only portions of the workload.  All of them run on equivalent hardware.  Most of the research focuses on the streaming replica postgres instances, although the same pathology has been observed in the writable primary node as well. The general topology is thousands of client connections fanning down into several pgbouncer instances per Postgres instance.  From each Postgres instance's perspective, its workload generally has a daily peak of roughly 80 concurrently active backends supporting a throughput of 75K transactions second, where most transactions run a single query.

Yes, I agree with that. Partitioning makes this issue works, I guess.
Schemas with indexes on every column are disturbingly common these days
too, which hits the issue too ...

Agreed.
 
Those are pretty great pieces of information. I wonder if some of the
measurements may be affecting the observation (by consuming too much
CPU, making the contention worse), but overall it seems convincing.

Yes, definitely overhead is a concern, glad you asked!

Here are my notes on the overhead for each bpftrace script: https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#note_1357834678


Here are more generic benchmark results for uprobe overhead: https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1383

Briefly, we generally  expect the instrumentation overhead to be roughly 1-2 microseconds per call to the instrumented instruction.  It partly depends on what we're doing in the instrumentation, but most of that overhead is just the interrupt-handling to transfer control flow to/from the BPF code.

Would it be difficult to sample just a small fraction of the calls? Say,
1%, to get good histograms/estimated with acceptable CPU usage.

That would be great, but since the overhead comes mostly from the control transfer, it wouldn't help to put sampling logic in the tracer itself.  The main way to mitigate that overhead is to choose instrumentation points where the call rate is tolerably low.  That's why the only instrumentation I used for more than a few seconds at a time were the "low overhead" scripts that instrument only the stalled call to LWLockAcquire.
 
In any case, it's a great source of information to reproduce the issue
and evaluate possible fixes.

Thanks, that's my hope!
 
After looking at the code etc. I think the main trade-off here is going
to be the cost of searching the fpRelId array. At the moment it's
searched linearly, which is cheap for 16 locks. But at some point it'll
become as expensive as updating the slowpath, and the question is when.

I wonder if we could switch to a more elaborate strategy if the number
of locks is high enough. Say, a hash table, or some hybrid approach.

Interesting idea!  I was hoping a linear search would stay cheap enough but you're right, it's going to become too inefficient at some point.  It might make sense to start with just blackbox timing or throughput measurements, because directly measuring that search duration may not be cheap.  To observe durations via BPF, we have to instrument 2 points (e.g. function entry and return, or more generally the instructions before and after the critical section we're observing).  For code called as frequently as LWLockAcquire, that overhead would be prohibitively expensive, so we might need to measure it natively with counters for each histogram bucket we care about.  Just thinking ahead; we don't need to deal with this yet, I guess.
 
I understand. I have two concerns:

1) How would the users know they need to tune this / determine what's
the right value, and  what's the right value for their system.

2) Having to deal with misconfigured systems as people tend to blindly
tune everything to 100x the default, because more is better :-(

These are very valid concerns.  Thanks for articulating them!

For point 1:
The advice might be to only increase the number of slots for fastpath locks per xact if sampling pg_stat_activity frequently shows "lock_manager" wait_events affecting a significant percentage of your non-idle backends.  And increase it as little as possible, due to the extra overhead incurred when checking locks.  For calibration purposes, I polled pg_locks periodically, counting the number of slowpath locks where the lock mode was weak enough to qualify for fastpath if there had been enough fastpath slots available.  See the graph and SQL at the bottom of this note: https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#note_1365595142.

For point 2:
Maybe we should put a ceiling on the allowed value.  Maybe 64 or 128 or 256?  Probably should depend on the cost of searching the fpRelid array, as you described earlier.

pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Configurable FP_LOCK_SLOTS_PER_BACKEND
Next
From: Tomas Vondra
Date:
Subject: Re: Use of additional index columns in rows filtering