Re: scalability bottlenecks with (many) partitions (and more) - Mailing list pgsql-hackers

From Jakub Wartak
Subject Re: scalability bottlenecks with (many) partitions (and more)
Date
Msg-id CAKZiRmyML5tE67n8SLbiU=kKbVROqb33NGT-=kNv1Vcv1dTuAA@mail.gmail.com
Whole thread Raw
In response to Re: scalability bottlenecks with (many) partitions (and more)  (Tomas Vondra <tomas@vondra.me>)
Responses Re: scalability bottlenecks with (many) partitions (and more)
List pgsql-hackers
Hi Tomas!

On Tue, Sep 3, 2024 at 6:20 PM Tomas Vondra <tomas@vondra.me> wrote:
>
> On 9/3/24 17:06, Robert Haas wrote:
> > On Mon, Sep 2, 2024 at 1:46 PM Tomas Vondra <tomas@vondra.me> wrote:
> >> The one argument to not tie this to max_locks_per_transaction is the
> >> vastly different "per element" memory requirements. If you add one entry
> >> to max_locks_per_transaction, that adds LOCK which is a whopping 152B.
> >> OTOH one fast-path entry is ~5B, give or take. That's a pretty big
> >> difference, and it if the locks fit into the shared lock table, but
> >> you'd like to allow more fast-path locks, having to increase
> >> max_locks_per_transaction is not great - pretty wastefull.
> >>
> >> OTOH I'd really hate to just add another GUC and hope the users will
> >> magically know how to set it correctly. That's pretty unlikely, IMO. I
> >> myself wouldn't know what a good value is, I think.
> >>
> >> But say we add a GUC and set it to -1 by default, in which case it just
> >> inherits the max_locks_per_transaction value. And then also provide some
> >> basic metric about this fast-path cache, so that people can tune this?
> >
> > All things being equal, I would prefer not to add another GUC for
> > this, but we might need it.
> >
>
> Agreed.
>
> [..]
>
> So I think I'm OK with just tying this to max_locks_per_transaction.

If that matters then the SLRU configurability effort added 7 GUCs
(with 3 scaling up based on shared_buffers) just to give high-end
users some relief, so here 1 new shouldn't be that such a deal. We
could add to the LWLock/lock_manager wait event docs to recommend just
using known-to-be-good certain values from this $thread (or ask the
user to benchmark it himself).

> >> I think just knowing the "hit ratio" would be enough, i.e. counters for
> >> how often it fits into the fast-path array, and how often we had to
> >> promote it to the shared lock table would be enough, no?
> >
> > Yeah, probably. I mean, that won't tell you how big it needs to be,
> > but it will tell you whether it's big enough.
> >
>
> True, but that applies to all "cache hit ratio" metrics (like for our
> shared buffers). It'd be great to have something better, enough to tell
> you how large the cache needs to be. But we don't :-(

My $0.02 cents: the originating case that triggered those patches,
actually started with LWLock/lock_manager waits being the top#1. The
operator can cross check (join) that with a group by pg_locks.fastpath
(='f'), count(*). So, IMHO we have good observability in this case
(rare thing to say!)

> > I wonder if we should be looking at further improvements in the lock
> > manager of some kind. [..]
>
> Perhaps. I agree we'll probably need something more radical soon, not
> just changes that aim to fix some rare exceptional case (which may be
> annoying, but not particularly harmful for the complete workload).
>
> For example, if we did what you propose, that might help when very few
> transactions need a lot of locks. I don't mind saving memory in that
> case, ofc. but is it a problem if those rare cases are a bit slower?
> Shouldn't we focus more on cases where many locks are common? Because
> people are simply going to use partitioning, a lot of indexes, etc?
>
> So yeah, I agree we probably need a more fundamental rethink. I don't
> think we can just keep optimizing the current approach, there's a limit
> of fast it can be.

Please help me understand: so are You both discussing potential far
future further improvements instead of this one ? My question is
really about: is the patchset good enough or are you considering some
other new effort instead?

BTW some other random questions:
Q1. I've been lurking into
https://github.com/tvondra/pg-lock-scalability-results and those
shouldn't be used anymore for further discussions, as they contained
earlier patches (including
0003-Add-a-memory-pool-with-adaptive-rebalancing.patch) and they were
replaced by benchmark data in this $thread, right?
Q2. Earlier attempts did contain a mempool patch to get those nice
numbers (or was that jemalloc or glibc tuning). So were those recent
results in [1] collected with still 0003 or you have switched
completely to glibc/jemalloc tuning?

-J.

[1] - https://www.postgresql.org/message-id/b8c43eda-0c3f-4cb4-809b-841fa5c40ada%40vondra.me



pgsql-hackers by date:

Previous
From: Peter Eisentraut
Date:
Subject: Re: [PoC] Federated Authn/z with OAUTHBEARER
Next
From: Kyotaro Horiguchi
Date:
Subject: Re: Add callbacks for fixed-numbered stats flush in pgstats