Re: scalability bottlenecks with (many) partitions (and more) - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: scalability bottlenecks with (many) partitions (and more)
Date
Msg-id b264bc7e-48de-4fba-8b33-f5190aa62f2f@vondra.me
Whole thread Raw
In response to Re: scalability bottlenecks with (many) partitions (and more)  (Jakub Wartak <jakub.wartak@enterprisedb.com>)
Responses Re: scalability bottlenecks with (many) partitions (and more)
List pgsql-hackers
On 9/4/24 11:29, Jakub Wartak wrote:
> Hi Tomas!
> 
> On Tue, Sep 3, 2024 at 6:20 PM Tomas Vondra <tomas@vondra.me> wrote:
>>
>> On 9/3/24 17:06, Robert Haas wrote:
>>> On Mon, Sep 2, 2024 at 1:46 PM Tomas Vondra <tomas@vondra.me> wrote:
>>>> The one argument to not tie this to max_locks_per_transaction is the
>>>> vastly different "per element" memory requirements. If you add one entry
>>>> to max_locks_per_transaction, that adds LOCK which is a whopping 152B.
>>>> OTOH one fast-path entry is ~5B, give or take. That's a pretty big
>>>> difference, and it if the locks fit into the shared lock table, but
>>>> you'd like to allow more fast-path locks, having to increase
>>>> max_locks_per_transaction is not great - pretty wastefull.
>>>>
>>>> OTOH I'd really hate to just add another GUC and hope the users will
>>>> magically know how to set it correctly. That's pretty unlikely, IMO. I
>>>> myself wouldn't know what a good value is, I think.
>>>>
>>>> But say we add a GUC and set it to -1 by default, in which case it just
>>>> inherits the max_locks_per_transaction value. And then also provide some
>>>> basic metric about this fast-path cache, so that people can tune this?
>>>
>>> All things being equal, I would prefer not to add another GUC for
>>> this, but we might need it.
>>>
>>
>> Agreed.
>>
>> [..]
>>
>> So I think I'm OK with just tying this to max_locks_per_transaction.
> 
> If that matters then the SLRU configurability effort added 7 GUCs
> (with 3 scaling up based on shared_buffers) just to give high-end
> users some relief, so here 1 new shouldn't be that such a deal. We
> could add to the LWLock/lock_manager wait event docs to recommend just
> using known-to-be-good certain values from this $thread (or ask the
> user to benchmark it himself).
> 

TBH I'm skeptical we'll be able to tune those GUCs. Maybe it was the
right thing for the SLRU thread, I don't know - I haven't been following
that very closely. But my impression is that we often add a GUC when
we're not quite sure how to pick a good value. So we just shift the
responsibility to someone else, who however also doesn't know.

I'd very much prefer not to do that here. Of course, it's challenging
because we can't easily resize these arrays, so even if we had some nice
heuristics to calculate the "optimal" number of fast-path slots, what
would we do with it ...

>>>> I think just knowing the "hit ratio" would be enough, i.e. counters for
>>>> how often it fits into the fast-path array, and how often we had to
>>>> promote it to the shared lock table would be enough, no?
>>>
>>> Yeah, probably. I mean, that won't tell you how big it needs to be,
>>> but it will tell you whether it's big enough.
>>>
>>
>> True, but that applies to all "cache hit ratio" metrics (like for our
>> shared buffers). It'd be great to have something better, enough to tell
>> you how large the cache needs to be. But we don't :-(
> 
> My $0.02 cents: the originating case that triggered those patches,
> actually started with LWLock/lock_manager waits being the top#1. The
> operator can cross check (join) that with a group by pg_locks.fastpath
> (='f'), count(*). So, IMHO we have good observability in this case
> (rare thing to say!)
> 

That's a good point. So if you had to give some instructions to users
what to measure / monitor, and how to adjust the GUC based on that, what
would your instructions be?

>>> I wonder if we should be looking at further improvements in the lock
>>> manager of some kind. [..]
>>
>> Perhaps. I agree we'll probably need something more radical soon, not
>> just changes that aim to fix some rare exceptional case (which may be
>> annoying, but not particularly harmful for the complete workload).
>>
>> For example, if we did what you propose, that might help when very few
>> transactions need a lot of locks. I don't mind saving memory in that
>> case, ofc. but is it a problem if those rare cases are a bit slower?
>> Shouldn't we focus more on cases where many locks are common? Because
>> people are simply going to use partitioning, a lot of indexes, etc?
>>
>> So yeah, I agree we probably need a more fundamental rethink. I don't
>> think we can just keep optimizing the current approach, there's a limit
>> of fast it can be.
> 
> Please help me understand: so are You both discussing potential far
> future further improvements instead of this one ? My question is
> really about: is the patchset good enough or are you considering some
> other new effort instead?
> 

I think it was mostly a brainstorming about alternative / additional
improvements in locking. The proposed patch does not change the locking
in any fundamental way, it merely optimizes one piece - we still acquire
exactly the same set of locks, exactly the same way.

AFAICS there's an agreement the current approach has limits, and with
the growing number of partitions we're hitting them already. That may
need rethinking the fundamental approach, but I think that should not
block improvements to the current approach.

Not to mention there's no proposal for such "fundamental rework" yet.

> BTW some other random questions:
> Q1. I've been lurking into
> https://github.com/tvondra/pg-lock-scalability-results and those
> shouldn't be used anymore for further discussions, as they contained
> earlier patches (including
> 0003-Add-a-memory-pool-with-adaptive-rebalancing.patch) and they were
> replaced by benchmark data in this $thread, right?

The github results are still valid, I've only shared them 3 days ago. It
does test both the mempool and glibc tuning, to assess (and compare) the
benefits of that, but why would that make it obsolete?

By "results in this thread" I suppose you mean the couple numbers I
shared on September 2? Those were just very limited benchmarks to asses
if making the arrays variable-length (based on GUC) would make things
slower. And it doesn't, so the "full" github results still apply.

> Q2. Earlier attempts did contain a mempool patch to get those nice
> numbers (or was that jemalloc or glibc tuning). So were those recent
> results in [1] collected with still 0003 or you have switched
> completely to glibc/jemalloc tuning?
> 

The results pushed to github are all with glibc, and test four cases:

a) mempool patch not applied, no glibc tuning
b) mempool patch applied, no glibc tuning
c) mempool patch not applied, glibc tuning
d) mempool patch applied, glibc tuning

These are the 4 "column groups" in some of the pivot tables, to allow
comparing those cases. My interpretation of the results are

1) The mempool / glibc tuning have significant benefits, at least for
some workloads (where the locking patch alone does help much).

2) There's very little difference between the mempool / glibc tuning.
The mempool does seem to have a small advantage.

3) The mempool / glibc tuning is irrelevant for non-glibc systems (e.g.
for FreeBSD which I think uses jemalloc or something like that).

I think the mempool might be interesting and useful for other reasons
(e.g. I initially wrote it to enforce a per-backend memory limit), but
you can get mostly the same caching benefits by tuning the glibc parameters.

So I'm focusing on the locking stuff.


regards

-- 
Tomas Vondra



pgsql-hackers by date:

Previous
From: Thomas Munro
Date:
Subject: Re: Use streaming read API in ANALYZE
Next
From: jian he
Date:
Subject: Re: First draft of PG 17 release notes