Re: scalability bottlenecks with (many) partitions (and more) - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: scalability bottlenecks with (many) partitions (and more) |
Date | |
Msg-id | b264bc7e-48de-4fba-8b33-f5190aa62f2f@vondra.me Whole thread Raw |
In response to | Re: scalability bottlenecks with (many) partitions (and more) (Jakub Wartak <jakub.wartak@enterprisedb.com>) |
Responses |
Re: scalability bottlenecks with (many) partitions (and more)
|
List | pgsql-hackers |
On 9/4/24 11:29, Jakub Wartak wrote: > Hi Tomas! > > On Tue, Sep 3, 2024 at 6:20 PM Tomas Vondra <tomas@vondra.me> wrote: >> >> On 9/3/24 17:06, Robert Haas wrote: >>> On Mon, Sep 2, 2024 at 1:46 PM Tomas Vondra <tomas@vondra.me> wrote: >>>> The one argument to not tie this to max_locks_per_transaction is the >>>> vastly different "per element" memory requirements. If you add one entry >>>> to max_locks_per_transaction, that adds LOCK which is a whopping 152B. >>>> OTOH one fast-path entry is ~5B, give or take. That's a pretty big >>>> difference, and it if the locks fit into the shared lock table, but >>>> you'd like to allow more fast-path locks, having to increase >>>> max_locks_per_transaction is not great - pretty wastefull. >>>> >>>> OTOH I'd really hate to just add another GUC and hope the users will >>>> magically know how to set it correctly. That's pretty unlikely, IMO. I >>>> myself wouldn't know what a good value is, I think. >>>> >>>> But say we add a GUC and set it to -1 by default, in which case it just >>>> inherits the max_locks_per_transaction value. And then also provide some >>>> basic metric about this fast-path cache, so that people can tune this? >>> >>> All things being equal, I would prefer not to add another GUC for >>> this, but we might need it. >>> >> >> Agreed. >> >> [..] >> >> So I think I'm OK with just tying this to max_locks_per_transaction. > > If that matters then the SLRU configurability effort added 7 GUCs > (with 3 scaling up based on shared_buffers) just to give high-end > users some relief, so here 1 new shouldn't be that such a deal. We > could add to the LWLock/lock_manager wait event docs to recommend just > using known-to-be-good certain values from this $thread (or ask the > user to benchmark it himself). > TBH I'm skeptical we'll be able to tune those GUCs. Maybe it was the right thing for the SLRU thread, I don't know - I haven't been following that very closely. But my impression is that we often add a GUC when we're not quite sure how to pick a good value. So we just shift the responsibility to someone else, who however also doesn't know. I'd very much prefer not to do that here. Of course, it's challenging because we can't easily resize these arrays, so even if we had some nice heuristics to calculate the "optimal" number of fast-path slots, what would we do with it ... >>>> I think just knowing the "hit ratio" would be enough, i.e. counters for >>>> how often it fits into the fast-path array, and how often we had to >>>> promote it to the shared lock table would be enough, no? >>> >>> Yeah, probably. I mean, that won't tell you how big it needs to be, >>> but it will tell you whether it's big enough. >>> >> >> True, but that applies to all "cache hit ratio" metrics (like for our >> shared buffers). It'd be great to have something better, enough to tell >> you how large the cache needs to be. But we don't :-( > > My $0.02 cents: the originating case that triggered those patches, > actually started with LWLock/lock_manager waits being the top#1. The > operator can cross check (join) that with a group by pg_locks.fastpath > (='f'), count(*). So, IMHO we have good observability in this case > (rare thing to say!) > That's a good point. So if you had to give some instructions to users what to measure / monitor, and how to adjust the GUC based on that, what would your instructions be? >>> I wonder if we should be looking at further improvements in the lock >>> manager of some kind. [..] >> >> Perhaps. I agree we'll probably need something more radical soon, not >> just changes that aim to fix some rare exceptional case (which may be >> annoying, but not particularly harmful for the complete workload). >> >> For example, if we did what you propose, that might help when very few >> transactions need a lot of locks. I don't mind saving memory in that >> case, ofc. but is it a problem if those rare cases are a bit slower? >> Shouldn't we focus more on cases where many locks are common? Because >> people are simply going to use partitioning, a lot of indexes, etc? >> >> So yeah, I agree we probably need a more fundamental rethink. I don't >> think we can just keep optimizing the current approach, there's a limit >> of fast it can be. > > Please help me understand: so are You both discussing potential far > future further improvements instead of this one ? My question is > really about: is the patchset good enough or are you considering some > other new effort instead? > I think it was mostly a brainstorming about alternative / additional improvements in locking. The proposed patch does not change the locking in any fundamental way, it merely optimizes one piece - we still acquire exactly the same set of locks, exactly the same way. AFAICS there's an agreement the current approach has limits, and with the growing number of partitions we're hitting them already. That may need rethinking the fundamental approach, but I think that should not block improvements to the current approach. Not to mention there's no proposal for such "fundamental rework" yet. > BTW some other random questions: > Q1. I've been lurking into > https://github.com/tvondra/pg-lock-scalability-results and those > shouldn't be used anymore for further discussions, as they contained > earlier patches (including > 0003-Add-a-memory-pool-with-adaptive-rebalancing.patch) and they were > replaced by benchmark data in this $thread, right? The github results are still valid, I've only shared them 3 days ago. It does test both the mempool and glibc tuning, to assess (and compare) the benefits of that, but why would that make it obsolete? By "results in this thread" I suppose you mean the couple numbers I shared on September 2? Those were just very limited benchmarks to asses if making the arrays variable-length (based on GUC) would make things slower. And it doesn't, so the "full" github results still apply. > Q2. Earlier attempts did contain a mempool patch to get those nice > numbers (or was that jemalloc or glibc tuning). So were those recent > results in [1] collected with still 0003 or you have switched > completely to glibc/jemalloc tuning? > The results pushed to github are all with glibc, and test four cases: a) mempool patch not applied, no glibc tuning b) mempool patch applied, no glibc tuning c) mempool patch not applied, glibc tuning d) mempool patch applied, glibc tuning These are the 4 "column groups" in some of the pivot tables, to allow comparing those cases. My interpretation of the results are 1) The mempool / glibc tuning have significant benefits, at least for some workloads (where the locking patch alone does help much). 2) There's very little difference between the mempool / glibc tuning. The mempool does seem to have a small advantage. 3) The mempool / glibc tuning is irrelevant for non-glibc systems (e.g. for FreeBSD which I think uses jemalloc or something like that). I think the mempool might be interesting and useful for other reasons (e.g. I initially wrote it to enforce a per-backend memory limit), but you can get mostly the same caching benefits by tuning the glibc parameters. So I'm focusing on the locking stuff. regards -- Tomas Vondra
pgsql-hackers by date: