Re: scalability bottlenecks with (many) partitions (and more) - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: scalability bottlenecks with (many) partitions (and more) |
Date | |
Msg-id | 0f27b64b-5bf3-4140-98b7-635e312e1796@vondra.me Whole thread Raw |
In response to | Re: scalability bottlenecks with (many) partitions (and more) (Jakub Wartak <jakub.wartak@enterprisedb.com>) |
Responses |
Re: scalability bottlenecks with (many) partitions (and more)
|
List | pgsql-hackers |
On 9/16/24 15:11, Jakub Wartak wrote: > On Fri, Sep 13, 2024 at 1:45 AM Tomas Vondra <tomas@vondra.me> wrote: > >> [..] > >> Anyway, at this point I'm quite happy with this improvement. I didn't >> have any clear plan when to commit this, but I'm considering doing so >> sometime next week, unless someone objects or asks for some additional >> benchmarks etc. > > Thank you very much for working on this :) > > The only fact that comes to my mind is that we could blow up L2 > caches. Fun fact, so if we are growing PGPROC by 6.3x, that's going to > be like one or two 2MB huge pages more @ common max_connections=1000 > x86_64 (830kB -> ~5.1MB), and indeed: > > # without patch: > postgres@hive:~$ /usr/pgsql18/bin/postgres -D /tmp/pg18 -C > shared_memory_size_in_huge_pages > 177 > > # with patch: > postgres@hive:~$ /usr/pgsql18/bin/postgres -D /tmp/pg18 -C > shared_memory_size_in_huge_pages > 178 > > So playing Devil's advocate , the worst situation that could possibly > hurt (?) could be: > * memory size of PGPROC working set >> L2_cache (thus very high > max_connections), > * insane number of working sessions on CPU (sessions >> VCPU) - sadly > happens to some, > * those sessions wouldn't have to be competing for the same Oids - > just fetching this new big fpLockBits[] structure - so probing a lot > for lots of Oids, but *NOT* having to use futex() syscall [so not that > syscall price] > * no huge pages (to cause dTLB misses) > > then maybe(?) one could observe further degradation of dTLB misses in > the perf-stat counter under some microbenchmark, but measuring that > requires isolated and physical hardware. Maybe that would be actually > noise due to overhead of context-switches itself. Just trying to think > out loud, what big PGPROC could cause here. But this is already an > unhealthy and non-steady state of the system, so IMHO we are good, > unless someone comes up with a better (more evil) idea. > I've been thinking about such cases too, but I don't think it can really happen in practice, because: - How likely is it that the sessions will need a lot of OIDs, but not the same ones? Also, why would it matter that the OIDs are not the same, I don't think it matters unless one of the sessions needs an exclusive lock, at which point the optimization doesn't really matter. - If having more fast-path slots means it doesn't fit into L2 cache, would we fit into L2 without it? I don't think so - if there really are that many locks, we'd have to add those into the shared lock table, and there's a lot of extra stuff to keep in memory (relcaches, ...). This is pretty much one of the cases I focused on in my benchmarking, and I'm yet to see any regression. >>> I did look at docs if anything needs updating, but I don't think so. The > SGML docs only talk about fast-path locking at fairly high level, not > about how many we have etc. > > Well the only thing I could think of was to add to the > doc/src/sgml/config.sgml / "max_locks_per_transaction" GUC, that "it > is also used as advisory for the number of groups used in > lockmanager's fast-path implementation" (that is, without going into > further discussion, as even pg_locks discussion > doc/src/sgml/system-views.sgml simply uses that term). > Thanks, I'll consider mentioning this in max_locks_per_transaction. Also, I think there's a place calculating the amount of per-connection memory, so maybe that needs to be updated too. regards -- Tomas Vondra
pgsql-hackers by date: