Thread: scalability bottlenecks with (many) partitions (and more)
Hi, I happened to investigate a query involving a partitioned table, which led me to a couple of bottlenecks severely affecting queries dealing with multiple partitions (or relations in general). After a while I came up with three WIP patches that improve the behavior by an order of magnitude, and not just in some extreme cases. Consider a partitioned pgbench with 20 partitions, say: pgbench -i -s 100 --partitions 100 testdb but let's modify the pgbench_accounts a little bit: ALTER TABLE pgbench_accounts ADD COLUMN aid_parent INT; UPDATE pgbench_accounts SET aid_parent = aid; CREATE INDEX ON pgbench_accounts(aid_parent); VACUUM FULL pgbench_accounts; which simply adds "aid_parent" column which is not a partition key. And now let's do a query SELECT * FROM pgbench_accounts pa JOIN pgbench_branches pb ON (pa.bid = pb.bid) WHERE pa.aid_parent = :aid so pretty much the regular "pgbench -S" except that on the column that does not allow partition elimination. Now, the plan looks like this: QUERY PLAN ---------------------------------------------------------------------- Hash Join (cost=1.52..34.41 rows=10 width=465) Hash Cond: (pa.bid = pb.bid) -> Append (cost=0.29..33.15 rows=10 width=101) -> Index Scan using pgbench_accounts_1_aid_parent_idx on pgbench_accounts_1 pa_1 (cost=0.29..3.31 rows=1 width=101) Index Cond: (aid_parent = 3489734) -> Index Scan using pgbench_accounts_2_aid_parent_idx on pgbench_accounts_2 pa_2 (cost=0.29..3.31 rows=1 width=101) Index Cond: (aid_parent = 3489734) -> Index Scan using pgbench_accounts_3_aid_parent_idx on pgbench_accounts_3 pa_3 (cost=0.29..3.31 rows=1 width=101) Index Cond: (aid_parent = 3489734) -> Index Scan using pgbench_accounts_4_aid_parent_idx on pgbench_accounts_4 pa_4 (cost=0.29..3.31 rows=1 width=101) Index Cond: (aid_parent = 3489734) -> ... -> Hash (cost=1.10..1.10 rows=10 width=364) -> Seq Scan on pgbench_branches pb (cost=0.00..1.10 rows=10 width=364) So yeah, scanning all 100 partitions. Not great, but no partitioning scheme is perfect for all queries. Anyway, let's see how this works on a big AMD EPYC machine with 96/192 cores - with "-M simple" we get: parts 1 8 16 32 64 96 160 224 ----------------------------------------------------------------------- 0 13877 105732 210890 410452 709509 844683 1050658 1163026 100 653 3957 7120 12022 12707 11813 10349 9633 1000 20 142 270 474 757 808 567 427 These are transactions per second, for different number of clients (numbers in the header). With -M prepared the story doesn't change - the numbers are higher, but the overall behavior is pretty much the same. Firstly, with no partitions (first row), the throughput by ~13k/client initially, then it gradually levels off. But it grows all the time. But with 100 or 1000 partitions, it peaks and then starts dropping again. And moreover, the throughput with 100 or 1000 partitions is just a tiny fraction of the non-partitioned value. The difference is roughly equal to the number of partitions - for example with 96 clients, the difference between 0 and 1000 partitions is 844683/808 = 1045. I could demonstrate the same behavior with fewer partitions - e.g. with 10 partitions you get ~10x difference, and so on. Another thing I'd mention is that this is not just about partitioning. Imagine a star schema with a fact table and dimensions - you'll get the same behavior depending on the number of dimensions you need to join with. With "-M simple" you may get this, for example: dims 1 8 16 32 64 96 160 224 ---------------------------------------------------------------------- 1 11737 92925 183678 361497 636598 768956 958679 1042799 10 462 3558 7086 13889 25367 29503 25353 24030 100 4 31 61 122 231 292 292 288 So, similar story - significant slowdown as we're adding dimensions. Now, what could be causing this? Clearly, there's a bottleneck of some kind, and we're hitting it. Some of this may be simply due to execution doing more stuff (more index scans, more initialization, ...) but maybe not - one of the reasons why I started looking into this was not using all the CPU even for small scales - the CPU was maybe 60% utilized. So I started poking at things. The first thing that I thought about was locking, obviously. That's consistent with the limited CPU utilization (waiting on a lock = not running), and it's somewhat expected when using many partitions - we need to lock all of them, and if we have 100 or 1000 of them, that's potentially lot of locks. From past experiments I've known about two places where such bottleneck could be - NUM_LOCK_PARTITIONS and fast-path locking. So I decided to give it a try, increase these values and see what happens. For NUM_LOCK_PARTITIONS this is pretty simple (see 0001 patch). The LWLock table has 16 partitions by default - it's quite possible that on machine with many cores and/or many partitions, we can easily hit this. So I bumped this 4x to 64 partitions. For fast-path locking the changes are more complicated (see 0002). We allow keeping 16 relation locks right in PGPROC, and only when this gets full we promote them to the actual lock table. But with enough partitions we're guaranteed to fill these 16 slots, of course. But increasing the number of slots is not simple - firstly, the information is split between an array of 16 OIDs and UINT64 serving as a bitmap. Increasing the size of the OID array is simple, but it's harder for the auxiliary bitmap. But there's more problems - with more OIDs a simple linear search won't do. But a simple hash table is not a good idea too, because of poor locality and the need to delete stuff ... What I ended up doing is having a hash table of 16-element arrays. There are 64 "pieces", each essentially the (16 x OID + UINT64 bitmap) that we have now. Each OID is mapped to exactly one of these parts as if in a hash table, and in each of those 16-element parts we do exactly the same thing we do now (linear search, removal, etc.). This works great, the locality is great, etc. The one disadvantage is this makes PGPROC larger, but I did a lot of benchmarks and I haven't seen any regression that I could attribute to this. (More about this later.) Unfortunately, for the pgbench join this does not make much difference. But for the "star join" (with -M prepared) it does this: 1 8 16 32 64 96 160 224 ------------------------------------------------------------------------ master 21610 137450 247541 300902 270932 229692 191454 189233 patched 21664 151695 301451 594615 1036424 1211716 1480953 1656203 speedup 1.0 1.1 1.2 2.0 3.8 5.3 7.7 8.8 That's a pretty nice speedup, I think. However, why doesn't the partitioned join improve (at not very much)? Well, perf profile says stuff like this: 9.16% 0.77% postgres [kernel.kallsyms] [k] asm_exc_page_fault | --8.39%--asm_exc_page_fault | --7.52%--exc_page_fault | --7.13%--do_user_addr_fault | --6.64%--handle_mm_fault | --6.29%--__handle_mm_fault | |--2.17%--__mem_cgroup_charge | | | |--1.25%--charge_memcg | | | | | --0.57%-- ... | | | --0.67%-- ... | |--2.04%--vma_alloc_folio After investigating this for a bit, I came to the conclusion this may be some sort of a scalability problem in glibc/malloc. I decided to try if the "memory pool" patch (which I've mentioned in the memory limit thread as an alternative way to introduce backend-level accounting/limit) could serve as a backend-level malloc cache, and how would that work. So I cleaned up the PoC patch I already had (see 0003), and gave it a try. And with both patches applied, the results for the partitioned join with 100 partitions look like this: -M simple 1 8 16 32 64 96 160 224 ------------------------------------------------------------------------ master 653 3957 7120 12022 12707 11813 10349 9633 both patches 954 7356 14580 28259 51552 65278 70607 69598 speedup 1.5 1.9 2.0 2.4 4.1 5.5 6.8 7.2 -M prepared 1 8 16 32 64 96 160 224 ------------------------------------------------------------------------ master 1639 8273 14138 14746 13446 14001 11129 10136 both patches 4792 30102 62208 122157 220984 267763 315632 323567 speedup 2.9 3.6 4.4 8.3 16.4 19.1 28.4 31.9 That's pretty nice, I think. And I've seen many such improvements, it's not a cherry-picked example. For the star join, the improvements are very similar. I'm attaching PDF files with a table visualizing results for these two benchmarks - there's results for different number of partitions/scales, and different builds (master, one or both of the patches). There's also a comparison to master, with color scale "red = slower, green = faster" (but there's no red anywhere, not even for low client counts). It's also interesting that with just the 0003 patch applied, the change is much smaller. It's as if the two bottlenecks (locking and malloc) are in balance - if you only address one one, you don't get much. But if you address both, it flies. FWIW where does the malloc overhead come from? For one, while we do have some caching of malloc-ed memory in memory contexts, that doesn't quite work cross-query, because we destroy the contexts at the end of the query. We attempt to cache the memory contexts too, but in this case that can't help because the allocations come from btbeginscan() where we do this: so = (BTScanOpaque) palloc(sizeof(BTScanOpaqueData)); and BTScanOpaqueData is ~27kB, which means it's an oversized chunk and thus always allocated using a separate malloc() call. Maybe we could break it into smaller/cacheable parts, but I haven't tried, and I doubt it's the only such allocation. I don't want to get into too much detail about the memory pool, but I think it's something we should consider doing - I'm sure there's stuff to improve, but caching the malloc may clearly be very beneficial. The basic idea is to have a cache that is "adaptive" (i.e. adjusts to caching blocks of sizes needed by the workload) but also cheap. The patch is PoC/WIP and needs more work, but I think it works quite well. If anyone wants to take a look or have a chat at FOSDEM, for example, I'm available. FWIW I was wondering if this is a glibc-specific malloc bottleneck, so I tried running the benchmarks with LD_PRELOAD=jemalloc, and that improves the behavior a lot - it gets us maybe ~80% of the mempool benefits. Which is nice, it confirms it's glibc-specific (I wonder if there's a way to tweak glibc to address this), and it also means systems using jemalloc (e.g. FreeBSD, right?) don't have this problem. But it also says the mempool has ~20% benefit on top of jemalloc. FWIW there's another bottleneck people may not realize, and that's the number of file descriptors. Once you get to >1000 relations, you can easily get into situation like this: 54.18% 0.48% postgres [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe | --53.70%--entry_SYSCALL_64_after_hwframe | --53.03%--do_syscall_64 | |--28.29%--__x64_sys_openat | | | --28.14%--do_sys_openat2 | | | |--23.14%--do_filp_open | | | | | --22.72%--path_openat That's pretty bad, it means we're closing/opening file descriptors like crazy, because every query needs the files. If I increase the number of file descriptors (both in ulimit and max_files_per_process) to prevent this trashing, I can increase the throughput ~5x. Of course, this is not a bottleneck that we can "fix" in code, it's simply a consequence of not having enough file descriptors etc. But I wonder if we might make it easier to monitor this, e.g. by tracking the fd cache hit ratio, or something like that ... There's a more complete set of benchmarking scripts and results for these and other tests, in various formats (PDF, ODS, ...) at https://github.com/tvondra/scalability-patches There's results from multiple machines - not just the big epyc machine, but also smaller intel machines (4C and 16C), and even two rpi5 (yes, it helps even on rpi5, quite a bit). regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Le dimanche 28 janvier 2024, 22:57:02 CET Tomas Vondra a écrit : Hi Tomas ! I'll comment on glibc-malloc part as I studied that part last year, and proposed some things here: https://www.postgresql.org/message-id/ 3424675.QJadu78ljV%40aivenlaptop > FWIW where does the malloc overhead come from? For one, while we do have > some caching of malloc-ed memory in memory contexts, that doesn't quite > work cross-query, because we destroy the contexts at the end of the > query. We attempt to cache the memory contexts too, but in this case > that can't help because the allocations come from btbeginscan() where we > do this: > > so = (BTScanOpaque) palloc(sizeof(BTScanOpaqueData)); > > and BTScanOpaqueData is ~27kB, which means it's an oversized chunk and > thus always allocated using a separate malloc() call. Maybe we could > break it into smaller/cacheable parts, but I haven't tried, and I doubt > > > > it's the only such allocation. Did you try running an strace on the process ? That may give you some hindsights into what malloc is doing. A more sophisticated approach would be using stap and plugging it into the malloc probes, for example memory_sbrk_more and memory_sbrk_less. An important part of glibc's malloc behaviour in that regard comes from the adjustment of the mmap and free threshold. By default, mmap adjusts them dynamically and you can poke into that using the memory_mallopt_free_dyn_thresholds probe. > > FWIW I was wondering if this is a glibc-specific malloc bottleneck, so I > tried running the benchmarks with LD_PRELOAD=jemalloc, and that improves > the behavior a lot - it gets us maybe ~80% of the mempool benefits. > Which is nice, it confirms it's glibc-specific (I wonder if there's a > way to tweak glibc to address this), and it also means systems using > jemalloc (e.g. FreeBSD, right?) don't have this problem. But it also > says the mempool has ~20% benefit on top of jemalloc. GLIBC's malloc offers some tuning for this. In particular, setting either M_MMAP_THRESHOLD or M_TRIM_THRESHOLD will disable the unpredictable "auto adjustment" beheviour and allow you to control what it's doing. By setting a bigger M_TRIM_THRESHOLD, one can make sure memory allocated using sbrk isn't freed as easily, and you don't run into a pattern of moving the sbrk pointer up and down repeatedly. The automatic trade off between the mmap and trim thresholds is supposed to prevent that, but the way it is incremented means you can end in a bad place depending on your particular allocation patttern. Best regards, -- Ronan Dunklau
On 1/29/24 09:53, Ronan Dunklau wrote: > Le dimanche 28 janvier 2024, 22:57:02 CET Tomas Vondra a écrit : > > Hi Tomas ! > > I'll comment on glibc-malloc part as I studied that part last year, and > proposed some things here: https://www.postgresql.org/message-id/ > 3424675.QJadu78ljV%40aivenlaptop > Thanks for reminding me. I'll re-read that thread. > >> FWIW where does the malloc overhead come from? For one, while we do have >> some caching of malloc-ed memory in memory contexts, that doesn't quite >> work cross-query, because we destroy the contexts at the end of the >> query. We attempt to cache the memory contexts too, but in this case >> that can't help because the allocations come from btbeginscan() where we >> do this: >> >> so = (BTScanOpaque) palloc(sizeof(BTScanOpaqueData)); >> >> and BTScanOpaqueData is ~27kB, which means it's an oversized chunk and >> thus always allocated using a separate malloc() call. Maybe we could >> break it into smaller/cacheable parts, but I haven't tried, and I doubt >>>>> it's the only such allocation. > > Did you try running an strace on the process ? That may give you some > hindsights into what malloc is doing. A more sophisticated approach would be > using stap and plugging it into the malloc probes, for example > memory_sbrk_more and memory_sbrk_less. > No, I haven't tried that. In my experience strace is pretty expensive, and if the issue is in glibc itself (before it does the syscalls), strace won't really tell us much. Not sure, ofc. > An important part of glibc's malloc behaviour in that regard comes from the > adjustment of the mmap and free threshold. By default, mmap adjusts them > dynamically and you can poke into that using the > memory_mallopt_free_dyn_thresholds probe. > Thanks, I'll take a look at that. >> >> FWIW I was wondering if this is a glibc-specific malloc bottleneck, so I >> tried running the benchmarks with LD_PRELOAD=jemalloc, and that improves >> the behavior a lot - it gets us maybe ~80% of the mempool benefits. >> Which is nice, it confirms it's glibc-specific (I wonder if there's a >> way to tweak glibc to address this), and it also means systems using >> jemalloc (e.g. FreeBSD, right?) don't have this problem. But it also >> says the mempool has ~20% benefit on top of jemalloc. > > GLIBC's malloc offers some tuning for this. In particular, setting either > M_MMAP_THRESHOLD or M_TRIM_THRESHOLD will disable the unpredictable "auto > adjustment" beheviour and allow you to control what it's doing. > > By setting a bigger M_TRIM_THRESHOLD, one can make sure memory allocated using > sbrk isn't freed as easily, and you don't run into a pattern of moving the > sbrk pointer up and down repeatedly. The automatic trade off between the mmap > and trim thresholds is supposed to prevent that, but the way it is incremented > means you can end in a bad place depending on your particular allocation > patttern. > So, what values would you recommend for these parameters? My concern is increasing those value would lead to (much) higher memory usage, with little control over it. With the mempool we keep more blocks, ofc, but we have control over freeing the memory. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Le lundi 29 janvier 2024, 13:17:07 CET Tomas Vondra a écrit : > > Did you try running an strace on the process ? That may give you some > > hindsights into what malloc is doing. A more sophisticated approach would > > be using stap and plugging it into the malloc probes, for example > > memory_sbrk_more and memory_sbrk_less. > > No, I haven't tried that. In my experience strace is pretty expensive, > and if the issue is in glibc itself (before it does the syscalls), > strace won't really tell us much. Not sure, ofc. It would tell you how malloc actually performs your allocations, and how often they end up translated into syscalls. The main issue with glibc would be that it releases the memory too agressively to the OS, IMO. > > > An important part of glibc's malloc behaviour in that regard comes from > > the > > adjustment of the mmap and free threshold. By default, mmap adjusts them > > dynamically and you can poke into that using the > > memory_mallopt_free_dyn_thresholds probe. > > Thanks, I'll take a look at that. > > >> FWIW I was wondering if this is a glibc-specific malloc bottleneck, so I > >> tried running the benchmarks with LD_PRELOAD=jemalloc, and that improves > >> the behavior a lot - it gets us maybe ~80% of the mempool benefits. > >> Which is nice, it confirms it's glibc-specific (I wonder if there's a > >> way to tweak glibc to address this), and it also means systems using > >> jemalloc (e.g. FreeBSD, right?) don't have this problem. But it also > >> says the mempool has ~20% benefit on top of jemalloc. > > > > GLIBC's malloc offers some tuning for this. In particular, setting either > > M_MMAP_THRESHOLD or M_TRIM_THRESHOLD will disable the unpredictable "auto > > adjustment" beheviour and allow you to control what it's doing. > > > > By setting a bigger M_TRIM_THRESHOLD, one can make sure memory allocated > > using sbrk isn't freed as easily, and you don't run into a pattern of > > moving the sbrk pointer up and down repeatedly. The automatic trade off > > between the mmap and trim thresholds is supposed to prevent that, but the > > way it is incremented means you can end in a bad place depending on your > > particular allocation patttern. > > So, what values would you recommend for these parameters? > > My concern is increasing those value would lead to (much) higher memory > usage, with little control over it. With the mempool we keep more > blocks, ofc, but we have control over freeing the memory. Right now depending on your workload (especially if you use connection pooling) you can end up with something like 32 or 64MB of dynamically adjusted trim-threshold which will never be released back. The first heurstic I had in mind was to set it to work_mem, up to a "reasonable" limit I guess. One can argue that it is expected for a backend to use work_mem frequently, and as such it shouldn't be released back. By setting work_mem to a lower value, we could ask glibc at the same time to trim the excess kept memory. That could be useful when a long-lived connection is pooled, and sees a spike in memory usage only once. Currently that could well end up with 32MB "wasted" permanently but tuning it ourselves could allow us to releaase it back. Since it was last year I worked on this, I'm a bit fuzzy on the details but I hope this helps.
On 1/29/24 15:15, Ronan Dunklau wrote: > Le lundi 29 janvier 2024, 13:17:07 CET Tomas Vondra a écrit : >>> Did you try running an strace on the process ? That may give you some >>> hindsights into what malloc is doing. A more sophisticated approach would >>> be using stap and plugging it into the malloc probes, for example >>> memory_sbrk_more and memory_sbrk_less. >> >> No, I haven't tried that. In my experience strace is pretty expensive, >> and if the issue is in glibc itself (before it does the syscalls), >> strace won't really tell us much. Not sure, ofc. > > It would tell you how malloc actually performs your allocations, and how often > they end up translated into syscalls. The main issue with glibc would be that > it releases the memory too agressively to the OS, IMO. > >> >>> An important part of glibc's malloc behaviour in that regard comes from >>> the >>> adjustment of the mmap and free threshold. By default, mmap adjusts them >>> dynamically and you can poke into that using the >>> memory_mallopt_free_dyn_thresholds probe. >> >> Thanks, I'll take a look at that. >> >>>> FWIW I was wondering if this is a glibc-specific malloc bottleneck, so I >>>> tried running the benchmarks with LD_PRELOAD=jemalloc, and that improves >>>> the behavior a lot - it gets us maybe ~80% of the mempool benefits. >>>> Which is nice, it confirms it's glibc-specific (I wonder if there's a >>>> way to tweak glibc to address this), and it also means systems using >>>> jemalloc (e.g. FreeBSD, right?) don't have this problem. But it also >>>> says the mempool has ~20% benefit on top of jemalloc. >>> >>> GLIBC's malloc offers some tuning for this. In particular, setting either >>> M_MMAP_THRESHOLD or M_TRIM_THRESHOLD will disable the unpredictable "auto >>> adjustment" beheviour and allow you to control what it's doing. >>> >>> By setting a bigger M_TRIM_THRESHOLD, one can make sure memory allocated >>> using sbrk isn't freed as easily, and you don't run into a pattern of >>> moving the sbrk pointer up and down repeatedly. The automatic trade off >>> between the mmap and trim thresholds is supposed to prevent that, but the >>> way it is incremented means you can end in a bad place depending on your >>> particular allocation patttern. >> >> So, what values would you recommend for these parameters? >> >> My concern is increasing those value would lead to (much) higher memory >> usage, with little control over it. With the mempool we keep more >> blocks, ofc, but we have control over freeing the memory. > > Right now depending on your workload (especially if you use connection > pooling) you can end up with something like 32 or 64MB of dynamically adjusted > trim-threshold which will never be released back. > OK, so let's say I expect each backend to use ~90MB of memory (allocated at once through memory contexts). How would you set the two limits? By default it's set to 128kB, which means blocks larger than 128kB are mmap-ed and released immediately. But there's very few such allocations - a vast majority of blocks in the benchmark workloads is <= 8kB or ~27kB (those from btbeginscan). So I'm thinking about leaving M_TRIM_THRESHOLD as is, but increasing the M_TRIM_THRESHOLD value to a couple MBs. But I doubt that'll really help, because what I expect to happen is we execute a query and it allocates all memory up to a high watermark of ~90MB. And then the query completes, and we release almost all of it. And even with trim threshold set to e.g. 8MB we'll free almost all of it, no? What we want to do is say - hey, we needed 90MB, and now we need 8MB. We could free 82MB, but maybe let's wait a bit and see if we need that memory again. And that's pretty much what the mempool does, but I don't see how to do that using the mmap options. > The first heurstic I had in mind was to set it to work_mem, up to a > "reasonable" limit I guess. One can argue that it is expected for a backend to > use work_mem frequently, and as such it shouldn't be released back. By setting > work_mem to a lower value, we could ask glibc at the same time to trim the > excess kept memory. That could be useful when a long-lived connection is > pooled, and sees a spike in memory usage only once. Currently that could well > end up with 32MB "wasted" permanently but tuning it ourselves could allow us > to releaase it back. > I'm not sure work_mem is a good parameter to drive this. It doesn't say how much memory we expect the backend to use - it's a per-operation limit, so it doesn't work particularly well with partitioning (e.g. with 100 partitions, we may get 100 nodes, which is completely unrelated to what work_mem says). A backend running the join query with 1000 partitions uses ~90MB (judging by data reported by the mempool), even with work_mem=4MB. So setting the trim limit to 4MB is pretty useless. The mempool could tell us how much memory we need (but we could track this in some other way too, probably). And we could even adjust the mmap parameters regularly, based on current workload. But there's then there's the problem that the mmap parameters don't tell us how much memory to keep, but how large chunks to release. Let's say we want to keep the 90MB (to allocate the memory once and then reuse it). How would you do that? We could set MMAP_TRIM_TRESHOLD 100MB, but then it takes just a little bit of extra memory to release all the memory, or something. > Since it was last year I worked on this, I'm a bit fuzzy on the details but I > hope this helps. > Thanks for the feedback / insights! regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Le lundi 29 janvier 2024, 15:59:04 CET Tomas Vondra a écrit : > I'm not sure work_mem is a good parameter to drive this. It doesn't say > how much memory we expect the backend to use - it's a per-operation > limit, so it doesn't work particularly well with partitioning (e.g. with > 100 partitions, we may get 100 nodes, which is completely unrelated to > what work_mem says). A backend running the join query with 1000 > partitions uses ~90MB (judging by data reported by the mempool), even > with work_mem=4MB. So setting the trim limit to 4MB is pretty useless. I understand your point, I was basing my previous observations on what a backend typically does during the execution. > > The mempool could tell us how much memory we need (but we could track > this in some other way too, probably). And we could even adjust the mmap > parameters regularly, based on current workload. > > But there's then there's the problem that the mmap parameters don't tell > If we > > us how much memory to keep, but how large chunks to release. > > Let's say we want to keep the 90MB (to allocate the memory once and then > reuse it). How would you do that? We could set MMAP_TRIM_TRESHOLD 100MB, > but then it takes just a little bit of extra memory to release all the > memory, or something. For doing this you can set M_TOP_PAD using glibc malloc. Which makes sure a certain amount of memory is always kept. But the way the dynamic adjustment works makes it sort-of work like this. MMAP_THRESHOLD and TRIM_THRESHOLD start with low values, meaning we don't expect to keep much memory around. So even "small" memory allocations will be served using mmap at first. Once mmaped memory is released, glibc's consider it a benchmark for "normal" allocations that can be routinely freed, and adjusts mmap_threshold to the released mmaped region size, and trim threshold to two times that. It means over time the two values will converge either to the max value (32MB for MMAP_THRESHOLD, 64 for trim threshold) or to something big enough to accomodate your released memory, since anything bigger than half trim threshold will be allocated using mmap. Setting any parameter disable that. But I'm not arguing against the mempool, just chiming in with glibc's malloc tuning possibilities :-)
On 1/29/24 16:42, Ronan Dunklau wrote: > Le lundi 29 janvier 2024, 15:59:04 CET Tomas Vondra a écrit : >> I'm not sure work_mem is a good parameter to drive this. It doesn't say >> how much memory we expect the backend to use - it's a per-operation >> limit, so it doesn't work particularly well with partitioning (e.g. with >> 100 partitions, we may get 100 nodes, which is completely unrelated to >> what work_mem says). A backend running the join query with 1000 >> partitions uses ~90MB (judging by data reported by the mempool), even >> with work_mem=4MB. So setting the trim limit to 4MB is pretty useless. > > I understand your point, I was basing my previous observations on what a > backend typically does during the execution. > >> >> The mempool could tell us how much memory we need (but we could track >> this in some other way too, probably). And we could even adjust the mmap >> parameters regularly, based on current workload. >> >> But there's then there's the problem that the mmap parameters don't tell >> If we > > us how much memory to keep, but how large chunks to release. >> >> Let's say we want to keep the 90MB (to allocate the memory once and then >> reuse it). How would you do that? We could set MMAP_TRIM_TRESHOLD 100MB, >> but then it takes just a little bit of extra memory to release all the >> memory, or something. > > For doing this you can set M_TOP_PAD using glibc malloc. Which makes sure a > certain amount of memory is always kept. > > But the way the dynamic adjustment works makes it sort-of work like this. > MMAP_THRESHOLD and TRIM_THRESHOLD start with low values, meaning we don't > expect to keep much memory around. > > So even "small" memory allocations will be served using mmap at first. Once > mmaped memory is released, glibc's consider it a benchmark for "normal" > allocations that can be routinely freed, and adjusts mmap_threshold to the > released mmaped region size, and trim threshold to two times that. > > It means over time the two values will converge either to the max value (32MB > for MMAP_THRESHOLD, 64 for trim threshold) or to something big enough to > accomodate your released memory, since anything bigger than half trim > threshold will be allocated using mmap. > > Setting any parameter disable that. > Thanks. I gave this a try, and I started the tests with this setting: export MALLOC_TOP_PAD_=$((64*1024*1024)) export MALLOC_MMAP_THRESHOLD_=$((1024*1024)) export MALLOC_TRIM_THRESHOLD_=$((1024*1024)) which I believe means that: 1) we'll keep 64MB "extra" memory on top of heap, serving as a cache for future allocations 2) everything below 1MB (so most of the blocks we allocate for contexts) will be allocated on heap (hence from the cache) 3) we won't trim heap unless there's at least 1MB of free contiguous space (I wonder if this should be the same as MALLOC_TOP_PAD) Those are mostly arbitrary values / guesses, and I don't have complete results yet. But from the results I have it seems this has almost the same effect as the mempool thing - see the attached PDF, with results for the "partitioned join" benchmark. first column - "master" (17dev) with no patches, default glibc second column - 17dev + locking + mempool, default glibc third column - 17dev + locking, tuned glibc The color scale on the right is throughput comparison (third/second), as a percentage with e.g. 90% meaning tuned glibc is 10% slower than the mempool results. Most of the time it's slower but very close to 100%, sometimes it's a bit faster. So overall it's roughly the same. The color scales below the results is a comparison of each branch to the master (without patches), showing comparison to current performance. It's almost the same, although the tuned glibc has a couple regressions that the mempool does not have. > But I'm not arguing against the mempool, just chiming in with glibc's malloc > tuning possibilities :-) > Yeah. I think the main problem with the glibc parameters is that it's very implementation-specific and also static - the mempool is more adaptive, I think. But it's an interesting experiment. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Sun, Jan 28, 2024 at 4:57 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > For NUM_LOCK_PARTITIONS this is pretty simple (see 0001 patch). The > LWLock table has 16 partitions by default - it's quite possible that on > machine with many cores and/or many partitions, we can easily hit this. > So I bumped this 4x to 64 partitions. I think this probably makes sense. I'm a little worried that we're just kicking the can down the road here where maybe we should be solving the problem in some more fundamental way, and I'm also a little worried that we might be reducing single-core performance. But it's probably fine. > What I ended up doing is having a hash table of 16-element arrays. There > are 64 "pieces", each essentially the (16 x OID + UINT64 bitmap) that we > have now. Each OID is mapped to exactly one of these parts as if in a > hash table, and in each of those 16-element parts we do exactly the same > thing we do now (linear search, removal, etc.). This works great, the > locality is great, etc. The one disadvantage is this makes PGPROC > larger, but I did a lot of benchmarks and I haven't seen any regression > that I could attribute to this. (More about this later.) I think this is a reasonable approach. Some comments: - FastPathLocalUseInitialized seems unnecessary to me; the contents of an uninitialized local variable are undefined, but an uninitialized global variable always starts out zeroed. - You need comments in various places, including here, where someone is certain to have questions about the algorithm and choice of constants: +#define FAST_PATH_LOCK_REL_GROUP(rel) (((uint64) (rel) * 7883 + 4481) % FP_LOCK_GROUPS_PER_BACKEND) When I originally coded up the fast-path locking stuff, I supposed that we couldn't make the number of slots too big because the algorithm requires a linear search of the whole array. But with this one trick (a partially-associative cache), there's no real reason that I can think of why you can't make the number of slots almost arbitrarily large. At some point you're going to use too much memory, and probably before that point you're going to make the cache big enough that it doesn't fit in the CPU cache of an individual core, at which point possibly it will stop working as well. But honestly ... I don't quite see why this approach couldn't be scaled quite far. Like, if we raised FP_LOCK_GROUPS_PER_BACKEND from your proposed value of 64 to say 65536, would that still perform well? I'm not saying we should do that, because that's probably a silly amount of memory to use for this, but the point is that when you start to have enough partitions that you run out of lock slots, performance is going to degrade, so you can imagine wanting to try to have enough lock groups to make that unlikely. Which leads me to wonder if there's any particular number of lock groups that is clearly "too many" or whether it's just about how much memory we want to use. -- Robert Haas EDB: http://www.enterprisedb.com
On 6/24/24 17:05, Robert Haas wrote: > On Sun, Jan 28, 2024 at 4:57 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> For NUM_LOCK_PARTITIONS this is pretty simple (see 0001 patch). The >> LWLock table has 16 partitions by default - it's quite possible that on >> machine with many cores and/or many partitions, we can easily hit this. >> So I bumped this 4x to 64 partitions. > > I think this probably makes sense. I'm a little worried that we're > just kicking the can down the road here where maybe we should be > solving the problem in some more fundamental way, and I'm also a > little worried that we might be reducing single-core performance. But > it's probably fine. > Yeah, I haven't seen this causing any regressions - the sensitive paths typically lock only one partition, so having more of them does not affect that. Or if it does, it's likely a reasonable trade off as it reduces the risk of lock contention. That being said, I don't recall benchmarking this patch in isolation, only with the other patches. Maybe I should do that ... >> What I ended up doing is having a hash table of 16-element arrays. There >> are 64 "pieces", each essentially the (16 x OID + UINT64 bitmap) that we >> have now. Each OID is mapped to exactly one of these parts as if in a >> hash table, and in each of those 16-element parts we do exactly the same >> thing we do now (linear search, removal, etc.). This works great, the >> locality is great, etc. The one disadvantage is this makes PGPROC >> larger, but I did a lot of benchmarks and I haven't seen any regression >> that I could attribute to this. (More about this later.) > > I think this is a reasonable approach. Some comments: > > - FastPathLocalUseInitialized seems unnecessary to me; the contents of > an uninitialized local variable are undefined, but an uninitialized > global variable always starts out zeroed. > OK. I didn't realize global variables start a zero. > - You need comments in various places, including here, where someone > is certain to have questions about the algorithm and choice of > constants: > > +#define FAST_PATH_LOCK_REL_GROUP(rel) (((uint64) (rel) * 7883 + 4481) > % FP_LOCK_GROUPS_PER_BACKEND) > Yeah, definitely needs comment explaining this. I admit those numbers are pretty arbitrary primes, to implement a trivial hash function. That was good enough for a PoC patch, but maybe for a "proper" version this should use a better hash function. It needs to be fast, and maybe it doesn't matter that much if it's not perfect. > When I originally coded up the fast-path locking stuff, I supposed > that we couldn't make the number of slots too big because the > algorithm requires a linear search of the whole array. But with this > one trick (a partially-associative cache), there's no real reason that > I can think of why you can't make the number of slots almost > arbitrarily large. At some point you're going to use too much memory, > and probably before that point you're going to make the cache big > enough that it doesn't fit in the CPU cache of an individual core, at > which point possibly it will stop working as well. But honestly ... I > don't quite see why this approach couldn't be scaled quite far. > I don't think I've heard the term "partially-associative cache" before, but now that I look at the approach again, it very much reminds me how set-associative caches work (e.g. with cachelines in CPU caches). It's a 16-way associative cache, assigning each entry into one of 16 slots. I must have been reading some papers in this area shortly before the PoC patch, and the idea came from there, probably. Which is good, because it means it's a well-understood and widely-used approach. > Like, if we raised FP_LOCK_GROUPS_PER_BACKEND from your proposed value > of 64 to say 65536, would that still perform well? I'm not saying we > should do that, because that's probably a silly amount of memory to > use for this, but the point is that when you start to have enough > partitions that you run out of lock slots, performance is going to > degrade, so you can imagine wanting to try to have enough lock groups > to make that unlikely. Which leads me to wonder if there's any > particular number of lock groups that is clearly "too many" or whether > it's just about how much memory we want to use. > That's an excellent question. I don't know. I agree 64 groups is pretty arbitrary, and having 1024 may not be enough even with a modest number of partitions. When I was thinking about using a higher value, my main concern was that it'd made the PGPROC entry larger. Each "fast-path" group is ~72B, so 64 groups is ~4.5kB, and that felt like quite a bit. But maybe it's fine and we could make it much larger - L3 caches tend to be many MBs these days, although AFAIK it's shared by threads running on the CPU. I'll see if I can do some more testing of this, and see if there's a value where it stops improving / starts degrading, etc. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Jun 25, 2024 at 6:04 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > Yeah, definitely needs comment explaining this. > > I admit those numbers are pretty arbitrary primes, to implement a > trivial hash function. That was good enough for a PoC patch, but maybe > for a "proper" version this should use a better hash function. It needs > to be fast, and maybe it doesn't matter that much if it's not perfect. Right. My guess is that if we try too hard to make the hash function good, there will be a performance hit. Unlike, say, strings that come from the user, we have no reason to believe that relfilenumbers will have any particular structure or pattern to them, so a low-quality, fast function seems like a good trade-off to me. But I'm *far* from a hashing expert, so I'm prepared for someone who is to tell me that I'm full of garbage. > I don't think I've heard the term "partially-associative cache" before > That's an excellent question. I don't know. > > I agree 64 groups is pretty arbitrary, and having 1024 may not be enough > even with a modest number of partitions. When I was thinking about using > a higher value, my main concern was that it'd made the PGPROC entry > larger. Each "fast-path" group is ~72B, so 64 groups is ~4.5kB, and that > felt like quite a bit. > > But maybe it's fine and we could make it much larger - L3 caches tend to > be many MBs these days, although AFAIK it's shared by threads running on > the CPU. > > I'll see if I can do some more testing of this, and see if there's a > value where it stops improving / starts degrading, etc. Sounds good. -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On 6/25/24 12:04, Tomas Vondra wrote: > > > On 6/24/24 17:05, Robert Haas wrote: >> On Sun, Jan 28, 2024 at 4:57 PM Tomas Vondra >> <tomas.vondra@enterprisedb.com> wrote: >>> For NUM_LOCK_PARTITIONS this is pretty simple (see 0001 patch). The >>> LWLock table has 16 partitions by default - it's quite possible that on >>> machine with many cores and/or many partitions, we can easily hit this. >>> So I bumped this 4x to 64 partitions. >> >> I think this probably makes sense. I'm a little worried that we're >> just kicking the can down the road here where maybe we should be >> solving the problem in some more fundamental way, and I'm also a >> little worried that we might be reducing single-core performance. But >> it's probably fine. >> > > Yeah, I haven't seen this causing any regressions - the sensitive paths > typically lock only one partition, so having more of them does not > affect that. Or if it does, it's likely a reasonable trade off as it > reduces the risk of lock contention. > > That being said, I don't recall benchmarking this patch in isolation, > only with the other patches. Maybe I should do that ... > >>> What I ended up doing is having a hash table of 16-element arrays. There >>> are 64 "pieces", each essentially the (16 x OID + UINT64 bitmap) that we >>> have now. Each OID is mapped to exactly one of these parts as if in a >>> hash table, and in each of those 16-element parts we do exactly the same >>> thing we do now (linear search, removal, etc.). This works great, the >>> locality is great, etc. The one disadvantage is this makes PGPROC >>> larger, but I did a lot of benchmarks and I haven't seen any regression >>> that I could attribute to this. (More about this later.) >> >> I think this is a reasonable approach. Some comments: >> >> - FastPathLocalUseInitialized seems unnecessary to me; the contents of >> an uninitialized local variable are undefined, but an uninitialized >> global variable always starts out zeroed. >> > > OK. I didn't realize global variables start a zero. > I haven't fixed this yet, but it's pretty clear the "init" is not really needed, because it did the memset() wrong: memset(FastPathLocalUseCounts, 0, sizeof(FastPathLocalUseInitialized)); This only resets one byte of the array, yet it still worked correctly. >> - You need comments in various places, including here, where someone >> is certain to have questions about the algorithm and choice of >> constants: >> >> +#define FAST_PATH_LOCK_REL_GROUP(rel) (((uint64) (rel) * 7883 + 4481) >> % FP_LOCK_GROUPS_PER_BACKEND) >> > > Yeah, definitely needs comment explaining this. > > I admit those numbers are pretty arbitrary primes, to implement a > trivial hash function. That was good enough for a PoC patch, but maybe > for a "proper" version this should use a better hash function. It needs > to be fast, and maybe it doesn't matter that much if it's not perfect. > >> When I originally coded up the fast-path locking stuff, I supposed >> that we couldn't make the number of slots too big because the >> algorithm requires a linear search of the whole array. But with this >> one trick (a partially-associative cache), there's no real reason that >> I can think of why you can't make the number of slots almost >> arbitrarily large. At some point you're going to use too much memory, >> and probably before that point you're going to make the cache big >> enough that it doesn't fit in the CPU cache of an individual core, at >> which point possibly it will stop working as well. But honestly ... I >> don't quite see why this approach couldn't be scaled quite far. >> > > I don't think I've heard the term "partially-associative cache" before, > but now that I look at the approach again, it very much reminds me how > set-associative caches work (e.g. with cachelines in CPU caches). It's a > 16-way associative cache, assigning each entry into one of 16 slots. > > I must have been reading some papers in this area shortly before the PoC > patch, and the idea came from there, probably. Which is good, because it > means it's a well-understood and widely-used approach. > >> Like, if we raised FP_LOCK_GROUPS_PER_BACKEND from your proposed value >> of 64 to say 65536, would that still perform well? I'm not saying we >> should do that, because that's probably a silly amount of memory to >> use for this, but the point is that when you start to have enough >> partitions that you run out of lock slots, performance is going to >> degrade, so you can imagine wanting to try to have enough lock groups >> to make that unlikely. Which leads me to wonder if there's any >> particular number of lock groups that is clearly "too many" or whether >> it's just about how much memory we want to use. >> > > That's an excellent question. I don't know. > > I agree 64 groups is pretty arbitrary, and having 1024 may not be enough > even with a modest number of partitions. When I was thinking about using > a higher value, my main concern was that it'd made the PGPROC entry > larger. Each "fast-path" group is ~72B, so 64 groups is ~4.5kB, and that > felt like quite a bit. > > But maybe it's fine and we could make it much larger - L3 caches tend to > be many MBs these days, although AFAIK it's shared by threads running on > the CPU. > > I'll see if I can do some more testing of this, and see if there's a > value where it stops improving / starts degrading, etc. > I finally got to do those experiments. The scripts and results (both raw and summarized) are too big to attach everything here, available at https://github.com/tvondra/scalability-tests The initial patch used 64 (which means 1024 fast-path slots), I ran the tests with 0, 1, 8, 32, 128, 512, 1024 (so up to 16k locks). I thought about testing with ~64k groups, but I didn't go with the extreme value because I don't quite see the point. It would only matter for cases with a truly extreme number of partitions (64k groups is ~1M fast-path slots), and just creating enough partitions would take a lot of time. Moreover, with that many partitions we seems to have various other bottlenecks, and improving this does not make it really practical. And it's so slow the benchmark results are somewhat bogus too. Because if we achieve 50 tps with 1000 partitions, does it really matter a patch changes that to 25 of 100 tps? I doubt that, especially if going to 100 partitions gives you 2000 tps. Now imagine you have 10k or 100k partitions - how fast is that going to be? So I think stopping at 1024 groups is sensible, and if there are some inefficiencies / costs, I'd expect those to gradually show up even at those lower sizes. But if you look at results, for example from the "join" test: https://github.com/tvondra/scalability-tests/blob/main/join.pdf there's no such negative effect. the table shows results for different combinations of parameters, with the first group of columns being on regular glibc, the second one has glibc tuning (see [1] for details). And the values are for different number of fast-path groups (0 means the patch was not applied). And the color scale on the show the impact of increasing the number of groups. So for example when a column for "32 groups" says 150%, it means going from 8 to 32 groups improved throughput to 1.5x. As usual, green is "good" and red is "bad". But if you look at the tables, there's very little change - most of the values are close to 100%. This might seem a bit strange, considering the promise of these patches is to improve throughput, and "no change" is an absence of that. But that's because the charts illustrate effect of changing the group count with other parameters fixed. It never compares runs with/without glibc runing, and that's an important part of the improvement. Doing the pivot table a bit differently would still show a substantial 2-3x improvement. There's a fair amount of noise - especially for the rpi5 machines (not the right hw for sensitive benchmarks), but also on some i5/xeon runs. I attribute this to only doing one short run (10s) for each combinations of parameters. I'll do more runs next time. Anyway, I think these results show a couple things: 1) There's no systemic negative effect of increasing the number of groups. We could go with 32k or 64k groups, and it doesn't seem like there would be a problem. 2) But there's not much point in doing that, because we run into various other bottlenecks well before having that many locks. By the results, it doesn't seem going beyond 32 or 64 groups would give us much. 3) The memory allocation caching (be it the mempool patch, or the glibc tuning like in this test round) is a crucial piece for this. Not doing that means some tests get no improvement at all, or a much smaller one. 4) The increase of NUM_LOCK_PARTITIONS has very limited effect, or perhaps even no effect at all. Based on this, my plan is to polish the patch adding fast-path groups, with either 32 or 64 groups, which seems to be reasonable values. Then in the future, if/when the other bottlenecks get addressed, we can rethink and increase this. This however reminds me that all those machines are pretty small. Which is good for showing it doesn't break existing/smaller systems, but the initial goal of the patch was to improve behavior on big boxes. I don't have access to the original box at the moment, so if someone could provide an access to one of those big epyc/xeon machines with 100+ cores for a couple days, that would be helpful. That being said, I think it's pretty clear how serious the issue with memory allocation overhead can be, especially in cases when the existing memory context caching is ineffective (like for the btree palloc). I'm not sure what to do about that. The mempool patch shared in this thread does the trick, it's fairly complex/invasive. I still like it, but maybe doing something with the glibc tuning would be enough - it's not as effective, but 80% of the improvement is better than no improvement. regards [1] https://www.postgresql.org/message-id/0da51f67-c88b-497e-bb89-d5139309eb9c@enterprisedb.com -- Tomas Vondra
On Sun, Sep 1, 2024 at 3:30 PM Tomas Vondra <tomas@vondra.me> wrote: > I don't think that's possible with hard-coded size of the array - that > allocates the memory for everyone. We'd need to make it variable-length, > and while doing those benchmarks I think we actually already have a GUC > for that - max_locks_per_transaction tells us exactly what we need to > know, right? I mean, if I know I'll need ~1000 locks, why not to make > the fast-path array large enough for that? I really like this idea. I'm not sure about exactly how many fast path slots you should get for what value of max_locks_per_transaction, but coupling the two things together in some way sounds smart. > Of course, the consequence of this would be making PGPROC variable > length, or having to point to a memory allocated separately (I prefer > the latter option, I think). I haven't done any experiments, but it > seems fairly doable - of course, not sure if it might be more expensive > compared to compile-time constants. I agree that this is a potential problem but it sounds like the idea works well enough that we'd probably still come out quite far ahead even with a bit more overhead. -- Robert Haas EDB: http://www.enterprisedb.com
On Mon, Sep 2, 2024 at 1:46 PM Tomas Vondra <tomas@vondra.me> wrote: > The one argument to not tie this to max_locks_per_transaction is the > vastly different "per element" memory requirements. If you add one entry > to max_locks_per_transaction, that adds LOCK which is a whopping 152B. > OTOH one fast-path entry is ~5B, give or take. That's a pretty big > difference, and it if the locks fit into the shared lock table, but > you'd like to allow more fast-path locks, having to increase > max_locks_per_transaction is not great - pretty wastefull. > > OTOH I'd really hate to just add another GUC and hope the users will > magically know how to set it correctly. That's pretty unlikely, IMO. I > myself wouldn't know what a good value is, I think. > > But say we add a GUC and set it to -1 by default, in which case it just > inherits the max_locks_per_transaction value. And then also provide some > basic metric about this fast-path cache, so that people can tune this? All things being equal, I would prefer not to add another GUC for this, but we might need it. Doing some worst case math, suppose somebody has max_connections=1000 (which is near the upper limit of what I'd consider a sane setting) and max_locks_per_transaction=10000 (ditto). The product is 10 million, so every 10 bytes of storage each a gigabyte of RAM. Chewing up 15GB of RAM when you could have chewed up only 0.5GB certainly isn't too great. On the other hand, those values are kind of pushing the limits of what is actually sane. If you imagine max_locks_per_transaction=2000 rather than max_locks_per_connection=10000, then it's only 3GB and that's hopefully not a lot on the hopefully-giant machine where you're running this. > I think just knowing the "hit ratio" would be enough, i.e. counters for > how often it fits into the fast-path array, and how often we had to > promote it to the shared lock table would be enough, no? Yeah, probably. I mean, that won't tell you how big it needs to be, but it will tell you whether it's big enough. I wonder if we should be looking at further improvements in the lock manager of some kind. For instance, imagine if we allocated storage via DSM or DSA for cases where we need a really large number of Lock entries. The downside of that is that we might run out of memory for locks at runtime, which would perhaps suck, but you'd probably use significantly less memory on average. Or, maybe we need an even bigger rethink where we reconsider the idea that we take a separate lock for every single partition instead of having some kind of hierarchy-aware lock manager. I don't know. But this feels like very old, crufty tech. There's probably something more state of the art that we could or should be doing. -- Robert Haas EDB: http://www.enterprisedb.com
On 9/3/24 17:06, Robert Haas wrote: > On Mon, Sep 2, 2024 at 1:46 PM Tomas Vondra <tomas@vondra.me> wrote: >> The one argument to not tie this to max_locks_per_transaction is the >> vastly different "per element" memory requirements. If you add one entry >> to max_locks_per_transaction, that adds LOCK which is a whopping 152B. >> OTOH one fast-path entry is ~5B, give or take. That's a pretty big >> difference, and it if the locks fit into the shared lock table, but >> you'd like to allow more fast-path locks, having to increase >> max_locks_per_transaction is not great - pretty wastefull. >> >> OTOH I'd really hate to just add another GUC and hope the users will >> magically know how to set it correctly. That's pretty unlikely, IMO. I >> myself wouldn't know what a good value is, I think. >> >> But say we add a GUC and set it to -1 by default, in which case it just >> inherits the max_locks_per_transaction value. And then also provide some >> basic metric about this fast-path cache, so that people can tune this? > > All things being equal, I would prefer not to add another GUC for > this, but we might need it. > Agreed. > Doing some worst case math, suppose somebody has max_connections=1000 > (which is near the upper limit of what I'd consider a sane setting) > and max_locks_per_transaction=10000 (ditto). The product is 10 > million, so every 10 bytes of storage each a gigabyte of RAM. Chewing > up 15GB of RAM when you could have chewed up only 0.5GB certainly > isn't too great. On the other hand, those values are kind of pushing > the limits of what is actually sane. If you imagine > max_locks_per_transaction=2000 rather than > max_locks_per_connection=10000, then it's only 3GB and that's > hopefully not a lot on the hopefully-giant machine where you're > running this. > Yeah, although I don't quite follow the math. With 1000/10000 settings, why would that eat 15GB of RAM? I mean, that's 1.5GB, right? FWIW the actual cost is somewhat higher, because we seem to need ~400B for every lock (not just the 150B for the LOCK struct). At least based on a quick experiment. (Seems a bit high, right?). Anyway, I agree this might be acceptable. If your transactions use this many locks regularly, you probably need this setting anyway. If you only need this many locks occasionally (so that you can keep the locks/xact value low), it probably does not matter that much. And if you're running massively-partitioned table on a tiny box, well, I don't really think that's a particularly sane idea. So I think I'm OK with just tying this to max_locks_per_transaction. >> I think just knowing the "hit ratio" would be enough, i.e. counters for >> how often it fits into the fast-path array, and how often we had to >> promote it to the shared lock table would be enough, no? > > Yeah, probably. I mean, that won't tell you how big it needs to be, > but it will tell you whether it's big enough. > True, but that applies to all "cache hit ratio" metrics (like for our shared buffers). It'd be great to have something better, enough to tell you how large the cache needs to be. But we don't :-( > I wonder if we should be looking at further improvements in the lock > manager of some kind. For instance, imagine if we allocated storage > via DSM or DSA for cases where we need a really large number of Lock > entries. The downside of that is that we might run out of memory for > locks at runtime, which would perhaps suck, but you'd probably use > significantly less memory on average. Or, maybe we need an even bigger > rethink where we reconsider the idea that we take a separate lock for > every single partition instead of having some kind of hierarchy-aware > lock manager. I don't know. But this feels like very old, crufty tech. > There's probably something more state of the art that we could or > should be doing. > Perhaps. I agree we'll probably need something more radical soon, not just changes that aim to fix some rare exceptional case (which may be annoying, but not particularly harmful for the complete workload). For example, if we did what you propose, that might help when very few transactions need a lot of locks. I don't mind saving memory in that case, ofc. but is it a problem if those rare cases are a bit slower? Shouldn't we focus more on cases where many locks are common? Because people are simply going to use partitioning, a lot of indexes, etc? So yeah, I agree we probably need a more fundamental rethink. I don't think we can just keep optimizing the current approach, there's a limit of fast it can be. Whether it's not locking individual partitions, or not locking some indexes, ... I don't know. regards -- Tomas Vondra
Hi Tomas! On Tue, Sep 3, 2024 at 6:20 PM Tomas Vondra <tomas@vondra.me> wrote: > > On 9/3/24 17:06, Robert Haas wrote: > > On Mon, Sep 2, 2024 at 1:46 PM Tomas Vondra <tomas@vondra.me> wrote: > >> The one argument to not tie this to max_locks_per_transaction is the > >> vastly different "per element" memory requirements. If you add one entry > >> to max_locks_per_transaction, that adds LOCK which is a whopping 152B. > >> OTOH one fast-path entry is ~5B, give or take. That's a pretty big > >> difference, and it if the locks fit into the shared lock table, but > >> you'd like to allow more fast-path locks, having to increase > >> max_locks_per_transaction is not great - pretty wastefull. > >> > >> OTOH I'd really hate to just add another GUC and hope the users will > >> magically know how to set it correctly. That's pretty unlikely, IMO. I > >> myself wouldn't know what a good value is, I think. > >> > >> But say we add a GUC and set it to -1 by default, in which case it just > >> inherits the max_locks_per_transaction value. And then also provide some > >> basic metric about this fast-path cache, so that people can tune this? > > > > All things being equal, I would prefer not to add another GUC for > > this, but we might need it. > > > > Agreed. > > [..] > > So I think I'm OK with just tying this to max_locks_per_transaction. If that matters then the SLRU configurability effort added 7 GUCs (with 3 scaling up based on shared_buffers) just to give high-end users some relief, so here 1 new shouldn't be that such a deal. We could add to the LWLock/lock_manager wait event docs to recommend just using known-to-be-good certain values from this $thread (or ask the user to benchmark it himself). > >> I think just knowing the "hit ratio" would be enough, i.e. counters for > >> how often it fits into the fast-path array, and how often we had to > >> promote it to the shared lock table would be enough, no? > > > > Yeah, probably. I mean, that won't tell you how big it needs to be, > > but it will tell you whether it's big enough. > > > > True, but that applies to all "cache hit ratio" metrics (like for our > shared buffers). It'd be great to have something better, enough to tell > you how large the cache needs to be. But we don't :-( My $0.02 cents: the originating case that triggered those patches, actually started with LWLock/lock_manager waits being the top#1. The operator can cross check (join) that with a group by pg_locks.fastpath (='f'), count(*). So, IMHO we have good observability in this case (rare thing to say!) > > I wonder if we should be looking at further improvements in the lock > > manager of some kind. [..] > > Perhaps. I agree we'll probably need something more radical soon, not > just changes that aim to fix some rare exceptional case (which may be > annoying, but not particularly harmful for the complete workload). > > For example, if we did what you propose, that might help when very few > transactions need a lot of locks. I don't mind saving memory in that > case, ofc. but is it a problem if those rare cases are a bit slower? > Shouldn't we focus more on cases where many locks are common? Because > people are simply going to use partitioning, a lot of indexes, etc? > > So yeah, I agree we probably need a more fundamental rethink. I don't > think we can just keep optimizing the current approach, there's a limit > of fast it can be. Please help me understand: so are You both discussing potential far future further improvements instead of this one ? My question is really about: is the patchset good enough or are you considering some other new effort instead? BTW some other random questions: Q1. I've been lurking into https://github.com/tvondra/pg-lock-scalability-results and those shouldn't be used anymore for further discussions, as they contained earlier patches (including 0003-Add-a-memory-pool-with-adaptive-rebalancing.patch) and they were replaced by benchmark data in this $thread, right? Q2. Earlier attempts did contain a mempool patch to get those nice numbers (or was that jemalloc or glibc tuning). So were those recent results in [1] collected with still 0003 or you have switched completely to glibc/jemalloc tuning? -J. [1] - https://www.postgresql.org/message-id/b8c43eda-0c3f-4cb4-809b-841fa5c40ada%40vondra.me
On 9/4/24 11:29, Jakub Wartak wrote: > Hi Tomas! > > On Tue, Sep 3, 2024 at 6:20 PM Tomas Vondra <tomas@vondra.me> wrote: >> >> On 9/3/24 17:06, Robert Haas wrote: >>> On Mon, Sep 2, 2024 at 1:46 PM Tomas Vondra <tomas@vondra.me> wrote: >>>> The one argument to not tie this to max_locks_per_transaction is the >>>> vastly different "per element" memory requirements. If you add one entry >>>> to max_locks_per_transaction, that adds LOCK which is a whopping 152B. >>>> OTOH one fast-path entry is ~5B, give or take. That's a pretty big >>>> difference, and it if the locks fit into the shared lock table, but >>>> you'd like to allow more fast-path locks, having to increase >>>> max_locks_per_transaction is not great - pretty wastefull. >>>> >>>> OTOH I'd really hate to just add another GUC and hope the users will >>>> magically know how to set it correctly. That's pretty unlikely, IMO. I >>>> myself wouldn't know what a good value is, I think. >>>> >>>> But say we add a GUC and set it to -1 by default, in which case it just >>>> inherits the max_locks_per_transaction value. And then also provide some >>>> basic metric about this fast-path cache, so that people can tune this? >>> >>> All things being equal, I would prefer not to add another GUC for >>> this, but we might need it. >>> >> >> Agreed. >> >> [..] >> >> So I think I'm OK with just tying this to max_locks_per_transaction. > > If that matters then the SLRU configurability effort added 7 GUCs > (with 3 scaling up based on shared_buffers) just to give high-end > users some relief, so here 1 new shouldn't be that such a deal. We > could add to the LWLock/lock_manager wait event docs to recommend just > using known-to-be-good certain values from this $thread (or ask the > user to benchmark it himself). > TBH I'm skeptical we'll be able to tune those GUCs. Maybe it was the right thing for the SLRU thread, I don't know - I haven't been following that very closely. But my impression is that we often add a GUC when we're not quite sure how to pick a good value. So we just shift the responsibility to someone else, who however also doesn't know. I'd very much prefer not to do that here. Of course, it's challenging because we can't easily resize these arrays, so even if we had some nice heuristics to calculate the "optimal" number of fast-path slots, what would we do with it ... >>>> I think just knowing the "hit ratio" would be enough, i.e. counters for >>>> how often it fits into the fast-path array, and how often we had to >>>> promote it to the shared lock table would be enough, no? >>> >>> Yeah, probably. I mean, that won't tell you how big it needs to be, >>> but it will tell you whether it's big enough. >>> >> >> True, but that applies to all "cache hit ratio" metrics (like for our >> shared buffers). It'd be great to have something better, enough to tell >> you how large the cache needs to be. But we don't :-( > > My $0.02 cents: the originating case that triggered those patches, > actually started with LWLock/lock_manager waits being the top#1. The > operator can cross check (join) that with a group by pg_locks.fastpath > (='f'), count(*). So, IMHO we have good observability in this case > (rare thing to say!) > That's a good point. So if you had to give some instructions to users what to measure / monitor, and how to adjust the GUC based on that, what would your instructions be? >>> I wonder if we should be looking at further improvements in the lock >>> manager of some kind. [..] >> >> Perhaps. I agree we'll probably need something more radical soon, not >> just changes that aim to fix some rare exceptional case (which may be >> annoying, but not particularly harmful for the complete workload). >> >> For example, if we did what you propose, that might help when very few >> transactions need a lot of locks. I don't mind saving memory in that >> case, ofc. but is it a problem if those rare cases are a bit slower? >> Shouldn't we focus more on cases where many locks are common? Because >> people are simply going to use partitioning, a lot of indexes, etc? >> >> So yeah, I agree we probably need a more fundamental rethink. I don't >> think we can just keep optimizing the current approach, there's a limit >> of fast it can be. > > Please help me understand: so are You both discussing potential far > future further improvements instead of this one ? My question is > really about: is the patchset good enough or are you considering some > other new effort instead? > I think it was mostly a brainstorming about alternative / additional improvements in locking. The proposed patch does not change the locking in any fundamental way, it merely optimizes one piece - we still acquire exactly the same set of locks, exactly the same way. AFAICS there's an agreement the current approach has limits, and with the growing number of partitions we're hitting them already. That may need rethinking the fundamental approach, but I think that should not block improvements to the current approach. Not to mention there's no proposal for such "fundamental rework" yet. > BTW some other random questions: > Q1. I've been lurking into > https://github.com/tvondra/pg-lock-scalability-results and those > shouldn't be used anymore for further discussions, as they contained > earlier patches (including > 0003-Add-a-memory-pool-with-adaptive-rebalancing.patch) and they were > replaced by benchmark data in this $thread, right? The github results are still valid, I've only shared them 3 days ago. It does test both the mempool and glibc tuning, to assess (and compare) the benefits of that, but why would that make it obsolete? By "results in this thread" I suppose you mean the couple numbers I shared on September 2? Those were just very limited benchmarks to asses if making the arrays variable-length (based on GUC) would make things slower. And it doesn't, so the "full" github results still apply. > Q2. Earlier attempts did contain a mempool patch to get those nice > numbers (or was that jemalloc or glibc tuning). So were those recent > results in [1] collected with still 0003 or you have switched > completely to glibc/jemalloc tuning? > The results pushed to github are all with glibc, and test four cases: a) mempool patch not applied, no glibc tuning b) mempool patch applied, no glibc tuning c) mempool patch not applied, glibc tuning d) mempool patch applied, glibc tuning These are the 4 "column groups" in some of the pivot tables, to allow comparing those cases. My interpretation of the results are 1) The mempool / glibc tuning have significant benefits, at least for some workloads (where the locking patch alone does help much). 2) There's very little difference between the mempool / glibc tuning. The mempool does seem to have a small advantage. 3) The mempool / glibc tuning is irrelevant for non-glibc systems (e.g. for FreeBSD which I think uses jemalloc or something like that). I think the mempool might be interesting and useful for other reasons (e.g. I initially wrote it to enforce a per-backend memory limit), but you can get mostly the same caching benefits by tuning the glibc parameters. So I'm focusing on the locking stuff. regards -- Tomas Vondra
On Tue, 3 Sept 2024 at 18:20, Tomas Vondra <tomas@vondra.me> wrote: > FWIW the actual cost is somewhat higher, because we seem to need ~400B > for every lock (not just the 150B for the LOCK struct). We do indeed allocate two PROCLOCKs for every LOCK, and allocate those inside dynahash tables. That amounts to (152+2*64+3*16=) 328 bytes in dynahash elements, and (3 * 8-16) = 24-48 bytes for the dynahash buckets/segments, resulting in 352-376 bytes * NLOCKENTS() being used[^1]. Does that align with your usage numbers, or are they significantly larger? > At least based on a quick experiment. (Seems a bit high, right?). Yeah, that does seem high, thanks for nerd-sniping me. The 152 bytes of LOCK are mostly due to a combination of two MAX_LOCKMODES-sized int[]s that are used to keep track of the number of requested/granted locks of each level. As MAX_LOCKMODES = 10, these arrays use a total of 2*4*10=80 bytes, with the remaining 72 spent on tracking. MAX_BACKENDS sadly doesn't fit in int16, so we'll have to keep using int[]s, but that doesn't mean we can't improve this size: ISTM that MAX_LOCKMODES is 2 larger than it has to be: LOCKMODE=0 is NoLock, which is never used or counted in these shared structures, and the max lock mode supported by any of the supported lock methods is AccessExclusiveLock (8). We can thus reduce MAX_LOCKMODES to 8, reducing size of the LOCK struct by 16 bytes. If some struct- and field packing is OK, then we could further reduce the size of LOCK by an additional 8 bytes by resizing the LOCKMASK type from int to int16 (we only use the first MaxLockMode (8) + 1 bits), and then storing the grant/waitMask fields (now 4 bytes total) in the padding that's present at the end of the waitProcs struct. This would depend on dclist not writing in its padding space, but I couldn't find any user that did so, and most critically dclist_init doesn't scribble in the padding with memset. If these are both implemented, it would save 24 bytes, reducing the struct to 128 bytes. :) [^2] I also checked PROCLOCK: If it is worth further packing the struct, we should probably look at whether it's worth replacing the PGPROC* typed fields with ProcNumber -based ones, potentially in both PROCLOCK and PROCLOCKTAG. When combined with int16-typed LOCKMASKs, either one of these fields being replaced with ProcNumber would allow a reduction in size by one MAXALIGN quantum, reducing the struct to 56 bytes, the smallest I could get it to without ignoring type alignments. Further shmem savings can be achieved by reducing the "10% safety margin" added at the end of LockShmemSize, as I'm fairly sure the memory used in shared hashmaps doesn't exceed the estimated amount, and if it did then we should probably fix that part, rather than requesting that (up to) 10% overhead here. Alltogether that'd save 40 bytes/lock entry on size, and ~35 bytes/lock on "safety margin", for a saving of (up to) 19% of our current allocation. I'm not sure if these tricks would benefit with performance or even be a demerit, apart from smaller structs usually being better at fitting better in CPU caches. Kind regards, Matthias van de Meent Neon (https://neon.tech) [^1] NLOCKENTS() benefits from being a power of 2, or slightly below one, as it's rounded up to a power of 2 when dynahash decides its number of buckets to allocate. [^2] Sadly this 2-cachelines alignment is lost due to dynahash's HASHELEMENT prefix of elements. :(
On Wed, 4 Sept 2024 at 03:06, Robert Haas <robertmhaas@gmail.com> wrote: > > On Mon, Sep 2, 2024 at 1:46 PM Tomas Vondra <tomas@vondra.me> wrote: > > But say we add a GUC and set it to -1 by default, in which case it just > > inherits the max_locks_per_transaction value. And then also provide some > > basic metric about this fast-path cache, so that people can tune this? > > All things being equal, I would prefer not to add another GUC for > this, but we might need it. I think driving the array size from max_locks_per_transaction is a good idea (rounded up to the next multiple of 16?). If someone comes along one day and shows us a compelling case where some backend needs more than its fair share of locks and performance is bad because of that, then maybe we can consider adding a GUC then. Certainly, it's much easier to add a GUC later if someone convinces us that it's a good idea than it is to add it now and try to take it away in the future if we realise it's not useful enough to keep. David
On 9/4/24 16:25, Matthias van de Meent wrote: > On Tue, 3 Sept 2024 at 18:20, Tomas Vondra <tomas@vondra.me> wrote: >> FWIW the actual cost is somewhat higher, because we seem to need ~400B >> for every lock (not just the 150B for the LOCK struct). > > We do indeed allocate two PROCLOCKs for every LOCK, and allocate those > inside dynahash tables. That amounts to (152+2*64+3*16=) 328 bytes in > dynahash elements, and (3 * 8-16) = 24-48 bytes for the dynahash > buckets/segments, resulting in 352-376 bytes * NLOCKENTS() being > used[^1]. Does that align with your usage numbers, or are they > significantly larger? > I see more like ~470B per lock. If I patch CalculateShmemSize to log the shmem allocated, I get this: max_connections=100 max_locks_per_transaction=1000 => 194264001 max_connections=100 max_locks_per_transaction=2000 => 241756967 and (((241756967-194264001)/100/1000)) = 474 Could be alignment of structs or something, not sure. >> At least based on a quick experiment. (Seems a bit high, right?). > > Yeah, that does seem high, thanks for nerd-sniping me. > > The 152 bytes of LOCK are mostly due to a combination of two > MAX_LOCKMODES-sized int[]s that are used to keep track of the number > of requested/granted locks of each level. As MAX_LOCKMODES = 10, these > arrays use a total of 2*4*10=80 bytes, with the remaining 72 spent on > tracking. MAX_BACKENDS sadly doesn't fit in int16, so we'll have to > keep using int[]s, but that doesn't mean we can't improve this size: > > ISTM that MAX_LOCKMODES is 2 larger than it has to be: LOCKMODE=0 is > NoLock, which is never used or counted in these shared structures, and > the max lock mode supported by any of the supported lock methods is > AccessExclusiveLock (8). We can thus reduce MAX_LOCKMODES to 8, > reducing size of the LOCK struct by 16 bytes. > > If some struct- and field packing is OK, then we could further reduce > the size of LOCK by an additional 8 bytes by resizing the LOCKMASK > type from int to int16 (we only use the first MaxLockMode (8) + 1 > bits), and then storing the grant/waitMask fields (now 4 bytes total) > in the padding that's present at the end of the waitProcs struct. This > would depend on dclist not writing in its padding space, but I > couldn't find any user that did so, and most critically dclist_init > doesn't scribble in the padding with memset. > > If these are both implemented, it would save 24 bytes, reducing the > struct to 128 bytes. :) [^2] > > I also checked PROCLOCK: If it is worth further packing the struct, we > should probably look at whether it's worth replacing the PGPROC* typed > fields with ProcNumber -based ones, potentially in both PROCLOCK and > PROCLOCKTAG. When combined with int16-typed LOCKMASKs, either one of > these fields being replaced with ProcNumber would allow a reduction in > size by one MAXALIGN quantum, reducing the struct to 56 bytes, the > smallest I could get it to without ignoring type alignments. > > Further shmem savings can be achieved by reducing the "10% safety > margin" added at the end of LockShmemSize, as I'm fairly sure the > memory used in shared hashmaps doesn't exceed the estimated amount, > and if it did then we should probably fix that part, rather than > requesting that (up to) 10% overhead here. > > Alltogether that'd save 40 bytes/lock entry on size, and ~35 > bytes/lock on "safety margin", for a saving of (up to) 19% of our > current allocation. I'm not sure if these tricks would benefit with > performance or even be a demerit, apart from smaller structs usually > being better at fitting better in CPU caches. > Not sure either, but it seems worth exploring. If you do an experimental patch for the LOCK size reduction, I can get some numbers. I'm not sure about the safety margins. 10% sure seems like quite a bit of memory (it might not have in the past, but as the instances are growing, that probably changed). regards -- Tomas Vondra
On 9/4/24 17:12, David Rowley wrote: > On Wed, 4 Sept 2024 at 03:06, Robert Haas <robertmhaas@gmail.com> wrote: >> >> On Mon, Sep 2, 2024 at 1:46 PM Tomas Vondra <tomas@vondra.me> wrote: >>> But say we add a GUC and set it to -1 by default, in which case it just >>> inherits the max_locks_per_transaction value. And then also provide some >>> basic metric about this fast-path cache, so that people can tune this? >> >> All things being equal, I would prefer not to add another GUC for >> this, but we might need it. > > I think driving the array size from max_locks_per_transaction is a > good idea (rounded up to the next multiple of 16?). Maybe, although I was thinking we'd just use the regular doubling, to get nice "round" numbers. It will likely overshoot a little bit (unless people set the GUC to exactly 2^N), but I don't think that's a problem. > If someone comes along one day and shows us a compelling case where > some backend needs more than its fair share of locks and performance > is bad because of that, then maybe we can consider adding a GUC then. > Certainly, it's much easier to add a GUC later if someone convinces > us that it's a good idea than it is to add it now and try to take it > away in the future if we realise it's not useful enough to keep. > Yeah, I agree with this. regards -- Tomas Vondra
On Tue, Sep 3, 2024 at 12:19 PM Tomas Vondra <tomas@vondra.me> wrote: > > Doing some worst case math, suppose somebody has max_connections=1000 > > (which is near the upper limit of what I'd consider a sane setting) > > and max_locks_per_transaction=10000 (ditto). The product is 10 > > million, so every 10 bytes of storage each a gigabyte of RAM. Chewing > > up 15GB of RAM when you could have chewed up only 0.5GB certainly > > isn't too great. On the other hand, those values are kind of pushing > > the limits of what is actually sane. If you imagine > > max_locks_per_transaction=2000 rather than > > max_locks_per_connection=10000, then it's only 3GB and that's > > hopefully not a lot on the hopefully-giant machine where you're > > running this. > > Yeah, although I don't quite follow the math. With 1000/10000 settings, > why would that eat 15GB of RAM? I mean, that's 1.5GB, right? Oh, right. > FWIW the actual cost is somewhat higher, because we seem to need ~400B > for every lock (not just the 150B for the LOCK struct). At least based > on a quick experiment. (Seems a bit high, right?). Hmm, yes, that's unpleasant. > Perhaps. I agree we'll probably need something more radical soon, not > just changes that aim to fix some rare exceptional case (which may be > annoying, but not particularly harmful for the complete workload). > > For example, if we did what you propose, that might help when very few > transactions need a lot of locks. I don't mind saving memory in that > case, ofc. but is it a problem if those rare cases are a bit slower? > Shouldn't we focus more on cases where many locks are common? Because > people are simply going to use partitioning, a lot of indexes, etc? > > So yeah, I agree we probably need a more fundamental rethink. I don't > think we can just keep optimizing the current approach, there's a limit > of fast it can be. Whether it's not locking individual partitions, or > not locking some indexes, ... I don't know. I don't know, either. We don't have to decide right now; it's just something to keep in mind. -- Robert Haas EDB: http://www.enterprisedb.com
On 9/4/24 13:15, Tomas Vondra wrote: > On 9/4/24 11:29, Jakub Wartak wrote: >> Hi Tomas! >> >> ... >> >> My $0.02 cents: the originating case that triggered those patches, >> actually started with LWLock/lock_manager waits being the top#1. The >> operator can cross check (join) that with a group by pg_locks.fastpath >> (='f'), count(*). So, IMHO we have good observability in this case >> (rare thing to say!) >> > > That's a good point. So if you had to give some instructions to users > what to measure / monitor, and how to adjust the GUC based on that, what > would your instructions be? > After thinking about this a bit more, I'm actually wondering if this is source of information is sufficient. Firstly, it's just a snapshot of a single instance, and it's not quite trivial to get some summary for longer time period (people would have to sample it often enough, etc.). Doable, but much less convenient than the cumulative counters. But for the sampling, doesn't this produce skewed data? Imagine you have a workload with very short queries (which is when fast-path matters), so you're likely to see the backend while it's obtaining the locks. If the fast-path locks take much faster acquire (kinda the whole point), we're more likely to see the backend while it's obtaining the regular locks. Let's say the backend needs 1000 locks, and 500 of those fit into the fast-path array. We're likely to see the 500 fast-path locks already acquired, and a random fraction of the 500 non-fast-path locks. So in the end you'll se backends needing 500 fast-path locks and 250 regular locks. That doesn't seem terrible, but I guess the difference can be made even larger. regards -- Tomas Vondra
On Thu, Sep 5, 2024 at 7:33 PM Tomas Vondra <tomas@vondra.me> wrote: >>> My $0.02 cents: the originating case that triggered those patches, >>> actually started with LWLock/lock_manager waits being the top#1. The >>> operator can cross check (join) that with a group by pg_locks.fastpath >>> (='f'), count(*). So, IMHO we have good observability in this case >>> (rare thing to say!) >>> >> >> That's a good point. So if you had to give some instructions to users >> what to measure / monitor, and how to adjust the GUC based on that, what >> would your instructions be? >> > > After thinking about this a bit more, I'm actually wondering if this is > source of information is sufficient. Firstly, it's just a snapshot of a > single instance, and it's not quite trivial to get some summary for > longer time period (people would have to sample it often enough, etc.). > Doable, but much less convenient than the cumulative counters. OK, so answering previous question: Probably just monitor pg_stat_activty (group on wait events count(*)) with pg_locks with group by on per-pid and fastpath . Even simple observations with \watch 0.1 are good enough to confirm/deny the root-cause in practice even for short bursts while it is happening. While deploying monitoring for a longer time (with say sample of 1s), you sooner or later would get the __high water mark__ and possibly allow up to that many fastpaths as starting point as there are locks occuring for affected PIDs (or double the amount). > But for the sampling, doesn't this produce skewed data? Imagine you have > a workload with very short queries (which is when fast-path matters), so > you're likely to see the backend while it's obtaining the locks. If the > fast-path locks take much faster acquire (kinda the whole point), we're > more likely to see the backend while it's obtaining the regular locks. > > Let's say the backend needs 1000 locks, and 500 of those fit into the > fast-path array. We're likely to see the 500 fast-path locks already > acquired, and a random fraction of the 500 non-fast-path locks. So in > the end you'll se backends needing 500 fast-path locks and 250 regular > locks. That doesn't seem terrible, but I guess the difference can be > made even larger. ... it doesn't need to perfect data to act, right? We may just need information that it is happening (well we do). Maybe it's too pragmatic point of view, but wasting some bits of memory for this, but still being allowed to control it how much it allocates in the end -- is much better situation than today, without any control where we are wasting crazy CPU time on all those futex() syscalls and context switches Another angle is that if you see the SQL causing it, it is most often going to be attributed to partitioning and people ending up accessing way too many partitions (thousands) without proper partition pruning - sometimes it even triggers re-partitioning of the said tables. So maybe the realistic "fastpath sizing" should assume something that supports: a) usual number of tables in JOINs (just few of them are fine like today) -> ok b) interval 1 month partitions for let's say 5 years (12*5 = 60), joined to some other table like that gives like what, max 120? -> so if you have users doing SELECT * FROM such_table , they will already have set the max_locks_per_xact probably to something higher. c) HASH partitioning up to VCPU-that-are-in-the-wild count? (say 64 or 128? so it sounds same as above?) d) probably we should not care here at all if somebody wants daily partitioning across years with HASH (sub)partitions without partition pruning -> it has nothing to do with being "fast" anyway Judging from the current reports, people have configured max_locks_per_xact like this: ~75% have it at default (64), 10% has 1024, 5% has 128 and the rest (~10%) is having between 100..thousands, with extreme one-offs @ 25k (wild misconfiguration judging from the other specs). BTW: you probably need to register this $thread into CF for others to see too (it's not there?) -J.
On Fri, Sep 13, 2024 at 1:45 AM Tomas Vondra <tomas@vondra.me> wrote: > [..] > Anyway, at this point I'm quite happy with this improvement. I didn't > have any clear plan when to commit this, but I'm considering doing so > sometime next week, unless someone objects or asks for some additional > benchmarks etc. Thank you very much for working on this :) The only fact that comes to my mind is that we could blow up L2 caches. Fun fact, so if we are growing PGPROC by 6.3x, that's going to be like one or two 2MB huge pages more @ common max_connections=1000 x86_64 (830kB -> ~5.1MB), and indeed: # without patch: postgres@hive:~$ /usr/pgsql18/bin/postgres -D /tmp/pg18 -C shared_memory_size_in_huge_pages 177 # with patch: postgres@hive:~$ /usr/pgsql18/bin/postgres -D /tmp/pg18 -C shared_memory_size_in_huge_pages 178 So playing Devil's advocate , the worst situation that could possibly hurt (?) could be: * memory size of PGPROC working set >> L2_cache (thus very high max_connections), * insane number of working sessions on CPU (sessions >> VCPU) - sadly happens to some, * those sessions wouldn't have to be competing for the same Oids - just fetching this new big fpLockBits[] structure - so probing a lot for lots of Oids, but *NOT* having to use futex() syscall [so not that syscall price] * no huge pages (to cause dTLB misses) then maybe(?) one could observe further degradation of dTLB misses in the perf-stat counter under some microbenchmark, but measuring that requires isolated and physical hardware. Maybe that would be actually noise due to overhead of context-switches itself. Just trying to think out loud, what big PGPROC could cause here. But this is already an unhealthy and non-steady state of the system, so IMHO we are good, unless someone comes up with a better (more evil) idea. >> I did look at docs if anything needs updating, but I don't think so. The SGML docs only talk about fast-path locking at fairly high level, not about how many we have etc. Well the only thing I could think of was to add to the doc/src/sgml/config.sgml / "max_locks_per_transaction" GUC, that "it is also used as advisory for the number of groups used in lockmanager's fast-path implementation" (that is, without going into further discussion, as even pg_locks discussion doc/src/sgml/system-views.sgml simply uses that term). -J.
On 9/16/24 15:11, Jakub Wartak wrote: > On Fri, Sep 13, 2024 at 1:45 AM Tomas Vondra <tomas@vondra.me> wrote: > >> [..] > >> Anyway, at this point I'm quite happy with this improvement. I didn't >> have any clear plan when to commit this, but I'm considering doing so >> sometime next week, unless someone objects or asks for some additional >> benchmarks etc. > > Thank you very much for working on this :) > > The only fact that comes to my mind is that we could blow up L2 > caches. Fun fact, so if we are growing PGPROC by 6.3x, that's going to > be like one or two 2MB huge pages more @ common max_connections=1000 > x86_64 (830kB -> ~5.1MB), and indeed: > > # without patch: > postgres@hive:~$ /usr/pgsql18/bin/postgres -D /tmp/pg18 -C > shared_memory_size_in_huge_pages > 177 > > # with patch: > postgres@hive:~$ /usr/pgsql18/bin/postgres -D /tmp/pg18 -C > shared_memory_size_in_huge_pages > 178 > > So playing Devil's advocate , the worst situation that could possibly > hurt (?) could be: > * memory size of PGPROC working set >> L2_cache (thus very high > max_connections), > * insane number of working sessions on CPU (sessions >> VCPU) - sadly > happens to some, > * those sessions wouldn't have to be competing for the same Oids - > just fetching this new big fpLockBits[] structure - so probing a lot > for lots of Oids, but *NOT* having to use futex() syscall [so not that > syscall price] > * no huge pages (to cause dTLB misses) > > then maybe(?) one could observe further degradation of dTLB misses in > the perf-stat counter under some microbenchmark, but measuring that > requires isolated and physical hardware. Maybe that would be actually > noise due to overhead of context-switches itself. Just trying to think > out loud, what big PGPROC could cause here. But this is already an > unhealthy and non-steady state of the system, so IMHO we are good, > unless someone comes up with a better (more evil) idea. > I've been thinking about such cases too, but I don't think it can really happen in practice, because: - How likely is it that the sessions will need a lot of OIDs, but not the same ones? Also, why would it matter that the OIDs are not the same, I don't think it matters unless one of the sessions needs an exclusive lock, at which point the optimization doesn't really matter. - If having more fast-path slots means it doesn't fit into L2 cache, would we fit into L2 without it? I don't think so - if there really are that many locks, we'd have to add those into the shared lock table, and there's a lot of extra stuff to keep in memory (relcaches, ...). This is pretty much one of the cases I focused on in my benchmarking, and I'm yet to see any regression. >>> I did look at docs if anything needs updating, but I don't think so. The > SGML docs only talk about fast-path locking at fairly high level, not > about how many we have etc. > > Well the only thing I could think of was to add to the > doc/src/sgml/config.sgml / "max_locks_per_transaction" GUC, that "it > is also used as advisory for the number of groups used in > lockmanager's fast-path implementation" (that is, without going into > further discussion, as even pg_locks discussion > doc/src/sgml/system-views.sgml simply uses that term). > Thanks, I'll consider mentioning this in max_locks_per_transaction. Also, I think there's a place calculating the amount of per-connection memory, so maybe that needs to be updated too. regards -- Tomas Vondra
Hi, I've finally pushed this, after many rounds of careful testing to ensure no regressions, and polishing. All changes since the version shared on September 13 are only cosmetic - renaming a macro to keep it consistent with the other ones, clarifying a couple comments etc. Nothing major. I ended up squashing the two parts into a single commit. I thought about keeping the two steps, but it seemed pointless - the first part inflated the PGPROC struct, which I didn't like to commit, even if only as an intermediate WIP state. So far buildfarm didn't blew up, so let's hope it will stay that way. I just realized there's no CF entry for this - sorry about that :-( I started the thread a year ago to discuss an experimental patche, and it never made it to CFA. But there was a discussion spanning a year, so hopefully that's enough. regards -- Tomas Vondra
On Sat, 21 Sept 2024 at 21:33, Tomas Vondra <tomas@vondra.me> wrote: > I've finally pushed this, after many rounds of careful testing to ensure > no regressions, and polishing. All changes since the version shared on > September 13 are only cosmetic - renaming a macro to keep it consistent > with the other ones, clarifying a couple comments etc. Nothing major. Great work on this, I have seen multiple customers hitting fast path capacity related LockManager contention. They will certainly be glad to have a fix available when they eventually upgrade. Regretfully I did not find the time to participate in this discussion during development. But I did have some thoughts that I wanted to unload to the list, not as a criticism, but in case it turns out follow up work is needed. Driving the array sizing from max_locks_per_transaction seems like a good idea. The one major difference from the lock table is that while the lock table is partitioned dynamically between backends, the fast path array has a static size per backend. One case where that distinction matters is when only a fraction of backends try to lock large numbers of relations. This fraction will still fall back to main lock tables, but at least the contention should be limited by virtue of not having too many of those backends. The other case is when max_connections is much higher than the number of backends actually used. Then backends may be consuming well over max_locks_per_transaction without running into lock table capacity issues. In both cases users will have the simple workaround of just increasing the max_locks_per_transaction setting. Still, I'm sure they would be happier if things just worked without any tuning. So I tried to figure out some scheme to get dynamic allocation of fast path locks. The best data structure I came up with was to have a shared fast path lock array. Still partitioned as a 16-way associative cache, but indexed by hash(BackendId, RelationId). fpLockBits can be stuffed into the high byte of BackendId thanks to MAX_BACKENDS. Locking could be handled by one lock per way, or at least on cursory glance it shouldn't be too difficult to convert the whole fast path acquisition to be lock free. Either way, it feels like structuring the array this way could result in a large amount of false sharing of cache lines. Current static allocation means that each process needs to touch only a small set of cache lines only referenced by itself - quite probable to keep those lines in CPU local L2 in exclusive mode. In a shared array a larger number of cache lines are needed and they will be concurrently written to by other backends - lots of invalidation messages and cache line bouncing. I don't know how large this effect will be without doing a prototype and running it on a large machine with high core-to-core latencies. It would be possible to create a hybrid approach of a small local FP array servicing the majority of acquisitions with a larger shared victim cache for exceptional cases. But it doesn't feel like it is worth the complexity. At least not without seeing some example workloads where it would help. And even then, maybe using hierarchical locking to do less work is the better approach. Being optimistic, perhaps the current patch was enough to resolve the issue. -- Ants Aasma Senior Database Engineer www.cybertec-postgresql.com
On 9/22/24 10:50, Ants Aasma wrote: > On Sat, 21 Sept 2024 at 21:33, Tomas Vondra <tomas@vondra.me> wrote: >> I've finally pushed this, after many rounds of careful testing to ensure >> no regressions, and polishing. All changes since the version shared on >> September 13 are only cosmetic - renaming a macro to keep it consistent >> with the other ones, clarifying a couple comments etc. Nothing major. > > Great work on this, I have seen multiple customers hitting fast path > capacity related LockManager contention. They will certainly be glad > to have a fix available when they eventually upgrade. Regretfully I > did not find the time to participate in this discussion during > development. But I did have some thoughts that I wanted to unload to > the list, not as a criticism, but in case it turns out follow up work > is needed. > > Driving the array sizing from max_locks_per_transaction seems like a > good idea. The one major difference from the lock table is that while > the lock table is partitioned dynamically between backends, the fast > path array has a static size per backend. One case where that > distinction matters is when only a fraction of backends try to lock > large numbers of relations. This fraction will still fall back to main > lock tables, but at least the contention should be limited by virtue > of not having too many of those backends. The other case is when > max_connections is much higher than the number of backends actually > used. Then backends may be consuming well over > max_locks_per_transaction without running into lock table capacity > issues. > I agree. I don't think the case with a couple lock-hungry backends matters too much, because as you say there can't be too many of them, so the contention should not be too bad. At least that was my reasoning. Regarding the case with very high max_connection values - I doubt we want to optimize for that very much. Extremely high max_connection values are a clear anti-pattern (IMO), and if you choose to do that anyway, you simply have to accept that connections have costs. The memory for fast-path locking is one of those costs. I'm not against improving that, ofc, but I think we should only do that if it doesn't hurt the "reasonable" setups. > In both cases users will have the simple workaround of just increasing > the max_locks_per_transaction setting. Still, I'm sure they would be > happier if things just worked without any tuning. So I tried to figure > out some scheme to get dynamic allocation of fast path locks. > I agree with the premise that less tuning is better. Which is why we tied this to max_locks_per_transaction. > The best data structure I came up with was to have a shared fast path > lock array. Still partitioned as a 16-way associative cache, but > indexed by hash(BackendId, RelationId). fpLockBits can be stuffed into > the high byte of BackendId thanks to MAX_BACKENDS. Locking could be > handled by one lock per way, or at least on cursory glance it > shouldn't be too difficult to convert the whole fast path acquisition > to be lock free. > > Either way, it feels like structuring the array this way could result > in a large amount of false sharing of cache lines. Current static > allocation means that each process needs to touch only a small set of > cache lines only referenced by itself - quite probable to keep those > lines in CPU local L2 in exclusive mode. In a shared array a larger > number of cache lines are needed and they will be concurrently written > to by other backends - lots of invalidation messages and cache line > bouncing. I don't know how large this effect will be without doing a > prototype and running it on a large machine with high core-to-core > latencies. > I don't have a very good intuition regarding cachelines. Ideally, the backends would access disjunct parts of the array, so there should not be a lot of false sharing. But maybe I'm wrong, hard to say without an experimental patch. > It would be possible to create a hybrid approach of a small local FP > array servicing the majority of acquisitions with a larger shared > victim cache for exceptional cases. But it doesn't feel like it is > worth the complexity. At least not without seeing some example > workloads where it would help. And even then, maybe using hierarchical > locking to do less work is the better approach. > Not sure. My intuition would be to keep this as simple a possible. Having a shared lock table and also a separate fast-path cache is already sufficiently complex, adding cache for a cache seems a bit too much to me. > Being optimistic, perhaps the current patch was enough to resolve the issue. > It's an improvement. But if you want to give the shared fast-path cache a try, go ahead - if you write a patch, I promise to review it. regards -- Tomas Vondra
Tomas Vondra <tomas@vondra.me> writes: > I've finally pushed this, after many rounds of careful testing to ensure > no regressions, and polishing. Coverity is not terribly happy with this. "Assert(fpPtr = fpEndPtr);" is very clearly not doing what you presumably intended. The others look like overaggressive assertion checking. If you don't want those macros to assume that the argument is unsigned, you could force the issue, say with #define FAST_PATH_GROUP(index) \ - (AssertMacro(((index) >= 0) && ((index) < FP_LOCK_SLOTS_PER_BACKEND)), \ + (AssertMacro((uint32) (index) < FP_LOCK_SLOTS_PER_BACKEND), \ ((index) / FP_LOCK_SLOTS_PER_GROUP)) ________________________________________________________________________________________________________ *** CID 1619664: Incorrect expression (ASSERT_SIDE_EFFECT) /srv/coverity/git/pgsql-git/postgresql/src/backend/storage/lmgr/proc.c: 325 in InitProcGlobal() 319 pg_atomic_init_u32(&(proc->procArrayGroupNext), INVALID_PROC_NUMBER); 320 pg_atomic_init_u32(&(proc->clogGroupNext), INVALID_PROC_NUMBER); 321 pg_atomic_init_u64(&(proc->waitStart), 0); 322 } 323 324 /* Should have consumed exactly the expected amount of fast-path memory. */ >>> CID 1619664: Incorrect expression (ASSERT_SIDE_EFFECT) >>> Assignment "fpPtr = fpEndPtr" has a side effect. This code will work differently in a non-debug build. 325 Assert(fpPtr = fpEndPtr); 326 327 /* 328 * Save pointers to the blocks of PGPROC structures reserved for auxiliary 329 * processes and prepared transactions. 330 */ ________________________________________________________________________________________________________ *** CID 1619662: Integer handling issues (NO_EFFECT) /srv/coverity/git/pgsql-git/postgresql/src/backend/storage/lmgr/lock.c: 3731 in GetLockStatusData() 3725 3726 LWLockAcquire(&proc->fpInfoLock, LW_SHARED); 3727 3728 for (f = 0; f < FP_LOCK_SLOTS_PER_BACKEND; ++f) 3729 { 3730 LockInstanceData *instance; >>> CID 1619662: Integer handling issues (NO_EFFECT) >>> This greater-than-or-equal-to-zero comparison of an unsigned value is always true. "f >= 0U". 3731 uint32 lockbits = FAST_PATH_GET_BITS(proc, f); 3732 3733 /* Skip unallocated slots. */ 3734 if (!lockbits) 3735 continue; 3736 ________________________________________________________________________________________________________ *** CID 1619661: Integer handling issues (NO_EFFECT) /srv/coverity/git/pgsql-git/postgresql/src/backend/storage/lmgr/lock.c: 2696 in FastPathGrantRelationLock() 2690 uint32 group = FAST_PATH_REL_GROUP(relid); 2691 2692 /* Scan for existing entry for this relid, remembering empty slot. */ 2693 for (i = 0; i < FP_LOCK_SLOTS_PER_GROUP; i++) 2694 { 2695 /* index into the whole per-backend array */ >>> CID 1619661: Integer handling issues (NO_EFFECT) >>> This greater-than-or-equal-to-zero comparison of an unsigned value is always true. "group >= 0U". 2696 uint32 f = FAST_PATH_SLOT(group, i); 2697 2698 if (FAST_PATH_GET_BITS(MyProc, f) == 0) 2699 unused_slot = f; 2700 else if (MyProc->fpRelId[f] == relid) 2701 { ________________________________________________________________________________________________________ *** CID 1619660: Integer handling issues (NO_EFFECT) /srv/coverity/git/pgsql-git/postgresql/src/backend/storage/lmgr/lock.c: 2813 in FastPathTransferRelationLocks() 2807 2808 for (j = 0; j < FP_LOCK_SLOTS_PER_GROUP; j++) 2809 { 2810 uint32 lockmode; 2811 2812 /* index into the whole per-backend array */ >>> CID 1619660: Integer handling issues (NO_EFFECT) >>> This greater-than-or-equal-to-zero comparison of an unsigned value is always true. "group >= 0U". 2813 uint32 f = FAST_PATH_SLOT(group, j); 2814 2815 /* Look for an allocated slot matching the given relid. */ 2816 if (relid != proc->fpRelId[f] || FAST_PATH_GET_BITS(proc, f) == 0) 2817 continue; 2818 ________________________________________________________________________________________________________ *** CID 1619659: Integer handling issues (NO_EFFECT) /srv/coverity/git/pgsql-git/postgresql/src/backend/storage/lmgr/lock.c: 3067 in GetLockConflicts() 3061 3062 for (j = 0; j < FP_LOCK_SLOTS_PER_GROUP; j++) 3063 { 3064 uint32 lockmask; 3065 3066 /* index into the whole per-backend array */ >>> CID 1619659: Integer handling issues (NO_EFFECT) >>> This greater-than-or-equal-to-zero comparison of an unsigned value is always true. "group >= 0U". 3067 uint32 f = FAST_PATH_SLOT(group, j); 3068 3069 /* Look for an allocated slot matching the given relid. */ 3070 if (relid != proc->fpRelId[f]) 3071 continue; 3072 lockmask = FAST_PATH_GET_BITS(proc, f); ________________________________________________________________________________________________________ *** CID 1619658: Integer handling issues (NO_EFFECT) /srv/coverity/git/pgsql-git/postgresql/src/backend/storage/lmgr/lock.c: 2739 in FastPathUnGrantRelationLock() 2733 uint32 group = FAST_PATH_REL_GROUP(relid); 2734 2735 FastPathLocalUseCounts[group] = 0; 2736 for (i = 0; i < FP_LOCK_SLOTS_PER_GROUP; i++) 2737 { 2738 /* index into the whole per-backend array */ >>> CID 1619658: Integer handling issues (NO_EFFECT) >>> This greater-than-or-equal-to-zero comparison of an unsigned value is always true. "group >= 0U". 2739 uint32 f = FAST_PATH_SLOT(group, i); 2740 2741 if (MyProc->fpRelId[f] == relid 2742 && FAST_PATH_CHECK_LOCKMODE(MyProc, f, lockmode)) 2743 { 2744 Assert(!result); ________________________________________________________________________________________________________ *** CID 1619657: Integer handling issues (NO_EFFECT) /srv/coverity/git/pgsql-git/postgresql/src/backend/storage/lmgr/lock.c: 2878 in FastPathGetRelationLockEntry() 2872 2873 for (i = 0; i < FP_LOCK_SLOTS_PER_GROUP; i++) 2874 { 2875 uint32 lockmode; 2876 2877 /* index into the whole per-backend array */ >>> CID 1619657: Integer handling issues (NO_EFFECT) >>> This greater-than-or-equal-to-zero comparison of an unsigned value is always true. "group >= 0U". 2878 uint32 f = FAST_PATH_SLOT(group, i); 2879 2880 /* Look for an allocated slot matching the given relid. */ 2881 if (relid != MyProc->fpRelId[f] || FAST_PATH_GET_BITS(MyProc, f) == 0) 2882 continue; 2883 regards, tom lane
On 9/22/24 17:45, Tom Lane wrote: > Tomas Vondra <tomas@vondra.me> writes: >> I've finally pushed this, after many rounds of careful testing to ensure >> no regressions, and polishing. > > Coverity is not terribly happy with this. "Assert(fpPtr = fpEndPtr);" > is very clearly not doing what you presumably intended. The others > look like overaggressive assertion checking. If you don't want those > macros to assume that the argument is unsigned, you could force the > issue, say with > > #define FAST_PATH_GROUP(index) \ > - (AssertMacro(((index) >= 0) && ((index) < FP_LOCK_SLOTS_PER_BACKEND)), \ > + (AssertMacro((uint32) (index) < FP_LOCK_SLOTS_PER_BACKEND), \ > ((index) / FP_LOCK_SLOTS_PER_GROUP)) > Ah, you're right. I'll fix those asserts tomorrow. The first is clearly wrong, of course. For the (x >= 0) asserts, doing it this way relies on negative values wrapping to large positive ones, correct? AFAIK it's guaranteed to be a very large value, so it can't accidentally be less than the slot count. regards -- Tomas Vondra
Tomas Vondra <tomas@vondra.me> writes: > On 9/22/24 17:45, Tom Lane wrote: >> #define FAST_PATH_GROUP(index) \ >> - (AssertMacro(((index) >= 0) && ((index) < FP_LOCK_SLOTS_PER_BACKEND)), \ >> + (AssertMacro((uint32) (index) < FP_LOCK_SLOTS_PER_BACKEND), \ >> ((index) / FP_LOCK_SLOTS_PER_GROUP)) > For the (x >= 0) asserts, doing it this way relies on negative values > wrapping to large positive ones, correct? AFAIK it's guaranteed to be a > very large value, so it can't accidentally be less than the slot count. Right, any negative value would wrap to something more than INT32_MAX. regards, tom lane
On Mon, Sep 16, 2024 at 4:19 PM Tomas Vondra <tomas@vondra.me> wrote: > On 9/16/24 15:11, Jakub Wartak wrote: > > On Fri, Sep 13, 2024 at 1:45 AM Tomas Vondra <tomas@vondra.me> wrote: > > > >> [..] > > > >> Anyway, at this point I'm quite happy with this improvement. I didn't > >> have any clear plan when to commit this, but I'm considering doing so > >> sometime next week, unless someone objects or asks for some additional > >> benchmarks etc. > > > > Thank you very much for working on this :) > > > > The only fact that comes to my mind is that we could blow up L2 > > caches. Fun fact, so if we are growing PGPROC by 6.3x, that's going to > > be like one or two 2MB huge pages more @ common max_connections=1000 > > x86_64 (830kB -> ~5.1MB), and indeed: [..] > > then maybe(?) one could observe further degradation of dTLB misses in > > the perf-stat counter under some microbenchmark, but measuring that > > requires isolated and physical hardware. Maybe that would be actually > > noise due to overhead of context-switches itself. Just trying to think > > out loud, what big PGPROC could cause here. But this is already an > > unhealthy and non-steady state of the system, so IMHO we are good, > > unless someone comes up with a better (more evil) idea. > > > > I've been thinking about such cases too, but I don't think it can really > happen in practice, because: > > - How likely is it that the sessions will need a lot of OIDs, but not > the same ones? Also, why would it matter that the OIDs are not the same, > I don't think it matters unless one of the sessions needs an exclusive > lock, at which point the optimization doesn't really matter. > > - If having more fast-path slots means it doesn't fit into L2 cache, > would we fit into L2 without it? I don't think so - if there really are > that many locks, we'd have to add those into the shared lock table, and > there's a lot of extra stuff to keep in memory (relcaches, ...). > > This is pretty much one of the cases I focused on in my benchmarking, > and I'm yet to see any regression. Sorry for answering this so late. Just for context here: I was imagining a scenario with high max_connections about e.g. schema-based multi-tenancy and no partitioning (so all would be fine without this $thread/commit ; so under 16 (fast)locks would be taken). The OIDs need to be different to avoid contention: so that futex() does not end up really in syscall (just user-space part). My theory was that a much smaller PGPROC should be doing much less (data) cache-line fetches than with-the-patch. That hash() % prime , hits various parts of a larger array - so without patch should be quicker as it wouldn't be randomly hitting some larger array[], but it might be noise as you state. It was a theoretical attempt at crafting the worst possible conditions for the patch, so feel free to disregard as it already assumes some anti-pattern (big & all active max_connections). > > Well the only thing I could think of was to add to the > > doc/src/sgml/config.sgml / "max_locks_per_transaction" GUC, that "it > > is also used as advisory for the number of groups used in > > lockmanager's fast-path implementation" (that is, without going into > > further discussion, as even pg_locks discussion > > doc/src/sgml/system-views.sgml simply uses that term). > > > > Thanks, I'll consider mentioning this in max_locks_per_transaction. > Also, I think there's a place calculating the amount of per-connection > memory, so maybe that needs to be updated too. > I couldn't find it in current versions, but maybe that's helpful/reaffirming: - up to 9.2. there were exact formulas used, see "(1800 + 270 * max_locks_per_transaction) * max_connections" [1] , that's a long time gone now. - if anything then Andres might want to improve a little his blog entry: [1] (my take is that is seems to be the most accurate and authoritative technical information that we have online) -J. [1] - https://www.postgresql.org/docs/9.2/kernel-resources.html [2] - https://blog.anarazel.de/2020/10/07/measuring-the-memory-overhead-of-a-postgres-connection/
On 9/23/24 01:06, Tom Lane wrote: > Tomas Vondra <tomas@vondra.me> writes: >> On 9/22/24 17:45, Tom Lane wrote: >>> #define FAST_PATH_GROUP(index) \ >>> - (AssertMacro(((index) >= 0) && ((index) < FP_LOCK_SLOTS_PER_BACKEND)), \ >>> + (AssertMacro((uint32) (index) < FP_LOCK_SLOTS_PER_BACKEND), \ >>> ((index) / FP_LOCK_SLOTS_PER_GROUP)) > >> For the (x >= 0) asserts, doing it this way relies on negative values >> wrapping to large positive ones, correct? AFAIK it's guaranteed to be a >> very large value, so it can't accidentally be less than the slot count. > > Right, any negative value would wrap to something more than > INT32_MAX. > Thanks. Pushed a fix for these issues, hopefully coverity will be happy. BTW is the coverity report accessible somewhere? I know someone mentioned that in the past, but I don't recall the details. Maybe we should have a list of all these resources, useful for committers, somewhere on the wiki? regards -- Tomas Vondra
Tomas Vondra <tomas@vondra.me> writes: > Thanks. Pushed a fix for these issues, hopefully coverity will be happy. Thanks. > BTW is the coverity report accessible somewhere? I know someone > mentioned that in the past, but I don't recall the details. Maybe we > should have a list of all these resources, useful for committers, > somewhere on the wiki? Currently those reports only go to the security team. Perhaps we should rethink that? regards, tom lane