Re: Configurable FP_LOCK_SLOTS_PER_BACKEND - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: Configurable FP_LOCK_SLOTS_PER_BACKEND |
Date | |
Msg-id | 116ef01e-942a-22c1-a2af-35bf69c1b07b@enterprisedb.com Whole thread Raw |
In response to | Re: Configurable FP_LOCK_SLOTS_PER_BACKEND (Tomas Vondra <tomas.vondra@enterprisedb.com>) |
Responses |
Re: Configurable FP_LOCK_SLOTS_PER_BACKEND
Re: Configurable FP_LOCK_SLOTS_PER_BACKEND |
List | pgsql-hackers |
On 8/6/23 16:44, Tomas Vondra wrote: > ... > > I'm not sure I'll have time to hack on this soon, but if someone else > wants to take a stab at it and produce a minimal patch, I might be able > to run more tests on it. > Nah, I gave it another try, handling the bitmap in a different way, and that happened to work sufficiently well. So here are some results and also the bench script. Note: I haven't reproduced the terrible regressions described in this thread. Either I don't have a system large enough, the workload may not be exactly right, or maybe it's due to the commit missing in older branches (mentioned by Andres). Still, the findings seem interesting. The workload is very simple - create a table "t" with certain number of partitiones and indexes, add certain number of rows (100, 10k or 1M) and do "select count(*) from t". And measure throughput. There's also a script collecting wait-event/lock info, but I haven't looked at that. I did this for current master (17dev), with two patch versions. master - current master, with 16 fast-path slots v1 increases the number of slots to 64, but switches to a single array combining the bitmap and OIDs. I'm sure the original approach (separate bitmap) can be made to work, and it's possible this is responsible for a small regression in some runs (more about it in a minute). v2 was an attempt to address the small regressions in v1, which may be due to having to search larger arrays. The core always walks the whole array, even if we know there have never been that many entries. So this tries to track the last used slot, and stop the loops earlier. The attached PDF visualizes the results, and differences between master and the two patches. It plots throughput against number of rows / tables and indexes, and also concurrent clients. The last two columns show throughput vs. master, with a simple color scale: green - speedup (good), red - regression (bad). Let's talk about the smallest data set (100 rows). The 10k test has the same behavior, with smaller differences (as the locking accounts for a smaller part of the total duration). On the 1M data set the patches make almost no difference. There's pretty clear flip once we reach 16 partitions - on master the throughput drops from 310k tps to 210k tps (for 32 clients, the machine has 32 cores). With both patches, the drop is to only about 240k tps, so ~20% improvement compared to master. The other interesting thing is behavior with many clients: 1 16 32 64 96 128 master 17603 169132 204870 199390 199662 196774 v1 17062 180475 234644 266372 267105 265885 v2 18972 187294 242838 275064 275931 274332 So the master "stagnates" or maybe even drops off, while with both patches the throughput continues to grow beyond 32 clients. This is even more obvious for 32 or 64 partitions - for 32, the results are 1 16 32 64 96 128 master 11292 93783 111223 102580 95197 87800 v1 12025 123503 168382 179952 180191 179846 v2 12501 126438 174255 185435 185332 184938 That's a pretty massive improvement, IMO. Would help OLTP scalability. For 10k rows the patterns is the same, although the differences are less significant. For 1M rows there's no speedup. The bad news is this seems to have negative impact on cases with few partitions, that'd fit into 16 slots. Which is not surprising, as the code has to walk longer arrays, it probably affects caching etc. So this would hurt the systems that don't use that many relations - not much, but still. The regression appears to be consistently ~3%, and v2 aimed to improve that - at least for the case with just 100 rows. It even gains ~5% in a couple cases. It's however a bit strange v2 doesn't really help the two larger cases. Overall, I think this seems interesting - it's hard to not like doubling the throughput in some cases. Yes, it's 100 rows only, and the real improvements are bound to be smaller, it would help short OLTP queries that only process a couple rows. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
pgsql-hackers by date: