Re: scalability bottlenecks with (many) partitions (and more) - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: scalability bottlenecks with (many) partitions (and more) |
Date | |
Msg-id | 9266c8df-0df7-4605-ba5a-b204c5e52586@enterprisedb.com Whole thread Raw |
In response to | Re: scalability bottlenecks with (many) partitions (and more) (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: scalability bottlenecks with (many) partitions (and more)
Re: scalability bottlenecks with (many) partitions (and more) |
List | pgsql-hackers |
On 6/24/24 17:05, Robert Haas wrote: > On Sun, Jan 28, 2024 at 4:57 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> For NUM_LOCK_PARTITIONS this is pretty simple (see 0001 patch). The >> LWLock table has 16 partitions by default - it's quite possible that on >> machine with many cores and/or many partitions, we can easily hit this. >> So I bumped this 4x to 64 partitions. > > I think this probably makes sense. I'm a little worried that we're > just kicking the can down the road here where maybe we should be > solving the problem in some more fundamental way, and I'm also a > little worried that we might be reducing single-core performance. But > it's probably fine. > Yeah, I haven't seen this causing any regressions - the sensitive paths typically lock only one partition, so having more of them does not affect that. Or if it does, it's likely a reasonable trade off as it reduces the risk of lock contention. That being said, I don't recall benchmarking this patch in isolation, only with the other patches. Maybe I should do that ... >> What I ended up doing is having a hash table of 16-element arrays. There >> are 64 "pieces", each essentially the (16 x OID + UINT64 bitmap) that we >> have now. Each OID is mapped to exactly one of these parts as if in a >> hash table, and in each of those 16-element parts we do exactly the same >> thing we do now (linear search, removal, etc.). This works great, the >> locality is great, etc. The one disadvantage is this makes PGPROC >> larger, but I did a lot of benchmarks and I haven't seen any regression >> that I could attribute to this. (More about this later.) > > I think this is a reasonable approach. Some comments: > > - FastPathLocalUseInitialized seems unnecessary to me; the contents of > an uninitialized local variable are undefined, but an uninitialized > global variable always starts out zeroed. > OK. I didn't realize global variables start a zero. > - You need comments in various places, including here, where someone > is certain to have questions about the algorithm and choice of > constants: > > +#define FAST_PATH_LOCK_REL_GROUP(rel) (((uint64) (rel) * 7883 + 4481) > % FP_LOCK_GROUPS_PER_BACKEND) > Yeah, definitely needs comment explaining this. I admit those numbers are pretty arbitrary primes, to implement a trivial hash function. That was good enough for a PoC patch, but maybe for a "proper" version this should use a better hash function. It needs to be fast, and maybe it doesn't matter that much if it's not perfect. > When I originally coded up the fast-path locking stuff, I supposed > that we couldn't make the number of slots too big because the > algorithm requires a linear search of the whole array. But with this > one trick (a partially-associative cache), there's no real reason that > I can think of why you can't make the number of slots almost > arbitrarily large. At some point you're going to use too much memory, > and probably before that point you're going to make the cache big > enough that it doesn't fit in the CPU cache of an individual core, at > which point possibly it will stop working as well. But honestly ... I > don't quite see why this approach couldn't be scaled quite far. > I don't think I've heard the term "partially-associative cache" before, but now that I look at the approach again, it very much reminds me how set-associative caches work (e.g. with cachelines in CPU caches). It's a 16-way associative cache, assigning each entry into one of 16 slots. I must have been reading some papers in this area shortly before the PoC patch, and the idea came from there, probably. Which is good, because it means it's a well-understood and widely-used approach. > Like, if we raised FP_LOCK_GROUPS_PER_BACKEND from your proposed value > of 64 to say 65536, would that still perform well? I'm not saying we > should do that, because that's probably a silly amount of memory to > use for this, but the point is that when you start to have enough > partitions that you run out of lock slots, performance is going to > degrade, so you can imagine wanting to try to have enough lock groups > to make that unlikely. Which leads me to wonder if there's any > particular number of lock groups that is clearly "too many" or whether > it's just about how much memory we want to use. > That's an excellent question. I don't know. I agree 64 groups is pretty arbitrary, and having 1024 may not be enough even with a modest number of partitions. When I was thinking about using a higher value, my main concern was that it'd made the PGPROC entry larger. Each "fast-path" group is ~72B, so 64 groups is ~4.5kB, and that felt like quite a bit. But maybe it's fine and we could make it much larger - L3 caches tend to be many MBs these days, although AFAIK it's shared by threads running on the CPU. I'll see if I can do some more testing of this, and see if there's a value where it stops improving / starts degrading, etc. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: