Re: scalability bottlenecks with (many) partitions (and more) - Mailing list pgsql-hackers

From Robert Haas
Subject Re: scalability bottlenecks with (many) partitions (and more)
Date
Msg-id CA+TgmoabtPvTffKQiqKg9qOmyt-sYzjQW+TyFWf1CBN7M4hggQ@mail.gmail.com
Whole thread Raw
In response to scalability bottlenecks with (many) partitions (and more)  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Responses Re: scalability bottlenecks with (many) partitions (and more)
List pgsql-hackers
On Sun, Jan 28, 2024 at 4:57 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> For NUM_LOCK_PARTITIONS this is pretty simple (see 0001 patch). The
> LWLock table has 16 partitions by default - it's quite possible that on
> machine with many cores and/or many partitions, we can easily hit this.
> So I bumped this 4x to 64 partitions.

I think this probably makes sense. I'm a little worried that we're
just kicking the can down the road here where maybe we should be
solving the problem in some more fundamental way, and I'm also a
little worried that we might be reducing single-core performance. But
it's probably fine.

> What I ended up doing is having a hash table of 16-element arrays. There
> are 64 "pieces", each essentially the (16 x OID + UINT64 bitmap) that we
> have now. Each OID is mapped to exactly one of these parts as if in a
> hash table, and in each of those 16-element parts we do exactly the same
> thing we do now (linear search, removal, etc.). This works great, the
> locality is great, etc. The one disadvantage is this makes PGPROC
> larger, but I did a lot of benchmarks and I haven't seen any regression
> that I could attribute to this. (More about this later.)

I think this is a reasonable approach. Some comments:

- FastPathLocalUseInitialized seems unnecessary to me; the contents of
an uninitialized local variable are undefined, but an uninitialized
global variable always starts out zeroed.

- You need comments in various places, including here, where someone
is certain to have questions about the algorithm and choice of
constants:

+#define FAST_PATH_LOCK_REL_GROUP(rel) (((uint64) (rel) * 7883 + 4481)
% FP_LOCK_GROUPS_PER_BACKEND)

When I originally coded up the fast-path locking stuff, I supposed
that we couldn't make the number of slots too big because the
algorithm requires a linear search of the whole array. But with this
one trick (a partially-associative cache), there's no real reason that
I can think of why you can't make the number of slots almost
arbitrarily large. At some point you're going to use too much memory,
and probably before that point you're going to make the cache big
enough that it doesn't fit in the CPU cache of an individual core, at
which point possibly it will stop working as well. But honestly ... I
don't quite see why this approach couldn't be scaled quite far.

Like, if we raised FP_LOCK_GROUPS_PER_BACKEND from your proposed value
of 64 to say 65536, would that still perform well? I'm not saying we
should do that, because that's probably a silly amount of memory to
use for this, but the point is that when you start to have enough
partitions that you run out of lock slots, performance is going to
degrade, so you can imagine wanting to try to have enough lock groups
to make that unlikely. Which leads me to wonder if there's any
particular number of lock groups that is clearly "too many" or whether
it's just about how much memory we want to use.

--
Robert Haas
EDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Injection point locking
Next
From: Nathan Bossart
Date:
Subject: Re: basebackups seem to have serious issues with FILE_COPY in CREATE DATABASE