Thread: scalability bottlenecks with (many) partitions (and more)

scalability bottlenecks with (many) partitions (and more)

From
Tomas Vondra
Date:
Hi,

I happened to investigate a query involving a partitioned table, which
led me to a couple of bottlenecks severely affecting queries dealing
with multiple partitions (or relations in general). After a while I came
up with three WIP patches that improve the behavior by an order of
magnitude, and not just in some extreme cases.


Consider a partitioned pgbench with 20 partitions, say:

   pgbench -i -s 100 --partitions 100 testdb

but let's modify the pgbench_accounts a little bit:

   ALTER TABLE pgbench_accounts ADD COLUMN aid_parent INT;
   UPDATE pgbench_accounts SET aid_parent = aid;
   CREATE INDEX ON pgbench_accounts(aid_parent);
   VACUUM FULL pgbench_accounts;

which simply adds "aid_parent" column which is not a partition key. And
now let's do a query

    SELECT * FROM pgbench_accounts pa JOIN pgbench_branches pb
        ON (pa.bid = pb.bid) WHERE pa.aid_parent = :aid

so pretty much the regular "pgbench -S" except that on the column that
does not allow partition elimination. Now, the plan looks like this:

                                QUERY PLAN
----------------------------------------------------------------------
 Hash Join  (cost=1.52..34.41 rows=10 width=465)
   Hash Cond: (pa.bid = pb.bid)
   ->  Append  (cost=0.29..33.15 rows=10 width=101)
         ->  Index Scan using pgbench_accounts_1_aid_parent_idx on
pgbench_accounts_1 pa_1  (cost=0.29..3.31 rows=1 width=101)
               Index Cond: (aid_parent = 3489734)
         ->  Index Scan using pgbench_accounts_2_aid_parent_idx on
pgbench_accounts_2 pa_2  (cost=0.29..3.31 rows=1 width=101)
               Index Cond: (aid_parent = 3489734)
         ->  Index Scan using pgbench_accounts_3_aid_parent_idx on
pgbench_accounts_3 pa_3  (cost=0.29..3.31 rows=1 width=101)
               Index Cond: (aid_parent = 3489734)
         ->  Index Scan using pgbench_accounts_4_aid_parent_idx on
pgbench_accounts_4 pa_4  (cost=0.29..3.31 rows=1 width=101)
               Index Cond: (aid_parent = 3489734)
         -> ...
   ->  Hash  (cost=1.10..1.10 rows=10 width=364)
         ->  Seq Scan on pgbench_branches pb  (cost=0.00..1.10 rows=10
width=364)


So yeah, scanning all 100 partitions. Not great, but no partitioning
scheme is perfect for all queries. Anyway, let's see how this works on a
big AMD EPYC machine with 96/192 cores - with "-M simple" we get:

parts      1       8      16      32     64       96     160      224
-----------------------------------------------------------------------
0      13877  105732  210890  410452  709509  844683  1050658  1163026
100      653    3957    7120   12022   12707   11813    10349     9633
1000      20     142     270     474     757     808      567      427

These are transactions per second, for different number of clients
(numbers in the header). With -M prepared the story doesn't change - the
numbers are higher, but the overall behavior is pretty much the same.

Firstly, with no partitions (first row), the throughput by ~13k/client
initially, then it gradually levels off. But it grows all the time.

But with 100 or 1000 partitions, it peaks and then starts dropping
again. And moreover, the throughput with 100 or 1000 partitions is just
a tiny fraction of the non-partitioned value. The difference is roughly
equal to the number of partitions - for example with 96 clients, the
difference between 0 and 1000 partitions is 844683/808 = 1045.

I could demonstrate the same behavior with fewer partitions - e.g. with
10 partitions you get ~10x difference, and so on.

Another thing I'd mention is that this is not just about partitioning.
Imagine a star schema with a fact table and dimensions - you'll get the
same behavior depending on the number of dimensions you need to join
with. With "-M simple" you may get this, for example:

dims        1      8      16      32      64      96     160      224
----------------------------------------------------------------------
1       11737  92925  183678  361497  636598  768956  958679  1042799
10        462   3558    7086   13889   25367   29503   25353    24030
100         4     31      61     122     231     292     292      288

So, similar story - significant slowdown as we're adding dimensions.


Now, what could be causing this? Clearly, there's a bottleneck of some
kind, and we're hitting it. Some of this may be simply due to execution
doing more stuff (more index scans, more initialization, ...) but maybe
not - one of the reasons why I started looking into this was not using
all the CPU even for small scales - the CPU was maybe 60% utilized.

So I started poking at things. The first thing that I thought about was
locking, obviously. That's consistent with the limited CPU utilization
(waiting on a lock = not running), and it's somewhat expected when using
many partitions - we need to lock all of them, and if we have 100 or
1000 of them, that's potentially lot of locks.

From past experiments I've known about two places where such bottleneck
could be - NUM_LOCK_PARTITIONS and fast-path locking. So I decided to
give it a try, increase these values and see what happens.

For NUM_LOCK_PARTITIONS this is pretty simple (see 0001 patch). The
LWLock table has 16 partitions by default - it's quite possible that on
machine with many cores and/or many partitions, we can easily hit this.
So I bumped this 4x to 64 partitions.

For fast-path locking the changes are more complicated (see 0002). We
allow keeping 16 relation locks right in PGPROC, and only when this gets
full we promote them to the actual lock table. But with enough
partitions we're guaranteed to fill these 16 slots, of course. But
increasing the number of slots is not simple - firstly, the information
is split between an array of 16 OIDs and UINT64 serving as a bitmap.
Increasing the size of the OID array is simple, but it's harder for the
auxiliary bitmap. But there's more problems - with more OIDs a simple
linear search won't do. But a simple hash table is not a good idea too,
because of poor locality and the need to delete stuff ...

What I ended up doing is having a hash table of 16-element arrays. There
are 64 "pieces", each essentially the (16 x OID + UINT64 bitmap) that we
have now. Each OID is mapped to exactly one of these parts as if in a
hash table, and in each of those 16-element parts we do exactly the same
thing we do now (linear search, removal, etc.). This works great, the
locality is great, etc. The one disadvantage is this makes PGPROC
larger, but I did a lot of benchmarks and I haven't seen any regression
that I could attribute to this. (More about this later.)

Unfortunately, for the pgbench join this does not make much difference.
But for the "star join" (with -M prepared) it does this:

             1      8     16    32       64       96      160       224
------------------------------------------------------------------------
master   21610 137450 247541 300902   270932   229692   191454   189233
patched  21664 151695 301451 594615  1036424  1211716  1480953  1656203
speedup    1.0    1.1    1.2    2.0      3.8      5.3      7.7      8.8

That's a pretty nice speedup, I think.

However, why doesn't the partitioned join improve (at not very much)?
Well, perf profile says stuff like this:


9.16%    0.77%  postgres      [kernel.kallsyms]      [k] asm_exc_page_fault
     |      
      --8.39%--asm_exc_page_fault
            |      
             --7.52%--exc_page_fault
                    |      
                    --7.13%--do_user_addr_fault
                           |      
                            --6.64%--handle_mm_fault
                                  |      
                                   --6.29%--__handle_mm_fault
                                          |      
                                          |--2.17%--__mem_cgroup_charge
                                          |       |      
                                          |       |--1.25%--charge_memcg
                                          |       |       |      
                                          |       |        --0.57%-- ...
                                          |       |      
                                          |        --0.67%-- ...
                                          |      
                                          |--2.04%--vma_alloc_folio

After investigating this for a bit, I came to the conclusion this may be
some sort of a scalability problem in glibc/malloc. I decided to try if
the "memory pool" patch (which I've mentioned in the memory limit thread
as an alternative way to introduce backend-level accounting/limit) could
serve as a backend-level malloc cache, and how would that work. So I
cleaned up the PoC patch I already had (see 0003), and gave it a try.

And with both patches applied, the results for the partitioned join with
100 partitions look like this:

-M simple

                1      8      16      32      64      96     160    224
------------------------------------------------------------------------
master        653   3957    7120   12022   12707   11813   10349   9633
both patches  954   7356   14580   28259   51552   65278   70607  69598
speedup       1.5    1.9     2.0     2.4     4.1     5.5     6.8    7.2


-M prepared

                1      8      16      32      64      96     160    224
------------------------------------------------------------------------
master       1639   8273   14138   14746   13446   14001   11129  10136
both patches 4792  30102   62208  122157  220984  267763  315632 323567
speedup       2.9    3.6     4.4     8.3    16.4    19.1    28.4   31.9


That's pretty nice, I think. And I've seen many such improvements, it's
not a cherry-picked example. For the star join, the improvements are
very similar.

I'm attaching PDF files with a table visualizing results for these two
benchmarks - there's results for different number of partitions/scales,
and different builds (master, one or both of the patches). There's also
a comparison to master, with color scale "red = slower, green = faster"
(but there's no red anywhere, not even for low client counts).

It's also interesting that with just the 0003 patch applied, the change
is much smaller. It's as if the two bottlenecks (locking and malloc) are
in balance - if you only address one one, you don't get much. But if you
address both, it flies.

FWIW where does the malloc overhead come from? For one, while we do have
some caching of malloc-ed memory in memory contexts, that doesn't quite
work cross-query, because we destroy the contexts at the end of the
query. We attempt to cache the memory contexts too, but in this case
that can't help because the allocations come from btbeginscan() where we
do this:

    so = (BTScanOpaque) palloc(sizeof(BTScanOpaqueData));

and BTScanOpaqueData is ~27kB, which means it's an oversized chunk and
thus always allocated using a separate malloc() call. Maybe we could
break it into smaller/cacheable parts, but I haven't tried, and I doubt
it's the only such allocation.

I don't want to get into too much detail about the memory pool, but I
think it's something we should consider doing - I'm sure there's stuff
to improve, but caching the malloc may clearly be very beneficial. The
basic idea is to have a cache that is "adaptive" (i.e. adjusts to
caching blocks of sizes needed by the workload) but also cheap. The
patch is PoC/WIP and needs more work, but I think it works quite well.
If anyone wants to take a look or have a chat at FOSDEM, for example,
I'm available.

FWIW I was wondering if this is a glibc-specific malloc bottleneck, so I
tried running the benchmarks with LD_PRELOAD=jemalloc, and that improves
the behavior a lot - it gets us maybe ~80% of the mempool benefits.
Which is nice, it confirms it's glibc-specific (I wonder if there's a
way to tweak glibc to address this), and it also means systems using
jemalloc (e.g. FreeBSD, right?) don't have this problem. But it also
says the mempool has ~20% benefit on top of jemalloc.

FWIW there's another bottleneck people may not realize, and that's the
number of file descriptors. Once you get to >1000 relations, you can
easily get into situation like this:


54.18%    0.48%  postgres       [kernel.kallsyms]       [k]
entry_SYSCALL_64_after_hwframe
      |       
       --53.70%--entry_SYSCALL_64_after_hwframe
               |       
               --53.03%--do_syscall_64
                       |       
                       |--28.29%--__x64_sys_openat
                       |        |       
                       |         --28.14%--do_sys_openat2
                       |                |       
                       |                |--23.14%--do_filp_open
                       |                |        |       
                       |                |         --22.72%--path_openat


That's pretty bad, it means we're closing/opening file descriptors like
crazy, because every query needs the files. If I increase the number of
file descriptors (both in ulimit and max_files_per_process) to prevent
this trashing, I can increase the throughput ~5x. Of course, this is not
a bottleneck that we can "fix" in code, it's simply a consequence of not
having enough file descriptors etc. But I wonder if we might make it
easier to monitor this, e.g. by tracking the fd cache hit ratio, or
something like that ...


There's a more complete set of benchmarking scripts and results for
these and other tests, in various formats (PDF, ODS, ...) at

    https://github.com/tvondra/scalability-patches

There's results from multiple machines - not just the big epyc machine,
but also smaller intel machines (4C and 16C), and even two rpi5 (yes, it
helps even on rpi5, quite a bit).


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachment

Re: scalability bottlenecks with (many) partitions (and more)

From
Ronan Dunklau
Date:
Le dimanche 28 janvier 2024, 22:57:02 CET Tomas Vondra a écrit :

Hi Tomas !

I'll comment on glibc-malloc part as I studied that part last year, and
proposed some things here: https://www.postgresql.org/message-id/
3424675.QJadu78ljV%40aivenlaptop


> FWIW where does the malloc overhead come from? For one, while we do have
> some caching of malloc-ed memory in memory contexts, that doesn't quite
> work cross-query, because we destroy the contexts at the end of the
> query. We attempt to cache the memory contexts too, but in this case
> that can't help because the allocations come from btbeginscan() where we
> do this:
>
>     so = (BTScanOpaque) palloc(sizeof(BTScanOpaqueData));
>
> and BTScanOpaqueData is ~27kB, which means it's an oversized chunk and
> thus always allocated using a separate malloc() call. Maybe we could
> break it into smaller/cacheable parts, but I haven't tried, and I doubt
> > > > it's the only such allocation.

Did you try running an strace on the process ? That may give you some
hindsights into what malloc is doing. A more sophisticated approach would be
using stap and plugging it into the malloc probes, for example
memory_sbrk_more and memory_sbrk_less.

An important part of glibc's malloc behaviour in that regard comes from the
adjustment of the mmap and free threshold. By default, mmap adjusts them
dynamically and you can poke into that using the
memory_mallopt_free_dyn_thresholds probe.

>
> FWIW I was wondering if this is a glibc-specific malloc bottleneck, so I
> tried running the benchmarks with LD_PRELOAD=jemalloc, and that improves
> the behavior a lot - it gets us maybe ~80% of the mempool benefits.
> Which is nice, it confirms it's glibc-specific (I wonder if there's a
> way to tweak glibc to address this), and it also means systems using
> jemalloc (e.g. FreeBSD, right?) don't have this problem. But it also
> says the mempool has ~20% benefit on top of jemalloc.

GLIBC's malloc offers some tuning for this. In particular, setting either
M_MMAP_THRESHOLD or M_TRIM_THRESHOLD will disable the unpredictable "auto
adjustment" beheviour and allow you to control what it's doing.

By setting a bigger M_TRIM_THRESHOLD, one can make sure memory allocated using
sbrk isn't freed as easily, and you don't run into a pattern of moving the
sbrk pointer up and down repeatedly. The automatic trade off between the mmap
and trim thresholds is supposed to prevent that, but the way it is incremented
means you can end in a bad place depending on your particular allocation
patttern.

Best regards,

--
Ronan Dunklau






Re: scalability bottlenecks with (many) partitions (and more)

From
Tomas Vondra
Date:
On 1/29/24 09:53, Ronan Dunklau wrote:
> Le dimanche 28 janvier 2024, 22:57:02 CET Tomas Vondra a écrit :
> 
> Hi Tomas !
> 
> I'll comment on glibc-malloc part as I studied that part last year, and 
> proposed some things here: https://www.postgresql.org/message-id/
> 3424675.QJadu78ljV%40aivenlaptop
> 

Thanks for reminding me. I'll re-read that thread.

> 
>> FWIW where does the malloc overhead come from? For one, while we do have
>> some caching of malloc-ed memory in memory contexts, that doesn't quite
>> work cross-query, because we destroy the contexts at the end of the
>> query. We attempt to cache the memory contexts too, but in this case
>> that can't help because the allocations come from btbeginscan() where we
>> do this:
>>
>>     so = (BTScanOpaque) palloc(sizeof(BTScanOpaqueData));
>>
>> and BTScanOpaqueData is ~27kB, which means it's an oversized chunk and
>> thus always allocated using a separate malloc() call. Maybe we could
>> break it into smaller/cacheable parts, but I haven't tried, and I doubt
>>>>> it's the only such allocation.
> 
> Did you try running an strace on the process ? That may give you some 
> hindsights into what malloc is doing. A more sophisticated approach would be 
> using stap and plugging it into the malloc probes, for example 
> memory_sbrk_more and memory_sbrk_less. 
> 

No, I haven't tried that. In my experience strace is pretty expensive,
and if the issue is in glibc itself (before it does the syscalls),
strace won't really tell us much. Not sure, ofc.

> An important part of glibc's malloc behaviour in that regard comes from the 
> adjustment of the mmap and free threshold. By default, mmap adjusts them 
> dynamically and you can poke into that using the 
> memory_mallopt_free_dyn_thresholds probe.
> 

Thanks, I'll take a look at that.

>>
>> FWIW I was wondering if this is a glibc-specific malloc bottleneck, so I
>> tried running the benchmarks with LD_PRELOAD=jemalloc, and that improves
>> the behavior a lot - it gets us maybe ~80% of the mempool benefits.
>> Which is nice, it confirms it's glibc-specific (I wonder if there's a
>> way to tweak glibc to address this), and it also means systems using
>> jemalloc (e.g. FreeBSD, right?) don't have this problem. But it also
>> says the mempool has ~20% benefit on top of jemalloc.
> 
> GLIBC's malloc offers some tuning for this. In particular, setting either 
> M_MMAP_THRESHOLD or M_TRIM_THRESHOLD will disable the unpredictable "auto 
> adjustment" beheviour and allow you to control what it's doing. 
> 
> By setting a bigger M_TRIM_THRESHOLD, one can make sure memory allocated using 
> sbrk isn't freed as easily, and you don't run into a pattern of moving the 
> sbrk pointer up and down repeatedly. The automatic trade off between the mmap 
> and trim thresholds is supposed to prevent that, but the way it is incremented 
> means you can end in a bad place depending on your particular allocation 
> patttern.
> 

So, what values would you recommend for these parameters?

My concern is increasing those value would lead to (much) higher memory
usage, with little control over it. With the mempool we keep more
blocks, ofc, but we have control over freeing the memory.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: scalability bottlenecks with (many) partitions (and more)

From
Ronan Dunklau
Date:
Le lundi 29 janvier 2024, 13:17:07 CET Tomas Vondra a écrit :
> > Did you try running an strace on the process ? That may give you some
> > hindsights into what malloc is doing. A more sophisticated approach would
> > be using stap and plugging it into the malloc probes, for example
> > memory_sbrk_more and memory_sbrk_less.
>
> No, I haven't tried that. In my experience strace is pretty expensive,
> and if the issue is in glibc itself (before it does the syscalls),
> strace won't really tell us much. Not sure, ofc.

It would tell you how malloc actually performs your allocations, and how often
they end up translated into syscalls. The main issue with glibc would be that
it releases the memory too agressively to the OS, IMO.

>
> > An important part of glibc's malloc behaviour in that regard comes from
> > the
> > adjustment of the mmap and free threshold. By default, mmap adjusts them
> > dynamically and you can poke into that using the
> > memory_mallopt_free_dyn_thresholds probe.
>
> Thanks, I'll take a look at that.
>
> >> FWIW I was wondering if this is a glibc-specific malloc bottleneck, so I
> >> tried running the benchmarks with LD_PRELOAD=jemalloc, and that improves
> >> the behavior a lot - it gets us maybe ~80% of the mempool benefits.
> >> Which is nice, it confirms it's glibc-specific (I wonder if there's a
> >> way to tweak glibc to address this), and it also means systems using
> >> jemalloc (e.g. FreeBSD, right?) don't have this problem. But it also
> >> says the mempool has ~20% benefit on top of jemalloc.
> >
> > GLIBC's malloc offers some tuning for this. In particular, setting either
> > M_MMAP_THRESHOLD or M_TRIM_THRESHOLD will disable the unpredictable "auto
> > adjustment" beheviour and allow you to control what it's doing.
> >
> > By setting a bigger M_TRIM_THRESHOLD, one can make sure memory allocated
> > using sbrk isn't freed as easily, and you don't run into a pattern of
> > moving the sbrk pointer up and down repeatedly. The automatic trade off
> > between the mmap and trim thresholds is supposed to prevent that, but the
> > way it is incremented means you can end in a bad place depending on your
> > particular allocation patttern.
>
> So, what values would you recommend for these parameters?
>
> My concern is increasing those value would lead to (much) higher memory
> usage, with little control over it. With the mempool we keep more
> blocks, ofc, but we have control over freeing the memory.

Right now depending on your workload (especially if you use connection
pooling) you can end up with something like 32 or 64MB of dynamically adjusted
trim-threshold which will never be released back.

The first heurstic I had in mind was to set it to work_mem, up to a
"reasonable" limit I guess. One can argue that it is expected for a backend to
use work_mem frequently, and as such it shouldn't be released back. By setting
work_mem to a lower value, we could ask glibc at the same time to trim the
excess kept memory. That could be useful when a long-lived connection is
pooled, and sees a spike in memory usage only once. Currently that could well
end up with 32MB "wasted" permanently but tuning it ourselves could allow us
to releaase it back.

Since it was last year I worked on this, I'm a bit fuzzy on the details but I
hope this helps.






Re: scalability bottlenecks with (many) partitions (and more)

From
Tomas Vondra
Date:

On 1/29/24 15:15, Ronan Dunklau wrote:
> Le lundi 29 janvier 2024, 13:17:07 CET Tomas Vondra a écrit :
>>> Did you try running an strace on the process ? That may give you some
>>> hindsights into what malloc is doing. A more sophisticated approach would
>>> be using stap and plugging it into the malloc probes, for example
>>> memory_sbrk_more and memory_sbrk_less.
>>
>> No, I haven't tried that. In my experience strace is pretty expensive,
>> and if the issue is in glibc itself (before it does the syscalls),
>> strace won't really tell us much. Not sure, ofc.
> 
> It would tell you how malloc actually performs your allocations, and how often 
> they end up translated into syscalls. The main issue with glibc would be that 
> it releases the memory too agressively to the OS, IMO.
> 
>>
>>> An important part of glibc's malloc behaviour in that regard comes from
>>> the
>>> adjustment of the mmap and free threshold. By default, mmap adjusts them
>>> dynamically and you can poke into that using the
>>> memory_mallopt_free_dyn_thresholds probe.
>>
>> Thanks, I'll take a look at that.
>>
>>>> FWIW I was wondering if this is a glibc-specific malloc bottleneck, so I
>>>> tried running the benchmarks with LD_PRELOAD=jemalloc, and that improves
>>>> the behavior a lot - it gets us maybe ~80% of the mempool benefits.
>>>> Which is nice, it confirms it's glibc-specific (I wonder if there's a
>>>> way to tweak glibc to address this), and it also means systems using
>>>> jemalloc (e.g. FreeBSD, right?) don't have this problem. But it also
>>>> says the mempool has ~20% benefit on top of jemalloc.
>>>
>>> GLIBC's malloc offers some tuning for this. In particular, setting either
>>> M_MMAP_THRESHOLD or M_TRIM_THRESHOLD will disable the unpredictable "auto
>>> adjustment" beheviour and allow you to control what it's doing.
>>>
>>> By setting a bigger M_TRIM_THRESHOLD, one can make sure memory allocated
>>> using sbrk isn't freed as easily, and you don't run into a pattern of
>>> moving the sbrk pointer up and down repeatedly. The automatic trade off
>>> between the mmap and trim thresholds is supposed to prevent that, but the
>>> way it is incremented means you can end in a bad place depending on your
>>> particular allocation patttern.
>>
>> So, what values would you recommend for these parameters?
>>
>> My concern is increasing those value would lead to (much) higher memory
>> usage, with little control over it. With the mempool we keep more
>> blocks, ofc, but we have control over freeing the memory.
> 
> Right now depending on your workload (especially if you use connection 
> pooling) you can end up with something like 32 or 64MB of dynamically adjusted 
> trim-threshold which will never be released back. 
> 

OK, so let's say I expect each backend to use ~90MB of memory (allocated
at once through memory contexts). How would you set the two limits? By
default it's set to 128kB, which means blocks larger than 128kB are
mmap-ed and released immediately.

But there's very few such allocations - a vast majority of blocks in the
benchmark workloads is <= 8kB or ~27kB (those from btbeginscan).

So I'm thinking about leaving M_TRIM_THRESHOLD as is, but increasing the
M_TRIM_THRESHOLD value to a couple MBs. But I doubt that'll really help,
because what I expect to happen is we execute a query and it allocates
all memory up to a high watermark of ~90MB. And then the query
completes, and we release almost all of it. And even with trim threshold
set to e.g. 8MB we'll free almost all of it, no?

What we want to do is say - hey, we needed 90MB, and now we need 8MB. We
could free 82MB, but maybe let's wait a bit and see if we need that
memory again. And that's pretty much what the mempool does, but I don't
see how to do that using the mmap options.

> The first heurstic I had in mind was to set it to work_mem, up to a 
> "reasonable" limit I guess. One can argue that it is expected for a backend to 
> use work_mem frequently, and as such it shouldn't be released back. By setting 
> work_mem to a lower value, we could ask glibc at the same time to trim the 
> excess kept memory. That could be useful when a long-lived connection is 
> pooled, and sees a spike in memory usage only once. Currently that could well 
> end up with 32MB "wasted" permanently but tuning it ourselves could allow us 
> to releaase it back. 
> 

I'm not sure work_mem is a good parameter to drive this. It doesn't say
how much memory we expect the backend to use - it's a per-operation
limit, so it doesn't work particularly well with partitioning (e.g. with
100 partitions, we may get 100 nodes, which is completely unrelated to
what work_mem says). A backend running the join query with 1000
partitions uses ~90MB (judging by data reported by the mempool), even
with work_mem=4MB. So setting the trim limit to 4MB is pretty useless.

The mempool could tell us how much memory we need (but we could track
this in some other way too, probably). And we could even adjust the mmap
parameters regularly, based on current workload.

But there's then there's the problem that the mmap parameters don't tell
us how much memory to keep, but how large chunks to release.

Let's say we want to keep the 90MB (to allocate the memory once and then
reuse it). How would you do that? We could set MMAP_TRIM_TRESHOLD 100MB,
but then it takes just a little bit of extra memory to release all the
memory, or something.

> Since it was last year I worked on this, I'm a bit fuzzy on the details but I 
> hope this helps.
> 

Thanks for the feedback / insights!


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: scalability bottlenecks with (many) partitions (and more)

From
Ronan Dunklau
Date:
Le lundi 29 janvier 2024, 15:59:04 CET Tomas Vondra a écrit :
> I'm not sure work_mem is a good parameter to drive this. It doesn't say
> how much memory we expect the backend to use - it's a per-operation
> limit, so it doesn't work particularly well with partitioning (e.g. with
> 100 partitions, we may get 100 nodes, which is completely unrelated to
> what work_mem says). A backend running the join query with 1000
> partitions uses ~90MB (judging by data reported by the mempool), even
> with work_mem=4MB. So setting the trim limit to 4MB is pretty useless.

I understand your point,  I was basing my previous observations on what a
backend typically does during the execution.

>
> The mempool could tell us how much memory we need (but we could track
> this in some other way too, probably). And we could even adjust the mmap
> parameters regularly, based on current workload.
>
> But there's then there's the problem that the mmap parameters don't tell
> If we > > us how much memory to keep, but how large chunks to release.
>
> Let's say we want to keep the 90MB (to allocate the memory once and then
> reuse it). How would you do that? We could set MMAP_TRIM_TRESHOLD 100MB,
> but then it takes just a little bit of extra memory to release all the
> memory, or something.

For doing this you can set M_TOP_PAD using glibc malloc. Which makes sure a
certain amount of memory is always kept.

But the way the dynamic adjustment works makes it sort-of work like this.
MMAP_THRESHOLD and TRIM_THRESHOLD start with low values, meaning we don't
expect to keep much memory around.

So even "small" memory allocations will be served using mmap at first. Once
mmaped memory is released, glibc's consider it a benchmark for "normal"
allocations that can be routinely freed, and adjusts mmap_threshold to the
released mmaped region size, and trim threshold to two times that.

It means over time the two values will converge either to the max value (32MB
for MMAP_THRESHOLD, 64 for trim threshold) or to something big enough to
accomodate your released memory, since anything bigger than half trim
threshold will be allocated using mmap.

Setting any parameter disable that.

But I'm not arguing against the mempool, just chiming in with glibc's malloc
tuning possibilities :-)





Re: scalability bottlenecks with (many) partitions (and more)

From
Tomas Vondra
Date:

On 1/29/24 16:42, Ronan Dunklau wrote:
> Le lundi 29 janvier 2024, 15:59:04 CET Tomas Vondra a écrit :
>> I'm not sure work_mem is a good parameter to drive this. It doesn't say
>> how much memory we expect the backend to use - it's a per-operation
>> limit, so it doesn't work particularly well with partitioning (e.g. with
>> 100 partitions, we may get 100 nodes, which is completely unrelated to
>> what work_mem says). A backend running the join query with 1000
>> partitions uses ~90MB (judging by data reported by the mempool), even
>> with work_mem=4MB. So setting the trim limit to 4MB is pretty useless.
> 
> I understand your point,  I was basing my previous observations on what a 
> backend typically does during the execution.
> 
>>
>> The mempool could tell us how much memory we need (but we could track
>> this in some other way too, probably). And we could even adjust the mmap
>> parameters regularly, based on current workload.
>>
>> But there's then there's the problem that the mmap parameters don't tell
>> If we > > us how much memory to keep, but how large chunks to release.
>>
>> Let's say we want to keep the 90MB (to allocate the memory once and then
>> reuse it). How would you do that? We could set MMAP_TRIM_TRESHOLD 100MB,
>> but then it takes just a little bit of extra memory to release all the
>> memory, or something.
> 
> For doing this you can set M_TOP_PAD using glibc malloc. Which makes sure a 
> certain amount of memory is always kept. 
> 
> But the way the dynamic adjustment works makes it sort-of work like this. 
> MMAP_THRESHOLD and TRIM_THRESHOLD start with low values, meaning we don't 
> expect to keep much memory around. 
> 
> So even "small" memory allocations will be served using mmap at first. Once 
> mmaped memory is released, glibc's consider it a benchmark for "normal" 
> allocations that can be routinely freed, and adjusts mmap_threshold to the 
> released mmaped region size, and trim threshold to two times that. 
> 
> It means over time the two values will converge either to the max value (32MB 
> for MMAP_THRESHOLD, 64 for trim threshold) or to something big enough to 
> accomodate your released memory, since anything bigger than half trim 
> threshold will be allocated using mmap. 
> 
> Setting any parameter disable that.
> 

Thanks. I gave this a try, and I started the tests with this setting:

export MALLOC_TOP_PAD_=$((64*1024*1024))
export MALLOC_MMAP_THRESHOLD_=$((1024*1024))
export MALLOC_TRIM_THRESHOLD_=$((1024*1024))

which I believe means that:

1) we'll keep 64MB "extra" memory on top of heap, serving as a cache for
future allocations

2) everything below 1MB (so most of the blocks we allocate for contexts)
will be allocated on heap (hence from the cache)

3) we won't trim heap unless there's at least 1MB of free contiguous
space (I wonder if this should be the same as MALLOC_TOP_PAD)

Those are mostly arbitrary values / guesses, and I don't have complete
results yet. But from the results I have it seems this has almost the
same effect as the mempool thing - see the attached PDF, with results
for the "partitioned join" benchmark.

first column - "master" (17dev) with no patches, default glibc

second column - 17dev + locking + mempool, default glibc

third column - 17dev + locking, tuned glibc

The color scale on the right is throughput comparison (third/second), as
a percentage with e.g. 90% meaning tuned glibc is 10% slower than the
mempool results. Most of the time it's slower but very close to 100%,
sometimes it's a bit faster. So overall it's roughly the same.

The color scales below the results is a comparison of each branch to the
master (without patches), showing comparison to current performance.
It's almost the same, although the tuned glibc has a couple regressions
that the mempool does not have.

> But I'm not arguing against the mempool, just chiming in with glibc's malloc 
> tuning possibilities :-)
> 

Yeah. I think the main problem with the glibc parameters is that it's
very implementation-specific and also static - the mempool is more
adaptive, I think. But it's an interesting experiment.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachment

Re: scalability bottlenecks with (many) partitions (and more)

From
Robert Haas
Date:
On Sun, Jan 28, 2024 at 4:57 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> For NUM_LOCK_PARTITIONS this is pretty simple (see 0001 patch). The
> LWLock table has 16 partitions by default - it's quite possible that on
> machine with many cores and/or many partitions, we can easily hit this.
> So I bumped this 4x to 64 partitions.

I think this probably makes sense. I'm a little worried that we're
just kicking the can down the road here where maybe we should be
solving the problem in some more fundamental way, and I'm also a
little worried that we might be reducing single-core performance. But
it's probably fine.

> What I ended up doing is having a hash table of 16-element arrays. There
> are 64 "pieces", each essentially the (16 x OID + UINT64 bitmap) that we
> have now. Each OID is mapped to exactly one of these parts as if in a
> hash table, and in each of those 16-element parts we do exactly the same
> thing we do now (linear search, removal, etc.). This works great, the
> locality is great, etc. The one disadvantage is this makes PGPROC
> larger, but I did a lot of benchmarks and I haven't seen any regression
> that I could attribute to this. (More about this later.)

I think this is a reasonable approach. Some comments:

- FastPathLocalUseInitialized seems unnecessary to me; the contents of
an uninitialized local variable are undefined, but an uninitialized
global variable always starts out zeroed.

- You need comments in various places, including here, where someone
is certain to have questions about the algorithm and choice of
constants:

+#define FAST_PATH_LOCK_REL_GROUP(rel) (((uint64) (rel) * 7883 + 4481)
% FP_LOCK_GROUPS_PER_BACKEND)

When I originally coded up the fast-path locking stuff, I supposed
that we couldn't make the number of slots too big because the
algorithm requires a linear search of the whole array. But with this
one trick (a partially-associative cache), there's no real reason that
I can think of why you can't make the number of slots almost
arbitrarily large. At some point you're going to use too much memory,
and probably before that point you're going to make the cache big
enough that it doesn't fit in the CPU cache of an individual core, at
which point possibly it will stop working as well. But honestly ... I
don't quite see why this approach couldn't be scaled quite far.

Like, if we raised FP_LOCK_GROUPS_PER_BACKEND from your proposed value
of 64 to say 65536, would that still perform well? I'm not saying we
should do that, because that's probably a silly amount of memory to
use for this, but the point is that when you start to have enough
partitions that you run out of lock slots, performance is going to
degrade, so you can imagine wanting to try to have enough lock groups
to make that unlikely. Which leads me to wonder if there's any
particular number of lock groups that is clearly "too many" or whether
it's just about how much memory we want to use.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: scalability bottlenecks with (many) partitions (and more)

From
Tomas Vondra
Date:

On 6/24/24 17:05, Robert Haas wrote:
> On Sun, Jan 28, 2024 at 4:57 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>> For NUM_LOCK_PARTITIONS this is pretty simple (see 0001 patch). The
>> LWLock table has 16 partitions by default - it's quite possible that on
>> machine with many cores and/or many partitions, we can easily hit this.
>> So I bumped this 4x to 64 partitions.
> 
> I think this probably makes sense. I'm a little worried that we're
> just kicking the can down the road here where maybe we should be
> solving the problem in some more fundamental way, and I'm also a
> little worried that we might be reducing single-core performance. But
> it's probably fine.
> 

Yeah, I haven't seen this causing any regressions - the sensitive paths
typically lock only one partition, so having more of them does not
affect that. Or if it does, it's likely a reasonable trade off as it
reduces the risk of lock contention.

That being said, I don't recall benchmarking this patch in isolation,
only with the other patches. Maybe I should do that ...

>> What I ended up doing is having a hash table of 16-element arrays. There
>> are 64 "pieces", each essentially the (16 x OID + UINT64 bitmap) that we
>> have now. Each OID is mapped to exactly one of these parts as if in a
>> hash table, and in each of those 16-element parts we do exactly the same
>> thing we do now (linear search, removal, etc.). This works great, the
>> locality is great, etc. The one disadvantage is this makes PGPROC
>> larger, but I did a lot of benchmarks and I haven't seen any regression
>> that I could attribute to this. (More about this later.)
> 
> I think this is a reasonable approach. Some comments:
> 
> - FastPathLocalUseInitialized seems unnecessary to me; the contents of
> an uninitialized local variable are undefined, but an uninitialized
> global variable always starts out zeroed.
> 

OK. I didn't realize global variables start a zero.

> - You need comments in various places, including here, where someone
> is certain to have questions about the algorithm and choice of
> constants:
> 
> +#define FAST_PATH_LOCK_REL_GROUP(rel) (((uint64) (rel) * 7883 + 4481)
> % FP_LOCK_GROUPS_PER_BACKEND)
> 

Yeah, definitely needs comment explaining this.

I admit those numbers are pretty arbitrary primes, to implement a
trivial hash function. That was good enough for a PoC patch, but maybe
for a "proper" version this should use a better hash function. It needs
to be fast, and maybe it doesn't matter that much if it's not perfect.

> When I originally coded up the fast-path locking stuff, I supposed
> that we couldn't make the number of slots too big because the
> algorithm requires a linear search of the whole array. But with this
> one trick (a partially-associative cache), there's no real reason that
> I can think of why you can't make the number of slots almost
> arbitrarily large. At some point you're going to use too much memory,
> and probably before that point you're going to make the cache big
> enough that it doesn't fit in the CPU cache of an individual core, at
> which point possibly it will stop working as well. But honestly ... I
> don't quite see why this approach couldn't be scaled quite far.
> 

I don't think I've heard the term "partially-associative cache" before,
but now that I look at the approach again, it very much reminds me how
set-associative caches work (e.g. with cachelines in CPU caches). It's a
16-way associative cache, assigning each entry into one of 16 slots.

I must have been reading some papers in this area shortly before the PoC
patch, and the idea came from there, probably. Which is good, because it
means it's a well-understood and widely-used approach.

> Like, if we raised FP_LOCK_GROUPS_PER_BACKEND from your proposed value
> of 64 to say 65536, would that still perform well? I'm not saying we
> should do that, because that's probably a silly amount of memory to
> use for this, but the point is that when you start to have enough
> partitions that you run out of lock slots, performance is going to
> degrade, so you can imagine wanting to try to have enough lock groups
> to make that unlikely. Which leads me to wonder if there's any
> particular number of lock groups that is clearly "too many" or whether
> it's just about how much memory we want to use.
> 

That's an excellent question. I don't know.

I agree 64 groups is pretty arbitrary, and having 1024 may not be enough
even with a modest number of partitions. When I was thinking about using
a higher value, my main concern was that it'd made the PGPROC entry
larger. Each "fast-path" group is ~72B, so 64 groups is ~4.5kB, and that
felt like quite a bit.

But maybe it's fine and we could make it much larger - L3 caches tend to
be many MBs these days, although AFAIK it's shared by threads running on
the CPU.

I'll see if I can do some more testing of this, and see if there's a
value where it stops improving / starts degrading, etc.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: scalability bottlenecks with (many) partitions (and more)

From
Robert Haas
Date:
On Tue, Jun 25, 2024 at 6:04 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> Yeah, definitely needs comment explaining this.
>
> I admit those numbers are pretty arbitrary primes, to implement a
> trivial hash function. That was good enough for a PoC patch, but maybe
> for a "proper" version this should use a better hash function. It needs
> to be fast, and maybe it doesn't matter that much if it's not perfect.

Right. My guess is that if we try too hard to make the hash function
good, there will be a performance hit. Unlike, say, strings that come
from the user, we have no reason to believe that relfilenumbers will
have any particular structure or pattern to them, so a low-quality,
fast function seems like a good trade-off to me. But I'm *far* from a
hashing expert, so I'm prepared for someone who is to tell me that I'm
full of garbage.

> I don't think I've heard the term "partially-associative cache" before
> That's an excellent question. I don't know.
>
> I agree 64 groups is pretty arbitrary, and having 1024 may not be enough
> even with a modest number of partitions. When I was thinking about using
> a higher value, my main concern was that it'd made the PGPROC entry
> larger. Each "fast-path" group is ~72B, so 64 groups is ~4.5kB, and that
> felt like quite a bit.
>
> But maybe it's fine and we could make it much larger - L3 caches tend to
> be many MBs these days, although AFAIK it's shared by threads running on
> the CPU.
>
> I'll see if I can do some more testing of this, and see if there's a
> value where it stops improving / starts degrading, etc.

Sounds good.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: scalability bottlenecks with (many) partitions (and more)

From
Tomas Vondra
Date:
Hi,

On 6/25/24 12:04, Tomas Vondra wrote:
> 
> 
> On 6/24/24 17:05, Robert Haas wrote:
>> On Sun, Jan 28, 2024 at 4:57 PM Tomas Vondra
>> <tomas.vondra@enterprisedb.com> wrote:
>>> For NUM_LOCK_PARTITIONS this is pretty simple (see 0001 patch). The
>>> LWLock table has 16 partitions by default - it's quite possible that on
>>> machine with many cores and/or many partitions, we can easily hit this.
>>> So I bumped this 4x to 64 partitions.
>>
>> I think this probably makes sense. I'm a little worried that we're
>> just kicking the can down the road here where maybe we should be
>> solving the problem in some more fundamental way, and I'm also a
>> little worried that we might be reducing single-core performance. But
>> it's probably fine.
>>
> 
> Yeah, I haven't seen this causing any regressions - the sensitive paths
> typically lock only one partition, so having more of them does not
> affect that. Or if it does, it's likely a reasonable trade off as it
> reduces the risk of lock contention.
> 
> That being said, I don't recall benchmarking this patch in isolation,
> only with the other patches. Maybe I should do that ...
> 
>>> What I ended up doing is having a hash table of 16-element arrays. There
>>> are 64 "pieces", each essentially the (16 x OID + UINT64 bitmap) that we
>>> have now. Each OID is mapped to exactly one of these parts as if in a
>>> hash table, and in each of those 16-element parts we do exactly the same
>>> thing we do now (linear search, removal, etc.). This works great, the
>>> locality is great, etc. The one disadvantage is this makes PGPROC
>>> larger, but I did a lot of benchmarks and I haven't seen any regression
>>> that I could attribute to this. (More about this later.)
>>
>> I think this is a reasonable approach. Some comments:
>>
>> - FastPathLocalUseInitialized seems unnecessary to me; the contents of
>> an uninitialized local variable are undefined, but an uninitialized
>> global variable always starts out zeroed.
>>
> 
> OK. I didn't realize global variables start a zero.
> 

I haven't fixed this yet, but it's pretty clear the "init" is not really 
needed, because it did the memset() wrong:

memset(FastPathLocalUseCounts, 0, sizeof(FastPathLocalUseInitialized));

This only resets one byte of the array, yet it still worked correctly.

>> - You need comments in various places, including here, where someone
>> is certain to have questions about the algorithm and choice of
>> constants:
>>
>> +#define FAST_PATH_LOCK_REL_GROUP(rel) (((uint64) (rel) * 7883 + 4481)
>> % FP_LOCK_GROUPS_PER_BACKEND)
>>
> 
> Yeah, definitely needs comment explaining this.
> 
> I admit those numbers are pretty arbitrary primes, to implement a
> trivial hash function. That was good enough for a PoC patch, but maybe
> for a "proper" version this should use a better hash function. It needs
> to be fast, and maybe it doesn't matter that much if it's not perfect.
> 
>> When I originally coded up the fast-path locking stuff, I supposed
>> that we couldn't make the number of slots too big because the
>> algorithm requires a linear search of the whole array. But with this
>> one trick (a partially-associative cache), there's no real reason that
>> I can think of why you can't make the number of slots almost
>> arbitrarily large. At some point you're going to use too much memory,
>> and probably before that point you're going to make the cache big
>> enough that it doesn't fit in the CPU cache of an individual core, at
>> which point possibly it will stop working as well. But honestly ... I
>> don't quite see why this approach couldn't be scaled quite far.
>>
> 
> I don't think I've heard the term "partially-associative cache" before,
> but now that I look at the approach again, it very much reminds me how
> set-associative caches work (e.g. with cachelines in CPU caches). It's a
> 16-way associative cache, assigning each entry into one of 16 slots.
> 
> I must have been reading some papers in this area shortly before the PoC
> patch, and the idea came from there, probably. Which is good, because it
> means it's a well-understood and widely-used approach.
> 
>> Like, if we raised FP_LOCK_GROUPS_PER_BACKEND from your proposed value
>> of 64 to say 65536, would that still perform well? I'm not saying we
>> should do that, because that's probably a silly amount of memory to
>> use for this, but the point is that when you start to have enough
>> partitions that you run out of lock slots, performance is going to
>> degrade, so you can imagine wanting to try to have enough lock groups
>> to make that unlikely. Which leads me to wonder if there's any
>> particular number of lock groups that is clearly "too many" or whether
>> it's just about how much memory we want to use.
>>
> 
> That's an excellent question. I don't know.
> 
> I agree 64 groups is pretty arbitrary, and having 1024 may not be enough
> even with a modest number of partitions. When I was thinking about using
> a higher value, my main concern was that it'd made the PGPROC entry
> larger. Each "fast-path" group is ~72B, so 64 groups is ~4.5kB, and that
> felt like quite a bit.
> 
> But maybe it's fine and we could make it much larger - L3 caches tend to
> be many MBs these days, although AFAIK it's shared by threads running on
> the CPU.
> 
> I'll see if I can do some more testing of this, and see if there's a
> value where it stops improving / starts degrading, etc.
> 

I finally got to do those experiments. The scripts and results (both raw 
and summarized) are too big to attach everything here, available at

     https://github.com/tvondra/scalability-tests

The initial patch used 64 (which means 1024 fast-path slots), I ran the 
tests with 0, 1, 8, 32, 128, 512, 1024 (so up to 16k locks). I thought 
about testing with ~64k groups, but I didn't go with the extreme value 
because I don't quite see the point.

It would only matter for cases with a truly extreme number of partitions 
(64k groups is ~1M fast-path slots), and just creating enough partitions 
would take a lot of time. Moreover, with that many partitions we seems 
to have various other bottlenecks, and improving this does not make it 
really practical. And it's so slow the benchmark results are somewhat 
bogus too.

Because if we achieve 50 tps with 1000 partitions, does it really matter 
a patch changes that to 25 of 100 tps? I doubt that, especially if going 
to 100 partitions gives you 2000 tps. Now imagine you have 10k or 100k 
partitions - how fast is that going to be?

So I think stopping at 1024 groups is sensible, and if there are some 
inefficiencies / costs, I'd expect those to gradually show up even at 
those lower sizes.

But if you look at results, for example from the "join" test:

   https://github.com/tvondra/scalability-tests/blob/main/join.pdf

there's no such negative effect. the table shows results for different 
combinations of parameters, with the first group of columns being on 
regular glibc, the second one has glibc tuning (see [1] for details). 
And the values are for different number of fast-path groups (0 means the 
patch was not applied).

And the color scale on the show the impact of increasing the number of 
groups. So for example when a column for "32 groups" says 150%, it means 
going from 8 to 32 groups improved throughput to 1.5x. As usual, green 
is "good" and red is "bad".

But if you look at the tables, there's very little change - most of the 
values are close to 100%. This might seem a bit strange, considering the 
promise of these patches is to improve throughput, and "no change" is an 
absence of that. But that's because the charts illustrate effect of 
changing the group count with other parameters fixed. It never compares 
runs with/without glibc runing, and that's an important part of the 
improvement. Doing the pivot table a bit differently would still show a 
substantial 2-3x improvement.

There's a fair amount of noise - especially for the rpi5 machines (not 
the right hw for sensitive benchmarks), but also on some i5/xeon runs. I
attribute this to only doing one short run (10s) for each combinations 
of parameters. I'll do more runs next time.

Anyway, I think these results show a couple things:

1) There's no systemic negative effect of increasing the number of 
groups. We could go with 32k or 64k groups, and it doesn't seem like 
there would be a problem.

2) But there's not much point in doing that, because we run into various 
other bottlenecks well before having that many locks. By the results, it 
doesn't seem going beyond 32 or 64 groups would give us much.

3) The memory allocation caching (be it the mempool patch, or the glibc 
tuning like in this test round) is a crucial piece for this. Not doing 
that means some tests get no improvement at all, or a much smaller one.

4) The increase of NUM_LOCK_PARTITIONS has very limited effect, or 
perhaps even no effect at all.


Based on this, my plan is to polish the patch adding fast-path groups, 
with either 32 or 64 groups, which seems to be reasonable values. Then 
in the future, if/when the other bottlenecks get addressed, we can 
rethink and increase this.

This however reminds me that all those machines are pretty small. Which 
is good for showing it doesn't break existing/smaller systems, but the 
initial goal of the patch was to improve behavior on big boxes. I don't 
have access to the original box at the moment, so if someone could 
provide an access to one of those big epyc/xeon machines with 100+ cores 
for a couple days, that would be helpful.


That being said, I think it's pretty clear how serious the issue with 
memory allocation overhead can be, especially in cases when the existing 
memory context caching is ineffective (like for the btree palloc). I'm 
not sure what to do about that. The mempool patch shared in this thread 
does the trick, it's fairly complex/invasive. I still like it, but maybe 
doing something with the glibc tuning would be enough - it's not as 
effective, but 80% of the improvement is better than no improvement.



regards


[1] 
https://www.postgresql.org/message-id/0da51f67-c88b-497e-bb89-d5139309eb9c@enterprisedb.com

-- 
Tomas Vondra



Re: scalability bottlenecks with (many) partitions (and more)

From
Robert Haas
Date:
On Sun, Sep 1, 2024 at 3:30 PM Tomas Vondra <tomas@vondra.me> wrote:
> I don't think that's possible with hard-coded size of the array - that
> allocates the memory for everyone. We'd need to make it variable-length,
> and while doing those benchmarks I think we actually already have a GUC
> for that - max_locks_per_transaction tells us exactly what we need to
> know, right? I mean, if I know I'll need ~1000 locks, why not to make
> the fast-path array large enough for that?

I really like this idea. I'm not sure about exactly how many fast path
slots you should get for what value of max_locks_per_transaction, but
coupling the two things together in some way sounds smart.

> Of course, the consequence of this would be making PGPROC variable
> length, or having to point to a memory allocated separately (I prefer
> the latter option, I think). I haven't done any experiments, but it
> seems fairly doable - of course, not sure if it might be more expensive
> compared to compile-time constants.

I agree that this is a potential problem but it sounds like the idea
works well enough that we'd probably still come out quite far ahead
even with a bit more overhead.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: scalability bottlenecks with (many) partitions (and more)

From
Robert Haas
Date:
On Mon, Sep 2, 2024 at 1:46 PM Tomas Vondra <tomas@vondra.me> wrote:
> The one argument to not tie this to max_locks_per_transaction is the
> vastly different "per element" memory requirements. If you add one entry
> to max_locks_per_transaction, that adds LOCK which is a whopping 152B.
> OTOH one fast-path entry is ~5B, give or take. That's a pretty big
> difference, and it if the locks fit into the shared lock table, but
> you'd like to allow more fast-path locks, having to increase
> max_locks_per_transaction is not great - pretty wastefull.
>
> OTOH I'd really hate to just add another GUC and hope the users will
> magically know how to set it correctly. That's pretty unlikely, IMO. I
> myself wouldn't know what a good value is, I think.
>
> But say we add a GUC and set it to -1 by default, in which case it just
> inherits the max_locks_per_transaction value. And then also provide some
> basic metric about this fast-path cache, so that people can tune this?

All things being equal, I would prefer not to add another GUC for
this, but we might need it.

Doing some worst case math, suppose somebody has max_connections=1000
(which is near the upper limit of what I'd consider a sane setting)
and max_locks_per_transaction=10000 (ditto). The product is 10
million, so every 10 bytes of storage each a gigabyte of RAM. Chewing
up 15GB of RAM when you could have chewed up only 0.5GB certainly
isn't too great. On the other hand, those values are kind of pushing
the limits of what is actually sane. If you imagine
max_locks_per_transaction=2000 rather than
max_locks_per_connection=10000, then it's only 3GB and that's
hopefully not a lot on the hopefully-giant machine where you're
running this.

> I think just knowing the "hit ratio" would be enough, i.e. counters for
> how often it fits into the fast-path array, and how often we had to
> promote it to the shared lock table would be enough, no?

Yeah, probably. I mean, that won't tell you how big it needs to be,
but it will tell you whether it's big enough.

I wonder if we should be looking at further improvements in the lock
manager of some kind. For instance, imagine if we allocated storage
via DSM or DSA for cases where we need a really large number of Lock
entries. The downside of that is that we might run out of memory for
locks at runtime, which would perhaps suck, but you'd probably use
significantly less memory on average. Or, maybe we need an even bigger
rethink where we reconsider the idea that we take a separate lock for
every single partition instead of having some kind of hierarchy-aware
lock manager. I don't know. But this feels like very old, crufty tech.
There's probably something more state of the art that we could or
should be doing.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: scalability bottlenecks with (many) partitions (and more)

From
Tomas Vondra
Date:
On 9/3/24 17:06, Robert Haas wrote:
> On Mon, Sep 2, 2024 at 1:46 PM Tomas Vondra <tomas@vondra.me> wrote:
>> The one argument to not tie this to max_locks_per_transaction is the
>> vastly different "per element" memory requirements. If you add one entry
>> to max_locks_per_transaction, that adds LOCK which is a whopping 152B.
>> OTOH one fast-path entry is ~5B, give or take. That's a pretty big
>> difference, and it if the locks fit into the shared lock table, but
>> you'd like to allow more fast-path locks, having to increase
>> max_locks_per_transaction is not great - pretty wastefull.
>>
>> OTOH I'd really hate to just add another GUC and hope the users will
>> magically know how to set it correctly. That's pretty unlikely, IMO. I
>> myself wouldn't know what a good value is, I think.
>>
>> But say we add a GUC and set it to -1 by default, in which case it just
>> inherits the max_locks_per_transaction value. And then also provide some
>> basic metric about this fast-path cache, so that people can tune this?
> 
> All things being equal, I would prefer not to add another GUC for
> this, but we might need it.
> 

Agreed.

> Doing some worst case math, suppose somebody has max_connections=1000
> (which is near the upper limit of what I'd consider a sane setting)
> and max_locks_per_transaction=10000 (ditto). The product is 10
> million, so every 10 bytes of storage each a gigabyte of RAM. Chewing
> up 15GB of RAM when you could have chewed up only 0.5GB certainly
> isn't too great. On the other hand, those values are kind of pushing
> the limits of what is actually sane. If you imagine
> max_locks_per_transaction=2000 rather than
> max_locks_per_connection=10000, then it's only 3GB and that's
> hopefully not a lot on the hopefully-giant machine where you're
> running this.
> 

Yeah, although I don't quite follow the math. With 1000/10000 settings,
why would that eat 15GB of RAM? I mean, that's 1.5GB, right?

FWIW the actual cost is somewhat higher, because we seem to need ~400B
for every lock (not just the 150B for the LOCK struct). At least based
on a quick experiment. (Seems a bit high, right?).

Anyway, I agree this might be acceptable. If your transactions use this
many locks regularly, you probably need this setting anyway. If you only
need this many locks occasionally (so that you can keep the locks/xact
value low), it probably does not matter that much.

And if you're running massively-partitioned table on a tiny box, well, I
don't really think that's a particularly sane idea.

So I think I'm OK with just tying this to max_locks_per_transaction.

>> I think just knowing the "hit ratio" would be enough, i.e. counters for
>> how often it fits into the fast-path array, and how often we had to
>> promote it to the shared lock table would be enough, no?
> 
> Yeah, probably. I mean, that won't tell you how big it needs to be,
> but it will tell you whether it's big enough.
> 

True, but that applies to all "cache hit ratio" metrics (like for our
shared buffers). It'd be great to have something better, enough to tell
you how large the cache needs to be. But we don't :-(

> I wonder if we should be looking at further improvements in the lock
> manager of some kind. For instance, imagine if we allocated storage
> via DSM or DSA for cases where we need a really large number of Lock
> entries. The downside of that is that we might run out of memory for
> locks at runtime, which would perhaps suck, but you'd probably use
> significantly less memory on average. Or, maybe we need an even bigger
> rethink where we reconsider the idea that we take a separate lock for
> every single partition instead of having some kind of hierarchy-aware
> lock manager. I don't know. But this feels like very old, crufty tech.
> There's probably something more state of the art that we could or
> should be doing.
> 

Perhaps. I agree we'll probably need something more radical soon, not
just changes that aim to fix some rare exceptional case (which may be
annoying, but not particularly harmful for the complete workload).

For example, if we did what you propose, that might help when very few
transactions need a lot of locks. I don't mind saving memory in that
case, ofc. but is it a problem if those rare cases are a bit slower?
Shouldn't we focus more on cases where many locks are common? Because
people are simply going to use partitioning, a lot of indexes, etc?

So yeah, I agree we probably need a more fundamental rethink. I don't
think we can just keep optimizing the current approach, there's a limit
of fast it can be. Whether it's not locking individual partitions, or
not locking some indexes, ... I don't know.


regards

-- 
Tomas Vondra



Re: scalability bottlenecks with (many) partitions (and more)

From
Jakub Wartak
Date:
Hi Tomas!

On Tue, Sep 3, 2024 at 6:20 PM Tomas Vondra <tomas@vondra.me> wrote:
>
> On 9/3/24 17:06, Robert Haas wrote:
> > On Mon, Sep 2, 2024 at 1:46 PM Tomas Vondra <tomas@vondra.me> wrote:
> >> The one argument to not tie this to max_locks_per_transaction is the
> >> vastly different "per element" memory requirements. If you add one entry
> >> to max_locks_per_transaction, that adds LOCK which is a whopping 152B.
> >> OTOH one fast-path entry is ~5B, give or take. That's a pretty big
> >> difference, and it if the locks fit into the shared lock table, but
> >> you'd like to allow more fast-path locks, having to increase
> >> max_locks_per_transaction is not great - pretty wastefull.
> >>
> >> OTOH I'd really hate to just add another GUC and hope the users will
> >> magically know how to set it correctly. That's pretty unlikely, IMO. I
> >> myself wouldn't know what a good value is, I think.
> >>
> >> But say we add a GUC and set it to -1 by default, in which case it just
> >> inherits the max_locks_per_transaction value. And then also provide some
> >> basic metric about this fast-path cache, so that people can tune this?
> >
> > All things being equal, I would prefer not to add another GUC for
> > this, but we might need it.
> >
>
> Agreed.
>
> [..]
>
> So I think I'm OK with just tying this to max_locks_per_transaction.

If that matters then the SLRU configurability effort added 7 GUCs
(with 3 scaling up based on shared_buffers) just to give high-end
users some relief, so here 1 new shouldn't be that such a deal. We
could add to the LWLock/lock_manager wait event docs to recommend just
using known-to-be-good certain values from this $thread (or ask the
user to benchmark it himself).

> >> I think just knowing the "hit ratio" would be enough, i.e. counters for
> >> how often it fits into the fast-path array, and how often we had to
> >> promote it to the shared lock table would be enough, no?
> >
> > Yeah, probably. I mean, that won't tell you how big it needs to be,
> > but it will tell you whether it's big enough.
> >
>
> True, but that applies to all "cache hit ratio" metrics (like for our
> shared buffers). It'd be great to have something better, enough to tell
> you how large the cache needs to be. But we don't :-(

My $0.02 cents: the originating case that triggered those patches,
actually started with LWLock/lock_manager waits being the top#1. The
operator can cross check (join) that with a group by pg_locks.fastpath
(='f'), count(*). So, IMHO we have good observability in this case
(rare thing to say!)

> > I wonder if we should be looking at further improvements in the lock
> > manager of some kind. [..]
>
> Perhaps. I agree we'll probably need something more radical soon, not
> just changes that aim to fix some rare exceptional case (which may be
> annoying, but not particularly harmful for the complete workload).
>
> For example, if we did what you propose, that might help when very few
> transactions need a lot of locks. I don't mind saving memory in that
> case, ofc. but is it a problem if those rare cases are a bit slower?
> Shouldn't we focus more on cases where many locks are common? Because
> people are simply going to use partitioning, a lot of indexes, etc?
>
> So yeah, I agree we probably need a more fundamental rethink. I don't
> think we can just keep optimizing the current approach, there's a limit
> of fast it can be.

Please help me understand: so are You both discussing potential far
future further improvements instead of this one ? My question is
really about: is the patchset good enough or are you considering some
other new effort instead?

BTW some other random questions:
Q1. I've been lurking into
https://github.com/tvondra/pg-lock-scalability-results and those
shouldn't be used anymore for further discussions, as they contained
earlier patches (including
0003-Add-a-memory-pool-with-adaptive-rebalancing.patch) and they were
replaced by benchmark data in this $thread, right?
Q2. Earlier attempts did contain a mempool patch to get those nice
numbers (or was that jemalloc or glibc tuning). So were those recent
results in [1] collected with still 0003 or you have switched
completely to glibc/jemalloc tuning?

-J.

[1] - https://www.postgresql.org/message-id/b8c43eda-0c3f-4cb4-809b-841fa5c40ada%40vondra.me



Re: scalability bottlenecks with (many) partitions (and more)

From
Tomas Vondra
Date:
On 9/4/24 11:29, Jakub Wartak wrote:
> Hi Tomas!
> 
> On Tue, Sep 3, 2024 at 6:20 PM Tomas Vondra <tomas@vondra.me> wrote:
>>
>> On 9/3/24 17:06, Robert Haas wrote:
>>> On Mon, Sep 2, 2024 at 1:46 PM Tomas Vondra <tomas@vondra.me> wrote:
>>>> The one argument to not tie this to max_locks_per_transaction is the
>>>> vastly different "per element" memory requirements. If you add one entry
>>>> to max_locks_per_transaction, that adds LOCK which is a whopping 152B.
>>>> OTOH one fast-path entry is ~5B, give or take. That's a pretty big
>>>> difference, and it if the locks fit into the shared lock table, but
>>>> you'd like to allow more fast-path locks, having to increase
>>>> max_locks_per_transaction is not great - pretty wastefull.
>>>>
>>>> OTOH I'd really hate to just add another GUC and hope the users will
>>>> magically know how to set it correctly. That's pretty unlikely, IMO. I
>>>> myself wouldn't know what a good value is, I think.
>>>>
>>>> But say we add a GUC and set it to -1 by default, in which case it just
>>>> inherits the max_locks_per_transaction value. And then also provide some
>>>> basic metric about this fast-path cache, so that people can tune this?
>>>
>>> All things being equal, I would prefer not to add another GUC for
>>> this, but we might need it.
>>>
>>
>> Agreed.
>>
>> [..]
>>
>> So I think I'm OK with just tying this to max_locks_per_transaction.
> 
> If that matters then the SLRU configurability effort added 7 GUCs
> (with 3 scaling up based on shared_buffers) just to give high-end
> users some relief, so here 1 new shouldn't be that such a deal. We
> could add to the LWLock/lock_manager wait event docs to recommend just
> using known-to-be-good certain values from this $thread (or ask the
> user to benchmark it himself).
> 

TBH I'm skeptical we'll be able to tune those GUCs. Maybe it was the
right thing for the SLRU thread, I don't know - I haven't been following
that very closely. But my impression is that we often add a GUC when
we're not quite sure how to pick a good value. So we just shift the
responsibility to someone else, who however also doesn't know.

I'd very much prefer not to do that here. Of course, it's challenging
because we can't easily resize these arrays, so even if we had some nice
heuristics to calculate the "optimal" number of fast-path slots, what
would we do with it ...

>>>> I think just knowing the "hit ratio" would be enough, i.e. counters for
>>>> how often it fits into the fast-path array, and how often we had to
>>>> promote it to the shared lock table would be enough, no?
>>>
>>> Yeah, probably. I mean, that won't tell you how big it needs to be,
>>> but it will tell you whether it's big enough.
>>>
>>
>> True, but that applies to all "cache hit ratio" metrics (like for our
>> shared buffers). It'd be great to have something better, enough to tell
>> you how large the cache needs to be. But we don't :-(
> 
> My $0.02 cents: the originating case that triggered those patches,
> actually started with LWLock/lock_manager waits being the top#1. The
> operator can cross check (join) that with a group by pg_locks.fastpath
> (='f'), count(*). So, IMHO we have good observability in this case
> (rare thing to say!)
> 

That's a good point. So if you had to give some instructions to users
what to measure / monitor, and how to adjust the GUC based on that, what
would your instructions be?

>>> I wonder if we should be looking at further improvements in the lock
>>> manager of some kind. [..]
>>
>> Perhaps. I agree we'll probably need something more radical soon, not
>> just changes that aim to fix some rare exceptional case (which may be
>> annoying, but not particularly harmful for the complete workload).
>>
>> For example, if we did what you propose, that might help when very few
>> transactions need a lot of locks. I don't mind saving memory in that
>> case, ofc. but is it a problem if those rare cases are a bit slower?
>> Shouldn't we focus more on cases where many locks are common? Because
>> people are simply going to use partitioning, a lot of indexes, etc?
>>
>> So yeah, I agree we probably need a more fundamental rethink. I don't
>> think we can just keep optimizing the current approach, there's a limit
>> of fast it can be.
> 
> Please help me understand: so are You both discussing potential far
> future further improvements instead of this one ? My question is
> really about: is the patchset good enough or are you considering some
> other new effort instead?
> 

I think it was mostly a brainstorming about alternative / additional
improvements in locking. The proposed patch does not change the locking
in any fundamental way, it merely optimizes one piece - we still acquire
exactly the same set of locks, exactly the same way.

AFAICS there's an agreement the current approach has limits, and with
the growing number of partitions we're hitting them already. That may
need rethinking the fundamental approach, but I think that should not
block improvements to the current approach.

Not to mention there's no proposal for such "fundamental rework" yet.

> BTW some other random questions:
> Q1. I've been lurking into
> https://github.com/tvondra/pg-lock-scalability-results and those
> shouldn't be used anymore for further discussions, as they contained
> earlier patches (including
> 0003-Add-a-memory-pool-with-adaptive-rebalancing.patch) and they were
> replaced by benchmark data in this $thread, right?

The github results are still valid, I've only shared them 3 days ago. It
does test both the mempool and glibc tuning, to assess (and compare) the
benefits of that, but why would that make it obsolete?

By "results in this thread" I suppose you mean the couple numbers I
shared on September 2? Those were just very limited benchmarks to asses
if making the arrays variable-length (based on GUC) would make things
slower. And it doesn't, so the "full" github results still apply.

> Q2. Earlier attempts did contain a mempool patch to get those nice
> numbers (or was that jemalloc or glibc tuning). So were those recent
> results in [1] collected with still 0003 or you have switched
> completely to glibc/jemalloc tuning?
> 

The results pushed to github are all with glibc, and test four cases:

a) mempool patch not applied, no glibc tuning
b) mempool patch applied, no glibc tuning
c) mempool patch not applied, glibc tuning
d) mempool patch applied, glibc tuning

These are the 4 "column groups" in some of the pivot tables, to allow
comparing those cases. My interpretation of the results are

1) The mempool / glibc tuning have significant benefits, at least for
some workloads (where the locking patch alone does help much).

2) There's very little difference between the mempool / glibc tuning.
The mempool does seem to have a small advantage.

3) The mempool / glibc tuning is irrelevant for non-glibc systems (e.g.
for FreeBSD which I think uses jemalloc or something like that).

I think the mempool might be interesting and useful for other reasons
(e.g. I initially wrote it to enforce a per-backend memory limit), but
you can get mostly the same caching benefits by tuning the glibc parameters.

So I'm focusing on the locking stuff.


regards

-- 
Tomas Vondra



Re: scalability bottlenecks with (many) partitions (and more)

From
Matthias van de Meent
Date:
On Tue, 3 Sept 2024 at 18:20, Tomas Vondra <tomas@vondra.me> wrote:
> FWIW the actual cost is somewhat higher, because we seem to need ~400B
> for every lock (not just the 150B for the LOCK struct).

We do indeed allocate two PROCLOCKs for every LOCK, and allocate those
inside dynahash tables. That amounts to (152+2*64+3*16=) 328 bytes in
dynahash elements, and (3 * 8-16) = 24-48 bytes for the dynahash
buckets/segments, resulting in 352-376 bytes * NLOCKENTS() being
used[^1]. Does that align with your usage numbers, or are they
significantly larger?

> At least based on a quick experiment. (Seems a bit high, right?).

Yeah, that does seem high, thanks for nerd-sniping me.

The 152 bytes of LOCK are mostly due to a combination of two
MAX_LOCKMODES-sized int[]s that are used to keep track of the number
of requested/granted locks of each level. As MAX_LOCKMODES = 10, these
arrays use a total of 2*4*10=80 bytes, with the remaining 72 spent on
tracking. MAX_BACKENDS sadly doesn't fit in int16, so we'll have to
keep using int[]s, but that doesn't mean we can't improve this size:

ISTM that MAX_LOCKMODES is 2 larger than it has to be: LOCKMODE=0 is
NoLock, which is never used or counted in these shared structures, and
the max lock mode supported by any of the supported lock methods is
AccessExclusiveLock (8). We can thus reduce MAX_LOCKMODES to 8,
reducing size of the LOCK struct by 16 bytes.

If some struct- and field packing is OK, then we could further reduce
the size of LOCK by an additional 8 bytes by resizing the LOCKMASK
type from int to int16 (we only use the first MaxLockMode (8) + 1
bits), and then storing the grant/waitMask fields (now 4 bytes total)
in the padding that's present at the end of the waitProcs struct. This
would depend on dclist not writing in its padding space, but I
couldn't find any user that did so, and most critically dclist_init
doesn't scribble in the padding with memset.

If these are both implemented, it would save 24 bytes, reducing the
struct to 128 bytes. :) [^2]

I also checked PROCLOCK: If it is worth further packing the struct, we
should probably look at whether it's worth replacing the PGPROC* typed
fields with ProcNumber -based ones, potentially in both PROCLOCK and
PROCLOCKTAG. When combined with int16-typed LOCKMASKs, either one of
these fields being replaced with ProcNumber would allow a reduction in
size by one MAXALIGN quantum, reducing the struct to 56 bytes, the
smallest I could get it to without ignoring type alignments.

Further shmem savings can be achieved by reducing the "10% safety
margin" added at the end of LockShmemSize, as I'm fairly sure the
memory used in shared hashmaps doesn't exceed the estimated amount,
and if it did then we should probably fix that part, rather than
requesting that (up to) 10% overhead here.

Alltogether that'd save 40 bytes/lock entry on size, and ~35
bytes/lock on "safety margin", for a saving of (up to) 19% of our
current allocation. I'm not sure if these tricks would benefit with
performance or even be a demerit, apart from smaller structs usually
being better at fitting better in CPU caches.


Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

[^1] NLOCKENTS() benefits from being a power of 2, or slightly below
one, as it's rounded up to a power of 2 when dynahash decides its
number of buckets to allocate.
[^2] Sadly this 2-cachelines alignment is lost due to dynahash's
HASHELEMENT prefix of elements. :(



Re: scalability bottlenecks with (many) partitions (and more)

From
David Rowley
Date:
On Wed, 4 Sept 2024 at 03:06, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Sep 2, 2024 at 1:46 PM Tomas Vondra <tomas@vondra.me> wrote:
> > But say we add a GUC and set it to -1 by default, in which case it just
> > inherits the max_locks_per_transaction value. And then also provide some
> > basic metric about this fast-path cache, so that people can tune this?
>
> All things being equal, I would prefer not to add another GUC for
> this, but we might need it.

I think driving the array size from max_locks_per_transaction is a
good idea (rounded up to the next multiple of 16?). If someone comes
along one day and shows us a compelling case where some backend needs
more than its fair share of locks and performance is bad because of
that, then maybe we can consider adding a GUC then. Certainly, it's
much easier to add a GUC later if someone convinces us that it's a
good idea than it is to add it now and try to take it away in the
future if we realise it's not useful enough to keep.

David



Re: scalability bottlenecks with (many) partitions (and more)

From
Tomas Vondra
Date:
On 9/4/24 16:25, Matthias van de Meent wrote:
> On Tue, 3 Sept 2024 at 18:20, Tomas Vondra <tomas@vondra.me> wrote:
>> FWIW the actual cost is somewhat higher, because we seem to need ~400B
>> for every lock (not just the 150B for the LOCK struct).
> 
> We do indeed allocate two PROCLOCKs for every LOCK, and allocate those
> inside dynahash tables. That amounts to (152+2*64+3*16=) 328 bytes in
> dynahash elements, and (3 * 8-16) = 24-48 bytes for the dynahash
> buckets/segments, resulting in 352-376 bytes * NLOCKENTS() being
> used[^1]. Does that align with your usage numbers, or are they
> significantly larger?
> 

I see more like ~470B per lock. If I patch CalculateShmemSize to log the
shmem allocated, I get this:

  max_connections=100 max_locks_per_transaction=1000 => 194264001
  max_connections=100 max_locks_per_transaction=2000 => 241756967

and (((241756967-194264001)/100/1000)) = 474

Could be alignment of structs or something, not sure.

>> At least based on a quick experiment. (Seems a bit high, right?).
> 
> Yeah, that does seem high, thanks for nerd-sniping me.
> 
> The 152 bytes of LOCK are mostly due to a combination of two
> MAX_LOCKMODES-sized int[]s that are used to keep track of the number
> of requested/granted locks of each level. As MAX_LOCKMODES = 10, these
> arrays use a total of 2*4*10=80 bytes, with the remaining 72 spent on
> tracking. MAX_BACKENDS sadly doesn't fit in int16, so we'll have to
> keep using int[]s, but that doesn't mean we can't improve this size:
> 
> ISTM that MAX_LOCKMODES is 2 larger than it has to be: LOCKMODE=0 is
> NoLock, which is never used or counted in these shared structures, and
> the max lock mode supported by any of the supported lock methods is
> AccessExclusiveLock (8). We can thus reduce MAX_LOCKMODES to 8,
> reducing size of the LOCK struct by 16 bytes.
> 
> If some struct- and field packing is OK, then we could further reduce
> the size of LOCK by an additional 8 bytes by resizing the LOCKMASK
> type from int to int16 (we only use the first MaxLockMode (8) + 1
> bits), and then storing the grant/waitMask fields (now 4 bytes total)
> in the padding that's present at the end of the waitProcs struct. This
> would depend on dclist not writing in its padding space, but I
> couldn't find any user that did so, and most critically dclist_init
> doesn't scribble in the padding with memset.
> 
> If these are both implemented, it would save 24 bytes, reducing the
> struct to 128 bytes. :) [^2]
> 
> I also checked PROCLOCK: If it is worth further packing the struct, we
> should probably look at whether it's worth replacing the PGPROC* typed
> fields with ProcNumber -based ones, potentially in both PROCLOCK and
> PROCLOCKTAG. When combined with int16-typed LOCKMASKs, either one of
> these fields being replaced with ProcNumber would allow a reduction in
> size by one MAXALIGN quantum, reducing the struct to 56 bytes, the
> smallest I could get it to without ignoring type alignments.
> 
> Further shmem savings can be achieved by reducing the "10% safety
> margin" added at the end of LockShmemSize, as I'm fairly sure the
> memory used in shared hashmaps doesn't exceed the estimated amount,
> and if it did then we should probably fix that part, rather than
> requesting that (up to) 10% overhead here.
> 
> Alltogether that'd save 40 bytes/lock entry on size, and ~35
> bytes/lock on "safety margin", for a saving of (up to) 19% of our
> current allocation. I'm not sure if these tricks would benefit with
> performance or even be a demerit, apart from smaller structs usually
> being better at fitting better in CPU caches.
> 

Not sure either, but it seems worth exploring. If you do an experimental
patch for the LOCK size reduction, I can get some numbers.

I'm not sure about the safety margins. 10% sure seems like quite a bit
of memory (it might not have in the past, but as the instances are
growing, that probably changed).


regards

-- 
Tomas Vondra



Re: scalability bottlenecks with (many) partitions (and more)

From
Tomas Vondra
Date:


On 9/4/24 17:12, David Rowley wrote:
> On Wed, 4 Sept 2024 at 03:06, Robert Haas <robertmhaas@gmail.com> wrote:
>>
>> On Mon, Sep 2, 2024 at 1:46 PM Tomas Vondra <tomas@vondra.me> wrote:
>>> But say we add a GUC and set it to -1 by default, in which case it just
>>> inherits the max_locks_per_transaction value. And then also provide some
>>> basic metric about this fast-path cache, so that people can tune this?
>>
>> All things being equal, I would prefer not to add another GUC for
>> this, but we might need it.
> 
> I think driving the array size from max_locks_per_transaction is a
> good idea (rounded up to the next multiple of 16?).

Maybe, although I was thinking we'd just use the regular doubling, to
get nice "round" numbers. It will likely overshoot a little bit (unless
people set the GUC to exactly 2^N), but I don't think that's a problem.

> If someone comes along one day and shows us a compelling case where
> some backend needs more than its fair share of locks and performance
> is bad because of that, then maybe we can consider adding a GUC then.
> Certainly, it's much easier to add a GUC later if someone convinces
> us that it's a good idea than it is to add it now and try to take it
> away in the future if we realise it's not useful enough to keep.
> 

Yeah, I agree with this.



regards

-- 
Tomas Vondra



Re: scalability bottlenecks with (many) partitions (and more)

From
Robert Haas
Date:
On Tue, Sep 3, 2024 at 12:19 PM Tomas Vondra <tomas@vondra.me> wrote:
> > Doing some worst case math, suppose somebody has max_connections=1000
> > (which is near the upper limit of what I'd consider a sane setting)
> > and max_locks_per_transaction=10000 (ditto). The product is 10
> > million, so every 10 bytes of storage each a gigabyte of RAM. Chewing
> > up 15GB of RAM when you could have chewed up only 0.5GB certainly
> > isn't too great. On the other hand, those values are kind of pushing
> > the limits of what is actually sane. If you imagine
> > max_locks_per_transaction=2000 rather than
> > max_locks_per_connection=10000, then it's only 3GB and that's
> > hopefully not a lot on the hopefully-giant machine where you're
> > running this.
>
> Yeah, although I don't quite follow the math. With 1000/10000 settings,
> why would that eat 15GB of RAM? I mean, that's 1.5GB, right?

Oh, right.

> FWIW the actual cost is somewhat higher, because we seem to need ~400B
> for every lock (not just the 150B for the LOCK struct). At least based
> on a quick experiment. (Seems a bit high, right?).

Hmm, yes, that's unpleasant.

> Perhaps. I agree we'll probably need something more radical soon, not
> just changes that aim to fix some rare exceptional case (which may be
> annoying, but not particularly harmful for the complete workload).
>
> For example, if we did what you propose, that might help when very few
> transactions need a lot of locks. I don't mind saving memory in that
> case, ofc. but is it a problem if those rare cases are a bit slower?
> Shouldn't we focus more on cases where many locks are common? Because
> people are simply going to use partitioning, a lot of indexes, etc?
>
> So yeah, I agree we probably need a more fundamental rethink. I don't
> think we can just keep optimizing the current approach, there's a limit
> of fast it can be. Whether it's not locking individual partitions, or
> not locking some indexes, ... I don't know.

I don't know, either. We don't have to decide right now; it's just
something to keep in mind.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: scalability bottlenecks with (many) partitions (and more)

From
Tomas Vondra
Date:
On 9/4/24 13:15, Tomas Vondra wrote:
> On 9/4/24 11:29, Jakub Wartak wrote:
>> Hi Tomas!
>>
>> ...
>>
>> My $0.02 cents: the originating case that triggered those patches,
>> actually started with LWLock/lock_manager waits being the top#1. The
>> operator can cross check (join) that with a group by pg_locks.fastpath
>> (='f'), count(*). So, IMHO we have good observability in this case
>> (rare thing to say!)
>>
> 
> That's a good point. So if you had to give some instructions to users
> what to measure / monitor, and how to adjust the GUC based on that, what
> would your instructions be?
> 

After thinking about this a bit more, I'm actually wondering if this is
source of information is sufficient. Firstly, it's just a snapshot of a
single instance, and it's not quite trivial to get some summary for
longer time period (people would have to sample it often enough, etc.).
Doable, but much less convenient than the cumulative counters.

But for the sampling, doesn't this produce skewed data? Imagine you have
a workload with very short queries (which is when fast-path matters), so
you're likely to see the backend while it's obtaining the locks. If the
fast-path locks take much faster acquire (kinda the whole point), we're
more likely to see the backend while it's obtaining the regular locks.

Let's say the backend needs 1000 locks, and 500 of those fit into the
fast-path array. We're likely to see the 500 fast-path locks already
acquired, and a random fraction of the 500 non-fast-path locks. So in
the end you'll se backends needing 500 fast-path locks and 250 regular
locks. That doesn't seem terrible, but I guess the difference can be
made even larger.


regards

-- 
Tomas Vondra



Re: scalability bottlenecks with (many) partitions (and more)

From
Jakub Wartak
Date:
On Thu, Sep 5, 2024 at 7:33 PM Tomas Vondra <tomas@vondra.me> wrote:

>>> My $0.02 cents: the originating case that triggered those patches,
>>> actually started with LWLock/lock_manager waits being the top#1. The
>>> operator can cross check (join) that with a group by pg_locks.fastpath
>>> (='f'), count(*). So, IMHO we have good observability in this case
>>> (rare thing to say!)
>>>
>>
>> That's a good point. So if you had to give some instructions to users
>> what to measure / monitor, and how to adjust the GUC based on that, what
>> would your instructions be?
>>
>
> After thinking about this a bit more, I'm actually wondering if this is
> source of information is sufficient. Firstly, it's just a snapshot of a
> single instance, and it's not quite trivial to get some summary for
> longer time period (people would have to sample it often enough, etc.).
> Doable, but much less convenient than the cumulative counters.

OK, so answering previous question:

Probably just monitor pg_stat_activty (group on wait events count(*))
with pg_locks with group by on per-pid and fastpath . Even simple
observations with \watch 0.1 are good enough to confirm/deny the
root-cause in practice even for short bursts while it is happening.
While deploying monitoring for a longer time (with say sample of 1s),
you sooner or later would get the __high water mark__ and possibly
allow up to that many fastpaths as starting point as there are locks
occuring for affected PIDs (or double the amount).

> But for the sampling, doesn't this produce skewed data? Imagine you have
> a workload with very short queries (which is when fast-path matters), so
> you're likely to see the backend while it's obtaining the locks. If the
> fast-path locks take much faster acquire (kinda the whole point), we're
> more likely to see the backend while it's obtaining the regular locks.
>
> Let's say the backend needs 1000 locks, and 500 of those fit into the
> fast-path array. We're likely to see the 500 fast-path locks already
> acquired, and a random fraction of the 500 non-fast-path locks. So in
> the end you'll se backends needing 500 fast-path locks and 250 regular
> locks. That doesn't seem terrible, but I guess the difference can be
> made even larger.

... it doesn't need to perfect data to act, right? We may just need
information that it is happening (well we do). Maybe it's too
pragmatic point of view, but wasting some bits of memory for this, but
still being allowed to control it how much it allocates in the end --
is much better situation than today, without any control where we are
wasting crazy CPU time on all those futex() syscalls and context
switches

Another angle is that if you see the SQL causing it, it is most often
going to be attributed to partitioning and people ending up accessing
way too many partitions (thousands) without proper partition pruning -
sometimes it even triggers re-partitioning of the said tables. So
maybe the realistic "fastpath sizing" should assume something that
supports:
a) usual number of tables in JOINs (just few of them are fine like today) -> ok
b) interval 1 month partitions for let's say 5 years (12*5 = 60),
joined to some other table like that gives like what, max 120? -> so
if you have users doing SELECT * FROM such_table , they will already
have set the max_locks_per_xact probably to something higher.
c) HASH partitioning up to VCPU-that-are-in-the-wild count? (say 64 or
128? so it sounds same as above?)
d) probably we should not care here at all if somebody wants daily
partitioning across years with HASH (sub)partitions without partition
pruning -> it has nothing to do with being "fast" anyway

Judging from the current reports, people have configured
max_locks_per_xact like this: ~75% have it at default (64), 10% has
1024, 5% has 128 and the rest (~10%) is having between 100..thousands,
with extreme one-offs @ 25k (wild misconfiguration judging from the
other specs).

BTW: you probably need to register this $thread into CF for others to
see too (it's not there?)

-J.



Re: scalability bottlenecks with (many) partitions (and more)

From
Jakub Wartak
Date:
On Fri, Sep 13, 2024 at 1:45 AM Tomas Vondra <tomas@vondra.me> wrote:

> [..]

> Anyway, at this point I'm quite happy with this improvement. I didn't
> have any clear plan when to commit this, but I'm considering doing so
> sometime next week, unless someone objects or asks for some additional
> benchmarks etc.

Thank you very much for working on this :)

The only fact that comes to my mind is that we could blow up L2
caches. Fun fact, so if we are growing PGPROC by 6.3x, that's going to
be like one or two 2MB huge pages more @ common max_connections=1000
x86_64 (830kB -> ~5.1MB), and indeed:

# without patch:
postgres@hive:~$ /usr/pgsql18/bin/postgres -D /tmp/pg18 -C
shared_memory_size_in_huge_pages
177

# with patch:
postgres@hive:~$ /usr/pgsql18/bin/postgres -D /tmp/pg18 -C
shared_memory_size_in_huge_pages
178

So playing Devil's advocate , the worst situation that could possibly
hurt (?) could be:
* memory size of PGPROC working set >> L2_cache (thus very high
max_connections),
* insane number of working sessions on CPU (sessions >> VCPU) - sadly
happens to some,
* those sessions wouldn't have to be competing for the same Oids -
just fetching this new big fpLockBits[] structure - so probing a lot
for lots of Oids, but *NOT* having to use futex() syscall [so not that
syscall price]
* no huge pages (to cause dTLB misses)

then maybe(?) one could observe further degradation of dTLB misses in
the perf-stat counter under some microbenchmark, but measuring that
requires isolated and physical hardware. Maybe that would be actually
noise due to overhead of context-switches itself. Just trying to think
out loud, what big PGPROC could cause here. But this is already an
unhealthy and non-steady state of the system, so IMHO we are good,
unless someone comes up with a better (more evil) idea.

>> I did look at docs if anything needs updating, but I don't think so. The
SGML docs only talk about fast-path locking at fairly high level, not
about how many we have etc.

Well the only thing I could think of was to add to the
doc/src/sgml/config.sgml / "max_locks_per_transaction" GUC, that "it
is also used as advisory for the number of groups used in
lockmanager's fast-path implementation" (that is, without going into
further discussion, as even pg_locks discussion
doc/src/sgml/system-views.sgml simply uses that term).

-J.



Re: scalability bottlenecks with (many) partitions (and more)

From
Tomas Vondra
Date:

On 9/16/24 15:11, Jakub Wartak wrote:
> On Fri, Sep 13, 2024 at 1:45 AM Tomas Vondra <tomas@vondra.me> wrote:
> 
>> [..]
> 
>> Anyway, at this point I'm quite happy with this improvement. I didn't
>> have any clear plan when to commit this, but I'm considering doing so
>> sometime next week, unless someone objects or asks for some additional
>> benchmarks etc.
> 
> Thank you very much for working on this :)
> 
> The only fact that comes to my mind is that we could blow up L2
> caches. Fun fact, so if we are growing PGPROC by 6.3x, that's going to
> be like one or two 2MB huge pages more @ common max_connections=1000
> x86_64 (830kB -> ~5.1MB), and indeed:
> 
> # without patch:
> postgres@hive:~$ /usr/pgsql18/bin/postgres -D /tmp/pg18 -C
> shared_memory_size_in_huge_pages
> 177
> 
> # with patch:
> postgres@hive:~$ /usr/pgsql18/bin/postgres -D /tmp/pg18 -C
> shared_memory_size_in_huge_pages
> 178
> 
> So playing Devil's advocate , the worst situation that could possibly
> hurt (?) could be:
> * memory size of PGPROC working set >> L2_cache (thus very high
> max_connections),
> * insane number of working sessions on CPU (sessions >> VCPU) - sadly
> happens to some,
> * those sessions wouldn't have to be competing for the same Oids -
> just fetching this new big fpLockBits[] structure - so probing a lot
> for lots of Oids, but *NOT* having to use futex() syscall [so not that
> syscall price]
> * no huge pages (to cause dTLB misses)
> 
> then maybe(?) one could observe further degradation of dTLB misses in
> the perf-stat counter under some microbenchmark, but measuring that
> requires isolated and physical hardware. Maybe that would be actually
> noise due to overhead of context-switches itself. Just trying to think
> out loud, what big PGPROC could cause here. But this is already an
> unhealthy and non-steady state of the system, so IMHO we are good,
> unless someone comes up with a better (more evil) idea.
> 

I've been thinking about such cases too, but I don't think it can really
happen in practice, because:

- How likely is it that the sessions will need a lot of OIDs, but not
the same ones? Also, why would it matter that the OIDs are not the same,
I don't think it matters unless one of the sessions needs an exclusive
lock, at which point the optimization doesn't really matter.

- If having more fast-path slots means it doesn't fit into L2 cache,
would we fit into L2 without it? I don't think so - if there really are
that many locks, we'd have to add those into the shared lock table, and
there's a lot of extra stuff to keep in memory (relcaches, ...).

This is pretty much one of the cases I focused on in my benchmarking,
and I'm yet to see any regression.


>>> I did look at docs if anything needs updating, but I don't think so. The
> SGML docs only talk about fast-path locking at fairly high level, not
> about how many we have etc.
> 
> Well the only thing I could think of was to add to the
> doc/src/sgml/config.sgml / "max_locks_per_transaction" GUC, that "it
> is also used as advisory for the number of groups used in
> lockmanager's fast-path implementation" (that is, without going into
> further discussion, as even pg_locks discussion
> doc/src/sgml/system-views.sgml simply uses that term).
> 

Thanks, I'll consider mentioning this in max_locks_per_transaction.
Also, I think there's a place calculating the amount of per-connection
memory, so maybe that needs to be updated too.


regards

-- 
Tomas Vondra



Re: scalability bottlenecks with (many) partitions (and more)

From
Tomas Vondra
Date:
Hi,

I've finally pushed this, after many rounds of careful testing to ensure
no regressions, and polishing. All changes since the version shared on
September 13 are only cosmetic - renaming a macro to keep it consistent
with the other ones, clarifying a couple comments etc. Nothing major.

I ended up squashing the two parts into a single commit. I thought about
keeping the two steps, but it seemed pointless - the first part inflated
the PGPROC struct, which I didn't like to commit, even if only as an
intermediate WIP state.

So far buildfarm didn't blew up, so let's hope it will stay that way.

I just realized there's no CF entry for this - sorry about that :-( I
started the thread a year ago to discuss an experimental patche, and it
never made it to CFA. But there was a discussion spanning a year, so
hopefully that's enough.


regards

-- 
Tomas Vondra



Re: scalability bottlenecks with (many) partitions (and more)

From
Ants Aasma
Date:
On Sat, 21 Sept 2024 at 21:33, Tomas Vondra <tomas@vondra.me> wrote:
> I've finally pushed this, after many rounds of careful testing to ensure
> no regressions, and polishing. All changes since the version shared on
> September 13 are only cosmetic - renaming a macro to keep it consistent
> with the other ones, clarifying a couple comments etc. Nothing major.

Great work on this, I have seen multiple customers hitting fast path
capacity related LockManager contention. They will certainly be glad
to have a fix available when they eventually upgrade. Regretfully I
did not find the time to participate in this discussion during
development. But I did have some thoughts that I wanted to unload to
the list, not as a criticism, but in case it turns out follow up work
is needed.

Driving the array sizing from max_locks_per_transaction seems like a
good idea. The one major difference from the lock table is that while
the lock table is partitioned dynamically between backends, the fast
path array has a static size per backend. One case where that
distinction matters is when only a fraction of backends try to lock
large numbers of relations. This fraction will still fall back to main
lock tables, but at least the contention should be limited by virtue
of not having too many of those backends. The other case is when
max_connections is much higher than the number of backends actually
used. Then backends may be consuming well over
max_locks_per_transaction without running into lock table capacity
issues.

In both cases users will have the simple workaround of just increasing
the max_locks_per_transaction setting.  Still, I'm sure they would be
happier if things just worked without any tuning. So I tried to figure
out some scheme to get dynamic allocation of fast path locks.

The best data structure I came up with was to have a shared fast path
lock array. Still partitioned as a 16-way associative cache, but
indexed by hash(BackendId, RelationId). fpLockBits can be stuffed into
the high byte of BackendId thanks to MAX_BACKENDS. Locking could be
handled by one lock per way, or at least on cursory glance it
shouldn't be too difficult to convert the whole fast path acquisition
to be lock free.

Either way, it feels like structuring the array this way could result
in a large amount of false sharing of cache lines. Current static
allocation means that each process needs to touch only a small set of
cache lines only referenced by itself - quite probable to keep those
lines in CPU local L2 in exclusive mode. In a shared array a larger
number of cache lines are needed and they will be concurrently written
to by other backends - lots of invalidation messages and cache line
bouncing. I don't know how large this effect will be without doing a
prototype and running it on a large machine with high core-to-core
latencies.

It would be possible to create a hybrid approach of a small local FP
array servicing the majority of acquisitions with a larger shared
victim cache for exceptional cases. But it doesn't feel like it is
worth the complexity. At least not without seeing some example
workloads where it would help. And even then, maybe using hierarchical
locking to do less work is the better approach.

Being optimistic, perhaps the current patch was enough to resolve the issue.

--
Ants Aasma
Senior Database Engineer
www.cybertec-postgresql.com



Re: scalability bottlenecks with (many) partitions (and more)

From
Tomas Vondra
Date:

On 9/22/24 10:50, Ants Aasma wrote:
> On Sat, 21 Sept 2024 at 21:33, Tomas Vondra <tomas@vondra.me> wrote:
>> I've finally pushed this, after many rounds of careful testing to ensure
>> no regressions, and polishing. All changes since the version shared on
>> September 13 are only cosmetic - renaming a macro to keep it consistent
>> with the other ones, clarifying a couple comments etc. Nothing major.
> 
> Great work on this, I have seen multiple customers hitting fast path
> capacity related LockManager contention. They will certainly be glad
> to have a fix available when they eventually upgrade. Regretfully I
> did not find the time to participate in this discussion during
> development. But I did have some thoughts that I wanted to unload to
> the list, not as a criticism, but in case it turns out follow up work
> is needed.
> 
> Driving the array sizing from max_locks_per_transaction seems like a
> good idea. The one major difference from the lock table is that while
> the lock table is partitioned dynamically between backends, the fast
> path array has a static size per backend. One case where that
> distinction matters is when only a fraction of backends try to lock
> large numbers of relations. This fraction will still fall back to main
> lock tables, but at least the contention should be limited by virtue
> of not having too many of those backends. The other case is when
> max_connections is much higher than the number of backends actually
> used. Then backends may be consuming well over
> max_locks_per_transaction without running into lock table capacity
> issues.
> 

I agree. I don't think the case with a couple lock-hungry backends
matters too much, because as you say there can't be too many of them, so
the contention should not be too bad. At least that was my reasoning.

Regarding the case with very high max_connection values - I doubt we
want to optimize for that very much. Extremely high max_connection
values are a clear anti-pattern (IMO), and if you choose to do that
anyway, you simply have to accept that connections have costs. The
memory for fast-path locking is one of those costs.

I'm not against improving that, ofc, but I think we should only do that
if it doesn't hurt the "reasonable" setups.

> In both cases users will have the simple workaround of just increasing
> the max_locks_per_transaction setting.  Still, I'm sure they would be
> happier if things just worked without any tuning. So I tried to figure
> out some scheme to get dynamic allocation of fast path locks.
> 

I agree with the premise that less tuning is better. Which is why we
tied this to max_locks_per_transaction.

> The best data structure I came up with was to have a shared fast path
> lock array. Still partitioned as a 16-way associative cache, but
> indexed by hash(BackendId, RelationId). fpLockBits can be stuffed into
> the high byte of BackendId thanks to MAX_BACKENDS. Locking could be
> handled by one lock per way, or at least on cursory glance it
> shouldn't be too difficult to convert the whole fast path acquisition
> to be lock free.
> 
> Either way, it feels like structuring the array this way could result
> in a large amount of false sharing of cache lines. Current static
> allocation means that each process needs to touch only a small set of
> cache lines only referenced by itself - quite probable to keep those
> lines in CPU local L2 in exclusive mode. In a shared array a larger
> number of cache lines are needed and they will be concurrently written
> to by other backends - lots of invalidation messages and cache line
> bouncing. I don't know how large this effect will be without doing a
> prototype and running it on a large machine with high core-to-core
> latencies.
> 

I don't have a very good intuition regarding cachelines. Ideally, the
backends would access disjunct parts of the array, so there should not
be a lot of false sharing. But maybe I'm wrong, hard to say without an
experimental patch.

> It would be possible to create a hybrid approach of a small local FP
> array servicing the majority of acquisitions with a larger shared
> victim cache for exceptional cases. But it doesn't feel like it is
> worth the complexity. At least not without seeing some example
> workloads where it would help. And even then, maybe using hierarchical
> locking to do less work is the better approach.
> 

Not sure. My intuition would be to keep this as simple a possible.
Having a shared lock table and also a separate fast-path cache is
already sufficiently complex, adding cache for a cache seems a bit too
much to me.

> Being optimistic, perhaps the current patch was enough to resolve the issue.
> 

It's an improvement. But if you want to give the shared fast-path cache
a try, go ahead - if you write a patch, I promise to review it.


regards

-- 
Tomas Vondra



Re: scalability bottlenecks with (many) partitions (and more)

From
Tom Lane
Date:
Tomas Vondra <tomas@vondra.me> writes:
> I've finally pushed this, after many rounds of careful testing to ensure
> no regressions, and polishing.

Coverity is not terribly happy with this.  "Assert(fpPtr = fpEndPtr);"
is very clearly not doing what you presumably intended.  The others
look like overaggressive assertion checking.  If you don't want those
macros to assume that the argument is unsigned, you could force the
issue, say with

 #define FAST_PATH_GROUP(index)    \
-    (AssertMacro(((index) >= 0) && ((index) < FP_LOCK_SLOTS_PER_BACKEND)), \
+    (AssertMacro((uint32) (index) < FP_LOCK_SLOTS_PER_BACKEND), \
      ((index) / FP_LOCK_SLOTS_PER_GROUP))


________________________________________________________________________________________________________
*** CID 1619664:  Incorrect expression  (ASSERT_SIDE_EFFECT)
/srv/coverity/git/pgsql-git/postgresql/src/backend/storage/lmgr/proc.c: 325 in InitProcGlobal()
319             pg_atomic_init_u32(&(proc->procArrayGroupNext), INVALID_PROC_NUMBER);
320             pg_atomic_init_u32(&(proc->clogGroupNext), INVALID_PROC_NUMBER);
321             pg_atomic_init_u64(&(proc->waitStart), 0);
322         }
323
324         /* Should have consumed exactly the expected amount of fast-path memory. */
>>>     CID 1619664:  Incorrect expression  (ASSERT_SIDE_EFFECT)
>>>     Assignment "fpPtr = fpEndPtr" has a side effect.  This code will work differently in a non-debug build.
325         Assert(fpPtr = fpEndPtr);
326
327         /*
328          * Save pointers to the blocks of PGPROC structures reserved for auxiliary
329          * processes and prepared transactions.
330          */

________________________________________________________________________________________________________
*** CID 1619662:  Integer handling issues  (NO_EFFECT)
/srv/coverity/git/pgsql-git/postgresql/src/backend/storage/lmgr/lock.c: 3731 in GetLockStatusData()
3725
3726             LWLockAcquire(&proc->fpInfoLock, LW_SHARED);
3727
3728             for (f = 0; f < FP_LOCK_SLOTS_PER_BACKEND; ++f)
3729             {
3730                 LockInstanceData *instance;
>>>     CID 1619662:  Integer handling issues  (NO_EFFECT)
>>>     This greater-than-or-equal-to-zero comparison of an unsigned value is always true. "f >= 0U".
3731                 uint32        lockbits = FAST_PATH_GET_BITS(proc, f);
3732
3733                 /* Skip unallocated slots. */
3734                 if (!lockbits)
3735                     continue;
3736

________________________________________________________________________________________________________
*** CID 1619661:  Integer handling issues  (NO_EFFECT)
/srv/coverity/git/pgsql-git/postgresql/src/backend/storage/lmgr/lock.c: 2696 in FastPathGrantRelationLock()
2690         uint32        group = FAST_PATH_REL_GROUP(relid);
2691
2692         /* Scan for existing entry for this relid, remembering empty slot. */
2693         for (i = 0; i < FP_LOCK_SLOTS_PER_GROUP; i++)
2694         {
2695             /* index into the whole per-backend array */
>>>     CID 1619661:  Integer handling issues  (NO_EFFECT)
>>>     This greater-than-or-equal-to-zero comparison of an unsigned value is always true. "group >= 0U".
2696             uint32        f = FAST_PATH_SLOT(group, i);
2697
2698             if (FAST_PATH_GET_BITS(MyProc, f) == 0)
2699                 unused_slot = f;
2700             else if (MyProc->fpRelId[f] == relid)
2701             {

________________________________________________________________________________________________________
*** CID 1619660:  Integer handling issues  (NO_EFFECT)
/srv/coverity/git/pgsql-git/postgresql/src/backend/storage/lmgr/lock.c: 2813 in FastPathTransferRelationLocks()
2807
2808             for (j = 0; j < FP_LOCK_SLOTS_PER_GROUP; j++)
2809             {
2810                 uint32        lockmode;
2811
2812                 /* index into the whole per-backend array */
>>>     CID 1619660:  Integer handling issues  (NO_EFFECT)
>>>     This greater-than-or-equal-to-zero comparison of an unsigned value is always true. "group >= 0U".
2813                 uint32        f = FAST_PATH_SLOT(group, j);
2814
2815                 /* Look for an allocated slot matching the given relid. */
2816                 if (relid != proc->fpRelId[f] || FAST_PATH_GET_BITS(proc, f) == 0)
2817                     continue;
2818

________________________________________________________________________________________________________
*** CID 1619659:  Integer handling issues  (NO_EFFECT)
/srv/coverity/git/pgsql-git/postgresql/src/backend/storage/lmgr/lock.c: 3067 in GetLockConflicts()
3061
3062                 for (j = 0; j < FP_LOCK_SLOTS_PER_GROUP; j++)
3063                 {
3064                     uint32        lockmask;
3065
3066                     /* index into the whole per-backend array */
>>>     CID 1619659:  Integer handling issues  (NO_EFFECT)
>>>     This greater-than-or-equal-to-zero comparison of an unsigned value is always true. "group >= 0U".
3067                     uint32        f = FAST_PATH_SLOT(group, j);
3068
3069                     /* Look for an allocated slot matching the given relid. */
3070                     if (relid != proc->fpRelId[f])
3071                         continue;
3072                     lockmask = FAST_PATH_GET_BITS(proc, f);

________________________________________________________________________________________________________
*** CID 1619658:  Integer handling issues  (NO_EFFECT)
/srv/coverity/git/pgsql-git/postgresql/src/backend/storage/lmgr/lock.c: 2739 in FastPathUnGrantRelationLock()
2733         uint32        group = FAST_PATH_REL_GROUP(relid);
2734
2735         FastPathLocalUseCounts[group] = 0;
2736         for (i = 0; i < FP_LOCK_SLOTS_PER_GROUP; i++)
2737         {
2738             /* index into the whole per-backend array */
>>>     CID 1619658:  Integer handling issues  (NO_EFFECT)
>>>     This greater-than-or-equal-to-zero comparison of an unsigned value is always true. "group >= 0U".
2739             uint32        f = FAST_PATH_SLOT(group, i);
2740
2741             if (MyProc->fpRelId[f] == relid
2742                 && FAST_PATH_CHECK_LOCKMODE(MyProc, f, lockmode))
2743             {
2744                 Assert(!result);

________________________________________________________________________________________________________
*** CID 1619657:  Integer handling issues  (NO_EFFECT)
/srv/coverity/git/pgsql-git/postgresql/src/backend/storage/lmgr/lock.c: 2878 in FastPathGetRelationLockEntry()
2872
2873         for (i = 0; i < FP_LOCK_SLOTS_PER_GROUP; i++)
2874         {
2875             uint32        lockmode;
2876
2877             /* index into the whole per-backend array */
>>>     CID 1619657:  Integer handling issues  (NO_EFFECT)
>>>     This greater-than-or-equal-to-zero comparison of an unsigned value is always true. "group >= 0U".
2878             uint32        f = FAST_PATH_SLOT(group, i);
2879
2880             /* Look for an allocated slot matching the given relid. */
2881             if (relid != MyProc->fpRelId[f] || FAST_PATH_GET_BITS(MyProc, f) == 0)
2882                 continue;
2883

            regards, tom lane



Re: scalability bottlenecks with (many) partitions (and more)

From
Tomas Vondra
Date:
On 9/22/24 17:45, Tom Lane wrote:
> Tomas Vondra <tomas@vondra.me> writes:
>> I've finally pushed this, after many rounds of careful testing to ensure
>> no regressions, and polishing.
> 
> Coverity is not terribly happy with this.  "Assert(fpPtr = fpEndPtr);"
> is very clearly not doing what you presumably intended.  The others
> look like overaggressive assertion checking.  If you don't want those
> macros to assume that the argument is unsigned, you could force the
> issue, say with
> 
>  #define FAST_PATH_GROUP(index)    \
> -    (AssertMacro(((index) >= 0) && ((index) < FP_LOCK_SLOTS_PER_BACKEND)), \
> +    (AssertMacro((uint32) (index) < FP_LOCK_SLOTS_PER_BACKEND), \
>       ((index) / FP_LOCK_SLOTS_PER_GROUP))
> 

Ah, you're right. I'll fix those asserts tomorrow.

The first is clearly wrong, of course.

For the (x >= 0) asserts, doing it this way relies on negative values
wrapping to large positive ones, correct? AFAIK it's guaranteed to be a
very large value, so it can't accidentally be less than the slot count.


regards

-- 
Tomas Vondra



Re: scalability bottlenecks with (many) partitions (and more)

From
Tom Lane
Date:
Tomas Vondra <tomas@vondra.me> writes:
> On 9/22/24 17:45, Tom Lane wrote:
>> #define FAST_PATH_GROUP(index)    \
>> -    (AssertMacro(((index) >= 0) && ((index) < FP_LOCK_SLOTS_PER_BACKEND)), \
>> +    (AssertMacro((uint32) (index) < FP_LOCK_SLOTS_PER_BACKEND), \
>> ((index) / FP_LOCK_SLOTS_PER_GROUP))

> For the (x >= 0) asserts, doing it this way relies on negative values
> wrapping to large positive ones, correct? AFAIK it's guaranteed to be a
> very large value, so it can't accidentally be less than the slot count.

Right, any negative value would wrap to something more than
INT32_MAX.

            regards, tom lane



Re: scalability bottlenecks with (many) partitions (and more)

From
Jakub Wartak
Date:
On Mon, Sep 16, 2024 at 4:19 PM Tomas Vondra <tomas@vondra.me> wrote:
> On 9/16/24 15:11, Jakub Wartak wrote:
> > On Fri, Sep 13, 2024 at 1:45 AM Tomas Vondra <tomas@vondra.me> wrote:
> >
> >> [..]
> >
> >> Anyway, at this point I'm quite happy with this improvement. I didn't
> >> have any clear plan when to commit this, but I'm considering doing so
> >> sometime next week, unless someone objects or asks for some additional
> >> benchmarks etc.
> >
> > Thank you very much for working on this :)
> >
> > The only fact that comes to my mind is that we could blow up L2
> > caches. Fun fact, so if we are growing PGPROC by 6.3x, that's going to
> > be like one or two 2MB huge pages more @ common max_connections=1000
> > x86_64 (830kB -> ~5.1MB), and indeed:
[..]
> > then maybe(?) one could observe further degradation of dTLB misses in
> > the perf-stat counter under some microbenchmark, but measuring that
> > requires isolated and physical hardware. Maybe that would be actually
> > noise due to overhead of context-switches itself. Just trying to think
> > out loud, what big PGPROC could cause here. But this is already an
> > unhealthy and non-steady state of the system, so IMHO we are good,
> > unless someone comes up with a better (more evil) idea.
> >
>
> I've been thinking about such cases too, but I don't think it can really
> happen in practice, because:
>
> - How likely is it that the sessions will need a lot of OIDs, but not
> the same ones? Also, why would it matter that the OIDs are not the same,
> I don't think it matters unless one of the sessions needs an exclusive
> lock, at which point the optimization doesn't really matter.
>
> - If having more fast-path slots means it doesn't fit into L2 cache,
> would we fit into L2 without it? I don't think so - if there really are
> that many locks, we'd have to add those into the shared lock table, and
> there's a lot of extra stuff to keep in memory (relcaches, ...).
>
> This is pretty much one of the cases I focused on in my benchmarking,
> and I'm yet to see any regression.

Sorry for answering this so late. Just for context here: I was
imagining a scenario with high max_connections about e.g. schema-based
multi-tenancy and no partitioning (so all would be fine without this
$thread/commit ; so under 16 (fast)locks would be taken). The OIDs
need to be different to avoid contention: so that futex() does not end
up really in syscall (just user-space part). My theory was that a much
smaller PGPROC should be doing much less (data) cache-line fetches
than with-the-patch. That hash() % prime , hits various parts of a
larger array - so without patch should be quicker as it wouldn't be
randomly hitting some larger array[], but it might be noise as you
state.  It was a theoretical attempt at crafting the worst possible
conditions for the patch, so feel free to disregard as it already
assumes some anti-pattern (big & all active max_connections).

> > Well the only thing I could think of was to add to the
> > doc/src/sgml/config.sgml / "max_locks_per_transaction" GUC, that "it
> > is also used as advisory for the number of groups used in
> > lockmanager's fast-path implementation" (that is, without going into
> > further discussion, as even pg_locks discussion
> > doc/src/sgml/system-views.sgml simply uses that term).
> >
>
> Thanks, I'll consider mentioning this in max_locks_per_transaction.
> Also, I think there's a place calculating the amount of per-connection
> memory, so maybe that needs to be updated too.
>

I couldn't find it in current versions, but maybe that's helpful/reaffirming:
- up to 9.2. there were exact formulas used, see "(1800 + 270 *
max_locks_per_transaction) * max_connections" [1] , that's a long time
gone now.
- if anything then Andres might want to improve a little his blog
entry: [1] (my take is that is seems to be the most accurate and
authoritative technical information that we have online)

-J.

[1] - https://www.postgresql.org/docs/9.2/kernel-resources.html
[2] - https://blog.anarazel.de/2020/10/07/measuring-the-memory-overhead-of-a-postgres-connection/



Re: scalability bottlenecks with (many) partitions (and more)

From
Tomas Vondra
Date:
On 9/23/24 01:06, Tom Lane wrote:
> Tomas Vondra <tomas@vondra.me> writes:
>> On 9/22/24 17:45, Tom Lane wrote:
>>> #define FAST_PATH_GROUP(index)    \
>>> -    (AssertMacro(((index) >= 0) && ((index) < FP_LOCK_SLOTS_PER_BACKEND)), \
>>> +    (AssertMacro((uint32) (index) < FP_LOCK_SLOTS_PER_BACKEND), \
>>> ((index) / FP_LOCK_SLOTS_PER_GROUP))
> 
>> For the (x >= 0) asserts, doing it this way relies on negative values
>> wrapping to large positive ones, correct? AFAIK it's guaranteed to be a
>> very large value, so it can't accidentally be less than the slot count.
> 
> Right, any negative value would wrap to something more than
> INT32_MAX.
> 

Thanks. Pushed a fix for these issues, hopefully coverity will be happy.

BTW is the coverity report accessible somewhere? I know someone
mentioned that in the past, but I don't recall the details. Maybe we
should have a list of all these resources, useful for committers,
somewhere on the wiki?


regards

-- 
Tomas Vondra



Re: scalability bottlenecks with (many) partitions (and more)

From
Tom Lane
Date:
Tomas Vondra <tomas@vondra.me> writes:
> Thanks. Pushed a fix for these issues, hopefully coverity will be happy.

Thanks.

> BTW is the coverity report accessible somewhere? I know someone
> mentioned that in the past, but I don't recall the details. Maybe we
> should have a list of all these resources, useful for committers,
> somewhere on the wiki?

Currently those reports only go to the security team.  Perhaps
we should rethink that?

            regards, tom lane