Re: Configurable FP_LOCK_SLOTS_PER_BACKEND - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Configurable FP_LOCK_SLOTS_PER_BACKEND
Date
Msg-id 116ef01e-942a-22c1-a2af-35bf69c1b07b@enterprisedb.com
Whole thread Raw
In response to Re: Configurable FP_LOCK_SLOTS_PER_BACKEND  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Responses Re: Configurable FP_LOCK_SLOTS_PER_BACKEND
Re: Configurable FP_LOCK_SLOTS_PER_BACKEND
List pgsql-hackers
On 8/6/23 16:44, Tomas Vondra wrote:
> ...
>
> I'm not sure I'll have time to hack on this soon, but if someone else
> wants to take a stab at it and produce a minimal patch, I might be able
> to run more tests on it.
> 

Nah, I gave it another try, handling the bitmap in a different way, and
that happened to work sufficiently well. So here are some results and
also the bench script.

Note: I haven't reproduced the terrible regressions described in this
thread. Either I don't have a system large enough, the workload may not
be exactly right, or maybe it's due to the commit missing in older
branches (mentioned by Andres). Still, the findings seem interesting.

The workload is very simple - create a table "t" with certain number of
partitiones and indexes, add certain number of rows (100, 10k or 1M) and
do "select count(*) from t". And measure throughput. There's also a
script collecting wait-event/lock info, but I haven't looked at that.

I did this for current master (17dev), with two patch versions.

master - current master, with 16 fast-path slots

v1 increases the number of slots to 64, but switches to a single array
combining the bitmap and OIDs. I'm sure the original approach (separate
bitmap) can be made to work, and it's possible this is responsible for a
small regression in some runs (more about it in a minute).

v2 was an attempt to address the small regressions in v1, which may be
due to having to search larger arrays. The core always walks the whole
array, even if we know there have never been that many entries. So this
tries to track the last used slot, and stop the loops earlier.

The attached PDF visualizes the results, and differences between master
and the two patches. It plots throughput against number of rows / tables
and indexes, and also concurrent clients.

The last two columns show throughput vs. master, with a simple color
scale: green - speedup (good), red - regression (bad).

Let's talk about the smallest data set (100 rows). The 10k test has the
same behavior, with smaller differences (as the locking accounts for a
smaller part of the total duration). On the 1M data set the patches make
almost no difference.

There's pretty clear flip once we reach 16 partitions - on master the
throughput drops from 310k tps to 210k tps (for 32 clients, the machine
has 32 cores). With both patches, the drop is to only about 240k tps, so
~20% improvement compared to master.

The other interesting thing is behavior with many clients:

              1     16     32     64     96    128
  master  17603 169132 204870 199390 199662 196774
  v1      17062 180475 234644 266372 267105 265885
  v2      18972 187294 242838 275064 275931 274332

So the master "stagnates" or maybe even drops off, while with both
patches the throughput continues to grow beyond 32 clients. This is even
more obvious for 32 or 64 partitions - for 32, the results are

              1     16     32     64     96    128
  master  11292  93783 111223 102580  95197  87800
  v1      12025 123503 168382 179952 180191 179846
  v2      12501 126438 174255 185435 185332 184938

That's a pretty massive improvement, IMO. Would help OLTP scalability.

For 10k rows the patterns is the same, although the differences are less
significant. For 1M rows there's no speedup.

The bad news is this seems to have negative impact on cases with few
partitions, that'd fit into 16 slots. Which is not surprising, as the
code has to walk longer arrays, it probably affects caching etc. So this
would hurt the systems that don't use that many relations - not much,
but still.

The regression appears to be consistently ~3%, and v2 aimed to improve
that - at least for the case with just 100 rows. It even gains ~5% in a
couple cases. It's however a bit strange v2 doesn't really help the two
larger cases.

Overall, I think this seems interesting - it's hard to not like doubling
the throughput in some cases. Yes, it's 100 rows only, and the real
improvements are bound to be smaller, it would help short OLTP queries
that only process a couple rows.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachment

pgsql-hackers by date:

Previous
From: David Rowley
Date:
Subject: Re: Check volatile functions in ppi_clauses for memoize node
Next
From: "Hayato Kuroda (Fujitsu)"
Date:
Subject: RE: [PoC] pg_upgrade: allow to upgrade publisher node