Thread: LWLocks by LockManager slowing large DB
Hello, apologies for the long post, but I want to make sure I’ve got enough details to describe the problem for y’all.
I’ve got a 64-core (Ubuntu 18.04 – 240GB RAM running at GCP) instance running PG 13.2 and PostGIS 3.1.1 and we’re having troubles getting it to run more than 30 or so large queries at the same time accessing the same tables. With 60 threads, each thread is only running at ~30% CPU and no diskIO/IOWait (once the tables become cached).
Boiling the complex queries down to their simplest form, we test running 60 of this query simultaneously:
select
count(*)
from
travel_processing_v5.llc_zone z,
parent_set10.usca_trip_points7 t
where t.year_num = 2019 and t.month_num = 9
and st_intersects(t.lock_geom, z.s_geom)
and st_intersects(t.lock_geom, z.e_geom);
llc_zone = 981 rows (568k disk size) with s_geom and e_geom both of datatype geometry(Multipolygon, 2163)
usca_trip_points7 = 79 million rows (469G disk size) with t.lock_geom datatype geometry(Linestring, 2163)
(more detailed schema/stats can be provided if helpful)
postgresql.conf is pretty normal for a large system like this (with appropriate shared_buffer, work_mem, etc. – can be provided if helpful, too)
What I’m finding in pg_stat_activity when running this is lots of wait_events of type ‘LockManager’.
Rebuilding with CFLAGS=" -fno-omit-frame-pointer" --prefix=/usr/local/pgsql_13debug --enable-dtrace CPPFLAGS='-DLOCK_DEBUG' and then setting trace_lwlocks yields lots of records looking like:
[39691] LOG: 39691: LWLockAcquire(LockManager 0x7fab2cc09d80): excl 0 shared 0 haswaiters 1 waiters 6 rOK 1
Does anyone have any advice on how to alleviate LockManager’s LWlock issue?
Thanks for any assistance!
---Paul
Paul Friedman
CTO
677 Harrison St | San Francisco, CA 94107
M: (650) 270-7676
E-mail: paul.friedman@streetlightdata.com
Attachment
Hi, On 2021-04-12 12:37:42 -0700, Paul Friedman wrote: > Boiling the complex queries down to their simplest form, we test running 60 > of this query simultaneously: How long does one execution of these queries take (on average)? The likely bottlenecks are very different between running 60 concurrent queries that each complete in 0.1ms and ones that take > 1s. Could you show the results for a query like SELECT state, backend_type, wait_event_type, wait_event, count(*) FROM pg_stat_activity GROUP BY 1, 2, 3, 4 ORDER BY count(*)DESC; ? Without knowing the proportion of LockManager wait events compared to the rest it's hard to know what to make of it. > Does anyone have any advice on how to alleviate LockManager’s LWlock issue? It'd be useful to get a perf profile for this. Both for cpu time and for what ends up blocking on kernel-level locks. E.g. something like # cpu time perf record --call-graph dwarf -F 500 -a sleep 5 perf report --no-children --sort comm,symbol To send it to the list you can use something like perf report --no-children --sort comm,symbol|head -n 500 > somefile # kernel level blocking on locks perf record --call-graph dwarf -e syscalls:sys_enter_futex -a sleep 3 perf report --no-children --sort comm,symbol Greetings, Andres Freund
Thanks for the quick reply!
These queries take ~1hr and are the only thing running on the system (all 60 are launched at the same time and the tables/files are fully-primed into memory so iowaits are basically zero).
Yes, that’s the same query I’ve been running to analyze the locks and this is the problem:
SELECT state, backend_type, wait_event_type, wait_event, count(*) FROM pg_stat_activity GROUP BY 1, 2, 3, 4 ORDER BY count(*) DESC;
State backend_type wait_event_type wait_event count
active client backend LWLock LockManager 28
active client backend 21
autovacuum launcher Activity AutoVacuumMain 1
logical replication launcher Activity LogicalLauncherMain 1
checkpointer Activity CheckpointerMain 1
idle client backend Client ClientRead 1
background writer Activity BgWriterMain 1
walwriter Activity WalWriterMain 1
Thanks again for any advice you have.
---Paul
Paul Friedman
CTO
677 Harrison St | San Francisco, CA 94107
M: (650) 270-7676
E-mail: paul.friedman@streetlightdata.com
-----Original Message-----
From: Andres Freund <andres@anarazel.de>
Sent: Monday, April 12, 2021 2:58 PM
To: Paul Friedman <paul.friedman@streetlightdata.com>
Cc: pgsql-performance@lists.postgresql.org
Subject: Re: LWLocks by LockManager slowing large DB
Hi,
On 2021-04-12 12:37:42 -0700, Paul Friedman wrote:
> Boiling the complex queries down to their simplest form, we test
> running 60 of this query simultaneously:
How long does one execution of these queries take (on average)? The likely bottlenecks are very different between running 60 concurrent queries that each complete in 0.1ms and ones that take > 1s.
Could you show the results for a query like SELECT state, backend_type, wait_event_type, wait_event, count(*) FROM pg_stat_activity GROUP BY 1, 2, 3, 4 ORDER BY count(*) DESC; ?
Without knowing the proportion of LockManager wait events compared to the rest it's hard to know what to make of it.
> Does anyone have any advice on how to alleviate LockManager’s LWlock issue?
It'd be useful to get a perf profile for this. Both for cpu time and for what ends up blocking on kernel-level locks. E.g. something like
# cpu time
perf record --call-graph dwarf -F 500 -a sleep 5 perf report --no-children --sort comm,symbol
To send it to the list you can use something like perf report --no-children --sort comm,symbol|head -n 500 > somefile
# kernel level blocking on locks
perf record --call-graph dwarf -e syscalls:sys_enter_futex -a sleep 3 perf report --no-children --sort comm,symbol
Greetings,
Andres Freund
Hi, On 2021-04-12 15:15:05 -0700, Paul Friedman wrote: > Thanks again for any advice you have. I think we'd need the perf profiles to be able to dig into this further. It's odd that there are a meaningful amount of LockManager contention in your case - assuming the stats you collected weren't just the first few milliseconds of starting those 60 queries, there shouldn't be any additional "heavyweight locks" taken given the duration of your queries. The futex profile hopefully will tell us from where that is coming from... Greetings, Andres Freund
Yes, I ran the query after a couple of minutes. Those are the steady-state numbers. Also 'top' shows: top - 22:44:26 up 12 days, 23:14, 5 users, load average: 20.99, 21.35, 19.27 Tasks: 859 total, 26 running, 539 sleeping, 0 stopped, 0 zombie %Cpu(s): 34.3 us, 1.6 sy, 0.0 ni, 64.1 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem : 24742353+total, 33723356 free, 73160656 used, 14053952+buff/cache KiB Swap: 0 total, 0 free, 0 used. 17132937+avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 30070 postgres 20 0 41.232g 28608 18192 R 53.8 0.0 17:35.74 postgres 30087 postgres 20 0 41.233g 28408 18180 S 53.8 0.0 17:35.69 postgres 30055 postgres 20 0 41.233g 28492 18120 R 53.5 0.0 17:41.51 postgres Note the postgres processes only running at 53% with the system at 64% idle. The 1.7% system time seems indicative of the spinlocks blocking the major processing. Do you know what resource the LockManager might be blocking on/protecting with these LWlocks? Also, I didn't understand your comment about a 'futex profile', could you point me in the right direction here? Thanks. ---Paul Paul Friedman CTO 677 Harrison St | San Francisco, CA 94107 M: (650) 270-7676 E-mail: paul.friedman@streetlightdata.com -----Original Message----- From: Andres Freund <andres@anarazel.de> Sent: Monday, April 12, 2021 3:22 PM To: Paul Friedman <paul.friedman@streetlightdata.com> Cc: pgsql-performance@lists.postgresql.org Subject: Re: LWLocks by LockManager slowing large DB Hi, On 2021-04-12 15:15:05 -0700, Paul Friedman wrote: > Thanks again for any advice you have. I think we'd need the perf profiles to be able to dig into this further. It's odd that there are a meaningful amount of LockManager contention in your case - assuming the stats you collected weren't just the first few milliseconds of starting those 60 queries, there shouldn't be any additional "heavyweight locks" taken given the duration of your queries. The futex profile hopefully will tell us from where that is coming from... Greetings, Andres Freund
Hi, On 2021-04-12 15:56:08 -0700, Paul Friedman wrote: > Also, I didn't understand your comment about a 'futex profile', could you > point me in the right direction here? My earlier mail included a section about running a perf profile showing the callers of the futex() system call, which in turn is what lwlocks end up using on linux when the lock is contended. Check the second half of: https://www.postgresql.org/message-id/20210412215738.xytq33wlciljyva5%40alap3.anarazel.de Greetings, Andres Freund
Thanks for this, I read too quickly! I've attached the 2 perf reports. From the 2nd one, I can see lots of time waiting for TOAST table locks on the geometry column, but I definitely don't fully understand the implications or why LockManager would be struggling here. Thanks for the continued help! ---Paul Paul Friedman CTO 677 Harrison St | San Francisco, CA 94107 M: (650) 270-7676 E-mail: paul.friedman@streetlightdata.com -----Original Message----- From: Andres Freund <andres@anarazel.de> Sent: Monday, April 12, 2021 4:04 PM To: Paul Friedman <paul.friedman@streetlightdata.com> Cc: pgsql-performance@lists.postgresql.org Subject: Re: LWLocks by LockManager slowing large DB Hi, On 2021-04-12 15:56:08 -0700, Paul Friedman wrote: > Also, I didn't understand your comment about a 'futex profile', could > you point me in the right direction here? My earlier mail included a section about running a perf profile showing the callers of the futex() system call, which in turn is what lwlocks end up using on linux when the lock is contended. Check the second half of: https://www.postgresql.org/message-id/20210412215738.xytq33wlciljyva5%40al ap3.anarazel.de Greetings, Andres Freund
Attachment
Thanks for this – these tools (and the raw selects on pg_stat_activity and pg_locks) are all showing wait events being created by LockManager waiting on an LWLock.
---Paul
Paul Friedman
CTO
677 Harrison St | San Francisco, CA 94107
M: (650) 270-7676
E-mail: paul.friedman@streetlightdata.com
From: Nikolay Samokhvalov <samokhvalov@gmail.com>
Sent: Monday, April 12, 2021 4:34 PM
To: Andres Freund <andres@anarazel.de>
Cc: Paul Friedman <paul.friedman@streetlightdata.com>; pgsql-performance@lists.postgresql.org
Subject: Re: LWLocks by LockManager slowing large DB
On Mon, Apr 12, 2021 at 14:57 Andres Freund <andres@anarazel.de> wrote:
Without knowing the proportion of LockManager wait events compared to
the rest it's hard to know what to make of it.
These OSS tools can be useful to understand the proportion:
- pgCenter
- pg_wait_sampling (can be used together with POWA monitoring)
- pgsentinel
- PASH Viewer (good for visualization, integrates with pgsentinel)
Attachment
Hi, On 2021-04-13 09:33:48 -0700, Paul Friedman wrote: > I've attached the 2 perf reports. From the 2nd one, I can see lots of > time waiting for TOAST table locks on the geometry column, but I > definitely don't fully understand the implications or why LockManager > would be struggling here. Oh, that is interesting. For toast tables we do not keep locks held for the duration of the transaction, but release the lock as soon as one access is done. It seems your workload is doing so many toast accesses that the table / index level locking for toast tables gets to be the bottleneck. It'd be interesting to know if the issue vanishes if you force the lock on the toast table and its index to be acquired explicitly. You can find the toast table names with something like: SELECT reltoastrelid::regclass FROM pg_class WHERE oid IN ('travel_processing_v5.llc_zone'::regclass, 'travel_processing_v5.llc_zone'::regclass); That should give you two relation names looking like "pg_toast.pg_toast_24576", just with a different number. If you then change your workload to be (with adjusted OIDs of course): BEGIN; SELECT * FROM pg_toast.pg_toast_24576 LIMIT 0; SELECT * FROM pg_toast.pg_toast_64454 LIMIT 0; <youquery> COMMIT; Does the scalability issue vanish? Regards, Andres
YES!!! This completely alleviates the bottleneck and I'm able to run the queries full-throttle. Thank you SO much for your help+insight. Interestingly, "lock pg_toast.pg_toast_2233612264 in ACCESS SHARE MODE;" which should do the same thing returns an error " ERROR: "pg_toast_2233612264" is not a table or view" Sounds like I should file this as a requested improvement? Thanks again. ---Paul Paul Friedman CTO 677 Harrison St | San Francisco, CA 94107 M: (650) 270-7676 E-mail: paul.friedman@streetlightdata.com -----Original Message----- From: Andres Freund <andres@anarazel.de> Sent: Tuesday, April 13, 2021 11:17 AM To: Paul Friedman <paul.friedman@streetlightdata.com> Cc: pgsql-performance@lists.postgresql.org Subject: Re: LWLocks by LockManager slowing large DB Hi, On 2021-04-13 09:33:48 -0700, Paul Friedman wrote: > I've attached the 2 perf reports. From the 2nd one, I can see lots of > time waiting for TOAST table locks on the geometry column, but I > definitely don't fully understand the implications or why LockManager > would be struggling here. Oh, that is interesting. For toast tables we do not keep locks held for the duration of the transaction, but release the lock as soon as one access is done. It seems your workload is doing so many toast accesses that the table / index level locking for toast tables gets to be the bottleneck. It'd be interesting to know if the issue vanishes if you force the lock on the toast table and its index to be acquired explicitly. You can find the toast table names with something like: SELECT reltoastrelid::regclass FROM pg_class WHERE oid IN ('travel_processing_v5.llc_zone'::regclass, 'travel_processing_v5.llc_zone'::regclass); That should give you two relation names looking like "pg_toast.pg_toast_24576", just with a different number. If you then change your workload to be (with adjusted OIDs of course): BEGIN; SELECT * FROM pg_toast.pg_toast_24576 LIMIT 0; SELECT * FROM pg_toast.pg_toast_64454 LIMIT 0; <youquery> COMMIT; Does the scalability issue vanish? Regards, Andres
Hi, On 2021-04-13 11:47:06 -0700, Paul Friedman wrote: > YES!!! This completely alleviates the bottleneck and I'm able to run the > queries full-throttle. Thank you SO much for your help+insight. Cool. And damn: I can't immediately think of a way to optimize this to not require this kind of hack in the future. > Interestingly, "lock pg_toast.pg_toast_2233612264 in ACCESS SHARE MODE;" > which should do the same thing returns an error " ERROR: > "pg_toast_2233612264" is not a table or view" > > Sounds like I should file this as a requested improvement? The ability to lock a toast table? Yea, it might be worth doing that. I seem to recall this being discussed not too long ago... Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2021-04-13 11:47:06 -0700, Paul Friedman wrote: >> YES!!! This completely alleviates the bottleneck and I'm able to run the >> queries full-throttle. Thank you SO much for your help+insight. > Cool. And damn: I can't immediately think of a way to optimize this to > not require this kind of hack in the future. Maybe the same thing we do with user tables, ie not give up the lock when we close a toast rel? As long as the internal lock counters are 64-bit, we'd not have to worry about overflowing them. regards, tom lane
> For toast tables we do not keep locks held for the duration of the transaction, > but release the lock as soon as one access is done. ... > The ability to lock a toast table? Yea, it might be worth doing that. I seem to > recall this being discussed not too long ago... Actually, the requested improvement I was thinking of was to have the locks on the toast table somehow have the same lifespan as the locks on the main table to avoid this problem to begin with. ---Paul Paul Friedman CTO 677 Harrison St | San Francisco, CA 94107 M: (650) 270-7676 E-mail: paul.friedman@streetlightdata.com -----Original Message----- From: Andres Freund <andres@anarazel.de> Sent: Tuesday, April 13, 2021 1:48 PM To: Paul Friedman <paul.friedman@streetlightdata.com> Cc: pgsql-performance@lists.postgresql.org Subject: Re: LWLocks by LockManager slowing large DB Hi, On 2021-04-13 11:47:06 -0700, Paul Friedman wrote: > YES!!! This completely alleviates the bottleneck and I'm able to run > the queries full-throttle. Thank you SO much for your help+insight. Cool. And damn: I can't immediately think of a way to optimize this to not require this kind of hack in the future. > Interestingly, "lock pg_toast.pg_toast_2233612264 in ACCESS SHARE MODE;" > which should do the same thing returns an error " ERROR: > "pg_toast_2233612264" is not a table or view" > > Sounds like I should file this as a requested improvement? The ability to lock a toast table? Yea, it might be worth doing that. I seem to recall this being discussed not too long ago... Greetings, Andres Freund
I wrote: > Andres Freund <andres@anarazel.de> writes: >> Cool. And damn: I can't immediately think of a way to optimize this to >> not require this kind of hack in the future. > Maybe the same thing we do with user tables, ie not give up the lock > when we close a toast rel? As long as the internal lock counters > are 64-bit, we'd not have to worry about overflowing them. Like this? This passes check-world, modulo the one very-unsurprising regression test change. I've not tried to do any performance testing. regards, tom lane diff --git a/src/backend/access/common/detoast.c b/src/backend/access/common/detoast.c index 545a6b8867..3826b4d93e 100644 --- a/src/backend/access/common/detoast.c +++ b/src/backend/access/common/detoast.c @@ -367,7 +367,7 @@ toast_fetch_datum(struct varlena *attr) * case. */ /* - * Open the toast relation and its indexes + * Open the toast relation (but its indexes are dealt with by the AM) */ toastrel = table_open(toast_pointer.va_toastrelid, AccessShareLock); @@ -376,7 +376,7 @@ toast_fetch_datum(struct varlena *attr) attrsize, 0, attrsize, result); /* Close toast table */ - table_close(toastrel, AccessShareLock); + table_close(toastrel, NoLock); return result; } @@ -457,7 +457,7 @@ toast_fetch_datum_slice(struct varlena *attr, int32 sliceoffset, result); /* Close toast table */ - table_close(toastrel, AccessShareLock); + table_close(toastrel, NoLock); return result; } diff --git a/src/backend/access/common/toast_internals.c b/src/backend/access/common/toast_internals.c index 730cd04a2d..f0d0251ce0 100644 --- a/src/backend/access/common/toast_internals.c +++ b/src/backend/access/common/toast_internals.c @@ -27,7 +27,7 @@ #include "utils/rel.h" #include "utils/snapmgr.h" -static bool toastrel_valueid_exists(Relation toastrel, Oid valueid); +static bool toastrel_valueid_exists(Relation toastrel, Oid valueid, LOCKMODE lock); static bool toastid_valueid_exists(Oid toastrelid, Oid valueid); /* ---------- @@ -264,7 +264,8 @@ toast_save_datum(Relation rel, Datum value, * be reclaimed by VACUUM. */ if (toastrel_valueid_exists(toastrel, - toast_pointer.va_valueid)) + toast_pointer.va_valueid, + RowExclusiveLock)) { /* Match, so short-circuit the data storage loop below */ data_todo = 0; @@ -359,8 +360,8 @@ toast_save_datum(Relation rel, Datum value, /* * Done - close toast relation and its indexes */ - toast_close_indexes(toastidxs, num_indexes, RowExclusiveLock); - table_close(toastrel, RowExclusiveLock); + toast_close_indexes(toastidxs, num_indexes, NoLock); + table_close(toastrel, NoLock); /* * Create the TOAST pointer value that we'll return @@ -440,8 +441,8 @@ toast_delete_datum(Relation rel, Datum value, bool is_speculative) * End scan and close relations */ systable_endscan_ordered(toastscan); - toast_close_indexes(toastidxs, num_indexes, RowExclusiveLock); - table_close(toastrel, RowExclusiveLock); + toast_close_indexes(toastidxs, num_indexes, NoLock); + table_close(toastrel, NoLock); } /* ---------- @@ -453,7 +454,7 @@ toast_delete_datum(Relation rel, Datum value, bool is_speculative) * ---------- */ static bool -toastrel_valueid_exists(Relation toastrel, Oid valueid) +toastrel_valueid_exists(Relation toastrel, Oid valueid, LOCKMODE lock) { bool result = false; ScanKeyData toastkey; @@ -464,7 +465,7 @@ toastrel_valueid_exists(Relation toastrel, Oid valueid) /* Fetch a valid index relation */ validIndex = toast_open_indexes(toastrel, - RowExclusiveLock, + lock, &toastidxs, &num_indexes); @@ -489,7 +490,7 @@ toastrel_valueid_exists(Relation toastrel, Oid valueid) systable_endscan(toastscan); /* Clean up */ - toast_close_indexes(toastidxs, num_indexes, RowExclusiveLock); + toast_close_indexes(toastidxs, num_indexes, NoLock); return result; } @@ -508,9 +509,9 @@ toastid_valueid_exists(Oid toastrelid, Oid valueid) toastrel = table_open(toastrelid, AccessShareLock); - result = toastrel_valueid_exists(toastrel, valueid); + result = toastrel_valueid_exists(toastrel, valueid, AccessShareLock); - table_close(toastrel, AccessShareLock); + table_close(toastrel, NoLock); return result; } diff --git a/src/backend/access/heap/heaptoast.c b/src/backend/access/heap/heaptoast.c index 55bbe1d584..c3bfb97c62 100644 --- a/src/backend/access/heap/heaptoast.c +++ b/src/backend/access/heap/heaptoast.c @@ -789,5 +789,5 @@ heap_fetch_toast_slice(Relation toastrel, Oid valueid, int32 attrsize, /* End scan and close indexes. */ systable_endscan_ordered(toastscan); - toast_close_indexes(toastidxs, num_indexes, AccessShareLock); + toast_close_indexes(toastidxs, num_indexes, NoLock); } diff --git a/src/test/regress/expected/alter_table.out b/src/test/regress/expected/alter_table.out index ec14b060a6..b7a007904e 100644 --- a/src/test/regress/expected/alter_table.out +++ b/src/test/regress/expected/alter_table.out @@ -2610,7 +2610,9 @@ select * from my_locks order by 1; relname | max_lockmode -----------+-------------------------- alterlock | ShareUpdateExclusiveLock -(1 row) + pg_toast | AccessShareLock + pg_toast | AccessShareLock +(3 rows) rollback; begin; alter table alterlock cluster on alterlock_pkey;
Andres Freund <andres@anarazel.de> writes: > On 2021-04-13 19:16:46 -0400, Tom Lane wrote: >> Like this? This passes check-world, modulo the one very-unsurprising >> regression test change. I've not tried to do any performance testing. > I wonder if there's a realistic chance it could create additional > deadlocks that don't exist right now? Not on user tables, because we'd always be holding at least as much of a lock on the parent table. However ... > Would it be a problem that we'd still release the locks on catalog > tables early, but not on its toast table? ... hmm, not sure. I can't immediately think of a scenario where it'd be problematic (or any more problematic than DDL on a catalog would be anyway). But that doesn't mean there isn't one. The concerns that had come to my mind were more along the lines of things like pg_dump requiring a larger footprint in the shared lock table. We could alleviate that by increasing the default value of max_locks_per_transaction, perhaps. regards, tom lane