Hello, apologies for the long post, but I want to make sure I’ve got enough details to describe the problem for y’all.
I’ve got a 64-core (Ubuntu 18.04 – 240GB RAM running at GCP) instance running PG 13.2 and PostGIS 3.1.1 and we’re having troubles getting it to run more than 30 or so large queries at the same time accessing the same tables. With 60 threads, each thread is only running at ~30% CPU and no diskIO/IOWait (once the tables become cached).
Boiling the complex queries down to their simplest form, we test running 60 of this query simultaneously:
select
count(*)
from
travel_processing_v5.llc_zone z,
parent_set10.usca_trip_points7 t
where t.year_num = 2019 and t.month_num = 9
and st_intersects(t.lock_geom, z.s_geom)
and st_intersects(t.lock_geom, z.e_geom);
llc_zone = 981 rows (568k disk size) with s_geom and e_geom both of datatype geometry(Multipolygon, 2163)
usca_trip_points7 = 79 million rows (469G disk size) with t.lock_geom datatype geometry(Linestring, 2163)
(more detailed schema/stats can be provided if helpful)
postgresql.conf is pretty normal for a large system like this (with appropriate shared_buffer, work_mem, etc. – can be provided if helpful, too)
What I’m finding in pg_stat_activity when running this is lots of wait_events of type ‘LockManager’.
Rebuilding with CFLAGS=" -fno-omit-frame-pointer" --prefix=/usr/local/pgsql_13debug --enable-dtrace CPPFLAGS='-DLOCK_DEBUG' and then setting trace_lwlocks yields lots of records looking like:
[39691] LOG: 39691: LWLockAcquire(LockManager 0x7fab2cc09d80): excl 0 shared 0 haswaiters 1 waiters 6 rOK 1
Does anyone have any advice on how to alleviate LockManager’s LWlock issue?
Thanks for any assistance!
---Paul
Paul Friedman
CTO
677 Harrison St | San Francisco, CA 94107
M: (650) 270-7676
E-mail: paul.friedman@streetlightdata.com