pg_stat_statements locking - Mailing list pgsql-hackers

From Andrey Borodin
Subject pg_stat_statements locking
Date
Msg-id 1AEEB240-9B68-44D5-8A29-8F9FDB22C801@yandex-team.ru
Whole thread Raw
Responses Re: pg_stat_statements locking
List pgsql-hackers
Hi hackers!

Recently I observed very peculiar incident.

=== Incident description ===

ETL database was operating fine for many months, regularly updated etc. Workload was not changing much, but as far as
itwas ETL database - most of queries were different all the time. 
On the night of September 7th database was stuck, no monitoring query could be executed. DBAs started to deal with the
incident,but there is not much to do with database service, when you cannot execute a single query. According to VM
metrics,VM was writing a lot on disk. 
On-call engineer was summoned. He observed a lot of backends stuck with similar backtrace
#5  LWLockAcquire()
#6  pgss_store()
#7  pgss_post_parse_analyze()
#8  parse_analyze()
#9  pg_analyze_and_rewrite()

After restart, problem reproduced within 50 minutes. But monitoring queries were operating, what showed that all
backendswere stuck in LWLock pg_stat_statements. It was impossible to disable pgss with SQL, so engineer altered
auto.confand restarted database. This resolved the incident. 

Later I was working on analyzing the incident. Enabling pgss back showed traces of the problem:
            Fri 09 Sep 2022 08:52:31 AM MSK (every 2s)

     usename     |        state        |     wait_event     | cnt
-----------------+---------------------+--------------------+-----
 content         | active              | DataFileRead       |   1
 content         | active              | pg_stat_statements |  42
 content         | idle in transaction | ClientRead         |   2
 pgwatch_monitor | active              | BufFileRead        |   1
 pgwatch_monitor | active              | [null]             |   5
 pgwatch_monitor | active              | pg_stat_statements |  85
 postgres        | active              | [null]             |   1
 repl            | active              | WalSenderMain      |   2
 [null]          | active              | [null]             |   2
 [null]          | active              | VacuumDelay        |   7
(10 rows)

pgwatch was quering 60 databases, every minute and each call to pg_stat_statements() took approximately 3-4 seconds.
Backend that was in charge of grand lock was looking like this in pg_stat_statements:

datid            | 127782
pid              | 4077900
usename          | pgwatch_monitor
application_name | pgwatch2 - 10.96.17.68
wait_event_type  | IO
wait_event       | BufFileWrite
state            | active
backend_xid      | [null]
backend_xmin     | 67048029

The contents of pg_stat_statements view overrun work_mem and were materialized in tuplestore on disk. This is what
causea lot of disk write on database that was not accepting any user query. 

</Incident description>

TLDR: LWLock "pg_stat_statements" disabled all SQL queries.


=== How the problem develops ===
Prerequisite 1. pgwatch is quering pgss often.
Prerequisite 2. pgss becomes big so that tuplestore is written on disk, while holding shared lock.
Prerequisite 3. Someone is calling reset() or pgss_store() needing exclusive lock.

Consequence. Exclusive lock queues after long held shared lock and prevents all shared locks to be taken.
Result. Any query calling pgss hooks hangs.


=== Reproduction for development purposes ===
0. Setup a database with pg_stat_statements.track = all.
1. Modify pg_stat_statements_internal() to wait for a long time under LWLockAcquire(pgss->lock, LW_SHARED).
2. select * from pg_stat_statements()
3. select pg_stat_statements_reset()

Now the database is bonked. Any query will hang until pg_stat_statements() finishes.


=== How to fix ===
pgss uses LWLock to protect hashtable.
When the hashtable is reset or new user query is inserted in pgss_store() - exclusive lock is used.
When stats are updated for known query or pg_stat_statements are read - shared lock is used.

I propose to swap shared lock for stats update to conditional shared lock.
It cannot be taken during reset() - and that's totally fine. It cannot be taken during entering new query - and I think
it'sOK to spill some stats in this case. PFA patch attached. 

This patch prevents from a disaster incurred by described here locking.


=== Other possible mitigation ===
The incident would not occur if tuplestore did not convert into on-disk representation. But I don't see how to reliably
protectfrom this. 

What do you think?


Thank!


Best regards, Andrey Borodin.

Attachment

pgsql-hackers by date:

Previous
From: Bharath Rupireddy
Date:
Subject: Re: Switching XLog source from archive to streaming when primary available
Next
From: bt22kawamotok
Date:
Subject: Re: [PATCH]Feature improvement for MERGE tab completion