Thread: pg_stat_statements locking
Hi hackers! Recently I observed very peculiar incident. === Incident description === ETL database was operating fine for many months, regularly updated etc. Workload was not changing much, but as far as itwas ETL database - most of queries were different all the time. On the night of September 7th database was stuck, no monitoring query could be executed. DBAs started to deal with the incident,but there is not much to do with database service, when you cannot execute a single query. According to VM metrics,VM was writing a lot on disk. On-call engineer was summoned. He observed a lot of backends stuck with similar backtrace #5 LWLockAcquire() #6 pgss_store() #7 pgss_post_parse_analyze() #8 parse_analyze() #9 pg_analyze_and_rewrite() After restart, problem reproduced within 50 minutes. But monitoring queries were operating, what showed that all backendswere stuck in LWLock pg_stat_statements. It was impossible to disable pgss with SQL, so engineer altered auto.confand restarted database. This resolved the incident. Later I was working on analyzing the incident. Enabling pgss back showed traces of the problem: Fri 09 Sep 2022 08:52:31 AM MSK (every 2s) usename | state | wait_event | cnt -----------------+---------------------+--------------------+----- content | active | DataFileRead | 1 content | active | pg_stat_statements | 42 content | idle in transaction | ClientRead | 2 pgwatch_monitor | active | BufFileRead | 1 pgwatch_monitor | active | [null] | 5 pgwatch_monitor | active | pg_stat_statements | 85 postgres | active | [null] | 1 repl | active | WalSenderMain | 2 [null] | active | [null] | 2 [null] | active | VacuumDelay | 7 (10 rows) pgwatch was quering 60 databases, every minute and each call to pg_stat_statements() took approximately 3-4 seconds. Backend that was in charge of grand lock was looking like this in pg_stat_statements: datid | 127782 pid | 4077900 usename | pgwatch_monitor application_name | pgwatch2 - 10.96.17.68 wait_event_type | IO wait_event | BufFileWrite state | active backend_xid | [null] backend_xmin | 67048029 The contents of pg_stat_statements view overrun work_mem and were materialized in tuplestore on disk. This is what causea lot of disk write on database that was not accepting any user query. </Incident description> TLDR: LWLock "pg_stat_statements" disabled all SQL queries. === How the problem develops === Prerequisite 1. pgwatch is quering pgss often. Prerequisite 2. pgss becomes big so that tuplestore is written on disk, while holding shared lock. Prerequisite 3. Someone is calling reset() or pgss_store() needing exclusive lock. Consequence. Exclusive lock queues after long held shared lock and prevents all shared locks to be taken. Result. Any query calling pgss hooks hangs. === Reproduction for development purposes === 0. Setup a database with pg_stat_statements.track = all. 1. Modify pg_stat_statements_internal() to wait for a long time under LWLockAcquire(pgss->lock, LW_SHARED). 2. select * from pg_stat_statements() 3. select pg_stat_statements_reset() Now the database is bonked. Any query will hang until pg_stat_statements() finishes. === How to fix === pgss uses LWLock to protect hashtable. When the hashtable is reset or new user query is inserted in pgss_store() - exclusive lock is used. When stats are updated for known query or pg_stat_statements are read - shared lock is used. I propose to swap shared lock for stats update to conditional shared lock. It cannot be taken during reset() - and that's totally fine. It cannot be taken during entering new query - and I think it'sOK to spill some stats in this case. PFA patch attached. This patch prevents from a disaster incurred by described here locking. === Other possible mitigation === The incident would not occur if tuplestore did not convert into on-disk representation. But I don't see how to reliably protectfrom this. What do you think? Thank! Best regards, Andrey Borodin.
Attachment
Hi, On Mon, Sep 12, 2022 at 11:52:28AM +0500, Andrey Borodin wrote: > > === How to fix === > pgss uses LWLock to protect hashtable. > When the hashtable is reset or new user query is inserted in pgss_store() - > exclusive lock is used. > When stats are updated for known query or pg_stat_statements are read - shared lock is used. > > I propose to swap shared lock for stats update to conditional shared lock. > It cannot be taken during reset() - and that's totally fine. It cannot be > taken during entering new query - and I think it's OK to spill some stats in > this case. PFA patch attached. > > This patch prevents from a disaster incurred by described here locking. I'm not a fan of that patch as it now silently ignores entries if the lwlock can't be acquired *immediately*, without any way to avoid that if your configuration and/or workload doesn't lead to this problem, or any way to know that entries were ignored. > === Other possible mitigation === > The incident would not occur if tuplestore did not convert into on-disk > representation. But I don't see how to reliably protect from this. I'm not sure that's true. If you need an exclusive lwlock it means that new entries are added. If that keeps happening it means that you will eventually need to defragment the query text file, and this operation will certainly hold an exclusive lwlock for a very long time. I think that the better long term approach is to move pg_stat_statements and the query texts to dynamic shared memory. This should also help in this scenario as dshash is partitioned, so you don't have a single lwlock for the whole hash table. And as discussed recently, see [1], we should make the stat collector extensible to reuse it in extensions like pg_stat_statements to benefit from all the other optimizations. [1] https://www.postgresql.org/message-id/20220818195124.c7ipzf6c5v7vxymc@awork3.anarazel.de
> On 12 Sep 2022, at 13:40, Julien Rouhaud <rjuju123@gmail.com> wrote: > > I'm not a fan of that patch as it now silently ignores entries if the lwlock > can't be acquired *immediately*, without any way to avoid that if your > configuration and/or workload doesn't lead to this problem, or any way to know > that entries were ignored. Practically, workload of this configuration is uncommon. At least I could not find any reports of such locking. But theoretically, all prerequisites of a disaster is very common (variety of queries + some QPS of pg_stat_statements view+ small work_mem + occasional reset() or GC). Maybe it's only a problem of programs that use pgss. pgwatch is calling pgss on every DB in the cluster, that's how checkonce in a minute became some queries per second. Personally, I'd prefer if I could configure a timeout to aquire lock. That timeout would denote maximum delay that pgss canincur on the query. But we would need to extend LWLock API for this. > On 12 Sep 2022, at 13:40, Julien Rouhaud <rjuju123@gmail.com> wrote: > > I think that the better long term approach is to move pg_stat_statements and > the query texts to dynamic shared memory. BTW we don't even need a dynamic memory. We need just a shared memory, probably pre-allocated. I agree that pgss must reside in main memory only, never on disk. But we still will have a possibility of long lock conflicts preventing queries from completing. And the ability to configurepgss hooks timeout would be useful anyway. Best regards, Andrey Borodin.
On Mon, Sep 12, 2022 at 05:32:55PM +0500, Andrey Borodin wrote: > > > On 12 Sep 2022, at 13:40, Julien Rouhaud <rjuju123@gmail.com> wrote: > > > > I'm not a fan of that patch as it now silently ignores entries if the lwlock > > can't be acquired *immediately*, without any way to avoid that if your > > configuration and/or workload doesn't lead to this problem, or any way to know > > that entries were ignored. > > Practically, workload of this configuration is uncommon. At least I could not > find any reports of such locking. But theoretically, all prerequisites of a > disaster is very common (variety of queries + some QPS of pg_stat_statements > view + small work_mem + occasional reset() or GC). Simply needing to evict entries regularly is already famous for being really expensive. See for instance [1]. > Maybe it's only a problem of programs that use pgss. pgwatch is calling pgss > on every DB in the cluster, that's how check once in a minute became some > queries per second. Ah, I wasn't sure if that's what you meant in your original message. Calling pg_stat_statements *for every database* doesn't sound like a good idea. Also ideally you shouldn't need to retrieve the query text every time. There's now pg_stat_statements_info.dealloc, so between that and the number of row returned you can easily know if there are new query texts that you never saw yet and cache those on the application side rather than retrieving them again and again. > Personally, I'd prefer if I could configure a timeout to aquire lock. That > timeout would denote maximum delay that pgss can incur on the query. But we > would need to extend LWLock API for this. Yeah, that's what I meant by "immediately" in my previous message. That being said I don't know if adding a timeout would be too expensive for the lwlock infrastructure. > > On 12 Sep 2022, at 13:40, Julien Rouhaud <rjuju123@gmail.com> wrote: > > > > I think that the better long term approach is to move pg_stat_statements and > > the query texts to dynamic shared memory. > > BTW we don't even need a dynamic memory. We need just a shared memory, > probably pre-allocated. I agree that pgss must reside in main memory only, > never on disk. We still need dynamic shared memory to get rid of the query text file, which is a big problem on its own. For the main hash table, relying on dshash could allow increasing the maximum number of entries without a restart, which could be cool if you're in a situation where you have a finite number of entries that's higher than pg_stat_statements.max (like after creating a new role or something). > > But we still will have a possibility of long lock conflicts preventing > queries from completing. And the ability to configure pgss hooks timeout > would be useful anyway. I didn't look thoroughly at the new pgstats infrastructure, but from what I saw it should be able to leverage most of the current problems. [1] https://twitter.com/AndresFreundTec/status/1105585237772263424
> On 12 Sep 2022, at 18:18, Julien Rouhaud <rjuju123@gmail.com> wrote: > > That being > said I don't know if adding a timeout would be too expensive for the lwlock > infrastructure. Implementation itself is straightforward, but we need to add 3 impls of waiting for semaphore with timeout. POSIX have sem_timedwait(). Windows' WaitForMultipleObjectsEx() have timeout arg. SysV have semtimedop(). That's what we need to add something like LWLockAcquireWithTimeout(). Does adding all these stuff sound like a good tradeoff for lock-safe pg_stat_statements? If so - I'll start to implementthis. Best regards, Andrey Borodin.
Andrey Borodin <x4mmm@yandex-team.ru> writes: >> On 12 Sep 2022, at 18:18, Julien Rouhaud <rjuju123@gmail.com> wrote: >> That being >> said I don't know if adding a timeout would be too expensive for the lwlock >> infrastructure. I want to object fiercely to loading down LWLock with anything like timeouts. It's supposed to be "lightweight". If we get away from that we're just going to find ourselves needing another lighter-weight lock mechanism. regards, tom lane
On Mon, Sep 12, 2022 at 02:01:23PM -0400, Tom Lane wrote: > Andrey Borodin <x4mmm@yandex-team.ru> writes: > >> On 12 Sep 2022, at 18:18, Julien Rouhaud <rjuju123@gmail.com> wrote: > >> That being > >> said I don't know if adding a timeout would be too expensive for the lwlock > >> infrastructure. > > I want to object fiercely to loading down LWLock with anything like > timeouts. It's supposed to be "lightweight". If we get away from > that we're just going to find ourselves needing another lighter-weight > lock mechanism. That's what I was thinking, so it looks like a show-stopper for the proposed patch.
> On 12 Sep 2022, at 23:01, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Andrey Borodin <x4mmm@yandex-team.ru> writes: >>> On 12 Sep 2022, at 18:18, Julien Rouhaud <rjuju123@gmail.com> wrote: >>> That being >>> said I don't know if adding a timeout would be too expensive for the lwlock >>> infrastructure. > > I want to object fiercely to loading down LWLock with anything like > timeouts. It's supposed to be "lightweight". If we get away from > that we're just going to find ourselves needing another lighter-weight > lock mechanism. Thanks for clarifying this out, Tom. I agree that spreading timeout-based algorithms is not a good thing. And when you havea hammer - everything seems like a nail, so it would be temping to use timeout here and there. > On 12 Sep 2022, at 23:08, Julien Rouhaud <rjuju123@gmail.com> wrote: > > That's what I was thinking, so it looks like a show-stopper for the proposed > patch. So, the only option to make things configurable is a switch for waiting\waitless locks. And the other way is refactoring towards partitioned hashtable, namely dshash. But I don't see how partitioned locking cansave us from a locking disaster. Problem is caused by reading all the pgss view colliding with reset() or GC. Both thisoperations deal with each partition - they will conflict anyway, with the same result. Time-consuming read of each partitionwill prevent exclusive lock by reset(), and queued exclusive lock will prevent any reads from hashtable. Best regards, Andrey Borodin.
On Tue, Sep 13, 2022 at 10:38:13AM +0500, Andrey Borodin wrote: > > And the other way is refactoring towards partitioned hashtable, namely > dshash. But I don't see how partitioned locking can save us from a locking > disaster. Problem is caused by reading all the pgss view colliding with > reset() or GC. If you store the query texts in DSM, you won't have a query text file to maintain and the GC problem will disappear. > Both this operations deal with each partition - they will > conflict anyway, with the same result. Time-consuming read of each partition > will prevent exclusive lock by reset(), and queued exclusive lock will > prevent any reads from hashtable. Conflicts would still be possible, just less likely and less long as the whole dshash is never locked globally, just one partition at a time (except when the dshash is resized, but the locks aren't held for a long time and it's not something frequent). But the biggest improvements should be gained by reusing the pgstats infrastructure. I only had a glance at it so I don't know much about it, but it has a per-backend hashtable to cache some information and avoid too many accesses on the shared hash table, and a mechanism to accumulate entries and do batch updates.