Thread: pg_stat_statements locking

pg_stat_statements locking

From
Andrey Borodin
Date:
Hi hackers!

Recently I observed very peculiar incident.

=== Incident description ===

ETL database was operating fine for many months, regularly updated etc. Workload was not changing much, but as far as
itwas ETL database - most of queries were different all the time. 
On the night of September 7th database was stuck, no monitoring query could be executed. DBAs started to deal with the
incident,but there is not much to do with database service, when you cannot execute a single query. According to VM
metrics,VM was writing a lot on disk. 
On-call engineer was summoned. He observed a lot of backends stuck with similar backtrace
#5  LWLockAcquire()
#6  pgss_store()
#7  pgss_post_parse_analyze()
#8  parse_analyze()
#9  pg_analyze_and_rewrite()

After restart, problem reproduced within 50 minutes. But monitoring queries were operating, what showed that all
backendswere stuck in LWLock pg_stat_statements. It was impossible to disable pgss with SQL, so engineer altered
auto.confand restarted database. This resolved the incident. 

Later I was working on analyzing the incident. Enabling pgss back showed traces of the problem:
            Fri 09 Sep 2022 08:52:31 AM MSK (every 2s)

     usename     |        state        |     wait_event     | cnt
-----------------+---------------------+--------------------+-----
 content         | active              | DataFileRead       |   1
 content         | active              | pg_stat_statements |  42
 content         | idle in transaction | ClientRead         |   2
 pgwatch_monitor | active              | BufFileRead        |   1
 pgwatch_monitor | active              | [null]             |   5
 pgwatch_monitor | active              | pg_stat_statements |  85
 postgres        | active              | [null]             |   1
 repl            | active              | WalSenderMain      |   2
 [null]          | active              | [null]             |   2
 [null]          | active              | VacuumDelay        |   7
(10 rows)

pgwatch was quering 60 databases, every minute and each call to pg_stat_statements() took approximately 3-4 seconds.
Backend that was in charge of grand lock was looking like this in pg_stat_statements:

datid            | 127782
pid              | 4077900
usename          | pgwatch_monitor
application_name | pgwatch2 - 10.96.17.68
wait_event_type  | IO
wait_event       | BufFileWrite
state            | active
backend_xid      | [null]
backend_xmin     | 67048029

The contents of pg_stat_statements view overrun work_mem and were materialized in tuplestore on disk. This is what
causea lot of disk write on database that was not accepting any user query. 

</Incident description>

TLDR: LWLock "pg_stat_statements" disabled all SQL queries.


=== How the problem develops ===
Prerequisite 1. pgwatch is quering pgss often.
Prerequisite 2. pgss becomes big so that tuplestore is written on disk, while holding shared lock.
Prerequisite 3. Someone is calling reset() or pgss_store() needing exclusive lock.

Consequence. Exclusive lock queues after long held shared lock and prevents all shared locks to be taken.
Result. Any query calling pgss hooks hangs.


=== Reproduction for development purposes ===
0. Setup a database with pg_stat_statements.track = all.
1. Modify pg_stat_statements_internal() to wait for a long time under LWLockAcquire(pgss->lock, LW_SHARED).
2. select * from pg_stat_statements()
3. select pg_stat_statements_reset()

Now the database is bonked. Any query will hang until pg_stat_statements() finishes.


=== How to fix ===
pgss uses LWLock to protect hashtable.
When the hashtable is reset or new user query is inserted in pgss_store() - exclusive lock is used.
When stats are updated for known query or pg_stat_statements are read - shared lock is used.

I propose to swap shared lock for stats update to conditional shared lock.
It cannot be taken during reset() - and that's totally fine. It cannot be taken during entering new query - and I think
it'sOK to spill some stats in this case. PFA patch attached. 

This patch prevents from a disaster incurred by described here locking.


=== Other possible mitigation ===
The incident would not occur if tuplestore did not convert into on-disk representation. But I don't see how to reliably
protectfrom this. 

What do you think?


Thank!


Best regards, Andrey Borodin.

Attachment

Re: pg_stat_statements locking

From
Julien Rouhaud
Date:
Hi,

On Mon, Sep 12, 2022 at 11:52:28AM +0500, Andrey Borodin wrote:
>
> === How to fix ===
> pgss uses LWLock to protect hashtable.
> When the hashtable is reset or new user query is inserted in pgss_store() -
> exclusive lock is used.
> When stats are updated for known query or pg_stat_statements are read - shared lock is used.
>
> I propose to swap shared lock for stats update to conditional shared lock.
> It cannot be taken during reset() - and that's totally fine. It cannot be
> taken during entering new query - and I think it's OK to spill some stats in
> this case. PFA patch attached.
>
> This patch prevents from a disaster incurred by described here locking.

I'm not a fan of that patch as it now silently ignores entries if the lwlock
can't be acquired *immediately*, without any way to avoid that if your
configuration and/or workload doesn't lead to this problem, or any way to know
that entries were ignored.

> === Other possible mitigation ===
> The incident would not occur if tuplestore did not convert into on-disk
> representation. But I don't see how to reliably protect from this.

I'm not sure that's true.  If you need an exclusive lwlock it means that new
entries are added.  If that keeps happening it means that you will eventually
need to defragment the query text file, and this operation will certainly hold
an exclusive lwlock for a very long time.

I think that the better long term approach is to move pg_stat_statements and
the query texts to dynamic shared memory.  This should also help in this
scenario as dshash is partitioned, so you don't have a single lwlock for the
whole hash table.  And as discussed recently, see [1], we should make the stat
collector extensible to reuse it in extensions like pg_stat_statements to
benefit from all the other optimizations.

[1] https://www.postgresql.org/message-id/20220818195124.c7ipzf6c5v7vxymc@awork3.anarazel.de



Re: pg_stat_statements locking

From
Andrey Borodin
Date:

> On 12 Sep 2022, at 13:40, Julien Rouhaud <rjuju123@gmail.com> wrote:
>
> I'm not a fan of that patch as it now silently ignores entries if the lwlock
> can't be acquired *immediately*, without any way to avoid that if your
> configuration and/or workload doesn't lead to this problem, or any way to know
> that entries were ignored.

Practically, workload of this configuration is uncommon. At least I could not find any reports of such locking.
But theoretically, all prerequisites of a disaster is very common (variety of queries + some QPS of pg_stat_statements
view+ small work_mem + occasional reset() or GC). 

Maybe it's only a problem of programs that use pgss. pgwatch is calling pgss on every DB in the cluster, that's how
checkonce in a minute became some queries per second. 

Personally, I'd prefer if I could configure a timeout to aquire lock. That timeout would denote maximum delay that pgss
canincur on the query. But we would need to extend LWLock API for this. 



> On 12 Sep 2022, at 13:40, Julien Rouhaud <rjuju123@gmail.com> wrote:
>
> I think that the better long term approach is to move pg_stat_statements and
> the query texts to dynamic shared memory.

BTW we don't even need a dynamic memory. We need just a shared memory, probably pre-allocated.
I agree that pgss must reside in main memory only, never on disk.

But we still will have a possibility of long lock conflicts preventing queries from completing. And the ability to
configurepgss hooks timeout would be useful anyway. 


Best regards, Andrey Borodin.


Re: pg_stat_statements locking

From
Julien Rouhaud
Date:
On Mon, Sep 12, 2022 at 05:32:55PM +0500, Andrey Borodin wrote:
>
> > On 12 Sep 2022, at 13:40, Julien Rouhaud <rjuju123@gmail.com> wrote:
> >
> > I'm not a fan of that patch as it now silently ignores entries if the lwlock
> > can't be acquired *immediately*, without any way to avoid that if your
> > configuration and/or workload doesn't lead to this problem, or any way to know
> > that entries were ignored.
>
> Practically, workload of this configuration is uncommon. At least I could not
> find any reports of such locking.  But theoretically, all prerequisites of a
> disaster is very common (variety of queries + some QPS of pg_stat_statements
> view + small work_mem + occasional reset() or GC).

Simply needing to evict entries regularly is already famous for being really
expensive.  See for instance [1].

> Maybe it's only a problem of programs that use pgss. pgwatch is calling pgss
> on every DB in the cluster, that's how check once in a minute became some
> queries per second.

Ah, I wasn't sure if that's what you meant in your original message.  Calling
pg_stat_statements *for every database* doesn't sound like a good idea.

Also ideally you shouldn't need to retrieve the query text every time.  There's
now pg_stat_statements_info.dealloc, so between that and the number of row
returned you can easily know if there are new query texts that you never saw
yet and cache those on the application side rather than retrieving them again
and again.

> Personally, I'd prefer if I could configure a timeout to aquire lock. That
> timeout would denote maximum delay that pgss can incur on the query. But we
> would need to extend LWLock API for this.

Yeah, that's what I meant by "immediately" in my previous message.  That being
said I don't know if adding a timeout would be too expensive for the lwlock
infrastructure.

> > On 12 Sep 2022, at 13:40, Julien Rouhaud <rjuju123@gmail.com> wrote:
> >
> > I think that the better long term approach is to move pg_stat_statements and
> > the query texts to dynamic shared memory.
>
> BTW we don't even need a dynamic memory. We need just a shared memory,
> probably pre-allocated.  I agree that pgss must reside in main memory only,
> never on disk.

We still need dynamic shared memory to get rid of the query text file, which is
a big problem on its own.  For the main hash table, relying on dshash could
allow increasing the maximum number of entries without a restart, which could
be cool if you're in a situation where you have a finite number of entries
that's higher than pg_stat_statements.max (like after creating a new role or
something).
>
> But we still will have a possibility of long lock conflicts preventing
> queries from completing. And the ability to configure pgss hooks timeout
> would be useful anyway.

I didn't look thoroughly at the new pgstats infrastructure, but from what I saw
it should be able to leverage most of the current problems.

[1] https://twitter.com/AndresFreundTec/status/1105585237772263424



Re: pg_stat_statements locking

From
Andrey Borodin
Date:

> On 12 Sep 2022, at 18:18, Julien Rouhaud <rjuju123@gmail.com> wrote:
>
> That being
> said I don't know if adding a timeout would be too expensive for the lwlock
> infrastructure.

Implementation itself is straightforward, but we need to add 3 impls of waiting for semaphore with timeout.
POSIX have sem_timedwait(). Windows' WaitForMultipleObjectsEx() have timeout arg. SysV have semtimedop().
That's what we need to add something like LWLockAcquireWithTimeout().

Does adding all these stuff sound like a good tradeoff for lock-safe pg_stat_statements? If so - I'll start to
implementthis. 

Best regards, Andrey Borodin.


Re: pg_stat_statements locking

From
Tom Lane
Date:
Andrey Borodin <x4mmm@yandex-team.ru> writes:
>> On 12 Sep 2022, at 18:18, Julien Rouhaud <rjuju123@gmail.com> wrote:
>> That being
>> said I don't know if adding a timeout would be too expensive for the lwlock
>> infrastructure.

I want to object fiercely to loading down LWLock with anything like
timeouts.  It's supposed to be "lightweight".  If we get away from
that we're just going to find ourselves needing another lighter-weight
lock mechanism.

            regards, tom lane



Re: pg_stat_statements locking

From
Julien Rouhaud
Date:
On Mon, Sep 12, 2022 at 02:01:23PM -0400, Tom Lane wrote:
> Andrey Borodin <x4mmm@yandex-team.ru> writes:
> >> On 12 Sep 2022, at 18:18, Julien Rouhaud <rjuju123@gmail.com> wrote:
> >> That being
> >> said I don't know if adding a timeout would be too expensive for the lwlock
> >> infrastructure.
> 
> I want to object fiercely to loading down LWLock with anything like
> timeouts.  It's supposed to be "lightweight".  If we get away from
> that we're just going to find ourselves needing another lighter-weight
> lock mechanism.

That's what I was thinking, so it looks like a show-stopper for the proposed
patch.



Re: pg_stat_statements locking

From
Andrey Borodin
Date:

> On 12 Sep 2022, at 23:01, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Andrey Borodin <x4mmm@yandex-team.ru> writes:
>>> On 12 Sep 2022, at 18:18, Julien Rouhaud <rjuju123@gmail.com> wrote:
>>> That being
>>> said I don't know if adding a timeout would be too expensive for the lwlock
>>> infrastructure.
>
> I want to object fiercely to loading down LWLock with anything like
> timeouts.  It's supposed to be "lightweight".  If we get away from
> that we're just going to find ourselves needing another lighter-weight
> lock mechanism.

Thanks for clarifying this out, Tom. I agree that spreading timeout-based algorithms is not a good thing. And when you
havea hammer - everything seems like a nail, so it would be temping to use timeout here and there. 


> On 12 Sep 2022, at 23:08, Julien Rouhaud <rjuju123@gmail.com> wrote:
>
> That's what I was thinking, so it looks like a show-stopper for the proposed
> patch.

So, the only option to make things configurable is a switch for waiting\waitless locks.

And the other way is refactoring towards partitioned hashtable, namely dshash. But I don't see how partitioned locking
cansave us from a locking disaster. Problem is caused by reading all the pgss view colliding with reset() or GC. Both
thisoperations deal with each partition - they will conflict anyway, with the same result. Time-consuming read of each
partitionwill prevent exclusive lock by reset(), and queued exclusive lock will prevent any reads from hashtable. 


Best regards, Andrey Borodin.


Re: pg_stat_statements locking

From
Julien Rouhaud
Date:
On Tue, Sep 13, 2022 at 10:38:13AM +0500, Andrey Borodin wrote:
> 
> And the other way is refactoring towards partitioned hashtable, namely
> dshash. But I don't see how partitioned locking can save us from a locking
> disaster. Problem is caused by reading all the pgss view colliding with
> reset() or GC.

If you store the query texts in DSM, you won't have a query text file to
maintain and the GC problem will disappear.

> Both this operations deal with each partition - they will
> conflict anyway, with the same result. Time-consuming read of each partition
> will prevent exclusive lock by reset(), and queued exclusive lock will
> prevent any reads from hashtable.

Conflicts would still be possible, just less likely and less long as the whole
dshash is never locked globally, just one partition at a time (except when the
dshash is resized, but the locks aren't held for a long time and it's not
something frequent).

But the biggest improvements should be gained by reusing the pgstats
infrastructure.  I only had a glance at it so I don't know much about it, but
it has a per-backend hashtable to cache some information and avoid too many
accesses on the shared hash table, and a mechanism to accumulate entries and do
batch updates.