Thread: Occasional lengthy locking causing stalling on commit

Occasional lengthy locking causing stalling on commit

From
Ben Hoskings
Date:
G'day all,

We have a postgres 11.9 instance serving 7-8k queries/sec. In general
it's humming along with no issues (p50 patency of ~1ms; p95 ~30ms).
Lately though, we've seen occasional locking that causes all commits
to the main database to be briefly blocked and the APIs that it backs
to stall.

This is what I know so far:
  * It's happened at varying times of day over the last 3 weeks.
Around 10 significant cases (20-30 second stalls) and more shorter
ones.
  * Each time queries are blocked behind an AccessExclusiveLock on the
database (class 1262) that is itself blocked.
  * There's no change in DB throughput at the time.
  * No schema changes or any other queries are being made that would
involve an AccessExclusiveLock.
  * The blocked queries are all commit statements; our metrics show
dozens of traces each time that look perfectly normal except for a 20
second commit somewhere within them.
  * We make heavy use of listen/notify via que with ~80 notify events
per second across 16 listen channels. (https://github.com/que-rb/que)

Based on all that and the logs below, the locks in question appear to
be held while inserting NOTIFY messages on commit:
https://www.postgresql.org/message-id/3598.1363354686%40sss.pgh.pa.us

If that's the case, the real question is why a small number of those
commits are taking so much longer than expected. Any advice on how to
proceed from here much appreciated.

Cheers
Ben Hoskings
---

The observed outage here started at ~00:35:22 and ended when the
blocked locks were acquired.

2021-01-27 00:35:15.264 UTC [1554652]: [21-1] db=main,user=runtime
LOG: process 1554652 still waiting for ShareLock on transaction
3399854732 after 1000.091 ms
2021-01-27 00:35:15.264 UTC [1554652]: [22-1] db=main,user=runtime
DETAIL: Process holding the lock: 1542461. Wait queue: 1554652,
1542154, 1542496.
2021-01-27 00:35:15.264 UTC [1554652]: [23-1] db=main,user=runtime
CONTEXT: while rechecking updated tuple (#,#) in relation "devices"
2021-01-27 00:35:15.264 UTC [1554652]: [24-1] db=main,user=runtime
STATEMENT: UPDATE "devices" SET "last_successful_request_at" =
'2021-01-27 00:35:14.203471', "updated_at" = '2021-01-27
00:35:14.204966' WHERE "devices"."id" = #
2021-01-27 00:35:15.265 UTC [1542154]: [82-1] db=main,user=runtime
LOG: process 1542154 still waiting for ShareLock on transaction
3399854732 after 1000.098 ms
2021-01-27 00:35:15.265 UTC [1542154]: [83-1] db=main,user=runtime
DETAIL: Process holding the lock: 1542461. Wait queue: 1554652,
1542154, 1542496.
2021-01-27 00:35:15.265 UTC [1542154]: [84-1] db=main,user=runtime
CONTEXT: while rechecking updated tuple (#,#) in relation "devices"
2021-01-27 00:35:15.265 UTC [1542154]: [85-1] db=main,user=runtime
STATEMENT: UPDATE "devices" SET "last_successful_request_at" =
'2021-01-27 00:35:14.226635', "updated_at" = '2021-01-27
00:35:14.227932' WHERE "devices"."id" = #
2021-01-27 00:35:15.267 UTC [1542521]: [93-1] db=main,user=runtime
LOG: process 1542521 still waiting for ShareLock on transaction
3399854728 after 1000.067 ms
2021-01-27 00:35:15.267 UTC [1542521]: [94-1] db=main,user=runtime
DETAIL: Process holding the lock: 1547170. Wait queue: 1542514,
1542521.
2021-01-27 00:35:15.267 UTC [1542521]: [95-1] db=main,user=runtime
CONTEXT: while updating tuple (#,#) in relation "devices"
2021-01-27 00:35:15.267 UTC [1542521]: [96-1] db=main,user=runtime
STATEMENT: UPDATE "devices" SET "last_successful_request_at" =
'2021-01-27 00:35:14.257563', "updated_at" = '2021-01-27
00:35:14.258934' WHERE "devices"."id" = #
2021-01-27 00:35:15.278 UTC [1479762]: [60-1] db=main,user=runtime
LOG: process 1479762 still waiting for AccessExclusiveLock on object 0
of class 1262 of database 0 after 1000.129 ms
2021-01-27 00:35:15.278 UTC [1479762]: [61-1] db=main,user=runtime
DETAIL: Process holding the lock: 1518905. Wait queue: 1517976,
1518853, 1445675, 1518975, 1477989, 1518976, 1517975, 3018947,
1468183, 1477990, 1532581, 1474376, 1558501, 1477991, 1473039,
1468182, 1558509, 1477988, 1558516, 3019502, 1518852, 1473037,
1473038, 1547149, 1518909, 1479762, 1479765.
2021-01-27 00:35:15.290 UTC [1479765]: [51-1] db=main,user=runtime
LOG: process 1479765 still waiting for AccessExclusiveLock on object 0
of class 1262 of database 0 after 1000.110 ms
2021-01-27 00:35:15.290 UTC [1479765]: [52-1] db=main,user=runtime
DETAIL: Process holding the lock: 1518905. Wait queue: 1517976,
1518853, 1445675, 1518975, 1477989, 1518976, 1517975, 3018947,
1468183, 1477990, 1532581, 1474376, 1558501, 1477991, 1473039,
1468182, 1558509, 1477988, 1558516, 3019502, 1518852, 1473037,
1473038, 1547149, 1518909, 1479762, 1479765.
2021-01-27 00:35:15.290 UTC [1479765]: [53-1] db=main,user=runtime
STATEMENT: COMMIT
2021-01-27 00:35:15.293 UTC [1542496]: [130-1] db=main,user=runtime
LOG: process 1542496 still waiting for ShareLock on transaction
3399854732 after 1000.123 ms
2021-01-27 00:35:15.293 UTC [1542496]: [131-1] db=main,user=runtime
DETAIL: Process holding the lock: 1542461. Wait queue: 1554652,
1542154, 1542496.
2021-01-27 00:35:15.293 UTC [1542496]: [132-1] db=main,user=runtime
CONTEXT: while updating tuple (#,#) in relation "devices"
2021-01-27 00:35:15.293 UTC [1542496]: [133-1] db=main,user=runtime
STATEMENT: UPDATE "devices" SET "last_successful_request_at" =
'2021-01-27 00:35:14.283515', "updated_at" = '2021-01-27
00:35:14.284677' WHERE "devices"."id" = #
2021-01-27 00:35:24.507 UTC [32]: [10-1] db=,user= LOG: using stale
statistics instead of current ones because stats collector is not
responding
2021-01-27 00:35:34.698 UTC [1565612]: [1-1] db=,user= LOG: using
stale statistics instead of current ones because stats collector is
not responding
2021-01-27 00:35:34.702 UTC [32]: [11-1] db=,user= LOG: using stale
statistics instead of current ones because stats collector is not
responding
2021-01-27 00:35:35.263 UTC [3485525]: [17-1] db=main,user=datadog
LOG: using stale statistics instead of current ones because stats
collector is not responding
2021-01-27 00:35:35.263 UTC [3485525]: [18-1] db=main,user=datadog
STATEMENT: SELECT psd.datname, numbackends, xact_commit,
xact_rollback, blks_read, blks_hit, tup_returned, tup_fetched,
tup_inserted, tup_updated, tup_deleted, 2^31 - age(datfrozenxid) as
wraparound, deadlocks, temp_bytes, temp_files,
pg_database_size(psd.datname) as pg_database_size FROM
pg_stat_database psd JOIN pg_database pd ON psd.datname = pd.datname
WHERE psd.datname not ilike 'template%%%%' AND psd.datname not ilike
'rdsadmin' AND psd.datname not ilike 'azure_maintenance' AND
psd.datname not ilike 'postgres'
2021-01-27 00:35:39.293 UTC [1517976]: [40-1] db=main,user=runtime
LOG: process 1517976 acquired AccessExclusiveLock on object 0 of class
1262 of database 0 after 25247.913 ms
2021-01-27 00:35:39.348 UTC [1542521]: [97-1] db=main,user=runtime
LOG: process 1542521 acquired ShareLock on transaction 3399854728
after 25081.166 ms
2021-01-27 00:35:39.348 UTC [1542521]: [98-1] db=main,user=runtime
CONTEXT: while updating tuple (#,#) in relation "devices"
2021-01-27 00:35:39.348 UTC [1542521]: [99-1] db=main,user=runtime
STATEMENT: UPDATE "devices" SET "last_successful_request_at" =
'2021-01-27 00:35:14.257563', "updated_at" = '2021-01-27
00:35:14.258934' WHERE "devices"."id" = #
2021-01-27 00:35:39.348 UTC [1542514]: [88-1] db=main,user=runtime
LOG: process 1542514 acquired ShareLock on transaction 3399854728
after 25086.306 ms
2021-01-27 00:35:39.348 UTC [1542514]: [89-1] db=main,user=runtime
CONTEXT: while rechecking updated tuple (#,#) in relation "devices"
2021-01-27 00:35:39.348 UTC [1542514]: [90-1] db=main,user=runtime
STATEMENT: UPDATE "devices" SET "last_successful_request_at" =
'2021-01-27 00:35:14.225360', "updated_at" = '2021-01-27
00:35:14.226612' WHERE "devices"."id" = #
2021-01-27 00:35:39.348 UTC [1542496]: [134-1] db=main,user=runtime
LOG: process 1542496 acquired ShareLock on transaction 3399854732
after 25055.414 ms
2021-01-27 00:35:39.348 UTC [1542496]: [135-1] db=main,user=runtime
CONTEXT: while updating tuple (#,#) in relation "devices"
2021-01-27 00:35:39.348 UTC [1542496]: [136-1] db=main,user=runtime
STATEMENT: UPDATE "devices" SET "last_successful_request_at" =
'2021-01-27 00:35:14.283515', "updated_at" = '2021-01-27
00:35:14.284677' WHERE "devices"."id" = #
2021-01-27 00:35:39.348 UTC [1542154]: [86-1] db=main,user=runtime
LOG: process 1542154 acquired ShareLock on transaction 3399854732
after 25083.744 ms
2021-01-27 00:35:39.348 UTC [1542154]: [87-1] db=main,user=runtime
CONTEXT: while rechecking updated tuple (#,#) in relation "devices"
2021-01-27 00:35:39.348 UTC [1542154]: [88-1] db=main,user=runtime
STATEMENT: UPDATE "devices" SET "last_successful_request_at" =
'2021-01-27 00:35:14.226635', "updated_at" = '2021-01-27
00:35:14.227932' WHERE "devices"."id" = #
2021-01-27 00:35:39.348 UTC [1554652]: [25-1] db=main,user=runtime
LOG: process 1554652 acquired ShareLock on transaction 3399854732
after 25084.652 ms
2021-01-27 00:35:39.348 UTC [1554652]: [26-1] db=main,user=runtime
CONTEXT: while rechecking updated tuple (#,#) in relation "devices"
2021-01-27 00:35:39.348 UTC [1554652]: [27-1] db=main,user=runtime
STATEMENT: UPDATE "devices" SET "last_successful_request_at" =
'2021-01-27 00:35:14.203471', "updated_at" = '2021-01-27
00:35:14.204966' WHERE "devices"."id" = #
2021-01-27 00:35:39.445 UTC [1518853]: [62-1] db=main,user=runtime
LOG: process 1518853 acquired AccessExclusiveLock on object 0 of class
1262 of database 0 after 25399.688 ms
2021-01-27 00:35:39.597 UTC [1445675]: [65-1] db=main,user=runtime
LOG: process 1445675 acquired AccessExclusiveLock on object 0 of class
1262 of database 0 after 25550.833 ms
2021-01-27 00:35:39.751 UTC [1518975]: [57-1] db=main,user=runtime
LOG: process 1518975 acquired AccessExclusiveLock on object 0 of class
1262 of database 0 after 25704.969 ms
2021-01-27 00:35:39.822 UTC [1477989]: [47-1] db=main,user=runtime
LOG: process 1477989 acquired AccessExclusiveLock on object 0 of class
1262 of database 0 after 25775.611 ms
2021-01-27 00:35:39.912 UTC [1518976]: [42-1] db=main,user=runtime
LOG: process 1518976 acquired AccessExclusiveLock on object 0 of class
1262 of database 0 after 25864.312 ms
2021-01-27 00:35:40.040 UTC [1517975]: [55-1] db=main,user=runtime
LOG: process 1517975 acquired AccessExclusiveLock on object 0 of class
1262 of database 0 after 25991.996 ms
2021-01-27 00:35:40.155 UTC [3018947]: [1147-1] db=main,user=runtime
LOG: process 3018947 acquired AccessExclusiveLock on object 0 of class
1262 of database 0 after 26104.830 ms
2021-01-27 00:35:40.207 UTC [1468183]: [40-1] db=main,user=runtime
LOG: process 1468183 acquired AccessExclusiveLock on object 0 of class
1262 of database 0 after 26155.200 ms
2021-01-27 00:35:40.238 UTC [1477990]: [53-1] db=main,user=runtime
LOG: process 1477990 acquired AccessExclusiveLock on object 0 of class
1262 of database 0 after 26181.880 ms
2021-01-27 00:35:40.256 UTC [1532581]: [27-1] db=main,user=runtime
LOG: process 1532581 acquired AccessExclusiveLock on object 0 of class
1262 of database 0 after 26187.341 ms
2021-01-27 00:35:40.263 UTC [1474376]: [40-1] db=main,user=runtime
LOG: process 1474376 acquired AccessExclusiveLock on object 0 of class
1262 of database 0 after 26191.817 ms
2021-01-27 00:35:40.272 UTC [1558501]: [3-1] db=main,user=runtime LOG:
process 1558501 acquired AccessExclusiveLock on object 0 of class 1262
of database 0 after 26192.658 ms
2021-01-27 00:35:40.279 UTC [1477991]: [61-1] db=main,user=runtime
LOG: process 1477991 acquired AccessExclusiveLock on object 0 of class
1262 of database 0 after 26100.590 ms
2021-01-27 00:35:40.285 UTC [1473039]: [36-1] db=main,user=runtime
LOG: process 1473039 acquired AccessExclusiveLock on object 0 of class
1262 of database 0 after 26106.395 ms
2021-01-27 00:35:40.308 UTC [1468182]: [45-1] db=main,user=runtime
LOG: process 1468182 acquired AccessExclusiveLock on object 0 of class
1262 of database 0 after 26128.550 ms
2021-01-27 00:35:40.329 UTC [1558509]: [3-1] db=main,user=runtime LOG:
process 1558509 acquired AccessExclusiveLock on object 0 of class 1262
of database 0 after 26143.959 ms



Re: Occasional lengthy locking causing stalling on commit

From
Tom Lane
Date:
Ben Hoskings <ben@hoskings.net> writes:
> We have a postgres 11.9 instance serving 7-8k queries/sec. In general
> it's humming along with no issues (p50 patency of ~1ms; p95 ~30ms).
> Lately though, we've seen occasional locking that causes all commits
> to the main database to be briefly blocked and the APIs that it backs
> to stall.
> ...
>   * We make heavy use of listen/notify via que with ~80 notify events
> per second across 16 listen channels. (https://github.com/que-rb/que)

Possibly you'd benefit from updating to v13, which has the listen/notify
performance improvements Martijn referred to in the other thread.

It's also possible that the hangup is unrelated to that, being somewhere
later in commit processing than where the notify traffic gets dumped out.
If you could identify which process has the "database 0" lock during one
of these delays, and capture a stack trace from it, that would be
super-informative.  Not sure about a good way to do that though.
Maybe you could automate tailing the log for "DETAIL: Process holding the
lock: nnn." and gdb'ing that process.

            regards, tom lane



Re: Occasional lengthy locking causing stalling on commit

From
Ben Hoskings
Date:
On Sun, 31 Jan 2021 at 03:50, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Possibly you'd benefit from updating to v13, which has the listen/notify
> performance improvements Martijn referred to in the other thread.
>
> It's also possible that the hangup is unrelated to that, being somewhere
> later in commit processing than where the notify traffic gets dumped out.
> If you could identify which process has the "database 0" lock during one
> of these delays, and capture a stack trace from it, that would be
> super-informative.  Not sure about a good way to do that though.
> Maybe you could automate tailing the log for "DETAIL: Process holding the
> lock: nnn." and gdb'ing that process.
>
>                         regards, tom lane

Thanks for the reply Tom. We're running on Google Cloud SQL so we
won't be able to use gdb - unless we can convince Google to run it for
us :)

I wonder if there are any likely candidates that we could look into -
for example, is it possible it could be due a batched index update
that we could alleviate with "fastupdate=off"?

Cheers
Ben



Re: Occasional lengthy locking causing stalling on commit

From
Tom Lane
Date:
Ben Hoskings <ben@hoskings.net> writes:
> I wonder if there are any likely candidates that we could look into -
> for example, is it possible it could be due a batched index update
> that we could alleviate with "fastupdate=off"?

No.  This is late enough in the commit code path that we really
don't want to be running anything that would take a long time or
have a significant probability of failure.  So I'm a bit mystified
as to what it could be.  Ockham's razor suggests that it's the
notify processing itself, as you seem to be an outlier in how
heavily you're using that; but that's no sure thing.

One thing that just occurred to me is that you might find it
interesting to keep tabs on what's in the $PGDATA/pg_notify
directory.  Do the performance burps correspond to transitory
peaks in the amount of data there?  Or (grasping at straws here...)
wraparound of the file names back to 0000?

            regards, tom lane



Re: Occasional lengthy locking causing stalling on commit

From
Ben Hoskings
Date:
On Mon, 1 Feb 2021 at 10:33, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> One thing that just occurred to me is that you might find it
> interesting to keep tabs on what's in the $PGDATA/pg_notify
> directory.  Do the performance burps correspond to transitory
> peaks in the amount of data there?  Or (grasping at straws here...)
> wraparound of the file names back to 0000?

We don't have filesystem access on Cloud SQL - the downside of the
managed route :)

It sounds like it might be time to bump the pg13 upgrade up the TODO list.

Cheers
Ben



Re: Occasional lengthy locking causing stalling on commit

From
Ben Hoskings
Date:
G'day all,

A quick follow-up on this issue for interest's sake. The stalling we
were seeing turned out to be a Cloud SQL issue and not related to our
listen/notify usage.

Cloud SQL has an automatic storage increase process that resizes the
underlying disk as required to account for cluster growth. As it turns
out that process occasionally causes I/O to stall for a brief window.
https://cloud.google.com/sql/docs/postgres/instance-settings#automatic-storage-increase-2ndgen

The workaround supplied by Google is to manually provision slack
storage in larger increments to prevent the more frequent automatic
increases, which happen 25GB at a time on a large cluster.

We didn't make the connection because disk resize events are not
visible in any logs; Google Support found the issue by correlating the
timestamps of our observed outages with their internal logs.

Hopefully this is useful for someone else. Thanks again for your help
Tom - your advice on listen/notify locking on commit was very useful
despite not being the cause in this case.

Cheers
Ben

On Mon, 1 Feb 2021 at 12:33, Ben Hoskings <ben@hoskings.net> wrote:
>
> On Mon, 1 Feb 2021 at 10:33, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >
> > One thing that just occurred to me is that you might find it
> > interesting to keep tabs on what's in the $PGDATA/pg_notify
> > directory.  Do the performance burps correspond to transitory
> > peaks in the amount of data there?  Or (grasping at straws here...)
> > wraparound of the file names back to 0000?
>
> We don't have filesystem access on Cloud SQL - the downside of the
> managed route :)
>
> It sounds like it might be time to bump the pg13 upgrade up the TODO list.
>
> Cheers
> Ben