G'day all,
We have a postgres 11.9 instance serving 7-8k queries/sec. In general
it's humming along with no issues (p50 patency of ~1ms; p95 ~30ms).
Lately though, we've seen occasional locking that causes all commits
to the main database to be briefly blocked and the APIs that it backs
to stall.
This is what I know so far:
* It's happened at varying times of day over the last 3 weeks.
Around 10 significant cases (20-30 second stalls) and more shorter
ones.
* Each time queries are blocked behind an AccessExclusiveLock on the
database (class 1262) that is itself blocked.
* There's no change in DB throughput at the time.
* No schema changes or any other queries are being made that would
involve an AccessExclusiveLock.
* The blocked queries are all commit statements; our metrics show
dozens of traces each time that look perfectly normal except for a 20
second commit somewhere within them.
* We make heavy use of listen/notify via que with ~80 notify events
per second across 16 listen channels. (https://github.com/que-rb/que)
Based on all that and the logs below, the locks in question appear to
be held while inserting NOTIFY messages on commit:
https://www.postgresql.org/message-id/3598.1363354686%40sss.pgh.pa.us
If that's the case, the real question is why a small number of those
commits are taking so much longer than expected. Any advice on how to
proceed from here much appreciated.
Cheers
Ben Hoskings
---
The observed outage here started at ~00:35:22 and ended when the
blocked locks were acquired.
2021-01-27 00:35:15.264 UTC [1554652]: [21-1] db=main,user=runtime
LOG: process 1554652 still waiting for ShareLock on transaction
3399854732 after 1000.091 ms
2021-01-27 00:35:15.264 UTC [1554652]: [22-1] db=main,user=runtime
DETAIL: Process holding the lock: 1542461. Wait queue: 1554652,
1542154, 1542496.
2021-01-27 00:35:15.264 UTC [1554652]: [23-1] db=main,user=runtime
CONTEXT: while rechecking updated tuple (#,#) in relation "devices"
2021-01-27 00:35:15.264 UTC [1554652]: [24-1] db=main,user=runtime
STATEMENT: UPDATE "devices" SET "last_successful_request_at" =
'2021-01-27 00:35:14.203471', "updated_at" = '2021-01-27
00:35:14.204966' WHERE "devices"."id" = #
2021-01-27 00:35:15.265 UTC [1542154]: [82-1] db=main,user=runtime
LOG: process 1542154 still waiting for ShareLock on transaction
3399854732 after 1000.098 ms
2021-01-27 00:35:15.265 UTC [1542154]: [83-1] db=main,user=runtime
DETAIL: Process holding the lock: 1542461. Wait queue: 1554652,
1542154, 1542496.
2021-01-27 00:35:15.265 UTC [1542154]: [84-1] db=main,user=runtime
CONTEXT: while rechecking updated tuple (#,#) in relation "devices"
2021-01-27 00:35:15.265 UTC [1542154]: [85-1] db=main,user=runtime
STATEMENT: UPDATE "devices" SET "last_successful_request_at" =
'2021-01-27 00:35:14.226635', "updated_at" = '2021-01-27
00:35:14.227932' WHERE "devices"."id" = #
2021-01-27 00:35:15.267 UTC [1542521]: [93-1] db=main,user=runtime
LOG: process 1542521 still waiting for ShareLock on transaction
3399854728 after 1000.067 ms
2021-01-27 00:35:15.267 UTC [1542521]: [94-1] db=main,user=runtime
DETAIL: Process holding the lock: 1547170. Wait queue: 1542514,
1542521.
2021-01-27 00:35:15.267 UTC [1542521]: [95-1] db=main,user=runtime
CONTEXT: while updating tuple (#,#) in relation "devices"
2021-01-27 00:35:15.267 UTC [1542521]: [96-1] db=main,user=runtime
STATEMENT: UPDATE "devices" SET "last_successful_request_at" =
'2021-01-27 00:35:14.257563', "updated_at" = '2021-01-27
00:35:14.258934' WHERE "devices"."id" = #
2021-01-27 00:35:15.278 UTC [1479762]: [60-1] db=main,user=runtime
LOG: process 1479762 still waiting for AccessExclusiveLock on object 0
of class 1262 of database 0 after 1000.129 ms
2021-01-27 00:35:15.278 UTC [1479762]: [61-1] db=main,user=runtime
DETAIL: Process holding the lock: 1518905. Wait queue: 1517976,
1518853, 1445675, 1518975, 1477989, 1518976, 1517975, 3018947,
1468183, 1477990, 1532581, 1474376, 1558501, 1477991, 1473039,
1468182, 1558509, 1477988, 1558516, 3019502, 1518852, 1473037,
1473038, 1547149, 1518909, 1479762, 1479765.
2021-01-27 00:35:15.290 UTC [1479765]: [51-1] db=main,user=runtime
LOG: process 1479765 still waiting for AccessExclusiveLock on object 0
of class 1262 of database 0 after 1000.110 ms
2021-01-27 00:35:15.290 UTC [1479765]: [52-1] db=main,user=runtime
DETAIL: Process holding the lock: 1518905. Wait queue: 1517976,
1518853, 1445675, 1518975, 1477989, 1518976, 1517975, 3018947,
1468183, 1477990, 1532581, 1474376, 1558501, 1477991, 1473039,
1468182, 1558509, 1477988, 1558516, 3019502, 1518852, 1473037,
1473038, 1547149, 1518909, 1479762, 1479765.
2021-01-27 00:35:15.290 UTC [1479765]: [53-1] db=main,user=runtime
STATEMENT: COMMIT
2021-01-27 00:35:15.293 UTC [1542496]: [130-1] db=main,user=runtime
LOG: process 1542496 still waiting for ShareLock on transaction
3399854732 after 1000.123 ms
2021-01-27 00:35:15.293 UTC [1542496]: [131-1] db=main,user=runtime
DETAIL: Process holding the lock: 1542461. Wait queue: 1554652,
1542154, 1542496.
2021-01-27 00:35:15.293 UTC [1542496]: [132-1] db=main,user=runtime
CONTEXT: while updating tuple (#,#) in relation "devices"
2021-01-27 00:35:15.293 UTC [1542496]: [133-1] db=main,user=runtime
STATEMENT: UPDATE "devices" SET "last_successful_request_at" =
'2021-01-27 00:35:14.283515', "updated_at" = '2021-01-27
00:35:14.284677' WHERE "devices"."id" = #
2021-01-27 00:35:24.507 UTC [32]: [10-1] db=,user= LOG: using stale
statistics instead of current ones because stats collector is not
responding
2021-01-27 00:35:34.698 UTC [1565612]: [1-1] db=,user= LOG: using
stale statistics instead of current ones because stats collector is
not responding
2021-01-27 00:35:34.702 UTC [32]: [11-1] db=,user= LOG: using stale
statistics instead of current ones because stats collector is not
responding
2021-01-27 00:35:35.263 UTC [3485525]: [17-1] db=main,user=datadog
LOG: using stale statistics instead of current ones because stats
collector is not responding
2021-01-27 00:35:35.263 UTC [3485525]: [18-1] db=main,user=datadog
STATEMENT: SELECT psd.datname, numbackends, xact_commit,
xact_rollback, blks_read, blks_hit, tup_returned, tup_fetched,
tup_inserted, tup_updated, tup_deleted, 2^31 - age(datfrozenxid) as
wraparound, deadlocks, temp_bytes, temp_files,
pg_database_size(psd.datname) as pg_database_size FROM
pg_stat_database psd JOIN pg_database pd ON psd.datname = pd.datname
WHERE psd.datname not ilike 'template%%%%' AND psd.datname not ilike
'rdsadmin' AND psd.datname not ilike 'azure_maintenance' AND
psd.datname not ilike 'postgres'
2021-01-27 00:35:39.293 UTC [1517976]: [40-1] db=main,user=runtime
LOG: process 1517976 acquired AccessExclusiveLock on object 0 of class
1262 of database 0 after 25247.913 ms
2021-01-27 00:35:39.348 UTC [1542521]: [97-1] db=main,user=runtime
LOG: process 1542521 acquired ShareLock on transaction 3399854728
after 25081.166 ms
2021-01-27 00:35:39.348 UTC [1542521]: [98-1] db=main,user=runtime
CONTEXT: while updating tuple (#,#) in relation "devices"
2021-01-27 00:35:39.348 UTC [1542521]: [99-1] db=main,user=runtime
STATEMENT: UPDATE "devices" SET "last_successful_request_at" =
'2021-01-27 00:35:14.257563', "updated_at" = '2021-01-27
00:35:14.258934' WHERE "devices"."id" = #
2021-01-27 00:35:39.348 UTC [1542514]: [88-1] db=main,user=runtime
LOG: process 1542514 acquired ShareLock on transaction 3399854728
after 25086.306 ms
2021-01-27 00:35:39.348 UTC [1542514]: [89-1] db=main,user=runtime
CONTEXT: while rechecking updated tuple (#,#) in relation "devices"
2021-01-27 00:35:39.348 UTC [1542514]: [90-1] db=main,user=runtime
STATEMENT: UPDATE "devices" SET "last_successful_request_at" =
'2021-01-27 00:35:14.225360', "updated_at" = '2021-01-27
00:35:14.226612' WHERE "devices"."id" = #
2021-01-27 00:35:39.348 UTC [1542496]: [134-1] db=main,user=runtime
LOG: process 1542496 acquired ShareLock on transaction 3399854732
after 25055.414 ms
2021-01-27 00:35:39.348 UTC [1542496]: [135-1] db=main,user=runtime
CONTEXT: while updating tuple (#,#) in relation "devices"
2021-01-27 00:35:39.348 UTC [1542496]: [136-1] db=main,user=runtime
STATEMENT: UPDATE "devices" SET "last_successful_request_at" =
'2021-01-27 00:35:14.283515', "updated_at" = '2021-01-27
00:35:14.284677' WHERE "devices"."id" = #
2021-01-27 00:35:39.348 UTC [1542154]: [86-1] db=main,user=runtime
LOG: process 1542154 acquired ShareLock on transaction 3399854732
after 25083.744 ms
2021-01-27 00:35:39.348 UTC [1542154]: [87-1] db=main,user=runtime
CONTEXT: while rechecking updated tuple (#,#) in relation "devices"
2021-01-27 00:35:39.348 UTC [1542154]: [88-1] db=main,user=runtime
STATEMENT: UPDATE "devices" SET "last_successful_request_at" =
'2021-01-27 00:35:14.226635', "updated_at" = '2021-01-27
00:35:14.227932' WHERE "devices"."id" = #
2021-01-27 00:35:39.348 UTC [1554652]: [25-1] db=main,user=runtime
LOG: process 1554652 acquired ShareLock on transaction 3399854732
after 25084.652 ms
2021-01-27 00:35:39.348 UTC [1554652]: [26-1] db=main,user=runtime
CONTEXT: while rechecking updated tuple (#,#) in relation "devices"
2021-01-27 00:35:39.348 UTC [1554652]: [27-1] db=main,user=runtime
STATEMENT: UPDATE "devices" SET "last_successful_request_at" =
'2021-01-27 00:35:14.203471', "updated_at" = '2021-01-27
00:35:14.204966' WHERE "devices"."id" = #
2021-01-27 00:35:39.445 UTC [1518853]: [62-1] db=main,user=runtime
LOG: process 1518853 acquired AccessExclusiveLock on object 0 of class
1262 of database 0 after 25399.688 ms
2021-01-27 00:35:39.597 UTC [1445675]: [65-1] db=main,user=runtime
LOG: process 1445675 acquired AccessExclusiveLock on object 0 of class
1262 of database 0 after 25550.833 ms
2021-01-27 00:35:39.751 UTC [1518975]: [57-1] db=main,user=runtime
LOG: process 1518975 acquired AccessExclusiveLock on object 0 of class
1262 of database 0 after 25704.969 ms
2021-01-27 00:35:39.822 UTC [1477989]: [47-1] db=main,user=runtime
LOG: process 1477989 acquired AccessExclusiveLock on object 0 of class
1262 of database 0 after 25775.611 ms
2021-01-27 00:35:39.912 UTC [1518976]: [42-1] db=main,user=runtime
LOG: process 1518976 acquired AccessExclusiveLock on object 0 of class
1262 of database 0 after 25864.312 ms
2021-01-27 00:35:40.040 UTC [1517975]: [55-1] db=main,user=runtime
LOG: process 1517975 acquired AccessExclusiveLock on object 0 of class
1262 of database 0 after 25991.996 ms
2021-01-27 00:35:40.155 UTC [3018947]: [1147-1] db=main,user=runtime
LOG: process 3018947 acquired AccessExclusiveLock on object 0 of class
1262 of database 0 after 26104.830 ms
2021-01-27 00:35:40.207 UTC [1468183]: [40-1] db=main,user=runtime
LOG: process 1468183 acquired AccessExclusiveLock on object 0 of class
1262 of database 0 after 26155.200 ms
2021-01-27 00:35:40.238 UTC [1477990]: [53-1] db=main,user=runtime
LOG: process 1477990 acquired AccessExclusiveLock on object 0 of class
1262 of database 0 after 26181.880 ms
2021-01-27 00:35:40.256 UTC [1532581]: [27-1] db=main,user=runtime
LOG: process 1532581 acquired AccessExclusiveLock on object 0 of class
1262 of database 0 after 26187.341 ms
2021-01-27 00:35:40.263 UTC [1474376]: [40-1] db=main,user=runtime
LOG: process 1474376 acquired AccessExclusiveLock on object 0 of class
1262 of database 0 after 26191.817 ms
2021-01-27 00:35:40.272 UTC [1558501]: [3-1] db=main,user=runtime LOG:
process 1558501 acquired AccessExclusiveLock on object 0 of class 1262
of database 0 after 26192.658 ms
2021-01-27 00:35:40.279 UTC [1477991]: [61-1] db=main,user=runtime
LOG: process 1477991 acquired AccessExclusiveLock on object 0 of class
1262 of database 0 after 26100.590 ms
2021-01-27 00:35:40.285 UTC [1473039]: [36-1] db=main,user=runtime
LOG: process 1473039 acquired AccessExclusiveLock on object 0 of class
1262 of database 0 after 26106.395 ms
2021-01-27 00:35:40.308 UTC [1468182]: [45-1] db=main,user=runtime
LOG: process 1468182 acquired AccessExclusiveLock on object 0 of class
1262 of database 0 after 26128.550 ms
2021-01-27 00:35:40.329 UTC [1558509]: [3-1] db=main,user=runtime LOG:
process 1558509 acquired AccessExclusiveLock on object 0 of class 1262
of database 0 after 26143.959 ms