Issue with COMMITs stuck on "AccessExclusiveLock on object 0 of class1262 of database 0" - Mailing list pgsql-general

From Martijn van Oosterhout
Subject Issue with COMMITs stuck on "AccessExclusiveLock on object 0 of class1262 of database 0"
Date
Msg-id CADWG95t0j9zF0uwdcMH81KMnDsiTAVHxmBvgYqrRJcD-iLwQhw@mail.gmail.com
Whole thread Raw
List pgsql-general
Hoi,

We're running 9.4.20 (yes it's old) and having issues with the database getting stuck occasionally, with COMMITs in one database being blocked on COMMITs in a completely different database. I've been able to determine that "object 0 of class 1262 of database 0" relates to the LISTEN/NOTIFY machinery. See message from Tom Lane[1]. There are not many references to this error online, mostly posted questions with no answers.

We have lock_delay set to 1 second and it catches mostly COMMITs. The logs are at the bottom of the email: What is most interesting is that the COMMITs in one database are blocked on COMMITs of a different database. This is quite unexpected for us, because we would expect the databases to work independently of each other. The above AccessExclusive lock appears to be a global lock shared across all databases.

A bit more context:
- The system has about 100 databases, and several hundred connections all up (yes, we have a connection poolers)
- We do about 8,000 queries per second
- We use LISTEN/NOTIFY quite a bit, probably 40-100 notifies per second (haven't got good stats on this yet).
- The wait queues in the logs below consist of many other processes all attempting to COMMIT across many different databases. Since we only log queries blocked more than a second they don't appear in this log. Mostly after about 1.5 seconds it resolves itself. However, this is enough to block hundreds of transactions across the cluster. We've seen wait queues of 500 on this single lock.
- The problem gets really bad when a session was holding another lock and then blocks here. This means other sessions get stuck as well, even if they do nothing with NOTIFY. Turns out this doesn't happen so often fortunately, but when it does the system does grind to a halt for a few seconds.
- There are replication slaves, but they're all on async
- The I/O subsystem isn't stressed and we don't see any memory pressure or high CPU usage.
- There are no warnings about "too many notifications"

We're wondering if it's possible for a single process to send a lot of notifications and thus spend a lot of time doing that? It there anyway of measuring this?

Can anyone shed light on what is going on here?


=== logs ===
2019-03-27 01:40:42 UTC [33226-1] [th.ha] app_ha@app_ha LOG:  process 33226 still waiting for AccessExclusiveLock on object 0 of class 1262 of database 0 after 1000.050 ms
2019-03-27 01:40:42 UTC [33226-2] [th.ha] app_ha@app_ha DETAIL:  Process holding the lock: 33162. Wait queue: 33226, 32565, 32443, 32452, 33923, 33900, 31701, 33827, 33844, 33427, 33359, 34012.
2019-03-27 01:40:42 UTC [33226-3] [th.ha] app_ha@app_ha STATEMENT:  COMMIT
2019-03-27 01:40:42 UTC [32565-1] [th.ek] app_ek@app_ek LOG:  process 32565 still waiting for AccessExclusiveLock on object 0 of class 1262 of database 0 after 1000.102 ms
2019-03-27 01:40:42 UTC [32565-2] [th.ek] app_ek@app_ek DETAIL:  Process holding the lock: 33162. Wait queue: 33226, 32565, 32443, 32452, 33923, 33900, 31701, 33827, 33844, 33427, 33359, 34012, 34005, 32054, 26820, 32788.
2019-03-27 01:40:42 UTC [32565-3] [th.ek] app_ek@app_ek STATEMENT:  COMMIT
2019-03-27 01:40:42 UTC [32443-1] [th.ne] app_ne@app_ne LOG:  process 32443 still waiting for AccessExclusiveLock on object 0 of class 1262 of database 0 after 1000.062 ms
2019-03-27 01:40:42 UTC [32443-2] [th.ne] app_ne@app_ne DETAIL:  Process holding the lock: 33162. Wait queue: 33226, 32565, 32443, 32452, 33923, 33900, 31701, 33827, 33844, 33427, 33359, 34012, 34005, 32054, 26820, 32788, 33587, 34062.
2019-03-27 01:40:42 UTC [32443-3] [th.ne] app_ne@app_ne STATEMENT:  COMMIT
2019-03-27 01:40:42 UTC [33784-1] [ev.ge] app_ge@app_ge LOG:  process 33784 still waiting for ExclusiveLock on advisory lock [7777729,1129598309,56397,2] after 1000.085 ms
2019-03-27 01:40:42 UTC [33784-2] [ev.ge] app_ge@app_ge DETAIL:  Process holding the lock: 32452. Wait queue: 33784.
2019-03-27 01:40:42 UTC [33784-3] [ev.ge] app_ge@app_ge STATEMENT:  SELECT pg_advisory_xact_lock(1129598309, 56397)
2019-03-27 01:40:42 UTC [32452-2] [ev.ge] app_ge@app_ge LOG:  process 32452 still waiting for AccessExclusiveLock on object 0 of class 1262 of database 0 after 1000.068 ms
2019-03-27 01:40:42 UTC [32452-3] [ev.ge] app_ge@app_ge DETAIL:  Process holding the lock: 33162. Wait queue: 33226, 32565, 32443, 32452, 33923, 33900, 31701, 33827, 33844, 33427, 33359, 34012, 34005, 32054, 26820, 32788, 33587, 34062, 32282, 34127.
2019-03-27 01:40:42 UTC [32452-4] [ev.ge] app_ge@app_ge STATEMENT:  COMMIT
2019-03-27 01:40:43 UTC [33923-1] [th.ei] app_ei@app_ei LOG:  process 33923 still waiting for AccessExclusiveLock on object 0 of class 1262 of database 0 after 1000.131 ms
2019-03-27 01:40:43 UTC [33923-2] [th.ei] app_ei@app_ei DETAIL:  Process holding the lock: 33162. Wait queue: 33226, 32565, 32443, 32452, 33923, 33900, 31701, 33827, 33844, 33427, 33359, 34012, 34005, 32054, 26820, 32788, 33587, 34062, 32282, 34127, 34129.
2019-03-27 01:40:43 UTC [33923-3] [th.ei] app_ei@app_ei STATEMENT:  COMMIT
2019-03-27 01:40:43 UTC [33900-1] [th.t6] app_t6@app_t6 LOG:  process 33900 still waiting for AccessExclusiveLock on object 0 of class 1262 of database 0 after 1000.080 ms
2019-03-27 01:40:43 UTC [33900-2] [th.t6] app_t6@app_t6 DETAIL:  Process holding the lock: 33162. Wait queue: 33226, 32565, 32443, 32452, 33923, 33900, 31701, 33827, 33844, 33427, 33359, 34012, 34005, 32054, 26820, 32788, 33587, 34062, 32282, 34127, 34129, 34080, 33161.
2019-03-27 01:40:43 UTC [33900-3] [th.t6] app_t6@app_t6 STATEMENT:  COMMIT
--

pgsql-general by date:

Previous
From: "Peter J. Holzer"
Date:
Subject: Re: Key encryption and relational integrity
Next
From: Adrian Klaver
Date:
Subject: Re: Key encryption and relational integrity