Thread: Undetected Deadlock
Hello Experts I'm hoping you will be able to help me with a tricky issue. We've recently updated our application to PG 14.1, and in the test instance we have started to see some alarming undetected deadlocks. An example of what we have seen is: locktype | database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction | pid | mode | granted | fastpath | waitstart | relation ----------+----------+------------+------+-------+------------+---------------+---------+-------+----------+--------------------+---------+-----------------+---------+----------+-------------------------------+-------------------------------- relation | 529986 | 1842228045 | | | | | | | | 165/1941408 | 2130531 | AccessShareLock | f | f | 2022-01-19 00:32:32.626152+01 | st.ctr_table_efr_oa (1 row) locktype | database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction | pid | mode | granted | fastpath | waitstart | relation ----------+----------+------------+------+-------+------------+---------------+---------+-------+----------+--------------------+---------+---------------------+---------+----------+-----------+-------------------------------- relation | 529986 | 1842228045 | | | | | | | | 75/2193719 | 2128603 | AccessExclusiveLock | t | f | | st.ctr_table_efr_oa (1 row) locktype | database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction | pid | mode | granted | fastpath | waitstart | relation ----------+----------+------------+------+-------+------------+---------------+---------+-------+----------+--------------------+---------+---------------------+---------+----------+-------------------------------+----------- relation | 529986 | 1842231489 | | | | | | | | 75/2193719 | 2128603 | AccessExclusiveLock | f | f | 2022-01-19 00:32:32.924694+01 | st.tpd_oa (1 row) locktype | database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction | pid | mode | granted | fastpath | waitstart | relation ---------------+----------+------------+------+-------+--------------+---------------+---------+-----------+----------+--------------------+---------+-----------------------+---------+----------+-------------------------------+----------- relation | 529986 | 1842231489 | | | | | | | | 165/1941408 | 2130531 | AccessShareLock | t | f | | st.tpd_oa So: pid 2130531 waits for an AccessShareLock on relation 1842228045, blocked by pid 2128603 which holds an AccessExclusiveLock pid 2128603 waits for an AccessExclusiveLock on relation 1842231489, blocked by pid 2130531 which holds an AccessShareLock The queries being executed by these backends are: pid | query_start | state_change | wait_event_type | wait_event | state | query ---------+-------------------------------+-------------------------------+-----------------+------------+--------+------------------------------------------------------------------------- 2128603 | 2022-01-19 00:32:32.924413+01 | 2022-01-19 00:32:32.924413+01 | Lock | relation | active | DROP TABLE st.tpd_oa_18929 2130531 | 2022-01-19 00:32:32.625706+01 | 2022-01-19 00:32:32.625708+01 | Lock | relation | active | DELETE FROM st.ctr_table_efr_oa_19010 WHERE ropid = 44788868 (2 rows) Both of these processes had been waiting for nearly 2 hours when I took the above printouts. My understanding of deadlock detection is that when a backend is waiting for a lock, it runs a deadlock detection function once a second so it should have been detected straight away but somehow it is not. Are there any deadlock conditions that postgresql is not able to detect? Note that there were a lot of other processes also waiting on relation 1842231489 - could that be confusing the deadlock detection routine? I am also confused about the locks which are being taken out by the DELETE query. Process 2130531 is trying to delete rows from a partition `st.ctr_table_efr_oa_19010`, and to do so it is trying to get an AccessShareLock on the parent table `st.ctr_table_efr_oa`. I don't really understand why that is, and in fact if I try to reproduce the deadlock manually by executing the relevant SQL in psql, the DELETE never tries to take out such a lock and no deadlock occurs. What are the circumstances where deleting rows directly from a partition will cause it to take a lock on the partitioned table? Any suggestions for debugging this issue would be most welcome. Thanks in advance Cheers Mike
My apologies, After posting this and looking at how it appears I realised that line wrapping makes the tables totally illegible. Here they are again with all unnecessary columns removed and others shortened. locktype | database | relation | pid | mode | granted | waitstart | relation ----------+----------+------------+---------+---------------------+---------+-------------------------------+-------------------- relation | 529986 | 1842228045 | 2130531 | AccessShareLock | f | 2022-01-19 00:32:32.626152+01 | st.ctr_table_efr_oa relation | 529986 | 1842228045 | 2128603 | AccessExclusiveLock | t | | st.ctr_table_efr_oa relation | 529986 | 1842231489 | 2128603 | AccessExclusiveLock | f | 2022-01-19 00:32:32.924694+01 | st.tpd_oa relation | 529986 | 1842231489 | 2130531 | AccessShareLock | t | | st.tpd_oa pid | query_start | state_change | wait_event_type | wait_event | query ---------+--------------------+--------------------+-----------------+------------+------------------------------------------------------------- 2128603 | 00:32:32.924413+01 | 00:32:32.924413+01 | Lock | relation | DROP TABLE st.tpd_oa_18929 2130531 | 00:32:32.625706+01 | 00:32:32.625708+01 | Lock | relation | DELETE FROM st.ctr_table_efr_oa_19010 WHERE ropid = 44788868 I hope this version is easier to read. Cheers Mike On Tue, 25 Jan 2022 at 15:49, Michael Harris <harmic@gmail.com> wrote: > > Hello Experts > > I'm hoping you will be able to help me with a tricky issue. > > We've recently updated our application to PG 14.1, and in the test instance we > have started to see some alarming undetected deadlocks. > > An example of what we have seen is: > > locktype | database | relation | page | tuple | virtualxid | > transactionid | classid | objid | objsubid | virtualtransaction | > pid | mode | granted | fastpath | waitstart > | relation > ----------+----------+------------+------+-------+------------+---------------+---------+-------+----------+--------------------+---------+-----------------+---------+----------+-------------------------------+-------------------------------- > relation | 529986 | 1842228045 | | | | > | | | | 165/1941408 | 2130531 | > AccessShareLock | f | f | 2022-01-19 00:32:32.626152+01 | > st.ctr_table_efr_oa > (1 row) > > locktype | database | relation | page | tuple | virtualxid | > transactionid | classid | objid | objsubid | virtualtransaction | > pid | mode | granted | fastpath | waitstart | > relation > ----------+----------+------------+------+-------+------------+---------------+---------+-------+----------+--------------------+---------+---------------------+---------+----------+-----------+-------------------------------- > relation | 529986 | 1842228045 | | | | > | | | | 75/2193719 | 2128603 | > AccessExclusiveLock | t | f | | > st.ctr_table_efr_oa > (1 row) > > locktype | database | relation | page | tuple | virtualxid | > transactionid | classid | objid | objsubid | virtualtransaction | > pid | mode | granted | fastpath | waitstart > | relation > ----------+----------+------------+------+-------+------------+---------------+---------+-------+----------+--------------------+---------+---------------------+---------+----------+-------------------------------+----------- > relation | 529986 | 1842231489 | | | | > | | | | 75/2193719 | 2128603 | > AccessExclusiveLock | f | f | 2022-01-19 > 00:32:32.924694+01 | st.tpd_oa > (1 row) > > locktype | database | relation | page | tuple | virtualxid | > transactionid | classid | objid | objsubid | virtualtransaction | > pid | mode | granted | fastpath | > waitstart | relation > ---------------+----------+------------+------+-------+--------------+---------------+---------+-----------+----------+--------------------+---------+-----------------------+---------+----------+-------------------------------+----------- > relation | 529986 | 1842231489 | | | | > | | | | 165/1941408 | > 2130531 | AccessShareLock | t | f | > | st.tpd_oa > > So: > pid 2130531 waits for an AccessShareLock on relation 1842228045, > blocked by pid 2128603 which holds an AccessExclusiveLock > pid 2128603 waits for an AccessExclusiveLock on relation 1842231489, > blocked by pid 2130531 which holds an AccessShareLock > > The queries being executed by these backends are: > > pid | query_start | state_change > | wait_event_type | wait_event | state | > query > ---------+-------------------------------+-------------------------------+-----------------+------------+--------+------------------------------------------------------------------------- > 2128603 | 2022-01-19 00:32:32.924413+01 | 2022-01-19 > 00:32:32.924413+01 | Lock | relation | active | DROP > TABLE st.tpd_oa_18929 > 2130531 | 2022-01-19 00:32:32.625706+01 | 2022-01-19 > 00:32:32.625708+01 | Lock | relation | active | DELETE > FROM st.ctr_table_efr_oa_19010 WHERE ropid = 44788868 > (2 rows) > > Both of these processes had been waiting for nearly 2 hours when I > took the above printouts. My understanding of deadlock detection is > that when a backend is waiting for a lock, it runs a deadlock > detection function once a second so it should have been detected > straight away but somehow it is not. > > Are there any deadlock conditions that postgresql is not able to detect? > > Note that there were a lot of other processes also waiting on relation > 1842231489 - could that be confusing the deadlock detection routine? > > I am also confused about the locks which are being taken out by the > DELETE query. Process 2130531 is trying to delete rows from a > partition `st.ctr_table_efr_oa_19010`, and to do so it is trying to > get an AccessShareLock on the parent table `st.ctr_table_efr_oa`. I > don't really understand why that is, and in fact if I try to reproduce > the deadlock manually by executing the relevant SQL in psql, the > DELETE never tries to take out such a lock and no deadlock occurs. > > What are the circumstances where deleting rows directly from a > partition will cause it to take a lock on the partitioned table? > > Any suggestions for debugging this issue would be most welcome. > > Thanks in advance > > Cheers > Mike
> On Jan 24, 2022, at 10:02 PM, Michael Harris <harmic@gmail.com> wrote: > > My apologies, > > After posting this and looking at how it appears I realised that line > wrapping makes the tables totally illegible. > > Here they are again with all unnecessary columns removed and others shortened. > > locktype | database | relation | pid | mode | > granted | waitstart | relation > ----------+----------+------------+---------+---------------------+---------+-------------------------------+-------------------- > relation | 529986 | 1842228045 | 2130531 | AccessShareLock | f > | 2022-01-19 00:32:32.626152+01 | st.ctr_table_efr_oa > relation | 529986 | 1842228045 | 2128603 | AccessExclusiveLock | t > | | st.ctr_table_efr_oa > relation | 529986 | 1842231489 | 2128603 | AccessExclusiveLock | f > | 2022-01-19 00:32:32.924694+01 | st.tpd_oa > relation | 529986 | 1842231489 | 2130531 | AccessShareLock | t > | | st.tpd_oa > > > pid | query_start | state_change | wait_event_type | > wait_event | query > ---------+--------------------+--------------------+-----------------+------------+------------------------------------------------------------- > 2128603 | 00:32:32.924413+01 | 00:32:32.924413+01 | Lock | > relation | DROP TABLE st.tpd_oa_18929 > 2130531 | 00:32:32.625706+01 | 00:32:32.625708+01 | Lock | > relation | DELETE FROM st.ctr_table_efr_oa_19010 WHERE ropid = > 44788868 > > I hope this version is easier to read. > > Cheers > Mike > On Tue, 25 Jan 2022 at 15:49, Michael Harris <harmic@gmail.com> wrote: It’s definitely tidier, but on my mac Mail.app and on Thunderbird there is still a line wrap problem. I don’t think thatwill prevent the experts from helping out. I’m not one of them.
On 2022-Jan-25, Michael Harris wrote: > We've recently updated our application to PG 14.1, and in the test instance we > have started to see some alarming undetected deadlocks. This is indeed suspicious / worrisome / curious. What version were you using previously? I reformatted the result sets: > An example of what we have seen is: > > locktype | database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction| pid | mode | granted | fastpath | waitstart | relation > ----------+----------+------------+------+-------+------------+---------------+---------+-------+----------+--------------------+---------+-----------------+---------+----------+-------------------------------+-------------------------------- > relation | 529986 | 1842228045 | | | | | | | | 165/1941408 | 2130531 | AccessShareLock | f | f | 2022-01-19 00:32:32.626152+01 | st.ctr_table_efr_oa > (1 row) > > locktype | database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction| pid | mode | granted | fastpath | waitstart | relation > ----------+----------+------------+------+-------+------------+---------------+---------+-------+----------+--------------------+---------+---------------------+---------+----------+-----------+-------------------------------- > relation | 529986 | 1842228045 | | | | | | | | 75/2193719 | 2128603 | AccessExclusiveLock | t | f | | st.ctr_table_efr_oa > (1 row) > > locktype | database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction| pid | mode | granted | fastpath | waitstart | relation > ----------+----------+------------+------+-------+------------+---------------+---------+-------+----------+--------------------+---------+---------------------+---------+----------+-------------------------------+----------- > relation | 529986 | 1842231489 | | | | | | | | 75/2193719 | 2128603 | AccessExclusiveLock | f | f | 2022-01-19 00:32:32.924694+01 | st.tpd_oa > (1 row) > > locktype | database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid |virtualtransaction | pid | mode | granted | fastpath | waitstart | relation > ---------------+----------+------------+------+-------+--------------+---------------+---------+-----------+----------+--------------------+---------+-----------------------+---------+----------+-------------------------------+----------- > relation | 529986 | 1842231489 | | | | | | | |165/1941408 | 2130531 | AccessShareLock | t | f | | st.tpd_oa > > So: > pid 2130531 waits for an AccessShareLock on relation 1842228045, blocked by pid 2128603 which holds an AccessExclusiveLock > pid 2128603 waits for an AccessExclusiveLock on relation 1842231489, blocked by pid 2130531 which holds an AccessShareLock > > The queries being executed by these backends are: > > pid | query_start | state_change | wait_event_type | wait_event | state | query > ---------+-------------------------------+-------------------------------+-----------------+------------+--------+------------------------------------------------------------------------- > 2128603 | 2022-01-19 00:32:32.924413+01 | 2022-01-19 00:32:32.924413+01 | Lock | relation | active | DROPTABLE st.tpd_oa_18929 > 2130531 | 2022-01-19 00:32:32.625706+01 | 2022-01-19 00:32:32.625708+01 | Lock | relation | active | DELETEFROM st.ctr_table_efr_oa_19010 WHERE ropid = 44788868 > (2 rows) I know of no cases in which we fail to detect a deadlock. Perhaps you have indeed hit a bug. > Note that there were a lot of other processes also waiting on relation > 1842231489 - could that be confusing the deadlock detection routine? It shouldn't. > I am also confused about the locks which are being taken out by the > DELETE query. Maybe the lock is already taken before the DELETE is run; do you have any triggers, rules, constraints, or anything? If you have seen this several times already, maybe a way to investigate deeper is an exhaustive log capture of everything that these transactions do, from the point they begin until they become blocked (log_statement=all). Perhaps you need to involve other concurrent transactions in order to cause the problem. -- Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/ "Tiene valor aquel que admite que es un cobarde" (Fernandel)
Hi Alvaro Thanks for the feedback! > What version were you using previously? We were previously on 11.4. Another difference is that we were using inheritance based partitioning before, whereas now we are using declarative partitioning. > Maybe the lock is already taken before the DELETE is run; do you have > any triggers, rules, constraints, or anything? There are no triggers, rules or constraints on the table involved in the DELETE (either the partition or the table that the partition is part of). Even more confusingly - when I reproduce the SQL statements that should be triggering the deadlock, it does not happen: the DELETE does not attempt to take an AccessShareLock on the parent table, so it does not deadlock. Is there any state associated with a transaction or a database connection that would alter the lock(s) that gets taken out for a DELETE on a partition? Or can other concurrent transactions in other processes cause more locks to be needed somehow? > If you have seen this several times already, maybe a way to investigate deeper is an > exhaustive log capture of everything that these transactions do So far it has happened at least twice. There were a couple of other incidents that may well also have been caused by this, but not enough data was collected at the time to be sure. A bit more detail: the two processes that are deadlocked here are one that is ingesting new data, while the other is removing old data by dropping partitions. Even before we shifted to 14.1 and native partitioning, we did get deadlocks between these two processes every so often which we could not really prevent, so we adopted a retry approach when it does occur. However we never had an undetected deadlock in the database. Since going to 14.1 & native partitioning, we are seeing a lot more frequent deadlocks. It looks like the increase in frequency might be related to the extra lock taken by the DELETE that I mentioned above. However most of the time the deadlock is detected and released by Postgres and the impact is minimal. Of course it is another story if it is not detected! I have enabled `log_statement=all`, but the undetected deadlock hasn't happened again since. I can easily reproduce the deadlock itself, but not the undetected case. Thanks again. Cheers Mike On Wed, 26 Jan 2022 at 10:11, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > On 2022-Jan-25, Michael Harris wrote: > > > We've recently updated our application to PG 14.1, and in the test instance we > > have started to see some alarming undetected deadlocks. > > This is indeed suspicious / worrisome / curious. > > What version were you using previously? > > I reformatted the result sets: > > > An example of what we have seen is: > > > > locktype | database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction| pid | mode | granted | fastpath | waitstart | relation > > ----------+----------+------------+------+-------+------------+---------------+---------+-------+----------+--------------------+---------+-----------------+---------+----------+-------------------------------+-------------------------------- > > relation | 529986 | 1842228045 | | | | | | | | 165/1941408 | 2130531 | AccessShareLock | f | f | 2022-01-19 00:32:32.626152+01 | st.ctr_table_efr_oa > > (1 row) > > > > locktype | database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction| pid | mode | granted | fastpath | waitstart | relation > > ----------+----------+------------+------+-------+------------+---------------+---------+-------+----------+--------------------+---------+---------------------+---------+----------+-----------+-------------------------------- > > relation | 529986 | 1842228045 | | | | | | | | 75/2193719 | 2128603 | AccessExclusiveLock | t | f | | st.ctr_table_efr_oa > > (1 row) > > > > locktype | database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction| pid | mode | granted | fastpath | waitstart | relation > > ----------+----------+------------+------+-------+------------+---------------+---------+-------+----------+--------------------+---------+---------------------+---------+----------+-------------------------------+----------- > > relation | 529986 | 1842231489 | | | | | | | | 75/2193719 | 2128603 | AccessExclusiveLock | f | f | 2022-01-19 00:32:32.924694+01 | st.tpd_oa > > (1 row) > > > > locktype | database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid| virtualtransaction | pid | mode | granted | fastpath | waitstart | relation > > ---------------+----------+------------+------+-------+--------------+---------------+---------+-----------+----------+--------------------+---------+-----------------------+---------+----------+-------------------------------+----------- > > relation | 529986 | 1842231489 | | | | | | | | 165/1941408 | 2130531 | AccessShareLock | t | f | | st.tpd_oa > > > > So: > > pid 2130531 waits for an AccessShareLock on relation 1842228045, blocked by pid 2128603 which holds an AccessExclusiveLock > > pid 2128603 waits for an AccessExclusiveLock on relation 1842231489, blocked by pid 2130531 which holds an AccessShareLock > > > > The queries being executed by these backends are: > > > > pid | query_start | state_change | wait_event_type | wait_event | state | query > > ---------+-------------------------------+-------------------------------+-----------------+------------+--------+------------------------------------------------------------------------- > > 2128603 | 2022-01-19 00:32:32.924413+01 | 2022-01-19 00:32:32.924413+01 | Lock | relation | active | DROPTABLE st.tpd_oa_18929 > > 2130531 | 2022-01-19 00:32:32.625706+01 | 2022-01-19 00:32:32.625708+01 | Lock | relation | active | DELETEFROM st.ctr_table_efr_oa_19010 WHERE ropid = 44788868 > > (2 rows) > > I know of no cases in which we fail to detect a deadlock. Perhaps you > have indeed hit a bug. > > > Note that there were a lot of other processes also waiting on relation > > 1842231489 - could that be confusing the deadlock detection routine? > > It shouldn't. > > > I am also confused about the locks which are being taken out by the > > DELETE query. > > Maybe the lock is already taken before the DELETE is run; do you have > any triggers, rules, constraints, or anything? If you have seen this > several times already, maybe a way to investigate deeper is an > exhaustive log capture of everything that these transactions do, from > the point they begin until they become blocked (log_statement=all). > > Perhaps you need to involve other concurrent transactions in order to > cause the problem. > > -- > Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/ > "Tiene valor aquel que admite que es un cobarde" (Fernandel)
I must be missing something. You mentioned dropping a partition, so is there an actual need for the delete? Could you detach concurrently and then drop the table or delete rows if needed?
https://www.postgresql.org/docs/14/sql-altertable.html#SQL-ALTERTABLE-DETACH-PARTITION
https://www.postgresql.org/docs/14/sql-altertable.html#SQL-ALTERTABLE-DETACH-PARTITION
> I must be missing something. You mentioned dropping a partition, so is there an actual need for the delete? Could you detachconcurrently and then drop the table or delete rows if needed? The DELETE is part of a transaction performing data loading. Our application allows data to be overwritten. Before using a COPY to actually store the new data, we do a DELETE to make sure there are no existing records. We tested a number of sequences and found that the most efficient is to perform a DELETE first, even though in most cases there is nothing to delete, and then do the COPY (rather than for example checking if it exists first and then deleting it). The DROP is part of a separate transaction that is doing the data expiry. Note that it is not dropping the same partition; old partitions get dropped, new data is loaded into newer partitions. Cheers Mike
There may be a bug so perhaps still pursue reproducing the issue, but I would expect drop concurrently to resolve your issue with the two processes conflicting. Also, perhaps trying"insert, on conflict do update" could be more efficient than the copy but obviously there are too many unknowns and variables for either to be a definite winner.
> but I would expect drop concurrently to resolve your issue with the two processes conflicting I guess you mean ALTER TABLE DETACH PARTITION ... CONCURRENTLY? DROP TABLE does not seem to have a concurrent option. Still that does seem like a good option to try, thanks for drawing it to my attention. Cheers Mike On Thu, 27 Jan 2022 at 14:20, Michael Lewis <mlewis@entrata.com> wrote: > > There may be a bug so perhaps still pursue reproducing the issue, but I would expect drop concurrently to resolve yourissue with the two processes conflicting. Also, perhaps trying"insert, on conflict do update" could be more efficientthan the copy but obviously there are too many unknowns and variables for either to be a definite winner.
Hi The undetected deadlock occurred again today and I was able to collect some more info. The presentation was very similar to the case I reported previously: - One backend trying to do a DROP TABLE on a partition of a partitioned table, waiting for an AccessExclusiveLock on that table - Another backend trying to do a DELETE on records within a partition of another table, waiting on an AccessShareLock which is already held by the first table and already holding a lock on the table that the first backend is waiting for - A load of other backends also trying to do a DELETE. I was able to attach gdb to the backends and I discovered a few things. First, taking a stack trace of the process doing the DELETE, it looks like this: #0 0x00007fc82a6750bb in epoll_wait (epfd=13, events=0x18696f8, maxevents=maxevents@entry=1, timeout=timeout@entry=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30 #1 0x00000000007d0c32 in WaitEventSetWaitBlock (nevents=1, occurred_events=<optimized out>, cur_timeout=-1, set=0x1869698) at latch.c:1450 #2 WaitEventSetWait (set=0x1869698, timeout=<optimized out>, occurred_events=<optimized out>, nevents=1, wait_event_info=<optimized out>) at latch.c:1396 #3 0x00000000007d0f94 in WaitLatch (latch=<optimized out>, wakeEvents=wakeEvents@entry=33, timeout=timeout@entry=0, wait_event_info=50331648) at latch.c:473 #4 0x00000000007eaea7 in ProcSleep (locallock=locallock@entry=0x24571d0, lockMethodTable=lockMethodTable@entry=0xdb2360 <default_lockmethod>) at proc.c:1361 #5 0x00000000007df419 in WaitOnLock (locallock=locallock@entry=0x24571d0, owner=owner@entry=0x18b80f8) at lock.c:1858 #6 0x00000000007e052c in LockAcquireExtended (locktag=locktag@entry=0x7ffced1d0b80, lockmode=lockmode@entry=1, sessionLock=sessionLock@entry=false, dontWait=dontWait@entry=false, reportMemoryError=reportMemoryError@entry=true, locallockp=locallockp@entry=0x7ffced1d0b78) at lock.c:1100 #7 0x00000000007ddb23 in LockRelationOid (relid=1842227607, lockmode=1) at lmgr.c:117 #8 0x000000000050dba5 in relation_open (relationId=1842227607, lockmode=lockmode@entry=1) at relation.c:56 #9 0x00000000008fc838 in generate_partition_qual (rel=0x56614e8) at partcache.c:361 #10 generate_partition_qual (rel=0x56614e8) at partcache.c:336 #11 0x00000000007502bf in set_baserel_partition_constraint (relation=relation@entry=0x56614e8, rel=<optimized out>, rel=<optimized out>) at plancat.c:2402 ~snip~ The lock is being taken out by the function generate_partition_qual (in partcache.c) which has this comment: /* * Grab at least an AccessShareLock on the parent table. Must do this * even if the partition has been partially detached, because transactions * concurrent with the detach might still be trying to use a partition * descriptor that includes it. */ This is happening during query planning rather than execution, where it is trying to fetch the partition bound expression. It explains why this lock is not always required (the result is cached in the relcache, and anyway it looks to me like this code path is called depending on factors such as the row estimates). The second thing was that the DROP process, and all the DELETE processes, were all waiting inside epoll_wait ultimately called from ProcSleep as shown above. It is sitting in this section of code (proc.c line 1361): (void) WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, 0, <--- Processes sitting here PG_WAIT_LOCK | locallock->tag.lock.locktag_type); ResetLatch(MyLatch); /* check for deadlocks first, as that's probably log-worthy */ if (got_deadlock_timeout) <-- set breakpoint here { CheckDeadLock(); got_deadlock_timeout = false; } I set a breakpoint on the next line after ResetLatch is called, and it was never reached. I thought that could be due to GDB attaching to the process and disrupting the alarm signal that should be sent into the timeout handler, so I manually sent it a SIGALRM. At that point the breakpoint was triggered - but the flag got_deadlock_timeout is not set, so the deadlock check is not executed. This led me to believe that the deadlock check timeout wasn't set properly at all. To see if the deadlock would be detected if CheckDeadlock() got an opportunity to run, I manually set got_deadlock_timeout, and sure enough the deadlock was detected and released. I concluded that the deadlock detection function is able to detect these deadlocks, but for some reason the 1sec timeout is not being correctly scheduled so it never runs. When it is artificially triggered, it does release the deadlock. After the release, a new deadlock formed between the DROP process and another one of the waiting DELETE processes (not an entirely unexpected outcome). I will leave it in that state for as long as possible in case anyone can think of any other info that should be gathered. Luckily it is a test instance. Sorry for the slightly long and detailed email - let me know if this should be moved to another forum. Cheers Mike
Hi again Some good news. After some more debugging & reflection, I realized that the likely cause is one of our own libraries that gets loaded as part of some custom functions we are using. Some of these functions trigger fetching of remote resources, for which a timeout is set using `alarm`. The function unfortunately does not re-establish any pre-existing interval timers after it is done, which leads to postgresql missing it's own expected alarm signal. The reason that this was not affecting us on previous postgres versions was this commit: https://github.com/postgres/postgres/commit/09cf1d52267644cdbdb734294012cf1228745aaa#diff-b12a7ca3bf9c6a56745844c2670b0b28d2a4237741c395dda318c6cc3664ad4a After this commit, once an alarm is missed, that backend never sets one again, so no timeouts of any kind will work. Therefore, the deadlock detector was never being run. Prior to that, the next time any timeout was set by the backend it would re-establish it's timer. We will of course fix our own code to prevent this issue, but I am a little concerned at the above commit as it reduces the robustness of postgres in this situation. Perhaps I will raise it on the pgsql-hackers list. Cheers Mike On Tue, 1 Feb 2022 at 17:50, Michael Harris <harmic@gmail.com> wrote: > > Hi > > The undetected deadlock occurred again today and I was able to collect > some more info. > > The presentation was very similar to the case I reported previously: > - One backend trying to do a DROP TABLE on a partition of a > partitioned table, waiting for an AccessExclusiveLock on that table > - Another backend trying to do a DELETE on records within a partition > of another table, waiting on an AccessShareLock which is already held > by the first table and already holding a lock on the table that the > first backend is waiting for > - A load of other backends also trying to do a DELETE. > > I was able to attach gdb to the backends and I discovered a few things. > > First, taking a stack trace of the process doing the DELETE, it looks like this: > > #0 0x00007fc82a6750bb in epoll_wait (epfd=13, events=0x18696f8, > maxevents=maxevents@entry=1, timeout=timeout@entry=-1) at > ../sysdeps/unix/sysv/linux/epoll_wait.c:30 > #1 0x00000000007d0c32 in WaitEventSetWaitBlock (nevents=1, > occurred_events=<optimized out>, cur_timeout=-1, set=0x1869698) at > latch.c:1450 > #2 WaitEventSetWait (set=0x1869698, timeout=<optimized out>, > occurred_events=<optimized out>, nevents=1, wait_event_info=<optimized > out>) at latch.c:1396 > #3 0x00000000007d0f94 in WaitLatch (latch=<optimized out>, > wakeEvents=wakeEvents@entry=33, timeout=timeout@entry=0, > wait_event_info=50331648) at latch.c:473 > #4 0x00000000007eaea7 in ProcSleep > (locallock=locallock@entry=0x24571d0, > lockMethodTable=lockMethodTable@entry=0xdb2360 <default_lockmethod>) > at proc.c:1361 > #5 0x00000000007df419 in WaitOnLock > (locallock=locallock@entry=0x24571d0, owner=owner@entry=0x18b80f8) at > lock.c:1858 > #6 0x00000000007e052c in LockAcquireExtended > (locktag=locktag@entry=0x7ffced1d0b80, lockmode=lockmode@entry=1, > sessionLock=sessionLock@entry=false, dontWait=dontWait@entry=false, > reportMemoryError=reportMemoryError@entry=true, > locallockp=locallockp@entry=0x7ffced1d0b78) at lock.c:1100 > #7 0x00000000007ddb23 in LockRelationOid (relid=1842227607, > lockmode=1) at lmgr.c:117 > #8 0x000000000050dba5 in relation_open (relationId=1842227607, > lockmode=lockmode@entry=1) at relation.c:56 > #9 0x00000000008fc838 in generate_partition_qual (rel=0x56614e8) at > partcache.c:361 > #10 generate_partition_qual (rel=0x56614e8) at partcache.c:336 > #11 0x00000000007502bf in set_baserel_partition_constraint > (relation=relation@entry=0x56614e8, rel=<optimized out>, > rel=<optimized out>) at plancat.c:2402 > ~snip~ > > The lock is being taken out by the function generate_partition_qual > (in partcache.c) which has this comment: > > /* > * Grab at least an AccessShareLock on the parent table. Must do this > * even if the partition has been partially detached, because transactions > * concurrent with the detach might still be trying to use a partition > * descriptor that includes it. > */ > > This is happening during query planning rather than execution, where > it is trying to fetch the partition bound expression. > It explains why this lock is not always required (the result is cached > in the relcache, and anyway it looks to me like this code path is > called depending on factors such as the row estimates). > > The second thing was that the DROP process, and all the DELETE > processes, were all waiting inside epoll_wait ultimately called from > ProcSleep as shown above. It is sitting in this section of code > (proc.c line 1361): > > (void) WaitLatch(MyLatch, WL_LATCH_SET | > WL_EXIT_ON_PM_DEATH, 0, <--- Processes sitting > here > PG_WAIT_LOCK | locallock->tag.lock.locktag_type); > ResetLatch(MyLatch); > /* check for deadlocks first, as that's probably log-worthy */ > if (got_deadlock_timeout) > > <-- set breakpoint here > { > CheckDeadLock(); > got_deadlock_timeout = false; > } > > I set a breakpoint on the next line after ResetLatch is called, and it > was never reached. I thought that could be due to GDB attaching to the > process and disrupting the alarm signal that should be sent into the > timeout handler, so I manually sent it a SIGALRM. > At that point the breakpoint was triggered - but the flag > got_deadlock_timeout is not set, so the deadlock check is not > executed. > This led me to believe that the deadlock check timeout wasn't set > properly at all. > > To see if the deadlock would be detected if CheckDeadlock() got an > opportunity to run, I manually set got_deadlock_timeout, and sure > enough the deadlock was detected and released. > > I concluded that the deadlock detection function is able to detect > these deadlocks, but for some reason the 1sec timeout is not being > correctly scheduled so it never runs. > When it is artificially triggered, it does release the deadlock. > > After the release, a new deadlock formed between the DROP process and > another one of the waiting DELETE processes (not an entirely > unexpected outcome). > I will leave it in that state for as long as possible in case anyone > can think of any other info that should be gathered. Luckily it is a > test instance. > > Sorry for the slightly long and detailed email - let me know if this > should be moved to another forum. > > Cheers > Mike
On Thu, 3 Feb 2022 at 06:25, Michael Harris <harmic@gmail.com> wrote: > > Hi again > > Some good news. After some more debugging & reflection, I realized > that the likely cause is one of our own libraries that gets loaded as > part of some custom functions we are using. > > Some of these functions trigger fetching of remote resources, for > which a timeout is set using `alarm`. The function unfortunately does > not re-establish any pre-existing interval timers after it is done, > which leads to postgresql missing it's own expected alarm signal. > > The reason that this was not affecting us on previous postgres > versions was this commit: > > https://github.com/postgres/postgres/commit/09cf1d52267644cdbdb734294012cf1228745aaa#diff-b12a7ca3bf9c6a56745844c2670b0b28d2a4237741c395dda318c6cc3664ad4a > > After this commit, once an alarm is missed, that backend never sets > one again, so no timeouts of any kind will work. Therefore, the > deadlock detector was never being run. Prior to that, the next time > any timeout was set by the backend it would re-establish it's timer. > > We will of course fix our own code to prevent this issue, but I am a > little concerned at the above commit as it reduces the robustness of > postgres in this situation. Perhaps I will raise it on the > pgsql-hackers list. Hmm, so you turned off Postgres' alarms so they stopped working, and you're saying that is a robustness issue of Postgres? Yes, something broke and it would be nice to avoid that, but the responsibility for that lies in the user code that was called. Postgres can't know what kernel calls are made during a custom function. Perhaps you could contribute a test case for this situation and a new call to check/reset any missing alarms? Or alternatively, contribute the function library that fetches remote resources, so that can become an optional part of Postgres, in contrib. -- Simon Riggs http://www.EnterpriseDB.com/
Simon Riggs <simon.riggs@enterprisedb.com> writes: > On Thu, 3 Feb 2022 at 06:25, Michael Harris <harmic@gmail.com> wrote: >> Some of these functions trigger fetching of remote resources, for >> which a timeout is set using `alarm`. The function unfortunately does >> not re-establish any pre-existing interval timers after it is done, >> which leads to postgresql missing it's own expected alarm signal. >> >> The reason that this was not affecting us on previous postgres >> versions was this commit: >> >> https://github.com/postgres/postgres/commit/09cf1d52267644cdbdb734294012cf1228745aaa#diff-b12a7ca3bf9c6a56745844c2670b0b28d2a4237741c395dda318c6cc3664ad4a >> >> After this commit, once an alarm is missed, that backend never sets >> one again, so no timeouts of any kind will work. Therefore, the >> deadlock detector was never being run. Prior to that, the next time >> any timeout was set by the backend it would re-establish it's timer. >> >> We will of course fix our own code to prevent this issue, but I am a >> little concerned at the above commit as it reduces the robustness of >> postgres in this situation. Perhaps I will raise it on the >> pgsql-hackers list. > Hmm, so you turned off Postgres' alarms so they stopped working, and > you're saying that is a robustness issue of Postgres? If Michael's analysis were accurate, I'd agree that there is a robustness issue, but I don't think there is. See timeout.c:220: /* * Get the time remaining till the nearest pending timeout. If it is * negative, assume that we somehow missed an interrupt, and force * signal_pending off. This gives us a chance to recover if the * kernel drops a timeout request for some reason. */ nearest_timeout = active_timeouts[0]->fin_time; if (now > nearest_timeout) { signal_pending = false; /* force an interrupt as soon as possible */ secs = 0; usecs = 1; } Now admittedly we don't have a good way to test this stanza, but it should result in re-establishing the timer interrupt the next time any timeout.c API is invoked after a missed interrupt. I don't see anything more that we could or should do. We're not going to issue setitimer() after every user-defined function call. regards, tom lane
Michael Harris <harmic@gmail.com> writes: >> If Michael's analysis were accurate, I'd agree that there is a robustness >> issue, but I don't think there is. See timeout.c:220: > Actually that only sets a new timer after the nearest timeout has expired. Mmm, yeah, you're right: as long as we keep on canceling timeout requests before they are reached, this code won't notice a problem. We need to check against signal_due_at, not only the next timeout request, more or less as attached. Do you want to try this and see if it actually adds any robustness with your buggy code? > I was thinking that to improve robustness, we could add a check for > `now < signal_due_at` to schedule_alarm line 300: > if (signal_pending && now < signal_due_at && nearest_timeout >= signal_due_at) > return; I don't think that merging this issue into that test is appropriate; the logic is already complicated and hard to explain. Also, it seems to me that some slop is needed, since as mentioned nearby, the kernel's opinion of when to fire the interrupt is likely to be a bit later than ours. I'm not wedded to the 10ms slop proposed below, but it seems like it's probably in the right ballpark. > One other thing that struck me when reading this code: the variable > signal_due_at is not declared as volatile sig_atomic_t, even though it > is read from / written to by the signal handler. Maybe that could > cause problems? Hm. I'm not really convinced there's any issue, since signal_due_at is only examined when signal_pending is true. Still, it's probably better for all these variables to be volatile, so done. > One other question: I've fixed our custom function, so that it > correctly restores any interval timers that were running, but of > course if our function takes longer than the remaining time on the > postgres timer the signal will be delivered late. Beyond the fact that > a deadlock or statement timeout will take longer than expected, are > there any other negative consequences? Are any of the timeouts > time-critical such that being delayed by a few seconds would cause a > problem? They're certainly not supposed to be that time-critical; any code that is expecting such a thing is going to have lots of problems already. PG is not a hard-real-time system ;-) regards, tom lane diff --git a/src/backend/utils/misc/timeout.c b/src/backend/utils/misc/timeout.c index ac57073033..f38c85df74 100644 --- a/src/backend/utils/misc/timeout.c +++ b/src/backend/utils/misc/timeout.c @@ -71,11 +71,12 @@ static volatile sig_atomic_t alarm_enabled = false; /* * State recording if and when we next expect the interrupt to fire. + * (signal_due_at is valid only when signal_pending is true.) * Note that the signal handler will unconditionally reset signal_pending to * false, so that can change asynchronously even when alarm_enabled is false. */ static volatile sig_atomic_t signal_pending = false; -static TimestampTz signal_due_at = 0; /* valid only when signal_pending */ +static volatile TimestampTz signal_due_at = 0; /***************************************************************************** @@ -217,10 +218,21 @@ schedule_alarm(TimestampTz now) MemSet(&timeval, 0, sizeof(struct itimerval)); + /* + * If we think there's a signal pending, but current time is more than + * 10ms past when the signal was due, then assume that the timeout + * request got lost somehow and re-enable it. (10ms corresponds to + * the worst-case timeout granularity on modern systems.) It won't + * hurt us if the interrupt does manage to fire between now and when + * we reach the setitimer() call. + */ + if (signal_pending && now > signal_due_at + 10 * 1000) + signal_pending = false; + /* * Get the time remaining till the nearest pending timeout. If it is * negative, assume that we somehow missed an interrupt, and force - * signal_pending off. This gives us a chance to recover if the + * signal_pending off. This gives us another chance to recover if the * kernel drops a timeout request for some reason. */ nearest_timeout = active_timeouts[0]->fin_time;
Michael Harris <harmic@gmail.com> writes: > On Mon, 7 Feb 2022 at 09:57, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Do you want to try this and see if it actually adds any robustness with your buggy code? > Sorry for the delayed response, & thanks for the patch. > I wasn't able to test with our actual application because it could > take days for it to actually trigger the problem, so I tested it with > a simulation, which you can find here: > https://github.com/harmic/pg_almloss > With that simulation I could attach gdb to the backend and see that > signal_pending & signal_due_at were being reset in the expected way, > even when a missed interrupt was triggered. > I'm convinced your patch improves robustness under the scenario we saw. Great, thanks for testing! regards, tom lane