Re: Undetected Deadlock - Mailing list pgsql-general

From Michael Harris
Subject Re: Undetected Deadlock
Date
Msg-id CADofcAXSn_oS3MOxS=1epiXirwETOB_BpBmM8w4=F3GaQe+cwA@mail.gmail.com
Whole thread Raw
In response to Re: Undetected Deadlock  (Alvaro Herrera <alvherre@alvh.no-ip.org>)
Responses Re: Undetected Deadlock
Re: Undetected Deadlock
List pgsql-general
Hi Alvaro

Thanks for the feedback!

> What version were you using previously?

We were previously on 11.4. Another difference is that we were using
inheritance based partitioning before, whereas now we are using
declarative partitioning.

> Maybe the lock is already taken before the DELETE is run; do you have
> any triggers, rules, constraints, or anything?

There are no triggers, rules or constraints on the table involved in
the DELETE (either the partition or the table that the partition is
part of).

Even more confusingly - when I reproduce the SQL statements that
should be triggering the deadlock, it does not happen: the DELETE does
not attempt to take an AccessShareLock on the parent table, so it does
not deadlock.

Is there any state associated with a transaction or a database
connection that would alter the lock(s) that gets taken out for a
DELETE on a partition? Or can other concurrent transactions in other
processes cause more locks to be needed somehow?

> If you have seen this several times already, maybe a way to investigate deeper is an
> exhaustive log capture of everything that these transactions do

So far it has happened at least twice. There were a couple of other
incidents that may well also have been caused by this, but not enough
data was collected at the time to be sure.

A bit more detail: the two processes that are deadlocked here are one
that is ingesting new data, while the other is removing old data by
dropping partitions. Even before we shifted to 14.1 and native
partitioning, we did get deadlocks between these two processes every
so often which we could not really prevent, so we adopted a retry
approach when it does occur. However we never had an undetected
deadlock in the database.

Since going to 14.1 & native partitioning, we are seeing a lot more
frequent deadlocks. It looks like the increase in frequency might be
related to the extra lock taken by the DELETE that I mentioned above.
However most of the time the deadlock is detected and released by
Postgres and the impact is minimal. Of course it is another story if
it is not detected!

I have enabled `log_statement=all`, but the undetected deadlock hasn't
happened again since. I can easily reproduce the deadlock itself, but
not the undetected case.

Thanks again.

Cheers
Mike

On Wed, 26 Jan 2022 at 10:11, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> On 2022-Jan-25, Michael Harris wrote:
>
> > We've recently updated our application to PG 14.1, and in the test instance we
> > have started to see some alarming undetected deadlocks.
>
> This is indeed suspicious / worrisome / curious.
>
> What version were you using previously?
>
> I reformatted the result sets:
>
> > An example of what we have seen is:
> >
> >  locktype | database |  relation  | page | tuple | virtualxid | transactionid | classid | objid | objsubid |
virtualtransaction|   pid   |      mode       | granted | fastpath |           waitstart           |
relation
> >
----------+----------+------------+------+-------+------------+---------------+---------+-------+----------+--------------------+---------+-----------------+---------+----------+-------------------------------+--------------------------------
> >  relation |   529986 | 1842228045 |      |       |            |               |         |       |          |
165/1941408       | 2130531 | AccessShareLock | f       | f        | 2022-01-19 00:32:32.626152+01 |
st.ctr_table_efr_oa
> > (1 row)
> >
> >  locktype | database |  relation  | page | tuple | virtualxid | transactionid | classid | objid | objsubid |
virtualtransaction|   pid   |        mode         | granted | fastpath | waitstart | relation 
> >
----------+----------+------------+------+-------+------------+---------------+---------+-------+----------+--------------------+---------+---------------------+---------+----------+-----------+--------------------------------
> >  relation |   529986 | 1842228045 |      |       |            |               |         |       |          |
75/2193719        | 2128603 | AccessExclusiveLock | t       | f        |           | st.ctr_table_efr_oa 
> > (1 row)
> >
> >  locktype | database |  relation  | page | tuple | virtualxid | transactionid | classid | objid | objsubid |
virtualtransaction|   pid   |        mode         | granted | fastpath |           waitstart           | relation 
> >
----------+----------+------------+------+-------+------------+---------------+---------+-------+----------+--------------------+---------+---------------------+---------+----------+-------------------------------+-----------
> >  relation |   529986 | 1842231489 |      |       |            |               |         |       |          |
75/2193719        | 2128603 | AccessExclusiveLock | f       | f        | 2022-01-19 00:32:32.924694+01 | st.tpd_oa 
> > (1 row)
> >
> >    locktype    | database |  relation  | page | tuple |  virtualxid  | transactionid | classid |   objid   |
objsubid| virtualtransaction |   pid   |         mode          | granted | fastpath |           waitstart           |
relation
> >
---------------+----------+------------+------+-------+--------------+---------------+---------+-----------+----------+--------------------+---------+-----------------------+---------+----------+-------------------------------+-----------
> >  relation      |   529986 | 1842231489 |      |       |              |               |         |           |
 | 165/1941408        | 2130531 | AccessShareLock       | t       | f        |                               |
st.tpd_oa
> >
> > So:
> >   pid 2130531 waits for an AccessShareLock on relation 1842228045, blocked by pid 2128603 which holds an
AccessExclusiveLock
> >   pid 2128603 waits for an AccessExclusiveLock on relation 1842231489, blocked by pid 2130531 which holds an
AccessShareLock
> >
> > The queries being executed by these backends are:
> >
> >    pid   |          query_start          |         state_change          | wait_event_type | wait_event | state  |
query
> >
---------+-------------------------------+-------------------------------+-----------------+------------+--------+-------------------------------------------------------------------------
> >  2128603 | 2022-01-19 00:32:32.924413+01 | 2022-01-19 00:32:32.924413+01 | Lock            | relation   | active |
DROPTABLE st.tpd_oa_18929 
> >  2130531 | 2022-01-19 00:32:32.625706+01 | 2022-01-19 00:32:32.625708+01 | Lock            | relation   | active |
DELETEFROM st.ctr_table_efr_oa_19010 WHERE ropid = 44788868 
> > (2 rows)
>
> I know of no cases in which we fail to detect a deadlock.  Perhaps you
> have indeed hit a bug.
>
> > Note that there were a lot of other processes also waiting on relation
> > 1842231489 - could that be confusing the deadlock detection routine?
>
> It shouldn't.
>
> > I am also confused about the locks which are being taken out by the
> > DELETE query.
>
> Maybe the lock is already taken before the DELETE is run; do you have
> any triggers, rules, constraints, or anything?  If you have seen this
> several times already, maybe a way to investigate deeper is an
> exhaustive log capture of everything that these transactions do, from
> the point they begin until they become blocked (log_statement=all).
>
> Perhaps you need to involve other concurrent transactions in order to
> cause the problem.
>
> --
> Álvaro Herrera           39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/
> "Tiene valor aquel que admite que es un cobarde" (Fernandel)



pgsql-general by date:

Previous
From: Merlin Moncure
Date:
Subject: Re: Counting the number of repeated phrases in a column
Next
From: Michael Lewis
Date:
Subject: Re: Undetected Deadlock