Re: DROP DATABASE deadlocks with logical replication worker in PG 15.1 - Mailing list pgsql-bugs

From Amit Kapila
Subject Re: DROP DATABASE deadlocks with logical replication worker in PG 15.1
Date
Msg-id CAA4eK1+s1B7-UD+JG_-ZVMag8DeufQ0oshS5RGaZ-LY9v7bVuQ@mail.gmail.com
Whole thread Raw
In response to Re: DROP DATABASE deadlocks with logical replication worker in PG 15.1  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: DROP DATABASE deadlocks with logical replication worker in PG 15.1  (Andres Freund <andres@anarazel.de>)
List pgsql-bugs
On Mon, Jan 16, 2023 at 3:02 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Sat, Jan 14, 2023 at 9:32 PM Andres Freund <andres@anarazel.de> wrote:
> >
> > The problem is here:
> >
> > On 2023-01-13 20:53:49 +0530, Lakshmi Narayanan Sreethar wrote:
> > > #7  0x0000559cccbe1e71 in LogicalRepSyncTableStart
> > > (origin_startpos=0x7fffb26f7728) at
> > > /pg15.1/src/backend/replication/logical/tablesync.c:1353
> >
> > Because the logical rep code explicitly prevents interrupts:
> >
> >         /*
> >          * Create a new permanent logical decoding slot. This slot will be used
> >          * for the catchup phase after COPY is done, so tell it to use the
> >          * snapshot to make the final data consistent.
> >          *
> >          * Prevent cancel/die interrupts while creating slot here because it is
> >          * possible that before the server finishes this command, a concurrent
> >          * drop subscription happens which would complete without removing this
> >          * slot leading to a dangling slot on the server.
> >          */
> >         HOLD_INTERRUPTS();
> >         walrcv_create_slot(LogRepWorkerWalRcvConn,
> >                                            slotname, false /* permanent */ , false /* two_phase */ ,
> >                                            CRS_USE_SNAPSHOT, origin_startpos);
> >         RESUME_INTERRUPTS();
> >
> > Which is just completely entirely wrong. Independent of this issue even. Not
> > allowing termination for the duration of command executed over network?
> >
> > This is from:
> >
> > commit 6b67d72b604cb913e39324b81b61ab194d94cba0
> > Author: Amit Kapila <akapila@postgresql.org>
> > Date:   2021-03-17 08:15:12 +0530
> >
> >     Fix race condition in drop subscription's handling of tablesync slots.
> >
...
...
> >
> >
> > But this can't be the right fix.
> >
>
> I will look into this
>

As per my initial analysis, I have added this code to hold/resume
interrupts during slot creation due to the test failure (in buildfarm)
reported in the email [1]. It is clearly a wrong fix as per the report
and discussion in this thread. There is an analysis of the test
failure in the email [2] which explains the race condition that leads
to test failure. Thinking again about the failure, I feel we can
instead change the failed test (t/004_sync.pl) to either ensure that
both the walsenders (corresponding to sync worker and apply worker)
exits after dropping the subscription and before checking the
remaining slots on publisher or wait for slots to become zero in the
test. As mentioned in email [2], such a slot created on the server
will be dropped before we persist them, so there doesn't appear to be
a problem in removing the change to hold/resume interrupts and instead
change the test but will study this more and share the same in the
next update.

[1] - https://www.postgresql.org/message-id/CA%2BhUKGJG9dWpw1cOQ2nzWU8PHjm%3DPTraB%2BKgE5648K9nTfwvxg%40mail.gmail.com
[2] - https://www.postgresql.org/message-id/CAA4eK1JnDQF_B9vDj7WQz9uufbMfE5wo%3DmdG5fbHq9S7Unfhtg%40mail.gmail.com

-- 
With Regards,
Amit Kapila.



pgsql-bugs by date:

Previous
From: Tom Lane
Date:
Subject: Re: Crash during backend start when low on memory
Next
From: "Xiong He"
Date:
Subject: Possible wrong result with some "in" subquery with non-existing columns