Re: DROP TABLE can crash the replication sync worker - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: DROP TABLE can crash the replication sync worker
Date
Msg-id CAA4eK1JJ0=rL+LegmgKC3OyUbd3TLrNw7oY-Xq5fmCmR1JAkxg@mail.gmail.com
Whole thread Raw
In response to Re: DROP TABLE can crash the replication sync worker  (Peter Smith <smithpb2250@gmail.com>)
List pgsql-hackers
On Thu, Feb 4, 2021 at 5:31 AM Peter Smith <smithpb2250@gmail.com> wrote:
>
> On Wed, Feb 3, 2021 at 11:49 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Wed, Feb 3, 2021 at 2:53 PM Peter Smith <smithpb2250@gmail.com> wrote:
> > >
> > > Hi Hackers.
> > >
> > > As discovered in another thread [master] there is an *existing* bug in
> > > the PG HEAD code which can happen if a DROP TABLE is done at same time
> > > a replication tablesync worker is running.
> > >
> > > It seems the table's relid that the sync worker is using gets ripped
> > > out from underneath it and is invalidated by the DROP TABLE. Any
> > > subsequent use of that relid will go wrong.
> > >
> >
> > Where exactly did you pause the tablesync worker while dropping the
> > table? We acquire the lock on the table in LogicalRepSyncTableStart
> > and then keep it for the entire duration of tablesync worker so drop
> > table shouldn't be allowed.
> >
>
> I have a breakpoint set on LogicalRepSyncTableStart. The DROP TABLE is
> done while paused on that breakpoint, so no code of
> LogicalRepSyncTableStart has even executed yet.
>

Fair enough. So, you are hitting this problem in finish_sync_worker()
while logging the message because by that time the relation is
dropped. I think it is good to fix that but we don't want the patch
you have attached here, we can fix it locally in finish_sync_worker()
by constructing a different message (something like: "logical
replication table synchronization worker for subscription \"%s\" has
finished") when we can't get rel_name from rel id. This doesn't appear
to be as serious a problem as we were talking about in the patch
"Allow multiple xacts during table sync in logical replication." [1]
because there we don't hold the lock on the table for the entire
duration tablesync. So, even if we want to fix this problem, it would
be more appropriate for back-branches if we push the patch [1].

[1] - https://www.postgresql.org/message-id/CAHut%2BPtAKP1FoHbUEWN%2Ba%3D8Pg_njsJKc9Zoz05A_ewJSvjX2MQ%40mail.gmail.com

-- 
With Regards,
Amit Kapila.



pgsql-hackers by date:

Previous
From: Bharath Rupireddy
Date:
Subject: Re: Is Recovery actually paused?
Next
From: Greg Nancarrow
Date:
Subject: Re: Parallel INSERT (INTO ... SELECT ...)