Thread: DROP TABLE can crash the replication sync worker

DROP TABLE can crash the replication sync worker

From
Peter Smith
Date:
Hi Hackers.

As discovered in another thread [master] there is an *existing* bug in
the PG HEAD code which can happen if a DROP TABLE is done at same time
a replication tablesync worker is running.

It seems the table's relid that the sync worker is using gets ripped
out from underneath it and is invalidated by the DROP TABLE. Any
subsequent use of that relid will go wrong. In the particular test
case which found this, the result was a stack trace when a LOG message
tried to display the table name of the bad relid.

PSA the patch code to fix this. The patch disallows DROP TABLE while
any associated tablesync worker is still running. This fix was already
confirmed OK in the other thread [v25]

----
[master]
https://www.postgresql.org/message-id/CAHut%2BPtSO4WsZwx8z%3D%2BYp_OWpxFmmFi5WX6OmYJzULNa2NV89g%40mail.gmail.com
[v25] https://www.postgresql.org/message-id/CAHut%2BPtAKP1FoHbUEWN%2Ba%3D8Pg_njsJKc9Zoz05A_ewJSvjX2MQ%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

Attachment

Re: DROP TABLE can crash the replication sync worker

From
Amit Kapila
Date:
On Wed, Feb 3, 2021 at 2:53 PM Peter Smith <smithpb2250@gmail.com> wrote:
>
> Hi Hackers.
>
> As discovered in another thread [master] there is an *existing* bug in
> the PG HEAD code which can happen if a DROP TABLE is done at same time
> a replication tablesync worker is running.
>
> It seems the table's relid that the sync worker is using gets ripped
> out from underneath it and is invalidated by the DROP TABLE. Any
> subsequent use of that relid will go wrong.
>

Where exactly did you pause the tablesync worker while dropping the
table? We acquire the lock on the table in LogicalRepSyncTableStart
and then keep it for the entire duration of tablesync worker so drop
table shouldn't be allowed.

-- 
With Regards,
Amit Kapila.



Re: DROP TABLE can crash the replication sync worker

From
Peter Smith
Date:
On Wed, Feb 3, 2021 at 11:49 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Feb 3, 2021 at 2:53 PM Peter Smith <smithpb2250@gmail.com> wrote:
> >
> > Hi Hackers.
> >
> > As discovered in another thread [master] there is an *existing* bug in
> > the PG HEAD code which can happen if a DROP TABLE is done at same time
> > a replication tablesync worker is running.
> >
> > It seems the table's relid that the sync worker is using gets ripped
> > out from underneath it and is invalidated by the DROP TABLE. Any
> > subsequent use of that relid will go wrong.
> >
>
> Where exactly did you pause the tablesync worker while dropping the
> table? We acquire the lock on the table in LogicalRepSyncTableStart
> and then keep it for the entire duration of tablesync worker so drop
> table shouldn't be allowed.
>

I have a breakpoint set on LogicalRepSyncTableStart. The DROP TABLE is
done while paused on that breakpoint, so no code of
LogicalRepSyncTableStart has even executed yet.

----
Kind Regards,
Peter Smith.
Fujitsu Australia



Re: DROP TABLE can crash the replication sync worker

From
Amit Kapila
Date:
On Thu, Feb 4, 2021 at 5:31 AM Peter Smith <smithpb2250@gmail.com> wrote:
>
> On Wed, Feb 3, 2021 at 11:49 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Wed, Feb 3, 2021 at 2:53 PM Peter Smith <smithpb2250@gmail.com> wrote:
> > >
> > > Hi Hackers.
> > >
> > > As discovered in another thread [master] there is an *existing* bug in
> > > the PG HEAD code which can happen if a DROP TABLE is done at same time
> > > a replication tablesync worker is running.
> > >
> > > It seems the table's relid that the sync worker is using gets ripped
> > > out from underneath it and is invalidated by the DROP TABLE. Any
> > > subsequent use of that relid will go wrong.
> > >
> >
> > Where exactly did you pause the tablesync worker while dropping the
> > table? We acquire the lock on the table in LogicalRepSyncTableStart
> > and then keep it for the entire duration of tablesync worker so drop
> > table shouldn't be allowed.
> >
>
> I have a breakpoint set on LogicalRepSyncTableStart. The DROP TABLE is
> done while paused on that breakpoint, so no code of
> LogicalRepSyncTableStart has even executed yet.
>

Fair enough. So, you are hitting this problem in finish_sync_worker()
while logging the message because by that time the relation is
dropped. I think it is good to fix that but we don't want the patch
you have attached here, we can fix it locally in finish_sync_worker()
by constructing a different message (something like: "logical
replication table synchronization worker for subscription \"%s\" has
finished") when we can't get rel_name from rel id. This doesn't appear
to be as serious a problem as we were talking about in the patch
"Allow multiple xacts during table sync in logical replication." [1]
because there we don't hold the lock on the table for the entire
duration tablesync. So, even if we want to fix this problem, it would
be more appropriate for back-branches if we push the patch [1].

[1] - https://www.postgresql.org/message-id/CAHut%2BPtAKP1FoHbUEWN%2Ba%3D8Pg_njsJKc9Zoz05A_ewJSvjX2MQ%40mail.gmail.com

-- 
With Regards,
Amit Kapila.