Thread: DROP TABLE can crash the replication sync worker
Hi Hackers. As discovered in another thread [master] there is an *existing* bug in the PG HEAD code which can happen if a DROP TABLE is done at same time a replication tablesync worker is running. It seems the table's relid that the sync worker is using gets ripped out from underneath it and is invalidated by the DROP TABLE. Any subsequent use of that relid will go wrong. In the particular test case which found this, the result was a stack trace when a LOG message tried to display the table name of the bad relid. PSA the patch code to fix this. The patch disallows DROP TABLE while any associated tablesync worker is still running. This fix was already confirmed OK in the other thread [v25] ---- [master] https://www.postgresql.org/message-id/CAHut%2BPtSO4WsZwx8z%3D%2BYp_OWpxFmmFi5WX6OmYJzULNa2NV89g%40mail.gmail.com [v25] https://www.postgresql.org/message-id/CAHut%2BPtAKP1FoHbUEWN%2Ba%3D8Pg_njsJKc9Zoz05A_ewJSvjX2MQ%40mail.gmail.com Kind Regards, Peter Smith. Fujitsu Australia
Attachment
On Wed, Feb 3, 2021 at 2:53 PM Peter Smith <smithpb2250@gmail.com> wrote: > > Hi Hackers. > > As discovered in another thread [master] there is an *existing* bug in > the PG HEAD code which can happen if a DROP TABLE is done at same time > a replication tablesync worker is running. > > It seems the table's relid that the sync worker is using gets ripped > out from underneath it and is invalidated by the DROP TABLE. Any > subsequent use of that relid will go wrong. > Where exactly did you pause the tablesync worker while dropping the table? We acquire the lock on the table in LogicalRepSyncTableStart and then keep it for the entire duration of tablesync worker so drop table shouldn't be allowed. -- With Regards, Amit Kapila.
On Wed, Feb 3, 2021 at 11:49 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Feb 3, 2021 at 2:53 PM Peter Smith <smithpb2250@gmail.com> wrote: > > > > Hi Hackers. > > > > As discovered in another thread [master] there is an *existing* bug in > > the PG HEAD code which can happen if a DROP TABLE is done at same time > > a replication tablesync worker is running. > > > > It seems the table's relid that the sync worker is using gets ripped > > out from underneath it and is invalidated by the DROP TABLE. Any > > subsequent use of that relid will go wrong. > > > > Where exactly did you pause the tablesync worker while dropping the > table? We acquire the lock on the table in LogicalRepSyncTableStart > and then keep it for the entire duration of tablesync worker so drop > table shouldn't be allowed. > I have a breakpoint set on LogicalRepSyncTableStart. The DROP TABLE is done while paused on that breakpoint, so no code of LogicalRepSyncTableStart has even executed yet. ---- Kind Regards, Peter Smith. Fujitsu Australia
On Thu, Feb 4, 2021 at 5:31 AM Peter Smith <smithpb2250@gmail.com> wrote: > > On Wed, Feb 3, 2021 at 11:49 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Wed, Feb 3, 2021 at 2:53 PM Peter Smith <smithpb2250@gmail.com> wrote: > > > > > > Hi Hackers. > > > > > > As discovered in another thread [master] there is an *existing* bug in > > > the PG HEAD code which can happen if a DROP TABLE is done at same time > > > a replication tablesync worker is running. > > > > > > It seems the table's relid that the sync worker is using gets ripped > > > out from underneath it and is invalidated by the DROP TABLE. Any > > > subsequent use of that relid will go wrong. > > > > > > > Where exactly did you pause the tablesync worker while dropping the > > table? We acquire the lock on the table in LogicalRepSyncTableStart > > and then keep it for the entire duration of tablesync worker so drop > > table shouldn't be allowed. > > > > I have a breakpoint set on LogicalRepSyncTableStart. The DROP TABLE is > done while paused on that breakpoint, so no code of > LogicalRepSyncTableStart has even executed yet. > Fair enough. So, you are hitting this problem in finish_sync_worker() while logging the message because by that time the relation is dropped. I think it is good to fix that but we don't want the patch you have attached here, we can fix it locally in finish_sync_worker() by constructing a different message (something like: "logical replication table synchronization worker for subscription \"%s\" has finished") when we can't get rel_name from rel id. This doesn't appear to be as serious a problem as we were talking about in the patch "Allow multiple xacts during table sync in logical replication." [1] because there we don't hold the lock on the table for the entire duration tablesync. So, even if we want to fix this problem, it would be more appropriate for back-branches if we push the patch [1]. [1] - https://www.postgresql.org/message-id/CAHut%2BPtAKP1FoHbUEWN%2Ba%3D8Pg_njsJKc9Zoz05A_ewJSvjX2MQ%40mail.gmail.com -- With Regards, Amit Kapila.