Re: long-standing data loss bug in initial sync of logical replication - Mailing list pgsql-hackers
From | Nitin Motiani |
---|---|
Subject | Re: long-standing data loss bug in initial sync of logical replication |
Date | |
Msg-id | CAH5HC971JjQiUGyH1ZWhyFmVxrXGBLAbJUz4H5GtXv_Crdadow@mail.gmail.com Whole thread Raw |
In response to | Re: long-standing data loss bug in initial sync of logical replication (Nitin Motiani <nitinmotiani@google.com>) |
List | pgsql-hackers |
On Thu, Jul 18, 2024 at 3:30 PM Nitin Motiani <nitinmotiani@google.com> wrote: > > On Thu, Jul 18, 2024 at 3:25 PM Nitin Motiani <nitinmotiani@google.com> wrote: > > > > On Thu, Jul 18, 2024 at 3:05 PM Nitin Motiani <nitinmotiani@google.com> wrote: > > > > > > On Wed, Jul 17, 2024 at 5:25 PM vignesh C <vignesh21@gmail.com> wrote: > > > > > > > > I tested these scenarios, and as you expected, it throws an error for > > > > the create publication case: > > > > 2024-07-17 14:50:01.145 IST [481526] 481526 ERROR: could not receive > > > > data from WAL stream: ERROR: publication "pub1" does not exist > > > > CONTEXT: slot "sub1", output plugin "pgoutput", in the change > > > > callback, associated LSN 0/1510CD8 > > > > 2024-07-17 14:50:01.147 IST [481450] 481450 LOG: background worker > > > > "logical replication apply worker" (PID 481526) exited with exit code > > > > 1 > > > > > > > > The steps for this process are as follows: > > > > 1) Create tables in both the publisher and subscriber. > > > > 2) On the publisher: Create a replication slot. > > > > 3) On the subscriber: Create a subscription using the slot created by > > > > the publisher. > > > > 4) On the publisher: > > > > 4.a) Session 1: BEGIN; INSERT INTO T1; > > > > 4.b) Session 2: CREATE PUBLICATION FOR ALL TABLES > > > > 4.c) Session 1: COMMIT; > > > > > > > > Since we are throwing out a "publication does not exist" error, there > > > > is no inconsistency issue here. > > > > > > > > However, an issue persists with DROP ALL TABLES publication, where > > > > data continues to replicate even after the publication is dropped. > > > > This happens because the open transaction consumes the invalidation, > > > > causing the publications to be revalidated using old snapshot. As a > > > > result, both the open transactions and the subsequent transactions are > > > > getting replicated. > > > > > > > > We can reproduce this issue by following these steps in a logical > > > > replication setup with an "ALL TABLES" publication: > > > > On the publisher: > > > > Session 1: BEGIN; INSERT INTO T1 VALUES (val1); > > > > In another session on the publisher: > > > > Session 2: DROP PUBLICATION > > > > Back in Session 1 on the publisher: > > > > COMMIT; > > > > Finally, in Session 1 on the publisher: > > > > INSERT INTO T1 VALUES (val2); > > > > > > > > Even after dropping the publication, both val1 and val2 are still > > > > being replicated to the subscriber. This means that both the > > > > in-progress concurrent transaction and the subsequent transactions are > > > > being replicated. > > > > > > > > > > Hi, > > > > > > I tried the 'DROP PUBLICATION' command even for a publication with a > > > single table. And there also the data continues to get replicated. > > > > > > To test this, I did a similar experiment as the above but instead of > > > creating publication on all tables, I did it for one specific table. > > > > > > Here are the steps : > > > 1. Create table test_1 and test_2 on both the publisher and subscriber > > > instances. > > > 2. Create publication p for table test_1 on the publisher. > > > 3. Create a subscription s which subscribes to p. > > > 4. On the publisher > > > 4a) Session 1 : BEGIN; INSERT INTO test_1 VALUES(val1); > > > 4b) Session 2 : DROP PUBLICATION p; > > > 4c) Session 1 : Commit; > > > 5. On the publisher : INSERT INTO test_1 VALUES(val2); > > > > > > After these, when I check the subscriber, both val1 and val2 have been > > > replicated. I tried a few more inserts on publisher after this and > > > they all got replicated to the subscriber. Only after explicitly > > > creating a new publication p2 for test_1 on the publisher, the > > > replication stopped. Most likely because the create publication > > > command invalidated the cache. > > > > > > My guess is that this issue probably comes from the fact that > > > RemoveObjects in dropcmds.c doesn't do any special handling or > > > invalidation for the object drop command. > > > > > > > I checked further and I see that RemovePublicationById does do cache > > invalidation but it is only done in the scenario when the publication > > is on all tables. This is done without taking any locks. But for the > > other cases (eg. publication on one table), I don't see any cache > > invalidation in RemovePublicationById. That would explain why the > > replication kept happening for multiple transactions after the drop > > publication command in my example.. > > > > Sorry, I missed that for the individual table scenario, the > invalidation would happen in RemovePublicationRelById. That is > invalidating the cache for all relids. But this is also not taking any > locks. So that would explain why dropping the publication on a single > table doesn't invalidate the cache in an ongoing transaction. I'm not > sure why the replication kept happening even in subsequent > transactions. > > Either way I think the SRE lock should be taken for all relids in that > function also before the invalidations. > My apologies. I wasn't testing with the latest patch. I see this has already been done in the v6 patch file. Thanks & Regards Nitin Motiani Google
pgsql-hackers by date: