Re: long-standing data loss bug in initial sync of logical replication - Mailing list pgsql-hackers

From Nitin Motiani
Subject Re: long-standing data loss bug in initial sync of logical replication
Date
Msg-id CAH5HC971JjQiUGyH1ZWhyFmVxrXGBLAbJUz4H5GtXv_Crdadow@mail.gmail.com
Whole thread Raw
In response to Re: long-standing data loss bug in initial sync of logical replication  (Nitin Motiani <nitinmotiani@google.com>)
List pgsql-hackers
On Thu, Jul 18, 2024 at 3:30 PM Nitin Motiani <nitinmotiani@google.com> wrote:
>
> On Thu, Jul 18, 2024 at 3:25 PM Nitin Motiani <nitinmotiani@google.com> wrote:
> >
> > On Thu, Jul 18, 2024 at 3:05 PM Nitin Motiani <nitinmotiani@google.com> wrote:
> > >
> > > On Wed, Jul 17, 2024 at 5:25 PM vignesh C <vignesh21@gmail.com> wrote:
> > > >
> > > > I tested these scenarios, and as you expected, it throws an error for
> > > > the create publication case:
> > > > 2024-07-17 14:50:01.145 IST [481526] 481526  ERROR:  could not receive
> > > > data from WAL stream: ERROR:  publication "pub1" does not exist
> > > >         CONTEXT:  slot "sub1", output plugin "pgoutput", in the change
> > > > callback, associated LSN 0/1510CD8
> > > > 2024-07-17 14:50:01.147 IST [481450] 481450  LOG:  background worker
> > > > "logical replication apply worker" (PID 481526) exited with exit code
> > > > 1
> > > >
> > > > The steps for this process are as follows:
> > > > 1) Create tables in both the publisher and subscriber.
> > > > 2) On the publisher: Create a replication slot.
> > > > 3) On the subscriber: Create a subscription using the slot created by
> > > > the publisher.
> > > > 4) On the publisher:
> > > > 4.a) Session 1: BEGIN; INSERT INTO T1;
> > > > 4.b) Session 2: CREATE PUBLICATION FOR ALL TABLES
> > > > 4.c) Session 1: COMMIT;
> > > >
> > > > Since we are throwing out a "publication does not exist" error, there
> > > > is no inconsistency issue here.
> > > >
> > > > However, an issue persists with DROP ALL TABLES publication, where
> > > > data continues to replicate even after the publication is dropped.
> > > > This happens because the open transaction consumes the invalidation,
> > > > causing the publications to be revalidated using old snapshot. As a
> > > > result, both the open transactions and the subsequent transactions are
> > > > getting replicated.
> > > >
> > > > We can reproduce this issue by following these steps in a logical
> > > > replication setup with an "ALL TABLES" publication:
> > > > On the publisher:
> > > > Session 1: BEGIN; INSERT INTO T1 VALUES (val1);
> > > > In another session on the publisher:
> > > > Session 2: DROP PUBLICATION
> > > > Back in Session 1 on the publisher:
> > > > COMMIT;
> > > > Finally, in Session 1 on the publisher:
> > > > INSERT INTO T1 VALUES (val2);
> > > >
> > > > Even after dropping the publication, both val1 and val2 are still
> > > > being replicated to the subscriber. This means that both the
> > > > in-progress concurrent transaction and the subsequent transactions are
> > > > being replicated.
> > > >
> > >
> > > Hi,
> > >
> > > I tried the 'DROP PUBLICATION' command even for a publication with a
> > > single table. And there also the data continues to get replicated.
> > >
> > > To test this, I did a similar experiment as the above but instead of
> > > creating publication on all tables, I did it for one specific table.
> > >
> > > Here are the steps :
> > > 1. Create table test_1 and test_2 on both the publisher and subscriber
> > > instances.
> > > 2. Create publication p for table test_1 on the publisher.
> > > 3. Create a subscription s which subscribes to p.
> > > 4. On the publisher
> > > 4a) Session 1 : BEGIN; INSERT INTO test_1 VALUES(val1);
> > > 4b) Session 2 : DROP PUBLICATION p;
> > > 4c) Session 1 : Commit;
> > > 5. On the publisher : INSERT INTO test_1 VALUES(val2);
> > >
> > > After these, when I check the subscriber, both val1 and val2 have been
> > > replicated. I tried a few more inserts on publisher after this and
> > > they all got replicated to the subscriber. Only after explicitly
> > > creating a new publication p2 for test_1 on the publisher, the
> > > replication stopped. Most likely because the create publication
> > > command invalidated the cache.
> > >
> > > My guess is that this issue probably comes from the fact that
> > > RemoveObjects in dropcmds.c doesn't do any special handling or
> > > invalidation for the object drop command.
> > >
> >
> > I checked further and I see that RemovePublicationById does do cache
> > invalidation but it is only done in the scenario when the publication
> > is on all tables. This is done without taking any locks. But for the
> > other cases (eg. publication on one table), I don't see any cache
> > invalidation in RemovePublicationById. That would explain why the
> > replication kept happening for multiple transactions after the drop
> > publication command in my example..
> >
>
> Sorry, I missed that for the individual table scenario, the
> invalidation would happen in RemovePublicationRelById. That is
> invalidating the cache for all relids. But this is also not taking any
> locks. So that would explain why dropping the publication on a single
> table doesn't invalidate the cache in an ongoing transaction. I'm not
> sure why the replication kept happening even in subsequent
> transactions.
>
> Either way I think the SRE lock should be taken for all relids in that
> function also before the invalidations.
>

My apologies. I wasn't testing with the latest patch. I see this has
already been done in the v6 patch file.

Thanks & Regards
Nitin Motiani
Google



pgsql-hackers by date:

Previous
From: David Rowley
Date:
Subject: Re: Add mention of execution time memory for enable_partitionwise_* GUCs
Next
From: David Rowley
Date:
Subject: Re: Duplicate unique key values in inheritance tables