Re: BUG #18267: Logical replication bug: data is not synchronized after Alter Publication. - Mailing list pgsql-bugs

From Dilip Kumar
Subject Re: BUG #18267: Logical replication bug: data is not synchronized after Alter Publication.
Date
Msg-id CAFiTN-vsdWgthGJFOG74E94LAi5E5DmP0Ag616V62hftHq6Ldw@mail.gmail.com
Whole thread Raw
In response to RE: BUG #18267: Logical replication bug: data is not synchronized after Alter Publication.  ("Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com>)
Responses RE: BUG #18267: Logical replication bug: data is not synchronized after Alter Publication.  ("Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com>)
List pgsql-bugs
On Fri, Jan 5, 2024 at 9:25 AM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>
> Dear Song,
>
> >
> > Hi hackers, I found when insert plenty of data into a table, and add the
> > table to publication (through Alter Publication) meanwhile, it's likely that
> > the incremental data cannot be synchronized to the subscriber. Here is my
> > test method:
>
> Good catch.
>
> > 1. On publisher and subscriber, create table for test:
> > CREATE TABLE tab_1 (a int);
> >
> > 2. Setup logical replication:
> > on publisher:
> >      SELECT pg_create_logical_replication_slot('slot1', 'pgoutput', false,
> > false);
> >                        CREATE PUBLICATION tap_pub;
> > on subscriber:
> >      CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr'
> > PUBLICATION
> > tap_pub WITH (enabled = true, create_slot = false, slot_name='slot1')
> >
> > 3. Perform Insert:
> >      for (my $i = 1; $i <= 1000; $i++) {
> >          $node_publisher->safe_psql('postgres', "INSERT INTO tab_1 SELECT
> > generate_series(1, 1000)");
> >      }
> >      Each transaction contains 1000 insertion, and 1000 transactions are in
> > total.
> >
> > 4. When performing step 3, add table tab_1  to publication.
> >      ALTER PUBLICATION tap_pub ADD TABLE tab_1
> >      ALTER SUBSCRIPTION tap_sub REFRESH PUBLICATION
>
> I could reproduce the failure. PSA the script.
>
> In the script, ALTER PUBLICATION was executed while doing the initial data sync.
> (The workload is almost same as what you reporter posted, but number of rows are reduced)
>
> In total, 4000 tuples are inserted on publisher. However, after sometime, only 2500 tuples are replicated.
>
> ```
> publisher=# SELECT count(*) FROM tab_1 ;
>  count
> -------
>  40000
> (1 row)
>
> subscriber=# SELECT count(*) FROM tab_1 ;
>  count
> -------
>  25000
> (1 row)
> ```
>
> Is it same failure you saw?

With your attached script I was able to see this gap, I didn't dig
deeper but with the initial investigation, I could see that even after
ALTER PUBLICATION, the pgoutput_change continues to see
'relentry->pubactions.pubinsert' as false, even after re fetching the
relation entry after the invalidation.  That shows the invalidation
framework might be working fine but we are using the older snapshot to
fetch the entry.  I did not debug it further why it is not getting the
updated snapshot which can see the change in publication, because I
assume Yutao Song as already analyzed that as per his first email, so
I would wait for his patch.

> > The root cause of the problem is as follows:
> > pgoutput relies on the invalidation mechanism to validate publications. When
> > walsender decoding an Alter Publication transaction, catalog caches are
> > invalidated at once. Furthermore, since pg_publication_rel is modified,
> > snapshot changes are added to all transactions currently being decoded. For
> > other transactions, catalog caches have been invalidated. However, it is
> > likely that the snapshot changes have not yet been decoded. In pgoutput
> > implementation, these transactions query the system table pg_publication_rel
> > to determine whether to publish changes made in transactions. In this case,
> > catalog tuples are not found because snapshot has not been updated. As a
> > result, changes in transactions are considered not to be published, and
> > subsequent data cannot be synchronized.
> >
> > I think it's necessary to add invalidations to other transactions after
> > adding a snapshot change to them.
> > Therefore, I submitted a patch for this bug.
>
> I cannot see your attaching, but I found that proposed patch in [1] can solve
> the issue. After applying 0001 + 0002 + 0003 (open relations as ShareRowExclusiveLock,
> in OpenTableList), the data gap was removed. Thought?

Not sure why 'open relations as ShareRowExclusiveLock' would help in
this case? have you investigated that?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



pgsql-bugs by date:

Previous
From: vignesh C
Date:
Subject: Re: "unexpected duplicate for tablespace" problem in logical replication
Next
From: Alexander Lakhin
Date:
Subject: Re: BUG #17798: Incorrect memory access occurs when using BEFORE ROW UPDATE trigger