RE: BUG #18267: Logical replication bug: data is not synchronized after Alter Publication. - Mailing list pgsql-bugs

From Hayato Kuroda (Fujitsu)
Subject RE: BUG #18267: Logical replication bug: data is not synchronized after Alter Publication.
Date
Msg-id TY3PR01MB9889E9DB1AC80C2DEC3B37D6F5662@TY3PR01MB9889.jpnprd01.prod.outlook.com
Whole thread Raw
In response to BUG #18267: Logical replication bug: data is not synchronized after Alter Publication.  (PG Bug reporting form <noreply@postgresql.org>)
Responses Re: BUG #18267: Logical replication bug: data is not synchronized after Alter Publication.
List pgsql-bugs
Dear Song,

> 
> Hi hackers, I found when insert plenty of data into a table, and add the
> table to publication (through Alter Publication) meanwhile, it's likely that
> the incremental data cannot be synchronized to the subscriber. Here is my
> test method:

Good catch.

> 1. On publisher and subscriber, create table for test:
> CREATE TABLE tab_1 (a int);
> 
> 2. Setup logical replication:
> on publisher:
>      SELECT pg_create_logical_replication_slot('slot1', 'pgoutput', false,
> false);
>                        CREATE PUBLICATION tap_pub;
> on subscriber:
>      CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr'
> PUBLICATION
> tap_pub WITH (enabled = true, create_slot = false, slot_name='slot1')
> 
> 3. Perform Insert:
>      for (my $i = 1; $i <= 1000; $i++) {
>          $node_publisher->safe_psql('postgres', "INSERT INTO tab_1 SELECT
> generate_series(1, 1000)");
>      }
>      Each transaction contains 1000 insertion, and 1000 transactions are in
> total.
> 
> 4. When performing step 3, add table tab_1  to publication.
>      ALTER PUBLICATION tap_pub ADD TABLE tab_1
>      ALTER SUBSCRIPTION tap_sub REFRESH PUBLICATION

I could reproduce the failure. PSA the script.

In the script, ALTER PUBLICATION was executed while doing the initial data sync.
(The workload is almost same as what you reporter posted, but number of rows are reduced)

In total, 4000 tuples are inserted on publisher. However, after sometime, only 2500 tuples are replicated.

```
publisher=# SELECT count(*) FROM tab_1 ;
 count 
-------
 40000
(1 row)

subscriber=# SELECT count(*) FROM tab_1 ;
 count 
-------
 25000
(1 row)
```

Is it same failure you saw?

> The root cause of the problem is as follows:
> pgoutput relies on the invalidation mechanism to validate publications. When
> walsender decoding an Alter Publication transaction, catalog caches are
> invalidated at once. Furthermore, since pg_publication_rel is modified,
> snapshot changes are added to all transactions currently being decoded. For
> other transactions, catalog caches have been invalidated. However, it is
> likely that the snapshot changes have not yet been decoded. In pgoutput
> implementation, these transactions query the system table pg_publication_rel
> to determine whether to publish changes made in transactions. In this case,
> catalog tuples are not found because snapshot has not been updated. As a
> result, changes in transactions are considered not to be published, and
> subsequent data cannot be synchronized.
>
> I think it's necessary to add invalidations to other transactions after
> adding a snapshot change to them.
> Therefore, I submitted a patch for this bug.

I cannot see your attaching, but I found that proposed patch in [1] can solve
the issue. After applying 0001 + 0002 + 0003 (open relations as ShareRowExclusiveLock,
in OpenTableList), the data gap was removed. Thought?

[1]: https://www.postgresql.org/message-id/de52b282-1166-1180-45a2-8d8917ca74c6%40enterprisedb.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


Attachment

pgsql-bugs by date:

Previous
From: Andrei Lepikhov
Date:
Subject: Re: BUG #18261: Inconsistent results of SELECT affected by joined subqueries
Next
From: tender wang
Date:
Subject: Fwd: BUG #18259: Assertion in ExtendBufferedRelLocal() fails after no-space-left condition