Issue with logical replication - Mailing list pgsql-bugs

From Kacey Holston
Subject Issue with logical replication
Date
Msg-id 8CE7E1D1-56E6-4836-8F17-235B6FFA4EBD@pgexperts.com
Whole thread Raw
Responses Re: Issue with logical replication  (Amit Kapila <amit.kapila16@gmail.com>)
List pgsql-bugs
I ran into an issue of missing rows during an upgrade from  PostgreSQL 11 to PostgreSQL 13 using in core logical
replication.

## Procedure Used:

After copying over all the users and schema I ran this on the publisher:

"CREATE PUBLICATION upgrade11to13 FOR ALL TABLES;"

Then ran this on the subscriber:

"CREATE SUBSCRIPTION server13 CONNECTION ‘dbname=<DBNAME redacted> hostaddr=<ADDRESS redacted > user=<USER redacted >
password=<PASSWORDredacted > port=5432' PUBLICATION upgrade11to13;" 

The initial copy and replication appeared to run as expected. I saw all the tables in the ready state in
pg_subscription_rel.The tables pg_stat_repllication and pg_stat_subscription looked as expected as well. 

We then reversed the direction of replication by running this on the PostgreSQL 13 server:

"DROP SUBSCRIPTION server13;”

And this on the PostgreSQL 11 server:

"DROP PUBLICATION upgrade11to13;”

Back on the PostgreSQL 13 server:

"CREATE PUBLICATION reverse_to_11 FOR ALL TABLES;"

And on the PostgreSQL 11 server:

"CREATE SUBSCRIPTION oldserver11 CONNECTION 'dbname=<DBNAME redacted> hostaddr=<ADDRESS redacted > user=<USER redacted
>password=<PASSWORD redacted > port=5432' PUBLICATION reverse_to_11 with (copy_data=false);" 

At this point unexpected query results were returning and we discovered it was from missing rows in at least one table.
(Weunfortunately never checked additional tables).  

We were able to determine that all the rows missing were created on the publisher starting seven minutes after
replicationbegan and lasted for three hours.  

This table had a date created time stamp so we checked what time period the rows were missing from.  If you run a query
requestingall the records created during those three hours there is a 50k record discrepancy. If you run a query
requestingall the records created after those three hours all rows are in-tact. 

We did not retain the logs but at the time did not see any related errors.

## Second try:

We attempted to run this again.

We created a similar, new and fresh instance. Followed the same steps as above but this time created a replication slot
manually.

We received identical results. The records were missing from a handful of minutes after replication starts and last for
approximately3 hours. 

We were able to note that the data loss appeared to coincide approximately with the start of the table’s copy but the
dataloss ended before the table finished the copy. The table took about 12 hours to copy while the data loss was only
threehours. 

## Logs:

On the Subscriber, during the entire initial copy there were only four, spread out errors of the type:

"could not receive data from WAL stream: SSL SYSCALL error: EOF detected”

After these calls the next line in the server indicates a new copy and for different tables.

There were no related errors on the Publisher side.

## Server information

Publisher:

Ubuntu 16.04.7 LTS
GNU/Linux 4.4.0-1114-aws x86_64

Subscriber:

Ubuntu 20.04.2
GNU/Linux 5.4.0-1045-aws x86_64








pgsql-bugs by date:

Previous
From: Noah Misch
Date:
Subject: Re: CREATE INDEX CONCURRENTLY does not index prepared xact's data
Next
From: Amit Kapila
Date:
Subject: Re: Issue with logical replication