Data loss on logical replication, 12.12 to 14.5, ALTER SUBSCRIPTION - Mailing list pgsql-hackers

From Michail Nikolaev
Subject Data loss on logical replication, 12.12 to 14.5, ALTER SUBSCRIPTION
Date
Msg-id CANtu0ohgHM_7+M+3i2kXkZuLCsN_Dbio2pFfphrH+oOfzLfUhw@mail.gmail.com
Whole thread Raw
Responses Re: Data loss on logical replication, 12.12 to 14.5, ALTER SUBSCRIPTION  (Michail Nikolaev <michail.nikolaev@gmail.com>)
List pgsql-hackers
Hello.

Just a small story about small data-loss on logical replication.

We were logically replicating a 4 TB database from

> PostgreSQL 12.12 (Ubuntu 12.12-201-yandex.49163.d86383ed5b) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu
7.5.0-3ubuntu1~18.04)7.5.0, 64-bit 

to

> PostgreSQL 14.5 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0, 64-bit

Database includes many tables, but there are A, B and C tables. Tables
A, B and are C changed in the same transaction (new tuples created in
B and C with corresponding update of A).

Table A was added to the PUBLICATION from the start. Once initial sync
was done, tables B and C were added to PUBLICATION with REFRESH on
subscription (to reduce the WAL collection on the source database).

So, we see in logs:

> -2022-12-13 13:19:55 UTC-63987bfb.2733-LOG: logical replication table synchronization worker for subscription
"cloud_production_main_sub_v4",table "A" has started 
> -2022-12-13 14:41:49 UTC-63987bfb.2733-LOG: logical replication table synchronization worker for subscription
"cloud_production_main_sub_v4",table "A" has finished 


> -2022-12-14 08:08:34 UTC-63998482.7d10-LOG: logical replication table synchronization worker for subscription
"cloud_production_main_sub_v4",table "B" has started 
> -2022-12-14 10:19:08 UTC-63998482.7d10-LOG: logical replication table synchronization worker for subscription
"cloud_production_main_sub_v4",table "B" has finished 


> -2022-12-14 10:37:47 UTC-6399a77b.1fc-LOG: logical replication table synchronization worker for subscription
"cloud_production_main_sub_v4",table "C" has started 
> -2022-12-14 10:48:46 UTC-6399a77b.1fc-LOG: logical replication table synchronization worker for subscription
"cloud_production_main_sub_v4",table "C" has finished 


Also, we had to reboot subscription server twice. Moreover, we have
plenty of messages like:


> -2022-12-13 15:53:30 UTC-639872fa.1-LOG: background worker "logical replication worker" (PID 47960) exited with exit
code1 
> -2022-12-13 21:04:31 UTC-6398e8df.4f7c-LOG: logical replication apply worker for subscription
"cloud_production_main_sub_v4"has started 
> -2022-12-14 10:19:22 UTC-6398e8df.4f7c-ERROR: could not receive data from WAL stream: SSL SYSCALL error: EOF detected

Additionally, our HA replica of subscriber was broken and recovered by
support… And logs like this:

> psql-2022-12-14 10:24:18 UTC-63999d2c.2020-WARNING: user requested cancel while waiting for synchronous replication
ack.
> The COMMIT record has already flushed to WAL locally and might not have been replicated to the standby. We must wait
here.

Furthermore, we were adding\removing another table D from publication
few times. So, it was a little bit messy story.

After all, we got streaming working for the whole database.

But after some time we realized we have lost 203 records for table B
created from

2022-12-14 09:21:25.705 to
2022-12-14 09:49:20.664 (after synchronization start, but before finish).

And the most tricky part here - A, B and C are changed in the same
transaction (related tuples). But tables A and C  - are fine, only few
records from B are lost.

We have compared all other tables record to record - only 203 records
from B are missing. We have restored the server from backup with
point-in-time-recovery (to exclude case with application or human
error) - the same results. Furthermore, we have tried different
indexes in search (to exclude issue with broken btree) - the same
results.

So, yes, we understand our replication story was not a classical happy
path even close. But the result feels a little bit scary.

Currently, I have access to database and logs - so, feel free to ask
for additional debugging information if you like.

Thanks a lot,
Michail.



pgsql-hackers by date:

Previous
From: Dilip Kumar
Date:
Subject: Re: Time delayed LR (WAS Re: logical replication restrictions)
Next
From: Michail Nikolaev
Date:
Subject: Re: Data loss on logical replication, 12.12 to 14.5, ALTER SUBSCRIPTION