Re: Data loss on logical replication, 12.12 to 14.5, ALTER SUBSCRIPTION - Mailing list pgsql-hackers

From Michail Nikolaev
Subject Re: Data loss on logical replication, 12.12 to 14.5, ALTER SUBSCRIPTION
Date
Msg-id CANtu0ogdMKQ-qj7U8qdCzw+YhOcdTLoLRa5evrdahkrwjSDMiA@mail.gmail.com
Whole thread Raw
In response to Re: Data loss on logical replication, 12.12 to 14.5, ALTER SUBSCRIPTION  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: Data loss on logical replication, 12.12 to 14.5, ALTER SUBSCRIPTION
List pgsql-hackers
Hello, Amid.

> The point which is not completely clear from your description is the
> timing of missing records. In one of your previous emails, you seem to
> have indicated that the data missed from Table B is from the time when
> the initial sync for Table B was in-progress, right? Also, from your
> description, it seems there is no error or restart that happened
> during the time of initial sync for Table B. Is that understanding
> correct?

Yes and yes.
* B sync started - 08:08:34
* lost records are created - 09:49:xx
* B initial sync finished - 10:19:08
* I/O error with WAL - 10:19:22
* SIGTERM - 10:35:20

"Finished" here is `logical replication table synchronization worker
for subscription "cloud_production_main_sub_v4", table "B" has
finished`.
As far as I know, it is about COPY command.

> I am not able to see how these steps can lead to the problem.

One idea I have here - it is something related to the patch about
forbidding of canceling queries while waiting for synchronous
replication acknowledgement [1].
It is applied to Postgres in the cloud we were using [2]. We started
to see such errors in 10:24:18:

      `The COMMIT record has already flushed to WAL locally and might
not have been replicated to the standby. We must wait here.`

I wonder could it be some tricky race because of downtime of
synchronous replica and queries stuck waiting for ACK forever?

> If the problem is reproducible at your end, you might want to increase LOG
> verbosity to DEBUG1 and see if there is additional information in the
> LOGs that can help or it would be really good if there is a
> self-sufficient test to reproduce it.

Unfortunately, it looks like it is really hard to reproduce.

Best regards,
Michail.

[1]:
https://www.postgresql.org/message-id/flat/CALj2ACU%3DnzEb_dEfoLqez5CLcwvx1GhkdfYRNX%2BA4NDRbjYdBg%40mail.gmail.com#8b7ffc8cdecb89de43c0701b4b6b5142
[2]:
https://www.postgresql.org/message-id/flat/CAAhFRxgcBy-UCvyJ1ZZ1UKf4Owrx4J2X1F4tN_FD%3Dfh5wZgdkw%40mail.gmail.com#9c71a85cb6009eb60d0361de82772a50



pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: typos
Next
From: Pavel Borisov
Date:
Subject: Re: Allow placeholders in ALTER ROLE w/o superuser