Thread: Data loss on logical replication, 12.12 to 14.5, ALTER SUBSCRIPTION

Data loss on logical replication, 12.12 to 14.5, ALTER SUBSCRIPTION

From
Michail Nikolaev
Date:
Hello.

Just a small story about small data-loss on logical replication.

We were logically replicating a 4 TB database from

> PostgreSQL 12.12 (Ubuntu 12.12-201-yandex.49163.d86383ed5b) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu
7.5.0-3ubuntu1~18.04)7.5.0, 64-bit 

to

> PostgreSQL 14.5 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0, 64-bit

Database includes many tables, but there are A, B and C tables. Tables
A, B and are C changed in the same transaction (new tuples created in
B and C with corresponding update of A).

Table A was added to the PUBLICATION from the start. Once initial sync
was done, tables B and C were added to PUBLICATION with REFRESH on
subscription (to reduce the WAL collection on the source database).

So, we see in logs:

> -2022-12-13 13:19:55 UTC-63987bfb.2733-LOG: logical replication table synchronization worker for subscription
"cloud_production_main_sub_v4",table "A" has started 
> -2022-12-13 14:41:49 UTC-63987bfb.2733-LOG: logical replication table synchronization worker for subscription
"cloud_production_main_sub_v4",table "A" has finished 


> -2022-12-14 08:08:34 UTC-63998482.7d10-LOG: logical replication table synchronization worker for subscription
"cloud_production_main_sub_v4",table "B" has started 
> -2022-12-14 10:19:08 UTC-63998482.7d10-LOG: logical replication table synchronization worker for subscription
"cloud_production_main_sub_v4",table "B" has finished 


> -2022-12-14 10:37:47 UTC-6399a77b.1fc-LOG: logical replication table synchronization worker for subscription
"cloud_production_main_sub_v4",table "C" has started 
> -2022-12-14 10:48:46 UTC-6399a77b.1fc-LOG: logical replication table synchronization worker for subscription
"cloud_production_main_sub_v4",table "C" has finished 


Also, we had to reboot subscription server twice. Moreover, we have
plenty of messages like:


> -2022-12-13 15:53:30 UTC-639872fa.1-LOG: background worker "logical replication worker" (PID 47960) exited with exit
code1 
> -2022-12-13 21:04:31 UTC-6398e8df.4f7c-LOG: logical replication apply worker for subscription
"cloud_production_main_sub_v4"has started 
> -2022-12-14 10:19:22 UTC-6398e8df.4f7c-ERROR: could not receive data from WAL stream: SSL SYSCALL error: EOF detected

Additionally, our HA replica of subscriber was broken and recovered by
support… And logs like this:

> psql-2022-12-14 10:24:18 UTC-63999d2c.2020-WARNING: user requested cancel while waiting for synchronous replication
ack.
> The COMMIT record has already flushed to WAL locally and might not have been replicated to the standby. We must wait
here.

Furthermore, we were adding\removing another table D from publication
few times. So, it was a little bit messy story.

After all, we got streaming working for the whole database.

But after some time we realized we have lost 203 records for table B
created from

2022-12-14 09:21:25.705 to
2022-12-14 09:49:20.664 (after synchronization start, but before finish).

And the most tricky part here - A, B and C are changed in the same
transaction (related tuples). But tables A and C  - are fine, only few
records from B are lost.

We have compared all other tables record to record - only 203 records
from B are missing. We have restored the server from backup with
point-in-time-recovery (to exclude case with application or human
error) - the same results. Furthermore, we have tried different
indexes in search (to exclude issue with broken btree) - the same
results.

So, yes, we understand our replication story was not a classical happy
path even close. But the result feels a little bit scary.

Currently, I have access to database and logs - so, feel free to ask
for additional debugging information if you like.

Thanks a lot,
Michail.



Re: Data loss on logical replication, 12.12 to 14.5, ALTER SUBSCRIPTION

From
Michail Nikolaev
Date:
Hello again.

Just small a fix for:

> 2022-12-14 09:21:25.705 to
> 2022-12-14 09:49:20.664 (after synchronization start, but before finish).

Correct values are:

2022-12-14 09:49:31.340
2022-12-14 09:49:41.683

So, it looks like we lost about 10s of one of the tables WAL.



Re: Data loss on logical replication, 12.12 to 14.5, ALTER SUBSCRIPTION

From
Amit Kapila
Date:
On Mon, Dec 26, 2022 at 8:50 PM Michail Nikolaev
<michail.nikolaev@gmail.com> wrote:
>
> Hello again.
>
> Just small a fix for:
>
> > 2022-12-14 09:21:25.705 to
> > 2022-12-14 09:49:20.664 (after synchronization start, but before finish).
>
> Correct values are:
>
> 2022-12-14 09:49:31.340
> 2022-12-14 09:49:41.683
>
> So, it looks like we lost about 10s of one of the tables WAL.
>


IIUC, this is the time when only table B's initial sync was
in-progress. Table A's initial sync was finished by that time and for
Table C, it is yet not started. We have some logic where in such
overlapping phases where we allow apply worker to skip the data which
will be later applied by the initial sync worker. During the time of
the initial sync of B, are there any other operations on that table
apart from the missing ones? You may want to see the LOGs of
subscribers during the initial sync time for any errors or other
messages.


-- 
With Regards,
Amit Kapila.



Re: Data loss on logical replication, 12.12 to 14.5, ALTER SUBSCRIPTION

From
Michail Nikolaev
Date:
Hello, Amit!

> IUC, this is the time when only table B's initial sync was
> in-progress. Table A's initial sync was finished by that time and for
> Table C, it is yet not started.
Yes, it is correct. C was started too, but unsuccessfully (restarted
after, see below).

> During the time of
> the initial sync of B, are there any other operations on that table
> apart from the missing ones?
Yes, a lot of them. Tuples created all the time without any pauses,
only ~10s interval is gone.

> You may want to see the LOGs of
> subscribers during the initial sync time for any errors or other
> messages.

Looks like I have found something interesting. Below, sorted by time
with some comments.

Restart of the logical apply worker day before issue.

        -2022-12-13 21:04:25 UTC-6398e8d9.4ec3-LOG: logical
replication apply worker for subscription
"cloud_production_main_sub_v4" has started
        -2022-12-13 21:04:26 UTC-6398e8d9.4ec3-ERROR: could not start
WAL streaming: FATAL: out of relcache_callback_list slots CONTEXT:
slot "cloud_production_main_sub_v4", output plugin "pgoutput", in the
startup callback ERROR: odyssey: c1747b31d0187: remote server
read/write error s721c052ace56: (null) SSL SYSCALL error: EOF detected
        -2022-12-13 21:04:26 UTC-639872fa.1-LOG: background worker
"logical replication worker" (PID 20163) exited with exit code 1
        -2022-12-13 21:04:31 UTC-6398e8df.4f7c-LOG: logical
replication apply worker for subscription
"cloud_production_main_sub_v4" has started

Start of B and C table initial sync (adding tables to the
subscription, table A already in streaming mode):

        -2022-12-14 08:08:34 UTC-63998482.7d10-LOG: logical
replication table synchronization worker for subscription
"cloud_production_main_sub_v4", table "B" has started
        -2022-12-14 08:08:34 UTC-63998482.7d13-LOG: logical
replication table synchronization worker for subscription
"cloud_production_main_sub_v4", table "C" has started

B is synchronized without any errors:

        -2022-12-14 10:19:08 UTC-63998482.7d10-LOG: logical
replication table synchronization worker for subscription
"cloud_production_main_sub_v4", table "B" has finished

After about 15 seconds, replication worker is restarted because of
issues with I/O:

        -2022-12-14 10:19:22 UTC-6398e8df.4f7c-ERROR: could not
receive data from WAL stream: SSL SYSCALL error: EOF detected
        -2022-12-14 10:19:22 UTC-6399a32a.6af3-LOG: logical
replication apply worker for subscription
"cloud_production_main_sub_v4" has started
        -2022-12-14 10:19:22 UTC-639872fa.1-LOG: background worker
"logical replication worker" (PID 20348) exited with exit code 1

Then cancel of query (something about insert into public.lsnmover):

        -2022-12-14 10:21:03 UTC-63999c6a.f25d-ERROR: canceling
statement due to user request
        -2022-12-14 10:21:39 UTC-639997f9.fd6e-ERROR: canceling
statement due to user request

After small amount of time, synchronous replication seems to be
broken, we see tons of:

        -2022-12-14 10:24:18 UTC-63999d2c.2020-WARNING: shutdown
requested while waiting for synchronous replication ack.
        -2022-12-14 10:24:18 UTC-63999d2c.2020-WARNING: user requested
cancel while waiting for synchronous

After few minutes at 10:36:05 we initiated database restart:

        -2022-12-14 10:35:20 UTC-63999c6a.f25d-ERROR: canceling
statement due to user request
        -2022-12-14 10:37:25 UTC-6399a765.1-LOG: pgms_stats: Finishing PG_init
        -2022-12-14 10:37:26 UTC-6399a765.1-LOG: listening on IPv4
address "0.0.0.0", port 5432
        -2022-12-14 10:37:26 UTC-6399a765.1-LOG: listening on IPv6
address "::", port 5432
        -2022-12-14 10:37:26 UTC-6399a765.1-LOG: starting PostgreSQL
14.4 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu
7.5.0-3ubuntu1~18.04) 7.5.0, 64-bit
        -2022-12-14 10:37:26 UTC-6399a765.1-LOG: listening on Unix
socket "/tmp/.s.PGSQL.5432"
        -2022-12-14 10:37:26 UTC-6399a766.103-LOG: database system was
interrupted; last known up at 2022-12-14 10:36:47 UTC
        -2022-12-14 10:37:26 UTC-6399a766.105-FATAL: the database
system is starting up
        -2022-12-14 10:37:26 UTC-6399a766.107-FATAL: the database
system is starting up
        -2022-12-14 10:37:27 UTC-6399a767.108-FATAL: the database
system is starting up
        -2022-12-14 10:37:27 UTC-6399a767.109-FATAL: the database
system is starting up
        -2022-12-14 10:37:27 UTC-6399a767.10a-FATAL: the database
system is starting up
        -2022-12-14 10:37:27 UTC-6399a767.10b-FATAL: the database
system is starting up
        -2022-12-14 10:37:27 UTC-6399a767.113-FATAL: the database
system is starting up
        -2022-12-14 10:37:29 UTC-6399a766.103-LOG: recovered
replication state of node 4 to 185F6/CBF4BBB8
        -2022-12-14 10:37:29 UTC-6399a766.103-LOG: recovered
replication state of node 1 to 18605/FE229E10
        -2022-12-14 10:37:29 UTC-6399a766.103-LOG: recovered
replication state of node 6 to 185F8/86AEC880
        -2022-12-14 10:37:29 UTC-6399a766.103-LOG: database system was
not properly shut down; automatic recovery in progress
        -2022-12-14 10:37:29 UTC-6399a766.103-LOG: recovered
replication state of node 5 to 185F6/CBF4BBB8
        -2022-12-14 10:37:29 UTC-6399a766.103-LOG: redo starts at D76/557F4210
        -2022-12-14 10:37:34 UTC-6399a76e.14a-FATAL: the database
system is starting up
        -2022-12-14 10:37:34 UTC-6399a766.103-LOG: redo done at
D76/BDDBDAE0 system usage: CPU: user: 3.31 s, system: 1.36 s, elapsed:
4.68 s
        -2022-12-14 10:37:34 UTC-6399a766.103-LOG: checkpoint
starting: end-of-recovery immediate
        -2022-12-14 10:37:34 UTC-6399a76e.14b-FATAL: the database
system is starting up
        -2022-12-14 10:37:34 UTC-6399a76e.14c-FATAL: the database
system is starting up
        -2022-12-14 10:37:45 UTC-6399a779.190-FATAL: the database
system is starting up
        -2022-12-14 10:37:45 UTC-6399a766.103-LOG: checkpoint
complete: wrote 300326 buffers (3.6%);
        -2022-12-14 10:37:45 UTC-6399a779.191-FATAL: the database
system is starting up
        -2022-12-14 10:37:45 UTC-6399a779.199-LOG: pgms_stats:
bgworker: pid: 409.
        -2022-12-14 10:37:45 UTC-6399a779.19a-LOG: pg_qs: qs bgworker: pid: 410.
        -2022-12-14 10:37:45 UTC-6399a765.1-LOG: database system is
ready to accept connections
        -2022-12-14 10:37:45 UTC-6399a779.199-LOG: pgms_stats: qs
bgworker: bgworker started running with host database: cloud_sys
        -2022-12-14 10:37:45 UTC-6399a779.19a-LOG: pg_qs: qs bgworker:
bgworker started running with host database: cloud_sys
        -2022-12-14 10:37:45 UTC-6399a779.19d-LOG: logical replication
apply worker for subscription "cloud_production_main_sub_v4" has
started
        -2022-12-14 10:37:46 UTC-6399a779.19b-LOG: pg_cron scheduler started

Initial sync for C started from scratch (with some other tables):

        -2022-12-14 10:37:47 UTC-6399a77b.1f9-LOG: logical replication
table synchronization worker for subscription
"cloud_production_main_sub_v4", table "E" has started
        -2022-12-14 10:37:47 UTC-6399a77b.1fa-LOG: logical replication
table synchronization worker for subscription
"cloud_production_main_sub_v4", table "F" has started
        -2022-12-14 10:37:47 UTC-6399a77b.1fc-LOG: logical replication
table synchronization worker for subscription
"cloud_production_main_sub_v4", table "C" has started

And a few more flaps with WAL streaming later:

        -2022-12-14 10:50:20 UTC-6399a779.19d-ERROR: could not send
data to WAL stream: SSL SYSCALL error: EOF detected
        -2022-12-14 10:50:20 UTC-6399a765.1-LOG: background worker
"logical replication worker" (PID 413) exited with exit code 1

Probably a small part of WAL was somehow skipped by logical worker in
all that mess.

Best regards,
Michail.



Re: Data loss on logical replication, 12.12 to 14.5, ALTER SUBSCRIPTION

From
Amit Kapila
Date:
On Tue, Dec 27, 2022 at 5:49 PM Michail Nikolaev
<michail.nikolaev@gmail.com> wrote:
>
>
> Probably a small part of WAL was somehow skipped by logical worker in
> all that mess.
>

None of these entries are from the point mentioned by you [1]
yesterday where you didn't find the corresponding data in the
subscriber. How did you identify that the entries corresponding to
that timing were missing?

[1] -
> 2022-12-14 09:49:31.340
> 2022-12-14 09:49:41.683

-- 
With Regards,
Amit Kapila.



Re: Data loss on logical replication, 12.12 to 14.5, ALTER SUBSCRIPTION

From
Michail Nikolaev
Date:
Hello.

> None of these entries are from the point mentioned by you [1]
> yesterday where you didn't find the corresponding data in the
> subscriber. How did you identify that the entries corresponding to
> that timing were missing?

Some of the before the interval, some after... But the source database
was generating a lot of WAL during logical replication
- some of these log entries from time AFTER completion of initial sync
of B but (probably) BEFORE finishing B table catch up (entering
streaming mode).

Just to clarify, tables A, B and C are updated in the same
transaction, something like:

BEGIN;
UPDATE A SET x = x +1 WHERE id = :id;
INSERT INTO B(current_time, :id);
INSERT INTO C(current_time, :id);
COMMIT;

Other (non-mentioned) tables also included into this transaction, but
only B missed small amount of data.

So, shortly the story looks like:

* initial sync of A (and other tables) started and completed, they are
in streaming mode
* B and C initial sync started (by altering PUBLICATION and SUBSCRIPTION)
* B sync completed, but new changes are still applying to the tables
to catch up primary
* logical replication apply worker is restarted because IO error on WAL receive
* Postgres killed
* Postgres restarted
* C initial sync restarted
* logical replication apply worker few times restarted because IO
error on WAL receive
* finally every table in streaming mode but with small gap in B

Thanks,
Michail.



Re: Data loss on logical replication, 12.12 to 14.5, ALTER SUBSCRIPTION

From
Amit Kapila
Date:
On Wed, Dec 28, 2022 at 4:52 PM Michail Nikolaev
<michail.nikolaev@gmail.com> wrote:
>
> Hello.
>
> > None of these entries are from the point mentioned by you [1]
> > yesterday where you didn't find the corresponding data in the
> > subscriber. How did you identify that the entries corresponding to
> > that timing were missing?
>
> Some of the before the interval, some after... But the source database
> was generating a lot of WAL during logical replication
> - some of these log entries from time AFTER completion of initial sync
> of B but (probably) BEFORE finishing B table catch up (entering
> streaming mode).
>
...
...
>
> So, shortly the story looks like:
>
> * initial sync of A (and other tables) started and completed, they are
> in streaming mode
> * B and C initial sync started (by altering PUBLICATION and SUBSCRIPTION)
> * B sync completed, but new changes are still applying to the tables
> to catch up primary
>

The point which is not completely clear from your description is the
timing of missing records. In one of your previous emails, you seem to
have indicated that the data missed from Table B is from the time when
the initial sync for Table B was in-progress, right? Also, from your
description, it seems there is no error or restart that happened
during the time of initial sync for Table B. Is that understanding
correct?

> * logical replication apply worker is restarted because IO error on WAL receive
> * Postgres killed
> * Postgres restarted
> * C initial sync restarted
> * logical replication apply worker few times restarted because IO
> error on WAL receive
> * finally every table in streaming mode but with small gap in B
>

I am not able to see how these steps can lead to the problem. If the
problem is reproducible at your end, you might want to increase LOG
verbosity to DEBUG1 and see if there is additional information in the
LOGs that can help or it would be really good if there is a
self-sufficient test to reproduce it.

-- 
With Regards,
Amit Kapila.



Re: Data loss on logical replication, 12.12 to 14.5, ALTER SUBSCRIPTION

From
Michail Nikolaev
Date:
Hello, Amid.

> The point which is not completely clear from your description is the
> timing of missing records. In one of your previous emails, you seem to
> have indicated that the data missed from Table B is from the time when
> the initial sync for Table B was in-progress, right? Also, from your
> description, it seems there is no error or restart that happened
> during the time of initial sync for Table B. Is that understanding
> correct?

Yes and yes.
* B sync started - 08:08:34
* lost records are created - 09:49:xx
* B initial sync finished - 10:19:08
* I/O error with WAL - 10:19:22
* SIGTERM - 10:35:20

"Finished" here is `logical replication table synchronization worker
for subscription "cloud_production_main_sub_v4", table "B" has
finished`.
As far as I know, it is about COPY command.

> I am not able to see how these steps can lead to the problem.

One idea I have here - it is something related to the patch about
forbidding of canceling queries while waiting for synchronous
replication acknowledgement [1].
It is applied to Postgres in the cloud we were using [2]. We started
to see such errors in 10:24:18:

      `The COMMIT record has already flushed to WAL locally and might
not have been replicated to the standby. We must wait here.`

I wonder could it be some tricky race because of downtime of
synchronous replica and queries stuck waiting for ACK forever?

> If the problem is reproducible at your end, you might want to increase LOG
> verbosity to DEBUG1 and see if there is additional information in the
> LOGs that can help or it would be really good if there is a
> self-sufficient test to reproduce it.

Unfortunately, it looks like it is really hard to reproduce.

Best regards,
Michail.

[1]:
https://www.postgresql.org/message-id/flat/CALj2ACU%3DnzEb_dEfoLqez5CLcwvx1GhkdfYRNX%2BA4NDRbjYdBg%40mail.gmail.com#8b7ffc8cdecb89de43c0701b4b6b5142
[2]:
https://www.postgresql.org/message-id/flat/CAAhFRxgcBy-UCvyJ1ZZ1UKf4Owrx4J2X1F4tN_FD%3Dfh5wZgdkw%40mail.gmail.com#9c71a85cb6009eb60d0361de82772a50



Re: Data loss on logical replication, 12.12 to 14.5, ALTER SUBSCRIPTION

From
Amit Kapila
Date:
On Tue, Jan 3, 2023 at 2:14 PM Michail Nikolaev
<michail.nikolaev@gmail.com> wrote:
>
> > The point which is not completely clear from your description is the
> > timing of missing records. In one of your previous emails, you seem to
> > have indicated that the data missed from Table B is from the time when
> > the initial sync for Table B was in-progress, right? Also, from your
> > description, it seems there is no error or restart that happened
> > during the time of initial sync for Table B. Is that understanding
> > correct?
>
> Yes and yes.
> * B sync started - 08:08:34
> * lost records are created - 09:49:xx
> * B initial sync finished - 10:19:08
> * I/O error with WAL - 10:19:22
> * SIGTERM - 10:35:20
>
> "Finished" here is `logical replication table synchronization worker
> for subscription "cloud_production_main_sub_v4", table "B" has
> finished`.
> As far as I know, it is about COPY command.
>
> > I am not able to see how these steps can lead to the problem.
>
> One idea I have here - it is something related to the patch about
> forbidding of canceling queries while waiting for synchronous
> replication acknowledgement [1].
> It is applied to Postgres in the cloud we were using [2]. We started
> to see such errors in 10:24:18:
>
>       `The COMMIT record has already flushed to WAL locally and might
> not have been replicated to the standby. We must wait here.`
>

Does that by any chance mean you are using a non-community version of
Postgres which has some other changes?

> I wonder could it be some tricky race because of downtime of
> synchronous replica and queries stuck waiting for ACK forever?
>

It is possible but ideally, in that case, the client should request
such a transaction again.

-- 
With Regards,
Amit Kapila.



Re: Data loss on logical replication, 12.12 to 14.5, ALTER SUBSCRIPTION

From
Michail Nikolaev
Date:
> Does that by any chance mean you are using a non-community version of
> Postgres which has some other changes?

It is a managed Postgres service in the general cloud. Usually, such
providers apply some custom minor patches.
The only one I know about - about forbidding of canceling queries
while waiting for synchronous replication acknowledgement.

> It is possible but ideally, in that case, the client should request
> such a transaction again.

I am not sure I get you here.

I'll try to explain what I mean:

The patch I'm referring to does not allow canceling a query while it
waiting acknowledge for ACK for COMMIT message in case of synchronous
replication.
If synchronous standby is down - query and connection just stuck until
server restart (or until standby become available to process ACK).
Tuples changed by such a hanging transaction are not visible by other
transactions. It is all done to prevent seeing spurious tuples in case
of network split.

So, it seems like we had such a situation during that story because of
our synchronous standby downtime (before server restart).
My thoughts just about the possibility of fact that such transactions
(waiting for ACK for COMMIT) are handled somehow incorrectly by
logical replication engine.

Michail.



Re: Data loss on logical replication, 12.12 to 14.5, ALTER SUBSCRIPTION

From
Amit Kapila
Date:
On Tue, Jan 3, 2023 at 8:50 PM Michail Nikolaev
<michail.nikolaev@gmail.com> wrote:
>
> > Does that by any chance mean you are using a non-community version of
> > Postgres which has some other changes?
>
> It is a managed Postgres service in the general cloud. Usually, such
> providers apply some custom minor patches.
> The only one I know about - about forbidding of canceling queries
> while waiting for synchronous replication acknowledgement.
>

Okay, but it would be better to know what all the other changes they have.

> > It is possible but ideally, in that case, the client should request
> > such a transaction again.
>
> I am not sure I get you here.
>
> I'll try to explain what I mean:
>
> The patch I'm referring to does not allow canceling a query while it
> waiting acknowledge for ACK for COMMIT message in case of synchronous
> replication.
> If synchronous standby is down - query and connection just stuck until
> server restart (or until standby become available to process ACK).
> Tuples changed by such a hanging transaction are not visible by other
> transactions. It is all done to prevent seeing spurious tuples in case
> of network split.
>
> So, it seems like we had such a situation during that story because of
> our synchronous standby downtime (before server restart).
> My thoughts just about the possibility of fact that such transactions
> (waiting for ACK for COMMIT) are handled somehow incorrectly by
> logical replication engine.
>

I understood this point yesterday but we do have handling for such
cases. Say, if the subscriber is down during the time of such
synchronous transactions, after the restart, it will request to
restart the replication from a point which is prior to such
transactions. We ensure this by replication origins. See docs [1] for
more information about the same. Now, it is possible that there is a
bug in that mechanism but it is difficult to find it without some
hints from LOGs or a reproducible test. It is also possible that there
is another area that has a bug in the Postgres code. But, OTOH, we
can't rule out the possibility that it is because of some features
added by managed service unless you can reproduce it on the Postgres
build.

[1] - https://www.postgresql.org/docs/devel/replication-origins.html

-- 
With Regards,
Amit Kapila.