Thread: Data loss on logical replication, 12.12 to 14.5, ALTER SUBSCRIPTION
Hello. Just a small story about small data-loss on logical replication. We were logically replicating a 4 TB database from > PostgreSQL 12.12 (Ubuntu 12.12-201-yandex.49163.d86383ed5b) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 7.5.0-3ubuntu1~18.04)7.5.0, 64-bit to > PostgreSQL 14.5 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0, 64-bit Database includes many tables, but there are A, B and C tables. Tables A, B and are C changed in the same transaction (new tuples created in B and C with corresponding update of A). Table A was added to the PUBLICATION from the start. Once initial sync was done, tables B and C were added to PUBLICATION with REFRESH on subscription (to reduce the WAL collection on the source database). So, we see in logs: > -2022-12-13 13:19:55 UTC-63987bfb.2733-LOG: logical replication table synchronization worker for subscription "cloud_production_main_sub_v4",table "A" has started > -2022-12-13 14:41:49 UTC-63987bfb.2733-LOG: logical replication table synchronization worker for subscription "cloud_production_main_sub_v4",table "A" has finished > -2022-12-14 08:08:34 UTC-63998482.7d10-LOG: logical replication table synchronization worker for subscription "cloud_production_main_sub_v4",table "B" has started > -2022-12-14 10:19:08 UTC-63998482.7d10-LOG: logical replication table synchronization worker for subscription "cloud_production_main_sub_v4",table "B" has finished > -2022-12-14 10:37:47 UTC-6399a77b.1fc-LOG: logical replication table synchronization worker for subscription "cloud_production_main_sub_v4",table "C" has started > -2022-12-14 10:48:46 UTC-6399a77b.1fc-LOG: logical replication table synchronization worker for subscription "cloud_production_main_sub_v4",table "C" has finished Also, we had to reboot subscription server twice. Moreover, we have plenty of messages like: > -2022-12-13 15:53:30 UTC-639872fa.1-LOG: background worker "logical replication worker" (PID 47960) exited with exit code1 > -2022-12-13 21:04:31 UTC-6398e8df.4f7c-LOG: logical replication apply worker for subscription "cloud_production_main_sub_v4"has started > -2022-12-14 10:19:22 UTC-6398e8df.4f7c-ERROR: could not receive data from WAL stream: SSL SYSCALL error: EOF detected Additionally, our HA replica of subscriber was broken and recovered by support… And logs like this: > psql-2022-12-14 10:24:18 UTC-63999d2c.2020-WARNING: user requested cancel while waiting for synchronous replication ack. > The COMMIT record has already flushed to WAL locally and might not have been replicated to the standby. We must wait here. Furthermore, we were adding\removing another table D from publication few times. So, it was a little bit messy story. After all, we got streaming working for the whole database. But after some time we realized we have lost 203 records for table B created from 2022-12-14 09:21:25.705 to 2022-12-14 09:49:20.664 (after synchronization start, but before finish). And the most tricky part here - A, B and C are changed in the same transaction (related tuples). But tables A and C - are fine, only few records from B are lost. We have compared all other tables record to record - only 203 records from B are missing. We have restored the server from backup with point-in-time-recovery (to exclude case with application or human error) - the same results. Furthermore, we have tried different indexes in search (to exclude issue with broken btree) - the same results. So, yes, we understand our replication story was not a classical happy path even close. But the result feels a little bit scary. Currently, I have access to database and logs - so, feel free to ask for additional debugging information if you like. Thanks a lot, Michail.
Hello again. Just small a fix for: > 2022-12-14 09:21:25.705 to > 2022-12-14 09:49:20.664 (after synchronization start, but before finish). Correct values are: 2022-12-14 09:49:31.340 2022-12-14 09:49:41.683 So, it looks like we lost about 10s of one of the tables WAL.
On Mon, Dec 26, 2022 at 8:50 PM Michail Nikolaev <michail.nikolaev@gmail.com> wrote: > > Hello again. > > Just small a fix for: > > > 2022-12-14 09:21:25.705 to > > 2022-12-14 09:49:20.664 (after synchronization start, but before finish). > > Correct values are: > > 2022-12-14 09:49:31.340 > 2022-12-14 09:49:41.683 > > So, it looks like we lost about 10s of one of the tables WAL. > IIUC, this is the time when only table B's initial sync was in-progress. Table A's initial sync was finished by that time and for Table C, it is yet not started. We have some logic where in such overlapping phases where we allow apply worker to skip the data which will be later applied by the initial sync worker. During the time of the initial sync of B, are there any other operations on that table apart from the missing ones? You may want to see the LOGs of subscribers during the initial sync time for any errors or other messages. -- With Regards, Amit Kapila.
Hello, Amit! > IUC, this is the time when only table B's initial sync was > in-progress. Table A's initial sync was finished by that time and for > Table C, it is yet not started. Yes, it is correct. C was started too, but unsuccessfully (restarted after, see below). > During the time of > the initial sync of B, are there any other operations on that table > apart from the missing ones? Yes, a lot of them. Tuples created all the time without any pauses, only ~10s interval is gone. > You may want to see the LOGs of > subscribers during the initial sync time for any errors or other > messages. Looks like I have found something interesting. Below, sorted by time with some comments. Restart of the logical apply worker day before issue. -2022-12-13 21:04:25 UTC-6398e8d9.4ec3-LOG: logical replication apply worker for subscription "cloud_production_main_sub_v4" has started -2022-12-13 21:04:26 UTC-6398e8d9.4ec3-ERROR: could not start WAL streaming: FATAL: out of relcache_callback_list slots CONTEXT: slot "cloud_production_main_sub_v4", output plugin "pgoutput", in the startup callback ERROR: odyssey: c1747b31d0187: remote server read/write error s721c052ace56: (null) SSL SYSCALL error: EOF detected -2022-12-13 21:04:26 UTC-639872fa.1-LOG: background worker "logical replication worker" (PID 20163) exited with exit code 1 -2022-12-13 21:04:31 UTC-6398e8df.4f7c-LOG: logical replication apply worker for subscription "cloud_production_main_sub_v4" has started Start of B and C table initial sync (adding tables to the subscription, table A already in streaming mode): -2022-12-14 08:08:34 UTC-63998482.7d10-LOG: logical replication table synchronization worker for subscription "cloud_production_main_sub_v4", table "B" has started -2022-12-14 08:08:34 UTC-63998482.7d13-LOG: logical replication table synchronization worker for subscription "cloud_production_main_sub_v4", table "C" has started B is synchronized without any errors: -2022-12-14 10:19:08 UTC-63998482.7d10-LOG: logical replication table synchronization worker for subscription "cloud_production_main_sub_v4", table "B" has finished After about 15 seconds, replication worker is restarted because of issues with I/O: -2022-12-14 10:19:22 UTC-6398e8df.4f7c-ERROR: could not receive data from WAL stream: SSL SYSCALL error: EOF detected -2022-12-14 10:19:22 UTC-6399a32a.6af3-LOG: logical replication apply worker for subscription "cloud_production_main_sub_v4" has started -2022-12-14 10:19:22 UTC-639872fa.1-LOG: background worker "logical replication worker" (PID 20348) exited with exit code 1 Then cancel of query (something about insert into public.lsnmover): -2022-12-14 10:21:03 UTC-63999c6a.f25d-ERROR: canceling statement due to user request -2022-12-14 10:21:39 UTC-639997f9.fd6e-ERROR: canceling statement due to user request After small amount of time, synchronous replication seems to be broken, we see tons of: -2022-12-14 10:24:18 UTC-63999d2c.2020-WARNING: shutdown requested while waiting for synchronous replication ack. -2022-12-14 10:24:18 UTC-63999d2c.2020-WARNING: user requested cancel while waiting for synchronous After few minutes at 10:36:05 we initiated database restart: -2022-12-14 10:35:20 UTC-63999c6a.f25d-ERROR: canceling statement due to user request -2022-12-14 10:37:25 UTC-6399a765.1-LOG: pgms_stats: Finishing PG_init -2022-12-14 10:37:26 UTC-6399a765.1-LOG: listening on IPv4 address "0.0.0.0", port 5432 -2022-12-14 10:37:26 UTC-6399a765.1-LOG: listening on IPv6 address "::", port 5432 -2022-12-14 10:37:26 UTC-6399a765.1-LOG: starting PostgreSQL 14.4 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0, 64-bit -2022-12-14 10:37:26 UTC-6399a765.1-LOG: listening on Unix socket "/tmp/.s.PGSQL.5432" -2022-12-14 10:37:26 UTC-6399a766.103-LOG: database system was interrupted; last known up at 2022-12-14 10:36:47 UTC -2022-12-14 10:37:26 UTC-6399a766.105-FATAL: the database system is starting up -2022-12-14 10:37:26 UTC-6399a766.107-FATAL: the database system is starting up -2022-12-14 10:37:27 UTC-6399a767.108-FATAL: the database system is starting up -2022-12-14 10:37:27 UTC-6399a767.109-FATAL: the database system is starting up -2022-12-14 10:37:27 UTC-6399a767.10a-FATAL: the database system is starting up -2022-12-14 10:37:27 UTC-6399a767.10b-FATAL: the database system is starting up -2022-12-14 10:37:27 UTC-6399a767.113-FATAL: the database system is starting up -2022-12-14 10:37:29 UTC-6399a766.103-LOG: recovered replication state of node 4 to 185F6/CBF4BBB8 -2022-12-14 10:37:29 UTC-6399a766.103-LOG: recovered replication state of node 1 to 18605/FE229E10 -2022-12-14 10:37:29 UTC-6399a766.103-LOG: recovered replication state of node 6 to 185F8/86AEC880 -2022-12-14 10:37:29 UTC-6399a766.103-LOG: database system was not properly shut down; automatic recovery in progress -2022-12-14 10:37:29 UTC-6399a766.103-LOG: recovered replication state of node 5 to 185F6/CBF4BBB8 -2022-12-14 10:37:29 UTC-6399a766.103-LOG: redo starts at D76/557F4210 -2022-12-14 10:37:34 UTC-6399a76e.14a-FATAL: the database system is starting up -2022-12-14 10:37:34 UTC-6399a766.103-LOG: redo done at D76/BDDBDAE0 system usage: CPU: user: 3.31 s, system: 1.36 s, elapsed: 4.68 s -2022-12-14 10:37:34 UTC-6399a766.103-LOG: checkpoint starting: end-of-recovery immediate -2022-12-14 10:37:34 UTC-6399a76e.14b-FATAL: the database system is starting up -2022-12-14 10:37:34 UTC-6399a76e.14c-FATAL: the database system is starting up -2022-12-14 10:37:45 UTC-6399a779.190-FATAL: the database system is starting up -2022-12-14 10:37:45 UTC-6399a766.103-LOG: checkpoint complete: wrote 300326 buffers (3.6%); -2022-12-14 10:37:45 UTC-6399a779.191-FATAL: the database system is starting up -2022-12-14 10:37:45 UTC-6399a779.199-LOG: pgms_stats: bgworker: pid: 409. -2022-12-14 10:37:45 UTC-6399a779.19a-LOG: pg_qs: qs bgworker: pid: 410. -2022-12-14 10:37:45 UTC-6399a765.1-LOG: database system is ready to accept connections -2022-12-14 10:37:45 UTC-6399a779.199-LOG: pgms_stats: qs bgworker: bgworker started running with host database: cloud_sys -2022-12-14 10:37:45 UTC-6399a779.19a-LOG: pg_qs: qs bgworker: bgworker started running with host database: cloud_sys -2022-12-14 10:37:45 UTC-6399a779.19d-LOG: logical replication apply worker for subscription "cloud_production_main_sub_v4" has started -2022-12-14 10:37:46 UTC-6399a779.19b-LOG: pg_cron scheduler started Initial sync for C started from scratch (with some other tables): -2022-12-14 10:37:47 UTC-6399a77b.1f9-LOG: logical replication table synchronization worker for subscription "cloud_production_main_sub_v4", table "E" has started -2022-12-14 10:37:47 UTC-6399a77b.1fa-LOG: logical replication table synchronization worker for subscription "cloud_production_main_sub_v4", table "F" has started -2022-12-14 10:37:47 UTC-6399a77b.1fc-LOG: logical replication table synchronization worker for subscription "cloud_production_main_sub_v4", table "C" has started And a few more flaps with WAL streaming later: -2022-12-14 10:50:20 UTC-6399a779.19d-ERROR: could not send data to WAL stream: SSL SYSCALL error: EOF detected -2022-12-14 10:50:20 UTC-6399a765.1-LOG: background worker "logical replication worker" (PID 413) exited with exit code 1 Probably a small part of WAL was somehow skipped by logical worker in all that mess. Best regards, Michail.
On Tue, Dec 27, 2022 at 5:49 PM Michail Nikolaev <michail.nikolaev@gmail.com> wrote: > > > Probably a small part of WAL was somehow skipped by logical worker in > all that mess. > None of these entries are from the point mentioned by you [1] yesterday where you didn't find the corresponding data in the subscriber. How did you identify that the entries corresponding to that timing were missing? [1] - > 2022-12-14 09:49:31.340 > 2022-12-14 09:49:41.683 -- With Regards, Amit Kapila.
Hello. > None of these entries are from the point mentioned by you [1] > yesterday where you didn't find the corresponding data in the > subscriber. How did you identify that the entries corresponding to > that timing were missing? Some of the before the interval, some after... But the source database was generating a lot of WAL during logical replication - some of these log entries from time AFTER completion of initial sync of B but (probably) BEFORE finishing B table catch up (entering streaming mode). Just to clarify, tables A, B and C are updated in the same transaction, something like: BEGIN; UPDATE A SET x = x +1 WHERE id = :id; INSERT INTO B(current_time, :id); INSERT INTO C(current_time, :id); COMMIT; Other (non-mentioned) tables also included into this transaction, but only B missed small amount of data. So, shortly the story looks like: * initial sync of A (and other tables) started and completed, they are in streaming mode * B and C initial sync started (by altering PUBLICATION and SUBSCRIPTION) * B sync completed, but new changes are still applying to the tables to catch up primary * logical replication apply worker is restarted because IO error on WAL receive * Postgres killed * Postgres restarted * C initial sync restarted * logical replication apply worker few times restarted because IO error on WAL receive * finally every table in streaming mode but with small gap in B Thanks, Michail.
On Wed, Dec 28, 2022 at 4:52 PM Michail Nikolaev <michail.nikolaev@gmail.com> wrote: > > Hello. > > > None of these entries are from the point mentioned by you [1] > > yesterday where you didn't find the corresponding data in the > > subscriber. How did you identify that the entries corresponding to > > that timing were missing? > > Some of the before the interval, some after... But the source database > was generating a lot of WAL during logical replication > - some of these log entries from time AFTER completion of initial sync > of B but (probably) BEFORE finishing B table catch up (entering > streaming mode). > ... ... > > So, shortly the story looks like: > > * initial sync of A (and other tables) started and completed, they are > in streaming mode > * B and C initial sync started (by altering PUBLICATION and SUBSCRIPTION) > * B sync completed, but new changes are still applying to the tables > to catch up primary > The point which is not completely clear from your description is the timing of missing records. In one of your previous emails, you seem to have indicated that the data missed from Table B is from the time when the initial sync for Table B was in-progress, right? Also, from your description, it seems there is no error or restart that happened during the time of initial sync for Table B. Is that understanding correct? > * logical replication apply worker is restarted because IO error on WAL receive > * Postgres killed > * Postgres restarted > * C initial sync restarted > * logical replication apply worker few times restarted because IO > error on WAL receive > * finally every table in streaming mode but with small gap in B > I am not able to see how these steps can lead to the problem. If the problem is reproducible at your end, you might want to increase LOG verbosity to DEBUG1 and see if there is additional information in the LOGs that can help or it would be really good if there is a self-sufficient test to reproduce it. -- With Regards, Amit Kapila.
Hello, Amid. > The point which is not completely clear from your description is the > timing of missing records. In one of your previous emails, you seem to > have indicated that the data missed from Table B is from the time when > the initial sync for Table B was in-progress, right? Also, from your > description, it seems there is no error or restart that happened > during the time of initial sync for Table B. Is that understanding > correct? Yes and yes. * B sync started - 08:08:34 * lost records are created - 09:49:xx * B initial sync finished - 10:19:08 * I/O error with WAL - 10:19:22 * SIGTERM - 10:35:20 "Finished" here is `logical replication table synchronization worker for subscription "cloud_production_main_sub_v4", table "B" has finished`. As far as I know, it is about COPY command. > I am not able to see how these steps can lead to the problem. One idea I have here - it is something related to the patch about forbidding of canceling queries while waiting for synchronous replication acknowledgement [1]. It is applied to Postgres in the cloud we were using [2]. We started to see such errors in 10:24:18: `The COMMIT record has already flushed to WAL locally and might not have been replicated to the standby. We must wait here.` I wonder could it be some tricky race because of downtime of synchronous replica and queries stuck waiting for ACK forever? > If the problem is reproducible at your end, you might want to increase LOG > verbosity to DEBUG1 and see if there is additional information in the > LOGs that can help or it would be really good if there is a > self-sufficient test to reproduce it. Unfortunately, it looks like it is really hard to reproduce. Best regards, Michail. [1]: https://www.postgresql.org/message-id/flat/CALj2ACU%3DnzEb_dEfoLqez5CLcwvx1GhkdfYRNX%2BA4NDRbjYdBg%40mail.gmail.com#8b7ffc8cdecb89de43c0701b4b6b5142 [2]: https://www.postgresql.org/message-id/flat/CAAhFRxgcBy-UCvyJ1ZZ1UKf4Owrx4J2X1F4tN_FD%3Dfh5wZgdkw%40mail.gmail.com#9c71a85cb6009eb60d0361de82772a50
On Tue, Jan 3, 2023 at 2:14 PM Michail Nikolaev <michail.nikolaev@gmail.com> wrote: > > > The point which is not completely clear from your description is the > > timing of missing records. In one of your previous emails, you seem to > > have indicated that the data missed from Table B is from the time when > > the initial sync for Table B was in-progress, right? Also, from your > > description, it seems there is no error or restart that happened > > during the time of initial sync for Table B. Is that understanding > > correct? > > Yes and yes. > * B sync started - 08:08:34 > * lost records are created - 09:49:xx > * B initial sync finished - 10:19:08 > * I/O error with WAL - 10:19:22 > * SIGTERM - 10:35:20 > > "Finished" here is `logical replication table synchronization worker > for subscription "cloud_production_main_sub_v4", table "B" has > finished`. > As far as I know, it is about COPY command. > > > I am not able to see how these steps can lead to the problem. > > One idea I have here - it is something related to the patch about > forbidding of canceling queries while waiting for synchronous > replication acknowledgement [1]. > It is applied to Postgres in the cloud we were using [2]. We started > to see such errors in 10:24:18: > > `The COMMIT record has already flushed to WAL locally and might > not have been replicated to the standby. We must wait here.` > Does that by any chance mean you are using a non-community version of Postgres which has some other changes? > I wonder could it be some tricky race because of downtime of > synchronous replica and queries stuck waiting for ACK forever? > It is possible but ideally, in that case, the client should request such a transaction again. -- With Regards, Amit Kapila.
> Does that by any chance mean you are using a non-community version of > Postgres which has some other changes? It is a managed Postgres service in the general cloud. Usually, such providers apply some custom minor patches. The only one I know about - about forbidding of canceling queries while waiting for synchronous replication acknowledgement. > It is possible but ideally, in that case, the client should request > such a transaction again. I am not sure I get you here. I'll try to explain what I mean: The patch I'm referring to does not allow canceling a query while it waiting acknowledge for ACK for COMMIT message in case of synchronous replication. If synchronous standby is down - query and connection just stuck until server restart (or until standby become available to process ACK). Tuples changed by such a hanging transaction are not visible by other transactions. It is all done to prevent seeing spurious tuples in case of network split. So, it seems like we had such a situation during that story because of our synchronous standby downtime (before server restart). My thoughts just about the possibility of fact that such transactions (waiting for ACK for COMMIT) are handled somehow incorrectly by logical replication engine. Michail.
On Tue, Jan 3, 2023 at 8:50 PM Michail Nikolaev <michail.nikolaev@gmail.com> wrote: > > > Does that by any chance mean you are using a non-community version of > > Postgres which has some other changes? > > It is a managed Postgres service in the general cloud. Usually, such > providers apply some custom minor patches. > The only one I know about - about forbidding of canceling queries > while waiting for synchronous replication acknowledgement. > Okay, but it would be better to know what all the other changes they have. > > It is possible but ideally, in that case, the client should request > > such a transaction again. > > I am not sure I get you here. > > I'll try to explain what I mean: > > The patch I'm referring to does not allow canceling a query while it > waiting acknowledge for ACK for COMMIT message in case of synchronous > replication. > If synchronous standby is down - query and connection just stuck until > server restart (or until standby become available to process ACK). > Tuples changed by such a hanging transaction are not visible by other > transactions. It is all done to prevent seeing spurious tuples in case > of network split. > > So, it seems like we had such a situation during that story because of > our synchronous standby downtime (before server restart). > My thoughts just about the possibility of fact that such transactions > (waiting for ACK for COMMIT) are handled somehow incorrectly by > logical replication engine. > I understood this point yesterday but we do have handling for such cases. Say, if the subscriber is down during the time of such synchronous transactions, after the restart, it will request to restart the replication from a point which is prior to such transactions. We ensure this by replication origins. See docs [1] for more information about the same. Now, it is possible that there is a bug in that mechanism but it is difficult to find it without some hints from LOGs or a reproducible test. It is also possible that there is another area that has a bug in the Postgres code. But, OTOH, we can't rule out the possibility that it is because of some features added by managed service unless you can reproduce it on the Postgres build. [1] - https://www.postgresql.org/docs/devel/replication-origins.html -- With Regards, Amit Kapila.