Re: Data loss on logical replication, 12.12 to 14.5, ALTER SUBSCRIPTION - Mailing list pgsql-hackers
From | Michail Nikolaev |
---|---|
Subject | Re: Data loss on logical replication, 12.12 to 14.5, ALTER SUBSCRIPTION |
Date | |
Msg-id | CANtu0ohT4+4WC6JKxk2Udr-OkroDsf+8_=XgfS7Fjm6GDcNNUw@mail.gmail.com Whole thread Raw |
In response to | Re: Data loss on logical replication, 12.12 to 14.5, ALTER SUBSCRIPTION (Amit Kapila <amit.kapila16@gmail.com>) |
Responses |
Re: Data loss on logical replication, 12.12 to 14.5, ALTER SUBSCRIPTION
|
List | pgsql-hackers |
Hello, Amit! > IUC, this is the time when only table B's initial sync was > in-progress. Table A's initial sync was finished by that time and for > Table C, it is yet not started. Yes, it is correct. C was started too, but unsuccessfully (restarted after, see below). > During the time of > the initial sync of B, are there any other operations on that table > apart from the missing ones? Yes, a lot of them. Tuples created all the time without any pauses, only ~10s interval is gone. > You may want to see the LOGs of > subscribers during the initial sync time for any errors or other > messages. Looks like I have found something interesting. Below, sorted by time with some comments. Restart of the logical apply worker day before issue. -2022-12-13 21:04:25 UTC-6398e8d9.4ec3-LOG: logical replication apply worker for subscription "cloud_production_main_sub_v4" has started -2022-12-13 21:04:26 UTC-6398e8d9.4ec3-ERROR: could not start WAL streaming: FATAL: out of relcache_callback_list slots CONTEXT: slot "cloud_production_main_sub_v4", output plugin "pgoutput", in the startup callback ERROR: odyssey: c1747b31d0187: remote server read/write error s721c052ace56: (null) SSL SYSCALL error: EOF detected -2022-12-13 21:04:26 UTC-639872fa.1-LOG: background worker "logical replication worker" (PID 20163) exited with exit code 1 -2022-12-13 21:04:31 UTC-6398e8df.4f7c-LOG: logical replication apply worker for subscription "cloud_production_main_sub_v4" has started Start of B and C table initial sync (adding tables to the subscription, table A already in streaming mode): -2022-12-14 08:08:34 UTC-63998482.7d10-LOG: logical replication table synchronization worker for subscription "cloud_production_main_sub_v4", table "B" has started -2022-12-14 08:08:34 UTC-63998482.7d13-LOG: logical replication table synchronization worker for subscription "cloud_production_main_sub_v4", table "C" has started B is synchronized without any errors: -2022-12-14 10:19:08 UTC-63998482.7d10-LOG: logical replication table synchronization worker for subscription "cloud_production_main_sub_v4", table "B" has finished After about 15 seconds, replication worker is restarted because of issues with I/O: -2022-12-14 10:19:22 UTC-6398e8df.4f7c-ERROR: could not receive data from WAL stream: SSL SYSCALL error: EOF detected -2022-12-14 10:19:22 UTC-6399a32a.6af3-LOG: logical replication apply worker for subscription "cloud_production_main_sub_v4" has started -2022-12-14 10:19:22 UTC-639872fa.1-LOG: background worker "logical replication worker" (PID 20348) exited with exit code 1 Then cancel of query (something about insert into public.lsnmover): -2022-12-14 10:21:03 UTC-63999c6a.f25d-ERROR: canceling statement due to user request -2022-12-14 10:21:39 UTC-639997f9.fd6e-ERROR: canceling statement due to user request After small amount of time, synchronous replication seems to be broken, we see tons of: -2022-12-14 10:24:18 UTC-63999d2c.2020-WARNING: shutdown requested while waiting for synchronous replication ack. -2022-12-14 10:24:18 UTC-63999d2c.2020-WARNING: user requested cancel while waiting for synchronous After few minutes at 10:36:05 we initiated database restart: -2022-12-14 10:35:20 UTC-63999c6a.f25d-ERROR: canceling statement due to user request -2022-12-14 10:37:25 UTC-6399a765.1-LOG: pgms_stats: Finishing PG_init -2022-12-14 10:37:26 UTC-6399a765.1-LOG: listening on IPv4 address "0.0.0.0", port 5432 -2022-12-14 10:37:26 UTC-6399a765.1-LOG: listening on IPv6 address "::", port 5432 -2022-12-14 10:37:26 UTC-6399a765.1-LOG: starting PostgreSQL 14.4 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0, 64-bit -2022-12-14 10:37:26 UTC-6399a765.1-LOG: listening on Unix socket "/tmp/.s.PGSQL.5432" -2022-12-14 10:37:26 UTC-6399a766.103-LOG: database system was interrupted; last known up at 2022-12-14 10:36:47 UTC -2022-12-14 10:37:26 UTC-6399a766.105-FATAL: the database system is starting up -2022-12-14 10:37:26 UTC-6399a766.107-FATAL: the database system is starting up -2022-12-14 10:37:27 UTC-6399a767.108-FATAL: the database system is starting up -2022-12-14 10:37:27 UTC-6399a767.109-FATAL: the database system is starting up -2022-12-14 10:37:27 UTC-6399a767.10a-FATAL: the database system is starting up -2022-12-14 10:37:27 UTC-6399a767.10b-FATAL: the database system is starting up -2022-12-14 10:37:27 UTC-6399a767.113-FATAL: the database system is starting up -2022-12-14 10:37:29 UTC-6399a766.103-LOG: recovered replication state of node 4 to 185F6/CBF4BBB8 -2022-12-14 10:37:29 UTC-6399a766.103-LOG: recovered replication state of node 1 to 18605/FE229E10 -2022-12-14 10:37:29 UTC-6399a766.103-LOG: recovered replication state of node 6 to 185F8/86AEC880 -2022-12-14 10:37:29 UTC-6399a766.103-LOG: database system was not properly shut down; automatic recovery in progress -2022-12-14 10:37:29 UTC-6399a766.103-LOG: recovered replication state of node 5 to 185F6/CBF4BBB8 -2022-12-14 10:37:29 UTC-6399a766.103-LOG: redo starts at D76/557F4210 -2022-12-14 10:37:34 UTC-6399a76e.14a-FATAL: the database system is starting up -2022-12-14 10:37:34 UTC-6399a766.103-LOG: redo done at D76/BDDBDAE0 system usage: CPU: user: 3.31 s, system: 1.36 s, elapsed: 4.68 s -2022-12-14 10:37:34 UTC-6399a766.103-LOG: checkpoint starting: end-of-recovery immediate -2022-12-14 10:37:34 UTC-6399a76e.14b-FATAL: the database system is starting up -2022-12-14 10:37:34 UTC-6399a76e.14c-FATAL: the database system is starting up -2022-12-14 10:37:45 UTC-6399a779.190-FATAL: the database system is starting up -2022-12-14 10:37:45 UTC-6399a766.103-LOG: checkpoint complete: wrote 300326 buffers (3.6%); -2022-12-14 10:37:45 UTC-6399a779.191-FATAL: the database system is starting up -2022-12-14 10:37:45 UTC-6399a779.199-LOG: pgms_stats: bgworker: pid: 409. -2022-12-14 10:37:45 UTC-6399a779.19a-LOG: pg_qs: qs bgworker: pid: 410. -2022-12-14 10:37:45 UTC-6399a765.1-LOG: database system is ready to accept connections -2022-12-14 10:37:45 UTC-6399a779.199-LOG: pgms_stats: qs bgworker: bgworker started running with host database: cloud_sys -2022-12-14 10:37:45 UTC-6399a779.19a-LOG: pg_qs: qs bgworker: bgworker started running with host database: cloud_sys -2022-12-14 10:37:45 UTC-6399a779.19d-LOG: logical replication apply worker for subscription "cloud_production_main_sub_v4" has started -2022-12-14 10:37:46 UTC-6399a779.19b-LOG: pg_cron scheduler started Initial sync for C started from scratch (with some other tables): -2022-12-14 10:37:47 UTC-6399a77b.1f9-LOG: logical replication table synchronization worker for subscription "cloud_production_main_sub_v4", table "E" has started -2022-12-14 10:37:47 UTC-6399a77b.1fa-LOG: logical replication table synchronization worker for subscription "cloud_production_main_sub_v4", table "F" has started -2022-12-14 10:37:47 UTC-6399a77b.1fc-LOG: logical replication table synchronization worker for subscription "cloud_production_main_sub_v4", table "C" has started And a few more flaps with WAL streaming later: -2022-12-14 10:50:20 UTC-6399a779.19d-ERROR: could not send data to WAL stream: SSL SYSCALL error: EOF detected -2022-12-14 10:50:20 UTC-6399a765.1-LOG: background worker "logical replication worker" (PID 413) exited with exit code 1 Probably a small part of WAL was somehow skipped by logical worker in all that mess. Best regards, Michail.
pgsql-hackers by date: