Re: Data loss on logical replication, 12.12 to 14.5, ALTER SUBSCRIPTION - Mailing list pgsql-hackers

From Michail Nikolaev
Subject Re: Data loss on logical replication, 12.12 to 14.5, ALTER SUBSCRIPTION
Date
Msg-id CANtu0ohT4+4WC6JKxk2Udr-OkroDsf+8_=XgfS7Fjm6GDcNNUw@mail.gmail.com
Whole thread Raw
In response to Re: Data loss on logical replication, 12.12 to 14.5, ALTER SUBSCRIPTION  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: Data loss on logical replication, 12.12 to 14.5, ALTER SUBSCRIPTION  (Amit Kapila <amit.kapila16@gmail.com>)
List pgsql-hackers
Hello, Amit!

> IUC, this is the time when only table B's initial sync was
> in-progress. Table A's initial sync was finished by that time and for
> Table C, it is yet not started.
Yes, it is correct. C was started too, but unsuccessfully (restarted
after, see below).

> During the time of
> the initial sync of B, are there any other operations on that table
> apart from the missing ones?
Yes, a lot of them. Tuples created all the time without any pauses,
only ~10s interval is gone.

> You may want to see the LOGs of
> subscribers during the initial sync time for any errors or other
> messages.

Looks like I have found something interesting. Below, sorted by time
with some comments.

Restart of the logical apply worker day before issue.

        -2022-12-13 21:04:25 UTC-6398e8d9.4ec3-LOG: logical
replication apply worker for subscription
"cloud_production_main_sub_v4" has started
        -2022-12-13 21:04:26 UTC-6398e8d9.4ec3-ERROR: could not start
WAL streaming: FATAL: out of relcache_callback_list slots CONTEXT:
slot "cloud_production_main_sub_v4", output plugin "pgoutput", in the
startup callback ERROR: odyssey: c1747b31d0187: remote server
read/write error s721c052ace56: (null) SSL SYSCALL error: EOF detected
        -2022-12-13 21:04:26 UTC-639872fa.1-LOG: background worker
"logical replication worker" (PID 20163) exited with exit code 1
        -2022-12-13 21:04:31 UTC-6398e8df.4f7c-LOG: logical
replication apply worker for subscription
"cloud_production_main_sub_v4" has started

Start of B and C table initial sync (adding tables to the
subscription, table A already in streaming mode):

        -2022-12-14 08:08:34 UTC-63998482.7d10-LOG: logical
replication table synchronization worker for subscription
"cloud_production_main_sub_v4", table "B" has started
        -2022-12-14 08:08:34 UTC-63998482.7d13-LOG: logical
replication table synchronization worker for subscription
"cloud_production_main_sub_v4", table "C" has started

B is synchronized without any errors:

        -2022-12-14 10:19:08 UTC-63998482.7d10-LOG: logical
replication table synchronization worker for subscription
"cloud_production_main_sub_v4", table "B" has finished

After about 15 seconds, replication worker is restarted because of
issues with I/O:

        -2022-12-14 10:19:22 UTC-6398e8df.4f7c-ERROR: could not
receive data from WAL stream: SSL SYSCALL error: EOF detected
        -2022-12-14 10:19:22 UTC-6399a32a.6af3-LOG: logical
replication apply worker for subscription
"cloud_production_main_sub_v4" has started
        -2022-12-14 10:19:22 UTC-639872fa.1-LOG: background worker
"logical replication worker" (PID 20348) exited with exit code 1

Then cancel of query (something about insert into public.lsnmover):

        -2022-12-14 10:21:03 UTC-63999c6a.f25d-ERROR: canceling
statement due to user request
        -2022-12-14 10:21:39 UTC-639997f9.fd6e-ERROR: canceling
statement due to user request

After small amount of time, synchronous replication seems to be
broken, we see tons of:

        -2022-12-14 10:24:18 UTC-63999d2c.2020-WARNING: shutdown
requested while waiting for synchronous replication ack.
        -2022-12-14 10:24:18 UTC-63999d2c.2020-WARNING: user requested
cancel while waiting for synchronous

After few minutes at 10:36:05 we initiated database restart:

        -2022-12-14 10:35:20 UTC-63999c6a.f25d-ERROR: canceling
statement due to user request
        -2022-12-14 10:37:25 UTC-6399a765.1-LOG: pgms_stats: Finishing PG_init
        -2022-12-14 10:37:26 UTC-6399a765.1-LOG: listening on IPv4
address "0.0.0.0", port 5432
        -2022-12-14 10:37:26 UTC-6399a765.1-LOG: listening on IPv6
address "::", port 5432
        -2022-12-14 10:37:26 UTC-6399a765.1-LOG: starting PostgreSQL
14.4 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu
7.5.0-3ubuntu1~18.04) 7.5.0, 64-bit
        -2022-12-14 10:37:26 UTC-6399a765.1-LOG: listening on Unix
socket "/tmp/.s.PGSQL.5432"
        -2022-12-14 10:37:26 UTC-6399a766.103-LOG: database system was
interrupted; last known up at 2022-12-14 10:36:47 UTC
        -2022-12-14 10:37:26 UTC-6399a766.105-FATAL: the database
system is starting up
        -2022-12-14 10:37:26 UTC-6399a766.107-FATAL: the database
system is starting up
        -2022-12-14 10:37:27 UTC-6399a767.108-FATAL: the database
system is starting up
        -2022-12-14 10:37:27 UTC-6399a767.109-FATAL: the database
system is starting up
        -2022-12-14 10:37:27 UTC-6399a767.10a-FATAL: the database
system is starting up
        -2022-12-14 10:37:27 UTC-6399a767.10b-FATAL: the database
system is starting up
        -2022-12-14 10:37:27 UTC-6399a767.113-FATAL: the database
system is starting up
        -2022-12-14 10:37:29 UTC-6399a766.103-LOG: recovered
replication state of node 4 to 185F6/CBF4BBB8
        -2022-12-14 10:37:29 UTC-6399a766.103-LOG: recovered
replication state of node 1 to 18605/FE229E10
        -2022-12-14 10:37:29 UTC-6399a766.103-LOG: recovered
replication state of node 6 to 185F8/86AEC880
        -2022-12-14 10:37:29 UTC-6399a766.103-LOG: database system was
not properly shut down; automatic recovery in progress
        -2022-12-14 10:37:29 UTC-6399a766.103-LOG: recovered
replication state of node 5 to 185F6/CBF4BBB8
        -2022-12-14 10:37:29 UTC-6399a766.103-LOG: redo starts at D76/557F4210
        -2022-12-14 10:37:34 UTC-6399a76e.14a-FATAL: the database
system is starting up
        -2022-12-14 10:37:34 UTC-6399a766.103-LOG: redo done at
D76/BDDBDAE0 system usage: CPU: user: 3.31 s, system: 1.36 s, elapsed:
4.68 s
        -2022-12-14 10:37:34 UTC-6399a766.103-LOG: checkpoint
starting: end-of-recovery immediate
        -2022-12-14 10:37:34 UTC-6399a76e.14b-FATAL: the database
system is starting up
        -2022-12-14 10:37:34 UTC-6399a76e.14c-FATAL: the database
system is starting up
        -2022-12-14 10:37:45 UTC-6399a779.190-FATAL: the database
system is starting up
        -2022-12-14 10:37:45 UTC-6399a766.103-LOG: checkpoint
complete: wrote 300326 buffers (3.6%);
        -2022-12-14 10:37:45 UTC-6399a779.191-FATAL: the database
system is starting up
        -2022-12-14 10:37:45 UTC-6399a779.199-LOG: pgms_stats:
bgworker: pid: 409.
        -2022-12-14 10:37:45 UTC-6399a779.19a-LOG: pg_qs: qs bgworker: pid: 410.
        -2022-12-14 10:37:45 UTC-6399a765.1-LOG: database system is
ready to accept connections
        -2022-12-14 10:37:45 UTC-6399a779.199-LOG: pgms_stats: qs
bgworker: bgworker started running with host database: cloud_sys
        -2022-12-14 10:37:45 UTC-6399a779.19a-LOG: pg_qs: qs bgworker:
bgworker started running with host database: cloud_sys
        -2022-12-14 10:37:45 UTC-6399a779.19d-LOG: logical replication
apply worker for subscription "cloud_production_main_sub_v4" has
started
        -2022-12-14 10:37:46 UTC-6399a779.19b-LOG: pg_cron scheduler started

Initial sync for C started from scratch (with some other tables):

        -2022-12-14 10:37:47 UTC-6399a77b.1f9-LOG: logical replication
table synchronization worker for subscription
"cloud_production_main_sub_v4", table "E" has started
        -2022-12-14 10:37:47 UTC-6399a77b.1fa-LOG: logical replication
table synchronization worker for subscription
"cloud_production_main_sub_v4", table "F" has started
        -2022-12-14 10:37:47 UTC-6399a77b.1fc-LOG: logical replication
table synchronization worker for subscription
"cloud_production_main_sub_v4", table "C" has started

And a few more flaps with WAL streaming later:

        -2022-12-14 10:50:20 UTC-6399a779.19d-ERROR: could not send
data to WAL stream: SSL SYSCALL error: EOF detected
        -2022-12-14 10:50:20 UTC-6399a765.1-LOG: background worker
"logical replication worker" (PID 413) exited with exit code 1

Probably a small part of WAL was somehow skipped by logical worker in
all that mess.

Best regards,
Michail.



pgsql-hackers by date:

Previous
From: Nikita Malakhov
Date:
Subject: Re: Passing relation metadata to Exec routine
Next
From: "Anton A. Melnikov"
Date:
Subject: Re: [BUG] pg_upgrade test fails from older versions.