Thread: Re: FATAL: could not send data to WAL stream: lost synchronization with server: got message type "0", length 892351284
Re: FATAL: could not send data to WAL stream: lost synchronization with server: got message type "0", length 892351284
From
Adrian Klaver
Date:
On 1/25/25 06:58, Дмитрий wrote: > Hello everyone! > > Please help me with the error: FATAL: could not send data to WAL stream: > lost synchronization with server: got message type "0", length 892351284. > Replication worked for more than a month and now there is such a > problem. Restarting PostgreSQL helps. > How can I fix this error? This is going to need more information: 1) What sort of replication? 2) Where are the two servers located relative to each other? 3) Has there been any software upgrades/network changes recently? > > Source: > PostgreSQL 15.6 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 > 20150623 (Red Hat 4.8.5-44), 64-bit > CentOS Linux release 7.8.2003 (Core) > Filesystem type ext4 > > Receiver: > PostgreSQL 15.10 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 11.5.0 > 20240719 (Red Hat 11.5.0-2), 64-bit > Red Hat Enterprise Linux release 9.5 (Plow) > Filesystem type xfs > > > Thank you. > -- Adrian Klaver adrian.klaver@aklaver.com
Re[2]: FATAL: could not send data to WAL stream: lost synchronization with server: got message type "0", length 892351284
From
Дмитрий
Date:
1) What sort of replication?
- Streaming replication
2) Where are the two servers located relative to each other?
- The servers are located in different data centers.
3) Has there been any software upgrades/network changes recently?
- I don't know any information about the software upgrades/network суббота, 25 января 2025г., 19:31 +03:00 от Adrian Klaver adrian.klaver@aklaver.com:
On 1/25/25 06:58, Дмитрий wrote:
> Hello everyone!
>
> Please help me with the error: FATAL: could not send data to WAL stream:
> lost synchronization with server: got message type "0", length 892351284.
> Replication worked for more than a month and now there is such a
> problem. Restarting PostgreSQL helps.
> How can I fix this error?
This is going to need more information:
1) What sort of replication?
2) Where are the two servers located relative to each other?
3) Has there been any software upgrades/network changes recently?
>
> Source:
> PostgreSQL 15.6 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5
> 20150623 (Red Hat 4.8.5-44), 64-bit
> CentOS Linux release 7.8.2003 (Core)
> Filesystem type ext4
>
> Receiver:
> PostgreSQL 15.10 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 11.5.0
> 20240719 (Red Hat 11.5.0-2), 64-bit
> Red Hat Enterprise Linux release 9.5 (Plow)
> Filesystem type xfs
>
>
> Thank you.
>
--
Adrian Klaver
adrian.klaver@aklaver.com
> Hello everyone!
>
> Please help me with the error: FATAL: could not send data to WAL stream:
> lost synchronization with server: got message type "0", length 892351284.
> Replication worked for more than a month and now there is such a
> problem. Restarting PostgreSQL helps.
> How can I fix this error?
This is going to need more information:
1) What sort of replication?
2) Where are the two servers located relative to each other?
3) Has there been any software upgrades/network changes recently?
>
> Source:
> PostgreSQL 15.6 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5
> 20150623 (Red Hat 4.8.5-44), 64-bit
> CentOS Linux release 7.8.2003 (Core)
> Filesystem type ext4
>
> Receiver:
> PostgreSQL 15.10 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 11.5.0
> 20240719 (Red Hat 11.5.0-2), 64-bit
> Red Hat Enterprise Linux release 9.5 (Plow)
> Filesystem type xfs
>
>
> Thank you.
>
--
Adrian Klaver
adrian.klaver@aklaver.com
Re: FATAL: could not send data to WAL stream: lost synchronization with server: got message type "0", length 892351284
From
Adrian Klaver
Date:
On 1/25/25 09:03, Дмитрий wrote: > 1) What sort of replication? > - Streaming replication > > 2) Where are the two servers located relative to each other? > - The servers are located in different data centers. > > 3) Has there been any software upgrades/network changes recently? > - I don't know any information about the software upgrades/network It would be a good thing to ask of those folks that do know. From the log attached to your initial post: 2025-01-25 17:28:01.930 MSK [1196013] LOG: starting PostgreSQL 15.10 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 11.5.0 20240719 (Red Hat 11.5.0-2), 64-bit 2025-01-25 17:28:01.930 MSK [1196013] LOG: listening on IPv4 address "0.0.0.0", port 5432 2025-01-25 17:28:01.931 MSK [1196013] LOG: listening on Unix socket "/run/postgresql/.s.PGSQL.5432" 2025-01-25 17:28:01.932 MSK [1196013] LOG: listening on Unix socket "/tmp/.s.PGSQL.5432" 2025-01-25 17:28:01.962 MSK [1196017] LOG: database system was shut down in recovery at 2025-01-25 17:28:01 MSK 2025-01-25 17:28:01.962 MSK [1196017] LOG: entering standby mode How was it shut down, on purpose or a hardware/software issue? Also do you have corresponding logs from primary? 2025-01-25 17:28:12.192 MSK [1196017] LOG: consistent recovery state reached at 1063C/D002DC68 2025-01-25 17:28:12.192 MSK [1196017] LOG: incorrect resource manager data checksum in record at 1063C/D002DC68 2025-01-25 17:28:12.192 MSK [1196013] LOG: database system is ready to accept read-only connections 2025-01-25 17:28:12.205 MSK [1196019] LOG: started streaming WAL from primary at 1063C/D0000000 on timeline 61 The recovery ended and the streaming started. Not sure if 'incorrect resource manager data checksum' is significant or not. 2025-01-25 17:29:08.452 MSK [1196015] LOG: recovery restart point at 1063C/DBC7E1D8 2025-01-25 17:29:08.452 MSK [1196015] DETAIL: Last completed transaction was at log time 2025-01-25 16:23:08.828548+03. 2025-01-25 17:29:24.553 MSK [1196015] LOG: restartpoint starting: wal 2025-01-25 17:29:24.553 MSK [1196015] DEBUG: performing replication slot checkpoint 2025-01-25 17:29:27.651 MSK [1196019] FATAL: could not send data to WAL stream: lost synchronization with server: got message type "0", length 892351284 2025-01-25 17:29:27.653 MSK [1196017] LOG: invalid magic number 3600 in log segment 0000003D0001063D000000F4, offset 212992 2025-01-25 17:29:27.653 MSK [1196017] LOG: invalid magic number 3600 in log segment 0000003D0001063D000000F4, offset 212992 2025-01-25 17:29:27.653 MSK [1196017] LOG: invalid magic number 3600 in log segment 0000003D0001063D000000F4, offset 212992 This is where things fall apart. What confuses me is: "could not send data to WAL stream: lost synchronization with server: got message type "0", length 892351284" If this is from the standby why is it sending data to the stream? Unless, is there cascading replication going on? 2025-01-25 17:30:01.887 MSK [1196013] LOG: received fast shutdown request 2025-01-25 17:30:01.888 MSK [1196013] LOG: aborting any active transactions Was that a manual intervention? 2025-01-25 17:30:02.157 MSK [1196015] LOG: shutting down 2025-01-25 17:30:02.181 MSK [1196013] LOG: database system is shut down 2025-01-25 17:30:02.182 MSK [1196014] DEBUG: logger shutting down So the server went from start up to shut down in ~2 minutes. From your original post: 'Restarting PostgreSQL helps.' Is that what is shown above or have you restarted since the above and the server is running? -- Adrian Klaver adrian.klaver@aklaver.com
Re: FATAL: could not send data to WAL stream: lost synchronization with server: got message type "0", length 892351284
From
Adrian Klaver
Date:
On 1/26/25 03:29, Дмитрий wrote: > "How was it shut down, on purpose or a hardware/software issue?" > - I reboot the receiver every 2 minutes on purpose. I determined this > time empirically, because replication breaks down approximately every > minute and a half. The reboot helps to advance the receiver. > > "Also do you have corresponding logs from primary?" > - Attached to this message. > > "Unless, is there cascading replication going on?" > - No, this is replication from the leader. The leader has its two > replicas and they are all in one data center. And the problematic > replica is needed to migrate to another data center. > > "Was that a manual intervention?" > - Yes, reboot on schedule, every two minutes. > > "Is that what is shown above or have you restarted since the above and > the server is running?" > - Sometimes replication works without problems for several hours. But > when a breakdown occurs, rebooting every two minutes helps to catch up > with this replica. 1) It would make life easier if the log line entry prefix timestamp was set to same precision on primary and standby. As of now it looks like the primary has %t (Time stamp without milliseconds) and the standby has %m (Time stamp with milliseconds) 2) From the logs. Primary: 2025-01-26 12:21:27 MSK [656]: [11-1] app=v-host-n1,user=replicator,db=[unknown],client=192.168.5.1 STATEMENT: START_REPLICATION SLOT "slot_migration_to_rcod" 106B6/52000000 TIMELINE 61 2025-01-26 12:21:27 MSK [656]: [12-1] app=v-host-n1,user=replicator,db=[unknown],client=192.168.5.1 LOG: disconnection: session time: 0:01:05.329 user=replicator database= host=192.168.5.1 port=58380 Standby: 2025-01-26 12:21:27.113 MSK [10824] FATAL: could not send data to WAL stream: lost synchronization with server: got message type "0", length 825373235 Do you know what is doing START_REPLICATION SLOT? > > Another interesting point. In addition to this replication, there are > two more, to the same data center. One replication had the same problem, > but a one-time restart helped to solve the problem, the replication is > still working normally. And the second replication does not have such > problems, it has been working since its launch, more than a month ago. > > -- > -- Adrian Klaver adrian.klaver@aklaver.com
Re[2]: FATAL: could not send data to WAL stream: lost synchronization with server: got message type "0", length 892351284
From
Дмитрий
Date:
Colleagues confirmed that the problem is with the network between data centers. Thank you!
воскресенье, 26 января 2025г., 20:33 +03:00 от Adrian Klaver adrian.klaver@aklaver.com:
On 1/26/25 03:29, Дмитрий wrote:
> "How was it shut down, on purpose or a hardware/software issue?"
> - I reboot the receiver every 2 minutes on purpose. I determined this
> time empirically, because replication breaks down approximately every
> minute and a half. The reboot helps to advance the receiver.
>
> "Also do you have corresponding logs from primary?"
> - Attached to this message.
>
> "Unless, is there cascading replication going on?"
> - No, this is replication from the leader. The leader has its two
> replicas and they are all in one data center. And the problematic
> replica is needed to migrate to another data center.
>
> "Was that a manual intervention?"
> - Yes, reboot on schedule, every two minutes.
>
> "Is that what is shown above or have you restarted since the above and
> the server is running?"
> - Sometimes replication works without problems for several hours. But
> when a breakdown occurs, rebooting every two minutes helps to catch up
> with this replica.
1) It would make life easier if the log line entry prefix timestamp was
set to same precision on primary and standby. As of now it looks like
the primary has %t (Time stamp without milliseconds) and the standby has
%m (Time stamp with milliseconds)
2) From the logs.
Primary:
2025-01-26 12:21:27 MSK [656]: [11-1]
app=v-host-n1,user=replicator,db=[unknown],client=192.168.5.1 STATEMENT:
START_REPLICATION SLOT "slot_migration_to_rcod" 106B6/52000000 TIMELINE 61
2025-01-26 12:21:27 MSK [656]: [12-1]
app=v-host-n1,user=replicator,db=[unknown],client=192.168.5.1 LOG:
disconnection: session time: 0:01:05.329 user=replicator database=
host=192.168.5.1 port=58380
Standby:
2025-01-26 12:21:27.113 MSK [10824] FATAL: could not send data to WAL
stream: lost synchronization with server: got message type "0", length
825373235
Do you know what is doing START_REPLICATION SLOT?
>
> Another interesting point. In addition to this replication, there are
> two more, to the same data center. One replication had the same problem,
> but a one-time restart helped to solve the problem, the replication is
> still working normally. And the second replication does not have such
> problems, it has been working since its launch, more than a month ago.
>
> --
>
--
Adrian Klaver
adrian.klaver@aklaver.com
> "How was it shut down, on purpose or a hardware/software issue?"
> - I reboot the receiver every 2 minutes on purpose. I determined this
> time empirically, because replication breaks down approximately every
> minute and a half. The reboot helps to advance the receiver.
>
> "Also do you have corresponding logs from primary?"
> - Attached to this message.
>
> "Unless, is there cascading replication going on?"
> - No, this is replication from the leader. The leader has its two
> replicas and they are all in one data center. And the problematic
> replica is needed to migrate to another data center.
>
> "Was that a manual intervention?"
> - Yes, reboot on schedule, every two minutes.
>
> "Is that what is shown above or have you restarted since the above and
> the server is running?"
> - Sometimes replication works without problems for several hours. But
> when a breakdown occurs, rebooting every two minutes helps to catch up
> with this replica.
1) It would make life easier if the log line entry prefix timestamp was
set to same precision on primary and standby. As of now it looks like
the primary has %t (Time stamp without milliseconds) and the standby has
%m (Time stamp with milliseconds)
2) From the logs.
Primary:
2025-01-26 12:21:27 MSK [656]: [11-1]
app=v-host-n1,user=replicator,db=[unknown],client=192.168.5.1 STATEMENT:
START_REPLICATION SLOT "slot_migration_to_rcod" 106B6/52000000 TIMELINE 61
2025-01-26 12:21:27 MSK [656]: [12-1]
app=v-host-n1,user=replicator,db=[unknown],client=192.168.5.1 LOG:
disconnection: session time: 0:01:05.329 user=replicator database=
host=192.168.5.1 port=58380
Standby:
2025-01-26 12:21:27.113 MSK [10824] FATAL: could not send data to WAL
stream: lost synchronization with server: got message type "0", length
825373235
Do you know what is doing START_REPLICATION SLOT?
>
> Another interesting point. In addition to this replication, there are
> two more, to the same data center. One replication had the same problem,
> but a one-time restart helped to solve the problem, the replication is
> still working normally. And the second replication does not have such
> problems, it has been working since its launch, more than a month ago.
>
> --
>
--
Adrian Klaver
adrian.klaver@aklaver.com