Thread: Re: FATAL: could not send data to WAL stream: lost synchronization with server: got message type "0", length 892351284

On 1/25/25 06:58, Дмитрий wrote:
> Hello everyone!
> 
> Please help me with the error: FATAL: could not send data to WAL stream: 
> lost synchronization with server: got message type "0", length 892351284.
> Replication worked for more than a month and now there is such a 
> problem. Restarting PostgreSQL helps.
> How can I fix this error?

This is going to need more information:

1) What sort of replication?

2) Where are the two servers located relative to each other?

3) Has there been any software upgrades/network changes recently?

> 
> Source:
> PostgreSQL 15.6 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 
> 20150623 (Red Hat 4.8.5-44), 64-bit
> CentOS Linux release 7.8.2003 (Core)
> Filesystem type ext4
> 
> Receiver:
> PostgreSQL 15.10 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 11.5.0 
> 20240719 (Red Hat 11.5.0-2), 64-bit
> Red Hat Enterprise Linux release 9.5 (Plow)
> Filesystem type xfs
> 
> 
> Thank you.
> 

-- 
Adrian Klaver
adrian.klaver@aklaver.com




1) What sort of replication?
- Streaming replication

2) Where are the two servers located relative to each other?
- The servers are located in different data centers.

3) Has there been any software upgrades/network changes recently?
- I don't know any information about the  software upgrades/network суббота, 25 января 2025г., 19:31 +03:00 от Adrian Klaver adrian.klaver@aklaver.com:

On 1/25/25 06:58, Дмитрий wrote:
> Hello everyone!
>
> Please help me with the error: FATAL: could not send data to WAL stream:
> lost synchronization with server: got message type "0", length 892351284.
> Replication worked for more than a month and now there is such a
> problem. Restarting PostgreSQL helps.
> How can I fix this error?
This is going to need more information:

1) What sort of replication?

2) Where are the two servers located relative to each other?

3) Has there been any software upgrades/network changes recently?

>
> Source:
> PostgreSQL 15.6 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5
> 20150623 (Red Hat 4.8.5-44), 64-bit
> CentOS Linux release 7.8.2003 (Core)
> Filesystem type ext4
>
> Receiver:
> PostgreSQL 15.10 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 11.5.0
> 20240719 (Red Hat 11.5.0-2), 64-bit
> Red Hat Enterprise Linux release 9.5 (Plow)
> Filesystem type xfs
>
>
> Thank you.
>

--
Adrian Klaver
adrian.klaver@aklaver.com
On 1/25/25 09:03, Дмитрий wrote:
> 1) What sort of replication?
> - Streaming replication
> 
> 2) Where are the two servers located relative to each other?
> - The servers are located in different data centers.
> 
> 3) Has there been any software upgrades/network changes recently?
> - I don't know any information about the  software upgrades/network

It would be a good thing to ask of those folks that do know.

 From the log attached to your initial post:

2025-01-25 17:28:01.930 MSK [1196013] LOG:  starting PostgreSQL 15.10 on 
x86_64-pc-linux-gnu, compiled by gcc (GCC) 11.5.0 20240719 (Red Hat 
11.5.0-2), 64-bit
2025-01-25 17:28:01.930 MSK [1196013] LOG:  listening on IPv4 address 
"0.0.0.0", port 5432
2025-01-25 17:28:01.931 MSK [1196013] LOG:  listening on Unix socket 
"/run/postgresql/.s.PGSQL.5432"
2025-01-25 17:28:01.932 MSK [1196013] LOG:  listening on Unix socket 
"/tmp/.s.PGSQL.5432"
2025-01-25 17:28:01.962 MSK [1196017] LOG:  database system was shut 
down in recovery at 2025-01-25 17:28:01 MSK
2025-01-25 17:28:01.962 MSK [1196017] LOG:  entering standby mode

How was it shut down, on purpose or a hardware/software issue?

Also do you have corresponding logs from primary?

2025-01-25 17:28:12.192 MSK [1196017] LOG:  consistent recovery state 
reached at 1063C/D002DC68
2025-01-25 17:28:12.192 MSK [1196017] LOG:  incorrect resource manager 
data checksum in record at 1063C/D002DC68
2025-01-25 17:28:12.192 MSK [1196013] LOG:  database system is ready to 
accept read-only connections
2025-01-25 17:28:12.205 MSK [1196019] LOG:  started streaming WAL from 
primary at 1063C/D0000000 on timeline 61

The recovery ended and the streaming started.

Not sure if 'incorrect resource manager data checksum' is significant or 
not.

2025-01-25 17:29:08.452 MSK [1196015] LOG:  recovery restart point at 
1063C/DBC7E1D8
2025-01-25 17:29:08.452 MSK [1196015] DETAIL:  Last completed 
transaction was at log time 2025-01-25 16:23:08.828548+03.
2025-01-25 17:29:24.553 MSK [1196015] LOG:  restartpoint starting: wal
2025-01-25 17:29:24.553 MSK [1196015] DEBUG:  performing replication 
slot checkpoint
2025-01-25 17:29:27.651 MSK [1196019] FATAL:  could not send data to WAL 
stream: lost synchronization with server: got message type "0", length 
892351284
2025-01-25 17:29:27.653 MSK [1196017] LOG:  invalid magic number 3600 in 
log segment 0000003D0001063D000000F4, offset 212992
2025-01-25 17:29:27.653 MSK [1196017] LOG:  invalid magic number 3600 in 
log segment 0000003D0001063D000000F4, offset 212992
2025-01-25 17:29:27.653 MSK [1196017] LOG:  invalid magic number 3600 in 
log segment 0000003D0001063D000000F4, offset 212992

This is where things fall apart. What confuses me is:

"could not send data to WAL stream: lost synchronization with server: 
got message type "0", length 892351284"

If this is from the standby why is it sending data to the stream?

Unless, is there cascading replication going on?


2025-01-25 17:30:01.887 MSK [1196013] LOG:  received fast shutdown request
2025-01-25 17:30:01.888 MSK [1196013] LOG:  aborting any active transactions

Was that a manual intervention?

2025-01-25 17:30:02.157 MSK [1196015] LOG:  shutting down
2025-01-25 17:30:02.181 MSK [1196013] LOG:  database system is shut down
2025-01-25 17:30:02.182 MSK [1196014] DEBUG:  logger shutting down


So the server went from start up to shut down in ~2 minutes.

 From your original post:

'Restarting PostgreSQL helps.'

Is that what is shown above or have you restarted since the above and 
the server is running?






-- 
Adrian Klaver
adrian.klaver@aklaver.com




On 1/26/25 03:29, Дмитрий wrote:
> "How was it shut down, on purpose or a hardware/software issue?"
> - I reboot the receiver every 2 minutes on purpose. I determined this 
> time empirically, because replication breaks down approximately every 
> minute and a half. The reboot helps to advance the receiver.
> 
> "Also do you have corresponding logs from primary?"
> - Attached to this message.
> 
> "Unless, is there cascading replication going on?"
> - No, this is replication from the leader. The leader has its two 
> replicas and they are all in one data center. And the problematic 
> replica is needed to migrate to another data center.
> 
> "Was that a manual intervention?"
> - Yes, reboot on schedule, every two minutes.
> 
> "Is that what is shown above or have you restarted since the above and
> the server is running?"
> - Sometimes replication works without problems for several hours. But 
> when a breakdown occurs, rebooting every two minutes helps to catch up 
> with this replica.

1) It would make life easier if the log line entry prefix timestamp was 
set to same precision on primary and standby. As of now it looks like 
the primary has %t (Time stamp without milliseconds) and the standby has
%m (Time stamp with milliseconds)

2) From the logs.

Primary:

2025-01-26 12:21:27 MSK [656]: [11-1] 
app=v-host-n1,user=replicator,db=[unknown],client=192.168.5.1 STATEMENT: 
  START_REPLICATION SLOT "slot_migration_to_rcod" 106B6/52000000 TIMELINE 61

2025-01-26 12:21:27 MSK [656]: [12-1] 
app=v-host-n1,user=replicator,db=[unknown],client=192.168.5.1 LOG: 
disconnection: session time: 0:01:05.329 user=replicator database= 
host=192.168.5.1 port=58380


Standby:

2025-01-26 12:21:27.113 MSK [10824] FATAL:  could not send data to WAL 
stream: lost synchronization with server: got message type "0", length 
825373235


Do you know what is doing START_REPLICATION SLOT?

> 
> Another interesting point. In addition to this replication, there are 
> two more, to the same data center. One replication had the same problem, 
> but a one-time restart helped to solve the problem, the replication is 
> still working normally. And the second replication does not have such 
> problems, it has been working since its launch, more than a month ago.
> 
> --
> 


-- 
Adrian Klaver
adrian.klaver@aklaver.com




Colleagues confirmed that the problem is with the network between data centers. Thank you!
воскресенье, 26 января 2025г., 20:33 +03:00 от Adrian Klaver adrian.klaver@aklaver.com:

On 1/26/25 03:29, Дмитрий wrote:
> "How was it shut down, on purpose or a hardware/software issue?"
> - I reboot the receiver every 2 minutes on purpose. I determined this
> time empirically, because replication breaks down approximately every
> minute and a half. The reboot helps to advance the receiver.
>
> "Also do you have corresponding logs from primary?"
> - Attached to this message.
>
> "Unless, is there cascading replication going on?"
> - No, this is replication from the leader. The leader has its two
> replicas and they are all in one data center. And the problematic
> replica is needed to migrate to another data center.
>
> "Was that a manual intervention?"
> - Yes, reboot on schedule, every two minutes.
>
> "Is that what is shown above or have you restarted since the above and
> the server is running?"
> - Sometimes replication works without problems for several hours. But
> when a breakdown occurs, rebooting every two minutes helps to catch up
> with this replica.
1) It would make life easier if the log line entry prefix timestamp was
set to same precision on primary and standby. As of now it looks like
the primary has %t (Time stamp without milliseconds) and the standby has
%m (Time stamp with milliseconds)

2) From the logs.

Primary:

2025-01-26 12:21:27 MSK [656]: [11-1]
app=v-host-n1,user=replicator,db=[unknown],client=192.168.5.1 STATEMENT:
  START_REPLICATION SLOT "slot_migration_to_rcod" 106B6/52000000 TIMELINE 61

2025-01-26 12:21:27 MSK [656]: [12-1]
app=v-host-n1,user=replicator,db=[unknown],client=192.168.5.1 LOG:
disconnection: session time: 0:01:05.329 user=replicator database=
host=192.168.5.1 port=58380


Standby:

2025-01-26 12:21:27.113 MSK [10824] FATAL: could not send data to WAL
stream: lost synchronization with server: got message type "0", length
825373235


Do you know what is doing START_REPLICATION SLOT?

>
> Another interesting point. In addition to this replication, there are
> two more, to the same data center. One replication had the same problem,
> but a one-time restart helped to solve the problem, the replication is
> still working normally. And the second replication does not have such
> problems, it has been working since its launch, more than a month ago.
>
> --
>


--
Adrian Klaver
adrian.klaver@aklaver.com