Home > mailing lists

Thread: walreceiver fails on asynchronous replica [SEC=UNOFFICIAL]

walreceiver fails on asynchronous replica [SEC=UNOFFICIAL]

From

Mark Schloss

Date:

23 February, 07:04:03

UNOFFICIAL

Hello,

We have the following setup in our DEV environment -

- Primary/Replica using asynchronous streaming replication

- Servers run Ubuntu

Linux 5.15.0-88-generic #98-Ubuntu SMP Mon Oct 2 15:18:56 UTC 2023 x86_64 x86_64 GNU/Linux

- Postgres version

postgres (PostgreSQL) 13.13 (Ubuntu 13.13-1.pgdg22.04+1)

The replica has been running for many weeks and then the walreceiver failed with the following -

<2024-02-22 14:12:34.731 AEDT [565228]: [60136-1] user=,db= > DETAIL: Last completed transaction was at log time 2024-02-22 14:12:00.327411+11.

<2024-02-22 14:17:34.822 AEDT [565228]: [60137-1] user=,db= > LOG: restartpoint starting: time

<2024-02-22 14:17:34.935 AEDT [565228]: [60138-1] user=,db= > LOG: restartpoint complete: wrote 1 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.106 s, sync=0.002 s, total=0.114 s; sync files=1, longest=0.002 s, average=0.002 s; distance=4 kB,

estimate=14 kB

<2024-02-22 14:17:34.935 AEDT [565228]: [60139-1] user=,db= > LOG: recovery restart point at 6/B0ACB188

<2024-02-22 14:17:34.935 AEDT [565228]: [60140-1] user=,db= > DETAIL: Last completed transaction was at log time 2024-02-22 14:15:00.457984+11.

<2024-02-22 14:20:23.383 AEDT [565227]: [7-1] user=,db= > LOG: invalid record length at 6/B0ACC360: wanted 24, got 0

<2024-02-22 14:20:23.383 AEDT [565231]: [6-1] user=,db= > FATAL: terminating walreceiver process due to administrator command

<2024-02-22 14:20:23.385 AEDT [565227]: [8-1] user=,db= > LOG: invalid record length at 6/B0ACC360: wanted 24, got 0

<2024-02-22 14:20:23.385 AEDT [565227]: [9-1] user=,db= > LOG: invalid record length at 6/B0ACC360: wanted 24, got 0

<2024-02-22 14:20:23.385 AEDT [565227]: [10-1] user=,db= > LOG: invalid record length at 6/B0ACC360: wanted 24, got 0

<2024-02-22 14:20:28.390 AEDT [565227]: [11-1] user=,db= > LOG: invalid record length at 6/B0ACC360: wanted 24, got 0

<2024-02-22 14:20:33.394 AEDT [565227]: [12-1] user=,db= > LOG: invalid record length at 6/B0ACC360: wanted 24, got 0

<2024-02-22 14:20:38.394 AEDT [565227]: [13-1] user=,db= > LOG: invalid record length at 6/B0ACC360: wanted 24, got 0

<2024-02-22 14:20:43.398 AEDT [565227]: [14-1] user=,db= > LOG: invalid record length at 6/B0ACC360: wanted 24, got 0

<2024-02-22 14:20:48.402 AEDT [565227]: [15-1] user=,db= > LOG: invalid record length at 6/B0ACC360: wanted 24, got 0

<2024-02-22 14:20:53.406 AEDT [565227]: [16-1] user=,db= > LOG: invalid record length at 6/B0ACC360: wanted 24, got 0

The replica was restarted and the walreceiver immediately failed again with -

2024-02-23 07:50:05.591 AEDT [1957117]: [2-1] user=,db= > LOG: entering standby mode

<2024-02-23 07:50:05.607 AEDT [1957117]: [3-1] user=,db= > LOG: redo starts at 6/B0ACB188

<2024-02-23 07:50:05.607 AEDT [1957117]: [4-1] user=,db= > LOG: consistent recovery state reached at 6/B0ACC360

<2024-02-23 07:50:05.607 AEDT [1957117]: [5-1] user=,db= > LOG: invalid record length at 6/B0ACC360: wanted 24, got 0

<2024-02-23 07:50:05.609 AEDT [1957114]: [5-1] user=,db= > LOG: database system is ready to accept read only connections

<2024-02-23 07:50:05.637 AEDT [1957121]: [1-1] user=,db= > LOG: started streaming WAL from primary at 6/B0000000 on timeline 5

<2024-02-23 07:50:05.696 AEDT [1957117]: [6-1] user=,db= > LOG: invalid magic number 0000 in log segment 0000000500000006000000B0, offset 0

<2024-02-23 07:50:05.697 AEDT [1957121]: [2-1] user=,db= > FATAL: terminating walreceiver process due to administrator command

<2024-02-23 07:50:05.798 AEDT [1957117]: [7-1] user=,db= > LOG: invalid magic number 0000 in log segment 0000000500000006000000B0, offset 0

<2024-02-23 07:50:05.798 AEDT [1957117]: [8-1] user=,db= > LOG: invalid magic number 0000 in log segment 0000000500000006000000B0, offset 0

<2024-02-23 07:50:06.215 AEDT [1957125]: [1-1] user=[unknown],db=[unknown] > LOG: connection received: host=[local]

Running a pg_waldump against the WAL file on the primary server produces -

deveng >pg_waldump 0000000500000006000000B0

pg_waldump: fatal: WAL segment size must be a power of two between 1 MB and 1 GB, but the WAL file "0000000500000006000000B0" header specifies 0 bytes

The same error is returned when using the WAL in pg_wal or the WAL that has been archived locally on the primary server.

We also stream WAL to a barman backup server. Running pg_waldump against the WAL file on the barman server errors with -

pg_waldump: fatal: error in WAL record at 6/B0ACC328: invalid record length at 6/B0ACC360: wanted 24, got 0

The primary cluster Postgres log does not show any unusual messages at the time of the original walreceiver failure -

<2024-02-22 14:20:00.279 AEDT [2408502]: [2-1] user=crmportaluser,db=reportingentity > LOG: connection authorized: user=crmportaluser database=reportingentity SSL enabled (protocol=TLSv1.3, c

ipher=TLS_AES_256_GCM_SHA384, bits=256, compression=off)

<2024-02-22 14:20:23.330 AEDT [2408557]: [1-1] user=,db= > LOG: automatic analyze of table "reportingentity.crmportal.setting" system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s

<2024-02-22 14:20:23.396 AEDT [10152]: [3-1] user=repmgr,db=[unknown] > LOG: disconnection: session time: 2140:47:39.683 user=repmgr database= host=192.168.136.200 port=51680

<2024-02-22 14:20:29.479 AEDT [2306390]: [251-1] user=origamiuser,db=reportingentity > LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp2306390.124", size 1370040

Any help with this error is appreciated.

UNOFFICIAL

**********************************************************************

Please note that your email address is known to AUSTRAC for the

purposes of communicating with you. The information transmitted in

this e-mail is for the use of the intended recipient only and may

contain confidential and/or legally privileged material. If you have

received this information in error you must not disseminate, copy or

take any action on it and we request that you delete all copies of

this transmission together with attachments and notify the sender.

This footnote also confirms that this email message has been swept for

the presence of computer viruses.

**********************************************************************

Re: walreceiver fails on asynchronous replica [SEC=UNOFFICIAL]

From

Kyotaro Horiguchi

Date:

26 February, 11:26:51

At Fri, 23 Feb 2024 04:04:03 +0000, Mark Schloss <Mark.Schloss@austrac.gov.au> wrote in 
> <2024-02-23 07:50:05.637 AEDT [1957121]: [1-1] user=,db= > LOG:  started streaming WAL from primary at 6/B0000000 on
timeline5
 
> <2024-02-23 07:50:05.696 AEDT [1957117]: [6-1] user=,db= > LOG:  invalid magic number 0000 in log segment
0000000500000006000000B0,offset 0
 

This appears to suggest that the WAL file that the standby fetched was
zero-filled on the primary side, which cannot happen by a normal
operation. A preallocated WAL segment can be zero-filled but it cannot
be replicated under normal operations.

> <2024-02-22 14:20:23.383 AEDT [565231]: [6-1] user=,db= > FATAL:  terminating walreceiver process due to
administratorcommand
 

This may suggest a config reload with some parameter changes.

One possible scenario matching the log lines could be that someone
switched primary_conninfo to a newly-restored primary. However, if the
new primary had older data than the previously connected primary,
possibly leading to the situation where the segment 0..5..6..B0 on it
was a preallocated one that was filled with zeros, the standby could
end up fetching the zero-filled WAL segment (file) and might fail this
way. If this is the case, such operations should be avoided.

Unfortunately, due to the lack of a reproducer or detailed operations
that took place there, the best I can do now is to guess a possible
scenario as described above. I'm not sure how come the situation
actually arose.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center