Home > mailing lists
RE: walreceiver fails on asynchronous replica [EXTERNAL] [SEC=UNOFFICIAL] - Mailing list pgsql-general

From	Mark Schloss
Subject	RE: walreceiver fails on asynchronous replica [EXTERNAL] [SEC=UNOFFICIAL]
Date	February 28, 2024 03:23:37
Msg-id	7f2cb3a69fe74f9f889aceafcf3906e5@austrac.gov.au Whole thread Raw
Responses	Re: walreceiver fails on asynchronous replica [EXTERNAL] [SEC=UNOFFICIAL]
List	pgsql-general
Tree view
UNOFFICIAL
Thank you for your reply. I can confirm there were no changes made to the config of the replica.

We have had a further two occurrences of a walreceiver failing - each occurrence is in a different cluster, however,
bothclusters are on the same set of hardware. In each of the new cases we do not get the error where the primary wal
haszeroes. We are able to dump the WAL from each of the primary clusters and this shows the command the walreceiver
failedon was a commit. Is there any significance in the parameters in the commit record - 'inval msgs: catcache 21;
sync'.Details of each are shown below.   

--Cluster 1

The walreceiver in the replica failed as follows -

    <2024-02-26 07:00:18.475 AEDT [3067297]: [1-1] user=[unknown],db=[unknown] > LOG:  connection received:
host=[local]
    <2024-02-26 07:00:18.476 AEDT [3067297]: [2-1] user=postgres,db=postgres > LOG:  connection authorized:
user=postgresdatabase=postgres application_name=psql 
    <2024-02-26 07:00:18.480 AEDT [3067297]: [3-1] user=postgres,db=postgres > LOG:  disconnection: session time:
0:00:00.006user=postgres database=postgres host=[local] 
    <2024-02-26 11:31:27.174 AEDT [1971080]: [6-1] user=,db= > LOG:  invalid record length at 1/AC000D78: wanted 24,
got0 
    <2024-02-26 11:31:27.174 AEDT [1971084]: [2-1] user=,db= > FATAL:  terminating walreceiver process due to
administratorcommand 
    <2024-02-26 11:31:27.274 AEDT [1971080]: [7-1] user=,db= > LOG:  invalid record length at 1/AC000D78: wanted 24,
got0 
    <2024-02-26 11:31:27.274 AEDT [1971080]: [8-1] user=,db= > LOG:  invalid record length at 1/AC000D78: wanted 24,
got0 
    <2024-02-26 11:31:27.275 AEDT [1971080]: [9-1] user=,db= > LOG:  invalid record length at 1/AC000D78: wanted 24,
got0 
    <2024-02-26 11:31:32.279 AEDT [1971080]: [10-1] user=,db= > LOG:  invalid record length at 1/AC000D78: wanted 24,
got0 

A pg_waldump of the archived wal from the primary shows 1/AC000D78 is a commit -

    pg_waldump 0000000900000001000000AC
    <snip>
    <snip>
    rmgr: Standby     len (rec/tot):     54/    54, tx:          0, lsn: 1/AC000C60, prev 1/AC000BE8, desc:
RUNNING_XACTSnextXid 8395441 latestCompletedXid 8395439 oldestRunningXid 8395440; 1 xacts: 8395440 
    rmgr: Database    len (rec/tot):     42/    42, tx:    8395440, lsn: 1/AC000C98, prev 1/AC000C60, desc: CREATE copy
dir1663/1 to 1663/17149 
    rmgr: Standby     len (rec/tot):     54/    54, tx:          0, lsn: 1/AC000CC8, prev 1/AC000C98, desc:
RUNNING_XACTSnextXid 8395441 latestCompletedXid 8395439 oldestRunningXid 8395440; 1 xacts: 8395440 
    rmgr: XLOG        len (rec/tot):    114/   114, tx:          0, lsn: 1/AC000D00, prev 1/AC000CC8, desc:
CHECKPOINT_ONLINEredo 1/AC000CC8; tli 9; prev tli 9; fpw true; xid 0:8395441; oid 24713; multi 1; offset 0; oldest xid
478in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 8395440; online 
    rmgr: Transaction len (rec/tot):     66/    66, tx:    8395440, lsn: 1/AC000D78, prev 1/AC000D00, desc: COMMIT
2024-02-2611:31:25.927728 AEDT; inval msgs: catcache 21; sync 
    rmgr: XLOG        len (rec/tot):     49/  2037, tx:          0, lsn: 1/AC000DC0, prev 1/AC000D78, desc:
FPI_FOR_HINT, blkref #0: rel 1664/0/1262 blk 0 FPW 
    rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn: 1/AC0015B8, prev 1/AC000DC0, desc:
RUNNING_XACTSnextXid 8395441 latestCompletedXid 8395440 oldestRunningXid 8395441 
    rmgr: Storage     len (rec/tot):     42/    42, tx:          0, lsn: 1/AC0015F0, prev 1/AC0015B8, desc: CREATE
base/17149/17150
    rmgr: Heap        len (rec/tot):     54/  1518, tx:    8395441, lsn: 1/AC001620, prev 1/AC0015F0, desc: INSERT off
8flags 0x01, blkref #0: rel 1663/17149/1247 blk 9 FPW 
    rmgr: Btree       len (rec/tot):     53/  1013, tx:    8395441, lsn: 1/AC001C10, prev 1/AC001620, desc: INSERT_LEAF
off46, blkref #0: rel 1663/17149/2703 blk 2 FPW 
    rmgr: Btree       len (rec/tot):     53/  6113, tx:    8395441, lsn: 1/AC002020, prev 1/AC001C10, desc: INSERT_LEAF
off161, blkref #0: rel 1663/17149/2704 blk 4 FPW 
    <snip>
    <snip>

A pg_waldump of the wal that was streamed to our barman server shows -

    pg_waldump 0000000900000001000000AC
    <snip>
    <snip>
    rmgr: Standby     len (rec/tot):     54/    54, tx:          0, lsn: 1/AC000C60, prev 1/AC000BE8, desc:
RUNNING_XACTSnextXid 8395441 latestCompletedXid 8395439 oldestRunningXid 8395440; 1 xacts: 8395440 
    rmgr: Database    len (rec/tot):     42/    42, tx:    8395440, lsn: 1/AC000C98, prev 1/AC000C60, desc: CREATE copy
dir1663/1 to 1663/17149 
    rmgr: Standby     len (rec/tot):     54/    54, tx:          0, lsn: 1/AC000CC8, prev 1/AC000C98, desc:
RUNNING_XACTSnextXid 8395441 latestCompletedXid 8395439 oldestRunningXid 8395440; 1 xacts: 8395440 
    rmgr: XLOG        len (rec/tot):    114/   114, tx:          0, lsn: 1/AC000D00, prev 1/AC000CC8, desc:
CHECKPOINT_ONLINEredo 1/AC000CC8; tli 9; prev tli 9; fpw true; xid 0:8395441; oid 24713; multi 1; offset 0; oldest xid
478in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 8395440; online 
    pg_waldump: fatal: error in WAL record at 1/AC000D00: invalid record length at 1/AC000D78: wanted 24, got 0

--Cluster 2

The walreceiver in the replica failed as follows -

    <2024-02-26 07:01:06.727 AEDT [3068652]: [1-1] user=[unknown],db=[unknown] > LOG:  connection received:
host=[local]
    <2024-02-26 07:01:06.730 AEDT [3068652]: [2-1] user=postgres,db=postgres > LOG:  connection authorized:
user=postgresdatabase=postgres application_name=psql 
    <2024-02-26 07:01:06.734 AEDT [3068652]: [3-1] user=postgres,db=postgres > LOG:  disconnection: session time:
0:00:00.008user=postgres database=postgres host=[local] 
    <2024-02-26 07:01:06.861 AEDT [3068725]: [1-1] user=[unknown],db=[unknown] > LOG:  connection received:
host=[local]
    <2024-02-26 07:01:06.863 AEDT [3068725]: [2-1] user=postgres,db=postgres > LOG:  connection authorized:
user=postgresdatabase=postgres application_name=psql 
    <2024-02-26 07:01:06.867 AEDT [3068725]: [3-1] user=postgres,db=postgres > LOG:  disconnection: session time:
0:00:00.006user=postgres database=postgres host=[local] 
    <2024-02-26 11:54:08.280 AEDT [3701794]: [5-1] user=,db= > LOG:  invalid record length at 0/8000FF8: wanted 24, got
0
    <2024-02-26 11:54:08.280 AEDT [3701798]: [2-1] user=,db= > FATAL:  terminating walreceiver process due to
administratorcommand 
    <2024-02-26 11:54:08.381 AEDT [3701794]: [6-1] user=,db= > LOG:  invalid record length at 0/8000FF8: wanted 24, got
0
    <2024-02-26 11:54:08.381 AEDT [3701794]: [7-1] user=,db= > LOG:  invalid record length at 0/8000FF8: wanted 24, got
0
    <2024-02-26 11:54:08.381 AEDT [3701794]: [8-1] user=,db= > LOG:  invalid record length at 0/8000FF8: wanted 24, got
0
    <2024-02-26 11:54:13.385 AEDT [3701794]: [9-1] user=,db= > LOG:  invalid record length at 0/8000FF8: wanted 24, got
0

A pg_waldump of the archived wal from the primary shows 0/08000FF8 is a commit -

    pg_waldump 000000010000000000000008
    <snip>
    <snip>
    rmgr: Standby     len (rec/tot):     54/    54, tx:          0, lsn: 0/08000E30, prev 0/08000D68, desc:
RUNNING_XACTSnextXid 499 latestCompletedXid 497 oldestRunningXid 498; 1 xacts: 498 
    rmgr: XLOG        len (rec/tot):    114/   114, tx:          0, lsn: 0/08000E68, prev 0/08000E30, desc:
CHECKPOINT_ONLINEredo 0/8000E30; tli 1; prev tli 1; fpw true; xid 0:499; oid 24576; multi 1; offset 0; oldest xid 478
inDB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest ru 
nning xid 498; online
    rmgr: Standby     len (rec/tot):     54/    54, tx:          0, lsn: 0/08000EE0, prev 0/08000E68, desc:
RUNNING_XACTSnextXid 499 latestCompletedXid 497 oldestRunningXid 498; 1 xacts: 498 
    rmgr: Database    len (rec/tot):     42/    42, tx:        498, lsn: 0/08000F18, prev 0/08000EE0, desc: CREATE copy
dir1663/1 to 1663/16451 
    rmgr: Standby     len (rec/tot):     54/    54, tx:          0, lsn: 0/08000F48, prev 0/08000F18, desc:
RUNNING_XACTSnextXid 499 latestCompletedXid 497 oldestRunningXid 498; 1 xacts: 498 
    rmgr: XLOG        len (rec/tot):    114/   114, tx:          0, lsn: 0/08000F80, prev 0/08000F48, desc:
CHECKPOINT_ONLINEredo 0/8000F48; tli 1; prev tli 1; fpw true; xid 0:499; oid 24576; multi 1; offset 0; oldest xid 478
inDB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest ru 
nning xid 498; online
    rmgr: Transaction len (rec/tot):     66/    66, tx:        498, lsn: 0/08000FF8, prev 0/08000F80, desc: COMMIT
2024-02-2611:54:07.145087 AEDT; inval msgs: catcache 21; sync 
    rmgr: XLOG        len (rec/tot):     49/  1501, tx:          0, lsn: 0/08001040, prev 0/08000FF8, desc:
FPI_FOR_HINT, blkref #0: rel 1664/0/1262 blk 0 FPW 
    rmgr: Storage     len (rec/tot):     42/    42, tx:          0, lsn: 0/08001620, prev 0/08001040, desc: CREATE
base/16451/16452
    rmgr: Heap        len (rec/tot):     54/  1518, tx:        499, lsn: 0/08001650, prev 0/08001620, desc: INSERT off
8flags 0x01, blkref #0: rel 1663/16451/1247 blk 9 FPW 
    rmgr: Btree       len (rec/tot):     53/  1013, tx:        499, lsn: 0/08001C40, prev 0/08001650, desc: INSERT_LEAF
off46, blkref #0: rel 1663/16451/2703 blk 2 FPW 
    rmgr: Btree       len (rec/tot):     53/  6113, tx:        499, lsn: 0/08002050, prev 0/08001C40, desc: INSERT_LEAF
off161, blkref #0: rel 1663/16451/2704 blk 4 FPW 
    <snip>
    <snip>

A pg_waldump of the wal that was streamed to our barman server shows -

    pg_waldump 000000010000000000000008
    <snip>
    <snip>
    rmgr: XLOG        len (rec/tot):    114/   114, tx:          0, lsn: 0/08000E68, prev 0/08000E30, desc:
CHECKPOINT_ONLINEredo 0/8000E30; tli 1; prev tli 1; fpw true; xid 0:499; oid 24576; multi 1; offset 0; oldest xid 478
inDB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 498; online 
    rmgr: Standby     len (rec/tot):     54/    54, tx:          0, lsn: 0/08000EE0, prev 0/08000E68, desc:
RUNNING_XACTSnextXid 499 latestCompletedXid 497 oldestRunningXid 498; 1 xacts: 498 
    rmgr: Database    len (rec/tot):     42/    42, tx:        498, lsn: 0/08000F18, prev 0/08000EE0, desc: CREATE copy
dir1663/1 to 1663/16451 
    rmgr: Standby     len (rec/tot):     54/    54, tx:          0, lsn: 0/08000F48, prev 0/08000F18, desc:
RUNNING_XACTSnextXid 499 latestCompletedXid 497 oldestRunningXid 498; 1 xacts: 498 
    rmgr: XLOG        len (rec/tot):    114/   114, tx:          0, lsn: 0/08000F80, prev 0/08000F48, desc:
CHECKPOINT_ONLINEredo 0/8000F48; tli 1; prev tli 1; fpw true; xid 0:499; oid 24576; multi 1; offset 0; oldest xid 478
inDB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 498; online 
    pg_waldump: fatal: error in WAL record at 0/8000F80: invalid record length at 0/8000FF8: wanted 24, got 0


Some observations from these clusters

- Both primary clusters continued to write and close the WAL files (000000010000000000000008 and
0000000900000001000000AC)that the replicas were streaming at the time of the walreceiver failure 
- The walreceiver on each replica was terminated when attempting to write the commit transaction
- The walreceiver on the barman server did not fail but the WAL file does not contain the commit transaction
- The walreceiver on the barman server continued to receive streamed WAL for subsequent WAL files
- The commit (0/08000FF8 and 1/AC000D78) lsn did not get streamed to either the replicas or barman server

Thanks in advance.






UNOFFICIAL

-----Original Message-----
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Sent: Monday, 26 February 2024 7:27 PM
To: Mark Schloss <Mark.Schloss@austrac.gov.au>
Cc: pgsql-general@lists.postgresql.org
Subject: Re: walreceiver fails on asynchronous replica [SEC=UNOFFICIAL] [EXTERNAL]

At Fri, 23 Feb 2024 04:04:03 +0000, Mark Schloss <Mark.Schloss@austrac.gov.au> wrote in
> <2024-02-23 07:50:05.637 AEDT [1957121]: [1-1] user=,db= > LOG:  started streaming WAL from primary at 6/B0000000 on
timeline5 
> <2024-02-23 07:50:05.696 AEDT [1957117]: [6-1] user=,db= > LOG:  invalid magic number 0000 in log segment
0000000500000006000000B0,offset 0 

This appears to suggest that the WAL file that the standby fetched was
zero-filled on the primary side, which cannot happen by a normal
operation. A preallocated WAL segment can be zero-filled but it cannot
be replicated under normal operations.

> <2024-02-22 14:20:23.383 AEDT [565231]: [6-1] user=,db= > FATAL:  terminating walreceiver process due to
administratorcommand 

This may suggest a config reload with some parameter changes.

One possible scenario matching the log lines could be that someone
switched primary_conninfo to a newly-restored primary. However, if the
new primary had older data than the previously connected primary,
possibly leading to the situation where the segment 0..5..6..B0 on it
was a preallocated one that was filled with zeros, the standby could
end up fetching the zero-filled WAL segment (file) and might fail this
way. If this is the case, such operations should be avoided.

Unfortunately, due to the lack of a reproducer or detailed operations
that took place there, the best I can do now is to guess a possible
scenario as described above. I'm not sure how come the situation
actually arose.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

**********************************************************************
Please  note  that  your  email address  is known to  AUSTRAC  for the
purposes  of  communicating with you.  The information  transmitted in
this  e-mail is  for the  use of  the intended  recipient only and may
contain confidential and/or legally  privileged  material. If you have
received  this information  in error you must not disseminate, copy or
take  any  action on  it and we  request that you delete all copies of
this transmission together with attachments and notify the sender.

This footnote also confirms that this email message has been swept for
the presence of computer viruses.
**********************************************************************
pgsql-general by date:
From: Adrian Klaver
Date: 27 February 2024, 23:23:00
Subject: Re: Me again with an insert trigger problem
From: veem v
Date: 28 February 2024, 04:22:53
Subject: Re: Aligning grants and privileges
RE: walreceiver fails on asynchronous replica [EXTERNAL] [SEC=UNOFFICIAL] - Mailing list pgsql-general

Previous

Next