Thread: RE: walreceiver fails on asynchronous replica [EXTERNAL] [SEC=UNOFFICIAL]
UNOFFICIAL Thank you for your reply. I can confirm there were no changes made to the config of the replica. We have had a further two occurrences of a walreceiver failing - each occurrence is in a different cluster, however, bothclusters are on the same set of hardware. In each of the new cases we do not get the error where the primary wal haszeroes. We are able to dump the WAL from each of the primary clusters and this shows the command the walreceiver failedon was a commit. Is there any significance in the parameters in the commit record - 'inval msgs: catcache 21; sync'.Details of each are shown below. --Cluster 1 The walreceiver in the replica failed as follows - <2024-02-26 07:00:18.475 AEDT [3067297]: [1-1] user=[unknown],db=[unknown] > LOG: connection received: host=[local] <2024-02-26 07:00:18.476 AEDT [3067297]: [2-1] user=postgres,db=postgres > LOG: connection authorized: user=postgresdatabase=postgres application_name=psql <2024-02-26 07:00:18.480 AEDT [3067297]: [3-1] user=postgres,db=postgres > LOG: disconnection: session time: 0:00:00.006user=postgres database=postgres host=[local] <2024-02-26 11:31:27.174 AEDT [1971080]: [6-1] user=,db= > LOG: invalid record length at 1/AC000D78: wanted 24, got0 <2024-02-26 11:31:27.174 AEDT [1971084]: [2-1] user=,db= > FATAL: terminating walreceiver process due to administratorcommand <2024-02-26 11:31:27.274 AEDT [1971080]: [7-1] user=,db= > LOG: invalid record length at 1/AC000D78: wanted 24, got0 <2024-02-26 11:31:27.274 AEDT [1971080]: [8-1] user=,db= > LOG: invalid record length at 1/AC000D78: wanted 24, got0 <2024-02-26 11:31:27.275 AEDT [1971080]: [9-1] user=,db= > LOG: invalid record length at 1/AC000D78: wanted 24, got0 <2024-02-26 11:31:32.279 AEDT [1971080]: [10-1] user=,db= > LOG: invalid record length at 1/AC000D78: wanted 24, got0 A pg_waldump of the archived wal from the primary shows 1/AC000D78 is a commit - pg_waldump 0000000900000001000000AC <snip> <snip> rmgr: Standby len (rec/tot): 54/ 54, tx: 0, lsn: 1/AC000C60, prev 1/AC000BE8, desc: RUNNING_XACTSnextXid 8395441 latestCompletedXid 8395439 oldestRunningXid 8395440; 1 xacts: 8395440 rmgr: Database len (rec/tot): 42/ 42, tx: 8395440, lsn: 1/AC000C98, prev 1/AC000C60, desc: CREATE copy dir1663/1 to 1663/17149 rmgr: Standby len (rec/tot): 54/ 54, tx: 0, lsn: 1/AC000CC8, prev 1/AC000C98, desc: RUNNING_XACTSnextXid 8395441 latestCompletedXid 8395439 oldestRunningXid 8395440; 1 xacts: 8395440 rmgr: XLOG len (rec/tot): 114/ 114, tx: 0, lsn: 1/AC000D00, prev 1/AC000CC8, desc: CHECKPOINT_ONLINEredo 1/AC000CC8; tli 9; prev tli 9; fpw true; xid 0:8395441; oid 24713; multi 1; offset 0; oldest xid 478in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 8395440; online rmgr: Transaction len (rec/tot): 66/ 66, tx: 8395440, lsn: 1/AC000D78, prev 1/AC000D00, desc: COMMIT 2024-02-2611:31:25.927728 AEDT; inval msgs: catcache 21; sync rmgr: XLOG len (rec/tot): 49/ 2037, tx: 0, lsn: 1/AC000DC0, prev 1/AC000D78, desc: FPI_FOR_HINT, blkref #0: rel 1664/0/1262 blk 0 FPW rmgr: Standby len (rec/tot): 50/ 50, tx: 0, lsn: 1/AC0015B8, prev 1/AC000DC0, desc: RUNNING_XACTSnextXid 8395441 latestCompletedXid 8395440 oldestRunningXid 8395441 rmgr: Storage len (rec/tot): 42/ 42, tx: 0, lsn: 1/AC0015F0, prev 1/AC0015B8, desc: CREATE base/17149/17150 rmgr: Heap len (rec/tot): 54/ 1518, tx: 8395441, lsn: 1/AC001620, prev 1/AC0015F0, desc: INSERT off 8flags 0x01, blkref #0: rel 1663/17149/1247 blk 9 FPW rmgr: Btree len (rec/tot): 53/ 1013, tx: 8395441, lsn: 1/AC001C10, prev 1/AC001620, desc: INSERT_LEAF off46, blkref #0: rel 1663/17149/2703 blk 2 FPW rmgr: Btree len (rec/tot): 53/ 6113, tx: 8395441, lsn: 1/AC002020, prev 1/AC001C10, desc: INSERT_LEAF off161, blkref #0: rel 1663/17149/2704 blk 4 FPW <snip> <snip> A pg_waldump of the wal that was streamed to our barman server shows - pg_waldump 0000000900000001000000AC <snip> <snip> rmgr: Standby len (rec/tot): 54/ 54, tx: 0, lsn: 1/AC000C60, prev 1/AC000BE8, desc: RUNNING_XACTSnextXid 8395441 latestCompletedXid 8395439 oldestRunningXid 8395440; 1 xacts: 8395440 rmgr: Database len (rec/tot): 42/ 42, tx: 8395440, lsn: 1/AC000C98, prev 1/AC000C60, desc: CREATE copy dir1663/1 to 1663/17149 rmgr: Standby len (rec/tot): 54/ 54, tx: 0, lsn: 1/AC000CC8, prev 1/AC000C98, desc: RUNNING_XACTSnextXid 8395441 latestCompletedXid 8395439 oldestRunningXid 8395440; 1 xacts: 8395440 rmgr: XLOG len (rec/tot): 114/ 114, tx: 0, lsn: 1/AC000D00, prev 1/AC000CC8, desc: CHECKPOINT_ONLINEredo 1/AC000CC8; tli 9; prev tli 9; fpw true; xid 0:8395441; oid 24713; multi 1; offset 0; oldest xid 478in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 8395440; online pg_waldump: fatal: error in WAL record at 1/AC000D00: invalid record length at 1/AC000D78: wanted 24, got 0 --Cluster 2 The walreceiver in the replica failed as follows - <2024-02-26 07:01:06.727 AEDT [3068652]: [1-1] user=[unknown],db=[unknown] > LOG: connection received: host=[local] <2024-02-26 07:01:06.730 AEDT [3068652]: [2-1] user=postgres,db=postgres > LOG: connection authorized: user=postgresdatabase=postgres application_name=psql <2024-02-26 07:01:06.734 AEDT [3068652]: [3-1] user=postgres,db=postgres > LOG: disconnection: session time: 0:00:00.008user=postgres database=postgres host=[local] <2024-02-26 07:01:06.861 AEDT [3068725]: [1-1] user=[unknown],db=[unknown] > LOG: connection received: host=[local] <2024-02-26 07:01:06.863 AEDT [3068725]: [2-1] user=postgres,db=postgres > LOG: connection authorized: user=postgresdatabase=postgres application_name=psql <2024-02-26 07:01:06.867 AEDT [3068725]: [3-1] user=postgres,db=postgres > LOG: disconnection: session time: 0:00:00.006user=postgres database=postgres host=[local] <2024-02-26 11:54:08.280 AEDT [3701794]: [5-1] user=,db= > LOG: invalid record length at 0/8000FF8: wanted 24, got 0 <2024-02-26 11:54:08.280 AEDT [3701798]: [2-1] user=,db= > FATAL: terminating walreceiver process due to administratorcommand <2024-02-26 11:54:08.381 AEDT [3701794]: [6-1] user=,db= > LOG: invalid record length at 0/8000FF8: wanted 24, got 0 <2024-02-26 11:54:08.381 AEDT [3701794]: [7-1] user=,db= > LOG: invalid record length at 0/8000FF8: wanted 24, got 0 <2024-02-26 11:54:08.381 AEDT [3701794]: [8-1] user=,db= > LOG: invalid record length at 0/8000FF8: wanted 24, got 0 <2024-02-26 11:54:13.385 AEDT [3701794]: [9-1] user=,db= > LOG: invalid record length at 0/8000FF8: wanted 24, got 0 A pg_waldump of the archived wal from the primary shows 0/08000FF8 is a commit - pg_waldump 000000010000000000000008 <snip> <snip> rmgr: Standby len (rec/tot): 54/ 54, tx: 0, lsn: 0/08000E30, prev 0/08000D68, desc: RUNNING_XACTSnextXid 499 latestCompletedXid 497 oldestRunningXid 498; 1 xacts: 498 rmgr: XLOG len (rec/tot): 114/ 114, tx: 0, lsn: 0/08000E68, prev 0/08000E30, desc: CHECKPOINT_ONLINEredo 0/8000E30; tli 1; prev tli 1; fpw true; xid 0:499; oid 24576; multi 1; offset 0; oldest xid 478 inDB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest ru nning xid 498; online rmgr: Standby len (rec/tot): 54/ 54, tx: 0, lsn: 0/08000EE0, prev 0/08000E68, desc: RUNNING_XACTSnextXid 499 latestCompletedXid 497 oldestRunningXid 498; 1 xacts: 498 rmgr: Database len (rec/tot): 42/ 42, tx: 498, lsn: 0/08000F18, prev 0/08000EE0, desc: CREATE copy dir1663/1 to 1663/16451 rmgr: Standby len (rec/tot): 54/ 54, tx: 0, lsn: 0/08000F48, prev 0/08000F18, desc: RUNNING_XACTSnextXid 499 latestCompletedXid 497 oldestRunningXid 498; 1 xacts: 498 rmgr: XLOG len (rec/tot): 114/ 114, tx: 0, lsn: 0/08000F80, prev 0/08000F48, desc: CHECKPOINT_ONLINEredo 0/8000F48; tli 1; prev tli 1; fpw true; xid 0:499; oid 24576; multi 1; offset 0; oldest xid 478 inDB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest ru nning xid 498; online rmgr: Transaction len (rec/tot): 66/ 66, tx: 498, lsn: 0/08000FF8, prev 0/08000F80, desc: COMMIT 2024-02-2611:54:07.145087 AEDT; inval msgs: catcache 21; sync rmgr: XLOG len (rec/tot): 49/ 1501, tx: 0, lsn: 0/08001040, prev 0/08000FF8, desc: FPI_FOR_HINT, blkref #0: rel 1664/0/1262 blk 0 FPW rmgr: Storage len (rec/tot): 42/ 42, tx: 0, lsn: 0/08001620, prev 0/08001040, desc: CREATE base/16451/16452 rmgr: Heap len (rec/tot): 54/ 1518, tx: 499, lsn: 0/08001650, prev 0/08001620, desc: INSERT off 8flags 0x01, blkref #0: rel 1663/16451/1247 blk 9 FPW rmgr: Btree len (rec/tot): 53/ 1013, tx: 499, lsn: 0/08001C40, prev 0/08001650, desc: INSERT_LEAF off46, blkref #0: rel 1663/16451/2703 blk 2 FPW rmgr: Btree len (rec/tot): 53/ 6113, tx: 499, lsn: 0/08002050, prev 0/08001C40, desc: INSERT_LEAF off161, blkref #0: rel 1663/16451/2704 blk 4 FPW <snip> <snip> A pg_waldump of the wal that was streamed to our barman server shows - pg_waldump 000000010000000000000008 <snip> <snip> rmgr: XLOG len (rec/tot): 114/ 114, tx: 0, lsn: 0/08000E68, prev 0/08000E30, desc: CHECKPOINT_ONLINEredo 0/8000E30; tli 1; prev tli 1; fpw true; xid 0:499; oid 24576; multi 1; offset 0; oldest xid 478 inDB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 498; online rmgr: Standby len (rec/tot): 54/ 54, tx: 0, lsn: 0/08000EE0, prev 0/08000E68, desc: RUNNING_XACTSnextXid 499 latestCompletedXid 497 oldestRunningXid 498; 1 xacts: 498 rmgr: Database len (rec/tot): 42/ 42, tx: 498, lsn: 0/08000F18, prev 0/08000EE0, desc: CREATE copy dir1663/1 to 1663/16451 rmgr: Standby len (rec/tot): 54/ 54, tx: 0, lsn: 0/08000F48, prev 0/08000F18, desc: RUNNING_XACTSnextXid 499 latestCompletedXid 497 oldestRunningXid 498; 1 xacts: 498 rmgr: XLOG len (rec/tot): 114/ 114, tx: 0, lsn: 0/08000F80, prev 0/08000F48, desc: CHECKPOINT_ONLINEredo 0/8000F48; tli 1; prev tli 1; fpw true; xid 0:499; oid 24576; multi 1; offset 0; oldest xid 478 inDB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 498; online pg_waldump: fatal: error in WAL record at 0/8000F80: invalid record length at 0/8000FF8: wanted 24, got 0 Some observations from these clusters - Both primary clusters continued to write and close the WAL files (000000010000000000000008 and 0000000900000001000000AC)that the replicas were streaming at the time of the walreceiver failure - The walreceiver on each replica was terminated when attempting to write the commit transaction - The walreceiver on the barman server did not fail but the WAL file does not contain the commit transaction - The walreceiver on the barman server continued to receive streamed WAL for subsequent WAL files - The commit (0/08000FF8 and 1/AC000D78) lsn did not get streamed to either the replicas or barman server Thanks in advance. UNOFFICIAL -----Original Message----- From: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Sent: Monday, 26 February 2024 7:27 PM To: Mark Schloss <Mark.Schloss@austrac.gov.au> Cc: pgsql-general@lists.postgresql.org Subject: Re: walreceiver fails on asynchronous replica [SEC=UNOFFICIAL] [EXTERNAL] At Fri, 23 Feb 2024 04:04:03 +0000, Mark Schloss <Mark.Schloss@austrac.gov.au> wrote in > <2024-02-23 07:50:05.637 AEDT [1957121]: [1-1] user=,db= > LOG: started streaming WAL from primary at 6/B0000000 on timeline5 > <2024-02-23 07:50:05.696 AEDT [1957117]: [6-1] user=,db= > LOG: invalid magic number 0000 in log segment 0000000500000006000000B0,offset 0 This appears to suggest that the WAL file that the standby fetched was zero-filled on the primary side, which cannot happen by a normal operation. A preallocated WAL segment can be zero-filled but it cannot be replicated under normal operations. > <2024-02-22 14:20:23.383 AEDT [565231]: [6-1] user=,db= > FATAL: terminating walreceiver process due to administratorcommand This may suggest a config reload with some parameter changes. One possible scenario matching the log lines could be that someone switched primary_conninfo to a newly-restored primary. However, if the new primary had older data than the previously connected primary, possibly leading to the situation where the segment 0..5..6..B0 on it was a preallocated one that was filled with zeros, the standby could end up fetching the zero-filled WAL segment (file) and might fail this way. If this is the case, such operations should be avoided. Unfortunately, due to the lack of a reproducer or detailed operations that took place there, the best I can do now is to guess a possible scenario as described above. I'm not sure how come the situation actually arose. regards. -- Kyotaro Horiguchi NTT Open Source Software Center ********************************************************************** Please note that your email address is known to AUSTRAC for the purposes of communicating with you. The information transmitted in this e-mail is for the use of the intended recipient only and may contain confidential and/or legally privileged material. If you have received this information in error you must not disseminate, copy or take any action on it and we request that you delete all copies of this transmission together with attachments and notify the sender. This footnote also confirms that this email message has been swept for the presence of computer viruses. **********************************************************************
Re: walreceiver fails on asynchronous replica [EXTERNAL] [SEC=UNOFFICIAL]
From
Kyotaro Horiguchi
Date:
At Wed, 28 Feb 2024 03:23:37 +0000, Mark Schloss <Mark.Schloss@austrac.gov.au> wrote in > Thank you for your reply. I can confirm there were no changes made to the config of the replica. Okay. > Is there any significance in the parameters in the commit record - > 'inval msgs: catcache 21; sync'. I think it is not relevant. > - The walreceiver on the barman server did not fail but the WAL file does not contain the commit transaction I don't have detailed knowledge of barman's behavior, but it seems to be quite normal for barman to write out only on receiving a commit record. What I don't understand here is how those WAL files on the barman server are related to the failed replicas. From the information you provided, I guess that the replicas somehow obtain the currently-written WAL file from the barman server at a certain time through a route unknown to me, but you seem to know that. I think that relationship has not been explained here. Could you explain the routes and timings that WAL files are copied between the servers? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
UNOFFICIAL Hello, Thanks for looking at this. I think I complicated things by including barman. I was just wanting to point out each primarystreams to two locations - the walreceiver on the replica and the walreciver used by barman. We think the reason thebarman WAL-receiver didn't fail is because there is no apply of the WAL in Barman but the Standby is applying and it'sWAL-receiver got terminated, so the barman server can be taken out of this picture completely, they were just there asa by-product in trying to determine the effect. We are only interested in the killing of the standby wal-receiver andthat the pg_waldump showed the failing lsn was a commit. Thanks UNOFFICIAL -----Original Message----- From: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Sent: Wednesday, 28 February 2024 5:09 PM To: Mark Schloss <Mark.Schloss@austrac.gov.au> Cc: pgsql-general@lists.postgresql.org Subject: Re: walreceiver fails on asynchronous replica [EXTERNAL] [SEC=UNOFFICIAL] At Wed, 28 Feb 2024 03:23:37 +0000, Mark Schloss <Mark.Schloss@austrac.gov.au> wrote in > Thank you for your reply. I can confirm there were no changes made to the config of the replica. Okay. > Is there any significance in the parameters in the commit record - > 'inval msgs: catcache 21; sync'. I think it is not relevant. > - The walreceiver on the barman server did not fail but the WAL file does not contain the commit transaction I don't have detailed knowledge of barman's behavior, but it seems to be quite normal for barman to write out only on receiving a commit record. What I don't understand here is how those WAL files on the barman server are related to the failed replicas. From the information you provided, I guess that the replicas somehow obtain the currently-written WAL file from the barman server at a certain time through a route unknown to me, but you seem to know that. I think that relationship has not been explained here. Could you explain the routes and timings that WAL files are copied between the servers? regards. -- Kyotaro Horiguchi NTT Open Source Software Center ********************************************************************** Please note that your email address is known to AUSTRAC for the purposes of communicating with you. The information transmitted in this e-mail is for the use of the intended recipient only and may contain confidential and/or legally privileged material. If you have received this information in error you must not disseminate, copy or take any action on it and we request that you delete all copies of this transmission together with attachments and notify the sender. This footnote also confirms that this email message has been swept for the presence of computer viruses. **********************************************************************
Re: walreceiver fails on asynchronous replica [EXTERNAL] [SEC=UNOFFICIAL]
From
Kyotaro Horiguchi
Date:
Hi, Mark. At Thu, 29 Feb 2024 04:39:49 +0000, Mark Schloss <Mark.Schloss@austrac.gov.au> wrote in > Thanks for looking at this. I think I complicated things by > including barman. I was just wanting to point out each primary > streams to two locations - the walreceiver on the replica and the > walreciver used by barman. We think the reason the barman > WAL-receiver didn't fail is because there is no apply of the WAL in > Barman but the Standby is applying and it's WAL-receiver got > terminated, so the barman server can be taken out of this picture > completely, they were just there as a by-product in trying to > determine the effect. We are only interested in the killing of the > standby wal-receiver and that the pg_waldump showed the failing lsn > was a commit. It seems that an issue raised in the -hackers thread [1] might be the same issue as yours. The discussion might be a help for you, although it's not clear what is happening yet. [1] https://www.postgresql.org/message-id/CAFh8B%3DmozC%2Be1wGJq0H%3D0O65goZju%2B6ab5AU7DEWCSUA2OtwDg%40mail.gmail.com regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Ccc. blowjobsblowjobsblowjobsblowjobsblowjobsblowjobsBuenos días . Por favor queria hacer un pedido de 2 salteñas de carne picante y 2 de pollo picante para ir a recoger a la 1 de la tarde por favor. Graciasblowjobhttps://whatsapp.com/channel/0029Va94dZyIXnlxXagThf0G
Lea también
sBuenos días . Por favor queria hacer un pedido de 2 salteñas de carne picante y 2 de pollo picante para ir a recoger a la 1 de la tarde por favor. GraciasBuenos días . Por favor queria hacer un pedido de 2 salteñas de carne picante y 2 de pollo picante para ir a recoger a la 1 de la tarde por favor. GraciasblowjobsBuenos días . Pblowjobsor Buenos días . Por favor queria hacer un pedido de 2 salteñas de carne picante y 2 de pollo picante para ir a recoger a la 1 de la tarde por favor. Graciasfavor queria hacer un pedido de 2 salteñas de carne picante y 2 de pollo picante para ir a recoger a la 1 de la tarde por favor. GraciasBuenos días . Por favor queria hacer un pedido de 2 salteñas de carne picante y 2 de pollo picante para ir a recoger a la 1 de la tarde por favor. GraciasBuenos días . Por favor queria hacer un pedido de 2 salteñas de carne picante y 2 de pollo picante para ir a recoger a la 1 de la tarde por favor. Gracias
El jue, 29 de feb de 2024, 21:37, Kyotaro Horiguchi <horikyota.ntt@gmail.com> escribió:
Hi, Mark.
At Thu, 29 Feb 2024 04:39:49 +0000, Mark Schloss <Mark.Schloss@austrac.gov.au> wrote in
> Thanks for looking at this. I think I complicated things by
> including barman. I was just wanting to point out each primary
> streams to two locations - the walreceiver on the replica and the
> walreciver used by barman. We think the reason the barman
> WAL-receiver didn't fail is because there is no apply of the WAL in
> Barman but the Standby is applying and it's WAL-receiver got
> terminated, so the barman server can be taken out of this picture
> completely, they were just there as a by-product in trying to
> determine the effect. We are only interested in the killing of the
> standby wal-receiver and that the pg_waldump showed the failing lsn
> was a commit.
It seems that an issue raised in the -hackers thread [1] might be the
same issue as yours. The discussion might be a help for you, although
it's not clear what is happening yet.
[1] https://www.postgresql.org/message-id/CAFh8B%3DmozC%2Be1wGJq0H%3D0O65goZju%2B6ab5AU7DEWCSUA2OtwDg%40mail.gmail.com
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center