Re: archive status ".ready" files may be created too early - Mailing list pgsql-hackers

From Andrey Borodin
Subject Re: archive status ".ready" files may be created too early
Date
Msg-id EE2EA7B8-36E5-4C46-BD39-099713C4BC50@yandex-team.ru
Whole thread Raw
In response to Re: archive status ".ready" files may be created too early  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
Responses Re: archive status ".ready" files may be created too early  ("Bossart, Nathan" <bossartn@amazon.com>)
List pgsql-hackers
Hi!

I was looking to review something in CF. This seems like a thread of some interest to me.

Recently we had somewhat related incident. Do I understand correctly that this incident is related to the bug discussed
inthis thread? 

Primary instance was killed by OOM
[ 2020-11-12 15:27:03.732 MSK ,,,739,00000 ]:LOG:  server process (PID 40189) was terminated by signal 9: Killed
after recovery it archived some WAL segments.
[ 2020-11-12 15:27:31.477 MSK ,,,739,00000 ]:LOG:  database system is ready to accept connections
INFO: 2020/11/12 15:27:32.059541 FILE PATH: 0000000E0001C02F000000AF.br
INFO: 2020/11/12 15:27:32.114319 FILE PATH: 0000000E0001C02F000000B3.br

then PITR failed on another host
[ 2020-11-12 16:26:33.024 MSK ,,,51414,00000 ]:LOG:  restored log file "0000000E0001C02F000000B3" from archive
[ 2020-11-12 16:26:33.042 MSK ,,,51414,00000 ]:LOG:  invalid record length at 1C02F/B3FFF778: wanted 24, got 0
[ 2020-11-12 16:26:33.042 MSK ,,,51414,00000 ]:LOG:  invalid record length at 1C02F/B3FFF778: wanted 24, got 0

archived segment has some zeroes at the end
rmgr: XLOG        len (rec/tot):     51/  1634, tx:          0, lsn: 1C02F/B3FFF058, prev 1C02F/B3FFEFE8, desc:
FPI_FOR_HINT, blkref #0: rel 1663/14030/16384 blk 140 FPW 
rmgr: Heap        len (rec/tot):    129/   129, tx: 3890578935, lsn: 1C02F/B3FFF6C0, prev 1C02F/B3FFF058, desc:
HOT_UPDATEoff 34 xmax 3890578935 ; new off 35 xmax 0, blkref #0: rel 1663/14030/16384 blk 140 
rmgr: Transaction len (rec/tot):     46/    46, tx: 3890578935, lsn: 1C02F/B3FFF748, prev 1C02F/B3FFF6C0, desc: COMMIT
2020-11-1215:27:31.507363 MSK 
pg_waldump: FATAL:  error in WAL record at 1C02F/**B3FFF748**: invalid record length at 1C02F/**B3FFF778**: wanted 24,
got0 

Meanwhile next segment points to previous record at **B3FFF748**
postgres@man-odszl7u4361o8m3z:/tmp$ pg_waldump 0000000E0001C02F000000B4| head
rmgr: Heap        len (rec/tot):    129/   129, tx: 3890578936, lsn: 1C02F/B4000A68, prev 1C02F/**B3FFF778**, desc:
HOT_UPDATEoff 35 xmax 3890578936 ; new off 36 xmax 0, blkref #0: rel 1663/14030/16384 blk 140 
rmgr: Transaction len (rec/tot):     46/    46, tx: 3890578936, lsn: 1C02F/B4000AF0, prev 1C02F/B4000A68, desc: COMMIT
2020-11-1215:27:32.509443 MSK 

Best regards, Andrey Borodin.


pgsql-hackers by date:

Previous
From: Michail Nikolaev
Date:
Subject: Why latestRemovedXid|cuteoff_xid are always sent?
Next
From: Jeff Davis
Date:
Subject: Re: Reloptions for table access methods