Re: LOG: invalid record length at : wanted 24, got 0 - Mailing list pgsql-hackers

From Harinath Kanchu
Subject Re: LOG: invalid record length at : wanted 24, got 0
Date
Msg-id 9A769294-7DC1-408C-A634-A445A3BD8BCA@apple.com
Whole thread Raw
In response to Re: LOG: invalid record length at : wanted 24, got 0  (Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>)
List pgsql-hackers
Thanks Bharath for your response,

You mentioned that
the connection to primary was lost, so you need to dive deep as to why
it got lost. If the connection was lost half-way through fetching the
WAL record, the standby may emit such a LOG message.

The connection was lost due to bad network. Currently we are okay with bad network, because applications in general has to always expect bad network in their design.

To simulate the bad network, we manually killed the wal-sender process in primary multiple times, this should be same as primary unable to send messages to standby due to bad network. And in these experiments, the standby is able to join the primary after checking WAL files in archive, within few seconds.

But, when the standby gets disconnected due to bad network, the standby is unable to join back, so we wanted to understand why this is happening.


I understand this problem and there's a proposed patch to help with
this - https://www.postgresql.org/message-id/CALj2ACVryN_PdFmQkbhga1VeW10VgQ4Lv9JXO=3nJkvZT8qgfA@mail.gmail.com.

It basically allows one to set a timeout as to how much duration the
standby can restore from archive before switching to stream.
Therefore, in your case, the standby doesn't have to wait for 1hr to
connect to primary, but it can connect before that.

Good to know about this patch, because we tried to do something similar to this.

But this patch will not solve our problem, because here the hot-standby is NOT downloading WAL files from the archive for very long time and wasting time, but,  the standby simply switching the sources from STREAM to ARCHIVE in a tight loop doing nothing.

In fact, the standby is trying to connect to the primary for streaming back after it fails to pull the latest WAL file from archive. But trying to stream from the primary fails again immediately.


To explain this situation better, I am adding more logs.

Note: the logs mentioned here includes the custom logging we enable for debugging purposes.



2023-02-09 19:52:30.909 GMT [  ] () [0 | 00000 | 63e4244f.54]: LOG:  Starting the state machine again from starting, changing current source to XLOG_FROM_ARCHIVE
2023-02-09 19:52:30.909 GMT [  ] () [0 | 00000 | 63e4244f.54]: LOG:  switched WAL source from stream to archive after failure
ERROR: 2023/02/09 19:52:31.616383 Archive '00000006000000020000001C' does not exist.
2023-02-09 19:52:31.618 GMT [  ] () [0 | 00000 | 63e4244f.54]: LOG:  Successfully read the WAL file using XLogFileReadAnyTLI from archive or existing pg_wal, returning true and exit.
2023-02-09 19:52:31.619 GMT [  ] () [0 | 00000 | 63e4244f.54]: LOG:  last source archive failed, and now switching to new source.
2023-02-09 19:52:31.619 GMT [  ] () [0 | 00000 | 63e4244f.54]: LOG:  Moving to XLOG_FROM_STREAM state and start walreceiver if necessary
2023-02-09 19:52:31.619 GMT [  ] () [0 | 00000 | 63e4244f.54]: LOG:  switched WAL source from archive to stream after failure with WalStreamingPreferred false
2023-02-09 19:52:31.619 GMT [  ] () [0 | 00000 | 63e4244f.54]: LOG:  changing the current file timeline to new tli.
2023-02-09 19:52:31.619 GMT [  ] () [0 | 00000 | 63e4244f.54]: LOG:  we have data from wal receiver.
2023-02-09 19:52:31.619 GMT [  ] () [0 | 00000 | 63e4244f.54]: LOG:  after streaming file, we dont have WAL file opened, so read now.
2023-02-09 19:52:31.619 GMT [  ] () [0 | 00000 | 63e4244f.54]: LOG:  Due to WAL receiver, we have WAL file present and opened and hence returning true
2023-02-09 19:52:31.619 GMT [  ] () [0 | 00000 | 63e4244f.54]: LOG:  invalid record length at 2/1C0324C8: wanted 24, got 0
2023-02-09 19:52:31.619 GMT [  ] () [0 | 00000 | 63e4244f.54]: LOG:  last source stream failed, and now switching to new source.
2023-02-09 19:52:31.619 GMT [  ] () [0 | 00000 | 63e4244f.54]: LOG:  Failure while streaming, we might have found invalid record in WAL streamed from master
2023-02-09 19:52:31.619 GMT [  ] () [0 | 00000 | 63e4244f.54]: LOG:  Stopping WAL receiver as we saw serious failure while streaming
INFO: 2023/02/09 19:52:31.796489 WAL: 00000007.history will be downloaded from archive
ERROR: 2023/02/09 19:52:32.330491 Archive '00000007.history' does not exist.
2023-02-09 19:52:32.333 GMT [  ] () [0 | 00000 | 63e4244f.54]: LOG:  We were requested to recover to latest timeline, but rescan is NOT needed.
2023-02-09 19:52:32.333 GMT [  ] () [0 | 00000 | 63e4244f.54]: LOG:  We decided to sleep before retry the state-machine
2023-02-09 19:52:35.910 GMT [  ] () [0 | 00000 | 63e4244f.54]: LOG:  Starting the state machine again from starting, changing current source to XLOG_FROM_ARCHIVE
2023-02-09 19:52:35.910 GMT [  ] () [0 | 00000 | 63e4244f.54]: LOG:  switched WAL source from stream to archive after failure
ERROR: 2023/02/09 19:52:36.607613 Archive '00000006000000020000001C' does not exist.
2023-02-09 19:52:36.610 GMT [  ] () [0 | 00000 | 63e4244f.54]: LOG:  Successfully read the WAL file using XLogFileReadAnyTLI from archive or existing pg_wal, returning true and exit.
2023-02-09 19:52:36.610 GMT [  ] () [0 | 00000 | 63e4244f.54]: LOG:  last source archive failed, and now switching to new source.
2023-02-09 19:52:36.610 GMT [  ] () [0 | 00000 | 63e4244f.54]: LOG:  Moving to XLOG_FROM_STREAM state and start walreceiver if necessary
2023-02-09 19:52:36.610 GMT [  ] () [0 | 00000 | 63e4244f.54]: LOG:  switched WAL source from archive to stream after failure with WalStreamingPreferred false
2023-02-09 19:52:36.610 GMT [  ] () [0 | 00000 | 63e4244f.54]: LOG:  changing the current file timeline to new tli.
2023-02-09 19:52:36.610 GMT [  ] () [0 | 00000 | 63e4244f.54]: LOG:  we have data from wal receiver.
2023-02-09 19:52:36.610 GMT [  ] () [0 | 00000 | 63e4244f.54]: LOG:  after streaming file, we dont have WAL file opened, so read now.
2023-02-09 19:52:36.610 GMT [  ] () [0 | 00000 | 63e4244f.54]: LOG:  Due to WAL receiver, we have WAL file present and opened and hence returning true
2023-02-09 19:52:36.610 GMT [  ] () [0 | 00000 | 63e4244f.54]: LOG:  invalid record length at 2/1C0324C8: wanted 24, got 0



So, we are wondering, why killing of wal-sender in primary abruptly seems okay but a few network failures are not okay for the standby to recover.

Thank you.

Best,
Harinath




pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: refactoring relation extension and BufferAlloc(), faster COPY
Next
From: "Ryo Yamaji (Fujitsu)"
Date:
Subject: The order of queues in row lock is changed (not FIFO)