Thread: LOG: invalid record length at : wanted 24, got 0

LOG: invalid record length at : wanted 24, got 0

From
Harinath Kanchu
Date:
Hello,

We are seeing an interesting STANDBY behavior, that’s happening once in 3-4 days.

The standby suddenly disconnects from the primary, and it throws the error “LOG: invalid record length at <LSN>: wanted
24,got0”. 

And then it tries to restore the WAL file from the archive. Due to low write activity on primary, the WAL file will be
switchedand archived only after 1 hr. 

So, it stuck in a loop of switching the WAL sources from STREAM and ARCHIVE without replicating the primary.

Due to this there will be write outage as the standby is synchronous standby.

We are using “wal_sync_method” as “fsync” assuming WAL file not getting flushed correctly.

But this is happening even after making it as “fsync” instead of “fdatasync”.

Restarting the STANDBY sometimes fixes this problem, but detecting this automatically is a big problem as the postgres
standbyprocess will be still running fine, but WAL RECEIVER process is up and down continuously due to switching of WAL
sources.


How can we fix this ? Any suggestions regarding this will be appreciated.


Postgres Version: 13.6
OS: RHEL Linux


Thank you,


Best,
Harinath.


Re: LOG: invalid record length at : wanted 24, got 0

From
Bharath Rupireddy
Date:
On Wed, Mar 1, 2023 at 10:51 AM Harinath Kanchu <hkanchu@apple.com> wrote:
>
> Hello,
>
> We are seeing an interesting STANDBY behavior, that’s happening once in 3-4 days.
>
> The standby suddenly disconnects from the primary, and it throws the error “LOG: invalid record length at <LSN>:
wanted24, got0”. 

Firstly, this isn't an error per se, especially for a standby as it
can get/retry the same WAL record from other sources. It's a bit hard
to say anything further just by looking at this LOG message, one needs
to look at what's happening around the same time. You mentioned that
the connection to primary was lost, so you need to dive deep as to why
it got lost. If the connection was lost half-way through fetching the
WAL record, the standby may emit such a LOG message.

Secondly, you definitely need to understand why the connection to
primary keeps getting lost - network disruption, parameter changes or
primary going down, standby going down etc.?

> And then it tries to restore the WAL file from the archive. Due to low write activity on primary, the WAL file will
beswitched and archived only after 1 hr. 
>
> So, it stuck in a loop of switching the WAL sources from STREAM and ARCHIVE without replicating the primary.
>
> Due to this there will be write outage as the standby is synchronous standby.

I understand this problem and there's a proposed patch to help with
this - https://www.postgresql.org/message-id/CALj2ACVryN_PdFmQkbhga1VeW10VgQ4Lv9JXO=3nJkvZT8qgfA@mail.gmail.com.

It basically allows one to set a timeout as to how much duration the
standby can restore from archive before switching to stream.
Therefore, in your case, the standby doesn't have to wait for 1hr to
connect to primary, but it can connect before that.

> We are using “wal_sync_method” as “fsync” assuming WAL file not getting flushed correctly.
>
> But this is happening even after making it as “fsync” instead of “fdatasync”.

I don't think that's a problem, unless wal_sync_method isn't changed
to something else in between.

> Restarting the STANDBY sometimes fixes this problem, but detecting this automatically is a big problem as the
postgresstandby process will be still running fine, but WAL RECEIVER process is up and down continuously due to
switchingof WAL sources. 

Yes, the standby after failure to connect to primary, it switches to
archive and stays there until it exhausts all the WAL from the archive
and then switches to stream. You can monitor the replication slot of
the standby on the primary, if it's inactive, then one needs to jump
in. As mentioned above, there's an in-progress feature that helps in
these cases.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: LOG: invalid record length at : wanted 24, got 0

From
Harinath Kanchu
Date:
Thanks Bharath for your response,

You mentioned that
the connection to primary was lost, so you need to dive deep as to why
it got lost. If the connection was lost half-way through fetching the
WAL record, the standby may emit such a LOG message.

The connection was lost due to bad network. Currently we are okay with bad network, because applications in general has to always expect bad network in their design.

To simulate the bad network, we manually killed the wal-sender process in primary multiple times, this should be same as primary unable to send messages to standby due to bad network. And in these experiments, the standby is able to join the primary after checking WAL files in archive, within few seconds.

But, when the standby gets disconnected due to bad network, the standby is unable to join back, so we wanted to understand why this is happening.


I understand this problem and there's a proposed patch to help with
this - https://www.postgresql.org/message-id/CALj2ACVryN_PdFmQkbhga1VeW10VgQ4Lv9JXO=3nJkvZT8qgfA@mail.gmail.com.

It basically allows one to set a timeout as to how much duration the
standby can restore from archive before switching to stream.
Therefore, in your case, the standby doesn't have to wait for 1hr to
connect to primary, but it can connect before that.

Good to know about this patch, because we tried to do something similar to this.

But this patch will not solve our problem, because here the hot-standby is NOT downloading WAL files from the archive for very long time and wasting time, but,  the standby simply switching the sources from STREAM to ARCHIVE in a tight loop doing nothing.

In fact, the standby is trying to connect to the primary for streaming back after it fails to pull the latest WAL file from archive. But trying to stream from the primary fails again immediately.


To explain this situation better, I am adding more logs.

Note: the logs mentioned here includes the custom logging we enable for debugging purposes.



2023-02-09 19:52:30.909 GMT [  ] () [0 | 00000 | 63e4244f.54]: LOG:  Starting the state machine again from starting, changing current source to XLOG_FROM_ARCHIVE
2023-02-09 19:52:30.909 GMT [  ] () [0 | 00000 | 63e4244f.54]: LOG:  switched WAL source from stream to archive after failure
ERROR: 2023/02/09 19:52:31.616383 Archive '00000006000000020000001C' does not exist.
2023-02-09 19:52:31.618 GMT [  ] () [0 | 00000 | 63e4244f.54]: LOG:  Successfully read the WAL file using XLogFileReadAnyTLI from archive or existing pg_wal, returning true and exit.
2023-02-09 19:52:31.619 GMT [  ] () [0 | 00000 | 63e4244f.54]: LOG:  last source archive failed, and now switching to new source.
2023-02-09 19:52:31.619 GMT [  ] () [0 | 00000 | 63e4244f.54]: LOG:  Moving to XLOG_FROM_STREAM state and start walreceiver if necessary
2023-02-09 19:52:31.619 GMT [  ] () [0 | 00000 | 63e4244f.54]: LOG:  switched WAL source from archive to stream after failure with WalStreamingPreferred false
2023-02-09 19:52:31.619 GMT [  ] () [0 | 00000 | 63e4244f.54]: LOG:  changing the current file timeline to new tli.
2023-02-09 19:52:31.619 GMT [  ] () [0 | 00000 | 63e4244f.54]: LOG:  we have data from wal receiver.
2023-02-09 19:52:31.619 GMT [  ] () [0 | 00000 | 63e4244f.54]: LOG:  after streaming file, we dont have WAL file opened, so read now.
2023-02-09 19:52:31.619 GMT [  ] () [0 | 00000 | 63e4244f.54]: LOG:  Due to WAL receiver, we have WAL file present and opened and hence returning true
2023-02-09 19:52:31.619 GMT [  ] () [0 | 00000 | 63e4244f.54]: LOG:  invalid record length at 2/1C0324C8: wanted 24, got 0
2023-02-09 19:52:31.619 GMT [  ] () [0 | 00000 | 63e4244f.54]: LOG:  last source stream failed, and now switching to new source.
2023-02-09 19:52:31.619 GMT [  ] () [0 | 00000 | 63e4244f.54]: LOG:  Failure while streaming, we might have found invalid record in WAL streamed from master
2023-02-09 19:52:31.619 GMT [  ] () [0 | 00000 | 63e4244f.54]: LOG:  Stopping WAL receiver as we saw serious failure while streaming
INFO: 2023/02/09 19:52:31.796489 WAL: 00000007.history will be downloaded from archive
ERROR: 2023/02/09 19:52:32.330491 Archive '00000007.history' does not exist.
2023-02-09 19:52:32.333 GMT [  ] () [0 | 00000 | 63e4244f.54]: LOG:  We were requested to recover to latest timeline, but rescan is NOT needed.
2023-02-09 19:52:32.333 GMT [  ] () [0 | 00000 | 63e4244f.54]: LOG:  We decided to sleep before retry the state-machine
2023-02-09 19:52:35.910 GMT [  ] () [0 | 00000 | 63e4244f.54]: LOG:  Starting the state machine again from starting, changing current source to XLOG_FROM_ARCHIVE
2023-02-09 19:52:35.910 GMT [  ] () [0 | 00000 | 63e4244f.54]: LOG:  switched WAL source from stream to archive after failure
ERROR: 2023/02/09 19:52:36.607613 Archive '00000006000000020000001C' does not exist.
2023-02-09 19:52:36.610 GMT [  ] () [0 | 00000 | 63e4244f.54]: LOG:  Successfully read the WAL file using XLogFileReadAnyTLI from archive or existing pg_wal, returning true and exit.
2023-02-09 19:52:36.610 GMT [  ] () [0 | 00000 | 63e4244f.54]: LOG:  last source archive failed, and now switching to new source.
2023-02-09 19:52:36.610 GMT [  ] () [0 | 00000 | 63e4244f.54]: LOG:  Moving to XLOG_FROM_STREAM state and start walreceiver if necessary
2023-02-09 19:52:36.610 GMT [  ] () [0 | 00000 | 63e4244f.54]: LOG:  switched WAL source from archive to stream after failure with WalStreamingPreferred false
2023-02-09 19:52:36.610 GMT [  ] () [0 | 00000 | 63e4244f.54]: LOG:  changing the current file timeline to new tli.
2023-02-09 19:52:36.610 GMT [  ] () [0 | 00000 | 63e4244f.54]: LOG:  we have data from wal receiver.
2023-02-09 19:52:36.610 GMT [  ] () [0 | 00000 | 63e4244f.54]: LOG:  after streaming file, we dont have WAL file opened, so read now.
2023-02-09 19:52:36.610 GMT [  ] () [0 | 00000 | 63e4244f.54]: LOG:  Due to WAL receiver, we have WAL file present and opened and hence returning true
2023-02-09 19:52:36.610 GMT [  ] () [0 | 00000 | 63e4244f.54]: LOG:  invalid record length at 2/1C0324C8: wanted 24, got 0



So, we are wondering, why killing of wal-sender in primary abruptly seems okay but a few network failures are not okay for the standby to recover.

Thank you.

Best,
Harinath