Re: LOG: invalid record length at : wanted 24, got 0 - Mailing list pgsql-hackers

From Bharath Rupireddy
Subject Re: LOG: invalid record length at : wanted 24, got 0
Date
Msg-id CALj2ACW+vHcgZntw_JHtWfkDx64JS3_eiQNoRgNynK-uGM2j5A@mail.gmail.com
Whole thread Raw
In response to LOG: invalid record length at : wanted 24, got 0  (Harinath Kanchu <hkanchu@apple.com>)
Responses Re: LOG: invalid record length at : wanted 24, got 0
List pgsql-hackers
On Wed, Mar 1, 2023 at 10:51 AM Harinath Kanchu <hkanchu@apple.com> wrote:
>
> Hello,
>
> We are seeing an interesting STANDBY behavior, that’s happening once in 3-4 days.
>
> The standby suddenly disconnects from the primary, and it throws the error “LOG: invalid record length at <LSN>:
wanted24, got0”. 

Firstly, this isn't an error per se, especially for a standby as it
can get/retry the same WAL record from other sources. It's a bit hard
to say anything further just by looking at this LOG message, one needs
to look at what's happening around the same time. You mentioned that
the connection to primary was lost, so you need to dive deep as to why
it got lost. If the connection was lost half-way through fetching the
WAL record, the standby may emit such a LOG message.

Secondly, you definitely need to understand why the connection to
primary keeps getting lost - network disruption, parameter changes or
primary going down, standby going down etc.?

> And then it tries to restore the WAL file from the archive. Due to low write activity on primary, the WAL file will
beswitched and archived only after 1 hr. 
>
> So, it stuck in a loop of switching the WAL sources from STREAM and ARCHIVE without replicating the primary.
>
> Due to this there will be write outage as the standby is synchronous standby.

I understand this problem and there's a proposed patch to help with
this - https://www.postgresql.org/message-id/CALj2ACVryN_PdFmQkbhga1VeW10VgQ4Lv9JXO=3nJkvZT8qgfA@mail.gmail.com.

It basically allows one to set a timeout as to how much duration the
standby can restore from archive before switching to stream.
Therefore, in your case, the standby doesn't have to wait for 1hr to
connect to primary, but it can connect before that.

> We are using “wal_sync_method” as “fsync” assuming WAL file not getting flushed correctly.
>
> But this is happening even after making it as “fsync” instead of “fdatasync”.

I don't think that's a problem, unless wal_sync_method isn't changed
to something else in between.

> Restarting the STANDBY sometimes fixes this problem, but detecting this automatically is a big problem as the
postgresstandby process will be still running fine, but WAL RECEIVER process is up and down continuously due to
switchingof WAL sources. 

Yes, the standby after failure to connect to primary, it switches to
archive and stays there until it exhausts all the WAL from the archive
and then switches to stream. You can monitor the replication slot of
the standby on the primary, if it's inactive, then one needs to jump
in. As mentioned above, there's an in-progress feature that helps in
these cases.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: add PROCESS_MAIN to VACUUM
Next
From: John Naylor
Date:
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum