Re: LOG: invalid record length at : wanted 24, got 0 - Mailing list pgsql-hackers
From | Bharath Rupireddy |
---|---|
Subject | Re: LOG: invalid record length at |
Date | |
Msg-id | CALj2ACW+vHcgZntw_JHtWfkDx64JS3_eiQNoRgNynK-uGM2j5A@mail.gmail.com Whole thread Raw |
In response to |
LOG: invalid record length at |
Responses |
Re: LOG: invalid record length at |
List | pgsql-hackers |
On Wed, Mar 1, 2023 at 10:51 AM Harinath Kanchu <hkanchu@apple.com> wrote: > > Hello, > > We are seeing an interesting STANDBY behavior, that’s happening once in 3-4 days. > > The standby suddenly disconnects from the primary, and it throws the error “LOG: invalid record length at <LSN>: wanted24, got0”. Firstly, this isn't an error per se, especially for a standby as it can get/retry the same WAL record from other sources. It's a bit hard to say anything further just by looking at this LOG message, one needs to look at what's happening around the same time. You mentioned that the connection to primary was lost, so you need to dive deep as to why it got lost. If the connection was lost half-way through fetching the WAL record, the standby may emit such a LOG message. Secondly, you definitely need to understand why the connection to primary keeps getting lost - network disruption, parameter changes or primary going down, standby going down etc.? > And then it tries to restore the WAL file from the archive. Due to low write activity on primary, the WAL file will beswitched and archived only after 1 hr. > > So, it stuck in a loop of switching the WAL sources from STREAM and ARCHIVE without replicating the primary. > > Due to this there will be write outage as the standby is synchronous standby. I understand this problem and there's a proposed patch to help with this - https://www.postgresql.org/message-id/CALj2ACVryN_PdFmQkbhga1VeW10VgQ4Lv9JXO=3nJkvZT8qgfA@mail.gmail.com. It basically allows one to set a timeout as to how much duration the standby can restore from archive before switching to stream. Therefore, in your case, the standby doesn't have to wait for 1hr to connect to primary, but it can connect before that. > We are using “wal_sync_method” as “fsync” assuming WAL file not getting flushed correctly. > > But this is happening even after making it as “fsync” instead of “fdatasync”. I don't think that's a problem, unless wal_sync_method isn't changed to something else in between. > Restarting the STANDBY sometimes fixes this problem, but detecting this automatically is a big problem as the postgresstandby process will be still running fine, but WAL RECEIVER process is up and down continuously due to switchingof WAL sources. Yes, the standby after failure to connect to primary, it switches to archive and stays there until it exhausts all the WAL from the archive and then switches to stream. You can monitor the replication slot of the standby on the primary, if it's inactive, then one needs to jump in. As mentioned above, there's an in-progress feature that helps in these cases. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
pgsql-hackers by date: