Re: Infinite loop in XLogPageRead() on standby - Mailing list pgsql-hackers
From | Michael Paquier |
---|---|
Subject | Re: Infinite loop in XLogPageRead() on standby |
Date | |
Msg-id | ZeAQi1UMyyWkspiT@paquier.xyz Whole thread Raw |
In response to | Infinite loop in XLogPageRead() on standby (Alexander Kukushkin <cyberdemn@gmail.com>) |
Responses |
Re: Infinite loop in XLogPageRead() on standby
Re: Infinite loop in XLogPageRead() on standby |
List | pgsql-hackers |
On Wed, Feb 28, 2024 at 11:19:41AM +0100, Alexander Kukushkin wrote: > I spent some time debugging an issue with standby not being able to > continue streaming after failover. > > The problem happens when standbys received only the first part of the WAL > record that spans multiple pages. > In this case the promoted standby discards the first part of the WAL record > and writes END_OF_RECOVERY instead. If in addition to that someone will > call pg_switch_wal(), then there are chances that SWITCH record will also > fit to the page where the discarded part was settling, As a result the > other standby (that wasn't promoted) will infinitely try making attempts to > decode WAL record span on multiple pages by reading the next page, which is > filled with zero bytes. And, this next page will never be written, because > the new primary will be writing to the new WAL file after pg_switch_wal(). Wow. Have you seen that in an actual production environment? I was just trying your TAP test to see it looping on a single record as you mentioned: 2024-02-29 12:57:44.884 JST [2555] LOG: invalid magic number 0000 in WAL segment 000000020000000000000003, LSN 0/301A000, offset 106496 2024-02-29 12:57:44.884 JST [2555] LOG: invalid magic number 0000 in WAL segment 000000020000000000000003, LSN 0/301A000, offset 106496 > Restart of the stuck standby fixes the problem, because it will be first > reading the history file and therefore will never read the incomplete WAL > file from the old timeline. That is, all major versions starting from v13 > are impacted (including the master branch), because they allow changing of > primary_conninfo GUC with reload. Still that's not nice at a large scale, because you would not know about the problem until your monitoring tools raise alarms because some nodes in your cluster setup decide to lag behind. > Please find attached the TAP test that reproduces the problem. my $start_page = start_of_page($end_lsn); my $wal_file = write_wal($primary, $TLI, $start_page, "\x00" x $WAL_BLOCK_SIZE); # copy the file we just "hacked" to the archive copy($wal_file, $primary->archive_dir); So you are emulating a failure by filling with zeros the second page where the last emit_message() generated a record, and the page before that includes the continuation record. Then abuse of WAL archiving to force the replay of the last record. That's kind of cool. > To be honest, I don't know yet how to fix it nicely. I am thinking about > returning XLREAD_FAIL from XLogPageRead() if it suddenly switched to a new > timeline while trying to read a page and if this page is invalid. Hmm. I suspect that you may be right on a TLI change when reading a page. There are a bunch of side cases with continuation records and header validation around XLogReaderValidatePageHeader(). Perhaps you have an idea of patch to show your point? Nit. In your test, it seems to me that you should not call directly set_standby_mode and enable_restoring, just rely on has_restoring with the standby option included. -- Michael
Attachment
pgsql-hackers by date: