Infinite loop in XLogPageRead() on standby - Mailing list pgsql-hackers

From Alexander Kukushkin
Subject Infinite loop in XLogPageRead() on standby
Date
Msg-id CAFh8B=mozC+e1wGJq0H=0O65goZju+6ab5AU7DEWCSUA2OtwDg@mail.gmail.com
Whole thread Raw
Responses Re: Infinite loop in XLogPageRead() on standby
List pgsql-hackers
Hello hackers,

I spent some time debugging an issue with standby not being able to continue streaming after failover.

The problem manifests itself by following messages in the log:
LOG:  received SIGHUP, reloading configuration files
LOG:  parameter "primary_conninfo" changed to "port=58669 host=/tmp/dn20WVmNqF"
LOG:  restored log file "000000010000000000000003" from archive
LOG:  invalid magic number 0000 in WAL segment 000000010000000000000003, LSN 0/301A000, offset 106496
LOG:  fetching timeline history file for timeline 2 from primary server
LOG:  started streaming WAL from primary at 0/3000000 on timeline 1
LOG:  replication terminated by primary server
DETAIL:  End of WAL reached on timeline 1 at 0/3019158.
FATAL:  terminating walreceiver process due to administrator command
LOG:  restored log file "00000002.history" from archive
LOG:  new target timeline is 2
LOG:  restored log file "000000020000000000000003" from archive
LOG:  invalid magic number 0000 in WAL segment 000000020000000000000003, LSN 0/301A000, offset 106496
LOG:  started streaming WAL from primary at 0/3000000 on timeline 2
LOG:  invalid magic number 0000 in WAL segment 000000020000000000000003, LSN 0/301A000, offset 106496
FATAL:  terminating walreceiver process due to administrator command
LOG:  waiting for WAL to become available at 0/301A04E
LOG:  restored log file "000000020000000000000003" from archive
LOG:  invalid magic number 0000 in WAL segment 000000020000000000000003, LSN 0/301A000, offset 106496
LOG:  invalid magic number 0000 in WAL segment 000000020000000000000003, LSN 0/301A000, offset 106496
LOG:  waiting for WAL to become available at 0/301A04E
LOG:  restored log file "000000020000000000000003" from archive
LOG:  invalid magic number 0000 in WAL segment 000000020000000000000003, LSN 0/301A000, offset 106496
LOG:  invalid magic number 0000 in WAL segment 000000020000000000000003, LSN 0/301A000, offset 106496

The problem happens when standbys received only the first part of the WAL record that spans multiple pages.
In this case the promoted standby discards the first part of the WAL record and writes END_OF_RECOVERY instead. If in addition to that someone will call pg_switch_wal(), then there are chances that SWITCH record will also fit to the page where the discarded part was settling, As a result the other standby (that wasn't promoted) will infinitely try making attempts to decode WAL record span on multiple pages by reading the next page, which is filled with zero bytes. And, this next page will never be written, because the new primary will be writing to the new WAL file after pg_switch_wal().

Restart of the stuck standby fixes the problem, because it will be first reading the history file and therefore will never read the incomplete WAL file from the old timeline. That is, all major versions starting from v13 are impacted (including the master branch), because they allow changing of primary_conninfo GUC with reload.

Please find attached the TAP test that reproduces the problem.

To be honest, I don't know yet how to fix it nicely. I am thinking about returning XLREAD_FAIL from XLogPageRead() if it suddenly switched to a new timeline while trying to read a page and if this page is invalid.

--
Regards,
--
Alexander Kukushkin
Attachment

pgsql-hackers by date:

Previous
From: jian he
Date:
Subject: Re: POC, WIP: OR-clause support for indexes
Next
From: Alena Rybakina
Date:
Subject: Re: POC, WIP: OR-clause support for indexes