Thread: Infinite loop in XLogPageRead() on standby
Hello hackers,
I spent some time debugging an issue with standby not being able to continue streaming after failover.
The problem manifests itself by following messages in the log:
LOG: received SIGHUP, reloading configuration files
LOG: parameter "primary_conninfo" changed to "port=58669 host=/tmp/dn20WVmNqF"
LOG: restored log file "000000010000000000000003" from archive
LOG: invalid magic number 0000 in WAL segment 000000010000000000000003, LSN 0/301A000, offset 106496
LOG: fetching timeline history file for timeline 2 from primary server
LOG: started streaming WAL from primary at 0/3000000 on timeline 1
LOG: replication terminated by primary server
DETAIL: End of WAL reached on timeline 1 at 0/3019158.
FATAL: terminating walreceiver process due to administrator command
LOG: parameter "primary_conninfo" changed to "port=58669 host=/tmp/dn20WVmNqF"
LOG: restored log file "000000010000000000000003" from archive
LOG: invalid magic number 0000 in WAL segment 000000010000000000000003, LSN 0/301A000, offset 106496
LOG: fetching timeline history file for timeline 2 from primary server
LOG: started streaming WAL from primary at 0/3000000 on timeline 1
LOG: replication terminated by primary server
DETAIL: End of WAL reached on timeline 1 at 0/3019158.
FATAL: terminating walreceiver process due to administrator command
LOG: restored log file "00000002.history" from archive
LOG: new target timeline is 2
LOG: restored log file "000000020000000000000003" from archive
LOG: invalid magic number 0000 in WAL segment 000000020000000000000003, LSN 0/301A000, offset 106496
LOG: started streaming WAL from primary at 0/3000000 on timeline 2
LOG: invalid magic number 0000 in WAL segment 000000020000000000000003, LSN 0/301A000, offset 106496
FATAL: terminating walreceiver process due to administrator command
LOG: waiting for WAL to become available at 0/301A04E
LOG: restored log file "000000020000000000000003" from archive
LOG: invalid magic number 0000 in WAL segment 000000020000000000000003, LSN 0/301A000, offset 106496
LOG: invalid magic number 0000 in WAL segment 000000020000000000000003, LSN 0/301A000, offset 106496
LOG: waiting for WAL to become available at 0/301A04E
LOG: restored log file "000000020000000000000003" from archive
LOG: invalid magic number 0000 in WAL segment 000000020000000000000003, LSN 0/301A000, offset 106496
LOG: invalid magic number 0000 in WAL segment 000000020000000000000003, LSN 0/301A000, offset 106496
LOG: new target timeline is 2
LOG: restored log file "000000020000000000000003" from archive
LOG: invalid magic number 0000 in WAL segment 000000020000000000000003, LSN 0/301A000, offset 106496
LOG: started streaming WAL from primary at 0/3000000 on timeline 2
LOG: invalid magic number 0000 in WAL segment 000000020000000000000003, LSN 0/301A000, offset 106496
FATAL: terminating walreceiver process due to administrator command
LOG: waiting for WAL to become available at 0/301A04E
LOG: restored log file "000000020000000000000003" from archive
LOG: invalid magic number 0000 in WAL segment 000000020000000000000003, LSN 0/301A000, offset 106496
LOG: invalid magic number 0000 in WAL segment 000000020000000000000003, LSN 0/301A000, offset 106496
LOG: waiting for WAL to become available at 0/301A04E
LOG: restored log file "000000020000000000000003" from archive
LOG: invalid magic number 0000 in WAL segment 000000020000000000000003, LSN 0/301A000, offset 106496
LOG: invalid magic number 0000 in WAL segment 000000020000000000000003, LSN 0/301A000, offset 106496
The problem happens when standbys received only the first part of the WAL record that spans multiple pages.
In this case the promoted standby discards the first part of the WAL record and writes END_OF_RECOVERY instead. If in addition to that someone will call pg_switch_wal(), then there are chances that SWITCH record will also fit to the page where the discarded part was settling, As a result the other standby (that wasn't promoted) will infinitely try making attempts to decode WAL record span on multiple pages by reading the next page, which is filled with zero bytes. And, this next page will never be written, because the new primary will be writing to the new WAL file after pg_switch_wal().
Restart of the stuck standby fixes the problem, because it will be first reading the history file and therefore will never read the incomplete WAL file from the old timeline. That is, all major versions starting from v13 are impacted (including the master branch), because they allow changing of primary_conninfo GUC with reload.
Please find attached the TAP test that reproduces the problem.
To be honest, I don't know yet how to fix it nicely. I am thinking about returning XLREAD_FAIL from XLogPageRead() if it suddenly switched to a new timeline while trying to read a page and if this page is invalid.
--
Regards,
--
Alexander Kukushkin
Attachment
On Wed, Feb 28, 2024 at 11:19:41AM +0100, Alexander Kukushkin wrote: > I spent some time debugging an issue with standby not being able to > continue streaming after failover. > > The problem happens when standbys received only the first part of the WAL > record that spans multiple pages. > In this case the promoted standby discards the first part of the WAL record > and writes END_OF_RECOVERY instead. If in addition to that someone will > call pg_switch_wal(), then there are chances that SWITCH record will also > fit to the page where the discarded part was settling, As a result the > other standby (that wasn't promoted) will infinitely try making attempts to > decode WAL record span on multiple pages by reading the next page, which is > filled with zero bytes. And, this next page will never be written, because > the new primary will be writing to the new WAL file after pg_switch_wal(). Wow. Have you seen that in an actual production environment? I was just trying your TAP test to see it looping on a single record as you mentioned: 2024-02-29 12:57:44.884 JST [2555] LOG: invalid magic number 0000 in WAL segment 000000020000000000000003, LSN 0/301A000, offset 106496 2024-02-29 12:57:44.884 JST [2555] LOG: invalid magic number 0000 in WAL segment 000000020000000000000003, LSN 0/301A000, offset 106496 > Restart of the stuck standby fixes the problem, because it will be first > reading the history file and therefore will never read the incomplete WAL > file from the old timeline. That is, all major versions starting from v13 > are impacted (including the master branch), because they allow changing of > primary_conninfo GUC with reload. Still that's not nice at a large scale, because you would not know about the problem until your monitoring tools raise alarms because some nodes in your cluster setup decide to lag behind. > Please find attached the TAP test that reproduces the problem. my $start_page = start_of_page($end_lsn); my $wal_file = write_wal($primary, $TLI, $start_page, "\x00" x $WAL_BLOCK_SIZE); # copy the file we just "hacked" to the archive copy($wal_file, $primary->archive_dir); So you are emulating a failure by filling with zeros the second page where the last emit_message() generated a record, and the page before that includes the continuation record. Then abuse of WAL archiving to force the replay of the last record. That's kind of cool. > To be honest, I don't know yet how to fix it nicely. I am thinking about > returning XLREAD_FAIL from XLogPageRead() if it suddenly switched to a new > timeline while trying to read a page and if this page is invalid. Hmm. I suspect that you may be right on a TLI change when reading a page. There are a bunch of side cases with continuation records and header validation around XLogReaderValidatePageHeader(). Perhaps you have an idea of patch to show your point? Nit. In your test, it seems to me that you should not call directly set_standby_mode and enable_restoring, just rely on has_restoring with the standby option included. -- Michael
Attachment
At Thu, 29 Feb 2024 14:05:15 +0900, Michael Paquier <michael@paquier.xyz> wrote in > On Wed, Feb 28, 2024 at 11:19:41AM +0100, Alexander Kukushkin wrote: > > I spent some time debugging an issue with standby not being able to > > continue streaming after failover. > > > > The problem happens when standbys received only the first part of the WAL > > record that spans multiple pages. > > In this case the promoted standby discards the first part of the WAL record > > and writes END_OF_RECOVERY instead. If in addition to that someone will > > call pg_switch_wal(), then there are chances that SWITCH record will also > > fit to the page where the discarded part was settling, As a result the > > other standby (that wasn't promoted) will infinitely try making attempts to > > decode WAL record span on multiple pages by reading the next page, which is > > filled with zero bytes. And, this next page will never be written, because > > the new primary will be writing to the new WAL file after pg_switch_wal(). In the first place, it's important to note that we do not guarantee that an async standby can always switch its replication connection to the old primary or another sibling standby. This is due to the variations in replication lag among standbys. pg_rewind is required to adjust such discrepancies. I might be overlooking something, but I don't understand how this occurs without purposefully tweaking WAL files. The repro script pushes an incomplete WAL file to the archive as a non-partial segment. This shouldn't happen in the real world. In the repro script, the replication connection of the second standby is switched from the old primary to the first standby after its promotion. After the switching, replication is expected to continue from the beginning of the last replayed segment. But with the script, the second standby copies the intentionally broken file, which differs from the data that should be received via streaming. A similar problem to the issue here was seen at segment boundaries, before we introduced the XLP_FIRST_IS_OVERWRITE_CONTRECORD flag, which prevents overwriting a WAL file that is already archived. However, in this case, the second standby won't see the broken record because it cannot be in a non-partial segment in the archive, and the new primary streams END_OF_RECOVERY instead of the broken record. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Hi Michael,
On Thu, 29 Feb 2024 at 06:05, Michael Paquier <michael@paquier.xyz> wrote:
Wow. Have you seen that in an actual production environment?
Yes, we see it regularly, and it is reproducible in test environments as well.
my $start_page = start_of_page($end_lsn);
my $wal_file = write_wal($primary, $TLI, $start_page,
"\x00" x $WAL_BLOCK_SIZE);
# copy the file we just "hacked" to the archive
copy($wal_file, $primary->archive_dir);
So you are emulating a failure by filling with zeros the second page
where the last emit_message() generated a record, and the page before
that includes the continuation record. Then abuse of WAL archiving to
force the replay of the last record. That's kind of cool.
Right, at this point it is easier than to cause an artificial crash on the primary after it finished writing just one page.
> To be honest, I don't know yet how to fix it nicely. I am thinking about
> returning XLREAD_FAIL from XLogPageRead() if it suddenly switched to a new
> timeline while trying to read a page and if this page is invalid.
Hmm. I suspect that you may be right on a TLI change when reading a
page. There are a bunch of side cases with continuation records and
header validation around XLogReaderValidatePageHeader(). Perhaps you
have an idea of patch to show your point?
Not yet, but hopefully I will get something done next week.
Nit. In your test, it seems to me that you should not call directly
set_standby_mode and enable_restoring, just rely on has_restoring with
the standby option included.
Thanks, I'll look into it.
Regards,
--
Alexander Kukushkin
Hi Kyotaro,
On Thu, 29 Feb 2024 at 08:18, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
In the first place, it's important to note that we do not guarantee
that an async standby can always switch its replication connection to
the old primary or another sibling standby. This is due to the
variations in replication lag among standbys. pg_rewind is required to
adjust such discrepancies.
Sure, I know. But in this case the async standby received and flushed absolutely the same amount of WAL as the promoted one.
I might be overlooking something, but I don't understand how this
occurs without purposefully tweaking WAL files. The repro script
pushes an incomplete WAL file to the archive as a non-partial
segment. This shouldn't happen in the real world.
It easily happens if the primary crashed and standbys didn't receive another page with continuation record.
In the repro script, the replication connection of the second standby
is switched from the old primary to the first standby after its
promotion. After the switching, replication is expected to continue
from the beginning of the last replayed segment.
Well, maybe, but apparently the standby is busy trying to decode a record that spans multiple pages, and it is just infinitely waiting for the next page to arrive. Also, the restart "fixes" the problem, because indeed it is reading the file from the beginning.
But with the script,
the second standby copies the intentionally broken file, which differs
from the data that should be received via streaming.
As I already said, this is a simple way to emulate the primary crash while standbys receiving WAL.
It could easily happen that the record spans on multiple pages is not fully received and flushed.
--
Regards,
--
Alexander Kukushkin
On Thu, Feb 29, 2024 at 05:44:25PM +0100, Alexander Kukushkin wrote: > On Thu, 29 Feb 2024 at 08:18, Kyotaro Horiguchi <horikyota.ntt@gmail.com> > wrote: >> In the first place, it's important to note that we do not guarantee >> that an async standby can always switch its replication connection to >> the old primary or another sibling standby. This is due to the >> variations in replication lag among standbys. pg_rewind is required to >> adjust such discrepancies. > > Sure, I know. But in this case the async standby received and flushed > absolutely the same amount of WAL as the promoted one. Ugh. If it can happen transparently to the user without the user knowing directly about it, that does not sound good to me. I did not look very closely at monitoring tools available out there, but if both standbys flushed the same WAL locations a rewind should not be required. It is not something that monitoring tools would be able to detect because they just look at LSNs. >> In the repro script, the replication connection of the second standby >> is switched from the old primary to the first standby after its >> promotion. After the switching, replication is expected to continue >> from the beginning of the last replayed segment. > > Well, maybe, but apparently the standby is busy trying to decode a record > that spans multiple pages, and it is just infinitely waiting for the next > page to arrive. Also, the restart "fixes" the problem, because indeed it is > reading the file from the beginning. What happens if the continuation record spawns across multiple segment files boundaries in this case? We would go back to the beginning of the segment where the record spawning across multiple segments began, right? (I may recall this part of contrecords incorrectly, feel free to correct me if necessary.) >> But with the script, >> the second standby copies the intentionally broken file, which differs >> from the data that should be received via streaming. > > As I already said, this is a simple way to emulate the primary crash while > standbys receiving WAL. > It could easily happen that the record spans on multiple pages is not fully > received and flushed. I think that's OK to do so at test level to force a test in the backend, FWIW, because that's cheaper, and 039_end_of_wal.pl has proved that this can be designed to be cheap and stable across the buildfarm fleet. For anything like that, the result had better have solid test coverage, where perhaps we'd better refactor some of the routines of 039_end_of_wal.pl into a module to use them here, rather than duplicate the code. The other test has a few assumptions with the calculation of page boundaries, and I'd rather not duplicate that across the tree. Seeing that Alexander is a maintainer of Patroni, which is very probably used by his employer across a large set of PostgreSQL instances, if he says that he's seen that in the field, that's good enough for me to respond to the problem, especially if reconnecting a standby to a promoted node where both flushed the same LSN. Now the level of response also depends on the invasiness of the change, and we need a very careful evaluation here. -- Michael
Attachment
At Fri, 1 Mar 2024 08:17:04 +0900, Michael Paquier <michael@paquier.xyz> wrote in > On Thu, Feb 29, 2024 at 05:44:25PM +0100, Alexander Kukushkin wrote: > > On Thu, 29 Feb 2024 at 08:18, Kyotaro Horiguchi <horikyota.ntt@gmail.com> > > wrote: > >> In the first place, it's important to note that we do not guarantee > >> that an async standby can always switch its replication connection to > >> the old primary or another sibling standby. This is due to the > >> variations in replication lag among standbys. pg_rewind is required to > >> adjust such discrepancies. > > > > Sure, I know. But in this case the async standby received and flushed > > absolutely the same amount of WAL as the promoted one. > > Ugh. If it can happen transparently to the user without the user > knowing directly about it, that does not sound good to me. I did not > look very closely at monitoring tools available out there, but if both > standbys flushed the same WAL locations a rewind should not be > required. It is not something that monitoring tools would be able to > detect because they just look at LSNs. > > >> In the repro script, the replication connection of the second standby > >> is switched from the old primary to the first standby after its > >> promotion. After the switching, replication is expected to continue > >> from the beginning of the last replayed segment. > > > > Well, maybe, but apparently the standby is busy trying to decode a record > > that spans multiple pages, and it is just infinitely waiting for the next > > page to arrive. Also, the restart "fixes" the problem, because indeed it is > > reading the file from the beginning. > > What happens if the continuation record spawns across multiple segment > files boundaries in this case? We would go back to the beginning of > the segment where the record spawning across multiple segments began, > right? (I may recall this part of contrecords incorrectly, feel free > to correct me if necessary.) After reading this, I came up with a possibility that walreceiver recovers more quickly than the calling interval to WaitForWALtoBecomeAvailable(). If walreceiver disconnects after a call to the function WaitForWAL...(), and then somehow recovers the connection before the next call, the function doesn't notice the disconnection and returns XLREAD_SUCCESS wrongly. If this assumption is correct, the correct fix might be for us to return XLREAD_FAIL when reconnection happens after the last call to the WaitForWAL...() function. > >> But with the script, > >> the second standby copies the intentionally broken file, which differs > >> from the data that should be received via streaming. > > > > As I already said, this is a simple way to emulate the primary crash while > > standbys receiving WAL. > > It could easily happen that the record spans on multiple pages is not fully > > received and flushed. > > I think that's OK to do so at test level to force a test in the > backend, FWIW, because that's cheaper, and 039_end_of_wal.pl has > proved that this can be designed to be cheap and stable across the > buildfarm fleet. Yeah, I agree that it clearly illustrates the final state after the issue happened, but if my assumption above is correct, the test doesn't manifest the real issue. > For anything like that, the result had better have solid test > coverage, where perhaps we'd better refactor some of the routines of > 039_end_of_wal.pl into a module to use them here, rather than > duplicate the code. The other test has a few assumptions with the > calculation of page boundaries, and I'd rather not duplicate that > across the tree. I agree to the point. > Seeing that Alexander is a maintainer of Patroni, which is very > probably used by his employer across a large set of PostgreSQL > instances, if he says that he's seen that in the field, that's good > enough for me to respond to the problem, especially if reconnecting a > standby to a promoted node where both flushed the same LSN. Now the > level of response also depends on the invasiness of the change, and we > need a very careful evaluation here. I don't mean to say that we should respond with DNF to this "issue" at all. I simply wanted to make the real issue clear. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
At Fri, 01 Mar 2024 10:29:12 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > After reading this, I came up with a possibility that walreceiver > recovers more quickly than the calling interval to > WaitForWALtoBecomeAvailable(). If walreceiver disconnects after a call > to the function WaitForWAL...(), and then somehow recovers the > connection before the next call, the function doesn't notice the > disconnection and returns XLREAD_SUCCESS wrongly. If this assumption > is correct, the correct fix might be for us to return XLREAD_FAIL when > reconnection happens after the last call to the WaitForWAL...() > function. That's my stupid. The function performs reconnection by itself. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
At Fri, 01 Mar 2024 12:04:31 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > At Fri, 01 Mar 2024 10:29:12 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > > After reading this, I came up with a possibility that walreceiver > > recovers more quickly than the calling interval to > > WaitForWALtoBecomeAvailable(). If walreceiver disconnects after a call > > to the function WaitForWAL...(), and then somehow recovers the > > connection before the next call, the function doesn't notice the > > disconnection and returns XLREAD_SUCCESS wrongly. If this assumption > > is correct, the correct fix might be for us to return XLREAD_FAIL when > > reconnection happens after the last call to the WaitForWAL...() > > function. > > That's my stupid. The function performs reconnection by itself. Anyway, our current policy here is to avoid record-rereads beyond source switches. However, fixing this seems to require that source switches cause record rereads unless some additional information is available to know if replay LSN needs to back up. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
At Fri, 01 Mar 2024 12:37:55 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > Anyway, our current policy here is to avoid record-rereads beyond > source switches. However, fixing this seems to require that source > switches cause record rereads unless some additional information is > available to know if replay LSN needs to back up. It seems to me that the error messages are related to commit 0668719801. XLogPageRead: > * Check the page header immediately, so that we can retry immediately if > * it's not valid. This may seem unnecessary, because ReadPageInternal() > * validates the page header anyway, and would propagate the failure up to > * ReadRecord(), which would retry. However, there's a corner case with > * continuation records, if a record is split across two pages such that > * we would need to read the two pages from different sources. For ... > if (StandbyMode && > !XLogReaderValidatePageHeader(xlogreader, targetPagePtr, readBuf)) > { > /* > * Emit this error right now then retry this page immediately. Use > * errmsg_internal() because the message was already translated. > */ > if (xlogreader->errormsg_buf[0]) > ereport(emode_for_corrupt_record(emode, xlogreader->EndRecPtr), > (errmsg_internal("%s", xlogreader->errormsg_buf))); This code intends to prevent a page header error from causing a record reread, when a record is required to be read from multiple sources. We could restrict this to only fire at segment boundaries. At segment boundaries, we won't let LSN back up by using XLP_FIRST_IS_CONTRECORD. Having thought up to this point, I now believe that we should completely prevent LSN from going back in any case. One drawback is that the fix cannot be back-patched. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Hello Michael, Kyotaro,
Please find attached the patch fixing the problem and the updated TAP test that addresses Nit.
--
Regards,
--
Alexander Kukushkin
Attachment
At Tue, 5 Mar 2024 09:36:44 +0100, Alexander Kukushkin <cyberdemn@gmail.com> wrote in > Please find attached the patch fixing the problem and the updated TAP test > that addresses Nit. Record-level retries happen when the upper layer detects errors. In my previous mail, I cited code that is intended to prevent this at segment boundaries. However, the resulting code applies to all page boundaries, since we judged that the difference doen't significanty affects the outcome. > * Check the page header immediately, so that we can retry immediately if > * it's not valid. This may seem unnecessary, because ReadPageInternal() > * validates the page header anyway, and would propagate the failure up to So, the following (tentative) change should also work. xlogrecovery.c: @@ -3460,8 +3490,10 @@ retry: * responsible for the validation. */ if (StandbyMode && + targetPagePtr % 0x100000 == 0 && !XLogReaderValidatePageHeader(xlogreader, targetPagePtr, readBuf)) { Thus, I managed to reproduce precisely the same situation as you described utilizing your script with modifications and some core tweaks, and with the change above, I saw that the behavior was fixed. However, for reasons unclear to me, it shows another issue, and I am running out of time and need more caffeine. I'll continue investigating this tomorrow. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Hi Kyotaro,
Oh, now I understand what you mean. Is the retry supposed to happen only when we are reading the very first page from the WAL file?
On Wed, 6 Mar 2024 at 09:57, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
xlogrecovery.c:
@@ -3460,8 +3490,10 @@ retry:
* responsible for the validation.
*/
if (StandbyMode &&
+ targetPagePtr % 0x100000 == 0 &&
!XLogReaderValidatePageHeader(xlogreader, targetPagePtr, readBuf))
{
Hmm, I think you meant to use wal_segment_size, because 0x100000 is just 1MB. As a result, currently it works for you by accident.
Thus, I managed to reproduce precisely the same situation as you
described utilizing your script with modifications and some core
tweaks, and with the change above, I saw that the behavior was
fixed. However, for reasons unclear to me, it shows another issue, and
I am running out of time and need more caffeine. I'll continue
investigating this tomorrow.
Thank you for spending your time on it!
--
Regards,
--
Alexander Kukushkin
At Wed, 6 Mar 2024 11:34:29 +0100, Alexander Kukushkin <cyberdemn@gmail.com> wrote in > Hmm, I think you meant to use wal_segment_size, because 0x100000 is just > 1MB. As a result, currently it works for you by accident. Oh, I once saw the fix work, but seems not to be working after some point. The new issue was a corruption of received WAL records on the first standby, and it may be related to the setting. > > Thus, I managed to reproduce precisely the same situation as you > > described utilizing your script with modifications and some core > > tweaks, and with the change above, I saw that the behavior was > > fixed. However, for reasons unclear to me, it shows another issue, and > > I am running out of time and need more caffeine. I'll continue > > investigating this tomorrow. > > > > Thank you for spending your time on it! You're welcome, but I aplogize for the delay in the work.. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Mon, Mar 11, 2024 at 04:43:32PM +0900, Kyotaro Horiguchi wrote: > At Wed, 6 Mar 2024 11:34:29 +0100, Alexander Kukushkin <cyberdemn@gmail.com> wrote in >> Thank you for spending your time on it! > > You're welcome, but I aplogize for the delay in the work.. Thanks for spending time on this. Everybody is busy with the last commit fest, and the next minor release set is planned for May so there should still be time even after the feature freeze: https://www.postgresql.org/developer/roadmap/ I should be able to come back to this thread around the beginning of April. If you are able to do some progress in-between, that would be surely helpful. Thanks, -- Michael
Attachment
At Mon, 11 Mar 2024 16:43:32 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > Oh, I once saw the fix work, but seems not to be working after some > point. The new issue was a corruption of received WAL records on the > first standby, and it may be related to the setting. I identified the cause of the second issue. When I tried to replay the issue, the second standby accidentally received the old timeline's last page-spanning record till the end while the first standby was promoting (but it had not been read by recovery). In addition to that, on the second standby, there's a time window where the timeline increased but the first segment of the new timeline is not available yet. In this case, the second standby successfully reads the page-spanning record in the old timeline even after the second standby noticed that the timeline ID has been increased, thanks to the robustness of XLogFileReadAnyTLI(). I think the primary change to XLogPageRead that I suggested is correct (assuming the use of wal_segment_size instead of the constant). However, still XLogFileReadAnyTLI() has a chance to read the segment from the old timeline after the second standby notices a timeline switch, leading to the second issue. The second issue was fixed by preventing XLogFileReadAnyTLI from reading segments from older timelines than those suggested by the latest timeline history. (In other words, disabling the "AnyTLI" part). I recall that there was a discussion for commit 4bd0ad9e44, about the objective of allowing reading segments from older timelines than the timeline history suggests. In my faint memory, we concluded to postpone making the decision to remove the feature due to uncertainity about the objective. If there's no clear reason to continue using XLogFileReadAnyTLI(), I suggest we stop its use and instead adopt XLogFileReadOnTLHistory(), which reads segments that align precisely with the timeline history. Of course, regardless of the changes above, if recovery on the second standby had reached the end of the page-spanning record before redirection to the first standby, it would need pg_rewind to connect to the first standby. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Hi Kyotaro,
On Wed, 13 Mar 2024 at 03:56, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
I identified the cause of the second issue. When I tried to replay the
issue, the second standby accidentally received the old timeline's
last page-spanning record till the end while the first standby was
promoting (but it had not been read by recovery). In addition to that,
on the second standby, there's a time window where the timeline
increased but the first segment of the new timeline is not available
yet. In this case, the second standby successfully reads the
page-spanning record in the old timeline even after the second standby
noticed that the timeline ID has been increased, thanks to the
robustness of XLogFileReadAnyTLI().
Hmm, I don't think it could really be prevented.
There are always chances that the standby that is not ahead of other standbys could be promoted due to reasons like:
1. HA configuration doesn't let certain nodes to be promoted.
2. This is an async standby (name isn't listed in synchronous_standby_names) and it was ahead of promoted sync standby. No data loss from the client point of view.
Of course, regardless of the changes above, if recovery on the second
standby had reached the end of the page-spanning record before
redirection to the first standby, it would need pg_rewind to connect
to the first standby.
Correct, IMO pg_rewind is a right way of solving it.
Regards,
--
Alexander Kukushkin
On Wed, 13 Mar 2024 at 04:56, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > At Mon, 11 Mar 2024 16:43:32 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > > Oh, I once saw the fix work, but seems not to be working after some > > point. The new issue was a corruption of received WAL records on the > > first standby, and it may be related to the setting. > > I identified the cause of the second issue. When I tried to replay the > issue, the second standby accidentally received the old timeline's > last page-spanning record till the end while the first standby was > promoting (but it had not been read by recovery). In addition to that, > on the second standby, there's a time window where the timeline > increased but the first segment of the new timeline is not available > yet. In this case, the second standby successfully reads the > page-spanning record in the old timeline even after the second standby > noticed that the timeline ID has been increased, thanks to the > robustness of XLogFileReadAnyTLI(). > > I think the primary change to XLogPageRead that I suggested is correct > (assuming the use of wal_segment_size instead of the > constant). However, still XLogFileReadAnyTLI() has a chance to read > the segment from the old timeline after the second standby notices a > timeline switch, leading to the second issue. The second issue was > fixed by preventing XLogFileReadAnyTLI from reading segments from > older timelines than those suggested by the latest timeline > history. (In other words, disabling the "AnyTLI" part). > > I recall that there was a discussion for commit 4bd0ad9e44, about the > objective of allowing reading segments from older timelines than the > timeline history suggests. In my faint memory, we concluded to > postpone making the decision to remove the feature due to uncertainity > about the objective. If there's no clear reason to continue using > XLogFileReadAnyTLI(), I suggest we stop its use and instead adopt > XLogFileReadOnTLHistory(), which reads segments that align precisely > with the timeline history. This sounds very similar to the problem described in [1]. And I think both will be resolved by that change. [1] https://postgr.es/m/CANwKhkMN3QwAcvuDZHb6wsvLRtkweBiYso-KLFykkQVWuQLcOw%40mail.gmail.com
Hi Michael and Kyotaro,
Now that beta1 was released I hope you are not so busy and hence would like to follow up on this problem.
Regards,
--
Alexander Kukushkin
On Tue, Jun 04, 2024 at 04:16:43PM +0200, Alexander Kukushkin wrote: > Now that beta1 was released I hope you are not so busy and hence would like > to follow up on this problem. I am still working on something for the v18 cycle that I'd like to present before the beginning of the next commit fest, so I am a bit busy to get that out first. Fingers crossed to not have open items to look at.. This thread is one of the things I have marked as an item to look at, yes. -- Michael