Thread: pg_rewind: warn when checkpoint hasn't happened after promotion
A few weeks back I sent a bug report [1] directly to the -bugs mailing list, and I haven't seen any activity on it (maybe this is because I emailed directly instead of using the form?), but I got some time to take a look and concluded that a first-level fix is pretty simple. A quick background refresher: after promoting a standby rewinding the former primary requires that a checkpoint have been completed on the new primary after promotion. This is correctly documented. However pg_rewind incorrectly reports to the user that a rewind isn't necessary because the source and target are on the same timeline. Specifically, this happens when the control file on the newly promoted server looks like: Latest checkpoint's TimeLineID: 4 Latest checkpoint's PrevTimeLineID: 4 ... Min recovery ending loc's timeline: 5 Attached is a patch that detects this condition and reports it as an error to the user. In the spirit of the new-ish "ensure shutdown" functionality I could imagine extending this to automatically issue a checkpoint when this situation is detected. I haven't started to code that up, however, wanting to first get buy-in on that. Thanks, James Coleman 1: https://www.postgresql.org/message-id/CAAaqYe8b2DBbooTprY4v=BiZEd9qBqVLq+FD9j617eQFjk1KvQ@mail.gmail.com
Attachment
On Sat, Jun 4, 2022 at 6:29 PM James Coleman <jtc331@gmail.com> wrote: > > A few weeks back I sent a bug report [1] directly to the -bugs mailing > list, and I haven't seen any activity on it (maybe this is because I > emailed directly instead of using the form?), but I got some time to > take a look and concluded that a first-level fix is pretty simple. > > A quick background refresher: after promoting a standby rewinding the > former primary requires that a checkpoint have been completed on the > new primary after promotion. This is correctly documented. However > pg_rewind incorrectly reports to the user that a rewind isn't > necessary because the source and target are on the same timeline. > > Specifically, this happens when the control file on the newly promoted > server looks like: > > Latest checkpoint's TimeLineID: 4 > Latest checkpoint's PrevTimeLineID: 4 > ... > Min recovery ending loc's timeline: 5 > > Attached is a patch that detects this condition and reports it as an > error to the user. > > In the spirit of the new-ish "ensure shutdown" functionality I could > imagine extending this to automatically issue a checkpoint when this > situation is detected. I haven't started to code that up, however, > wanting to first get buy-in on that. > > 1: https://www.postgresql.org/message-id/CAAaqYe8b2DBbooTprY4v=BiZEd9qBqVLq+FD9j617eQFjk1KvQ@mail.gmail.com Thanks. I had a quick look over the issue and patch - just a thought - can't we let pg_rewind issue a checkpoint on the new primary instead of erroring out, maybe optionally? It might sound too much, but helps pg_rewind to be self-reliant i.e. avoiding external actor to detect the error and issue checkpoint the new primary to be able to successfully run pg_rewind on the pld primary and repair it to use it as a new standby. Regards, Bharath Rupireddy.
At Sat, 4 Jun 2022 19:09:41 +0530, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote in > On Sat, Jun 4, 2022 at 6:29 PM James Coleman <jtc331@gmail.com> wrote: > > > > A few weeks back I sent a bug report [1] directly to the -bugs mailing > > list, and I haven't seen any activity on it (maybe this is because I > > emailed directly instead of using the form?), but I got some time to > > take a look and concluded that a first-level fix is pretty simple. > > > > A quick background refresher: after promoting a standby rewinding the > > former primary requires that a checkpoint have been completed on the > > new primary after promotion. This is correctly documented. However > > pg_rewind incorrectly reports to the user that a rewind isn't > > necessary because the source and target are on the same timeline. ... > > Attached is a patch that detects this condition and reports it as an > > error to the user. I have some random thoughts on this. There could be a problem in the case of gracefully shutdowned old-primary, so I think it is worth doing something if it can be in a simple way. However, I don't think we can simply rely on minRecoveryPoint to detect that situation, since it won't be reset on a standby. A standby also still can be the upstream of a cascading standby. So, as discussed in the thread for the comment [2], what we can do here would be simply waiting for the timelineID to advance, maybe having a timeout. In a case of single-step replication set, a checkpoint request to the primary makes the end-of-recovery checkpoint fast. It won't work as expected in cascading replicas, but it might be acceptable. > > In the spirit of the new-ish "ensure shutdown" functionality I could > > imagine extending this to automatically issue a checkpoint when this > > situation is detected. I haven't started to code that up, however, > > wanting to first get buy-in on that. > > > > 1: https://www.postgresql.org/message-id/CAAaqYe8b2DBbooTprY4v=BiZEd9qBqVLq+FD9j617eQFjk1KvQ@mail.gmail.com > > Thanks. I had a quick look over the issue and patch - just a thought - > can't we let pg_rewind issue a checkpoint on the new primary instead > of erroring out, maybe optionally? It might sound too much, but helps > pg_rewind to be self-reliant i.e. avoiding external actor to detect > the error and issue checkpoint the new primary to be able to > successfully run pg_rewind on the pld primary and repair it to use it > as a new standby. At the time of the discussion [2] for the it was the hinderance that that requires superuser privileges. Now that has been narrowed down to the pg_checkpointer privileges. If we know that the timeline IDs are different, we don't need to wait for a checkpoint. It seems to me that the exit status is significant. pg_rewind exits with 1 when an invalid option is given. I don't think it is great if we report this state by the same code. I don't think we always want to request a non-spreading checkpoint. [2] https://www.postgresql.org/message-id/flat/CABUevEz5bpvbwVsYCaSMV80CBZ5-82nkMzbb%2BBu%3Dh1m%3DrLdn%3Dg%40mail.gmail.com regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Sat, Jun 4, 2022 at 9:39 AM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > On Sat, Jun 4, 2022 at 6:29 PM James Coleman <jtc331@gmail.com> wrote: > > > > A few weeks back I sent a bug report [1] directly to the -bugs mailing > > list, and I haven't seen any activity on it (maybe this is because I > > emailed directly instead of using the form?), but I got some time to > > take a look and concluded that a first-level fix is pretty simple. > > > > A quick background refresher: after promoting a standby rewinding the > > former primary requires that a checkpoint have been completed on the > > new primary after promotion. This is correctly documented. However > > pg_rewind incorrectly reports to the user that a rewind isn't > > necessary because the source and target are on the same timeline. > > > > Specifically, this happens when the control file on the newly promoted > > server looks like: > > > > Latest checkpoint's TimeLineID: 4 > > Latest checkpoint's PrevTimeLineID: 4 > > ... > > Min recovery ending loc's timeline: 5 > > > > Attached is a patch that detects this condition and reports it as an > > error to the user. > > > > In the spirit of the new-ish "ensure shutdown" functionality I could > > imagine extending this to automatically issue a checkpoint when this > > situation is detected. I haven't started to code that up, however, > > wanting to first get buy-in on that. > > > > 1: https://www.postgresql.org/message-id/CAAaqYe8b2DBbooTprY4v=BiZEd9qBqVLq+FD9j617eQFjk1KvQ@mail.gmail.com > > Thanks. I had a quick look over the issue and patch - just a thought - > can't we let pg_rewind issue a checkpoint on the new primary instead > of erroring out, maybe optionally? It might sound too much, but helps > pg_rewind to be self-reliant i.e. avoiding external actor to detect > the error and issue checkpoint the new primary to be able to > successfully run pg_rewind on the pld primary and repair it to use it > as a new standby. That's what I had suggested as a "further improvement" option in the last paragraph :) But I think agreement on this more basic solution would still be good (even if I add the automatic checkpointing in this thread); given we currently explicitly mis-inform the user of pg_rewind, I think this is a bug that should be considered for backpatching, and the simpler "fail if detected" patch is probably the only thing we could backpatch. Thanks for taking a look, James Coleman
On Mon, Jun 6, 2022 at 1:26 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > At Sat, 4 Jun 2022 19:09:41 +0530, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote in > > On Sat, Jun 4, 2022 at 6:29 PM James Coleman <jtc331@gmail.com> wrote: > > > > > > A few weeks back I sent a bug report [1] directly to the -bugs mailing > > > list, and I haven't seen any activity on it (maybe this is because I > > > emailed directly instead of using the form?), but I got some time to > > > take a look and concluded that a first-level fix is pretty simple. > > > > > > A quick background refresher: after promoting a standby rewinding the > > > former primary requires that a checkpoint have been completed on the > > > new primary after promotion. This is correctly documented. However > > > pg_rewind incorrectly reports to the user that a rewind isn't > > > necessary because the source and target are on the same timeline. > ... > > > Attached is a patch that detects this condition and reports it as an > > > error to the user. > > I have some random thoughts on this. > > There could be a problem in the case of gracefully shutdowned > old-primary, so I think it is worth doing something if it can be in a > simple way. > > However, I don't think we can simply rely on minRecoveryPoint to > detect that situation, since it won't be reset on a standby. A standby > also still can be the upstream of a cascading standby. So, as > discussed in the thread for the comment [2], what we can do here would be > simply waiting for the timelineID to advance, maybe having a timeout. To confirm I'm following you correctly, you're envisioning a situation like: - Primary A - Replica B replicating from primary - Replica C replicating from replica B then on failover from A to B you end up with: - Primary B - Replica C replication from primary - [needs rewind] A and you try to rewind A from C as the source? > In a case of single-step replication set, a checkpoint request to the > primary makes the end-of-recovery checkpoint fast. It won't work as > expected in cascading replicas, but it might be acceptable. "Won't work as expected" because there's no way to guarantee replication is caught up or even advancing? > > > In the spirit of the new-ish "ensure shutdown" functionality I could > > > imagine extending this to automatically issue a checkpoint when this > > > situation is detected. I haven't started to code that up, however, > > > wanting to first get buy-in on that. > > > > > > 1: https://www.postgresql.org/message-id/CAAaqYe8b2DBbooTprY4v=BiZEd9qBqVLq+FD9j617eQFjk1KvQ@mail.gmail.com > > > > Thanks. I had a quick look over the issue and patch - just a thought - > > can't we let pg_rewind issue a checkpoint on the new primary instead > > of erroring out, maybe optionally? It might sound too much, but helps > > pg_rewind to be self-reliant i.e. avoiding external actor to detect > > the error and issue checkpoint the new primary to be able to > > successfully run pg_rewind on the pld primary and repair it to use it > > as a new standby. > > At the time of the discussion [2] for the it was the hinderance that > that requires superuser privileges. Now that has been narrowed down > to the pg_checkpointer privileges. > > If we know that the timeline IDs are different, we don't need to wait > for a checkpoint. Correct. > It seems to me that the exit status is significant. pg_rewind exits > with 1 when an invalid option is given. I don't think it is great if > we report this state by the same code. I'm happy to change that; I only chose "1" as a placeholder for "non-zero exit status". > I don't think we always want to request a non-spreading checkpoint. I'm not familiar with the terminology "non-spreading checkpoint". > [2] https://www.postgresql.org/message-id/flat/CABUevEz5bpvbwVsYCaSMV80CBZ5-82nkMzbb%2BBu%3Dh1m%3DrLdn%3Dg%40mail.gmail.com I read through that thread, and one interesting idea stuck out to me: making "tiimeline IDs are the same" an error exit status. On the one hand that makes a certain amount of sense because it's unexpected. But on the other hand there are entirely legitimate situations where upon failover the timeline IDs happen to match (e.g., for use it happens some percentage of the time naturally as we are using sync replication and failovers often involve STONITHing the original primary, so it's entirely possible that the promoted replica begins with exactly the same WAL ending LSN from the primary before it stopped). Thanks, James Coleman
At Mon, 6 Jun 2022 08:32:01 -0400, James Coleman <jtc331@gmail.com> wrote in > On Mon, Jun 6, 2022 at 1:26 AM Kyotaro Horiguchi > <horikyota.ntt@gmail.com> wrote: > > > > At Sat, 4 Jun 2022 19:09:41 +0530, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote in > > > On Sat, Jun 4, 2022 at 6:29 PM James Coleman <jtc331@gmail.com> wrote: > > > > > > > > A few weeks back I sent a bug report [1] directly to the -bugs mailing > > > > list, and I haven't seen any activity on it (maybe this is because I > > > > emailed directly instead of using the form?), but I got some time to > > > > take a look and concluded that a first-level fix is pretty simple. > > > > > > > > A quick background refresher: after promoting a standby rewinding the > > > > former primary requires that a checkpoint have been completed on the > > > > new primary after promotion. This is correctly documented. However > > > > pg_rewind incorrectly reports to the user that a rewind isn't > > > > necessary because the source and target are on the same timeline. > > ... > > > > Attached is a patch that detects this condition and reports it as an > > > > error to the user. > > > > I have some random thoughts on this. > > > > There could be a problem in the case of gracefully shutdowned > > old-primary, so I think it is worth doing something if it can be in a > > simple way. > > > > However, I don't think we can simply rely on minRecoveryPoint to > > detect that situation, since it won't be reset on a standby. A standby > > also still can be the upstream of a cascading standby. So, as > > discussed in the thread for the comment [2], what we can do here would be > > simply waiting for the timelineID to advance, maybe having a timeout. > > To confirm I'm following you correctly, you're envisioning a situation like: > > - Primary A > - Replica B replicating from primary > - Replica C replicating from replica B > > then on failover from A to B you end up with: > > - Primary B > - Replica C replication from primary > - [needs rewind] A > > and you try to rewind A from C as the source? Yes. I think it is a legit use case. That being said, like other points, it might be acceptable. > > In a case of single-step replication set, a checkpoint request to the > > primary makes the end-of-recovery checkpoint fast. It won't work as > > expected in cascading replicas, but it might be acceptable. > > "Won't work as expected" because there's no way to guarantee > replication is caught up or even advancing? Maybe no. I meant that restartpoints don't run more frequently than the intervals of checkpoint_timeout even if checkpint records come more frequently. > > > > In the spirit of the new-ish "ensure shutdown" functionality I could > > > > imagine extending this to automatically issue a checkpoint when this > > > > situation is detected. I haven't started to code that up, however, > > > > wanting to first get buy-in on that. > > > > > > > > 1: https://www.postgresql.org/message-id/CAAaqYe8b2DBbooTprY4v=BiZEd9qBqVLq+FD9j617eQFjk1KvQ@mail.gmail.com > > > > > > Thanks. I had a quick look over the issue and patch - just a thought - > > > can't we let pg_rewind issue a checkpoint on the new primary instead > > > of erroring out, maybe optionally? It might sound too much, but helps > > > pg_rewind to be self-reliant i.e. avoiding external actor to detect > > > the error and issue checkpoint the new primary to be able to > > > successfully run pg_rewind on the pld primary and repair it to use it > > > as a new standby. > > > > At the time of the discussion [2] for the it was the hinderance that > > that requires superuser privileges. Now that has been narrowed down > > to the pg_checkpointer privileges. > > > > If we know that the timeline IDs are different, we don't need to wait > > for a checkpoint. > > Correct. > > > It seems to me that the exit status is significant. pg_rewind exits > > with 1 when an invalid option is given. I don't think it is great if > > we report this state by the same code. > > I'm happy to change that; I only chose "1" as a placeholder for > "non-zero exit status". > > > I don't think we always want to request a non-spreading checkpoint. > > I'm not familiar with the terminology "non-spreading checkpoint". Does "immediate checkpoint" works? That is, a checkpoint that runs at full-speed (i.e. with no delays between writes). > > [2] https://www.postgresql.org/message-id/flat/CABUevEz5bpvbwVsYCaSMV80CBZ5-82nkMzbb%2BBu%3Dh1m%3DrLdn%3Dg%40mail.gmail.com > > I read through that thread, and one interesting idea stuck out to me: > making "tiimeline IDs are the same" an error exit status. On the one > hand that makes a certain amount of sense because it's unexpected. But > on the other hand there are entirely legitimate situations where upon > failover the timeline IDs happen to match (e.g., for use it happens > some percentage of the time naturally as we are using sync replication > and failovers often involve STONITHing the original primary, so it's > entirely possible that the promoted replica begins with exactly the > same WAL ending LSN from the primary before it stopped). Yes that is true for most cases unless old primary written some records that had not sent to the standby before its death. So if we don't inspect WAL records (on the target cluster), we should always run rewinding even in the STONITH-killed (or immediate-shutdown) cases. One possible way to detect promotion reliably is to look into timeline history files. It is written immediately at promotion even on standbys. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
At Tue, 07 Jun 2022 12:39:38 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > One possible way to detect promotion reliably is to look into timeline > history files. It is written immediately at promotion even on > standbys. The attached seems to work. It uses timeline history files to identify the source timeline. With this change pg_waldump no longer need to wait for end-of-recovery to finish. (It lacks doc part and test.. But I'm not sure how we can test this behavior.) regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Attachment
On Tue, Jun 07, 2022 at 12:39:38PM +0900, Kyotaro Horiguchi wrote: > At Mon, 6 Jun 2022 08:32:01 -0400, James Coleman <jtc331@gmail.com> wrote in >> To confirm I'm following you correctly, you're envisioning a situation like: >> >> - Primary A >> - Replica B replicating from primary >> - Replica C replicating from replica B >> >> then on failover from A to B you end up with: >> >> - Primary B >> - Replica C replication from primary >> - [needs rewind] A >> >> and you try to rewind A from C as the source? > > Yes. I think it is a legit use case. That being said, like other > points, it might be acceptable. This configuration is a case supported by pg_rewind, meaning that your patch to check after minRecoveryPointTLI would be confusing when using a standby as a source because the checkpoint needs to apply on its primary to allow the TLI of the standby to go up. If you want to provide to the user more context, a more meaningful way may be to rely on an extra check for ControlFileData.state, I guess, as a promoted cluster is marked as DB_IN_PRODUCTION before recoveryMinPoint is cleared by the first post-promotion checkpoint, with DB_IN_ARCHIVE_RECOVERY for a cascading standby. -- Michael
Attachment
At Tue, 7 Jun 2022 16:16:09 +0900, Michael Paquier <michael@paquier.xyz> wrote in > On Tue, Jun 07, 2022 at 12:39:38PM +0900, Kyotaro Horiguchi wrote: > > At Mon, 6 Jun 2022 08:32:01 -0400, James Coleman <jtc331@gmail.com> wrote in > >> To confirm I'm following you correctly, you're envisioning a situation like: > >> > >> - Primary A > >> - Replica B replicating from primary > >> - Replica C replicating from replica B > >> > >> then on failover from A to B you end up with: > >> > >> - Primary B > >> - Replica C replication from primary > >> - [needs rewind] A > >> > >> and you try to rewind A from C as the source? > > > > Yes. I think it is a legit use case. That being said, like other > > points, it might be acceptable. > > This configuration is a case supported by pg_rewind, meaning that your > patch to check after minRecoveryPointTLI would be confusing when using > a standby as a source because the checkpoint needs to apply on its > primary to allow the TLI of the standby to go up. If you want to Yeah, that what I meant. > provide to the user more context, a more meaningful way may be to rely > on an extra check for ControlFileData.state, I guess, as a promoted > cluster is marked as DB_IN_PRODUCTION before recoveryMinPoint is > cleared by the first post-promotion checkpoint, with > DB_IN_ARCHIVE_RECOVERY for a cascading standby. Right. However, IIUC, checkpoint LSN/TLI is not updated at the time. The point of the minRecoveryPoint check is to confirm that we can read the timeline ID of the promoted source cluster from checkPointCopy.ThisTimeLineID. But we cannot do that yet at the time the cluster state moves to DB_IN_PRODUCTION. And a standby is in DB_IN_ARCHIVE_RECOVERY since before the upstream promotes. It also doesn't signal the reliability of checkPointCopy.ThisTimeLineID.. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
I think this is a good improvement and also like the option (on pg_rewind) to potentially send checkpoints to the source.
References https://github.com/sorintlab/stolon/issues/601
And the fix https://github.com/sorintlab/stolon/pull/644
---- On Sat, 04 Jun 2022 05:59:12 -0700 James Coleman <jtc331@gmail.com> wrote ----
A few weeks back I sent a bug report [1] directly to the -bugs mailing
list, and I haven't seen any activity on it (maybe this is because I
emailed directly instead of using the form?), but I got some time to
take a look and concluded that a first-level fix is pretty simple.
A quick background refresher: after promoting a standby rewinding the
former primary requires that a checkpoint have been completed on the
new primary after promotion. This is correctly documented. However
pg_rewind incorrectly reports to the user that a rewind isn't
necessary because the source and target are on the same timeline.
Specifically, this happens when the control file on the newly promoted
server looks like:
Latest checkpoint's TimeLineID: 4
Latest checkpoint's PrevTimeLineID: 4
...
Min recovery ending loc's timeline: 5
Attached is a patch that detects this condition and reports it as an
error to the user.
In the spirit of the new-ish "ensure shutdown" functionality I could
imagine extending this to automatically issue a checkpoint when this
situation is detected. I haven't started to code that up, however,
wanting to first get buy-in on that.
Thanks,
James Coleman
1: https://www.postgresql.org/message-id/CAAaqYe8b2DBbooTprY4v=BiZEd9qBqVLq+FD9j617eQFjk1KvQ@mail.gmail.com
At Tue, 07 Jun 2022 16:05:47 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > At Tue, 07 Jun 2022 12:39:38 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > > One possible way to detect promotion reliably is to look into timeline > > history files. It is written immediately at promotion even on > > standbys. > > The attached seems to work. It uses timeline history files to identify > the source timeline. With this change pg_waldump no longer need to > wait for end-of-recovery to finish. > > (It lacks doc part and test.. But I'm not sure how we can test this > behavior.) This is a revised version. Revised getTimelineHistory()'s logic (refactored, and changed so that it doesn't pick-up the wrong history files). perform_rewind always identify endtli based on source's timeline history. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Attachment
At Wed, 08 Jun 2022 18:15:09 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > At Tue, 07 Jun 2022 16:05:47 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > > At Tue, 07 Jun 2022 12:39:38 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > > > One possible way to detect promotion reliably is to look into timeline > > > history files. It is written immediately at promotion even on > > > standbys. > > > > The attached seems to work. It uses timeline history files to identify > > the source timeline. With this change pg_waldump no longer need to > > wait for end-of-recovery to finish. > > > > (It lacks doc part and test.. But I'm not sure how we can test this > > behavior.) > > This is a revised version. > > Revised getTimelineHistory()'s logic (refactored, and changed so that > it doesn't pick-up the wrong history files). > > perform_rewind always identify endtli based on source's timeline > history. No need to "search" history file to identify it. The latest timeline must be that. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Attachment
On Sat, Jun 4, 2022 at 8:59 AM James Coleman <jtc331@gmail.com> wrote: > A quick background refresher: after promoting a standby rewinding the > former primary requires that a checkpoint have been completed on the > new primary after promotion. This is correctly documented. However > pg_rewind incorrectly reports to the user that a rewind isn't > necessary because the source and target are on the same timeline. Is there anything intrinsic to the mechanism of operation of pg_rewind that requires a timeline change, or could we just rewind within the same timeline to an earlier LSN? In other words, maybe we could just remove this limitation of pg_rewind, and then perhaps it wouldn't be necessary to determine what the new timeline is. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Jul 5, 2022 at 2:39 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Sat, Jun 4, 2022 at 8:59 AM James Coleman <jtc331@gmail.com> wrote: > > A quick background refresher: after promoting a standby rewinding the > > former primary requires that a checkpoint have been completed on the > > new primary after promotion. This is correctly documented. However > > pg_rewind incorrectly reports to the user that a rewind isn't > > necessary because the source and target are on the same timeline. > > Is there anything intrinsic to the mechanism of operation of pg_rewind > that requires a timeline change, or could we just rewind within the > same timeline to an earlier LSN? In other words, maybe we could just > remove this limitation of pg_rewind, and then perhaps it wouldn't be > necessary to determine what the new timeline is. I think (someone can correct me if I'm wrong) that in theory the mechanisms would support the source and target being on the same timeline, but in practice that presents problems since you'd not have an LSN you could detect as the divergence point. If we allowed passing "rewind to" point LSN value, then that (again, as far as I understand it) would work, but it's a different use case. Specifically I wouldn't want that option to need to be used for this particular case since in my example there is in fact a real divergence point that we should be detecting automatically. Thanks, James Coleman
Robert Haas <robertmhaas@gmail.com> writes: > Is there anything intrinsic to the mechanism of operation of pg_rewind > that requires a timeline change, or could we just rewind within the > same timeline to an earlier LSN? In other words, maybe we could just > remove this limitation of pg_rewind, and then perhaps it wouldn't be > necessary to determine what the new timeline is. That seems like a fairly bad idea. For example, if you've already archived some WAL segments past the rewind target, there will shortly be two versions of truth about what that part of the WAL space contains, and your archiver will either spit up or do probably-the-wrong-thing. regards, tom lane
On Tue, Jul 5, 2022 at 2:47 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > Is there anything intrinsic to the mechanism of operation of pg_rewind > > that requires a timeline change, or could we just rewind within the > > same timeline to an earlier LSN? In other words, maybe we could just > > remove this limitation of pg_rewind, and then perhaps it wouldn't be > > necessary to determine what the new timeline is. > > That seems like a fairly bad idea. For example, if you've already > archived some WAL segments past the rewind target, there will shortly > be two versions of truth about what that part of the WAL space contains, > and your archiver will either spit up or do probably-the-wrong-thing. Well, only if you void the warranty. If you rewind the ex-primary to the LSN where the new primary is replaying and tell it to start replaying from there and follow the new primary's subsequent switch onto a new timeline, there's no split-brain problem. -- Robert Haas EDB: http://www.enterprisedb.com
At Tue, 5 Jul 2022 14:46:13 -0400, James Coleman <jtc331@gmail.com> wrote in > On Tue, Jul 5, 2022 at 2:39 PM Robert Haas <robertmhaas@gmail.com> wrote: > > > > On Sat, Jun 4, 2022 at 8:59 AM James Coleman <jtc331@gmail.com> wrote: > > > A quick background refresher: after promoting a standby rewinding the > > > former primary requires that a checkpoint have been completed on the > > > new primary after promotion. This is correctly documented. However > > > pg_rewind incorrectly reports to the user that a rewind isn't > > > necessary because the source and target are on the same timeline. > > > > Is there anything intrinsic to the mechanism of operation of pg_rewind > > that requires a timeline change, or could we just rewind within the > > same timeline to an earlier LSN? In other words, maybe we could just > > remove this limitation of pg_rewind, and then perhaps it wouldn't be > > necessary to determine what the new timeline is. > > I think (someone can correct me if I'm wrong) that in theory the > mechanisms would support the source and target being on the same > timeline, but in practice that presents problems since you'd not have > an LSN you could detect as the divergence point. If we allowed passing > "rewind to" point LSN value, then that (again, as far as I understand > it) would work, but it's a different use case. Specifically I wouldn't > want that option to need to be used for this particular case since in > my example there is in fact a real divergence point that we should be > detecting automatically. The point of pg_rewind is finding diverging point then finding all blocks modified in the dead history (from the diverging point) and "replace" them with those of the live history. In that sense, to be exact, pg_rewind does not "rewind" a cluster. If no diverging point, the last LSN of the cluster getting behind (as target cluster?) is that and just no need to replace a block at all because no WAL exists (on the cluster being behind) after the last LSN. The issue here is pg_rewind looks into control file to determine the soruce timeline, because the control file is not updated until the first checkpoint ends after promotion finishes, even though file blocks are already diverged. Even in that case history file for the new timeline is already created, so searching for the latest history file works. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: pg_rewind: warn when checkpoint hasn't happened after promotion
From
kuroda.keisuke@nttcom.co.jp
Date:
Hi, hackers > The issue here is pg_rewind looks into control file to determine the > soruce timeline, because the control file is not updated until the > first checkpoint ends after promotion finishes, even though file > blocks are already diverged. > > Even in that case history file for the new timeline is already > created, so searching for the latest history file works. I think this change is a good one because if I want pg_rewind to run automatically after a promotion, I don't have to wait for the checkpoint to complete. The attached patch is Horiguchi-san's patch with additional tests. The tests are based on James's tests, "010_no_checkpoint_after_promotion.pl" tests that pg_rewind is successfully executed without running checkpoint after promote. Best Regards, Keisuke Kuroda NTT COMWARE
Attachment
On 16/11/2022 07:17, kuroda.keisuke@nttcom.co.jp wrote: >> The issue here is pg_rewind looks into control file to determine the >> soruce timeline, because the control file is not updated until the >> first checkpoint ends after promotion finishes, even though file >> blocks are already diverged. >> >> Even in that case history file for the new timeline is already >> created, so searching for the latest history file works. > > I think this change is a good one because if I want > pg_rewind to run automatically after a promotion, > I don't have to wait for the checkpoint to complete. > > The attached patch is Horiguchi-san's patch with > additional tests. The tests are based on James's tests, > "010_no_checkpoint_after_promotion.pl" tests that > pg_rewind is successfully executed without running > checkpoint after promote. I fixed this last week in commit 009eeee746, see thread [1]. I'm sorry I didn't notice this thread earlier. I didn't realize that we had a notice about this in the docs. I'll go and remove that. Thanks! - Heikki
Re: pg_rewind: warn when checkpoint hasn't happened after promotion
From
kuroda.keisuke@nttcom.co.jp
Date:
hi Heikki, Thanks to mail, and thanks also for the commit(0a0500207a) to fix the document. I'm glad the problem was solved. Best Regards, Keisuke Kuroda NTT COMWARE 2023-02-27 16:33 に Heikki Linnakangas さんは書きました: > On 16/11/2022 07:17, kuroda.keisuke@nttcom.co.jp wrote: > > I fixed this last week in commit 009eeee746, see thread [1]. I'm sorry > I didn't notice this thread earlier. > > I didn't realize that we had a notice about this in the docs. I'll go > and remove that. Thanks! > > - Heikki
On Mon, Feb 27, 2023 at 2:33 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > On 16/11/2022 07:17, kuroda.keisuke@nttcom.co.jp wrote: > >> The issue here is pg_rewind looks into control file to determine the > >> soruce timeline, because the control file is not updated until the > >> first checkpoint ends after promotion finishes, even though file > >> blocks are already diverged. > >> > >> Even in that case history file for the new timeline is already > >> created, so searching for the latest history file works. > > > > I think this change is a good one because if I want > > pg_rewind to run automatically after a promotion, > > I don't have to wait for the checkpoint to complete. > > > > The attached patch is Horiguchi-san's patch with > > additional tests. The tests are based on James's tests, > > "010_no_checkpoint_after_promotion.pl" tests that > > pg_rewind is successfully executed without running > > checkpoint after promote. > > I fixed this last week in commit 009eeee746, see thread [1]. I'm sorry I > didn't notice this thread earlier. > > I didn't realize that we had a notice about this in the docs. I'll go > and remove that. Thanks! > > - Heikki > Thanks; I think the missing [1] (for reference) is: https://www.postgresql.org/message-id/9f568c97-87fe-a716-bd39-65299b8a60f4%40iki.fi James