Thread: pg_rewind: warn when checkpoint hasn't happened after promotion

pg_rewind: warn when checkpoint hasn't happened after promotion

From

James Coleman

Date:

04 June 2022, 12:59:12

A few weeks back I sent a bug report [1] directly to the -bugs mailing
list, and I haven't seen any activity on it (maybe this is because I
emailed directly instead of using the form?), but I got some time to
take a look and concluded that a first-level fix is pretty simple.

A quick background refresher: after promoting a standby rewinding the
former primary requires that a checkpoint have been completed on the
new primary after promotion. This is correctly documented. However
pg_rewind incorrectly reports to the user that a rewind isn't
necessary because the source and target are on the same timeline.

Specifically, this happens when the control file on the newly promoted
server looks like:

    Latest checkpoint's TimeLineID:       4
    Latest checkpoint's PrevTimeLineID:   4
    ...
    Min recovery ending loc's timeline:   5

Attached is a patch that detects this condition and reports it as an
error to the user.

In the spirit of the new-ish "ensure shutdown" functionality I could
imagine extending this to automatically issue a checkpoint when this
situation is detected. I haven't started to code that up, however,
wanting to first get buy-in on that.

Thanks,
James Coleman

1: https://www.postgresql.org/message-id/CAAaqYe8b2DBbooTprY4v=BiZEd9qBqVLq+FD9j617eQFjk1KvQ@mail.gmail.com

Attachment

v1-0001-pg_rewind-warn-if-we-haven-t-checkpointed-after-p.patch

Re: pg_rewind: warn when checkpoint hasn't happened after promotion

From

Bharath Rupireddy

Date:

04 June 2022, 13:39:41

On Sat, Jun 4, 2022 at 6:29 PM James Coleman <jtc331@gmail.com> wrote:
>
> A few weeks back I sent a bug report [1] directly to the -bugs mailing
> list, and I haven't seen any activity on it (maybe this is because I
> emailed directly instead of using the form?), but I got some time to
> take a look and concluded that a first-level fix is pretty simple.
>
> A quick background refresher: after promoting a standby rewinding the
> former primary requires that a checkpoint have been completed on the
> new primary after promotion. This is correctly documented. However
> pg_rewind incorrectly reports to the user that a rewind isn't
> necessary because the source and target are on the same timeline.
>
> Specifically, this happens when the control file on the newly promoted
> server looks like:
>
>     Latest checkpoint's TimeLineID:       4
>     Latest checkpoint's PrevTimeLineID:   4
>     ...
>     Min recovery ending loc's timeline:   5
>
> Attached is a patch that detects this condition and reports it as an
> error to the user.
>
> In the spirit of the new-ish "ensure shutdown" functionality I could
> imagine extending this to automatically issue a checkpoint when this
> situation is detected. I haven't started to code that up, however,
> wanting to first get buy-in on that.
>
> 1: https://www.postgresql.org/message-id/CAAaqYe8b2DBbooTprY4v=BiZEd9qBqVLq+FD9j617eQFjk1KvQ@mail.gmail.com

Thanks. I had a quick look over the issue and patch - just a thought -
can't we let pg_rewind issue a checkpoint on the new primary instead
of erroring out, maybe optionally? It might sound too much, but helps
pg_rewind to be self-reliant i.e. avoiding external actor to detect
the error and issue checkpoint the new primary to be able to
successfully run pg_rewind on the pld primary and repair it to use it
as a new standby.

Regards,
Bharath Rupireddy.

Re: pg_rewind: warn when checkpoint hasn't happened after promotion

From

Kyotaro Horiguchi

Date:

06 June 2022, 05:26:02

At Sat, 4 Jun 2022 19:09:41 +0530, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote in 
> On Sat, Jun 4, 2022 at 6:29 PM James Coleman <jtc331@gmail.com> wrote:
> >
> > A few weeks back I sent a bug report [1] directly to the -bugs mailing
> > list, and I haven't seen any activity on it (maybe this is because I
> > emailed directly instead of using the form?), but I got some time to
> > take a look and concluded that a first-level fix is pretty simple.
> >
> > A quick background refresher: after promoting a standby rewinding the
> > former primary requires that a checkpoint have been completed on the
> > new primary after promotion. This is correctly documented. However
> > pg_rewind incorrectly reports to the user that a rewind isn't
> > necessary because the source and target are on the same timeline.
...
> > Attached is a patch that detects this condition and reports it as an
> > error to the user.

I have some random thoughts on this.

There could be a problem in the case of gracefully shutdowned
old-primary, so I think it is worth doing something if it can be in a
simple way.

However, I don't think we can simply rely on minRecoveryPoint to
detect that situation, since it won't be reset on a standby. A standby
also still can be the upstream of a cascading standby.  So, as
discussed in the thread for the comment [2], what we can do here would be
simply waiting for the timelineID to advance, maybe having a timeout.

In a case of single-step replication set, a checkpoint request to the
primary makes the end-of-recovery checkpoint fast.  It won't work as
expected in cascading replicas, but it might be acceptable.

> > In the spirit of the new-ish "ensure shutdown" functionality I could
> > imagine extending this to automatically issue a checkpoint when this
> > situation is detected. I haven't started to code that up, however,
> > wanting to first get buy-in on that.
> >
> > 1: https://www.postgresql.org/message-id/CAAaqYe8b2DBbooTprY4v=BiZEd9qBqVLq+FD9j617eQFjk1KvQ@mail.gmail.com
> 
> Thanks. I had a quick look over the issue and patch - just a thought -
> can't we let pg_rewind issue a checkpoint on the new primary instead
> of erroring out, maybe optionally? It might sound too much, but helps
> pg_rewind to be self-reliant i.e. avoiding external actor to detect
> the error and issue checkpoint the new primary to be able to
> successfully run pg_rewind on the pld primary and repair it to use it
> as a new standby.

At the time of the discussion [2] for the it was the hinderance that
that requires superuser privileges.  Now that has been narrowed down
to the pg_checkpointer privileges.

If we know that the timeline IDs are different, we don't need to wait
for a checkpoint.

It seems to me that the exit status is significant. pg_rewind exits
with 1 when an invalid option is given. I don't think it is great if
we report this state by the same code.

I don't think we always want to request a non-spreading checkpoint.

[2]
https://www.postgresql.org/message-id/flat/CABUevEz5bpvbwVsYCaSMV80CBZ5-82nkMzbb%2BBu%3Dh1m%3DrLdn%3Dg%40mail.gmail.com

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: pg_rewind: warn when checkpoint hasn't happened after promotion

From

James Coleman

Date:

06 June 2022, 12:10:19

On Sat, Jun 4, 2022 at 9:39 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> On Sat, Jun 4, 2022 at 6:29 PM James Coleman <jtc331@gmail.com> wrote:
> >
> > A few weeks back I sent a bug report [1] directly to the -bugs mailing
> > list, and I haven't seen any activity on it (maybe this is because I
> > emailed directly instead of using the form?), but I got some time to
> > take a look and concluded that a first-level fix is pretty simple.
> >
> > A quick background refresher: after promoting a standby rewinding the
> > former primary requires that a checkpoint have been completed on the
> > new primary after promotion. This is correctly documented. However
> > pg_rewind incorrectly reports to the user that a rewind isn't
> > necessary because the source and target are on the same timeline.
> >
> > Specifically, this happens when the control file on the newly promoted
> > server looks like:
> >
> >     Latest checkpoint's TimeLineID:       4
> >     Latest checkpoint's PrevTimeLineID:   4
> >     ...
> >     Min recovery ending loc's timeline:   5
> >
> > Attached is a patch that detects this condition and reports it as an
> > error to the user.
> >
> > In the spirit of the new-ish "ensure shutdown" functionality I could
> > imagine extending this to automatically issue a checkpoint when this
> > situation is detected. I haven't started to code that up, however,
> > wanting to first get buy-in on that.
> >
> > 1: https://www.postgresql.org/message-id/CAAaqYe8b2DBbooTprY4v=BiZEd9qBqVLq+FD9j617eQFjk1KvQ@mail.gmail.com
>
> Thanks. I had a quick look over the issue and patch - just a thought -
> can't we let pg_rewind issue a checkpoint on the new primary instead
> of erroring out, maybe optionally? It might sound too much, but helps
> pg_rewind to be self-reliant i.e. avoiding external actor to detect
> the error and issue checkpoint the new primary to be able to
> successfully run pg_rewind on the pld primary and repair it to use it
> as a new standby.

That's what I had suggested as a "further improvement" option in the
last paragraph :)

But I think agreement on this more basic solution would still be good
(even if I add the automatic checkpointing in this thread); given we
currently explicitly mis-inform the user of pg_rewind, I think this is
a bug that should be considered for backpatching, and the simpler
"fail if detected" patch is probably the only thing we could
backpatch.

Thanks for taking a look,
James Coleman

Re: pg_rewind: warn when checkpoint hasn't happened after promotion

From

James Coleman

Date:

06 June 2022, 12:32:01

On Mon, Jun 6, 2022 at 1:26 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> At Sat, 4 Jun 2022 19:09:41 +0530, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote in
> > On Sat, Jun 4, 2022 at 6:29 PM James Coleman <jtc331@gmail.com> wrote:
> > >
> > > A few weeks back I sent a bug report [1] directly to the -bugs mailing
> > > list, and I haven't seen any activity on it (maybe this is because I
> > > emailed directly instead of using the form?), but I got some time to
> > > take a look and concluded that a first-level fix is pretty simple.
> > >
> > > A quick background refresher: after promoting a standby rewinding the
> > > former primary requires that a checkpoint have been completed on the
> > > new primary after promotion. This is correctly documented. However
> > > pg_rewind incorrectly reports to the user that a rewind isn't
> > > necessary because the source and target are on the same timeline.
> ...
> > > Attached is a patch that detects this condition and reports it as an
> > > error to the user.
>
> I have some random thoughts on this.
>
> There could be a problem in the case of gracefully shutdowned
> old-primary, so I think it is worth doing something if it can be in a
> simple way.
>
> However, I don't think we can simply rely on minRecoveryPoint to
> detect that situation, since it won't be reset on a standby. A standby
> also still can be the upstream of a cascading standby.  So, as
> discussed in the thread for the comment [2], what we can do here would be
> simply waiting for the timelineID to advance, maybe having a timeout.

To confirm I'm following you correctly, you're envisioning a situation like:

- Primary A
- Replica B replicating from primary
- Replica C replicating from replica B

then on failover from A to B you end up with:

- Primary B
- Replica C replication from primary
- [needs rewind] A

and you try to rewind A from C as the source?

> In a case of single-step replication set, a checkpoint request to the
> primary makes the end-of-recovery checkpoint fast.  It won't work as
> expected in cascading replicas, but it might be acceptable.

"Won't work as expected" because there's no way to guarantee
replication is caught up or even advancing?

> > > In the spirit of the new-ish "ensure shutdown" functionality I could
> > > imagine extending this to automatically issue a checkpoint when this
> > > situation is detected. I haven't started to code that up, however,
> > > wanting to first get buy-in on that.
> > >
> > > 1: https://www.postgresql.org/message-id/CAAaqYe8b2DBbooTprY4v=BiZEd9qBqVLq+FD9j617eQFjk1KvQ@mail.gmail.com
> >
> > Thanks. I had a quick look over the issue and patch - just a thought -
> > can't we let pg_rewind issue a checkpoint on the new primary instead
> > of erroring out, maybe optionally? It might sound too much, but helps
> > pg_rewind to be self-reliant i.e. avoiding external actor to detect
> > the error and issue checkpoint the new primary to be able to
> > successfully run pg_rewind on the pld primary and repair it to use it
> > as a new standby.
>
> At the time of the discussion [2] for the it was the hinderance that
> that requires superuser privileges.  Now that has been narrowed down
> to the pg_checkpointer privileges.
>
> If we know that the timeline IDs are different, we don't need to wait
> for a checkpoint.

Correct.

> It seems to me that the exit status is significant. pg_rewind exits
> with 1 when an invalid option is given. I don't think it is great if
> we report this state by the same code.

I'm happy to change that; I only chose "1" as a placeholder for
"non-zero exit status".

> I don't think we always want to request a non-spreading checkpoint.

I'm not familiar with the terminology "non-spreading checkpoint".

> [2]
https://www.postgresql.org/message-id/flat/CABUevEz5bpvbwVsYCaSMV80CBZ5-82nkMzbb%2BBu%3Dh1m%3DrLdn%3Dg%40mail.gmail.com

I read through that thread, and one interesting idea stuck out to me:
making "tiimeline IDs are the same" an error exit status. On the one
hand that makes a certain amount of sense because it's unexpected. But
on the other hand there are entirely legitimate situations where upon
failover the timeline IDs happen to match (e.g., for use it happens
some percentage of the time naturally as we are using sync replication
and failovers often involve STONITHing the original primary, so it's
entirely possible that the promoted replica begins with exactly the
same WAL ending LSN from the primary before it stopped).

Thanks,
James Coleman

Re: pg_rewind: warn when checkpoint hasn't happened after promotion

From

Kyotaro Horiguchi

Date:

07 June 2022, 03:39:38

At Mon, 6 Jun 2022 08:32:01 -0400, James Coleman <jtc331@gmail.com> wrote in 
> On Mon, Jun 6, 2022 at 1:26 AM Kyotaro Horiguchi
> <horikyota.ntt@gmail.com> wrote:
> >
> > At Sat, 4 Jun 2022 19:09:41 +0530, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote in
> > > On Sat, Jun 4, 2022 at 6:29 PM James Coleman <jtc331@gmail.com> wrote:
> > > >
> > > > A few weeks back I sent a bug report [1] directly to the -bugs mailing
> > > > list, and I haven't seen any activity on it (maybe this is because I
> > > > emailed directly instead of using the form?), but I got some time to
> > > > take a look and concluded that a first-level fix is pretty simple.
> > > >
> > > > A quick background refresher: after promoting a standby rewinding the
> > > > former primary requires that a checkpoint have been completed on the
> > > > new primary after promotion. This is correctly documented. However
> > > > pg_rewind incorrectly reports to the user that a rewind isn't
> > > > necessary because the source and target are on the same timeline.
> > ...
> > > > Attached is a patch that detects this condition and reports it as an
> > > > error to the user.
> >
> > I have some random thoughts on this.
> >
> > There could be a problem in the case of gracefully shutdowned
> > old-primary, so I think it is worth doing something if it can be in a
> > simple way.
> >
> > However, I don't think we can simply rely on minRecoveryPoint to
> > detect that situation, since it won't be reset on a standby. A standby
> > also still can be the upstream of a cascading standby.  So, as
> > discussed in the thread for the comment [2], what we can do here would be
> > simply waiting for the timelineID to advance, maybe having a timeout.
> 
> To confirm I'm following you correctly, you're envisioning a situation like:
> 
> - Primary A
> - Replica B replicating from primary
> - Replica C replicating from replica B
> 
> then on failover from A to B you end up with:
> 
> - Primary B
> - Replica C replication from primary
> - [needs rewind] A
> 
> and you try to rewind A from C as the source?

Yes. I think it is a legit use case.  That being said, like other
points, it might be acceptable.

> > In a case of single-step replication set, a checkpoint request to the
> > primary makes the end-of-recovery checkpoint fast.  It won't work as
> > expected in cascading replicas, but it might be acceptable.
> 
> "Won't work as expected" because there's no way to guarantee
> replication is caught up or even advancing?

Maybe no.  I meant that restartpoints don't run more frequently than
the intervals of checkpoint_timeout even if checkpint records come
more frequently.

> > > > In the spirit of the new-ish "ensure shutdown" functionality I could
> > > > imagine extending this to automatically issue a checkpoint when this
> > > > situation is detected. I haven't started to code that up, however,
> > > > wanting to first get buy-in on that.
> > > >
> > > > 1: https://www.postgresql.org/message-id/CAAaqYe8b2DBbooTprY4v=BiZEd9qBqVLq+FD9j617eQFjk1KvQ@mail.gmail.com
> > >
> > > Thanks. I had a quick look over the issue and patch - just a thought -
> > > can't we let pg_rewind issue a checkpoint on the new primary instead
> > > of erroring out, maybe optionally? It might sound too much, but helps
> > > pg_rewind to be self-reliant i.e. avoiding external actor to detect
> > > the error and issue checkpoint the new primary to be able to
> > > successfully run pg_rewind on the pld primary and repair it to use it
> > > as a new standby.
> >
> > At the time of the discussion [2] for the it was the hinderance that
> > that requires superuser privileges.  Now that has been narrowed down
> > to the pg_checkpointer privileges.
> >
> > If we know that the timeline IDs are different, we don't need to wait
> > for a checkpoint.
> 
> Correct.
> 
> > It seems to me that the exit status is significant. pg_rewind exits
> > with 1 when an invalid option is given. I don't think it is great if
> > we report this state by the same code.
> 
> I'm happy to change that; I only chose "1" as a placeholder for
> "non-zero exit status".
> 
> > I don't think we always want to request a non-spreading checkpoint.
> 
> I'm not familiar with the terminology "non-spreading checkpoint".

Does "immediate checkpoint" works?  That is, a checkpoint that runs at
full-speed (i.e. with no delays between writes).

> > [2]
https://www.postgresql.org/message-id/flat/CABUevEz5bpvbwVsYCaSMV80CBZ5-82nkMzbb%2BBu%3Dh1m%3DrLdn%3Dg%40mail.gmail.com
> 
> I read through that thread, and one interesting idea stuck out to me:
> making "tiimeline IDs are the same" an error exit status. On the one
> hand that makes a certain amount of sense because it's unexpected. But
> on the other hand there are entirely legitimate situations where upon
> failover the timeline IDs happen to match (e.g., for use it happens
> some percentage of the time naturally as we are using sync replication
> and failovers often involve STONITHing the original primary, so it's
> entirely possible that the promoted replica begins with exactly the
> same WAL ending LSN from the primary before it stopped).

Yes that is true for most cases unless old primary written some
records that had not sent to the standby before its death. So if we
don't inspect WAL records (on the target cluster), we should always
run rewinding even in the STONITH-killed (or immediate-shutdown)
cases.

One possible way to detect promotion reliably is to look into timeline
history files. It is written immediately at promotion even on
standbys.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: pg_rewind: warn when checkpoint hasn't happened after promotion

From

Kyotaro Horiguchi

Date:

07 June 2022, 07:05:47

At Tue, 07 Jun 2022 12:39:38 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> One possible way to detect promotion reliably is to look into timeline
> history files. It is written immediately at promotion even on
> standbys.

The attached seems to work. It uses timeline history files to identify
the source timeline.  With this change pg_waldump no longer need to
wait for end-of-recovery to finish.

(It lacks doc part and test.. But I'm not sure how we can test this
behavior.)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment

pg_rewind_fix.diff.txt

Re: pg_rewind: warn when checkpoint hasn't happened after promotion

From

Michael Paquier

Date:

07 June 2022, 07:16:09

On Tue, Jun 07, 2022 at 12:39:38PM +0900, Kyotaro Horiguchi wrote:
> At Mon, 6 Jun 2022 08:32:01 -0400, James Coleman <jtc331@gmail.com> wrote in
>> To confirm I'm following you correctly, you're envisioning a situation like:
>>
>> - Primary A
>> - Replica B replicating from primary
>> - Replica C replicating from replica B
>>
>> then on failover from A to B you end up with:
>>
>> - Primary B
>> - Replica C replication from primary
>> - [needs rewind] A
>>
>> and you try to rewind A from C as the source?
>
> Yes. I think it is a legit use case.  That being said, like other
> points, it might be acceptable.

This configuration is a case supported by pg_rewind, meaning that your
patch to check after minRecoveryPointTLI would be confusing when using
a standby as a source because the checkpoint needs to apply on its
primary to allow the TLI of the standby to go up.  If you want to
provide to the user more context, a more meaningful way may be to rely
on an extra check for ControlFileData.state, I guess, as a promoted
cluster is marked as DB_IN_PRODUCTION before recoveryMinPoint is
cleared by the first post-promotion checkpoint, with
DB_IN_ARCHIVE_RECOVERY for a cascading standby.
--
Michael

Attachment

signature.asc

Re: pg_rewind: warn when checkpoint hasn't happened after promotion

From

Kyotaro Horiguchi

Date:

07 June 2022, 07:54:01

At Tue, 7 Jun 2022 16:16:09 +0900, Michael Paquier <michael@paquier.xyz> wrote in 
> On Tue, Jun 07, 2022 at 12:39:38PM +0900, Kyotaro Horiguchi wrote:
> > At Mon, 6 Jun 2022 08:32:01 -0400, James Coleman <jtc331@gmail.com> wrote in 
> >> To confirm I'm following you correctly, you're envisioning a situation like:
> >> 
> >> - Primary A
> >> - Replica B replicating from primary
> >> - Replica C replicating from replica B
> >> 
> >> then on failover from A to B you end up with:
> >> 
> >> - Primary B
> >> - Replica C replication from primary
> >> - [needs rewind] A
> >> 
> >> and you try to rewind A from C as the source?
> > 
> > Yes. I think it is a legit use case.  That being said, like other
> > points, it might be acceptable.
> 
> This configuration is a case supported by pg_rewind, meaning that your
> patch to check after minRecoveryPointTLI would be confusing when using
> a standby as a source because the checkpoint needs to apply on its
> primary to allow the TLI of the standby to go up.  If you want to

Yeah, that what I meant.

> provide to the user more context, a more meaningful way may be to rely
> on an extra check for ControlFileData.state, I guess, as a promoted 
> cluster is marked as DB_IN_PRODUCTION before recoveryMinPoint is
> cleared by the first post-promotion checkpoint, with
> DB_IN_ARCHIVE_RECOVERY for a cascading standby.

Right. However, IIUC, checkpoint LSN/TLI is not updated at the
time. The point of the minRecoveryPoint check is to confirm that we
can read the timeline ID of the promoted source cluster from
checkPointCopy.ThisTimeLineID. But we cannot do that yet at the time
the cluster state moves to DB_IN_PRODUCTION.  And a standby is in
DB_IN_ARCHIVE_RECOVERY since before the upstream promotes. It also
doesn't signal the reliability of checkPointCopy.ThisTimeLineID..

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: pg_rewind: warn when checkpoint hasn't happened after promotion

From

vignesh ravichandran

Date:

07 June 2022, 14:41:11

I think this is a good improvement and also like the option (on pg_rewind) to potentially send checkpoints to the source.

Personal anecdote. I was using stolon and frequently failing over. For sometime the rewind was failing that it wasn't required. Only learnt that it's the checkpoint on the source which was missing.

References https://github.com/sorintlab/stolon/issues/601
And the fix https://github.com/sorintlab/stolon/pull/644

---- On Sat, 04 Jun 2022 05:59:12 -0700 James Coleman <jtc331@gmail.com> wrote ----

A few weeks back I sent a bug report [1] directly to the -bugs mailing
list, and I haven't seen any activity on it (maybe this is because I
emailed directly instead of using the form?), but I got some time to
take a look and concluded that a first-level fix is pretty simple.

A quick background refresher: after promoting a standby rewinding the
former primary requires that a checkpoint have been completed on the
new primary after promotion. This is correctly documented. However
pg_rewind incorrectly reports to the user that a rewind isn't
necessary because the source and target are on the same timeline.

Specifically, this happens when the control file on the newly promoted
server looks like:

Latest checkpoint's TimeLineID: 4
Latest checkpoint's PrevTimeLineID: 4
...
Min recovery ending loc's timeline: 5

Attached is a patch that detects this condition and reports it as an
error to the user.

In the spirit of the new-ish "ensure shutdown" functionality I could
imagine extending this to automatically issue a checkpoint when this
situation is detected. I haven't started to code that up, however,
wanting to first get buy-in on that.

Thanks,
James Coleman

1: https://www.postgresql.org/message-id/CAAaqYe8b2DBbooTprY4v=BiZEd9qBqVLq+FD9j617eQFjk1KvQ@mail.gmail.com

Re: pg_rewind: warn when checkpoint hasn't happened after promotion

From

Kyotaro Horiguchi

Date:

08 June 2022, 09:15:09

At Tue, 07 Jun 2022 16:05:47 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> At Tue, 07 Jun 2022 12:39:38 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> > One possible way to detect promotion reliably is to look into timeline
> > history files. It is written immediately at promotion even on
> > standbys.
> 
> The attached seems to work. It uses timeline history files to identify
> the source timeline.  With this change pg_waldump no longer need to
> wait for end-of-recovery to finish.
> 
> (It lacks doc part and test.. But I'm not sure how we can test this
> behavior.)

This is a revised version.

Revised getTimelineHistory()'s logic (refactored, and changed so that
it doesn't pick-up the wrong history files).

perform_rewind always identify endtli based on source's timeline
history.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment

pg_rewind_fix_2.diff.txt

Re: pg_rewind: warn when checkpoint hasn't happened after promotion

From

Kyotaro Horiguchi

Date:

08 June 2022, 09:36:04

At Wed, 08 Jun 2022 18:15:09 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> At Tue, 07 Jun 2022 16:05:47 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> > At Tue, 07 Jun 2022 12:39:38 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> > > One possible way to detect promotion reliably is to look into timeline
> > > history files. It is written immediately at promotion even on
> > > standbys.
> > 
> > The attached seems to work. It uses timeline history files to identify
> > the source timeline.  With this change pg_waldump no longer need to
> > wait for end-of-recovery to finish.
> > 
> > (It lacks doc part and test.. But I'm not sure how we can test this
> > behavior.)
> 
> This is a revised version.
> 
> Revised getTimelineHistory()'s logic (refactored, and changed so that
> it doesn't pick-up the wrong history files).
> 
> perform_rewind always identify endtli based on source's timeline
> history.

No need to "search" history file to identify it.  The latest timeline
must be that.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment

pg_rewind_fix_3.diff.txt

Re: pg_rewind: warn when checkpoint hasn't happened after promotion

From

Robert Haas

Date:

05 July 2022, 18:39:11

On Sat, Jun 4, 2022 at 8:59 AM James Coleman <jtc331@gmail.com> wrote:
> A quick background refresher: after promoting a standby rewinding the
> former primary requires that a checkpoint have been completed on the
> new primary after promotion. This is correctly documented. However
> pg_rewind incorrectly reports to the user that a rewind isn't
> necessary because the source and target are on the same timeline.

Is there anything intrinsic to the mechanism of operation of pg_rewind
that requires a timeline change, or could we just rewind within the
same timeline to an earlier LSN? In other words, maybe we could just
remove this limitation of pg_rewind, and then perhaps it wouldn't be
necessary to determine what the new timeline is.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: pg_rewind: warn when checkpoint hasn't happened after promotion

From

James Coleman

Date:

05 July 2022, 18:46:13

On Tue, Jul 5, 2022 at 2:39 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Sat, Jun 4, 2022 at 8:59 AM James Coleman <jtc331@gmail.com> wrote:
> > A quick background refresher: after promoting a standby rewinding the
> > former primary requires that a checkpoint have been completed on the
> > new primary after promotion. This is correctly documented. However
> > pg_rewind incorrectly reports to the user that a rewind isn't
> > necessary because the source and target are on the same timeline.
>
> Is there anything intrinsic to the mechanism of operation of pg_rewind
> that requires a timeline change, or could we just rewind within the
> same timeline to an earlier LSN? In other words, maybe we could just
> remove this limitation of pg_rewind, and then perhaps it wouldn't be
> necessary to determine what the new timeline is.

I think (someone can correct me if I'm wrong) that in theory the
mechanisms would support the source and target being on the same
timeline, but in practice that presents problems since you'd not have
an LSN you could detect as the divergence point. If we allowed passing
"rewind to" point LSN value, then that (again, as far as I understand
it) would work, but it's a different use case. Specifically I wouldn't
want that option to need to be used for this particular case since in
my example there is in fact a real divergence point that we should be
detecting automatically.

Thanks,
James Coleman

Re: pg_rewind: warn when checkpoint hasn't happened after promotion

From

Tom Lane

Date:

05 July 2022, 18:47:27

Robert Haas <robertmhaas@gmail.com> writes:
> Is there anything intrinsic to the mechanism of operation of pg_rewind
> that requires a timeline change, or could we just rewind within the
> same timeline to an earlier LSN? In other words, maybe we could just
> remove this limitation of pg_rewind, and then perhaps it wouldn't be
> necessary to determine what the new timeline is.

That seems like a fairly bad idea.  For example, if you've already
archived some WAL segments past the rewind target, there will shortly
be two versions of truth about what that part of the WAL space contains,
and your archiver will either spit up or do probably-the-wrong-thing.

            regards, tom lane

Re: pg_rewind: warn when checkpoint hasn't happened after promotion

From

Robert Haas

Date:

05 July 2022, 18:51:35

On Tue, Jul 5, 2022 at 2:47 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > Is there anything intrinsic to the mechanism of operation of pg_rewind
> > that requires a timeline change, or could we just rewind within the
> > same timeline to an earlier LSN? In other words, maybe we could just
> > remove this limitation of pg_rewind, and then perhaps it wouldn't be
> > necessary to determine what the new timeline is.
>
> That seems like a fairly bad idea.  For example, if you've already
> archived some WAL segments past the rewind target, there will shortly
> be two versions of truth about what that part of the WAL space contains,
> and your archiver will either spit up or do probably-the-wrong-thing.

Well, only if you void the warranty. If you rewind the ex-primary to
the LSN where the new primary is replaying and tell it to start
replaying from there and follow the new primary's subsequent switch
onto a new timeline, there's no split-brain problem.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: pg_rewind: warn when checkpoint hasn't happened after promotion

From

Kyotaro Horiguchi

Date:

06 July 2022, 02:38:42

At Tue, 5 Jul 2022 14:46:13 -0400, James Coleman <jtc331@gmail.com> wrote in 
> On Tue, Jul 5, 2022 at 2:39 PM Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > On Sat, Jun 4, 2022 at 8:59 AM James Coleman <jtc331@gmail.com> wrote:
> > > A quick background refresher: after promoting a standby rewinding the
> > > former primary requires that a checkpoint have been completed on the
> > > new primary after promotion. This is correctly documented. However
> > > pg_rewind incorrectly reports to the user that a rewind isn't
> > > necessary because the source and target are on the same timeline.
> >
> > Is there anything intrinsic to the mechanism of operation of pg_rewind
> > that requires a timeline change, or could we just rewind within the
> > same timeline to an earlier LSN? In other words, maybe we could just
> > remove this limitation of pg_rewind, and then perhaps it wouldn't be
> > necessary to determine what the new timeline is.
>
> I think (someone can correct me if I'm wrong) that in theory the
> mechanisms would support the source and target being on the same
> timeline, but in practice that presents problems since you'd not have
> an LSN you could detect as the divergence point. If we allowed passing
> "rewind to" point LSN value, then that (again, as far as I understand
> it) would work, but it's a different use case. Specifically I wouldn't
> want that option to need to be used for this particular case since in
> my example there is in fact a real divergence point that we should be
> detecting automatically.

The point of pg_rewind is finding diverging point then finding all
blocks modified in the dead history (from the diverging point) and
"replace" them with those of the live history. In that sense, to be
exact, pg_rewind does not "rewind" a cluster.  If no diverging point,
the last LSN of the cluster getting behind (as target cluster?) is
that and just no need to replace a block at all because no WAL exists
(on the cluster being behind) after the last LSN.

The issue here is pg_rewind looks into control file to determine the
soruce timeline, because the control file is not updated until the
first checkpoint ends after promotion finishes, even though file
blocks are already diverged.

Even in that case history file for the new timeline is already
created, so searching for the latest history file works.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: pg_rewind: warn when checkpoint hasn't happened after promotion

From

kuroda.keisuke@nttcom.co.jp

Date:

16 November 2022, 05:17:59

Hi, hackers

> The issue here is pg_rewind looks into control file to determine the
> soruce timeline, because the control file is not updated until the
> first checkpoint ends after promotion finishes, even though file
> blocks are already diverged.
> 
> Even in that case history file for the new timeline is already
> created, so searching for the latest history file works.

I think this change is a good one because if I want
pg_rewind to run automatically after a promotion,
I don't have to wait for the checkpoint to complete.

The attached patch is Horiguchi-san's patch with
additional tests. The tests are based on James's tests,
"010_no_checkpoint_after_promotion.pl" tests that
pg_rewind is successfully executed without running
checkpoint after promote.

Best Regards,
Keisuke Kuroda
NTT COMWARE

Attachment

pg_rewind_fix_4.diff.patch

Re: pg_rewind: warn when checkpoint hasn't happened after promotion

From

Heikki Linnakangas

Date:

27 February 2023, 07:33:13

On 16/11/2022 07:17, kuroda.keisuke@nttcom.co.jp wrote:
>> The issue here is pg_rewind looks into control file to determine the
>> soruce timeline, because the control file is not updated until the
>> first checkpoint ends after promotion finishes, even though file
>> blocks are already diverged.
>>
>> Even in that case history file for the new timeline is already
>> created, so searching for the latest history file works.
> 
> I think this change is a good one because if I want
> pg_rewind to run automatically after a promotion,
> I don't have to wait for the checkpoint to complete.
> 
> The attached patch is Horiguchi-san's patch with
> additional tests. The tests are based on James's tests,
> "010_no_checkpoint_after_promotion.pl" tests that
> pg_rewind is successfully executed without running
> checkpoint after promote.

I fixed this last week in commit 009eeee746, see thread [1]. I'm sorry I 
didn't notice this thread earlier.

I didn't realize that we had a notice about this in the docs. I'll go 
and remove that. Thanks!

- Heikki

Re: pg_rewind: warn when checkpoint hasn't happened after promotion

From

kuroda.keisuke@nttcom.co.jp

Date:

28 February 2023, 02:07:52

hi Heikki,

Thanks to mail, and thanks also for the commit(0a0500207a)
to fix the document.
I'm glad the problem was solved.

Best Regards,
Keisuke Kuroda
NTT COMWARE

2023-02-27 16:33 に Heikki Linnakangas さんは書きました:
> On 16/11/2022 07:17, kuroda.keisuke@nttcom.co.jp wrote:
>
> I fixed this last week in commit 009eeee746, see thread [1]. I'm sorry
> I didn't notice this thread earlier.
>
> I didn't realize that we had a notice about this in the docs. I'll go
> and remove that. Thanks!
>
> - Heikki

Re: pg_rewind: warn when checkpoint hasn't happened after promotion

From

James Coleman

Date:

28 February 2023, 12:37:53

On Mon, Feb 27, 2023 at 2:33 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> On 16/11/2022 07:17, kuroda.keisuke@nttcom.co.jp wrote:
> >> The issue here is pg_rewind looks into control file to determine the
> >> soruce timeline, because the control file is not updated until the
> >> first checkpoint ends after promotion finishes, even though file
> >> blocks are already diverged.
> >>
> >> Even in that case history file for the new timeline is already
> >> created, so searching for the latest history file works.
> >
> > I think this change is a good one because if I want
> > pg_rewind to run automatically after a promotion,
> > I don't have to wait for the checkpoint to complete.
> >
> > The attached patch is Horiguchi-san's patch with
> > additional tests. The tests are based on James's tests,
> > "010_no_checkpoint_after_promotion.pl" tests that
> > pg_rewind is successfully executed without running
> > checkpoint after promote.
>
> I fixed this last week in commit 009eeee746, see thread [1]. I'm sorry I
> didn't notice this thread earlier.
>
> I didn't realize that we had a notice about this in the docs. I'll go
> and remove that. Thanks!
>
> - Heikki
>

Thanks; I think the missing [1] (for reference) is:
https://www.postgresql.org/message-id/9f568c97-87fe-a716-bd39-65299b8a60f4%40iki.fi

James