Re: pg_rewind: warn when checkpoint hasn't happened after promotion - Mailing list pgsql-hackers

From James Coleman
Subject Re: pg_rewind: warn when checkpoint hasn't happened after promotion
Date
Msg-id CAAaqYe9JxLvqqF=ZGfnqUsw+KBLUu_Rgf37+OtKdR49mhHLZGw@mail.gmail.com
Whole thread Raw
In response to Re: pg_rewind: warn when checkpoint hasn't happened after promotion  (Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>)
List pgsql-hackers
On Sat, Jun 4, 2022 at 9:39 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> On Sat, Jun 4, 2022 at 6:29 PM James Coleman <jtc331@gmail.com> wrote:
> >
> > A few weeks back I sent a bug report [1] directly to the -bugs mailing
> > list, and I haven't seen any activity on it (maybe this is because I
> > emailed directly instead of using the form?), but I got some time to
> > take a look and concluded that a first-level fix is pretty simple.
> >
> > A quick background refresher: after promoting a standby rewinding the
> > former primary requires that a checkpoint have been completed on the
> > new primary after promotion. This is correctly documented. However
> > pg_rewind incorrectly reports to the user that a rewind isn't
> > necessary because the source and target are on the same timeline.
> >
> > Specifically, this happens when the control file on the newly promoted
> > server looks like:
> >
> >     Latest checkpoint's TimeLineID:       4
> >     Latest checkpoint's PrevTimeLineID:   4
> >     ...
> >     Min recovery ending loc's timeline:   5
> >
> > Attached is a patch that detects this condition and reports it as an
> > error to the user.
> >
> > In the spirit of the new-ish "ensure shutdown" functionality I could
> > imagine extending this to automatically issue a checkpoint when this
> > situation is detected. I haven't started to code that up, however,
> > wanting to first get buy-in on that.
> >
> > 1: https://www.postgresql.org/message-id/CAAaqYe8b2DBbooTprY4v=BiZEd9qBqVLq+FD9j617eQFjk1KvQ@mail.gmail.com
>
> Thanks. I had a quick look over the issue and patch - just a thought -
> can't we let pg_rewind issue a checkpoint on the new primary instead
> of erroring out, maybe optionally? It might sound too much, but helps
> pg_rewind to be self-reliant i.e. avoiding external actor to detect
> the error and issue checkpoint the new primary to be able to
> successfully run pg_rewind on the pld primary and repair it to use it
> as a new standby.

That's what I had suggested as a "further improvement" option in the
last paragraph :)

But I think agreement on this more basic solution would still be good
(even if I add the automatic checkpointing in this thread); given we
currently explicitly mis-inform the user of pg_rewind, I think this is
a bug that should be considered for backpatching, and the simpler
"fail if detected" patch is probably the only thing we could
backpatch.

Thanks for taking a look,
James Coleman



pgsql-hackers by date:

Previous
From: "kuroda.hayato@fujitsu.com"
Date:
Subject: RE: Multi-Master Logical Replication
Next
From: James Coleman
Date:
Subject: Re: pg_rewind: warn when checkpoint hasn't happened after promotion