On Sat, Jun 4, 2022 at 9:39 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> On Sat, Jun 4, 2022 at 6:29 PM James Coleman <jtc331@gmail.com> wrote:
> >
> > A few weeks back I sent a bug report [1] directly to the -bugs mailing
> > list, and I haven't seen any activity on it (maybe this is because I
> > emailed directly instead of using the form?), but I got some time to
> > take a look and concluded that a first-level fix is pretty simple.
> >
> > A quick background refresher: after promoting a standby rewinding the
> > former primary requires that a checkpoint have been completed on the
> > new primary after promotion. This is correctly documented. However
> > pg_rewind incorrectly reports to the user that a rewind isn't
> > necessary because the source and target are on the same timeline.
> >
> > Specifically, this happens when the control file on the newly promoted
> > server looks like:
> >
> > Latest checkpoint's TimeLineID: 4
> > Latest checkpoint's PrevTimeLineID: 4
> > ...
> > Min recovery ending loc's timeline: 5
> >
> > Attached is a patch that detects this condition and reports it as an
> > error to the user.
> >
> > In the spirit of the new-ish "ensure shutdown" functionality I could
> > imagine extending this to automatically issue a checkpoint when this
> > situation is detected. I haven't started to code that up, however,
> > wanting to first get buy-in on that.
> >
> > 1: https://www.postgresql.org/message-id/CAAaqYe8b2DBbooTprY4v=BiZEd9qBqVLq+FD9j617eQFjk1KvQ@mail.gmail.com
>
> Thanks. I had a quick look over the issue and patch - just a thought -
> can't we let pg_rewind issue a checkpoint on the new primary instead
> of erroring out, maybe optionally? It might sound too much, but helps
> pg_rewind to be self-reliant i.e. avoiding external actor to detect
> the error and issue checkpoint the new primary to be able to
> successfully run pg_rewind on the pld primary and repair it to use it
> as a new standby.
That's what I had suggested as a "further improvement" option in the
last paragraph :)
But I think agreement on this more basic solution would still be good
(even if I add the automatic checkpointing in this thread); given we
currently explicitly mis-inform the user of pg_rewind, I think this is
a bug that should be considered for backpatching, and the simpler
"fail if detected" patch is probably the only thing we could
backpatch.
Thanks for taking a look,
James Coleman