Thread: Should we remove a fallback promotion? take 2
Hi, We discussed the $SUBJECT six years ago at the following thread. https://postgr.es/m/CAHGQGwGYkF+CvpOMdxaO=+aNAzc1Oo9O4LqWo50MxpvFj+0VOw@mail.gmail.com Seems our consensus at that discussion was to leave a fallback promotion for a release or two for debugging purpose or as an emergency method because fast promotion might have some issues, and then to remove it later. Now, more than six years have already passed since that discussion. Is there still any reason to keep a fallback promotion? If nothing, I'd like to drop it from v13. Regards, -- Fujii Masao NTT DATA CORPORATION Advanced Platform Technology Group Research and Development Headquarters
On Thu, Mar 5, 2020 at 8:48 AM Fujii Masao <masao.fujii@oss.nttdata.com> wrote: > We discussed the $SUBJECT six years ago at the following thread. > https://postgr.es/m/CAHGQGwGYkF+CvpOMdxaO=+aNAzc1Oo9O4LqWo50MxpvFj+0VOw@mail.gmail.com > > Seems our consensus at that discussion was to leave a fallback > promotion for a release or two for debugging purpose or as > an emergency method because fast promotion might have > some issues, and then to remove it later. Now, more than six years > have already passed since that discussion. Is there still > any reason to keep a fallback promotion? If nothing, I'd like to > drop it from v13. Seems reasonable, but it would be better if people proposed these kinds of changes closer to the beginning of the release cycle rather than in the crush at the end. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Mar 05, 2020 at 09:40:54AM -0500, Robert Haas wrote: > Seems reasonable, but it would be better if people proposed these > kinds of changes closer to the beginning of the release cycle rather > than in the crush at the end. +1, to both points. -- Michael
Attachment
On 2020/03/06 10:40, Michael Paquier wrote: > On Thu, Mar 05, 2020 at 09:40:54AM -0500, Robert Haas wrote: >> Seems reasonable, but it would be better if people proposed these >> kinds of changes closer to the beginning of the release cycle rather >> than in the crush at the end. > > +1, to both points. Ok, I'm fine to do that in v14. Regards, -- Fujii Masao NTT DATA CORPORATION Advanced Platform Technology Group Research and Development Headquarters
On 2020-Mar-06, Michael Paquier wrote: > On Thu, Mar 05, 2020 at 09:40:54AM -0500, Robert Haas wrote: > > Seems reasonable, but it would be better if people proposed these > > kinds of changes closer to the beginning of the release cycle rather > > than in the crush at the end. > > +1, to both points. Why? Are you saying that there's some actual risk of breaking something? We're not even near beta or feature freeze yet. I'm not seeing the reason for the "please propose this sooner in the cycle" argument. It has already been proposed sooner -- seven years sooner. We're not waiting for users to complain anymore; clearly nobody cared. I think dragging things forever serves no purpose. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, On 2020-03-06 16:33:18 -0300, Alvaro Herrera wrote: > On 2020-Mar-06, Michael Paquier wrote: > > On Thu, Mar 05, 2020 at 09:40:54AM -0500, Robert Haas wrote: > > > Seems reasonable, but it would be better if people proposed these > > > kinds of changes closer to the beginning of the release cycle rather > > > than in the crush at the end. > > > > +1, to both points. > > Why? Are you saying that there's some actual risk of breaking > something? We're not even near beta or feature freeze yet. > > I'm not seeing the reason for the "please propose this sooner in the > cycle" argument. It has already been proposed sooner -- seven years > sooner. We're not waiting for users to complain anymore; clearly nobody > cared. Yea. There are changes that are so invasive that it's useful to go very early, but in this case I'm not seeing it? +1 for removing non-fast promotions. FWIW, I find "fallback promotion" a confusing description. Btw, I'd really like to make the crash recovery environment more like the replication environment. I.e. have checkpointer, bgwriter running, and have an 'end-of-recovery' record instead of a checkpoint at the end. Greetings, Andres Freund
On 2020/03/10 6:56, Andres Freund wrote: > Hi, > > On 2020-03-06 16:33:18 -0300, Alvaro Herrera wrote: >> On 2020-Mar-06, Michael Paquier wrote: >>> On Thu, Mar 05, 2020 at 09:40:54AM -0500, Robert Haas wrote: >>>> Seems reasonable, but it would be better if people proposed these >>>> kinds of changes closer to the beginning of the release cycle rather >>>> than in the crush at the end. >>> >>> +1, to both points. >> >> Why? Are you saying that there's some actual risk of breaking >> something? We're not even near beta or feature freeze yet. >> >> I'm not seeing the reason for the "please propose this sooner in the >> cycle" argument. It has already been proposed sooner -- seven years >> sooner. We're not waiting for users to complain anymore; clearly nobody >> cared. > > Yea. There are changes that are so invasive that it's useful to go very > early, but in this case I'm not seeing it? > > +1 for removing non-fast promotions. Patch attached. I will add this into the first CF for v14. > FWIW, I find "fallback promotion" a confusing description. Yeah, so I changed the subject. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Attachment
Re: Remove non-fast promotion Re: Should we remove a fallbackpromotion? take 2
From
Michael Paquier
Date:
On Mon, Apr 20, 2020 at 03:26:16PM +0900, Fujii Masao wrote: > Patch attached. I will add this into the first CF for v14. Thanks! > - if (IsPromoteSignaled()) > + /* > + * In 9.1 and 9.2 the postmaster unlinked the promote file inside the > + * signal handler. It now leaves the file in place and lets the > + * Startup process do the unlink. > + */ > + if (IsPromoteSignaled() && stat(PROMOTE_SIGNAL_FILE, &stat_buf) == 0) > { > - /* > - * In 9.1 and 9.2 the postmaster unlinked the promote file inside the > - * signal handler. It now leaves the file in place and lets the > - * Startup process do the unlink. This allows Startup to know whether > - * it should create a full checkpoint before starting up (fallback > - * mode). Fast promotion takes precedence. > - */ > - if (stat(PROMOTE_SIGNAL_FILE, &stat_buf) == 0) > - { > - unlink(PROMOTE_SIGNAL_FILE); > - unlink(FALLBACK_PROMOTE_SIGNAL_FILE); > - fast_promote = true; > - } > - else if (stat(FALLBACK_PROMOTE_SIGNAL_FILE, &stat_buf) == 0) > - { > - unlink(FALLBACK_PROMOTE_SIGNAL_FILE); > - fast_promote = false; > - } > - > ereport(LOG, (errmsg("received promote request"))); > - > + unlink(PROMOTE_SIGNAL_FILE); On HEAD, this code means that it is possible to end recovery just by sending SIGUSR2 to the startup process. With your patch, this code now means that in order to finish recovery you need to send SIGUSR2 to the startup process *and* to create the promote signal file. Is that really what you want? -- Michael
Attachment
Re: Remove non-fast promotion Re: Should we remove a fallbackpromotion? take 2
From
Fujii Masao
Date:
On 2020/04/21 10:59, Michael Paquier wrote: > On Mon, Apr 20, 2020 at 03:26:16PM +0900, Fujii Masao wrote: >> Patch attached. I will add this into the first CF for v14. > > Thanks! > >> - if (IsPromoteSignaled()) >> + /* >> + * In 9.1 and 9.2 the postmaster unlinked the promote file inside the >> + * signal handler. It now leaves the file in place and lets the >> + * Startup process do the unlink. >> + */ >> + if (IsPromoteSignaled() && stat(PROMOTE_SIGNAL_FILE, &stat_buf) == 0) >> { >> - /* >> - * In 9.1 and 9.2 the postmaster unlinked the promote file inside the >> - * signal handler. It now leaves the file in place and lets the >> - * Startup process do the unlink. This allows Startup to know whether >> - * it should create a full checkpoint before starting up (fallback >> - * mode). Fast promotion takes precedence. >> - */ >> - if (stat(PROMOTE_SIGNAL_FILE, &stat_buf) == 0) >> - { >> - unlink(PROMOTE_SIGNAL_FILE); >> - unlink(FALLBACK_PROMOTE_SIGNAL_FILE); >> - fast_promote = true; >> - } >> - else if (stat(FALLBACK_PROMOTE_SIGNAL_FILE, &stat_buf) == 0) >> - { >> - unlink(FALLBACK_PROMOTE_SIGNAL_FILE); >> - fast_promote = false; >> - } >> - >> ereport(LOG, (errmsg("received promote request"))); >> - >> + unlink(PROMOTE_SIGNAL_FILE); Thanks for reviewing the patch! > On HEAD, this code means that it is possible to end recovery just by > sending SIGUSR2 to the startup process. Yes, in this case, non-fast promotion is triggered. > With your patch, this code > now means that in order to finish recovery you need to send SIGUSR2 to > the startup process *and* to create the promote signal file. Yes, but isn't this the same as the way to trigger fast promotion in HEAD? Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Re: Remove non-fast promotion Re: Should we remove a fallbackpromotion? take 2
From
Michael Paquier
Date:
On Tue, Apr 21, 2020 at 02:27:20PM +0900, Fujii Masao wrote: > On 2020/04/21 10:59, Michael Paquier wrote: >> With your patch, this code >> now means that in order to finish recovery you need to send SIGUSR2 to >> the startup process *and* to create the promote signal file. > > Yes, but isn't this the same as the way to trigger fast promotion in HEAD? Yep, but my point is that some users who have been relying only on SIGUSR2 sent to the startup process for a promotion may be surprised to see that doing the same operation does not trigger a promotion anymore. -- Michael
Attachment
Re: Remove non-fast promotion Re: Should we remove a fallbackpromotion? take 2
From
Fujii Masao
Date:
On 2020/04/21 14:54, Michael Paquier wrote: > On Tue, Apr 21, 2020 at 02:27:20PM +0900, Fujii Masao wrote: >> On 2020/04/21 10:59, Michael Paquier wrote: >>> With your patch, this code >>> now means that in order to finish recovery you need to send SIGUSR2 to >>> the startup process *and* to create the promote signal file. >> >> Yes, but isn't this the same as the way to trigger fast promotion in HEAD? > > Yep, but my point is that some users who have been relying only on > SIGUSR2 sent to the startup process for a promotion may be surprised > to see that doing the same operation does not trigger a promotion > anymore. Yeah, but that's not documented. So I don't think that we need to keep the backward-compatibility for that. Also in that case, non-fast promotion is triggered. Since my patch tries to remove non-fast promotion, it's intentional to prevent them from doing that. But you think that we should not drop that because there are still some users for that? Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Re: Remove non-fast promotion Re: Should we remove a fallbackpromotion? take 2
From
Michael Paquier
Date:
On Tue, Apr 21, 2020 at 03:29:54PM +0900, Fujii Masao wrote: > Yeah, but that's not documented. So I don't think that we need to keep > the backward-compatibility for that. > > Also in that case, non-fast promotion is triggered. Since my patch > tries to remove non-fast promotion, it's intentional to prevent them > from doing that. But you think that we should not drop that because > there are still some users for that? It would be good to ask around to folks maintaining HA solutions about that change at least, as there could be a point in still letting promotion to happen in this case, but switch silently to the fast path. -- Michael
Attachment
Re: Remove non-fast promotion Re: Should we remove a fallbackpromotion? take 2
From
Fujii Masao
Date:
On 2020/04/21 15:36, Michael Paquier wrote: > On Tue, Apr 21, 2020 at 03:29:54PM +0900, Fujii Masao wrote: >> Yeah, but that's not documented. So I don't think that we need to keep >> the backward-compatibility for that. >> >> Also in that case, non-fast promotion is triggered. Since my patch >> tries to remove non-fast promotion, it's intentional to prevent them >> from doing that. But you think that we should not drop that because >> there are still some users for that? > > It would be good to ask around to folks maintaining HA solutions about > that change at least, as there could be a point in still letting > promotion to happen in this case, but switch silently to the fast > path. *If* there are some HA solutions doing that, IMO that they should be changed so that the documented official way to trigger promotion (i.e., pg_ctl promote, pg_promote or trigger_file) is used instead. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Re: Remove non-fast promotion Re: Should we remove a fallbackpromotion? take 2
From
Kyotaro Horiguchi
Date:
At Tue, 21 Apr 2020 15:48:02 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in > > > On 2020/04/21 15:36, Michael Paquier wrote: > > On Tue, Apr 21, 2020 at 03:29:54PM +0900, Fujii Masao wrote: > >> Yeah, but that's not documented. So I don't think that we need to keep > >> the backward-compatibility for that. > >> > >> Also in that case, non-fast promotion is triggered. Since my patch > >> tries to remove non-fast promotion, it's intentional to prevent them > >> from doing that. But you think that we should not drop that because > >> there are still some users for that? > > It would be good to ask around to folks maintaining HA solutions about > > that change at least, as there could be a point in still letting > > promotion to happen in this case, but switch silently to the fast > > path. > > *If* there are some HA solutions doing that, IMO that they should be > *changed > so that the documented official way to trigger promotion (i.e., pg_ctl > promote, > pg_promote or trigger_file) is used instead. The difference between fast and non-fast promotions is far trivial than the difference between promotion happens or not. I think everyone cares about the new version actually promotes by the steps they have been doing, but few of them even notices the difference between the fast and non-fast. If those who are using non-fast promotion for a certain reason should notice the change of promotion behavior in release notes. This is similar to the change of the default waiting behvaior of pg_ctl at PG10. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Remove non-fast promotion Re: Should we remove a fallbackpromotion? take 2
From
Kyotaro Horiguchi
Date:
At Mon, 20 Apr 2020 15:26:16 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in > Patch attached. I will add this into the first CF for v14. - if (!fast_promoted) + if (!promoted) RequestCheckpoint(CHECKPOINT_END_OF_RECOVERY | CHECKPOINT_IMMEDIATE | CHECKPOINT_WAIT); If we don't find the checkpoint record just before, we don't insert End-Of-Recovery record then run an immediate chekpoint. I think if we nuke the non-fast promotion, shouldn't we insert the EOR record even in that case? Or, as Andres suggested upthread, do we always insert it? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Remove non-fast promotion Re: Should we remove a fallbackpromotion? take 2
From
Fujii Masao
Date:
On 2020/04/21 17:15, Kyotaro Horiguchi wrote: > At Mon, 20 Apr 2020 15:26:16 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in >> Patch attached. I will add this into the first CF for v14. > > - if (!fast_promoted) > + if (!promoted) > RequestCheckpoint(CHECKPOINT_END_OF_RECOVERY | > CHECKPOINT_IMMEDIATE | > CHECKPOINT_WAIT); > > If we don't find the checkpoint record just before, we don't insert > End-Of-Recovery record then run an immediate chekpoint. I think if we > nuke the non-fast promotion, shouldn't we insert the EOR record even > in that case? I'm not sure if that's safe. What if the server crashes before the checkpoint completes in that case? Since the last checkpoint record is not available, the subsequent crash recovery will fail. This would lead to that the server will never start up. Right? Currently ISTM that end-of-recovery-checkpoint is executed to avoid such trouble in that case. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Re: Remove non-fast promotion Re: Should we remove a fallbackpromotion? take 2
From
Jehan-Guillaume de Rorthais
Date:
Hello, On Tue, 21 Apr 2020 15:36:22 +0900 Michael Paquier <michael@paquier.xyz> wrote: > On Tue, Apr 21, 2020 at 03:29:54PM +0900, Fujii Masao wrote: > > Yeah, but that's not documented. So I don't think that we need to keep > > the backward-compatibility for that. > > > > Also in that case, non-fast promotion is triggered. Since my patch > > tries to remove non-fast promotion, it's intentional to prevent them > > from doing that. But you think that we should not drop that because > > there are still some users for that? > > It would be good to ask around to folks maintaining HA solutions about > that change at least, as there could be a point in still letting > promotion to happen in this case, but switch silently to the fast > path. FWIW, PAF relies on pg_ctl promote. No need for non-fast promotion. Regards,
Re: Remove non-fast promotion Re: Should we remove a fallbackpromotion? take 2
From
Alvaro Herrera
Date:
On 2020-Apr-21, Jehan-Guillaume de Rorthais wrote: > On Tue, 21 Apr 2020 15:36:22 +0900 > Michael Paquier <michael@paquier.xyz> wrote: > > > Also in that case, non-fast promotion is triggered. Since my patch > > > tries to remove non-fast promotion, it's intentional to prevent them > > > from doing that. But you think that we should not drop that because > > > there are still some users for that? > > > > It would be good to ask around to folks maintaining HA solutions about > > that change at least, as there could be a point in still letting > > promotion to happen in this case, but switch silently to the fast > > path. > > FWIW, PAF relies on pg_ctl promote. No need for non-fast promotion. AFAICT repmgr uses 'pg_ctl promote', and has since version 3.0 (released in mid 2015). It was only 3.3.2 (mid 2017) that supported Postgres 10, so it seems fairly safe to assume that the removal won't be a problem. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Remove non-fast promotion Re: Should we remove a fallbackpromotion? take 2
From
Alvaro Herrera
Date:
On 2020-Apr-20, Fujii Masao wrote: > + /* > + * In 9.1 and 9.2 the postmaster unlinked the promote file inside the > + * signal handler. It now leaves the file in place and lets the > + * Startup process do the unlink. > + */ > + if (IsPromoteSignaled() && stat(PROMOTE_SIGNAL_FILE, &stat_buf) == 0) > { > - /* > - * In 9.1 and 9.2 the postmaster unlinked the promote file inside the > - * signal handler. It now leaves the file in place and lets the > - * Startup process do the unlink. This allows Startup to know whether > - * it should create a full checkpoint before starting up (fallback > - * mode). Fast promotion takes precedence. > - */ It seems pointless to leave a very old comment that documents what the code no longer does. I thikn it would be better to reword it indicating what the code does do, ie. something like "Leave the signal file in place; it will be removed by the startup process when ..." -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Remove non-fast promotion Re: Should we remove a fallbackpromotion? take 2
From
Kyotaro Horiguchi
Date:
At Tue, 21 Apr 2020 22:08:56 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in > > > On 2020/04/21 17:15, Kyotaro Horiguchi wrote: > > At Mon, 20 Apr 2020 15:26:16 +0900, Fujii Masao > > <masao.fujii@oss.nttdata.com> wrote in > >> Patch attached. I will add this into the first CF for v14. > > - if (!fast_promoted) > > + if (!promoted) > > RequestCheckpoint(CHECKPOINT_END_OF_RECOVERY | > > CHECKPOINT_IMMEDIATE | > > CHECKPOINT_WAIT); > > If we don't find the checkpoint record just before, we don't insert > > End-Of-Recovery record then run an immediate chekpoint. I think if we > > nuke the non-fast promotion, shouldn't we insert the EOR record even > > in that case? > > I'm not sure if that's safe. What if the server crashes before the > checkpoint > completes in that case? Since the last checkpoint record is not > available, > the subsequent crash recovery will fail. This would lead to that the > server > will never start up. Right? Currently ISTM that Yes, that's right. > end-of-recovery-checkpoint > is executed to avoid such trouble in that case. I meant that we always have EOR at the end of recovery. So in the missing latest checkpoint (and crash recovery) case, we insert EOR after the immediate checkpoint. That also means we no longer set CHECKPOINT_END_OF_RECOVERY to the checkpoint, too. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Remove non-fast promotion Re: Should we remove a fallbackpromotion? take 2
From
Ian Barwick
Date:
On 2020/04/22 6:53, Alvaro Herrera wrote: > On 2020-Apr-21, Jehan-Guillaume de Rorthais wrote: > >> On Tue, 21 Apr 2020 15:36:22 +0900 >> Michael Paquier <michael@paquier.xyz> wrote: > >>>> Also in that case, non-fast promotion is triggered. Since my patch >>>> tries to remove non-fast promotion, it's intentional to prevent them >>>> from doing that. But you think that we should not drop that because >>>> there are still some users for that? >>> >>> It would be good to ask around to folks maintaining HA solutions about >>> that change at least, as there could be a point in still letting >>> promotion to happen in this case, but switch silently to the fast >>> path. >> >> FWIW, PAF relies on pg_ctl promote. No need for non-fast promotion. > > AFAICT repmgr uses 'pg_ctl promote', and has since version 3.0 (released > in mid 2015). It was only 3.3.2 (mid 2017) that supported Postgres 10, > so it seems fairly safe to assume that the removal won't be a problem. Correct, repmgr uses "pg_ctl promote" or pg_promote() (if available), and won't be affected by this change. Regards Ian Barwick -- Ian Barwick https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Remove non-fast promotion Re: Should we remove a fallbackpromotion? take 2
From
Kyotaro Horiguchi
Date:
At Wed, 22 Apr 2020 10:28:07 +0900, Ian Barwick <ian.barwick@2ndquadrant.com> wrote in > On 2020/04/22 6:53, Alvaro Herrera wrote: > > On 2020-Apr-21, Jehan-Guillaume de Rorthais wrote: > > > >> On Tue, 21 Apr 2020 15:36:22 +0900 > >> Michael Paquier <michael@paquier.xyz> wrote: > > > >>>> Also in that case, non-fast promotion is triggered. Since my patch > >>>> tries to remove non-fast promotion, it's intentional to prevent them > >>>> from doing that. But you think that we should not drop that because > >>>> there are still some users for that? > >>> > >>> It would be good to ask around to folks maintaining HA solutions about > >>> that change at least, as there could be a point in still letting > >>> promotion to happen in this case, but switch silently to the fast > >>> path. > >> > >> FWIW, PAF relies on pg_ctl promote. No need for non-fast promotion. > > AFAICT repmgr uses 'pg_ctl promote', and has since version 3.0 > > (released > > in mid 2015). It was only 3.3.2 (mid 2017) that supported Postgres > > 10, > > so it seems fairly safe to assume that the removal won't be a problem. > > Correct, repmgr uses "pg_ctl promote" or pg_promote() (if available), > and > won't be affected by this change. For the record, the pgsql resource agent uses "pg_ctl promote" and working with fast-promote. Auxiliary tools for it is assuming fast-promote. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Remove non-fast promotion Re: Should we remove a fallbackpromotion? take 2
From
Fujii Masao
Date:
On 2020/04/22 6:57, Alvaro Herrera wrote: > On 2020-Apr-20, Fujii Masao wrote: > >> + /* >> + * In 9.1 and 9.2 the postmaster unlinked the promote file inside the >> + * signal handler. It now leaves the file in place and lets the >> + * Startup process do the unlink. >> + */ >> + if (IsPromoteSignaled() && stat(PROMOTE_SIGNAL_FILE, &stat_buf) == 0) >> { >> - /* >> - * In 9.1 and 9.2 the postmaster unlinked the promote file inside the >> - * signal handler. It now leaves the file in place and lets the >> - * Startup process do the unlink. This allows Startup to know whether >> - * it should create a full checkpoint before starting up (fallback >> - * mode). Fast promotion takes precedence. >> - */ > > It seems pointless to leave a very old comment that documents what the > code no longer does. I thikn it would be better to reword it indicating > what the code does do, ie. something like "Leave the signal file in > place; it will be removed by the startup process when ..." Agreed. And, while reading the related code, I thought that it's more proper to place this comment in CheckPromoteSignal() rather than CheckForStandbyTrigger(). Because CheckPromoteSignal() actually does what the comment says, i.e., leaves the promote signal file in place and lets the startup process do the unlink. Attached is the updated version of the patch. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Attachment
Re: Remove non-fast promotion Re: Should we remove a fallbackpromotion? take 2
From
Fujii Masao
Date:
On 2020/04/22 10:53, Kyotaro Horiguchi wrote: > At Wed, 22 Apr 2020 10:28:07 +0900, Ian Barwick <ian.barwick@2ndquadrant.com> wrote in >> On 2020/04/22 6:53, Alvaro Herrera wrote: >>> On 2020-Apr-21, Jehan-Guillaume de Rorthais wrote: >>> >>>> On Tue, 21 Apr 2020 15:36:22 +0900 >>>> Michael Paquier <michael@paquier.xyz> wrote: >>> >>>>>> Also in that case, non-fast promotion is triggered. Since my patch >>>>>> tries to remove non-fast promotion, it's intentional to prevent them >>>>>> from doing that. But you think that we should not drop that because >>>>>> there are still some users for that? >>>>> >>>>> It would be good to ask around to folks maintaining HA solutions about >>>>> that change at least, as there could be a point in still letting >>>>> promotion to happen in this case, but switch silently to the fast >>>>> path. >>>> >>>> FWIW, PAF relies on pg_ctl promote. No need for non-fast promotion. >>> AFAICT repmgr uses 'pg_ctl promote', and has since version 3.0 >>> (released >>> in mid 2015). It was only 3.3.2 (mid 2017) that supported Postgres >>> 10, >>> so it seems fairly safe to assume that the removal won't be a problem. >> >> Correct, repmgr uses "pg_ctl promote" or pg_promote() (if available), >> and >> won't be affected by this change. > > For the record, the pgsql resource agent uses "pg_ctl promote" and > working with fast-promote. Auxiliary tools for it is assuming > fast-promote. Thanks all for checking whether the change affects each HA solution! Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Re: Remove non-fast promotion Re: Should we remove a fallbackpromotion? take 2
From
Fujii Masao
Date:
On 2020/04/22 9:13, Kyotaro Horiguchi wrote: > At Tue, 21 Apr 2020 22:08:56 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in >> >> >> On 2020/04/21 17:15, Kyotaro Horiguchi wrote: >>> At Mon, 20 Apr 2020 15:26:16 +0900, Fujii Masao >>> <masao.fujii@oss.nttdata.com> wrote in >>>> Patch attached. I will add this into the first CF for v14. >>> - if (!fast_promoted) >>> + if (!promoted) >>> RequestCheckpoint(CHECKPOINT_END_OF_RECOVERY | >>> CHECKPOINT_IMMEDIATE | >>> CHECKPOINT_WAIT); >>> If we don't find the checkpoint record just before, we don't insert >>> End-Of-Recovery record then run an immediate chekpoint. I think if we >>> nuke the non-fast promotion, shouldn't we insert the EOR record even >>> in that case? >> >> I'm not sure if that's safe. What if the server crashes before the >> checkpoint >> completes in that case? Since the last checkpoint record is not >> available, >> the subsequent crash recovery will fail. This would lead to that the >> server >> will never start up. Right? Currently ISTM that > > Yes, that's right. > >> end-of-recovery-checkpoint >> is executed to avoid such trouble in that case. > > I meant that we always have EOR at the end of recovery. So in the > missing latest checkpoint (and crash recovery) case, we insert EOR > after the immediate checkpoint. That also means we no longer set > CHECKPOINT_END_OF_RECOVERY to the checkpoint, too. Could you tell me what the benefit by this change is? Even with this change, the server still needs to wait for the checkpoint to complete before becoming the master and starting the service, unlike fast promotion. No? Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Re: Remove non-fast promotion Re: Should we remove a fallbackpromotion? take 2
From
Kyotaro Horiguchi
Date:
At Wed, 22 Apr 2020 11:51:42 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in > > I meant that we always have EOR at the end of recovery. So in the > > missing latest checkpoint (and crash recovery) case, we insert EOR > > after the immediate checkpoint. That also means we no longer set > > CHECKPOINT_END_OF_RECOVERY to the checkpoint, too. > > Could you tell me what the benefit by this change is? Even with this > change, > the server still needs to wait for the checkpoint to complete before > becoming the master and starting the service, unlike fast > promotion. No? There's no benefit of performance. It's just for simplicity by signalling end-of-recovery in a unified way. Long ago, we had only non-fast promotion, which is marked by CHECKPOINT_END_OF_RECOVERY. When we introduced fast-promotion, it is marked by the END_OF_RECOVERY record since checkpoint record is not inserted at the promotion time. However, we internally fall back to non-fast promotion when we need to make a checkpoint immediately. If we remove non-fast checkpoint, we don't need two means to signal end-of-recovery. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Remove non-fast promotion Re: Should we remove a fallbackpromotion? take 2
From
Jehan-Guillaume de Rorthais
Date:
On Wed, 22 Apr 2020 11:51:15 +0900 Fujii Masao <masao.fujii@oss.nttdata.com> wrote: > On 2020/04/22 10:53, Kyotaro Horiguchi wrote: > [...] > [...] > [...] > [...] > [...] > [...] > [...] > [...] > [...] > [...] > [...] > > Thanks all for checking whether the change affects each HA solution! Unless I'm wrong, we don't have feedback from Patroni team. I did some quick grep and it seems to rely on "pg_ctl promote" as well. Moreover, the latest commit 80fbe9005 force a checkpoint right after the promote. So I suppose they don't use non-fast promote. I CC'ed Alexander Kukushkin to this discussion, so at least he is aware of this topic. Regards,
Re: Remove non-fast promotion Re: Should we remove a fallbackpromotion? take 2
From
Fujii Masao
Date:
On 2020/04/23 3:56, Jehan-Guillaume de Rorthais wrote: > On Wed, 22 Apr 2020 11:51:15 +0900 > Fujii Masao <masao.fujii@oss.nttdata.com> wrote: > >> On 2020/04/22 10:53, Kyotaro Horiguchi wrote: >> [...] >> [...] >> [...] >> [...] >> [...] >> [...] >> [...] >> [...] >> [...] >> [...] >> [...] >> >> Thanks all for checking whether the change affects each HA solution! > > Unless I'm wrong, we don't have feedback from Patroni team. > > I did some quick grep and it seems to rely on "pg_ctl promote" as well. > Moreover, the latest commit 80fbe9005 force a checkpoint right after the > promote. So I suppose they don't use non-fast promote. Thanks for checking that! Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
The following review has been posted through the commitfest application: make installcheck-world: tested, passed Implements feature: not tested Spec compliant: not tested Documentation: not tested I've applied the v2 patch on the master branch. There some hunks, but the patch got applied. So, I ran make installcheck-worldand everything looks fine to me with this patch. Though, I do have a few suggestions in general: (1) I see two functions being used (a) CheckPromoteSignal and (b) IsPromoteSignaled in the code. Should these be combinedinto a single function and perhaps check for "promote_signaled" and the "PROMOTE_SIGNAL_FILE". Not sure if doingthis will break "sigusr1_handler" in postmaster.c though. (2) CheckPromoteSignal is checking for "PROMOTE_SIGNAL_FILE" file. So, perhaps, rather than calling stat on "PROMOTE_SIGNAL_FILE"in if statements, I would suggest to use CheckPromoteSignal function instead as it does nothing butstat on "PROMOTE_SIGNAL_FILE" (after applying your patch). The new status of this patch is: Waiting on Author
On 2020/06/03 3:38, Hamid Akhtar wrote: > The following review has been posted through the commitfest application: > make installcheck-world: tested, passed > Implements feature: not tested > Spec compliant: not tested > Documentation: not tested > > I've applied the v2 patch on the master branch. There some hunks, but the patch got applied. So, I ran make installcheck-worldand everything looks fine to me with this patch. Though, I do have a few suggestions in general: Thanks for the test and review! > (1) I see two functions being used (a) CheckPromoteSignal and (b) IsPromoteSignaled in the code. Should these be combinedinto a single function and perhaps check for "promote_signaled" and the "PROMOTE_SIGNAL_FILE". Not sure if doingthis will break "sigusr1_handler" in postmaster.c though. I don't think we can do that simply. CheckPromoteSignal() can be called by both postmaster and the startup process. OTOH, IsPromoteSignaled() accesses the flag that can be set only in the startup process' signal handler, i.e., it's intended to be called only by the startup process. > (2) CheckPromoteSignal is checking for "PROMOTE_SIGNAL_FILE" file. So, perhaps, rather than calling stat on "PROMOTE_SIGNAL_FILE"in if statements, I would suggest to use CheckPromoteSignal function instead as it does nothing butstat on "PROMOTE_SIGNAL_FILE" (after applying your patch). Yes, that's good idea. Attached is the updated version of the patch. I replaced that stat() with CheckPromoteSignal(). Also I replaced unlink(PROMOTE_SIGNAL_FILE) with RemovePromoteSignalFiles(). > The new status of this patch is: Waiting on Author I will change the status back to Needs Review. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Attachment
At Wed, 3 Jun 2020 09:43:17 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in > I will change the status back to Needs Review. record = ReadCheckpointRecord(xlogreader, checkPointLoc, 1, false); if (record != NULL) { - fast_promoted = true; + promoted = true; Even if we missed the last checkpoint record, we don't give up promotion and continue fall-back promotion but the variable "promoted" stays false. That is confusiong. How about changing it to fallback_promotion, or some names with more behavior-specific name like immediate_checkpoint_needed? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On 2020/06/03 12:06, Kyotaro Horiguchi wrote: > At Wed, 3 Jun 2020 09:43:17 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in >> I will change the status back to Needs Review. Thanks for the review! > record = ReadCheckpointRecord(xlogreader, checkPointLoc, 1, false); > if (record != NULL) > { > - fast_promoted = true; > + promoted = true; > > Even if we missed the last checkpoint record, we don't give up > promotion and continue fall-back promotion but the variable "promoted" > stays false. That is confusiong. > > How about changing it to fallback_promotion, or some names with more > behavior-specific name like immediate_checkpoint_needed? I like doEndOfRecoveryCkpt or something, but I have no strong opinion about that flag naming. So I'm ok with immediate_checkpoint_needed if others also like that, too. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Applying the patch to the current master branch throws 9 hunks. AFAICT, the patch is good otherwise.
On Wed, Jun 3, 2020 at 3:20 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
On 2020/06/03 12:06, Kyotaro Horiguchi wrote:
> At Wed, 3 Jun 2020 09:43:17 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in
>> I will change the status back to Needs Review.
Thanks for the review!
> record = ReadCheckpointRecord(xlogreader, checkPointLoc, 1, false);
> if (record != NULL)
> {
> - fast_promoted = true;
> + promoted = true;
>
> Even if we missed the last checkpoint record, we don't give up
> promotion and continue fall-back promotion but the variable "promoted"
> stays false. That is confusiong.
>
> How about changing it to fallback_promotion, or some names with more
> behavior-specific name like immediate_checkpoint_needed?
I like doEndOfRecoveryCkpt or something, but I have no strong opinion
about that flag naming. So I'm ok with immediate_checkpoint_needed
if others also like that, too.
Regards,
--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
Highgo Software (Canada/China/Pakistan)
URL : www.highgo.ca
ADDR: 10318 WHALLEY BLVD, Surrey, BC
CELL:+923335449950 EMAIL: mailto:hamid.akhtar@highgo.ca
URL : www.highgo.ca
ADDR: 10318 WHALLEY BLVD, Surrey, BC
CELL:+923335449950 EMAIL: mailto:hamid.akhtar@highgo.ca
SKYPE: engineeredvirus
On 2020/07/27 17:53, Hamid Akhtar wrote: > Applying the patch to the current master branch throws 9 hunks. AFAICT, the patch is good otherwise. So you think that the patch can be marked as Ready for Committer? Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
There have been no real objections on this patch and it seems to work. So, IMHO, the only thing that needs to be done is perhaps update the patch so that it applies clearly on the master branch.
On Mon, Jul 27, 2020 at 9:31 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
On 2020/07/27 17:53, Hamid Akhtar wrote:
> Applying the patch to the current master branch throws 9 hunks. AFAICT, the patch is good otherwise.
So you think that the patch can be marked as Ready for Committer?
Regards,
--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
Highgo Software (Canada/China/Pakistan)
URL : www.highgo.ca
ADDR: 10318 WHALLEY BLVD, Surrey, BC
CELL:+923335449950 EMAIL: mailto:hamid.akhtar@highgo.ca
URL : www.highgo.ca
ADDR: 10318 WHALLEY BLVD, Surrey, BC
CELL:+923335449950 EMAIL: mailto:hamid.akhtar@highgo.ca
SKYPE: engineeredvirus
On 2020/07/28 20:35, Hamid Akhtar wrote: > There have been no real objections on this patch and it seems to work. Thanks! So I pushed the patch. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION