Thread: BUG #13368: standby cluster immediately promotes after pg_basebackup from previously promoted master

The following bug has been logged on the website:

Bug reference:      13368
Logged by:          Feike Steenbergen
Email address:      feikesteenbergen@gmail.com
PostgreSQL version: 9.4.2
Operating system:   Debian 8.0 x86_64
Description:

We sometimes see a standby server promoting itself to master immediately.

Analysis shows us that the master still has a promote file in the PGDATA
directory. We assume the presence of the promote file (which is copied
by pg_basebackup) is triggering the promotion.

The master itself previously was a standby server. The promotion was done
using pg_ctl promote. Analysis of our logs show that we sent pg_ctl promote
twice to this cluster, this also is reflected in the server log,
"server promoting" shows up twice.

Some testing shows us that in some cases, when pg_ctl promote is called
multiple
times, a promote file is left in the PGDATA directory, even though the
cluster
has been succesfully promoted and is accepting read/write queries.

I tested it in the following manner:

- stop the cluster
- remove promote file if exists
- create recovery.conf containing only "standby_mode = on"
- ensure cluster is ready to accept read only queries
- send pg_ctl promote twice (simultaneously)
- see if the promote file still exists

I ran this test for a while:

111 times no promote file existed afterwards
 41 times a promote file still existed

A short excerpt from the running of the script in a loop:

        20150528120345 Promote file does not exists
        20150528120346 Promote file exists
        20150528120348 Promote file does not exists
        20150528120349 Promote file exists
        20150528120350 Promote file does not exists
        20150528120351 Promote file does not exists
        20150528120352 Promote file does not exists
        20150528120354 Promote file exists

We will try to workaround this issue by ensuring we do not send multiple
promote request using pg_ctl to the same cluster.

We judge this to be a bug, as we would assume the promote file to be
removed
after successful promotion of a cluster.

Version: PostgreSQL 9.4.2 on x86_64-unknown-linux-gnu, compiled by gcc
(Debian 4.9.2-10) 4.9.2, 64-bit
On Thu, May 28, 2015 at 7:07 PM,  <feikesteenbergen@gmail.com> wrote:
> The following bug has been logged on the website:
>
> Bug reference:      13368
> Logged by:          Feike Steenbergen
> Email address:      feikesteenbergen@gmail.com
> PostgreSQL version: 9.4.2
> Operating system:   Debian 8.0 x86_64
> Description:
>
> We sometimes see a standby server promoting itself to master immediately.
>
> Analysis shows us that the master still has a promote file in the PGDATA
> directory. We assume the presence of the promote file (which is copied
> by pg_basebackup) is triggering the promotion.

If there is a promote file in PGDATA when a standby starts up,
promotion will be triggered.

> The master itself previously was a standby server. The promotion was done
> using pg_ctl promote. Analysis of our logs show that we sent pg_ctl promote
> twice to this cluster, this also is reflected in the server log,
> "server promoting" shows up twice.

In this case promotion is triggered by CheckForStandbyTrigger(), where
the promote file is unlinked.

> Some testing shows us that in some cases, when pg_ctl promote is called
> multiple
> times, a promote file is left in the PGDATA directory, even though the
> cluster
> has been succesfully promoted and is accepting read/write queries.

This is not surprising, pg_ctl bases its analysis that a node needs to
be promoted if recovery.conf exists or not, and there is an interval
of time between which recovery.conf is removed and the promotion is
actually triggered, so you can create a promote file even after even
sending SIGUSR1 to the standby's postmaster

> We will try to workaround this issue by ensuring we do not send multiple
> promote request using pg_ctl to the same cluster.

Well, we could for example have the server switch promote to
promote_done in CheckForStandbyTrigger() and then unlink it when
recovery.conf is switched to .done. Opinions are welcome on the
matter.
--
Michael
On Mon, Jun 1, 2015 at 5:19 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Thu, May 28, 2015 at 7:07 PM,  <feikesteenbergen@gmail.com> wrote:
>> The following bug has been logged on the website:
>>
>> Bug reference:      13368
>> Logged by:          Feike Steenbergen
>> Email address:      feikesteenbergen@gmail.com
>> PostgreSQL version: 9.4.2
>> Operating system:   Debian 8.0 x86_64
>> Description:
>>
>> We sometimes see a standby server promoting itself to master immediately.
>>
>> Analysis shows us that the master still has a promote file in the PGDATA
>> directory. We assume the presence of the promote file (which is copied
>> by pg_basebackup) is triggering the promotion.
>
> If there is a promote file in PGDATA when a standby starts up,
> promotion will be triggered.
>
>> The master itself previously was a standby server. The promotion was done
>> using pg_ctl promote. Analysis of our logs show that we sent pg_ctl promote
>> twice to this cluster, this also is reflected in the server log,
>> "server promoting" shows up twice.
>
> In this case promotion is triggered by CheckForStandbyTrigger(), where
> the promote file is unlinked.
>
>> Some testing shows us that in some cases, when pg_ctl promote is called
>> multiple
>> times, a promote file is left in the PGDATA directory, even though the
>> cluster
>> has been succesfully promoted and is accepting read/write queries.
>
> This is not surprising, pg_ctl bases its analysis that a node needs to
> be promoted if recovery.conf exists or not, and there is an interval
> of time between which recovery.conf is removed and the promotion is
> actually triggered, so you can create a promote file even after even
> sending SIGUSR1 to the standby's postmaster
>
>> We will try to workaround this issue by ensuring we do not send multiple
>> promote request using pg_ctl to the same cluster.
>
> Well, we could for example have the server switch promote to
> promote_done in CheckForStandbyTrigger() and then unlink it when
> recovery.conf is switched to .done. Opinions are welcome on the
> matter.

Or we can just always remove the signal file at the end of recovery.
That filename switch seems unnecessary.

In addition to that change, we should make pg_basebackup skip
the signal file?

Regards,

--
Fujii Masao


On Wed, Jun 3, 2015 at 1:04 AM, Fujii Masao wrote:
On Mon, Jun 1, 2015 at 5:19 PM, Michael Paquier
>> Some testing shows us that in some cases, when pg_ctl promote is called
>> multiple
>> times, a promote file is left in the PGDATA directory, even though the
>> cluster
>> has been succesfully promoted and is accepting read/write queries.
>
> This is not surprising, pg_ctl bases its analysis that a node needs to
> be promoted if recovery.conf exists or not, and there is an interval
> of time between which recovery.conf is removed and the promotion is
> actually triggered, so you can create a promote file even after even
> sending SIGUSR1 to the standby's postmaster
>
>> We will try to workaround this issue by ensuring we do not send multiple
>> promote request using pg_ctl to the same cluster.
>
> Well, we could for example have the server switch promote to
> promote_done in CheckForStandbyTrigger() and then unlink it when
> recovery.conf is switched to .done. Opinions are welcome on the
> matter.

Or we can just always remove the signal file at the end of recovery.
That filename switch seems unnecessary.

Well, by doing so, in the event of a crash during recovery the promote signal file would be present in PGDATA, and this would enforce a promotion at the next startup of the node. I don't think that this is a good idea. In the case of a promoted node crash a user may want to look at his node back in a recovery state.

Also, this intermediate promote file, let's say promote.detected, would be useful for external tools to let them know that the promotion has been acknoledged (you can already know it if your tool knows that a promote has been triggered, that promote has been removed by the server and if recovery.conf is still present). That's not something you would want on back branches btw as this changes how promotion bevahes seen from an external point of view. But that would be a patch simple enough (got a WIP for people wondering).

An open question would be what to do with pg_ctl promote if a promote file already exists. I think that we should ignore the creation of the promote file but still kick the signal SIGUSR1.
 
In addition to that change, we should make pg_basebackup skip
the signal file?

Well, yes, and it we would be just fine for the case reported by Feike to just ignore promote and fallback_promote in a base backup, as the problem reported was about a standby that contained the signal promote file after pg_basebackup. And I think that we would be fine by doing that as well in the back-branches. trigger_file is not exposed out of xlog.c in the startup process, but I can live with the fact that it is not ignored.

In short, I guess that the patch attached would be fine.
Opinions?
--
Michael
Attachment
>> In addition to that change, we should make pg_basebackup skip
>> the signal file?
>
>
> Well, yes
> [...]
>
> In short, I guess that the patch attached would be fine.
> Opinions?

To me this patch seems to address the exact issue we reported and as
far as I can tell has no downside to it.
On Fri, Jun 5, 2015 at 3:01 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
>
>
> On Wed, Jun 3, 2015 at 1:04 AM, Fujii Masao wrote:
>>
>> On Mon, Jun 1, 2015 at 5:19 PM, Michael Paquier
>> >> Some testing shows us that in some cases, when pg_ctl promote is called
>> >> multiple
>> >> times, a promote file is left in the PGDATA directory, even though the
>> >> cluster
>> >> has been succesfully promoted and is accepting read/write queries.
>> >
>> > This is not surprising, pg_ctl bases its analysis that a node needs to
>> > be promoted if recovery.conf exists or not, and there is an interval
>> > of time between which recovery.conf is removed and the promotion is
>> > actually triggered, so you can create a promote file even after even
>> > sending SIGUSR1 to the standby's postmaster
>> >
>> >> We will try to workaround this issue by ensuring we do not send
>> >> multiple
>> >> promote request using pg_ctl to the same cluster.
>> >
>> > Well, we could for example have the server switch promote to
>> > promote_done in CheckForStandbyTrigger() and then unlink it when
>> > recovery.conf is switched to .done. Opinions are welcome on the
>> > matter.
>>
>> Or we can just always remove the signal file at the end of recovery.
>> That filename switch seems unnecessary.
>
>
> Well, by doing so, in the event of a crash during recovery the promote
> signal file would be present in PGDATA, and this would enforce a promotion
> at the next startup of the node. I don't think that this is a good idea. In
> the case of a promoted node crash a user may want to look at his node back
> in a recovery state.

You meant the case of crash which occurs before CheckForStandbyTrigger()
removes the signal file after pg_ctl promote is executed? If yes, even if
we rename the file to the intermediate one, the signal file would remain.

If we want to address the above corner case, we can additionally remove
the file always at the beginning of recovery. This idea can completely avoid
an unexpected promotion by the surviving signal file.

>
> Also, this intermediate promote file, let's say promote.detected, would be
> useful for external tools to let them know that the promotion has been
> acknoledged (you can already know it if your tool knows that a promote has
> been triggered, that promote has been removed by the server and if
> recovery.conf is still present). That's not something you would want on back
> branches btw as this changes how promotion bevahes seen from an external
> point of view. But that would be a patch simple enough (got a WIP for people
> wondering).
>
> An open question would be what to do with pg_ctl promote if a promote file
> already exists. I think that we should ignore the creation of the promote
> file but still kick the signal SIGUSR1.
>
>>
>> In addition to that change, we should make pg_basebackup skip
>> the signal file?
>
>
> Well, yes, and it we would be just fine for the case reported by Feike to
> just ignore promote and fallback_promote in a base backup, as the problem
> reported was about a standby that contained the signal promote file after
> pg_basebackup. And I think that we would be fine by doing that as well in
> the back-branches. trigger_file is not exposed out of xlog.c in the startup
> process, but I can live with the fact that it is not ignored.
>
> In short, I guess that the patch attached would be fine.
> Opinions?

I have no strong objection to that change, but it seems half-baked.
That is, that idea doesn't address the case where a base backup is
taken by other than pg_basebackup at all.

Regards,

--
Fujii Masao
On Fri, Jun 5, 2015 at 11:06 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Fri, Jun 5, 2015 at 3:01 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>>
>>
>> On Wed, Jun 3, 2015 at 1:04 AM, Fujii Masao wrote:
>>>
>>> On Mon, Jun 1, 2015 at 5:19 PM, Michael Paquier
>>> >> Some testing shows us that in some cases, when pg_ctl promote is called
>>> >> multiple
>>> >> times, a promote file is left in the PGDATA directory, even though the
>>> >> cluster
>>> >> has been succesfully promoted and is accepting read/write queries.
>>> >
>>> > This is not surprising, pg_ctl bases its analysis that a node needs to
>>> > be promoted if recovery.conf exists or not, and there is an interval
>>> > of time between which recovery.conf is removed and the promotion is
>>> > actually triggered, so you can create a promote file even after even
>>> > sending SIGUSR1 to the standby's postmaster
>>> >
>>> >> We will try to workaround this issue by ensuring we do not send
>>> >> multiple
>>> >> promote request using pg_ctl to the same cluster.
>>> >
>>> > Well, we could for example have the server switch promote to
>>> > promote_done in CheckForStandbyTrigger() and then unlink it when
>>> > recovery.conf is switched to .done. Opinions are welcome on the
>>> > matter.
>>>
>>> Or we can just always remove the signal file at the end of recovery.
>>> That filename switch seems unnecessary.
>>
>>
>> Well, by doing so, in the event of a crash during recovery the promote
>> signal file would be present in PGDATA, and this would enforce a promotion
>> at the next startup of the node. I don't think that this is a good idea. In
>> the case of a promoted node crash a user may want to look at his node back
>> in a recovery state.
>
> You meant the case of crash which occurs before CheckForStandbyTrigger()
> removes the signal file after pg_ctl promote is executed? If yes, even if
> we rename the file to the intermediate one, the signal file would remain.
>
> If we want to address the above corner case, we can additionally remove
> the file always at the beginning of recovery. This idea can completely avoid
> an unexpected promotion by the surviving signal file.

Then what about the case where a promote file is let by user on
purpose to trigger a promotion on restart?

>> Also, this intermediate promote file, let's say promote.detected, would be
>> useful for external tools to let them know that the promotion has been
>> acknoledged (you can already know it if your tool knows that a promote has
>> been triggered, that promote has been removed by the server and if
>> recovery.conf is still present). That's not something you would want on back
>> branches btw as this changes how promotion bevahes seen from an external
>> point of view. But that would be a patch simple enough (got a WIP for people
>> wondering).
>>
>> An open question would be what to do with pg_ctl promote if a promote file
>> already exists. I think that we should ignore the creation of the promote
>> file but still kick the signal SIGUSR1.
>>
>>>
>>> In addition to that change, we should make pg_basebackup skip
>>> the signal file?
>>
>>
>> Well, yes, and it we would be just fine for the case reported by Feike to
>> just ignore promote and fallback_promote in a base backup, as the problem
>> reported was about a standby that contained the signal promote file after
>> pg_basebackup. And I think that we would be fine by doing that as well in
>> the back-branches. trigger_file is not exposed out of xlog.c in the startup
>> process, but I can live with the fact that it is not ignored.
>>
>> In short, I guess that the patch attached would be fine.
>> Opinions?
>
> I have no strong objection to that change, but it seems half-baked.
> That is, that idea doesn't address the case where a base backup is
> taken by other than pg_basebackup at all.

That's the same problem with for example postmaster.pid,
postmaster.opts or similar when taking a FS-level backup.
--
Michael
On Sat, Jun 6, 2015 at 8:36 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Fri, Jun 5, 2015 at 11:06 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>> On Fri, Jun 5, 2015 at 3:01 PM, Michael Paquier
>> <michael.paquier@gmail.com> wrote:
>>>
>>>
>>> On Wed, Jun 3, 2015 at 1:04 AM, Fujii Masao wrote:
>>>>
>>>> On Mon, Jun 1, 2015 at 5:19 PM, Michael Paquier
>>>> >> Some testing shows us that in some cases, when pg_ctl promote is called
>>>> >> multiple
>>>> >> times, a promote file is left in the PGDATA directory, even though the
>>>> >> cluster
>>>> >> has been succesfully promoted and is accepting read/write queries.
>>>> >
>>>> > This is not surprising, pg_ctl bases its analysis that a node needs to
>>>> > be promoted if recovery.conf exists or not, and there is an interval
>>>> > of time between which recovery.conf is removed and the promotion is
>>>> > actually triggered, so you can create a promote file even after even
>>>> > sending SIGUSR1 to the standby's postmaster
>>>> >
>>>> >> We will try to workaround this issue by ensuring we do not send
>>>> >> multiple
>>>> >> promote request using pg_ctl to the same cluster.
>>>> >
>>>> > Well, we could for example have the server switch promote to
>>>> > promote_done in CheckForStandbyTrigger() and then unlink it when
>>>> > recovery.conf is switched to .done. Opinions are welcome on the
>>>> > matter.
>>>>
>>>> Or we can just always remove the signal file at the end of recovery.
>>>> That filename switch seems unnecessary.
>>>
>>>
>>> Well, by doing so, in the event of a crash during recovery the promote
>>> signal file would be present in PGDATA, and this would enforce a promotion
>>> at the next startup of the node. I don't think that this is a good idea. In
>>> the case of a promoted node crash a user may want to look at his node back
>>> in a recovery state.
>>
>> You meant the case of crash which occurs before CheckForStandbyTrigger()
>> removes the signal file after pg_ctl promote is executed? If yes, even if
>> we rename the file to the intermediate one, the signal file would remain.
>>
>> If we want to address the above corner case, we can additionally remove
>> the file always at the beginning of recovery. This idea can completely avoid
>> an unexpected promotion by the surviving signal file.
>
> Then what about the case where a promote file is let by user on
> purpose to trigger a promotion on restart?

Sounds like non-realistic case. If the promote file exists at
the beginning of recovery, the recovery might try to end before
reaching the consistency point, and then it would fail.
Why do people want to make the recovery fail in that way?

>>> Also, this intermediate promote file, let's say promote.detected, would be
>>> useful for external tools to let them know that the promotion has been
>>> acknoledged (you can already know it if your tool knows that a promote has
>>> been triggered, that promote has been removed by the server and if
>>> recovery.conf is still present). That's not something you would want on back
>>> branches btw as this changes how promotion bevahes seen from an external
>>> point of view. But that would be a patch simple enough (got a WIP for people
>>> wondering).
>>>
>>> An open question would be what to do with pg_ctl promote if a promote file
>>> already exists. I think that we should ignore the creation of the promote
>>> file but still kick the signal SIGUSR1.
>>>
>>>>
>>>> In addition to that change, we should make pg_basebackup skip
>>>> the signal file?
>>>
>>>
>>> Well, yes, and it we would be just fine for the case reported by Feike to
>>> just ignore promote and fallback_promote in a base backup, as the problem
>>> reported was about a standby that contained the signal promote file after
>>> pg_basebackup. And I think that we would be fine by doing that as well in
>>> the back-branches. trigger_file is not exposed out of xlog.c in the startup
>>> process, but I can live with the fact that it is not ignored.
>>>
>>> In short, I guess that the patch attached would be fine.
>>> Opinions?
>>
>> I have no strong objection to that change, but it seems half-baked.
>> That is, that idea doesn't address the case where a base backup is
>> taken by other than pg_basebackup at all.
>
> That's the same problem with for example postmaster.pid,
> postmaster.opts or similar when taking a FS-level backup.

Yes, but even if such bogus postmaster.pid exists in $PGDATA,
the server can handle it and restart almost harmlessly. But
the promote file not.

Regards,

--
Fujii Masao
On Fri, Jun 12, 2015 at 2:43 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Sat, Jun 6, 2015 at 8:36 AM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> On Fri, Jun 5, 2015 at 11:06 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>>> On Fri, Jun 5, 2015 at 3:01 PM, Michael Paquier
>>> <michael.paquier@gmail.com> wrote:
>>>>
>>>>
>>>> On Wed, Jun 3, 2015 at 1:04 AM, Fujii Masao wrote:
>>>>>
>>>>> On Mon, Jun 1, 2015 at 5:19 PM, Michael Paquier
>>>>> >> Some testing shows us that in some cases, when pg_ctl promote is called
>>>>> >> multiple
>>>>> >> times, a promote file is left in the PGDATA directory, even though the
>>>>> >> cluster
>>>>> >> has been succesfully promoted and is accepting read/write queries.
>>>>> >
>>>>> > This is not surprising, pg_ctl bases its analysis that a node needs to
>>>>> > be promoted if recovery.conf exists or not, and there is an interval
>>>>> > of time between which recovery.conf is removed and the promotion is
>>>>> > actually triggered, so you can create a promote file even after even
>>>>> > sending SIGUSR1 to the standby's postmaster
>>>>> >
>>>>> >> We will try to workaround this issue by ensuring we do not send
>>>>> >> multiple
>>>>> >> promote request using pg_ctl to the same cluster.
>>>>> >
>>>>> > Well, we could for example have the server switch promote to
>>>>> > promote_done in CheckForStandbyTrigger() and then unlink it when
>>>>> > recovery.conf is switched to .done. Opinions are welcome on the
>>>>> > matter.
>>>>>
>>>>> Or we can just always remove the signal file at the end of recovery.
>>>>> That filename switch seems unnecessary.
>>>>
>>>>
>>>> Well, by doing so, in the event of a crash during recovery the promote
>>>> signal file would be present in PGDATA, and this would enforce a promotion
>>>> at the next startup of the node. I don't think that this is a good idea. In
>>>> the case of a promoted node crash a user may want to look at his node back
>>>> in a recovery state.
>>>
>>> You meant the case of crash which occurs before CheckForStandbyTrigger()
>>> removes the signal file after pg_ctl promote is executed? If yes, even if
>>> we rename the file to the intermediate one, the signal file would remain.
>>>
>>> If we want to address the above corner case, we can additionally remove
>>> the file always at the beginning of recovery. This idea can completely avoid
>>> an unexpected promotion by the surviving signal file.
>>
>> Then what about the case where a promote file is let by user on
>> purpose to trigger a promotion on restart?
>
> Sounds like non-realistic case. If the promote file exists at
> the beginning of recovery, the recovery might try to end before
> reaching the consistency point, and then it would fail.
> Why do people want to make the recovery fail in that way?

Hm. OK.

>>>> Also, this intermediate promote file, let's say promote.detected, would be
>>>> useful for external tools to let them know that the promotion has been
>>>> acknowledged (you can already know it if your tool knows that a promote has
>>>> been triggered, that promote has been removed by the server and if
>>>> recovery.conf is still present). That's not something you would want on back
>>>> branches btw as this changes how promotion behaves seen from an external
>>>> point of view. But that would be a patch simple enough (got a WIP for people
>>>> wondering).
>>>>
>>>> An open question would be what to do with pg_ctl promote if a promote file
>>>> already exists. I think that we should ignore the creation of the promote
>>>> file but still kick the signal SIGUSR1.
>>>>
>>>>>
>>>>> In addition to that change, we should make pg_basebackup skip
>>>>> the signal file?
>>>>
>>>>
>>>> Well, yes, and it we would be just fine for the case reported by Feike to
>>>> just ignore promote and fallback_promote in a base backup, as the problem
>>>> reported was about a standby that contained the signal promote file after
>>>> pg_basebackup. And I think that we would be fine by doing that as well in
>>>> the back-branches. trigger_file is not exposed out of xlog.c in the startup
>>>> process, but I can live with the fact that it is not ignored.
>>>>
>>>> In short, I guess that the patch attached would be fine.
>>>> Opinions?
>>>
>>> I have no strong objection to that change, but it seems half-baked.
>>> That is, that idea doesn't address the case where a base backup is
>>> taken by other than pg_basebackup at all.
>>
>> That's the same problem with for example postmaster.pid,
>> postmaster.opts or similar when taking a FS-level backup.
>
> Yes, but even if such bogus postmaster.pid exists in $PGDATA,
> the server can handle it and restart almost harmlessly. But
> the promote file not.

Then what about an idea like the attached. At the beginning of
StartupXLOG() before parsing recovery.conf we could check for the
existence of promotion trigger files and unlink them unconditionally.
--
Michael

Attachment
On Tue, Jun 16, 2015 at 4:42 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Fri, Jun 12, 2015 at 2:43 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>> On Sat, Jun 6, 2015 at 8:36 AM, Michael Paquier
>> <michael.paquier@gmail.com> wrote:
>>> On Fri, Jun 5, 2015 at 11:06 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>>>> On Fri, Jun 5, 2015 at 3:01 PM, Michael Paquier
>>>> <michael.paquier@gmail.com> wrote:
>>>>>
>>>>>
>>>>> On Wed, Jun 3, 2015 at 1:04 AM, Fujii Masao wrote:
>>>>>>
>>>>>> On Mon, Jun 1, 2015 at 5:19 PM, Michael Paquier
>>>>>> >> Some testing shows us that in some cases, when pg_ctl promote is called
>>>>>> >> multiple
>>>>>> >> times, a promote file is left in the PGDATA directory, even though the
>>>>>> >> cluster
>>>>>> >> has been succesfully promoted and is accepting read/write queries.
>>>>>> >
>>>>>> > This is not surprising, pg_ctl bases its analysis that a node needs to
>>>>>> > be promoted if recovery.conf exists or not, and there is an interval
>>>>>> > of time between which recovery.conf is removed and the promotion is
>>>>>> > actually triggered, so you can create a promote file even after even
>>>>>> > sending SIGUSR1 to the standby's postmaster
>>>>>> >
>>>>>> >> We will try to workaround this issue by ensuring we do not send
>>>>>> >> multiple
>>>>>> >> promote request using pg_ctl to the same cluster.
>>>>>> >
>>>>>> > Well, we could for example have the server switch promote to
>>>>>> > promote_done in CheckForStandbyTrigger() and then unlink it when
>>>>>> > recovery.conf is switched to .done. Opinions are welcome on the
>>>>>> > matter.
>>>>>>
>>>>>> Or we can just always remove the signal file at the end of recovery.
>>>>>> That filename switch seems unnecessary.
>>>>>
>>>>>
>>>>> Well, by doing so, in the event of a crash during recovery the promote
>>>>> signal file would be present in PGDATA, and this would enforce a promotion
>>>>> at the next startup of the node. I don't think that this is a good idea. In
>>>>> the case of a promoted node crash a user may want to look at his node back
>>>>> in a recovery state.
>>>>
>>>> You meant the case of crash which occurs before CheckForStandbyTrigger()
>>>> removes the signal file after pg_ctl promote is executed? If yes, even if
>>>> we rename the file to the intermediate one, the signal file would remain.
>>>>
>>>> If we want to address the above corner case, we can additionally remove
>>>> the file always at the beginning of recovery. This idea can completely avoid
>>>> an unexpected promotion by the surviving signal file.
>>>
>>> Then what about the case where a promote file is let by user on
>>> purpose to trigger a promotion on restart?
>>
>> Sounds like non-realistic case. If the promote file exists at
>> the beginning of recovery, the recovery might try to end before
>> reaching the consistency point, and then it would fail.
>> Why do people want to make the recovery fail in that way?
>
> Hm. OK.
>
>>>>> Also, this intermediate promote file, let's say promote.detected, would be
>>>>> useful for external tools to let them know that the promotion has been
>>>>> acknowledged (you can already know it if your tool knows that a promote has
>>>>> been triggered, that promote has been removed by the server and if
>>>>> recovery.conf is still present). That's not something you would want on back
>>>>> branches btw as this changes how promotion behaves seen from an external
>>>>> point of view. But that would be a patch simple enough (got a WIP for people
>>>>> wondering).
>>>>>
>>>>> An open question would be what to do with pg_ctl promote if a promote file
>>>>> already exists. I think that we should ignore the creation of the promote
>>>>> file but still kick the signal SIGUSR1.
>>>>>
>>>>>>
>>>>>> In addition to that change, we should make pg_basebackup skip
>>>>>> the signal file?
>>>>>
>>>>>
>>>>> Well, yes, and it we would be just fine for the case reported by Feike to
>>>>> just ignore promote and fallback_promote in a base backup, as the problem
>>>>> reported was about a standby that contained the signal promote file after
>>>>> pg_basebackup. And I think that we would be fine by doing that as well in
>>>>> the back-branches. trigger_file is not exposed out of xlog.c in the startup
>>>>> process, but I can live with the fact that it is not ignored.
>>>>>
>>>>> In short, I guess that the patch attached would be fine.
>>>>> Opinions?
>>>>
>>>> I have no strong objection to that change, but it seems half-baked.
>>>> That is, that idea doesn't address the case where a base backup is
>>>> taken by other than pg_basebackup at all.
>>>
>>> That's the same problem with for example postmaster.pid,
>>> postmaster.opts or similar when taking a FS-level backup.
>>
>> Yes, but even if such bogus postmaster.pid exists in $PGDATA,
>> the server can handle it and restart almost harmlessly. But
>> the promote file not.
>
> Then what about an idea like the attached.

Thanks for the patch!

> At the beginning of
> StartupXLOG() before parsing recovery.conf we could check for the
> existence of promotion trigger files and unlink them unconditionally.

There seems to be still race condition: postmaster can receive SIGUSR1
before the startup process removes the promotion trigger file. Then
the variable promote_triggered can be set to true unexpectedly.

So, what about making postmaster remove the trigger file unconditionally
before the startup process is forked, instead? For example, just after
PostmasterMain() calls RemovePgTempFiles()?

Regards,

--
Fujii Masao
On Thu, Jul 2, 2015 at 10:00 PM, Fujii Masao wrote:
>> At the beginning of
>> StartupXLOG() before parsing recovery.conf we could check for the
>> existence of promotion trigger files and unlink them unconditionally.
>
> There seems to be still race condition: postmaster can receive SIGUSR1
> before the startup process removes the promotion trigger file. Then
> the variable promote_triggered can be set to true unexpectedly.

Ah, yes. That's indeed possible.

> So, what about making postmaster remove the trigger file unconditionally
> before the startup process is forked, instead? For example, just after
> PostmasterMain() calls RemovePgTempFiles()?

Sounds fine, no other processes are running at this point.
--
Michael

Attachment
On Fri, Jul 3, 2015 at 4:03 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Thu, Jul 2, 2015 at 10:00 PM, Fujii Masao wrote:
>>> At the beginning of
>>> StartupXLOG() before parsing recovery.conf we could check for the
>>> existence of promotion trigger files and unlink them unconditionally.
>>
>> There seems to be still race condition: postmaster can receive SIGUSR1
>> before the startup process removes the promotion trigger file. Then
>> the variable promote_triggered can be set to true unexpectedly.
>
> Ah, yes. That's indeed possible.
>
>> So, what about making postmaster remove the trigger file unconditionally
>> before the startup process is forked, instead? For example, just after
>> PostmasterMain() calls RemovePgTempFiles()?
>
> Sounds fine, no other processes are running at this point.

Thanks for updating the patch!

+                * Skip promote signal files. Those files may have
been created by
+                * pg_ctl to trigger a promotion but they do not
belong to a base
+                * backup.
+                */
+               if (strcmp(de->d_name, PROMOTE_SIGNAL_FILE) == 0 ||
+                       strcmp(de->d_name, FALLBACK_PROMOTE_SIGNAL_FILE) == 0)
+                       continue;

Do we really want this? This is not required because the files are
always removed at the server startup. Also the backup performance
gain by skipping them would be almost zero since their size is zero.
That is, no need to add this for performance. OTOH, it's harmless
to add this. But I don't want to enlarge the "skip file list" for
pg_basebackup beyond necessary. So I removed this from the patch.

BTW, if we really want this, not only pg_basebackup but also pg_rewind
would need to skip the files.

+        * Remove any trigger file able to trigger promotion, this is safe from
+        * race conditions involving signals as no other processes are running
+        * yet.
+        */
+       RemoveStandbyTrigger();

I changed the function name to RemovePromoteSignalFiles().

I added more detail comment about the race condition that we're
talking about, into here.

Updated version of the patch attached.

Regards,

--
Fujii Masao

Attachment
On Fri, Jul 3, 2015 at 9:26 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Fri, Jul 3, 2015 at 4:03 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> On Thu, Jul 2, 2015 at 10:00 PM, Fujii Masao wrote:
>>>> At the beginning of
>>>> StartupXLOG() before parsing recovery.conf we could check for the
>>>> existence of promotion trigger files and unlink them unconditionally.
>>>
>>> There seems to be still race condition: postmaster can receive SIGUSR1
>>> before the startup process removes the promotion trigger file. Then
>>> the variable promote_triggered can be set to true unexpectedly.
>>
>> Ah, yes. That's indeed possible.
>>
>>> So, what about making postmaster remove the trigger file unconditionally
>>> before the startup process is forked, instead? For example, just after
>>> PostmasterMain() calls RemovePgTempFiles()?
>>
>> Sounds fine, no other processes are running at this point.
>
> Thanks for updating the patch!
>
> +                * Skip promote signal files. Those files may have
> been created by
> +                * pg_ctl to trigger a promotion but they do not
> belong to a base
> +                * backup.
> +                */
> +               if (strcmp(de->d_name, PROMOTE_SIGNAL_FILE) == 0 ||
> +                       strcmp(de->d_name, FALLBACK_PROMOTE_SIGNAL_FILE) == 0)
> +                       continue;
>
> Do we really want this? This is not required because the files are
> always removed at the server startup. Also the backup performance
> gain by skipping them would be almost zero since their size is zero.
> That is, no need to add this for performance. OTOH, it's harmless
> to add this. But I don't want to enlarge the "skip file list" for
> pg_basebackup beyond necessary. So I removed this from the patch.
>
> BTW, if we really want this, not only pg_basebackup but also pg_rewind
> would need to skip the files.
>
> +        * Remove any trigger file able to trigger promotion, this is safe from
> +        * race conditions involving signals as no other processes are running
> +        * yet.
> +        */
> +       RemoveStandbyTrigger();
>
> I changed the function name to RemovePromoteSignalFiles().
>
> I added more detail comment about the race condition that we're
> talking about, into here.
>
> Updated version of the patch attached.

Pushed.

Regards,

--
Fujii Masao