Thread: could not stat promote trigger file leads to shutdown

could not stat promote trigger file leads to shutdown

From
Peter Eisentraut
Date:
I have seen the error

     could not stat promote trigger file "...": Permission denied

because of a misconfiguration (for example, setting promote_trigger_file 
to point into a directory to which you don't have appropriate read or 
execute access).

The problem is that because this happens in the startup process, the 
ERROR is turned into a FATAL and the whole instance shuts down.  That 
seems like a harsh penalty.  Would it be better to turn this ERROR into 
a WARNING?

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: could not stat promote trigger file leads to shutdown

From
Fujii Masao
Date:
On Thu, Nov 14, 2019 at 10:58 PM Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
>
> I have seen the error
>
>      could not stat promote trigger file "...": Permission denied
>
> because of a misconfiguration (for example, setting promote_trigger_file
> to point into a directory to which you don't have appropriate read or
> execute access).
>
> The problem is that because this happens in the startup process, the
> ERROR is turned into a FATAL and the whole instance shuts down.  That
> seems like a harsh penalty.  Would it be better to turn this ERROR into
> a WARNING?

+1

Regards,

-- 
Fujii Masao



Re: could not stat promote trigger file leads to shutdown

From
Tom Lane
Date:
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
> I have seen the error
>      could not stat promote trigger file "...": Permission denied
> because of a misconfiguration (for example, setting promote_trigger_file 
> to point into a directory to which you don't have appropriate read or 
> execute access).

> The problem is that because this happens in the startup process, the 
> ERROR is turned into a FATAL and the whole instance shuts down.  That 
> seems like a harsh penalty.  Would it be better to turn this ERROR into 
> a WARNING?

It is harsh, but I suspect if we just logged the complaint, we'd get
bug reports about "Postgres isn't reacting to my trigger file",
because people don't read the postmaster log unless forced to.
Is there some more-visible way to report the problem, short of
shutting down?

(BTW, from this perspective, WARNING is especially bad because it
might not get logged at all.  Better to use LOG.)

One thought is to try to detect the misconfiguration at postmaster
start --- better to fail at startup than sometime later.  But I'm
not sure how reliably we could do that.

            regards, tom lane



Re: could not stat promote trigger file leads to shutdown

From
Michael Paquier
Date:
On Thu, Nov 14, 2019 at 10:38:30AM -0500, Tom Lane wrote:
> It is harsh, but I suspect if we just logged the complaint, we'd get
> bug reports about "Postgres isn't reacting to my trigger file",
> because people don't read the postmaster log unless forced to.
> Is there some more-visible way to report the problem, short of
> shutting down?
>
> (BTW, from this perspective, WARNING is especially bad because it
> might not get logged at all.  Better to use LOG.)

Neither am I comfortable with that.

> One thought is to try to detect the misconfiguration at postmaster
> start --- better to fail at startup than sometime later.  But I'm
> not sure how reliably we could do that.

I think that we could do something close to the area where
RemovePromoteSignalFiles() gets called.  Why not simply checking the
path defined by PromoteTriggerFile() at startup time then?  I take it
that the only thing we should not complain about is stat() returning
ENOENT when looking at the promote file defined.
--
Michael

Attachment

Re: could not stat promote trigger file leads to shutdown

From
Fujii Masao
Date:
On Fri, Nov 15, 2019 at 10:49 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Thu, Nov 14, 2019 at 10:38:30AM -0500, Tom Lane wrote:
> > It is harsh, but I suspect if we just logged the complaint, we'd get
> > bug reports about "Postgres isn't reacting to my trigger file",
> > because people don't read the postmaster log unless forced to.
> > Is there some more-visible way to report the problem, short of
> > shutting down?
> >
> > (BTW, from this perspective, WARNING is especially bad because it
> > might not get logged at all.  Better to use LOG.)
>
> Neither am I comfortable with that.

I always wonder why WARNING was defined that way.
I think that users usually pay attention to the word "WARNING"
rather than "LOG".

> > One thought is to try to detect the misconfiguration at postmaster
> > start --- better to fail at startup than sometime later.  But I'm
> > not sure how reliably we could do that.
>
> I think that we could do something close to the area where
> RemovePromoteSignalFiles() gets called.  Why not simply checking the
> path defined by PromoteTriggerFile() at startup time then?  I take it
> that the only thing we should not complain about is stat() returning
> ENOENT when looking at the promote file defined.

promote_trigger_file is declared as PGC_SIGHUP,
so such check would be necessary even while the standby is running.
Which can cause the server to fail after the startup.

Regards,

-- 
Fujii Masao



Re: could not stat promote trigger file leads to shutdown

From
Peter Eisentraut
Date:
On 2019-11-15 03:14, Fujii Masao wrote:
>>> One thought is to try to detect the misconfiguration at postmaster
>>> start --- better to fail at startup than sometime later.  But I'm
>>> not sure how reliably we could do that.
>> I think that we could do something close to the area where
>> RemovePromoteSignalFiles() gets called.  Why not simply checking the
>> path defined by PromoteTriggerFile() at startup time then?  I take it
>> that the only thing we should not complain about is stat() returning
>> ENOENT when looking at the promote file defined.
> promote_trigger_file is declared as PGC_SIGHUP,
> so such check would be necessary even while the standby is running.
> Which can cause the server to fail after the startup.

Let me illustrate a scenario in a more lively way:

Say you want to set up promote_trigger_file to point to a file outside 
of the data directory, maybe because you want to integrate it with some 
external tooling.  So you go into your configuration and set

     promote_trigger_file = '/srv/foobar/trigger'

and reload the server.  Everything is happy.  The fact that the 
directory /srv/foobar/ does not exist at this point is completely ignored.

Now you become root and run

     mkdir /srv/foobar

and, depending circumstances such as root's umask or the permissions of 
/srv, your PostgreSQL server crashes immediately.  That can't be good.

Also imagine the above steps being run by a configuration management system.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: could not stat promote trigger file leads to shutdown

From
Tom Lane
Date:
Fujii Masao <masao.fujii@gmail.com> writes:
> On Fri, Nov 15, 2019 at 10:49 AM Michael Paquier <michael@paquier.xyz> wrote:
>> On Thu, Nov 14, 2019 at 10:38:30AM -0500, Tom Lane wrote:
>>> (BTW, from this perspective, WARNING is especially bad because it
>>>> might not get logged at all.  Better to use LOG.)

>> Neither am I comfortable with that.

> I always wonder why WARNING was defined that way.
> I think that users usually pay attention to the word "WARNING"
> rather than "LOG".

The issue really is "what are we warning about".  The way things
are set up basically assumes that WARNING is for complaining about
user-issued commands that might not be doing what the user wants.
Which is a legitimate use-case, but it doesn't necessarily mean
something that's very important to put in the postmaster log.

What we're seeing, in these repeated proposals to use WARNING in
some background process that doesn't run user commands, is that
there is also a use-case for "more-significant-than-usual log
messages".  Maybe we need a new elevel category for that.
SYSTEM_WARNING or LOG_WARNING, perhaps?

>> I think that we could do something close to the area where
>> RemovePromoteSignalFiles() gets called.  Why not simply checking the
>> path defined by PromoteTriggerFile() at startup time then?  I take it
>> that the only thing we should not complain about is stat() returning
>> ENOENT when looking at the promote file defined.

> promote_trigger_file is declared as PGC_SIGHUP,
> so such check would be necessary even while the standby is running.
> Which can cause the server to fail after the startup.

No, it'd be just the same as any other GUC: if we make such a test
in the check hook, then we'd fail for a bad value at startup, but
at SIGHUP we'd just reject the new setting.  I think this might be
a workable answer to Peter's concern.

            regards, tom lane



Re: could not stat promote trigger file leads to shutdown

From
Sergei Kornilov
Date:
Hello

> Maybe we need a new elevel category for that.
> SYSTEM_WARNING or LOG_WARNING, perhaps?

I think a separate levels for user warnings and system warnings (and errors) would be great for log analytics. Error
dueto user typo in query is not the same as cache lookup error (for example).
 

regards, Sergei



Re: could not stat promote trigger file leads to shutdown

From
Tom Lane
Date:
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
> Say you want to set up promote_trigger_file to point to a file outside 
> of the data directory, maybe because you want to integrate it with some 
> external tooling.  So you go into your configuration and set
>      promote_trigger_file = '/srv/foobar/trigger'
> and reload the server.  Everything is happy.  The fact that the 
> directory /srv/foobar/ does not exist at this point is completely ignored.
> Now you become root and run
>      mkdir /srv/foobar
> and, depending circumstances such as root's umask or the permissions of 
> /srv, your PostgreSQL server crashes immediately.  That can't be good.

No, it's not good, but the proposed fix of s/ERROR/LOG/ simply delays
the problem till later, ie when you try to promote the server nothing
happens.  That's not good either.  (To be clear: I'm not necessarily
against that change, I just don't think it's a sufficient response.)

If we add a GUC-check-hook test, then the problem of misconfiguration
is reduced to the previously unsolved problem that we have crappy
feedback for erroneous on-the-fly configuration changes.  So it's
still unsolved, but at least we've got one unsolved problem not two.

            regards, tom lane



Re: could not stat promote trigger file leads to shutdown

From
Alvaro Herrera
Date:
On 2019-Nov-15, Tom Lane wrote:

> If we add a GUC-check-hook test, then the problem of misconfiguration
> is reduced to the previously unsolved problem that we have crappy
> feedback for erroneous on-the-fly configuration changes.  So it's
> still unsolved, but at least we've got one unsolved problem not two.

I am now against this kind of behavior, because nowadays it is common
to have external orchestrating systems stopping and starting postmaster
on their own volition.

If this kind of misconfiguration causes postmaster refuse to start, it
can effectively become a service-shutdown scenario which requires the
DBA to go temporarily mad.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: could not stat promote trigger file leads to shutdown

From
Tom Lane
Date:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> On 2019-Nov-15, Tom Lane wrote:
>> If we add a GUC-check-hook test, then the problem of misconfiguration
>> is reduced to the previously unsolved problem that we have crappy
>> feedback for erroneous on-the-fly configuration changes.  So it's
>> still unsolved, but at least we've got one unsolved problem not two.

> I am now against this kind of behavior, because nowadays it is common
> to have external orchestrating systems stopping and starting postmaster
> on their own volition.

> If this kind of misconfiguration causes postmaster refuse to start, it
> can effectively become a service-shutdown scenario which requires the
> DBA to go temporarily mad.

By that argument, postgresql.conf could contain complete garbage
and the postmaster should still start.  I'm not willing to say
that an "external orchestrating system" doesn't need to take
responsibility for putting valid info into the config file.

            regards, tom lane



Re: could not stat promote trigger file leads to shutdown

From
Peter Eisentraut
Date:
On 2019-11-15 19:23, Tom Lane wrote:
> Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
>> Say you want to set up promote_trigger_file to point to a file outside
>> of the data directory, maybe because you want to integrate it with some
>> external tooling.  So you go into your configuration and set
>>       promote_trigger_file = '/srv/foobar/trigger'
>> and reload the server.  Everything is happy.  The fact that the
>> directory /srv/foobar/ does not exist at this point is completely ignored.
>> Now you become root and run
>>       mkdir /srv/foobar
>> and, depending circumstances such as root's umask or the permissions of
>> /srv, your PostgreSQL server crashes immediately.  That can't be good.
> 
> No, it's not good, but the proposed fix of s/ERROR/LOG/ simply delays
> the problem till later, ie when you try to promote the server nothing
> happens.  That's not good either.  (To be clear: I'm not necessarily
> against that change, I just don't think it's a sufficient response.)
> 
> If we add a GUC-check-hook test, then the problem of misconfiguration
> is reduced to the previously unsolved problem that we have crappy
> feedback for erroneous on-the-fly configuration changes.  So it's
> still unsolved, but at least we've got one unsolved problem not two.

AFAICT, a GUC check hook wouldn't actually be able to address the 
specific scenario I described.  At the time the GUC is set, the 
containing the directory of the trigger file does not exist yet.  This 
is currently not an error.  The problem only happens if after the GUC is 
set the containing directory appears and is not readable.

I notice that we use LOG level if an SSL key or certificate file is not 
accessible on reload, so that seems somewhat similar.

We don't have any GUC check hooks on other file system location string 
settings that ensure accessibility or presence of the file.  Although I 
do notice that we use check_canonical_path() in some places and not 
others for mysterious and undocumented reasons.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: could not stat promote trigger file leads to shutdown

From
Tom Lane
Date:
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
> On 2019-11-15 19:23, Tom Lane wrote:
>> If we add a GUC-check-hook test, then the problem of misconfiguration
>> is reduced to the previously unsolved problem that we have crappy
>> feedback for erroneous on-the-fly configuration changes.  So it's
>> still unsolved, but at least we've got one unsolved problem not two.

> AFAICT, a GUC check hook wouldn't actually be able to address the 
> specific scenario I described.  At the time the GUC is set, the 
> containing the directory of the trigger file does not exist yet.  This 
> is currently not an error.  The problem only happens if after the GUC is 
> set the containing directory appears and is not readable.

True, if the hook just consists of trying fopen() and checking the
errno.  Would it be feasible to insist that the containing directory
exist and be readable?  We have enough infrastructure that that
should only take a few lines of code, so the question is whether
or not that's a nicer behavior than we have now.

If we had this to do over, I'd argue that we misdesigned trigger
files: they should be required to exist always, and triggering
depends on file contents (eg empty vs. not) not existence.  That
would make it far easier to check for configuration mistakes
at startup.  But I suppose it's too late now.

> We don't have any GUC check hooks on other file system location string 
> settings that ensure accessibility or presence of the file.

Right, but I'm suggesting we should add that where appropriate.
Basically the complaint here is that the system lacks checks
that the given configuration settings are workable, and we ought
to add such.

> Although I 
> do notice that we use check_canonical_path() in some places and not 
> others for mysterious and undocumented reasons.

Probably only that some patch authors didn't know about it :-(

            regards, tom lane



Re: could not stat promote trigger file leads to shutdown

From
Peter Eisentraut
Date:
On 2019-11-20 16:21, Tom Lane wrote:
>> AFAICT, a GUC check hook wouldn't actually be able to address the
>> specific scenario I described.  At the time the GUC is set, the
>> containing the directory of the trigger file does not exist yet.  This
>> is currently not an error.  The problem only happens if after the GUC is
>> set the containing directory appears and is not readable.
> True, if the hook just consists of trying fopen() and checking the
> errno.  Would it be feasible to insist that the containing directory
> exist and be readable?  We have enough infrastructure that that
> should only take a few lines of code, so the question is whether
> or not that's a nicer behavior than we have now.

Is it possible to do this in a mostly bullet-proof way?  Just because 
the directory exists and looks pretty good otherwise, doesn't mean we 
can read a file created in it later in a way that doesn't fall afoul of 
the existing error checks.  There could be something like SELinux 
lurking, for example.

Maybe some initial checking would be useful, but I think we still need 
to downgrade the error check at use time a bit to not crash in the cases 
that we miss.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: could not stat promote trigger file leads to shutdown

From
Kyotaro Horiguchi
Date:
At Wed, 4 Dec 2019 11:52:33 +0100, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote in 
> On 2019-11-20 16:21, Tom Lane wrote:
> >> AFAICT, a GUC check hook wouldn't actually be able to address the
> >> specific scenario I described.  At the time the GUC is set, the
> >> containing the directory of the trigger file does not exist yet.  This
> >> is currently not an error.  The problem only happens if after the GUC
> >> is
> >> set the containing directory appears and is not readable.
> > True, if the hook just consists of trying fopen() and checking the
> > errno.  Would it be feasible to insist that the containing directory
> > exist and be readable?  We have enough infrastructure that that
> > should only take a few lines of code, so the question is whether
> > or not that's a nicer behavior than we have now.
> 
> Is it possible to do this in a mostly bullet-proof way?  Just because
> the directory exists and looks pretty good otherwise, doesn't mean we
> can read a file created in it later in a way that doesn't fall afoul
> of the existing error checks.  There could be something like SELinux
> lurking, for example.
> 
> Maybe some initial checking would be useful, but I think we still need
> to downgrade the error check at use time a bit to not crash in the
> cases that we miss.

+1. Any GUC variables that points to outer, or externally-modifiable
resources, including directories, files, commands can face that kind
of problem. For example a bogus value for archive_command doesn't
preveint server from starting. I understand that the reason is that we
don't have a reliable means to check-up the command before we actually
execute it, but server can (or should) continue running even if it
fails. I think this issue falls into that category.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: could not stat promote trigger file leads to shutdown

From
Michael Paquier
Date:
On Wed, Dec 04, 2019 at 11:52:33AM +0100, Peter Eisentraut wrote:
> Is it possible to do this in a mostly bullet-proof way?  Just because the
> directory exists and looks pretty good otherwise, doesn't mean we can read a
> file created in it later in a way that doesn't fall afoul of the existing
> error checks.  There could be something like SELinux lurking, for example.
>
> Maybe some initial checking would be useful, but I think we still need to
> downgrade the error check at use time a bit to not crash in the cases that
> we miss.

I got that thread in my backlog for some time, and was not able to
come back to it.  Reading it again the thread, it seems to me that
using a LOG would make the promote file handling more consistent with
what we do for the SSL context reload.  Still, one downside I can see
here is that this causes the backend to create a new LOG entry each
time the promote file is checked, aka each time we check if WAL is
available.  Couldn't that bloat be a problem?  During the SSL reload,
we only generate LOG entries for each backend on SIGHUP.
--
Michael

Attachment