Thread: Questions about pid file creation code

Questions about pid file creation code

From
Zdenek Kotala
Date:
I'm looking on pid file creation code (src/backend/utils/init/miscinit.c  - CreateLockFile) and I have couple of
questions:

1) Is there still some reason have negative value in postmaster.pid? It 
happens only if backend runs in single mode. But I think now is not 
necessary to use it. And there are some confusing messages about 
postgres/postmaster. See:

errhint("Is another postgres (PID %d) running in data directory \"%s\"?",
(int) other_pid, refName) :
errhint("Is another postmaster (PID %d) running in data directory \"%s\"?",
(int) other_pid, refName)) :

2) Why 100? What race condition should happen? This piece of code looks 
like kind of magic.

3) Why pid checking and cleanup is in postgres? I think it is role of 
pg_ctl or init scripts.


4) The following condition is buggy, because atoi function does not have 
defined result if parameter is not valid number. (OK in most 
implementation it really returns 0)
 if (other_pid <= 0)             elog(FATAL, "bogus data in lock file \"%s\": \"%s\"",                  filename,
buffer)

I think usage of strtol there should be better.

    Zdenek


Re: Questions about pid file creation code

From
Tom Lane
Date:
Zdenek Kotala <Zdenek.Kotala@Sun.COM> writes:
> 1) Is there still some reason have negative value in postmaster.pid?

Just to distinguish postmasters from standalone backends in the error
messages.  I think that's still useful.

> 2) Why 100? What race condition should happen? This piece of code looks 
> like kind of magic.

There are at least two race cases identified in the comments in the
loop.

> 3) Why pid checking and cleanup is in postgres? I think it is role of 
> pg_ctl or init scripts.

Let's see, instead of one place in the postgres code we should do it in
N places in different init scripts, and just trust to luck that a
particular installation is using an init script that knows to do that?
I don't think so.  Besides, how is the init script going to remove it
again?  It won't still be running when the postmaster exits.

> 4) The following condition is buggy, because atoi function does not have 
> defined result if parameter is not valid number.

>   if (other_pid <= 0)

It's not actually trying to validate the syntax of the lock file, only
to make certain it doesn't trigger any unexpected behavior in kill().
I don't think I've yet seen any reports that suggest that more syntax
checking of the lock file would be a useful activity.
        regards, tom lane


Re: Questions about pid file creation code

From
Zdenek Kotala
Date:
Tom Lane wrote:
> Zdenek Kotala <Zdenek.Kotala@Sun.COM> writes:
>> 1) Is there still some reason have negative value in postmaster.pid?
> 
> Just to distinguish postmasters from standalone backends in the error
> messages.  I think that's still useful.

I'm not sure what you mean. It is used only in CreatePidFile function 
and I think that if directory is locked by some process, I don't see any 
useful reason to know if it is postmaster or standalone backend.

(PS: Is standalone backend same as --single switch?)

>> 2) Why 100? What race condition should happen? This piece of code looks 
>> like kind of magic.
> 
> There are at least two race cases identified in the comments in the
> loop.

Yes there are. But it does not sense for me. If I want to open file and 
another process remove it, why I want to try created it again when 
another process going to do it?

There is only one reason and it is that user delete file manually from 
the system, but in this case I don't believe that administrator shot 
right time.

Or if it still have sense do it in this way I expect some sleep instead 
of some loop which depends on CPU speed.

>> 3) Why pid checking and cleanup is in postgres? I think it is role of 
>> pg_ctl or init scripts.
> 
> Let's see, instead of one place in the postgres code we should do it in
> N places in different init scripts, and just trust to luck that a
> particular installation is using an init script that knows to do that?
> I don't think so.  Besides, how is the init script going to remove it
> again?  It won't still be running when the postmaster exits.

I'm sorry, I meant why there is a pid cleanup which stays there after 
another postmaster crash. Many application only check OK there is some 
pid file -> exit. And rest is on start script or some other monitoring 
facility.

>> 4) The following condition is buggy, because atoi function does not have 
>> defined result if parameter is not valid number.
> 
>>   if (other_pid <= 0)
> 
> It's not actually trying to validate the syntax of the lock file, only
> to make certain it doesn't trigger any unexpected behavior in kill().

I not sure if we talk about same place. kill() is called after this if. 
If I miss that atoi need not return 0 if fails, then following condition 
is more accurate:
  if (other_pid == 0)


> I don't think I've yet seen any reports that suggest that more syntax
> checking of the lock file would be a useful activity.

Yes, I agree.
    Zdenek


Re: Questions about pid file creation code

From
Alvaro Herrera
Date:
Zdenek Kotala wrote:

> (PS: Is standalone backend same as --single switch?)

Yes.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: Questions about pid file creation code

From
Tom Lane
Date:
Zdenek Kotala <Zdenek.Kotala@Sun.COM> writes:
> Tom Lane wrote:
>> Just to distinguish postmasters from standalone backends in the error
>> messages.  I think that's still useful.

> I'm not sure what you mean. It is used only in CreatePidFile function 
> and I think that if directory is locked by some process, I don't see any 
> useful reason to know if it is postmaster or standalone backend.

You don't?  Consider the decisions the user needs to take upon seeing
the message --- should he kill that other process or not, and if so how?
Knowing whether it's a postmaster seems pretty important to me.

> Yes there are. But it does not sense for me. If I want to open file and 
> another process remove it, why I want to try created it again when 
> another process going to do it?

That could be the track of another postmaster just now shutting down.
There's no reason to fail to start in such a scenario.  The looping
logic is necessary anyway (to guard against races involving two
postmasters trying to start at the same time), so we might as well let
it handle this case too.

> I'm sorry, I meant why there is a pid cleanup which stays there after 
> another postmaster crash. Many application only check OK there is some 
> pid file -> exit. And rest is on start script or some other monitoring 
> facility.

The start script does not typically have the intelligence to get this
right, particularly not the is-shmem-still-in-use part.  If you check
the archives you will find many of us on record telling people who think
they should remove the pidfile in their start script that they're crazy.

>> It's not actually trying to validate the syntax of the lock file, only
>> to make certain it doesn't trigger any unexpected behavior in kill().

> I not sure if we talk about same place.

Yes, we are.  Read the kill(2) man page and note the special behaviors
for pid = 0 or -1.  The test is just trying to be darn certain we don't
invoke those behaviors.
        regards, tom lane


Re: Questions about pid file creation code

From
Zdenek Kotala
Date:
Tom Lane wrote:
> Zdenek Kotala <Zdenek.Kotala@Sun.COM> writes:
>> Tom Lane wrote:
>>> Just to distinguish postmasters from standalone backends in the error
>>> messages.  I think that's still useful.
> 
>> I'm not sure what you mean. It is used only in CreatePidFile function 
>> and I think that if directory is locked by some process, I don't see any 
>> useful reason to know if it is postmaster or standalone backend.
> 
> You don't?  Consider the decisions the user needs to take upon seeing
> the message --- should he kill that other process or not, and if so how?
> Knowing whether it's a postmaster seems pretty important to me.

If somebody want to kill some process he must know what he want to do. 
How many postgres user know what is different between postmaster and 
postgres in error message?

And other problem. If another application (e.g. pg_migrator) want to 
lock this directory to prevent data corruption. How shall it do that? 
How big sense have this message in this case?

I suggest to remove this behavior and modify message.

>> Yes there are. But it does not sense for me. If I want to open file and 
>> another process remove it, why I want to try created it again when 
>> another process going to do it?
> 
> That could be the track of another postmaster just now shutting down.
> There's no reason to fail to start in such a scenario.  The looping
> logic is necessary anyway (to guard against races involving two
> postmasters trying to start at the same time), so we might as well let
> it handle this case too.

Ok. I now understand (I hope) what this loop try to handle. However, If 
one server go down and another go up there is only really small time 
piece between first open attempt and second one. I guess in this case we 
can say stop to the startup postmaster. For me it is better then make 
one hundred loops depend on cpu speed and recheck it again.  I think 
that in this case postgres doubled role of startup scripts.

There is also another issue which can occur. If you have two node with 
access to one shared filesystem. One node is for backup and somebody run 
postgres on second node. In this case postgres remove file and create 
own and two postgres on one dbcluster is not good idea. Good cluster 
solution protect this situation, but it can happen if somebody run it 
manually.

>> I'm sorry, I meant why there is a pid cleanup which stays there after 
>> another postmaster crash. Many application only check OK there is some 
>> pid file -> exit. And rest is on start script or some other monitoring 
>> facility.
> 
> The start script does not typically have the intelligence to get this
> right, particularly not the is-shmem-still-in-use part.  If you check
> the archives you will find many of us on record telling people who think
> they should remove the pidfile in their start script that they're crazy.

It is true, but question is what way is better. Keep all logic in 
postmaster or improve pg_ctl to share more information and keep 
responsibility on start scripts or monitoring tool which has more 
information about system as complex.

>>> It's not actually trying to validate the syntax of the lock file, only
>>> to make certain it doesn't trigger any unexpected behavior in kill().
> 
>> I not sure if we talk about same place.
> 
> Yes, we are.  Read the kill(2) man page and note the special behaviors
> for pid = 0 or -1.  The test is just trying to be darn certain we don't
> invoke those behaviors.

No we don't :-). I mean code few lines up after atoi().
with regards Zdenek




Re: Questions about pid file creation code

From
Tom Lane
Date:
Zdenek Kotala <Zdenek.Kotala@Sun.COM> writes:
> Tom Lane wrote:
>> The start script does not typically have the intelligence to get this
>> right, particularly not the is-shmem-still-in-use part.  If you check
>> the archives you will find many of us on record telling people who think
>> they should remove the pidfile in their start script that they're crazy.

> It is true, but question is what way is better. Keep all logic in 
> postmaster or improve pg_ctl to share more information and keep 
> responsibility on start scripts or monitoring tool which has more 
> information about system as complex.

If you have conditions PG doesn't know about, you're free to test for
them in your start script.  I see no reason to change this code, however.
        regards, tom lane