Re: Unreliable "pg_ctl -w start" again - Mailing list pgsql-hackers

From MauMau
Subject Re: Unreliable "pg_ctl -w start" again
Date
Msg-id C81D34C8264145748A657C0954AAE9BB@maumau
Whole thread Raw
In response to Re: Unreliable "pg_ctl -w start" again  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
From: "Tom Lane" <tgl@sss.pgh.pa.us>
> Well, feel free to increase that duration if you want.  The reason it's
> there is to not wait for a long time if the postmaster falls over
> instantly at startup, but in a non-interactive situation you might not
> care.

Yes, just lengthening the wait duration causes unnecessary long wait when we 
run pg_ctl interactively. Therefore, the current wait approach is is not 
correct.


>> How about inserting postmaster_is_alive() as below?
>
> Looks like complete nonsense to me, if the goal is to behave sanely when
> postmaster.pid hasn't been created yet.  Where do you think get_pgpid
> gets the PID from?

Yes, I understand that get_pgpid() gets the pid from postmaster.pid, which 
may be the pid of the previous postmaster that did not stop cleanly.

I think my simple fix makes sense to solve the problem as follows. Could you 
point out what might not be good?

1.The previous postmaster was terminated abruptly due to OS shutdown, 
machine failure, etc. leaving postmaster.pid.
2.Run "pg_ctl -w start" to start new postmaster.
3.do_start() of pg_ctl reads the pid of previously running postmaster from 
postmaster.pid. Say, let it be pid-1 (old_pid in code) here.
 old_pid = get_pgpid();

4.Anyway, try to start postmaster by calling start_postmaster().
5.If postmaster.pid existed at step 3, it means either of:

(a) Previous postmaster did not stop cleanly and left postmaster.pid.
(b) Another postmaster is already running in the data directory (since 
before running pg_ctl -w start this time.)

But we can't distinguish between them. Then, we read ostmaster.pid again to 
judge the situation. Let it be pid-2 (pid in code).
if (old_pid != 0){ pg_usleep(1000000); pid = get_pgpid();

6.If pid-1 != pid-2, it means that the situation (a) applies and the newly 
started postmaster overwrote old postmaster.pid. Then, try to connect to 
postmaster.

If pid-1 == pid-2, it means either of:

(a') Previous postmaster did not stop cleanly and left postmaster.pid. Newly 
started postmaster will complete startup, but hasn't overwritten 
postmaster.pid yet.
(b) Another postmaster is already running in the data directory (since 
before running pg_ctl -w start this time.)

The current comparison logic cannot distinguish between them. In my problem 
situation, situation a' happened, and pg_ctl mistakenly exited.
 if (pid == old_pid) {  write_stderr(_("%s: could not start server\n"        "Examine the log output.\n"),
progname); exit(1); }
 

7.To distinguish between a' and b, check if pid-1 is alive. If pid-1 is 
alive, it means situation b. Otherwise, that is situation a'.
 if (pid == old_pid && postmaster_is_alive(old_pid))

However, the pid of newly started postmaster might match the one of old 
postmaster. To deal with that situation, it may be better to check the 
modified timestamp of postmaster.pid in addition.

What do you think?


> If we had the postmaster's PID a priori, we could detect postmaster
> death directly instead of having to make assumptions about how long
> is reasonable to wait for the pidfile to appear.  The problem is that
> we don't want to write a complete replacement for the shell's command
> line parser and I/O redirection logic.  It doesn't look like a small
> project.

Yes, I understand this. I don't think we can replace shell's various work.


> (But maybe we could bypass that by doing a fork() and then having
> the child exec() the shell, telling it to exec postmaster in turn?)

Possibly. I hope this works. Then, we can pass unnamed pipe file descriptors 
to postmaster via environment variables from the pg_ctl's forked child.


> And of course Windows as usual makes things twice as hard, since we
> couldn't make such a change unless start_postmaster could return the
> proper PID in that case too.

Well, we can make start_postmaster() return the pid of the newly created 
postmaster. CreateProcess() sets the process handle in the structure passed 
to it. We can pass the process handle to WaitForSingleObject8) to know 
whether postmaster is alive.

Regards
MauMau



pgsql-hackers by date:

Previous
From: "MauMau"
Date:
Subject: Re: Unreliable "pg_ctl -w start" again
Next
From: Tom Lane
Date:
Subject: Re: Confusing EXPLAIN output in case of inherited tables