Re: Unreliable "pg_ctl -w start" again - Mailing list pgsql-hackers
From | MauMau |
---|---|
Subject | Re: Unreliable "pg_ctl -w start" again |
Date | |
Msg-id | 4F0633450B7A4741B348040C8831EC90@maumau Whole thread Raw |
In response to | Re: Unreliable "pg_ctl -w start" again (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Unreliable "pg_ctl -w start" again
|
List | pgsql-hackers |
From: "Tom Lane" <tgl@sss.pgh.pa.us> > Well, feel free to increase that duration if you want. The reason it's > there is to not wait for a long time if the postmaster falls over > instantly at startup, but in a non-interactive situation you might not > care. Yes, just lengthening the wait duration causes unnecessary long wait when we run pg_ctl interactively. Therefore, the current wait approach is is not correct. >> How about inserting postmaster_is_alive() as below? > > Looks like complete nonsense to me, if the goal is to behave sanely when > postmaster.pid hasn't been created yet. Where do you think get_pgpid > gets the PID from? Yes, I understand that get_pgpid() gets the pid from postmaster.pid, which may be the pid of the previous postmaster that did not stop cleanly. I think my simple fix makes sense to solve the problem as follows. Could you point out what might not be good? 1.The previous postmaster was terminated abruptly due to OS shutdown, machine failure, etc. leaving postmaster.pid. 2.Run "pg_ctl -w start" to start new postmaster. 3.do_start() of pg_ctl reads the pid of previously running postmaster from postmaster.pid. Say, let it be pid-1 (old_pid in code) here. old_pid = get_pgpid(); 4.Anyway, try to start postmaster by calling start_postmaster(). 5.If postmaster.pid existed at step 3, it means either of: (a) Previous postmaster did not stop cleanly and left postmaster.pid. (b) Another postmaster is already running in the data directory (since before running pg_ctl -w start this time.) But we can't distinguish between them. Then, we read ostmaster.pid again to judge the situation. Let it be pid-2 (pid in code). if (old_pid != 0){ pg_usleep(1000000); pid = get_pgpid(); 6.If pid-1 != pid-2, it means that the situation (a) applies and the newly started postmaster overwrote old postmaster.pid. Then, try to connect to postmaster. If pid-1 == pid-2, it means either of: (a') Previous postmaster did not stop cleanly and left postmaster.pid. Newly started postmaster will complete startup, but hasn't overwritten postmaster.pid yet. (b) Another postmaster is already running in the data directory (since before running pg_ctl -w start this time.) The current comparison logic cannot distinguish between them. In my problem situation, situation a' happened, and pg_ctl mistakenly exited. if (pid == old_pid) { write_stderr(_("%s: could not start server\n" "Examine the log output.\n"), progname); exit(1); } 7.To distinguish between a' and b, check if pid-1 is alive. If pid-1 is alive, it means situation b. Otherwise, that is situation a'. if (pid == old_pid && postmaster_is_alive(old_pid)) However, the pid of newly started postmaster might match the one of old postmaster. To deal with that situation, it may be better to check the modified timestamp of postmaster.pid in addition. What do you think? > If we had the postmaster's PID a priori, we could detect postmaster > death directly instead of having to make assumptions about how long > is reasonable to wait for the pidfile to appear. The problem is that > we don't want to write a complete replacement for the shell's command > line parser and I/O redirection logic. It doesn't look like a small > project. Yes, I understand this. I don't think we can replace shell's various work. > (But maybe we could bypass that by doing a fork() and then having > the child exec() the shell, telling it to exec postmaster in turn?) Possibly. I hope this works. Then, we can pass unnamed pipe file descriptors to postmaster via environment variables from the pg_ctl's forked child. > And of course Windows as usual makes things twice as hard, since we > couldn't make such a change unless start_postmaster could return the > proper PID in that case too. Well, we can make start_postmaster() return the pid of the newly created postmaster. CreateProcess() sets the process handle in the structure passed to it. We can pass the process handle to WaitForSingleObject8) to know whether postmaster is alive. Regards MauMau
pgsql-hackers by date: