Thread: Reliably determining whether the server came up
I've been trying to work out a reliable script to determine, after pg_ctl start, that the server is done attempting to come up, and that it has either succeeded OR FAILED. This is for several hundred unattended appliance-type servers, currently on PG 8.0 but soon to be on 8.3 Haven't found anything in the archives. I want to determine success/failure without time-outs, since: the db is restarted every time a server gets an upgrade, and it can get several upgrades in a batch, and the cpu/disk load during an upgrade is highly variable; a restart with no recovery may still require as much as a minute to get to 'ready'. We also need to restart the server several hundred times in our in-house system tests. So pg_ctl -w start is not an option, even if the timeout were configurable to under a minute. The best I have that doesn't involve modifying pg_ctl is: # Hand-compute $NEXT_LOG from postgresql.conf # parameters (log_directory) and (log_filename). # Replace %S format with a '??' wildcard (yech). $ TEMP_LOG=/tmp/pg.$PGPORT.log $ touch $NEXT_LOG >$TEMP_LOG $ FROM=`awk 'END {print NR+1}' $NEXT_LOG` $ pg_ctl start -s -l $TEMPLOG $ while tail +$FROM $NEXT_LOG | ! egrep -hw 'FATAL|PANIC|DETAIL|ready|shutting|^postmaster cannot' $TEMP_LOG -; do sleep 1; done The nasty cases are when the server fails (exits) without being able to create its std log file (e.g. error in postgresql.conf). So I'm down to patching start_postmaster in pg_ctl.c to use popen("... & echo $!") instead of system("... &"), then make test_postmaster_connection do a kill(0,pid) if PQsetdbLogin fails. Any suggestions appreciated. -- Engineers think that equations approximate reality. Physicists think that reality approximates the equations. Mathematicians never make the connection.
Mischa Sandberg wrote: > I've been trying to work out a reliable script to determine, > after pg_ctl start, that the server is done attempting > to come up, and that it has either succeeded OR FAILED. > This is for several hundred unattended appliance-type servers, > currently on PG 8.0 but soon to be on 8.3 > Why don't you try to create a connection to a db on each server? HH > Haven't found anything in the archives. > I want to determine success/failure without time-outs, since: > the db is restarted every time a server gets an upgrade, > and it can get several upgrades in a batch, and the cpu/disk load > during an upgrade is highly variable; a restart with no > recovery may still require as much as a minute to get to 'ready'. > > We also need to restart the server several hundred times > in our in-house system tests. > > So pg_ctl -w start is not an option, even if the timeout were > configurable to under a minute. > > The best I have that doesn't involve modifying pg_ctl is: > > # Hand-compute $NEXT_LOG from postgresql.conf > # parameters (log_directory) and (log_filename). > # Replace %S format with a '??' wildcard (yech). > > $ TEMP_LOG=/tmp/pg.$PGPORT.log > $ touch $NEXT_LOG >$TEMP_LOG > $ FROM=`awk 'END {print NR+1}' $NEXT_LOG` > $ pg_ctl start -s -l $TEMPLOG > $ while tail +$FROM $NEXT_LOG | ! egrep -hw > 'FATAL|PANIC|DETAIL|ready|shutting|^postmaster cannot' $TEMP_LOG -; do > sleep 1; done > > The nasty cases are when the server fails (exits) > without being able to create its std log file (e.g. > error in postgresql.conf). > > So I'm down to patching start_postmaster in pg_ctl.c > to use popen("... & echo $!") instead of system("... &"), > then make test_postmaster_connection do a kill(0,pid) > if PQsetdbLogin fails. > > Any suggestions appreciated. >
Quoting "H. Hall" <hhall1001@reedyriver.com>: > > Mischa Sandberg wrote: > > I've been trying to work out a reliable script to determine, > > after pg_ctl start, that the server is done attempting > > to come up, and that it has either succeeded OR FAILED. > > This is for several hundred unattended appliance-type servers, > > currently on PG 8.0 but soon to be on 8.3 > > > > Why don't you try to create a connection to a db on each server? Thanks, but that only tells me if the server is up at the time of trying to connect. What I'm trying to test is, did the server just start up and abort? This is in a large cluster, and in (say) a box-by-box software upgrade, any time-outs are additive.
On Sat, 2008-11-15 at 12:29 -0800, Mischa Sandberg wrote: > Quoting "H. Hall" <hhall1001@reedyriver.com>: > > > > > Mischa Sandberg wrote: > > > I've been trying to work out a reliable script to determine, > > > after pg_ctl start, that the server is done attempting > > > to come up, and that it has either succeeded OR FAILED. > > > This is for several hundred unattended appliance-type servers, > > > currently on PG 8.0 but soon to be on 8.3 > > > > > > > Why don't you try to create a connection to a db on each server? > > Thanks, but that only tells me if the server is up at the time of trying > to connect. Actually it doesn't. If you are using any standard library to connect if the server is not ready to accept connections, it will tell you when you connect. If the server failed to come up, you won't get a connection at all, if you try to connect and you are able to connect but not initiate a session and appropriate response will be sent. Joshua D. Drake --
Joshua D. Drake wrote: > On Sat, 2008-11-15 at 12:29 -0800, Mischa Sandberg wrote: > >> Quoting "H. Hall" <hhall1001@reedyriver.com>: >> >> >>> Mischa Sandberg wrote: >>> >>>> I've been trying to work out a reliable script to determine, >>>> after pg_ctl start, that the server is done attempting >>>> to come up, and that it has either succeeded OR FAILED. >>>> This is for several hundred unattended appliance-type servers, >>>> currently on PG 8.0 but soon to be on 8.3 >>>> >>>> >>> Why don't you try to create a connection to a db on each server? >>> >> Thanks, but that only tells me if the server is up at the time of trying >> to connect. >> > > Actually it doesn't. If you are using any standard library to connect if > the server is not ready to accept connections, it will tell you when you > connect. If the server failed to come up, you won't get a connection at > all, if you try to connect and you are able to connect but not initiate > a session and appropriate response will be sent. > > Joshua D. Drake > > Exactly. :-) Also, once you take a look at your solution code a light bulb may go off. Hey! This code could also be used to test the health of my db servers during production! If I just execute it in a timer thread . . . Hmmm. --cheers, HH -- H. Hall ReedyRiver Group LLC www.reedyriver.com
Comment below: Joshua D. Drake wrote: > On Sat, 2008-11-15 at 12:29 -0800, Mischa Sandberg wrote: >> Quoting "H. Hall" <hhall1001@reedyriver.com>: >>> Mischa Sandberg wrote: >>> >>>> I've been trying to work out a reliable script to determine, >>>> after pg_ctl start, that the server is done attempting >>>> to come up, and that it has either succeeded OR FAILED. >>>> This is for several hundred unattended appliance-type servers, >>>> currently on PG 8.0 but soon to be on 8.3 >>>> >>> Why don't you try to create a connection to a db on each server? >> Thanks, but that only tells me if the server is up at the time of trying >> to connect. > > Actually it doesn't. If you are using any standard library to connect if > the server is not ready to accept connections, it will tell you when you > connect. If the server failed to come up, you won't get a connection at > all, if you try to connect and you are able to connect but not initiate > a session and appropriate response will be sent. > Joshua D. Drake Exactly. :-) Also, once you take a look at your solution code a light bulb may go off. Hey! This code could also be used to test the health of my db servers during production! If I just execute it in a timer thread . . . Hmmm. --cheers, HH -- H. Hall ReedyRiver Group LLC www.reedyriver.com Well, I'll look further at it. I originally did start with while pg_ctl status && ! psql -l; do nothing; done The cases I've had to catch include: - startup so slow that postmaster.pid has still not been created when the first pg_ctl status exits, returning 'no server'. - a pg_xlog drive going sour (some low-end hardware is, well, crap), so pg_ctl status says server is up but connects get 'FATAL: the database system is shutting down' forever. ... and I'm guessing that other server failure states will produce other messages (with FATAL not always meaning real fatality). -- Engineers think that equations approximate reality. Physicists think that reality approximates the equations. Mathematicians never make the connection.
Perhaps it's my "test for DB ready" that's the problem? This is the typical glitch I get ... + mkdir -p /persist/pgdata + chmod 700 /persist/pgdata + initdb -Atrust -Upmx -L/persist/pgsql/8.0.9/share >/dev/null + cp -p /persist/etc/p*g*.conf /persist/pgdata + chmod u+w /persist/home/mischa/pgdata/*.conf + pgconf dynamic_library_path=/persist/pgsql/8.0.9/lib + touch pg-Mon.log >/tmp/pg.5432.log + mkdir -p /persist/pglog + cd /persist/pglog + pg_ctl start -s -l /tmp/pg.5432.log -o '-p 5432 -B 500 -N 10' + while pg_ctl status && ! psql -l; do sleep 1; done >/dev/null 2>&1 + createlang --pglib /persist/pgsql/8.0.9/lib -d template1 plpgsql createlang: could not connect to database template1: FATAL: the database system is starting up -- Engineers think that equations approximate reality. Physicists think that reality approximates the equations. Mathematicians never make the connection.
Mischa Sandberg <mischa_sandberg@telus.net> writes: > Perhaps it's my "test for DB ready" that's the problem? > + while pg_ctl status && ! psql -l; do sleep 1; done >/dev/null 2>&1 I'd bet that the pg_ctl status part is failing. I get exit status 1 from it if there's no server running. regards, tom lane
Quoting Tom Lane <tgl@sss.pgh.pa.us>: > Mischa Sandberg <mischa_sandberg@telus.net> writes: > > Perhaps it's my "test for DB ready" that's the problem? > > > + while pg_ctl status && ! psql -l; do sleep 1; done >/dev/null > 2>&1 > > I'd bet that the pg_ctl status part is failing. I get exit status 1 > from it if there's no server running. Yes, that was part of the problem with the original startup script; postmaster hadn't even gotten as far as writing postmaster.pid, I guess. But pg_ctl status returning 1 could also mean that that the server had come up, hit a critical problem and exited. Hence my problem; this has to detect server failure, reliably, as well. BTW the example with (start,status,psql,createlang) failing just happened, to my surprise, on my dev box -- fairly fast and lightly loaded. On loaded, unattended systems, it happened consistently. ............ In another vein, another place where there are consistent failures is in the sequence: createlang ... -d template1 plpgsql createdb $PGDATABASE <app> The failure can happen on createdb ("template1 is busy") or on <app>; and most frequently on the systems with overloaded disks. My hacky response is to separate those steps with: psql -qc checkpoint template1 which consistently makes the problem go away; but what is the problem, exactly, that this is tripping over?? Anyway, thanks for the comments. -- Engineers think that equations approximate reality. Physicists think that reality approximates the equations. Mathematicians never make the connection.
Mischa Sandberg <mischa_sandberg@telus.net> writes: > Quoting Tom Lane <tgl@sss.pgh.pa.us>: >> I'd bet that the pg_ctl status part is failing. I get exit status 1 >> from it if there's no server running. > Yes, that was part of the problem with the original startup script; > postmaster hadn't even gotten as far as writing postmaster.pid, > I guess. But pg_ctl status returning 1 could also mean that that the > server had come up, hit a critical problem and exited. Hence my problem; > this has to detect server failure, reliably, as well. You could sleep for a second or so *before* you start looking for the pidfile. > In another vein, another place where there are consistent > failures is in the sequence: > createlang ... -d template1 plpgsql > createdb $PGDATABASE > <app> This should be fixed in 8.3 and up. In older releases about all you can do is delay a second or so to let the old backend exit. regards, tom lane
Quoting Tom Lane <tgl@sss.pgh.pa.us>: > Mischa Sandberg <mischa_sandberg@telus.net> writes: > > Quoting Tom Lane <tgl@sss.pgh.pa.us>: > >> I'd bet that the pg_ctl status part is failing. I get exit status > 1 > >> from it if there's no server running. > > > Yes, that was part of the problem with the original startup > script; > > postmaster hadn't even gotten as far as writing postmaster.pid, > > I guess. But pg_ctl status returning 1 could also mean that that > the > > server had come up, hit a critical problem and exited. Hence my > problem; > > this has to detect server failure, reliably, as well. > > You could sleep for a second or so *before* you start looking for > the > pidfile. The systems are under erratic load, due to concurrent cpu and diskio spikes around start-up time. 1-2 secs is not enough to be a guarantee :-( Probably not explaining the issues well; caught between two constraints that aren't really pg's problem; and wide clusters with automated admin, variable hardware and spikes of db restarts are no doubt an oddball edge case. There are workarounds; was hoping for something clean and obvious (to all but me). Switching back to tailing the log files and moving on. Thanks everyone. -- Engineers think that equations approximate reality. Physicists think that reality approximates the equations. Mathematicians never make the connection.
Mischa Sandberg <mischa_sandberg@telus.net> writes: >> You could sleep for a second or so *before* you start looking for >> the pidfile. > The systems are under erratic load, due to concurrent > cpu and diskio spikes around start-up time. > 1-2 secs is not enough to be a guarantee :-( Well, forget pg_ctl and just start the postmaster directly, so that your script knows its PID. Then you could keep an eye on whether the PID still exists. regards, tom lane