Thread: Reliably determining whether the server came up

Reliably determining whether the server came up

From

Mischa Sandberg

Date:

12 November 2008, 18:05:45

I've been trying to work out a reliable script to determine,
after pg_ctl start, that the server is done attempting
to come up, and that it has either succeeded OR FAILED.
This is for several hundred unattended appliance-type servers,
currently on PG 8.0 but soon to be on 8.3

Haven't found anything in the archives.
I want to determine success/failure without time-outs, since:
the db is restarted every time a server gets an upgrade,
and it can get several upgrades in a batch, and the cpu/disk load
during an upgrade is highly variable; a restart with no
recovery may still require as much as a minute to get to 'ready'.

We also need to restart the server several hundred times
in our in-house system tests.

So pg_ctl -w start is not an option, even if the timeout were
configurable to under a minute.

The best I have that doesn't involve modifying pg_ctl is:

# Hand-compute $NEXT_LOG from postgresql.conf
# parameters (log_directory) and (log_filename).
# Replace %S format with a '??' wildcard (yech).

$ TEMP_LOG=/tmp/pg.$PGPORT.log
$ touch $NEXT_LOG >$TEMP_LOG
$ FROM=`awk 'END {print NR+1}' $NEXT_LOG`
$ pg_ctl start -s -l $TEMPLOG
$ while tail +$FROM $NEXT_LOG | ! egrep -hw
'FATAL|PANIC|DETAIL|ready|shutting|^postmaster cannot' $TEMP_LOG -; do
sleep 1; done

The nasty cases are when the server fails (exits)
without being able to create its std log file (e.g.
error in postgresql.conf).

So I'm down to patching start_postmaster in pg_ctl.c
to use popen("... & echo $!") instead of system("... &"),
then make test_postmaster_connection do a kill(0,pid)
if PQsetdbLogin fails.

Any suggestions appreciated.
--
Engineers think that equations approximate reality.
Physicists think that reality approximates the equations.
Mathematicians never make the connection.

Re: Reliably determining whether the server came up

From

"H. Hall"

Date:

15 November 2008, 14:50:18

Mischa Sandberg wrote:
> I've been trying to work out a reliable script to determine,
> after pg_ctl start, that the server is done attempting
> to come up, and that it has either succeeded OR FAILED.
> This is for several hundred unattended appliance-type servers,
> currently on PG 8.0 but soon to be on 8.3
>

Why don't you try to create a connection to a db on each server?

HH
> Haven't found anything in the archives.
> I want to determine success/failure without time-outs, since:
> the db is restarted every time a server gets an upgrade,
> and it can get several upgrades in a batch, and the cpu/disk load
> during an upgrade is highly variable; a restart with no
> recovery may still require as much as a minute to get to 'ready'.
>
> We also need to restart the server several hundred times
> in our in-house system tests.
>
> So pg_ctl -w start is not an option, even if the timeout were
> configurable to under a minute.
>
> The best I have that doesn't involve modifying pg_ctl is:
>
> # Hand-compute $NEXT_LOG from postgresql.conf
> # parameters (log_directory) and (log_filename).
> # Replace %S format with a '??' wildcard (yech).
>
> $ TEMP_LOG=/tmp/pg.$PGPORT.log
> $ touch $NEXT_LOG >$TEMP_LOG
> $ FROM=`awk 'END {print NR+1}' $NEXT_LOG`
> $ pg_ctl start -s -l $TEMPLOG
> $ while tail +$FROM $NEXT_LOG | ! egrep -hw
> 'FATAL|PANIC|DETAIL|ready|shutting|^postmaster cannot' $TEMP_LOG -; do
> sleep 1; done
>
> The nasty cases are when the server fails (exits)
> without being able to create its std log file (e.g.
> error in postgresql.conf).
>
> So I'm down to patching start_postmaster in pg_ctl.c
> to use popen("... & echo $!") instead of system("... &"),
> then make test_postmaster_connection do a kill(0,pid)
> if PQsetdbLogin fails.
>
> Any suggestions appreciated.
>

Re: Reliably determining whether the server came up

From

Mischa Sandberg

Date:

15 November 2008, 16:29:11

Quoting "H. Hall" <hhall1001@reedyriver.com>:

>
> Mischa Sandberg wrote:
> > I've been trying to work out a reliable script to determine,
> > after pg_ctl start, that the server is done attempting
> > to come up, and that it has either succeeded OR FAILED.
> > This is for several hundred unattended appliance-type servers,
> > currently on PG 8.0 but soon to be on 8.3
> >
>
> Why don't you try to create a connection to a db on each server?

Thanks, but that only tells me if the server is up at the time of trying
to connect. What I'm trying to test is, did the server just start up and
abort? This is in a large cluster, and in (say) a box-by-box software
upgrade, any time-outs are additive.

Re: Reliably determining whether the server came up

From

"Joshua D. Drake"

Date:

15 November 2008, 16:32:01

On Sat, 2008-11-15 at 12:29 -0800, Mischa Sandberg wrote:
> Quoting "H. Hall" <hhall1001@reedyriver.com>:
>
> >
> > Mischa Sandberg wrote:
> > > I've been trying to work out a reliable script to determine,
> > > after pg_ctl start, that the server is done attempting
> > > to come up, and that it has either succeeded OR FAILED.
> > > This is for several hundred unattended appliance-type servers,
> > > currently on PG 8.0 but soon to be on 8.3
> > >
> >
> > Why don't you try to create a connection to a db on each server?
>
> Thanks, but that only tells me if the server is up at the time of trying
> to connect.

Actually it doesn't. If you are using any standard library to connect if
the server is not ready to accept connections, it will tell you when you
connect. If the server failed to come up, you won't get a connection at
all, if you try to connect and you are able to connect but not initiate
a session and appropriate response will be sent.

Joshua D. Drake

--

Re: Reliably determining whether the server came up

From

"H. Hall"

Date:

16 November 2008, 08:02:42

Joshua D. Drake wrote:
> On Sat, 2008-11-15 at 12:29 -0800, Mischa Sandberg wrote:
>
>> Quoting "H. Hall" <hhall1001@reedyriver.com>:
>>
>>
>>> Mischa Sandberg wrote:
>>>
>>>> I've been trying to work out a reliable script to determine,
>>>> after pg_ctl start, that the server is done attempting
>>>> to come up, and that it has either succeeded OR FAILED.
>>>> This is for several hundred unattended appliance-type servers,
>>>> currently on PG 8.0 but soon to be on 8.3
>>>>
>>>>
>>> Why don't you try to create a connection to a db on each server?
>>>
>> Thanks, but that only tells me if the server is up at the time of trying
>> to connect.
>>
>
> Actually it doesn't. If you are using any standard library to connect if
> the server is not ready to accept connections, it will tell you when you
> connect. If the server failed to come up, you won't get a connection at
> all, if you try to connect and you are able to connect but not initiate
> a session and appropriate response will be sent.
>
> Joshua D. Drake
>
>
Exactly. :-)
Also, once you take a look at your solution code a light bulb may go
off. Hey! This code could also be used to test the health of my db
servers during production!  If I just execute it in a timer thread . . .
Hmmm.
--cheers, HH



--
H. Hall
ReedyRiver Group LLC
www.reedyriver.com

Re: Reliably determining whether the server came up

From

Mischa Sandberg

Date:

18 November 2008, 01:38:54

Comment below:

Joshua D. Drake wrote:
> On Sat, 2008-11-15 at 12:29 -0800, Mischa Sandberg wrote:
>> Quoting "H. Hall" <hhall1001@reedyriver.com>:
>>> Mischa Sandberg wrote:
>>>
>>>> I've been trying to work out a reliable script to determine,
>>>> after pg_ctl start, that the server is done attempting
>>>> to come up, and that it has either succeeded OR FAILED.
>>>> This is for several hundred unattended appliance-type servers,
>>>> currently on PG 8.0 but soon to be on 8.3
>>>>
>>> Why don't you try to create a connection to a db on each server?
>> Thanks, but that only tells me if the server is up at the time of trying
>> to connect.
>
> Actually it doesn't. If you are using any standard library to connect if
> the server is not ready to accept connections, it will tell you when you
> connect. If the server failed to come up, you won't get a connection at
> all, if you try to connect and you are able to connect but not initiate
> a session and appropriate response will be sent.
> Joshua D. Drake
Exactly. :-)
Also, once you take a look at your solution code a light bulb may go
off. Hey! This code could also be used to test the health of my db
servers during production!  If I just execute it in a timer thread . . .
Hmmm.
--cheers, HH
--
H. Hall
ReedyRiver Group LLC
www.reedyriver.com

Well, I'll look further at it. I originally did start with

   while pg_ctl status && ! psql -l; do nothing; done

The cases I've had to catch include:

- startup so slow that postmaster.pid has still not been created when
the first pg_ctl status exits, returning 'no server'.

- a pg_xlog drive going sour (some low-end hardware is, well, crap),
  so pg_ctl status says server is up but connects get
  'FATAL:  the database system is shutting down' forever.

... and I'm guessing that other server failure states will produce
other messages (with FATAL not always meaning real fatality).
--
Engineers think that equations approximate reality.
Physicists think that reality approximates the equations.
Mathematicians never make the connection.

Re: Reliably determining whether the server came up

From

Mischa Sandberg

Date:

18 November 2008, 02:17:23

Perhaps it's my "test for DB ready" that's the problem?
This is the typical glitch I get ...

+ mkdir -p /persist/pgdata
+ chmod 700 /persist/pgdata
+ initdb -Atrust -Upmx -L/persist/pgsql/8.0.9/share >/dev/null
+ cp -p /persist/etc/p*g*.conf /persist/pgdata
+ chmod u+w /persist/home/mischa/pgdata/*.conf
+ pgconf dynamic_library_path=/persist/pgsql/8.0.9/lib
+ touch pg-Mon.log >/tmp/pg.5432.log
+ mkdir -p /persist/pglog
+ cd /persist/pglog
+ pg_ctl start -s -l /tmp/pg.5432.log -o '-p 5432 -B 500 -N 10'
+ while pg_ctl status && ! psql -l; do sleep 1; done >/dev/null 2>&1
+ createlang --pglib /persist/pgsql/8.0.9/lib -d template1 plpgsql
createlang: could not connect to database template1: FATAL:  the
database system is starting up

--
Engineers think that equations approximate reality.
Physicists think that reality approximates the equations.
Mathematicians never make the connection.

Re: Reliably determining whether the server came up

From

Tom Lane

Date:

18 November 2008, 11:59:47

Mischa Sandberg <mischa_sandberg@telus.net> writes:
> Perhaps it's my "test for DB ready" that's the problem?

> + while pg_ctl status && ! psql -l; do sleep 1; done >/dev/null 2>&1

I'd bet that the pg_ctl status part is failing.  I get exit status 1
from it if there's no server running.

            regards, tom lane

Re: Reliably determining whether the server came up

From

Mischa Sandberg

Date:

18 November 2008, 12:47:36

Quoting Tom Lane <tgl@sss.pgh.pa.us>:

> Mischa Sandberg <mischa_sandberg@telus.net> writes:
> > Perhaps it's my "test for DB ready" that's the problem?
>
> > + while pg_ctl status && ! psql -l; do sleep 1; done >/dev/null
> 2>&1
>
> I'd bet that the pg_ctl status part is failing.  I get exit status 1
> from it if there's no server running.

Yes, that was part of the problem with the original startup script;
postmaster hadn't even gotten as far as writing postmaster.pid,
I guess. But pg_ctl status returning 1 could also mean that that the
server had come up, hit a critical problem and exited. Hence my problem;
this has to detect server failure, reliably, as well.

BTW the example with (start,status,psql,createlang) failing just
happened, to my surprise, on my dev box -- fairly fast and lightly
loaded. On loaded, unattended systems, it happened consistently.
............
In another vein, another place where there are consistent
failures is in the sequence:

   createlang ... -d template1 plpgsql
   createdb $PGDATABASE
   <app>

The failure can happen on createdb ("template1 is busy")
or on <app>; and most frequently on the systems with overloaded disks.
My hacky response is to separate those steps with:

   psql -qc checkpoint template1

which consistently makes the problem go away; but what is
the problem, exactly, that this is tripping over??

Anyway, thanks for the comments.
--
Engineers think that equations approximate reality.
Physicists think that reality approximates the equations.
Mathematicians never make the connection.

Re: Reliably determining whether the server came up

From

Tom Lane

Date:

18 November 2008, 14:20:33

Mischa Sandberg <mischa_sandberg@telus.net> writes:
> Quoting Tom Lane <tgl@sss.pgh.pa.us>:
>> I'd bet that the pg_ctl status part is failing.  I get exit status 1
>> from it if there's no server running.

> Yes, that was part of the problem with the original startup script;
> postmaster hadn't even gotten as far as writing postmaster.pid,
> I guess. But pg_ctl status returning 1 could also mean that that the
> server had come up, hit a critical problem and exited. Hence my problem;
> this has to detect server failure, reliably, as well.

You could sleep for a second or so *before* you start looking for the
pidfile.

> In another vein, another place where there are consistent
> failures is in the sequence:

>    createlang ... -d template1 plpgsql
>    createdb $PGDATABASE
>    <app>

This should be fixed in 8.3 and up.  In older releases about all you
can do is delay a second or so to let the old backend exit.

            regards, tom lane

Re: Reliably determining whether the server came up

From

Mischa Sandberg

Date:

18 November 2008, 15:07:06

Quoting Tom Lane <tgl@sss.pgh.pa.us>:

> Mischa Sandberg <mischa_sandberg@telus.net> writes:
> > Quoting Tom Lane <tgl@sss.pgh.pa.us>:
> >> I'd bet that the pg_ctl status part is failing.  I get exit status
> 1
> >> from it if there's no server running.
>
> > Yes, that was part of the problem with the original startup
> script;
> > postmaster hadn't even gotten as far as writing postmaster.pid,
> > I guess. But pg_ctl status returning 1 could also mean that that
> the
> > server had come up, hit a critical problem and exited. Hence my
> problem;
> > this has to detect server failure, reliably, as well.
>
> You could sleep for a second or so *before* you start looking for
> the
> pidfile.

The systems are under erratic load, due to concurrent
cpu and diskio spikes around start-up time.
1-2 secs is not enough to be a guarantee :-(

Probably not explaining the issues well;
caught between two constraints that aren't really pg's problem;
and wide clusters with automated admin, variable hardware
and spikes of db restarts are no doubt an oddball edge case.
There are workarounds; was hoping for something
clean and obvious (to all but me).

Switching back to tailing the log files and moving on.
Thanks everyone.
--
Engineers think that equations approximate reality.
Physicists think that reality approximates the equations.
Mathematicians never make the connection.

Re: Reliably determining whether the server came up

From

Tom Lane

Date:

18 November 2008, 15:44:50

Mischa Sandberg <mischa_sandberg@telus.net> writes:
>> You could sleep for a second or so *before* you start looking for
>> the pidfile.

> The systems are under erratic load, due to concurrent
> cpu and diskio spikes around start-up time.
> 1-2 secs is not enough to be a guarantee :-(

Well, forget pg_ctl and just start the postmaster directly, so that
your script knows its PID.  Then you could keep an eye on whether the
PID still exists.

            regards, tom lane