Thread: PG service restart failure (start getting ahead of stop?)

PG service restart failure (start getting ahead of stop?)

From
"George Pavlov"
Date:
We have a nightly restart of one PG database. Today it failed and I
can't seem to understand why and how to prevent this in the future (nor
can I reproduce the problem).

We have a line in a shell script that calls "/etc/init.d/postgresql
restart". In the shell script's log from this invocation I have:

  Stopping postgresql service: [FAILED]
  Starting postgresql service: [  OK  ]

The outcome of this was that the service was not running (despite the "[
OK  ]" on the second line). The query log from that run ends with:

  2007-04-23 03:03:59 PDT [23265] LOG:  received fast shutdown request
  2007-04-23 03:03:59 PDT [23265] LOG:  aborting any active transactions
  2007-04-23 03:03:59 PDT [26749] FATAL:  terminating connection due to
administrator command
  <... snipped more lines like the one above ...>
  2007-04-23 03:03:59 PDT [24090] LOG:  could not send data to client:
Broken pipe
  2007-04-23 03:03:59 PDT [26700] FATAL:  terminating connection due to
administrator command
  <... snipped more lines like the one above ...>
  2007-04-23 03:04:13 PDT [26820] FATAL:  the database system is
shutting down
  <... snipped more lines like the one above ...>
  2007-04-23 03:06:10 PDT [23269] LOG:  shutting down
  2007-04-23 03:06:10 PDT [23269] LOG:  database system is shut down
  2007-04-23 03:06:13 PDT [23267] LOG:  logger shutting down
  <end of log>

So it looks like the STOPPING of the service actually succeeded, albeit
it took a while (more than the usual sessions open?). The STARTING is
the one that actually failed (is that because the STOP was still in
process?).  The question is why -- in a RESTART situation
wouldn't/shouldn't the START part wait for the STOP part to complete
(regardless of how long it takes)?

Also what can we do to avoid this in the future? We can issue a separate
STOP and then a START X minutes later, but how long an X? It would seem
that a RESTART is really what I want...

TIA,

George

Re: PG service restart failure (start getting ahead of stop?)

From
Brent Wood
Date:
George Pavlov wrote:
> We have a nightly restart of one PG database. Today it failed and I
> can't seem to understand why and how to prevent this in the future (nor
> can I reproduce the problem).
>
>
If you get the PID of the psql server process then invoke a STOP, then
have a loop waiting for that PID to disappear, then run your START, it
may do what you want. Or someone may have a more elegant solution :-)

Brent Wood



> We have a line in a shell script that calls "/etc/init.d/postgresql
> restart". In the shell script's log from this invocation I have:
>
>   Stopping postgresql service: [FAILED]
>   Starting postgresql service: [  OK  ]
>
> The outcome of this was that the service was not running (despite the "[
> OK  ]" on the second line). The query log from that run ends with:
>
>   2007-04-23 03:03:59 PDT [23265] LOG:  received fast shutdown request
>   2007-04-23 03:03:59 PDT [23265] LOG:  aborting any active transactions
>   2007-04-23 03:03:59 PDT [26749] FATAL:  terminating connection due to
> administrator command
>   <... snipped more lines like the one above ...>
>   2007-04-23 03:03:59 PDT [24090] LOG:  could not send data to client:
> Broken pipe
>   2007-04-23 03:03:59 PDT [26700] FATAL:  terminating connection due to
> administrator command
>   <... snipped more lines like the one above ...>
>   2007-04-23 03:04:13 PDT [26820] FATAL:  the database system is
> shutting down
>   <... snipped more lines like the one above ...>
>   2007-04-23 03:06:10 PDT [23269] LOG:  shutting down
>   2007-04-23 03:06:10 PDT [23269] LOG:  database system is shut down
>   2007-04-23 03:06:13 PDT [23267] LOG:  logger shutting down
>   <end of log>
>
> So it looks like the STOPPING of the service actually succeeded, albeit
> it took a while (more than the usual sessions open?). The STARTING is
> the one that actually failed (is that because the STOP was still in
> process?).  The question is why -- in a RESTART situation
> wouldn't/shouldn't the START part wait for the STOP part to complete
> (regardless of how long it takes)?
>
> Also what can we do to avoid this in the future? We can issue a separate
> STOP and then a START X minutes later, but how long an X? It would seem
> that a RESTART is really what I want...
>
> TIA,
>
> George
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
>
>                http://www.postgresql.org/docs/faq
>


Re: PG service restart failure (start getting ahead of stop?)

From
Tom Lane
Date:
"George Pavlov" <gpavlov@mynewplace.com> writes:
> We have a nightly restart of one PG database.

Just out of curiosity, what for?  I can't imagine any really good reason
for just shutting down the postmaster and immediately restarting it.

> So it looks like the STOPPING of the service actually succeeded, albeit
> it took a while (more than the usual sessions open?). The STARTING is
> the one that actually failed (is that because the STOP was still in
> process?).  The question is why -- in a RESTART situation
> wouldn't/shouldn't the START part wait for the STOP part to complete
> (regardless of how long it takes)?

Well, this'd depend on the details of the postgres init script you're
using, which you gave no hint about (and yes, there are a *ton* of
different versions out there).  The one I'm currently shipping for Red
Hat would give up waiting after a minute, but it should report failure
not success in that case.

            regards, tom lane

Re: PG service restart failure (start getting ahead of stop?)

From
"George Pavlov"
Date:
> > So it looks like the STOPPING of the service actually
> > succeeded, albeit
> > it took a while (more than the usual sessions open?). The
> > STARTING is
> > the one that actually failed (is that because the STOP was still in
> > process?).  The question is why -- in a RESTART situation
> > wouldn't/shouldn't the START part wait for the STOP part to complete
> > (regardless of how long it takes)?
>
> Well, this'd depend on the details of the postgres init script you're
> using, which you gave no hint about (and yes, there are a *ton* of
> different versions out there).  The one I'm currently shipping for Red
> Hat would give up waiting after a minute, but it should report failure
> not success in that case.

oh, sorry, i did not realize there were many of them. this is the 8.1.8
redhat one; the (hopefully) identifying lines are:

 64 # Version 8.1 Devrim Gunduz <devrim@PostgreSQL.org>
 65 # Increased sleep time from 1 sec to 2 sec.
 66
 67 # PGVERSION is the full package version, e.g., 8.1.1
 68 # Note: the specfile ordinarily updates this during install
 69 PGVERSION=8.1.8

i have examined the stop() and start() and i think i understand why the
stop() reported a failure (it took to long), but i don't understand how
the start() could have reported success:

>   Stopping postgresql service: [FAILED]
>   Starting postgresql service: [  OK  ]

there was definitely no running DB after that (until someone manually
started it hours later).

> > We have a nightly restart of one PG database.
>
> Just out of curiosity, what for?  I can't imagine any really
> good reason
> for just shutting down the postmaster and immediately restarting it.

why have a "restart" option if there is never a reason for it? :-)
seriously, this is a good question, i think this was someone's idea of a
quick way to clear any remaining DB sessions in order to be able to drop
a database and restore a newer version (this is a reporting DB that gets
refreshed nightly with a dump from another DB). this seems a bit
misguided in that if you want to kill sessions you should just kill
sessions, (a la http://varlena.com/GeneralBits/29.html), right? what's
your opinion on the best way to do this?

thanks.

george

Re: PG service restart failure (start getting ahead of stop?)

From
Tom Lane
Date:
"George Pavlov" <gpavlov@mynewplace.com> writes:
> i have examined the stop() and start() and i think i understand why the
> stop() reported a failure (it took to long), but i don't understand how
> the start() could have reported success:

>> Stopping postgresql service: [FAILED]
>> Starting postgresql service: [  OK  ]

Well, that makes sense: if the shutdown took more than a minute then the
"stop" script action would give up waiting, and then the "start" action
would see the postmaster running and go away happy.  (It's a bit odd
that "service start" actions are supposed to treat "already running"
as OK, but I've been told that that's required by the Linux Standards
Base and I can't change it.)

The real question here is why'd it take so long to stop?  It should be
using "mode fast" which'd kick out active queries.

>> Just out of curiosity, what for?  I can't imagine any really
>> good reason
>> for just shutting down the postmaster and immediately restarting it.

> why have a "restart" option if there is never a reason for it? :-)

Well, there are times when you need it, like changing shared_buffers
or one of the other postmaster-start-time-only parameters.  But doing
it just as a routine action smacks of Microsoftish "you need to reboot
regularly" thinking...

> seriously, this is a good question, i think this was someone's idea of a
> quick way to clear any remaining DB sessions in order to be able to drop
> a database and restore a newer version (this is a reporting DB that gets
> refreshed nightly with a dump from another DB).

Hmm.  It's not exactly bulletproof --- what if someone reconnects to the
DB as soon as you restart?  But I guess it's a simple answer that might
not be worth improving on.

            regards, tom lane

Re: [QUARANTINE] Re: PG service restart failure (start getting ahead of stop?)

From
"George Pavlov"
Date:
> Well, that makes sense: if the shutdown took more than a
> minute then the
> "stop" script action would give up waiting, and then the
> "start" action
> would see the postmaster running and go away happy.  (It's a bit odd
> that "service start" actions are supposed to treat "already running"
> as OK, but I've been told that that's required by the Linux Standards
> Base and I can't change it.)

thanks, that's good to know. i never realized that's how it behaves.

> The real question here is why'd it take so long to stop?  It should be
> using "mode fast" which'd kick out active queries.

indeed a mystery -- from looking at the query log there didn't seem to
be ANY active queries at the time. seems that > 85% of the session IDs
(84 total) for which i got lines like "2007-04-23 03:05:48 PDT [26987]
FATAL:  the database system is shutting down" did not even have any
preceding query activity. hard to debug retroactively -- it's just
annoying that i don't understand what was different about this restart.

and, yes, i confirmed that it is using fast mode:

  212   $SU -l postgres -c "$PGENGINE/pg_ctl stop -D '$PGDATA' -s -m
fast" > /dev/null 2>&1 < /dev/null

george