Thread: PG service restart failure (start getting ahead of stop?)
We have a nightly restart of one PG database. Today it failed and I can't seem to understand why and how to prevent this in the future (nor can I reproduce the problem). We have a line in a shell script that calls "/etc/init.d/postgresql restart". In the shell script's log from this invocation I have: Stopping postgresql service: [FAILED] Starting postgresql service: [ OK ] The outcome of this was that the service was not running (despite the "[ OK ]" on the second line). The query log from that run ends with: 2007-04-23 03:03:59 PDT [23265] LOG: received fast shutdown request 2007-04-23 03:03:59 PDT [23265] LOG: aborting any active transactions 2007-04-23 03:03:59 PDT [26749] FATAL: terminating connection due to administrator command <... snipped more lines like the one above ...> 2007-04-23 03:03:59 PDT [24090] LOG: could not send data to client: Broken pipe 2007-04-23 03:03:59 PDT [26700] FATAL: terminating connection due to administrator command <... snipped more lines like the one above ...> 2007-04-23 03:04:13 PDT [26820] FATAL: the database system is shutting down <... snipped more lines like the one above ...> 2007-04-23 03:06:10 PDT [23269] LOG: shutting down 2007-04-23 03:06:10 PDT [23269] LOG: database system is shut down 2007-04-23 03:06:13 PDT [23267] LOG: logger shutting down <end of log> So it looks like the STOPPING of the service actually succeeded, albeit it took a while (more than the usual sessions open?). The STARTING is the one that actually failed (is that because the STOP was still in process?). The question is why -- in a RESTART situation wouldn't/shouldn't the START part wait for the STOP part to complete (regardless of how long it takes)? Also what can we do to avoid this in the future? We can issue a separate STOP and then a START X minutes later, but how long an X? It would seem that a RESTART is really what I want... TIA, George
George Pavlov wrote: > We have a nightly restart of one PG database. Today it failed and I > can't seem to understand why and how to prevent this in the future (nor > can I reproduce the problem). > > If you get the PID of the psql server process then invoke a STOP, then have a loop waiting for that PID to disappear, then run your START, it may do what you want. Or someone may have a more elegant solution :-) Brent Wood > We have a line in a shell script that calls "/etc/init.d/postgresql > restart". In the shell script's log from this invocation I have: > > Stopping postgresql service: [FAILED] > Starting postgresql service: [ OK ] > > The outcome of this was that the service was not running (despite the "[ > OK ]" on the second line). The query log from that run ends with: > > 2007-04-23 03:03:59 PDT [23265] LOG: received fast shutdown request > 2007-04-23 03:03:59 PDT [23265] LOG: aborting any active transactions > 2007-04-23 03:03:59 PDT [26749] FATAL: terminating connection due to > administrator command > <... snipped more lines like the one above ...> > 2007-04-23 03:03:59 PDT [24090] LOG: could not send data to client: > Broken pipe > 2007-04-23 03:03:59 PDT [26700] FATAL: terminating connection due to > administrator command > <... snipped more lines like the one above ...> > 2007-04-23 03:04:13 PDT [26820] FATAL: the database system is > shutting down > <... snipped more lines like the one above ...> > 2007-04-23 03:06:10 PDT [23269] LOG: shutting down > 2007-04-23 03:06:10 PDT [23269] LOG: database system is shut down > 2007-04-23 03:06:13 PDT [23267] LOG: logger shutting down > <end of log> > > So it looks like the STOPPING of the service actually succeeded, albeit > it took a while (more than the usual sessions open?). The STARTING is > the one that actually failed (is that because the STOP was still in > process?). The question is why -- in a RESTART situation > wouldn't/shouldn't the START part wait for the STOP part to complete > (regardless of how long it takes)? > > Also what can we do to avoid this in the future? We can issue a separate > STOP and then a START X minutes later, but how long an X? It would seem > that a RESTART is really what I want... > > TIA, > > George > > ---------------------------(end of broadcast)--------------------------- > TIP 3: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faq >
"George Pavlov" <gpavlov@mynewplace.com> writes: > We have a nightly restart of one PG database. Just out of curiosity, what for? I can't imagine any really good reason for just shutting down the postmaster and immediately restarting it. > So it looks like the STOPPING of the service actually succeeded, albeit > it took a while (more than the usual sessions open?). The STARTING is > the one that actually failed (is that because the STOP was still in > process?). The question is why -- in a RESTART situation > wouldn't/shouldn't the START part wait for the STOP part to complete > (regardless of how long it takes)? Well, this'd depend on the details of the postgres init script you're using, which you gave no hint about (and yes, there are a *ton* of different versions out there). The one I'm currently shipping for Red Hat would give up waiting after a minute, but it should report failure not success in that case. regards, tom lane
> > So it looks like the STOPPING of the service actually > > succeeded, albeit > > it took a while (more than the usual sessions open?). The > > STARTING is > > the one that actually failed (is that because the STOP was still in > > process?). The question is why -- in a RESTART situation > > wouldn't/shouldn't the START part wait for the STOP part to complete > > (regardless of how long it takes)? > > Well, this'd depend on the details of the postgres init script you're > using, which you gave no hint about (and yes, there are a *ton* of > different versions out there). The one I'm currently shipping for Red > Hat would give up waiting after a minute, but it should report failure > not success in that case. oh, sorry, i did not realize there were many of them. this is the 8.1.8 redhat one; the (hopefully) identifying lines are: 64 # Version 8.1 Devrim Gunduz <devrim@PostgreSQL.org> 65 # Increased sleep time from 1 sec to 2 sec. 66 67 # PGVERSION is the full package version, e.g., 8.1.1 68 # Note: the specfile ordinarily updates this during install 69 PGVERSION=8.1.8 i have examined the stop() and start() and i think i understand why the stop() reported a failure (it took to long), but i don't understand how the start() could have reported success: > Stopping postgresql service: [FAILED] > Starting postgresql service: [ OK ] there was definitely no running DB after that (until someone manually started it hours later). > > We have a nightly restart of one PG database. > > Just out of curiosity, what for? I can't imagine any really > good reason > for just shutting down the postmaster and immediately restarting it. why have a "restart" option if there is never a reason for it? :-) seriously, this is a good question, i think this was someone's idea of a quick way to clear any remaining DB sessions in order to be able to drop a database and restore a newer version (this is a reporting DB that gets refreshed nightly with a dump from another DB). this seems a bit misguided in that if you want to kill sessions you should just kill sessions, (a la http://varlena.com/GeneralBits/29.html), right? what's your opinion on the best way to do this? thanks. george
"George Pavlov" <gpavlov@mynewplace.com> writes: > i have examined the stop() and start() and i think i understand why the > stop() reported a failure (it took to long), but i don't understand how > the start() could have reported success: >> Stopping postgresql service: [FAILED] >> Starting postgresql service: [ OK ] Well, that makes sense: if the shutdown took more than a minute then the "stop" script action would give up waiting, and then the "start" action would see the postmaster running and go away happy. (It's a bit odd that "service start" actions are supposed to treat "already running" as OK, but I've been told that that's required by the Linux Standards Base and I can't change it.) The real question here is why'd it take so long to stop? It should be using "mode fast" which'd kick out active queries. >> Just out of curiosity, what for? I can't imagine any really >> good reason >> for just shutting down the postmaster and immediately restarting it. > why have a "restart" option if there is never a reason for it? :-) Well, there are times when you need it, like changing shared_buffers or one of the other postmaster-start-time-only parameters. But doing it just as a routine action smacks of Microsoftish "you need to reboot regularly" thinking... > seriously, this is a good question, i think this was someone's idea of a > quick way to clear any remaining DB sessions in order to be able to drop > a database and restore a newer version (this is a reporting DB that gets > refreshed nightly with a dump from another DB). Hmm. It's not exactly bulletproof --- what if someone reconnects to the DB as soon as you restart? But I guess it's a simple answer that might not be worth improving on. regards, tom lane
Re: [QUARANTINE] Re: PG service restart failure (start getting ahead of stop?)
From
"George Pavlov"
Date:
> Well, that makes sense: if the shutdown took more than a > minute then the > "stop" script action would give up waiting, and then the > "start" action > would see the postmaster running and go away happy. (It's a bit odd > that "service start" actions are supposed to treat "already running" > as OK, but I've been told that that's required by the Linux Standards > Base and I can't change it.) thanks, that's good to know. i never realized that's how it behaves. > The real question here is why'd it take so long to stop? It should be > using "mode fast" which'd kick out active queries. indeed a mystery -- from looking at the query log there didn't seem to be ANY active queries at the time. seems that > 85% of the session IDs (84 total) for which i got lines like "2007-04-23 03:05:48 PDT [26987] FATAL: the database system is shutting down" did not even have any preceding query activity. hard to debug retroactively -- it's just annoying that i don't understand what was different about this restart. and, yes, i confirmed that it is using fast mode: 212 $SU -l postgres -c "$PGENGINE/pg_ctl stop -D '$PGDATA' -s -m fast" > /dev/null 2>&1 < /dev/null george