Thread: postmaster dead but backends still running?

postmaster dead but backends still running?

From

Charles Hornberger

Date:

17 June 2003, 15:50:52

I've got what looks like a really strange situation on my hands (or else
I've got a normal situation that I'm looking at strangely): It appears
that the main postmaster process is dead & gone, but I have a bunch of
backends still running.

I can't connect to the database server any more, but a bunch of old
persistent connections (which are about four days old and which I think
are being kept alive by my web server) are still up & running; at least
some of them are serving data to web pages.

To wit:

[rhodes] data/$ /usr/ucb/ps axuw | grep post
postgres  9238  0.2  1.4 8664 5104 ?        S   Jun 13  3:13
/its/software/bin/postmaster
postgres  9268  0.1  1.4 8672 5144 ?        S   Jun 13  3:26
/its/software/bin/postmaster
postgres  8920  0.1  0.6 2480 2024 pts/0    R 11:08:26  0:00 bash
postgres  9237  0.1  1.4 8664 5104 ?        S   Jun 13  3:01
/its/software/bin/postmaster
root      5411  0.0  0.4 1904 1448 ?        S   Jun 09  0:00
/software/stow/postfix-2.0.10/libexec/postfix/master
postfix   5413  0.0  0.4 1992 1528 ?        S   Jun 09  0:00 qmgr -l -t
fifo -u
postfix   8857  0.0  0.4 1960 1552 ?        S 11:03:14  0:00 pickup -l
-t fifo -u
postgres  9236  0.0  1.4 8664 5120 ?        S   Jun 13  3:12
/its/software/bin/postmaster
postgres  9243  0.0  1.5 8720 5584 ?        S   Jun 13  3:06
/its/software/bin/postmaster
postgres  9254  0.0  1.4 8656 5128 ?        S   Jun 13  3:22
/its/software/bin/postmaster
postgres  9278  0.0  1.4 8664 5192 ?        S   Jun 13  3:08
/its/software/bin/postmaster
postgres  9333  0.0  1.5 8672 5312 ?        S   Jun 13  3:33
/its/software/bin/postmaster
postgres  9379  0.0  1.4 8720 5176 ?        S   Jun 13  3:08
/its/software/bin/postmaster
postgres  9431  0.0  1.4 8672 5112 ?        S   Jun 13  3:18
/its/software/bin/postmaster
postgres  9877  0.0  0.0 2480    ? pts/0    R 11:47:15  0:00 bash

The file /var/pgsql/data/postmaster.pid claims that the postmaster's PID
is 27215; there's no process with that PID running on my system.

Whenever I try to create a new connection, it fail:

[rhodes] data/$ psql template1
psql: could not connect to server: No such file or directory
         Is the server running locally and accepting
         connections on Unix domain socket "/tmp/.s.PGSQL.5432"?
[rhodes] data/$ psql -h localhost template1
psql: could not connect to server: Connection refused
         Is the server running on host localhost and accepting
         TCP/IP connections on port 5432?

Any ideas on what I should do now? I'm running 7.3.2 on Solaris 7.

-Charlie

--
Charles Hornberger
Caltech
Division of the Humanities and Social Sciences
M/C 228-77
Tel (626) 395-3474

Re: postmaster dead but backends still running?

From

Charles Hornberger

Date:

17 June 2003, 17:36:16

Other things I perhaps ought to mention: Trying to stop the postmaster
using pg_ctl fails (unsurprisingly, since pg_ctl relies on
/var/pgsql/data/postmaster.pid, which contains a nonexistent PID); I
haven't tried to start a new postmaster yet, because the old backends
are hanging around.

Nor have I attempted to restart the web server, which might allow the
hanging-round backends to die by closing the old connections it's
holding to them. I'm tempted to go ahead and do this, though I'm not
sure whether I ought to until I've diagnosed what's going on right now.

In case it's relevant, I've gone back through the logs and discovered
that for the past week or so I've been occasionally running out of
connections (I was running w/ the default of 16) and getting 'FATAL:
Non-superuser connection limit exceeded errors' (about a dozen a day),
but I can't find any other related  messages in the logs.

If anyone has any suggestions, I'd really appreciate your input; I'm
getting a bit antsy since my production database server is basically
halfway down and users are wondering why their web pages don't work ...

-Charlie


Charles Hornberger wrote:
> I've got what looks like a really strange situation on my hands (or else
> I've got a normal situation that I'm looking at strangely): It appears
> that the main postmaster process is dead & gone, but I have a bunch of
> backends still running.
>
> I can't connect to the database server any more, but a bunch of old
> persistent connections (which are about four days old and which I think
> are being kept alive by my web server) are still up & running; at least
> some of them are serving data to web pages.
>
> To wit:
>
> [rhodes] data/$ /usr/ucb/ps axuw | grep post
> postgres  9238  0.2  1.4 8664 5104 ?        S   Jun 13  3:13
> /its/software/bin/postmaster
> postgres  9268  0.1  1.4 8672 5144 ?        S   Jun 13  3:26
> /its/software/bin/postmaster
> postgres  8920  0.1  0.6 2480 2024 pts/0    R 11:08:26  0:00 bash
> postgres  9237  0.1  1.4 8664 5104 ?        S   Jun 13  3:01
> /its/software/bin/postmaster
> root      5411  0.0  0.4 1904 1448 ?        S   Jun 09  0:00
> /software/stow/postfix-2.0.10/libexec/postfix/master
> postfix   5413  0.0  0.4 1992 1528 ?        S   Jun 09  0:00 qmgr -l -t
> fifo -u
> postfix   8857  0.0  0.4 1960 1552 ?        S 11:03:14  0:00 pickup -l
> -t fifo -u
> postgres  9236  0.0  1.4 8664 5120 ?        S   Jun 13  3:12
> /its/software/bin/postmaster
> postgres  9243  0.0  1.5 8720 5584 ?        S   Jun 13  3:06
> /its/software/bin/postmaster
> postgres  9254  0.0  1.4 8656 5128 ?        S   Jun 13  3:22
> /its/software/bin/postmaster
> postgres  9278  0.0  1.4 8664 5192 ?        S   Jun 13  3:08
> /its/software/bin/postmaster
> postgres  9333  0.0  1.5 8672 5312 ?        S   Jun 13  3:33
> /its/software/bin/postmaster
> postgres  9379  0.0  1.4 8720 5176 ?        S   Jun 13  3:08
> /its/software/bin/postmaster
> postgres  9431  0.0  1.4 8672 5112 ?        S   Jun 13  3:18
> /its/software/bin/postmaster
> postgres  9877  0.0  0.0 2480    ? pts/0    R 11:47:15  0:00 bash
>
> The file /var/pgsql/data/postmaster.pid claims that the postmaster's PID
> is 27215; there's no process with that PID running on my system.
>
> Whenever I try to create a new connection, it fail:
>
> [rhodes] data/$ psql template1
> psql: could not connect to server: No such file or directory
>         Is the server running locally and accepting
>         connections on Unix domain socket "/tmp/.s.PGSQL.5432"?
> [rhodes] data/$ psql -h localhost template1
> psql: could not connect to server: Connection refused
>         Is the server running on host localhost and accepting
>         TCP/IP connections on port 5432?
>
> Any ideas on what I should do now? I'm running 7.3.2 on Solaris 7.
>
> -Charlie
>

--
Charles Hornberger
Caltech
Division of the Humanities and Social Sciences
M/C 228-77
Tel (626) 395-3474

Re: postmaster dead but backends still running?

From

Tom Lane

Date:

17 June 2003, 20:39:19

Charles Hornberger <charlie@hss.caltech.edu> writes:
> Other things I perhaps ought to mention: Trying to stop the postmaster
> using pg_ctl fails (unsurprisingly, since pg_ctl relies on
> /var/pgsql/data/postmaster.pid, which contains a nonexistent PID); I
> haven't tried to start a new postmaster yet, because the old backends
> are hanging around.

In theory a new postmaster would detect the old backends and refuse to
start anyway.  I don't trust that interlock unreservedly though.  (But
please test it while you have the opportunity...)

> Nor have I attempted to restart the web server, which might allow the
> hanging-round backends to die by closing the old connections it's
> holding to them. I'm tempted to go ahead and do this, though I'm not
> sure whether I ought to until I've diagnosed what's going on right now.

You will need to close all the existing connections before the new
postmaster can be started.  I'd recommend doing so sooner instead of
later, because with no postmaster you aren't getting any checkpoints
done, and your WAL space is going to start ballooning.

As far as diagnosing the problem goes: if you have a postmaster log
file, look to see if the postmaster wrote an ERROR or FATAL message
before it exited.  (Finding it among all the backend-level messages
might be painful though.)  Also look in the directory the postmaster
was started in to see if there's a core file.  Save away any evidence
you can find before trying to start a new postmaster.

Because the postmaster doesn't actually do much, crashes are pretty
unusual.  I'm interested in whatever you can find.

            regards, tom lane

Re: postmaster dead but backends still running?

From

Charles Hornberger

Date:

19 June 2003, 14:26:37

On Tue, 17 Jun 2003, Tom Lane wrote:
> Charles Hornberger <charlie@hss.caltech.edu> writes:
> > Other things I perhaps ought to mention: Trying to stop the postmaster
> > using pg_ctl fails (unsurprisingly, since pg_ctl relies on
> > /var/pgsql/data/postmaster.pid, which contains a nonexistent PID); I
> > haven't tried to start a new postmaster yet, because the old backends
> > are hanging around.
>
> In theory a new postmaster would detect the old backends and refuse to
> start anyway.  I don't trust that interlock unreservedly though.  (But
> please test it while you have the opportunity...)

Unfortunately, our system administrator solved this before I got a chance
to test more. I don't know how he went about restarting the server,
although whatever he did doesn't appear to have hurt anything; would
it be interesting to know exactly what steps he took?

> > Nor have I attempted to restart the web server, which might allow the
> > hanging-round backends to die by closing the old connections it's
> > holding to them. I'm tempted to go ahead and do this, though I'm not
> > sure whether I ought to until I've diagnosed what's going on right now.
>
> You will need to close all the existing connections before the new
> postmaster can be started.  I'd recommend doing so sooner instead of
> later, because with no postmaster you aren't getting any checkpoints
> done, and your WAL space is going to start ballooning.
>
> As far as diagnosing the problem goes: if you have a postmaster log
> file, look to see if the postmaster wrote an ERROR or FATAL message
> before it exited.  (Finding it among all the backend-level messages
> might be painful though.)  Also look in the directory the postmaster
> was started in to see if there's a core file.  Save away any evidence
> you can find before trying to start a new postmaster.

Interestingly, there are no messages in the log file, and I can't find a
core file -- in short, there's no evidence whatsoever, at least not that
I can find. (Though I am probably a pretty rotten detective.)

However, I think I know the cause (though I haven't tested to see if this
indeed causes the postmaster to die): A few hours before I noticed that
the postmaster was dead, one of the sysadmins made a typo that caused an
NFS mount to become unavailable -- the very NFS mount that held the
postgres executable (all our Solaris boxes share the same executables). So
the theory is that the postmaster tried to fork() a process using a
non-existent executable, and died as a result. Does this make any sense?

-Charlie

> Because the postmaster doesn't actually do much, crashes are pretty
> unusual.  I'm interested in whatever you can find.
>
>             regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
>       subscribe-nomail command to majordomo@postgresql.org so that your
>       message can get through to the mailing list cleanly
>

Re: postmaster dead but backends still running?

From

Tom Lane

Date:

19 June 2003, 14:48:19

Charles Hornberger <charlie@hss.caltech.edu> writes:
> However, I think I know the cause (though I haven't tested to see if this
> indeed causes the postmaster to die): A few hours before I noticed that
> the postmaster was dead, one of the sysadmins made a typo that caused an
> NFS mount to become unavailable -- the very NFS mount that held the
> postgres executable (all our Solaris boxes share the same executables). So
> the theory is that the postmaster tried to fork() a process using a
> non-existent executable, and died as a result. Does this make any sense?

A fork() failure would not cause the postmaster to die (it's not
uncommon to see fork() failures due to resource limits, so this path is
really pretty well tested).  I'm not familiar enough with Solaris to know
whether other fatal error conditions might arise in this scenario.
(I know HPUX gets rather unhappy if you try to delete an executable file
or shared library that's in use by live processes...)  But the trouble
with that line of thought is that the postmaster and the backends are
all the same executable; if the postmaster went south because of loss of
the executable file, I'd expect the backends not to survive it either.
Unless maybe the backends weren't actually doing anything --- is it
possible that the connected clients had issued no commands in the whole
episode?

            regards, tom lane

Re: postmaster dead but backends still running?

From

Charles Hornberger

Date:

19 June 2003, 15:21:05

On Thu, 19 Jun 2003, Tom Lane wrote:

> Charles Hornberger <charlie@hss.caltech.edu> writes:
> > However, I think I know the cause (though I haven't tested to see if this
> > indeed causes the postmaster to die): A few hours before I noticed that
> > the postmaster was dead, one of the sysadmins made a typo that caused an
> > NFS mount to become unavailable -- the very NFS mount that held the
> > postgres executable (all our Solaris boxes share the same executables). So
> > the theory is that the postmaster tried to fork() a process using a
> > non-existent executable, and died as a result. Does this make any sense?
>
> A fork() failure would not cause the postmaster to die (it's not
> uncommon to see fork() failures due to resource limits, so this path is
> really pretty well tested).  I'm not familiar enough with Solaris to know
> whether other fatal error conditions might arise in this scenario.
> (I know HPUX gets rather unhappy if you try to delete an executable file
> or shared library that's in use by live processes...)  But the trouble
> with that line of thought is that the postmaster and the backends are
> all the same executable; if the postmaster went south because of loss of
> the executable file, I'd expect the backends not to survive it either.
> Unless maybe the backends weren't actually doing anything --- is it
> possible that the connected clients had issued no commands in the whole
> episode?
>

No, that's not possible; the backends that were up & running were
definitely in active use (serving a web site). Sorry I can't be of more
help -- I'm not familiar w/ Solaris, either. But if there are other places
I should look for error messages, I'm happy to go poking around ...

-Charlie

Need help/suggestions with backup/restore of database tables.

From

"Chris White (cjwhite)"

Date:

24 June 2003, 14:01:59

I have four tables in a database which reference all the large objects
in the database. I need to backup and restore these 4 tables and the
large objects independently of the other tables in the database.

pg_dump only allows me to dump the whole database, if I want to dump the
large objects. I know that on restore I can create a contents file from
the pg_dump file and selectively restore only the tables I need, but
this is not suitable because:

1) Restore data is streamed from a remote ftp server.
2) Backup file is large.
3) There is not enough physical storage to copy the restore file to disk
prior to restoring the database, so I can create the contents list
without reading file twice over ftp link.

Does anybody have any suggestions apart from writing my own custom
backup/restore?

Chris White