Thread: postmaster dead but backends still running?
I've got what looks like a really strange situation on my hands (or else I've got a normal situation that I'm looking at strangely): It appears that the main postmaster process is dead & gone, but I have a bunch of backends still running. I can't connect to the database server any more, but a bunch of old persistent connections (which are about four days old and which I think are being kept alive by my web server) are still up & running; at least some of them are serving data to web pages. To wit: [rhodes] data/$ /usr/ucb/ps axuw | grep post postgres 9238 0.2 1.4 8664 5104 ? S Jun 13 3:13 /its/software/bin/postmaster postgres 9268 0.1 1.4 8672 5144 ? S Jun 13 3:26 /its/software/bin/postmaster postgres 8920 0.1 0.6 2480 2024 pts/0 R 11:08:26 0:00 bash postgres 9237 0.1 1.4 8664 5104 ? S Jun 13 3:01 /its/software/bin/postmaster root 5411 0.0 0.4 1904 1448 ? S Jun 09 0:00 /software/stow/postfix-2.0.10/libexec/postfix/master postfix 5413 0.0 0.4 1992 1528 ? S Jun 09 0:00 qmgr -l -t fifo -u postfix 8857 0.0 0.4 1960 1552 ? S 11:03:14 0:00 pickup -l -t fifo -u postgres 9236 0.0 1.4 8664 5120 ? S Jun 13 3:12 /its/software/bin/postmaster postgres 9243 0.0 1.5 8720 5584 ? S Jun 13 3:06 /its/software/bin/postmaster postgres 9254 0.0 1.4 8656 5128 ? S Jun 13 3:22 /its/software/bin/postmaster postgres 9278 0.0 1.4 8664 5192 ? S Jun 13 3:08 /its/software/bin/postmaster postgres 9333 0.0 1.5 8672 5312 ? S Jun 13 3:33 /its/software/bin/postmaster postgres 9379 0.0 1.4 8720 5176 ? S Jun 13 3:08 /its/software/bin/postmaster postgres 9431 0.0 1.4 8672 5112 ? S Jun 13 3:18 /its/software/bin/postmaster postgres 9877 0.0 0.0 2480 ? pts/0 R 11:47:15 0:00 bash The file /var/pgsql/data/postmaster.pid claims that the postmaster's PID is 27215; there's no process with that PID running on my system. Whenever I try to create a new connection, it fail: [rhodes] data/$ psql template1 psql: could not connect to server: No such file or directory Is the server running locally and accepting connections on Unix domain socket "/tmp/.s.PGSQL.5432"? [rhodes] data/$ psql -h localhost template1 psql: could not connect to server: Connection refused Is the server running on host localhost and accepting TCP/IP connections on port 5432? Any ideas on what I should do now? I'm running 7.3.2 on Solaris 7. -Charlie -- Charles Hornberger Caltech Division of the Humanities and Social Sciences M/C 228-77 Tel (626) 395-3474
Other things I perhaps ought to mention: Trying to stop the postmaster using pg_ctl fails (unsurprisingly, since pg_ctl relies on /var/pgsql/data/postmaster.pid, which contains a nonexistent PID); I haven't tried to start a new postmaster yet, because the old backends are hanging around. Nor have I attempted to restart the web server, which might allow the hanging-round backends to die by closing the old connections it's holding to them. I'm tempted to go ahead and do this, though I'm not sure whether I ought to until I've diagnosed what's going on right now. In case it's relevant, I've gone back through the logs and discovered that for the past week or so I've been occasionally running out of connections (I was running w/ the default of 16) and getting 'FATAL: Non-superuser connection limit exceeded errors' (about a dozen a day), but I can't find any other related messages in the logs. If anyone has any suggestions, I'd really appreciate your input; I'm getting a bit antsy since my production database server is basically halfway down and users are wondering why their web pages don't work ... -Charlie Charles Hornberger wrote: > I've got what looks like a really strange situation on my hands (or else > I've got a normal situation that I'm looking at strangely): It appears > that the main postmaster process is dead & gone, but I have a bunch of > backends still running. > > I can't connect to the database server any more, but a bunch of old > persistent connections (which are about four days old and which I think > are being kept alive by my web server) are still up & running; at least > some of them are serving data to web pages. > > To wit: > > [rhodes] data/$ /usr/ucb/ps axuw | grep post > postgres 9238 0.2 1.4 8664 5104 ? S Jun 13 3:13 > /its/software/bin/postmaster > postgres 9268 0.1 1.4 8672 5144 ? S Jun 13 3:26 > /its/software/bin/postmaster > postgres 8920 0.1 0.6 2480 2024 pts/0 R 11:08:26 0:00 bash > postgres 9237 0.1 1.4 8664 5104 ? S Jun 13 3:01 > /its/software/bin/postmaster > root 5411 0.0 0.4 1904 1448 ? S Jun 09 0:00 > /software/stow/postfix-2.0.10/libexec/postfix/master > postfix 5413 0.0 0.4 1992 1528 ? S Jun 09 0:00 qmgr -l -t > fifo -u > postfix 8857 0.0 0.4 1960 1552 ? S 11:03:14 0:00 pickup -l > -t fifo -u > postgres 9236 0.0 1.4 8664 5120 ? S Jun 13 3:12 > /its/software/bin/postmaster > postgres 9243 0.0 1.5 8720 5584 ? S Jun 13 3:06 > /its/software/bin/postmaster > postgres 9254 0.0 1.4 8656 5128 ? S Jun 13 3:22 > /its/software/bin/postmaster > postgres 9278 0.0 1.4 8664 5192 ? S Jun 13 3:08 > /its/software/bin/postmaster > postgres 9333 0.0 1.5 8672 5312 ? S Jun 13 3:33 > /its/software/bin/postmaster > postgres 9379 0.0 1.4 8720 5176 ? S Jun 13 3:08 > /its/software/bin/postmaster > postgres 9431 0.0 1.4 8672 5112 ? S Jun 13 3:18 > /its/software/bin/postmaster > postgres 9877 0.0 0.0 2480 ? pts/0 R 11:47:15 0:00 bash > > The file /var/pgsql/data/postmaster.pid claims that the postmaster's PID > is 27215; there's no process with that PID running on my system. > > Whenever I try to create a new connection, it fail: > > [rhodes] data/$ psql template1 > psql: could not connect to server: No such file or directory > Is the server running locally and accepting > connections on Unix domain socket "/tmp/.s.PGSQL.5432"? > [rhodes] data/$ psql -h localhost template1 > psql: could not connect to server: Connection refused > Is the server running on host localhost and accepting > TCP/IP connections on port 5432? > > Any ideas on what I should do now? I'm running 7.3.2 on Solaris 7. > > -Charlie > -- Charles Hornberger Caltech Division of the Humanities and Social Sciences M/C 228-77 Tel (626) 395-3474
Charles Hornberger <charlie@hss.caltech.edu> writes: > Other things I perhaps ought to mention: Trying to stop the postmaster > using pg_ctl fails (unsurprisingly, since pg_ctl relies on > /var/pgsql/data/postmaster.pid, which contains a nonexistent PID); I > haven't tried to start a new postmaster yet, because the old backends > are hanging around. In theory a new postmaster would detect the old backends and refuse to start anyway. I don't trust that interlock unreservedly though. (But please test it while you have the opportunity...) > Nor have I attempted to restart the web server, which might allow the > hanging-round backends to die by closing the old connections it's > holding to them. I'm tempted to go ahead and do this, though I'm not > sure whether I ought to until I've diagnosed what's going on right now. You will need to close all the existing connections before the new postmaster can be started. I'd recommend doing so sooner instead of later, because with no postmaster you aren't getting any checkpoints done, and your WAL space is going to start ballooning. As far as diagnosing the problem goes: if you have a postmaster log file, look to see if the postmaster wrote an ERROR or FATAL message before it exited. (Finding it among all the backend-level messages might be painful though.) Also look in the directory the postmaster was started in to see if there's a core file. Save away any evidence you can find before trying to start a new postmaster. Because the postmaster doesn't actually do much, crashes are pretty unusual. I'm interested in whatever you can find. regards, tom lane
On Tue, 17 Jun 2003, Tom Lane wrote: > Charles Hornberger <charlie@hss.caltech.edu> writes: > > Other things I perhaps ought to mention: Trying to stop the postmaster > > using pg_ctl fails (unsurprisingly, since pg_ctl relies on > > /var/pgsql/data/postmaster.pid, which contains a nonexistent PID); I > > haven't tried to start a new postmaster yet, because the old backends > > are hanging around. > > In theory a new postmaster would detect the old backends and refuse to > start anyway. I don't trust that interlock unreservedly though. (But > please test it while you have the opportunity...) Unfortunately, our system administrator solved this before I got a chance to test more. I don't know how he went about restarting the server, although whatever he did doesn't appear to have hurt anything; would it be interesting to know exactly what steps he took? > > Nor have I attempted to restart the web server, which might allow the > > hanging-round backends to die by closing the old connections it's > > holding to them. I'm tempted to go ahead and do this, though I'm not > > sure whether I ought to until I've diagnosed what's going on right now. > > You will need to close all the existing connections before the new > postmaster can be started. I'd recommend doing so sooner instead of > later, because with no postmaster you aren't getting any checkpoints > done, and your WAL space is going to start ballooning. > > As far as diagnosing the problem goes: if you have a postmaster log > file, look to see if the postmaster wrote an ERROR or FATAL message > before it exited. (Finding it among all the backend-level messages > might be painful though.) Also look in the directory the postmaster > was started in to see if there's a core file. Save away any evidence > you can find before trying to start a new postmaster. Interestingly, there are no messages in the log file, and I can't find a core file -- in short, there's no evidence whatsoever, at least not that I can find. (Though I am probably a pretty rotten detective.) However, I think I know the cause (though I haven't tested to see if this indeed causes the postmaster to die): A few hours before I noticed that the postmaster was dead, one of the sysadmins made a typo that caused an NFS mount to become unavailable -- the very NFS mount that held the postgres executable (all our Solaris boxes share the same executables). So the theory is that the postmaster tried to fork() a process using a non-existent executable, and died as a result. Does this make any sense? -Charlie > Because the postmaster doesn't actually do much, crashes are pretty > unusual. I'm interested in whatever you can find. > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly >
Charles Hornberger <charlie@hss.caltech.edu> writes: > However, I think I know the cause (though I haven't tested to see if this > indeed causes the postmaster to die): A few hours before I noticed that > the postmaster was dead, one of the sysadmins made a typo that caused an > NFS mount to become unavailable -- the very NFS mount that held the > postgres executable (all our Solaris boxes share the same executables). So > the theory is that the postmaster tried to fork() a process using a > non-existent executable, and died as a result. Does this make any sense? A fork() failure would not cause the postmaster to die (it's not uncommon to see fork() failures due to resource limits, so this path is really pretty well tested). I'm not familiar enough with Solaris to know whether other fatal error conditions might arise in this scenario. (I know HPUX gets rather unhappy if you try to delete an executable file or shared library that's in use by live processes...) But the trouble with that line of thought is that the postmaster and the backends are all the same executable; if the postmaster went south because of loss of the executable file, I'd expect the backends not to survive it either. Unless maybe the backends weren't actually doing anything --- is it possible that the connected clients had issued no commands in the whole episode? regards, tom lane
On Thu, 19 Jun 2003, Tom Lane wrote: > Charles Hornberger <charlie@hss.caltech.edu> writes: > > However, I think I know the cause (though I haven't tested to see if this > > indeed causes the postmaster to die): A few hours before I noticed that > > the postmaster was dead, one of the sysadmins made a typo that caused an > > NFS mount to become unavailable -- the very NFS mount that held the > > postgres executable (all our Solaris boxes share the same executables). So > > the theory is that the postmaster tried to fork() a process using a > > non-existent executable, and died as a result. Does this make any sense? > > A fork() failure would not cause the postmaster to die (it's not > uncommon to see fork() failures due to resource limits, so this path is > really pretty well tested). I'm not familiar enough with Solaris to know > whether other fatal error conditions might arise in this scenario. > (I know HPUX gets rather unhappy if you try to delete an executable file > or shared library that's in use by live processes...) But the trouble > with that line of thought is that the postmaster and the backends are > all the same executable; if the postmaster went south because of loss of > the executable file, I'd expect the backends not to survive it either. > Unless maybe the backends weren't actually doing anything --- is it > possible that the connected clients had issued no commands in the whole > episode? > No, that's not possible; the backends that were up & running were definitely in active use (serving a web site). Sorry I can't be of more help -- I'm not familiar w/ Solaris, either. But if there are other places I should look for error messages, I'm happy to go poking around ... -Charlie
I have four tables in a database which reference all the large objects in the database. I need to backup and restore these 4 tables and the large objects independently of the other tables in the database. pg_dump only allows me to dump the whole database, if I want to dump the large objects. I know that on restore I can create a contents file from the pg_dump file and selectively restore only the tables I need, but this is not suitable because: 1) Restore data is streamed from a remote ftp server. 2) Backup file is large. 3) There is not enough physical storage to copy the restore file to disk prior to restoring the database, so I can create the contents list without reading file twice over ftp link. Does anybody have any suggestions apart from writing my own custom backup/restore? Chris White