Thread: postmaster blues after system restart
I'm experiencing an unusual situation on a recently reprovisioned server. It's a Debian server (testing/unstable) running a 2.6.13.3 kernel. postgres 8.0.4 was built from source. I'm using a slightly modified version of the tarball sample init script that includes a function to start pg_autovacuum automatically. When I restart, everything seems to come up fine with the exception that postmaster starts in a state such that it doesn't seem to be accepting connections (either over UNIX or TCP/IP). As best I can tell, it is using the init script to start postgres because pg_autovacuum tries to start, too, and dies shortly after the box comes up because it, too, cannot connect to postgres. The very odd thing is that if I use this same init script manually once the box is up to stop/start postgres, everything comes up roses. So: is there anything unusual that could happen in terms of order of operations or pathing during the boot process that I am overlooking? Also, is there any way to get more status out of a postmaster if one cannot connect to it? I am able to run pg_ctl status and pg_controldata, both of which return normally. -- Thomas F. O'Connell Co-Founder, Information Architect Sitening, LLC Open Source Solutions. Optimized Web Development. http://www.sitening.com/ 110 30th Avenue North, Suite 6 Nashville, TN 37203-6320 615-469-5150 615-469-5151 (fax)
On Thu, 13 Oct 2005, Thomas F. O'Connell wrote: > The very odd thing is that if I use this same init script manually once the > box is up to stop/start postgres, everything comes up roses. You could post your init script for us to look at, but my first question is this: Is postgresql actually running after the box gets into multi user mode? If not, I'd look at paths and permissions. Does anything show up in the logs to indicate why it might have failed? Is the script atttempting to start it as the incorrect user perhaps? -- Jeff Frost, Owner <jeff@frostconsultingllc.com> Frost Consulting, LLC http://www.frostconsultingllc.com/ Phone: 650-780-7908 FAX: 650-649-1954
"Thomas F. O'Connell" <tfo@sitening.com> writes: > When I restart, everything seems to come up fine with the exception > that postmaster starts in a state such that it doesn't seem to be > accepting connections (either over UNIX or TCP/IP). As best I can > tell, it is using the init script to start postgres because > pg_autovacuum tries to start, too, and dies shortly after the box > comes up because it, too, cannot connect to postgres. hmmm ... maybe you need to start your DNS server first? > Also, is there any way to get more status out of a postmaster if one > cannot connect to it? One thing I'd look into is exactly what ports it's listening to --- try lsof and/or netstat for this. Also, have you looked at the postmaster log? regards, tom lane
On Oct 13, 2005, at 9:35 PM, Tom Lane wrote: > "Thomas F. O'Connell" <tfo@sitening.com> writes: > >> When I restart, everything seems to come up fine with the exception >> that postmaster starts in a state such that it doesn't seem to be >> accepting connections (either over UNIX or TCP/IP). As best I can >> tell, it is using the init script to start postgres because >> pg_autovacuum tries to start, too, and dies shortly after the box >> comes up because it, too, cannot connect to postgres. > > hmmm ... maybe you need to start your DNS server first? I'll check the order in which services are started. But would the DNS server prevent UNIX socket connections? >> Also, is there any way to get more status out of a postmaster if one >> cannot connect to it? > > One thing I'd look into is exactly what ports it's listening to --- > try lsof and/or netstat for this. Also, have you looked at the > postmaster log? I'll take a look at the ports. I was wondering about the best way to do that. The postmaster log gives no evidence of anything out of the ordinary. -- Thomas F. O'Connell Co-Founder, Information Architect Sitening, LLC Open Source Solutions. Optimized Web Development. http://www.sitening.com/ 110 30th Avenue North, Suite 6 Nashville, TN 37203-6320 615-469-5150 615-469-5151 (fax)
The culprit that ended up leading to my original post was an NFS script that cleans out /tmp. It was running as the last thing in a given boot level, so it blew away the socket file in /tmp. Restarting postgres after boot recreated the file, so that explained the behavior discrepancy I was seeing. -- Thomas F. O'Connell Co-Founder, Information Architect Sitening, LLC Open Source Solutions. Optimized Web Development. http://www.sitening.com/ 110 30th Avenue North, Suite 6 Nashville, TN 37203-6320 615-469-5150 615-469-5151 (fax) On Oct 13, 2005, at 9:40 PM, Thomas F. O'Connell wrote: > On Oct 13, 2005, at 9:35 PM, Tom Lane wrote: > >> "Thomas F. O'Connell" <tfo@sitening.com> writes: >> >>> When I restart, everything seems to come up fine with the exception >>> that postmaster starts in a state such that it doesn't seem to be >>> accepting connections (either over UNIX or TCP/IP). As best I can >>> tell, it is using the init script to start postgres because >>> pg_autovacuum tries to start, too, and dies shortly after the box >>> comes up because it, too, cannot connect to postgres. >> >> hmmm ... maybe you need to start your DNS server first? > > I'll check the order in which services are started. But would the > DNS server prevent UNIX socket connections? > >>> Also, is there any way to get more status out of a postmaster if one >>> cannot connect to it? >> >> One thing I'd look into is exactly what ports it's listening to --- >> try lsof and/or netstat for this. Also, have you looked at the >> postmaster log? > > I'll take a look at the ports. I was wondering about the best way > to do that. > > The postmaster log gives no evidence of anything out of the ordinary.
On Fri, 2005-10-14 at 15:49, Thomas F. O'Connell wrote: > The culprit that ended up leading to my original post was an NFS > script that cleans out /tmp. It was running as the last thing in a > given boot level, so it blew away the socket file in /tmp. I'm sure you already know this, but wildly / randomly deleting things in /tmp is a bad idea...
On Fri, 2005-10-14 at 15:49, Thomas F. O'Connell wrote: >> The culprit that ended up leading to my original post was an NFS >> script that cleans out /tmp. It was running as the last thing in a >> given boot level, so it blew away the socket file in /tmp. > > I'm sure you already know this, but wildly / randomly deleting > things in > /tmp is a bad idea... Well, here's the story. It was a Debian box, and somehow, mountnfs.sh wound up at S99 in /etc/rc2.d. I'm not sure how that happened, but it raises some potential questions: Converting this to a PostgreSQL on Debian question: is it a good admin practice to go ahead and set TMPTIME=-1 in /etc/default/rcS on Debian servers running postgres? If the answer is "yes", then it becomes a more purely Debian question: what are the ramifications of not cleaning out /tmp at boot using initscripts? More widely: are there other non-Debian-based distributions that have a similar facility for wiping /tmp at boot? Is changing the timing easy? Is the disabling mechanism the same? Maybe not having seen too many prior instances of this particular issue is a good sign that no one is obliterating the contents of /tmp after postgres has started, but the fact that Debian has tools in initscripts that seek to clean out /tmp this makes me think a mention of this in the PostgreSQL documentation might be a good idea (perhaps in 14.6: Post-Installation Setup?). But I'd be curious to know the perspective of other Debian/PostgreSQL admins on where they think the issue really lies or even whether there is a perceived issue. -- Thomas F. O'Connell Co-Founder, Information Architect Sitening, LLC Open Source Solutions. Optimized Web Development. http://www.sitening.com/ 110 30th Avenue North, Suite 6 Nashville, TN 37203-6320 615-469-5150 615-469-5151 (fax)
Well, after digging a bit deeper, I guess the conclusion is that the best practice is to trust package maintainers and follow their lead when installing from source. I notice, for instance, that contrib/start-scripts/linux recommends S98 for /etc/rc. That wouldn't've prevented the S99 mountnfs.sh foul- up, but it would've prevented it had mountnfs.sh been at its (apparent) default of S45. And the problem that bit me looks only to have been an issue when used in conjunction with postgres when built from source on Debian, as Debian seems to prefer /var/run/postgresql for its socket directory, thereby avoiding the issue with /tmp altogether. So either by following the guidance of contrib/start-scripts/linux or the postgresql package for Debian, the problem is alleviated. But cleaning out /tmp seems to be a part of institutional practice on Linux, so it still seems like something about that deserves mention somewhere in the postgres documentation since the default for unix_socket_directory is /tmp. -- Thomas F. O'Connell Co-Founder, Information Architect Sitening, LLC Open Source Solutions. Optimized Web Development. http://www.sitening.com/ 110 30th Avenue North, Suite 6 Nashville, TN 37203-6320 615-469-5150 615-469-5151 (fax) On Oct 17, 2005, at 6:36 PM, Thomas F. O'Connell wrote: > On Fri, 2005-10-14 at 15:49, Thomas F. O'Connell wrote: > >>> The culprit that ended up leading to my original post was an NFS >>> script that cleans out /tmp. It was running as the last thing in a >>> given boot level, so it blew away the socket file in /tmp. >> >> I'm sure you already know this, but wildly / randomly deleting >> things in >> /tmp is a bad idea... > > Well, here's the story. It was a Debian box, and somehow, > mountnfs.sh wound up at S99 in /etc/rc2.d. I'm not sure how that > happened, but it raises some potential questions: > > Converting this to a PostgreSQL on Debian question: is it a good > admin practice to go ahead and set TMPTIME=-1 in /etc/default/rcS > on Debian servers running postgres? > > If the answer is "yes", then it becomes a more purely Debian > question: what are the ramifications of not cleaning out /tmp at > boot using initscripts? > > More widely: are there other non-Debian-based distributions that > have a similar facility for wiping /tmp at boot? Is changing the > timing easy? Is the disabling mechanism the same? > > Maybe not having seen too many prior instances of this particular > issue is a good sign that no one is obliterating the contents of / > tmp after postgres has started, but the fact that Debian has tools > in initscripts that seek to clean out /tmp this makes me think a > mention of this in the PostgreSQL documentation might be a good > idea (perhaps in 14.6: Post-Installation Setup?). > > But I'd be curious to know the perspective of other Debian/ > PostgreSQL admins on where they think the issue really lies or even > whether there is a perceived issue.
"Thomas F. O'Connell" <tfo@sitening.com> writes: > But cleaning out /tmp seems to be a part of institutional practice on > Linux, I'm unconvinced of that. A quick test on Fedora Core 4 shows that random files in /tmp survive reboot, and any moment of thought would show why users would object to a blanket cleanout policy. I do see this in a quick grep of Fedora RC files: /etc/rc.sysinit:rm -f /tmp/.X*-lock /tmp/.lock.* /tmp/.gdm_socket /tmp/.s.PGSQL.* but this happens well before any of the /etc/rc.d files get to run. I think what you've got is a rogue, broken mountnfs.sh script. I don't even see any such script in my installation ... what is its provenance? regards, tom lane
On Oct 18, 2005, at 12:29 AM, Tom Lane wrote: > "Thomas F. O'Connell" <tfo@sitening.com> writes: > >> But cleaning out /tmp seems to be a part of institutional practice on >> Linux, > > I'm unconvinced of that. A quick test on Fedora Core 4 shows that > random files in /tmp survive reboot, and any moment of thought would > show why users would object to a blanket cleanout policy. > > I do see this in a quick grep of Fedora RC files: > > /etc/rc.sysinit:rm -f /tmp/.X*-lock /tmp/.lock.* /tmp/.gdm_socket / > tmp/.s.PGSQL.* > > but this happens well before any of the /etc/rc.d files get to run. > > I think what you've got is a rogue, broken mountnfs.sh script. I > don't > even see any such script in my installation ... what is its > provenance? > > regards, tom lane Apparently, the rogue mountnfs.sh rcS setting came from here: http://www.ida.liu.se/~TDDI05/labs/NFS%20-%20Network%20File% 20Systems.pdf “There is a bug in the version of UML that we use, that is triggered by mounting NFS volumes too early in the boot process. In Debian, the /etc/init.d/mountnfs.sh script is responsible for mounting NFS directories. You must reconfigure your system to mount NFS volumes at the latest possible moment. The following commands will do the job: update-rc.d –f mountnfs.sh remove update-rc.d mountnfs.sh mountnfs.sh start 99 2 .” But in looking into expectations for /tmp, I'm also interested in the interpretation here: http://www.pathname.com/fhs/pub/fhs-2.3.html#TMPTEMPORARYFILES Does postgres just use /tmp because it will generally be known to exist and be writable? Is it generally expected that one should not actually use the default setting for unix_socket_directory, or is it more generally expected that /tmp will be a reliable repository for the socket file? We've long since worked around this issue now; I'm just wondering whether anything would better help prevent the situation if approached from scratch again, whether for us or for other users. Looking for a missing socket file as a source of being unable to connect was certainly an interesting takeaway. In the long run, maybe the user space requiring builds of postgres from source on Debian boxes requiring NFS and prioritizing Google hits over package and contrib defaults is sufficiently small that this becomes a non-issue... :P -- Thomas F. O'Connell Co-Founder, Information Architect Sitening, LLC Open Source Solutions. Optimized Web Development. http://www.sitening.com/ 110 30th Avenue North, Suite 6 Nashville, TN 37203-6320 615-469-5150 615-469-5151 (fax)
On Tue, 2005-10-18 at 01:29 -0400, Tom Lane wrote: > "Thomas F. O'Connell" <tfo@sitening.com> writes: > > But cleaning out /tmp seems to be a part of institutional practice on > > Linux, > > I'm unconvinced of that. A quick test on Fedora Core 4 shows that > random files in /tmp survive reboot, and any moment of thought would > show why users would object to a blanket cleanout policy. ... > I think what you've got is a rogue, broken mountnfs.sh script. I don't > even see any such script in my installation ... what is its provenance? It's a standard part of Debian, from the package initscripts. The default policy is to clean out /tmp at boot. This can be varied by changing TMPTIME in /etc/default/rcS. -- Oliver Elphick olly@lfix.co.uk Isle of Wight http://www.lfix.co.uk/oliver GPG: 1024D/A54310EA 92C8 39E7 280E 3631 3F0E 1EC0 5664 7A2F A543 10EA ======================================== Do you want to know God? http://www.lfix.co.uk/knowing_god.html
"Thomas F. O'Connell" <tfo@sitening.com> writes: > Does postgres just use /tmp because it will generally be known to > exist and be writable? I suppose that was the original motivation. We've had repeated troubles over the years with using /tmp --- for example, the code now goes to considerable lengths to update the socket's timestamp periodically so that it won't be seen as a target by scripts that clean out /tmp entries more than X minutes old. It's really not very feasible to change the standard default, though, because that would break too many clients (it's not very different from changing the default port number). There's also the small problem that there is no good alternative choice; there is no other fixed directory path that can be assumed writable by a non-privileged postmaster on every Unix. So I'm afraid we're stuck with /tmp as the default socket location. In any case, I don't think there's much doubt that unconditionally cleaning out /tmp *after* beginning to start daemons is simply broken. That script has no excuse whatever for thinking that there can't be some other process actually using /tmp at the time it's running. If you're going to have a forcible cleanout of /tmp during reboot, it has to happen before the /etc/rc.d scripts begin to run. regards, tom lane