Thread: postmaster blues after system restart

postmaster blues after system restart

From
"Thomas F. O'Connell"
Date:
I'm experiencing an unusual situation on a recently reprovisioned
server.

It's a Debian server (testing/unstable) running a 2.6.13.3 kernel.
postgres 8.0.4 was built from source. I'm using a slightly modified
version of the tarball sample init script that includes a function to
start pg_autovacuum automatically.

When I restart, everything seems to come up fine with the exception
that postmaster starts in a state such that it doesn't seem to be
accepting connections (either over UNIX or TCP/IP). As best I can
tell, it is using the init script to start postgres because
pg_autovacuum tries to start, too, and dies shortly after the box
comes up because it, too, cannot connect to postgres.

The very odd thing is that if I use this same init script manually
once the box is up to stop/start postgres, everything comes up roses.

So: is there anything unusual that could happen in terms of order of
operations or pathing during the boot process that I am overlooking?

Also, is there any way to get more status out of a postmaster if one
cannot connect to it? I am able to run pg_ctl status and
pg_controldata, both of which return normally.

--
Thomas F. O'Connell
Co-Founder, Information Architect
Sitening, LLC

Open Source Solutions. Optimized Web Development.

http://www.sitening.com/
110 30th Avenue North, Suite 6
Nashville, TN 37203-6320
615-469-5150
615-469-5151 (fax)


Re: postmaster blues after system restart

From
Jeff Frost
Date:
On Thu, 13 Oct 2005, Thomas F. O'Connell wrote:

> The very odd thing is that if I use this same init script manually once the
> box is up to stop/start postgres, everything comes up roses.

You could post your init script for us to look at, but my first question is
this: Is postgresql actually running after the box gets into multi user mode?
If not, I'd look at paths and permissions.  Does anything show up in the logs
to indicate why it might have failed?  Is the script atttempting to start it
as the incorrect user perhaps?

--
Jeff Frost, Owner     <jeff@frostconsultingllc.com>
Frost Consulting, LLC     http://www.frostconsultingllc.com/
Phone: 650-780-7908    FAX: 650-649-1954

Re: postmaster blues after system restart

From
Tom Lane
Date:
"Thomas F. O'Connell" <tfo@sitening.com> writes:
> When I restart, everything seems to come up fine with the exception
> that postmaster starts in a state such that it doesn't seem to be
> accepting connections (either over UNIX or TCP/IP). As best I can
> tell, it is using the init script to start postgres because
> pg_autovacuum tries to start, too, and dies shortly after the box
> comes up because it, too, cannot connect to postgres.

hmmm ... maybe you need to start your DNS server first?

> Also, is there any way to get more status out of a postmaster if one
> cannot connect to it?

One thing I'd look into is exactly what ports it's listening to ---
try lsof and/or netstat for this.  Also, have you looked at the
postmaster log?

            regards, tom lane

Re: postmaster blues after system restart

From
"Thomas F. O'Connell"
Date:
On Oct 13, 2005, at 9:35 PM, Tom Lane wrote:

> "Thomas F. O'Connell" <tfo@sitening.com> writes:
>
>> When I restart, everything seems to come up fine with the exception
>> that postmaster starts in a state such that it doesn't seem to be
>> accepting connections (either over UNIX or TCP/IP). As best I can
>> tell, it is using the init script to start postgres because
>> pg_autovacuum tries to start, too, and dies shortly after the box
>> comes up because it, too, cannot connect to postgres.
>
> hmmm ... maybe you need to start your DNS server first?

I'll check the order in which services are started. But would the DNS
server prevent UNIX socket connections?

>> Also, is there any way to get more status out of a postmaster if one
>> cannot connect to it?
>
> One thing I'd look into is exactly what ports it's listening to ---
> try lsof and/or netstat for this.  Also, have you looked at the
> postmaster log?

I'll take a look at the ports. I was wondering about the best way to
do that.

The postmaster log gives no evidence of anything out of the ordinary.

--
Thomas F. O'Connell
Co-Founder, Information Architect
Sitening, LLC

Open Source Solutions. Optimized Web Development.

http://www.sitening.com/
110 30th Avenue North, Suite 6
Nashville, TN 37203-6320
615-469-5150
615-469-5151 (fax)

Re: postmaster blues after system restart

From
"Thomas F. O'Connell"
Date:
The culprit that ended up leading to my original post was an NFS
script that cleans out /tmp. It was running as the last thing in a
given boot level, so it blew away the socket file in /tmp.

Restarting postgres after boot recreated the file, so that explained
the behavior discrepancy I was seeing.

--
Thomas F. O'Connell
Co-Founder, Information Architect
Sitening, LLC

Open Source Solutions. Optimized Web Development.

http://www.sitening.com/
110 30th Avenue North, Suite 6
Nashville, TN 37203-6320
615-469-5150
615-469-5151 (fax)

On Oct 13, 2005, at 9:40 PM, Thomas F. O'Connell wrote:

> On Oct 13, 2005, at 9:35 PM, Tom Lane wrote:
>
>> "Thomas F. O'Connell" <tfo@sitening.com> writes:
>>
>>> When I restart, everything seems to come up fine with the exception
>>> that postmaster starts in a state such that it doesn't seem to be
>>> accepting connections (either over UNIX or TCP/IP). As best I can
>>> tell, it is using the init script to start postgres because
>>> pg_autovacuum tries to start, too, and dies shortly after the box
>>> comes up because it, too, cannot connect to postgres.
>>
>> hmmm ... maybe you need to start your DNS server first?
>
> I'll check the order in which services are started. But would the
> DNS server prevent UNIX socket connections?
>
>>> Also, is there any way to get more status out of a postmaster if one
>>> cannot connect to it?
>>
>> One thing I'd look into is exactly what ports it's listening to ---
>> try lsof and/or netstat for this.  Also, have you looked at the
>> postmaster log?
>
> I'll take a look at the ports. I was wondering about the best way
> to do that.
>
> The postmaster log gives no evidence of anything out of the ordinary.

Re: postmaster blues after system restart

From
Scott Marlowe
Date:
On Fri, 2005-10-14 at 15:49, Thomas F. O'Connell wrote:
> The culprit that ended up leading to my original post was an NFS
> script that cleans out /tmp. It was running as the last thing in a
> given boot level, so it blew away the socket file in /tmp.

I'm sure you already know this, but wildly / randomly deleting things in
/tmp is a bad idea...

Re: postmaster blues after system restart

From
"Thomas F. O'Connell"
Date:
On Fri, 2005-10-14 at 15:49, Thomas F. O'Connell wrote:

>> The culprit that ended up leading to my original post was an NFS
>> script that cleans out /tmp. It was running as the last thing in a
>> given boot level, so it blew away the socket file in /tmp.
>
> I'm sure you already know this, but wildly / randomly deleting
> things in
> /tmp is a bad idea...

Well, here's the story. It was a Debian box, and somehow, mountnfs.sh
wound up at S99 in /etc/rc2.d. I'm not sure how that happened, but it
raises some potential questions:

Converting this to a PostgreSQL on Debian question: is it a good
admin practice to go ahead and set TMPTIME=-1 in /etc/default/rcS on
Debian servers running postgres?

If the answer is "yes", then it becomes a more purely Debian
question: what are the ramifications of not cleaning out /tmp at boot
using initscripts?

More widely: are there other non-Debian-based distributions that have
a similar facility for wiping /tmp at boot? Is changing the timing
easy? Is the disabling mechanism the same?

Maybe not having seen too many prior instances of this particular
issue is a good sign that no one is obliterating the contents of /tmp
after postgres has started, but the fact that Debian has tools in
initscripts that seek to clean out /tmp this makes me think a mention
of this in the PostgreSQL documentation might be a good idea (perhaps
in 14.6: Post-Installation Setup?).

But I'd be curious to know the perspective of other Debian/PostgreSQL
admins on where they think the issue really lies or even whether
there is a perceived issue.

--
Thomas F. O'Connell
Co-Founder, Information Architect
Sitening, LLC

Open Source Solutions. Optimized Web Development.

http://www.sitening.com/
110 30th Avenue North, Suite 6
Nashville, TN 37203-6320
615-469-5150
615-469-5151 (fax)


Re: postmaster blues after system restart

From
"Thomas F. O'Connell"
Date:
Well, after digging a bit deeper, I guess the conclusion is that the
best practice is to trust package maintainers and follow their lead
when installing from source.

I notice, for instance, that contrib/start-scripts/linux recommends
S98 for /etc/rc. That wouldn't've prevented the S99 mountnfs.sh foul-
up, but it would've prevented it had mountnfs.sh been at its
(apparent) default of S45.

And the problem that bit me looks only to have been an issue when
used in conjunction with postgres when built from source on Debian,
as Debian seems to prefer /var/run/postgresql for its socket
directory, thereby avoiding the issue with /tmp altogether.

So either by following the guidance of contrib/start-scripts/linux or
the postgresql package for Debian, the problem is alleviated.

But cleaning out /tmp seems to be a part of institutional practice on
Linux, so it still seems like something about that deserves mention
somewhere in the postgres documentation since the default for
unix_socket_directory is /tmp.

--
Thomas F. O'Connell
Co-Founder, Information Architect
Sitening, LLC

Open Source Solutions. Optimized Web Development.

http://www.sitening.com/
110 30th Avenue North, Suite 6
Nashville, TN 37203-6320
615-469-5150
615-469-5151 (fax)

On Oct 17, 2005, at 6:36 PM, Thomas F. O'Connell wrote:

> On Fri, 2005-10-14 at 15:49, Thomas F. O'Connell wrote:
>
>>> The culprit that ended up leading to my original post was an NFS
>>> script that cleans out /tmp. It was running as the last thing in a
>>> given boot level, so it blew away the socket file in /tmp.
>>
>> I'm sure you already know this, but wildly / randomly deleting
>> things in
>> /tmp is a bad idea...
>
> Well, here's the story. It was a Debian box, and somehow,
> mountnfs.sh wound up at S99 in /etc/rc2.d. I'm not sure how that
> happened, but it raises some potential questions:
>
> Converting this to a PostgreSQL on Debian question: is it a good
> admin practice to go ahead and set TMPTIME=-1 in /etc/default/rcS
> on Debian servers running postgres?
>
> If the answer is "yes", then it becomes a more purely Debian
> question: what are the ramifications of not cleaning out /tmp at
> boot using initscripts?
>
> More widely: are there other non-Debian-based distributions that
> have a similar facility for wiping /tmp at boot? Is changing the
> timing easy? Is the disabling mechanism the same?
>
> Maybe not having seen too many prior instances of this particular
> issue is a good sign that no one is obliterating the contents of /
> tmp after postgres has started, but the fact that Debian has tools
> in initscripts that seek to clean out /tmp this makes me think a
> mention of this in the PostgreSQL documentation might be a good
> idea (perhaps in 14.6: Post-Installation Setup?).
>
> But I'd be curious to know the perspective of other Debian/
> PostgreSQL admins on where they think the issue really lies or even
> whether there is a perceived issue.

Re: postmaster blues after system restart

From
Tom Lane
Date:
"Thomas F. O'Connell" <tfo@sitening.com> writes:
> But cleaning out /tmp seems to be a part of institutional practice on
> Linux,

I'm unconvinced of that.  A quick test on Fedora Core 4 shows that
random files in /tmp survive reboot, and any moment of thought would
show why users would object to a blanket cleanout policy.

I do see this in a quick grep of Fedora RC files:

/etc/rc.sysinit:rm -f /tmp/.X*-lock /tmp/.lock.* /tmp/.gdm_socket /tmp/.s.PGSQL.*

but this happens well before any of the /etc/rc.d files get to run.

I think what you've got is a rogue, broken mountnfs.sh script.  I don't
even see any such script in my installation ... what is its provenance?

            regards, tom lane

Re: postmaster blues after system restart

From
"Thomas F. O'Connell"
Date:
On Oct 18, 2005, at 12:29 AM, Tom Lane wrote:

> "Thomas F. O'Connell" <tfo@sitening.com> writes:
>
>> But cleaning out /tmp seems to be a part of institutional practice on
>> Linux,
>
> I'm unconvinced of that.  A quick test on Fedora Core 4 shows that
> random files in /tmp survive reboot, and any moment of thought would
> show why users would object to a blanket cleanout policy.
>
> I do see this in a quick grep of Fedora RC files:
>
> /etc/rc.sysinit:rm -f /tmp/.X*-lock /tmp/.lock.* /tmp/.gdm_socket /
> tmp/.s.PGSQL.*
>
> but this happens well before any of the /etc/rc.d files get to run.
>
> I think what you've got is a rogue, broken mountnfs.sh script.  I
> don't
> even see any such script in my installation ... what is its
> provenance?
>
>             regards, tom lane

Apparently, the rogue mountnfs.sh rcS setting came from here:

http://www.ida.liu.se/~TDDI05/labs/NFS%20-%20Network%20File%
20Systems.pdf

“There is a bug in the version of UML that we use, that is triggered
by mounting NFS volumes too
early in the boot process. In Debian, the /etc/init.d/mountnfs.sh
script is responsible for
mounting NFS directories. You must reconfigure your system to mount
NFS volumes at the latest
possible moment. The following commands will do the job:
update-rc.d –f mountnfs.sh remove
update-rc.d mountnfs.sh mountnfs.sh start 99 2 .”

But in looking into expectations for /tmp, I'm also interested in the
interpretation here:

http://www.pathname.com/fhs/pub/fhs-2.3.html#TMPTEMPORARYFILES

Does postgres just use /tmp because it will generally be known to
exist and be writable? Is it generally expected that one should not
actually use the default setting for unix_socket_directory, or is it
more generally expected that /tmp will be a reliable repository for
the socket file?

We've long since worked around this issue now; I'm just wondering
whether anything would better help prevent the situation if
approached from scratch again, whether for us or for other users.
Looking for a missing socket file as a source of being unable to
connect was certainly an interesting takeaway.

In the long run, maybe the user space requiring builds of postgres
from source on Debian boxes requiring NFS and prioritizing Google
hits over package and contrib defaults is sufficiently small that
this becomes a non-issue... :P

--
Thomas F. O'Connell
Co-Founder, Information Architect
Sitening, LLC

Open Source Solutions. Optimized Web Development.

http://www.sitening.com/
110 30th Avenue North, Suite 6
Nashville, TN 37203-6320
615-469-5150
615-469-5151 (fax)

Re: postmaster blues after system restart

From
Oliver Elphick
Date:
On Tue, 2005-10-18 at 01:29 -0400, Tom Lane wrote:
> "Thomas F. O'Connell" <tfo@sitening.com> writes:
> > But cleaning out /tmp seems to be a part of institutional practice on
> > Linux,
>
> I'm unconvinced of that.  A quick test on Fedora Core 4 shows that
> random files in /tmp survive reboot, and any moment of thought would
> show why users would object to a blanket cleanout policy.
...
> I think what you've got is a rogue, broken mountnfs.sh script.  I don't
> even see any such script in my installation ... what is its provenance?

It's a standard part of Debian, from the package initscripts.

The default policy is to clean out /tmp at boot.  This can be varied by
changing TMPTIME in /etc/default/rcS.

--
Oliver Elphick                                          olly@lfix.co.uk
Isle of Wight                              http://www.lfix.co.uk/oliver
GPG: 1024D/A54310EA  92C8 39E7 280E 3631 3F0E  1EC0 5664 7A2F A543 10EA
                 ========================================
   Do you want to know God?   http://www.lfix.co.uk/knowing_god.html


Re: postmaster blues after system restart

From
Tom Lane
Date:
"Thomas F. O'Connell" <tfo@sitening.com> writes:
> Does postgres just use /tmp because it will generally be known to
> exist and be writable?

I suppose that was the original motivation.  We've had repeated troubles
over the years with using /tmp --- for example, the code now goes to
considerable lengths to update the socket's timestamp periodically so
that it won't be seen as a target by scripts that clean out /tmp entries
more than X minutes old.

It's really not very feasible to change the standard default, though,
because that would break too many clients (it's not very different from
changing the default port number).

There's also the small problem that there is no good alternative choice;
there is no other fixed directory path that can be assumed writable by a
non-privileged postmaster on every Unix.

So I'm afraid we're stuck with /tmp as the default socket location.

In any case, I don't think there's much doubt that unconditionally
cleaning out /tmp *after* beginning to start daemons is simply broken.
That script has no excuse whatever for thinking that there can't be some
other process actually using /tmp at the time it's running.  If you're
going to have a forcible cleanout of /tmp during reboot, it has to
happen before the /etc/rc.d scripts begin to run.

            regards, tom lane