Thread: Warning: Don't delete those /tmp/.PGSQL.* files

Warning: Don't delete those /tmp/.PGSQL.* files

From
"Joel Burton"
Date:
This is part question, part short, sad tale.

Working on my database, I had a view that would lock up the
machine (eats all available memory, soon goes belly-up.) Turned out
to be a recursive view: view A asked a question of view B that
asked view A. [is it possible for pgsql to detect this? I worry about
my users doing this.] [and, yes, I should use kernel-level controls to
make sure that the postmaster process can't use all available
resources; but hey, it's a development machine. ]

Anyway, as I was tracking down this problem, I couldn't restart
PostgreSQL if the machine had crashed and I had a /tmp/.PGSQL.*
file in the temp directory; it assumed that the socket was in use.
So, I began restarting pgsql w/a  line like

rm -f /tmp/.PGSQL.* && postmaster -i >log 2>log &

Which works great. Except that I *kept* using this for two weeks
after the view problem (damn that bash up-arrow laziness!), and
yesterday, used it to restart PostgreSQL except (oops!) it was
already running.

Results: no database at all. All classes (tables/views/etc) returned
0 records (meaning that no tables showed up in psql's \d, since
pg_class returned nothing.)

I don't know enough about why -- the /tmp files appear to have a
length of 0, but pgsql seems to care a great deal about them.

[ I did have a very fresh pg_dumpall file--thank you, anacron--so I
lost about 30 minutes worth of work, but it would have been
everything if I never backed up. ]

My advice:

1) Use pg_dumpall.
2) Don't delete those /tmp files until you're *sure* you're out of Pg

Anyone know what *happened* and *why*? Was there anything I
could have done?

Thanks!

[ I do read these lists, but always appreciate a cc on responses so I
don't accidentally miss them. TIA. ]
--
Joel Burton, Director of Information Systems -*- jburton@scw.org
Support Center of Washington (www.scw.org)

Re: Warning: Don't delete those /tmp/.PGSQL.* files

From
Tom Lane
Date:
"Joel Burton" <jburton@scw.org> writes:
> Working on my database, I had a view that would lock up the
> machine (eats all available memory, soon goes belly-up.) Turned out
> to be a recursive view: view A asked a question of view B that
> asked view A. [is it possible for pgsql to detect this?

It should have been detected --- there is a check in the rewriter that's
supposed to error out after ten recursive rewrite calls.  Maybe that
logic is broken, or misses certain cases.  Could you exhibit the views
that caused this behavior for you?

> So, I began restarting pgsql w/a  line like

> rm -f /tmp/.PGSQL.* && postmaster -i >log 2>log &

> Which works great. Except that I *kept* using this for two weeks
> after the view problem (damn that bash up-arrow laziness!), and
> yesterday, used it to restart PostgreSQL except (oops!) it was
> already running.

> Results: no database at all. All classes (tables/views/etc) returned
> 0 records (meaning that no tables showed up in psql's \d, since
> pg_class returned nothing.)

Ugh.  The reason that removing the socket file allowed a second
postmaster to start up is that we use an advisory lock on the socket
file as the interlock that prevents two PMs on the same port number.
Remove the socket file, poof no interlock.

*However*, there is a second line of defense to prevent two postmasters
in the same directory, and I don't understand why that didn't trigger.
Unless you are running a version old enough to not have it.  What PG
version is this, anyway?

Assuming you got past both interlocks, the second postmaster would have
reinitialized Postgres' shared memory block for that database, which
would have been a Bad Thing(tm) ... but it would not have led to any
immediate damage to your on-disk files, AFAICS.  Was the database still
hosed after you stopped both postmasters and started a fresh one?  (Did
you even try that?)

This story does indicate that we need a less fragile interlock against
starting two postmasters on one database.  I have to admit that it
hadn't occurred to me that you could break the port-number interlock
so easily as that :-(.  But obviously you can, so we need a different
way of representing the interlock.  Hackers, any thoughts?

Note: I've narrowed followups to just pghackers, since that seems like
the right forum for discussing a better interlock mechanism.

            regards, tom lane

Re: Warning: Don't delete those /tmp/.PGSQL.* files

From
GH
Date:
On Sat, Nov 25, 2000 at 05:35:13PM -0500, some SMTP stream spewed forth:
*snip*
>
> > So, I began restarting pgsql w/a  line like
>
> > rm -f /tmp/.PGSQL.* && postmaster -i >log 2>log &
>
> > Which works great. Except that I *kept* using this for two weeks
> > after the view problem (damn that bash up-arrow laziness!), and
> > yesterday, used it to restart PostgreSQL except (oops!) it was
> > already running.
>
> > Results: no database at all. All classes (tables/views/etc) returned
> > 0 records (meaning that no tables showed up in psql's \d, since
> > pg_class returned nothing.)
>

*snip Tom's reply*
I have a situation vaguely related to this.
At some point Postgres was not shut down properly and now everytime at
startup I the error log gets something like:

---------
root% tail -f errlog
Waiting for postmaster starting up...DEBUG:  Data Base System is starting
up at Sat Nov 25 16:53:10 2000
DEBUG:  Data Base System was interrupted being in production at Sat Nov
25 16:35:27 2000
DEBUG:  Data Base System is in production state at Sat Nov 25 16:53:10
2000
FATAL 1:  ReleaseLruFile: No open files available to be closed
............................................................pg_ctl:
postmaster does not start up
---------

After that, all postgres processes die and the cycle begins again on
subsequent attempts to start postgres.
At one point I would receive some "Too many open files" (or similar)
error with postgres holding more than 750 file descriptors -- almost
entirely consisting of socket streams.
What is the significance of "ReleaseLruFile" and how can I repair this?

This is using FreeBSD 4.1-RELEASE and Postgres 7.0.2.

Thanks

gh


Re: Warning: Don't delete those /tmp/.PGSQL.* files

From
Tom Lane
Date:
GH <grasshacker@over-yonder.net> writes:
> FATAL 1:  ReleaseLruFile: No open files available to be closed
> ............................................................pg_ctl:
> postmaster does not start up

> After that, all postgres processes die and the cycle begins again on
> subsequent attempts to start postgres.
> At one point I would receive some "Too many open files" (or similar)
> error with postgres holding more than 750 file descriptors -- almost
> entirely consisting of socket streams.
> What is the significance of "ReleaseLruFile" and how can I repair this?

> This is using FreeBSD 4.1-RELEASE and Postgres 7.0.2.

7.0.3 will probably help --- the message is coming out of some
inappropriate error recovery code that we fixed in 7.0.3.

The underlying problem, however, is that you are running out of kernel
file table slots (ENFILE or EMFILE error return from open()).  Not
enough info here to tell why that's happening.

            regards, tom lane

Re: Warning: Don't delete those /tmp/.PGSQL.* files

From
GH
Date:
On Sat, Nov 25, 2000 at 06:40:12PM -0500, some SMTP stream spewed forth:
> GH <grasshacker@over-yonder.net> writes:
> > FATAL 1:  ReleaseLruFile: No open files available to be closed
> > ............................................................pg_ctl:
> > postmaster does not start up
>
> > After that, all postgres processes die and the cycle begins again on
> > subsequent attempts to start postgres.
> > At one point I would receive some "Too many open files" (or similar)
> > error with postgres holding more than 750 file descriptors -- almost
> > entirely consisting of socket streams.
> > What is the significance of "ReleaseLruFile" and how can I repair this?
>
> > This is using FreeBSD 4.1-RELEASE and Postgres 7.0.2.
>
> 7.0.3 will probably help --- the message is coming out of some
> inappropriate error recovery code that we fixed in 7.0.3.
>
> The underlying problem, however, is that you are running out of kernel
> file table slots (ENFILE or EMFILE error return from open()).  Not
> enough info here to tell why that's happening.

Well, through some research of my own I have discovered that the file
issue is somehow related to our startup script:
/usr/local/etc/rc.d/pgsql.sh.
I am not sure how familiar you are with FreeBSD's startup process, but
it will suffice to say that this script expects one of three arguments:
start, stop, or status -- apparently corresponding to the options of
pg_ctl.

When I start the postgres server manually, it runs relatively fine.
i.e.
# su -l pgsql /usr/local/pgsql/bin/pg_ctl -w start > /usr/local/pgsql/errlog 2>&1 &

Here is pgsql.sh:

#!/bin/sh

# $FreeBSD: ports/databases/postgresql7/files/pgsql.sh.tmpl,v 1.8
2000/05/25 09:35:25 andreas Exp $
#
# For postmaster startup options, edit $PGDATA/postmaster.opts.default
# Preinstalled options are -i -o "-F"

case $1 in
start)
    [ -d /usr/local/pgsql/lib ] && /sbin/ldconfig -m /usr/local/pgsql/lib
# Clean up by Matt
# This is a really bad idea, unless we are absolutely certain that there
# are no postgres processes running or that we feel like restoring
# from a recent backup. ;-) gh
    rm -f /tmp/.s.PGSQL*
    [ -x /usr/local/pgsql/bin/pg_ctl ] && {
        su -l pgsql  \
            /usr/local/pgsql/bin/pg_ctl -w start >
/usr/local/pgsql/errlog 2>&1 &
#           /usr/local/pgsql/bin/pg_ctl -w start -o "-B 64 -N 32" start >
/usr/local/pgsql/errlog 2>&1 &
        echo -n ' pgsql'
    }
    ;;

stop)
    [ -x /usr/local/pgsql/bin/pg_ctl ] && {
        su -l pgsql -c 'exec /usr/local/pgsql/bin/pg_ctl -w -m fast stop'
    }
    ;;

status)
    [ -x /usr/local/pgsql/bin/pg_ctl ] && {
        su -l pgsql -c 'exec /usr/local/pgsql/bin/pg_ctl status'
    }
    ;;

*)
    echo "usage: `basename $0` {start|stop|status}" >&2
    exit 64
    ;;
esac

EOF

running this script with "start" causes the postgres server to start,
run out of files, and then shutdown. Postgres is useable until it runs
out of files and shuts down.


Thanks.

gh

>
>             regards, tom lane

Re: Warning: Don't delete those /tmp/.PGSQL.* files

From
"Joel Burton"
Date:

On 25 Nov 2000, at 17:35, Tom Lane wrote:

> > So, I began restarting pgsql w/a  line like
>
> > rm -f /tmp/.PGSQL.* && postmaster -i >log 2>log &
>
> > Which works great. Except that I *kept* using this for two weeks
> > after the view problem (damn that bash up-arrow laziness!), and
> > yesterday, used it to restart PostgreSQL except (oops!) it was
> > already running.
>
> > Results: no database at all. All classes (tables/views/etc) returned
> > 0 records (meaning that no tables showed up in psql's \d, since
> > pg_class returned nothing.)
>
> Ugh.  The reason that removing the socket file allowed a second
> postmaster to start up is that we use an advisory lock on the socket
> file as the interlock that prevents two PMs on the same port number.
> Remove the socket file, poof no interlock.
>
> *However*, there is a second line of defense to prevent two
> postmasters in the same directory, and I don't understand why that
> didn't trigger. Unless you are running a version old enough to not
> have it.  What PG version is this, anyway?

7.1devel, from about 1 week ago.

> Assuming you got past both interlocks, the second postmaster would
> have reinitialized Postgres' shared memory block for that database,
> which would have been a Bad Thing(tm) ... but it would not have led to
> any immediate damage to your on-disk files, AFAICS.  Was the database
> still hosed after you stopped both postmasters and started a fresh
> one?  (Did you even try that?)

Yes, I stopped both, rebooted machine, restarted postmaster.
Rebooted machine, used just postgres, tried to vacuum, tried to
dump, etc. Always the same story.


--
Joel Burton, Director of Information Systems -*- jburton@scw.org
Support Center of Washington (www.scw.org)

Re: Warning: Don't delete those /tmp/.PGSQL.* files

From
Tom Lane
Date:
GH <grasshacker@over-yonder.net> writes:
> running this script with "start" causes the postgres server to start,
> run out of files, and then shutdown. Postgres is useable until it runs
> out of files and shuts down.

Continuing on that line of thought --- it seems like this must be an
indication of a file-descriptor leak somewhere.  That is, some bit of
code forgets to close a file it opened.  Cycle through that bit of code
enough times, and the kernel stops being willing to give you more file
descriptors.

If this is correct, we could probably identify the leak by knowing what
file is being opened multiple times.  Can you run 'lsof' or some similar
tool to check for duplicate descriptors being held open by the
postmaster?

I recall that we have fixed one or two leaks of this kind in the past,
but I don't recall details, nor which versions the fixes first appeared
in.

            regards, tom lane