Thread: problem with maintenance script and missing pg_clog files with pg 7.2.1

problem with maintenance script and missing pg_clog files with pg 7.2.1

From
Ben Roberts
Date:
Hi,

We are having problems with an installation of Postgres 7.2.1 on Debian
Linux 3.0 (woody).

The scheduled maintenance script throws the following error:

FATAL 2:  open of /2/var/lib/postgres/data/pg_clog/0002 failed: No such file
or directory
server closed the connection unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request.
connection to server was lost

We have run a disk check on the relevant partition and it appears to be
fine.

We have several postgres instances on other machines and we have never seen
this problem before. Does anyone have any ideas what the problem is/ what
the right way of dealing with this is?

Thanks for your help.

Ben

--
Ben Roberts
Sysadmin Team
Runtime Collective Ltd
http://www.runtime-collective.com

Re: problem with maintenance script and missing pg_clog files with pg 7.2.1

From
Martijn van Oosterhout
Date:
On Tue, Sep 24, 2002 at 03:49:48PM +0100, Ben Roberts wrote:
> Hi,
>
> We are having problems with an installation of Postgres 7.2.1 on Debian
> Linux 3.0 (woody).
>
> The scheduled maintenance script throws the following error:
>
> FATAL 2:  open of /2/var/lib/postgres/data/pg_clog/0002 failed: No such file
> or directory
> server closed the connection unexpectedly
>     This probably means the server terminated abnormally
>     before or while processing the request.
> connection to server was lost

Do you have a /2/var/lib/postgres/data/pg_clog/ directory?

--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> There are 10 kinds of people in the world, those that can do binary
> arithmetic and those that can't.

Re: problem with maintenance script and missing pg_clog files with pg 7.2.1

From
Tom Lane
Date:
Ben Roberts <benroberts@runtime-collective.com> writes:
> The scheduled maintenance script throws the following error:

> FATAL 2:  open of /2/var/lib/postgres/data/pg_clog/0002 failed: No such file
> or directory

Could we see the results of ls -l  /2/var/lib/postgres/data/pg_clog ?

It would seem that you have a tuple somewhere with a bogus transaction
number in it, which leads the tuple validity check code to try to access
a nonexistent segment of the transaction commit log (clog).  It's hard
to say more without an idea of what reasonable transaction numbers are
in your installation; so the names and sizes of the files in the clog
directory are interesting ...

            regards, tom lane

Re: problem with maintenance script and missing pg_clog files with pg 7.2.1

From
Ben Roberts
Date:
Tom, Martijn,

Thanks very much for your replies.

On Wed, Sep 25, 2002 at 12:55:00AM -0400, Tom Lane wrote:
> Ben Roberts <benroberts@runtime-collective.com> writes:
> > The scheduled maintenance script throws the following error:
>
> > FATAL 2:  open of /2/var/lib/postgres/data/pg_clog/0002 failed: No such file
> > or directory
>
> Could we see the results of ls -l  /2/var/lib/postgres/data/pg_clog ?

Yes, sorry. I should have included that with my original mail:

ns1:/2/var/lib/postgres# ls -l /2/var/lib/postgres/data/pg_clog
total 588
-rw-------    1 postgres postgres   262144 Sep 22 01:05 0004
-rw-------    1 postgres postgres   262144 Sep 25 01:05 0005
-rw-------    1 postgres postgres    65536 Sep 25 04:40 0006

Having done some investigation, it now appears to me as if this problem
could possibly stem from having moved the postgres data directory from
/var/lib/postgres to /2/var/lib/postgres. I say this because an ls in the
old directory doesn't look entirely healthy to me:

ns1:/2/var/lib/postgres# ls -l /var/lib/postgres
total 25
-rw-r--r--    1 postgres postgres     1768 Sep  6 04:39 >g?L&??l?X0[r??R???\
drwx------    6 postgres postgres     1024 Aug 22 09:29 data.old
-rw-r--r--    1 postgres postgres    16638 Sep  6 04:28 tfinance.db.lst
-rw-r--r--    1 postgres postgres     3371 Sep  6 04:39 ?????h???ex??+?Y0e??

It looks as if postgres is using both directories in some way.

thanks for your help,

Ben

Re: problem with maintenance script and missing pg_clog files with pg 7.2.1

From
Ben Roberts
Date:
Ben Roberts benroberts@runtime-collective.com said:

Tom, Martijn,

Oh dear, I feel this problem is related to us having not correctly
implemented the move to a new data directory.

I think the problem here is in postmaster.conf:

ns1:/etc/postgresql# cat postmaster.conf
# /etc/postgresql/postmaster.conf
#
# Copyright (c) Oliver Elphick 1997, 2001
# Part of the Debian package, postgresql. The Debian packaging is
# licensed under GPL v.2

<snip>

POSTGRES_HOME=`getent passwd postgres | awk -F: '{print $6}' | head -1`
if [ -z "$POSTGRES_HOME" ]
then
    POSTGRES_HOME=/var/lib/postgres
fi

# Where to find the PostgreSQL database files, including those that
# define PostgresSQL users and permissions.
POSTGRES_DATA=/2/var/lib/postgres/data

<snip>

As you can see postmaster looks for the value of POSTGRES_HOME directory in
/etc/passwd and if not found it defaults to the value of
/var/lib/postgres. Unfortunately the relevant entry in /etc/passwd is:

postgres:x:31:32:postgres:/var/lib/postgres:/bin/sh

                          ^^^^^^^^^^^^^^^^^
Instead of /2/var/lib/postgres.

ho hum....we seem to have forgotten to change the postgres home directory
when we made the move to the new disk. Possibly postgres is using the value
of POSTGRES_HOME for some things and POSTGRES_DATA for others, causing the
problems we are seeing?

It's not 100% certain that this is the cause of the problem, but it looks
very much like the smoking gun to me.

Or is just 'OK' to have separate directories for POSTGRES_HOME and
POSTGRES_DATA?

thanks,

Ben

Re: problem with maintenance script and missing pg_clog files with pg 7.2.1

From
Martijn van Oosterhout
Date:
On Wed, Sep 25, 2002 at 11:18:59AM +0100, Ben Roberts wrote:
> Ben Roberts benroberts@runtime-collective.com said:
>
> Tom, Martijn,
>
> Oh dear, I feel this problem is related to us having not correctly
> implemented the move to a new data directory.
>
> I think the problem here is in postmaster.conf:

<snip>

To be honest, I don't know. What I usually do is create a symlink from the
old location to the new location so that it looks like nothing has changed.
The you don't need to change any config files at all.

So something like:

ln -s /2/var/lib/postgres/data /var/lib/postgres/

OTOH, I can't see why you'd be having any problems. If the server can start,
it should have access to everything it needs...

--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> There are 10 kinds of people in the world, those that can do binary
> arithmetic and those that can't.

Re: problem with maintenance script and missing pg_clog

From
Oliver Elphick
Date:
On Wed, 2002-09-25 at 11:18, Ben Roberts wrote:
> As you can see postmaster looks for the value of POSTGRES_HOME directory in
> /etc/passwd and if not found it defaults to the value of
> /var/lib/postgres. Unfortunately the relevant entry in /etc/passwd is:
>
> postgres:x:31:32:postgres:/var/lib/postgres:/bin/sh
>
>                           ^^^^^^^^^^^^^^^^^
> Instead of /2/var/lib/postgres.
>
> ho hum....we seem to have forgotten to change the postgres home directory
> when we made the move to the new disk. Possibly postgres is using the value
> of POSTGRES_HOME for some things and POSTGRES_DATA for others, causing the
> problems we are seeing?
>
> It's not 100% certain that this is the cause of the problem, but it looks
> very much like the smoking gun to me.
>
> Or is just 'OK' to have separate directories for POSTGRES_HOME and
> POSTGRES_DATA?

Yes.

As delivered, the package (postgresql-client, on which postgresql
depends) installs /etc/postgresql/postgresql.env and sources that in the
startup script for postgres (one of ~postgres/.profile,
~postgres/.bash_profile, ~postgres/.login).  postgresql.env is used to
set up the environment for anyone using psql as well as for postgres.
It says:

    [ -r /etc/postgresql/postmaster.conf ] &&
            . /etc/postgresql/postmaster.conf

    PGDATA=${POSTGRES_DATA:-/var/lib/postgres/data}
    ...

POSTGRES_HOME is in fact only used by the install scripts and as the
directory to cd to before running pg_ctl to start the postmaster.  So
unless you have something in ~postgres/ that is giving PGDATA a
different value, I can't see any problem here.  Even if that were the
case, the postmaster would be started with one value or the other; I
don't see how it could use both at once.

--
Oliver Elphick                                Oliver.Elphick@lfix.co.uk
Isle of Wight, UK
http://www.lfix.co.uk/oliver
GPG: 1024D/3E1D0C1C: CA12 09E0 E8D5 8870 5839  932A 614D 4C34 3E1D 0C1C
                 ========================================
     "Who shall separate us from the love of Christ? shall
      tribulation, or distress, or persecution, or famine,
      or nakedness, or peril, or sword?...But in all these
      things we overwhelmingly conquer through Him who loved
      us."               Romans 8:35,37


Re: problem with maintenance script and missing pg_clog files with pg 7.2.1

From
Ben Roberts
Date:
On Wed, Sep 25, 2002 at 02:12:28PM +0100, Oliver Elphick wrote:

> On Wed, 2002-09-25 at 11:18, Ben Roberts wrote:
> > Or is just 'OK' to have separate directories for POSTGRES_HOME and
> > POSTGRES_DATA?
>
> Yes.
>
> As delivered, the package (postgresql-client, on which postgresql
> depends) installs /etc/postgresql/postgresql.env and sources that in the
> startup script for postgres (one of ~postgres/.profile,
> ~postgres/.bash_profile, ~postgres/.login).  postgresql.env is used to
> set up the environment for anyone using psql as well as for postgres.
> It says:
>
>     [ -r /etc/postgresql/postmaster.conf ] &&
>             . /etc/postgresql/postmaster.conf
>
>     PGDATA=${POSTGRES_DATA:-/var/lib/postgres/data}
>     ...
>
> POSTGRES_HOME is in fact only used by the install scripts and as the
> directory to cd to before running pg_ctl to start the postmaster.  So
> unless you have something in ~postgres/ that is giving PGDATA a
> different value, I can't see any problem here.  Even if that were the
> case, the postmaster would be started with one value or the other; I
> don't see how it could use both at once.

Thanks Oliver, that's very useful. Unfortunately in that case I am back to
square #1 in terms of my understanding of this problem.

I can't find anything else that is strange or unusaul about our set up, with
the exception that the DB has been set up to use UNICODE, but that shouldn't
be a problem, should it?

thanks,

Ben


Re: problem with maintenance script and missing pg_clog files with pg 7.2.1

From
Tom Lane
Date:
Oliver Elphick <olly@lfix.co.uk> writes:
> POSTGRES_HOME is in fact only used by the install scripts and as the
> directory to cd to before running pg_ctl to start the postmaster.  So
> unless you have something in ~postgres/ that is giving PGDATA a
> different value, I can't see any problem here.

Right, there really isn't any (necessary) connection between postgres'
home directory and $PGDATA.  The junk in the old directory seems odd,
but I doubt it's related to your troubles (especially since it
apparently is one level up from your old data directory).

Given this:

> > FATAL 2:  open of /2/var/lib/postgres/data/pg_clog/0002 failed: No such file
> > or directory

ns1:/2/var/lib/postgres# ls -l /2/var/lib/postgres/data/pg_clog
total 588
-rw-------    1 postgres postgres   262144 Sep 22 01:05 0004
-rw-------    1 postgres postgres   262144 Sep 25 01:05 0005
-rw-------    1 postgres postgres    65536 Sep 25 04:40 0006

I now have to revise my opinion somewhat.  The reports we'd been seeing
that seemed to be data corruption mostly showed attempted accesses to
clog segments with numbers far beyond the active segment(s).  Here
you've got a reference to an old segment that's already been removed.
This looks like it could be a software bug --- ie, failure to be sure
we'd nailed down the commit status of all old tuples before removing a
past clog segment.  We found one or two bugs before 7.2 release that
could lead to this symptom, but I'd hoped we'd caught them all.  Maybe
not.

I would be interested to look closely at the data and see if anything
can be learned.  Would you be willing to consider letting me have access
to your machine & database to study the problem?

            regards, tom lane