Thread: problem with maintenance script and missing pg_clog files with pg 7.2.1
Hi, We are having problems with an installation of Postgres 7.2.1 on Debian Linux 3.0 (woody). The scheduled maintenance script throws the following error: FATAL 2: open of /2/var/lib/postgres/data/pg_clog/0002 failed: No such file or directory server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. connection to server was lost We have run a disk check on the relevant partition and it appears to be fine. We have several postgres instances on other machines and we have never seen this problem before. Does anyone have any ideas what the problem is/ what the right way of dealing with this is? Thanks for your help. Ben -- Ben Roberts Sysadmin Team Runtime Collective Ltd http://www.runtime-collective.com
Re: problem with maintenance script and missing pg_clog files with pg 7.2.1
From
Martijn van Oosterhout
Date:
On Tue, Sep 24, 2002 at 03:49:48PM +0100, Ben Roberts wrote: > Hi, > > We are having problems with an installation of Postgres 7.2.1 on Debian > Linux 3.0 (woody). > > The scheduled maintenance script throws the following error: > > FATAL 2: open of /2/var/lib/postgres/data/pg_clog/0002 failed: No such file > or directory > server closed the connection unexpectedly > This probably means the server terminated abnormally > before or while processing the request. > connection to server was lost Do you have a /2/var/lib/postgres/data/pg_clog/ directory? -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > There are 10 kinds of people in the world, those that can do binary > arithmetic and those that can't.
Ben Roberts <benroberts@runtime-collective.com> writes: > The scheduled maintenance script throws the following error: > FATAL 2: open of /2/var/lib/postgres/data/pg_clog/0002 failed: No such file > or directory Could we see the results of ls -l /2/var/lib/postgres/data/pg_clog ? It would seem that you have a tuple somewhere with a bogus transaction number in it, which leads the tuple validity check code to try to access a nonexistent segment of the transaction commit log (clog). It's hard to say more without an idea of what reasonable transaction numbers are in your installation; so the names and sizes of the files in the clog directory are interesting ... regards, tom lane
Tom, Martijn, Thanks very much for your replies. On Wed, Sep 25, 2002 at 12:55:00AM -0400, Tom Lane wrote: > Ben Roberts <benroberts@runtime-collective.com> writes: > > The scheduled maintenance script throws the following error: > > > FATAL 2: open of /2/var/lib/postgres/data/pg_clog/0002 failed: No such file > > or directory > > Could we see the results of ls -l /2/var/lib/postgres/data/pg_clog ? Yes, sorry. I should have included that with my original mail: ns1:/2/var/lib/postgres# ls -l /2/var/lib/postgres/data/pg_clog total 588 -rw------- 1 postgres postgres 262144 Sep 22 01:05 0004 -rw------- 1 postgres postgres 262144 Sep 25 01:05 0005 -rw------- 1 postgres postgres 65536 Sep 25 04:40 0006 Having done some investigation, it now appears to me as if this problem could possibly stem from having moved the postgres data directory from /var/lib/postgres to /2/var/lib/postgres. I say this because an ls in the old directory doesn't look entirely healthy to me: ns1:/2/var/lib/postgres# ls -l /var/lib/postgres total 25 -rw-r--r-- 1 postgres postgres 1768 Sep 6 04:39 >g?L&??l?X0[r??R???\ drwx------ 6 postgres postgres 1024 Aug 22 09:29 data.old -rw-r--r-- 1 postgres postgres 16638 Sep 6 04:28 tfinance.db.lst -rw-r--r-- 1 postgres postgres 3371 Sep 6 04:39 ?????h???ex??+?Y0e?? It looks as if postgres is using both directories in some way. thanks for your help, Ben
Ben Roberts benroberts@runtime-collective.com said: Tom, Martijn, Oh dear, I feel this problem is related to us having not correctly implemented the move to a new data directory. I think the problem here is in postmaster.conf: ns1:/etc/postgresql# cat postmaster.conf # /etc/postgresql/postmaster.conf # # Copyright (c) Oliver Elphick 1997, 2001 # Part of the Debian package, postgresql. The Debian packaging is # licensed under GPL v.2 <snip> POSTGRES_HOME=`getent passwd postgres | awk -F: '{print $6}' | head -1` if [ -z "$POSTGRES_HOME" ] then POSTGRES_HOME=/var/lib/postgres fi # Where to find the PostgreSQL database files, including those that # define PostgresSQL users and permissions. POSTGRES_DATA=/2/var/lib/postgres/data <snip> As you can see postmaster looks for the value of POSTGRES_HOME directory in /etc/passwd and if not found it defaults to the value of /var/lib/postgres. Unfortunately the relevant entry in /etc/passwd is: postgres:x:31:32:postgres:/var/lib/postgres:/bin/sh ^^^^^^^^^^^^^^^^^ Instead of /2/var/lib/postgres. ho hum....we seem to have forgotten to change the postgres home directory when we made the move to the new disk. Possibly postgres is using the value of POSTGRES_HOME for some things and POSTGRES_DATA for others, causing the problems we are seeing? It's not 100% certain that this is the cause of the problem, but it looks very much like the smoking gun to me. Or is just 'OK' to have separate directories for POSTGRES_HOME and POSTGRES_DATA? thanks, Ben
Re: problem with maintenance script and missing pg_clog files with pg 7.2.1
From
Martijn van Oosterhout
Date:
On Wed, Sep 25, 2002 at 11:18:59AM +0100, Ben Roberts wrote: > Ben Roberts benroberts@runtime-collective.com said: > > Tom, Martijn, > > Oh dear, I feel this problem is related to us having not correctly > implemented the move to a new data directory. > > I think the problem here is in postmaster.conf: <snip> To be honest, I don't know. What I usually do is create a symlink from the old location to the new location so that it looks like nothing has changed. The you don't need to change any config files at all. So something like: ln -s /2/var/lib/postgres/data /var/lib/postgres/ OTOH, I can't see why you'd be having any problems. If the server can start, it should have access to everything it needs... -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > There are 10 kinds of people in the world, those that can do binary > arithmetic and those that can't.
On Wed, 2002-09-25 at 11:18, Ben Roberts wrote: > As you can see postmaster looks for the value of POSTGRES_HOME directory in > /etc/passwd and if not found it defaults to the value of > /var/lib/postgres. Unfortunately the relevant entry in /etc/passwd is: > > postgres:x:31:32:postgres:/var/lib/postgres:/bin/sh > > ^^^^^^^^^^^^^^^^^ > Instead of /2/var/lib/postgres. > > ho hum....we seem to have forgotten to change the postgres home directory > when we made the move to the new disk. Possibly postgres is using the value > of POSTGRES_HOME for some things and POSTGRES_DATA for others, causing the > problems we are seeing? > > It's not 100% certain that this is the cause of the problem, but it looks > very much like the smoking gun to me. > > Or is just 'OK' to have separate directories for POSTGRES_HOME and > POSTGRES_DATA? Yes. As delivered, the package (postgresql-client, on which postgresql depends) installs /etc/postgresql/postgresql.env and sources that in the startup script for postgres (one of ~postgres/.profile, ~postgres/.bash_profile, ~postgres/.login). postgresql.env is used to set up the environment for anyone using psql as well as for postgres. It says: [ -r /etc/postgresql/postmaster.conf ] && . /etc/postgresql/postmaster.conf PGDATA=${POSTGRES_DATA:-/var/lib/postgres/data} ... POSTGRES_HOME is in fact only used by the install scripts and as the directory to cd to before running pg_ctl to start the postmaster. So unless you have something in ~postgres/ that is giving PGDATA a different value, I can't see any problem here. Even if that were the case, the postmaster would be started with one value or the other; I don't see how it could use both at once. -- Oliver Elphick Oliver.Elphick@lfix.co.uk Isle of Wight, UK http://www.lfix.co.uk/oliver GPG: 1024D/3E1D0C1C: CA12 09E0 E8D5 8870 5839 932A 614D 4C34 3E1D 0C1C ======================================== "Who shall separate us from the love of Christ? shall tribulation, or distress, or persecution, or famine, or nakedness, or peril, or sword?...But in all these things we overwhelmingly conquer through Him who loved us." Romans 8:35,37
On Wed, Sep 25, 2002 at 02:12:28PM +0100, Oliver Elphick wrote: > On Wed, 2002-09-25 at 11:18, Ben Roberts wrote: > > Or is just 'OK' to have separate directories for POSTGRES_HOME and > > POSTGRES_DATA? > > Yes. > > As delivered, the package (postgresql-client, on which postgresql > depends) installs /etc/postgresql/postgresql.env and sources that in the > startup script for postgres (one of ~postgres/.profile, > ~postgres/.bash_profile, ~postgres/.login). postgresql.env is used to > set up the environment for anyone using psql as well as for postgres. > It says: > > [ -r /etc/postgresql/postmaster.conf ] && > . /etc/postgresql/postmaster.conf > > PGDATA=${POSTGRES_DATA:-/var/lib/postgres/data} > ... > > POSTGRES_HOME is in fact only used by the install scripts and as the > directory to cd to before running pg_ctl to start the postmaster. So > unless you have something in ~postgres/ that is giving PGDATA a > different value, I can't see any problem here. Even if that were the > case, the postmaster would be started with one value or the other; I > don't see how it could use both at once. Thanks Oliver, that's very useful. Unfortunately in that case I am back to square #1 in terms of my understanding of this problem. I can't find anything else that is strange or unusaul about our set up, with the exception that the DB has been set up to use UNICODE, but that shouldn't be a problem, should it? thanks, Ben
Oliver Elphick <olly@lfix.co.uk> writes: > POSTGRES_HOME is in fact only used by the install scripts and as the > directory to cd to before running pg_ctl to start the postmaster. So > unless you have something in ~postgres/ that is giving PGDATA a > different value, I can't see any problem here. Right, there really isn't any (necessary) connection between postgres' home directory and $PGDATA. The junk in the old directory seems odd, but I doubt it's related to your troubles (especially since it apparently is one level up from your old data directory). Given this: > > FATAL 2: open of /2/var/lib/postgres/data/pg_clog/0002 failed: No such file > > or directory ns1:/2/var/lib/postgres# ls -l /2/var/lib/postgres/data/pg_clog total 588 -rw------- 1 postgres postgres 262144 Sep 22 01:05 0004 -rw------- 1 postgres postgres 262144 Sep 25 01:05 0005 -rw------- 1 postgres postgres 65536 Sep 25 04:40 0006 I now have to revise my opinion somewhat. The reports we'd been seeing that seemed to be data corruption mostly showed attempted accesses to clog segments with numbers far beyond the active segment(s). Here you've got a reference to an old segment that's already been removed. This looks like it could be a software bug --- ie, failure to be sure we'd nailed down the commit status of all old tuples before removing a past clog segment. We found one or two bugs before 7.2 release that could lead to this symptom, but I'd hoped we'd caught them all. Maybe not. I would be interested to look closely at the data and see if anything can be learned. Would you be willing to consider letting me have access to your machine & database to study the problem? regards, tom lane