Thread: FATAL 2: open of /var/lib/pgsql/data/pg_clog/0EE3 failed: No such file or directory

FATAL 2: open of /var/lib/pgsql/data/pg_clog/0EE3 failed: No such file or directory

From

Dmitry Tkach

Date:

18 July 2003, 15:49:36

Hi, everybody!

I was getting these errors (see subject) from time to time on 7.2.1 when
trying to analyze tables... trying to repeat the statament would usually
work....
I then saw somebody else post in the list, mentioning the same problem,
and Tom Lane's reply to it suggesting to upgrade to 7.2.4, where that
was supposedly fixed.

I have upgraded recently, and got the same problem again yesterday... It
seems to have become even worse, as it was happenning consistently, even
when I tried to repeat the query (with 7.2.1, the second time always
used to work, as far as I remember)... it seems to have went away after
I have restarted the server manually...

Was I wrong deducing the assumption that it should be fixed in 7.2.4
from that Tom's message?
Or is this news to you guys too?

Any idea what is causing this, and/or how it can be avoided?

Thanks a lot!

Dima

Re: FATAL 2: open of /var/lib/pgsql/data/pg_clog/0EE3

From

"scott.marlowe"

Date:

18 July 2003, 16:09:24

On Fri, 18 Jul 2003, Dmitry Tkach wrote:

> Hi, everybody!
>
> I was getting these errors (see subject) from time to time on 7.2.1 when
> trying to analyze tables... trying to repeat the statament would usually
> work....
> I then saw somebody else post in the list, mentioning the same problem,
> and Tom Lane's reply to it suggesting to upgrade to 7.2.4, where that
> was supposedly fixed.
>
> I have upgraded recently, and got the same problem again yesterday... It
> seems to have become even worse, as it was happenning consistently, even
> when I tried to repeat the query (with 7.2.1, the second time always
> used to work, as far as I remember)... it seems to have went away after
> I have restarted the server manually...
>
> Was I wrong deducing the assumption that it should be fixed in 7.2.4
> from that Tom's message?
> Or is this news to you guys too?
>
> Any idea what is causing this, and/or how it can be avoided?

did you dump and reload your database?  It may well be that 7.2.1 planted
logic bombs in the base directory that 7.2.4 can't fix now.  I.e. some
data structure got munged and now 7.2.4 is busy complaining about it.

Re: FATAL 2: open of /var/lib/pgsql/data/pg_clog/0EE3

From

Dmitry Tkach

Date:

18 July 2003, 16:24:03

scott.marlowe wrote:

>
>did you dump and reload your database?
>
Nope... It's a 200 Gig database.... If it was that simple to dump and
reload it, I would have upgraded to 7.3, not 7.2.4 to begin with :-(

>It may well be that 7.2.1 planted
>logic bombs in the base directory that 7.2.4 can't fix now.  I.e. some
>data structure got munged and now 7.2.4 is busy complaining about it.
>
Why doesn't it complain after I restart it then?

Dima

Re: FATAL 2: open of /var/lib/pgsql/data/pg_clog/0EE3

From

"scott.marlowe"

Date:

18 July 2003, 16:43:19

On Fri, 18 Jul 2003, Dmitry Tkach wrote:

> scott.marlowe wrote:
>
> >
> >did you dump and reload your database?
> >
> Nope... It's a 200 Gig database.... If it was that simple to dump and
> reload it, I would have upgraded to 7.3, not 7.2.4 to begin with :-(
>
> >It may well be that 7.2.1 planted
> >logic bombs in the base directory that 7.2.4 can't fix now.  I.e. some
> >data structure got munged and now 7.2.4 is busy complaining about it.
> >
> Why doesn't it complain after I restart it then?

Why should it?  If the problem is corrupted data in an index / table
/system table etc... you won't see an error until it accesses the
table/index etc... that's causing the problem.

Re: FATAL 2: open of /var/lib/pgsql/data/pg_clog/0EE3

From

Dmitry Tkach

Date:

18 July 2003, 16:52:16

scott.marlowe wrote:

>>Why doesn't it complain after I restart it then?
>>
>>
>
>Why should it?  If the problem is corrupted data in an index / table
>/system table etc... you won't see an error until it accesses the
>table/index etc... that's causing the problem.
>
>
Right...

So,

1) I do

analyze mytable;

,.. it crashes.

2) I do it again:

analyze mytable;

... it crashes.

3) I restart the server manually, and try again

analyze mytable;

... it *works*

4) I let it run for a while, then try again:

analyze mytable;

... it crashes.

So, it looks like, if there is some data structure or a catalog screwed
up, it is not screwed up by 7.2.1 earlier, it is *being* screwed up
somewhere between #3 and #4 above...

Dima

Re: FATAL 2: open of /var/lib/pgsql/data/pg_clog/0EE3

From

Jeff Eckermann

Date:

18 July 2003, 19:32:21

--- Dmitry Tkach <dmitry@openratings.com> wrote:
> scott.marlowe wrote:
>
> >>Why doesn't it complain after I restart it then?
> >>
> >>
> >
> >Why should it?  If the problem is corrupted data in
> an index / table
> >/system table etc... you won't see an error until
> it accesses the
> >table/index etc... that's causing the problem.
> >
> >
> Right...
>
> So,
>
> 1) I do
>
> analyze mytable;
>
> ,.. it crashes.
>
> 2) I do it again:
>
> analyze mytable;
>
> ... it crashes.
>
> 3) I restart the server manually, and try again
>
> analyze mytable;
>
> ... it *works*
>
> 4) I let it run for a while, then try again:
>
> analyze mytable;
>
> ... it crashes.
>
> So, it looks like, if there is some data structure
> or a catalog screwed
> up, it is not screwed up by 7.2.1 earlier, it is
> *being* screwed up
> somewhere between #3 and #4 above...
>
> Dima

I *think* (ignorance showing here) that "analyze" only
samples data, so the bad data will not necessarily be
touched on a given pass.

I went through the problem too (with version 7.2.1),
and went through the archives pretty thoroughly.  The
problem is caused by a spurious xid value in some
record somewhere.  The theories offered on the cause
of this were:
* hardware problems
* an unknown bug in 7.2, fixed around 7.2.3(?)

The second theory looked likely, because there was a
rash of reports for versions in the 7.2.0 - 7.2.1
range, but none for 7.2.4 (until now).  The bug theory
was never investigated, because no-one was able to
supply a reproducible test case, and the problem
seemed to be fixed by an upgrade anyway.

So the dump and restore should have fixed the problem
for you, because xid values are not dumped.  And if
you had a bad xid value in your database, the dump
should have failed anyway.  All very mysterious.
Search the archives for plenty more on this.

__________________________________
Do you Yahoo!?
SBC Yahoo! DSL - Now only $29.95 per month!
http://sbc.yahoo.com

Re: FATAL 2: open of /var/lib/pgsql/data/pg_clog/0EE3

From

Tom Lane

Date:

18 July 2003, 21:36:32

Dmitry Tkach <dmitry@openratings.com> writes:
> 3) I restart the server manually, and try again
> analyze mytable;
> ... it *works*

> 4) I let it run for a while, then try again:
> analyze mytable;
> ... it crashes.

Proves nothing, since ANALYZE only touches a random sample of the rows.

If you get that behavior with VACUUM, or a full-table SELECT (say,
"SELECT count(*) FROM foo"), then it'd be interesting.

What range of file names do you actually have in pg_clog/, anyway?

            regards, tom lane

Re: FATAL 2: open of /var/lib/pgsql/data/pg_clog/0EE3

From

Dmitry Tkach

Date:

18 July 2003, 21:46:34

Tom Lane wrote:

>Proves nothing, since ANALYZE only touches a random sample of the rows.
>
>
Ok, I understand... Thanks.

>If you get that behavior with VACUUM, or a full-table SELECT (say,
>"SELECT count(*) FROM foo"), then it'd be interesting.
>
>
I never got it with select - only with vacuum and/or analyze...
I you suggesting it should/could happen with select, or was that just
meant to be an example of a full table scan?
I just did  select count (*) from that table, and it worked...

>What range of file names do you actually have in pg_clog/, anyway?
>
>
>
Well ... *today* there seem to be files between 0000 and 00EC
Is that range supposed to stay the same or does it vary?
... because that problem I had happened yesterday, and I have restarted
the server since then...


Thanks!

Dima

Re: FATAL 2: open of /var/lib/pgsql/data/pg_clog/0EE3

From

Tom Lane

Date:

18 July 2003, 21:54:28

Dmitry Tkach <dmitry@openratings.com> writes:
> Well ... *today* there seem to be files between 0000 and 00EC
> Is that range supposed to stay the same or does it vary?

It will vary, but not quickly --- each file represents 1 million
transactions.

If the problem is erratic with VACUUM or SELECT COUNT(*), then the
only speculation I have is flaky hardware: you must be reading different
xids from the table at different times.

            regards, tom lane

Re: FATAL 2: open of /var/lib/pgsql/data/pg_clog/0EE3

From

Dmitry Tkach

Date:

18 July 2003, 22:14:11

Tom Lane wrote:

>Dmitry Tkach <dmitry@openratings.com> writes:
>
>
>>Well ... *today* there seem to be files between 0000 and 00EC
>>Is that range supposed to stay the same or does it vary?
>>
>>
>
>It will vary, but not quickly --- each file represents 1 million
>transactions.
>
>If the problem is erratic with VACUUM or SELECT COUNT(*), then the
>only speculation I have is flaky hardware: you must be reading different
>xids from the table at different times.
>
>
Oops... I just got it again - right after doing that select count (*),
that *worked*, I tried analyze the same table, and got:

FATAL 2:  open of /var/lib/pgsql/data/pg_clog/0980 failed: No such file
or directory

I tried it again, right away, and got the same error, complaining about
02B0 this time

I tried select count (*) again, and it worked.

I restarted the server, did analyze, and it worked too...

Any ideas?

Re: FATAL 2: open of /var/lib/pgsql/data/pg_clog/0EE3

From

Tom Lane

Date:

18 July 2003, 22:45:23

Dmitry Tkach <dmitry@openratings.com> writes:
> Any ideas?

Time to get out memtest86 and badblocks.

            regards, tom lane