Thread: FATAL 2: open of /var/lib/pgsql/data/pg_clog/0EE3 failed: No such file or directory
FATAL 2: open of /var/lib/pgsql/data/pg_clog/0EE3 failed: No such file or directory
From
Dmitry Tkach
Date:
Hi, everybody! I was getting these errors (see subject) from time to time on 7.2.1 when trying to analyze tables... trying to repeat the statament would usually work.... I then saw somebody else post in the list, mentioning the same problem, and Tom Lane's reply to it suggesting to upgrade to 7.2.4, where that was supposedly fixed. I have upgraded recently, and got the same problem again yesterday... It seems to have become even worse, as it was happenning consistently, even when I tried to repeat the query (with 7.2.1, the second time always used to work, as far as I remember)... it seems to have went away after I have restarted the server manually... Was I wrong deducing the assumption that it should be fixed in 7.2.4 from that Tom's message? Or is this news to you guys too? Any idea what is causing this, and/or how it can be avoided? Thanks a lot! Dima
On Fri, 18 Jul 2003, Dmitry Tkach wrote: > Hi, everybody! > > I was getting these errors (see subject) from time to time on 7.2.1 when > trying to analyze tables... trying to repeat the statament would usually > work.... > I then saw somebody else post in the list, mentioning the same problem, > and Tom Lane's reply to it suggesting to upgrade to 7.2.4, where that > was supposedly fixed. > > I have upgraded recently, and got the same problem again yesterday... It > seems to have become even worse, as it was happenning consistently, even > when I tried to repeat the query (with 7.2.1, the second time always > used to work, as far as I remember)... it seems to have went away after > I have restarted the server manually... > > Was I wrong deducing the assumption that it should be fixed in 7.2.4 > from that Tom's message? > Or is this news to you guys too? > > Any idea what is causing this, and/or how it can be avoided? did you dump and reload your database? It may well be that 7.2.1 planted logic bombs in the base directory that 7.2.4 can't fix now. I.e. some data structure got munged and now 7.2.4 is busy complaining about it.
scott.marlowe wrote: > >did you dump and reload your database? > Nope... It's a 200 Gig database.... If it was that simple to dump and reload it, I would have upgraded to 7.3, not 7.2.4 to begin with :-( >It may well be that 7.2.1 planted >logic bombs in the base directory that 7.2.4 can't fix now. I.e. some >data structure got munged and now 7.2.4 is busy complaining about it. > Why doesn't it complain after I restart it then? Dima
On Fri, 18 Jul 2003, Dmitry Tkach wrote: > scott.marlowe wrote: > > > > >did you dump and reload your database? > > > Nope... It's a 200 Gig database.... If it was that simple to dump and > reload it, I would have upgraded to 7.3, not 7.2.4 to begin with :-( > > >It may well be that 7.2.1 planted > >logic bombs in the base directory that 7.2.4 can't fix now. I.e. some > >data structure got munged and now 7.2.4 is busy complaining about it. > > > Why doesn't it complain after I restart it then? Why should it? If the problem is corrupted data in an index / table /system table etc... you won't see an error until it accesses the table/index etc... that's causing the problem.
scott.marlowe wrote: >>Why doesn't it complain after I restart it then? >> >> > >Why should it? If the problem is corrupted data in an index / table >/system table etc... you won't see an error until it accesses the >table/index etc... that's causing the problem. > > Right... So, 1) I do analyze mytable; ,.. it crashes. 2) I do it again: analyze mytable; ... it crashes. 3) I restart the server manually, and try again analyze mytable; ... it *works* 4) I let it run for a while, then try again: analyze mytable; ... it crashes. So, it looks like, if there is some data structure or a catalog screwed up, it is not screwed up by 7.2.1 earlier, it is *being* screwed up somewhere between #3 and #4 above... Dima
--- Dmitry Tkach <dmitry@openratings.com> wrote: > scott.marlowe wrote: > > >>Why doesn't it complain after I restart it then? > >> > >> > > > >Why should it? If the problem is corrupted data in > an index / table > >/system table etc... you won't see an error until > it accesses the > >table/index etc... that's causing the problem. > > > > > Right... > > So, > > 1) I do > > analyze mytable; > > ,.. it crashes. > > 2) I do it again: > > analyze mytable; > > ... it crashes. > > 3) I restart the server manually, and try again > > analyze mytable; > > ... it *works* > > 4) I let it run for a while, then try again: > > analyze mytable; > > ... it crashes. > > So, it looks like, if there is some data structure > or a catalog screwed > up, it is not screwed up by 7.2.1 earlier, it is > *being* screwed up > somewhere between #3 and #4 above... > > Dima I *think* (ignorance showing here) that "analyze" only samples data, so the bad data will not necessarily be touched on a given pass. I went through the problem too (with version 7.2.1), and went through the archives pretty thoroughly. The problem is caused by a spurious xid value in some record somewhere. The theories offered on the cause of this were: * hardware problems * an unknown bug in 7.2, fixed around 7.2.3(?) The second theory looked likely, because there was a rash of reports for versions in the 7.2.0 - 7.2.1 range, but none for 7.2.4 (until now). The bug theory was never investigated, because no-one was able to supply a reproducible test case, and the problem seemed to be fixed by an upgrade anyway. So the dump and restore should have fixed the problem for you, because xid values are not dumped. And if you had a bad xid value in your database, the dump should have failed anyway. All very mysterious. Search the archives for plenty more on this. __________________________________ Do you Yahoo!? SBC Yahoo! DSL - Now only $29.95 per month! http://sbc.yahoo.com
Dmitry Tkach <dmitry@openratings.com> writes: > 3) I restart the server manually, and try again > analyze mytable; > ... it *works* > 4) I let it run for a while, then try again: > analyze mytable; > ... it crashes. Proves nothing, since ANALYZE only touches a random sample of the rows. If you get that behavior with VACUUM, or a full-table SELECT (say, "SELECT count(*) FROM foo"), then it'd be interesting. What range of file names do you actually have in pg_clog/, anyway? regards, tom lane
Tom Lane wrote: >Proves nothing, since ANALYZE only touches a random sample of the rows. > > Ok, I understand... Thanks. >If you get that behavior with VACUUM, or a full-table SELECT (say, >"SELECT count(*) FROM foo"), then it'd be interesting. > > I never got it with select - only with vacuum and/or analyze... I you suggesting it should/could happen with select, or was that just meant to be an example of a full table scan? I just did select count (*) from that table, and it worked... >What range of file names do you actually have in pg_clog/, anyway? > > > Well ... *today* there seem to be files between 0000 and 00EC Is that range supposed to stay the same or does it vary? ... because that problem I had happened yesterday, and I have restarted the server since then... Thanks! Dima
Dmitry Tkach <dmitry@openratings.com> writes: > Well ... *today* there seem to be files between 0000 and 00EC > Is that range supposed to stay the same or does it vary? It will vary, but not quickly --- each file represents 1 million transactions. If the problem is erratic with VACUUM or SELECT COUNT(*), then the only speculation I have is flaky hardware: you must be reading different xids from the table at different times. regards, tom lane
Tom Lane wrote: >Dmitry Tkach <dmitry@openratings.com> writes: > > >>Well ... *today* there seem to be files between 0000 and 00EC >>Is that range supposed to stay the same or does it vary? >> >> > >It will vary, but not quickly --- each file represents 1 million >transactions. > >If the problem is erratic with VACUUM or SELECT COUNT(*), then the >only speculation I have is flaky hardware: you must be reading different >xids from the table at different times. > > Oops... I just got it again - right after doing that select count (*), that *worked*, I tried analyze the same table, and got: FATAL 2: open of /var/lib/pgsql/data/pg_clog/0980 failed: No such file or directory I tried it again, right away, and got the same error, complaining about 02B0 this time I tried select count (*) again, and it worked. I restarted the server, did analyze, and it worked too... Any ideas?
Dmitry Tkach <dmitry@openratings.com> writes: > Any ideas? Time to get out memtest86 and badblocks. regards, tom lane