Thread: Potential RC1-stoppers
I'm currently concerned about these recent reports: * Joel Burton's report of disappearing files, 3/20. This is real scary, but no one else has reported anything like it. * Tatsuo's weird failure in XLogFileInit ("ZeroFill: no such file or directory"). I'm hoping this can be explained away, but probably we ought to alter the code so that we can detect the case where no errno is set by write() and avoid printing a bogus message. Do people feel comfortable putting out RC1 when we don't know the reasons for these reports? Another thing I'd like to fix before RC1 is Adriaan's complaint about mishandling of int8-sized numeric constants on Alpha. Seems to me that we want Alpha to behave like other platforms, ie T_Integer parse nodes should only be generated for values that fit in int4. Otherwise Alpha will have different type resolution behavior for expressions that contain such constants, and that's going to be real confusing. I'm thinking about making scan.l do long x; errno = 0; x = strtol((char *)yytext, &endptr, 10); if (*endptr !='\0' || errno == ERANGE #ifdef HAVE_LONG_INT_64 /* if long is wider than 32 bits, check for overflow */ || x != (long) ((int32) x) #endif ) { /* integer too large, treat it as a float */ Objections? regards, tom lane
> I'm currently concerned about these recent reports: > > * Joel Burton's report of disappearing files, 3/20. This is real scary, > but no one else has reported anything like it. > > * Tatsuo's weird failure in XLogFileInit ("ZeroFill: no such file or > directory"). I'm hoping this can be explained away, but probably we > ought to alter the code so that we can detect the case where no errno > is set by write() and avoid printing a bogus message. > > Do people feel comfortable putting out RC1 when we don't know the > reasons for these reports? Can we keep an eye on these and address in 7.1.1? 7.1 will need fixes anyway. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> * Joel Burton's report of disappearing files, 3/20. This is > real scary, but no one else has reported anything like it. Can please you remind that report? Vadim
"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes: >> * Joel Burton's report of disappearing files, 3/20. This is >> real scary, but no one else has reported anything like it. > Can please you remind that report? It's the "pg_inherits: not found, but visible" thread in pghackers on 3/20 and 3/21. Briefly, he had two separate occurrences of a table file disappearing while the pg_class row remained (and he hadn't tried to delete it, either). The only idea I can come up with is that a removal of some other table removed the wrong file. Ugly. Joel, can you give us any more info? Do you have a postmaster log of the queries that were issued while this was happening? regards, tom lane
On Thu, 22 Mar 2001, Tom Lane wrote: > "Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes: > >> * Joel Burton's report of disappearing files, 3/20. This is > >> real scary, but no one else has reported anything like it. > > > Can please you remind that report? > > It's the "pg_inherits: not found, but visible" thread in pghackers > on 3/20 and 3/21. Briefly, he had two separate occurrences of a table > file disappearing while the pg_class row remained (and he hadn't > tried to delete it, either). The only idea I can come up with is that > a removal of some other table removed the wrong file. Ugly. > > Joel, can you give us any more info? Do you have a postmaster log of > the queries that were issued while this was happening? Sorry; I've been at client sites for the past day. I rebooted my machine, and it didn't happen again that night. Yesterday, my staff reinstalled Pg straight from the CVS but without (!) tarring up the old Pg install, so I'm afraid I don't have any logs. I run Pg w/debug switches on my development machine; this machine did not have such. After rebooting, and since reinstalling Pg beta-6-or-whatever-we're-at-now, it hasn't happened again. I'm afraid I can't think of anything unusual about the PC. Unbranded, decent-quality components AMD K6-III/550 256MB RAM Linux-Mandrake 7.2 w/the secure version of the kernel (2.2.17, IIRC) Pg beta4 I don't have a log, but do have the query that was issued, multiple times, overlapping: SELECT * FROM zope_facinst LIMIT 1000; where zope_facinst is the view SELECT DISTINCT ON (t.lname, t.fname, c.fulltitle, c.classcode, t.trainid) c.classcode, t.trainid, scw_namecode(t.fname, t.lname) AS namecode, t.fullname, c.fulltitle, c.descrip, t.descripshort AS train_descripshort, c.descripshort ASclass_descripshort FROM vlkpclass c, vlkptrain t, tblinst i, trelinsttrain it WHERE (((c.classcode = i.classcode) AND (i.instid = it.instid)) AND (it.trainid = t.trainid)) ORDER BY t.lname, t.fname, c.fulltitle, c.classcode, t.trainid; So it's pretty complicated, but not terrible. The classes starting w/'t' are tables; those starting with 'v' are views; none of the views are too complex. scw_namecode() is a simple pl/pgsql routine that just joins the strings together in a particular way. There are about 400 records returned by the view. EXPLAIN for it looks like this: reg2=# explain select * from zope_Facinst; NOTICE: QUERY PLAN: Subquery Scan zope_facinst (cost=339.93..356.42 rows=132 width=141) -> Unique (cost=339.93..356.42 rows=132 width=141) -> Sort (cost=339.93..339.93 rows=1319 width=141) -> Merge Join (cost=261.33..271.56 rows=1319width=141) -> Sort (cost=223.52..223.52 rows=597 width=92) -> MergeJoin (cost=131.72..195.99 rows=597 width=92) -> Index Scan using tblinst_pkey on tblinst i (cost=0.00..53 .69 rows=769 width=16) -> Sort (cost=131.72..131.72 rows=78 width=76) -> Merge Join (cost=52.15..129.28 rows=78 width=76) -> Merge Join (cost=52.15..59.96 rows=976 width= 68) -> Sort (cost=27.28..27.28 rows=316 width= 40) -> Seq Scan on tblpers p (cost=0.00. .14.16 rows=316 width=40) -> Sort (cost=24.87..24.87 rows=309 width= 28) -> Seq Scan on tbltrain t (cost=0.00 ..12.09 rows=309 width=28) -> Index Scan using trelinsttrain_trainid_idx on trelinsttrain it (cost=0.00..42.75 rows=795 width=8) -> Sort (cost=37.82..37.82 rows=221 width=49) -> Seq Scan on tblclass c (cost=0.00..29.21 rows=221 width=49) I can provide a dump of the database if anyone would like, or copies of the Zope scripts (very, very simple: they just call the ZSQL method 'select * from zope_facinst limit 1000') Sorry I can't provide much more, and, yes, I know it sucks to have a problem I can't replicate. Err. Computers can be like that. I hope this helps. -- Joel Burton <jburton@scw.org> Director of Information Systems, Support Center of Washington
Joel Burton <jburton@scw.org> writes: > I rebooted my machine, and it didn't happen again that night. Yesterday, > my staff reinstalled Pg straight from the CVS but without (!) tarring up > the old Pg install, so I'm afraid I don't have any logs. I run Pg w/debug > switches on my development machine; this machine did not have such. Drat. > I don't have a log, but do have the query that was issued, multiple times, > overlapping: > SELECT * FROM zope_facinst LIMIT 1000; It's really unlikely (I hope) that the clients running SELECTs had anything to do with it. You had mentioned that you were busy making manual schema revisions while this went on; that process seems more likely to be the guilty party. But if you don't have the logs anymore, I suppose there's not much chance of reconstructing what you did :-( I spent much of this afternoon groveling through the deletion-related code, looking for some code path that could lead to a deletion operation deleting the wrong file. I didn't find anything that looked plausible enough to be worth pursuing. So I'm stumped for the moment. We'll have to hope that if it happens again, we can gather more data. regards, tom lane
On Thu, 22 Mar 2001, Tom Lane wrote: > Joel Burton <jburton@scw.org> writes: > > I rebooted my machine, and it didn't happen again that night. Yesterday, > > my staff reinstalled Pg straight from the CVS but without (!) tarring up > > the old Pg install, so I'm afraid I don't have any logs. I run Pg w/debug > > switches on my development machine; this machine did not have such. > > Drat. > > > I don't have a log, but do have the query that was issued, multiple times, > > overlapping: > > SELECT * FROM zope_facinst LIMIT 1000; > > It's really unlikely (I hope) that the clients running SELECTs had > anything to do with it. You had mentioned that you were busy making > manual schema revisions while this went on; that process seems more > likely to be the guilty party. But if you don't have the logs anymore, > I suppose there's not much chance of reconstructing what you did :-( The dropping and re-making were the zope_facinst view listed in my email. I was tinkering with various parameters, trying to see if distinct on (list) was faster than distinct list, etc. > I spent much of this afternoon groveling through the deletion-related > code, looking for some code path that could lead to a deletion operation > deleting the wrong file. I didn't find anything that looked plausible > enough to be worth pursuing. So I'm stumped for the moment. We'll have > to hope that if it happens again, we can gather more data. It could be my machine; it's not a heavily used machine, so I can't vouch for its stability. Sorry I couldn't help more. As always, thanks. -- Joel Burton <jburton@scw.org> Director of Information Systems, Support Center of Washington