Thread: Potential RC1-stoppers

Potential RC1-stoppers

From
Tom Lane
Date:
I'm currently concerned about these recent reports:

* Joel Burton's report of disappearing files, 3/20.  This is real scary,
but no one else has reported anything like it.

* Tatsuo's weird failure in XLogFileInit ("ZeroFill: no such file or
directory").  I'm hoping this can be explained away, but probably we
ought to alter the code so that we can detect the case where no errno
is set by write() and avoid printing a bogus message.

Do people feel comfortable putting out RC1 when we don't know the
reasons for these reports?

Another thing I'd like to fix before RC1 is Adriaan's complaint about
mishandling of int8-sized numeric constants on Alpha.  Seems to me that
we want Alpha to behave like other platforms, ie T_Integer parse nodes
should only be generated for values that fit in int4.  Otherwise Alpha
will have different type resolution behavior for expressions that
contain such constants, and that's going to be real confusing.  I'm
thinking about making scan.l do
                   long x;
                   errno = 0;                   x = strtol((char *)yytext, &endptr, 10);                   if (*endptr
!='\0' || errno == ERANGE
 
#ifdef HAVE_LONG_INT_64                       /* if long is wider than 32 bits, check for overflow */
   || x != (long) ((int32) x)
 
#endif                      )                   {                       /* integer too large, treat it as a float */

Objections?
        regards, tom lane


Re: Potential RC1-stoppers

From
Bruce Momjian
Date:
> I'm currently concerned about these recent reports:
> 
> * Joel Burton's report of disappearing files, 3/20.  This is real scary,
> but no one else has reported anything like it.
> 
> * Tatsuo's weird failure in XLogFileInit ("ZeroFill: no such file or
> directory").  I'm hoping this can be explained away, but probably we
> ought to alter the code so that we can detect the case where no errno
> is set by write() and avoid printing a bogus message.
> 
> Do people feel comfortable putting out RC1 when we don't know the
> reasons for these reports?

Can we keep an eye on these and address in 7.1.1?  7.1 will need fixes
anyway.


--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


RE: Potential RC1-stoppers

From
"Mikheev, Vadim"
Date:
> * Joel Burton's report of disappearing files, 3/20.  This is 
> real scary, but no one else has reported anything like it.

Can please you remind that report?

Vadim


Re: Potential RC1-stoppers

From
Tom Lane
Date:
"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes:
>> * Joel Burton's report of disappearing files, 3/20.  This is 
>> real scary, but no one else has reported anything like it.

> Can please you remind that report?

It's the "pg_inherits: not found, but visible" thread in pghackers
on 3/20 and 3/21.  Briefly, he had two separate occurrences of a table
file disappearing while the pg_class row remained (and he hadn't
tried to delete it, either).  The only idea I can come up with is that
a removal of some other table removed the wrong file.  Ugly.

Joel, can you give us any more info?  Do you have a postmaster log of
the queries that were issued while this was happening?
        regards, tom lane


Re: Potential RC1-stoppers

From
Joel Burton
Date:
On Thu, 22 Mar 2001, Tom Lane wrote:

> "Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes:
> >> * Joel Burton's report of disappearing files, 3/20.  This is 
> >> real scary, but no one else has reported anything like it.
> 
> > Can please you remind that report?
> 
> It's the "pg_inherits: not found, but visible" thread in pghackers
> on 3/20 and 3/21.  Briefly, he had two separate occurrences of a table
> file disappearing while the pg_class row remained (and he hadn't
> tried to delete it, either).  The only idea I can come up with is that
> a removal of some other table removed the wrong file.  Ugly.
> 
> Joel, can you give us any more info?  Do you have a postmaster log of
> the queries that were issued while this was happening?

Sorry; I've been at client sites for the past day.

I rebooted my machine, and it didn't happen again that night. Yesterday,
my staff reinstalled Pg straight from the CVS but without (!) tarring up
the old Pg install, so I'm afraid I don't have any logs. I run Pg w/debug
switches on my development machine; this machine did not have such.

After rebooting, and since reinstalling Pg
beta-6-or-whatever-we're-at-now, it hasn't happened again. I'm afraid I
can't think of anything unusual about the PC.

Unbranded, decent-quality components AMD K6-III/550
256MB RAM
Linux-Mandrake 7.2 w/the secure version of the kernel (2.2.17, IIRC)
Pg beta4


I don't have a log, but do have the query that was issued, multiple times,
overlapping:

SELECT * FROM zope_facinst LIMIT 1000;

where zope_facinst is the view

SELECT DISTINCT ON (t.lname,                    t.fname,                    c.fulltitle, c.classcode,
t.trainid)        c.classcode,         t.trainid,         scw_namecode(t.fname, t.lname) AS namecode,
t.fullname,        c.fulltitle,         c.descrip,         t.descripshort AS train_descripshort,         c.descripshort
ASclass_descripshort 
 
FROM     vlkpclass c,         vlkptrain t,         tblinst i,         trelinsttrain it 
WHERE    (((c.classcode = i.classcode) AND           (i.instid = it.instid)) 
AND       (it.trainid = t.trainid)) 
ORDER BY t.lname,         t.fname,        c.fulltitle,         c.classcode,         t.trainid;

So it's pretty complicated, but not terrible.

The classes starting w/'t' are tables; those starting with 'v' are
views; none of the views are too complex.

scw_namecode() is a simple pl/pgsql routine that just joins the strings
together in a particular way.

There are about 400 records returned by the view.



EXPLAIN for it looks like this:

reg2=# explain select * from zope_Facinst;
NOTICE:  QUERY PLAN:

Subquery Scan zope_facinst  (cost=339.93..356.42 rows=132 width=141) ->  Unique  (cost=339.93..356.42 rows=132
width=141)      ->  Sort  (cost=339.93..339.93 rows=1319 width=141)             ->  Merge Join  (cost=261.33..271.56
rows=1319width=141)                   ->  Sort  (cost=223.52..223.52 rows=597 width=92)                         ->
MergeJoin  (cost=131.72..195.99 rows=597
 
width=92)                               ->  Index Scan using tblinst_pkey on
tblinst i  (cost=0.00..53
.69 rows=769 width=16)                               ->  Sort  (cost=131.72..131.72 rows=78
width=76)                                     ->  Merge Join  (cost=52.15..129.28
rows=78 width=76)                                           ->  Merge Join
(cost=52.15..59.96 rows=976 width=
68)                                                 ->  Sort
(cost=27.28..27.28 rows=316 width=
40)                                                       ->  Seq Scan on
tblpers p  (cost=0.00.
.14.16 rows=316 width=40)                                                 ->  Sort
(cost=24.87..24.87 rows=309 width=
28)                                                       ->  Seq Scan on
tbltrain t  (cost=0.00
..12.09 rows=309 width=28)                                           ->  Index Scan using
trelinsttrain_trainid_idx on
trelinsttrain it  (cost=0.00..42.75 rows=795 width=8)                   ->  Sort  (cost=37.82..37.82 rows=221 width=49)
                       ->  Seq Scan on tblclass c  (cost=0.00..29.21
 
rows=221 width=49)



I can provide a dump of the database if anyone would like, or copies of
the Zope scripts (very, very simple: they just call the ZSQL method
'select * from zope_facinst limit 1000')


Sorry I can't provide much more, and, yes, I know it sucks to have a
problem I can't replicate. Err. Computers can be like that.

I hope this helps.


-- 
Joel Burton   <jburton@scw.org>
Director of Information Systems, Support Center of Washington



Re: Potential RC1-stoppers

From
Tom Lane
Date:
Joel Burton <jburton@scw.org> writes:
> I rebooted my machine, and it didn't happen again that night. Yesterday,
> my staff reinstalled Pg straight from the CVS but without (!) tarring up
> the old Pg install, so I'm afraid I don't have any logs. I run Pg w/debug
> switches on my development machine; this machine did not have such.

Drat.

> I don't have a log, but do have the query that was issued, multiple times,
> overlapping:
> SELECT * FROM zope_facinst LIMIT 1000;

It's really unlikely (I hope) that the clients running SELECTs had
anything to do with it.  You had mentioned that you were busy making
manual schema revisions while this went on; that process seems more
likely to be the guilty party.  But if you don't have the logs anymore,
I suppose there's not much chance of reconstructing what you did :-(

I spent much of this afternoon groveling through the deletion-related
code, looking for some code path that could lead to a deletion operation
deleting the wrong file.  I didn't find anything that looked plausible
enough to be worth pursuing.  So I'm stumped for the moment.  We'll have
to hope that if it happens again, we can gather more data.
        regards, tom lane


Re: Potential RC1-stoppers

From
Joel Burton
Date:
On Thu, 22 Mar 2001, Tom Lane wrote:

> Joel Burton <jburton@scw.org> writes:
> > I rebooted my machine, and it didn't happen again that night. Yesterday,
> > my staff reinstalled Pg straight from the CVS but without (!) tarring up
> > the old Pg install, so I'm afraid I don't have any logs. I run Pg w/debug
> > switches on my development machine; this machine did not have such.
> 
> Drat.
> 
> > I don't have a log, but do have the query that was issued, multiple times,
> > overlapping:
> > SELECT * FROM zope_facinst LIMIT 1000;
> 
> It's really unlikely (I hope) that the clients running SELECTs had
> anything to do with it.  You had mentioned that you were busy making
> manual schema revisions while this went on; that process seems more
> likely to be the guilty party.  But if you don't have the logs anymore,
> I suppose there's not much chance of reconstructing what you did :-(

The dropping and re-making were the zope_facinst view listed in my email.
I was tinkering with various parameters, trying to see if distinct on
(list) was faster than distinct list, etc.

> I spent much of this afternoon groveling through the deletion-related
> code, looking for some code path that could lead to a deletion operation
> deleting the wrong file.  I didn't find anything that looked plausible
> enough to be worth pursuing.  So I'm stumped for the moment.  We'll have
> to hope that if it happens again, we can gather more data.

It could be my machine; it's not a heavily used machine, so I can't vouch
for its stability.

Sorry I couldn't help more.

As always, thanks.
-- 
Joel Burton   <jburton@scw.org>
Director of Information Systems, Support Center of Washington