Thread: Re: I might be getting closer?

Re: I might be getting closer?

From
Bruce Momjian
Date:
[ cc to hackers]

It certainly looks closer, particularly because the failure is s simple
domain constraint failure and not a more internal error.

Have you tried moving ahead a few days to see if the bug was fixed in
CVS?

---------------------------------------------------------------------------

Robert Creager wrote:
-- Start of PGP signed section.
> 
> Hey Bruce,
> 
> I can get version 2003-02-01 to only fail one test, and sporadically at
> that (2 out of 50 runs):
> 
> *** ./expected/domain.out    Sat Jul 26 12:24:18 2003
> --- ./results/domain.out    Sat Jul 26 12:56:01 2003
> ***************
> *** 263,269 ****
>   insert into domcontest values (5);
>   alter domain con drop constraint t;
>   insert into domcontest values (-5); --fails
> ! ERROR:  ExecEvalConstraintTest: Domain con constraint $1 failed
>   insert into domcontest values (42);
>   -- cleanup
>   drop domain ddef1 restrict;
> --- 263,269 ----
>   insert into domcontest values (5);
>   alter domain con drop constraint t;
>   insert into domcontest values (-5); --fails
> ! ERROR:  ExecEvalConstraintTest: Domain con constraint  failed
>   insert into domcontest values (42);
>   -- cleanup
>   drop domain ddef1 restrict;
> 
> ======================================================================
> 
> -- 
>  13:04:42 up 8 days, 17:05,  2 users,  load average: 1.84, 1.24, 1.34
-- End of PGP section, PGP failed!

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: I might be getting closer?

From
Robert Creager
Date:
On Sat, 26 Jul 2003 16:49:27 -0400 (EDT)
Bruce Momjian <pgman@candle.pha.pa.us> said something like:

> [ cc to hackers]
> 
> It certainly looks closer, particularly because the failure is s
> simple domain constraint failure and not a more internal error.
> 
> Have you tried moving ahead a few days to see if the bug was fixed in
> CVS?
> 

No.  I'll run 2003-02-15 next.

I just got the domain failure on 2003-01-26 after 42 passes.

-- 15:03:30 up 8 days, 19:04,  2 users,  load average: 2.40, 2.15, 2.31

Regression test failure date.

From
Robert Creager
Date:
I found it (I think)...

Looks like something was done after the 15'th...

2003-02-15 passes 50/50 and 33/33 on second pass (so far)
2003-02-16 fails 6/50  vacuum failed 1 times  misc failed 3 times  sanity_check failed 3 times  inherit failed 1 times
triggersfailed 4 times
 
2003-02-18 fails 11/50  constraints failed 5 times  sanity_check failed 3 times  misc failed 8 times  inherit failed 2
times rules failed 1 times  triggers failed 5 times
 

Cheers,
Rob

-- 17:42:41 up 8 days, 21:43,  2 users,  load average: 3.62, 2.69, 2.35

Re: Regression test failure date.

From
Tom Lane
Date:
Robert Creager <Robert_Creager@LogicalChaos.org> writes:
> Looks like something was done after the 15'th...

> 2003-02-15 passes 50/50 and 33/33 on second pass (so far)
> 2003-02-16 fails 6/50

As far back as that!  Okay, many thanks for the info --- that will help.

I'm buried in error message editing right now but will look at the diffs
in that timeframe tomorrow, unless someone beats me to it.
        regards, tom lane


Re: Regression test failure date.

From
Tom Lane
Date:
Robert Creager <Robert_Creager@LogicalChaos.org> writes:
> 2003-02-15 passes 50/50 and 33/33 on second pass (so far)
> 2003-02-16 fails 6/50

I looked in the CVS logs while waiting for a compile, and the only patch
I see that goes anywhere near the locking or cache code around that time
is this one:

2003-02-17 21:13  momjian
* src/: backend/storage/lmgr/deadlock.c,backend/storage/lmgr/lock.c,
backend/storage/lmgr/proc.c,backend/utils/adt/lockfuncs.c,include/storage/lock.h,include/storage/proc.h: Rename
'holder'references to 'proclock'for PROCLOCK references, for consistency.
 

which seems like a safe change (I assume it was just a
search-and-replace; do you recall, Bruce?) and anyway the time is not
quite right.

What time of day did your successive pulls correspond to, anyway?
(I believe my cvs2cl printout above is showing me EST.)
        regards, tom lane


Re: Regression test failure date.

From
Bruce Momjian
Date:
Tom Lane wrote:
> Robert Creager <Robert_Creager@LogicalChaos.org> writes:
> > 2003-02-15 passes 50/50 and 33/33 on second pass (so far)
> > 2003-02-16 fails 6/50
> 
> I looked in the CVS logs while waiting for a compile, and the only patch
> I see that goes anywhere near the locking or cache code around that time
> is this one:
> 
> 2003-02-17 21:13  momjian
> 
>     * src/: backend/storage/lmgr/deadlock.c,
>     backend/storage/lmgr/lock.c, backend/storage/lmgr/proc.c,
>     backend/utils/adt/lockfuncs.c, include/storage/lock.h,
>     include/storage/proc.h: Rename 'holder' references to 'proclock'
>     for PROCLOCK references, for consistency.
> 
> which seems like a safe change (I assume it was just a
> search-and-replace; do you recall, Bruce?) and anyway the time is not
> quite right.

Yes, just a rename operation.

> What time of day did your successive pulls correspond to, anyway?
> (I believe my cvs2cl printout above is showing me EST.)

For the date range:
pgcvs log -d'2003-02-15 00:00:00 GMT<2003-02-18 00:00:00 GMT' -rHEAD

I see:

---------------------------------------------------------------------------

/src/include/optimizer/pathnode.h                                                                        tglTeach
plannerhow to propagate pathkeys from sub-SELECTs in FROM up tothe outer query.  (The implementation is a bit klugy,
butit would takenontrivial restructuring to make it nicer, which this is probably notworth.)  This avoids unnecessary
sortsteps in examples likeSELECT foo,count(*) FROM (SELECT ... ORDER BY foo,bar) sub GROUP BY foowhich means there is
nowa reasonable technique for controlling theorder of inputs to custom aggregates, even in the grouping case.
 

---
/src/test/regress/expected/case.out
tglCOALESCE()and NULLIF() are now first-class expressions, not macrosthat turn into CASE expressions.  They evaluate
theirarguments at mostonce.  Patch by Kris Jurka, review and (very light) editorializing by
 
me.

---
/doc/TODO.detail/exists                                                                     
momjianRemove IN/EXISTS TODO.detail item.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Regression test failure date.

From
Bruce Momjian
Date:
I am seeing repeatable success from a CVS of 2003-05-01, and repeatable
failure from current CVS.

I have only been running nightly paralell regression runs since June 27,
so it is possible that the paralell regression was broken in February,
fixed in May, then broken some time after that.

I will test June 1 now.

---------------------------------------------------------------------------

Robert Creager wrote:
-- Start of PGP signed section.
> 
> I found it (I think)...
> 
> Looks like something was done after the 15'th...
> 
> 2003-02-15 passes 50/50 and 33/33 on second pass (so far)
> 2003-02-16 fails 6/50
>    vacuum failed 1 times
>    misc failed 3 times
>    sanity_check failed 3 times
>    inherit failed 1 times
>    triggers failed 4 times
> 2003-02-18 fails 11/50
>    constraints failed 5 times
>    sanity_check failed 3 times
>    misc failed 8 times
>    inherit failed 2 times
>    rules failed 1 times
>    triggers failed 5 times
> 
> Cheers,
> Rob
> 
> -- 
>  17:42:41 up 8 days, 21:43,  2 users,  load average: 3.62, 2.69, 2.35
-- End of PGP section, PGP failed!

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Regression test failure date.

From
Robert Creager
Date:
On Sat, 26 Jul 2003 20:24:56 -0400
Tom Lane <tgl@sss.pgh.pa.us> said something like:

> 
> What time of day did your successive pulls correspond to, anyway?
> (I believe my cvs2cl printout above is showing me EST.)
> 
>             regards, tom lane
> 
> 

I'm MST, and I did not specify a timezone on the cvs updates.  just <cvs
update -D 2003-02-16>

I can re-do with a specific time/date if you tell me what you want.  Or
give me a range.  I take a few minutes to do a complete cvs download.

Later,
Rob

-- 19:10:13 up 8 days, 23:10,  2 users,  load average: 0.00, 0.00, 0.00

Re: Regression test failure date.

From
Robert Creager
Date:
On Sat, 26 Jul 2003 21:08:46 -0400 (EDT)
Bruce Momjian <pgman@candle.pha.pa.us> said something like:

> 
> I am seeing repeatable success from a CVS of 2003-05-01, and
> repeatable failure from current CVS.
> 
> I have only been running nightly paralell regression runs since June
> 27, so it is possible that the paralell regression was broken in
> February, fixed in May, then broken some time after that.
> 
> I will test June 1 now.
> 

I don't know about that Bruce.  When I grabbed 2003-05-01, I have 2
failures in 15 runs so far.  One item I did have to change was to move
from bison 1.5 to bison 1.875.

I've attached included the first failure one.

*** ./expected/triggers.out    Sat Nov 23 11:13:22 2002
--- ./results/triggers.out    Sat Jul 26 20:10:18 2003
***************
*** 87,92 ****
--- 87,93 ---- NOTICE:  check_pkeys_fkey_cascade: 1 tuple(s) of fkeys are deleted NOTICE:  check_pkeys_fkey_cascade: 1
tuple(s)of fkeys2 are deleted DROP TABLE pkeys;
 
+ ERROR:  cache lookup of relation 129432 failed DROP TABLE fkeys; DROP TABLE fkeys2; -- -- I've disabled the
funny_dup17test because the new semantics
 

======================================================================

*** ./expected/sanity_check.out    Mon Aug 19 13:33:36 2002
--- ./results/sanity_check.out    Sat Jul 26 20:10:20 2003
***************
*** 58,68 ****  pg_statistic        | t  pg_trigger          | t  pg_type             | t  road                | t
shighway           | t  tenk1               | t  tenk2               | t
 
! (52 rows)  -- -- another sanity check: every system catalog that has OIDs should
have--- 58,69 ----  pg_statistic        | t  pg_trigger          | t  pg_type             | t
+  pkeys               | t  road                | t  shighway            | t  tenk1               | t  tenk2
  | t
 
! (53 rows)  -- -- another sanity check: every system catalog that has OIDs should
have

======================================================================

*** ./expected/misc.out    Sat Jul 26 20:03:48 2003
--- ./results/misc.out    Sat Jul 26 20:10:22 2003
***************
*** 633,638 ****
--- 633,639 ----  onek2  path_tbl  person
+  pkeys  point_tbl  polygon_tbl  ramp
***************
*** 657,663 ****  toyemp  varchar_tbl  xacttest
! (93 rows)  --SELECT name(equipment(hobby_construct(text 'skywalking', text
'mer'))) AS equip_name;  SELECT hobbies_by_name('basketball');
--- 658,664 ----  toyemp  varchar_tbl  xacttest
! (94 rows)  --SELECT name(equipment(hobby_construct(text 'skywalking', text
'mer'))) AS equip_name;  SELECT hobbies_by_name('basketball');

======================================================================



-- 20:11:31 up 9 days, 12 min,  2 users,  load average: 2.86, 2.30, 1.52

Re: Regression test failure date.

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> I have only been running nightly paralell regression runs since June 27,
> so it is possible that the paralell regression was broken in February,
> fixed in May, then broken some time after that.

Any further progress on this?

My best theory at the moment is that we have a problem with relcache
entry creation failing if it's interrupted by an SI inval message at
just the right time.  I don't much want to grovel through six months
worth of changelog entries looking for candidate mistakes, though.
        regards, tom lane


Re: Regression test failure date.

From
Robert Creager
Date:
I will stand by the fact that I cannot generate failures from
2003-02-15 (200+ runs), and I can from 2003-02-16.  Just to make sure I
didn't screw up the cvs usage, I'll try again tonight if I get the
chance and re-download re-test these two days.

I can set up a script that will step through weekly dates starting from
'now' and see if the 02-16 problem might of been fixed and then
re-introduced if you like.

2003-02-16 fails 6/50  vacuum failed 1 times  misc failed 3 times  sanity_check failed 3 times  inherit failed 1 times
triggersfailed 4 times
 

Cheers,
Rob

On Mon, 28 Jul 2003 02:14:32 -0400
Tom Lane <tgl@sss.pgh.pa.us> said something like:

> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > I have only been running nightly paralell regression runs since June
> > 27, so it is possible that the paralell regression was broken in
> > February, fixed in May, then broken some time after that.
> 
> Any further progress on this?
> 
> My best theory at the moment is that we have a problem with relcache
> entry creation failing if it's interrupted by an SI inval message at
> just the right time.  I don't much want to grovel through six months
> worth of changelog entries looking for candidate mistakes, though.
> 
>             regards, tom lane
> 
> ---------------------------(end of
> broadcast)--------------------------- TIP 3: if posting/reading
> through Usenet, please send an appropriate
>       subscribe-nomail command to majordomo@postgresql.org so that
>       your message can get through to the mailing list cleanly
> 
> 


-- 06:57:40 up 10 days, 10:57,  2 users,  load average: 2.17, 2.08, 1.83

Re: Regression test failure date.

From
Bruce Momjian
Date:
I am testing this today.  I found 2003-03-03 to not generate a failure
in 20 tests, so I am moving forward to April/May.

---------------------------------------------------------------------------

Robert Creager wrote:
-- Start of PGP signed section.
> 
> I will stand by the fact that I cannot generate failures from
> 2003-02-15 (200+ runs), and I can from 2003-02-16.  Just to make sure I
> didn't screw up the cvs usage, I'll try again tonight if I get the
> chance and re-download re-test these two days.
> 
> I can set up a script that will step through weekly dates starting from
> 'now' and see if the 02-16 problem might of been fixed and then
> re-introduced if you like.
> 
> 2003-02-16 fails 6/50
>    vacuum failed 1 times
>    misc failed 3 times
>    sanity_check failed 3 times
>    inherit failed 1 times
>    triggers failed 4 times
> 
> Cheers,
> Rob
> 
> On Mon, 28 Jul 2003 02:14:32 -0400
> Tom Lane <tgl@sss.pgh.pa.us> said something like:
> 
> > Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > > I have only been running nightly paralell regression runs since June
> > > 27, so it is possible that the paralell regression was broken in
> > > February, fixed in May, then broken some time after that.
> > 
> > Any further progress on this?
> > 
> > My best theory at the moment is that we have a problem with relcache
> > entry creation failing if it's interrupted by an SI inval message at
> > just the right time.  I don't much want to grovel through six months
> > worth of changelog entries looking for candidate mistakes, though.
> > 
> >             regards, tom lane
> > 
> > ---------------------------(end of
> > broadcast)--------------------------- TIP 3: if posting/reading
> > through Usenet, please send an appropriate
> >       subscribe-nomail command to majordomo@postgresql.org so that
> >       your message can get through to the mailing list cleanly
> > 
> > 
> 
> 
> -- 
>  06:57:40 up 10 days, 10:57,  2 users,  load average: 2.17, 2.08, 1.83
-- End of PGP section, PGP failed!

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Regression test failure date.

From
Bruce Momjian
Date:
I am now seeing this error in 2003-03-03.
 CREATE TABLE INSERT_CHILD (cx INT default 42,       cy INT CHECK (cy > x))       INHERITS (INSERT_TBL);
+ ERROR:  RelationClearRelation: relation 130996 deleted while still in use 

---------------------------------------------------------------------------

Bruce Momjian wrote:
> 
> I am testing this today.  I found 2003-03-03 to not generate a failure
> in 20 tests, so I am moving forward to April/May.
> 
> ---------------------------------------------------------------------------
> 
> Robert Creager wrote:
> -- Start of PGP signed section.
> > 
> > I will stand by the fact that I cannot generate failures from
> > 2003-02-15 (200+ runs), and I can from 2003-02-16.  Just to make sure I
> > didn't screw up the cvs usage, I'll try again tonight if I get the
> > chance and re-download re-test these two days.
> > 
> > I can set up a script that will step through weekly dates starting from
> > 'now' and see if the 02-16 problem might of been fixed and then
> > re-introduced if you like.
> > 
> > 2003-02-16 fails 6/50
> >    vacuum failed 1 times
> >    misc failed 3 times
> >    sanity_check failed 3 times
> >    inherit failed 1 times
> >    triggers failed 4 times
> > 
> > Cheers,
> > Rob
> > 
> > On Mon, 28 Jul 2003 02:14:32 -0400
> > Tom Lane <tgl@sss.pgh.pa.us> said something like:
> > 
> > > Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > > > I have only been running nightly paralell regression runs since June
> > > > 27, so it is possible that the paralell regression was broken in
> > > > February, fixed in May, then broken some time after that.
> > > 
> > > Any further progress on this?
> > > 
> > > My best theory at the moment is that we have a problem with relcache
> > > entry creation failing if it's interrupted by an SI inval message at
> > > just the right time.  I don't much want to grovel through six months
> > > worth of changelog entries looking for candidate mistakes, though.
> > > 
> > >             regards, tom lane
> > > 
> > > ---------------------------(end of
> > > broadcast)--------------------------- TIP 3: if posting/reading
> > > through Usenet, please send an appropriate
> > >       subscribe-nomail command to majordomo@postgresql.org so that
> > >       your message can get through to the mailing list cleanly
> > > 
> > > 
> > 
> > 
> > -- 
> >  06:57:40 up 10 days, 10:57,  2 users,  load average: 2.17, 2.08, 1.83
> -- End of PGP section, PGP failed!
> 
> -- 
>   Bruce Momjian                        |  http://candle.pha.pa.us
>   pgman@candle.pha.pa.us               |  (610) 359-1001
>   +  If your life is a hard drive,     |  13 Roberts Road
>   +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>     (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
> 

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Regression test failure date.

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> I am now seeing this error in 2003-03-03.

>   CREATE TABLE INSERT_CHILD (cx INT default 42,
>         cy INT CHECK (cy > x))
>         INHERITS (INSERT_TBL);
> + ERROR:  RelationClearRelation: relation 130996 deleted while still in use

Define "now seeing".  Did you change something?  Did you just run more
test cycles and it happened one time?  Did it suddenly start to happen a
lot?
        regards, tom lane