Thread: join regression failure on cygwin

join regression failure on cygwin

From
Andrew Dunstan
Date:
My Cygwin buildfarm member started failing (hanging, in fact) recently. 
It seems to hang consistently in join.sql and the only way I can get it 
to complete is to kill the backend fairly violently. See 
<http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=brown_bat&dt=2009-07-22%2023:10:21> 
It looks like this is the result of the semi-join ordering bug fix Tom 
applied a few days ago, but it is building and running 8.4 quite 
happily, and I think that branch got the same changes, so I'm not quite 
sure about it.

I'm going to try reversing that change locally to see if it fixes the 
problem. Unfortunately, this isn't an easy platform to debug.

cheers

andrew


Re: join regression failure on cygwin

From
Andrew Dunstan
Date:
I wrote:
>
> My Cygwin buildfarm member started failing (hanging, in fact) 
> recently. It seems to hang consistently in join.sql and the only way I 
> can get it to complete is to kill the backend fairly violently. See 
> <http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=brown_bat&dt=2009-07-22%2023:10:21> 
> It looks like this is the result of the semi-join ordering bug fix Tom 
> applied a few days ago, but it is building and running 8.4 quite 
> happily, and I think that branch got the same changes, so I'm not 
> quite sure about it.
>
> I'm going to try reversing that change locally to see if it fixes the 
> problem. Unfortunately, this isn't an easy platform to debug.
>
>

Nope, it wasn't that. So my next candidate is the prior 
join_is_legal/GEQO changes. But reverting that would get rid of the new 
regression test where we seem to be failing, so I'm not quite sure where 
to go on it.

cheers

andrew


Re: join regression failure on cygwin

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
>> My Cygwin buildfarm member started failing (hanging, in fact) 
>> recently. It seems to hang consistently in join.sql and the only way I 
>> can get it to complete is to kill the backend fairly violently. See 
>> <http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=brown_bat&dt=2009-07-22%2023:10:21> 

> Nope, it wasn't that. So my next candidate is the prior 
> join_is_legal/GEQO changes. But reverting that would get rid of the new 
> regression test where we seem to be failing, so I'm not quite sure where 
> to go on it.

I see it claims to have working erand48, but maybe not so much in
reality?
        regards, tom lane


Re: join regression failure on cygwin

From
Andrew Dunstan
Date:

Tom Lane wrote:
>
> I see it claims to have working erand48, but maybe not so much in
> reality?
>
>             
>   

Good guess. I removed erand48 from the configure file and added 
erand48.o  to OBJS in src/port/Makefile and it suddenly sailed through.

cheers

andrew


Re: join regression failure on cygwin

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> Tom Lane wrote:
>> I see it claims to have working erand48, but maybe not so much in
>> reality?

> Good guess. I removed erand48 from the configure file and added 
> erand48.o  to OBJS in src/port/Makefile and it suddenly sailed through.

Hmm.  So we need to figure out how to improve configure's check so that
it rejects whatever broken version you've got ...
        regards, tom lane


Re: join regression failure on cygwin

From
Andrew Dunstan
Date:

Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>   
>> Tom Lane wrote:
>>     
>>> I see it claims to have working erand48, but maybe not so much in
>>> reality?
>>>       
>
>   
>> Good guess. I removed erand48 from the configure file and added 
>> erand48.o  to OBJS in src/port/Makefile and it suddenly sailed through.
>>     
>
> Hmm.  So we need to figure out how to improve configure's check so that
> it rejects whatever broken version you've got ...
>
>             
>   

Yeah.  Any ideas? I'd hate just to exclude the system erand48 on Cygwin 
and then find out later it's broken on some other abstruse system.

cheers

andrew


Re: join regression failure on cygwin

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> Tom Lane wrote:
>> Hmm.  So we need to figure out how to improve configure's check so that
>> it rejects whatever broken version you've got ...

> Yeah.  Any ideas? I'd hate just to exclude the system erand48 on Cygwin 
> and then find out later it's broken on some other abstruse system.

Seems like it would be useful to figure out exactly why it's failing.

I don't personally have a problem with just forcing use of our own
erand48 on Cygwin; it's not a lot of code and it would make the behavior
of that build more like the MSVC build.  But it's curious that such a
simple library function is seemingly broken on Cygwin ... especially
when their random() and srandom() evidently work.
        regards, tom lane


Re: join regression failure on cygwin

From
Andrew Dunstan
Date:

Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>   
>> Tom Lane wrote:
>>     
>>> Hmm.  So we need to figure out how to improve configure's check so that
>>> it rejects whatever broken version you've got ...
>>>       
>
>   
>> Yeah.  Any ideas? I'd hate just to exclude the system erand48 on Cygwin 
>> and then find out later it's broken on some other abstruse system.
>>     
>
> Seems like it would be useful to figure out exactly why it's failing.
>
> I don't personally have a problem with just forcing use of our own
> erand48 on Cygwin; it's not a lot of code and it would make the behavior
> of that build more like the MSVC build.  But it's curious that such a
> simple library function is seemingly broken on Cygwin ... especially
> when their random() and srandom() evidently work.
>   

I'll work on it, but for now I propose to make the following change to 
configure.in and the corresponding change in configure:

diff -u -r1.605 configure.in
--- configure.in        16 Jul 2009 17:43:52 -0000      1.605
+++ configure.in        23 Jul 2009 22:39:19 -0000
@@ -1249,7 +1249,7 @@pgac_save_LIBS="$LIBS"LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
-AC_REPLACE_FUNCS([crypt erand48 getopt getrusage inet_aton random rint 
srandom strdup strerror strlcat strlcpy strtol strtoul])
+AC_REPLACE_FUNCS([crypt getopt getrusage inet_aton random rint srandom 
strdup strerror strlcat strlcpy strtol strtoul])case $host_os in
@@ -1262,6 +1262,12 @@               ;;esac
+# Cygwin's erand48 sometimes hangs, so force use of ours
+if test "$PORTNAME" = "cygwin"; then
+  AC_LIBOBJ(erand48)
+else
+  AC_REPLACE_FUNCS([erand48])
+fiLIBS="$pgac_save_LIBS"



>             regards, tom lane
>
>   


Re: join regression failure on cygwin

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> I'll work on it, but for now I propose to make the following change to 
> configure.in and the corresponding change in configure:

I believe you can just add AC_LIBOBJ(erand48) in the Cygwin-specific
section without touching the other part; that's supposed to be a no-op
if the filename has already been added to LIBOBJS.
        regards, tom lane


Re: join regression failure on cygwin

From
Andrew Dunstan
Date:

Tom Lane wrote:
> I don't personally have a problem with just forcing use of our own
> erand48 on Cygwin; it's not a lot of code and it would make the behavior
> of that build more like the MSVC build.  But it's curious that such a
> simple library function is seemingly broken on Cygwin ... especially
> when their random() and srandom() evidently work.
>
>   

It appears on Googling a bit that the erand48() is buggy in that it 
requires the seed to have been initialized with srand48() or it will 
constantly return 0.0.

So I think just forcing use of ours is the safe way to go. It might have 
been fixed since I installed Cygwin, although I can't find a reference 
to that, and I don't feel like triangulating it anyway.

cheers

andrew


Re: join regression failure on cygwin

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> It appears on Googling a bit that the erand48() is buggy in that it 
> requires the seed to have been initialized with srand48() or it will 
> constantly return 0.0.

Huh, and that sends us into an infinite loop?  I'll take a look at that.
Even though it's surely nonrandom, it doesn't seem like pathological
behavior of the RNG should wedge us completely.

> So I think just forcing use of ours is the safe way to go.

Agreed, but I'm curious ...
        regards, tom lane


Re: join regression failure on cygwin

From
Tom Lane
Date:
I wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>> It appears on Googling a bit that the erand48() is buggy in that it 
>> requires the seed to have been initialized with srand48() or it will 
>> constantly return 0.0.

> Huh, and that sends us into an infinite loop?  I'll take a look at that.
> Even though it's surely nonrandom, it doesn't seem like pathological
> behavior of the RNG should wedge us completely.

The answer is that a constant RNG result sends this bit of
geqo_selection() into a tight loop:
   int         first,               second;
   first = linear(root, pool->size, bias);   second = linear(root, pool->size, bias);
   if (pool->size > 1)   {       while (first == second)           second = linear(root, pool->size, bias);   }

Not sure if it's worth trying to do something about that, or exactly
what we'd do anyway.  Even if we hacked this up somehow, a constant RNG
result would pretty much break GEQO for any useful purpose.  So it could
be argued that having the regression tests fail here is a good thing...
        regards, tom lane


Re: join regression failure on cygwin

From
Andrew Dunstan
Date:

Tom Lane wrote:
> I wrote:
>   
>> Andrew Dunstan <andrew@dunslane.net> writes:
>>     
>>> It appears on Googling a bit that the erand48() is buggy in that it 
>>> requires the seed to have been initialized with srand48() or it will 
>>> constantly return 0.0.
>>>       
>
>   
>> Huh, and that sends us into an infinite loop?  I'll take a look at that.
>> Even though it's surely nonrandom, it doesn't seem like pathological
>> behavior of the RNG should wedge us completely.
>>     
>
> The answer is that a constant RNG result sends this bit of
> geqo_selection() into a tight loop:
>
>     int         first,
>                 second;
>
>     first = linear(root, pool->size, bias);
>     second = linear(root, pool->size, bias);
>
>     if (pool->size > 1)
>     {
>         while (first == second)
>             second = linear(root, pool->size, bias);
>     }
>
> Not sure if it's worth trying to do something about that, or exactly
> what we'd do anyway.  Even if we hacked this up somehow, a constant RNG
> result would pretty much break GEQO for any useful purpose.  So it could
> be argued that having the regression tests fail here is a good thing...
>
>             
>   

Right. Let's let sleeping dogs lie. I think at most a code comment is 
the only action called for.

cheers

andrew