Thread: join regression failure on cygwin
My Cygwin buildfarm member started failing (hanging, in fact) recently. It seems to hang consistently in join.sql and the only way I can get it to complete is to kill the backend fairly violently. See <http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=brown_bat&dt=2009-07-22%2023:10:21> It looks like this is the result of the semi-join ordering bug fix Tom applied a few days ago, but it is building and running 8.4 quite happily, and I think that branch got the same changes, so I'm not quite sure about it. I'm going to try reversing that change locally to see if it fixes the problem. Unfortunately, this isn't an easy platform to debug. cheers andrew
I wrote: > > My Cygwin buildfarm member started failing (hanging, in fact) > recently. It seems to hang consistently in join.sql and the only way I > can get it to complete is to kill the backend fairly violently. See > <http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=brown_bat&dt=2009-07-22%2023:10:21> > It looks like this is the result of the semi-join ordering bug fix Tom > applied a few days ago, but it is building and running 8.4 quite > happily, and I think that branch got the same changes, so I'm not > quite sure about it. > > I'm going to try reversing that change locally to see if it fixes the > problem. Unfortunately, this isn't an easy platform to debug. > > Nope, it wasn't that. So my next candidate is the prior join_is_legal/GEQO changes. But reverting that would get rid of the new regression test where we seem to be failing, so I'm not quite sure where to go on it. cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: >> My Cygwin buildfarm member started failing (hanging, in fact) >> recently. It seems to hang consistently in join.sql and the only way I >> can get it to complete is to kill the backend fairly violently. See >> <http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=brown_bat&dt=2009-07-22%2023:10:21> > Nope, it wasn't that. So my next candidate is the prior > join_is_legal/GEQO changes. But reverting that would get rid of the new > regression test where we seem to be failing, so I'm not quite sure where > to go on it. I see it claims to have working erand48, but maybe not so much in reality? regards, tom lane
Tom Lane wrote: > > I see it claims to have working erand48, but maybe not so much in > reality? > > > Good guess. I removed erand48 from the configure file and added erand48.o to OBJS in src/port/Makefile and it suddenly sailed through. cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > Tom Lane wrote: >> I see it claims to have working erand48, but maybe not so much in >> reality? > Good guess. I removed erand48 from the configure file and added > erand48.o to OBJS in src/port/Makefile and it suddenly sailed through. Hmm. So we need to figure out how to improve configure's check so that it rejects whatever broken version you've got ... regards, tom lane
Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: > >> Tom Lane wrote: >> >>> I see it claims to have working erand48, but maybe not so much in >>> reality? >>> > > >> Good guess. I removed erand48 from the configure file and added >> erand48.o to OBJS in src/port/Makefile and it suddenly sailed through. >> > > Hmm. So we need to figure out how to improve configure's check so that > it rejects whatever broken version you've got ... > > > Yeah. Any ideas? I'd hate just to exclude the system erand48 on Cygwin and then find out later it's broken on some other abstruse system. cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > Tom Lane wrote: >> Hmm. So we need to figure out how to improve configure's check so that >> it rejects whatever broken version you've got ... > Yeah. Any ideas? I'd hate just to exclude the system erand48 on Cygwin > and then find out later it's broken on some other abstruse system. Seems like it would be useful to figure out exactly why it's failing. I don't personally have a problem with just forcing use of our own erand48 on Cygwin; it's not a lot of code and it would make the behavior of that build more like the MSVC build. But it's curious that such a simple library function is seemingly broken on Cygwin ... especially when their random() and srandom() evidently work. regards, tom lane
Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: > >> Tom Lane wrote: >> >>> Hmm. So we need to figure out how to improve configure's check so that >>> it rejects whatever broken version you've got ... >>> > > >> Yeah. Any ideas? I'd hate just to exclude the system erand48 on Cygwin >> and then find out later it's broken on some other abstruse system. >> > > Seems like it would be useful to figure out exactly why it's failing. > > I don't personally have a problem with just forcing use of our own > erand48 on Cygwin; it's not a lot of code and it would make the behavior > of that build more like the MSVC build. But it's curious that such a > simple library function is seemingly broken on Cygwin ... especially > when their random() and srandom() evidently work. > I'll work on it, but for now I propose to make the following change to configure.in and the corresponding change in configure: diff -u -r1.605 configure.in --- configure.in 16 Jul 2009 17:43:52 -0000 1.605 +++ configure.in 23 Jul 2009 22:39:19 -0000 @@ -1249,7 +1249,7 @@pgac_save_LIBS="$LIBS"LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'` -AC_REPLACE_FUNCS([crypt erand48 getopt getrusage inet_aton random rint srandom strdup strerror strlcat strlcpy strtol strtoul]) +AC_REPLACE_FUNCS([crypt getopt getrusage inet_aton random rint srandom strdup strerror strlcat strlcpy strtol strtoul])case $host_os in @@ -1262,6 +1262,12 @@ ;;esac +# Cygwin's erand48 sometimes hangs, so force use of ours +if test "$PORTNAME" = "cygwin"; then + AC_LIBOBJ(erand48) +else + AC_REPLACE_FUNCS([erand48]) +fiLIBS="$pgac_save_LIBS" > regards, tom lane > >
Andrew Dunstan <andrew@dunslane.net> writes: > I'll work on it, but for now I propose to make the following change to > configure.in and the corresponding change in configure: I believe you can just add AC_LIBOBJ(erand48) in the Cygwin-specific section without touching the other part; that's supposed to be a no-op if the filename has already been added to LIBOBJS. regards, tom lane
Tom Lane wrote: > I don't personally have a problem with just forcing use of our own > erand48 on Cygwin; it's not a lot of code and it would make the behavior > of that build more like the MSVC build. But it's curious that such a > simple library function is seemingly broken on Cygwin ... especially > when their random() and srandom() evidently work. > > It appears on Googling a bit that the erand48() is buggy in that it requires the seed to have been initialized with srand48() or it will constantly return 0.0. So I think just forcing use of ours is the safe way to go. It might have been fixed since I installed Cygwin, although I can't find a reference to that, and I don't feel like triangulating it anyway. cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > It appears on Googling a bit that the erand48() is buggy in that it > requires the seed to have been initialized with srand48() or it will > constantly return 0.0. Huh, and that sends us into an infinite loop? I'll take a look at that. Even though it's surely nonrandom, it doesn't seem like pathological behavior of the RNG should wedge us completely. > So I think just forcing use of ours is the safe way to go. Agreed, but I'm curious ... regards, tom lane
I wrote: > Andrew Dunstan <andrew@dunslane.net> writes: >> It appears on Googling a bit that the erand48() is buggy in that it >> requires the seed to have been initialized with srand48() or it will >> constantly return 0.0. > Huh, and that sends us into an infinite loop? I'll take a look at that. > Even though it's surely nonrandom, it doesn't seem like pathological > behavior of the RNG should wedge us completely. The answer is that a constant RNG result sends this bit of geqo_selection() into a tight loop: int first, second; first = linear(root, pool->size, bias); second = linear(root, pool->size, bias); if (pool->size > 1) { while (first == second) second = linear(root, pool->size, bias); } Not sure if it's worth trying to do something about that, or exactly what we'd do anyway. Even if we hacked this up somehow, a constant RNG result would pretty much break GEQO for any useful purpose. So it could be argued that having the regression tests fail here is a good thing... regards, tom lane
Tom Lane wrote: > I wrote: > >> Andrew Dunstan <andrew@dunslane.net> writes: >> >>> It appears on Googling a bit that the erand48() is buggy in that it >>> requires the seed to have been initialized with srand48() or it will >>> constantly return 0.0. >>> > > >> Huh, and that sends us into an infinite loop? I'll take a look at that. >> Even though it's surely nonrandom, it doesn't seem like pathological >> behavior of the RNG should wedge us completely. >> > > The answer is that a constant RNG result sends this bit of > geqo_selection() into a tight loop: > > int first, > second; > > first = linear(root, pool->size, bias); > second = linear(root, pool->size, bias); > > if (pool->size > 1) > { > while (first == second) > second = linear(root, pool->size, bias); > } > > Not sure if it's worth trying to do something about that, or exactly > what we'd do anyway. Even if we hacked this up somehow, a constant RNG > result would pretty much break GEQO for any useful purpose. So it could > be argued that having the regression tests fail here is a good thing... > > > Right. Let's let sleeping dogs lie. I think at most a code comment is the only action called for. cheers andrew