Thread: Oops - BF:Mastodon just died

Re: Oops - BF:Mastodon just died

From
Magnus Hagander
Date:
Dave Page wrote:
> http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=mastodon&dt=2008-01-30%2020:00:00

Maybe I shouldn't have had those beers after work today, but that looks 
like it's for example failing tsearch2, which hasn't been touched for 
over a month!

Any chance there's something dodgy in the build env?

(If I'm missing the obvious, I blame the beer!)

//Magnus



Re: Oops - BF:Mastodon just died

From
"Dave Page"
Date:
On Jan 30, 2008 9:13 PM, Magnus Hagander <magnus@hagander.net> wrote:
> Dave Page wrote:
> > http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=mastodon&dt=2008-01-30%2020:00:00
>
> Maybe I shouldn't have had those beers after work today, but that looks
> like it's for example failing tsearch2, which hasn't been touched for
> over a month!
>
> Any chance there's something dodgy in the build env?

I can't remember the last time I logged into that box so if it's
something in the buildenv, it's either caused by a Windows update, or
some failing hardware.

/D


Re: Oops - BF:Mastodon just died

From
Magnus Hagander
Date:
Dave Page wrote:
> On Jan 30, 2008 9:13 PM, Magnus Hagander <magnus@hagander.net> wrote:
>> Dave Page wrote:
>>> http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=mastodon&dt=2008-01-30%2020:00:00
>> Maybe I shouldn't have had those beers after work today, but that looks
>> like it's for example failing tsearch2, which hasn't been touched for
>> over a month!
>>
>> Any chance there's something dodgy in the build env?
> 
> I can't remember the last time I logged into that box so if it's
> something in the buildenv, it's either caused by a Windows update, or
> some failing hardware.

I won't have access to my MSVC box until tomorrow, but unless beaten to 
it I can dig into it a bit more. I don't see anything obvious int he 
latest patches thoughy (but again, that could be the beer :-P).

Any chance you could just do a forced run on it now to show if it was 
some kind of transient stuff?

//Magnus


Re: Oops - BF:Mastodon just died

From
Andrew Dunstan
Date:

Dave Page wrote:
> On Jan 30, 2008 9:13 PM, Magnus Hagander <magnus@hagander.net> wrote:
>   
>> Dave Page wrote:
>>     
>>> http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=mastodon&dt=2008-01-30%2020:00:00
>>>       
>> Maybe I shouldn't have had those beers after work today, but that looks
>> like it's for example failing tsearch2, which hasn't been touched for
>> over a month!
>>
>> Any chance there's something dodgy in the build env?
>>     
>
> I can't remember the last time I logged into that box so if it's
> something in the buildenv, it's either caused by a Windows update, or
> some failing hardware.
>
>
>   

None of the CVS changes in the relevant period seems to have any 
relation to the errors, so I suspect a local problem.

red_bat is due to build in a couple of hours, so we will soon see if it 
reproduces the error.

cheers

andrew



Re: Oops - BF:Mastodon just died

From
"Dave Page"
Date:
On Jan 30, 2008 9:21 PM, Magnus Hagander <magnus@hagander.net> wrote:
>
> I won't have access to my MSVC box until tomorrow, but unless beaten to
> it I can dig into it a bit more. I don't see anything obvious int he
> latest patches thoughy (but again, that could be the beer :-P).
>
> Any chance you could just do a forced run on it now to show if it was
> some kind of transient stuff?

Not from here. :-(

/D


Re: Oops - BF:Mastodon just died

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> None of the CVS changes in the relevant period seems to have any 
> relation to the errors, so I suspect a local problem.

skylark and baiji are now red too, so I guess that theory is dead in the
water.  Something in today's changes broke the MSVC build, but what?

I diffed yesterday's and today's make logs from skylark, and found
nothing interesting except this:

***************
*** 605,611 ****         Generate DEF file^M         Generating POSTGRES.DEF from directory Release\postgres^M
............................................................................................................................................................\

......................................................................................................................................................................\

.........................................................................................................................................^M
!         Generated 5208 symbols^M         Linking...^M            Creating library Release\postgres\postgres.lib and
objectRelease\postgres\postgres.exp^M         Embedding manifest...^M
 
--- 605,611 ----         Generate DEF file^M         Generating POSTGRES.DEF from directory Release\postgres^M
............................................................................................................................................................\

......................................................................................................................................................................\

.........................................................................................................................................^M
!         Generated 5205 symbols^M         Linking...^M            Creating library Release\postgres\postgres.lib and
objectRelease\postgres\postgres.exp^M         Embedding manifest...^M
 
***************

Presumably the three missing symbols include the two that are being
complained of later, but what the heck?

(Hmm, actually today's commits should have added two global symbols to
the backend, so it seems there are five not three symbols to be
accounted for.)

It is probably significant that both of the known missing symbols come
from guc.c, which we added another variable to today.  I have a
sickening feeling that we have hit some kind of undocumented internal
limit in MSVC as to the number of symbols imported/exported by one
source file...
        regards, tom lane


Re: Oops - BF:Mastodon just died

From
Tom Lane
Date:
I wrote:
> I diffed yesterday's and today's make logs from skylark, and found
> nothing interesting except this:

> ***************
> *** 605,611 ****
>           Generating POSTGRES.DEF from directory Release\postgres^M
> !         Generated 5208 symbols^M
>           Linking...^M
> --- 605,611 ----
>           Generating POSTGRES.DEF from directory Release\postgres^M
> !         Generated 5205 symbols^M
>           Linking...^M
> ***************

Looking at this a bit closer, I realize that it's coming from
gendef.pl's dumpbin usage of recent infamy.  So there are a couple
of ideas that come to mind:

* Has the buildfarm script changed recently in a way that might change
the execution PATH and thereby suck in a different version of dumpbin?
(Or even a different version of Perl?)

* Is it conceivable that dumpbin's output format has changed in a way
that confuses the bit of Perl code that's parsing it?  One idea that
comes to mind is that it contains a timestamp that just got wider ---
I remember seeing some bugs like that when the value of Unix time_t
reached 1 billion and became 9 instead of 8 digits.

Neither of these sound very plausible, but it seems the next step for
investigation is to look closely at what's happening in gendef.pl.
        regards, tom lane


Re: Oops - BF:Mastodon just died

From
Tom Lane
Date:
"Dave Page" <dpage@postgresql.org> writes:
> I can't remember the last time I logged into that box so if it's
> something in the buildenv, it's either caused by a Windows update,

Re-reading the thread ... could that last point be significant?  Are
all four of these boxen set to auto-accept updates from Redmond?
        regards, tom lane


Re: Oops - BF:Mastodon just died

From
Andrew Dunstan
Date:

Tom Lane wrote:
>
> * Has the buildfarm script changed recently in a way that might change
> the execution PATH and thereby suck in a different version of dumpbin?
> (Or even a different version of Perl?)
>   


No. In at least the case of red_bat nothing has changed for months.

> * Is it conceivable that dumpbin's output format has changed in a way
> that confuses the bit of Perl code that's parsing it?  One idea that
> comes to mind is that it contains a timestamp that just got wider ---
> I remember seeing some bugs like that when the value of Unix time_t
> reached 1 billion and became 9 instead of 8 digits.
>
> Neither of these sound very plausible, but it seems the next step for
> investigation is to look closely at what's happening in gendef.pl.
>
>             
>   

Right. I agree that your diff makes gendef.pl the prime suspect.

Yoo also just said:
> "Dave Page" <dpage@postgresql.org> writes:
>   
>> > I can't remember the last time I logged into that box so if it's
>> > something in the buildenv, it's either caused by a Windows update,
>>     
>
> Re-reading the thread ... could that last point be significant?  Are
> all four of these boxen set to auto-accept updates from Redmond?

No. red_bat does not auto-accept anything.

cheers

andrew


Re: Oops - BF:Mastodon just died

From
Andrew Dunstan
Date:

Tom Lane wrote:
>
> Neither of these sound very plausible, but it seems the next step for
> investigation is to look closely at what's happening in gendef.pl.
>
>             
>   

Yes, I have found the problem. It is this line, which I am amazed hasn't 
bitten us before:
       next unless /^\d/;

The first field in the dumpbin output looks like a 3 digit hex number. 
The line on my system for GetConfigOptionByName starts with 'A02' which 
of course fails the test above.

For now I'm going try to fix it by changing it to:
       next unless $pieces[0] =~/^[A-F0-9]{3}$/;

I also propose to have the gendefs.pl script save the dumpbin output so 
this sort of problem will be easier to debug.

cheers

andrew


Re: Oops - BF:Mastodon just died

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> Yes, I have found the problem. It is this line, which I am amazed hasn't 
> bitten us before:
>         next unless /^\d/;
> The first field in the dumpbin output looks like a 3 digit hex number. 

Argh, so it was crossing a power-of-2 boundary that got us.  Good catch.

> For now I'm going try to fix it by changing it to:
>         next unless $pieces[0] =~/^[A-F0-9]{3}$/;

Check.

> I also propose to have the gendefs.pl script save the dumpbin output so 
> this sort of problem will be easier to debug.

Agreed, but I suggest waiting till 8.4 is branched unless you are really
sure about this addition.  We freeze for 8.3.0 in less than 24 hours.
        regards, tom lane


Re: Oops - BF:Mastodon just died

From
"Dave Page"
Date:
On Jan 31, 2008 1:33 AM, Andrew Dunstan <andrew@dunslane.net> wrote:
> > Re-reading the thread ... could that last point be significant?  Are
> > all four of these boxen set to auto-accept updates from Redmond?
>
> No. red_bat does not auto-accept anything.

For future reference, my BF members do  auto-accept updates (though
they only reboot if I tell them to). It seems like having red_bat do
the opposite provides a useful baseline for tracking down future
issues.

I wonder if it would be worth adding a notes field to the BF so we can
record this sort of detail...

/D


Re: Oops - BF:Mastodon just died

From
Magnus Hagander
Date:
On Thu, Jan 31, 2008 at 08:28:21AM +0000, Dave Page wrote:
> On Jan 31, 2008 1:33 AM, Andrew Dunstan <andrew@dunslane.net> wrote:
> > > Re-reading the thread ... could that last point be significant?  Are
> > > all four of these boxen set to auto-accept updates from Redmond?
> >
> > No. red_bat does not auto-accept anything.
> 
> For future reference, my BF members do  auto-accept updates (though
> they only reboot if I tell them to). It seems like having red_bat do
> the opposite provides a useful baseline for tracking down future
> issues.
> 
> I wonder if it would be worth adding a notes field to the BF so we can
> record this sort of detail...

+1. That should be interesting for non-win32 platforms as well... Assuming
it's not too much work, of course ;)

I have yet to see the first case where a windows update breaks PostgreSQL
in any way though, but once it happens it would be nice to have the info.

//Magnus


Re: Oops - BF:Mastodon just died

From
Magnus Hagander
Date:
On Thu, Jan 31, 2008 at 12:45:40AM -0500, Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
> > Yes, I have found the problem. It is this line, which I am amazed hasn't 
> > bitten us before:
> >         next unless /^\d/;
> > The first field in the dumpbin output looks like a 3 digit hex number. 
> 
> Argh, so it was crossing a power-of-2 boundary that got us.  Good catch.
> 
> > For now I'm going try to fix it by changing it to:
> >         next unless $pieces[0] =~/^[A-F0-9]{3}$/;
> 
> Check.

Yeah, nice catch. Wouldn't surprise me if we actually had this problem
before, just that the dropped symbols were not actually used by our own
modules. I notice the export count jumped to 5226...


> > I also propose to have the gendefs.pl script save the dumpbin output so 
> > this sort of problem will be easier to debug.
> 
> Agreed, but I suggest waiting till 8.4 is branched unless you are really
> sure about this addition.  We freeze for 8.3.0 in less than 24 hours.

+1

//Magnus


Re: Oops - BF:Mastodon just died

From
Andrew Dunstan
Date:

Magnus Hagander wrote:
>>> I also propose to have the gendefs.pl script save the dumpbin output so 
>>> this sort of problem will be easier to debug.
>>>       
>> Agreed, but I suggest waiting till 8.4 is branched unless you are really
>> sure about this addition.  We freeze for 8.3.0 in less than 24 hours.
>>     
>
> +1
>
>
>   

I am pretty damn sure it's OK. It's pretty low risk (change an unlink 
call to a rename call) and even if it's broken as my fist version was, 
it doesn't appear to break the build. It's working on the buildfarm. I 
want it in so if we have problems with 8.3 we don't have to go through 
the handstands I had to to find out what was broken.

cheers

andrew


Re: Oops - BF:Mastodon just died

From
Tom Lane
Date:
Magnus Hagander <magnus@hagander.net> writes:
>> Andrew Dunstan <andrew@dunslane.net> writes:
>>> For now I'm going try to fix it by changing it to:
>>> next unless $pieces[0] =~/^[A-F0-9]{3}$/;

> Yeah, nice catch. Wouldn't surprise me if we actually had this problem
> before, just that the dropped symbols were not actually used by our own
> modules. I notice the export count jumped to 5226...

I was wondering where the count would go.

It strikes me that the pattern needs to be {3,} or maybe just +.
I dunno what this column is measuring, but if we are past 0xA00
then surely 0x1000 is not far away.
        regards, tom lane


Re: Oops - BF:Mastodon just died

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
>>> Agreed, but I suggest waiting till 8.4 is branched unless you are really
>>> sure about this addition.  We freeze for 8.3.0 in less than 24 hours.

> I am pretty damn sure it's OK. It's pretty low risk (change an unlink 
> call to a rename call) and even if it's broken as my fist version was, 
> it doesn't appear to break the build. It's working on the buildfarm. I 
> want it in so if we have problems with 8.3 we don't have to go through 
> the handstands I had to to find out what was broken.

After looking at the patch, my only question is how all those junk files
get cleaned up at "make clean".
        regards, tom lane


Re: Oops - BF:Mastodon just died

From
Andrew Dunstan
Date:

Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>   
>>>> Agreed, but I suggest waiting till 8.4 is branched unless you are really
>>>> sure about this addition.  We freeze for 8.3.0 in less than 24 hours.
>>>>         
>
>   
>> I am pretty damn sure it's OK. It's pretty low risk (change an unlink 
>> call to a rename call) and even if it's broken as my fist version was, 
>> it doesn't appear to break the build. It's working on the buildfarm. I 
>> want it in so if we have problems with 8.3 we don't have to go through 
>> the handstands I had to to find out what was broken.
>>     
>
> After looking at the patch, my only question is how all those junk files
> get cleaned up at "make clean".
>
>             
>   

The symbols files we are keeping as a result of the patch are renamed 
into to the release or debug hierarchy (depending on what we're 
building). Those entire trees are removed by src/tools/msvc/clean.bat.

cheers

andrew


Re: Oops - BF:Mastodon just died

From
Andrew Dunstan
Date:

Tom Lane wrote:
> Magnus Hagander <magnus@hagander.net> writes:
>   
>>> Andrew Dunstan <andrew@dunslane.net> writes:
>>>       
>>>> For now I'm going try to fix it by changing it to:
>>>> next unless $pieces[0] =~/^[A-F0-9]{3}$/;
>>>>         
>
>   
>> Yeah, nice catch. Wouldn't surprise me if we actually had this problem
>> before, just that the dropped symbols were not actually used by our own
>> modules. I notice the export count jumped to 5226...
>>     
>
> I was wondering where the count would go.
>
> It strikes me that the pattern needs to be {3,} or maybe just +.
> I dunno what this column is measuring, but if we are past 0xA00
> then surely 0x1000 is not far away.
>
>   

http://msdn2.microsoft.com/en-us/library/b842y285(VS.71).aspx appears to 
suggest that the size of the field is fixed.

But who knows?

cheers

andrew


Re: Oops - BF:Mastodon just died

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> Tom Lane wrote:
>> It strikes me that the pattern needs to be {3,} or maybe just +.
>> I dunno what this column is measuring, but if we are past 0xA00
>> then surely 0x1000 is not far away.

> http://msdn2.microsoft.com/en-us/library/b842y285(VS.71).aspx appears to 
> suggest that the size of the field is fixed.

That would imply that dumpbin fails at 4096 symbols per file.  While I
surely wouldn't put it past M$ to have put in such a limitation, I think
it's more likely that the documentation is badly written.

In any case it would be easy enough to make up a quick test to see what
happens with say
void func1() {}void func2() {}...void func5000() {}

        regards, tom lane


Re: Oops - BF:Mastodon just died

From
"Zeugswetter Andreas ADI SD"
Date:
> http://msdn2.microsoft.com/en-us/library/b842y285(VS.71).aspx
> appears to
> > suggest that the size of the field is fixed.
>
> That would imply that dumpbin fails at 4096 symbols per file.  While I
> surely wouldn't put it past M$ to have put in such a
> limitation, I think
> it's more likely that the documentation is badly written.

Yes, it starts with 3 and goes to 4 digits above FFF

Andreas


Re: Oops - BF:Mastodon just died

From
Andrew Dunstan
Date:

Zeugswetter Andreas ADI SD wrote:
>> http://msdn2.microsoft.com/en-us/library/b842y285(VS.71).aspx 
>> appears to 
>>     
>>> suggest that the size of the field is fixed.
>>>       
>> That would imply that dumpbin fails at 4096 symbols per file.  While I
>> surely wouldn't put it past M$ to have put in such a 
>> limitation, I think
>> it's more likely that the documentation is badly written.
>>     
>
> Yes, it starts with 3 and goes to 4 digits above FFF
>   

OK, then {3,} is the right quantification. Will fix.

cheers

andrew