Thread: regresssion script hole

regresssion script hole

From
Andrew Dunstan
Date:
While investigating some problems with buildfarm member spoonbill I came 
across this piece of code in pg_regress.sh, which seems less than robust:

   # Wait till postmaster is able to accept connections (normally only   # a second or so, but Cygwin is reportedly
*much*slower).  Don't   # wait forever, however.   i=0   max=60   until "$bindir/psql" -X $psql_options postgres
</dev/null2>/dev/null   do       i=`expr $i + 1`       if [ $i -ge $max ]       then           break       fi       if
kill-0 $postmaster_pid >/dev/null 2>&1       then           : still starting up       else           break       fi
 sleep 1   done
 
   if kill -0 $postmaster_pid >/dev/null 2>&1   then       echo "running on port $PGPORT with pid $postmaster_pid"
else      echo       echo "$me: postmaster did not start"       echo "Examine $LOGDIR/postmaster.log for the reason."
   echo       (exit 2); exit   fi
 

The problem is that if the postmaster takes more than 60 seconds to 
start listening (as is apparently happening on spoonbill - don't yet 
know why) this code falls through.

I'm inclined to run the psql test one more time to make sure we can 
actually connect, and if not then fail at this point. I wouldn't bother 
but it did confuse the heck out of both Stefan and me when createlang 
failed.

Thoughts?

cheers

andrew


Re: regresssion script hole

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> The problem is that if the postmaster takes more than 60 seconds to 
> start listening (as is apparently happening on spoonbill - don't yet 
> know why) this code falls through.

If the postmaster takes that long to start listening, I'd say we need to
fix the postmaster not pg_regress.

> I'm inclined to run the psql test one more time to make sure we can 
> actually connect, and if not then fail at this point.

How does that differ from just iterating the loop one more time?
        regards, tom lane


Re: regresssion script hole

From
Andrew Dunstan
Date:

Tom Lane wrote:

>Andrew Dunstan <andrew@dunslane.net> writes:
>  
>
>>The problem is that if the postmaster takes more than 60 seconds to 
>>start listening (as is apparently happening on spoonbill - don't yet 
>>know why) this code falls through.
>>    
>>
>
>If the postmaster takes that long to start listening, I'd say we need to
>fix the postmaster not pg_regress.
>
>  
>

We need both, I think. I am still trying to find out why it's taking so 
long. This is on the 8.0 branch, though. Later branches seem to be working.

>>I'm inclined to run the psql test one more time to make sure we can 
>>actually connect, and if not then fail at this point.
>>    
>>
>
>How does that differ from just iterating the loop one more time?
>  
>

There is no guarantee that at the end of the loop we have connected 
successfully to postgres.

I will post a patch that shows what I suggest.

cheers

andrew


Re: regresssion script hole

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> We need both, I think. I am still trying to find out why it's taking so 
> long. This is on the 8.0 branch, though. Later branches seem to be working.

One idea that comes to mind is a DNS lookup timeout.  Can you strace the
postmaster to see what it's doing?
        regards, tom lane


Re: regresssion script hole

From
Andrew Dunstan
Date:

Tom Lane wrote:

>Andrew Dunstan <andrew@dunslane.net> writes:
>  
>
>>We need both, I think. I am still trying to find out why it's taking so 
>>long. This is on the 8.0 branch, though. Later branches seem to be working.
>>    
>>
>
>One idea that comes to mind is a DNS lookup timeout.  Can you strace the
>postmaster to see what it's doing?
>
>        
>  
>

There is ktrace output I managed to capture at 
http://developer.postgresql.org/~adunstan/ktrace.txt

Not sure what it tells us. I do see it reading in the whole timezone db, 
which isn't pretty.

cheers

andrew


Re: regresssion script hole

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> Tom Lane wrote:
>> One idea that comes to mind is a DNS lookup timeout.  Can you strace the
>> postmaster to see what it's doing?

> There is ktrace output I managed to capture at 
> http://developer.postgresql.org/~adunstan/ktrace.txt
> Not sure what it tells us. I do see it reading in the whole timezone db, 
> which isn't pretty.

Yeah... I think we fixed that in 8.1.  Anyway, the tail end of the trace
shows it repeatedly sending off a UDP packet and getting practically the
same data back:
24583 postgres CALL  socket(0x2,0x2,0)24583 postgres RET   socket 324583 postgres CALL
sendto(0x3,0x43e1e000,0x25,0,0x493a6338,0x10)24583postgres GIO   fd 3 wrote 37 bytes
"\M-Sr\^A\0\0\^A\0\0\0\0\0\0\^B''\rkaltenbrunner\^Bcc\0\0\^A\0\^A"24583postgres RET   sendto 37/0x2524583 postgres CALL
getpid()24583 postgres RET   getpid 24583/0x600724583 postgres CALL  select(0x4,0x40739600,0,0,0xfffffffffffe4d90)24583
postgresRET   select 124583 postgres CALL
recvfrom(0x3,0x477e4000,0x10000,0,0xfffffffffffe4da0,0xfffffffffffe4d5c)24583postgres GIO   fd 3 read 37 bytes
"\M-Sr\M^A\M^B\0\^A\0\0\0\0\0\0\^B''\rkaltenbrunner\^Bcc\0\0\^A\0\^A"24583postgres RET   recvfrom 37/0x2524583 postgres
CALL close(0x3)24583 postgres RET   close 0
 

I'm not too up on what the DNS protocol looks like on-the-wire, but I'll
bet this is it.  I think it's trying to look up "kaltenbrunner.cc" and
failing.
        regards, tom lane


Re: regresssion script hole

From
Andrew Dunstan
Date:

Tom Lane wrote:

>Anyway, the tail end of the trace
>shows it repeatedly sending off a UDP packet and getting practically the
>same data back:
>
> 
>I'm not too up on what the DNS protocol looks like on-the-wire, but I'll
>bet this is it.  I think it's trying to look up "kaltenbrunner.cc" and
>failing.
>
>  
>

Why are we actually looking up anything? Just so we can bind to a 
listening socket?

Anyway, maybe the box needs a lookup line in its /etc/resolv.conf to 
direct it to use files first, something like
 lookup file bind

Stefan, can you look into that? It would be a bit ugly if it's calling 
DNS (and failing) to resolve localhost.

cheers

andrew


Re: regresssion script hole

From
Michael Fuhr
Date:
On Sun, Jun 18, 2006 at 07:50:04PM -0400, Tom Lane wrote:
>  24583 postgres CALL  recvfrom(0x3,0x477e4000,0x10000,0,0xfffffffffffe4da0,0xfffffffffffe4d5c)
>  24583 postgres GIO   fd 3 read 37 bytes
>        "\M-Sr\M^A\M^B\0\^A\0\0\0\0\0\0\^B''\rkaltenbrunner\^Bcc\0\0\^A\0\^A"
>  24583 postgres RET   recvfrom 37/0x25
>  24583 postgres CALL  close(0x3)
>  24583 postgres RET   close 0
> 
> I'm not too up on what the DNS protocol looks like on-the-wire, but I'll
> bet this is it.  I think it's trying to look up "kaltenbrunner.cc" and
> failing.

Maybe I'm misreading the packet, but I think the query is for
''kaltenbrunner.cc (two single quotes followed by kaltenbrunner.cc)
and the DNS server is responding with SRVFAIL.

-- 
Michael Fuhr


Re: regresssion script hole

From
Michael Fuhr
Date:
On Sun, Jun 18, 2006 at 07:18:07PM -0600, Michael Fuhr wrote:
> Maybe I'm misreading the packet, but I think the query is for
> ''kaltenbrunner.cc (two single quotes followed by kaltenbrunner.cc)

Correction: ''.kaltenbrunner.cc

-- 
Michael Fuhr


Re: regresssion script hole

From
Stefan Kaltenbrunner
Date:
Michael Fuhr wrote:
> On Sun, Jun 18, 2006 at 07:18:07PM -0600, Michael Fuhr wrote:
>> Maybe I'm misreading the packet, but I think the query is for
>> ''kaltenbrunner.cc (two single quotes followed by kaltenbrunner.cc)
> 
> Correction: ''.kaltenbrunner.cc

yes that is exactly the issue - the postmaster tries to resolve
''.kaltenbrunner.cc multiple times during startup and getting ServFail
as a response from the upstream resolver.


Stefan


Re: regresssion script hole

From
Stefan Kaltenbrunner
Date:
Andrew Dunstan wrote:
> 
> 
> Tom Lane wrote:
> 
>> Anyway, the tail end of the trace
>> shows it repeatedly sending off a UDP packet and getting practically the
>> same data back:
>>
>>
>> I'm not too up on what the DNS protocol looks like on-the-wire, but I'll
>> bet this is it.  I think it's trying to look up "kaltenbrunner.cc" and
>> failing.
>>
>>  
>>
> 
> Why are we actually looking up anything? Just so we can bind to a
> listening socket?
> 
> Anyway, maybe the box needs a lookup line in its /etc/resolv.conf to
> direct it to use files first, something like
> 
>  lookup file bind
> 
> Stefan, can you look into that? It would be a bit ugly if it's calling
> DNS (and failing) to resolve localhost.


no - resolving localhost works fine (both using /etc/hosts and through
the dns-resolver) - and I infact verified that when we initially started
to investigate that issue a while ago :-)


Stefan


Re: regresssion script hole

From
Andrew Dunstan
Date:

Stefan Kaltenbrunner wrote:

>Andrew Dunstan wrote:
>  
>
>>Why are we actually looking up anything? Just so we can bind to a
>>listening socket?
>>
>>Anyway, maybe the box needs a lookup line in its /etc/resolv.conf to
>>direct it to use files first, something like
>>
>> lookup file bind
>>
>>Stefan, can you look into that? It would be a bit ugly if it's calling
>>DNS (and failing) to resolve localhost.
>>    
>>
>
>
>no - resolving localhost works fine (both using /etc/hosts and through
>the dns-resolver) - and I infact verified that when we initially started
>to investigate that issue a while ago :-)
>
>
>  
>

Why are we looking up 'kaltenbrunner.cc' at all then? In any case, can 
we just try with that resolver line?

The question isn't whether is succeeds, it's how long it takes to 
succeed. When I increased the pg_regress timeout it actually went 
through the whole regression test happily. I suspect we have 2 things 
eating up the 60s timeout here: loading the timezone db and resolving 
whatever it is we are trying to resolve.

cheers

andrew




Re: regresssion script hole

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> The question isn't whether is succeeds, it's how long it takes to 
> succeed. When I increased the pg_regress timeout it actually went 
> through the whole regression test happily. I suspect we have 2 things 
> eating up the 60s timeout here: loading the timezone db and resolving 
> whatever it is we are trying to resolve.

The behavior of loading the whole TZ database was there for awhile
before anyone noticed; I believe it could only be responsible for a
few seconds.  So the failed DNS responses must be the problem.  Could
we get a ktrace with timestamps on the syscalls to confirm that?

Of course the $64 question is *why* is 8.0 trying to resolve that name,
particularly seeing that the later branches apparently aren't.
        regards, tom lane


Re: regresssion script hole

From
Martijn van Oosterhout
Date:
On Mon, Jun 19, 2006 at 09:21:21AM -0400, Tom Lane wrote:
> Of course the $64 question is *why* is 8.0 trying to resolve that name,
> particularly seeing that the later branches apparently aren't.

The formatting of the message suggests it is a gethostbyname("''")
doing it. Did any quoting rules change between 8.0 and 8.1 w.r.t. the
configuration files?

I wonder it it'd be worth adding some conditional code around
gethostbyname() calls that warn if the call took longer than say 10
seconds. By printing out that and the string it's looking up you could
save a lot of time confirming if the delay is there or elsewhere...

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Re: regresssion script hole

From
Stefan Kaltenbrunner
Date:
Martijn van Oosterhout wrote:
> On Mon, Jun 19, 2006 at 09:21:21AM -0400, Tom Lane wrote:
>> Of course the $64 question is *why* is 8.0 trying to resolve that name,
>> particularly seeing that the later branches apparently aren't.
> 
> The formatting of the message suggests it is a gethostbyname("''")
> doing it. Did any quoting rules change between 8.0 and 8.1 w.r.t. the
> configuration files?

I tcpdump'd the dns-traffic on that box during a postmaster startup and
it's definitly trying to look up ''.kaltenbrunner.cc a lot of times.
And from what it looks like it might be getting somehow rate limited by
my ISPs recursive resolvers after doing the same query a dozens of times
and getting a "servfail" every time.
At least the timestamps seem to indicate that the responses are getting
delayed up to 10 seconds after a number of queries ...
It might be a complete shot in the dark but spoonbill worked fine on
REL_8_0_STABLE until i disabled reporting 3 month ago.
During this time the large escaping security fix/standard_strings patch
went in - could this be related in any way ?


Stefan


Re: regresssion script hole

From
Tom Lane
Date:
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
> I tcpdump'd the dns-traffic on that box during a postmaster startup and
> it's definitly trying to look up ''.kaltenbrunner.cc a lot of times.

I just strace'd postmaster start on a Fedora box and can see nothing
corresponding.  Since this is a "make check" we know that the PG
configuration files it's using are stock ... so it must be something
about the system config that's sending it round the bend.  What do you
have in /etc/hosts, /etc/resolv.conf, /etc/nsswitch.conf?
        regards, tom lane


Re: regresssion script hole

From
Tom Lane
Date:
Oh, I think I see the problem:

8.0 pg_regress:
   if [ "$unix_sockets" = no ]; then       postmaster_options="$postmaster_options -c listen_addresses=$hostname"
else      postmaster_options="$postmaster_options -c listen_addresses=''"   fi
 

8.1 pg_regress:
   if [ "$unix_sockets" = no ]; then       postmaster_options="$postmaster_options -c listen_addresses=$hostname"
else      postmaster_options="$postmaster_options -c listen_addresses="   fi
 
        regards, tom lane


Re: regresssion script hole

From
Andrew Dunstan
Date:
Stefan Kaltenbrunner wrote:
> Tom Lane wrote:
>   
>> Andrew Dunstan <andrew@dunslane.net> writes:
>>     
>>> The question isn't whether is succeeds, it's how long it takes to 
>>> succeed. When I increased the pg_regress timeout it actually went 
>>> through the whole regression test happily. I suspect we have 2 things 
>>> eating up the 60s timeout here: loading the timezone db and resolving 
>>> whatever it is we are trying to resolve.
>>>       
>> The behavior of loading the whole TZ database was there for awhile
>> before anyone noticed; I believe it could only be responsible for a
>> few seconds.  So the failed DNS responses must be the problem.  Could
>> we get a ktrace with timestamps on the syscalls to confirm that?
>>
>> Of course the $64 question is *why* is 8.0 trying to resolve that name,
>> particularly seeing that the later branches apparently aren't.
>>     
>
> hmm maybe the later branches are trying to resolve that too - but only
> the combination of the TZ database loading + the failed DNS-queries is
> pushing the startup time over the 60 second limit on this (quite slow) box ?
>
> I will try to verify what the later branches are doing exactly ...
>
>
>   

Yes, we're on the margin here. The successful runs I saw got through the 
timeout in 5 or 10 seconds over the 60 that we currently allow.

What interests me is where it even gets the string 'kaltenbrunner.cc' 
from. It looks to me like the most likely place is the search line in 
/etc/resolv.conf. Ir would be nice to know exactly what it is trying to 
resolve.

cheers

andrew

cheers

andrew



Re: regresssion script hole

From
Tom Lane
Date:
I wrote:
> 8.0 pg_regress:
>         postmaster_options="$postmaster_options -c listen_addresses=''"
> 8.1 pg_regress:
>         postmaster_options="$postmaster_options -c listen_addresses="

and in fact here's the commit that changed that:

2005-06-19 22:26  tgl
* src/test/regress/pg_regress.sh: Change shell syntax that seemsnot to work right on FreeBSD 6-CURRENT buildfarm
machines.

So apparently it's some generic disease of the BSD shell.  I should have
back-patched at the time but did not.  Will take care of it.

On the timezone search business, it's still the case that HEAD will
search through all the timezones if it's not given an explicit setting
(eg an explicit environment TZ value).  We could suppress that by having
pg_regress set TZ, but then the regression tests wouldn't exercise the
search code at all, which is probably not a great idea.
        regards, tom lane


Re: regresssion script hole

From
Andrew Dunstan
Date:
Tom Lane wrote:
> Oh, I think I see the problem:
>
> 8.0 pg_regress:
>
>     if [ "$unix_sockets" = no ]; then
>         postmaster_options="$postmaster_options -c listen_addresses=$hostname"
>     else
>         postmaster_options="$postmaster_options -c listen_addresses=''"
>     fi
>
> 8.1 pg_regress:
>
>     if [ "$unix_sockets" = no ]; then
>         postmaster_options="$postmaster_options -c listen_addresses=$hostname"
>     else
>         postmaster_options="$postmaster_options -c listen_addresses="
>     fi
>
>     
>   

Good catch! I'm impressed! This is surely the heart of the problem.

That change (from rev 1.56) clearly needs to be backported to 8.0.

cheers

andrew



Re: regresssion script hole

From
Stefan Kaltenbrunner
Date:
Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>> The question isn't whether is succeeds, it's how long it takes to 
>> succeed. When I increased the pg_regress timeout it actually went 
>> through the whole regression test happily. I suspect we have 2 things 
>> eating up the 60s timeout here: loading the timezone db and resolving 
>> whatever it is we are trying to resolve.
> 
> The behavior of loading the whole TZ database was there for awhile
> before anyone noticed; I believe it could only be responsible for a
> few seconds.  So the failed DNS responses must be the problem.  Could
> we get a ktrace with timestamps on the syscalls to confirm that?
> 
> Of course the $64 question is *why* is 8.0 trying to resolve that name,
> particularly seeing that the later branches apparently aren't.

hmm maybe the later branches are trying to resolve that too - but only
the combination of the TZ database loading + the failed DNS-queries is
pushing the startup time over the 60 second limit on this (quite slow) box ?

I will try to verify what the later branches are doing exactly ...


Stefan