Thread: regression failures - further data

regression failures - further data

From
Andrew Dunstan
Date:
I have managed (with a lot of effort) to track down the apparent cause
of the regression failures I was seeing. They appear to be directly
related to the degree of parallelism with which the tests are run. I can
reliably get a 100% clean run on the serial tests, and on the parallel
tests with MAX_CONNECTIONS=5. But if I run at MAX_CONNECTIONS=10 I
(almost) always get failures, which for some reason that is beyond me
start with the copy test, which isn't even run in parallel with other tests.

This is all quite worrying, and suggests that we will need to do some
careful stress testing before we can release this.

Is there some W2K parameter I can tweak in the TCP stack that might
alleviate the problem?

Cheers

andrew

Re: regression failures - further data

From
Bruce Momjian
Date:
Andrew Dunstan wrote:
>
> I have managed (with a lot of effort) to track down the apparent cause
> of the regression failures I was seeing. They appear to be directly
> related to the degree of parallelism with which the tests are run. I can
> reliably get a 100% clean run on the serial tests, and on the parallel
> tests with MAX_CONNECTIONS=5. But if I run at MAX_CONNECTIONS=10 I
> (almost) always get failures, which for some reason that is beyond me
> start with the copy test, which isn't even run in parallel with other tests.
>
> This is all quite worrying, and suggests that we will need to do some
> careful stress testing before we can release this.
>
> Is there some W2K parameter I can tweak in the TCP stack that might
> alleviate the problem?

Is this the extra newline regression failure you were seeing?

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: regression failures - further data

From
"Andrew Dunstan"
Date:
Bruce Momjian said:
> Andrew Dunstan wrote:
>>
>> I have managed (with a lot of effort) to track down the apparent cause
>>  of the regression failures I was seeing. They appear to be directly
>> related to the degree of parallelism with which the tests are run. I
>> can  reliably get a 100% clean run on the serial tests, and on the
>> parallel  tests with MAX_CONNECTIONS=5. But if I run at
>> MAX_CONNECTIONS=10 I  (almost) always get failures, which for some
>> reason that is beyond me  start with the copy test, which isn't even
>> run in parallel with other tests.
>>
>> This is all quite worrying, and suggests that we will need to do some
>> careful stress testing before we can release this.
>>
>> Is there some W2K parameter I can tweak in the TCP stack that might
>> alleviate the problem?
>
> Is this the extra newline regression failure you were seeing?
>

No, this is running with the patch that suppresses that. Basically, for
some reason that I have been unable to find, and which leaves no log
trace, copy just stops after about 4 or 5 lines, and then there are a
bunch of consequent failures. I can't account for it yet. All I do know is
that it happens when the tests are run with high parallelism and doesn't
with no or low parallelism.

(tests are all run from MSys)

cheers

andrew



Re: regression failures - further data

From
Bruce Momjian
Date:
Andrew Dunstan wrote:
> No, this is running with the patch that suppresses that. Basically, for
> some reason that I have been unable to find, and which leaves no log
> trace, copy just stops after about 4 or 5 lines, and then there are a
> bunch of consequent failures. I can't account for it yet. All I do know is
> that it happens when the tests are run with high parallelism and doesn't
> with no or low parallelism.

Not surprising.  We expect to have these problems with such a new port.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: regression failures - further data

From
Tom Lane
Date:
"Andrew Dunstan" <andrew@dunslane.net> writes:
> No, this is running with the patch that suppresses that. Basically, for
> some reason that I have been unable to find, and which leaves no log
> trace, copy just stops after about 4 or 5 lines, and then there are a
> bunch of consequent failures. I can't account for it yet. All I do know is
> that it happens when the tests are run with high parallelism and doesn't
> with no or low parallelism.

That seems *really weird*.  When you say "stops after 4 or 5 lines",
do you mean that the first few rows of the source data get committed
successfully, but the rest don't?

            regards, tom lane

Re: regression failures - further data

From
"Andrew Dunstan"
Date:
Tom Lane said:
> "Andrew Dunstan" <andrew@dunslane.net> writes:
>> No, this is running with the patch that suppresses that. Basically,
>> for some reason that I have been unable to find, and which leaves no
>> log trace, copy just stops after about 4 or 5 lines, and then there
>> are a bunch of consequent failures. I can't account for it yet. All I
>> do know is that it happens when the tests are run with high
>> parallelism and doesn't with no or low parallelism.
>
> That seems *really weird*.  When you say "stops after 4 or 5 lines", do
> you mean that the first few rows of the source data get committed
> successfully, but the rest don't?
>

Er, no :-) I mean the first few lines of source code ... around where it
does the "delete from onek". It's not entirely consistent. I'm playing
with debugging etc. in the postmaster log to get further and better
particulars.

cheers

andrew



Re: regression failures - further data

From
Tom Lane
Date:
"Andrew Dunstan" <andrew@dunslane.net> writes:
> Tom Lane said:
>> "Andrew Dunstan" <andrew@dunslane.net> writes:
>>> No, this is running with the patch that suppresses that. Basically,
>>> for some reason that I have been unable to find, and which leaves no
>>> log trace, copy just stops after about 4 or 5 lines, and then there
>>> are a bunch of consequent failures.

>> That seems *really weird*.  When you say "stops after 4 or 5 lines", do
>> you mean that the first few rows of the source data get committed
>> successfully, but the rest don't?

> Er, no :-) I mean the first few lines of source code ... around where it
> does the "delete from onek".

Oh, you mean the copy regression script stops.

Is it possible that it didn't fail, per se, but simply stopped making
progress?

Theoretically the regression script couldn't move on to later steps
until that script gets done, but it wouldn't be quite so weird to think
that that interlock failed ...

            regards, tom lane

Re: regression failures - further data

From
"Magnus Hagander"
Date:
> >> I have managed (with a lot of effort) to track down the apparent
> >> cause  of the regression failures I was seeing. They appear to be
> >> directly related to the degree of parallelism with which the tests
> >> are run. I can  reliably get a 100% clean run on the serial tests,
> >> and on the parallel  tests with MAX_CONNECTIONS=5. But if I run at
> >> MAX_CONNECTIONS=10 I  (almost) always get failures, which for some
> >> reason that is beyond me  start with the copy test, which
> isn't even
> >> run in parallel with other tests.
> >>
> >> This is all quite worrying, and suggests that we will need
> to do some
> >> careful stress testing before we can release this.
> >>
> >> Is there some W2K parameter I can tweak in the TCP stack
> that might
> >> alleviate the problem?
> >
> > Is this the extra newline regression failure you were seeing?
> >
>
> No, this is running with the patch that suppresses that.
> Basically, for some reason that I have been unable to find,
> and which leaves no log trace, copy just stops after about 4
> or 5 lines, and then there are a bunch of consequent
> failures. I can't account for it yet. All I do know is that
> it happens when the tests are run with high parallelism and
> doesn't with no or low parallelism.
>
> (tests are all run from MSys)

Does this mean they work when you run the tests from cygwin, or that you
haven't tried it there? I've always run it with the defualt, which
should be unlimited, with no problems.

Could it simply be another issue with the infamous stdout/stderr
buffering on the msys console, that causes the script to get results in
the wrong order or something? [assuming this *is* a client-side problem,
of course]

//Magnus


Re: regression failures - further data

From
"Dave Page"
Date:

> -----Original Message-----
> From: Andrew Dunstan [mailto:andrew@dunslane.net]
> Sent: 06 May 2004 23:52
> To: pgsql-hackers-win32@postgresql.org
> Subject: Re: [pgsql-hackers-win32] regression failures - further data
>
> No, this is running with the patch that suppresses that.
> Basically, for some reason that I have been unable to find,
> and which leaves no log trace, copy just stops after about 4
> or 5 lines, and then there are a bunch of consequent
> failures. I can't account for it yet. All I do know is that
> it happens when the tests are run with high parallelism and
> doesn't with no or low parallelism.

Are you running on a workstation edition of Windows? If so, it sounds
like you've hit the limit that M$ put in to stop you using it as a cheap
server. That's always been an issue with the parallel regression tests
under Cygwin.

Regards, Dave

Re: regression failures - further data

From
Andrew Dunstan
Date:
Magnus Hagander wrote:

>>>>I have managed (with a lot of effort) to track down the apparent
>>>>cause  of the regression failures I was seeing. They appear to be
>>>>directly related to the degree of parallelism with which the tests
>>>>are run. I can  reliably get a 100% clean run on the serial tests,
>>>>and on the parallel  tests with MAX_CONNECTIONS=5. But if I run at
>>>>MAX_CONNECTIONS=10 I  (almost) always get failures, which for some
>>>>reason that is beyond me  start with the copy test, which
>>>>
>>>>
>>isn't even
>>
>>
>>>>run in parallel with other tests.
>>>>
>>>>This is all quite worrying, and suggests that we will need
>>>>
>>>>
>>to do some
>>
>>
>>>>careful stress testing before we can release this.
>>>>
>>>>Is there some W2K parameter I can tweak in the TCP stack
>>>>
>>>>
>>that might
>>
>>
>>>>alleviate the problem?
>>>>
>>>>
>>>Is this the extra newline regression failure you were seeing?
>>>
>>>
>>>
>>No, this is running with the patch that suppresses that.
>>Basically, for some reason that I have been unable to find,
>>and which leaves no log trace, copy just stops after about 4
>>or 5 lines, and then there are a bunch of consequent
>>failures. I can't account for it yet. All I do know is that
>>it happens when the tests are run with high parallelism and
>>doesn't with no or low parallelism.
>>
>>(tests are all run from MSys)
>>
>>
>
>Does this mean they work when you run the tests from cygwin, or that you
>haven't tried it there? I've always run it with the defualt, which
>should be unlimited, with no problems.
>
>Could it simply be another issue with the infamous stdout/stderr
>buffering on the msys console, that causes the script to get results in
>the wrong order or something? [assuming this *is* a client-side problem,
>of course]
>
>
>

Haven't run from cygwin because what I'm trying to do is get it so you
don't need to. I think that's incredibly ugly. You should be able to run
regression tests from your build platform. The stderr buffering problem
should have been fixed by the recent patch explicitly unbuffering it.

still investigating ...

cheers

andrew

Re: regression failures - further data

From
Andrew Dunstan
Date:
Andrew Dunstan wrote:

>
> still investigating ...
>

The log traces (log_connections=true, log_disconnections=true,
log_statement='all') show that if I run without limiting
max_connections, the next tests start up before the copy is finished -
no wonder things get right royally screwed as a result.

It seems like the problem is in the Msys shell. It appears not to wait
correctly for a job to finish (Single tests are run in the foreground by
the shell, so no explicit 'wait' is run - I tried putting one in with no
effect). It's probably triggered by the copy test because it takes such
a long time. I have no idea why the parallelism of the tests should
affect it.

trying to find a workaround.

Does anyone have any contacts with the MINGW/MSys people?

cheers

andrew

Couldn't make check

From
"Hisaji ONO"
Date:
Hi.

 I've succeeded to make postgresql in latest msys/mingw.

However I've got following message.

initdb.exe  - couldn't find the component -

couldn't find libpq.dll.......

Could anyone give any suggestion?

 Regards.


Re: Couldn't make check

From
Andrew Dunstan
Date:
Hisaji ONO wrote:

>Hi.
>
> I've succeeded to make postgresql in latest msys/mingw.
>
>However I've got following message.
>
>initdb.exe  - couldn't find the component -
>
>couldn't find libpq.dll.......
>
>Could anyone give any suggestion?
>
> Regards.
>
>  
>

The attached patch against pg_regress.sh v 1.42 (i.e. cvs tip) is what 
I'm currently testing with (minus a few local tweaks I have for 
debugging purposes).

cheers

andrew
Index: pg_regress.sh
===================================================================
RCS file: /projects/cvsroot/pgsql-server/src/test/regress/pg_regress.sh,v
retrieving revision 1.42
diff -c -w -r1.42 pg_regress.sh
*** pg_regress.sh    3 May 2004 13:25:23 -0000    1.42
--- pg_regress.sh    7 May 2004 16:09:33 -0000
***************
*** 1,5 ****
  #! /bin/sh
! # $PostgreSQL: pgsql-server/src/test/regress/pg_regress.sh,v 1.42 2004/05/03 13:25:23 momjian Exp $
  
  me=`basename $0`
  : ${TMPDIR=/tmp}
--- 1,5 ----
  #! /bin/sh
! # $PostgreSQL: pgsql-server/src/test/regress/pg_regress.sh,v 1.38 2004/01/08 20:04:41 neilc Exp $
  
  me=`basename $0`
  : ${TMPDIR=/tmp}
***************
*** 208,225 ****
  
  
  # ----------
- # Set up pwd to give a win32 happy pathname
- # ----------
- 
- case $host_platform in
-     *-*-mingw32*)
-         PWDFLAGS=-W;;
-     *)
-         PWDFLAGS=;;
- esac
- 
- 
- # ----------
  # Set backend timezone and datestyle explicitly
  #
  # To pass the horology test in its current form, the postmaster must be
--- 208,213 ----
***************
*** 306,317 ****
  if [ x"$temp_install" != x"" ]
  then
      if echo x"$temp_install" | grep -v '^x/' >/dev/null 2>&1; then
!         temp_install="`pwd $PWDFLAGS`/$temp_install"
      fi
  
      bindir=$temp_install/install/$bindir
      libdir=$temp_install/install/$libdir
-     pkglibdir=$temp_install/install/$pkglibdir
      datadir=$temp_install/install/$datadir
      PGDATA=$temp_install/data
  
--- 294,313 ----
  if [ x"$temp_install" != x"" ]
  then
      if echo x"$temp_install" | grep -v '^x/' >/dev/null 2>&1; then
!         case `uname` in
!           MINGW*)
!                 pkglibdir="`pwd -W`/$temp_install/install/$pkglibdir"
!         temp_install="`pwd`/$temp_install"
!                 ;;
!           *)
!                 temp_install="`pwd`/$temp_install"
!                 pkglibdir=$temp_install/install/$pkglibdir
!                 ;;
!         esac
      fi
  
      bindir=$temp_install/install/$bindir
      libdir=$temp_install/install/$libdir
      datadir=$temp_install/install/$datadir
      PGDATA=$temp_install/data
  
***************
*** 348,354 ****
      # executables, not dlopen'ed ones)
      # ----------
      case $host_platform in
!         *-*-cygwin*)
              PATH=$libdir:$PATH
              export PATH
              ;;
--- 344,350 ----
      # executables, not dlopen'ed ones)
      # ----------
      case $host_platform in
!         *-*-cygwin* | *-*-mingw32*)
              PATH=$libdir:$PATH
              export PATH
              ;;

white smoke up the chimney

From
Andrew Dunstan
Date:
Andrew Dunstan wrote:

> Andrew Dunstan wrote:
>
>>
>> still investigating ...
>>
>
> The log traces (log_connections=true, log_disconnections=true,
> log_statement='all') show that if I run without limiting
> max_connections, the next tests start up before the copy is finished -
> no wonder things get right royally screwed as a result.
>
> It seems like the problem is in the Msys shell. It appears not to wait
> correctly for a job to finish (Single tests are run in the foreground
> by the shell, so no explicit 'wait' is run - I tried putting one in
> with no effect). It's probably triggered by the copy test because it
> takes such a long time. I have no idea why the parallelism of the
> tests should affect it.
>
> trying to find a workaround.
>

OK. The workaround that I have just come up with worked has just worked
in 6 successive runs of "make check" and friends under MSys, so I'm
prepared to declare a win and start preparing patches. It will be next
week before I can get that done.

The workaround is to run the single command in a background shell and
wait for it, just like the parallel tests.

Here is a summary of the diffs in my tree:
. configure.in
    - checks at the end to make sure that links have built properly and
warns if not
. src/bin
    - added pgkill based in the one on the web page, with a Makefile and
install target
. src/bin/psql/print.c
    - suppress newline after footers for win32, as previously discussed
. src/test/regress/GNUmakefile
    - added a target pinstallcheck to run against an installed and
running server, but using the parallel tests
    - add a test to the sed command to make sure the files were built
correctly
. src/test/pg_regress.sh
    - see previous post and above
    - also calls pgkill under mingw instead of kill -15, which doesn't
work. Gets the pid for pgkill from postmaster.pid.
. src/test/regress/expected/join-win32.out
    - new file reflecting different order from join results on win32
. src/test/regress/resultmap.out
    - maps above for win32

The pgkill stuff is pending us getting a binary pg_ctl. But it seems to
work very well.

With all these changes I now consistently get 94 of 94 tests passing,
and a completely clean (and automatic) server shutdown.

cheers

andrew