Thread: Tests randomly failed

Tests randomly failed

From
Alexander Klimov
Date:
Hi all.

First time I execute `make check' 10 tests failed:
     float8               ... FAILED
test numerology           ... FAILED
     point                ... FAILED
     lseg                 ... FAILED
     interval             ... FAILED
test geometry             ... FAILED
test horology             ... FAILED
     subselect            ... FAILED
     union                ... FAILED
test misc                 ... FAILED

the second time it was only 5:

     abstime              ... FAILED
test horology             ... FAILED
     subselect            ... FAILED
     union                ... FAILED
test misc                 ... FAILED

the third time is was 10 again:
     abstime              ... FAILED
     tinterval            ... FAILED
     inet                 ... FAILED
     comments             ... FAILED
     oidjoins             ... FAILED
test horology             ... FAILED
     case                 ... FAILED
     join                 ... FAILED
     portals              ... FAILED
test misc                 ... FAILED

Results of second and third passes are in the attachment.
It is looks like failed tests are due to
! psql: connectDBStart() -- connect() failed: Connection refused
!     Is the postmaster running locally
!     and accepting connections on Unix socket '/tmp/.s.PGSQL.65432'?

My guess is that this could be due to high load of my box, but
w said
 11:29am  up 24 day(s), 18:30,  2 users,  load average: 0.00, 0.18, 0.29
and I shut down my production postmaster before tests, and I have 256MB of
RAM,
SunOS iridium 5.6 Generic_105181-20 sun4u sparc SUNW,Ultra-5_10
gcc version 2.95.2 19991024 (release)
psql (PostgreSQL) 7.1RC1 (actualy from CVS)

So, the question is: what is the reason of such behaviour, and how to
fight against it?

Regards,
ASK

Re: Tests randomly failed

From
Tom Lane
Date:
Alexander Klimov <ask@wisdom.weizmann.ac.il> writes:
> It is looks like failed tests are due to
> ! psql: connectDBStart() -- connect() failed: Connection refused
> !     Is the postmaster running locally
> !     and accepting connections on Unix socket '/tmp/.s.PGSQL.65432'?

What I see is a lot of

! psql: Backend startup failed

which suggests a fork() failure.  Look in the postmaster logfile to see
the exact kernel error code --- but probably you are out of swap space
or up against the kernel's limit on number of processes for one userid.

            regards, tom lane

Re: Tests randomly failed

From
Peter Eisentraut
Date:
Alexander Klimov writes:

> Results of second and third passes are in the attachment.
> It is looks like failed tests are due to
> ! psql: connectDBStart() -- connect() failed: Connection refused
> !     Is the postmaster running locally
> !     and accepting connections on Unix socket '/tmp/.s.PGSQL.65432'?
>
> My guess is that this could be due to high load of my box, but
> w said
>  11:29am  up 24 day(s), 18:30,  2 users,  load average: 0.00, 0.18, 0.29
> and I shut down my production postmaster before tests, and I have 256MB of
> RAM,
> SunOS iridium 5.6 Generic_105181-20 sun4u sparc SUNW,Ultra-5_10
> gcc version 2.95.2 19991024 (release)
> psql (PostgreSQL) 7.1RC1 (actualy from CVS)

In src/test/regress/pg_regress[.sh], line 163, change

    *-*-qnx* | *beos*)

to

    *-*-qnx* | *beos* | *solaris*)

and rerun the tests.  This will avoid using Unix domain sockets, which are
broken on Solaris.

--
Peter Eisentraut      peter_e@gmx.net       http://yi.org/peter-e/

Re: Tests randomly failed

From
Tom Lane
Date:
Peter Eisentraut <peter_e@gmx.net> writes:
> In src/test/regress/pg_regress[.sh], line 163, change
>     *-*-qnx* | *beos*)
> to
>     *-*-qnx* | *beos* | *solaris*)

> and rerun the tests.  This will avoid using Unix domain sockets, which are
> broken on Solaris.

I was just thinking that maybe pg_regress should have a command line
option to set unix_sockets=no, so that both connection options could
be exercised when there's doubt.

            regards, tom lane

Re: Tests randomly failed

From
Justin Clift
Date:
Hey guys,

I don't understand what you mean by "This will avoid using Unix domain
sockets, which are broken on Solaris.".

If this were the case, then the errors which are described would happen
on ALL solaris platforms wouldn't they?  And other packages using Unix
domain sockets would have problems too wouldn't they?

If it's of any help, I get the same types of regression testing failures
on Solaris, with the same "is the backend running?" type error
messages.. when the installation of solaris HAS NOT had it's /etc/system
file altered to change the amount of shared memory segments and
semaphores.

Whenever I have those problems, I insert the updated (higher) values for
shared memory and semaphores, reboot the system, then the tests pass as
the backend is able to start fine.

Hope this is helpful.

Regards and best wishes,

Justin Clift

Tom Lane wrote:
>
> Peter Eisentraut <peter_e@gmx.net> writes:
> > In src/test/regress/pg_regress[.sh], line 163, change
> >     *-*-qnx* | *beos*)
> > to
> >     *-*-qnx* | *beos* | *solaris*)
>
> > and rerun the tests.  This will avoid using Unix domain sockets, which are
> > broken on Solaris.
>
> I was just thinking that maybe pg_regress should have a command line
> option to set unix_sockets=no, so that both connection options could
> be exercised when there's doubt.
>
>                         regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>     (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)

Re: Tests randomly failed

From
Tom Lane
Date:
Justin Clift <jclift@iprimus.com.au> writes:
> If it's of any help, I get the same types of regression testing failures
> on Solaris, with the same "is the backend running?" type error
> messages.. when the installation of solaris HAS NOT had it's /etc/system
> file altered to change the amount of shared memory segments and
> semaphores.

> Whenever I have those problems, I insert the updated (higher) values for
> shared memory and semaphores, reboot the system, then the tests pass as
> the backend is able to start fine.

Hm.  That's interesting, but it's fairly hard to believe.  For at least
a couple releases past, Postgres has grabbed all the shared memory and
semaphores that it wants at postmaster start.  Insufficient shmem/sema
resources should result in postmaster abort, not in occasional failures
to start backends.

            regards, tom lane

Re: Tests randomly failed

From
Justin Clift
Date:
Hi Tom,

I know what you're saying, but I've come across it multiple times.

The process for building a Solaris server for PostgreSQL is (from
memory) :

A) Install the OS
B) Install the latest Maintenance Update
C) Install the latest recommended patches
D) Adjust system values for semaphores and shared memory
E) Do an initial lockdown for system security
F) Reboot for the new settings to take effect
G) Create postgres group and postgres user
H) Compile postgres
I) Run the regression tests
J) Lockdown system again
K) Reboot, test startup scripts, etc
<etc>

If I'm working very late and can't find the semaphore settings, then
sometimes I'll do them out-of-order.

A number of times I've totally forgotten to change things until
PostgreSQL complains either in the regression tests (as described in
this thread) or during normal startup.

We're talking a few times anyway, probably about.... um... 15 - 20 times
or so that I've forgotten.

Regards and best wishes,

Justin Clift

Tom Lane wrote:
>
> Justin Clift <jclift@iprimus.com.au> writes:
> > If it's of any help, I get the same types of regression testing failures
> > on Solaris, with the same "is the backend running?" type error
> > messages.. when the installation of solaris HAS NOT had it's /etc/system
> > file altered to change the amount of shared memory segments and
> > semaphores.
>
> > Whenever I have those problems, I insert the updated (higher) values for
> > shared memory and semaphores, reboot the system, then the tests pass as
> > the backend is able to start fine.
>
> Hm.  That's interesting, but it's fairly hard to believe.  For at least
> a couple releases past, Postgres has grabbed all the shared memory and
> semaphores that it wants at postmaster start.  Insufficient shmem/sema
> resources should result in postmaster abort, not in occasional failures
> to start backends.
>
>                         regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
> http://www.postgresql.org/search.mpl

Re: Tests randomly failed

From
Peter Eisentraut
Date:
Justin Clift writes:

> I don't understand what you mean by "This will avoid using Unix domain
> sockets, which are broken on Solaris.".
>
> If this were the case, then the errors which are described would happen
> on ALL solaris platforms wouldn't they?

I suppose things are a bit more complicated than that.  We once has a
brief suspicion that it could be related to Sun's tmpfs file system that
/tmp often resides on, but I don't think this turned out to be the case.

> And other packages using Unix domain sockets would have problems too
> wouldn't they?

Indeed.  A while ago I looked around and found at least two packages (INN
and Postfix) that had similar-sounding problems.  In fact, one of the two
ended up disabling it with the words "more trouble than it's worth".

You could argue that X and KDE and what else should be broken as well.
This is a good question.  It could perhaps be related to a buffer problem,
under the assumption that X usually passes small amounts of data through
the pipe, whereas PostgreSQL can pass megabytes in a very short time.
(Don't know what INN and Postfix would want to do with a local socket.)


The bottom line here is that the switch from local sockets to TCP/IP
invariably fixes the identical failure pattern.  Make of that what you
will.

--
Peter Eisentraut      peter_e@gmx.net       http://yi.org/peter-e/

Re: Tests randomly failed

From
Alexander Klimov
Date:
On Thu, 22 Mar 2001, Peter Eisentraut wrote:

> In src/test/regress/pg_regress[.sh], line 163, change
>
>     *-*-qnx* | *beos*)
>
> to
>
>     *-*-qnx* | *beos* | *solaris*)
>
> and rerun the tests.  This will avoid using Unix domain sockets, which are
> broken on Solaris.

Yes, it works now:
======================
 All 76 tests passed.
======================

From the other hand, my production version uses Unix domain sockets
without problems

Regards,
ASK

Re: Tests randomly failed

From
Alexander Klimov
Date:
On Thu, 22 Mar 2001, Tom Lane wrote:
> What I see is a lot of
>
> ! psql: Backend startup failed
>
> which suggests a fork() failure.  Look in the postmaster logfile to see
> the exact kernel error code --- but probably you are out of swap space
> or up against the kernel's limit on number of processes for one userid.
Strange, but this solution *also* works: I raise in /etc/system from 64 to
set maxuprc=256
revert pg_regress.sh in original state (with unix sockets for solaris),
and now all tests are passed.

Regards,
ASK