Thread: Connection problem under extreme load.

Connection problem under extreme load.

From
Jeffery Collins
Date:
We have been doing some load testing with postgresql, and we have been
getting the following error when libpq attempts to connect to the
backend.  This only happens occasionally and, as I said under extreme
load (e.g. load average 30+ on a single processor Sun).

connectDBStart() -- connect() failed: Connection refused
Is the postmaster running at 'localhost' and accepting connections on
Unix socket '6700'?

Has anyone seen this before or know what could be happening?  One theory
that we have is that the connection request timed out because the server
was under such heavy load that it couldn't respond to the request.  Is
this possible?

Thank you,
Jeff Collins



Re: Connection problem under extreme load.

From
Thomas Lockhart
Date:
> We have been doing some load testing with postgresql, and we have been
> getting the following error when libpq attempts to connect to the
> backend.  This only happens occasionally and, as I said under extreme
> load (e.g. load average 30+ on a single processor Sun).
> connectDBStart() -- connect() failed: Connection refused
> Is the postmaster running at 'localhost' and accepting connections on
> Unix socket '6700'?
> Has anyone seen this before or know what could be happening?  One theory
> that we have is that the connection request timed out because the server
> was under such heavy load that it couldn't respond to the request.  Is
> this possible?

You are increasing the number of allowed connections to above 32, right?
The runtime default is 32, but it can be increased with a commandline
switch.

                     - Thomas

Re: Connection problem under extreme load.

From
Jeffery Collins
Date:
Thomas Lockhart wrote:

> You are increasing the number of allowed connections to above 32, right?
> The runtime default is 32, but it can be increased with a commandline
> switch.
>
>                      - Thomas

Yep.  I should have mentioned that I increased the number of allowed
connections to using the config option.  I am assuming this also changes the
runtime default.  I did see over 100 backend processes running concurrently.

Jeff



Re: Connection problem under extreme load.

From
Tom Lane
Date:
Jeffery Collins <collins@onyx-technologies.com> writes:
> We have been doing some load testing with postgresql, and we have been
> getting the following error when libpq attempts to connect to the
> backend.  This only happens occasionally and, as I said under extreme
> load (e.g. load average 30+ on a single processor Sun).

> connectDBStart() -- connect() failed: Connection refused
> Is the postmaster running at 'localhost' and accepting connections on
> Unix socket '6700'?

Interesting.  I *think* (not totally sure) that 'Connection refused'
here implies that the kernel rejected the connection before the
postmaster ever had a chance to do anything with it.  The most likely
reason would probably be that the maximum connection backlog was
exceeded.  On my system (HPUX) man listen(2) sez

     int listen(int s, int backlog);

     ...

     backlog defines the desirable queue length for pending connections.
     The actual queue length may be greater than the specified backlog . If
     a connection request arrives when the queue is full, the client will
     receive an ETIMEDOUT error.

     backlog is limited to the range of 0 to SOMAXCONN, which is defined in
     <sys/socket.h>.  SOMAXCONN is currently set to 20.  If any other value
     is specified, the system automatically assigns the closest value
     within the range.  A backlog of 0 specifies only 1 pending connection
     is allowed at any given time.

ETIMEDOUT is not the error you are getting, but that could be a platform
difference.  In fact the nearest BSD system I have access to says that
"the client will receive an error with an indication of ECONNREFUSED".
The same box defines SOMAXCONN as 5, which seems a tad low :-(

So, it would seem your options are
  (a) recompile your kernel with larger SOMAXCONN, or
  (b) figure out why the postmaster isn't responding faster.

Offhand, the only performance problem I know of in the postmaster is
that it does IDENT checks serially --- if you specify ident checks in
pg_hba.conf, the postmaster will wait for a response from the ident
server before processing more connection requests.  So if you're using
IDENT authentication you might want to consider some other answer, or
else fix that code and send in a patch.

If that's not it, please poke into it further and let us know what you
find out.

            regards, tom lane

Re: Connection problem under extreme load.

From
Jeffery Collins
Date:
Tom Lane wrote:

> Interesting.  I *think* (not totally sure) that 'Connection refused'
> here implies that the kernel rejected the connection before the
> postmaster ever had a chance to do anything with it.  The most likely
> reason would probably be that the maximum connection backlog was
> exceeded.  On my system (HPUX) man listen(2) sez
>
>      int listen(int s, int backlog);
>
>      ...
>
>      backlog defines the desirable queue length for pending connections.
>      The actual queue length may be greater than the specified backlog . If
>      a connection request arrives when the queue is full, the client will
>      receive an ETIMEDOUT error.
>
>      backlog is limited to the range of 0 to SOMAXCONN, which is defined in
>      <sys/socket.h>.  SOMAXCONN is currently set to 20.  If any other value
>      is specified, the system automatically assigns the closest value
>      within the range.  A backlog of 0 specifies only 1 pending connection
>      is allowed at any given time.
>
> ETIMEDOUT is not the error you are getting, but that could be a platform
> difference.  In fact the nearest BSD system I have access to says that
> "the client will receive an error with an indication of ECONNREFUSED".
> The same box defines SOMAXCONN as 5, which seems a tad low :-(
>
> So, it would seem your options are
>   (a) recompile your kernel with larger SOMAXCONN, or
>   (b) figure out why the postmaster isn't responding faster.
>
> Offhand, the only performance problem I know of in the postmaster is
> that it does IDENT checks serially --- if you specify ident checks in
> pg_hba.conf, the postmaster will wait for a response from the ident
> server before processing more connection requests.  So if you're using
> IDENT authentication you might want to consider some other answer, or
> else fix that code and send in a patch.
>
> If that's not it, please poke into it further and let us know what you
> find out.
>
>                         regards, tom lane

I think you are correct.  The listen man page on my machine (Sun Solaris)
says:

     If a connection request arrives with  the  queue  full,  the client  will
receive
    an error with an indication of ECONNREFUSED...

The SOMAXCONN field is also 5, which IS a tad low.

Unfortunately, I don't have the ability to rebuild the kernel so this is not
an option.

As to why the postmaster was not responding faster, I think it was because of
the load on the machine.  The load was so heavy, and there were so many
connection requests at the same time, I am not surprised that it could not
keep up.  My test was probably not a realistic load.

I think my best option is to retry the connection when this happens.  I do
wish my kernel would return a different failure, because there really is no
way to distinguish a legitimate ECONNREFUSED (i.e. the server really isn't
listening), versus a backlog queue full situation.

Once again, thank you very much,
Jeff