Re: max_connections reached in postgres 9.3.3 - Mailing list pgsql-general

From Merlin Moncure
Subject Re: max_connections reached in postgres 9.3.3
Date
Msg-id CAHyXU0z+QnP1p+mB1X4Szw7R8LnfJ=xU-Cf65q1hXgOjuH3hcQ@mail.gmail.com
Whole thread Raw
In response to max_connections reached in postgres 9.3.3  ("Vasudevan, Ramya" <ramya.vasudevan@classmates.com>)
Responses Re: max_connections reached in postgres 9.3.3
List pgsql-general
On Wed, Jun 11, 2014 at 1:24 PM, Vasudevan, Ramya
<ramya.vasudevan@classmates.com> wrote:
> Our set up:
>
> ·         Db version: postgres 9.3.3
>
> ·         OS: CentOS 6.5
>
> ·         kernel Version - Linux 2.6.32-431.3.1.el6.x86_64 #1 SMP Fri Jan 3
> 21:39:27 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
>
> ·         cpu - 24 proc
>
> ·         memory - 768 GB
>
> ·         The disks are SAN fiber.
>
> ·         We have streaming replication set up from this server to the
> secondary server.
>
> ·         Some of the DB parameters:
>
> max_connections  - 1500
>
> shared_buffers - 4GB
>
> work_mem = 130MB
>
>  maintenance_work_mem = 1GB
>
>
>
> Issue this morning:
>
> At 9:04 - FATAL:  remaining connection slots are reserved for
> non-replication superuser connections
>
> These were the things we noticed and have question on each one of them:
>
> pg_stat_activity showed a total of 850 connections but a 'ps -ef|grep
> postgers' showed 1500 connections and DB wasn’t accepting new connections.
> How is it possible to have such discrepancy in count between
> pg_stat_activity and ps?
>
> 200 were in waiting status, 50 were in authentication status, 748 were in
> IDLE status and the rest were all doing some DML operations. Why will 50
> connections be in authentication status in the database and stay there?
>
>  Some existing sessions were completing the sqls successfully (though not as
> fast). So, the database was not completely hung or unresponsive.
>
> Trying to kill the session at the DB level (pg_terminate_backend) were just
> hanging. Why would this be the case?
>
> During this time, server had a very high load average of 130 (normally we
> have < 5).
>
> We had to restart the database to clear it all up.

There are two problems here:
1. there was some type of 'high load' event.  Perhaps it was spurious,
like one of the various o/s mishaps like THP or something on your end
like a bad query plan or some database impacting application bug.
When this happens, queries are accumulating faster than they are
clearing and load escalates

2. your database is configured with very high max connection count and
your application servers open connections without bound


This is a very common problem: there is a trigger that causes some
type of huccup then queries storm in and overload the database.
Raising the max_connection count is not a good solution beyond a
certain threshold as this will simple make your load issues more
severe when you are trying to rescue the database and you are facing
emergency restart.

Obviously fixing #1 is ideal.  The problem is, all too often by the
time it comes to your attention you are in emergency fix mode as the
server is unresponsive (or barely-) even from psql.

One way of managing problem #2 is to install something like pgbouncer
which is a 'man in the middle' connection pooler and virtualizes the
connection count so that you only have a very well defined number of
physical connections actually in use by the database.  Installing this
is not to be taken lightly but what it *does* do is protect the
database when something like occurs -- load will be capped and you can
still get emergency diagnostic information out of the server  (like
perf or sar) query logs out of pg_stat_activity, and pull of cancels
and things like that.

merlin


pgsql-general by date:

Previous
From: Keith
Date:
Subject: Re: How can I tell if pg_restore is running?
Next
From: Tom Lane
Date:
Subject: Re: [HACKERS] Question about partial functional indexes and the query planner