Thread: max_connections reached in postgres 9.3.3

max_connections reached in postgres 9.3.3

From
"Vasudevan, Ramya"
Date:

Our set up:

·         Db version: postgres 9.3.3

·         OS: CentOS 6.5

·         kernel Version - Linux 2.6.32-431.3.1.el6.x86_64 #1 SMP Fri Jan 3 21:39:27 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

·         cpu - 24 proc

·         memory - 768 GB

·         The disks are SAN fiber.

·         We have streaming replication set up from this server to the secondary server.

·         Some of the DB parameters:

max_connections  - 1500

shared_buffers - 4GB

work_mem = 130MB             

 maintenance_work_mem = 1GB

 

Issue this morning:

At 9:04 - FATAL:  remaining connection slots are reserved for non-replication superuser connections

These were the things we noticed and have question on each one of them:

pg_stat_activity showed a total of 850 connections but a 'ps -ef|grep postgers' showed 1500 connections and DB wasn’t accepting new connections. How is it possible to have such discrepancy in count between pg_stat_activity and ps?

200 were in waiting status, 50 were in authentication status, 748 were in IDLE status and the rest were all doing some DML operations. Why will 50 connections be in authentication status in the database and stay there?

 Some existing sessions were completing the sqls successfully (though not as fast). So, the database was not completely hung or unresponsive.

Trying to kill the session at the DB level (pg_terminate_backend) were just hanging. Why would this be the case?

During this time, server had a very high load average of 130 (normally we have < 5).

We had to restart the database to clear it all up.

 Note:  

We are having this issue of max_connections being reached for a couple of months now. At one point, we tried increasing the max_connections to 2500 to see if we will still run out of connections, and we did.

We noticed that generic-segmentation-offload was set to ON on this server. We changed it to OFF.

 

Recent Changes in the environment:

·         We turned off THP on the server a couple of months ago to see if that will fix the issue.

·         We recently (2 weeks ago) moved from RedHat ext 2 to xfs because we were having the same problem.

 

Please let us know if you need more information on our environment or on the issue this morning. Any help will be greatly appreciated.

 

Thank you,

________________

Ramya Vasudevan

Database Administrator

CLASSMATES

 

Re: max_connections reached in postgres 9.3.3

From
Adrian Klaver
Date:
On 06/11/2014 11:24 AM, Vasudevan, Ramya wrote:
> Our set up:
>
> ·Db version: postgres 9.3.3
>
> ·OS: CentOS 6.5
>
> ·kernel Version - Linux 2.6.32-431.3.1.el6.x86_64 #1 SMP Fri Jan 3
> 21:39:27 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
>
> ·cpu - 24 proc
>
> ·memory - 768 GB
>
> ·The disks are SAN fiber.
>
> ·We have streaming replication set up from this server to the secondary
> server.
>
> ·Some of the DB parameters:
>
> max_connections  - 1500
>
> shared_buffers - 4GB
>
> work_mem = 130MB
>
>   maintenance_work_mem = 1GB
>
> Issue this morning:
>
> At 9:04 - FATAL:  remaining connection slots are reserved for
> non-replication superuser connections
>
> These were the things we noticed and have question on each one of them:
>
> pg_stat_activity showed a total of 850 connections but a 'ps -ef|grep
> postgers' showed 1500 connections and DB wasn’t accepting new
> connections. How is it possible to have such discrepancy in count
> between pg_stat_activity and ps?
>
> 200 were in waiting status, 50 were in authentication status, 748 were
> in IDLE status and the rest were all doing some DML operations. Why will
> 50 connections be in authentication status in the database and stay there?
>
>   Some existing sessions were completing the sqls successfully (though
> not as fast). So, the database was not completely hung or unresponsive.
>
> Trying to kill the session at the DB level (pg_terminate_backend) were
> just hanging. Why would this be the case?
>
> During this time, server had a very high load average of 130 (normally
> we have < 5).
>
> We had to restart the database to clear it all up.
>
>   Note:
>
> We are having this issue of max_connections being reached for a couple
> of months now. At one point, we tried increasing the max_connections to
> 2500 to see if we will still run out of connections, and we did.
>
> We noticed that generic-segmentation-offload was set to ON on this
> server. We changed it to OFF.
>
> Recent Changes in the environment:
>
> ·We turned off THP on the server a couple of months ago to see if that
> will fix the issue.
>
> ·We recently (2 weeks ago) moved from RedHat ext 2 to xfs because we
> were having the same problem.
>
> Please let us know if you need more information on our environment or on
> the issue this morning. Any help will be greatly appreciated.

So unless I am missing something the issue seems to be unclosed
connections. It would seem the thing to do would be to look at the
application(s) hitting the database and determine why the connections
are not being closed. Increasing max_connections is just driving deeper
into the wreck.

>
> Thank you,
>
> *________________*
>
> */Ramya Vasudevan/*
>
> /Database Administrator/
>
> */CLASSMATES/*
>


--
Adrian Klaver
adrian.klaver@aklaver.com


Re: max_connections reached in postgres 9.3.3

From
Merlin Moncure
Date:
On Wed, Jun 11, 2014 at 1:24 PM, Vasudevan, Ramya
<ramya.vasudevan@classmates.com> wrote:
> Our set up:
>
> ·         Db version: postgres 9.3.3
>
> ·         OS: CentOS 6.5
>
> ·         kernel Version - Linux 2.6.32-431.3.1.el6.x86_64 #1 SMP Fri Jan 3
> 21:39:27 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
>
> ·         cpu - 24 proc
>
> ·         memory - 768 GB
>
> ·         The disks are SAN fiber.
>
> ·         We have streaming replication set up from this server to the
> secondary server.
>
> ·         Some of the DB parameters:
>
> max_connections  - 1500
>
> shared_buffers - 4GB
>
> work_mem = 130MB
>
>  maintenance_work_mem = 1GB
>
>
>
> Issue this morning:
>
> At 9:04 - FATAL:  remaining connection slots are reserved for
> non-replication superuser connections
>
> These were the things we noticed and have question on each one of them:
>
> pg_stat_activity showed a total of 850 connections but a 'ps -ef|grep
> postgers' showed 1500 connections and DB wasn’t accepting new connections.
> How is it possible to have such discrepancy in count between
> pg_stat_activity and ps?
>
> 200 were in waiting status, 50 were in authentication status, 748 were in
> IDLE status and the rest were all doing some DML operations. Why will 50
> connections be in authentication status in the database and stay there?
>
>  Some existing sessions were completing the sqls successfully (though not as
> fast). So, the database was not completely hung or unresponsive.
>
> Trying to kill the session at the DB level (pg_terminate_backend) were just
> hanging. Why would this be the case?
>
> During this time, server had a very high load average of 130 (normally we
> have < 5).
>
> We had to restart the database to clear it all up.

There are two problems here:
1. there was some type of 'high load' event.  Perhaps it was spurious,
like one of the various o/s mishaps like THP or something on your end
like a bad query plan or some database impacting application bug.
When this happens, queries are accumulating faster than they are
clearing and load escalates

2. your database is configured with very high max connection count and
your application servers open connections without bound


This is a very common problem: there is a trigger that causes some
type of huccup then queries storm in and overload the database.
Raising the max_connection count is not a good solution beyond a
certain threshold as this will simple make your load issues more
severe when you are trying to rescue the database and you are facing
emergency restart.

Obviously fixing #1 is ideal.  The problem is, all too often by the
time it comes to your attention you are in emergency fix mode as the
server is unresponsive (or barely-) even from psql.

One way of managing problem #2 is to install something like pgbouncer
which is a 'man in the middle' connection pooler and virtualizes the
connection count so that you only have a very well defined number of
physical connections actually in use by the database.  Installing this
is not to be taken lightly but what it *does* do is protect the
database when something like occurs -- load will be capped and you can
still get emergency diagnostic information out of the server  (like
perf or sar) query logs out of pg_stat_activity, and pull of cancels
and things like that.

merlin


Re: max_connections reached in postgres 9.3.3

From
Merlin Moncure
Date:
On Thu, Jun 12, 2014 at 1:23 PM, Vasudevan, Ramya
<ramya.vasudevan@classmates.com> wrote:
> Thank you for the response.
>
> On further investigation, we found out that select statements were happening normally. But DMLs (writes to the DB)
werehung for minutes at a time, and some of them went through. And we had 2 checkpoints during this period. Yesterday
whenwe had the issue, we had 759 connections  that were not idle (116 COMMIT, 238 INSERT, UPDATE 176, 57
AUTHENTICATION,133 BIND). So, it looked like writes (simple inserts and updates) were not happening as fast and caused
connectionsto back up in the DB. So, it didn’t look like any one bad query. 
> It almost seems like postgres could not write to the WAL logs.
>
> We normally have 600-700 connections in the database. Since we migrated lot more applications to this postgres
databasefrom oracle, we increased max_connections just as a test to see if we legitimately need to allow more
connectionsor if it is an issue. And quickly realized that we already had a high number (1500) 

Be sure to rule out locks -- some operation blocks queries until it
clears then everyone storms in.  It's good idea to count up blocked
queries (select count(*) from pg_locks where waiting) and raise alarms
when you see any accumulation there.

merlin


Re: max_connections reached in postgres 9.3.3

From
Merlin Moncure
Date:
On Thu, Jun 12, 2014 at 1:51 PM, Vasudevan, Ramya
<ramya.vasudevan@classmates.com> wrote:
> Thanks Merlin.
>
> We did look at the locks in the DB and all we saw were RowExclusiveLock, AccessShareLock, RowShareLock,
AccessExclusiveLock.The ExclusiveLocks we saw were all in the virtualxids. 
>
> I think the max_connections maxing out is due to the DB not being able to write and complete existing connections
beforenew legitimate ones come in, thus pilling up connections until it reaches 1500 and start erroring out. 

(aside: please stop top posting) the key is not what kind of locks you
have, but if any queries are blocked.  Check pg_stat_activity for
'waiting'.   Via pg_locks you can check for granted=f.

If you've ruled out locking, the next step is to:
1) make sure slow query logging (log_min_duration_statement) is on --
need to rule out bad application query
2) detailed system performance diagnostics during load event.   We
need to know precise values for user, system, iowait
3) know if transaction rates spiking going into (as opposed to during)
the load is interesting.  We can capture tps rates by logging the
output of: select * from pg_stat_database.

Basically we need to parse the problem down to application side issue
(always suspect this first), o/s issue (there are a number of
interesting potential culprits such as THP compaction) or storage
related problems (for example, a database checkpoint storm or a page
flush storm in the o/s), or something else entirely.

merlin


Re: max_connections reached in postgres 9.3.3

From
"Vasudevan, Ramya"
Date:
Thanks Merlin.

We did look at the locks in the DB and all we saw were RowExclusiveLock, AccessShareLock, RowShareLock,
AccessExclusiveLock.The ExclusiveLocks we saw were all in the virtualxids.
 

I think the max_connections maxing out is due to the DB not being able to write and complete existing connections
beforenew legitimate ones come in, thus pilling up connections until it reaches 1500 and start erroring out.
 

Thank You
Ramya


-----Original Message-----
From: Merlin Moncure [mailto:mmoncure@gmail.com] 
Sent: Thursday, June 12, 2014 11:26 AM
To: Vasudevan, Ramya
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] max_connections reached in postgres 9.3.3

On Thu, Jun 12, 2014 at 1:23 PM, Vasudevan, Ramya <ramya.vasudevan@classmates.com> wrote:
> Thank you for the response.
>
> On further investigation, we found out that select statements were happening normally. But DMLs (writes to the DB)
werehung for minutes at a time, and some of them went through. And we had 2 checkpoints during this period. Yesterday
whenwe had the issue, we had 759 connections  that were not idle (116 COMMIT, 238 INSERT, UPDATE 176, 57
AUTHENTICATION,133 BIND). So, it looked like writes (simple inserts and updates) were not happening as fast and caused
connectionsto back up in the DB. So, it didn’t look like any one bad query.
 
> It almost seems like postgres could not write to the WAL logs.
>
> We normally have 600-700 connections in the database. Since we 
> migrated lot more applications to this postgres database from oracle, 
> we increased max_connections just as a test to see if we legitimately 
> need to allow more connections or if it is an issue. And quickly 
> realized that we already had a high number (1500)

Be sure to rule out locks -- some operation blocks queries until it clears then everyone storms in.  It's good idea to
countup blocked queries (select count(*) from pg_locks where waiting) and raise alarms when you see any accumulation
there.

merlin

Re: max_connections reached in postgres 9.3.3

From
"Vasudevan, Ramya"
Date:
Thank you for the response.

On further investigation, we found out that select statements were happening normally. But DMLs (writes to the DB) were
hungfor minutes at a time, and some of them went through. And we had 2 checkpoints during this period. Yesterday when
wehad the issue, we had 759 connections  that were not idle (116 COMMIT, 238 INSERT, UPDATE 176, 57 AUTHENTICATION, 133
BIND).So, it looked like writes (simple inserts and updates) were not happening as fast and caused connections to back
upin the DB. So, it didn’t look like any one bad query.
 
It almost seems like postgres could not write to the WAL logs.

We normally have 600-700 connections in the database. Since we migrated lot more applications to this postgres database
fromoracle, we increased max_connections just as a test to see if we legitimately need to allow more connections or if
itis an issue. And quickly realized that we already had a high number (1500)
 

Thank You
Ramya


-----Original Message-----
From: Merlin Moncure [mailto:mmoncure@gmail.com] 
Sent: Wednesday, June 11, 2014 4:24 PM
To: Vasudevan, Ramya
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] max_connections reached in postgres 9.3.3

On Wed, Jun 11, 2014 at 1:24 PM, Vasudevan, Ramya <ramya.vasudevan@classmates.com> wrote:
> Our set up:
>
> ·         Db version: postgres 9.3.3
>
> ·         OS: CentOS 6.5
>
> ·         kernel Version - Linux 2.6.32-431.3.1.el6.x86_64 #1 SMP Fri Jan 3
> 21:39:27 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
>
> ·         cpu - 24 proc
>
> ·         memory - 768 GB
>
> ·         The disks are SAN fiber.
>
> ·         We have streaming replication set up from this server to the
> secondary server.
>
> ·         Some of the DB parameters:
>
> max_connections  - 1500
>
> shared_buffers - 4GB
>
> work_mem = 130MB
>
>  maintenance_work_mem = 1GB
>
>
>
> Issue this morning:
>
> At 9:04 - FATAL:  remaining connection slots are reserved for 
> non-replication superuser connections
>
> These were the things we noticed and have question on each one of them:
>
> pg_stat_activity showed a total of 850 connections but a 'ps -ef|grep 
> postgers' showed 1500 connections and DB wasn’t accepting new connections.
> How is it possible to have such discrepancy in count between 
> pg_stat_activity and ps?
>
> 200 were in waiting status, 50 were in authentication status, 748 were 
> in IDLE status and the rest were all doing some DML operations. Why 
> will 50 connections be in authentication status in the database and stay there?
>
>  Some existing sessions were completing the sqls successfully (though 
> not as fast). So, the database was not completely hung or unresponsive.
>
> Trying to kill the session at the DB level (pg_terminate_backend) were 
> just hanging. Why would this be the case?
>
> During this time, server had a very high load average of 130 (normally 
> we have < 5).
>
> We had to restart the database to clear it all up.

There are two problems here:
1. there was some type of 'high load' event.  Perhaps it was spurious, like one of the various o/s mishaps like THP or
somethingon your end like a bad query plan or some database impacting application bug.
 
When this happens, queries are accumulating faster than they are clearing and load escalates

2. your database is configured with very high max connection count and your application servers open connections
withoutbound
 


This is a very common problem: there is a trigger that causes some type of huccup then queries storm in and overload
thedatabase.
 
Raising the max_connection count is not a good solution beyond a certain threshold as this will simple make your load
issuesmore severe when you are trying to rescue the database and you are facing emergency restart.
 

Obviously fixing #1 is ideal.  The problem is, all too often by the time it comes to your attention you are in
emergencyfix mode as the server is unresponsive (or barely-) even from psql.
 

One way of managing problem #2 is to install something like pgbouncer which is a 'man in the middle' connection pooler
andvirtualizes the connection count so that you only have a very well defined number of physical connections actually
inuse by the database.  Installing this is not to be taken lightly but what it *does* do is protect the database when
somethinglike occurs -- load will be capped and you can still get emergency diagnostic information out of the server
(likeperf or sar) query logs out of pg_stat_activity, and pull of cancels and things like that.
 

merlin

Re: max_connections reached in postgres 9.3.3

From
Kevin Grittner
Date:
Merlin Moncure <mmoncure@gmail.com> wrote:

> or something else entirely.


It strikes me that this might be relevant:

http://wiki.postgresql.org/wiki/Number_Of_Database_Connections


--
Kevin Grittner

EDB: http://www.enterprisedb.com

The Enterprise PostgreSQL Company


Re: max_connections reached in postgres 9.3.3

From
Merlin Moncure
Date:
On Thu, Jun 12, 2014 at 3:32 PM, Kevin Grittner <kgrittn@ymail.com> wrote:
> Merlin Moncure <mmoncure@gmail.com> wrote:
>
>> or something else entirely.
>
>
> It strikes me that this might be relevant:

Agreed.  The stock advice to many, many problems of this sort is 'use
pgbouncer' but it can be hard to work in a lot of code bases and we
have to be careful to rule out some underlying possible contributing
factors before switching up things up to much.  THP compaction in
particular has plaguing servers throughout the company I work for; we
haven't figured out OP's system went loaded all of a sudden.
Nevertheless, having a setup that can accumulate 1000's of connections
during high load events is not the way to go.  Other than pgbouncer,
being very spartan with application server connection pools can bring
benefits (but can also cause problems).

merlin


Re: max_connections reached in postgres 9.3.3

From
Kevin Grittner
Date:
Merlin Moncure <mmoncure@gmail.com> wrote:

> we have to be careful to rule out some underlying possible
> contributing factors before switching up things up to much.

Agreed.

> THP compaction in particular has plaguing servers throughout the
> company I work for;

I've seen many support tickets where turning off Transparent Huge
Page support has been the solution, but in all cases that I've seen
it is *system* CPU time that spikes when that is the problem, and
the OP here showed a graph where it was *user* CPU time spiking.
With high concurrency that is usually (in my experience) spinlocks
inside of PostgreSQL -- often spinlocks guarding transitions of hot
lightweight locks.

> we haven't figured out OP's system went loaded all of a sudden.

Agreed, although I feel that THP problems are not indicated because
system CPU time wasn't pegged, and write gluts seem unlikely based
on the IO wait numbers.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: max_connections reached in postgres 9.3.3

From
"Vasudevan, Ramya"
Date:
Merlin, Thank you for the response.

On the waiting queries - When  we reached 1500 connections, we had 759 connections  that were in active state (116
COMMIT,238 INSERT, UPDATE 176, 57 AUTHENTICATION, 133 BIND. These active INSERTS and UPDATES  also includes the 80-90
waitingsessions (We checked pg_stat_activity for 'waiting' state.  And pg_locks for granted=f). The blocking and the
waitingqueries were simple one row updates/inserts/deletes. These shouldn’t be blocking each other normally (unless, we
think,there was a problem writing to the disk). Correct me if I am wrong.
 

1) log_min_duration_statement  >>
 We have it set to 200ms and capturing all slow queries.

2) detailed system performance diagnostics during load event.   We need to know precise values for user, system, iowait
>>
 First occurance of 'max_connections reached' problem:  It seems to have started about 10:40. 

At 10:42, we ran out of connections:
Jun 17 10:42:21 FATAL:  remaining connection slots are reserved for non-replication superuser connections

AT 10:48:
 We dropped the disk cache (used
http://www.evanhoffman.com/evan/2012/09/25/problems-with-postgresql-9-0-on-centos-6-3-with-a-2-6-32-series-kernel/as
reference).However, that didn’t help.
 

vmstat information (with user cpu, system cpu, i/dle % and i/o wait):
-----------------------------------------------------------------------------------
At 10:40:19, user cpu started climbing, and cs (context switches) and in (interrupts) climbed up as well. No change
reallyin swap, bi (bytes in), bo (bytes out)
 
procs  -----------memory----------  ---swap--  -----io----  --procs  -----------memory----------  ---swap--
-----io---- --system--  -----cpu------  ---timestamp---
 
r      b                            swpd       free         buff        cache           si               so  bi    bo
 in     cs      us  sy  id  wa  st
 
7      0                            0          939888       145808      774588544       0                0   1524  772
 21177  32761   26  6   67  1   0   2014-06-17  10:40:06  PDT
 
14     0                            0          951436       145816      774589952       0                0   1128  1396
 22584  42339   34  6   59  1   0   2014-06-17  10:40:07  PDT
 
12     1                            0          916132       145816      774591360       0                0   2036  1956
 23132  35945   27  6   65  2   0   2014-06-17  10:40:08  PDT
 
7      0                            0          925444       145816      774593920       0                0   1604  3156
 23722  78796   35  8   55  1   0   2014-06-17  10:40:09  PDT
 
15     2                            0          946200       145816      774595840       0                0   1420  460
 24051  56856   35  8   57  1   0   2014-06-17  10:40:10  PDT
 
6      2                            0          979356       145816      774597376       0                0   1708  5016
 23379  33239   31  6   61  1   0   2014-06-17  10:40:11  PDT
 
6      0                            0          1005784      145816      774615616       0                0   1980  6792
 23584  36025   30  6   62  2   0   2014-06-17  10:40:12  PDT
 
9      1                            0          1004352      145816      774618880       0                0   1760  9784
 24826  63949   35  7   57  2   0   2014-06-17  10:40:13  PDT
 
18     0                            0          980156       145816      774621568       0                0   1260  9336
 23897  69468   35  8   56  1   0   2014-06-17  10:40:14  PDT
 
4      1                            0          960672       145816      774639168       0                0   1516  6420
 23235  38023   33  7   58  1   0   2014-06-17  10:40:15  PDT
 
19     0                            0          944560       145816      774642048       0                0   1880  6940
 23337  39658   32  6   60  2   0   2014-06-17  10:40:16  PDT
 
16     1                            0          897036       145816      774661440       0                0   2072  6024
 26300  52597   39  8   53  1   0   2014-06-17  10:40:17  PDT
 
16     5                            0          880428       145816      774664192       0                0   1460
12652 22617  28171   32  5   62  1   0   2014-06-17  10:40:18  PDT
 
11     0                            0          885412       145816      774666112       0                0   1840  7884
 25048  45393   39  7   53  1   0   2014-06-17  10:40:19  PDT 
 
* interrupts start climbing about here (in and cs columns)
24     1                            0          900336       145816      774685120       0                0   2248  4804
 26126  42707   42  7   50  1   0   2014-06-17  10:40:20  PDT
 
14     2                            0          895120       145816      774687744       0                0   1856
22660 27721  57019   46  8   45  1   0   2014-06-17  10:40:21  PDT
 
27     0                            0          881216       145816      774690496       0                0   1756  5872
 27643  44114   43  7   48  2   0   2014-06-17  10:40:22  PDT
 
19     0                            0          865320       145816      774692864       0                0   1704
25752 34434  132248  59  13  27  1   0   2014-06-17  10:40:23  PDT
 
8      0                            0          868056       145816      774711680       0                0   1308
10960 30492  81550   52  10  37  2   0   2014-06-17  10:40:24  PDT
 
33     0                            0          853888       145816      774713152       0                0   616   2960
 31951  61918   68  10  21  0   0   2014-06-17  10:40:25  PDT
 
44     0                            0          784172       145816      774714176       0                0   700   2292
 40585  77163   77  15  8   0   0   2014-06-17  10:40:26  PDT
 
procs  -----------memory----------  ---swap--  -----io----  --system--  -----cpu------  ---timestamp---
r      b                            swpd       free         buff        cache           si               so  bi    bo
 in     cs      us  sy  id  wa  st
 
32     0                            0          613712       145816      774715008       0                0   768   1172
 44675  108155  74  17  8   0   0   2014-06-17  10:40:27  PDT
 
68     0                            0          564336       145764      774600320       0                0   960   4892
 46838  145268  68  21  11  0   0   2014-06-17  10:40:28  PDT
 
49     1                            0          565880       145732      774426560       0                0   1340  4112
 47306  165764  68  21  11  0   0   2014-06-17  10:40:29  PDT
 
3      0                            0          552808       145716      774312896       0                0   2072  1336
 46749  160141  66  23  11  1   0   2014-06-17  10:40:30  PDT
 
26     2                            0          549716       145704      774139328       0                0   1260  1212
 47234  164933  67  22  10  0   0   2014-06-17  10:40:31  PDT
 
13     1                            0          579216       145676      773907072       0                0   1116  1328
 46831  173111  64  24  12  0   0   2014-06-17  10:40:32  PDT
 
44     0                            0          563372       145640      773703680       0                0   1384  3120
 49568  166285  68  23  9   0   0   2014-06-17  10:40:33  PDT
 
55     0                            0          545040       145624      773587776       0                0   1056  3864
 50042  167987  68  23  9   0   0   2014-06-17  10:40:34  PDT
 
86     2                            0          572432       145572      773430144       0                0   944   1084
 58868  140190  77  18  4   0   0   2014-06-17  10:40:35  PDT
 
63     0                            0          555424       145548      773372864       0                0   312   616
 61136  126323  82  16  3   0   0   2014-06-17  10:40:36  PDT
 
71     0                            0          584556       145500      773255936       0                0   888   852
 59944  128606  81  16  3   0   0   2014-06-17  10:40:37  PDT
 
45     1                            0          575360       145500      773140224       0                0   796   1844
 60107  128723  81  16  3   0   0   2014-06-17  10:40:38  PDT
 
37     0                            0          560712       145472      772965120       0                0   1440  4796
 57150  140963  76  19  5   0   0   2014-06-17  10:40:39  PDT
 
112    0                            0          541944       145424      772732480       0                0   2340  1272
 53304  154266  70  22  7   0   0   2014-06-17  10:40:40  PDT
 
55     1                            0          548428       145368      772440000       0                0   852   816
 51299  166797  71  22  7   0   0   2014-06-17  10:40:41  PDT
 
21     0                            0          584668       145288      772177216       0                0   1232  992
 49328  157423  70  23  7   0   0   2014-06-17  10:40:42  PDT
 
61     0                            0          552688       145236      771940608       0                0   2224  5216
 49301  165526  66  25  9   0   0   2014-06-17  10:40:43  PDT
 
29     1                            0          564316       145216      771679104       0                0   1340  4832
 48844  164812  68  24  8   0   0   2014-06-17  10:40:44  PDT
 
34     0                            0          566816       145192      771444288       0                0   1156  776
 49673  155265  71  22  7   0   0   2014-06-17  10:40:45  PDT
 
34     0                            0          569152       145176      771268608       0                0   1232
33408 49762  138163  74  20  6   0   0   2014-06-17  10:40:46  PDT
 
32     1                            0          582320       145144      771034752       0                0   712   596
 47012  156240  69  22  9   0   0   2014-06-17  10:40:47  PDT
 


iostat information, consolidated from output of iostat -x to one minute intervals:
--------------------------------------------------------------------------------------------------
 RPS: Reads in blocks per second
 WPS: writes in blocks per second
 BLK_R: Blocks read
 BLK_W: Blocks written

This is our file system mount point information:
 site and site2 - mounts holding the data files.
 data - pg_xlog
 archive - archive logs

In general, We do not see anything that stands out in terms of I/O leading up to the problem. 

i/o information on all mount points:
site - RPS went up after the cache was cleared, expected. But, no pattern that I can see leading up to the issue:
20140617 10:36:       site: RPS:  1433.91 WPS:   599.67 BLK_R:  1434.27 BLK_W:     1.08
20140617 10:37:       site: RPS:  1364.76 WPS:   583.80 BLK_R:  1364.87 BLK_W:     0.88
20140617 10:38:       site: RPS:  1392.00 WPS:  1412.67 BLK_R:  1392.07 BLK_W:     0.75
20140617 10:39:       site: RPS:  1481.12 WPS:   664.47 BLK_R:  1481.20 BLK_W:     0.88
20140617 10:40:       site: RPS:  1242.64 WPS:  2034.67 BLK_R:  1242.73 BLK_W:     1.08 (problem)
20140617 10:41:       site: RPS:   758.62 WPS:   620.86 BLK_R:   758.73 BLK_W:     1.10
20140617 10:42:       site: RPS:  1482.59 WPS:  1304.47 BLK_R:  1482.80 BLK_W:     0.80
20140617 10:43:       site: RPS:  1699.67 WPS:  3165.47 BLK_R:  1699.67 BLK_W:     1.13
20140617 10:44:       site: RPS:  1072.45 WPS:  1193.67 BLK_R:  1072.67 BLK_W:     1.05
20140617 10:45:       site: RPS:   941.51 WPS:   594.07 BLK_R:   941.67 BLK_W:     0.82
20140617 10:46:       site: RPS:   622.42 WPS:   358.07 BLK_R:   622.73 BLK_W:     0.82
20140617 10:47:       site: RPS:   781.18 WPS:  2038.07 BLK_R:   781.13 BLK_W:     0.98
20140617 10:48:       site: RPS:  1582.35 WPS:   730.27 BLK_R:  1582.40 BLK_W:     0.93 (cache cleared)
20140617 10:49:       site: RPS:  7954.96 WPS:  1713.60 BLK_R:  7955.87 BLK_W:     1.10
20140617 10:50:       site: RPS: 10837.16 WPS:  1065.13 BLK_R: 10839.73 BLK_W:     1.05
20140617 10:51:       site: RPS: 27859.84 WPS:   846.22 BLK_R: 27863.53 BLK_W:     1.02
20140617 10:52:       site: RPS: 23309.31 WPS:  1650.86 BLK_R: 23312.67 BLK_W:     1.37
20140617 10:53:       site: RPS: 28450.93 WPS:   220.83 BLK_R: 28454.00 BLK_W:     0.78
20140617 10:54:       site: RPS: 14312.11 WPS:   272.51 BLK_R: 14313.60 BLK_W:     1.18
20140617 10:55:       site: RPS: 29515.76 WPS:   157.33 BLK_R: 29516.00 BLK_W:     0.58
20140617 10:56:       site: RPS: 22423.51 WPS:   683.07 BLK_R: 22426.80 BLK_W:     0.73

site2 - WPS went down after the problem, and even further down after cache was cleared. 
Definitely an effect not a cause. 
DB was hung and so, fewer writes. 
Again, no pattern that I can see leading up to the issue:

20140617 10:36:      site2: RPS:  1199.19 WPS:  1366.67 BLK_R:  1199.47 BLK_W:     1.17
20140617 10:37:      site2: RPS:  1241.60 WPS:  1450.93 BLK_R:  1241.60 BLK_W:     1.27
20140617 10:38:      site2: RPS:  1277.89 WPS:  1306.53 BLK_R:  1278.13 BLK_W:     1.20
20140617 10:39:      site2: RPS:  1254.00 WPS:  1174.40 BLK_R:  1254.00 BLK_W:     1.07
20140617 10:40:      site2: RPS:  1106.41 WPS:  1206.39 BLK_R:  1106.53 BLK_W:     0.87 (problem)
20140617 10:41:      site2: RPS:   360.63 WPS:   898.53 BLK_R:   360.67 BLK_W:     0.87
20140617 10:42:      site2: RPS:  1290.44 WPS:   418.27 BLK_R:  1290.67 BLK_W:     0.93
20140617 10:43:      site2: RPS:  1804.53 WPS:  1614.93 BLK_R:  1804.53 BLK_W:     1.27
20140617 10:44:      site2: RPS:   815.21 WPS:  1029.73 BLK_R:   815.47 BLK_W:     0.73
20140617 10:45:      site2: RPS:   731.70 WPS:   640.67 BLK_R:   732.00 BLK_W:     1.33
20140617 10:46:      site2: RPS:   415.36 WPS:   362.80 BLK_R:   415.47 BLK_W:     1.13
20140617 10:47:      site2: RPS:   461.93 WPS:   268.13 BLK_R:   462.27 BLK_W:     1.13
20140617 10:48:      site2: RPS:   559.47 WPS:   175.07 BLK_R:   559.47 BLK_W:     1.07
20140617 10:49:      site2: RPS:   578.12 WPS:   306.53 BLK_R:   578.27 BLK_W:     1.20
20140617 10:50:      site2: RPS:   419.14 WPS:   155.60 BLK_R:   419.20 BLK_W:     1.10
20140617 10:51:      site2: RPS:   407.66 WPS:   152.80 BLK_R:   407.73 BLK_W:     1.13
20140617 10:52:      site2: RPS:   316.07 WPS:   204.62 BLK_R:   316.00 BLK_W:     1.07
20140617 10:53:      site2: RPS:  4957.84 WPS:   122.80 BLK_R:  4958.00 BLK_W:     1.13
20140617 10:54:      site2: RPS:  5731.85 WPS:   236.40 BLK_R:  5732.40 BLK_W:     0.90
20140617 10:55:      site2: RPS: 69626.67 WPS:     0.67 BLK_R: 69626.67 BLK_W:     0.17 (cache cleared)
20140617 10:56:      site2: RPS: 77967.18 WPS:     2.93 BLK_R: 77980.80 BLK_W:     0.27
20140617 10:57:      site2: RPS: 92866.05 WPS:    12.79 BLK_R: 92880.93 BLK_W:     3.13
20140617 10:58:      site2: RPS: 84538.00 WPS:    22.53 BLK_R: 84538.00 BLK_W:     4.20
20140617 10:59:      site2: RPS: 37730.09 WPS:    72.26 BLK_R: 37737.33 BLK_W:     2.93
20140617 11:00:      site2: RPS: 82512.67 WPS:    13.60 BLK_R: 82512.67 BLK_W:     2.27
20140617 11:01:      site2: RPS: 78838.10 WPS:    23.33 BLK_R: 78850.67 BLK_W:     4.17

data - A minute after the issue, spurt in reads (could be a one-off). 
Writes drop (effect, not cause). 
After cache is cleared, reads from disk increase, writes decrease (expected)

20140617 10:11:       data: RPS:     0.00 WPS:  1342.90 BLK_R:     0.00 BLK_W:     3.73
20140617 10:12:       data: RPS:     0.00 WPS:  1187.73 BLK_R:     0.00 BLK_W:     4.57
20140617 10:13:       data: RPS:     0.00 WPS:  1318.91 BLK_R:     0.00 BLK_W:     4.23
20140617 10:14:       data: RPS:     0.00 WPS:  1350.00 BLK_R:     0.00 BLK_W:     4.17
20140617 10:15:       data: RPS:     0.00 WPS:  1539.12 BLK_R:     0.00 BLK_W:     3.83
20140617 10:16:       data: RPS:     0.00 WPS:  1255.47 BLK_R:     0.00 BLK_W:     4.47
20140617 10:17:       data: RPS:     0.00 WPS:  1734.58 BLK_R:     0.00 BLK_W:     4.20
20140617 10:18:       data: RPS:     0.00 WPS:  2082.13 BLK_R:     0.00 BLK_W:     4.13
20140617 10:19:       data: RPS:     0.00 WPS:  1560.90 BLK_R:     0.00 BLK_W:     4.70
20140617 10:20:       data: RPS:     0.00 WPS:  1413.73 BLK_R:     0.00 BLK_W:     3.57
20140617 10:21:       data: RPS:     0.00 WPS:  1677.33 BLK_R:     0.00 BLK_W:     3.90
20140617 10:22:       data: RPS:     0.27 WPS:  1454.13 BLK_R:     0.27 BLK_W:     3.97
20140617 10:23:       data: RPS:     0.00 WPS:  1439.16 BLK_R:     0.00 BLK_W:     4.00
20140617 10:24:       data: RPS:     0.00 WPS:  1323.47 BLK_R:     0.00 BLK_W:     4.13
20140617 10:25:       data: RPS:     0.00 WPS:  1650.56 BLK_R:     0.00 BLK_W:     3.27
20140617 10:26:       data: RPS:     0.00 WPS:  1361.20 BLK_R:     0.00 BLK_W:     3.87
20140617 10:27:       data: RPS:     0.00 WPS:  1360.07 BLK_R:     0.00 BLK_W:     3.93
20140617 10:28:       data: RPS:     0.00 WPS:  1260.13 BLK_R:     0.00 BLK_W:     3.97
20140617 10:29:       data: RPS:     0.00 WPS:  1513.86 BLK_R:     0.00 BLK_W:     3.50
20140617 10:30:       data: RPS:     0.00 WPS:  1499.73 BLK_R:     0.00 BLK_W:     3.90
20140617 10:31:       data: RPS:     0.00 WPS:  1394.80 BLK_R:     0.00 BLK_W:     3.47
20140617 10:32:       data: RPS:     0.00 WPS:  1690.27 BLK_R:     0.00 BLK_W:     3.90
20140617 10:33:       data: RPS:     0.00 WPS:  1543.55 BLK_R:     0.00 BLK_W:     3.57
20140617 10:34:       data: RPS:     0.00 WPS:  1457.47 BLK_R:     0.00 BLK_W:     3.80
20140617 10:35:       data: RPS:     0.00 WPS:  1414.00 BLK_R:     0.00 BLK_W:     4.50
20140617 10:36:       data: RPS:     0.00 WPS:  1557.14 BLK_R:     0.00 BLK_W:     4.23
20140617 10:37:       data: RPS:     0.00 WPS:  1675.60 BLK_R:     0.00 BLK_W:     3.77
20140617 10:38:       data: RPS:     0.00 WPS:  1473.56 BLK_R:     0.00 BLK_W:     4.57
20140617 10:39:       data: RPS:     0.00 WPS:  1424.00 BLK_R:     0.00 BLK_W:     3.67
20140617 10:40:       data: RPS:     0.00 WPS:  4113.20 BLK_R:     0.00 BLK_W:     4.13 (problem)

* (anomaly, RPS goes to 2.40 from 0.0. One-off?) *
20140617 10:41:       data: RPS:     2.40 WPS:   425.04 BLK_R:     2.40 BLK_W:     3.90 
20140617 10:42:       data: RPS:     0.00 WPS:  1402.49 BLK_R:     0.00 BLK_W:     4.50
20140617 10:43:       data: RPS:     0.00 WPS:  1619.47 BLK_R:     0.00 BLK_W:     3.47
20140617 10:44:       data: RPS:     0.00 WPS:   715.64 BLK_R:     0.00 BLK_W:     3.53
20140617 10:45:       data: RPS:     0.27 WPS:   547.02 BLK_R:     0.27 BLK_W:     3.53
20140617 10:46:       data: RPS:     0.00 WPS:   163.31 BLK_R:     0.00 BLK_W:     4.00
20140617 10:47:       data: RPS:     0.13 WPS:   815.87 BLK_R:     0.13 BLK_W:     4.30
20140617 10:48:       data: RPS:    74.00 WPS:   663.07 BLK_R:    74.00 BLK_W:     3.90 (cache cleared)
20140617 10:49:       data: RPS:   448.53 WPS:   628.63 BLK_R:   448.53 BLK_W:     4.33
20140617 10:50:       data: RPS:    39.20 WPS:   208.50 BLK_R:    39.20 BLK_W:     4.03
20140617 10:51:       data: RPS:    35.73 WPS:   228.12 BLK_R:    35.73 BLK_W:     4.13
20140617 10:52:       data: RPS:    15.20 WPS:   148.78 BLK_R:    15.20 BLK_W:     3.63
20140617 10:53:       data: RPS:    24.27 WPS:   162.37 BLK_R:    24.27 BLK_W:     3.40
20140617 10:54:       data: RPS:   136.26 WPS:   139.74 BLK_R:   136.27 BLK_W:     1.90
20140617 10:55:       data: RPS:    93.33 WPS:  2253.87 BLK_R:    93.33 BLK_W:     4.37
20140617 10:56:       data: RPS:    20.27 WPS:   837.60 BLK_R:    20.27 BLK_W:     4.57
20140617 10:57:       data: RPS:    11.46 WPS:   391.42 BLK_R:    11.47 BLK_W:     4.30
20140617 10:58:       data: RPS:    21.07 WPS:   348.27 BLK_R:    21.07 BLK_W:     3.77
20140617 10:59:       data: RPS:    65.83 WPS:   388.87 BLK_R:    65.87 BLK_W:     3.43
20140617 11:00:       data: RPS:    45.87 WPS:  1159.20 BLK_R:    45.87 BLK_W:     3.37

archive -After the problem started, WPS dropped very low for some time. 
Is that expected? Is it because wal logs were written at a lower pace, and so archiving had less to do?

20140617 10:34:    archive: RPS:     0.00 WPS:  1648.27 BLK_R:     0.00 BLK_W:     0.93
20140617 10:35:    archive: RPS:     0.00 WPS:   553.07 BLK_R:     0.00 BLK_W:     0.57
20140617 10:36:    archive: RPS:     0.00 WPS:  1648.10 BLK_R:     0.00 BLK_W:     1.30
20140617 10:37:    archive: RPS:     0.00 WPS:  1102.00 BLK_R:     0.00 BLK_W:     1.17
20140617 10:38:    archive: RPS:     0.00 WPS:  1098.39 BLK_R:     0.00 BLK_W:     0.73
20140617 10:39:    archive: RPS:     0.00 WPS:  1102.27 BLK_R:     0.00 BLK_W:     0.60
20140617 10:40:    archive: RPS:     0.00 WPS:  3285.07 BLK_R:     0.00 BLK_W:     0.73 (problem)
20140617 10:41:    archive: RPS:     0.00 WPS:  1102.27 BLK_R:     0.00 BLK_W:     1.10
20140617 10:42:    archive: RPS:     0.00 WPS:   551.73 BLK_R:     0.00 BLK_W:     0.73
20140617 10:43:    archive: RPS:     0.00 WPS:  1101.07 BLK_R:     0.00 BLK_W:     1.07
20140617 10:44:    archive: RPS:     0.00 WPS:  1103.47 BLK_R:     0.00 BLK_W:     1.13
20140617 10:45:    archive: RPS:     0.00 WPS:   553.20 BLK_R:     0.00 BLK_W:     0.70
20140617 10:46:    archive: RPS:     0.00 WPS:     7.07 BLK_R:     0.00 BLK_W:     0.73
20140617 10:47:    archive: RPS:     0.00 WPS:   553.84 BLK_R:     0.00 BLK_W:     0.87
20140617 10:48:    archive: RPS:     0.00 WPS:   554.53 BLK_R:     0.00 BLK_W:     0.87 (cache cleared)
20140617 10:49:    archive: RPS:     0.00 WPS:   551.31 BLK_R:     0.00 BLK_W:     0.50
20140617 10:50:    archive: RPS:     0.00 WPS:     8.53 BLK_R:     0.00 BLK_W:     0.87
20140617 10:51:    archive: RPS:     0.00 WPS:   553.73 BLK_R:     0.00 BLK_W:     0.73
20140617 10:52:    archive: RPS:     0.00 WPS:     3.87 BLK_R:     0.00 BLK_W:     0.87
20140617 10:53:    archive: RPS:     0.00 WPS:     6.67 BLK_R:     0.00 BLK_W:     0.83
20140617 10:54:    archive: RPS:     0.00 WPS:   554.27 BLK_R:     0.00 BLK_W:     0.77
20140617 10:55:    archive: RPS:   848.80 WPS:   555.87 BLK_R:   848.80 BLK_W:     1.03
20140617 10:56:    archive: RPS:     0.27 WPS:  1099.20 BLK_R:     0.27 BLK_W:     0.87
20140617 10:57:    archive: RPS:     0.00 WPS:     1.33 BLK_R:     0.00 BLK_W:     0.17
20140617 10:58:    archive: RPS:     0.00 WPS:     1.33 BLK_R:     0.00 BLK_W:     0.17
20140617 10:59:    archive: RPS:     0.00 WPS:  1110.13 BLK_R:     0.00 BLK_W:     0.13
20140617 11:00:    archive: RPS:     0.00 WPS:   553.87 BLK_R:     0.00 BLK_W:     0.37
20140617 11:01:    archive: RPS:     0.00 WPS:  1102.13 BLK_R:     0.00 BLK_W:     0.80
20140617 11:02:    archive: RPS:     0.00 WPS:  1647.87 BLK_R:     0.00 BLK_W:     1.03
20140617 11:03:    archive: RPS:     0.00 WPS:  2191.60 BLK_R:     0.00 BLK_W:     0.60
20140617 11:04:    archive: RPS:     0.00 WPS:  2194.27 BLK_R:     0.00 BLK_W:     0.93
20140617 11:05:    archive: RPS:     0.00 WPS:  2193.33 BLK_R:     0.00 BLK_W:     1.33
20140617 11:06:    archive: RPS:     0.00 WPS:     7.84 BLK_R:     0.00 BLK_W:     0.70
20140617 11:07:    archive: RPS:  3229.33 WPS:     4.27 BLK_R:  3229.33 BLK_W:     0.43
20140617 11:08:    archive: RPS:  3281.07 WPS:     6.40 BLK_R:  3281.07 BLK_W:     0.73
20140617 11:09:    archive: RPS: 19427.33 WPS:   554.27 BLK_R: 19427.33 BLK_W:     0.77
20140617 11:10:    archive: RPS:     0.00 WPS:  2191.47 BLK_R:     0.00 BLK_W:     0.63
20140617 11:11:    archive: RPS:     0.00 WPS:  1102.27 BLK_R:     0.00 BLK_W:     0.60
20140617 11:12:    archive: RPS:     0.00 WPS:  1102.00 BLK_R:     0.00 BLK_W:     0.67
20140617 11:13:    archive: RPS:     0.00 WPS:  1101.60 BLK_R:     0.00 BLK_W:     0.77
20140617 11:14:    archive: RPS:     0.00 WPS:  3829.47 BLK_R:     0.00 BLK_W:     0.80
20140617 11:15:    archive: RPS:     0.00 WPS:  1104.13 BLK_R:     0.00 BLK_W:     0.97
20140617 11:16:    archive: RPS:     0.00 WPS:  1101.87 BLK_R:     0.00 BLK_W:     0.87
20140617 11:17:    archive: RPS:  1339.07 WPS:  1100.40 BLK_R:  1339.07 BLK_W:     0.90
20140617 11:18:    archive: RPS:     0.00 WPS:  1100.67 BLK_R:     0.00 BLK_W:     0.83


Load average information leading up to the problem:
-----------------------------------------------------------------
20140617 10:30:01 top - 10:30:08 up 161 days, 19:46,  5 users,  load average: 1.30, 1.22, 1.20
20140617 10:30:01 Tasks: 1347 total,  15 running, 1332 sleeping,   0 stopped,   0 zombie
--
20140617 10:31:01 top - 10:31:08 up 161 days, 19:47,  5 users,  load average: 1.89, 1.35, 1.25
20140617 10:31:01 Tasks: 1339 total,   6 running, 1333 sleeping,   0 stopped,   0 zombie
--
20140617 10:32:01 top - 10:32:09 up 161 days, 19:48,  5 users,  load average: 1.53, 1.33, 1.25
20140617 10:32:01 Tasks: 1359 total,   5 running, 1354 sleeping,   0 stopped,   0 zombie
--
20140617 10:33:01 top - 10:33:08 up 161 days, 19:49,  5 users,  load average: 1.10, 1.25, 1.22
20140617 10:33:01 Tasks: 1333 total,  10 running, 1323 sleeping,   0 stopped,   0 zombie
--
20140617 10:34:01 top - 10:34:08 up 161 days, 19:50,  5 users,  load average: 1.01, 1.23, 1.22
20140617 10:34:01 Tasks: 1368 total,   8 running, 1360 sleeping,   0 stopped,   0 zombie
--
20140617 10:35:01 top - 10:35:09 up 161 days, 19:51,  5 users,  load average: 1.21, 1.26, 1.22
20140617 10:35:01 Tasks: 1359 total,   6 running, 1353 sleeping,   0 stopped,   0 zombie
--
20140617 10:36:01 top - 10:36:08 up 161 days, 19:52,  5 users,  load average: 1.19, 1.25, 1.22
20140617 10:36:01 Tasks: 1356 total,   9 running, 1347 sleeping,   0 stopped,   0 zombie
--
20140617 10:37:01 top - 10:37:08 up 161 days, 19:53,  5 users,  load average: 0.88, 1.15, 1.19
20140617 10:37:01 Tasks: 1339 total,  13 running, 1326 sleeping,   0 stopped,   0 zombie
--
20140617 10:38:01 top - 10:38:08 up 161 days, 19:54,  5 users,  load average: 1.51, 1.24, 1.21
20140617 10:38:01 Tasks: 1356 total,   9 running, 1347 sleeping,   0 stopped,   0 zombie
--
20140617 10:39:01 top - 10:39:08 up 161 days, 19:55,  5 users,  load average: 1.75, 1.35, 1.25
20140617 10:39:01 Tasks: 1339 total,  12 running, 1327 sleeping,   0 stopped,   0 zombie
--
20140617 10:40:01 top - 10:40:08 up 161 days, 19:56,  5 users,  load average: 1.27, 1.27, 1.23
20140617 10:40:01 Tasks: 1359 total,   4 running, 1355 sleeping,   0 stopped,   0 zombie

* Start of problem *
--
20140617 10:41:02 top - 10:41:12 up 161 days, 19:57,  5 users,  load average: 14.96, 4.47, 2.29
20140617 10:41:02 Tasks: 1831 total,  59 running, 1772 sleeping,   0 stopped,   0 zombie
--
20140617 10:43:01 top - 10:43:09 up 161 days, 19:59,  5 users,  load average: 16.23, 10.94, 5.07
20140617 10:43:01 Tasks: 2203 total,  49 running, 2154 sleeping,   0 stopped,   0 zombie
--
20140617 10:44:01 top - 10:44:09 up 161 days, 20:00,  5 users,  load average: 8.61, 9.60, 4.98
20140617 10:44:01 Tasks: 1942 total,  36 running, 1906 sleeping,   0 stopped,   0 zombie
--
20140617 10:45:01 top - 10:45:10 up 161 days, 20:01,  5 users,  load average: 26.00, 14.28, 6.86
20140617 10:45:01 Tasks: 2226 total,  49 running, 2177 sleeping,   0 stopped,   0 zombie
--
20140617 10:46:02 top - 10:46:22 up 161 days, 20:02,  6 users,  load average: 121.92, 47.56, 19.11
20140617 10:46:02 Tasks: 2248 total, 102 running, 2145 sleeping,   0 stopped,   1 zombie
--
20140617 10:47:01 top - 10:47:15 up 161 days, 20:03,  7 users,  load average: 119.38, 58.95, 24.52
20140617 10:47:01 Tasks: 2257 total,  83 running, 2171 sleeping,   0 stopped,   2 zombie
--
20140617 10:48:01 top - 10:48:12 up 161 days, 20:04,  9 users,  load average: 78.64, 57.52, 26.14
20140617 10:48:01 Tasks: 2247 total,  95 running, 2152 sleeping,   0 stopped,   0 zombie
--
20140617 10:55:01 top - 10:55:08 up 161 days, 20:11, 13 users,  load average: 55.85, 64.71, 41.02
20140617 10:55:01 Tasks: 1275 total,   4 running, 1271 sleeping,   0 stopped,   0 zombie
--
20140617 10:56:01 top - 10:56:08 up 161 days, 20:12, 13 users,  load average: 306.87, 132.30, 65.5
20140617 10:56:01 Tasks: 2007 total,   1 running, 2006 sleeping,   0 stopped,   0 zombie
--
20140617 10:57:01 top - 10:57:09 up 161 days, 20:13, 13 users,  load average: 591.73, 241.81, 107.
20140617 10:57:01 Tasks: 2209 total,   2 running, 2207 sleeping,   0 stopped,   0 zombie
--
20140617 10:58:01 top - 10:58:08 up 161 days, 20:14, 13 users,  load average: 858.24, 378.84, 162.
20140617 10:58:01 Tasks: 2266 total,   3 running, 2263 sleeping,   0 stopped,   0 zombie
--
20140617 10:59:02 top - 10:59:09 up 161 days, 20:15, 13 users,  load average: 1068.68, 525.91, 226
20140617 10:59:02 Tasks: 2263 total,   1 running, 2262 sleeping,   0 stopped,   0 zombie
--
20140617 11:00:01 top - 11:00:08 up 161 days, 20:16, 13 users,  load average: 1104.86, 639.53, 284
20140617 11:00:01 Tasks: 2279 total,   2 running, 2277 sleeping,   0 stopped,   0 zombie
--
20140617 11:01:01 top - 11:01:08 up 161 days, 20:17, 13 users,  load average: 664.81, 602.11, 294.
20140617 11:01:01 Tasks: 2275 total,   6 running, 2269 sleeping,   0 stopped,   0 zombie
--
20140617 11:02:01 top - 11:02:09 up 161 days, 20:18, 13 users,  load average: 485.78, 561.31, 299.
20140617 11:02:01 Tasks: 2272 total,   9 running, 2263 sleeping,   0 stopped,   0 zombie
--
20140617 11:03:01 top - 11:03:08 up 161 days, 20:19, 13 users,  load average: 422.34, 529.23, 304.
20140617 11:03:01 Tasks: 2118 total,   4 running, 2114 sleeping,   0 stopped,   0 zombie
--
20140617 11:04:01 top - 11:04:09 up 161 days, 20:20, 13 users,  load average: 371.39, 495.59, 307.
20140617 11:04:01 Tasks: 1953 total,  10 running, 1943 sleeping,   0 stopped,   0 zombie
--
20140617 11:05:01 top - 11:05:08 up 161 days, 20:21, 13 users,  load average: 331.89, 461.73, 307.
20140617 11:05:01 Tasks: 1844 total,   8 running, 1836 sleeping,   0 stopped,   0 zombie
--
20140617 11:06:01 top - 11:06:12 up 161 days, 20:22, 13 users,  load average: 213.18, 403.72, 297.
20140617 11:06:01 Tasks: 1913 total, 110 running, 1803 sleeping,   0 stopped,   0 zombie
--
20140617 11:10:01 top - 11:10:08 up 161 days, 20:26, 14 users,  load average: 90.95, 229.57, 249.9
20140617 11:10:01 Tasks: 1326 total,  11 running, 1314 sleeping,   1 stopped,   0 zombie
--
20140617 11:11:01 top - 11:11:08 up 161 days, 20:27, 14 users,  load average: 83.22, 202.67, 239.4
20140617 11:11:01 Tasks: 1544 total,   6 running, 1537 sleeping,   1 stopped,   0 zombie
--
20140617 11:12:01 top - 11:12:08 up 161 days, 20:28, 14 users,  load average: 76.15, 179.03, 229.0
20140617 11:12:01 Tasks: 1486 total,  12 running, 1473 sleeping,   1 stopped,   0 zombie
--
20140617 11:13:01 top - 11:13:08 up 161 days, 20:29, 14 users,  load average: 70.00, 158.61, 218.8
20140617 11:13:01 Tasks: 1518 total,   3 running, 1514 sleeping,   1 stopped,   0 zombie
--
20140617 11:14:01 top - 11:14:08 up 161 days, 20:30, 14 users,  load average: 67.06, 141.63, 209.2
20140617 11:14:01 Tasks: 1439 total,  39 running, 1399 sleeping,   1 stopped,   0 zombie
--
20140617 11:15:01 top - 11:15:08 up 161 days, 20:31, 14 users,  load average: 72.38, 128.89, 200.6
20140617 11:15:01 Tasks: 2279 total,   9 running, 2269 sleeping,   1 stopped,   0 zombie
--
20140617 11:16:01 top - 11:16:08 up 161 days, 20:32, 14 users,  load average: 63.85, 116.17, 191.7
20140617 11:16:01 Tasks: 2002 total,  18 running, 1983 sleeping,   1 stopped,   0 zombie
--
20140617 11:17:01 top - 11:17:08 up 161 days, 20:33, 14 users,  load average: 54.34, 103.79, 182.7
20140617 11:17:01 Tasks: 1759 total,  12 running, 1746 sleeping,   1 stopped,   0 zombie
--
20140617 11:18:01 top - 11:18:08 up 161 days, 20:34, 14 users,  load average: 56.49, 95.20, 174.84
20140617 11:18:01 Tasks: 1630 total,  13 running, 1616 sleeping,   1 stopped,   0 zombie
--
20140617 11:19:01 top - 11:19:08 up 161 days, 20:35, 15 users,  load average: 53.37, 87.20, 167.10
20140617 11:19:01 Tasks: 1501 total,  11 running, 1489 sleeping,   1 stopped,   0 zombie
--
20140617 11:20:01 top - 11:20:08 up 161 days, 20:36, 15 users,  load average: 47.54, 79.38, 159.41
20140617 11:20:01 Tasks: 1415 total,   8 running, 1406 sleeping,   1 stopped,   0 zombie
--

3) know if transaction rates spiking going into (as opposed to during) the load is interesting.  We can capture tps
ratesby logging the output of: select * from pg_stat_database. >>
 
We noticed that tps rate was not abnormal before, during and immediately after the issue happens. Old sessions were
still"processing" and we had new sessions come in (like normal). When legitimate connections from our site came in
whilethe existing old connections were still running in the database (either active/processing or active/waiting
state),we are hitting the max_connections limit.
 

Few more information to add:
This is what we see happening almost all the time when we have the issue:
- At least from 5 mins before the issue (max_connections reached) happens, we see queries slowing down considerably
(simpleselects using indexes taking 10-15 mins, and fewer dmls)
 
- When the issue happens user cpu maxes out, i/o wait goes almost to zero, and load goes really high. Simple selects
takingforever, and ton of dmls waiting. 
 

Things we changed in the environment so far:
 1. Disabled THP, and tcp offloading (GRP/GSO/TRO/TSO). They helped reduce the frequency of the issue but didn’t fix
it
 2. We restarted postgres with "numactl --interleave=all" last Thrusday (06-12-14) to alleviate the NUMA specific
memoryoffloading issues. 
 
 3. We also dropped the disk cache a couple times till now.

Questions:
1) As mentioned earlier, we are in CentOS 6.5 and Kernel version 2.6.32. Are there any known issues with these
versions?We read in some forums that downgrading to CentOS 5.8 and kernel 2.6.18 have helped with similar issues.  Is
itworth trying?
 
2) Should we look into changing our I/O scheduler from CFQ to deadline?
(http://www.cybertec.at/postgresql-linux-kernel-io-tuning/)

Thank You
Ramya

Re: max_connections reached in postgres 9.3.3

From
Kevin Grittner
Date:
"Vasudevan, Ramya" <ramya.vasudevan@classmates.com> wrote:

> On the waiting queries - When  we reached 1500 connections, we
> had 759 connections that were in active state (116 COMMIT, 238
> INSERT, UPDATE 176, 57 AUTHENTICATION, 133 BIND. These active
> INSERTS and UPDATES also includes the 80-90 waiting sessions (We
> checked pg_stat_activity for 'waiting' state.  And pg_locks for
> granted=f). The blocking and the waiting queries were simple one
> row updates/inserts/deletes. These shouldn’t be blocking each
> other normally (unless, we think, there was a problem writing to
> the disk). Correct me if I am wrong.

You may want to consider this:

http://wiki.postgresql.org/wiki/Number_Of_Database_Connections

> [vmstat show up to 173111 context switches per second, with high
> cs rate corresponding to user CPU% between 64 and 82]

That usually means there is high contention for spinlocks,
potentially with processes getting suspended while holding
spinlocks, making things worse.

> 2) Should we look into changing our I/O scheduler from CFQ to
> deadline?
> (http://www.cybertec.at/postgresql-linux-kernel-io-tuning/)

That is likely to help when disk I/O is the bottleneck, but your
problem now is you need to be using a connection pooler in
transaction mode, so that when more than some limit of transactions
are active (that limit probably being somewhere around twice the
core count on the machine), new requests to start a transaction are
queued.  You will see much better throughput and much better
latency if you can do that.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: max_connections reached in postgres 9.3.3

From
Merlin Moncure
Date:
On Thu, Jun 19, 2014 at 2:35 PM, Kevin Grittner <kgrittn@ymail.com> wrote:
> "Vasudevan, Ramya" <ramya.vasudevan@classmates.com> wrote:
>
>> On the waiting queries - When  we reached 1500 connections, we
>> had 759 connections that were in active state (116 COMMIT, 238
>> INSERT, UPDATE 176, 57 AUTHENTICATION, 133 BIND. These active
>> INSERTS and UPDATES also includes the 80-90 waiting sessions (We
>> checked pg_stat_activity for 'waiting' state.  And pg_locks for
>> granted=f). The blocking and the waiting queries were simple one
>> row updates/inserts/deletes. These shouldn’t be blocking each
>> other normally (unless, we think, there was a problem writing to
>> the disk). Correct me if I am wrong.
>
> You may want to consider this:
>
> http://wiki.postgresql.org/wiki/Number_Of_Database_Connections
>
>> [vmstat show up to 173111 context switches per second, with high
>> cs rate corresponding to user CPU% between 64 and 82]
>
> That usually means there is high contention for spinlocks,
> potentially with processes getting suspended while holding
> spinlocks, making things worse.

Completely agree on both diagnosis and proposed solution -- load
profile (low iowait, high user%, high processes, high CS) is
symptomatic of too many processes trying to do things at once.  So
there may be some application caused driver of the problem or you are
hitting a contention point within postgres (a 'perf top' might give
clues to the latter).  Either way, once you are in this state you end
up with too many processes fighting for cpu and cache lines which
exaggerates the problem to the point you can classify it as an outage.

Be advised transaction mode pooling makes certain features of the
database difficult or impossible to use -- advisory locks (except xact
variants), server side prepared statements, asynchronous
notificiation, WITH HOLD cursors and the like -- basically anything
scoped to the session.  For many workloads it is a high win though.
If for whatever reason this solution doesn't work, your other options
are to try to optimize whatever is causing the load event
(particularly if it's in your code -- careful query logging might give
some clues) or to simply upgrade hardware (more/faster cpu especially
for your case) to the point that even when highly loaded you're always
clearing queries at an acceptable rate.  The hardware approach has
some risk though -- if you have a contention problem it's not always a
given that adding cores will scale as well as you think.  Faster
core/bus is almost always a win, but obviously there's a very strict
limit you can go.

merlin


Re: max_connections reached in postgres 9.3.3

From
Borislav Ivanov
Date:
Hi Ramya,

We experience exactly the same problem here at Bitbucket. From what I can tell the major difference between your setup and ours is that you run 9.3.3 and we run 9.2.8. Our post for the issues is at http://www.postgresql.org/message-id/CAJ+wzrb1qhz3xuoeSy5mo8i=E-5OO9Yvm6R+VxLBGaPB=uevqA@mail.gmail.com. We have shared collected information from the incident. 

We have tried quite a few things and none of them have solved the problem yet. Things we tried:
- Upgraded CentOS 6.4 to 6.5
- Upgrade kernel multiple times (current 2.6.32-431.17.1 )
- Changed IO scheduler from CFQ to deadline (front_merges = 0, read_expire = 150, write_expire = 1500) - ext4 for both PGDATA and pg_xlog volumes
- Switched from ext4 to XFS for both PGDATA and pg_xlog (We thought we were hitting the ext4 freeze bug)
- Upgraded Postgres from 9.2.5 to .6 then to .8
- Lowered max_connections from 3000 to 500
- Lowered pgbouncer default_pool_size
- Switched pgbouncer from session to transaction mode (It was suitable for our use)
- Multiple App improvements which resulted in lowering the returned tuples from 10M to 1.5M
- Fixed all slow queries. 
- Truncated tables bigger than 25GB data, 25GB index
- Tried shared_buffers with 4GB, 6GB, 8GB (current 4GB)
- Tried work_mem with multiple values (current 384MB)
- Tried wal_buffers with 4MB and 8MB (current 8MB)
- Tried different approaches about checkpoints. None of them made a difference for the actual problem.
- We've been always running with synchronous_commit = on 
- We have 4 streaming replicas
- We do archive from the master using rsync in archive_command
- Switched the master to a different machine to triple-confirm no hardware issues


Next we are adding a second layer for pgbouncer on the database itself. Our goal is to pass less connections to Postgres than the number of cores we have. Many are suggesting that we lower our max_connections and this thread also says the same. So we are going to try that.

However, most people on our team think that the number of connections is purely a symptom of the actual problem. We would love to be wrong about this. But for now we feel the high number of connections contributes for preserving the problem but it's not actually triggering the problem.

I am including some data from when the DB freezes at 400 connections and another data set of the DB slowing down due to a load test with 500 connections where the performance is quite degraded but Postgres didn't hang. I am also willing to provide more data on our side as needed.

#  vmstat -S M 1 20 from when the DB freezes. Site is down. We are maxed out on conns. This incident is the same as what you see. In this case our Postgres cannot recover. The only fix we have so far is to restart Postgres.

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----

 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st

51  0      0 123717    152 121785    0    0     4    34    1    1 20  3 77  0  0

228  0      0 123702    152 121786    0    0     0   720 134874 295787 79 11 10  0  0

232  0      0 123700    152 121786    0    0     0  7128 138700 314263 79 12  9  0  0

274  0      0 123691    152 121786    0    0     0   980 133590 245257 81 11  8  0  0

380  0      0 123789    152 121786    0    0     0  1008 133792 258339 80 11  9  0  0

142  0      0 123839    152 121786    0    0     0  1328 139243 303489 78 12 10  0  0

 
#  vmstat -S M 1 20 from when the DB is stressed with real production load. The site is slow but up. Performance is degraded but the machine and Postgres can recover from this situation. Note that the machine was freshly rebooted before running this test.

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----

 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st

156 10      0 232743     45  16749    0    0    42     2   80   80  3  0 96  1  0

260 11      0 232639     45  16828    0    0 76704  2792 155594 223928 83 12  4  0  0

128 12      0 232552     45  16887    0    0 59536  2896 153932 229723 82 12  6  1  0

267  6      0 232480     45  16931    0    0 43144  2320 149501 200510 83 11  5  0  0

105  8      0 232408     45  16965    0    0 38016  2208 148905 185142 84 11  5  0  0

112  1      0 232339     45  17000    0    0 32640  2528 148390 183620 83 11  6  0  0


Again, if you want to see more of the data we've collected, see our post  http://www.postgresql.org/message-id/CAJ+wzrb1qhz3xuoeSy5mo8i=E-5OO9Yvm6R+VxLBGaPB=uevqA@mail.gmail.com. We have shared collected information from the incident. 

I'll also keep you updated with any new things we find and also how lowering the potential connections to our Postgres goes. We don't have plans of testing on CentOS 5.8.



On 19 June 2014 12:56, Merlin Moncure <mmoncure@gmail.com> wrote:
On Thu, Jun 19, 2014 at 2:35 PM, Kevin Grittner <kgrittn@ymail.com> wrote:
> "Vasudevan, Ramya" <ramya.vasudevan@classmates.com> wrote:
>
>> On the waiting queries - When  we reached 1500 connections, we
>> had 759 connections that were in active state (116 COMMIT, 238
>> INSERT, UPDATE 176, 57 AUTHENTICATION, 133 BIND. These active
>> INSERTS and UPDATES also includes the 80-90 waiting sessions (We
>> checked pg_stat_activity for 'waiting' state.  And pg_locks for
>> granted=f). The blocking and the waiting queries were simple one
>> row updates/inserts/deletes. These shouldn’t be blocking each
>> other normally (unless, we think, there was a problem writing to
>> the disk). Correct me if I am wrong.
>
> You may want to consider this:
>
> http://wiki.postgresql.org/wiki/Number_Of_Database_Connections
>
>> [vmstat show up to 173111 context switches per second, with high
>> cs rate corresponding to user CPU% between 64 and 82]
>
> That usually means there is high contention for spinlocks,
> potentially with processes getting suspended while holding
> spinlocks, making things worse.

Completely agree on both diagnosis and proposed solution -- load
profile (low iowait, high user%, high processes, high CS) is
symptomatic of too many processes trying to do things at once.  So
there may be some application caused driver of the problem or you are
hitting a contention point within postgres (a 'perf top' might give
clues to the latter).  Either way, once you are in this state you end
up with too many processes fighting for cpu and cache lines which
exaggerates the problem to the point you can classify it as an outage.

Be advised transaction mode pooling makes certain features of the
database difficult or impossible to use -- advisory locks (except xact
variants), server side prepared statements, asynchronous
notificiation, WITH HOLD cursors and the like -- basically anything
scoped to the session.  For many workloads it is a high win though.
If for whatever reason this solution doesn't work, your other options
are to try to optimize whatever is causing the load event
(particularly if it's in your code -- careful query logging might give
some clues) or to simply upgrade hardware (more/faster cpu especially
for your case) to the point that even when highly loaded you're always
clearing queries at an acceptable rate.  The hardware approach has
some risk though -- if you have a contention problem it's not always a
given that adding cores will scale as well as you think.  Faster
core/bus is almost always a win, but obviously there's a very strict
limit you can go.

merlin


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: max_connections reached in postgres 9.3.3

From
Merlin Moncure
Date:
On Thu, Jun 19, 2014 at 5:12 PM, Borislav Ivanov <bivanov@atlassian.com> wrote:
> However, most people on our team think that the number of connections is
> purely a symptom of the actual problem. We would love to be wrong about
> this. But for now we feel the high number of connections contributes for
> preserving the problem but it's not actually triggering the problem.

This is entirely correct.  pgbouncer does not preventing database load
but about limiting damage when it occurs.  This generally necessary in
environments where application servers keep piling on connections when
the database is not clearing queries fast enough.

In your case user% is dominating system load.  Along with the high cs
this is really suggesting spinlock contention.  A 'perf top' is
essential for identifying the culprit.  It's very possible that 9.4
will fix your problem...see:
http://postgresql.1045698.n5.nabble.com/Cpu-usage-100-on-slave-s-lock-problem-td5768655.html.
There was some poorly optimized code in the wal replay.

merlin


Re: max_connections reached in postgres 9.3.3

From
"Vasudevan, Ramya"
Date:

Hi Borislav –  Thank You for the update and all the information. It does look like we are on the same boat. And I feel the same too - maxing out on max_connections is just a symptom. pgbouncer may help alleviate the problem (though in your case it didn’t) and is definitely good to have either way.

 

One thing I wanted to mention is that we upgraded from CentOS 5.6 to 6.3 in order to fix a kernel bug we hit on 5.6. It was only after we moved to 6.3 that we started seeing issues with max_connections. And it is worth mentioning that there was no application change when we moved from 5.6 to 6.3. We then upgraded to 6.5 and still have the same problem. And from your case, 6.4 also has the same problem. I think that is why we wanted to downgrade to 5.8 to see if it resolves the issue.

 

Thank You

Ramya

Re: max_connections reached in postgres 9.3.3

From
Borislav Ivanov
Date:
We do record perf data. For each incident we've had the data looks about the same. Unfortunately, I can't read much into it. Besides it getting stuck on a spinlock. But why and with what? 


### from perf report

 53.28%       postmaster  postgres               [.] s_lock                                                                        

  6.22%       postmaster  postgres               [.] 0x00000000001b4306                                                            

  2.30%       postmaster  [kernel.kallsyms]      [k] _spin_lock                                                                    

  2.23%       postmaster  postgres               [.] LWLockAcquire                                                                 

  1.79%       postmaster  postgres               [.] LWLockRelease                                                                 

  1.39%       postmaster  postgres               [.] hash_search_with_hash_value                                                   

  1.20%       postmaster  postgres               [.] SearchCatCache                                                                

  0.71%       postmaster  [kernel.kallsyms]      [k] hrtimer_interrupt                                                             

  0.56%       postmaster  [kernel.kallsyms]      [k] tick_program_event                                                            

  0.54%       postmaster  [kernel.kallsyms]      [k] schedule                                                                      

  0.44%       postmaster  [kernel.kallsyms]      [k] _spin_lock_irq 


### Then zooming on s_lock

# Annotate s_lock

...

 99.04       test   %al,%al 

...


# Zoom into postmaster(81487) thread

 55.84%  postmaster  postgres               [.] s_lock                                                                           

  6.52%  postmaster  postgres               [.] 0x00000000001b4306                                                                ▒

  2.42%  postmaster  [kernel.kallsyms]      [k] _spin_lock                                                                        ▒

  2.34%  postmaster  postgres               [.] LWLockAcquire                                                                     ▒

  1.87%  postmaster  postgres               [.] LWLockRelease                                                                     ▒

  1.46%  postmaster  postgres               [.] hash_search_with_hash_value                                                       ▒

  1.26%  postmaster  postgres               [.] SearchCatCache                                                                    ▒

  0.75%  postmaster  [kernel.kallsyms]      [k] hrtimer_interrupt                                                                 ▒

  0.59%  postmaster  [kernel.kallsyms]      [k] tick_program_event                                                                ▒

  0.57%  postmaster  [kernel.kallsyms]      [k] schedule 


# Zoom into postgres DSO

 65.75%  postmaster  [.] s_lock                                                                                                  

  7.68%  postmaster  [.] 0x00000000001b4306                                                                                       ▒

  2.75%  postmaster  [.] LWLockAcquire                                                                                            ▒

  2.20%  postmaster  [.] LWLockRelease                                                                                            ▒

  1.72%  postmaster  [.] hash_search_with_hash_value                                                                              ▒

  1.49%  postmaster  [.] SearchCatCache                                                                                           ▒

  0.54%  postmaster  [.] _bt_compare                                                                                              ▒

  0.51%  postmaster  [.] _bt_checkkeys 



On 19 June 2014 15:57, Merlin Moncure <mmoncure@gmail.com> wrote:
On Thu, Jun 19, 2014 at 5:12 PM, Borislav Ivanov <bivanov@atlassian.com> wrote:
> However, most people on our team think that the number of connections is
> purely a symptom of the actual problem. We would love to be wrong about
> this. But for now we feel the high number of connections contributes for
> preserving the problem but it's not actually triggering the problem.

This is entirely correct.  pgbouncer does not preventing database load
but about limiting damage when it occurs.  This generally necessary in
environments where application servers keep piling on connections when
the database is not clearing queries fast enough.

In your case user% is dominating system load.  Along with the high cs
this is really suggesting spinlock contention.  A 'perf top' is
essential for identifying the culprit.  It's very possible that 9.4
will fix your problem...see:
http://postgresql.1045698.n5.nabble.com/Cpu-usage-100-on-slave-s-lock-problem-td5768655.html.
There was some poorly optimized code in the wal replay.

merlin

Re: max_connections reached in postgres 9.3.3

From
Erik van Zijst
Date:
On Thu, Jun 19, 2014 at 3:57 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
> In your case user% is dominating system load.  Along with the high cs
> this is really suggesting spinlock contention.  A 'perf top' is
> essential for identifying the culprit.  It's very possible that 9.4
> will fix your problem...see:
> http://postgresql.1045698.n5.nabble.com/Cpu-usage-100-on-slave-s-lock-problem-td5768655.html.
> There was some poorly optimized code in the wal replay.

Did that patch go in? The mailing list thread doesn't seem conclusive.

Cheers,
Erik


> merlin
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general


Re: max_connections reached in postgres 9.3.3

From
Erik van Zijst
Date:
On Thu, Jun 19, 2014 at 10:10 PM, Erik van Zijst
<erik.van.zijst@gmail.com> wrote:
> On Thu, Jun 19, 2014 at 3:57 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
>> In your case user% is dominating system load.  Along with the high cs
>> this is really suggesting spinlock contention.  A 'perf top' is
>> essential for identifying the culprit.  It's very possible that 9.4
>> will fix your problem...see:
>> http://postgresql.1045698.n5.nabble.com/Cpu-usage-100-on-slave-s-lock-problem-td5768655.html.
>> There was some poorly optimized code in the wal replay.
>
> Did that patch go in? The mailing list thread doesn't seem conclusive.

Also, that thread talks about slave databases (we're seeing these
issues exclusively on our master). Is that RecoveryMightBeInProgress
code applicable to masters, too?

Cheers,
Erik


> Cheers,
> Erik
>
>
>> merlin
>>
>>
>> --
>> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-general


Re: max_connections reached in postgres 9.3.3

From
Merlin Moncure
Date:
On Fri, Jun 20, 2014 at 12:58 AM, Erik van Zijst
<erik.van.zijst@gmail.com> wrote:
> On Thu, Jun 19, 2014 at 10:10 PM, Erik van Zijst
> <erik.van.zijst@gmail.com> wrote:
>> On Thu, Jun 19, 2014 at 3:57 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
>>> In your case user% is dominating system load.  Along with the high cs
>>> this is really suggesting spinlock contention.  A 'perf top' is
>>> essential for identifying the culprit.  It's very possible that 9.4
>>> will fix your problem...see:
>>> http://postgresql.1045698.n5.nabble.com/Cpu-usage-100-on-slave-s-lock-problem-td5768655.html.
>>> There was some poorly optimized code in the wal replay.
>>
>> Did that patch go in? The mailing list thread doesn't seem conclusive.
>
> Also, that thread talks about slave databases (we're seeing these
> issues exclusively on our master). Is that RecoveryMightBeInProgress
> code applicable to masters, too?

hm -- no.  However, it shows how important it is to grab a performance
profile in cases of suspected contention.

merlin


Re: max_connections reached in postgres 9.3.3

From
Tom Lane
Date:
Erik van Zijst <erik.van.zijst@gmail.com> writes:
> On Thu, Jun 19, 2014 at 3:57 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
>> In your case user% is dominating system load.  Along with the high cs
>> this is really suggesting spinlock contention.  A 'perf top' is
>> essential for identifying the culprit.  It's very possible that 9.4
>> will fix your problem...see:
>> http://postgresql.1045698.n5.nabble.com/Cpu-usage-100-on-slave-s-lock-problem-td5768655.html.
>> There was some poorly optimized code in the wal replay.

> Did that patch go in? The mailing list thread doesn't seem conclusive.

A descendant patch was applied:

http://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=1a3d104475ce01326fc00601ed66ac4d658e37e5

but that's only in 9.4, not any earlier branches.

            regards, tom lane


Re: max_connections reached in postgres 9.3.3

From
"Vasudevan, Ramya"
Date:
We turned off NUMA in the BIOS on Jul 2nd and haven't seen the issue since - (though once last week, we had the
connectionscount go up to 1000, but recovered in few seconds on its own). Will keep you all posted when I have more
updates.

Appreciate everyone's help, comments and suggestions so far. Thank You.

Ramya