Re: Server hitting 100% CPU usage, system comes to a crawl. - Mailing list pgsql-general

From Scott Mead
Subject Re: Server hitting 100% CPU usage, system comes to a crawl.
Date
Msg-id CAKq0gvKw_dvXMYme+MG=Aj8ENoPFug+u_Jvx5nXgLig-G32JRw@mail.gmail.com
Whole thread Raw
In response to Server hitting 100% CPU usage, system comes to a crawl.  (Brian Fehrle <brianf@consistentstate.com>)
Responses Re: Server hitting 100% CPU usage, system comes to a crawl.  (Brian Fehrle <brianf@consistentstate.com>)
List pgsql-general


On Thu, Oct 27, 2011 at 2:39 PM, Brian Fehrle <brianf@consistentstate.com> wrote:
Hi all, need some help/clues on tracking down a performance issue.

PostgreSQL version: 8.3.11

I've got a system that has 32 cores and 128 gigs of ram. We have connection pooling set up, with about 100 - 200 persistent connections open to the database. Our applications then use these connections to query the database constantly, but when a connection isn't currently executing a query, it's <IDLE>. On average, at any given time, there are 3 - 6 connections that are actually executing a query, while the rest are <IDLE>.

Remember, when you read pg_stat_activity, it is showing you query activity from that exact specific moment in time.  Just because it looks like only 3-6 connections are executing, doesn't mean that 200 aren't actually executing < .1ms statements.  With such a beefy box, I would see if you can examine any stats from your connection pooler to find out how many connections are actually getting used.

 

About once a day, queries that normally take just a few seconds slow way down, and start to pile up, to the point where instead of just having 3-6 queries running at any given time, we get 100 - 200. The whole system comes to a crawl, and looking at top, the CPU usage is 99%.

Looking at top, I see no SWAP usage, very little IOWait, and there are a large number of postmaster processes at 100% cpu usage (makes sense, at this point there are 150 or so queries currently executing on the database).

 Tasks: 713 total,  44 running, 668 sleeping,   0 stopped,   1 zombie
Cpu(s):  4.4%us, 92.0%sy,  0.0%ni,  3.0%id,  0.0%wa,  0.0%hi,  0.3%si,  0.2%st
Mem:  134217728k total, 131229972k used,  2987756k free,   462444k buffers
Swap:  8388600k total,      296k used,  8388304k free, 119029580k cached


In the past, we noticed that autovacuum was hitting some large tables at the same time this happened, so we turned autovacuum off to see if that was the issue, and it still happened without any vacuums running.
That was my next question :)

We also ruled out checkpoints being the cause.

How exactly did you rule this out?  Just because a checkpoint is over doesn't mean that it hasn't had a negative effect on the OS cache.  If you're stuck going to disk, that could be hurting you (that being said, you do point to a low I/O wait above, so you're probably correct in ruling this out).
 

I'm currently digging through some statistics I've been gathering to see if traffic increased at all, or remained the same when the slowdown occurred. I'm also digging through the logs from the postgresql cluster (I increased verbosity yesterday), looking for any clues. Any suggestions or clues on where to look for this to see what can be causing a slowdown like this would be greatly appreciated.

Are you capturing table-level stats from pg_stat_user_[tables | indexes]?  Just because a server doesn't look busy doesn't mean that you're not doing 1000 index scans per second returning 1000 tuples each time.

--Scott
 
Thanks,
   - Brian F

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

pgsql-general by date:

Previous
From: Brian Fehrle
Date:
Subject: Re: Server hitting 100% CPU usage, system comes to a crawl.
Next
From: Adrian Schreyer
Date:
Subject: Custom data type in C with one fixed and one variable attribute