Re: Server hitting 100% CPU usage, system comes to a crawl. - Mailing list pgsql-general

From Brian Fehrle
Subject Re: Server hitting 100% CPU usage, system comes to a crawl.
Date
Msg-id 4EA9C0E2.2030000@consistentstate.com
Whole thread Raw
In response to Re: Server hitting 100% CPU usage, system comes to a crawl.  (Scott Mead <scottm@openscg.com>)
List pgsql-general
On 10/27/2011 02:27 PM, Scott Mead wrote:


On Thu, Oct 27, 2011 at 2:39 PM, Brian Fehrle <brianf@consistentstate.com> wrote:
Hi all, need some help/clues on tracking down a performance issue.

PostgreSQL version: 8.3.11

I've got a system that has 32 cores and 128 gigs of ram. We have connection pooling set up, with about 100 - 200 persistent connections open to the database. Our applications then use these connections to query the database constantly, but when a connection isn't currently executing a query, it's <IDLE>. On average, at any given time, there are 3 - 6 connections that are actually executing a query, while the rest are <IDLE>.

Remember, when you read pg_stat_activity, it is showing you query activity from that exact specific moment in time.  Just because it looks like only 3-6 connections are executing, doesn't mean that 200 aren't actually executing < .1ms statements.  With such a beefy box, I would see if you can examine any stats from your connection pooler to find out how many connections are actually getting used.

 
Correct, we're getting a few hundred transactions per second, but under normal operation, polling pg_stat_activity will show the average of 3 - 6 queries that were running at that moment, and those queries run for an average of 5 - 7 seconds. So my belief is that something happens to the system where either a) We get a ton more queries than normal from the application (currently hunting down data to support this), or b) the overall speed of the system slows down so that all queries increase in time so much that polling pg_stat_activity lets me actually see them.


About once a day, queries that normally take just a few seconds slow way down, and start to pile up, to the point where instead of just having 3-6 queries running at any given time, we get 100 - 200. The whole system comes to a crawl, and looking at top, the CPU usage is 99%.

Looking at top, I see no SWAP usage, very little IOWait, and there are a large number of postmaster processes at 100% cpu usage (makes sense, at this point there are 150 or so queries currently executing on the database).

 Tasks: 713 total,  44 running, 668 sleeping,   0 stopped,   1 zombie
Cpu(s):  4.4%us, 92.0%sy,  0.0%ni,  3.0%id,  0.0%wa,  0.0%hi,  0.3%si,  0.2%st
Mem:  134217728k total, 131229972k used,  2987756k free,   462444k buffers
Swap:  8388600k total,      296k used,  8388304k free, 119029580k cached


In the past, we noticed that autovacuum was hitting some large tables at the same time this happened, so we turned autovacuum off to see if that was the issue, and it still happened without any vacuums running.
That was my next question :)

We also ruled out checkpoints being the cause.

How exactly did you rule this out?  Just because a checkpoint is over doesn't mean that it hasn't had a negative effect on the OS cache.  If you're stuck going to disk, that could be hurting you (that being said, you do point to a low I/O wait above, so you're probably correct in ruling this out).
 
Checkpoint settings were set to the default per install. 5 minute timeout, 0.5 completion target, and 30s warning. Looking at the logs, we were getting a checkpoint every 5 minutes on the dot.

I looked at the data in pg_stat_database and noticed that buffers written by checkpoints are near 4X that of the background writer. So I implemented some changes to get more to be written by the background writer, including increasing the checkpoint timeout to 30 minutes, and setting the frequency of the bgwriter wait time from 200ms to 50ms.

checkpoints now happen 30 mins apart on the dot, and there was not a checkpoint happening the last time this issue of major slowdown occured.


I'm currently digging through some statistics I've been gathering to see if traffic increased at all, or remained the same when the slowdown occurred. I'm also digging through the logs from the postgresql cluster (I increased verbosity yesterday), looking for any clues. Any suggestions or clues on where to look for this to see what can be causing a slowdown like this would be greatly appreciated.

Are you capturing table-level stats from pg_stat_user_[tables | indexes]?  Just because a server doesn't look busy doesn't mean that you're not doing 1000 index scans per second returning 1000 tuples each time.

I am not grabbing any of those at the moment, I'll look into those.

- Brian F
--Scott
 
Thanks,
   - Brian F

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


pgsql-general by date:

Previous
From: Adrian Schreyer
Date:
Subject: Custom data type in C with one fixed and one variable attribute
Next
From: Tom Lane
Date:
Subject: Re: Getting X coordinate from a point(lseg), btw i read the man page about points.