Server hitting 100% CPU usage, system comes to a crawl. - Mailing list pgsql-general

From Brian Fehrle
Subject Server hitting 100% CPU usage, system comes to a crawl.
Date
Msg-id 4EA9A562.7020808@consistentstate.com
Whole thread Raw
Responses Re: Server hitting 100% CPU usage, system comes to a crawl.  (John R Pierce <pierce@hogranch.com>)
Re: Server hitting 100% CPU usage, system comes to a crawl.  (Scott Marlowe <scott.marlowe@gmail.com>)
Re: Server hitting 100% CPU usage, system comes to a crawl.  (Scott Mead <scottm@openscg.com>)
Re: Server hitting 100% CPU usage, system comes to a crawl.  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-general
Hi all, need some help/clues on tracking down a performance issue.

PostgreSQL version: 8.3.11

I've got a system that has 32 cores and 128 gigs of ram. We have
connection pooling set up, with about 100 - 200 persistent connections
open to the database. Our applications then use these connections to
query the database constantly, but when a connection isn't currently
executing a query, it's <IDLE>. On average, at any given time, there are
3 - 6 connections that are actually executing a query, while the rest
are <IDLE>.

About once a day, queries that normally take just a few seconds slow way
down, and start to pile up, to the point where instead of just having
3-6 queries running at any given time, we get 100 - 200. The whole
system comes to a crawl, and looking at top, the CPU usage is 99%.

Looking at top, I see no SWAP usage, very little IOWait, and there are a
large number of postmaster processes at 100% cpu usage (makes sense, at
this point there are 150 or so queries currently executing on the database).

  Tasks: 713 total,  44 running, 668 sleeping,   0 stopped,   1 zombie
Cpu(s):  4.4%us, 92.0%sy,  0.0%ni,  3.0%id,  0.0%wa,  0.0%hi,  0.3%si,
0.2%st
Mem:  134217728k total, 131229972k used,  2987756k free,   462444k buffers
Swap:  8388600k total,      296k used,  8388304k free, 119029580k cached


In the past, we noticed that autovacuum was hitting some large tables at
the same time this happened, so we turned autovacuum off to see if that
was the issue, and it still happened without any vacuums running.

We also ruled out checkpoints being the cause.

I'm currently digging through some statistics I've been gathering to see
if traffic increased at all, or remained the same when the slowdown
occurred. I'm also digging through the logs from the postgresql cluster
(I increased verbosity yesterday), looking for any clues. Any
suggestions or clues on where to look for this to see what can be
causing a slowdown like this would be greatly appreciated.

Thanks,
     - Brian F

pgsql-general by date:

Previous
From: Martijn van Oosterhout
Date:
Subject: Re: PostGIS in a commercial project
Next
From: John R Pierce
Date:
Subject: Re: Server hitting 100% CPU usage, system comes to a crawl.