On Wed, Feb 1, 2012 at 6:38 PM, Christopher Opena <counterveil@gmail.com> wrote:
> Hello folks,
>
> We've been running into some very strange issues of late with our PostgreSQL
> database(s). We have an issue where a couple of queries push high CPU on a
> few of our processors and the entire database locks (reads, writes, console
> cannot be achieved unless the high CPU query procs are killed). Further
> investigation shows ~59% total cpu usage (we have 16 total cores), low io,
> and mid-to-low memory usage (we have 74GB of memory, shared_buffers=16GB).
> We had previously seen some high io problems but those turned out to be
> unconnected and ultimately solved, yet we are still seeing a complete lock
> of the DB occasionally as previously described.
>
> The queries themselves are not any different than normal usage on other
> databases; they are pulling back a little more data but there's nothing that
> stands out about them as far as query construction.
>
> One thing that we aren't sure of is whether or not we are running into a
> general connection pooling issue. Our typical number of postgresql
> processes fluctuates between 1,400 and 1,600 - most of which are idle - as
> we have a number of application servers all connecting to a central
> read/write master (the master replicates out to a secondary via streaming
> replication). We have max_processes set to 3,000 after tweaking some kernel
> memory parameters so at least we know we aren't exceeding that, but is there
> a practical "real world" limit or issue with setting this too high?
>
> Ultimately, the problem we're seeing is a full read/write lock on a system
> that is apparently at medium usage levels once we got rid of our high io red
> herring. Honestly I'm a little stumped as to where to look next; is there
> some specific metric I might be missing here?
>
> Any help is greatly appreciated,
Random thoughts/suggestions:
*) Look for some correlation between non-idle process count and
locking situation. You are running a lot of processes and if I was in
your shoes I would be strongly looking at pgbouncer to handle
connection pooling. You could be binding in the database or (worse)
the kernel
*) Try logging checkpoints to see if there is any correlation with your locks.
*) An strace of both the 'high cpu' process and one of the blocked
process might give some clues -- in particular if you are being
blocked on a system call
*) Given enough time, do your high cpu queries ever complete? Are they
writing or reading?
*) What happens to overall system load if you lower shared_buffers to, say, 4gb?
merlin