Re: Puzzling full database lock - Mailing list pgsql-general

From Merlin Moncure
Subject Re: Puzzling full database lock
Date
Msg-id CAHyXU0zmv=un8ewJCKDqLUjPwFJofGM1+UucTYwczPQ0r9x_XQ@mail.gmail.com
Whole thread Raw
In response to Puzzling full database lock  (Christopher Opena <counterveil@gmail.com>)
Responses Re: Puzzling full database lock
List pgsql-general
On Wed, Feb 1, 2012 at 6:38 PM, Christopher Opena <counterveil@gmail.com> wrote:
> Hello folks,
>
> We've been running into some very strange issues of late with our PostgreSQL
> database(s).  We have an issue where a couple of queries push high CPU on a
> few of our processors and the entire database locks (reads, writes, console
> cannot be achieved unless the high CPU query procs are killed).  Further
> investigation shows ~59% total cpu usage (we have 16 total cores), low io,
> and mid-to-low memory usage (we have 74GB of memory, shared_buffers=16GB).
>  We had previously seen some high io problems but those turned out to be
> unconnected and ultimately solved, yet we are still seeing a complete lock
> of the DB occasionally as previously described.
>
> The queries themselves are not any different than normal usage on other
> databases; they are pulling back a little more data but there's nothing that
> stands out about them as far as query construction.
>
> One thing that we aren't sure of is whether or not we are running into a
> general connection pooling issue.  Our typical number of postgresql
> processes fluctuates between 1,400 and 1,600 - most of which are idle - as
> we have a number of application servers all connecting to a central
> read/write master (the master replicates out to a secondary via streaming
> replication).  We have max_processes set to 3,000 after tweaking some kernel
> memory parameters so at least we know we aren't exceeding that, but is there
> a practical "real world" limit or issue with setting this too high?
>
> Ultimately, the problem we're seeing is a full read/write lock on a system
> that is apparently at medium usage levels once we got rid of our high io red
> herring.  Honestly I'm a little stumped as to where to look next; is there
> some specific metric I might be missing here?
>
> Any help is greatly appreciated,

Random thoughts/suggestions:
*) Look for some correlation between non-idle process count and
locking situation.  You are running a lot of processes and if I was in
your shoes I would be strongly looking at pgbouncer to handle
connection pooling.  You could be binding in the database or (worse)
the kernel

*) Try logging checkpoints to see if there is any correlation with your locks.

*) An strace of both the 'high cpu' process and one of the blocked
process might give some clues -- in particular if you are being
blocked on a system call

*) Given enough time, do your high cpu queries ever complete? Are they
writing or reading?

*) What happens to overall system load if you lower shared_buffers to, say, 4gb?

merlin

pgsql-general by date:

Previous
From: hubert depesz lubaczewski
Date:
Subject: Re: Is it possible to speed up addition of "not null"?
Next
From: Florian Weimer
Date:
Subject: Re: Is it possible to speed up addition of "not null"?