Thread: Puzzling full database lock
Hello folks,
We've been running into some very strange issues of late with our PostgreSQL database(s). We have an issue where a couple of queries push high CPU on a few of our processors and the entire database locks (reads, writes, console cannot be achieved unless the high CPU query procs are killed). Further investigation shows ~59% total cpu usage (we have 16 total cores), low io, and mid-to-low memory usage (we have 74GB of memory, shared_buffers=16GB). We had previously seen some high io problems but those turned out to be unconnected and ultimately solved, yet we are still seeing a complete lock of the DB occasionally as previously described.
The queries themselves are not any different than normal usage on other databases; they are pulling back a little more data but there's nothing that stands out about them as far as query construction.
One thing that we aren't sure of is whether or not we are running into a general connection pooling issue. Our typical number of postgresql processes fluctuates between 1,400 and 1,600 - most of which are idle - as we have a number of application servers all connecting to a central read/write master (the master replicates out to a secondary via streaming replication). We have max_processes set to 3,000 after tweaking some kernel memory parameters so at least we know we aren't exceeding that, but is there a practical "real world" limit or issue with setting this too high?
Ultimately, the problem we're seeing is a full read/write lock on a system that is apparently at medium usage levels once we got rid of our high io red herring. Honestly I'm a little stumped as to where to look next; is there some specific metric I might be missing here?
Any help is greatly appreciated,
-Chris.
On Wed, Feb 1, 2012 at 7:38 PM, Christopher Opena <counterveil@gmail.com> wrote: > Hello folks, > > We've been running into some very strange issues of late with our PostgreSQL > database(s). We have an issue where a couple of queries push high CPU on a > few of our processors and the entire database locks (reads, writes, console > cannot be achieved unless the high CPU query procs are killed). Further > investigation shows ~59% total cpu usage (we have 16 total cores), low io, > and mid-to-low memory usage (we have 74GB of memory, shared_buffers=16GB). Just out of curiosity, what OS are you running?
It's CentOS 5.5, PostgreSQL version 9.0.4 (x86_64).
On Wed, Feb 1, 2012 at 4:44 PM, Carlos Mennens <carlos.mennens@gmail.com> wrote:
On Wed, Feb 1, 2012 at 7:38 PM, Christopher Opena <counterveil@gmail.com> wrote:Just out of curiosity, what OS are you running?
> Hello folks,
>
> We've been running into some very strange issues of late with our PostgreSQL
> database(s). We have an issue where a couple of queries push high CPU on a
> few of our processors and the entire database locks (reads, writes, console
> cannot be achieved unless the high CPU query procs are killed). Further
> investigation shows ~59% total cpu usage (we have 16 total cores), low io,
> and mid-to-low memory usage (we have 74GB of memory, shared_buffers=16GB).
On Wed, Feb 1, 2012 at 7:51 PM, Christopher Opena <counterveil@gmail.com> wrote: > It's CentOS 5.5, PostgreSQL version 9.0.4 (x86_64). That seems extremely bleeding edge for CentOS. Did you compile this package from source RPM or some 3rd party package maintainer for PostgreSQL?
On Wed, Feb 1, 2012 at 7:38 PM, Christopher Opena <counterveil@gmail.com> wrote: > > Hello folks, > > > > We've been running into some very strange issues of late with our > > PostgreSQL database(s). We have an issue where a couple of queries > > push high CPU on a few of our processors and the entire database locks > > (reads, writes, console cannot be achieved unless the high CPU query > > procs are killed). Further investigation shows ~59% total cpu usage > > (we have 16 total cores), low io, and mid-to-low memory usage (we have > > 74GB of memory, shared_buffers=16GB). Define "low I/O". The only things I've ever seen interfere with console access are running out of memory or excessive disk i/o. If you're seeing even 6-12% iowait on a 16 core machine you might very well have swamped your available disk I/O.
It was installed from pgrpms.org's repository.
On Wed, Feb 1, 2012 at 4:55 PM, Carlos Mennens <carlos.mennens@gmail.com> wrote:
On Wed, Feb 1, 2012 at 7:51 PM, Christopher Opena <counterveil@gmail.com> wrote:That seems extremely bleeding edge for CentOS. Did you compile this
> It's CentOS 5.5, PostgreSQL version 9.0.4 (x86_64).
package from source RPM or some 3rd party package maintainer for
PostgreSQL?
Hi, On 2 February 2012 11:38, Christopher Opena <counterveil@gmail.com> wrote: > We've been running into some very strange issues of late with our PostgreSQL > database(s). We have an issue where a couple of queries push high CPU on a > few of our processors and the entire database locks (reads, writes, console > cannot be achieved unless the high CPU query procs are killed). Further > investigation shows ~59% total cpu usage (we have 16 total cores), low io, > and mid-to-low memory usage (we have 74GB of memory, shared_buffers=16GB). I think 16GB is too much. We started with 9GB (16 cores, 80GB RAM, SAN) and then experimented with lower value (6GB) but never used in the production because we switched to different database / storage technology. Anyway, Overal CPU utilisation was lower using 6GB. If CPU util is high because of io waits then it might be worth to play with dirty_background_ratio and dirty_ratio. The problem is that the value is percentage and you have 74GB. CentOS has 10% and 40% as default value for dirty_background_ratio and dirty_ratio respectively. 10% of 74GB is 7.4GB and there is no storage controller with 7.4GB of cache so you get IO waits (and high load). So writes will backup until you hit 40% hard limit (vm.dirty_ratio) which is even worse (~30GB to flush). I think you should try lower both. For example, try 1 and 10 for vm.dirty_background_ratio and vm.dirty_ratio respectively. > One thing that we aren't sure of is whether or not we are running into a > general connection pooling issue. Our typical number of postgresql > processes fluctuates between 1,400 and 1,600 - most of which are idle - as > we have a number of application servers all connecting to a central > read/write master (the master replicates out to a secondary via streaming > replication). We have max_processes set to 3,000 after tweaking some kernel > memory parameters so at least we know we aren't exceeding that, but is there > a practical "real world" limit or issue with setting this too high? I would use connection pooler like PG-Pool II. It can add transparent failover and you don't need max_processes set so high (plus parallel query feature could be useful). -- Ondrej Ivanic (ondrej.ivanic@gmail.com)
Do you mean 6-12% of total iowait, or per cpu? Our average iowait in the last week is 34.31% of a total 1600% with an average idle of 1451.76%. Our iowait *does* spike occasionally (today it went up to 148.01%) but it doesn't coincide with the lock happening. At the time of the lock we were at 10.58% iowait, which is quite a bit below our average.
Thanks,
-Chris.
On Wed, Feb 1, 2012 at 4:55 PM, Alan Hodgson <ahodgson@simkin.ca> wrote:
Define "low I/O". The only things I've ever seen interfere with console access
are running out of memory or excessive disk i/o. If you're seeing even 6-12%
iowait on a 16 core machine you might very well have swamped your available
disk I/O.
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On Wednesday, February 01, 2012 05:13:15 PM Christopher Opena wrote: > Do you mean 6-12% of total iowait, or per cpu? Our average iowait in the > last week is 34.31% of a total 1600% with an average idle of 1451.76%. Our > iowait *does* spike occasionally (today it went up to 148.01%) but it > doesn't coincide with the lock happening. At the time of the lock we were > at 10.58% iowait, which is quite a bit below our average. > Total, but it doesn't sound like that's the problem.
Yeah, it's strange because we definitely have periods of high iowait but this is not when the locks are happening. If I could correlate it directly to that it would be so much easier. Thanks again for the response!
On Wed, Feb 1, 2012 at 8:42 PM, Alan Hodgson <ahodgson@simkin.ca> wrote:
On Wednesday, February 01, 2012 05:13:15 PM Christopher Opena wrote:Total, but it doesn't sound like that's the problem.
> Do you mean 6-12% of total iowait, or per cpu? Our average iowait in the
> last week is 34.31% of a total 1600% with an average idle of 1451.76%. Our
> iowait *does* spike occasionally (today it went up to 148.01%) but it
> doesn't coincide with the lock happening. At the time of the lock we were
> at 10.58% iowait, which is quite a bit below our average.
>
On Wed, Feb 1, 2012 at 6:38 PM, Christopher Opena <counterveil@gmail.com> wrote: > Hello folks, > > We've been running into some very strange issues of late with our PostgreSQL > database(s). We have an issue where a couple of queries push high CPU on a > few of our processors and the entire database locks (reads, writes, console > cannot be achieved unless the high CPU query procs are killed). Further > investigation shows ~59% total cpu usage (we have 16 total cores), low io, > and mid-to-low memory usage (we have 74GB of memory, shared_buffers=16GB). > We had previously seen some high io problems but those turned out to be > unconnected and ultimately solved, yet we are still seeing a complete lock > of the DB occasionally as previously described. > > The queries themselves are not any different than normal usage on other > databases; they are pulling back a little more data but there's nothing that > stands out about them as far as query construction. > > One thing that we aren't sure of is whether or not we are running into a > general connection pooling issue. Our typical number of postgresql > processes fluctuates between 1,400 and 1,600 - most of which are idle - as > we have a number of application servers all connecting to a central > read/write master (the master replicates out to a secondary via streaming > replication). We have max_processes set to 3,000 after tweaking some kernel > memory parameters so at least we know we aren't exceeding that, but is there > a practical "real world" limit or issue with setting this too high? > > Ultimately, the problem we're seeing is a full read/write lock on a system > that is apparently at medium usage levels once we got rid of our high io red > herring. Honestly I'm a little stumped as to where to look next; is there > some specific metric I might be missing here? > > Any help is greatly appreciated, Random thoughts/suggestions: *) Look for some correlation between non-idle process count and locking situation. You are running a lot of processes and if I was in your shoes I would be strongly looking at pgbouncer to handle connection pooling. You could be binding in the database or (worse) the kernel *) Try logging checkpoints to see if there is any correlation with your locks. *) An strace of both the 'high cpu' process and one of the blocked process might give some clues -- in particular if you are being blocked on a system call *) Given enough time, do your high cpu queries ever complete? Are they writing or reading? *) What happens to overall system load if you lower shared_buffers to, say, 4gb? merlin
* Christopher Opena: > We've been running into some very strange issues of late with our > PostgreSQL database(s). We have an issue where a couple of queries push > high CPU on a few of our processors and the entire database locks (reads, > writes, console cannot be achieved unless the high CPU query procs are > killed). Does the kernel log something? Does "dmesg" display anything illuminating? -- Florian Weimer <fweimer@bfk.de> BFK edv-consulting GmbH http://www.bfk.de/ Kriegsstraße 100 tel: +49-721-96201-1 D-76133 Karlsruhe fax: +49-721-96201-99
Merlin, thanks for the response. My comments below, but firstly, does anyone know if autovacuum is affected by setting a statement_timeout? There was a long thread here from 2007'ish:
But it's unclear to me which way that ended up going. We're thinking of setting statement_timeout to something fairy high (that will catch these queries we've been seeing) so that we can further troubleshoot over time. We're worried, however, that autovacuum might be affected.
Random thoughts/suggestions:
*) Look for some correlation between non-idle process count and
locking situation. You are running a lot of processes and if I was in
your shoes I would be strongly looking at pgbouncer to handle
connection pooling. You could be binding in the database or (worse)
the kernel
We're definitely looking at our options with pgbouncer right now; issue being that we'd have to have a bouncer per database, and our architecture right now calls for "many databases, many connections" so we're trying to limit that by having our application pooler limit the amount of active connections one can have to the application itself (and thereby to the database, by proxy). This is still an option, however, so we're doing some research here.
*) Try logging checkpoints to see if there is any correlation with your locks.
We've been logging checkpoints for several days now with no hard correlation that we can find. Thanks for the suggestion though!
*) An strace of both the 'high cpu' process and one of the blocked
process might give some clues -- in particular if you are being
blocked on a system call
We have yet to try this; definitely next in line.
*) Given enough time, do your high cpu queries ever complete? Are they
writing or reading?
The queries are reading in this case; we haven't allowed them to run their course because of the effect it has on our entire user base. Right now we've patched our application to catch these cases and handle them outright by notifying the end user that there is a potentially damaging query that is being cancelled. Short term solution, but for now it's something we need to do until we can replicate and solve the problem on a non-production system.
*) What happens to overall system load if you lower shared_buffers to, say, 4gb?
We're going to be trying this as well once we have an appropriate maintenance window. It seems to be a general consensus that this is something we should at least try.
Thanks,
-Chris.
Christopher Opena <counterveil@gmail.com> writes: > Merlin, thanks for the response. My comments below, but firstly, does > anyone know if autovacuum is affected by setting a statement_timeout? It is not; in all recent PG releases, the autovacuum processes are careful to force a session setting of zero. regards, tom lane
On Fri, Feb 3, 2012 at 2:55 PM, Christopher Opena <counterveil@gmail.com> wrote: > Merlin, thanks for the response. no problem. if you're open to architecture suggestions you might also want to consider going with HS/SR and getting those large olap queries off your main database. you'll have to configure it to be very forgiving of long running queries (which can pause replication) but this should be a huge win in terms of total load on the main server. also, you can skimp on hardware to a certain degree to cut costs. we still want to get to the bottom of this obviously but it can't hurt to get a plan 'B' going... merlin