Thread: PG connections going to 'waiting'
Hey folks, I've got Munin installed on all my systems, so was able to get some interesting data around the big crash we had last night. We'd thought it was simply a matter of our DB connections maxing out, but it looks a bit more complex than that. A good 2 or 3 hours before the connections max (at 300), about 80% of them go to 'waiting'. I'm wondering what other data I should be looking at to see why. Munin also graphs iostat, but unfortunately only the amount of activity. I look at the iostat man page and it looks to me like this might be of interest, no? avgqu-sz The average queue length of the requests that were issued to the device. await The average time (in milliseconds) for I/O requests issued to the device to be served. This includes the time spent by the requests in queue and the time spent servicing them. Here is the core of the Munin plugin for anyone who wants to know - written in Perl : my $sql = "select count(*), waiting from pg_stat_activity "; $sql .= " where datname = ? group by waiting "; print "# $sql\n" if $debug; my $sth = $dbh->prepare($sql); $sth->execute($dbname); printf ("# Rows: %d\n", $sth->rows) if $debug; if ($sth->rows > 0) { my $c_waiting = 0; my $c_active = 0; while (my ($count, $waiting) = $sth->fetchrow_array()) { if ($waiting) { $c_waiting = $count; } else { $c_active = $count; } } print "waiting.value $c_waiting\n"; print "active.value $c_active\n"; } $sql = "select count(*) from pg_stat_activity "; $sql .= " where datname = ? and current_query like '<IDLE>%'"; print "# $sql\n" if $debug; $sth = $dbh->prepare($sql); $sth->execute($dbname); printf ("# Rows: %d\n", $sth->rows) if $debug; if ($sth->rows > 0) { my ($idle) = $sth->fetchrow_array(); print "idle.value $idle\n"; } } -- “Don't eat anything you've ever seen advertised on TV” - Michael Pollan, author of "In Defense of Food"
Alan McKay <alan.mckay@gmail.com> writes: > I've got Munin installed on all my systems, so was able to get some > interesting data around the big crash we had last night. We'd > thought it was simply a matter of our DB connections maxing out, but > it looks a bit more complex than that. A good 2 or 3 hours before the > connections max (at 300), about 80% of them go to 'waiting'. I'm > wondering what other data I should be looking at to see why. pg_locks? Somebody taking exclusive lock on a widely-used table might explain that. regards, tom lane
> pg_locks? Somebody taking exclusive lock on a widely-used table might > explain that. Thanks, I'll check with the SW designers and DB admin. -- “Don't eat anything you've ever seen advertised on TV” - Michael Pollan, author of "In Defense of Food"
> pg_locks? Somebody taking exclusive lock on a widely-used table might > explain that. OK, in theory we could do the following, no? Use our PITR logs to restore a tertiary system to the point when we were having the problem (we have a pretty wide 2 or 3 hour window to hit), then query the pg_locks table on that system? If so, is it possible to only restore a subset of the DB to that point - in particular just this table? The problem is that our DB is pretty massive and this could take an extremely long time. -- “Don't eat anything you've ever seen advertised on TV” - Michael Pollan, author of "In Defense of Food"
Alan McKay <alan.mckay@gmail.com> writes: >> pg_locks? �Somebody taking exclusive lock on a widely-used table might >> explain that. > OK, in theory we could do the following, no? > Use our PITR logs to restore a tertiary system to the point when we > were having the problem (we have a pretty wide 2 or 3 hour window to > hit), then query the pg_locks table on that system? No, that wouldn't tell you a single thing. pg_locks is a dynamic view of current system state. You would have to look at it while having the problem. regards, tom lane
On Sat, Sep 05, 2009 at 01:08:30PM -0400, Alan McKay wrote: > > pg_locks? Somebody taking exclusive lock on a widely-used table might > > explain that. > > OK, in theory we could do the following, no? > > Use our PITR logs to restore a tertiary system to the point when we > were having the problem (we have a pretty wide 2 or 3 hour window to > hit), then query the pg_locks table on that system? I'm guessing this wouldn't work. pg_locks isn't a table, it's a view, based on the pg_lock_status function. I don't know how that function works, but I'm guessing it gathers lock data from shared memory or some such, and would only contain the same values it did at the time of the crash if you had the same queries and autovacuum processes currently in flight. Autovacuum presumably didn't contribute to the locking in meaningful ways, so if you have logs of the queries you were running at the time, you might be able to get the same or a similar condition by executing those queries against the restored system. -- Joshua Tolley / eggyknap End Point Corporation http://www.endpoint.com
Attachment
Is the any way to look at the statistics on the name of table, length and type over a period of time?
Or, would we have to use munin and capture these stats for analysis later?
Chris
> To: alan.mckay@gmail.com
> CC: pgsql-general@postgresql.org
> Subject: Re: [GENERAL] PG connections going to 'waiting'
> Date: Sat, 5 Sep 2009 15:24:55 -0400
> From: tgl@sss.pgh.pa.us
>
> Alan McKay <alan.mckay@gmail.com> writes:
> >> pg_locks? Somebody taking exclusive lock on a widely-used table might
> >> explain that.
>
> > OK, in theory we could do the following, no?
>
> > Use our PITR logs to restore a tertiary system to the point when we
> > were having the problem (we have a pretty wide 2 or 3 hour window to
> > hit), then query the pg_locks table on that system?
>
> No, that wouldn't tell you a single thing. pg_locks is a dynamic view
> of current system state. You would have to look at it while having
> the problem.
>
> regards, tom lane
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
Faster Hotmail access now on the new MSN homepage.
Or, would we have to use munin and capture these stats for analysis later?
Chris
> To: alan.mckay@gmail.com
> CC: pgsql-general@postgresql.org
> Subject: Re: [GENERAL] PG connections going to 'waiting'
> Date: Sat, 5 Sep 2009 15:24:55 -0400
> From: tgl@sss.pgh.pa.us
>
> Alan McKay <alan.mckay@gmail.com> writes:
> >> pg_locks? Somebody taking exclusive lock on a widely-used table might
> >> explain that.
>
> > OK, in theory we could do the following, no?
>
> > Use our PITR logs to restore a tertiary system to the point when we
> > were having the problem (we have a pretty wide 2 or 3 hour window to
> > hit), then query the pg_locks table on that system?
>
> No, that wouldn't tell you a single thing. pg_locks is a dynamic view
> of current system state. You would have to look at it while having
> the problem.
>
> regards, tom lane
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
Faster Hotmail access now on the new MSN homepage.
On Sat, Sep 05, 2009 at 07:41:42PM -0400, Chris Barnes wrote: > Is the any way to look at the statistics on the name of table, length and > type over a period of time? Only if you've captured them. Various system information views provide those statistics (or the server's best guesses about those statistics) for the moment they're queried, but when those values are updated, the old values are lost. -- Joshua Tolley / eggyknap End Point Corporation http://www.endpoint.com
Attachment
OK, looks like the time window is exactly when we run vacuum. That has been running now for a couple of months no problem, but the last 2 weekends we've been doing massive data loads which could be complicating things. Is vacuum a good candidate for what could be locking up the tables? Here is the script that is running : ---snip--- #Initialize the variables yesterday=`date -d "yesterday" +%Y-%m-%d` for db in `psql -t -A -d template1 -c "select datname from pg_database"`; do /opt/PostgreSQL/8.3/bin/vacuumdb --analyze --full $db ; done #Mail wal backup log to admin group mail -s " Vacuum log on pgprd01 " admin@example.com < /home/postgres/vacuum/vacuum.log #Move the wal_log_backup.log to the yesterday folder mkdir /home/postgres/vacuum/$yesterday/ mv /home/postgres/vacuum/vacuum.log /home/postgres/vacuum/$yesterday/$yesterday_vacuum.log -- “Don't eat anything you've ever seen advertised on TV” - Michael Pollan, author of "In Defense of Food"
Alan McKay escribió: > OK, looks like the time window is exactly when we run vacuum. That > has been running now for a couple of months no problem, but the last 2 > weekends we've been doing massive data loads which could be > complicating things. > > Is vacuum a good candidate for what could be locking up the tables? Vacuum full, which is what your script is using, locks tables. But it's not recommended. You should be looking at an alternative vacuuming strategy that does not involve vacuum full at all. Perhaps autovacuum, properly tuned, is a better solution. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support