Thread: PG connections going to 'waiting'

PG connections going to 'waiting'

From

Alan McKay

Date:

05 September 2009, 09:11:24

Hey folks,

I've got Munin installed on all my systems, so was able to get some
interesting data around the big crash we had last night.   We'd
thought it was simply a matter of our DB connections maxing out, but
it looks a bit more complex than that.  A good 2 or 3 hours before the
connections max (at 300), about 80% of them go to 'waiting'.  I'm
wondering what other data I should be looking at to see why.

Munin also graphs iostat, but unfortunately only the amount of
activity.   I look at the iostat man page and it looks to me like this
might be of interest, no?

              avgqu-sz
                     The average queue length of the requests that
were issued to the device.

              await
                     The average time (in milliseconds) for I/O
requests issued to the device to be
                     served.  This  includes  the  time spent by the
requests in queue and the time
                     spent servicing them.

Here is the core of the Munin plugin for anyone who wants to know -
written in Perl :

  my $sql = "select count(*), waiting from pg_stat_activity ";
  $sql   .= " where datname = ? group by waiting ";
  print "# $sql\n" if $debug;
  my $sth = $dbh->prepare($sql);
  $sth->execute($dbname);
  printf ("# Rows: %d\n",  $sth->rows) if $debug;
  if ($sth->rows > 0) {
        my $c_waiting = 0;
    my $c_active  = 0;
    while (my ($count, $waiting) = $sth->fetchrow_array()) {
      if ($waiting) {
        $c_waiting = $count;
      } else {
        $c_active = $count;
      }
        }
    print "waiting.value $c_waiting\n";
    print "active.value $c_active\n";
  }

  $sql = "select count(*) from pg_stat_activity ";
  $sql   .= " where datname = ? and current_query like '<IDLE>%'";
  print "# $sql\n" if $debug;
  $sth = $dbh->prepare($sql);
  $sth->execute($dbname);
  printf ("# Rows: %d\n",  $sth->rows) if $debug;
  if ($sth->rows > 0) {
    my ($idle) = $sth->fetchrow_array();
    print "idle.value $idle\n";
        }

}


--
“Don't eat anything you've ever seen advertised on TV”
         - Michael Pollan, author of "In Defense of Food"

Re: PG connections going to 'waiting'

From

Tom Lane

Date:

05 September 2009, 11:57:23

Alan McKay <alan.mckay@gmail.com> writes:
> I've got Munin installed on all my systems, so was able to get some
> interesting data around the big crash we had last night.   We'd
> thought it was simply a matter of our DB connections maxing out, but
> it looks a bit more complex than that.  A good 2 or 3 hours before the
> connections max (at 300), about 80% of them go to 'waiting'.  I'm
> wondering what other data I should be looking at to see why.

pg_locks?  Somebody taking exclusive lock on a widely-used table might
explain that.

            regards, tom lane

Re: PG connections going to 'waiting'

From

Alan McKay

Date:

05 September 2009, 12:35:53

> pg_locks?  Somebody taking exclusive lock on a widely-used table might
> explain that.

Thanks, I'll check with the SW designers and DB admin.


--
“Don't eat anything you've ever seen advertised on TV”
         - Michael Pollan, author of "In Defense of Food"

Re: PG connections going to 'waiting'

From

Alan McKay

Date:

05 September 2009, 14:08:43

> pg_locks?  Somebody taking exclusive lock on a widely-used table might
> explain that.

OK, in theory we could do the following, no?

Use our PITR logs to restore a tertiary system to the point when we
were having the problem (we have a pretty wide 2 or 3 hour window to
hit), then query the pg_locks table on that system?

If so, is it possible to only restore a subset of the DB to that point
- in particular just this table?

The problem is that our DB is pretty massive and this could take an
extremely long time.


--
“Don't eat anything you've ever seen advertised on TV”
         - Michael Pollan, author of "In Defense of Food"

Re: PG connections going to 'waiting'

From

Tom Lane

Date:

05 September 2009, 16:25:11

Alan McKay <alan.mckay@gmail.com> writes:
>> pg_locks? �Somebody taking exclusive lock on a widely-used table might
>> explain that.

> OK, in theory we could do the following, no?

> Use our PITR logs to restore a tertiary system to the point when we
> were having the problem (we have a pretty wide 2 or 3 hour window to
> hit), then query the pg_locks table on that system?

No, that wouldn't tell you a single thing.  pg_locks is a dynamic view
of current system state.  You would have to look at it while having
the problem.

            regards, tom lane

Re: PG connections going to 'waiting'

From

Joshua Tolley

Date:

05 September 2009, 16:28:43

On Sat, Sep 05, 2009 at 01:08:30PM -0400, Alan McKay wrote:
> > pg_locks?  Somebody taking exclusive lock on a widely-used table might
> > explain that.
>
> OK, in theory we could do the following, no?
>
> Use our PITR logs to restore a tertiary system to the point when we
> were having the problem (we have a pretty wide 2 or 3 hour window to
> hit), then query the pg_locks table on that system?

I'm guessing this wouldn't work. pg_locks isn't a table, it's a view, based on
the pg_lock_status function. I don't know how that function works, but I'm
guessing it gathers lock data from shared memory or some such, and would only
contain the same values it did at the time of the crash if you had the same
queries and autovacuum processes currently in flight. Autovacuum presumably
didn't contribute to the locking in meaningful ways, so if you have logs of
the queries you were running at the time, you might be able to get the same or
a similar condition by executing those queries against the restored system.

--
Joshua Tolley / eggyknap
End Point Corporation
http://www.endpoint.com

Attachment

signature.asc

Re: PG connections going to 'waiting'

From

Chris Barnes

Date:

05 September 2009, 20:41:53

Is the any way to look at the statistics on the name of table, length and type over a period of time?

Or, would we have to use munin and capture these stats for analysis later?

Chris

> To: alan.mckay@gmail.com
> CC: pgsql-general@postgresql.org
> Subject: Re: [GENERAL] PG connections going to 'waiting'
> Date: Sat, 5 Sep 2009 15:24:55 -0400
> From: tgl@sss.pgh.pa.us
>
> Alan McKay <alan.mckay@gmail.com> writes:
> >> pg_locks? Somebody taking exclusive lock on a widely-used table might
> >> explain that.
>
> > OK, in theory we could do the following, no?
>
> > Use our PITR logs to restore a tertiary system to the point when we
> > were having the problem (we have a pretty wide 2 or 3 hour window to
> > hit), then query the pg_locks table on that system?
>
> No, that wouldn't tell you a single thing. pg_locks is a dynamic view
> of current system state. You would have to look at it while having
> the problem.
>
> regards, tom lane
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general

Faster Hotmail access now on the new MSN homepage.

Re: PG connections going to 'waiting'

From

Joshua Tolley

Date:

06 September 2009, 17:14:34

On Sat, Sep 05, 2009 at 07:41:42PM -0400, Chris Barnes wrote:
>    Is the any way to look at the statistics on the name of table, length and
>    type over a period of time?

Only if you've captured them. Various system information views provide those
statistics (or the server's best guesses about those statistics) for the
moment they're queried, but when those values are updated, the old values are
lost.

--
Joshua Tolley / eggyknap
End Point Corporation
http://www.endpoint.com

Attachment

signature.asc

Re: PG connections going to 'waiting'

From

Alan McKay

Date:

08 September 2009, 16:47:08

OK, looks like the time window is exactly when we run vacuum.   That
has been running now for a couple of months no problem, but the last 2
weekends we've been doing massive data loads which could be
complicating things.

Is vacuum a good candidate for what could be locking up the tables?

Here is the script that is running :

---snip---
#Initialize the variables

yesterday=`date -d "yesterday" +%Y-%m-%d`
for db in `psql -t -A -d template1 -c "select datname from
pg_database"`; do /opt/PostgreSQL/8.3/bin/vacuumdb --analyze --full
$db ; done

#Mail wal backup log to admin group
mail -s " Vacuum log on pgprd01 " admin@example.com <
/home/postgres/vacuum/vacuum.log

#Move the wal_log_backup.log to the yesterday folder
mkdir /home/postgres/vacuum/$yesterday/
mv /home/postgres/vacuum/vacuum.log
/home/postgres/vacuum/$yesterday/$yesterday_vacuum.log



--
“Don't eat anything you've ever seen advertised on TV”
         - Michael Pollan, author of "In Defense of Food"

Re: PG connections going to 'waiting'

From

Alvaro Herrera

Date:

08 September 2009, 17:12:35

Alan McKay escribió:
> OK, looks like the time window is exactly when we run vacuum.   That
> has been running now for a couple of months no problem, but the last 2
> weekends we've been doing massive data loads which could be
> complicating things.
>
> Is vacuum a good candidate for what could be locking up the tables?

Vacuum full, which is what your script is using, locks tables.  But it's
not recommended.  You should be looking at an alternative vacuuming
strategy that does not involve vacuum full at all.  Perhaps autovacuum,
properly tuned, is a better solution.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support