Thread: Query stucked in pg_stat_activity

Query stucked in pg_stat_activity

From
Csaba Nagy
Date:
Hi all,

I have a postgres system where we just migrated a fairly big data set.
The application accessing it is a cluster of servers which do burst-like
processing, i.e. when they have some work to do, it will be distributed
in the cluster and the data base will be under fairly high load.
On our first test run everything went fine, the only strange thing is a
row in the pg_stat_activity, which has a row about a query which is long
gone, the process pointed by the procpid field is not existing.
I've executed a "select pg_stat_reset();" as superuser, and all went
away except the offending row... I wonder what can cause this behavior,
and how to get rid of it ? I'm using the pg_stat_activity view to detect
long running queries, and I do expect a few more to pop up in time...

Thanks,
Csaba.




Re: Query stucked in pg_stat_activity

From
Csaba Nagy
Date:
Oh, I've forgot to mention that the DB is version 8.0.3 running on
linux.

Cheers,
Csaba.


On Tue, 2005-08-09 at 15:19, Csaba Nagy wrote:
> Hi all,
>
> I have a postgres system where we just migrated a fairly big data set.
> The application accessing it is a cluster of servers which do burst-like
> processing, i.e. when they have some work to do, it will be distributed
> in the cluster and the data base will be under fairly high load.
> On our first test run everything went fine, the only strange thing is a
> row in the pg_stat_activity, which has a row about a query which is long
> gone, the process pointed by the procpid field is not existing.
> I've executed a "select pg_stat_reset();" as superuser, and all went
> away except the offending row... I wonder what can cause this behavior,
> and how to get rid of it ? I'm using the pg_stat_activity view to detect
> long running queries, and I do expect a few more to pop up in time...
>
> Thanks,
> Csaba.
>
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
>        subscribe-nomail command to majordomo@postgresql.org so that your
>        message can get through to the mailing list cleanly


Re: Query stucked in pg_stat_activity

From
Michael Fuhr
Date:
On Tue, Aug 09, 2005 at 03:19:46PM +0200, Csaba Nagy wrote:
> I have a postgres system where we just migrated a fairly big data set.
> The application accessing it is a cluster of servers which do burst-like
> processing, i.e. when they have some work to do, it will be distributed
> in the cluster and the data base will be under fairly high load.
> On our first test run everything went fine, the only strange thing is a
> row in the pg_stat_activity, which has a row about a query which is long
> gone, the process pointed by the procpid field is not existing.

I ran across this situation a while ago, where high load caused
pg_stat_activity to have stale entries.  Tom Lane wondered if the
stats subsystem was under a high enough load that it was dropping
messages, as it's designed to do.

http://archives.postgresql.org/pgsql-bugs/2004-10/msg00163.php

--
Michael Fuhr

Re: Query stucked in pg_stat_activity

From
Csaba Nagy
Date:
Michael,

I've read the message you referred, and it's probably what happens.
In fact the original row I've complained about is gone, and I have now
10 other dead processes listed in pg_stat_activity... one of the queries
is a "<BIND>", still running after 25 minutes, and the associated
process is gone, so it's clearly an inconsistent state of the stats
collector. I wonder if there's a way to fix that without too much
affecting performance ?
The logs don't show the "statistics buffer is full" message as suggested
by Tom, but ITOH "log_min_messages = info", and that message might be a
debug level one.

In any case it seems my system can readily reproduce the issue whenever
I place a bigger load on it...

Cheers,
Csaba.



On Tue, 2005-08-09 at 15:51, Michael Fuhr wrote:
> On Tue, Aug 09, 2005 at 03:19:46PM +0200, Csaba Nagy wrote:
> > I have a postgres system where we just migrated a fairly big data set.
> > The application accessing it is a cluster of servers which do burst-like
> > processing, i.e. when they have some work to do, it will be distributed
> > in the cluster and the data base will be under fairly high load.
> > On our first test run everything went fine, the only strange thing is a
> > row in the pg_stat_activity, which has a row about a query which is long
> > gone, the process pointed by the procpid field is not existing.
>
> I ran across this situation a while ago, where high load caused
> pg_stat_activity to have stale entries.  Tom Lane wondered if the
> stats subsystem was under a high enough load that it was dropping
> messages, as it's designed to do.
>
> http://archives.postgresql.org/pgsql-bugs/2004-10/msg00163.php


Re: Query stucked in pg_stat_activity

From
Michael Fuhr
Date:
On Tue, Aug 09, 2005 at 04:25:30PM +0200, Csaba Nagy wrote:
> The logs don't show the "statistics buffer is full" message as suggested
> by Tom, but ITOH "log_min_messages = info", and that message might be a
> debug level one.

The message is in src/backend/postmaster/pgstat.c:

    if (!overflow)
    {
        ereport(LOG,
                (errmsg("statistics buffer is full")));
        overflow = true;
    }

For log_min_messages, LOG is just above FATAL and PANIC, so I'd
expect those messages to appear in the logs if they're happening.
But I don't recall seeing them either.

> In any case it seems my system can readily reproduce the issue whenever
> I place a bigger load on it...

I was also able to reproduce the behavior when running pgbench with
sufficiently high settings.

--
Michael Fuhr

Re: Query stucked in pg_stat_activity

From
Csaba Nagy
Date:
[snip]
> > I've executed a "select pg_stat_reset();" as superuser, and all went
> > away except the offending row...
>
> That only resets the I/O counts (and only for one database), not the
> backend activity info.
>
>             regards, tom lane

This reminds me I've forgot to ask, is there any other way of getting
rid of those ghost entries than via big load ? The next big load will
leave another set of ghosts behind it...

Cheers,
Csaba.


Re: Query stucked in pg_stat_activity

From
Tom Lane
Date:
Csaba Nagy <nagy@ecircle-ag.com> writes:
> On our first test run everything went fine, the only strange thing is a
> row in the pg_stat_activity, which has a row about a query which is long
> gone, the process pointed by the procpid field is not existing.

This is not totally surprising, since the pgstat mechanism is by design
not 100% reliable (it will drop statistics messages under high load
rather than making backends wait for the stats collector).  Probably the
backend-is-exiting message for that process got dropped.

Eventually that backend slot will get re-used for another backend, and
then the entry will go away, but if it's a high-number slot then it'll
take a similar load level to get to it.

> I've executed a "select pg_stat_reset();" as superuser, and all went
> away except the offending row...

That only resets the I/O counts (and only for one database), not the
backend activity info.

            regards, tom lane

Re: Query stucked in pg_stat_activity

From
Tom Lane
Date:
Csaba Nagy <nagy@ecircle-ag.com> writes:
>>> I've executed a "select pg_stat_reset();" as superuser, and all went
>>> away except the offending row...
>>
>> That only resets the I/O counts (and only for one database), not the
>> backend activity info.

> This reminds me I've forgot to ask, is there any other way of getting
> rid of those ghost entries than via big load ?

Not at the moment.  It might be worth teaching the pgstats code to
cross-check the activity list every so often, but the only place
where it'd really fit naturally is vacuum_tabstats which is probably
not executed often enough to be helpful.

Or maybe we could just filter the data on the reading side: ignore
anything the stats collector reports that doesn't correspond to a
live backend according to the PGPROC array.

Jan, any thoughts?

            regards, tom lane

Re: Query stucked in pg_stat_activity

From
Jan Wieck
Date:
On 8/9/2005 12:21 PM, Tom Lane wrote:

> Csaba Nagy <nagy@ecircle-ag.com> writes:
>>>> I've executed a "select pg_stat_reset();" as superuser, and all went
>>>> away except the offending row...
>>>
>>> That only resets the I/O counts (and only for one database), not the
>>> backend activity info.
>
>> This reminds me I've forgot to ask, is there any other way of getting
>> rid of those ghost entries than via big load ?
>
> Not at the moment.  It might be worth teaching the pgstats code to
> cross-check the activity list every so often, but the only place
> where it'd really fit naturally is vacuum_tabstats which is probably
> not executed often enough to be helpful.
>
> Or maybe we could just filter the data on the reading side: ignore
> anything the stats collector reports that doesn't correspond to a
> live backend according to the PGPROC array.
>
> Jan, any thoughts?

The reset call is supposed to throw away everything. If it leaves crap
behind, I'd call that a bug.

IIRC the pg_stat functions don't examine the shared memory, but rely
entirely on information from the stats file. It sure would be possible
to add something there that checks the PGPROC array.


Jan

--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #

Re: Query stucked in pg_stat_activity

From
Tom Lane
Date:
Jan Wieck <JanWieck@Yahoo.com> writes:
>> Jan, any thoughts?

> The reset call is supposed to throw away everything. If it leaves crap
> behind, I'd call that a bug.

resetcounters only thinks it is supposed to zero the counters for the
current database.  That seems considerably different from "throw away
everything".  In any case, with autovacuum coming up fast on the outside
it doesn't seem that we want to encourage people to reset the stats on
a routine basis.

> IIRC the pg_stat functions don't examine the shared memory, but rely
> entirely on information from the stats file. It sure would be possible
> to add something there that checks the PGPROC array.

Yeah.  I'll take a look at doing this.

            regards, tom lane

Re: Query stucked in pg_stat_activity

From
"Matthew T. O'Connor"
Date:
Jan Wieck wrote:

> On 8/9/2005 12:21 PM, Tom Lane wrote:
>
>>> This reminds me I've forgot to ask, is there any other way of getting
>>> rid of those ghost entries than via big load ?
>>
>>
>> Not at the moment.  It might be worth teaching the pgstats code to
>> cross-check the activity list every so often, but the only place
>> where it'd really fit naturally is vacuum_tabstats which is probably
>> not executed often enough to be helpful.
>>
>> Or maybe we could just filter the data on the reading side: ignore
>> anything the stats collector reports that doesn't correspond to a
>> live backend according to the PGPROC array.
>>
>> Jan, any thoughts?
>
>
> The reset call is supposed to throw away everything. If it leaves crap
> behind, I'd call that a bug.
>
> IIRC the pg_stat functions don't examine the shared memory, but rely
> entirely on information from the stats file. It sure would be possible
> to add something there that checks the PGPROC array.


Is that the same stats reset that effects autovacuum?