Thread: Query stucked in pg_stat_activity
Hi all, I have a postgres system where we just migrated a fairly big data set. The application accessing it is a cluster of servers which do burst-like processing, i.e. when they have some work to do, it will be distributed in the cluster and the data base will be under fairly high load. On our first test run everything went fine, the only strange thing is a row in the pg_stat_activity, which has a row about a query which is long gone, the process pointed by the procpid field is not existing. I've executed a "select pg_stat_reset();" as superuser, and all went away except the offending row... I wonder what can cause this behavior, and how to get rid of it ? I'm using the pg_stat_activity view to detect long running queries, and I do expect a few more to pop up in time... Thanks, Csaba.
Oh, I've forgot to mention that the DB is version 8.0.3 running on linux. Cheers, Csaba. On Tue, 2005-08-09 at 15:19, Csaba Nagy wrote: > Hi all, > > I have a postgres system where we just migrated a fairly big data set. > The application accessing it is a cluster of servers which do burst-like > processing, i.e. when they have some work to do, it will be distributed > in the cluster and the data base will be under fairly high load. > On our first test run everything went fine, the only strange thing is a > row in the pg_stat_activity, which has a row about a query which is long > gone, the process pointed by the procpid field is not existing. > I've executed a "select pg_stat_reset();" as superuser, and all went > away except the offending row... I wonder what can cause this behavior, > and how to get rid of it ? I'm using the pg_stat_activity view to detect > long running queries, and I do expect a few more to pop up in time... > > Thanks, > Csaba. > > > > > ---------------------------(end of broadcast)--------------------------- > TIP 1: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly
On Tue, Aug 09, 2005 at 03:19:46PM +0200, Csaba Nagy wrote: > I have a postgres system where we just migrated a fairly big data set. > The application accessing it is a cluster of servers which do burst-like > processing, i.e. when they have some work to do, it will be distributed > in the cluster and the data base will be under fairly high load. > On our first test run everything went fine, the only strange thing is a > row in the pg_stat_activity, which has a row about a query which is long > gone, the process pointed by the procpid field is not existing. I ran across this situation a while ago, where high load caused pg_stat_activity to have stale entries. Tom Lane wondered if the stats subsystem was under a high enough load that it was dropping messages, as it's designed to do. http://archives.postgresql.org/pgsql-bugs/2004-10/msg00163.php -- Michael Fuhr
Michael, I've read the message you referred, and it's probably what happens. In fact the original row I've complained about is gone, and I have now 10 other dead processes listed in pg_stat_activity... one of the queries is a "<BIND>", still running after 25 minutes, and the associated process is gone, so it's clearly an inconsistent state of the stats collector. I wonder if there's a way to fix that without too much affecting performance ? The logs don't show the "statistics buffer is full" message as suggested by Tom, but ITOH "log_min_messages = info", and that message might be a debug level one. In any case it seems my system can readily reproduce the issue whenever I place a bigger load on it... Cheers, Csaba. On Tue, 2005-08-09 at 15:51, Michael Fuhr wrote: > On Tue, Aug 09, 2005 at 03:19:46PM +0200, Csaba Nagy wrote: > > I have a postgres system where we just migrated a fairly big data set. > > The application accessing it is a cluster of servers which do burst-like > > processing, i.e. when they have some work to do, it will be distributed > > in the cluster and the data base will be under fairly high load. > > On our first test run everything went fine, the only strange thing is a > > row in the pg_stat_activity, which has a row about a query which is long > > gone, the process pointed by the procpid field is not existing. > > I ran across this situation a while ago, where high load caused > pg_stat_activity to have stale entries. Tom Lane wondered if the > stats subsystem was under a high enough load that it was dropping > messages, as it's designed to do. > > http://archives.postgresql.org/pgsql-bugs/2004-10/msg00163.php
On Tue, Aug 09, 2005 at 04:25:30PM +0200, Csaba Nagy wrote: > The logs don't show the "statistics buffer is full" message as suggested > by Tom, but ITOH "log_min_messages = info", and that message might be a > debug level one. The message is in src/backend/postmaster/pgstat.c: if (!overflow) { ereport(LOG, (errmsg("statistics buffer is full"))); overflow = true; } For log_min_messages, LOG is just above FATAL and PANIC, so I'd expect those messages to appear in the logs if they're happening. But I don't recall seeing them either. > In any case it seems my system can readily reproduce the issue whenever > I place a bigger load on it... I was also able to reproduce the behavior when running pgbench with sufficiently high settings. -- Michael Fuhr
[snip] > > I've executed a "select pg_stat_reset();" as superuser, and all went > > away except the offending row... > > That only resets the I/O counts (and only for one database), not the > backend activity info. > > regards, tom lane This reminds me I've forgot to ask, is there any other way of getting rid of those ghost entries than via big load ? The next big load will leave another set of ghosts behind it... Cheers, Csaba.
Csaba Nagy <nagy@ecircle-ag.com> writes: > On our first test run everything went fine, the only strange thing is a > row in the pg_stat_activity, which has a row about a query which is long > gone, the process pointed by the procpid field is not existing. This is not totally surprising, since the pgstat mechanism is by design not 100% reliable (it will drop statistics messages under high load rather than making backends wait for the stats collector). Probably the backend-is-exiting message for that process got dropped. Eventually that backend slot will get re-used for another backend, and then the entry will go away, but if it's a high-number slot then it'll take a similar load level to get to it. > I've executed a "select pg_stat_reset();" as superuser, and all went > away except the offending row... That only resets the I/O counts (and only for one database), not the backend activity info. regards, tom lane
Csaba Nagy <nagy@ecircle-ag.com> writes: >>> I've executed a "select pg_stat_reset();" as superuser, and all went >>> away except the offending row... >> >> That only resets the I/O counts (and only for one database), not the >> backend activity info. > This reminds me I've forgot to ask, is there any other way of getting > rid of those ghost entries than via big load ? Not at the moment. It might be worth teaching the pgstats code to cross-check the activity list every so often, but the only place where it'd really fit naturally is vacuum_tabstats which is probably not executed often enough to be helpful. Or maybe we could just filter the data on the reading side: ignore anything the stats collector reports that doesn't correspond to a live backend according to the PGPROC array. Jan, any thoughts? regards, tom lane
On 8/9/2005 12:21 PM, Tom Lane wrote: > Csaba Nagy <nagy@ecircle-ag.com> writes: >>>> I've executed a "select pg_stat_reset();" as superuser, and all went >>>> away except the offending row... >>> >>> That only resets the I/O counts (and only for one database), not the >>> backend activity info. > >> This reminds me I've forgot to ask, is there any other way of getting >> rid of those ghost entries than via big load ? > > Not at the moment. It might be worth teaching the pgstats code to > cross-check the activity list every so often, but the only place > where it'd really fit naturally is vacuum_tabstats which is probably > not executed often enough to be helpful. > > Or maybe we could just filter the data on the reading side: ignore > anything the stats collector reports that doesn't correspond to a > live backend according to the PGPROC array. > > Jan, any thoughts? The reset call is supposed to throw away everything. If it leaves crap behind, I'd call that a bug. IIRC the pg_stat functions don't examine the shared memory, but rely entirely on information from the stats file. It sure would be possible to add something there that checks the PGPROC array. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
Jan Wieck <JanWieck@Yahoo.com> writes: >> Jan, any thoughts? > The reset call is supposed to throw away everything. If it leaves crap > behind, I'd call that a bug. resetcounters only thinks it is supposed to zero the counters for the current database. That seems considerably different from "throw away everything". In any case, with autovacuum coming up fast on the outside it doesn't seem that we want to encourage people to reset the stats on a routine basis. > IIRC the pg_stat functions don't examine the shared memory, but rely > entirely on information from the stats file. It sure would be possible > to add something there that checks the PGPROC array. Yeah. I'll take a look at doing this. regards, tom lane
Jan Wieck wrote: > On 8/9/2005 12:21 PM, Tom Lane wrote: > >>> This reminds me I've forgot to ask, is there any other way of getting >>> rid of those ghost entries than via big load ? >> >> >> Not at the moment. It might be worth teaching the pgstats code to >> cross-check the activity list every so often, but the only place >> where it'd really fit naturally is vacuum_tabstats which is probably >> not executed often enough to be helpful. >> >> Or maybe we could just filter the data on the reading side: ignore >> anything the stats collector reports that doesn't correspond to a >> live backend according to the PGPROC array. >> >> Jan, any thoughts? > > > The reset call is supposed to throw away everything. If it leaves crap > behind, I'd call that a bug. > > IIRC the pg_stat functions don't examine the shared memory, but rely > entirely on information from the stats file. It sure would be possible > to add something there that checks the PGPROC array. Is that the same stats reset that effects autovacuum?