Thread: BUG #10123: Weird entries in pg_stat_activity

BUG #10123: Weird entries in pg_stat_activity

From
maxim.boguk@gmail.com
Date:
The following bug has been logged on the website:

Bug reference:      10123
Logged by:          Maxim Boguk
Email address:      maxim.boguk@gmail.com
PostgreSQL version: 9.2.8
Operating system:   Linux
Description:

Hi everyone,

During debug of the weird nagios database alerts about long transactions I
found that sometime pg_stat_activity contains very weird entries with
xact_start NOT NULL and state='idle'... what's more they happened during
backend idle state.

To check what's going on I performed the following actions on production
database:
1)set log_min_duration_statement=0 to see ALL database queries
2)run single line bash script to check for weird entries:
while true; do  psql -x -X -c "select now() as ts,now()-xact_start as
tx_age, now()-state_change as change_age,* from pg_stat_activity where
state='idle' and xact_start is not null" |  grep -vE '^$|No rows'; sleep
0.1; done

And yes, once-twice-trice per minute script output some interesting results
such as:


-[ RECORD 1

]----+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
ts               | 2014-04-24 13:49:03.498354+04
tx_age           | 00:04:01.283106
change_age       | 00:04:01.279373
datid            | 16385
datname          | *******
pid              | 685
usesysid         | 16384
usename          | *******
application_name |
client_addr      | 127.0.0.1
client_hostname  |
client_port      | 37461
backend_start    | 2014-04-24 13:30:03.275711+04
xact_start       | 2014-04-24 13:45:02.215248+04
query_start      | 2014-04-24 13:45:02.215248+04
state_change     | 2014-04-24 13:45:02.218981+04
waiting          | f
state            | idle
query            |
                 |                     SELECT cache.save_response(
                 |                      *****
                 |                     );


State=idle
xact_start=4 min ago

Now what I found in the full database log for this pid:

...

2014-04-24 13:45:02 MSK 685 ****@**** from 127.0.0.1 [vxid:24/0 txid:0]
[SELECT] LOG:  duration: 3.703 ms  statement:
                            SELECT cache.save_response(...)

-- no queries between 13:45 and 13:50 at all!

2014-04-24 13:50:01 MSK 685 ****@**** from 127.0.0.1 [vxid:24/0 txid:0]
[SELECT] LOG:  duration: 0.962 ms  statement: select 1
2014-04-24 13:50:01 MSK 685 ****@**** from 127.0.0.1 [vxid:24/0 txid:0]
[SELECT] LOG:  duration: 0.415 ms  statement: SELECT * FROM o_queries_queue
...

So it seems that sometime pg_stat_activity entry become alive with the old
information (it shows xact_start from the last executed query in backend,
even if that query finished long time ago).
I know about asynchronous behavior of pg_stat_activity but getting
xact_start from deep past seems wrong for me.

What's more: i tested that problem on another databases - and found they
have same effect (so it doesn't look like a problem local to that single
exact database).

Kind Regards,
Maksym

Re: BUG #10123: Weird entries in pg_stat_activity

From
Magnus Hagander
Date:
On Thu, Apr 24, 2014 at 12:18 PM, <maxim.boguk@gmail.com> wrote:

> The following bug has been logged on the website:
>
> Bug reference:      10123
> Logged by:          Maxim Boguk
> Email address:      maxim.boguk@gmail.com
> PostgreSQL version: 9.2.8
> Operating system:   Linux
> Description:
>
> Hi everyone,
>
> During debug of the weird nagios database alerts about long transactions =
I
> found that sometime pg_stat_activity contains very weird entries with
> xact_start NOT NULL and state=3D'idle'... what's more they happened durin=
g
> backend idle state.
>
> To check what's going on I performed the following actions on production
> database:
> 1)set log_min_duration_statement=3D0 to see ALL database queries
> 2)run single line bash script to check for weird entries:
> while true; do  psql -x -X -c "select now() as ts,now()-xact_start as
> tx_age, now()-state_change as change_age,* from pg_stat_activity where
> state=3D'idle' and xact_start is not null" |  grep -vE '^$|No rows'; slee=
p
> 0.1; done
>
> And yes, once-twice-trice per minute script output some interesting resul=
ts
> such as:
>
>
> -[ RECORD 1
>
> ]----+-------------------------------------------------------------------=
---------------------------------------------------------------------------=
---------------------------------------------------------------------------=
---------------------------------------------------
> ts               | 2014-04-24 13:49:03.498354+04
> tx_age           | 00:04:01.283106
> change_age       | 00:04:01.279373
> datid            | 16385
> datname          | *******
> pid              | 685
> usesysid         | 16384
> usename          | *******
> application_name |
> client_addr      | 127.0.0.1
> client_hostname  |
> client_port      | 37461
> backend_start    | 2014-04-24 13:30:03.275711+04
> xact_start       | 2014-04-24 13:45:02.215248+04
> query_start      | 2014-04-24 13:45:02.215248+04
> state_change     | 2014-04-24 13:45:02.218981+04
> waiting          | f
> state            | idle
> query            |
>                  |                     SELECT cache.save_response(
>                  |                      *****
>                  |                     );
>
>
> State=3Didle
> xact_start=3D4 min ago
>
> Now what I found in the full database log for this pid:
>
> ...
>
> 2014-04-24 13:45:02 MSK 685 ****@**** from 127.0.0.1 [vxid:24/0 txid:0]
> [SELECT] LOG:  duration: 3.703 ms  statement:
>                             SELECT cache.save_response(...)
>
> -- no queries between 13:45 and 13:50 at all!
>
> 2014-04-24 13:50:01 MSK 685 ****@**** from 127.0.0.1 [vxid:24/0 txid:0]
> [SELECT] LOG:  duration: 0.962 ms  statement: select 1
> 2014-04-24 13:50:01 MSK 685 ****@**** from 127.0.0.1 [vxid:24/0 txid:0]
> [SELECT] LOG:  duration: 0.415 ms  statement: SELECT * FROM o_queries_que=
ue
> ...
>
> So it seems that sometime pg_stat_activity entry become alive with the ol=
d
> information (it shows xact_start from the last executed query in backend,
> even if that query finished long time ago).
> I know about asynchronous behavior of pg_stat_activity but getting
> xact_start from deep past seems wrong for me.
>
> What's more: i tested that problem on another databases - and found they
> have same effect (so it doesn't look like a problem local to that single
> exact database).
>

This sounds like the expected behavior. When state=3DIdle, the query field
shows the value of the *last* query that ran. See
http://www.postgresql.org/docs/9.3/static/monitoring-stats.html#PG-STAT-ACT=
IVITY-VIEW
.

Note that this was changed in 9.2, and at the same time the current_query
field was renamed to query to highlight this change.

The fact that xact_start shows something far in the past just means that
there is an open transaction that has been open since then.

--=20
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

Re: BUG #10123: Weird entries in pg_stat_activity

From
Andres Freund
Date:
On 2014-04-24 12:26:09 +0200, Magnus Hagander wrote:
> On Thu, Apr 24, 2014 at 12:18 PM, <maxim.boguk@gmail.com> wrote:
> >
]----+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> > ts               | 2014-04-24 13:49:03.498354+04
> > tx_age           | 00:04:01.283106
> > change_age       | 00:04:01.279373
> > datid            | 16385
> > datname          | *******
> > pid              | 685
> > usesysid         | 16384
> > usename          | *******
> > application_name |
> > client_addr      | 127.0.0.1
> > client_hostname  |
> > client_port      | 37461
> > backend_start    | 2014-04-24 13:30:03.275711+04
> > xact_start       | 2014-04-24 13:45:02.215248+04
> > query_start      | 2014-04-24 13:45:02.215248+04
> > state_change     | 2014-04-24 13:45:02.218981+04
> > waiting          | f
> > state            | idle
> > query            |
> >                  |                     SELECT cache.save_response(
> >                  |                      *****
> >                  |                     );

> This sounds like the expected behavior. When state=Idle, the query field
> shows the value of the *last* query that ran. See
> http://www.postgresql.org/docs/9.3/static/monitoring-stats.html#PG-STAT-ACTIVITY-VIEW
> .
>
> Note that this was changed in 9.2, and at the same time the current_query
> field was renamed to query to highlight this change.
>
> The fact that xact_start shows something far in the past just means that
> there is an open transaction that has been open since then.

I don't think that's true. In that case the state should have been idle
in transaction, no?

I think there's a pgstat_report_xact_timestamp(0) missing from
xact.c:PrepareTransaction(). There's no point in reporting a prepared
xact's starttime in a backend that's not associated with it anymore.

Maxim, are you using prepared transactions?

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: BUG #10123: Weird entries in pg_stat_activity

From
Magnus Hagander
Date:
On Thu, Apr 24, 2014 at 12:44 PM, Andres Freund <andres@2ndquadrant.com>wro=
te:

> On 2014-04-24 12:26:09 +0200, Magnus Hagander wrote:
> > On Thu, Apr 24, 2014 at 12:18 PM, <maxim.boguk@gmail.com> wrote:
> > >
> ]----+-------------------------------------------------------------------=
---------------------------------------------------------------------------=
---------------------------------------------------------------------------=
---------------------------------------------------
> > > ts               | 2014-04-24 13:49:03.498354+04
> > > tx_age           | 00:04:01.283106
> > > change_age       | 00:04:01.279373
> > > datid            | 16385
> > > datname          | *******
> > > pid              | 685
> > > usesysid         | 16384
> > > usename          | *******
> > > application_name |
> > > client_addr      | 127.0.0.1
> > > client_hostname  |
> > > client_port      | 37461
> > > backend_start    | 2014-04-24 13:30:03.275711+04
> > > xact_start       | 2014-04-24 13:45:02.215248+04
> > > query_start      | 2014-04-24 13:45:02.215248+04
> > > state_change     | 2014-04-24 13:45:02.218981+04
> > > waiting          | f
> > > state            | idle
> > > query            |
> > >                  |                     SELECT cache.save_response(
> > >                  |                      *****
> > >                  |                     );
>
> > This sounds like the expected behavior. When state=3DIdle, the query fi=
eld
> > shows the value of the *last* query that ran. See
> >
> http://www.postgresql.org/docs/9.3/static/monitoring-stats.html#PG-STAT-A=
CTIVITY-VIEW
> > .
> >
> > Note that this was changed in 9.2, and at the same time the current_que=
ry
> > field was renamed to query to highlight this change.
> >
> > The fact that xact_start shows something far in the past just means tha=
t
> > there is an open transaction that has been open since then.
>
> I don't think that's true. In that case the state should have been idle
> in transaction, no?
>

Oh, good point. Clearly I was too cofused on the "query" field. Sorry about
that, and thanks for spotting it!



I think there's a pgstat_report_xact_timestamp(0) missing from
> xact.c:PrepareTransaction(). There's no point in reporting a prepared
> xact's starttime in a backend that's not associated with it anymore.


> Maxim, are you using prepared transactions?
>

--=20
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

Re: BUG #10123: Weird entries in pg_stat_activity

From
Maxim Boguk
Date:
On Thu, Apr 24, 2014 at 8:44 PM, Andres Freund <andres@2ndquadrant.com>wrot=
e:

> On 2014-04-24 12:26:09 +0200, Magnus Hagander wrote:
> > On Thu, Apr 24, 2014 at 12:18 PM, <maxim.boguk@gmail.com> wrote:
> > >
> ]----+-------------------------------------------------------------------=
---------------------------------------------------------------------------=
---------------------------------------------------------------------------=
---------------------------------------------------
> > > ts               | 2014-04-24 13:49:03.498354+04
> > > tx_age           | 00:04:01.283106
> > > change_age       | 00:04:01.279373
> > > datid            | 16385
> > > datname          | *******
> > > pid              | 685
> > > usesysid         | 16384
> > > usename          | *******
> > > application_name |
> > > client_addr      | 127.0.0.1
> > > client_hostname  |
> > > client_port      | 37461
> > > backend_start    | 2014-04-24 13:30:03.275711+04
> > > xact_start       | 2014-04-24 13:45:02.215248+04
> > > query_start      | 2014-04-24 13:45:02.215248+04
> > > state_change     | 2014-04-24 13:45:02.218981+04
> > > waiting          | f
> > > state            | idle
> > > query            |
> > >                  |                     SELECT cache.save_response(
> > >                  |                      *****
> > >                  |                     );
>
> > This sounds like the expected behavior. When state=3DIdle, the query fi=
eld
> > shows the value of the *last* query that ran. See
> >
> http://www.postgresql.org/docs/9.3/static/monitoring-stats.html#PG-STAT-A=
CTIVITY-VIEW
> > .
> >
> > Note that this was changed in 9.2, and at the same time the current_que=
ry
> > field was renamed to query to highlight this change.
> >
> > The fact that xact_start shows something far in the past just means tha=
t
> > there is an open transaction that has been open since then.
>
> I don't think that's true. In that case the state should have been idle
> in transaction, no?
>
> I think there's a pgstat_report_xact_timestamp(0) missing from
> xact.c:PrepareTransaction(). There's no point in reporting a prepared
> xact's starttime in a backend that's not associated with it anymore.
>
> Maxim, are you using prepared transactions?
>
>
No, project isn't using prepared transactions. I absolutely sure.
postgres=3D# show max_prepared_transactions;
 max_prepared_transactions
---------------------------
 0


What strange - that exact pg_stat_activity entry stayed with xact_start
NULL for 4 minute after the last query for this pid finished, and after
that somehow got the live xact_start again (and in the next check iteration
after 0.1s xact_start become NULL again for that pid).
tx_age can vary, i seen anything between 1 second and 30 minutes... all
related to idle backends (so state=3D'idle' is correct).


Kind Regards,
Maksym

Re: BUG #10123: Weird entries in pg_stat_activity

From
Tom Lane
Date:
Andres Freund <andres@2ndquadrant.com> writes:
> I think there's a pgstat_report_xact_timestamp(0) missing from
> xact.c:PrepareTransaction(). There's no point in reporting a prepared
> xact's starttime in a backend that's not associated with it anymore.

> Maxim, are you using prepared transactions?

Even if he were, that wouldn't explain this report, because the displayed
latest-query would have to have been a PREPARE TRANSACTION, no?

However, regardless of exactly what's going on here, I think you're
probably right that PrepareTransaction ought to do
pgstat_report_xact_timestamp(0) once it exits the transaction state
locally.  The definition of that field is that it's null whenever the
session isn't inside a transaction, and once we do PREPARE, we aren't.

            regards, tom lane

Re: BUG #10123: Weird entries in pg_stat_activity

From
Andres Freund
Date:
On 2014-04-24 12:37:15 -0400, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > I think there's a pgstat_report_xact_timestamp(0) missing from
> > xact.c:PrepareTransaction(). There's no point in reporting a prepared
> > xact's starttime in a backend that's not associated with it anymore.
>
> > Maxim, are you using prepared transactions?
>
> Even if he were, that wouldn't explain this report, because the displayed
> latest-query would have to have been a PREPARE TRANSACTION, no?

Right. I think there's some other contents that could potentially end up
there, but surely not the SELECT Maxim is reporting. Especially not as
there seem to be several minutes inbetween the pg_stat_get_activity call
and the last state change...

> However, regardless of exactly what's going on here, I think you're
> probably right that PrepareTransaction ought to do
> pgstat_report_xact_timestamp(0) once it exits the transaction state
> locally.  The definition of that field is that it's null whenever the
> session isn't inside a transaction, and once we do PREPARE, we aren't.

Thanks for fixing it.

I don't really have a clue what's going on here. I don't believe
pgstat.c's changecount thing is 100% safe, but that doesn't explain the
problem here.

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: BUG #10123: Weird entries in pg_stat_activity

From
Tom Lane
Date:
maxim.boguk@gmail.com writes:
> During debug of the weird nagios database alerts about long transactions I
> found that sometime pg_stat_activity contains very weird entries with
> xact_start NOT NULL and state='idle'...
> ...
> What's more: i tested that problem on another databases - and found they
> have same effect (so it doesn't look like a problem local to that single
> exact database).

So can you provide a self-contained test case for this?  I tried to
reproduce it here without much success, and I don't see a code path
other than PREPARE TRANSACTION that could lead to this.

            regards, tom lane

Re: BUG #10123: Weird entries in pg_stat_activity

From
Maxim Boguk
Date:
On Fri, Apr 25, 2014 at 4:38 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> maxim.boguk@gmail.com writes:
> > During debug of the weird nagios database alerts about long transactions
> I
> > found that sometime pg_stat_activity contains very weird entries with
> > xact_start NOT NULL and state='idle'...
> > ...
> > What's more: i tested that problem on another databases - and found they
> > have same effect (so it doesn't look like a problem local to that single
> > exact database).
>
> So can you provide a self-contained test case for this?  I tried to
> reproduce it here without much success, and I don't see a code path
> other than PREPARE TRANSACTION that could lead to this.
>
>                         regards, tom lane
>

I going to work on test case over weekend.
Initial quick and dirty attempts to reproduce outside of complicated
production environments provided no results yet.

What I almost sure now that reproducing of the bug requires a mix of very
active connections to the database (with a lot of requests per second) with
semi-idle connections.
And that transient problem affects only these semi-idle connections.

I will provide an update when I get more information.

Kind Regards,
Maksym

Re: BUG #10123: Weird entries in pg_stat_activity

From
Maxim Boguk
Date:
On Fri, Apr 25, 2014 at 11:04 AM, Maxim Boguk <maxim.boguk@gmail.com> wrote:

>
>
>
> On Fri, Apr 25, 2014 at 4:38 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
>> maxim.boguk@gmail.com writes:
>> > During debug of the weird nagios database alerts about long
>> transactions I
>> > found that sometime pg_stat_activity contains very weird entries with
>> > xact_start NOT NULL and state='idle'...
>> > ...
>> > What's more: i tested that problem on another databases - and found they
>> > have same effect (so it doesn't look like a problem local to that single
>> > exact database).
>>
>> So can you provide a self-contained test case for this?  I tried to
>> reproduce it here without much success, and I don't see a code path
>> other than PREPARE TRANSACTION that could lead to this.
>>
>>                         regards, tom lane
>>
>
> I going to work on test case over weekend.
> Initial quick and dirty attempts to reproduce outside of complicated
> production environments provided no results yet.
>
>  What I almost sure now that reproducing of the bug requires a mix of very
> active connections to the database (with a lot of requests per second) with
> semi-idle connections.
> And that transient problem affects only these semi-idle connections.
>
> I will provide an update when I get more information.
>
> Kind Regards,
> Maksym
>


Unfortunately even if problem can be seen in at least 10 different project
and db-servers, creating reproducible test case had not been successful.
So I decided work with gdb on the real database.

At this moment I dig only to the pgstat.c :
pgstat_read_current_status(void) function.

gdb:

set print pretty on
b pgstat.c:2796 if ((localentry->st_procpid>0) &&
(localentry->st_xact_start_timestamp>0) &&
(localentry->st_state==STATE_IDLE))
cont

2796 line in 9.3.4 source is pgstat_read_current_status function just after
                memcpy(localentry, (char *) beentry,
sizeof(PgBackendStatus));

After short time running:
select pg_sleep(0.01); select now() as ts,now()-xact_start as tx_age,
now()-state_change as change_age,* from pg_stat_activity where state='idle'
and xact_start is not null;
in loop in process under gdb I got conditional breakpoint fired with the
next results:

(gdb) cont
Continuing.

Breakpoint 1, pgstat_read_current_status () at
/tmp/buildd/postgresql-9.3-9.3.4/build/../src/backend/postmaster/pgstat.c:2796
2796                                    strcpy(localappname, (char *)
beentry->st_appname);
(gdb) p *localentry
$1 = {
  st_changecount = 1222944,
  st_procpid = 16215,
  st_proc_start_timestamp = 454324143169490,
  st_xact_start_timestamp = 454324206165363,
  st_activity_start_timestamp = 454324206165363,
  st_state_start_timestamp = 454324206165650,
  st_databaseid = 16423,
  st_userid = 25078444,
  st_clientaddr = {
    addr = {
      ss_family = 2,
      __ss_align = 0,
      __ss_padding = '\000' <repeats 111 times>
    },
    salen = 16
  },
  st_clienthostname = 0x7fac4344fbe0 "",
  st_waiting = 0 '\000',
  st_state = STATE_IDLE,
  st_appname = 0x7fac624d8ff0 "",
  st_activity = 0x7fac624e4bb0 "SELECT
\"last_host_hour_accumulation_flag\".\"id\",
\"last_host_hour_accumulation_flag\".\"day\",
\"last_host_hour_accumulation_flag\".\"hour\",
\"last_host_hour_accumulation_flag\".\"type_id\", \"last_host_hour_accum"...
}

Please note that the:
  st_xact_start_timestamp = 454324206165363,
  st_activity_start_timestamp = 454324206165363,

and in the same time:
st_state = STATE_IDLE

On the next iteration after 0.01 second   st_proc_start_timestamp and
st_activity_start_timestamp returned to 0 again.

Now I have no good idea how to perform the future debug because
BackendStatusArray is shared memory structure with every backend writing
information into it (so I can't set a conditional breakpoint in specific
process)
Ideas welcome.

Kind Regards,
Maksym

Re: [BUGS] BUG #10123: Weird entries in pg_stat_activity

From
Fujii Masao
Date:
On Thu, Apr 24, 2014 at 7:19 PM <maxim.boguk@gmail.com> wrote:
>
> The following bug has been logged on the website:
>
> Bug reference:      10123
> Logged by:          Maxim Boguk
> Email address:      maxim.boguk@gmail.com
> PostgreSQL version: 9.2.8
> Operating system:   Linux
> Description:
>
> Hi everyone,
>
> During debug of the weird nagios database alerts about long transactions I
> found that sometime pg_stat_activity contains very weird entries with
> xact_start NOT NULL and state='idle'... what's more they happened during
> backend idle state.
>
> To check what's going on I performed the following actions on production
> database:
> 1)set log_min_duration_statement=0 to see ALL database queries
> 2)run single line bash script to check for weird entries:
> while true; do  psql -x -X -c "select now() as ts,now()-xact_start as
> tx_age, now()-state_change as change_age,* from pg_stat_activity where
> state='idle' and xact_start is not null" |  grep -vE '^$|No rows'; sleep
> 0.1; done
>
> And yes, once-twice-trice per minute script output some interesting results
> such as:
>
>
> -[ RECORD 1
>
]----+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> ts               | 2014-04-24 13:49:03.498354+04
> tx_age           | 00:04:01.283106
> change_age       | 00:04:01.279373
> datid            | 16385
> datname          | *******
> pid              | 685
> usesysid         | 16384
> usename          | *******
> application_name |
> client_addr      | 127.0.0.1
> client_hostname  |
> client_port      | 37461
> backend_start    | 2014-04-24 13:30:03.275711+04
> xact_start       | 2014-04-24 13:45:02.215248+04
> query_start      | 2014-04-24 13:45:02.215248+04
> state_change     | 2014-04-24 13:45:02.218981+04
> waiting          | f
> state            | idle
> query            |
>                  |                     SELECT cache.save_response(
>                  |                      *****
>                  |                     );
>
>
> State=idle
> xact_start=4 min ago
>
> Now what I found in the full database log for this pid:
>
> ...
>
> 2014-04-24 13:45:02 MSK 685 ****@**** from 127.0.0.1 [vxid:24/0 txid:0]
> [SELECT] LOG:  duration: 3.703 ms  statement:
>                             SELECT cache.save_response(...)
>
> -- no queries between 13:45 and 13:50 at all!
>
> 2014-04-24 13:50:01 MSK 685 ****@**** from 127.0.0.1 [vxid:24/0 txid:0]
> [SELECT] LOG:  duration: 0.962 ms  statement: select 1
> 2014-04-24 13:50:01 MSK 685 ****@**** from 127.0.0.1 [vxid:24/0 txid:0]
> [SELECT] LOG:  duration: 0.415 ms  statement: SELECT * FROM o_queries_queue
> ...
>
> So it seems that sometime pg_stat_activity entry become alive with the old
> information (it shows xact_start from the last executed query in backend,
> even if that query finished long time ago).
> I know about asynchronous behavior of pg_stat_activity but getting
> xact_start from deep past seems wrong for me.
>
> What's more: i tested that problem on another databases - and found they
> have same effect (so it doesn't look like a problem local to that single
> exact database).

Sorry for replying to a very old thread.

My customer also observed weird entries in pg_stat_activity, with state="idle"
and xact_start showing an old timestamp. I figured out this issue occurs
if catchup interrupts arrive in "idle" state and
ProcessCatchupInterrupt() is called.

ProcessCatchupInterrupt() starts and completes the transaction, as follows,
without changing the state (it remains "idle"). However,
StartTransactionCommand() sets xact_start to the timestamp of the last query
executed, not the current timestamp. If pg_stat_activity is read during
this brief transaction, the weird entries (state="idle" and
xact_start="old timestamp")
can appear.

        else
        {
                elog(DEBUG4, "ProcessCatchupEvent outside transaction");
                StartTransactionCommand();
                CommitTransactionCommand();
        }

Similarly, RemoveTempRelationsCallback() can cause the same issue.
When a backend exits and RemoveTempRelationsCallback() is called,
it also starts and completes a transaction, as follows, without changing
the state. This sets xact_start to the timestamp of the last query
executed, too.

        if (OidIsValid(myTempNamespace)) /* should always be true */
        {
                /* Need to ensure we have a usable transaction. */
                AbortOutOfAnyTransaction();
                StartTransactionCommand();
                PushActiveSnapshot(GetTransactionSnapshot());

                RemoveTempRelations(myTempNamespace);

                PopActiveSnapshot();
                CommitTransactionCommand();
        }

Additionally, exec_describe_statement_message() and
exec_describe_portal_message() can cause pg_stat_activity records
with state="idle" but xact_start!=NULL. However, these cases call
SetCurrentStatementStartTimestamp(), setting xact_start to
the current timestamp instead of an old one.

I'm not sure if these are bugs. However, to prevent confusion from
these odd pg_stat_activity entries, I suggest the following changes:

        1. Modify ProcessCatchupInterrupt() and RemoveTempRelationsCallback()
            to reset stmtStartTimestamp (timestamp of the last query
executed) to 0
            before starting the transaction. This way, during the transaction,
            records with state="idle" and xact_start=NULL would be observed.

        2. Update exec_describe_statement_message() and
exec_describe_portal_message()
            to change the state to "active," similar to exec_simple_query().
            This would result in records with state="active" and
xact_start="current timestamp"
            during those operations.

Thoughts?

Regards,

--
Fujii Masao



Re: [BUGS] BUG #10123: Weird entries in pg_stat_activity

From
Fujii Masao
Date:

On 2024/07/04 0:14, Fujii Masao wrote:
> I'm not sure if these are bugs. However, to prevent confusion from
> these odd pg_stat_activity entries, I suggest the following changes:
> 
>          1. Modify ProcessCatchupInterrupt() and RemoveTempRelationsCallback()
>              to reset stmtStartTimestamp (timestamp of the last query
> executed) to 0
>              before starting the transaction. This way, during the transaction,
>              records with state="idle" and xact_start=NULL would be observed.
> 
>          2. Update exec_describe_statement_message() and
> exec_describe_portal_message()
>              to change the state to "active," similar to exec_simple_query().
>              This would result in records with state="active" and
> xact_start="current timestamp"
>              during those operations.

Attached patch does these things. I will register it into the next CF.

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
Attachment

Re: [BUGS] BUG #10123: Weird entries in pg_stat_activity

From
Andres Freund
Date:
Hi,

On 2024-07-04 14:54:01 +0900, Fujii Masao wrote:
> @@ -515,6 +515,17 @@ ProcessClientReadInterrupt(bool blocked)
>          /* Check for general interrupts that arrived before/while reading */
>          CHECK_FOR_INTERRUPTS();
>  
> +        /*
> +         * Reset statement_timestamp() to 0 so that ProcessCatchupInterrupt()
> +         * and ProcessNotifyInterrupt() will set pg_stat_activity.xact_start
> +         * to NULL when they start new transaction. Otherwise, xact_start
> +         * could be set to statement_timestamp() that may indicate the
> +         * timestamp of the last query executed, while the backend's state
> +         * remains "idle," leading to incorrect "idle" entries with non-NULL
> +         * xact_start in pg_stat_activity.
> +         */
> +        ResetCurrentStatementStartTimestamp();
> +
>          /* Process sinval catchup interrupts, if any */
>          if (catchupInterruptPending)
>              ProcessCatchupInterrupt();

This seems odd to me. ProcessClientReadInterrupt() can be called a lot, why do
we need to reset the statement start timestamp even if nothing is happening here?


> @@ -2598,6 +2609,8 @@ exec_describe_statement_message(const char *stmt_name)
>  {
>      CachedPlanSource *psrc;
>  
> +    pgstat_report_activity(STATE_RUNNING, NULL);
> +
>      /*
>       * Start up a transaction command. (Note that this will normally change
>       * current memory context.) Nothing happens if we are already in one.
> @@ -2692,6 +2705,8 @@ exec_describe_portal_message(const char *portal_name)
>  {
>      Portal        portal;
>  
> +    pgstat_report_activity(STATE_RUNNING, NULL);
> +
>      /*
>       * Start up a transaction command. (Note that this will normally change
>       * current memory context.) Nothing happens if we are already in one.

Calls to pgstat_report_activity() aren't exactly free. This substantially
increases the number of calls to it for common workloads. There got to be a
more targeted way of dealing with this.

Greetings,

Andres Freund



Re: [BUGS] BUG #10123: Weird entries in pg_stat_activity

From
Fujii Masao
Date:

On 2024/07/05 7:03, Andres Freund wrote:
> Calls to pgstat_report_activity() aren't exactly free. This substantially
> increases the number of calls to it for common workloads. There got to be a
> more targeted way of dealing with this.

Yes. So, as an alternative approach, how about setting xact_start and
query_start to NULL whenever the state is idle, since non-NULL values
aren't expected in that case? This would prevent the weired pg_stat_activity
entries I mentioned earlier, with minimal performance impact.

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: [BUGS] BUG #10123: Weird entries in pg_stat_activity

From
Maxim Orlov
Date:

On Fri, 22 Nov 2024 at 17:30, Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

Patch attached.
+1 I think, this is much better approach than try to reset timestamp.

             * Don't expose transaction time for walsenders; it confuses
             * monitoring, particularly because we don't keep the time up-to-
             * date.
+            *
+            * Also, don't show transaction time for backends in the "idle"
+            * state. There are cases, like during "Describe" message
+            * handling, removing temporary relations at exit, or processing
+            * client read interrupts, where the backend remains "idle" but
+            * still sets transaction time. This can lead to incorrect "idle"
+            * entries with non-NULL transaction times in pg_stat_activity. To
+            * prevent these misleading entries, avoid exposing transaction
+            * time for idle backends.
             */
            if (beentry->st_xact_start_timestamp != 0 &&
-               beentry->st_backendType != B_WAL_SENDER)
+               beentry->st_backendType != B_WAL_SENDER &&
+               (beentry->st_state != STATE_IDLE ||
+                beentry->st_backendType != B_BACKEND))

One thing here is not 100% clear for me. As comment above stated, we don't expose transaction time for walsenders, since it confuses monitoring. That is clear.

-           if (beentry->st_activity_start_timestamp != 0)
+           /*
+            * Don't expose query start time for idle backends for the same
+            * reasons mentioned above regarding transaction time.
+            */
+           if (beentry->st_activity_start_timestamp != 0 &&
+               (beentry->st_state != STATE_IDLE ||
+                beentry->st_backendType != B_BACKEND))

But here the comment stated "for the same reasons mentioned above" but the condition is different. AFAICS, the walsender backend is the only one to be affected. It's quite possible I miss something, but I expected to have, let's say, alike conditions in this particular cases.

--
Best regards,
Maxim Orlov.

Re: [BUGS] BUG #10123: Weird entries in pg_stat_activity

From
Tomas Vondra
Date:
On 10/17/24 06:19, Fujii Masao wrote:
> 
> 
> On 2024/10/17 1:59, Fujii Masao wrote:
>>
>>
>> On 2024/07/05 7:03, Andres Freund wrote:
>>> Calls to pgstat_report_activity() aren't exactly free. This
>>> substantially
>>> increases the number of calls to it for common workloads. There got
>>> to be a
>>> more targeted way of dealing with this.
>>
>> Yes. So, as an alternative approach, how about setting xact_start and
>> query_start to NULL whenever the state is idle, since non-NULL values
>> aren't expected in that case? This would prevent the weired
>> pg_stat_activity
>> entries I mentioned earlier, with minimal performance impact.
> 
> Patch attached.
> 

I was going through the CF looking for long-running patches that might
need more attention ... I was not expecting a thread started in 2014.

It seems a bit strange to "fix" this while querying the data, which I
think is what the patch is doing. Wouldn't it be better to make sure we
don't have such "incorrect" data in the first place?

I mean, couldn't we simply reset the timestamp whenever the state gets
set to IDLE? Of maybe if there are legitimate cases where we can't do
that, maybe we could add a flag to pgstat_report_activity() to trigger
this timestamp reset?

Alternatively, why don't we simply add pgstat_report_xact_timestamp(0)
to the two places that can cause this, after CommitTransactionCommand()?

FWIW it seems a bit confusing that CommitTransactionCommand() doesn't
already do this. If StartTransactionCommand() can set the start
timestamp, wouldn't it be reasonable for CommitTransactionCommand() to
reset it?

Of course, it's entirely possible I miss something and there are reasons
why none of this would work.


regards

-- 
Tomas Vondra