[BUGS] BUG #14901: Canceled queries missing from pg_stat_statements - Mailing list pgsql-bugs

From gregburek@heroku.com
Subject [BUGS] BUG #14901: Canceled queries missing from pg_stat_statements
Date
Msg-id 20171112222811.1464.28388@wrigleys.postgresql.org
Whole thread Raw
Responses Re: [BUGS] BUG #14901: Canceled queries missing frompg_stat_statements  (Andres Freund <andres@anarazel.de>)
List pgsql-bugs
The following bug has been logged on the website:

Bug reference:      14901
Logged by:          Greg Burek
Email address:      gregburek@heroku.com
PostgreSQL version: 9.5.10
Operating system:   linux
Description:

Hello,
I recently worked with an app that hit a mysterious wall in Postgres query
performance, where overall query latency had increased dramatically, with
sporadic statement_timeout producing app errors. Examining
pg_stat_statements showed some queries that were consuming resources, which
were then optimized, but the overall query latency and app error rate stayed
constant. 

Only after examining logs did we find a class of query that was being
consistently canceled by statement_timeout=30s. This query was found mostly
by chance as it was a fraction of the total statement_timeout rate, as some
sessions have much more aggressive statement_timeout=2s. Running explain
analyze on this 30s timeout query, we found it to be extremely resource
heavy and ultimately causing the overall degredation of response time.

This was a surprise to the app's developers that a query, which had a large
execution time but was being canceled by a relatively sane
statement_timeout, was missing from pg_stat_statements. That one would need
to access, consume and analyze postgres logs was unexpected, given Postgres
documents
(https://www.postgresql.org/docs/current/static/pgstatstatements.html) that
"The pg_stat_statements module provides a means for tracking execution
statistics of *all* SQL statements executed by a server." That canceled
queries are missing from this view seems like an easy to miss, but vital
exception.

I think this is either a bug in how pgss does not account for canceled
queries that still exert load on a database or in the documentation of pgss
that it only accounts for completed queries and that statement_timeout or
app timeouts can result in query undercounting or omissions. 

Below is a reproduction of this issue, where a pathologically bad query,
which is responsible for the bulk of the load executed on the db, does not
increment or add execution statistics to pg_stat_statements.

I would expect that the final `select * from test order by random() limit
?;` entry of pg_stat_statements would have a count equal 7 and total_time
around 142623.038. 

=# create extension "uuid-ossp";
CREATE EXTENSION
Time: 10.053 ms

=# create extension "pg_stat_statements";
CREATE EXTENSION
Time: 23.607 ms

=# create table test as (   SELECT id as id, uuid_generate_v4() as uuid, uuid_generate_v1() as
name     FROM generate_series(1,(10^8)::int) as id);
SELECT 100000000
Time: 276467.422 ms (04:36.467)

=# select pg_stat_statements_reset();pg_stat_statements_reset
--------------------------

(1 row)

Time: 0.867 ms

=# select query, calls from pg_stat_statements;              query                | calls
------------------------------------+-------select pg_stat_statements_reset(); |     1
(1 row)

Time: 1.221 ms

=# select * from test order by random() limit 1;   id    |                 uuid                 |                 name
----------+--------------------------------------+--------------------------------------79600490 |
46844417-f3ee-4add-b5c5-1ba3e6db5da5| 
e4c118b2-c5a3-11e7-be4b-8c8590305665
(1 row)

Time: 22627.277 ms (00:22.627)

=# select query, calls from pg_stat_statements;                    query                     | calls
-----------------------------------------------+-------select query, calls from pg_stat_statements;  |     1select
pg_stat_statements_reset();           |     1select * from test order by random() limit ?; |     1
 
(3 rows)

Time: 1.439 ms

=# SET statement_timeout='20s';
SET
Time: 8.936 ms

=# select * from test order by random() limit 1;
ERROR:  57014: canceling statement due to statement timeout
LOCATION:  ProcessInterrupts, postgres.c:2971
Time: 20073.747 ms (00:20.074)

=# select * from test order by random() limit 1;
ERROR:  57014: canceling statement due to statement timeout
LOCATION:  ProcessInterrupts, postgres.c:2971
Time: 20000.455 ms (00:20.000)

=# select * from test order by random() limit 1;
ERROR:  57014: canceling statement due to statement timeout
LOCATION:  ProcessInterrupts, postgres.c:2971
Time: 20065.561 ms (00:20.066)

=# select * from test order by random() limit 1;
ERROR:  57014: canceling statement due to statement timeout
LOCATION:  ProcessInterrupts, postgres.c:2971
Time: 20019.753 ms (00:20.020)

=# select * from test order by random() limit 1;
ERROR:  57014: canceling statement due to statement timeout
LOCATION:  ProcessInterrupts, postgres.c:2971
Time: 20056.721 ms (00:20.057)

=# select * from test order by random() limit 1;
ERROR:  57014: canceling statement due to statement timeout
LOCATION:  ProcessInterrupts, postgres.c:2971
Time: 20001.994 ms (00:20.002)

=# select query, calls, total_time from pg_stat_statements;                    query                     | calls |
total_time
-----------------------------------------------+-------+------------select query, calls from pg_stat_statements;  |
2|      2.063SET statement_timeout='20s';                  |     1 |      0.857select pg_stat_statements_reset();
    |     1 |      2.753select * from test order by random() limit ?; |     1 |  22623.038
 
(4 rows)

Time: 2.536 ms


--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

pgsql-bugs by date:

Previous
From: Tom Lane
Date:
Subject: Re: [BUGS] BUG #14897: Segfault on statitics SQL request
Next
From: Andres Freund
Date:
Subject: Re: [BUGS] BUG #14901: Canceled queries missing frompg_stat_statements