[HACKERS] [PATCH v2] Progress command to monitor progression of long runningSQL queries - Mailing list pgsql-hackers

From Remi Colinet
Subject [HACKERS] [PATCH v2] Progress command to monitor progression of long runningSQL queries
Date
Msg-id CADdR5ny_0dFwnD+suBnV1Vz6NDKbFHeWoV1EDv9buhDCtc3aAA@mail.gmail.com
Whole thread Raw
Responses Re: [HACKERS] [PATCH v2] Progress command to monitor progression oflong running SQL queries  (David Fetter <david@fetter.org>)
Re: [HACKERS] [PATCH v2] Progress command to monitor progression oflong running SQL queries  (Pavel Stehule <pavel.stehule@gmail.com>)
Re: [HACKERS] [PATCH v2] Progress command to monitor progression oflong running SQL queries  (Michael Paquier <michael.paquier@gmail.com>)
Re: [HACKERS] [PATCH v2] Progress command to monitor progression oflong running SQL queries  (Amit Kapila <amit.kapila16@gmail.com>)
List pgsql-hackers
Hello,

This is version 2 of the new command name PROGRESS which I wrote in order to monitor long running SQL queries in a Postgres backend process.


New command justification
====================

The purpose of the command is to monitor long running SQL queries in a backend process to evaluate when they will complete.

First, the user needs to find the pid of the backend process which is executing the long running SQL query for which we want to monitor progression.
This search may be based on the SQL query, the session, the terminal, using the view pg_stat_activity. This view provides the best method to identify the query to be monitored.
 
Once the pid of the backend is determined, user may run:
psql> PROGRESS [VERBOSE|(FORMAT TEXT|JSON|XML|INLINE)] PID_OF_BACKEND

The output will show a table similar to the one of the EXPLAIN command but with further details about the progression of execution of each node which may take a long time to be executed.

Such nodes are SeqScan (Sequential Scans), IndexScan (Index Scans), Sort (long sorts which require disk tapes mecanism), TupleStore (Store on disks of tuple tables), Limit, HashJoinTable which use one disk files to perform HashJoin.
Other nodes use the previous nodes. For instance, Material nodes use TupleStore nodes.

The PROGRESS command can be used with the psql \watch command. For instance, user may run:

psql> \watch PROGRESS PID_OF_BACKEND

This is handy to view progression of SQL queries.


Use case
=======

A use case is shown in the below example based on a table named t_10m with at least 10 millions rows.

The table has been created with :

CREATE TABLE T_10M ( id integer, md5 text);
INSERT INTO T_10M SELECT generate_series(1,10000000) AS id, md5(random()::text) AS md5;
ANALYSE t_10m;

ANALYSE is needed to have statistics updated. These are used to compare rows fetched with total number of rows and report percentage of execution of query nodes.
If ANALYSE is not run against relations used by the monitored query, statistics may be inaccurate.

1/ Start a first psql session to run a long SQL query:

[pgadm@rco ~]$ psql -A -d test
psql (10devel)
Type "help" for help.
test=#

The option -A is used to allow rows to be output without formatting work.
Redirect output to a file in order to let the query run without terminal interaction:

test=# \o out

Start a long running query:
test=# select * from t_10M order by md5;

2/ In a second psql session, list the backend pid(s) and their matching SQL query

[pgadm@rco ~]$ psql -d test
psql (10devel)
Type "help" for help.

test=# select * from prg;
  pid  |               query              
-------+-----------------------------------
 26306 |
 26309 |
 26311 | select * from t_10m order by md5;
 26312 | select * from t_10m order by md5;
 26313 | select * from t_10m order by md5;
 26330 | select * from prg;
 26304 |
 26303 |
 26305 |
(9 rows)

test=#

Chose the pid of the backend running the long SQL query to be monitored.
Above example is a parallel SQL query. Lowest pid is the main backend of the query.

test=# PROGRESS 26311;

                                       PLAN PROGRESS                                       
--------------------------------------------------------------------------------------------
 status: <query running>
 query: select * from t_10m order by md5;
 time used (s): 20
 Gather Merge
   worker child: 26312
   ->  Sort =>  rows r/w merge 0/0 sort 0/1895721 0% tape blocks 10903
         Sort Key: md5
         ->  Parallel Seq Scan on t_10m => rows 1903441/4166657 45% => blks 49924/83334 59%
   worker child: 26313
   ->  Sort =>  rows r/w merge 0/0 sort 0/1967521 0% tape blocks 11314
         Sort Key: md5
         ->  Parallel Seq Scan on t_10m => rows 1975561/4166657 47% => blks 49927/83334 59%
   worker parent: 26311
   ->  Sort on tapes writing =>  rows r/w merge 0/0 sort 0/2075590 0% tape blocks 11935
         Sort Key: md5
         ->  Parallel Seq Scan on t_10m => rows 2111550/4166657 50% => blks 49928/83334 59%
 total disk space used (MB) 266
(17 rows)

test=#

The query of the monitored backend is:
test=# select * from t_10M order by md5;

Because the table has 10 millions of rows, the sort is done on tapes.

The output shows the progression in terms of:
-  rows with:
   - already fetched rows,
   - total rows to be fetched,
- blocks with:
   - already fetched blocks
   - total blocks to be fetched

Each node may have different output types.
Progression is reported in terms of rows, blocks, bytes, and percentages.

Sort nodes show tapes blocks being used and step phase (merge/sort).


Implementation of the command
========================

The design of the command is:

- the user issue the "PROGRESS pid" command from a psql session.
The pid is the one of the backend which runs the SQL query for which we want to get a progression report. It can be determined from the view pg_stat_activity.

- the monitoring backend, upon receiving the "PROGRESS pid" command from psql utility used in step above, sends a signal to the backend whose process pid is the one provided in the PROGRESS command, and whose reason is PROCSIG_PROGRESS defined in src/include/storage/procsignal.h

- the monitored backend receives the signal with the reason of the signal (progression request) and notes the request as for any interrupt. Then, it continues its execution of its SQL query until interrupts can be serviced.

- when the monitored process can service the interrupts, it deals with the progress request by collecting its execution tree with the execution progress of each long running node. At this time, the SQL query must be in a consistent state without partial nodes under alllocation or freeing. This is enforced by flags added in the executor. The progression is only collected if the backend is in the function of ExecutorRun(). If execution tree is in a consistent state, it is dumped in shared memory pages allocated at the start of the server. Then, the monitored backend set a latch on which the monitoring backend is waiting for. It then continues executing its SQL query.

- the monitoring backend collects the share memory data dumped, and sends it to its psql session as a list of rows.

The command PROGRESS does not incur any slowness on the running query because the execution progression is only computed upon receiving the progress request which is supposed to be seldom used.

The code heavily reuses the one of the explain command. In order to share as much code as possible with the EXPLAIN command, part of the EXPLAIN code which deals with reporting quals for instance, has been moved to a new report.c file in the src/backend/commands folder. This code in report.c is shared between explain.c source code and PROGRESS command source code which is in progress.c file.

The progression reported by PROGRESS command is given in terms of rows, blocks, bytes and percents. The values displayed depend on the node type in the execution plan. The current patch implements all the possible nodes which could take a lot of time.

Parallel queries can also be monitored. The same mecanism is used to monitor child workers with a slight difference: the main worker requests the child progression directly in order to dump the whole progress tree in shared memory.


Patch
=====

The patch for version 2 is attached to the current email.
The patch has been rebased on latest commits.

The patch is also available at : https://github.com/colinet/progress/tree/progress

The diff stat of the patch is:

[root@rco pg]# git diff --stat master..
 contrib/auto_explain/auto_explain.c       |    5 +-
 contrib/postgres_fdw/postgres_fdw.c       |   13 +-
 src/backend/access/heap/heapam.c          |    2 +
 src/backend/access/transam/parallel.c     |    2 +-
 src/backend/commands/Makefile             |    3 +-
 src/backend/commands/explain.c            | 2834 ++++++++++++++-----------------------------------------------------------------------------------------------
 src/backend/commands/prepare.c            |    5 +-
 src/backend/commands/progress.c           | 2035 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 src/backend/commands/report.c             | 2252 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 src/backend/executor/execMain.c           |    8 +
 src/backend/executor/execProcnode.c       |   31 ++
 src/backend/executor/nodeBitmapHeapscan.c |   13 +-
 src/backend/executor/nodeIndexonlyscan.c  |   13 +-
 src/backend/executor/nodeIndexscan.c      |   15 +-
 src/backend/executor/nodeSamplescan.c     |   12 +-
 src/backend/executor/nodeSeqscan.c        |   16 +-
 src/backend/nodes/bitmapset.c             |   19 +
 src/backend/nodes/outfuncs.c              |  245 ++++++++++
 src/backend/parser/gram.y                 |   79 +++-
 src/backend/postmaster/postmaster.c       |    1 +
 src/backend/storage/file/buffile.c        |   73 +++
 src/backend/storage/file/fd.c             |   15 +
 src/backend/storage/ipc/ipci.c            |    3 +
 src/backend/storage/ipc/procarray.c       |   57 +++
 src/backend/storage/ipc/procsignal.c      |    4 +
 src/backend/storage/lmgr/lwlock.c         |    7 +-
 src/backend/storage/lmgr/lwlocknames.txt  |    1 +
 src/backend/tcop/postgres.c               |   65 ++-
 src/backend/tcop/pquery.c                 |   25 +
 src/backend/tcop/utility.c                |   10 +
 src/backend/utils/init/globals.c          |   12 +
 src/backend/utils/sort/logtape.c          |   18 +
 src/backend/utils/sort/tuplesort.c        |  153 +++++-
 src/backend/utils/sort/tuplestore.c       |   75 ++-
 src/include/access/parallel.h             |    1 +
 src/include/commands/explain.h            |   67 +--
 src/include/commands/prepare.h            |    2 +-
 src/include/commands/report.h             |  159 +++++++
 src/include/executor/execdesc.h           |    3 +
 src/include/executor/progress.h           |   52 ++
 src/include/foreign/fdwapi.h              |   10 +-
 src/include/miscadmin.h                   |    2 +
 src/include/nodes/bitmapset.h             |    1 +
 src/include/nodes/execnodes.h             |    3 +
 src/include/nodes/extensible.h            |    6 +-
 src/include/nodes/nodes.h                 |    8 +
 src/include/nodes/parsenodes.h            |   11 +
 src/include/nodes/plannodes.h             |   11 +
 src/include/parser/kwlist.h               |    2 +
 src/include/pgstat.h                      |    3 +-
 src/include/storage/buffile.h             |   11 +
 src/include/storage/fd.h                  |    2 +
 src/include/storage/procarray.h           |    3 +
 src/include/storage/procsignal.h          |    3 +
 src/include/tcop/pquery.h                 |    1 +
 src/include/utils/logtape.h               |    2 +
 src/include/utils/tuplesort.h             |   72 ++-
 src/include/utils/tuplestore.h            |   35 ++
 58 files changed, 5972 insertions(+), 2619 deletions(-)
[root@rco pg]#


The progress command can be used with the watch command of psql making it more handy to monitor a long running query.
The default format of the PROGRESS command is text inline. It can be set as json, xml, or text as for EXPLAIN command.


Use cases
========

Some further examples of use are shown below in the test_v2.txt file with parallel queries, sorts, hashes, scans, json format output.


TODO
=====

Monitor progress of utilities commands such as CREATE INDEX which may take a long time to complete.
Write documentation


Changelog
========

v2:
Added JSON, XML, TEXT INLINE format output
Added time consumed to run SQL queries
Added SQL query being monitored
Added support for parallel queries
Added disk used by the queries
Rebased on 9th May 2017 commits.

v1:
Initial version with only TEXT format reporting



Any suggestion, comment or feedback is welcome.

Regards
Remi


Attachment

pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: [HACKERS] [POC] hash partitioning
Next
From: Robert Haas
Date:
Subject: Re: [HACKERS] [POC] hash partitioning