[HACKERS] [PATCH v3] pg_progress() SQL function to monitor progression of longrunning SQL queries/utilities - Mailing list pgsql-hackers
From | Remi Colinet |
---|---|
Subject | [HACKERS] [PATCH v3] pg_progress() SQL function to monitor progression of longrunning SQL queries/utilities |
Date | |
Msg-id | CADdR5nxQUSh5kCm9MKmNga8+c1JLxLHDzLhAyXpfo9-Wmc6s5g@mail.gmail.com Whole thread Raw |
Responses |
Re: [HACKERS] [PATCH v3] pg_progress() SQL function to monitorprogression of long running SQL queries/utilities
|
List | pgsql-hackers |
Hello,
This is version 3 of a new SQL function written in order to monitor long running SQL queries/utilities in Postgres backends.
Previous patches were using a command named PROGRESS. This was restricting the use to psql only.
This is version 3 of a new SQL function written in order to monitor long running SQL queries/utilities in Postgres backends.
Previous patches were using a command named PROGRESS. This was restricting the use to psql only.
=========================
The purpose of the new SQL function is to monitor the progression of long running SQL queries/utilities in backend processes. It can be used to evaluate when a query will complete and consequently whether it will complete within a reasonable amount of time.The function is named pg_progress(pid, verbosity).
Arguments are:
- pid : process pid of a given backend. This pid can be found using pg_stat_activity(). This value can be also 0, in which case progress of all backend processes is requested, one by one. Only regular backends can be monitored. Specific backends such as WAL writer for instance are not concerned by the new function.
- verbosity: this value ranges from 0 to 3. Higher value provide more details. This also controls whether the disk space resource is exhibited.
The function returns a set of rows similar to name/value pairs reflecting the execution tree and its progression for long running nodes. Long running nodes are for instance SeqScan, IndexScan, Sort using tapes, TupleStore (on disks), Limit, HashJoin.
The row fields returned by the SQL function are:
- pid: process pid of the backend being monitored
- ppid: parent pid or master processs pid for parallel queries
- bid: backend Id
- lineid: line number of a given backend
- indent: indentation of the node. This field and lineid are used to order and indent the output, in order to reflect the execution tree.
- type: 3 node types exists:
- nodes which are for instance SeqScan, HasjJoin, Sorts.
- nodes which are for instance SeqScan, HasjJoin, Sorts.
- relationship between nodes and child nodes of the execution tree
- properties exhibiting the progress execution of long running nodes
- name: the name of the node/relationship/property. This is the name in the name/value pairs.
- value: the value of the node.
- unit: the unit for the above value field.
SQL query run in the monitoring process can be used to exhibit the execution tree progress.
test=# SELECT pid, ppid, bid, concat(repeat(' ', 3 * indent),name), value, unit FROM pg_progress(0,0);indent field can be used to reflect the tree nature of the execution tree.
=======
A use case is shown in the below example based on a table named t_10m with 10 millions rows.
The table has been created with :
CREATE TABLE T_10M ( id integer, md5 text);
INSERT INTO T_10M SELECT generate_series(1,10000000) AS id, md5(random()::text) AS md5;
ANALYSE t_10m;
ANALYSE is desired to be run in order to have statistics updated.
1/ Start a first psql session to run a long SQL query:
[pgadm@rco ~]$ psql -A -d test
psql (10devel)
Type "help" for help.
test=#
[pgadm@rco ~]$ psql -A -d test
psql (10devel)
Type "help" for help.
test=#
The option -A is used to allow unaligned rows in the output. Otherwise psql might slow down progression of the query.
Redirect output to a file in order to let the query run without terminal interaction:
test=# \o file
test=# \o file
Start a long running query:
test=# select * from t_10M order by md5;
test=# select * from t_10M order by md5;
2/ In a second psql session, list the backend pid(s) and their matching SQL query
[pgadm@rco ~]$ psql -d test
psql (10devel)
Type "help" for help.
[pgadm@rco ~]$ psql -d test
psql (10devel)
Type "help" for help.
test=# SELECT pid, ppid, bid, concat(repeat(' ', 3 * indent),name), value, unit FROM pg_progress(0,0); pid | ppid | bid | concat | value | unit -------+------+-----+------------------+------------------+- --------14106 | 0 | 4 | status | query running | 14106 | 0 | 4 | relationship | progression | 14106 | 0 | 4 | node name | Sort | 14106 | 0 | 4 | sort status | on tapes writing | 14106 | 0 | 4 | completion | 0 | percent14106 | 0 | 4 | relationship | Outer | 14106 | 0 | 4 | node name | Seq Scan | 14106 | 0 | 4 | scan on | t_10m | 14106 | 0 | 4 | fetched | 25049 | block14106 | 0 | 4 | total | 83334 | block14106 | 0 | 4 | completion | 30 | percent (11 rows)
test=#
Chose the pid of the backend running the long SQL query to be monitored, or use 0 as above for the pid to get all backends progress report.
The output shows the progression in terms of:
- blocks with:
Sort nodes show tapes blocks being used and step phases (merge/sort).
Parallel queries can be monitored by observing progress of both master and worker backends. In one psql session, run:
The sort is done on tapes because the table has 10 millions of rows.
The output shows the progression in terms of:
- blocks with:
- already fetched blocks
- total blocks to be fetched
- percentage of completion
Each node may have different output types.
Progression is reported in terms of rows, blocks, bytes, and percentages.
Progression is reported in terms of rows, blocks, bytes, and percentages.
Parallel queries can be monitored by observing progress of both master and worker backends. In one psql session, run:
test=# SELECT pid, ppid, bid, concat(repeat(' ', 3 * indent),name), value, unit FROM pg_progress(0,0); pid | ppid | bid | concat | value | unit -------+-------+-----+---------------------+---------------- -+---------26924 | 0 | 4 | status | query running | 26924 | 0 | 4 | relationship | progression | 26924 | 0 | 4 | node name | Gather Merge | 26924 | 0 | 4 | relationship | Outer | 26924 | 0 | 4 | node name | Sort | 26924 | 0 | 4 | completion | 0 | percent26924 | 0 | 4 | relationship | Outer | 26924 | 0 | 4 | node name | Seq Scan | 26924 | 0 | 4 | node mode | parallel | 26924 | 0 | 4 | scan on | t_10m | 26924 | 0 | 4 | scan mode | parallel | 26924 | 0 | 4 | fetched | 38231 | block26924 | 0 | 4 | total | 132403 | block26924 | 0 | 4 | completion | 28 | percent26926 | 26924 | 6 | relationship | child worker | 26926 | 26924 | 6 | node name | Sort | 26926 | 26924 | 6 | completion | 0 | percent26926 | 26924 | 6 | relationship | Outer | 26926 | 26924 | 6 | node name | Seq Scan | 26926 | 26924 | 6 | node mode | parallel | 26926 | 26924 | 6 | scan on | t_10m | 26926 | 26924 | 6 | scan mode | parallel | 26926 | 26924 | 6 | fetched | 38246 | block26926 | 26924 | 6 | total | 132403 | block26926 | 26924 | 6 | completion | 28 | percent26925 | 26924 | 5 | relationship | child worker | 26925 | 26924 | 5 | node name | Sort | 26925 | 26924 | 5 | completion | 0 | percent26925 | 26924 | 5 | relationship | Outer | 26925 | 26924 | 5 | node name | Seq Scan | 26925 | 26924 | 5 | node mode | parallel | 26925 | 26924 | 5 | scan on | t_10m | 26925 | 26924 | 5 | scan mode | parallel | 26925 | 26924 | 5 | fetched | 38283 | block26925 | 26924 | 5 | total | 132403 | block26925 | 26924 | 5 | completion | 28 | percent (36 rows) test=#
Nodes types are SeqScan and Sort.
Design of the SQL function
===================
The design of the SQL function relies on a per backend shared memory structure named ProgressCtl. This structure contains fields for progression requests and responses.
The monitoring process fills the request fields in the per backend ProgressCtl structure. Once the request fields are set, it sends a signal to the monitored process and starts waiting on a per backend latch, for the response of the monitored backend to be available.
The monitored process, upon receiving the signal stops temporarily its execution and note the progression request. It then continues its execution until interruption can be serviced. At this moment, it disables interrupts and walks through its execution tree to collect entirely the progress of each node of the execution tree. This is done in one step interrupting the backend SQL query execution. Structure ProgressState is used to manage the progress report. It is stored in the monitored backend process. Then, it dumps the progress report in a per backend shared memory page and set the latch to allow the monitoring backend to wake up and collect the report put in shared memory. The monitored backend continues the execution of its SQL query.
The monitoring process fills the request fields in the per backend ProgressCtl structure. Once the request fields are set, it sends a signal to the monitored process and starts waiting on a per backend latch, for the response of the monitored backend to be available.
The monitored process, upon receiving the signal stops temporarily its execution and note the progression request. It then continues its execution until interruption can be serviced. At this moment, it disables interrupts and walks through its execution tree to collect entirely the progress of each node of the execution tree. This is done in one step interrupting the backend SQL query execution. Structure ProgressState is used to manage the progress report. It is stored in the monitored backend process. Then, it dumps the progress report in a per backend shared memory page and set the latch to allow the monitoring backend to wake up and collect the report put in shared memory. The monitored backend continues the execution of its SQL query.
The monitoring backend collects the progress report in shared memory, and sends it to its psql session as a list of rows.
If the one shared memory page is not enough for the whole progress report, the progress report transfert between the 2 backends is done with a series of request/response. Before setting the latch, the monitored backend write the size of the data dumped in shared memory and set a status to indicate that more data is to be sent through the shared memory page. The monitoring backends get the result and sends an other signal, and then wait for the latch again. The monitored backend does not collect a new progress report but continues to dump the already collected report. And the exchange goes on until the full progress report has been dumped.
Should the backend process disappear, the wait on latch will eventually complete after a timeout value which is guc variable. Default is 5 seconds which should be enough for most cases.
Should the monitoring process disappear, the monitored backed will dump data in shared memory and continue its execution. On the next progression request, the monitored backend starts a new progress report.
When the monitored process execute a SQL query, it collects progress of each long running node with counters of rows, blocks. This data collection incurs a very low overhead on the running query.
At the time a monitored backend deals with a progress report, the SQL query must be in a consistent state without partial nodes under allocation or freeing. This is needed to walk through the execution tree at this moment, in the middle of the SQL query execution. This is enforced by a flag added in the executor. The progression is only collected if the backend is in the function of ExecutorRun() which is known based on the execution flag. If execution tree is in a consistent state, it is collected and dumped in the shared memory page. The dump of the progress report is only done upon receiving the progress request which is supposed to be seldom used.
At the time a monitored backend deals with a progress report, the SQL query must be in a consistent state without partial nodes under allocation or freeing. This is needed to walk through the execution tree at this moment, in the middle of the SQL query execution. This is enforced by a flag added in the executor. The progression is only collected if the backend is in the function of ExecutorRun() which is known based on the execution flag. If execution tree is in a consistent state, it is collected and dumped in the shared memory page. The dump of the progress report is only done upon receiving the progress request which is supposed to be seldom used.
Parallel queries can also be monitored. The same mecanism is used to monitor child workers.
Utility such as CREATE INDEX which may be long to complete are also monitored.
Utility such as CREATE INDEX which may be long to complete are also monitored.
=====
The patch for version 3 is attached to the current email.
The patch has been rebased on latest commits as per today.
The patch has been rebased on latest commits as per today.
The patch is also available at : https://github.com/colinet/pro gress/tree/progress
The diff stat of the patch is:
[root@rco pg]# git diff --stat master..
doc/src/sgml/config.sgml | 27 +-
doc/src/sgml/logical-replication.sgml | 2 +-
doc/src/sgml/monitoring.sgml | 283 +++++++++++-
doc/src/sgml/ref/create_function.sgml | 5 +-
doc/src/sgml/ref/create_subscription.sgml | 2 +-
doc/src/sgml/ref/load.sgml | 11 +-
doc/src/sgml/ref/psql-ref.sgml | 12 +-
src/backend/catalog/index.c | 11 +
src/backend/catalog/partition.c | 2 +-
src/backend/catalog/pg_publication.c | 24 -
src/backend/commands/explain.c | 3 +-
src/backend/executor/execMain.c | 12 +-
src/backend/executor/execProcnode.c | 31 ++
src/backend/nodes/outfuncs.c | 245 ++++++++++
src/backend/optimizer/path/costsize.c | 97 ++--
src/backend/postmaster/postmaster.c | 1 +
src/backend/rewrite/rewriteDefine.c | 6 -
src/backend/storage/file/buffile.c | 73 +++
src/backend/storage/file/fd.c | 15 +
src/backend/storage/ipc/ipci.c | 4 +
src/backend/storage/ipc/procarray.c | 59 +++
src/backend/storage/ipc/procsignal.c | 5 +
src/backend/storage/lmgr/lwlock.c | 2 +-
src/backend/tcop/postgres.c | 59 ++-
src/backend/tcop/pquery.c | 7 +
src/backend/tcop/utility.c | 10 +
src/backend/utils/adt/Makefile | 2 +-
src/backend/utils/adt/pg_progress.c | 2764 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
src/backend/utils/adt/varlena.c | 4 +-
src/backend/utils/init/globals.c | 18 +
src/backend/utils/init/miscinit.c | 21 +-
src/backend/utils/misc/guc.c | 23 +
src/backend/utils/sort/logtape.c | 18 +
src/backend/utils/sort/tuplesort.c | 153 +++++-
src/backend/utils/sort/tuplestore.c | 75 ++-
src/bin/pg_dump/pg_backup_archiver.c | 8 +-
src/bin/pg_upgrade/check.c | 20 +-
src/bin/pg_upgrade/pg_upgrade.c | 2 +-
src/bin/pg_upgrade/pg_upgrade.h | 2 +-
src/bin/psql/describe.c | 16 +-
src/include/catalog/catversion.h | 2 +-
src/include/catalog/pg_proc.h | 6 +-
src/include/commands/explain.h | 2 +
src/include/executor/execdesc.h | 3 +
src/include/miscadmin.h | 2 +
src/include/nodes/execnodes.h | 3 +
src/include/nodes/nodes.h | 7 +
src/include/nodes/plannodes.h | 11 +
src/include/pgstat.h | 3 +-
src/include/storage/buffile.h | 11 +
src/include/storage/fd.h | 2 +
src/include/storage/lwlock.h | 1 +
src/include/storage/procarray.h | 3 +
src/include/storage/procsignal.h | 3 +
src/include/tcop/pquery.h | 1 +
src/include/utils/logtape.h | 2 +
src/include/utils/pg_progress.h | 46 ++
src/include/utils/tuplesort.h | 72 ++-
src/include/utils/tuplestore.h | 35 ++
src/test/regress/expected/join.out | 35 --
src/test/regress/expected/opr_sanity.out | 10 -
src/test/regress/expected/rules.out | 5 -
src/test/regress/sql/join.sql | 28 --
src/test/regress/sql/opr_sanity.sql | 7 -
src/test/regress/sql/rules.sql | 5 -
65 files changed, 4153 insertions(+), 286 deletions(-)
[root@rco pg]#
2 main files are created:
src/backend/utils/adt/pg_progr ess.c
src/include/utils/pg_progress. h
src/backend/utils/adt/pg_progr
src/include/utils/pg_progress.
Other files are concerned by small changes in order to collect progression of the long running nodes.
Changelog
========
========
v3:
Added documentation
Added guc parameters for policy change (timeout, delay before reporting).
Restrict use of pg_progress() to superuser only
Support utility progression with CREATE INDEX.
Allow multiple monitoring backends at the same time to request progress of the backends.
Fixed design of shared memory exchange between monitoring and monitored backends when report is more than one shm page size.
Fixed locking of backend report request.
Moved code for new SQL function in src/backend/utils/adt/pg_progr ess.c
Use LWLockTranche to allocate one LWlock per backend. This lock to used to serialize progression requests on backends.
Replace PROGRESS command by a new SQL function returning rows instead of TEXT, JSON, XML.
Improved debug support
v2:
Added JSON, XML, TEXT INLINE format output
Added JSON, XML, TEXT INLINE format output
Added time consumed to run SQL queries
Added SQL query being monitored
Added support for parallel queries
Added disk used by the queries
Rebased on 9th May 2017 commits.
v1:
Initial version with only TEXT format reporting
v1:
Initial version with only TEXT format reporting
Any suggestion, comment or feedback is welcome.
Attachment
pgsql-hackers by date: