Re: [HACKERS] [PATCH] New command to monitor progression of longrunning queries - Mailing list pgsql-hackers

From Remi Colinet
Subject Re: [HACKERS] [PATCH] New command to monitor progression of longrunning queries
Date
Msg-id CADdR5nyy_qSRivJaWC6iUSSxuQt-2a6Ozg_=Oa17GW_NJFrejA@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] [PATCH] New command to monitor progression of longrunning queries  (Remi Colinet <remi.colinet@gmail.com>)
List pgsql-hackers
Following on previous email....

I have added below some use cases which I find very relevant when we need to know the progress of a SQL query.
The command can be used by any SQL query (select, update, delete, insert).


The tables used have been created with :

CREATE TABLE T_1M (id integer, md5 text);
INSERT INTO T_1M SELECT generate_series(1,1000000) AS id, md5(random()::text) AS md5;

CREATE TABLE T_10M ( id integer, md5 text);
INSERT INTO T_10M SELECT generate_series(1,10000000) AS id, md5(random()::text) AS md5;


All the different leaf node types are implemented.



1/ Parallel select with sort (no index)
===========================

=> Terminal running the long SQL query:
test=# select * from t_10M order by md5;

=> Terminal monitoring SQL query progression:
test=# select pid,query from pg_stat_activity ;
  pid  |                  query                  
-------+------------------------------------------
  8062 |
  8065 |
 19605 | select pid,query from pg_stat_activity ;
 20830 | select * from t_10M order by md5;
 20832 | select * from t_10M order by md5;
 20833 | select * from t_10M order by md5;
  8060 |
  8059 |
  8061 |
(9 rows)

test=# PROGRESS 20830
test-# ;
                             PLAN PROGRESS                             
------------------------------------------------------------------------
 Gather Merge
   ->  Sort=> dumping tuples to tapes / merging tapes
         rows r/w merge 2167923/2167908 rows r/w effective 0/3514320 0%
         Sort Key: md5
         ->  Parallel Seq Scan on t_10m => rows 3514321/4166700 84%
(5 rows)

test=#
test=#
test=# PROGRESS 20830;
                               PLAN PROGRESS                               
----------------------------------------------------------------------------
 Gather Merge
   ->  Sort=> final merge sort on tapes
         rows r/w merge 4707198/4691167 rows r/w effective 16016/3514320 0%
         Sort Key: md5
         ->  Parallel Seq Scan on t_10m => rows 3514321/4166700 84%
(5 rows)

test=# PROGRESS 20830;
                                PLAN PROGRESS                               
-----------------------------------------------------------------------------
 Gather Merge
   ->  Sort=> final merge sort on tapes
         rows r/w merge 4809857/4691167 rows r/w effective 118675/3514320 3%
         Sort Key: md5
         ->  Parallel Seq Scan on t_10m => rows 3514321/4166700 84%
(5 rows)

test=# PROGRESS 20830;
                                PLAN PROGRESS                               
-----------------------------------------------------------------------------
 Gather Merge
   ->  Sort=> final merge sort on tapes
         rows r/w merge 4883715/4691167 rows r/w effective 192533/3514320 5%
         Sort Key: md5
         ->  Parallel Seq Scan on t_10m => rows 3514321/4166700 84%
(5 rows)

test=# PROGRESS 20830;
                                PLAN PROGRESS                               
-----------------------------------------------------------------------------
 Gather Merge
   ->  Sort=> final merge sort on tapes
         rows r/w merge 4948381/4691167 rows r/w effective 257199/3514320 7%
         Sort Key: md5
         ->  Parallel Seq Scan on t_10m => rows 3514321/4166700 84%
(5 rows)

test=# PROGRESS 20830;
                                PLAN PROGRESS                               
-----------------------------------------------------------------------------
 Gather Merge
   ->  Sort=> final merge sort on tapes
         rows r/w merge 5022137/4691167 rows r/w effective 330955/3514320 9%
         Sort Key: md5
         ->  Parallel Seq Scan on t_10m => rows 3514321/4166700 84%
(5 rows)

test=# PROGRESS 20830;
                                PLAN PROGRESS                                
------------------------------------------------------------------------------
 Gather Merge
   ->  Sort=> final merge sort on tapes
         rows r/w merge 5079083/4691167 rows r/w effective 387901/3514320 11%
         Sort Key: md5
         ->  Parallel Seq Scan on t_10m => rows 3514321/4166700 84%
(5 rows)

test=# PROGRESS 20830;
                                PLAN PROGRESS                                
------------------------------------------------------------------------------
 Gather Merge
   ->  Sort=> final merge sort on tapes
         rows r/w merge 5144499/4691167 rows r/w effective 453317/3514320 12%
         Sort Key: md5
         ->  Parallel Seq Scan on t_10m => rows 3514321/4166700 84%
(5 rows)

test=# PROGRESS 20830;
    PLAN PROGRESS    
----------------------
 <out of transaction>
(1 row)

test=#

SQL query was interrupted before completion


2/ Insert into table
=============

=> Terminal running the long SQL query:
test=# INSERT INTO T_10M SELECT generate_series(10000001, 12000000)  AS id, md5(random()::text) AS md5;

=> Terminal monitoring SQL query progression:
test=# PROGRESS 20830;
    PLAN PROGRESS    
----------------------
 <out of transaction>
(1 row)

test=#
test=# PROGRESS 20830;
     PLAN PROGRESS    
-----------------------
 Insert => rows 718161
   ->  ProjectSet
         ->  Result
(3 rows)

test=# PROGRESS 20830;
     PLAN PROGRESS     
------------------------
 Insert => rows 1370255
   ->  ProjectSet
         ->  Result
(3 rows)

test=# PROGRESS 20830;
     PLAN PROGRESS     
------------------------
 Insert => rows 1916731
   ->  ProjectSet
         ->  Result
(3 rows)

test=# PROGRESS 20830;
 PLAN PROGRESS 
----------------
 <idle backend>
(1 row)

test=#


3/ Delete with like clause 
===================

=> Terminal running the long SQL query:
test=# DELETE FROM T_10M WHERE md5 like '%cb%';

=> Terminal monitoring SQL query progression:
test=# PROGRESS 20830;
 PLAN PROGRESS 
----------------
 <idle backend>
(1 row)

test=# PROGRESS 20830;
                            PLAN PROGRESS                            
----------------------------------------------------------------------
 Delete => rows 91906
   ->  Seq Scan on t_10m => rows 91906/848485 10% blks 6668/100000 6%
         Filter: (md5 ~~ '%cb%'::text)
(3 rows)

test=# PROGRESS 20830;
                              PLAN PROGRESS                             
-------------------------------------------------------------------------
 Delete => rows 151900
   ->  Seq Scan on t_10m => rows 151900/848485 17% blks 11019/100000 11%
         Filter: (md5 ~~ '%cb%'::text)
(3 rows)

test=# PROGRESS 20830;
                              PLAN PROGRESS                             
-------------------------------------------------------------------------
 Delete => rows 309533
   ->  Seq Scan on t_10m => rows 309533/848485 36% blks 22471/100000 22%
         Filter: (md5 ~~ '%cb%'::text)
(3 rows)

test=# PROGRESS 20830;
                              PLAN PROGRESS                             
-------------------------------------------------------------------------
 Delete => rows 705968
   ->  Seq Scan on t_10m => rows 705968/848485 83% blks 51274/100000 51%
         Filter: (md5 ~~ '%cb%'::text)
(3 rows)

test=# PROGRESS 20830;
                              PLAN PROGRESS                              
--------------------------------------------------------------------------
 Delete => rows 913843
   ->  Seq Scan on t_10m => rows 913843/848485 107% blks 66417/100000 66%
         Filter: (md5 ~~ '%cb%'::text)
(3 rows)

test=# PROGRESS 20830;
                               PLAN PROGRESS                              
---------------------------------------------------------------------------
 Delete => rows 1113104
   ->  Seq Scan on t_10m => rows 1113104/848485 131% blks 80881/100000 80%
         Filter: (md5 ~~ '%cb%'::text)
(3 rows)

test=# PROGRESS 20830;
 PLAN PROGRESS 
----------------
 <idle backend>
(1 row)

test=#

Above monitoring report shows:
- The Seq Scan node with the number of rows scanned, the number of blocks scanned/read.
- The Delete node with the number of rows deleted.


4/ Select with offset and limit clause
===========================

=> Terminal running the long SQL query:
test=#select * from t_10M order by md5 offset 80 limit 10;

=> Terminal monitoring SQL query progression:
test=# \watch PROGRESS 20830;
Wed 19 Apr 2017 04:36:16 PM CEST (every 1s)

 PLAN PROGRESS 
----------------
 <idle backend>
(1 row)

Wed 19 Apr 2017 04:36:17 PM CEST (every 1s)

 PLAN PROGRESS 
----------------
 <idle backend>
(1 row)

                 Wed 19 Apr 2017 04:36:18 PM CEST (every 1s)

                                PLAN PROGRESS                                
------------------------------------------------------------------------------
 Limit   => offset 0% limit 0%
   ->  Sort=> loading tuples in memory 90
         Sort Key: md5
         ->  Seq Scan on t_10m => rows 174392/11586584 1% blks 1640/100000 1%
(4 rows)

...

                                  PLAN PROGRESS                                  
----------------------------------------------------------------------------------
 Limit   => offset 0% limit 0%
   ->  Sort=> loading tuples in memory 90
         Sort Key: md5
         ->  Seq Scan on t_10m => rows 1656828/11586584 14% blks 15600/100000 15%
(4 rows)

...

                                  PLAN PROGRESS                                  
----------------------------------------------------------------------------------
 Limit   => offset 0% limit 0%
   ->  Sort=> loading tuples in memory 90
         Sort Key: md5
         ->  Seq Scan on t_10m => rows 4954207/11586584 42% blks 46640/100000 46%
(4 rows)

                   Wed 19 Apr 2017 04:36:35 PM CEST (every 1s)

...

                                  PLAN PROGRESS                                  
----------------------------------------------------------------------------------
 Limit   => offset 0% limit 0%
   ->  Sort=> loading tuples in memory 90
         Sort Key: md5
         ->  Seq Scan on t_10m => rows 7837687/11586584 67% blks 73772/100000 73%
(4 rows)

                   Wed 19 Apr 2017 04:36:41 PM CEST (every 1s)

...


                                   PLAN PROGRESS                                  
-----------------------------------------------------------------------------------
 Limit   => offset 0% limit 0%
   ->  Sort=> loading tuples in memory 90
         Sort Key: md5
         ->  Seq Scan on t_10m => rows 10378786/11586584 89% blks 97690/100000 97%
(4 rows)

Wed 19 Apr 2017 04:36:49 PM CEST (every 1s)

 PLAN PROGRESS 
----------------
 <idle backend>
(1 row)


5/ Sample scan
=============

=> Terminal running the long SQL query:
# select * from t_10m tablesample system(50);

=> Terminal monitoring SQL query progression:
 PLAN PROGRESS 
----------------
 <idle backend>
(1 row)

              Wed 19 Apr 2017 04:44:12 PM CEST (every 1s)

                             PLAN PROGRESS                            
-----------------------------------------------------------------------
 Sample Scan on t_10m => rows 783274/5793292 13% blks 14616/100000 14%
   Sampling: system ('50'::real)
(2 rows)

              Wed 19 Apr 2017 04:44:13 PM CEST (every 1s)

                             PLAN PROGRESS                             
------------------------------------------------------------------------
 Sample Scan on t_10m => rows 2514675/5793292 43% blks 47076/100000 47%
   Sampling: system ('50'::real)
(2 rows)

              Wed 19 Apr 2017 04:44:14 PM CEST (every 1s)

                             PLAN PROGRESS                             
------------------------------------------------------------------------
 Sample Scan on t_10m => rows 4031400/5793292 69% blks 75625/100000 75%
   Sampling: system ('50'::real)
(2 rows)

Wed 19 Apr 2017 04:44:15 PM CEST (every 1s)

 PLAN PROGRESS 
----------------
 <idle backend>
(1 row)


2017-04-19 16:13 GMT+02:00 Remi Colinet <remi.colinet@gmail.com>:
Maksim,


2017-04-18 20:31 GMT+02:00 Maksim Milyutin <m.milyutin@postgrespro.ru>:
On 18.04.2017 17:39, Remi Colinet wrote:
Hello Maksim,

The core implementation I suggested for the new PROGRESS command uses
different functions from the one used by EXPLAIN for its core
implementation.
Some source code is shared with EXPLAIN command. But this shared code is
only related to quals, properties, children, subPlans and few other nodes.

All other code for PROGRESS is new code.

I don't believe explain.c code can be fully shared with the one of the
new PROGRESS command. These 2 commands have different purposes.
The core code used for the new PROGRESS command is very different from
the core code used for EXPLAIN.


Perhaps you will be forced to duplicate significant snippets of functionality from explain.c into your progress.c.
 
Currently, few code is duplicated between EXPLAIN and PROGRESS commands. The duplicated code could be moved to file src/backend/commands/report.c which is used to gather shared code between the 2 commands. I will try to complete this code sharing as much as possible.

The main point is that PROGRESS uses the same design pattern as EXPLAIN by parsing the query tree. The work horse of the PROGRESS command is ProgressNode() which calls recursively sub nodes until we reach leaf nodes such as SeqScan, IndexScan, TupleStore, Sort, Material, ... . EXPLAIN command uses a similar work horse with function ExplainNode() which eventually calls different leaf nodes.

Some of the leaf nodes which are common to the 2 commands have been put in the file src/backend/commands/report.c. May be some further code sharing is also possible for the work horse by using a template function which would call EXPLAIN specific leaf node functions or PROGRESS specific leaf node functions.
 


Regarding the queryDesc state of SQL query upon receiving a request to
report its execution progress, it does not bring any issue. The request
is noted when the signal is received by the monitored backend. Then, the
backend continues its execution code path. When interrupts are checked
in the executor code, the request will be dealt.


Yes, interrupts are checked in the CHECK_FOR_INTERRUPTS entries.

When the request is being dealt, the monitored backend will stop its
execution and report the progress of the SQL query. Whatever is the
status of the SQL query, progress.c code checks the status and report
either that the SQL query does not have a valid status, or otherwise the
current execution state of the SQL query.

SQL query status checking is about:
- idle transaction
- out of transaction status
- null planned statement
- utility command
- self monitoring

Other tests can be added if needed to exclude some SQL query state. Such
checking is done in void HandleProgressRequest(void).
I do not see why a SQL query progression would not be possible in this
context. Even when the queryDescc is NULL, we can just report a <idle
transaction> output. This is currently the case with the patch suggested.


It's interesting question - how much the active query is in a usable state on the stage of execution. Tom Lane noticed that CHECK_FOR_INTERRUPTS doesn't give us 100% guarantee about full consistency [1].

I wonder what you mean about usable state.

Currently, the code suggested tests the queryDesc pointer and all the sub nodes pointers in order to detect NULL pointers. When the progress report is collected by the backend, this backend does the collect and consequently does not run the query. So the query tree is not being modified. At this moment, whatever is the query state, we can manage to deal with its static state. It is only a tree which could also be just a NULL pointer in the most extreme case. Such case is dealt in the current code.
 

So far, I've found this new command very handy. It allows to evaluate
the time needed to complete a SQL query.


Could you explain how you get the percent of execution for nodes of plan tree and overall for query?

The progress of execution of the query is computed as follows at 2 different places for each leaf node type (Scan, IndexScan, Sort, Material, TupleStore, ...):

- one place in the executor code, or in access methods code, or in sort utilities code, used during the execution of the SQL query in which following values are counted for instance: rows R/W, blocks, R/W, tapes R/W used for sort, tuple store R/W, ... . Some of these values are already computed in the current Postgresql official source code. Some other values had to be added and collected.

- one place in the leaf function of each node type (ProgressScan(), ProgressSort(), ...) in which percents are computed and are then dumped together with raw values collected during execution, in the report. The dump details can be selected with the VERBOSE option of the PROGRESS command (For instance # PROGRESS VERBOSE $pid)

For instance:

1/ read/write rows are collected when running the executor in the file src/backend/executor/execProcnode.c
==============================================================================

This is already implemented in the current official source tree. Nothing mode needs to be implemented to collect values (total rows number and current rows number already fetched are collected).
The report is implemented in leaf function ProgressScanRows().

2/ read/write blocks are collected in the file src/backend/access/heap/heapam.c
==========================================================

This is already implemented in the current official source tree. Nothing more needs to be implemented to collect values during execution.
 The report is implemented in a leaf function ProgressScanBlks().

3/ the sort progressions are collected in the file src/backend/utils/sort/tuplesort.c
==========================================================

[root@rco pg]# git diff --stat master.. src/backend/utils/sort/tuplesort.c
 src/backend/utils/sort/tuplesort.c | 142 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 127 insertions(+), 15 deletions(-)
[root@rco pg]#

New fields have been added to compute the different I/O (read/write/merge/...) per tapes for instance, during a sort on tape.
The report of Sort node is implemented in leaf function ProgressSort()
 
4/ the tuple store progressions are computed in the file src/backend/utils/sort/tuplestore.c
=================================================================

[root@rco pg]# git diff --stat master.. src/backend/utils/sort/tuplestore.c
 src/backend/utils/sort/tuplestore.c | 73 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 56 insertions(+), 17 deletions(-)
[root@rco pg]#

New fields have been added to collect the I/O needed for such tuple store.
The report of TupleStore node is implemented in leaf function ProgressTupleStore().


Other node types have been implemented: TidScan, IndexScan, LimitScan, CustomScan, Hash, ModifyTable.
Such node may require some new fields to collect values during the SQL query execution.

Overall, the overhead caused by new values collected during the SQL query execution, is very low.
A few values need to be collected.


A further improvement would be to report the memory, disk and time
resources used by the monitored backend. An overuse of memory, disk and
time resources can prevent the SQL query to complete.


This functionality is somehow implemented in explain.c. You can see my patch to this file [2]. I only manipulate runtime statistics (data in the structure 'Instrumentation') to achieve the printing state of running query.

I will check your patch and try to add such feature to the current patch.
It provides a valuable hint to estimate whether a SQL query has a chance to complete and will not reached the resource limits.
.

1. https://www.postgresql.org/message-id/24182.1472745492%40sss.pgh.pa.us
2. https://github.com/postgrespro/pg_query_state/blob/master/runtime_explain.patch

Thanks for your suggestions and comments

Regards
Remi

--
Maksim Milyutin
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: [HACKERS] OK, so culicidae is *still* broken
Next
From: Petr Jelinek
Date:
Subject: Re: [HACKERS] Interval for launching the table sync worker