Thread: [HACKERS] [PATCH v2] Progress command to monitor progression of long runningSQL queries

Hello,

This is version 2 of the new command name PROGRESS which I wrote in order to monitor long running SQL queries in a Postgres backend process.


New command justification
====================

The purpose of the command is to monitor long running SQL queries in a backend process to evaluate when they will complete.

First, the user needs to find the pid of the backend process which is executing the long running SQL query for which we want to monitor progression.
This search may be based on the SQL query, the session, the terminal, using the view pg_stat_activity. This view provides the best method to identify the query to be monitored.
 
Once the pid of the backend is determined, user may run:
psql> PROGRESS [VERBOSE|(FORMAT TEXT|JSON|XML|INLINE)] PID_OF_BACKEND

The output will show a table similar to the one of the EXPLAIN command but with further details about the progression of execution of each node which may take a long time to be executed.

Such nodes are SeqScan (Sequential Scans), IndexScan (Index Scans), Sort (long sorts which require disk tapes mecanism), TupleStore (Store on disks of tuple tables), Limit, HashJoinTable which use one disk files to perform HashJoin.
Other nodes use the previous nodes. For instance, Material nodes use TupleStore nodes.

The PROGRESS command can be used with the psql \watch command. For instance, user may run:

psql> \watch PROGRESS PID_OF_BACKEND

This is handy to view progression of SQL queries.


Use case
=======

A use case is shown in the below example based on a table named t_10m with at least 10 millions rows.

The table has been created with :

CREATE TABLE T_10M ( id integer, md5 text);
INSERT INTO T_10M SELECT generate_series(1,10000000) AS id, md5(random()::text) AS md5;
ANALYSE t_10m;

ANALYSE is needed to have statistics updated. These are used to compare rows fetched with total number of rows and report percentage of execution of query nodes.
If ANALYSE is not run against relations used by the monitored query, statistics may be inaccurate.

1/ Start a first psql session to run a long SQL query:

[pgadm@rco ~]$ psql -A -d test
psql (10devel)
Type "help" for help.
test=#

The option -A is used to allow rows to be output without formatting work.
Redirect output to a file in order to let the query run without terminal interaction:

test=# \o out

Start a long running query:
test=# select * from t_10M order by md5;

2/ In a second psql session, list the backend pid(s) and their matching SQL query

[pgadm@rco ~]$ psql -d test
psql (10devel)
Type "help" for help.

test=# select * from prg;
  pid  |               query              
-------+-----------------------------------
 26306 |
 26309 |
 26311 | select * from t_10m order by md5;
 26312 | select * from t_10m order by md5;
 26313 | select * from t_10m order by md5;
 26330 | select * from prg;
 26304 |
 26303 |
 26305 |
(9 rows)

test=#

Chose the pid of the backend running the long SQL query to be monitored.
Above example is a parallel SQL query. Lowest pid is the main backend of the query.

test=# PROGRESS 26311;

                                       PLAN PROGRESS                                       
--------------------------------------------------------------------------------------------
 status: <query running>
 query: select * from t_10m order by md5;
 time used (s): 20
 Gather Merge
   worker child: 26312
   ->  Sort =>  rows r/w merge 0/0 sort 0/1895721 0% tape blocks 10903
         Sort Key: md5
         ->  Parallel Seq Scan on t_10m => rows 1903441/4166657 45% => blks 49924/83334 59%
   worker child: 26313
   ->  Sort =>  rows r/w merge 0/0 sort 0/1967521 0% tape blocks 11314
         Sort Key: md5
         ->  Parallel Seq Scan on t_10m => rows 1975561/4166657 47% => blks 49927/83334 59%
   worker parent: 26311
   ->  Sort on tapes writing =>  rows r/w merge 0/0 sort 0/2075590 0% tape blocks 11935
         Sort Key: md5
         ->  Parallel Seq Scan on t_10m => rows 2111550/4166657 50% => blks 49928/83334 59%
 total disk space used (MB) 266
(17 rows)

test=#

The query of the monitored backend is:
test=# select * from t_10M order by md5;

Because the table has 10 millions of rows, the sort is done on tapes.

The output shows the progression in terms of:
-  rows with:
   - already fetched rows,
   - total rows to be fetched,
- blocks with:
   - already fetched blocks
   - total blocks to be fetched

Each node may have different output types.
Progression is reported in terms of rows, blocks, bytes, and percentages.

Sort nodes show tapes blocks being used and step phase (merge/sort).


Implementation of the command
========================

The design of the command is:

- the user issue the "PROGRESS pid" command from a psql session.
The pid is the one of the backend which runs the SQL query for which we want to get a progression report. It can be determined from the view pg_stat_activity.

- the monitoring backend, upon receiving the "PROGRESS pid" command from psql utility used in step above, sends a signal to the backend whose process pid is the one provided in the PROGRESS command, and whose reason is PROCSIG_PROGRESS defined in src/include/storage/procsignal.h

- the monitored backend receives the signal with the reason of the signal (progression request) and notes the request as for any interrupt. Then, it continues its execution of its SQL query until interrupts can be serviced.

- when the monitored process can service the interrupts, it deals with the progress request by collecting its execution tree with the execution progress of each long running node. At this time, the SQL query must be in a consistent state without partial nodes under alllocation or freeing. This is enforced by flags added in the executor. The progression is only collected if the backend is in the function of ExecutorRun(). If execution tree is in a consistent state, it is dumped in shared memory pages allocated at the start of the server. Then, the monitored backend set a latch on which the monitoring backend is waiting for. It then continues executing its SQL query.

- the monitoring backend collects the share memory data dumped, and sends it to its psql session as a list of rows.

The command PROGRESS does not incur any slowness on the running query because the execution progression is only computed upon receiving the progress request which is supposed to be seldom used.

The code heavily reuses the one of the explain command. In order to share as much code as possible with the EXPLAIN command, part of the EXPLAIN code which deals with reporting quals for instance, has been moved to a new report.c file in the src/backend/commands folder. This code in report.c is shared between explain.c source code and PROGRESS command source code which is in progress.c file.

The progression reported by PROGRESS command is given in terms of rows, blocks, bytes and percents. The values displayed depend on the node type in the execution plan. The current patch implements all the possible nodes which could take a lot of time.

Parallel queries can also be monitored. The same mecanism is used to monitor child workers with a slight difference: the main worker requests the child progression directly in order to dump the whole progress tree in shared memory.


Patch
=====

The patch for version 2 is attached to the current email.
The patch has been rebased on latest commits.

The patch is also available at : https://github.com/colinet/progress/tree/progress

The diff stat of the patch is:

[root@rco pg]# git diff --stat master..
 contrib/auto_explain/auto_explain.c       |    5 +-
 contrib/postgres_fdw/postgres_fdw.c       |   13 +-
 src/backend/access/heap/heapam.c          |    2 +
 src/backend/access/transam/parallel.c     |    2 +-
 src/backend/commands/Makefile             |    3 +-
 src/backend/commands/explain.c            | 2834 ++++++++++++++-----------------------------------------------------------------------------------------------
 src/backend/commands/prepare.c            |    5 +-
 src/backend/commands/progress.c           | 2035 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 src/backend/commands/report.c             | 2252 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 src/backend/executor/execMain.c           |    8 +
 src/backend/executor/execProcnode.c       |   31 ++
 src/backend/executor/nodeBitmapHeapscan.c |   13 +-
 src/backend/executor/nodeIndexonlyscan.c  |   13 +-
 src/backend/executor/nodeIndexscan.c      |   15 +-
 src/backend/executor/nodeSamplescan.c     |   12 +-
 src/backend/executor/nodeSeqscan.c        |   16 +-
 src/backend/nodes/bitmapset.c             |   19 +
 src/backend/nodes/outfuncs.c              |  245 ++++++++++
 src/backend/parser/gram.y                 |   79 +++-
 src/backend/postmaster/postmaster.c       |    1 +
 src/backend/storage/file/buffile.c        |   73 +++
 src/backend/storage/file/fd.c             |   15 +
 src/backend/storage/ipc/ipci.c            |    3 +
 src/backend/storage/ipc/procarray.c       |   57 +++
 src/backend/storage/ipc/procsignal.c      |    4 +
 src/backend/storage/lmgr/lwlock.c         |    7 +-
 src/backend/storage/lmgr/lwlocknames.txt  |    1 +
 src/backend/tcop/postgres.c               |   65 ++-
 src/backend/tcop/pquery.c                 |   25 +
 src/backend/tcop/utility.c                |   10 +
 src/backend/utils/init/globals.c          |   12 +
 src/backend/utils/sort/logtape.c          |   18 +
 src/backend/utils/sort/tuplesort.c        |  153 +++++-
 src/backend/utils/sort/tuplestore.c       |   75 ++-
 src/include/access/parallel.h             |    1 +
 src/include/commands/explain.h            |   67 +--
 src/include/commands/prepare.h            |    2 +-
 src/include/commands/report.h             |  159 +++++++
 src/include/executor/execdesc.h           |    3 +
 src/include/executor/progress.h           |   52 ++
 src/include/foreign/fdwapi.h              |   10 +-
 src/include/miscadmin.h                   |    2 +
 src/include/nodes/bitmapset.h             |    1 +
 src/include/nodes/execnodes.h             |    3 +
 src/include/nodes/extensible.h            |    6 +-
 src/include/nodes/nodes.h                 |    8 +
 src/include/nodes/parsenodes.h            |   11 +
 src/include/nodes/plannodes.h             |   11 +
 src/include/parser/kwlist.h               |    2 +
 src/include/pgstat.h                      |    3 +-
 src/include/storage/buffile.h             |   11 +
 src/include/storage/fd.h                  |    2 +
 src/include/storage/procarray.h           |    3 +
 src/include/storage/procsignal.h          |    3 +
 src/include/tcop/pquery.h                 |    1 +
 src/include/utils/logtape.h               |    2 +
 src/include/utils/tuplesort.h             |   72 ++-
 src/include/utils/tuplestore.h            |   35 ++
 58 files changed, 5972 insertions(+), 2619 deletions(-)
[root@rco pg]#


The progress command can be used with the watch command of psql making it more handy to monitor a long running query.
The default format of the PROGRESS command is text inline. It can be set as json, xml, or text as for EXPLAIN command.


Use cases
========

Some further examples of use are shown below in the test_v2.txt file with parallel queries, sorts, hashes, scans, json format output.


TODO
=====

Monitor progress of utilities commands such as CREATE INDEX which may take a long time to complete.
Write documentation


Changelog
========

v2:
Added JSON, XML, TEXT INLINE format output
Added time consumed to run SQL queries
Added SQL query being monitored
Added support for parallel queries
Added disk used by the queries
Rebased on 9th May 2017 commits.

v1:
Initial version with only TEXT format reporting



Any suggestion, comment or feedback is welcome.

Regards
Remi


Attachment
On Wed, May 10, 2017 at 06:40:31PM +0200, Remi Colinet wrote:
> Hello,
> 
> This is version 2 of the new command name PROGRESS which I wrote in
> order to monitor long running SQL queries in a Postgres backend
> process.

Perhaps I missed something important in the discussion, but was there
a good reason that this isn't a function callable from SQL, i.e. not
restricted to the psql client?

Best,
David.
-- 
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david(dot)fetter(at)gmail(dot)com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate



Hi

2017-05-10 18:40 GMT+02:00 Remi Colinet <remi.colinet@gmail.com>:
Hello,

This is version 2 of the new command name PROGRESS which I wrote in order to monitor long running SQL queries in a Postgres backend process.

This patch introduce lot of reserved keyword. Why? It can breaks lot of applications. Cannot you enhance EXPLAIN command?

Regards

Pavel

On Thu, May 11, 2017 at 1:40 AM, Remi Colinet <remi.colinet@gmail.com> wrote:
> This is version 2 of the new command name PROGRESS which I wrote in order to
> monitor long running SQL queries in a Postgres backend process.

It would be a good idea to add this patch to the next commit fest if
you are expecting some feedback:
https://commitfest.postgresql.org/14/
But please don't expect immediate feedback, the primary focus these
days is to stabilize the upcoming release.

> New command justification
> ====================
>
> [root@rco pg]# git diff --stat master..
> [...]
>  58 files changed, 5972 insertions(+), 2619 deletions(-)

For your first patch, you may want something less... Challenging.
Please note as well that it would be nice if you review other patches
to get more familiar with the community process, it is expected that
for each patch submitted, an equivalent amount of review is done.
That's hard to measure but for a patch as large as that you will need
to pick up at least one patch equivalent in difficulty.

Regarding the patch, this is way to close to the progress facility
already in place. So why don't you extend it for the executor? We can
likely come up with something that can help, though I see the part
where the plan run by a backend is shared among all processes as a
major bottleneck in performance. You have at least three concepts
different here:
- Store plans run in shared memory to allow access to any other processes.
- Allow a process to look at the plan run by another one with a SQL interface.
- Track the progress of a running plan, via pg_stat_activity.
All in all, I think that a new command is not adapted, and that you
had better split each concept before implementing something
over-engineered like the patch attached.
-- 
Michael



That's good point.

I will probably convert the new command to a SQL function.

I was fumbling between a table or a SQL function. Oracle uses v$session_longops to track progression of long running SQL queries which have run for more than 6 seconds.
But a function is much better to provide parameters such as the backend pid and a format specification for the output. 

Regards
Remi

2017-05-10 21:52 GMT+02:00 David Fetter <david@fetter.org>:
On Wed, May 10, 2017 at 06:40:31PM +0200, Remi Colinet wrote:
> Hello,
>
> This is version 2 of the new command name PROGRESS which I wrote in
> order to monitor long running SQL queries in a Postgres backend
> process.

Perhaps I missed something important in the discussion, but was there
a good reason that this isn't a function callable from SQL, i.e. not
restricted to the psql client?

Best,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david(dot)fetter(at)gmail(dot)com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate



2017-05-11 4:05 GMT+02:00 Michael Paquier <michael.paquier@gmail.com>:
On Thu, May 11, 2017 at 1:40 AM, Remi Colinet <remi.colinet@gmail.com> wrote:
> This is version 2 of the new command name PROGRESS which I wrote in order to
> monitor long running SQL queries in a Postgres backend process.

It would be a good idea to add this patch to the next commit fest if
you are expecting some feedback:
https://commitfest.postgresql.org/14/
But please don't expect immediate feedback, the primary focus these
days is to stabilize the upcoming release.

Yes, I understand. I will submit the patch to the commit fest once I will have had a few feedbacks.
I already received some interesting suggestions such as using a SQL function instead of a whole new command.

The purpose for the moment is to gather feedback and advise about such feature.

> New command justification
> ====================
>
> [root@rco pg]# git diff --stat master..
> [...]
>  58 files changed, 5972 insertions(+), 2619 deletions(-)

For your first patch, you may want something less... Challenging.
Please note as well that it would be nice if you review other patches
to get more familiar with the community process, it is expected that
for each patch submitted, an equivalent amount of review is done.
That's hard to measure but for a patch as large as that you will need
to pick up at least one patch equivalent in difficulty.

Acked


Regarding the patch, this is way to close to the progress facility
already in place. So why don't you extend it for the executor? We can
likely come up with something that can help, though I see the part
where the plan run by a backend is shared among all processes as a
major bottleneck in performance. You have at least three concepts
different here:
- Store plans run in shared memory to allow access to any other processes.
- Allow a process to look at the plan run by another one with a SQL interface.
- Track the progress of a running plan, via pg_stat_activity.
All in all, I think that a new command is not adapted, and that you
had better split each concept before implementing something
over-engineered like the patch attached.

I initially had a look at the progress facility but this seemed to me very far from the current goal.
Btw, I will reconsider it. Things may have changed.

The plan and its planstate are not permanently shared with other backends.
This would be useless and very costly.

As long as the PROGRESS command is not used, the patch has a very low impact on the executor.
A few counters and flags are added/updated by the patch in the executor. This is all what is needed to track progress of execution.

For instance in src/backend/executor/execProcnode.c, I've added:

ExecInitNode()
+ result->plan_rows = 0;

ExecProcNode()
+ node->plan_rows++;

ExecEndNode()
+ node->plan_rows = 0;

This is enough to compute the number of rows fetched by the plan at this step of the plan.
A few counters have been added to track sorts, scans, hashes, tuple stores.

The disk space used by nodes is also traced. This shows disk resource needed to complete a SQL query.
This was not initially a purpose of the patch, but I find this very handy.

Some debugging code will also be removed making the change in the executor very small.
The code change in the executor is less than 50 lines and it is about 150 lines in the tuple sort/store code mainly to add functions to fetch sort and store status.

The execution tree is dumped in shared memory only when requested by a monitoring backend sending a signal to the monitored backend.
The monitoring backend first sends a signal which is noted by the backend running the long SQL query.
Once the long SQL query can deal with the request (interrupt enabled), it dumps its execution tree in shared memory and then follows up its execution.

So the dump of the execution tree is done only when it is requested.
This does not seem to incur any performance hit. Or I missed something.

Once the concept is mature, I would split the patch in small consistent parts to allow a smooth and clean merge.

Regards
Remi
  
--
Michael

Thx for your comment.

Based on other comments, I will probably convert the command into a SQL function.
This allows passing arguments such as the one which can be used in the current command (VERBOSE, FORMAT).

This will avoid keyword collisions.

Regards
Remi

2017-05-10 22:04 GMT+02:00 Pavel Stehule <pavel.stehule@gmail.com>:
Hi

2017-05-10 18:40 GMT+02:00 Remi Colinet <remi.colinet@gmail.com>:
Hello,

This is version 2 of the new command name PROGRESS which I wrote in order to monitor long running SQL queries in a Postgres backend process.

This patch introduce lot of reserved keyword. Why? It can breaks lot of applications. Cannot you enhance EXPLAIN command?

Regards

Pavel


On Thu, May 11, 2017 at 05:24:16PM +0200, Remi Colinet wrote:
> 2017-05-10 21:52 GMT+02:00 David Fetter <david@fetter.org>:
> 
> > On Wed, May 10, 2017 at 06:40:31PM +0200, Remi Colinet wrote:
> > > Hello,
> > >
> > > This is version 2 of the new command name PROGRESS which I wrote in
> > > order to monitor long running SQL queries in a Postgres backend
> > > process.
> >
> > Perhaps I missed something important in the discussion, but was there
> > a good reason that this isn't a function callable from SQL, i.e. not
> > restricted to the psql client?

Please don't top post. http://www.caliburn.nl/topposting.html

> That's good point.
> 
> I will probably convert the new command to a SQL function.

Great!

> I was fumbling between a table or a SQL function. Oracle uses
> v$session_longops to track progression of long running SQL queries
> which have run for more than 6 seconds.  But a function is much
> better to provide parameters such as the backend pid and a format
> specification for the output.

Once it's a function, it can very easily be called in a system (or
user-defined) view.

Best,
David.
-- 
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david(dot)fetter(at)gmail(dot)com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate



On Wed, May 10, 2017 at 10:05 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> Regarding the patch, this is way to close to the progress facility
> already in place. So why don't you extend it for the executor?

I don't think that's a good idea.  The existing progress facility
advertises a fixed number of counters with a command-type-specific
interpretation, but for *query* progress reporting, we really need a
richer model.  A query plan can contain an unbounded number of
executor nodes and Remi's idea is, quite reasonably, to report
something about each one.

From a design point of view, I think a patch like this has to clear a
number of hurdles.  It needs to:

1. Decide what data to expose.  The sample output looks reasonable in
this case, although the amount of code churn looks really high.

2. Find a way to advertise the data that it exposes.  This patch looks
like it allocates a half-MB of shared memory per backend and gives up
if that's not enough.

3. Find a way to synchronize the access to the data that it exposes.
This patch ignores that problem completely, so multiple progress
report commands running at the same time will behave unpredictably.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



On Wed, May 10, 2017 at 10:10 PM, Remi Colinet <remi.colinet@gmail.com> wrote:
>
> Parallel queries can also be monitored. The same mecanism is used to monitor
> child workers with a slight difference: the main worker requests the child
> progression directly in order to dump the whole progress tree in shared
> memory.
>

What if there is any error in the worker (like "out of memory") while
gathering the statistics?  It seems both for workers as well as for
the main backend it will just error out.  I am not sure if it is a
good idea to error out the backend or parallel worker as it will just
end the query execution.  Also, even if it is okay, there doesn't seem
to be a way by which a parallel worker can communicate the error back
to master backend, rather it will just exit silently which is not
right.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



On Sat, May 13, 2017 at 10:53 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, May 10, 2017 at 10:05 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> Regarding the patch, this is way to close to the progress facility
>> already in place. So why don't you extend it for the executor?
>
> I don't think that's a good idea.  The existing progress facility
> advertises a fixed number of counters with a command-type-specific
> interpretation, but for *query* progress reporting, we really need a
> richer model.  A query plan can contain an unbounded number of
> executor nodes and Remi's idea is, quite reasonably, to report
> something about each one.

I see what you're coming at. As st_progress_param is upper-bounded,
what should be actually used as report structure is a tree made of
nodes with one planner node attached to each node of the report tree.
Then the reporting facility should need to make persistent the data
reported.which is in this case a set of planner nodes, being able to also make
persistent the changes on the tree reported to the user. Then when a
backend reports its execution status, what it does is just updating a
portion of the tree in shared memory. Lock contention may be a problem
though, perhaps things could be made atomic?

> From a design point of view, I think a patch like this has to clear a
> number of hurdles.  It needs to:
>
> 1. Decide what data to expose.  The sample output looks reasonable in
> this case, although the amount of code churn looks really high.
>
> 2. Find a way to advertise the data that it exposes.  This patch looks
> like it allocates a half-MB of shared memory per backend and gives up
> if that's not enough.

Perhaps DSM? It is not user-friendly to fail sporadically...

> 3. Find a way to synchronize the access to the data that it exposes.
> This patch ignores that problem completely, so multiple progress
> report commands running at the same time will behave unpredictably.

Yeah..
-- 
Michael



On Tue, May 16, 2017 at 2:17 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> Perhaps DSM? It is not user-friendly to fail sporadically...

Yeah.  I've been thinking we might want to give each backend a
backend-lifetime DSA that is created on first use.  That could be
useful for some parallel query stuff and maybe for this as well.
Other backends could attach to it to read data from it, and it would
go away on last detach (which would normally be the detach by the
creating backend, but not if somebody else happens to be reading at
the moment the backend exits).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



On Tue, May 16, 2017 at 7:26 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, May 16, 2017 at 2:17 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> Perhaps DSM? It is not user-friendly to fail sporadically...

Yeah.  I've been thinking we might want to give each backend a
backend-lifetime DSA that is created on first use.  That could be
useful for some parallel query stuff and maybe for this as well.
Other backends could attach to it to read data from it, and it would
go away on last detach (which would normally be the detach by the
creating backend, but not if somebody else happens to be reading at
the moment the backend exits).

That seems like a pretty good idea. I've been considering something simliar for the usecase of being able to view the "collected but not sent yet" statistics inside a backend (which tables have been accessed, number of reads etc),and this seems like it could be used for that as well. 


--
2017-05-13 3:53 GMT+02:00 Robert Haas <robertmhaas@gmail.com>:
On Wed, May 10, 2017 at 10:05 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> Regarding the patch, this is way to close to the progress facility
> already in place. So why don't you extend it for the executor?

I don't think that's a good idea.  The existing progress facility
advertises a fixed number of counters with a command-type-specific
interpretation, but for *query* progress reporting, we really need a
richer model.  A query plan can contain an unbounded number of
executor nodes and Remi's idea is, quite reasonably, to report
something about each one.
 
Oracle's way to report progress of long queries is to use a v$session_longops view which collects progress reports for any backend which has been running a SQL query for more than 6 seconds. But it's probably even better to use a SQL function which allows for instance to provide a specific pid parameter in order to limit the report to a given backend. Users are expected to be focused on one SQL query which is long to complete.

From a design point of view, I think a patch like this has to clear a
number of hurdles.  It needs to:

1. Decide what data to expose.  The sample output looks reasonable in
this case, although the amount of code churn looks really high.

I will probably remove:
- the verbose option (PROGRESS VERBOSE PID) which exposes very low details for tapes used by Sort nodes and by HasJoinTable nodes.
- the JSON, XML output formats because the command will be replaced by a SQL function returning a row set. 
This should make the code much smaller and avoid any change to the EXPLAIN code (code sharing so far).


2. Find a way to advertise the data that it exposes.  This patch looks
like it allocates a half-MB of shared memory per backend and gives up
if that's not enough.
 
May be, it needs a dynamic allocation of shared memory instead of static allocation at start of server. This would spare shared memory and avoid failure when the report is larger than the shared memory buffer pre allocated so far.
 

3. Find a way to synchronize the access to the data that it exposes.
This patch ignores that problem completely, so multiple progress
report commands running at the same time will behave unpredictably.

So far, I've been using a single lock in shared memory to synchronize the requests for SQL progress report.

LWLockAcquire(ProgressLock, LW_EXCLUSIVE);
...
SendProcSignal(stmt->pid, PROCSIG_PROGRESS, bid);
...
LWLockRelease(ProgressLock);

It's expected that a few dba would use such feature at the same time. The lock would need to be unlocked if the current backend fails.

I've also realized that access control for progress report needs to be added.

Thx & regards
Remi

 
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


2017-05-13 14:38 GMT+02:00 Amit Kapila <amit.kapila16@gmail.com>:
On Wed, May 10, 2017 at 10:10 PM, Remi Colinet <remi.colinet@gmail.com> wrote:
>
> Parallel queries can also be monitored. The same mecanism is used to monitor
> child workers with a slight difference: the main worker requests the child
> progression directly in order to dump the whole progress tree in shared
> memory.
>

What if there is any error in the worker (like "out of memory") while
gathering the statistics?  It seems both for workers as well as for
the main backend it will just error out.  I am not sure if it is a
good idea to error out the backend or parallel worker as it will just
end the query execution. 

The handling of progress report starts by the creation of a MemoryContext attached to CurrentMemoryContext. Then, few memory (few KB) is allocated. Meanwhile, the handling of progress report could indeed exhaust memory and fail the backend request. But, in such situation, the backend could also have fail even without any progress request.
 
Also, even if it is okay, there doesn't seem
to be a way by which a parallel worker can communicate the error back
to master backend, rather it will just exit silently which is not
right.

If a child worker fails, it will not respond to the main backend request. The main backend will follow up it execution after a 5 seconds timeout (GUC param to be added may be). In which case, the report would be partially filled.

If the main backend fails, the requesting backend will have a response such as:

test=# PROGRESS 14611;
 PLAN PROGRESS 
----------------
 <backend timeout>
(1 row)

test=# 

and the child workers will log their response to the shared memory. This response will not be collected by the main backend which has failed.

Thx & regards
Remi


--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


2017-05-16 8:17 GMT+02:00 Michael Paquier <michael.paquier@gmail.com>:
On Sat, May 13, 2017 at 10:53 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, May 10, 2017 at 10:05 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> Regarding the patch, this is way to close to the progress facility
>> already in place. So why don't you extend it for the executor?
>
> I don't think that's a good idea.  The existing progress facility
> advertises a fixed number of counters with a command-type-specific
> interpretation, but for *query* progress reporting, we really need a
> richer model.  A query plan can contain an unbounded number of
> executor nodes and Remi's idea is, quite reasonably, to report
> something about each one.

I see what you're coming at. As st_progress_param is upper-bounded,
what should be actually used as report structure is a tree made of
nodes with one planner node attached to each node of the report tree.
Then the reporting facility should need to make persistent the data
reported.
 which is in this case a set of planner nodes, being able to also make
persistent the changes on the tree reported to the user. Then when a
backend reports its execution status, what it does is just updating a
portion of the tree in shared memory. Lock contention may be a problem
though, perhaps things could be made atomic?

In order to report progress of a SQL query, the progress request handler walks the execution tree of the query. No new node is ever created. The execution tree has been slightly modified to collect the number of rows/blocks returned during the executor run. This collection of the progress request state is only done when it is requested by a user, and it is dump at this moment in shared memory.

As long as nobody asks for a progress report, there is no overhead unless a few counters updated during executor run.

Regards
 

> From a design point of view, I think a patch like this has to clear a
> number of hurdles.  It needs to:
>
> 1. Decide what data to expose.  The sample output looks reasonable in
> this case, although the amount of code churn looks really high.
>
> 2. Find a way to advertise the data that it exposes.  This patch looks
> like it allocates a half-MB of shared memory per backend and gives up
> if that's not enough.

Perhaps DSM? It is not user-friendly to fail sporadically...

> 3. Find a way to synchronize the access to the data that it exposes.
> This patch ignores that problem completely, so multiple progress
> report commands running at the same time will behave unpredictably.

Yeah..
--
Michael

On Wed, May 17, 2017 at 9:43 PM, Remi Colinet <remi.colinet@gmail.com> wrote:
>
> 2017-05-13 14:38 GMT+02:00 Amit Kapila <amit.kapila16@gmail.com>:
>>
>> On Wed, May 10, 2017 at 10:10 PM, Remi Colinet <remi.colinet@gmail.com>
>> wrote:
>> >
>> > Parallel queries can also be monitored. The same mecanism is used to
>> > monitor
>> > child workers with a slight difference: the main worker requests the
>> > child
>> > progression directly in order to dump the whole progress tree in shared
>> > memory.
>> >
>>
>> What if there is any error in the worker (like "out of memory") while
>> gathering the statistics?  It seems both for workers as well as for
>> the main backend it will just error out.  I am not sure if it is a
>> good idea to error out the backend or parallel worker as it will just
>> end the query execution.
>
>
> The handling of progress report starts by the creation of a MemoryContext
> attached to CurrentMemoryContext. Then, few memory (few KB) is allocated.
> Meanwhile, the handling of progress report could indeed exhaust memory and
> fail the backend request. But, in such situation, the backend could also
> have fail even without any progress request.
>
>>
>> Also, even if it is okay, there doesn't seem
>> to be a way by which a parallel worker can communicate the error back
>> to master backend, rather it will just exit silently which is not
>> right.
>
>
> If a child worker fails, it will not respond to the main backend request.
> The main backend will follow up it execution after a 5 seconds timeout (GUC
> param to be added may be). In which case, the report would be partially
> filled.
>
> If the main backend fails, the requesting backend will have a response such
> as:
>
> test=# PROGRESS 14611;
>  PLAN PROGRESS
> ----------------
>  <backend timeout>
> (1 row)
>
> test=#
>
> and the child workers will log their response to the shared memory. This
> response will not be collected by the main backend which has failed.
>

If the worker errors out due to any reason, we should end the main
query as well, otherwise, it can produce wrong results.  See the error
handling of workers in HandleParallelMessage


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com