Thread: Duplicate Workers entries in some EXPLAIN plans

Duplicate Workers entries in some EXPLAIN plans

From

Maciek Sakrejda

Date:

22 October 2019, 18:58:35

Hello,

I originally reported this in pgsql-bugs [0], but there wasn't much feedback there, so moving the discussion here. When using JSON, YAML, or XML-format EXPLAIN on a plan that uses a parallelized sort, the Sort nodes list two different entries for "Workers", one for the sort-related info, and one for general worker info. This is what this looks like in JSON (some details elided):

{
"Node Type": "Sort",
...
"Workers": [
{
"Worker Number": 0,
"Sort Method": "external merge",
"Sort Space Used": 20128,
"Sort Space Type": "Disk"
},
{
"Worker Number": 1,
"Sort Method": "external merge",
"Sort Space Used": 20128,
"Sort Space Type": "Disk"
}
],
...
"Workers": [
{
"Worker Number": 0,
"Actual Startup Time": 309.726,
"Actual Total Time": 310.179,
"Actual Rows": 4128,
"Actual Loops": 1,
"Shared Hit Blocks": 2872,
"Shared Read Blocks": 7584,
"Shared Dirtied Blocks": 0,
"Shared Written Blocks": 0,
"Local Hit Blocks": 0,
"Local Read Blocks": 0,
"Local Dirtied Blocks": 0,
"Local Written Blocks": 0,
"Temp Read Blocks": 490,
"Temp Written Blocks": 2529
},
{
"Worker Number": 1,
"Actual Startup Time": 306.523,
"Actual Total Time": 307.001,
"Actual Rows": 4128,
"Actual Loops": 1,
"Shared Hit Blocks": 3356,
"Shared Read Blocks": 7100,
"Shared Dirtied Blocks": 0,
"Shared Written Blocks": 0,
"Local Hit Blocks": 0,
"Local Read Blocks": 0,
"Local Dirtied Blocks": 0,
"Local Written Blocks": 0,
"Temp Read Blocks": 490,
"Temp Written Blocks": 2529
}
],
"Plans:" ...
}

This is technically valid JSON, but it's extremely difficult to work with, since default JSON parsing in Ruby, node, Python, Go, and even Postgres' own jsonb only keep the latest key--the sort information is discarded (other languages probably don't fare much better; this is what I had on hand). As Tom Lane pointed out in my pgsql-bugs thread, this has been reported before [1] and in that earlier thread, Andrew Dunstan suggested that perhaps the simplest solution is to just rename the sort-related Workers node. Tom expressed some concerns about a breaking change here, though I think the current behavior means vanishingly few users are parsing this data correctly. Thoughts?

Thanks,

Maciek

[0]: https://www.postgresql.org/message-id/CADXhmgSr807j2Pc9aUjW2JOzOBe3FeYnQBe_f9U%2B-Mm4b1HRUw%40mail.gmail.com

[1]: https://www.postgresql.org/message-id/flat/41ee53a5-a36e-cc8f-1bee-63f6565bb1ee@dalibo.com

Re: Duplicate Workers entries in some EXPLAIN plans

From

Andres Freund

Date:

25 October 2019, 01:48:21

Hi,

On 2019-10-22 11:58:35 -0700, Maciek Sakrejda wrote:
> I originally reported this in pgsql-bugs [0], but there wasn't much
> feedback there, so moving the discussion here. When using JSON, YAML, or
> XML-format EXPLAIN on a plan that uses a parallelized sort, the Sort nodes
> list two different entries for "Workers", one for the sort-related info,
> and one for general worker info. This is what this looks like in JSON (some
> details elided):
>
> {
>   "Node Type": "Sort",
>   ...
>   "Workers": [
>     {
>       "Worker Number": 0,
>       "Sort Method": "external merge",
>       "Sort Space Used": 20128,
>       "Sort Space Type": "Disk"
>     },
>     {
>       "Worker Number": 1,
>       "Sort Method": "external merge",
>       "Sort Space Used": 20128,
>       "Sort Space Type": "Disk"
>     }
>   ],
>   ...
>   "Workers": [
>     {

> This is technically valid JSON, but it's extremely difficult to work with,
> since default JSON parsing in Ruby, node, Python, Go, and even Postgres'
> own jsonb only keep the latest key

It's also quite confusing.

> As Tom Lane pointed out in my pgsql-bugs thread, this has been
> reported before [1] and in that earlier thread, Andrew Dunstan suggested
> that perhaps the simplest solution is to just rename the sort-related
> Workers node.  Thoughts?

Yea, I think we should fix this. The current output basically makes no
sense.

> Tom expressed some concerns about a breaking change here,
> though I think the current behavior means vanishingly few users are parsing
> this data correctly.

Well, in a lot of the cases there's no parallel output for the sort, and
in other cases BUFFERS is not specified. In either case the 'duplicate
key' problem won't exist then.

While Tom said:

On 2019-10-16 09:16:56 +0200, Tom Lane wrote:
> I think the text-mode output is intentional, but the other formats
> need more work.

  Sort Method: external merge  Disk: 4920kB
  Worker 0:  Sort Method: external merge  Disk: 5880kB
  Worker 1:  Sort Method: external merge  Disk: 5920kB
  Buffers: shared hit=682 read=10188, temp read=1415 written=2101
  Worker 0: actual time=130.058..130.324 rows=1324 loops=1
    Buffers: shared hit=337 read=3489, temp read=505 written=739
  Worker 1: actual time=130.273..130.512 rows=1297 loops=1
    Buffers: shared hit=345 read=3507, temp read=505 written=744

I don't think this is close to being good enough to be worth
preserving. I think it's worth avoiding unnecessary breakage of explain
output, but we also shouldn't endlessly carry forward confusing output,
just because of that.

It clearly seems like it'd be better if this instead were

  Sort Method: external merge  Disk: 4920kB
  Buffers: shared hit=682 read=10188, temp read=1415 written=2101
  Worker 0: actual time=130.058..130.324 rows=1324 loops=1
    Sort Method: external merge  Disk: 5880kB
    Buffers: shared hit=337 read=3489, temp read=505 written=739
  Worker 1: actual time=130.273..130.512 rows=1297 loops=1
    Buffers: shared hit=345 read=3507, temp read=505 written=744
    Sort Method: external merge  Disk: 5920kB

I think the way this information was added in bf11e7ee2e36 and
33001fd7a707, contrasting to the output added in b287df70e408, is just
not right. If we add similar instrumentation reporting to more nodes,
we'll end up with duplicated information all over.  Additionally the
per-worker part of show_sort_info() basically just duplicated the rest
of the function.  I then also did something similar (although luckily
with a different key...), with the ExplainPrintJIT() call for Gather
nodes.

Unfortunately I think the fix isn't all that trivial, due to the way we
output the per-worker information at the end of ExplainNode(), by just
dumping things into a string.  It seems to me that a step in the right
direction would be for ExplainNode() to create
planstate->worker_instrument StringInfos, which can be handed to
routines like show_sort_info(), which would print the per-node
information into that, rather than directly dumping into
es->output. Most of the current "Show worker detail" would be done
earlier in ExplainNode(), at the place where we current display the
"actual rows" bit.

ISTM that should include removing the duplication fo the the contents of
show_sort_info(), and probably also for the Gather, GatherMerge blocks
(I've apparently skipped adding the JIT information to the latter, not
sure if we ought to fix that in the stable branches).

Any chance you want to take a stab at that?

I don't think we'll fix it soon, but damn, all this string appending
around just isn't a great way to reliably build nested data formats.

Greetings,

Andres Freund

Re: Duplicate Workers entries in some EXPLAIN plans

From

Maciek Sakrejda

Date:

18 November 2019, 23:39:33

On Thu, Oct 24, 2019 at 6:48 PM Andres Freund <andres@anarazel.de> wrote:
> Unfortunately I think the fix isn't all that trivial, due to the way we
> output the per-worker information at the end of ExplainNode(), by just
> dumping things into a string.  It seems to me that a step in the right
> direction would be for ExplainNode() to create
> planstate->worker_instrument StringInfos, which can be handed to
> routines like show_sort_info(), which would print the per-node
> information into that, rather than directly dumping into
> es->output. Most of the current "Show worker detail" would be done
> earlier in ExplainNode(), at the place where we current display the
> "actual rows" bit.
>
> ISTM that should include removing the duplication fo the the contents of
> show_sort_info(), and probably also for the Gather, GatherMerge blocks
> (I've apparently skipped adding the JIT information to the latter, not
> sure if we ought to fix that in the stable branches).
>
> Any chance you want to take a stab at that?

It took me a while, but I did take a stab at it (thanks for your
off-list help). Attached is my patch that changes the structured
formats to merge sort worker output in with costs/timing/buffers
worker output. I have not touched any other worker output yet, since
it's not under a Workers group as far as I can tell (so it does not
exhibit the problem I originally reported). I can try to make further
changes here if the approach is deemed sound. I also have not touched
text output; above you had proposed

>  Sort Method: external merge  Disk: 4920kB
>  Buffers: shared hit=682 read=10188, temp read=1415 written=2101
>  Worker 0: actual time=130.058..130.324 rows=1324 loops=1
>    Sort Method: external merge  Disk: 5880kB
>    Buffers: shared hit=337 read=3489, temp read=505 written=739
>  Worker 1: actual time=130.273..130.512 rows=1297 loops=1
>    Buffers: shared hit=345 read=3507, temp read=505 written=744
>    Sort Method: external merge  Disk: 5920kB

which makes sense to me, but I'd like to confirm this is the approach
we want before I add it to the patch.

This is my first C in close to a decade (and I was never much of a C
programmer to begin with), so please be gentle.

As Andres suggested off-list, I also changed the worker output to
order fields that also occur in the parent node in the same way as the
parent node.

I've also added a test for the patch, and because this is really an
EXPLAIN issue rather than a query feature issue, I added a
src/test/regress/sql/explain.sql for the test. I added a couple of
utility functions for munging json-formatted EXPLAIN plans into
something we can repeatably verify in regression tests (the functions
use json rather than jsonb to preserve field order). I have not added
this for YAML or XML (even though they should behave the same way),
since I'm not familiar with the the functions to manipulate those data
types in a similar way (if they exist). My hunch is due to the
similarity of structured formats, just testing JSON is enough, but I
can expand/adjust tests as necessary.

Attachment

merge-explain-worker-output.patch

Re: Duplicate Workers entries in some EXPLAIN plans

From

Maciek Sakrejda

Date:

26 December 2019, 23:31:16

I wanted to follow up on this patch since I received no feedback. What should my next steps be (besides rebasing, though I want to confirm there's interest before I do that)?

Re: Duplicate Workers entries in some EXPLAIN plans

From

Julien Rouhaud

Date:

27 December 2019, 06:29:48

On Fri, Dec 27, 2019 at 12:31 AM Maciek Sakrejda <m.sakrejda@gmail.com> wrote:
>
> I wanted to follow up on this patch since I received no feedback. What should my next steps be (besides rebasing,
thoughI want to confirm there's interest before I do that)?

Given Andres' answer I'd say that there's interest in this patch.  You
should register this patch in the next commitfest
(https://commitfest.postgresql.org/26/) to make sure that it's not
forgotten, which unfortunately is probably what happened here .

Re: Duplicate Workers entries in some EXPLAIN plans

From

Maciek Sakrejda

Date:

28 December 2019, 01:11:38

Done! Thanks!

Re: Duplicate Workers entries in some EXPLAIN plans

From

Georgios Kokolatos

Date:

14 January 2020, 14:44:51

The following review has been posted through the commitfest application:
make installcheck-world:  not tested
Implements feature:       not tested
Spec compliant:           not tested
Documentation:            not tested

This is a high level review only. However, seeing that there is a conflict and does not merge cleanly after commit
<b925a00f4ef>,I return to Author.
 

To be fair the resolution seems quite straight forward and I took the liberty of applying the necessary changes to
includethe new element of ExplainState introduced in the above commit, namely hide_workers. However since the author
mighthave a different idea on how to incorporate this change I leave it up to him.
 

Another very high level comment is the introduction of a new test file, namely explain. Seeing `explain.sql` in the
testssuits, personally and very opinion based, I would have expected the whole spectrum of the explain properties to be
tested.However only a slight fraction of the functionality is tested. Since this is a bit more of a personal opinion, I
don'texpect any changes unless the author happens to agree.
 

Other than these minor nitpick, the code seems clear, concise and does what it says on the box. It follows the comments
inthe discussion thread(s) and solves a real issue.
 

Please have a look on how commit <b925a00f4ef> affects this patch and rebase.

The new status of this patch is: Waiting on Author

Re: Duplicate Workers entries in some EXPLAIN plans

From

Maciek Sakrejda

Date:

15 January 2020, 07:22:04

Thanks for the review! I looked at <b925a00f4ef> and rebased the patch
on current master, ac5bdf6.

I introduced a new test file because this bug is specifically about
EXPLAIN output (as opposed to query execution or planning
functionality), and it didn't seem like a test would fit in any of the
other files. I focused on testing just the behavior around this
specific bug (and fix). I think eventually we should probably test
other more fundamental EXPLAIN features (and I'm happy to contribute
to that) in that file, but that seems outside of the scope of this
patch.

Any thoughts on what we should do with text mode output (which is
untouched right now)? The output Andres proposed above makes sense to
me, but I'd like to get more input.

Attachment

merge-explain-worker-output-v2.patch

Re: Duplicate Workers entries in some EXPLAIN plans

From

Georgios Kokolatos

Date:

15 January 2020, 10:11:43

The following review has been posted through the commitfest application:
make installcheck-world:  tested, failed
Implements feature:       not tested
Spec compliant:           not tested
Documentation:            not tested

The current version of the patch (v2) applies cleanly and does what it says on the box.

> Any thoughts on what we should do with text mode output (which is
untouched right now)? The output Andres proposed above makes sense to
me, but I'd like to get more input.

Nothing to add beyond what is stated in the thread.

>  Sort Method: external merge  Disk: 4920kB
>  Buffers: shared hit=682 read=10188, temp read=1415 written=2101
>  Worker 0: actual time=130.058..130.324 rows=1324 loops=1
>    Sort Method: external merge  Disk: 5880kB
>    Buffers: shared hit=337 read=3489, temp read=505 written=739
>  Worker 1: actual time=130.273..130.512 rows=1297 loops=1
>    Buffers: shared hit=345 read=3507, temp read=505 written=744
>    Sort Method: external merge  Disk: 5920kB

This proposal seems like a fitting approach. Awaiting for v3 which
will include the text version. May I suggest a format yaml test case?
Just to make certain that no regressions occur in the future.

Thanks,

Re: Duplicate Workers entries in some EXPLAIN plans

From

Maciek Sakrejda

Date:

15 January 2020, 17:12:04

Sounds good, I'll try that format. Any idea how to test YAML? With the
JSON format, I was able to rely on Postgres' own JSON-manipulating
functions to strip or canonicalize fields that can vary across
executions--I can't really do that with YAML. Or should I run EXPLAIN
with COSTS OFF, TIMING OFF, SUMMARY OFF and assume that for simple
queries the BUFFERS output (and other fields I can't turn off like
Sort Space Used) *is* going to be stable?

Re: Duplicate Workers entries in some EXPLAIN plans

From

Georgios Kokolatos

Date:

16 January 2020, 14:07:36

> Sounds good, I'll try that format. Any idea how to test YAML? With the
> JSON format, I was able to rely on Postgres' own JSON-manipulating
> functions to strip or canonicalize fields that can vary across
> executions--I can't really do that with YAML. 

Yes, this approach was clear in the patch and works great with Json. Also
you are correct, this can not be done with YAML. I spend a bit of time to
look around and I could not find any tests really on yaml format.

> Or should I run EXPLAIN
> with COSTS OFF, TIMING OFF, SUMMARY OFF and assume that for simple
> queries the BUFFERS output (and other fields I can't turn off like
> Sort Space Used) *is* going to be stable?

I have to admit with the current diff tool used in pg_regress, this is not possible.
I am pretty certain that it *is not* going to be stable. Not for long anyways.
I withdraw my suggestion for YAML and currently awaiting for TEXT format only.

Re: Duplicate Workers entries in some EXPLAIN plans

From

Maciek Sakrejda

Date:

21 January 2020, 08:48:35

TEXT format was tricky due to its inconsistencies, but I think I have
something working reasonably well. I added a simple test for TEXT
format output as well, using a similar approach as the JSON format
test, and liberally regexp_replacing away any volatile output. I
suppose in theory we could do this for YAML, too, but I think it's
gross enough not to be worth it, especially given the high similarity
of all the structured outputs.

Attachment

merge-explain-worker-output-v3.patch

Re: Duplicate Workers entries in some EXPLAIN plans

From

Georgios Kokolatos

Date:

22 January 2020, 12:54:07

The following review has been posted through the commitfest application:
make installcheck-world:  not tested
Implements feature:       not tested
Spec compliant:           not tested
Documentation:            not tested

> TEXT format was tricky due to its inconsistencies, but I think I have
> something working reasonably well. I added a simple test for TEXT
> format output as well, using a similar approach as the JSON format

Great!

> test, and liberally regexp_replacing away any volatile output. I
> suppose in theory we could do this for YAML, too, but I think it's
> gross enough not to be worth it, especially given the high similarity
> of all the structured outputs.

Agreed, what is in the patch suffices. Overall great work, a couple of
minor nitpicks if you allow me.

+   /* Prepare per-worker output */
+   if (es->analyze && planstate->worker_instrument) {

Style, parenthesis on its own line.

+       int num_workers = planstate->worker_instrument->num_workers;
+       int n;
+       worker_strs = (StringInfo *) palloc0(num_workers * sizeof(StringInfo));
+       for (n = 0; n < num_workers; n++) {

I think C99 would be better here. Also no parenthesis needed.

+           worker_strs[n] = makeStringInfo();
+       }
+   }

@@ -1357,6 +1369,58 @@ ExplainNode(PlanState *planstate, List *ancestors,
        ExplainPropertyBool("Parallel Aware", plan->parallel_aware, es);
    }

+   /* Prepare worker general execution details */
+   if (es->analyze && es->verbose && planstate->worker_instrument)
+   {
+       WorkerInstrumentation *w = planstate->worker_instrument;
+       int         n;
+
+       for (n = 0; n < w->num_workers; ++n)

I think C99 would be better here.

+       {
+           Instrumentation *instrument = &w->instrument[n];
+           double      nloops = instrument->nloops;

-               appendStringInfoSpaces(es->str, es->indent * 2);
-               if (n > 0 || !es->hide_workers)
-                   appendStringInfo(es->str, "Worker %d:  ", n);
+               if (indent)
+               {
+                   appendStringInfoSpaces(es->str, es->indent * 2);
+               }

Style: No parenthesis needed


-       if (opened_group)
-           ExplainCloseGroup("Workers", "Workers", false, es);
+   /* Show worker detail */
+   if (planstate->worker_instrument) {
+       ExplainFlushWorkers(worker_strs, planstate->worker_instrument->num_workers, es);
    }

Style: No parenthesis needed


+    * just indent once, to add worker info on the next worker line.
+    */
+   if (es->str == es->root_str)
+   {
+       es->indent += es->format == EXPLAIN_FORMAT_TEXT ? 1 : 2;
+   }
+

Style: No parenthesis needed

+   ExplainCloseGroup("Workers", "Workers", false, es);
+   // do we have any other cleanup to do?

This comment does not really explain anything. Either remove
or rephrase. Also C style comments.

+   es->print_workers = false;
+}

    int         indent;         /* current indentation level */
    List       *grouping_stack; /* format-specific grouping state */
+   bool        print_workers;  /* whether current node has worker metadata */

Hmm.. commit <b925a00f4ef> introduced `hide_workers` in the struct. Having both
names in the struct so far apart even seems a bit confusing and sloppy. Do you
think it would be possible to combine or rename?


+extern void ExplainOpenWorker(StringInfo worker_str, ExplainState *es);
+extern void ExplainCloseWorker(ExplainState *es);
+extern void ExplainFlushWorkers(StringInfo *worker_strs, int num_workers, ExplainState *es);

No need to expose those, is there? I feel there should be static.

Awaiting for answer or resolution of these comments to change the status.

//Georgios

Re: Duplicate Workers entries in some EXPLAIN plans

From

Maciek Sakrejda

Date:

22 January 2020, 16:54:40

Thanks! I'll fix the brace issues. Re: the other items:

> +       int num_workers = planstate->worker_instrument->num_workers;
> +       int n;
> +       worker_strs = (StringInfo *) palloc0(num_workers * sizeof(StringInfo));
> +       for (n = 0; n < num_workers; n++) {
>
> I think C99 would be better here. Also no parenthesis needed.

Pardon my C illiteracy, but what am I doing that's not valid C99 here?

> +   /* Prepare worker general execution details */
> +   if (es->analyze && es->verbose && planstate->worker_instrument)
> +   {
> +       WorkerInstrumentation *w = planstate->worker_instrument;
> +       int         n;
> +
> +       for (n = 0; n < w->num_workers; ++n)
>
> I think C99 would be better here.

And here (if it's not the same problem)?

> +   ExplainCloseGroup("Workers", "Workers", false, es);
> +   // do we have any other cleanup to do?
>
> This comment does not really explain anything. Either remove
> or rephrase. Also C style comments.

Good catch, thanks--I had put this in to remind myself (and reviewers)
about cleanup, but I don't think there's anything else to do, so I'll
just drop it.

>     int         indent;         /* current indentation level */
>     List       *grouping_stack; /* format-specific grouping state */
> +   bool        print_workers;  /* whether current node has worker metadata */
>
> Hmm.. commit <b925a00f4ef> introduced `hide_workers` in the struct. Having both
> names in the struct so far apart even seems a bit confusing and sloppy. Do you
> think it would be possible to combine or rename?

I noticed that. I was thinking about combining them, but
"hide_workers" seems to be about "pretend there is no worker output
even if there is" and "print_workers" is "keep track of whether or not
there is worker output to print". Maybe I'll rename to
"has_worker_output"?

> +extern void ExplainOpenWorker(StringInfo worker_str, ExplainState *es);
> +extern void ExplainCloseWorker(ExplainState *es);
> +extern void ExplainFlushWorkers(StringInfo *worker_strs, int num_workers, ExplainState *es);
>
> No need to expose those, is there? I feel there should be static.

Good call, I'll update.

Re: Duplicate Workers entries in some EXPLAIN plans

From

Georgios Kokolatos

Date:

22 January 2020, 17:36:20

>> +       int num_workers = planstate->worker_instrument->num_workers;
>> +       int n;
>> +       worker_strs = (StringInfo *) palloc0(num_workers * sizeof(StringInfo));
>> +       for (n = 0; n < num_workers; n++) {
>>
>> I think C99 would be better here. Also no parenthesis needed.
>
>
> Pardon my C illiteracy, but what am I doing that's not valid C99 here?

My bad, I should have been more clear. I meant that it is preferable to use
the C99 standard which calls for declaring variables in the scope that you
need them. In this case, 'n' is needed only in the for loop, so something like

for (int n = 0; n < num_workers; n++) 

is preferable. To be clear, your code was perfectly valid. It was only the
style I was referring to.

>> +       for (n = 0; n < w->num_workers; ++n)
>>
>> I think C99 would be better here.
>
>
> And here (if it's not the same problem)?

Exactly the same as above. 

>>     int         indent;         /* current indentation level */
>>     List       *grouping_stack; /* format-specific grouping state */
>> +   bool        print_workers;  /* whether current node has worker metadata */
>>
>> Hmm.. commit <b925a00f4ef> introduced `hide_workers` in the struct. Having both
>> names in the struct so far apart even seems a bit confusing and sloppy. Do you
>> think it would be possible to combine or rename?
>
>
> I noticed that. I was thinking about combining them, but
> "hide_workers" seems to be about "pretend there is no worker output
> even if there is" and "print_workers" is "keep track of whether or not
> there is worker output to print". Maybe I'll rename to
> "has_worker_output"?

The rename sounds a bit better in my humble opinion. Thanks.

Re: Duplicate Workers entries in some EXPLAIN plans

From

Maciek Sakrejda

Date:

23 January 2020, 09:00:32

On Wed, Jan 22, 2020 at 9:37 AM Georgios Kokolatos <gkokolatos@pm.me> wrote:
> My bad, I should have been more clear. I meant that it is preferable to use
> the C99 standard which calls for declaring variables in the scope that you
> need them.

Ah, I see. I think I got that from ExplainPrintSettings. I've
corrected my usage--thanks for pointing it out. I appreciate the
effort to maintain a consistent style.

>
> >>     int         indent;         /* current indentation level */
> >>     List       *grouping_stack; /* format-specific grouping state */
> >> +   bool        print_workers;  /* whether current node has worker metadata */
> >>
> >> Hmm.. commit <b925a00f4ef> introduced `hide_workers` in the struct. Having both
> >> names in the struct so far apart even seems a bit confusing and sloppy. Do you
> >> think it would be possible to combine or rename?
> >
> >
> > I noticed that. I was thinking about combining them, but
> > "hide_workers" seems to be about "pretend there is no worker output
> > even if there is" and "print_workers" is "keep track of whether or not
> > there is worker output to print". Maybe I'll rename to
> > "has_worker_output"?
>
> The rename sounds a bit better in my humble opinion. Thanks.

Also, reviewing my code again, I noticed that when I moved the general
worker output earlier, I missed part of the merge conflict: I had
replaced

-       /* Show worker detail */
-       if (es->analyze && es->verbose && !es->hide_workers &&
-               planstate->worker_instrument)

with

+       /* Prepare worker general execution details */
+       if (es->analyze && es->verbose && planstate->worker_instrument)

which ignores the es->hide_workers flag (it did not fail the tests,
but the intent is pretty clear). I've corrected this in the current
patch.

I also noticed that we can now handle worker buffer output more
consistently across TEXT and structured formats, so I made that small
change too:

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 140d0be426..b23b015594 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1401,8 +1401,6 @@ ExplainNode(PlanState *planstate, List *ancestors,
                                        appendStringInfo(es->str,

  "actual rows=%.0f loops=%.0f\n",

  rows, nloops);
-                               if (es->buffers)
-                                       show_buffer_usage(es,
&instrument->bufusage);
                        }
                        else
                        {
@@ -1951,7 +1949,7 @@ ExplainNode(PlanState *planstate, List *ancestors,

        /* Prepare worker buffer usage */
        if (es->buffers && es->analyze && es->verbose && !es->hide_workers
-               && planstate->worker_instrument && es->format !=
EXPLAIN_FORMAT_TEXT)
+               && planstate->worker_instrument)
        {
                WorkerInstrumentation *w = planstate->worker_instrument;
                int                     n;
diff --git a/src/test/regress/expected/explain.out
b/src/test/regress/expected/explain.out
index 8034a4e0db..a4eed3067f 100644
--- a/src/test/regress/expected/explain.out
+++ b/src/test/regress/expected/explain.out
@@ -103,8 +103,8 @@ $$, 'verbose', 'analyze', 'buffers', 'timing off',
'costs off', 'summary off'),
          Sort Key: (ROW("*VALUES*".column1))                      +
          Buffers: shared hit=114                                  +
          Worker 0: actual rows=2 loops=1                          +
-           Buffers: shared hit=114                                +
            Sort Method: xxx                                       +
+           Buffers: shared hit=114                                +
          ->  Values Scan on "*VALUES*" (actual rows=2 loops=1)    +
                Output: "*VALUES*".column1, ROW("*VALUES*".column1)+
                Worker 0: actual rows=2 loops=1                    +


I think the "producing plan output for a worker" process is easier to
reason about now, and while it changes TEXT format worker output
order, the other changes in this patch are more drastic so this
probably does not matter.

I've also addressed the other feedback above, and reworded a couple of
comments slightly.

Attachment

merge-explain-worker-output-v4.patch

Re: Duplicate Workers entries in some EXPLAIN plans

From

Georgios Kokolatos

Date:

23 January 2020, 10:38:38

The following review has been posted through the commitfest application:
make installcheck-world:  not tested
Implements feature:       not tested
Spec compliant:           not tested
Documentation:            not tested

> Ah, I see. I think I got that from ExplainPrintSettings. I've
> corrected my usage--thanks for pointing it out. I appreciate the
> effort to maintain a consistent style.

Thanks, I am just following the reviewing guide to be honest.

> Also, reviewing my code again, I noticed that when I moved the general
> worker output earlier, I missed part of the merge conflict:

Right. I thought that was intentional.

> which ignores the es->hide_workers flag (it did not fail the tests,
> but the intent is pretty clear). I've corrected this in the current
> patch.

Noted and appreciated.

> I also noticed that we can now handle worker buffer output more
> consistently across TEXT and structured formats, so I made that small
> change too:

Looks good.

> I think the "producing plan output for a worker" process is easier to
> reason about now, and while it changes TEXT format worker output
> order, the other changes in this patch are more drastic so this
> probably does not matter.
> 
> I've also addressed the other feedback above, and reworded a couple of
> comments slightly.

Thanks for the above. Looks clean, does what it says in the tin and solves a
problem that needs solving. All relevant installcheck-world pass. As far as I 
am concerned, I think it is ready to be sent to a committer. Updating the status
accordingly.

The new status of this patch is: Ready for Committer

Re: Duplicate Workers entries in some EXPLAIN plans

From

Maciek Sakrejda

Date:

24 January 2020, 17:03:22

Great, thank you. I noticed in the CF patch tester app, the build
fails on Windows [1]. Investigating, I realized I had failed to fully
strip volatile EXPLAIN info (namely buffers) in TEXT mode due to a
bad regexp_replace. I've fixed this in the attached patch (which I
also rebased against current master again).

[1]: https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.76313

Attachment

merge-explain-worker-output-v5.patch

Re: Duplicate Workers entries in some EXPLAIN plans

From

Tom Lane

Date:

24 January 2020, 17:40:13

Maciek Sakrejda <m.sakrejda@gmail.com> writes:
> Great, thank you. I noticed in the CF patch tester app, the build
> fails on Windows [1]. Investigating, I realized I had failed to fully
> strip volatile EXPLAIN info (namely buffers) in TEXT mode due to a
> bad regexp_replace.

You haven't gone nearly far enough in that direction.  The proposed
output still relies on the assumption that the session will always
get as many workers as it asks for, which it will not.  For previous
bitter experience in this department, see for instance commits 4ea03f3f4
and 13e8b2ee8.

TBH I am not sure that the proposed regression tests for this change
can be committed at all.  Which is a bit of a problem perhaps, but
then again we don't have terribly good coverage for the existing code
either, for the same reason.

            regards, tom lane

Re: Duplicate Workers entries in some EXPLAIN plans

From

Maciek Sakrejda

Date:

25 January 2020, 00:26:57

Okay. Does not getting as many workers as it asks for include
sometimes getting zero, completely changing the actual output? If so,
I'll submit a new version of the patch removing all tests--I was
hoping to improve coverage, but I guess this is not the way to start.
If not, can we keep the json tests at least if we only consider the
first worker?

Re: Duplicate Workers entries in some EXPLAIN plans

From

Tom Lane

Date:

25 January 2020, 02:14:55

Maciek Sakrejda <m.sakrejda@gmail.com> writes:
> Okay. Does not getting as many workers as it asks for include
> sometimes getting zero, completely changing the actual output?

Yup :-(.  We could possibly avoid that by running the explain
test by itself rather than in a parallel group, but I don't
especially want to add yet more non-parallelizable tests.

Meanwhile, I spent some time looking at the code, and wasn't very happy
with it.  I'm on board with the general plan of redirecting EXPLAIN
output into per-worker buffers that we eventually recombine, but I think
that this particular coding is pretty unmaintainable/unextensible.
In particular, I'm really unhappy that the code is ignoring EXPLAIN's
grouping_state stack.  That means that when it's formatting fields that
belong to the per-worker group, it's using the state-stack entry that
corresponds to the plan node's main level.  This seems to accidentally
work, but that fact depends on a number of shaky assumptions:

* Before messing with any per-worker data, we've always emitted at
least one field in the plan node's main level, so that the state-stack
entry isn't at its initial state for the level.

* Before actually emitting the shunted-aside data, we've always emitted
a "Worker Number" field in correct format within the per-worker group,
so that the formatting state is now correct for a non-initial field.

* There is no formatting difference between any non-first fields in
a level (ie the state stack entries are basically just booleans),
so that it doesn't matter how many plan-node fields we emitted before
starting the per-worker data, so long as there was at least one, nor
does transiently abusing the plan node level's stack entry like this
break the formatting of subsequent plan-node-level fields.

Now maybe we'll never implement an output format that breaks that
last assumption, and maybe we'll never rearrange the EXPLAIN code
in a way that breaks either of the first two.  But I don't like those
assumptions too much.  I also didn't like the code's assumption that
all the non-text formats interpret es->indent the same.

I also noted an actual bug, which is that the patch fails regression
testing under force_parallel_mode = regress.  This isn't really your
fault, because the issue is in this obscure and poorly-commented hack
in show_sort_info:

     * You might think we should just skip this stanza entirely when
     * es->hide_workers is true, but then we'd get no sort-method output at
     * all.  We have to make it look like worker 0's data is top-level data.
     * Currently, we only bother with that for text-format output.

Nonetheless, it's broken.

So I spent some time hacking on this and came up with the attached.
It's noticeably more verbose than your patch, but it keeps the
output-format-aware code at what seems to me to be a maintainable
arm's-length distance from the parallel-worker hacking.  TEXT is
still a special case of course :-(.

This patch just covers the code, I'm not taking any position yet
about what to do about the tests.  I did tweak the code to eliminate
the one formatting difference in select_parallel (ie put two spaces
after "Worker N:", which I think reads better anyhow), so it
passes check-world as it stands.

Thoughts?

            regards, tom lane

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index d189b8d..9e965d2 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -123,11 +123,20 @@ static void ExplainSubPlans(List *plans, List *ancestors,
                             const char *relationship, ExplainState *es);
 static void ExplainCustomChildren(CustomScanState *css,
                                   List *ancestors, ExplainState *es);
+static ExplainWorkersState *ExplainCreateWorkersState(int num_workers);
+static void ExplainOpenWorker(int n, ExplainState *es);
+static void ExplainCloseWorker(int n, ExplainState *es);
+static void ExplainFlushWorkersState(ExplainState *es);
 static void ExplainProperty(const char *qlabel, const char *unit,
                             const char *value, bool numeric, ExplainState *es);
+static void ExplainOpenSetAsideGroup(const char *objtype, const char *labelname,
+                                     bool labeled, int depth, ExplainState *es);
+static void ExplainSaveGroup(ExplainState *es, int depth, int *state_save);
+static void ExplainRestoreGroup(ExplainState *es, int depth, int *state_save);
 static void ExplainDummyGroup(const char *objtype, const char *labelname,
                               ExplainState *es);
 static void ExplainXMLTag(const char *tagname, int flags, ExplainState *es);
+static void ExplainIndentText(ExplainState *es);
 static void ExplainJSONLineEnding(ExplainState *es);
 static void ExplainYAMLLineStarting(ExplainState *es);
 static void escape_yaml(StringInfo buf, const char *str);
@@ -827,7 +836,7 @@ ExplainPrintJIT(ExplainState *es, int jit_flags,
     /* for higher density, open code the text output format */
     if (es->format == EXPLAIN_FORMAT_TEXT)
     {
-        appendStringInfoSpaces(es->str, es->indent * 2);
+        ExplainIndentText(es);
         if (for_workers)
             appendStringInfo(es->str, "JIT for worker %u:\n", worker_num);
         else
@@ -836,7 +845,7 @@ ExplainPrintJIT(ExplainState *es, int jit_flags,

         ExplainPropertyInteger("Functions", NULL, ji->created_functions, es);

-        appendStringInfoSpaces(es->str, es->indent * 2);
+        ExplainIndentText(es);
         appendStringInfo(es->str, "Options: %s %s, %s %s, %s %s, %s %s\n",
                          "Inlining", jit_flags & PGJIT_INLINE ? "true" : "false",
                          "Optimization", jit_flags & PGJIT_OPT3 ? "true" : "false",
@@ -845,7 +854,7 @@ ExplainPrintJIT(ExplainState *es, int jit_flags,

         if (es->analyze && es->timing)
         {
-            appendStringInfoSpaces(es->str, es->indent * 2);
+            ExplainIndentText(es);
             appendStringInfo(es->str,
                              "Timing: %s %.3f ms, %s %.3f ms, %s %.3f ms, %s %.3f ms, %s %.3f ms\n",
                              "Generation", 1000.0 * INSTR_TIME_GET_DOUBLE(ji->generation_counter),
@@ -1074,9 +1083,10 @@ ExplainPreScanNode(PlanState *planstate, Bitmapset **rels_used)
  * optional name to be attached to the node.
  *
  * In text format, es->indent is controlled in this function since we only
- * want it to change at plan-node boundaries.  In non-text formats, es->indent
- * corresponds to the nesting depth of logical output groups, and therefore
- * is controlled by ExplainOpenGroup/ExplainCloseGroup.
+ * want it to change at plan-node boundaries (but a few subroutines will
+ * transiently increment it).  In non-text formats, es->indent corresponds
+ * to the nesting depth of logical output groups, and therefore is controlled
+ * by ExplainOpenGroup/ExplainCloseGroup.
  */
 static void
 ExplainNode(PlanState *planstate, List *ancestors,
@@ -1090,9 +1100,20 @@ ExplainNode(PlanState *planstate, List *ancestors,
     const char *partialmode = NULL;
     const char *operation = NULL;
     const char *custom_name = NULL;
+    ExplainWorkersState *save_workers_state = es->workers_state;
     int            save_indent = es->indent;
     bool        haschildren;

+    /*
+     * Prepare per-worker output buffers, if needed.  We'll append the data in
+     * these to the main output string further down.
+     */
+    if (planstate->worker_instrument && es->analyze && !es->hide_workers)
+        es->workers_state = ExplainCreateWorkersState(planstate->worker_instrument->num_workers);
+    else
+        es->workers_state = NULL;
+
+    /* Identify plan node type, and print generic details */
     switch (nodeTag(plan))
     {
         case T_Result:
@@ -1324,13 +1345,13 @@ ExplainNode(PlanState *planstate, List *ancestors,
     {
         if (plan_name)
         {
-            appendStringInfoSpaces(es->str, es->indent * 2);
+            ExplainIndentText(es);
             appendStringInfo(es->str, "%s\n", plan_name);
             es->indent++;
         }
         if (es->indent)
         {
-            appendStringInfoSpaces(es->str, es->indent * 2);
+            ExplainIndentText(es);
             appendStringInfoString(es->str, "->  ");
             es->indent += 2;
         }
@@ -1574,6 +1595,56 @@ ExplainNode(PlanState *planstate, List *ancestors,
     if (es->format == EXPLAIN_FORMAT_TEXT)
         appendStringInfoChar(es->str, '\n');

+    /* prepare per-worker general execution details */
+    if (es->workers_state && es->verbose)
+    {
+        WorkerInstrumentation *w = planstate->worker_instrument;
+
+        for (int n = 0; n < w->num_workers; n++)
+        {
+            Instrumentation *instrument = &w->instrument[n];
+            double        nloops = instrument->nloops;
+            double        startup_ms;
+            double        total_ms;
+            double        rows;
+
+            if (nloops <= 0)
+                continue;
+            startup_ms = 1000.0 * instrument->startup / nloops;
+            total_ms = 1000.0 * instrument->total / nloops;
+            rows = instrument->ntuples / nloops;
+
+            ExplainOpenWorker(n, es);
+
+            if (es->format == EXPLAIN_FORMAT_TEXT)
+            {
+                ExplainIndentText(es);
+                if (es->timing)
+                    appendStringInfo(es->str,
+                                     "actual time=%.3f..%.3f rows=%.0f loops=%.0f\n",
+                                     startup_ms, total_ms, rows, nloops);
+                else
+                    appendStringInfo(es->str,
+                                     "actual rows=%.0f loops=%.0f\n",
+                                     rows, nloops);
+            }
+            else
+            {
+                if (es->timing)
+                {
+                    ExplainPropertyFloat("Actual Startup Time", "ms",
+                                         startup_ms, 3, es);
+                    ExplainPropertyFloat("Actual Total Time", "ms",
+                                         total_ms, 3, es);
+                }
+                ExplainPropertyFloat("Actual Rows", NULL, rows, 0, es);
+                ExplainPropertyFloat("Actual Loops", NULL, nloops, 0, es);
+            }
+
+            ExplainCloseWorker(n, es);
+        }
+    }
+
     /* target list */
     if (es->verbose)
         show_plan_tlist(planstate, ancestors, es);
@@ -1885,76 +1956,30 @@ ExplainNode(PlanState *planstate, List *ancestors,
     if (es->buffers && planstate->instrument)
         show_buffer_usage(es, &planstate->instrument->bufusage);

-    /* Show worker detail */
-    if (es->analyze && es->verbose && !es->hide_workers &&
-        planstate->worker_instrument)
+    /* Prepare per-worker buffer usage */
+    if (es->workers_state && es->buffers && es->verbose)
     {
         WorkerInstrumentation *w = planstate->worker_instrument;
-        bool        opened_group = false;
-        int            n;

-        for (n = 0; n < w->num_workers; ++n)
+        for (int n = 0; n < w->num_workers; n++)
         {
             Instrumentation *instrument = &w->instrument[n];
             double        nloops = instrument->nloops;
-            double        startup_ms;
-            double        total_ms;
-            double        rows;

             if (nloops <= 0)
                 continue;
-            startup_ms = 1000.0 * instrument->startup / nloops;
-            total_ms = 1000.0 * instrument->total / nloops;
-            rows = instrument->ntuples / nloops;

-            if (es->format == EXPLAIN_FORMAT_TEXT)
-            {
-                appendStringInfoSpaces(es->str, es->indent * 2);
-                appendStringInfo(es->str, "Worker %d: ", n);
-                if (es->timing)
-                    appendStringInfo(es->str,
-                                     "actual time=%.3f..%.3f rows=%.0f loops=%.0f\n",
-                                     startup_ms, total_ms, rows, nloops);
-                else
-                    appendStringInfo(es->str,
-                                     "actual rows=%.0f loops=%.0f\n",
-                                     rows, nloops);
-                es->indent++;
-                if (es->buffers)
-                    show_buffer_usage(es, &instrument->bufusage);
-                es->indent--;
-            }
-            else
-            {
-                if (!opened_group)
-                {
-                    ExplainOpenGroup("Workers", "Workers", false, es);
-                    opened_group = true;
-                }
-                ExplainOpenGroup("Worker", NULL, true, es);
-                ExplainPropertyInteger("Worker Number", NULL, n, es);
-
-                if (es->timing)
-                {
-                    ExplainPropertyFloat("Actual Startup Time", "ms",
-                                         startup_ms, 3, es);
-                    ExplainPropertyFloat("Actual Total Time", "ms",
-                                         total_ms, 3, es);
-                }
-                ExplainPropertyFloat("Actual Rows", NULL, rows, 0, es);
-                ExplainPropertyFloat("Actual Loops", NULL, nloops, 0, es);
-
-                if (es->buffers)
-                    show_buffer_usage(es, &instrument->bufusage);
-
-                ExplainCloseGroup("Worker", NULL, true, es);
-            }
+            ExplainOpenWorker(n, es);
+            show_buffer_usage(es, &instrument->bufusage);
+            ExplainCloseWorker(n, es);
         }
-
-        if (opened_group)
-            ExplainCloseGroup("Workers", "Workers", false, es);
     }

+    /* Show per-worker details, then pop that stack */
+    if (es->workers_state)
+        ExplainFlushWorkersState(es);
+    es->workers_state = save_workers_state;
+
     /* Get ready to display the child plans */
     haschildren = planstate->initPlan ||
         outerPlanState(planstate) ||
@@ -2525,7 +2550,7 @@ show_tablesample(TableSampleClause *tsc, PlanState *planstate,
     {
         bool        first = true;

-        appendStringInfoSpaces(es->str, es->indent * 2);
+        ExplainIndentText(es);
         appendStringInfo(es->str, "Sampling: %s (", method_name);
         foreach(lc, params)
         {
@@ -2572,7 +2597,7 @@ show_sort_info(SortState *sortstate, ExplainState *es)

         if (es->format == EXPLAIN_FORMAT_TEXT)
         {
-            appendStringInfoSpaces(es->str, es->indent * 2);
+            ExplainIndentText(es);
             appendStringInfo(es->str, "Sort Method: %s  %s: %ldkB\n",
                              sortMethod, spaceType, spaceUsed);
         }
@@ -2588,12 +2613,14 @@ show_sort_info(SortState *sortstate, ExplainState *es)
      * You might think we should just skip this stanza entirely when
      * es->hide_workers is true, but then we'd get no sort-method output at
      * all.  We have to make it look like worker 0's data is top-level data.
-     * Currently, we only bother with that for text-format output.
+     * This is easily done by just skipping the OpenWorker/CloseWorker calls.
+     * Currently, we don't worry about the possibility that there are multiple
+     * workers in such a case; if there are, duplicate output fields will be
+     * emitted.
      */
     if (sortstate->shared_info != NULL)
     {
         int            n;
-        bool        opened_group = false;

         for (n = 0; n < sortstate->shared_info->num_workers; n++)
         {
@@ -2609,32 +2636,26 @@ show_sort_info(SortState *sortstate, ExplainState *es)
             spaceType = tuplesort_space_type_name(sinstrument->spaceType);
             spaceUsed = sinstrument->spaceUsed;

+            if (es->workers_state)
+                ExplainOpenWorker(n, es);
+
             if (es->format == EXPLAIN_FORMAT_TEXT)
             {
-                appendStringInfoSpaces(es->str, es->indent * 2);
-                if (n > 0 || !es->hide_workers)
-                    appendStringInfo(es->str, "Worker %d:  ", n);
+                ExplainIndentText(es);
                 appendStringInfo(es->str,
                                  "Sort Method: %s  %s: %ldkB\n",
                                  sortMethod, spaceType, spaceUsed);
             }
             else
             {
-                if (!opened_group)
-                {
-                    ExplainOpenGroup("Workers", "Workers", false, es);
-                    opened_group = true;
-                }
-                ExplainOpenGroup("Worker", NULL, true, es);
-                ExplainPropertyInteger("Worker Number", NULL, n, es);
                 ExplainPropertyText("Sort Method", sortMethod, es);
                 ExplainPropertyInteger("Sort Space Used", "kB", spaceUsed, es);
                 ExplainPropertyText("Sort Space Type", spaceType, es);
-                ExplainCloseGroup("Worker", NULL, true, es);
             }
+
+            if (es->workers_state)
+                ExplainCloseWorker(n, es);
         }
-        if (opened_group)
-            ExplainCloseGroup("Workers", "Workers", false, es);
     }
 }

@@ -2721,7 +2742,7 @@ show_hash_info(HashState *hashstate, ExplainState *es)
         else if (hinstrument.nbatch_original != hinstrument.nbatch ||
                  hinstrument.nbuckets_original != hinstrument.nbuckets)
         {
-            appendStringInfoSpaces(es->str, es->indent * 2);
+            ExplainIndentText(es);
             appendStringInfo(es->str,
                              "Buckets: %d (originally %d)  Batches: %d (originally %d)  Memory Usage: %ldkB\n",
                              hinstrument.nbuckets,
@@ -2732,7 +2753,7 @@ show_hash_info(HashState *hashstate, ExplainState *es)
         }
         else
         {
-            appendStringInfoSpaces(es->str, es->indent * 2);
+            ExplainIndentText(es);
             appendStringInfo(es->str,
                              "Buckets: %d  Batches: %d  Memory Usage: %ldkB\n",
                              hinstrument.nbuckets, hinstrument.nbatch,
@@ -2758,7 +2779,7 @@ show_tidbitmap_info(BitmapHeapScanState *planstate, ExplainState *es)
     {
         if (planstate->exact_pages > 0 || planstate->lossy_pages > 0)
         {
-            appendStringInfoSpaces(es->str, es->indent * 2);
+            ExplainIndentText(es);
             appendStringInfoString(es->str, "Heap Blocks:");
             if (planstate->exact_pages > 0)
                 appendStringInfo(es->str, " exact=%ld", planstate->exact_pages);
@@ -2894,7 +2915,7 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage)
         /* Show only positive counter values. */
         if (has_shared || has_local || has_temp)
         {
-            appendStringInfoSpaces(es->str, es->indent * 2);
+            ExplainIndentText(es);
             appendStringInfoString(es->str, "Buffers:");

             if (has_shared)
@@ -2949,7 +2970,7 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage)
         /* As above, show only positive counter values. */
         if (has_timing)
         {
-            appendStringInfoSpaces(es->str, es->indent * 2);
+            ExplainIndentText(es);
             appendStringInfoString(es->str, "I/O Timings:");
             if (!INSTR_TIME_IS_ZERO(usage->blk_read_time))
                 appendStringInfo(es->str, " read=%0.3f",
@@ -3237,7 +3258,7 @@ show_modifytable_info(ModifyTableState *mtstate, List *ancestors,
              */
             if (es->format == EXPLAIN_FORMAT_TEXT)
             {
-                appendStringInfoSpaces(es->str, es->indent * 2);
+                ExplainIndentText(es);
                 appendStringInfoString(es->str,
                                        fdwroutine ? foperation : operation);
             }
@@ -3427,6 +3448,150 @@ ExplainCustomChildren(CustomScanState *css, List *ancestors, ExplainState *es)
 }

 /*
+ * Create a per-plan-node workspace for collecting per-worker data.
+ */
+static ExplainWorkersState *
+ExplainCreateWorkersState(int num_workers)
+{
+    ExplainWorkersState *wstate;
+
+    wstate = (ExplainWorkersState *) palloc(sizeof(ExplainWorkersState));
+    wstate->num_workers = num_workers;
+    wstate->worker_inited = (bool *) palloc0(num_workers * sizeof(bool));
+    wstate->worker_str = (StringInfoData *)
+        palloc0(num_workers * sizeof(StringInfoData));
+    wstate->worker_state_save = (int *) palloc(num_workers * sizeof(int));
+    return wstate;
+}
+
+/*
+ * Begin or resume output into the set-aside group for worker N.
+ */
+static void
+ExplainOpenWorker(int n, ExplainState *es)
+{
+    ExplainWorkersState *wstate = es->workers_state;
+
+    Assert(wstate);
+    Assert(n >= 0 && n < wstate->num_workers);
+
+    /* Save prior output pointer */
+    wstate->prev_str = es->str;
+
+    if (!wstate->worker_inited[n])
+    {
+        /* First time through, so create the buffer for this worker */
+        initStringInfo(&wstate->worker_str[n]);
+        es->str = &wstate->worker_str[n];
+
+        /*
+         * Push suitable initial formatting state for this worker's field
+         * group.  We allow one extra logical nesting level, since this group
+         * will eventually be wrapped in an outer "Workers" group.
+         */
+        ExplainOpenSetAsideGroup("Worker", NULL, true, 2, es);
+
+        /*
+         * In non-TEXT formats, we always emit a "Worker Number" field even if
+         * there's no other data for this worker.
+         */
+        if (es->format != EXPLAIN_FORMAT_TEXT)
+            ExplainPropertyInteger("Worker Number", NULL, n, es);
+
+        wstate->worker_inited[n] = true;
+    }
+    else
+    {
+        /* Resuming output for a worker we've already emitted some data for */
+        es->str = &wstate->worker_str[n];
+
+        /* Restore formatting state saved by last ExplainCloseWorker() */
+        ExplainRestoreGroup(es, 2, &wstate->worker_state_save[n]);
+    }
+
+    /*
+     * In TEXT format, prefix the first output line for this worker with
+     * "Worker N:".  Then, any additional lines should be indented one more
+     * stop than the "Worker N" line is.
+     */
+    if (es->format == EXPLAIN_FORMAT_TEXT)
+    {
+        if (es->str->len == 0)
+        {
+            ExplainIndentText(es);
+            appendStringInfo(es->str, "Worker %d:  ", n);
+        }
+
+        es->indent++;
+    }
+}
+
+/*
+ * End output for worker N --- must pair with previous ExplainOpenWorker call
+ */
+static void
+ExplainCloseWorker(int n, ExplainState *es)
+{
+    ExplainWorkersState *wstate = es->workers_state;
+
+    Assert(wstate);
+    Assert(n >= 0 && n < wstate->num_workers);
+    Assert(wstate->worker_inited[n]);
+
+    /*
+     * Save formatting state in case we do another ExplainOpenWorker(), then
+     * pop the formatting stack.
+     */
+    ExplainSaveGroup(es, 2, &wstate->worker_state_save[n]);
+
+    /*
+     * In TEXT format, if we didn't actually produce any output line(s) then
+     * truncate off the partial line (this avoids bogus output if, for
+     * instance, show_buffer_usage chooses not to print anything).  Also fix
+     * up the indent level.
+     */
+    if (es->format == EXPLAIN_FORMAT_TEXT)
+    {
+        while (es->str->len > 0 && es->str->data[es->str->len - 1] != '\n')
+            es->str->data[--(es->str->len)] = '\0';
+
+        es->indent--;
+    }
+
+    /* Restore prior output pointer */
+    es->str = wstate->prev_str;
+}
+
+/*
+ * Print per-worker info for current node, then free the ExplainWorkersState.
+ */
+static void
+ExplainFlushWorkersState(ExplainState *es)
+{
+    ExplainWorkersState *wstate = es->workers_state;
+
+    ExplainOpenGroup("Workers", "Workers", false, es);
+    for (int i = 0; i < wstate->num_workers; i++)
+    {
+        if (wstate->worker_inited[i])
+        {
+            /* This must match previous ExplainOpenSetAsideGroup call */
+            ExplainOpenGroup("Worker", NULL, true, es);
+            appendStringInfoString(es->str, wstate->worker_str[i].data);
+            ExplainCloseGroup("Worker", NULL, true, es);
+
+            pfree(wstate->worker_str[i].data);
+        }
+    }
+    ExplainCloseGroup("Workers", "Workers", false, es);
+
+    pfree(wstate->worker_inited);
+    pfree(wstate->worker_str);
+    pfree(wstate->worker_state_save);
+    pfree(wstate);
+}
+
+/*
  * Explain a property, such as sort keys or targets, that takes the form of
  * a list of unlabeled items.  "data" is a list of C strings.
  */
@@ -3439,7 +3604,7 @@ ExplainPropertyList(const char *qlabel, List *data, ExplainState *es)
     switch (es->format)
     {
         case EXPLAIN_FORMAT_TEXT:
-            appendStringInfoSpaces(es->str, es->indent * 2);
+            ExplainIndentText(es);
             appendStringInfo(es->str, "%s: ", qlabel);
             foreach(lc, data)
             {
@@ -3560,7 +3725,7 @@ ExplainProperty(const char *qlabel, const char *unit, const char *value,
     switch (es->format)
     {
         case EXPLAIN_FORMAT_TEXT:
-            appendStringInfoSpaces(es->str, es->indent * 2);
+            ExplainIndentText(es);
             if (unit)
                 appendStringInfo(es->str, "%s: %s %s\n", qlabel, value, unit);
             else
@@ -3752,6 +3917,117 @@ ExplainCloseGroup(const char *objtype, const char *labelname,
 }

 /*
+ * Open a group of related objects, without emitting actual data.
+ *
+ * Prepare the formatting state as though we were beginning a group with
+ * the identified properties, but don't actually emit anything.  Output
+ * subsequent to this call can be redirected into a separate output buffer,
+ * and then eventually appended to the main output buffer after doing a
+ * regular ExplainOpenGroup call (with the same parameters).
+ *
+ * The extra "depth" parameter is the new group's depth compared to current.
+ * It could be more than one, in case the eventual output will be enclosed
+ * in additional nesting group levels.  We assume we don't need to track
+ * formatting state for those levels while preparing this group's output.
+ *
+ * There is no ExplainCloseSetAsideGroup --- in current usage, we always
+ * pop this state with ExplainSaveGroup.
+ */
+static void
+ExplainOpenSetAsideGroup(const char *objtype, const char *labelname,
+                         bool labeled, int depth, ExplainState *es)
+{
+    switch (es->format)
+    {
+        case EXPLAIN_FORMAT_TEXT:
+            /* nothing to do */
+            break;
+
+        case EXPLAIN_FORMAT_XML:
+            es->indent += depth;
+            break;
+
+        case EXPLAIN_FORMAT_JSON:
+            es->grouping_stack = lcons_int(0, es->grouping_stack);
+            es->indent += depth;
+            break;
+
+        case EXPLAIN_FORMAT_YAML:
+            if (labelname)
+                es->grouping_stack = lcons_int(1, es->grouping_stack);
+            else
+                es->grouping_stack = lcons_int(0, es->grouping_stack);
+            es->indent += depth;
+            break;
+    }
+}
+
+/*
+ * Pop one level of grouping state, allowing for a re-push later.
+ *
+ * This is typically used after ExplainOpenSetAsideGroup; pass the
+ * same "depth" used for that.
+ *
+ * This should not emit any output.  If state needs to be saved,
+ * save it at *state_save.  Currently, an integer save area is sufficient
+ * for all formats, but we might need to revisit that someday.
+ */
+static void
+ExplainSaveGroup(ExplainState *es, int depth, int *state_save)
+{
+    switch (es->format)
+    {
+        case EXPLAIN_FORMAT_TEXT:
+            /* nothing to do */
+            break;
+
+        case EXPLAIN_FORMAT_XML:
+            es->indent -= depth;
+            break;
+
+        case EXPLAIN_FORMAT_JSON:
+            es->indent -= depth;
+            *state_save = linitial_int(es->grouping_stack);
+            es->grouping_stack = list_delete_first(es->grouping_stack);
+            break;
+
+        case EXPLAIN_FORMAT_YAML:
+            es->indent -= depth;
+            *state_save = linitial_int(es->grouping_stack);
+            es->grouping_stack = list_delete_first(es->grouping_stack);
+            break;
+    }
+}
+
+/*
+ * Re-push one level of grouping state, undoing the effects of ExplainSaveGroup.
+ */
+static void
+ExplainRestoreGroup(ExplainState *es, int depth, int *state_save)
+{
+    switch (es->format)
+    {
+        case EXPLAIN_FORMAT_TEXT:
+            /* nothing to do */
+            break;
+
+        case EXPLAIN_FORMAT_XML:
+            es->indent += depth;
+            break;
+
+        case EXPLAIN_FORMAT_JSON:
+            es->grouping_stack = lcons_int(*state_save, es->grouping_stack);
+            es->indent += depth;
+            break;
+
+        case EXPLAIN_FORMAT_YAML:
+            es->grouping_stack = lcons_int(*state_save, es->grouping_stack);
+            es->indent += depth;
+            break;
+    }
+}
+
+/*
  * Emit a "dummy" group that never has any members.
  *
  * objtype is the type of the group object, labelname is its label within
@@ -3913,6 +4189,21 @@ ExplainXMLTag(const char *tagname, int flags, ExplainState *es)
 }

 /*
+ * Indent a text-format line.
+ *
+ * We indent by two spaces per indentation level.  However, when emitting
+ * data for a parallel worker there might already be data on the current line
+ * (cf. ExplainOpenWorker); in that case, don't indent any more.
+ */
+static void
+ExplainIndentText(ExplainState *es)
+{
+    Assert(es->format == EXPLAIN_FORMAT_TEXT);
+    if (es->str->len == 0 || es->str->data[es->str->len - 1] == '\n')
+        appendStringInfoSpaces(es->str, es->indent * 2);
+}
+
+/*
  * Emit a JSON line ending.
  *
  * JSON requires a comma after each property but the last.  To facilitate this,
diff --git a/src/include/commands/explain.h b/src/include/commands/explain.h
index 45027a7..853e206 100644
--- a/src/include/commands/explain.h
+++ b/src/include/commands/explain.h
@@ -25,6 +25,15 @@ typedef enum ExplainFormat
     EXPLAIN_FORMAT_YAML
 } ExplainFormat;

+typedef struct ExplainWorkersState
+{
+    int            num_workers;    /* # of worker processes the plan used */
+    bool       *worker_inited;    /* per-worker state-initialized flags */
+    StringInfoData *worker_str; /* per-worker transient output buffers */
+    int           *worker_state_save;    /* per-worker grouping state save areas */
+    StringInfo    prev_str;        /* saved output buffer while redirecting */
+} ExplainWorkersState;
+
 typedef struct ExplainState
 {
     StringInfo    str;            /* output buffer */
@@ -47,6 +56,8 @@ typedef struct ExplainState
     List       *deparse_cxt;    /* context list for deparsing expressions */
     Bitmapset  *printed_subplans;    /* ids of SubPlans we've printed */
     bool        hide_workers;    /* set if we find an invisible Gather */
+    /* state related to the current plan node */
+    ExplainWorkersState *workers_state; /* needed if parallel plan */
 } ExplainState;

 /* Hook for plugins to get control in ExplainOneQuery() */

Re: Duplicate Workers entries in some EXPLAIN plans

From

Maciek Sakrejda

Date:

25 January 2020, 03:33:18

Thanks for the thorough review. I obviously missed some critical
issues. I recognized some of the other maintainability problems, but
did not have a sense of how to best avoid them, being unfamiliar with
the code.

For what it's worth, this version makes sense to me.

Re: Duplicate Workers entries in some EXPLAIN plans

From

Tom Lane

Date:

25 January 2020, 19:23:50

Maciek Sakrejda <m.sakrejda@gmail.com> writes:
> For what it's worth, this version makes sense to me.

Thanks for looking.  Here's a version that deals with the JIT
instrumentation.  As Andres noted far upthread, that was also
really bogusly done before.  Not only could you get multiple "JIT"
subnodes on a Gather node, but we failed to print the info at all
if the parallelism was expressed as Gather Merge rather than
Gather.

A side effect of this change is that per-worker JIT info is now
printed one plan level further down: before it was printed on
the Gather node, but now it's attached to the Gather's child,
because that's where we print other per-worker data.  I don't
see anything particularly wrong with that, but it's another
change from the behavior today.

It's still really unclear to me how we could exercise any of
this behavior meaningfully in a regression test.  I thought
for a little bit about using the TAP infrastructure instead
of a traditional-style test, but it seems like that doesn't
buy anything except for a bias towards ignoring details instead
of overspecifying them.  Which is not much of an improvement.

            regards, tom lane

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index d189b8d..fe3c83b 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -57,6 +57,8 @@ static void ExplainOneQuery(Query *query, int cursorOptions,
                             IntoClause *into, ExplainState *es,
                             const char *queryString, ParamListInfo params,
                             QueryEnvironment *queryEnv);
+static void ExplainPrintJIT(ExplainState *es, int jit_flags,
+                            JitInstrumentation *ji);
 static void report_triggers(ResultRelInfo *rInfo, bool show_relname,
                             ExplainState *es);
 static double elapsed_time(instr_time *starttime);
@@ -123,11 +125,20 @@ static void ExplainSubPlans(List *plans, List *ancestors,
                             const char *relationship, ExplainState *es);
 static void ExplainCustomChildren(CustomScanState *css,
                                   List *ancestors, ExplainState *es);
+static ExplainWorkersState *ExplainCreateWorkersState(int num_workers);
+static void ExplainOpenWorker(int n, ExplainState *es);
+static void ExplainCloseWorker(int n, ExplainState *es);
+static void ExplainFlushWorkersState(ExplainState *es);
 static void ExplainProperty(const char *qlabel, const char *unit,
                             const char *value, bool numeric, ExplainState *es);
+static void ExplainOpenSetAsideGroup(const char *objtype, const char *labelname,
+                                     bool labeled, int depth, ExplainState *es);
+static void ExplainSaveGroup(ExplainState *es, int depth, int *state_save);
+static void ExplainRestoreGroup(ExplainState *es, int depth, int *state_save);
 static void ExplainDummyGroup(const char *objtype, const char *labelname,
                               ExplainState *es);
 static void ExplainXMLTag(const char *tagname, int flags, ExplainState *es);
+static void ExplainIndentText(ExplainState *es);
 static void ExplainJSONLineEnding(ExplainState *es);
 static void ExplainYAMLLineStarting(ExplainState *es);
 static void escape_yaml(StringInfo buf, const char *str);
@@ -790,31 +801,22 @@ ExplainPrintJITSummary(ExplainState *es, QueryDesc *queryDesc)
     if (queryDesc->estate->es_jit_worker_instr)
         InstrJitAgg(&ji, queryDesc->estate->es_jit_worker_instr);

-    ExplainPrintJIT(es, queryDesc->estate->es_jit_flags, &ji, -1);
+    ExplainPrintJIT(es, queryDesc->estate->es_jit_flags, &ji);
 }

 /*
  * ExplainPrintJIT -
  *      Append information about JITing to es->str.
- *
- * Can be used to print the JIT instrumentation of the backend (worker_num =
- * -1) or that of a specific worker (worker_num = ...).
  */
-void
-ExplainPrintJIT(ExplainState *es, int jit_flags,
-                JitInstrumentation *ji, int worker_num)
+static void
+ExplainPrintJIT(ExplainState *es, int jit_flags, JitInstrumentation *ji)
 {
     instr_time    total_time;
-    bool        for_workers = (worker_num >= 0);

     /* don't print information if no JITing happened */
     if (!ji || ji->created_functions == 0)
         return;

-    /* don't print per-worker info if we're supposed to hide that */
-    if (for_workers && es->hide_workers)
-        return;
-
     /* calculate total time */
     INSTR_TIME_SET_ZERO(total_time);
     INSTR_TIME_ADD(total_time, ji->generation_counter);
@@ -827,16 +829,13 @@ ExplainPrintJIT(ExplainState *es, int jit_flags,
     /* for higher density, open code the text output format */
     if (es->format == EXPLAIN_FORMAT_TEXT)
     {
-        appendStringInfoSpaces(es->str, es->indent * 2);
-        if (for_workers)
-            appendStringInfo(es->str, "JIT for worker %u:\n", worker_num);
-        else
-            appendStringInfoString(es->str, "JIT:\n");
-        es->indent += 1;
+        ExplainIndentText(es);
+        appendStringInfoString(es->str, "JIT:\n");
+        es->indent++;

         ExplainPropertyInteger("Functions", NULL, ji->created_functions, es);

-        appendStringInfoSpaces(es->str, es->indent * 2);
+        ExplainIndentText(es);
         appendStringInfo(es->str, "Options: %s %s, %s %s, %s %s, %s %s\n",
                          "Inlining", jit_flags & PGJIT_INLINE ? "true" : "false",
                          "Optimization", jit_flags & PGJIT_OPT3 ? "true" : "false",
@@ -845,7 +844,7 @@ ExplainPrintJIT(ExplainState *es, int jit_flags,

         if (es->analyze && es->timing)
         {
-            appendStringInfoSpaces(es->str, es->indent * 2);
+            ExplainIndentText(es);
             appendStringInfo(es->str,
                              "Timing: %s %.3f ms, %s %.3f ms, %s %.3f ms, %s %.3f ms, %s %.3f ms\n",
                              "Generation", 1000.0 * INSTR_TIME_GET_DOUBLE(ji->generation_counter),
@@ -855,11 +854,10 @@ ExplainPrintJIT(ExplainState *es, int jit_flags,
                              "Total", 1000.0 * INSTR_TIME_GET_DOUBLE(total_time));
         }

-        es->indent -= 1;
+        es->indent--;
     }
     else
     {
-        ExplainPropertyInteger("Worker Number", NULL, worker_num, es);
         ExplainPropertyInteger("Functions", NULL, ji->created_functions, es);

         ExplainOpenGroup("Options", "Options", true, es);
@@ -1074,9 +1072,10 @@ ExplainPreScanNode(PlanState *planstate, Bitmapset **rels_used)
  * optional name to be attached to the node.
  *
  * In text format, es->indent is controlled in this function since we only
- * want it to change at plan-node boundaries.  In non-text formats, es->indent
- * corresponds to the nesting depth of logical output groups, and therefore
- * is controlled by ExplainOpenGroup/ExplainCloseGroup.
+ * want it to change at plan-node boundaries (but a few subroutines will
+ * transiently increment it).  In non-text formats, es->indent corresponds
+ * to the nesting depth of logical output groups, and therefore is controlled
+ * by ExplainOpenGroup/ExplainCloseGroup.
  */
 static void
 ExplainNode(PlanState *planstate, List *ancestors,
@@ -1090,9 +1089,20 @@ ExplainNode(PlanState *planstate, List *ancestors,
     const char *partialmode = NULL;
     const char *operation = NULL;
     const char *custom_name = NULL;
+    ExplainWorkersState *save_workers_state = es->workers_state;
     int            save_indent = es->indent;
     bool        haschildren;

+    /*
+     * Prepare per-worker output buffers, if needed.  We'll append the data in
+     * these to the main output string further down.
+     */
+    if (planstate->worker_instrument && es->analyze && !es->hide_workers)
+        es->workers_state = ExplainCreateWorkersState(planstate->worker_instrument->num_workers);
+    else
+        es->workers_state = NULL;
+
+    /* Identify plan node type, and print generic details */
     switch (nodeTag(plan))
     {
         case T_Result:
@@ -1324,13 +1334,13 @@ ExplainNode(PlanState *planstate, List *ancestors,
     {
         if (plan_name)
         {
-            appendStringInfoSpaces(es->str, es->indent * 2);
+            ExplainIndentText(es);
             appendStringInfo(es->str, "%s\n", plan_name);
             es->indent++;
         }
         if (es->indent)
         {
-            appendStringInfoSpaces(es->str, es->indent * 2);
+            ExplainIndentText(es);
             appendStringInfoString(es->str, "->  ");
             es->indent += 2;
         }
@@ -1574,6 +1584,56 @@ ExplainNode(PlanState *planstate, List *ancestors,
     if (es->format == EXPLAIN_FORMAT_TEXT)
         appendStringInfoChar(es->str, '\n');

+    /* prepare per-worker general execution details */
+    if (es->workers_state && es->verbose)
+    {
+        WorkerInstrumentation *w = planstate->worker_instrument;
+
+        for (int n = 0; n < w->num_workers; n++)
+        {
+            Instrumentation *instrument = &w->instrument[n];
+            double        nloops = instrument->nloops;
+            double        startup_ms;
+            double        total_ms;
+            double        rows;
+
+            if (nloops <= 0)
+                continue;
+            startup_ms = 1000.0 * instrument->startup / nloops;
+            total_ms = 1000.0 * instrument->total / nloops;
+            rows = instrument->ntuples / nloops;
+
+            ExplainOpenWorker(n, es);
+
+            if (es->format == EXPLAIN_FORMAT_TEXT)
+            {
+                ExplainIndentText(es);
+                if (es->timing)
+                    appendStringInfo(es->str,
+                                     "actual time=%.3f..%.3f rows=%.0f loops=%.0f\n",
+                                     startup_ms, total_ms, rows, nloops);
+                else
+                    appendStringInfo(es->str,
+                                     "actual rows=%.0f loops=%.0f\n",
+                                     rows, nloops);
+            }
+            else
+            {
+                if (es->timing)
+                {
+                    ExplainPropertyFloat("Actual Startup Time", "ms",
+                                         startup_ms, 3, es);
+                    ExplainPropertyFloat("Actual Total Time", "ms",
+                                         total_ms, 3, es);
+                }
+                ExplainPropertyFloat("Actual Rows", NULL, rows, 0, es);
+                ExplainPropertyFloat("Actual Loops", NULL, nloops, 0, es);
+            }
+
+            ExplainCloseWorker(n, es);
+        }
+    }
+
     /* target list */
     if (es->verbose)
         show_plan_tlist(planstate, ancestors, es);
@@ -1684,24 +1744,6 @@ ExplainNode(PlanState *planstate, List *ancestors,
                                            nworkers, es);
                 }

-                /*
-                 * Print per-worker Jit instrumentation. Use same conditions
-                 * as for the leader's JIT instrumentation, see comment there.
-                 */
-                if (es->costs && es->verbose &&
-                    outerPlanState(planstate)->worker_jit_instrument)
-                {
-                    PlanState  *child = outerPlanState(planstate);
-                    int            n;
-                    SharedJitInstrumentation *w = child->worker_jit_instrument;
-
-                    for (n = 0; n < w->num_workers; ++n)
-                    {
-                        ExplainPrintJIT(es, child->state->es_jit_flags,
-                                        &w->jit_instr[n], n);
-                    }
-                }
-
                 if (gather->single_copy || es->format != EXPLAIN_FORMAT_TEXT)
                     ExplainPropertyBool("Single Copy", gather->single_copy, es);
             }
@@ -1881,80 +1923,54 @@ ExplainNode(PlanState *planstate, List *ancestors,
             break;
     }

+    /*
+     * Prepare per-worker JIT instrumentation.  As with the overall JIT
+     * summary, this is printed only if printing costs is enabled.
+     */
+    if (es->workers_state && es->costs && es->verbose)
+    {
+        SharedJitInstrumentation *w = planstate->worker_jit_instrument;
+
+        if (w)
+        {
+            for (int n = 0; n < w->num_workers; n++)
+            {
+                ExplainOpenWorker(n, es);
+                ExplainPrintJIT(es, planstate->state->es_jit_flags,
+                                &w->jit_instr[n]);
+                ExplainCloseWorker(n, es);
+            }
+        }
+    }
+
     /* Show buffer usage */
     if (es->buffers && planstate->instrument)
         show_buffer_usage(es, &planstate->instrument->bufusage);

-    /* Show worker detail */
-    if (es->analyze && es->verbose && !es->hide_workers &&
-        planstate->worker_instrument)
+    /* Prepare per-worker buffer usage */
+    if (es->workers_state && es->buffers && es->verbose)
     {
         WorkerInstrumentation *w = planstate->worker_instrument;
-        bool        opened_group = false;
-        int            n;

-        for (n = 0; n < w->num_workers; ++n)
+        for (int n = 0; n < w->num_workers; n++)
         {
             Instrumentation *instrument = &w->instrument[n];
             double        nloops = instrument->nloops;
-            double        startup_ms;
-            double        total_ms;
-            double        rows;

             if (nloops <= 0)
                 continue;
-            startup_ms = 1000.0 * instrument->startup / nloops;
-            total_ms = 1000.0 * instrument->total / nloops;
-            rows = instrument->ntuples / nloops;
-
-            if (es->format == EXPLAIN_FORMAT_TEXT)
-            {
-                appendStringInfoSpaces(es->str, es->indent * 2);
-                appendStringInfo(es->str, "Worker %d: ", n);
-                if (es->timing)
-                    appendStringInfo(es->str,
-                                     "actual time=%.3f..%.3f rows=%.0f loops=%.0f\n",
-                                     startup_ms, total_ms, rows, nloops);
-                else
-                    appendStringInfo(es->str,
-                                     "actual rows=%.0f loops=%.0f\n",
-                                     rows, nloops);
-                es->indent++;
-                if (es->buffers)
-                    show_buffer_usage(es, &instrument->bufusage);
-                es->indent--;
-            }
-            else
-            {
-                if (!opened_group)
-                {
-                    ExplainOpenGroup("Workers", "Workers", false, es);
-                    opened_group = true;
-                }
-                ExplainOpenGroup("Worker", NULL, true, es);
-                ExplainPropertyInteger("Worker Number", NULL, n, es);
-
-                if (es->timing)
-                {
-                    ExplainPropertyFloat("Actual Startup Time", "ms",
-                                         startup_ms, 3, es);
-                    ExplainPropertyFloat("Actual Total Time", "ms",
-                                         total_ms, 3, es);
-                }
-                ExplainPropertyFloat("Actual Rows", NULL, rows, 0, es);
-                ExplainPropertyFloat("Actual Loops", NULL, nloops, 0, es);

-                if (es->buffers)
-                    show_buffer_usage(es, &instrument->bufusage);
-
-                ExplainCloseGroup("Worker", NULL, true, es);
-            }
+            ExplainOpenWorker(n, es);
+            show_buffer_usage(es, &instrument->bufusage);
+            ExplainCloseWorker(n, es);
         }
-
-        if (opened_group)
-            ExplainCloseGroup("Workers", "Workers", false, es);
     }

+    /* Show per-worker details for this plan node, then pop that stack */
+    if (es->workers_state)
+        ExplainFlushWorkersState(es);
+    es->workers_state = save_workers_state;
+
     /* Get ready to display the child plans */
     haschildren = planstate->initPlan ||
         outerPlanState(planstate) ||
@@ -2525,7 +2541,7 @@ show_tablesample(TableSampleClause *tsc, PlanState *planstate,
     {
         bool        first = true;

-        appendStringInfoSpaces(es->str, es->indent * 2);
+        ExplainIndentText(es);
         appendStringInfo(es->str, "Sampling: %s (", method_name);
         foreach(lc, params)
         {
@@ -2572,7 +2588,7 @@ show_sort_info(SortState *sortstate, ExplainState *es)

         if (es->format == EXPLAIN_FORMAT_TEXT)
         {
-            appendStringInfoSpaces(es->str, es->indent * 2);
+            ExplainIndentText(es);
             appendStringInfo(es->str, "Sort Method: %s  %s: %ldkB\n",
                              sortMethod, spaceType, spaceUsed);
         }
@@ -2588,12 +2604,14 @@ show_sort_info(SortState *sortstate, ExplainState *es)
      * You might think we should just skip this stanza entirely when
      * es->hide_workers is true, but then we'd get no sort-method output at
      * all.  We have to make it look like worker 0's data is top-level data.
-     * Currently, we only bother with that for text-format output.
+     * This is easily done by just skipping the OpenWorker/CloseWorker calls.
+     * Currently, we don't worry about the possibility that there are multiple
+     * workers in such a case; if there are, duplicate output fields will be
+     * emitted.
      */
     if (sortstate->shared_info != NULL)
     {
         int            n;
-        bool        opened_group = false;

         for (n = 0; n < sortstate->shared_info->num_workers; n++)
         {
@@ -2609,32 +2627,26 @@ show_sort_info(SortState *sortstate, ExplainState *es)
             spaceType = tuplesort_space_type_name(sinstrument->spaceType);
             spaceUsed = sinstrument->spaceUsed;

+            if (es->workers_state)
+                ExplainOpenWorker(n, es);
+
             if (es->format == EXPLAIN_FORMAT_TEXT)
             {
-                appendStringInfoSpaces(es->str, es->indent * 2);
-                if (n > 0 || !es->hide_workers)
-                    appendStringInfo(es->str, "Worker %d:  ", n);
+                ExplainIndentText(es);
                 appendStringInfo(es->str,
                                  "Sort Method: %s  %s: %ldkB\n",
                                  sortMethod, spaceType, spaceUsed);
             }
             else
             {
-                if (!opened_group)
-                {
-                    ExplainOpenGroup("Workers", "Workers", false, es);
-                    opened_group = true;
-                }
-                ExplainOpenGroup("Worker", NULL, true, es);
-                ExplainPropertyInteger("Worker Number", NULL, n, es);
                 ExplainPropertyText("Sort Method", sortMethod, es);
                 ExplainPropertyInteger("Sort Space Used", "kB", spaceUsed, es);
                 ExplainPropertyText("Sort Space Type", spaceType, es);
-                ExplainCloseGroup("Worker", NULL, true, es);
             }
+
+            if (es->workers_state)
+                ExplainCloseWorker(n, es);
         }
-        if (opened_group)
-            ExplainCloseGroup("Workers", "Workers", false, es);
     }
 }

@@ -2721,7 +2733,7 @@ show_hash_info(HashState *hashstate, ExplainState *es)
         else if (hinstrument.nbatch_original != hinstrument.nbatch ||
                  hinstrument.nbuckets_original != hinstrument.nbuckets)
         {
-            appendStringInfoSpaces(es->str, es->indent * 2);
+            ExplainIndentText(es);
             appendStringInfo(es->str,
                              "Buckets: %d (originally %d)  Batches: %d (originally %d)  Memory Usage: %ldkB\n",
                              hinstrument.nbuckets,
@@ -2732,7 +2744,7 @@ show_hash_info(HashState *hashstate, ExplainState *es)
         }
         else
         {
-            appendStringInfoSpaces(es->str, es->indent * 2);
+            ExplainIndentText(es);
             appendStringInfo(es->str,
                              "Buckets: %d  Batches: %d  Memory Usage: %ldkB\n",
                              hinstrument.nbuckets, hinstrument.nbatch,
@@ -2758,7 +2770,7 @@ show_tidbitmap_info(BitmapHeapScanState *planstate, ExplainState *es)
     {
         if (planstate->exact_pages > 0 || planstate->lossy_pages > 0)
         {
-            appendStringInfoSpaces(es->str, es->indent * 2);
+            ExplainIndentText(es);
             appendStringInfoString(es->str, "Heap Blocks:");
             if (planstate->exact_pages > 0)
                 appendStringInfo(es->str, " exact=%ld", planstate->exact_pages);
@@ -2894,7 +2906,7 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage)
         /* Show only positive counter values. */
         if (has_shared || has_local || has_temp)
         {
-            appendStringInfoSpaces(es->str, es->indent * 2);
+            ExplainIndentText(es);
             appendStringInfoString(es->str, "Buffers:");

             if (has_shared)
@@ -2949,7 +2961,7 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage)
         /* As above, show only positive counter values. */
         if (has_timing)
         {
-            appendStringInfoSpaces(es->str, es->indent * 2);
+            ExplainIndentText(es);
             appendStringInfoString(es->str, "I/O Timings:");
             if (!INSTR_TIME_IS_ZERO(usage->blk_read_time))
                 appendStringInfo(es->str, " read=%0.3f",
@@ -3237,7 +3249,7 @@ show_modifytable_info(ModifyTableState *mtstate, List *ancestors,
              */
             if (es->format == EXPLAIN_FORMAT_TEXT)
             {
-                appendStringInfoSpaces(es->str, es->indent * 2);
+                ExplainIndentText(es);
                 appendStringInfoString(es->str,
                                        fdwroutine ? foperation : operation);
             }
@@ -3427,6 +3439,158 @@ ExplainCustomChildren(CustomScanState *css, List *ancestors, ExplainState *es)
 }

 /*
+ * Create a per-plan-node workspace for collecting per-worker data.
+ *
+ * Output related to each worker will be temporarily "set aside" into a
+ * separate buffer, which we'll merge into the main output stream once
+ * we've processed all data for the plan node.  This makes it feasible to
+ * generate a coherent sub-group of fields for each worker, even though the
+ * code that produces the fields is in several different places in this file.
+ * Formatting of such a set-aside field group is managed by
+ * ExplainOpenSetAsideGroup and ExplainSaveGroup/ExplainRestoreGroup.
+ */
+static ExplainWorkersState *
+ExplainCreateWorkersState(int num_workers)
+{
+    ExplainWorkersState *wstate;
+
+    wstate = (ExplainWorkersState *) palloc(sizeof(ExplainWorkersState));
+    wstate->num_workers = num_workers;
+    wstate->worker_inited = (bool *) palloc0(num_workers * sizeof(bool));
+    wstate->worker_str = (StringInfoData *)
+        palloc0(num_workers * sizeof(StringInfoData));
+    wstate->worker_state_save = (int *) palloc(num_workers * sizeof(int));
+    return wstate;
+}
+
+/*
+ * Begin or resume output into the set-aside group for worker N.
+ */
+static void
+ExplainOpenWorker(int n, ExplainState *es)
+{
+    ExplainWorkersState *wstate = es->workers_state;
+
+    Assert(wstate);
+    Assert(n >= 0 && n < wstate->num_workers);
+
+    /* Save prior output buffer pointer */
+    wstate->prev_str = es->str;
+
+    if (!wstate->worker_inited[n])
+    {
+        /* First time through, so create the buffer for this worker */
+        initStringInfo(&wstate->worker_str[n]);
+        es->str = &wstate->worker_str[n];
+
+        /*
+         * Push suitable initial formatting state for this worker's field
+         * group.  We allow one extra logical nesting level, since this group
+         * will eventually be wrapped in an outer "Workers" group.
+         */
+        ExplainOpenSetAsideGroup("Worker", NULL, true, 2, es);
+
+        /*
+         * In non-TEXT formats we always emit a "Worker Number" field, even if
+         * there's no other data for this worker.
+         */
+        if (es->format != EXPLAIN_FORMAT_TEXT)
+            ExplainPropertyInteger("Worker Number", NULL, n, es);
+
+        wstate->worker_inited[n] = true;
+    }
+    else
+    {
+        /* Resuming output for a worker we've already emitted some data for */
+        es->str = &wstate->worker_str[n];
+
+        /* Restore formatting state saved by last ExplainCloseWorker() */
+        ExplainRestoreGroup(es, 2, &wstate->worker_state_save[n]);
+    }
+
+    /*
+     * In TEXT format, prefix the first output line for this worker with
+     * "Worker N:".  Then, any additional lines should be indented one more
+     * stop than the "Worker N" line is.
+     */
+    if (es->format == EXPLAIN_FORMAT_TEXT)
+    {
+        if (es->str->len == 0)
+        {
+            ExplainIndentText(es);
+            appendStringInfo(es->str, "Worker %d:  ", n);
+        }
+
+        es->indent++;
+    }
+}
+
+/*
+ * End output for worker N --- must pair with previous ExplainOpenWorker call
+ */
+static void
+ExplainCloseWorker(int n, ExplainState *es)
+{
+    ExplainWorkersState *wstate = es->workers_state;
+
+    Assert(wstate);
+    Assert(n >= 0 && n < wstate->num_workers);
+    Assert(wstate->worker_inited[n]);
+
+    /*
+     * Save formatting state in case we do another ExplainOpenWorker(), then
+     * pop the formatting stack.
+     */
+    ExplainSaveGroup(es, 2, &wstate->worker_state_save[n]);
+
+    /*
+     * In TEXT format, if we didn't actually produce any output line(s) then
+     * truncate off the partial line.  (This is to avoid bogus output if, for
+     * instance, show_buffer_usage chooses not to print anything for the
+     * worker.)  Also fix up the indent level.
+     */
+    if (es->format == EXPLAIN_FORMAT_TEXT)
+    {
+        while (es->str->len > 0 && es->str->data[es->str->len - 1] != '\n')
+            es->str->data[--(es->str->len)] = '\0';
+
+        es->indent--;
+    }
+
+    /* Restore prior output buffer pointer */
+    es->str = wstate->prev_str;
+}
+
+/*
+ * Print per-worker info for current node, then free the ExplainWorkersState.
+ */
+static void
+ExplainFlushWorkersState(ExplainState *es)
+{
+    ExplainWorkersState *wstate = es->workers_state;
+
+    ExplainOpenGroup("Workers", "Workers", false, es);
+    for (int i = 0; i < wstate->num_workers; i++)
+    {
+        if (wstate->worker_inited[i])
+        {
+            /* This must match previous ExplainOpenSetAsideGroup call */
+            ExplainOpenGroup("Worker", NULL, true, es);
+            appendStringInfoString(es->str, wstate->worker_str[i].data);
+            ExplainCloseGroup("Worker", NULL, true, es);
+
+            pfree(wstate->worker_str[i].data);
+        }
+    }
+    ExplainCloseGroup("Workers", "Workers", false, es);
+
+    pfree(wstate->worker_inited);
+    pfree(wstate->worker_str);
+    pfree(wstate->worker_state_save);
+    pfree(wstate);
+}
+
+/*
  * Explain a property, such as sort keys or targets, that takes the form of
  * a list of unlabeled items.  "data" is a list of C strings.
  */
@@ -3439,7 +3603,7 @@ ExplainPropertyList(const char *qlabel, List *data, ExplainState *es)
     switch (es->format)
     {
         case EXPLAIN_FORMAT_TEXT:
-            appendStringInfoSpaces(es->str, es->indent * 2);
+            ExplainIndentText(es);
             appendStringInfo(es->str, "%s: ", qlabel);
             foreach(lc, data)
             {
@@ -3560,7 +3724,7 @@ ExplainProperty(const char *qlabel, const char *unit, const char *value,
     switch (es->format)
     {
         case EXPLAIN_FORMAT_TEXT:
-            appendStringInfoSpaces(es->str, es->indent * 2);
+            ExplainIndentText(es);
             if (unit)
                 appendStringInfo(es->str, "%s: %s %s\n", qlabel, value, unit);
             else
@@ -3752,6 +3916,117 @@ ExplainCloseGroup(const char *objtype, const char *labelname,
 }

 /*
+ * Open a group of related objects, without emitting actual data.
+ *
+ * Prepare the formatting state as though we were beginning a group with
+ * the identified properties, but don't actually emit anything.  Output
+ * subsequent to this call can be redirected into a separate output buffer,
+ * and then eventually appended to the main output buffer after doing a
+ * regular ExplainOpenGroup call (with the same parameters).
+ *
+ * The extra "depth" parameter is the new group's depth compared to current.
+ * It could be more than one, in case the eventual output will be enclosed
+ * in additional nesting group levels.  We assume we don't need to track
+ * formatting state for those levels while preparing this group's output.
+ *
+ * There is no ExplainCloseSetAsideGroup --- in current usage, we always
+ * pop this state with ExplainSaveGroup.
+ */
+static void
+ExplainOpenSetAsideGroup(const char *objtype, const char *labelname,
+                         bool labeled, int depth, ExplainState *es)
+{
+    switch (es->format)
+    {
+        case EXPLAIN_FORMAT_TEXT:
+            /* nothing to do */
+            break;
+
+        case EXPLAIN_FORMAT_XML:
+            es->indent += depth;
+            break;
+
+        case EXPLAIN_FORMAT_JSON:
+            es->grouping_stack = lcons_int(0, es->grouping_stack);
+            es->indent += depth;
+            break;
+
+        case EXPLAIN_FORMAT_YAML:
+            if (labelname)
+                es->grouping_stack = lcons_int(1, es->grouping_stack);
+            else
+                es->grouping_stack = lcons_int(0, es->grouping_stack);
+            es->indent += depth;
+            break;
+    }
+}
+
+/*
+ * Pop one level of grouping state, allowing for a re-push later.
+ *
+ * This is typically used after ExplainOpenSetAsideGroup; pass the
+ * same "depth" used for that.
+ *
+ * This should not emit any output.  If state needs to be saved,
+ * save it at *state_save.  Currently, an integer save area is sufficient
+ * for all formats, but we might need to revisit that someday.
+ */
+static void
+ExplainSaveGroup(ExplainState *es, int depth, int *state_save)
+{
+    switch (es->format)
+    {
+        case EXPLAIN_FORMAT_TEXT:
+            /* nothing to do */
+            break;
+
+        case EXPLAIN_FORMAT_XML:
+            es->indent -= depth;
+            break;
+
+        case EXPLAIN_FORMAT_JSON:
+            es->indent -= depth;
+            *state_save = linitial_int(es->grouping_stack);
+            es->grouping_stack = list_delete_first(es->grouping_stack);
+            break;
+
+        case EXPLAIN_FORMAT_YAML:
+            es->indent -= depth;
+            *state_save = linitial_int(es->grouping_stack);
+            es->grouping_stack = list_delete_first(es->grouping_stack);
+            break;
+    }
+}
+
+/*
+ * Re-push one level of grouping state, undoing the effects of ExplainSaveGroup.
+ */
+static void
+ExplainRestoreGroup(ExplainState *es, int depth, int *state_save)
+{
+    switch (es->format)
+    {
+        case EXPLAIN_FORMAT_TEXT:
+            /* nothing to do */
+            break;
+
+        case EXPLAIN_FORMAT_XML:
+            es->indent += depth;
+            break;
+
+        case EXPLAIN_FORMAT_JSON:
+            es->grouping_stack = lcons_int(*state_save, es->grouping_stack);
+            es->indent += depth;
+            break;
+
+        case EXPLAIN_FORMAT_YAML:
+            es->grouping_stack = lcons_int(*state_save, es->grouping_stack);
+            es->indent += depth;
+            break;
+    }
+}
+
+/*
  * Emit a "dummy" group that never has any members.
  *
  * objtype is the type of the group object, labelname is its label within
@@ -3913,6 +4188,21 @@ ExplainXMLTag(const char *tagname, int flags, ExplainState *es)
 }

 /*
+ * Indent a text-format line.
+ *
+ * We indent by two spaces per indentation level.  However, when emitting
+ * data for a parallel worker there might already be data on the current line
+ * (cf. ExplainOpenWorker); in that case, don't indent any more.
+ */
+static void
+ExplainIndentText(ExplainState *es)
+{
+    Assert(es->format == EXPLAIN_FORMAT_TEXT);
+    if (es->str->len == 0 || es->str->data[es->str->len - 1] == '\n')
+        appendStringInfoSpaces(es->str, es->indent * 2);
+}
+
+/*
  * Emit a JSON line ending.
  *
  * JSON requires a comma after each property but the last.  To facilitate this,
diff --git a/src/include/commands/explain.h b/src/include/commands/explain.h
index 45027a7..54f6240 100644
--- a/src/include/commands/explain.h
+++ b/src/include/commands/explain.h
@@ -25,6 +25,15 @@ typedef enum ExplainFormat
     EXPLAIN_FORMAT_YAML
 } ExplainFormat;

+typedef struct ExplainWorkersState
+{
+    int            num_workers;    /* # of worker processes the plan used */
+    bool       *worker_inited;    /* per-worker state-initialized flags */
+    StringInfoData *worker_str; /* per-worker transient output buffers */
+    int           *worker_state_save;    /* per-worker grouping state save areas */
+    StringInfo    prev_str;        /* saved output buffer while redirecting */
+} ExplainWorkersState;
+
 typedef struct ExplainState
 {
     StringInfo    str;            /* output buffer */
@@ -47,6 +56,8 @@ typedef struct ExplainState
     List       *deparse_cxt;    /* context list for deparsing expressions */
     Bitmapset  *printed_subplans;    /* ids of SubPlans we've printed */
     bool        hide_workers;    /* set if we find an invisible Gather */
+    /* state related to the current plan node */
+    ExplainWorkersState *workers_state; /* needed if parallel plan */
 } ExplainState;

 /* Hook for plugins to get control in ExplainOneQuery() */
@@ -84,8 +95,6 @@ extern void ExplainPrintPlan(ExplainState *es, QueryDesc *queryDesc);
 extern void ExplainPrintTriggers(ExplainState *es, QueryDesc *queryDesc);

 extern void ExplainPrintJITSummary(ExplainState *es, QueryDesc *queryDesc);
-extern void ExplainPrintJIT(ExplainState *es, int jit_flags,
-                            struct JitInstrumentation *jit_instr, int worker_num);

 extern void ExplainQueryText(ExplainState *es, QueryDesc *queryDesc);

Re: Duplicate Workers entries in some EXPLAIN plans

From

Tom Lane

Date:

25 January 2020, 23:20:16

I wrote:
> It's still really unclear to me how we could exercise any of
> this behavior meaningfully in a regression test.  I thought
> for a little bit about using the TAP infrastructure instead
> of a traditional-style test, but it seems like that doesn't
> buy anything except for a bias towards ignoring details instead
> of overspecifying them.  Which is not much of an improvement.

After further thought, I decided that about the best we can do
is suppress the "Workers" field in the regression test's expected
output.  This still gives us code coverage of the relevant code,
and we can check that the output is valid JSON before we strip it,
so it's not a dead loss.

I rewrote the test script a bit to add some coverage of XML and YAML
output formats, since we had exactly none before, and pushed it.

            regards, tom lane

Re: Duplicate Workers entries in some EXPLAIN plans

From

Andres Freund

Date:

26 January 2020, 00:02:58

Hi,

On 2020-01-25 14:23:50 -0500, Tom Lane wrote:
> A side effect of this change is that per-worker JIT info is now
> printed one plan level further down: before it was printed on
> the Gather node, but now it's attached to the Gather's child,
> because that's where we print other per-worker data.  I don't
> see anything particularly wrong with that, but it's another
> change from the behavior today.

Yea, I don't see any need to be bothered by that.


> It's still really unclear to me how we could exercise any of
> this behavior meaningfully in a regression test.  I thought
> for a little bit about using the TAP infrastructure instead
> of a traditional-style test, but it seems like that doesn't
> buy anything except for a bias towards ignoring details instead
> of overspecifying them.  Which is not much of an improvement.

Hm. I'd like to avoid needing TAP for this kind of thing, it'll just
make it much more expensive in just about all ways.

Testing JIT explain is "easy" enough I think, I've posted a patch in
another thread, which just skips over the region of the test if JIT is
not available. See e.g. 0008 in
https://www.postgresql.org/message-id/20191029000229.fkjmuld3g7f2jq7i%40alap3.anarazel.de
(that's a thread I'd love your input in btw)


It's harder for parallel query though. If parallel query were able to
reuse workers, we could "just" check at the beginning of the test if we
are able to get the workers we need, and then skip the rest of the tests
if not. But as things stand that doesn't guarantee anything.

I wonder if we could introduce a debug GUC that makes parallel worker
acquisition just retry in a loop, for a time determined by the GUC. That
obviously would be a bad idea to do in a production setup, but it could
be good enough for regression tests?  There are some deadlock dangers,
but I'm not sure they really matter for the tests.



> +    /* prepare per-worker general execution details */
> +    if (es->workers_state && es->verbose)
> +    {
> +        WorkerInstrumentation *w = planstate->worker_instrument;
> +
> +        for (int n = 0; n < w->num_workers; n++)
> +        {
> +            Instrumentation *instrument = &w->instrument[n];
> +            double        nloops = instrument->nloops;
> +            double        startup_ms;
> +            double        total_ms;
> +            double        rows;
> +
> +            if (nloops <= 0)
> +                continue;
> +            startup_ms = 1000.0 * instrument->startup / nloops;
> +            total_ms = 1000.0 * instrument->total / nloops;
> +            rows = instrument->ntuples / nloops;
> +
> +            ExplainOpenWorker(n, es);
> +
> +            if (es->format == EXPLAIN_FORMAT_TEXT)
> +            {
> +                ExplainIndentText(es);
> +                if (es->timing)
> +                    appendStringInfo(es->str,
> +                                     "actual time=%.3f..%.3f rows=%.0f loops=%.0f\n",
> +                                     startup_ms, total_ms, rows, nloops);
> +                else
> +                    appendStringInfo(es->str,
> +                                     "actual rows=%.0f loops=%.0f\n",
> +                                     rows, nloops);
> +            }
> +            else
> +            {
> +                if (es->timing)
> +                {
> +                    ExplainPropertyFloat("Actual Startup Time", "ms",
> +                                         startup_ms, 3, es);
> +                    ExplainPropertyFloat("Actual Total Time", "ms",
> +                                         total_ms, 3, es);
> +                }
> +                ExplainPropertyFloat("Actual Rows", NULL, rows, 0, es);
> +                ExplainPropertyFloat("Actual Loops", NULL, nloops, 0, es);
> +            }
> +
> +            ExplainCloseWorker(n, es);
> +        }
> +    }

I'd personally move this into a separate function, given the patches
moves the code around already.  ExplainNode() is already hard enough to
navigate...

It probably also makes sense to move everything but the nloops <= 0,
ExplainOpenWorker/ExplainCloseWorker into its own function. As far as I
can tell it now should be identical between the non-parallel case?


> +/*
> + * Begin or resume output into the set-aside group for worker N.
> + */
> +static void

Would it make sense to make these functions non-static? It seems
plausible that code for a custom node or such would want to add
its own information?


> +ExplainOpenWorker(int n, ExplainState *es)
> +{
> +    ExplainWorkersState *wstate = es->workers_state;
> +
> +    Assert(wstate);
> +    Assert(n >= 0 && n < wstate->num_workers);
> +
> +    /* Save prior output buffer pointer */
> +    wstate->prev_str = es->str;
> +
> +    if (!wstate->worker_inited[n])
> +    {
> +        /* First time through, so create the buffer for this worker */
> +        initStringInfo(&wstate->worker_str[n]);
> +        es->str = &wstate->worker_str[n];
> +
> +        /*
> +         * Push suitable initial formatting state for this worker's field
> +         * group.  We allow one extra logical nesting level, since this group
> +         * will eventually be wrapped in an outer "Workers" group.
> +         */
> +        ExplainOpenSetAsideGroup("Worker", NULL, true, 2, es);
> +
> +        /*
> +         * In non-TEXT formats we always emit a "Worker Number" field, even if
> +         * there's no other data for this worker.
> +         */
> +        if (es->format != EXPLAIN_FORMAT_TEXT)
> +            ExplainPropertyInteger("Worker Number", NULL, n, es);
> +
> +        wstate->worker_inited[n] = true;
> +    }
> +    else
> +    {
> +        /* Resuming output for a worker we've already emitted some data for */
> +        es->str = &wstate->worker_str[n];
> +
> +        /* Restore formatting state saved by last ExplainCloseWorker() */
> +        ExplainRestoreGroup(es, 2, &wstate->worker_state_save[n]);
> +    }
> +
> +    /*
> +     * In TEXT format, prefix the first output line for this worker with
> +     * "Worker N:".  Then, any additional lines should be indented one more
> +     * stop than the "Worker N" line is.
> +     */
> +    if (es->format == EXPLAIN_FORMAT_TEXT)
> +    {
> +        if (es->str->len == 0)
> +        {
> +            ExplainIndentText(es);
> +            appendStringInfo(es->str, "Worker %d:  ", n);
> +        }
> +
> +        es->indent++;
> +    }
> +}

I don't quite get the Worker %d bit. Why are we not outputting that in
the !worker_inited block?


> +/*
> + * Print per-worker info for current node, then free the ExplainWorkersState.
> + */
> +static void
> +ExplainFlushWorkersState(ExplainState *es)
> +{
> +    ExplainWorkersState *wstate = es->workers_state;
> +
> +    ExplainOpenGroup("Workers", "Workers", false, es);
> +    for (int i = 0; i < wstate->num_workers; i++)
> +    {
> +        if (wstate->worker_inited[i])
> +        {
> +            /* This must match previous ExplainOpenSetAsideGroup call */
> +            ExplainOpenGroup("Worker", NULL, true, es);
> +            appendStringInfoString(es->str, wstate->worker_str[i].data);

Probably never matters, but given we do have the string length already,
we could use appendBinaryStringInfo().


> +            ExplainCloseGroup("Worker", NULL, true, es);

Not related to this patch: I never got why we maintain a grouping stack
for some things, but don't do it for the group properties
themselves.


>  /*
> + * Open a group of related objects, without emitting actual data.
> + *
> + * Prepare the formatting state as though we were beginning a group with
> + * the identified properties, but don't actually emit anything.  Output
> + * subsequent to this call can be redirected into a separate output buffer,
> + * and then eventually appended to the main output buffer after doing a
> + * regular ExplainOpenGroup call (with the same parameters).
> + *
> + * The extra "depth" parameter is the new group's depth compared to current.
> + * It could be more than one, in case the eventual output will be enclosed
> + * in additional nesting group levels.  We assume we don't need to track
> + * formatting state for those levels while preparing this group's output.
> + *
> + * There is no ExplainCloseSetAsideGroup --- in current usage, we always
> + * pop this state with ExplainSaveGroup.
> + */
> +static void
> +ExplainOpenSetAsideGroup(const char *objtype, const char *labelname,
> +                         bool labeled, int depth, ExplainState *es)
> +{
> +    switch (es->format)
> +    {
> +        case EXPLAIN_FORMAT_TEXT:
> +            /* nothing to do */
> +            break;
> +
> +        case EXPLAIN_FORMAT_XML:
> +            es->indent += depth;
> +            break;
> +
> +        case EXPLAIN_FORMAT_JSON:
> +            es->grouping_stack = lcons_int(0, es->grouping_stack);
> +            es->indent += depth;
> +            break;
> +
> +        case EXPLAIN_FORMAT_YAML:
> +            if (labelname)
> +                es->grouping_stack = lcons_int(1, es->grouping_stack);
> +            else
> +                es->grouping_stack = lcons_int(0, es->grouping_stack);
> +            es->indent += depth;
> +            break;
> +    }
> +}

Hm. It might be worthwhile to rename ExplainOpenSetAsideGroup and use it
from ExplainOpenGroup()? Seems we could just call it after
ExplainOpenGroup()'s switch, and remove all the indent/grouping_stack
related code from ExplainOpenGroup().


Greetings,

Andres Freund

Re: Duplicate Workers entries in some EXPLAIN plans

From

Tom Lane

Date:

26 January 2020, 00:30:33

Andres Freund <andres@anarazel.de> writes:
> I wonder if we could introduce a debug GUC that makes parallel worker
> acquisition just retry in a loop, for a time determined by the GUC. That
> obviously would be a bad idea to do in a production setup, but it could
> be good enough for regression tests?  There are some deadlock dangers,
> but I'm not sure they really matter for the tests.

Hmmm .... might work.  Seems like a better idea than "run it by itself"
as we have to do now.

> I'd personally move this into a separate function, given the patches
> moves the code around already.  ExplainNode() is already hard enough to
> navigate...

Well, it was already inline in ExplainNode, so this just moved the
code a few lines.  I'm not that excited about moving little bits of
that function out-of-line.

>> +/*
>> + * Begin or resume output into the set-aside group for worker N.
>> + */
>> +static void

> Would it make sense to make these functions non-static? It seems
> plausible that code for a custom node or such would want to add
> its own information?

Maybe, but seems to me that there'd be a whole lot of other infrastructure
needed to track additional per-worker state.  I'd just as soon not
expose this stuff until (a) there's a concrete not hypothetical use case
and (b) it's been around long enough to feel comfortable that there's
nothing wrong with the design.

>> +    /*
>> +     * In TEXT format, prefix the first output line for this worker with
>> +     * "Worker N:".  Then, any additional lines should be indented one more
>> +     * stop than the "Worker N" line is.
>> +     */

> I don't quite get the Worker %d bit. Why are we not outputting that in
> the !worker_inited block?

We might strip it off again in ExplainCloseWorker, and then potentially
add it back again in a later ExplainOpenWorker.  That could only happen
if an earlier ExplainOpen/CloseWorker fragment decides not to emit any
text and then a later one wants to do so.  Which I'm pretty sure is
unreachable right now, but I don't think this code should assume that.

>> +            appendStringInfoString(es->str, wstate->worker_str[i].data);

> Probably never matters, but given we do have the string length already,
> we could use appendBinaryStringInfo().

Ah, I was thinking about making that change but then forgot.

>> +            ExplainCloseGroup("Worker", NULL, true, es);

> Not related to this patch: I never got why we maintain a grouping stack
> for some things, but don't do it for the group properties
> themselves.

Right now it'd just be extra overhead.  If we ever have a case where it's
not convenient for an ExplainCloseGroup caller to provide the same data
as for ExplainOpenGroup, then I'd be on board with storing that info.

> Hm. It might be worthwhile to rename ExplainOpenSetAsideGroup and use it
> from ExplainOpenGroup()? Seems we could just call it after
> ExplainOpenGroup()'s switch, and remove all the indent/grouping_stack
> related code from ExplainOpenGroup().

Hmm.  It seemed easier to me to keep them separate, but ...

I did consider a design in which, instead of ExplainOpenSetAsideGroup,
there was some function that would initialize the "state_save" area and
then you'd call the "restore" function to make that state active.  It
seemed like that would be too dissimilar from ExplainOpenGroup --- but
conceivably, we could reimplement ExplainOpenGroup as calling the
initialize function and then the restore function (along with doing some
output).  Not really sure that'd be an improvement though: it'd involve
less duplicate code, but the functions would individually be harder to
wrap your brain around.

            regards, tom lane

Re: Duplicate Workers entries in some EXPLAIN plans

From

Tom Lane

Date:

26 January 2020, 23:00:21

I wrote:
> Andres Freund <andres@anarazel.de> writes:
>> I wonder if we could introduce a debug GUC that makes parallel worker
>> acquisition just retry in a loop, for a time determined by the GUC. That
>> obviously would be a bad idea to do in a production setup, but it could
>> be good enough for regression tests?  There are some deadlock dangers,
>> but I'm not sure they really matter for the tests.

> Hmmm .... might work.  Seems like a better idea than "run it by itself"
> as we have to do now.

The more I think about this, the more it seems like a good idea, and
not only for regression test purposes.  If you're about to launch a
query that will run for hours even with the max number of workers,
you don't want it to launch with less than that number just because
somebody else was eating a worker slot for a few milliseconds.

So I'm imagining a somewhat general-purpose GUC defined like
"max_delay_to_acquire_parallel_worker", measured say in milliseconds.
The default would be zero (current behavior: try once and give up),
but you could set it to small positive values if you have that kind
of production concern, while the regression tests could set it to big
positive values.  This would alleviate all sorts of problems we have
with not being able to assume stable results from parallel worker
acquisition in the tests.

            regards, tom lane