Thread: Duplicate Workers entries in some EXPLAIN plans
Hello,
I originally reported this in pgsql-bugs [0], but there wasn't much feedback there, so moving the discussion here. When using JSON, YAML, or XML-format EXPLAIN on a plan that uses a parallelized sort, the Sort nodes list two different entries for "Workers", one for the sort-related info, and one for general worker info. This is what this looks like in JSON (some details elided):
{
"Node Type": "Sort",
...
"Workers": [
{
"Worker Number": 0,
"Sort Method": "external merge",
"Sort Space Used": 20128,
"Sort Space Type": "Disk"
},
{
"Worker Number": 1,
"Sort Method": "external merge",
"Sort Space Used": 20128,
"Sort Space Type": "Disk"
}
],
...
"Workers": [
{
"Worker Number": 0,
"Actual Startup Time": 309.726,
"Actual Total Time": 310.179,
"Actual Rows": 4128,
"Actual Loops": 1,
"Shared Hit Blocks": 2872,
"Shared Read Blocks": 7584,
"Shared Dirtied Blocks": 0,
"Shared Written Blocks": 0,
"Local Hit Blocks": 0,
"Local Read Blocks": 0,
"Local Dirtied Blocks": 0,
"Local Written Blocks": 0,
"Temp Read Blocks": 490,
"Temp Written Blocks": 2529
},
{
"Worker Number": 1,
"Actual Startup Time": 306.523,
"Actual Total Time": 307.001,
"Actual Rows": 4128,
"Actual Loops": 1,
"Shared Hit Blocks": 3356,
"Shared Read Blocks": 7100,
"Shared Dirtied Blocks": 0,
"Shared Written Blocks": 0,
"Local Hit Blocks": 0,
"Local Read Blocks": 0,
"Local Dirtied Blocks": 0,
"Local Written Blocks": 0,
"Temp Read Blocks": 490,
"Temp Written Blocks": 2529
}
],
"Plans:" ...
}
"Node Type": "Sort",
...
"Workers": [
{
"Worker Number": 0,
"Sort Method": "external merge",
"Sort Space Used": 20128,
"Sort Space Type": "Disk"
},
{
"Worker Number": 1,
"Sort Method": "external merge",
"Sort Space Used": 20128,
"Sort Space Type": "Disk"
}
],
...
"Workers": [
{
"Worker Number": 0,
"Actual Startup Time": 309.726,
"Actual Total Time": 310.179,
"Actual Rows": 4128,
"Actual Loops": 1,
"Shared Hit Blocks": 2872,
"Shared Read Blocks": 7584,
"Shared Dirtied Blocks": 0,
"Shared Written Blocks": 0,
"Local Hit Blocks": 0,
"Local Read Blocks": 0,
"Local Dirtied Blocks": 0,
"Local Written Blocks": 0,
"Temp Read Blocks": 490,
"Temp Written Blocks": 2529
},
{
"Worker Number": 1,
"Actual Startup Time": 306.523,
"Actual Total Time": 307.001,
"Actual Rows": 4128,
"Actual Loops": 1,
"Shared Hit Blocks": 3356,
"Shared Read Blocks": 7100,
"Shared Dirtied Blocks": 0,
"Shared Written Blocks": 0,
"Local Hit Blocks": 0,
"Local Read Blocks": 0,
"Local Dirtied Blocks": 0,
"Local Written Blocks": 0,
"Temp Read Blocks": 490,
"Temp Written Blocks": 2529
}
],
"Plans:" ...
}
This is technically valid JSON, but it's extremely difficult to work with, since default JSON parsing in Ruby, node, Python, Go, and even Postgres' own jsonb only keep the latest key--the sort information is discarded (other languages probably don't fare much better; this is what I had on hand). As Tom Lane pointed out in my pgsql-bugs thread, this has been reported before [1] and in that earlier thread, Andrew Dunstan suggested that perhaps the simplest solution is to just rename the sort-related Workers node. Tom expressed some concerns about a breaking change here, though I think the current behavior means vanishingly few users are parsing this data correctly. Thoughts?
Thanks,
Maciek
Hi, On 2019-10-22 11:58:35 -0700, Maciek Sakrejda wrote: > I originally reported this in pgsql-bugs [0], but there wasn't much > feedback there, so moving the discussion here. When using JSON, YAML, or > XML-format EXPLAIN on a plan that uses a parallelized sort, the Sort nodes > list two different entries for "Workers", one for the sort-related info, > and one for general worker info. This is what this looks like in JSON (some > details elided): > > { > "Node Type": "Sort", > ... > "Workers": [ > { > "Worker Number": 0, > "Sort Method": "external merge", > "Sort Space Used": 20128, > "Sort Space Type": "Disk" > }, > { > "Worker Number": 1, > "Sort Method": "external merge", > "Sort Space Used": 20128, > "Sort Space Type": "Disk" > } > ], > ... > "Workers": [ > { > This is technically valid JSON, but it's extremely difficult to work with, > since default JSON parsing in Ruby, node, Python, Go, and even Postgres' > own jsonb only keep the latest key It's also quite confusing. > As Tom Lane pointed out in my pgsql-bugs thread, this has been > reported before [1] and in that earlier thread, Andrew Dunstan suggested > that perhaps the simplest solution is to just rename the sort-related > Workers node. Thoughts? Yea, I think we should fix this. The current output basically makes no sense. > Tom expressed some concerns about a breaking change here, > though I think the current behavior means vanishingly few users are parsing > this data correctly. Well, in a lot of the cases there's no parallel output for the sort, and in other cases BUFFERS is not specified. In either case the 'duplicate key' problem won't exist then. While Tom said: On 2019-10-16 09:16:56 +0200, Tom Lane wrote: > I think the text-mode output is intentional, but the other formats > need more work. Sort Method: external merge Disk: 4920kB Worker 0: Sort Method: external merge Disk: 5880kB Worker 1: Sort Method: external merge Disk: 5920kB Buffers: shared hit=682 read=10188, temp read=1415 written=2101 Worker 0: actual time=130.058..130.324 rows=1324 loops=1 Buffers: shared hit=337 read=3489, temp read=505 written=739 Worker 1: actual time=130.273..130.512 rows=1297 loops=1 Buffers: shared hit=345 read=3507, temp read=505 written=744 I don't think this is close to being good enough to be worth preserving. I think it's worth avoiding unnecessary breakage of explain output, but we also shouldn't endlessly carry forward confusing output, just because of that. It clearly seems like it'd be better if this instead were Sort Method: external merge Disk: 4920kB Buffers: shared hit=682 read=10188, temp read=1415 written=2101 Worker 0: actual time=130.058..130.324 rows=1324 loops=1 Sort Method: external merge Disk: 5880kB Buffers: shared hit=337 read=3489, temp read=505 written=739 Worker 1: actual time=130.273..130.512 rows=1297 loops=1 Buffers: shared hit=345 read=3507, temp read=505 written=744 Sort Method: external merge Disk: 5920kB I think the way this information was added in bf11e7ee2e36 and 33001fd7a707, contrasting to the output added in b287df70e408, is just not right. If we add similar instrumentation reporting to more nodes, we'll end up with duplicated information all over. Additionally the per-worker part of show_sort_info() basically just duplicated the rest of the function. I then also did something similar (although luckily with a different key...), with the ExplainPrintJIT() call for Gather nodes. Unfortunately I think the fix isn't all that trivial, due to the way we output the per-worker information at the end of ExplainNode(), by just dumping things into a string. It seems to me that a step in the right direction would be for ExplainNode() to create planstate->worker_instrument StringInfos, which can be handed to routines like show_sort_info(), which would print the per-node information into that, rather than directly dumping into es->output. Most of the current "Show worker detail" would be done earlier in ExplainNode(), at the place where we current display the "actual rows" bit. ISTM that should include removing the duplication fo the the contents of show_sort_info(), and probably also for the Gather, GatherMerge blocks (I've apparently skipped adding the JIT information to the latter, not sure if we ought to fix that in the stable branches). Any chance you want to take a stab at that? I don't think we'll fix it soon, but damn, all this string appending around just isn't a great way to reliably build nested data formats. Greetings, Andres Freund
On Thu, Oct 24, 2019 at 6:48 PM Andres Freund <andres@anarazel.de> wrote: > Unfortunately I think the fix isn't all that trivial, due to the way we > output the per-worker information at the end of ExplainNode(), by just > dumping things into a string. It seems to me that a step in the right > direction would be for ExplainNode() to create > planstate->worker_instrument StringInfos, which can be handed to > routines like show_sort_info(), which would print the per-node > information into that, rather than directly dumping into > es->output. Most of the current "Show worker detail" would be done > earlier in ExplainNode(), at the place where we current display the > "actual rows" bit. > > ISTM that should include removing the duplication fo the the contents of > show_sort_info(), and probably also for the Gather, GatherMerge blocks > (I've apparently skipped adding the JIT information to the latter, not > sure if we ought to fix that in the stable branches). > > Any chance you want to take a stab at that? It took me a while, but I did take a stab at it (thanks for your off-list help). Attached is my patch that changes the structured formats to merge sort worker output in with costs/timing/buffers worker output. I have not touched any other worker output yet, since it's not under a Workers group as far as I can tell (so it does not exhibit the problem I originally reported). I can try to make further changes here if the approach is deemed sound. I also have not touched text output; above you had proposed > Sort Method: external merge Disk: 4920kB > Buffers: shared hit=682 read=10188, temp read=1415 written=2101 > Worker 0: actual time=130.058..130.324 rows=1324 loops=1 > Sort Method: external merge Disk: 5880kB > Buffers: shared hit=337 read=3489, temp read=505 written=739 > Worker 1: actual time=130.273..130.512 rows=1297 loops=1 > Buffers: shared hit=345 read=3507, temp read=505 written=744 > Sort Method: external merge Disk: 5920kB which makes sense to me, but I'd like to confirm this is the approach we want before I add it to the patch. This is my first C in close to a decade (and I was never much of a C programmer to begin with), so please be gentle. As Andres suggested off-list, I also changed the worker output to order fields that also occur in the parent node in the same way as the parent node. I've also added a test for the patch, and because this is really an EXPLAIN issue rather than a query feature issue, I added a src/test/regress/sql/explain.sql for the test. I added a couple of utility functions for munging json-formatted EXPLAIN plans into something we can repeatably verify in regression tests (the functions use json rather than jsonb to preserve field order). I have not added this for YAML or XML (even though they should behave the same way), since I'm not familiar with the the functions to manipulate those data types in a similar way (if they exist). My hunch is due to the similarity of structured formats, just testing JSON is enough, but I can expand/adjust tests as necessary.
Attachment
I wanted to follow up on this patch since I received no feedback. What should my next steps be (besides rebasing, though I want to confirm there's interest before I do that)?
On Fri, Dec 27, 2019 at 12:31 AM Maciek Sakrejda <m.sakrejda@gmail.com> wrote: > > I wanted to follow up on this patch since I received no feedback. What should my next steps be (besides rebasing, thoughI want to confirm there's interest before I do that)? Given Andres' answer I'd say that there's interest in this patch. You should register this patch in the next commitfest (https://commitfest.postgresql.org/26/) to make sure that it's not forgotten, which unfortunately is probably what happened here .
Done! Thanks!
The following review has been posted through the commitfest application: make installcheck-world: not tested Implements feature: not tested Spec compliant: not tested Documentation: not tested This is a high level review only. However, seeing that there is a conflict and does not merge cleanly after commit <b925a00f4ef>,I return to Author. To be fair the resolution seems quite straight forward and I took the liberty of applying the necessary changes to includethe new element of ExplainState introduced in the above commit, namely hide_workers. However since the author mighthave a different idea on how to incorporate this change I leave it up to him. Another very high level comment is the introduction of a new test file, namely explain. Seeing `explain.sql` in the testssuits, personally and very opinion based, I would have expected the whole spectrum of the explain properties to be tested.However only a slight fraction of the functionality is tested. Since this is a bit more of a personal opinion, I don'texpect any changes unless the author happens to agree. Other than these minor nitpick, the code seems clear, concise and does what it says on the box. It follows the comments inthe discussion thread(s) and solves a real issue. Please have a look on how commit <b925a00f4ef> affects this patch and rebase. The new status of this patch is: Waiting on Author
Thanks for the review! I looked at <b925a00f4ef> and rebased the patch on current master, ac5bdf6. I introduced a new test file because this bug is specifically about EXPLAIN output (as opposed to query execution or planning functionality), and it didn't seem like a test would fit in any of the other files. I focused on testing just the behavior around this specific bug (and fix). I think eventually we should probably test other more fundamental EXPLAIN features (and I'm happy to contribute to that) in that file, but that seems outside of the scope of this patch. Any thoughts on what we should do with text mode output (which is untouched right now)? The output Andres proposed above makes sense to me, but I'd like to get more input.
Attachment
The following review has been posted through the commitfest application: make installcheck-world: tested, failed Implements feature: not tested Spec compliant: not tested Documentation: not tested The current version of the patch (v2) applies cleanly and does what it says on the box. > Any thoughts on what we should do with text mode output (which is untouched right now)? The output Andres proposed above makes sense to me, but I'd like to get more input. Nothing to add beyond what is stated in the thread. > Sort Method: external merge Disk: 4920kB > Buffers: shared hit=682 read=10188, temp read=1415 written=2101 > Worker 0: actual time=130.058..130.324 rows=1324 loops=1 > Sort Method: external merge Disk: 5880kB > Buffers: shared hit=337 read=3489, temp read=505 written=739 > Worker 1: actual time=130.273..130.512 rows=1297 loops=1 > Buffers: shared hit=345 read=3507, temp read=505 written=744 > Sort Method: external merge Disk: 5920kB This proposal seems like a fitting approach. Awaiting for v3 which will include the text version. May I suggest a format yaml test case? Just to make certain that no regressions occur in the future. Thanks,
Sounds good, I'll try that format. Any idea how to test YAML? With the JSON format, I was able to rely on Postgres' own JSON-manipulating functions to strip or canonicalize fields that can vary across executions--I can't really do that with YAML. Or should I run EXPLAIN with COSTS OFF, TIMING OFF, SUMMARY OFF and assume that for simple queries the BUFFERS output (and other fields I can't turn off like Sort Space Used) *is* going to be stable?
> Sounds good, I'll try that format. Any idea how to test YAML? With the > JSON format, I was able to rely on Postgres' own JSON-manipulating > functions to strip or canonicalize fields that can vary across > executions--I can't really do that with YAML. Yes, this approach was clear in the patch and works great with Json. Also you are correct, this can not be done with YAML. I spend a bit of time to look around and I could not find any tests really on yaml format. > Or should I run EXPLAIN > with COSTS OFF, TIMING OFF, SUMMARY OFF and assume that for simple > queries the BUFFERS output (and other fields I can't turn off like > Sort Space Used) *is* going to be stable? I have to admit with the current diff tool used in pg_regress, this is not possible. I am pretty certain that it *is not* going to be stable. Not for long anyways. I withdraw my suggestion for YAML and currently awaiting for TEXT format only.
TEXT format was tricky due to its inconsistencies, but I think I have something working reasonably well. I added a simple test for TEXT format output as well, using a similar approach as the JSON format test, and liberally regexp_replacing away any volatile output. I suppose in theory we could do this for YAML, too, but I think it's gross enough not to be worth it, especially given the high similarity of all the structured outputs.
Attachment
The following review has been posted through the commitfest application: make installcheck-world: not tested Implements feature: not tested Spec compliant: not tested Documentation: not tested > TEXT format was tricky due to its inconsistencies, but I think I have > something working reasonably well. I added a simple test for TEXT > format output as well, using a similar approach as the JSON format Great! > test, and liberally regexp_replacing away any volatile output. I > suppose in theory we could do this for YAML, too, but I think it's > gross enough not to be worth it, especially given the high similarity > of all the structured outputs. Agreed, what is in the patch suffices. Overall great work, a couple of minor nitpicks if you allow me. + /* Prepare per-worker output */ + if (es->analyze && planstate->worker_instrument) { Style, parenthesis on its own line. + int num_workers = planstate->worker_instrument->num_workers; + int n; + worker_strs = (StringInfo *) palloc0(num_workers * sizeof(StringInfo)); + for (n = 0; n < num_workers; n++) { I think C99 would be better here. Also no parenthesis needed. + worker_strs[n] = makeStringInfo(); + } + } @@ -1357,6 +1369,58 @@ ExplainNode(PlanState *planstate, List *ancestors, ExplainPropertyBool("Parallel Aware", plan->parallel_aware, es); } + /* Prepare worker general execution details */ + if (es->analyze && es->verbose && planstate->worker_instrument) + { + WorkerInstrumentation *w = planstate->worker_instrument; + int n; + + for (n = 0; n < w->num_workers; ++n) I think C99 would be better here. + { + Instrumentation *instrument = &w->instrument[n]; + double nloops = instrument->nloops; - appendStringInfoSpaces(es->str, es->indent * 2); - if (n > 0 || !es->hide_workers) - appendStringInfo(es->str, "Worker %d: ", n); + if (indent) + { + appendStringInfoSpaces(es->str, es->indent * 2); + } Style: No parenthesis needed - if (opened_group) - ExplainCloseGroup("Workers", "Workers", false, es); + /* Show worker detail */ + if (planstate->worker_instrument) { + ExplainFlushWorkers(worker_strs, planstate->worker_instrument->num_workers, es); } Style: No parenthesis needed + * just indent once, to add worker info on the next worker line. + */ + if (es->str == es->root_str) + { + es->indent += es->format == EXPLAIN_FORMAT_TEXT ? 1 : 2; + } + Style: No parenthesis needed + ExplainCloseGroup("Workers", "Workers", false, es); + // do we have any other cleanup to do? This comment does not really explain anything. Either remove or rephrase. Also C style comments. + es->print_workers = false; +} int indent; /* current indentation level */ List *grouping_stack; /* format-specific grouping state */ + bool print_workers; /* whether current node has worker metadata */ Hmm.. commit <b925a00f4ef> introduced `hide_workers` in the struct. Having both names in the struct so far apart even seems a bit confusing and sloppy. Do you think it would be possible to combine or rename? +extern void ExplainOpenWorker(StringInfo worker_str, ExplainState *es); +extern void ExplainCloseWorker(ExplainState *es); +extern void ExplainFlushWorkers(StringInfo *worker_strs, int num_workers, ExplainState *es); No need to expose those, is there? I feel there should be static. Awaiting for answer or resolution of these comments to change the status. //Georgios
Thanks! I'll fix the brace issues. Re: the other items: > + int num_workers = planstate->worker_instrument->num_workers; > + int n; > + worker_strs = (StringInfo *) palloc0(num_workers * sizeof(StringInfo)); > + for (n = 0; n < num_workers; n++) { > > I think C99 would be better here. Also no parenthesis needed. Pardon my C illiteracy, but what am I doing that's not valid C99 here? > + /* Prepare worker general execution details */ > + if (es->analyze && es->verbose && planstate->worker_instrument) > + { > + WorkerInstrumentation *w = planstate->worker_instrument; > + int n; > + > + for (n = 0; n < w->num_workers; ++n) > > I think C99 would be better here. And here (if it's not the same problem)? > + ExplainCloseGroup("Workers", "Workers", false, es); > + // do we have any other cleanup to do? > > This comment does not really explain anything. Either remove > or rephrase. Also C style comments. Good catch, thanks--I had put this in to remind myself (and reviewers) about cleanup, but I don't think there's anything else to do, so I'll just drop it. > int indent; /* current indentation level */ > List *grouping_stack; /* format-specific grouping state */ > + bool print_workers; /* whether current node has worker metadata */ > > Hmm.. commit <b925a00f4ef> introduced `hide_workers` in the struct. Having both > names in the struct so far apart even seems a bit confusing and sloppy. Do you > think it would be possible to combine or rename? I noticed that. I was thinking about combining them, but "hide_workers" seems to be about "pretend there is no worker output even if there is" and "print_workers" is "keep track of whether or not there is worker output to print". Maybe I'll rename to "has_worker_output"? > +extern void ExplainOpenWorker(StringInfo worker_str, ExplainState *es); > +extern void ExplainCloseWorker(ExplainState *es); > +extern void ExplainFlushWorkers(StringInfo *worker_strs, int num_workers, ExplainState *es); > > No need to expose those, is there? I feel there should be static. Good call, I'll update.
>> + int num_workers = planstate->worker_instrument->num_workers; >> + int n; >> + worker_strs = (StringInfo *) palloc0(num_workers * sizeof(StringInfo)); >> + for (n = 0; n < num_workers; n++) { >> >> I think C99 would be better here. Also no parenthesis needed. > > > Pardon my C illiteracy, but what am I doing that's not valid C99 here? My bad, I should have been more clear. I meant that it is preferable to use the C99 standard which calls for declaring variables in the scope that you need them. In this case, 'n' is needed only in the for loop, so something like for (int n = 0; n < num_workers; n++) is preferable. To be clear, your code was perfectly valid. It was only the style I was referring to. >> + for (n = 0; n < w->num_workers; ++n) >> >> I think C99 would be better here. > > > And here (if it's not the same problem)? Exactly the same as above. >> int indent; /* current indentation level */ >> List *grouping_stack; /* format-specific grouping state */ >> + bool print_workers; /* whether current node has worker metadata */ >> >> Hmm.. commit <b925a00f4ef> introduced `hide_workers` in the struct. Having both >> names in the struct so far apart even seems a bit confusing and sloppy. Do you >> think it would be possible to combine or rename? > > > I noticed that. I was thinking about combining them, but > "hide_workers" seems to be about "pretend there is no worker output > even if there is" and "print_workers" is "keep track of whether or not > there is worker output to print". Maybe I'll rename to > "has_worker_output"? The rename sounds a bit better in my humble opinion. Thanks.
On Wed, Jan 22, 2020 at 9:37 AM Georgios Kokolatos <gkokolatos@pm.me> wrote: > My bad, I should have been more clear. I meant that it is preferable to use > the C99 standard which calls for declaring variables in the scope that you > need them. Ah, I see. I think I got that from ExplainPrintSettings. I've corrected my usage--thanks for pointing it out. I appreciate the effort to maintain a consistent style. > > >> int indent; /* current indentation level */ > >> List *grouping_stack; /* format-specific grouping state */ > >> + bool print_workers; /* whether current node has worker metadata */ > >> > >> Hmm.. commit <b925a00f4ef> introduced `hide_workers` in the struct. Having both > >> names in the struct so far apart even seems a bit confusing and sloppy. Do you > >> think it would be possible to combine or rename? > > > > > > I noticed that. I was thinking about combining them, but > > "hide_workers" seems to be about "pretend there is no worker output > > even if there is" and "print_workers" is "keep track of whether or not > > there is worker output to print". Maybe I'll rename to > > "has_worker_output"? > > The rename sounds a bit better in my humble opinion. Thanks. Also, reviewing my code again, I noticed that when I moved the general worker output earlier, I missed part of the merge conflict: I had replaced - /* Show worker detail */ - if (es->analyze && es->verbose && !es->hide_workers && - planstate->worker_instrument) with + /* Prepare worker general execution details */ + if (es->analyze && es->verbose && planstate->worker_instrument) which ignores the es->hide_workers flag (it did not fail the tests, but the intent is pretty clear). I've corrected this in the current patch. I also noticed that we can now handle worker buffer output more consistently across TEXT and structured formats, so I made that small change too: diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c index 140d0be426..b23b015594 100644 --- a/src/backend/commands/explain.c +++ b/src/backend/commands/explain.c @@ -1401,8 +1401,6 @@ ExplainNode(PlanState *planstate, List *ancestors, appendStringInfo(es->str, "actual rows=%.0f loops=%.0f\n", rows, nloops); - if (es->buffers) - show_buffer_usage(es, &instrument->bufusage); } else { @@ -1951,7 +1949,7 @@ ExplainNode(PlanState *planstate, List *ancestors, /* Prepare worker buffer usage */ if (es->buffers && es->analyze && es->verbose && !es->hide_workers - && planstate->worker_instrument && es->format != EXPLAIN_FORMAT_TEXT) + && planstate->worker_instrument) { WorkerInstrumentation *w = planstate->worker_instrument; int n; diff --git a/src/test/regress/expected/explain.out b/src/test/regress/expected/explain.out index 8034a4e0db..a4eed3067f 100644 --- a/src/test/regress/expected/explain.out +++ b/src/test/regress/expected/explain.out @@ -103,8 +103,8 @@ $$, 'verbose', 'analyze', 'buffers', 'timing off', 'costs off', 'summary off'), Sort Key: (ROW("*VALUES*".column1)) + Buffers: shared hit=114 + Worker 0: actual rows=2 loops=1 + - Buffers: shared hit=114 + Sort Method: xxx + + Buffers: shared hit=114 + -> Values Scan on "*VALUES*" (actual rows=2 loops=1) + Output: "*VALUES*".column1, ROW("*VALUES*".column1)+ Worker 0: actual rows=2 loops=1 + I think the "producing plan output for a worker" process is easier to reason about now, and while it changes TEXT format worker output order, the other changes in this patch are more drastic so this probably does not matter. I've also addressed the other feedback above, and reworded a couple of comments slightly.
Attachment
The following review has been posted through the commitfest application: make installcheck-world: not tested Implements feature: not tested Spec compliant: not tested Documentation: not tested > Ah, I see. I think I got that from ExplainPrintSettings. I've > corrected my usage--thanks for pointing it out. I appreciate the > effort to maintain a consistent style. Thanks, I am just following the reviewing guide to be honest. > Also, reviewing my code again, I noticed that when I moved the general > worker output earlier, I missed part of the merge conflict: Right. I thought that was intentional. > which ignores the es->hide_workers flag (it did not fail the tests, > but the intent is pretty clear). I've corrected this in the current > patch. Noted and appreciated. > I also noticed that we can now handle worker buffer output more > consistently across TEXT and structured formats, so I made that small > change too: Looks good. > I think the "producing plan output for a worker" process is easier to > reason about now, and while it changes TEXT format worker output > order, the other changes in this patch are more drastic so this > probably does not matter. > > I've also addressed the other feedback above, and reworded a couple of > comments slightly. Thanks for the above. Looks clean, does what it says in the tin and solves a problem that needs solving. All relevant installcheck-world pass. As far as I am concerned, I think it is ready to be sent to a committer. Updating the status accordingly. The new status of this patch is: Ready for Committer
Great, thank you. I noticed in the CF patch tester app, the build fails on Windows [1]. Investigating, I realized I had failed to fully strip volatile EXPLAIN info (namely buffers) in TEXT mode due to a bad regexp_replace. I've fixed this in the attached patch (which I also rebased against current master again). [1]: https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.76313
Attachment
Maciek Sakrejda <m.sakrejda@gmail.com> writes: > Great, thank you. I noticed in the CF patch tester app, the build > fails on Windows [1]. Investigating, I realized I had failed to fully > strip volatile EXPLAIN info (namely buffers) in TEXT mode due to a > bad regexp_replace. You haven't gone nearly far enough in that direction. The proposed output still relies on the assumption that the session will always get as many workers as it asks for, which it will not. For previous bitter experience in this department, see for instance commits 4ea03f3f4 and 13e8b2ee8. TBH I am not sure that the proposed regression tests for this change can be committed at all. Which is a bit of a problem perhaps, but then again we don't have terribly good coverage for the existing code either, for the same reason. regards, tom lane
Okay. Does not getting as many workers as it asks for include sometimes getting zero, completely changing the actual output? If so, I'll submit a new version of the patch removing all tests--I was hoping to improve coverage, but I guess this is not the way to start. If not, can we keep the json tests at least if we only consider the first worker?
Maciek Sakrejda <m.sakrejda@gmail.com> writes: > Okay. Does not getting as many workers as it asks for include > sometimes getting zero, completely changing the actual output? Yup :-(. We could possibly avoid that by running the explain test by itself rather than in a parallel group, but I don't especially want to add yet more non-parallelizable tests. Meanwhile, I spent some time looking at the code, and wasn't very happy with it. I'm on board with the general plan of redirecting EXPLAIN output into per-worker buffers that we eventually recombine, but I think that this particular coding is pretty unmaintainable/unextensible. In particular, I'm really unhappy that the code is ignoring EXPLAIN's grouping_state stack. That means that when it's formatting fields that belong to the per-worker group, it's using the state-stack entry that corresponds to the plan node's main level. This seems to accidentally work, but that fact depends on a number of shaky assumptions: * Before messing with any per-worker data, we've always emitted at least one field in the plan node's main level, so that the state-stack entry isn't at its initial state for the level. * Before actually emitting the shunted-aside data, we've always emitted a "Worker Number" field in correct format within the per-worker group, so that the formatting state is now correct for a non-initial field. * There is no formatting difference between any non-first fields in a level (ie the state stack entries are basically just booleans), so that it doesn't matter how many plan-node fields we emitted before starting the per-worker data, so long as there was at least one, nor does transiently abusing the plan node level's stack entry like this break the formatting of subsequent plan-node-level fields. Now maybe we'll never implement an output format that breaks that last assumption, and maybe we'll never rearrange the EXPLAIN code in a way that breaks either of the first two. But I don't like those assumptions too much. I also didn't like the code's assumption that all the non-text formats interpret es->indent the same. I also noted an actual bug, which is that the patch fails regression testing under force_parallel_mode = regress. This isn't really your fault, because the issue is in this obscure and poorly-commented hack in show_sort_info: * You might think we should just skip this stanza entirely when * es->hide_workers is true, but then we'd get no sort-method output at * all. We have to make it look like worker 0's data is top-level data. * Currently, we only bother with that for text-format output. Nonetheless, it's broken. So I spent some time hacking on this and came up with the attached. It's noticeably more verbose than your patch, but it keeps the output-format-aware code at what seems to me to be a maintainable arm's-length distance from the parallel-worker hacking. TEXT is still a special case of course :-(. This patch just covers the code, I'm not taking any position yet about what to do about the tests. I did tweak the code to eliminate the one formatting difference in select_parallel (ie put two spaces after "Worker N:", which I think reads better anyhow), so it passes check-world as it stands. Thoughts? regards, tom lane diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c index d189b8d..9e965d2 100644 --- a/src/backend/commands/explain.c +++ b/src/backend/commands/explain.c @@ -123,11 +123,20 @@ static void ExplainSubPlans(List *plans, List *ancestors, const char *relationship, ExplainState *es); static void ExplainCustomChildren(CustomScanState *css, List *ancestors, ExplainState *es); +static ExplainWorkersState *ExplainCreateWorkersState(int num_workers); +static void ExplainOpenWorker(int n, ExplainState *es); +static void ExplainCloseWorker(int n, ExplainState *es); +static void ExplainFlushWorkersState(ExplainState *es); static void ExplainProperty(const char *qlabel, const char *unit, const char *value, bool numeric, ExplainState *es); +static void ExplainOpenSetAsideGroup(const char *objtype, const char *labelname, + bool labeled, int depth, ExplainState *es); +static void ExplainSaveGroup(ExplainState *es, int depth, int *state_save); +static void ExplainRestoreGroup(ExplainState *es, int depth, int *state_save); static void ExplainDummyGroup(const char *objtype, const char *labelname, ExplainState *es); static void ExplainXMLTag(const char *tagname, int flags, ExplainState *es); +static void ExplainIndentText(ExplainState *es); static void ExplainJSONLineEnding(ExplainState *es); static void ExplainYAMLLineStarting(ExplainState *es); static void escape_yaml(StringInfo buf, const char *str); @@ -827,7 +836,7 @@ ExplainPrintJIT(ExplainState *es, int jit_flags, /* for higher density, open code the text output format */ if (es->format == EXPLAIN_FORMAT_TEXT) { - appendStringInfoSpaces(es->str, es->indent * 2); + ExplainIndentText(es); if (for_workers) appendStringInfo(es->str, "JIT for worker %u:\n", worker_num); else @@ -836,7 +845,7 @@ ExplainPrintJIT(ExplainState *es, int jit_flags, ExplainPropertyInteger("Functions", NULL, ji->created_functions, es); - appendStringInfoSpaces(es->str, es->indent * 2); + ExplainIndentText(es); appendStringInfo(es->str, "Options: %s %s, %s %s, %s %s, %s %s\n", "Inlining", jit_flags & PGJIT_INLINE ? "true" : "false", "Optimization", jit_flags & PGJIT_OPT3 ? "true" : "false", @@ -845,7 +854,7 @@ ExplainPrintJIT(ExplainState *es, int jit_flags, if (es->analyze && es->timing) { - appendStringInfoSpaces(es->str, es->indent * 2); + ExplainIndentText(es); appendStringInfo(es->str, "Timing: %s %.3f ms, %s %.3f ms, %s %.3f ms, %s %.3f ms, %s %.3f ms\n", "Generation", 1000.0 * INSTR_TIME_GET_DOUBLE(ji->generation_counter), @@ -1074,9 +1083,10 @@ ExplainPreScanNode(PlanState *planstate, Bitmapset **rels_used) * optional name to be attached to the node. * * In text format, es->indent is controlled in this function since we only - * want it to change at plan-node boundaries. In non-text formats, es->indent - * corresponds to the nesting depth of logical output groups, and therefore - * is controlled by ExplainOpenGroup/ExplainCloseGroup. + * want it to change at plan-node boundaries (but a few subroutines will + * transiently increment it). In non-text formats, es->indent corresponds + * to the nesting depth of logical output groups, and therefore is controlled + * by ExplainOpenGroup/ExplainCloseGroup. */ static void ExplainNode(PlanState *planstate, List *ancestors, @@ -1090,9 +1100,20 @@ ExplainNode(PlanState *planstate, List *ancestors, const char *partialmode = NULL; const char *operation = NULL; const char *custom_name = NULL; + ExplainWorkersState *save_workers_state = es->workers_state; int save_indent = es->indent; bool haschildren; + /* + * Prepare per-worker output buffers, if needed. We'll append the data in + * these to the main output string further down. + */ + if (planstate->worker_instrument && es->analyze && !es->hide_workers) + es->workers_state = ExplainCreateWorkersState(planstate->worker_instrument->num_workers); + else + es->workers_state = NULL; + + /* Identify plan node type, and print generic details */ switch (nodeTag(plan)) { case T_Result: @@ -1324,13 +1345,13 @@ ExplainNode(PlanState *planstate, List *ancestors, { if (plan_name) { - appendStringInfoSpaces(es->str, es->indent * 2); + ExplainIndentText(es); appendStringInfo(es->str, "%s\n", plan_name); es->indent++; } if (es->indent) { - appendStringInfoSpaces(es->str, es->indent * 2); + ExplainIndentText(es); appendStringInfoString(es->str, "-> "); es->indent += 2; } @@ -1574,6 +1595,56 @@ ExplainNode(PlanState *planstate, List *ancestors, if (es->format == EXPLAIN_FORMAT_TEXT) appendStringInfoChar(es->str, '\n'); + /* prepare per-worker general execution details */ + if (es->workers_state && es->verbose) + { + WorkerInstrumentation *w = planstate->worker_instrument; + + for (int n = 0; n < w->num_workers; n++) + { + Instrumentation *instrument = &w->instrument[n]; + double nloops = instrument->nloops; + double startup_ms; + double total_ms; + double rows; + + if (nloops <= 0) + continue; + startup_ms = 1000.0 * instrument->startup / nloops; + total_ms = 1000.0 * instrument->total / nloops; + rows = instrument->ntuples / nloops; + + ExplainOpenWorker(n, es); + + if (es->format == EXPLAIN_FORMAT_TEXT) + { + ExplainIndentText(es); + if (es->timing) + appendStringInfo(es->str, + "actual time=%.3f..%.3f rows=%.0f loops=%.0f\n", + startup_ms, total_ms, rows, nloops); + else + appendStringInfo(es->str, + "actual rows=%.0f loops=%.0f\n", + rows, nloops); + } + else + { + if (es->timing) + { + ExplainPropertyFloat("Actual Startup Time", "ms", + startup_ms, 3, es); + ExplainPropertyFloat("Actual Total Time", "ms", + total_ms, 3, es); + } + ExplainPropertyFloat("Actual Rows", NULL, rows, 0, es); + ExplainPropertyFloat("Actual Loops", NULL, nloops, 0, es); + } + + ExplainCloseWorker(n, es); + } + } + /* target list */ if (es->verbose) show_plan_tlist(planstate, ancestors, es); @@ -1885,76 +1956,30 @@ ExplainNode(PlanState *planstate, List *ancestors, if (es->buffers && planstate->instrument) show_buffer_usage(es, &planstate->instrument->bufusage); - /* Show worker detail */ - if (es->analyze && es->verbose && !es->hide_workers && - planstate->worker_instrument) + /* Prepare per-worker buffer usage */ + if (es->workers_state && es->buffers && es->verbose) { WorkerInstrumentation *w = planstate->worker_instrument; - bool opened_group = false; - int n; - for (n = 0; n < w->num_workers; ++n) + for (int n = 0; n < w->num_workers; n++) { Instrumentation *instrument = &w->instrument[n]; double nloops = instrument->nloops; - double startup_ms; - double total_ms; - double rows; if (nloops <= 0) continue; - startup_ms = 1000.0 * instrument->startup / nloops; - total_ms = 1000.0 * instrument->total / nloops; - rows = instrument->ntuples / nloops; - if (es->format == EXPLAIN_FORMAT_TEXT) - { - appendStringInfoSpaces(es->str, es->indent * 2); - appendStringInfo(es->str, "Worker %d: ", n); - if (es->timing) - appendStringInfo(es->str, - "actual time=%.3f..%.3f rows=%.0f loops=%.0f\n", - startup_ms, total_ms, rows, nloops); - else - appendStringInfo(es->str, - "actual rows=%.0f loops=%.0f\n", - rows, nloops); - es->indent++; - if (es->buffers) - show_buffer_usage(es, &instrument->bufusage); - es->indent--; - } - else - { - if (!opened_group) - { - ExplainOpenGroup("Workers", "Workers", false, es); - opened_group = true; - } - ExplainOpenGroup("Worker", NULL, true, es); - ExplainPropertyInteger("Worker Number", NULL, n, es); - - if (es->timing) - { - ExplainPropertyFloat("Actual Startup Time", "ms", - startup_ms, 3, es); - ExplainPropertyFloat("Actual Total Time", "ms", - total_ms, 3, es); - } - ExplainPropertyFloat("Actual Rows", NULL, rows, 0, es); - ExplainPropertyFloat("Actual Loops", NULL, nloops, 0, es); - - if (es->buffers) - show_buffer_usage(es, &instrument->bufusage); - - ExplainCloseGroup("Worker", NULL, true, es); - } + ExplainOpenWorker(n, es); + show_buffer_usage(es, &instrument->bufusage); + ExplainCloseWorker(n, es); } - - if (opened_group) - ExplainCloseGroup("Workers", "Workers", false, es); } + /* Show per-worker details, then pop that stack */ + if (es->workers_state) + ExplainFlushWorkersState(es); + es->workers_state = save_workers_state; + /* Get ready to display the child plans */ haschildren = planstate->initPlan || outerPlanState(planstate) || @@ -2525,7 +2550,7 @@ show_tablesample(TableSampleClause *tsc, PlanState *planstate, { bool first = true; - appendStringInfoSpaces(es->str, es->indent * 2); + ExplainIndentText(es); appendStringInfo(es->str, "Sampling: %s (", method_name); foreach(lc, params) { @@ -2572,7 +2597,7 @@ show_sort_info(SortState *sortstate, ExplainState *es) if (es->format == EXPLAIN_FORMAT_TEXT) { - appendStringInfoSpaces(es->str, es->indent * 2); + ExplainIndentText(es); appendStringInfo(es->str, "Sort Method: %s %s: %ldkB\n", sortMethod, spaceType, spaceUsed); } @@ -2588,12 +2613,14 @@ show_sort_info(SortState *sortstate, ExplainState *es) * You might think we should just skip this stanza entirely when * es->hide_workers is true, but then we'd get no sort-method output at * all. We have to make it look like worker 0's data is top-level data. - * Currently, we only bother with that for text-format output. + * This is easily done by just skipping the OpenWorker/CloseWorker calls. + * Currently, we don't worry about the possibility that there are multiple + * workers in such a case; if there are, duplicate output fields will be + * emitted. */ if (sortstate->shared_info != NULL) { int n; - bool opened_group = false; for (n = 0; n < sortstate->shared_info->num_workers; n++) { @@ -2609,32 +2636,26 @@ show_sort_info(SortState *sortstate, ExplainState *es) spaceType = tuplesort_space_type_name(sinstrument->spaceType); spaceUsed = sinstrument->spaceUsed; + if (es->workers_state) + ExplainOpenWorker(n, es); + if (es->format == EXPLAIN_FORMAT_TEXT) { - appendStringInfoSpaces(es->str, es->indent * 2); - if (n > 0 || !es->hide_workers) - appendStringInfo(es->str, "Worker %d: ", n); + ExplainIndentText(es); appendStringInfo(es->str, "Sort Method: %s %s: %ldkB\n", sortMethod, spaceType, spaceUsed); } else { - if (!opened_group) - { - ExplainOpenGroup("Workers", "Workers", false, es); - opened_group = true; - } - ExplainOpenGroup("Worker", NULL, true, es); - ExplainPropertyInteger("Worker Number", NULL, n, es); ExplainPropertyText("Sort Method", sortMethod, es); ExplainPropertyInteger("Sort Space Used", "kB", spaceUsed, es); ExplainPropertyText("Sort Space Type", spaceType, es); - ExplainCloseGroup("Worker", NULL, true, es); } + + if (es->workers_state) + ExplainCloseWorker(n, es); } - if (opened_group) - ExplainCloseGroup("Workers", "Workers", false, es); } } @@ -2721,7 +2742,7 @@ show_hash_info(HashState *hashstate, ExplainState *es) else if (hinstrument.nbatch_original != hinstrument.nbatch || hinstrument.nbuckets_original != hinstrument.nbuckets) { - appendStringInfoSpaces(es->str, es->indent * 2); + ExplainIndentText(es); appendStringInfo(es->str, "Buckets: %d (originally %d) Batches: %d (originally %d) Memory Usage: %ldkB\n", hinstrument.nbuckets, @@ -2732,7 +2753,7 @@ show_hash_info(HashState *hashstate, ExplainState *es) } else { - appendStringInfoSpaces(es->str, es->indent * 2); + ExplainIndentText(es); appendStringInfo(es->str, "Buckets: %d Batches: %d Memory Usage: %ldkB\n", hinstrument.nbuckets, hinstrument.nbatch, @@ -2758,7 +2779,7 @@ show_tidbitmap_info(BitmapHeapScanState *planstate, ExplainState *es) { if (planstate->exact_pages > 0 || planstate->lossy_pages > 0) { - appendStringInfoSpaces(es->str, es->indent * 2); + ExplainIndentText(es); appendStringInfoString(es->str, "Heap Blocks:"); if (planstate->exact_pages > 0) appendStringInfo(es->str, " exact=%ld", planstate->exact_pages); @@ -2894,7 +2915,7 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage) /* Show only positive counter values. */ if (has_shared || has_local || has_temp) { - appendStringInfoSpaces(es->str, es->indent * 2); + ExplainIndentText(es); appendStringInfoString(es->str, "Buffers:"); if (has_shared) @@ -2949,7 +2970,7 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage) /* As above, show only positive counter values. */ if (has_timing) { - appendStringInfoSpaces(es->str, es->indent * 2); + ExplainIndentText(es); appendStringInfoString(es->str, "I/O Timings:"); if (!INSTR_TIME_IS_ZERO(usage->blk_read_time)) appendStringInfo(es->str, " read=%0.3f", @@ -3237,7 +3258,7 @@ show_modifytable_info(ModifyTableState *mtstate, List *ancestors, */ if (es->format == EXPLAIN_FORMAT_TEXT) { - appendStringInfoSpaces(es->str, es->indent * 2); + ExplainIndentText(es); appendStringInfoString(es->str, fdwroutine ? foperation : operation); } @@ -3427,6 +3448,150 @@ ExplainCustomChildren(CustomScanState *css, List *ancestors, ExplainState *es) } /* + * Create a per-plan-node workspace for collecting per-worker data. + */ +static ExplainWorkersState * +ExplainCreateWorkersState(int num_workers) +{ + ExplainWorkersState *wstate; + + wstate = (ExplainWorkersState *) palloc(sizeof(ExplainWorkersState)); + wstate->num_workers = num_workers; + wstate->worker_inited = (bool *) palloc0(num_workers * sizeof(bool)); + wstate->worker_str = (StringInfoData *) + palloc0(num_workers * sizeof(StringInfoData)); + wstate->worker_state_save = (int *) palloc(num_workers * sizeof(int)); + return wstate; +} + +/* + * Begin or resume output into the set-aside group for worker N. + */ +static void +ExplainOpenWorker(int n, ExplainState *es) +{ + ExplainWorkersState *wstate = es->workers_state; + + Assert(wstate); + Assert(n >= 0 && n < wstate->num_workers); + + /* Save prior output pointer */ + wstate->prev_str = es->str; + + if (!wstate->worker_inited[n]) + { + /* First time through, so create the buffer for this worker */ + initStringInfo(&wstate->worker_str[n]); + es->str = &wstate->worker_str[n]; + + /* + * Push suitable initial formatting state for this worker's field + * group. We allow one extra logical nesting level, since this group + * will eventually be wrapped in an outer "Workers" group. + */ + ExplainOpenSetAsideGroup("Worker", NULL, true, 2, es); + + /* + * In non-TEXT formats, we always emit a "Worker Number" field even if + * there's no other data for this worker. + */ + if (es->format != EXPLAIN_FORMAT_TEXT) + ExplainPropertyInteger("Worker Number", NULL, n, es); + + wstate->worker_inited[n] = true; + } + else + { + /* Resuming output for a worker we've already emitted some data for */ + es->str = &wstate->worker_str[n]; + + /* Restore formatting state saved by last ExplainCloseWorker() */ + ExplainRestoreGroup(es, 2, &wstate->worker_state_save[n]); + } + + /* + * In TEXT format, prefix the first output line for this worker with + * "Worker N:". Then, any additional lines should be indented one more + * stop than the "Worker N" line is. + */ + if (es->format == EXPLAIN_FORMAT_TEXT) + { + if (es->str->len == 0) + { + ExplainIndentText(es); + appendStringInfo(es->str, "Worker %d: ", n); + } + + es->indent++; + } +} + +/* + * End output for worker N --- must pair with previous ExplainOpenWorker call + */ +static void +ExplainCloseWorker(int n, ExplainState *es) +{ + ExplainWorkersState *wstate = es->workers_state; + + Assert(wstate); + Assert(n >= 0 && n < wstate->num_workers); + Assert(wstate->worker_inited[n]); + + /* + * Save formatting state in case we do another ExplainOpenWorker(), then + * pop the formatting stack. + */ + ExplainSaveGroup(es, 2, &wstate->worker_state_save[n]); + + /* + * In TEXT format, if we didn't actually produce any output line(s) then + * truncate off the partial line (this avoids bogus output if, for + * instance, show_buffer_usage chooses not to print anything). Also fix + * up the indent level. + */ + if (es->format == EXPLAIN_FORMAT_TEXT) + { + while (es->str->len > 0 && es->str->data[es->str->len - 1] != '\n') + es->str->data[--(es->str->len)] = '\0'; + + es->indent--; + } + + /* Restore prior output pointer */ + es->str = wstate->prev_str; +} + +/* + * Print per-worker info for current node, then free the ExplainWorkersState. + */ +static void +ExplainFlushWorkersState(ExplainState *es) +{ + ExplainWorkersState *wstate = es->workers_state; + + ExplainOpenGroup("Workers", "Workers", false, es); + for (int i = 0; i < wstate->num_workers; i++) + { + if (wstate->worker_inited[i]) + { + /* This must match previous ExplainOpenSetAsideGroup call */ + ExplainOpenGroup("Worker", NULL, true, es); + appendStringInfoString(es->str, wstate->worker_str[i].data); + ExplainCloseGroup("Worker", NULL, true, es); + + pfree(wstate->worker_str[i].data); + } + } + ExplainCloseGroup("Workers", "Workers", false, es); + + pfree(wstate->worker_inited); + pfree(wstate->worker_str); + pfree(wstate->worker_state_save); + pfree(wstate); +} + +/* * Explain a property, such as sort keys or targets, that takes the form of * a list of unlabeled items. "data" is a list of C strings. */ @@ -3439,7 +3604,7 @@ ExplainPropertyList(const char *qlabel, List *data, ExplainState *es) switch (es->format) { case EXPLAIN_FORMAT_TEXT: - appendStringInfoSpaces(es->str, es->indent * 2); + ExplainIndentText(es); appendStringInfo(es->str, "%s: ", qlabel); foreach(lc, data) { @@ -3560,7 +3725,7 @@ ExplainProperty(const char *qlabel, const char *unit, const char *value, switch (es->format) { case EXPLAIN_FORMAT_TEXT: - appendStringInfoSpaces(es->str, es->indent * 2); + ExplainIndentText(es); if (unit) appendStringInfo(es->str, "%s: %s %s\n", qlabel, value, unit); else @@ -3752,6 +3917,117 @@ ExplainCloseGroup(const char *objtype, const char *labelname, } /* + * Open a group of related objects, without emitting actual data. + * + * Prepare the formatting state as though we were beginning a group with + * the identified properties, but don't actually emit anything. Output + * subsequent to this call can be redirected into a separate output buffer, + * and then eventually appended to the main output buffer after doing a + * regular ExplainOpenGroup call (with the same parameters). + * + * The extra "depth" parameter is the new group's depth compared to current. + * It could be more than one, in case the eventual output will be enclosed + * in additional nesting group levels. We assume we don't need to track + * formatting state for those levels while preparing this group's output. + * + * There is no ExplainCloseSetAsideGroup --- in current usage, we always + * pop this state with ExplainSaveGroup. + */ +static void +ExplainOpenSetAsideGroup(const char *objtype, const char *labelname, + bool labeled, int depth, ExplainState *es) +{ + switch (es->format) + { + case EXPLAIN_FORMAT_TEXT: + /* nothing to do */ + break; + + case EXPLAIN_FORMAT_XML: + es->indent += depth; + break; + + case EXPLAIN_FORMAT_JSON: + es->grouping_stack = lcons_int(0, es->grouping_stack); + es->indent += depth; + break; + + case EXPLAIN_FORMAT_YAML: + if (labelname) + es->grouping_stack = lcons_int(1, es->grouping_stack); + else + es->grouping_stack = lcons_int(0, es->grouping_stack); + es->indent += depth; + break; + } +} + +/* + * Pop one level of grouping state, allowing for a re-push later. + * + * This is typically used after ExplainOpenSetAsideGroup; pass the + * same "depth" used for that. + * + * This should not emit any output. If state needs to be saved, + * save it at *state_save. Currently, an integer save area is sufficient + * for all formats, but we might need to revisit that someday. + */ +static void +ExplainSaveGroup(ExplainState *es, int depth, int *state_save) +{ + switch (es->format) + { + case EXPLAIN_FORMAT_TEXT: + /* nothing to do */ + break; + + case EXPLAIN_FORMAT_XML: + es->indent -= depth; + break; + + case EXPLAIN_FORMAT_JSON: + es->indent -= depth; + *state_save = linitial_int(es->grouping_stack); + es->grouping_stack = list_delete_first(es->grouping_stack); + break; + + case EXPLAIN_FORMAT_YAML: + es->indent -= depth; + *state_save = linitial_int(es->grouping_stack); + es->grouping_stack = list_delete_first(es->grouping_stack); + break; + } +} + +/* + * Re-push one level of grouping state, undoing the effects of ExplainSaveGroup. + */ +static void +ExplainRestoreGroup(ExplainState *es, int depth, int *state_save) +{ + switch (es->format) + { + case EXPLAIN_FORMAT_TEXT: + /* nothing to do */ + break; + + case EXPLAIN_FORMAT_XML: + es->indent += depth; + break; + + case EXPLAIN_FORMAT_JSON: + es->grouping_stack = lcons_int(*state_save, es->grouping_stack); + es->indent += depth; + break; + + case EXPLAIN_FORMAT_YAML: + es->grouping_stack = lcons_int(*state_save, es->grouping_stack); + es->indent += depth; + break; + } +} + +/* * Emit a "dummy" group that never has any members. * * objtype is the type of the group object, labelname is its label within @@ -3913,6 +4189,21 @@ ExplainXMLTag(const char *tagname, int flags, ExplainState *es) } /* + * Indent a text-format line. + * + * We indent by two spaces per indentation level. However, when emitting + * data for a parallel worker there might already be data on the current line + * (cf. ExplainOpenWorker); in that case, don't indent any more. + */ +static void +ExplainIndentText(ExplainState *es) +{ + Assert(es->format == EXPLAIN_FORMAT_TEXT); + if (es->str->len == 0 || es->str->data[es->str->len - 1] == '\n') + appendStringInfoSpaces(es->str, es->indent * 2); +} + +/* * Emit a JSON line ending. * * JSON requires a comma after each property but the last. To facilitate this, diff --git a/src/include/commands/explain.h b/src/include/commands/explain.h index 45027a7..853e206 100644 --- a/src/include/commands/explain.h +++ b/src/include/commands/explain.h @@ -25,6 +25,15 @@ typedef enum ExplainFormat EXPLAIN_FORMAT_YAML } ExplainFormat; +typedef struct ExplainWorkersState +{ + int num_workers; /* # of worker processes the plan used */ + bool *worker_inited; /* per-worker state-initialized flags */ + StringInfoData *worker_str; /* per-worker transient output buffers */ + int *worker_state_save; /* per-worker grouping state save areas */ + StringInfo prev_str; /* saved output buffer while redirecting */ +} ExplainWorkersState; + typedef struct ExplainState { StringInfo str; /* output buffer */ @@ -47,6 +56,8 @@ typedef struct ExplainState List *deparse_cxt; /* context list for deparsing expressions */ Bitmapset *printed_subplans; /* ids of SubPlans we've printed */ bool hide_workers; /* set if we find an invisible Gather */ + /* state related to the current plan node */ + ExplainWorkersState *workers_state; /* needed if parallel plan */ } ExplainState; /* Hook for plugins to get control in ExplainOneQuery() */
Thanks for the thorough review. I obviously missed some critical issues. I recognized some of the other maintainability problems, but did not have a sense of how to best avoid them, being unfamiliar with the code. For what it's worth, this version makes sense to me.
Maciek Sakrejda <m.sakrejda@gmail.com> writes: > For what it's worth, this version makes sense to me. Thanks for looking. Here's a version that deals with the JIT instrumentation. As Andres noted far upthread, that was also really bogusly done before. Not only could you get multiple "JIT" subnodes on a Gather node, but we failed to print the info at all if the parallelism was expressed as Gather Merge rather than Gather. A side effect of this change is that per-worker JIT info is now printed one plan level further down: before it was printed on the Gather node, but now it's attached to the Gather's child, because that's where we print other per-worker data. I don't see anything particularly wrong with that, but it's another change from the behavior today. It's still really unclear to me how we could exercise any of this behavior meaningfully in a regression test. I thought for a little bit about using the TAP infrastructure instead of a traditional-style test, but it seems like that doesn't buy anything except for a bias towards ignoring details instead of overspecifying them. Which is not much of an improvement. regards, tom lane diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c index d189b8d..fe3c83b 100644 --- a/src/backend/commands/explain.c +++ b/src/backend/commands/explain.c @@ -57,6 +57,8 @@ static void ExplainOneQuery(Query *query, int cursorOptions, IntoClause *into, ExplainState *es, const char *queryString, ParamListInfo params, QueryEnvironment *queryEnv); +static void ExplainPrintJIT(ExplainState *es, int jit_flags, + JitInstrumentation *ji); static void report_triggers(ResultRelInfo *rInfo, bool show_relname, ExplainState *es); static double elapsed_time(instr_time *starttime); @@ -123,11 +125,20 @@ static void ExplainSubPlans(List *plans, List *ancestors, const char *relationship, ExplainState *es); static void ExplainCustomChildren(CustomScanState *css, List *ancestors, ExplainState *es); +static ExplainWorkersState *ExplainCreateWorkersState(int num_workers); +static void ExplainOpenWorker(int n, ExplainState *es); +static void ExplainCloseWorker(int n, ExplainState *es); +static void ExplainFlushWorkersState(ExplainState *es); static void ExplainProperty(const char *qlabel, const char *unit, const char *value, bool numeric, ExplainState *es); +static void ExplainOpenSetAsideGroup(const char *objtype, const char *labelname, + bool labeled, int depth, ExplainState *es); +static void ExplainSaveGroup(ExplainState *es, int depth, int *state_save); +static void ExplainRestoreGroup(ExplainState *es, int depth, int *state_save); static void ExplainDummyGroup(const char *objtype, const char *labelname, ExplainState *es); static void ExplainXMLTag(const char *tagname, int flags, ExplainState *es); +static void ExplainIndentText(ExplainState *es); static void ExplainJSONLineEnding(ExplainState *es); static void ExplainYAMLLineStarting(ExplainState *es); static void escape_yaml(StringInfo buf, const char *str); @@ -790,31 +801,22 @@ ExplainPrintJITSummary(ExplainState *es, QueryDesc *queryDesc) if (queryDesc->estate->es_jit_worker_instr) InstrJitAgg(&ji, queryDesc->estate->es_jit_worker_instr); - ExplainPrintJIT(es, queryDesc->estate->es_jit_flags, &ji, -1); + ExplainPrintJIT(es, queryDesc->estate->es_jit_flags, &ji); } /* * ExplainPrintJIT - * Append information about JITing to es->str. - * - * Can be used to print the JIT instrumentation of the backend (worker_num = - * -1) or that of a specific worker (worker_num = ...). */ -void -ExplainPrintJIT(ExplainState *es, int jit_flags, - JitInstrumentation *ji, int worker_num) +static void +ExplainPrintJIT(ExplainState *es, int jit_flags, JitInstrumentation *ji) { instr_time total_time; - bool for_workers = (worker_num >= 0); /* don't print information if no JITing happened */ if (!ji || ji->created_functions == 0) return; - /* don't print per-worker info if we're supposed to hide that */ - if (for_workers && es->hide_workers) - return; - /* calculate total time */ INSTR_TIME_SET_ZERO(total_time); INSTR_TIME_ADD(total_time, ji->generation_counter); @@ -827,16 +829,13 @@ ExplainPrintJIT(ExplainState *es, int jit_flags, /* for higher density, open code the text output format */ if (es->format == EXPLAIN_FORMAT_TEXT) { - appendStringInfoSpaces(es->str, es->indent * 2); - if (for_workers) - appendStringInfo(es->str, "JIT for worker %u:\n", worker_num); - else - appendStringInfoString(es->str, "JIT:\n"); - es->indent += 1; + ExplainIndentText(es); + appendStringInfoString(es->str, "JIT:\n"); + es->indent++; ExplainPropertyInteger("Functions", NULL, ji->created_functions, es); - appendStringInfoSpaces(es->str, es->indent * 2); + ExplainIndentText(es); appendStringInfo(es->str, "Options: %s %s, %s %s, %s %s, %s %s\n", "Inlining", jit_flags & PGJIT_INLINE ? "true" : "false", "Optimization", jit_flags & PGJIT_OPT3 ? "true" : "false", @@ -845,7 +844,7 @@ ExplainPrintJIT(ExplainState *es, int jit_flags, if (es->analyze && es->timing) { - appendStringInfoSpaces(es->str, es->indent * 2); + ExplainIndentText(es); appendStringInfo(es->str, "Timing: %s %.3f ms, %s %.3f ms, %s %.3f ms, %s %.3f ms, %s %.3f ms\n", "Generation", 1000.0 * INSTR_TIME_GET_DOUBLE(ji->generation_counter), @@ -855,11 +854,10 @@ ExplainPrintJIT(ExplainState *es, int jit_flags, "Total", 1000.0 * INSTR_TIME_GET_DOUBLE(total_time)); } - es->indent -= 1; + es->indent--; } else { - ExplainPropertyInteger("Worker Number", NULL, worker_num, es); ExplainPropertyInteger("Functions", NULL, ji->created_functions, es); ExplainOpenGroup("Options", "Options", true, es); @@ -1074,9 +1072,10 @@ ExplainPreScanNode(PlanState *planstate, Bitmapset **rels_used) * optional name to be attached to the node. * * In text format, es->indent is controlled in this function since we only - * want it to change at plan-node boundaries. In non-text formats, es->indent - * corresponds to the nesting depth of logical output groups, and therefore - * is controlled by ExplainOpenGroup/ExplainCloseGroup. + * want it to change at plan-node boundaries (but a few subroutines will + * transiently increment it). In non-text formats, es->indent corresponds + * to the nesting depth of logical output groups, and therefore is controlled + * by ExplainOpenGroup/ExplainCloseGroup. */ static void ExplainNode(PlanState *planstate, List *ancestors, @@ -1090,9 +1089,20 @@ ExplainNode(PlanState *planstate, List *ancestors, const char *partialmode = NULL; const char *operation = NULL; const char *custom_name = NULL; + ExplainWorkersState *save_workers_state = es->workers_state; int save_indent = es->indent; bool haschildren; + /* + * Prepare per-worker output buffers, if needed. We'll append the data in + * these to the main output string further down. + */ + if (planstate->worker_instrument && es->analyze && !es->hide_workers) + es->workers_state = ExplainCreateWorkersState(planstate->worker_instrument->num_workers); + else + es->workers_state = NULL; + + /* Identify plan node type, and print generic details */ switch (nodeTag(plan)) { case T_Result: @@ -1324,13 +1334,13 @@ ExplainNode(PlanState *planstate, List *ancestors, { if (plan_name) { - appendStringInfoSpaces(es->str, es->indent * 2); + ExplainIndentText(es); appendStringInfo(es->str, "%s\n", plan_name); es->indent++; } if (es->indent) { - appendStringInfoSpaces(es->str, es->indent * 2); + ExplainIndentText(es); appendStringInfoString(es->str, "-> "); es->indent += 2; } @@ -1574,6 +1584,56 @@ ExplainNode(PlanState *planstate, List *ancestors, if (es->format == EXPLAIN_FORMAT_TEXT) appendStringInfoChar(es->str, '\n'); + /* prepare per-worker general execution details */ + if (es->workers_state && es->verbose) + { + WorkerInstrumentation *w = planstate->worker_instrument; + + for (int n = 0; n < w->num_workers; n++) + { + Instrumentation *instrument = &w->instrument[n]; + double nloops = instrument->nloops; + double startup_ms; + double total_ms; + double rows; + + if (nloops <= 0) + continue; + startup_ms = 1000.0 * instrument->startup / nloops; + total_ms = 1000.0 * instrument->total / nloops; + rows = instrument->ntuples / nloops; + + ExplainOpenWorker(n, es); + + if (es->format == EXPLAIN_FORMAT_TEXT) + { + ExplainIndentText(es); + if (es->timing) + appendStringInfo(es->str, + "actual time=%.3f..%.3f rows=%.0f loops=%.0f\n", + startup_ms, total_ms, rows, nloops); + else + appendStringInfo(es->str, + "actual rows=%.0f loops=%.0f\n", + rows, nloops); + } + else + { + if (es->timing) + { + ExplainPropertyFloat("Actual Startup Time", "ms", + startup_ms, 3, es); + ExplainPropertyFloat("Actual Total Time", "ms", + total_ms, 3, es); + } + ExplainPropertyFloat("Actual Rows", NULL, rows, 0, es); + ExplainPropertyFloat("Actual Loops", NULL, nloops, 0, es); + } + + ExplainCloseWorker(n, es); + } + } + /* target list */ if (es->verbose) show_plan_tlist(planstate, ancestors, es); @@ -1684,24 +1744,6 @@ ExplainNode(PlanState *planstate, List *ancestors, nworkers, es); } - /* - * Print per-worker Jit instrumentation. Use same conditions - * as for the leader's JIT instrumentation, see comment there. - */ - if (es->costs && es->verbose && - outerPlanState(planstate)->worker_jit_instrument) - { - PlanState *child = outerPlanState(planstate); - int n; - SharedJitInstrumentation *w = child->worker_jit_instrument; - - for (n = 0; n < w->num_workers; ++n) - { - ExplainPrintJIT(es, child->state->es_jit_flags, - &w->jit_instr[n], n); - } - } - if (gather->single_copy || es->format != EXPLAIN_FORMAT_TEXT) ExplainPropertyBool("Single Copy", gather->single_copy, es); } @@ -1881,80 +1923,54 @@ ExplainNode(PlanState *planstate, List *ancestors, break; } + /* + * Prepare per-worker JIT instrumentation. As with the overall JIT + * summary, this is printed only if printing costs is enabled. + */ + if (es->workers_state && es->costs && es->verbose) + { + SharedJitInstrumentation *w = planstate->worker_jit_instrument; + + if (w) + { + for (int n = 0; n < w->num_workers; n++) + { + ExplainOpenWorker(n, es); + ExplainPrintJIT(es, planstate->state->es_jit_flags, + &w->jit_instr[n]); + ExplainCloseWorker(n, es); + } + } + } + /* Show buffer usage */ if (es->buffers && planstate->instrument) show_buffer_usage(es, &planstate->instrument->bufusage); - /* Show worker detail */ - if (es->analyze && es->verbose && !es->hide_workers && - planstate->worker_instrument) + /* Prepare per-worker buffer usage */ + if (es->workers_state && es->buffers && es->verbose) { WorkerInstrumentation *w = planstate->worker_instrument; - bool opened_group = false; - int n; - for (n = 0; n < w->num_workers; ++n) + for (int n = 0; n < w->num_workers; n++) { Instrumentation *instrument = &w->instrument[n]; double nloops = instrument->nloops; - double startup_ms; - double total_ms; - double rows; if (nloops <= 0) continue; - startup_ms = 1000.0 * instrument->startup / nloops; - total_ms = 1000.0 * instrument->total / nloops; - rows = instrument->ntuples / nloops; - - if (es->format == EXPLAIN_FORMAT_TEXT) - { - appendStringInfoSpaces(es->str, es->indent * 2); - appendStringInfo(es->str, "Worker %d: ", n); - if (es->timing) - appendStringInfo(es->str, - "actual time=%.3f..%.3f rows=%.0f loops=%.0f\n", - startup_ms, total_ms, rows, nloops); - else - appendStringInfo(es->str, - "actual rows=%.0f loops=%.0f\n", - rows, nloops); - es->indent++; - if (es->buffers) - show_buffer_usage(es, &instrument->bufusage); - es->indent--; - } - else - { - if (!opened_group) - { - ExplainOpenGroup("Workers", "Workers", false, es); - opened_group = true; - } - ExplainOpenGroup("Worker", NULL, true, es); - ExplainPropertyInteger("Worker Number", NULL, n, es); - - if (es->timing) - { - ExplainPropertyFloat("Actual Startup Time", "ms", - startup_ms, 3, es); - ExplainPropertyFloat("Actual Total Time", "ms", - total_ms, 3, es); - } - ExplainPropertyFloat("Actual Rows", NULL, rows, 0, es); - ExplainPropertyFloat("Actual Loops", NULL, nloops, 0, es); - if (es->buffers) - show_buffer_usage(es, &instrument->bufusage); - - ExplainCloseGroup("Worker", NULL, true, es); - } + ExplainOpenWorker(n, es); + show_buffer_usage(es, &instrument->bufusage); + ExplainCloseWorker(n, es); } - - if (opened_group) - ExplainCloseGroup("Workers", "Workers", false, es); } + /* Show per-worker details for this plan node, then pop that stack */ + if (es->workers_state) + ExplainFlushWorkersState(es); + es->workers_state = save_workers_state; + /* Get ready to display the child plans */ haschildren = planstate->initPlan || outerPlanState(planstate) || @@ -2525,7 +2541,7 @@ show_tablesample(TableSampleClause *tsc, PlanState *planstate, { bool first = true; - appendStringInfoSpaces(es->str, es->indent * 2); + ExplainIndentText(es); appendStringInfo(es->str, "Sampling: %s (", method_name); foreach(lc, params) { @@ -2572,7 +2588,7 @@ show_sort_info(SortState *sortstate, ExplainState *es) if (es->format == EXPLAIN_FORMAT_TEXT) { - appendStringInfoSpaces(es->str, es->indent * 2); + ExplainIndentText(es); appendStringInfo(es->str, "Sort Method: %s %s: %ldkB\n", sortMethod, spaceType, spaceUsed); } @@ -2588,12 +2604,14 @@ show_sort_info(SortState *sortstate, ExplainState *es) * You might think we should just skip this stanza entirely when * es->hide_workers is true, but then we'd get no sort-method output at * all. We have to make it look like worker 0's data is top-level data. - * Currently, we only bother with that for text-format output. + * This is easily done by just skipping the OpenWorker/CloseWorker calls. + * Currently, we don't worry about the possibility that there are multiple + * workers in such a case; if there are, duplicate output fields will be + * emitted. */ if (sortstate->shared_info != NULL) { int n; - bool opened_group = false; for (n = 0; n < sortstate->shared_info->num_workers; n++) { @@ -2609,32 +2627,26 @@ show_sort_info(SortState *sortstate, ExplainState *es) spaceType = tuplesort_space_type_name(sinstrument->spaceType); spaceUsed = sinstrument->spaceUsed; + if (es->workers_state) + ExplainOpenWorker(n, es); + if (es->format == EXPLAIN_FORMAT_TEXT) { - appendStringInfoSpaces(es->str, es->indent * 2); - if (n > 0 || !es->hide_workers) - appendStringInfo(es->str, "Worker %d: ", n); + ExplainIndentText(es); appendStringInfo(es->str, "Sort Method: %s %s: %ldkB\n", sortMethod, spaceType, spaceUsed); } else { - if (!opened_group) - { - ExplainOpenGroup("Workers", "Workers", false, es); - opened_group = true; - } - ExplainOpenGroup("Worker", NULL, true, es); - ExplainPropertyInteger("Worker Number", NULL, n, es); ExplainPropertyText("Sort Method", sortMethod, es); ExplainPropertyInteger("Sort Space Used", "kB", spaceUsed, es); ExplainPropertyText("Sort Space Type", spaceType, es); - ExplainCloseGroup("Worker", NULL, true, es); } + + if (es->workers_state) + ExplainCloseWorker(n, es); } - if (opened_group) - ExplainCloseGroup("Workers", "Workers", false, es); } } @@ -2721,7 +2733,7 @@ show_hash_info(HashState *hashstate, ExplainState *es) else if (hinstrument.nbatch_original != hinstrument.nbatch || hinstrument.nbuckets_original != hinstrument.nbuckets) { - appendStringInfoSpaces(es->str, es->indent * 2); + ExplainIndentText(es); appendStringInfo(es->str, "Buckets: %d (originally %d) Batches: %d (originally %d) Memory Usage: %ldkB\n", hinstrument.nbuckets, @@ -2732,7 +2744,7 @@ show_hash_info(HashState *hashstate, ExplainState *es) } else { - appendStringInfoSpaces(es->str, es->indent * 2); + ExplainIndentText(es); appendStringInfo(es->str, "Buckets: %d Batches: %d Memory Usage: %ldkB\n", hinstrument.nbuckets, hinstrument.nbatch, @@ -2758,7 +2770,7 @@ show_tidbitmap_info(BitmapHeapScanState *planstate, ExplainState *es) { if (planstate->exact_pages > 0 || planstate->lossy_pages > 0) { - appendStringInfoSpaces(es->str, es->indent * 2); + ExplainIndentText(es); appendStringInfoString(es->str, "Heap Blocks:"); if (planstate->exact_pages > 0) appendStringInfo(es->str, " exact=%ld", planstate->exact_pages); @@ -2894,7 +2906,7 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage) /* Show only positive counter values. */ if (has_shared || has_local || has_temp) { - appendStringInfoSpaces(es->str, es->indent * 2); + ExplainIndentText(es); appendStringInfoString(es->str, "Buffers:"); if (has_shared) @@ -2949,7 +2961,7 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage) /* As above, show only positive counter values. */ if (has_timing) { - appendStringInfoSpaces(es->str, es->indent * 2); + ExplainIndentText(es); appendStringInfoString(es->str, "I/O Timings:"); if (!INSTR_TIME_IS_ZERO(usage->blk_read_time)) appendStringInfo(es->str, " read=%0.3f", @@ -3237,7 +3249,7 @@ show_modifytable_info(ModifyTableState *mtstate, List *ancestors, */ if (es->format == EXPLAIN_FORMAT_TEXT) { - appendStringInfoSpaces(es->str, es->indent * 2); + ExplainIndentText(es); appendStringInfoString(es->str, fdwroutine ? foperation : operation); } @@ -3427,6 +3439,158 @@ ExplainCustomChildren(CustomScanState *css, List *ancestors, ExplainState *es) } /* + * Create a per-plan-node workspace for collecting per-worker data. + * + * Output related to each worker will be temporarily "set aside" into a + * separate buffer, which we'll merge into the main output stream once + * we've processed all data for the plan node. This makes it feasible to + * generate a coherent sub-group of fields for each worker, even though the + * code that produces the fields is in several different places in this file. + * Formatting of such a set-aside field group is managed by + * ExplainOpenSetAsideGroup and ExplainSaveGroup/ExplainRestoreGroup. + */ +static ExplainWorkersState * +ExplainCreateWorkersState(int num_workers) +{ + ExplainWorkersState *wstate; + + wstate = (ExplainWorkersState *) palloc(sizeof(ExplainWorkersState)); + wstate->num_workers = num_workers; + wstate->worker_inited = (bool *) palloc0(num_workers * sizeof(bool)); + wstate->worker_str = (StringInfoData *) + palloc0(num_workers * sizeof(StringInfoData)); + wstate->worker_state_save = (int *) palloc(num_workers * sizeof(int)); + return wstate; +} + +/* + * Begin or resume output into the set-aside group for worker N. + */ +static void +ExplainOpenWorker(int n, ExplainState *es) +{ + ExplainWorkersState *wstate = es->workers_state; + + Assert(wstate); + Assert(n >= 0 && n < wstate->num_workers); + + /* Save prior output buffer pointer */ + wstate->prev_str = es->str; + + if (!wstate->worker_inited[n]) + { + /* First time through, so create the buffer for this worker */ + initStringInfo(&wstate->worker_str[n]); + es->str = &wstate->worker_str[n]; + + /* + * Push suitable initial formatting state for this worker's field + * group. We allow one extra logical nesting level, since this group + * will eventually be wrapped in an outer "Workers" group. + */ + ExplainOpenSetAsideGroup("Worker", NULL, true, 2, es); + + /* + * In non-TEXT formats we always emit a "Worker Number" field, even if + * there's no other data for this worker. + */ + if (es->format != EXPLAIN_FORMAT_TEXT) + ExplainPropertyInteger("Worker Number", NULL, n, es); + + wstate->worker_inited[n] = true; + } + else + { + /* Resuming output for a worker we've already emitted some data for */ + es->str = &wstate->worker_str[n]; + + /* Restore formatting state saved by last ExplainCloseWorker() */ + ExplainRestoreGroup(es, 2, &wstate->worker_state_save[n]); + } + + /* + * In TEXT format, prefix the first output line for this worker with + * "Worker N:". Then, any additional lines should be indented one more + * stop than the "Worker N" line is. + */ + if (es->format == EXPLAIN_FORMAT_TEXT) + { + if (es->str->len == 0) + { + ExplainIndentText(es); + appendStringInfo(es->str, "Worker %d: ", n); + } + + es->indent++; + } +} + +/* + * End output for worker N --- must pair with previous ExplainOpenWorker call + */ +static void +ExplainCloseWorker(int n, ExplainState *es) +{ + ExplainWorkersState *wstate = es->workers_state; + + Assert(wstate); + Assert(n >= 0 && n < wstate->num_workers); + Assert(wstate->worker_inited[n]); + + /* + * Save formatting state in case we do another ExplainOpenWorker(), then + * pop the formatting stack. + */ + ExplainSaveGroup(es, 2, &wstate->worker_state_save[n]); + + /* + * In TEXT format, if we didn't actually produce any output line(s) then + * truncate off the partial line. (This is to avoid bogus output if, for + * instance, show_buffer_usage chooses not to print anything for the + * worker.) Also fix up the indent level. + */ + if (es->format == EXPLAIN_FORMAT_TEXT) + { + while (es->str->len > 0 && es->str->data[es->str->len - 1] != '\n') + es->str->data[--(es->str->len)] = '\0'; + + es->indent--; + } + + /* Restore prior output buffer pointer */ + es->str = wstate->prev_str; +} + +/* + * Print per-worker info for current node, then free the ExplainWorkersState. + */ +static void +ExplainFlushWorkersState(ExplainState *es) +{ + ExplainWorkersState *wstate = es->workers_state; + + ExplainOpenGroup("Workers", "Workers", false, es); + for (int i = 0; i < wstate->num_workers; i++) + { + if (wstate->worker_inited[i]) + { + /* This must match previous ExplainOpenSetAsideGroup call */ + ExplainOpenGroup("Worker", NULL, true, es); + appendStringInfoString(es->str, wstate->worker_str[i].data); + ExplainCloseGroup("Worker", NULL, true, es); + + pfree(wstate->worker_str[i].data); + } + } + ExplainCloseGroup("Workers", "Workers", false, es); + + pfree(wstate->worker_inited); + pfree(wstate->worker_str); + pfree(wstate->worker_state_save); + pfree(wstate); +} + +/* * Explain a property, such as sort keys or targets, that takes the form of * a list of unlabeled items. "data" is a list of C strings. */ @@ -3439,7 +3603,7 @@ ExplainPropertyList(const char *qlabel, List *data, ExplainState *es) switch (es->format) { case EXPLAIN_FORMAT_TEXT: - appendStringInfoSpaces(es->str, es->indent * 2); + ExplainIndentText(es); appendStringInfo(es->str, "%s: ", qlabel); foreach(lc, data) { @@ -3560,7 +3724,7 @@ ExplainProperty(const char *qlabel, const char *unit, const char *value, switch (es->format) { case EXPLAIN_FORMAT_TEXT: - appendStringInfoSpaces(es->str, es->indent * 2); + ExplainIndentText(es); if (unit) appendStringInfo(es->str, "%s: %s %s\n", qlabel, value, unit); else @@ -3752,6 +3916,117 @@ ExplainCloseGroup(const char *objtype, const char *labelname, } /* + * Open a group of related objects, without emitting actual data. + * + * Prepare the formatting state as though we were beginning a group with + * the identified properties, but don't actually emit anything. Output + * subsequent to this call can be redirected into a separate output buffer, + * and then eventually appended to the main output buffer after doing a + * regular ExplainOpenGroup call (with the same parameters). + * + * The extra "depth" parameter is the new group's depth compared to current. + * It could be more than one, in case the eventual output will be enclosed + * in additional nesting group levels. We assume we don't need to track + * formatting state for those levels while preparing this group's output. + * + * There is no ExplainCloseSetAsideGroup --- in current usage, we always + * pop this state with ExplainSaveGroup. + */ +static void +ExplainOpenSetAsideGroup(const char *objtype, const char *labelname, + bool labeled, int depth, ExplainState *es) +{ + switch (es->format) + { + case EXPLAIN_FORMAT_TEXT: + /* nothing to do */ + break; + + case EXPLAIN_FORMAT_XML: + es->indent += depth; + break; + + case EXPLAIN_FORMAT_JSON: + es->grouping_stack = lcons_int(0, es->grouping_stack); + es->indent += depth; + break; + + case EXPLAIN_FORMAT_YAML: + if (labelname) + es->grouping_stack = lcons_int(1, es->grouping_stack); + else + es->grouping_stack = lcons_int(0, es->grouping_stack); + es->indent += depth; + break; + } +} + +/* + * Pop one level of grouping state, allowing for a re-push later. + * + * This is typically used after ExplainOpenSetAsideGroup; pass the + * same "depth" used for that. + * + * This should not emit any output. If state needs to be saved, + * save it at *state_save. Currently, an integer save area is sufficient + * for all formats, but we might need to revisit that someday. + */ +static void +ExplainSaveGroup(ExplainState *es, int depth, int *state_save) +{ + switch (es->format) + { + case EXPLAIN_FORMAT_TEXT: + /* nothing to do */ + break; + + case EXPLAIN_FORMAT_XML: + es->indent -= depth; + break; + + case EXPLAIN_FORMAT_JSON: + es->indent -= depth; + *state_save = linitial_int(es->grouping_stack); + es->grouping_stack = list_delete_first(es->grouping_stack); + break; + + case EXPLAIN_FORMAT_YAML: + es->indent -= depth; + *state_save = linitial_int(es->grouping_stack); + es->grouping_stack = list_delete_first(es->grouping_stack); + break; + } +} + +/* + * Re-push one level of grouping state, undoing the effects of ExplainSaveGroup. + */ +static void +ExplainRestoreGroup(ExplainState *es, int depth, int *state_save) +{ + switch (es->format) + { + case EXPLAIN_FORMAT_TEXT: + /* nothing to do */ + break; + + case EXPLAIN_FORMAT_XML: + es->indent += depth; + break; + + case EXPLAIN_FORMAT_JSON: + es->grouping_stack = lcons_int(*state_save, es->grouping_stack); + es->indent += depth; + break; + + case EXPLAIN_FORMAT_YAML: + es->grouping_stack = lcons_int(*state_save, es->grouping_stack); + es->indent += depth; + break; + } +} + +/* * Emit a "dummy" group that never has any members. * * objtype is the type of the group object, labelname is its label within @@ -3913,6 +4188,21 @@ ExplainXMLTag(const char *tagname, int flags, ExplainState *es) } /* + * Indent a text-format line. + * + * We indent by two spaces per indentation level. However, when emitting + * data for a parallel worker there might already be data on the current line + * (cf. ExplainOpenWorker); in that case, don't indent any more. + */ +static void +ExplainIndentText(ExplainState *es) +{ + Assert(es->format == EXPLAIN_FORMAT_TEXT); + if (es->str->len == 0 || es->str->data[es->str->len - 1] == '\n') + appendStringInfoSpaces(es->str, es->indent * 2); +} + +/* * Emit a JSON line ending. * * JSON requires a comma after each property but the last. To facilitate this, diff --git a/src/include/commands/explain.h b/src/include/commands/explain.h index 45027a7..54f6240 100644 --- a/src/include/commands/explain.h +++ b/src/include/commands/explain.h @@ -25,6 +25,15 @@ typedef enum ExplainFormat EXPLAIN_FORMAT_YAML } ExplainFormat; +typedef struct ExplainWorkersState +{ + int num_workers; /* # of worker processes the plan used */ + bool *worker_inited; /* per-worker state-initialized flags */ + StringInfoData *worker_str; /* per-worker transient output buffers */ + int *worker_state_save; /* per-worker grouping state save areas */ + StringInfo prev_str; /* saved output buffer while redirecting */ +} ExplainWorkersState; + typedef struct ExplainState { StringInfo str; /* output buffer */ @@ -47,6 +56,8 @@ typedef struct ExplainState List *deparse_cxt; /* context list for deparsing expressions */ Bitmapset *printed_subplans; /* ids of SubPlans we've printed */ bool hide_workers; /* set if we find an invisible Gather */ + /* state related to the current plan node */ + ExplainWorkersState *workers_state; /* needed if parallel plan */ } ExplainState; /* Hook for plugins to get control in ExplainOneQuery() */ @@ -84,8 +95,6 @@ extern void ExplainPrintPlan(ExplainState *es, QueryDesc *queryDesc); extern void ExplainPrintTriggers(ExplainState *es, QueryDesc *queryDesc); extern void ExplainPrintJITSummary(ExplainState *es, QueryDesc *queryDesc); -extern void ExplainPrintJIT(ExplainState *es, int jit_flags, - struct JitInstrumentation *jit_instr, int worker_num); extern void ExplainQueryText(ExplainState *es, QueryDesc *queryDesc);
I wrote: > It's still really unclear to me how we could exercise any of > this behavior meaningfully in a regression test. I thought > for a little bit about using the TAP infrastructure instead > of a traditional-style test, but it seems like that doesn't > buy anything except for a bias towards ignoring details instead > of overspecifying them. Which is not much of an improvement. After further thought, I decided that about the best we can do is suppress the "Workers" field in the regression test's expected output. This still gives us code coverage of the relevant code, and we can check that the output is valid JSON before we strip it, so it's not a dead loss. I rewrote the test script a bit to add some coverage of XML and YAML output formats, since we had exactly none before, and pushed it. regards, tom lane
Hi, On 2020-01-25 14:23:50 -0500, Tom Lane wrote: > A side effect of this change is that per-worker JIT info is now > printed one plan level further down: before it was printed on > the Gather node, but now it's attached to the Gather's child, > because that's where we print other per-worker data. I don't > see anything particularly wrong with that, but it's another > change from the behavior today. Yea, I don't see any need to be bothered by that. > It's still really unclear to me how we could exercise any of > this behavior meaningfully in a regression test. I thought > for a little bit about using the TAP infrastructure instead > of a traditional-style test, but it seems like that doesn't > buy anything except for a bias towards ignoring details instead > of overspecifying them. Which is not much of an improvement. Hm. I'd like to avoid needing TAP for this kind of thing, it'll just make it much more expensive in just about all ways. Testing JIT explain is "easy" enough I think, I've posted a patch in another thread, which just skips over the region of the test if JIT is not available. See e.g. 0008 in https://www.postgresql.org/message-id/20191029000229.fkjmuld3g7f2jq7i%40alap3.anarazel.de (that's a thread I'd love your input in btw) It's harder for parallel query though. If parallel query were able to reuse workers, we could "just" check at the beginning of the test if we are able to get the workers we need, and then skip the rest of the tests if not. But as things stand that doesn't guarantee anything. I wonder if we could introduce a debug GUC that makes parallel worker acquisition just retry in a loop, for a time determined by the GUC. That obviously would be a bad idea to do in a production setup, but it could be good enough for regression tests? There are some deadlock dangers, but I'm not sure they really matter for the tests. > + /* prepare per-worker general execution details */ > + if (es->workers_state && es->verbose) > + { > + WorkerInstrumentation *w = planstate->worker_instrument; > + > + for (int n = 0; n < w->num_workers; n++) > + { > + Instrumentation *instrument = &w->instrument[n]; > + double nloops = instrument->nloops; > + double startup_ms; > + double total_ms; > + double rows; > + > + if (nloops <= 0) > + continue; > + startup_ms = 1000.0 * instrument->startup / nloops; > + total_ms = 1000.0 * instrument->total / nloops; > + rows = instrument->ntuples / nloops; > + > + ExplainOpenWorker(n, es); > + > + if (es->format == EXPLAIN_FORMAT_TEXT) > + { > + ExplainIndentText(es); > + if (es->timing) > + appendStringInfo(es->str, > + "actual time=%.3f..%.3f rows=%.0f loops=%.0f\n", > + startup_ms, total_ms, rows, nloops); > + else > + appendStringInfo(es->str, > + "actual rows=%.0f loops=%.0f\n", > + rows, nloops); > + } > + else > + { > + if (es->timing) > + { > + ExplainPropertyFloat("Actual Startup Time", "ms", > + startup_ms, 3, es); > + ExplainPropertyFloat("Actual Total Time", "ms", > + total_ms, 3, es); > + } > + ExplainPropertyFloat("Actual Rows", NULL, rows, 0, es); > + ExplainPropertyFloat("Actual Loops", NULL, nloops, 0, es); > + } > + > + ExplainCloseWorker(n, es); > + } > + } I'd personally move this into a separate function, given the patches moves the code around already. ExplainNode() is already hard enough to navigate... It probably also makes sense to move everything but the nloops <= 0, ExplainOpenWorker/ExplainCloseWorker into its own function. As far as I can tell it now should be identical between the non-parallel case? > +/* > + * Begin or resume output into the set-aside group for worker N. > + */ > +static void Would it make sense to make these functions non-static? It seems plausible that code for a custom node or such would want to add its own information? > +ExplainOpenWorker(int n, ExplainState *es) > +{ > + ExplainWorkersState *wstate = es->workers_state; > + > + Assert(wstate); > + Assert(n >= 0 && n < wstate->num_workers); > + > + /* Save prior output buffer pointer */ > + wstate->prev_str = es->str; > + > + if (!wstate->worker_inited[n]) > + { > + /* First time through, so create the buffer for this worker */ > + initStringInfo(&wstate->worker_str[n]); > + es->str = &wstate->worker_str[n]; > + > + /* > + * Push suitable initial formatting state for this worker's field > + * group. We allow one extra logical nesting level, since this group > + * will eventually be wrapped in an outer "Workers" group. > + */ > + ExplainOpenSetAsideGroup("Worker", NULL, true, 2, es); > + > + /* > + * In non-TEXT formats we always emit a "Worker Number" field, even if > + * there's no other data for this worker. > + */ > + if (es->format != EXPLAIN_FORMAT_TEXT) > + ExplainPropertyInteger("Worker Number", NULL, n, es); > + > + wstate->worker_inited[n] = true; > + } > + else > + { > + /* Resuming output for a worker we've already emitted some data for */ > + es->str = &wstate->worker_str[n]; > + > + /* Restore formatting state saved by last ExplainCloseWorker() */ > + ExplainRestoreGroup(es, 2, &wstate->worker_state_save[n]); > + } > + > + /* > + * In TEXT format, prefix the first output line for this worker with > + * "Worker N:". Then, any additional lines should be indented one more > + * stop than the "Worker N" line is. > + */ > + if (es->format == EXPLAIN_FORMAT_TEXT) > + { > + if (es->str->len == 0) > + { > + ExplainIndentText(es); > + appendStringInfo(es->str, "Worker %d: ", n); > + } > + > + es->indent++; > + } > +} I don't quite get the Worker %d bit. Why are we not outputting that in the !worker_inited block? > +/* > + * Print per-worker info for current node, then free the ExplainWorkersState. > + */ > +static void > +ExplainFlushWorkersState(ExplainState *es) > +{ > + ExplainWorkersState *wstate = es->workers_state; > + > + ExplainOpenGroup("Workers", "Workers", false, es); > + for (int i = 0; i < wstate->num_workers; i++) > + { > + if (wstate->worker_inited[i]) > + { > + /* This must match previous ExplainOpenSetAsideGroup call */ > + ExplainOpenGroup("Worker", NULL, true, es); > + appendStringInfoString(es->str, wstate->worker_str[i].data); Probably never matters, but given we do have the string length already, we could use appendBinaryStringInfo(). > + ExplainCloseGroup("Worker", NULL, true, es); Not related to this patch: I never got why we maintain a grouping stack for some things, but don't do it for the group properties themselves. > /* > + * Open a group of related objects, without emitting actual data. > + * > + * Prepare the formatting state as though we were beginning a group with > + * the identified properties, but don't actually emit anything. Output > + * subsequent to this call can be redirected into a separate output buffer, > + * and then eventually appended to the main output buffer after doing a > + * regular ExplainOpenGroup call (with the same parameters). > + * > + * The extra "depth" parameter is the new group's depth compared to current. > + * It could be more than one, in case the eventual output will be enclosed > + * in additional nesting group levels. We assume we don't need to track > + * formatting state for those levels while preparing this group's output. > + * > + * There is no ExplainCloseSetAsideGroup --- in current usage, we always > + * pop this state with ExplainSaveGroup. > + */ > +static void > +ExplainOpenSetAsideGroup(const char *objtype, const char *labelname, > + bool labeled, int depth, ExplainState *es) > +{ > + switch (es->format) > + { > + case EXPLAIN_FORMAT_TEXT: > + /* nothing to do */ > + break; > + > + case EXPLAIN_FORMAT_XML: > + es->indent += depth; > + break; > + > + case EXPLAIN_FORMAT_JSON: > + es->grouping_stack = lcons_int(0, es->grouping_stack); > + es->indent += depth; > + break; > + > + case EXPLAIN_FORMAT_YAML: > + if (labelname) > + es->grouping_stack = lcons_int(1, es->grouping_stack); > + else > + es->grouping_stack = lcons_int(0, es->grouping_stack); > + es->indent += depth; > + break; > + } > +} Hm. It might be worthwhile to rename ExplainOpenSetAsideGroup and use it from ExplainOpenGroup()? Seems we could just call it after ExplainOpenGroup()'s switch, and remove all the indent/grouping_stack related code from ExplainOpenGroup(). Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > I wonder if we could introduce a debug GUC that makes parallel worker > acquisition just retry in a loop, for a time determined by the GUC. That > obviously would be a bad idea to do in a production setup, but it could > be good enough for regression tests? There are some deadlock dangers, > but I'm not sure they really matter for the tests. Hmmm .... might work. Seems like a better idea than "run it by itself" as we have to do now. > I'd personally move this into a separate function, given the patches > moves the code around already. ExplainNode() is already hard enough to > navigate... Well, it was already inline in ExplainNode, so this just moved the code a few lines. I'm not that excited about moving little bits of that function out-of-line. >> +/* >> + * Begin or resume output into the set-aside group for worker N. >> + */ >> +static void > Would it make sense to make these functions non-static? It seems > plausible that code for a custom node or such would want to add > its own information? Maybe, but seems to me that there'd be a whole lot of other infrastructure needed to track additional per-worker state. I'd just as soon not expose this stuff until (a) there's a concrete not hypothetical use case and (b) it's been around long enough to feel comfortable that there's nothing wrong with the design. >> + /* >> + * In TEXT format, prefix the first output line for this worker with >> + * "Worker N:". Then, any additional lines should be indented one more >> + * stop than the "Worker N" line is. >> + */ > I don't quite get the Worker %d bit. Why are we not outputting that in > the !worker_inited block? We might strip it off again in ExplainCloseWorker, and then potentially add it back again in a later ExplainOpenWorker. That could only happen if an earlier ExplainOpen/CloseWorker fragment decides not to emit any text and then a later one wants to do so. Which I'm pretty sure is unreachable right now, but I don't think this code should assume that. >> + appendStringInfoString(es->str, wstate->worker_str[i].data); > Probably never matters, but given we do have the string length already, > we could use appendBinaryStringInfo(). Ah, I was thinking about making that change but then forgot. >> + ExplainCloseGroup("Worker", NULL, true, es); > Not related to this patch: I never got why we maintain a grouping stack > for some things, but don't do it for the group properties > themselves. Right now it'd just be extra overhead. If we ever have a case where it's not convenient for an ExplainCloseGroup caller to provide the same data as for ExplainOpenGroup, then I'd be on board with storing that info. > Hm. It might be worthwhile to rename ExplainOpenSetAsideGroup and use it > from ExplainOpenGroup()? Seems we could just call it after > ExplainOpenGroup()'s switch, and remove all the indent/grouping_stack > related code from ExplainOpenGroup(). Hmm. It seemed easier to me to keep them separate, but ... I did consider a design in which, instead of ExplainOpenSetAsideGroup, there was some function that would initialize the "state_save" area and then you'd call the "restore" function to make that state active. It seemed like that would be too dissimilar from ExplainOpenGroup --- but conceivably, we could reimplement ExplainOpenGroup as calling the initialize function and then the restore function (along with doing some output). Not really sure that'd be an improvement though: it'd involve less duplicate code, but the functions would individually be harder to wrap your brain around. regards, tom lane
I wrote: > Andres Freund <andres@anarazel.de> writes: >> I wonder if we could introduce a debug GUC that makes parallel worker >> acquisition just retry in a loop, for a time determined by the GUC. That >> obviously would be a bad idea to do in a production setup, but it could >> be good enough for regression tests? There are some deadlock dangers, >> but I'm not sure they really matter for the tests. > Hmmm .... might work. Seems like a better idea than "run it by itself" > as we have to do now. The more I think about this, the more it seems like a good idea, and not only for regression test purposes. If you're about to launch a query that will run for hours even with the max number of workers, you don't want it to launch with less than that number just because somebody else was eating a worker slot for a few milliseconds. So I'm imagining a somewhat general-purpose GUC defined like "max_delay_to_acquire_parallel_worker", measured say in milliseconds. The default would be zero (current behavior: try once and give up), but you could set it to small positive values if you have that kind of production concern, while the regression tests could set it to big positive values. This would alleviate all sorts of problems we have with not being able to assume stable results from parallel worker acquisition in the tests. regards, tom lane