Thread: JIT performance bug/regression & JIT EXPLAIN
Hi, Unfortunately I found a performance regression for JITed query compilation introduced in 12, compared to 11. Fixed in one of the attached patches (v1-0009-Fix-determination-when-tuple-deforming-can-be-JIT.patch - which needs a better commit message). The first question is when to push that fix. I'm inclined to just do so now - as we still do JITed tuple deforming in most cases, as well as doing so in 11 in the places this patch fixes, the risk of that seems low. But I can also see an arguments for waiting after 12.0. For me the bigger question is about how to make sure we can write tests determining which parts of the querytree are JIT compiled and which are not. There's the above bug, and I'm also hunting a regression introduced somewhere during 11's lifetime, which suggests to me that we need better coverage. I also want to add new JIT logic, making this even more important. The reason that 11 didn't have tests verifying that certain parts of the plan tree are JIT compiled is that EXPLAIN doesn't currently show the relevant information, and it's not that trivial to do so. What I'd like to do is to add additional, presumably optional, output to EXPLAIN showing additional information about expressions. There's two major parts to doing so: 1) Find a way to represent the additional information attached to expressions, and provide show_expression et al with the ExprState to be able to do so. The additional information I think is necessary is a) is expression jit compiled b-d) is scan/outer/inner tuple deforming necessary, and if so, JIT compiled. We can't unconditionally JIT compile for tuple deforming, because there's a number of cases where the source slot doesn't have precisely the same tuple desc, and/or doesn't have the same type. 2) Expand EXPLAIN output to show expressions that currently aren't shown. Performance-wise the ones most critical that aren't currently visible, and that I know about, are: - Agg's combined transition function, we also currently don't display in any understandable way how many passes over the input we do (for grouping sets), nor how much memory is needed. - Agg's hash comparator (separate regression referenced above) - Hash/HashJoin's hashkeys/hjclauses For 1) think we need to change show_expression()/show_qual() etc to also pass down the corresponding ExprState if available (not available in plenty of cases, most of which are not particularly important). That's fairly mechanical. Then we need to add information about JIT to individual expressions. In the attached WIP patchset I've made that dependent on the new "jit_details" EXPLAIN option. When specified, new per-expression information is shown: - JIT-Expr: whether the expression was JIT compiled (might e.g. not be the case because no parent was provided) - JIT-Deform-{Scan,Outer,Inner}: wether necessary, and whether JIT accelerated. I don't like these names much, but ... For the deform cases I chose to display a) the function name if JIT compiled b) "false" if the expression is JIT compiled, deforming is necessary, but deforming is not JIT compiled (e.g. because the slot type wasn't fixed) c) "null" if not necessary, with that being omitted in text mode. So e.g in json format this looks like: "Filter": { "Expr": "(lineitem.l_shipdate <= '1998-09-18 00:00:00'::timestamp without time zone)", "JIT-Expr": "evalexpr_0_2", "JIT-Deform-Scan": "deform_0_3", "JIT-Deform-Outer": null, "JIT-Deform-Inner": null } and in text mode: Filter: (lineitem.l_shipdate <= '1998-09-18 00:00:00'::timestamp without time zone); JIT-Expr: evalexpr_0_2, JIT-Deform-Scan:deform_0_3 For now I chose to make Filter a group when both, not in text mode and jit_details on - otherwise it's unclear what the JIT fields would apply to. But that's pretty crappy, because it means that the 'shape' of the output depends on the jit_details option. I think if we were starting from scratch it'd make sense to alway have the Expression as it's own sub-node, so interpreting code doesn't have to know all the places an expression can be referenced from. But it's probably not too attractive to change that today? Somewhat independently the series also contains a patch that renames verbose mode's "Output" to project if the node projects. I find it pretty hard to interpret whether a node projects otherwise, and it's confusing when jit_details shows details only for some node's Output, but not for others. But the compat break due to that change is not small - perhaps we could instead mark that in another way? For 2) I've only started to improve the situation, but it's a pretty number of pretty crucial pieces. I first focussed adding information for Agg nodes, as a) those are typically performance sensitive in cases where JIT is beneficial b) the current instrumentation is really insufficient, especially in cases where multiple grouping sets are computed at the same time - I think it's effectilvey not interpretable. In verbose mode explain now shows per-phase output about the transition computation. E.g. for a grouping set query that can't be computed in one pass, it now displays something like MixedAggregate (cost=6083420.07..14022888.98 rows=10011685 width=64) Project: avg((l_linenumber)::bigint), count((l_partkey)::bigint), sum(l_quantity), l_linenumber, l_partkey, l_quantity Filter: (sum(lineitem.l_quantity) IS NOT NULL) Phase 2 using strategy "Sort": Sort Key: lineitem.l_partkey, lineitem.l_quantity Transition Function: 2 * int8_avg_accum(TRANS, (l_linenumber)::bigint), 2 * int8inc_any(TRANS, (l_partkey)::bigint),2 * float8pl(TRANS, l_quantity) Sorted Group: lineitem.l_partkey, lineitem.l_quantity Sorted Group: lineitem.l_partkey Phase 1 using strategy "Sorted Input & All & Hash": Transition Function: 6 * int8_avg_accum(TRANS, (l_linenumber)::bigint), 6 * int8inc_any(TRANS, (l_partkey)::bigint),6 * float8pl(TRANS, l_quantity) Sorted Input Group: lineitem.l_linenumber, lineitem.l_partkey, lineitem.l_quantity Sorted Input Group: lineitem.l_linenumber, lineitem.l_partkey Sorted Input Group: lineitem.l_linenumber All Group Hash Group: lineitem.l_quantity Hash Group: lineitem.l_quantity, lineitem.l_linenumber -> Sort (cost=6083420.07..6158418.50 rows=29999372 width=16) ... The N * indicates how many of the same transition functions are computed during that phase. I'm not sure that 'TRANS' is the best placeholder for the transition value here. Maybe $TRANS would be clearer? For a parallel aggregate the upper level looks like: Finalize HashAggregate (cost=610681.93..610682.02 rows=9 width=16) Project: l_tax, sum(l_quantity) Phase 0 using strategy "Hash": Transition Function: float8pl(TRANS, (PARTIAL sum(l_quantity))) Hash Group: lineitem.l_tax -> Gather (cost=610677.11..610681.70 rows=45 width=16) Output: l_tax, (PARTIAL sum(l_quantity)) Workers Planned: 5 -> Partial HashAggregate (cost=609677.11..609677.20 rows=9 width=16) Project: l_tax, PARTIAL sum(l_quantity) I've not done that yet, but I think it's way past time that we also add memory usage information to Aggregate nodes (both for the hashtable(s), and for internal sorts if those are performed for grouping sets). Which would also be very hard in the "current" format, as there's no representation of passes. With jit_details enabled, we then can show information about the aggregation function, and grouping functions: Phase 0 using strategy "Hash": Transition Function: float8pl(TRANS, (PARTIAL sum(l_quantity))); JIT-Expr: evalexpr_0_11, JIT-Deform-Outer: false Hash Group: lineitem.l_tax; JIT-Expr: evalexpr_0_8, JIT-Deform-Outer: deform_0_10, JIT-Deform-Inner: deform_0_9 Currently the "new" format is used when either grouping sets are in use (as the previous explain output was not particularly useful, and information about the passes is important), or if VERBOSE or JIT_DETAILS are specified. For HashJoin/Hash I've added 'Outer Hash Key' and 'Hash Key' for each key, but only in verbose mode. That's somewhat important because for HashJoins those currently are often the performance critical bit, because they'll commonly be the expressions that deform the slots from below. That display is somewhat redundant with HashJoins "Hash Cond", but they're evaluated separately. Under verbose that seems OK to me. With jit_details enabled, this e.g. looks like this: Hash Join (cost=271409.60..2326739.51 rows=30000584 width=250) Project: lineitem.l_orderkey, lineitem.l_partkey, lineitem.l_suppkey, lineitem.l_linenumber, lineitem.l_quantity, lineitem.l_extendedprice,lineitem.l_discount, lineitem.l_tax, Inner Unique: true Hash Cond: ((lineitem.l_partkey = partsupp.ps_partkey) AND (lineitem.l_suppkey = partsupp.ps_suppkey)); JIT-Expr: evalexpr_0_7,JIT-Deform-Outer: deform_0_9, JIT-Deform-Inner: Outer Hash Key: lineitem.l_partkey; JIT-Expr: evalexpr_0_10, JIT-Deform-Outer: deform_0_11 Outer Hash Key: lineitem.l_suppkey; JIT-Expr: evalexpr_0_12, JIT-Deform-Outer: deform_0_13 -> Seq Scan on public.lineitem (cost=0.00..819684.84 rows=30000584 width=106) Output: lineitem.l_orderkey, lineitem.l_partkey, lineitem.l_suppkey, lineitem.l_linenumber, lineitem.l_quantity,lineitem.l_extendedprice, lineitem.l_discount, lineitem.l -> Hash (cost=129384.24..129384.24 rows=3999824 width=144) Output: partsupp.ps_partkey, partsupp.ps_suppkey, partsupp.ps_availqty, partsupp.ps_supplycost, partsupp.ps_comment Hash Key: partsupp.ps_partkey; JIT-Expr: evalexpr_0_0, JIT-Deform-Outer: deform_0_1 Hash Key: partsupp.ps_suppkey; JIT-Expr: evalexpr_0_2, JIT-Deform-Outer: deform_0_3 -> Seq Scan on public.partsupp (cost=0.00..129384.24 rows=3999824 width=144) Output: partsupp.ps_partkey, partsupp.ps_suppkey, partsupp.ps_availqty, partsupp.ps_supplycost, partsupp.ps_comment JIT: Functions: 14 (6 for expression evaluation, 8 for tuple deforming) Options: Inlining true, Optimization true, Expressions true, Deforming true this also highlights the sad fact that we currently use a separate ExprState to compute each of the hash keys, and then "manually" invoke the hash function itself. That's bad both for interpreted execution, as we repeatedly pay executor startup overhead and don't even hit the fastpath, as well as for JITed execution, because we have more code to optimize (some of it pretty redundant, in particular the deforming). In both cases we suffer from the problem that we deform the tuple incrementally. A later patch in the series then uses the new explain output to add some tests for JIT, and then fixes two bugs, showing that the test output changes. Additionally I've also included a small improvement to the expression evaluation logic, which also changes output in the JIT test, as it should. Comments? Greetings, Andres Freund
Attachment
- v1-0001-jit-Instrument-function-purpose-separately-add-tr.patch
- v1-0002-Refactor-explain.c-to-pass-ExprState-down-to-show.patch
- v1-0003-Explain-Differentiate-between-a-node-projecting-o.patch
- v1-0004-Add-EXPLAIN-option-jit_details-showing-per-expres.patch
- v1-0005-jit-explain-remove-backend-lifetime-module-count-.patch
- v1-0006-WIP-explain-Show-per-phase-information-about-aggr.patch
- v1-0007-WIP-explain-Output-hash-keys-in-verbose-mode.patch
- v1-0008-jit-Add-tests.patch
- v1-0009-Fix-determination-when-tuple-deforming-can-be-JIT.patch
- v1-0010-jit-Fix-pessimization-of-execGrouping.c-compariso.patch
- v1-0011-Reduce-code-duplication-for-ExecJust-Var-operatio.patch
- v1-0012-Don-t-generate-EEOP_-_FETCHSOME-operations-for-sl.patch
Hi, On 2019-09-27 00:20:53 -0700, Andres Freund wrote: > Unfortunately I found a performance regression for JITed query > compilation introduced in 12, compared to 11. Fixed in one of the > attached patches (v1-0009-Fix-determination-when-tuple-deforming-can-be-JIT.patch > - which needs a better commit message). > > The first question is when to push that fix. I'm inclined to just do so > now - as we still do JITed tuple deforming in most cases, as well as > doing so in 11 in the places this patch fixes, the risk of that seems > low. But I can also see an arguments for waiting after 12.0. Since nobody opined, I now have pushed that, and the other fix mentioned later in that email. I'd appreciate comments on the rest of the email, it's clear that we need to improve the test infrastructure here. And also the explain output for grouping sets... Greetings, Andres Freund
>But that's pretty crappy, because it means that the 'shape' of the output depends on the jit_details option. Yeah, that would be hard to work with. What about adding it as a sibling group? "Filter": "(lineitem.l_shipdate <= '1998-09-18 00:00:00'::timestamp without time zone)", "Filter JIT": { "Expr": "evalexpr_0_2", "Deform Scan": "deform_0_3", "Deform Outer": null, "Deform Inner": null } Also not that pretty, but at least it's easier to work with (I also changed the dashes to spaces since that's what the rest of EXPLAIN is doing as a matter of style). >But the compat break due to that change is not small- perhaps we could instead mark that in another way? We could add a "Projects" boolean key instead? Of course that's more awkward in text mode. Maybe compat break is less of an issue in text mode and we can treat this differently? >I'm not sure that 'TRANS' is the best placeholder for the transition value here. Maybe $TRANS would be clearer? +1, I think the `$` makes it clearer that this is not a literal expression. >For HashJoin/Hash I've added 'Outer Hash Key' and 'Hash Key' for each key, but only in verbose mode. That reads pretty well to me. What does the structured output look like?
On Fri, Sep 27, 2019 at 3:21 AM Andres Freund <andres@anarazel.de> wrote: > - JIT-Expr: whether the expression was JIT compiled (might e.g. not be > the case because no parent was provided) > - JIT-Deform-{Scan,Outer,Inner}: wether necessary, and whether JIT accelerated. > > I don't like these names much, but ... > > For the deform cases I chose to display > a) the function name if JIT compiled > b) "false" if the expression is JIT compiled, deforming is > necessary, but deforming is not JIT compiled (e.g. because the slot type > wasn't fixed) > c) "null" if not necessary, with that being omitted in text mode. I mean, why not just omit in all modes if it's not necessary? I don't see that making the information we produce randomly inconsistent between modes is buying us anything. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2019-10-28 15:05:01 -0400, Robert Haas wrote: > On Fri, Sep 27, 2019 at 3:21 AM Andres Freund <andres@anarazel.de> wrote: > > - JIT-Expr: whether the expression was JIT compiled (might e.g. not be > > the case because no parent was provided) > > - JIT-Deform-{Scan,Outer,Inner}: wether necessary, and whether JIT accelerated. > > > > I don't like these names much, but ... > > > > For the deform cases I chose to display > > a) the function name if JIT compiled > > b) "false" if the expression is JIT compiled, deforming is > > necessary, but deforming is not JIT compiled (e.g. because the slot type > > wasn't fixed) > > c) "null" if not necessary, with that being omitted in text mode. > > I mean, why not just omit in all modes if it's not necessary? I don't > see that making the information we produce randomly inconsistent > between modes is buying us anything. Because that's the normal way to represent something non-existing for formats like json? There's a lot of information we show always for !text format, even if not really applicable to the context (e.g. Triggers for select statements). I think there's an argument to made to deviate in this case, but I don't think it's obvious. Abstract formatting reasons aside, it's actually useful to see where we know we're dealing with tuples that don't need to be deformed and thus overhead due to that cannot be relevant. Not sure if there's sufficient consumers for that, but ... We e.g. should verify that the "none" doesn't suddenly vanish, because we broke the information that let us infer that we don't need tuple deforming - and that's easier to understand if there's an explicit field, rather than reasining from absence. IMO. Greetings, Andres Freund
Hi, On 2019-10-28 11:27:02 -0700, Maciek Sakrejda wrote: > >But that's pretty crappy, because it means that the 'shape' of the output depends on the jit_details option. > > Yeah, that would be hard to work with. What about adding it as a sibling group? > > "Filter": "(lineitem.l_shipdate <= '1998-09-18 00:00:00'::timestamp > without time zone)", > "Filter JIT": { > "Expr": "evalexpr_0_2", > "Deform Scan": "deform_0_3", > "Deform Outer": null, > "Deform Inner": null > } > > Also not that pretty, but at least it's easier to work with What I dislike about that is that it basically again is introducing something that requires either pattern matching on key names (i.e. a key of '(.*) JIT' is one that has information about JIT, and the associated expresssion is in key $1), or knowing all the potential keys an expression could be in. > (I also > changed the dashes to spaces since that's what the rest of EXPLAIN is > doing as a matter of style). That makes sense. > >But the compat break due to that change is not small- perhaps we could instead mark that in another way? > > We could add a "Projects" boolean key instead? Of course that's more > awkward in text mode. Maybe compat break is less of an issue in text > mode and we can treat this differently? Yea, I think projects as a key for each node makes sense. For text mode I guess we could just display the key on the same line when es->verbose is set? Still not sure if not just changing the output is the better approach. Another alternative would be to just remove the 'Output' line when a node doesn't project - it can't really carry meaning in those cases anyway? > >For HashJoin/Hash I've added 'Outer Hash Key' and 'Hash Key' for each key, but only in verbose mode. > > That reads pretty well to me. What does the structured output look > like? Just a new "Outer Hash Key" for the HashJoin node, and "Hash Key" for the Hash node. Perhaps the latter should be 'Inner Hash Key' - while that's currently a bit confusing because of Hash's subtree being the outer tree, it'd reduce changes when merging Hash into HashJoin [1], and it's clearer when looking at the HashJoin node itself. Here's an example query: EXPLAIN (VERBOSE, FORMAT JSON, COSTS OFF) SELECT pc.oid::regclass, pc.relkind, pc.relfilenode, pc_t.oid::regclass as toast_rel,pc_t.relfilenode as toast_relfilenode FROM pg_class pc LEFT OUTER JOIN pg_class pc_t ON (pc.reltoastrelid = pc_t.oid); [ { "Plan": { "Node Type": "Hash Join", "Parallel Aware": false, "Join Type": "Left", "Project": ["(pc.oid)::regclass", "pc.relkind", "pc.relfilenode", "(pc_t.oid)::regclass", "pc_t.relfilenode"], "Inner Unique": true, "Hash Cond": "(pc.reltoastrelid = pc_t.oid)", "Outer Hash Key": "pc.reltoastrelid", "Plans": [ { "Node Type": "Seq Scan", "Parent Relationship": "Outer", "Parallel Aware": false, "Relation Name": "pg_class", "Schema": "pg_catalog", "Alias": "pc", "Output": ["pc.oid", "pc.relname", "pc.relnamespace", "pc.reltype", "pc.reloftype", "pc.relowner", "pc.relam","pc.relfilenode", "pc.reltablespace", "pc.relpages", "pc.reltuples", "pc.relallvisible", "pc.reltoastrelid", "pc.relhasindex","pc.relisshared", "pc.relpersistence", "pc.relkind", "pc.relnatts", "pc.relchecks", "pc.relhasrules", "pc.relhastriggers","pc.relhassubclass", "pc.relrowsecurity", "pc.relforcerowsecurity", "pc.relispopulated", "pc.relreplident","pc.relispartition", "pc.relrewrite", "pc.relfrozenxid", "pc.relminmxid", "pc.relacl", "pc.reloptions","pc.relpartbound"] }, { "Node Type": "Hash", "Parent Relationship": "Inner", "Parallel Aware": false, "Output": ["pc_t.oid", "pc_t.relfilenode"], "Hash Key": "pc_t.oid", "Plans": [ { "Node Type": "Seq Scan", "Parent Relationship": "Outer", "Parallel Aware": false, "Relation Name": "pg_class", "Schema": "pg_catalog", "Alias": "pc_t", "Project": ["pc_t.oid", "pc_t.relfilenode"] } ] } ] } } ] and in plain text: Hash Left Join Project: (pc.oid)::regclass, pc.relkind, pc.relfilenode, (pc_t.oid)::regclass, pc_t.relfilenode Inner Unique: true Hash Cond: (pc.reltoastrelid = pc_t.oid) Outer Hash Key: pc.reltoastrelid -> Seq Scan on pg_catalog.pg_class pc Output: pc.oid, pc.relname, pc.relnamespace, pc.reltype, pc.reloftype, pc.relowner, pc.relam, pc.relfilenode, pc.reltablespace,pc.relpages, pc.reltuples, pc.relallvisible, pc.reltoastrelid, pc.relhasindex, pc.relisshared, pc.relpersistence,pc.relkind, pc.relnatts, pc.relchecks, pc.relhasrules, pc.relhastriggers, pc.relhassubclass, pc.relrowsecurity,pc.relforcerowsecurity, pc.relispopulated, pc.relreplident, pc.relispartition, pc.relrewrite, pc.relfrozenxid,pc.relminmxid, pc.relacl, pc.reloptions, pc.relpartbound -> Hash Output: pc_t.oid, pc_t.relfilenode Hash Key: pc_t.oid -> Seq Scan on pg_catalog.pg_class pc_t Project: pc_t.oid, pc_t.relfilenode which also serves as an example about my previous point about potentially just hiding the 'Output: ' bit when no projection is done: It's very verbose, without adding much, while hiding that there's actually nothing being done at the SeqScan level. I've attached a rebased version of the patcheset. No changes except for a minor conflict, and removing some already applied bugfixes. Greetings, Andres Freund [1] https://www.postgresql.org/message-id/20191028231526.wcnwag7lllkra4qt%40alap3.anarazel.de
Attachment
- v2-0001-jit-Instrument-function-purpose-separately-add-tr.patch
- v2-0002-Refactor-explain.c-to-pass-ExprState-down-to-show.patch
- v2-0003-Explain-Differentiate-between-a-node-projecting-o.patch
- v2-0004-Add-EXPLAIN-option-jit_details-showing-per-expres.patch
- v2-0005-jit-explain-remove-backend-lifetime-module-count-.patch
- v2-0006-WIP-explain-Show-per-phase-information-about-aggr.patch
- v2-0007-WIP-explain-Output-hash-keys-in-verbose-mode.patch
- v2-0008-jit-Add-tests.patch
On Mon, Oct 28, 2019 at 5:02 PM Andres Freund <andres@anarazel.de> wrote: > What I dislike about that is that it basically again is introducing "again"? Am I missing some history here? I'd love to read up on this if there are mistakes to learn from. > something that requires either pattern matching on key names (i.e. a key > of '(.*) JIT' is one that has information about JIT, and the associated > expresssion is in key $1), or knowing all the potential keys an > expression could be in. That still seems less awkward than having to handle a Filter field that's either scalar or a group. Most current EXPLAIN options just add additional fields to the structured plan instead of modifying it, no? If that output is better enough, though, maybe we should just always make Filter a group and go with the breaking change? If tooling authors need to treat this case specially anyway, might as well evolve the format. > Another alternative would be to just remove the 'Output' line when a > node doesn't project - it can't really carry meaning in those cases > anyway? ¯\_(ツ)_/¯ For what it's worth, I certainly wouldn't miss it.
Hi, On 2019-11-12 13:42:10 -0800, Maciek Sakrejda wrote: > On Mon, Oct 28, 2019 at 5:02 PM Andres Freund <andres@anarazel.de> wrote: > > What I dislike about that is that it basically again is introducing > > "again"? Am I missing some history here? I'd love to read up on this > if there are mistakes to learn from. I think I was mostly referring to mistakes we've made for the json etc key names. By e.g. having expressions as "Function Call", "Table Function Call", "Filter", "TID Cond", ... a tool that wants to interpret the output needs awareness of all of these different names, rather than knowing that everything with a sub-group "Expression" has to be an expression. I.e. instead of "Plan": { "Node Type": "Seq Scan", "Parallel Aware": false, "Relation Name": "pg_class", "Schema": "pg_catalog", "Alias": "pg_class", "Startup Cost": 0.00, "Total Cost": 17.82, "Plan Rows": 385, "Plan Width": 68, "Output": ["relname", "tableoid"], "Filter": "(pg_class.relname <> 'foo'::name)" } we ought to have gone for "Plan": { "Node Type": "Seq Scan", "Parallel Aware": false, "Relation Name": "pg_class", "Schema": "pg_catalog", "Alias": "pg_class", "Startup Cost": 0.00, "Total Cost": 17.82, "Plan Rows": 385, "Plan Width": 68, "Output": ["relname", "tableoid"], "Filter": {"Expression" : { "text": (pg_class.relname <> 'foo'::name)"}} } or something like that. Which'd then make it obvious how to add information about JIT to each expression. Whereas the proposal of the separate key name perpetuates the messiness... > > something that requires either pattern matching on key names (i.e. a key > > of '(.*) JIT' is one that has information about JIT, and the associated > > expresssion is in key $1), or knowing all the potential keys an > > expression could be in. > > That still seems less awkward than having to handle a Filter field > that's either scalar or a group. Yea, it's a sucky option :( > Most current EXPLAIN options just add > additional fields to the structured plan instead of modifying it, no? > If that output is better enough, though, maybe we should just always > make Filter a group and go with the breaking change? If tooling > authors need to treat this case specially anyway, might as well evolve > the format. Yea, maybe that's the right thing to do. Would be nice to have some more input... Greetings, Andres Freund
On Mon, Oct 28, 2019 at 7:21 PM Andres Freund <andres@anarazel.de> wrote: > Because that's the normal way to represent something non-existing for > formats like json? There's a lot of information we show always for !text > format, even if not really applicable to the context (e.g. Triggers for > select statements). I think there's an argument to made to deviate in > this case, but I don't think it's obvious. I've consistently been of the view that anyone who thinks that the FORMAT option should affect what information gets displayed doesn't understand the meaning of the word "format." And I still feel that way. I also think that conditionally renaming "Output" to "Project" is a super-bad idea. The idea of a format like this is that the "keys" stay constant and the values change. If you need to tell people more, you add more keys. I also think that making the Filter field a group conditionally is a bad idea, for similar reasons. But making it always be a group doesn't necessarily seem like a bad idea. I think, though, that you could handle this in other ways, like by suffixing existing keys. e.g. if you've got Index-Qual and Filter, just do Index-Qual-JIT and Filter-JIT and call it good. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2019-11-13 14:29:07 -0500, Robert Haas wrote: > On Mon, Oct 28, 2019 at 7:21 PM Andres Freund <andres@anarazel.de> wrote: > > Because that's the normal way to represent something non-existing for > > formats like json? There's a lot of information we show always for !text > > format, even if not really applicable to the context (e.g. Triggers for > > select statements). I think there's an argument to made to deviate in > > this case, but I don't think it's obvious. > > I've consistently been of the view that anyone who thinks that the > FORMAT option should affect what information gets displayed doesn't > understand the meaning of the word "format." And I still feel that > way. Well, it's not been that way since the format option was added, so ... > I also think that conditionally renaming "Output" to "Project" is a > super-bad idea. The idea of a format like this is that the "keys" stay > constant and the values change. If you need to tell people more, you > add more keys. Yea, I don't like the compat break either. But I'm not so convinced that just continuing to collect cruft because of compatibility is worth it - I just don't see an all that high reliance interest for explain output. I think adding a new key is somewhat ok for !text, but for text that doesn't seem like an easy solution? I kind of like my idea somewhere downthread, in a reply to Maciek, of simply not listing "Output" for nodes that don't project. While that's still a format break, it seems that tools already need to deal with "Output" not being present? > I also think that making the Filter field a group conditionally is a > bad idea, for similar reasons. Oh, yea, it's utterly terrible (I called it crappy in my email :)). > But making it always be a group doesn't necessarily seem like a bad > idea. I think, though, that you could handle this in other ways, like > by suffixing existing keys. e.g. if you've got Index-Qual and Filter, > just do Index-Qual-JIT and Filter-JIT and call it good. Maciek suggested the same. But to me it seems going down that way will make the format harder and harder to understand? So I think I'd rather break compat here, and go for a group. Personally I think the group naming choice for explain makes the the !text outputs much less useful than they could be - we basically force every tool to understand all possible keys, to make sense of formatted output. Instead of something like 'Filter: {"Qual":{"text" : "...", "JIT": ...}' where a tool only needed to understand that everything that has a "Qual" inside is a filtering expression, everything that has a "Project" is a projecting type of expression, ... a tool needs to know about "Inner Cond", "Order By", "Filter", "Recheck Cond", "TID Cond", "Join Filter", "Merge Cond", "Hash Cond", "One-Time Filter", ... Greetings, Andres Freund
On Wed, Nov 13, 2019 at 3:03 PM Andres Freund <andres@anarazel.de> wrote: > Well, it's not been that way since the format option was added, so ... It was pretty close in the original version, but people keep trying to be clever. > > I also think that conditionally renaming "Output" to "Project" is a > > super-bad idea. The idea of a format like this is that the "keys" stay > > constant and the values change. If you need to tell people more, you > > add more keys. > > Yea, I don't like the compat break either. But I'm not so convinced > that just continuing to collect cruft because of compatibility is worth > it - I just don't see an all that high reliance interest for explain > output. > > I think adding a new key is somewhat ok for !text, but for text that > doesn't seem like an easy solution? > > I kind of like my idea somewhere downthread, in a reply to Maciek, of > simply not listing "Output" for nodes that don't project. While that's > still a format break, it seems that tools already need to deal with > "Output" not being present? Yes, I think leaving out Output for a node that doesn't Project would be fine, as long as we're consistent about it. > > But making it always be a group doesn't necessarily seem like a bad > > idea. I think, though, that you could handle this in other ways, like > > by suffixing existing keys. e.g. if you've got Index-Qual and Filter, > > just do Index-Qual-JIT and Filter-JIT and call it good. > > Maciek suggested the same. But to me it seems going down that way will > make the format harder and harder to understand? So I think I'd rather > break compat here, and go for a group. Personally, I don't care very much about backward-compatibility, or about how hard it is for tools to parse. I want it to be possible, but if it takes a little extra effort, so be it. My main concern is having the text output look good to human beings, because that is the primary format and they are the primary consumers. > Personally I think the group naming choice for explain makes the the > !text outputs much less useful than they could be - we basically force > every tool to understand all possible keys, to make sense of formatted > output. Instead of something like 'Filter: {"Qual":{"text" : "...", > "JIT": ...}' where a tool only needed to understand that everything that > has a "Qual" inside is a filtering expression, everything that has a > "Project" is a projecting type of expression, ... a tool needs to know > about "Inner Cond", "Order By", "Filter", "Recheck Cond", "TID Cond", > "Join Filter", "Merge Cond", "Hash Cond", "One-Time Filter", ... It's not that long of a list, and I don't know of a tool that tries to do something in particular with all of those types of things anyway. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Nov 15, 2019 at 5:49 AM Robert Haas <robertmhaas@gmail.com> wrote: > Personally, I don't care very much about backward-compatibility, or > about how hard it is for tools to parse. I want it to be possible, but > if it takes a little extra effort, so be it. I think these are two separate issues. I agree on backward-compatibility (especially if we can embed a server version in structured EXPLAIN output to make it easier for tools to track format differences), but not caring how hard it is for tools to parse? What's the point of structured formats, then? > My main concern is having > the text output look good to human beings, because that is the primary > format and they are the primary consumers. Structured output is also for human beings, albeit indirectly. That text is the primary format may be more of a reflection of the difficulty of building and integrating EXPLAIN tools than its inherent superiority (that said, I'll concede it's a concise and elegant format for what it does). What if psql supported an EXPLAINER like it does EDITOR? For what it's worth, after thinking about this a bit, I'd like to see structured EXPLAIN evolve into a more consistent format, even if it means breaking changes (and I do think a version specifier at the root of the plan would make this easier).
Maciek Sakrejda <m.sakrejda@gmail.com> writes: > On Fri, Nov 15, 2019 at 5:49 AM Robert Haas <robertmhaas@gmail.com> wrote: >> Personally, I don't care very much about backward-compatibility, or >> about how hard it is for tools to parse. I want it to be possible, but >> if it takes a little extra effort, so be it. > I think these are two separate issues. I agree on > backward-compatibility (especially if we can embed a server version in > structured EXPLAIN output to make it easier for tools to track format > differences), but not caring how hard it is for tools to parse? What's > the point of structured formats, then? I'd not been paying any attention to this thread, but Andres just referenced it in another discussion, so I went back and read it. Here's my two cents: * I agree with Robert that conditionally changing "Output" to "Project" is an absolutely horrid idea. That will break every tool that looks at this stuff, and it just flies in the face of the design principle that the output schema should be stable, and it'll be a long term pain-in-the-rear for regression test back-patching, and it will confuse users much more than it will help them. The other idea of suppressing "Output" in cases where no projection is happening might be all right, but only in text format where we don't worry about schema stability. Another idea perhaps is to emit "Output: all columns" (in text formats, less sure what to do in structured formats). * In the structured formats, I think it should be okay to convert expression-ish fields from being raw strings to being {Expression} sub-nodes with the raw string as one field. Aside from making it easy to inject JIT info, that would also open the door to someday showing expressions in some more-parse-able format than a string, since other representations could also be added as new fields. (I have a vague recollection of wanting a list of all the Vars used in an expression, for example.) * Unfortunately that does nothing for the problem of how to show per-expression JIT info in text format. Maybe we just shouldn't. I do not think that the readability-vs-usefulness tradeoff is going to be all that good there, anyway. Certainly for testing purposes it's going to be more useful to examine portions of a structured output. * I'm not on board with the idea of adding a version number to the structured output formats. In the first place, it's too late, since we didn't leave room for one to begin with. In the second, an overall version number just isn't very helpful for this sort of problem. If a tool sees a version number higher than the latest thing it knows, what's it supposed to do, just fail? In practice it could still extract an awful lot of info, so that really isn't a desirable answer. It's better if the data structure is such that a tool can understand that some sub-part of the data is something it can't interpret, and just ignore that part. (This is more or less the same design principle that PNG image format was built on, FWIW.) Adding on fields to an existing node type easily meets that requirement, as does inventing new sub-node types, and that's all that we've done so far. But I think that replacing a scalar field value with a sub-node probably works too (at least for well-written tools), so the expression change suggested above should be OK. regards, tom lane
Hi, On 2020-01-27 12:15:53 -0500, Tom Lane wrote: > Maciek Sakrejda <m.sakrejda@gmail.com> writes: > > On Fri, Nov 15, 2019 at 5:49 AM Robert Haas <robertmhaas@gmail.com> wrote: > >> Personally, I don't care very much about backward-compatibility, or > >> about how hard it is for tools to parse. I want it to be possible, but > >> if it takes a little extra effort, so be it. > > > I think these are two separate issues. I agree on > > backward-compatibility (especially if we can embed a server version in > > structured EXPLAIN output to make it easier for tools to track format > > differences), but not caring how hard it is for tools to parse? What's > > the point of structured formats, then? > > I'd not been paying any attention to this thread, but Andres just > referenced it in another discussion, so I went back and read it. > Here's my two cents: > > * I agree with Robert that conditionally changing "Output" to "Project" is > an absolutely horrid idea. Yea, I think I'm convinced on that front. I never liked the idea, and the opposition has been pretty unanimous... > That will break every tool that looks at this stuff, and it just flies > in the face of the design principle that the output schema should be > stable, and it'll be a long term pain-in-the-rear for regression test > back-patching, and it will confuse users much more than it will help > them. The other idea of suppressing "Output" in cases where no > projection is happening might be all right, but only in text format > where we don't worry about schema stability. Another idea perhaps is > to emit "Output: all columns" (in text formats, less sure what to do > in structured formats). I think I like the "all columns" idea. Not what I'd do on a green field, but... If we were just dealing with the XML format, we could just add a <Projecting>True/False</Projecting> to the current <Output> <Item>a</Item> <Item>b</Item> ... </Output> and it'd make plenty sense. but for json's "Output": ["a", "b"] and yaml's Output: - "a" - "b" that's not an option as far as I can tell. Not sure what to do about that. > * In the structured formats, I think it should be okay to convert > expression-ish fields from being raw strings to being {Expression} > sub-nodes with the raw string as one field. Aside from making it easy > to inject JIT info, that would also open the door to someday showing > expressions in some more-parse-able format than a string, since other > representations could also be added as new fields. (I have a vague > recollection of wanting a list of all the Vars used in an expression, > for example.) Cool. Being extendable seems like a good direction. That's what I primarily dislike about the various work-arounds for how to associate information about JIT by a "related" name. That'd e.g. open the door to have both a normalized and an original expression in the explain output. Which would be quite valuable for some monitoring tools. > * Unfortunately that does nothing for the problem of how to show > per-expression JIT info in text format. Maybe we just shouldn't. > I do not think that the readability-vs-usefulness tradeoff is going > to be all that good there, anyway. Certainly for testing purposes > it's going to be more useful to examine portions of a structured output. I think I can live with that, I don't think it's going to be a very commonly used option. It's basically useful for regression tests, JIT improvements, and people that want to see whether they can change their query / schema to make better use of JIT - the latter category won't be many, I think. Since this is going to be a default off option anyway, I don't think we'd need to be as concerned with compatibility. But even leaving compatibility aside, it's not that clear how to best attach information in the current text format, without being confusing. Greetings, Andres Freund
On Fri, Nov 15, 2019 at 8:05 PM Maciek Sakrejda <m.sakrejda@gmail.com> wrote: > On Fri, Nov 15, 2019 at 5:49 AM Robert Haas <robertmhaas@gmail.com> wrote: > > Personally, I don't care very much about backward-compatibility, or > > about how hard it is for tools to parse. I want it to be possible, but > > if it takes a little extra effort, so be it. > > I think these are two separate issues. I agree on > backward-compatibility (especially if we can embed a server version in > structured EXPLAIN output to make it easier for tools to track format > differences), but not caring how hard it is for tools to parse? What's > the point of structured formats, then? To make the data easy to parse. :-) I mean, it's clear that, on the one hand, having a format like JSON that, as has recently been pointed out elsewhere, is parsable by a wide variety of tools, is advantageous. However, I don't think it really matters whether the somebody's got to look at a tag called Flump and match it up with the data in another tag called JIT-Flump, or whether there's a Flump group that has RegularStuff and JIT tags inside of it. There's just not much difference in the effort involved. Being able to parse the JSON or XML using generic code is enough of a win that the details shouldn't matter that much. I think if you were going to complain about the limitations of our current EXPLAIN output format, it'd make a lot more sense to focus on the way we output expressions. If you want to mechanically parse one of those expressions and figure out what it's doing - what functions or operators are involved, and to what they are being applied - you are probably out of luck altogether, and you are certainly not going to have an easy time of it. I'm not saying we have to solve that problem, but I believe it's a much bigger nuisance than the sort of thing we are talking about here. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Jan 27, 2020 at 12:41 PM Andres Freund <andres@anarazel.de> wrote: > > I do not think that the readability-vs-usefulness tradeoff is going > > to be all that good there, anyway. Certainly for testing purposes > > it's going to be more useful to examine portions of a structured output. > > I think I can live with that, I don't think it's going to be a very > commonly used option. It's basically useful for regression tests, JIT > improvements, and people that want to see whether they can change their > query / schema to make better use of JIT - the latter category won't be > many, I think. I intensely dislike having information that we can't show in the text format, or really, that we can't show in every format. I might be outvoted, but I stand by that position. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: >>> I do not think that the readability-vs-usefulness tradeoff is going >>> to be all that good there, anyway. Certainly for testing purposes >>> it's going to be more useful to examine portions of a structured output. > I intensely dislike having information that we can't show in the text > format, or really, that we can't show in every format. Well, if it's relegated to a "jit = detail" option or some such, the readability objection could be overcome. But I'm still not clear on how you'd physically wedge it into the output, at least not in a way that matches up with the proposal that non-text modes handle this stuff by producing sub-nodes for the existing types of expression fields. regards, tom lane
On Mon, Jan 27, 2020 at 11:01 AM Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Nov 15, 2019 at 8:05 PM Maciek Sakrejda <m.sakrejda@gmail.com> wrote: > > On Fri, Nov 15, 2019 at 5:49 AM Robert Haas <robertmhaas@gmail.com> wrote: > > > Personally, I don't care very much about backward-compatibility, or > > > about how hard it is for tools to parse. I want it to be possible, but > > > if it takes a little extra effort, so be it. > > > > I think these are two separate issues. I agree on > > backward-compatibility (especially if we can embed a server version in > > structured EXPLAIN output to make it easier for tools to track format > > differences), but not caring how hard it is for tools to parse? What's > > the point of structured formats, then? > > To make the data easy to parse. :-) > > I mean, it's clear that, on the one hand, having a format like JSON > that, as has recently been pointed out elsewhere, is parsable by a > wide variety of tools, is advantageous. However, I don't think it > really matters whether the somebody's got to look at a tag called > Flump and match it up with the data in another tag called JIT-Flump, > or whether there's a Flump group that has RegularStuff and JIT tags > inside of it. There's just not much difference in the effort involved. > Being able to parse the JSON or XML using generic code is enough of a > win that the details shouldn't matter that much. Having a structured EXPLAIN schema that's semantically consistent is still valuable. At the end of the day, it's humans who are writing the tools that consume that structured output. Given the sparse structured EXPLAIN schema documentation, as someone who currently works on EXPLAIN tooling, I'd prefer a trend toward consistency at the expense of backward compatibility. (Of course, we should avoid gratuitous changes.) But I take back the version number suggestion after reading Tom's response; that was naïve. > I think if you were going to complain about the limitations of our > current EXPLAIN output format, it'd make a lot more sense to focus on > the way we output expressions. That would be nice to have, but for what it's worth, my main complaint would be about documentation (especially around structured formats). The "Using EXPLAIN" section covers the basics, but understanding what node types exist, and what fields show up for what nodes and what they mean--that seems to be a big missing piece (I don't feel entitled to this documentation; as a structured format consumer, I'm just pointing out a deficiency). Contrast that with the great wire protocol documentation. In some ways it's easier to work on native drivers than on EXPLAIN tooling because the docs are thorough and well organized.
On Mon, Jan 27, 2020 at 4:18 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: > >>> I do not think that the readability-vs-usefulness tradeoff is going > >>> to be all that good there, anyway. Certainly for testing purposes > >>> it's going to be more useful to examine portions of a structured output. > > > I intensely dislike having information that we can't show in the text > > format, or really, that we can't show in every format. > > Well, if it's relegated to a "jit = detail" option or some such, > the readability objection could be overcome. But I'm still not clear > on how you'd physically wedge it into the output, at least not in a way > that matches up with the proposal that non-text modes handle this stuff > by producing sub-nodes for the existing types of expression fields. Well, remember that the text format was the original format. The whole idea of "groups" was an anachronism that I imposed on the text format to make it possible to add other formats. It wasn't entirely natural, because the text format basically indicated nesting by indentation, and that wasn't going to work for XML or JSON. The text format also felt free to repeat elements and assume the reader would figure it out; repeating elements is OK in XML in general, but in JSON it's only OK if the surrounding context is an array rather than an object. Anyway, the point is that I (necessarily) started with whatever we had and found a way to fit it into a structure. It seems like it ought to be possible to go the other direction also, and figure out how to make the structured data look OK as text. Here's Andres's original example: "Filter": { "Expr": "(lineitem.l_shipdate <= '1998-09-18 00:00:00'::timestamp without time zone)", "JIT-Expr": "evalexpr_0_2", "JIT-Deform-Scan": "deform_0_3", "JIT-Deform-Outer": null, "JIT-Deform-Inner": null } Right now we show: Filter: (lineitem.l_shipdate <= '1998-09-18 00:00:00'::timestamp without time zone) Andres proposed: Filter: (lineitem.l_shipdate <= '1998-09-18 00:00:00'::timestamp without time zone); JIT-Expr: evalexpr_0_2, JIT-Deform-Scan: deform_0_3 That's not ideal because it's all on one line, but that could be changed: Filter: (lineitem.l_shipdate <= '1998-09-18 00:00:00'::timestamp without time zone) JIT-Expr: evalexpr_0_2 JIT-Deform-Scan: deform_0_3 I would propose either including null all the time or omitting it all the time, so that we would either change the JSON output to... "Filter": { "Expr": "(lineitem.l_shipdate <= '1998-09-18 00:00:00'::timestamp without time zone)", "JIT-Expr": "evalexpr_0_2", "JIT-Deform-Scan": "deform_0_3" } Or the text output to: Filter: (lineitem.l_shipdate <= '1998-09-18 00:00:00'::timestamp without time zone) JIT-Expr: evalexpr_0_2 JIT-Deform-Scan: deform_0_3 JIT-Deform-Outer: null JIT-Deform-Inner: null You could argue that this is inconsistent because the JSON format shows a bunch of keys that are essentially parallel, and this text format makes the Expr key essentially the primary value and the others secondary. But since the text format is for human beings, and since human beings are likely to find the Expr key to be the primary piece of information, maybe that's totally fine. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company