Thread: Generating code for query jumbling through gen_node_support.pl
Hi all, This thread is a follow-up of the recent discussion about query jumbling with DDL statements, where the conclusion was that we'd want to generate all this code automatically for all the nodes: https://www.postgresql.org/message-id/36e5bffe-e989-194f-85c8-06e7bc88e6f7@amazon.com What this patch allows to do it to compute the same query ID for utility statements using their parsed Node state instead of their string, meaning that things like "BEGIN", "bEGIN" or "begin" would be treated the same, for example. But the main idea is not only that. I have implemented that as of the attached, where the following things are done: - queryjumble.c is moved to src/backend/nodes/, to stick with the other things for node equal/read/write/copy, renamed to jumblefuncs.c. - gen_node_support.c is extended to generate the functions and the switch for the jumbling. There are a few exceptions, as of the Lists and RangeTblEntry to do the jumbling consistently. - Two pg_node_attr() are added in consistency with the existing ones: no_jumble to discard completely a node from the the query jumbling and jumble_ignore to discard one field from the jumble. The patch is in a rather good shape, passes check-world and the CI, but there are a few things that need to be discussed IMO. Things could be perhaps divided in more patches, now the areas touched are quite different so it did not look like a big deal to me as the changes touch different areas and are straight-forward. The location of the Nodes is quite invasive because we only care about that for T_Const now in the query jumbling, and this could be compensated with a third pg_node_attr() that tracks for the "int location" of a Node whether it should participate in the jumbling or not. There is also an argument where we would want to not include by default new fields added to a Node, but that would require added more pg_node_attr() than what's done here. Note that the plan is to extend the normalization to some other parts of the Nodes, like CALL and SET as mentioned on the other thread. I have done nothing about that yet but doing so can be done in a few lines with the facility presented here (aka just add a location field). Hence, the normalization is consistent with the existing queryjumble.c for the fields and the nodes processed. In this patch, things are done so as the query ID is not computed anymore from the query string but from the Query. I still need to study the performance impact of that with short queries. If things prove to be noticeable in some cases, this stuff could be switched to use a new GUC where we could have a code path for the computation of utilityStmt using its string as a fallback. I am not sure that I want to enter in this level of complications, though, to keep things simple, but that's yet to be done. A bit more could be cut but pg_ident got in the way.. There are also a few things for pg_stat_statements where a query ID of 0 can be implied for utility statements in some cases. Generating this code leads to an overall removal of code as what queryjumble.c is generated automatically: 13 files changed, 901 insertions(+), 1113 deletions(-) I am adding that to the next commit fest. Thoughts? -- Michael
Attachment
On Wed, 7 Dec 2022 at 13:27, Michael Paquier <michael@paquier.xyz> wrote: > > Hi all, > > This thread is a follow-up of the recent discussion about query > jumbling with DDL statements, where the conclusion was that we'd want > to generate all this code automatically for all the nodes: > https://www.postgresql.org/message-id/36e5bffe-e989-194f-85c8-06e7bc88e6f7@amazon.com > > What this patch allows to do it to compute the same query ID for > utility statements using their parsed Node state instead of their > string, meaning that things like "BEGIN", "bEGIN" or "begin" would be > treated the same, for example. But the main idea is not only that. > > I have implemented that as of the attached, where the following things > are done: > - queryjumble.c is moved to src/backend/nodes/, to stick with the > other things for node equal/read/write/copy, renamed to > jumblefuncs.c. > - gen_node_support.c is extended to generate the functions and the > switch for the jumbling. There are a few exceptions, as of the Lists > and RangeTblEntry to do the jumbling consistently. > - Two pg_node_attr() are added in consistency with the existing ones: > no_jumble to discard completely a node from the the query jumbling > and jumble_ignore to discard one field from the jumble. > > The patch is in a rather good shape, passes check-world and the CI, > but there are a few things that need to be discussed IMO. Things > could be perhaps divided in more patches, now the areas touched are > quite different so it did not look like a big deal to me as the > changes touch different areas and are straight-forward. > > The location of the Nodes is quite invasive because we only care about > that for T_Const now in the query jumbling, and this could be > compensated with a third pg_node_attr() that tracks for the "int > location" of a Node whether it should participate in the jumbling or > not. There is also an argument where we would want to not include by > default new fields added to a Node, but that would require added more > pg_node_attr() than what's done here. > > Note that the plan is to extend the normalization to some other parts > of the Nodes, like CALL and SET as mentioned on the other thread. I > have done nothing about that yet but doing so can be done in a few > lines with the facility presented here (aka just add a location > field). Hence, the normalization is consistent with the existing > queryjumble.c for the fields and the nodes processed. > > In this patch, things are done so as the query ID is not computed > anymore from the query string but from the Query. I still need to > study the performance impact of that with short queries. If things > prove to be noticeable in some cases, this stuff could be switched to > use a new GUC where we could have a code path for the computation of > utilityStmt using its string as a fallback. I am not sure that I want > to enter in this level of complications, though, to keep things > simple, but that's yet to be done. > > A bit more could be cut but pg_ident got in the way.. There are also > a few things for pg_stat_statements where a query ID of 0 can be > implied for utility statements in some cases. > > Generating this code leads to an overall removal of code as what > queryjumble.c is generated automatically: > 13 files changed, 901 insertions(+), 1113 deletions(-) > > I am adding that to the next commit fest. The patch does not apply on top of HEAD as in [1], please post a rebased patch: === Applying patches on top of PostgreSQL commit ID b82557ecc2ebbf649142740a1c5ce8d19089f620 === === applying patch ./0001-Support-for-automated-query-jumble-with-all-Nodes.patch ... patching file src/backend/utils/misc/queryjumble.c Hunk #1 FAILED at 1. Not deleting file src/backend/utils/misc/queryjumble.c as content differs from patch 1 out of 1 hunk FAILED -- saving rejects to file src/backend/utils/misc/queryjumble.c.rej [1] - http://cfbot.cputube.org/patch_41_4047.log Regards, Vignesh
On 07.12.22 08:56, Michael Paquier wrote: > The location of the Nodes is quite invasive because we only care about > that for T_Const now in the query jumbling, and this could be > compensated with a third pg_node_attr() that tracks for the "int > location" of a Node whether it should participate in the jumbling or > not. The generation script already has a way to identify location fields, by ($t eq 'int' && $f =~ 'location$'), so you could use that as well. > There is also an argument where we would want to not include by > default new fields added to a Node, but that would require added more > pg_node_attr() than what's done here. I'm concerned about the large number of additional field annotations this adds. We have been careful so far to document the use of each attribute, e.g., *why* does a field not need to be copied etc. This patch adds dozens and dozens of annotations without any explanation at all. Now, the code this replaces also has no documentation, but maybe this is the time to add some.
On Sat, Jan 07, 2023 at 07:37:49AM +0100, Peter Eisentraut wrote: > The generation script already has a way to identify location fields, by ($t > eq 'int' && $f =~ 'location$'), so you could use that as well. I recall that some of the nodes may need renames to map with this choice. That could be just one patch on top of the actual feature. > I'm concerned about the large number of additional field annotations this > adds. We have been careful so far to document the use of each attribute, > e.g., *why* does a field not need to be copied etc. This patch adds dozens > and dozens of annotations without any explanation at all. Now, the code > this replaces also has no documentation, but maybe this is the time to add > some. In most cases, the addition of the node marker would be enough to self-explain why they are included, but there is a trend for a lot of the nodes when it comes to collations and typmods where we don't want to see these in the jumbling calculations. I'll look at providing more info for all that. (Note: I'm out for now.) -- Michael
Attachment
On Sat, Jan 07, 2023 at 07:37:49AM +0100, Peter Eisentraut wrote: > On 07.12.22 08:56, Michael Paquier wrote: >> The location of the Nodes is quite invasive because we only care about >> that for T_Const now in the query jumbling, and this could be >> compensated with a third pg_node_attr() that tracks for the "int >> location" of a Node whether it should participate in the jumbling or >> not. > > The generation script already has a way to identify location fields, by ($t > eq 'int' && $f =~ 'location$'), so you could use that as well. I did not recall exactly everything here, but there are two parts to the logic: - gen_node_support.pl uses exactly this condition when scanning the nodes to put the correct macro to mark a location to track, calling down RecordConstLocation(). - Marking a bunch of nodes as jumble_ignore is actually necessary, or we may finish by silencing parts of queries that should be semantically unrelevant to the queries jumbled (ColumnRef is one). Using a "jumble_ignore" flag is equally invasive to using an "jumble_include" flag for each field, I am afraid, as the number of fields in the nodes included in jumbles is pretty equivalent to the number of fields ignored. I tend to prefer the approach of ignoring things though, which is more consistent with the practive for node read, write and copy. Anyway, when it comes to the location, another thing that can be considered here would be to require a node-level flag for the nodes on which we want to track the location. This overlaps a bit with the fields satisfying "($t eq 'int' && $f =~ 'location$')", but it removes most of the code changes like this one as at the end we only care about the location for Const nodes: - int location; /* token location, or -1 if unknown */ + int location pg_node_attr(jumble_ignore); /* token location, or -1 + * if unknown */ I have taken this approach in v2 of the patch, shaving ~100 lines of more code as there is no need to mark all these location fields with "jumble_ignore" anymore, except for Const, of course. >> There is also an argument where we would want to not include by >> default new fields added to a Node, but that would require added more >> pg_node_attr() than what's done here. > > I'm concerned about the large number of additional field annotations this > adds. We have been careful so far to document the use of each attribute, > e.g., *why* does a field not need to be copied etc. This patch adds dozens > and dozens of annotations without any explanation at all. Now, the code > this replaces also has no documentation, but maybe this is the time to add > some. That's fair, though it is not doing to buy us much to update all the nodes with similar small comments, as well. As far as I know, there are basiscally three things here: typmods, collation information, and internal data of the nodes stored during parse-analyze. I have added more documentation to track what looks like the most relevant areas. I have begun running some performance tests with this stuff and HEAD to see if this leads to any difference in the query ID compilation (compute_query_id = on, on scissors roughly) with a simple set of short commands (like BEGIN/COMMIT) or longer ones, and I am seeing a speedup trend actually (?). I still need to think more about a set of tests here, but I think that micro-benchmarking of JumbleQuery() is the most adapted approach to minimize the noise, with a few nodes of various sizes (Const, Query, ColumnRef, anything..). Thoughts? -- Michael
Attachment
On 13.01.23 08:54, Michael Paquier wrote: > Using a "jumble_ignore" flag is equally invasive to using an > "jumble_include" flag for each field, I am afraid, as the number of > fields in the nodes included in jumbles is pretty equivalent to the > number of fields ignored. I tend to prefer the approach of ignoring > things though, which is more consistent with the practive for node > read, write and copy. I concur that jumble_ignore is better. I suppose you placed the jumble_ignore markers to maintain parity with the existing code, but I think that some the markers are actually wrong and are just errors of omission in the existing code (such as Query.override, to pick a random example). The way you have structured this would allow us to find and analyze such errors better. > Anyway, when it comes to the location, another thing that can be > considered here would be to require a node-level flag for the nodes on > which we want to track the location. This overlaps a bit with the > fields satisfying "($t eq 'int' && $f =~ 'location$')", but it removes > most of the code changes like this one as at the end we only care > about the location for Const nodes: > - int location; /* token location, or -1 if unknown */ > + int location pg_node_attr(jumble_ignore); /* token location, or -1 > + * if unknown */ > > I have taken this approach in v2 of the patch, shaving ~100 lines of > more code as there is no need to mark all these location fields with > "jumble_ignore" anymore, except for Const, of course. I don't understand why you chose that when we already have an established way. This would just make the jumble annotations inconsistent with the other annotations. I also have two suggestions to make this patch more palatable: 1. Make a separate patch to reformat long comments, like 835d476fd21bcfb60b055941dee8c3d9559af14c. 2. Make a separate patch to move the jumble support to src/backend/nodes/jumblefuncs.c. (It was probably a mistake that it wasn't there to begin with, and some of the errors of omission alluded to above are probably caused by it having been hidden away in the wrong place.)
On Mon, Jan 16, 2023 at 03:13:35PM +0100, Peter Eisentraut wrote: > On 13.01.23 08:54, Michael Paquier wrote: >> Using a "jumble_ignore" flag is equally invasive to using an >> "jumble_include" flag for each field, I am afraid, as the number of >> fields in the nodes included in jumbles is pretty equivalent to the >> number of fields ignored. I tend to prefer the approach of ignoring >> things though, which is more consistent with the practive for node >> read, write and copy. > > I concur that jumble_ignore is better. I suppose you placed the > jumble_ignore markers to maintain parity with the existing code, but I think > that some the markers are actually wrong and are just errors of omission in > the existing code (such as Query.override, to pick a random example). The > way you have structured this would allow us to find and analyze such errors > better. Thanks. Yes, I have made an effort of going down to maintain an exact compatibility with the existing code for now. My take is that removing or adding more things into the jumble deserves its own discussion. I think that's a bit better once this code is automated with the nodes, now it would not be difficult either to adjust HEAD depending on that. Only the analysis effort is different. >> Anyway, when it comes to the location, another thing that can be >> considered here would be to require a node-level flag for the nodes on >> which we want to track the location. This overlaps a bit with the >> fields satisfying "($t eq 'int' && $f =~ 'location$')", but it removes >> most of the code changes like this one as at the end we only care >> about the location for Const nodes: >> - int location; /* token location, or -1 if unknown */ >> + int location pg_node_attr(jumble_ignore); /* token location, or -1 >> + * if unknown */ >> >> I have taken this approach in v2 of the patch, shaving ~100 lines of >> more code as there is no need to mark all these location fields with >> "jumble_ignore" anymore, except for Const, of course. > > I don't understand why you chose that when we already have an established > way. This would just make the jumble annotations inconsistent with the > other annotations. Because we don't want to track the location of all the nodes! If we do that, pg_stat_statements would begin to parameterize a lot more things, from column aliases to full contents of IN or WITH clauses. The root point is that we only want to track the jumble location for Const nodes now. In order to do that, there are two approaches: - Mark all the locations with jumble_ignore: more invasive, but it requires only one per-field attribute with "jumble_ignore". This is what v1 did. - Mark only the fields where we want to track the location with a second node type, like "jumble_location". We could restrict that depending on the field type or its name on top of checking the field attribute, whatever. This is what v2 did. Which approach do you prefer? Marking all the locations we don't want with jumble_ignore, or introduce a second attribute with jumble_location. I'd tend to prefer the approach of minimizing the number of node and field attributes, FWIW. Now you have worked on this area previously, so your view may be more adapted than my thinking process The long-term perspective is that I'd like to extend the tracking of the locations to more DDL nodes, like parameters of SET statements or parts of CALL statements. Not to mention that this makes the work of forks easier. This is a separate discussion. > I also have two suggestions to make this patch more palatable: > > 1. Make a separate patch to reformat long comments, like > 835d476fd21bcfb60b055941dee8c3d9559af14c. > > 2. Make a separate patch to move the jumble support to > src/backend/nodes/jumblefuncs.c. (It was probably a mistake that it wasn't > there to begin with, and some of the errors of omission alluded to above are > probably caused by it having been hidden away in the wrong place.) Both suggestions make sense. I'll shape the next series of the patch to do something among these lines. My question about the location tracking greatly influences the first patch (comment reformating) and third patch (automated generation of the code) of the series, so do you have a preference about that? -- Michael
Attachment
On 17.01.23 04:48, Michael Paquier wrote: >>> Anyway, when it comes to the location, another thing that can be >>> considered here would be to require a node-level flag for the nodes on >>> which we want to track the location. This overlaps a bit with the >>> fields satisfying "($t eq 'int' && $f =~ 'location$')", but it removes >>> most of the code changes like this one as at the end we only care >>> about the location for Const nodes: >>> - int location; /* token location, or -1 if unknown */ >>> + int location pg_node_attr(jumble_ignore); /* token location, or -1 >>> + * if unknown */ >>> >>> I have taken this approach in v2 of the patch, shaving ~100 lines of >>> more code as there is no need to mark all these location fields with >>> "jumble_ignore" anymore, except for Const, of course. >> >> I don't understand why you chose that when we already have an established >> way. This would just make the jumble annotations inconsistent with the >> other annotations. > > Because we don't want to track the location of all the nodes! If we > do that, pg_stat_statements would begin to parameterize a lot more > things, from column aliases to full contents of IN or WITH clauses. > The root point is that we only want to track the jumble location for > Const nodes now. In order to do that, there are two approaches: > - Mark all the locations with jumble_ignore: more invasive, but > it requires only one per-field attribute with "jumble_ignore". This > is what v1 did. Ok, I understand now, and I agree with this approach over the opposite. I was confused because the snippet you showed above used "jumble_ignore", but your patch is correct as it uses "jumble_location". That said, the term "jumble" is really weird, because in the sense that we are using it here it means, approximately, "to mix together", "to unify". So what we are doing with the Const nodes is really to *not* jumble the location, but for all other node types we are jumbling the location. At least that is my understanding.
On Tue, Jan 17, 2023 at 08:43:44AM +0100, Peter Eisentraut wrote: > Ok, I understand now, and I agree with this approach over the opposite. I > was confused because the snippet you showed above used "jumble_ignore", but > your patch is correct as it uses "jumble_location". Okay. I'll refresh the patch set so as we have only "jumble_ignore", then, like v1, with preparatory patches for what you mentioned and anything that comes into mind. > That said, the term "jumble" is really weird, because in the sense that we > are using it here it means, approximately, "to mix together", "to unify". > So what we are doing with the Const nodes is really to *not* jumble the > location, but for all other node types we are jumbling the location. At > least that is my understanding. I am quite familiar with this term, FWIW. That's what we've inherited from the days where this has been introduced in pg_stat_statements. -- Michael
Attachment
On Tue, Jan 17, 2023 at 04:52:28PM +0900, Michael Paquier wrote: > On Tue, Jan 17, 2023 at 08:43:44AM +0100, Peter Eisentraut wrote: >> Ok, I understand now, and I agree with this approach over the opposite. I >> was confused because the snippet you showed above used "jumble_ignore", but >> your patch is correct as it uses "jumble_location". > > Okay. I'll refresh the patch set so as we have only "jumble_ignore", > then, like v1, with preparatory patches for what you mentioned and > anything that comes into mind. This is done as of the patch series v3 attached: - 0001 reformats all the comments of the nodes. - 0002 moves the current files for query jumble as of queryjumble.c -> queryjumblefuncs.c and utils/queryjumble.h -> nodes/queryjumble.h. - 0003 is the core feature, where I have done a second pass over the nodes to make sure that things map with HEAD, incorporating the extra docs coming from v2, adding a bit more. >> That said, the term "jumble" is really weird, because in the sense that we >> are using it here it means, approximately, "to mix together", "to unify". >> So what we are doing with the Const nodes is really to *not* jumble the >> location, but for all other node types we are jumbling the location. At >> least that is my understanding. > > I am quite familiar with this term, FWIW. That's what we've inherited > from the days where this has been introduced in pg_stat_statements. I have renamed the node attributes to query_jumble_ignore and no_query_jumble at the end, after considering Peter's point that only "jumble" could be fuzzy here. The file names are changed in consequence. While doing all that, I have done some micro-benchmarking of JumbleQuery(), making it loop 50M times on my laptop each time a query ID is computed (hideous hack with a loop in queryjumble.c): - For non-utility queries, aka queries that go through JumbleQueryInternal(), I am measuring a repeatable ~10% improvement with the generated code over HEAD, which is kind of nice. I have tested a few DMLs and simple SELECTs, still it looks like a trend. - For utility queries, the new code is competing against hash_any_extended() with the query string, which is going to be hard to beat.. I am measuring what looks like a 5 times slowdown, at least, and more depending on the depth of the query tree. That's not surprising. On a 50M loop, this comes down to compare a computation of 100ns to 5ns for a 20-time slowdown for example, still this could justify the addition of a GUC to control whether utility queries have their query ID compiled depending on their nodes or their string hash, as this could become noticeable in OLTP workloads with loads of short statements and BEGIN/COMMIT queries? Thoughts or comments? -- Michael
Attachment
On 18.01.23 08:04, Michael Paquier wrote: > On Tue, Jan 17, 2023 at 04:52:28PM +0900, Michael Paquier wrote: >> On Tue, Jan 17, 2023 at 08:43:44AM +0100, Peter Eisentraut wrote: >>> Ok, I understand now, and I agree with this approach over the opposite. I >>> was confused because the snippet you showed above used "jumble_ignore", but >>> your patch is correct as it uses "jumble_location". >> >> Okay. I'll refresh the patch set so as we have only "jumble_ignore", >> then, like v1, with preparatory patches for what you mentioned and >> anything that comes into mind. > > This is done as of the patch series v3 attached: > - 0001 reformats all the comments of the nodes. > - 0002 moves the current files for query jumble as of queryjumble.c -> > queryjumblefuncs.c and utils/queryjumble.h -> nodes/queryjumble.h. > - 0003 is the core feature, where I have done a second pass over the > nodes to make sure that things map with HEAD, incorporating the extra > docs coming from v2, adding a bit more. This patch structure looks good. >>> That said, the term "jumble" is really weird, because in the sense that we >>> are using it here it means, approximately, "to mix together", "to unify". >>> So what we are doing with the Const nodes is really to *not* jumble the >>> location, but for all other node types we are jumbling the location. At >>> least that is my understanding. >> >> I am quite familiar with this term, FWIW. That's what we've inherited >> from the days where this has been introduced in pg_stat_statements. > > I have renamed the node attributes to query_jumble_ignore and > no_query_jumble at the end, after considering Peter's point that only > "jumble" could be fuzzy here. The file names are changed in > consequence. I see that in the 0003 patch, most location fields now have an explicit markup with query_jumble_ignore. I thought we had previously resolved to consider location fields to be automatically ignored unless explicitly included (like for the Const node). This appears to invert that? Am I missing something?
On Thu, Jan 19, 2023 at 09:42:03AM +0100, Peter Eisentraut wrote: > I see that in the 0003 patch, most location fields now have an explicit > markup with query_jumble_ignore. I thought we had previously resolved to > consider location fields to be automatically ignored unless explicitly > included (like for the Const node). This appears to invert that? Am I > missing something? My misunderstanding then, I thought that you were OK with what was part of v1, where all these fields was marked as "ignore". But you actually prefer v2, with the second field "location" on top of "ignore". I can update 0003 to refresh that. Would you be OK if I apply 0001 (with the comments of the locations still reshaped to ease future property additions) and 0002? -- Michael
Attachment
On Thu, Jan 19, 2023 at 09:42:03AM +0100, Peter Eisentraut wrote: > I see that in the 0003 patch, most location fields now have an explicit > markup with query_jumble_ignore. I thought we had previously resolved to > consider location fields to be automatically ignored unless explicitly > included (like for the Const node). This appears to invert that? Am I > missing something? As a result, I have rebased the patch set to use the two-attribute approach: query_jumble_ignore and query_jumble_location. On top of the three previous patches, I am adding 0004 to implement a GUC able to switch the computation of the utility statements between what I am calling "string" to compute the query IDs based on the hash of the query string and the previous default, or "jumble", to use the parsed tree, with a few more tests to see the difference. Perhaps it is not worth bothering, but it could be possible that some users don't want to pay the penalty of doing the query jumbling with the parsed tree for utilities, as well.. -- Michael
Attachment
On 20.01.23 05:35, Michael Paquier wrote: > On Thu, Jan 19, 2023 at 09:42:03AM +0100, Peter Eisentraut wrote: >> I see that in the 0003 patch, most location fields now have an explicit >> markup with query_jumble_ignore. I thought we had previously resolved to >> consider location fields to be automatically ignored unless explicitly >> included (like for the Const node). This appears to invert that? Am I >> missing something? > As a result, I have rebased the patch set to use the two-attribute > approach: query_jumble_ignore and query_jumble_location. Structurally, this looks okay to me now. In your 0001 patch, most of the comment reformattings for location fields are no longer needed, so you should undo those. The 0002 patch looks good. Those two could be committed with those adjustments, I think. I'll read the 0003 again more carefully. I haven't studied the new 0004 yet.
On Fri, Jan 20, 2023 at 11:56:00AM +0100, Peter Eisentraut wrote: > In your 0001 patch, most of the comment reformattings for location fields > are no longer needed, so you should undo those. > > The 0002 patch looks good. Okay, I have gone through these two again and applied what I had. 0001 has been cleaned up of the extra comment moves for the locations. Now, I have kept a few changes for some of the nodes to have some consistency with the other fields, in the case where most of the fields at the end of the structures have to be marked with new node attributes. This made the style of the header a bit more elegant, IMV. > I'll read the 0003 again more carefully. I haven't studied the new 0004 > yet. Thanks, again. Rebased version attached. -- Michael
Attachment
On 21.01.23 04:35, Michael Paquier wrote: >> I'll read the 0003 again more carefully. I haven't studied the new 0004 >> yet. > > Thanks, again. Rebased version attached. A couple of small fixes are attached. There is something weird in _jumbleNode(). There are two switch (nodeTag(expr)) statements. Maybe that's intentional, but then it should be commented better, because now it looks more like an editing mistake. The handling of T_RangeTblEntry could be improved. In other contexts we have for example a custom_copy attribute, which generates the switch entry but not the function. Something like this could be useful here too. Otherwise, this looks ok. I haven't checked whether it maintains the exact behavior from before. What is the test coverage situation for this? For the 0004 patch, it should be documented why one would want one behavior or the other. That's totally unclear right now. I think if we are going to accept 0004, then it might be better to combine it with 0003. Otherwise, 0004 is just undoing a lot of the code structure changes in JumbleQuery() that 0003 did.
Attachment
On Mon, Jan 23, 2023 at 02:27:13PM +0100, Peter Eisentraut wrote: > A couple of small fixes are attached. Thanks. > There is something weird in _jumbleNode(). There are two switch > (nodeTag(expr)) statements. Maybe that's intentional, but then it should be > commented better, because now it looks more like an editing mistake. This one is intentional, so as it is possible to track correctly the highest param ID found while browsing the nodes. IMO it would be confusing to add that into gen_node_support.pl. Another thing that could be done is to switch Param to have a custom implementation, like RangeTblEntry, though this removes the automation around the creation of _jumbleParam(). I have clarified the comments around that. > The handling of T_RangeTblEntry could be improved. In other contexts we > have for example a custom_copy attribute, which generates the switch entry > but not the function. Something like this could be useful here too. Hmm. Okay. Fine by me. > Otherwise, this looks ok. I haven't checked whether it maintains the exact > behavior from before. What is the test coverage situation for this? 0003 taken in isolation has some minimal coverage through pg_stat_statements, though it turns around 15% with compute_query_id = auto that would enforce the jumbling path only when pg_stat_statements uses it. Still, my plan here is to enforce the loading of pg_stat_statements with compute_query_id = regress and utility_query_id = jumble (if needed) in a new buildfarm machine, because that's the cheapest path. An extra possibility is to have pg_regress kicked in a new TAP test with these settings, but that's costly and we have already two of these :/ Another possibility is to plug in that into 027_stream_regress or the pg_upgrade test suite with new settings :/ Anyway, the regression tests of pg_stat_statements should be extended a bit to cover more node types by default (Say COPY with DMLs for the InsertStmt & co) to look at how these are showing up once normalized using their parsed query, and we don't do much around that now. Normalizing more DDLs should use this path, as well. > For the 0004 patch, it should be documented why one would want one behavior > or the other. That's totally unclear right now. I am not 100% sure whether we should have this part at the end, but as an exit path in case somebody complains about the extra load with the automated jumbling compared to the hash of the query strings, that can be used as a backup. Anyway, attached is what would be a clarification. > I think if we are going to accept 0004, then it might be better to combine > it with 0003. Otherwise, 0004 is just undoing a lot of the code structure > changes in JumbleQuery() that 0003 did. Makes sense. That would be my intention if 0004 is the most acceptable and splitting things makes things a bit easier to review. -- Michael
Attachment
On Tue, Jan 24, 2023 at 03:57:56PM +0900, Michael Paquier wrote: > Makes sense. That would be my intention if 0004 is the most > acceptable and splitting things makes things a bit easier to review. There was a silly mistake in 0004 where the jumbling code relied on compute_query_id rather than utility_query_id, so fixed and rebased as of v7 attached. -- Michael
Attachment
On 24.01.23 07:57, Michael Paquier wrote: >> For the 0004 patch, it should be documented why one would want one behavior >> or the other. That's totally unclear right now. > I am not 100% sure whether we should have this part at the end, but as > an exit path in case somebody complains about the extra load with the > automated jumbling compared to the hash of the query strings, that can > be used as a backup. Anyway, attached is what would be a > clarification. Ok, the documentation make sense now. I wonder what the performance impact is. Probably, nobody cares about microoptimizing CREATE TABLE statements. But BEGIN/COMMIT could matter. However, whatever you do in between the BEGIN and COMMIT will already be jumbled, so you're already paying the overhead. Hopefully, jumbling such simple commands will have no noticeable overhead. In other words, we should test this and hopefully get rid of the 'string' method.
On 25.01.23 01:08, Michael Paquier wrote: > On Tue, Jan 24, 2023 at 03:57:56PM +0900, Michael Paquier wrote: >> Makes sense. That would be my intention if 0004 is the most >> acceptable and splitting things makes things a bit easier to review. > > There was a silly mistake in 0004 where the jumbling code relied on > compute_query_id rather than utility_query_id, so fixed and rebased as > of v7 attached. Overall, this looks good to me. There are a couple of repetitive comments, like "typmod and collation information are irrelevant for the query jumbling". This applies to all nodes, so we don't need to repeat it for a number of nodes (and then not mention it for other nodes). Maybe there should be a central place somewhere that describes "these kinds of fields should normally be ignored".
On Thu, Jan 26, 2023 at 09:37:13AM +0100, Peter Eisentraut wrote: > Ok, the documentation make sense now. I wonder what the performance impact > is. Probably, nobody cares about microoptimizing CREATE TABLE statements. > But BEGIN/COMMIT could matter. However, whatever you do in between the > BEGIN and COMMIT will already be jumbled, so you're already paying the > overhead. Hopefully, jumbling such simple commands will have no noticeable > overhead. > > In other words, we should test this and hopefully get rid of the 'string' > method. Yep. I have mentioned a few numbers upthread, and this deserves discussion. FYI, I have done more micro-benchmarking to compare both methods for utility queries by hijacking JumbleQuery() to run the computation in a tight loop run N times (could not come up with a better idea to avoid the repeated palloc/pfree overhead), as the path to stress is _jumbleNode(). See the attached, that should be able to apply on top of the latest patch set (named as .txt to not feed it to the CF bot, and need to recompile to switch the iteration). Using that, I can compile the following results for various cases (-O2 and compute_query_id=on): query | mode | iterations | avg_runtime_ns | avg_jumble_ns -------------------------+--------+------------+----------------+--------------- begin | string | 50000000 | 4.53116 | 4.54 begin | jumble | 50000000 | 30.94578 | 30.94 commit | string | 50000000 | 4.76004 | 4.74 commit | jumble | 50000000 | 31.4791 | 31.48 create table 1 column | string | 50000000 | 7.22836 | 7.08 create table 1 column | jumble | 50000000 | 152.10852 | 151.96 create table 5 columns | string | 50000000 | 12.43412 | 12.28 create table 5 columns | jumble | 50000000 | 352.88976 | 349.1 create table 20 columns | string | 5000000 | 49.591 | 48.2 create table 20 columns | jumble | 5000000 | 2272.4066 | 2271 drop table 1 column | string | 50000000 | 6.70538 | 6.56 drop table 1 column | jumble | 50000000 | 50.38 | 50.24 drop table 5 columns | string | 50000000 | 6.88256 | 6.74 drop table 5 columns | jumble | 50000000 | 50.02898 | 49.9 SET work_mem | string | 50000000 | 7.28752 | 7.28 SET work_mem | jumble | 50000000 | 91.66588 | 91.64 (16 rows) avg_runtime_ns is (query runtime / iterations) and avg_jumble_ns is the same with the difference between the start/end logs in the txt patch attached. The overhead to run the query does not matter much if you compare both. The time it takes to run a jumble is correlated to the number of nodes to go through for each query, and there is a larger gap for more nodes to go through. Well, a simple "begin" or "commit" query has its computation time increase from 4ns to 30ns in average which would be unnoticeable. The gap is larger for larger nodes, like SET, still we jump from 7ns to 90ns in this case. DDLs take the most hit with this method, where a 20-column CREATE TABLE jumps from 50ns to 2us (note that the iteration is 10 times lower here). At the end, that would be unnoticeable for the average user, I guess, but here are the numbers I get on my laptop :) -- Michael
Attachment
On Thu, Jan 26, 2023 at 09:39:05AM +0100, Peter Eisentraut wrote: > There are a couple of repetitive comments, like "typmod and collation > information are irrelevant for the query jumbling". This applies to all > nodes, so we don't need to repeat it for a number of nodes (and then not > mention it for other nodes). Maybe there should be a central place > somewhere that describes "these kinds of fields should normally be ignored". The place that would make the most sense to me to centralize this knowlegde is nodes.h itself, because that's where the node attributes are defined? -- Michael
Attachment
On Tue, Jan 24, 2023 at 03:57:56PM +0900, Michael Paquier wrote: > Still, my plan here is to enforce the loading of > pg_stat_statements with compute_query_id = regress and > utility_query_id = jumble (if needed) in a new buildfarm machine, Actually, about this specific point, I have been able to set up a buildfarm machine that uses shared_preload_libraries = pg_stat_statements and compute_query_id = regress in the base configuration, which is uncovered yet. This works as long as one sets up EXTRA_INSTALL => "contrib/pg_stat_statements" in build_env. -- Michael
Attachment
On Fri, Jan 27, 2023 at 11:59:47AM +0900, Michael Paquier wrote: > Using that, I can compile the following results for various cases (-O2 > and compute_query_id=on): > query | mode | iterations | avg_runtime_ns | avg_jumble_ns > -------------------------+--------+------------+----------------+--------------- > begin | string | 50000000 | 4.53116 | 4.54 > begin | jumble | 50000000 | 30.94578 | 30.94 > commit | string | 50000000 | 4.76004 | 4.74 > commit | jumble | 50000000 | 31.4791 | 31.48 > create table 1 column | string | 50000000 | 7.22836 | 7.08 > create table 1 column | jumble | 50000000 | 152.10852 | 151.96 > create table 5 columns | string | 50000000 | 12.43412 | 12.28 > create table 5 columns | jumble | 50000000 | 352.88976 | 349.1 > create table 20 columns | string | 5000000 | 49.591 | 48.2 > create table 20 columns | jumble | 5000000 | 2272.4066 | 2271 > drop table 1 column | string | 50000000 | 6.70538 | 6.56 > drop table 1 column | jumble | 50000000 | 50.38 | 50.24 > drop table 5 columns | string | 50000000 | 6.88256 | 6.74 > drop table 5 columns | jumble | 50000000 | 50.02898 | 49.9 > SET work_mem | string | 50000000 | 7.28752 | 7.28 > SET work_mem | jumble | 50000000 | 91.66588 | 91.64 > (16 rows) Just to close the loop here, I have done more measurements to compare the jumble done for some DMLs and some SELECTs between HEAD and the patch (forgot to post some last Friday). Both methods show comparable results: query | mode | iterations | avg_runtime_ns | avg_jumble_ns ----------------------+--------+------------+----------------+--------------- insert table 10 cols | master | 50000000 | 377.17878 | 377.04 insert table 10 cols | jumble | 50000000 | 409.47924 | 409.34 insert table 20 cols | master | 50000000 | 692.94924 | 692.8 insert table 20 cols | jumble | 50000000 | 710.0901 | 709.96 insert table 5 cols | master | 50000000 | 232.44308 | 232.3 insert table 5 cols | jumble | 50000000 | 253.49854 | 253.36 select 10 cols | master | 50000000 | 449.13608 | 383.36 select 10 cols | jumble | 50000000 | 491.61912 | 323.86 select 5 cols | master | 50000000 | 277.477 | 277.46 select 5 cols | jumble | 50000000 | 323.88152 | 323.86 (10 rows) The averages are in ns, so the difference does not bother me much. There may be some noise mixed in that ;) (Attached is the tweak I have applied on HEAD to get some numbers.) -- Michael
Attachment
On 27.01.23 04:07, Michael Paquier wrote: > On Thu, Jan 26, 2023 at 09:39:05AM +0100, Peter Eisentraut wrote: >> There are a couple of repetitive comments, like "typmod and collation >> information are irrelevant for the query jumbling". This applies to all >> nodes, so we don't need to repeat it for a number of nodes (and then not >> mention it for other nodes). Maybe there should be a central place >> somewhere that describes "these kinds of fields should normally be ignored". > > The place that would make the most sense to me to centralize this > knowlegde is nodes.h itself, because that's where the node attributes > are defined? Either that or src/backend/nodes/README.
On 27.01.23 03:59, Michael Paquier wrote: > At the end, that would be unnoticeable for the average user, I guess, > but here are the numbers I get on my laptop :) Personally, I think we do not want the two jumble methods in parallel. Maybe there are other opinions. I'm going to set this thread as "Ready for Committer". Either wait a bit for more feedback on this topic, or just go ahead with either solution. We can leave it as a semi-open item for reconsideration later.
On Mon, Jan 30, 2023 at 11:46:06AM +0100, Peter Eisentraut wrote: > Either that or src/backend/nodes/README. The README holds nothing about the node attributes currently, so nodes.h feels most adapted here at the end.. Thanks, -- Michael
Attachment
On Mon, Jan 30, 2023 at 11:48:45AM +0100, Peter Eisentraut wrote: > I'm going to set this thread as "Ready for Committer". Either wait a bit > for more feedback on this topic, or just go ahead with either solution. We > can leave it as a semi-open item for reconsideration later. All the measurements I have done for the last couple of days point me in the direction that there is no need for an extra node based on the averaged computation times (did a few more today with some long CREATE FUNCTION, VIEW or event trigger queries, for example). Agreed to add the extra option as something to consider at some point during beta, as long as it is fresh. I am not convinced that it will be necessary but let's see how it goes. While reviewing all the nodes, I have noticed two mistakes for a few things marked as query_jumble_ignore but they should not: - orderClause in WindowClause - aliascolnames in CommonTableExpr The rest was fine. varnullingrels has been added very recently, and there was in queryjumblefuncs.c a comment explaining why it should be ignored. This has been moved to nodes.h, like the others. With all that in mind, I have spent my day polishing that and doing a close lookup, and the patch has been applied. Thanks a lot! -- Michael
Attachment
On Tue, Jan 31, 2023 at 03:40:56PM +0900, Michael Paquier wrote: > With all that in mind, I have spent my day polishing that and doing a > close lookup, and the patch has been applied. Thanks a lot! While working on a different patch, I have noticed a small issue in the way the jumbling happens for A_Const, where ValUnion was not getting jumbled correctly. This caused a few statements that rely on this node to compile the same query IDs when using different values. The full contents of pg_stat_statements for a regression database point to: - SET. - COPY with queries. - CREATE TABLE with partition bounds and default expressions. This was causing some confusion in pg_stat_statements where some utility queries would be incorrectly reported, and at this point the intention is to keep this area compatible with the string-based method when it comes to the values. Like read, write and copy nodes, we need to compile the query ID based on the type of the value, which cannot be automated. Attached is a patch to fix this issue with some regression tests, that I'd like to get fixed before moving on with more business in pg_stat_statements (aka properly show Const nodes for utilities with normalized queries). Comments or objections are welcome, of course. (FWIW, I'd like to think that there is an argument to normalize the A_Const nodes for a portion of the DDL queries, by ignoring their values in the query jumbling and mark a location, which would be really useful for some workloads, but that's a separate discussion I am keeping for later.) -- Michael
Attachment
On Sun, Feb 05, 2023 at 07:40:57PM +0900, Michael Paquier wrote: > Comments or objections are welcome, of course. Done this part. > (FWIW, I'd like to think that there is an argument to normalize the > A_Const nodes for a portion of the DDL queries, by ignoring their > values in the query jumbling and mark a location, which would be > really useful for some workloads, but that's a separate discussion I > am keeping for later.) And I have a patch pretty much OK for this part of the discussion. Will post soon.. -- Michael
Attachment
Michael Paquier <michael@paquier.xyz> writes: > With all that in mind, I have spent my day polishing that and doing a > close lookup, and the patch has been applied. Thanks a lot! I have just noticed that this patch is generating useless jumbling code for node types such as Path nodes and other planner infrastructure nodes. That no doubt contributes to the miserable code coverage rating for queryjumblefuncs.*.c, which have enough dead lines to drag down the overall rating for all of backend/nodes/. Shouldn't a little more attention have been paid to excluding entire node classes if they can never appear in Query? regards, tom lane
On Tue, Feb 07, 2023 at 05:32:07PM -0500, Tom Lane wrote: > I have just noticed that this patch is generating useless jumbling > code for node types such as Path nodes and other planner infrastructure > nodes. That no doubt contributes to the miserable code coverage rating > for queryjumblefuncs.*.c, which have enough dead lines to drag down the > overall rating for all of backend/nodes/. Shouldn't a little more > attention have been paid to excluding entire node classes if they can > never appear in Query? This one was intentional to let extensions play with jumbling of such nodes, but perhaps you are right that it makes little sense at this stage. If there is an ask for it later, though.. Using shared_preload_libraries = pg_stat_statements and compute_query_id = regress shows that numbers go up to 60% for funcs.c and 30% for switch.c. Removing nodes like as of the attached brings these numbers respectively up to 94.5% and 93.5% for a check. With a check-world, I measure respectively 96.7% and 96.1% because there is more coverage for extensions, ALTER SYSTEM and database commands, roughly. This could also be a file-level policy by enforcing no_query_jumble in gen_node_support.pl by looking at the header name, still I favor no_query_jumble to keep all the pg_node_attr() in a single area with the headers. Note that the attached includes in 0002 the tweak to enforce the computation with compute_query_id if you want to test it yourself and check my numbers. This is useful IMO as we could detect missing nodes for all queries (utilities or not), still doing this change may deserve a separate discussion. Note that I am not seeing any "unrecognized node type" in any of the logs for any queries. As a side note, should we be more aggressive with the tests related to the jumbling code since it is now in core? For example, XmlExpr or MinMaxExpr, which are part of Query nodes that can be jumbled even in older branches, have zero coverage by default as coverage.postgresql.org reports, because everything goes through pg_stat_statements and it includes no queries with such nodes. My buildfarm member batta makes sure to stress that no nodes are missing by overloading pg_stat_statements and compute_query_id = regress in the configuration, so no nodes are missing from the computation, still the coverage could be better across the board. Expanding the tests of pg_stat_statements is needed in this area for some time, still could there be a point in switching compute_query_id = regress so as it is a synonym of "on" without the EXPLAIN output rather than "auto"? If the setting is enforced by pg_regress, the coverage of queryjumble.c would be so much better, at least.. -- Michael
Attachment
Hi, On 2023-02-08 15:47:51 +0900, Michael Paquier wrote: > This one was intentional to let extensions play with jumbling of such > nodes, but perhaps you are right that it makes little sense at this > stage. If there is an ask for it later, though.. Using > shared_preload_libraries = pg_stat_statements and compute_query_id = > regress shows that numbers go up to 60% for funcs.c and 30% for > switch.c. Removing nodes like as of the attached brings these numbers > respectively up to 94.5% and 93.5% for a check. With a check-world, I > measure respectively 96.7% and 96.1% because there is more coverage > for extensions, ALTER SYSTEM and database commands, roughly. Given that we already pay the price of multiple regress runs, and that jumbling is now really a core feature, perhaps we should enable pg_stat_statements in pg_upgrade or 027_stream_regress.pl? I'd hope it wouldn't add a meaningful amount of time? A tiny bit of verification at the end should also be ok. Both pg_upgrade and 027_stream_regress.pl have some advantages. The former would test pg_upgrade interactions with shared_preload_libraries, the latter could do some basic checks of pg_stat_statements on a standby. Greetings, Andres Freund
On Tue, Feb 07, 2023 at 11:01:03PM -0800, Andres Freund wrote: > Given that we already pay the price of multiple regress runs, and that > jumbling is now really a core feature, perhaps we should enable > pg_stat_statements in pg_upgrade or 027_stream_regress.pl? I'd hope it > wouldn't add a meaningful amount of time? A tiny bit of verification at the > end should also be ok. Yeah, I have briefly mentioned this part upthread: https://www.postgresql.org/message-id/Y8+BdCOjxykre5es@paquier.xyz It would not, I guess, as long as pg_stat_statements.max is set large enough in the TAP test. There are currently 21k~22k entries in the regression database, much larger than the default of 5000 so this may become an issue on small-ish machines if left untouched even in a TAP test. > Both pg_upgrade and 027_stream_regress.pl have some advantages. The former > would test pg_upgrade interactions with shared_preload_libraries, the latter > could do some basic checks of pg_stat_statements on a standby. Yes, there could be more checks, potentially useful for both cases, so I may choose both at the end of the day. Checking the consistency of the contents of pg_stat_statements across a pg_upgrade run for the same version may be one thing? I am not sure if it is that interesting, TBH, still that's one idea :) -- Michael
Attachment
On Wed, Feb 08, 2023 at 03:47:51PM +0900, Michael Paquier wrote: > This one was intentional to let extensions play with jumbling of such > nodes, but perhaps you are right that it makes little sense at this > stage. If there is an ask for it later, though.. Using > shared_preload_libraries = pg_stat_statements and compute_query_id = > regress shows that numbers go up to 60% for funcs.c and 30% for > switch.c. Removing nodes like as of the attached brings these numbers > respectively up to 94.5% and 93.5% for a check. With a check-world, I > measure respectively 96.7% and 96.1% because there is more coverage > for extensions, ALTER SYSTEM and database commands, roughly. Tom, did you get a chance to look at what is proposed here and expand the use of query_jumble_ignore in the definitions of the nodes rather than have an enforced per-file policy in gen_node_support.pl? The attached improves the situation by as much as we can, still the numbers reported by coverage.postgresql.org won't go that up until we enforce query jumbling testing on the regression database (something we should have done since this code moved to core, I guess). -- Michael
Attachment
Michael Paquier <michael@paquier.xyz> writes: > Tom, did you get a chance to look at what is proposed here and expand > the use of query_jumble_ignore in the definitions of the nodes rather > than have an enforced per-file policy in gen_node_support.pl? Sorry, didn't look at it before. I'm okay with the pathnodes.h changes --- although surely you don't need changes like this: - pg_node_attr(abstract) + pg_node_attr(abstract, no_query_jumble) "abstract" should already imply "no_query_jumble". I wonder too if you could shorten the changes by making no_query_jumble an inheritable attribute, and then just applying it to Path and Plan. The changes in parsenodes.h seem wrong, except for RawStmt. Those node types are used in parsed queries, aren't they? regards, tom lane
On Thu, Feb 09, 2023 at 06:12:50PM -0500, Tom Lane wrote: > I'm okay with the pathnodes.h changes --- although surely you don't need > changes like this: > > - pg_node_attr(abstract) > + pg_node_attr(abstract, no_query_jumble) > > "abstract" should already imply "no_query_jumble". Okay, understood. Following this string of thoughts, I am a bit surprised for two cases, though: - PartitionPruneStep. - Plan. Both are abstract and both are marked with no_equal. I guess that applying no_query_jumble to both of them is fine, and that's what you mean? > I wonder too if you could shorten the changes by making no_query_jumble > an inheritable attribute, and then just applying it to Path and Plan. Ah. I did not catch what you meant here at first, but I think that I do now. Are you referring to the part of gen_node_support.pl where we propagate properties when a node is a supertype? This part would be taken into account when a node is parsed but we find that its first member is already tracked as a node: # Propagate some node attributes from supertypes if ($supertype) { push @no_copy, $in_struct if elem $supertype, @no_copy; push @no_equal, $in_struct if elem $supertype, @no_equal; push @no_read, $in_struct if elem $supertype, @no_read; + push @no_query_jumble, $in_struct + if elem $supertype, @no_query_jumble; } A benefit of doing that would also discard all the Scan and Sort nodes. So like the other no_* attributes, it makes sense to force the inheritance here. > The changes in parsenodes.h seem wrong, except for RawStmt. Those node > types are used in parsed queries, aren't they? RTEPermissionInfo is a recent addition, as of a61b1f7. This commit documents it as a plan node, still it is part of a Query while being ignored in the query jumbling since its introduction, so I am a bit confused by this one. Anyway, none of these need to be included in the query jumbling currently because they are ignored, but I'd be fine to generate their code by default as they could become relevant if other nodes begin to rely on them more heavily, as being part of queries. Peter E. has mentioned upthread that a few nodes should include more jumbling while some other parts should be ignored. This should be analyzed separately because ~15 does not seem to be strictly right, either. Attached is a patch refreshed with all that. Feel free to ignore 0002 as that's just useful to enforce the tests to go through the jumbling code. The attached reaches 95.0% of line coverage after a check-world in funcs.c. Thoughts? -- Michael
Attachment
Michael Paquier <michael@paquier.xyz> writes: > Okay, understood. Following this string of thoughts, I am a bit > surprised for two cases, though: > - PartitionPruneStep. > - Plan. > Both are abstract and both are marked with no_equal. I guess that > applying no_query_jumble to both of them is fine, and that's what you > mean? On second thought, the point of that is to allow the no_equal property to automatically inherit to child node types, so doing likewise for no_query_jumble is sensible. >> The changes in parsenodes.h seem wrong, except for RawStmt. Those node >> types are used in parsed queries, aren't they? > RTEPermissionInfo is a recent addition, as of a61b1f7. This commit > documents it as a plan node, still it is part of a Query while being > ignored in the query jumbling since its introduction, so I am a bit > confused by this one. Hmm ... it is part of Query, so that documentation is wrong, and the fact that it's not reached by query jumbling kind of seems like a bug. However, it might be that everything in it is derived from something else that *is* covered by jumbling, in which case that's okay, if underdocumented. > ... Peter E. has > mentioned upthread that a few nodes should include more jumbling while > some other parts should be ignored. This should be analyzed > separately because ~15 does not seem to be strictly right, either. Yeah. It'd surprise me not at all if people have overlooked that. v2 looks good to me as far as it goes. I agree these other questions deserve a separate look. regards, tom lane
On Fri, Feb 10, 2023 at 04:40:08PM -0500, Tom Lane wrote: > v2 looks good to me as far as it goes. Thanks. I have applied that after an extra lookup. > I agree these other questions deserve a separate look. Okay, I may be able to come back to that. Another point is that we need to do a better job in forcing the execution of the query jumbling in one of the TAP tests running pg_regress, outside pg_stat_statements, to maximize coverage. Will see to that on a separate thread. -- Michael
Attachment
On Mon, Jan 30, 2023 at 11:48:45AM +0100, Peter Eisentraut wrote: > On 27.01.23 03:59, Michael Paquier wrote: >> At the end, that would be unnoticeable for the average user, I guess, >> but here are the numbers I get on my laptop :) > > Personally, I think we do not want the two jumble methods in parallel. > > Maybe there are other opinions. (Thanks Jonathan for the poke.) Now that we are in mid-beta for 16, it would be a good time to conclude on this open item: "Reconsider a utility_query_id GUC to control if query jumbling of utilities can go through the past string-only mode and the new mode?" In Postgres ~15, utility commands used a hash of the query string to compute their query ID. The current query jumbling code uses a Query instead, like any other queries. I have registered this open item as a self-reminder, mostly in case there would be an argument to have a GUC where users could switch from one mode to another. See here as well for some computation times for each method (table is in ns, wiht millions of iterations): https://www.postgresql.org/message-id/Y9eeYinDb1AcpWrG@paquier.xyz I still don't think that we need both methods based on these numbers, but there may be more opinions about that? Are people OK if this open item is discarded? -- Michael
Attachment
On 11/7/2023 05:35, Michael Paquier wrote: > On Mon, Jan 30, 2023 at 11:48:45AM +0100, Peter Eisentraut wrote: >> On 27.01.23 03:59, Michael Paquier wrote: >>> At the end, that would be unnoticeable for the average user, I guess, >>> but here are the numbers I get on my laptop :) >> >> Personally, I think we do not want the two jumble methods in parallel. >> >> Maybe there are other opinions. > > (Thanks Jonathan for the poke.) > > Now that we are in mid-beta for 16, it would be a good time to > conclude on this open item: > "Reconsider a utility_query_id GUC to control if query jumbling of > utilities can go through the past string-only mode and the new mode?" > > In Postgres ~15, utility commands used a hash of the query string to > compute their query ID. The current query jumbling code uses a Query > instead, like any other queries. I have registered this open item as > a self-reminder, mostly in case there would be an argument to have a > GUC where users could switch from one mode to another. See here as > well for some computation times for each method (table is in ns, wiht > millions of iterations): > https://www.postgresql.org/message-id/Y9eeYinDb1AcpWrG@paquier.xyz > > I still don't think that we need both methods based on these numbers, > but there may be more opinions about that? Are people OK if this open > item is discarded? I vote for only one method based on a query tree structure. BTW, did you think about different algorithms of queryId generation? Auto-generated queryId code can open a way for extensions to have easy-supporting custom queryIds. -- regards, Andrey Lepikhov Postgres Professional
On Tue, Jul 11, 2023 at 12:29:29PM +0700, Andrey Lepikhov wrote: > I vote for only one method based on a query tree structure. Noted > BTW, did you think about different algorithms of queryId generation? Not really, except if you are referring to the possibility of being able to handle differently different portions of the nodes depending on a context given by the callers willing to do a query jumbling computation. (For example, choose to *not* silence the Const nodes, etc.) > Auto-generated queryId code can open a way for extensions to have > easy-supporting custom queryIds. Extensions can control that at some extent, already. -- Michael
Attachment
On 11/7/2023 12:35, Michael Paquier wrote: > On Tue, Jul 11, 2023 at 12:29:29PM +0700, Andrey Lepikhov wrote: >> I vote for only one method based on a query tree structure. > > Noted > >> BTW, did you think about different algorithms of queryId generation? > > Not really, except if you are referring to the possibility of being > able to handle differently different portions of the nodes depending > on a context given by the callers willing to do a query jumbling > computation. (For example, choose to *not* silence the Const nodes, > etc.) Yes, I have two requests on different queryId algorithms: 1. With suppressed Const nodes. 2. With replacement of Oids with full names - to give a chance to see the same queryId at different instances for the same query. It is quite trivial to implement, but not easy in support. > >> Auto-generated queryId code can open a way for extensions to have >> easy-supporting custom queryIds. > > Extensions can control that at some extent, already. > -- > Michael -- regards, Andrey Lepikhov Postgres Professional
On Tue, Jul 11, 2023 at 07:35:43AM +0900, Michael Paquier wrote: > I still don't think that we need both methods based on these numbers, > but there may be more opinions about that? Are people OK if this open > item is discarded? Hearing nothing about this point, removed from the open item list, then. -- Michael