Thread: Generating code for query jumbling through gen_node_support.pl

Generating code for query jumbling through gen_node_support.pl

From

Michael Paquier

Date:

07 December 2022, 10:56:41

Hi all,

This thread is a follow-up of the recent discussion about query
jumbling with DDL statements, where the conclusion was that we'd want
to generate all this code automatically for all the nodes:
https://www.postgresql.org/message-id/36e5bffe-e989-194f-85c8-06e7bc88e6f7@amazon.com

What this patch allows to do it to compute the same query ID for
utility statements using their parsed Node state instead of their
string, meaning that things like "BEGIN", "bEGIN" or "begin" would be
treated the same, for example.  But the main idea is not only that.

I have implemented that as of the attached, where the following things
are done:
- queryjumble.c is moved to src/backend/nodes/, to stick with the
other things for node equal/read/write/copy, renamed to
jumblefuncs.c.
- gen_node_support.c is extended to generate the functions and the
switch for the jumbling.  There are a few exceptions, as of the Lists
and RangeTblEntry to do the jumbling consistently.
- Two pg_node_attr() are added in consistency with the existing ones:
no_jumble to discard completely a node from the the query jumbling
and jumble_ignore to discard one field from the jumble.

The patch is in a rather good shape, passes check-world and the CI,
but there are a few things that need to be discussed IMO.  Things
could be perhaps divided in more patches, now the areas touched are
quite different so it did not look like a big deal to me as the
changes touch different areas and are straight-forward.

The location of the Nodes is quite invasive because we only care about
that for T_Const now in the query jumbling, and this could be
compensated with a third pg_node_attr() that tracks for the "int
location" of a Node whether it should participate in the jumbling or
not.  There is also an argument where we would want to not include by
default new fields added to a Node, but that would require added more
pg_node_attr() than what's done here.

Note that the plan is to extend the normalization to some other parts
of the Nodes, like CALL and SET as mentioned on the other thread.  I
have done nothing about that yet but doing so can be done in a few
lines with the facility presented here (aka just add a location
field).  Hence, the normalization is consistent with the existing
queryjumble.c for the fields and the nodes processed.

In this patch, things are done so as the query ID is not computed
anymore from the query string but from the Query.  I still need to
study the performance impact of that with short queries.  If things
prove to be noticeable in some cases, this stuff could be switched to
use a new GUC where we could have a code path for the computation of
utilityStmt using its string as a fallback.  I am not sure that I want
to enter in this level of complications, though, to keep things
simple, but that's yet to be done.

A bit more could be cut but pg_ident got in the way..  There are also
a few things for pg_stat_statements where a query ID of 0 can be
implied for utility statements in some cases.

Generating this code leads to an overall removal of code as what
 queryjumble.c is generated automatically:
 13 files changed, 901 insertions(+), 1113 deletions(-)

I am adding that to the next commit fest.

Thoughts?
--
Michael

Attachment

Re: Generating code for query jumbling through gen_node_support.pl

From

vignesh C

Date:

06 January 2023, 09:07:32

On Wed, 7 Dec 2022 at 13:27, Michael Paquier <michael@paquier.xyz> wrote:
>
> Hi all,
>
> This thread is a follow-up of the recent discussion about query
> jumbling with DDL statements, where the conclusion was that we'd want
> to generate all this code automatically for all the nodes:
> https://www.postgresql.org/message-id/36e5bffe-e989-194f-85c8-06e7bc88e6f7@amazon.com
>
> What this patch allows to do it to compute the same query ID for
> utility statements using their parsed Node state instead of their
> string, meaning that things like "BEGIN", "bEGIN" or "begin" would be
> treated the same, for example.  But the main idea is not only that.
>
> I have implemented that as of the attached, where the following things
> are done:
> - queryjumble.c is moved to src/backend/nodes/, to stick with the
> other things for node equal/read/write/copy, renamed to
> jumblefuncs.c.
> - gen_node_support.c is extended to generate the functions and the
> switch for the jumbling.  There are a few exceptions, as of the Lists
> and RangeTblEntry to do the jumbling consistently.
> - Two pg_node_attr() are added in consistency with the existing ones:
> no_jumble to discard completely a node from the the query jumbling
> and jumble_ignore to discard one field from the jumble.
>
> The patch is in a rather good shape, passes check-world and the CI,
> but there are a few things that need to be discussed IMO.  Things
> could be perhaps divided in more patches, now the areas touched are
> quite different so it did not look like a big deal to me as the
> changes touch different areas and are straight-forward.
>
> The location of the Nodes is quite invasive because we only care about
> that for T_Const now in the query jumbling, and this could be
> compensated with a third pg_node_attr() that tracks for the "int
> location" of a Node whether it should participate in the jumbling or
> not.  There is also an argument where we would want to not include by
> default new fields added to a Node, but that would require added more
> pg_node_attr() than what's done here.
>
> Note that the plan is to extend the normalization to some other parts
> of the Nodes, like CALL and SET as mentioned on the other thread.  I
> have done nothing about that yet but doing so can be done in a few
> lines with the facility presented here (aka just add a location
> field).  Hence, the normalization is consistent with the existing
> queryjumble.c for the fields and the nodes processed.
>
> In this patch, things are done so as the query ID is not computed
> anymore from the query string but from the Query.  I still need to
> study the performance impact of that with short queries.  If things
> prove to be noticeable in some cases, this stuff could be switched to
> use a new GUC where we could have a code path for the computation of
> utilityStmt using its string as a fallback.  I am not sure that I want
> to enter in this level of complications, though, to keep things
> simple, but that's yet to be done.
>
> A bit more could be cut but pg_ident got in the way..  There are also
> a few things for pg_stat_statements where a query ID of 0 can be
> implied for utility statements in some cases.
>
> Generating this code leads to an overall removal of code as what
>  queryjumble.c is generated automatically:
>  13 files changed, 901 insertions(+), 1113 deletions(-)
>
> I am adding that to the next commit fest.

The patch does not apply on top of HEAD as in [1], please post a rebased patch:
=== Applying patches on top of PostgreSQL commit ID
b82557ecc2ebbf649142740a1c5ce8d19089f620 ===
=== applying patch
./0001-Support-for-automated-query-jumble-with-all-Nodes.patch
...
patching file src/backend/utils/misc/queryjumble.c
Hunk #1 FAILED at 1.
Not deleting file src/backend/utils/misc/queryjumble.c as content
differs from patch
1 out of 1 hunk FAILED -- saving rejects to file
src/backend/utils/misc/queryjumble.c.rej

[1] - http://cfbot.cputube.org/patch_41_4047.log

Regards,
Vignesh

Re: Generating code for query jumbling through gen_node_support.pl

From

Peter Eisentraut

Date:

07 January 2023, 09:37:49

On 07.12.22 08:56, Michael Paquier wrote:
> The location of the Nodes is quite invasive because we only care about
> that for T_Const now in the query jumbling, and this could be
> compensated with a third pg_node_attr() that tracks for the "int
> location" of a Node whether it should participate in the jumbling or
> not.

The generation script already has a way to identify location fields, by 
($t eq 'int' && $f =~ 'location$'), so you could use that as well.

> There is also an argument where we would want to not include by
> default new fields added to a Node, but that would require added more
> pg_node_attr() than what's done here.

I'm concerned about the large number of additional field annotations 
this adds.  We have been careful so far to document the use of each 
attribute, e.g., *why* does a field not need to be copied etc.  This 
patch adds dozens and dozens of annotations without any explanation at 
all.  Now, the code this replaces also has no documentation, but maybe 
this is the time to add some.

Re: Generating code for query jumbling through gen_node_support.pl

From

Michael Paquier

Date:

07 January 2023, 13:47:59

On Sat, Jan 07, 2023 at 07:37:49AM +0100, Peter Eisentraut wrote:
> The generation script already has a way to identify location fields, by ($t
> eq 'int' && $f =~ 'location$'), so you could use that as well.

I recall that some of the nodes may need renames to map with this
choice.  That could be just one patch on top of the actual feature.

> I'm concerned about the large number of additional field annotations this
> adds.  We have been careful so far to document the use of each attribute,
> e.g., *why* does a field not need to be copied etc.  This patch adds dozens
> and dozens of annotations without any explanation at all.  Now, the code
> this replaces also has no documentation, but maybe this is the time to add
> some.

In most cases, the addition of the node marker would be enough to
self-explain why they are included, but there is a trend for a lot of
the nodes when it comes to collations and typmods where we don't want
to see these in the jumbling calculations.  I'll look at providing
more info for all that.  (Note: I'm out for now.)
--
Michael

Attachment

signature.asc

Re: Generating code for query jumbling through gen_node_support.pl

From

Michael Paquier

Date:

13 January 2023, 10:54:58

On Sat, Jan 07, 2023 at 07:37:49AM +0100, Peter Eisentraut wrote:
> On 07.12.22 08:56, Michael Paquier wrote:
>> The location of the Nodes is quite invasive because we only care about
>> that for T_Const now in the query jumbling, and this could be
>> compensated with a third pg_node_attr() that tracks for the "int
>> location" of a Node whether it should participate in the jumbling or
>> not.
>
> The generation script already has a way to identify location fields, by ($t
> eq 'int' && $f =~ 'location$'), so you could use that as well.

I did not recall exactly everything here, but there are two parts to
the logic:
- gen_node_support.pl uses exactly this condition when scanning the
nodes to put the correct macro to mark a location to track, calling
down RecordConstLocation().
- Marking a bunch of nodes as jumble_ignore is actually necessary, or
we may finish by silencing parts of queries that should be
semantically unrelevant to the queries jumbled (ColumnRef is one).
Using a "jumble_ignore" flag is equally invasive to using an
"jumble_include" flag for each field, I am afraid, as the number of
fields in the nodes included in jumbles is pretty equivalent to the
number of fields ignored.  I tend to prefer the approach of ignoring
things though, which is more consistent with the practive for node
read, write and copy.

Anyway, when it comes to the location, another thing that can be
considered here would be to require a node-level flag for the nodes on
which we want to track the location.  This overlaps a bit with the
fields satisfying "($t eq 'int' && $f =~ 'location$')", but it removes
most of the code changes like this one as at the end we only care
about the location for Const nodes:
-   int         location;       /* token location, or -1 if unknown */
+   int         location pg_node_attr(jumble_ignore);   /* token location, or -1
+                                                        * if unknown */

I have taken this approach in v2 of the patch, shaving ~100 lines of
more code as there is no need to mark all these location fields with
"jumble_ignore" anymore, except for Const, of course.

>> There is also an argument where we would want to not include by
>> default new fields added to a Node, but that would require added more
>> pg_node_attr() than what's done here.
>
> I'm concerned about the large number of additional field annotations this
> adds.  We have been careful so far to document the use of each attribute,
> e.g., *why* does a field not need to be copied etc.  This patch adds dozens
> and dozens of annotations without any explanation at all.  Now, the code
> this replaces also has no documentation, but maybe this is the time to add
> some.

That's fair, though it is not doing to buy us much to update all the
nodes with similar small comments, as well.  As far as I know, there
are basiscally three things here: typmods, collation information, and
internal data of the nodes stored during parse-analyze.  I have added
more documentation to track what looks like the most relevant areas.

I have begun running some performance tests with this stuff and HEAD
to see if this leads to any difference in the query ID compilation
(compute_query_id = on, on scissors roughly) with a simple set of
short commands (like BEGIN/COMMIT) or longer ones, and I am seeing a
speedup trend actually (?).  I still need to think more about a set of
tests here, but I think that micro-benchmarking of JumbleQuery() is
the most adapted approach to minimize the noise, with a few nodes of
various sizes (Const, Query, ColumnRef, anything..).

Thoughts?
--
Michael

Attachment

Re: Generating code for query jumbling through gen_node_support.pl

From

Peter Eisentraut

Date:

16 January 2023, 17:13:35

On 13.01.23 08:54, Michael Paquier wrote:
> Using a "jumble_ignore" flag is equally invasive to using an
> "jumble_include" flag for each field, I am afraid, as the number of
> fields in the nodes included in jumbles is pretty equivalent to the
> number of fields ignored.  I tend to prefer the approach of ignoring
> things though, which is more consistent with the practive for node
> read, write and copy.

I concur that jumble_ignore is better.  I suppose you placed the 
jumble_ignore markers to maintain parity with the existing code, but I 
think that some the markers are actually wrong and are just errors of 
omission in the existing code (such as Query.override, to pick a random 
example).  The way you have structured this would allow us to find and 
analyze such errors better.

> Anyway, when it comes to the location, another thing that can be
> considered here would be to require a node-level flag for the nodes on
> which we want to track the location.  This overlaps a bit with the
> fields satisfying "($t eq 'int' && $f =~ 'location$')", but it removes
> most of the code changes like this one as at the end we only care
> about the location for Const nodes:
> -   int         location;       /* token location, or -1 if unknown */
> +   int         location pg_node_attr(jumble_ignore);   /* token location, or -1
> +                                                        * if unknown */
> 
> I have taken this approach in v2 of the patch, shaving ~100 lines of
> more code as there is no need to mark all these location fields with
> "jumble_ignore" anymore, except for Const, of course.

I don't understand why you chose that when we already have an 
established way.  This would just make the jumble annotations 
inconsistent with the other annotations.

I also have two suggestions to make this patch more palatable:

1. Make a separate patch to reformat long comments, like 
835d476fd21bcfb60b055941dee8c3d9559af14c.

2. Make a separate patch to move the jumble support to 
src/backend/nodes/jumblefuncs.c.  (It was probably a mistake that it 
wasn't there to begin with, and some of the errors of omission alluded 
to above are probably caused by it having been hidden away in the wrong 
place.)

Re: Generating code for query jumbling through gen_node_support.pl

From

Michael Paquier

Date:

17 January 2023, 06:48:32

On Mon, Jan 16, 2023 at 03:13:35PM +0100, Peter Eisentraut wrote:
> On 13.01.23 08:54, Michael Paquier wrote:
>> Using a "jumble_ignore" flag is equally invasive to using an
>> "jumble_include" flag for each field, I am afraid, as the number of
>> fields in the nodes included in jumbles is pretty equivalent to the
>> number of fields ignored.  I tend to prefer the approach of ignoring
>> things though, which is more consistent with the practive for node
>> read, write and copy.
>
> I concur that jumble_ignore is better.  I suppose you placed the
> jumble_ignore markers to maintain parity with the existing code, but I think
> that some the markers are actually wrong and are just errors of omission in
> the existing code (such as Query.override, to pick a random example).  The
> way you have structured this would allow us to find and analyze such errors
> better.

Thanks.  Yes, I have made an effort of going down to maintain an exact
compatibility with the existing code for now.  My take is that
removing or adding more things into the jumble deserves its own
discussion.  I think that's a bit better once this code is automated
with the nodes, now it would not be difficult either to adjust HEAD
depending on that.  Only the analysis effort is different.

>> Anyway, when it comes to the location, another thing that can be
>> considered here would be to require a node-level flag for the nodes on
>> which we want to track the location.  This overlaps a bit with the
>> fields satisfying "($t eq 'int' && $f =~ 'location$')", but it removes
>> most of the code changes like this one as at the end we only care
>> about the location for Const nodes:
>> -   int         location;       /* token location, or -1 if unknown */
>> +   int         location pg_node_attr(jumble_ignore);   /* token location, or -1
>> +                                                        * if unknown */
>>
>> I have taken this approach in v2 of the patch, shaving ~100 lines of
>> more code as there is no need to mark all these location fields with
>> "jumble_ignore" anymore, except for Const, of course.
>
> I don't understand why you chose that when we already have an established
> way.  This would just make the jumble annotations inconsistent with the
> other annotations.

Because we don't want to track the location of all the nodes!  If we
do that, pg_stat_statements would begin to parameterize a lot more
things, from column aliases to full contents of IN or WITH clauses.
The root point is that we only want to track the jumble location for
Const nodes now.  In order to do that, there are two approaches:
- Mark all the locations with jumble_ignore: more invasive, but
it requires only one per-field attribute with "jumble_ignore".  This
is what v1 did.
- Mark only the fields where we want to track the location with a
second node type, like "jumble_location".  We could restrict that
depending on the field type or its name on top of checking the field
attribute, whatever.  This is what v2 did.

Which approach do you prefer?  Marking all the locations we don't want
with jumble_ignore, or introduce a second attribute with
jumble_location.  I'd tend to prefer the approach of minimizing the
number of node and field attributes, FWIW.  Now you have worked on
this area previously, so your view may be more adapted than my
thinking process

The long-term perspective is that I'd like to extend the tracking of
the locations to more DDL nodes, like parameters of SET statements or
parts of CALL statements.  Not to mention that this makes the work of
forks easier.  This is a separate discussion.

> I also have two suggestions to make this patch more palatable:
>
> 1. Make a separate patch to reformat long comments, like
> 835d476fd21bcfb60b055941dee8c3d9559af14c.
>
> 2. Make a separate patch to move the jumble support to
> src/backend/nodes/jumblefuncs.c.  (It was probably a mistake that it wasn't
> there to begin with, and some of the errors of omission alluded to above are
> probably caused by it having been hidden away in the wrong place.)

Both suggestions make sense.  I'll shape the next series of the patch
to do something among these lines.

My question about the location tracking greatly influences the first
patch (comment reformating) and third patch (automated generation of
the code) of the series, so do you have a preference about that?
--
Michael

Attachment

signature.asc

Re: Generating code for query jumbling through gen_node_support.pl

From

Peter Eisentraut

Date:

17 January 2023, 10:43:44

On 17.01.23 04:48, Michael Paquier wrote:
>>> Anyway, when it comes to the location, another thing that can be
>>> considered here would be to require a node-level flag for the nodes on
>>> which we want to track the location.  This overlaps a bit with the
>>> fields satisfying "($t eq 'int' && $f =~ 'location$')", but it removes
>>> most of the code changes like this one as at the end we only care
>>> about the location for Const nodes:
>>> -   int         location;       /* token location, or -1 if unknown */
>>> +   int         location pg_node_attr(jumble_ignore);   /* token location, or -1
>>> +                                                        * if unknown */
>>>
>>> I have taken this approach in v2 of the patch, shaving ~100 lines of
>>> more code as there is no need to mark all these location fields with
>>> "jumble_ignore" anymore, except for Const, of course.
>>
>> I don't understand why you chose that when we already have an established
>> way.  This would just make the jumble annotations inconsistent with the
>> other annotations.
> 
> Because we don't want to track the location of all the nodes!  If we
> do that, pg_stat_statements would begin to parameterize a lot more
> things, from column aliases to full contents of IN or WITH clauses.
> The root point is that we only want to track the jumble location for
> Const nodes now.  In order to do that, there are two approaches:
> - Mark all the locations with jumble_ignore: more invasive, but
> it requires only one per-field attribute with "jumble_ignore".  This
> is what v1 did.

Ok, I understand now, and I agree with this approach over the opposite. 
I was confused because the snippet you showed above used 
"jumble_ignore", but your patch is correct as it uses "jumble_location".

That said, the term "jumble" is really weird, because in the sense that 
we are using it here it means, approximately, "to mix together", "to 
unify".  So what we are doing with the Const nodes is really to *not* 
jumble the location, but for all other node types we are jumbling the 
location.  At least that is my understanding.

Re: Generating code for query jumbling through gen_node_support.pl

From

Michael Paquier

Date:

17 January 2023, 10:52:28

On Tue, Jan 17, 2023 at 08:43:44AM +0100, Peter Eisentraut wrote:
> Ok, I understand now, and I agree with this approach over the opposite. I
> was confused because the snippet you showed above used "jumble_ignore", but
> your patch is correct as it uses "jumble_location".

Okay.  I'll refresh the patch set so as we have only "jumble_ignore",
then, like v1, with preparatory patches for what you mentioned and
anything that comes into mind.

> That said, the term "jumble" is really weird, because in the sense that we
> are using it here it means, approximately, "to mix together", "to unify".
> So what we are doing with the Const nodes is really to *not* jumble the
> location, but for all other node types we are jumbling the location.  At
> least that is my understanding.

I am quite familiar with this term, FWIW.  That's what we've inherited
from the days where this has been introduced in pg_stat_statements.
--
Michael

Attachment

signature.asc

Re: Generating code for query jumbling through gen_node_support.pl

From

Michael Paquier

Date:

18 January 2023, 10:04:23

On Tue, Jan 17, 2023 at 04:52:28PM +0900, Michael Paquier wrote:
> On Tue, Jan 17, 2023 at 08:43:44AM +0100, Peter Eisentraut wrote:
>> Ok, I understand now, and I agree with this approach over the opposite. I
>> was confused because the snippet you showed above used "jumble_ignore", but
>> your patch is correct as it uses "jumble_location".
>
> Okay.  I'll refresh the patch set so as we have only "jumble_ignore",
> then, like v1, with preparatory patches for what you mentioned and
> anything that comes into mind.

This is done as of the patch series v3 attached:
- 0001 reformats all the comments of the nodes.
- 0002 moves the current files for query jumble as of queryjumble.c ->
queryjumblefuncs.c and utils/queryjumble.h -> nodes/queryjumble.h.
- 0003 is the core feature, where I have done a second pass over the
nodes to make sure that things map with HEAD, incorporating the extra
docs coming from v2, adding a bit more.

>> That said, the term "jumble" is really weird, because in the sense that we
>> are using it here it means, approximately, "to mix together", "to unify".
>> So what we are doing with the Const nodes is really to *not* jumble the
>> location, but for all other node types we are jumbling the location.  At
>> least that is my understanding.
>
> I am quite familiar with this term, FWIW.  That's what we've inherited
> from the days where this has been introduced in pg_stat_statements.

I have renamed the node attributes to query_jumble_ignore and
no_query_jumble at the end, after considering Peter's point that only
"jumble" could be fuzzy here.  The file names are changed in
consequence.

While doing all that, I have done some micro-benchmarking of
JumbleQuery(), making it loop 50M times on my laptop each time a query
ID is computed (hideous hack with a loop in queryjumble.c):
- For non-utility queries, aka queries that go through
JumbleQueryInternal(), I am measuring a repeatable ~10% improvement
with the generated code over HEAD, which is kind of nice.  I have
tested a few DMLs and simple SELECTs, still it looks like a trend.
- For utility queries, the new code is competing against
hash_any_extended() with the query string, which is going to be hard
to beat..  I am measuring what looks like a 5 times slowdown, at
least, and more depending on the depth of the query tree.  That's not
surprising.  On a 50M loop, this comes down to compare a computation
of 100ns to 5ns for a 20-time slowdown for example, still this could
justify the addition of a GUC to control whether utility queries have
their query ID compiled depending on their nodes or their string
hash, as this could become noticeable in OLTP workloads with loads
of short statements and BEGIN/COMMIT queries?

Thoughts or comments?
--
Michael

Attachment

Re: Generating code for query jumbling through gen_node_support.pl

From

Peter Eisentraut

Date:

19 January 2023, 11:42:03

On 18.01.23 08:04, Michael Paquier wrote:
> On Tue, Jan 17, 2023 at 04:52:28PM +0900, Michael Paquier wrote:
>> On Tue, Jan 17, 2023 at 08:43:44AM +0100, Peter Eisentraut wrote:
>>> Ok, I understand now, and I agree with this approach over the opposite. I
>>> was confused because the snippet you showed above used "jumble_ignore", but
>>> your patch is correct as it uses "jumble_location".
>>
>> Okay.  I'll refresh the patch set so as we have only "jumble_ignore",
>> then, like v1, with preparatory patches for what you mentioned and
>> anything that comes into mind.
> 
> This is done as of the patch series v3 attached:
> - 0001 reformats all the comments of the nodes.
> - 0002 moves the current files for query jumble as of queryjumble.c ->
> queryjumblefuncs.c and utils/queryjumble.h -> nodes/queryjumble.h.
> - 0003 is the core feature, where I have done a second pass over the
> nodes to make sure that things map with HEAD, incorporating the extra
> docs coming from v2, adding a bit more.

This patch structure looks good.

>>> That said, the term "jumble" is really weird, because in the sense that we
>>> are using it here it means, approximately, "to mix together", "to unify".
>>> So what we are doing with the Const nodes is really to *not* jumble the
>>> location, but for all other node types we are jumbling the location.  At
>>> least that is my understanding.
>>
>> I am quite familiar with this term, FWIW.  That's what we've inherited
>> from the days where this has been introduced in pg_stat_statements.
> 
> I have renamed the node attributes to query_jumble_ignore and
> no_query_jumble at the end, after considering Peter's point that only
> "jumble" could be fuzzy here.  The file names are changed in
> consequence.

I see that in the 0003 patch, most location fields now have an explicit 
markup with query_jumble_ignore.  I thought we had previously resolved 
to consider location fields to be automatically ignored unless 
explicitly included (like for the Const node).  This appears to invert 
that?  Am I missing something?

Re: Generating code for query jumbling through gen_node_support.pl

From

Michael Paquier

Date:

19 January 2023, 11:46:37

On Thu, Jan 19, 2023 at 09:42:03AM +0100, Peter Eisentraut wrote:
> I see that in the 0003 patch, most location fields now have an explicit
> markup with query_jumble_ignore.  I thought we had previously resolved to
> consider location fields to be automatically ignored unless explicitly
> included (like for the Const node).  This appears to invert that?  Am I
> missing something?

My misunderstanding then, I thought that you were OK with what was
part of v1, where all these fields was marked as "ignore".  But you
actually prefer v2, with the second field "location" on top of
"ignore".  I can update 0003 to refresh that.

Would you be OK if I apply 0001 (with the comments of the locations
still reshaped to ease future property additions) and 0002?
--
Michael

Attachment

signature.asc

Re: Generating code for query jumbling through gen_node_support.pl

From

Michael Paquier

Date:

20 January 2023, 07:35:15

On Thu, Jan 19, 2023 at 09:42:03AM +0100, Peter Eisentraut wrote:
> I see that in the 0003 patch, most location fields now have an explicit
> markup with query_jumble_ignore.  I thought we had previously resolved to
> consider location fields to be automatically ignored unless explicitly
> included (like for the Const node).  This appears to invert that?  Am I
> missing something?

As a result, I have rebased the patch set to use the two-attribute
approach: query_jumble_ignore and query_jumble_location.

On top of the three previous patches, I am adding 0004 to implement a
GUC able to switch the computation of the utility statements between
what I am calling "string" to compute the query IDs based on the hash
of the query string and the previous default, or "jumble", to use the
parsed tree, with a few more tests to see the difference.  Perhaps it
is not worth bothering, but it could be possible that some users don't
want to pay the penalty of doing the query jumbling with the parsed
tree for utilities, as well..
--
Michael

Attachment

Re: Generating code for query jumbling through gen_node_support.pl

From

Peter Eisentraut

Date:

20 January 2023, 13:56:00

On 20.01.23 05:35, Michael Paquier wrote:
> On Thu, Jan 19, 2023 at 09:42:03AM +0100, Peter Eisentraut wrote:
>> I see that in the 0003 patch, most location fields now have an explicit
>> markup with query_jumble_ignore.  I thought we had previously resolved to
>> consider location fields to be automatically ignored unless explicitly
>> included (like for the Const node).  This appears to invert that?  Am I
>> missing something?
> As a result, I have rebased the patch set to use the two-attribute
> approach: query_jumble_ignore and query_jumble_location.

Structurally, this looks okay to me now.

In your 0001 patch, most of the comment reformattings for location 
fields are no longer needed, so you should undo those.

The 0002 patch looks good.

Those two could be committed with those adjustments, I think.

I'll read the 0003 again more carefully.  I haven't studied the new 0004 
yet.

Re: Generating code for query jumbling through gen_node_support.pl

From

Michael Paquier

Date:

21 January 2023, 06:35:52

On Fri, Jan 20, 2023 at 11:56:00AM +0100, Peter Eisentraut wrote:
> In your 0001 patch, most of the comment reformattings for location fields
> are no longer needed, so you should undo those.
>
> The 0002 patch looks good.

Okay, I have gone through these two again and applied what I had.
0001 has been cleaned up of the extra comment moves for the
locations.  Now, I have kept a few changes for some of the nodes to
have some consistency with the other fields, in the case where most of
the fields at the end of the structures have to be marked with new
node attributes.  This made the style of the header a bit more
elegant, IMV.

> I'll read the 0003 again more carefully.  I haven't studied the new 0004
> yet.

Thanks, again.  Rebased version attached.
--
Michael

Attachment

Re: Generating code for query jumbling through gen_node_support.pl

From

Peter Eisentraut

Date:

23 January 2023, 16:27:13

On 21.01.23 04:35, Michael Paquier wrote:
>> I'll read the 0003 again more carefully.  I haven't studied the new 0004
>> yet.
> 
> Thanks, again.  Rebased version attached.

A couple of small fixes are attached.

There is something weird in _jumbleNode().  There are two switch 
(nodeTag(expr)) statements.  Maybe that's intentional, but then it 
should be commented better, because now it looks more like an editing 
mistake.

The handling of T_RangeTblEntry could be improved.  In other contexts we 
have for example a custom_copy attribute, which generates the switch 
entry but not the function.  Something like this could be useful here too.

Otherwise, this looks ok.  I haven't checked whether it maintains the 
exact behavior from before.  What is the test coverage situation for this?

For the 0004 patch, it should be documented why one would want one 
behavior or the other.  That's totally unclear right now.

I think if we are going to accept 0004, then it might be better to 
combine it with 0003.  Otherwise, 0004 is just undoing a lot of the code 
structure changes in JumbleQuery() that 0003 did.

Attachment

0001-fixup-Support-for-automated-query-jumble-with-all-No.patch

Re: Generating code for query jumbling through gen_node_support.pl

From

Michael Paquier

Date:

24 January 2023, 09:57:56

On Mon, Jan 23, 2023 at 02:27:13PM +0100, Peter Eisentraut wrote:
> A couple of small fixes are attached.

Thanks.

> There is something weird in _jumbleNode().  There are two switch
> (nodeTag(expr)) statements.  Maybe that's intentional, but then it should be
> commented better, because now it looks more like an editing mistake.

This one is intentional, so as it is possible to track correctly the
highest param ID found while browsing the nodes.  IMO it would be
confusing to add that into gen_node_support.pl.  Another thing that
could be done is to switch Param to have a custom implementation, like
RangeTblEntry, though this removes the automation around the creation
of _jumbleParam().  I have clarified the comments around that.

> The handling of T_RangeTblEntry could be improved.  In other contexts we
> have for example a custom_copy attribute, which generates the switch entry
> but not the function.  Something like this could be useful here too.

Hmm.  Okay.  Fine by me.

> Otherwise, this looks ok.  I haven't checked whether it maintains the exact
> behavior from before.  What is the test coverage situation for this?

0003 taken in isolation has some minimal coverage through
pg_stat_statements, though it turns around 15% with compute_query_id =
auto that would enforce the jumbling path only when pg_stat_statements
uses it.  Still, my plan here is to enforce the loading of
pg_stat_statements with compute_query_id = regress and
utility_query_id = jumble (if needed) in a new buildfarm machine,
because that's the cheapest path.  An extra possibility is to have
pg_regress kicked in a new TAP test with these settings, but that's
costly and we have already two of these :/  Another possibility is to
plug in that into 027_stream_regress or the pg_upgrade test suite with
new settings :/

Anyway, the regression tests of pg_stat_statements should be extended
a bit to cover more node types by default (Say COPY with DMLs for the
InsertStmt & co) to look at how these are showing up once normalized
using their parsed query, and we don't do much around that now.
Normalizing more DDLs should use this path, as well.

> For the 0004 patch, it should be documented why one would want one behavior
> or the other.  That's totally unclear right now.

I am not 100% sure whether we should have this part at the end, but as
an exit path in case somebody complains about the extra load with the
automated jumbling compared to the hash of the query strings, that can
be used as a backup.  Anyway, attached is what would be a
clarification.

> I think if we are going to accept 0004, then it might be better to combine
> it with 0003.  Otherwise, 0004 is just undoing a lot of the code structure
> changes in JumbleQuery() that 0003 did.

Makes sense.  That would be my intention if 0004 is the most
acceptable and splitting things makes things a bit easier to review.
--
Michael

Attachment

Re: Generating code for query jumbling through gen_node_support.pl

From

Michael Paquier

Date:

25 January 2023, 03:08:17

On Tue, Jan 24, 2023 at 03:57:56PM +0900, Michael Paquier wrote:
> Makes sense.  That would be my intention if 0004 is the most
> acceptable and splitting things makes things a bit easier to review.

There was a silly mistake in 0004 where the jumbling code relied on
compute_query_id rather than utility_query_id, so fixed and rebased as
of v7 attached.
--
Michael

Attachment

Re: Generating code for query jumbling through gen_node_support.pl

From

Peter Eisentraut

Date:

26 January 2023, 11:37:13

On 24.01.23 07:57, Michael Paquier wrote:
>> For the 0004 patch, it should be documented why one would want one behavior
>> or the other.  That's totally unclear right now.
> I am not 100% sure whether we should have this part at the end, but as
> an exit path in case somebody complains about the extra load with the
> automated jumbling compared to the hash of the query strings, that can
> be used as a backup.  Anyway, attached is what would be a
> clarification.

Ok, the documentation make sense now.  I wonder what the performance 
impact is.  Probably, nobody cares about microoptimizing CREATE TABLE 
statements.  But BEGIN/COMMIT could matter.  However, whatever you do in 
between the BEGIN and COMMIT will already be jumbled, so you're already 
paying the overhead.  Hopefully, jumbling such simple commands will have 
no noticeable overhead.

In other words, we should test this and hopefully get rid of the 
'string' method.

Re: Generating code for query jumbling through gen_node_support.pl

From

Peter Eisentraut

Date:

26 January 2023, 11:39:05

On 25.01.23 01:08, Michael Paquier wrote:
> On Tue, Jan 24, 2023 at 03:57:56PM +0900, Michael Paquier wrote:
>> Makes sense.  That would be my intention if 0004 is the most
>> acceptable and splitting things makes things a bit easier to review.
> 
> There was a silly mistake in 0004 where the jumbling code relied on
> compute_query_id rather than utility_query_id, so fixed and rebased as
> of v7 attached.

Overall, this looks good to me.

There are a couple of repetitive comments, like "typmod and collation 
information are irrelevant for the query jumbling".  This applies to all 
nodes, so we don't need to repeat it for a number of nodes (and then not 
mention it for other nodes).  Maybe there should be a central place 
somewhere that describes "these kinds of fields should normally be ignored".

Re: Generating code for query jumbling through gen_node_support.pl

From

Michael Paquier

Date:

27 January 2023, 05:59:47

On Thu, Jan 26, 2023 at 09:37:13AM +0100, Peter Eisentraut wrote:
> Ok, the documentation make sense now.  I wonder what the performance impact
> is.  Probably, nobody cares about microoptimizing CREATE TABLE statements.
> But BEGIN/COMMIT could matter.  However, whatever you do in between the
> BEGIN and COMMIT will already be jumbled, so you're already paying the
> overhead.  Hopefully, jumbling such simple commands will have no noticeable
> overhead.
>
> In other words, we should test this and hopefully get rid of the 'string'
> method.

Yep.  I have mentioned a few numbers upthread, and this deserves
discussion.

FYI, I have done more micro-benchmarking to compare both methods for
utility queries by hijacking JumbleQuery() to run the computation in a
tight loop run N times (could not come up with a better idea to avoid
the repeated palloc/pfree overhead), as the path to stress is
_jumbleNode().  See the attached, that should be able to apply on top
of the latest patch set (named as .txt to not feed it to the CF bot,
and need to recompile to switch the iteration).

Using that, I can compile the following results for various cases (-O2
and compute_query_id=on):
          query          |  mode  | iterations | avg_runtime_ns | avg_jumble_ns
-------------------------+--------+------------+----------------+---------------
 begin                   | string |   50000000 |        4.53116 |          4.54
 begin                   | jumble |   50000000 |       30.94578 |         30.94
 commit                  | string |   50000000 |        4.76004 |          4.74
 commit                  | jumble |   50000000 |        31.4791 |         31.48
 create table 1 column   | string |   50000000 |        7.22836 |          7.08
 create table 1 column   | jumble |   50000000 |      152.10852 |        151.96
 create table 5 columns  | string |   50000000 |       12.43412 |         12.28
 create table 5 columns  | jumble |   50000000 |      352.88976 |         349.1
 create table 20 columns | string |    5000000 |         49.591 |          48.2
 create table 20 columns | jumble |    5000000 |      2272.4066 |          2271
 drop table 1 column     | string |   50000000 |        6.70538 |          6.56
 drop table 1 column     | jumble |   50000000 |          50.38 |         50.24
 drop table 5 columns    | string |   50000000 |        6.88256 |          6.74
 drop table 5 columns    | jumble |   50000000 |       50.02898 |          49.9
 SET work_mem            | string |   50000000 |        7.28752 |          7.28
 SET work_mem            | jumble |   50000000 |       91.66588 |         91.64
(16 rows)

avg_runtime_ns is (query runtime / iterations) and avg_jumble_ns is
the same with the difference between the start/end logs in the txt
patch attached.  The overhead to run the query does not matter much if
you compare both.  The time it takes to run a jumble is correlated to
the number of nodes to go through for each query, and there is a
larger gap for more nodes to go through.  Well, a simple "begin" or
"commit" query has its computation time increase from 4ns to 30ns in
average which would be unnoticeable.  The gap is larger for larger
nodes, like SET, still we jump from 7ns to 90ns in this case.  DDLs
take the most hit with this method, where a 20-column CREATE TABLE
jumps from 50ns to 2us (note that the iteration is 10 times lower
here).

At the end, that would be unnoticeable for the average user, I guess,
but here are the numbers I get on my laptop :)
--
Michael

Attachment

Re: Generating code for query jumbling through gen_node_support.pl

From

Michael Paquier

Date:

27 January 2023, 06:07:36

On Thu, Jan 26, 2023 at 09:39:05AM +0100, Peter Eisentraut wrote:
> There are a couple of repetitive comments, like "typmod and collation
> information are irrelevant for the query jumbling".  This applies to all
> nodes, so we don't need to repeat it for a number of nodes (and then not
> mention it for other nodes).  Maybe there should be a central place
> somewhere that describes "these kinds of fields should normally be ignored".

The place that would make the most sense to me to centralize this
knowlegde is nodes.h itself, because that's where the node attributes
are defined?
--
Michael

Attachment

signature.asc

Re: Generating code for query jumbling through gen_node_support.pl

From

Michael Paquier

Date:

28 January 2023, 06:32:21

On Tue, Jan 24, 2023 at 03:57:56PM +0900, Michael Paquier wrote:
> Still, my plan here is to enforce the loading of
> pg_stat_statements with compute_query_id = regress and
> utility_query_id = jumble (if needed) in a new buildfarm machine,

Actually, about this specific point, I have been able to set up a
buildfarm machine that uses shared_preload_libraries =
pg_stat_statements and compute_query_id = regress in the base
configuration, which is uncovered yet.  This works as long as one sets
up EXTRA_INSTALL => "contrib/pg_stat_statements" in build_env.
--
Michael

Attachment

signature.asc

Re: Generating code for query jumbling through gen_node_support.pl

From

Michael Paquier

Date:

30 January 2023, 13:39:30

On Fri, Jan 27, 2023 at 11:59:47AM +0900, Michael Paquier wrote:
> Using that, I can compile the following results for various cases (-O2
> and compute_query_id=on):
>           query          |  mode  | iterations | avg_runtime_ns | avg_jumble_ns
> -------------------------+--------+------------+----------------+---------------
>  begin                   | string |   50000000 |        4.53116 |          4.54
>  begin                   | jumble |   50000000 |       30.94578 |         30.94
>  commit                  | string |   50000000 |        4.76004 |          4.74
>  commit                  | jumble |   50000000 |        31.4791 |         31.48
>  create table 1 column   | string |   50000000 |        7.22836 |          7.08
>  create table 1 column   | jumble |   50000000 |      152.10852 |        151.96
>  create table 5 columns  | string |   50000000 |       12.43412 |         12.28
>  create table 5 columns  | jumble |   50000000 |      352.88976 |         349.1
>  create table 20 columns | string |    5000000 |         49.591 |          48.2
>  create table 20 columns | jumble |    5000000 |      2272.4066 |          2271
>  drop table 1 column     | string |   50000000 |        6.70538 |          6.56
>  drop table 1 column     | jumble |   50000000 |          50.38 |         50.24
>  drop table 5 columns    | string |   50000000 |        6.88256 |          6.74
>  drop table 5 columns    | jumble |   50000000 |       50.02898 |          49.9
>  SET work_mem            | string |   50000000 |        7.28752 |          7.28
>  SET work_mem            | jumble |   50000000 |       91.66588 |         91.64
> (16 rows)

Just to close the loop here, I have done more measurements to compare
the jumble done for some DMLs and some SELECTs between HEAD and the
patch (forgot to post some last Friday).  Both methods show comparable
results:
        query         |  mode  | iterations | avg_runtime_ns | avg_jumble_ns
----------------------+--------+------------+----------------+---------------
 insert table 10 cols | master |   50000000 |      377.17878 |        377.04
 insert table 10 cols | jumble |   50000000 |      409.47924 |        409.34
 insert table 20 cols | master |   50000000 |      692.94924 |         692.8
 insert table 20 cols | jumble |   50000000 |       710.0901 |        709.96
 insert table 5 cols  | master |   50000000 |      232.44308 |         232.3
 insert table 5 cols  | jumble |   50000000 |      253.49854 |        253.36
 select 10 cols       | master |   50000000 |      449.13608 |        383.36
 select 10 cols       | jumble |   50000000 |      491.61912 |        323.86
 select 5 cols        | master |   50000000 |        277.477 |        277.46
 select 5 cols        | jumble |   50000000 |      323.88152 |        323.86
(10 rows)

The averages are in ns, so the difference does not bother me much.
There may be some noise mixed in that ;)

(Attached is the tweak I have applied on HEAD to get some numbers.)
--
Michael

Attachment

Re: Generating code for query jumbling through gen_node_support.pl

From

Peter Eisentraut

Date:

30 January 2023, 13:46:06

On 27.01.23 04:07, Michael Paquier wrote:
> On Thu, Jan 26, 2023 at 09:39:05AM +0100, Peter Eisentraut wrote:
>> There are a couple of repetitive comments, like "typmod and collation
>> information are irrelevant for the query jumbling".  This applies to all
>> nodes, so we don't need to repeat it for a number of nodes (and then not
>> mention it for other nodes).  Maybe there should be a central place
>> somewhere that describes "these kinds of fields should normally be ignored".
> 
> The place that would make the most sense to me to centralize this
> knowlegde is nodes.h itself, because that's where the node attributes
> are defined?

Either that or src/backend/nodes/README.

Re: Generating code for query jumbling through gen_node_support.pl

From

Peter Eisentraut

Date:

30 January 2023, 13:48:45

On 27.01.23 03:59, Michael Paquier wrote:
> At the end, that would be unnoticeable for the average user, I guess,
> but here are the numbers I get on my laptop :)

Personally, I think we do not want the two jumble methods in parallel.

Maybe there are other opinions.

I'm going to set this thread as "Ready for Committer".  Either wait a 
bit for more feedback on this topic, or just go ahead with either 
solution.  We can leave it as a semi-open item for reconsideration later.

Re: Generating code for query jumbling through gen_node_support.pl

From

Michael Paquier

Date:

31 January 2023, 07:21:40

On Mon, Jan 30, 2023 at 11:46:06AM +0100, Peter Eisentraut wrote:
> Either that or src/backend/nodes/README.

The README holds nothing about the node attributes currently, so
nodes.h feels most adapted here at the end..

Thanks,
--
Michael

Attachment

signature.asc

Re: Generating code for query jumbling through gen_node_support.pl

From

Michael Paquier

Date:

31 January 2023, 09:40:56

On Mon, Jan 30, 2023 at 11:48:45AM +0100, Peter Eisentraut wrote:
> I'm going to set this thread as "Ready for Committer".  Either wait a bit
> for more feedback on this topic, or just go ahead with either solution.  We
> can leave it as a semi-open item for reconsideration later.

All the measurements I have done for the last couple of days point me
in the direction that there is no need for an extra node based on the
averaged computation times (did a few more today with some long CREATE
FUNCTION, VIEW or event trigger queries, for example).  Agreed to add
the extra option as something to consider at some point during beta,
as long as it is fresh.  I am not convinced that it will be necessary
but let's see how it goes.

While reviewing all the nodes, I have noticed two mistakes for a few
things marked as query_jumble_ignore but they should not:
- orderClause in WindowClause
- aliascolnames in CommonTableExpr
The rest was fine.

varnullingrels has been added very recently, and there was in
queryjumblefuncs.c a comment explaining why it should be ignored.
This has been moved to nodes.h, like the others.

With all that in mind, I have spent my day polishing that and doing a
close lookup, and the patch has been applied.  Thanks a lot!
--
Michael

Attachment

signature.asc

Re: Generating code for query jumbling through gen_node_support.pl

From

Michael Paquier

Date:

05 February 2023, 13:40:57

On Tue, Jan 31, 2023 at 03:40:56PM +0900, Michael Paquier wrote:
> With all that in mind, I have spent my day polishing that and doing a
> close lookup, and the patch has been applied.  Thanks a lot!

While working on a different patch, I have noticed a small issue in
the way the jumbling happens for A_Const, where ValUnion was not
getting jumbled correctly.  This caused a few statements that rely on
this node to compile the same query IDs when using different values.
The full contents of pg_stat_statements for a regression database
point to:
- SET.
- COPY with queries.
- CREATE TABLE with partition bounds and default expressions.

This was causing some confusion in pg_stat_statements where some
utility queries would be incorrectly reported, and at this point the
intention is to keep this area compatible with the string-based method
when it comes to the values.  Like read, write and copy nodes, we need
to compile the query ID based on the type of the value, which cannot
be automated.  Attached is a patch to fix this issue with some
regression tests, that I'd like to get fixed before moving on with
more business in pg_stat_statements (aka properly show Const nodes for
utilities with normalized queries).

Comments or objections are welcome, of course.

(FWIW, I'd like to think that there is an argument to normalize the
A_Const nodes for a portion of the DDL queries, by ignoring their
values in the query jumbling and mark a location, which would be
really useful for some workloads, but that's a separate discussion I
am keeping for later.)
--
Michael

Attachment

Re: Generating code for query jumbling through gen_node_support.pl

From

Michael Paquier

Date:

07 February 2023, 03:16:36

On Sun, Feb 05, 2023 at 07:40:57PM +0900, Michael Paquier wrote:
> Comments or objections are welcome, of course.

Done this part.

> (FWIW, I'd like to think that there is an argument to normalize the
> A_Const nodes for a portion of the DDL queries, by ignoring their
> values in the query jumbling and mark a location, which would be
> really useful for some workloads, but that's a separate discussion I
> am keeping for later.)

And I have a patch pretty much OK for this part of the discussion.
Will post soon..
--
Michael

Attachment

signature.asc

Re: Generating code for query jumbling through gen_node_support.pl

From

Tom Lane

Date:

08 February 2023, 01:32:07

Michael Paquier <michael@paquier.xyz> writes:
> With all that in mind, I have spent my day polishing that and doing a
> close lookup, and the patch has been applied.  Thanks a lot!

I have just noticed that this patch is generating useless jumbling
code for node types such as Path nodes and other planner infrastructure
nodes.  That no doubt contributes to the miserable code coverage rating
for queryjumblefuncs.*.c, which have enough dead lines to drag down the
overall rating for all of backend/nodes/.  Shouldn't a little more
attention have been paid to excluding entire node classes if they can
never appear in Query?

            regards, tom lane

Re: Generating code for query jumbling through gen_node_support.pl

From

Michael Paquier

Date:

08 February 2023, 09:47:51

On Tue, Feb 07, 2023 at 05:32:07PM -0500, Tom Lane wrote:
> I have just noticed that this patch is generating useless jumbling
> code for node types such as Path nodes and other planner infrastructure
> nodes.  That no doubt contributes to the miserable code coverage rating
> for queryjumblefuncs.*.c, which have enough dead lines to drag down the
> overall rating for all of backend/nodes/.  Shouldn't a little more
> attention have been paid to excluding entire node classes if they can
> never appear in Query?

This one was intentional to let extensions play with jumbling of such
nodes, but perhaps you are right that it makes little sense at this
stage.  If there is an ask for it later, though..  Using
shared_preload_libraries = pg_stat_statements and compute_query_id =
regress shows that numbers go up to 60% for funcs.c and 30% for
switch.c.  Removing nodes like as of the attached brings these numbers
respectively up to 94.5% and 93.5% for a check.  With a check-world, I
measure respectively 96.7% and 96.1% because there is more coverage
for extensions, ALTER SYSTEM and database commands, roughly.

This could also be a file-level policy by enforcing no_query_jumble in
gen_node_support.pl by looking at the header name, still I favor
no_query_jumble to keep all the pg_node_attr() in a single area with
the headers.  Note that the attached includes in 0002 the tweak to
enforce the computation with compute_query_id if you want to test it
yourself and check my numbers.  This is useful IMO as we could detect
missing nodes for all queries (utilities or not), still doing this
change may deserve a separate discussion.  Note that I am not seeing
any "unrecognized node type" in any of the logs for any queries.

As a side note, should we be more aggressive with the tests related to
the jumbling code since it is now in core?  For example, XmlExpr or
MinMaxExpr, which are part of Query nodes that can be jumbled even in
older branches, have zero coverage by default as
coverage.postgresql.org reports, because everything goes through
pg_stat_statements and it includes no queries with such nodes.  My
buildfarm member batta makes sure to stress that no nodes are missing
by overloading pg_stat_statements and compute_query_id = regress in
the configuration, so no nodes are missing from the computation, still
the coverage could be better across the board.  Expanding the tests of
pg_stat_statements is needed in this area for some time, still could
there be a point in switching compute_query_id = regress so as it is a
synonym of "on" without the EXPLAIN output rather than "auto"?  If the
setting is enforced by pg_regress, the coverage of queryjumble.c would
be so much better, at least..
--
Michael

Attachment

Re: Generating code for query jumbling through gen_node_support.pl

From

Andres Freund

Date:

08 February 2023, 10:01:03

Hi,

On 2023-02-08 15:47:51 +0900, Michael Paquier wrote:
> This one was intentional to let extensions play with jumbling of such
> nodes, but perhaps you are right that it makes little sense at this
> stage.  If there is an ask for it later, though..  Using
> shared_preload_libraries = pg_stat_statements and compute_query_id =
> regress shows that numbers go up to 60% for funcs.c and 30% for
> switch.c.  Removing nodes like as of the attached brings these numbers
> respectively up to 94.5% and 93.5% for a check.  With a check-world, I
> measure respectively 96.7% and 96.1% because there is more coverage
> for extensions, ALTER SYSTEM and database commands, roughly.

Given that we already pay the price of multiple regress runs, and that
jumbling is now really a core feature, perhaps we should enable
pg_stat_statements in pg_upgrade or 027_stream_regress.pl? I'd hope it
wouldn't add a meaningful amount of time?  A tiny bit of verification at the
end should also be ok.

Both pg_upgrade and 027_stream_regress.pl have some advantages. The former
would test pg_upgrade interactions with shared_preload_libraries, the latter
could do some basic checks of pg_stat_statements on a standby.

Greetings,

Andres Freund

Re: Generating code for query jumbling through gen_node_support.pl

From

Michael Paquier

Date:

08 February 2023, 10:16:29

On Tue, Feb 07, 2023 at 11:01:03PM -0800, Andres Freund wrote:
> Given that we already pay the price of multiple regress runs, and that
> jumbling is now really a core feature, perhaps we should enable
> pg_stat_statements in pg_upgrade or 027_stream_regress.pl? I'd hope it
> wouldn't add a meaningful amount of time?  A tiny bit of verification at the
> end should also be ok.

Yeah, I have briefly mentioned this part upthread:
https://www.postgresql.org/message-id/Y8+BdCOjxykre5es@paquier.xyz

It would not, I guess, as long as pg_stat_statements.max is set large
enough in the TAP test.  There are currently 21k~22k entries in the
regression database, much larger than the default of 5000 so this may
become an issue on small-ish machines if left untouched even in a TAP
test.

> Both pg_upgrade and 027_stream_regress.pl have some advantages. The former
> would test pg_upgrade interactions with shared_preload_libraries, the latter
> could do some basic checks of pg_stat_statements on a standby.

Yes, there could be more checks, potentially useful for both cases, so
I may choose both at the end of the day.  Checking the consistency of
the contents of pg_stat_statements across a pg_upgrade run for the
same version may be one thing?  I am not sure if it is that
interesting, TBH, still that's one idea :)
--
Michael

Attachment

signature.asc

Re: Generating code for query jumbling through gen_node_support.pl

From

Michael Paquier

Date:

10 February 2023, 01:49:45

On Wed, Feb 08, 2023 at 03:47:51PM +0900, Michael Paquier wrote:
> This one was intentional to let extensions play with jumbling of such
> nodes, but perhaps you are right that it makes little sense at this
> stage.  If there is an ask for it later, though..  Using
> shared_preload_libraries = pg_stat_statements and compute_query_id =
> regress shows that numbers go up to 60% for funcs.c and 30% for
> switch.c.  Removing nodes like as of the attached brings these numbers
> respectively up to 94.5% and 93.5% for a check.  With a check-world, I
> measure respectively 96.7% and 96.1% because there is more coverage
> for extensions, ALTER SYSTEM and database commands, roughly.

Tom, did you get a chance to look at what is proposed here and expand
the use of query_jumble_ignore in the definitions of the nodes rather
than have an enforced per-file policy in gen_node_support.pl?  The
attached improves the situation by as much as we can, still the
numbers reported by coverage.postgresql.org won't go that up until we
enforce query jumbling testing on the regression database (something
we should have done since this code moved to core, I guess).
--
Michael

Attachment

signature.asc

Re: Generating code for query jumbling through gen_node_support.pl

From

Tom Lane

Date:

10 February 2023, 02:12:50

Michael Paquier <michael@paquier.xyz> writes:
> Tom, did you get a chance to look at what is proposed here and expand
> the use of query_jumble_ignore in the definitions of the nodes rather
> than have an enforced per-file policy in gen_node_support.pl?

Sorry, didn't look at it before.

I'm okay with the pathnodes.h changes --- although surely you don't need
changes like this:

-    pg_node_attr(abstract)
+    pg_node_attr(abstract, no_query_jumble)

"abstract" should already imply "no_query_jumble".

I wonder too if you could shorten the changes by making no_query_jumble
an inheritable attribute, and then just applying it to Path and Plan.

The changes in parsenodes.h seem wrong, except for RawStmt.  Those node
types are used in parsed queries, aren't they?

            regards, tom lane

Re: Generating code for query jumbling through gen_node_support.pl

From

Michael Paquier

Date:

10 February 2023, 05:28:53

On Thu, Feb 09, 2023 at 06:12:50PM -0500, Tom Lane wrote:
> I'm okay with the pathnodes.h changes --- although surely you don't need
> changes like this:
>
> -    pg_node_attr(abstract)
> +    pg_node_attr(abstract, no_query_jumble)
>
> "abstract" should already imply "no_query_jumble".

Okay, understood.  Following this string of thoughts, I am a bit
surprised for two cases, though:
- PartitionPruneStep.
- Plan.
Both are abstract and both are marked with no_equal.  I guess that
applying no_query_jumble to both of them is fine, and that's what you
mean?

> I wonder too if you could shorten the changes by making no_query_jumble
> an inheritable attribute, and then just applying it to Path and Plan.

Ah.  I did not catch what you meant here at first, but I think that I
do now.  Are you referring to the part of gen_node_support.pl where we
propagate properties when a node is a supertype?  This part would be
taken into account when a node is parsed but we find that its first
member is already tracked as a node:
    # Propagate some node attributes from supertypes
    if ($supertype)
    {
        push @no_copy, $in_struct
          if elem $supertype, @no_copy;
        push @no_equal, $in_struct
          if elem $supertype, @no_equal;
        push @no_read, $in_struct
          if elem $supertype, @no_read;
+        push @no_query_jumble, $in_struct
+          if elem $supertype, @no_query_jumble;
    }

A benefit of doing that would also discard all the Scan and Sort
nodes.  So like the other no_* attributes, it makes sense to force the
inheritance here.

> The changes in parsenodes.h seem wrong, except for RawStmt.  Those node
> types are used in parsed queries, aren't they?

RTEPermissionInfo is a recent addition, as of a61b1f7.  This commit
documents it as a plan node, still it is part of a Query while being
ignored in the query jumbling since its introduction, so I am a bit
confused by this one.

Anyway, none of these need to be included in the query jumbling
currently because they are ignored, but I'd be fine to generate their
code by default as they could become relevant if other nodes begin to
rely on them more heavily, as being part of queries.  Peter E. has
mentioned upthread that a few nodes should include more jumbling while
some other parts should be ignored.  This should be analyzed
separately because ~15 does not seem to be strictly right, either.

Attached is a patch refreshed with all that.  Feel free to ignore 0002
as that's just useful to enforce the tests to go through the jumbling
code.  The attached reaches 95.0% of line coverage after a check-world
in funcs.c.

Thoughts?
--
Michael

Attachment

Re: Generating code for query jumbling through gen_node_support.pl

From

Tom Lane

Date:

11 February 2023, 00:40:08

Michael Paquier <michael@paquier.xyz> writes:
> Okay, understood.  Following this string of thoughts, I am a bit
> surprised for two cases, though:
> - PartitionPruneStep.
> - Plan.
> Both are abstract and both are marked with no_equal.  I guess that
> applying no_query_jumble to both of them is fine, and that's what you
> mean?

On second thought, the point of that is to allow the no_equal property
to automatically inherit to child node types, so doing likewise
for no_query_jumble is sensible.

>> The changes in parsenodes.h seem wrong, except for RawStmt.  Those node
>> types are used in parsed queries, aren't they?

> RTEPermissionInfo is a recent addition, as of a61b1f7.  This commit
> documents it as a plan node, still it is part of a Query while being
> ignored in the query jumbling since its introduction, so I am a bit
> confused by this one.

Hmm ... it is part of Query, so that documentation is wrong, and the
fact that it's not reached by query jumbling kind of seems like a bug.
However, it might be that everything in it is derived from something
else that *is* covered by jumbling, in which case that's okay, if
underdocumented.

> ... Peter E. has
> mentioned upthread that a few nodes should include more jumbling while
> some other parts should be ignored.  This should be analyzed
> separately because ~15 does not seem to be strictly right, either.

Yeah.  It'd surprise me not at all if people have overlooked that.

v2 looks good to me as far as it goes.  I agree these other questions
deserve a separate look.

            regards, tom lane

Re: Generating code for query jumbling through gen_node_support.pl

From

Michael Paquier

Date:

13 February 2023, 03:11:53

On Fri, Feb 10, 2023 at 04:40:08PM -0500, Tom Lane wrote:
> v2 looks good to me as far as it goes.

Thanks.  I have applied that after an extra lookup.

> I agree these other questions deserve a separate look.

Okay, I may be able to come back to that.  Another point is that we
need to do a better job in forcing the execution of the query jumbling
in one of the TAP tests running pg_regress, outside
pg_stat_statements, to maximize coverage.  Will see to that on a
separate thread.
--
Michael

Attachment

signature.asc

Re: Generating code for query jumbling through gen_node_support.pl

From

Michael Paquier

Date:

11 July 2023, 01:35:43

On Mon, Jan 30, 2023 at 11:48:45AM +0100, Peter Eisentraut wrote:
> On 27.01.23 03:59, Michael Paquier wrote:
>> At the end, that would be unnoticeable for the average user, I guess,
>> but here are the numbers I get on my laptop :)
>
> Personally, I think we do not want the two jumble methods in parallel.
>
> Maybe there are other opinions.

(Thanks Jonathan for the poke.)

Now that we are in mid-beta for 16, it would be a good time to
conclude on this open item:
"Reconsider a utility_query_id GUC to control if query jumbling of
utilities can go through the past string-only mode and the new mode?"

In Postgres ~15, utility commands used a hash of the query string to
compute their query ID.  The current query jumbling code uses a Query
instead, like any other queries.  I have registered this open item as
a self-reminder, mostly in case there would be an argument to have a
GUC where users could switch from one mode to another.  See here as
well for some computation times for each method (table is in ns, wiht
millions of iterations):
https://www.postgresql.org/message-id/Y9eeYinDb1AcpWrG@paquier.xyz

I still don't think that we need both methods based on these numbers,
but there may be more opinions about that?  Are people OK if this open
item is discarded?
--
Michael

Attachment

signature.asc

Re: Generating code for query jumbling through gen_node_support.pl

From

Andrey Lepikhov

Date:

11 July 2023, 08:29:29

On 11/7/2023 05:35, Michael Paquier wrote:
> On Mon, Jan 30, 2023 at 11:48:45AM +0100, Peter Eisentraut wrote:
>> On 27.01.23 03:59, Michael Paquier wrote:
>>> At the end, that would be unnoticeable for the average user, I guess,
>>> but here are the numbers I get on my laptop :)
>>
>> Personally, I think we do not want the two jumble methods in parallel.
>>
>> Maybe there are other opinions.
> 
> (Thanks Jonathan for the poke.)
> 
> Now that we are in mid-beta for 16, it would be a good time to
> conclude on this open item:
> "Reconsider a utility_query_id GUC to control if query jumbling of
> utilities can go through the past string-only mode and the new mode?"
> 
> In Postgres ~15, utility commands used a hash of the query string to
> compute their query ID.  The current query jumbling code uses a Query
> instead, like any other queries.  I have registered this open item as
> a self-reminder, mostly in case there would be an argument to have a
> GUC where users could switch from one mode to another.  See here as
> well for some computation times for each method (table is in ns, wiht
> millions of iterations):
> https://www.postgresql.org/message-id/Y9eeYinDb1AcpWrG@paquier.xyz
> 
> I still don't think that we need both methods based on these numbers,
> but there may be more opinions about that?  Are people OK if this open
> item is discarded?
I vote for only one method based on a query tree structure.
BTW, did you think about different algorithms of queryId generation? 
Auto-generated queryId code can open a way for extensions to have 
easy-supporting custom queryIds.

-- 
regards,
Andrey Lepikhov
Postgres Professional

Re: Generating code for query jumbling through gen_node_support.pl

From

Michael Paquier

Date:

11 July 2023, 08:35:55

On Tue, Jul 11, 2023 at 12:29:29PM +0700, Andrey Lepikhov wrote:
> I vote for only one method based on a query tree structure.

Noted

> BTW, did you think about different algorithms of queryId generation?

Not really, except if you are referring to the possibility of being
able to handle differently different portions of the nodes depending
on a context given by the callers willing to do a query jumbling
computation.  (For example, choose to *not* silence the Const nodes,
etc.)

> Auto-generated queryId code can open a way for extensions to have
> easy-supporting custom queryIds.

Extensions can control that at some extent, already.
--
Michael

Attachment

signature.asc

Re: Generating code for query jumbling through gen_node_support.pl

From

Andrey Lepikhov

Date:

11 July 2023, 09:44:39

On 11/7/2023 12:35, Michael Paquier wrote:
> On Tue, Jul 11, 2023 at 12:29:29PM +0700, Andrey Lepikhov wrote:
>> I vote for only one method based on a query tree structure.
> 
> Noted
> 
>> BTW, did you think about different algorithms of queryId generation?
> 
> Not really, except if you are referring to the possibility of being
> able to handle differently different portions of the nodes depending
> on a context given by the callers willing to do a query jumbling
> computation.  (For example, choose to *not* silence the Const nodes,
> etc.)
Yes, I have two requests on different queryId algorithms:
1. With suppressed Const nodes.
2. With replacement of Oids with full names - to give a chance to see 
the same queryId at different instances for the same query.

It is quite trivial to implement, but not easy in support.
> 
>> Auto-generated queryId code can open a way for extensions to have
>> easy-supporting custom queryIds.
> 
> Extensions can control that at some extent, already.
> --
> Michael

-- 
regards,
Andrey Lepikhov
Postgres Professional

Re: Generating code for query jumbling through gen_node_support.pl

From

Michael Paquier

Date:

19 July 2023, 06:30:57

On Tue, Jul 11, 2023 at 07:35:43AM +0900, Michael Paquier wrote:
> I still don't think that we need both methods based on these numbers,
> but there may be more opinions about that?  Are people OK if this open
> item is discarded?

Hearing nothing about this point, removed from the open item list,
then.
--
Michael

Attachment

signature.asc