Thread: [HACKERS] BUG: pg_stat_statements query normalization issues with combinedqueries

[HACKERS] BUG: pg_stat_statements query normalization issues with combinedqueries

From
Fabien COELHO
Date:
Hello,

While investigating a performance issue, I tried to get informations from 
pg_stat_statements, however I ran into another issue: it seems that when 
using combined queries pg_stat_statements query normalization does not 
work properly... 2 queries that should have been mapped to only one are 
instead map to... 3 cases, as constants are not all ignored:
                query
 BEGIN ;                               + SELECT data FROM Stuff WHERE id = 1 ; + SELECT data FROM Stuff WHERE id = 2 ;
+SELECT data FROM Stuff WHERE id = 3 ; + COMMIT;
 
 BEGIN ;                               + SELECT data FROM Stuff WHERE id = 4 ; + SELECT data FROM Stuff WHERE id = 5 ;
+SELECT data FROM Stuff WHERE id = 6 ; + COMMIT;
 
 BEGIN ;                               + SELECT data FROM Stuff WHERE id = ?   + SELECT data FROM Stuff WHERE id = 2
+SELECT data FROM Stuff WHERE id = 3   + COMMIT;
 

I was expecting the 2 combined queries either to be separated in 
individual queries "SELECT data FROM Stuff WHERE id = ?" or in one large 
queries with three ?, but not the above result...

-- 
Fabien



> I was expecting the 2 combined queries either to be separated in individual 
> queries "SELECT data FROM Stuff WHERE id = ?" or in one large queries with 
> three ?, but not the above result...

Oops, I forgot the attachement, see repeat script on 9.6.1

-- 
Fabien.
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment
>> I was expecting the 2 combined queries either to be separated in individual 
>> queries "SELECT data FROM Stuff WHERE id = ?" or in one large queries with 
>> three ?, but not the above result...
>
> Oops, I forgot the attachement, see repeat script on 9.6.1

After some quick investigation, I concluded that the issue is that the 
whole combined query string is used instead of the part refering to the 
actual query being processed.

As a result:
 - the full un-normalized string is used for both BEGIN & COMMIT   => n shared entries, one for each actual query,
begin& commit are mixed together
 
 - the fist-part only normalized string is used for each SELECT   => 1 shared entry with "query" is the first partially
    normalied encountered combined query
 

-- 
Fabien.



Hello again pgdevs,

More investigations conclude that the information is lost by the parser.

From a ;-separated string raw_parser returns a List<Node> which are 
directly the nodes of the statements, with empty statements silently 
skipped.

The Node entry of high level statements (*Stmt) does NOT contain any
location information, only some lower level nodes have it.

A possible fix of this would be to add the query string location 
information systematically at the head of every high level *Stmt, which 
would then share both NodeTag and this information (say location and 
length). This would be a new intermediate kind of node of which all these 
statements would "inherit", say a "ParseNode" structure.

The actual location information may be filled-in at quite a high level in 
the parser, maybe around "stmtmulti" and "stmt" rules, so it would not 
necessarily be too intrusive in the overall parser grammar.

Then once the information is available, the query string may be cut where 
appropriate to only store the relevant string in pg_stat_statements or 
above.

Would this approach be acceptable, or is modifying Nodes a no-go area?

If it is acceptable, I can probably put together a patch and submit it.

If not, I suggest to update the documentation to tell that 
pg_stat_statements does not work properly with combined queries.

-- 
Fabien.



On Tue, Dec 20, 2016 at 6:18 AM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
> Would this approach be acceptable, or is modifying Nodes a no-go area?
>
> If it is acceptable, I can probably put together a patch and submit it.
>
> If not, I suggest to update the documentation to tell that
> pg_stat_statements does not work properly with combined queries.

I think you've found a bug, but I'm a little doubtful about your
proposed fix.  However, I haven't studied the code, so I don't know
what other approach might be better.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] BUG: pg_stat_statements query normalization issueswith combined queries

From
Kyotaro HORIGUCHI
Date:
At Tue, 20 Dec 2016 22:42:48 -0500, Robert Haas <robertmhaas@gmail.com> wrote in
<CA+TgmoZT94brLAGK7gCmxB4mO=C-Cdz1N8KN8Xen4sexHozN=Q@mail.gmail.com>
> On Tue, Dec 20, 2016 at 6:18 AM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
> > Would this approach be acceptable, or is modifying Nodes a no-go area?
> >
> > If it is acceptable, I can probably put together a patch and submit it.
> >
> > If not, I suggest to update the documentation to tell that
> > pg_stat_statements does not work properly with combined queries.
> 
> I think you've found a bug, but I'm a little doubtful about your
> proposed fix.  However, I haven't studied the code, so I don't know
> what other approach might be better.

That will work and doable, but the more fundamental issue here
seems to be that post_parse_analyze_hook or other similar hooks
are called with a Query incosistent with query_debug_string. It
is very conveniently used but the name seems to me to be
suggesting that such usage is out of purpose. I'm not sure,
though.

A similar behavior is shown in error messages but this harms far
less than the case of pg_stat_statements.

ERROR:  column "b" does not exist
LINE 1: ...elect * from t where a = 1; select * from t where b = 2; com...
             ^
 


1. Let all of the parse node have location in debug_query_string. (The original proposal)

2. Let the parser return once for each statement (like psql  parser) and corresponding query string is stored in a
varialbleother than debug_query_string.
 

...


regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Hello Robert & Kyotaro,

>> I'm a little doubtful about your proposed fix.  However, I haven't 
>> studied the code, so I don't know what other approach might be better.

That is indeed the question!

The key point is that the parser parses several statements from a string, 
but currently there is no clue about where the queries was found in the 
string. Only a parser may know about what is being parsed so can generate 
the location information. Now the possible solutions I see are:
 - the string is split per query before parsing, but this requires at   least a lexer... it looks pretty uneffective to
haveanother lexing   phase involved, even if an existing lexer is reused.
 
 - the parser processes one query at a time and keep the "remaining   unparse string" in some state that can be queried
tocheck up to   where it proceeded, but then the result must be stored somewhere   and it would make sense that it
wouldbe in the statement anyway,   just the management of the location information would be outside   the parser. Also
thatwould add the cost of relaunching the parser,   not sure how bad or insignificant this is. This is (2) below.
 
 - the parser still generates a List<Node> but keep track of the location   of statements doing so, somehow... propably
asI outlined.
 

The query string information can be at some way of pointing in the initial
string, or the substring itself that could be extracted at some point.
I initially suggested the former because this is already what the parser 
does for some nodes, and because it seems simpler to do so.

Extracting the string instead would suggest that the location of tokens
within this statement are relative to this string rather than the initial 
one, but that involves a lot more changes and it is easier to miss 
something doing so.

> That will work and doable, but the more fundamental issue here
> seems to be that post_parse_analyze_hook or other similar hooks
> are called with a Query incosistent with query_debug_string.

Sure.

> It is very conveniently used but the name seems to me to be suggesting 
> that such usage is out of purpose. I'm not sure, though.
>
> A similar behavior is shown in error messages but this harms far
> less than the case of pg_stat_statements.
>
> ERROR:  column "b" does not exist
> LINE 1: ...elect * from t where a = 1; select * from t where b = 2; com...
>                                                             ^

Ah, I wrote this piece of code a long time ago:-) The location is relative 
to the full string, see comment above, changing that would involve much
more changes, and I'm not sure whether it is desirable.

Also, think of:

SELECT * FROM foo; DROP TABLE foo; SELECT * FROM foo;

Maybe the context to know which "SELECT * FROM foo" generates an error 
should be kept.

> 1. Let all of the parse node have location in
>  debug_query_string. (The original proposal)

Yep.

> 2. Let the parser return once for each statement (like psql
>   parser)

I'm not sure it does... "\;" generates ";" in the output and the psql 
lexer keeps on lexing.

> and corresponding query string is stored in a
>   varialble other than debug_query_string.

I think that would involve many changes because of the way postgres is 
written, the list is expected and returned by quite a few functions. 
Moreover query rewriting may generate several queries out of one anyway, 
so the list would be kept.

So I'm rather still in favor of my initial proposal, that is extend the 
existing location information to statements, not only some tokens.

-- 
Fabien.



Re: [HACKERS] BUG: pg_stat_statements query normalization issueswith combined queries

From
Kyotaro HORIGUCHI
Date:
At Wed, 21 Dec 2016 09:28:58 +0100 (CET), Fabien COELHO <coelho@cri.ensmp.fr> wrote in
<alpine.DEB.2.20.1612210841370.3892@lancre>
> 
> Hello Robert & Kyotaro,
> 
> >> I'm a little doubtful about your proposed fix.  However, I haven't
> >> studied the code, so I don't know what other approach might be better.
> 
> That is indeed the question!
> 
> The key point is that the parser parses several statements from a
> string, but currently there is no clue about where the queries was
> found in the string. Only a parser may know about what is being parsed
> so can generate the location information. Now the possible solutions I
> see are:
> 
>  - the string is split per query before parsing, but this requires at
>    least a lexer... it looks pretty uneffective to have another lexing
>    phase involved, even if an existing lexer is reused.

I don't see this is promising. Apparently a waste of CPU cycles.

>  - the parser processes one query at a time and keep the "remaining
>    unparse string" in some state that can be queried to check up to
>    where it proceeded, but then the result must be stored somewhere
>    and it would make sense that it would be in the statement anyway,
>    just the management of the location information would be outside
>    the parser. Also that would add the cost of relaunching the parser,
>    not sure how bad or insignificant this is. This is (2) below.

I raised this as a spoiler, I see this is somewhat too invasive
for the problem to be solved.

>  - the parser still generates a List<Node> but keep track of the location
>    of statements doing so, somehow... propably as I outlined.

Yeah, this seems most reasonable so far. It seems to me to be
better that the statement location is conveyed as a part of a
parse tree so as not to need need a side channel for location.

I'd like to rename debug_query_string to more reasonable name if
we are allowed but perhaps not.

> The query string information can be at some way of pointing in the
> initial
> string, or the substring itself that could be extracted at some point.
> I initially suggested the former because this is already what the
> parser does for some nodes, and because it seems simpler to do so.
> 
> Extracting the string instead would suggest that the location of
> tokens
> within this statement are relative to this string rather than the
> initial one, but that involves a lot more changes and it is easier to
> miss something doing so.

Agreed that copying statement string would be too much. But the
new *Stmt node should somehow have also the end location of the
statement since the user of a parse tree cannot look into another
one.


> > That will work and doable, but the more fundamental issue here
> > seems to be that post_parse_analyze_hook or other similar hooks
> > are called with a Query incosistent with query_debug_string.
> 
> Sure.
> 
> > It is very conveniently used but the name seems to me to be suggesting
> > that such usage is out of purpose. I'm not sure, though.
> >
> > A similar behavior is shown in error messages but this harms far
> > less than the case of pg_stat_statements.
> >
> > ERROR:  column "b" does not exist
> > LINE 1: ...elect * from t where a = 1; select * from t where b = 2;
> > com...
> >                                                             ^
> 
> Ah, I wrote this piece of code a long time ago:-) The location is
> relative to the full string, see comment above, changing that would
> involve much
> more changes, and I'm not sure whether it is desirable.
> 
> Also, think of:
> 
> SELECT * FROM foo; DROP TABLE foo; SELECT * FROM foo;
> 
> Maybe the context to know which "SELECT * FROM foo" generates an error
> should be kept.
> 
> > 1. Let all of the parse node have location in
> >  debug_query_string. (The original proposal)
> 
> Yep.
> 
> > 2. Let the parser return once for each statement (like psql
> >   parser)
> 
> I'm not sure it does... "\;" generates ";" in the output and the psql
> lexer keeps on lexing.
> 
> > and corresponding query string is stored in a
> >   varialble other than debug_query_string.
> 
> I think that would involve many changes because of the way postgres is
> written, the list is expected and returned by quite a few
> functions. Moreover query rewriting may generate several queries out
> of one anyway, so the list would be kept.
> 
> So I'm rather still in favor of my initial proposal, that is extend
> the existing location information to statements, not only some tokens.

I thought that it's too much to let all types of parse node have
location but grepping the gram.y with "makeNode" pursuaded me to
agree with you. After changing all *Stmt nodes, only several
types of nodes seems missing it.
regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Hello Kyotaro-san,

> [...] Agreed that copying statement string would be too much. But the 
> new *Stmt node should somehow have also the end location of the 
> statement since the user of a parse tree cannot look into another one.

Yes. I was thinking of adding a "length" field next to "location", where 
appropriate.

>> So I'm rather still in favor of my initial proposal, that is extend
>> the existing location information to statements, not only some tokens.
>
> I thought that it's too much to let all types of parse node have
> location but grepping the gram.y with "makeNode" pursuaded me to
> agree with you.

Indeed...

> After changing all *Stmt nodes, only several types of nodes seems 
> missing it.

Yes. I'll try to put together a patch and submit it to the next CF.

-- 
Fabien.



> Yes. I'll try to put together a patch and submit it to the next CF.

Here it is. I'll add this to the next CF.



This patch fixes the handling of compound/combined queries by 
pg_stat_statements (i.e. several queries sent together, eg with psql: 
"SELECT 1 \; SELECT 2;").

This bug was found in passing while investigating a strange performance 
issue I had with compound statements.


Collect query location information in parser:

  * create a ParseNode for statements with location information fields
    (qlocation & qlength).

  * add these fields to all parsed statements (*Stmt), Query and PlannedStmt.

  * statements location is filled in:

    - the lexer keep track of the last ';' encountered.

    - the information is used by the parser to compute the query length,
      which depends on whether the query ended on ';' or on the input end.

    - the query location is propagated to Query and PlannedStmt when built.


Fix pg_stat_statement:

  * the information is used by pg_stat_statement so as to extract the relevant part
    of the query string, including some trimming.

  * pg_stat_statement validation is greatly extended so as to test options
    and exercise various cases, including compound statements.


note 1: two non-statements tags (use for alter table commands) have been 
moved so that all statements tags are contiguous and easy to check.

note 2: the impact on the lexer & parser is quite minimal with this 
approach, about 30 LOC, most of which in one function. The largest changes 
are in the node header to add location fields.

note 3: the query length excludes the final separator, so that 
;-terminated and end of input terminated queries show the same.

note 4: the added test suggests that when tracking "all", queries in SQL 
user functions are not tracked, this might be a bug.

-- 
Fabien.
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment
>> Yes. I'll try to put together a patch and submit it to the next CF.
>
> Here it is. I'll add this to the next CF.

Oops... better without a stupid overflow. Shame on me!

-- 
Fabien.
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment


On 21 Dec. 2016 11:44, "Robert Haas" <robertmhaas@gmail.com> wrote:
On Tue, Dec 20, 2016 at 6:18 AM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
> Would this approach be acceptable, or is modifying Nodes a no-go area?
>
> If it is acceptable, I can probably put together a patch and submit it.
>
> If not, I suggest to update the documentation to tell that
> pg_stat_statements does not work properly with combined queries.

I think you've found a bug, but I'm a little doubtful about your
proposed fix.  However, I haven't studied the code, so I don't know
what other approach might be better.


FWIW this issue with multi-statements also causes issues with ProcessUtility_hook. It gets the char* querytext of the whole multistatement. Then gets invoked once for each utility command within. It has no information about which statement text the current invocation corresponds to.

Having a.pointer into the query text for the start and end would be good there too. Not as good as doing away with multis entirely as a bad hack but that's not practical for BC and protocol reasons.

BTW we should be sure the somewhat wacky semantics of multi-statements with embedded commits are documented. I'll check tomorrow.


On 23 Dec. 2016 17:53, "Fabien COELHO" <coelho@cri.ensmp.fr> wrote:

Yes. I'll try to put together a patch and submit it to the next CF.

Here it is. I'll add this to the next CF.

Great. Please put me (ringerc) down as a reviewer. I'll get to this as soon as holidays settle down. It'd be very useful for some things I'm working on too, and is much better then my early draft of similar functionality.

Hello Craig,

> Please put me (ringerc) down as a reviewer.

Done.

Please do not loose time reviewing stupid version 1... skip to version 2 
directly:-)

Also, note that the query-location part may be considered separately from 
the pg_stat_statements part which uses it.

-- 
Fabien.



Re: [HACKERS] BUG: pg_stat_statements query normalization issueswith combined queries

From
Kyotaro HORIGUCHI
Date:
Hello,

At Mon, 26 Dec 2016 16:00:57 +0100 (CET), Fabien COELHO <coelho@cri.ensmp.fr> wrote in
<alpine.DEB.2.20.1612261554510.4911@lancre>
> 
> Hello Craig,
> 
> > Please put me (ringerc) down as a reviewer.
> 
> Done.
> 
> Please do not loose time reviewing stupid version 1... skip to version
> 2 directly:-)
> 
> Also, note that the query-location part may be considered separately
> from the pg_stat_statements part which uses it.

In nonterminal stmtmulti, setQueryLocation is added and used to
calcualte and store the length of every stmt's. This needs an
extra storage in bse_yy_extra_type and introduces a bit
complicated stuff. But I think raw_parser can do that without
such extra storage and labor, then gram.y gets far simpler.

The struct member to store the location and length is named
'qlocation', and 'qlength' in the added ParseNode. But many nodes
already have 'location' of exactly the same purpopse. I don't see
a necessity to differentiate them.

Putting the two above together, the following is my suggestion
for the parser part.

- Make all parse nodes have the following two members at the beginning. This unifies all parse node from the view of
settingtheir location. Commenting about this would be better.
 
 | NodeTag   type; | int       location;

- Remove struct ParseNode and setQueryLocation. stmtmulti sets location in the same manner to other kind of nodes, just
doing'= @n'. base_yyparse() doesn't calculate statement length.
 

- Make raw_parser caluclate statement length then store it into each stmt scanning the yyextra.parsetree.

What do you think about this?

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: [HACKERS] BUG: pg_stat_statements query normalization issueswith combined queries

From
Kyotaro HORIGUCHI
Date:
Hello,

At Tue, 27 Dec 2016 10:28:53 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20161227.102853.204155513.horiguchi.kyotaro@lab.ntt.co.jp>
> Putting the two above together, the following is my suggestion
> for the parser part.
> 
> - Make all parse nodes have the following two members at the
>   beginning. This unifies all parse node from the view of setting
>   their location. Commenting about this would be better.
> 
>   | NodeTag   type;
>   | int       location;
> 
> - Remove struct ParseNode and setQueryLocation. stmtmulti sets
>   location in the same manner to other kind of nodes, just doing
>   '= @n'. base_yyparse() doesn't calculate statement length.
> 
> - Make raw_parser caluclate statement length then store it into
>   each stmt scanning the yyextra.parsetree.

An additional comment on parser(planner?) part.
 This patch make planner() copy the location and length from parse to result, but copying such stuffs is a job of
standard_planner.

Then the following is a comment on pg_stat_statements.c

- pg_stat_statements.c:977 - isParseNode(node) node should be parsenode.

- The name for the addional parameter query_loc is written as query_len in its prototype.

- In pgss_store, "else if (strlen(query)) != query_len)" doesn't work as expected because of one-off error. query_len
hereis the length of the query *excluding* the last semicolon.
 

- qtext_store doesn't rely on terminating nul character and the function already takes query length as a parameter. So,
there'sno need to copy the partial debug_query_string into palloc'ed memory.  Just replacing the query with query_loc
willbe sufficient.
 

Have a nice holidays.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Hello Kyotaro-san,

> In nonterminal stmtmulti, setQueryLocation is added and used to
> calcualte and store the length of every stmt's. This needs an
> extra storage in bse_yy_extra_type

The extra storage is one int.

> and introduces a bit complicated stuff. But I think raw_parser can do 
> that without such extra storage and labor,

How? The issue is that stmtmulti silently skip some ';' when empty 
statements are found, so I need to keep track of that to know where to 
stop, using the beginning location of the next statement, which is 
probably your idea, does not work.

> then gram.y gets far simpler.


> The struct member to store the location and length is named
> 'qlocation', and 'qlength' in the added ParseNode. But many nodes
> already have 'location' of exactly the same purpopse. I don't see
> a necessity to differentiate them.

Alas, the "CreateTableSpaceStmt" struct already had a "char *location" 
that I did not want to change... If I use "location", then I have to 
change it, why not...

Another reason for the name difference is that qlocation/qlength 
convention is slightly different from location when not set: location is 
set manually to -1 when the information is not available, but this 
convention is not possible for statements without changing all their 
allocations and initializations (there are hundreds...), so the convention 
I used for qlocation/qlength is that it is not set if qlength is zero, 
which is the default value set by makeNode, so no change was needed...

Changing this point would mean a *lot* of possibly error prone changes in 
the parser code everywhere statements are allocated...

> Putting the two above together, the following is my suggestion
> for the parser part.
>
> - Make all parse nodes have the following two members at the
>  beginning. This unifies all parse node from the view of setting
>  their location. Commenting about this would be better.
>
>  | NodeTag   type;
>  | int       location;

Currently only a few parse nodes have locations, those about variables and 
constants that can be designated by an error message. I added the 
information to about 100 statements, but for the many remaining parse 
nodes the information does not exist and is not needed, so I would prefer 
to avoid adding a field both unset and unused.

> - Remove struct ParseNode

"ParseNode" is needed to access the location and length of statements, 
otherwise they are only "Node", which does not have a location.

> and setQueryLocation.

The function is called twice, I wanted to avoid duplicating code.

> stmtmulti sets location in the same manner to other kind of nodes, just 
> doing '= @n'. base_yyparse() doesn't calculate statement length.

But...

> - Make raw_parser caluclate statement length then store it into
>  each stmt scanning the yyextra.parsetree.

... I cannot calculate the statement length cleanly because of empty 
statements. If I know where the last ';' is, then the length computation 
must be when the reduction occurs, hence the function called from the 
stmtmulti rule.

> What do you think about this?

I think that I do not know how to compute the length without knowing where 
the last ';' was, because of how empty statements are silently skipped 
around the stmtmulti/stmt rules, so I think that the length computation 
should stay where it is.

What I can certainly do is:
 - add more comments to explain why it is done like that


What I could do with some inputs from reviewers/committers is:
 - rename the "char *location" field of the create table space statement   to "directory" or something else... but
what?
 - change qlocation/qlength to location/length...
   However, then the -1 convention for location not set is not true, which   is annoying. Using this convention means
hundredsof changes because every   statement allocation & initialization must be changed. This is   obviously possible,
butis a much larger patch, which I cannot say   would be much better than this one, it would just be different.
 

What I'm reserved about:
 - adding a location field to nodes that do not need it.

-- 
Fabien.



> An additional comment on parser(planner?) part.
>
>  This patch make planner() copy the location and length from
>  parse to result, but copying such stuffs is a job of
>  standard_planner.

I put the copy in planner because standard_planner may not be called by 
planner, and in all cases I think that the fields should be copied...

Otherwise it is the responsability of the planner hook to do the copy, it 
would is ok if the planner hook calls standard_planner, but the fact is 
not warranted, the comments says "the plugin would normally call 
standard_planner". What if it does not?

So it seemed safer this way.

> Then the following is a comment on pg_stat_statements.c
>
> - pg_stat_statements.c:977 - isParseNode(node)
>  node should be parsenode.

Could be. ParseNode is a Node, it is just a pointer cast. I can assert on 
pn instead of node.

> - The name for the addional parameter query_loc is written as
>  query_len in its prototype.

Indeed, a typo on "generate_normalized_query" prototype.

> - In pgss_store, "else if (strlen(query)) != query_len)" doesn't
>  work as expected because of one-off error. query_len here is
>  the length of the query *excluding* the last semicolon.

It was somehow intentional: if the query is not the same as the string, 
then a copy is performed to have the right null-terminated string...
but

> - qtext_store doesn't rely on terminating nul character and the
>  function already takes query length as a parameter. So, there's
>  no need to copy the partial debug_query_string into palloc'ed
>  memory.  Just replacing the query with query_loc will be
>  sufficient.

Hmmm... it seemed good so I tried it, and it did not work...

The subtle reason is that qtext_store does assume that the query string is 
null terminated... it just does not recompute the length its length.

However it can be changed to behave correctly in this case, so as to avoid 
the copy.

-- 
Fabien.



Fabien COELHO <coelho@cri.ensmp.fr> writes:
> How? The issue is that stmtmulti silently skip some ';' when empty 
> statements are found, so I need to keep track of that to know where to 
> stop, using the beginning location of the next statement, which is 
> probably your idea, does not work.

Maybe you should undo that.  I've generally found that trying to put
optimizations into the grammar is a bad idea; it's better to KISS in
gram.y and apply optimizations later on.

>> - Make all parse nodes have the following two members at the
>> beginning. This unifies all parse node from the view of setting
>> their location. Commenting about this would be better.
>> 
>> | NodeTag   type;
>> | int       location;

I do not like this.  You are making what should be a small patch into
a giant, invasive one.  I'd think about adding a single parse node type
that corresponds to "stmt" in the grammar and really doesn't do anything
but carry the location info for the overall statement.  That info could
then be transferred into the resulting Query node.  Problem solved with
(I think) very little code touched.
        regards, tom lane



Hello Tom,

>> How? The issue is that stmtmulti silently skip some ';' when empty
>> statements are found, [...]
>
> Maybe you should undo that.

I was trying to be minimally invasive in the parser, i.e. not to change 
any rules.

> I've generally found that trying to put optimizations into the grammar 
> is a bad idea; it's better to KISS in gram.y and apply optimizations 
> later on.

I can change rules, but I'm not sure it will be better. It would allow to 
remove the last ';' location in the lexer which Kyotar does not like, but 
it would add some new stretching to compute the length of statements and 
remove the empty statements.

I would say that it would be different, but not necessarily better.

>>> - Make all parse nodes have the following two members at the
>>> beginning. This unifies all parse node from the view of setting
>>> their location. Commenting about this would be better.
>>>
>>> | NodeTag   type;
>>> | int       location;
>
> I do not like this.  You are making what should be a small patch into
> a giant, invasive one.

I would not say that the current patch is giant & invasive, if you 
abstract the two added fields to high-level statements.

> I'd think about adding a single parse node type that corresponds to 
> "stmt" in the grammar and really doesn't do anything but carry the 
> location info for the overall statement.  That info could then be 
> transferred into the resulting Query node.

I'm not sure I understand your suggestion. Currently I have added the 
location & length information to all high-level statements nodes, plus 
query and planned. I think that planned is necessary for utility 
statements.

I understand that you suggest to add a new intermediate structure:
  typedef struct {    NodeTag tag;    int location, length;    Node *stmt;  } ParseNode;

So that raw_parser would return a List<ParseNode> instead of a List<Node>, 
and the statements would be unmodified.

> Problem solved with (I think) very little code touched.

Hmmm...

Then raw_parser() callers have to manage a List<ParseNode> instead of the 
List<Node>, I'm not sure of the size of the propagation to manage the 
added indirection level appropriately: raw_parser is called 4 times (in 
parser, tcop, commands, plpgsql...). The first call in tcop is 
copyObject(), equal(), check_log_statement(), errdetail_execute()...

Probably it can be done, but I'm moderately unthousiastic: it looks like 
starting unwinding a ball of wool, and I'm not sure where it would stop, 
as it means touching at least the 4 uses of raw_parser in 4 distinct part 
of postgres, whereas the current approach did not change anything for 
raw_parser callers, although at the price of modifying all statement 
nodes.

Did I read you correctly?

-- 
Fabien.



Fabien COELHO <coelho@cri.ensmp.fr> writes:
>>> How? The issue is that stmtmulti silently skip some ';' when empty
>>> statements are found, [...]

>> Maybe you should undo that.

> I was trying to be minimally invasive in the parser, i.e. not to change 
> any rules.

That's fairly silly when the alternative is to be maximally invasive
at the parse-nodes level, thereby affecting lots and lots of code that
isn't really related to the problem at hand.

>> I do not like this.  You are making what should be a small patch into
>> a giant, invasive one.

> I would not say that the current patch is giant & invasive, if you 
> abstract the two added fields to high-level statements.

It's touching every single utility statement type, which is not only
pretty invasive in itself, but will break any pending patches that
add more utility statement types.  And you've not followed through on the
implications of adding those fields in, for instance, src/backend/nodes/;
so the patch will be even bigger once all the required tidying-up is done.

> I understand that you suggest to add a new intermediate structure:

>    typedef struct {
>      NodeTag tag;
>      int location, length;
>      Node *stmt;
>    } ParseNode;

> So that raw_parser would return a List<ParseNode> instead of a List<Node>, 
> and the statements would be unmodified.

Yeah, that's what I was thinking of.  There aren't very many places that
would need to know about that, I believe; possibly just gram.y itself
and transformTopLevelStmt().  The stuff in between only knows that it's
passing around lists of statement parse trees, it doesn't know much about
what's in those trees.
        regards, tom lane



On 28 December 2016 at 08:33, Tom Lane <tgl@sss.pgh.pa.us> wrote:

>> I would not say that the current patch is giant & invasive, if you
>> abstract the two added fields to high-level statements.
>
> It's touching every single utility statement type, which is not only
> pretty invasive in itself, but will break any pending patches that
> add more utility statement types.  And you've not followed through on the
> implications of adding those fields in, for instance, src/backend/nodes/;
> so the patch will be even bigger once all the required tidying-up is done.

I echo Tom's complaint. Adding fields to every node is definitely not
the way to do this.

>> I understand that you suggest to add a new intermediate structure:
>
>>    typedef struct {
>>      NodeTag tag;
>>      int location, length;
>>      Node *stmt;
>>    } ParseNode;
>
>> So that raw_parser would return a List<ParseNode> instead of a List<Node>,
>> and the statements would be unmodified.
>
> Yeah, that's what I was thinking of.

I strongly agree. That's what I'd done in my draft patch, but I hadn't
got the details of actually collecting position tracking info sorted
out yet.

The information will need to be copied to plans later, but that's
easy, and would've been necessary anyway.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Remote DBA, Training
&Services
 



Hello Tom,

> [...] It's touching every single utility statement type, which is not 
> only pretty invasive in itself, but will break any pending patches that 
> add more utility statement types.

Yep, but it is limited to headers and the break is trivial...

> And you've not followed through on the implications of adding those 
> fields in, for instance, src/backend/nodes/;

Hmmm, probably you are pointing out structure read/write functions.

I would have hoped that such code would be automatically derived from the 
corresponding struct definition...


>> I understand that you suggest to add a new intermediate structure [...] 
>> So that raw_parser would return a List<ParseNode> instead of a 
>> List<Node>, and the statements would be unmodified.
>
> Yeah, that's what I was thinking of.  There aren't very many places that
> would need to know about that, I believe; [...]

Hmmm. I've run into a tiny small ever so little detail in trying to apply 
this clever approach...

For fixing the information in pg_stat_statement, the location data must be 
transported from the parsed node to the query to the planned node, because 
the later two nodes types are passed to different hooks.

Now the detail is that utility statements, which seems to be nearly all of 
them but select/update/delete/insert, do not have plans: The statement 
itself is its own plan... so there is no place to store the location & 
length.

As adding such fields to store the information in the structures is no go 
area, I'm hesitating about the course to follow to provide a solution 
acceptable to you.

Would you have any other clever proposition or advice about how to 
proceed?

I thought of creating yet another node "utility plans with locations" or 
maybe reuse the "parsed node" for the purpose, but then it has to be 
managed in quite a few places as well.

For the parser output, there is 4 raw_parser calls but also 10 
pg_parse_query calls to manage. I'm not sure that going this way will fit 
the "few lines of code" bill...

-- 
Fabien.



On 28 December 2016 at 20:11, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
>
> Hello Tom,
>
>> [...] It's touching every single utility statement type, which is not only
>> pretty invasive in itself, but will break any pending patches that add more
>> utility statement types.
>
>
> Yep, but it is limited to headers and the break is trivial...

I'm not sure how else that part can be done.

We can add a common parent for all utility statement types, but that
also touches all utility statements, and since we're working with pure
C, it means 'type' won't be accessible as e.g. someCreateStmt.type,
but only as something like ((UtilityStatement)someCreateStmt).type or
someCreateStmt.hdr.type .

e.g.

typedef struct UtilityStatement
{  NodeTag type;  int querytext_offset;  int querytext_length;
};

typedef struct CreateStmt
{       UtilityStatement hdr;       RangeVar   *relation;           /* relation to create */       ....
}

I don't think that's so bad, but it's not exactly less invasive.

>> And you've not followed through on the implications of adding those fields
>> in, for instance, src/backend/nodes/;
>
> Hmmm, probably you are pointing out structure read/write functions.

I'm sure that's his intent, yes.

> I would have hoped that such code would be automatically derived from the
> corresponding struct definition...

That would be nice. But no, unfortunately, we have no such preprocessing.

To do so would require that we either generate things like
parsenodes.h and the relevant parts of copyfuncs.c etc from a common
source, or have a toolchain that can ingest parsenodes.h and emit
suitable code. The latter means either "fragile, hideous Perl script"
or requiring something like LLVM or the gcc frontend libraries in the
toolchain to parse the C code and emit an intermediate representation
of the structures that we could then generate our functions from.

If you've come from the Java world you'd expect that we'd just
generate the copy and makenode stuff from introspection of the node
structs, or preprocess them or something, but it's not very practical.
They don't change enough to really make it worth it either.

> Hmmm. I've run into a tiny small ever so little detail in trying to apply
> this clever approach...
>
> For fixing the information in pg_stat_statement, the location data must be
> transported from the parsed node to the query to the planned node, because
> the later two nodes types are passed to different hooks.

Yep.

> Now the detail is that utility statements, which seems to be nearly all of
> them but select/update/delete/insert, do not have plans: The statement
> itself is its own plan... so there is no place to store the location &
> length.

Yeah. I don't see any way around adding location info for utility
statements one way or the other.

We could wrap every utility statement up in a container with the
location info, but that'd break every user of ProcessUtility_hook and
unless we put the container as the first by-value member of each
utility struct anyway, it'd mean extra allocations for each utility
statement. Not very appealing.

TBH I think that for utility statements, adding a member struct with
location info is the least-bad option. Something like:

/** Location of the original querytext of an individual parsetree or
plan relative to the* start of the querytext. This is usually offset 0 and the length of
the querytext its self,* but in multi-statements and other cases where a single parse has
multiple products* hooks may need to know which part of the querytext corresponds to
which product* plan.** offset and length are both -1 for plans with no corresponding
querytext, such as* additional/replacement queries emitted by the rewrite phase or
additional statements* produced as a result of processing some utility statements.*
typedef struct StmtLocation
{   int offset;   int length;
}

typedef struct CreateStmt
{       NodeTag         type;       StmtLocation  location;       ....
}


Personally I don't much care if we do it this way, or by embedding the
NodeTag too and creating a common UtilityStmt. The latter probably has
benefits for later extension, but is a bit more annoying in the
shorter term.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Remote DBA, Training
&Services
 



Hello Craig,

>>> [...] It's touching every single utility statement type, which is not 
>>> only pretty invasive in itself, but will break any pending patches 
>>> that add more utility statement types.
>>
>> Yep, but it is limited to headers and the break is trivial...
>
> I'm not sure how else that part can be done.
>
> We can add a common parent for all utility statement types, but that 
> also touches all utility statements, and since we're working with pure 
> C, it means 'type' won't be accessible as e.g. someCreateStmt.type, but 
> only as something like ((UtilityStatement)someCreateStmt).type or 
> someCreateStmt.hdr.type. [...] I don't think that's so bad, but it's not 
> exactly less invasive.

I thought of this way of implementing that on the submitted version, I 
decided against because then it is less obvious what is directly in the 
structure, existing code may reference the tag field, and the number of 
header changes is the same, if a little less verbose.

>> [...] For fixing the information in pg_stat_statement, the location 
>> data must be transported from the parsed node to the query to the 
>> planned node, because the later two nodes types are passed to different 
>> hooks.
>
> Yep.
>
>> Now the detail is that utility statements, which seems to be nearly all of
>> them but select/update/delete/insert, do not have plans: The statement
>> itself is its own plan... so there is no place to store the location &
>> length.
>
> Yeah. I don't see any way around adding location info for utility
> statements one way or the other.

If utility statements are done this way, that's 95% of all statements, so 
the point of doing the remaining 5% with Tom's neat intermediate node 
trick seems void, I think that it is better to have all of them the same 
way.

> TBH I think that for utility statements, adding a member struct with
> location info is the least-bad option. Something like:

> typedef struct StmtLocation
> {
>    int offset;
>    int length;
> }

It is possible, but:

> typedef struct CreateStmt
> {
>        NodeTag         type;
>        StmtLocation  location;
>        ....
> }

Name "location" is already used meaning the offset or the directory 
destination for create table space, that would create a third option.

For homogeneity, ISTM that keeping location & length directly and renaming 
the table space location is the simplest & most homogeneous option.

-- 
Fabien.



Hello,

>> Yeah, that's what I was thinking of.  There aren't very many places that
>> would need to know about that, I believe; [...]
>
> For fixing the information in pg_stat_statement, the location data must be 
> transported from the parsed node to the query to the planned node, because 
> the later two nodes types are passed to different hooks.
>
> Now the detail is that utility statements, which seems to be nearly all of 
> them but select/update/delete/insert, do not have plans: The statement itself 
> is its own plan... so there is no place to store the location & length.

Here is an updated version:

Changes wrt v2:

  - I have added the missing stuff under /nodes/, this is stupid code that
    should be automatically generated:-(

  - I have added comments in "gram.y" about how the length is computed.
    I have also slightly simplified the rule code there.

  - I have rename "location" in create table space to "location_dir"
    to avoid confusion.

  - I have renamed the fields "location" and "length" instead of q*.

  - I have moved the location & lenth copies in standard_planner.

  - I have fixed the function declaration typo.

  - I have simplified pgss_store code to avoid a copy, and move the
    length truncation in qtext_store.

  - I have improved again the pg_stat_statement regression tests with
    combined utility statement tests, which implied some fixes to
    extract the right substring for utility queries.

However, not changed:

  - I cannot use the intermediate node trick suggested by Tom because
    it does not work for utility statements which do not have plans, so
    the code still adds location & length, sorry.

  - I still use the 'last_semicolon' lexer variable. The alternative is to
    change rules so as not to skip empty statements, then write a loop to
    compute the length based on successor location, and remove the empty
    statements. It can be done, I do not think it is better, it is only
    different and more verbose. I'll do it if required by a committer.

-- 
Fabien.
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] BUG: pg_stat_statements query normalization issueswith combined queries

From
Kyotaro HORIGUCHI
Date:
Hello,

At Fri, 30 Dec 2016 15:10:42 +0100 (CET), Fabien COELHO <coelho@cri.ensmp.fr> wrote in
<alpine.DEB.2.20.1612301453280.32017@lancre>
> 
> Hello,
> 
> >> Yeah, that's what I was thinking of.  There aren't very many places
> >> that
> >> would need to know about that, I believe; [...]
> >
> > For fixing the information in pg_stat_statement, the location data
> > must be transported from the parsed node to the query to the planned
> > node, because the later two nodes types are passed to different hooks.
> >
> > Now the detail is that utility statements, which seems to be nearly
> > all of them but select/update/delete/insert, do not have plans: The
> > statement itself is its own plan... so there is no place to store the
> > location & length.
> 
> Here is an updated version:
> 
> Changes wrt v2:
> 
>  - I have added the missing stuff under /nodes/, this is stupid code that
>    should be automatically generated:-(
>
>  - I have added comments in "gram.y" about how the length is computed.
>    I have also slightly simplified the rule code there.
> 
>  - I have rename "location" in create table space to "location_dir"
>    to avoid confusion.
> 
>  - I have renamed the fields "location" and "length" instead of q*.
> 
>  - I have moved the location & lenth copies in standard_planner.
> 
>  - I have fixed the function declaration typo.
> 
>  - I have simplified pgss_store code to avoid a copy, and move the
>    length truncation in qtext_store.
> 
>  - I have improved again the pg_stat_statement regression tests with
>    combined utility statement tests, which implied some fixes to
>    extract the right substring for utility queries.
> 
> However, not changed:
> 
>  - I cannot use the intermediate node trick suggested by Tom because
>    it does not work for utility statements which do not have plans, so
>    the code still adds location & length, sorry.

Wrapping the output of pg_parse_query still works for non-utility
commands. But pg_stat_statement retrieves the location
information via ProcessUtility, which has plantree. Wrapping
plantree with the same struct also works but it imacts too widely
and extremely error-prone.

One available solution is, as Fabien did, bring location and
length data along with transformation.  And anther is give a
chopped query to query tree.

The attached patch does the second way (but quite ugly and
incomplete, it's PoC). This seems working as expected to a
certain degree but somewhat bogus..  A significant drawback of
this is that this makes a copy of each query in multistatement.

>  - I still use the 'last_semicolon' lexer variable. The alternative is to
>    change rules so as not to skip empty statements, then write a loop to
>    compute the length based on successor location, and remove the empty
>    statements. It can be done, I do not think it is better, it is only
>    different and more verbose. I'll do it if required by a committer.

I think this is doable in the way shown in this patch. But this
seems somewhat bogus, too..

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> writes:
> At Fri, 30 Dec 2016 15:10:42 +0100 (CET), Fabien COELHO <coelho@cri.ensmp.fr> wrote in
<alpine.DEB.2.20.1612301453280.32017@lancre>
>> - I cannot use the intermediate node trick suggested by Tom because
>> it does not work for utility statements which do not have plans, so
>> the code still adds location & length, sorry.

I still really, really don't like this patch.  I think it's going to
create continuing maintenance problems, because of shortcuts like this:

+#define isParseNodeTag(tag)    ((T_Query <= (tag)) && ((tag) < T_A_Expr))

We've never had any code that depends on ranges of nodetag numbers, and
now is not the time to start.  (The fact that you had to randomly relocate
some existing tags to make this work didn't make me any happier about it.)

>> - I still use the 'last_semicolon' lexer variable. The alternative is to
>> change rules so as not to skip empty statements, then write a loop to
>> compute the length based on successor location, and remove the empty
>> statements. It can be done, I do not think it is better, it is only
>> different and more verbose. I'll do it if required by a committer.

> I think this is doable in the way shown in this patch. But this
> seems somewhat bogus, too..

Yeah, I doubt that this technique for getting the raw locations in the
grammar+lexer works reliably.  In general, Bison is a bit asynchronous
in terms of scanning, and may or may not have scanned a token ahead of
what the current production covers.  So having production rules that
look at the scanner state is just dangerous.  You should be relying on
the Bison locations mechanism (@N), instead.  That has some gotchas of
its own with respect to locations of nonterminals, but at least they're
known --- and we could fix them if we had to.


Anyway, I decided to see what it would take to do it the way I had
in mind, which was to stick a separate RawStmt node atop each statement
parsetree.  The project turned out a bit larger than I hoped :-(, but
I'm really pleased with the end result, because it makes the APIs around
lists of statements much simpler and more regular than they used to be.
I would probably propose most of the attached patch for adoption even
if we weren't interested in tracking statement boundaries.

In brief, what this does is:

* The output of raw parsing is now always a list of RawStmt nodes;
the statement-type-dependent nodes are one level down from that.
This is similar to the existing rule that the output of parse analysis
is always a list of Query nodes, even though the Query struct is mostly
empty in the CMD_UTILITY case.  Furthermore, the output of pg_plan_queries
is now always a list of PlannedStmt nodes, even for utility statements.
In the case of a utility statement, "planning" just consists of wrapping
a CMD_UTILITY PlannedStmt around the utility node.  Now, every list of
statements has a consistent head-node type depending on how far along
it is in processing.

* This allows us to keep statement location/length in RawStmt, Query,
and PlannedStmt nodes, and not anywhere else.

* This results in touching quite a bit of code that is concerned with
lists of statements in different formats, but in most places, it's
simple changes like replacing generic Node* pointers with specific
RawStmt* or PlannedStmt* pointers.  I think that's an unalloyed good.

* To preserve the new rule that the node passed to top-level
parse_analyze() is now always a RawStmt, I made the grammar insert
RawStmts in the sub-statements of EXPLAIN, DECLARE CURSOR, etc.
This adds a bit of cruft in gram.y but avoids cruft elsewhere.

* In addition, I had to change the API of ProcessUtility() so that
what it's passed is the wrapper PlannedStmt not just the bare
utility statement.  This is what allows pg_stat_statements to get at
the location info for a utility statement.  This will affect all
users of ProcessUtility_hook, but the changes are pretty trivial
for those that don't care about location info --- see the changes
to contrib/sepgsql/hooks.c for an example.

* Because PlannedStmt also carries a canSetTag field, we're able to
get rid of some ad-hoc rules about how to reconstruct canSetTag for
a bare utility statement (specifically, the assumption that a utility
is canSetTag if and only if it's the only one in its list).  While
I see no near-term need for relaxing that, it's nice to get rid of the
ad-hocery.

* I chose to also rearrange the post-parse-analysis representation
of DECLARE CURSOR so that it looks more like EXPLAIN, PREPARE, etc.
That is, the contained SELECT remains a child of the DeclareCursorStmt
rather than getting flipped around to be the other way.  Possibly this
patch could have been made to work without that, but the existing code
had some weird rules about how the presence of a PlannedStmt where a
utility command was expected meant it was DECLARE CURSOR, and I was out
to get rid of weird rules like that one.  With these changes, it's
true for both Query and PlannedStmt that utilityStmt is non-null if
and only if commandType is CMD_UTILITY.  That allows simplifying a
whole lot of places that were testing both fields --- I think a few
of those were just defensive programming, but in a lot of them, it was
actually necessary to avoid confusing DECLARE CURSOR with SELECT.

So this ends up costing one extra palloc per statement, or two extra
in the case of a utility statement, but really that's quite negligible
compared to everything else that happens in processing a statement.
As against that, I think it makes the related code a lot clearer.
The sheer size of the patch is a bit more than Fabien's patch, but
what it is touching is not per-statement-type code but code that
works generically with lists of statements.  So I think it's less
likely to cause problems for other patches.

            regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment
Hello Tom,

> Yeah, I doubt that this technique for getting the raw locations in the
> grammar+lexer works reliably.

It worked reliably on all tests, and the assumption seemed sound to me, 
given the one-look-ahead property and that statement reductions can only 
triggered when ';' or end of input comes.

> Anyway, I decided to see what it would take to do it the way I had
> in mind, which was to stick a separate RawStmt node atop each statement
> parsetree.  The project turned out a bit larger than I hoped :-(,

Indeed: I tried it as it looked elegant, but it did not show to be the 
simple thing you announced...

> [...] Furthermore, the output of pg_plan_queries is now always a list of 
> PlannedStmt nodes, even for utility statements.

I stopped exactly there when I tried: I thought that changing that would 
be enough to get a reject because it would be considered much too invasive 
in too many places for fixing a small bug.

> So this ends up costing one extra palloc per statement, or two extra
> in the case of a utility statement, but really that's quite negligible
> compared to everything else that happens in processing a statement.

> As against that, I think it makes the related code a lot clearer.

Having spent some time there, I agree that a refactoring and cleanup was 
somehow needed...

> The sheer size of the patch is a bit more than Fabien's patch,

Yep, nearly twice as big and significantly more complex, but the result is 
a definite improvement.

> but what it is touching is not per-statement-type code but code that 
> works generically with lists of statements.  So I think it's less likely 
> to cause problems for other patches.

Patch applies (with "patch", "git apply" did not like it though), 
compiles, overall make check ok, pgss make check ok as well. I do not know 
whether the coverage of pg tests is enough to know whether anything is 
broken...

I noticed that you also improved the pgss fix and tests. Good. I was 
unsure about removing spaces at the end of the query because of possible 
encoding issues.

The update end trick is nice to deal with empty statements. I wonder 
whether the sub "query" in Copy, View, CreateAs, Explain, Prepare, 
Execute, Declare... could be typed RawStmt* instead of Node*. That would 
avoid some casts in updateRawStmtEnd, but maybe add some other elsewhere.

One point bothered me a bit when looking at the initial code, that your 
refactoring has not changed: the location is about a string which is not 
accessible from the node, so that the string must be kept elsewhere and 
passed as a second argument here and there. ISTM that it would make some 
sense to have the initial string directly available from RawStmt, Query 
and PlannedStmt.

-- 
Fabien.



Fabien COELHO <coelho@cri.ensmp.fr> writes:
>> [...] Furthermore, the output of pg_plan_queries is now always a list of 
>> PlannedStmt nodes, even for utility statements.

> I stopped exactly there when I tried: I thought that changing that would 
> be enough to get a reject because it would be considered much too invasive 
> in too many places for fixing a small bug.

Well, I don't know that we'd have bothered to change this without a
reason, but now that the effort's been put in, it seems a lot cleaner
to me.

> I noticed that you also improved the pgss fix and tests. Good. I was 
> unsure about removing spaces at the end of the query because of possible 
> encoding issues.

It's safe in backend-legal encodings.  I agree it could be problematic
in, say, SJIS, but we disallow that as a backend encoding precisely
because of this type of hazard.

> The update end trick is nice to deal with empty statements. I wonder 
> whether the sub "query" in Copy, View, CreateAs, Explain, Prepare, 
> Execute, Declare... could be typed RawStmt* instead of Node*.

I did that for some of them, eg CopyStmt.  The trouble for statements like
EXPLAIN is that we replace the sub-statement during parse analysis, so
that the field points to either a raw grammar output node or to a Query
depending on when you look.  If we wanted to get rid of using a Node*
field there, we'd have to invent two separate struct types to represent
raw EXPLAIN and post-parse-analyze EXPLAIN (analogously to the way that,
say, ColumnRef and Var are distinct node types).  It doesn't seem worth
the trouble to me.

> One point bothered me a bit when looking at the initial code, that your 
> refactoring has not changed: the location is about a string which is not 
> accessible from the node, so that the string must be kept elsewhere and 
> passed as a second argument here and there. ISTM that it would make some 
> sense to have the initial string directly available from RawStmt, Query 
> and PlannedStmt.

Yeah, perhaps, but that would be something to tackle as a separate round
of changes I think.  This might also tie into the nearby thread about
having that string available to pass to parallel workers.  The history
here is that we didn't use to keep the source string around after parsing,
but that's not the case anymore.  But we've not entirely accepted that
in all the internal APIs ...
        regards, tom lane



I wrote:
> Fabien COELHO <coelho@cri.ensmp.fr> writes:
>> One point bothered me a bit when looking at the initial code, that your 
>> refactoring has not changed: the location is about a string which is not 
>> accessible from the node, so that the string must be kept elsewhere and 
>> passed as a second argument here and there. ISTM that it would make some 
>> sense to have the initial string directly available from RawStmt, Query 
>> and PlannedStmt.

> Yeah, perhaps, but that would be something to tackle as a separate round
> of changes I think.

I remembered one reason why we haven't done this: it's unclear how we'd
handle copying if we do it.  If, say, Query contains a "char *" pointer
then you'd expect copyObject() to pstrdup that string, but we really don't
want that to happen in scenarios where a single source string generated a
large number of Query trees.  That's not an academic concern, we've gotten
real field bug reports when we broke it --- cf commit 1591fcbec.  We'd
need to work out a way of managing multiple Queries carrying references
to the same source string, and it's not clear how to do that reasonably.
So for now, it remains the path of least resistance for Portals and
CachedPlans to contain a list of Query or PlannedStmt objects and a
separate source string.
        regards, tom lane



> [...] it seems a lot cleaner to me.

I agree.

> [...] It's safe in backend-legal encodings.

Good:-)

> [...] The trouble for statements like EXPLAIN is that we replace the 
> sub-statement during parse analysis, [...]

Ok. So Node* is fine to keep it generic.

-- 
Fabien.



About having a pointer to the initial string from RawStmt, Query & 
PlannedStmt:

> I remembered one reason why we haven't done this: it's unclear how we'd 
> handle copying if we do it. If, say, Query contains a "char *" pointer 
> then you'd expect copyObject() to pstrdup that string, [..., So] We'd 
> need to work out a way of managing multiple Queries carrying references 
> to the same source string, and it's not clear how to do that reasonably.

For me it would be shared, but then it may break some memory management 
hypothesis downstream.

> So for now, it remains the path of least resistance for Portals and 
> CachedPlans to contain a list of Query or PlannedStmt objects and a 
> separate source string.

Ok. Thanks for the explanation.

-- 
Fabien.



Re: [HACKERS] BUG: pg_stat_statements query normalization issueswith combined queries

From
Kyotaro HORIGUCHI
Date:
Hello,

It is big, but I think quite reasonable. The refactoring makes
many things clearer and I like it. No objection for the
patch. The affect of changing some API's are not a problem.

At Thu, 12 Jan 2017 19:00:47 +0100 (CET), Fabien COELHO <coelho@cri.ensmp.fr> wrote in
<alpine.DEB.2.20.1701121752430.3788@lancre>
> Patch applies (with "patch", "git apply" did not like it though),
> compiles, overall make check ok, pgss make check ok as well. I do not
> know whether the coverage of pg tests is enough to know whether
> anything is broken...

The same for me. There's some parts that I haven't fully
understand, though..

> I noticed that you also improved the pgss fix and tests. Good. I was
> unsure about removing spaces at the end of the query because of
> possible encoding issues.
> 
> The update end trick is nice to deal with empty statements. I wonder
> whether the sub "query" in Copy, View, CreateAs, Explain, Prepare,
> Execute, Declare... could be typed RawStmt* instead of Node*. That
> would avoid some casts in updateRawStmtEnd, but maybe add some other
> elsewhere.
> 
> One point bothered me a bit when looking at the initial code, that
> your refactoring has not changed: the location is about a string which
> is not accessible from the node, so that the string must be kept
> elsewhere and passed as a second argument here and there. ISTM that it
> would make some sense to have the initial string directly available
> from RawStmt, Query and PlannedStmt.

I found an inconsistency (in style, not function:) between
copyfuncs and equalfuncs. The patch handles the three members
utilityStmt, stmt_location and stmt_len of Query at once and
copyfuncs seems to be edited to follow the rule, but in
equalfuncs the position of utilityStmt is not moved.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: [HACKERS] BUG: pg_stat_statements query normalization issueswith combined queries

From
Kyotaro HORIGUCHI
Date:
Hello,

At Thu, 12 Jan 2017 20:08:54 +0100 (CET), Fabien COELHO <coelho@cri.ensmp.fr> wrote in
<alpine.DEB.2.20.1701122004190.3788@lancre>
> 
> About having a pointer to the initial string from RawStmt, Query &
> PlannedStmt:
> 
> > I remembered one reason why we haven't done this: it's unclear how
> > we'd handle copying if we do it. If, say, Query contains a "char *"
> > pointer then you'd expect copyObject() to pstrdup that string, [...,
> > So] We'd need to work out a way of managing multiple Queries carrying
> > references to the same source string, and it's not clear how to do
> > that reasonably.
> 
> For me it would be shared, but then it may break some memory
> management hypothesis downstream.

+1 to they have a pointer to the shared query string. But doing
that without some measure like reference counting seems
difficult..

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> writes:
> I found an inconsistency (in style, not function:) between
> copyfuncs and equalfuncs. The patch handles the three members
> utilityStmt, stmt_location and stmt_len of Query at once and
> copyfuncs seems to be edited to follow the rule, but in
> equalfuncs the position of utilityStmt is not moved.

I think you're looking at struct Query's instance of utilityStmt.
I didn't move that one.  (Maybe I should've, but it didn't
look quite as randomly dropped into the middle of the struct
as the PlannedStmt one did.  Just a judgment call in either case.)
        regards, tom lane





On 13 Jan. 2017 16:35, "Kyotaro HORIGUCHI" <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
Hello,

At Thu, 12 Jan 2017 20:08:54 +0100 (CET), Fabien COELHO <coelho@cri.ensmp.fr> wrote in <alpine.DEB.2.20.1701122004190.3788@lancre>
>
> About having a pointer to the initial string from RawStmt, Query &
> PlannedStmt:
>
> > I remembered one reason why we haven't done this: it's unclear how
> > we'd handle copying if we do it. If, say, Query contains a "char *"
> > pointer then you'd expect copyObject() to pstrdup that string, [...,
> > So] We'd need to work out a way of managing multiple Queries carrying
> > references to the same source string, and it's not clear how to do
> > that reasonably.
>
> For me it would be shared, but then it may break some memory
> management hypothesis downstream.

+1 to they have a pointer to the shared query string. But doing
that without some measure like reference counting seems
difficult..


Sounds like it'd be better as a separate change so as not to block this one.

I really like what you have done Tom, though I'm about to travel so I haven't read it in full detail. Like Fabien I would've been certain that it'd be rejected if I tried it, but I sure am glad you did it.


Craig Ringer <craig.ringer@2ndquadrant.com> writes:
> Sounds like it'd be better as a separate change so as not to block this one.

Yeah, if we do that at all it should be a separate patch IMO.  The
concerns are largely different.

> I really like what you have done Tom, though I'm about to travel so I
> haven't read it in full detail. Like Fabien I would've been certain that
> it'd be rejected if I tried it, but I sure am glad you did it.

Does anyone want to do further review on this before I push it?

One thing that I'm not quite satisfied with is the business with
non-top-level RawStmt nodes in some utility statements.  That's certainly
a wart from gram.y's perspective, and it's mostly a wart from analyze.c's
perspective as well --- the parse analyze routines mostly just throw away
the non-top-level RawStmt.

The original reason for doing it was that DoCopy needs to call
pg_analyze_and_rewrite() on the sub-statement, and I wanted
pg_analyze_and_rewrite's argument to be declared as RawStmt, so CopyStmt's
sub-statement had to have a RawStmt.  And it seemed like a good idea for
the other utility statements to be consistent with that.

However, when all the dust settled, DoCopy ended up needing to construct
a RawStmt node anyway for the case where it's inventing a Query tree to
support an RLS query.  We could make it construct a RawStmt node in the
plain copy-from-query case too, and then there needn't be one in the
grammar output.

So I'm now thinking that it might be better if the grammar produced
RawStmt only at top level, and anybody who calls pg_analyze_and_rewrite
on sub-sections of a utility statement has to cons up a RawStmt to put
at the top of the sub-query.  This complicates life a little for utility
statements that call parse analysis during execution, but the principle
that RawStmt appears only at top level seems attractively simple.
We don't nest PlannedStmts either.

One thing that connects into this is why do we have some utility
statements that do parse analysis of sub-statements immediately during
their own parse analysis, while others postpone that to execution?
I've not researched it yet, but my vague memory is that EXPLAIN and
friends used to all do it in the second way, and we realized that that
was broken so we changed them to the first way.  COPY has only recently
grown the ability to have a sub-query, and I'm wondering if it's just a
failure of institutional memory that led us to do it the second way.
If we ended up requiring all these cases to work more like EXPLAIN,
they'd not be needing to construct RawStmts at execution anyway.
        regards, tom lane



> One thing that I'm not quite satisfied with is the business with
> non-top-level RawStmt nodes in some utility statements. 
> a wart from gram.y's perspective, and it's mostly a wart from analyze.c's
> perspective as well --- the parse analyze routines mostly just throw away
> the non-top-level RawStmt.
>
> The original reason for doing it was that DoCopy needs to call
> pg_analyze_and_rewrite() on the sub-statement, and I wanted
> pg_analyze_and_rewrite's argument to be declared as RawStmt,

My 0,02€, feel free to ignore them:

Personnaly when I had started doing a version I had decided to only change 
the type at top level, and then I made a few functions being resilient 
about having a RawStmt (that I had called ParsedStmt in my version) or
a direct statement, rather than change the type, I had kept Node*.

Now I see the benefit of changing the type, because the compiler will say 
if there is an issue, and their is no hidden level changes.

> So I'm now thinking that it might be better if the grammar produced
> RawStmt only at top level, and anybody who calls pg_analyze_and_rewrite
> on sub-sections of a utility statement has to cons up a RawStmt to put
> at the top of the sub-query.

Why not. The lazy programmer I am notices that there seems to be 6 
instances, this is not too bad, some of which are already dealt with. The 
RawStmt may not need to be allocated dynamically, a stack instance could 
be enough.

-- 
Fabien.

Fabien COELHO <coelho@cri.ensmp.fr> writes:
>> One thing that I'm not quite satisfied with is the business with
>> non-top-level RawStmt nodes in some utility statements. 
>> ...
>> So I'm now thinking that it might be better if the grammar produced
>> RawStmt only at top level, and anybody who calls pg_analyze_and_rewrite
>> on sub-sections of a utility statement has to cons up a RawStmt to put
>> at the top of the sub-query.

> Why not. The lazy programmer I am notices that there seems to be 6 
> instances, this is not too bad, some of which are already dealt with. The 
> RawStmt may not need to be allocated dynamically, a stack instance could 
> be enough.

Here's a v2 that does it like that.  It ends up being about 30 fewer lines
of code overall, despite there being four places that have to make ad-hoc
RawStmt nodes.  On the whole I feel like this is cleaner, although there's
room to argue that.  Still, the extra cruft is in places that I'm
suspicious are wrong anyway, as I noted.

            regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment
> Here's a v2 that does it like that.

Applies, compiles, overall & pgss make check are both ok.

Personnally I find having RawStmt only for the top parsed statement 
cleaner, and it significantly reduces changes in the parser, as expected, 
without inducing too much changes elsewhere.

Same comment as on v1: I'm wondering about the actual coverage of the 
default tests... I cannot say I understand everything.

> It ends up being about 30 fewer lines of code overall, despite there 
> being four places that have to make ad-hoc RawStmt nodes.  On the whole 
> I feel like this is cleaner,

I agree: Better typing, more homogeneous code (PlannedStmt for all), 
less ad-hoc checks to work around utility statements...

-- 
Fabien.



Fabien COELHO <coelho@cri.ensmp.fr> writes:
>> It ends up being about 30 fewer lines of code overall, despite there 
>> being four places that have to make ad-hoc RawStmt nodes.  On the whole 
>> I feel like this is cleaner,

> I agree: Better typing, more homogeneous code (PlannedStmt for all), 
> less ad-hoc checks to work around utility statements...

OK, pushed like that.
        regards, tom lane



On 15 January 2017 at 05:18, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Fabien COELHO <coelho@cri.ensmp.fr> writes:
>>> It ends up being about 30 fewer lines of code overall, despite there
>>> being four places that have to make ad-hoc RawStmt nodes.  On the whole
>>> I feel like this is cleaner,
>
>> I agree: Better typing, more homogeneous code (PlannedStmt for all),
>> less ad-hoc checks to work around utility statements...
>
> OK, pushed like that.


Thanks very much for that Tom, it's great to see this change.

One suggestion: it's currently non-obvious that ProcessUtility_hook
gets called with the full text of all parts of a multi-statement. I
suggest the following wording added to the comment on ProcessUtility()
in src/backend/tcop/utility.c, after the note on the query string, or
something like it:

 The same query string may be passed to multiple invocations of ProcessUtility if a utility statement in turn invokes
otherutility statements, or if the user supplied a query string containing multiple semicolon-separated statements in a
singleprotocol message. It is also possible for the query text to contain other non-utility-statement text like
comments,empty statements, and plannable statements. Callers that use the queryString should use pstmt->stmt_location
andpstmt->stmt_len to extract the text for the statement of interest and should guard against re-entrant invocation.
 


That should help with at least some of the traps around
ProcessUtility_hook, and I certainly wish I'd known it some months
ago.



For the record, this is commits
ab1f0c8225714aaa18d2f9ca4f80cd009f145421 and
83f2061dd037477ec8479ee160367840e203a722 .

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=ab1f0c8225714aaa18d2f9ca4f80cd009f145421

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=83f2061dd037477ec8479ee160367840e203a722

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Remote DBA, Training
&Services
 



Craig Ringer <craig.ringer@2ndquadrant.com> writes:
> One suggestion: it's currently non-obvious that ProcessUtility_hook
> gets called with the full text of all parts of a multi-statement.

OK, we can improve that ...

>   The same query string may be passed to multiple invocations of ProcessUtility
>   if a utility statement in turn invokes other utility statements, or if the
>   user supplied a query string containing multiple semicolon-separated
>   statements in a single protocol message. It is also possible for the query
>   text to contain other non-utility-statement text like comments, empty
>   statements, and plannable statements. Callers that use the queryString
>   should use pstmt->stmt_location and pstmt->stmt_len to extract the text for
>   the statement of interest and should guard against re-entrant invocation.

Not sure about the reference to re-entrancy.  It's not especially relevant
to query texts AFAICS, and wouldn't a utility statement know darn well if
it was doing something that could end up invoking another instance of
itself?
        regards, tom lane



On 26 January 2017 at 21:42, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Craig Ringer <craig.ringer@2ndquadrant.com> writes:
>> One suggestion: it's currently non-obvious that ProcessUtility_hook
>> gets called with the full text of all parts of a multi-statement.
>
> OK, we can improve that ...
>
>>   The same query string may be passed to multiple invocations of ProcessUtility
>>   if a utility statement in turn invokes other utility statements, or if the
>>   user supplied a query string containing multiple semicolon-separated
>>   statements in a single protocol message. It is also possible for the query
>>   text to contain other non-utility-statement text like comments, empty
>>   statements, and plannable statements. Callers that use the queryString
>>   should use pstmt->stmt_location and pstmt->stmt_len to extract the text for
>>   the statement of interest and should guard against re-entrant invocation.
>
> Not sure about the reference to re-entrancy.  It's not especially relevant
> to query texts AFAICS, and wouldn't a utility statement know darn well if
> it was doing something that could end up invoking another instance of
> itself?

The utility statement does, but the hooks don't necessarily. If you fire an

ALTER TABLE ... ADD COLUMN .. ADD COLUMN .. ADD CONSTRAINT ..;

for example.

However, I was wrong to say we must guard against re-entrancy. We
should only enter ProcessUtility once with context ==
PROCESS_UTILITY_QUERY, and that's what hooks should be looking at
rather than keeping track of re-entrant invocations.

So perhaps:


"The same query string may be passed to multiple invocations of
ProcessUtility if a utility statement invokes subcommands (e.g. ALTER
TABLE), in which case context will be set to
PROCESS_UTILITY_SUBCOMMAND, or if the user supplied a query string
containing multiple semicolon-separated statements in a single
protocol message. It is also possible for the query text to contain
other non-utility-statement text like comments, empty  statements, and
plannable statements that don't pass through ProcessUtility. Hooks
that use the queryString should use pstmt->stmt_location and
pstmt->stmt_len to extract the text for the statement of interest and
should pay attention to the context to avoid repeatedly handling the
same query string in subcommands."



-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Remote DBA, Training
&Services
 



Craig Ringer <craig.ringer@2ndquadrant.com> writes:
> So perhaps:

> "The same query string may be passed to multiple invocations of
> ProcessUtility if a utility statement invokes subcommands (e.g. ALTER
> TABLE), in which case context will be set to
> PROCESS_UTILITY_SUBCOMMAND, or if the user supplied a query string
> containing multiple semicolon-separated statements in a single
> protocol message. It is also possible for the query text to contain
> other non-utility-statement text like comments, empty  statements, and
> plannable statements that don't pass through ProcessUtility. Hooks
> that use the queryString should use pstmt->stmt_location and
> pstmt->stmt_len to extract the text for the statement of interest and
> should pay attention to the context to avoid repeatedly handling the
> same query string in subcommands."

Meh.  I really don't think that a comment about the query string is
the place to get into promises that are unrelated to the string, and
that ProcessUtility is in no position to enforce anyway.  How about
we just say this:

"The same queryString may be passed to multiple invocations of
ProcessUtility when processing a query string containing multiple
semicolon-separated statements; one should use pstmt->stmt_location and
pstmt->stmt_len to identify the substring containing the current
statement.  Keep in mind also that some utility statements (e.g.,
CREATE SCHEMA) will recurse to ProcessUtility to process sub-statements,
often passing down the same queryString, stmt_location, and stmt_len
that were given for the whole statement."
        regards, tom lane



Re: [HACKERS] BUG: pg_stat_statements query normalization issueswith combined queries

From
Kyotaro HORIGUCHI
Date:
At Thu, 26 Jan 2017 22:34:57 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote in <23778.1485488097@sss.pgh.pa.us>
> Craig Ringer <craig.ringer@2ndquadrant.com> writes:
> > So perhaps:
> 
> > "The same query string may be passed to multiple invocations of
> > ProcessUtility if a utility statement invokes subcommands (e.g. ALTER
> > TABLE), in which case context will be set to
> > PROCESS_UTILITY_SUBCOMMAND, or if the user supplied a query string
> > containing multiple semicolon-separated statements in a single
> > protocol message. It is also possible for the query text to contain
> > other non-utility-statement text like comments, empty  statements, and
> > plannable statements that don't pass through ProcessUtility. Hooks
> > that use the queryString should use pstmt->stmt_location and
> > pstmt->stmt_len to extract the text for the statement of interest and
> > should pay attention to the context to avoid repeatedly handling the
> > same query string in subcommands."
> 
> Meh.  I really don't think that a comment about the query string is
> the place to get into promises that are unrelated to the string, and
> that ProcessUtility is in no position to enforce anyway.  How about
> we just say this:
> 
> "The same queryString may be passed to multiple invocations of
> ProcessUtility when processing a query string containing multiple
> semicolon-separated statements; one should use pstmt->stmt_location and
> pstmt->stmt_len to identify the substring containing the current
> statement.  Keep in mind also that some utility statements (e.g.,
> CREATE SCHEMA) will recurse to ProcessUtility to process sub-statements,
> often passing down the same queryString, stmt_location, and stmt_len
> that were given for the whole statement."

Tom's rewrite is easier to read for me and seems telling me
enough as a user of the hook.

By the way the existing comment for the hook,

> *
> * We provide a function hook variable that lets loadable plugins get
> * control when ProcessUtility is called.  Such a plugin would normally
> * call standard_ProcessUtility().
> */

This is quite a matter of course for developers. planner_hook and
other ones don't have such a comment or have a one-liner at
most. Since this hook gets a good amount of commnet, it seems
better to just remove the existing one..

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center







On 27 Jan. 2017 14:34, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:
Craig Ringer <craig.ringer@2ndquadrant.com> writes:
> So perhaps:

> "The same query string may be passed to multiple invocations of
> ProcessUtility if a utility statement invokes subcommands (e.g. ALTER
> TABLE), in which case context will be set to
> PROCESS_UTILITY_SUBCOMMAND, or if the user supplied a query string
> containing multiple semicolon-separated statements in a single
> protocol message. It is also possible for the query text to contain
> other non-utility-statement text like comments, empty  statements, and
> plannable statements that don't pass through ProcessUtility. Hooks
> that use the queryString should use pstmt->stmt_location and
> pstmt->stmt_len to extract the text for the statement of interest and
> should pay attention to the context to avoid repeatedly handling the
> same query string in subcommands."

Meh.  I really don't think that a comment about the query string is
the place to get into promises that are unrelated to the string, and
that ProcessUtility is in no position to enforce anyway.  How about
we just say this:

"The same queryString may be passed to multiple invocations of
ProcessUtility when processing a query string containing multiple
semicolon-separated statements; one should use pstmt->stmt_location and
pstmt->stmt_len to identify the substring containing the current
statement.  Keep in mind also that some utility statements (e.g.,
CREATE SCHEMA) will recurse to ProcessUtility to process sub-statements,
often passing down the same queryString, stmt_location, and stmt_len
that were given for the whole statement."

Much better wording. I like that.
Craig Ringer <craig.ringer@2ndquadrant.com> writes:
> On 27 Jan. 2017 14:34, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:
>> "The same queryString may be passed to multiple invocations of
>> ProcessUtility when processing a query string containing multiple
>> semicolon-separated statements; one should use pstmt->stmt_location and
>> pstmt->stmt_len to identify the substring containing the current
>> statement.  Keep in mind also that some utility statements (e.g.,
>> CREATE SCHEMA) will recurse to ProcessUtility to process sub-statements,
>> often passing down the same queryString, stmt_location, and stmt_len
>> that were given for the whole statement."

> Much better wording. I like that.

OK, done that way.
        regards, tom lane



Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> writes:
> By the way the existing comment for the hook,

>> *
>> * We provide a function hook variable that lets loadable plugins get
>> * control when ProcessUtility is called.  Such a plugin would normally
>> * call standard_ProcessUtility().
>> */

> This is quite a matter of course for developers. planner_hook and
> other ones don't have such a comment or have a one-liner at
> most. Since this hook gets a good amount of commnet, it seems
> better to just remove the existing one..

Hm?  I see pretty much the same wording for planner_hook:
* To support loadable plugins that monitor or modify planner behavior,* we provide a hook variable that lets a plugin
getcontrol before and* after the standard planning process.  The plugin would normally call* standard_planner().
 

I think other places have copied-and-pasted this as well.

The real problem with this is it isn't a correct specification of the
contract: in most cases, the right thing to do is more like "call the
previous occupant of the ProcessUtility_hook, or standard_ProcessUtility
if there was none."

Not sure if it's worth trying to be more precise in these comments,
but if we do something about it, we need to hit all the places with
this wording.
        regards, tom lane





On 28 Jan. 2017 02:08, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:
Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> writes:
> By the way the existing comment for the hook,

>> *
>> * We provide a function hook variable that lets loadable plugins get
>> * control when ProcessUtility is called.  Such a plugin would normally
>> * call standard_ProcessUtility().
>> */

> This is quite a matter of course for developers. planner_hook and
> other ones don't have such a comment or have a one-liner at
> most. Since this hook gets a good amount of commnet, it seems
> better to just remove the existing one..

Hm?  I see pretty much the same wording for planner_hook:

 * To support loadable plugins that monitor or modify planner behavior,
 * we provide a hook variable that lets a plugin get control before and
 * after the standard planning process.  The plugin would normally call
 * standard_planner().

I think other places have copied-and-pasted this as well.

The real problem with this is it isn't a correct specification of the
contract: in most cases, the right thing to do is more like "call the
previous occupant of the ProcessUtility_hook, or standard_ProcessUtility
if there was none."

Not sure if it's worth trying to be more precise in these comments,
but if we do something about it, we need to hit all the places with
this wording.

I'd rather leave it for the hooks documentation work that's being done elsewhere and have a README.hooks or something that discusses patterns common across hooks. Then we can just reference it.

Otherwise we'll have a lot of repetitive comments. Especially once we mention that you can also ERROR out or (potentially) take no action at all.

Re: [HACKERS] BUG: pg_stat_statements query normalization issueswith combined queries

From
Kyotaro HORIGUCHI
Date:
Mmm..

At Sat, 28 Jan 2017 11:52:20 +0800, Craig Ringer <craig.ringer@2ndquadrant.com> wrote in
<CAMsr+YGX0uU5Rw0fOKFwxtg_5WxWOBQ=KiTHWiKrb65H67inwg@mail.gmail.com>
> On 28 Jan. 2017 02:08, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:
> 
> Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> writes:
> > By the way the existing comment for the hook,
> 
> >> *
> >> * We provide a function hook variable that lets loadable plugins get
> >> * control when ProcessUtility is called.  Such a plugin would normally
> >> * call standard_ProcessUtility().
> >> */
> 
> > This is quite a matter of course for developers. planner_hook and
> > other ones don't have such a comment or have a one-liner at
> > most. Since this hook gets a good amount of commnet, it seems
> > better to just remove the existing one..
> 
> Hm?  I see pretty much the same wording for planner_hook:

Sorry, my eyes were very short-ranged. Surely most of them have
commnents with very similar phrase. ExplainOneQuery seems to be
one exception but I'll call ExplainOneQuery in the hook function,
though:p

Anyway I don't want to remove the comment in ProcessUtility since
now I know that is rather the common case.

>  * To support loadable plugins that monitor or modify planner behavior,
>  * we provide a hook variable that lets a plugin get control before and
>  * after the standard planning process.  The plugin would normally call
>  * standard_planner().
> 
> I think other places have copied-and-pasted this as well.
> 
> The real problem with this is it isn't a correct specification of the
> contract: in most cases, the right thing to do is more like "call the
> previous occupant of the ProcessUtility_hook, or standard_ProcessUtility
> if there was none."
> 
> Not sure if it's worth trying to be more precise in these comments,
> but if we do something about it, we need to hit all the places with
> this wording.

That seems bad. I'll prefer rather leaving them alone. Modifying
them wouldn't be so much gain if many of them already have such
comment.

> I'd rather leave it for the hooks documentation work that's being done
> elsewhere and have a README.hooks or something that discusses patterns
> common across hooks. Then we can just reference it.
> 
> Otherwise we'll have a lot of repetitive comments. Especially once we
> mention that you can also ERROR out or (potentially) take no action at all.

Sorry for the noise..

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center