Thread: [HACKERS] [PATCH] Use $ parameters as replacement characters for pg_stat_statements

Hi,

Currently pg_stat_statements replaces constant values with ? characters. I've seen this be a problem on multiple occasions, in particular since it conflicts with the use of ? as an operator.

I'd like to propose changing the replacement character from ? to instead be a parameter (like $1).

My main motiviation is to aid external tools that parse pg_stat_statements output (like [0]), and currently have to resort to complex parser patches to distinguish operators from replacement characters.

First of all, attached 0001_pgss_additional_regression_tests.v1.patch which increases regression test coverage for a few scenarios relevant to this discussion.

Then, there is two variants I've prepared, only one of the two to be committed:

A) 0002_pgss_mask_with_incrementing_params.v1.patch: Replace constants with $1, $2, $3 (etc) in the normalized query, whilst respecting any existing params in the counting

B) 0003_pgss_mask_with_zero_param.v1.patch: Replace constants with $0 in the normalized query

Ideally I'd see A turn into something we can commit, but it involves a bit more computation to get the parameter numbers right. The benefit is that we could use PREPARE/EXECUTE with pg_stat_statements output.

B is intentionally simple, and would address the primary issue at hand (distinguishing from the ? operator).

I'm also adding this patch to the commitfest starting tomorrow.

Best,
Lukas


--
Lukas Fittl
Attachment
On 2/28/17 20:01, Lukas Fittl wrote:
> Currently pg_stat_statements replaces constant values with ? characters.
> I've seen this be a problem on multiple occasions, in particular since
> it conflicts with the use of ? as an operator.
> 
> I'd like to propose changing the replacement character from ? to instead
> be a parameter (like $1).

Hmm, I think this could confuse people into thinking that the queries
displayed were in fact prepared queries.

Maybe we could gather some more ideas.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



On Wed, Mar 1, 2017 at 6:51 AM, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote:
Hmm, I think this could confuse people into thinking that the queries
displayed were in fact prepared queries.

Maybe we could gather some more ideas.

I think thats a reasonable concern - the main benefit of $1 is that its already designated as something that can replace a constant, and still be read by the Postgres parser.

Is there any other character that has the same properties?

I'll also note that Greg Stark mentioned in [0] that "There's another feature pg_stat_statements *really* needs. A way to convert a jumbled statement into one that can be prepared easily. The use of ? instead of :1 :2 etc makes this a mechanical but annoying process."

Using $1, $2, etc. for jumbling statements would give us that for free, no additional effort needed.


Best,
Lukas

--
Lukas Fittl
On Wed, Mar 1, 2017 at 8:21 PM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
> On 2/28/17 20:01, Lukas Fittl wrote:
>> Currently pg_stat_statements replaces constant values with ? characters.
>> I've seen this be a problem on multiple occasions, in particular since
>> it conflicts with the use of ? as an operator.
>>
>> I'd like to propose changing the replacement character from ? to instead
>> be a parameter (like $1).
>
> Hmm, I think this could confuse people into thinking that the queries
> displayed were in fact prepared queries.
>
> Maybe we could gather some more ideas.

Perhaps there could be a choice of behaviors.  Even if we all agreed
that parameter notation was better in theory, there's something to be
said for maintaining backward compatibility, or having an option to do
so.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Robert Haas <robertmhaas@gmail.com> writes:
> On Wed, Mar 1, 2017 at 8:21 PM, Peter Eisentraut
> <peter.eisentraut@2ndquadrant.com> wrote:
>> On 2/28/17 20:01, Lukas Fittl wrote:
>>> I'd like to propose changing the replacement character from ? to instead
>>> be a parameter (like $1).

>> Hmm, I think this could confuse people into thinking that the queries
>> displayed were in fact prepared queries.

> Perhaps there could be a choice of behaviors.  Even if we all agreed
> that parameter notation was better in theory, there's something to be
> said for maintaining backward compatibility, or having an option to do
> so.

Meh ... we've generally regretted it when we "solved" a backwards
compatibility problem by introducing a GUC that changes query semantics.
I'm inclined to think we should either do it or not.

My own vote would probably be for "not", because I haven't seen a case
made why it's important to be able to automatically distinguish a
constant-substitution marker from a "?" operator.

On the other hand, it seems like arguing for backwards compatibility here
is a bit contradictory, because that would only matter if you think there
*are* people trying to automatically parse the output of
pg_stat_statements in that much detail.  And if there are, they would
likely appreciate it becoming less ambiguous.

But speaking of ambiguity: isn't it possible for $n symbols to appear in
pg_stat_statements already?  I think it is, both from extended-protocol
client queries and from SPI commands, which would mean that the proposal
as it stands is not fixing the ambiguity problem at all.  So yes, we need
another idea.
        regards, tom lane



Hi,

On 2017-03-04 11:02:14 -0500, Tom Lane wrote:
> But speaking of ambiguity: isn't it possible for $n symbols to appear in
> pg_stat_statements already?

Indeed.

> I think it is, both from extended-protocol
> client queries and from SPI commands, which would mean that the proposal
> as it stands is not fixing the ambiguity problem at all.  So yes, we need
> another idea.

I think using $ to signal a parameter is still a good idea, as people
kinda have to know that one anyway.  Maybe one of "$:n" "$-n"?

- Andres



On Sat, Mar 4, 2017 at 8:02 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Perhaps there could be a choice of behaviors.  Even if we all agreed
>> that parameter notation was better in theory, there's something to be
>> said for maintaining backward compatibility, or having an option to do
>> so.
>
> Meh ... we've generally regretted it when we "solved" a backwards
> compatibility problem by introducing a GUC that changes query semantics.
> I'm inclined to think we should either do it or not.

In my opinion, we expose query id (and dbid, and userid) as the
canonical identifier for each pg_stat_statements entry, and have done
so for some time. That's the stable API -- not query text. I'm aware
of cases where query text was used as an identifier, but that ended up
being hashed anyway.

Query text is just for human consumption. I'd be in favor of a change
that makes it easier to copy and paste a query, to run EXPLAIN and so
on. Lukas probably realizes that there are no guarantees that the
query text that appears in pg_stat_statements will even appear as
normalized in all cases. The "sticky entry" stuff is intended to
maximize the chances of that happening, but it's still generally quite
possible (e.g. pg_stat_statements never swaps constants in a query
like "SELECT 5, pg_stat_statements_reset()"). This means that we
cannot really say that this buys us a machine-readable query text
format, at least not without adding some fairly messy caveats.

-- 
Peter Geoghegan



On Sat, Mar 4, 2017 at 1:52 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Sat, Mar 4, 2017 at 8:02 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Perhaps there could be a choice of behaviors.  Even if we all agreed
>>> that parameter notation was better in theory, there's something to be
>>> said for maintaining backward compatibility, or having an option to do
>>> so.
>>
>> Meh ... we've generally regretted it when we "solved" a backwards
>> compatibility problem by introducing a GUC that changes query semantics.
>> I'm inclined to think we should either do it or not.
>
> In my opinion, we expose query id (and dbid, and userid) as the
> canonical identifier for each pg_stat_statements entry, and have done
> so for some time. That's the stable API -- not query text. I'm aware
> of cases where query text was used as an identifier, but that ended up
> being hashed anyway.
>
> Query text is just for human consumption.

Lukas evidently thinks otherwise, based on the original post.

> I'd be in favor of a change
> that makes it easier to copy and paste a query, to run EXPLAIN and so
> on. Lukas probably realizes that there are no guarantees that the
> query text that appears in pg_stat_statements will even appear as
> normalized in all cases. The "sticky entry" stuff is intended to
> maximize the chances of that happening, but it's still generally quite
> possible (e.g. pg_stat_statements never swaps constants in a query
> like "SELECT 5, pg_stat_statements_reset()"). This means that we
> cannot really say that this buys us a machine-readable query text
> format, at least not without adding some fairly messy caveats.

Well, Lukas's original suggestion of using $n for a placeholder would
do that, unless there's already a $n with the same numerical value,
but Andres's proposal to use $-n or $:n would not.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



On Mon, Mar 6, 2017 at 9:36 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Sat, Mar 4, 2017 at 1:52 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> In my opinion, we expose query id (and dbid, and userid) as the
> canonical identifier for each pg_stat_statements entry, and have done
> so for some time. That's the stable API -- not query text. I'm aware
> of cases where query text was used as an identifier, but that ended up
> being hashed anyway.
>
> Query text is just for human consumption.

Lukas evidently thinks otherwise, based on the original post.

I actually agree with Peter that the queryid+userid+dbid is the canonical identifier,
not the query text.

There is however value in parsing the query, e.g. to find out which statement
type something is, or to determine which table names a query references
(assuming one knows the search_path) programatically.

It is for that latter reason I'm interested in parsing the query, and avoiding the
ambiguity that ? carries, since its also an operator.

Based on some hackery, I've previously built a little example script that
filters pg_stat_statements output: https://github.com/lfittl/pg_qtop#usage

This script currently breaks in complex cases of ? operators, since the
pg_stat_statements query text is ambiguous.
 
> I'd be in favor of a change
> that makes it easier to copy and paste a query, to run EXPLAIN and so
> on. Lukas probably realizes that there are no guarantees that the
> query text that appears in pg_stat_statements will even appear as
> normalized in all cases. The "sticky entry" stuff is intended to
> maximize the chances of that happening, but it's still generally quite
> possible (e.g. pg_stat_statements never swaps constants in a query
> like "SELECT 5, pg_stat_statements_reset()"). This means that we
> cannot really say that this buys us a machine-readable query text
> format, at least not without adding some fairly messy caveats.

Well, Lukas's original suggestion of using $n for a placeholder would
do that, unless there's already a $n with the same numerical value,
but Andres's proposal to use $-n or $:n would not.

Yes, and I do think that making it easier to run EXPLAIN would be the
primary user-visible benefit in core.

I'd be happy to add a docs section showing how to use this, if there is
some consensus that its worth pursuing this direction.

Thanks for all the comments, appreciate the discussion.

Best,
Lukas

--
Lukas Fittl
On Sat, Mar  4, 2017 at 10:52:44AM -0800, Peter Geoghegan wrote:
> On Sat, Mar 4, 2017 at 8:02 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Meh ... we've generally regretted it when we "solved" a backwards
> > compatibility problem by introducing a GUC that changes query semantics.
> > I'm inclined to think we should either do it or not.
> 
> In my opinion, we expose query id (and dbid, and userid) as the
> canonical identifier for each pg_stat_statements entry, and have done
> so for some time. That's the stable API -- not query text. I'm aware
> of cases where query text was used as an identifier, but that ended up
> being hashed anyway.

Speaking of hash values for queries, someone once asked me if we could
display a hash value for queries displayed in pg_stat_activity and
pg_stat_statements so they could take a running query and look in
pg_stat_statements to see how long is usually ran.  It seemed like a
useful idea to me.

I don't think they can hash the query manually because of the constants
involved.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +



On Thu, Mar 9, 2017 at 8:17 PM, Bruce Momjian <bruce@momjian.us> wrote:
>> In my opinion, we expose query id (and dbid, and userid) as the
>> canonical identifier for each pg_stat_statements entry, and have done
>> so for some time. That's the stable API -- not query text. I'm aware
>> of cases where query text was used as an identifier, but that ended up
>> being hashed anyway.
>
> Speaking of hash values for queries, someone once asked me if we could
> display a hash value for queries displayed in pg_stat_activity and
> pg_stat_statements so they could take a running query and look in
> pg_stat_statements to see how long is usually ran.  It seemed like a
> useful idea to me.

I agree.

> I don't think they can hash the query manually because of the constants
> involved.

It would be a matter of having postgres expose Query.queryId (or the
similar field in PlannedStmt, I suppose). Morally, that field belongs
to pg_stat_statements, since it was written to make the query
normalization stuff work, and because everything would break if
another extension attempted to use it as pg_stat_statements does.
Whether or not that makes it okay to expose the hash value in
pg_stat_activity like that is above my pay grade, as Tom would say.

-- 
Peter Geoghegan



Robert Haas <robertmhaas@gmail.com> writes:
> On Sat, Mar 4, 2017 at 1:52 PM, Peter Geoghegan <pg@bowt.ie> wrote:
>> I'd be in favor of a change
>> that makes it easier to copy and paste a query, to run EXPLAIN and so
>> on. Lukas probably realizes that there are no guarantees that the
>> query text that appears in pg_stat_statements will even appear as
>> normalized in all cases. The "sticky entry" stuff is intended to
>> maximize the chances of that happening, but it's still generally quite
>> possible (e.g. pg_stat_statements never swaps constants in a query
>> like "SELECT 5, pg_stat_statements_reset()"). This means that we
>> cannot really say that this buys us a machine-readable query text
>> format, at least not without adding some fairly messy caveats.

> Well, Lukas's original suggestion of using $n for a placeholder would
> do that, unless there's already a $n with the same numerical value,
> but Andres's proposal to use $-n or $:n would not.

I don't much like the $-n or $:n proposals, as those are infringing on
syntax space we might wish we had back someday.  $:n also could cause
confusion with psql variable substitution.

I wonder if it would improve matters to use $n, but starting with the
first number after the actual external Param symbols in the query.
(This presumes that we can identify the last in-use external Param number
reasonably efficiently.  struct Query doesn't really expose that ---
although it might not be unreasonable to add a field to do so.
In practice pg_stat_statements could perhaps look aside to find
that out, say from EState.es_param_list_info->numParams.)

In a situation where you wanted to try to actually execute the query,
this might be fairly convenient since you could do PREPARE with the
original external Params followed by the ones pg_stat_statements had
abstracted from constants.  Of course, guessing the types of those
might be nontrivial.  I wonder whether we should change the printout
to look like '$n::type' not just '$n'.

A variant that might make it easier to read would be to start the
numbering of the abstracted params from $100, or some other number
much larger than the last external Param, thus creating a pretty
clear distinction between the two.  But that would break the
hypothetical use-case of building a prepared statement directly
from the query text, or at least make it mighty inconvenient.
        regards, tom lane



On Mon, Mar 13, 2017 at 6:05 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Sat, Mar 4, 2017 at 1:52 PM, Peter Geoghegan <pg@bowt.ie> wrote:
>>> I'd be in favor of a change
>>> that makes it easier to copy and paste a query, to run EXPLAIN and so
>>> on. Lukas probably realizes that there are no guarantees that the
>>> query text that appears in pg_stat_statements will even appear as
>>> normalized in all cases. The "sticky entry" stuff is intended to
>>> maximize the chances of that happening, but it's still generally quite
>>> possible (e.g. pg_stat_statements never swaps constants in a query
>>> like "SELECT 5, pg_stat_statements_reset()"). This means that we
>>> cannot really say that this buys us a machine-readable query text
>>> format, at least not without adding some fairly messy caveats.
>
>> Well, Lukas's original suggestion of using $n for a placeholder would
>> do that, unless there's already a $n with the same numerical value,
>> but Andres's proposal to use $-n or $:n would not.
>
> I don't much like the $-n or $:n proposals, as those are infringing on
> syntax space we might wish we had back someday.  $:n also could cause
> confusion with psql variable substitution.
>
> I wonder if it would improve matters to use $n, but starting with the
> first number after the actual external Param symbols in the query.

That sounds pretty sensible, although I guess it's got the weakness
that you might get confused about which kind of $n you are looking at.
However, I'd be inclined to deem that a fairly minor weakness and go
with it: just because somebody could hypothetically get confused
doesn't mean that real people will.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, Mar 13, 2017 at 6:05 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I wonder if it would improve matters to use $n, but starting with the
>> first number after the actual external Param symbols in the query.

> That sounds pretty sensible, although I guess it's got the weakness
> that you might get confused about which kind of $n you are looking at.
> However, I'd be inclined to deem that a fairly minor weakness and go
> with it: just because somebody could hypothetically get confused
> doesn't mean that real people will.

So it turns out this discussion just reinvented the alternative that
Lukas had in his 0002 proposal.  Are there any remaining objections
to proceeding with that approach?


In a quick read of the patch, I note that it falsifies and fails to
update the header comment for generate_normalized_query:
* *query_len_p contains the input string length, and is updated with* the result string length (which cannot be longer)
onexit. 

It doesn't look like the calling code depends (any more?) on the
assumption that the string doesn't get longer, but it would be good
to double-check that.

This bit seems much more expensive and complicated than necessary:

+    /* Accomodate the additional query replacement characters */
+    norm_query_buflen = query_len;
+    for (i = 0; i < jstate->clocations_count; i++)
+    {
+        norm_query_buflen += floor(log10(abs(i + 1 + jstate->highest_extern_param_id))) + 1;
+    }

I'd just add, say, "10 * clocations_count" to the allocation length.
We can afford to waste a few bytes on this transient copy of the query
text.

The documentation is overly vague about what the parameter symbols are,
in particular failing to explain that their numbers start from after
the last $n occurring in the original query text.

It seems like a good idea to have a regression test case demonstrating
that, too.
        regards, tom lane



On Sat, Mar 18, 2017 at 12:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
So it turns out this discussion just reinvented the alternative that
Lukas had in his 0002 proposal.  Are there any remaining objections
to proceeding with that approach?

Thanks for reviewing - updated patch attached, comments below.
 
In a quick read of the patch, I note that it falsifies and fails to
update the header comment for generate_normalized_query:

 * *query_len_p contains the input string length, and is updated with
 * the result string length (which cannot be longer) on exit.

It doesn't look like the calling code depends (any more?) on the
assumption that the string doesn't get longer, but it would be good
to double-check that.

Fixed.
 
I'd just add, say, "10 * clocations_count" to the allocation length.
We can afford to waste a few bytes on this transient copy of the query
text.

I've changed this, although I've kept this somewhat dynamic, to avoid
having an accidental overrun in edge cases:

1 + floor(log10(jstate->clocations_count + jstate->highest_extern_param_id));
 
The documentation is overly vague about what the parameter symbols are,
in particular failing to explain that their numbers start from after
the last $n occurring in the original query text.

Updated the docs to clarify this.
 
It seems like a good idea to have a regression test case demonstrating
that, too.

I already had a separate patch that adds more regression tests (0000),
including one for prepared statements in the initial email.

Merged together now into one patch to keep it simple, and added another
constant value to the prepared statement test case. I've also included
a commit message in the patch file, if helpful.

Thanks,
Lukas

--
Lukas Fittl
Attachment
Lukas Fittl <lukas@fittl.com> writes:
> On Sat, Mar 18, 2017 at 12:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> So it turns out this discussion just reinvented the alternative that
>> Lukas had in his 0002 proposal.  Are there any remaining objections
>> to proceeding with that approach?

> Thanks for reviewing - updated patch attached, comments below.

Pushed with minor adjustments.

The main non-cosmetic thing I did was to replace the floor(log10())
business with plain constant "10" as I suggested before.  That's
what we do in other places --- see int4out for an example --- and
frankly I did not feel that a small space savings in a transient
string buffer was worth the intellectual effort to verify whether
that calculation was correct or not, never mind whatever runtime
cycles it would take.  I don't believe the argument that it's safer
your way: if you had an off-by-one thinko in the calculation, or even
just roundoff error in the log10() call, it could result in an actual
reachable buffer overrun, because there's no safety margin.
        regards, tom lane



On Mon, Mar 27, 2017 at 8:24 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Pushed with minor adjustments.

Excellent - thanks for your review and all the discussion here!
 
The main non-cosmetic thing I did was to replace the floor(log10())
business with plain constant "10" as I suggested before.  That's
what we do in other places --- see int4out for an example --- and
frankly I did not feel that a small space savings in a transient
string buffer was worth the intellectual effort to verify whether
that calculation was correct or not, never mind whatever runtime
cycles it would take.  I don't believe the argument that it's safer
your way: if you had an off-by-one thinko in the calculation, or even
just roundoff error in the log10() call, it could result in an actual
reachable buffer overrun, because there's no safety margin.

Makes sense, guess I was overthinking this one.

Best,
Lukas 

--
Lukas Fittl

Skype: lfittl
Phone: +1 415 321 0630