Thread: Revisiting pg_stat_statements and IN() (Was: Re: pg_stat_statements fingerprinting logic and ArrayExpr)

On Tue, Dec 10, 2013 at 1:30 AM, Peter Geoghegan <pg@heroku.com> wrote:
> pg_stat_statements' fingerprinting logic considers the following two
> statements as distinct:
>
> select 1 in (1, 2, 3);
> select 1 in (1, 2, 3, 4);
>
> This is because the ArrayExpr jumble case jumbles any ArrayExpr's list
> of elements recursively. In this case it's a list of Const nodes, and
> the fingerprinting logic jumbles those nodes indifferently.
>
> Somebody told me that they think that pg_stat_statements should not do
> that. This person felt that it would be preferable for such
> expressions to be normalized without regard to the number of distinct
> Const elements. I suppose that that would work by determing if the
> ArrayExpr elements list was a list of Const nodes and only const
> nodes. Iff that turned out to be the case, something else would be
> jumbled (something other than the list) that would essentially be a
> representation of "some list of zero or more (or maybe one or more)
> Const nodes with consttype of, in this example, 23". I think that this
> would make at least one person happy, because of course the two
> statements above would have their costs aggregated within a single
> pg_stat_statements entry.

Baron Schwartz recently gave a talk at PGConf Silicon Valley about the
proprietary query instrumentation tool, VividCortex. The slides are
available from:


http://info.citusdata.com/rs/235-CNE-301/images/Analyzing_PostgreSQL_Network_Traffic_with_vc-pgsql-sniffer_-_Baron_Schwartz.pdf

One specific justification he gave for not using pg_stat_statements was:

"Doesn’t merge bind vars in IN()" (See slide #11)

His theory is that you should allow a proprietary binary to run with
root permissions, a binary that sniffs the wire protocol, mostly
because pg_stat_statements has this limitation (all other
pg_stat_statements limitations listed are pretty unconvincing, IMV).
That doesn't seem like a great plan to me, but I think he has a point
about pg_stat_statements. It's about time that we fixed this -- it
isn't realistic to imagine that people are going to know to use an
array constant like "= any ('{1,2,3}')" -- even a major contributor to
Django that I talked to about this issue a couple of years ago didn't
know about that. It also isn't portable across database systems.

I wonder:

* How do other people feel about this? Personally, I've seen enough
problems of this kind in the field that "slippery slope" arguments
against this don't seem very compelling.

* How might altering the jumbling logic to make it recognize a
variable number of constants as equivalent work in practice? Perhaps
we should do something to flatten the representation based on which
powers of two the number of constants is between. There are still some
details to work out there, but that's my first idea. That seems like a
good compromise between the current behavior, and completely
disregarding the number of constants.

--
Peter Geoghegan



On Tue, Nov 24, 2015 at 7:53 AM, Peter Geoghegan <pg@heroku.com> wrote:
> * How do other people feel about this? Personally, I've seen enough
> problems of this kind in the field that "slippery slope" arguments
> against this don't seem very compelling.

I also always felt there should be some kind of ??? symbol to
represent a list of constants. In my experience these lists are
actually *more* likely to be variables being inlined than single
constants since it's easy to use :1 etc for single constants and quite
a bit trickier to do it for lists. I guess that might be changed these
days since I think you can do  =any(?::int[]) and construct an array
literal as a parameter. But plenty of code actually constructs lists
of question marks to interpolate.

I have also seen code where I would have needed *not* to have this
jumbling though. I think this is a general problem with jumbling that
there needs to be some kind of intelligence that deduces when it's
important to break out the statistics by constant. In my case it was
an IN query where specific values were very common but others very
rare. Partial indexes ended up being the solution and we had to
identify which partial indexes were needed.

Incidentally there's another feature pg_stat_statements *really*
needs. A way to convert a jumbled statement into one that can be
prepared easily. The use of ? instead of :1 :2 etc makes this a
mechanical but annoying process. Adding ??? would make it even more
annoying. Even just a function that does this (and takes an optional
list of counts for lists I guess?) would be a big help.


-- 
greg



On Tue, Nov 24, 2015 at 2:25 PM, Greg Stark <stark@mit.edu> wrote:
> On Tue, Nov 24, 2015 at 7:53 AM, Peter Geoghegan <pg@heroku.com> wrote:
>> * How do other people feel about this? Personally, I've seen enough
>> problems of this kind in the field that "slippery slope" arguments
>> against this don't seem very compelling.

+1 for changing jumbling logic to consider any number of constants as identical.

> I have also seen code where I would have needed *not* to have this
> jumbling though. I think this is a general problem with jumbling that
> there needs to be some kind of intelligence that deduces when it's
> important to break out the statistics by constant. In my case it was
> an IN query where specific values were very common but others very
> rare. Partial indexes ended up being the solution and we had to
> identify which partial indexes were needed.

This is an issue with jumbling in general and very vaguely related to
merging the number of constants in IN. I think a better solution for
this type of issue would be a way to sample the parameter values and
possibly EXPLAIN ANALYZE results to logs. Having runtime intelligence
in PostgreSQL that would be capable of distinguishing interesting
variations from irrelevant doesn't seem like a feasible plan. In my
view the best we could do is to aim to have entries roughly correspond
to application query invocation points and leave the more complex
statistical analysis use cases to more specialized tools.

Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de



Peter Geoghegan <pg@heroku.com> writes:
> On Tue, Dec 10, 2013 at 1:30 AM, Peter Geoghegan <pg@heroku.com> wrote:
>> pg_stat_statements' fingerprinting logic considers the following two
>> statements as distinct:
>> 
>> select 1 in (1, 2, 3);
>> select 1 in (1, 2, 3, 4);
>> 
>> This is because the ArrayExpr jumble case jumbles any ArrayExpr's list
>> of elements recursively. In this case it's a list of Const nodes, and
>> the fingerprinting logic jumbles those nodes indifferently.

I think this is a vastly oversimplified explanation of the problem.
In particular, because the planner will flatten an ArrayExpr containing
only Const nodes to an array constant (see eval_const_expressions),
I don't believe the case ever arises in exactly the form you posit here.

A portion of the problem is possibly due to the heuristics in
parse_expr.c's transformAExprIn():
    * We try to generate a ScalarArrayOpExpr from IN/NOT IN, but this is only    * possible if there is a suitable
arraytype available.  If not, we fall    * back to a boolean condition tree with multiple copies of the lefthand    *
expression. Also, any IN-list items that contain Vars are handled as    * separate boolean conditions, because that
givesthe planner more scope    * for optimization on such clauses.
 

If the original text actually involves a variable number of Vars, then you
will end up with a boolean expression with a varying number of OR arms,
even if the Vars later get flattened to constants.  However, it's not
clear to me that anyone would expect such cases to be treated as
identical.  Another possibility is a type clash, for example
"x IN (42, 44.1)" will end up as a boolean tree for lack of a common
type for the would-be array elements.  That case might possibly be an
issue in practice.

But what seems more likely to be annoying people is cases in which the
original text contains a varying number of Param markers.  Those might or
might not get folded to constants during planning depending on context,
so that they might or might not look different to pg_stat_statements.

So I suspect the real problem here is that we might want all of these
things to look identical to pg_stat_statements:
      ARRAY[$1, $2, 42]      ARRAY[$1, $2, $3, 47]      '{1,2,3,47}'::int[]

Don't see a very clean way to do that ...
        regards, tom lane



On 11/24/15 1:29 PM, Tom Lane wrote:
> So I suspect the real problem here is that we might want all of these
> things to look identical to pg_stat_statements:
>
>         ARRAY[$1, $2, 42]
>         ARRAY[$1, $2, $3, 47]
>         '{1,2,3,47}'::int[]
>
> Don't see a very clean way to do that ...

Another not-uncommon case is IN ( '1', '2', ... , '2342' ); in other 
words, treating an integer as text. A lot of frameworks like to do that 
and just push the problem onto the database. I'm not sure what 
pg_stat_statements would ultimately see in that case..

Since there's a few different things people might want, maybe a good 
first step is to allow extending/changing the jumbling decision at the C 
level. That would make it easy for a knowledgeable enough person to come 
up with an alternative as a plugin that regular users could use.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com



On Tue, Nov 24, 2015 at 5:39 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
> Another not-uncommon case is IN ( '1', '2', ... , '2342' ); in other words,
> treating an integer as text. A lot of frameworks like to do that and just
> push the problem onto the database. I'm not sure what pg_stat_statements
> would ultimately see in that case..

They do?

postgres=# select 5::int4 in ('5');?column?
──────────t
(1 row)

postgres=# select 5::int4 in ('5a');
ERROR:  22P02: invalid input syntax for integer: "5a"
LINE 1: select 5::int4 in ('5a');                          ^
--
Peter Geoghegan



I wrote:
> Peter Geoghegan <pg@heroku.com> writes:
>> This is because the ArrayExpr jumble case jumbles any ArrayExpr's list
>> of elements recursively. In this case it's a list of Const nodes, and
>> the fingerprinting logic jumbles those nodes indifferently.

> I think this is a vastly oversimplified explanation of the problem.
> In particular, because the planner will flatten an ArrayExpr containing
> only Const nodes to an array constant (see eval_const_expressions),
> I don't believe the case ever arises in exactly the form you posit here.

Um ... disregard that.  For some reason I was thinking that
pg_stat_statements looks at plan trees, but of course what it looks at
is the query tree immediately post-parse-analysis.  So the behavior of
the const-folding pass is not relevant.

The heuristics in transformAExprIn() are still relevant, though, and
I suspect that the question of whether Params should be considered
the same as Consts is also highly relevant.

I wonder whether we could improve this by arranging things so that both
Consts and Params contribute zero to the jumble hash, and a list of these
things also contributes zero, regardless of the length of the list.
        regards, tom lane



On 11/24/15 7:46 PM, Peter Geoghegan wrote:
> On Tue, Nov 24, 2015 at 5:39 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
>> Another not-uncommon case is IN ( '1', '2', ... , '2342' ); in other words,
>> treating an integer as text. A lot of frameworks like to do that and just
>> push the problem onto the database. I'm not sure what pg_stat_statements
>> would ultimately see in that case..
>
> They do?
>
> postgres=# select 5::int4 in ('5');
>   ?column?
> ──────────
>   t
> (1 row)
>
> postgres=# select 5::int4 in ('5a');
> ERROR:  22P02: invalid input syntax for integer: "5a"
> LINE 1: select 5::int4 in ('5a');
>                             ^

I'm not following your point. Obviously you can't compare int to text 
that doesn't convert back to an int, but that's not what I was talking 
about.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com



On Tue, Nov 24, 2015 at 10:55 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
> I'm not following your point. Obviously you can't compare int to text that
> doesn't convert back to an int, but that's not what I was talking about.

I didn't see what else you could have meant. In any case, the type
text has no involvement in your example.

pg_stat_statements sees the following SQL queries as equivalent:

select integer '5';

select 6;

select 7::int4;

-- 
Peter Geoghegan



On Mon, Nov 23, 2015 at 11:53 PM, Peter Geoghegan <pg@heroku.com> wrote:
One specific justification he gave for not using pg_stat_statements was:

"Doesn’t merge bind vars in IN()" (See slide #11)

I wonder:

* How do other people feel about this? Personally, I've seen enough
problems of this kind in the field that "slippery slope" arguments
against this don't seem very compelling.

As someone who runs a little monitoring service thats solely based on pg_stat_statements data, ignoring IN list length would certainly be a good change.

We currently do this in post-processing, together with other data cleanup (e.g. ignoring the length of a VALUES list in an INSERT statement).

Given the fact that pgss data is normalized & you don't know which plan was chosen, I don't think distinguishing based on the length of the list helps anyone really.

I see pg_stat_statements as a high-level overview of which queries have run, and which ones you might want to look into closer using e.g. auto_explain.

Best,
Lukas

--
Lukas Fittl

Skype: lfittl
Phone: +1 415 321 0630
On Wed, Nov 25, 2015 at 9:13 AM, Lukas Fittl <lukas@fittl.com> wrote:
On Mon, Nov 23, 2015 at 11:53 PM, Peter Geoghegan <pg@heroku.com> wrote:
One specific justification he gave for not using pg_stat_statements was:

"Doesn’t merge bind vars in IN()" (See slide #11)

I wonder:

* How do other people feel about this? Personally, I've seen enough
problems of this kind in the field that "slippery slope" arguments
against this don't seem very compelling.

As someone who runs a little monitoring service thats solely based on pg_stat_statements data, ignoring IN list length would certainly be a good change.

We currently do this in post-processing, together with other data cleanup (e.g. ignoring the length of a VALUES list in an INSERT statement).

Given the fact that pgss data is normalized & you don't know which plan was chosen, I don't think distinguishing based on the length of the list helps anyone really.

I see pg_stat_statements as a high-level overview of which queries have run, and which ones you might want to look into closer using e.g. auto_explain.

I still have the plans to try to marry pg_stat_statements and auto_explain for the next iteration of "online query plans" extension I was proposing a few months ago, and the first thing I was going to look into is rectifying this problem with IN() jumbling.  So, have a +1 from me.

--
Alex

On Mon, Nov 30, 2015 at 02:41:02PM +0100, Shulgin, Oleksandr wrote:
> On Wed, Nov 25, 2015 at 9:13 AM, Lukas Fittl <lukas@fittl.com> wrote:
> 
>     On Mon, Nov 23, 2015 at 11:53 PM, Peter Geoghegan <pg@heroku.com> wrote:
> 
>         One specific justification he gave for not using pg_stat_statements
>         was:
> 
>         "Doesn’t merge bind vars in IN()" (See slide #11)
> 
> 
>         I wonder:
> 
>         * How do other people feel about this? Personally, I've seen enough
>         problems of this kind in the field that "slippery slope" arguments
>         against this don't seem very compelling.
> 
> 
>     As someone who runs a little monitoring service thats solely based on
>     pg_stat_statements data, ignoring IN list length would certainly be a good
>     change.
> 
>     We currently do this in post-processing, together with other data cleanup
>     (e.g. ignoring the length of a VALUES list in an INSERT statement).
> 
>     Given the fact that pgss data is normalized & you don't know which plan was
>     chosen, I don't think distinguishing based on the length of the list helps
>     anyone really.
> 
>     I see pg_stat_statements as a high-level overview of which queries have
>     run, and which ones you might want to look into closer using e.g.
>     auto_explain.
> 
> 
> I still have the plans to try to marry pg_stat_statements and auto_explain for
> the next iteration of "online query plans" extension I was proposing a few
> months ago, and the first thing I was going to look into is rectifying this
> problem with IN() jumbling.  So, have a +1 from me.

Is this a TODO?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Roman grave inscription                             +



On Wed, Dec 30, 2015 at 5:20 PM, Bruce Momjian <bruce@momjian.us> wrote:
> Is this a TODO?

It's on my (very long) personal TODO list. It would be great if
someone else took it, though. So, yes.

-- 
Peter Geoghegan



On Wed, Dec 30, 2015 at 05:21:05PM -0800, Peter Geoghegan wrote:
> On Wed, Dec 30, 2015 at 5:20 PM, Bruce Momjian <bruce@momjian.us> wrote:
> > Is this a TODO?
> 
> It's on my (very long) personal TODO list. It would be great if
> someone else took it, though. So, yes.

TODO added.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Roman grave inscription                             +