Thread: pg_stat_statements fingerprinting logic and ArrayExpr

pg_stat_statements fingerprinting logic and ArrayExpr

From
Peter Geoghegan
Date:
pg_stat_statements' fingerprinting logic considers the following two
statements as distinct:

select 1 in (1, 2, 3);
select 1 in (1, 2, 3, 4);

This is because the ArrayExpr jumble case jumbles any ArrayExpr's list
of elements recursively. In this case it's a list of Const nodes, and
the fingerprinting logic jumbles those nodes indifferently.

Somebody told me that they think that pg_stat_statements should not do
that. This person felt that it would be preferable for such
expressions to be normalized without regard to the number of distinct
Const elements. I suppose that that would work by determing if the
ArrayExpr elements list was a list of Const nodes and only const
nodes. Iff that turned out to be the case, something else would be
jumbled (something other than the list) that would essentially be a
representation of "some list of zero or more (or maybe one or more)
Const nodes with consttype of, in this example, 23". I think that this
would make at least one person happy, because of course the two
statements above would have their costs aggregated within a single
pg_stat_statements entry.

I'm not sure that I agree, but there is anecdata that suggests that it
isn't uncommon for these sorts of queries to be broken out when
they're all traceable back to a single point in the application
(apparently it's common for Django apps to do so, perhaps
questionably). If we assume that doing what I've described has no real
downside, then it would probably be worth implementing. Plus I'm
pretty sure that tools that do regex normalization are already doing
something analogous. Thoughts?

-- 
Peter Geoghegan



Re: pg_stat_statements fingerprinting logic and ArrayExpr

From
Robert Haas
Date:
On Tue, Dec 10, 2013 at 4:30 AM, Peter Geoghegan <pg@heroku.com> wrote:
> pg_stat_statements' fingerprinting logic considers the following two
> statements as distinct:
>
> select 1 in (1, 2, 3);
> select 1 in (1, 2, 3, 4);
>
> This is because the ArrayExpr jumble case jumbles any ArrayExpr's list
> of elements recursively. In this case it's a list of Const nodes, and
> the fingerprinting logic jumbles those nodes indifferently.
>
> Somebody told me that they think that pg_stat_statements should not do
> that. This person felt that it would be preferable for such
> expressions to be normalized without regard to the number of distinct
> Const elements. I suppose that that would work by determing if the
> ArrayExpr elements list was a list of Const nodes and only const
> nodes. Iff that turned out to be the case, something else would be
> jumbled (something other than the list) that would essentially be a
> representation of "some list of zero or more (or maybe one or more)
> Const nodes with consttype of, in this example, 23". I think that this
> would make at least one person happy, because of course the two
> statements above would have their costs aggregated within a single
> pg_stat_statements entry.
>
> I'm not sure that I agree, but there is anecdata that suggests that it
> isn't uncommon for these sorts of queries to be broken out when
> they're all traceable back to a single point in the application
> (apparently it's common for Django apps to do so, perhaps
> questionably). If we assume that doing what I've described has no real
> downside, then it would probably be worth implementing. Plus I'm
> pretty sure that tools that do regex normalization are already doing
> something analogous. Thoughts?

Sounds like this:
   // What's the worst thing that could happen?   pandoras_box.open();

I am very wary of implementing special-case logic here even though I
know it could be useful to some people, simply because I fear that
there could be a near-infinite variety of situations where, in a
particular environment, a particular distinction isn't important.  We
won't be serving anyone well if we ignore all of those distinctions,
because not everyone will want to ignore all of them.  And adding a
configuration option for each one doesn't sound like a good idea,
either.  *Maybe* this particular case is narrow enough that it's OK to
just ignore it unconditionally and maybe that'd be OK ... but I fear
this is a rathole.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: pg_stat_statements fingerprinting logic and ArrayExpr

From
Peter Geoghegan
Date:
On Tue, Dec 10, 2013 at 1:09 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> I am very wary of implementing special-case logic here even though I
> know it could be useful to some people, simply because I fear that
> there could be a near-infinite variety of situations where, in a
> particular environment, a particular distinction isn't important.

I am too, which is why I asked.

We're already in the business of deciding what is and isn't essential
to a query in this way. For example, we already determine that
Var.varcollid shouldn't appear in a query jumble - there is no better
reason for that then "it would hurt more than it helped", even though
it's possible that someone could care about such a distinction. Now, I
have no intention of avoiding the issue with a relativistic argument
("who is to say what the essential nature of a query is anyway?"), but
I know doctrinarianism isn't helpful either.

I do think I know who should determine what is the essential nature of
a query for fingerprinting purposes: we should. We should pick the
scheme that is most widely useful, while weighing the worst case. I'm
not asserting that this is closer to that, but it might be.


-- 
Peter Geoghegan



Re: pg_stat_statements fingerprinting logic and ArrayExpr

From
Andres Freund
Date:
On 2013-12-10 16:09:02 -0500, Robert Haas wrote:
> On Tue, Dec 10, 2013 at 4:30 AM, Peter Geoghegan <pg@heroku.com> wrote:
> > I'm not sure that I agree, but there is anecdata that suggests that it
> > isn't uncommon for these sorts of queries to be broken out when
> > they're all traceable back to a single point in the application
> > (apparently it's common for Django apps to do so, perhaps
> > questionably). If we assume that doing what I've described has no real
> > downside, then it would probably be worth implementing. Plus I'm
> > pretty sure that tools that do regex normalization are already doing
> > something analogous. Thoughts?
> 
> Sounds like this:
> 
>     // What's the worst thing that could happen?
>     pandoras_box.open();
> 
> I am very wary of implementing special-case logic here even though I
> know it could be useful to some people, simply because I fear that
> there could be a near-infinite variety of situations where, in a
> particular environment, a particular distinction isn't important.  We
> won't be serving anyone well if we ignore all of those distinctions,
> because not everyone will want to ignore all of them.  And adding a
> configuration option for each one doesn't sound like a good idea,
> either.  *Maybe* this particular case is narrow enough that it's OK to
> just ignore it unconditionally and maybe that'd be OK ... but I fear
> this is a rathole.

FWIW, I hit exactly this issue every single time I have looked at
pg_stat_statements in some client's database making it nearly useless
for analysis. So improving it might be worthwile.

On the other hand, so far all could fix their application in a
reasonable amount of time to generate = ANY(array) queries
instead. Which probably isn't a bad idea either, considering they now
use far fewer prepared statements.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: pg_stat_statements fingerprinting logic and ArrayExpr

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Dec 10, 2013 at 4:30 AM, Peter Geoghegan <pg@heroku.com> wrote:
>> pg_stat_statements' fingerprinting logic considers the following two
>> statements as distinct:
>> 
>> select 1 in (1, 2, 3);
>> select 1 in (1, 2, 3, 4);
>> 
>> [ and some people think it shouldn't ]

> I am very wary of implementing special-case logic here even though I
> know it could be useful to some people, simply because I fear that
> there could be a near-infinite variety of situations where, in a
> particular environment, a particular distinction isn't important.

FWIW, I didn't much care for this idea either, though Robert's expressed
it better than what was rattling around in my brain this morning.  There's
a very slippery slope from here to inserting all sorts of random hacks
into the query fingerprinter, and that's not someplace I want to go.

There are alternatives that the requestor could consider for making these
cases more alike, such as supplying the set of IDs as an array constant
(or parameter) to begin with.
        regards, tom lane



Re: pg_stat_statements fingerprinting logic and ArrayExpr

From
Peter Geoghegan
Date:
On Tue, Dec 10, 2013 at 1:41 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> FWIW, I hit exactly this issue every single time I have looked at
> pg_stat_statements in some client's database making it nearly useless
> for analysis. So improving it might be worthwile.

I think the thing that makes me lean towards doing this is the fact
that pgFouine and pgBadger have been doing something similar for
years, and those tools continue to be much more popular than I thought
they'd be now that pg_stat_statements with normalization is fairly
widely available. Yes, of course the problem can be solved by using "
= ANY(array)". But some people are lazy, or uninformed, or they don't
directly control this. As I mentioned, this has traditionally been a
problem with Django, where I imagine the fix is non-trivial. I haven't
asked, but it isn't too hard to imagine that the Django guys are
probably not all that keen on doing a lot of work that will only be
applicable to Postgres.

Did you really find pg_stat_statements to be almost useless in such
situations? That seems worse than I thought.

I guess it in no small part comes down to what the long term future of
the query finger-printer is.

-- 
Peter Geoghegan



Re: pg_stat_statements fingerprinting logic and ArrayExpr

From
Andres Freund
Date:
On 2013-12-10 14:30:36 -0800, Peter Geoghegan wrote:
> Did you really find pg_stat_statements to be almost useless in such
> situations? That seems worse than I thought.

It's very hard to see where you should spend efforts when every "logical
query" is split into hundreds of pg_stat_statement entries. Suddenly
it's important whether a certain counts of parameters are more frequent
than others because in the equally distributed cases they fall out of
p_s_s again pretty soon. I think that's probably a worse than average
case, but certainly not something only I could have the bad fortune of
looking at.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: pg_stat_statements fingerprinting logic and ArrayExpr

From
Peter Geoghegan
Date:
On Tue, Dec 10, 2013 at 2:38 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> It's very hard to see where you should spend efforts when every "logical
> query" is split into hundreds of pg_stat_statement entries. Suddenly
> it's important whether a certain counts of parameters are more frequent
> than others because in the equally distributed cases they fall out of
> p_s_s again pretty soon. I think that's probably a worse than average
> case, but certainly not something only I could have the bad fortune of
> looking at.

Another problem is that creating a new entry is relatively expensive,
because we need to acquire an exclusive lock to do so. If there was a
lot of churn, I'd worry that the performance overhead of
pg_stat_statements. It could be quite a lot higher than necessary.


-- 
Peter Geoghegan



Re: pg_stat_statements fingerprinting logic and ArrayExpr

From
Robert Haas
Date:
On Tue, Dec 10, 2013 at 5:38 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2013-12-10 14:30:36 -0800, Peter Geoghegan wrote:
>> Did you really find pg_stat_statements to be almost useless in such
>> situations? That seems worse than I thought.
>
> It's very hard to see where you should spend efforts when every "logical
> query" is split into hundreds of pg_stat_statement entries. Suddenly
> it's important whether a certain counts of parameters are more frequent
> than others because in the equally distributed cases they fall out of
> p_s_s again pretty soon. I think that's probably a worse than average
> case, but certainly not something only I could have the bad fortune of
> looking at.

Right, but the flip side is that you could collapse things that people
don't want collapsed.  If you've got lots of query that differ only in
that some of them say user_id IN (const1, const2) and others say
user_id IN (const1, const2, const3) and the constants vary a lot, then
of course this seems attractive.  On the other hand if you have two
queries and one of them looks like this:

WHERE status IN ('active') AND user_id = ?

and the other looks like this:

WHERE status IN ('inactive', 'deleted') AND user_id = ?

...it might actually annoy you to have those two things conflated;
it's easy to imagine one having much different performance
characteristics than the other.

Part of me wonders if the real solution here is to invent a way to
support an arbitrarily large hash table of entries efficiently, and
then let people do further roll-ups of the data in userland if they
don't like our rollups.  Part of the pain here is that when you
overflow the hash table, you start losing information that can't be
recaptured after the fact.  If said hash table were by chance also
suitable for use as part of the stats infrastructure, in place of the
whole-file-rewrite technique we use today, massive win.

Of course, even if we had all this, it necessarily make doing
additional rollups *easy*; it's easy to construct cases that can be
handled much better given access to the underlying parse tree
representation than they can be with sed and awk.  But it's a thought.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: pg_stat_statements fingerprinting logic and ArrayExpr

From
Peter Geoghegan
Date:
On Tue, Dec 10, 2013 at 2:46 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> Right, but the flip side is that you could collapse things that people
> don't want collapsed.  If you've got lots of query that differ only in
> that some of them say user_id IN (const1, const2) and others say
> user_id IN (const1, const2, const3) and the constants vary a lot, then
> of course this seems attractive.  On the other hand if you have two
> queries and one of them looks like this:
>
> WHERE status IN ('active') AND user_id = ?
>
> and the other looks like this:
>
> WHERE status IN ('inactive', 'deleted') AND user_id = ?
>
> ...it might actually annoy you to have those two things conflated;
> it's easy to imagine one having much different performance
> characteristics than the other.

That is true, but you're probably not going to be able to make any use
of the distinction in the first place, because it's just going to be a
bunch of ? ? characters as constants. You'd have to plan ahead most
likely. You might get lucky and have this exact case, and be able to
leverage the knowledge that the 2 constants in the ArrayExpr must be
the latter and 1 must be the former, but experience suggests very
probably not.


-- 
Peter Geoghegan



Re: pg_stat_statements fingerprinting logic and ArrayExpr

From
Andres Freund
Date:
On 2013-12-10 17:46:56 -0500, Robert Haas wrote:
> On Tue, Dec 10, 2013 at 5:38 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> > On 2013-12-10 14:30:36 -0800, Peter Geoghegan wrote:
> >> Did you really find pg_stat_statements to be almost useless in such
> >> situations? That seems worse than I thought.
> >
> > It's very hard to see where you should spend efforts when every "logical
> > query" is split into hundreds of pg_stat_statement entries. Suddenly
> > it's important whether a certain counts of parameters are more frequent
> > than others because in the equally distributed cases they fall out of
> > p_s_s again pretty soon. I think that's probably a worse than average
> > case, but certainly not something only I could have the bad fortune of
> > looking at.
> 
> Right, but the flip side is that you could collapse things that people
> don't want collapsed.  If you've got lots of query that differ only in
> that some of them say user_id IN (const1, const2) and others say
> user_id IN (const1, const2, const3) and the constants vary a lot, then
> of course this seems attractive.

Yea, completely agreed. It might also lead to users missing the fact
that their precious prepared-statement cache is just using up loads of
backend memory and individual prepared statements are seldomly
re-executed because there are so many...

>  On the other hand if you have two
> queries and one of them looks like this:
> 
> WHERE status IN ('active') AND user_id = ?
> 
> and the other looks like this:
> 
> WHERE status IN ('inactive', 'deleted') AND user_id = ?

That too.

> Part of me wonders if the real solution here is to invent a way to
> support an arbitrarily large hash table of entries efficiently, and
> then let people do further roll-ups of the data in userland if they
> don't like our rollups.  Part of the pain here is that when you
> overflow the hash table, you start losing information that can't be
> recaptured after the fact.  If said hash table were by chance also
> suitable for use as part of the stats infrastructure, in place of the
> whole-file-rewrite technique we use today, massive win.
> 
> Of course, even if we had all this, it necessarily make doing
> additional rollups *easy*; it's easy to construct cases that can be
> handled much better given access to the underlying parse tree
> representation than they can be with sed and awk.  But it's a thought.

That would obviously be neat, but I have roughly no clue how to achieve
that. Granular control over how such rollups would work sounds very hard
to achieve unless that granular control just is getting passed a tree
and returning another.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: pg_stat_statements fingerprinting logic and ArrayExpr

From
Peter Geoghegan
Date:
On Tue, Dec 10, 2013 at 2:55 PM, Peter Geoghegan <pg@heroku.com> wrote:
> You might get lucky and have this exact case, and be able to
> leverage the knowledge that the 2 constants in the ArrayExpr must be
> the latter and 1 must be the former, but experience suggests very
> probably not.

When things get this bad, you'd probably be better off adding a
tautology to the query with the problematic constants (or, dare I say,
installing pg_stat_plans). That seems weird, but then if you were able
to recognize the case you described without one it's nothing more than
dumb luck.


-- 
Peter Geoghegan



Re: pg_stat_statements fingerprinting logic and ArrayExpr

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> Right, but the flip side is that you could collapse things that people
> don't want collapsed.  If you've got lots of query that differ only in
> that some of them say user_id IN (const1, const2) and others say
> user_id IN (const1, const2, const3) and the constants vary a lot, then
> of course this seems attractive.  On the other hand if you have two
> queries and one of them looks like this:

> WHERE status IN ('active') AND user_id = ?

> and the other looks like this:

> WHERE status IN ('inactive', 'deleted') AND user_id = ?

> ...it might actually annoy you to have those two things conflated;
> it's easy to imagine one having much different performance
> characteristics than the other.

Of course, "status = 'active'" and "status = 'deleted'" might have very
different performance characteristics all by themselves, yet we've
already hard-wired a decision that pg_stat_statements will conflate them.
So I don't think the above argument holds a lot of water.

A different point of view is that it's more or less an implementation
artifact that pg_stat_statements doesn't already see the cases as
equivalent; that happens only because it looks at the querytree before
the planner gets around to constant-folding ARRAY[1,2,3] into the single
Const '{1,2,3}'::int[].

So my objection to what Peter is suggesting is not that it's a bad idea
in isolation, but that I don't see where he's going to stop, short of
reinventing every query-normalization behavior that exists in the planner.
If this particular case is worthy of fixing with a hack in the
fingerprinter, aren't there going to be dozens more with just as good
claims?  (Perhaps not, but what's the basis for thinking this is far
worse than any other normalization issue?)

I'm wondering whether this doesn't indicate that we need to rethink where
the fingerprinter has been plugged in.  I'm not sure that somewhere in
the planner, post-constant-folding, would be a better place; but it's
worth thinking about.
        regards, tom lane



Re: pg_stat_statements fingerprinting logic and ArrayExpr

From
Daniel Farina
Date:
On Tue, Dec 10, 2013 at 3:08 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> So my objection to what Peter is suggesting is not that it's a bad idea
> in isolation, but that I don't see where he's going to stop, short of
> reinventing every query-normalization behavior that exists in the planner.
> If this particular case is worthy of fixing with a hack in the
> fingerprinter, aren't there going to be dozens more with just as good
> claims?  (Perhaps not, but what's the basis for thinking this is far
> worse than any other normalization issue?)

Qualitatively, the dynamic length values list is the worst offender.

There is no algebraic solution to where to stop with normalizations,
but as Peter points out, that bridge has been crossed already:
assumptions have already been made that toss some potentially useful
information already, and yet the program is undoubtedly practical.

Based on my own experience (which sounds similar to Andres's), I am
completely convinced the canonicalization he proposes here is more
practical than the current definition to a large degree, so I suggest
the goal is worthwhile.



Re: pg_stat_statements fingerprinting logic and ArrayExpr

From
Peter Geoghegan
Date:
On Tue, Dec 10, 2013 at 3:08 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> A different point of view is that it's more or less an implementation
> artifact that pg_stat_statements doesn't already see the cases as
> equivalent; that happens only because it looks at the querytree before
> the planner gets around to constant-folding ARRAY[1,2,3] into the single
> Const '{1,2,3}'::int[].

Right. It's also an implementation artifact, for example, that
pg_stat_statements sees two queries as distinct based solely on the
choice of JOIN syntax for (arguably) semantically equivalent inner
join queries. In a perfect world, I guess it would see them as
non-distinct.

> So my objection to what Peter is suggesting is not that it's a bad idea
> in isolation, but that I don't see where he's going to stop, short of
> reinventing every query-normalization behavior that exists in the planner.
> If this particular case is worthy of fixing with a hack in the
> fingerprinter, aren't there going to be dozens more with just as good
> claims?  (Perhaps not, but what's the basis for thinking this is far
> worse than any other normalization issue?)

The basis is simply that I've heard a number of complaints about it
(I'm shocked at just how big a problem Andres considers this to be),
and yet I've heard no complaints about any of the other cases. No one
cares about those, because those are distinctions that applications
and application frameworks are not at all inclined to unpredictably
inject. Doing some string interpolation to build what will become an
ArrayExpr during parse analysis is a common thing. But unpredictably
switching from one syntax to another equivalent one (like varying JOIN
syntax), with the relevant user code traceable to a single point in
the application, simply doesn't happen in the real world.

> I'm wondering whether this doesn't indicate that we need to rethink where
> the fingerprinter has been plugged in.  I'm not sure that somewhere in
> the planner, post-constant-folding, would be a better place; but it's
> worth thinking about.

Hardly seems worth the trouble. A nice property of pg_stat_statements
is that it has good temporal locality; fingerprinting is just an extra
step of parse analysis. I don't think that this would necessarily be
lost by what you outline, but it'd sure be a concern.

We had this discussion, or at least a similar discussion back when the
normalization work was underway (fingerprint at raw parse tree vs.
post-parse-analysis tree .vs after rule expansion .vs plan tree). My
position at the time was that assigning query execution costs at the
plan granularity is a fine thing, but doing so at the query
granularity is also a fine thing, which is more immediately useful. I
believe this so strongly that I went so far as to write a derivative
of pg_stat_statements to do the fingerprinting at the plan granularity
(while expanding that in a more experimental direction, towards
querying the structure of plans, relatively untested ideas not well
suited to even a contrib module). I could certainly see us
instrumenting plan costs as a separate, perhaps complimentary thing,
but it should not happen at the expense of simple blaming of costs on
queries.

So, if you're not talking about blaming plans and not queries, you
must be talking about making the planner do the constant folding in a
way that facilitates tools like pg_stat_statements in getting the
"essential nature" of the query (*not* the plan) post constant
folding. Which doesn't seem like a good allocation of resources, since
as I said this is the sole complaint of this nature that I've heard,
and people talk to me about pg_stat_statements a lot.

The role of pg_stat_statements is to instrument statements: to assign
costs traceable back to a point in the application. Because of this
implementation artifact, that makes it harder to trace back the costs
to that point, because some entries with less execution numbers of
ArrayExpr elements may be evicted while others are not, and so on.

I just discussed the Django issue with Jacob Kaplan-Moss, Heroku's
directory of security, who is on the Django core team. He's going to
take a look at it, and I wouldn't be surprised if it was fixed, but
that's hardly a scalable model.

-- 
Peter Geoghegan



Re: pg_stat_statements fingerprinting logic and ArrayExpr

From
Tom Lane
Date:
Peter Geoghegan <pg@heroku.com> writes:
> On Tue, Dec 10, 2013 at 3:08 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I'm wondering whether this doesn't indicate that we need to rethink where
>> the fingerprinter has been plugged in.  I'm not sure that somewhere in
>> the planner, post-constant-folding, would be a better place; but it's
>> worth thinking about.

> ... if you're not talking about blaming plans and not queries, you
> must be talking about making the planner do the constant folding in a
> way that facilitates tools like pg_stat_statements in getting the
> "essential nature" of the query (*not* the plan) post constant
> folding.

Yeah; if we were going to take this seriously, we'd need to do some
refactoring to separate normalization-type activities from other
planning activities.  I'm not sure it's worth the trouble.  Right
now, for instance, eval_const_expressions() also handles inlining
of SQL functions, which is a behavior we'd almost certainly *not*
want in front of query fingerprinting.  But it's hard to see how
we disentangle that without either lost optimization capacity
or duplicative processing.  A significant fraction of the point of
const-folding is to exploit opportunities revealed by inlining.

Anyway, I'm not volunteering to do this ;-) ... just idly speculating
about whether it'd be worth doing.
        regards, tom lane