Re: [PATCH] optional cleaning queries stored in pg_stat_statements - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: [PATCH] optional cleaning queries stored in pg_stat_statements
Date
Msg-id CAEYLb_V0KRPMppgKsSEZVatOMp52DNXxV=DpuyTA_JQzd9LZiw@mail.gmail.com
Whole thread Raw
In response to Re: [PATCH] optional cleaning queries stored in pg_stat_statements  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: [PATCH] optional cleaning queries stored in pg_stat_statements
List pgsql-hackers
On 6 November 2011 15:08, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Are you intending to hash the raw
> grammar output tree, the parse analysis result, the rewriter result,
> or the actual plan tree?

I'm hashing the Query tree from a planner plugin (function assigned to
planner_hook), immediately prior to it being passed to
standard_planner. A major consideration was backwards compatibility;
at one point, it looked like this all could be done without adding any
new infrastructure, with perhaps an officially sanctioned 9.2 version
of pg_stat_statements that could be built against earlier pg versions.Other than that, it seems intuitively obvious
thatthis should happen 
as late as possible in the parsing stage - any transformation
performed prior to planning was considered essential to the query,
even if that potentially meant that successive calls of the same query
string were considered different due to external factors. These things
probably don't come up to often in practice, but I think that they're
nice to have, as they prevent the DBA from looking at statistics that
aren't actually representative.

>> A guiding principal that
>> I've followed is that anything that could result in a different plan
>> is a differentiator of queries.
>
> This claim seems like bunk, unless you're hashing the plan tree,
> in which case it's tautological.

I think I may have been unclear, for which I apologise. Certainly, if
the principle I (mis)described was followed to the letter, I'd end up
with something that was exactly the same as what we already have. I
merely meant to suggest, with an awareness of the fact that I was
saying something slightly controversial, that because the planner is
/invariably/ sensitive to changes in limit/offset constants,
particularly large changes, it made sense to have those constants as
differentiators, and it certainly made sense to have a limit n
differentiate the same query with and without a limit n. I do of
course appreciate that the use of a different constant in quals could
result in a different plan being generated due to their selectivity
estimates varying.

> I'm not real sure whether it's better to classify on the basis of
> similar plans or similar original queries, anyway.  This seems like
> something that could be open for debate about use-cases.

Indeed - it's generally difficult to reason about what behaviour is
optimal when attempting to anticipate the needs of every Postgres DBA.
In this case, I myself lean strongly towards similar original queries,
because classifying on the basis of similar plans isn't likely to be
nearly as actionable, and there are really no obvious precedents to
draw on - how can it be turned into a useful tool?

>> There will be additional infrastructure added to the parser to support
>> normalisation of query strings for the patch I'll be submitting (that
>> obviously won't be supported in the version that builds against
>> existing Postgres versions that I'll make available). Essentially,
>> I'll be adding a length field to certain nodes,
>
> This seems like a good way to get your patch rejected: adding overhead
> to the core system for the benefit of a feature that not everyone cares
> about is problematic.

It's problematic, but I believe that it can be justified. Without
being glib, exactly no one cares about the location of tokens that
correspond to Const nodes until they have an error that occurs outside
the parser that is attributable to a node/corresponding token, which,
in a production environment, could take a long time. All I'm doing is
moving slightly more towards the default Bison representation of
YYLTYPE, where the column and row of both the beginning and end of
each token are stored. I'm storing location and length rather than
just location, which is a modest change. Not everyone cares about this
feature, but plenty do. It will be particularly useful to have the
Query representation stable, even to the point where it is stable
between pg_stat_statements_reset() calls - third party tools can rely
on the stability of the query string.

> Why do you need it anyway?  Surely it's possible
> to determine the length of a literal token after the fact.

Probably, but not in a manner that I deem to be well-principled. I
want to push the onus on keeping bookkeeping for the replacement of
tokens into the parser, which is authoritative - otherwise, I'll end
up with a kludge that is liable to fall out of sync. The community
will have the opportunity to consider if that trade-off makes sense.

> More generally, if you're hashing anything later than the raw grammar
> tree, I think that generating a suitable representation of the queries
> represented by a single hash entry is going to be problematic anyway.
> There could be significant differences --- much more than just the
> values of constants --- between query strings that end up being
> semantically the same.  Or for that matter we could have identical query
> strings that wind up being considered different because of the action of
> search_path or other context.

I don't see that that has to be problematic. I consider that to be a
feature, and it can be documented as such. I think that any scenario
anyone might care to describe where this is a real problem will be
mostly contrived.

> It might be that the path of least resistance is to document that we
> select one of the actual query strings "at random" to represent all the
> queries that went into the same hash entry, and not even bother with
> trying to strip out constants.  The effort required to do that seems
> well out of proportion to the reward, if we can't do a perfect job of
> representing the common aspects of the queries.

Up until quite recently, this patch actually updated the query string,
so that a new set of constants was seen after each query call - that's
what Greg posted a link to. The part of the patch that normalises the
query string and the related backend infrastructure are fairly neat
adjuncts to what you see there. So while I'll be asking someone to
commit the patch with that infrastructure, because the adjunct adds
quite a lot of value to the patch, there is no need to have that be a
blocker.

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services


pgsql-hackers by date:

Previous
From: Bernd Helmle
Date:
Subject: Re: Measuring relation free space
Next
From: Tom Lane
Date:
Subject: Re: [PATCH] optional cleaning queries stored in pg_stat_statements