Re: [PATCH] optional cleaning queries stored in pg_stat_statements - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: [PATCH] optional cleaning queries stored in pg_stat_statements |
Date | |
Msg-id | CAEYLb_V0KRPMppgKsSEZVatOMp52DNXxV=DpuyTA_JQzd9LZiw@mail.gmail.com Whole thread Raw |
In response to | Re: [PATCH] optional cleaning queries stored in pg_stat_statements (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: [PATCH] optional cleaning queries stored in pg_stat_statements
|
List | pgsql-hackers |
On 6 November 2011 15:08, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Are you intending to hash the raw > grammar output tree, the parse analysis result, the rewriter result, > or the actual plan tree? I'm hashing the Query tree from a planner plugin (function assigned to planner_hook), immediately prior to it being passed to standard_planner. A major consideration was backwards compatibility; at one point, it looked like this all could be done without adding any new infrastructure, with perhaps an officially sanctioned 9.2 version of pg_stat_statements that could be built against earlier pg versions.Other than that, it seems intuitively obvious thatthis should happen as late as possible in the parsing stage - any transformation performed prior to planning was considered essential to the query, even if that potentially meant that successive calls of the same query string were considered different due to external factors. These things probably don't come up to often in practice, but I think that they're nice to have, as they prevent the DBA from looking at statistics that aren't actually representative. >> A guiding principal that >> I've followed is that anything that could result in a different plan >> is a differentiator of queries. > > This claim seems like bunk, unless you're hashing the plan tree, > in which case it's tautological. I think I may have been unclear, for which I apologise. Certainly, if the principle I (mis)described was followed to the letter, I'd end up with something that was exactly the same as what we already have. I merely meant to suggest, with an awareness of the fact that I was saying something slightly controversial, that because the planner is /invariably/ sensitive to changes in limit/offset constants, particularly large changes, it made sense to have those constants as differentiators, and it certainly made sense to have a limit n differentiate the same query with and without a limit n. I do of course appreciate that the use of a different constant in quals could result in a different plan being generated due to their selectivity estimates varying. > I'm not real sure whether it's better to classify on the basis of > similar plans or similar original queries, anyway. This seems like > something that could be open for debate about use-cases. Indeed - it's generally difficult to reason about what behaviour is optimal when attempting to anticipate the needs of every Postgres DBA. In this case, I myself lean strongly towards similar original queries, because classifying on the basis of similar plans isn't likely to be nearly as actionable, and there are really no obvious precedents to draw on - how can it be turned into a useful tool? >> There will be additional infrastructure added to the parser to support >> normalisation of query strings for the patch I'll be submitting (that >> obviously won't be supported in the version that builds against >> existing Postgres versions that I'll make available). Essentially, >> I'll be adding a length field to certain nodes, > > This seems like a good way to get your patch rejected: adding overhead > to the core system for the benefit of a feature that not everyone cares > about is problematic. It's problematic, but I believe that it can be justified. Without being glib, exactly no one cares about the location of tokens that correspond to Const nodes until they have an error that occurs outside the parser that is attributable to a node/corresponding token, which, in a production environment, could take a long time. All I'm doing is moving slightly more towards the default Bison representation of YYLTYPE, where the column and row of both the beginning and end of each token are stored. I'm storing location and length rather than just location, which is a modest change. Not everyone cares about this feature, but plenty do. It will be particularly useful to have the Query representation stable, even to the point where it is stable between pg_stat_statements_reset() calls - third party tools can rely on the stability of the query string. > Why do you need it anyway? Surely it's possible > to determine the length of a literal token after the fact. Probably, but not in a manner that I deem to be well-principled. I want to push the onus on keeping bookkeeping for the replacement of tokens into the parser, which is authoritative - otherwise, I'll end up with a kludge that is liable to fall out of sync. The community will have the opportunity to consider if that trade-off makes sense. > More generally, if you're hashing anything later than the raw grammar > tree, I think that generating a suitable representation of the queries > represented by a single hash entry is going to be problematic anyway. > There could be significant differences --- much more than just the > values of constants --- between query strings that end up being > semantically the same. Or for that matter we could have identical query > strings that wind up being considered different because of the action of > search_path or other context. I don't see that that has to be problematic. I consider that to be a feature, and it can be documented as such. I think that any scenario anyone might care to describe where this is a real problem will be mostly contrived. > It might be that the path of least resistance is to document that we > select one of the actual query strings "at random" to represent all the > queries that went into the same hash entry, and not even bother with > trying to strip out constants. The effort required to do that seems > well out of proportion to the reward, if we can't do a perfect job of > representing the common aspects of the queries. Up until quite recently, this patch actually updated the query string, so that a new set of constants was seen after each query call - that's what Greg posted a link to. The part of the patch that normalises the query string and the related backend infrastructure are fairly neat adjuncts to what you see there. So while I'll be asking someone to commit the patch with that infrastructure, because the adjunct adds quite a lot of value to the patch, there is no need to have that be a blocker. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
pgsql-hackers by date: