Next steps on pg_stat_statements normalisation - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Next steps on pg_stat_statements normalisation |
Date | |
Msg-id | CAEYLb_W4oUZ+qE6t8Bm_YpXVU=33i6MvxXt43sHpsZGVw5TD=A@mail.gmail.com Whole thread Raw |
Responses |
Re: Next steps on pg_stat_statements normalisation
Re: Next steps on pg_stat_statements normalisation |
List | pgsql-hackers |
So, having received feedback from Tom and others in relation to this patch, I would like to state how I think I should go about addressing various concerns to ensure that a revision of the patch gets into the 9.2 release. As I've said time and again, I think that it is very important that we have this, sooner rather than later. The main objection made in relation to my patch was to the overhead of the admittedly invasive (though purely mechanical) changes to the grammar to record the length of all tokens, so that Const tokens could subsequently have their length known in a principled fashion. While I believe that you'd be extremely hard pushed to demonstrate any regression at all from these extra cycles, it's also true that given the starting position of tokens (a value already exposed for the purposes of producing error messages outside the parser that reference particular tokens with a caret to their position in the query string, corresponding to various nodes) there's no reason to think that it should not be possible to calculate the total length on-the-fly from within pg_stat_statements. The syntax for constants is sufficiently simple that I think that a good set of regression tests should make this entirely practicable, covering all permutations of relevant factors affecting how the implementation should act, including for example standard_conforming_strings. There's no reason to think that the SQL syntax rules for constants should need to change very frequently, or even at all, so we should be fine with just knowing the starting position. It's quicker and easier to do it this way than to argue the case for my original implementation, so that's what I'll do. Whatever overhead this independent, pg_stat_statements-internal const parsing may impose, it will at least only be paid once per query, when we first require a canonicalised representation of the query for the pg_stat_statements view. I have, in the past, spoken against the idea of parsing query strings on the fly (an objection proven to be well-founded by various third party tools), but this is something quite different - by starting from a position that is known to be where a Const token starts, we need only concern ourselves with constant syntax, a much more limited and stable thing to have to pin down. I imagined that by plugging into Planner_hook, and hashing the post-rewrite query tree immediately before it was passed to the planner, that there was a certain amount of merit in recognising as equivalent queries that would definitely result in the same plan, all other things, and selectivity estimates of constants, being equal. However, since some of that logic, such as the logic through which different though equivalent join syntaxes are normalised, actually occurs in the planner, it was judged that there was an undesirable inconsistency. I also felt that it was useful that the hashing occur on the post-rewrite "query proper", but various examples illustrated that that might be problematic, such as the inability for the implementation to differentiate constants in the original query string from, for example, constants appearing in view definitions. Tom wanted to hash plans. I think that this view was partially informed by concerns about co-ordination of hooks. While I had worked around those problems, the solution that I'd proposed wasn't very pretty, and it wasn't that hard to imagine someone finding a reason to need to break it. While I think there might be a lot to be said for hashing plans, that isn't a natural adjunct to pg_stat_statements - the user expects to see exactly one entry for what they consider to be the same query, and I think you'd have a hard time convincing many people that it's okay to have a large number of entries for what they thought was the same query, just reflecting the fact that a different plan was used for each of several entries with the same query string, based on factors they didn't consider essential to the query that may be quite transitory. Some more marginal plans for the same query could be missing entirely from the fixed-size hash table. It wouldn't have been a very useful way of normalising queries, even if it did more accurately reflect where execution costs came from, so I am dead set against doing this for normalisation. However, it might be that we can add even more value in the 9.3 release by hashing plans as well, to do entirely more interesting things, but we need to have a canonicalised, normalised representation of queries first - that's obviously how people expect a tool like this to work, and it's difficult to imagine a practical work-flow for isolating poorly performing queries/plans that doesn't stem from that. Balancing all of these concerns, here's a basic outline of what I think that the way forward is: * Add a hook to core that allows modules to hook into the call to parse_analyze() and parse_analyze_varparams(). There'd probably be two single-line plugin functions for each, so that they're both implemented within a function that does all the real work. * Add an int64 value, query_id, to both Query and PlannedStmt structs, that are just blindly copied from one to the other by the core system. No need to worry about about the query tree and plan not being sufficiently coupled now, as was previously a concern with prepared queries, as that value will persist. * Within pg_stat_statements, hook into parse_analyze(). Hash the post analysis Query returned by parse_analyze() (and not within a planner hook, which this proposed new revision no longer has). Now, I know that Tom felt that equivalent syntaxes shouldn't be normalized, or at least that this was trivial, and this approach would still do some measure of that. However, having a big case statement like the one in transformStmt() to act on the pre-analysis parse tree (which is always referenced through a Node pointer, as it is polymorphic) seems like an awful lot more code for little benefit. I'm not even sure right now if doing that would avoid this implementation artefact. In any case, ISTM that the solution is to simply not document the precise behaviour of normalisation, so the user has no reasonable basis for expecting, for example, that a cross join should be treated the same as when the comma syntax is used. Is it really defensible to extrapolate that that should happen from the observation that it works when noise words like "outer" in "left outer join" are omitted? I think not. * Copy the hash to the Query struct's query_id, so that the plan that our executor hook will later see will bear that same query_id. * Track the position of constants within the query string when walking the post analysis query tree if we don't have a hash table entry for it already - do update the constants if we have constants for the query_id already, but no hashtable entry yet. Keep this bookkeeping infromation in palloc()'d memory, using some kind of associative data structure that associates the list of constant positions with their query_id. All of this avoids using the wrong set of constants with an equivalent query that has a different amount of whitespace or something like that. * Within the existing executor hook, look for the query in the hash table based on the PlannedStmt's query_id, trusting that the core system has copied that value from the original Query tree, when the query was originally parsed, potentially quite a while ago for prepared queries. If it's recognised, just update the stats. If it is not, make a new entry. This entry will require a canonicalised query string; generate one from the query string the executor plugin already sees, plus a list of constant starting positions previously computed for this query_id/query string, figuring out the length of the constant by simply knowing the rules for constants and applying them, using the principles I've already described. Free the memory in the bookkeeping structure allocated for this entry, safe in the knowledge that the hash table entry will prevent that from being updated/needed, until it is evicted from the hash, at which point if it is seen again constant positions needs to be figured out again (that new canonicalised query string might have more whitespace than before, obviously). Thoughts? -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
pgsql-hackers by date: