pg_stat_statements normalisation without invasive changes to the parser (was: Next steps on pg_stat_statements normalisation) - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | pg_stat_statements normalisation without invasive changes to the parser (was: Next steps on pg_stat_statements normalisation) |
Date | |
Msg-id | CAEYLb_WGeFCT7MfJ8FXf-CR6BSE6Lbn+O1VX3+OGrc4Bscn4=A@mail.gmail.com Whole thread Raw |
Responses |
Re: pg_stat_statements normalisation without invasive changes to the
parser (was: Next steps on pg_stat_statements normalisation)
|
List | pgsql-hackers |
Attached is a revision of the pg_stat_statements normalisation patch, plus its Python/psycopg2/dellstore2 test suite, which has yet to be converted to produce pg_regress output. This revision is a response to the last round of feedback given by reviewers. Highlights include: * No more invasive changes to the parser. The only way that this patch even touches the core code itself is the addition of a new hook for hashing the post parse analysis tree (there are actually technically two hooks - parse_analyze_hook and parse_analyze_varparams_hook), as well as by adding a query_id to the Query and PlannedStmt structs, that the core system simply naively copies around. This resolves the hook synchronisation issues that had less elegant workarounds in prior revisions. * We now use the internal, low-level scanner API declared in scanner.h, so that pg_stat_statements has the capability of robustly detecting a given constant's length based only on its position in the query string (taken from Const nodes, as before) and the string itself. * All my old regression tests pass, but I've added quite a few new ones too, as problems transpired, including tests to exercise canonicalisation of what might be considered edge-case query strings, such as ones with many large constants. There are 100 tests just for that, that use psuedo-random constants to exercise the canonicalisation logic thoroughly. Once things start to shape up, I'll modify that python script to spit out pg_regress tests - it seems worth delaying committing to that less flexible approach for now though, and clearly not all of the hundreds of tests are going to make the cut, as at certain points I was shooting from the hip, so to speak. I'll do something similar to sepgsql, another contrib module that has tests. * All the regular Postgres regression tests now pass, with the new pg_stat_statements enabled, and with both parallel and serial schedules. There are no unrecognised nodes, nor any other apparent failures, with assertions enabled. All strings that are subsequently seen in the view are correctly canonicalised, with the exception 3 or 4 corner cases, noted below. These may well not be worth fixing, or may be down to subtle bugs in the core system parser that we ought to fix. * Continual hashing is now used, so that arbitrarily long queries can be differentiated (though of course we are still subject to the previous limitation of a query string being capped to pgstat_track_activity_query_size - now, that's what the *canonicalised* query is capped at). That's another improvement on 9.1's pg_stat_statements, which didn't see any differences past pgstat_track_activity_query_size (default: 1024) characters. There are a number of outstanding issues that I'm aware of: * Under some situations, the logic through which we determine the length of constants is a little fragile, though I believe we can find a solution. In particular, consider this query: select integer '1'; this normalises to: select ? and not, as was the case in prior revisions: select integer ?; This is because the post analysis tree, unlike the post rewrite tree, appears to give the position of the constant in this case as starting with the datatype, so I'm forced to try and work out a way to have the length of the constant considered as more than a single token. I'll break on reaching a SCONST token in this case, but there are other cases that require careful workarounds. I wouldn't be surprised if someone was able to craft a query to break this logic. Ideally, I'd be able to assume that constants are exactly one token, allowing me to greatly simplify the code. * I am aware that it's suboptimal how I initialise the scanner once for each time a constant of a given query is first seen. The function get_constant_length is fairly messy, but the fact that we may only need to take the length of a single token in a future revision (once we address the previous known issue) doesn't leave me with much motivation to clean it up just yet. * # XXX: This test currently fails!: #verify_normalizes_correctly("SELECT cast('1' as dnotnull);","SELECT cast(? as dnotnull);",conn, "domain literal canonicalization/cast") It appears to fail because the CoerceToDomain node gives its location to the constant node as starting from "cast", so we end up with "SELECT ?('1' as dnotnull);". I'm not quite sure if this points to there being a slight tension with my use of the location field in this way, or if this is something that could be fixed as a bug in core (albeit a highly obscure one), though I suspect the latter. * I'm still not using a core mechanism like query_tree_walker to walk the tree, which would be preferable. The maintainability of the walker logic was criticized. At about 800 lines of code in total for the walker logic (for the functions PerformJumble, QualsNode, LeafNode, LimitOffsetNode, JoinExprNode, JoinExprNodeChild), for structures that in practice are seldom changed, with a good test suite, I think we could do a lot worse. We now raise a warning rather than an error in the event of an unrecognised node, which seems more sensible - people really aren't going to thank you for making their entire query fail, just because we failed to serialise some node at some point. I don't think that we can get away with just accumulating nodetags much of the time, as least if we'd like to implement this feature as I'd envisaged, which is that it would be robust and comprehensive. * If we use prepared statements, it's possible that an entry, created from within our parse analysis hook, will get evicted from the fixed-sized shared hash table before it is once again executed within our executor hook. Now, if this happens, we won't be able to canonicalise the query string constants again. However, it can probably only happen with prepared statements (I concede that eviction might be possible between a given backends parse analysis hook and executor hook being called - not really sure. Might be worth holding a shared lock between the hooks in that case, on the off chance that the query string won't be canonicalised, but then again that's a fairly rare failure). People aren't going to care too much about canonicalisation of prepared statement constants, but I haven't just removed it and hashed the query string there because it may still be valuable to be able to differentiate arbitrarily long prepared queries. Maybe the answer here is to have pg_stat_statements tell the core system "this is that querytree's original query string now". That would have hazards of its own though, including invalidating the positions of constants. Another option would be to add a normalized_query char* to the Query and PlannedStmt structs, with which the core system does much the same thing as the query_id field in the proposed patch. * The way that I maintain a stack of range tables, so that Vars whose vallevelsup != 0 can rt_fetch() an rte to hash its relid may be less than idiomatic. There is a function used elsewhere on the raw parse tree to do something similar, but that tree has a parent pointer that can be followed which is not available to me. * I would have liked to have been able to have pg_stat_statements have a configurable eviction criteria, so that queries with the lowest total time executed could be evicted first, rather than the lowest number of calls. I haven't done that here. Thoughts? -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
Attachment
pgsql-hackers by date: