pg_stat_statements normalisation without invasive changes to the parser (was: Next steps on pg_stat_statements normalisation) - Mailing list pgsql-hackers

From Peter Geoghegan
Subject pg_stat_statements normalisation without invasive changes to the parser (was: Next steps on pg_stat_statements normalisation)
Date
Msg-id CAEYLb_WGeFCT7MfJ8FXf-CR6BSE6Lbn+O1VX3+OGrc4Bscn4=A@mail.gmail.com
Whole thread Raw
Responses Re: pg_stat_statements normalisation without invasive changes to the parser (was: Next steps on pg_stat_statements normalisation)
List pgsql-hackers
Attached is a revision of the pg_stat_statements normalisation patch,
plus its Python/psycopg2/dellstore2 test suite, which has yet to be
converted to produce pg_regress output. This revision is a response to
the last round of feedback given by reviewers. Highlights include:

* No more invasive changes to the parser. The only way that this patch
even touches the core code itself is the addition of a new hook for
hashing the post parse analysis tree (there are actually technically
two hooks - parse_analyze_hook and parse_analyze_varparams_hook), as
well as by adding a query_id to the Query and PlannedStmt structs,
that the core system simply naively copies around. This resolves the
hook synchronisation issues that had less elegant workarounds in prior
revisions.

* We now use the internal, low-level scanner API declared in
scanner.h, so that pg_stat_statements has the capability of robustly
detecting a given constant's length based only on its position in the
query string (taken from Const nodes, as before) and the string
itself.

* All my old regression tests pass, but I've added quite a few new
ones too, as problems transpired, including tests to exercise
canonicalisation of what might be considered edge-case query strings,
such as ones with many large constants. There are 100 tests just for
that, that use psuedo-random constants to exercise the
canonicalisation logic thoroughly. Once things start to shape up, I'll
modify that python script to spit out pg_regress tests - it seems
worth delaying committing to that less flexible approach for now
though, and clearly not all of the hundreds of tests are going to make
the cut, as at certain points I was shooting from the hip, so to
speak. I'll do something similar to sepgsql, another contrib module
that has tests.

* All the regular Postgres regression tests now pass, with the new
pg_stat_statements enabled, and with both parallel and serial
schedules. There are no unrecognised nodes, nor any other apparent
failures, with assertions enabled. All strings that are subsequently
seen in the view are correctly canonicalised, with the exception 3 or
4 corner cases, noted below. These may well not be worth fixing, or
may be down to subtle bugs in the core system parser that we ought to
fix.

* Continual hashing is now used, so that arbitrarily long queries can
be differentiated (though of course we are still subject to the
previous limitation of a query string being capped to
pgstat_track_activity_query_size - now, that's what the
*canonicalised* query is capped at). That's another improvement on
9.1's pg_stat_statements, which didn't see any differences past
pgstat_track_activity_query_size (default: 1024) characters.

There are a number of outstanding issues that I'm aware of:

* Under some situations, the logic through which we determine the
length of constants is a little fragile, though I believe we can find
a solution. In particular, consider this query:

select integer '1';

this normalises to:

select ?

and not, as was the case in prior revisions:

select integer ?;

This is because the post analysis tree, unlike the post rewrite tree,
appears to give the position of the constant in this case as starting
with the datatype, so I'm forced to try and work out a way to have the
length of the constant considered as more than a single token. I'll
break on reaching a SCONST token in this case, but there are other
cases that require careful workarounds. I wouldn't be surprised if
someone was able to craft a query to break this logic. Ideally, I'd be
able to assume that constants are exactly one token, allowing me to
greatly simplify the code.

* I am aware that it's suboptimal how I initialise the scanner once
for each time a constant of a given query is first seen. The function
get_constant_length is fairly messy, but the fact that we may only
need to take the length of a single token in a future revision (once
we address the previous known issue) doesn't leave me with much
motivation to clean it up just yet.

*     # XXX: This test currently fails!:
    #verify_normalizes_correctly("SELECT cast('1' as dnotnull);","SELECT
cast(? as dnotnull);",conn, "domain literal canonicalization/cast")

It appears to fail because the CoerceToDomain node gives its location
to the constant node as starting from "cast", so we end up with
"SELECT ?('1' as dnotnull);". I'm not quite sure if this points to
there being a slight tension with my use of the location field in this
way, or if this is something that could be fixed as a bug in core
(albeit a highly obscure one), though I suspect the latter.

* I'm still not using a core mechanism like query_tree_walker to walk
the tree, which would be preferable. The maintainability of the walker
logic was criticized. At about 800 lines of code in total for the
walker logic (for the functions PerformJumble, QualsNode, LeafNode,
LimitOffsetNode, JoinExprNode, JoinExprNodeChild), for structures that
in practice are seldom changed, with a good test suite, I think we
could do a lot worse. We now raise a warning rather than an error in
the event of an unrecognised node, which seems more sensible - people
really aren't going to thank you for making their entire query fail,
just because we failed to serialise some node at some point. I don't
think that we can get away with just accumulating nodetags much of the
time, as least if we'd like to implement this feature as I'd
envisaged, which is that it would be robust and comprehensive.

* If we use prepared statements, it's possible that an entry, created
from within our parse analysis hook, will get evicted from the
fixed-sized shared hash table before it is once again executed within
our executor hook. Now, if this happens, we won't be able to
canonicalise the query string constants again. However, it can
probably only happen with prepared statements (I concede that eviction
might be possible between a given backends parse analysis hook and
executor hook being called - not really sure. Might be worth holding a
shared lock between the hooks in that case, on the off chance that the
query string won't be canonicalised, but then again that's a fairly
rare failure). People aren't going to care too much about
canonicalisation of prepared statement constants, but I haven't just
removed it and hashed the query string there because it may still be
valuable to be able to differentiate arbitrarily long prepared
queries.

Maybe the answer here is to have pg_stat_statements tell the core
system "this is that querytree's original query string now". That
would have hazards of its own though, including invalidating the
positions of constants. Another option would be to add a
normalized_query char* to the Query and PlannedStmt structs, with
which the core system does much the same thing as the query_id field
in the proposed patch.

* The way that I maintain a stack of range tables, so that Vars whose
vallevelsup != 0 can rt_fetch() an rte to hash its relid may be less
than idiomatic. There is a function used elsewhere on the raw parse
tree to do something similar, but that tree has a parent pointer that
can be followed which is not available to me.

* I would have liked to have been able to have pg_stat_statements have
a configurable eviction criteria, so that queries with the lowest
total time executed could be evicted first, rather than the lowest
number of calls. I haven't done that here.

Thoughts?

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

Attachment

pgsql-hackers by date:

Previous
From: Andrew Dunstan
Date:
Subject: Re: proposal: copybytea command for psql
Next
From: Alvaro Herrera
Date:
Subject: Re: Command Triggers