Thread: pg_stat_statements and non default search_path
Hello, I've faced multiple times a painful limitation with pg_stat_statements. Queries are normalized based on the textual SQL statements, and the queryid is computed using objects' oids. So for instance with different search_path settings, we can end up with the same normalized query but different queryids, and there's no way to know what are the actual relations each query is using. It also means that it can be almost impossible to use the normalized query to replay a query. I see two ways of fixing this. First one would be to store normalized queries while fully qualifiying the relation's names. After a quick look this can't be done without storing at least a token location in RangeTblEntry nodes, which sounds like a bad idea. The other way would be to store the value of search_path with each pgss entry (either in shared_memory or in the external file). Is it something that we should fix, and if yes is any of this acceptable, or does someone see another way to solve this problem? Regards. -- Julien Rouhaud http://dalibo.com - http://dalibo.org
On 10/7/16 4:39 AM, Julien Rouhaud wrote: > I see two ways of fixing this. First one would be to store normalized > queries while fully qualifiying the relation's names. After a quick look > this can't be done without storing at least a token location in > RangeTblEntry nodes, which sounds like a bad idea. > > The other way would be to store the value of search_path with each pgss > entry (either in shared_memory or in the external file). > > Is it something that we should fix, and if yes is any of this > acceptable, or does someone see another way to solve this problem? Would another option be to temporarily set search_path to '' when getting the query text? ISTM that would be the best option. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532) mobile: 512-569-9461
On 10/10/2016 05:00, Jim Nasby wrote: > On 10/7/16 4:39 AM, Julien Rouhaud wrote: >> I see two ways of fixing this. First one would be to store normalized >> queries while fully qualifiying the relation's names. After a quick look >> this can't be done without storing at least a token location in >> RangeTblEntry nodes, which sounds like a bad idea. >> >> The other way would be to store the value of search_path with each pgss >> entry (either in shared_memory or in the external file). >> >> Is it something that we should fix, and if yes is any of this >> acceptable, or does someone see another way to solve this problem? > > Would another option be to temporarily set search_path to '' when > getting the query text? ISTM that would be the best option. I don't think that possible. The query text used is raw query text provided by the post_parse_analyse hook (ParseState->p_sourcetext). Unless you mean deparsing the Query instead of using raw source text? I think that would solve this issue (and also the other issue when multiple queries are submitted at once, you get the normalized version of all the queries multiple time), but AFAIK ruleutils.c doesn't expose enough to do it (like get_query_def()), and exposing it isn't an option. -- Julien Rouhaud http://dalibo.com - http://dalibo.org
On 10/10/16 12:58 AM, Julien Rouhaud wrote: >> > Would another option be to temporarily set search_path to '' when >> > getting the query text? ISTM that would be the best option. > I don't think that possible. The query text used is raw query text > provided by the post_parse_analyse hook (ParseState->p_sourcetext). Oh, hrm. :/ > Unless you mean deparsing the Query instead of using raw source text? I > think that would solve this issue (and also the other issue when > multiple queries are submitted at once, you get the normalized version > of all the queries multiple time), but AFAIK ruleutils.c doesn't expose > enough to do it (like get_query_def()), and exposing it isn't an option. Why couldn't we expose it? BTW, after thinking about it some more, I don't see how storing the search_path would help at all... it's not like you can do anything with it unless you have a huge chunk of the parser available. BTW, this issue isn't limited to just tables; it affects almost every object identifier you can put in a query, like functions, operators, types, etc. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532) mobile: 512-569-9461
On 16/10/2016 02:38, Jim Nasby wrote: > On 10/10/16 12:58 AM, Julien Rouhaud wrote: >> Unless you mean deparsing the Query instead of using raw source text? I >> think that would solve this issue (and also the other issue when >> multiple queries are submitted at once, you get the normalized version >> of all the queries multiple time), but AFAIK ruleutils.c doesn't expose >> enough to do it (like get_query_def()), and exposing it isn't an option. > > Why couldn't we expose it? > I'm not really sure. The only thread I could find on this topic didn't get any answer: https://www.postgresql.org/message-id/flat/5E5BB0F8-B05E-47C2-8706-F85E70A6DF7D@citusdata.com > BTW, after thinking about it some more, I don't see how storing the > search_path would help at all... it's not like you can do anything with > it unless you have a huge chunk of the parser available. > My use case is not really to know the fully qualified name of each identifier, but being able to optimize a problematic query found with pg_stat_statements. I can already "unjumble" it automatically, but the search_path is needed to be able to get an explain or just execute the query. Of course, I'd prefer that pgss can generate a fully qualified normalized query instead of storing the search_path. > BTW, this issue isn't limited to just tables; it affects almost every > object identifier you can put in a query, like functions, operators, > types, etc. Yes, indeed. -- Julien Rouhaud http://dalibo.com - http://dalibo.org
On Sat, Oct 15, 2016 at 11:29 PM, Julien Rouhaud <julien.rouhaud@dalibo.com> wrote:
> BTW, after thinking about it some more, I don't see how storing the
> search_path would help at all... it's not like you can do anything with
> it unless you have a huge chunk of the parser available.
>
My use case is not really to know the fully qualified name of each
identifier, but being able to optimize a problematic query found with
pg_stat_statements. I can already "unjumble" it automatically, but the
search_path is needed to be able to get an explain or just execute the
query.
I'd also find having the search_path available to be a helpful benefit - I've run into the same problem as Julien where two identical queries couldn't be identified correctly because of the missing search_path.
In particular this situation might happen if you shard different tenants using schemas (one schema for each tenant), and use the search_path to set the current tenant.
In my own setup I combine pg_stat_statements with a slightly hacked up copy of the Postgres parser, so having the correct search_path + query text available is enough to find the objects a query references (most of the time, there are a few other edge cases).
My assumption thus far has been that adding another field like this might not be considered because of performance considerations.
Can somebody chime in if it would be feasible to store this in the out-of-band query text file, and whether a patch for this would be considered acceptable?
Best,
Lukas
Lukas Fittl
Skype: lfittl
Phone: +1 415 321 0630
Skype: lfittl
Phone: +1 415 321 0630
On 16/10/2016 10:47, Lukas Fittl wrote: > Can somebody chime in if it would be feasible to store this in the > out-of-band query text file, and whether a patch for this would be > considered acceptable? FWIW I already have a quick and dirty patch for this, I don't think there's any major difficulty here. I still hope we can find a way to store the fully qualified objects name in the normalized query instead. -- Julien Rouhaud http://dalibo.com - http://dalibo.org
<p dir="ltr"><p dir="ltr">On 16 Oct. 2016 14:31, "Julien Rouhaud" <<a href="mailto:julien.rouhaud@dalibo.com">julien.rouhaud@dalibo.com</a>>wrote:<br /> ><br /> > On 16/10/2016 02:38,Jim Nasby wrote:<br /> > > On 10/10/16 12:58 AM, Julien Rouhaud wrote:<br /> > >> Unless you mean deparsingthe Query instead of using raw source text? I<br /> > >> think that would solve this issue (and also theother issue when<br /> > >> multiple queries are submitted at once, you get the normalized version<br /> >>> of all the queries multiple time), but AFAIK ruleutils.c doesn't expose<br /> > >> enough to do it(like get_query_def()), and exposing it isn't an option.<br /> > ><br /> > > Why couldn't we expose it?<p dir="ltr">I'minterested in that too, for the purpose of passing the correct substring of a multi-statement to ProcessUtility_hook.Perhaps by using parser position data to generate a start pointer and length within the querystring alreadypassed.<p dir="ltr">For the problem in this thread one could always implement plan-to-query transforms (deparse).But you'd get back something pretty far from the input query after function inlining, subquery flattering, conditionpushdown/pullup, join elimination, etc etc.<p dir="ltr">I think pg_stat_plans attempted a middle ground here, capturingexplain style plans rather than trying to report the SQL.<p dir="ltr">I do think there is utility to storing search_pathand optionally specialising stats according to it. The most obvious case is multi-tenant and schema-sharded modelswhere per-schema stats would be very handy. You'd want to use the active search path though so you eliminate referencesto nonexistent schemas. Otherwise $user would screw things up.
On 16/10/2016 11:21, Craig Ringer wrote: > On 16 Oct. 2016 14:31, Julien Rouhaud wrote: >> >> On 16/10/2016 02:38, Jim Nasby wrote: >> > On 10/10/16 12:58 AM, Julien Rouhaud wrote: >> >> Unless you mean deparsing the Query instead of using raw source > text? I >> >> think that would solve this issue (and also the other issue when >> >> multiple queries are submitted at once, you get the normalized version >> >> of all the queries multiple time), but AFAIK ruleutils.c doesn't expose >> >> enough to do it (like get_query_def()), and exposing it isn't an > option. >> > >> > Why couldn't we expose it? > > I'm interested in that too, for the purpose of passing the correct > substring of a multi-statement to ProcessUtility_hook. I have another use case for this: being able to easily print what is the query (or queries) that are actually executed (ie. what's get out of the rewriter). That's useful when trying to figure out / optimize nested views hell or abuse of rules. -- Julien Rouhaud http://dalibo.com - http://dalibo.org