Thread: New hook after raw parsing, before analyze
Hello Hackers, I work on a foreign data wrapper for a legacy system. I generally find the hook system very useful and flexible way to extendPostgres. The post parse analyze hook almost fits what I need, but I have a few use cases where I would need to tap right into theparsed queries but before any catalog based validation is done. Please find the attached trivial patch for this new hook. One of the use cases I have is this: I have table like data structures in the source system for the FDW I work on. These tables are sometimes too big and the source system is able to filter and join them with limitations, thus it is notoptimal to transfer the data to Postgres. At the same time I want the users to think in terms of the original tables. The idea is to rewrite the SQL queries like this: “SELECT * FROM tableA a, tableB b WHERE a.id=b.id AND a.col1=1234 AND b.col2=987” to: “SELECT * FROM fdw_tableA_tableB ab WHERE ab.col1=1234 AND ab.col2=987” This rewritten query would be handled by the FDW table that I previously added to the catalog. The reason I want this new hook is that I don’t want tableA and tableB to be in the catalog. Looking forward to hear your thoughts, opinions, comments. Best regards, David
Attachment
See the discussion of Custom-Scan API. https://commitfest.postgresql.org/action/patch_view?id=1282 I believe my third patch is what you really want to do... > This rewritten query would be handled by the FDW table that I previously added to the catalog. > > The reason I want this new hook is that I don't want tableA and tableB to be in the catalog. > I'd like to see why you wants the pseudo table "fdw_tableA_tableB" to be in the catalog, instead of the "tableA" and "tableB". In addition, parser shall raise an error if referenced columns (as a part of "tableA" or "tableB") are not in-catalog because of name lookup error. Thanks, 2014-02-13 19:01 GMT+09:00 David Beck <dbeck@starschema.net>: > Hello Hackers, > > I work on a foreign data wrapper for a legacy system. I generally find the hook system very useful and flexible way toextend Postgres. > The post parse analyze hook almost fits what I need, but I have a few use cases where I would need to tap right into theparsed queries but before any catalog based validation is done. > Please find the attached trivial patch for this new hook. > > One of the use cases I have is this: > > I have table like data structures in the source system for the FDW I work on. > These tables are sometimes too big and the source system is able to filter and join them with limitations, thus it is notoptimal to transfer the data to Postgres. > At the same time I want the users to think in terms of the original tables. > > The idea is to rewrite the SQL queries like this: > > "SELECT * FROM tableA a, tableB b WHERE a.id=b.id AND a.col1=1234 AND b.col2=987" > > to: > > "SELECT * FROM fdw_tableA_tableB ab WHERE ab.col1=1234 AND ab.col2=987" > > > This rewritten query would be handled by the FDW table that I previously added to the catalog. > > The reason I want this new hook is that I don't want tableA and tableB to be in the catalog. > > Looking forward to hear your thoughts, opinions, comments. > > Best regards, David > > > > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > -- KaiGai Kohei <kaigai@kaigai.gr.jp>
Thanks for the link. I want flexibility. Here is a situation: my hook knows the size of tableA and tableB on the legacy side. It should be ableto decide wether to offload the join/filter onto the legacy side or not. At the same time it can start transferring thedata to real Postgres tables in the background. When the data is here in the local DB it can change the rewrite to usethe local tables instead for the new queries. Answering your question: I don’t know what you mean by pseudo table. The “fdw_tableA_tableB” is a real FDW table with a catalog entry, with real catalogcolumns. tableA and tableB are non-real tables in my setup. My understanding is the raw_parse-transform-analyze triplefirst checks the catalog in the transform phase. That’s why I want the hook to be after raw_parse but before transform,so the transform phase won’t see tableA, tableB and their columns, thus I don’t expect errors. Thank you, David Please find my answers inline. 2014.02.13. dátummal, 13:28 időpontban Kohei KaiGai <kaigai@kaigai.gr.jp> írta: > See the discussion of Custom-Scan API. > https://commitfest.postgresql.org/action/patch_view?id=1282 > > I believe my third patch is what you really want to do... > >> This rewritten query would be handled by the FDW table that I previously added to the catalog. >> >> The reason I want this new hook is that I don't want tableA and tableB to be in the catalog. >> > I'd like to see why you wants the pseudo table "fdw_tableA_tableB" to > be in the catalog, > instead of the "tableA" and "tableB". In addition, parser shall raise > an error if referenced > columns (as a part of "tableA" or "tableB") are not in-catalog because > of name lookup > error. > > Thanks, > > > 2014-02-13 19:01 GMT+09:00 David Beck <dbeck@starschema.net>: >> Hello Hackers, >> >> I work on a foreign data wrapper for a legacy system. I generally find the hook system very useful and flexible way toextend Postgres. >> The post parse analyze hook almost fits what I need, but I have a few use cases where I would need to tap right into theparsed queries but before any catalog based validation is done. >> Please find the attached trivial patch for this new hook. >> >> One of the use cases I have is this: >> >> I have table like data structures in the source system for the FDW I work on. >> These tables are sometimes too big and the source system is able to filter and join them with limitations, thus it isnot optimal to transfer the data to Postgres. >> At the same time I want the users to think in terms of the original tables. >> >> The idea is to rewrite the SQL queries like this: >> >> "SELECT * FROM tableA a, tableB b WHERE a.id=b.id AND a.col1=1234 AND b.col2=987" >> >> to: >> >> "SELECT * FROM fdw_tableA_tableB ab WHERE ab.col1=1234 AND ab.col2=987" >> >> >> This rewritten query would be handled by the FDW table that I previously added to the catalog. >> >> The reason I want this new hook is that I don't want tableA and tableB to be in the catalog. >> >> Looking forward to hear your thoughts, opinions, comments. >> >> Best regards, David >> >> >> >> >> >> -- >> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) >> To make changes to your subscription: >> http://www.postgresql.org/mailpref/pgsql-hackers >> > > > > -- > KaiGai Kohei <kaigai@kaigai.gr.jp>
David Beck <dbeck@starschema.net> writes: > I have table like data structures in the source system for the FDW I work on. > These tables are sometimes too big and the source system is able to filter and join them with limitations, thus it is notoptimal to transfer the data to Postgres. > At the same time I want the users to think in terms of the original tables. > The idea is to rewrite the SQL queries like this: > “SELECT * FROM tableA a, tableB b WHERE a.id=b.id AND a.col1=1234 AND b.col2=987” > to: > “SELECT * FROM fdw_tableA_tableB ab WHERE ab.col1=1234 AND ab.col2=987” TBH this sounds like a spectacularly bad idea, especially in the place and way you propose to do it. You can't even do catalog access safely where you've put that hook, not to mention that there are many other places where queries can be submitted. But more generally, an FDW should not operate in the way you're describing. We do lack support for pushing joins to the foreign server, and that needs to be addressed; but we need to do it in the planner, not by kluging the query somewhere upstream of that. regards, tom lane
Thanks for the reply. There are two things I think I’ve been misunderstood: 1, the point is to do the rewrite without and before catalog access 2, I do want to push the join to the source and equally important pushing the where conditions there Best regards, David 2014.02.13. dátummal, 21:22 időpontban Tom Lane <tgl@sss.pgh.pa.us> írta: > David Beck <dbeck@starschema.net> writes: >> I have table like data structures in the source system for the FDW I work on. >> These tables are sometimes too big and the source system is able to filter and join them with limitations, thus it isnot optimal to transfer the data to Postgres. >> At the same time I want the users to think in terms of the original tables. > >> The idea is to rewrite the SQL queries like this: > >> “SELECT * FROM tableA a, tableB b WHERE a.id=b.id AND a.col1=1234 AND b.col2=987” > >> to: > >> “SELECT * FROM fdw_tableA_tableB ab WHERE ab.col1=1234 AND ab.col2=987” > > TBH this sounds like a spectacularly bad idea, especially in the place and > way you propose to do it. You can't even do catalog access safely where > you've put that hook, not to mention that there are many other places > where queries can be submitted. But more generally, an FDW should not > operate in the way you're describing. > > We do lack support for pushing joins to the foreign server, and that needs > to be addressed; but we need to do it in the planner, not by kluging the > query somewhere upstream of that. > > regards, tom lane
Let me rephrase this: Let’s remove my motivations and use cases from this conversation…. Why is that a bad idea of rewriting the query before it reaches transform/analyze (without ever accessing the catalog)? If that flexibility is acceptable to you, where would be the best place to put it in? Thanks, David 2014.02.14. dátummal, 10:30 időpontban David Beck <dbeck@starschema.net> írta: > Thanks for the reply. There are two things I think I’ve been misunderstood: > > 1, the point is to do the rewrite without and before catalog access > 2, I do want to push the join to the source and equally important pushing the where conditions there > > Best regards, David > > > 2014.02.13. dátummal, 21:22 időpontban Tom Lane <tgl@sss.pgh.pa.us> írta: > >> David Beck <dbeck@starschema.net> writes: >>> I have table like data structures in the source system for the FDW I work on. >>> These tables are sometimes too big and the source system is able to filter and join them with limitations, thus it isnot optimal to transfer the data to Postgres. >>> At the same time I want the users to think in terms of the original tables. >> >>> The idea is to rewrite the SQL queries like this: >> >>> “SELECT * FROM tableA a, tableB b WHERE a.id=b.id AND a.col1=1234 AND b.col2=987” >> >>> to: >> >>> “SELECT * FROM fdw_tableA_tableB ab WHERE ab.col1=1234 AND ab.col2=987” >> >> TBH this sounds like a spectacularly bad idea, especially in the place and >> way you propose to do it. You can't even do catalog access safely where >> you've put that hook, not to mention that there are many other places >> where queries can be submitted. But more generally, an FDW should not >> operate in the way you're describing. >> >> We do lack support for pushing joins to the foreign server, and that needs >> to be addressed; but we need to do it in the planner, not by kluging the >> query somewhere upstream of that. >> >> regards, tom lane >
On Fri, Feb 14, 2014 at 2:28 PM, David Beck <dbeck@starschema.net> wrote: > Why is that a bad idea of rewriting the query before it reaches transform/analyze (without ever accessing the catalog)? > > If that flexibility is acceptable to you, where would be the best place to put it in? Well if there are two foreign tables and the planner could push the join work down to the fdw then the planner should be able to accurately represent that plan and cost it without having the user have to create any catalog structures. That's what the planner does for every other type of plan node. What you're describing would still be useful for materialized views. In that case the user is creating the materialized view and it is a real thing in the catalogs that won't disappear on the planner. Even then it would be ideal if the planner could decide to use the materialized view late enough that it can actually determine if it's superior rather than rewriting the query before it gets to that point. That would be much more flexible for users too who might not write the query in a way that exactly matches the materialized view. -- greg
I think I’m gonna need to dig into the planner to fully understand your points. Thank you for the insights. I was more intoputting the knowledge of the legacy system into the an extension and my codebase. Now I see better use of the plannerwould help. Thank you. What inspired me is the scriptable query rewrite in http://dev.mysql.com/downloads/mysql-proxy/ The hook I proposed would be a lot nicer in Postgres because the raw parsing is already done at this point while in mysql-proxythat has to be done manually. Another point I liked in mysql is the possibility to write info schema plugins: http://dev.mysql.com/doc/refman/5.1/en/writing-information-schema-plugins.html Like a virtual catalog. Is there anything similar in Postgres? Thank you, David 2014.02.14. dátummal, 18:06 időpontban Greg Stark <stark@mit.edu> írta: > On Fri, Feb 14, 2014 at 2:28 PM, David Beck <dbeck@starschema.net> wrote: >> Why is that a bad idea of rewriting the query before it reaches transform/analyze (without ever accessing the catalog)? >> >> If that flexibility is acceptable to you, where would be the best place to put it in? > > Well if there are two foreign tables and the planner could push the > join work down to the fdw then the planner should be able to > accurately represent that plan and cost it without having the user > have to create any catalog structures. That's what the planner does > for every other type of plan node. > > What you're describing would still be useful for materialized views. > In that case the user is creating the materialized view and it is a > real thing in the catalogs that won't disappear on the planner. Even > then it would be ideal if the planner could decide to use the > materialized view late enough that it can actually determine if it's > superior rather than rewriting the query before it gets to that point. > That would be much more flexible for users too who might not write the > query in a way that exactly matches the materialized view. > > -- > greg
On Fri, Feb 14, 2014 at 9:16 PM, David Beck <dbeck@starschema.net> wrote: > Another point I liked in mysql is the possibility to write info schema plugins: http://dev.mysql.com/doc/refman/5.1/en/writing-information-schema-plugins.html > Like a virtual catalog. Is there anything similar in Postgres? The documentation you linked to describes how to provide information_schema plugins but not why you would want to do such a thing. I'm not seeing why this would be useful. The information_schema schema is described by the standard so creating new views in it isn't needed very often and the schema for the existing views doesn't change very often. I can see why a plugin might want to add rows to the views but that doesn't seem to be what this feature is about. -- greg
Greg Stark <stark@mit.edu> writes: > On Fri, Feb 14, 2014 at 9:16 PM, David Beck <dbeck@starschema.net> wrote: >> Another point I liked in mysql is the possibility to write info schema plugins: http://dev.mysql.com/doc/refman/5.1/en/writing-information-schema-plugins.html >> Like a virtual catalog. Is there anything similar in Postgres? > The documentation you linked to describes how to provide > information_schema plugins but not why you would want to do such a > thing. I'm not seeing why this would be useful. The information_schema > schema is described by the standard so creating new views in it isn't > needed very often and the schema for the existing views doesn't change > very often. IIRC, mysql has a liberal view about what they can do to the information_schema, so for them this isn't as insane as it sounds to us. regards, tom lane
On Fri, Feb 14, 2014 at 9:16 PM, David Beck <dbeck@starschema.net> wrote: > What inspired me is the scriptable query rewrite in http://dev.mysql.com/downloads/mysql-proxy/ > The hook I proposed would be a lot nicer in Postgres because the raw parsing is already done at this point while in mysql-proxythat has to be done manually. There are a few similar things in the Postgres world such as pgbouncer and plproxy but as you note they are limited in how powerful they can be by the complexity of parsing and analyzing SQL. I think Postgres tends to take a different tack regarding extensibility than MySQL. Rather than have one system then hooks to allow external code to modify or replace it, in Postgres modules usually are treated similarly to the internal code. This is a pretty broad generalization and there are certainly exceptions. But for example new data types, functions, even whole new index types are all treated almost identically to the internal data types, functions, and index types. The planner then considers them all more or less on equal basis. Obviously there are limits. Btree indexes are kind of special because they represent Postgres's basic concept of ordering. And we don't have a pluggable recovery system which makes any externally provided index types non-crash-safe. And we don't have a generalized concept of table storage -- the closest thing we have is FDWs which is more like MySQL's style of extensibility where the extension is a special case. But this is why our instinct is that if you want to be able to push down joins the way to do it is to extend the FDW api so that the planner can make those decisions just like it makes the decisions about internal joins rather than have an external module that takes over part of the planner's work. -- greg
2014.02.15. dátummal, 0:46 időpontban Greg Stark <stark@mit.edu> írta: > On Fri, Feb 14, 2014 at 9:16 PM, David Beck <dbeck@starschema.net> wrote: >> Another point I liked in mysql is the possibility to write info schema plugins: http://dev.mysql.com/doc/refman/5.1/en/writing-information-schema-plugins.html >> Like a virtual catalog. Is there anything similar in Postgres? > > The documentation you linked to describes how to provide > information_schema plugins but not why you would want to do such a > thing. I'm not seeing why this would be useful. The information_schema > schema is described by the standard so creating new views in it isn't > needed very often and the schema for the existing views doesn't change > very often. I can see why a plugin might want to add rows to the views > but that doesn't seem to be what this feature is about. Another reason I was thinking about dynamic catalog and/or query rewrite is the project I work on is a data integration platform.Right now it is in the feasibility study phase and Postgres+extension looks to be the strongest option. The legacy system we want to interface with has over 150k table like objects. Our platform’s task is to provide a relationalview on top of them. I know that it is unlikely the users to use all 150k tables. I would expect may be 10-100 are used in practice, but I didn’twant to figure out which 100, neither want to create all 150k catalog entries in advance. I was also dreaming about the possibility to transfer the small enough objects to Postgres tables in the background and sparethe communication with the legacy system and let Postgres do the joins on these. The solution I was thinking about is this: - when the query arrives a smart rewrite would know 1) what tables are local 2) what tables need new catalog entries 3) whatcan be joined on the other side - the rewriter would potentially add SQL statements in the beginning of the query for creating the missing FDW catalog entriesif needed - the FDW would be handled by the same extension so they can easily talk to each other about the status of the objects, sothe rewriter would know if the background transfer of the small table is completed and should do the rewrite accordingly I know these are pretty far from the functionality and traditional operation of an RDBMS… but if you look at the FDW exampleslike do a select on a Google Imap mailbox, it is not that far from Postgres Best regards, David
On Sat, Feb 15, 2014 at 2:06 PM, David Beck <dbeck@starschema.net> wrote: > - when the query arrives a smart rewrite would know 1) what tables are local 2) what tables need new catalog entries 3)what can be joined on the other side > - the rewriter would potentially add SQL statements in the beginning of the query for creating the missing FDW catalogentries if needed > - the FDW would be handled by the same extension so they can easily talk to each other about the status of the objects,so the rewriter would know if the background transfer of the small table is completed and should do the rewrite accordingly I think you have a pretty big impedence mismatch with Postgres which assumes the schema of the database is fairly static and known when parsing begins. To do what you describe you pretty much need to write your own SQL parser. There is a hook "post_parse_analyze_hook" but I think it comes too late as it comes after the analyze step which is when Postgres looks up the schema information for every relation mentioned in the query. What you would need is a post_parse_hook which would work on the raw parse tree before the analyze step. That doesn't seem terribly controversial to add though there may be some technical details. The API would of course be completely unstable from major release to major release -- the parse tree gets knocked around quite a bit. But that only gets you so far. I think you would be able to get lazy creation of schema objects fairly straightforwardly. Ie. Any legacy objects referenced in a query get corresponding fdw schema objects created before analyze is done -- though I would expect a few gotchas, possibly deadlocks, with concurrent queries creating the same objects. But I don't think that gets you the joins which I think would be quite a battle. And I have to wonder if you aren't going the long way around to do something that can be done more simply some other way. If you have 150k objects I wonder if your objects aren't all very similar and could be handled by a single Postgres schema object. Either a single FDW object or a simple function. If they really are different schema objects perhaps a single function that returns a more flexible data type like an hstore blob? That has obvious disadvantages over an object that the planner understands better but at least you would be in a well supported stable API (well, for interesting definitions of "stable" given 9.4 will have a whole new version of hstore). As a side note, you should evaluate carefully what lazily creating objects will buy you. Perhaps just creating 150k objects would be cheaper than maintaining this code. In particular since the user *might* access all 150k you still have to worry about the worst case anyway and it might be cheaper to just engineer for it in the first place. -- greg
> There is a hook "post_parse_analyze_hook" but I think it comes too > late as it comes after the analyze step which is when Postgres looks > up the schema information for every relation mentioned in the query. > What you would need is a post_parse_hook which would work on the raw > parse tree before the analyze step. That doesn't seem terribly > controversial to add though there may be some technical details. The > API would of course be completely unstable from major release to major > release -- the parse tree gets knocked around quite a bit. Good to know that parse tree is not stable. My original proposal was exactly the one you describe. I called it post_raw_parse_hookand wanted to call it from exec_simple_query(), right after pg_parse_query(). May be not the best place or way to achieve this? It was a few liner patch, I wonder if it ever has a chance to get into Postgres. > And I have to wonder if you aren't going the long way around to do > something that can be done more simply some other way. If you have > 150k objects I wonder if your objects aren't all very similar and > could be handled by a single Postgres schema object. Either a single > FDW object or a simple function. I need to do a bit more research on that side. I was hoping to spend less time on understanding that legacy system... > As a side note, you should evaluate carefully what lazily creating > objects will buy you. Perhaps just creating 150k objects would be > cheaper than maintaining this code. In particular since the user > *might* access all 150k you still have to worry about the worst case > anyway and it might be cheaper to just engineer for it in the first > place. I have a similar problem with the legacy system. The set of 150k objects may change in between releases. My gut feeling isthat it is easier to understand and keep track of an open source project like Postgres. Best regards and thanks for all insights and ideas, David