Thread: New hook after raw parsing, before analyze

New hook after raw parsing, before analyze

From
David Beck
Date:
Hello Hackers,

I work on a foreign data wrapper for a legacy system. I generally find the hook system very useful and flexible way to
extendPostgres. 
The post parse analyze hook almost fits what I need, but I have a few use cases where I would need to tap right into
theparsed queries but before any catalog based validation is done. 
Please find the attached trivial patch for this new hook.

One of the use cases I have is this:

I have table like data structures in the source system for the FDW I work on.
These tables are sometimes too big and the source system is able to filter and join them with limitations, thus it is
notoptimal to transfer the data to Postgres. 
At the same time I want the users to think in terms of the original tables.

The idea is to rewrite the SQL queries like this:

  “SELECT * FROM tableA a, tableB b WHERE a.id=b.id AND a.col1=1234 AND b.col2=987”

to:

  “SELECT * FROM fdw_tableA_tableB ab WHERE ab.col1=1234 AND ab.col2=987”


This rewritten query would be handled by the FDW table that I previously added to the catalog.

The reason I want this new hook is that I don’t want tableA and tableB to be in the catalog.

Looking forward to hear your thoughts, opinions, comments.

Best regards, David




Attachment

Re: New hook after raw parsing, before analyze

From
Kohei KaiGai
Date:
See the discussion of Custom-Scan API.
https://commitfest.postgresql.org/action/patch_view?id=1282

I believe my third patch is what you really want to do...

> This rewritten query would be handled by the FDW table that I previously added to the catalog.
>
> The reason I want this new hook is that I don't want tableA and tableB to be in the catalog.
>
I'd like to see why you wants the pseudo table "fdw_tableA_tableB" to
be in the catalog,
instead of the "tableA" and "tableB". In addition, parser shall raise
an error if referenced
columns (as a part of "tableA" or "tableB") are not in-catalog because
of name lookup
error.

Thanks,


2014-02-13 19:01 GMT+09:00 David Beck <dbeck@starschema.net>:
> Hello Hackers,
>
> I work on a foreign data wrapper for a legacy system. I generally find the hook system very useful and flexible way
toextend Postgres.
 
> The post parse analyze hook almost fits what I need, but I have a few use cases where I would need to tap right into
theparsed queries but before any catalog based validation is done.
 
> Please find the attached trivial patch for this new hook.
>
> One of the use cases I have is this:
>
> I have table like data structures in the source system for the FDW I work on.
> These tables are sometimes too big and the source system is able to filter and join them with limitations, thus it is
notoptimal to transfer the data to Postgres.
 
> At the same time I want the users to think in terms of the original tables.
>
> The idea is to rewrite the SQL queries like this:
>
>   "SELECT * FROM tableA a, tableB b WHERE a.id=b.id AND a.col1=1234 AND b.col2=987"
>
> to:
>
>   "SELECT * FROM fdw_tableA_tableB ab WHERE ab.col1=1234 AND ab.col2=987"
>
>
> This rewritten query would be handled by the FDW table that I previously added to the catalog.
>
> The reason I want this new hook is that I don't want tableA and tableB to be in the catalog.
>
> Looking forward to hear your thoughts, opinions, comments.
>
> Best regards, David
>
>
>
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>



-- 
KaiGai Kohei <kaigai@kaigai.gr.jp>



Re: New hook after raw parsing, before analyze

From
David Beck
Date:
Thanks for the link.

I want flexibility. Here is a situation: my hook knows the size of tableA and tableB on the legacy side. It should be
ableto decide wether to offload the join/filter onto the legacy side or not. At the same time it can start transferring
thedata to real Postgres tables in the background. When the data is here in the local DB it can change the rewrite to
usethe local tables instead for the new queries. 

Answering your question:

I don’t know what you mean by pseudo table. The “fdw_tableA_tableB” is a real FDW table with a catalog entry, with real
catalogcolumns. tableA and tableB are non-real tables in my setup. My understanding is the raw_parse-transform-analyze
triplefirst checks the catalog in the transform phase. That’s why I want the hook to be after raw_parse but before
transform,so the transform phase won’t see tableA, tableB and their columns, thus I don’t expect errors. 

Thank you, David


Please find my answers inline.

2014.02.13. dátummal, 13:28 időpontban Kohei KaiGai <kaigai@kaigai.gr.jp> írta:

> See the discussion of Custom-Scan API.
> https://commitfest.postgresql.org/action/patch_view?id=1282
>
> I believe my third patch is what you really want to do...
>
>> This rewritten query would be handled by the FDW table that I previously added to the catalog.
>>
>> The reason I want this new hook is that I don't want tableA and tableB to be in the catalog.
>>
> I'd like to see why you wants the pseudo table "fdw_tableA_tableB" to
> be in the catalog,
> instead of the "tableA" and "tableB". In addition, parser shall raise
> an error if referenced
> columns (as a part of "tableA" or "tableB") are not in-catalog because
> of name lookup
> error.
>
> Thanks,
>
>
> 2014-02-13 19:01 GMT+09:00 David Beck <dbeck@starschema.net>:
>> Hello Hackers,
>>
>> I work on a foreign data wrapper for a legacy system. I generally find the hook system very useful and flexible way
toextend Postgres. 
>> The post parse analyze hook almost fits what I need, but I have a few use cases where I would need to tap right into
theparsed queries but before any catalog based validation is done. 
>> Please find the attached trivial patch for this new hook.
>>
>> One of the use cases I have is this:
>>
>> I have table like data structures in the source system for the FDW I work on.
>> These tables are sometimes too big and the source system is able to filter and join them with limitations, thus it
isnot optimal to transfer the data to Postgres. 
>> At the same time I want the users to think in terms of the original tables.
>>
>> The idea is to rewrite the SQL queries like this:
>>
>>  "SELECT * FROM tableA a, tableB b WHERE a.id=b.id AND a.col1=1234 AND b.col2=987"
>>
>> to:
>>
>>  "SELECT * FROM fdw_tableA_tableB ab WHERE ab.col1=1234 AND ab.col2=987"
>>
>>
>> This rewritten query would be handled by the FDW table that I previously added to the catalog.
>>
>> The reason I want this new hook is that I don't want tableA and tableB to be in the catalog.
>>
>> Looking forward to hear your thoughts, opinions, comments.
>>
>> Best regards, David
>>
>>
>>
>>
>>
>> --
>> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-hackers
>>
>
>
>
> --
> KaiGai Kohei <kaigai@kaigai.gr.jp>




Re: New hook after raw parsing, before analyze

From
Tom Lane
Date:
David Beck <dbeck@starschema.net> writes:
> I have table like data structures in the source system for the FDW I work on.
> These tables are sometimes too big and the source system is able to filter and join them with limitations, thus it is
notoptimal to transfer the data to Postgres.
 
> At the same time I want the users to think in terms of the original tables.

> The idea is to rewrite the SQL queries like this:

>   “SELECT * FROM tableA a, tableB b WHERE a.id=b.id AND a.col1=1234 AND b.col2=987”

> to:

>   “SELECT * FROM fdw_tableA_tableB ab WHERE ab.col1=1234 AND ab.col2=987”

TBH this sounds like a spectacularly bad idea, especially in the place and
way you propose to do it.  You can't even do catalog access safely where
you've put that hook, not to mention that there are many other places
where queries can be submitted.  But more generally, an FDW should not
operate in the way you're describing.

We do lack support for pushing joins to the foreign server, and that needs
to be addressed; but we need to do it in the planner, not by kluging the
query somewhere upstream of that.
        regards, tom lane



Re: New hook after raw parsing, before analyze

From
David Beck
Date:
Thanks for the reply. There are two things I think I’ve been misunderstood:

1, the point is to do the rewrite without and before catalog access
2, I do want to push the join to the source and equally important pushing the where conditions there

Best regards, David


2014.02.13. dátummal, 21:22 időpontban Tom Lane <tgl@sss.pgh.pa.us> írta:

> David Beck <dbeck@starschema.net> writes:
>> I have table like data structures in the source system for the FDW I work on.
>> These tables are sometimes too big and the source system is able to filter and join them with limitations, thus it
isnot optimal to transfer the data to Postgres. 
>> At the same time I want the users to think in terms of the original tables.
>
>> The idea is to rewrite the SQL queries like this:
>
>>  “SELECT * FROM tableA a, tableB b WHERE a.id=b.id AND a.col1=1234 AND b.col2=987”
>
>> to:
>
>>  “SELECT * FROM fdw_tableA_tableB ab WHERE ab.col1=1234 AND ab.col2=987”
>
> TBH this sounds like a spectacularly bad idea, especially in the place and
> way you propose to do it.  You can't even do catalog access safely where
> you've put that hook, not to mention that there are many other places
> where queries can be submitted.  But more generally, an FDW should not
> operate in the way you're describing.
>
> We do lack support for pushing joins to the foreign server, and that needs
> to be addressed; but we need to do it in the planner, not by kluging the
> query somewhere upstream of that.
>
>             regards, tom lane




Re: New hook after raw parsing, before analyze

From
David Beck
Date:
Let me rephrase this:

Let’s remove my motivations and use cases from this conversation….

Why is that a bad idea of rewriting the query before it reaches transform/analyze (without ever accessing the catalog)?

If that flexibility is acceptable to you, where would be the best place to put it in?

Thanks, David


2014.02.14. dátummal, 10:30 időpontban David Beck <dbeck@starschema.net> írta:

> Thanks for the reply. There are two things I think I’ve been misunderstood:
>
> 1, the point is to do the rewrite without and before catalog access
> 2, I do want to push the join to the source and equally important pushing the where conditions there
>
> Best regards, David
>
>
> 2014.02.13. dátummal, 21:22 időpontban Tom Lane <tgl@sss.pgh.pa.us> írta:
>
>> David Beck <dbeck@starschema.net> writes:
>>> I have table like data structures in the source system for the FDW I work on.
>>> These tables are sometimes too big and the source system is able to filter and join them with limitations, thus it
isnot optimal to transfer the data to Postgres. 
>>> At the same time I want the users to think in terms of the original tables.
>>
>>> The idea is to rewrite the SQL queries like this:
>>
>>> “SELECT * FROM tableA a, tableB b WHERE a.id=b.id AND a.col1=1234 AND b.col2=987”
>>
>>> to:
>>
>>> “SELECT * FROM fdw_tableA_tableB ab WHERE ab.col1=1234 AND ab.col2=987”
>>
>> TBH this sounds like a spectacularly bad idea, especially in the place and
>> way you propose to do it.  You can't even do catalog access safely where
>> you've put that hook, not to mention that there are many other places
>> where queries can be submitted.  But more generally, an FDW should not
>> operate in the way you're describing.
>>
>> We do lack support for pushing joins to the foreign server, and that needs
>> to be addressed; but we need to do it in the planner, not by kluging the
>> query somewhere upstream of that.
>>
>>             regards, tom lane
>




Re: New hook after raw parsing, before analyze

From
Greg Stark
Date:
On Fri, Feb 14, 2014 at 2:28 PM, David Beck <dbeck@starschema.net> wrote:
> Why is that a bad idea of rewriting the query before it reaches transform/analyze (without ever accessing the
catalog)?
>
> If that flexibility is acceptable to you, where would be the best place to put it in?

Well if there are two foreign tables and the planner could push the
join work down to the fdw then the planner should be able to
accurately represent that plan and cost it without having the user
have to create any catalog structures. That's what the planner does
for every other type of plan node.

What you're describing would still be useful for materialized views.
In that case the user is creating the materialized view and it is a
real thing in the catalogs that won't disappear on the planner. Even
then it would be ideal if the planner could decide to use the
materialized view late enough that it can actually determine if it's
superior rather than rewriting the query before it gets to that point.
That would be much more flexible for users too who might not write the
query in a way that exactly matches the materialized view.

-- 
greg



Re: New hook after raw parsing, before analyze

From
David Beck
Date:
I think I’m gonna need to dig into the planner to fully understand your points. Thank you for the insights. I was more
intoputting the knowledge of the legacy system into the an extension and my codebase. Now I see better use of the
plannerwould help. Thank you. 

What inspired me is the scriptable query rewrite in http://dev.mysql.com/downloads/mysql-proxy/
The hook I proposed would be a lot nicer in Postgres because the raw parsing is already done at this point while in
mysql-proxythat has to be done manually. 

Another point I liked in mysql is the possibility to write info schema plugins:
http://dev.mysql.com/doc/refman/5.1/en/writing-information-schema-plugins.html
Like a virtual catalog. Is there anything similar in Postgres?

Thank you, David


2014.02.14. dátummal, 18:06 időpontban Greg Stark <stark@mit.edu> írta:

> On Fri, Feb 14, 2014 at 2:28 PM, David Beck <dbeck@starschema.net> wrote:
>> Why is that a bad idea of rewriting the query before it reaches transform/analyze (without ever accessing the
catalog)?
>>
>> If that flexibility is acceptable to you, where would be the best place to put it in?
>
> Well if there are two foreign tables and the planner could push the
> join work down to the fdw then the planner should be able to
> accurately represent that plan and cost it without having the user
> have to create any catalog structures. That's what the planner does
> for every other type of plan node.
>
> What you're describing would still be useful for materialized views.
> In that case the user is creating the materialized view and it is a
> real thing in the catalogs that won't disappear on the planner. Even
> then it would be ideal if the planner could decide to use the
> materialized view late enough that it can actually determine if it's
> superior rather than rewriting the query before it gets to that point.
> That would be much more flexible for users too who might not write the
> query in a way that exactly matches the materialized view.
>
> --
> greg




Re: New hook after raw parsing, before analyze

From
Greg Stark
Date:
On Fri, Feb 14, 2014 at 9:16 PM, David Beck <dbeck@starschema.net> wrote:
> Another point I liked in mysql is the possibility to write info schema plugins:
http://dev.mysql.com/doc/refman/5.1/en/writing-information-schema-plugins.html
> Like a virtual catalog. Is there anything similar in Postgres?

The documentation you linked to describes how to provide
information_schema plugins but not why you would want to do such a
thing. I'm not seeing why this would be useful. The information_schema
schema is described by the standard so creating new views in it isn't
needed very often and the schema for the existing views doesn't change
very often. I can see why a plugin might want to add rows to the views
but that doesn't seem to be what this feature is about.


-- 
greg



Re: New hook after raw parsing, before analyze

From
Tom Lane
Date:
Greg Stark <stark@mit.edu> writes:
> On Fri, Feb 14, 2014 at 9:16 PM, David Beck <dbeck@starschema.net> wrote:
>> Another point I liked in mysql is the possibility to write info schema plugins:
http://dev.mysql.com/doc/refman/5.1/en/writing-information-schema-plugins.html
>> Like a virtual catalog. Is there anything similar in Postgres?

> The documentation you linked to describes how to provide
> information_schema plugins but not why you would want to do such a
> thing. I'm not seeing why this would be useful. The information_schema
> schema is described by the standard so creating new views in it isn't
> needed very often and the schema for the existing views doesn't change
> very often.

IIRC, mysql has a liberal view about what they can do to the
information_schema, so for them this isn't as insane as it sounds to us.
        regards, tom lane



Re: New hook after raw parsing, before analyze

From
Greg Stark
Date:
On Fri, Feb 14, 2014 at 9:16 PM, David Beck <dbeck@starschema.net> wrote:
> What inspired me is the scriptable query rewrite in http://dev.mysql.com/downloads/mysql-proxy/
> The hook I proposed would be a lot nicer in Postgres because the raw parsing is already done at this point while in
mysql-proxythat has to be done manually.
 

There are a few similar things in the Postgres world such as pgbouncer
and plproxy but as you note they are limited in how powerful they can
be by the complexity of parsing and analyzing SQL.

I think Postgres tends to take a different tack regarding
extensibility than MySQL. Rather than have one system then hooks to
allow external code to modify or replace it, in Postgres modules
usually are treated similarly to the internal code. This is a pretty
broad generalization and there are certainly exceptions. But for
example new data types, functions, even whole new index types are all
treated almost identically to the internal data types, functions, and
index types. The planner then considers them all more or less on equal
basis.

Obviously there are limits. Btree indexes are kind of special because
they represent Postgres's basic concept of ordering. And we don't have
a pluggable recovery system which makes any externally provided index
types non-crash-safe. And we don't have a generalized concept of table
storage -- the closest thing we have is FDWs which is more like
MySQL's style of extensibility where the extension is a special case.

But this is why our instinct is that if you want to be able to push
down joins the way to do it is to extend the FDW api so that the
planner can make those decisions just like it makes the decisions
about internal joins rather than have an external module that takes
over part of the planner's work.

-- 
greg



Re: New hook after raw parsing, before analyze

From
David Beck
Date:
2014.02.15. dátummal, 0:46 időpontban Greg Stark <stark@mit.edu> írta:

> On Fri, Feb 14, 2014 at 9:16 PM, David Beck <dbeck@starschema.net> wrote:
>> Another point I liked in mysql is the possibility to write info schema plugins:
http://dev.mysql.com/doc/refman/5.1/en/writing-information-schema-plugins.html
>> Like a virtual catalog. Is there anything similar in Postgres?
>
> The documentation you linked to describes how to provide
> information_schema plugins but not why you would want to do such a
> thing. I'm not seeing why this would be useful. The information_schema
> schema is described by the standard so creating new views in it isn't
> needed very often and the schema for the existing views doesn't change
> very often. I can see why a plugin might want to add rows to the views
> but that doesn't seem to be what this feature is about.

Another reason I was thinking about dynamic catalog and/or query rewrite is the project I work on is a data integration
platform.Right now it is in the feasibility study phase and Postgres+extension looks to be the strongest option. 

The legacy system we want to interface with has over 150k table like objects. Our platform’s task is to provide a
relationalview on top of them. 

I know that it is unlikely the users to use all 150k tables. I would expect may be 10-100 are used in practice, but I
didn’twant to figure out which 100, neither want to create all 150k catalog entries in advance. 

I was also dreaming about the possibility to transfer the small enough objects to Postgres tables in the background and
sparethe communication with the legacy system and let Postgres do the joins on these. 

The solution I was thinking about is this:

- when the query arrives a smart rewrite would know 1) what tables are local 2) what tables need new catalog entries 3)
whatcan be joined on the other side 
- the rewriter would potentially add SQL statements in the beginning of the query for creating the missing FDW catalog
entriesif needed 
- the FDW would be handled by the same extension so they can easily talk to each other about the status of the objects,
sothe rewriter would know if the background transfer of the small table is completed and should do the rewrite
accordingly

I know these are pretty far from the functionality and traditional operation of an RDBMS… but if you look at the FDW
exampleslike do a select on a Google Imap mailbox, it is not that far from Postgres 

Best regards, David




Re: New hook after raw parsing, before analyze

From
Greg Stark
Date:
On Sat, Feb 15, 2014 at 2:06 PM, David Beck <dbeck@starschema.net> wrote:
> - when the query arrives a smart rewrite would know 1) what tables are local 2) what tables need new catalog entries
3)what can be joined on the other side
 
> - the rewriter would potentially add SQL statements in the beginning of the query for creating the missing FDW
catalogentries if needed
 
> - the FDW would be handled by the same extension so they can easily talk to each other about the status of the
objects,so the rewriter would know if the background transfer of the small table is completed and should do the rewrite
accordingly

I think you have a pretty big impedence mismatch with Postgres which
assumes the schema of the database is fairly static and known when
parsing begins. To do what you describe you pretty much need to write
your own SQL parser.

There is a hook "post_parse_analyze_hook" but I think it comes too
late as it comes after the analyze step which is when Postgres looks
up the schema information for every relation mentioned in the query.
What you would need is a post_parse_hook which would work on the raw
parse tree before the analyze step. That doesn't seem terribly
controversial to add though there may be some technical details. The
API would of course be completely unstable from major release to major
release -- the parse tree gets knocked around quite a bit.

But that only gets you so far. I think you would be able to get lazy
creation of schema objects fairly straightforwardly. Ie. Any legacy
objects referenced in a query get corresponding fdw schema objects
created before analyze is done -- though I would expect a few gotchas,
possibly deadlocks, with concurrent queries creating the same objects.

But I don't think that gets you the joins which I think would be quite a battle.

And I have to wonder if you aren't going the long way around to do
something that can be done more simply some other way. If you have
150k objects I wonder if your objects aren't all very similar and
could be handled by a single Postgres schema object. Either a single
FDW object or a simple function.

If they really are different schema objects perhaps a single function
that returns a more flexible data type like an hstore blob? That has
obvious disadvantages over an object that the planner understands
better but at least you would be in a well supported stable API (well,
for interesting definitions of "stable" given 9.4 will have a whole
new version of hstore).

As a side note, you should evaluate carefully what lazily creating
objects will buy you. Perhaps just creating 150k objects would be
cheaper than maintaining this code. In particular since the user
*might* access all 150k you still have to worry about the worst case
anyway and it might be cheaper to just engineer for it in the first
place.

-- 
greg



Re: New hook after raw parsing, before analyze

From
David Beck
Date:
> There is a hook "post_parse_analyze_hook" but I think it comes too
> late as it comes after the analyze step which is when Postgres looks
> up the schema information for every relation mentioned in the query.
> What you would need is a post_parse_hook which would work on the raw
> parse tree before the analyze step. That doesn't seem terribly
> controversial to add though there may be some technical details. The
> API would of course be completely unstable from major release to major
> release -- the parse tree gets knocked around quite a bit.

Good to know that parse tree is not stable. My original proposal was exactly the one you describe. I called it
post_raw_parse_hookand wanted to call it from exec_simple_query(), right after pg_parse_query(). 
May be not the best place or way to achieve this?

It was a few liner patch, I wonder if it ever has a chance to get into Postgres.

> And I have to wonder if you aren't going the long way around to do
> something that can be done more simply some other way. If you have
> 150k objects I wonder if your objects aren't all very similar and
> could be handled by a single Postgres schema object. Either a single
> FDW object or a simple function.

I need to do a bit more research on that side. I was hoping to spend less time on understanding that legacy system...

> As a side note, you should evaluate carefully what lazily creating
> objects will buy you. Perhaps just creating 150k objects would be
> cheaper than maintaining this code. In particular since the user
> *might* access all 150k you still have to worry about the worst case
> anyway and it might be cheaper to just engineer for it in the first
> place.

I have a similar problem with the legacy system. The set of 150k objects may change in between releases. My gut feeling
isthat it is easier to understand and keep track of an open source project like Postgres. 

Best regards and thanks for all insights and ideas, David