Thread: pass-through queries to foreign servers

pass-through queries to foreign servers

From
David Gudeman
Date:
When you write an application involving foreign tables, you frequently
end up with queries that are just too inefficient because they bring
too much data over from the foreign server. For a trivial example,
consider "SELECT count(*) FROM t" where t is a foreign table. This
will pull the entire table over the network just to count up the rows.
If the writer of the foreign data wrapper was clever enough, this may
only pull one column from the foreign server, but that can still be a
lot of data.

To solve (or work around) this problem, it would be convenient to have
a pass-through query mechanism associated with foreign servers. A
pass-through query would look like a table function, but would use the
name of the foreign server as the function name. For example:

CREATE SERVER foo ...;
CREATE USER MAPPING ...;
CREATE FOREIGN TABLE t (...) SERVER foo ... OPTIONS (table 't');

SELECT size FROM foo('SELECT count(*) FROM t') AS t(size BIGINT);

The SELECT above will execute the quoted string as a query on the
foreign server represented by foo. (Notice that only the CREATE SERVER
and CREATE USER MAPPING are needed for the SELECT to work. I just
added the CREATE FOREIGN TABLE for context.)

I can think of two ways to implement this. I think it would pretty
easy to just add a table function foo that does the right thing. This
would require the author of the foreign data wrapper to provide
another callback function to send the query and get back the results.
Such a callback function would largely duplicate the functionality of
the current callback functions and --because of the current
implementation of table functions-- it would materialize the entire
result set before returning it.

A more difficult solution (for me, at least) would be to construct a
sort of temporary foreign table from the pass-through query then let
it go through the usual foreign-table handling code. This also would
require some changes to foreign data wrappers. Current wrappers have
to construct a query to scan a foreign table but with a pass-through
query the query is already constructed. But this probably requires
less work for the authors of foreign data wrappers and it doesn't
materialize the results of the foreign query unnecessarily.

Any suggestions or hints?

Regards,
David Gudeman
http://unobtainabol.blogspot.com



Re: pass-through queries to foreign servers

From
David Fetter
Date:
On Tue, Jul 30, 2013 at 04:40:38PM -0700, David Gudeman wrote:
> When you write an application involving foreign tables, you frequently
> end up with queries that are just too inefficient because they bring
> too much data over from the foreign server. For a trivial example,
> consider "SELECT count(*) FROM t" where t is a foreign table. This
> will pull the entire table over the network just to count up the rows.
> If the writer of the foreign data wrapper was clever enough, this may
> only pull one column from the foreign server, but that can still be a
> lot of data.

Yes, and this case is a known limitation of our planner
infrastructure.   Aggregates are "special" when it comes to
generating paths for the planner to evaluate, so there's no current
way a FDW could supply such info to the planner, and hence no API in
our FDW code for having FDWs supply that info.  That's probably a
"should fix" but I don't know whether a project that size could be
done by 9.4.

All that said, my DBI-Link, back in the bad old days, provided two
important functions: remote_select(), which returned SETOF RECORD and
remote_execute(), which returned nothing.  It also provided ways to
control connections to the remote host, introspect remote schemas,
etc., etc.  We need capabilities like that in the FDW API, I believe
we could have them by 9.4.

I don't know how to solve your problem within the context of our
current FDW API, but I think it's common enough that we do need to
solve it as above.

Cheers,
David.
-- 
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate



Re: pass-through queries to foreign servers

From
Tom Lane
Date:
David Fetter <david@fetter.org> writes:
> On Tue, Jul 30, 2013 at 04:40:38PM -0700, David Gudeman wrote:
>> When you write an application involving foreign tables, you frequently
>> end up with queries that are just too inefficient because they bring
>> too much data over from the foreign server. For a trivial example,
>> consider "SELECT count(*) FROM t" where t is a foreign table. This
>> will pull the entire table over the network just to count up the rows.

> Yes, and this case is a known limitation of our planner
> infrastructure.   Aggregates are "special" when it comes to
> generating paths for the planner to evaluate, so there's no current
> way a FDW could supply such info to the planner, and hence no API in
> our FDW code for having FDWs supply that info.  That's probably a
> "should fix" but I don't know whether a project that size could be
> done by 9.4.

Yeah.  There's a lot left to be done in the FDW infrastructure.
But not this:

> All that said, my DBI-Link, back in the bad old days, provided two
> important functions: remote_select(), which returned SETOF RECORD and
> remote_execute(), which returned nothing.  It also provided ways to
> control connections to the remote host, introspect remote schemas,
> etc., etc.  We need capabilities like that in the FDW API, I believe
> we could have them by 9.4.

I would argue we *don't* want that.  If you want pass-through queries
or explicit connection control, your needs are already met by dblink or
dbi-link.  The whole point of FDW is that it's at a higher level of
abstraction than that; which offers greater ease of use and will
eventually offer better optimization than what you can get from dblink
et al.  If we start trying to shoehorn things like passthrough queries
into FDW, we'll be crippling the technology.  As an example, somebody
on planet postgresql was just recently surprised to find that postgres_fdw
honors transaction rollback.  Well, it can do that because users can't
disconnect the connection underneath it, nor issue passthrough
commit/rollback commands.  You don't get to have it both ways.
        regards, tom lane



Re: pass-through queries to foreign servers

From
David Fetter
Date:
On Wed, Jul 31, 2013 at 01:22:56AM -0400, Tom Lane wrote:
> David Fetter <david@fetter.org> writes:
> > On Tue, Jul 30, 2013 at 04:40:38PM -0700, David Gudeman wrote:
> >> When you write an application involving foreign tables, you frequently
> >> end up with queries that are just too inefficient because they bring
> >> too much data over from the foreign server. For a trivial example,
> >> consider "SELECT count(*) FROM t" where t is a foreign table. This
> >> will pull the entire table over the network just to count up the rows.
> 
> > Yes, and this case is a known limitation of our planner
> > infrastructure.   Aggregates are "special" when it comes to
> > generating paths for the planner to evaluate, so there's no current
> > way a FDW could supply such info to the planner, and hence no API in
> > our FDW code for having FDWs supply that info.  That's probably a
> > "should fix" but I don't know whether a project that size could be
> > done by 9.4.
> 
> Yeah.  There's a lot left to be done in the FDW infrastructure.
> But not this:
> 
> > All that said, my DBI-Link, back in the bad old days, provided two
> > important functions: remote_select(), which returned SETOF RECORD and
> > remote_execute(), which returned nothing.  It also provided ways to
> > control connections to the remote host, introspect remote schemas,
> > etc., etc.  We need capabilities like that in the FDW API, I believe
> > we could have them by 9.4.
> 
> I would argue we *don't* want that.  If you want pass-through queries
> or explicit connection control, your needs are already met by dblink or
> dbi-link.

The standard actually describes a passthrough mode.  That mode is
mind-numbingly weird, but it's there.  It's weird because As far as I
can tell, when it's set to true, all commands, *whether the local
server can parse them or not*, are sent to the foreign server until
it's set to false again.

We can easily do better.

Cheers,
David.
-- 
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate



Re: pass-through queries to foreign servers

From
David Gudeman
Date:
On Tue, Jul 30, 2013 at 10:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> David Fetter <david@fetter.org> writes:
>> On Tue, Jul 30, 2013 at 04:40:38PM -0700, David Gudeman wrote:
>>> When you write an application involving foreign tables, you frequently
>>> end up with queries that are just too inefficient because they bring
>>> too much data over from the foreign server. For a trivial example,
>>> consider "SELECT count(*) FROM t" where t is a foreign table. This
>>> will pull the entire table over the network just to count up the rows.
>
>> Yes, and this case is a known limitation of our planner
>> infrastructure.   Aggregates are "special" when it comes to
>> generating paths for the planner to evaluate, so there's no current
>> way a FDW could supply such info to the planner, and hence no API in
>> our FDW code for having FDWs supply that info.  That's probably a
>> "should fix" but I don't know whether a project that size could be
>> done by 9.4.
>
> Yeah.  There's a lot left to be done in the FDW infrastructure.
> But not this:
>
>> All that said, my DBI-Link, back in the bad old days, provided two
>> important functions: remote_select(), which returned SETOF RECORD and
>> remote_execute(), which returned nothing.  It also provided ways to
>> control connections to the remote host, introspect remote schemas,
>> etc., etc.  We need capabilities like that in the FDW API, I believe
>> we could have them by 9.4.
>
> I would argue we *don't* want that.  If you want pass-through queries
> or explicit connection control, your needs are already met by dblink or
> dbi-link.  The whole point of FDW is that it's at a higher level of
> abstraction than that; which offers greater ease of use and will
> eventually offer better optimization than what you can get from dblink
> et al.  If we start trying to shoehorn things like passthrough queries
> into FDW, we'll be crippling the technology.  As an example, somebody
> on planet postgresql was just recently surprised to find that postgres_fdw
> honors transaction rollback.  Well, it can do that because users can't
> disconnect the connection underneath it, nor issue passthrough
> commit/rollback commands.  You don't get to have it both ways.
>
>                         regards, tom lane

Tom, you have a good point about transaction management, but I think
we _can_ have it both ways. There are several things that the author
of the foreign data wrapper can do to prevent bad things from being
done since he has ultimate control of everything that gets sent to the
foreign server. For many foreign servers it is enough to check that
the string being sent to the foreign server begins with "select ". Or
he can prevent pass-through queries when there is an on-going
transaction on the foreign server. Or the author of a particular
foreign data wrapper can prevent pass-through queries entirely.

The point is that this is only a concern for some kinds of foreign
servers and even then only for those foreign data wrappers that care
about transactions. If they don't implement update/insert/delete, for
example, then it doesn't matter. Since there are many other kinds of
foreign servers where this could be useful, it should be available, at
least as an option.

The reason I want it is to do use it for some of the things that David
Fetter was talking about --optimizing queries with aggregates and
GROUP BY. I have code that currently optimizes these sorts of queries
for a particular database engine. I did this several years ago before
there were foreign data wrappers so I had to roll my own using table
functions. My implementation query rewrite rather than plan
optimization (it seemed to me to be too hard to do in the panning
phase). See http://unobtainabol.blogspot.com/2013/04/daves-foreign-data-translating-foreign_24.html
for a description of what it does.

My plan was to generalize my current code to generic SQL databases and
to make it work with foreign data wrappers. If there is any interest
from the PG community I'll try to get my company to let me contribute
this back. But the first thing I need is to implement pass-through
queries for foreign servers or I have to duplicate all of the
functionality for managing foreign servers and tables.

Regards,
David Gudeman
http://unobtainabol.blogspot.com



Re: pass-through queries to foreign servers

From
David Gudeman
Date:
For those who don't want to go to the link to see what I'm talking
about with query rewrites, I thought I'd give a brief description.
Foreign data wrappers currently do all of their work in the planning
phase but I claim that isn't the right place to optimize foreign
queries with aggregates and GROUP BY because optimizing those things
would involve collapsing multiple plan node back into a single node
for a foreign call. I propose to do these optimizations as query
rewrites instead. So for example suppose t is a foreign table on the
foreign server named fs. Then the query
 SELECT count(*) FROM t

is rewritten to
 SELECT count FROM fs('select count(*) from t') fs(count bigint)

where ts() is the pass-through query function for the server fs. To
implement this optimization as a query rewrite, all of the elements of
the result have to be real source-language constructs so the
pass-through query has to be available in Postgresql SQL.

My current implementation of this uses a plugin that hooks into
planner_hook, but I'm hoping that I can get some support for adding
the query rewriting as callback functions for the FDW system.

Regards,
David Gudeman
http://unobtainabol.blogspot.com



Re: pass-through queries to foreign servers

From
Pavel Stehule
Date:
Hello

2013/8/5 David Gudeman <dave.gudeman@gmail.com>:
> For those who don't want to go to the link to see what I'm talking
> about with query rewrites, I thought I'd give a brief description.
> Foreign data wrappers currently do all of their work in the planning
> phase but I claim that isn't the right place to optimize foreign
> queries with aggregates and GROUP BY because optimizing those things
> would involve collapsing multiple plan node back into a single node
> for a foreign call. I propose to do these optimizations as query
> rewrites instead. So for example suppose t is a foreign table on the
> foreign server named fs. Then the query
>
>   SELECT count(*) FROM t
>
> is rewritten to
>
>   SELECT count FROM fs('select count(*) from t') fs(count bigint)
>
> where ts() is the pass-through query function for the server fs. To
> implement this optimization as a query rewrite, all of the elements of
> the result have to be real source-language constructs so the
> pass-through query has to be available in Postgresql SQL.
>

why you introduce new API? There is still dblink.

Regards

Pavel

> My current implementation of this uses a plugin that hooks into
> planner_hook, but I'm hoping that I can get some support for adding
> the query rewriting as callback functions for the FDW system.
>
> Regards,
> David Gudeman
> http://unobtainabol.blogspot.com
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers



Re: pass-through queries to foreign servers

From
Tom Lane
Date:
David Gudeman <dave.gudeman@gmail.com> writes:
> For those who don't want to go to the link to see what I'm talking
> about with query rewrites, I thought I'd give a brief description.
> Foreign data wrappers currently do all of their work in the planning
> phase but I claim that isn't the right place to optimize foreign
> queries with aggregates and GROUP BY because optimizing those things
> would involve collapsing multiple plan node back into a single node
> for a foreign call.

I'm not sure what the best implementation for that is, but what you
propose here would still involve such collapsing, so this argument
seems rather empty.

> I propose to do these optimizations as query
> rewrites instead. So for example suppose t is a foreign table on the
> foreign server named fs. Then the query

>   SELECT count(*) FROM t

> is rewritten to

>   SELECT count FROM fs('select count(*) from t') fs(count bigint)

> where ts() is the pass-through query function for the server fs. To
> implement this optimization as a query rewrite, all of the elements of
> the result have to be real source-language constructs so the
> pass-through query has to be available in Postgresql SQL.

I don't believe in any part of that design, starting with the "pass
through query function".  For one thing, it seems narrowly targeted to the
assumption that the FDW is a frontend for a foreign server that speaks
SQL.  If the FDW's infrastructure doesn't include some kind of textual
query language, this isn't going to be useful for it at all.  For another,
a query rewrite system is unlikely to be able to cost out the alternatives
and decide whether pushing the aggregation across is actually a win or
not.

The direction I think we ought to be heading is to generate explicit Paths
representing the various ways in which aggregation can be implemented.
The logic in grouping_planner is already overly complex, and hard to
extend, because it's all hard-wired comparisons of alternatives.  We'd be
better off with something more like the add_path infrastructure.  Once
that's been done, maybe we can allow FDWs to add Paths representing remote
aggregation.
        regards, tom lane



Re: pass-through queries to foreign servers

From
David Gudeman
Date:
On Mon, Aug 5, 2013 at 12:21 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> David Gudeman <dave.gudeman@gmail.com> writes:
>> For those who don't want to go to the link to see what I'm talking
>> about with query rewrites, I thought I'd give a brief description.
>> Foreign data wrappers currently do all of their work in the planning
>> phase but I claim that isn't the right place to optimize foreign
>> queries with aggregates and GROUP BY because optimizing those things
>> would involve collapsing multiple plan node back into a single node
>> for a foreign call.
>
> I'm not sure what the best implementation for that is, but what you
> propose here would still involve such collapsing, so this argument
> seems rather empty.
>
>> I propose to do these optimizations as query
>> rewrites instead. So for example suppose t is a foreign table on the
>> foreign server named fs. Then the query
>
>>   SELECT count(*) FROM t
>
>> is rewritten to
>
>>   SELECT count FROM fs('select count(*) from t') fs(count bigint)
>
>> where ts() is the pass-through query function for the server fs. To
>> implement this optimization as a query rewrite, all of the elements of
>> the result have to be real source-language constructs so the
>> pass-through query has to be available in Postgresql SQL.
>
> I don't believe in any part of that design, starting with the "pass
> through query function".  For one thing, it seems narrowly targeted to the
> assumption that the FDW is a frontend for a foreign server that speaks
> SQL.  If the FDW's infrastructure doesn't include some kind of textual
> query language, this isn't going to be useful for it at all.  For another,
> a query rewrite system is unlikely to be able to cost out the alternatives
> and decide whether pushing the aggregation across is actually a win or
> not.
>
> The direction I think we ought to be heading is to generate explicit Paths
> representing the various ways in which aggregation can be implemented.
> The logic in grouping_planner is already overly complex, and hard to
> extend, because it's all hard-wired comparisons of alternatives.  We'd be
> better off with something more like the add_path infrastructure.  Once
> that's been done, maybe we can allow FDWs to add Paths representing remote
> aggregation.
>
>                         regards, tom lane

Well, I will have to defer to your greater knowledge of the Posgres
planner. I'm surprised that it can do that since in my experience, the
planner did a cost-based search to find the best join order but the
rest of the plan structure was generated independently of the tables.
If the Postgres planner can look at the FROM clause and generate a
completely different structure based on the types of tables it finds
there, then that doesn't sound unreasonable.

However it still seems a bit like overkill. The purpose of a
cost-based heuristic search is to deal with the fact that there are
lots and lots of possible join orders (as the number of tables grows)
and it makes a huge difference which order you chose. By contrast,
this decision is fairly easy. Either you send the computation to the
data or you bring the data to the computation. Since in the case of
SQL aggregation the computation generally reduces the amount of data
that has to be  moved, the first choice is almost always the right
one.

Also, this means that you won't be able to use the code that I've
written since it's all based on query rewriting. I'm disappointed
because I wanted to contribute something back, but that's for you guys
to decide.

Regards,
David Gudeman
http://unobtainabol.blogspot.com



Re: pass-through queries to foreign servers

From
Ashutosh Bapat
Date:



On Tue, Aug 6, 2013 at 12:51 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
David Gudeman <dave.gudeman@gmail.com> writes:
> For those who don't want to go to the link to see what I'm talking
> about with query rewrites, I thought I'd give a brief description.
> Foreign data wrappers currently do all of their work in the planning
> phase but I claim that isn't the right place to optimize foreign
> queries with aggregates and GROUP BY because optimizing those things
> would involve collapsing multiple plan node back into a single node
> for a foreign call.

I'm not sure what the best implementation for that is, but what you
propose here would still involve such collapsing, so this argument
seems rather empty.

> I propose to do these optimizations as query
> rewrites instead. So for example suppose t is a foreign table on the
> foreign server named fs. Then the query

>   SELECT count(*) FROM t

> is rewritten to

>   SELECT count FROM fs('select count(*) from t') fs(count bigint)

> where ts() is the pass-through query function for the server fs. To
> implement this optimization as a query rewrite, all of the elements of
> the result have to be real source-language constructs so the
> pass-through query has to be available in Postgresql SQL.

I don't believe in any part of that design, starting with the "pass
through query function".  For one thing, it seems narrowly targeted to the
assumption that the FDW is a frontend for a foreign server that speaks
SQL.  If the FDW's infrastructure doesn't include some kind of textual
query language, this isn't going to be useful for it at all.  For another,
a query rewrite system is unlikely to be able to cost out the alternatives
and decide whether pushing the aggregation across is actually a win or
not.

The direction I think we ought to be heading is to generate explicit Paths
representing the various ways in which aggregation can be implemented.
The logic in grouping_planner is already overly complex, and hard to
extend, because it's all hard-wired comparisons of alternatives.  We'd be
better off with something more like the add_path infrastructure.  Once
that's been done, maybe we can allow FDWs to add Paths representing remote
aggregation.


Postgres-XC has extended the current PostgreSQL planner to find out the largest subset of join tree that can be evaluated on the server where the data is (called the Datanode in XC jargon). If it finds that the whole of join tree can be evaluated on the Datanode/s, it also attempts to evaluate the grouped aggregates (sometime partially). Same is the case with ORDER BY, LIMIT clauses. An alternate method called fast-query-shipping is used to avoid planning and pass the entire query to the Datanode/s if the query can be completely evaluated at the Datanode/s. These two techniques eliminate the need of pass-through syntax in XC.

But, XC planner currently has these extensions 1. without real cost estimations (since in XC assumption is that the query perform better if evaluated on the Datanodes, which itself is not right in some cases.) 2. Right now it works only for PostgreSQL (but can be extended easily for all SQL based databases).

It might be worth to look at the XC planner and pick up pieces of work that fit in PostgreSQL.

                        regards, tom lane


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers



--
Best Wishes,
Ashutosh Bapat
EntepriseDB Corporation
The Postgres Database Company

Re: pass-through queries to foreign servers

From
Merlin Moncure
Date:
On Mon, Aug 5, 2013 at 2:02 PM, Pavel Stehule <pavel.stehule@gmail.com> wrote:
> Hello
>
> 2013/8/5 David Gudeman <dave.gudeman@gmail.com>:
>> For those who don't want to go to the link to see what I'm talking
>> about with query rewrites, I thought I'd give a brief description.
>> Foreign data wrappers currently do all of their work in the planning
>> phase but I claim that isn't the right place to optimize foreign
>> queries with aggregates and GROUP BY because optimizing those things
>> would involve collapsing multiple plan node back into a single node
>> for a foreign call. I propose to do these optimizations as query
>> rewrites instead. So for example suppose t is a foreign table on the
>> foreign server named fs. Then the query
>>
>>   SELECT count(*) FROM t
>>
>> is rewritten to
>>
>>   SELECT count FROM fs('select count(*) from t') fs(count bigint)
>>
>> where ts() is the pass-through query function for the server fs. To
>> implement this optimization as a query rewrite, all of the elements of
>> the result have to be real source-language constructs so the
>> pass-through query has to be available in Postgresql SQL.
>>
>
> why you introduce new API? There is still dblink.


yeah: FDW infrastructure as it exists today is not really an ad hoc
query tool: you have to (unlike XXXlink) persist global objects for
each command string.

If it was me, in addition to (or instead of) trying to go down the
rathole of exposing optimization mechanics to the FDW driver, I'd be
leaning on some user facing infrastructure to say, parameterize the
query string and/or run ad hoc queries.  This would be very useful and
a whole lot less complicated.  Note the FDW spec may already have
these facilities; I didn't read it.

merlin