Re: Join push-down support for foreign tables - Mailing list pgsql-hackers

From Shigeru HANADA
Subject Re: Join push-down support for foreign tables
Date
Msg-id 540CE54F.8050006@gmail.com
Whole thread Raw
In response to Re: Join push-down support for foreign tables  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: Join push-down support for foreign tables
Re: Join push-down support for foreign tables
List pgsql-hackers
(2014/09/04 21:37), Robert Haas wrote:> On Wed, Sep 3, 2014 at 5:16 AM, 
Shigeru Hanada <shigeru.hanada@gmail.com> wrote:>> (1) Separate cost estimation phases?>> For existing join paths,
plannerestimates their costs in two phaeses.>> In the first phase initial_cost_foo(), here foo is one of>>
nestloop/mergejoin/hashjoin,produces lower-bound estimates for>> elimination.  The second phase is done for only
promisingpaths which>> passed add_path_precheck(), by final_cost_foo() for cost and result>> size.  I'm not sure that
weneed to follow this manner, since FDWs>> would be able to estimate final cost/size with their own methods.>> The main
problemI see here is that accurate costing may require a> round-trip to the remote server.  If there is only one path
thatis> probably OK; the cost of asking the question will usually be more than> paid for by hearing that the
pushed-downjoin clobbers the other> possible methods of executing the query.  But if there are many paths,> for example
becausethere are multiple sets of useful pathkeys, it> might start to get a bit expensive.
 

I agree that requiring round-trip per path is unbearable, so main source 
of plan cost should be local statistics gathered by ANALYZE command.  If 
an FDW needs extra information for planning, it should obtain that from 
FDW options or its private catalogs to avoid undesirable round-trips.
FDWs like postgres_fdw would want to optimize plan by providing paths 
with pathkeys (not only use remote index, but it also allows MergeJoin 
at upper level),

I noticed that order of join considering is an issue too.  Planner 
compares currently-cheapest path to newly generated path, and mostly 
foreign join path would be the cheapest, so considering foreign join 
would reduce planner overhead.
> Probably both the initial cost and final cost calculations should be> delegated to the FDW, but maybe within
postgres_fdw,the initial cost> should do only the work that can be done without contacting the remote> server; then,
letthe final cost step do that if appropriate.  But I'm> not entirely sure what is best here.
 

Agreed.  I'll design planner API along that way for now.
>> (2) How to reflect cost of transfer>> Cost of transfer is dominant in foreign table operations, including>> foreign
scans. It would be nice to have some mechanism to reflect>> actual time of transfer to the cost estimation.  An idea is
tohave a>> FDW option which represents cost factor of transfer, say>> transfer_cost.>> That would be reasonable.  I
assumeusers would normally wish to> specify this per-server, and the default should be something> reasonable for a
LAN.

This enhancement could be applied separately from foreign join patch.
>> (4) criteria for push-down>> It is assumed that FDWs can push joins down to remote when all foreign>> tables are in
sameserver.  IMO a SERVER objects represents a logical>> data source.  For instance database for postgres_fdw and
other>>connection-based FDWs, and disk volumes (or directory?) for file_fdw.>> Is this reasonable assumption?>> I think
it'sprobably good to give an FDW the option of producing a> ForeignJoinPath for any join against a ForeignPath *or>
ForeignJoinPath*for the same FDW.  It's perhaps unlikely that an FDW> can perform a join efficiently between two data
sourceswith different> server definitions, but why not give it the option?  It should be> pretty fast for the FDW to
realize,oh, the server OIDs don't match -> and at that point it can exit without doing anything further if that> seems
desirable. And there might be some kinds of data sources where> cross-server joins actually can be executed quickly
(e.g.when the> underlying data is just in two files in different places on the local> machine).
 

Indeed how to separate servers is left to users, or author of FDWs, 
though postgres_fdw and most of other FDWs can join foreign tables in a 
server.  I think it would be good if we can know two foreign tables are 
managed by same FDW, from FdwRoutine, maybe adding new API which returns 
FDW identifier?
>> (5) Terminology>> I used "foreign join" as a process which joins foreign tables on>> *remote* side, but is this
enoughintuitive?  Another idea is using>> "remote join", is this more appropriate for this kind of process?  I>>
hesitateto use "remote join" because it implies client-server FDWs,>> but foreign join is not limited to such FDWs,
e.g.file_fdw can have>> extra file which is already joined files accessed via foreign tables.>> Foreign join is
perfect.>>As I alluded to above, it's pretty important to make sure that this> works with large join trees; that is, if
Ijoin four foreign tables, I> don't want it to push down a join between two of the tables and a join> between the other
twotables and then join the results of those joins> locally.  Instead, I want to push the entire join tree to the
foreign>server and execute the whole thing there.  Some care may be needed in> designing the hooks to make sure this
worksas desired.>
 

I think so too, so ForeignJoinPath should be able to be an input of 
another ForeignJoinPath in upper join level.  But I also think joining 
on remote or not should be decided based on cost, as existing joins are 
planned with bottom-up approach.

Regards,
--
Shigeru HANADA



pgsql-hackers by date:

Previous
From: Костя Кузнецов
Date:
Subject: gist vacuum gist access
Next
From: Shigeru HANADA
Date:
Subject: Re: Join push-down support for foreign tables