Re: Custom/Foreign-Join-APIs (Re: [v9.5] Custom Plan API) - Mailing list pgsql-hackers
From | Kouhei Kaigai |
---|---|
Subject | Re: Custom/Foreign-Join-APIs (Re: [v9.5] Custom Plan API) |
Date | |
Msg-id | 9A28C8860F777E439AA12E8AEA7694F8010C1D3A@BPXM15GP.gisp.nec.co.jp Whole thread Raw |
In response to | Re: Custom/Foreign-Join-APIs (Re: [v9.5] Custom Plan API) (Ashutosh Bapat <ashutosh.bapat@enterprisedb.com>) |
Responses |
Re: Custom/Foreign-Join-APIs (Re: [v9.5] Custom Plan API)
|
List | pgsql-hackers |
> On Sat, Mar 14, 2015 at 3:48 AM, Robert Haas <robertmhaas@gmail.com> wrote: > > > On Fri, Mar 13, 2015 at 2:31 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Robert Haas <robertmhaas@gmail.com> writes: > >> Another bit of this that I think we could commit without fretting > >> about it too much is the code adding set_join_pathlist_hook. This > is > >> - I think - analogous to set_rel_pathlist_hook, and like that hook, > >> could be used for other purposes than custom plan generation - e.g. > to > >> delete paths we do not want to use. I've extracted this portion of > >> the patch and adjusted the comments; if there are no objections, I > >> will commit this bit also. > > > > I don't object to the concept, but I think that is a pretty bad place > > to put the hook call: add_paths_to_joinrel is typically called multiple > > (perhaps *many*) times per joinrel and thus this placement would force > > any user of the hook to do a lot of repetitive work. > > Interesting point. I guess the question is whether a some or all > callers are going to actually *want* a separate call for each > invocation of add_paths_to_joinrel(), or whether they'll be happy to > operate on the otherwise-complete path list. It's true that if your > goal is to delete paths, it's probably best to be called just once > after the path list is complete, and there might be a use case for > that, but I guess it's less useful than for baserels. For a baserel, > as long as you don't nuke the sequential-scan path, there is always > going to be a way to complete the plan; so this would be a fine way to > implement a disable-an-index extension. But for joinrels, it's not so > easy to rule out, say, a hash-join here. Neither hook placement is > much good for that; the path you want to get rid of may have already > dominated paths you want to keep. > > Suppose you want to add paths - e.g. you have an extension that goes > and looks for a materialized view that matches this subtree of the > query, and if it finds one, it substitutes a scan of the materialized > view for a scan of the baserel. Or, as in KaiGai's case, you have an > extension that can perform the whole join in GPU-land and produce the > same results we would have gotten via normal execution. Either way, > you want - and this is the central point of the whole patch here - to > inject a scan path into a joinrel. It is not altogether obvious to me > what the best placement for this is. In the materialized view case, > you probably need a perfect match between the baserels in the view and > the baserels in the joinrel to do anything. There's no point in > re-checking that for every innerrels/outerrels combination. I don't > know enough about the GPU case to reason about it intelligently; maybe > KaiGai can comment. > > I think the foreign data wrapper join pushdown case, which also aims > to substitute a scan for a join, is interesting to think about, even > though it's likely to be handled by a new FDW method instead of via > the hook. Where should the FDW method get called from? Currently, > the FDW method in KaiGai's patch is GetForeignJoinPaths, and that gets > called from add_paths_to_joinrel(). The patch at > http://www.postgresql.org/message-id/CAEZqfEfy7p=uRpwN-Q-NNgzb8kwHbf > qF82YSb9ztFZG7zN64Xw@mail.gmail.com > uses that to implement join pushdown in postgres_fdw; if you have A > JOIN B JOIN C all on server X, we'll notice that the join with A and B > can be turned into a foreign scan on A JOIN B, and similarly for A-C > and B-C. Then, if it turns out that the cheapest path for A-B is the > foreign join, and the cheapest path for C is a foreign scan, we'll > arrive at the idea of a foreign scan on A-B-C, and we'll realize the > same thing in each of the other combinations as well. So, eventually > the foreign join gets pushed down. > > But there's another possible approach: suppose that > join_search_one_level, after considering left-sided and right-sided > joins and after considering bushy joins, checks whether every relation > it's got is from the same foreign server, and if so, asks that foreign > server whether it would like to contribute any paths. Would that be > better or worse? A disadvantage is that if you've got something like > A LEFT JOIN B LEFT JOIN C LEFT JOIN D LEFT JOIN E LEFT JOIN F LEFT > JOIN G LEFT JOIN H LEFT JOIN I but none of the joins can be pushed > down (say, each join clause calls a non-pushdown-safe function) you'll > end up examining a pile of joinrels - at every level of the join tree > - and individually rejecting each one. With the > build-it-up-incrementally approach, you'll figure that all out at > level 2, and then after that there's nothing to do but give up > quickly. On the other hand, I'm afraid the incremental approach might > miss a trick: consider small LEFT JOIN (big INNER JOIN huge ON big.x = > huge.x) ON small.y = big.y AND small.z = huge.z, where all three are > foreign tables on the same server. If the output of the big/huge join > is big, none of those paths are going to survive at level 2, but the > overall join size might be very small, so we surely want a chance to > recover at level 3. (We discussed test cases of this form quite a bit > in the context of e2fa76d80ba571d4de8992de6386536867250474.) > > > > > The real problem here, is that with FDW in picture, the "optimal substructure" > property required by dynamic programming is broken. If A foreign join B foreign > join C is optimal solution for problem A join B join C, A foreign join B is not > necessarily optimal solution for subproblem A join B. While for local relations, > PostgreSQL has to compute each two way join itself, and thus chooses the cheapest > path for each two way join, FDW (esp. those working with real foreign servers) > do not compute the joins in two-way fashion and don't need to choose the cheapest > path for each two way join. > I cannot agree 100% because we cannot know whether A foreign join B foreign join C is optimal than A join B join C. For example, if (A x B) is estimated to generate O(N) rows but (A x B) x C is estimated to generate O(N x M) rows, local join may be optimal to process the final stage. Even if N-way remote join might be possible, we need to estimate the cost of remote join for each level, and make a decision whether it shall be pushed- down to the remote server based on the estimated cost. The hooks location Tom suggested requires FDW to compute a foreign-scan path for each joinrel during concentration of join combinations, but not multiple times for each joinrel. > A way to work around this is to leave the ForeignPaths (there can possibly be > only one foreign path per join relation) in the joinrel without removing them. > FDW should work on joining two relations if they have foreign paths in the list > of paths, irrespective of whether the cheapest path is foreign join path or not. > For the topmost joinrel, if the foreign path happens to be the cheapest one, the > whole join tree will be pushed down. > > On the other thread implementing foreign join for postgres_fdw, > postgresGetForeignJoinPaths(), is just looking at the cheapest path, which would > cause the problem you have described above. > It might be an idea if foreign-scan path is not wiped out regardless of the estimated cost, we will be able to construct an entirely remote-join path even if intermediation path is expensive than local join. A problem is, how to distinct these special paths from usual paths that are eliminated on the previous stage once its path is more expensive. Thanks, -- NEC OSS Promotion Center / PG-Strom Project KaiGai Kohei <kaigai@ak.jp.nec.com>
pgsql-hackers by date: