Re: Hooking at standard_join_search (Was: Re: Foreign join pushdown vs EvalPlanQual) - Mailing list pgsql-hackers

From Robert Haas
Subject Re: Hooking at standard_join_search (Was: Re: Foreign join pushdown vs EvalPlanQual)
Date
Msg-id CA+TgmoZH9PB8BC+Z3rE7wo8CwuxAF7VP3066iSG39QfR1jJ+UQ@mail.gmail.com
Whole thread Raw
In response to Re: Hooking at standard_join_search (Was: Re: Foreign join pushdown vs EvalPlanQual)  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Hooking at standard_join_search (Was: Re: Foreign join pushdown vs EvalPlanQual)  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: Hooking at standard_join_search (Was: Re: Foreign join pushdown vs EvalPlanQual)  (Etsuro Fujita <fujita.etsuro@lab.ntt.co.jp>)
List pgsql-hackers
On Wed, Sep 2, 2015 at 1:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Wed, Sep 2, 2015 at 10:30 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> But if you have in mind that typical FDWs would actually create join paths
>>> at that point, consider that
>>>
>>> 1. The FDW would have to find all the combinations of its supplied
>>> relations (unless you are only intending to generate one path for the
>>> union of all such rels, which seems pretty narrow-minded from here).
>
>> Well, if the remote end is another database server, presumably we can
>> leave it to optimize the query, so why would we need more than one
>> path?
>
> If you have say 5 relations in the query, 3 of which are foreign, it might
> make sense to join all 3 at the remote end, or maybe you should only join
> 2 of them remotely because it's better to then join to one of the local
> rels before joining the last remote rel.

True.  But that's not the problem I'm concerned about.  Suppose the
query looks like this:

SELECT * FROM ft1 LEFT JOIN ft2 ON ft1.x = ft2.x LEFT JOIN t1 ON ft2.y
= t1.y LEFT JOIN ft3 ON ft1.z = ft3.z LEFT JOIN t2 ON ft1.w = t2.w;

Now, no matter where we put the hooks, we'll consider foreign join
paths for all of the various combinations of tables that we could push
down.  We'll decide between those various options based on cost, which
is fine.  But let's consider just one joinrel, the one that includes
(ft1 ft2 ft3).  Assuming that the remote tables have the same name as
the local tables.  The path that implements a pushed-down join of all
three tables will send one of these two queries to the remote server:

SELECT * FROM ft1 LEFT JOIN ft2 ON ft1.x = ft2.x LEFT JOIN ft3 ON ft1.z = ft3.z;
SELECT * FROM ft1 LEFT JOIN ft3 ON ft1.z = ft3.z LEFT JOIN ft2 ON
ft1.x = ft2.x ;

We need to generate one of those two queries, and we need to figure
out what the remote server thinks it will cost to execute.  We
presumably do not to cost both of them, because if it's legal to
commute the joins, the remote server can and will do that itself.  It
would be stupid to cost both possible queries if the remote server is
going to pick the same plan either way.  However - and this is the key
point - the one we choose to generate *must represent a legal join
order*.  If the ft1-ft2 join were a FULL JOIN instead of a LEFT JOIN,
the second query wouldn't be a legal thing to send to the remote
server.  So, the problem I'm worried about is: given that we know we
want to at least consider the path that pushes the whole join to the
remote server, how do we construct an SQL query that embodies a legal
join order of the relations being pushed down?

> Even if you claim that that
> would never make sense from a cost standpoint (a claim easily seen to be
> silly), there might not be any legal way to join all 3 directly because of
> join order constraints.
>
> The larger point is that we can't expect the remote server to be fully
> responsible for optimizing, because it will know nothing of what's being
> done on our end.

No argument with any of that.

>> So, the problem is that I don't think this entirely skirts the
>> join_is_legal issues, which are a principal point of concern for me.
>> Say this is a joinrel between (A B) and (C D E).  We need to generate
>> an SQL query for (A B C D E).  We know that the outermost syntactic
>> join can be (A B) to (C D E).  But how do we know which join orders
>> are legal as among (C D E)?  Maybe there's a simple way to handle this
>> that I'm not seeing.
>
> Well, if the joins get built up in the way I think should happen, we'd
> have already considered (C D E), and we could have recorded the legal join
> orders within that at the time.  (I imagine that we should allow FDWs to
> store some data within RelOptInfo structs that represent foreign joins
> belonging entirely to them, so that there'd be a handy place to keep that
> data till later.)

Yes, that would help.  Can fdw_private serve that purpose, or do we
need something else?

> Or we could trawl through the paths associated with the
> child joinrel, which will presumably include instances of every reasonable
> sub-join combination.  Or the FDW could look at the SpecialJoinInfo data
> and determine things for itself (or more likely, ask join_is_legal about
> that).

Yeah, this is the part I'm worried will be complex, which accounts for
the current hook placement.  I'm worried that trawling through that
SpecialJoinInfo data will end up needing to duplicate much of
make_join_rel and add_paths_to_joinrel.  For example, consider:

SELECT * FROM verysmall v JOIN (bigft1 FULL JOIN bigft2 ON bigft1.x =
bigft2.x) ON v.q = bigft1.q AND v.r = bigft2.r;

The best path for this plan is presumably something like this:

Nested Loop
-> Seq Scan on verysmall v
-> Foreign Scan on bigft1 and bigft2   Remote SQL: SELECT * FROM bigft1 FULL JOIN bigft2 ON bigft1.x =
bigft2.x AND bigft1.q = $1 AND bigft2.r = $2

Now, how is the FDW going to figure out that it needs to generate this
parameterized path without duplicating this code from
add_paths_to_joinrel?
   /*    * Decide whether it's sensible to generate parameterized paths for this    * joinrel, and if so, which
relationssuch paths should require.  There    * is usually no need to create a parameterized result path unless there
 
...

Maybe there's a very simple answer to this question and I'm just not
seeing it, but I really don't see how that's going to work.

> In the case of postgres_fdw, I think the actual requirement will be to be
> able to reconstruct a SQL query that correctly expresses the join; that
> is, we need to send over something like "from c left join d on (...) full
> join e on (...)", not just "from c, d, e", or we'll get totally bogus
> estimates as well as bogus execution results.

Agreed.

> Offhand I think that the
> most likely way to build that text will be to examine the query's jointree
> to see where c,d,e appear in it.  But in any case, that's a separate issue
> and I fail to see how plopping the join search problem into the FDW's lap
> would make it any easier.

Yeah, I am not advocating for putting the hook in
standard_join_search.  I'm explaining why I put it in
add_paths_to_joinrel instead of, as I believe you were advocating, in
make_join_rel prior to the big switch.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Dave Page
Date:
Subject: Re: Proposal: Implement failover on libpq connect level.
Next
From: Anastasia Lubennikova
Date:
Subject: Re: [PROPOSAL] Effective storage of duplicates in B-tree index.