Re: Getting sorted data from foreign server for merge join - Mailing list pgsql-hackers

From Robert Haas
Subject Re: Getting sorted data from foreign server for merge join
Date
Msg-id CA+TgmoYcCabE9i8oGXngtsjeuEh-vXfQ43B7Djw=57vO83OyFQ@mail.gmail.com
Whole thread Raw
In response to Getting sorted data from foreign server for merge join  (Ashutosh Bapat <ashutosh.bapat@enterprisedb.com>)
Responses Re: Getting sorted data from foreign server for merge join  (Kevin Grittner <kgrittn@ymail.com>)
Re: Getting sorted data from foreign server for merge join  (Corey Huinker <corey.huinker@gmail.com>)
Re: Getting sorted data from foreign server for merge join  (Ashutosh Bapat <ashutosh.bapat@enterprisedb.com>)
List pgsql-hackers
On Thu, Nov 5, 2015 at 11:54 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> Hi All,
> PFA patch to get data sorted from the foreign server (postgres_fdw)
> according to the pathkeys useful for merge join.
>
> For a given base relation (extendable to join when that becomes available in
> postgres_fdw), the patch tries to find merge joinable clauses. It then adds
> paths with pathkeys extracted out of these merge joinable clauses. The merge
> joinable clauses form equivalence classes. The patch searches in
> root->eq_classes for equivalence members belonging to the given relation.
> For every such expression it creates a single member pathkey list and
> corresponding path. The test postgres_fdw.sql has an existing join which
> uses merge join. With this patch the data is sorted on the foreign server
> than locally.
>
> While mergejoinable clauses can be obtained from rel->joininfo as well. But
> rel->joininfo contains other clauses as well and we need extra efforts to
> remove duplicates if the same expression appears in multiple merge joinable
> clauses.
>
> Two joining relations can have multiple merge joinable clauses, requiring
> multi-member pathkeys so that merge join is efficient to the maximum extent.
> The order in which the expressions appears in pathkeys can change the costs
> of sorting the data according to the pathkeys, depending upon the
> expressions and the presence of indexes containing those expressions. Thus
> ideally we would need to club all the expressions appearing in all the
> clauses for given two relations and create paths with pathkeys for every
> order of these expressions.That explodes the number of possible paths. We
> may restrict the number of paths created by considering only certain orders
> like sort_inner_and_outer(). In any case, costing such paths increases the
> planning time which may not be worth it. So, this patch uses a heuristic
> approach of creating single member pathkeys path for every merge joinable
> expression.
>
> The pathkeys need to be canonicalised using make_canonical_pathkey(), which
> is a static function. I have added a TODO and comments in the patch
> explaining possible ways to avoid "extern"alization of this function.
>
> Comments/suggestions are welcome.

I think this approach is generally reasonable, but I suggested parts
of it, so may be biased.  I would be interested in hearing the
opinions of others.

Random notes:

"possibily" is a typo.

usable_pklist is confusing because it seems like it might be talking
about primary keys rather than pathkeys.  Also, I realize now, looking
at this again, that you're saying "usable" when what I really think
you mean is "useful".  Lots of pathkeys are usable, but only a few of
those are useful.  I suggest renaming usable_pathkeys to
query_pathkeys and usable_pklist to useful_pathkeys.  Similarly, let's
rename generate_pathkeys_for_relation() to
get_useful_pathkeys_for_relation().

Although I'm usually on the side of marking things as extern whenever
we find it convenient, I'm nervous about doing that to
make_canonical_pathkey(), because it has side effects.  Searching the
list of canonical pathkeys for the one we need is reasonable, but is
it really right to ever think that we might create a new one at this
stage?  Maybe it is, but if so I'd like to hear a good explanation as
to why.
Is the comment "Equivalence classes covering relations other than the
current one are of interest here" missing a "not"?

I don't find this comment illuminating:

+         * In case of child relation, we need to check that the
+         * equivalence class indicates a join to a relation other than
+         * parents, other children and itself (something similar to above).
+         * Otherwise we will end up creating useless paths. The code below is
+         * similar to generate_implied_equalities_for_column(), which might
+         * give a hint.

That basically just says that we have to do it this way because the
other way would be wrong.  But it doesn't say WHY the other way would
be wrong. Then a few lines later, you have another comment which says
the same thing again:

+                /*
+                 * Ignore equivalence members which correspond to children
+                 * or same relation or to parent relations
+                 */

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: patch for geqo tweaks
Next
From: Robert Haas
Date:
Subject: Re: Within CF app, "Bug Fixes" should be "Bug Fixes/Refactoring"