Re: Push down more full joins in postgres_fdw - Mailing list pgsql-hackers

From Ashutosh Bapat
Subject Re: Push down more full joins in postgres_fdw
Date
Msg-id CAFjFpRfwoSsJr9418b2jA7g0nwagjZSWhPQPUmK9M6z5XSOAqQ@mail.gmail.com
Whole thread Raw
In response to Re: Push down more full joins in postgres_fdw  (Etsuro Fujita <fujita.etsuro@lab.ntt.co.jp>)
Responses Re: Push down more full joins in postgres_fdw  (Ashutosh Bapat <ashutosh.bapat@enterprisedb.com>)
Re: Push down more full joins in postgres_fdw  (Etsuro Fujita <fujita.etsuro@lab.ntt.co.jp>)
List pgsql-hackers
Review for postgres-fdw-more-full-join-pushdown-v6 patch.

The patch applies cleanly and regression is clean (make check in
regress directory and that in postgres_fdw).

Here are some comments.
1. Because of the following code change, for a joinrel we might end up using
attrs_used, which will be garbage for a join rel. The assumption that tlist can
not be NIL for a join relation is wrong. Even for a join relation, tlist can be
NULL.  Consider query 'explain verbose select count(*) from ft1, ft2' Since
count(*) doesn't need any columns, the tlist is NIL. With the patch the server
crashes for this query.
-    if (foreignrel->reloptkind == RELOPT_JOINREL)
+    /*
+     * Note: tlist for a base relation might be non-NIL.  For example, if the
+     * base relation is an operand of a foreign join performing a full outer
+     * join and has non-NIL remote_conds, the base relation will be deparsed
+     * as a subquery, so the tlist for the base relation could be non-NIL.
+     */
+    if (tlist != NIL)    {
-        /* For a join relation use the input tlist */

We can not decide whether to use deparseExplicitTargetList or
deparse it from attrs_used based on the tlist. SELECT lists, whether it's
topmost SELECT or a subquery (even on a base relation), for deparsing a
JOIN query need to use deparseExplicitTargetList.

2. The code in deparseRangeTblRef() dealing with subqueries, is very similar to
deparseSelectStmtForRel(). The only thing deparseRangeTblRef() does not require
is the appending ORDER BY, which is determined by existence of pathkeys
argument. Probably we should reuse deparseSelectStmtForRel(), instead of
duplicating the code. This will also make the current changes to
deparseSelectSql unnecessary.

3. Why do we need following change? The elog() just few lines above says that
we expect only Var nodes. Why then we use deparseExpr() instead of
deparseVar()?        if (i > 0)            appendStringInfoString(buf, ", ");
-        deparseVar(var, context);
+        deparseExpr((Expr *) var, context);
        *retrieved_attrs = lappend_int(*retrieved_attrs, i + 1);

And I get the answer for that question, somewhere down in the patch
+    /*
+     * If the given expression is an output column of a subquery-in-FROM,
+     * deparse the alias to the expression instead.
+     */
+    if (IsA(node, Var))
+    {
+        int            tabno;
+        int            colno;
+
+        if (isSubqueryExpr(node, context->foreignrel, &tabno, &colno))
+        {
+            appendStringInfo(context->buf, "%s%d.%s%d",
+                             SS_TAB_ALIAS_PREFIX, tabno,
+                             SS_COL_ALIAS_PREFIX, colno);
+            return;
+        }
+    }
+

Functionally, this code belongs to deparseVar() and not deparseExpr(). Like all
other functions which handle Expr nodes, deparseExpr() is expected to be a
dispatcher to the node specific function.

4. This comment is good to have there, but is unrelated to this patch. May be a
separate patch?
+    /* Don't generate bad syntax if no columns */

5. Changed comment doesn't say anything different from the original comment. It
may have been good to have it this way to start with, but changing it now
doesn't bring anything new. You seem to have just merged prologue of the
function with a comment in that function. I think, this change is unnecessary.
- * The function constructs ... JOIN ... ON ... for join relation. For a base
- * relation it just returns schema-qualified tablename, with the appropriate
- * alias if so requested.
+ * For a join relation the clause of the following form is appended to buf:
+ * ((outer relation) <join type> (inner relation) ON (joinclauses))
+ * For a base relation the function just adds the schema-qualified tablename,
+ * with the appropriate alias if so requested.
+    /* Don't generate bad syntax if no columns */

6. Change may be good but unrelated to this patch. May be material for a
separate patch. There are few such changes in this function. While these
changes may be good by themselves, they distract reviewer from the goal of the
patch.
-            appendStringInfo(buf, "(");
+            appendStringInfoChar(buf, '(');

7. I don't understand why you need to change this function so much. Take for
example the following change.
-        RelOptInfo *rel_o = fpinfo->outerrel;
-        RelOptInfo *rel_i = fpinfo->innerrel;
-        StringInfoData join_sql_o;
-        StringInfoData join_sql_i;
+        /* Begin the FROM clause entry */
+        appendStringInfoChar(buf, '(');
        /* Deparse outer relation */
-        initStringInfo(&join_sql_o);
-        deparseFromExprForRel(&join_sql_o, root, rel_o, true, params_list);
+        deparseRangeTblRef(buf, root,
+                           fpinfo->outerrel,
+                           fpinfo->make_outerrel_subquery,
+                           params_list);        /* Deparse outer relation */
-        initStringInfo(&join_sql_o);
-        deparseFromExprForRel(&join_sql_o, root, rel_o, true, params_list);
+        deparseRangeTblRef(buf, root,
+                           fpinfo->outerrel,
+                           fpinfo->make_outerrel_subquery,
+                           params_list);

It removes few variables, instead directly accesses the members from fpinfo.
Also, deparses the individual relations in the provided buffer directly, rather
than in separate buffers in the original code. Instead of this, all you had to
do was replace a call
-        deparseFromExprForRel(&join_sql_o, root, rel_o, true, params_list);
with
+        deparseRangeTblRef(&join_sql_o, root, rel_o,
fpinfo->make_outerrel_subquery, params_list)
Similarly for the inner relation. Again, the changes you have done might have
been good, if done in the original code, but doing those in this patch just
creates distractions and increases the size of the patch.

Similarly changes
-        /* Append join clause; (TRUE) if no join clause */
+        /* Append join conditions */

+        {
+            /* No join conditions; add "(TRUE)" */            appendStringInfoString(buf, "(TRUE)");
+        }
+        appendStringInfoString(buf, " ON ");

-        /* End the FROM clause entry. */
-        appendStringInfo(buf, ")");


+        /* End the FROM clause entry */
+        appendStringInfoChar(buf, ')');

8. Why have you changed the name of variable here?
-        if (use_alias)
+        if (add_rel_alias)

9. In deparseRangeTblRef(), the targetlist for the subquery is obtained by
calling build_tlist_to_deparse(), but appendSubselectAlias() constructs the
column aliases based on foreignrel->reltarget->exprs. Those two are usually
same, but there is no code which guarantees that. Instead, pass tlist to
appendSubselectAlias() or simply pass the number of entries in that list
(list_length(tlist), which is all it needs. OR use foreignrel->reltarget->exprs
directly instead of calling build_tlist_to_deparse(). Similar problem exists
with getSubselectAliasInfo().

10. Deparsing a query in the form of a subquery makes it hard-to-read. The
original code aimed at avoiding a subquery. Also, this change has created many
expected output changes, which again seem unnecessary. In fact, having the
pushable join clauses of an inner join in ON clause, which is closer to JOIN
clause, is better than having them farther in the WHERE clause.
-    /*
-     * For an inner join, all restrictions can be treated alike. Treating the
-     * pushed down conditions as join conditions allows a top level full outer
-     * join to be deparsed without requiring subqueries.
-     */
-    if (jointype == JOIN_INNER)
-    {
-        Assert(!fpinfo->joinclauses);
-        fpinfo->joinclauses = fpinfo->remote_conds;
-        fpinfo->remote_conds = NIL;
-    }
-

11. I have reworded following comment and restructured the code that follows it
in the attached patch.
+    /*
+     * Set the subquery information.  If the relation performs a full outer
+     * join and if the input relations have non-NIL remote_conds, the input
+     * relations need to be deparsed as a subquery.
+     */
The code esp. the if .. else .. block followed by another one, made it a bit
unreadable. Hence I restructured it so that it's readable and doesn't require
variable "relids" from earlier code, which was being reused. Let me know if
this change looks good.

12. The code below deletes the condition, which is understandable.
-            if (fpinfo_i->remote_conds || fpinfo_o->remote_conds)
But why does it add a new unrelated condition here? What the comment claims,
looks like an existing bug, unrelated to the patch. Can you please give an
example? If that it's an existing bug, it should be fixed as a separate patch.
I don't understand the relation of this code with what the patch is doing.
+            /*
+             * We can't do anything here, and if there are any non-Vars in the
+             * outerrel/innerrel's reltarget, give up pushing down this join,
+             * because the deparsing logic can't support such a case currently.
+             */
+            if (reltarget_has_non_vars(outerrel))
+                return false;
+            if (reltarget_has_non_vars(innerrel))                return false;

13. The comment below is missing the main purpose i.e. creating a a unique
alias, in case the relation gets converted into a subquery. Lowest or highest
relid will create a unique alias at given level of join and that would be more
future proof. If we decide to consider paths for every join order, following
comment will no more be true.
+    /*
+     * Set the relation index.  This is defined as the position of this
+     * joinrel in the join_rel_list list plus the length of the rtable list.
+     * Note that since this joinrel is at the end of the list when we are
+     * called, we can get the position by list_length.
+     */
+    fpinfo->relation_index =
+        list_length(root->parse->rtable) + list_length(root->join_rel_list);

14. The function name talks about non-vars but the Assert few lines below
asserts that every node there is a Var. Need better naming.
+reltarget_has_non_vars(RelOptInfo *foreignrel)
+{
+    ListCell   *lc;
+
+    foreach(lc, foreignrel->reltarget->exprs)
+    {
+        Var           *var = (Var *) lfirst(lc);
+
+        Assert(IsA(var, Var));
And also an explanation for the claim
+ * Note: currently deparseExplicitTargetList can't properly handle such Vars.

15. While deparsing every Var, we are descending the RelOptInfo hierarchy.
Instead of that, we should probably gather all the alias information once and
store it in context as an array indexed by relid. While deparsing a Var we can
get the targetlist and alias for a given relation by using var->varno as index.
And then search for given Var node in the targetlist to deparse the column name
by its position in the targetlist. For the relations that are not converted
into subqueries, this array will not hold any information and the Var will be
deparsed as it's being done today.

Testing
-------
1. The code is changing deparser but doesn't have testcases
testing impact of the code. For example, we should have testcases with remote
query deparsed as nested subqueries or nested subqueries within subqueries and
so on May be testcases where a relation deeper in the RelOptInfo hierarchy has
conditions but it's immediate upper relation does not have those.

2. The only testcase added by this patch, copies an existing case and adds a
whole row reference to one of the relations being joined. Instead we could add
that whole-row reference to the existing testcase itself.

On Thu, Oct 13, 2016 at 4:05 PM, Etsuro Fujita
<fujita.etsuro@lab.ntt.co.jp> wrote:
> On 2016/09/29 21:12, Etsuro Fujita wrote:
>>
>> Please find attached an updated version of the patch.
>
>
> Here is a new version (V6) of the patch, which is basically the same as the
> previous patch, but I slightly modified that patch.  Another patch I am
> attaching is an updated patch for pushing down PHVs to the remote, which is
> created on top of the V6 patch.
>
> Best regards,
> Etsuro Fujita



-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



pgsql-hackers by date:

Previous
From: Ashutosh Bapat
Date:
Subject: Re: Aggregate Push Down - Performing aggregation on foreign server
Next
From: Julien Rouhaud
Date:
Subject: issue with track_commit_timestamp and server restart