Re: Custom/Foreign-Join-APIs (Re: [v9.5] Custom Plan API) - Mailing list pgsql-hackers

From Kouhei Kaigai
Subject Re: Custom/Foreign-Join-APIs (Re: [v9.5] Custom Plan API)
Date
Msg-id 9A28C8860F777E439AA12E8AEA7694F8010CD77E@BPXM15GP.gisp.nec.co.jp
Whole thread Raw
In response to Custom/Foreign-Join-APIs (Re: [v9.5] Custom Plan API)  (Kouhei Kaigai <kaigai@ak.jp.nec.com>)
Responses Re: Custom/Foreign-Join-APIs (Re: [v9.5] Custom Plan API)
List pgsql-hackers
Hanada-san,

> In addition to your comments, I removed useless code that retrieves ForeignPath
> from outer/inner RelOptInfo and store them into ForeignPath#fdw_private.  Now
> postgres_fdw’s join pushd-down is free from existence of ForeignPath under the
> join relation.  This means that we can support the case Robert mentioned before,
> that whole "(huge JOIN large) JOIN small” can be pushed down even if “(huge JOIN
> large)” is dominated by another join path.
>
Yes, it's definitely reasonable design, and fits intention of the interface.
I should point out it from the beginning. :-)

> > "l" of the first SELECT represents a whole-row reference.
> > However, underlying SELECT contains system columns in its target-
> > list.
> >
> > Is it available to construct such a query?
> >  SELECT l.a1, l.a2 FROM (SELECT (id,a,atext), ctid) l (a1, a2) ...
> >                                  ^^^^^^^^^^
> 
> Simple relation reference such as "l" is not sufficient for the purpose, yes.
> But putting columns in parentheses would not work when a user column is referenced
> in original query.
> 
> I implemented deparseProjectionSql to use ROW(...) expression for a whole-row
> reference in the target list, in addition to ordinary column references for
> actually used columns and ctid.
> 
> Please see the test case for mixed use of ctid and whole-row reference to
> postgres_fdw’s regression tests.  Now a whole-row reference in the remote query
> looks like this:
> 
It seems to me that deparseProjectionSql() works properly.

> -- ctid with whole-row reference
> EXPLAIN (COSTS false, VERBOSE)
> SELECT t1.ctid, t1, t2 FROM ft1 t1 JOIN ft2 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c3,
> t1.c1 OFFSET 100 LIMIT 10;
> 
> ----------------------------------------------------------------------------
> ----------------------------------------------------------------------------
> -------------------------------------------------------------------------
>  Limit
>    Output: t1.ctid, t1.*, t2.*, t1.c3, t1.c1
>    ->  Sort
>          Output: t1.ctid, t1.*, t2.*, t1.c3, t1.c1
>          Sort Key: t1.c3, t1.c1
>          ->  Foreign Scan
>                Output: t1.ctid, t1.*, t2.*, t1.c3, t1.c1
>                Remote SQL: SELECT l.a1, l.a2, l.a3, l.a4, r.a1 FROM (SELECT l.a7,
> ROW(l.a10, l.a11, l.a12, l.a13, l.a14, l.a15, l.a16, l.a17), l.a12, l.a10 FROM
> (SELECT "C 1" a10, c2 a11, c3 a12, c4 a13, c5 a14, c6 a15, c7 a1
> (8 rows)
> 
> In fact l.a12 and l.a10, for t1.c3 and t1.c1, are redundant in transferred data,
> but IMO this would simplify the code for deparsing.
>
I agree. Even if we can reduce networking cost a little, tuple
construction takes CPU cycles. Your decision is fair enough for
me.

> > * merge_fpinfo()
> > It seems to me fpinfo->rows should be joinrel->rows, and
> > fpinfo->width also should be joinrel->width.
> > No need to have special intelligence here, isn't it?
> 
> 
> Oops. They are vestige of my struggle which disabled SELECT clause optimization
> (omit unused columns).  Now width and rows are inherited from joinrel.  Besides
> that, fdw_startup_cost and fdw_tuple_cost seem wrong, so I fixed them to use simple
> summary, not average.
>
Does fpinfo->fdw_startup_cost represent a cost to open connection to remote
PostgreSQL, doesn't it?

postgres_fdw.c:1757 says as follows:
   /*    * Add some additional cost factors to account for connection overhead    * (fdw_startup_cost), transferring
dataacross the network    * (fdw_tuple_cost per retrieved row), and local manipulation of the data    * (cpu_tuple_cost
perretrieved row).    */
 

If so, does a ForeignScan that involves 100 underlying relation takes 100
times heavy network operations on startup? Probably, no.
I think, average is better than sum, and max of them will reflect the cost
more correctly.
Also, fdw_tuple_cost introduce the cost of data transfer over the network.
I thinks, weighted average is the best strategy, like: fpinfo->fdw_tuple_cost =   (fpinfo_o->width / (fpinfo_o->width +
fpinfo_i->width)* fpinfo_o->fdw_tuple_cost +   (fpinfo_i->width / (fpinfo_o->width + fpinfo_i->width) *
fpinfo_i->fdw_tuple_cost;

That's just my suggestion. Please apply the best way you thought.

> > * explain output
> >
> > EXPLAIN output may be a bit insufficient to know what does it
> > actually try to do.
> >
> > postgres=# explain select * from ft1,ft2 where a = b;
> >                       QUERY PLAN
> > --------------------------------------------------------
> > Foreign Scan  (cost=200.00..212.80 rows=1280 width=80)
> > (1 row)
> >
> > Even though it is not an immediate request, it seems to me
> > better to show users joined relations and remote ON/WHERE
> > clause here.
> >
> 
> Like this?
> 
> Foreign Scan on ft1 INNER JOIN ft2 ON ft1.a = ft2.b  (cost=200.00..212.80
> rows=1280 width=80)
> …
>
No. This line is produced by ExplainScanTarget(), so it requires the
backend knowledge to individual FDW.
Rather than the backend, postgresExplainForeignScan() shall produce
the output.

> It might produce a very long line in a case of joining many tables because it
> contains most of remote query other than SELECT clause, but I prefer detailed.
> Another idea is to print “Join Cond” and “Remote Filter” as separated EXPLAIN
> items.
>
It is good, if postgres_fdw can generate relations name involved in
the join for each level, and join cond/remote filter individually.

> Note that v8 patch doesn’t contain this change yet!
>
It is a "nice to have" feature. So, I don't think the first commit needs
to support this. Just a suggestion in the next step.


* implementation suggestion

At the deparseJoinSql(), 

+   /* print SELECT clause of the join scan */
+   initStringInfo(&selbuf);
+   i = 0;
+   foreach(lc, baserel->reltargetlist)
+   {
+       Var        *var = (Var *) lfirst(lc);
+       TargetEntry *tle;
+
+       if (i > 0)
+           appendStringInfoString(&selbuf, ", ");
+       deparseJoinVar(var, &context);
+
+       tle = makeTargetEntry((Expr *) copyObject(var),
+                             i + 1, pstrdup(""), false);
+       if (fdw_ps_tlist)
+           *fdw_ps_tlist = lappend(*fdw_ps_tlist, copyObject(tle));
+
+       *retrieved_attrs = lappend_int(*retrieved_attrs, i + 1);
+
+       i++;
+   }

The tle is a copy of the original target-entry, and var-node is also
copied one. Why is the tle copied on lappend() again?
Also, NULL as acceptable as 3rd argument of makeTargetEntry.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: "rejected" vs "returned with feedback" in new CF app
Next
From: Fujii Masao
Date:
Subject: Re: Removal of FORCE option in REINDEX