Re: Foreign join pushdown vs EvalPlanQual - Mailing list pgsql-hackers

From Kouhei Kaigai
Subject Re: Foreign join pushdown vs EvalPlanQual
Date
Msg-id 9A28C8860F777E439AA12E8AEA7694F80115A959@BPXM15GP.gisp.nec.co.jp
Whole thread Raw
In response to Foreign join pushdown vs EvalPlanQual  (Etsuro Fujita <fujita.etsuro@lab.ntt.co.jp>)
Responses Re: Foreign join pushdown vs EvalPlanQual  (Etsuro Fujita <fujita.etsuro@lab.ntt.co.jp>)
List pgsql-hackers
> On Mon, Oct 19, 2015 at 12:17 AM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:
> > 1. Fetch every EPQ slot of base relations involved in this join.
> >    In case of ForeignScan, all the required tuples of base relations
> >    should be filled because it is preliminary fetched by whole-row var
> >    if earlier row-locking, or by RefetchForeignRow if later row-locking.
> >    In case of CustomScan, it can call ExecProcNode() to generate the
> >    first tuple even if it does not exists.
> >    Anyway, I assume all the component tuples of this join can be fetched
> >    using existing EPQ slot because they are owned by base relations.
> >
> > 2. The recheck callback fills up ss_ScanTupleSlot according to the
> >    fdw_scan_tlist or custom_scan_tlist. The callback knows the best way
> >    to reconstruct the joined tuple from the base relations' tuple fetched
> >    on the step-1.
> >    For example, if joined tuple is consists of (t1.a, t1.b, t2.x, t3.s),
> >    the callback picks up 't1.a' and 't1.b' from the tuple fetched from
> >    the EPQ slot of t1, then put these values onto the 1st and 2nd slot.
> >    Also, it picks up 't2.x' from the tuple fetched from the EPQ slot of
> >    t2, then put this value onto the 3rd slot. Same as above for 't3'.
> >    At this point, ss_ScanTupleSlot gets filled up by the expected fields
> >    as if join clauses are satisfied.
> >
> > 3. The recheck callback also checks qualifiers of base relations that
> >    are pushed down. Because expression nodes kept in fds_exprs or
> >    custom_exprs are initialized to reference ss_ScanTupleSlot at setrefs.c,
> >    it is more reasonable to run ExecQual after the step-2.
> >    If one of the qualifiers of base relation was evaluated as false,
> >    the recheck callback returns an empty slot.
> >
> > 4. The recheck callback also checks join-clauses to join underlying
> >    base relations. Due to same reason at step-3, it is more reasonable
> >    to execute ExecQual after the step-2.
> >    If one of the join-clauses was evaluated as false, the recheck returns
> >    an empty slot.
> >    Elsewhere, it returns ss_ScanTupleSlot, then ExecScan will process
> >    any further jobs.
> 
> Hmm, I guess this would work.  But it still feels unnatural to me.  It
> feels like we haven't really pushed down the join.  It's pushed down
> except when there's an EPQ check, and then it's not.  So we need a
> whole alternate plan tree.  With my proposal, we don't need that.
>
Even if we fetch whole-row of both side, join pushdown is exactly working
because we can receive less number of rows than local join + 2 of foreign-
scan. (If planner works well, we can expect join-path that increases number
of rows shall be dropped.)

One downside of my proposition is growth of width for individual rows.
It is a trade-off situation. The above approach takes no changes for
existing EPQ infrastructure, thus, its implementation design is clear.
On the other hands, your approach will reduce traffic over the network,
however, it is still unclear how we integrate scanrelid==0 with EPQ
infrastructure.

On the other hands, in case of custom-scan that takes underlying local
scan-nodes, thus, any kind of ROW_MARK_* except for ROW_MARK_COPY will
happen. I think width of the joined tuples are relatively minor issue
than FDW cases. However, we cannot expect the fetched rows are protected
by early row-locking mechanism, so probability of re-fetching rows and
reconstruction of joined-tuple has relatively higher priority.

> There is also some possible loss of efficiency with this approach.
> Suppose that we have two tables ft1 and ft2 which are being joined,
> and we push down the join.  They are being joined on an integer
> column, and the join needs to select several other columns as well.
> However, ft1 and ft2 are very wide tables that also contain some text
> columns.   The query is like this:
> 
> SELECT localtab.a, ft1.p, ft2.p FROM localtab LEFT JOIN (ft1 JOIN ft2
> ON ft1.x = ft2.x AND ft1.huge ~ 'stuff' AND f2.huge2 ~ 'nonsense') ON
> localtab.q = ft1.q;
> 
> If we refetch each row individually, we will need a wholerow image of
> ft1 and ft2 that includes all columns, or at least f1.huge and
> f2.huge2.  If we just fetch a wholerow image of the join output, we
> can exclude those.  The only thing we need to recheck is that it's
> still the case that localtab.q = ft1.q (because the value of
> localtab.q might have changed).
>
Isn't it possible to distinguish whole-var reference required by
locking mechanism, from the ones required by users?
(Does resjunk=true give us a hint?)

In case when whole-var reference is required by system internal, it
seems to me harmless to put dummy NULLs on unreferenced columns.
Is it a feasible idea?

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>


pgsql-hackers by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: Multi-column distinctness.
Next
From: Etsuro Fujita
Date:
Subject: Re: Foreign join pushdown vs EvalPlanQual