Re: Foreign join pushdown vs EvalPlanQual - Mailing list pgsql-hackers

From Robert Haas
Subject Re: Foreign join pushdown vs EvalPlanQual
Date
Msg-id CA+TgmoYAz=vpDn_EVBcBzTa=JUiHBGmtpreJoP+A=p80mkt0UQ@mail.gmail.com
Whole thread Raw
In response to Re: Foreign join pushdown vs EvalPlanQual  (Etsuro Fujita <fujita.etsuro@lab.ntt.co.jp>)
Responses Re: Foreign join pushdown vs EvalPlanQual
Re: Foreign join pushdown vs EvalPlanQual
List pgsql-hackers
On Mon, Sep 28, 2015 at 11:15 PM, Etsuro Fujita
<fujita.etsuro@lab.ntt.co.jp> wrote:
> I thought the same thing [1].  While I thought it was relatively easy to
> make changes to RefetchForeignRow that way for the foreign table case
> (scanrelid>0), I was not sure how hard it would be to do so for the foreign
> join case (scanrelid==0).  So, I proposed to leave that changes for 9.6.
> I'll have a rethink on this issue along the lines of that approach.

Well, I spent some more time looking at this today, and testing it out
using a fixed-up version of your foreign_join_v16 patch, and I decided
that RefetchForeignRow is basically a red herring.  That's only used
for FDWs that do late row locking, but postgres_fdw (and probably many
others) do early row locking, in which case RefetchForeignRow never
gets called. Instead, the row is treated as a "non-locked source row"
by ExecLockRows (even though it is in fact locked) and is re-fetched
by EvalPlanQualFetchRowMarks.  We should probably update the comment
about non-locked source rows to mention the case of FDWs that do early
row locking.

Anyway, everything appears to work OK up to this point: we correctly
retrieve the saved whole-rows from the foreign side and call
EvalPlanQualSetTuple on each one, setting es_epqTuple[rti - 1] and
es_epqTupleSet[rti - 1].  So far, so good.  Now we call
EvalPlanQualNext, and that's where we get into trouble.  We've got the
already-locked tuples from the foreign side and those tuples CANNOT
have gone away or been modified because we have already locked them.
So, all the foreign join needs to do is return the same tuple that it
returned before: the EPQ recheck was triggered by some *other* table
involved in the plan, not our table.  A local table also involved in
the query, or conceivably a foreign table that does late row locking,
could have had something change under it after the row was fetched,
but in postgres_fdw that can't happen because we locked the row up
front.  And thus, again, all we need to do is re-return the same
tuple.  But we don't have that.  Instead, the ROW_MARK_COPY logic has
caused us to preserve a copy of each *baserel* tuple.

Now, this is as sad as can be.  Early row locking has huge advantages
for FDWs, both in terms of minimizing server round trips and also
because the FDW doesn't really need to do anything about EPQ.  Sure,
it's inefficient to carry around whole-row references, but it makes
life easy for the FDW author.

So, if we wanted to fix this in a way that preserves the spirit of
what's there now, it seems to me that we'd want the FDW to return
something that's like a whole row reference, but represents the output
of the foreign join rather than some underlying base table.  And then
get the EPQ machinery to have the evaluation of the ForeignScan for
the join, when it happens in an EPQ context, to return that tuple.
But I don't really have a good idea how to do that.

More thought seems needed here...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: Add pg_basebackup single tar output format
Next
From: Stephen Frost
Date:
Subject: Re: Arguable RLS security bug, EvalPlanQual() paranoia