Re: CustomScan under the Gather node? - Mailing list pgsql-hackers

From Kouhei Kaigai
Subject Re: CustomScan under the Gather node?
Date
Msg-id 9A28C8860F777E439AA12E8AEA7694F8011A2759@BPXM15GP.gisp.nec.co.jp
Whole thread Raw
In response to CustomScan under the Gather node?  (Kouhei Kaigai <kaigai@ak.jp.nec.com>)
List pgsql-hackers
> On Tue, Jan 26, 2016 at 1:30 AM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:
> > What enhancement will be necessary to implement similar feature of
> > partial seq-scan using custom-scan interface?
> >
> > It seems to me callbacks on the three points below are needed.
> > * ExecParallelEstimate
> > * ExecParallelInitializeDSM
> > * ExecParallelInitializeWorker
> >
> > Anything else?
> > Does ForeignScan also need equivalent enhancement?
> 
> For postgres_fdw, running the query from a parallel worker would
> change the transaction semantics.  Suppose you begin a transaction,
> UPDATE data on the foreign server, and then run a parallel query.  If
> the leader performs the ForeignScan it will see the uncommitted
> UPDATE, but a worker would have to make its own connection which not
> be part of the same transaction and which would therefore not see the
> update.  That's a problem.
>
Ah, yes, as long as FDW driver ensure the remote session has no
uncommitted data, pg_export_snapshot() might provide us an opportunity,
however, once a session writes something, FDW driver has to prohibit it.

> Also, for postgres_fdw, and many other FDWs I suspect, the assumption
> is that most of the work is being done on the remote side, so doing
> the work in a parallel worker doesn't seem super interesting.  Instead
> of incurring transfer costs to move the data from remote to local, we
> incur two sets of transfer costs: first remote to local, then worker
> to leader.  Ouch.  I think a more promising line of inquiry is to try
> to provide asynchronous execution when we have something like:
> 
> Append
> -> Foreign Scan
> -> Foreign Scan
> 
> ...so that we can return a row from whichever Foreign Scan receives
> data back from the remote server first.
> 
> So it's not impossible that an FDW author could want this, but mostly
> probably not.  I think.
>
Yes, I also have same opinion. Likely, local parallelism is not
valuable for the class of FDWs that obtains data from the remote
server (e.g, postgres_fdw, ...), expect for the case when packing
and unpacking cost over the network is major bottleneck.

On the other hands, it will be valuable for the class of FDW that
performs as a wrapper to local data structure, as like current
partial seq-scan doing. (e.g, file_fdw, ...)
Its data source is not under the transaction control, and 'remote
execution' of these FDWs are eventually executed on the local
computing resources.

If I would make a proof-of-concept patch with interface itself, it
seems to me file_fdw may be a good candidate for this enhancement.
It is not a field for postgres_fdw.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>


pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: extend pgbench expressions with functions
Next
From: Robert Haas
Date:
Subject: Re: [PoC] Asynchronous execution again (which is not parallel)