Re: Custom Scan APIs (Re: Custom Plan node) - Mailing list pgsql-hackers

From Kouhei Kaigai
Subject Re: Custom Scan APIs (Re: Custom Plan node)
Date
Msg-id 9A28C8860F777E439AA12E8AEA7694F8F8010F@BPXM15GP.gisp.nec.co.jp
Whole thread Raw
In response to Re: Custom Scan APIs (Re: Custom Plan node)  (Stephen Frost <sfrost@snowman.net>)
List pgsql-hackers
> * Kouhei Kaigai (kaigai@ak.jp.nec.com) wrote:
> > IIUC, his approach was integration of join-pushdown within FDW APIs,
> > however, it does not mean the idea of remote-join is rejected.
>
> For my part, trying to consider doing remote joins *without* going through
> FDWs is just nonsensical.  What are you joining remotely if not two foreign
> tables?
>
It is a case to be joined locally. If query has two foreign tables managed
by same server, this couple shall be found during the optimizer tries
various possible combinations.

> With regard to the GPU approach, if that model works whereby the
> normal PG tuples are read off disk, fed over to the GPU, processed, then
> returned back to the user through PG, then I wouldn't consider it really
> a 'remote' join but rather simply a new execution node inside of PG which
> is planned and costed just like the others.  We've been over the discussion
> already about trying to make that a pluggable system but the, very reasonable,
> push-back on that has been if it's really possible and really makes sense
> to be pluggable.  It certainly doesn't *have* to be- PostgreSQL is written
> in C, as we all know, and plenty of C code talks to GPUs and shuffles memory
> around- and that's almost exactly what Robert is working on supporting with
> regular CPUs and PG backends already.
>
> In many ways, trying to conflate this idea of using-GPUs-to-do-work with
> the idea of remote-FDW-joins has really disillusioned me with regard to
> the CustomScan approach.
>
Are you suggesting me to focus on the GPU stuff, rather than killing two birds
with a stone? It may be an approach, however, these have common part because
the plan-node for remote-join will pops tuples towards its upper node.
From viewpoint of the upper node, it looks like a black box that returns tuples
that joined two underlying relations. On the other hands, here is another black
box that returns tuples that scans or joins underlying relations with GPU assist.
Both of implementation detail is not visible for the upper node, but its external
interface is common. The custom-scan node can provide a pluggable way for both
of use-case.
Anyway, I'm not motivated to remote-join feature more than GPU-acceleration
stuff. If it is better to drop FDW's remote-join stuff from the custom-scan
scope, I don't claim it.

> > > Then perhaps they should be exposed more directly?  I can understand
> > > generally useful functionality being exposed in a way that anyone
> > > can use it, but we need to avoid interfaces which can't be stable
> > > due to normal / ongoing changes to the backend code.
> > >
> > The functions my patches want to expose are:
> >  - get_restriction_qual_cost()
> >  - fix_expr_common()
>
> I'll try and find time to go look at these in more detail later this week.
> I have reservations about exposing the current estimates on costs as we
> may want to adjust them in the future- but such adjustments may need to
> be made in balance with other changes throughout the system and an external
> module which depends on one result from the qual costing might end up having
> problems with the costing changes because the extension author wasn't aware
> of the other changes happening in other areas of the costing.
>
It is also the point of mine. If cost estimation logic is revised in
the future, it makes a problem if extension cuts and copies the code.

> I'm talking about this from a "beyond-just-the-GUCs" point of view, I
> realize that the extension author could go look at the GUC settings, but
> it's entirely reasonable to believe we'll make changes to the default GUC
> settings along with how they're used in the future.
>
Is the GUC something like Boolean that shows whether the new costing model
is applied or not? If so, extension needs to keep two cost estimation logics
within its code, isn't it?
If the GUC shows something like a weight, I also think it makes sense.

> > And, the functions my patches newly want are:
> >  - bms_to_string()
> >  - bms_from_string()
>
> Offhand, these look fine, if there's really an external use for them.
> Will try to look at them in more detail later.
>
At least, it makes sense to carry bitmap data structure on the private
field of custom-scan, because all the plan node has to be safe for
copyObject() manner.

> > > That's fine, if we can get data to and from those co-processors
> > > efficiently enough that it's worth doing so.  If moving the data to
> > > the GPU's memory will take longer than running the actual
> > > aggregation, then it doesn't make any sense for regular tables
> > > because then we'd have to cache the data in the GPU's memory in some
> > > way across multiple queries, which isn't something we're set up to do.
> > >
> > When I made a prototype implementation on top of FDW, using CUDA, it
> > enabled to run sequential scan 10 times faster than SeqScan on regular
> > tables, if qualifiers are enough complex.
> > Library to communicate GPU (OpenCL/CUDA) has asynchronous data
> > transfer mode using hardware DMA. It allows to hide the cost of data
> > transfer by pipelining, if here is enough number of records to be
> transferred.
>
> That sounds very interesting and certainly figuring out the costing to
> support that model will be tricky.  Also, shuffling the data around in that
> way will also be interesting.  It strikes me that it'll be made more
> difficult if we're trying to do it through the limitations of a pre-defined
> API between the core code and an extension.
>
This data shuffling is done within extension side, so it looks like the core
PG just picks up tuples from the box that handles underlying table scan in
some way.

> > Also, the recent trend of semiconductor device is GPU integration with
> > CPU, that shares a common memory space. See, Haswell of Intel, Kaveri
> > of AMD, or Tegra K1 of nvidia. All of them shares same memory, so no
> > need to transfer the data to be calculated. This trend is dominated by
> > physical law because of energy consumption by semiconductor. So, I'm
> optimistic for my idea.
>
> And this just makes me wonder why the focus isn't on the background worker
> approach instead of trying to do this all in an extension.
>
The GPU portion of above processors have different instruction set from CPU,
so we cannot utilize its parallel execution capability even if we launch
tons of background workers; that run existing CPU instructions.

> > Hmm... It seems to me we should follow the existing manner to
> > construct join path, rather than special handling. Even if a query
> > contains three or more foreign tables managed by same server, it shall
> > be consolidated into one remote join as long as its cost is less than
> local ones.
>
> I'm not convinced that it's going to be that simple, but I'm certainly
> interested in the general idea.
>
That is implemented in my part-3 patch, add_join_path hook adds custom-scan
path that joins two foreign tables, a foreign table and a custom-scan, or
two custom-scans if all of them are managed in same foreign server.
As long as its execution cost is reasonable, it allows to run remote join
that contains three or more relations.

> > So, I'd like to bed using the new add_join_path_hook to compute
> > possible join path. If remote join implemented by custom-scan is
> > cheaper than local join, it shall be chosen, then optimizer will try
> > joining with other foreign tables with this custom-scan node. If
> > remote-join is still cheap, then it shall be consolidated again.
>
> And I'm still unconvinced that trying to make this a hook and implemented
> by an extension makes sense.
>
The postgresAddJoinPaths() in my part-3 patch is doing that. Of course,
some portion of its code might have been supported at the code backend.
However, I don't think overall design is unreasonable than special handling.

> > > Admittedly, getting the costing right isn't easy either, but it's
> > > not clear to me how it'd make sense for the local server to be doing
> > > costing for remote servers.
> > >
> > Right now, I ignored the cost to run remote-server, focused on the
> > cost to transfer via network. It might be an idea to discount the CPU
> > cost of remote execution.
>
> Pretty sure we're going to need to consider the remote processing cost of
> the join as well..
>
I also think so, even though it is not done yet.

Thanks,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>




pgsql-hackers by date:

Previous
From: Peter Geoghegan
Date:
Subject: Re: jsonb and nested hstore
Next
From: Kouhei Kaigai
Date:
Subject: Re: Custom Scan APIs (Re: Custom Plan node)