Re: pgsql_fdw, FDW for PostgreSQL server - Mailing list pgsql-hackers

From Shigeru HANADA
Subject Re: pgsql_fdw, FDW for PostgreSQL server
Date
Msg-id 4F61F75A.6090000@gmail.com
Whole thread Raw
In response to Re: pgsql_fdw, FDW for PostgreSQL server  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: pgsql_fdw, FDW for PostgreSQL server
List pgsql-hackers
I fixed pgsql_fdw to use new FDW API.  Also I fixed to use own deparser
to generate remote queries, so these patch don't contain changes for
backend codes any more.  Now planning foreign scan is done in the steps
below:

1) in GerForeignRelSize, pgsql_fdw generates simple SELECT statement
which has no WHERE clause from given information such as PlannerInfo and
RelOptInfo.  This query string is used to execute remote EXPLAIN in
order to estimate number of rows which will be returned from the scan.
Remote EXPLAIN doesn't return such number directly, so pgsql_fdw calls
set_baserel_size_estimates function which calculates number of rows with
selectivity of quals and reltuples.  It also estimates width, but
currently no statistic is available for foreign tables, so pgsql_fdw
overrides it with width value provided by remote EXPLAIN.  If we support
ANALYZE for foreign tables, we would be able to defer EXPLAIN until
GetForeignPaths, because local statistics are enough information to
estimate number of rows returned by the scan.

2) in GetForeignPaths, first, pgsql_fdw considers a very simple Path
which doesn't push any expression down to remote end.  It is like
SeqScan for regular tables.  Backend will filter the result with all
quals in baserestrictinfo.  In push-down supported version, pgsql_fdw
also considers another Path which pushes conditions as much as possible.
 This would reduce the amount of data transfer, and clocks used to
convert strings to tuples at local side.  pgsql_fdw emits another
EXPLAIN for this path, though we might be able to omit.  Planner can
choose either path with basis of their costs.  If pgsql_fdw can know
that the remote table is indexed, pgsql_fdw would be able to consider
sorted Path for each remote index.

3) in GetForeignPlan, pgsql_fdw creates only one ForeignScan node from
given best_path.  Currently pgsql_fdw uses SQL-level cursor in order to
avoid out-of-memory by huge result set, so we need to construct several
SQL statements for the plan.  Old implementation has created SQL
statements in Path phase, but now it's deferred until Plan phase.  This
change would avoid possible unnecessary string operation.  I worry that
I've misunderstood the purpose of fdw_exprs...

Although the patches are still WIP, especially in WHERE push-down part,
but I'd like to post them so that I can get feedback of the design as
soon as possible.

(2012/03/11 1:34), Tom Lane wrote:
>> 1) IIUC, GetForeignRelSize should set baserel->rows to the number of
>> rows the ForeignScan node returns to upper node, but not the number
>> of rows FDW returns to core executor, right?
>
> It should be the number of rows estimated to pass the baserestrictinfo
> restriction clauses, so yeah, not the same as what the FDW would return,
> except in cases where all the restriction clauses are handled internally
> by the FDW.
>
>> BTW, once Fujita-san's ANALYZE support patch is merged, we will be
>> able to get rows estimatation easily by calling clauselist_selectivity
>> with baserel->tuples and baserestrictinfo.  Otherwise, pgsql_fdw
>> would still need to execute EXPLAIN on remote side to get meaningful
>> rows estimation.
>
> Yeah, one of the issues for that patch is how we see it coexisting with
> the option of doing a remote-side EXPLAIN.

It seems not so easy to determine whether remote EXPLAIN is better from
local statistics.  An easy way is having a per-relation FDW option like
"use_local_stats" or something for pgsql_fdw, but it doesn't sound right
because other FDWs have same problem...

>> 2) ISTM that pgsql_fdw needs to execute EXPLAIN on remote side for each
>> possible remote query to get meaningful costs estimation, and it
>> requires pgsql_fdw to generate SQL statements in GetForeignPaths.
>> I worry that I've misunderstood intention of your design because
>> you've mentioned postponing SQL deparsing to createplan time.
>
> If you want to get the cost estimates that way, then yes, you'd be
> needing to do some SQL-statement-construction earlier than final plan
> generation.  But it's not apparent to me that those statements would
> necessarily be the same as, or even very similar to, what the final
> queries would be.  For instance, you'd probably try to reduce parameters
> to constants for estimation purposes.

Hm, I though that using queries same as final ones has no overhead, if
we don't need to  deparse clauses redundantly.  In current
implementation, WHERE clause is deparsed only once for a scan, and basis
of the query (SELLECT ... FROM ...) is also deparseed only once.  Indeed
string copy is not avoidable, but I feel that's not big problem.  Thoughts?

Regards,
--
Shigeru Hanada

Attachment

pgsql-hackers by date:

Previous
From: Thom Brown
Date:
Subject: Re: Command Triggers, patch v11
Next
From: Tom Lane
Date:
Subject: Re: libpq should have functions for escaping data for use in COPY FROM