Re: pgsql_fdw, FDW for PostgreSQL server - Mailing list pgsql-hackers

From Shigeru HANADA
Subject Re: pgsql_fdw, FDW for PostgreSQL server
Date
Msg-id 4F71A101.9060407@gmail.com
Whole thread Raw
In response to Re: pgsql_fdw, FDW for PostgreSQL server  ("Albe Laurenz" <laurenz.albe@wien.gv.at>)
Responses Re: pgsql_fdw, FDW for PostgreSQL server
List pgsql-hackers
Thanks for the comments.

(2012/03/27 18:36), Albe Laurenz wrote:
>> 2) Defer planning stuffs as long as possible to clarify the role of
> each
>> function.  Currently GetRelSize just estimates result rows from local
>> statistics, and GetPaths adds only one path which represents SeqScan
> on
>> remote side.  As result of this change, PgsqlFdwPlanState struct is
>> obsolete.
> 
> I see the advantage of being able to do all this locally, but
> I think there are a lot of downsides too:
> - You have an additional maintenance task if you want to keep
>    statistics for remote tables accurate.  I understand that this may
>    get better in a future release.
> - You depend on the structure of pg_statistic, which means a potential
>    incompatibility between server versions.  You can add cases to
>    pgsql_fdw_analyze to cater for changes, but that is cumbersome and
> will
>    only help for later PostgreSQL versions connecting to earlier ones.
> - Planning and execution will change (improve, of course) between server
>    versions.  The local planner may choose an inferior plan based on a
>    wrong assumption of how a certain query can be handled on the remote.
> - You have no statistics if the foreign table points to a view on the
>    remote system.

Especially for 2nd and 4th, generating pg_statistic records without
calling do_analyze_rel() seems unpractical in multiple version
environment.  As you pointed out, I've missed another
semantics-different problem here.  We would have to use do_analyze_rel()
and custom sampling function which returns sample rows from remote data
source, if we want to have statistics of foreign data locally.  This
method would be available for most of FDWs, but requires some changes in
core.  [I'll comment on Fujita-san's ANALYZE patch about this issue soon.]

> My gut feeling is that planning should be done by the server which
> will execute the query.

Agreed, if selectivity of both local filtering and remote filtering were
available, we can estimate result rows correctly and choose better plan.

How about getting # of rows estimate by executing EXPLAIN for
fully-fledged remote query (IOW, contains pushed-down WHERE clause), and
estimate selectivity of local filter on the basis of the statistics
which are generated by FDW via do_analyze_rel() and FDW-specific
sampling function?  In this design, we would be able to use quite
correct rows estimate because we can consider filtering stuffs done on
each side separately, though it requires expensive remote EXPLAIN for
each possible path.

>> 3) Implement pgsql_fdw's own deparser which pushes down collation-free
>> and immutable expressions in local WHERE clause.  This means that most
>> of numeric conditions can be pushed down, but conditions using
> character
>> types are not.
> 
> I understand that this is simple and practical, but it is a pity that
> this excludes equality and inequality conditions on strings.
> Correct me if I am wrong, but I thought that these work the same
> regardless of the collation.

You are right, built-in equality and inequality operators don't cause
collation problem.  Perhaps allowing them would cover significant cases
of string comparison, but I'm not sure how to determine whether an
operator is = or != in generic way.  We might have to hold list of oid
for collation-safe operator/functions until we support ROUTINE MAPPING
or something like that...  Anyway, I'll fix pgsql_fdw to allow = and !=
for character types.

Regards,
-- 
Shigeru Hanada


pgsql-hackers by date:

Previous
From: Fujii Masao
Date:
Subject: Re: PATCH: pg_basebackup (missing exit on error)
Next
From: Thom Brown
Date:
Subject: Re: pgsql_fdw, FDW for PostgreSQL server