Thread: PG9.2 and FDW query planning.

PG9.2 and FDW query planning.

From
Ronan Dunklau
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hello.

I'm in the process of porting our multicorn extension to pg9.2, and
I'd like to take advantage of the GetForeignPaths hook.

The multicorn extension allows the creation of a FDW in python using a
simple API, and I'd like to be able to provide FDW implementors a way
to influence the planner.

Let's say I have an IMAP foreign data wrapper, and I write a query
joining the table on itself using the In-Reply-To and Message-ID
headers, is there anything I can do to avoid fetching all the mails
from the remote server ?

If I could somehow inform the planner that it can look up rows by
message-id, thus avoiding the need to fetch everything from the remote
server. Perhaps "persuading" the planner to use a nested-loop ?

If I understand correctly, I should build one (or more) list of
pathkeys, and add more foreign paths using those pathkeys.

There was a discussion about index on foreign tables back in march.
- From what I understand from this discussion, someone proposed to
locally store information about indexes on the foreign tables, but I
did not find anything on how to build a path "from scratch".

Thank you.

- -- 
Ronan Dunklau

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)

iQEcBAEBAgAGBQJP9bsIAAoJECTYLCgFy323KtsH/2/8AVfBRm75oWFMlU0l0oBC
ujxkt338PnpVi1V5gtE5GSRSwybPWytXAkgzIQ5/DEP/RmeW8pliV+0V7zvqlEWG
zgvfA7stRBWtIIIv6mdlTM0eBBgsFnoJLiJDTUDst5vAaj8vg8b+pX/ip7nSF5sw
dV2i3ir6JSsG4nJOcQ/kP6xg4Joan65pOwFDfwnx9pFnerT0YN9f87DRuohcj12e
fgWSqZkGU5nx9yCLWa294YzIFFY7lIjLowzEfg2eP2dVIM09GquKsSXyilJiMy4J
3QL64mUDF/pNZeH0LnpyGJfCfPlQsX4c554rZbO03tVeSEZMyVpCISLqutTnR9I=
=HwMc
-----END PGP SIGNATURE-----


Re: PG9.2 and FDW query planning.

From
Tom Lane
Date:
Ronan Dunklau <rdunklau@gmail.com> writes:
> Let's say I have an IMAP foreign data wrapper, and I write a query
> joining the table on itself using the In-Reply-To and Message-ID
> headers, is there anything I can do to avoid fetching all the mails
> from the remote server ?

> If I could somehow inform the planner that it can look up rows by
> message-id, thus avoiding the need to fetch everything from the remote
> server. Perhaps "persuading" the planner to use a nested-loop ?

OK, so what you're saying is that the imap server can effectively
provide an index on message_id.  What you'd do is create a parameterized
path that uses the tbl.message_id = other_tbl.in_reply_to join clause.
If that's enough cheaper than a full scan, the planner would select it.

FWIW, I'm not sure that it's sane to try to expose this stuff to python
yet.  It's complicated and still something of a moving target.  Once
we've got a few more C-coded FDWs that can do this type of optimization,
things might be stable enough that it'd be useful to try to provide a
high-level API.
        regards, tom lane


Re: PG9.2 and FDW query planning.

From
Ronan Dunklau
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 05/07/2012 18:30, Tom Lane wrote:
> Ronan Dunklau <rdunklau@gmail.com> writes:
>> Let's say I have an IMAP foreign data wrapper, and I write a
>> query joining the table on itself using the In-Reply-To and
>> Message-ID headers, is there anything I can do to avoid fetching
>> all the mails from the remote server ?
> 
>> If I could somehow inform the planner that it can look up rows
>> by message-id, thus avoiding the need to fetch everything from
>> the remote server. Perhaps "persuading" the planner to use a
>> nested-loop ?
> 
> OK, so what you're saying is that the imap server can effectively 
> provide an index on message_id.  What you'd do is create a
> parameterized path that uses the tbl.message_id =
> other_tbl.in_reply_to join clause. If that's enough cheaper than a
> full scan, the planner would select it.

Thank you, I was able to build such paths from your indication.

The python FDW implementor can optionally give a list of tuples
consisting of (path key, expected_number_of_row). So, in the imap
example that could be [('Message-ID', 1), ('From', 1000)] for example.

- From this information, if there is an equivalence class which
restrictinfo uses one of those keys, we build a parameterized path,
with an associated cost of base_width * expected_number_of_row, in
addition to the generic, unparameterized path.

The planner can then select this path, and build plans looking like this:

postgres=# explain select m1."From",      m1."To",      m2."From",      m2."To"
from mails m1 inner join mails m2 on m2."Message-ID" = m1."In-Reply-To"
where m1."From" = '%test@example.com%';
                       QUERY PLAN
- --------------------------------------------------------------------Nested Loop  (cost=10.00..60001000.00
rows=500000000width=128)  ->  Foreign Scan on mails m1  (cost=0.00..30000000.00 rows=100000
 
width=300)        Filter: (("From")::text = '%test@example.com%'::text)  ->  Foreign Scan on mails m2
(cost=10.00..300.00rows=1 width=300)        Filter: (("Message-ID")::text = (m1."In-Reply-To")::text)
 


If I understand it correctly, after returning a ForeignScan (from
GetForeignPlan), the planner decides to use a nestloop, and in the
process of creating the nestloop plan, replaces Var nodes coming from
the outerel (here, m1."In-Reply-To") by params nodes.

My current implementation already looks for (var = param) expressions
that it may handle  during the plan phase and stores the association
between the var and the param_id.
At execution time, the needed parameters values are fetched (from the
ParamExecData array found in es_param_exec_vals) and passed to the
python foreign data wrapper.

The problem I have: how can I retrieve the generated params and keep
the association between the var and the param ?

Should I replace the (var = outervar) clauses by (var = param) myself
and store them in the fdw_exprs field of the foreign scan ?

> FWIW, I'm not sure that it's sane to try to expose this stuff to
> python yet.  It's complicated and still something of a moving
> target.  Once we've got a few more C-coded FDWs that can do this
> type of optimization, things might be stable enough that it'd be
> useful to try to provide a high-level API.

The current API (as mentioned above) would be more declarative than
anything, only offering a way to (maybe) build parameterized paths
without guaranteeing anything. Even if the internals change, I fail to
see how it can hurt to offer such a feature.


Regards,

- -- 
Ronan Dunklau


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)

iQEcBAEBAgAGBQJP/s2dAAoJECTYLCgFy3239KkIAIiKJo/F1r4Yp49wLpmThjQI
ICo910ZajqlUKVsl9ye8m2l6p+lyGEmZMWUAWP6ae2pqFR+aC0zThypjF1faZ9tN
HfqMbEKx/trkDf05U28tJlvOeu21tiEOEs4n02fmfdHu9SvemuLdyhU3dOLxoBVK
ZZ8ra9q/+zHCPpc3zt0Mow80Q1X1M3DtirsHPoeIdOK69wD4nD2ZfhQule5HaoV1
dG3FlrKGAGzRpohLBCuWzyGPcWCS584lXGWfhsz/waLaSDIjcjvaaMke54eaa8Ci
7KxXkMM12CKFQyheSR5VVwFJrobnME2HDiJCoAOkRc0dW+Y2aASJnKG/FwL8C7s=
=4RjB
-----END PGP SIGNATURE-----