Thread: Optimizing Nested Correlated Queries by decorrelation: GSOC 2012 Project

Optimizing Nested Correlated Queries by decorrelation: GSOC 2012 Project

From
Arun Chaitanya
Date:
Hi,

I wanted to take up this as a GSOC 2012 project.

SQL supports nested queries. When the inner query contains a
correlation variable the present optimizer takes an iterative
execution plan. If the inner query scans over a relation, the
iterative plan chosen can be sub-optimal.

The goal of this project is to enable De-correlation for all possible
cases. The iterative execution plan can be converted to a set oriented
plan by taking a join over the base relations involved. Sufficient
work has already been done in this area and has been implemented in
SQL Server.

The changes required to incorporate the above mentioned strategy is in
rewriting phase to the best of my knowledge. The key idea is to
introduce the APPLY operator in the raw parse tree. In the above
mentioned Papers, the author has mentioned the process of removing the
apply. The author has proposed a set of rules which will allow us to
achieve the goal. The present postgresql optimizer community has done
some work in these lines for simple subqueries involving =, > , < in
the predicates [ I observed it by seeing the result of EXPLAIN for
relevant queries ]. The optimization is not done for subqueries
containing aggregate queries and existential and containment queries.

An example query from TPCH benchmark discussed by the author:

select c_custkey
from customer
where 1000000 < (select sum(o_totalprice)
from orders
where o_custkey = c_custkey)

In the above case, c_custkey is a correlation variable (parameterized)
coming from the outer query. Hence in the present system, the inner
query is executed as many times as the tuples in the customer
relation. As the subQuery involves a scan over orders relation, the
total I/O cost involved is pretty high.

Using the transformation proposed by the author, the query can be re-written as

select c_custkey
from customer left outer join
orders on o_custkey = c custkey
group by c_custkey
having 1000000 < sum(o_totalprice)

This allows the optimizer to chose a better plan for left outer join
and avoid iterative execution. The idea here is not to get a rewritten
query as output but to generate a plan tree that does the same as the
above query.

Regards,
Arun


On Fri, Mar 30, 2012 at 7:33 AM, Arun Chaitanya <chaitan64arun@gmail.com> wrote:
> I wanted to take up this as a GSOC 2012 project.

This would be a great query planner optimization but the chances of
getting it done in one summer as a GSoC project seem to me to be nil.
You've never had a patch accepted before; picking a massive
reorganization of the query planner as your first project is not going
to work out well.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Optimizing Nested Correlated Queries by decorrelation: GSOC 2012 Project

From
Arun Chaitanya
Date:
Thanks a lot Heikki.
I have already posted an example in the mail.
The link to the paper is
http://www.iith.ac.in/~ravig/courses/cs5050/papers/decorrelation-cesar.pdf

Hope this helps,
Arun

On Fri, Mar 30, 2012 at 5:32 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> (off-list)
>
> You'll want to post a link to the paper, otherwise people who are not GSoC
> mentors will have no idea what you're talking about ;-). Posting an example
> query and access plans would illustrate the point, too.
>
>
> On 30.03.2012 14:33, Arun Chaitanya wrote:
>>
>> Hi,
>>
>> I wanted to take up this as a GSOC 2012 project.
>>
>> SQL supports nested queries. When the inner query contains a
>> correlation variable the present optimizer takes an iterative
>> execution plan. If the inner query scans over a relation, the
>> iterative plan chosen can be sub-optimal.
>>
>> The goal of this project is to enable De-correlation for all possible
>> cases. The iterative execution plan can be converted to a set oriented
>> plan by taking a join over the base relations involved. Sufficient
>> work has already been done in this area and has been implemented in
>> SQL Server.
>>
>> The changes required to incorporate the above mentioned strategy is in
>> rewriting phase to the best of my knowledge. The key idea is to
>> introduce the APPLY operator in the raw parse tree. In the above
>> mentioned Papers, the author has mentioned the process of removing the
>> apply. The author has proposed a set of rules which will allow us to
>> achieve the goal. The present postgresql optimizer community has done
>> some work in these lines for simple subqueries involving =,>  ,<  in
>> the predicates [ I observed it by seeing the result of EXPLAIN for
>> relevant queries ]. The optimization is not done for subqueries
>> containing aggregate queries and existential and containment queries.
>>
>> An example query from TPCH benchmark discussed by the author:
>>
>> select c_custkey
>> from customer
>> where 1000000<  (select sum(o_totalprice)
>> from orders
>> where o_custkey = c_custkey)
>>
>> In the above case, c_custkey is a correlation variable (parameterized)
>> coming from the outer query. Hence in the present system, the inner
>> query is executed as many times as the tuples in the customer
>> relation. As the subQuery involves a scan over orders relation, the
>> total I/O cost involved is pretty high.
>>
>> Using the transformation proposed by the author, the query can be
>> re-written as
>>
>> select c_custkey
>> from customer left outer join
>> orders on o_custkey = c custkey
>> group by c_custkey
>> having 1000000<  sum(o_totalprice)
>>
>> This allows the optimizer to chose a better plan for left outer join
>> and avoid iterative execution. The idea here is not to get a rewritten
>> query as output but to generate a plan tree that does the same as the
>> above query.
>>
>> Regards,
>> Arun
>>
>
>
> --
>  Heikki Linnakangas
>  EnterpriseDB   http://www.enterprisedb.com


Re: Optimizing Nested Correlated Queries by decorrelation: GSOC 2012 Project

From
Arun Chaitanya
Date:
Thanks Robert,

Yes. I think I am being over ambitious as I never had any Open Source
development experience.
Anyways, please go through the idea. I have posted the link to the
paper in on of the replies.

Please, suggest me other options which I can take up as a GSOC 2012 project.

On Fri, Mar 30, 2012 at 5:34 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Mar 30, 2012 at 7:33 AM, Arun Chaitanya <chaitan64arun@gmail.com> wrote:
>> I wanted to take up this as a GSOC 2012 project.
>
> This would be a great query planner optimization but the chances of
> getting it done in one summer as a GSoC project seem to me to be nil.
> You've never had a patch accepted before; picking a massive
> reorganization of the query planner as your first project is not going
> to work out well.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company


Arun Chaitanya <chaitan64arun@gmail.com> writes:
> The link to the paper is
> http://www.iith.ac.in/~ravig/courses/cs5050/papers/decorrelation-cesar.pdf

Given the authorship of that paper, I'd have to wonder whether Microsoft
has filed for any patents regarding these ideas.
        regards, tom lane


On 3/30/12 7:29 AM, Tom Lane wrote:
> Arun Chaitanya <chaitan64arun@gmail.com> writes:
>> The link to the paper is
>> http://www.iith.ac.in/~ravig/courses/cs5050/papers/decorrelation-cesar.pdf
> 
> Given the authorship of that paper, I'd have to wonder whether Microsoft
> has filed for any patents regarding these ideas.

Feh, too bad Jim Grey isn't around anymore; I could just ask.

Is there some smaller query optimization project which Arun could
reasonably tackle?

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com