Thread: Idea for speeding up uncorrelated subqueries

Idea for speeding up uncorrelated subqueries

From
Tom Lane
Date:
Someone was just complaining over in the sql list about the poor
performance of

select name,description from descriptions 
where name in (select name     from descriptions     where description like '%Bankverbindung%');

Since the inner query is uncorrelated with the outer, there's really
no need to execute it more than once, but currently it's re-executed
each time through the outer plan.

I wonder whether it wouldn't be a good idea to force a Materialize
node to be added to the top of an uncorrelated subplan?  Then at
least the re-executions would be pretty cheap...
        regards, tom lane


Re: [HACKERS] Idea for speeding up uncorrelated subqueries

From
Bruce Momjian
Date:
> Someone was just complaining over in the sql list about the poor
> performance of
> 
> select name,description from descriptions 
> where name in (select name 
>         from descriptions 
>         where description like '%Bankverbindung%');
> 
> Since the inner query is uncorrelated with the outer, there's really
> no need to execute it more than once, but currently it's re-executed
> each time through the outer plan.
> 
> I wonder whether it wouldn't be a good idea to force a Materialize
> node to be added to the top of an uncorrelated subplan?  Then at
> least the re-executions would be pretty cheap...

Yes, the subqueries need work.  We don't even do index lookups into the
inner plan, only sequential.  Already on TODO.  The multiple query
executions are not on the TODO list.  Not sure why this is happening
here.


--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [HACKERS] Idea for speeding up uncorrelated subqueries

From
Tom Lane
Date:
Bruce Momjian <maillist@candle.pha.pa.us> writes:
> Yes, the subqueries need work.  We don't even do index lookups into the
> inner plan, only sequential.  Already on TODO.

Huh?  I don't follow that at all...

> The multiple query executions are not on the TODO list.  Not sure why
> this is happening here.

After looking at subselect.c I think I understand why --- InitPlans are
only for subqueries that are known to return a *single* reslt.  When you
have a subquery that might potentially return many, many tuples, you
need to scan through those tuples, so we use SubPlan tactics even if
there's not a query correlation.

But this neglects the cost of re-executing the subplan over and over.
Materializing the result should help, no?  (Of course there are cases
where it won't, such as when the subplan is just an unqualified select,
but most of the time it should be a win, I think...)
        regards, tom lane


Re: [HACKERS] Idea for speeding up uncorrelated subqueries

From
Bruce Momjian
Date:
> Bruce Momjian <maillist@candle.pha.pa.us> writes:
> > Yes, the subqueries need work.  We don't even do index lookups into the
> > inner plan, only sequential.  Already on TODO.
> 
> Huh?  I don't follow that at all...

Suppose you have a subquery that returns 1000 rows.  There is no code so
lookups of the inner table are indexed:
select *from tabwhere col in (select col2 from tab2)

In this case, a sequential scan of the subquery results are required.  I
didn't think the subquery was executed every time it needed to see if
col1 was in the subquery.

> 
> > The multiple query executions are not on the TODO list.  Not sure why
> > this is happening here.
> 
> After looking at subselect.c I think I understand why --- InitPlans are
> only for subqueries that are known to return a *single* reslt.  When you
> have a subquery that might potentially return many, many tuples, you
> need to scan through those tuples, so we use SubPlan tactics even if
> there's not a query correlation.
> 
> But this neglects the cost of re-executing the subplan over and over.
> Materializing the result should help, no?  (Of course there are cases
> where it won't, such as when the subplan is just an unqualified select,
> but most of the time it should be a win, I think...)

No what Vadim is done MVCC, I would like to bug him to improve subquery
performance.  We are tweeking the optimizer, but we have this huge
subquery performance problem here.

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [HACKERS] Idea for speeding up uncorrelated subqueries

From
Tom Lane
Date:
Bruce Momjian <maillist@candle.pha.pa.us> writes:
>> Bruce Momjian <maillist@candle.pha.pa.us> writes:
>>>> Yes, the subqueries need work.  We don't even do index lookups into the
>>>> inner plan, only sequential.  Already on TODO.
>> 
>> Huh?  I don't follow that at all...

> Suppose you have a subquery that returns 1000 rows.  There is no code so
> lookups of the inner table are indexed:

>     select *
>     from tab
>     where col in (select col2 from tab2)

> In this case, a sequential scan of the subquery results are required.

Well, yes, the subquery is a sequential scan.  I guess what you are
envisioning is rewriting this into some kind of nested-loop join?
For simple cases that might be possible...
        regards, tom lane


Re: [HACKERS] Idea for speeding up uncorrelated subqueries

From
Bruce Momjian
Date:
> Bruce Momjian <maillist@candle.pha.pa.us> writes:
> >> Bruce Momjian <maillist@candle.pha.pa.us> writes:
> >>>> Yes, the subqueries need work.  We don't even do index lookups into the
> >>>> inner plan, only sequential.  Already on TODO.
> >> 
> >> Huh?  I don't follow that at all...
> 
> > Suppose you have a subquery that returns 1000 rows.  There is no code so
> > lookups of the inner table are indexed:
> 
> >     select *
> >     from tab
> >     where col in (select col2 from tab2)
> 
> > In this case, a sequential scan of the subquery results are required.
> 
> Well, yes, the subquery is a sequential scan.  I guess what you are
> envisioning is rewriting this into some kind of nested-loop join?
> For simple cases that might be possible...

Yes, or mergejoin/hashjoin.

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [HACKERS] Idea for speeding up uncorrelated subqueries

From
Vadim Mikheev
Date:
Bruce Momjian wrote:
> 
> Yes, the subqueries need work.  We don't even do index lookups into the
> inner plan, only sequential.  Already on TODO.  The multiple query             ^^^^^^^^^^^^^^^
What? Indices are used when appropriate.

Vadim


Re: [HACKERS] Idea for speeding up uncorrelated subqueries

From
Bruce Momjian
Date:
> Bruce Momjian wrote:
> > 
> > Yes, the subqueries need work.  We don't even do index lookups into the
> > inner plan, only sequential.  Already on TODO.  The multiple query
>               ^^^^^^^^^^^^^^^
> What? Indices are used when appropriate.

Sorry, bad wording.  My English should be better.  :-)

I meant to say that joins from the outer plan to subplan are always
nested loops, not hash or mergejoins.  If you have something like:
delete from tab where col in (select col2 from largetable)

it takes a long time, when it really should be quick.  That is why
people are encouraged to use EXISTS.

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [HACKERS] Idea for speeding up uncorrelated subqueries

From
Vadim Mikheev
Date:
Bruce Momjian wrote:
> 
> Suppose you have a subquery that returns 1000 rows.  There is no code so
> lookups of the inner table are indexed:
> 
>         select *
>         from tab
>         where col in (select col2 from tab2)
> 
> In this case, a sequential scan of the subquery results are required.  I
> didn't think the subquery was executed every time it needed to see if
> col1 was in the subquery.

Oh, well. These are cases when query may be rewritten with EXISTS.
This is possible when there are no aggregates in subquery.

> > After looking at subselect.c I think I understand why --- InitPlans are
> > only for subqueries that are known to return a *single* reslt.  When you
> > have a subquery that might potentially return many, many tuples, you
> > need to scan through those tuples, so we use SubPlan tactics even if
> > there's not a query correlation.

Yes. But as I said already, you can use InitPlan (i.e. execute subquery
first) after removing duplicates from subquery results.

> > But this neglects the cost of re-executing the subplan over and over.
> > Materializing the result should help, no?  (Of course there are cases

We could not only cache subquery results (materialization) but also
hash them. 

> > where it won't, such as when the subplan is just an unqualified select,
> > but most of the time it should be a win, I think...)

In such cases, if there are no aggregates in subquery then EXISTS
could be used else materialization will still help. 

> No what Vadim is done MVCC, I would like to bug him to improve subquery
> performance.  We are tweeking the optimizer, but we have this huge
> subquery performance problem here.

No, Bruce. I'm in WAL now. I think that we need in recovery
(remember that you'll lose indices being updated when some
crash took place), fast backup (it's easy to copy part of log 
than dump 1Gb table), fast commits (<= 1 fsync per commit
using group commit, instead of >= 2 fsyncs now), savepoints 
AND buffered logging, which you, Bruce, want so much, 
and so long -:).

Vadim


Re: [HACKERS] Idea for speeding up uncorrelated subqueries

From
Bruce Momjian
Date:
> > > where it won't, such as when the subplan is just an unqualified select,
> > > but most of the time it should be a win, I think...)
> 
> In such cases, if there are no aggregates in subquery then EXISTS
> could be used else materialization will still help. 
> 
> > No what Vadim is done MVCC, I would like to bug him to improve subquery
> > performance.  We are tweeking the optimizer, but we have this huge
> > subquery performance problem here.
> 
> No, Bruce. I'm in WAL now. I think that we need in recovery
> (remember that you'll lose indices being updated when some
> crash took place), fast backup (it's easy to copy part of log 
> than dump 1Gb table), fast commits (<= 1 fsync per commit
> using group commit, instead of >= 2 fsyncs now), savepoints 
> AND buffered logging, which you, Bruce, want so much, 
> and so long -:).

Oh, no, I have been outmaneuvered by Vadim.

Help.

Isn't it something that takes only a few hours to implement.  We can't
keep telling people to us EXISTS, especially because most SQL people
think correlated queries are slower that non-correlated ones.  Can we
just on-the-fly rewrite the query to use exists?

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [HACKERS] Idea for speeding up uncorrelated subqueries

From
Vadim Mikheev
Date:
Bruce Momjian wrote:
> 
> Isn't it something that takes only a few hours to implement.  We can't
> keep telling people to us EXISTS, especially because most SQL people
> think correlated queries are slower that non-correlated ones.  Can we
> just on-the-fly rewrite the query to use exists?

This seems easy to implement. We could look does subquery have
aggregates or not before calling union_planner() in
subselect.c:_make_subplan() and rewrite it (change 
slink->subLinkType from IN to EXISTS and add quals).

Without caching implemented IN-->EXISTS rewriting always
has sence.

After implementation of caching we probably should call union_planner()
for both original/modified subqueries and compare costs/sizes
of EXISTS/IN_with_caching plans and maybe even make
decision what plan to use after parent query is planned
and we know for how many parent rows subplan will be executed.

Vadim


Re: [HACKERS] Idea for speeding up uncorrelated subqueries

From
Tom Lane
Date:
Bruce Momjian <maillist@candle.pha.pa.us> writes:
> Isn't it something that takes only a few hours to implement.  We can't
> keep telling people to us EXISTS, especially because most SQL people
> think correlated queries are slower that non-correlated ones.  Can we
> just on-the-fly rewrite the query to use exists?

I was just about to suggest exactly that.  The "IN (subselect)"
notation seems to be a lot more intuitive --- at least, people
keep coming up with it --- so why not rewrite it to the EXISTS
form, if we can handle that more efficiently?
        regards, tom lane


Re: [HACKERS] Idea for speeding up uncorrelated subqueries

From
Bruce Momjian
Date:
> Bruce Momjian <maillist@candle.pha.pa.us> writes:
> > Isn't it something that takes only a few hours to implement.  We can't
> > keep telling people to us EXISTS, especially because most SQL people
> > think correlated queries are slower that non-correlated ones.  Can we
> > just on-the-fly rewrite the query to use exists?
> 
> I was just about to suggest exactly that.  The "IN (subselect)"
> notation seems to be a lot more intuitive --- at least, people
> keep coming up with it --- so why not rewrite it to the EXISTS
> form, if we can handle that more efficiently?

Yes, we have the nice subselect feature.  I just hate to see it not
completely finished, performance-wise.

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026