Thread: about 7.0 LIMIT optimization

about 7.0 LIMIT optimization

From
"Roberto Cornacchia"
Date:
Hi there,

I've just had a look at the 7.0beta and I've seen your enhancements 
about LIMIT optimization.
Did you read by chance my previous message intitled 
"Generalized Top Queries on PostgreSQL"?
When I wrote it I hadn't read the thread 
intitled "Solution for LIMIT cost estimation" yet.

What you did looks pretty similar to part of our extension
(cost model and pruning rules). The main differences are:

- the FOR EACH generalization.

- You cannot select the top N rows according to criterion A ordering the results with a different criterion B.

- If you ask for the best 10 rows, from a relation including  100000 rows, you have to do a traditional sort on 100000
rowsand then retain only the first 10, doing more comparisons than requested.
 

- You can choose a "fast-start" plan (i.e., basically,  a pipelined plan), but you cannot performe an "early-stop" of
thestream when you have a "slow-start" plan  (e.g. involving sorts  or hash tables). We noticed that this kind of plan
often outperforms the first one.
 

So, we are looking forward to see how the new LIMIT optimization works
(we will do some tests as soon as possible). Have you noticed
relevant improvements? 

Actually, we should say we can't figure out the reason for
managing the LIMIT clause in a so complicated way, not providing 
a node in the plan as any other operator. 
In our opinion, the choice to provide a separated process of the 
LIMIT clause has two problems:
1. We find it more complicated and not so natural.
2. It is an obstacle to some optimizations and to some functionalities  (how to use it in subselects or views?)

Best regards

R. Cornacchia (cornacch@cs.unibo.it) Computer Science, University of
Bologna

A. Ghidini (ghidini@cs.unibo.it) Computer Science, University of Bologna 

Dr. Paolo Ciaccia (ciaccia@cs.unibo.it) DEIS CSITE-CNR, University of
Bologna

===========================================================

VIRGILIO MAIL - Il tuo indirizzo E-mail gratis e per sempre
http://mail.virgilio.it/


VIRGILIO - La guida italiana a Internet
http://www.virgilio.it/


Re: about 7.0 LIMIT optimization

From
Tom Lane
Date:
> Did you read by chance my previous message intitled 
> "Generalized Top Queries on PostgreSQL"?

I vaguely recall it, but forget the details...

> - You cannot select the top N rows according to criterion A ordering
>   the results with a different criterion B.

True, but I don't see how to do that with one indexscan (for that
matter, I don't even see how to express it in the SQL subset that
we support...)

> - If you ask for the best 10 rows, from a relation including 
>   100000 rows, you have to do a traditional sort on 100000 rows and
>   then retain only the first 10, doing more comparisons than requested.

Not if there's an index that implements the ordering --- and if there
is not, I don't see how to avoid the sort anyway.

> - You can choose a "fast-start" plan (i.e., basically, 
>   a pipelined plan), but you cannot performe an "early-stop" of 
>   the stream when you have a "slow-start" plan  (e.g. involving sorts 
>   or hash tables).

Why not?  The executor *will* stop when it has as many output rows as
the LIMIT demands.

> We noticed that this kind of plan often outperforms the first one.

I'd be the first to admit that the cost model needs some fine-tuning
still.  It's just a conceptual structure at this point.

> Actually, we should say we can't figure out the reason for
> managing the LIMIT clause in a so complicated way, not providing 
> a node in the plan as any other operator. 

We will probably end up doing it like that sooner or later, in order to
allow attaching LIMIT to sub-selects.  I don't take any credit or blame
for the execution-time implementation of LIMIT; I just worked with what
I found...
        regards, tom lane


Re: about 7.0 LIMIT optimization

From
"Roberto Cornacchia"
Date:
>> - You cannot select the top N rows according to criterion A ordering
>>   the results with a different criterion B.

> True, but I don't see how to do that with one indexscan (for that
> matter, I don't even see how to express it in the SQL subset that
> we support...)

...That's why we proposed this syntax extension:

SELECT
.
.
STOP AFTER <N>   (we changed the name, but this is the LIMIT)
RANK BY <A>
ORDER BY <B>

Here you can select the best <N> rows according to <A> and then order the results on <B>. 
We note that, not accounting for a similar extension, you could do the same thing only using a subselect (with an ORDER
BYclause in the inner select, that is non-standard as well).
 


>> - If you ask for the best 10 rows, from a relation including 
>>   100000 rows, you have to do a traditional sort on 100000 rows and
>>   then retain only the first 10, doing more comparisons than requested.

> Not if there's an index that implements the ordering --- and if there
> is not, I don't see how to avoid the sort anyway.

Of course, if you have an index there is no problem. 
It is even true that if you don't have an index there is no way to avoid the sort, but in that case we use a
specializedsort, which does much less comparisons. 
 
For example, if you want the 10 best rows from 100000, these are the average numbers of comparisons:

QuickSort: 1.6E+14
SortStop: 1.5E+11

>> - You can choose a "fast-start" plan (i.e., basically, 
>>   a pipelined plan), but you cannot performe an "early-stop" of 
>>   the stream when you have a "slow-start" plan  (e.g. involving sorts 
>>   or hash tables).

> Why not?  The executor *will* stop when it has as many output rows as
> the LIMIT demands.

Yes, but consider this very simple case:

LIMIT 10
[something else] MergeJoin (100000 rows)   Sort (100000 rows)     SeqScan on Table1 (100000 rows)

   IndexScan on Table2 (100 rows)

Assuming that referential constraints allow us to do it, we would do the following:

[something else] MergeJoin (10 rows)   SortStop 10 (10 rows)       SeqScan on Table1 (100000 rows)   IndexScan on
Table2(100 rows)
 

Here, we get only 10 rows from the outer relation. *In general*, this is NOT correct, but referential constraints make
itsafe in many cases. You can see that in the second approach, the "[something else]" will operate with an input stream
cardinalityof 10, against 100000 of the first approach. This is what we call the "push-down" of the Stop operator.
 

> I'd be the first to admit that the cost model needs some fine-tuning
> still.  It's just a conceptual structure at this point.

We hope you are not considering our posts as a criticism. We used PostgreSQL as a base to our proposal, finding good
results,and now we are wondering if you are interested to continue in this sense.
 
Btw, DB2 currently adopts "LIMIT" optimization techniques similar to ours. 

Regards
Roberto Cornacchia


===========================================================

VIRGILIO MAIL - Il tuo indirizzo E-mail gratis e per sempre
http://mail.virgilio.it/


VIRGILIO - La guida italiana a Internet
http://www.virgilio.it/



Re: [HACKERS] Re: about 7.0 LIMIT optimization

From
Don Baccus
Date:
At 07:10 PM 2/23/00 -0500, Roberto Cornacchia wrote:

>Of course, if you have an index there is no problem. 
>It is even true that if you don't have an index there is no way to avoid
the sort, but in that case we use a specialized sort, which does much less
comparisons. 
>For example, if you want the 10 best rows from 100000, these are the
average numbers of comparisons:
>
>QuickSort: 1.6E+14
>SortStop: 1.5E+11

This makes sense ... you can stop once you can guarantee that the first
ten rows are in proper order.  I'm not familiar with the algorithm
but not terribly surprised that one exists.



- Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, Pacific Northwest Rare Bird Alert
Serviceand other goodies at http://donb.photo.net.
 


Re: about 7.0 LIMIT optimization

From
Tom Lane
Date:
"Roberto Cornacchia" <rob.c@virgilio.it> writes:
>> Why not?  The executor *will* stop when it has as many output rows as
>> the LIMIT demands.

> Yes, but consider this very simple case:

> LIMIT 10
> [something else]
>   MergeJoin (100000 rows)
>     Sort (100000 rows)
>       SeqScan on Table1 (100000 rows)


>     IndexScan on Table2 (100 rows)

> Assuming that referential constraints allow us to do it, we would do the
> following:

> [something else]
>   MergeJoin (10 rows)
>     SortStop 10 (10 rows)
>         SeqScan on Table1 (100000 rows)
>     IndexScan on Table2 (100 rows)

> Here, we get only 10 rows from the outer relation. *In general*, this is
> NOT correct, but referential constraints make it safe in many cases. You
> can see that in the second approach, the "[something else]" will operate
> with an input stream cardinality of 10, against 100000 of the first
> approach. This is what we call the "push-down" of the Stop operator.

If I understand your point correctly, the existing code arrives at this
same effect through another direction: it will choose the right plan for
the query when the [something else] node doesn't need to read very many
rows.  This isn't reflected in the EXPLAIN output very well, which might
be fooling you as to what's really happening.

I'm not sure about your comment about referential constraints.  If you
are doing analysis of restriction clauses to prove that a particular
stage doesn't require reading as many rows as it otherwise would, then
you've done more than I have.
        regards, tom lane


Re: [HACKERS] Re: about 7.0 LIMIT optimization

From
Tom Lane
Date:
Don Baccus <dhogaza@pacifier.com> writes:
>> For example, if you want the 10 best rows from 100000, these are the
> average numbers of comparisons:
>> 
>> QuickSort: 1.6E+14
>> SortStop: 1.5E+11

Are there some zeroes missing here?  That sounds like an awful lot of
operations for a quicksort of only 1E5 elements...

> This makes sense ... you can stop once you can guarantee that the first
> ten rows are in proper order.  I'm not familiar with the algorithm
> but not terribly surprised that one exists.

The obvious way to do it would be with a heap-based sort.  After you've
built the heap, you pull out the first ten elements and then stop.
Offhand this only seems like it'd save about half the work, though,
so maybe Roberto has a better idea.
        regards, tom lane


Re: [HACKERS] Re: about 7.0 LIMIT optimization

From
Don Baccus
Date:
At 09:30 PM 2/23/00 -0500, Tom Lane wrote:
>Don Baccus <dhogaza@pacifier.com> writes:
>>> For example, if you want the 10 best rows from 100000, these are the
>> average numbers of comparisons:
>>> 
>>> QuickSort: 1.6E+14
>>> SortStop: 1.5E+11
>
>Are there some zeroes missing here?  That sounds like an awful lot of
>operations for a quicksort of only 1E5 elements...

Yeah, obviously one or more of his numbers are wrong.  Let's see, a
bubble sort's only O(n^2), "only" 1E10/2 comparisons for 1E5 elements,
right?  Surely O(n*log(n)) is quicker :)

>
>> This makes sense ... you can stop once you can guarantee that the first
>> ten rows are in proper order.  I'm not familiar with the algorithm
>> but not terribly surprised that one exists.
>
>The obvious way to do it would be with a heap-based sort.  After you've
>built the heap, you pull out the first ten elements and then stop.
>Offhand this only seems like it'd save about half the work, though,
>so maybe Roberto has a better idea.

I'd like to see some elaboration.



- Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, Pacific Northwest Rare Bird Alert
Serviceand other goodies at http://donb.photo.net.
 


Re: about 7.0 LIMIT optimization

From
"Roberto Cornacchia"
Date:
Hi,

> I'm not sure about your comment about referential constraints.  If you
> are doing analysis of restriction clauses to prove that a particular
> stage doesn't require reading as many rows as it otherwise would, then
> you've done more than I have.

Yes, that's what we do. Here is a clarifying example:

-----
"Retrieve name, salary and Dept name of the 10 most paid employees"

SELECT Emp.name, Emp.salary, Dep.nameFROM Emp, Dep
WHERE Emp.dno=Dept.dno
STOP AFTER 10
RANK BY Emp.salary DESC;
-----

Suppose you have a referential constraint like:
Emp->dno --> Dep.dno  (foreign --> primary)

In this case we can do :

join  (Emp.dno = Dep.dno) Stop 10   Scan Emp Scan Dept

since we are sure that every employee works in a departement (because of the constraints), so the 10 most paid
employeeswill survive after the join. In this way you can reduce the cardinality of one of the input stream of the
join,obtaining the same final results. 
 

Note that this is a very simple case. In many plans involving a larger number of joins you can place a Stop operator in
adeep position, reducing the work of all the following joins.
 

We have formalized a set of rules which allow us to determine wheter or not a position in the plan for the Stop
operatoris safe and then we have developed a fast algorithm able to take the right decision.
 

Regards

R. Cornacchia
A. Ghidini
Dr. P. Ciaccia

===========================================================

VIRGILIO MAIL - Il tuo indirizzo E-mail gratis e per sempre
http://mail.virgilio.it/


VIRGILIO - La guida italiana a Internet
http://www.virgilio.it/