Thread: Planner not using column limit specified for one column for another column equal to first

Planner not using column limit specified for one column for another column equal to first

From
Віталій Тимчишин
Date:
Hello.

I have a query that performs very poor because there is a limit on join column that is not applied to other columns:

select * from company this_ left outer join company_tag this_1_ on this_.id=this_1_.company_id left outer join company_measures companymea2_ on this_.id=companymea2_.company_id left outer join company_descr ces3_ on this_.id=ces3_.company_id where this_1_.tag_id = 7 and this_.id>50000000 
and this_1_.company_id>50000000
order by this_.id asc limit 1000;

(plan1.txt)
Total runtime: 7794.692 ms

At the same time if I apply the limit (>50000000) to other columns in query itself it works like a charm:

select * from company this_ left outer join company_tag this_1_ on this_.id=this_1_.company_id left outer join company_measures companymea2_ on this_.id=companymea2_.company_id left outer join company_descr ces3_ on this_.id=ces3_.company_id where this_1_.tag_id = 7 and this_.id>50000000 
and this_1_.company_id>50000000
and companymea2_.company_id>50000000 and ces3_.company_id>50000000
order by this_.id asc limit 1000;

(plan2.txt)
Total runtime: 27.547 ms

I've thought and someone in this list've told me that this should be done automatically. But I have pretty recent server:
PostgreSQL 8.4.2 on amd64-portbld-freebsd8.0, compiled by GCC cc (GCC) 4.2.1 20070719  [FreeBSD], 64-bit
and it still do not work

Do I misunderstand something or this feature don't work in such a query?

Best regards, Vitalii Tymchyshyn
Attachment
Віталій Тимчишин wrote:
> Hello.
>
> I have a query that performs very poor because there is a limit on
> join column that is not applied to other columns:
>
> select * from company this_ left outer join company_tag this_1_ on
> this_.id=this_1_.company_id left outer join company_measures
> companymea2_ on this_.id=companymea2_.company_id left outer join
> company_descr ces3_ on this_.id=ces3_.company_id where this_1_.tag_id
> = 7 and this_.id>50000000
> and this_1_.company_id>50000000
> order by this_.id asc limit 1000;
>
> (plan1.txt)
> Total runtime: 7794.692 ms
>
> At the same time if I apply the limit (>50000000) to other columns in
> query itself it works like a charm:
>
> select * from company this_ left outer join company_tag this_1_ on
> this_.id=this_1_.company_id left outer join company_measures
> companymea2_ on this_.id=companymea2_.company_id left outer join
> company_descr ces3_ on this_.id=ces3_.company_id where this_1_.tag_id
> = 7 and this_.id>50000000
> and this_1_.company_id>50000000
> and companymea2_.company_id>50000000 and ces3_.company_id>50000000
> order by this_.id asc limit 1000;
>
> (plan2.txt)
> Total runtime: 27.547 ms
>
> I've thought and someone in this list've told me that this should be
> done automatically.
Yes, if you have in a query a=b and b=c, then the optimizer figures out
that a=c as well. (a,b and c are then member of the same equivalence class).

However both queries are not the same, since the joins you're using are
outer joins. In the first it's possible that records are returned for
company records with no matching ces3_ records, the ces3_ records is
null in that case. In the second query no NULL ces3_ information may be
returned.

Another thing is it seems that the number of rows guessed is far off
from the actual number of rows, is the number 5000000 artificial or are
you're statistics old or too small histogram/mcv's?

regards,
Yeb Havinga


Re: Planner not using column limit specified for one column for another column equal to first

From
Віталій Тимчишин
Date:


16 квітня 2010 р. 11:31 Yeb Havinga <yebhavinga@gmail.com> написав:
Віталій Тимчишин wrote:
Hello.

I have a query that performs very poor because there is a limit on join column that is not applied to other columns:

select * from company this_ left outer join company_tag this_1_ on this_.id=this_1_.company_id left outer join company_measures companymea2_ on this_.id=companymea2_.company_id left outer join company_descr ces3_ on this_.id=ces3_.company_id where this_1_.tag_id = 7 and this_.id>50000000 and this_1_.company_id>50000000
order by this_.id asc limit 1000;

(plan1.txt)
Total runtime: 7794.692 ms

At the same time if I apply the limit (>50000000) to other columns in query itself it works like a charm:

select * from company this_ left outer join company_tag this_1_ on this_.id=this_1_.company_id left outer join company_measures companymea2_ on this_.id=companymea2_.company_id left outer join company_descr ces3_ on this_.id=ces3_.company_id where this_1_.tag_id = 7 and this_.id>50000000 and this_1_.company_id>50000000
and companymea2_.company_id>50000000 and ces3_.company_id>50000000
order by this_.id asc limit 1000;

(plan2.txt)
Total runtime: 27.547 ms

I've thought and someone in this list've told me that this should be done automatically.
Yes, if you have in a query a=b and b=c, then the optimizer figures out that a=c as well. (a,b and c are then member of the same equivalence class).

However both queries are not the same, since the joins you're using are outer joins. In the first it's possible that records are returned for company records with no matching ces3_ records, the ces3_ records is null in that case. In the second query no NULL ces3_ information may be returned.

OK, but when I move limit to join condition the query is still fast:

select * from company this_ left outer join company_tag this_1_ on this_.id=this_1_.company_id 
left outer join company_measures companymea2_ on this_.id=companymea2_.company_id and companymea2_.company_id>50000000
left outer join company_descr ces3_ on this_.id=ces3_.company_id and ces3_.company_id>50000000
where this_1_.tag_id = 7 and this_.id>50000000 
and this_1_.company_id>50000000
order by this_.id asc limit 1000;

(plan3.txt),
Total runtime: 26.327 ms
BTW: Changing slow query to inner joins do not make it fast
 

Another thing is it seems that the number of rows guessed is far off from the actual number of rows, is the number 5000000 artificial or are you're statistics old or too small histogram/mcv's?

Nope, I suppose this is because of limit. If I remove the limit, the estimations are quite correct. There are ~6 millions of row in each table.

Attachment

Re: Planner not using column limit specified for one column for another column equal to first

From
Віталій Тимчишин
Date:


16 квітня 2010 р. 11:25 Hannu Krosing <hannu@2ndquadrant.com> написав:
On Fri, 2010-04-16 at 11:02 +0300, Віталій Тимчишин wrote:
> Hello.
>
>
> I have a query that performs very poor because there is a limit on
> join column that is not applied to other columns:
>
>
> select * from company this_ left outer join company_tag this_1_ on
> this_.id=this_1_.company_id left outer join company_measures
> companymea2_ on this_.id=companymea2_.company_id left outer join
> company_descr ces3_ on this_.id=ces3_.company_id where this_1_.tag_id
> = 7 and this_.id>50000000
> and this_1_.company_id>50000000
> order by this_.id asc limit 1000;
>
>
> (plan1.txt)
> Total runtime: 7794.692 ms
>
>
> At the same time if I apply the limit (>50000000) to other columns in
> query itself it works like a charm:
>
>
> select * from company this_ left outer join company_tag this_1_ on
> this_.id=this_1_.company_id left outer join company_measures
> companymea2_ on this_.id=companymea2_.company_id left outer join
> company_descr ces3_ on this_.id=ces3_.company_id where this_1_.tag_id
> = 7 and this_.id>50000000
> and this_1_.company_id>50000000
> and companymea2_.company_id>50000000 and ces3_.company_id>50000000
> order by this_.id asc limit 1000;

The queries are not the same.

2nd variant will not return the rows where there are no matching rows
inthis_1_ , companymea2_ or ces3_.company_id

A query equivalent to first one would be:


select * from company this_
 left outer join company_tag this_1_
              on (this_.id=this_1_.company_id
              and this_1_.company_id>50000000)
 left outer join company_measures companymea2_
              on (this_.id=companymea2_.company_id
              and companymea2_.company_id>50000000)
 left outer join company_descr ces3_
              on (this_.id=ces3_.company_id
              and ces3_.company_id>50000000)
 where this_1_.tag_id = 7
  and this_.id>50000000
 order by this_.id asc
 limit 1000;

And it's still fast (see plan in another mail), while "inner join" variant of original query is still slow.
 


I'm not sure that planner considers the above form of plan rewrite, nor
that it would make much sense to do so unless there was a really small
number of rows where x_.company_id>50000000

Actually no,
select id > 50000000, count(*) from company group by 1
f,1096042
t,5725630

I don't know why the planner wishes to perform few merges of 1000 to a million of records (and the merges is the thing that takes time) instead of taking a 1000 of records from main table and then doing a nested loop. And it must read all the records that DO NOT match the criteria for secondary tables before getting to correct records if it do not filter secondary tables with index on retrieve.

set enable_mergejoin=false helps original query, but this is another problem and first solution is simpler and can be used by planner automatically, while second requires rethinking/rewrite of LIMIT estimation logic
(Plan of nested loop attached)

Attachment
Віталій Тимчишин wrote:
>
> BTW: Changing slow query to inner joins do not make it fast
I'm interested to see the query andplan of the slow query with inner joins.

>
>
>     Another thing is it seems that the number of rows guessed is far
>     off from the actual number of rows, is the number 5000000
>     artificial or are you're statistics old or too small histogram/mcv's?
>
>
> Nope, I suppose this is because of limit. If I remove the limit, the
> estimations are quite correct. There are ~6 millions of row in each table.
Yes, that makes sense.

regards,
Yeb Havinga


Re: Planner not using column limit specified for one column for another column equal to first

From
Віталій Тимчишин
Date:


16 квітня 2010 р. 16:21 Yeb Havinga <yebhavinga@gmail.com> написав:
Віталій Тимчишин wrote:

BTW: Changing slow query to inner joins do not make it fast
I'm interested to see the query andplan of the slow query with inner joins.


Here you are. The query:

select * from company this_ inner join company_tag this_1_ on this_.id=this_1_.company_id 
inner join company_measures companymea2_ on this_.id=companymea2_.company_id 
inner join company_descr ces3_ on this_.id=ces3_.company_id 
where this_1_.tag_id = 7 and this_.id>50000000 
order by this_.id asc
limit 1000
Total runtime: 14088.942 ms
(plan is attached)

Best regards, Vitalii Tymchyshyn
Attachment
=?KOI8-U?B?96bUwcymyiD0yc3eydvJzg==?= <tivv00@gmail.com> writes:
> I've thought and someone in this list've told me that this should be done
> automatically.

No, that's not true.  We do make deductions about transitive equalities,
ie, given WHERE a=b AND b=c the planner will infer a=c and use that if
it's helpful.  We don't make deductions about inequalities such as a>c.
In theory there's enough information available to do so, but overall
trying to do that would probably waste more cycles than it would save.
You'd need a lot of expensive new planner infrastructure, and in the
vast majority of queries it wouldn't produce anything very helpful.

As was pointed out, even if we had such logic it wouldn't apply in this
example, because the equality conditions aren't real equalities but
OUTER JOIN conditions.

            regards, tom lane

Tom Lane wrote:
> =?KOI8-U?B?96bUwcymyiD0yc3eydvJzg==?= <tivv00@gmail.com> writes:
>
>> I've thought and someone in this list've told me that this should be done
>> automatically.
>>
>
> No, that's not true.  We do make deductions about transitive equalities,
> ie, given WHERE a=b AND b=c the planner will infer a=c and use that if
> it's helpful.  We don't make deductions about inequalities such as a>c.
> In theory there's enough information available to do so, but overall
> trying to do that would probably waste more cycles than it would save.
> You'd need a lot of expensive new planner infrastructure, and in the
> vast majority of queries it wouldn't produce anything very helpful.
>
New expensive planner infrastructure to support from a>b and b>c infer
a>c, yes.

But I wonder if something like Leibniz's principle of identity holds for
members of the same equivalence class, e.g. like if x,y are both members
of the same EC, then for every predicate P, P(x) iff P(y). Probably not
for every predicate (like varno = 2 or attname='x'), but for the query
evaluation, the object denoted by the variables are the same, since that
is the standard meaning of the = operator. I cannot think of any
standard (btree) operator where 'Leibniz' would fail in this case.

regards,
Yeb Havinga



Yeb Havinga <yebhavinga@gmail.com> writes:
> New expensive planner infrastructure to support from a>b and b>c infer
> a>c, yes.

> But I wonder if something like Leibniz's principle of identity holds for
> members of the same equivalence class, e.g. like if x,y are both members
> of the same EC, then for every predicate P, P(x) iff P(y).

This could only be assumed to apply for predicates constructed from
operators that are in the equivalence operator's btree opfamily.
Now, that would certainly be a large enough set of cases to sometimes
give useful results --- but I stand by the opinion that it wouldn't
win often enough to justify the added planner overhead.

            regards, tom lane

Re: Planner not using column limit specified for one column for another column equal to first

From
Віталій Тимчишин
Date:


16 квітня 2010 р. 17:19 Tom Lane <tgl@sss.pgh.pa.us> написав:
Віталій Тимчишин <tivv00@gmail.com> writes:
> I've thought and someone in this list've told me that this should be done
> automatically.

As was pointed out, even if we had such logic it wouldn't apply in this
example, because the equality conditions aren't real equalities but
OUTER JOIN conditions.


In this case you can copy condition to "ON" condition, not to where cause and this would work correct, e.g. "select something from a join b on a.x=b.y where a.x > n" <=> "select something from a join b on a.x=b.y and b.y > n where a.x > n".

As of making planner more clever, may be it is possible to introduce division on "fast queries" and "long queries", so that if after fast planning cost is greater then some configurable threshold, advanced planning techniques (or settings) are used. As far as I have seen in this list, many techniques are not used simply because they are too complex and could make planning take too much time for really fast queries, but they are vital for long ones.
Also same (or similar) threshold could be used to enable replanning for each run of prepared query - also an often complaint is that planned query is not that fast as is could be.
 
--
Best regards,
Vitalii Tymchyshyn
On Sat, 17 Apr 2010, Віталій Тимчишин wrote:
> As of making planner more clever, may be it is possible to introduce
> division on "fast queries" and "long queries", so that if after fast
> planning cost is greater then some configurable threshold, advanced planning
> techniques (or settings) are used. As far as I have seen in this list, many
> techniques are not used simply because they are too complex and could make
> planning take too much time for really fast queries, but they are vital for
> long ones.

+1. That's definitely a good idea in my view. The query optimiser I wrote
(which sits on top of Postgres and makes use of materialised views to
speed up queries) uses a similar approach - it expends effort proportional
to the estimated cost of the query, as reported by EXPLAIN.

Matthew

--
 To most people, solutions mean finding the answers. But to chemists,
 solutions are things that are still all mixed up.

On Fri, 2010-04-16 at 11:02 +0300, Віталій Тимчишин wrote:
> Hello.
>
>
> I have a query that performs very poor because there is a limit on
> join column that is not applied to other columns:
>
>
> select * from company this_ left outer join company_tag this_1_ on
> this_.id=this_1_.company_id left outer join company_measures
> companymea2_ on this_.id=companymea2_.company_id left outer join
> company_descr ces3_ on this_.id=ces3_.company_id where this_1_.tag_id
> = 7 and this_.id>50000000
> and this_1_.company_id>50000000
> order by this_.id asc limit 1000;
>
>
> (plan1.txt)
> Total runtime: 7794.692 ms
>
>
> At the same time if I apply the limit (>50000000) to other columns in
> query itself it works like a charm:
>
>
> select * from company this_ left outer join company_tag this_1_ on
> this_.id=this_1_.company_id left outer join company_measures
> companymea2_ on this_.id=companymea2_.company_id left outer join
> company_descr ces3_ on this_.id=ces3_.company_id where this_1_.tag_id
> = 7 and this_.id>50000000
> and this_1_.company_id>50000000
> and companymea2_.company_id>50000000 and ces3_.company_id>50000000
> order by this_.id asc limit 1000;

The queries are not the same.

2nd variant will not return the rows where there are no matching rows
inthis_1_ , companymea2_ or ces3_.company_id

A query equivalent to first one would be:


select * from company this_
  left outer join company_tag this_1_
               on (this_.id=this_1_.company_id
           and this_1_.company_id>50000000)
  left outer join company_measures companymea2_
               on (this_.id=companymea2_.company_id
           and companymea2_.company_id>50000000)
  left outer join company_descr ces3_
               on (this_.id=ces3_.company_id
           and ces3_.company_id>50000000)
 where this_1_.tag_id = 7
   and this_.id>50000000
 order by this_.id asc
 limit 1000;


I'm not sure that planner considers the above form of plan rewrite, nor
that it would make much sense to do so unless there was a really small
number of rows where x_.company_id>50000000


--
Hannu Krosing   http://www.2ndQuadrant.com
PostgreSQL Scalability and Availability
   Services, Consulting and Training