Thread: Planner not using column limit specified for one column for another column equal to first
Planner not using column limit specified for one column for another column equal to first
From
Віталій Тимчишин
Date:
Hello.
I have a query that performs very poor because there is a limit on join column that is not applied to other columns:
select * from company this_ left outer join company_tag this_1_ on this_.id=this_1_.company_id left outer join company_measures companymea2_ on this_.id=companymea2_.company_id left outer join company_descr ces3_ on this_.id=ces3_.company_id where this_1_.tag_id = 7 and this_.id>50000000
and this_1_.company_id>50000000
order by this_.id asc limit 1000;
(plan1.txt)
Total runtime: 7794.692 ms
At the same time if I apply the limit (>50000000) to other columns in query itself it works like a charm:
select * from company this_ left outer join company_tag this_1_ on this_.id=this_1_.company_id left outer join company_measures companymea2_ on this_.id=companymea2_.company_id left outer join company_descr ces3_ on this_.id=ces3_.company_id where this_1_.tag_id = 7 and this_.id>50000000
and this_1_.company_id>50000000
and companymea2_.company_id>50000000 and ces3_.company_id>50000000
order by this_.id asc limit 1000;
(plan2.txt)
Total runtime: 27.547 ms
I've thought and someone in this list've told me that this should be done automatically. But I have pretty recent server:
PostgreSQL 8.4.2 on amd64-portbld-freebsd8.0, compiled by GCC cc (GCC) 4.2.1 20070719 [FreeBSD], 64-bit
and it still do not work
Do I misunderstand something or this feature don't work in such a query?
Best regards, Vitalii Tymchyshyn
Attachment
Re: Planner not using column limit specified for one column for another column equal to first
From
Yeb Havinga
Date:
Віталій Тимчишин wrote: > Hello. > > I have a query that performs very poor because there is a limit on > join column that is not applied to other columns: > > select * from company this_ left outer join company_tag this_1_ on > this_.id=this_1_.company_id left outer join company_measures > companymea2_ on this_.id=companymea2_.company_id left outer join > company_descr ces3_ on this_.id=ces3_.company_id where this_1_.tag_id > = 7 and this_.id>50000000 > and this_1_.company_id>50000000 > order by this_.id asc limit 1000; > > (plan1.txt) > Total runtime: 7794.692 ms > > At the same time if I apply the limit (>50000000) to other columns in > query itself it works like a charm: > > select * from company this_ left outer join company_tag this_1_ on > this_.id=this_1_.company_id left outer join company_measures > companymea2_ on this_.id=companymea2_.company_id left outer join > company_descr ces3_ on this_.id=ces3_.company_id where this_1_.tag_id > = 7 and this_.id>50000000 > and this_1_.company_id>50000000 > and companymea2_.company_id>50000000 and ces3_.company_id>50000000 > order by this_.id asc limit 1000; > > (plan2.txt) > Total runtime: 27.547 ms > > I've thought and someone in this list've told me that this should be > done automatically. Yes, if you have in a query a=b and b=c, then the optimizer figures out that a=c as well. (a,b and c are then member of the same equivalence class). However both queries are not the same, since the joins you're using are outer joins. In the first it's possible that records are returned for company records with no matching ces3_ records, the ces3_ records is null in that case. In the second query no NULL ces3_ information may be returned. Another thing is it seems that the number of rows guessed is far off from the actual number of rows, is the number 5000000 artificial or are you're statistics old or too small histogram/mcv's? regards, Yeb Havinga
Re: Planner not using column limit specified for one column for another column equal to first
From
Віталій Тимчишин
Date:
16 квітня 2010 р. 11:31 Yeb Havinga <yebhavinga@gmail.com> написав:
Віталій Тимчишин wrote:Yes, if you have in a query a=b and b=c, then the optimizer figures out that a=c as well. (a,b and c are then member of the same equivalence class).Hello.
I have a query that performs very poor because there is a limit on join column that is not applied to other columns:
select * from company this_ left outer join company_tag this_1_ on this_.id=this_1_.company_id left outer join company_measures companymea2_ on this_.id=companymea2_.company_id left outer join company_descr ces3_ on this_.id=ces3_.company_id where this_1_.tag_id = 7 and this_.id>50000000 and this_1_.company_id>50000000
order by this_.id asc limit 1000;
(plan1.txt)
Total runtime: 7794.692 ms
At the same time if I apply the limit (>50000000) to other columns in query itself it works like a charm:
select * from company this_ left outer join company_tag this_1_ on this_.id=this_1_.company_id left outer join company_measures companymea2_ on this_.id=companymea2_.company_id left outer join company_descr ces3_ on this_.id=ces3_.company_id where this_1_.tag_id = 7 and this_.id>50000000 and this_1_.company_id>50000000
and companymea2_.company_id>50000000 and ces3_.company_id>50000000
order by this_.id asc limit 1000;
(plan2.txt)
Total runtime: 27.547 ms
I've thought and someone in this list've told me that this should be done automatically.
However both queries are not the same, since the joins you're using are outer joins. In the first it's possible that records are returned for company records with no matching ces3_ records, the ces3_ records is null in that case. In the second query no NULL ces3_ information may be returned.
OK, but when I move limit to join condition the query is still fast:
select * from company this_ left outer join company_tag this_1_ on this_.id=this_1_.company_id
left outer join company_measures companymea2_ on this_.id=companymea2_.company_id and companymea2_.company_id>50000000
left outer join company_descr ces3_ on this_.id=ces3_.company_id and ces3_.company_id>50000000
where this_1_.tag_id = 7 and this_.id>50000000
and this_1_.company_id>50000000
order by this_.id asc limit 1000;
(plan3.txt),
Total runtime: 26.327 ms
BTW: Changing slow query to inner joins do not make it fast
Another thing is it seems that the number of rows guessed is far off from the actual number of rows, is the number 5000000 artificial or are you're statistics old or too small histogram/mcv's?
Nope, I suppose this is because of limit. If I remove the limit, the estimations are quite correct. There are ~6 millions of row in each table.
Attachment
Re: Planner not using column limit specified for one column for another column equal to first
From
Віталій Тимчишин
Date:
16 квітня 2010 р. 11:25 Hannu Krosing <hannu@2ndquadrant.com> написав:
On Fri, 2010-04-16 at 11:02 +0300, Віталій Тимчишин wrote:The queries are not the same.
> Hello.
>
>
> I have a query that performs very poor because there is a limit on
> join column that is not applied to other columns:
>
>
> select * from company this_ left outer join company_tag this_1_ on
> this_.id=this_1_.company_id left outer join company_measures
> companymea2_ on this_.id=companymea2_.company_id left outer join
> company_descr ces3_ on this_.id=ces3_.company_id where this_1_.tag_id
> = 7 and this_.id>50000000
> and this_1_.company_id>50000000
> order by this_.id asc limit 1000;
>
>
> (plan1.txt)
> Total runtime: 7794.692 ms
>
>
> At the same time if I apply the limit (>50000000) to other columns in
> query itself it works like a charm:
>
>
> select * from company this_ left outer join company_tag this_1_ on
> this_.id=this_1_.company_id left outer join company_measures
> companymea2_ on this_.id=companymea2_.company_id left outer join
> company_descr ces3_ on this_.id=ces3_.company_id where this_1_.tag_id
> = 7 and this_.id>50000000
> and this_1_.company_id>50000000
> and companymea2_.company_id>50000000 and ces3_.company_id>50000000
> order by this_.id asc limit 1000;
2nd variant will not return the rows where there are no matching rows
inthis_1_ , companymea2_ or ces3_.company_id
A query equivalent to first one would be:and this_1_.company_id>50000000)
select * from company this_
left outer join company_tag this_1_
on (this_.id=this_1_.company_idleft outer join company_measures companymea2_and companymea2_.company_id>50000000)
on (this_.id=companymea2_.company_idleft outer join company_descr ces3_and ces3_.company_id>50000000)
on (this_.id=ces3_.company_idwhere this_1_.tag_id = 7
and this_.id>50000000order by this_.id asc
limit 1000;
And it's still fast (see plan in another mail), while "inner join" variant of original query is still slow.
I'm not sure that planner considers the above form of plan rewrite, nor
that it would make much sense to do so unless there was a really small
number of rows where x_.company_id>50000000
Actually no,
select id > 50000000, count(*) from company group by 1
f,1096042
t,5725630
I don't know why the planner wishes to perform few merges of 1000 to a million of records (and the merges is the thing that takes time) instead of taking a 1000 of records from main table and then doing a nested loop. And it must read all the records that DO NOT match the criteria for secondary tables before getting to correct records if it do not filter secondary tables with index on retrieve.
set enable_mergejoin=false helps original query, but this is another problem and first solution is simpler and can be used by planner automatically, while second requires rethinking/rewrite of LIMIT estimation logic
(Plan of nested loop attached)
Attachment
Re: Planner not using column limit specified for one column for another column equal to first
From
Yeb Havinga
Date:
Віталій Тимчишин wrote: > > BTW: Changing slow query to inner joins do not make it fast I'm interested to see the query andplan of the slow query with inner joins. > > > Another thing is it seems that the number of rows guessed is far > off from the actual number of rows, is the number 5000000 > artificial or are you're statistics old or too small histogram/mcv's? > > > Nope, I suppose this is because of limit. If I remove the limit, the > estimations are quite correct. There are ~6 millions of row in each table. Yes, that makes sense. regards, Yeb Havinga
Re: Planner not using column limit specified for one column for another column equal to first
From
Віталій Тимчишин
Date:
16 квітня 2010 р. 16:21 Yeb Havinga <yebhavinga@gmail.com> написав:
Віталій Тимчишин wrote:I'm interested to see the query andplan of the slow query with inner joins.
BTW: Changing slow query to inner joins do not make it fast
Here you are. The query:
select * from company this_ inner join company_tag this_1_ on this_.id=this_1_.company_id
inner join company_measures companymea2_ on this_.id=companymea2_.company_id
inner join company_descr ces3_ on this_.id=ces3_.company_id
where this_1_.tag_id = 7 and this_.id>50000000
order by this_.id asc
limit 1000
;
Total runtime: 14088.942 ms
(plan is attached)
Best regards, Vitalii Tymchyshyn
Attachment
Re: Planner not using column limit specified for one column for another column equal to first
From
Tom Lane
Date:
=?KOI8-U?B?96bUwcymyiD0yc3eydvJzg==?= <tivv00@gmail.com> writes: > I've thought and someone in this list've told me that this should be done > automatically. No, that's not true. We do make deductions about transitive equalities, ie, given WHERE a=b AND b=c the planner will infer a=c and use that if it's helpful. We don't make deductions about inequalities such as a>c. In theory there's enough information available to do so, but overall trying to do that would probably waste more cycles than it would save. You'd need a lot of expensive new planner infrastructure, and in the vast majority of queries it wouldn't produce anything very helpful. As was pointed out, even if we had such logic it wouldn't apply in this example, because the equality conditions aren't real equalities but OUTER JOIN conditions. regards, tom lane
Re: Planner not using column limit specified for one column for another column equal to first
From
Yeb Havinga
Date:
Tom Lane wrote: > =?KOI8-U?B?96bUwcymyiD0yc3eydvJzg==?= <tivv00@gmail.com> writes: > >> I've thought and someone in this list've told me that this should be done >> automatically. >> > > No, that's not true. We do make deductions about transitive equalities, > ie, given WHERE a=b AND b=c the planner will infer a=c and use that if > it's helpful. We don't make deductions about inequalities such as a>c. > In theory there's enough information available to do so, but overall > trying to do that would probably waste more cycles than it would save. > You'd need a lot of expensive new planner infrastructure, and in the > vast majority of queries it wouldn't produce anything very helpful. > New expensive planner infrastructure to support from a>b and b>c infer a>c, yes. But I wonder if something like Leibniz's principle of identity holds for members of the same equivalence class, e.g. like if x,y are both members of the same EC, then for every predicate P, P(x) iff P(y). Probably not for every predicate (like varno = 2 or attname='x'), but for the query evaluation, the object denoted by the variables are the same, since that is the standard meaning of the = operator. I cannot think of any standard (btree) operator where 'Leibniz' would fail in this case. regards, Yeb Havinga
Re: Planner not using column limit specified for one column for another column equal to first
From
Tom Lane
Date:
Yeb Havinga <yebhavinga@gmail.com> writes: > New expensive planner infrastructure to support from a>b and b>c infer > a>c, yes. > But I wonder if something like Leibniz's principle of identity holds for > members of the same equivalence class, e.g. like if x,y are both members > of the same EC, then for every predicate P, P(x) iff P(y). This could only be assumed to apply for predicates constructed from operators that are in the equivalence operator's btree opfamily. Now, that would certainly be a large enough set of cases to sometimes give useful results --- but I stand by the opinion that it wouldn't win often enough to justify the added planner overhead. regards, tom lane
Re: Planner not using column limit specified for one column for another column equal to first
From
Віталій Тимчишин
Date:
16 квітня 2010 р. 17:19 Tom Lane <tgl@sss.pgh.pa.us> написав:
-- Віталій Тимчишин <tivv00@gmail.com> writes:As was pointed out, even if we had such logic it wouldn't apply in this
> I've thought and someone in this list've told me that this should be done
> automatically.
example, because the equality conditions aren't real equalities but
OUTER JOIN conditions.
In this case you can copy condition to "ON" condition, not to where cause and this would work correct, e.g. "select something from a join b on a.x=b.y where a.x > n" <=> "select something from a join b on a.x=b.y and b.y > n where a.x > n".
As of making planner more clever, may be it is possible to introduce division on "fast queries" and "long queries", so that if after fast planning cost is greater then some configurable threshold, advanced planning techniques (or settings) are used. As far as I have seen in this list, many techniques are not used simply because they are too complex and could make planning take too much time for really fast queries, but they are vital for long ones.
Also same (or similar) threshold could be used to enable replanning for each run of prepared query - also an often complaint is that planned query is not that fast as is could be.
Best regards,
Vitalii Tymchyshyn
Re: Planner not using column limit specified for one column for another column equal to first
From
Matthew Wakeling
Date:
On Sat, 17 Apr 2010, Віталій Тимчишин wrote: > As of making planner more clever, may be it is possible to introduce > division on "fast queries" and "long queries", so that if after fast > planning cost is greater then some configurable threshold, advanced planning > techniques (or settings) are used. As far as I have seen in this list, many > techniques are not used simply because they are too complex and could make > planning take too much time for really fast queries, but they are vital for > long ones. +1. That's definitely a good idea in my view. The query optimiser I wrote (which sits on top of Postgres and makes use of materialised views to speed up queries) uses a similar approach - it expends effort proportional to the estimated cost of the query, as reported by EXPLAIN. Matthew -- To most people, solutions mean finding the answers. But to chemists, solutions are things that are still all mixed up.
Re: Planner not using column limit specified for one column for another column equal to first
From
Hannu Krosing
Date:
On Fri, 2010-04-16 at 11:02 +0300, Віталій Тимчишин wrote: > Hello. > > > I have a query that performs very poor because there is a limit on > join column that is not applied to other columns: > > > select * from company this_ left outer join company_tag this_1_ on > this_.id=this_1_.company_id left outer join company_measures > companymea2_ on this_.id=companymea2_.company_id left outer join > company_descr ces3_ on this_.id=ces3_.company_id where this_1_.tag_id > = 7 and this_.id>50000000 > and this_1_.company_id>50000000 > order by this_.id asc limit 1000; > > > (plan1.txt) > Total runtime: 7794.692 ms > > > At the same time if I apply the limit (>50000000) to other columns in > query itself it works like a charm: > > > select * from company this_ left outer join company_tag this_1_ on > this_.id=this_1_.company_id left outer join company_measures > companymea2_ on this_.id=companymea2_.company_id left outer join > company_descr ces3_ on this_.id=ces3_.company_id where this_1_.tag_id > = 7 and this_.id>50000000 > and this_1_.company_id>50000000 > and companymea2_.company_id>50000000 and ces3_.company_id>50000000 > order by this_.id asc limit 1000; The queries are not the same. 2nd variant will not return the rows where there are no matching rows inthis_1_ , companymea2_ or ces3_.company_id A query equivalent to first one would be: select * from company this_ left outer join company_tag this_1_ on (this_.id=this_1_.company_id and this_1_.company_id>50000000) left outer join company_measures companymea2_ on (this_.id=companymea2_.company_id and companymea2_.company_id>50000000) left outer join company_descr ces3_ on (this_.id=ces3_.company_id and ces3_.company_id>50000000) where this_1_.tag_id = 7 and this_.id>50000000 order by this_.id asc limit 1000; I'm not sure that planner considers the above form of plan rewrite, nor that it would make much sense to do so unless there was a really small number of rows where x_.company_id>50000000 -- Hannu Krosing http://www.2ndQuadrant.com PostgreSQL Scalability and Availability Services, Consulting and Training