Thread: Optimizer picks an ineffient plan
I have a customer table that has the field CUSTOMER_ID as the primary key (cust_pkkey), the table has 102,834 records in it. The following select statement works fine: select * from customer order by customer_id; QUERY PLAN: Index Scan using cust_pkkey on customer (cost=0.00..5175.17 rows=102834 width=724) Total runtime: 5999.47 msec but... select * from customer order by customer_id, first_name; QUERY PLAN: Sort(cost=142028.25..142028.25 rows=102834 width=724) -> Seq Scan on customer (cost=0.00..4617.34 rows=102834 width=724) Total runtime: 19999.81 msec It seems that the optimizer should be able to detect (in this instance at least) that the first order by field is a primary key and should not consider the other fields because it's pointless... the resultset will be in <primary key> order. NOTE: I'm testing this on Postgresql 7.2 for Windows, so this my have already been dealt with. Thanks and keep up the great work!!
"Bupp Phillips" <hello@noname.com> writes: > but... > > select * from customer order by customer_id, first_name; > QUERY PLAN: > Sort(cost=142028.25..142028.25 rows=102834 width=724) > -> Seq Scan on customer (cost=0.00..4617.34 rows=102834 width=724) > Total runtime: 19999.81 msec Actually in this case the optimizer would likely still use a sequential scan even if it had an index it thought it could use. If you're going to be reading the whole table anyways it'll be faster to read it in order than to jump all around even if you have to sort it. However you do have a point. In this case I don't think postgres even considers using the index. Even if it would decide not to use it in this case there could conceivably be cases where it would want to use it. However I'm not sure I see a lot of cases where this would come up. Even in automatically generated code, which is the usual cause of redundant things like this, I don't think I've seen this particular combination ever come up. -- greg
Greg Stark <gsstark@mit.edu> writes: > "Bupp Phillips" <hello@noname.com> writes: >> select * from customer order by customer_id, first_name; >> [ where customer_id is the primary key ] > However you do have a point. In this case I don't think postgres even > considers using the index. It will not, since the index does not appear to provide the correct sort order. > However I'm not sure I see a lot of cases where this would come up. Yes, that's the real crux of the matter. Should the optimizer spend cycles on *every* query to detect cases where the user has written useless sort keys? I've got grave doubts that it's a win. ISTM such an optimization penalizes the folk who write their queries well to benefit those who are careless. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > > However I'm not sure I see a lot of cases where this would come up. > > Yes, that's the real crux of the matter. Should the optimizer spend > cycles on *every* query to detect cases where the user has written > useless sort keys? I've got grave doubts that it's a win. ISTM such > an optimization penalizes the folk who write their queries well to > benefit those who are careless. Well I'm sure the same arguments were made 30 years ago about optimizing compilers. But thankfully the optimizer-geeks won the argument. As a result these days we can more or less write our code in whatever form is most readable and flexible. We can spend our time worrying about algorithmic improvements and design abstraction, and not worry that the extra layer of abstraction will introduce redundant code or inefficient expressions. Already today we see database-independent toolkits that write SQL queries for you. They introduce needless subqueries and other inefficiencies willy-nilly. I argue that if the performance of the database is important down to a sub-second constant term per query, then you're talking about an OLTP system where all the queries ought to be prepared and the plans cached. If you're talking about a system where queries are constructed ad-hoc for every execution then you should be talking about a DSS system running batch jobs where an extra few seconds spent optimizing could save you hours. All that said I'm not sure the case at hand is a great example. I don't think it would be a big performance loss, but the added code complexity for nothing might be annoying. I don't see how even automatically generated code is likely to generate this situation. -- greg
Greg Stark <gsstark@mit.edu> writes: > Tom Lane <tgl@sss.pgh.pa.us> writes: >> Yes, that's the real crux of the matter. Should the optimizer spend >> cycles on *every* query to detect cases where the user has written >> useless sort keys? I've got grave doubts that it's a win. > Well I'm sure the same arguments were made 30 years ago about optimizing > compilers. But thankfully the optimizer-geeks won the argument. Um ... I *am* an optimizer-geek. You can find my name in the credits for Bliss/11, which was the best optimizing compiler on the planet about thirty years ago. I stand by my comment that there's a tradeoff between the potential gain from an optimization and the time spent to find it. PG is at a disadvantage compared to typical compilation scenarios, in that a compiler assumes its output will be executed many times, while SQL queries often are planned and then executed but once. There's been some talk of working harder when planning a "prepared statement", but so far I've not seen very many places where I'd really want to alter the planner's behavior on that basis. regards, tom lane
On Thu, 4 Sep 2003, Tom Lane wrote: > Greg Stark <gsstark@mit.edu> writes: > > Tom Lane <tgl@sss.pgh.pa.us> writes: > >> Yes, that's the real crux of the matter. Should the optimizer spend > >> cycles on *every* query to detect cases where the user has written > >> useless sort keys? I've got grave doubts that it's a win. > > > Well I'm sure the same arguments were made 30 years ago about optimizing > > compilers. But thankfully the optimizer-geeks won the argument. > > Um ... I *am* an optimizer-geek. You can find my name in the credits > for Bliss/11, which was the best optimizing compiler on the planet about > thirty years ago. I stand by my comment that there's a tradeoff between > the potential gain from an optimization and the time spent to find it. > > PG is at a disadvantage compared to typical compilation scenarios, in > that a compiler assumes its output will be executed many times, while > SQL queries often are planned and then executed but once. There's been > some talk of working harder when planning a "prepared statement", but > so far I've not seen very many places where I'd really want to alter > the planner's behavior on that basis. > An intresting point. Perhaps storing some stats on Views would help. Maybe adding a cache facility for views would speed some things up. I don't really see anything against storing stats on "Prepared Statements" and Views like we do on Tables. Maybe indexs on View would be useful but keeping them uptodate would be a hazard. Peter Childs
Tom Lane <tgl@sss.pgh.pa.us> writes: > Um ... I *am* an optimizer-geek. That explains so much :) > I stand by my comment that there's a tradeoff between the potential gain > from an optimization and the time spent to find it. Well there are always tradoffs in engineering. I'm just trying to push a little bit in one direction and make you rethink your assumptions. In the past postgres users were largely OLTP systems (largely web sites) running ad-hoc queries that are parsed and executed every time and interpolate parameters into the query string. That's crap. With the new FE binary protocol, and a bit of a push to the driver writers a good high performance (and secure) system ought to be able to run entirely prepared queries that are prepared once per backend process and executed thousands of times. > PG is at a disadvantage compared to typical compilation scenarios, in > that a compiler assumes its output will be executed many times, while > SQL queries often are planned and then executed but once. There's been > some talk of working harder when planning a "prepared statement", but > so far I've not seen very many places where I'd really want to alter > the planner's behavior on that basis. I think that's backwards actually. The queries where you would want to work super extra hard, spending a few seconds or even minutes checking possible plans are the ones that will run for hours. Those are more likely to be DSS queries that are unprepared ad-hoc queries that will be executed only once. For OLTP queries I think postgres can afford to not worry about small (subsecond) constant-time optimizations even if they're unlikely to return big benefits because OLTP systems should be running entirely prepared queries. -- greg
Well, it's unfortunate that you feel that way, because SQL Server handles it correctly. "Tom Lane" <tgl@sss.pgh.pa.us> wrote in message news:4375.1062643465@sss.pgh.pa.us... > Greg Stark <gsstark@mit.edu> writes: > > "Bupp Phillips" <hello@noname.com> writes: > >> select * from customer order by customer_id, first_name; > >> [ where customer_id is the primary key ] > > > However you do have a point. In this case I don't think postgres even > > considers using the index. > > It will not, since the index does not appear to provide the correct sort > order. > > > However I'm not sure I see a lot of cases where this would come up. > > Yes, that's the real crux of the matter. Should the optimizer spend > cycles on *every* query to detect cases where the user has written > useless sort keys? I've got grave doubts that it's a win. ISTM such > an optimization penalizes the folk who write their queries well to > benefit those who are careless. > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 8: explain analyze is your friend >
On Wed, 3 Sep 2003, Bupp Phillips wrote: > Well, it's unfortunate that you feel that way, because SQL Server handles it > correctly. For some definition of correctly. If you're in a system which gets penalized .001 seconds for each query planning that uses a multi-column order by and you do 100 million of them that this doesn't apply to, and one that it does which save you 30 seconds, is that correct? > "Tom Lane" <tgl@sss.pgh.pa.us> wrote in message > news:4375.1062643465@sss.pgh.pa.us... > > Greg Stark <gsstark@mit.edu> writes: > > > "Bupp Phillips" <hello@noname.com> writes: > > >> select * from customer order by customer_id, first_name; > > >> [ where customer_id is the primary key ] > > > > > However you do have a point. In this case I don't think postgres even > > > considers using the index. > > > > It will not, since the index does not appear to provide the correct sort > > order. > > > > > However I'm not sure I see a lot of cases where this would come up. > > > > Yes, that's the real crux of the matter. Should the optimizer spend > > cycles on *every* query to detect cases where the user has written > > useless sort keys? I've got grave doubts that it's a win. ISTM such > > an optimization penalizes the folk who write their queries well to > > benefit those who are careless. > > > > regards, tom lane > > > > ---------------------------(end of broadcast)--------------------------- > > TIP 8: explain analyze is your friend > > > > > > ---------------------------(end of broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faqs/FAQ.html >
Stephan Szabo <sszabo@megazone.bigpanda.com> writes: > On Wed, 3 Sep 2003, Bupp Phillips wrote: > > > Well, it's unfortunate that you feel that way, because SQL Server handles it > > correctly. > > For some definition of correctly. If you're in a system which gets > penalized .001 seconds for each query planning that uses a multi-column > order by and you do 100 million of them that this doesn't apply to, and > one that it does which save you 30 seconds, is that correct? Uhm. Yes. Absolutely. For OLTP systems like a web site if there's a single query anywhere in the system that takes 30s the system is broken. Every time that query happens to be hit a few times in a row the OLTP system will simply break. This isn't a performance issue, it's outright broken. Whereas a 1ms performance hit on every query plan will make virtually no difference at all in performance and most importantly, it will work. At worst it means provisioning a 1% faster machine. Even then only if my system isn't preparing all these queries in advance, in which case I have bigger performance and security issues than a 1ms per query hit. For DSS systems handling queries that take minutes or hours to run the case is even clearer. The 1ms hit is even less of an issue, and the 30s gain will balloon and turn into minutes or hours of speedup. I'm pretty sure this particular case was not one of the cases where people said it just wasn't worth doing. This was just too hard. Solving it in a way that integrates cleanly with postgres's aggregates will be very hard and require someone who understands lots of different parts of postgres to spend an awful lot of time thinking about the problem. [This is one of the strength's of free software. There's no marketing department providing checklists that have to be met even if there's no good solution today. So instead of a bad solution that sticks around for a long time, we'll one day (hopefully) have a good solution when someone figures out how to do it.] -- greg
On 5 Sep 2003, Greg Stark wrote: > > Stephan Szabo <sszabo@megazone.bigpanda.com> writes: > > > On Wed, 3 Sep 2003, Bupp Phillips wrote: > > > > > Well, it's unfortunate that you feel that way, because SQL Server handles it > > > correctly. > > > > For some definition of correctly. If you're in a system which gets > > penalized .001 seconds for each query planning that uses a multi-column > > order by and you do 100 million of them that this doesn't apply to, and > > one that it does which save you 30 seconds, is that correct? > > Uhm. Yes. Absolutely. Even if all the people that don't write queries with redundant sort columns pay the cost for those that do? In some ways it'd be nice if we could control the planner more so that individual systems could make determinations of whether this was useful or not. From the below, I'm not sure we're talking about the same case any longer. :) > I'm pretty sure this particular case was not one of the cases where people > said it just wasn't worth doing. This was just too hard. Solving it in a way > that integrates cleanly with postgres's aggregates will be very hard and > require someone who understands lots of different parts of postgres to spend > an awful lot of time thinking about the problem. I'm not sure how finding redundant sort columns due to unique constraints really integrates with aggregates at all. Did we maybe cross conversation with one of the aggregate discussions on min/max/count?
What the heck are you talking about? SQLServer returns the query immediately (milliseconds) where as PGSQL returns 6 seconds!! So, yes I would say that SQLServer did it correctly in this situation. The optimizer did what it was suppose to do...find the quickest way to get me the data. Now if you guys here a PGSQL don't believe it's necessary to change the optimizer for this purpose, that's fine, but don't start putting other DBMS down because they DO handle this situation. "Stephan Szabo" <sszabo@megazone.bigpanda.com> wrote in message news:20030904132334.E44205-100000@megazone.bigpanda.com... > On Wed, 3 Sep 2003, Bupp Phillips wrote: > > > Well, it's unfortunate that you feel that way, because SQL Server handles it > > correctly. > > For some definition of correctly. If you're in a system which gets > penalized .001 seconds for each query planning that uses a multi-column > order by and you do 100 million of them that this doesn't apply to, and > one that it does which save you 30 seconds, is that correct? > > > > "Tom Lane" <tgl@sss.pgh.pa.us> wrote in message > > news:4375.1062643465@sss.pgh.pa.us... > > > Greg Stark <gsstark@mit.edu> writes: > > > > "Bupp Phillips" <hello@noname.com> writes: > > > >> select * from customer order by customer_id, first_name; > > > >> [ where customer_id is the primary key ] > > > > > > > However you do have a point. In this case I don't think postgres even > > > > considers using the index. > > > > > > It will not, since the index does not appear to provide the correct sort > > > order. > > > > > > > However I'm not sure I see a lot of cases where this would come up. > > > > > > Yes, that's the real crux of the matter. Should the optimizer spend > > > cycles on *every* query to detect cases where the user has written > > > useless sort keys? I've got grave doubts that it's a win. ISTM such > > > an optimization penalizes the folk who write their queries well to > > > benefit those who are careless. > > > > > > regards, tom lane > > > > > > ---------------------------(end of broadcast)--------------------------- > > > TIP 8: explain analyze is your friend > > > > > > > > > > > ---------------------------(end of broadcast)--------------------------- > > TIP 5: Have you checked our extensive FAQ? > > > > http://www.postgresql.org/docs/faqs/FAQ.html > > > > > ---------------------------(end of broadcast)--------------------------- > TIP 9: the planner will ignore your desire to choose an index scan if your > joining column's datatypes do not match >
On Thu, 4 Sep 2003, Relaxin wrote: > What the heck are you talking about? Whether or not trying every optimization is worthwhile in every potential situation where it might apply, in theory. > SQLServer returns the query immediately (milliseconds) where as PGSQL > returns 6 seconds!! > So, yes I would say that SQLServer did it correctly in this situation. The > optimizer did what it was suppose to do...find the quickest way to get me > the data. *If* you had other queries that took 6 seconds rather than milliseconds because it was trying to work out if the optimization applied, would you still think it was correct to be trying the optimization on those queries? What about if it added 50 ms to every query you did with an order by, and you never used redundant sort columns? > Now if you guys here a PGSQL don't believe it's necessary to change the > optimizer for this purpose, that's fine, but don't start putting other DBMS > down because they DO handle this situation. Where did I put down another DBMS? (I've left my text below). I said (not in so many words), some optimizations take time, always doing every available optimization may not be "correct" because it may affect the running time of other queries adversely. I don't really know if this particular case warrents doing, I don't know if the time involved in checking it is entirely negligible or not for complicated queries. I do know that there are people who use order by however and do not use redundant sort columns and that if there is a non-negligible cost, they're the ones that are really going to be paying for it. Tom believed that the cost was non-negligible for its benefit, if someone else wants to do it and get numbers and the costs are negligible, it's likely to get put in. > "Stephan Szabo" <sszabo@megazone.bigpanda.com> wrote in message > news:20030904132334.E44205-100000@megazone.bigpanda.com... > > On Wed, 3 Sep 2003, Bupp Phillips wrote: > > > > > Well, it's unfortunate that you feel that way, because SQL Server > handles it > > > correctly. > > > > For some definition of correctly. If you're in a system which gets > > penalized .001 seconds for each query planning that uses a multi-column > > order by and you do 100 million of them that this doesn't apply to, and > > one that it does which save you 30 seconds, is that correct?
Regardless of your argument, all I know is that SQL Server handles it correctly (giving me the data in the fastest possible way) and postgresql doesn't. And your argument about how it will increase other queries is pointless to me. Will you still stand by this argument when the PG folks find a situation where the optimizer creates a terrible plan and the only way to fix is to add additional logic, based on what you are saying, it's not worth it because it could add .001 seconds of additional processing. Again, like I said, if the PG folks don't see a need for this type of optimization, then that's fine. What I'm stating is that SQLServer is a very fast and robust database that also handles this, as you may call it, odd situation. Thanks "Stephan Szabo" <sszabo@megazone.bigpanda.com> wrote in message news:20030905071003.P73820-100000@megazone.bigpanda.com... > > On Thu, 4 Sep 2003, Relaxin wrote: > > > What the heck are you talking about? > > Whether or not trying every optimization is worthwhile in every potential > situation where it might apply, in theory. > > > SQLServer returns the query immediately (milliseconds) where as PGSQL > > returns 6 seconds!! > > So, yes I would say that SQLServer did it correctly in this situation. The > > optimizer did what it was suppose to do...find the quickest way to get me > > the data. > > *If* you had other queries that took 6 seconds rather than milliseconds > because it was trying to work out if the optimization applied, would you > still think it was correct to be trying the optimization on those queries? > What about if it added 50 ms to every query you did with an order by, and > you never used redundant sort columns? > > > Now if you guys here a PGSQL don't believe it's necessary to change the > > optimizer for this purpose, that's fine, but don't start putting other DBMS > > down because they DO handle this situation. > > Where did I put down another DBMS? (I've left my text below). I said (not > in so many words), some optimizations take time, always doing every > available optimization may not be "correct" because it may affect the > running time of other queries adversely. > > I don't really know if this particular case warrents doing, I don't know > if the time involved in checking it is entirely negligible or not for > complicated queries. I do know that there are people who use order by > however and do not use redundant sort columns and that if there is a > non-negligible cost, they're the ones that are really going to be paying > for it. Tom believed that the cost was non-negligible for its benefit, if > someone else wants to do it and get numbers and the costs are negligible, > it's likely to get put in. > > > "Stephan Szabo" <sszabo@megazone.bigpanda.com> wrote in message > > news:20030904132334.E44205-100000@megazone.bigpanda.com... > > > On Wed, 3 Sep 2003, Bupp Phillips wrote: > > > > > > > Well, it's unfortunate that you feel that way, because SQL Server > > handles it > > > > correctly. > > > > > > For some definition of correctly. If you're in a system which gets > > > penalized .001 seconds for each query planning that uses a multi-column > > > order by and you do 100 million of them that this doesn't apply to, and > > > one that it does which save you 30 seconds, is that correct? > > > ---------------------------(end of broadcast)--------------------------- > TIP 7: don't forget to increase your free space map settings >
On Fri, 5 Sep 2003, Relaxin wrote: > And your argument about how it will increase other queries is pointless to > me. Will you still stand by this argument when the PG folks find a situation > where the optimizer creates a terrible plan and the only way to fix is to > add additional logic, based on what you are saying, it's not worth it > because it could add .001 seconds of additional processing. I've tried to say that you need to balance the cost of the optimization against its potential gain in the cases it helps, the frequency of those cases, the ability to solve the problems other ways and against the development time that is used in it. There isn't a push button with two states, best and not best. When you want to argue about an optimization you should think about: a) What does the optimization actually mean (actual specification, not vague words) b) What cases is the optimization legal for (doesn't change results, doesn't violate spec, etc) c) What cases does the optimization help (as separate from b in that some cases may not actually be faster but the optimization is legal) d) What is the gain over the cases that the optimization helps e) What is the penalty over the cases that the optimization does not help If someone feels strongly about it or feels that they have the time, they can (and should be generally promoted to) attempt the optimization. If the optimization is not expensive or has better than expected gains or good side effects, it's likely to get accepted.