Thread: MAX/MIN optimization via rewrite (plus query rewrites generally)
I am looking at implementing this TODO item. e.g. (max case): rewrite SELECT max(foo) FROM bar as SELECT foo FROM bar ORDER BY foo DESC LIMIT 1 if there is an index on bar(foo) Suggestions about the most suitable point in the parser/planner stage to perform this sort of rewrite would be most welcome! (as this would be my first non trivial getting of hands dirty in the code). My initial thoughts revolved around extending the existing RULE system to be able to handle more general types of rewrite - like conditionals in SELECT rules and rewrites that change elements of the query other than the target relation. Planning for future note: I would like whatever mechanism that is added for this MAX/MIN stuff to be amenable to more subtle things like aggregate navigation (see R.Kimball's article http://www.dbmsmag.com/9608d54.html). regards Mark
Mark Kirkwood <markir@coretech.co.nz> writes: > I am looking at implementing this TODO item. e.g. (max case): > My initial thoughts revolved around extending the existing RULE system > to be able to handle more general types of rewrite - like conditionals > in SELECT rules and rewrites that change elements of the query other > than the target relation. The rule rewriter is almost certainly the wrong place, because it has only the most superficial understanding of a query's semantics. Doing this processing there would require re-inventing (or at least duplicating the execution of) a lot of the planner's query analysis work. My thoughts would run towards doing this after the prepqual and prepjointree steps (probably somewhere in grouping_planner). Even there is a bit early since you'd have to duplicate plancat.c's extraction of information about related indexes; but possibly it'd be reasonable to move the add_base_rels_to_query() call out of query_planner and do it in grouping_planner. A more radical way of handling it would be to detect the relevance of an indexscan in indxpath.c and generate a special kind of Path node; this would not generalize to other sorts of things as you were hoping, but I'm unconvinced that the mechanism is going to be very general-purpose anyway. The major advantage is that this would work conveniently for comparing the cost of a rewritten query to a non-rewritten one. How are you planning to represent the association between MIN/MAX and particular index orderings in the system catalogs? regards, tom lane
On Wed, Nov 10, 2004 at 07:18:59PM -0500, Tom Lane wrote: > A more radical way of handling it would be to detect the relevance of an > indexscan in indxpath.c and generate a special kind of Path node; this > would not generalize to other sorts of things as you were hoping, but > I'm unconvinced that the mechanism is going to be very general-purpose > anyway. The major advantage is that this would work conveniently for > comparing the cost of a rewritten query to a non-rewritten one. What about having a new column in pg_aggregate which would point to a function that would try to optimize the aggregate's handling? There could be multiple calls to that function along the query's way to executor, each one at a different point (with a parameter specifying which one it is), that would try to rewrite the query optimizing the aggregate. So we could optimize some aggregates at one point, and others at a different point, whichever makes the most sense. -- Alvaro Herrera (<alvherre[a]dcc.uchile.cl>) "Hay dos momentos en la vida de un hombre en los que no debería especular: cuando puede permitírselo y cuando no puede" (Mark Twain)
Tom Lane wrote: > >A more radical way of handling it would be to detect the relevance of an >indexscan in indxpath.c and generate a special kind of Path node; this >would not generalize to other sorts of things as you were hoping, but >I'm unconvinced that the mechanism is going to be very general-purpose >anyway. The major advantage is that this would work conveniently for >comparing the cost of a rewritten query to a non-rewritten one. > > > I like this point - it makes sense to check that the rewritten query is less costly to execute than the original! >How are you planning to represent the association between MIN/MAX and >particular index orderings in the system catalogs? > > > > That is the next item to think on, we could have a rewrite catalog that holds possible transformations for certain functions (certain aggregates at this stage I guess). This is a bit like Alvaro's idea - however it may be better to represent it the way he suggested! regards Mark
On Wed, Nov 10, 2004 at 22:21:31 -0300, Alvaro Herrera <alvherre@dcc.uchile.cl> wrote: > On Wed, Nov 10, 2004 at 07:18:59PM -0500, Tom Lane wrote: > > > A more radical way of handling it would be to detect the relevance of an > > indexscan in indxpath.c and generate a special kind of Path node; this > > would not generalize to other sorts of things as you were hoping, but > > I'm unconvinced that the mechanism is going to be very general-purpose > > anyway. The major advantage is that this would work conveniently for > > comparing the cost of a rewritten query to a non-rewritten one. > > What about having a new column in pg_aggregate which would point to a > function that would try to optimize the aggregate's handling? I think you want to store an operator class and a direction. This allows you to figure out what indexes might be usable. This could be used on all of the max and min aggregates and the boolean and and or aggregates.
On Thu, Nov 11, 2004 at 11:48:49AM +1300, Mark Kirkwood wrote: > I am looking at implementing this TODO item. e.g. (max case): > > rewrite > SELECT max(foo) FROM bar > as > SELECT foo FROM bar ORDER BY foo DESC LIMIT 1 > if there is an index on bar(foo) Out of curiosity, will you be doing this in such a way that SELECT min(foo), max(foo) FROM bar will end up as SELECT (SELECT foo FROM bar ORDER BY foo ASC LIMIT 1), (SELECT ... DESC LIMIT 1) ? -- Jim C. Nasby, Database Consultant decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
Your example and ones like : SELECT max(foo), count(foo) FROM bar SELECT max(a.foo1), max(b.foo2) FROM bar1 AS a NATURAL JOIN bar2 AS b have made me realize that the scope of "what should be optimized" is somewhat subtle. I am inclined to keep it simple (i.e rather limited) for a first cut, and if that works well, then look at extending to more complex rewrites. What do you think? Jim C. Nasby wrote: >On Thu, Nov 11, 2004 at 11:48:49AM +1300, Mark Kirkwood wrote: > > >>I am looking at implementing this TODO item. e.g. (max case): >> >>rewrite >>SELECT max(foo) FROM bar >>as >>SELECT foo FROM bar ORDER BY foo DESC LIMIT 1 >>if there is an index on bar(foo) >> >> > >Out of curiosity, will you be doing this in such a way that > >SELECT min(foo), max(foo) FROM bar > >will end up as > >SELECT (SELECT foo FROM bar ORDER BY foo ASC LIMIT 1), (SELECT ... DESC >LIMIT 1) > >? > >
On Thu, Nov 11, 2004 at 17:57:42 +1300, Mark Kirkwood <markir@coretech.co.nz> wrote: > Your example and ones like : > > SELECT max(foo), count(foo) FROM bar > SELECT max(a.foo1), max(b.foo2) FROM bar1 AS a NATURAL JOIN bar2 AS b > > have made me realize that the scope of "what should be optimized" is > somewhat subtle. > > I am inclined to keep it simple (i.e rather limited) for a first cut, > and if that works well, then look at extending to more complex rewrites. > > What do you think? I don't think you should be rewriting queries as much as providing alternate plans and letting the rest of the optimizer decided which plan to use. If you just rewrite a query you might lock yourself into using a poor plan.
Alvaro Herrera <alvherre@dcc.uchile.cl> writes: > What about having a new column in pg_aggregate which would point to a > function that would try to optimize the aggregate's handling? I can't get very excited about this, because how would you make a reasonably stable/narrow API for such a thing? The function as you propose it would have to know everything about not only the planner's data representations but the N specific places it would be called from. The existing selectivity functions are bad enough on this score ... regards, tom lane
Why not just change the function all together to 'select $1 from $2 order by $1 desc limit 1;' Is there ANY situation where max(col) as it is, would be faster? ... John
Certainly handling only one case is better than none. I just wanted to bring up the multiple aggregate scenario. Also, consider that SELECT min(a), max(a), min(b), max(c) FROM table could be optimized as well (into 4 index scans, assuming a, b, and c all had indexes). I don't think any other aggregates are candidates for optimization right now, though I guess I could be wrong. On Thu, Nov 11, 2004 at 05:57:42PM +1300, Mark Kirkwood wrote: > Your example and ones like : > > SELECT max(foo), count(foo) FROM bar > SELECT max(a.foo1), max(b.foo2) FROM bar1 AS a NATURAL JOIN bar2 AS b > > have made me realize that the scope of "what should be optimized" is > somewhat subtle. > > I am inclined to keep it simple (i.e rather limited) for a first cut, > and if that works well, then look at extending to more complex rewrites. > > What do you think? > > > Jim C. Nasby wrote: > > >On Thu, Nov 11, 2004 at 11:48:49AM +1300, Mark Kirkwood wrote: > > > > > >>I am looking at implementing this TODO item. e.g. (max case): > >> > >>rewrite > >>SELECT max(foo) FROM bar > >>as > >>SELECT foo FROM bar ORDER BY foo DESC LIMIT 1 > >>if there is an index on bar(foo) > >> > >> > > > >Out of curiosity, will you be doing this in such a way that > > > >SELECT min(foo), max(foo) FROM bar > > > >will end up as > > > >SELECT (SELECT foo FROM bar ORDER BY foo ASC LIMIT 1), (SELECT ... DESC > >LIMIT 1) > > > >? > > > > > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Don't 'kill -9' the postmaster > -- Jim C. Nasby, Database Consultant decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
There seems to be (as Tom indicated) a choice of approaches: i) rewrite max/min querys and then plan 'em ii) provide alternate plans based on presence of certain aggregate types in the query when I first examined this TODO item, I was really thinking about i), but I suspect that ii) is probably the best approach. regards Mark Bruno Wolff III wrote: >On Thu, Nov 11, 2004 at 17:57:42 +1300, > Mark Kirkwood <markir@coretech.co.nz> wrote: > > >>Your example and ones like : >> >>SELECT max(foo), count(foo) FROM bar >>SELECT max(a.foo1), max(b.foo2) FROM bar1 AS a NATURAL JOIN bar2 AS b >> >>have made me realize that the scope of "what should be optimized" is >>somewhat subtle. >> >>I am inclined to keep it simple (i.e rather limited) for a first cut, >>and if that works well, then look at extending to more complex rewrites. >> >>What do you think? >> >> > >I don't think you should be rewriting queries as much as providing >alternate plans and letting the rest of the optimizer decided which >plan to use. If you just rewrite a query you might lock yourself into >using a poor plan. > >
Probably for a small table, where the machinery of reading the index, followed by checking the table for non-visible tuples is more costly than just scanning the table! regards Mark John Hansen wrote: >Why not just change the function all together to 'select $1 from $2 >order by $1 desc limit 1;' > >Is there ANY situation where max(col) as it is, would be faster? > >... John > >
Bruno Wolff III <bruno@wolff.to> writes: > I don't think you should be rewriting queries as much as providing > alternate plans and letting the rest of the optimizer decided which > plan to use. If you just rewrite a query you might lock yourself into > using a poor plan. Moreover, none of these rewritten queries would work properly for select min(foo) from tab group by bar This should still be aware it can use an index on <bar,foo> and get much better performance. Well it can't know, but it should be able to. All it should really need to know is that min() only needs a particular subset of the dataset -- namely the first record, as long as the records are provided in a particular order. Also, the same code ought to be able to handle select first(foo) from tab group by bar Which is exactly equivalent to the min() case except that no particular ordering is required. -- greg
"Zeugswetter Andreas DAZ SD" <ZeugswetterA@spardat.at> writes: >> How are you planning to represent the association between MIN/MAX and >> particular index orderings in the system catalogs? > Don't we already have that info to decide whether an index handles > an "ORDER BY" without a sort node ? We know how to determine that an index matches an ORDER BY clause. But what has an aggregate called MAX() got to do with ORDER BY? Magic assumptions about operators named "<" are not acceptable answers; there has to be a traceable connection in the catalogs. As a real-world example of why I won't hold still for hard-wiring this: a complex-number data type might have btree opclasses allowing it to be sorted either by real part or by absolute value. One might then define max_real() and max_abs() aggregates on the type. It should be possible to optimize such aggregates the same way as any other max() aggregate. regards, tom lane
On Wed, 2004-11-10 at 22:48, Mark Kirkwood wrote: > Planning for future note: I would like whatever mechanism that is added > for this MAX/MIN stuff to be amenable to more subtle things like > aggregate navigation (see R.Kimball's article > http://www.dbmsmag.com/9608d54.html). > With you on that one... -- Best Regards, Simon Riggs
On Thu, Nov 11, 2004 at 01:08:39AM -0500, Tom Lane wrote: > Alvaro Herrera <alvherre@dcc.uchile.cl> writes: > > What about having a new column in pg_aggregate which would point to a > > function that would try to optimize the aggregate's handling? > > I can't get very excited about this, because how would you make a > reasonably stable/narrow API for such a thing? The function as you > propose it would have to know everything about not only the planner's > data representations but the N specific places it would be called from. No, the function would discard all calls except the one it knows how to optimize. The point in having multiple call places is that some aggregates will likely be optimized in some place, and others somewhere else. Most likely, a first patch would include only the call site that would help in optimizing min() and max(). Of course, an aggregate could specify no optimizing function, which would be the current situation, where no aggregate knows how to optimize itself. Re: knowing internal representation, I think this is required anyway; else the optimization would only work on a very limited numer of situations. -- Alvaro Herrera (<alvherre[a]dcc.uchile.cl>) "A wizard is never late, Frodo Baggins, nor is he early. He arrives precisely when he means to." (Gandalf, en LoTR FoTR)
On Thu, Nov 11, 2004 at 01:18:05 -0600, "Jim C. Nasby" <decibel@decibel.org> wrote: > Certainly handling only one case is better than none. I just wanted to > bring up the multiple aggregate scenario. Also, consider that > > SELECT min(a), max(a), min(b), max(c) FROM table > > could be optimized as well (into 4 index scans, assuming a, b, and c all > had indexes). > > I don't think any other aggregates are candidates for optimization right > now, though I guess I could be wrong. Remember that max and min are a number of aggregates, as each datatype which have max and min functions have different ones from those used by other datatypes. I think someone added boolean aggregates for and and or in version 8. If so, those can also use indexes in the same way.
Re: MAX/MIN optimization via rewrite (plus query rewrites generally)
From
"Zeugswetter Andreas DAZ SD"
Date:
> How are you planning to represent the association between MIN/MAX and > particular index orderings in the system catalogs? Don't we already have that info to decide whether an index handles an "ORDER BY" without a sort node ? Andreas
On Thu, Nov 11, 2004 at 17:52:19 +1100, John Hansen <john@geeknet.com.au> wrote: > Why not just change the function all together to 'select $1 from $2 > order by $1 desc limit 1;' > > Is there ANY situation where max(col) as it is, would be faster? Yes. A couple I can think of are: When count(col) is also being used. When a GROUP BY is being used and there isn't an index that can both be used to do the grouping and col order within each group.
Simon Riggs <simon@2ndquadrant.com> writes: > On Wed, 2004-11-10 at 22:48, Mark Kirkwood wrote: > > Planning for future note: I would like whatever mechanism that is added > > for this MAX/MIN stuff to be amenable to more subtle things like > > aggregate navigation (see R.Kimball's article > > http://www.dbmsmag.com/9608d54.html). > > > > With you on that one... This looks like more of a materialized views application than anything to do with planning aggregate functions. Materialized views are most useful when the materialized view is smaller than the original data and therefore usually involve aggregates, but it's not necessary. -- greg
Bruno Wolff III <bruno@wolff.to> writes: > Tom Lane <tgl@sss.pgh.pa.us> wrote: >> We know how to determine that an index matches an ORDER BY clause. >> But what has an aggregate called MAX() got to do with ORDER BY? > Wouldn't knowing an opclass and direction associated with an aggregrate > function allow you to do this? That's one way you could do it. Another possibly cleaner way is to just supply a sort operator (with the implication "the desired value is the one ordered first by this operator") and then let the optimizer work out which opclass(es) are relevant. This would win if the same operator appears in different opclasses, which is uncommon at the moment but would not be so if we start offering "reverse sort" opclasses. I think we covered all this ground before, though --- have you checked the archives from the last time this was discussed? regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > As a real-world example of why I won't hold still for hard-wiring this: > a complex-number data type might have btree opclasses allowing it to be > sorted either by real part or by absolute value. One might then define > max_real() and max_abs() aggregates on the type. It should be possible > to optimize such aggregates the same way as any other max() aggregate. So if the max_real() aggregate had a field that indicated that max_real(x) could be satisfied with only the first record from the dataset as long as it's sorted by "real(x)" that would be enough information. The optimizer would still have a lot of work to combine this information for all aggregates used and check the costs for providing sorted result sets to the scan. There would also need new scans that could handle reading just one record and then skipping to the next group. It's a lot of work but it would make a lot of aggregates a lot more useful. It would also make it possible to deprecate DISTINCT ON in favour of GROUP BY with first() calls. -- greg
On Thu, Nov 11, 2004 at 10:24:34 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > We know how to determine that an index matches an ORDER BY clause. > But what has an aggregate called MAX() got to do with ORDER BY? Magic > assumptions about operators named "<" are not acceptable answers; there > has to be a traceable connection in the catalogs. > > As a real-world example of why I won't hold still for hard-wiring this: > a complex-number data type might have btree opclasses allowing it to be > sorted either by real part or by absolute value. One might then define > max_real() and max_abs() aggregates on the type. It should be possible > to optimize such aggregates the same way as any other max() aggregate. Wouldn't knowing an opclass and direction associated with an aggregrate function allow you to do this?
On Thu, Nov 11, 2004 at 09:29:14 +0000, Simon Riggs <simon@2ndquadrant.com> wrote: > On Wed, 2004-11-10 at 22:48, Mark Kirkwood wrote: > > Planning for future note: I would like whatever mechanism that is added > > for this MAX/MIN stuff to be amenable to more subtle things like > > aggregate navigation (see R.Kimball's article > > http://www.dbmsmag.com/9608d54.html). > > > > With you on that one... I disaggree. What that author is suggesting is orthogonal to what is being proposed by Mark. That article is really suggesting a varient of materialized views where you can use the normal aggregate notation instead of having to use a special column name to get access to an aggregate in the materialized view.
Greg Stark <gsstark@mit.edu> writes: > It would also make it possible to deprecate DISTINCT ON in favour of GROUP BY > with first() calls. Oh? How is a first() aggregate going to know what sort order you want within the group? AFAICS first() is only useful when you honestly do not care which group member you get ... which is certainly not the case for applications of DISTINCT ON. regards, tom lane
Alvaro Herrera <alvherre@dcc.uchile.cl> writes: > Re: knowing internal representation, I think this is required anyway; > else the optimization would only work on a very limited numer of > situations. The point of my remark is that pushing this knowledge out to a function is helpful only if you can put that function somehow at arm's length from the main body of the optimizer. Otherwise structuring it like that is useless obscurantism. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > Greg Stark <gsstark@mit.edu> writes: > > It would also make it possible to deprecate DISTINCT ON in favour of GROUP BY > > with first() calls. > > Oh? How is a first() aggregate going to know what sort order you want > within the group? AFAICS first() is only useful when you honestly do > not care which group member you get ... which is certainly not the case > for applications of DISTINCT ON. It would look something like select x,first(a),first(b) from (select x,a,b from table order by x,y) group by x which is equivalent to select DISTINCT ON (x) x,a,b from table ORDER BY x,y The group by can see that the subquery is already sorted by x and doesn't need to be resorted. In fact I believe you added the smarts to detect that condition in response to a user asking about precisely this type of scenario. This is actually more general than DISTINCT ON since DISTINCT ON is basically a degenerate case of the above where the _only_ aggregate allowed is first(). The more general case could have first() as well as other aggregates, though obviously they would make it unlikely that any optimizations would be applicable. I do kind of like the DISTINCT ON syntax, but the inability to use any other aggregate functions makes me often have to convert queries I originally wrote to use it to use the more general GROUP BY and first() instead. -- greg
Greg Stark <gsstark@mit.edu> writes: > Tom Lane <tgl@sss.pgh.pa.us> writes: >> Oh? How is a first() aggregate going to know what sort order you want >> within the group? > It would look something like > select x,first(a),first(b) from (select x,a,b from table order by x,y) group by x > which is equivalent to > select DISTINCT ON (x) x,a,b from table ORDER BY x,y No, it is not. The GROUP BY has no commitment to preserve order --- consider for example the possibility that we implement the GROUP BY by hashing. > The group by can see that the subquery is already sorted by x and > doesn't need to be resorted. In fact I believe you added the smarts to > detect that condition in response to a user asking about precisely > this type of scenario. The fact that an optimization is present does not make it part of the guaranteed semantics of the language. Basically, first() is a broken concept in SQL. Of course DISTINCT ON is broken too for the same reasons, but I do not see that first() is one whit less of a kluge than DISTINCT ON. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > Greg Stark <gsstark@mit.edu> writes: > > Tom Lane <tgl@sss.pgh.pa.us> writes: > >> Oh? How is a first() aggregate going to know what sort order you want > >> within the group? > > > It would look something like > > > select x,first(a),first(b) from (select x,a,b from table order by x,y) group by x > > > which is equivalent to > > > select DISTINCT ON (x) x,a,b from table ORDER BY x,y > > No, it is not. The GROUP BY has no commitment to preserve order --- > consider for example the possibility that we implement the GROUP BY by > hashing. It doesn't matter how the group by is implemented. The only thing that matters is what order the tuples are presented to the aggregate function. Even hashing presents them to the aggregate function as they're read in from the subquery. In fact iirc hashing was why the original user that asked for this behaviour found that it sometimes did work and wondered why it didn't work when the planner *didn't* use hashing. You added code to allow it to function properly for that case as well. > The fact that an optimization is present does not make it part of the > guaranteed semantics of the language. > > Basically, first() is a broken concept in SQL. Of course DISTINCT ON > is broken too for the same reasons, but I do not see that first() is > one whit less of a kluge than DISTINCT ON. first() is only a whit less of a kludge than DISTINCT ON in that it covers the more general case of wanting both first value for each group as well as other aggregates. Otherwise it's exactly equivalent. Depending on ordering was a broken concept in the abstract theoretical world of SQL as originally envisioned. But (unfortunately) we don't really live in that world any more. There are tons of extensions that Postgres and other databases provide that do depend on the ordering of record sets. And there are tons of examples of real business problems that require those extensions. It may be a broken concept but it's one that's extremely popular. Postgres implemented DISTINCT ON independently, but mysql also has an equivalent feature: you can include any column not included in the GROUP BY clause and mysql will implicitly do the equivalent of first(). Oracle on the other hand has a complete suite of functions for processing order-dependent data. They even implement lead()/lag() for allowing you to access the data from subsequent or previous records in the data set. -- greg
On Thu, Nov 11, 2004 at 08:00:01AM -0600, Bruno Wolff III wrote: > On Thu, Nov 11, 2004 at 17:52:19 +1100, > John Hansen <john@geeknet.com.au> wrote: > > Why not just change the function all together to 'select $1 from $2 > > order by $1 desc limit 1;' > > > > Is there ANY situation where max(col) as it is, would be faster? > > Yes. A couple I can think of are: > When count(col) is also being used. Technically, wouldn't that depend on how many rows you were processing? Certainly the time required for the CPU to compare the value of a field in the current row to what it's got stored as the current maximum is small compared to a disk read, but at some point it will be faster to read an index. > When a GROUP BY is being used and there isn't an index that can both be used > to do the grouping and col order within each group. -- Jim C. Nasby, Database Consultant decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
On Thu, Nov 11, 2004 at 11:59:06AM -0600, Bruno Wolff III wrote: > On Thu, Nov 11, 2004 at 09:29:14 +0000, > Simon Riggs <simon@2ndquadrant.com> wrote: > > On Wed, 2004-11-10 at 22:48, Mark Kirkwood wrote: > > > Planning for future note: I would like whatever mechanism that is added > > > for this MAX/MIN stuff to be amenable to more subtle things like > > > aggregate navigation (see R.Kimball's article > > > http://www.dbmsmag.com/9608d54.html). > > > > > > > With you on that one... > > I disaggree. What that author is suggesting is orthogonal to what is being > proposed by Mark. That article is really suggesting a varient of materialized > views where you can use the normal aggregate notation instead of having > to use a special column name to get access to an aggregate in the > materialized view. And it also appears to be completely missing the automatic query rewrite capabilites found in Oracle, DB2, and MSSQL. -- Jim C. Nasby, Database Consultant decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
On 11/10/2004 11:57 PM, Mark Kirkwood wrote: > Your example and ones like : > > SELECT max(foo), count(foo) FROM bar > SELECT max(a.foo1), max(b.foo2) FROM bar1 AS a NATURAL JOIN bar2 AS b > > have made me realize that the scope of "what should be optimized" is > somewhat subtle. > > I am inclined to keep it simple (i.e rather limited) for a first cut, > and if that works well, then look at extending to more complex rewrites. > > What do you think? The problem is, that select min(foo) from bar where foo > 100; is still solvable with an index scan, assuming there is an index on foo. But select min(foo) from bar where baz = 'IT'; is only doable with an index scan if you have a compound index on (foo,baz). Both cases can be expressed with order by + limit queries, that would indeed utilize those indexes. But what's been discussed so far does not cover any of them. Jan > > > Jim C. Nasby wrote: > >>On Thu, Nov 11, 2004 at 11:48:49AM +1300, Mark Kirkwood wrote: >> >> >>>I am looking at implementing this TODO item. e.g. (max case): >>> >>>rewrite >>>SELECT max(foo) FROM bar >>>as >>>SELECT foo FROM bar ORDER BY foo DESC LIMIT 1 >>>if there is an index on bar(foo) >>> >>> >> >>Out of curiosity, will you be doing this in such a way that >> >>SELECT min(foo), max(foo) FROM bar >> >>will end up as >> >>SELECT (SELECT foo FROM bar ORDER BY foo ASC LIMIT 1), (SELECT ... DESC >>LIMIT 1) >> >>? >> >> > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Don't 'kill -9' the postmaster -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
Jan Wieck <JanWieck@Yahoo.com> writes: > Both cases can be expressed with order by + limit queries, that would indeed > utilize those indexes. But what's been discussed so far does not cover any of > them. I think people should get away from thinking about "order by + limit". That isn't going to work for anything with a GROUP BY. And it isn't going to work for anything more complex than a single min() or max(). min() only needs the first record from whatever set of records it's operating on as long as they're provided in a specified order. This is just as true for a min() applied to only a single GROUP as it is for a min() applied to an entire table. I don't think you want to use the existing Limit executor node. That will only ever let you handle these simple aggregates that return the first value they see. What you want is a normal Aggregate node, but the node feeding it should be an altered index scan that knows it only needs to pull out the first and/or last record for each GROUP. That will let you handle min() and max() in the same query for example. It might also leave open the door for other more complex data subsets. Say a geometric data type where it needs all the bounding points of an area. -- greg
On 15 Nov 2004 02:00:37 -0500, Greg Stark <gsstark@mit.edu> wrote: > I think people should get away from thinking about "order by + limit". That > isn't going to work for anything with a GROUP BY. And it isn't going to work > for anything more complex than a single min() or max(). > > min() only needs the first record from whatever set of records it's operating > on as long as they're provided in a specified order. This is just as true for > a min() applied to only a single GROUP as it is for a min() applied to an > entire table. But as far as I can tell there is no way of forcing such order, at least ORDER BY queries are doomed to fail: select max(val) from test_max order by val desc; ERROR: column "test_max.val" must appear in the GROUP BY clause or be used in an aggregate function Anyway I think that any optimization (supposedly "imlicit" order by when min() or max() is the only requested column) would at least stop people from using awkward syntax for performance reasons... Regards, Dawid
I think a summary of where the discussion went might be helpful (especially for me after a week or so away doing perl). There were a number of approaches suggested, which I will attempt to summarize in a hand wavy fashion - (apologies for any misrepresentation caused): i) Rewrite max/min querys using order by in presence of a suitable index. ii) Provide alternate (i.e rewritten) querys for consideration along with the original, letting the planner use its costing methods to choose as usual. iii) Provide alternate plans based on presence of certain aggregate types in the query, letting the planner use its costingmethods to choose as usual. iv) Create short-cut evaluations for certain aggregates that don't actually need to see all the (to-be aggregated) data. v) Create a mechanism for defining per-aggregate optimization operators. Note that some of these ideas may overlap one another to some extent. Some critiques of the various approaches are: i) Too simple, rewrite may not be better than original, only simple queries can be handled this way. Probably reasonably easy to implement. ii) Simple queries will be well handled, but very complex transformations needed to handle even slightly more complexones. Probably medium -> difficult to implement. iii) Rules for creating alternate plans will mimic the issues with ii). Probably medium -> difficult to implement. iv) May need different short cuts for each aggregate -> datatype combination. Implies conventional > and < operators, or the existence of similar use definable ones (or a way of findingsuitable ones). Guessing medium to implement. v) Is kind of a generalization of iv). The key areas of difficulty are the specification of said optimization operatorsand the definition of an API for constructing/calling them. Guessing difficult to implement. I am leaning towards ii) or iv) as the most promising approaches - what do people think? regards Mark