Thread: Index not being used in MAX function (7.2.3)
Hi all: I have here a table belonging to a message board (Phorum 3.3), and there's an index in it that is not being used for reasons that I don't understand. The table is: Table "todocinetv" Column | Type | Modifiers -------------+-----------------------------+---------------------- id | integer | not null default '0' datestamp | timestamp without time zone | not null thread | integer | not null default '0' parent | integer | not null default '0' author | character(37) | not null default '' subject | character(255) | not null default '' email | character(200) | not null default '' attachment | character(64) | default '' host | character(50) | not null default '' email_reply | character(1) | not null default 'N' approved | character(1) | not null default 'N' msgid | character(100) | not null default '' modifystamp | integer | not null default '0' userid | integer | not null default '0' Indexes: todocinetv_approved, todocinetv_author, todocinetv_datestamp, todocinetv_modifystamp, todocinetv_msgid, todocinetv_parent, todocinetv_subject, todocinetv_thread, todocinetv_userid, todocinetvpri_key And the index "todocinetvpri_key" is created on the primary key (id). Yet when I do: explain select max(id) from todocinetv; NOTICE: QUERY PLAN: Aggregate (cost=30939.22..30939.22 rows=1 width=4) -> Seq Scan on todocinetv (cost=0.00..30882.98 rows=22498 width=4) It doesn't use the index, and surely, it takes forever. I have tried with VACUUM ANALYZE, and also dropping the index, creating it again and VACUUMing it, and it never uses it. The only explanation I can come up with is that the MAX() function doesn't use indices; I have tried with tables in other databases (running Postgres 7.2.1), and it doesn't use the indices in any of them. Is this the right behaviour? Or there is something else going on? The table mentioned above is in a Postgres 7.2.3 server, while the other ones that I used for testing were, as I said, in 7.2.1. Paulo Jan. DDnet.
Yep. It's a "feature" :-) I had hard time understanding why too... Apparently, there is nothing special about max() aggregate from the query planner's standpoint, compared to any other aggregate function. It doesn't know its specifics, and thus is not able to figure out that all it really needs is to grab the first entry from the index... To mak it quicker do this instead: select id from todocinetv order by id desc limit 1; I hope, it helps... Dima Paulo Jan wrote: > Hi all: > > I have here a table belonging to a message board (Phorum 3.3), and > there's an index in it that is not being used for reasons that I don't > understand. The table is: > > > Table "todocinetv" > Column | Type | Modifiers > -------------+-----------------------------+---------------------- > id | integer | not null default '0' > datestamp | timestamp without time zone | not null > thread | integer | not null default '0' > parent | integer | not null default '0' > author | character(37) | not null default '' > subject | character(255) | not null default '' > email | character(200) | not null default '' > attachment | character(64) | default '' > host | character(50) | not null default '' > email_reply | character(1) | not null default 'N' > approved | character(1) | not null default 'N' > msgid | character(100) | not null default '' > modifystamp | integer | not null default '0' > userid | integer | not null default '0' > Indexes: todocinetv_approved, > todocinetv_author, > todocinetv_datestamp, > todocinetv_modifystamp, > todocinetv_msgid, > todocinetv_parent, > todocinetv_subject, > todocinetv_thread, > todocinetv_userid, > todocinetvpri_key > > > And the index "todocinetvpri_key" is created on the primary key > (id). Yet when I do: > > explain select max(id) from todocinetv; > NOTICE: QUERY PLAN: > > Aggregate (cost=30939.22..30939.22 rows=1 width=4) > -> Seq Scan on todocinetv (cost=0.00..30882.98 rows=22498 width=4) > > > It doesn't use the index, and surely, it takes forever. I have > tried with VACUUM ANALYZE, and also dropping the index, creating it > again and VACUUMing it, and it never uses it. The only explanation I > can come up with is that the MAX() function doesn't use indices; I > have tried with tables in other databases (running Postgres 7.2.1), > and it doesn't use the indices in any of them. > Is this the right behaviour? Or there is something else going on? > The table mentioned above is in a Postgres 7.2.3 server, while the > other ones that I used for testing were, as I said, in 7.2.1. > > > > Paulo Jan. > DDnet. > > > > ---------------------------(end of broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly
Is your index a hash or btree? Jon On Tue, 10 Jun 2003, Paulo Jan wrote: > Hi all: > > I have here a table belonging to a message board (Phorum 3.3), and > there's an index in it that is not being used for reasons that I don't > understand. The table is: > > > Table "todocinetv" > Column | Type | Modifiers > -------------+-----------------------------+---------------------- > id | integer | not null default '0' > datestamp | timestamp without time zone | not null > thread | integer | not null default '0' > parent | integer | not null default '0' > author | character(37) | not null default '' > subject | character(255) | not null default '' > email | character(200) | not null default '' > attachment | character(64) | default '' > host | character(50) | not null default '' > email_reply | character(1) | not null default 'N' > approved | character(1) | not null default 'N' > msgid | character(100) | not null default '' > modifystamp | integer | not null default '0' > userid | integer | not null default '0' > Indexes: todocinetv_approved, > todocinetv_author, > todocinetv_datestamp, > todocinetv_modifystamp, > todocinetv_msgid, > todocinetv_parent, > todocinetv_subject, > todocinetv_thread, > todocinetv_userid, > todocinetvpri_key > > > And the index "todocinetvpri_key" is created on the primary key (id). > Yet when I do: > > explain select max(id) from todocinetv; > NOTICE: QUERY PLAN: > > Aggregate (cost=30939.22..30939.22 rows=1 width=4) > -> Seq Scan on todocinetv (cost=0.00..30882.98 rows=22498 width=4) > > > It doesn't use the index, and surely, it takes forever. I have tried > with VACUUM ANALYZE, and also dropping the index, creating it again and > VACUUMing it, and it never uses it. The only explanation I can come up > with is that the MAX() function doesn't use indices; I have tried with > tables in other databases (running Postgres 7.2.1), and it doesn't use > the indices in any of them. > Is this the right behaviour? Or there is something else going on? The > table mentioned above is in a Postgres 7.2.3 server, while the other > ones that I used for testing were, as I said, in 7.2.1. > > > > Paulo Jan. > DDnet. > > > > ---------------------------(end of broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly >
At 18:39 10.06.2003, Paulo Jan said: --------------------[snip]-------------------- >Hi all: > > I have here a table belonging to a message board (Phorum 3.3), and >there's an index in it that is not being used for reasons that I don't >understand. The table is: > > ... > > select max(id) from todocinetv; --------------------[snip]-------------------- Your question has already been answered, but: I suspect that you're doing this to retrieve the row ID of a newly inserted row. This may not be foolproof since others could already have inserted rows in between your insertion and this ID lookup. Assuming that you're using a sequence to provide the primary key (which you should) you may safely query its current value: SELECT currval('todocinetv_id_seq') as "newid" This is guaranteed to return the last value **for your connection** only, regardless if the sequence has actually been incremented by others or not. And it's lightning fast. Just my 2c :) -- >O Ernest E. Vogelsinger (\) ICQ #13394035 ^ http://www.vogelsinger.at/
Jonathan Bartlett wrote: > Is your index a hash or btree? > > Jon > It's a btree, but anyway, I see that others have already answered my question. So it's a "feature" and not a bug? Hrmpf. (BTW, the code I was running wasn't written by me; it was part of Phorum, a PHP web posting board application. I'll try to patch it to make it use "SELECT id... ORDER BY id DESC LIMIT 1" and see how it goes). Paulo Jan. DDnet.
Hi all, in a psql script for GnuMed (www.gnumed.org) I am using a snippet like the following for setting up predefined test accounts: CREATE USER "test-doc" WITH PASSWORD 'test-doc' IN GROUP "gm-doctors", "gm-public" VALID UNTIL '2003-09-30' ; I would like to constrain their validity to, say, six months. I have tried but not found a way to tell the VALID UNTIL clause something like now() + '6 months'::interval Anyone have a suggestion (short of calculating in the client at runtime and substituting) on how to do this in plain SQL ? Thanks a lot, Karsten Hilbert, MD GnuMed i18n coordinator -- GPG key ID E4071346 @ wwwkeys.pgp.net E167 67FD A291 2BEA 73BD 4537 78B9 A9F9 E407 1346
Karsten Hilbert <Karsten.Hilbert@gmx.net> writes: > I have tried but not found a way to tell the VALID UNTIL clause > something like > now() + '6 months'::interval CREATE USER, like pretty much all utility statements in Postgres, won't do any expression evaluation --- the parameters have to be simple literal constants. The only workaround I can think of is to do the CREATE USER inside a helper plpgsql function that does the time calculation and then uses EXECUTE to run the command with the appropriate values plugged in. It seems unlikely that this is a cleaner solution than doing it in the client code ... regards, tom lane
I think it has to be done like this: "now() + '@ 6 month' as six_months_from_now;" which yields six_months_from_now ------------------------ 2003-12-11 09:38:24-05 (1 row) Jim Apsey -------------------------------------------------------------------------------------- Karsten Hilbert wrote: >Hi all, > >in a psql script for GnuMed (www.gnumed.org) I am using a >snippet like the following for setting up predefined test >accounts: > >CREATE USER "test-doc" > WITH PASSWORD 'test-doc' > IN GROUP "gm-doctors", "gm-public" > VALID UNTIL '2003-09-30' >; > >I would like to constrain their validity to, say, six months. I >have tried but not found a way to tell the VALID UNTIL clause >something like > > now() + '6 months'::interval > >Anyone have a suggestion (short of calculating in the client at >runtime and substituting) on how to do this in plain SQL ? > >Thanks a lot, > >Karsten Hilbert, MD >GnuMed i18n coordinator > >
Wouldn't it make sense to optimize max() and min() for indexed columns? I don't know if I'm barking up the wrong tree, but would it be possible to create an aggregate (o_max, o_min) to make the query planner treat it differently from other aggregates? IMO, (if possible...) this would be a more elegant solution than SQL'ing around the "feature". If it is possible, it might be a nifty contrib module, poised for inclusion in the production code. Any takers/thoughts? :) CG --- Paulo Jan <admin@digital.ddnet.es> wrote: > Jonathan Bartlett wrote: > > Is your index a hash or btree? > > > > Jon > > > > > It's a btree, but anyway, I see that others have already answered my > question. So it's a "feature" and not a bug? Hrmpf. > (BTW, the code I was running wasn't written by me; it was part of > Phorum, a PHP web posting board application. I'll try to patch it to > make it use "SELECT id... ORDER BY id DESC LIMIT 1" and see how it goes). > > > > Paulo Jan. > DDnet. > > > > ---------------------------(end of broadcast)--------------------------- > TIP 6: Have you searched our list archives? > > http://archives.postgresql.org __________________________________ Do you Yahoo!? Yahoo! Calendar - Free online calendar with sync to Outlook(TM). http://calendar.yahoo.com
I think this is a wonderful idea. I often wondered why max(column) makes a full table scan. B.T.W.: If your max column is indexed, use select column from table order by column desc limit 1 which gives you the maximum value in no time at all. Detlef -----Ursprungliche Nachricht----- Von: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org]Im Auftrag von Chris Gamache Gesendet: Mittwoch, 11. Juni 2003 17:44 An: pgsql-general@postgresql.org Betreff: Re: [GENERAL] Index not being used in MAX function (7.2.3) Wouldn't it make sense to optimize max() and min() for indexed columns? I don't know if I'm barking up the wrong tree, but would it be possible to create an aggregate (o_max, o_min) to make the query planner treat it differently from other aggregates? IMO, (if possible...) this would be a more elegant solution than SQL'ing around the "feature". If it is possible, it might be a nifty contrib module, poised for inclusion in the production code. Any takers/thoughts? :) CG --- Paulo Jan <admin@digital.ddnet.es> wrote: > Jonathan Bartlett wrote: > > Is your index a hash or btree? > > > > Jon > > > > > It's a btree, but anyway, I see that others have already answered my > question. So it's a "feature" and not a bug? Hrmpf. > (BTW, the code I was running wasn't written by me; it was part of > Phorum, a PHP web posting board application. I'll try to patch it to > make it use "SELECT id... ORDER BY id DESC LIMIT 1" and see how it goes). > > > > Paulo Jan. > DDnet. > > > > ---------------------------(end of broadcast)--------------------------- > TIP 6: Have you searched our list archives? > > http://archives.postgresql.org __________________________________ Do you Yahoo!? Yahoo! Calendar - Free online calendar with sync to Outlook(TM). http://calendar.yahoo.com ---------------------------(end of broadcast)--------------------------- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
On Wed, Jun 11, 2003 at 02:21:00PM +0200, Paulo Jan wrote: > It's a btree, but anyway, I see that others have already answered my > question. So it's a "feature" and not a bug? Hrmpf. > (BTW, the code I was running wasn't written by me; it was part of > Phorum, a PHP web posting board application. I'll try to patch it to > make it use "SELECT id... ORDER BY id DESC LIMIT 1" and see how it goes). Not to drag this out further, but you might want to hold off on that patch. 7.4 is supposed to use indexes for max/min. -- Jim C. Nasby (aka Decibel!) jim@nasby.net Member: Triangle Fraternity, Sports Car Club of America Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
"Jim C. Nasby" <jim@nasby.net> writes: > Not to drag this out further, but you might want to hold off on that > patch. 7.4 is supposed to use indexes for max/min. Where did you get that idea? There's been no change in the basic problem, which is that no one has proposed a reasonably general method of translating aggregates into index manipulations. Postgres has an extremely general, extensible concept of aggregates, and we're not going to mess it up with some poorly-designed hack. But show me a clean design and implementation, and it'll go in. regards, tom lane
> -----Original Message----- > From: Tom Lane [mailto:tgl@sss.pgh.pa.us] > Sent: Wednesday, June 11, 2003 10:03 AM > To: jim@nasby.net > Cc: pgsql-general@postgresql.org > Subject: Re: [GENERAL] Index not being used in MAX function (7.2.3) > > > "Jim C. Nasby" <jim@nasby.net> writes: > > Not to drag this out further, but you might want to hold > off on that > > patch. 7.4 is supposed to use indexes for max/min. > > Where did you get that idea? > > There's been no change in the basic problem, which is that no > one has proposed a reasonably general method of translating > aggregates into index manipulations. Postgres has an > extremely general, extensible concept of aggregates, and > we're not going to mess it up with some poorly-designed hack. > But show me a clean design and implementation, and it'll go in. Is this a poorly designed hack: Select max(expression) from <join> where <filter> Becomes: If (non_hashed_index_exists_on_expression) then /* Transform expression to: */ select (expression) from <join> where <filter> order by <expression> limit to 1 rows else do_what_you_are_doing_right_now endif ?
> -----Original Message----- > From: Dann Corbit > Sent: Wednesday, June 11, 2003 10:12 AM > To: Tom Lane; jim@nasby.net > Cc: pgsql-general@postgresql.org > Subject: Re: [GENERAL] Index not being used in MAX function (7.2.3) > > > > -----Original Message----- > > From: Tom Lane [mailto:tgl@sss.pgh.pa.us] > > Sent: Wednesday, June 11, 2003 10:03 AM > > To: jim@nasby.net > > Cc: pgsql-general@postgresql.org > > Subject: Re: [GENERAL] Index not being used in MAX function (7.2.3) > > > > > > "Jim C. Nasby" <jim@nasby.net> writes: > > > Not to drag this out further, but you might want to hold > > off on that > > > patch. 7.4 is supposed to use indexes for max/min. > > > > Where did you get that idea? > > > > There's been no change in the basic problem, which is that no > > one has proposed a reasonably general method of translating > > aggregates into index manipulations. Postgres has an > > extremely general, extensible concept of aggregates, and > > we're not going to mess it up with some poorly-designed hack. > > But show me a clean design and implementation, and it'll go in. > > Is this a poorly designed hack: > > Select max(expression) from <join> where <filter> > > Becomes: > > If (non_hashed_index_exists_on_expression) then > /* Transform expression to: */ > select (expression) from <join> where <filter> order > by <expression> limit to 1 rows "Order by <expression> DESC" for max and ASC for min. > else > do_what_you_are_doing_right_now > endif > ? > > ---------------------------(end of > broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > http://www.postgresql.org/docs/faqs/FAQ.html
"Dann Corbit" <DCorbit@connx.com> writes: > Is this a poorly designed hack: > Select max(expression) from <join> where <filter> Well, to start with, any nonempty WHERE probably invalidates the optimization, and I doubt it works if there's a join rather than a single table involved. But you're just handwaving --- the devil is in the details. How will you identify which aggregates are MIN and MAX (no, I don't like the idea of relying on the names; remember we have an extensible set of aggregates)? What will you do when there are multiple aggregates in the command --- or more generally, how complex a context for the aggregate call are you going to be able to support? Where exactly is this transformation going to occur in the planning pipeline, and how many cycles will we waste checking for the possible transform in cases where it doesn't apply? Questions like these are where we've gotten bogged down in the past. You might care to consult the pgsql-hackers archives for past discussions. regards, tom lane
I wonder if a macro system might be warranted - then have max be a macro instead of an aggregate. However, I don't know exactly how that would work since it involves the whole statement. Anyway, just an idea to hopefully spur someone else's thinking cap :) Jon On Wed, 11 Jun 2003, Tom Lane wrote: > "Dann Corbit" <DCorbit@connx.com> writes: > > Is this a poorly designed hack: > > Select max(expression) from <join> where <filter> > > Well, to start with, any nonempty WHERE probably invalidates the > optimization, and I doubt it works if there's a join rather than a > single table involved. But you're just handwaving --- the devil is in > the details. How will you identify which aggregates are MIN and MAX > (no, I don't like the idea of relying on the names; remember we have > an extensible set of aggregates)? What will you do when there are > multiple aggregates in the command --- or more generally, how complex a > context for the aggregate call are you going to be able to support? > Where exactly is this transformation going to occur in the planning > pipeline, and how many cycles will we waste checking for the possible > transform in cases where it doesn't apply? Questions like these are > where we've gotten bogged down in the past. You might care to consult > the pgsql-hackers archives for past discussions. > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faqs/FAQ.html >
On Wed, Jun 11, 2003 at 10:44:22 -0700, Jonathan Bartlett <johnnyb@eskimo.com> wrote: > I wonder if a macro system might be warranted - then have max be a macro > instead of an aggregate. However, I don't know exactly how that would > work since it involves the whole statement. Anyway, just an idea to > hopefully spur someone else's thinking cap :) I don't think that would work. There are going to be some cases where the aggregate is better than the subselect (and not just when there isn't an appropiate index). And in some cases distinct on order by may be the best way to get what you want.
> -----Original Message----- > From: Bruno Wolff III [mailto:bruno@wolff.to] > Sent: Wednesday, June 11, 2003 11:30 AM > To: Jonathan Bartlett > Cc: Tom Lane; Dann Corbit; jim@nasby.net; pgsql-general@postgresql.org > Subject: Re: [GENERAL] Index not being used in MAX function (7.2.3) > > > On Wed, Jun 11, 2003 at 10:44:22 -0700, > Jonathan Bartlett <johnnyb@eskimo.com> wrote: > > I wonder if a macro system might be warranted - then have max be a > > macro instead of an aggregate. However, I don't know > exactly how that > > would work since it involves the whole statement. Anyway, just an > > idea to hopefully spur someone else's thinking cap :) > > I don't think that would work. There are going to be some > cases where the aggregate is better than the subselect (and > not just when there isn't an appropiate index). And in some > cases distinct on order by may be the best way to get what you want. Isn't that the optimizer's job to figure out? The whole idea of SQL is to abstract the queries and allow the optimizer to make all the smart choices about plans and stuff. I do realize that it is very "non-trivial" to implement, but min() and max() are used so often it seems it might be useful. Here are some "free to use" templates for statistical functions: ftp://cap.connx.com/tournament_software/Kahan.Hpp ftp://cap.connx.com/tournament_software/STATS.HPP The Kahan template is an extremely accurate adder (does not lose precision like direct summation). The Stats template (which uses the Kahan adder) does all sorts of things like skew, kurtosis, min, max, stddev, average, count, sum etc. all simultaneously. Our product uses a similar template to produce all kinds of useful statistical information. See: http://www.connx.com/products/connx/Connx%208.8%20UserGuide/connxcdd32.h tm And look at the statistical functions book. No, we don't do the optimization I have suggested for min/max, but I hope to poke it into our tool set some day. However, we do have a function called "sortfirst()" and a function called "sortlast() " both of which do perform the suggested optimizations [when possible]. Perhaps PostgreSQL could do something similar.
On Wed, Jun 11, 2003 at 11:41:26 -0700, Dann Corbit <DCorbit@connx.com> wrote: > > Isn't that the optimizer's job to figure out? The whole idea of SQL is > to abstract the queries and allow the optimizer to make all the smart > choices about plans and stuff. In theory yes. My comment was specifically addressing the idea of using a macro. I don't think this would work because subselects aren't always the best way to do this. There are other areas where how you write queries makes a big difference in performance even though the queries are logically equivalent.
On Wed, Jun 11, 2003 at 11:59:36 -0700, Dennis Gearon <gearond@cvc.net> wrote: > I guess the question is, are other big iron data bases using indexes on > MAX/MIN functions, and how are they doing it? It is easier for them to do it because they don't have to worry about a function named max not really being a maximum.
Tom Lane wrote: > There's been no change in the basic problem, which is that no one has > proposed a reasonably general method of translating aggregates into > index manipulations. Postgres has an extremely general, extensible > concept of aggregates, and we're not going to mess it up with some > poorly-designed hack. But show me a clean design and implementation, > and it'll go in. Just a quick idea, in CREATE AGGREGATE, add optional parameters of: 1. ORDER BY ASC|DESC|USING operator 2. LIMIT {count} And modify INITCOND param to: INITCOND = COUNT | initial_condition where INITCOND = COUNT forces pgsql to get all row counts into INITCOND first before calling sfunc or ffunc. Still ugly, but things might be able to be generallized to: -- returnme is a function that returns its parameter. Or else make SFUNC optional and would by default return its param. CREATE AGGREGATE max (BASETYPE=int, SFUNC=returnme, STYPE=int, INITCOND=NULL, ORDER BY DESC LIMIT 1); CREATE AGGREGATE min (BASETYPE=int, SFUNC=returnme, STYPE=int, INITCOND=NULL, ORDER BY ASC LIMIT 1); CREATE AGGREGATE count (BASETYPE=int, SFUNC=returnme, STYPE=int, INITCOND=COUNT, LIMIT 0); The implementation would be: SELECT min(foo) FROM bar translates to: SELECT (SELECT sfunc FROM bar ORDER BY foo ASC LIMIT 1) as min; (or similar, if you get the idea) SELECT baz, min(foo) FROM bar GROUP BY baz ORDER BY baz; translates to: SELECT baz, (SELECT sfunc FROM bar WHERE baz = highlevel_bar.baz ORDER BY foo ASC LIMIT 1) as min FROM bar ORDER BY baz; -- Hoping that the subselect would automagically use an index if it -- exists, like normal queries. SELECT baz, count(*) FROM bar GROUP BY baz ORDER BY baz; translates to: SELECT baz, (SELECT __COUNT__ FROM bar WHERE baz = highlevel_bar.baz) as count FROM bar ORDER BY baz; Note how the GROUP BY gets pushed into the subselect as a WHERE condition, possibly allowing generic optimization of SELECT count(*). Lots of hand waving in parts, but I hope I got the idea across. Can't tell how much work is it to do without in depth knowledge of pgsql internals, though. :( -- Linux homer 2.4.18-14 #1 Wed Sep 4 13:35:50 EDT 2002 i686 i686 i386 GNU/Linux 11:00am up 168 days, 1:56, 9 users, load average: 5.08, 5.05, 5.04
Attachment
On Wed, Jun 11, 2003 at 01:02:57PM -0400, Tom Lane wrote: > "Jim C. Nasby" <jim@nasby.net> writes: > > Not to drag this out further, but you might want to hold off on that > > patch. 7.4 is supposed to use indexes for max/min. > > Where did you get that idea? I remember seeing something about someone committing the min/max to ORDER BY/LIMIT transform, but obviously I must have been smoking something. My bad. -- Jim C. Nasby (aka Decibel!) jim@nasby.net Member: Triangle Fraternity, Sports Car Club of America Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
Am I just being naive, or couldn't this be solved by adding min/max boolean flags to pg_aggregates and the appropriate syntax to CREATE AGGREGATE? That would just leave the simple matter of the index scanning code </sarcasm>. BTW, I recently tried to do something like this... SELECT key, blah, foo, bar, scoring_function(blah) AS score INTO TEMP t1 FROM blah; SELECT key, blah, foo, bar INTO TEMP info_for_max_scoring_entry_for_each_key FROM t1 WHERE t1.score = (SELECT score FROM t1 AS inner_t1 WHERE inner_t1.key = t1.key ORDER BY score DESC LIMIT 1) ; The performance was horrid. I ended up building a middle table using SELECT key, max(score) INTO TEMP t2 FROM t1 GROUP BY key; and joining with that to build the final table I wanted. So it seems the ORDER/LIMIT hack doesn't work well at all except in limited situations. On Wed, Jun 11, 2003 at 03:11:26PM -0500, Bruno Wolff III wrote: > On Wed, Jun 11, 2003 at 11:59:36 -0700, > Dennis Gearon <gearond@cvc.net> wrote: > > I guess the question is, are other big iron data bases using indexes on > > MAX/MIN functions, and how are they doing it? > > It is easier for them to do it because they don't have to worry about > a function named max not really being a maximum. > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Don't 'kill -9' the postmaster > -- Jim C. Nasby (aka Decibel!) jim@nasby.net Member: Triangle Fraternity, Sports Car Club of America Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
On Thu, Jun 12, 2003 at 12:08:15PM +0800, Ang Chin Han wrote: > Just a quick idea, in CREATE AGGREGATE, add optional parameters of: > > 1. ORDER BY ASC|DESC|USING operator > 2. LIMIT {count} IMHO, I don't think it's right to focus on the ORDER BY/LIMIT hack. The real issue here is that the best way to find a min is to grab the first tuple in the index (granted, a bit tricker in pgsql due to MVCC), and the best way to find a max is to grab the last tuple in the index. And this extends beyond the simplest of min/max examples. For instance, this technique should be used to solve grouped aggregates (assuming a suitable index, of course), the only difference is that you don't use the first/last tuple, you use the first/last one that matches your other criteria. I haven't read the source, but it seems to me what's lacking is the ability to scan indexes in order to do this. This becomes really important whenever pgsql gains the ability to use multiple indexes per table (someone smack me if it can do this now and I don't realize it), because then you could do something like SELECT min(a), max(b), min(c), min(d) and the query would be blazing fast if you had the right indexes on all 4 fields. -- Jim C. Nasby (aka Decibel!) jim@nasby.net Member: Triangle Fraternity, Sports Car Club of America Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
"Jim C. Nasby" <jim@nasby.net> writes: > Am I just being naive, or couldn't this be solved by adding min/max > boolean flags to pg_aggregates and the appropriate syntax to CREATE > AGGREGATE? I'd prefer to see a direct link to the associated sort operator ('<' for MIN or '>' for MAX). But yeah, some addition to the system catalogs seems essential if you don't want the code to be full of unsupportable assumptions about aggregate and operator names. > So it seems the ORDER/LIMIT hack doesn't work well at all except in > limited situations. No kidding. This is one reason there hasn't been a huge push to get it implemented: the actual usefulness of the hack is quite limited. In your example, I suspect the presence of the unrelated WHERE clause is what makes it unhelpful. regards, tom lane
On Thu, Jun 12, 2003 at 17:17:19 -0500, "Jim C. Nasby" <jim@nasby.net> wrote: > Am I just being naive, or couldn't this be solved by adding min/max > boolean flags to pg_aggregates and the appropriate syntax to CREATE > AGGREGATE? That would just leave the simple matter of the index scanning > code </sarcasm>. There are other potential aggregates of this type. You would be better off adding an operator (class?) than a flag. > > BTW, I recently tried to do something like this... > > SELECT key, blah, foo, bar, scoring_function(blah) AS score INTO TEMP t1 FROM blah; > SELECT key, blah, foo, bar > INTO TEMP info_for_max_scoring_entry_for_each_key > FROM t1 > WHERE t1.score = (SELECT score FROM t1 AS inner_t1 WHERE > inner_t1.key = t1.key ORDER BY score DESC LIMIT 1) > ; > > The performance was horrid. I ended up building a middle table using > SELECT key, max(score) INTO TEMP t2 FROM t1 GROUP BY key; > > and joining with that to build the final table I wanted. So it seems the > ORDER/LIMIT hack doesn't work well at all except in limited situations. Unless there was a combined index on key and score I would expect the form you ended up using to be the fastest way to do it. With a combined index, distinct on would probably be a bit faster (epsecially if there were lots of values with the same key).
On Thu, Jun 12, 2003 at 17:26:11 -0500, "Jim C. Nasby" <jim@nasby.net> wrote: > On Thu, Jun 12, 2003 at 12:08:15PM +0800, Ang Chin Han wrote: > > This becomes really important whenever pgsql gains the ability to use > multiple indexes per table (someone smack me if it can do this now and I SMACK! > don't realize it), because then you could do something like > > SELECT min(a), max(b), min(c), min(d) > > and the query would be blazing fast if you had the right indexes on all > 4 fields. And you can speed this up now using subselects.
On Fri, Jun 13, 2003 at 10:37:46AM -0500, Bruno Wolff III wrote: > On Thu, Jun 12, 2003 at 17:26:11 -0500, > "Jim C. Nasby" <jim@nasby.net> wrote: > > On Thu, Jun 12, 2003 at 12:08:15PM +0800, Ang Chin Han wrote: > > > > This becomes really important whenever pgsql gains the ability to use > > multiple indexes per table (someone smack me if it can do this now and I > > SMACK! You can use multiple indexes per query plan node? IE: if you have indexes on a and b, you can use both indexes when you do WHERE a=blah and b=foo? That's what I was reffering to... -- Jim C. Nasby (aka Decibel!) jim@nasby.net Member: Triangle Fraternity, Sports Car Club of America Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
On Fri, 13 Jun 2003, Jim C. Nasby wrote: > On Fri, Jun 13, 2003 at 10:37:46AM -0500, Bruno Wolff III wrote: > > On Thu, Jun 12, 2003 at 17:26:11 -0500, > > "Jim C. Nasby" <jim@nasby.net> wrote: > > > On Thu, Jun 12, 2003 at 12:08:15PM +0800, Ang Chin Han wrote: > > > > > > This becomes really important whenever pgsql gains the ability to use > > > multiple indexes per table (someone smack me if it can do this now and I > > > > SMACK! > > You can use multiple indexes per query plan node? IE: if you have > indexes on a and b, you can use both indexes when you do WHERE a=blah > and b=foo? That's what I was reffering to... In that instance you'd be better off with a multiple key index.
On Fri, Jun 13, 2003 at 15:12:39 -0500, "Jim C. Nasby" <jim@nasby.net> wrote: > On Fri, Jun 13, 2003 at 10:37:46AM -0500, Bruno Wolff III wrote: > > On Thu, Jun 12, 2003 at 17:26:11 -0500, > > "Jim C. Nasby" <jim@nasby.net> wrote: > > > On Thu, Jun 12, 2003 at 12:08:15PM +0800, Ang Chin Han wrote: > > > > > > This becomes really important whenever pgsql gains the ability to use > > > multiple indexes per table (someone smack me if it can do this now and I > > > > SMACK! > > You can use multiple indexes per query plan node? IE: if you have > indexes on a and b, you can use both indexes when you do WHERE a=blah > and b=foo? That's what I was reffering to... I think what you are describing doesn't make much sense. On any scan you are only going to using one index (possibly over several columns). But while processing a query you might be making use of several indexes on a table. For example the following query should would fast if appropiate indexes exist: select (select col1 from tab order by col1), (select col2 from tab order by col2)
On Fri, Jun 13, 2003 at 03:58:13PM -0500, Bruno Wolff III wrote: > > You can use multiple indexes per query plan node? IE: if you have > > indexes on a and b, you can use both indexes when you do WHERE a=blah > > and b=foo? That's what I was reffering to... > > I think what you are describing doesn't make much sense. On any scan > you are only going to using one index (possibly over several columns). I believe MSSQL and Oracle support it; they scan the indexes then grab the appropriate set of matching tuple addresses. The advantage you get from this is you can get close to multi-key index performance without using multi-key indexes. This is useful when you need to do lookups on two different fields in a table, and also need to do lookups on both fields. IE: select ... where a=foo; select ... where b=bar; select ... where a=blah and b=baz; I was strictly looking to the future when pgsql migth eventually support this. :) -- Jim C. Nasby (aka Decibel!) jim@nasby.net Member: Triangle Fraternity, Sports Car Club of America Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
On Fri, Jun 13, 2003 at 17:36:49 -0500, "Jim C. Nasby" <jim@nasby.net> wrote: > > I believe MSSQL and Oracle support it; they scan the indexes then grab > the appropriate set of matching tuple addresses. The advantage you get > from this is you can get close to multi-key index performance without > using multi-key indexes. This is useful when you need to do lookups on > two different fields in a table, and also need to do lookups on both > fields. IE: > > select ... where a=foo; > select ... where b=bar; > select ... where a=blah and b=baz; > > I was strictly looking to the future when pgsql migth eventually support > this. :) I don't see how this would be useful in any of these examples. The first two seem to be ones where one index scan would work. The third would be handled by postgres using an index scan and a filter (assuming no multikey index was available). I can't think of a circumtances where doing two index scans and then joining the rows obtained from each scan would be faster than the way postgres does it. If there was an 'or' instead of an 'and' then unioning the two sets of results from index scans would make sense. I tried a quick test of this and saw postgres using a seq scan with a filter, but I might not have data with the right set of properties to make this work. In theory you could do a plain index scan for one half of the or and an index scan with a filter (to remove what would be duplicates) on the other half of the or clause. This would be a win in many common cases. I tried some stuff to see what postgres does and it looks like it has a way to search two indexes at once. The following commands: select version(); explain analyze select gameid from crate where areaid = 'JL005' or rate = 5034; explain analyze (select gameid from crate where areaid = 'JL005') union all (select gameid from crate where rate = 5034 and areaid <> 'JL005'); generated the following output: version ------------------------------------------------------------------------ PostgreSQL 7.4devel on i686-pc-linux-gnu, compiled by GCC egcs-2.91.66 (1 row) QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------- Index Scan using crate_pkey, "temp" on crate (cost=0.00..151.34 rows=38 width=7) (actual time=0.09..0.72 rows=72 loops=1) Index Cond: ((areaid = 'JL005'::text) OR (rate = 5034)) Total runtime: 0.85 msec (3 rows) QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------ Append (cost=0.00..151.32 rows=39 width=7) (actual time=0.08..0.73 rows=72 loops=1) -> Subquery Scan "*SELECT* 1" (cost=0.00..39.88 rows=10 width=7) (actual time=0.07..0.10 rows=2 loops=1) -> Index Scan using crate_pkey on crate (cost=0.00..39.88 rows=10 width=7) (actual time=0.07..0.09 rows=2 loops=1) Index Cond: (areaid = 'JL005'::text) -> Subquery Scan "*SELECT* 2" (cost=0.00..111.43 rows=29 width=7) (actual time=0.03..0.57 rows=70 loops=1) -> Index Scan using "temp" on crate (cost=0.00..111.43 rows=29 width=7) (actual time=0.03..0.44 rows=70 loops=1) Index Cond: (rate = 5034) Filter: (areaid <> 'JL005'::text) Total runtime: 0.96 msec (9 rows) The select with the union was an attempt to force the results of two index scans to be combined. But if you look at the results of teh plan for the simpler query you will see that postgres is doing an index scan with an 'or' condition which suggests that it is doing pretty much the same thing as the more complicated query, but more efficiently. P.S. In my example yesterday there were suppposed to be limit 1 clauses in both subselects as the were supposed to be the equivalent of min functions.
On Sat, Jun 14, 2003 at 08:27:20AM -0500, Bruno Wolff III wrote: > I don't see how this would be useful in any of these examples. The first > two seem to be ones where one index scan would work. The third would > be handled by postgres using an index scan and a filter (assuming no > multikey index was available). I can't think of a circumtances where > doing two index scans and then joining the rows obtained from each > scan would be faster than the way postgres does it. If there was an 'or' It would be useful if we didn't have to immediately consider MVCC info, which requires hitting the tupples. Indexes are usually much narrower than tables, even with metadata included, so you can scan two indexes and and/or the results faster than scanning one index, then hitting all the tuples. I believe this could be done right now in pgsql, though I'm not sure if it would be as useful since you'll still have to read all the tupples to obtain the MVCC info (this is why I hope the MVCC info can eventually be moved out of mainline storage). > instead of an 'and' then unioning the two sets of results from index > scans would make sense. I tried a quick test of this and saw postgres > using a seq scan with a filter, but I might not have data with the > right set of properties to make this work. In theory you could do > a plain index scan for one half of the or and an index scan with a filter > (to remove what would be duplicates) on the other half of the or clause. > This would be a win in many common cases. I tried some stuff to see > what postgres does and it looks like it has a way to search two indexes > at once. > > The following commands: > > select version(); > > explain analyze > select gameid from crate where areaid = 'JL005' or rate = 5034; > > explain analyze > (select gameid from crate where areaid = 'JL005') union all > (select gameid from crate where rate = 5034 and areaid <> 'JL005'); > > generated the following output: > > version > ------------------------------------------------------------------------ > PostgreSQL 7.4devel on i686-pc-linux-gnu, compiled by GCC egcs-2.91.66 > (1 row) > > QUERY PLAN > ---------------------------------------------------------------------------------------------------------------------------- > Index Scan using crate_pkey, "temp" on crate (cost=0.00..151.34 rows=38 width=7) (actual time=0.09..0.72 rows=72 loops=1) > Index Cond: ((areaid = 'JL005'::text) OR (rate = 5034)) > Total runtime: 0.85 msec > (3 rows) > > QUERY PLAN > ------------------------------------------------------------------------------------------------------------------------------ > Append (cost=0.00..151.32 rows=39 width=7) (actual time=0.08..0.73 rows=72 loops=1) > -> Subquery Scan "*SELECT* 1" (cost=0.00..39.88 rows=10 width=7) (actual time=0.07..0.10 rows=2 loops=1) > -> Index Scan using crate_pkey on crate (cost=0.00..39.88 rows=10 width=7) (actual time=0.07..0.09 rows=2 loops=1) > Index Cond: (areaid = 'JL005'::text) > -> Subquery Scan "*SELECT* 2" (cost=0.00..111.43 rows=29 width=7) (actual time=0.03..0.57 rows=70 loops=1) > -> Index Scan using "temp" on crate (cost=0.00..111.43 rows=29 width=7) (actual time=0.03..0.44 rows=70 loops=1) > Index Cond: (rate = 5034) > Filter: (areaid <> 'JL005'::text) > Total runtime: 0.96 msec > (9 rows) > > The select with the union was an attempt to force the results of two index > scans to be combined. But if you look at the results of teh plan for the > simpler query you will see that postgres is doing an index scan with > an 'or' condition which suggests that it is doing pretty much the same thing > as the more complicated query, but more efficiently. What indexes are on crate? -- Jim C. Nasby (aka Decibel!) jim@nasby.net Member: Triangle Fraternity, Sports Car Club of America Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
On Sat, Jun 14, 2003 at 10:23:20 -0500, "Jim C. Nasby" <jim@nasby.net> wrote: > > It would be useful if we didn't have to immediately consider MVCC info, > which requires hitting the tupples. Indexes are usually much narrower > than tables, even with metadata included, so you can scan two indexes > and and/or the results faster than scanning one index, then hitting all > the tuples. I don't think you are likely to see much gain from this as scanning two indexes instead of one is likely to cost about as much as scanning an index and looking at the tupples to see if they match the other condition. Also if you look at both indexes you have to do a join to connect the tuples back together and there is going to be some cost to that. > > What indexes are on crate? I had indexes on areaid, game and game, areaid. I created one of crate for the test. I don't normally search on crate. I do do orders on it, but generally after joins so an index isn't a great help. The table only has about 16000 rows in it, so it is a pretty small sample. I would actually be interested in hearing a comment from someone who knows more just how a index condition with an OR is handled when the two indexes aren't part of the same multicolumn index.
Bruno Wolff III <bruno@wolff.to> writes: > I don't think you are likely to see much gain from this as scanning > two indexes instead of one is likely to cost about as much as scanning > an index and looking at the tupples to see if they match the other > condition. We have speculated about this in the past. Several of the big commercial DBs can do it, so it seems that at least some people find it valuable. I'd be inclined to look at it in combination with decoupling heap and index scan order: that is, you traverse the index and make a list of heap tuple TIDs that the index says to visit, then you visit those tuples in heap storage order. This gets rid of a lot of the random-access overhead of the present indexscanning scheme, at the cost of not producing sorted-by-the-indexkey output (but presumably the planner could choose whether to do it this way or the old way). The way this fits with multiple indexes is that you could gather tuple TIDs from several indexes and intersect or union the lists before you visit the heap. I believe the common way to do this is to represent the TID lists as sparse bitmaps and AND or OR the bitmaps. > I would actually be interested in hearing a comment from someone who > knows more just how a index condition with an OR is handled when the > two indexes aren't part of the same multicolumn index. Right now, the way we handle it is to test each tuple retrieved by an indexscan against the conditions associated with the previous indexscans (ie, the earlier OR terms), so that we can detect whether we already returned the tuple in a previous scan. This works but it's not especially efficient, because if the OR terms overlap then each tuple in the overlap is fetched multiple times. Plus you have all that computation done to recheck the conditions. regards, tom lane
On Sat, Jun 14, 2003 at 01:18:15PM -0400, Tom Lane wrote: > Bruno Wolff III <bruno@wolff.to> writes: > > I don't think you are likely to see much gain from this as scanning > > two indexes instead of one is likely to cost about as much as scanning > > an index and looking at the tupples to see if they match the other > > condition. > > We have speculated about this in the past. Several of the big > commercial DBs can do it, so it seems that at least some people find > it valuable. > > I'd be inclined to look at it in combination with decoupling heap and > index scan order: that is, you traverse the index and make a list of > heap tuple TIDs that the index says to visit, then you visit those > tuples in heap storage order. This gets rid of a lot of the > random-access overhead of the present indexscanning scheme, at the cost > of not producing sorted-by-the-indexkey output (but presumably the > planner could choose whether to do it this way or the old way). > > The way this fits with multiple indexes is that you could gather tuple > TIDs from several indexes and intersect or union the lists before you > visit the heap. I believe the common way to do this is to represent > the TID lists as sparse bitmaps and AND or OR the bitmaps. That's my understanding as well. As to how useful it is, you just have to consider the case of a very wide table and being able to narrow things down as much as possible before hitting the table itself. If TID is 6 bytes and you have 2 indexes that are each on int fields, that's 10 bytes per row, per index. Compare that to a table that's 100 bytes wide or more and it's very easy to come out way ahead on the index side. -- Jim C. Nasby (aka Decibel!) jim@nasby.net Member: Triangle Fraternity, Sports Car Club of America Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"