Thread: Speeding up Aggregates
Hi, I have a query that ran quite well initially, but slowed down quite a bit once I introduced an aggregate into the equation. The average execution time went up from around 15 msec to around 300 msec. The original query fetches a bunch of articles: select articlenumber, channel, description, title, link, dtstamp from items, my_channels where items.channel = '22222' and my_channels.id = '22222' and owner = 'drormata' and dtstamp > last_viewed and articlenumber not in (select item from viewed_items where channel ='22222' and owner = 'drormata'); I then added a call to a function: and (dtstamp = item_max_date(22222, link)) item_max_date() looks like this: select max(dtstamp) from items where channel = $1 and link = $2; This should eliminate duplicate articles and only show the most recent one. resulting in the following query select articlenumber, channel, description, title, link, dtstamp from items, my_channels where items.channel = '22222' and my_channels.id = '22222' and owner = 'drormata' and dtstamp > last_viewed and articlenumber not in (select item from viewed_items where channel ='22222' and owner = 'drormata') and (dtstamp = item_max_date(22222, link)); Any suggestions on optimizing the query/function? It makes sense that it slowed down, but I wonder if I can do better. I'm including index list as well as "explain analyze" of both versions. Indexes: "item_channel_link" btree (channel, link) "item_created" btree (dtstamp) "item_signature" btree (signature) "items_channel_article" btree (channel, articlenumber) explain analyze select articlenumber, channel, description, title, link, dtstamp from items, my_channels where items.channel= '22222' and my_channels.id = '22222' and owner = 'drormata' and dtstamp > last_viewed and articlenumber notin (select item from viewed_items where channel ='22222' and owner = 'drormata'); QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------- Nested Loop (cost=8.19..6982.58 rows=302 width=259) (actual time=16.95..17.16 rows=8 loops=1) Join Filter: ("inner".dtstamp > "outer".last_viewed) -> Seq Scan on my_channels (cost=0.00..3.23 rows=1 width=8) (actual time=0.36..0.38 rows=1 loops=1) Filter: ((id = 22222) AND (("owner")::text = 'drormata'::text)) -> Index Scan using items_channel_article on items (cost=8.19..6968.05 rows=904 width=259) (actual time=0.68..13.94rows=899 loops=1) Index Cond: (channel = 22222) Filter: (NOT (hashed subplan)) SubPlan -> Seq Scan on viewed_items (cost=0.00..8.19 rows=2 width=4) (actual time=0.48..0.48 rows=0 loops=1) Filter: ((channel = 22222) AND (("owner")::text = 'drormata'::text)) Total runtime: 17.42 msec (11 rows) explain analyze select articlenumber, channel, description, title, link, dtstamp from items, my_channels where items.channel= '22222' and my_channels.id = '22222' and owner = 'drormata' and dtstamp > last_viewed and articlenumber notin (select item from viewed_items where channel ='22222' and owner = 'drormata') and (dtstamp = item_max_date(22222, link)); QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------ Nested Loop (cost=8.19..6980.33 rows=1 width=259) (actual time=262.94..265.14 rows=7 loops=1) Join Filter: ("outer".dtstamp > "inner".last_viewed) -> Index Scan using items_channel_article on items (cost=8.19..6977.08 rows=1 width=259) (actual time=1.94..150.55 rows=683loops=1) Index Cond: (channel = 22222) Filter: ((dtstamp = item_max_date(22222, link)) AND (NOT (hashed subplan))) SubPlan -> Seq Scan on viewed_items (cost=0.00..8.19 rows=2 width=4) (actual time=0.43..0.43 rows=0 loops=1) Filter: ((channel = 22222) AND (("owner")::text = 'drormata'::text)) -> Seq Scan on my_channels (cost=0.00..3.23 rows=1 width=8) (actual time=0.14..0.15 rows=1 loops=683) Filter: ((id = 22222) AND (("owner")::text = 'drormata'::text)) Total runtime: 265.39 msec -- Dror Matalon Zapatec Inc 1700 MLK Way Berkeley, CA 94709 http://www.zapatec.com
Dror, > select articlenumber, channel, description, title, link, dtstamp from > items, my_channels where items.channel = '22222' and my_channels.id = > '22222' and owner = 'drormata' and dtstamp > last_viewed and > articlenumber not in (select item from viewed_items where channel > ='22222' and owner = 'drormata'); the NOT IN is a bad idea unless the subselect never returns more than a handful of rows. If viewed_items can grow to dozens of rows, wyou should use WHERE NOT EXISTS instead. Unless you're using 7.4. > item_max_date() looks like this: > select max(dtstamp) from items where channel = $1 and link = $2; Change it to SELECT dtstamp from iterm where channel = $1 and link = $2 ORDER BY dtstamp DESC LIMIT 1 and possibly build an index on channel, link, dtstamp -- -Josh Berkus Aglio Database Solutions San Francisco
Hi Josh, On Fri, Oct 03, 2003 at 02:07:10PM -0700, Josh Berkus wrote: > Dror, > > > select articlenumber, channel, description, title, link, dtstamp from > > items, my_channels where items.channel = '22222' and my_channels.id = > > '22222' and owner = 'drormata' and dtstamp > last_viewed and > > articlenumber not in (select item from viewed_items where channel > > ='22222' and owner = 'drormata'); > > the NOT IN is a bad idea unless the subselect never returns more than a > handful of rows. If viewed_items can grow to dozens of rows, wyou should > use WHERE NOT EXISTS instead. Unless you're using 7.4. > I am using 7.4, and had tried NOT EXISTS and didn't see any improvements. > > item_max_date() looks like this: > > select max(dtstamp) from items where channel = $1 and link = $2; > > Change it to > > SELECT dtstamp from iterm where channel = $1 and link = $2 > ORDER BY dtstamp DESC LIMIT 1 > Didn't make a difference. And plugging real values into this query as well as into the original select max(dtstamp) from items where channel = $1 and link = $2; and doing an explain analyze shows that the cost is the same. The strange things is that when I run the above queries by hand they take about .5 msec. Yet on a resultset that fetches 5 rows, I go up from 15 msec to 300 msec. It would seem like it should be something like 15 + (0.5 * 5) + small overhead, = 30 msec or so rather than the 300 I'm seeing. > and possibly build an index on channel, link, dtstamp Didn't make a difference either. Explain analyze shows that it didn't use it. > > -- > -Josh Berkus > Aglio Database Solutions > San Francisco > -- Dror Matalon Zapatec Inc 1700 MLK Way Berkeley, CA 94709 http://www.zapatec.com
Dror, > I am using 7.4, and had tried NOT EXISTS and didn't see any > improvements. It wouldn't if you're using 7.4, which has improved IN performance immensely. What happens if you stop using a function and instead use a subselect? -- -Josh Berkus Aglio Database Solutions San Francisco
> item_max_date() looks like this: > select max(dtstamp) from items where channel = $1 and link = $2; It is too bad the (channel, link) index doesn't have dtstamp at the end of it, otherwise the below query would be a gain (might be a small one anyway). select dtstamp from items where channel = $1 and link = $2 ORDER BY dtstamp DESC LIMIT 1; Could you show us the exact specification of the function? In particular, did you mark it VOLATILE, IMMUTABLE, or STABLE? I hope it isn't the first or second one ;)
Attachment
On Fri, Oct 03, 2003 at 05:44:49PM -0400, Rod Taylor wrote: > > item_max_date() looks like this: > > select max(dtstamp) from items where channel = $1 and link = $2; > > It is too bad the (channel, link) index doesn't have dtstamp at the end > of it, otherwise the below query would be a gain (might be a small one > anyway). > > select dtstamp > from items > where channel = $1 > and link = $2 > ORDER BY dtstamp DESC > LIMIT 1; Similar idea to what Josh suggested. I did create an additional index with dtstamp at the end and it doesn't look like the planner used it. Using the above query instead of max() didn't improve things either. > > > Could you show us the exact specification of the function? In > particular, did you mark it VOLATILE, IMMUTABLE, or STABLE? > > I hope it isn't the first or second one ;) CREATE or REPLACE FUNCTION item_max_date (int4, varchar) RETURNS timestamptz AS ' select max(dtstamp) from items where channel = $1 and link = $2; ' LANGUAGE 'sql'; -- Dror Matalon Zapatec Inc 1700 MLK Way Berkeley, CA 94709 http://www.zapatec.com
On Fri, Oct 03, 2003 at 02:35:46PM -0700, Josh Berkus wrote: > Dror, > > > I am using 7.4, and had tried NOT EXISTS and didn't see any > > improvements. > > It wouldn't if you're using 7.4, which has improved IN performance immensely. > > What happens if you stop using a function and instead use a subselect? An improvement. Now I'm getting in the 200 msec response time. And by the way, I tried "not exists" again and it actually runs slower than "not in." > > -- > -Josh Berkus > Aglio Database Solutions > San Francisco > > > ---------------------------(end of broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly -- Dror Matalon Zapatec Inc 1700 MLK Way Berkeley, CA 94709 http://www.zapatec.com
On Fri, 2003-10-03 at 17:53, Dror Matalon wrote: > On Fri, Oct 03, 2003 at 05:44:49PM -0400, Rod Taylor wrote: > > > item_max_date() looks like this: > > > select max(dtstamp) from items where channel = $1 and link = $2; > > > > It is too bad the (channel, link) index doesn't have dtstamp at the end > > of it, otherwise the below query would be a gain (might be a small one > > anyway). > > > > select dtstamp > > from items > > where channel = $1 > > and link = $2 > > ORDER BY dtstamp DESC > > LIMIT 1; It didn't make a difference even with the 3 term index? I guess you don't have very many common values for channel / link combination. How about the below? Note the word STABLE on the end. CREATE or REPLACE FUNCTION item_max_date (int4, varchar) RETURNS timestamptz AS ' select max(dtstamp) from items where channel = $1 and link = $2; ' LANGUAGE 'sql' STABLE;
Attachment
On Fri, Oct 03, 2003 at 06:10:29PM -0400, Rod Taylor wrote: > On Fri, 2003-10-03 at 17:53, Dror Matalon wrote: > > On Fri, Oct 03, 2003 at 05:44:49PM -0400, Rod Taylor wrote: > > > > item_max_date() looks like this: > > > > select max(dtstamp) from items where channel = $1 and link = $2; > > > > > > It is too bad the (channel, link) index doesn't have dtstamp at the end > > > of it, otherwise the below query would be a gain (might be a small one > > > anyway). > > > > > > select dtstamp > > > from items > > > where channel = $1 > > > and link = $2 > > > ORDER BY dtstamp DESC > > > LIMIT 1; > > It didn't make a difference even with the 3 term index? I guess you > don't have very many common values for channel / link combination. There's no noticeable difference between two term and three term indexes. > > > > How about the below? Note the word STABLE on the end. > > CREATE or REPLACE FUNCTION item_max_date (int4, varchar) RETURNS > timestamptz AS ' > select max(dtstamp) from items where channel = $1 and link = $2; > ' LANGUAGE 'sql' STABLE; Made no difference. -- Dror Matalon Zapatec Inc 1700 MLK Way Berkeley, CA 94709 http://www.zapatec.com
> > I hope it isn't the first or second one ;) > > CREATE or REPLACE FUNCTION item_max_date (int4, varchar) RETURNS > timestamptz AS ' > select max(dtstamp) from items where channel = $1 and link = $2; > ' LANGUAGE 'sql'; How about the below? CREATE or REPLACE FUNCTION item_max_date (int4, varchar) RETURNS timestamptz AS ' select max(dtstamp) from items where channel = $1 and link = $2; ' LANGUAGE 'sql' STABLE;
Attachment
Rod Taylor <rbt@rbt.ca> writes: > On Fri, 2003-10-03 at 17:53, Dror Matalon wrote: > > On Fri, Oct 03, 2003 at 05:44:49PM -0400, Rod Taylor wrote: > > > > > > It is too bad the (channel, link) index doesn't have dtstamp at the end > > > of it, otherwise the below query would be a gain (might be a small one > > > anyway). > > > > > > select dtstamp > > > from items > > > where channel = $1 > > > and link = $2 > > > ORDER BY dtstamp DESC > > > LIMIT 1; > > It didn't make a difference even with the 3 term index? I guess you > don't have very many common values for channel / link combination. You need to do: ORDER BY channel DESC, link DESC, dtstamp DESC This is an optimizer nit. It doesn't notice that since it selected on channel and link already the remaining tuples in the index will be ordered simply by dtstamp. (This is the thing i pointed out previously in <87el6ckrlu.fsf@stark.dyndns.tv> on Feb 13th 2003 on pgsql-general) -- greg
Actually what finally sovled the problem is repeating the dtstamp > last_viewed in the sub select select articlenumber, channel, description, title, link, dtstamp from items i1, my_channels where ((i1.channel = '22222'and my_channels.id = '22222' and owner = 'drormata' and (dtstamp > last_viewed)) ) and (dtstamp = (select max (dtstamp) fromitems i2 where channel = '22222' and i1.link = i2.link)); to explain analyze select articlenumber, channel, description, title, link, dtstamp from items i1, my_channels where ((i1.channel= '22222' and my_channels.id = '22222' and owner = 'drormata' and (dtstamp > last_viewed)) ) and (dtstamp = (select max (dtstamp) fromitems i2 where channel = '22222' and i1.link = i2.link and dtstamp > last_viewed)); Which in the stored procedure looks like this: CREATE or REPLACE FUNCTION item_max_date (int4, varchar, timestamptz) RETURNS timestamptz AS ' select max(dtstamp) from items where channel = $1 and link = $2 and dtstamp > $3; ' LANGUAGE 'sql'; Basically I have hundreds or thousands of items but only a few that satisfy "dtstamp > last_viewed". Obviously I want to run the max() only on on a few items. Repeating "dtstamp > last_viewed" did the trick, but it seems like there should be a more elegant/clear way to tell the planner which constraint to apply first. Dror On Wed, Oct 08, 2003 at 10:54:24AM -0400, Greg Stark wrote: > Rod Taylor <rbt@rbt.ca> writes: > > > On Fri, 2003-10-03 at 17:53, Dror Matalon wrote: > > > On Fri, Oct 03, 2003 at 05:44:49PM -0400, Rod Taylor wrote: > > > > > > > > It is too bad the (channel, link) index doesn't have dtstamp at the end > > > > of it, otherwise the below query would be a gain (might be a small one > > > > anyway). > > > > > > > > select dtstamp > > > > from items > > > > where channel = $1 > > > > and link = $2 > > > > ORDER BY dtstamp DESC > > > > LIMIT 1; > > > > It didn't make a difference even with the 3 term index? I guess you > > don't have very many common values for channel / link combination. > > You need to do: > > ORDER BY channel DESC, link DESC, dtstamp DESC > > This is an optimizer nit. It doesn't notice that since it selected on channel > and link already the remaining tuples in the index will be ordered simply by > dtstamp. > > (This is the thing i pointed out previously in > <87el6ckrlu.fsf@stark.dyndns.tv> on Feb 13th 2003 on pgsql-general) > > > -- > greg > -- Dror Matalon Zapatec Inc 1700 MLK Way Berkeley, CA 94709 http://www.zapatec.com
Dror Matalon <dror@zapatec.com> writes: > Actually what finally sovled the problem is repeating the > dtstamp > last_viewed > in the sub select That will at least convince the optimizer to use an index range lookup. But it still will have to scan every record that matches channel==$1, link==$2, and dtstamp>$3. The trick of using limit 1 will be faster still as it only has to retrieve a single record using the index. But you have to be sure to convince it to use the index and the way to do that is to list exactly the same columns in the ORDER BY as are in the index definition. Even if some of the leading columns are redundant because they'll be constant for all of the records retrieved. The optimizer doesn't know to ignore those. > > (This is the thing i pointed out previously in > > <87el6ckrlu.fsf@stark.dyndns.tv> on Feb 13th 2003 on pgsql-general) -- greg
On Thu, Oct 09, 2003 at 07:07:00PM -0400, Greg Stark wrote: > Dror Matalon <dror@zapatec.com> writes: > > > Actually what finally sovled the problem is repeating the > > dtstamp > last_viewed > > in the sub select > > That will at least convince the optimizer to use an index range lookup. But it > still will have to scan every record that matches channel==$1, link==$2, and > dtstamp>$3. > > The trick of using limit 1 will be faster still as it only has to retrieve a > single record using the index. But you have to be sure to convince it to use How is doing order by limit 1 faster than doing max()? Seems like the optimizer will need to sort or scan the data set either way. That part didn't actually make a difference in my specific case. > the index and the way to do that is to list exactly the same columns in the > ORDER BY as are in the index definition. > > Even if some of the leading columns are redundant because they'll be constant > for all of the records retrieved. The optimizer doesn't know to ignore those. The main problem in my case was that the optimizer was doing the max() on all 700 rows, rather than the filtered rows. It's not until I put the "dtstamp> last_viewed" in the sub select as well as in the main query that it realized that it can first filter the 696 rows out and then to the max() on the 4 rows that satisfied this constraint. That was the big saving. Hope this all makes sense, Dror > > > > (This is the thing i pointed out previously in > > > <87el6ckrlu.fsf@stark.dyndns.tv> on Feb 13th 2003 on pgsql-general) > > -- > greg > -- Dror Matalon Zapatec Inc 1700 MLK Way Berkeley, CA 94709 http://www.zapatec.com
On Thu, Oct 09, 2003 at 17:44:46 -0700, Dror Matalon <dror@zapatec.com> wrote: > > How is doing order by limit 1 faster than doing max()? Seems like the > optimizer will need to sort or scan the data set either way. That part > didn't actually make a difference in my specific case. max() will never be evaluated by using an index to find the greatest value. So in many cases using order by and limit 1 is faster.
On Thu, Oct 09, 2003 at 08:35:22PM -0500, Bruno Wolff III wrote: > On Thu, Oct 09, 2003 at 17:44:46 -0700, > Dror Matalon <dror@zapatec.com> wrote: > > > > How is doing order by limit 1 faster than doing max()? Seems like the > > optimizer will need to sort or scan the data set either way. That part > > didn't actually make a difference in my specific case. > > max() will never be evaluated by using an index to find the greatest value. > So in many cases using order by and limit 1 is faster. Ouch. I just double checked and you're right. Is this considered a bug, or just an implementation issue? While I've seen this hint a few times in the lists, it seems like it's one of those magic incantations that those in the know, know about, and that people new to postgres are going to be surprised by the need to use this idiom. Regards, Dror -- Dror Matalon Zapatec Inc 1700 MLK Way Berkeley, CA 94709 http://www.zapatec.com
Dror Matalon <dror@zapatec.com> writes: > Ouch. I just double checked and you're right. Is this considered a bug, > or just an implementation issue? Call it a wishlist bug. The problem is it would be a hard feature to implement properly. And none of the people paid to work on postgres by various companies seem to have this on their to-do lists. So don't expect it in the near future. > While I've seen this hint a few times in the lists, it seems like it's > one of those magic incantations that those in the know, know about, and > that people new to postgres are going to be surprised by the need to use > this idiom. Yup. Though it's in the FAQ and comes up on the mailing list about once a week or so, so it's hard to see how to document it any better. Perhaps a warning specifically on the min/max functions in the documentation? Say, what do people think about a comment board thing like php.net has attached to the documentation. People can add comments that show up directly on the bottom of the documentation for each function. I find it's mostly full of junk but skimming the comments often turns up one or two relevant warnings, especially when I'm wondering why something's not behaving the way I expect. -- greg
> Say, what do people think about a comment board thing like php.net has > attached to the documentation. People can add comments that show up directly > on the bottom of the documentation for each function. I find it's mostly full > of junk but skimming the comments often turns up one or two relevant warnings, > especially when I'm wondering why something's not behaving the way I expect. I thought we had that: http://www.postgresql.org/docs/7.3/interactive/functions-aggregate.html ...and someone has already made the comment. Chris
On Thu, 9 Oct 2003, Dror Matalon wrote: > On Thu, Oct 09, 2003 at 07:07:00PM -0400, Greg Stark wrote: > > Dror Matalon <dror@zapatec.com> writes: > > > > > Actually what finally sovled the problem is repeating the > > > dtstamp > last_viewed > > > in the sub select > > > > That will at least convince the optimizer to use an index range lookup. But it > > still will have to scan every record that matches channel==$1, link==$2, and > > dtstamp>$3. > > > > The trick of using limit 1 will be faster still as it only has to retrieve a > > single record using the index. But you have to be sure to convince it to use > > How is doing order by limit 1 faster than doing max()? Seems like the > optimizer will need to sort or scan the data set either way. That part > didn't actually make a difference in my specific case. > max(field) = sequential scan looking for the hightest. order by field desc limit 1 = index scan (if available), read first record. else (if no index) sequential scan for highest. aggregates don't use indexes because its only appilicable for max() and min() and can't be done for sum(), count(), etc writing an alogorithim to use the index would be complex as you would need to tell the optimized from the inside a function (you can write aggrate functions your self if you wish) to do somthing slighly differently. for my large table.... select max(field) from table; (5264.21 msec) select field from table order by field limit 1; (54.88 msec) Peter Childs
Greg Stark writes: > Call it a wishlist bug. The problem is it would be a hard feature to > implement properly. And none of the people paid to work on postgres > by various companies seem to have this on their to-do lists. So > don't expect it in the near future. We are using Postgres heavily, and we should be able to find some time and/or funding to help. We're becoming more and more frustrated with the discontinuous behaviour of the planner. It seems every complex query we have these days needs some "hint" like "ORDER BY foo DESC LIMIT 1" to make it run on the order of seconds, not minutes. We usually figure out a way to write the query so the planner does the right thing, and pushes the discontinuity out far enough that the user doesn't see it. However, it takes a lot of work, and it seems to me that work would be put to better use improving the planner than improving our knowledge of how to get the planner to do the right thing by coding the SQL in some unusual way. Please allow me to go out on a limb here. I know that Tom is philosophically opposed to planner hints. However, we do have a problem that the planner is never going to be smart enough. This leaves the SQL coder the only option of collecting a bag of (non-portable) SQL tricks. I saw what the best SQL coders did at Tandem, and frankly, it's scary. Being at Tandem, the SQL coders also had the option (if they could argue their case strong enough) of adding a new rule to the optimizer. This doesn't solve the problem for outsiders no matter how good they are at SQL. Would it be possible to extend the planner with a pattern matching language? It would formalize what it is doing already, and would allow outsiders to teach the planner about idiosyncrasies without changing the SQL. It would be like a style sheet in Latex (or Scribe :-) if you are familiar with these typesetting languages. Comments? Rob
Dror, > Ouch. I just double checked and you're right. Is this considered a bug, > or just an implementation issue? It's an implementation issue, which may be fixed by 7.5 but not sooner. Basically, the free ability of PostgreSQL users to define their own aggregates limits our ability to define query planner optimization for aggregates. Only recently has anyone suggested a feasable way around this. > While I've seen this hint a few times in the lists, it seems like it's > one of those magic incantations that those in the know, know about, and > that people new to postgres are going to be surprised by the need to use > this idiom. It IS in the FAQ. -- Josh Berkus Aglio Database Solutions San Francisco
On Fri, Oct 10, 2003 at 10:32:32AM -0700, Josh Berkus wrote: > Dror, > > > Ouch. I just double checked and you're right. Is this considered a bug, > > or just an implementation issue? > > It's an implementation issue, which may be fixed by 7.5 but not sooner. > Basically, the free ability of PostgreSQL users to define their own > aggregates limits our ability to define query planner optimization for > aggregates. Only recently has anyone suggested a feasable way around this. > > > While I've seen this hint a few times in the lists, it seems like it's > > one of those magic incantations that those in the know, know about, and > > that people new to postgres are going to be surprised by the need to use > > this idiom. > > It IS in the FAQ. Might be a good idea to put it in its own section rather than under "My queries are slow or don't make use of the indexes. Why?" Also, you might want to take out for 7.4 4.22) Why are my subqueries using IN so slow? > > -- > Josh Berkus > Aglio Database Solutions > San Francisco -- Dror Matalon Zapatec Inc 1700 MLK Way Berkeley, CA 94709 http://www.zapatec.com