Thread: select max() much slower than select min()
ts_stats_transet_user_interval has ~48M rows. ts_id is the PK and there is an index on ts_interval_start_time. I reindexed it and ran vacuum analyze. Only SELECTs have been done since these operations. cemdb=# explain select min(ts_id) from ts_stats_transet_user_interval a where 0=0 and a.ts_interval_start_time >= '2009-6-16 01:00' and a.ts_interval_start_time < '2009-6-16 02:00'; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Result (cost=12.19..12.20 rows=1 width=0) InitPlan -> Limit (cost=0.00..12.19 rows=1 width=8) -> Index Scan using ts_stats_transet_user_interval_pkey on ts_stats_transet_user_interval a (cost=0.00..5496152.30 rows=450799 width=8) Filter: ((ts_id IS NOT NULL) AND (ts_interval_start_time >= '2009-06-16 01:00:00-07'::timestamp with time zone) AND (ts_interval_start_time < '2009-06-16 02:00:00-07'::timestamp with time zone)) (5 rows) cemdb=# explain select max(ts_id) from ts_stats_transet_user_interval a where 0=0 and a.ts_interval_start_time >= '2009-6-16 01:00' and a.ts_interval_start_time < '2009-6-16 02:00'; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Result (cost=12.19..12.20 rows=1 width=0) InitPlan -> Limit (cost=0.00..12.19 rows=1 width=8) -> Index Scan Backward using ts_stats_transet_user_interval_pkey on ts_stats_transet_user_interval a (cost=0.00..5496152.30 rows=450799 width=8) Filter: ((ts_id IS NOT NULL) AND (ts_interval_start_time >= '2009-06-16 01:00:00-07'::timestamp with time zone) AND (ts_interval_start_time < '2009-06-16 02:00:00-07'::timestamp with time zone)) (5 rows) [root@rdl64xeoserv01 log]# time PGPASSWORD=quality psql -U admin -d cemdb -c "select min(ts_id) from ts_stats_transet_user_interval a where a.ts_interval_start_time >= '2009-6-16 01:00' and a.ts_interval_start_time < '2009-6-16 02:00'" min -------------------- 600000000032100000 (1 row) real 1m32.025s user 0m0.000s sys 0m0.003s [root@rdl64xeoserv01 log]# time PGPASSWORD=quality psql -U admin -d cemdb -c "select max(ts_id) from ts_stats_transet_user_interval a where a.ts_interval_start_time >= '2009-6-16 01:00' and a.ts_interval_start_time < '2009-6-16 02:00'" max -------------------- 600000000032399999 (1 row) real 16m39.412s user 0m0.002s sys 0m0.002s seems like max() shouldn't take any longer than min() and certainly not 10 times as long. Any ideas on how to determine the max more quickly? Thanks, Brian
Brian Cox <brian.cox@ca.com> wrote: > cemdb=# explain select min(ts_id) from > ts_stats_transet_user_interval a > where 0=0 and a.ts_interval_start_time >= '2009-6-16 01:00' and > a.ts_interval_start_time < '2009-6-16 02:00'; > seems like max() shouldn't take any longer than min() and certainly > not 10 times as long. Any ideas on how to determine the max more > quickly? Is there any correlation between ts_id and ts_interval_start_time? Perhaps if you tried min and max with different time ranges it would find a row on a backward scan faster. It'll take ten times as long if it has to scan through ten times as many rows to find a match. I don't suppose you have an index on ts_interval_start_time? If not, what happens if you run these queries after adding one? -Kevin
Kevin Grittner [Kevin.Grittner@wicourts.gov] wrote: > Is there any correlation between ts_id and ts_interval_start_time? only vaguely: increasing ts_interval_start_time implies increasing ts_id but there may be many rows (100,000's) with the same ts_interval_start_time > Perhaps if you tried min and max with different time ranges it would > find a row on a backward scan faster. It'll take ten times as long if > it has to scan through ten times as many rows to find a match. it looks like there are fewer rows backwards than forwards: cemdb=> select count(*) from ts_stats_transet_user_interval where ts_interval_start_time < '2009-6-16 01:00'; count ---------- 32100000 (1 row) cemdb=> select count(*) from ts_stats_transet_user_interval where ts_interval_start_time >= '2009-6-16 02:00'; count ---------- 13500000 (1 row) > I don't suppose you have an index on ts_interval_start_time? there is an index. I mentioned this in my orginal posting. Thanks, Brian
Brian Cox <brian.cox@ca.com> wrote: > Kevin Grittner [Kevin.Grittner@wicourts.gov] wrote: >> Is there any correlation between ts_id and ts_interval_start_time? > only vaguely: increasing ts_interval_start_time implies increasing > ts_id but there may be many rows (100,000's) with the same > ts_interval_start_time > >> Perhaps if you tried min and max with different time ranges it >> would find a row on a backward scan faster. It'll take ten times >> as long if it has to scan through ten times as many rows to find a >> match. > it looks like there are fewer rows backwards than forwards: Hmmm.... I was going to suggest possible bloat near the end of the table, but the vacuum and reindex should have kept that at from being a problem. This might be an issue where disks are laid out so that the pages can be read from start to end quickly; reading backwards might cause a lot more rotational delay. >> I don't suppose you have an index on ts_interval_start_time? > there is an index. I mentioned this in my orginal posting. Sorry I missed that. I was afraid that it might not use it because PostgreSQL doesn't yet recognize correlations between columns. If it did, it might determine that the other index was better for this query (which may or may not be the case). Could you provide the output of VACUUM ANALYZE for these queries, so we can compare expected to actual? Also, what is your statistics target for these (default_statistics_target if you haven't overridden the specific columns involved)? I guess you could try something like this, too: select max(ts_id) from (select ts_id from ts_stats_transet_user_interval a where 0=0 and a.ts_interval_start_time >= '2009-6-16 01:00' and a.ts_interval_start_time < '2009-6-16 02:00') x; (Untested, so you might need to fix some typo or oversight.) The EXPLAIN ANALYZE of that might yield interesting information. If that doesn't get you closer to something acceptable, you could consider a functional index on the inverse of the ts_id column, and search for the negative of the min of that. Kinda ugly, but it might work because the disk would be spinning in the right direction for you. -Kevin
> -----Original Message----- > From: Brian Cox > Subject: [PERFORM] select max() much slower than select min() > > seems like max() shouldn't take any longer than min() and > certainly not 10 times as long. Any ideas on how to determine > the max more quickly? That is odd. It seems like max should actually have to scan fewer rows than min should. It might still be bloat in the table, because unless you did VACUUM FULL there could still be dead rows. A vacuum verbose would show if there is bloat or not. Also maybe you could try a two column index like this: create index test_index on ts_stats_transet_user_interval (ts_interval_start_time, ts_id); Dave
Brian Cox <brian.cox@ca.com> writes: > Kevin Grittner [Kevin.Grittner@wicourts.gov] wrote: >> Is there any correlation between ts_id and ts_interval_start_time? > only vaguely: increasing ts_interval_start_time implies increasing ts_id > but there may be many rows (100,000's) with the same ts_interval_start_time That's the problem then. Notice what the query plan is doing: it's scanning the table in order by ts_id, looking for the first row that falls within the ts_interval_start_time range. Evidently this particular range is associated with smaller ts_ids, so you reach it a lot sooner in a ts_id ascending scan than a ts_id descending one. Given the estimated size of the range, scanning with the ts_interval_start_time index wouldn't be much fun either, since it would have to examine all rows in the range to determine the min or max ts_id. You could possibly twiddle the cost constants to make the planner choose that plan instead, but it's still not going to be exactly speedy. Some experimentation suggests that it might help to provide a 2-column index on (ts_id, ts_interval_start_time). This is still going to be scanned in order by ts_id, but it will be possible to check the ts_interval_start_time condition in the index, eliminating a large number of useless trips to the heap. Whether this type of query is important enough to justify maintaining an extra index for is something you'll have to decide for yourself... regards, tom lane
On Fri, Jun 19, 2009 at 3:26 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote: > > That's the problem then. Notice what the query plan is doing: it's > scanning the table in order by ts_id, looking for the first row that > falls within the ts_interval_start_time range. Evidently this > particular range is associated with smaller ts_ids, so you reach it a > lot sooner in a ts_id ascending scan than a ts_id descending one. > > Given the estimated size of the range, scanning with the > ts_interval_start_time index wouldn't be much fun either, since it would > have to examine all rows in the range to determine the min or max ts_id. > You could possibly twiddle the cost constants to make the planner choose > that plan instead, but it's still not going to be exactly speedy. If your range of ts_interval_start_time is relatively static -- it doesn't look like it in this case given that's only an hour, but... -- then one option is to create a partial index on "ts_id" with the condition "WHERE ts_interval_start_time >= 'foo' AND ts_interval_start_time < 'bar' ". But if your range of times is always going to vary then you're going to have a problem there. There ought to be a way to use GIST to do this but I don't think we have any way to combine two different columns of different types in a single GIST index except as a multicolumn index which doesn't do what you want. -- greg http://mit.edu/~gsstark/resume.pdf
Tom Lane [tgl@sss.pgh.pa.us] wrote: > Some experimentation suggests that it might help to provide a 2-column > index on (ts_id, ts_interval_start_time). This is still going to be > scanned in order by ts_id, but it will be possible to check the > ts_interval_start_time condition in the index, eliminating a large > number of useless trips to the heap. Whether this type of query is > important enough to justify maintaining an extra index for is something > you'll have to decide for yourself... Thanks to all for the analysis and suggestions. Since the number of rows in an hour < ~500,000, brute force looks to be a fast solution: select ts_id from ... where ts_interval_start_time >= ... and ... This query runs very fast as does a single pass through the ids to find the min and max. Brian
On Fri, Jun 19, 2009 at 1:05 PM, Brian Cox<brian.cox@ca.com> wrote: > Thanks to all for the analysis and suggestions. Since the number of rows in > an hour < ~500,000, brute force looks to be a fast solution: > > select ts_id from ... where ts_interval_start_time >= ... and ... > > This query runs very fast as does a single pass through the ids to find the > min and max. Along those lines, couldn't you just have the DB do the work? select max(ts_id), min(ts_id) from ... where ts_interval_start_time >= ... and ... Then you don't have to transfer 500k ids across the network... -Dave
David Rees [drees76@gmail.com] wrote: > Along those lines, couldn't you just have the DB do the work? > > select max(ts_id), min(ts_id) from ... where ts_interval_start_time >= > ... and ... > > Then you don't have to transfer 500k ids across the network... I guess you didn't read the entire thread: I started it because the query you suggest took 15 mins to complete. Brian
On Fri, Jun 19, 2009 at 2:05 PM, Brian Cox<brian.cox@ca.com> wrote: > David Rees [drees76@gmail.com] wrote: >> >> Along those lines, couldn't you just have the DB do the work? >> >> select max(ts_id), min(ts_id) from ... where ts_interval_start_time >= >> ... and ... >> >> Then you don't have to transfer 500k ids across the network... > > I guess you didn't read the entire thread: I started it because the query > you suggest took 15 mins to complete. I read the whole thing and just scanned through it again - I didn't see any queries where you put both the min and max into the same query, but perhaps I missed it. Then again - I don't quite see why your brute force method is any faster than using a min or max, either. It would be interesting to see the analyze output as apparently scanning on the ts_interval_start_time is a lot faster than scanning the pkey (even though Tom thought that it would not be much difference since either way you have to hit the heap a lot). My thought was that putting both the min and max into the query would encourage Pg to use the same index as the brute force method. If not, you could still put the ts_ids into a temporary table using your brute force query and use that to avoid the overhead transferring 500k ids over the network. -Dave