Thread: select max() much slower than select min()

select max() much slower than select min()

From
Brian Cox
Date:
ts_stats_transet_user_interval has ~48M rows. ts_id is the PK and there
is an index on ts_interval_start_time. I reindexed it and ran vacuum
analyze. Only SELECTs have been done since these operations.

cemdb=# explain select min(ts_id) from ts_stats_transet_user_interval a
where 0=0 and a.ts_interval_start_time >= '2009-6-16 01:00' and
a.ts_interval_start_time < '2009-6-16 02:00';

                               QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  Result  (cost=12.19..12.20 rows=1 width=0)
    InitPlan
      ->  Limit  (cost=0.00..12.19 rows=1 width=8)
            ->  Index Scan using ts_stats_transet_user_interval_pkey on
ts_stats_transet_user_interval a  (cost=0.00..5496152.30 rows=450799
width=8)
                  Filter: ((ts_id IS NOT NULL) AND
(ts_interval_start_time >= '2009-06-16 01:00:00-07'::timestamp with time
zone) AND (ts_interval_start_time < '2009-06-16 02:00:00-07'::timestamp
with time zone))
(5 rows)
cemdb=# explain select max(ts_id) from ts_stats_transet_user_interval a
where 0=0 and a.ts_interval_start_time >= '2009-6-16 01:00' and
a.ts_interval_start_time < '2009-6-16 02:00';

                               QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  Result  (cost=12.19..12.20 rows=1 width=0)
    InitPlan
      ->  Limit  (cost=0.00..12.19 rows=1 width=8)
            ->  Index Scan Backward using
ts_stats_transet_user_interval_pkey on ts_stats_transet_user_interval a
  (cost=0.00..5496152.30 rows=450799 width=8)
                  Filter: ((ts_id IS NOT NULL) AND
(ts_interval_start_time >= '2009-06-16 01:00:00-07'::timestamp with time
zone) AND (ts_interval_start_time < '2009-06-16 02:00:00-07'::timestamp
with time zone))
(5 rows)
[root@rdl64xeoserv01 log]# time PGPASSWORD=quality psql -U admin -d
cemdb -c "select min(ts_id) from ts_stats_transet_user_interval a where
a.ts_interval_start_time >= '2009-6-16 01:00' and
a.ts_interval_start_time < '2009-6-16 02:00'"         min
--------------------
  600000000032100000
(1 row)


real    1m32.025s
user    0m0.000s
sys     0m0.003s
[root@rdl64xeoserv01 log]# time PGPASSWORD=quality psql -U admin -d
cemdb -c "select max(ts_id) from ts_stats_transet_user_interval a where
a.ts_interval_start_time >= '2009-6-16 01:00' and
a.ts_interval_start_time < '2009-6-16 02:00'"
         max
--------------------
  600000000032399999
(1 row)


real    16m39.412s
user    0m0.002s
sys     0m0.002s


seems like max() shouldn't take any longer than min() and certainly not
10 times as long. Any ideas on how to determine the max more quickly?

Thanks,
Brian

Re: select max() much slower than select min()

From
"Kevin Grittner"
Date:
Brian Cox <brian.cox@ca.com> wrote:

> cemdb=# explain select min(ts_id) from
> ts_stats_transet_user_interval a
> where 0=0 and a.ts_interval_start_time >= '2009-6-16 01:00' and
> a.ts_interval_start_time < '2009-6-16 02:00';

> seems like max() shouldn't take any longer than min() and certainly
> not 10 times as long. Any ideas on how to determine the max more
> quickly?

Is there any correlation between ts_id and ts_interval_start_time?
Perhaps if you tried min and max with different time ranges it would
find a row on a backward scan faster.  It'll take ten times as long if
it has to scan through ten times as many rows to find a match.

I don't suppose you have an index on ts_interval_start_time?
If not, what happens if you run these queries after adding one?

-Kevin

Re: select max() much slower than select min()

From
Brian Cox
Date:
Kevin Grittner [Kevin.Grittner@wicourts.gov] wrote:
> Is there any correlation between ts_id and ts_interval_start_time?
only vaguely: increasing ts_interval_start_time implies increasing ts_id
but there may be many rows (100,000's) with the same ts_interval_start_time

> Perhaps if you tried min and max with different time ranges it would
> find a row on a backward scan faster.  It'll take ten times as long if
> it has to scan through ten times as many rows to find a match.
it looks like there are fewer rows backwards than forwards:

cemdb=> select count(*) from ts_stats_transet_user_interval where
ts_interval_start_time < '2009-6-16 01:00';
   count
----------
  32100000
(1 row)

cemdb=> select count(*) from ts_stats_transet_user_interval where
ts_interval_start_time >= '2009-6-16 02:00';
   count
----------
  13500000
(1 row)


> I don't suppose you have an index on ts_interval_start_time?
there is an index. I mentioned this in my orginal posting.

Thanks,
Brian



Re: select max() much slower than select min()

From
"Kevin Grittner"
Date:
Brian Cox <brian.cox@ca.com> wrote:
> Kevin Grittner [Kevin.Grittner@wicourts.gov] wrote:
>> Is there any correlation between ts_id and ts_interval_start_time?
> only vaguely: increasing ts_interval_start_time implies increasing
> ts_id but there may be many rows (100,000's) with the same
> ts_interval_start_time
>
>> Perhaps if you tried min and max with different time ranges it
>> would find a row on a backward scan faster.  It'll take ten times
>> as long if it has to scan through ten times as many rows to find a
>> match.
> it looks like there are fewer rows backwards than forwards:

Hmmm....  I was going to suggest possible bloat near the end of the
table, but the vacuum and reindex should have kept that at from being
a problem.

This might be an issue where disks are laid out so that the pages can
be read from start to end quickly; reading backwards might cause a lot
more rotational delay.

>> I don't suppose you have an index on ts_interval_start_time?
> there is an index. I mentioned this in my orginal posting.

Sorry I missed that.  I was afraid that it might not use it because
PostgreSQL doesn't yet recognize correlations between columns.  If it
did, it might determine that the other index was better for this query
(which may or may not be the case).

Could you provide the output of VACUUM ANALYZE for these queries, so
we can compare expected to actual?  Also, what is your statistics
target for these (default_statistics_target if you haven't overridden
the specific columns involved)?

I guess you could try something like this, too:

select max(ts_id) from (select ts_id from
ts_stats_transet_user_interval a
where 0=0 and a.ts_interval_start_time >= '2009-6-16 01:00' and
a.ts_interval_start_time < '2009-6-16 02:00') x;

(Untested, so you might need to fix some typo or oversight.)

The EXPLAIN ANALYZE of that might yield interesting information.

If that doesn't get you closer to something acceptable, you could
consider a functional index on the inverse of the ts_id column, and
search for the negative of the min of that.  Kinda ugly, but it might
work because the disk would be spinning in the right direction for
you.

-Kevin

Re: select max() much slower than select min()

From
"Dave Dutcher"
Date:
> -----Original Message-----
> From: Brian Cox
> Subject: [PERFORM] select max() much slower than select min()
>

> seems like max() shouldn't take any longer than min() and
> certainly not 10 times as long. Any ideas on how to determine
> the max more quickly?


That is odd.  It seems like max should actually have to scan fewer rows than
min should.  It might still be bloat in the table, because unless you did
VACUUM FULL there could still be dead rows.  A vacuum verbose would show if
there is bloat or not.  Also maybe you could try a two column index like
this:

create index test_index on ts_stats_transet_user_interval
(ts_interval_start_time, ts_id);


Dave



Re: select max() much slower than select min()

From
Tom Lane
Date:
Brian Cox <brian.cox@ca.com> writes:
> Kevin Grittner [Kevin.Grittner@wicourts.gov] wrote:
>> Is there any correlation between ts_id and ts_interval_start_time?

> only vaguely: increasing ts_interval_start_time implies increasing ts_id
> but there may be many rows (100,000's) with the same ts_interval_start_time

That's the problem then.  Notice what the query plan is doing: it's
scanning the table in order by ts_id, looking for the first row that
falls within the ts_interval_start_time range.  Evidently this
particular range is associated with smaller ts_ids, so you reach it a
lot sooner in a ts_id ascending scan than a ts_id descending one.

Given the estimated size of the range, scanning with the
ts_interval_start_time index wouldn't be much fun either, since it would
have to examine all rows in the range to determine the min or max ts_id.
You could possibly twiddle the cost constants to make the planner choose
that plan instead, but it's still not going to be exactly speedy.

Some experimentation suggests that it might help to provide a 2-column
index on (ts_id, ts_interval_start_time).  This is still going to be
scanned in order by ts_id, but it will be possible to check the
ts_interval_start_time condition in the index, eliminating a large
number of useless trips to the heap.  Whether this type of query is
important enough to justify maintaining an extra index for is something
you'll have to decide for yourself...

            regards, tom lane

Re: select max() much slower than select min()

From
Greg Stark
Date:
On Fri, Jun 19, 2009 at 3:26 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote:
>
> That's the problem then.  Notice what the query plan is doing: it's
> scanning the table in order by ts_id, looking for the first row that
> falls within the ts_interval_start_time range.  Evidently this
> particular range is associated with smaller ts_ids, so you reach it a
> lot sooner in a ts_id ascending scan than a ts_id descending one.
>
> Given the estimated size of the range, scanning with the
> ts_interval_start_time index wouldn't be much fun either, since it would
> have to examine all rows in the range to determine the min or max ts_id.
> You could possibly twiddle the cost constants to make the planner choose
> that plan instead, but it's still not going to be exactly speedy.

If your range of ts_interval_start_time is relatively static -- it
doesn't look like it in this case given that's only an hour, but... --
then one option is to create a partial index on "ts_id" with the
condition "WHERE ts_interval_start_time >= 'foo' AND
ts_interval_start_time < 'bar' ".

But if your range of times is always going to vary then you're going
to have a problem there.

There ought to be a way to use GIST to do this but I don't think we
have any way to combine two different columns of different types in a
single GIST index except as a multicolumn index which doesn't do what
you want.

--
greg
http://mit.edu/~gsstark/resume.pdf

Re: select max() much slower than select min()

From
Brian Cox
Date:
Tom Lane [tgl@sss.pgh.pa.us] wrote:
> Some experimentation suggests that it might help to provide a 2-column
> index on (ts_id, ts_interval_start_time).  This is still going to be
> scanned in order by ts_id, but it will be possible to check the
> ts_interval_start_time condition in the index, eliminating a large
> number of useless trips to the heap.  Whether this type of query is
> important enough to justify maintaining an extra index for is something
> you'll have to decide for yourself...

Thanks to all for the analysis and suggestions. Since the number of rows
in an hour < ~500,000, brute force looks to be a fast solution:

select ts_id from ... where ts_interval_start_time >= ... and ...

This query runs very fast as does a single pass through the ids to find
the min and max.

Brian

Re: select max() much slower than select min()

From
David Rees
Date:
On Fri, Jun 19, 2009 at 1:05 PM, Brian Cox<brian.cox@ca.com> wrote:
> Thanks to all for the analysis and suggestions. Since the number of rows in
> an hour < ~500,000, brute force looks to be a fast solution:
>
> select ts_id from ... where ts_interval_start_time >= ... and ...
>
> This query runs very fast as does a single pass through the ids to find the
> min and max.

Along those lines, couldn't you just have the DB do the work?

select max(ts_id), min(ts_id) from ... where ts_interval_start_time >=
... and ...

Then you don't have to transfer 500k ids across the network...

-Dave

Re: select max() much slower than select min()

From
Brian Cox
Date:
David Rees [drees76@gmail.com] wrote:
> Along those lines, couldn't you just have the DB do the work?
>
> select max(ts_id), min(ts_id) from ... where ts_interval_start_time >=
> ... and ...
>
> Then you don't have to transfer 500k ids across the network...
I guess you didn't read the entire thread: I started it because the
query you suggest took 15 mins to complete.

Brian


Re: select max() much slower than select min()

From
David Rees
Date:
On Fri, Jun 19, 2009 at 2:05 PM, Brian Cox<brian.cox@ca.com> wrote:
> David Rees [drees76@gmail.com] wrote:
>>
>> Along those lines, couldn't you just have the DB do the work?
>>
>> select max(ts_id), min(ts_id) from ... where ts_interval_start_time >=
>> ... and ...
>>
>> Then you don't have to transfer 500k ids across the network...
>
> I guess you didn't read the entire thread: I started it because the query
> you suggest took 15 mins to complete.

I read the whole thing and just scanned through it again - I didn't
see any queries where you put both the min and max into the same
query, but perhaps I missed it.  Then again - I don't quite see why
your brute force method is any faster than using a min or max, either.
 It would be interesting to see the analyze output as apparently
scanning on the ts_interval_start_time is a lot faster than scanning
the pkey (even though Tom thought that it would not be much difference
since either way you have to hit the heap a lot).

My thought was that putting both the min and max into the query would
encourage Pg to use the same index as the brute force method.
If not, you could still put the ts_ids into a temporary table using
your brute force query and use that to avoid the overhead transferring
500k ids over the network.

-Dave