Thread: How to use indexes for GROUP BY

How to use indexes for GROUP BY

From
Dimi Paun
Date:
Hi folks,

I have a table like so:

create table tagRecord (
    uid             varchar(60) primary key,
    [bunch of other fields]
    location      varchar(32),
    creationTS    timestamp
);
create index idx_tagdata_loc_creationTS on tagRecord(location, creationTS);

The number of individual values in location is small (e.g. 2).

I want to simply get the latest "creationTS" for each location,
but that seems to result in a full table scan:

tts_server_db=# explain analyze select location, max(creationTS) from tagrecord group by location;
                                                       QUERY PLAN
 

-------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=5330.53..5330.55 rows=2 width=18) (actual time=286.161..286.165 rows=3 loops=1)
   ->  Seq Scan on tagrecord  (cost=0.00..4771.35 rows=111835 width=18) (actual time=0.059..119.828 rows=111739
loops=1)
 Total runtime: 286.222 ms


Now I have the idx_tagdata_loc_creationTS, and it seemed to me that
it should be able to use it to quickly figure out the max creationTS
for each location.

Any way I can make this more efficient?

BTW, I am using postgresql-server-8.1.22-1.el5_5.1

--
Dimi Paun <dimi@lattica.com>
Lattica, Inc.


Re: How to use indexes for GROUP BY

From
Shaun Thomas
Date:
On 01/24/2011 12:29 PM, Dimi Paun wrote:

> I want to simply get the latest "creationTS" for each location,
> but that seems to result in a full table scan:
>
> tts_server_db=# explain analyze select location, max(creationTS) from
> tagrecord group by location;

Try this, it *might* work:

select DISTINCT ON (location) location, creationTS
   from tagrecord
  order by location, creationTS DESC;

Secondly... Postgresql 8.1? Really? If at all possible, upgrade. There
is a lot you're missing from the last six years of PostgreSQL releases.
For instance, your MAX means a reverse index scan for each location,
which is far more expensive than an ordered index scan, so the planner
may be ignoring it, if the planner in 8.1 is even that intelligent.

If you were running 8.3, for instance, your index could be:

create index idx_tagdata_loc_creationTS on tagRecord(location,
creationTS DESC);

And then suddenly it just has to use the first match for that index for
each location. Older PG versions are... flaky when it comes to
optimization. I'm not sure if 8.1 used MAX as an internal construct or
treated it like a function. If it's the latter, it has to read every
value to find out which is the "max", which is why using ORDER BY *may*
fix your problem.

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 800 | Chicago IL, 60604
312-676-8870
sthomas@peak6.com

______________________________________________

See  http://www.peak6.com/email_disclaimer.php
for terms and conditions related to this email

Re: How to use indexes for GROUP BY

From
Scott Marlowe
Date:
On Mon, Jan 24, 2011 at 11:29 AM, Dimi Paun <dimi@lattica.com> wrote:

Two very quick points:

> tts_server_db=# explain analyze select location, max(creationTS) from tagrecord group by location;
>                                                       QUERY PLAN
>
-------------------------------------------------------------------------------------------------------------------------
>  HashAggregate  (cost=5330.53..5330.55 rows=2 width=18) (actual time=286.161..286.165 rows=3 loops=1)
>   ->  Seq Scan on tagrecord  (cost=0.00..4771.35 rows=111835 width=18) (actual time=0.059..119.828 rows=111739
loops=1)
>  Total runtime: 286.222 ms

Most of your run time is the hashaggregate running, not the seq scan

> BTW, I am using postgresql-server-8.1.22-1.el5_5.1

As another poster observed, you're running an ancient version of pgsql
from a performance perspective.  Upgrading to 8.4 or 9.0 would make a
huge difference in overall performance, not just with one or two
queries.

Re: How to use indexes for GROUP BY

From
hubert depesz lubaczewski
Date:
On Mon, Jan 24, 2011 at 01:29:01PM -0500, Dimi Paun wrote:
> Hi folks,
>
> I have a table like so:
>
> create table tagRecord (
>     uid             varchar(60) primary key,
>     [bunch of other fields]
>     location      varchar(32),
>     creationTS    timestamp
> );
> create index idx_tagdata_loc_creationTS on tagRecord(location, creationTS);
>
> The number of individual values in location is small (e.g. 2).
>
> I want to simply get the latest "creationTS" for each location,
> but that seems to result in a full table scan:
>
> tts_server_db=# explain analyze select location, max(creationTS) from tagrecord group by location;
>                                                        QUERY PLAN
   
>
-------------------------------------------------------------------------------------------------------------------------
>  HashAggregate  (cost=5330.53..5330.55 rows=2 width=18) (actual time=286.161..286.165 rows=3 loops=1)
>    ->  Seq Scan on tagrecord  (cost=0.00..4771.35 rows=111835 width=18) (actual time=0.059..119.828 rows=111739
loops=1)
>  Total runtime: 286.222 ms

you can use technique described in here:
http://www.depesz.com/index.php/2009/07/10/getting-list-of-unique-elements/

Best regards,

depesz


Re: How to use indexes for GROUP BY

From
Dimi Paun
Date:
On Mon, 2011-01-24 at 12:33 -0700, Scott Marlowe wrote:
> As another poster observed, you're running an ancient version of pgsql
> from a performance perspective.  Upgrading to 8.4 or 9.0 would make a
> huge difference in overall performance, not just with one or two
> queries.

Thanks for the trips.

I'll try to first upgrade, and I'll report back if that doesn't help :)

--
Dimi Paun <dimi@lattica.com>
Lattica, Inc.