Thread: Speeding up select distinct

Speeding up select distinct

From
Laurent Martelli
Date:
Consider this query:

SELECT distinct owner from pictures;

 Unique  (cost=361.18..382.53 rows=21 width=4) (actual time=14.197..17.639 rows=21 loops=1)
   ->  Sort  (cost=361.18..371.86 rows=4270 width=4) (actual time=14.188..15.450 rows=4270 loops=1)
         Sort Key: "owner"
         ->  Seq Scan on pictures  (cost=0.00..103.70 rows=4270 width=4) (actual time=0.012..5.795 rows=4270 loops=1)
 Total runtime: 19.147 ms

I thought that 19ms to return 20 rows out of a 4000 rows table so I
added an index:

CREATE INDEX pictures_owner ON pictures (owner);

It gives a slight improvement:

 Unique  (cost=0.00..243.95 rows=21 width=4) (actual time=0.024..10.293 rows=21 loops=1)
   ->  Index Scan using pictures_owner on pictures  (cost=0.00..233.27 rows=4270 width=4) (actual time=0.022..8.227
rows=4270loops=1) 
 Total runtime: 10.369 ms

But still, it's a lot for 20 rows. I looked at other type of indexes,
but they seem to either not give beter perfs or be irrelevant.

Any ideas, apart from more or less manually maintaining a list of
distinct owners in another table ?

--
Laurent Martelli
laurent@aopsys.com                                Java Aspect Components
http://www.aopsys.com/                          http://jac.objectweb.org


Re: Speeding up select distinct

From
PFC
Date:
    Try :

    SELECT owner from pictures group by owner;

> Any ideas, apart from more or less manually maintaining a list of
> distinct owners in another table ?

    That would be a good idea too for normalizing your database.


Re: Speeding up select distinct

From
Rod Taylor
Date:
On Wed, 2005-03-16 at 18:58 +0100, Laurent Martelli wrote:
> Consider this query:
>
> SELECT distinct owner from pictures;

The performance has nothing to do with the number of rows returned, but
rather the complexity of calculations and amount of data to sift through
in order to find it.

> Any ideas, apart from more or less manually maintaining a list of
> distinct owners in another table ?

This would be the proper thing to do, along with adding a foreign key
from pictures to the new owner structure for integrity enforcement.
--


Re: Speeding up select distinct

From
"Merlin Moncure"
Date:
> Consider this query:
>
> SELECT distinct owner from pictures;

[...]
> Any ideas, apart from more or less manually maintaining a list of
> distinct owners in another table ?

you answered your own question.  With a 20 row owners table, you should
be directing your efforts there group by is faster than distinct, but
both are very wasteful and essentially require s full seqscan of the
detail table.

With a little hacking, you can change 'manual maintenance' to 'automatic
maintenance'.

1. create table owner as select distinct owner from pictures;
2. alter table owner add constraint owner_pkey(owner);
3. alter table pictures add constraint ri_picture_owner(owner)
references owner;
4. make a little append_ownder function which adds an owner to the owner
table if there is not already one there. Inline this to your insert
statement on pictures.

Voila!
Merlin
p.s. normalize your data always!

Re: Speeding up select distinct

From
Laurent Martelli
Date:
Wow, what a fast response !!!

>>>>> "PFC" == PFC  <lists@boutiquenumerique.com> writes:

  PFC>     Try :

  PFC>     SELECT owner from pictures group by owner;

That's a slight improvement, but there's still a seq scan on pictures:

 HashAggregate  (cost=114.38..114.38 rows=21 width=4) (actual time=7.585..7.605 rows=21 loops=1)
   ->  Seq Scan on pictures  (cost=0.00..103.70 rows=4270 width=4) (actual time=0.015..3.272 rows=4270 loops=1)
 Total runtime: 7.719 ms




--
Laurent Martelli
laurent@aopsys.com                                Java Aspect Components
http://www.aopsys.com/                          http://jac.objectweb.org


Re: Speeding up select distinct

From
Laurent Martelli
Date:
>>>>> "Rod" == Rod Taylor <pg@rbt.ca> writes:

  Rod> On Wed, 2005-03-16 at 18:58 +0100, Laurent Martelli wrote:
  >> Consider this query:
  >>
  >> SELECT distinct owner from pictures;

  Rod> The performance has nothing to do with the number of rows
  Rod> returned, but rather the complexity of calculations and amount
  Rod> of data to sift through in order to find it.

Yes, but I thought that an index might be able to know what distinct
values there are and help optime that query very much.

--
Laurent Martelli
laurent@aopsys.com                                Java Aspect Components
http://www.aopsys.com/                          http://jac.objectweb.org


Re: Speeding up select distinct

From
Laurent Martelli
Date:
>>>>> "Merlin" == Merlin Moncure <merlin.moncure@rcsonline.com> writes:

  >> Consider this query:
  >>
  >> SELECT distinct owner from pictures;

  Merlin> [...]
  >> Any ideas, apart from more or less manually maintaining a list of
  >> distinct owners in another table ?

  Merlin> you answered your own question.  With a 20 row owners table,
  Merlin> you should be directing your efforts there group by is
  Merlin> faster than distinct, but both are very wasteful and
  Merlin> essentially require s full seqscan of the detail table.

  Merlin> With a little hacking, you can change 'manual maintenance'
  Merlin> to 'automatic maintenance'.

  Merlin> 1. create table owner as select distinct owner from
  Merlin> pictures; 2. alter table owner add constraint
  Merlin> owner_pkey(owner); 3. alter table pictures add constraint
  Merlin> ri_picture_owner(owner) references owner; 4. make a little
  Merlin> append_ownder function which adds an owner to the owner
  Merlin> table if there is not already one there. Inline this to your
  Merlin> insert statement on pictures.

I just wished there was a means to fully automate all this and render
it transparent to the user, just like an index.

  Merlin> Voila!  Merlin p.s. normalize your data always!

I have this:

pictures(
    PictureID serial PRIMARY KEY,
    Owner integer NOT NULL REFERENCES users,
    [...]);
CREATE TABLE users (
    UserID serial PRIMARY KEY,
    Name character varying(255),
    [...]);

Isn't it normalized ?

--
Laurent Martelli
laurent@aopsys.com                                Java Aspect Components
http://www.aopsys.com/                          http://jac.objectweb.org


Re: Speeding up select distinct

From
Rod Taylor
Date:
On Wed, 2005-03-16 at 19:31 +0100, Laurent Martelli wrote:
> >>>>> "Rod" == Rod Taylor <pg@rbt.ca> writes:
>
>   Rod> On Wed, 2005-03-16 at 18:58 +0100, Laurent Martelli wrote:
>   >> Consider this query:
>   >>
>   >> SELECT distinct owner from pictures;
>
>   Rod> The performance has nothing to do with the number of rows
>   Rod> returned, but rather the complexity of calculations and amount
>   Rod> of data to sift through in order to find it.
>
> Yes, but I thought that an index might be able to know what distinct
> values there are and help optime that query very much.

The index does know. You just have to visit all of the pages within the
index to find out, which it does, and that's why you dropped 10ms.

But if you want a sub ms query, you're going to have to normalize the
structure.

--


Re: Speeding up select distinct

From
"Merlin Moncure"
Date:
> I just wished there was a means to fully automate all this and render
> it transparent to the user, just like an index.
>
>   Merlin> Voila!  Merlin p.s. normalize your data always!
>
> I have this:
>
> pictures(
>     PictureID serial PRIMARY KEY,
>     Owner integer NOT NULL REFERENCES users,
>     [...]);
> CREATE TABLE users (
>     UserID serial PRIMARY KEY,
>     Name character varying(255),
>     [...]);
>
> Isn't it normalized ?

try:
select * from users where UserID in (select pictureId from pictures);
select * userid from users intersect select pictureid from pictures;
select distinct userid, [...] from users, pictures where user userid =
pictureid)

if none of these give you what you want then you can solve this with a
new tble, picture_user using the instructions I gave previously.

Not sure if your data is normalized, but ISTM you are over-using
surrogate keys.  It may not be possible, but consider downgrading ID
columns to unique and picking a natural key.  Now you get better benefit
of RI and you can sometimes remove joins from certain queries.

Rule: use natural keys when you can, surrogate keys when you have to.
Corollary: use domains for fields used in referential integrity.

Merlin


Re: Speeding up select distinct

From
Greg Stark
Date:
Laurent Martelli <laurent@aopsys.com> writes:

>   PFC>     SELECT owner from pictures group by owner;
>
> That's a slight improvement, but there's still a seq scan on pictures:

It should be a sequential scan. An index will be slower.

>  HashAggregate  (cost=114.38..114.38 rows=21 width=4) (actual time=7.585..7.605 rows=21 loops=1)
>    ->  Seq Scan on pictures  (cost=0.00..103.70 rows=4270 width=4) (actual time=0.015..3.272 rows=4270 loops=1)
>  Total runtime: 7.719 ms

That's the best plan for this query.

--
greg