Thread: Speeding up select distinct
Consider this query: SELECT distinct owner from pictures; Unique (cost=361.18..382.53 rows=21 width=4) (actual time=14.197..17.639 rows=21 loops=1) -> Sort (cost=361.18..371.86 rows=4270 width=4) (actual time=14.188..15.450 rows=4270 loops=1) Sort Key: "owner" -> Seq Scan on pictures (cost=0.00..103.70 rows=4270 width=4) (actual time=0.012..5.795 rows=4270 loops=1) Total runtime: 19.147 ms I thought that 19ms to return 20 rows out of a 4000 rows table so I added an index: CREATE INDEX pictures_owner ON pictures (owner); It gives a slight improvement: Unique (cost=0.00..243.95 rows=21 width=4) (actual time=0.024..10.293 rows=21 loops=1) -> Index Scan using pictures_owner on pictures (cost=0.00..233.27 rows=4270 width=4) (actual time=0.022..8.227 rows=4270loops=1) Total runtime: 10.369 ms But still, it's a lot for 20 rows. I looked at other type of indexes, but they seem to either not give beter perfs or be irrelevant. Any ideas, apart from more or less manually maintaining a list of distinct owners in another table ? -- Laurent Martelli laurent@aopsys.com Java Aspect Components http://www.aopsys.com/ http://jac.objectweb.org
Try : SELECT owner from pictures group by owner; > Any ideas, apart from more or less manually maintaining a list of > distinct owners in another table ? That would be a good idea too for normalizing your database.
On Wed, 2005-03-16 at 18:58 +0100, Laurent Martelli wrote: > Consider this query: > > SELECT distinct owner from pictures; The performance has nothing to do with the number of rows returned, but rather the complexity of calculations and amount of data to sift through in order to find it. > Any ideas, apart from more or less manually maintaining a list of > distinct owners in another table ? This would be the proper thing to do, along with adding a foreign key from pictures to the new owner structure for integrity enforcement. --
> Consider this query: > > SELECT distinct owner from pictures; [...] > Any ideas, apart from more or less manually maintaining a list of > distinct owners in another table ? you answered your own question. With a 20 row owners table, you should be directing your efforts there group by is faster than distinct, but both are very wasteful and essentially require s full seqscan of the detail table. With a little hacking, you can change 'manual maintenance' to 'automatic maintenance'. 1. create table owner as select distinct owner from pictures; 2. alter table owner add constraint owner_pkey(owner); 3. alter table pictures add constraint ri_picture_owner(owner) references owner; 4. make a little append_ownder function which adds an owner to the owner table if there is not already one there. Inline this to your insert statement on pictures. Voila! Merlin p.s. normalize your data always!
Wow, what a fast response !!! >>>>> "PFC" == PFC <lists@boutiquenumerique.com> writes: PFC> Try : PFC> SELECT owner from pictures group by owner; That's a slight improvement, but there's still a seq scan on pictures: HashAggregate (cost=114.38..114.38 rows=21 width=4) (actual time=7.585..7.605 rows=21 loops=1) -> Seq Scan on pictures (cost=0.00..103.70 rows=4270 width=4) (actual time=0.015..3.272 rows=4270 loops=1) Total runtime: 7.719 ms -- Laurent Martelli laurent@aopsys.com Java Aspect Components http://www.aopsys.com/ http://jac.objectweb.org
>>>>> "Rod" == Rod Taylor <pg@rbt.ca> writes: Rod> On Wed, 2005-03-16 at 18:58 +0100, Laurent Martelli wrote: >> Consider this query: >> >> SELECT distinct owner from pictures; Rod> The performance has nothing to do with the number of rows Rod> returned, but rather the complexity of calculations and amount Rod> of data to sift through in order to find it. Yes, but I thought that an index might be able to know what distinct values there are and help optime that query very much. -- Laurent Martelli laurent@aopsys.com Java Aspect Components http://www.aopsys.com/ http://jac.objectweb.org
>>>>> "Merlin" == Merlin Moncure <merlin.moncure@rcsonline.com> writes: >> Consider this query: >> >> SELECT distinct owner from pictures; Merlin> [...] >> Any ideas, apart from more or less manually maintaining a list of >> distinct owners in another table ? Merlin> you answered your own question. With a 20 row owners table, Merlin> you should be directing your efforts there group by is Merlin> faster than distinct, but both are very wasteful and Merlin> essentially require s full seqscan of the detail table. Merlin> With a little hacking, you can change 'manual maintenance' Merlin> to 'automatic maintenance'. Merlin> 1. create table owner as select distinct owner from Merlin> pictures; 2. alter table owner add constraint Merlin> owner_pkey(owner); 3. alter table pictures add constraint Merlin> ri_picture_owner(owner) references owner; 4. make a little Merlin> append_ownder function which adds an owner to the owner Merlin> table if there is not already one there. Inline this to your Merlin> insert statement on pictures. I just wished there was a means to fully automate all this and render it transparent to the user, just like an index. Merlin> Voila! Merlin p.s. normalize your data always! I have this: pictures( PictureID serial PRIMARY KEY, Owner integer NOT NULL REFERENCES users, [...]); CREATE TABLE users ( UserID serial PRIMARY KEY, Name character varying(255), [...]); Isn't it normalized ? -- Laurent Martelli laurent@aopsys.com Java Aspect Components http://www.aopsys.com/ http://jac.objectweb.org
On Wed, 2005-03-16 at 19:31 +0100, Laurent Martelli wrote: > >>>>> "Rod" == Rod Taylor <pg@rbt.ca> writes: > > Rod> On Wed, 2005-03-16 at 18:58 +0100, Laurent Martelli wrote: > >> Consider this query: > >> > >> SELECT distinct owner from pictures; > > Rod> The performance has nothing to do with the number of rows > Rod> returned, but rather the complexity of calculations and amount > Rod> of data to sift through in order to find it. > > Yes, but I thought that an index might be able to know what distinct > values there are and help optime that query very much. The index does know. You just have to visit all of the pages within the index to find out, which it does, and that's why you dropped 10ms. But if you want a sub ms query, you're going to have to normalize the structure. --
> I just wished there was a means to fully automate all this and render > it transparent to the user, just like an index. > > Merlin> Voila! Merlin p.s. normalize your data always! > > I have this: > > pictures( > PictureID serial PRIMARY KEY, > Owner integer NOT NULL REFERENCES users, > [...]); > CREATE TABLE users ( > UserID serial PRIMARY KEY, > Name character varying(255), > [...]); > > Isn't it normalized ? try: select * from users where UserID in (select pictureId from pictures); select * userid from users intersect select pictureid from pictures; select distinct userid, [...] from users, pictures where user userid = pictureid) if none of these give you what you want then you can solve this with a new tble, picture_user using the instructions I gave previously. Not sure if your data is normalized, but ISTM you are over-using surrogate keys. It may not be possible, but consider downgrading ID columns to unique and picking a natural key. Now you get better benefit of RI and you can sometimes remove joins from certain queries. Rule: use natural keys when you can, surrogate keys when you have to. Corollary: use domains for fields used in referential integrity. Merlin
Laurent Martelli <laurent@aopsys.com> writes: > PFC> SELECT owner from pictures group by owner; > > That's a slight improvement, but there's still a seq scan on pictures: It should be a sequential scan. An index will be slower. > HashAggregate (cost=114.38..114.38 rows=21 width=4) (actual time=7.585..7.605 rows=21 loops=1) > -> Seq Scan on pictures (cost=0.00..103.70 rows=4270 width=4) (actual time=0.015..3.272 rows=4270 loops=1) > Total runtime: 7.719 ms That's the best plan for this query. -- greg