Re: fast DISTINCT or EXIST - Mailing list pgsql-performance
From | Tilo Buschmann |
---|---|
Subject | Re: fast DISTINCT or EXIST |
Date | |
Msg-id | 20070407182407.297e7e35@wonderland Whole thread Raw |
In response to | Re: fast DISTINCT or EXIST (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: fast DISTINCT or EXIST
Re: fast DISTINCT or EXIST |
List | pgsql-performance |
Hi everyone, On Sat, 07 Apr 2007 11:54:08 -0400 Tom Lane <tgl@sss.pgh.pa.us> wrote: > Arjen van der Meijden <acmmailing@tweakers.net> writes: > > If that is your main culprit, you could also use two limits based on the > > fact that there will be at most X songs per cd which would match your > > title (my not very educated guess is 3x). Its a bit ugly... but if that > > is what it takes to make postgresql not scan your entire index, so be it... > > > SELECT ... FROM cd > > JOIN tracks ... > > WHERE cd.id IN (SELECT DISTINCT cd_id FROM (SELECT t.cd_id FROM tracks t > > WHERE t.tstitle @@ plainto_tsquery('simple','education') LIMIT 30) > > as foo LIMIT 10) > > I think that's the only way. There is no plan type in Postgres that > will generate unique-ified output without scanning the whole input > first, except for Uniq on pre-sorted input, which we can't use here > because the tsearch scan isn't going to deliver the rows in cd_id order. Unfortunately, the query above will definitely not work correctly, if someone searches for "a" or "the". The correct query does not perform as well as I hoped. #v+ cddb=# EXPLAIN ANALYSE SELECT cd.cd_id,cd.artist,cd.title,tracks.title FROM cd JOIN tracks USING (cd_id) WHERE cd_id IN (SELECTDISTINCT tracks.cd_id FROM tracks WHERE tracks.tstitle @@ plainto_tsquery('simple','sympathy') LIMIT 10); QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- Nested Loop (cost=61031.41..64906.58 rows=139 width=69) (actual time=31236.562..31810.940 rows=166 loops=1) -> Nested Loop (cost=61031.41..61176.20 rows=10 width=50) (actual time=31208.649..31388.289 rows=10 loops=1) -> Limit (cost=61031.41..61089.74 rows=10 width=4) (actual time=31185.972..31186.024 rows=10 loops=1) -> Unique (cost=61031.41..61124.74 rows=16 width=4) (actual time=31185.967..31186.006 rows=10 loops=1) -> Sort (cost=61031.41..61078.07 rows=18665 width=4) (actual time=31185.961..31185.977 rows=11 loops=1) Sort Key: public.tracks.cd_id -> Bitmap Heap Scan on tracks (cost=536.76..59707.31 rows=18665 width=4) (actual time=146.222..30958.057rows=1677 loops=1) Recheck Cond: (tstitle @@ '''sympathy'''::tsquery) -> Bitmap Index Scan on tstitle_tracks_idx (cost=0.00..532.09 rows=18665 width=0) (actualtime=126.328..126.328 rows=1677 loops=1) Index Cond: (tstitle @@ '''sympathy'''::tsquery) -> Index Scan using cd_id_key on cd (cost=0.00..8.62 rows=1 width=46) (actual time=20.218..20.219 rows=1 loops=10) Index Cond: (cd.cd_id = "IN_subquery".cd_id) -> Index Scan using cdid_tracks_idx on tracks (cost=0.00..358.08 rows=1197 width=27) (actual time=39.935..42.247 rows=17loops=10) Index Cond: (cd.cd_id = public.tracks.cd_id) Total runtime: 31811.256 ms (15 rows) #v- It gets better when the rows are in memory (down to 10.452 ms), but Murphy tells me, that the content that I need will never be in memory. I think I disregarded this variant at first, because it limits the possibility to restrict the cd artist and title. > I can see how to build one: make a variant of HashAggregate that returns > each input row immediately after hashing it, *if* it isn't a duplicate > of one already in the hash table. But it'd be a lot of work for what > seems a rather specialized need. D'oh. Actually, I hoped to find an alternative, that does not involve DISTINCT. Best Regards, Tilo
pgsql-performance by date: