Re: fast DISTINCT or EXIST - Mailing list pgsql-performance

From Tilo Buschmann
Subject Re: fast DISTINCT or EXIST
Date
Msg-id 20070407182407.297e7e35@wonderland
Whole thread Raw
In response to Re: fast DISTINCT or EXIST  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: fast DISTINCT or EXIST
Re: fast DISTINCT or EXIST
List pgsql-performance
Hi everyone,

On Sat, 07 Apr 2007 11:54:08 -0400
Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Arjen van der Meijden <acmmailing@tweakers.net> writes:
> > If that is your main culprit, you could also use two limits based on the
> > fact that there will be at most X songs per cd which would match your
> > title (my not very educated guess is 3x). Its a bit ugly... but if that
> > is what it takes to make postgresql not scan your entire index, so be it...
>
> > SELECT ... FROM cd
> >    JOIN tracks ...
> > WHERE cd.id IN (SELECT DISTINCT cd_id FROM (SELECT t.cd_id FROM tracks t
> >       WHERE t.tstitle @@ plainto_tsquery('simple','education') LIMIT 30)
> > as foo LIMIT 10)
>
> I think that's the only way.  There is no plan type in Postgres that
> will generate unique-ified output without scanning the whole input
> first, except for Uniq on pre-sorted input, which we can't use here
> because the tsearch scan isn't going to deliver the rows in cd_id order.

Unfortunately, the query above will definitely not work correctly, if
someone searches for "a" or "the".

The correct query does not perform as well as I hoped.

#v+
cddb=# EXPLAIN ANALYSE SELECT cd.cd_id,cd.artist,cd.title,tracks.title FROM cd JOIN tracks USING (cd_id) WHERE cd_id IN
(SELECTDISTINCT tracks.cd_id FROM tracks WHERE tracks.tstitle @@ plainto_tsquery('simple','sympathy') LIMIT 10); 
                                                                              QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=61031.41..64906.58 rows=139 width=69) (actual time=31236.562..31810.940 rows=166 loops=1)
   ->  Nested Loop  (cost=61031.41..61176.20 rows=10 width=50) (actual time=31208.649..31388.289 rows=10 loops=1)
         ->  Limit  (cost=61031.41..61089.74 rows=10 width=4) (actual time=31185.972..31186.024 rows=10 loops=1)
               ->  Unique  (cost=61031.41..61124.74 rows=16 width=4) (actual time=31185.967..31186.006 rows=10 loops=1)
                     ->  Sort  (cost=61031.41..61078.07 rows=18665 width=4) (actual time=31185.961..31185.977 rows=11
loops=1)
                           Sort Key: public.tracks.cd_id
                           ->  Bitmap Heap Scan on tracks  (cost=536.76..59707.31 rows=18665 width=4) (actual
time=146.222..30958.057rows=1677 loops=1) 
                                 Recheck Cond: (tstitle @@ '''sympathy'''::tsquery)
                                 ->  Bitmap Index Scan on tstitle_tracks_idx  (cost=0.00..532.09 rows=18665 width=0)
(actualtime=126.328..126.328 rows=1677 loops=1) 
                                       Index Cond: (tstitle @@ '''sympathy'''::tsquery)
         ->  Index Scan using cd_id_key on cd  (cost=0.00..8.62 rows=1 width=46) (actual time=20.218..20.219 rows=1
loops=10)
               Index Cond: (cd.cd_id = "IN_subquery".cd_id)
   ->  Index Scan using cdid_tracks_idx on tracks  (cost=0.00..358.08 rows=1197 width=27) (actual time=39.935..42.247
rows=17loops=10) 
         Index Cond: (cd.cd_id = public.tracks.cd_id)
 Total runtime: 31811.256 ms
(15 rows)
#v-

It gets better when the rows are in memory (down to 10.452 ms), but
Murphy tells me, that the content that I need will never be in memory.

I think I disregarded this variant at first, because it limits the
possibility to restrict the cd artist and title.

> I can see how to build one: make a variant of HashAggregate that returns
> each input row immediately after hashing it, *if* it isn't a duplicate
> of one already in the hash table.  But it'd be a lot of work for what
> seems a rather specialized need.

D'oh.

Actually, I hoped to find an alternative, that does not involve
DISTINCT.

Best Regards,

Tilo

pgsql-performance by date:

Previous
From: Tom Lane
Date:
Subject: Re: fast DISTINCT or EXIST
Next
From: Tom Lane
Date:
Subject: Re: fast DISTINCT or EXIST