Thread: fast DISTINCT or EXIST

fast DISTINCT or EXIST

From

Tilo Buschmann

Date:

07 April 2007, 08:13:00

Hello,

I am trying to build a application to search CDs and their tracks and I
am experiencing some performance difficulties.

The database is very simple at the moment, two tables "cd" and "tracks"
contain the CD-information and their respective tracks. A column
"cd_id" in public.tracks is the foreign key to the cd table.

#v+
                                  Table "public.cd"
   Column    |       Type        |                     Modifiers
-------------+-------------------+----------------------------------------------------
 revision    | integer           | not null default 0
 disc_length | integer           |
 via         | character varying |
 cd_id       | integer           | not null default nextval('cd_cd_id_seq'::regclass)
 discid      | integer           | not null
 title       | character varying | not null
 artist      | character varying | not null
 year        | smallint          |
 genre       | character varying |
 ext         | character varying |
 tstitle     | tsvector          |
 tsartist    | tsvector          |
Indexes:
    "cd_id_key" PRIMARY KEY, btree (cd_id)
    "discid_key" UNIQUE, btree (discid)
    "tsartist_cd_idx" gist (tsartist)
    "tstitle_cd_idx" gist (tstitle)
Check constraints:
    "year_check" CHECK ("year" IS NULL OR "year" >= 0 AND "year" <= 10000)
Tablespace: "d_separate"

                                   Table "public.tracks"
  Column  |       Type        |                         Modifiers
----------+-------------------+-----------------------------------------------------------
 track_id | integer           | not null default nextval('tracks_track_id_seq'::regclass)
 cd_id    | integer           | not null
 title    | character varying |
 artist   | character varying |
 ext      | character varying |
 length   | integer           |
 number   | smallint          | not null default 0
 tstitle  | tsvector          |
 tsartist | tsvector          |
Indexes:
    "tracks_pkey" PRIMARY KEY, btree (track_id)
    "cdid_tracks_idx" btree (cd_id)
    "tsartist_tracks_idx" gist (tsartist)
    "tstitle_tracks_idx" gin (tstitle)
Foreign-key constraints:
    "tracks_cd_id_fkey" FOREIGN KEY (cd_id) REFERENCES cd(cd_id) ON UPDATE RESTRICT ON DELETE RESTRICT
Tablespace: "d_separate"

#v-

I am using tsearch2 to be able to search very fast for CD and track
artists and titles.

The database is created only once and I expect SELECTS to happen very
often, therefore the indexes will not hurt the performance. I also ran
a VACUUM FULL ANALYSE.

The query that I want to optimise at the moment is the "Give me all CDs
with their tracks, that contain a track with the Title 'foobar'". The
query is very expensive, so I try to limit it to 10 cds at once.

My first idea was:

#+
cddb=# EXPLAIN ANALYSE SELECT cd.cd_id,cd.title,cd.artist,tracks.title FROM tracks JOIN (SELECT
cd.cd_id,cd.artist,cd.titleFROM cd JOIN tracks USING (cd_id) WHERE tracks.tstitle @@
plainto_tsquery('simple','education')LIMIT 10) AS cd USING (cd_id); 
                                                                        QUERY PLAN
                                  

----------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.00..3852.42 rows=11974 width=91) (actual time=310.983..972.739 rows=136 loops=1)
   ->  Limit  (cost=0.00..121.94 rows=10 width=46) (actual time=264.797..650.178 rows=10 loops=1)
         ->  Nested Loop  (cost=0.00..227602.43 rows=18665 width=46) (actual time=264.793..650.165 rows=10 loops=1)
               ->  Index Scan using tstitle_tracks_idx on tracks  (cost=0.00..73402.74 rows=18665 width=4) (actual
time=155.516..155.578rows=10 loops=1) 
                     Index Cond: (tstitle @@ '''education'''::tsquery)
               ->  Index Scan using cd_id_key on cd  (cost=0.00..8.25 rows=1 width=46) (actual time=49.452..49.453
rows=1loops=10) 
                     Index Cond: (public.cd.cd_id = public.tracks.cd_id)
   ->  Index Scan using cdid_tracks_idx on tracks  (cost=0.00..358.08 rows=1197 width=27) (actual time=29.588..32.239
rows=14loops=10) 
         Index Cond: (public.tracks.cd_id = cd.cd_id)
 Total runtime: 972.917 ms
(10 rows)
#v-


The query is fast enough, but erroneous. If a cd contains more than one
track, that matches the condition, the inner SELECT will return more
than one cd and therefore the whole query will shield duplicate cds.

The solution is to either insert DISTINCT into the above query or use
EXISTS as condition, but both queries show a terrible performance:

#v+
cddb=# EXPLAIN ANALYSE SELECT cd.cd_id,cd.title,cd.artist,tracks.title FROM tracks JOIN (SELECT DISTINCT
cd.cd_id,cd.artist,cd.titleFROM cd JOIN tracks USING (cd_id) WHERE tracks.tstitle @@
plainto_tsquery('simple','education')LIMIT 10) AS cd USING (cd_id); 
                                                                           QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=152390.12..156120.71 rows=11974 width=91) (actual time=37356.517..37605.073 rows=137 loops=1)
   ->  Limit  (cost=152390.12..152390.22 rows=10 width=46) (actual time=37289.598..37289.643 rows=10 loops=1)
         ->  Unique  (cost=152390.12..152576.77 rows=18665 width=46) (actual time=37289.594..37289.629 rows=10 loops=1)
               ->  Sort  (cost=152390.12..152436.79 rows=18665 width=46) (actual time=37289.590..37289.601 rows=12
loops=1)
                     Sort Key: public.cd.cd_id, public.cd.artist, public.cd.title
                     ->  Hash Join  (cost=78926.50..151066.02 rows=18665 width=46) (actual time=36214.504..37285.974
rows=811loops=1) 
                           Hash Cond: (public.tracks.cd_id = public.cd.cd_id)
                           ->  Bitmap Heap Scan on tracks  (cost=536.76..59707.31 rows=18665 width=4) (actual
time=0.724..39.253rows=811 loops=1) 
                                 Recheck Cond: (tstitle @@ '''education'''::tsquery)
                                 ->  Bitmap Index Scan on tstitle_tracks_idx  (cost=0.00..532.09 rows=18665 width=0)
(actualtime=0.492..0.492 rows=811 loops=1) 
                                       Index Cond: (tstitle @@ '''education'''::tsquery)
                           ->  Hash  (cost=49111.33..49111.33 rows=1344433 width=46) (actual time=36211.598..36211.598
rows=1344433loops=1) 
                                 ->  Seq Scan on cd  (cost=0.00..49111.33 rows=1344433 width=46) (actual
time=31.094..19813.716rows=1344433 loops=1) 
   ->  Index Scan using cdid_tracks_idx on tracks  (cost=0.00..358.08 rows=1197 width=27) (actual time=31.294..31.527
rows=14loops=10) 
         Index Cond: (public.tracks.cd_id = cd.cd_id)
 Total runtime: 37614.523 ms
(16 rows)

cddb=# EXPLAIN ANALYSE SELECT cd.cd_id,cd.artist,cd.title,tracks.title FROM tracks JOIN (SELECT
cd.cd_id,cd.artist,cd.titleFROM cd WHERE EXISTS (SELECT 1 FROM tracks WHERE tracks.cd_id = cd.cd_id AND tracks.tstitle
@@plainto_tsquery('simple','education')) LIMIT 10) as cd USING (cd_id); 
                                                                    QUERY PLAN
                          

--------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.00..10023.37 rows=11974 width=91) (actual time=126.607..40853.563 rows=148 loops=1)
   ->  Limit  (cost=0.00..6292.89 rows=10 width=46) (actual time=126.587..40853.072 rows=10 loops=1)
         ->  Seq Scan on cd  (cost=0.00..423018283.46 rows=672216 width=46) (actual time=126.584..40853.035 rows=10
loops=1)
               Filter: (subplan)
               SubPlan
                 ->  Index Scan using cdid_tracks_idx on tracks  (cost=0.00..314.61 rows=1 width=0) (actual
time=1.025..1.025rows=0 loops=39706) 
                       Index Cond: (cd_id = $0)
                       Filter: (tstitle @@ '''education'''::tsquery)
   ->  Index Scan using cdid_tracks_idx on tracks  (cost=0.00..358.08 rows=1197 width=27) (actual time=0.011..0.029
rows=15loops=10) 
         Index Cond: (tracks.cd_id = cd.cd_id)
 Total runtime: 40853.789 ms
(11 rows)
#v-

Rephrasing the EXISTS-query as an IN-query did not help the
performance, either.

I get the impression, that I am blind and cannot find the obvious
solution, do you have any idea how to accomplish, what I am trying?

Best Regards,

Tilo

Re: fast DISTINCT or EXIST

From

Arjen van der Meijden

Date:

07 April 2007, 09:31:47

Can't you use something like this? Or is the distinct on the t.cd_id
still causing the major slowdown here?

SELECT ... FROM cd
   JOIN tracks ...
WHERE cd.id IN (SELECT DISTINCT t.cd_id FROM tracks t
      WHERE t.tstitle @@ plainto_tsquery('simple','education') LIMIT 10)

If that is your main culprit, you could also use two limits based on the
fact that there will be at most X songs per cd which would match your
title (my not very educated guess is 3x). Its a bit ugly... but if that
is what it takes to make postgresql not scan your entire index, so be it...

SELECT ... FROM cd
   JOIN tracks ...
WHERE cd.id IN (SELECT DISTINCT cd_id FROM (SELECT t.cd_id FROM tracks t
      WHERE t.tstitle @@ plainto_tsquery('simple','education') LIMIT 30)
as foo LIMIT 10)


Best regards,

Arjen

On 7-4-2007 12:47 Tilo Buschmann wrote:
> Hello,
>
> I am trying to build a application to search CDs and their tracks and I
> am experiencing some performance difficulties.
>
> The database is very simple at the moment, two tables "cd" and "tracks"
> contain the CD-information and their respective tracks. A column
> "cd_id" in public.tracks is the foreign key to the cd table.
>
> #v+
>                                   Table "public.cd"
>    Column    |       Type        |                     Modifiers
> -------------+-------------------+----------------------------------------------------
>  revision    | integer           | not null default 0
>  disc_length | integer           |
>  via         | character varying |
>  cd_id       | integer           | not null default nextval('cd_cd_id_seq'::regclass)
>  discid      | integer           | not null
>  title       | character varying | not null
>  artist      | character varying | not null
>  year        | smallint          |
>  genre       | character varying |
>  ext         | character varying |
>  tstitle     | tsvector          |
>  tsartist    | tsvector          |
> Indexes:
>     "cd_id_key" PRIMARY KEY, btree (cd_id)
>     "discid_key" UNIQUE, btree (discid)
>     "tsartist_cd_idx" gist (tsartist)
>     "tstitle_cd_idx" gist (tstitle)
> Check constraints:
>     "year_check" CHECK ("year" IS NULL OR "year" >= 0 AND "year" <= 10000)
> Tablespace: "d_separate"
>
>                                    Table "public.tracks"
>   Column  |       Type        |                         Modifiers
> ----------+-------------------+-----------------------------------------------------------
>  track_id | integer           | not null default nextval('tracks_track_id_seq'::regclass)
>  cd_id    | integer           | not null
>  title    | character varying |
>  artist   | character varying |
>  ext      | character varying |
>  length   | integer           |
>  number   | smallint          | not null default 0
>  tstitle  | tsvector          |
>  tsartist | tsvector          |
> Indexes:
>     "tracks_pkey" PRIMARY KEY, btree (track_id)
>     "cdid_tracks_idx" btree (cd_id)
>     "tsartist_tracks_idx" gist (tsartist)
>     "tstitle_tracks_idx" gin (tstitle)
> Foreign-key constraints:
>     "tracks_cd_id_fkey" FOREIGN KEY (cd_id) REFERENCES cd(cd_id) ON UPDATE RESTRICT ON DELETE RESTRICT
> Tablespace: "d_separate"
>
> #v-
>
> I am using tsearch2 to be able to search very fast for CD and track
> artists and titles.
>
> The database is created only once and I expect SELECTS to happen very
> often, therefore the indexes will not hurt the performance. I also ran
> a VACUUM FULL ANALYSE.
>
> The query that I want to optimise at the moment is the "Give me all CDs
> with their tracks, that contain a track with the Title 'foobar'". The
> query is very expensive, so I try to limit it to 10 cds at once.
>
> My first idea was:
>
> #+
> cddb=# EXPLAIN ANALYSE SELECT cd.cd_id,cd.title,cd.artist,tracks.title FROM tracks JOIN (SELECT
cd.cd_id,cd.artist,cd.titleFROM cd JOIN tracks USING (cd_id) WHERE tracks.tstitle @@
plainto_tsquery('simple','education')LIMIT 10) AS cd USING (cd_id); 
>                                                                         QUERY PLAN
                                    
>
----------------------------------------------------------------------------------------------------------------------------------------------------------
>  Nested Loop  (cost=0.00..3852.42 rows=11974 width=91) (actual time=310.983..972.739 rows=136 loops=1)
>    ->  Limit  (cost=0.00..121.94 rows=10 width=46) (actual time=264.797..650.178 rows=10 loops=1)
>          ->  Nested Loop  (cost=0.00..227602.43 rows=18665 width=46) (actual time=264.793..650.165 rows=10 loops=1)
>                ->  Index Scan using tstitle_tracks_idx on tracks  (cost=0.00..73402.74 rows=18665 width=4) (actual
time=155.516..155.578rows=10 loops=1) 
>                      Index Cond: (tstitle @@ '''education'''::tsquery)
>                ->  Index Scan using cd_id_key on cd  (cost=0.00..8.25 rows=1 width=46) (actual time=49.452..49.453
rows=1loops=10) 
>                      Index Cond: (public.cd.cd_id = public.tracks.cd_id)
>    ->  Index Scan using cdid_tracks_idx on tracks  (cost=0.00..358.08 rows=1197 width=27) (actual time=29.588..32.239
rows=14loops=10) 
>          Index Cond: (public.tracks.cd_id = cd.cd_id)
>  Total runtime: 972.917 ms
> (10 rows)
> #v-
>
>
> The query is fast enough, but erroneous. If a cd contains more than one
> track, that matches the condition, the inner SELECT will return more
> than one cd and therefore the whole query will shield duplicate cds.
>
> The solution is to either insert DISTINCT into the above query or use
> EXISTS as condition, but both queries show a terrible performance:
>
> #v+
> cddb=# EXPLAIN ANALYSE SELECT cd.cd_id,cd.title,cd.artist,tracks.title FROM tracks JOIN (SELECT DISTINCT
cd.cd_id,cd.artist,cd.titleFROM cd JOIN tracks USING (cd_id) WHERE tracks.tstitle @@
plainto_tsquery('simple','education')LIMIT 10) AS cd USING (cd_id); 
>                                                                            QUERY PLAN
>
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
>  Nested Loop  (cost=152390.12..156120.71 rows=11974 width=91) (actual time=37356.517..37605.073 rows=137 loops=1)
>    ->  Limit  (cost=152390.12..152390.22 rows=10 width=46) (actual time=37289.598..37289.643 rows=10 loops=1)
>          ->  Unique  (cost=152390.12..152576.77 rows=18665 width=46) (actual time=37289.594..37289.629 rows=10
loops=1)
>                ->  Sort  (cost=152390.12..152436.79 rows=18665 width=46) (actual time=37289.590..37289.601 rows=12
loops=1)
>                      Sort Key: public.cd.cd_id, public.cd.artist, public.cd.title
>                      ->  Hash Join  (cost=78926.50..151066.02 rows=18665 width=46) (actual time=36214.504..37285.974
rows=811loops=1) 
>                            Hash Cond: (public.tracks.cd_id = public.cd.cd_id)
>                            ->  Bitmap Heap Scan on tracks  (cost=536.76..59707.31 rows=18665 width=4) (actual
time=0.724..39.253rows=811 loops=1) 
>                                  Recheck Cond: (tstitle @@ '''education'''::tsquery)
>                                  ->  Bitmap Index Scan on tstitle_tracks_idx  (cost=0.00..532.09 rows=18665 width=0)
(actualtime=0.492..0.492 rows=811 loops=1) 
>                                        Index Cond: (tstitle @@ '''education'''::tsquery)
>                            ->  Hash  (cost=49111.33..49111.33 rows=1344433 width=46) (actual
time=36211.598..36211.598rows=1344433 loops=1) 
>                                  ->  Seq Scan on cd  (cost=0.00..49111.33 rows=1344433 width=46) (actual
time=31.094..19813.716rows=1344433 loops=1) 
>    ->  Index Scan using cdid_tracks_idx on tracks  (cost=0.00..358.08 rows=1197 width=27) (actual time=31.294..31.527
rows=14loops=10) 
>          Index Cond: (public.tracks.cd_id = cd.cd_id)
>  Total runtime: 37614.523 ms
> (16 rows)
>
> cddb=# EXPLAIN ANALYSE SELECT cd.cd_id,cd.artist,cd.title,tracks.title FROM tracks JOIN (SELECT
cd.cd_id,cd.artist,cd.titleFROM cd WHERE EXISTS (SELECT 1 FROM tracks WHERE tracks.cd_id = cd.cd_id AND tracks.tstitle
@@plainto_tsquery('simple','education')) LIMIT 10) as cd USING (cd_id); 
>                                                                     QUERY PLAN
                            
>
--------------------------------------------------------------------------------------------------------------------------------------------------
>  Nested Loop  (cost=0.00..10023.37 rows=11974 width=91) (actual time=126.607..40853.563 rows=148 loops=1)
>    ->  Limit  (cost=0.00..6292.89 rows=10 width=46) (actual time=126.587..40853.072 rows=10 loops=1)
>          ->  Seq Scan on cd  (cost=0.00..423018283.46 rows=672216 width=46) (actual time=126.584..40853.035 rows=10
loops=1)
>                Filter: (subplan)
>                SubPlan
>                  ->  Index Scan using cdid_tracks_idx on tracks  (cost=0.00..314.61 rows=1 width=0) (actual
time=1.025..1.025rows=0 loops=39706) 
>                        Index Cond: (cd_id = $0)
>                        Filter: (tstitle @@ '''education'''::tsquery)
>    ->  Index Scan using cdid_tracks_idx on tracks  (cost=0.00..358.08 rows=1197 width=27) (actual time=0.011..0.029
rows=15loops=10) 
>          Index Cond: (tracks.cd_id = cd.cd_id)
>  Total runtime: 40853.789 ms
> (11 rows)
> #v-
>
> Rephrasing the EXISTS-query as an IN-query did not help the
> performance, either.
>
> I get the impression, that I am blind and cannot find the obvious
> solution, do you have any idea how to accomplish, what I am trying?
>
> Best Regards,
>
> Tilo
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster
>

Re: fast DISTINCT or EXIST

From

Tom Lane

Date:

07 April 2007, 12:54:16

Arjen van der Meijden <acmmailing@tweakers.net> writes:
> If that is your main culprit, you could also use two limits based on the
> fact that there will be at most X songs per cd which would match your
> title (my not very educated guess is 3x). Its a bit ugly... but if that
> is what it takes to make postgresql not scan your entire index, so be it...

> SELECT ... FROM cd
>    JOIN tracks ...
> WHERE cd.id IN (SELECT DISTINCT cd_id FROM (SELECT t.cd_id FROM tracks t
>       WHERE t.tstitle @@ plainto_tsquery('simple','education') LIMIT 30)
> as foo LIMIT 10)

I think that's the only way.  There is no plan type in Postgres that
will generate unique-ified output without scanning the whole input
first, except for Uniq on pre-sorted input, which we can't use here
because the tsearch scan isn't going to deliver the rows in cd_id order.

I can see how to build one: make a variant of HashAggregate that returns
each input row immediately after hashing it, *if* it isn't a duplicate
of one already in the hash table.  But it'd be a lot of work for what
seems a rather specialized need.

            regards, tom lane

Re: fast DISTINCT or EXIST

From

Tilo Buschmann

Date:

07 April 2007, 13:24:19

Hi everyone,

On Sat, 07 Apr 2007 11:54:08 -0400
Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Arjen van der Meijden <acmmailing@tweakers.net> writes:
> > If that is your main culprit, you could also use two limits based on the
> > fact that there will be at most X songs per cd which would match your
> > title (my not very educated guess is 3x). Its a bit ugly... but if that
> > is what it takes to make postgresql not scan your entire index, so be it...
>
> > SELECT ... FROM cd
> >    JOIN tracks ...
> > WHERE cd.id IN (SELECT DISTINCT cd_id FROM (SELECT t.cd_id FROM tracks t
> >       WHERE t.tstitle @@ plainto_tsquery('simple','education') LIMIT 30)
> > as foo LIMIT 10)
>
> I think that's the only way.  There is no plan type in Postgres that
> will generate unique-ified output without scanning the whole input
> first, except for Uniq on pre-sorted input, which we can't use here
> because the tsearch scan isn't going to deliver the rows in cd_id order.

Unfortunately, the query above will definitely not work correctly, if
someone searches for "a" or "the".

The correct query does not perform as well as I hoped.

#v+
cddb=# EXPLAIN ANALYSE SELECT cd.cd_id,cd.artist,cd.title,tracks.title FROM cd JOIN tracks USING (cd_id) WHERE cd_id IN
(SELECTDISTINCT tracks.cd_id FROM tracks WHERE tracks.tstitle @@ plainto_tsquery('simple','sympathy') LIMIT 10); 
                                                                              QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=61031.41..64906.58 rows=139 width=69) (actual time=31236.562..31810.940 rows=166 loops=1)
   ->  Nested Loop  (cost=61031.41..61176.20 rows=10 width=50) (actual time=31208.649..31388.289 rows=10 loops=1)
         ->  Limit  (cost=61031.41..61089.74 rows=10 width=4) (actual time=31185.972..31186.024 rows=10 loops=1)
               ->  Unique  (cost=61031.41..61124.74 rows=16 width=4) (actual time=31185.967..31186.006 rows=10 loops=1)
                     ->  Sort  (cost=61031.41..61078.07 rows=18665 width=4) (actual time=31185.961..31185.977 rows=11
loops=1)
                           Sort Key: public.tracks.cd_id
                           ->  Bitmap Heap Scan on tracks  (cost=536.76..59707.31 rows=18665 width=4) (actual
time=146.222..30958.057rows=1677 loops=1) 
                                 Recheck Cond: (tstitle @@ '''sympathy'''::tsquery)
                                 ->  Bitmap Index Scan on tstitle_tracks_idx  (cost=0.00..532.09 rows=18665 width=0)
(actualtime=126.328..126.328 rows=1677 loops=1) 
                                       Index Cond: (tstitle @@ '''sympathy'''::tsquery)
         ->  Index Scan using cd_id_key on cd  (cost=0.00..8.62 rows=1 width=46) (actual time=20.218..20.219 rows=1
loops=10)
               Index Cond: (cd.cd_id = "IN_subquery".cd_id)
   ->  Index Scan using cdid_tracks_idx on tracks  (cost=0.00..358.08 rows=1197 width=27) (actual time=39.935..42.247
rows=17loops=10) 
         Index Cond: (cd.cd_id = public.tracks.cd_id)
 Total runtime: 31811.256 ms
(15 rows)
#v-

It gets better when the rows are in memory (down to 10.452 ms), but
Murphy tells me, that the content that I need will never be in memory.

I think I disregarded this variant at first, because it limits the
possibility to restrict the cd artist and title.

> I can see how to build one: make a variant of HashAggregate that returns
> each input row immediately after hashing it, *if* it isn't a duplicate
> of one already in the hash table.  But it'd be a lot of work for what
> seems a rather specialized need.

D'oh.

Actually, I hoped to find an alternative, that does not involve
DISTINCT.

Best Regards,

Tilo

Re: fast DISTINCT or EXIST

From

Tom Lane

Date:

07 April 2007, 13:39:43

Tilo Buschmann <mailinglist.postgresql.performance@b-n-w.org> writes:
>> Arjen van der Meijden <acmmailing@tweakers.net> writes:
>>> SELECT ... FROM cd
>>> JOIN tracks ...
>>> WHERE cd.id IN (SELECT DISTINCT cd_id FROM (SELECT t.cd_id FROM tracks t
>>> WHERE t.tstitle @@ plainto_tsquery('simple','education') LIMIT 30)
>>> as foo LIMIT 10)

> Unfortunately, the query above will definitely not work correctly, if
> someone searches for "a" or "the".

Well, the "incorrectness" is only that it might deliver fewer than the
hoped-for ten CDs ... but that was a completely arbitrary cutoff anyway,
no?  I think in practice this'd give perfectly acceptable results.

> Actually, I hoped to find an alternative, that does not involve
> DISTINCT.

You could try playing around with GROUP BY rather than DISTINCT; those
are separate code paths and will probably give you different plans.
But I don't think you'll find that GROUP BY does any better on this
particular measure of yielding rows before the full input has been
scanned.

            regards, tom lane

Re: fast DISTINCT or EXIST

From

Arjen van der Meijden

Date:

07 April 2007, 14:27:59

On 7-4-2007 18:24 Tilo Buschmann wrote:
> Unfortunately, the query above will definitely not work correctly, if
> someone searches for "a" or "the".

That are two words you may want to consider not searching on at all.

As Tom said, its not very likely to be fixed in PostgreSQL. But you can
always consider using application logic (or a pgpsql function, you could
even use a set returning function to replace the double-limit subselects
in your in-statement) which will automatically fetch more records when
the initial guess turns out to be wrong, obviously using something like
a NOT IN to remove the initially returned cd.id's for the next batches.
Then again, even 'a' or 'the' will not likely be in *all* tracks of a
cd, so you can also use the 'average amount of tracks per cd' (about 10
or 11?) as your multiplier rather than my initial 3. Obviously you'll
loose performance with each increment of that value.

Best regards,

Arjen