Re: Mnogosearch (Was: Re: website doc search is ... ) - Mailing list pgsql-general

From Oleg Bartunov
Subject Re: Mnogosearch (Was: Re: website doc search is ... )
Date
Msg-id Pine.GSO.4.58.0401031751460.11643@ra.sai.msu.su
Whole thread Raw
In response to Re: Mnogosearch (Was: Re: website doc search is ... )  ("Marc G. Fournier" <scrappy@postgresql.org>)
Responses Re: Mnogosearch (Was: Re: website doc search is ... )  ("Marc G. Fournier" <scrappy@postgresql.org>)
List pgsql-general
On Thu, 1 Jan 2004, Marc G. Fournier wrote:

> On Thu, 1 Jan 2004, Bruce Momjian wrote:
>
> > Marc G. Fournier wrote:
> > > 186_archives=# \d ndict7
> > >          Table "public.ndict7"
> > >  Column  |  Type   |     Modifiers
> > > ---------+---------+--------------------
> > >  url_id  | integer | not null default 0
> > >  word_id | integer | not null default 0
> > >  intag   | integer | not null default 0
> > > Indexes:
> > >     "n7_url" btree (url_id)
> > >     "n7_word" btree (word_id)
> > >
> > >
> > > The slowdown is the LIKE condition, as the ndict[78] word_id conditions
> > > return near instantly when run individually, and when I run the 'url/LIKE'
> > > condition, it takes "forever" ...
> >
> > Does it help to CLUSTER url.url?  Is your data being loaded in so
> > identical values used by LIKE are next to each other?
>
> Just tried CLUSTER, and no difference, but ... chat'd with Dave on ICQ
> this evening, and was thinking of something ... and it comes back to
> something that I mentioned awhile back ...
>
> Taking the ndict8 query that I originally presented, now post CLUSTER, and
> an explain analyze looks like:
>
>                                                               QUERY PLAN
>
---------------------------------------------------------------------------------------------------------------------------------------
>  Hash Join  (cost=13918.23..26550.58 rows=17 width=8) (actual time=4053.403..83481.769 rows=13415 loops=1)
>    Hash Cond: ("outer".url_id = "inner".rec_id)
>    ->  Index Scan using n8_word on ndict8  (cost=0.00..12616.09 rows=3219 width=8) (actual time=113.645..79163.431
rows=15533loops=1) 
>          Index Cond: (word_id = 417851441)
>    ->  Hash  (cost=13913.31..13913.31 rows=1968 width=4) (actual time=3920.597..3920.597 rows=0 loops=1)
>          ->  Seq Scan on url  (cost=0.00..13913.31 rows=1968 width=4) (actual time=3.837..2377.853 rows=304811
loops=1)
>                Filter: ((url || ''::text) ~~ 'http://archives.postgresql.org/%%'::text)
>  Total runtime: 83578.572 ms
> (8 rows)
>
> Now, if I knock off the LIKE, so that I'm returning all rows from ndict8,
> join'd to all the URLs that contain them, you get:
>
>                                                             QUERY PLAN
>
-----------------------------------------------------------------------------------------------------------------------------------
>  Nested Loop  (cost=0.00..30183.13 rows=3219 width=8) (actual time=0.299..1217.116 rows=15533 loops=1)
>    ->  Index Scan using n8_word on ndict8  (cost=0.00..12616.09 rows=3219 width=8) (actual time=0.144..458.891
rows=15533loops=1) 
>          Index Cond: (word_id = 417851441)
>    ->  Index Scan using url_rec_id on url  (cost=0.00..5.44 rows=1 width=4) (actual time=0.024..0.029 rows=1
loops=15533)
>          Index Cond: (url.rec_id = "outer".url_id)
>  Total runtime: 1286.647 ms
> (6 rows)
>
> So, there are 15333 URLs that contain that word ... now, what I want to
> find out is how many of those 15333 URLs contain
> 'http://archives.postgresql.org/%%', which is 13415 ...

what's the need for such query ? Are you trying to restrict search to
archives ? Why not just have site attribute for document and use simple
join ?

>
> The problem is that right now, we look at the LIKE first, giving us ~300k
> rows, and then search through those for those who have the word matching
> ... is there some way of reducing the priority of the LIKE part of the
> query, as far as the planner is concerned, so that it will "resolve" the =
> first, and then work the LIKE on the resultant set, instead of the other
> way around?  So that the query is only checking 15k records for the 13k
> that match, instead of searching through 300k?
>
> I'm guessing that the reason that the LIKE is taking precidence(sp?) is
> because the URL table has less rows in it then ndict8?
>
> ----
> Marc G. Fournier           Hub.Org Networking Services (http://www.hub.org)
> Email: scrappy@hub.org           Yahoo!: yscrappy              ICQ: 7615664
>
> ---------------------------(end of broadcast)---------------------------
> TIP 8: explain analyze is your friend
>

    Regards,
        Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

pgsql-general by date:

Previous
From: Oleg Bartunov
Date:
Subject: Re: website doc search is extremely SLOW
Next
From: Oleg Bartunov
Date:
Subject: Re: website doc search is extremely SLOW