Re: Mnogosearch (Was: Re: website doc search is ... ) - Mailing list pgsql-general

From Marc G. Fournier
Subject Re: Mnogosearch (Was: Re: website doc search is ... )
Date
Msg-id 20040101020755.F62198@ganymede.hub.org
Whole thread Raw
In response to Re: Mnogosearch (Was: Re: website doc search is ... )  (Bruce Momjian <pgman@candle.pha.pa.us>)
Responses Re: Mnogosearch (Was: Re: website doc search is ... )
Re: Mnogosearch (Was: Re: website doc search is ... )
Re: Mnogosearch (Was: Re: website doc search is ... )
List pgsql-general
On Thu, 1 Jan 2004, Bruce Momjian wrote:

> Marc G. Fournier wrote:
> > 186_archives=# \d ndict7
> >          Table "public.ndict7"
> >  Column  |  Type   |     Modifiers
> > ---------+---------+--------------------
> >  url_id  | integer | not null default 0
> >  word_id | integer | not null default 0
> >  intag   | integer | not null default 0
> > Indexes:
> >     "n7_url" btree (url_id)
> >     "n7_word" btree (word_id)
> >
> >
> > The slowdown is the LIKE condition, as the ndict[78] word_id conditions
> > return near instantly when run individually, and when I run the 'url/LIKE'
> > condition, it takes "forever" ...
>
> Does it help to CLUSTER url.url?  Is your data being loaded in so
> identical values used by LIKE are next to each other?

Just tried CLUSTER, and no difference, but ... chat'd with Dave on ICQ
this evening, and was thinking of something ... and it comes back to
something that I mentioned awhile back ...

Taking the ndict8 query that I originally presented, now post CLUSTER, and
an explain analyze looks like:

                                                              QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=13918.23..26550.58 rows=17 width=8) (actual time=4053.403..83481.769 rows=13415 loops=1)
   Hash Cond: ("outer".url_id = "inner".rec_id)
   ->  Index Scan using n8_word on ndict8  (cost=0.00..12616.09 rows=3219 width=8) (actual time=113.645..79163.431
rows=15533loops=1) 
         Index Cond: (word_id = 417851441)
   ->  Hash  (cost=13913.31..13913.31 rows=1968 width=4) (actual time=3920.597..3920.597 rows=0 loops=1)
         ->  Seq Scan on url  (cost=0.00..13913.31 rows=1968 width=4) (actual time=3.837..2377.853 rows=304811 loops=1)
               Filter: ((url || ''::text) ~~ 'http://archives.postgresql.org/%%'::text)
 Total runtime: 83578.572 ms
(8 rows)

Now, if I knock off the LIKE, so that I'm returning all rows from ndict8,
join'd to all the URLs that contain them, you get:

                                                            QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.00..30183.13 rows=3219 width=8) (actual time=0.299..1217.116 rows=15533 loops=1)
   ->  Index Scan using n8_word on ndict8  (cost=0.00..12616.09 rows=3219 width=8) (actual time=0.144..458.891
rows=15533loops=1) 
         Index Cond: (word_id = 417851441)
   ->  Index Scan using url_rec_id on url  (cost=0.00..5.44 rows=1 width=4) (actual time=0.024..0.029 rows=1
loops=15533)
         Index Cond: (url.rec_id = "outer".url_id)
 Total runtime: 1286.647 ms
(6 rows)

So, there are 15333 URLs that contain that word ... now, what I want to
find out is how many of those 15333 URLs contain
'http://archives.postgresql.org/%%', which is 13415 ...

The problem is that right now, we look at the LIKE first, giving us ~300k
rows, and then search through those for those who have the word matching
... is there some way of reducing the priority of the LIKE part of the
query, as far as the planner is concerned, so that it will "resolve" the =
first, and then work the LIKE on the resultant set, instead of the other
way around?  So that the query is only checking 15k records for the 13k
that match, instead of searching through 300k?

I'm guessing that the reason that the LIKE is taking precidence(sp?) is
because the URL table has less rows in it then ndict8?

----
Marc G. Fournier           Hub.Org Networking Services (http://www.hub.org)
Email: scrappy@hub.org           Yahoo!: yscrappy              ICQ: 7615664

pgsql-general by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: Mnogosearch (Was: Re: website doc search is ... )
Next
From: Dustin Sallings
Date:
Subject: Re: speeding up inserts