Re: Mnogosearch (Was: Re: website doc search is ... ) - Mailing list pgsql-general
From | Marc G. Fournier |
---|---|
Subject | Re: Mnogosearch (Was: Re: website doc search is ... ) |
Date | |
Msg-id | 20040101020755.F62198@ganymede.hub.org Whole thread Raw |
In response to | Re: Mnogosearch (Was: Re: website doc search is ... ) (Bruce Momjian <pgman@candle.pha.pa.us>) |
Responses |
Re: Mnogosearch (Was: Re: website doc search is ... )
Re: Mnogosearch (Was: Re: website doc search is ... ) Re: Mnogosearch (Was: Re: website doc search is ... ) |
List | pgsql-general |
On Thu, 1 Jan 2004, Bruce Momjian wrote: > Marc G. Fournier wrote: > > 186_archives=# \d ndict7 > > Table "public.ndict7" > > Column | Type | Modifiers > > ---------+---------+-------------------- > > url_id | integer | not null default 0 > > word_id | integer | not null default 0 > > intag | integer | not null default 0 > > Indexes: > > "n7_url" btree (url_id) > > "n7_word" btree (word_id) > > > > > > The slowdown is the LIKE condition, as the ndict[78] word_id conditions > > return near instantly when run individually, and when I run the 'url/LIKE' > > condition, it takes "forever" ... > > Does it help to CLUSTER url.url? Is your data being loaded in so > identical values used by LIKE are next to each other? Just tried CLUSTER, and no difference, but ... chat'd with Dave on ICQ this evening, and was thinking of something ... and it comes back to something that I mentioned awhile back ... Taking the ndict8 query that I originally presented, now post CLUSTER, and an explain analyze looks like: QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------- Hash Join (cost=13918.23..26550.58 rows=17 width=8) (actual time=4053.403..83481.769 rows=13415 loops=1) Hash Cond: ("outer".url_id = "inner".rec_id) -> Index Scan using n8_word on ndict8 (cost=0.00..12616.09 rows=3219 width=8) (actual time=113.645..79163.431 rows=15533loops=1) Index Cond: (word_id = 417851441) -> Hash (cost=13913.31..13913.31 rows=1968 width=4) (actual time=3920.597..3920.597 rows=0 loops=1) -> Seq Scan on url (cost=0.00..13913.31 rows=1968 width=4) (actual time=3.837..2377.853 rows=304811 loops=1) Filter: ((url || ''::text) ~~ 'http://archives.postgresql.org/%%'::text) Total runtime: 83578.572 ms (8 rows) Now, if I knock off the LIKE, so that I'm returning all rows from ndict8, join'd to all the URLs that contain them, you get: QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------- Nested Loop (cost=0.00..30183.13 rows=3219 width=8) (actual time=0.299..1217.116 rows=15533 loops=1) -> Index Scan using n8_word on ndict8 (cost=0.00..12616.09 rows=3219 width=8) (actual time=0.144..458.891 rows=15533loops=1) Index Cond: (word_id = 417851441) -> Index Scan using url_rec_id on url (cost=0.00..5.44 rows=1 width=4) (actual time=0.024..0.029 rows=1 loops=15533) Index Cond: (url.rec_id = "outer".url_id) Total runtime: 1286.647 ms (6 rows) So, there are 15333 URLs that contain that word ... now, what I want to find out is how many of those 15333 URLs contain 'http://archives.postgresql.org/%%', which is 13415 ... The problem is that right now, we look at the LIKE first, giving us ~300k rows, and then search through those for those who have the word matching ... is there some way of reducing the priority of the LIKE part of the query, as far as the planner is concerned, so that it will "resolve" the = first, and then work the LIKE on the resultant set, instead of the other way around? So that the query is only checking 15k records for the 13k that match, instead of searching through 300k? I'm guessing that the reason that the LIKE is taking precidence(sp?) is because the URL table has less rows in it then ndict8? ---- Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Email: scrappy@hub.org Yahoo!: yscrappy ICQ: 7615664
pgsql-general by date: