Re: Searching for substring with tsearch(1/2) - Mailing list pgsql-hackers

From Teodor Sigaev
Subject Re: Searching for substring with tsearch(1/2)
Date
Msg-id 3FD639A5.9080902@sigaev.ru
Whole thread Raw
In response to Searching for substring with tsearch(1/2)  (Urmo <urmo@xwm.ee>)
Responses Re: Searching for substring with tsearch(1/2)  (Hannu Krosing <hannu@tm.ee>)
List pgsql-hackers

Urmo wrote:
> Hi,
> 
> there seems to be no way of searching partial matches with tsearch. 
> Would it be hard to implement prefix based matching, i.e.
> "hu" matches "human", "humanity", "humming", "huge"? With some hacking I 
> managed to disable morphology part from tsearch1 (database contained 
> multiple languages in a single table so morphology could not be used) 
> and it run happily for a year. But now I needed prefix based substring 
> match and I'm kinda lost. I tried using fulltextindex but it took ages 
> to create the index with 100K table (query run about a day before I lost 
> my pacience and canceled it). I even modified it to lose suffixes and 
> index only full words but it was still too slow (50K records were 
> indexed in 24h).
> 
> Can anybody help or point out right direction? Or is this even (easily) 
> doable with tsearch1 or tsearch2?


Tsearch was never minded as prefix search, and index structure doesn't support 
any kind of prefix or suffix. But you can write extension to tsearch, which will 
search by prefix. But such solution wiil not use index, only sequence scan.
:
Prefix searches easy realized with inverted index, but it require a lot of 
programing.
The simplest way is:
create table invidx (lexeme text not null primary key,        ids[] int
);

where ids[] - array with identificators of documents which contains this word.
So, your custom software may look as follow:
create function getids_by_word(text) returns setof int as ..........;

This function should returns all identificators of docs which contains word with  prefix in argument. So result query
willbe:
 
select * from docs where docs.id in (select * from getids_by_word());

Whats good:
1 efficience of word search
2 Reuse some code from tsearch :) - word parser

Whats bad:
1 update - too slow, some more efficient way is a  bulk upload.
2 If word is frequent then query with 'IN (select * from func()) may works slow...




-- 
Teodor Sigaev                                  E-mail: teodor@sigaev.ru



pgsql-hackers by date:

Previous
From: Sailesh Krishnamurthy
Date:
Subject: Re: PostgreSQL port to pure Java?
Next
From: Teodor Sigaev
Date:
Subject: Re: Searching for substring with tsearch(1/2)