Thread: Searching for substring with tsearch(1/2)

Searching for substring with tsearch(1/2)

From
Urmo
Date:
Hi,

there seems to be no way of searching partial matches with tsearch. 
Would it be hard to implement prefix based matching, i.e.
"hu" matches "human", "humanity", "humming", "huge"? With some hacking I 
managed to disable morphology part from tsearch1 (database contained 
multiple languages in a single table so morphology could not be used) 
and it run happily for a year. But now I needed prefix based substring 
match and I'm kinda lost. I tried using fulltextindex but it took ages 
to create the index with 100K table (query run about a day before I lost 
my pacience and canceled it). I even modified it to lose suffixes and 
index only full words but it was still too slow (50K records were 
indexed in 24h).

Can anybody help or point out right direction? Or is this even (easily) 
doable with tsearch1 or tsearch2?

Urmo



Re: Searching for substring with tsearch(1/2)

From
Teodor Sigaev
Date:

Urmo wrote:
> Hi,
> 
> there seems to be no way of searching partial matches with tsearch. 
> Would it be hard to implement prefix based matching, i.e.
> "hu" matches "human", "humanity", "humming", "huge"? With some hacking I 
> managed to disable morphology part from tsearch1 (database contained 
> multiple languages in a single table so morphology could not be used) 
> and it run happily for a year. But now I needed prefix based substring 
> match and I'm kinda lost. I tried using fulltextindex but it took ages 
> to create the index with 100K table (query run about a day before I lost 
> my pacience and canceled it). I even modified it to lose suffixes and 
> index only full words but it was still too slow (50K records were 
> indexed in 24h).
> 
> Can anybody help or point out right direction? Or is this even (easily) 
> doable with tsearch1 or tsearch2?


Tsearch was never minded as prefix search, and index structure doesn't support 
any kind of prefix or suffix. But you can write extension to tsearch, which will 
search by prefix. But such solution wiil not use index, only sequence scan.
:
Prefix searches easy realized with inverted index, but it require a lot of 
programing.
The simplest way is:
create table invidx (lexeme text not null primary key,        ids[] int
);

where ids[] - array with identificators of documents which contains this word.
So, your custom software may look as follow:
create function getids_by_word(text) returns setof int as ..........;

This function should returns all identificators of docs which contains word with  prefix in argument. So result query
willbe:
 
select * from docs where docs.id in (select * from getids_by_word());

Whats good:
1 efficience of word search
2 Reuse some code from tsearch :) - word parser

Whats bad:
1 update - too slow, some more efficient way is a  bulk upload.
2 If word is frequent then query with 'IN (select * from func()) may works slow...




-- 
Teodor Sigaev                                  E-mail: teodor@sigaev.ru



Re: Searching for substring with tsearch(1/2)

From
Teodor Sigaev
Date:
> managed to disable morphology part from tsearch1 (database contained 
> multiple languages in a single table so morphology could not be used) 
BTW, why? How many are languages in one record possible? tsearch V2 allows to 
use many dictionaries for each type of lexeme and you can write your own 
dictionary, which can define language and normalize word...
Look at default_russian configuration of tsearch2 and read more about tsearch2.



-- 
Teodor Sigaev                                  E-mail: teodor@sigaev.ru



Re: Searching for substring with tsearch(1/2)

From
Hannu Krosing
Date:
Teodor Sigaev kirjutas T, 09.12.2003 kell 23:07:
> Urmo wrote:
> > Hi,
> >
> > there seems to be no way of searching partial matches with tsearch.
> > Would it be hard to implement prefix based matching, i.e.
> > "hu" matches "human", "humanity", "humming", "huge"? With some hacking I
> > managed to disable morphology part from tsearch1 (database contained
> > multiple languages in a single table so morphology could not be used)
> > and it run happily for a year. But now I needed prefix based substring
> > match and I'm kinda lost. I tried using fulltextindex but it took ages
> > to create the index with 100K table (query run about a day before I lost
> > my pacience and canceled it). I even modified it to lose suffixes and
> > index only full words but it was still too slow (50K records were
> > indexed in 24h).
> >
> > Can anybody help or point out right direction? Or is this even (easily)
> > doable with tsearch1 or tsearch2?

I have done it outside PostgreSQL (using BSDDB 1.85 and python, about
7-8 years ago) in a manner very much like Teodor describes below.

It was the original web search system for our leading newspaper, Eesti
Päevaleht. It worked without any maintenance for 3-4 years, even after
trashing parts of index due to other processes filled up the disk a few
times, after which it did not always return all the older results ;)

bulk updates for a days worth of articles (50-70) took just a few
minutes, search was about one second including starting up python cgi,
which usually took most of that second .

> Tsearch was never minded as prefix search, and index structure doesn't support
> any kind of prefix or suffix. But you can write extension to tsearch, which will
> search by prefix. But such solution wiil not use index, only sequence scan.

How efficient would tsearch be for really big expressions (where 'hu%'
would be expanded (using a btree word index on one column word table) to
tsearch equivalent of ( "human" or "humanity" or "humming" or "huge" or
..1000 words here...) before passing the expression to tsearch?

> :
> Prefix searches easy realized with inverted index, but it require a lot of
> programing.
> The simplest way is:
> create table invidx (
>     lexeme text not null primary key,
>          ids[] int
> );
>
> where ids[] - array with identificators of documents which contains this word.

How hard (or sensible ;) would be creating such an index using GiST ?

As proved by tsearch GiST can cope well with many-to-many indexes.

> So, your custom software may look as follow:
> create function getids_by_word(text) returns setof int as ..........;
>
> This function should returns all identificators of docs which contains word with
>   prefix in argument. So result query will be:
> select * from docs where docs.id in (select * from getids_by_word());
>
> Whats good:
> 1 efficience of word search
> 2 Reuse some code from tsearch :) - word parser
>
> Whats bad:
> 1 update - too slow, some more efficient way is a  bulk upload.

or some hybrid of bulk and update, perhaps with table structure like

create table invidx (lexeme   text not null,       textdate date not null,       ids[]    int,       primary  key
(lexeme,textdate) 
);

which would partition the invidx table on textdate (or some other
suitable datum)

> 2 If word is frequent then query with 'IN (select * from func()) may works slow...

if it is often too slow then creating a temp table and doing a plain
join may be faster.

-------------
Hannu





Re: Searching for substring with tsearch(1/2)

From
Teodor Sigaev
Date:
>>Tsearch was never minded as prefix search, and index structure doesn't support 
>>any kind of prefix or suffix. But you can write extension to tsearch, which will 
>>search by prefix. But such solution wiil not use index, only sequence scan.
> 
> 
> How efficient would tsearch be for really big expressions (where 'hu%'
> would be expanded (using a btree word index on one column word table) to
> tsearch equivalent of ( "human" or "humanity" or "humming" or "huge" or
> ..1000 words here...) before passing the expression to tsearch?

GiST index of tsearch doen't support prefix search, so it will works only by 
seqscan, as we know :) disk is much more slow than processor, speed will be 
limited by disk.


>>Prefix searches easy realized with inverted index, but it require a lot of 
>>programing.
>>The simplest way is:
>>create table invidx (
>>    lexeme text not null primary key,
>>         ids[] int
>>);
>>
>>where ids[] - array with identificators of documents which contains this word.
> 
> 
> How hard (or sensible ;) would be creating such an index using GiST ?
> As proved by tsearch GiST can cope well with many-to-many indexes.

Sorry, I don't understand. Do you mean that GiST supports one heap tuple in 
several index tuple? If yes then no :). GiST doesn't support this feature. I 
don't think that GiST may help in this situation.



> create table invidx (
>     lexeme   text not null,
>         textdate date not null,
>         ids[]    int,
>         primary  key (lexeme, textdate)
> );
> 
> which would partition the invidx table on textdate (or some other
> suitable datum)
> 
> 
>>2 If word is frequent then query with 'IN (select * from func()) may works slow...
> if it is often too slow then creating a temp table and doing a plain
> join may be faster.
Table structure as indidx decrease this problem.
-- 
Teodor Sigaev                                  E-mail: teodor@sigaev.ru



Re: Searching for substring with tsearch(1/2)

From
Hannu Krosing
Date:
Teodor Sigaev kirjutas K, 10.12.2003 kell 11:20:
> >>Tsearch was never minded as prefix search, and index structure doesn't support 
> >>any kind of prefix or suffix. But you can write extension to tsearch, which will 
> >>search by prefix. But such solution wiil not use index, only sequence scan.
> > 
> > 
> > How efficient would tsearch be for really big expressions (where 'hu%'
> > would be expanded (using a btree word index on one column word table) to
> > tsearch equivalent of ( "human" or "humanity" or "humming" or "huge" or
> > ..1000 words here...) before passing the expression to tsearch?
> 
> GiST index of tsearch doen't support prefix search, so it will works only by 
> seqscan, as we know :) disk is much more slow than processor, speed will be 
> limited by disk.

I meant that the expansion of 'hu%' is done before and outside of
tsearch, so the question is how efficient will tsearch be for searching
for hudreds or thousands of words in one expression.


> > How hard (or sensible ;) would be creating such an index using GiST ?
> > As proved by tsearch GiST can cope well with many-to-many indexes.
> 
> Sorry, I don't understand. Do you mean that GiST supports one heap tuple in 
> several index tuple? If yes then no :). GiST doesn't support this feature. I 
> don't think that GiST may help in this situation.

but tsearch seems to support this, and tsearch uses GiST. Is this
functionality added entirely by tsearch ?

--------------
Hannu



Re: Searching for substring with tsearch(1/2)

From
Teodor Sigaev
Date:
> I meant that the expansion of 'hu%' is done before and outside of
> tsearch, so the question is how efficient will tsearch be for searching
> for hudreds or thousands of words in one expression.

Ok, I see. The answer - bad. Index structure is signature tree with constant 
signature length, by default 2016 bits. Siganture makes by hashing word and sets 
bits number HASHVAL % 2016 to 1. So, if query has many terms and all terms are 
ored then there is a lot of signatures that matched by query. This means a lot 
of pages in index will be readed.


>>>How hard (or sensible ;) would be creating such an index using GiST ?
>>>As proved by tsearch GiST can cope well with many-to-many indexes.
>>
>>Sorry, I don't understand. Do you mean that GiST supports one heap tuple in 
>>several index tuple? If yes then no :). GiST doesn't support this feature. I 
>>don't think that GiST may help in this situation.
> 
> 
> but tsearch seems to support this, and tsearch uses GiST. Is this
> functionality added entirely by tsearch ?
No, one heap tuple - one index tuple.

I'll try to explain index structure used by tsearch (three levels just for example):
Root page internal tuple 1  -> second level page 1                       internal tuple 1.1 ->
internaltuple 1.2 -> internal tuple 2  -> second level page 2                       internal tuple 2.1 ->
       internal tuple 2.2 -> third level (leaf) page 2.2                                              leaf tuple 2.2.1
->heap tuple                                              leaf tuple 2.2.2 -> heap tuple
 

leaf tuple contains one of two types of predicats:  1 just lexemes (without psition information)  2 if store size of
firsttype is too big then tuple    stores signature as described above.
 

internal tuple contains ored (super-imposed) signatures of childs.



-- 
Teodor Sigaev                                  E-mail: teodor@sigaev.ru