Thread: Searching for substring with tsearch(1/2)
Hi, there seems to be no way of searching partial matches with tsearch. Would it be hard to implement prefix based matching, i.e. "hu" matches "human", "humanity", "humming", "huge"? With some hacking I managed to disable morphology part from tsearch1 (database contained multiple languages in a single table so morphology could not be used) and it run happily for a year. But now I needed prefix based substring match and I'm kinda lost. I tried using fulltextindex but it took ages to create the index with 100K table (query run about a day before I lost my pacience and canceled it). I even modified it to lose suffixes and index only full words but it was still too slow (50K records were indexed in 24h). Can anybody help or point out right direction? Or is this even (easily) doable with tsearch1 or tsearch2? Urmo
Urmo wrote: > Hi, > > there seems to be no way of searching partial matches with tsearch. > Would it be hard to implement prefix based matching, i.e. > "hu" matches "human", "humanity", "humming", "huge"? With some hacking I > managed to disable morphology part from tsearch1 (database contained > multiple languages in a single table so morphology could not be used) > and it run happily for a year. But now I needed prefix based substring > match and I'm kinda lost. I tried using fulltextindex but it took ages > to create the index with 100K table (query run about a day before I lost > my pacience and canceled it). I even modified it to lose suffixes and > index only full words but it was still too slow (50K records were > indexed in 24h). > > Can anybody help or point out right direction? Or is this even (easily) > doable with tsearch1 or tsearch2? Tsearch was never minded as prefix search, and index structure doesn't support any kind of prefix or suffix. But you can write extension to tsearch, which will search by prefix. But such solution wiil not use index, only sequence scan. : Prefix searches easy realized with inverted index, but it require a lot of programing. The simplest way is: create table invidx (lexeme text not null primary key, ids[] int ); where ids[] - array with identificators of documents which contains this word. So, your custom software may look as follow: create function getids_by_word(text) returns setof int as ..........; This function should returns all identificators of docs which contains word with prefix in argument. So result query willbe: select * from docs where docs.id in (select * from getids_by_word()); Whats good: 1 efficience of word search 2 Reuse some code from tsearch :) - word parser Whats bad: 1 update - too slow, some more efficient way is a bulk upload. 2 If word is frequent then query with 'IN (select * from func()) may works slow... -- Teodor Sigaev E-mail: teodor@sigaev.ru
> managed to disable morphology part from tsearch1 (database contained > multiple languages in a single table so morphology could not be used) BTW, why? How many are languages in one record possible? tsearch V2 allows to use many dictionaries for each type of lexeme and you can write your own dictionary, which can define language and normalize word... Look at default_russian configuration of tsearch2 and read more about tsearch2. -- Teodor Sigaev E-mail: teodor@sigaev.ru
Teodor Sigaev kirjutas T, 09.12.2003 kell 23:07: > Urmo wrote: > > Hi, > > > > there seems to be no way of searching partial matches with tsearch. > > Would it be hard to implement prefix based matching, i.e. > > "hu" matches "human", "humanity", "humming", "huge"? With some hacking I > > managed to disable morphology part from tsearch1 (database contained > > multiple languages in a single table so morphology could not be used) > > and it run happily for a year. But now I needed prefix based substring > > match and I'm kinda lost. I tried using fulltextindex but it took ages > > to create the index with 100K table (query run about a day before I lost > > my pacience and canceled it). I even modified it to lose suffixes and > > index only full words but it was still too slow (50K records were > > indexed in 24h). > > > > Can anybody help or point out right direction? Or is this even (easily) > > doable with tsearch1 or tsearch2? I have done it outside PostgreSQL (using BSDDB 1.85 and python, about 7-8 years ago) in a manner very much like Teodor describes below. It was the original web search system for our leading newspaper, Eesti Päevaleht. It worked without any maintenance for 3-4 years, even after trashing parts of index due to other processes filled up the disk a few times, after which it did not always return all the older results ;) bulk updates for a days worth of articles (50-70) took just a few minutes, search was about one second including starting up python cgi, which usually took most of that second . > Tsearch was never minded as prefix search, and index structure doesn't support > any kind of prefix or suffix. But you can write extension to tsearch, which will > search by prefix. But such solution wiil not use index, only sequence scan. How efficient would tsearch be for really big expressions (where 'hu%' would be expanded (using a btree word index on one column word table) to tsearch equivalent of ( "human" or "humanity" or "humming" or "huge" or ..1000 words here...) before passing the expression to tsearch? > : > Prefix searches easy realized with inverted index, but it require a lot of > programing. > The simplest way is: > create table invidx ( > lexeme text not null primary key, > ids[] int > ); > > where ids[] - array with identificators of documents which contains this word. How hard (or sensible ;) would be creating such an index using GiST ? As proved by tsearch GiST can cope well with many-to-many indexes. > So, your custom software may look as follow: > create function getids_by_word(text) returns setof int as ..........; > > This function should returns all identificators of docs which contains word with > prefix in argument. So result query will be: > select * from docs where docs.id in (select * from getids_by_word()); > > Whats good: > 1 efficience of word search > 2 Reuse some code from tsearch :) - word parser > > Whats bad: > 1 update - too slow, some more efficient way is a bulk upload. or some hybrid of bulk and update, perhaps with table structure like create table invidx (lexeme text not null, textdate date not null, ids[] int, primary key (lexeme,textdate) ); which would partition the invidx table on textdate (or some other suitable datum) > 2 If word is frequent then query with 'IN (select * from func()) may works slow... if it is often too slow then creating a temp table and doing a plain join may be faster. ------------- Hannu
>>Tsearch was never minded as prefix search, and index structure doesn't support >>any kind of prefix or suffix. But you can write extension to tsearch, which will >>search by prefix. But such solution wiil not use index, only sequence scan. > > > How efficient would tsearch be for really big expressions (where 'hu%' > would be expanded (using a btree word index on one column word table) to > tsearch equivalent of ( "human" or "humanity" or "humming" or "huge" or > ..1000 words here...) before passing the expression to tsearch? GiST index of tsearch doen't support prefix search, so it will works only by seqscan, as we know :) disk is much more slow than processor, speed will be limited by disk. >>Prefix searches easy realized with inverted index, but it require a lot of >>programing. >>The simplest way is: >>create table invidx ( >> lexeme text not null primary key, >> ids[] int >>); >> >>where ids[] - array with identificators of documents which contains this word. > > > How hard (or sensible ;) would be creating such an index using GiST ? > As proved by tsearch GiST can cope well with many-to-many indexes. Sorry, I don't understand. Do you mean that GiST supports one heap tuple in several index tuple? If yes then no :). GiST doesn't support this feature. I don't think that GiST may help in this situation. > create table invidx ( > lexeme text not null, > textdate date not null, > ids[] int, > primary key (lexeme, textdate) > ); > > which would partition the invidx table on textdate (or some other > suitable datum) > > >>2 If word is frequent then query with 'IN (select * from func()) may works slow... > if it is often too slow then creating a temp table and doing a plain > join may be faster. Table structure as indidx decrease this problem. -- Teodor Sigaev E-mail: teodor@sigaev.ru
Teodor Sigaev kirjutas K, 10.12.2003 kell 11:20: > >>Tsearch was never minded as prefix search, and index structure doesn't support > >>any kind of prefix or suffix. But you can write extension to tsearch, which will > >>search by prefix. But such solution wiil not use index, only sequence scan. > > > > > > How efficient would tsearch be for really big expressions (where 'hu%' > > would be expanded (using a btree word index on one column word table) to > > tsearch equivalent of ( "human" or "humanity" or "humming" or "huge" or > > ..1000 words here...) before passing the expression to tsearch? > > GiST index of tsearch doen't support prefix search, so it will works only by > seqscan, as we know :) disk is much more slow than processor, speed will be > limited by disk. I meant that the expansion of 'hu%' is done before and outside of tsearch, so the question is how efficient will tsearch be for searching for hudreds or thousands of words in one expression. > > How hard (or sensible ;) would be creating such an index using GiST ? > > As proved by tsearch GiST can cope well with many-to-many indexes. > > Sorry, I don't understand. Do you mean that GiST supports one heap tuple in > several index tuple? If yes then no :). GiST doesn't support this feature. I > don't think that GiST may help in this situation. but tsearch seems to support this, and tsearch uses GiST. Is this functionality added entirely by tsearch ? -------------- Hannu
> I meant that the expansion of 'hu%' is done before and outside of > tsearch, so the question is how efficient will tsearch be for searching > for hudreds or thousands of words in one expression. Ok, I see. The answer - bad. Index structure is signature tree with constant signature length, by default 2016 bits. Siganture makes by hashing word and sets bits number HASHVAL % 2016 to 1. So, if query has many terms and all terms are ored then there is a lot of signatures that matched by query. This means a lot of pages in index will be readed. >>>How hard (or sensible ;) would be creating such an index using GiST ? >>>As proved by tsearch GiST can cope well with many-to-many indexes. >> >>Sorry, I don't understand. Do you mean that GiST supports one heap tuple in >>several index tuple? If yes then no :). GiST doesn't support this feature. I >>don't think that GiST may help in this situation. > > > but tsearch seems to support this, and tsearch uses GiST. Is this > functionality added entirely by tsearch ? No, one heap tuple - one index tuple. I'll try to explain index structure used by tsearch (three levels just for example): Root page internal tuple 1 -> second level page 1 internal tuple 1.1 -> internaltuple 1.2 -> internal tuple 2 -> second level page 2 internal tuple 2.1 -> internal tuple 2.2 -> third level (leaf) page 2.2 leaf tuple 2.2.1 ->heap tuple leaf tuple 2.2.2 -> heap tuple leaf tuple contains one of two types of predicats: 1 just lexemes (without psition information) 2 if store size of firsttype is too big then tuple stores signature as described above. internal tuple contains ored (super-imposed) signatures of childs. -- Teodor Sigaev E-mail: teodor@sigaev.ru