Re: Searching for substring with tsearch(1/2) - Mailing list pgsql-hackers
From | Hannu Krosing |
---|---|
Subject | Re: Searching for substring with tsearch(1/2) |
Date | |
Msg-id | 1071009771.3194.34.camel@fuji.krosing.net Whole thread Raw |
In response to | Re: Searching for substring with tsearch(1/2) (Teodor Sigaev <teodor@sigaev.ru>) |
Responses |
Re: Searching for substring with tsearch(1/2)
|
List | pgsql-hackers |
Teodor Sigaev kirjutas T, 09.12.2003 kell 23:07: > Urmo wrote: > > Hi, > > > > there seems to be no way of searching partial matches with tsearch. > > Would it be hard to implement prefix based matching, i.e. > > "hu" matches "human", "humanity", "humming", "huge"? With some hacking I > > managed to disable morphology part from tsearch1 (database contained > > multiple languages in a single table so morphology could not be used) > > and it run happily for a year. But now I needed prefix based substring > > match and I'm kinda lost. I tried using fulltextindex but it took ages > > to create the index with 100K table (query run about a day before I lost > > my pacience and canceled it). I even modified it to lose suffixes and > > index only full words but it was still too slow (50K records were > > indexed in 24h). > > > > Can anybody help or point out right direction? Or is this even (easily) > > doable with tsearch1 or tsearch2? I have done it outside PostgreSQL (using BSDDB 1.85 and python, about 7-8 years ago) in a manner very much like Teodor describes below. It was the original web search system for our leading newspaper, Eesti Päevaleht. It worked without any maintenance for 3-4 years, even after trashing parts of index due to other processes filled up the disk a few times, after which it did not always return all the older results ;) bulk updates for a days worth of articles (50-70) took just a few minutes, search was about one second including starting up python cgi, which usually took most of that second . > Tsearch was never minded as prefix search, and index structure doesn't support > any kind of prefix or suffix. But you can write extension to tsearch, which will > search by prefix. But such solution wiil not use index, only sequence scan. How efficient would tsearch be for really big expressions (where 'hu%' would be expanded (using a btree word index on one column word table) to tsearch equivalent of ( "human" or "humanity" or "humming" or "huge" or ..1000 words here...) before passing the expression to tsearch? > : > Prefix searches easy realized with inverted index, but it require a lot of > programing. > The simplest way is: > create table invidx ( > lexeme text not null primary key, > ids[] int > ); > > where ids[] - array with identificators of documents which contains this word. How hard (or sensible ;) would be creating such an index using GiST ? As proved by tsearch GiST can cope well with many-to-many indexes. > So, your custom software may look as follow: > create function getids_by_word(text) returns setof int as ..........; > > This function should returns all identificators of docs which contains word with > prefix in argument. So result query will be: > select * from docs where docs.id in (select * from getids_by_word()); > > Whats good: > 1 efficience of word search > 2 Reuse some code from tsearch :) - word parser > > Whats bad: > 1 update - too slow, some more efficient way is a bulk upload. or some hybrid of bulk and update, perhaps with table structure like create table invidx (lexeme text not null, textdate date not null, ids[] int, primary key (lexeme,textdate) ); which would partition the invidx table on textdate (or some other suitable datum) > 2 If word is frequent then query with 'IN (select * from func()) may works slow... if it is often too slow then creating a temp table and doing a plain join may be faster. ------------- Hannu
pgsql-hackers by date: