Re: Dictionary chaining and stop words - Mailing list pgsql-hackers
From | Oleg Bartunov |
---|---|
Subject | Re: Dictionary chaining and stop words |
Date | |
Msg-id | Pine.LNX.4.64.0708291835100.2767@sn.sai.msu.ru Whole thread Raw |
In response to | Dictionary chaining and stop words ("Heikki Linnakangas" <heikki@enterprisedb.com>) |
List | pgsql-hackers |
Heikki, we know about this ( I call it filtering), but we leave it for the future after we'll have everything in core. The more demonstrative example is well-known accent-removal problem. I used to recommend to preprocess string before tsearch2, but there is a problem with headline() when this will not work, so, clearly, we need accent removal in dictionary chain using simple pg_unaccent dictionary, which should return an original word without accent and then pass it to the next dictionary. Currently, this is impossible. But, it's not obvious in the general case, when dictionary return array of lexems. So, we decide to leave it for future. I'm very pleased, that we have now many developers interested in the text search development ! We have many interesting todo like 'phrase search'. Oleg On Wed, 29 Aug 2007, Heikki Linnakangas wrote: > It's nice to be able to chain tsearch dictionaries, but I find that it's > not as flexible as it should be. Currently we have these dictionaries > built-in: > > dict_simple - lowercases and checks against stop word list, accepts > everything not in stop word list > dict_synonym - replaces with synonym, if found > dict_thesaurus - similar to synonym, but can recognize phrases > dict_ispell - lowercases, checks dictionary, then checks stop words > dict_snowball - lowercases, checks stop words, then stems > > The way things are at the moment, you can't for example use any of the > built-in dictionaries in case-sensitive mode, without writing custom C > code. Or check against stop words before going through an ispell > dictionary (dict_simple accepts everything, so you can't put it in front > of dict_ispell). Or use ispell dictionary first, then replace synonyms > with dict_synonym, and so forth. > > To make the chaining more useful, I'm proposing some changes to > dictionary API and the set of built-in dictionaries. Currently, a > dictionary can either: > - Accept the word (and possibly replace it with something else) > - Reject the word > - Do nothing > > There's clearly need for transforming a word and passing on the > transformed version to the next dictionary. dict_thesaurus does exactly > that by supporting a subdictionary which is called before invoking the > thesaurus, but it should be generic capability not specific to any > dictionary. Let's modify the lexize API so that a dictionary can: > - Accept the word (and possibly input with something else) > - Reject the word > - Transform word into another (or pass on as is) > > If we do that, and modularize the lowercasing and stopwords > functionality into separate dictionaries, we end up with this nice, > orthogonal set of dictionaries that you can use as building blocks for a > wide range of more complex rules: > > dict_lowercase - lowercases, doesn't accept or reject anything > dict_simple - accepts or rejects (depending on dict option) words in > list, passes on others. This can be used for stop words functionality, > or to accept words found in a simple list of words > dict_accept - accepts everything (for use as a terminator in the chain, > if you want to accept everything not accepted or rejected by other > dictionaries) > > dict_synonym - replaces input with synonym, passes on or accepts matches > depending on dict option > dict_thesaurus - replaces input with preferred term, passes on or > accepts matches depending on dict option > dict_ispell - replaces input with basic form from dictionary, passes on > or accepts matches depending on dict option > dict_snowball - replaces input with stem, passes on > > I don't know what the current plan for beta is, but it would be nice to > get the API right even though there is some work to do. I can write a > patch if no-one objects. > > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
pgsql-hackers by date: