Re: contrib/tsearch - Mailing list pgsql-hackers
From | Oleg Bartunov |
---|---|
Subject | Re: contrib/tsearch |
Date | |
Msg-id | Pine.GSO.4.44.0209061348260.13637-100000@ra.sai.msu.su Whole thread Raw |
In response to | Re: contrib/tsearch ("Christopher Kings-Lynne" <chriskl@familyhealth.com.au>) |
List | pgsql-hackers |
On Fri, 6 Sep 2002, Christopher Kings-Lynne wrote: > > Should we check for stop words before stemming or after ? > > I think you should. > > > In the first case we have to collect all forms of stop-words > > which is doable > > but difficult to maintain, in latter - we'll have current problem. > > Looking at the list of stopwords you sent me, Oleg, there are only about 1 > out of the list of 120 stopwords that need to have all word forms added. I > also don't think it'll be a maintenance problem. The reason I think this is > because stopwords in general don't have different word forms. > > eg. her, his, i, and, etc. They don't have different forms. In fact, the > _only_ word in the stopword list that needs a different form is yourself and > yourselves. Actually, according to dictionary.com 'ourself' is also a word. > 'themself' isn't tho. Some others I don't know about are: > > 'veri' - I assume this is stemmed 'very', so why not just use 'very'? That's because we currently check for stop word after stemming and I think porters algorithm converts 'very' to 'veri' :-) > > So, why don't you change tsearch to check for stop words _before_ stemming? > I can give you a list of revised stopwords that haven't been stemmed, with > all forms of the words. > I agree that english list is, probably, easy to maintain, but what about other languages ? We don't have any volunteers - you're the first one. > > It's time for beta1 and I'm not sure if we could work on this issue > > right now, but I feel a big pressure from tsearch users :-) > > If people want to help us why not to work on stop words list including > > all forms ? In any case, we are not native english, so don't expect we'll > > create more or less decent list. Programming changes are trivial, probably > > we'll end for the moment just using compile time option. > > As always, your patches are welcome ! > > I'm happy to work on the list of stopwords for you, Oleg. I agree this > might be 7.4 thing though... We always could keep updates separately on our page and in CVS. > > Chris > Regards, Oleg _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
pgsql-hackers by date: