Thread: contrib/tsearch
Hi Oleg/Teodor, I'm sorry to keep posting bugs without patches, but I'm just hoping you guys know the answer faster than I...I know you're busy. What does tsearch have against the word 'herring' (as in the fish). Why is it considered a stopword? Attached is example queries... Chris
Attachment
Hmmm...thinking about it, maybe 'herring' is being reduced to 'her' after the stemming process and hence is thought to be a stopword? This is a bug, but how should it be fixed? Although, tests don't support that: usa=# select food_id, brand,description,ftiidx from food_foods where ftiidx ## 'himring';food_id | brand | description | ftiidx ---------+-------+-------------+-------- (0 rows) usa=# select food_id, brand,description,ftiidx from food_foods where ftiidx ## 'hisring';food_id | brand | description | ftiidx ---------+-------+-------------+-------- (0 rows) usa=# select food_id, brand,description,ftiidx from food_foods where ftiidx ## 'hising';food_id | brand | description | ftiidx ---------+-------+-------------+-------- (0 rows) usa=# select food_id, brand,description,ftiidx from food_foods where ftiidx ## 'himing';food_id | brand | description | ftiidx ---------+-------+-------------+-------- (0 rows) All work...? Chris > -----Original Message----- > From: pgsql-hackers-owner@postgresql.org > [mailto:pgsql-hackers-owner@postgresql.org]On Behalf Of Christopher > Kings-Lynne > Sent: Thursday, 5 September 2002 2:36 PM > To: Hackers > Subject: [HACKERS] contrib/tsearch > > > Hi Oleg/Teodor, > > I'm sorry to keep posting bugs without patches, but I'm just > hoping you guys > know the answer faster than I...I know you're busy. > > What does tsearch have against the word 'herring' (as in the > fish). Why is > it considered a stopword? > > Attached is example queries... > > Chris >
On Thu, 5 Sep 2002, Christopher Kings-Lynne wrote: > Hmmm...thinking about it, maybe 'herring' is being reduced to 'her' after > the stemming process and hence is thought to be a stopword? This is a bug, > but how should it be fixed? > It's difficult question how to use stop words. We'll see what we could do. Probably, porter's stemming algorithm has problem here. 'herring' -> 'her'~'ring' (I have a demo of english-russian stemmr, so you can play) http://intra.astronet.ru/db/lingua/snowball/ I'll ask Martin Porter if there could be an error stemmer. But I think the problem is in concept of using stop words. Should we check for stop words before stemming or after ? In the first case we have to collect all forms of stop-words which is doable but difficult to maintain, in latter - we'll have current problem. It's time for beta1 and I'm not sure if we could work on this issue right now, but I feel a big pressure from tsearch users :-) If people want to help us why not to work on stop words list including all forms ? In any case, we are not native english, so don't expect we'll create more or less decent list. Programming changes are trivial, probably we'll end for the moment just using compile time option. As always, your patches are welcome ! btw, you may test your queries much easier: list=# select 'herring'::mquery_txt; ERROR: Your query contained only stopword(s), ignored list=# select 'herring'::query_txt;query_txt -----------'herring' (1 row) > Although, tests don't support that: > > usa=# select food_id, brand,description,ftiidx from food_foods where ftiidx > ## 'himring'; > food_id | brand | description | ftiidx > ---------+-------+-------------+-------- > (0 rows) > usa=# select food_id, brand,description,ftiidx from food_foods where ftiidx > ## 'hisring'; > food_id | brand | description | ftiidx > ---------+-------+-------------+-------- > (0 rows) > > usa=# select food_id, brand,description,ftiidx from food_foods where ftiidx > ## 'hising'; > food_id | brand | description | ftiidx > ---------+-------+-------------+-------- > (0 rows) > > usa=# select food_id, brand,description,ftiidx from food_foods where ftiidx > ## 'himing'; > food_id | brand | description | ftiidx > ---------+-------+-------------+-------- > (0 rows) > > All work...? > > Chris > > > -----Original Message----- > > From: pgsql-hackers-owner@postgresql.org > > [mailto:pgsql-hackers-owner@postgresql.org]On Behalf Of Christopher > > Kings-Lynne > > Sent: Thursday, 5 September 2002 2:36 PM > > To: Hackers > > Subject: [HACKERS] contrib/tsearch > > > > > > Hi Oleg/Teodor, > > > > I'm sorry to keep posting bugs without patches, but I'm just > > hoping you guys > > know the answer faster than I...I know you're busy. > > > > What does tsearch have against the word 'herring' (as in the > > fish). Why is > > it considered a stopword? > > > > Attached is example queries... > > > > Chris > > > > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Don't 'kill -9' the postmaster > Regards, Oleg _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
Oleg, The Porter stemming stems herring and herrings to her, which is a bit unfortunate. A quick fix is to put 'herring/herrings' in the exception list in the english (porter2) stemmer, but I'll look at this case over the next few days and see if I can come up with something a bit better. Interesting that no one has reported this before. Martin
On Thu, 5 Sep 2002, Martin Porter wrote: > > Oleg, > > The Porter stemming stems herring and herrings to her, which is a bit > unfortunate. A quick fix is to put 'herring/herrings' in the exception list > in the english (porter2) stemmer, but I'll look at this case over the next > few days and see if I can come up with something a bit better. Unfrtunately, we wrote tsearch module before the Snowball project has started, so we used one implementation we found in the net (www.muscat.com) and there is no exception list. OpenFTS uses snowball stemming, so we'd like to have a fix. I think we have enough arguments to use snowball stemmers in tsearch also. > > Interesting that no one has reported this before. :-) Thanks Cristopher for his persistence. > > Martin > > Regards, Oleg _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
> Should we check for stop words before stemming or after ? I think you should. > In the first case we have to collect all forms of stop-words > which is doable > but difficult to maintain, in latter - we'll have current problem. Looking at the list of stopwords you sent me, Oleg, there are only about 1 out of the list of 120 stopwords that need to have all word forms added. I also don't think it'll be a maintenance problem. The reason I think this is because stopwords in general don't have different word forms. eg. her, his, i, and, etc. They don't have different forms. In fact, the _only_ word in the stopword list that needs a different form is yourself and yourselves. Actually, according to dictionary.com 'ourself' is also a word. 'themself' isn't tho. Some others I don't know about are: 'veri' - I assume this is stemmed 'very', so why not just use 'very'? So, why don't you change tsearch to check for stop words _before_ stemming? I can give you a list of revised stopwords that haven't been stemmed, with all forms of the words. > It's time for beta1 and I'm not sure if we could work on this issue > right now, but I feel a big pressure from tsearch users :-) > If people want to help us why not to work on stop words list including > all forms ? In any case, we are not native english, so don't expect we'll > create more or less decent list. Programming changes are trivial, probably > we'll end for the moment just using compile time option. > As always, your patches are welcome ! I'm happy to work on the list of stopwords for you, Oleg. I agree this might be 7.4 thing though... Chris
> Looking at the list of stopwords you sent me, Oleg, there are only about 1 > out of the list of 120 stopwords that need to have all word forms > added. I > also don't think it'll be a maintenance problem. The reason I > think this is > because stopwords in general don't have different word forms. Actually, it just occurred to me that stuff like: will won't it it's where where's Will all have to be in the list, right? Chris
There also seems to be a more complete list of english stopwords here: http://www.dcs.gla.ac.uk/idom/ir_resources/linguistic_utils/ However this list again does not include contractions. I can take this list, check it and submit it to you Oleg, but do you want me to add contractions? eg. wasn't, isn't, it's, etc.? Chris > -----Original Message----- > From: pgsql-hackers-owner@postgresql.org > [mailto:pgsql-hackers-owner@postgresql.org]On Behalf Of Christopher > Kings-Lynne > Sent: Friday, 6 September 2002 12:20 PM > To: Christopher Kings-Lynne; Oleg Bartunov > Cc: Hackers; martin_porter@softhome.net > Subject: Re: [HACKERS] contrib/tsearch > > > > Looking at the list of stopwords you sent me, Oleg, there are > only about 1 > > out of the list of 120 stopwords that need to have all word forms > > added. I > > also don't think it'll be a maintenance problem. The reason I > > think this is > > because stopwords in general don't have different word forms. > > Actually, it just occurred to me that stuff like: > > will > won't > it > it's > where > where's > > Will all have to be in the list, right? > > Chris > > > ---------------------------(end of broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly >
On Fri, 6 Sep 2002, Christopher Kings-Lynne wrote: > There also seems to be a more complete list of english stopwords here: > > http://www.dcs.gla.ac.uk/idom/ir_resources/linguistic_utils/ Chris, I think we have to separate stop word list from tsearch package and supply just some defaults. The reason for this is to let user decide what is a stop word - various domains should have different stop words. This is how OpenFTS works. Also, we probably need to let user decide when to check for stop word - after or before stemming. I'm waiting for Martin's fix for english stemmerr and probably we'll switch to use snowball one, which are more qualified. Damn, we wanted to do these and much more a bit later because we're under big pressure of our work. We'll see if we could manage our plans. We certainly need developers to help us in full text searching, ltree ( it has a chance to support XML ). Also we need to work on adding concurrency support to GiST. so, I couldn't promise we'll work on tsearch right now, but we provide makedict.pl so you could build dictionary with custom list of stop words. Did you try it ? > > However this list again does not include contractions. I can take this > list, check it and submit it to you Oleg, but do you want me to add > contractions? > > eg. wasn't, isn't, it's, etc.? Hmm, our parser isn't smart to handle them as a single word, so it'll not helps: 13:30:03[megera@amon]~/app/fts/test-suite>./testdict.pl -p wasn't lexeme:wasn:1:Latin word lexeme:':12:Space symbols lexeme:t:1:Latin word But, you always could add 'wasn', 'isn' ... and 't','s' to list of your stop words and be happy. Hmm, probably we could enhance our parser to handle such words too. Anyway, most problems just a question of time we don't have :-( > > Chris > > > -----Original Message----- > > From: pgsql-hackers-owner@postgresql.org > > [mailto:pgsql-hackers-owner@postgresql.org]On Behalf Of Christopher > > Kings-Lynne > > Sent: Friday, 6 September 2002 12:20 PM > > To: Christopher Kings-Lynne; Oleg Bartunov > > Cc: Hackers; martin_porter@softhome.net > > Subject: Re: [HACKERS] contrib/tsearch > > > > > > > Looking at the list of stopwords you sent me, Oleg, there are > > only about 1 > > > out of the list of 120 stopwords that need to have all word forms > > > added. I > > > also don't think it'll be a maintenance problem. The reason I > > > think this is > > > because stopwords in general don't have different word forms. > > > > Actually, it just occurred to me that stuff like: > > > > will > > won't > > it > > it's > > where > > where's > > > > Will all have to be in the list, right? > > > > Chris > > > > > > ---------------------------(end of broadcast)--------------------------- > > TIP 3: if posting/reading through Usenet, please send an appropriate > > subscribe-nomail command to majordomo@postgresql.org so that your > > message can get through to the mailing list cleanly > > > Regards, Oleg _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
On Fri, 6 Sep 2002, Christopher Kings-Lynne wrote: > > Looking at the list of stopwords you sent me, Oleg, there are only about 1 > > out of the list of 120 stopwords that need to have all word forms > > added. I > > also don't think it'll be a maintenance problem. The reason I > > think this is > > because stopwords in general don't have different word forms. > > Actually, it just occurred to me that stuff like: > > will > won't > it > it's > where > where's > > Will all have to be in the list, right? right, see my previous message. Teodor is our main developer, he should be back from vacation very soon. But he already has many assignments regarding our main project. Are there one smart programmer ? > > Chris > Regards, Oleg _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
On Fri, 6 Sep 2002, Christopher Kings-Lynne wrote: > > Should we check for stop words before stemming or after ? > > I think you should. > > > In the first case we have to collect all forms of stop-words > > which is doable > > but difficult to maintain, in latter - we'll have current problem. > > Looking at the list of stopwords you sent me, Oleg, there are only about 1 > out of the list of 120 stopwords that need to have all word forms added. I > also don't think it'll be a maintenance problem. The reason I think this is > because stopwords in general don't have different word forms. > > eg. her, his, i, and, etc. They don't have different forms. In fact, the > _only_ word in the stopword list that needs a different form is yourself and > yourselves. Actually, according to dictionary.com 'ourself' is also a word. > 'themself' isn't tho. Some others I don't know about are: > > 'veri' - I assume this is stemmed 'very', so why not just use 'very'? That's because we currently check for stop word after stemming and I think porters algorithm converts 'very' to 'veri' :-) > > So, why don't you change tsearch to check for stop words _before_ stemming? > I can give you a list of revised stopwords that haven't been stemmed, with > all forms of the words. > I agree that english list is, probably, easy to maintain, but what about other languages ? We don't have any volunteers - you're the first one. > > It's time for beta1 and I'm not sure if we could work on this issue > > right now, but I feel a big pressure from tsearch users :-) > > If people want to help us why not to work on stop words list including > > all forms ? In any case, we are not native english, so don't expect we'll > > create more or less decent list. Programming changes are trivial, probably > > we'll end for the moment just using compile time option. > > As always, your patches are welcome ! > > I'm happy to work on the list of stopwords for you, Oleg. I agree this > might be 7.4 thing though... We always could keep updates separately on our page and in CVS. > > Chris > Regards, Oleg _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
> Should we check for stop words before stemming or after ? Current implementation supports both variants. Look dictionary interface definition in morph.c: typedef struct { char localename[NAMEDATALEN]; /* init dictionary */ void *(*init) (void); /* close dictionary */ void (*close) (void *); /* find in dictionary */ char *(*lemmatize)(void *, char *, int *); int (*is_stoplemm) (void *, char *, int); int (*is_stemstoplemm) (void *, char *, int); } DICT; 'is_stoplemm' method is called before 'lemmtize' and 'is_stemstoplemm' after. dict/porter_english.dct at the end: TABLE_DICT_START "C", setup_english_stemmer, closedown_english_stemmer, engstemming, NULL, is_stopengword TABLE_DICT_END dict/russian_stemming.dct: TABLE_DICT_START "ru_RU.KOI8-R", NULL, NULL, ru_RUKOI8R_stem, ru_RUKOI8R_is_stopword, NULL TABLE_DICT_END So english stemmer defines is lexem stop or not after stemming, but russian before. -- Teodor Sigaev teodor@stack.net