> Should we check for stop words before stemming or after ?
Current implementation supports both variants. Look dictionary interface
definition in morph.c:
typedef struct
{ char localename[NAMEDATALEN]; /* init dictionary */ void *(*init) (void);
/* close dictionary */ void (*close) (void *); /* find in dictionary */ char
*(*lemmatize)(void *, char *, int *); int (*is_stoplemm) (void *, char *, int); int
(*is_stemstoplemm) (void *, char *, int);
} DICT;
'is_stoplemm' method is called before 'lemmtize' and 'is_stemstoplemm' after.
dict/porter_english.dct at the end:
TABLE_DICT_START "C", setup_english_stemmer, closedown_english_stemmer, engstemming,
NULL, is_stopengword
TABLE_DICT_END
dict/russian_stemming.dct:
TABLE_DICT_START "ru_RU.KOI8-R", NULL, NULL, ru_RUKOI8R_stem,
ru_RUKOI8R_is_stopword, NULL
TABLE_DICT_END
So english stemmer defines is lexem stop or not after stemming, but russian before.
--
Teodor Sigaev
teodor@stack.net