Thread: contrib/tsearch

contrib/tsearch

From
"Christopher Kings-Lynne"
Date:
Hi Oleg/Teodor,

I'm sorry to keep posting bugs without patches, but I'm just hoping you guys
know the answer faster than I...I know you're busy.

What does tsearch have against the word 'herring' (as in the fish).  Why is
it considered a stopword?

Attached is example queries...

Chris

Attachment

Re: contrib/tsearch

From
"Christopher Kings-Lynne"
Date:
Hmmm...thinking about it, maybe 'herring' is being reduced to 'her' after
the stemming process and hence is thought to be a stopword?  This is a bug,
but how should it be fixed?

Although, tests don't support that:

usa=# select food_id, brand,description,ftiidx from food_foods where ftiidx
## 'himring';food_id | brand | description | ftiidx
---------+-------+-------------+--------
(0 rows)
usa=# select food_id, brand,description,ftiidx from food_foods where ftiidx
## 'hisring';food_id | brand | description | ftiidx
---------+-------+-------------+--------
(0 rows)

usa=# select food_id, brand,description,ftiidx from food_foods where ftiidx
## 'hising';food_id | brand | description | ftiidx
---------+-------+-------------+--------
(0 rows)

usa=# select food_id, brand,description,ftiidx from food_foods where ftiidx
## 'himing';food_id | brand | description | ftiidx
---------+-------+-------------+--------
(0 rows)

All work...?

Chris

> -----Original Message-----
> From: pgsql-hackers-owner@postgresql.org
> [mailto:pgsql-hackers-owner@postgresql.org]On Behalf Of Christopher
> Kings-Lynne
> Sent: Thursday, 5 September 2002 2:36 PM
> To: Hackers
> Subject: [HACKERS] contrib/tsearch
>
>
> Hi Oleg/Teodor,
>
> I'm sorry to keep posting bugs without patches, but I'm just
> hoping you guys
> know the answer faster than I...I know you're busy.
>
> What does tsearch have against the word 'herring' (as in the
> fish).  Why is
> it considered a stopword?
>
> Attached is example queries...
>
> Chris
>



Re: contrib/tsearch

From
Oleg Bartunov
Date:
On Thu, 5 Sep 2002, Christopher Kings-Lynne wrote:

> Hmmm...thinking about it, maybe 'herring' is being reduced to 'her' after
> the stemming process and hence is thought to be a stopword?  This is a bug,
> but how should it be fixed?
>

It's difficult question how to use stop words. We'll see what we could
do. Probably, porter's stemming algorithm has problem here.
'herring' -> 'her'~'ring'
(I have a demo of english-russian stemmr, so you can play)
http://intra.astronet.ru/db/lingua/snowball/
I'll ask Martin Porter if there could be an error stemmer.
But I think the problem is in concept of using stop words.
Should we check for stop words before stemming or after ?
In the first case we have to collect all forms of stop-words which is doable
but difficult to maintain, in latter - we'll have current problem.

It's time for beta1 and I'm not sure if we could work on this issue
right now, but I feel a big pressure from tsearch users :-)
If people want to help us why not to work on stop words list including
all forms ? In any case, we are not native  english, so don't expect we'll
create more or less decent list. Programming changes are trivial, probably
we'll end for the moment just using compile time option.
As always, your patches are welcome !

btw, you may test your queries much easier:

list=# select 'herring'::mquery_txt;
ERROR:  Your query contained only stopword(s), ignored
list=# select 'herring'::query_txt;query_txt
-----------'herring'
(1 row)




> Although, tests don't support that:
>
> usa=# select food_id, brand,description,ftiidx from food_foods where ftiidx
> ## 'himring';
>  food_id | brand | description | ftiidx
> ---------+-------+-------------+--------
> (0 rows)
> usa=# select food_id, brand,description,ftiidx from food_foods where ftiidx
> ## 'hisring';
>  food_id | brand | description | ftiidx
> ---------+-------+-------------+--------
> (0 rows)
>
> usa=# select food_id, brand,description,ftiidx from food_foods where ftiidx
> ## 'hising';
>  food_id | brand | description | ftiidx
> ---------+-------+-------------+--------
> (0 rows)
>
> usa=# select food_id, brand,description,ftiidx from food_foods where ftiidx
> ## 'himing';
>  food_id | brand | description | ftiidx
> ---------+-------+-------------+--------
> (0 rows)
>
> All work...?
>
> Chris
>
> > -----Original Message-----
> > From: pgsql-hackers-owner@postgresql.org
> > [mailto:pgsql-hackers-owner@postgresql.org]On Behalf Of Christopher
> > Kings-Lynne
> > Sent: Thursday, 5 September 2002 2:36 PM
> > To: Hackers
> > Subject: [HACKERS] contrib/tsearch
> >
> >
> > Hi Oleg/Teodor,
> >
> > I'm sorry to keep posting bugs without patches, but I'm just
> > hoping you guys
> > know the answer faster than I...I know you're busy.
> >
> > What does tsearch have against the word 'herring' (as in the
> > fish).  Why is
> > it considered a stopword?
> >
> > Attached is example queries...
> >
> > Chris
> >
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster
>
Regards,    Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83





Re: contrib/tsearch

From
martin_porter@softhome.net (Martin Porter)
Date:
Oleg,

The Porter stemming stems herring and herrings to her, which is a bit
unfortunate. A quick fix is to put 'herring/herrings' in the exception list
in the english (porter2) stemmer, but I'll look at this case over the next
few days and see if I can come up with something a bit better.

Interesting that no one has reported this before.

Martin




Re: contrib/tsearch

From
Oleg Bartunov
Date:
On Thu, 5 Sep 2002, Martin Porter wrote:

>
> Oleg,
>
> The Porter stemming stems herring and herrings to her, which is a bit
> unfortunate. A quick fix is to put 'herring/herrings' in the exception list
> in the english (porter2) stemmer, but I'll look at this case over the next
> few days and see if I can come up with something a bit better.

Unfrtunately, we wrote tsearch module before the Snowball project has started,
so we used one implementation we found in the net (www.muscat.com) and
there is no exception list. OpenFTS uses snowball stemming, so we'd like
to have a fix. I think we have enough arguments to use snowball stemmers
in tsearch also.

>
> Interesting that no one has reported this before.

:-) Thanks Cristopher for his persistence.

>
> Martin
>
>
Regards,    Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83



Re: contrib/tsearch

From
"Christopher Kings-Lynne"
Date:
> Should we check for stop words before stemming or after ?

I think you should.

> In the first case we have to collect all forms of stop-words
> which is doable
> but difficult to maintain, in latter - we'll have current problem.

Looking at the list of stopwords you sent me, Oleg, there are only about 1
out of the list of 120 stopwords that need to have all word forms added.  I
also don't think it'll be a maintenance problem.  The reason I think this is
because stopwords in general don't have different word forms.

eg. her, his, i, and, etc.  They don't have different forms.  In fact, the
_only_ word in the stopword list that needs a different form is yourself and
yourselves.  Actually, according to dictionary.com 'ourself' is also a word.
'themself' isn't tho.  Some others I don't know about are:

'veri' - I assume this is stemmed 'very', so why not just use 'very'?

So, why don't you change tsearch to check for stop words _before_ stemming?
I can give you a list of revised stopwords that haven't been stemmed, with
all forms of the words.

> It's time for beta1 and I'm not sure if we could work on this issue
> right now, but I feel a big pressure from tsearch users :-)
> If people want to help us why not to work on stop words list including
> all forms ? In any case, we are not native  english, so don't expect we'll
> create more or less decent list. Programming changes are trivial, probably
> we'll end for the moment just using compile time option.
> As always, your patches are welcome !

I'm happy to work on the list of stopwords for you, Oleg.  I agree this
might be 7.4 thing though...

Chris



Re: contrib/tsearch

From
"Christopher Kings-Lynne"
Date:
> Looking at the list of stopwords you sent me, Oleg, there are only about 1
> out of the list of 120 stopwords that need to have all word forms 
> added.  I
> also don't think it'll be a maintenance problem.  The reason I 
> think this is
> because stopwords in general don't have different word forms.

Actually, it just occurred to me that stuff like:

will
won't
it
it's
where
where's

Will all have to be in the list, right?

Chris



Re: contrib/tsearch

From
"Christopher Kings-Lynne"
Date:
There also seems to be a more complete list of english stopwords here:

http://www.dcs.gla.ac.uk/idom/ir_resources/linguistic_utils/

However this list again does not include contractions.  I can take this
list, check it and submit it to you Oleg, but do you want me to add
contractions?

eg. wasn't, isn't, it's, etc.?

Chris

> -----Original Message-----
> From: pgsql-hackers-owner@postgresql.org
> [mailto:pgsql-hackers-owner@postgresql.org]On Behalf Of Christopher
> Kings-Lynne
> Sent: Friday, 6 September 2002 12:20 PM
> To: Christopher Kings-Lynne; Oleg Bartunov
> Cc: Hackers; martin_porter@softhome.net
> Subject: Re: [HACKERS] contrib/tsearch
>
>
> > Looking at the list of stopwords you sent me, Oleg, there are
> only about 1
> > out of the list of 120 stopwords that need to have all word forms
> > added.  I
> > also don't think it'll be a maintenance problem.  The reason I
> > think this is
> > because stopwords in general don't have different word forms.
>
> Actually, it just occurred to me that stuff like:
>
> will
> won't
> it
> it's
> where
> where's
>
> Will all have to be in the list, right?
>
> Chris
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to majordomo@postgresql.org so that your
> message can get through to the mailing list cleanly
>



Re: contrib/tsearch

From
Oleg Bartunov
Date:
On Fri, 6 Sep 2002, Christopher Kings-Lynne wrote:

> There also seems to be a more complete list of english stopwords here:
>
> http://www.dcs.gla.ac.uk/idom/ir_resources/linguistic_utils/

Chris, I think we have to separate stop word list from tsearch package and
supply just some defaults. The reason for this is to let user decide what is
a stop word - various domains should have different stop words.
This is how OpenFTS works.
Also, we probably need to let user decide when to check for stop word -
after or before stemming. I'm waiting for Martin's fix for english stemmerr
and probably we'll switch to use snowball one, which are more qualified.

Damn, we wanted to do these and much more a bit later because we're under
big pressure of our work. We'll see if we could manage our plans.

We certainly need developers to help us in full text searching,
ltree ( it has a chance to support XML ). Also we need to work
on adding concurrency support to GiST.

so, I couldn't promise we'll work on tsearch right now, but we provide
makedict.pl so you could build dictionary with custom list of stop words.
Did you try it ?

>
> However this list again does not include contractions.  I can take this
> list, check it and submit it to you Oleg, but do you want me to add
> contractions?
>
> eg. wasn't, isn't, it's, etc.?

Hmm, our parser isn't smart to handle them as a single word, so
it'll not helps:

13:30:03[megera@amon]~/app/fts/test-suite>./testdict.pl -p
wasn't
lexeme:wasn:1:Latin word
lexeme:':12:Space symbols
lexeme:t:1:Latin word

But, you always could add 'wasn', 'isn' ... and 't','s' to list of your
stop words and be happy. Hmm, probably we could enhance our parser to
handle such words too.

Anyway, most problems just a question of time we don't have :-(


>
> Chris
>
> > -----Original Message-----
> > From: pgsql-hackers-owner@postgresql.org
> > [mailto:pgsql-hackers-owner@postgresql.org]On Behalf Of Christopher
> > Kings-Lynne
> > Sent: Friday, 6 September 2002 12:20 PM
> > To: Christopher Kings-Lynne; Oleg Bartunov
> > Cc: Hackers; martin_porter@softhome.net
> > Subject: Re: [HACKERS] contrib/tsearch
> >
> >
> > > Looking at the list of stopwords you sent me, Oleg, there are
> > only about 1
> > > out of the list of 120 stopwords that need to have all word forms
> > > added.  I
> > > also don't think it'll be a maintenance problem.  The reason I
> > > think this is
> > > because stopwords in general don't have different word forms.
> >
> > Actually, it just occurred to me that stuff like:
> >
> > will
> > won't
> > it
> > it's
> > where
> > where's
> >
> > Will all have to be in the list, right?
> >
> > Chris
> >
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 3: if posting/reading through Usenet, please send an appropriate
> > subscribe-nomail command to majordomo@postgresql.org so that your
> > message can get through to the mailing list cleanly
> >
>
Regards,    Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83



Re: contrib/tsearch

From
Oleg Bartunov
Date:
On Fri, 6 Sep 2002, Christopher Kings-Lynne wrote:

> > Looking at the list of stopwords you sent me, Oleg, there are only about 1
> > out of the list of 120 stopwords that need to have all word forms
> > added.  I
> > also don't think it'll be a maintenance problem.  The reason I
> > think this is
> > because stopwords in general don't have different word forms.
>
> Actually, it just occurred to me that stuff like:
>
> will
> won't
> it
> it's
> where
> where's
>
> Will all have to be in the list, right?

right, see my previous message. Teodor is our main developer, he should be
back from vacation very soon. But he already has many assignments regarding
our main project. Are there one smart programmer ?


>
> Chris
>
Regards,    Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83



Re: contrib/tsearch

From
Oleg Bartunov
Date:
On Fri, 6 Sep 2002, Christopher Kings-Lynne wrote:

> > Should we check for stop words before stemming or after ?
>
> I think you should.
>
> > In the first case we have to collect all forms of stop-words
> > which is doable
> > but difficult to maintain, in latter - we'll have current problem.
>
> Looking at the list of stopwords you sent me, Oleg, there are only about 1
> out of the list of 120 stopwords that need to have all word forms added.  I
> also don't think it'll be a maintenance problem.  The reason I think this is
> because stopwords in general don't have different word forms.
>
> eg. her, his, i, and, etc.  They don't have different forms.  In fact, the
> _only_ word in the stopword list that needs a different form is yourself and
> yourselves.  Actually, according to dictionary.com 'ourself' is also a word.
> 'themself' isn't tho.  Some others I don't know about are:
>
> 'veri' - I assume this is stemmed 'very', so why not just use 'very'?

That's because we currently check for stop word after stemming and
I think porters algorithm converts 'very' to 'veri' :-)

>
> So, why don't you change tsearch to check for stop words _before_ stemming?
> I can give you a list of revised stopwords that haven't been stemmed, with
> all forms of the words.
>

I agree that english list is, probably, easy to maintain, but what about
other languages ? We don't have any volunteers - you're the first one.


> > It's time for beta1 and I'm not sure if we could work on this issue
> > right now, but I feel a big pressure from tsearch users :-)
> > If people want to help us why not to work on stop words list including
> > all forms ? In any case, we are not native  english, so don't expect we'll
> > create more or less decent list. Programming changes are trivial, probably
> > we'll end for the moment just using compile time option.
> > As always, your patches are welcome !
>
> I'm happy to work on the list of stopwords for you, Oleg.  I agree this
> might be 7.4 thing though...

We always could keep updates separately on our page and in CVS.

>
> Chris
>
Regards,    Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83



Re: contrib/tsearch

From
Teodor Sigaev
Date:
> Should we check for stop words before stemming or after ?

Current implementation supports both variants. Look dictionary interface 
definition in morph.c:

typedef struct
{        char            localename[NAMEDATALEN];        /* init dictionary */        void       *(*init) (void);
/* close dictionary */        void            (*close) (void *);        /* find in dictionary */        char
*(*lemmatize)(void *, char *, int *);        int                     (*is_stoplemm) (void *, char *, int);        int
                 (*is_stemstoplemm) (void *, char *, int);
 
}       DICT;

'is_stoplemm'  method is called before 'lemmtize' and 'is_stemstoplemm' after.
dict/porter_english.dct at the end:
TABLE_DICT_START        "C",        setup_english_stemmer,        closedown_english_stemmer,        engstemming,
NULL,       is_stopengword
 
TABLE_DICT_END

dict/russian_stemming.dct:
TABLE_DICT_START        "ru_RU.KOI8-R",        NULL,        NULL,        ru_RUKOI8R_stem,
ru_RUKOI8R_is_stopword,       NULL
 
TABLE_DICT_END

So english stemmer defines is lexem stop or not after stemming, but russian before.



-- 
Teodor Sigaev
teodor@stack.net