Re: contrib/tsearch - Mailing list pgsql-hackers

From Oleg Bartunov
Subject Re: contrib/tsearch
Date
Msg-id Pine.GSO.4.44.0209051313210.3967-100000@ra.sai.msu.su
Whole thread Raw
In response to Re: contrib/tsearch  ("Christopher Kings-Lynne" <chriskl@familyhealth.com.au>)
Responses Re: contrib/tsearch  ("Christopher Kings-Lynne" <chriskl@familyhealth.com.au>)
List pgsql-hackers
On Thu, 5 Sep 2002, Christopher Kings-Lynne wrote:

> Hmmm...thinking about it, maybe 'herring' is being reduced to 'her' after
> the stemming process and hence is thought to be a stopword?  This is a bug,
> but how should it be fixed?
>

It's difficult question how to use stop words. We'll see what we could
do. Probably, porter's stemming algorithm has problem here.
'herring' -> 'her'~'ring'
(I have a demo of english-russian stemmr, so you can play)
http://intra.astronet.ru/db/lingua/snowball/
I'll ask Martin Porter if there could be an error stemmer.
But I think the problem is in concept of using stop words.
Should we check for stop words before stemming or after ?
In the first case we have to collect all forms of stop-words which is doable
but difficult to maintain, in latter - we'll have current problem.

It's time for beta1 and I'm not sure if we could work on this issue
right now, but I feel a big pressure from tsearch users :-)
If people want to help us why not to work on stop words list including
all forms ? In any case, we are not native  english, so don't expect we'll
create more or less decent list. Programming changes are trivial, probably
we'll end for the moment just using compile time option.
As always, your patches are welcome !

btw, you may test your queries much easier:

list=# select 'herring'::mquery_txt;
ERROR:  Your query contained only stopword(s), ignored
list=# select 'herring'::query_txt;query_txt
-----------'herring'
(1 row)




> Although, tests don't support that:
>
> usa=# select food_id, brand,description,ftiidx from food_foods where ftiidx
> ## 'himring';
>  food_id | brand | description | ftiidx
> ---------+-------+-------------+--------
> (0 rows)
> usa=# select food_id, brand,description,ftiidx from food_foods where ftiidx
> ## 'hisring';
>  food_id | brand | description | ftiidx
> ---------+-------+-------------+--------
> (0 rows)
>
> usa=# select food_id, brand,description,ftiidx from food_foods where ftiidx
> ## 'hising';
>  food_id | brand | description | ftiidx
> ---------+-------+-------------+--------
> (0 rows)
>
> usa=# select food_id, brand,description,ftiidx from food_foods where ftiidx
> ## 'himing';
>  food_id | brand | description | ftiidx
> ---------+-------+-------------+--------
> (0 rows)
>
> All work...?
>
> Chris
>
> > -----Original Message-----
> > From: pgsql-hackers-owner@postgresql.org
> > [mailto:pgsql-hackers-owner@postgresql.org]On Behalf Of Christopher
> > Kings-Lynne
> > Sent: Thursday, 5 September 2002 2:36 PM
> > To: Hackers
> > Subject: [HACKERS] contrib/tsearch
> >
> >
> > Hi Oleg/Teodor,
> >
> > I'm sorry to keep posting bugs without patches, but I'm just
> > hoping you guys
> > know the answer faster than I...I know you're busy.
> >
> > What does tsearch have against the word 'herring' (as in the
> > fish).  Why is
> > it considered a stopword?
> >
> > Attached is example queries...
> >
> > Chris
> >
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster
>
Regards,    Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83





pgsql-hackers by date:

Previous
From: Curt Sampson
Date:
Subject: Re: Inheritance
Next
From: Vince Vielhaber
Date:
Subject: Re: beta1 packaged