Thread: Re: [PATCHES] a tsearch2 (8.2.4) dictionary that only filters out stopwords

Re: [PATCHES] a tsearch2 (8.2.4) dictionary that only filters out stopwords

From
Tom Lane
Date:
This patch:
http://archives.postgresql.org/pgsql-patches/2007-11/msg00137.php
seems simple and useful enough that I think we ought to slip it into
8.3, even though we are far past feature freeze.

As the "simple" dictionary type stands in CVS HEAD, it is only useful as
the last dictionary in a stack, since it never passes anything on as
unrecognized.  With the proposed AcceptAll = false option, it could be
used to filter out some stopwords before feeding tokens to another
dictionary.  While most dictionary types have their own stopword support,
some of them match stopwords after their own normalization processing,
and so there's no way to filter on pre-normalized words.  That seems
like a good improvement, even without the specific need-example that
Jan provided at the start of the thread.

Normally we'd never consider adding a new feature so late in the
development cycle, but this seems small enough and useful enough
to make an exception.  Comments?

            regards, tom lane

Re: [PATCHES] a tsearch2 (8.2.4) dictionary that only filters out stopwords

From
Bruce Momjian
Date:
Tom Lane wrote:
> This patch:
> http://archives.postgresql.org/pgsql-patches/2007-11/msg00137.php
> seems simple and useful enough that I think we ought to slip it into
> 8.3, even though we are far past feature freeze.
>
> As the "simple" dictionary type stands in CVS HEAD, it is only useful as
> the last dictionary in a stack, since it never passes anything on as
> unrecognized.  With the proposed AcceptAll = false option, it could be
> used to filter out some stopwords before feeding tokens to another
> dictionary.  While most dictionary types have their own stopword support,
> some of them match stopwords after their own normalization processing,
> and so there's no way to filter on pre-normalized words.  That seems
> like a good improvement, even without the specific need-example that
> Jan provided at the start of the thread.
>
> Normally we'd never consider adding a new feature so late in the
> development cycle, but this seems small enough and useful enough
> to make an exception.  Comments?

Agreed.  The logic is that textsearch is getting a major overhaul in 8.3
and it is reasonable to keep adjusting things during beta.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://postgres.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: [PATCHES] a tsearch2 (8.2.4) dictionary that only filters out stopwords

From
Oleg Bartunov
Date:
In principle the right way is to allow any dictionary have option
like 'PassThrough' and internal function get_dict_options(dict, option)
to check if PassThrough option is true.
Let's consider one example - removing accents.
In the past I always recommend people to use regex functions before
to_tsvector conversion to remove accents, but recently I was noticed that
such trick doesn't work with headline(). So, the only way is to have
special dictionary dict_remove_accent before, which  works as a filter.

I don't remember why do we left this for future releases, though.

Oleg
On Wed, 14 Nov 2007, Tom Lane wrote:

> This patch:
> http://archives.postgresql.org/pgsql-patches/2007-11/msg00137.php
> seems simple and useful enough that I think we ought to slip it into
> 8.3, even though we are far past feature freeze.
>
> As the "simple" dictionary type stands in CVS HEAD, it is only useful as
> the last dictionary in a stack, since it never passes anything on as
> unrecognized.  With the proposed AcceptAll = false option, it could be
> used to filter out some stopwords before feeding tokens to another
> dictionary.  While most dictionary types have their own stopword support,
> some of them match stopwords after their own normalization processing,
> and so there's no way to filter on pre-normalized words.  That seems
> like a good improvement, even without the specific need-example that
> Jan provided at the start of the thread.
>
> Normally we'd never consider adding a new feature so late in the
> development cycle, but this seems small enough and useful enough
> to make an exception.  Comments?
>
>             regards, tom lane
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: [PATCHES] a tsearch2 (8.2.4) dictionary that only filters out stopwords

From
Tom Lane
Date:
Oleg Bartunov <oleg@sai.msu.su> writes:
> Let's consider one example - removing accents.
> In the past I always recommend people to use regex functions before
> to_tsvector conversion to remove accents, but recently I was noticed that
> such trick doesn't work with headline(). So, the only way is to have
> special dictionary dict_remove_accent before, which  works as a filter.

> I don't remember why do we left this for future releases, though.

That would require a system-to-dictionary API change (to be able to
modify the token under inspection), no?  So it's certainly something
I'd say is too late for 8.3.

One thought that came to mind is that the option name should be just
"Accept" not "AcceptAll".  To me "All" implies that it would accept
*everything* ... including stopwords.

            regards, tom lane

Re: [PATCHES] a tsearch2 (8.2.4) dictionary that only filters out stopwords

From
Oleg Bartunov
Date:
On Wed, 14 Nov 2007, Tom Lane wrote:

> Oleg Bartunov <oleg@sai.msu.su> writes:
>> Let's consider one example - removing accents.
>> In the past I always recommend people to use regex functions before
>> to_tsvector conversion to remove accents, but recently I was noticed that
>> such trick doesn't work with headline(). So, the only way is to have
>> special dictionary dict_remove_accent before, which  works as a filter.
>
>> I don't remember why do we left this for future releases, though.
>
> That would require a system-to-dictionary API change (to be able to
> modify the token under inspection), no?  So it's certainly something

It requires one reserved option for dictionaries and  ability to get dictionary
option.  Unless somebody have dictionary with the same option, this change
looks harmless.

> I'd say is too late for 8.3.

yes, probably we get better idea.

>
> One thought that came to mind is that the option name should be just
> "Accept" not "AcceptAll".  To me "All" implies that it would accept
> *everything* ... including stopwords.

wait, I remind the problem with filters. How it will works with thesaurus ?

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: [PATCHES] a tsearch2 (8.2.4) dictionary that only filters out stopwords

From
Tom Lane
Date:
Oleg Bartunov <oleg@sai.msu.su> writes:
> On Wed, 14 Nov 2007, Tom Lane wrote:
>> One thought that came to mind is that the option name should be just
>> "Accept" not "AcceptAll".  To me "All" implies that it would accept
>> *everything* ... including stopwords.

> wait, I remind the problem with filters. How it will works with thesaurus ?

Huh?  This is just an option for the "simple" dictionary, it's got
nothing to do with thesaurus AFAICS.

            regards, tom lane

Re: [PATCHES] a tsearch2 (8.2.4) dictionary that only filters out stopwords

From
Oleg Bartunov
Date:
On Wed, 14 Nov 2007, Tom Lane wrote:

> Oleg Bartunov <oleg@sai.msu.su> writes:
>> On Wed, 14 Nov 2007, Tom Lane wrote:
>>> One thought that came to mind is that the option name should be just
>>> "Accept" not "AcceptAll".  To me "All" implies that it would accept
>>> *everything* ... including stopwords.
>
>> wait, I remind the problem with filters. How it will works with thesaurus ?
>
> Huh?  This is just an option for the "simple" dictionary, it's got
> nothing to do with thesaurus AFAICS.

I can assign simple dictionary as a normalization dictionary for thesaurus

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: [PATCHES] a tsearch2 (8.2.4) dictionary that only filters out stopwords

From
Tom Lane
Date:
Oleg Bartunov <oleg@sai.msu.su> writes:
> On Wed, 14 Nov 2007, Tom Lane wrote:
>> Huh?  This is just an option for the "simple" dictionary, it's got
>> nothing to do with thesaurus AFAICS.

> I can assign simple dictionary as a normalization dictionary for thesaurus

Sure.  So what?  You wouldn't use this option in that case.

            regards, tom lane

Re: [PATCHES] a tsearch2 (8.2.4) dictionary that only filters out stopwords

From
Oleg Bartunov
Date:
On Wed, 14 Nov 2007, Tom Lane wrote:

> Oleg Bartunov <oleg@sai.msu.su> writes:
>> On Wed, 14 Nov 2007, Tom Lane wrote:
>>> Huh?  This is just an option for the "simple" dictionary, it's got
>>> nothing to do with thesaurus AFAICS.
>
>> I can assign simple dictionary as a normalization dictionary for thesaurus
>
> Sure.  So what?  You wouldn't use this option in that case.

Right. That should be documented to avoid possible confusion.

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: [PATCHES] a tsearch2 (8.2.4) dictionary that only filters out stopwords

From
Bruce Momjian
Date:
Added to TODO:

* Allow text search dictionary to filter out only stop words

  http://archives.postgresql.org/pgsql-patches/2007-11/msg00081.php


---------------------------------------------------------------------------

Tom Lane wrote:
> Oleg Bartunov <oleg@sai.msu.su> writes:
> > Let's consider one example - removing accents.
> > In the past I always recommend people to use regex functions before
> > to_tsvector conversion to remove accents, but recently I was noticed that
> > such trick doesn't work with headline(). So, the only way is to have
> > special dictionary dict_remove_accent before, which  works as a filter.
>
> > I don't remember why do we left this for future releases, though.
>
> That would require a system-to-dictionary API change (to be able to
> modify the token under inspection), no?  So it's certainly something
> I'd say is too late for 8.3.
>
> One thought that came to mind is that the option name should be just
> "Accept" not "AcceptAll".  To me "All" implies that it would accept
> *everything* ... including stopwords.
>
>             regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
>                http://archives.postgresql.org

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://postgres.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Bruce Momjian <bruce@momjian.us> writes:
> Added to TODO:

> * Allow text search dictionary to filter out only stop words

>   http://archives.postgresql.org/pgsql-patches/2007-11/msg00081.php

That's a poor description.  I thought the TODO was something more like
"allow dictionaries to change the token that is passed on to later
dictionaries".

            regards, tom lane

Tom Lane wrote:
> Bruce Momjian <bruce@momjian.us> writes:
> > Added to TODO:
>
> > * Allow text search dictionary to filter out only stop words
>
> >   http://archives.postgresql.org/pgsql-patches/2007-11/msg00081.php
>
> That's a poor description.  I thought the TODO was something more like
> "allow dictionaries to change the token that is passed on to later
> dictionaries".

TODO updated as described.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +