Thread: Re: [PATCHES] a tsearch2 (8.2.4) dictionary that only filters out stopwords
This patch: http://archives.postgresql.org/pgsql-patches/2007-11/msg00137.php seems simple and useful enough that I think we ought to slip it into 8.3, even though we are far past feature freeze. As the "simple" dictionary type stands in CVS HEAD, it is only useful as the last dictionary in a stack, since it never passes anything on as unrecognized. With the proposed AcceptAll = false option, it could be used to filter out some stopwords before feeding tokens to another dictionary. While most dictionary types have their own stopword support, some of them match stopwords after their own normalization processing, and so there's no way to filter on pre-normalized words. That seems like a good improvement, even without the specific need-example that Jan provided at the start of the thread. Normally we'd never consider adding a new feature so late in the development cycle, but this seems small enough and useful enough to make an exception. Comments? regards, tom lane
Re: [PATCHES] a tsearch2 (8.2.4) dictionary that only filters out stopwords
From
Bruce Momjian
Date:
Tom Lane wrote: > This patch: > http://archives.postgresql.org/pgsql-patches/2007-11/msg00137.php > seems simple and useful enough that I think we ought to slip it into > 8.3, even though we are far past feature freeze. > > As the "simple" dictionary type stands in CVS HEAD, it is only useful as > the last dictionary in a stack, since it never passes anything on as > unrecognized. With the proposed AcceptAll = false option, it could be > used to filter out some stopwords before feeding tokens to another > dictionary. While most dictionary types have their own stopword support, > some of them match stopwords after their own normalization processing, > and so there's no way to filter on pre-normalized words. That seems > like a good improvement, even without the specific need-example that > Jan provided at the start of the thread. > > Normally we'd never consider adding a new feature so late in the > development cycle, but this seems small enough and useful enough > to make an exception. Comments? Agreed. The logic is that textsearch is getting a major overhaul in 8.3 and it is reasonable to keep adjusting things during beta. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://postgres.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Re: [PATCHES] a tsearch2 (8.2.4) dictionary that only filters out stopwords
From
Oleg Bartunov
Date:
In principle the right way is to allow any dictionary have option like 'PassThrough' and internal function get_dict_options(dict, option) to check if PassThrough option is true. Let's consider one example - removing accents. In the past I always recommend people to use regex functions before to_tsvector conversion to remove accents, but recently I was noticed that such trick doesn't work with headline(). So, the only way is to have special dictionary dict_remove_accent before, which works as a filter. I don't remember why do we left this for future releases, though. Oleg On Wed, 14 Nov 2007, Tom Lane wrote: > This patch: > http://archives.postgresql.org/pgsql-patches/2007-11/msg00137.php > seems simple and useful enough that I think we ought to slip it into > 8.3, even though we are far past feature freeze. > > As the "simple" dictionary type stands in CVS HEAD, it is only useful as > the last dictionary in a stack, since it never passes anything on as > unrecognized. With the proposed AcceptAll = false option, it could be > used to filter out some stopwords before feeding tokens to another > dictionary. While most dictionary types have their own stopword support, > some of them match stopwords after their own normalization processing, > and so there's no way to filter on pre-normalized words. That seems > like a good improvement, even without the specific need-example that > Jan provided at the start of the thread. > > Normally we'd never consider adding a new feature so late in the > development cycle, but this seems small enough and useful enough > to make an exception. Comments? > > regards, tom lane > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
Oleg Bartunov <oleg@sai.msu.su> writes: > Let's consider one example - removing accents. > In the past I always recommend people to use regex functions before > to_tsvector conversion to remove accents, but recently I was noticed that > such trick doesn't work with headline(). So, the only way is to have > special dictionary dict_remove_accent before, which works as a filter. > I don't remember why do we left this for future releases, though. That would require a system-to-dictionary API change (to be able to modify the token under inspection), no? So it's certainly something I'd say is too late for 8.3. One thought that came to mind is that the option name should be just "Accept" not "AcceptAll". To me "All" implies that it would accept *everything* ... including stopwords. regards, tom lane
Re: [PATCHES] a tsearch2 (8.2.4) dictionary that only filters out stopwords
From
Oleg Bartunov
Date:
On Wed, 14 Nov 2007, Tom Lane wrote: > Oleg Bartunov <oleg@sai.msu.su> writes: >> Let's consider one example - removing accents. >> In the past I always recommend people to use regex functions before >> to_tsvector conversion to remove accents, but recently I was noticed that >> such trick doesn't work with headline(). So, the only way is to have >> special dictionary dict_remove_accent before, which works as a filter. > >> I don't remember why do we left this for future releases, though. > > That would require a system-to-dictionary API change (to be able to > modify the token under inspection), no? So it's certainly something It requires one reserved option for dictionaries and ability to get dictionary option. Unless somebody have dictionary with the same option, this change looks harmless. > I'd say is too late for 8.3. yes, probably we get better idea. > > One thought that came to mind is that the option name should be just > "Accept" not "AcceptAll". To me "All" implies that it would accept > *everything* ... including stopwords. wait, I remind the problem with filters. How it will works with thesaurus ? Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
Oleg Bartunov <oleg@sai.msu.su> writes: > On Wed, 14 Nov 2007, Tom Lane wrote: >> One thought that came to mind is that the option name should be just >> "Accept" not "AcceptAll". To me "All" implies that it would accept >> *everything* ... including stopwords. > wait, I remind the problem with filters. How it will works with thesaurus ? Huh? This is just an option for the "simple" dictionary, it's got nothing to do with thesaurus AFAICS. regards, tom lane
Re: [PATCHES] a tsearch2 (8.2.4) dictionary that only filters out stopwords
From
Oleg Bartunov
Date:
On Wed, 14 Nov 2007, Tom Lane wrote: > Oleg Bartunov <oleg@sai.msu.su> writes: >> On Wed, 14 Nov 2007, Tom Lane wrote: >>> One thought that came to mind is that the option name should be just >>> "Accept" not "AcceptAll". To me "All" implies that it would accept >>> *everything* ... including stopwords. > >> wait, I remind the problem with filters. How it will works with thesaurus ? > > Huh? This is just an option for the "simple" dictionary, it's got > nothing to do with thesaurus AFAICS. I can assign simple dictionary as a normalization dictionary for thesaurus Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
Oleg Bartunov <oleg@sai.msu.su> writes: > On Wed, 14 Nov 2007, Tom Lane wrote: >> Huh? This is just an option for the "simple" dictionary, it's got >> nothing to do with thesaurus AFAICS. > I can assign simple dictionary as a normalization dictionary for thesaurus Sure. So what? You wouldn't use this option in that case. regards, tom lane
Re: [PATCHES] a tsearch2 (8.2.4) dictionary that only filters out stopwords
From
Oleg Bartunov
Date:
On Wed, 14 Nov 2007, Tom Lane wrote: > Oleg Bartunov <oleg@sai.msu.su> writes: >> On Wed, 14 Nov 2007, Tom Lane wrote: >>> Huh? This is just an option for the "simple" dictionary, it's got >>> nothing to do with thesaurus AFAICS. > >> I can assign simple dictionary as a normalization dictionary for thesaurus > > Sure. So what? You wouldn't use this option in that case. Right. That should be documented to avoid possible confusion. Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
Re: [PATCHES] a tsearch2 (8.2.4) dictionary that only filters out stopwords
From
Bruce Momjian
Date:
Added to TODO: * Allow text search dictionary to filter out only stop words http://archives.postgresql.org/pgsql-patches/2007-11/msg00081.php --------------------------------------------------------------------------- Tom Lane wrote: > Oleg Bartunov <oleg@sai.msu.su> writes: > > Let's consider one example - removing accents. > > In the past I always recommend people to use regex functions before > > to_tsvector conversion to remove accents, but recently I was noticed that > > such trick doesn't work with headline(). So, the only way is to have > > special dictionary dict_remove_accent before, which works as a filter. > > > I don't remember why do we left this for future releases, though. > > That would require a system-to-dictionary API change (to be able to > modify the token under inspection), no? So it's certainly something > I'd say is too late for 8.3. > > One thought that came to mind is that the option name should be just > "Accept" not "AcceptAll". To me "All" implies that it would accept > *everything* ... including stopwords. > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Have you searched our list archives? > > http://archives.postgresql.org -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://postgres.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Bruce Momjian <bruce@momjian.us> writes: > Added to TODO: > * Allow text search dictionary to filter out only stop words > http://archives.postgresql.org/pgsql-patches/2007-11/msg00081.php That's a poor description. I thought the TODO was something more like "allow dictionaries to change the token that is passed on to later dictionaries". regards, tom lane
Re: Re: [PATCHES] a tsearch2 (8.2.4) dictionary that only filters out stopwords
From
Bruce Momjian
Date:
Tom Lane wrote: > Bruce Momjian <bruce@momjian.us> writes: > > Added to TODO: > > > * Allow text search dictionary to filter out only stop words > > > http://archives.postgresql.org/pgsql-patches/2007-11/msg00081.php > > That's a poor description. I thought the TODO was something more like > "allow dictionaries to change the token that is passed on to later > dictionaries". TODO updated as described. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +