Thread: Fulltext search configuration
I have ran into some problems here.
I am trying to implement arabic fulltext search on three columns.
To create a dictionary I have a hunspell dictionary and and arabic stop file.
CREATE TEXT SEARCH DICTIONARY hunspell_dic (
TEMPLATE = ispell,
DictFile = hunarabic,
AffFile = hunarabic,
StopWords = arabic
);
1) The problem is that the hunspell contains a .dic and a .aff file but the configuration requeries a .dict and .affix file. I have tried to change the endings but with no success.
2) ts_lexize('hunspell_dic', 'ARABIC WORD') returns nothing
3) How can I convert my .dic and .aff to valid .dict and .affix ?
4) I have read that when using dictionaries, if a word is not recognized by any dictionary it will not be indexed. I find that troublesome. I would like everything but the stop words to be indexed. I guess this might be a step that I am not ready for yet, but just wanted to put it out there.
Also I would like to know how the process of the fulltext search implementation looks like, from config to search.
Create dictionary, then a text configuration, add dic to configuration, index columns with gin or gist ...
How does a search look like? Does it match against the gin/gist index. Have that index been built up using the dictionary/configuration, or is the dictionary only used on search frases?
/ Moe
Hi Mohamed.
I don't know where you get the dictionary - I unsuccessfully tried the OpenOffice one by myself (the Ayaspell one), and I had no arabic stopwords file.
Renaming the file is supposed to be enough (I did it successfully for Thailandese dictionary) - the ".aff'" file becoming the ".affix" one.
When I tried to create the dictionary:
CREATE TEXT SEARCH DICTIONARY ar_ispell (
TEMPLATE = ispell,
DictFile = ar_utf8,
AffFile = ar_utf8,
StopWords = english
);
I had an error:
ERREUR: mauvais format de fichier affixe pour le drapeau
CONTEXTE : ligne 42 du fichier de configuration « /usr/share/pgsql/tsearch_data/ar_utf8.affix » : « PFX Aa Y 40
(which means Bad format of Affix file for flag, line 42 of configuration file)
Do you have an error when creating your dictionary?
Daniel
Mohamed a écrit :
I don't know where you get the dictionary - I unsuccessfully tried the OpenOffice one by myself (the Ayaspell one), and I had no arabic stopwords file.
Renaming the file is supposed to be enough (I did it successfully for Thailandese dictionary) - the ".aff'" file becoming the ".affix" one.
When I tried to create the dictionary:
CREATE TEXT SEARCH DICTIONARY ar_ispell (
TEMPLATE = ispell,
DictFile = ar_utf8,
AffFile = ar_utf8,
StopWords = english
);
I had an error:
ERREUR: mauvais format de fichier affixe pour le drapeau
CONTEXTE : ligne 42 du fichier de configuration « /usr/share/pgsql/tsearch_data/ar_utf8.affix » : « PFX Aa Y 40
(which means Bad format of Affix file for flag, line 42 of configuration file)
Do you have an error when creating your dictionary?
Daniel
Mohamed a écrit :
I have ran into some problems here.I am trying to implement arabic fulltext search on three columns.To create a dictionary I have a hunspell dictionary and and arabic stop file.CREATE TEXT SEARCH DICTIONARY hunspell_dic (TEMPLATE = ispell,DictFile = hunarabic,AffFile = hunarabic,StopWords = arabic);1) The problem is that the hunspell contains a .dic and a .aff file but the configuration requeries a .dict and .affix file. I have tried to change the endings but with no success.
2) ts_lexize('hunspell_dic', 'ARABIC WORD') returns nothing
3) How can I convert my .dic and .aff to valid .dict and .affix ?
4) I have read that when using dictionaries, if a word is not recognized by any dictionary it will not be indexed. I find that troublesome. I would like everything but the stop words to be indexed. I guess this might be a step that I am not ready for yet, but just wanted to put it out there.Also I would like to know how the process of the fulltext search implementation looks like, from config to search.Create dictionary, then a text configuration, add dic to configuration, index columns with gin or gist ...How does a search look like? Does it match against the gin/gist index. Have that index been built up using the dictionary/configuration, or is the dictionary only used on search frases?/ Moe
No, I don't. But the ts_lexize don't return anything so I figured there must be an error somehow.
I think we are using the same dictionary + that I am using the stopwords file and a different affix file, because using the hunspell (ayaspell) .aff gives me this error :
ERROR: wrong affix file format for flag
CONTEXT: line 42 of configuration file "C:/Program Files/PostgreSQL/8.3/share/tsearch_data/hunarabic.affix": "PFX Aa Y 40
/ Moe
On Mon, Feb 2, 2009 at 12:13 PM, Daniel Chiaramello <daniel.chiaramello@golog.net> wrote:
Hi Mohamed.
I don't know where you get the dictionary - I unsuccessfully tried the OpenOffice one by myself (the Ayaspell one), and I had no arabic stopwords file.
Renaming the file is supposed to be enough (I did it successfully for Thailandese dictionary) - the ".aff'" file becoming the ".affix" one.
When I tried to create the dictionary:
CREATE TEXT SEARCH DICTIONARY ar_ispell (
TEMPLATE = ispell,
DictFile = ar_utf8,
AffFile = ar_utf8,
StopWords = english
);
I had an error:
ERREUR: mauvais format de fichier affixe pour le drapeau
CONTEXTE : ligne 42 du fichier de configuration « /usr/share/pgsql/tsearch_data/ar_utf8.affix » : « PFX Aa Y 40
(which means Bad format of Affix file for flag, line 42 of configuration file)
Do you have an error when creating your dictionary?
Daniel
Mohamed a écrit :I have ran into some problems here.I am trying to implement arabic fulltext search on three columns.To create a dictionary I have a hunspell dictionary and and arabic stop file.CREATE TEXT SEARCH DICTIONARY hunspell_dic (TEMPLATE = ispell,DictFile = hunarabic,AffFile = hunarabic,StopWords = arabic);1) The problem is that the hunspell contains a .dic and a .aff file but the configuration requeries a .dict and .affix file. I have tried to change the endings but with no success.
2) ts_lexize('hunspell_dic', 'ARABIC WORD') returns nothing
3) How can I convert my .dic and .aff to valid .dict and .affix ?
4) I have read that when using dictionaries, if a word is not recognized by any dictionary it will not be indexed. I find that troublesome. I would like everything but the stop words to be indexed. I guess this might be a step that I am not ready for yet, but just wanted to put it out there.Also I would like to know how the process of the fulltext search implementation looks like, from config to search.Create dictionary, then a text configuration, add dic to configuration, index columns with gin or gist ...How does a search look like? Does it match against the gin/gist index. Have that index been built up using the dictionary/configuration, or is the dictionary only used on search frases?/ Moe
Mohamed, We are looking on the problem. Oleg On Mon, 2 Feb 2009, Mohamed wrote: > No, I don't. But the ts_lexize don't return anything so I figured there must > be an error somehow. > I think we are using the same dictionary + that I am using the stopwords > file and a different affix file, because using the hunspell (ayaspell) .aff > gives me this error : > > ERROR: wrong affix file format for flag > CONTEXT: line 42 of configuration file "C:/Program > Files/PostgreSQL/8.3/share/tsearch_data/hunarabic.affix": "PFX Aa Y 40 > > / Moe > > > > > On Mon, Feb 2, 2009 at 12:13 PM, Daniel Chiaramello < > daniel.chiaramello@golog.net> wrote: > >> Hi Mohamed. >> >> I don't know where you get the dictionary - I unsuccessfully tried the >> OpenOffice one by myself (the Ayaspell one), and I had no arabic stopwords >> file. >> >> Renaming the file is supposed to be enough (I did it successfully for >> Thailandese dictionary) - the ".aff'" file becoming the ".affix" one. >> When I tried to create the dictionary: >> >> CREATE TEXT SEARCH DICTIONARY ar_ispell ( >> TEMPLATE = ispell, >> DictFile = ar_utf8, >> AffFile = ar_utf8, >> StopWords = english >> ); >> >> I had an error: >> >> ERREUR: mauvais format de fichier affixe pour le drapeau >> CONTEXTE : ligne 42 du fichier de configuration ? >> /usr/share/pgsql/tsearch_data/ar_utf8.affix ? : ? PFX Aa Y 40 >> >> (which means Bad format of Affix file for flag, line 42 of configuration >> file) >> >> Do you have an error when creating your dictionary? >> >> Daniel >> >> Mohamed a ?crit : >> >> I have ran into some problems here. >> I am trying to implement arabic fulltext search on three columns. >> >> To create a dictionary I have a hunspell dictionary and and arabic stop >> file. >> >> CREATE TEXT SEARCH DICTIONARY hunspell_dic ( >> TEMPLATE = ispell, >> DictFile = hunarabic, >> AffFile = hunarabic, >> StopWords = arabic >> ); >> >> >> 1) The problem is that the hunspell contains a .dic and a .aff file but >> the configuration requeries a .dict and .affix file. I have tried to change >> the endings but with no success. >> >> 2) ts_lexize('hunspell_dic', 'ARABIC WORD') returns nothing >> >> 3) How can I convert my .dic and .aff to valid .dict and .affix ? >> >> 4) I have read that when using dictionaries, if a word is not recognized by >> any dictionary it will not be indexed. I find that troublesome. I would like >> everything but the stop words to be indexed. I guess this might be a step >> that I am not ready for yet, but just wanted to put it out there. >> >> >> >> Also I would like to know how the process of the fulltext search >> implementation looks like, from config to search. >> >> Create dictionary, then a text configuration, add dic to configuration, >> index columns with gin or gist ... >> >> How does a search look like? Does it match against the gin/gist index. >> Have that index been built up using the dictionary/configuration, or is the >> dictionary only used on search frases? >> >> / Moe >> >> >> >> >> >> > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
Ok, thank you Oleg.
I have another dictionary package which is a conversion to hunspell aswell:
(Conversion of Buckwalter's Arabic morphological analyser) 2006-02-08
And running that gives me this error : (again the affix file)
ERROR: wrong affix file format for flag
CONTEXT: line 560 of configuration file "C:/Program Files/PostgreSQL/8.3/share/tsearch_data/arabic_utf8_alias.affix": "PFX 1013 Y 6
"
/ Moe
On Mon, Feb 2, 2009 at 2:41 PM, Oleg Bartunov <oleg@sai.msu.su> wrote:
Mohamed,
We are looking on the problem.
Oleg
On Mon, 2 Feb 2009, Mohamed wrote:No, I don't. But the ts_lexize don't return anything so I figured there must
be an error somehow.
I think we are using the same dictionary + that I am using the stopwords
file and a different affix file, because using the hunspell (ayaspell) .aff
gives me this error :
ERROR: wrong affix file format for flag
CONTEXT: line 42 of configuration file "C:/Program
Files/PostgreSQL/8.3/share/tsearch_data/hunarabic.affix": "PFX Aa Y 40
/ Moe
On Mon, Feb 2, 2009 at 12:13 PM, Daniel Chiaramello <
daniel.chiaramello@golog.net> wrote:Mohamed a ?crit :Hi Mohamed.
I don't know where you get the dictionary - I unsuccessfully tried the
OpenOffice one by myself (the Ayaspell one), and I had no arabic stopwords
file.
Renaming the file is supposed to be enough (I did it successfully for
Thailandese dictionary) - the ".aff'" file becoming the ".affix" one.
When I tried to create the dictionary:
CREATE TEXT SEARCH DICTIONARY ar_ispell (
TEMPLATE = ispell,
DictFile = ar_utf8,
AffFile = ar_utf8,
StopWords = english
);
I had an error:
ERREUR: mauvais format de fichier affixe pour le drapeau
CONTEXTE : ligne 42 du fichier de configuration ?
/usr/share/pgsql/tsearch_data/ar_utf8.affix ? : ? PFX Aa Y 40
(which means Bad format of Affix file for flag, line 42 of configuration
file)
Do you have an error when creating your dictionary?
Daniel
I have ran into some problems here.
I am trying to implement arabic fulltext search on three columns.
To create a dictionary I have a hunspell dictionary and and arabic stop
file.
CREATE TEXT SEARCH DICTIONARY hunspell_dic (
TEMPLATE = ispell,
DictFile = hunarabic,
AffFile = hunarabic,
StopWords = arabic
);
1) The problem is that the hunspell contains a .dic and a .aff file but
the configuration requeries a .dict and .affix file. I have tried to change
the endings but with no success.
2) ts_lexize('hunspell_dic', 'ARABIC WORD') returns nothing
3) How can I convert my .dic and .aff to valid .dict and .affix ?
4) I have read that when using dictionaries, if a word is not recognized by
any dictionary it will not be indexed. I find that troublesome. I would like
everything but the stop words to be indexed. I guess this might be a step
that I am not ready for yet, but just wanted to put it out there.
Also I would like to know how the process of the fulltext search
implementation looks like, from config to search.
Create dictionary, then a text configuration, add dic to configuration,
index columns with gin or gist ...
How does a search look like? Does it match against the gin/gist index.
Have that index been built up using the dictionary/configuration, or is the
dictionary only used on search frases?
/ Moe
Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83
Oleg, like I mentioned earlier. I have a different .affix file that I got from Andrew with the stop file and I get no errors creating the dictionary using that one but I get nothing out from ts_lexize.
The size on that one is : 406,219 bytes
And the size on the hunspell one (first) : 406,229 bytes
Little to close, don't you think ?
It might be that the arabic hunspell (ayaspell) affix file is damaged on some lines and I got the fixed one from Andrew.
Just wanted to let you know.
/ Moe
On Mon, Feb 2, 2009 at 3:25 PM, Mohamed <mohamed5432154321@gmail.com> wrote:
Ok, thank you Oleg.I have another dictionary package which is a conversion to hunspell aswell:(Conversion of Buckwalter's Arabic morphological analyser) 2006-02-08And running that gives me this error : (again the affix file)ERROR: wrong affix file format for flagCONTEXT: line 560 of configuration file "C:/Program Files/PostgreSQL/8.3/share/tsearch_data/arabic_utf8_alias.affix": "PFX 1013 Y 6"/ MoeOn Mon, Feb 2, 2009 at 2:41 PM, Oleg Bartunov <oleg@sai.msu.su> wrote:Mohamed,
We are looking on the problem.
Oleg
On Mon, 2 Feb 2009, Mohamed wrote:No, I don't. But the ts_lexize don't return anything so I figured there must
be an error somehow.
I think we are using the same dictionary + that I am using the stopwords
file and a different affix file, because using the hunspell (ayaspell) .aff
gives me this error :
ERROR: wrong affix file format for flag
CONTEXT: line 42 of configuration file "C:/Program
Files/PostgreSQL/8.3/share/tsearch_data/hunarabic.affix": "PFX Aa Y 40
/ Moe
On Mon, Feb 2, 2009 at 12:13 PM, Daniel Chiaramello <
daniel.chiaramello@golog.net> wrote:Mohamed a ?crit :Hi Mohamed.
I don't know where you get the dictionary - I unsuccessfully tried the
OpenOffice one by myself (the Ayaspell one), and I had no arabic stopwords
file.
Renaming the file is supposed to be enough (I did it successfully for
Thailandese dictionary) - the ".aff'" file becoming the ".affix" one.
When I tried to create the dictionary:
CREATE TEXT SEARCH DICTIONARY ar_ispell (
TEMPLATE = ispell,
DictFile = ar_utf8,
AffFile = ar_utf8,
StopWords = english
);
I had an error:
ERREUR: mauvais format de fichier affixe pour le drapeau
CONTEXTE : ligne 42 du fichier de configuration ?
/usr/share/pgsql/tsearch_data/ar_utf8.affix ? : ? PFX Aa Y 40
(which means Bad format of Affix file for flag, line 42 of configuration
file)
Do you have an error when creating your dictionary?
Daniel
I have ran into some problems here.
I am trying to implement arabic fulltext search on three columns.
To create a dictionary I have a hunspell dictionary and and arabic stop
file.
CREATE TEXT SEARCH DICTIONARY hunspell_dic (
TEMPLATE = ispell,
DictFile = hunarabic,
AffFile = hunarabic,
StopWords = arabic
);
1) The problem is that the hunspell contains a .dic and a .aff file but
the configuration requeries a .dict and .affix file. I have tried to change
the endings but with no success.
2) ts_lexize('hunspell_dic', 'ARABIC WORD') returns nothing
3) How can I convert my .dic and .aff to valid .dict and .affix ?
4) I have read that when using dictionaries, if a word is not recognized by
any dictionary it will not be indexed. I find that troublesome. I would like
everything but the stop words to be indexed. I guess this might be a step
that I am not ready for yet, but just wanted to put it out there.
Also I would like to know how the process of the fulltext search
implementation looks like, from config to search.
Create dictionary, then a text configuration, add dic to configuration,
index columns with gin or gist ...
How does a search look like? Does it match against the gin/gist index.
Have that index been built up using the dictionary/configuration, or is the
dictionary only used on search frases?
/ Moe
Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83
Mohamed, comment line in ar.affix #FLAG long and creation of ispell dictionary will work. This is temp, solution. Teodor is working on fixing affix autorecognizing. I can't say anything about testing, since somebody should provide first test case. I don't know how to type arabic :) Oleg On Mon, 2 Feb 2009, Mohamed wrote: > Oleg, like I mentioned earlier. I have a different .affix file that I got > from Andrew with the stop file and I get no errors creating the dictionary > using that one but I get nothing out from ts_lexize. > The size on that one is : 406,219 bytes > And the size on the hunspell one (first) : 406,229 bytes > > Little to close, don't you think ? > > It might be that the arabic hunspell (ayaspell) affix file is damaged on > some lines and I got the fixed one from Andrew. > > Just wanted to let you know. > > / Moe > > > > On Mon, Feb 2, 2009 at 3:25 PM, Mohamed <mohamed5432154321@gmail.com> wrote: > >> Ok, thank you Oleg. >> I have another dictionary package which is a conversion to hunspell >> aswell: >> >> >> http://wiki.services.openoffice.org/wiki/Dictionaries#Arabic_.28North_Africa_and_Middle_East.29 >> (Conversion of Buckwalter's Arabic morphological analyser) 2006-02-08 >> >> And running that gives me this error : (again the affix file) >> >> ERROR: wrong affix file format for flag >> CONTEXT: line 560 of configuration file "C:/Program >> Files/PostgreSQL/8.3/share/tsearch_data/arabic_utf8_alias.affix": "PFX 1013 >> Y 6 >> " >> >> / Moe >> >> >> >> On Mon, Feb 2, 2009 at 2:41 PM, Oleg Bartunov <oleg@sai.msu.su> wrote: >> >>> Mohamed, >>> >>> We are looking on the problem. >>> >>> Oleg >>> >>> On Mon, 2 Feb 2009, Mohamed wrote: >>> >>> No, I don't. But the ts_lexize don't return anything so I figured there >>>> must >>>> be an error somehow. >>>> I think we are using the same dictionary + that I am using the stopwords >>>> file and a different affix file, because using the hunspell (ayaspell) >>>> .aff >>>> gives me this error : >>>> >>>> ERROR: wrong affix file format for flag >>>> CONTEXT: line 42 of configuration file "C:/Program >>>> Files/PostgreSQL/8.3/share/tsearch_data/hunarabic.affix": "PFX Aa Y 40 >>>> >>>> / Moe >>>> >>>> >>>> >>>> >>>> On Mon, Feb 2, 2009 at 12:13 PM, Daniel Chiaramello < >>>> daniel.chiaramello@golog.net> wrote: >>>> >>>> Hi Mohamed. >>>>> >>>>> I don't know where you get the dictionary - I unsuccessfully tried the >>>>> OpenOffice one by myself (the Ayaspell one), and I had no arabic >>>>> stopwords >>>>> file. >>>>> >>>>> Renaming the file is supposed to be enough (I did it successfully for >>>>> Thailandese dictionary) - the ".aff'" file becoming the ".affix" one. >>>>> When I tried to create the dictionary: >>>>> >>>>> CREATE TEXT SEARCH DICTIONARY ar_ispell ( >>>>> TEMPLATE = ispell, >>>>> DictFile = ar_utf8, >>>>> AffFile = ar_utf8, >>>>> StopWords = english >>>>> ); >>>>> >>>>> I had an error: >>>>> >>>>> ERREUR: mauvais format de fichier affixe pour le drapeau >>>>> CONTEXTE : ligne 42 du fichier de configuration ? >>>>> /usr/share/pgsql/tsearch_data/ar_utf8.affix ? : ? PFX Aa Y 40 >>>>> >>>>> (which means Bad format of Affix file for flag, line 42 of configuration >>>>> file) >>>>> >>>>> Do you have an error when creating your dictionary? >>>>> >>>>> Daniel >>>>> >>>>> Mohamed a ?crit : >>>>> >>>>> >>>>> I have ran into some problems here. >>>>> I am trying to implement arabic fulltext search on three columns. >>>>> >>>>> To create a dictionary I have a hunspell dictionary and and arabic stop >>>>> file. >>>>> >>>>> CREATE TEXT SEARCH DICTIONARY hunspell_dic ( >>>>> TEMPLATE = ispell, >>>>> DictFile = hunarabic, >>>>> AffFile = hunarabic, >>>>> StopWords = arabic >>>>> ); >>>>> >>>>> >>>>> 1) The problem is that the hunspell contains a .dic and a .aff file but >>>>> the configuration requeries a .dict and .affix file. I have tried to >>>>> change >>>>> the endings but with no success. >>>>> >>>>> 2) ts_lexize('hunspell_dic', 'ARABIC WORD') returns nothing >>>>> >>>>> 3) How can I convert my .dic and .aff to valid .dict and .affix ? >>>>> >>>>> 4) I have read that when using dictionaries, if a word is not recognized >>>>> by >>>>> any dictionary it will not be indexed. I find that troublesome. I would >>>>> like >>>>> everything but the stop words to be indexed. I guess this might be a >>>>> step >>>>> that I am not ready for yet, but just wanted to put it out there. >>>>> >>>>> >>>>> >>>>> Also I would like to know how the process of the fulltext search >>>>> implementation looks like, from config to search. >>>>> >>>>> Create dictionary, then a text configuration, add dic to configuration, >>>>> index columns with gin or gist ... >>>>> >>>>> How does a search look like? Does it match against the gin/gist index. >>>>> Have that index been built up using the dictionary/configuration, or is >>>>> the >>>>> dictionary only used on search frases? >>>>> >>>>> / Moe >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>> >>> Regards, >>> Oleg >>> _____________________________________________________________ >>> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), >>> Sternberg Astronomical Institute, Moscow University, Russia >>> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ >>> phone: +007(495)939-16-83, +007(495)939-23-83 >>> >> >> > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
Hehe, ok..
I don't know either but I took some lines from Al-Jazeera : http://aljazeera.net/portal
just made the change you said and created it successfully and tried this :
select ts_lexize('ayaspell', 'استشهد فلسطيني وأصيب ثلاثة في غارة إسرائيلية جديدة')
but I got nothing... :(
Is there a way of making sure that words not recognized also gets indexed/searched for ? (Not that I think this is the problem)
/ Moe
On Mon, Feb 2, 2009 at 3:50 PM, Oleg Bartunov <oleg@sai.msu.su> wrote:
Mohamed,
comment line in ar.affix
#FLAG long
and creation of ispell dictionary will work. This is temp, solution. Teodor is working on fixing affix autorecognizing.
I can't say anything about testing, since somebody should provide
first test case. I don't know how to type arabic :)
Oleg
On Mon, 2 Feb 2009, Mohamed wrote:Oleg, like I mentioned earlier. I have a different .affix file that I got
from Andrew with the stop file and I get no errors creating the dictionary
using that one but I get nothing out from ts_lexize.
The size on that one is : 406,219 bytes
And the size on the hunspell one (first) : 406,229 bytes
Little to close, don't you think ?
It might be that the arabic hunspell (ayaspell) affix file is damaged on
some lines and I got the fixed one from Andrew.
Just wanted to let you know.
/ Moe
On Mon, Feb 2, 2009 at 3:25 PM, Mohamed <mohamed5432154321@gmail.com> wrote:Ok, thank you Oleg.
I have another dictionary package which is a conversion to hunspell
aswell:
http://wiki.services.openoffice.org/wiki/Dictionaries#Arabic_.28North_Africa_and_Middle_East.29
(Conversion of Buckwalter's Arabic morphological analyser) 2006-02-08
And running that gives me this error : (again the affix file)
ERROR: wrong affix file format for flag
CONTEXT: line 560 of configuration file "C:/Program
Files/PostgreSQL/8.3/share/tsearch_data/arabic_utf8_alias.affix": "PFX 1013
Y 6
"
/ Moe
On Mon, Feb 2, 2009 at 2:41 PM, Oleg Bartunov <oleg@sai.msu.su> wrote:Mohamed,
We are looking on the problem.
Oleg
On Mon, 2 Feb 2009, Mohamed wrote:
No, I don't. But the ts_lexize don't return anything so I figured theremustRegards,
be an error somehow.
I think we are using the same dictionary + that I am using the stopwords
file and a different affix file, because using the hunspell (ayaspell)
.aff
gives me this error :
ERROR: wrong affix file format for flag
CONTEXT: line 42 of configuration file "C:/Program
Files/PostgreSQL/8.3/share/tsearch_data/hunarabic.affix": "PFX Aa Y 40
/ Moe
On Mon, Feb 2, 2009 at 12:13 PM, Daniel Chiaramello <
daniel.chiaramello@golog.net> wrote:
Hi Mohamed.
I don't know where you get the dictionary - I unsuccessfully tried the
OpenOffice one by myself (the Ayaspell one), and I had no arabic
stopwords
file.
Renaming the file is supposed to be enough (I did it successfully for
Thailandese dictionary) - the ".aff'" file becoming the ".affix" one.
When I tried to create the dictionary:
CREATE TEXT SEARCH DICTIONARY ar_ispell (
TEMPLATE = ispell,
DictFile = ar_utf8,
AffFile = ar_utf8,
StopWords = english
);
I had an error:
ERREUR: mauvais format de fichier affixe pour le drapeau
CONTEXTE : ligne 42 du fichier de configuration ?
/usr/share/pgsql/tsearch_data/ar_utf8.affix ? : ? PFX Aa Y 40
(which means Bad format of Affix file for flag, line 42 of configuration
file)
Do you have an error when creating your dictionary?
Daniel
Mohamed a ?crit :
I have ran into some problems here.
I am trying to implement arabic fulltext search on three columns.
To create a dictionary I have a hunspell dictionary and and arabic stop
file.
CREATE TEXT SEARCH DICTIONARY hunspell_dic (
TEMPLATE = ispell,
DictFile = hunarabic,
AffFile = hunarabic,
StopWords = arabic
);
1) The problem is that the hunspell contains a .dic and a .aff file but
the configuration requeries a .dict and .affix file. I have tried to
change
the endings but with no success.
2) ts_lexize('hunspell_dic', 'ARABIC WORD') returns nothing
3) How can I convert my .dic and .aff to valid .dict and .affix ?
4) I have read that when using dictionaries, if a word is not recognized
by
any dictionary it will not be indexed. I find that troublesome. I would
like
everything but the stop words to be indexed. I guess this might be a
step
that I am not ready for yet, but just wanted to put it out there.
Also I would like to know how the process of the fulltext search
implementation looks like, from config to search.
Create dictionary, then a text configuration, add dic to configuration,
index columns with gin or gist ...
How does a search look like? Does it match against the gin/gist index.
Have that index been built up using the dictionary/configuration, or is
the
dictionary only used on search frases?
/ Moe
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83
Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83
On Mon, 2 Feb 2009, Mohamed wrote: > Hehe, ok.. > I don't know either but I took some lines from Al-Jazeera : > http://aljazeera.net/portal > > just made the change you said and created it successfully and tried this : > > select ts_lexize('ayaspell', '?????? ??????? ????? ????? ?? ???? ????????? > ?????') > > but I got nothing... :( Mohamed, what did you expect from ts_lexize ? Please, provide us valuable information, else we can't help you. > > Is there a way of making sure that words not recognized also gets > indexed/searched for ? (Not that I think this is the problem) yes > > / Moe > > > > On Mon, Feb 2, 2009 at 3:50 PM, Oleg Bartunov <oleg@sai.msu.su> wrote: > >> Mohamed, >> >> comment line in ar.affix >> #FLAG long >> and creation of ispell dictionary will work. This is temp, solution. Teodor >> is working on fixing affix autorecognizing. >> >> I can't say anything about testing, since somebody should provide >> first test case. I don't know how to type arabic :) >> >> >> Oleg >> >> On Mon, 2 Feb 2009, Mohamed wrote: >> >> Oleg, like I mentioned earlier. I have a different .affix file that I got >>> from Andrew with the stop file and I get no errors creating the dictionary >>> using that one but I get nothing out from ts_lexize. >>> The size on that one is : 406,219 bytes >>> And the size on the hunspell one (first) : 406,229 bytes >>> >>> Little to close, don't you think ? >>> >>> It might be that the arabic hunspell (ayaspell) affix file is damaged on >>> some lines and I got the fixed one from Andrew. >>> >>> Just wanted to let you know. >>> >>> / Moe >>> >>> >>> >>> On Mon, Feb 2, 2009 at 3:25 PM, Mohamed <mohamed5432154321@gmail.com> >>> wrote: >>> >>> Ok, thank you Oleg. >>>> I have another dictionary package which is a conversion to hunspell >>>> aswell: >>>> >>>> >>>> >>>> http://wiki.services.openoffice.org/wiki/Dictionaries#Arabic_.28North_Africa_and_Middle_East.29 >>>> (Conversion of Buckwalter's Arabic morphological analyser) 2006-02-08 >>>> >>>> And running that gives me this error : (again the affix file) >>>> >>>> ERROR: wrong affix file format for flag >>>> CONTEXT: line 560 of configuration file "C:/Program >>>> Files/PostgreSQL/8.3/share/tsearch_data/arabic_utf8_alias.affix": "PFX >>>> 1013 >>>> Y 6 >>>> " >>>> >>>> / Moe >>>> >>>> >>>> >>>> On Mon, Feb 2, 2009 at 2:41 PM, Oleg Bartunov <oleg@sai.msu.su> wrote: >>>> >>>> Mohamed, >>>>> >>>>> We are looking on the problem. >>>>> >>>>> Oleg >>>>> >>>>> On Mon, 2 Feb 2009, Mohamed wrote: >>>>> >>>>> No, I don't. But the ts_lexize don't return anything so I figured there >>>>> >>>>>> must >>>>>> be an error somehow. >>>>>> I think we are using the same dictionary + that I am using the >>>>>> stopwords >>>>>> file and a different affix file, because using the hunspell (ayaspell) >>>>>> .aff >>>>>> gives me this error : >>>>>> >>>>>> ERROR: wrong affix file format for flag >>>>>> CONTEXT: line 42 of configuration file "C:/Program >>>>>> Files/PostgreSQL/8.3/share/tsearch_data/hunarabic.affix": "PFX Aa Y 40 >>>>>> >>>>>> / Moe >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Mon, Feb 2, 2009 at 12:13 PM, Daniel Chiaramello < >>>>>> daniel.chiaramello@golog.net> wrote: >>>>>> >>>>>> Hi Mohamed. >>>>>> >>>>>>> >>>>>>> I don't know where you get the dictionary - I unsuccessfully tried the >>>>>>> OpenOffice one by myself (the Ayaspell one), and I had no arabic >>>>>>> stopwords >>>>>>> file. >>>>>>> >>>>>>> Renaming the file is supposed to be enough (I did it successfully for >>>>>>> Thailandese dictionary) - the ".aff'" file becoming the ".affix" one. >>>>>>> When I tried to create the dictionary: >>>>>>> >>>>>>> CREATE TEXT SEARCH DICTIONARY ar_ispell ( >>>>>>> TEMPLATE = ispell, >>>>>>> DictFile = ar_utf8, >>>>>>> AffFile = ar_utf8, >>>>>>> StopWords = english >>>>>>> ); >>>>>>> >>>>>>> I had an error: >>>>>>> >>>>>>> ERREUR: mauvais format de fichier affixe pour le drapeau >>>>>>> CONTEXTE : ligne 42 du fichier de configuration ? >>>>>>> /usr/share/pgsql/tsearch_data/ar_utf8.affix ? : ? PFX Aa Y >>>>>>> 40 >>>>>>> >>>>>>> (which means Bad format of Affix file for flag, line 42 of >>>>>>> configuration >>>>>>> file) >>>>>>> >>>>>>> Do you have an error when creating your dictionary? >>>>>>> >>>>>>> Daniel >>>>>>> >>>>>>> Mohamed a ?crit : >>>>>>> >>>>>>> >>>>>>> I have ran into some problems here. >>>>>>> I am trying to implement arabic fulltext search on three columns. >>>>>>> >>>>>>> To create a dictionary I have a hunspell dictionary and and arabic >>>>>>> stop >>>>>>> file. >>>>>>> >>>>>>> CREATE TEXT SEARCH DICTIONARY hunspell_dic ( >>>>>>> TEMPLATE = ispell, >>>>>>> DictFile = hunarabic, >>>>>>> AffFile = hunarabic, >>>>>>> StopWords = arabic >>>>>>> ); >>>>>>> >>>>>>> >>>>>>> 1) The problem is that the hunspell contains a .dic and a .aff file >>>>>>> but >>>>>>> the configuration requeries a .dict and .affix file. I have tried to >>>>>>> change >>>>>>> the endings but with no success. >>>>>>> >>>>>>> 2) ts_lexize('hunspell_dic', 'ARABIC WORD') returns nothing >>>>>>> >>>>>>> 3) How can I convert my .dic and .aff to valid .dict and .affix ? >>>>>>> >>>>>>> 4) I have read that when using dictionaries, if a word is not >>>>>>> recognized >>>>>>> by >>>>>>> any dictionary it will not be indexed. I find that troublesome. I >>>>>>> would >>>>>>> like >>>>>>> everything but the stop words to be indexed. I guess this might be a >>>>>>> step >>>>>>> that I am not ready for yet, but just wanted to put it out there. >>>>>>> >>>>>>> >>>>>>> >>>>>>> Also I would like to know how the process of the fulltext search >>>>>>> implementation looks like, from config to search. >>>>>>> >>>>>>> Create dictionary, then a text configuration, add dic to >>>>>>> configuration, >>>>>>> index columns with gin or gist ... >>>>>>> >>>>>>> How does a search look like? Does it match against the gin/gist >>>>>>> index. >>>>>>> Have that index been built up using the dictionary/configuration, or >>>>>>> is >>>>>>> the >>>>>>> dictionary only used on search frases? >>>>>>> >>>>>>> / Moe >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> Regards, >>>>> Oleg >>>>> _____________________________________________________________ >>>>> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), >>>>> Sternberg Astronomical Institute, Moscow University, Russia >>>>> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ >>>>> phone: +007(495)939-16-83, +007(495)939-23-83 >>>>> >>>>> >>>> >>>> >>> >> Regards, >> Oleg >> _____________________________________________________________ >> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), >> Sternberg Astronomical Institute, Moscow University, Russia >> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ >> phone: +007(495)939-16-83, +007(495)939-23-83 >> > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
On Mon, 2 Feb 2009, Oleg Bartunov wrote: > On Mon, 2 Feb 2009, Mohamed wrote: > >> Hehe, ok.. >> I don't know either but I took some lines from Al-Jazeera : >> http://aljazeera.net/portal >> >> just made the change you said and created it successfully and tried this : >> >> select ts_lexize('ayaspell', '?????? ??????? ????? ????? ?? ???? ????????? >> ?????') >> >> but I got nothing... :( > > Mohamed, what did you expect from ts_lexize ? Please, provide us valuable > information, else we can't help you. > >> >> Is there a way of making sure that words not recognized also gets >> indexed/searched for ? (Not that I think this is the problem) > > yes Read http://www.postgresql.org/docs/8.3/static/textsearch-dictionaries.html "A text search configuration binds a parser together with a set of dictionaries to process the parser's output tokens. For each token type that the parser can return, a separate list of dictionaries is specified by the configuration. When a token of that type is found by the parser, each dictionary in the list is consulted in turn, until some dictionary recognizes it as a known word. If it is identified as a stop word, or if no dictionary recognizes the token, it will be discarded and not indexed or searched for. The general rule for configuring a list of dictionaries is to place first the most narrow, most specific dictionary, then the more general dictionaries, finishing with a very general dictionary, like a Snowball stemmer or simple, which recognizes everything." quick example: CREATE TEXT SEARCH CONFIGURATION arabic ( COPY = english ); =# \dF+ arabic Text search configuration "public.arabic" Parser: "pg_catalog.default" Token | Dictionaries -----------------+-------------- asciihword | english_stem asciiword | english_stem email | simple file | simple float | simple host | simple hword | english_stem hword_asciipart | english_stem hword_numpart | simple hword_part | english_stem int | simple numhword | simple numword | simple sfloat | simple uint | simple url | simple url_path | simple version | simple word | english_stem Then you can alter this configuration. > > >> >> / Moe >> >> >> >> On Mon, Feb 2, 2009 at 3:50 PM, Oleg Bartunov <oleg@sai.msu.su> wrote: >> >>> Mohamed, >>> >>> comment line in ar.affix >>> #FLAG long >>> and creation of ispell dictionary will work. This is temp, solution. >>> Teodor >>> is working on fixing affix autorecognizing. >>> >>> I can't say anything about testing, since somebody should provide >>> first test case. I don't know how to type arabic :) >>> >>> >>> Oleg >>> >>> On Mon, 2 Feb 2009, Mohamed wrote: >>> >>> Oleg, like I mentioned earlier. I have a different .affix file that I got >>>> from Andrew with the stop file and I get no errors creating the >>>> dictionary >>>> using that one but I get nothing out from ts_lexize. >>>> The size on that one is : 406,219 bytes >>>> And the size on the hunspell one (first) : 406,229 bytes >>>> >>>> Little to close, don't you think ? >>>> >>>> It might be that the arabic hunspell (ayaspell) affix file is damaged on >>>> some lines and I got the fixed one from Andrew. >>>> >>>> Just wanted to let you know. >>>> >>>> / Moe >>>> >>>> >>>> >>>> On Mon, Feb 2, 2009 at 3:25 PM, Mohamed <mohamed5432154321@gmail.com> >>>> wrote: >>>> >>>> Ok, thank you Oleg. >>>>> I have another dictionary package which is a conversion to hunspell >>>>> aswell: >>>>> >>>>> >>>>> >>>>> http://wiki.services.openoffice.org/wiki/Dictionaries#Arabic_.28North_Africa_and_Middle_East.29 >>>>> (Conversion of Buckwalter's Arabic morphological analyser) 2006-02-08 >>>>> >>>>> And running that gives me this error : (again the affix file) >>>>> >>>>> ERROR: wrong affix file format for flag >>>>> CONTEXT: line 560 of configuration file "C:/Program >>>>> Files/PostgreSQL/8.3/share/tsearch_data/arabic_utf8_alias.affix": "PFX >>>>> 1013 >>>>> Y 6 >>>>> " >>>>> >>>>> / Moe >>>>> >>>>> >>>>> >>>>> On Mon, Feb 2, 2009 at 2:41 PM, Oleg Bartunov <oleg@sai.msu.su> wrote: >>>>> >>>>> Mohamed, >>>>>> >>>>>> We are looking on the problem. >>>>>> >>>>>> Oleg >>>>>> >>>>>> On Mon, 2 Feb 2009, Mohamed wrote: >>>>>> >>>>>> No, I don't. But the ts_lexize don't return anything so I figured >>>>>> there >>>>>> >>>>>>> must >>>>>>> be an error somehow. >>>>>>> I think we are using the same dictionary + that I am using the >>>>>>> stopwords >>>>>>> file and a different affix file, because using the hunspell (ayaspell) >>>>>>> .aff >>>>>>> gives me this error : >>>>>>> >>>>>>> ERROR: wrong affix file format for flag >>>>>>> CONTEXT: line 42 of configuration file "C:/Program >>>>>>> Files/PostgreSQL/8.3/share/tsearch_data/hunarabic.affix": "PFX Aa Y 40 >>>>>>> >>>>>>> / Moe >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Mon, Feb 2, 2009 at 12:13 PM, Daniel Chiaramello < >>>>>>> daniel.chiaramello@golog.net> wrote: >>>>>>> >>>>>>> Hi Mohamed. >>>>>>> >>>>>>>> >>>>>>>> I don't know where you get the dictionary - I unsuccessfully tried >>>>>>>> the >>>>>>>> OpenOffice one by myself (the Ayaspell one), and I had no arabic >>>>>>>> stopwords >>>>>>>> file. >>>>>>>> >>>>>>>> Renaming the file is supposed to be enough (I did it successfully for >>>>>>>> Thailandese dictionary) - the ".aff'" file becoming the ".affix" one. >>>>>>>> When I tried to create the dictionary: >>>>>>>> >>>>>>>> CREATE TEXT SEARCH DICTIONARY ar_ispell ( >>>>>>>> TEMPLATE = ispell, >>>>>>>> DictFile = ar_utf8, >>>>>>>> AffFile = ar_utf8, >>>>>>>> StopWords = english >>>>>>>> ); >>>>>>>> >>>>>>>> I had an error: >>>>>>>> >>>>>>>> ERREUR: mauvais format de fichier affixe pour le drapeau >>>>>>>> CONTEXTE : ligne 42 du fichier de configuration ? >>>>>>>> /usr/share/pgsql/tsearch_data/ar_utf8.affix ? : ? PFX Aa Y >>>>>>>> 40 >>>>>>>> >>>>>>>> (which means Bad format of Affix file for flag, line 42 of >>>>>>>> configuration >>>>>>>> file) >>>>>>>> >>>>>>>> Do you have an error when creating your dictionary? >>>>>>>> >>>>>>>> Daniel >>>>>>>> >>>>>>>> Mohamed a ?crit : >>>>>>>> >>>>>>>> >>>>>>>> I have ran into some problems here. >>>>>>>> I am trying to implement arabic fulltext search on three columns. >>>>>>>> >>>>>>>> To create a dictionary I have a hunspell dictionary and and arabic >>>>>>>> stop >>>>>>>> file. >>>>>>>> >>>>>>>> CREATE TEXT SEARCH DICTIONARY hunspell_dic ( >>>>>>>> TEMPLATE = ispell, >>>>>>>> DictFile = hunarabic, >>>>>>>> AffFile = hunarabic, >>>>>>>> StopWords = arabic >>>>>>>> ); >>>>>>>> >>>>>>>> >>>>>>>> 1) The problem is that the hunspell contains a .dic and a .aff file >>>>>>>> but >>>>>>>> the configuration requeries a .dict and .affix file. I have tried to >>>>>>>> change >>>>>>>> the endings but with no success. >>>>>>>> >>>>>>>> 2) ts_lexize('hunspell_dic', 'ARABIC WORD') returns nothing >>>>>>>> >>>>>>>> 3) How can I convert my .dic and .aff to valid .dict and .affix ? >>>>>>>> >>>>>>>> 4) I have read that when using dictionaries, if a word is not >>>>>>>> recognized >>>>>>>> by >>>>>>>> any dictionary it will not be indexed. I find that troublesome. I >>>>>>>> would >>>>>>>> like >>>>>>>> everything but the stop words to be indexed. I guess this might be a >>>>>>>> step >>>>>>>> that I am not ready for yet, but just wanted to put it out there. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Also I would like to know how the process of the fulltext search >>>>>>>> implementation looks like, from config to search. >>>>>>>> >>>>>>>> Create dictionary, then a text configuration, add dic to >>>>>>>> configuration, >>>>>>>> index columns with gin or gist ... >>>>>>>> >>>>>>>> How does a search look like? Does it match against the gin/gist >>>>>>>> index. >>>>>>>> Have that index been built up using the dictionary/configuration, or >>>>>>>> is >>>>>>>> the >>>>>>>> dictionary only used on search frases? >>>>>>>> >>>>>>>> / Moe >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> Regards, >>>>>> Oleg >>>>>> _____________________________________________________________ >>>>>> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), >>>>>> Sternberg Astronomical Institute, Moscow University, Russia >>>>>> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ >>>>>> phone: +007(495)939-16-83, +007(495)939-23-83 >>>>>> >>>>>> >>>>> >>>>> >>>> >>> Regards, >>> Oleg >>> _____________________________________________________________ >>> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), >>> Sternberg Astronomical Institute, Moscow University, Russia >>> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ >>> phone: +007(495)939-16-83, +007(495)939-23-83 >>> >> > > Regards, > Oleg > _____________________________________________________________ > Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), > Sternberg Astronomical Institute, Moscow University, Russia > Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ > phone: +007(495)939-16-83, +007(495)939-23-83 > > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
On Mon, Feb 2, 2009 at 4:34 PM, Oleg Bartunov <oleg@sai.msu.su> wrote:
On Mon, 2 Feb 2009, Oleg Bartunov wrote:On Mon, 2 Feb 2009, Mohamed wrote:Hehe, ok..
I don't know either but I took some lines from Al-Jazeera :
http://aljazeera.net/portal
just made the change you said and created it successfully and tried this :
select ts_lexize('ayaspell', '?????? ??????? ????? ????? ?? ???? ?????????
?????')
but I got nothing... :(
Mohamed, what did you expect from ts_lexize ? Please, provide us valuable
information, else we can't help you.
What I expected was something to be returned. After all they are valid words taken from an article. (perhaps you don't see the words, but only ???... ) Am I wrong to expect something ? Should I go for setting up the configuration completly first?
SELECT ts_lexize('norwegian_ispell', 'overbuljongterningpakkmesterassistent');
{over,buljong,terning,pakk,mester,assistent}
Check out this article if you need a sample.
Read http://www.postgresql.org/docs/8.3/static/textsearch-dictionaries.html
Is there a way of making sure that words not recognized also gets
indexed/searched for ? (Not that I think this is the problem)
yes
"A text search configuration binds a parser together with a set of dictionaries to process the parser's output tokens. For each token type that the parser can return, a separate list of dictionaries is specified by the configuration. When a token of that type is found by the parser, each dictionary in the list is consulted in turn, until some dictionary recognizes it as a known word. If it is identified as a stop word, or if no dictionary recognizes the token, it will be discarded and not indexed or searched for. The general rule for configuring a list of dictionaries is to place first the most narrow, most specific dictionary, then the more general dictionaries,
finishing with a very general dictionary, like a Snowball stemmer or simple, which recognizes everything."
Ok, but I don't have Thesaurus or a Snowball to fall back on. So when words that are words but for some reason is not recognized "it will be discarded and not indexed or searched for." which I consider a problem since I don't trust my configuration to cover everything.
Is this not a valid concern?
quick example:
CREATE TEXT SEARCH CONFIGURATION arabic (
COPY = english
);
=# \dF+ arabic
Text search configuration "public.arabic"
Parser: "pg_catalog.default"
Token | Dictionaries
-----------------+--------------
asciihword | english_stem
asciiword | english_stem
email | simple
file | simple
float | simple
host | simple
hword | english_stem
hword_asciipart | english_stem
hword_numpart | simple
hword_part | english_stem
int | simple
numhword | simple
numword | simple
sfloat | simple
uint | simple
url | simple
url_path | simple
version | simple
word | english_stem
Then you can alter this configuration.
Yes, I figured thats the next step but thought I should get the lexize to work first? What do you think?
Just a thought, say I have this :
ALTER TEXT SEARCH CONFIGURATION pg
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
word, hword, hword_part
WITH pga_ardict, ar_ispell, ar_stem;
is it possible to keep adding dictionaries, to get both arabic and english matches on the same column (arabic people tend to mix), like this :
ALTER TEXT SEARCH CONFIGURATION pg
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
word, hword, hword_part
WITH pga_ardict, ar_ispell, ar_stem, pg_english_dict, english_ispell, english_stem;
Will something like that work ?
/ Moe
Mohamed, please, try to read docs and think a bit first. On Mon, 2 Feb 2009, Mohamed wrote: > On Mon, Feb 2, 2009 at 4:34 PM, Oleg Bartunov <oleg@sai.msu.su> wrote: > >> On Mon, 2 Feb 2009, Oleg Bartunov wrote: >> >> On Mon, 2 Feb 2009, Mohamed wrote: >>> >>> Hehe, ok.. >>>> I don't know either but I took some lines from Al-Jazeera : >>>> http://aljazeera.net/portal >>>> >>>> just made the change you said and created it successfully and tried this >>>> : >>>> >>>> select ts_lexize('ayaspell', '?????? ??????? ????? ????? ?? ???? >>>> ????????? >>>> ?????') >>>> >>>> but I got nothing... :( You did wrong ! ts_lexize expects word, not phrase ! >>>> >>> >>> Mohamed, what did you expect from ts_lexize ? Please, provide us valuable >>> information, else we can't help you. >>> >> > What I expected was something to be returned. After all they are valid words > taken from an article. (perhaps you don't see the words, but only ???... ) > Am I wrong to expect something ? Should I go for setting up the > configuration completly first? You should definitely read documentation http://www.postgresql.org/docs/8.3/static/textsearch-debugging.html#TEXTSEARCH-DICTIONARY-TESTING Period. > > SELECT ts_lexize('norwegian_ispell', > 'overbuljongterningpakkmesterassistent'); > {over,buljong,terning,pakk,mester,assistent} > > Check out this article if you need a sample. > http://www.aljazeera.net/NR/exeres/103CFC06-0195-47FD-A29F-2C84B5A15DD0.htm > > > > > >> >>> >>>> Is there a way of making sure that words not recognized also gets >>>> indexed/searched for ? (Not that I think this is the problem) >>>> >>> >>> yes >>> >> >> Read >> http://www.postgresql.org/docs/8.3/static/textsearch-dictionaries.html >> "A text search configuration binds a parser together with a set of >> dictionaries to process the parser's output tokens. For each token type that >> the parser can return, a separate list of dictionaries is specified by the >> configuration. When a token of that type is found by the parser, each >> dictionary in the list is consulted in turn, until some dictionary >> recognizes it as a known word. If it is identified as a stop word, or if no >> dictionary recognizes the token, it will be discarded and not indexed or >> searched for. The general rule for configuring a list of dictionaries is to >> place first the most narrow, most specific dictionary, then the more general >> dictionaries, >> finishing with a very general dictionary, like a Snowball stemmer or >> simple, which recognizes everything." >> > > > Ok, but I don't have Thesaurus or a Snowball to fall back on. So when words > that are words but for some reason is not recognized "it will be discarded > and not indexed or searched for." which I consider a problem since I don't > trust my configuration to cover everything. > > Is this not a valid concern? > > >> >> quick example: >> >> CREATE TEXT SEARCH CONFIGURATION arabic ( >> COPY = english >> ); >> >> =# \dF+ arabic >> Text search configuration "public.arabic" >> Parser: "pg_catalog.default" >> Token | Dictionaries >> -----------------+-------------- >> asciihword | english_stem >> asciiword | english_stem >> email | simple >> file | simple >> float | simple >> host | simple >> hword | english_stem >> hword_asciipart | english_stem >> hword_numpart | simple >> hword_part | english_stem >> int | simple >> numhword | simple >> numword | simple >> sfloat | simple >> uint | simple >> url | simple >> url_path | simple >> version | simple >> word | english_stem >> >> Then you can alter this configuration. > > > > Yes, I figured thats the next step but thought I should get the lexize to > work first? What do you think? > > Just a thought, say I have this : > > ALTER TEXT SEARCH CONFIGURATION pg > ALTER MAPPING FOR asciiword, asciihword, hword_asciipart, > word, hword, hword_part > WITH pga_ardict, ar_ispell, ar_stem; > > is it possible to keep adding dictionaries, to get both arabic and english > matches on the same column (arabic people tend to mix), like this : > > ALTER TEXT SEARCH CONFIGURATION pg > ALTER MAPPING FOR asciiword, asciihword, hword_asciipart, > word, hword, hword_part > WITH pga_ardict, ar_ispell, ar_stem, pg_english_dict, english_ispell, > english_stem; > > > Will something like that work ? > > > / Moe > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
Little harsh, are we? I have read the WHOLE documentation, it's a bit long so confusion might arise + I am not familiar with postgre AT ALL so the confusion grows.
Perhaps I am an idiot and you don't like helping idiots or perhaps it's something else? Which one is it?
If you don't want to help me, then DON'T ! Period.
The mailing list is not yours.
.
.
.
I have tried ts_lexize with words, lots of them and I have yet to get something out of it!
/ Moe