Thread: Fulltext search configuration

Fulltext search configuration

From
Mohamed
Date:
I have ran into some problems here.

I am trying to implement arabic fulltext search on three columns.

To create a dictionary I have a hunspell dictionary and and arabic stop file.

CREATE TEXT SEARCH DICTIONARY hunspell_dic (
    TEMPLATE = ispell,
    DictFile = hunarabic,
    AffFile = hunarabic,
    StopWords = arabic
);

1) The problem is that the hunspell contains a .dic and a .aff file but the configuration requeries a .dict and .affix file. I have tried to change the endings but with no success.

2) ts_lexize('hunspell_dic', 'ARABIC WORD') returns nothing

3) How can I convert my .dic and .aff to valid .dict and .affix ? 

4) I have read that when using dictionaries, if a word is not recognized by any dictionary it will not be indexed. I find that troublesome. I would like everything but the stop words to be indexed. I guess this might be a step that I am not ready for yet, but just wanted to put it out there.



Also I would like to know how the process of the fulltext search implementation looks like, from config to search.

Create dictionary, then a text configuration, add dic to configuration, index columns with gin or gist ...

How does a search look like? Does it match against the gin/gist index. Have that index been built up using the dictionary/configuration, or is the dictionary only used on search frases? 

/ Moe



Re: Fulltext search configuration

From
Daniel Chiaramello
Date:
Hi Mohamed.

I don't know where you get the dictionary - I unsuccessfully tried the OpenOffice one by myself (the Ayaspell one), and I had no arabic stopwords file.

Renaming the file is supposed to be enough (I did it successfully for Thailandese dictionary) - the ".aff'" file becoming the ".affix" one.
When I tried to create the dictionary:

CREATE TEXT SEARCH DICTIONARY ar_ispell (
    TEMPLATE = ispell,
    DictFile = ar_utf8,
    AffFile = ar_utf8,
    StopWords = english
);

I had an error:

ERREUR:  mauvais format de fichier affixe pour le drapeau
CONTEXTE : ligne 42 du fichier de configuration « /usr/share/pgsql/tsearch_data/ar_utf8.affix » : « PFX Aa      Y       40

(which means Bad format of Affix file for flag, line 42 of configuration file)

Do you have an error when creating your dictionary?

Daniel

Mohamed a écrit :
I have ran into some problems here.

I am trying to implement arabic fulltext search on three columns.

To create a dictionary I have a hunspell dictionary and and arabic stop file.

CREATE TEXT SEARCH DICTIONARY hunspell_dic (
    TEMPLATE = ispell,
    DictFile = hunarabic,
    AffFile = hunarabic,
    StopWords = arabic
);

1) The problem is that the hunspell contains a .dic and a .aff file but the configuration requeries a .dict and .affix file. I have tried to change the endings but with no success.

2) ts_lexize('hunspell_dic', 'ARABIC WORD') returns nothing

3) How can I convert my .dic and .aff to valid .dict and .affix ? 

4) I have read that when using dictionaries, if a word is not recognized by any dictionary it will not be indexed. I find that troublesome. I would like everything but the stop words to be indexed. I guess this might be a step that I am not ready for yet, but just wanted to put it out there.



Also I would like to know how the process of the fulltext search implementation looks like, from config to search.

Create dictionary, then a text configuration, add dic to configuration, index columns with gin or gist ...

How does a search look like? Does it match against the gin/gist index. Have that index been built up using the dictionary/configuration, or is the dictionary only used on search frases? 

/ Moe




Re: Fulltext search configuration

From
Mohamed
Date:
No, I don't. But the ts_lexize don't return anything so I figured there must be an error somehow. 

I think we are using the same dictionary + that I am using the stopwords file and a different affix file, because using the hunspell (ayaspell) .aff gives me this error : 

ERROR:  wrong affix file format for flag
CONTEXT:  line 42 of configuration file "C:/Program Files/PostgreSQL/8.3/share/tsearch_data/hunarabic.affix": "PFX Aa Y 40

/ Moe




On Mon, Feb 2, 2009 at 12:13 PM, Daniel Chiaramello <daniel.chiaramello@golog.net> wrote:
Hi Mohamed.

I don't know where you get the dictionary - I unsuccessfully tried the OpenOffice one by myself (the Ayaspell one), and I had no arabic stopwords file.

Renaming the file is supposed to be enough (I did it successfully for Thailandese dictionary) - the ".aff'" file becoming the ".affix" one.
When I tried to create the dictionary:

CREATE TEXT SEARCH DICTIONARY ar_ispell (
    TEMPLATE = ispell,
    DictFile = ar_utf8,
    AffFile = ar_utf8,
    StopWords = english
);

I had an error:

ERREUR:  mauvais format de fichier affixe pour le drapeau
CONTEXTE : ligne 42 du fichier de configuration « /usr/share/pgsql/tsearch_data/ar_utf8.affix » : « PFX Aa      Y       40

(which means Bad format of Affix file for flag, line 42 of configuration file)

Do you have an error when creating your dictionary?

Daniel

Mohamed a écrit :
I have ran into some problems here.

I am trying to implement arabic fulltext search on three columns.

To create a dictionary I have a hunspell dictionary and and arabic stop file.

CREATE TEXT SEARCH DICTIONARY hunspell_dic (
    TEMPLATE = ispell,
    DictFile = hunarabic,
    AffFile = hunarabic,
    StopWords = arabic
);

1) The problem is that the hunspell contains a .dic and a .aff file but the configuration requeries a .dict and .affix file. I have tried to change the endings but with no success.

2) ts_lexize('hunspell_dic', 'ARABIC WORD') returns nothing

3) How can I convert my .dic and .aff to valid .dict and .affix ? 

4) I have read that when using dictionaries, if a word is not recognized by any dictionary it will not be indexed. I find that troublesome. I would like everything but the stop words to be indexed. I guess this might be a step that I am not ready for yet, but just wanted to put it out there.



Also I would like to know how the process of the fulltext search implementation looks like, from config to search.

Create dictionary, then a text configuration, add dic to configuration, index columns with gin or gist ...

How does a search look like? Does it match against the gin/gist index. Have that index been built up using the dictionary/configuration, or is the dictionary only used on search frases? 

/ Moe





Re: Fulltext search configuration

From
Oleg Bartunov
Date:
Mohamed,

We are looking on the problem.

Oleg
On Mon, 2 Feb 2009, Mohamed wrote:

> No, I don't. But the ts_lexize don't return anything so I figured there must
> be an error somehow.
> I think we are using the same dictionary + that I am using the stopwords
> file and a different affix file, because using the hunspell (ayaspell) .aff
> gives me this error :
>
> ERROR:  wrong affix file format for flag
> CONTEXT:  line 42 of configuration file "C:/Program
> Files/PostgreSQL/8.3/share/tsearch_data/hunarabic.affix": "PFX Aa Y 40
>
> / Moe
>
>
>
>
> On Mon, Feb 2, 2009 at 12:13 PM, Daniel Chiaramello <
> daniel.chiaramello@golog.net> wrote:
>
>>  Hi Mohamed.
>>
>> I don't know where you get the dictionary - I unsuccessfully tried the
>> OpenOffice one by myself (the Ayaspell one), and I had no arabic stopwords
>> file.
>>
>> Renaming the file is supposed to be enough (I did it successfully for
>> Thailandese dictionary) - the ".aff'" file becoming the ".affix" one.
>> When I tried to create the dictionary:
>>
>> CREATE TEXT SEARCH DICTIONARY ar_ispell (
>>     TEMPLATE = ispell,
>>     DictFile = ar_utf8,
>>     AffFile = ar_utf8,
>>     StopWords = english
>> );
>>
>> I had an error:
>>
>> ERREUR:  mauvais format de fichier affixe pour le drapeau
>> CONTEXTE : ligne 42 du fichier de configuration ?
>> /usr/share/pgsql/tsearch_data/ar_utf8.affix ? : ? PFX Aa      Y       40
>>
>> (which means Bad format of Affix file for flag, line 42 of configuration
>> file)
>>
>> Do you have an error when creating your dictionary?
>>
>> Daniel
>>
>> Mohamed a ?crit :
>>
>> I have ran into some problems here.
>>  I am trying to implement arabic fulltext search on three columns.
>>
>>  To create a dictionary I have a hunspell dictionary and and arabic stop
>> file.
>>
>>  CREATE TEXT SEARCH DICTIONARY hunspell_dic (
>>     TEMPLATE = ispell,
>>     DictFile = hunarabic,
>>     AffFile = hunarabic,
>>     StopWords = arabic
>> );
>>
>>
>>  1) The problem is that the hunspell contains a .dic and a .aff file but
>> the configuration requeries a .dict and .affix file. I have tried to change
>> the endings but with no success.
>>
>> 2) ts_lexize('hunspell_dic', 'ARABIC WORD') returns nothing
>>
>> 3) How can I convert my .dic and .aff to valid .dict and .affix ?
>>
>> 4) I have read that when using dictionaries, if a word is not recognized by
>> any dictionary it will not be indexed. I find that troublesome. I would like
>> everything but the stop words to be indexed. I guess this might be a step
>> that I am not ready for yet, but just wanted to put it out there.
>>
>>
>>
>>  Also I would like to know how the process of the fulltext search
>> implementation looks like, from config to search.
>>
>>  Create dictionary, then a text configuration, add dic to configuration,
>> index columns with gin or gist ...
>>
>>  How does a search look like? Does it match against the gin/gist index.
>> Have that index been built up using the dictionary/configuration, or is the
>> dictionary only used on search frases?
>>
>>  / Moe
>>
>>
>>
>>
>>
>>
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: Fulltext search configuration

From
Mohamed
Date:
Ok, thank you Oleg. 

I have another dictionary package which is a conversion to hunspell aswell: 

(Conversion of Buckwalter's Arabic morphological analyser) 2006-02-08

And running that gives me this error : (again the affix file)

ERROR:  wrong affix file format for flag
CONTEXT:  line 560 of configuration file "C:/Program Files/PostgreSQL/8.3/share/tsearch_data/arabic_utf8_alias.affix": "PFX 1013 Y 6
"

/ Moe



On Mon, Feb 2, 2009 at 2:41 PM, Oleg Bartunov <oleg@sai.msu.su> wrote:
Mohamed,

We are looking on the problem.

Oleg

On Mon, 2 Feb 2009, Mohamed wrote:

No, I don't. But the ts_lexize don't return anything so I figured there must
be an error somehow.
I think we are using the same dictionary + that I am using the stopwords
file and a different affix file, because using the hunspell (ayaspell) .aff
gives me this error :

ERROR:  wrong affix file format for flag
CONTEXT:  line 42 of configuration file "C:/Program
Files/PostgreSQL/8.3/share/tsearch_data/hunarabic.affix": "PFX Aa Y 40

/ Moe




On Mon, Feb 2, 2009 at 12:13 PM, Daniel Chiaramello <
daniel.chiaramello@golog.net> wrote:

 Hi Mohamed.

I don't know where you get the dictionary - I unsuccessfully tried the
OpenOffice one by myself (the Ayaspell one), and I had no arabic stopwords
file.

Renaming the file is supposed to be enough (I did it successfully for
Thailandese dictionary) - the ".aff'" file becoming the ".affix" one.
When I tried to create the dictionary:

CREATE TEXT SEARCH DICTIONARY ar_ispell (
   TEMPLATE = ispell,
   DictFile = ar_utf8,
   AffFile = ar_utf8,
   StopWords = english
);

I had an error:

ERREUR:  mauvais format de fichier affixe pour le drapeau
CONTEXTE : ligne 42 du fichier de configuration ?
/usr/share/pgsql/tsearch_data/ar_utf8.affix ? : ? PFX Aa      Y       40

(which means Bad format of Affix file for flag, line 42 of configuration
file)

Do you have an error when creating your dictionary?

Daniel

Mohamed a ?crit :


I have ran into some problems here.
 I am trying to implement arabic fulltext search on three columns.

 To create a dictionary I have a hunspell dictionary and and arabic stop
file.

 CREATE TEXT SEARCH DICTIONARY hunspell_dic (
   TEMPLATE = ispell,
   DictFile = hunarabic,
   AffFile = hunarabic,
   StopWords = arabic
);


 1) The problem is that the hunspell contains a .dic and a .aff file but
the configuration requeries a .dict and .affix file. I have tried to change
the endings but with no success.

2) ts_lexize('hunspell_dic', 'ARABIC WORD') returns nothing

3) How can I convert my .dic and .aff to valid .dict and .affix ?

4) I have read that when using dictionaries, if a word is not recognized by
any dictionary it will not be indexed. I find that troublesome. I would like
everything but the stop words to be indexed. I guess this might be a step
that I am not ready for yet, but just wanted to put it out there.



 Also I would like to know how the process of the fulltext search
implementation looks like, from config to search.

 Create dictionary, then a text configuration, add dic to configuration,
index columns with gin or gist ...

 How does a search look like? Does it match against the gin/gist index.
Have that index been built up using the dictionary/configuration, or is the
dictionary only used on search frases?

 / Moe








       Regards,
               Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: Fulltext search configuration

From
Mohamed
Date:
Oleg, like I mentioned earlier. I have a different .affix file that I got from Andrew with the stop file and I get no errors creating the dictionary using that one but I get nothing out from ts_lexize.

The size on that one is : 406,219 bytes
And the size on the hunspell one (first) : 406,229 bytes

Little to close, don't you think ? 

It might be that the arabic hunspell (ayaspell) affix file is damaged on some lines and I got the fixed one from Andrew. 

Just wanted to let you know.

/ Moe



On Mon, Feb 2, 2009 at 3:25 PM, Mohamed <mohamed5432154321@gmail.com> wrote:
Ok, thank you Oleg. 

I have another dictionary package which is a conversion to hunspell aswell: 

(Conversion of Buckwalter's Arabic morphological analyser) 2006-02-08

And running that gives me this error : (again the affix file)

ERROR:  wrong affix file format for flag
CONTEXT:  line 560 of configuration file "C:/Program Files/PostgreSQL/8.3/share/tsearch_data/arabic_utf8_alias.affix": "PFX 1013 Y 6
"

/ Moe



On Mon, Feb 2, 2009 at 2:41 PM, Oleg Bartunov <oleg@sai.msu.su> wrote:
Mohamed,

We are looking on the problem.

Oleg

On Mon, 2 Feb 2009, Mohamed wrote:

No, I don't. But the ts_lexize don't return anything so I figured there must
be an error somehow.
I think we are using the same dictionary + that I am using the stopwords
file and a different affix file, because using the hunspell (ayaspell) .aff
gives me this error :

ERROR:  wrong affix file format for flag
CONTEXT:  line 42 of configuration file "C:/Program
Files/PostgreSQL/8.3/share/tsearch_data/hunarabic.affix": "PFX Aa Y 40

/ Moe




On Mon, Feb 2, 2009 at 12:13 PM, Daniel Chiaramello <
daniel.chiaramello@golog.net> wrote:

 Hi Mohamed.

I don't know where you get the dictionary - I unsuccessfully tried the
OpenOffice one by myself (the Ayaspell one), and I had no arabic stopwords
file.

Renaming the file is supposed to be enough (I did it successfully for
Thailandese dictionary) - the ".aff'" file becoming the ".affix" one.
When I tried to create the dictionary:

CREATE TEXT SEARCH DICTIONARY ar_ispell (
   TEMPLATE = ispell,
   DictFile = ar_utf8,
   AffFile = ar_utf8,
   StopWords = english
);

I had an error:

ERREUR:  mauvais format de fichier affixe pour le drapeau
CONTEXTE : ligne 42 du fichier de configuration ?
/usr/share/pgsql/tsearch_data/ar_utf8.affix ? : ? PFX Aa      Y       40

(which means Bad format of Affix file for flag, line 42 of configuration
file)

Do you have an error when creating your dictionary?

Daniel

Mohamed a ?crit :


I have ran into some problems here.
 I am trying to implement arabic fulltext search on three columns.

 To create a dictionary I have a hunspell dictionary and and arabic stop
file.

 CREATE TEXT SEARCH DICTIONARY hunspell_dic (
   TEMPLATE = ispell,
   DictFile = hunarabic,
   AffFile = hunarabic,
   StopWords = arabic
);


 1) The problem is that the hunspell contains a .dic and a .aff file but
the configuration requeries a .dict and .affix file. I have tried to change
the endings but with no success.

2) ts_lexize('hunspell_dic', 'ARABIC WORD') returns nothing

3) How can I convert my .dic and .aff to valid .dict and .affix ?

4) I have read that when using dictionaries, if a word is not recognized by
any dictionary it will not be indexed. I find that troublesome. I would like
everything but the stop words to be indexed. I guess this might be a step
that I am not ready for yet, but just wanted to put it out there.



 Also I would like to know how the process of the fulltext search
implementation looks like, from config to search.

 Create dictionary, then a text configuration, add dic to configuration,
index columns with gin or gist ...

 How does a search look like? Does it match against the gin/gist index.
Have that index been built up using the dictionary/configuration, or is the
dictionary only used on search frases?

 / Moe








       Regards,
               Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


Re: Fulltext search configuration

From
Oleg Bartunov
Date:
Mohamed,

comment line in ar.affix
#FLAG   long
and creation of ispell dictionary will work.
This is temp, solution.
Teodor is working on fixing affix autorecognizing.

I can't say anything about testing, since somebody should provide
first test case. I don't know how to type arabic :)

Oleg

On Mon, 2 Feb 2009, Mohamed wrote:

> Oleg, like I mentioned earlier. I have a different .affix file that I got
> from Andrew with the stop file and I get no errors creating the dictionary
> using that one but I get nothing out from ts_lexize.
> The size on that one is : 406,219 bytes
> And the size on the hunspell one (first) : 406,229 bytes
>
> Little to close, don't you think ?
>
> It might be that the arabic hunspell (ayaspell) affix file is damaged on
> some lines and I got the fixed one from Andrew.
>
> Just wanted to let you know.
>
> / Moe
>
>
>
> On Mon, Feb 2, 2009 at 3:25 PM, Mohamed <mohamed5432154321@gmail.com> wrote:
>
>> Ok, thank you Oleg.
>> I have another dictionary package which is a conversion to hunspell
>> aswell:
>>
>>
>> http://wiki.services.openoffice.org/wiki/Dictionaries#Arabic_.28North_Africa_and_Middle_East.29
>> (Conversion of Buckwalter's Arabic morphological analyser) 2006-02-08
>>
>> And running that gives me this error : (again the affix file)
>>
>> ERROR:  wrong affix file format for flag
>> CONTEXT:  line 560 of configuration file "C:/Program
>> Files/PostgreSQL/8.3/share/tsearch_data/arabic_utf8_alias.affix": "PFX 1013
>> Y 6
>> "
>>
>> / Moe
>>
>>
>>
>> On Mon, Feb 2, 2009 at 2:41 PM, Oleg Bartunov <oleg@sai.msu.su> wrote:
>>
>>> Mohamed,
>>>
>>> We are looking on the problem.
>>>
>>> Oleg
>>>
>>> On Mon, 2 Feb 2009, Mohamed wrote:
>>>
>>>  No, I don't. But the ts_lexize don't return anything so I figured there
>>>> must
>>>> be an error somehow.
>>>> I think we are using the same dictionary + that I am using the stopwords
>>>> file and a different affix file, because using the hunspell (ayaspell)
>>>> .aff
>>>> gives me this error :
>>>>
>>>> ERROR:  wrong affix file format for flag
>>>> CONTEXT:  line 42 of configuration file "C:/Program
>>>> Files/PostgreSQL/8.3/share/tsearch_data/hunarabic.affix": "PFX Aa Y 40
>>>>
>>>> / Moe
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, Feb 2, 2009 at 12:13 PM, Daniel Chiaramello <
>>>> daniel.chiaramello@golog.net> wrote:
>>>>
>>>>   Hi Mohamed.
>>>>>
>>>>> I don't know where you get the dictionary - I unsuccessfully tried the
>>>>> OpenOffice one by myself (the Ayaspell one), and I had no arabic
>>>>> stopwords
>>>>> file.
>>>>>
>>>>> Renaming the file is supposed to be enough (I did it successfully for
>>>>> Thailandese dictionary) - the ".aff'" file becoming the ".affix" one.
>>>>> When I tried to create the dictionary:
>>>>>
>>>>> CREATE TEXT SEARCH DICTIONARY ar_ispell (
>>>>>    TEMPLATE = ispell,
>>>>>    DictFile = ar_utf8,
>>>>>    AffFile = ar_utf8,
>>>>>    StopWords = english
>>>>> );
>>>>>
>>>>> I had an error:
>>>>>
>>>>> ERREUR:  mauvais format de fichier affixe pour le drapeau
>>>>> CONTEXTE : ligne 42 du fichier de configuration ?
>>>>> /usr/share/pgsql/tsearch_data/ar_utf8.affix ? : ? PFX Aa      Y       40
>>>>>
>>>>> (which means Bad format of Affix file for flag, line 42 of configuration
>>>>> file)
>>>>>
>>>>> Do you have an error when creating your dictionary?
>>>>>
>>>>> Daniel
>>>>>
>>>>> Mohamed a ?crit :
>>>>>
>>>>>
>>>>> I have ran into some problems here.
>>>>>  I am trying to implement arabic fulltext search on three columns.
>>>>>
>>>>>  To create a dictionary I have a hunspell dictionary and and arabic stop
>>>>> file.
>>>>>
>>>>>  CREATE TEXT SEARCH DICTIONARY hunspell_dic (
>>>>>    TEMPLATE = ispell,
>>>>>    DictFile = hunarabic,
>>>>>    AffFile = hunarabic,
>>>>>    StopWords = arabic
>>>>> );
>>>>>
>>>>>
>>>>>  1) The problem is that the hunspell contains a .dic and a .aff file but
>>>>> the configuration requeries a .dict and .affix file. I have tried to
>>>>> change
>>>>> the endings but with no success.
>>>>>
>>>>> 2) ts_lexize('hunspell_dic', 'ARABIC WORD') returns nothing
>>>>>
>>>>> 3) How can I convert my .dic and .aff to valid .dict and .affix ?
>>>>>
>>>>> 4) I have read that when using dictionaries, if a word is not recognized
>>>>> by
>>>>> any dictionary it will not be indexed. I find that troublesome. I would
>>>>> like
>>>>> everything but the stop words to be indexed. I guess this might be a
>>>>> step
>>>>> that I am not ready for yet, but just wanted to put it out there.
>>>>>
>>>>>
>>>>>
>>>>>  Also I would like to know how the process of the fulltext search
>>>>> implementation looks like, from config to search.
>>>>>
>>>>>  Create dictionary, then a text configuration, add dic to configuration,
>>>>> index columns with gin or gist ...
>>>>>
>>>>>  How does a search look like? Does it match against the gin/gist index.
>>>>> Have that index been built up using the dictionary/configuration, or is
>>>>> the
>>>>> dictionary only used on search frases?
>>>>>
>>>>>  / Moe
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>        Regards,
>>>                Oleg
>>> _____________________________________________________________
>>> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
>>> Sternberg Astronomical Institute, Moscow University, Russia
>>> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
>>> phone: +007(495)939-16-83, +007(495)939-23-83
>>>
>>
>>
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: Fulltext search configuration

From
Mohamed
Date:
Hehe, ok..

I don't know either but I took some lines from Al-Jazeera : http://aljazeera.net/portal

just made the change you said and created it successfully and tried this : 

select ts_lexize('ayaspell', 'استشهد فلسطيني وأصيب ثلاثة في غارة إسرائيلية جديدة')

but I got nothing... :(

Is there a way of making sure that words not recognized also gets indexed/searched for ? (Not that I think this is the problem)

/ Moe



On Mon, Feb 2, 2009 at 3:50 PM, Oleg Bartunov <oleg@sai.msu.su> wrote:
Mohamed,

comment line in ar.affix
#FLAG   long
and creation of ispell dictionary will work. This is temp, solution. Teodor is working on fixing affix autorecognizing.

I can't say anything about testing, since somebody should provide
first test case. I don't know how to type arabic :)


Oleg

On Mon, 2 Feb 2009, Mohamed wrote:

Oleg, like I mentioned earlier. I have a different .affix file that I got
from Andrew with the stop file and I get no errors creating the dictionary
using that one but I get nothing out from ts_lexize.
The size on that one is : 406,219 bytes
And the size on the hunspell one (first) : 406,229 bytes

Little to close, don't you think ?

It might be that the arabic hunspell (ayaspell) affix file is damaged on
some lines and I got the fixed one from Andrew.

Just wanted to let you know.

/ Moe



On Mon, Feb 2, 2009 at 3:25 PM, Mohamed <mohamed5432154321@gmail.com> wrote:

Ok, thank you Oleg.
I have another dictionary package which is a conversion to hunspell
aswell:


http://wiki.services.openoffice.org/wiki/Dictionaries#Arabic_.28North_Africa_and_Middle_East.29
(Conversion of Buckwalter's Arabic morphological analyser) 2006-02-08

And running that gives me this error : (again the affix file)

ERROR:  wrong affix file format for flag
CONTEXT:  line 560 of configuration file "C:/Program
Files/PostgreSQL/8.3/share/tsearch_data/arabic_utf8_alias.affix": "PFX 1013
Y 6
"

/ Moe



On Mon, Feb 2, 2009 at 2:41 PM, Oleg Bartunov <oleg@sai.msu.su> wrote:

Mohamed,

We are looking on the problem.

Oleg

On Mon, 2 Feb 2009, Mohamed wrote:

 No, I don't. But the ts_lexize don't return anything so I figured there
must
be an error somehow.
I think we are using the same dictionary + that I am using the stopwords
file and a different affix file, because using the hunspell (ayaspell)
.aff
gives me this error :

ERROR:  wrong affix file format for flag
CONTEXT:  line 42 of configuration file "C:/Program
Files/PostgreSQL/8.3/share/tsearch_data/hunarabic.affix": "PFX Aa Y 40

/ Moe




On Mon, Feb 2, 2009 at 12:13 PM, Daniel Chiaramello <
daniel.chiaramello@golog.net> wrote:

 Hi Mohamed.

I don't know where you get the dictionary - I unsuccessfully tried the
OpenOffice one by myself (the Ayaspell one), and I had no arabic
stopwords
file.

Renaming the file is supposed to be enough (I did it successfully for
Thailandese dictionary) - the ".aff'" file becoming the ".affix" one.
When I tried to create the dictionary:

CREATE TEXT SEARCH DICTIONARY ar_ispell (
  TEMPLATE = ispell,
  DictFile = ar_utf8,
  AffFile = ar_utf8,
  StopWords = english
);

I had an error:

ERREUR:  mauvais format de fichier affixe pour le drapeau
CONTEXTE : ligne 42 du fichier de configuration ?
/usr/share/pgsql/tsearch_data/ar_utf8.affix ? : ? PFX Aa      Y       40

(which means Bad format of Affix file for flag, line 42 of configuration
file)

Do you have an error when creating your dictionary?

Daniel

Mohamed a ?crit :


I have ran into some problems here.
 I am trying to implement arabic fulltext search on three columns.

 To create a dictionary I have a hunspell dictionary and and arabic stop
file.

 CREATE TEXT SEARCH DICTIONARY hunspell_dic (
  TEMPLATE = ispell,
  DictFile = hunarabic,
  AffFile = hunarabic,
  StopWords = arabic
);


 1) The problem is that the hunspell contains a .dic and a .aff file but
the configuration requeries a .dict and .affix file. I have tried to
change
the endings but with no success.

2) ts_lexize('hunspell_dic', 'ARABIC WORD') returns nothing

3) How can I convert my .dic and .aff to valid .dict and .affix ?

4) I have read that when using dictionaries, if a word is not recognized
by
any dictionary it will not be indexed. I find that troublesome. I would
like
everything but the stop words to be indexed. I guess this might be a
step
that I am not ready for yet, but just wanted to put it out there.



 Also I would like to know how the process of the fulltext search
implementation looks like, from config to search.

 Create dictionary, then a text configuration, add dic to configuration,
index columns with gin or gist ...

 How does a search look like? Does it match against the gin/gist index.
Have that index been built up using the dictionary/configuration, or is
the
dictionary only used on search frases?

 / Moe








      Regards,
              Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83





       Regards,
               Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: Fulltext search configuration

From
Oleg Bartunov
Date:
On Mon, 2 Feb 2009, Mohamed wrote:

> Hehe, ok..
> I don't know either but I took some lines from Al-Jazeera :
> http://aljazeera.net/portal
>
> just made the change you said and created it successfully and tried this :
>
> select ts_lexize('ayaspell', '?????? ??????? ????? ????? ?? ???? ?????????
> ?????')
>
> but I got nothing... :(

Mohamed, what did you expect from ts_lexize ?  Please, provide us valuable
information, else we can't help you.

>
> Is there a way of making sure that words not recognized also gets
> indexed/searched for ? (Not that I think this is the problem)

yes


>
> / Moe
>
>
>
> On Mon, Feb 2, 2009 at 3:50 PM, Oleg Bartunov <oleg@sai.msu.su> wrote:
>
>> Mohamed,
>>
>> comment line in ar.affix
>> #FLAG   long
>> and creation of ispell dictionary will work. This is temp, solution. Teodor
>> is working on fixing affix autorecognizing.
>>
>> I can't say anything about testing, since somebody should provide
>> first test case. I don't know how to type arabic :)
>>
>>
>> Oleg
>>
>> On Mon, 2 Feb 2009, Mohamed wrote:
>>
>>  Oleg, like I mentioned earlier. I have a different .affix file that I got
>>> from Andrew with the stop file and I get no errors creating the dictionary
>>> using that one but I get nothing out from ts_lexize.
>>> The size on that one is : 406,219 bytes
>>> And the size on the hunspell one (first) : 406,229 bytes
>>>
>>> Little to close, don't you think ?
>>>
>>> It might be that the arabic hunspell (ayaspell) affix file is damaged on
>>> some lines and I got the fixed one from Andrew.
>>>
>>> Just wanted to let you know.
>>>
>>> / Moe
>>>
>>>
>>>
>>> On Mon, Feb 2, 2009 at 3:25 PM, Mohamed <mohamed5432154321@gmail.com>
>>> wrote:
>>>
>>>  Ok, thank you Oleg.
>>>> I have another dictionary package which is a conversion to hunspell
>>>> aswell:
>>>>
>>>>
>>>>
>>>> http://wiki.services.openoffice.org/wiki/Dictionaries#Arabic_.28North_Africa_and_Middle_East.29
>>>> (Conversion of Buckwalter's Arabic morphological analyser) 2006-02-08
>>>>
>>>> And running that gives me this error : (again the affix file)
>>>>
>>>> ERROR:  wrong affix file format for flag
>>>> CONTEXT:  line 560 of configuration file "C:/Program
>>>> Files/PostgreSQL/8.3/share/tsearch_data/arabic_utf8_alias.affix": "PFX
>>>> 1013
>>>> Y 6
>>>> "
>>>>
>>>> / Moe
>>>>
>>>>
>>>>
>>>> On Mon, Feb 2, 2009 at 2:41 PM, Oleg Bartunov <oleg@sai.msu.su> wrote:
>>>>
>>>>  Mohamed,
>>>>>
>>>>> We are looking on the problem.
>>>>>
>>>>> Oleg
>>>>>
>>>>> On Mon, 2 Feb 2009, Mohamed wrote:
>>>>>
>>>>>  No, I don't. But the ts_lexize don't return anything so I figured there
>>>>>
>>>>>> must
>>>>>> be an error somehow.
>>>>>> I think we are using the same dictionary + that I am using the
>>>>>> stopwords
>>>>>> file and a different affix file, because using the hunspell (ayaspell)
>>>>>> .aff
>>>>>> gives me this error :
>>>>>>
>>>>>> ERROR:  wrong affix file format for flag
>>>>>> CONTEXT:  line 42 of configuration file "C:/Program
>>>>>> Files/PostgreSQL/8.3/share/tsearch_data/hunarabic.affix": "PFX Aa Y 40
>>>>>>
>>>>>> / Moe
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Feb 2, 2009 at 12:13 PM, Daniel Chiaramello <
>>>>>> daniel.chiaramello@golog.net> wrote:
>>>>>>
>>>>>>  Hi Mohamed.
>>>>>>
>>>>>>>
>>>>>>> I don't know where you get the dictionary - I unsuccessfully tried the
>>>>>>> OpenOffice one by myself (the Ayaspell one), and I had no arabic
>>>>>>> stopwords
>>>>>>> file.
>>>>>>>
>>>>>>> Renaming the file is supposed to be enough (I did it successfully for
>>>>>>> Thailandese dictionary) - the ".aff'" file becoming the ".affix" one.
>>>>>>> When I tried to create the dictionary:
>>>>>>>
>>>>>>> CREATE TEXT SEARCH DICTIONARY ar_ispell (
>>>>>>>   TEMPLATE = ispell,
>>>>>>>   DictFile = ar_utf8,
>>>>>>>   AffFile = ar_utf8,
>>>>>>>   StopWords = english
>>>>>>> );
>>>>>>>
>>>>>>> I had an error:
>>>>>>>
>>>>>>> ERREUR:  mauvais format de fichier affixe pour le drapeau
>>>>>>> CONTEXTE : ligne 42 du fichier de configuration ?
>>>>>>> /usr/share/pgsql/tsearch_data/ar_utf8.affix ? : ? PFX Aa      Y
>>>>>>> 40
>>>>>>>
>>>>>>> (which means Bad format of Affix file for flag, line 42 of
>>>>>>> configuration
>>>>>>> file)
>>>>>>>
>>>>>>> Do you have an error when creating your dictionary?
>>>>>>>
>>>>>>> Daniel
>>>>>>>
>>>>>>> Mohamed a ?crit :
>>>>>>>
>>>>>>>
>>>>>>> I have ran into some problems here.
>>>>>>>  I am trying to implement arabic fulltext search on three columns.
>>>>>>>
>>>>>>>  To create a dictionary I have a hunspell dictionary and and arabic
>>>>>>> stop
>>>>>>> file.
>>>>>>>
>>>>>>>  CREATE TEXT SEARCH DICTIONARY hunspell_dic (
>>>>>>>   TEMPLATE = ispell,
>>>>>>>   DictFile = hunarabic,
>>>>>>>   AffFile = hunarabic,
>>>>>>>   StopWords = arabic
>>>>>>> );
>>>>>>>
>>>>>>>
>>>>>>>  1) The problem is that the hunspell contains a .dic and a .aff file
>>>>>>> but
>>>>>>> the configuration requeries a .dict and .affix file. I have tried to
>>>>>>> change
>>>>>>> the endings but with no success.
>>>>>>>
>>>>>>> 2) ts_lexize('hunspell_dic', 'ARABIC WORD') returns nothing
>>>>>>>
>>>>>>> 3) How can I convert my .dic and .aff to valid .dict and .affix ?
>>>>>>>
>>>>>>> 4) I have read that when using dictionaries, if a word is not
>>>>>>> recognized
>>>>>>> by
>>>>>>> any dictionary it will not be indexed. I find that troublesome. I
>>>>>>> would
>>>>>>> like
>>>>>>> everything but the stop words to be indexed. I guess this might be a
>>>>>>> step
>>>>>>> that I am not ready for yet, but just wanted to put it out there.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>  Also I would like to know how the process of the fulltext search
>>>>>>> implementation looks like, from config to search.
>>>>>>>
>>>>>>>  Create dictionary, then a text configuration, add dic to
>>>>>>> configuration,
>>>>>>> index columns with gin or gist ...
>>>>>>>
>>>>>>>  How does a search look like? Does it match against the gin/gist
>>>>>>> index.
>>>>>>> Have that index been built up using the dictionary/configuration, or
>>>>>>> is
>>>>>>> the
>>>>>>> dictionary only used on search frases?
>>>>>>>
>>>>>>>  / Moe
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>        Regards,
>>>>>               Oleg
>>>>> _____________________________________________________________
>>>>> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
>>>>> Sternberg Astronomical Institute, Moscow University, Russia
>>>>> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
>>>>> phone: +007(495)939-16-83, +007(495)939-23-83
>>>>>
>>>>>
>>>>
>>>>
>>>
>>        Regards,
>>                Oleg
>> _____________________________________________________________
>> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
>> Sternberg Astronomical Institute, Moscow University, Russia
>> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
>> phone: +007(495)939-16-83, +007(495)939-23-83
>>
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: Fulltext search configuration

From
Oleg Bartunov
Date:
On Mon, 2 Feb 2009, Oleg Bartunov wrote:

> On Mon, 2 Feb 2009, Mohamed wrote:
>
>> Hehe, ok..
>> I don't know either but I took some lines from Al-Jazeera :
>> http://aljazeera.net/portal
>>
>> just made the change you said and created it successfully and tried this :
>>
>> select ts_lexize('ayaspell', '?????? ??????? ????? ????? ?? ???? ?????????
>> ?????')
>>
>> but I got nothing... :(
>
> Mohamed, what did you expect from ts_lexize ?  Please, provide us valuable
> information, else we can't help you.
>
>>
>> Is there a way of making sure that words not recognized also gets
>> indexed/searched for ? (Not that I think this is the problem)
>
> yes

Read http://www.postgresql.org/docs/8.3/static/textsearch-dictionaries.html
"A text search configuration binds a parser together with a set of
dictionaries to process the parser's output tokens. For each token type that
the parser can return, a separate list of dictionaries is specified by the
configuration. When a token of that type is found by the parser, each
dictionary in the list is consulted in turn, until some dictionary recognizes
it as a known word. If it is identified as a stop word, or if no dictionary
recognizes the token, it will be discarded and not indexed or searched for.
The general rule for configuring a list of dictionaries is to place first
the most narrow, most specific dictionary, then the more general dictionaries,
finishing with a very general dictionary, like a Snowball stemmer or simple,
which recognizes everything."

quick example:

CREATE TEXT SEARCH CONFIGURATION arabic (
     COPY = english
);

=# \dF+ arabic
Text search configuration "public.arabic"
Parser: "pg_catalog.default"
       Token      | Dictionaries
-----------------+--------------
  asciihword      | english_stem
  asciiword       | english_stem
  email           | simple
  file            | simple
  float           | simple
  host            | simple
  hword           | english_stem
  hword_asciipart | english_stem
  hword_numpart   | simple
  hword_part      | english_stem
  int             | simple
  numhword        | simple
  numword         | simple
  sfloat          | simple
  uint            | simple
  url             | simple
  url_path        | simple
  version         | simple
  word            | english_stem

Then you can alter this configuration.



>
>
>>
>> / Moe
>>
>>
>>
>> On Mon, Feb 2, 2009 at 3:50 PM, Oleg Bartunov <oleg@sai.msu.su> wrote:
>>
>>> Mohamed,
>>>
>>> comment line in ar.affix
>>> #FLAG   long
>>> and creation of ispell dictionary will work. This is temp, solution.
>>> Teodor
>>> is working on fixing affix autorecognizing.
>>>
>>> I can't say anything about testing, since somebody should provide
>>> first test case. I don't know how to type arabic :)
>>>
>>>
>>> Oleg
>>>
>>> On Mon, 2 Feb 2009, Mohamed wrote:
>>>
>>>  Oleg, like I mentioned earlier. I have a different .affix file that I got
>>>> from Andrew with the stop file and I get no errors creating the
>>>> dictionary
>>>> using that one but I get nothing out from ts_lexize.
>>>> The size on that one is : 406,219 bytes
>>>> And the size on the hunspell one (first) : 406,229 bytes
>>>>
>>>> Little to close, don't you think ?
>>>>
>>>> It might be that the arabic hunspell (ayaspell) affix file is damaged on
>>>> some lines and I got the fixed one from Andrew.
>>>>
>>>> Just wanted to let you know.
>>>>
>>>> / Moe
>>>>
>>>>
>>>>
>>>> On Mon, Feb 2, 2009 at 3:25 PM, Mohamed <mohamed5432154321@gmail.com>
>>>> wrote:
>>>>
>>>>  Ok, thank you Oleg.
>>>>> I have another dictionary package which is a conversion to hunspell
>>>>> aswell:
>>>>>
>>>>>
>>>>>
>>>>> http://wiki.services.openoffice.org/wiki/Dictionaries#Arabic_.28North_Africa_and_Middle_East.29
>>>>> (Conversion of Buckwalter's Arabic morphological analyser) 2006-02-08
>>>>>
>>>>> And running that gives me this error : (again the affix file)
>>>>>
>>>>> ERROR:  wrong affix file format for flag
>>>>> CONTEXT:  line 560 of configuration file "C:/Program
>>>>> Files/PostgreSQL/8.3/share/tsearch_data/arabic_utf8_alias.affix": "PFX
>>>>> 1013
>>>>> Y 6
>>>>> "
>>>>>
>>>>> / Moe
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Feb 2, 2009 at 2:41 PM, Oleg Bartunov <oleg@sai.msu.su> wrote:
>>>>>
>>>>>  Mohamed,
>>>>>>
>>>>>> We are looking on the problem.
>>>>>>
>>>>>> Oleg
>>>>>>
>>>>>> On Mon, 2 Feb 2009, Mohamed wrote:
>>>>>>
>>>>>>  No, I don't. But the ts_lexize don't return anything so I figured
>>>>>> there
>>>>>>
>>>>>>> must
>>>>>>> be an error somehow.
>>>>>>> I think we are using the same dictionary + that I am using the
>>>>>>> stopwords
>>>>>>> file and a different affix file, because using the hunspell (ayaspell)
>>>>>>> .aff
>>>>>>> gives me this error :
>>>>>>>
>>>>>>> ERROR:  wrong affix file format for flag
>>>>>>> CONTEXT:  line 42 of configuration file "C:/Program
>>>>>>> Files/PostgreSQL/8.3/share/tsearch_data/hunarabic.affix": "PFX Aa Y 40
>>>>>>>
>>>>>>> / Moe
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Feb 2, 2009 at 12:13 PM, Daniel Chiaramello <
>>>>>>> daniel.chiaramello@golog.net> wrote:
>>>>>>>
>>>>>>>  Hi Mohamed.
>>>>>>>
>>>>>>>>
>>>>>>>> I don't know where you get the dictionary - I unsuccessfully tried
>>>>>>>> the
>>>>>>>> OpenOffice one by myself (the Ayaspell one), and I had no arabic
>>>>>>>> stopwords
>>>>>>>> file.
>>>>>>>>
>>>>>>>> Renaming the file is supposed to be enough (I did it successfully for
>>>>>>>> Thailandese dictionary) - the ".aff'" file becoming the ".affix" one.
>>>>>>>> When I tried to create the dictionary:
>>>>>>>>
>>>>>>>> CREATE TEXT SEARCH DICTIONARY ar_ispell (
>>>>>>>>   TEMPLATE = ispell,
>>>>>>>>   DictFile = ar_utf8,
>>>>>>>>   AffFile = ar_utf8,
>>>>>>>>   StopWords = english
>>>>>>>> );
>>>>>>>>
>>>>>>>> I had an error:
>>>>>>>>
>>>>>>>> ERREUR:  mauvais format de fichier affixe pour le drapeau
>>>>>>>> CONTEXTE : ligne 42 du fichier de configuration ?
>>>>>>>> /usr/share/pgsql/tsearch_data/ar_utf8.affix ? : ? PFX Aa      Y
>>>>>>>> 40
>>>>>>>>
>>>>>>>> (which means Bad format of Affix file for flag, line 42 of
>>>>>>>> configuration
>>>>>>>> file)
>>>>>>>>
>>>>>>>> Do you have an error when creating your dictionary?
>>>>>>>>
>>>>>>>> Daniel
>>>>>>>>
>>>>>>>> Mohamed a ?crit :
>>>>>>>>
>>>>>>>>
>>>>>>>> I have ran into some problems here.
>>>>>>>>  I am trying to implement arabic fulltext search on three columns.
>>>>>>>>
>>>>>>>>  To create a dictionary I have a hunspell dictionary and and arabic
>>>>>>>> stop
>>>>>>>> file.
>>>>>>>>
>>>>>>>>  CREATE TEXT SEARCH DICTIONARY hunspell_dic (
>>>>>>>>   TEMPLATE = ispell,
>>>>>>>>   DictFile = hunarabic,
>>>>>>>>   AffFile = hunarabic,
>>>>>>>>   StopWords = arabic
>>>>>>>> );
>>>>>>>>
>>>>>>>>
>>>>>>>>  1) The problem is that the hunspell contains a .dic and a .aff file
>>>>>>>> but
>>>>>>>> the configuration requeries a .dict and .affix file. I have tried to
>>>>>>>> change
>>>>>>>> the endings but with no success.
>>>>>>>>
>>>>>>>> 2) ts_lexize('hunspell_dic', 'ARABIC WORD') returns nothing
>>>>>>>>
>>>>>>>> 3) How can I convert my .dic and .aff to valid .dict and .affix ?
>>>>>>>>
>>>>>>>> 4) I have read that when using dictionaries, if a word is not
>>>>>>>> recognized
>>>>>>>> by
>>>>>>>> any dictionary it will not be indexed. I find that troublesome. I
>>>>>>>> would
>>>>>>>> like
>>>>>>>> everything but the stop words to be indexed. I guess this might be a
>>>>>>>> step
>>>>>>>> that I am not ready for yet, but just wanted to put it out there.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>  Also I would like to know how the process of the fulltext search
>>>>>>>> implementation looks like, from config to search.
>>>>>>>>
>>>>>>>>  Create dictionary, then a text configuration, add dic to
>>>>>>>> configuration,
>>>>>>>> index columns with gin or gist ...
>>>>>>>>
>>>>>>>>  How does a search look like? Does it match against the gin/gist
>>>>>>>> index.
>>>>>>>> Have that index been built up using the dictionary/configuration, or
>>>>>>>> is
>>>>>>>> the
>>>>>>>> dictionary only used on search frases?
>>>>>>>>
>>>>>>>>  / Moe
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>        Regards,
>>>>>>               Oleg
>>>>>> _____________________________________________________________
>>>>>> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
>>>>>> Sternberg Astronomical Institute, Moscow University, Russia
>>>>>> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
>>>>>> phone: +007(495)939-16-83, +007(495)939-23-83
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>        Regards,
>>>                Oleg
>>> _____________________________________________________________
>>> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
>>> Sternberg Astronomical Institute, Moscow University, Russia
>>> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
>>> phone: +007(495)939-16-83, +007(495)939-23-83
>>>
>>
>
>     Regards,
>         Oleg
> _____________________________________________________________
> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
> Sternberg Astronomical Institute, Moscow University, Russia
> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
> phone: +007(495)939-16-83, +007(495)939-23-83
>
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: Fulltext search configuration

From
Mohamed
Date:


On Mon, Feb 2, 2009 at 4:34 PM, Oleg Bartunov <oleg@sai.msu.su> wrote:
On Mon, 2 Feb 2009, Oleg Bartunov wrote:

On Mon, 2 Feb 2009, Mohamed wrote:

Hehe, ok..
I don't know either but I took some lines from Al-Jazeera :
http://aljazeera.net/portal

just made the change you said and created it successfully and tried this :

select ts_lexize('ayaspell', '?????? ??????? ????? ????? ?? ???? ?????????
?????')

but I got nothing... :(

Mohamed, what did you expect from ts_lexize ?  Please, provide us valuable
information, else we can't help you.

What I expected was something to be returned. After all they are valid words taken from an article. (perhaps you don't see the words, but only ???... ) Am I wrong to expect something ? Should I go for setting up the configuration completly first?

SELECT ts_lexize('norwegian_ispell', 'overbuljongterningpakkmesterassistent');
{over,buljong,terning,pakk,mester,assistent}

Check out this article if you need a sample.



 


Is there a way of making sure that words not recognized also gets
indexed/searched for ? (Not that I think this is the problem)

yes

Read http://www.postgresql.org/docs/8.3/static/textsearch-dictionaries.html
"A text search configuration binds a parser together with a set of dictionaries to process the parser's output tokens. For each token type that the parser can return, a separate list of dictionaries is specified by the configuration. When a token of that type is found by the parser, each dictionary in the list is consulted in turn, until some dictionary recognizes it as a known word. If it is identified as a stop word, or if no dictionary recognizes the token, it will be discarded and not indexed or searched for. The general rule for configuring a list of dictionaries is to place first the most narrow, most specific dictionary, then the more general dictionaries,
finishing with a very general dictionary, like a Snowball stemmer or simple, which recognizes everything."


Ok, but I don't have Thesaurus or a Snowball to fall back on. So when words that are words but for some reason is not recognized "it will be discarded and not indexed or searched for." which I consider a problem since I don't trust my configuration to cover everything.

Is this not a valid concern?
 

quick example:

CREATE TEXT SEARCH CONFIGURATION arabic (
   COPY = english
);

=# \dF+ arabic
Text search configuration "public.arabic"
Parser: "pg_catalog.default"
     Token      | Dictionaries
-----------------+--------------
 asciihword      | english_stem
 asciiword       | english_stem
 email           | simple
 file            | simple
 float           | simple
 host            | simple
 hword           | english_stem
 hword_asciipart | english_stem
 hword_numpart   | simple
 hword_part      | english_stem
 int             | simple
 numhword        | simple
 numword         | simple
 sfloat          | simple
 uint            | simple
 url             | simple
 url_path        | simple
 version         | simple
 word            | english_stem

Then you can alter this configuration.


Yes, I figured thats the next step but thought I should get the lexize to work first? What do you think?

Just a thought, say I have this : 

ALTER TEXT SEARCH CONFIGURATION pg
    ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
                      word, hword, hword_part
    WITH pga_ardict, ar_ispell, ar_stem;

is it possible to keep adding dictionaries, to get both arabic and english matches on the same column (arabic people tend to mix), like this : 

ALTER TEXT SEARCH CONFIGURATION pg
    ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
                      word, hword, hword_part
    WITH pga_ardict, ar_ispell, ar_stem, pg_english_dict, english_ispell, english_stem;


Will something like that work ? 


 / Moe

Re: Fulltext search configuration

From
Oleg Bartunov
Date:
Mohamed,

please, try to read docs and think a bit first.

On Mon, 2 Feb 2009, Mohamed wrote:

> On Mon, Feb 2, 2009 at 4:34 PM, Oleg Bartunov <oleg@sai.msu.su> wrote:
>
>> On Mon, 2 Feb 2009, Oleg Bartunov wrote:
>>
>>  On Mon, 2 Feb 2009, Mohamed wrote:
>>>
>>>  Hehe, ok..
>>>> I don't know either but I took some lines from Al-Jazeera :
>>>> http://aljazeera.net/portal
>>>>
>>>> just made the change you said and created it successfully and tried this
>>>> :
>>>>
>>>> select ts_lexize('ayaspell', '?????? ??????? ????? ????? ?? ????
>>>> ?????????
>>>> ?????')
>>>>
>>>> but I got nothing... :(


You did wrong ! ts_lexize expects word, not phrase !

>>>>
>>>
>>> Mohamed, what did you expect from ts_lexize ?  Please, provide us valuable
>>> information, else we can't help you.
>>>
>>
> What I expected was something to be returned. After all they are valid words
> taken from an article. (perhaps you don't see the words, but only ???... )
> Am I wrong to expect something ? Should I go for setting up the
> configuration completly first?

You should definitely read documentation
http://www.postgresql.org/docs/8.3/static/textsearch-debugging.html#TEXTSEARCH-DICTIONARY-TESTING
Period.

>
> SELECT ts_lexize('norwegian_ispell',
> 'overbuljongterningpakkmesterassistent');
> {over,buljong,terning,pakk,mester,assistent}
>
> Check out this article if you need a sample.
> http://www.aljazeera.net/NR/exeres/103CFC06-0195-47FD-A29F-2C84B5A15DD0.htm
>
>
>
>
>
>>
>>>
>>>> Is there a way of making sure that words not recognized also gets
>>>> indexed/searched for ? (Not that I think this is the problem)
>>>>
>>>
>>> yes
>>>
>>
>> Read
>> http://www.postgresql.org/docs/8.3/static/textsearch-dictionaries.html
>> "A text search configuration binds a parser together with a set of
>> dictionaries to process the parser's output tokens. For each token type that
>> the parser can return, a separate list of dictionaries is specified by the
>> configuration. When a token of that type is found by the parser, each
>> dictionary in the list is consulted in turn, until some dictionary
>> recognizes it as a known word. If it is identified as a stop word, or if no
>> dictionary recognizes the token, it will be discarded and not indexed or
>> searched for. The general rule for configuring a list of dictionaries is to
>> place first the most narrow, most specific dictionary, then the more general
>> dictionaries,
>> finishing with a very general dictionary, like a Snowball stemmer or
>> simple, which recognizes everything."
>>
>
>
> Ok, but I don't have Thesaurus or a Snowball to fall back on. So when words
> that are words but for some reason is not recognized "it will be discarded
> and not indexed or searched for." which I consider a problem since I don't
> trust my configuration to cover everything.
>
> Is this not a valid concern?
>
>
>>
>> quick example:
>>
>> CREATE TEXT SEARCH CONFIGURATION arabic (
>>    COPY = english
>> );
>>
>> =# \dF+ arabic
>> Text search configuration "public.arabic"
>> Parser: "pg_catalog.default"
>>      Token      | Dictionaries
>> -----------------+--------------
>>  asciihword      | english_stem
>>  asciiword       | english_stem
>>  email           | simple
>>  file            | simple
>>  float           | simple
>>  host            | simple
>>  hword           | english_stem
>>  hword_asciipart | english_stem
>>  hword_numpart   | simple
>>  hword_part      | english_stem
>>  int             | simple
>>  numhword        | simple
>>  numword         | simple
>>  sfloat          | simple
>>  uint            | simple
>>  url             | simple
>>  url_path        | simple
>>  version         | simple
>>  word            | english_stem
>>
>> Then you can alter this configuration.
>
>
>
> Yes, I figured thats the next step but thought I should get the lexize to
> work first? What do you think?
>
> Just a thought, say I have this :
>
> ALTER TEXT SEARCH CONFIGURATION pg
>    ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
>                      word, hword, hword_part
>    WITH pga_ardict, ar_ispell, ar_stem;
>
> is it possible to keep adding dictionaries, to get both arabic and english
> matches on the same column (arabic people tend to mix), like this :
>
> ALTER TEXT SEARCH CONFIGURATION pg
>    ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
>                      word, hword, hword_part
>    WITH pga_ardict, ar_ispell, ar_stem, pg_english_dict, english_ispell,
> english_stem;
>
>
> Will something like that work ?
>
>
> / Moe
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: Fulltext search configuration

From
Mohamed
Date:
Little harsh, are we? I have read the WHOLE documentation, it's a bit long so confusion might arise + I am not familiar with postgre AT ALL so the confusion grows.

Perhaps I am an idiot and you don't like helping idiots or perhaps it's something else? Which one is it?

If you don't want to help me, then DON'T ! Period. 

The mailing list is not yours.

.
.
.

I have tried ts_lexize with words, lots of them and I have yet to get something out of it!

/ Moe