Re: Fulltext search configuration - Mailing list pgsql-general

From Mohamed
Subject Re: Fulltext search configuration
Date
Msg-id 861fed220902020809u534b743atba3491397f27b404@mail.gmail.com
Whole thread Raw
In response to Re: Fulltext search configuration  (Oleg Bartunov <oleg@sai.msu.su>)
Responses Re: Fulltext search configuration
List pgsql-general


On Mon, Feb 2, 2009 at 4:34 PM, Oleg Bartunov <oleg@sai.msu.su> wrote:
On Mon, 2 Feb 2009, Oleg Bartunov wrote:

On Mon, 2 Feb 2009, Mohamed wrote:

Hehe, ok..
I don't know either but I took some lines from Al-Jazeera :
http://aljazeera.net/portal

just made the change you said and created it successfully and tried this :

select ts_lexize('ayaspell', '?????? ??????? ????? ????? ?? ???? ?????????
?????')

but I got nothing... :(

Mohamed, what did you expect from ts_lexize ?  Please, provide us valuable
information, else we can't help you.

What I expected was something to be returned. After all they are valid words taken from an article. (perhaps you don't see the words, but only ???... ) Am I wrong to expect something ? Should I go for setting up the configuration completly first?

SELECT ts_lexize('norwegian_ispell', 'overbuljongterningpakkmesterassistent');
{over,buljong,terning,pakk,mester,assistent}

Check out this article if you need a sample.



 


Is there a way of making sure that words not recognized also gets
indexed/searched for ? (Not that I think this is the problem)

yes

Read http://www.postgresql.org/docs/8.3/static/textsearch-dictionaries.html
"A text search configuration binds a parser together with a set of dictionaries to process the parser's output tokens. For each token type that the parser can return, a separate list of dictionaries is specified by the configuration. When a token of that type is found by the parser, each dictionary in the list is consulted in turn, until some dictionary recognizes it as a known word. If it is identified as a stop word, or if no dictionary recognizes the token, it will be discarded and not indexed or searched for. The general rule for configuring a list of dictionaries is to place first the most narrow, most specific dictionary, then the more general dictionaries,
finishing with a very general dictionary, like a Snowball stemmer or simple, which recognizes everything."


Ok, but I don't have Thesaurus or a Snowball to fall back on. So when words that are words but for some reason is not recognized "it will be discarded and not indexed or searched for." which I consider a problem since I don't trust my configuration to cover everything.

Is this not a valid concern?
 

quick example:

CREATE TEXT SEARCH CONFIGURATION arabic (
   COPY = english
);

=# \dF+ arabic
Text search configuration "public.arabic"
Parser: "pg_catalog.default"
     Token      | Dictionaries
-----------------+--------------
 asciihword      | english_stem
 asciiword       | english_stem
 email           | simple
 file            | simple
 float           | simple
 host            | simple
 hword           | english_stem
 hword_asciipart | english_stem
 hword_numpart   | simple
 hword_part      | english_stem
 int             | simple
 numhword        | simple
 numword         | simple
 sfloat          | simple
 uint            | simple
 url             | simple
 url_path        | simple
 version         | simple
 word            | english_stem

Then you can alter this configuration.


Yes, I figured thats the next step but thought I should get the lexize to work first? What do you think?

Just a thought, say I have this : 

ALTER TEXT SEARCH CONFIGURATION pg
    ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
                      word, hword, hword_part
    WITH pga_ardict, ar_ispell, ar_stem;

is it possible to keep adding dictionaries, to get both arabic and english matches on the same column (arabic people tend to mix), like this : 

ALTER TEXT SEARCH CONFIGURATION pg
    ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
                      word, hword, hword_part
    WITH pga_ardict, ar_ispell, ar_stem, pg_english_dict, english_ispell, english_stem;


Will something like that work ? 


 / Moe

pgsql-general by date:

Previous
From: Oleg Bartunov
Date:
Subject: Re: Fulltext search configuration
Next
From: Thomas Kellerer
Date:
Subject: Re: Warm Standby question