Re: Tsearch2 custom dictionaries - Mailing list pgsql-general
From | Oleg Bartunov |
---|---|
Subject | Re: Tsearch2 custom dictionaries |
Date | |
Msg-id | Pine.GSO.4.56.0308072106070.17880@ra.sai.msu.su Whole thread Raw |
In response to | Re: Tsearch2 custom dictionaries (psql-mail@freeuk.com) |
List | pgsql-general |
On Thu, 7 Aug 2003 psql-mail@freeuk.com wrote: > > On Thu, 7 Aug 2003 psql-mail@freeuk.com wrote: > > > > > Part1. > > > > > > I have created a dictionary called 'webwords' which checks all > words > > > and curtails them to 300 chars (for now) > > > > > > after running > > > make > > > make install > > > > > > I then copied the lib_webwords.so into my $libdir > > > > > > I have run > > > > > > psql mybd < dict_webwords.sql > > > > > Once you did 'psql mybd < dict_webwords.sql' you should be able use > it :) > > Test it : > > select lexize('webwords','some_web_word'); > > I did test it with > select lexize('webwords','some_web_word'); > lexize > ------- > {some_web_word} > > select lexize('webwords','some_400char_web_word'); > lexize > -------- > {some_shortened_web_word} > > > so that bit works, but then I tried > > SELECT to_tsvector( 'webwords', 'my words' ); > Error: No tsearch config from ref.guide: to_tsvector( [configuration,] document TEXT) RETURNS tsvector > > > Did you read http://www.sai.msu.su/~megera/oddmuse/index.cgi/Gendict > > yeah, i did read it - its good! > should i run: > update pg_ts_cfgmap set dict_name='{webwords}'; > after loading your dictionary to db you should have it registered in pg_ts_dict, try select * from pg_ts_dict; next, you need to read docs, for example http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/tsearch-V2-intro.html how to create your configuration and specify lexem_type-dictionary mapping; > > > > > Part2. > <snip> > > > As the text can be multilingual I don't think stemming is possible? > > > > > You're right. I'm afraid you need UTF database, but tsearch2 isn't > > UTF-8 compatible :( > > My database was created as unicode - does this mean I cannot use > tsaerch?! > We have no any experience with UTF, so you may better ask openfts mailing list and read archives. > > > I also need to include many none-standard words in the index such > as > > > urls and message ID's contained in the text. > > > > > > > What's message ID ? Integer ? it's already recognized by parser. > > > > try > > select * from token_type(); > > > > Also, last version of tsearch2 (for 7.3 grab from > > http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/, > > for 7.4 - available from CVS) > > has rather useful function - ts_debug > > > > apod=# select * from ts_debug('http://www.sai.msu.su/~megera'); > > ts_name | tok_type | description | token | dict_name | > tsvector > > ---------+----------+-------------+----------------+-----------+------ > ------------ > > simple | host | Host | www.sai.msu.su | {simple} | 'www. > sai.msu.su' > > simple | lword | Latin word | megera | {simple} | ' > megera' > > (2 rows) > > > > > > > > > I get the feeling that building these indexs will by no means be an > > > > easy task so any suggestions will be gratefully recieved! > > > > > > > You may write your own parser, at last. Some info about parser API: > > http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_in_Brief > > > Parser writing...scary stuff :-) > > > Thanks! > > Regards, Oleg _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
pgsql-general by date: