Thread: making tsearch2 dictionaries

making tsearch2 dictionaries

From
Ben
Date:
I'm trying to make myself a dictionary for tsearch2 that converts
numbers to their english word equivalents. This seems to be working
great, except that I can't figure out how to make my lexize function
return multiple lexemes. For instance, I'd like "100" to get converted
to {one,hundred}, not {"one hundred"} as is currently happening.

How do I specify the output of the lexize function so that this will
happen?


Re: making tsearch2 dictionaries

From
Ben
Date:
Okay, so I was actually able to answer this question on my own, in a
manner of speaking. It seems the way to do this is to merely return a
larger char** array, with one element for each word. But I was having
trouble with postgres crashing, because (I think) it tries to free each
element independently before using all of them. I had set each element
to a different null-terminated chunk of the same palloc'd memory
segment. Having never written C stored procs before, I take it that's
bad practice?

Anyway, now that this is working, my next question is: can I take the
lexemes from one dictionary lookup and pipe them into another
dictionary? I see that I can have redundant dictionaries, such that if
lexemes aren't found in one it'll try another, but that's not quite the
same.

For instance, the en_stem dictionary converts "hundred" into "hundr".
Right now, my dictionary converts "100" into "one" and "hundred", but
I'd like it to filter both one and hundred through the en_stem
dictionary to arrive at "one" and "hundr".

It also occurs to me I could pipe things through an ispell dictionary
and be able to handle misspellings....

On Sun, 2004-02-15 at 15:35, Ben wrote:
> I'm trying to make myself a dictionary for tsearch2 that converts
> numbers to their english word equivalents. This seems to be working
> great, except that I can't figure out how to make my lexize function
> return multiple lexemes. For instance, I'd like "100" to get converted
> to {one,hundred}, not {"one hundred"} as is currently happening.
>
> How do I specify the output of the lexize function so that this will
> happen?


Re: making tsearch2 dictionaries

From
Teodor Sigaev
Date:
 From http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_in_Brief

    Table for storing dictionaries. Dict_init field store Oid of function
    that initialize dictionary. Dict_init has one option: text value from
    dict_initoption and should return internal representation (structure)
    of dictionary. Structure must be malloced or palloced in
    TopMemoryContext. Dict_init is called only one times per process.
    dict_lexize field store Oid of function that lemmatize lexem.
    Input values: structure of dictionary, pionter to string and it's
    length. Output: pointer to array of pointers to C-strings. Last pointer
    in array must be NULL. Returns NULL means that dictionary can't resolve
     this word, but return void array means that dictionary know input word,
    but suppose that word is stop-word.

Ben wrote:
> I'm trying to make myself a dictionary for tsearch2 that converts
> numbers to their english word equivalents. This seems to be working
> great, except that I can't figure out how to make my lexize function
> return multiple lexemes. For instance, I'd like "100" to get converted
> to {one,hundred}, not {"one hundred"} as is currently happening.
>
> How do I specify the output of the lexize function so that this will
> happen?
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster

--
Teodor Sigaev                                  E-mail: teodor@sigaev.ru

Re: making tsearch2 dictionaries

From
Tom Lane
Date:
Ben <bench@silentmedia.com> writes:
> Okay, so I was actually able to answer this question on my own, in a
> manner of speaking. It seems the way to do this is to merely return a
> larger char** array, with one element for each word. But I was having
> trouble with postgres crashing, because (I think) it tries to free each
> element independently before using all of them. I had set each element
> to a different null-terminated chunk of the same palloc'd memory
> segment. Having never written C stored procs before, I take it that's
> bad practice?

Given Teodor's response, I think the issue is probably that you were
palloc'ing in too short-lived a context.  But whatever the problem is,
you'll narrow it down a lot faster if you build with --enable-cassert.
I wouldn't ever recommend trying to debug C functions without that.

            regards, tom lane

Re: making tsearch2 dictionaries

From
Teodor Sigaev
Date:
Excuse me, but I was too brief.
I mean your lexize method of dictionary should return pointer to array with 3
elements:
first should points to "one" C-string, second - to "hundred" C-string and 3rd is
NULL.
Array and C-strings should be palloc'ed in short-lived context, because it's
lives during parse text only.




Tom Lane wrote:
> Ben <bench@silentmedia.com> writes:
>
>>Okay, so I was actually able to answer this question on my own, in a
>>manner of speaking. It seems the way to do this is to merely return a
>>larger char** array, with one element for each word. But I was having
>>trouble with postgres crashing, because (I think) it tries to free each
>>element independently before using all of them. I had set each element
>>to a different null-terminated chunk of the same palloc'd memory
>>segment. Having never written C stored procs before, I take it that's
>>bad practice?
>
>
> Given Teodor's response, I think the issue is probably that you were
> palloc'ing in too short-lived a context.  But whatever the problem is,
> you'll narrow it down a lot faster if you build with --enable-cassert.
> I wouldn't ever recommend trying to debug C functions without that.
>
>             regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
>
>                http://www.postgresql.org/docs/faqs/FAQ.html

--
Teodor Sigaev                                  E-mail: teodor@sigaev.ru

Re: making tsearch2 dictionaries

From
Ben
Date:
Thanks for the replies. Just to clarify what I was doing, quaicode
looked something like:

phrase = palloc(8);
phrase = "foo\0bar\0";
res = palloc(3);
res[0] = phrase[0];
res[1] = phrase[5];
res[2] = 0;

That crashed. Once I changed it to:

res = palloc(3);
res[0] = palloc(4);
res[0] = "foo\0";
res[1] = palloc(4);
res[2] = "bar\0";
res[3] = 0;

it worked.

Anyway, I'm happy to forget my pain with this if only I could figure out
how to pipe the lexemes from one dictionary into another dictionary. :)

On Mon, 2004-02-16 at 08:09, Teodor Sigaev wrote:
> Excuse me, but I was too brief.
> I mean your lexize method of dictionary should return pointer to array with 3
> elements:
> first should points to "one" C-string, second - to "hundred" C-string and 3rd is
> NULL.
> Array and C-strings should be palloc'ed in short-lived context, because it's
> lives during parse text only.
>
>
>
>
> Tom Lane wrote:
> > Ben <bench@silentmedia.com> writes:
> >
> >>Okay, so I was actually able to answer this question on my own, in a
> >>manner of speaking. It seems the way to do this is to merely return a
> >>larger char** array, with one element for each word. But I was having
> >>trouble with postgres crashing, because (I think) it tries to free each
> >>element independently before using all of them. I had set each element
> >>to a different null-terminated chunk of the same palloc'd memory
> >>segment. Having never written C stored procs before, I take it that's
> >>bad practice?
> >
> >
> > Given Teodor's response, I think the issue is probably that you were
> > palloc'ing in too short-lived a context.  But whatever the problem is,
> > you'll narrow it down a lot faster if you build with --enable-cassert.
> > I wouldn't ever recommend trying to debug C functions without that.
> >
> >             regards, tom lane
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 5: Have you checked our extensive FAQ?
> >
> >                http://www.postgresql.org/docs/faqs/FAQ.html


Re: making tsearch2 dictionaries

From
Ben
Date:
Like I said, quasicode. :)

And in fact I see I even put an off-by-one error in this last email that
wasn't in my function. (Honest!) Should have been "res[1] = phrase[4]"
in the first section.

Are there docs for making parsers? Or anything like gendict?

On Mon, 2004-02-16 at 09:25, Teodor Sigaev wrote:

> :)
> I hope you mean:
> res = palloc(3);
> res[0] = palloc(4);
> memcpy(res[0] ,"foo", 4);
> res[1] = palloc(4);
> memcpy(res[1] ,"bar", 4);
> res[2] = 0;
>
> Look at indexes of res.


Re: making tsearch2 dictionaries

From
Teodor Sigaev
Date:

Ben wrote:
> Thanks for the replies. Just to clarify what I was doing, quaicode
> looked something like:
>
> phrase = palloc(8);
> phrase = "foo\0bar\0";
> res = palloc(3);
> res[0] = phrase[0];
> res[1] = phrase[5];
> res[2] = 0;
>
> That crashed. Once I changed it to:
>
> res = palloc(3);
> res[0] = palloc(4);
> res[0] = "foo\0";
> res[1] = palloc(4);
> res[2] = "bar\0";
> res[3] = 0;
>
> it worked.
>
:)
I hope you mean:
res = palloc(3);
res[0] = palloc(4);
memcpy(res[0] ,"foo", 4);
res[1] = palloc(4);
memcpy(res[1] ,"bar", 4);
res[2] = 0;

Look at indexes of res.

--
Teodor Sigaev                                  E-mail: teodor@sigaev.ru

Re: making tsearch2 dictionaries

From
Teodor Sigaev
Date:
Small docs are avaliable at
http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_in_Brief

and into current implementation at contrib/tsearch2/wparser_def.c. The largest
code is about headline stuff.

Ben wrote:
> Like I said, quasicode. :)
>
> And in fact I see I even put an off-by-one error in this last email that
> wasn't in my function. (Honest!) Should have been "res[1] = phrase[4]"
> in the first section.
>
> Are there docs for making parsers? Or anything like gendict?
>
> On Mon, 2004-02-16 at 09:25, Teodor Sigaev wrote:
>
>
>>:)
>>I hope you mean:
>>res = palloc(3);
>>res[0] = palloc(4);
>>memcpy(res[0] ,"foo", 4);
>>res[1] = palloc(4);
>>memcpy(res[1] ,"bar", 4);
>>res[2] = 0;
>>
>>Look at indexes of res.

--
Teodor Sigaev                                  E-mail: teodor@sigaev.ru

Re: making tsearch2 dictionaries

From
Oleg Bartunov
Date:
btw, Ben, if you get you dictionary working, could you describe process
of developing so  other people will appreciate your work. This part of
tsearch2 documentation is very weak.

    Oleg

On Mon, 16 Feb 2004, Teodor Sigaev wrote:

>
>
> Ben wrote:
> > Thanks for the replies. Just to clarify what I was doing, quaicode
> > looked something like:
> >
> > phrase = palloc(8);
> > phrase = "foo\0bar\0";
> > res = palloc(3);
> > res[0] = phrase[0];
> > res[1] = phrase[5];
> > res[2] = 0;
> >
> > That crashed. Once I changed it to:
> >
> > res = palloc(3);
> > res[0] = palloc(4);
> > res[0] = "foo\0";
> > res[1] = palloc(4);
> > res[2] = "bar\0";
> > res[3] = 0;
> >
> > it worked.
> >
> :)
> I hope you mean:
> res = palloc(3);
> res[0] = palloc(4);
> memcpy(res[0] ,"foo", 4);
> res[1] = palloc(4);
> memcpy(res[1] ,"bar", 4);
> res[2] = 0;
>
> Look at indexes of res.
>
>

    Regards,
        Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

Re: making tsearch2 dictionaries

From
Ben
Date:
So I noticed. ;) The dictionary's working, and I'd be happy to expand
upon the documentation. Just point me at something to work on.

But, like I said, I really want to figure out a way to pipe the output
of my dictionary through the another dictionary. If I can't do that, it
doesn't seem as useful, because "100" (handled by my dictionary) and
"one hundred" (handled by en_stem) currently don't generate the same
ts_vector.

Once I figure out how to tweak the parser to parse things they way I
want, I can expand upon those docs too. Looks like I'm going to need to
reach waaaay back into my brain and dust off my flex knowledge for that,
though....

On Mon, 2004-02-16 at 10:33, Oleg Bartunov wrote:
> btw, Ben, if you get you dictionary working, could you describe process
> of developing so  other people will appreciate your work. This part of
> tsearch2 documentation is very weak.
>
>     Oleg
>
> On Mon, 16 Feb 2004, Teodor Sigaev wrote:
>
> >
> >
> > Ben wrote:
> > > Thanks for the replies. Just to clarify what I was doing, quaicode
> > > looked something like:
> > >
> > > phrase = palloc(8);
> > > phrase = "foo\0bar\0";
> > > res = palloc(3);
> > > res[0] = phrase[0];
> > > res[1] = phrase[5];
> > > res[2] = 0;
> > >
> > > That crashed. Once I changed it to:
> > >
> > > res = palloc(3);
> > > res[0] = palloc(4);
> > > res[0] = "foo\0";
> > > res[1] = palloc(4);
> > > res[2] = "bar\0";
> > > res[3] = 0;
> > >
> > > it worked.
> > >
> > :)
> > I hope you mean:
> > res = palloc(3);
> > res[0] = palloc(4);
> > memcpy(res[0] ,"foo", 4);
> > res[1] = palloc(4);
> > memcpy(res[1] ,"bar", 4);
> > res[2] = 0;
> >
> > Look at indexes of res.
> >
> >
>
>     Regards,
>         Oleg
> _____________________________________________________________
> Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
> Sternberg Astronomical Institute, Moscow University (Russia)
> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
> phone: +007(095)939-16-83, +007(095)939-23-83


Re: making tsearch2 dictionaries

From
Oleg Bartunov
Date:
On Mon, 16 Feb 2004, Ben wrote:

> So I noticed. ;) The dictionary's working, and I'd be happy to expand
> upon the documentation. Just point me at something to work on.
>

I think you may just write a paper "How I did custom dictionary for tsearch2".
From what I've read I see your dictionary could be interesting to people
especially if you describe the motivation and usage.
Do you want '100' or 'hundred' will be fully equivalent ? So,
if you search '100' you will find document with 'hundred'. Interesting,
that you will find '123', because '123' will be 'one hundred twenty three'.

> But, like I said, I really want to figure out a way to pipe the output
> of my dictionary through the another dictionary. If I can't do that, it
> doesn't seem as useful, because "100" (handled by my dictionary) and
> "one hundred" (handled by en_stem) currently don't generate the same
> ts_vector.

What's the problem ? You may configure which dictionaries and in what order
should be used for given type of token (pg_ts_cfgmap table).
Aha, I got your problem:

www=# select * from ts_debug('one hundred');
     ts_name     | tok_type | description |  token  | dict_name | tsvector
-----------------+----------+-------------+---------+-----------+----------
 default_russian | lword    | Latin word  | one     | {en_stem} | 'one'
 default_russian | lword    | Latin word  | hundred | {en_stem} | 'hundr

'hundred' becames 'hundr'. You may use synonym dictionary which is
rather simple
( see http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_Notes for details ).
Once word is recognized by synonym dictionary it will not pass to
next dictionary ! This is how tsearch2 is working with any dictionary.


>
> Once I figure out how to tweak the parser to parse things they way I
> want, I can expand upon those docs too. Looks like I'm going to need to
> reach waaaay back into my brain and dust off my flex knowledge for that,
> though....

What do you want from parser ?

>
> On Mon, 2004-02-16 at 10:33, Oleg Bartunov wrote:
> > btw, Ben, if you get you dictionary working, could you describe process
> > of developing so  other people will appreciate your work. This part of
> > tsearch2 documentation is very weak.
> >
> >     Oleg
> >
> > On Mon, 16 Feb 2004, Teodor Sigaev wrote:
> >
> > >
> > >
> > > Ben wrote:
> > > > Thanks for the replies. Just to clarify what I was doing, quaicode
> > > > looked something like:
> > > >
> > > > phrase = palloc(8);
> > > > phrase = "foo\0bar\0";
> > > > res = palloc(3);
> > > > res[0] = phrase[0];
> > > > res[1] = phrase[5];
> > > > res[2] = 0;
> > > >
> > > > That crashed. Once I changed it to:
> > > >
> > > > res = palloc(3);
> > > > res[0] = palloc(4);
> > > > res[0] = "foo\0";
> > > > res[1] = palloc(4);
> > > > res[2] = "bar\0";
> > > > res[3] = 0;
> > > >
> > > > it worked.
> > > >
> > > :)
> > > I hope you mean:
> > > res = palloc(3);
> > > res[0] = palloc(4);
> > > memcpy(res[0] ,"foo", 4);
> > > res[1] = palloc(4);
> > > memcpy(res[1] ,"bar", 4);
> > > res[2] = 0;
> > >
> > > Look at indexes of res.
> > >
> > >
> >
> >     Regards,
> >         Oleg
> > _____________________________________________________________
> > Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
> > Sternberg Astronomical Institute, Moscow University (Russia)
> > Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
> > phone: +007(095)939-16-83, +007(095)939-23-83
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
>                http://archives.postgresql.org
>

    Regards,
        Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

Re: making tsearch2 dictionaries

From
Ben
Date:
On Tue, 2004-02-17 at 03:15, Oleg Bartunov wrote:

> Do you want '100' or 'hundred' will be fully equivalent ? So,
> if you search '100' you will find document with 'hundred'. Interesting,
> that you will find '123', because '123' will be 'one hundred twenty three'.

Yeah, for a general case of documents I'm not sure how accurate it would
make things, but I'm trying to index music artist names and song titles,
where I'd get things like "3 Dog Night".... or is that "Three Dog
Night"? :)

> What's the problem ? You may configure which dictionaries and in what order
> should be used for given type of token (pg_ts_cfgmap table).
> Aha, I got your problem:

> Once word is recognized by synonym dictionary it will not pass to
> next dictionary ! This is how tsearch2 is working with any dictionary.

Yep, that's my problem. :) And it seems that if I could pass the normal
words into an ispell dictionary before passing them on to the en_stem
dictionary, I'd get spell checking for free. Unless there's a better way
to give "did you mean: <your search spelled correctly>?" results....?

I know doing this would increase the size of the generated ts_vector,
but for my case, where what I'm indexing is generally only a few words
anyway, that's not an issue. As it is, I'm already going to get rid of
the stop words file, so that I can actually find things like "The Who."

How hard do you think it would be to change up the behavior to make this
happen? I

> What do you want from parser ?

I want to be able to recognize symbols, such as the degree (°) and
vulgar half (½) symbols.


Re: making tsearch2 dictionaries

From
Oleg Bartunov
Date:
On Tue, 17 Feb 2004, Ben wrote:

> On Tue, 2004-02-17 at 03:15, Oleg Bartunov wrote:
>
> > Do you want '100' or 'hundred' will be fully equivalent ? So,
> > if you search '100' you will find document with 'hundred'. Interesting,
> > that you will find '123', because '123' will be 'one hundred twenty three'.
>
> Yeah, for a general case of documents I'm not sure how accurate it would
> make things, but I'm trying to index music artist names and song titles,
> where I'd get things like "3 Dog Night".... or is that "Three Dog
> Night"? :)
>
> > What's the problem ? You may configure which dictionaries and in what order
> > should be used for given type of token (pg_ts_cfgmap table).
> > Aha, I got your problem:
>
> > Once word is recognized by synonym dictionary it will not pass to
> > next dictionary ! This is how tsearch2 is working with any dictionary.
>
> Yep, that's my problem. :) And it seems that if I could pass the normal
> words into an ispell dictionary before passing them on to the en_stem
> dictionary, I'd get spell checking for free. Unless there's a better way
> to give "did you mean: <your search spelled correctly>?" results....?
>

If ispell dictionary recognizes a word, that word will not pass to en_stem.
We know how to add "query spelling feature" to tsearch2, just waiting
for sponsorships :) meanwhile, you could use our trgm module, which
implements trigram based spelling correction. You need to maintain
separate table with all words of interests (say, from tsvectors) and
search query words in that table using bestmatch finction.

> I know doing this would increase the size of the generated ts_vector,
> but for my case, where what I'm indexing is generally only a few words
> anyway, that's not an issue. As it is, I'm already going to get rid of
> the stop words file, so that I can actually find things like "The Who."
>
> How hard do you think it would be to change up the behavior to make this
> happen? I
>
> > What do you want from parser ?
>
> I want to be able to recognize symbols, such as the degree (ТА) and
> vulgar half (ТН) symbols.

You mean '(TA)', '(TH)' ?  I think it's not very difficult. What'd be
a token type ( parenthesis_word :?)

>

    Regards,
        Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

Re: making tsearch2 dictionaries

From
Ben
Date:
On Tue, 17 Feb 2004, Oleg Bartunov wrote:

> If ispell dictionary recognizes a word, that word will not pass to en_stem.
> We know how to add "query spelling feature" to tsearch2, just waiting
> for sponsorships :) meanwhile, you could use our trgm module, which
> implements trigram based spelling correction. You need to maintain
> separate table with all words of interests (say, from tsvectors) and
> search query words in that table using bestmatch finction.

Hm, I'll take a look at this approach. I take it you think piping
dictionary output to more dictionaries in the chain is a bad idea? :)

> > > What do you want from parser ?
> >
> > I want to be able to recognize symbols, such as the degree (ôá) and
> > vulgar half (ôî) symbols.
>
> You mean '(TA)', '(TH)' ?  I think it's not very difficult. What'd be
> a token type ( parenthesis_word :?)

uh, not sure how you got (TA) and (TH)... if you look at the original
message with utf-8 unicode encoding, the sympols come out fine. Or, maybe
you'd just have better luck pointing a browser at a page like
http://homepages.comnet.co.nz/~r-mahoney/bca_text/utf8.html. I want to be
able to recognize a subset of these symbols, and I'd want another
dictionary I'd make to handle the symbol token to return both the symbol
and the common name as lexemes, in case people spell out the symbol
instead of entering it.


Re: making tsearch2 dictionaries

From
Oleg Bartunov
Date:
On Tue, 17 Feb 2004, Ben wrote:

> On Tue, 17 Feb 2004, Oleg Bartunov wrote:
>
> > If ispell dictionary recognizes a word, that word will not pass to en_stem.
> > We know how to add "query spelling feature" to tsearch2, just waiting
> > for sponsorships :) meanwhile, you could use our trgm module, which
> > implements trigram based spelling correction. You need to maintain
> > separate table with all words of interests (say, from tsvectors) and
> > search query words in that table using bestmatch finction.
>
> Hm, I'll take a look at this approach. I take it you think piping
> dictionary output to more dictionaries in the chain is a bad idea? :)

it's unpredictable  and I still don't get your idea of pipilining, but
in general, I have nothing agains it.

>
> > > > What do you want from parser ?
> > >
> > > I want to be able to recognize symbols, such as the degree (ТА) and
> > > vulgar half (ТН) symbols.
> >
> > You mean '(TA)', '(TH)' ?  I think it's not very difficult. What'd be
> > a token type ( parenthesis_word :?)
>
> uh, not sure how you got (TA) and (TH)... if you look at the original
> message with utf-8 unicode encoding, the sympols come out fine. Or, maybe
> you'd just have better luck pointing a browser at a page like

Yup:)

> http://homepages.comnet.co.nz/~r-mahoney/bca_text/utf8.html. I want to be
> able to recognize a subset of these symbols, and I'd want another
> dictionary I'd make to handle the symbol token to return both the symbol
> and the common name as lexemes, in case people spell out the symbol
> instead of entering it.
>

Aha, the same way as we handle complex words with hyphen - we return
the whole word and its parts. So you need to introduce new type of token
in parser and use synonym dictionary which in one's turn will returns
the symbol token and human readable word.

    Regards,
        Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

Re: making tsearch2 dictionaries

From
Ben
Date:
On Tue, 17 Feb 2004, Oleg Bartunov wrote:

> it's unpredictable  and I still don't get your idea of pipilining, but
> in general, I have nothing agains it.

Oh, well, the idea is that instead of the dictionary searching stopping at
the first dictionary in the chain that returns a lexeme, it would take
each of the lexemes returned and pass them on to the next dictionary in
the chain.

So if I specified numbers were to be handled by my num2english dictionary,
followed by en_stem, and then tried to deal get a vector for "100",
num2english would return "one" and "hundred". Then both "one" and
"hundred" would each be looked up in en_stem, and the union of these
lexems would be the final result.

Similarly, if a latin word gets piped through an ispell dictionary before
being sent to en_stem, each possible spelling would be stemmed.

> Aha, the same way as we handle complex words with hyphen - we return
> the whole word and its parts. So you need to introduce new type of token
> in parser and use synonym dictionary which in one's turn will returns
> the symbol token and human readable word.

Okay, that makes sense. I'll look more into how hyphenated words are being
handled now.