Re: tsearch2 and hyphenated terms - Mailing list pgsql-general

From Reece Hart
Subject Re: tsearch2 and hyphenated terms
Date
Msg-id 1207960275.7053.86.camel@snafu
Whole thread Raw
In response to Re: tsearch2 and hyphenated terms  (Oleg Bartunov <oleg@sai.msu.su>)
List pgsql-general
On Fri, 2008-04-11 at 22:07 +0400, Oleg Bartunov wrote:
> We have the same problem with names in astronomy, so we implemented
> dict_regex  http://vo.astronet.ru/arxiv/dict_regex.html
> Check it out !

Oleg-

This gets me a lot closer. Thank you.  I have two remaining problems.


The first problem is that 'bcl-w' and 'bcl-2' are parsed differently,
like so:

unison@u8.3=> select * from ts_debug('english','bcl-w');
      alias      |           description           | token |  dictionaries  |  dictionary  | lexemes
-----------------+---------------------------------+-------+----------------+--------------+---------
 asciihword      | Hyphenated word, all ASCII      | bcl-w | {english_stem} | english_stem | {bcl-w}
 hword_asciipart | Hyphenated word part, all ASCII | bcl   | {english_stem} | english_stem | {bcl}
 blank           | Space symbols                   | -     | {}             |              |
 hword_asciipart | Hyphenated word part, all ASCII | w     | {english_stem} | english_stem | {w}

unison@u8.3=> select * from ts_debug('english','bcl-2');
   alias   |   description   | token |  dictionaries  |  dictionary  | lexemes
-----------+-----------------+-------+----------------+--------------+---------
 asciiword | Word, all ASCII | bcl   | {english_stem} | english_stem | {bcl}
 int       | Signed integer  | -2    | {simple}       | simple       | {-2}

One option would be to write a new parser/modify wparser_def.c to make
the InHyphyenWordFirst accept p_isdigit or p_isalnum on the first
character (I think I got this right).  This would achieve Tom's initial
inkling that Bcl-2 might be parsed as a numhword and (to me) it seems
more congruent with asciihword class.

Perhaps a more broadly useful modification is for the lexer to also emit
whitespace-delimited tokens (period). asciihword almost does the trick,
but it too requires a post-hyphen alphabetic character.



The second problem is with quantifiers on PCRE's regexps.  I initially
implemented a dict_regex with a conf line like
(\w+)-(\w{1,2}) $1$2
I can make simpler expressions work (eg., (bcl)-(\w)). I think it must
be related to the README caveat regarding PCRE partial matching mode,
which I didn't understand initially.

However, I don't see that it's possible to write a general regexp like
the one I initially tried. Do you have any suggestions?


Thanks again. I'm very impressed with tsearch2.

-Reece

--
Reece Hart, http://harts.net/reece/, GPG:0x25EC91A0


pgsql-general by date:

Previous
From: Tom Lane
Date:
Subject: Re: Deleting row in 7.4 takes for ever
Next
From: "Jaisen N.D."
Date:
Subject: Problem. createdb: could not connect to database postgres: could not connect to server: No such file or directory