Thread: tsearch2 and hyphenated terms
I'd like to use tsearch2 to index protein and gene names. Unfortunately, such names are written inconsistently and sometimes with hyphens. For example, MCL-1 and MCL1 are semantically equivalent but with the default parser and to_tsvector, I see this: unison@u8.3=> select to_tsvector('MCL1 MCL-1'); to_tsvector ------------------------- '-1':3 'mcl':2 'mcl1':1 For the purposes of indexing these names, I suspect I'd get the majority of cases by removing a hyphen when it's followed by 1 or 2 chars from [a-zA-Z0-9]. Does that require a custom parser? Thanks, Reece -- Reece Hart, http://harts.net/reece/, GPG:0x25EC91A0
Reece Hart <reece@harts.net> writes: > For the purposes of indexing these names, I suspect I'd get the majority > of cases by removing a hyphen when it's followed by 1 or 2 chars from > [a-zA-Z0-9]. Does that require a custom parser? Yeah, looks like it: regression=# select * from ts_debug('MCL1 MCL-1'); alias | description | token | dictionaries | dictionary | lexemes -----------+--------------------------+-------+----------------+--------------+--------- numword | Word, letters and digits | MCL1 | {simple} | simple | {mcl1} blank | Space symbols | | {} | | asciiword | Word, all ASCII | MCL | {english_stem} | english_stem | {mcl} int | Signed integer | -1 | {simple} | simple | {-1} (4 rows) I had thought you might get a "numhword" output, but that only seems to happen if there's at least one letter after the dash: regression=# select * from ts_debug('MCL1 MCL-X1'); alias | description | token | dictionaries | dictionary | lexemes -----------------+------------------------------------------+--------+----------------+--------------+---------- numword | Word, letters and digits | MCL1 | {simple} | simple | {mcl1} blank | Space symbols | | {} | | numhword | Hyphenated word, letters and digits | MCL-X1 | {simple} | simple | {mcl-x1} hword_asciipart | Hyphenated word part, all ASCII | MCL | {english_stem} | english_stem | {mcl} blank | Space symbols | - | {} | | hword_numpart | Hyphenated word part, letters and digits | X1 | {simple} | simple | {x1} (6 rows) regards, tom lane
We have the same problem with names in astronomy, so we implemented dict_regex http://vo.astronet.ru/arxiv/dict_regex.html Check it out ! Oleg On Thu, 10 Apr 2008, Reece Hart wrote: > I'd like to use tsearch2 to index protein and gene names. Unfortunately, > such names are written inconsistently and sometimes with hyphens. For > example, MCL-1 and MCL1 are semantically equivalent but with the default > parser and to_tsvector, I see this: > > unison@u8.3=> select to_tsvector('MCL1 MCL-1'); > to_tsvector > ------------------------- > '-1':3 'mcl':2 'mcl1':1 > > For the purposes of indexing these names, I suspect I'd get the majority > of cases by removing a hyphen when it's followed by 1 or 2 chars from > [a-zA-Z0-9]. Does that require a custom parser? > > Thanks, > Reece > > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
On Fri, 2008-04-11 at 22:07 +0400, Oleg Bartunov wrote: > We have the same problem with names in astronomy, so we implemented > dict_regex http://vo.astronet.ru/arxiv/dict_regex.html > Check it out ! Oleg- This gets me a lot closer. Thank you. I have two remaining problems. The first problem is that 'bcl-w' and 'bcl-2' are parsed differently, like so: unison@u8.3=> select * from ts_debug('english','bcl-w'); alias | description | token | dictionaries | dictionary | lexemes -----------------+---------------------------------+-------+----------------+--------------+--------- asciihword | Hyphenated word, all ASCII | bcl-w | {english_stem} | english_stem | {bcl-w} hword_asciipart | Hyphenated word part, all ASCII | bcl | {english_stem} | english_stem | {bcl} blank | Space symbols | - | {} | | hword_asciipart | Hyphenated word part, all ASCII | w | {english_stem} | english_stem | {w} unison@u8.3=> select * from ts_debug('english','bcl-2'); alias | description | token | dictionaries | dictionary | lexemes -----------+-----------------+-------+----------------+--------------+--------- asciiword | Word, all ASCII | bcl | {english_stem} | english_stem | {bcl} int | Signed integer | -2 | {simple} | simple | {-2} One option would be to write a new parser/modify wparser_def.c to make the InHyphyenWordFirst accept p_isdigit or p_isalnum on the first character (I think I got this right). This would achieve Tom's initial inkling that Bcl-2 might be parsed as a numhword and (to me) it seems more congruent with asciihword class. Perhaps a more broadly useful modification is for the lexer to also emit whitespace-delimited tokens (period). asciihword almost does the trick, but it too requires a post-hyphen alphabetic character. The second problem is with quantifiers on PCRE's regexps. I initially implemented a dict_regex with a conf line like (\w+)-(\w{1,2}) $1$2 I can make simpler expressions work (eg., (bcl)-(\w)). I think it must be related to the README caveat regarding PCRE partial matching mode, which I didn't understand initially. However, I don't see that it's possible to write a general regexp like the one I initially tried. Do you have any suggestions? Thanks again. I'm very impressed with tsearch2. -Reece -- Reece Hart, http://harts.net/reece/, GPG:0x25EC91A0