Home > mailing lists

Re: using Tsearch2 for chemical text - Mailing list pgsql-general

From	Tatsuo Ishii
Subject	Re: using Tsearch2 for chemical text
Date	July 25, 2007 22:30:58
Msg-id	20070726.102837.51511527.t-ishii@sraoss.co.jp Whole thread Raw
In response to	Re: using Tsearch2 for chemical text (Tom Lane <tgl@sss.pgh.pa.us>)
List	pgsql-general

Tree view

> Rajarshi Guha <rguha@indiana.edu> writes:
> > My problem is that the name column contains names of chemicals. Now
> > for many cases this may simply be a number (1674-56-2) and in other
> > cases it may be an alphanumeric string (such as (-)O-acetylcarnitine
> > or 1,2-cis-dihydroxybenzoate). In some cases it is a well-known word
> > (say viagra or calcium  chloride or pentathol).
>
> > My question is: will Tsearch2 be able to handle this type of text?
>
> I think you might need to write a custom lexer to divide the strings
> into meaningful units.  If there are subsections of these names that
> make sense to search for, then tsearch2 can certainly handle the
> mechanics of that, but I doubt that the standard rules will divide
> these names into lexemes usefully.
>
>             regards, tom lane

We have similar problem since Japanese is an agglutinative
language. To solve the problem, we divide Japanese texts into space
separted "words" by using specialized tool, which has huge dictionary
to look for word boundaries. To make things easier, I have written a
simple C function which calls the tool and returns the space separated
texts.

Just for your information.
--
Tatsuo Ishii
SRA OSS, Inc. Japan

pgsql-general by date:

From: "Anton A. Patrushev"
Date: 25 July 2007, 22:23:56
Subject: C function problem with 8.2.4

From: Michael Glaesemann
Date: 25 July 2007, 22:51:41
Subject: Re: Porting MySQL data types to PostgreSQL

Re: using Tsearch2 for chemical text - Mailing list pgsql-general

Previous

Next