Thread: using Tsearch2 for chemical text

using Tsearch2 for chemical text

From

Rajarshi Guha

Date:

25 July 2007, 19:40:58

Hi, I have a table with about 9M entries. The table has 2 fields: id
and name which are of serial and text types respectively. I have a
ordinary index on the text field which allows me to do searches in
reasonable time. Most of my searches are of the form

select * from mytable where name ~ 'some text query'

I know that the Tsearch2 module will let me have very efficient text
searches. But if I understand correctly, it's based on a language
specific dictionary.

My problem is that the name column contains names of chemicals. Now
for many cases this may simply be a number (1674-56-2) and in other
cases it may be an alphanumeric string (such as (-)O-acetylcarnitine
or 1,2-cis-dihydroxybenzoate). In some cases it is a well-known word
(say viagra or calcium  chloride or pentathol).

My question is: will Tsearch2 be able to handle this type of text? Or
will it be hampered by the fact that the bulk of the rows do not
correspond to ordinary English

-------------------------------------------------------------------
Rajarshi Guha  <rguha@indiana.edu>
GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04  06F7 1BB9 E634 9B87 56EE
-------------------------------------------------------------------
My Ethicator machine must have had a built-in moral
compromise spectral phantasmatron! I'm a genius."
                 -Calvin

Re: using Tsearch2 for chemical text

From

Tom Lane

Date:

25 July 2007, 19:51:31

Rajarshi Guha <rguha@indiana.edu> writes:
> My problem is that the name column contains names of chemicals. Now
> for many cases this may simply be a number (1674-56-2) and in other
> cases it may be an alphanumeric string (such as (-)O-acetylcarnitine
> or 1,2-cis-dihydroxybenzoate). In some cases it is a well-known word
> (say viagra or calcium  chloride or pentathol).

> My question is: will Tsearch2 be able to handle this type of text?

I think you might need to write a custom lexer to divide the strings
into meaningful units.  If there are subsections of these names that
make sense to search for, then tsearch2 can certainly handle the
mechanics of that, but I doubt that the standard rules will divide
these names into lexemes usefully.

            regards, tom lane

Re: using Tsearch2 for chemical text

From

"Dann Corbit"

Date:

25 July 2007, 19:53:07

Tsearch2 is used for full text indexing.  It won't be any faster than a
btree index like the one you have now (I assume it's unique -- if it
isn't then I think it ought to be).  If you cluster the table by your
index it will speed up your queries.

> -----Original Message-----
> From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-
> owner@postgresql.org] On Behalf Of Rajarshi Guha
> Sent: Wednesday, July 25, 2007 3:41 PM
> To: pgsql-general
> Subject: [GENERAL] using Tsearch2 for chemical text
>
> Hi, I have a table with about 9M entries. The table has 2 fields: id
> and name which are of serial and text types respectively. I have a
> ordinary index on the text field which allows me to do searches in
> reasonable time. Most of my searches are of the form
>
> select * from mytable where name ~ 'some text query'
>
> I know that the Tsearch2 module will let me have very efficient text
> searches. But if I understand correctly, it's based on a language
> specific dictionary.
>
> My problem is that the name column contains names of chemicals. Now
> for many cases this may simply be a number (1674-56-2) and in other
> cases it may be an alphanumeric string (such as (-)O-acetylcarnitine
> or 1,2-cis-dihydroxybenzoate). In some cases it is a well-known word
> (say viagra or calcium  chloride or pentathol).
>
> My question is: will Tsearch2 be able to handle this type of text? Or
> will it be hampered by the fact that the bulk of the rows do not
> correspond to ordinary English
>
> -------------------------------------------------------------------
> Rajarshi Guha  <rguha@indiana.edu>
> GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04  06F7 1BB9 E634 9B87 56EE
> -------------------------------------------------------------------
> My Ethicator machine must have had a built-in moral
> compromise spectral phantasmatron! I'm a genius."
>                  -Calvin
>
>
>
> ---------------------------(end of
broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
>        choose an index scan if your joining column's datatypes do not
>        match

Re: using Tsearch2 for chemical text

From

Tatsuo Ishii

Date:

25 July 2007, 22:30:58

> Rajarshi Guha <rguha@indiana.edu> writes:
> > My problem is that the name column contains names of chemicals. Now
> > for many cases this may simply be a number (1674-56-2) and in other
> > cases it may be an alphanumeric string (such as (-)O-acetylcarnitine
> > or 1,2-cis-dihydroxybenzoate). In some cases it is a well-known word
> > (say viagra or calcium  chloride or pentathol).
>
> > My question is: will Tsearch2 be able to handle this type of text?
>
> I think you might need to write a custom lexer to divide the strings
> into meaningful units.  If there are subsections of these names that
> make sense to search for, then tsearch2 can certainly handle the
> mechanics of that, but I doubt that the standard rules will divide
> these names into lexemes usefully.
>
>             regards, tom lane

We have similar problem since Japanese is an agglutinative
language. To solve the problem, we divide Japanese texts into space
separted "words" by using specialized tool, which has huge dictionary
to look for word boundaries. To make things easier, I have written a
simple C function which calls the tool and returns the space separated
texts.

Just for your information.
--
Tatsuo Ishii
SRA OSS, Inc. Japan

Re: using Tsearch2 for chemical text

From

Naz Gassiep

Date:

26 July 2007, 02:53:13

> I think you might need to write a custom lexer to divide the strings
> into meaningful units.  If there are subsections of these names that
> make sense to search for, then tsearch2 can certainly handle the
> mechanics of that, but I doubt that the standard rules will divide
> these names into lexemes usefully.

A custom lexer for tsearch2 that recognized chemistry related lexical
components (di-, tetra-, acetyl-, ethan-, -oic, -ane, -ene etc) would
increase *hugely* the out-of-the-box applicability of PostgreSQL to
scientific applications. Perhaps such an effort could be co ordinated
with a physics based lexer and biology related lexer, to perhaps provide
a unified lexer that provided full scientific capabilities in the way
that PostGIS provides unified geospatial capabilities.

I don't know how best to bring such an effort about, but I do know that
if such a thing were created it would be a boon for PostgreSQL, giving
it a very significant leg up in terms of functionality, not to mention
the great positive impact that the wide, free availability of such a
tool would have on the scientific research community.

Re: using Tsearch2 for chemical text

From

Oleg Bartunov

Date:

26 July 2007, 02:53:56

On Wed, 25 Jul 2007, Rajarshi Guha wrote:

> Hi, I have a table with about 9M entries. The table has 2 fields: id and name
> which are of serial and text types respectively. I have a ordinary index on
> the text field which allows me to do searches in reasonable time. Most of my
> searches are of the form
>
> select * from mytable where name ~ 'some text query'
>
> I know that the Tsearch2 module will let me have very efficient text
> searches. But if I understand correctly, it's based on a language specific
> dictionary.

wrong ! it comes with some written human language dictionaries, but you can
write your very own dictionaries. dictionary is just a C-program.

>
> My problem is that the name column contains names of chemicals. Now for many
> cases this may simply be a number (1674-56-2) and in other cases it may be an
> alphanumeric string (such as (-)O-acetylcarnitine or
> 1,2-cis-dihydroxybenzoate). In some cases it is a well-known word (say viagra
> or calcium  chloride or pentathol).
>
> My question is: will Tsearch2 be able to handle this type of text? Or will it
> be hampered by the fact that the bulk of the rows do not correspond to
> ordinary English

Oh, sure. See, for example, our dict_regex dictionary, we use for
astronomical search.
http://lynx.sao.ru/~karpov/software/postgres_dict_regex.html

This is a work in progress, but it works.

>
> -------------------------------------------------------------------
> Rajarshi Guha  <rguha@indiana.edu>
> GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04  06F7 1BB9 E634 9B87 56EE
> -------------------------------------------------------------------
> My Ethicator machine must have had a built-in moral
> compromise spectral phantasmatron! I'm a genius."
>               -Calvin
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
>     choose an index scan if your joining column's datatypes do not
>     match

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: using Tsearch2 for chemical text

From

Oleg Bartunov

Date:

26 July 2007, 03:08:51

Naz, in posted link to the dict_regex dictionary for tsearch2
http://lynx.sao.ru/~karpov/software/postgres_dict_regex.html

Feel free to test it and send us feedback. It's rather general, of course,
it uses regex (pcre library).

Oleg
On Thu, 26 Jul 2007, Naz Gassiep wrote:

>
>> I think you might need to write a custom lexer to divide the strings
>> into meaningful units.  If there are subsections of these names that
>> make sense to search for, then tsearch2 can certainly handle the
>> mechanics of that, but I doubt that the standard rules will divide
>> these names into lexemes usefully.
>
> A custom lexer for tsearch2 that recognized chemistry related lexical
> components (di-, tetra-, acetyl-, ethan-, -oic, -ane, -ene etc) would
> increase *hugely* the out-of-the-box applicability of PostgreSQL to
> scientific applications. Perhaps such an effort could be co ordinated with a
> physics based lexer and biology related lexer, to perhaps provide a unified
> lexer that provided full scientific capabilities in the way that PostGIS
> provides unified geospatial capabilities.
>
> I don't know how best to bring such an effort about, but I do know that if
> such a thing were created it would be a boon for PostgreSQL, giving it a very
> significant leg up in terms of functionality, not to mention the great
> positive impact that the wide, free availability of such a tool would have on
> the scientific research community.
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83