Thread: using Tsearch2 for chemical text
Hi, I have a table with about 9M entries. The table has 2 fields: id and name which are of serial and text types respectively. I have a ordinary index on the text field which allows me to do searches in reasonable time. Most of my searches are of the form select * from mytable where name ~ 'some text query' I know that the Tsearch2 module will let me have very efficient text searches. But if I understand correctly, it's based on a language specific dictionary. My problem is that the name column contains names of chemicals. Now for many cases this may simply be a number (1674-56-2) and in other cases it may be an alphanumeric string (such as (-)O-acetylcarnitine or 1,2-cis-dihydroxybenzoate). In some cases it is a well-known word (say viagra or calcium chloride or pentathol). My question is: will Tsearch2 be able to handle this type of text? Or will it be hampered by the fact that the bulk of the rows do not correspond to ordinary English ------------------------------------------------------------------- Rajarshi Guha <rguha@indiana.edu> GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04 06F7 1BB9 E634 9B87 56EE ------------------------------------------------------------------- My Ethicator machine must have had a built-in moral compromise spectral phantasmatron! I'm a genius." -Calvin
Rajarshi Guha <rguha@indiana.edu> writes: > My problem is that the name column contains names of chemicals. Now > for many cases this may simply be a number (1674-56-2) and in other > cases it may be an alphanumeric string (such as (-)O-acetylcarnitine > or 1,2-cis-dihydroxybenzoate). In some cases it is a well-known word > (say viagra or calcium chloride or pentathol). > My question is: will Tsearch2 be able to handle this type of text? I think you might need to write a custom lexer to divide the strings into meaningful units. If there are subsections of these names that make sense to search for, then tsearch2 can certainly handle the mechanics of that, but I doubt that the standard rules will divide these names into lexemes usefully. regards, tom lane
Tsearch2 is used for full text indexing. It won't be any faster than a btree index like the one you have now (I assume it's unique -- if it isn't then I think it ought to be). If you cluster the table by your index it will speed up your queries. > -----Original Message----- > From: pgsql-general-owner@postgresql.org [mailto:pgsql-general- > owner@postgresql.org] On Behalf Of Rajarshi Guha > Sent: Wednesday, July 25, 2007 3:41 PM > To: pgsql-general > Subject: [GENERAL] using Tsearch2 for chemical text > > Hi, I have a table with about 9M entries. The table has 2 fields: id > and name which are of serial and text types respectively. I have a > ordinary index on the text field which allows me to do searches in > reasonable time. Most of my searches are of the form > > select * from mytable where name ~ 'some text query' > > I know that the Tsearch2 module will let me have very efficient text > searches. But if I understand correctly, it's based on a language > specific dictionary. > > My problem is that the name column contains names of chemicals. Now > for many cases this may simply be a number (1674-56-2) and in other > cases it may be an alphanumeric string (such as (-)O-acetylcarnitine > or 1,2-cis-dihydroxybenzoate). In some cases it is a well-known word > (say viagra or calcium chloride or pentathol). > > My question is: will Tsearch2 be able to handle this type of text? Or > will it be hampered by the fact that the bulk of the rows do not > correspond to ordinary English > > ------------------------------------------------------------------- > Rajarshi Guha <rguha@indiana.edu> > GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04 06F7 1BB9 E634 9B87 56EE > ------------------------------------------------------------------- > My Ethicator machine must have had a built-in moral > compromise spectral phantasmatron! I'm a genius." > -Calvin > > > > ---------------------------(end of broadcast)--------------------------- > TIP 9: In versions below 8.0, the planner will ignore your desire to > choose an index scan if your joining column's datatypes do not > match
> Rajarshi Guha <rguha@indiana.edu> writes: > > My problem is that the name column contains names of chemicals. Now > > for many cases this may simply be a number (1674-56-2) and in other > > cases it may be an alphanumeric string (such as (-)O-acetylcarnitine > > or 1,2-cis-dihydroxybenzoate). In some cases it is a well-known word > > (say viagra or calcium chloride or pentathol). > > > My question is: will Tsearch2 be able to handle this type of text? > > I think you might need to write a custom lexer to divide the strings > into meaningful units. If there are subsections of these names that > make sense to search for, then tsearch2 can certainly handle the > mechanics of that, but I doubt that the standard rules will divide > these names into lexemes usefully. > > regards, tom lane We have similar problem since Japanese is an agglutinative language. To solve the problem, we divide Japanese texts into space separted "words" by using specialized tool, which has huge dictionary to look for word boundaries. To make things easier, I have written a simple C function which calls the tool and returns the space separated texts. Just for your information. -- Tatsuo Ishii SRA OSS, Inc. Japan
> I think you might need to write a custom lexer to divide the strings > into meaningful units. If there are subsections of these names that > make sense to search for, then tsearch2 can certainly handle the > mechanics of that, but I doubt that the standard rules will divide > these names into lexemes usefully. A custom lexer for tsearch2 that recognized chemistry related lexical components (di-, tetra-, acetyl-, ethan-, -oic, -ane, -ene etc) would increase *hugely* the out-of-the-box applicability of PostgreSQL to scientific applications. Perhaps such an effort could be co ordinated with a physics based lexer and biology related lexer, to perhaps provide a unified lexer that provided full scientific capabilities in the way that PostGIS provides unified geospatial capabilities. I don't know how best to bring such an effort about, but I do know that if such a thing were created it would be a boon for PostgreSQL, giving it a very significant leg up in terms of functionality, not to mention the great positive impact that the wide, free availability of such a tool would have on the scientific research community.
On Wed, 25 Jul 2007, Rajarshi Guha wrote: > Hi, I have a table with about 9M entries. The table has 2 fields: id and name > which are of serial and text types respectively. I have a ordinary index on > the text field which allows me to do searches in reasonable time. Most of my > searches are of the form > > select * from mytable where name ~ 'some text query' > > I know that the Tsearch2 module will let me have very efficient text > searches. But if I understand correctly, it's based on a language specific > dictionary. wrong ! it comes with some written human language dictionaries, but you can write your very own dictionaries. dictionary is just a C-program. > > My problem is that the name column contains names of chemicals. Now for many > cases this may simply be a number (1674-56-2) and in other cases it may be an > alphanumeric string (such as (-)O-acetylcarnitine or > 1,2-cis-dihydroxybenzoate). In some cases it is a well-known word (say viagra > or calcium chloride or pentathol). > > My question is: will Tsearch2 be able to handle this type of text? Or will it > be hampered by the fact that the bulk of the rows do not correspond to > ordinary English Oh, sure. See, for example, our dict_regex dictionary, we use for astronomical search. http://lynx.sao.ru/~karpov/software/postgres_dict_regex.html This is a work in progress, but it works. > > ------------------------------------------------------------------- > Rajarshi Guha <rguha@indiana.edu> > GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04 06F7 1BB9 E634 9B87 56EE > ------------------------------------------------------------------- > My Ethicator machine must have had a built-in moral > compromise spectral phantasmatron! I'm a genius." > -Calvin > > > > ---------------------------(end of broadcast)--------------------------- > TIP 9: In versions below 8.0, the planner will ignore your desire to > choose an index scan if your joining column's datatypes do not > match Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
Naz, in posted link to the dict_regex dictionary for tsearch2 http://lynx.sao.ru/~karpov/software/postgres_dict_regex.html Feel free to test it and send us feedback. It's rather general, of course, it uses regex (pcre library). Oleg On Thu, 26 Jul 2007, Naz Gassiep wrote: > >> I think you might need to write a custom lexer to divide the strings >> into meaningful units. If there are subsections of these names that >> make sense to search for, then tsearch2 can certainly handle the >> mechanics of that, but I doubt that the standard rules will divide >> these names into lexemes usefully. > > A custom lexer for tsearch2 that recognized chemistry related lexical > components (di-, tetra-, acetyl-, ethan-, -oic, -ane, -ene etc) would > increase *hugely* the out-of-the-box applicability of PostgreSQL to > scientific applications. Perhaps such an effort could be co ordinated with a > physics based lexer and biology related lexer, to perhaps provide a unified > lexer that provided full scientific capabilities in the way that PostGIS > provides unified geospatial capabilities. > > I don't know how best to bring such an effort about, but I do know that if > such a thing were created it would be a boon for PostgreSQL, giving it a very > significant leg up in terms of functionality, not to mention the great > positive impact that the wide, free availability of such a tool would have on > the scientific research community. > > > ---------------------------(end of broadcast)--------------------------- > TIP 6: explain analyze is your friend > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83