Re: HTML tags and tsearch2 - Mailing list pgsql-general

From Oleg Bartunov
Subject Re: HTML tags and tsearch2
Date
Msg-id Pine.LNX.4.64.0806261602120.11363@sn.sai.msu.ru
Whole thread Raw
In response to HTML tags and tsearch2  (Joanna Sharman <Joanna.Sharman@ed.ac.uk>)
List pgsql-general
On Thu, 26 Jun 2008, Joanna Sharman wrote:

> Hi,
>
> I have recently started experimenting with tsearch2 and it seems that the
> default behaviour is to ignore HTML tags and treat them as word-separators.
> What I would like it to do is to ignore HTML tags within words, but instead
> of creating separate words, combine the characters separated by the tag into
> one word.
>
> For example: in the database I have words like 'K<sub>ir</sub>' that need to
> be searched using the term without HTML tags, i.e. 'Kir'. Currently, the HTML
> tags are ignored and two words are stored in the vector, 'k' and 'ir'. I
> would like only one word, 'kir', to be stored in the vector, so that searches
> using the word 'kir' will match the row.

2 options - write HTML parser and preprocess text before to_tsvector.

>
> A second, related question is whether it is possible to cause tsearch2 to
> split up words when it encounters digits, e.g. 'TM8' into 'TM' and '8'.

you can write your own dictionary or use dict_regex from
http://vo.astronet.ru/arxiv/dict_regex.html

>
> I am not sure if this functionality is possible to implement using tsearch2
> or if there might be a better way, so I would be grateful for any advice or
> pointers to further reading on how I might do this. (I am using PostgreSQL
> version 8.1.10)

think about upgrading to 8.3

>
> Many thanks in advance,
> Joanna
>
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

pgsql-general by date:

Previous
From: Joanna Sharman
Date:
Subject: HTML tags and tsearch2
Next
From: "Phillip Mills"
Date:
Subject: Re: Serialized Access