Home > mailing lists

Re: BUG #15277: ts_headline strips things that look like HTML tagsand it cannot be disabled - Mailing list pgsql-bugs

From	Arthur Zakirov
Subject	Re: BUG #15277: ts_headline strips things that look like HTML tagsand it cannot be disabled
Date	July 12, 2018 12:22:06
Msg-id	20180712092205.GA16177@zakirov.localdomain Whole thread Raw
In response to	BUG #15277: ts_headline strips things that look like HTML tags and itcannot be disabled (PG Bug reporting form <noreply@postgresql.org>)
Responses	Re: BUG #15277: ts_headline strips things that look like HTML tagsand it cannot be disabled
List	pgsql-bugs

Tree view

Hello,

On Thu, Jul 12, 2018 at 07:59:40AM +0000, PG Bug reporting form wrote:
> I have text that is not HTML and contains things that look like HTML tags.
> The headlines are HTML escaped when output. It is very odd to have this text
> missing from the resulting headlines and no way to control the behavior.

<b> and </b> are recognized as "tag" token. By default they are
ignored. You need to modify existing configuration or create new one:

=# CREATE TEXT SEARCH CONFIGURATION english_tag (COPY = english);
=# alter text search configuration english_tag
   add mapping for tag with simple;

Then tags aren't skipped:

=# select * from ts_debug('english_tag', 'query <b>test</b>');
   alias   |   description   | token |  dictionaries  |  dictionary  | lexemes 
-----------+-----------------+-------+----------------+--------------+---------
 asciiword | Word, all ASCII | query | {english_stem} | english_stem | {queri}
 blank     | Space symbols   |       | {}             | (null)       | (null)
 tag       | XML tag         | <b>   | {simple}       | simple       | {<b>}
 asciiword | Word, all ASCII | test  | {english_stem} | english_stem | {test}
 tag       | XML tag         | </b>  | {simple}       | simple       | {</b>}

But even in this case ts_headline will skip tags. Because it is
hardcoded [1].

I think it isn't good to change the behaviour for existing versions of
PostgreSQL. But there is a workaround of course if it is appropriate for
someone. It is possible to create your own text search parser extension.
Example [2]. And change

#define HLIDREPLACE(x)    ( (x)==TAG_T )

to

#define HLIDREPLACE(x)    ( false )


1 - https://github.com/postgres/postgres/blob/master/src/backend/tsearch/wparser_def.c#L1923
2 - https://github.com/postgrespro/pg_tsparser

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

pgsql-bugs by date:

From: PG Bug reporting form
Date: 12 July 2018, 10:59:40
Subject: BUG #15277: ts_headline strips things that look like HTML tags and itcannot be disabled

From: Dan Book
Date: 12 July 2018, 18:33:52
Subject: Re: BUG #15277: ts_headline strips things that look like HTML tagsand it cannot be disabled

Re: BUG #15277: ts_headline strips things that look like HTML tagsand it cannot be disabled - Mailing list pgsql-bugs

Previous

Next