Thread: BUG #15277: ts_headline strips things that look like HTML tags and itcannot be disabled

BUG #15277: ts_headline strips things that look like HTML tags and itcannot be disabled

From
PG Bug reporting form
Date:
The following bug has been logged on the website:

Bug reference:      15277
Logged by:          Dan Book
Email address:      grinnz@gmail.com
PostgreSQL version: 9.6.9
Operating system:   CentOS 7
Description:

This post has a good overview of the issue with a reproduction case:
https://stackoverflow.com/questions/40263956/why-is-postgresql-stripping-html-entities-in-ts-headline

I have text that is not HTML and contains things that look like HTML tags.
The headlines are HTML escaped when output. It is very odd to have this text
missing from the resulting headlines and no way to control the behavior.


Hello,

On Thu, Jul 12, 2018 at 07:59:40AM +0000, PG Bug reporting form wrote:
> I have text that is not HTML and contains things that look like HTML tags.
> The headlines are HTML escaped when output. It is very odd to have this text
> missing from the resulting headlines and no way to control the behavior.

<b> and </b> are recognized as "tag" token. By default they are
ignored. You need to modify existing configuration or create new one:

=# CREATE TEXT SEARCH CONFIGURATION english_tag (COPY = english);
=# alter text search configuration english_tag
   add mapping for tag with simple;

Then tags aren't skipped:

=# select * from ts_debug('english_tag', 'query <b>test</b>');
   alias   |   description   | token |  dictionaries  |  dictionary  | lexemes 
-----------+-----------------+-------+----------------+--------------+---------
 asciiword | Word, all ASCII | query | {english_stem} | english_stem | {queri}
 blank     | Space symbols   |       | {}             | (null)       | (null)
 tag       | XML tag         | <b>   | {simple}       | simple       | {<b>}
 asciiword | Word, all ASCII | test  | {english_stem} | english_stem | {test}
 tag       | XML tag         | </b>  | {simple}       | simple       | {</b>}

But even in this case ts_headline will skip tags. Because it is
hardcoded [1].

I think it isn't good to change the behaviour for existing versions of
PostgreSQL. But there is a workaround of course if it is appropriate for
someone. It is possible to create your own text search parser extension.
Example [2]. And change

#define HLIDREPLACE(x)    ( (x)==TAG_T )

to

#define HLIDREPLACE(x)    ( false )


1 - https://github.com/postgres/postgres/blob/master/src/backend/tsearch/wparser_def.c#L1923
2 - https://github.com/postgrespro/pg_tsparser

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


On Thu, Jul 12, 2018 at 5:22 AM Arthur Zakirov <a.zakirov@postgrespro.ru> wrote:
Hello,

On Thu, Jul 12, 2018 at 07:59:40AM +0000, PG Bug reporting form wrote:
> I have text that is not HTML and contains things that look like HTML tags.
> The headlines are HTML escaped when output. It is very odd to have this text
> missing from the resulting headlines and no way to control the behavior.

<b> and </b> are recognized as "tag" token. By default they are
ignored. You need to modify existing configuration or create new one:

=# CREATE TEXT SEARCH CONFIGURATION english_tag (COPY = english);
=# alter text search configuration english_tag
   add mapping for tag with simple;

Then tags aren't skipped:

=# select * from ts_debug('english_tag', 'query <b>test</b>');
   alias   |   description   | token |  dictionaries  |  dictionary  | lexemes
-----------+-----------------+-------+----------------+--------------+---------
 asciiword | Word, all ASCII | query | {english_stem} | english_stem | {queri}
 blank     | Space symbols   |       | {}             | (null)       | (null)
 tag       | XML tag         | <b>   | {simple}       | simple       | {<b>}
 asciiword | Word, all ASCII | test  | {english_stem} | english_stem | {test}
 tag       | XML tag         | </b>  | {simple}       | simple       | {</b>}

But even in this case ts_headline will skip tags. Because it is
hardcoded [1].

I think it isn't good to change the behaviour for existing versions of
PostgreSQL. But there is a workaround of course if it is appropriate for
someone. It is possible to create your own text search parser extension.
Example [2]. And change

#define HLIDREPLACE(x)  ( (x)==TAG_T )

to

#define HLIDREPLACE(x)  ( false )

Thanks for the response. It's good to know this is possible but defining a custom parser is not ideal.

-Dan