Thread: BUG #15277: ts_headline strips things that look like HTML tags and itcannot be disabled
BUG #15277: ts_headline strips things that look like HTML tags and itcannot be disabled
From
PG Bug reporting form
Date:
The following bug has been logged on the website: Bug reference: 15277 Logged by: Dan Book Email address: grinnz@gmail.com PostgreSQL version: 9.6.9 Operating system: CentOS 7 Description: This post has a good overview of the issue with a reproduction case: https://stackoverflow.com/questions/40263956/why-is-postgresql-stripping-html-entities-in-ts-headline I have text that is not HTML and contains things that look like HTML tags. The headlines are HTML escaped when output. It is very odd to have this text missing from the resulting headlines and no way to control the behavior.
Re: BUG #15277: ts_headline strips things that look like HTML tagsand it cannot be disabled
From
Arthur Zakirov
Date:
Hello, On Thu, Jul 12, 2018 at 07:59:40AM +0000, PG Bug reporting form wrote: > I have text that is not HTML and contains things that look like HTML tags. > The headlines are HTML escaped when output. It is very odd to have this text > missing from the resulting headlines and no way to control the behavior. <b> and </b> are recognized as "tag" token. By default they are ignored. You need to modify existing configuration or create new one: =# CREATE TEXT SEARCH CONFIGURATION english_tag (COPY = english); =# alter text search configuration english_tag add mapping for tag with simple; Then tags aren't skipped: =# select * from ts_debug('english_tag', 'query <b>test</b>'); alias | description | token | dictionaries | dictionary | lexemes -----------+-----------------+-------+----------------+--------------+--------- asciiword | Word, all ASCII | query | {english_stem} | english_stem | {queri} blank | Space symbols | | {} | (null) | (null) tag | XML tag | <b> | {simple} | simple | {<b>} asciiword | Word, all ASCII | test | {english_stem} | english_stem | {test} tag | XML tag | </b> | {simple} | simple | {</b>} But even in this case ts_headline will skip tags. Because it is hardcoded [1]. I think it isn't good to change the behaviour for existing versions of PostgreSQL. But there is a workaround of course if it is appropriate for someone. It is possible to create your own text search parser extension. Example [2]. And change #define HLIDREPLACE(x) ( (x)==TAG_T ) to #define HLIDREPLACE(x) ( false ) 1 - https://github.com/postgres/postgres/blob/master/src/backend/tsearch/wparser_def.c#L1923 2 - https://github.com/postgrespro/pg_tsparser -- Arthur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company
Re: BUG #15277: ts_headline strips things that look like HTML tagsand it cannot be disabled
From
Dan Book
Date:
On Thu, Jul 12, 2018 at 5:22 AM Arthur Zakirov <a.zakirov@postgrespro.ru> wrote:
Hello,
On Thu, Jul 12, 2018 at 07:59:40AM +0000, PG Bug reporting form wrote:
> I have text that is not HTML and contains things that look like HTML tags.
> The headlines are HTML escaped when output. It is very odd to have this text
> missing from the resulting headlines and no way to control the behavior.
<b> and </b> are recognized as "tag" token. By default they are
ignored. You need to modify existing configuration or create new one:
=# CREATE TEXT SEARCH CONFIGURATION english_tag (COPY = english);
=# alter text search configuration english_tag
add mapping for tag with simple;
Then tags aren't skipped:
=# select * from ts_debug('english_tag', 'query <b>test</b>');
alias | description | token | dictionaries | dictionary | lexemes
-----------+-----------------+-------+----------------+--------------+---------
asciiword | Word, all ASCII | query | {english_stem} | english_stem | {queri}
blank | Space symbols | | {} | (null) | (null)
tag | XML tag | <b> | {simple} | simple | {<b>}
asciiword | Word, all ASCII | test | {english_stem} | english_stem | {test}
tag | XML tag | </b> | {simple} | simple | {</b>}
But even in this case ts_headline will skip tags. Because it is
hardcoded [1].
I think it isn't good to change the behaviour for existing versions of
PostgreSQL. But there is a workaround of course if it is appropriate for
someone. It is possible to create your own text search parser extension.
Example [2]. And change
#define HLIDREPLACE(x) ( (x)==TAG_T )
to
#define HLIDREPLACE(x) ( false )
Thanks for the response. It's good to know this is possible but defining a custom parser is not ideal.
-Dan