Thread: BUG #15172: Postgresql ts_headline with <-> operator does nothighlight text properly
BUG #15172: Postgresql ts_headline with <-> operator does nothighlight text properly
From
PG Bug reporting form
Date:
The following bug has been logged on the website: Bug reference: 15172 Logged by: Ngigi Waithaka Email address: ngigi@at.co.ke PostgreSQL version: 10.3 Operating system: Linux Description: I have a noticed a likely bug when using ts_headline with the <-> operator Assuming the following query: SELECT ts_headline('English','This Commercial Bank does not have any Equity in Europe but European Commercial Bank does', phraseto_tsquery('English','European Commercial Bank')::tsquery); The returned result is: This <b>Commercial</b> <b>Bank</b> does not have any Equity in Europe but <b>European</b> <b>Commercial</b> <b>Bank</b> does This highlights the words Commercial & Bank separately in addition to European Commercial Bank. However, the correct output expected should be: This Commercial Bank does not have any Equity in Europe but <b>European</b> <b>Commercial</b> <b>Bank</b> does Which only highlights *European Commercial Bank* due to the <-> operator in phraseto_tsquery. SELECT phraseto_tsquery('English','European Commercial Bank'); returns 'european' <-> 'commerci' <-> 'bank' as expected indicating the problem is with ts_headline function. Regards NgigiW
Re: BUG #15172: Postgresql ts_headline with <-> operator does not highlight text properly
From
Alex Malek
Date:
I can confirm this is still an issue in PostgreSQL 14.4
Best,
Alex
On Wed, Aug 3, 2022 at 1:58 PM PG Bug reporting form <noreply@postgresql.org> wrote:
The following bug has been logged on the website:
Bug reference: 15172
Logged by: Ngigi Waithaka
Email address: ngigi@at.co.ke
PostgreSQL version: 10.3
Operating system: Linux
Description:
I have a noticed a likely bug when using ts_headline with the <-> operator
Assuming the following query:
SELECT ts_headline('English','This Commercial Bank does not have any Equity
in Europe but European Commercial Bank does',
phraseto_tsquery('English','European Commercial
Bank')::tsquery);
The returned result is:
This <b>Commercial</b> <b>Bank</b> does not have any Equity in Europe but
<b>European</b> <b>Commercial</b> <b>Bank</b> does
This highlights the words Commercial & Bank separately in addition to
European Commercial Bank.
However, the correct output expected should be:
This Commercial Bank does not have any Equity in Europe but <b>European</b>
<b>Commercial</b> <b>Bank</b> does
Which only highlights *European Commercial Bank* due to the <-> operator in
phraseto_tsquery.
SELECT phraseto_tsquery('English','European Commercial Bank');
returns 'european' <-> 'commerci' <-> 'bank' as expected indicating the
problem is with ts_headline function.
Regards
NgigiW
Re: BUG #15172: Postgresql ts_headline with <-> operator does not highlight text properly
From
Bruce Momjian
Date:
On Wed, Aug 3, 2022 at 02:02:51PM -0400, Alex Malek wrote: > On Wed, Aug 3, 2022 at 1:58 PM PG Bug reporting form <noreply@postgresql.org> > wrote: > I have a noticed a likely bug when using ts_headline with the <-> operator > > Assuming the following query: > > SELECT ts_headline('English','This Commercial Bank does not have any Equity > in Europe but European Commercial Bank does', > phraseto_tsquery('English','European Commercial > Bank')::tsquery); > > The returned result is: > This <b>Commercial</b> <b>Bank</b> does not have any Equity in Europe but > <b>European</b> <b>Commercial</b> <b>Bank</b> does > > This highlights the words Commercial & Bank separately in addition to > European Commercial Bank. > > However, the correct output expected should be: > This Commercial Bank does not have any Equity in Europe but <b>European</b> > <b>Commercial</b> <b>Bank</b> does > > Which only highlights *European Commercial Bank* due to the <-> operator in > phraseto_tsquery. > > SELECT phraseto_tsquery('English','European Commercial Bank'); > returns 'european' <-> 'commerci' <-> 'bank' as expected indicating the > problem is with ts_headline function. I tested this against Postgres 11 and master (and you tested on PG 10 and 14) and I found the same behavior, plus I found someting even worse: SELECT ts_headline('English', 'This Commercial Bank does not have any Equity in Europe but European Commercial Bank does', ('''equiti'' <-> ''bank''')::tsquery); ts_headline ---------------------------------------------------------------------------------------------------------------- This Commercial <b>Bank</b> does not have any <b>Equity</b> in Europebut European Commercial <b>Bank</b> does Notice that "Bank" and "Equity" are not next to each other, but they still highlight. In fact, the words appear to be independently checked: SELECT ts_headline('English', 'This Commercial Bank does not have any Equity in Europe but European Commercial Bank does', ('''XXX'' <-> ''bank''')::tsquery); ts_headline --------------------------------------------------------------------------------------------------------- This Commercial <b>Bank</b> does not have any Equity in Europe but European Commercial <b>Bank</b> does Is this documented somewhere? -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com Only you can decide what is important to you.
Re: BUG #15172: Postgresql ts_headline with <-> operator does not highlight text properly
From
Tom Lane
Date:
Bruce Momjian <bruce@momjian.us> writes: > Is this documented somewhere? The docs [1] only say that ts_headline "returns an excerpt from the document in which terms from the query are highlighted". This behavior does not violate that admittedly-weak contract. IIRC, ts_headline does attempt to find a text fragment or fragments that fully satisfy the query (e.g., include an exact phrase match) but it will then highlight all the matching words in the fragment, not only the location of the phrase match. I do not agree with the OP's opinion that that's wrong. The highlight-em-all approach has its own value, and in any case it may not be possible to find a full match that satisfies the function's other constraints such as MaxWords. Refusing to highlight anything in that event would be unhelpful. regards, tom lane [1] https://www.postgresql.org/docs/current/textsearch-controls.html#TEXTSEARCH-HEADLINE
Re: BUG #15172: Postgresql ts_headline with <-> operator does not highlight text properly
From
Pavel Borisov
Date:
Hi, Bruce and Tom! On Sun, 29 Oct 2023 at 00:46, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Bruce Momjian <bruce@momjian.us> writes: > > Is this documented somewhere? > > The docs [1] only say that ts_headline "returns an excerpt from the > document in which terms from the query are highlighted". This > behavior does not violate that admittedly-weak contract. > > IIRC, ts_headline does attempt to find a text fragment or fragments > that fully satisfy the query (e.g., include an exact phrase match) > but it will then highlight all the matching words in the fragment, > not only the location of the phrase match. I do not agree with the > OP's opinion that that's wrong. The highlight-em-all approach has its > own value, and in any case it may not be possible to find a full match > that satisfies the function's other constraints such as MaxWords. > Refusing to highlight anything in that event would be unhelpful. > > regards, tom lane I think that the ts_headline main functionality is to make Postgres more friendly to search-engine-like approach, which I feel is too niche usage scenario for supporting it as a part of core code. If remember right, bug reports coming from the users supposing it has more strict semantics than it has in reality are regular. And I also remember myself being puzzled by unusual output in the past. If we fiddle with other parameters of ts_headline we can easily have other kinds of output that seem counterintuitive e.g.: SELECT ts_headline('English', 'This Commercial Bank does not have any Equity in Europe but European Commercial Bank does', ('''equiti'' <-> ''bank''')::tsquery, 'MaxWords=30, MinWords=2'); ts_headline ----------------- This Commercial (1 row) What do you think about clearly deprecating this feature in docs, still leaving it working as it is? Kind regards, Pavel Borisov, Supabase.
Re: BUG #15172: Postgresql ts_headline with <-> operator does not highlight text properly
From
Bruce Momjian
Date:
On Sat, Oct 28, 2023 at 04:46:40PM -0400, Tom Lane wrote: > Bruce Momjian <bruce@momjian.us> writes: > > Is this documented somewhere? > > The docs [1] only say that ts_headline "returns an excerpt from the > document in which terms from the query are highlighted". This > behavior does not violate that admittedly-weak contract. > > IIRC, ts_headline does attempt to find a text fragment or fragments > that fully satisfy the query (e.g., include an exact phrase match) > but it will then highlight all the matching words in the fragment, > not only the location of the phrase match. I do not agree with the I see what you mean in this query output: SELECT ts_headline('English','kj asdlkjf alds jflkasjd flkaj dsflkja sdlfk jaslfd kjasdlfkj salfdkj This Commercial Bankdoes not have any Equity in Europe but European Commercial Bank does lkj sadlkjf asldkjf alskjd flsakj fdlkaj dfaslkfdjlakds jaslkfdj', ('''european'' <-> ''commerci'' <-> ''bank''')::tsquery); ts_headline --------------------------------------------------------------------------------------------------------------------------------- Europe but <b>European</b> <b>Commercial</b> <b>Bank</b> does lkj sadlkjf asldkjf alskjd flsakj fdlkaj dfaslkfd jlakdsjaslkfdj The query controls the fragment chosen. > OP's opinion that that's wrong. The highlight-em-all approach has its > own value, and in any case it may not be possible to find a full match > that satisfies the function's other constraints such as MaxWords. > Refusing to highlight anything in that event would be unhelpful. Attached is a proposed doc patch. I hope people don't mind me addressing these old emails but I think they address important issues, and while I wasn't able to deal with them when they are posted, I have time for the next month to do so. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com Only you can decide what is important to you.
Attachment
Re: BUG #15172: Postgresql ts_headline with <-> operator does not highlight text properly
From
Bruce Momjian
Date:
On Sun, Oct 29, 2023 at 01:20:11AM +0400, Pavel Borisov wrote: > Hi, Bruce and Tom! > I think that the ts_headline main functionality is to make Postgres > more friendly to search-engine-like approach, which I feel is too > niche usage scenario for supporting it as a part of core code. If > remember right, bug reports coming from the users supposing it has > more strict semantics than it has in reality are regular. And I also > remember myself being puzzled by unusual output in the past. > > If we fiddle with other parameters of ts_headline we can easily have > other kinds of output that seem counterintuitive e.g.: > SELECT ts_headline('English', I just posted a proposed doc patch which should help reduce the number of people surprised by the highlighting. Let's see if that helps. FYI, here is a Stack Overflow post from 2021 linking to the original email that started this thread from 2018: https://stackoverflow.com/questions/69512416/is-ts-headline-intended-to-highlight-non-matching-parts-of-the-query-which-it -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com Only you can decide what is important to you.
Re: BUG #15172: Postgresql ts_headline with <-> operator does not highlight text properly
From
Tom Lane
Date:
Bruce Momjian <bruce@momjian.us> writes: > Attached is a proposed doc patch. As I pointed out before, the fragments *don't* necessarily satisfy the query, so this is still promising too much. An important edge case to keep in mind is that the given text itself might not satisfy the query; ts_headline has no control over what you hand it. But even if the text as a whole does, there may not be small fragments that do. regards, tom lane
Re: BUG #15172: Postgresql ts_headline with <-> operator does not highlight text properly
From
Bruce Momjian
Date:
On Sun, Oct 29, 2023 at 11:53:35AM -0400, Tom Lane wrote: > Bruce Momjian <bruce@momjian.us> writes: > > Attached is a proposed doc patch. > > As I pointed out before, the fragments *don't* necessarily satisfy > the query, so this is still promising too much. > > An important edge case to keep in mind is that the given text > itself might not satisfy the query; ts_headline has no control > over what you hand it. But even if the text as a whole does, > there may not be small fragments that do. How is this weasel-wording, attached. :-) -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com Only you can decide what is important to you.
Attachment
Re: BUG #15172: Postgresql ts_headline with <-> operator does not highlight text properly
From
Tom Lane
Date:
Bruce Momjian <bruce@momjian.us> writes: > How is this weasel-wording, attached. :-) Getting there. What do you think of + Specifically, the function will use the query to select relevant + text fragments, and then highlight all words that appear in the query, + even if those word positions do not match the query's restrictions. regards, tom lane
Re: BUG #15172: Postgresql ts_headline with <-> operator does not highlight text properly
From
Bruce Momjian
Date:
On Mon, Oct 30, 2023 at 11:32:26AM -0400, Tom Lane wrote: > Bruce Momjian <bruce@momjian.us> writes: > > How is this weasel-wording, attached. :-) > > Getting there. What do you think of > > + Specifically, the function will use the query to select relevant > + text fragments, and then highlight all words that appear in the query, > + even if those word positions do not match the query's restrictions. Sold! :-) Attached. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com Only you can decide what is important to you.
Attachment
Re: BUG #15172: Postgresql ts_headline with <-> operator does not highlight text properly
From
Tom Lane
Date:
Bruce Momjian <bruce@momjian.us> writes: > Sold! :-) Attached. LGTM. BTW, just for the OP's context: ts_headline was designed before we had phrase search operators. With only AND/OR/NOT, there aren't any location restrictions on individual words. (I recall that we occasionally got complaints about how it shouldn't highlight words that are supposed to NOT be there, but that was an uncommon situation because normally you wouldn't be selecting such a document to highlight.) So both the function's basic algorithm and its control parameters were designed without thought for what to do if the query restricted match locations. Maybe there's a case for rethinking what it should do more than we already have; but it's not clear that you can do much better without throwing out the current set of control parameters as well as the algorithm. See [1] for some context and discussion. regards, tom lane [1] https://www.postgresql.org/message-id/flat/840.1669405935%40sss.pgh.pa.us
Re: BUG #15172: Postgresql ts_headline with <-> operator does not highlight text properly
From
Bruce Momjian
Date:
On Mon, Oct 30, 2023 at 12:00:38PM -0400, Bruce Momjian wrote: > On Mon, Oct 30, 2023 at 11:32:26AM -0400, Tom Lane wrote: > > Bruce Momjian <bruce@momjian.us> writes: > > > How is this weasel-wording, attached. :-) > > > > Getting there. What do you think of > > > > + Specifically, the function will use the query to select relevant > > + text fragments, and then highlight all words that appear in the query, > > + even if those word positions do not match the query's restrictions. > > Sold! :-) Attached. Patch applied back to PG 16. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com Only you can decide what is important to you.