Thread: BUG #17691: Unexpected behaviour using ts_headline()

BUG #17691: Unexpected behaviour using ts_headline()

From
PG Bug reporting form
Date:
The following bug has been logged on the website:

Bug reference:      17691
Logged by:          Sebastian Patino-Lang
Email address:      sebastian.patino-lang@posteo.net
PostgreSQL version: 13.9
Operating system:   x86_64-apple-darwin19.6.0
Description:

I experience unexpected behaviour when using ts_headline() in general, but
especially when changing  MaxFragments. Given the data in
ts_headline_report.sql [1]

SELECT id,
ts_headline('english', "texts"."fulltext", to_tsquery('english', 'amazon &
world'), 'StartSel=<<, StopSel=>>') AS "preview"
FROM texts;

id=1: Highlight word is the first one in the result. Expectation: highlight
word is somewhere in the middle.
id=2: No highlight word at all.
id=3: Highlight words are the first and last one in the result. Not ideal
but ok-ish.

SELECT id,
ts_headline('english', "texts"."fulltext", to_tsquery('english', 'amazon &
world'), 'MaxFragments=1, StartSel=<<, StopSel=>>') AS "preview"
FROM texts;

id=1: Highlight word is now in the middle of the result. This is ok.
id=2: No highlight word at all.
id=3: Highlight words are in the middle part of the result. This is ok.

SELECT id,
ts_headline('english', "texts"."fulltext", to_tsquery('english', 'amazon &
world'), 'MaxFragments=2, StartSel=<<, StopSel=>>') AS "preview"
FROM texts;

id=1: Correct number of fragments (2) with highlight words are returned. 
id=2: No highlight word at all.
id=3: Correct number of fragments (2) with highlight words are returned. 

SELECT id,
ts_headline('english', "texts"."fulltext", to_tsquery('english', 'amazon &
world'), 'MaxFragments=3, StartSel=<<, StopSel=>>') AS "preview"
FROM texts;

id=1: Wrong number of fragments (2) with highlight words are returned. 
id=2: No highlight word at all.
id=3: Correct number of fragments (3) with highlight words are returned.

SELECT id,
ts_headline('english', "texts"."fulltext", to_tsquery('english', 'amazon &
world'), 'MaxFragments=4, StartSel=<<, StopSel=>>') AS "preview"
FROM texts;

id=1: Wrong number of fragments (2) with highlight words are returned. 
id=2: No highlight word at all.
id=3: Correct number of fragments (4) with highlight words are returned.

... and so on. Until MaxFragments=6 where for id=1 suddenly more fragments
(4) get returned.

SELECT id,
ts_headline('english', "texts"."fulltext", to_tsquery('english', 'amazon &
world'), 'MaxFragments=4, StartSel=<<, StopSel=>>') AS "preview"
FROM texts;

id=1: Wrong number of fragments (4) with highlight words are returned. 
id=2: No highlight word at all.
id=3: Correct number of fragments (6) with highlight words are returned.

... and so on. Until MaxFragments=11 where for id=1 the number of returned
fragments changes again (5)

SELECT id,
ts_headline('english', "texts"."fulltext", to_tsquery('english', 'amazon &
world'), 'MaxFragments=11, StartSel=<<, StopSel=>>') AS "preview"
FROM texts;

id=1: Wrong number of fragments (5) with highlight words are returned. 
id=2: No highlight word at all.
id=3: Correct number of fragments (11) with highlight words are returned.

... and so on. Until MaxFragments=14 where for id=2 suddenly more fragments
(2) get returned.

SELECT id,
ts_headline('english', "texts"."fulltext", to_tsquery('english', 'amazon &
world'), 'MaxFragments=14, StartSel=<<, StopSel=>>') AS "preview"
FROM texts;

id=1: Wrong number of fragments (5) with highlight words are returned. 
id=2: Wrong number of fragments (2) with highlight words are returned. 
id=3: Correct number of fragments (11) with highlight words are returned.

I stopped testing here, but im sure the strange behaviour and jumps in
fragment count will continue.

Any ideas?

[1] https://1drv.ms/u/s!AqRGi9iGBWddgZVU_M2iuoRdTzM6tg?e=0OHHnB


Re: BUG #17691: Unexpected behaviour using ts_headline()

From
Tom Lane
Date:
PG Bug reporting form <noreply@postgresql.org> writes:
> I experience unexpected behaviour when using ts_headline() in general, but
> especially when changing  MaxFragments. Given the data in
> ts_headline_report.sql [1]

That URL doesn't work for me, and in any case we dislike
non-self-contained bug reports for archival reasons.  Could you
provide the sample data as an attachment in a reply to pgsql-bugs?

            regards, tom lane



Re: BUG #17691: Unexpected behaviour using ts_headline()

From
sebastian.patino-lang@posteo.net
Date:
Hi all,

Tom Lane made me aware that the link is not working and its better anyway to include all data in the bug report directly. Please find the file attached.

@Tom: thanks for the hint!

Regards,
Sebastian
On 19. Nov 2022, 14:05 +0100, PG Bug reporting form <noreply@postgresql.org>, wrote:
The following bug has been logged on the website:

Bug reference: 17691
Logged by: Sebastian Patino-Lang
Email address: sebastian.patino-lang@posteo.net
PostgreSQL version: 13.9
Operating system: x86_64-apple-darwin19.6.0
Description:

I experience unexpected behaviour when using ts_headline() in general, but
especially when changing MaxFragments. Given the data in
ts_headline_report.sql [1]

SELECT id,
ts_headline('english', "texts"."fulltext", to_tsquery('english', 'amazon &
world'), 'StartSel=<<, StopSel=>>') AS "preview"
FROM texts;

id=1: Highlight word is the first one in the result. Expectation: highlight
word is somewhere in the middle.
id=2: No highlight word at all.
id=3: Highlight words are the first and last one in the result. Not ideal
but ok-ish.

SELECT id,
ts_headline('english', "texts"."fulltext", to_tsquery('english', 'amazon &
world'), 'MaxFragments=1, StartSel=<<, StopSel=>>') AS "preview"
FROM texts;

id=1: Highlight word is now in the middle of the result. This is ok.
id=2: No highlight word at all.
id=3: Highlight words are in the middle part of the result. This is ok.

SELECT id,
ts_headline('english', "texts"."fulltext", to_tsquery('english', 'amazon &
world'), 'MaxFragments=2, StartSel=<<, StopSel=>>') AS "preview"
FROM texts;

id=1: Correct number of fragments (2) with highlight words are returned.
id=2: No highlight word at all.
id=3: Correct number of fragments (2) with highlight words are returned.

SELECT id,
ts_headline('english', "texts"."fulltext", to_tsquery('english', 'amazon &
world'), 'MaxFragments=3, StartSel=<<, StopSel=>>') AS "preview"
FROM texts;

id=1: Wrong number of fragments (2) with highlight words are returned.
id=2: No highlight word at all.
id=3: Correct number of fragments (3) with highlight words are returned.

SELECT id,
ts_headline('english', "texts"."fulltext", to_tsquery('english', 'amazon &
world'), 'MaxFragments=4, StartSel=<<, StopSel=>>') AS "preview"
FROM texts;

id=1: Wrong number of fragments (2) with highlight words are returned.
id=2: No highlight word at all.
id=3: Correct number of fragments (4) with highlight words are returned.

... and so on. Until MaxFragments=6 where for id=1 suddenly more fragments
(4) get returned.

SELECT id,
ts_headline('english', "texts"."fulltext", to_tsquery('english', 'amazon &
world'), 'MaxFragments=4, StartSel=<<, StopSel=>>') AS "preview"
FROM texts;

id=1: Wrong number of fragments (4) with highlight words are returned.
id=2: No highlight word at all.
id=3: Correct number of fragments (6) with highlight words are returned.

... and so on. Until MaxFragments=11 where for id=1 the number of returned
fragments changes again (5)

SELECT id,
ts_headline('english', "texts"."fulltext", to_tsquery('english', 'amazon &
world'), 'MaxFragments=11, StartSel=<<, StopSel=>>') AS "preview"
FROM texts;

id=1: Wrong number of fragments (5) with highlight words are returned.
id=2: No highlight word at all.
id=3: Correct number of fragments (11) with highlight words are returned.

... and so on. Until MaxFragments=14 where for id=2 suddenly more fragments
(2) get returned.

SELECT id,
ts_headline('english', "texts"."fulltext", to_tsquery('english', 'amazon &
world'), 'MaxFragments=14, StartSel=<<, StopSel=>>') AS "preview"
FROM texts;

id=1: Wrong number of fragments (5) with highlight words are returned.
id=2: Wrong number of fragments (2) with highlight words are returned.
id=3: Correct number of fragments (11) with highlight words are returned.

I stopped testing here, but im sure the strange behaviour and jumps in
fragment count will continue.

Any ideas?

[1] https://1drv.ms/u/s!AqRGi9iGBWddgZVU_M2iuoRdTzM6tg?e=0OHHnB

Attachment

Re: BUG #17691: Unexpected behaviour using ts_headline()

From
Tom Lane
Date:
PG Bug reporting form <noreply@postgresql.org> writes:
> I experience unexpected behaviour when using ts_headline() in general, but
> especially when changing  MaxFragments. Given the data in
> ts_headline_report.sql [1]

Thanks for sending the sample data.  After poking at this a bit, it
seems the question boils down to what ts_headline should do when there
simply aren't any short matches to the query.  Part of the issue there
is what qualifies as "short".  The definition I installed in 78e73e875
was

+   /*
+    * We might eventually make max_cover a user-settable parameter, but for
+    * now, just compute a reasonable value based on max_words and
+    * max_fragments.
+    */
+   max_cover = Max(max_words * 10, 100);
+   if (max_fragments > 0)
+       max_cover *= max_fragments;

Looking at this again, I think the dependency on max_fragments is just
wrong.  That is what is causing your different results for different
settings of MaxFragments: larger values allow recognition of longer
candidate covers ("cover" being jargon in this code for "a substring
satisfying the query").  But there's no good reason for that.  I'd misread
the logic in mark_hl_fragments, which "breaks the cover into smaller
fragments such that each fragment has at most max_words", to think that
it needed to see longer covers that it could divide into fragments.
But actually max_fragments limits the number of fragments it will return
across all covers, and there doesn't seem to be a reason to connect that
parameter to the acceptable length of any particular cover.  So dropping
those two lines is enough to make the strange behavior across differing
MaxFragments go away.

The other issue is what to do after finding there's no match of max_cover
or fewer tokens.  As the code now stands, it'll just give up and return
the initial bit of the text, likely with no query words.  That's not
great.  I propose that we look for any match and return (highlight) just
its first word.

> SELECT id,
> ts_headline('english', "texts"."fulltext", to_tsquery('english', 'amazon &
> world'), 'StartSel=<<, StopSel=>>') AS "preview"
> FROM texts;

> id=1: Highlight word is the first one in the result. Expectation: highlight
> word is somewhere in the middle.

Meh --- I don't agree with that expectation.  The fragment-based code does
act that way, but the non-fragment path never has, so I think we'd make
more people unhappy than happy if we change its behavior.

> id=2: No highlight word at all.

Right, because there are no short matches to "amazon & world" anywhere.
With the attached proposed patch, we'll highlight whichever one of those
words appears first.

> id=3: Highlight words are the first and last one in the result. Not ideal
> but ok-ish.

As I said, that's how non-fragment highlighting has always worked.  It
will extend the quoted text, but not beyond min_words, which is a bit
different from fragment highlighting.

> SELECT id,
> ts_headline('english', "texts"."fulltext", to_tsquery('english', 'amazon &
> world'), 'MaxFragments=3, StartSel=<<, StopSel=>>') AS "preview"
> FROM texts;

> id=1: Wrong number of fragments (2) with highlight words are returned. 

I don't agree with this expectation either.  There's only one reasonably
sane match to 'amazon & world' in that text, and it's near the beginning.
As the code stands, increasing max_fragments eventually allows it to find
some less-sane matches, but that doesn't seem like an improvement.


In short, I suggest something about like the attached.  For your id=2
example, which has only very far-apart occurrences of 'amazon' and
'world', this produces

    <<world>> where fast runtimes make or break the popularity of research
    fields, this oversight has effectively

without fragmenting, and

    researchers make the effort of understanding the inner workings of
    these convenient black-boxes. In a <<world>> where fast runtimes make
    or break the popularity of research fields, this oversight has
    effectively surrendered most

with fragmenting, which is okay by me given their historically different
behaviors.  Otherwise, it gets rid of the strange changes in matching
behavior for different MaxFragment values.

            regards, tom lane

diff --git a/src/backend/tsearch/wparser_def.c b/src/backend/tsearch/wparser_def.c
index 826027844e..ea73a60852 100644
--- a/src/backend/tsearch/wparser_def.c
+++ b/src/backend/tsearch/wparser_def.c
@@ -2085,6 +2085,33 @@ hlCover(HeadlineParsedText *prs, TSQuery query, int max_cover,
         /* No luck here, so try next feasible startpoint */
         pmin = nextpmin;
     }
+
+    /*
+     * If there is no max_cover-or-less substring anywhere that satisfies the
+     * query, throw up our hands and return a single-word "match" that is just
+     * the first word appearing in the first query match, if any.  (We need to
+     * verify a match to avoid possibly returning a word that is negated in
+     * the query.)
+     */
+    if (*p == 0)
+    {
+        pmin = hlFirstIndex(prs, *p);
+        while (pmin >= 0)
+        {
+            /* Try to match query against pmin .. end of text */
+            ch.words = &(prs->words[pmin]);
+            ch.len = prs->curwords - pmin;
+            if (TS_execute(GETQUERY(query), &ch,
+                           TS_EXEC_EMPTY, checkcondition_HL))
+            {
+                *p = *q = pmin;
+                return true;
+            }
+            /* Nope, so advance pmin to next feasible start point */
+            pmin = hlFirstIndex(prs, pmin + 1);
+        }
+    }
+
     return false;
 }

@@ -2581,12 +2608,11 @@ prsd_headline(PG_FUNCTION_ARGS)

     /*
      * We might eventually make max_cover a user-settable parameter, but for
-     * now, just compute a reasonable value based on max_words and
-     * max_fragments.
+     * now, just compute a reasonable value based on max_words.  Note that it
+     * has to be several times more than max_words, because max_cover needs to
+     * allow for non-word tokens appearing between the word tokens.
      */
     max_cover = Max(max_words * 10, 100);
-    if (max_fragments > 0)
-        max_cover *= max_fragments;

     /* in HighlightAll mode these parameters are ignored */
     if (!highlightall)

Re: BUG #17691: Unexpected behaviour using ts_headline()

From
sebastian.patino-lang@posteo.net
Date:
Hi Tom,

thanks for the in-depth explanation. Makes a bit more sense now. Not sure what to do with the attached patch though. Should I use that to compile my own version of PG? Since I use a managed version thats not an option currently.

Regards

Sebastian
On 20. Nov 2022, 20:30 +0100, Tom Lane <tgl@sss.pgh.pa.us>, wrote:
PG Bug reporting form <noreply@postgresql.org> writes:
I experience unexpected behaviour when using ts_headline() in general, but
especially when changing MaxFragments. Given the data in
ts_headline_report.sql [1]

Thanks for sending the sample data. After poking at this a bit, it
seems the question boils down to what ts_headline should do when there
simply aren't any short matches to the query. Part of the issue there
is what qualifies as "short". The definition I installed in 78e73e875
was

+ /*
+ * We might eventually make max_cover a user-settable parameter, but for
+ * now, just compute a reasonable value based on max_words and
+ * max_fragments.
+ */
+ max_cover = Max(max_words * 10, 100);
+ if (max_fragments > 0)
+ max_cover *= max_fragments;

Looking at this again, I think the dependency on max_fragments is just
wrong. That is what is causing your different results for different
settings of MaxFragments: larger values allow recognition of longer
candidate covers ("cover" being jargon in this code for "a substring
satisfying the query"). But there's no good reason for that. I'd misread
the logic in mark_hl_fragments, which "breaks the cover into smaller
fragments such that each fragment has at most max_words", to think that
it needed to see longer covers that it could divide into fragments.
But actually max_fragments limits the number of fragments it will return
across all covers, and there doesn't seem to be a reason to connect that
parameter to the acceptable length of any particular cover. So dropping
those two lines is enough to make the strange behavior across differing
MaxFragments go away.

The other issue is what to do after finding there's no match of max_cover
or fewer tokens. As the code now stands, it'll just give up and return
the initial bit of the text, likely with no query words. That's not
great. I propose that we look for any match and return (highlight) just
its first word.

SELECT id,
ts_headline('english', "texts"."fulltext", to_tsquery('english', 'amazon &
world'), 'StartSel=<<, StopSel=>>') AS "preview"
FROM texts;

id=1: Highlight word is the first one in the result. Expectation: highlight
word is somewhere in the middle.

Meh --- I don't agree with that expectation. The fragment-based code does
act that way, but the non-fragment path never has, so I think we'd make
more people unhappy than happy if we change its behavior.

id=2: No highlight word at all.

Right, because there are no short matches to "amazon & world" anywhere.
With the attached proposed patch, we'll highlight whichever one of those
words appears first.

id=3: Highlight words are the first and last one in the result. Not ideal
but ok-ish.

As I said, that's how non-fragment highlighting has always worked. It
will extend the quoted text, but not beyond min_words, which is a bit
different from fragment highlighting.

SELECT id,
ts_headline('english', "texts"."fulltext", to_tsquery('english', 'amazon &
world'), 'MaxFragments=3, StartSel=<<, StopSel=>>') AS "preview"
FROM texts;

id=1: Wrong number of fragments (2) with highlight words are returned.

I don't agree with this expectation either. There's only one reasonably
sane match to 'amazon & world' in that text, and it's near the beginning.
As the code stands, increasing max_fragments eventually allows it to find
some less-sane matches, but that doesn't seem like an improvement.


In short, I suggest something about like the attached. For your id=2
example, which has only very far-apart occurrences of 'amazon' and
'world', this produces

<<world>> where fast runtimes make or break the popularity of research
fields, this oversight has effectively

without fragmenting, and

researchers make the effort of understanding the inner workings of
these convenient black-boxes. In a <<world>> where fast runtimes make
or break the popularity of research fields, this oversight has
effectively surrendered most

with fragmenting, which is okay by me given their historically different
behaviors. Otherwise, it gets rid of the strange changes in matching
behavior for different MaxFragment values.

regards, tom lane

Re: BUG #17691: Unexpected behaviour using ts_headline()

From
Tom Lane
Date:
For the archives' sake:

I've moved this discussion to a new thread on pgsql-hackers,

https://www.postgresql.org/message-id/flat/840.1669405935%40sss.pgh.pa.us

            regards, tom lane