Home > mailing lists

Re: BUG #17691: Unexpected behaviour using ts_headline() - Mailing list pgsql-bugs

From	Tom Lane
Subject	Re: BUG #17691: Unexpected behaviour using ts_headline()
Date	November 20, 2022 19:29:51
Msg-id	2885495.1668972591@sss.pgh.pa.us Whole thread Raw
In response to	BUG #17691: Unexpected behaviour using ts_headline() (PG Bug reporting form <noreply@postgresql.org>)
Responses	Re: BUG #17691: Unexpected behaviour using ts_headline()
List	pgsql-bugs

Tree view

PG Bug reporting form <noreply@postgresql.org> writes:
> I experience unexpected behaviour when using ts_headline() in general, but
> especially when changing  MaxFragments. Given the data in
> ts_headline_report.sql [1]

Thanks for sending the sample data.  After poking at this a bit, it
seems the question boils down to what ts_headline should do when there
simply aren't any short matches to the query.  Part of the issue there
is what qualifies as "short".  The definition I installed in 78e73e875
was

+   /*
+    * We might eventually make max_cover a user-settable parameter, but for
+    * now, just compute a reasonable value based on max_words and
+    * max_fragments.
+    */
+   max_cover = Max(max_words * 10, 100);
+   if (max_fragments > 0)
+       max_cover *= max_fragments;

Looking at this again, I think the dependency on max_fragments is just
wrong.  That is what is causing your different results for different
settings of MaxFragments: larger values allow recognition of longer
candidate covers ("cover" being jargon in this code for "a substring
satisfying the query").  But there's no good reason for that.  I'd misread
the logic in mark_hl_fragments, which "breaks the cover into smaller
fragments such that each fragment has at most max_words", to think that
it needed to see longer covers that it could divide into fragments.
But actually max_fragments limits the number of fragments it will return
across all covers, and there doesn't seem to be a reason to connect that
parameter to the acceptable length of any particular cover.  So dropping
those two lines is enough to make the strange behavior across differing
MaxFragments go away.

The other issue is what to do after finding there's no match of max_cover
or fewer tokens.  As the code now stands, it'll just give up and return
the initial bit of the text, likely with no query words.  That's not
great.  I propose that we look for any match and return (highlight) just
its first word.

> SELECT id,
> ts_headline('english', "texts"."fulltext", to_tsquery('english', 'amazon &
> world'), 'StartSel=<<, StopSel=>>') AS "preview"
> FROM texts;

> id=1: Highlight word is the first one in the result. Expectation: highlight
> word is somewhere in the middle.

Meh --- I don't agree with that expectation.  The fragment-based code does
act that way, but the non-fragment path never has, so I think we'd make
more people unhappy than happy if we change its behavior.

> id=2: No highlight word at all.

Right, because there are no short matches to "amazon & world" anywhere.
With the attached proposed patch, we'll highlight whichever one of those
words appears first.

> id=3: Highlight words are the first and last one in the result. Not ideal
> but ok-ish.

As I said, that's how non-fragment highlighting has always worked.  It
will extend the quoted text, but not beyond min_words, which is a bit
different from fragment highlighting.

> SELECT id,
> ts_headline('english', "texts"."fulltext", to_tsquery('english', 'amazon &
> world'), 'MaxFragments=3, StartSel=<<, StopSel=>>') AS "preview"
> FROM texts;

> id=1: Wrong number of fragments (2) with highlight words are returned. 

I don't agree with this expectation either.  There's only one reasonably
sane match to 'amazon & world' in that text, and it's near the beginning.
As the code stands, increasing max_fragments eventually allows it to find
some less-sane matches, but that doesn't seem like an improvement.

In short, I suggest something about like the attached.  For your id=2
example, which has only very far-apart occurrences of 'amazon' and
'world', this produces

    <<world>> where fast runtimes make or break the popularity of research
    fields, this oversight has effectively

without fragmenting, and

    researchers make the effort of understanding the inner workings of
    these convenient black-boxes. In a <<world>> where fast runtimes make
    or break the popularity of research fields, this oversight has
    effectively surrendered most

with fragmenting, which is okay by me given their historically different
behaviors.  Otherwise, it gets rid of the strange changes in matching
behavior for different MaxFragment values.

            regards, tom lane

diff --git a/src/backend/tsearch/wparser_def.c b/src/backend/tsearch/wparser_def.c
index 826027844e..ea73a60852 100644
--- a/src/backend/tsearch/wparser_def.c
+++ b/src/backend/tsearch/wparser_def.c
@@ -2085,6 +2085,33 @@ hlCover(HeadlineParsedText *prs, TSQuery query, int max_cover,
         /* No luck here, so try next feasible startpoint */
         pmin = nextpmin;
     }
+
+    /*
+     * If there is no max_cover-or-less substring anywhere that satisfies the
+     * query, throw up our hands and return a single-word "match" that is just
+     * the first word appearing in the first query match, if any.  (We need to
+     * verify a match to avoid possibly returning a word that is negated in
+     * the query.)
+     */
+    if (*p == 0)
+    {
+        pmin = hlFirstIndex(prs, *p);
+        while (pmin >= 0)
+        {
+            /* Try to match query against pmin .. end of text */
+            ch.words = &(prs->words[pmin]);
+            ch.len = prs->curwords - pmin;
+            if (TS_execute(GETQUERY(query), &ch,
+                           TS_EXEC_EMPTY, checkcondition_HL))
+            {
+                *p = *q = pmin;
+                return true;
+            }
+            /* Nope, so advance pmin to next feasible start point */
+            pmin = hlFirstIndex(prs, pmin + 1);
+        }
+    }
+
     return false;
 }

@@ -2581,12 +2608,11 @@ prsd_headline(PG_FUNCTION_ARGS)

     /*
      * We might eventually make max_cover a user-settable parameter, but for
-     * now, just compute a reasonable value based on max_words and
-     * max_fragments.
+     * now, just compute a reasonable value based on max_words.  Note that it
+     * has to be several times more than max_words, because max_cover needs to
+     * allow for non-word tokens appearing between the word tokens.
      */
     max_cover = Max(max_words * 10, 100);
-    if (max_fragments > 0)
-        max_cover *= max_fragments;

     /* in HighlightAll mode these parameters are ignored */
     if (!highlightall)

pgsql-bugs by date:

From: sebastian.patino-lang@posteo.net
Date: 20 November 2022, 12:03:39
Subject: Re: BUG #17691: Unexpected behaviour using ts_headline()

From: Amit Kapila
Date: 21 November 2022, 14:26:20
Subject: Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

Re: BUG #17691: Unexpected behaviour using ts_headline() - Mailing list pgsql-bugs

Previous

Next