Re: BUG #16345: ts_headline does not find phrase matches correctly - Mailing list pgsql-bugs

From Tom Lane
Subject Re: BUG #16345: ts_headline does not find phrase matches correctly
Date
Msg-id 25072.1586401341@sss.pgh.pa.us
Whole thread Raw
In response to BUG #16345: ts_headline does not find phrase matches correctly  (PG Bug reporting form <noreply@postgresql.org>)
List pgsql-bugs
PG Bug reporting form <noreply@postgresql.org> writes:
> When query:
> select ts_headline(
>     $$Lorem ipsum urna.  Nullam  nullam ullamcorper urna.$$,
>     to_tsquery('Lorem') && phraseto_tsquery('ullamcorper urna'),
>     'StartSel=#$#, StopSel=#$#, FragmentDelimiter=$#$, MaxFragments=100,
> MaxWords=100, MinWords=1'
> );
> is ran, a fragment of
>> Lorem ipsum urna.  Nullam  nullam ullamcorper urna.
> should be returned, however, the result is a single word of #$#Lorem#$# is
> returned, meaning that ts_headline did not find the queried string.

Yeah.  I spent some time digging into this, and was reminded of how
miserably baroque and undercommented almost all of the text-search code
is.  Anyway, the bottom line is that wparser_def.c's hlCover() is
failing here.  It is looking for a minimal cover, that is a substring
satisfying the given tsquery, and obviously with this query the only
such substring is the whole sentence.  (Well, not the trailing period.)
However, it looks like hlCover() wasn't updated when we added phrase
search, because that made word position significant and invalidated
a rather fundamental assumption that hlCover() seems to be making:
it figures that any substring including all words of the query ought
to be good enough.  Since "urna" also appears as the third word of the
text, hlCover() doesn't try any substrings longer than "Lorem ipsum
urna.  Nullam nullam ullamcorper", thus it never finds one that
actually satisfies the query, and we end up failing.

Although utterly undocumented, the algorithm it is using seems to be
"find the latest first occurrence of any query word (here,
'ullamcorper'), then find the earliest last occurrence of any query
word before that (hence, 'Lorem'), and then see if the substring
between those points satisfies the query (oops, nope).  If not,
start over from a point one past the previous try."  But all the
tries after that will omit 'Lorem', so they all fail to match the
query, even though it'll eventually try substrings that include
the later occurrence of 'urna'.

I've not spent a huge amount of time thinking about it, but this might
be all right as a way to find a shortest-possible cover for queries
involving only AND and OR operators.  (It'd fall down on NOT operators
of course, but it already cheats on that by telling TS_execute to
ignore NOT subtrees.)  However, it's blatantly wrong as soon as a
phrase operator is involved, because then only some occurrences of a
particular word in the string might meet the query's requirements.

So I set out to rewrite hlCover to make fewer assumptions about
what a valid cover could be.  In the new version appearing below,
it just tries every substring that begins and ends with some query
word, preferring earlier and shorter substrings.  This should
certainly find the desired cover ... but it didn't work, plus it
broke a number of existing regression test cases.  I was thus forced
to the realization that its immediate caller mark_hl_words is *also*
broken, because it's rejecting good headlines in favor of bad ones,
or even in favor of headlines that contain no cover at all.  Which
was a bit daunting, because that's an even larger and uglier chunk of
undocumented code.

I ended up with the following stepwise approach to improving the
situation.

0001 below adds the problematic test case, with the wrong output
that HEAD produces.  This was basically just to track which changes
affected that.

0002 makes a bunch of purely-cosmetic changes to mark_hl_words
and its siblings, in hopes of making it less unintelligible to
the next hacker.  I added comments, used macros to make some of
the hairier if-tests more readable, and changed a couple of
small things for more readability (though they can be proven
not to affect the behavior).  As expected, regression outputs
don't change here.

0003 fixes a couple of fairly obvious bugs.  One is that there's
an early-exit optimization that tries to reject a possible headline
before having fully defined its boundaries.  This is not really
necessary, but worse it's wrong because the figure of merit might
change by the time we've chosen the actual boundaries.  Deleting
it doesn't change any regression test cases, but I'm sure it'd be
possible to invent a scenario where it does the wrong thing.
The other bug is that the loop that tries to shorten a maximum-length
headline until it has a good endpoint has an order-of-operations issue:
it can exit after making adjustments to curlen and poslen that discount
the i'th word from the headline, but without changing pose to actually
exclude that word.  So we end up with a poslen figure-of-merit that
does not describe the actual headline from posb to pose.

Unfortunately, the one regression test output change caused by 0003
is clearly for the worse: hlCover successfully finds the cover '1 3'
for the query, but now mark_hl_words discards '3' and only highlights
'1'.  What is happening is that '3' fails the "short word" test and
is thereby excluded from the headline.  This behavior is clearly what
the code intends, but it was accidentally masked before by the
order-of-operations bug.

I argue that the problem here is that excluding an actual query term
from the headline on the basis of its being short is just stupid.
There is much-earlier processing that is charged with excluding
stop words from tsqueries, and this code has no business second-
guessing that.  So the short-word test should only be applied to
text words that did not appear in the tsquery.

Hence, 0004 rejiggers the new BADENDPOINT() macro so that query
terms are never considered bad endpoints.  That fixes the '1 <-> 3'
test case broken by 0003, but now there's a new diff: matching
'1 2 3 1 3' to '1 & 3' now selects only '1 2 3' as the headline
not '1 2 3 1'.  I do not think this is a bug though.  The reason
we got '1 2 3 1' before is that hlCover selected the cover '1 2 3'
but then mark_hl_words decided '3' was a bad endpoint and extended
the headline by one word.  (It would've extended more, because '1'
is also a bad endpoint, except MaxWords=4 stopped it.)  Again,
treating '3' as a bad endpoint is just silly, so I think this change
is acceptable.

Next, 0005 rearranges the preference order for different possible
headlines so that the first preference item is whether or not the
headline includes the cover string initially found by hlCover.
(It might not, if the cover string was longer than MaxWords.)
It seems to me to be dumb to prefer headlines that don't meet that
requirement to ones that do, because shorter headlines might not
satisfy the user's query, which surely fails to satisfy the principle
of least astonishment.  (While I'm not entirely sure what can be
said about the old implementation of hlCover, with my rewrite it
is *certain* that substrings not including the full cover won't
satisfy the query.)  0005 also gets rid of what seems to me to
be a corner-case bug in the old preference logic, which is that
it will take a headline with fewer query words over one with more,
if the former has a "good" endpoint and the latter doesn't.  That
makes the actual preference order close to unexplainable.

0005 doesn't in itself change any regression results, but it's
necessary to prevent problems from appearing with the next patch.
The real situation here, as I've come to understand it, is that
the existing hlCover code frequently produces only one possible cover
and thus it doesn't matter how silly are mark_hl_words's rules for
preferring one over another.  The rewrite causes hlCover to produce
more candidate covers and so it becomes more important for
mark_hl_words to make sane decisions.

Lastly, 0006 introduces the new hlCover code.  This at last fixes
the test case for the bug at hand.  It also introduces two new diffs.
One is this change in one Rime of the Ancient Mariner example:

- <b>painted</b> <b>Ocean</b>.    +
- Water, water, every where       +
-   And all the boards did shrink;+
- Water, water, every

+ <b>painted</b> Ship                  +
+   Upon a <b>painted</b> <b>Ocean</b>.+
+ Water, water, every where            +
+   And all the boards did shrink

I don't see any way that that's not an improved match, given that
the query is 'painted <-> Ocean'; including another match to one
of the query words is surely better than not doing so.  The other
change is that matching '1 2 3 1 3' to '1 <-> 3' now selects '3 1 3'
not just '1 3'.  These are basically both the same change.  The
reason for it is that the old hlCover would *only* find the cover
'painted Ocean' (or '1 3').  The new hlCover finds that, but it
also finds 'painted ... painted Ocean' (or '3 1 3'), and then the
preference metric for more query words likes this option better.

So my feeling is that these changes are for the better and we shouldn't
complain.  We could perhaps make them go away if we changed the
preference rules some more, for example by preferring headlines that
use shorter covers instead of (or at least ahead of) those having more
query words.  But ISTM that would actually be a bigger change from the
current behavior, so likely it would create new changes in other query
results.  Besides, there's already a preference for shorter covers in
hlCover, so I don't feel like we need another one at the calling level.

In short then, I propose applying 0001-0006.  I'm not quite sure
if we should back-patch, or just be content to fix this in HEAD.
But there's definitely an argument that this has been broken since
we added phrase search (in 9.6) and deserves to be back-patched.

(BTW, I wonder if we could now undo the hack to ignore NOT
restrictions while finding covers.  I haven't tried it though.)

            regards, tom lane

diff --git a/src/test/regress/expected/tsearch.out b/src/test/regress/expected/tsearch.out
index 2bfd58b..d01010f 100644
--- a/src/test/regress/expected/tsearch.out
+++ b/src/test/regress/expected/tsearch.out
@@ -1328,6 +1328,15 @@ S. T. Coleridge (1772-1834)
    And all the boards
 (1 row)

+SELECT ts_headline('english',
+'Lorem ipsum urna.  Nullam nullam ullamcorper urna.',
+to_tsquery('english','Lorem') && phraseto_tsquery('english','ullamcorper urna'),
+'MaxWords=100, MinWords=1');
+ ts_headline
+--------------
+ <b>Lorem</b>
+(1 row)
+
 SELECT ts_headline('english', '
 <html>
 <!-- some comment -->
@@ -1467,6 +1476,16 @@ S. T. Coleridge (1772-1834)
  S. T. <b>Coleridge</b>
 (1 row)

+--Fragments with phrase search
+SELECT ts_headline('english',
+'Lorem ipsum urna.  Nullam nullam ullamcorper urna.',
+to_tsquery('english','Lorem') && phraseto_tsquery('english','ullamcorper urna'),
+'MaxFragments=100, MaxWords=100, MinWords=1');
+ ts_headline
+--------------
+ <b>Lorem</b>
+(1 row)
+
 --Rewrite sub system
 CREATE TABLE test_tsquery (txtkeyword TEXT, txtsample TEXT);
 \set ECHO none
diff --git a/src/test/regress/sql/tsearch.sql b/src/test/regress/sql/tsearch.sql
index c71cda5..22c84a3 100644
--- a/src/test/regress/sql/tsearch.sql
+++ b/src/test/regress/sql/tsearch.sql
@@ -384,6 +384,11 @@ Water, water, every where,
 S. T. Coleridge (1772-1834)
 ', phraseto_tsquery('english', 'idle as a painted Ship'));

+SELECT ts_headline('english',
+'Lorem ipsum urna.  Nullam nullam ullamcorper urna.',
+to_tsquery('english','Lorem') && phraseto_tsquery('english','ullamcorper urna'),
+'MaxWords=100, MinWords=1');
+
 SELECT ts_headline('english', '
 <html>
 <!-- some comment -->
@@ -454,6 +459,12 @@ Water, water, every where,
 S. T. Coleridge (1772-1834)
 ', to_tsquery('english', 'Coleridge & stuck'), 'MaxFragments=2,FragmentDelimiter=***');

+--Fragments with phrase search
+SELECT ts_headline('english',
+'Lorem ipsum urna.  Nullam nullam ullamcorper urna.',
+to_tsquery('english','Lorem') && phraseto_tsquery('english','ullamcorper urna'),
+'MaxFragments=100, MaxWords=100, MinWords=1');
+
 --Rewrite sub system

 CREATE TABLE test_tsquery (txtkeyword TEXT, txtsample TEXT);
diff --git a/src/backend/tsearch/wparser_def.c b/src/backend/tsearch/wparser_def.c
index 898466f..3e9d1f8 100644
--- a/src/backend/tsearch/wparser_def.c
+++ b/src/backend/tsearch/wparser_def.c
@@ -2064,6 +2064,20 @@ mark_fragment(HeadlineParsedText *prs, int highlight, int startpos, int endpos)
     }
 }

+/*
+ * Macros useful in headline selection.  These rely on availability of
+ * "HeadlineParsedText *prs" describing some text, and "int shortword"
+ * describing the "short word" length parameter.
+ */
+
+/* Interesting words are non-repeated search terms */
+#define INTERESTINGWORD(j) \
+    (prs->words[j].item && !prs->words[j].repeated)
+
+/* Don't want to end at a non-word or a short word */
+#define BADENDPOINT(j) \
+    (NOENDTOKEN(prs->words[j].type) || prs->words[j].len <= shortword)
+
 typedef struct
 {
     int32        startpos;
@@ -2091,7 +2105,7 @@ get_next_fragment(HeadlineParsedText *prs, int *startpos, int *endpos,
     for (i = *startpos; i <= *endpos; i++)
     {
         *startpos = i;
-        if (prs->words[i].item && !prs->words[i].repeated)
+        if (INTERESTINGWORD(i))
             break;
     }
     /* cut endpos to have only max_words */
@@ -2101,7 +2115,7 @@ get_next_fragment(HeadlineParsedText *prs, int *startpos, int *endpos,
     {
         if (!NONWORDTOKEN(prs->words[i].type))
             *curlen += 1;
-        if (prs->words[i].item && !prs->words[i].repeated)
+        if (INTERESTINGWORD(i))
             *poslen += 1;
     }
     /* if the cover was cut then move back endpos to a query item */
@@ -2111,7 +2125,7 @@ get_next_fragment(HeadlineParsedText *prs, int *startpos, int *endpos,
         for (i = *endpos; i >= *startpos; i--)
         {
             *endpos = i;
-            if (prs->words[i].item && !prs->words[i].repeated)
+            if (INTERESTINGWORD(i))
                 break;
             if (!NONWORDTOKEN(prs->words[i].type))
                 *curlen -= 1;
@@ -2197,8 +2211,9 @@ mark_hl_fragments(HeadlineParsedText *prs, TSQuery query, int highlight,
         for (i = 0; i < numcovers; i++)
         {
             if (!covers[i].in && !covers[i].excluded &&
-                (maxitems < covers[i].poslen || (maxitems == covers[i].poslen
-                                                 && minwords > covers[i].curlen)))
+                (maxitems < covers[i].poslen ||
+                 (maxitems == covers[i].poslen &&
+                  minwords > covers[i].curlen)))
             {
                 maxitems = covers[i].poslen;
                 minwords = covers[i].curlen;
@@ -2235,8 +2250,8 @@ mark_hl_fragments(HeadlineParsedText *prs, TSQuery query, int highlight,
                     }
                     posmarker = i;
                 }
-                /* cut back startpos till we find a non short token */
-                for (i = posmarker; i < startpos && (NOENDTOKEN(prs->words[i].type) || prs->words[i].len <=
shortword);i++) 
+                /* cut back startpos till we find a good endpoint */
+                for (i = posmarker; i < startpos && BADENDPOINT(i); i++)
                 {
                     if (!NONWORDTOKEN(prs->words[i].type))
                         curlen--;
@@ -2250,8 +2265,8 @@ mark_hl_fragments(HeadlineParsedText *prs, TSQuery query, int highlight,
                         curlen++;
                     posmarker = i;
                 }
-                /* cut back endpos till we find a non-short token */
-                for (i = posmarker; i > endpos && (NOENDTOKEN(prs->words[i].type) || prs->words[i].len <= shortword);
i--)
+                /* cut back endpos till we find a good endpoint */
+                for (i = posmarker; i > endpos && BADENDPOINT(i); i--)
                 {
                     if (!NONWORDTOKEN(prs->words[i].type))
                         curlen--;
@@ -2267,7 +2282,11 @@ mark_hl_fragments(HeadlineParsedText *prs, TSQuery query, int highlight,
             /* exclude overlapping covers */
             for (i = 0; i < numcovers; i++)
             {
-                if (i != minI && ((covers[i].startpos >= covers[minI].startpos && covers[i].startpos <=
covers[minI].endpos)|| (covers[i].endpos >= covers[minI].startpos && covers[i].endpos <= covers[minI].endpos))) 
+                if (i != minI &&
+                    ((covers[i].startpos >= covers[minI].startpos &&
+                      covers[i].startpos <= covers[minI].endpos) ||
+                     (covers[i].endpos >= covers[minI].startpos &&
+                      covers[i].endpos <= covers[minI].endpos)))
                     covers[i].excluded = 1;
             }
         }
@@ -2275,7 +2294,7 @@ mark_hl_fragments(HeadlineParsedText *prs, TSQuery query, int highlight,
             break;
     }

-    /* show at least min_words we have not marked anything */
+    /* show at least min_words if we have not marked anything */
     if (num_f <= 0)
     {
         startpos = endpos = curlen = 0;
@@ -2299,7 +2318,7 @@ mark_hl_words(HeadlineParsedText *prs, TSQuery query, int highlight,
     int            bestb = -1,
                 beste = -1;
     int            bestlen = -1;
-    int            pose = 0,
+    int            pose,
                 posb,
                 poslen,
                 curlen;
@@ -2310,55 +2329,69 @@ mark_hl_words(HeadlineParsedText *prs, TSQuery query, int highlight,
     {
         while (hlCover(prs, query, &p, &q))
         {
-            /* find cover len in words */
+            /*
+             * Count words (curlen) and interesting words (poslen) within
+             * cover, but stop once we reach max_words.  This step doesn't
+             * consider whether that's a good stopping point.  posb and pose
+             * are set to the start and end indexes of the possible headline.
+             */
             curlen = 0;
             poslen = 0;
+            posb = pose = p;
             for (i = p; i <= q && curlen < max_words; i++)
             {
                 if (!NONWORDTOKEN(prs->words[i].type))
                     curlen++;
-                if (prs->words[i].item && !prs->words[i].repeated)
+                if (INTERESTINGWORD(i))
                     poslen++;
                 pose = i;
             }

-            if (poslen < bestlen && !(NOENDTOKEN(prs->words[beste].type) || prs->words[beste].len <= shortword))
+            /* XXX this optimization seems unnecessary and wrong */
+            if (poslen < bestlen && !BADENDPOINT(beste))
             {
-                /* best already found, so try one more cover */
+                /* better cover already found, so try next cover */
                 p++;
                 continue;
             }

-            posb = p;
             if (curlen < max_words)
-            {                    /* find good end */
+            {
+                /*
+                 * We have room to lengthen the headline, so search forward
+                 * until it's full or we find a good stopping point.  We'll
+                 * reconsider the word at "q", then move forward.
+                 */
                 for (i = i - 1; i < prs->curwords && curlen < max_words; i++)
                 {
-                    if (i != q)
+                    if (i > q)
                     {
                         if (!NONWORDTOKEN(prs->words[i].type))
                             curlen++;
-                        if (prs->words[i].item && !prs->words[i].repeated)
+                        if (INTERESTINGWORD(i))
                             poslen++;
                     }
                     pose = i;
-                    if (NOENDTOKEN(prs->words[i].type) || prs->words[i].len <= shortword)
+                    if (BADENDPOINT(i))
                         continue;
                     if (curlen >= min_words)
                         break;
                 }
-                if (curlen < min_words && i >= prs->curwords)
-                {                /* got end of text and our cover is shorter
-                                 * than min_words */
+                if (curlen < min_words)
+                {
+                    /*
+                     * Reached end of text and our headline is still shorter
+                     * than min_words, so try to extend it to the left.
+                     */
                     for (i = p - 1; i >= 0; i--)
                     {
                         if (!NONWORDTOKEN(prs->words[i].type))
                             curlen++;
-                        if (prs->words[i].item && !prs->words[i].repeated)
+                        if (INTERESTINGWORD(i))
                             poslen++;
                         if (curlen >= max_words)
                             break;
-                        if (NOENDTOKEN(prs->words[i].type) || prs->words[i].len <= shortword)
+                        if (BADENDPOINT(i))
                             continue;
                         if (curlen >= min_words)
                             break;
@@ -2367,25 +2400,34 @@ mark_hl_words(HeadlineParsedText *prs, TSQuery query, int highlight,
                 }
             }
             else
-            {                    /* shorter cover :((( */
+            {
+                /*
+                 * Can't make headline longer, so consider making it shorter
+                 * if needed to avoid a bad endpoint.
+                 */
                 if (i > q)
                     i = q;
                 for (; curlen > min_words; i--)
                 {
                     if (!NONWORDTOKEN(prs->words[i].type))
                         curlen--;
-                    if (prs->words[i].item && !prs->words[i].repeated)
+                    if (INTERESTINGWORD(i))
                         poslen--;
                     pose = i;
-                    if (NOENDTOKEN(prs->words[i].type) || prs->words[i].len <= shortword)
-                        continue;
-                    break;
+                    if (!BADENDPOINT(i))
+                        break;
                 }
             }

-            if (bestlen < 0 || (poslen > bestlen && !(NOENDTOKEN(prs->words[pose].type) || prs->words[pose].len <=
shortword))|| 
-                (bestlen >= 0 && !(NOENDTOKEN(prs->words[pose].type) || prs->words[pose].len <= shortword) &&
-                 (NOENDTOKEN(prs->words[beste].type) || prs->words[beste].len <= shortword)))
+            /*
+             * Adopt this headline if it's the first, or if it has more
+             * interesting words and isn't ending at a bad endpoint, or if it
+             * replaces a bad endpoint with a good one (XXX even if it has
+             * fewer interesting words?  Really?)
+             */
+            if (bestlen < 0 ||
+                (poslen > bestlen && !BADENDPOINT(pose)) ||
+                (!BADENDPOINT(pose) && BADENDPOINT(beste)))
             {
                 bestb = posb;
                 beste = pose;
@@ -2395,6 +2437,10 @@ mark_hl_words(HeadlineParsedText *prs, TSQuery query, int highlight,
             p++;
         }

+        /*
+         * If we found nothing acceptable, select min_words words starting at
+         * the beginning.
+         */
         if (bestlen < 0)
         {
             curlen = 0;
@@ -2433,7 +2479,6 @@ mark_hl_words(HeadlineParsedText *prs, TSQuery query, int highlight,

         prs->words[i].in = (prs->words[i].repeated) ? 0 : 1;
     }
-
 }

 Datum
diff --git a/src/backend/tsearch/wparser_def.c b/src/backend/tsearch/wparser_def.c
index 3e9d1f8..84b6f8f9 100644
--- a/src/backend/tsearch/wparser_def.c
+++ b/src/backend/tsearch/wparser_def.c
@@ -2347,14 +2347,6 @@ mark_hl_words(HeadlineParsedText *prs, TSQuery query, int highlight,
                 pose = i;
             }

-            /* XXX this optimization seems unnecessary and wrong */
-            if (poslen < bestlen && !BADENDPOINT(beste))
-            {
-                /* better cover already found, so try next cover */
-                p++;
-                continue;
-            }
-
             if (curlen < max_words)
             {
                 /*
@@ -2409,13 +2401,13 @@ mark_hl_words(HeadlineParsedText *prs, TSQuery query, int highlight,
                     i = q;
                 for (; curlen > min_words; i--)
                 {
+                    if (!BADENDPOINT(i))
+                        break;
                     if (!NONWORDTOKEN(prs->words[i].type))
                         curlen--;
                     if (INTERESTINGWORD(i))
                         poslen--;
-                    pose = i;
-                    if (!BADENDPOINT(i))
-                        break;
+                    pose = i - 1;
                 }
             }

diff --git a/src/test/regress/expected/tsearch.out b/src/test/regress/expected/tsearch.out
index d01010f..e6fb820 100644
--- a/src/test/regress/expected/tsearch.out
+++ b/src/test/regress/expected/tsearch.out
@@ -1367,9 +1367,9 @@ to_tsquery('english', 'sea&foo'), 'HighlightAll=true');
 (1 row)

 SELECT ts_headline('simple', '1 2 3 1 3'::text, '1 <-> 3', 'MaxWords=2, MinWords=1');
-    ts_headline
--------------------
- <b>1</b> <b>3</b>
+ ts_headline
+-------------
+ <b>1</b>
 (1 row)

 SELECT ts_headline('simple', '1 2 3 1 3'::text, '1 & 3', 'MaxWords=4, MinWords=1');
diff --git a/src/backend/tsearch/wparser_def.c b/src/backend/tsearch/wparser_def.c
index 84b6f8f9..aa4271a 100644
--- a/src/backend/tsearch/wparser_def.c
+++ b/src/backend/tsearch/wparser_def.c
@@ -2074,9 +2074,10 @@ mark_fragment(HeadlineParsedText *prs, int highlight, int startpos, int endpos)
 #define INTERESTINGWORD(j) \
     (prs->words[j].item && !prs->words[j].repeated)

-/* Don't want to end at a non-word or a short word */
+/* Don't want to end at a non-word or a short word, unless interesting */
 #define BADENDPOINT(j) \
-    (NOENDTOKEN(prs->words[j].type) || prs->words[j].len <= shortword)
+    ((NOENDTOKEN(prs->words[j].type) || prs->words[j].len <= shortword) && \
+     !INTERESTINGWORD(j))

 typedef struct
 {
diff --git a/src/test/regress/expected/tsearch.out b/src/test/regress/expected/tsearch.out
index e6fb820..36e995b 100644
--- a/src/test/regress/expected/tsearch.out
+++ b/src/test/regress/expected/tsearch.out
@@ -1367,15 +1367,15 @@ to_tsquery('english', 'sea&foo'), 'HighlightAll=true');
 (1 row)

 SELECT ts_headline('simple', '1 2 3 1 3'::text, '1 <-> 3', 'MaxWords=2, MinWords=1');
- ts_headline
--------------
- <b>1</b>
+    ts_headline
+-------------------
+ <b>1</b> <b>3</b>
 (1 row)

 SELECT ts_headline('simple', '1 2 3 1 3'::text, '1 & 3', 'MaxWords=4, MinWords=1');
-         ts_headline
-------------------------------
- <b>1</b> 2 <b>3</b> <b>1</b>
+     ts_headline
+---------------------
+ <b>1</b> 2 <b>3</b>
 (1 row)

 SELECT ts_headline('simple', '1 2 3 1 3'::text, '1 <-> 3', 'MaxWords=4, MinWords=1');
diff --git a/src/backend/tsearch/wparser_def.c b/src/backend/tsearch/wparser_def.c
index aa4271a..2f84259 100644
--- a/src/backend/tsearch/wparser_def.c
+++ b/src/backend/tsearch/wparser_def.c
@@ -2319,11 +2319,12 @@ mark_hl_words(HeadlineParsedText *prs, TSQuery query, int highlight,
     int            bestb = -1,
                 beste = -1;
     int            bestlen = -1;
+    bool        bestcover = false;
     int            pose,
                 posb,
                 poslen,
                 curlen;
-
+    bool        poscover;
     int            i;

     if (highlight == 0)
@@ -2413,18 +2414,27 @@ mark_hl_words(HeadlineParsedText *prs, TSQuery query, int highlight,
             }

             /*
-             * Adopt this headline if it's the first, or if it has more
-             * interesting words and isn't ending at a bad endpoint, or if it
-             * replaces a bad endpoint with a good one (XXX even if it has
-             * fewer interesting words?  Really?)
+             * Check whether the proposed headline includes the original
+             * cover; it might not if we trimmed it due to max_words.
+             */
+            poscover = (posb <= p && pose >= q);
+
+            /*
+             * Adopt this headline if it's better than the last one, giving
+             * highest priority to headlines including the cover, then to
+             * headlines with more interesting words, then to headlines with
+             * good stopping points.  (Since bestlen is initially -1, we will
+             * certainly adopt the first headline.)
              */
-            if (bestlen < 0 ||
-                (poslen > bestlen && !BADENDPOINT(pose)) ||
-                (!BADENDPOINT(pose) && BADENDPOINT(beste)))
+            if (poscover > bestcover ||
+                (poscover == bestcover && poslen > bestlen) ||
+                (poscover == bestcover && poslen == bestlen &&
+                 !BADENDPOINT(pose) && BADENDPOINT(beste)))
             {
                 bestb = posb;
                 beste = pose;
                 bestlen = poslen;
+                bestcover = poscover;
             }

             p++;
diff --git a/src/backend/tsearch/wparser_def.c b/src/backend/tsearch/wparser_def.c
index 2f84259..6bf7e49 100644
--- a/src/backend/tsearch/wparser_def.c
+++ b/src/backend/tsearch/wparser_def.c
@@ -1966,75 +1966,97 @@ checkcondition_HL(void *opaque, QueryOperand *val, ExecPhraseData *data)
     return false;
 }

-
-static bool
-hlCover(HeadlineParsedText *prs, TSQuery query, int *p, int *q)
+/*
+ * hlFirstIndex: find first index >= pos containing any word used in query
+ *
+ * Returns -1 if no such index
+ */
+static int
+hlFirstIndex(HeadlineParsedText *prs, TSQuery query, int pos)
 {
-    int            i,
-                j;
-    QueryItem  *item = GETQUERY(query);
-    int            pos = *p;
-
-    *q = -1;
-    *p = INT_MAX;
+    int            i;

-    for (j = 0; j < query->size; j++)
+    /* For each word ... */
+    for (i = pos; i < prs->curwords; i++)
     {
-        if (item->type != QI_VAL)
+        /* ... scan the query to see if this word matches any operand */
+        QueryItem  *item = GETQUERY(query);
+        int            j;
+
+        for (j = 0; j < query->size; j++)
         {
+            if (item->type == QI_VAL &&
+                prs->words[i].item == &item->qoperand)
+                return i;
             item++;
-            continue;
         }
-        for (i = pos; i < prs->curwords; i++)
-        {
-            if (prs->words[i].item == &item->qoperand)
-            {
-                if (i > *q)
-                    *q = i;
-                break;
-            }
-        }
-        item++;
     }
+    return -1;
+}

-    if (*q < 0)
-        return false;
+/*
+ * hlCover: try to find a substring of prs' word list that satisfies query
+ *
+ * At entry, *p must be the first word index to consider (initialize this to
+ * zero, or to the next index after a previous successful search).
+ *
+ * On success, sets *p to first word index and *q to last word index of the
+ * cover substring, and returns true.
+ *
+ * The result is a minimal cover, in the sense that both *p and *q will be
+ * words used in the query.
+ */
+static bool
+hlCover(HeadlineParsedText *prs, TSQuery query, int *p, int *q)
+{
+    int            pmin,
+                pmax,
+                nextpmin,
+                nextpmax;
+    hlCheck        ch;

-    item = GETQUERY(query);
-    for (j = 0; j < query->size; j++)
+    /*
+     * We look for the earliest, shortest substring of prs->words that
+     * satisfies the query.  Both the pmin and pmax indices must be words
+     * appearing in the query; there's no point in trying endpoints in between
+     * such points.
+     */
+    pmin = hlFirstIndex(prs, query, *p);
+    while (pmin >= 0)
     {
-        if (item->type != QI_VAL)
+        /* This useless assignment just keeps stupider compilers quiet */
+        nextpmin = -1;
+        /* Consider substrings starting at pmin */
+        ch.words = &(prs->words[pmin]);
+        /* Consider the length-one substring first, then longer substrings */
+        pmax = pmin;
+        do
         {
-            item++;
-            continue;
-        }
-        for (i = *q; i >= pos; i--)
-        {
-            if (prs->words[i].item == &item->qoperand)
+            /* Try to match query against pmin .. pmax substring */
+            ch.len = pmax - pmin + 1;
+            if (TS_execute(GETQUERY(query), &ch,
+                           TS_EXEC_EMPTY, checkcondition_HL))
             {
-                if (i < *p)
-                    *p = i;
-                break;
+                *p = pmin;
+                *q = pmax;
+                return true;
             }
-        }
-        item++;
-    }
-
-    if (*p <= *q)
-    {
-        hlCheck        ch;
+            /* Nope, so advance pmax to next feasible endpoint */
+            nextpmax = hlFirstIndex(prs, query, pmax + 1);

-        ch.words = &(prs->words[*p]);
-        ch.len = *q - *p + 1;
-        if (TS_execute(GETQUERY(query), &ch, TS_EXEC_EMPTY, checkcondition_HL))
-            return true;
-        else
-        {
-            (*p)++;
-            return hlCover(prs, query, p, q);
+            /*
+             * If this is our first advance past pmin, then the result is also
+             * the next feasible value of pmin; remember it to save a
+             * redundant search.
+             */
+            if (pmax == pmin)
+                nextpmin = nextpmax;
+            pmax = nextpmax;
         }
+        while (pmax >= 0);
+        /* No luck here, so try next feasible startpoint */
+        pmin = nextpmin;
     }
-
     return false;
 }

diff --git a/src/test/regress/expected/tsearch.out b/src/test/regress/expected/tsearch.out
index 36e995b..d6e0ba6 100644
--- a/src/test/regress/expected/tsearch.out
+++ b/src/test/regress/expected/tsearch.out
@@ -1301,12 +1301,12 @@ Water, water, every where,
   Nor any drop to drink.
 S. T. Coleridge (1772-1834)
 ', phraseto_tsquery('english', 'painted Ocean'));
-           ts_headline
-----------------------------------
- <b>painted</b> <b>Ocean</b>.    +
- Water, water, every where       +
-   And all the boards did shrink;+
- Water, water, every
+              ts_headline
+---------------------------------------
+ <b>painted</b> Ship                  +
+   Upon a <b>painted</b> <b>Ocean</b>.+
+ Water, water, every where            +
+   And all the boards did shrink
 (1 row)

 SELECT ts_headline('english', '
@@ -1332,9 +1332,9 @@ SELECT ts_headline('english',
 'Lorem ipsum urna.  Nullam nullam ullamcorper urna.',
 to_tsquery('english','Lorem') && phraseto_tsquery('english','ullamcorper urna'),
 'MaxWords=100, MinWords=1');
- ts_headline
---------------
- <b>Lorem</b>
+                                  ts_headline
+-------------------------------------------------------------------------------
+ <b>Lorem</b> ipsum <b>urna</b>.  Nullam nullam <b>ullamcorper</b> <b>urna</b>
 (1 row)

 SELECT ts_headline('english', '
@@ -1379,9 +1379,9 @@ SELECT ts_headline('simple', '1 2 3 1 3'::text, '1 & 3', 'MaxWords=4, MinWords=1
 (1 row)

 SELECT ts_headline('simple', '1 2 3 1 3'::text, '1 <-> 3', 'MaxWords=4, MinWords=1');
-    ts_headline
--------------------
- <b>1</b> <b>3</b>
+        ts_headline
+----------------------------
+ <b>3</b> <b>1</b> <b>3</b>
 (1 row)

 --Check if headline fragments work
@@ -1481,9 +1481,9 @@ SELECT ts_headline('english',
 'Lorem ipsum urna.  Nullam nullam ullamcorper urna.',
 to_tsquery('english','Lorem') && phraseto_tsquery('english','ullamcorper urna'),
 'MaxFragments=100, MaxWords=100, MinWords=1');
- ts_headline
---------------
- <b>Lorem</b>
+                                  ts_headline
+-------------------------------------------------------------------------------
+ <b>Lorem</b> ipsum <b>urna</b>.  Nullam nullam <b>ullamcorper</b> <b>urna</b>
 (1 row)

 --Rewrite sub system

pgsql-bugs by date:

Previous
From: Kyotaro Horiguchi
Date:
Subject: Re: [BUG] non archived WAL removed during production crash recovery
Next
From: PG Bug reporting form
Date:
Subject: BUG #16353: pg_isready timeout is ignored