Re: BUG #16345: ts_headline does not find phrase matches correctly - Mailing list pgsql-bugs
From | Tom Lane |
---|---|
Subject | Re: BUG #16345: ts_headline does not find phrase matches correctly |
Date | |
Msg-id | 25072.1586401341@sss.pgh.pa.us Whole thread Raw |
In response to | BUG #16345: ts_headline does not find phrase matches correctly (PG Bug reporting form <noreply@postgresql.org>) |
List | pgsql-bugs |
PG Bug reporting form <noreply@postgresql.org> writes: > When query: > select ts_headline( > $$Lorem ipsum urna. Nullam nullam ullamcorper urna.$$, > to_tsquery('Lorem') && phraseto_tsquery('ullamcorper urna'), > 'StartSel=#$#, StopSel=#$#, FragmentDelimiter=$#$, MaxFragments=100, > MaxWords=100, MinWords=1' > ); > is ran, a fragment of >> Lorem ipsum urna. Nullam nullam ullamcorper urna. > should be returned, however, the result is a single word of #$#Lorem#$# is > returned, meaning that ts_headline did not find the queried string. Yeah. I spent some time digging into this, and was reminded of how miserably baroque and undercommented almost all of the text-search code is. Anyway, the bottom line is that wparser_def.c's hlCover() is failing here. It is looking for a minimal cover, that is a substring satisfying the given tsquery, and obviously with this query the only such substring is the whole sentence. (Well, not the trailing period.) However, it looks like hlCover() wasn't updated when we added phrase search, because that made word position significant and invalidated a rather fundamental assumption that hlCover() seems to be making: it figures that any substring including all words of the query ought to be good enough. Since "urna" also appears as the third word of the text, hlCover() doesn't try any substrings longer than "Lorem ipsum urna. Nullam nullam ullamcorper", thus it never finds one that actually satisfies the query, and we end up failing. Although utterly undocumented, the algorithm it is using seems to be "find the latest first occurrence of any query word (here, 'ullamcorper'), then find the earliest last occurrence of any query word before that (hence, 'Lorem'), and then see if the substring between those points satisfies the query (oops, nope). If not, start over from a point one past the previous try." But all the tries after that will omit 'Lorem', so they all fail to match the query, even though it'll eventually try substrings that include the later occurrence of 'urna'. I've not spent a huge amount of time thinking about it, but this might be all right as a way to find a shortest-possible cover for queries involving only AND and OR operators. (It'd fall down on NOT operators of course, but it already cheats on that by telling TS_execute to ignore NOT subtrees.) However, it's blatantly wrong as soon as a phrase operator is involved, because then only some occurrences of a particular word in the string might meet the query's requirements. So I set out to rewrite hlCover to make fewer assumptions about what a valid cover could be. In the new version appearing below, it just tries every substring that begins and ends with some query word, preferring earlier and shorter substrings. This should certainly find the desired cover ... but it didn't work, plus it broke a number of existing regression test cases. I was thus forced to the realization that its immediate caller mark_hl_words is *also* broken, because it's rejecting good headlines in favor of bad ones, or even in favor of headlines that contain no cover at all. Which was a bit daunting, because that's an even larger and uglier chunk of undocumented code. I ended up with the following stepwise approach to improving the situation. 0001 below adds the problematic test case, with the wrong output that HEAD produces. This was basically just to track which changes affected that. 0002 makes a bunch of purely-cosmetic changes to mark_hl_words and its siblings, in hopes of making it less unintelligible to the next hacker. I added comments, used macros to make some of the hairier if-tests more readable, and changed a couple of small things for more readability (though they can be proven not to affect the behavior). As expected, regression outputs don't change here. 0003 fixes a couple of fairly obvious bugs. One is that there's an early-exit optimization that tries to reject a possible headline before having fully defined its boundaries. This is not really necessary, but worse it's wrong because the figure of merit might change by the time we've chosen the actual boundaries. Deleting it doesn't change any regression test cases, but I'm sure it'd be possible to invent a scenario where it does the wrong thing. The other bug is that the loop that tries to shorten a maximum-length headline until it has a good endpoint has an order-of-operations issue: it can exit after making adjustments to curlen and poslen that discount the i'th word from the headline, but without changing pose to actually exclude that word. So we end up with a poslen figure-of-merit that does not describe the actual headline from posb to pose. Unfortunately, the one regression test output change caused by 0003 is clearly for the worse: hlCover successfully finds the cover '1 3' for the query, but now mark_hl_words discards '3' and only highlights '1'. What is happening is that '3' fails the "short word" test and is thereby excluded from the headline. This behavior is clearly what the code intends, but it was accidentally masked before by the order-of-operations bug. I argue that the problem here is that excluding an actual query term from the headline on the basis of its being short is just stupid. There is much-earlier processing that is charged with excluding stop words from tsqueries, and this code has no business second- guessing that. So the short-word test should only be applied to text words that did not appear in the tsquery. Hence, 0004 rejiggers the new BADENDPOINT() macro so that query terms are never considered bad endpoints. That fixes the '1 <-> 3' test case broken by 0003, but now there's a new diff: matching '1 2 3 1 3' to '1 & 3' now selects only '1 2 3' as the headline not '1 2 3 1'. I do not think this is a bug though. The reason we got '1 2 3 1' before is that hlCover selected the cover '1 2 3' but then mark_hl_words decided '3' was a bad endpoint and extended the headline by one word. (It would've extended more, because '1' is also a bad endpoint, except MaxWords=4 stopped it.) Again, treating '3' as a bad endpoint is just silly, so I think this change is acceptable. Next, 0005 rearranges the preference order for different possible headlines so that the first preference item is whether or not the headline includes the cover string initially found by hlCover. (It might not, if the cover string was longer than MaxWords.) It seems to me to be dumb to prefer headlines that don't meet that requirement to ones that do, because shorter headlines might not satisfy the user's query, which surely fails to satisfy the principle of least astonishment. (While I'm not entirely sure what can be said about the old implementation of hlCover, with my rewrite it is *certain* that substrings not including the full cover won't satisfy the query.) 0005 also gets rid of what seems to me to be a corner-case bug in the old preference logic, which is that it will take a headline with fewer query words over one with more, if the former has a "good" endpoint and the latter doesn't. That makes the actual preference order close to unexplainable. 0005 doesn't in itself change any regression results, but it's necessary to prevent problems from appearing with the next patch. The real situation here, as I've come to understand it, is that the existing hlCover code frequently produces only one possible cover and thus it doesn't matter how silly are mark_hl_words's rules for preferring one over another. The rewrite causes hlCover to produce more candidate covers and so it becomes more important for mark_hl_words to make sane decisions. Lastly, 0006 introduces the new hlCover code. This at last fixes the test case for the bug at hand. It also introduces two new diffs. One is this change in one Rime of the Ancient Mariner example: - <b>painted</b> <b>Ocean</b>. + - Water, water, every where + - And all the boards did shrink;+ - Water, water, every + <b>painted</b> Ship + + Upon a <b>painted</b> <b>Ocean</b>.+ + Water, water, every where + + And all the boards did shrink I don't see any way that that's not an improved match, given that the query is 'painted <-> Ocean'; including another match to one of the query words is surely better than not doing so. The other change is that matching '1 2 3 1 3' to '1 <-> 3' now selects '3 1 3' not just '1 3'. These are basically both the same change. The reason for it is that the old hlCover would *only* find the cover 'painted Ocean' (or '1 3'). The new hlCover finds that, but it also finds 'painted ... painted Ocean' (or '3 1 3'), and then the preference metric for more query words likes this option better. So my feeling is that these changes are for the better and we shouldn't complain. We could perhaps make them go away if we changed the preference rules some more, for example by preferring headlines that use shorter covers instead of (or at least ahead of) those having more query words. But ISTM that would actually be a bigger change from the current behavior, so likely it would create new changes in other query results. Besides, there's already a preference for shorter covers in hlCover, so I don't feel like we need another one at the calling level. In short then, I propose applying 0001-0006. I'm not quite sure if we should back-patch, or just be content to fix this in HEAD. But there's definitely an argument that this has been broken since we added phrase search (in 9.6) and deserves to be back-patched. (BTW, I wonder if we could now undo the hack to ignore NOT restrictions while finding covers. I haven't tried it though.) regards, tom lane diff --git a/src/test/regress/expected/tsearch.out b/src/test/regress/expected/tsearch.out index 2bfd58b..d01010f 100644 --- a/src/test/regress/expected/tsearch.out +++ b/src/test/regress/expected/tsearch.out @@ -1328,6 +1328,15 @@ S. T. Coleridge (1772-1834) And all the boards (1 row) +SELECT ts_headline('english', +'Lorem ipsum urna. Nullam nullam ullamcorper urna.', +to_tsquery('english','Lorem') && phraseto_tsquery('english','ullamcorper urna'), +'MaxWords=100, MinWords=1'); + ts_headline +-------------- + <b>Lorem</b> +(1 row) + SELECT ts_headline('english', ' <html> <!-- some comment --> @@ -1467,6 +1476,16 @@ S. T. Coleridge (1772-1834) S. T. <b>Coleridge</b> (1 row) +--Fragments with phrase search +SELECT ts_headline('english', +'Lorem ipsum urna. Nullam nullam ullamcorper urna.', +to_tsquery('english','Lorem') && phraseto_tsquery('english','ullamcorper urna'), +'MaxFragments=100, MaxWords=100, MinWords=1'); + ts_headline +-------------- + <b>Lorem</b> +(1 row) + --Rewrite sub system CREATE TABLE test_tsquery (txtkeyword TEXT, txtsample TEXT); \set ECHO none diff --git a/src/test/regress/sql/tsearch.sql b/src/test/regress/sql/tsearch.sql index c71cda5..22c84a3 100644 --- a/src/test/regress/sql/tsearch.sql +++ b/src/test/regress/sql/tsearch.sql @@ -384,6 +384,11 @@ Water, water, every where, S. T. Coleridge (1772-1834) ', phraseto_tsquery('english', 'idle as a painted Ship')); +SELECT ts_headline('english', +'Lorem ipsum urna. Nullam nullam ullamcorper urna.', +to_tsquery('english','Lorem') && phraseto_tsquery('english','ullamcorper urna'), +'MaxWords=100, MinWords=1'); + SELECT ts_headline('english', ' <html> <!-- some comment --> @@ -454,6 +459,12 @@ Water, water, every where, S. T. Coleridge (1772-1834) ', to_tsquery('english', 'Coleridge & stuck'), 'MaxFragments=2,FragmentDelimiter=***'); +--Fragments with phrase search +SELECT ts_headline('english', +'Lorem ipsum urna. Nullam nullam ullamcorper urna.', +to_tsquery('english','Lorem') && phraseto_tsquery('english','ullamcorper urna'), +'MaxFragments=100, MaxWords=100, MinWords=1'); + --Rewrite sub system CREATE TABLE test_tsquery (txtkeyword TEXT, txtsample TEXT); diff --git a/src/backend/tsearch/wparser_def.c b/src/backend/tsearch/wparser_def.c index 898466f..3e9d1f8 100644 --- a/src/backend/tsearch/wparser_def.c +++ b/src/backend/tsearch/wparser_def.c @@ -2064,6 +2064,20 @@ mark_fragment(HeadlineParsedText *prs, int highlight, int startpos, int endpos) } } +/* + * Macros useful in headline selection. These rely on availability of + * "HeadlineParsedText *prs" describing some text, and "int shortword" + * describing the "short word" length parameter. + */ + +/* Interesting words are non-repeated search terms */ +#define INTERESTINGWORD(j) \ + (prs->words[j].item && !prs->words[j].repeated) + +/* Don't want to end at a non-word or a short word */ +#define BADENDPOINT(j) \ + (NOENDTOKEN(prs->words[j].type) || prs->words[j].len <= shortword) + typedef struct { int32 startpos; @@ -2091,7 +2105,7 @@ get_next_fragment(HeadlineParsedText *prs, int *startpos, int *endpos, for (i = *startpos; i <= *endpos; i++) { *startpos = i; - if (prs->words[i].item && !prs->words[i].repeated) + if (INTERESTINGWORD(i)) break; } /* cut endpos to have only max_words */ @@ -2101,7 +2115,7 @@ get_next_fragment(HeadlineParsedText *prs, int *startpos, int *endpos, { if (!NONWORDTOKEN(prs->words[i].type)) *curlen += 1; - if (prs->words[i].item && !prs->words[i].repeated) + if (INTERESTINGWORD(i)) *poslen += 1; } /* if the cover was cut then move back endpos to a query item */ @@ -2111,7 +2125,7 @@ get_next_fragment(HeadlineParsedText *prs, int *startpos, int *endpos, for (i = *endpos; i >= *startpos; i--) { *endpos = i; - if (prs->words[i].item && !prs->words[i].repeated) + if (INTERESTINGWORD(i)) break; if (!NONWORDTOKEN(prs->words[i].type)) *curlen -= 1; @@ -2197,8 +2211,9 @@ mark_hl_fragments(HeadlineParsedText *prs, TSQuery query, int highlight, for (i = 0; i < numcovers; i++) { if (!covers[i].in && !covers[i].excluded && - (maxitems < covers[i].poslen || (maxitems == covers[i].poslen - && minwords > covers[i].curlen))) + (maxitems < covers[i].poslen || + (maxitems == covers[i].poslen && + minwords > covers[i].curlen))) { maxitems = covers[i].poslen; minwords = covers[i].curlen; @@ -2235,8 +2250,8 @@ mark_hl_fragments(HeadlineParsedText *prs, TSQuery query, int highlight, } posmarker = i; } - /* cut back startpos till we find a non short token */ - for (i = posmarker; i < startpos && (NOENDTOKEN(prs->words[i].type) || prs->words[i].len <= shortword);i++) + /* cut back startpos till we find a good endpoint */ + for (i = posmarker; i < startpos && BADENDPOINT(i); i++) { if (!NONWORDTOKEN(prs->words[i].type)) curlen--; @@ -2250,8 +2265,8 @@ mark_hl_fragments(HeadlineParsedText *prs, TSQuery query, int highlight, curlen++; posmarker = i; } - /* cut back endpos till we find a non-short token */ - for (i = posmarker; i > endpos && (NOENDTOKEN(prs->words[i].type) || prs->words[i].len <= shortword); i--) + /* cut back endpos till we find a good endpoint */ + for (i = posmarker; i > endpos && BADENDPOINT(i); i--) { if (!NONWORDTOKEN(prs->words[i].type)) curlen--; @@ -2267,7 +2282,11 @@ mark_hl_fragments(HeadlineParsedText *prs, TSQuery query, int highlight, /* exclude overlapping covers */ for (i = 0; i < numcovers; i++) { - if (i != minI && ((covers[i].startpos >= covers[minI].startpos && covers[i].startpos <= covers[minI].endpos)|| (covers[i].endpos >= covers[minI].startpos && covers[i].endpos <= covers[minI].endpos))) + if (i != minI && + ((covers[i].startpos >= covers[minI].startpos && + covers[i].startpos <= covers[minI].endpos) || + (covers[i].endpos >= covers[minI].startpos && + covers[i].endpos <= covers[minI].endpos))) covers[i].excluded = 1; } } @@ -2275,7 +2294,7 @@ mark_hl_fragments(HeadlineParsedText *prs, TSQuery query, int highlight, break; } - /* show at least min_words we have not marked anything */ + /* show at least min_words if we have not marked anything */ if (num_f <= 0) { startpos = endpos = curlen = 0; @@ -2299,7 +2318,7 @@ mark_hl_words(HeadlineParsedText *prs, TSQuery query, int highlight, int bestb = -1, beste = -1; int bestlen = -1; - int pose = 0, + int pose, posb, poslen, curlen; @@ -2310,55 +2329,69 @@ mark_hl_words(HeadlineParsedText *prs, TSQuery query, int highlight, { while (hlCover(prs, query, &p, &q)) { - /* find cover len in words */ + /* + * Count words (curlen) and interesting words (poslen) within + * cover, but stop once we reach max_words. This step doesn't + * consider whether that's a good stopping point. posb and pose + * are set to the start and end indexes of the possible headline. + */ curlen = 0; poslen = 0; + posb = pose = p; for (i = p; i <= q && curlen < max_words; i++) { if (!NONWORDTOKEN(prs->words[i].type)) curlen++; - if (prs->words[i].item && !prs->words[i].repeated) + if (INTERESTINGWORD(i)) poslen++; pose = i; } - if (poslen < bestlen && !(NOENDTOKEN(prs->words[beste].type) || prs->words[beste].len <= shortword)) + /* XXX this optimization seems unnecessary and wrong */ + if (poslen < bestlen && !BADENDPOINT(beste)) { - /* best already found, so try one more cover */ + /* better cover already found, so try next cover */ p++; continue; } - posb = p; if (curlen < max_words) - { /* find good end */ + { + /* + * We have room to lengthen the headline, so search forward + * until it's full or we find a good stopping point. We'll + * reconsider the word at "q", then move forward. + */ for (i = i - 1; i < prs->curwords && curlen < max_words; i++) { - if (i != q) + if (i > q) { if (!NONWORDTOKEN(prs->words[i].type)) curlen++; - if (prs->words[i].item && !prs->words[i].repeated) + if (INTERESTINGWORD(i)) poslen++; } pose = i; - if (NOENDTOKEN(prs->words[i].type) || prs->words[i].len <= shortword) + if (BADENDPOINT(i)) continue; if (curlen >= min_words) break; } - if (curlen < min_words && i >= prs->curwords) - { /* got end of text and our cover is shorter - * than min_words */ + if (curlen < min_words) + { + /* + * Reached end of text and our headline is still shorter + * than min_words, so try to extend it to the left. + */ for (i = p - 1; i >= 0; i--) { if (!NONWORDTOKEN(prs->words[i].type)) curlen++; - if (prs->words[i].item && !prs->words[i].repeated) + if (INTERESTINGWORD(i)) poslen++; if (curlen >= max_words) break; - if (NOENDTOKEN(prs->words[i].type) || prs->words[i].len <= shortword) + if (BADENDPOINT(i)) continue; if (curlen >= min_words) break; @@ -2367,25 +2400,34 @@ mark_hl_words(HeadlineParsedText *prs, TSQuery query, int highlight, } } else - { /* shorter cover :((( */ + { + /* + * Can't make headline longer, so consider making it shorter + * if needed to avoid a bad endpoint. + */ if (i > q) i = q; for (; curlen > min_words; i--) { if (!NONWORDTOKEN(prs->words[i].type)) curlen--; - if (prs->words[i].item && !prs->words[i].repeated) + if (INTERESTINGWORD(i)) poslen--; pose = i; - if (NOENDTOKEN(prs->words[i].type) || prs->words[i].len <= shortword) - continue; - break; + if (!BADENDPOINT(i)) + break; } } - if (bestlen < 0 || (poslen > bestlen && !(NOENDTOKEN(prs->words[pose].type) || prs->words[pose].len <= shortword))|| - (bestlen >= 0 && !(NOENDTOKEN(prs->words[pose].type) || prs->words[pose].len <= shortword) && - (NOENDTOKEN(prs->words[beste].type) || prs->words[beste].len <= shortword))) + /* + * Adopt this headline if it's the first, or if it has more + * interesting words and isn't ending at a bad endpoint, or if it + * replaces a bad endpoint with a good one (XXX even if it has + * fewer interesting words? Really?) + */ + if (bestlen < 0 || + (poslen > bestlen && !BADENDPOINT(pose)) || + (!BADENDPOINT(pose) && BADENDPOINT(beste))) { bestb = posb; beste = pose; @@ -2395,6 +2437,10 @@ mark_hl_words(HeadlineParsedText *prs, TSQuery query, int highlight, p++; } + /* + * If we found nothing acceptable, select min_words words starting at + * the beginning. + */ if (bestlen < 0) { curlen = 0; @@ -2433,7 +2479,6 @@ mark_hl_words(HeadlineParsedText *prs, TSQuery query, int highlight, prs->words[i].in = (prs->words[i].repeated) ? 0 : 1; } - } Datum diff --git a/src/backend/tsearch/wparser_def.c b/src/backend/tsearch/wparser_def.c index 3e9d1f8..84b6f8f9 100644 --- a/src/backend/tsearch/wparser_def.c +++ b/src/backend/tsearch/wparser_def.c @@ -2347,14 +2347,6 @@ mark_hl_words(HeadlineParsedText *prs, TSQuery query, int highlight, pose = i; } - /* XXX this optimization seems unnecessary and wrong */ - if (poslen < bestlen && !BADENDPOINT(beste)) - { - /* better cover already found, so try next cover */ - p++; - continue; - } - if (curlen < max_words) { /* @@ -2409,13 +2401,13 @@ mark_hl_words(HeadlineParsedText *prs, TSQuery query, int highlight, i = q; for (; curlen > min_words; i--) { + if (!BADENDPOINT(i)) + break; if (!NONWORDTOKEN(prs->words[i].type)) curlen--; if (INTERESTINGWORD(i)) poslen--; - pose = i; - if (!BADENDPOINT(i)) - break; + pose = i - 1; } } diff --git a/src/test/regress/expected/tsearch.out b/src/test/regress/expected/tsearch.out index d01010f..e6fb820 100644 --- a/src/test/regress/expected/tsearch.out +++ b/src/test/regress/expected/tsearch.out @@ -1367,9 +1367,9 @@ to_tsquery('english', 'sea&foo'), 'HighlightAll=true'); (1 row) SELECT ts_headline('simple', '1 2 3 1 3'::text, '1 <-> 3', 'MaxWords=2, MinWords=1'); - ts_headline -------------------- - <b>1</b> <b>3</b> + ts_headline +------------- + <b>1</b> (1 row) SELECT ts_headline('simple', '1 2 3 1 3'::text, '1 & 3', 'MaxWords=4, MinWords=1'); diff --git a/src/backend/tsearch/wparser_def.c b/src/backend/tsearch/wparser_def.c index 84b6f8f9..aa4271a 100644 --- a/src/backend/tsearch/wparser_def.c +++ b/src/backend/tsearch/wparser_def.c @@ -2074,9 +2074,10 @@ mark_fragment(HeadlineParsedText *prs, int highlight, int startpos, int endpos) #define INTERESTINGWORD(j) \ (prs->words[j].item && !prs->words[j].repeated) -/* Don't want to end at a non-word or a short word */ +/* Don't want to end at a non-word or a short word, unless interesting */ #define BADENDPOINT(j) \ - (NOENDTOKEN(prs->words[j].type) || prs->words[j].len <= shortword) + ((NOENDTOKEN(prs->words[j].type) || prs->words[j].len <= shortword) && \ + !INTERESTINGWORD(j)) typedef struct { diff --git a/src/test/regress/expected/tsearch.out b/src/test/regress/expected/tsearch.out index e6fb820..36e995b 100644 --- a/src/test/regress/expected/tsearch.out +++ b/src/test/regress/expected/tsearch.out @@ -1367,15 +1367,15 @@ to_tsquery('english', 'sea&foo'), 'HighlightAll=true'); (1 row) SELECT ts_headline('simple', '1 2 3 1 3'::text, '1 <-> 3', 'MaxWords=2, MinWords=1'); - ts_headline -------------- - <b>1</b> + ts_headline +------------------- + <b>1</b> <b>3</b> (1 row) SELECT ts_headline('simple', '1 2 3 1 3'::text, '1 & 3', 'MaxWords=4, MinWords=1'); - ts_headline ------------------------------- - <b>1</b> 2 <b>3</b> <b>1</b> + ts_headline +--------------------- + <b>1</b> 2 <b>3</b> (1 row) SELECT ts_headline('simple', '1 2 3 1 3'::text, '1 <-> 3', 'MaxWords=4, MinWords=1'); diff --git a/src/backend/tsearch/wparser_def.c b/src/backend/tsearch/wparser_def.c index aa4271a..2f84259 100644 --- a/src/backend/tsearch/wparser_def.c +++ b/src/backend/tsearch/wparser_def.c @@ -2319,11 +2319,12 @@ mark_hl_words(HeadlineParsedText *prs, TSQuery query, int highlight, int bestb = -1, beste = -1; int bestlen = -1; + bool bestcover = false; int pose, posb, poslen, curlen; - + bool poscover; int i; if (highlight == 0) @@ -2413,18 +2414,27 @@ mark_hl_words(HeadlineParsedText *prs, TSQuery query, int highlight, } /* - * Adopt this headline if it's the first, or if it has more - * interesting words and isn't ending at a bad endpoint, or if it - * replaces a bad endpoint with a good one (XXX even if it has - * fewer interesting words? Really?) + * Check whether the proposed headline includes the original + * cover; it might not if we trimmed it due to max_words. + */ + poscover = (posb <= p && pose >= q); + + /* + * Adopt this headline if it's better than the last one, giving + * highest priority to headlines including the cover, then to + * headlines with more interesting words, then to headlines with + * good stopping points. (Since bestlen is initially -1, we will + * certainly adopt the first headline.) */ - if (bestlen < 0 || - (poslen > bestlen && !BADENDPOINT(pose)) || - (!BADENDPOINT(pose) && BADENDPOINT(beste))) + if (poscover > bestcover || + (poscover == bestcover && poslen > bestlen) || + (poscover == bestcover && poslen == bestlen && + !BADENDPOINT(pose) && BADENDPOINT(beste))) { bestb = posb; beste = pose; bestlen = poslen; + bestcover = poscover; } p++; diff --git a/src/backend/tsearch/wparser_def.c b/src/backend/tsearch/wparser_def.c index 2f84259..6bf7e49 100644 --- a/src/backend/tsearch/wparser_def.c +++ b/src/backend/tsearch/wparser_def.c @@ -1966,75 +1966,97 @@ checkcondition_HL(void *opaque, QueryOperand *val, ExecPhraseData *data) return false; } - -static bool -hlCover(HeadlineParsedText *prs, TSQuery query, int *p, int *q) +/* + * hlFirstIndex: find first index >= pos containing any word used in query + * + * Returns -1 if no such index + */ +static int +hlFirstIndex(HeadlineParsedText *prs, TSQuery query, int pos) { - int i, - j; - QueryItem *item = GETQUERY(query); - int pos = *p; - - *q = -1; - *p = INT_MAX; + int i; - for (j = 0; j < query->size; j++) + /* For each word ... */ + for (i = pos; i < prs->curwords; i++) { - if (item->type != QI_VAL) + /* ... scan the query to see if this word matches any operand */ + QueryItem *item = GETQUERY(query); + int j; + + for (j = 0; j < query->size; j++) { + if (item->type == QI_VAL && + prs->words[i].item == &item->qoperand) + return i; item++; - continue; } - for (i = pos; i < prs->curwords; i++) - { - if (prs->words[i].item == &item->qoperand) - { - if (i > *q) - *q = i; - break; - } - } - item++; } + return -1; +} - if (*q < 0) - return false; +/* + * hlCover: try to find a substring of prs' word list that satisfies query + * + * At entry, *p must be the first word index to consider (initialize this to + * zero, or to the next index after a previous successful search). + * + * On success, sets *p to first word index and *q to last word index of the + * cover substring, and returns true. + * + * The result is a minimal cover, in the sense that both *p and *q will be + * words used in the query. + */ +static bool +hlCover(HeadlineParsedText *prs, TSQuery query, int *p, int *q) +{ + int pmin, + pmax, + nextpmin, + nextpmax; + hlCheck ch; - item = GETQUERY(query); - for (j = 0; j < query->size; j++) + /* + * We look for the earliest, shortest substring of prs->words that + * satisfies the query. Both the pmin and pmax indices must be words + * appearing in the query; there's no point in trying endpoints in between + * such points. + */ + pmin = hlFirstIndex(prs, query, *p); + while (pmin >= 0) { - if (item->type != QI_VAL) + /* This useless assignment just keeps stupider compilers quiet */ + nextpmin = -1; + /* Consider substrings starting at pmin */ + ch.words = &(prs->words[pmin]); + /* Consider the length-one substring first, then longer substrings */ + pmax = pmin; + do { - item++; - continue; - } - for (i = *q; i >= pos; i--) - { - if (prs->words[i].item == &item->qoperand) + /* Try to match query against pmin .. pmax substring */ + ch.len = pmax - pmin + 1; + if (TS_execute(GETQUERY(query), &ch, + TS_EXEC_EMPTY, checkcondition_HL)) { - if (i < *p) - *p = i; - break; + *p = pmin; + *q = pmax; + return true; } - } - item++; - } - - if (*p <= *q) - { - hlCheck ch; + /* Nope, so advance pmax to next feasible endpoint */ + nextpmax = hlFirstIndex(prs, query, pmax + 1); - ch.words = &(prs->words[*p]); - ch.len = *q - *p + 1; - if (TS_execute(GETQUERY(query), &ch, TS_EXEC_EMPTY, checkcondition_HL)) - return true; - else - { - (*p)++; - return hlCover(prs, query, p, q); + /* + * If this is our first advance past pmin, then the result is also + * the next feasible value of pmin; remember it to save a + * redundant search. + */ + if (pmax == pmin) + nextpmin = nextpmax; + pmax = nextpmax; } + while (pmax >= 0); + /* No luck here, so try next feasible startpoint */ + pmin = nextpmin; } - return false; } diff --git a/src/test/regress/expected/tsearch.out b/src/test/regress/expected/tsearch.out index 36e995b..d6e0ba6 100644 --- a/src/test/regress/expected/tsearch.out +++ b/src/test/regress/expected/tsearch.out @@ -1301,12 +1301,12 @@ Water, water, every where, Nor any drop to drink. S. T. Coleridge (1772-1834) ', phraseto_tsquery('english', 'painted Ocean')); - ts_headline ----------------------------------- - <b>painted</b> <b>Ocean</b>. + - Water, water, every where + - And all the boards did shrink;+ - Water, water, every + ts_headline +--------------------------------------- + <b>painted</b> Ship + + Upon a <b>painted</b> <b>Ocean</b>.+ + Water, water, every where + + And all the boards did shrink (1 row) SELECT ts_headline('english', ' @@ -1332,9 +1332,9 @@ SELECT ts_headline('english', 'Lorem ipsum urna. Nullam nullam ullamcorper urna.', to_tsquery('english','Lorem') && phraseto_tsquery('english','ullamcorper urna'), 'MaxWords=100, MinWords=1'); - ts_headline --------------- - <b>Lorem</b> + ts_headline +------------------------------------------------------------------------------- + <b>Lorem</b> ipsum <b>urna</b>. Nullam nullam <b>ullamcorper</b> <b>urna</b> (1 row) SELECT ts_headline('english', ' @@ -1379,9 +1379,9 @@ SELECT ts_headline('simple', '1 2 3 1 3'::text, '1 & 3', 'MaxWords=4, MinWords=1 (1 row) SELECT ts_headline('simple', '1 2 3 1 3'::text, '1 <-> 3', 'MaxWords=4, MinWords=1'); - ts_headline -------------------- - <b>1</b> <b>3</b> + ts_headline +---------------------------- + <b>3</b> <b>1</b> <b>3</b> (1 row) --Check if headline fragments work @@ -1481,9 +1481,9 @@ SELECT ts_headline('english', 'Lorem ipsum urna. Nullam nullam ullamcorper urna.', to_tsquery('english','Lorem') && phraseto_tsquery('english','ullamcorper urna'), 'MaxFragments=100, MaxWords=100, MinWords=1'); - ts_headline --------------- - <b>Lorem</b> + ts_headline +------------------------------------------------------------------------------- + <b>Lorem</b> ipsum <b>urna</b>. Nullam nullam <b>ullamcorper</b> <b>urna</b> (1 row) --Rewrite sub system
pgsql-bugs by date: