Re: BUG #17691: Unexpected behaviour using ts_headline() - Mailing list pgsql-bugs
From | Tom Lane |
---|---|
Subject | Re: BUG #17691: Unexpected behaviour using ts_headline() |
Date | |
Msg-id | 2885495.1668972591@sss.pgh.pa.us Whole thread Raw |
In response to | BUG #17691: Unexpected behaviour using ts_headline() (PG Bug reporting form <noreply@postgresql.org>) |
Responses |
Re: BUG #17691: Unexpected behaviour using ts_headline()
|
List | pgsql-bugs |
PG Bug reporting form <noreply@postgresql.org> writes: > I experience unexpected behaviour when using ts_headline() in general, but > especially when changing MaxFragments. Given the data in > ts_headline_report.sql [1] Thanks for sending the sample data. After poking at this a bit, it seems the question boils down to what ts_headline should do when there simply aren't any short matches to the query. Part of the issue there is what qualifies as "short". The definition I installed in 78e73e875 was + /* + * We might eventually make max_cover a user-settable parameter, but for + * now, just compute a reasonable value based on max_words and + * max_fragments. + */ + max_cover = Max(max_words * 10, 100); + if (max_fragments > 0) + max_cover *= max_fragments; Looking at this again, I think the dependency on max_fragments is just wrong. That is what is causing your different results for different settings of MaxFragments: larger values allow recognition of longer candidate covers ("cover" being jargon in this code for "a substring satisfying the query"). But there's no good reason for that. I'd misread the logic in mark_hl_fragments, which "breaks the cover into smaller fragments such that each fragment has at most max_words", to think that it needed to see longer covers that it could divide into fragments. But actually max_fragments limits the number of fragments it will return across all covers, and there doesn't seem to be a reason to connect that parameter to the acceptable length of any particular cover. So dropping those two lines is enough to make the strange behavior across differing MaxFragments go away. The other issue is what to do after finding there's no match of max_cover or fewer tokens. As the code now stands, it'll just give up and return the initial bit of the text, likely with no query words. That's not great. I propose that we look for any match and return (highlight) just its first word. > SELECT id, > ts_headline('english', "texts"."fulltext", to_tsquery('english', 'amazon & > world'), 'StartSel=<<, StopSel=>>') AS "preview" > FROM texts; > id=1: Highlight word is the first one in the result. Expectation: highlight > word is somewhere in the middle. Meh --- I don't agree with that expectation. The fragment-based code does act that way, but the non-fragment path never has, so I think we'd make more people unhappy than happy if we change its behavior. > id=2: No highlight word at all. Right, because there are no short matches to "amazon & world" anywhere. With the attached proposed patch, we'll highlight whichever one of those words appears first. > id=3: Highlight words are the first and last one in the result. Not ideal > but ok-ish. As I said, that's how non-fragment highlighting has always worked. It will extend the quoted text, but not beyond min_words, which is a bit different from fragment highlighting. > SELECT id, > ts_headline('english', "texts"."fulltext", to_tsquery('english', 'amazon & > world'), 'MaxFragments=3, StartSel=<<, StopSel=>>') AS "preview" > FROM texts; > id=1: Wrong number of fragments (2) with highlight words are returned. I don't agree with this expectation either. There's only one reasonably sane match to 'amazon & world' in that text, and it's near the beginning. As the code stands, increasing max_fragments eventually allows it to find some less-sane matches, but that doesn't seem like an improvement. In short, I suggest something about like the attached. For your id=2 example, which has only very far-apart occurrences of 'amazon' and 'world', this produces <<world>> where fast runtimes make or break the popularity of research fields, this oversight has effectively without fragmenting, and researchers make the effort of understanding the inner workings of these convenient black-boxes. In a <<world>> where fast runtimes make or break the popularity of research fields, this oversight has effectively surrendered most with fragmenting, which is okay by me given their historically different behaviors. Otherwise, it gets rid of the strange changes in matching behavior for different MaxFragment values. regards, tom lane diff --git a/src/backend/tsearch/wparser_def.c b/src/backend/tsearch/wparser_def.c index 826027844e..ea73a60852 100644 --- a/src/backend/tsearch/wparser_def.c +++ b/src/backend/tsearch/wparser_def.c @@ -2085,6 +2085,33 @@ hlCover(HeadlineParsedText *prs, TSQuery query, int max_cover, /* No luck here, so try next feasible startpoint */ pmin = nextpmin; } + + /* + * If there is no max_cover-or-less substring anywhere that satisfies the + * query, throw up our hands and return a single-word "match" that is just + * the first word appearing in the first query match, if any. (We need to + * verify a match to avoid possibly returning a word that is negated in + * the query.) + */ + if (*p == 0) + { + pmin = hlFirstIndex(prs, *p); + while (pmin >= 0) + { + /* Try to match query against pmin .. end of text */ + ch.words = &(prs->words[pmin]); + ch.len = prs->curwords - pmin; + if (TS_execute(GETQUERY(query), &ch, + TS_EXEC_EMPTY, checkcondition_HL)) + { + *p = *q = pmin; + return true; + } + /* Nope, so advance pmin to next feasible start point */ + pmin = hlFirstIndex(prs, pmin + 1); + } + } + return false; } @@ -2581,12 +2608,11 @@ prsd_headline(PG_FUNCTION_ARGS) /* * We might eventually make max_cover a user-settable parameter, but for - * now, just compute a reasonable value based on max_words and - * max_fragments. + * now, just compute a reasonable value based on max_words. Note that it + * has to be several times more than max_words, because max_cover needs to + * allow for non-word tokens appearing between the word tokens. */ max_cover = Max(max_words * 10, 100); - if (max_fragments > 0) - max_cover *= max_fragments; /* in HighlightAll mode these parameters are ignored */ if (!highlightall)
pgsql-bugs by date: