Thread: BUG #17691: Unexpected behaviour using ts_headline()
The following bug has been logged on the website: Bug reference: 17691 Logged by: Sebastian Patino-Lang Email address: sebastian.patino-lang@posteo.net PostgreSQL version: 13.9 Operating system: x86_64-apple-darwin19.6.0 Description: I experience unexpected behaviour when using ts_headline() in general, but especially when changing MaxFragments. Given the data in ts_headline_report.sql [1] SELECT id, ts_headline('english', "texts"."fulltext", to_tsquery('english', 'amazon & world'), 'StartSel=<<, StopSel=>>') AS "preview" FROM texts; id=1: Highlight word is the first one in the result. Expectation: highlight word is somewhere in the middle. id=2: No highlight word at all. id=3: Highlight words are the first and last one in the result. Not ideal but ok-ish. SELECT id, ts_headline('english', "texts"."fulltext", to_tsquery('english', 'amazon & world'), 'MaxFragments=1, StartSel=<<, StopSel=>>') AS "preview" FROM texts; id=1: Highlight word is now in the middle of the result. This is ok. id=2: No highlight word at all. id=3: Highlight words are in the middle part of the result. This is ok. SELECT id, ts_headline('english', "texts"."fulltext", to_tsquery('english', 'amazon & world'), 'MaxFragments=2, StartSel=<<, StopSel=>>') AS "preview" FROM texts; id=1: Correct number of fragments (2) with highlight words are returned. id=2: No highlight word at all. id=3: Correct number of fragments (2) with highlight words are returned. SELECT id, ts_headline('english', "texts"."fulltext", to_tsquery('english', 'amazon & world'), 'MaxFragments=3, StartSel=<<, StopSel=>>') AS "preview" FROM texts; id=1: Wrong number of fragments (2) with highlight words are returned. id=2: No highlight word at all. id=3: Correct number of fragments (3) with highlight words are returned. SELECT id, ts_headline('english', "texts"."fulltext", to_tsquery('english', 'amazon & world'), 'MaxFragments=4, StartSel=<<, StopSel=>>') AS "preview" FROM texts; id=1: Wrong number of fragments (2) with highlight words are returned. id=2: No highlight word at all. id=3: Correct number of fragments (4) with highlight words are returned. ... and so on. Until MaxFragments=6 where for id=1 suddenly more fragments (4) get returned. SELECT id, ts_headline('english', "texts"."fulltext", to_tsquery('english', 'amazon & world'), 'MaxFragments=4, StartSel=<<, StopSel=>>') AS "preview" FROM texts; id=1: Wrong number of fragments (4) with highlight words are returned. id=2: No highlight word at all. id=3: Correct number of fragments (6) with highlight words are returned. ... and so on. Until MaxFragments=11 where for id=1 the number of returned fragments changes again (5) SELECT id, ts_headline('english', "texts"."fulltext", to_tsquery('english', 'amazon & world'), 'MaxFragments=11, StartSel=<<, StopSel=>>') AS "preview" FROM texts; id=1: Wrong number of fragments (5) with highlight words are returned. id=2: No highlight word at all. id=3: Correct number of fragments (11) with highlight words are returned. ... and so on. Until MaxFragments=14 where for id=2 suddenly more fragments (2) get returned. SELECT id, ts_headline('english', "texts"."fulltext", to_tsquery('english', 'amazon & world'), 'MaxFragments=14, StartSel=<<, StopSel=>>') AS "preview" FROM texts; id=1: Wrong number of fragments (5) with highlight words are returned. id=2: Wrong number of fragments (2) with highlight words are returned. id=3: Correct number of fragments (11) with highlight words are returned. I stopped testing here, but im sure the strange behaviour and jumps in fragment count will continue. Any ideas? [1] https://1drv.ms/u/s!AqRGi9iGBWddgZVU_M2iuoRdTzM6tg?e=0OHHnB
PG Bug reporting form <noreply@postgresql.org> writes: > I experience unexpected behaviour when using ts_headline() in general, but > especially when changing MaxFragments. Given the data in > ts_headline_report.sql [1] That URL doesn't work for me, and in any case we dislike non-self-contained bug reports for archival reasons. Could you provide the sample data as an attachment in a reply to pgsql-bugs? regards, tom lane
Re: BUG #17691: Unexpected behaviour using ts_headline()
From
sebastian.patino-lang@posteo.net
Date:
Hi all,
Tom Lane made me aware that the link is not working and its better anyway to include all data in the bug report directly. Please find the file attached.
@Tom: thanks for the hint!
Regards,
Sebastian
Tom Lane made me aware that the link is not working and its better anyway to include all data in the bug report directly. Please find the file attached.
@Tom: thanks for the hint!
Regards,
Sebastian
On 19. Nov 2022, 14:05 +0100, PG Bug reporting form <noreply@postgresql.org>, wrote:
The following bug has been logged on the website:
Bug reference: 17691
Logged by: Sebastian Patino-Lang
Email address: sebastian.patino-lang@posteo.net
PostgreSQL version: 13.9
Operating system: x86_64-apple-darwin19.6.0
Description:
I experience unexpected behaviour when using ts_headline() in general, but
especially when changing MaxFragments. Given the data in
ts_headline_report.sql [1]
SELECT id,
ts_headline('english', "texts"."fulltext", to_tsquery('english', 'amazon &
world'), 'StartSel=<<, StopSel=>>') AS "preview"
FROM texts;
id=1: Highlight word is the first one in the result. Expectation: highlight
word is somewhere in the middle.
id=2: No highlight word at all.
id=3: Highlight words are the first and last one in the result. Not ideal
but ok-ish.
SELECT id,
ts_headline('english', "texts"."fulltext", to_tsquery('english', 'amazon &
world'), 'MaxFragments=1, StartSel=<<, StopSel=>>') AS "preview"
FROM texts;
id=1: Highlight word is now in the middle of the result. This is ok.
id=2: No highlight word at all.
id=3: Highlight words are in the middle part of the result. This is ok.
SELECT id,
ts_headline('english', "texts"."fulltext", to_tsquery('english', 'amazon &
world'), 'MaxFragments=2, StartSel=<<, StopSel=>>') AS "preview"
FROM texts;
id=1: Correct number of fragments (2) with highlight words are returned.
id=2: No highlight word at all.
id=3: Correct number of fragments (2) with highlight words are returned.
SELECT id,
ts_headline('english', "texts"."fulltext", to_tsquery('english', 'amazon &
world'), 'MaxFragments=3, StartSel=<<, StopSel=>>') AS "preview"
FROM texts;
id=1: Wrong number of fragments (2) with highlight words are returned.
id=2: No highlight word at all.
id=3: Correct number of fragments (3) with highlight words are returned.
SELECT id,
ts_headline('english', "texts"."fulltext", to_tsquery('english', 'amazon &
world'), 'MaxFragments=4, StartSel=<<, StopSel=>>') AS "preview"
FROM texts;
id=1: Wrong number of fragments (2) with highlight words are returned.
id=2: No highlight word at all.
id=3: Correct number of fragments (4) with highlight words are returned.
... and so on. Until MaxFragments=6 where for id=1 suddenly more fragments
(4) get returned.
SELECT id,
ts_headline('english', "texts"."fulltext", to_tsquery('english', 'amazon &
world'), 'MaxFragments=4, StartSel=<<, StopSel=>>') AS "preview"
FROM texts;
id=1: Wrong number of fragments (4) with highlight words are returned.
id=2: No highlight word at all.
id=3: Correct number of fragments (6) with highlight words are returned.
... and so on. Until MaxFragments=11 where for id=1 the number of returned
fragments changes again (5)
SELECT id,
ts_headline('english', "texts"."fulltext", to_tsquery('english', 'amazon &
world'), 'MaxFragments=11, StartSel=<<, StopSel=>>') AS "preview"
FROM texts;
id=1: Wrong number of fragments (5) with highlight words are returned.
id=2: No highlight word at all.
id=3: Correct number of fragments (11) with highlight words are returned.
... and so on. Until MaxFragments=14 where for id=2 suddenly more fragments
(2) get returned.
SELECT id,
ts_headline('english', "texts"."fulltext", to_tsquery('english', 'amazon &
world'), 'MaxFragments=14, StartSel=<<, StopSel=>>') AS "preview"
FROM texts;
id=1: Wrong number of fragments (5) with highlight words are returned.
id=2: Wrong number of fragments (2) with highlight words are returned.
id=3: Correct number of fragments (11) with highlight words are returned.
I stopped testing here, but im sure the strange behaviour and jumps in
fragment count will continue.
Any ideas?
[1] https://1drv.ms/u/s!AqRGi9iGBWddgZVU_M2iuoRdTzM6tg?e=0OHHnB
Attachment
PG Bug reporting form <noreply@postgresql.org> writes: > I experience unexpected behaviour when using ts_headline() in general, but > especially when changing MaxFragments. Given the data in > ts_headline_report.sql [1] Thanks for sending the sample data. After poking at this a bit, it seems the question boils down to what ts_headline should do when there simply aren't any short matches to the query. Part of the issue there is what qualifies as "short". The definition I installed in 78e73e875 was + /* + * We might eventually make max_cover a user-settable parameter, but for + * now, just compute a reasonable value based on max_words and + * max_fragments. + */ + max_cover = Max(max_words * 10, 100); + if (max_fragments > 0) + max_cover *= max_fragments; Looking at this again, I think the dependency on max_fragments is just wrong. That is what is causing your different results for different settings of MaxFragments: larger values allow recognition of longer candidate covers ("cover" being jargon in this code for "a substring satisfying the query"). But there's no good reason for that. I'd misread the logic in mark_hl_fragments, which "breaks the cover into smaller fragments such that each fragment has at most max_words", to think that it needed to see longer covers that it could divide into fragments. But actually max_fragments limits the number of fragments it will return across all covers, and there doesn't seem to be a reason to connect that parameter to the acceptable length of any particular cover. So dropping those two lines is enough to make the strange behavior across differing MaxFragments go away. The other issue is what to do after finding there's no match of max_cover or fewer tokens. As the code now stands, it'll just give up and return the initial bit of the text, likely with no query words. That's not great. I propose that we look for any match and return (highlight) just its first word. > SELECT id, > ts_headline('english', "texts"."fulltext", to_tsquery('english', 'amazon & > world'), 'StartSel=<<, StopSel=>>') AS "preview" > FROM texts; > id=1: Highlight word is the first one in the result. Expectation: highlight > word is somewhere in the middle. Meh --- I don't agree with that expectation. The fragment-based code does act that way, but the non-fragment path never has, so I think we'd make more people unhappy than happy if we change its behavior. > id=2: No highlight word at all. Right, because there are no short matches to "amazon & world" anywhere. With the attached proposed patch, we'll highlight whichever one of those words appears first. > id=3: Highlight words are the first and last one in the result. Not ideal > but ok-ish. As I said, that's how non-fragment highlighting has always worked. It will extend the quoted text, but not beyond min_words, which is a bit different from fragment highlighting. > SELECT id, > ts_headline('english', "texts"."fulltext", to_tsquery('english', 'amazon & > world'), 'MaxFragments=3, StartSel=<<, StopSel=>>') AS "preview" > FROM texts; > id=1: Wrong number of fragments (2) with highlight words are returned. I don't agree with this expectation either. There's only one reasonably sane match to 'amazon & world' in that text, and it's near the beginning. As the code stands, increasing max_fragments eventually allows it to find some less-sane matches, but that doesn't seem like an improvement. In short, I suggest something about like the attached. For your id=2 example, which has only very far-apart occurrences of 'amazon' and 'world', this produces <<world>> where fast runtimes make or break the popularity of research fields, this oversight has effectively without fragmenting, and researchers make the effort of understanding the inner workings of these convenient black-boxes. In a <<world>> where fast runtimes make or break the popularity of research fields, this oversight has effectively surrendered most with fragmenting, which is okay by me given their historically different behaviors. Otherwise, it gets rid of the strange changes in matching behavior for different MaxFragment values. regards, tom lane diff --git a/src/backend/tsearch/wparser_def.c b/src/backend/tsearch/wparser_def.c index 826027844e..ea73a60852 100644 --- a/src/backend/tsearch/wparser_def.c +++ b/src/backend/tsearch/wparser_def.c @@ -2085,6 +2085,33 @@ hlCover(HeadlineParsedText *prs, TSQuery query, int max_cover, /* No luck here, so try next feasible startpoint */ pmin = nextpmin; } + + /* + * If there is no max_cover-or-less substring anywhere that satisfies the + * query, throw up our hands and return a single-word "match" that is just + * the first word appearing in the first query match, if any. (We need to + * verify a match to avoid possibly returning a word that is negated in + * the query.) + */ + if (*p == 0) + { + pmin = hlFirstIndex(prs, *p); + while (pmin >= 0) + { + /* Try to match query against pmin .. end of text */ + ch.words = &(prs->words[pmin]); + ch.len = prs->curwords - pmin; + if (TS_execute(GETQUERY(query), &ch, + TS_EXEC_EMPTY, checkcondition_HL)) + { + *p = *q = pmin; + return true; + } + /* Nope, so advance pmin to next feasible start point */ + pmin = hlFirstIndex(prs, pmin + 1); + } + } + return false; } @@ -2581,12 +2608,11 @@ prsd_headline(PG_FUNCTION_ARGS) /* * We might eventually make max_cover a user-settable parameter, but for - * now, just compute a reasonable value based on max_words and - * max_fragments. + * now, just compute a reasonable value based on max_words. Note that it + * has to be several times more than max_words, because max_cover needs to + * allow for non-word tokens appearing between the word tokens. */ max_cover = Max(max_words * 10, 100); - if (max_fragments > 0) - max_cover *= max_fragments; /* in HighlightAll mode these parameters are ignored */ if (!highlightall)
Re: BUG #17691: Unexpected behaviour using ts_headline()
From
sebastian.patino-lang@posteo.net
Date:
Hi Tom,
thanks for the in-depth explanation. Makes a bit more sense now. Not sure what to do with the attached patch though. Should I use that to compile my own version of PG? Since I use a managed version thats not an option currently.
Regards
thanks for the in-depth explanation. Makes a bit more sense now. Not sure what to do with the attached patch though. Should I use that to compile my own version of PG? Since I use a managed version thats not an option currently.
Regards
Sebastian
On 20. Nov 2022, 20:30 +0100, Tom Lane <tgl@sss.pgh.pa.us>, wrote:
PG Bug reporting form <noreply@postgresql.org> writes:I experience unexpected behaviour when using ts_headline() in general, but
especially when changing MaxFragments. Given the data in
ts_headline_report.sql [1]
Thanks for sending the sample data. After poking at this a bit, it
seems the question boils down to what ts_headline should do when there
simply aren't any short matches to the query. Part of the issue there
is what qualifies as "short". The definition I installed in 78e73e875
was
+ /*
+ * We might eventually make max_cover a user-settable parameter, but for
+ * now, just compute a reasonable value based on max_words and
+ * max_fragments.
+ */
+ max_cover = Max(max_words * 10, 100);
+ if (max_fragments > 0)
+ max_cover *= max_fragments;
Looking at this again, I think the dependency on max_fragments is just
wrong. That is what is causing your different results for different
settings of MaxFragments: larger values allow recognition of longer
candidate covers ("cover" being jargon in this code for "a substring
satisfying the query"). But there's no good reason for that. I'd misread
the logic in mark_hl_fragments, which "breaks the cover into smaller
fragments such that each fragment has at most max_words", to think that
it needed to see longer covers that it could divide into fragments.
But actually max_fragments limits the number of fragments it will return
across all covers, and there doesn't seem to be a reason to connect that
parameter to the acceptable length of any particular cover. So dropping
those two lines is enough to make the strange behavior across differing
MaxFragments go away.
The other issue is what to do after finding there's no match of max_cover
or fewer tokens. As the code now stands, it'll just give up and return
the initial bit of the text, likely with no query words. That's not
great. I propose that we look for any match and return (highlight) just
its first word.SELECT id,
ts_headline('english', "texts"."fulltext", to_tsquery('english', 'amazon &
world'), 'StartSel=<<, StopSel=>>') AS "preview"
FROM texts;id=1: Highlight word is the first one in the result. Expectation: highlight
word is somewhere in the middle.
Meh --- I don't agree with that expectation. The fragment-based code does
act that way, but the non-fragment path never has, so I think we'd make
more people unhappy than happy if we change its behavior.id=2: No highlight word at all.
Right, because there are no short matches to "amazon & world" anywhere.
With the attached proposed patch, we'll highlight whichever one of those
words appears first.id=3: Highlight words are the first and last one in the result. Not ideal
but ok-ish.
As I said, that's how non-fragment highlighting has always worked. It
will extend the quoted text, but not beyond min_words, which is a bit
different from fragment highlighting.SELECT id,
ts_headline('english', "texts"."fulltext", to_tsquery('english', 'amazon &
world'), 'MaxFragments=3, StartSel=<<, StopSel=>>') AS "preview"
FROM texts;id=1: Wrong number of fragments (2) with highlight words are returned.
I don't agree with this expectation either. There's only one reasonably
sane match to 'amazon & world' in that text, and it's near the beginning.
As the code stands, increasing max_fragments eventually allows it to find
some less-sane matches, but that doesn't seem like an improvement.
In short, I suggest something about like the attached. For your id=2
example, which has only very far-apart occurrences of 'amazon' and
'world', this produces
<<world>> where fast runtimes make or break the popularity of research
fields, this oversight has effectively
without fragmenting, and
researchers make the effort of understanding the inner workings of
these convenient black-boxes. In a <<world>> where fast runtimes make
or break the popularity of research fields, this oversight has
effectively surrendered most
with fragmenting, which is okay by me given their historically different
behaviors. Otherwise, it gets rid of the strange changes in matching
behavior for different MaxFragment values.
regards, tom lane
For the archives' sake: I've moved this discussion to a new thread on pgsql-hackers, https://www.postgresql.org/message-id/flat/840.1669405935%40sss.pgh.pa.us regards, tom lane