Re: Rethinking the implementation of ts_headline() - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: Rethinking the implementation of ts_headline() |
Date | |
Msg-id | 3951261.1674082546@sss.pgh.pa.us Whole thread Raw |
In response to | Re: Rethinking the implementation of ts_headline() (Alvaro Herrera <alvherre@alvh.no-ip.org>) |
Responses |
Re: Rethinking the implementation of ts_headline()
|
List | pgsql-hackers |
Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > I tried this other test, based on looking at the new regression tests > you added, > SELECT ts_headline('english', ' > Day after day, day after day, > We stuck, nor breath nor motion, > As idle as a painted Ship > Upon a painted Ocean. > Water, water, every where > And all the boards did shrink; > Water, water, every where, > Nor any drop to drink. > S. T. Coleridge (1772-1834) > ', to_tsquery('english', '(day & drink) | (idle & painted)'), 'MaxFragments=5, MaxWords=9, MinWords=4'); > ts_headline > ───────────────────────────────────────── > motion, ↵ > As <b>idle</b> as a <b>painted</b> Ship↵ > Upon > (1 fila) > and was surprised that the match for the 'day & drink' arm of the OR > disappears from the reported headline. I'd argue that that's exactly what should happen. It's supposed to find as-short-as-possible cover strings that satisfy the query. In this case, satisfying the 'day & drink' condition would require nearly the entire input text, whereas 'idle & painted' can be satisfied in just the third line. So what you get is the third line, slightly expanded because of some later rules that like to add context if the cover is shorter than MaxWords. I don't find anything particularly principled about the old behavior: > <b>Day</b> after <b>day</b>, <b>day</b> after <b>day</b>,↵ > We stuck ... motion, ↵ > As <b>idle</b> as a <b>painted</b> Ship ↵ > Upon It's including hits for "day" into the cover despite the lack of any nearby match to "drink". > Another thing I think might be a regression is the way fragments are > selected. Consider what happens if I change the "idle & painted" in the > earlier query to "idle <-> painted", and MaxWords is kept low: Of course, "idle <-> painted" is satisfied nowhere in this text (the words are there, but not adjacent). So the cover has to run from the last 'day' to the 'drink'. I think v15 is deciding that it runs from the first 'day' to the 'drink', which while not exactly wrong is not the shortest cover. The rest of this is just minor variations in what mark_hl_fragments() decides to do based on the precise cover string it's given. I don't dispute that mark_hl_fragments() could be improved, but this patch doesn't touch its algorithm and I think that doing so should be material for a different patch. (I have no immediate ideas about what would be a better algorithm for it, anyway.) > (Both 15 and master highlight 'painted' in the "Upon a painted Ocean" > verse, which perhaps they shouldn't do, since it's not preceded by > 'idle'.) Yeah, and 'idle' too. Once it's chosen a string to show, it'll highlight all the query words within that string, whether they constitute part of the match or not. I can see arguments on both sides of doing it that way; it was probably more sensible before we had phrase match than it is now. But again, changing that phase of the processing is outside the scope of this patch. I'm just trying to undo the damage I did to the cover-string-selection phase. regards, tom lane
pgsql-hackers by date: