Thread: ts_headline and query with hyphen

ts_headline and query with hyphen

From
daniel
Date:
Hi

I have a question about ts_headline, when the query includes word like
'on-line' - only the 'line' part is highlighted, even though the whole
phrase is indexed too, some details below.

Postgresql 9.1.6

select
token, dictionary, lexemes
from
ts_debug('play on-line') where alias <> 'blank';

   token  |  dictionary  | lexemes
---------+--------------+----------
  play    | english_stem | {play}
  on-line | english_stem | {on-lin}
  on      | english_stem | {}
  line    | english_stem | {line}


select to_tsquery('play & on-line');
          to_tsquery
----------------------------
  'play' & 'on-lin' & 'line'


select ts_headline('play on-line', to_tsquery('play & on-line'));

         ts_headline
----------------------------
  <b>play</b> on-<b>line</b>

Same as

select ts_headline('play on-line', to_tsquery('play & line'));
         ts_headline
----------------------------
  <b>play</b> on-<b>line</b>

Is that the intended behaviour? I guess the problem here is that 'on' is
not a lexem, but then what about on-lin?

In another example, I thought that a hyphenated match would have some
kind of preference

select token, dictionary, lexemes from ts_debug('custom-built query')
where alias <> 'blank';
     token     |  dictionary  |    lexemes
--------------+--------------+----------------
  custom-built | english_stem | {custom-built}
  custom       | english_stem | {custom}
  built        | english_stem | {built}
  query        | english_stem | {queri}


select to_tsquery('query & custom-built');
                   to_tsquery
-----------------------------------------------
  'queri' & 'custom-built' & 'custom' & 'built'


select ts_headline('custom-built query', to_tsquery('query &
custom-built'));
                ts_headline
-----------------------------------------
  <b>custom</b>-<b>built</b> <b>query</b>


This works better, but still both parts of 'custom-built' are
highlighted separately. But maybe ts_headline understands or operates on
single, not hyphenated words only?

thanks
daniel



Re: ts_headline and query with hyphen

From
Tom Lane
Date:
daniel <dochtorek@gmail.com> writes:
> I have a question about ts_headline, when the query includes word like
> 'on-line' - only the 'line' part is highlighted, even though the whole
> phrase is indexed too, some details below.

Part of the reason is that "on" is a stop word (at least in the default
english dictionary).  That's why you get

> select to_tsquery('play & on-line');
>           to_tsquery
> ----------------------------
>   'play' & 'on-lin' & 'line'

and not "'play' & 'on-lin' & 'on' & 'line'".  If you did get the latter
then you'd get a headline result with both parts highlighted, similar to
your "custom-built" case.

> But maybe ts_headline understands or operates on
> single, not hyphenated words only?

Dunno.  It would seem reasonable to highlight the whole compound in
these cases, but I have no idea how hard that is.

Another thing that seems a bit odd here is that we seem to be stemming
the compound word as a whole, but not the individual parts.  Not sure
how sane that combination of choices is ...

            regards, tom lane


Re: ts_headline and query with hyphen

From
daniel
Date:
On 12/05/2012 04:49 AM, Tom Lane wrote:
> daniel <dochtorek@gmail.com> writes:
>> I have a question about ts_headline, when the query includes word like
>> 'on-line' - only the 'line' part is highlighted, even though the whole
>> phrase is indexed too, some details below.
>
> Part of the reason is that "on" is a stop word (at least in the default
> english dictionary).  That's why you get
>
>> select to_tsquery('play & on-line');
>>            to_tsquery
>> ----------------------------
>>    'play' & 'on-lin' & 'line'
>
> and not "'play' & 'on-lin' & 'on' & 'line'".  If you did get the latter
> then you'd get a headline result with both parts highlighted, similar to
> your "custom-built" case.
>

I understand the 'on' part, but still, 'on-lin' is passed to the
ts_headline, so I thought that match would be preferred over 'line' and
highlighted as a whole.

Additionally, with a specific value of MaxWords I could see a dangling
"line" at the start of a headline ("on-" has been cut off), which is
kinda troubling, because it's not even an English document. It doesn't
seem to happen to queries like 'custom-built' - I can't see it being
split neither in the beginning of a headline nor at the end.

Just to be clear - the headline with cut off "on-" is OK (having the
matched stuff somewhere in the middle, though with highlighted 'line'
only), it's just that the word 'on-line' is used multiple times in the
doc and it happended to appear at the beginning of a headline. Cutting
was not affected by ShortWord setting, so I guess it's a stopword thing
again. If that's the case, then IMHO it should treat hyphenated words as
1 when creating the headline and not cut off like that. But maybe it was
intended to work like that..

>> But maybe ts_headline understands or operates on
>> single, not hyphenated words only?
>
> Dunno.  It would seem reasonable to highlight the whole compound in
> these cases, but I have no idea how hard that is.
>

Right, although that latter case is easy to fix outside postgres and
still looks fine - I've included it just as an example. Former causes a
few problems in specific cases, I have to fix them manually now, word by
word.

> Another thing that seems a bit odd here is that we seem to be stemming
> the compound word as a whole, but not the individual parts.  Not sure
> how sane that combination of choices is ...
>

Good question, hope others will jump in.

thanks,
daniel



Re: ts_headline and query with hyphen

From
daniel
Date:
As a follow up to my previous comment, this is a cutting example

select ts_headline('game played on-line', to_tsquery('on-line & game'),
'MaxWords=3,MinWords=2,ShortWord=1');

       ts_headline
-----------------------
  <b>game</b> played on


that can't be right...

daniel