Re: BUG #17556: ts_headline does not correctly find matches when separated by 4,999 words - Mailing list pgsql-bugs

From Alex Malek
Subject Re: BUG #17556: ts_headline does not correctly find matches when separated by 4,999 words
Date
Msg-id CAGH8cceNS=J3OJMv9y_D009hnFhZtU4YbBwp3OxYhn8TA=i0VQ@mail.gmail.com
Whole thread Raw
In response to Re: BUG #17556: ts_headline does not correctly find matches when separated by 4,999 words  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
List pgsql-bugs
On Sun, Jul 24, 2022 at 10:36 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
At Fri, 22 Jul 2022 14:06:43 +0000, PG Bug reporting form <noreply@postgresql.org> wrote in
> The following bug has been logged on the website:
>
> Bug reference:      17556
> Logged by:          Alex Malek
> Email address:      magicagent@gmail.com
> PostgreSQL version: 14.4
> Operating system:   Red Hat
> Description:       
>
> Correct results when 4,998 words separate search terms:
>
> # select ts_headline('baz baz baz ipsum ' || repeat(' foo ',4998) || '
> labor',
>            $$'ipsum' & 'labor'$$::tsquery, 'StartSel=>, StopSel=<,
> MaxFragments=100, MaxWords=7, MinWords=3') ;
>      ts_headline
> ---------------------
>  >ipsum< ... >labor<
> (1 row)
>
> Add one more word between terms being searched for, to total 4,999, and
> terms are not found:
>
> # select ts_headline('baz baz baz ipsum ' || repeat(' foo ',4999) || '
> labor',
>            $$'ipsum' & 'labor'$$::tsquery, 'StartSel=>, StopSel=<,
> MaxFragments=100, MaxWords=7, MinWords=3') ;
>  ts_headline
> -------------
>  baz baz baz
> (1 row)

When ts_headline searches the document, it splits the document into
segments in the length called internally as max_cover, which is not
configurable for now [1].  In the latter case above, it is
MaxFragments * (max(MaxWords * 10, 100)) = 10000 "words" where
whitespaces are counted as words. The docuement has 10007 "words",
where 'ipsum' is the 7th word and 'labor' is the 10007th word. The two
words aren't within a 10000-word segment so it is missed. ts_headeline
returns instead the first MinWords words as you see.

This is not a bug, but a designed behavior.  However, we might want to
document that beahvior.

This could be "improved" as [1], but in this specific case, I doubt
the usefulness of ts_headline picking up it up when the two words are
that far distant each other, in exchange of possible degradation.


[1] For developers, wparser_def.c:2582
>        * We might eventually make max_cover a user-settable parameter, but for
>        * now, just compute a reasonable value based on max_words and
>        * max_fragments.


Since the expected output is produced for much larger documents when OR ('|') replaces AND ('&'),
what if the code, when no match is found, tries again with such a replacement?
Alternatively since the "highlighting" of terms is the same for '|' vs '&' maybe always do the replacement?

Note: I have no idea how the parsing, max_cover etc., actually work, I am suggesting "high level" ideas
that I realize  may or may not make sense for that code base.


Correct highlighting for 100,000+ "words:" using OR ('|'):

# select ts_headline('baz baz baz ipsum ' || repeat(' foo ',100000) || ' labor',
           $$'ipsum' | 'labor'$$::tsquery, 'StartSel=>, StopSel=<,
MaxFragments=100, MaxWords=7, MinWords=3') ;
     ts_headline
---------------------
 >ipsum< ... >labor<
(1 row)
 

Highlighting the same for OR vs AND:

# select ts_headline('baz baz baz ipsum labor foo foo foo', $$'ipsum' & 'labor'$$::tsquery, 'StartSel=>, StopSel=<');
               ts_headline
-----------------------------------------
 baz baz baz >ipsum< >labor< foo foo foo
(1 row)

# select ts_headline('baz baz baz ipsum labor foo foo foo', $$'ipsum' | 'labor'$$::tsquery, 'StartSel=>, StopSel=<');
               ts_headline
-----------------------------------------
 baz baz baz >ipsum< >labor< foo foo foo
(1 row)

Best,
Alex

pgsql-bugs by date:

Previous
From: David Steele
Date:
Subject: Re: could not link file in wal restore lines
Next
From: Tomas Vondra
Date:
Subject: Re: Fwd: "SELECT COUNT(*) FROM" still causing issues (deadlock) in PostgreSQL 14.3/4?