Re: Rethinking the implementation of ts_headline() - Mailing list pgsql-hackers

From Alvaro Herrera
Subject Re: Rethinking the implementation of ts_headline()
Date
Msg-id 20230118110942.od2naagwp6molgxz@alvherre.pgsql
Whole thread Raw
In response to Rethinking the implementation of ts_headline()  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Rethinking the implementation of ts_headline()  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
I tried this other test, based on looking at the new regression tests
you added,

SELECT ts_headline('english', '
Day after day, day after day,
  We stuck, nor breath nor motion,
As idle as a painted Ship
  Upon a painted Ocean.
Water, water, every where
  And all the boards did shrink;
Water, water, every where,
  Nor any drop to drink.
S. T. Coleridge (1772-1834)
', to_tsquery('english', '(day & drink) | (idle & painted)'), 'MaxFragments=5, MaxWords=9, MinWords=4');
               ts_headline               
─────────────────────────────────────────
 motion,                                ↵
 As <b>idle</b> as a <b>painted</b> Ship↵
   Upon
(1 fila)

and was surprised that the match for the 'day & drink' arm of the OR
disappears from the reported headline.

This is what 15 reports for the same query:

SELECT ts_headline('english', '
Day after day, day after day,
  We stuck, nor breath nor motion,
As idle as a painted Ship
  Upon a painted Ocean.
Water, water, every where
  And all the boards did shrink;
Water, water, every where,
  Nor any drop to drink.
S. T. Coleridge (1772-1834)
', to_tsquery('english', '(day & drink) | (idle & painted)'), 'MaxFragments=5, MaxWords=9, MinWords=4');
                        ts_headline                        
───────────────────────────────────────────────────────────
 <b>Day</b> after <b>day</b>, <b>day</b> after <b>day</b>,↵
   We stuck ... motion,                                   ↵
 As <b>idle</b> as a <b>painted</b> Ship                  ↵
   Upon
(1 fila)

I think this was better.

15 seems to fail in other ways; for instance, 'drink' is not highlighted in the
headline when the OR matches, but if the other arm of the OR doesn't match, it
is; for example both 15 and master return the same for this one:

SELECT ts_headline('english', '
Day after day, day after day,
  We stuck, nor breath nor motion,
As idle as a painted Ship
  Upon a painted Ocean.
Water, water, every where
  And all the boards did shrink;
Water, water, every where,
  Nor any drop to drink.
S. T. Coleridge (1772-1834)
', to_tsquery('english', '(day & drink) | (mountain & backpack)'), 'MaxFragments=5, MaxWords=9, MinWords=4');
                        ts_headline                        
───────────────────────────────────────────────────────────
 <b>Day</b> after <b>day</b>, <b>day</b> after <b>day</b>,↵
   We stuck ... drop to <b>drink</b>.                     ↵
 S. T. Coleridge
(1 fila)



Another thing I think might be a regression is the way fragments are
selected.  Consider what happens if I change the "idle & painted" in the
earlier query to "idle <-> painted", and MaxWords is kept low:

SELECT ts_headline('english', '
Day after day, day after day,
  We stuck, nor breath nor motion,
As idle as a painted Ship
  Upon a painted Ocean.
Water, water, every where
  And all the boards did shrink;
Water, water, every where,
  Nor any drop to drink.
S. T. Coleridge (1772-1834)
', to_tsquery('english', '(day & drink) | (idle <-> painted)'), 'MaxFragments=5, MaxWords=9, MinWords=4');
                  ts_headline                  
───────────────────────────────────────────────
 <b>day</b>,                                  ↵
   We stuck, nor breath nor motion,           ↵
 As <b>idle</b> ... <b>painted</b> Ship       ↵
   Upon a <b>painted</b> Ocean.               ↵
 Water, water, every ... drop to <b>drink</b>.↵
 S. T. Coleridge
(1 fila)

Note that it chose to put a fragment delimiter exactly in the middle of the
phrase match, where the stop words are.  If I raise MaxWords, it is of course
much better, I suppose because the word limit doesn't force a new fragment,

SELECT ts_headline('english', '
Day after day, day after day,
  We stuck, nor breath nor motion,
As idle as a painted Ship
  Upon a painted Ocean.
Water, water, every where
  And all the boards did shrink;
Water, water, every where,
  Nor any drop to drink.
S. T. Coleridge (1772-1834)
', to_tsquery('english', '(day & drink) | (idle <-> painted)'), 'MaxFragments=5, MaxWords=25, MinWords=4');
                   ts_headline                    
──────────────────────────────────────────────────
 after <b>day</b>, <b>day</b> after <b>day</b>,  ↵
   We stuck, nor breath nor motion,              ↵
 As <b>idle</b> as a <b>painted</b> Ship         ↵
   Upon a <b>painted</b> Ocean.                  ↵
 Water, water, every where ... boards did shrink;↵
 Water, water, every where,                      ↵
   Nor any drop to <b>drink</b>.                 ↵
 S. T. Coleridge
(1 fila)

But in 15, the query with low MaxWords does this instead, where the
fragment delimiter occurs just *before* the phrasal match.

SELECT ts_headline('english', '
Day after day, day after day,
  We stuck, nor breath nor motion,
As idle as a painted Ship
  Upon a painted Ocean.
Water, water, every where
  And all the boards did shrink;
Water, water, every where,
  Nor any drop to drink.
S. T. Coleridge (1772-1834)
', to_tsquery('english', '(day & drink) | (idle <-> painted)'), 'MaxFragments=5, MaxWords=9, MinWords=4');
                        ts_headline                        
───────────────────────────────────────────────────────────
 <b>Day</b> after <b>day</b>, <b>day</b> after <b>day</b>,↵
   We stuck ... <b>idle</b> as a <b>painted</b> Ship      ↵
   Upon a <b>painted</b> Ocean ... drop to <b>drink</b>.  ↵
 S. T. Coleridge
(1 fila)

(Both 15 and master highlight 'painted' in the "Upon a painted Ocean"
verse, which perhaps they shouldn't do, since it's not preceded by
'idle'.)


(I think it's super annoying that the fragment separation algorithm
fails to preserve newlines between verses as it adds the '...'
separator.  But I guess poetry is not the main use case for text search
anyway, so it probably doesn't matter much.)

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Every machine is a smoke machine if you operate it wrong enough."
https://twitter.com/libseybieda/status/1541673325781196801



pgsql-hackers by date:

Previous
From: Etsuro Fujita
Date:
Subject: Re: postgres_fdw: commit remote (sub)transactions in parallel during pre-commit
Next
From: shveta malik
Date:
Subject: Re: Question about initial logical decoding snapshot