Re: Mailing list search engine: surprising missing results? - Mailing list pgsql-www

From Ivan Panchenko
Subject Re: Mailing list search engine: surprising missing results?
Date
Msg-id 79b3eb6e-152e-3c56-7b71-51d091c0f6d9@postgrespro.ru
Whole thread Raw
In response to Re: Mailing list search engine: surprising missing results?  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Mailing list search engine: surprising missing results?  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-www


On 25.01.2022 19:22, Tom Lane wrote:
Laurenz Albe <laurenz.albe@cybertec.at> writes:
On Tue, 2022-01-25 at 14:04 +0300, Oleg Bartunov wrote:
On Mon, Jan 24, 2022 at 11:47 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Bruce Momjian <bruce@momjian.us> writes:
On Mon, Jan 24, 2022 at 08:27:41AM +0100, Laurenz Albe wrote:
The reason is that the 'moore' in 'boyer-moore' is stemmed, since it
is at the end of the word, while the 'moore' in 'Boyer-Moore-Horspool'
isn't:
Not quite.  The problem is question is the "'boyer-moore':1".
If that were "'boyer-moor':1" instead, the problem would disappear.
Actually, when I try this here, it seems like the stemming *is*
consistent:

regression=# SELECT to_tsvector('english', 'Boyer-Moore-Horspool');                       to_tsvector                        
---------------------------------------------------------- 'boyer':2 'boyer-moore-horspool':1 'horspool':4 'moor':3
(1 row)

regression=# SELECT to_tsvector('english', 'Boyer-Moore');            to_tsvector            
----------------------------------- 'boyer':2 'boyer-moor':1 'moor':3
(1 row)

If you try variants of that where the first or third term is stemmable,
say

regression=# SELECT to_tsvector('english', 'Boyers-Moore-Horspool');                        to_tsvector                        
----------------------------------------------------------- 'boyer':2 'boyers-moore-horspool':1 'horspool':4 'moor':3
(1 row)

it sure appears that each component word is stemmed independently
already.  So I think the original explanation here is wrong and
we need to probe more closely.
The actual explanation can be seen from comparing a tsvector with a tsquery.
To avoid stemming effects, we use the simple configuration below.
# select plainto_tsquery('simple','boyers-moore');

           plainto_tsquery           
-------------------------------------
 'boyers-moore' & 'boyers' & 'moore'
 
# select to_tsvector('simple','boyers-moore-horspool');
                         to_tsvector                        
-------------------------------------------------------------
 'boyers':2 'boyers-moore-horspool':1 'horspool':4 'moore':3
Obviously, such tsvector does not match the above tsquery. I think,a better tsquery for this query would be
 'boyers-moore' | ('boyers' & 'moore')
May be, it is worth changing to_tsquery() behavior for such cases.

			regards, tom lane


Regards,
Ivan

pgsql-www by date:

Previous
From: Tom Lane
Date:
Subject: Re: Mailing list search engine: surprising missing results?
Next
From: Magnus Hagander
Date:
Subject: Re: Update Commitfest requirements and README