Re: Should phraseto_tsquery('simple', 'blue blue') @@ to_tsvector('simple', 'blue') be true ? - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Should phraseto_tsquery('simple', 'blue blue') @@ to_tsvector('simple', 'blue') be true ?
Date
Msg-id 11252.1465422251@sss.pgh.pa.us
Whole thread Raw
In response to Re: Should phraseto_tsquery('simple', 'blue blue') @@ to_tsvector('simple', 'blue') be true ?  (Oleg Bartunov <obartunov@gmail.com>)
Responses Re: Should phraseto_tsquery('simple', 'blue blue') @@ to_tsvector('simple', 'blue') be true ?  (Teodor Sigaev <teodor@sigaev.ru>)
List pgsql-hackers
Oleg Bartunov <obartunov@gmail.com> writes:
> On Wed, Jun 8, 2016 at 1:05 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I concur that that seems like a rather useless behavior.  If we have
>> "x <-> y" it is not possible to match at distance zero, while if we
>> have "x <-> x" it seems unlikely that the user is expecting us to
>> treat that identically to "x".  So phrase search simply should not
>> consider distance-zero matches.

> what's about word with several infinitives

> select to_tsvector('en', 'leavings');
>       to_tsvector
> ------------------------
>  'leave':1 'leavings':1
> (1 row)

> select to_tsvector('en', 'leavings') @@ 'leave <0> leavings'::tsquery;
>  ?column?
> ----------
>  t
> (1 row)

Hmm.  I can grant that there might be some cases where you want to see
if two separate patterns match the same lexeme, but that seems like an
extremely specialized use-case that you would only invoke very
intentionally.  It should not be built in as part of the default behavior
of every phrase search, because 99% of the time this would be an
unexpected and unwanted match.  I'm not even convinced that the operator
for this should be spelled <0> --- that seems more like a hack than a
natural extension of phrase search.  But if we do spell it like that,
then I think it should be called out as a special case that only applies
to <0>; that is, for any other value of N, the match has to be to separate
lexemes.

This brings up something else that I am not very sold on: to wit,
do we really want the "less than or equal" distance behavior at all?
The documentation gives the example thatphraseto_tsquery('cat ate some rats')
produces( 'cat' <-> 'ate' ) <2> 'rat'
because "some" is a stopword.  However, that pattern will also match
"cat ate rats", which seems surprising and unexpected to me; certainly
it would surprise a user who did not realize that "some" is a stopword.

So I think there's a reasonable case for decreeing that <N> should only
match lexemes *exactly* N apart.  If we did that, we would no longer have
the misbehavior that Jean-Pierre is complaining about, and we'd not need
to argue about whether <0> needs to be treated specially.

Or maybe we need two operators, one for exactly-N-apart and one for
at-most-N-apart.
        regards, tom lane



pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: Use of index for 50% column restriction
Next
From: Tom Lane
Date:
Subject: Re: Should phraseto_tsquery('simple', 'blue blue') @@ to_tsvector('simple', 'blue') be true ?