Re: BUG #15689: Stemming of negation/not operator - Mailing list pgsql-bugs

From Ivan Viragine
Subject Re: BUG #15689: Stemming of negation/not operator
Date
Msg-id CAOWkBR+AdOer2mWX5ahKsPUgozZ7-0s-FRN6+ENhjmtG8mZqyQ@mail.gmail.com
Whole thread Raw
In response to Re: BUG #15689: Stemming of negation/not operator  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-bugs
Hi, Tom!

Thanks for the reply.
Surely there are many cases where stemming would be nice, but from the user's perspective, when someone does a complex query, with NOTs, they usually know what they are doing and wants to match certain specific cases. Stemming the NOT clause "removes" their control.
Also, I think it is better to have more results with the stemmed words and then have the user to add new clauses to filter them out, then to lose some correct results without the user even knowing why or if he/she is losing it really (a priori, the user does not know that the NOT clause was stemmed).
If you try this on Elastic Search, it works as (I) expected.
The idea is not to be particular words, but to not stem the clauses of the query, that is: the query parser knows which parts are in the NOT clause, it should parse it and add dynamically to the not stemmed words.
About the index / token being "car" for the word "cars", sure it will, as long as we use the same parser / tokener. That's why the recheck you said, should be necessary.

About the lexemes: we do not use prefix match here. But I see your point. It falls almost in the same category: doing things under the hood that the user may not be aware of.

The normal way to explicitly do not stem something would be using quotes. Normally, quotes means "match this exactly".

Atenciosamente,

--
Ivan Nicola Viragine


On Tue, Mar 12, 2019 at 7:34 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
PG Bug reporting form <noreply@postgresql.org> writes:
> When using to_tsquery function it is stemming negation/not parts of the
> query, where it probably shouldn't.
> Some examples:

> SELECT to_tsquery('english', 'car & !cars');
>    to_tsquery   
> ----------------
>  'car' & !'car'

I'm not exactly convinced by this argument, because it seems like
you're only thinking about a corner case.  There are probably at
least as many examples where you *do* want stemming on a negated term.

Another issue is that even if we changed the tsquery input function
to not stem particular words, I doubt that it would do anything useful,
because what it will be comparing to is tsvector entries that have
certainly been stemmed.  That is, even if the original document said
"cars", what's going to be in the tsvector is just "car", so that
forbidding a match to "cars" isn't going to do anything.  (Maybe
what this says is that there should be a less-lossy recheck against
the original document after the tsvector match, but that'd have to
be done by an additional, explicit operator I think.  Or possibly
the recheck just requires tsquery match with a different stemming
configuration.)

A related problem that's bothered me for some time is that lexemes
get stemmed even if there is a "*" (prefix match) marker on them,
causing them to possibly match much more than the user expected.
But again, it's not real obvious how to make that better given the
match-to-tsvector context --- not stemming could easily remove
desired matches to stemmed tsvector entries.

If we could think of a way for it to do something useful, my inclination
would be to allow an explicit "don't stem" marker on lexemes, rather
than trying to drive it off whether the context is a negation or not.

                        regards, tom lane

pgsql-bugs by date:

Previous
From: Sergei Kornilov
Date:
Subject: Re: BUG #15692: infinity loop
Next
From: Dean Rasheed
Date:
Subject: Re: BUG #15692: infinity loop