Home > mailing lists

Re: BUG #18149: Incorrect lexeme for english token "proxy" - Mailing list pgsql-bugs

From	Tom Lane
Subject	Re: BUG #18149: Incorrect lexeme for english token "proxy"
Date	October 7, 2023 14:18:56
Msg-id	3123965.1696688336@sss.pgh.pa.us Whole thread Raw
In response to	Re: BUG #18149: Incorrect lexeme for english token "proxy" (Laurenz Albe <laurenz.albe@cybertec.at>)
Responses	Re: BUG #18149: Incorrect lexeme for english token "proxy"
List	pgsql-bugs

Tree view

Laurenz Albe <laurenz.albe@cybertec.at> writes:
> On Thu, 2023-10-05 at 21:44 +0000, PG Bug reporting form wrote:
>> The english dictionary is using the lexeme "proxi" for the token "proxy". As
>> a result, the search term "proxy" is not yielding results for records that
>> contain this word.

> I cannot reproduce that.

Me either.  It suggests that you're trying to match against documents
that haven't been put through the same normalization process as the
query.

>> I think this lexeme was chosen to support the plural of proxy which is
>> proxies. However there are other plurals where the root word is spelled
>> different and Postgres creates the correct lexeme such as: [goose or mouse]

> The snowball dictionary has no real knowledge of the words.  Stemming is
> done by applying some heuristics which work "well enough" in most cases.

Yeah.  I don't see anything hugely wrong with this particular
transformation.  It is doing something useful, in that "proxy"
and "proxies" are both converted to the same lexeme "proxi".
In an ideal world, the lexeme would be "proxy", but it doesn't
really make that much difference if it isn't.

In any case, changing it now wouldn't be very practical, because
existing documents will already have been made into tsvectors
using this rule.

            regards, tom lane

pgsql-bugs by date:

From: Laurenz Albe
Date: 07 October 2023, 12:49:21
Subject: Re: BUG #18149: Incorrect lexeme for english token "proxy"

From: Patrick Peralta
Date: 07 October 2023, 16:07:18
Subject: Re: BUG #18149: Incorrect lexeme for english token "proxy"

Re: BUG #18149: Incorrect lexeme for english token "proxy" - Mailing list pgsql-bugs

Previous

Next