Thread: BUG #18149: Incorrect lexeme for english token "proxy"
The following bug has been logged on the website: Bug reference: 18149 Logged by: Patrick Peralta Email address: pperalta@gmail.com PostgreSQL version: 14.5 Operating system: Linux Description: The english dictionary is using the lexeme "proxi" for the token "proxy". As a result, the search term "proxy" is not yielding results for records that contain this word. # select * from ts_debug('english', 'proxy'); alias | description | token | dictionaries | dictionary | lexemes -----------+-----------------+-------+----------------+--------------+--------- asciiword | Word, all ASCII | proxy | {english_stem} | english_stem | {proxi} I think this lexeme was chosen to support the plural of proxy which is proxies. However there are other plurals where the root word is spelled different and Postgres creates the correct lexeme such as: # select * from ts_debug('english', 'goose'); alias | description | token | dictionaries | dictionary | lexemes -----------+-----------------+-------+----------------+--------------+--------- asciiword | Word, all ASCII | goose | {english_stem} | english_stem | {goos} # select * from ts_debug('english', 'mouse'); alias | description | token | dictionaries | dictionary | lexemes -----------+-----------------+-------+----------------+--------------+--------- asciiword | Word, all ASCII | mouse | {english_stem} | english_stem | {mous} I believe we can create our own dictionary as a workaround (https://www.postgresql.org/docs/current/textsearch-dictionaries.html) but I'm reporting this to see if using "proxi" for "proxy" is intentional.
On Thu, 2023-10-05 at 21:44 +0000, PG Bug reporting form wrote: > The english dictionary is using the lexeme "proxi" for the token "proxy". As > a result, the search term "proxy" is not yielding results for records that > contain this word. > > # select * from ts_debug('english', 'proxy'); > alias | description | token | dictionaries | dictionary | > lexemes > -----------+-----------------+-------+----------------+--------------+--------- > asciiword | Word, all ASCII | proxy | {english_stem} | english_stem | > {proxi} I cannot reproduce that. If I generate a text search query for "proxy", I get this: SELECT to_tsquery('english', 'proxy'); to_tsquery ════════════ 'proxi' (1 row) which will work just fine. > I think this lexeme was chosen to support the plural of proxy which is > proxies. However there are other plurals where the root word is spelled > different and Postgres creates the correct lexeme such as: [goose or mouse] The snowball dictionary has no real knowledge of the words. Stemming is done by applying some heuristics which work "well enough" in most cases. In the case of "proxy", the rule seems simply to be to stem everything that ends in "y" as ending in "i", in the hope to catch plurals that are built along the same lines as for "proxy": select * from to_tsvector('english', 'mumply'); to_tsvector ═════════════ 'mumpli':1 (1 row) This works fine in most cases, but every heuristic can go wrong occasionally: select * from to_tsvector('english', 'standby'); to_tsvector ═════════════ 'standbi':1 (1 row) Yours, Laurenz Albe
Laurenz Albe <laurenz.albe@cybertec.at> writes: > On Thu, 2023-10-05 at 21:44 +0000, PG Bug reporting form wrote: >> The english dictionary is using the lexeme "proxi" for the token "proxy". As >> a result, the search term "proxy" is not yielding results for records that >> contain this word. > I cannot reproduce that. Me either. It suggests that you're trying to match against documents that haven't been put through the same normalization process as the query. >> I think this lexeme was chosen to support the plural of proxy which is >> proxies. However there are other plurals where the root word is spelled >> different and Postgres creates the correct lexeme such as: [goose or mouse] > The snowball dictionary has no real knowledge of the words. Stemming is > done by applying some heuristics which work "well enough" in most cases. Yeah. I don't see anything hugely wrong with this particular transformation. It is doing something useful, in that "proxy" and "proxies" are both converted to the same lexeme "proxi". In an ideal world, the lexeme would be "proxy", but it doesn't really make that much difference if it isn't. In any case, changing it now wouldn't be very practical, because existing documents will already have been made into tsvectors using this rule. regards, tom lane
Hi Tom and Laurenz,
Thank you for your timely replies. I may have misdiagnosed my problem, so I'll elaborate a bit more.
As you mentioned, I see that searching for the term "proxy" works in general:
# SELECT to_tsvector('english', 'I set up an http proxy for my network.') @@ to_tsquery('english','proxy');
?column?
----------
t
(1 row)
?column?
----------
t
(1 row)
However I ran into an anomaly with this query:
# SELECT to_tsvector('english', 'CLOUD-PROXY-SEP19-T1-254--1695167380256') @@ to_tsquery('english','cloud-proxy:*');
?column?
----------
f
(1 row)
?column?
----------
f
(1 row)
When I search with the prefix "cloud-proxy" it doesn't match the input.
When I try this:
# SELECT to_tsvector('english', 'CLOUD-SERVER-SEP19-T1-254--1695167380256') @@ to_tsquery('english','cloud-server:*');
?column?
----------
t
?column?
----------
t
The prefix "cloud-server" works.
Furthermore, if I switch to the 'simple' dictionary instead of 'english:
# SELECT to_tsvector('simple', 'CLOUD-PROXY-SEP19-T1-254--1695167380256') @@ to_tsquery('simple','cloud-proxy:*');
?column?
----------
t
(1 row)
?column?
----------
t
(1 row)
Here the 'cloud-proxy' prefix works.
The only difference I can see is that the 'simple' dictionary uses 'proxy' as the lexeme whereas the 'english' dictionary uses 'proxi'. Could this explain the difference in these queries, or is there something else I'm missing?
Thanks,
Patrick
On Sat, Oct 7, 2023 at 10:18 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Laurenz Albe <laurenz.albe@cybertec.at> writes:
> On Thu, 2023-10-05 at 21:44 +0000, PG Bug reporting form wrote:
>> The english dictionary is using the lexeme "proxi" for the token "proxy". As
>> a result, the search term "proxy" is not yielding results for records that
>> contain this word.
> I cannot reproduce that.
Me either. It suggests that you're trying to match against documents
that haven't been put through the same normalization process as the
query.
>> I think this lexeme was chosen to support the plural of proxy which is
>> proxies. However there are other plurals where the root word is spelled
>> different and Postgres creates the correct lexeme such as: [goose or mouse]
> The snowball dictionary has no real knowledge of the words. Stemming is
> done by applying some heuristics which work "well enough" in most cases.
Yeah. I don't see anything hugely wrong with this particular
transformation. It is doing something useful, in that "proxy"
and "proxies" are both converted to the same lexeme "proxi".
In an ideal world, the lexeme would be "proxy", but it doesn't
really make that much difference if it isn't.
In any case, changing it now wouldn't be very practical, because
existing documents will already have been made into tsvectors
using this rule.
regards, tom lane
Patrick Peralta <pperalta@gmail.com> writes: > However I ran into an anomaly with this query: > # SELECT to_tsvector('english', 'CLOUD-PROXY-SEP19-T1-254--1695167380256') > @@ to_tsquery('english','cloud-proxy:*'); > ?column? > ---------- > f > (1 row) Hmm. Investigating that a bit: regression=# select * from ts_debug('english', 'cloud-proxy'); alias | description | token | dictionaries | dictionary | lexemes -----------------+---------------------------------+-------------+----------------+--------------+--------------- asciihword | Hyphenated word, all ASCII | cloud-proxy | {english_stem} | english_stem | {cloud-proxi} hword_asciipart | Hyphenated word part, all ASCII | cloud | {english_stem} | english_stem | {cloud} blank | Space symbols | - | {} | | hword_asciipart | Hyphenated word part, all ASCII | proxy | {english_stem} | english_stem | {proxi} (4 rows) regression=# select * from ts_debug('english', 'CLOUD-PROXY-SEP19-T1-254--1695167380256'); alias | description | token | dictionaries | dictionary | lexemes -----------------+------------------------------------------+----------------------+----------------+--------------+------------------------ numhword | Hyphenated word, letters and digits | CLOUD-PROXY-SEP19-T1 | {simple} | simple | {cloud-proxy-sep19-t1} hword_asciipart | Hyphenated word part, all ASCII | CLOUD | {english_stem} | english_stem | {cloud} blank | Space symbols | - | {} | | hword_asciipart | Hyphenated word part, all ASCII | PROXY | {english_stem} | english_stem | {proxi} blank | Space symbols | - | {} | | hword_numpart | Hyphenated word part, letters and digits | SEP19 | {simple} | simple | {sep19} blank | Space symbols | - | {} | | hword_numpart | Hyphenated word part, letters and digits | T1 | {simple} | simple | {t1} blank | Space symbols | - | {} | | uint | Unsigned integer | 254 | {simple} | simple | {254} blank | Space symbols | - | {} | | int | Signed integer | -1695167380256 | {simple} | simple | {-1695167380256} (12 rows) So the difficulty is that (a) the default TS parser doesn't break down this multiply-hyphenated word quite the way you'd hoped, and (b) fragments classified as numhword aren't passed through the english_stem dictionary at all. Also, (c) I'm doubtful that the snowball stemmer would have converted cloud-proxy-sep19-t1 to cloud-proxi-sep19-t1; but it didn't get the chance anyway. While (b) would be easy to address with a custom TS configuration, (a) and (c) can't be fixed without getting your hands dirty in C code. Is there any chance of adjusting the notation you're dealing with here? I get sane-looking results from, for example, regression=# select to_tsvector('english', 'CLOUD-PROXY--SEP19-T1-254--1695167380256'); to_tsvector ---------------------------------------------------------------------------------------------- '-1695167380256':8 '254':7 'cloud':2 'cloud-proxi':1 'proxi':3 'sep19':5 'sep19-t1':4 't1':6 (1 row) If that data format is being imposed on you then I'm not seeing a good solution without custom C code. I'd be inclined to try to make the parser generate all of "cloud-proxy-sep19-t1", "cloud-proxy-sep19", "cloud-proxy" from this input, but a custom TS parser is kind of a high bar to clear. regards, tom lane
So the difficulty is that (a) the default TS parser doesn't break down
this multiply-hyphenated word quite the way you'd hoped
I see, thanks for pointing this out. Now I understand that this isn't a lexeme issue; rather it's a side effect of how the parser handles hyphens.
I tried this out with underscores which produces a better result:
# select * from ts_debug('english', 'CLOUD_PROXY_SEP19_T1_254__1695167380256');
alias | description | token | dictionaries | dictionary | lexemes
-----------+--------------------------+---------------+----------------+--------------+-----------------
asciiword | Word, all ASCII | CLOUD | {english_stem} | english_stem | {cloud}
blank | Space symbols | _ | {} | |
asciiword | Word, all ASCII | PROXY | {english_stem} | english_stem | {proxi}
blank | Space symbols | _ | {} | |
numword | Word, letters and digits | SEP19 | {simple} | simple | {sep19}
blank | Space symbols | _ | {} | |
numword | Word, letters and digits | T1 | {simple} | simple | {t1}
blank | Space symbols | _ | {} | |
uint | Unsigned integer | 254 | {simple} | simple | {254}
blank | Space symbols | __ | {} | |
uint | Unsigned integer | 1695167380256 | {simple} | simple | {1695167380256}
(11 rows)
alias | description | token | dictionaries | dictionary | lexemes
-----------+--------------------------+---------------+----------------+--------------+-----------------
asciiword | Word, all ASCII | CLOUD | {english_stem} | english_stem | {cloud}
blank | Space symbols | _ | {} | |
asciiword | Word, all ASCII | PROXY | {english_stem} | english_stem | {proxi}
blank | Space symbols | _ | {} | |
numword | Word, letters and digits | SEP19 | {simple} | simple | {sep19}
blank | Space symbols | _ | {} | |
numword | Word, letters and digits | T1 | {simple} | simple | {t1}
blank | Space symbols | _ | {} | |
uint | Unsigned integer | 254 | {simple} | simple | {254}
blank | Space symbols | __ | {} | |
uint | Unsigned integer | 1695167380256 | {simple} | simple | {1695167380256}
(11 rows)
# SELECT to_tsvector('english', 'CLOUD_PROXY__SEP19_T1-254__1695167380256') @@ to_tsquery('english','cloud_proxy:*');
?column?
----------
t
?column?
----------
t
If that data format is being imposed on you
These are freeform fields so we are not imposing a data format. However I wonder if we can do some data "massaging" to deal with hyphens. Writing our own parser is definitely overkill for this edge condition.
Really appreciate this insight, it was very helpful! We can close this ticket.
Thanks,
Patrick
On Sat, Oct 7, 2023 at 12:37 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Patrick Peralta <pperalta@gmail.com> writes:
> However I ran into an anomaly with this query:
> # SELECT to_tsvector('english', 'CLOUD-PROXY-SEP19-T1-254--1695167380256')
> @@ to_tsquery('english','cloud-proxy:*');
> ?column?
> ----------
> f
> (1 row)
Hmm. Investigating that a bit:
regression=# select * from ts_debug('english', 'cloud-proxy');
alias | description | token | dictionaries | dictionary | lexemes
-----------------+---------------------------------+-------------+----------------+--------------+---------------
asciihword | Hyphenated word, all ASCII | cloud-proxy | {english_stem} | english_stem | {cloud-proxi}
hword_asciipart | Hyphenated word part, all ASCII | cloud | {english_stem} | english_stem | {cloud}
blank | Space symbols | - | {} | |
hword_asciipart | Hyphenated word part, all ASCII | proxy | {english_stem} | english_stem | {proxi}
(4 rows)
regression=# select * from ts_debug('english', 'CLOUD-PROXY-SEP19-T1-254--1695167380256');
alias | description | token | dictionaries | dictionary | lexemes
-----------------+------------------------------------------+----------------------+----------------+--------------+------------------------
numhword | Hyphenated word, letters and digits | CLOUD-PROXY-SEP19-T1 | {simple} | simple | {cloud-proxy-sep19-t1}
hword_asciipart | Hyphenated word part, all ASCII | CLOUD | {english_stem} | english_stem | {cloud}
blank | Space symbols | - | {} | |
hword_asciipart | Hyphenated word part, all ASCII | PROXY | {english_stem} | english_stem | {proxi}
blank | Space symbols | - | {} | |
hword_numpart | Hyphenated word part, letters and digits | SEP19 | {simple} | simple | {sep19}
blank | Space symbols | - | {} | |
hword_numpart | Hyphenated word part, letters and digits | T1 | {simple} | simple | {t1}
blank | Space symbols | - | {} | |
uint | Unsigned integer | 254 | {simple} | simple | {254}
blank | Space symbols | - | {} | |
int | Signed integer | -1695167380256 | {simple} | simple | {-1695167380256}
(12 rows)
So the difficulty is that (a) the default TS parser doesn't break down
this multiply-hyphenated word quite the way you'd hoped, and (b) fragments
classified as numhword aren't passed through the english_stem dictionary
at all. Also, (c) I'm doubtful that the snowball stemmer would have
converted cloud-proxy-sep19-t1 to cloud-proxi-sep19-t1; but it didn't get
the chance anyway.
While (b) would be easy to address with a custom TS configuration,
(a) and (c) can't be fixed without getting your hands dirty in
C code. Is there any chance of adjusting the notation you're dealing
with here? I get sane-looking results from, for example,
regression=# select to_tsvector('english', 'CLOUD-PROXY--SEP19-T1-254--1695167380256');
to_tsvector
----------------------------------------------------------------------------------------------
'-1695167380256':8 '254':7 'cloud':2 'cloud-proxi':1 'proxi':3 'sep19':5 'sep19-t1':4 't1':6
(1 row)
If that data format is being imposed on you then I'm not seeing a good
solution without custom C code. I'd be inclined to try to make the
parser generate all of "cloud-proxy-sep19-t1", "cloud-proxy-sep19",
"cloud-proxy" from this input, but a custom TS parser is kind of a
high bar to clear.
regards, tom lane