Re: BUG #18149: Incorrect lexeme for english token "proxy" - Mailing list pgsql-bugs
From | Tom Lane |
---|---|
Subject | Re: BUG #18149: Incorrect lexeme for english token "proxy" |
Date | |
Msg-id | 3149021.1696696657@sss.pgh.pa.us Whole thread Raw |
In response to | BUG #18149: Incorrect lexeme for english token "proxy" (PG Bug reporting form <noreply@postgresql.org>) |
Responses |
Re: BUG #18149: Incorrect lexeme for english token "proxy"
|
List | pgsql-bugs |
Patrick Peralta <pperalta@gmail.com> writes: > However I ran into an anomaly with this query: > # SELECT to_tsvector('english', 'CLOUD-PROXY-SEP19-T1-254--1695167380256') > @@ to_tsquery('english','cloud-proxy:*'); > ?column? > ---------- > f > (1 row) Hmm. Investigating that a bit: regression=# select * from ts_debug('english', 'cloud-proxy'); alias | description | token | dictionaries | dictionary | lexemes -----------------+---------------------------------+-------------+----------------+--------------+--------------- asciihword | Hyphenated word, all ASCII | cloud-proxy | {english_stem} | english_stem | {cloud-proxi} hword_asciipart | Hyphenated word part, all ASCII | cloud | {english_stem} | english_stem | {cloud} blank | Space symbols | - | {} | | hword_asciipart | Hyphenated word part, all ASCII | proxy | {english_stem} | english_stem | {proxi} (4 rows) regression=# select * from ts_debug('english', 'CLOUD-PROXY-SEP19-T1-254--1695167380256'); alias | description | token | dictionaries | dictionary | lexemes -----------------+------------------------------------------+----------------------+----------------+--------------+------------------------ numhword | Hyphenated word, letters and digits | CLOUD-PROXY-SEP19-T1 | {simple} | simple | {cloud-proxy-sep19-t1} hword_asciipart | Hyphenated word part, all ASCII | CLOUD | {english_stem} | english_stem | {cloud} blank | Space symbols | - | {} | | hword_asciipart | Hyphenated word part, all ASCII | PROXY | {english_stem} | english_stem | {proxi} blank | Space symbols | - | {} | | hword_numpart | Hyphenated word part, letters and digits | SEP19 | {simple} | simple | {sep19} blank | Space symbols | - | {} | | hword_numpart | Hyphenated word part, letters and digits | T1 | {simple} | simple | {t1} blank | Space symbols | - | {} | | uint | Unsigned integer | 254 | {simple} | simple | {254} blank | Space symbols | - | {} | | int | Signed integer | -1695167380256 | {simple} | simple | {-1695167380256} (12 rows) So the difficulty is that (a) the default TS parser doesn't break down this multiply-hyphenated word quite the way you'd hoped, and (b) fragments classified as numhword aren't passed through the english_stem dictionary at all. Also, (c) I'm doubtful that the snowball stemmer would have converted cloud-proxy-sep19-t1 to cloud-proxi-sep19-t1; but it didn't get the chance anyway. While (b) would be easy to address with a custom TS configuration, (a) and (c) can't be fixed without getting your hands dirty in C code. Is there any chance of adjusting the notation you're dealing with here? I get sane-looking results from, for example, regression=# select to_tsvector('english', 'CLOUD-PROXY--SEP19-T1-254--1695167380256'); to_tsvector ---------------------------------------------------------------------------------------------- '-1695167380256':8 '254':7 'cloud':2 'cloud-proxi':1 'proxi':3 'sep19':5 'sep19-t1':4 't1':6 (1 row) If that data format is being imposed on you then I'm not seeing a good solution without custom C code. I'd be inclined to try to make the parser generate all of "cloud-proxy-sep19-t1", "cloud-proxy-sep19", "cloud-proxy" from this input, but a custom TS parser is kind of a high bar to clear. regards, tom lane
pgsql-bugs by date: