Home > mailing lists

Re: BUG #18149: Incorrect lexeme for english token "proxy" - Mailing list pgsql-bugs

From	Patrick Peralta
Subject	Re: BUG #18149: Incorrect lexeme for english token "proxy"
Date	October 7, 2023 21:22:58
Msg-id	CADV9oGz+JRYg2n1X2Jzknr-2YSAjcGT9FJPY0wiMNBnV+ZJUgw@mail.gmail.com Whole thread Raw
In response to	Re: BUG #18149: Incorrect lexeme for english token "proxy" (Tom Lane <tgl@sss.pgh.pa.us>)
List	pgsql-bugs

Tree view

So the difficulty is that (a) the default TS parser doesn't break down
this multiply-hyphenated word quite the way you'd hoped

I see, thanks for pointing this out. Now I understand that this isn't a lexeme issue; rather it's a side effect of how the parser handles hyphens.

I tried this out with underscores which produces a better result:

# select * from ts_debug('english', 'CLOUD_PROXY_SEP19_T1_254__1695167380256');
alias | description | token | dictionaries | dictionary | lexemes
-----------+--------------------------+---------------+----------------+--------------+-----------------
asciiword | Word, all ASCII | CLOUD | {english_stem} | english_stem | {cloud}
blank | Space symbols | _ | {} | |
asciiword | Word, all ASCII | PROXY | {english_stem} | english_stem | {proxi}
blank | Space symbols | _ | {} | |
numword | Word, letters and digits | SEP19 | {simple} | simple | {sep19}
blank | Space symbols | _ | {} | |
numword | Word, letters and digits | T1 | {simple} | simple | {t1}
blank | Space symbols | _ | {} | |
uint | Unsigned integer | 254 | {simple} | simple | {254}
blank | Space symbols | __ | {} | |
uint | Unsigned integer | 1695167380256 | {simple} | simple | {1695167380256}
(11 rows)

# SELECT to_tsvector('english', 'CLOUD_PROXY__SEP19_T1-254__1695167380256') @@ to_tsquery('english','cloud_proxy:*');
?column?
----------
t

If that data format is being imposed on you

These are freeform fields so we are not imposing a data format. However I wonder if we can do some data "massaging" to deal with hyphens. Writing our own parser is definitely overkill for this edge condition.

Really appreciate this insight, it was very helpful! We can close this ticket.

Thanks,

Patrick

On Sat, Oct 7, 2023 at 12:37 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Patrick Peralta <pperalta@gmail.com> writes:
> However I ran into an anomaly with this query:

> # SELECT to_tsvector('english', 'CLOUD-PROXY-SEP19-T1-254--1695167380256')
> @@ to_tsquery('english','cloud-proxy:*');
> ?column?
> ----------
> f
> (1 row)

Hmm. Investigating that a bit:

regression=# select * from ts_debug('english', 'cloud-proxy');
alias | description | token | dictionaries | dictionary | lexemes
-----------------+---------------------------------+-------------+----------------+--------------+---------------
asciihword | Hyphenated word, all ASCII | cloud-proxy | {english_stem} | english_stem | {cloud-proxi}
hword_asciipart | Hyphenated word part, all ASCII | cloud | {english_stem} | english_stem | {cloud}
blank | Space symbols | - | {} | |
hword_asciipart | Hyphenated word part, all ASCII | proxy | {english_stem} | english_stem | {proxi}
(4 rows)

regression=# select * from ts_debug('english', 'CLOUD-PROXY-SEP19-T1-254--1695167380256');
alias | description | token | dictionaries | dictionary | lexemes
-----------------+------------------------------------------+----------------------+----------------+--------------+------------------------
numhword | Hyphenated word, letters and digits | CLOUD-PROXY-SEP19-T1 | {simple} | simple | {cloud-proxy-sep19-t1}
hword_asciipart | Hyphenated word part, all ASCII | CLOUD | {english_stem} | english_stem | {cloud}
blank | Space symbols | - | {} | |
hword_asciipart | Hyphenated word part, all ASCII | PROXY | {english_stem} | english_stem | {proxi}
blank | Space symbols | - | {} | |
hword_numpart | Hyphenated word part, letters and digits | SEP19 | {simple} | simple | {sep19}
blank | Space symbols | - | {} | |
hword_numpart | Hyphenated word part, letters and digits | T1 | {simple} | simple | {t1}
blank | Space symbols | - | {} | |
uint | Unsigned integer | 254 | {simple} | simple | {254}
blank | Space symbols | - | {} | |
int | Signed integer | -1695167380256 | {simple} | simple | {-1695167380256}
(12 rows)

So the difficulty is that (a) the default TS parser doesn't break down
this multiply-hyphenated word quite the way you'd hoped, and (b) fragments
classified as numhword aren't passed through the english_stem dictionary
at all. Also, (c) I'm doubtful that the snowball stemmer would have
converted cloud-proxy-sep19-t1 to cloud-proxi-sep19-t1; but it didn't get
the chance anyway.

While (b) would be easy to address with a custom TS configuration,
(a) and (c) can't be fixed without getting your hands dirty in
C code. Is there any chance of adjusting the notation you're dealing
with here? I get sane-looking results from, for example,

regression=# select to_tsvector('english', 'CLOUD-PROXY--SEP19-T1-254--1695167380256');
to_tsvector
----------------------------------------------------------------------------------------------
'-1695167380256':8 '254':7 'cloud':2 'cloud-proxi':1 'proxi':3 'sep19':5 'sep19-t1':4 't1':6
(1 row)

If that data format is being imposed on you then I'm not seeing a good
solution without custom C code. I'd be inclined to try to make the
parser generate all of "cloud-proxy-sep19-t1", "cloud-proxy-sep19",
"cloud-proxy" from this input, but a custom TS parser is kind of a
high bar to clear.

regards, tom lane

pgsql-bugs by date:

From: Tom Lane
Date: 07 October 2023, 19:37:37
Subject: Re: BUG #18149: Incorrect lexeme for english token "proxy"

From: Richard Guo
Date: 08 October 2023, 11:26:43
Subject: Re: BUG #17540: Prepared statement: PG switches to a generic query plan which is consistently much slower

Re: BUG #18149: Incorrect lexeme for english token "proxy" - Mailing list pgsql-bugs

Previous

Next