Re: BUG #18149: Incorrect lexeme for english token "proxy" - Mailing list pgsql-bugs

From Tom Lane
Subject Re: BUG #18149: Incorrect lexeme for english token "proxy"
Date
Msg-id 3149021.1696696657@sss.pgh.pa.us
Whole thread Raw
In response to BUG #18149: Incorrect lexeme for english token "proxy"  (PG Bug reporting form <noreply@postgresql.org>)
Responses Re: BUG #18149: Incorrect lexeme for english token "proxy"
List pgsql-bugs
Patrick Peralta <pperalta@gmail.com> writes:
> However I ran into an anomaly with this query:

> # SELECT to_tsvector('english', 'CLOUD-PROXY-SEP19-T1-254--1695167380256')
> @@ to_tsquery('english','cloud-proxy:*');
>  ?column?
> ----------
>  f
> (1 row)

Hmm.  Investigating that a bit:

regression=# select * from ts_debug('english', 'cloud-proxy');
      alias      |           description           |    token    |  dictionaries  |  dictionary  |    lexemes
-----------------+---------------------------------+-------------+----------------+--------------+---------------
 asciihword      | Hyphenated word, all ASCII      | cloud-proxy | {english_stem} | english_stem | {cloud-proxi}
 hword_asciipart | Hyphenated word part, all ASCII | cloud       | {english_stem} | english_stem | {cloud}
 blank           | Space symbols                   | -           | {}             |              |
 hword_asciipart | Hyphenated word part, all ASCII | proxy       | {english_stem} | english_stem | {proxi}
(4 rows)

regression=# select * from ts_debug('english', 'CLOUD-PROXY-SEP19-T1-254--1695167380256');
      alias      |               description                |        token         |  dictionaries  |  dictionary  |
   lexemes          

-----------------+------------------------------------------+----------------------+----------------+--------------+------------------------
 numhword        | Hyphenated word, letters and digits      | CLOUD-PROXY-SEP19-T1 | {simple}       | simple       |
{cloud-proxy-sep19-t1}
 hword_asciipart | Hyphenated word part, all ASCII          | CLOUD                | {english_stem} | english_stem |
{cloud}
 blank           | Space symbols                            | -                    | {}             |              |
 hword_asciipart | Hyphenated word part, all ASCII          | PROXY                | {english_stem} | english_stem |
{proxi}
 blank           | Space symbols                            | -                    | {}             |              |
 hword_numpart   | Hyphenated word part, letters and digits | SEP19                | {simple}       | simple       |
{sep19}
 blank           | Space symbols                            | -                    | {}             |              |
 hword_numpart   | Hyphenated word part, letters and digits | T1                   | {simple}       | simple       |
{t1}
 blank           | Space symbols                            | -                    | {}             |              |
 uint            | Unsigned integer                         | 254                  | {simple}       | simple       |
{254}
 blank           | Space symbols                            | -                    | {}             |              |
 int             | Signed integer                           | -1695167380256       | {simple}       | simple       |
{-1695167380256}
(12 rows)

So the difficulty is that (a) the default TS parser doesn't break down
this multiply-hyphenated word quite the way you'd hoped, and (b) fragments
classified as numhword aren't passed through the english_stem dictionary
at all.  Also, (c) I'm doubtful that the snowball stemmer would have
converted cloud-proxy-sep19-t1 to cloud-proxi-sep19-t1; but it didn't get
the chance anyway.

While (b) would be easy to address with a custom TS configuration,
(a) and (c) can't be fixed without getting your hands dirty in
C code.  Is there any chance of adjusting the notation you're dealing
with here?  I get sane-looking results from, for example,

regression=# select to_tsvector('english', 'CLOUD-PROXY--SEP19-T1-254--1695167380256');
                                         to_tsvector
----------------------------------------------------------------------------------------------
 '-1695167380256':8 '254':7 'cloud':2 'cloud-proxi':1 'proxi':3 'sep19':5 'sep19-t1':4 't1':6
(1 row)

If that data format is being imposed on you then I'm not seeing a good
solution without custom C code.  I'd be inclined to try to make the
parser generate all of "cloud-proxy-sep19-t1", "cloud-proxy-sep19",
"cloud-proxy" from this input, but a custom TS parser is kind of a
high bar to clear.

            regards, tom lane



pgsql-bugs by date:

Previous
From: Patrick Peralta
Date:
Subject: Re: BUG #18149: Incorrect lexeme for english token "proxy"
Next
From: Patrick Peralta
Date:
Subject: Re: BUG #18149: Incorrect lexeme for english token "proxy"