Re: BUG #18149: Incorrect lexeme for english token "proxy" - Mailing list pgsql-bugs

From Patrick Peralta
Subject Re: BUG #18149: Incorrect lexeme for english token "proxy"
Date
Msg-id CADV9oGz+JRYg2n1X2Jzknr-2YSAjcGT9FJPY0wiMNBnV+ZJUgw@mail.gmail.com
Whole thread Raw
In response to Re: BUG #18149: Incorrect lexeme for english token "proxy"  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-bugs
So the difficulty is that (a) the default TS parser doesn't break down
this multiply-hyphenated word quite the way you'd hoped

I see, thanks for pointing this out. Now I understand that this isn't a lexeme issue; rather it's a side effect of how the parser handles hyphens.

I tried this out with underscores which produces a better result:

# select * from ts_debug('english', 'CLOUD_PROXY_SEP19_T1_254__1695167380256');
   alias   |       description        |     token     |  dictionaries  |  dictionary  |     lexemes
-----------+--------------------------+---------------+----------------+--------------+-----------------
 asciiword | Word, all ASCII          | CLOUD         | {english_stem} | english_stem | {cloud}
 blank     | Space symbols            | _             | {}             |              |
 asciiword | Word, all ASCII          | PROXY         | {english_stem} | english_stem | {proxi}
 blank     | Space symbols            | _             | {}             |              |
 numword   | Word, letters and digits | SEP19         | {simple}       | simple       | {sep19}
 blank     | Space symbols            | _             | {}             |              |
 numword   | Word, letters and digits | T1            | {simple}       | simple       | {t1}
 blank     | Space symbols            | _             | {}             |              |
 uint      | Unsigned integer         | 254           | {simple}       | simple       | {254}
 blank     | Space symbols            | __            | {}             |              |
 uint      | Unsigned integer         | 1695167380256 | {simple}       | simple       | {1695167380256}
(11 rows)

# SELECT to_tsvector('english', 'CLOUD_PROXY__SEP19_T1-254__1695167380256') @@ to_tsquery('english','cloud_proxy:*');
 ?column?
----------
 t

If that data format is being imposed on you

These are freeform fields so we are not imposing a data format. However I wonder if we can do some data "massaging" to deal with hyphens. Writing our own parser is definitely overkill for this edge condition. 

Really appreciate this insight, it was very helpful! We can close this ticket.

Thanks,
Patrick


On Sat, Oct 7, 2023 at 12:37 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Patrick Peralta <pperalta@gmail.com> writes:
> However I ran into an anomaly with this query:

> # SELECT to_tsvector('english', 'CLOUD-PROXY-SEP19-T1-254--1695167380256')
> @@ to_tsquery('english','cloud-proxy:*');
>  ?column?
> ----------
>  f
> (1 row)

Hmm.  Investigating that a bit:

regression=# select * from ts_debug('english', 'cloud-proxy');
      alias      |           description           |    token    |  dictionaries  |  dictionary  |    lexemes   
-----------------+---------------------------------+-------------+----------------+--------------+---------------
 asciihword      | Hyphenated word, all ASCII      | cloud-proxy | {english_stem} | english_stem | {cloud-proxi}
 hword_asciipart | Hyphenated word part, all ASCII | cloud       | {english_stem} | english_stem | {cloud}
 blank           | Space symbols                   | -           | {}             |              |
 hword_asciipart | Hyphenated word part, all ASCII | proxy       | {english_stem} | english_stem | {proxi}
(4 rows)

regression=# select * from ts_debug('english', 'CLOUD-PROXY-SEP19-T1-254--1695167380256');
      alias      |               description                |        token         |  dictionaries  |  dictionary  |        lexemes         
-----------------+------------------------------------------+----------------------+----------------+--------------+------------------------
 numhword        | Hyphenated word, letters and digits      | CLOUD-PROXY-SEP19-T1 | {simple}       | simple       | {cloud-proxy-sep19-t1}
 hword_asciipart | Hyphenated word part, all ASCII          | CLOUD                | {english_stem} | english_stem | {cloud}
 blank           | Space symbols                            | -                    | {}             |              |
 hword_asciipart | Hyphenated word part, all ASCII          | PROXY                | {english_stem} | english_stem | {proxi}
 blank           | Space symbols                            | -                    | {}             |              |
 hword_numpart   | Hyphenated word part, letters and digits | SEP19                | {simple}       | simple       | {sep19}
 blank           | Space symbols                            | -                    | {}             |              |
 hword_numpart   | Hyphenated word part, letters and digits | T1                   | {simple}       | simple       | {t1}
 blank           | Space symbols                            | -                    | {}             |              |
 uint            | Unsigned integer                         | 254                  | {simple}       | simple       | {254}
 blank           | Space symbols                            | -                    | {}             |              |
 int             | Signed integer                           | -1695167380256       | {simple}       | simple       | {-1695167380256}
(12 rows)

So the difficulty is that (a) the default TS parser doesn't break down
this multiply-hyphenated word quite the way you'd hoped, and (b) fragments
classified as numhword aren't passed through the english_stem dictionary
at all.  Also, (c) I'm doubtful that the snowball stemmer would have
converted cloud-proxy-sep19-t1 to cloud-proxi-sep19-t1; but it didn't get
the chance anyway.

While (b) would be easy to address with a custom TS configuration,
(a) and (c) can't be fixed without getting your hands dirty in
C code.  Is there any chance of adjusting the notation you're dealing
with here?  I get sane-looking results from, for example,

regression=# select to_tsvector('english', 'CLOUD-PROXY--SEP19-T1-254--1695167380256');
                                         to_tsvector                                         
----------------------------------------------------------------------------------------------
 '-1695167380256':8 '254':7 'cloud':2 'cloud-proxi':1 'proxi':3 'sep19':5 'sep19-t1':4 't1':6
(1 row)

If that data format is being imposed on you then I'm not seeing a good
solution without custom C code.  I'd be inclined to try to make the
parser generate all of "cloud-proxy-sep19-t1", "cloud-proxy-sep19",
"cloud-proxy" from this input, but a custom TS parser is kind of a
high bar to clear.

                        regards, tom lane

pgsql-bugs by date:

Previous
From: Tom Lane
Date:
Subject: Re: BUG #18149: Incorrect lexeme for english token "proxy"
Next
From: Richard Guo
Date:
Subject: Re: BUG #17540: Prepared statement: PG switches to a generic query plan which is consistently much slower