Thread: BUG #15488: Unexpected behaviour of to_tsverctor and ts_query

BUG #15488: Unexpected behaviour of to_tsverctor and ts_query

From
PG Bug reporting form
Date:
The following bug has been logged on the website:

Bug reference:      15488
Logged by:          Nikolajs Arhipovs
Email address:      nikolajs.arhipovs@gmail.com
PostgreSQL version: 9.6.8
Operating system:   Linux, kernel 4.18.16-arch1-1-ARCH
Description:

I am trying to do prefix search of emails using full text search. Emails
sometimes contain dot symbols.
Now, select to_tsquery('simple', 'thing.a:*'); returns a single term as
expected but 
select to_tsquery('simple', 'the.thing.a:*')  suddenly tokenizes my search
expression into 'the.thing':* & 'a':*. At the same time select
to_tsquery('simple', 'the.thing.aa:*'); tokenizes as expected into a single
term.
Also select to_tsquery('simple', 'the.thing@gmail.com'); returns a single
term as expected, but select to_tsquery('simple', 'the.thing@gma:*'); splits
on @ char and returns two tokens.
This behavior is not covered in the official documentation.


Re: BUG #15488: Unexpected behaviour of to_tsverctor and ts_query

From
Arthur Zakirov
Date:
Hello,

On 06.11.2018 15:51, PG Bug reporting form wrote:
> I am trying to do prefix search of emails using full text search. Emails
> sometimes contain dot symbols.
> Now, select to_tsquery('simple', 'thing.a:*'); returns a single term as
> expected but
> select to_tsquery('simple', 'the.thing.a:*')  suddenly tokenizes my search
> expression into 'the.thing':* & 'a':*. At the same time select
> to_tsquery('simple', 'the.thing.aa:*'); tokenizes as expected into a single
> term.
> Also select to_tsquery('simple', 'the.thing@gmail.com'); returns a single
> term as expected, but select to_tsquery('simple', 'the.thing@gma:*'); splits
> on @ char and returns two tokens.
> This behavior is not covered in the official documentation.

Yeah, it seems strange, but full text search parser thinks that each 
label of a host should have at least two characters.
In first example it thinks that it is a file name:

=# select * from ts_debug('simple', 'thing.a:*');
  alias |    description    |  token  | dictionaries | dictionary | 
lexemes
-------+-------------------+---------+--------------+------------+-----------
  file  | File or path name | thing.a | {simple}     | simple     | 
{thing.a}
  blank | Space symbols     | :*      | {}           | (null)     | (null)

Next, it thinks that only first two label are only part of host name:

=# select * from ts_debug('simple', 'the.thing.a:*');
    alias   |   description   |   token   | dictionaries | dictionary | 
  lexemes
-----------+-----------------+-----------+--------------+------------+-------------
  host      | Host            | the.thing | {simple}     | simple     | 
{the.thing}
  blank     | Space symbols   | .         | {}           | (null)     | 
(null)
  asciiword | Word, all ASCII | a         | {simple}     | simple     | {a}
  blank     | Space symbols   | :*        | {}           | (null)     | 
(null)

And next is host name:

=# select * from ts_debug('simple', 'the.thing.aa:*');
  alias |  description  |    token     | dictionaries | dictionary | 
lexemes
-------+---------------+--------------+--------------+------------+----------------
  host  | Host          | the.thing.aa | {simple}     | simple     | 
{the.thing.aa}
  blank | Space symbols | :*           | {}           | (null)     | (null)

It seems that the RFC doesn't have such restriction, if I'm not 
mistaken. See:
https://tools.ietf.org/html/rfc1034#section-3.1

And same for emails:

=# select * from ts_debug('simple', 'the.thing@gmail.com');
  alias |  description  |        token        | dictionaries | 
dictionary |        lexemes
-------+---------------+---------------------+--------------+------------+-----------------------
  email | Email address | the.thing@gmail.com | {simple}     | simple 
  | {the.thing@gmail.com}

=# select * from ts_debug('simple', 'the.thing@gmail.c:*');
  alias |    description    |   token   | dictionaries | dictionary | 
lexemes
-------+-------------------+-----------+--------------+------------+-------------
  host  | Host              | the.thing | {simple}     | simple     | 
{the.thing}
  blank | Space symbols     | @         | {}           | (null)     | (null)
  file  | File or path name | gmail.c   | {simple}     | simple     | 
{gmail.c}
  blank | Space symbols     | :*        | {}           | (null)     | (null)

=# select * from ts_debug('simple', 'the.thing@gma:*');
    alias   |   description   |   token   | dictionaries | dictionary | 
  lexemes
-----------+-----------------+-----------+--------------+------------+-------------
  host      | Host            | the.thing | {simple}     | simple     | 
{the.thing}
  blank     | Space symbols   | @         | {}           | (null)     | 
(null)
  asciiword | Word, all ASCII | gma       | {simple}     | simple     | 
{gma}
  blank     | Space symbols   | :*        | {}           | (null)     | 
(null)


It seems that it can be easily fixed, but I'm not sure that it won't 
break something.
If you have such host names only in prefix queries (not in documents 
itself) I may suggest the following workaround:

=# select to_tsvector('simple', 'the.thing@gmail.com') @@ 
'the.thing@gma:*'::tsquery;
  ?column?
----------
  t

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company