Re: BUG #15488: Unexpected behaviour of to_tsverctor and ts_query - Mailing list pgsql-bugs

From Arthur Zakirov
Subject Re: BUG #15488: Unexpected behaviour of to_tsverctor and ts_query
Date
Msg-id dc4c5f92-0eeb-2d5e-32af-653b80816c31@postgrespro.ru
Whole thread Raw
In response to BUG #15488: Unexpected behaviour of to_tsverctor and ts_query  (PG Bug reporting form <noreply@postgresql.org>)
List pgsql-bugs
Hello,

On 06.11.2018 15:51, PG Bug reporting form wrote:
> I am trying to do prefix search of emails using full text search. Emails
> sometimes contain dot symbols.
> Now, select to_tsquery('simple', 'thing.a:*'); returns a single term as
> expected but
> select to_tsquery('simple', 'the.thing.a:*')  suddenly tokenizes my search
> expression into 'the.thing':* & 'a':*. At the same time select
> to_tsquery('simple', 'the.thing.aa:*'); tokenizes as expected into a single
> term.
> Also select to_tsquery('simple', 'the.thing@gmail.com'); returns a single
> term as expected, but select to_tsquery('simple', 'the.thing@gma:*'); splits
> on @ char and returns two tokens.
> This behavior is not covered in the official documentation.

Yeah, it seems strange, but full text search parser thinks that each 
label of a host should have at least two characters.
In first example it thinks that it is a file name:

=# select * from ts_debug('simple', 'thing.a:*');
  alias |    description    |  token  | dictionaries | dictionary | 
lexemes
-------+-------------------+---------+--------------+------------+-----------
  file  | File or path name | thing.a | {simple}     | simple     | 
{thing.a}
  blank | Space symbols     | :*      | {}           | (null)     | (null)

Next, it thinks that only first two label are only part of host name:

=# select * from ts_debug('simple', 'the.thing.a:*');
    alias   |   description   |   token   | dictionaries | dictionary | 
  lexemes
-----------+-----------------+-----------+--------------+------------+-------------
  host      | Host            | the.thing | {simple}     | simple     | 
{the.thing}
  blank     | Space symbols   | .         | {}           | (null)     | 
(null)
  asciiword | Word, all ASCII | a         | {simple}     | simple     | {a}
  blank     | Space symbols   | :*        | {}           | (null)     | 
(null)

And next is host name:

=# select * from ts_debug('simple', 'the.thing.aa:*');
  alias |  description  |    token     | dictionaries | dictionary | 
lexemes
-------+---------------+--------------+--------------+------------+----------------
  host  | Host          | the.thing.aa | {simple}     | simple     | 
{the.thing.aa}
  blank | Space symbols | :*           | {}           | (null)     | (null)

It seems that the RFC doesn't have such restriction, if I'm not 
mistaken. See:
https://tools.ietf.org/html/rfc1034#section-3.1

And same for emails:

=# select * from ts_debug('simple', 'the.thing@gmail.com');
  alias |  description  |        token        | dictionaries | 
dictionary |        lexemes
-------+---------------+---------------------+--------------+------------+-----------------------
  email | Email address | the.thing@gmail.com | {simple}     | simple 
  | {the.thing@gmail.com}

=# select * from ts_debug('simple', 'the.thing@gmail.c:*');
  alias |    description    |   token   | dictionaries | dictionary | 
lexemes
-------+-------------------+-----------+--------------+------------+-------------
  host  | Host              | the.thing | {simple}     | simple     | 
{the.thing}
  blank | Space symbols     | @         | {}           | (null)     | (null)
  file  | File or path name | gmail.c   | {simple}     | simple     | 
{gmail.c}
  blank | Space symbols     | :*        | {}           | (null)     | (null)

=# select * from ts_debug('simple', 'the.thing@gma:*');
    alias   |   description   |   token   | dictionaries | dictionary | 
  lexemes
-----------+-----------------+-----------+--------------+------------+-------------
  host      | Host            | the.thing | {simple}     | simple     | 
{the.thing}
  blank     | Space symbols   | @         | {}           | (null)     | 
(null)
  asciiword | Word, all ASCII | gma       | {simple}     | simple     | 
{gma}
  blank     | Space symbols   | :*        | {}           | (null)     | 
(null)


It seems that it can be easily fixed, but I'm not sure that it won't 
break something.
If you have such host names only in prefix queries (not in documents 
itself) I may suggest the following workaround:

=# select to_tsvector('simple', 'the.thing@gmail.com') @@ 
'the.thing@gma:*'::tsquery;
  ?column?
----------
  t

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


pgsql-bugs by date:

Previous
From: Etsuro Fujita
Date:
Subject: Re: BUG #15449: file_fdw using program cause exit code error whenusing LIMIT
Next
From: Arthur Zakirov
Date:
Subject: Re: BUG #15491: index on function not being used for full text searchwhen querying through a view