Hello,
On 06.11.2018 15:51, PG Bug reporting form wrote:
> I am trying to do prefix search of emails using full text search. Emails
> sometimes contain dot symbols.
> Now, select to_tsquery('simple', 'thing.a:*'); returns a single term as
> expected but
> select to_tsquery('simple', 'the.thing.a:*') suddenly tokenizes my search
> expression into 'the.thing':* & 'a':*. At the same time select
> to_tsquery('simple', 'the.thing.aa:*'); tokenizes as expected into a single
> term.
> Also select to_tsquery('simple', 'the.thing@gmail.com'); returns a single
> term as expected, but select to_tsquery('simple', 'the.thing@gma:*'); splits
> on @ char and returns two tokens.
> This behavior is not covered in the official documentation.
Yeah, it seems strange, but full text search parser thinks that each
label of a host should have at least two characters.
In first example it thinks that it is a file name:
=# select * from ts_debug('simple', 'thing.a:*');
alias | description | token | dictionaries | dictionary |
lexemes
-------+-------------------+---------+--------------+------------+-----------
file | File or path name | thing.a | {simple} | simple |
{thing.a}
blank | Space symbols | :* | {} | (null) | (null)
Next, it thinks that only first two label are only part of host name:
=# select * from ts_debug('simple', 'the.thing.a:*');
alias | description | token | dictionaries | dictionary |
lexemes
-----------+-----------------+-----------+--------------+------------+-------------
host | Host | the.thing | {simple} | simple |
{the.thing}
blank | Space symbols | . | {} | (null) |
(null)
asciiword | Word, all ASCII | a | {simple} | simple | {a}
blank | Space symbols | :* | {} | (null) |
(null)
And next is host name:
=# select * from ts_debug('simple', 'the.thing.aa:*');
alias | description | token | dictionaries | dictionary |
lexemes
-------+---------------+--------------+--------------+------------+----------------
host | Host | the.thing.aa | {simple} | simple |
{the.thing.aa}
blank | Space symbols | :* | {} | (null) | (null)
It seems that the RFC doesn't have such restriction, if I'm not
mistaken. See:
https://tools.ietf.org/html/rfc1034#section-3.1
And same for emails:
=# select * from ts_debug('simple', 'the.thing@gmail.com');
alias | description | token | dictionaries |
dictionary | lexemes
-------+---------------+---------------------+--------------+------------+-----------------------
email | Email address | the.thing@gmail.com | {simple} | simple
| {the.thing@gmail.com}
=# select * from ts_debug('simple', 'the.thing@gmail.c:*');
alias | description | token | dictionaries | dictionary |
lexemes
-------+-------------------+-----------+--------------+------------+-------------
host | Host | the.thing | {simple} | simple |
{the.thing}
blank | Space symbols | @ | {} | (null) | (null)
file | File or path name | gmail.c | {simple} | simple |
{gmail.c}
blank | Space symbols | :* | {} | (null) | (null)
=# select * from ts_debug('simple', 'the.thing@gma:*');
alias | description | token | dictionaries | dictionary |
lexemes
-----------+-----------------+-----------+--------------+------------+-------------
host | Host | the.thing | {simple} | simple |
{the.thing}
blank | Space symbols | @ | {} | (null) |
(null)
asciiword | Word, all ASCII | gma | {simple} | simple |
{gma}
blank | Space symbols | :* | {} | (null) |
(null)
It seems that it can be easily fixed, but I'm not sure that it won't
break something.
If you have such host names only in prefix queries (not in documents
itself) I may suggest the following workaround:
=# select to_tsvector('simple', 'the.thing@gmail.com') @@
'the.thing@gma:*'::tsquery;
?column?
----------
t
--
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company