Re: BUG #15488: Unexpected behaviour of to_tsverctor and ts_query - Mailing list pgsql-bugs
From | Arthur Zakirov |
---|---|
Subject | Re: BUG #15488: Unexpected behaviour of to_tsverctor and ts_query |
Date | |
Msg-id | dc4c5f92-0eeb-2d5e-32af-653b80816c31@postgrespro.ru Whole thread Raw |
In response to | BUG #15488: Unexpected behaviour of to_tsverctor and ts_query (PG Bug reporting form <noreply@postgresql.org>) |
List | pgsql-bugs |
Hello, On 06.11.2018 15:51, PG Bug reporting form wrote: > I am trying to do prefix search of emails using full text search. Emails > sometimes contain dot symbols. > Now, select to_tsquery('simple', 'thing.a:*'); returns a single term as > expected but > select to_tsquery('simple', 'the.thing.a:*') suddenly tokenizes my search > expression into 'the.thing':* & 'a':*. At the same time select > to_tsquery('simple', 'the.thing.aa:*'); tokenizes as expected into a single > term. > Also select to_tsquery('simple', 'the.thing@gmail.com'); returns a single > term as expected, but select to_tsquery('simple', 'the.thing@gma:*'); splits > on @ char and returns two tokens. > This behavior is not covered in the official documentation. Yeah, it seems strange, but full text search parser thinks that each label of a host should have at least two characters. In first example it thinks that it is a file name: =# select * from ts_debug('simple', 'thing.a:*'); alias | description | token | dictionaries | dictionary | lexemes -------+-------------------+---------+--------------+------------+----------- file | File or path name | thing.a | {simple} | simple | {thing.a} blank | Space symbols | :* | {} | (null) | (null) Next, it thinks that only first two label are only part of host name: =# select * from ts_debug('simple', 'the.thing.a:*'); alias | description | token | dictionaries | dictionary | lexemes -----------+-----------------+-----------+--------------+------------+------------- host | Host | the.thing | {simple} | simple | {the.thing} blank | Space symbols | . | {} | (null) | (null) asciiword | Word, all ASCII | a | {simple} | simple | {a} blank | Space symbols | :* | {} | (null) | (null) And next is host name: =# select * from ts_debug('simple', 'the.thing.aa:*'); alias | description | token | dictionaries | dictionary | lexemes -------+---------------+--------------+--------------+------------+---------------- host | Host | the.thing.aa | {simple} | simple | {the.thing.aa} blank | Space symbols | :* | {} | (null) | (null) It seems that the RFC doesn't have such restriction, if I'm not mistaken. See: https://tools.ietf.org/html/rfc1034#section-3.1 And same for emails: =# select * from ts_debug('simple', 'the.thing@gmail.com'); alias | description | token | dictionaries | dictionary | lexemes -------+---------------+---------------------+--------------+------------+----------------------- email | Email address | the.thing@gmail.com | {simple} | simple | {the.thing@gmail.com} =# select * from ts_debug('simple', 'the.thing@gmail.c:*'); alias | description | token | dictionaries | dictionary | lexemes -------+-------------------+-----------+--------------+------------+------------- host | Host | the.thing | {simple} | simple | {the.thing} blank | Space symbols | @ | {} | (null) | (null) file | File or path name | gmail.c | {simple} | simple | {gmail.c} blank | Space symbols | :* | {} | (null) | (null) =# select * from ts_debug('simple', 'the.thing@gma:*'); alias | description | token | dictionaries | dictionary | lexemes -----------+-----------------+-----------+--------------+------------+------------- host | Host | the.thing | {simple} | simple | {the.thing} blank | Space symbols | @ | {} | (null) | (null) asciiword | Word, all ASCII | gma | {simple} | simple | {gma} blank | Space symbols | :* | {} | (null) | (null) It seems that it can be easily fixed, but I'm not sure that it won't break something. If you have such host names only in prefix queries (not in documents itself) I may suggest the following workaround: =# select to_tsvector('simple', 'the.thing@gmail.com') @@ 'the.thing@gma:*'::tsquery; ?column? ---------- t -- Arthur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company
pgsql-bugs by date: