Re: BUG #6375: tsearch does not recognize all valid emails - Mailing list pgsql-bugs
From | Bruce Momjian |
---|---|
Subject | Re: BUG #6375: tsearch does not recognize all valid emails |
Date | |
Msg-id | 20120207174138.GL19450@momjian.us Whole thread Raw |
In response to | BUG #6375: tsearch does not recognize all valid emails (valgog@gmail.com) |
Responses |
Re: BUG #6375: tsearch does not recognize all valid emails
|
List | pgsql-bugs |
On Tue, Jan 03, 2012 at 06:04:23PM +0000, valgog@gmail.com wrote: > The following bug has been logged on the website: > > Bug reference: 6375 > Logged by: Valentine Gogichashvili > Email address: valgog@gmail.com > PostgreSQL version: 9.1.1 > Operating system: Debian 4.4.5-8 > Description: > > Hello, > > default tsearch parser does not recognize all valid email addresses and > tokenizes them as text, splitting into tokens. > > For example: > > postgres=# select to_tsquery('simple', 'normal@email.com' ); > to_tsquery > ââââââââââââââââââââ > 'normal@email.com' > (1 row) > > here it behaves ok; > > postgres=# select to_tsquery('simple', '-still-normal@email.com' ); > to_tsquery > ââââââââââââââââââââââââââ > 'still-normal@email.com' > (1 row) > > here it trims '-' from the beginning of an email. This is not correct, but > will at least find that email. > > postgres=# select to_tsquery('simple', '-not-normal-with-dash-@email.com' > ); > to_tsquery > > âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ > 'not-normal-with-dash' & 'not' & 'normal' & 'with' & 'dash' & 'email.com' > (1 row) > > and this is now a real problem as it leads to finding emails that are not > the same, but are "super-sets" of that one. > > Valid email characters, that are not correctly treated also are at least '+' > and '.' Yep. :-( You can see the oddness here: test=> SELECT alias, description, token FROM ts_debug('-myname@gmail.com'); alias | description | token -------+---------------+------------------ blank | Space symbols | - email | Email address | myname@gmail.com (2 rows) test=> SELECT alias, description, token FROM ts_debug('-myna-me@gmail.com'); alias | description | token -------+---------------+------------------- blank | Space symbols | - email | Email address | myna-me@gmail.com (2 rows) test=> SELECT alias, description, token FROM ts_debug('-myna-me-@gmail.com'); alias | description | token -----------------+---------------------------------+----------- blank | Space symbols | - asciihword | Hyphenated word, all ASCII | myna-me hword_asciipart | Hyphenated word part, all ASCII | myna blank | Space symbols | - hword_asciipart | Hyphenated word part, all ASCII | me blank | Space symbols | -@ host | Host | gmail.com (7 rows) The first and second show that the leading-dash is separated. The third ones shows that a trailing dash causes the middle-dash to also be separated. This email thread from 2010 has a similar problem: http://archives.postgresql.org/pgsql-hackers/2010-10/msg00772.php What is limiting a fix for this is the breaking of existing behavior, and the breaking of indexes used during pg_upgrade. I have added your email to the existing TODO item: http://wiki.postgresql.org/wiki/Todo#Text_Search Improve handling of dash and plus signs in email address user names, and perhaps improve URL parsing http://archives.postgresql.org/pgsql-hackers/2010-10/msg00772.php tsearch does not recognize all valid emails -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
pgsql-bugs by date: