Thread: BUG #6375: tsearch does not recognize all valid emails
The following bug has been logged on the website: Bug reference: 6375 Logged by: Valentine Gogichashvili Email address: valgog@gmail.com PostgreSQL version: 9.1.1 Operating system: Debian 4.4.5-8 Description:=20=20=20=20=20=20=20=20 Hello,=20 default tsearch parser does not recognize all valid email addresses and tokenizes them as text, splitting into tokens.=20 For example: postgres=3D# select to_tsquery('simple', 'normal@email.com' ); to_tsquery=20=20=20=20=20 =E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2= =94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94= =80=E2=94=80=E2=94=80=E2=94=80 'normal@email.com' (1 row) here it behaves ok; postgres=3D# select to_tsquery('simple', '-still-normal@email.com' ); to_tsquery=20=20=20=20=20=20=20=20 =E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2= =94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94= =80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80= =E2=94=80 'still-normal@email.com' (1 row) here it trims '-' from the beginning of an email. This is not correct, but will at least find that email. postgres=3D# select to_tsquery('simple', '-not-normal-with-dash-@email.com' ); to_tsquery=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20 =20=20 =E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2= =94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94= =80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80= =E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2= =94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94= =80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80= =E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2= =94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94= =80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80= =E2=94=80=E2=94=80=E2=94=80=E2=94=80 'not-normal-with-dash' & 'not' & 'normal' & 'with' & 'dash' & 'email.com' (1 row) and this is now a real problem as it leads to finding emails that are not the same, but are "super-sets" of that one. Valid email characters, that are not correctly treated also are at least '+' and '.' With my best regards,=20 -- Valentine Gogichashvili=20
On Tue, Jan 03, 2012 at 06:04:23PM +0000, valgog@gmail.com wrote: > The following bug has been logged on the website: > > Bug reference: 6375 > Logged by: Valentine Gogichashvili > Email address: valgog@gmail.com > PostgreSQL version: 9.1.1 > Operating system: Debian 4.4.5-8 > Description: > > Hello, > > default tsearch parser does not recognize all valid email addresses and > tokenizes them as text, splitting into tokens. > > For example: > > postgres=# select to_tsquery('simple', 'normal@email.com' ); > to_tsquery > ââââââââââââââââââââ > 'normal@email.com' > (1 row) > > here it behaves ok; > > postgres=# select to_tsquery('simple', '-still-normal@email.com' ); > to_tsquery > ââââââââââââââââââââââââââ > 'still-normal@email.com' > (1 row) > > here it trims '-' from the beginning of an email. This is not correct, but > will at least find that email. > > postgres=# select to_tsquery('simple', '-not-normal-with-dash-@email.com' > ); > to_tsquery > > âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ > 'not-normal-with-dash' & 'not' & 'normal' & 'with' & 'dash' & 'email.com' > (1 row) > > and this is now a real problem as it leads to finding emails that are not > the same, but are "super-sets" of that one. > > Valid email characters, that are not correctly treated also are at least '+' > and '.' Yep. :-( You can see the oddness here: test=> SELECT alias, description, token FROM ts_debug('-myname@gmail.com'); alias | description | token -------+---------------+------------------ blank | Space symbols | - email | Email address | myname@gmail.com (2 rows) test=> SELECT alias, description, token FROM ts_debug('-myna-me@gmail.com'); alias | description | token -------+---------------+------------------- blank | Space symbols | - email | Email address | myna-me@gmail.com (2 rows) test=> SELECT alias, description, token FROM ts_debug('-myna-me-@gmail.com'); alias | description | token -----------------+---------------------------------+----------- blank | Space symbols | - asciihword | Hyphenated word, all ASCII | myna-me hword_asciipart | Hyphenated word part, all ASCII | myna blank | Space symbols | - hword_asciipart | Hyphenated word part, all ASCII | me blank | Space symbols | -@ host | Host | gmail.com (7 rows) The first and second show that the leading-dash is separated. The third ones shows that a trailing dash causes the middle-dash to also be separated. This email thread from 2010 has a similar problem: http://archives.postgresql.org/pgsql-hackers/2010-10/msg00772.php What is limiting a fix for this is the breaking of existing behavior, and the breaking of indexes used during pg_upgrade. I have added your email to the existing TODO item: http://wiki.postgresql.org/wiki/Todo#Text_Search Improve handling of dash and plus signs in email address user names, and perhaps improve URL parsing http://archives.postgresql.org/pgsql-hackers/2010-10/msg00772.php tsearch does not recognize all valid emails -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
> > This email thread from 2010 has a similar problem: > http://archives.postgresql.org/pgsql-hackers/2010-10/msg00772.php > What is limiting a fix for this is the breaking of existing behavior, > and the breaking of indexes used during pg_upgrade. > I have added your email to the existing TODO item: > http://wiki.postgresql.org/wiki/Todo#Text_Search > Improve handling of dash and plus signs in email address user > names, and > perhaps improve URL parsing > > http://archives.postgresql.org/pgsql-hackers/2010-10/msg00772.php > tsearch does not recognize all valid emails Thank you Bruce, as an intermediate solution, so that people, who have the same problem and search the mailing archives, can get at least some way to overcome this issue, I am rewriting such emails when building tverctor, and use rewrite the tsearch queries as well: code from a function, that builds the tsvector: select (select string_agg( case when CASE WHEN n in ( 1, s ) -- all special outer chars should be escaped THEN c = ANY ('{.,!,#,$,%,&,'',*,+,/,=,?,_,`,"{",|,"}",~,^,-}'::text[]) ELSE c = ANY ('{!,#,$,%,&,'',*,+,/,=,?,`,"{",|,"}",~,^}'::text[]) END then 'BCHR' || ascii(c)::text || 'END' else c end, '') from ( select row_number() over() as n, count(1) over() as s, c from regexp_split_to_table(split_part(p, '@', 1 ), '') as e(c) ) as ee ) || '@' || split_part(p, '@', 2 ) into email from ( select p_customer_user_row.email::text as p ) as e; return to_tsvector('simple', coalesce( p_customer_user_row.first_name, '') ) || to_tsvector('simple', coalesce( p_customer_user_row.last_name, '') ) || to_tsvector('simple', coalesce( p_customer_user_row.customer_id, '') ) || to_tsvector('simple', coalesce( email, '') ); code from a function, that builds a tsquery: RETURN (select to_tsquery('simple', string_agg( case when p ~ '^[^@]+@[^@]+$' -- has only one @ inside then (select string_agg( case when CASE WHEN n in ( 1, s ) -- all special outer chars should be escaped THEN c = ANY ('{.,!,#,$,%,&,'',*,+,/,=,?,_,`,"{",|,"}",~,^,-}'::text[]) ELSE c = ANY ('{!,#,$,%,&,'',*,+,/,=,?,`,"{",|,"}",~,^}'::text[]) END then 'BCHR' || ascii(c)::text || 'END' else c end, '') from ( select row_number() over() as n, count(1) over() as s, c from regexp_split_to_table(split_part(p, '@', 1 ), '') as e(c) ) as ee ) || '@' || split_part(p, '@', 2 ) else (select string_agg(token, ' & ') from ( select unnest(lexemes) || ':*' as token from ts_debug('simple', p) ) as g ) end, ' & ' ) ) from regexp_split_to_table(btrim(search_text), E'\\s+') as s(p) ); Cheers, -- Valentine