Thread: BUG #6375: tsearch does not recognize all valid emails

BUG #6375: tsearch does not recognize all valid emails

From
valgog@gmail.com
Date:
The following bug has been logged on the website:

Bug reference:      6375
Logged by:          Valentine Gogichashvili
Email address:      valgog@gmail.com
PostgreSQL version: 9.1.1
Operating system:   Debian 4.4.5-8
Description:=20=20=20=20=20=20=20=20

Hello,=20

default tsearch parser does not recognize all valid email addresses and
tokenizes them as text, splitting into tokens.=20

For example:

postgres=3D# select to_tsquery('simple', 'normal@email.com' );
     to_tsquery=20=20=20=20=20
=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=
=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=
=80=E2=94=80=E2=94=80=E2=94=80
 'normal@email.com'
(1 row)

here it behaves ok;

postgres=3D# select to_tsquery('simple', '-still-normal@email.com' );
        to_tsquery=20=20=20=20=20=20=20=20
=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=
=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=
=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=
=E2=94=80
 'still-normal@email.com'
(1 row)

here it trims '-' from the beginning of an email. This is not correct, but
will at least find that email.

postgres=3D# select to_tsquery('simple', '-not-normal-with-dash-@email.com'
);
                                  to_tsquery=20=20=20=20=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20
=20=20
=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=
=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=
=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=
=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=
=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=
=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=
=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=
=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=
=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=
=E2=94=80=E2=94=80=E2=94=80=E2=94=80
 'not-normal-with-dash' & 'not' & 'normal' & 'with' & 'dash' & 'email.com'
(1 row)

and this is now a real problem as it leads to finding emails that are not
the same, but are "super-sets" of that one.

Valid email characters, that are not correctly treated also are at least '+'
and '.'

With my best regards,=20

-- Valentine Gogichashvili=20

Re: BUG #6375: tsearch does not recognize all valid emails

From
Bruce Momjian
Date:
On Tue, Jan 03, 2012 at 06:04:23PM +0000, valgog@gmail.com wrote:
> The following bug has been logged on the website:
>
> Bug reference:      6375
> Logged by:          Valentine Gogichashvili
> Email address:      valgog@gmail.com
> PostgreSQL version: 9.1.1
> Operating system:   Debian 4.4.5-8
> Description:
>
> Hello,
>
> default tsearch parser does not recognize all valid email addresses and
> tokenizes them as text, splitting into tokens.
>
> For example:
>
> postgres=# select to_tsquery('simple', 'normal@email.com' );
>      to_tsquery
> ────────────────────
>  'normal@email.com'
> (1 row)
>
> here it behaves ok;
>
> postgres=# select to_tsquery('simple', '-still-normal@email.com' );
>         to_tsquery
> ──────────────────────────
>  'still-normal@email.com'
> (1 row)
>
> here it trims '-' from the beginning of an email. This is not correct, but
> will at least find that email.
>
> postgres=# select to_tsquery('simple', '-not-normal-with-dash-@email.com'
> );
>                                   to_tsquery
>
>
───────────────────────────────────────────────────────────────────────────────
>  'not-normal-with-dash' & 'not' & 'normal' & 'with' & 'dash' & 'email.com'
> (1 row)
>
> and this is now a real problem as it leads to finding emails that are not
> the same, but are "super-sets" of that one.
>
> Valid email characters, that are not correctly treated also are at least '+'
> and '.'

Yep.  :-(

You can see the oddness here:

    test=> SELECT alias, description, token FROM ts_debug('-myname@gmail.com');
     alias |  description  |      token
    -------+---------------+------------------
     blank | Space symbols | -
     email | Email address | myname@gmail.com
    (2 rows)

    test=> SELECT alias, description, token FROM ts_debug('-myna-me@gmail.com');
     alias |  description  |       token
    -------+---------------+-------------------
     blank | Space symbols | -
     email | Email address | myna-me@gmail.com
    (2 rows)

    test=> SELECT alias, description, token FROM ts_debug('-myna-me-@gmail.com');
          alias      |           description           |   token
    -----------------+---------------------------------+-----------
     blank           | Space symbols                   | -
     asciihword      | Hyphenated word, all ASCII      | myna-me
     hword_asciipart | Hyphenated word part, all ASCII | myna
     blank           | Space symbols                   | -
     hword_asciipart | Hyphenated word part, all ASCII | me
     blank           | Space symbols                   | -@
     host            | Host                            | gmail.com
    (7 rows)

The first and second show that the leading-dash is separated.  The third
ones shows that a trailing dash causes the middle-dash to also be
separated.

This email thread from 2010 has a similar problem:

    http://archives.postgresql.org/pgsql-hackers/2010-10/msg00772.php

What is limiting a fix for this is the breaking of existing behavior,
and the breaking of indexes used during pg_upgrade.

I have added your email to the existing TODO item:

    http://wiki.postgresql.org/wiki/Todo#Text_Search

    Improve handling of dash and plus signs in email address user names, and
    perhaps improve URL parsing

        http://archives.postgresql.org/pgsql-hackers/2010-10/msg00772.php
        tsearch does not recognize all valid emails

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +

Re: BUG #6375: tsearch does not recognize all valid emails

From
Valentine Gogichashvili
Date:
>
> This email thread from 2010 has a similar problem:
>        http://archives.postgresql.org/pgsql-hackers/2010-10/msg00772.php
> What is limiting a fix for this is the breaking of existing behavior,
> and the breaking of indexes used during pg_upgrade.
> I have added your email to the existing TODO item:
>        http://wiki.postgresql.org/wiki/Todo#Text_Search
>        Improve handling of dash and plus signs in email address user
> names, and
>        perhaps improve URL parsing
>
> http://archives.postgresql.org/pgsql-hackers/2010-10/msg00772.php
>            tsearch does not recognize all valid emails



Thank you Bruce,

as an intermediate solution, so that people, who have the same problem and
search the mailing archives, can get at least some way to overcome this
issue, I am rewriting such emails when building tverctor, and use rewrite
the tsearch queries as well:

code from a function, that builds the tsvector:

  select (select string_agg(
                 case when CASE WHEN n in ( 1, s ) -- all special outer
chars should be escaped
                                THEN c = ANY
('{.,!,#,$,%,&,'',*,+,/,=,?,_,`,"{",|,"}",~,^,-}'::text[])
                                ELSE c = ANY
('{!,#,$,%,&,'',*,+,/,=,?,`,"{",|,"}",~,^}'::text[])
                           END
                      then 'BCHR' || ascii(c)::text || 'END'
                      else c
                      end, '')
            from ( select row_number() over() as n, count(1) over() as s, c
from regexp_split_to_table(split_part(p, '@', 1 ), '') as e(c) ) as ee
          ) || '@' || split_part(p, '@', 2 )
         into email
    from ( select p_customer_user_row.email::text as p ) as e;

  return to_tsvector('simple', coalesce( p_customer_user_row.first_name,
'') ) ||
         to_tsvector('simple', coalesce( p_customer_user_row.last_name, '')
) ||
         to_tsvector('simple', coalesce( p_customer_user_row.customer_id,
'') ) ||
         to_tsvector('simple', coalesce( email, '') );


code from a function, that builds a tsquery:

  RETURN (select to_tsquery('simple',
                 string_agg(
                   case when p ~ '^[^@]+@[^@]+$' -- has only one @ inside
                        then (select string_agg(
                                     case when CASE WHEN n in ( 1, s ) --
all special outer chars should be escaped
                                                    THEN c = ANY
('{.,!,#,$,%,&,'',*,+,/,=,?,_,`,"{",|,"}",~,^,-}'::text[])
                                                    ELSE c = ANY
('{!,#,$,%,&,'',*,+,/,=,?,`,"{",|,"}",~,^}'::text[])
                                               END
                                          then 'BCHR' || ascii(c)::text ||
'END'
                                          else c
                                          end, '')
                                from ( select row_number() over() as n,
count(1) over() as s, c from regexp_split_to_table(split_part(p, '@', 1 ),
'') as e(c) ) as ee
                             ) || '@' || split_part(p, '@', 2 )
                        else (select string_agg(token, ' & ')
                                from ( select unnest(lexemes) || ':*' as
token
                                         from ts_debug('simple', p)
                                      ) as g
                              )
                   end, ' & ' )
                 )
            from regexp_split_to_table(btrim(search_text), E'\\s+') as s(p)
         );

Cheers,

-- Valentine