Re: BUG #6375: tsearch does not recognize all valid emails - Mailing list pgsql-bugs

From Bruce Momjian
Subject Re: BUG #6375: tsearch does not recognize all valid emails
Date
Msg-id 20120207174138.GL19450@momjian.us
Whole thread Raw
In response to BUG #6375: tsearch does not recognize all valid emails  (valgog@gmail.com)
Responses Re: BUG #6375: tsearch does not recognize all valid emails
List pgsql-bugs
On Tue, Jan 03, 2012 at 06:04:23PM +0000, valgog@gmail.com wrote:
> The following bug has been logged on the website:
>
> Bug reference:      6375
> Logged by:          Valentine Gogichashvili
> Email address:      valgog@gmail.com
> PostgreSQL version: 9.1.1
> Operating system:   Debian 4.4.5-8
> Description:
>
> Hello,
>
> default tsearch parser does not recognize all valid email addresses and
> tokenizes them as text, splitting into tokens.
>
> For example:
>
> postgres=# select to_tsquery('simple', 'normal@email.com' );
>      to_tsquery
> ────────────────────
>  'normal@email.com'
> (1 row)
>
> here it behaves ok;
>
> postgres=# select to_tsquery('simple', '-still-normal@email.com' );
>         to_tsquery
> ──────────────────────────
>  'still-normal@email.com'
> (1 row)
>
> here it trims '-' from the beginning of an email. This is not correct, but
> will at least find that email.
>
> postgres=# select to_tsquery('simple', '-not-normal-with-dash-@email.com'
> );
>                                   to_tsquery
>
>
───────────────────────────────────────────────────────────────────────────────
>  'not-normal-with-dash' & 'not' & 'normal' & 'with' & 'dash' & 'email.com'
> (1 row)
>
> and this is now a real problem as it leads to finding emails that are not
> the same, but are "super-sets" of that one.
>
> Valid email characters, that are not correctly treated also are at least '+'
> and '.'

Yep.  :-(

You can see the oddness here:

    test=> SELECT alias, description, token FROM ts_debug('-myname@gmail.com');
     alias |  description  |      token
    -------+---------------+------------------
     blank | Space symbols | -
     email | Email address | myname@gmail.com
    (2 rows)

    test=> SELECT alias, description, token FROM ts_debug('-myna-me@gmail.com');
     alias |  description  |       token
    -------+---------------+-------------------
     blank | Space symbols | -
     email | Email address | myna-me@gmail.com
    (2 rows)

    test=> SELECT alias, description, token FROM ts_debug('-myna-me-@gmail.com');
          alias      |           description           |   token
    -----------------+---------------------------------+-----------
     blank           | Space symbols                   | -
     asciihword      | Hyphenated word, all ASCII      | myna-me
     hword_asciipart | Hyphenated word part, all ASCII | myna
     blank           | Space symbols                   | -
     hword_asciipart | Hyphenated word part, all ASCII | me
     blank           | Space symbols                   | -@
     host            | Host                            | gmail.com
    (7 rows)

The first and second show that the leading-dash is separated.  The third
ones shows that a trailing dash causes the middle-dash to also be
separated.

This email thread from 2010 has a similar problem:

    http://archives.postgresql.org/pgsql-hackers/2010-10/msg00772.php

What is limiting a fix for this is the breaking of existing behavior,
and the breaking of indexes used during pg_upgrade.

I have added your email to the existing TODO item:

    http://wiki.postgresql.org/wiki/Todo#Text_Search

    Improve handling of dash and plus signs in email address user names, and
    perhaps improve URL parsing

        http://archives.postgresql.org/pgsql-hackers/2010-10/msg00772.php
        tsearch does not recognize all valid emails

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +

pgsql-bugs by date:

Previous
From: a.tanaka77@gmail.com
Date:
Subject: BUG #6436: ecpg processed wrong variable name for host value of struct at EXEC SQL INSERT
Next
From: hokie10@gmail.com
Date:
Subject: BUG #6438: I have reinstalled postgresql a couple times and now the postgresql service will not start.