Thread: BUG #5021: ts_parse doesn't recognize email addresses with underscores

BUG #5021: ts_parse doesn't recognize email addresses with underscores

From
"Dan O'Hara"
Date:
The following bug has been logged online:

Bug reference:      5021
Logged by:          Dan O'Hara
Email address:      danarasoftware@gmail.com
PostgreSQL version: 8.3.7
Operating system:   win32
Description:        ts_parse doesn't recognize email addresses with
underscores
Details:

In the following example,

select distinct token as email
from ts_parse('default', ' first_last@yahoo.com '   )
where tokid = 4

ts_parse returns last@yahoo.com rather than first_last@yahoo.com  It seems
that any text prior to the underscore is truncated.  If the portion
following the underscore is only numeric, such as this example,

select distinct token as email
from ts_parse('default', ' bill_2000@yahoo.com '   )
where tokid = 4

then ts_parse returns nothing at all.

section 3.2.3 of RFC 5322 indicates that underscores are valid characters in
an email address.

http://tools.ietf.org/html/rfc5322

Re: BUG #5021: ts_parse doesn't recognize email addresses with underscores

From
Robert Haas
Date:
On Fri, Aug 28, 2009 at 9:59 AM, Dan O'Hara <danarasoftware@gmail.com> wrot=
e:
>
> The following bug has been logged online:
>
> Bug reference: =A0 =A0 =A05021
> Logged by: =A0 =A0 =A0 =A0 =A0Dan O'Hara
> Email address: =A0 =A0 =A0danarasoftware@gmail.com
> PostgreSQL version: 8.3.7
> Operating system: =A0 win32
> Description: =A0 =A0 =A0 =A0ts_parse doesn't recognize email addresses wi=
th
> underscores
> Details:
>
> In the following example,
>
> select distinct token as email
> from ts_parse('default', ' first_last@yahoo.com ' =A0 )
> where tokid =3D 4
>
> ts_parse returns last@yahoo.com rather than first_last@yahoo.com =A0It se=
ems
> that any text prior to the underscore is truncated. =A0If the portion
> following the underscore is only numeric, such as this example,
>
> select distinct token as email
> from ts_parse('default', ' bill_2000@yahoo.com ' =A0 )
> where tokid =3D 4
>
> then ts_parse returns nothing at all.
>
> section 3.2.3 of RFC 5322 indicates that underscores are valid characters=
 in
> an email address.
>
> http://tools.ietf.org/html/rfc5322

I don't think this has much to do with email addresses.  If you do:

select token from ts_parse('a_b');

...you get three tokens.  In your case you're pulling out the fourth
token, but some of your examples don't have four tokens, so then you
get nothing at all.

I'm not real familiar with ts_parse(), but I'm thinking that it
doesn't have any special casing for email addresses and is just
intended to parse text for full-text-search - in which case splitting
on _ is a pretty good algorithm.

...Robert

Re: BUG #5021: ts_parse doesn't recognize email addresses with underscores

From
"Dan O'Hara"
Date:
Thanks for having a look at this bug.

According to section 12.8.2 of the postgres manual, ts_parse is
supposed to recognize different types of data, one of which (#4) is an
email address.

The list of recognized data formats for parse can be selected via this quer=
y:

 SELECT * FROM ts_token_type('default');

The example in the bug I reported is valid email address, according to
the RFC, but isn't recognized as such by the full text search in
postgres.  This bug will have a real impact on anybody using ts
functions to locate email addresses, as only some of them are found in
the query.

Regards
Dan



On Thu, Oct 22, 2009 at 12:29 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Aug 28, 2009 at 9:59 AM, Dan O'Hara <danarasoftware@gmail.com> wr=
ote:
>>
>> The following bug has been logged online:
>>
>> Bug reference: =A0 =A0 =A05021
>> Logged by: =A0 =A0 =A0 =A0 =A0Dan O'Hara
>> Email address: =A0 =A0 =A0danarasoftware@gmail.com
>> PostgreSQL version: 8.3.7
>> Operating system: =A0 win32
>> Description: =A0 =A0 =A0 =A0ts_parse doesn't recognize email addresses w=
ith
>> underscores
>> Details:
>>
>> In the following example,
>>
>> select distinct token as email
>> from ts_parse('default', ' first_last@yahoo.com ' =A0 )
>> where tokid =3D 4
>>
>> ts_parse returns last@yahoo.com rather than first_last@yahoo.com =A0It s=
eems
>> that any text prior to the underscore is truncated. =A0If the portion
>> following the underscore is only numeric, such as this example,
>>
>> select distinct token as email
>> from ts_parse('default', ' bill_2000@yahoo.com ' =A0 )
>> where tokid =3D 4
>>
>> then ts_parse returns nothing at all.
>>
>> section 3.2.3 of RFC 5322 indicates that underscores are valid character=
s in
>> an email address.
>>
>> http://tools.ietf.org/html/rfc5322
>
> I don't think this has much to do with email addresses. =A0If you do:
>
> select token from ts_parse('a_b');
>
> ...you get three tokens. =A0In your case you're pulling out the fourth
> token, but some of your examples don't have four tokens, so then you
> get nothing at all.
>
> I'm not real familiar with ts_parse(), but I'm thinking that it
> doesn't have any special casing for email addresses and is just
> intended to parse text for full-text-search - in which case splitting
> on _ is a pretty good algorithm.
>
> ...Robert
>



--=20
-------------------------------------------------------------------
Dan O'Hara
Danara Software Systems, Inc.
danarasoftware@gmail.com
613 288-8733

Re: BUG #5021: ts_parse doesn't recognize email addresses with underscores

From
Euler Taveira de Oliveira
Date:
Robert Haas escreveu:
> I'm not real familiar with ts_parse(), but I'm thinking that it
> doesn't have any special casing for email addresses and is just
> intended to parse text for full-text-search - in which case splitting
> on _ is a pretty good algorithm.
>
It is a bug. The tsearch claims to identify types of tokens but it doesn't
correctly identify any valid e-mail addresses. As Dan stated ts_parse() fails
to recognize an e-mail address. For example, foo+bar@baz.com is a valid e-mail
but the function fails to report that.

It is not that simple to identify an e-mail address that agrees with RFC. As
that code is a state machine, IMHO it decides too early (when it finds _) that
that string is not an e-mail address. AFAIR, that's not an one-line fix.

euler=# select distinct token as email from ts_parse('default',
'foo.bar@baz.com');
      email
─────────────────
 foo.bar@baz.com
(1 row)

euler=# select distinct token as email from ts_parse('default',
'foo+bar@baz.com');
    email
─────────────
 foo
 +
 bar@baz.com
(3 rows)

euler=# select distinct token as email from ts_parse('default',
'foo_bar@baz.com');
    email
─────────────
 foo
 bar@baz.com
 _
(3 rows)


--
  Euler Taveira de Oliveira
  http://www.timbira.com/

Re: BUG #5021: ts_parse doesn't recognize email addresses with underscores

From
"Dan O'Hara"
Date:
I agree that it isn't easy to determine if given text is a valid email
address.  As I couldn't use ts_parse, I ended up using a regex, which
worked substantially better at pulling out the emails from the text
stream.  I haven't looked at the code, but perhaps it is possible to
do the same thing here?  Even a regex that is 99% correct would be
better than the current tokenizer which is only right about 80-85% of
the time.

My workaround looked something like this:

  select regexp_matches(resumetext,E'[A-Z0-9._%+-]+@[A-Z0-9.-]+\\.[A-Z]{2,4=
}','gi')
as email
        from "Resume"
cheers
Dan

On Thu, Oct 22, 2009 at 3:39 PM, Euler Taveira de Oliveira
<euler@timbira.com> wrote:
> Robert Haas escreveu:
>> I'm not real familiar with ts_parse(), but I'm thinking that it
>> doesn't have any special casing for email addresses and is just
>> intended to parse text for full-text-search - in which case splitting
>> on _ is a pretty good algorithm.
>>
> It is a bug. The tsearch claims to identify types of tokens but it doesn't
> correctly identify any valid e-mail addresses. As Dan stated ts_parse() f=
ails
> to recognize an e-mail address. For example, foo+bar@baz.com is a valid e=
-mail
> but the function fails to report that.
>
> It is not that simple to identify an e-mail address that agrees with RFC.=
 As
> that code is a state machine, IMHO it decides too early (when it finds _)=
 that
> that string is not an e-mail address. AFAIR, that's not an one-line fix.
>
> euler=3D# select distinct token as email from ts_parse('default',
> 'foo.bar@baz.com');
> =C2=A0 =C2=A0 =C2=A0email
> =E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=
=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=
=94=80
> =C2=A0foo.bar@baz.com
> (1 row)
>
> euler=3D# select distinct token as email from ts_parse('default',
> 'foo+bar@baz.com');
> =C2=A0 =C2=A0email
> =E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=
=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80
> =C2=A0foo
> =C2=A0+
> =C2=A0bar@baz.com
> (3 rows)
>
> euler=3D# select distinct token as email from ts_parse('default',
> 'foo_bar@baz.com');
> =C2=A0 =C2=A0email
> =E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=
=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80
> =C2=A0foo
> =C2=A0bar@baz.com
> =C2=A0_
> (3 rows)
>
>
> --
> =C2=A0Euler Taveira de Oliveira
> =C2=A0http://www.timbira.com/
>



--=20
-------------------------------------------------------------------
Dan O'Hara
Danara Software Systems, Inc.
danarasoftware@gmail.com
613 288-8733

Re: BUG #5021: ts_parse doesn't recognize email addresses with underscores

From
Alvaro Herrera
Date:
Euler Taveira de Oliveira escribió:
> Robert Haas escreveu:
> > I'm not real familiar with ts_parse(), but I'm thinking that it
> > doesn't have any special casing for email addresses and is just
> > intended to parse text for full-text-search - in which case splitting
> > on _ is a pretty good algorithm.
> >
> It is a bug. The tsearch claims to identify types of tokens but it doesn't
> correctly identify any valid e-mail addresses. As Dan stated ts_parse() fails
> to recognize an e-mail address. For example, foo+bar@baz.com is a valid e-mail
> but the function fails to report that.

It is similarly too-simplistic for other cases too, like file names
(particularly where Windows filenames are concerned).

--
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support