Re: BUG #5021: ts_parse doesn't recognize email addresses with underscores - Mailing list pgsql-bugs

From Dan O'Hara
Subject Re: BUG #5021: ts_parse doesn't recognize email addresses with underscores
Date
Msg-id 557802370910221010k5669e9f0v559213d998e286d3@mail.gmail.com
Whole thread Raw
In response to Re: BUG #5021: ts_parse doesn't recognize email addresses with underscores  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-bugs
Thanks for having a look at this bug.

According to section 12.8.2 of the postgres manual, ts_parse is
supposed to recognize different types of data, one of which (#4) is an
email address.

The list of recognized data formats for parse can be selected via this quer=
y:

 SELECT * FROM ts_token_type('default');

The example in the bug I reported is valid email address, according to
the RFC, but isn't recognized as such by the full text search in
postgres.  This bug will have a real impact on anybody using ts
functions to locate email addresses, as only some of them are found in
the query.

Regards
Dan



On Thu, Oct 22, 2009 at 12:29 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Aug 28, 2009 at 9:59 AM, Dan O'Hara <danarasoftware@gmail.com> wr=
ote:
>>
>> The following bug has been logged online:
>>
>> Bug reference: =A0 =A0 =A05021
>> Logged by: =A0 =A0 =A0 =A0 =A0Dan O'Hara
>> Email address: =A0 =A0 =A0danarasoftware@gmail.com
>> PostgreSQL version: 8.3.7
>> Operating system: =A0 win32
>> Description: =A0 =A0 =A0 =A0ts_parse doesn't recognize email addresses w=
ith
>> underscores
>> Details:
>>
>> In the following example,
>>
>> select distinct token as email
>> from ts_parse('default', ' first_last@yahoo.com ' =A0 )
>> where tokid =3D 4
>>
>> ts_parse returns last@yahoo.com rather than first_last@yahoo.com =A0It s=
eems
>> that any text prior to the underscore is truncated. =A0If the portion
>> following the underscore is only numeric, such as this example,
>>
>> select distinct token as email
>> from ts_parse('default', ' bill_2000@yahoo.com ' =A0 )
>> where tokid =3D 4
>>
>> then ts_parse returns nothing at all.
>>
>> section 3.2.3 of RFC 5322 indicates that underscores are valid character=
s in
>> an email address.
>>
>> http://tools.ietf.org/html/rfc5322
>
> I don't think this has much to do with email addresses. =A0If you do:
>
> select token from ts_parse('a_b');
>
> ...you get three tokens. =A0In your case you're pulling out the fourth
> token, but some of your examples don't have four tokens, so then you
> get nothing at all.
>
> I'm not real familiar with ts_parse(), but I'm thinking that it
> doesn't have any special casing for email addresses and is just
> intended to parse text for full-text-search - in which case splitting
> on _ is a pretty good algorithm.
>
> ...Robert
>



--=20
-------------------------------------------------------------------
Dan O'Hara
Danara Software Systems, Inc.
danarasoftware@gmail.com
613 288-8733

pgsql-bugs by date:

Previous
From: Robert Haas
Date:
Subject: Re: BUG #5130: Failed to run initdb:1
Next
From: Tom Lane
Date:
Subject: Re: BUG #5039: 'i' flag i in regexp_replace ignored for polish letters