Thread: Select all invalid e-mail addresses

Select all invalid e-mail addresses

From
"Andrus"
Date:
I have a database of e-mail addresses.

I want to select the email addresses which are not valid:

do not contain exactly one @ character,
contain ; > < " ' , characters or spaces etc.

What is the WHERE clause for this ?

Andrus.



Re: Select all invalid e-mail addresses

From
"Dann Corbit"
Date:
This might be handy:
http://www.databasejournal.com/img/email_val.sql

> -----Original Message-----
> From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-
> owner@postgresql.org] On Behalf Of Andrus
> Sent: Wednesday, October 19, 2005 11:12 AM
> To: pgsql-general@postgresql.org
> Subject: [GENERAL] Select all invalid e-mail addresses
>
> I have a database of e-mail addresses.
>
> I want to select the email addresses which are not valid:
>
> do not contain exactly one @ character,
> contain ; > < " ' , characters or spaces etc.
>
> What is the WHERE clause for this ?
>
> Andrus.
>
>
>
> ---------------------------(end of
broadcast)---------------------------
> TIP 6: explain analyze is your friend

Re: Select all invalid e-mail addresses

From
"Guy Rouillier"
Date:
Andrus wrote:
> I have a database of e-mail addresses.
>
> I want to select the email addresses which are not valid:
>
> do not contain exactly one @ character,
> contain ; > < " ' , characters or spaces etc.
>
> What is the WHERE clause for this ?

Please see a long, detailed thread in the archives titled "Email
Verification Regular Expression" on Sept 7, 2005.

--
Guy Rouillier


Re: Select all invalid e-mail addresses

From
Michael Fuhr
Date:
On Wed, Oct 19, 2005 at 09:12:16PM +0300, Andrus wrote:
> I want to select the email addresses which are not valid:
>
> do not contain exactly one @ character,
> contain ; > < " ' , characters or spaces etc.

The rules that define a valid email address are more complex than
most people realize, and even if an address is syntactically valid
that doesn't mean it's valid in the sense that you can deliver mail
to it.  Whatever method you end up using, be sure to understand its
limitations.

One possibility would be to write a plperlu function that uses the
Email::Valid module.  Here's a trivial example; see the Email::Valid
documentation to learn about its full capabilities:

CREATE FUNCTION is_valid_email(text) RETURNS boolean AS $$
use Email::Valid;
return Email::Valid->address($_[0]) ? "true" : "false";
$$ LANGUAGE plperlu IMMUTABLE STRICT;

You could then do something like:

SELECT * FROM foo WHERE NOT is_valid_email(email_address);

Again, be aware that passing this or any other test doesn't necessarily
mean that an address is truly valid -- it's just an attempt to identify
addresses that are obviously bogus.

--
Michael Fuhr

Re: Select all invalid e-mail addresses

From
Hannes Dorbath
Date:
On 19.10.2005 21:18, Michael Fuhr wrote:
> One possibility would be to write a plperlu function that uses the
> Email::Valid module.  Here's a trivial example; see the Email::Valid
> documentation to learn about its full capabilities:

..and if you don't mind installing pl/php, you can use this function:

http://hannes.imos.net/validmail.html

It performs a MX-lookup, which IMHO is the best way to check for validity.


--
Regards,
Hannes Dorbath

Re: Select all invalid e-mail addresses

From
Edmund
Date:
"Andrus" <eetasoft@online.ee> writes:

> I have a database of e-mail addresses.
>
> I want to select the email addresses which are not valid:
>
> do not contain exactly one @ character,
> contain ; > < " ' , characters or spaces etc.
>
> What is the WHERE clause for this ?
>

There was a thread here not so long ago about matching valid email addresses.
It's not so simple.  You probably want to do a regex match - e.g.

select ... where email ~ '<regex>'

However the regex for all valid email possibilities is *VERY* complex.
see: http://www.ex-parrot.com/~pdw/Mail-RFC822-Address.html

You should probably search the maillist archives.  ISTR that there were
some suggestions on how one might simplify the search space.



Re: Select all invalid e-mail addresses

From
Douglas McNaught
Date:
Hannes Dorbath <light@theendofthetunnel.de> writes:

> On 19.10.2005 21:18, Michael Fuhr wrote:
>> One possibility would be to write a plperlu function that uses the
>> Email::Valid module.  Here's a trivial example; see the Email::Valid
>> documentation to learn about its full capabilities:
>
> ..and if you don't mind installing pl/php, you can use this function:
>
> http://hannes.imos.net/validmail.html
>
> It performs a MX-lookup, which IMHO is the best way to check for validity.

But that's expensive and slow, and doesn't tell you whether the user
part of the address is valid (and in general, there's no way to
determine that short of actually sending a message).  So what's the
point?

-Doug

Re: Select all invalid e-mail addresses

From
Hannes Dorbath
Date:
On 20.10.2005 14:00, Douglas McNaught wrote:
> But that's expensive and slow

Sure, that isn't meant to be used in a WHERE condition on a 100k row
table.. more to be bound via check constraint on a user table, so
incomming data is validated.

> and doesn't tell you whether the user part of the address is valid (and in general, there's no way to
> determine that short of actually sending a message). So what's the
> point?

The point is to validate as good as possible, and as you said yourself,
the user part can't be validated further.


--
Regards,
Hannes Dorbath

Re: Select all invalid e-mail addresses

From
"Andrus"
Date:
""Guy Rouillier"" <guyr@masergy.com> wrote in message
news:CC1CF380F4D70844B01D45982E671B239E8BE9@mtxexch01.add0.masergy.com...
> Andrus wrote:
>> I have a database of e-mail addresses.
>>
>> I want to select the email addresses which are not valid:
>>
>> do not contain exactly one @ character,
>> contain ; > < " ' , characters or spaces etc.
>>
>> What is the WHERE clause for this ?
>
> Please see a long, detailed thread in the archives titled "Email
> Verification Regular Expression" on Sept 7, 2005.

Guy Rouillier,

thank you.

I have emails in CHARACTER(60) type columns in database, total 3000 emails.
I need to check email addresses for most frequent typos before send.
I have only plpgsql language installed, no perl.

From this thread I got the regular expression


/^[^@]*@(?:[^@]*\.)?[a-z0-9-_]+\.(?:a[defgilmnoqrstuwz]|b[abdefghijmnorstvwyz]|c[acdfghiklmnoruvxyz]|d[ejkmoz]|e[ceghrst]|f[ijkmorx]|g[abdefhilmnpqrstuwy]|h[kmnrtu]|i[delnoqrst]|j[mop]|k[eghimnprwyz]|l[abcikrstuvy]|m[acdghklmnopqrstuvwxyz]|n[acefgilopruz]|om|p[aefghklmnrtwy]|qa|r[eouw]|s[abcdeghijklmnortvyz]|t[cdfghjkmnoprtvwz]|u[agkmsyz]|v[aceginu]|w[fs]|y[etu]|z[amw]|edu|com|net|org|gov|mil|info|biz|coop|museum|aero|name|pro)$/How
Ican use this in where clause ? I havent never used regular expressionsin Postgres.How to exclude top-level domain
namesfrom this regex ?Andrus. 


Re: Select all invalid e-mail addresses

From
Brian Mathis
Date:
On 10/20/05, Douglas McNaught <doug@mcnaught.org> wrote:
> It performs a MX-lookup, which IMHO is the best way to check for validity.

But that's expensive and slow, and doesn't tell you whether the user
part of the address is valid (and in general, there's no way to
determine that short of actually sending a message).  So what's the
point?

-Doug


That's why I think the better term for this is "well formed".  "Validity" can only be determined by sending to it, but you can tell if an address at least conforms to the specs programmatically.  In the end, when talking about a "valid" address in this context, that is what most people are talking about.

The point is to weed out malformed email addresses, just like you would enforce any other data formatting standards in other types of data.

Re: Select all invalid e-mail addresses

From
Andrew Sullivan
Date:
On Thu, Oct 20, 2005 at 06:10:40PM +0300, Andrus wrote:
> >From this thread I got the regular expression

[snipped]

Note that that regular expression, which appears to be validating
TLDs as well, is incredibly fragile.  John Klensin has actually
written an RFC about this very problem.  Among other problems, what
do you do when a country code ceases to be?  (There's a similar
problem that the naming bodies struggke with from time to time.)

I suggest that if you want to validate TLDs, you pull them off when
you write the data in your database, and use a lookup table to make
sure they're valid (you can keep the table up to date regularly by
checking the official IANA registry for them).  At least that way you
don't have to change a regex every time ICANN decides to add another
TLD.  (The regex is wrong anyway, I think: it doesn't have .mobi,
which has been announced although isn't taking registrations yet, and
it doesn't appear to have arpa, either.)

A


--
Andrew Sullivan  | ajs@crankycanuck.ca
Information security isn't a technological problem.  It's an economics
problem.
        --Bruce Schneier

Re: Select all invalid e-mail addresses

From
Andrew Sullivan
Date:
On Thu, Oct 20, 2005 at 11:22:25AM -0400, Brian Mathis wrote:

> That's why I think the better term for this is "well formed". "Validity" can
> only be determined by sending to it, but you can tell if an address at least

In fact, it can only be determined by sending to it over and over
again, because whether a mail address works may change over time (and
may have nothing to do with the poor schmuck whose email
administrators don't know how to spell "MX record").

A

--
Andrew Sullivan  | ajs@crankycanuck.ca
This work was visionary and imaginative, and goes to show that visionary
and imaginative work need not end up well.
        --Dennis Ritchie

Re: Select all invalid e-mail addresses

From
"Andrus"
Date:
"Andrew Sullivan" <ajs@crankycanuck.ca> wrote in message

> I suggest that if you want to validate TLDs, you pull them off when
> you write the data in your database, and use a lookup table to make
> sure they're valid (you can keep the table up to date regularly by
> checking the official IANA registry for them).  At least that way you
> don't have to change a regex every time ICANN decides to add another
> TLD.  (The regex is wrong anyway, I think: it doesn't have .mobi,
> which has been announced although isn't taking registrations yet, and
> it doesn't appear to have arpa, either.)

Andrew, thank you.

I understand now that I do'nt want to validate TLDs at all.

I have an existing database of e-mail addresses. Those addesses are copied
from letters so they contain < > chars, points, commas etc. stupid
characters.
Sometimes two email addresses are copied to this field (contains two @
sings, spaces or commas). Sometimes web addresses starting with www. and
without @ are present in email column.
I want simply to allow user to view those addresses and make manual
corrections before starting large mailing session in night.

How to write a WHERE clause which selects e-mail addresses which
are surely wrong ?

Andrus.


> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
>       choose an index scan if your joining column's datatypes do not
>       match
>




Re: Select all invalid e-mail addresses

From
Steve Atkins
Date:
On Thu, Oct 20, 2005 at 11:52:40AM -0400, Andrew Sullivan wrote:
> On Thu, Oct 20, 2005 at 06:10:40PM +0300, Andrus wrote:
> > >From this thread I got the regular expression
>
> [snipped]
>
> Note that that regular expression, which appears to be validating
> TLDs as well, is incredibly fragile.  John Klensin has actually
> written an RFC about this very problem.  Among other problems, what
> do you do when a country code ceases to be?  (There's a similar
> problem that the naming bodies struggke with from time to time.)
>
> I suggest that if you want to validate TLDs, you pull them off when
> you write the data in your database, and use a lookup table to make
> sure they're valid (you can keep the table up to date regularly by
> checking the official IANA registry for them).  At least that way you
> don't have to change a regex every time ICANN decides to add another
> TLD.

You need to maintain the data, certainly. To argue that it must
be in a table to be maintained is, well, wrong. My preference would
be to keep it in a table and regenerate the regex periodically, and
in the application layer I do exactly that, but to try and do that
in a check constraint would be painful. A cleaner approach would
be to have a regex that checks for general syntax and extracts the
TLD, which is then compared to a lookup table, perhaps, but that
adds a lot of complexity for no real benefit.

> (The regex is wrong anyway, I think: it doesn't have .mobi,
> which has been announced although isn't taking registrations yet, and
> it doesn't appear to have arpa, either.)

While there are valid deliverable email addresses in .arpa, you really
don't want to be accepting them from end users...

Cheers,
  Steve

Re: Select all invalid e-mail addresses

From
Steve Atkins
Date:
On Thu, Oct 20, 2005 at 09:54:13PM +0300, Andrus wrote:
> "Andrew Sullivan" <ajs@crankycanuck.ca> wrote in message
>
> > I suggest that if you want to validate TLDs, you pull them off when
> > you write the data in your database, and use a lookup table to make
> > sure they're valid (you can keep the table up to date regularly by
> > checking the official IANA registry for them).  At least that way you
> > don't have to change a regex every time ICANN decides to add another
> > TLD.  (The regex is wrong anyway, I think: it doesn't have .mobi,
> > which has been announced although isn't taking registrations yet, and
> > it doesn't appear to have arpa, either.)
>
> Andrew, thank you.
>
> I understand now that I do'nt want to validate TLDs at all.
>
> I have an existing database of e-mail addresses. Those addesses are copied
> from letters so they contain < > chars, points, commas etc. stupid
> characters.
> Sometimes two email addresses are copied to this field (contains two @
> sings, spaces or commas). Sometimes web addresses starting with www. and
> without @ are present in email column.
> I want simply to allow user to view those addresses and make manual
> corrections before starting large mailing session in night.
>
> How to write a WHERE clause which selects e-mail addresses which
> are surely wrong ?

... WHERE email !~ '...insert previously mentioned regex here...';

Cheers,
  Steve

Re: Select all invalid e-mail addresses

From
Andrew Sullivan
Date:
On Thu, Oct 20, 2005 at 09:54:13PM +0300, Andrus wrote:
> How to write a WHERE clause which selects e-mail addresses which
> are surely wrong ?

Then I think the validating function someone else sent here
(<http://www.databasejournal.com/img/email_val.sql>) is a good start.
You probably want the opposite behaviour -- emailinvalidate(), I
guess -- but that seems like a good "obviously wrong" tester.  It
might not be fast, though -- that loop at the special character
check looks pretty painful.

A

--
Andrew Sullivan  | ajs@crankycanuck.ca
The fact that technology doesn't work is no bar to success in the marketplace.
        --Philip Greenspun

Re: Select all invalid e-mail addresses

From
"Dann Corbit"
Date:
Interesting article:
http://coveryourasp.com/ValidateEmail.asp

See also:
http://search.cpan.org/~cwest/Email-Address-1.80/lib/Email/Address.pm
http://www.faqs.org/rfcs/rfc2822.html
http://docs.python.org/lib/module-rfc822.html

> -----Original Message-----
> From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-
> owner@postgresql.org] On Behalf Of Steve Atkins
> Sent: Thursday, October 20, 2005 12:35 PM
> To: pgsql-general@postgresql.org
> Subject: Re: [GENERAL] Select all invalid e-mail addresses
>
> On Thu, Oct 20, 2005 at 11:52:40AM -0400, Andrew Sullivan wrote:
> > On Thu, Oct 20, 2005 at 06:10:40PM +0300, Andrus wrote:
> > > >From this thread I got the regular expression
> >
> > [snipped]
> >
> > Note that that regular expression, which appears to be validating
> > TLDs as well, is incredibly fragile.  John Klensin has actually
> > written an RFC about this very problem.  Among other problems, what
> > do you do when a country code ceases to be?  (There's a similar
> > problem that the naming bodies struggke with from time to time.)
> >
> > I suggest that if you want to validate TLDs, you pull them off when
> > you write the data in your database, and use a lookup table to make
> > sure they're valid (you can keep the table up to date regularly by
> > checking the official IANA registry for them).  At least that way
you
> > don't have to change a regex every time ICANN decides to add another
> > TLD.
>
> You need to maintain the data, certainly. To argue that it must
> be in a table to be maintained is, well, wrong. My preference would
> be to keep it in a table and regenerate the regex periodically, and
> in the application layer I do exactly that, but to try and do that
> in a check constraint would be painful. A cleaner approach would
> be to have a regex that checks for general syntax and extracts the
> TLD, which is then compared to a lookup table, perhaps, but that
> adds a lot of complexity for no real benefit.
>
> > (The regex is wrong anyway, I think: it doesn't have .mobi,
> > which has been announced although isn't taking registrations yet,
and
> > it doesn't appear to have arpa, either.)
>
> While there are valid deliverable email addresses in .arpa, you really
> don't want to be accepting them from end users...
>
> Cheers,
>   Steve
>
> ---------------------------(end of
broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
>                http://archives.postgresql.org

Re: Select all invalid e-mail addresses

From
Andrew Sullivan
Date:
On Thu, Oct 20, 2005 at 12:34:39PM -0700, Steve Atkins wrote:
>
> While there are valid deliverable email addresses in .arpa, you really
> don't want to be accepting them from end users...

You know, as someone who has been bitten hundreds of times by the
decision of some application designer who thought s/he knew better
than I what my email address could possibly be, I respectfully submit
that you're mistaken.  We call it a bug when other databases accept
dates like '0000-00-00'; but we'd just as surely call it a bug if
PostgreSQL refused to accept valid leap year dates or leap seconds.
It's one thing to say you should not accept known-bad data; it's
quite another to refuse data that is improbable but nevertheless
perfectly good.

A

--
Andrew Sullivan  | ajs@crankycanuck.ca
I remember when computers were frustrating because they *did* exactly what
you told them to.  That actually seems sort of quaint now.
        --J.D. Baldwin

Re: Select all invalid e-mail addresses

From
"Andrus"
Date:
>> How to write a WHERE clause which selects e-mail addresses which
>> are surely wrong ?
>
> ... WHERE email !~ '...insert previously mentioned regex here...';

 Steve,

thank you.

I tried

SELECT   email  FROM customer
  WHERE  email !~

'/^[^@]*@(?:[^@]*\.)?[a-z0-9-_]+\.(?:a[defgilmnoqrstuwz]|b[abdefghijmnorstvwyz]|c[acdfghiklmnoruvxyz]|d[ejkmoz]|e[ceghrst]|f[ijkmorx]|g[abdefhilmnpqrstuwy]|h[kmnrtu]|i[delnoqrst]|j[mop]|k[eghimnprwyz]|l[abcikrstuvy]|m[acdghklmnopqrstuvwxyz]|n[acefgilopruz]|om|p[aefghklmnrtwy]|qa|r[eouw]|s[abcdeghijklmnortvyz]|t[cdfghjkmnoprtvwz]|u[agkmsyz]|v[aceginu]|w[fs]|y[etu]|z[amw]|edu|com|net|org|gov|mil|info|biz|coop|museum|aero|name|pro)$/'

but got an error

ERROR:  invalid regular expression: invalid character range


I'm using

"PostgreSQL 8.0.2 on i686-pc-mingw32, compiled by GCC gcc.exe (GCC) 3.4.2
(mingw-special)"

Andrus.




Re: Select all invalid e-mail addresses

From
Michael Fuhr
Date:
On Fri, Oct 21, 2005 at 08:15:23PM +0300, Andrus wrote:
> I tried
>
> SELECT   email  FROM customer
>   WHERE  email !~
>
'/^[^@]*@(?:[^@]*\.)?[a-z0-9-_]+\.(?:a[defgilmnoqrstuwz]|b[abdefghijmnorstvwyz]|c[acdfghiklmnoruvxyz]|d[ejkmoz]|e[ceghrst]|f[ijkmorx]|g[abdefhilmnpqrstuwy]|h[kmnrtu]|i[delnoqrst]|j[mop]|k[eghimnprwyz]|l[abcikrstuvy]|m[acdghklmnopqrstuvwxyz]|n[acefgilopruz]|om|p[aefghklmnrtwy]|qa|r[eouw]|s[abcdeghijklmnortvyz]|t[cdfghjkmnoprtvwz]|u[agkmsyz]|v[aceginu]|w[fs]|y[etu]|z[amw]|edu|com|net|org|gov|mil|info|biz|coop|museum|aero|name|pro)$/'
>
> but got an error
>
> ERROR:  invalid regular expression: invalid character range

Aside from the fact that this regular expression is semantically wrong,
it has a few other problems:

* A hyphen (-) must come first or last in a character class if you want
it interpreted literally instead of as part of a range specification.

  test=> SELECT 'abc' ~ '[a-z0-9-_]';  -- WRONG
  ERROR:  invalid regular expression: invalid character range

  test=> SELECT 'abc' ~ '[a-z0-9_-]';
   ?column?
  ----------
   t
  (1 row)

* Regular expressions in PostgreSQL don't use delimiters like / at
the beginning and end of the expression.

  test=> SELECT 'abc' ~ '/abc/';  -- WRONG
   ?column?
  ----------
   f
  (1 row)

  test=> SELECT 'abc' ~ 'abc';
   ?column?
  ----------
   t
  (1 row)

* If you use single quotes around the regular expression then you
need to escape backslashes that should be part of the regular
expression; otherwise the backslash will be parsed by the string
parser before the string is used as a regular expression and you'll
get unexpected results.  In other words, there's an extra layer of
string parsing that you have to allow for.  In 8.0 and later you
can avoid this by using dollar quotes.

  test=> SELECT 'abc' ~ 'a\.c';  -- WRONG
   ?column?
  ----------
   t
  (1 row)

  test=> SELECT 'abc' ~ 'a\\.c';
   ?column?
  ----------
   f
  (1 row)

  test=> SELECT 'a.c' ~ 'a\\.c';
   ?column?
  ----------
   t
  (1 row)

  test=> SELECT 'abc' ~ $$a\.c$$;
   ?column?
  ----------
   f
  (1 row)

  test=> SELECT 'a.c' ~ $$a\.c$$;
   ?column?
  ----------
   t
  (1 row)

--
Michael Fuhr

Re: Select all invalid e-mail addresses

From
Steve Atkins
Date:
On Fri, Oct 21, 2005 at 08:15:23PM +0300, Andrus wrote:
> >> How to write a WHERE clause which selects e-mail addresses which
> >> are surely wrong ?
> >
> > ... WHERE email !~ '...insert previously mentioned regex here...';
>
>  Steve,
>
> thank you.
>
> I tried

[snip]

SELECT   email  FROM customer
   WHERE  email !~*

'^[^@]*@(?:[^@]*\.)?[a-z0-9_-]+\.(?:a[defgilmnoqrstuwz]|b[abdefghijmnorstvwyz]|c[acdfghiklmnoruvxyz]|d[ejkmoz]|e[ceghrst]|f[ijkmorx]|g[abdefhilmnpqrstuwy]|h[kmnrtu]|i[delnoqrst]|j[mop]|k[eghimnprwyz]|l[abcikrstuvy]|m[acdghklmnopqrstuvwxyz]|n[acefgilopruz]|om|p[aefghklmnrtwy]|qa|r[eouw]|s[abcdeghijklmnortvyz]|t[cdfghjkmnoprtvwz]|u[agkmsyz]|v[aceginu]|w[fs]|y[etu]|z[amw]|edu|com|net|org|gov|mil|info|biz|coop|museum|aero|name|pro|mobi|arpa)$'

...should be closer. Fixes one typo in the range, uses valid pg format regex, rather
than perl regex and had a couple of pedant-fixes in the TLDs supported.

It's syntactically correct, and appears to do the right thing on my production
DB here (which conincedentally has a customer table with an email field :)), but
you should make sure you understand what the regex actually does.

Cheers,
  Steve


Re: Select all invalid e-mail addresses

From
Michael Fuhr
Date:
On Fri, Oct 21, 2005 at 11:49:54AM -0700, Steve Atkins wrote:
> SELECT   email  FROM customer
>    WHERE  email !~*
>
'^[^@]*@(?:[^@]*\.)?[a-z0-9_-]+\.(?:a[defgilmnoqrstuwz]|b[abdefghijmnorstvwyz]|c[acdfghiklmnoruvxyz]|d[ejkmoz]|e[ceghrst]|f[ijkmorx]|g[abdefhilmnpqrstuwy]|h[kmnrtu]|i[delnoqrst]|j[mop]|k[eghimnprwyz]|l[abcikrstuvy]|m[acdghklmnopqrstuvwxyz]|n[acefgilopruz]|om|p[aefghklmnrtwy]|qa|r[eouw]|s[abcdeghijklmnortvyz]|t[cdfghjkmnoprtvwz]|u[agkmsyz]|v[aceginu]|w[fs]|y[etu]|z[amw]|edu|com|net|org|gov|mil|info|biz|coop|museum|aero|name|pro|mobi|arpa)$'
>
> ...should be closer. Fixes one typo in the range, uses valid pg format regex, rather
> than perl regex and had a couple of pedant-fixes in the TLDs supported.
>
> It's syntactically correct, and appears to do the right thing on my production
> DB here (which conincedentally has a customer table with an email field :))

The backslashes should be escaped or the regular expression should
be quoted with dollar quotes (8.0 and later) -- otherwise the string
parser converts "\." to ".", which matches anything.  For example,
the above regular expression considers the following address valid:

foo@example?com

Even with that correction the regular expression is still wrong,
especially the ^[^@]*@ part at the beginning.  See this group's
archives and numerous other sources for further discussion on this
topic.

--
Michael Fuhr

Re: Select all invalid e-mail addresses

From
"Andrus"
Date:
> SELECT   email  FROM customer
>   WHERE  email !~*
>
'^[^@]*@(?:[^@]*\.)?[a-z0-9_-]+\.(?:a[defgilmnoqrstuwz]|b[abdefghijmnorstvwyz]|c[acdfghiklmnoruvxyz]|d[ejkmoz]|e[ceghrst]|f[ijkmorx]|g[abdefhilmnpqrstuwy]|h[kmnrtu]|i[delnoqrst]|j[mop]|k[eghimnprwyz]|l[abcikrstuvy]|m[acdghklmnopqrstuvwxyz]|n[acefgilopruz]|om|p[aefghklmnrtwy]|qa|r[eouw]|s[abcdeghijklmnortvyz]|t[cdfghjkmnoprtvwz]|u[agkmsyz]|v[aceginu]|w[fs]|y[etu]|z[amw]|edu|com|net|org|gov|mil|info|biz|coop|museum|aero|name|pro|mobi|arpa)$'
>
> ...should be closer. Fixes one typo in the range, uses valid pg format
> regex, rather
> than perl regex and had a couple of pedant-fixes in the TLDs supported.
>
> It's syntactically correct, and appears to do the right thing on my
> production
> DB here (which conincedentally has a customer table with an email field
> :)), but
> you should make sure you understand what the regex actually does.

Steve,

thank you again.

I applied Michael hint about dollar quoting to this and tried

create temp table customer ( email char(60));
insert into customer values( 'steve@blighty.com');
SELECT   email  FROM customer  WHERE  email !~*
$$

^[^@]*@(?:[^@]*\.)?[a-z0-9_-]+\.(?:a[defgilmnoqrstuwz]|b[abdefghijmnorstvwyz]|c[acdfghiklmnoruvxyz]|d[ejkmoz]|e[ceghrst]|f[ijkmorx]|g[abdefhilmnpqrstuwy]|h[kmnrtu]|i[delnoqrst]|j[mop]|k[eghimnprwyz]|l[abcikrstuvy]|m[acdghklmnopqrstuvwxyz]|n[acefgilopruz]|om|p[aefghklmnrtwy]|qa|r[eouw]|s[abcdeghijklmnortvyz]|t[cdfghjkmnoprtvwz]|u[agkmsyz]|v[aceginu]|w[fs]|y[etu]|z[amw]|edu|com|net|org|gov|mil|info|biz|coop|museum|aero|name|pro|mobi|arpa)$
$$

but this classifies e-mail address steve@blighty.com as invalid (select
returns it). The same result are without dollar quoting, using your original
select.

Andrus.





Re: Select all invalid e-mail addresses

From
Michael Fuhr
Date:
On Mon, Oct 24, 2005 at 09:02:26PM +0300, Andrus wrote:
> I applied Michael hint about dollar quoting to this and tried
>
> create temp table customer ( email char(60));
> insert into customer values( 'steve@blighty.com');
> SELECT   email  FROM customer  WHERE  email !~*
> $$
>
^[^@]*@(?:[^@]*\.)?[a-z0-9_-]+\.(?:a[defgilmnoqrstuwz]|b[abdefghijmnorstvwyz]|c[acdfghiklmnoruvxyz]|d[ejkmoz]|e[ceghrst]|f[ijkmorx]|g[abdefhilmnpqrstuwy]|h[kmnrtu]|i[delnoqrst]|j[mop]|k[eghimnprwyz]|l[abcikrstuvy]|m[acdghklmnopqrstuvwxyz]|n[acefgilopruz]|om|p[aefghklmnrtwy]|qa|r[eouw]|s[abcdeghijklmnortvyz]|t[cdfghjkmnoprtvwz]|u[agkmsyz]|v[aceginu]|w[fs]|y[etu]|z[amw]|edu|com|net|org|gov|mil|info|biz|coop|museum|aero|name|pro|mobi|arpa)$
> $$
>
> but this classifies e-mail address steve@blighty.com as invalid (select
> returns it). The same result are without dollar quoting, using your original
> select.

There are at least two problems:

1. Since you're storing the email address as char(60), in some cases
it'll be padded with spaces up to 60 characters.  This appears to
be one of those cases:

SELECT 'foo'::char(60) ~ '^foo$';
 ?column?
----------
 f
(1 row)

test=> SELECT 'foo'::char(60) ~ '^foo {57}$';
 ?column?
----------
 t
(1 row)

2. Everything in the quoted string is part of the regular expression,
including the embedded newlines immediately after the open quote and
before the close quote.

test=> SELECT 'foo'::text ~ $$
test$> ^foo$
test$> $$;
 ?column?
----------
 f
(1 row)

test=> SELECT 'foo'::text ~ $_$^foo$$_$;
 ?column?
----------
 t
(1 row)

Note the need to quote with something other than $$ ($_$ in this case)
because of the $ that's part of the regular expression.  Otherwise
you'd get this:

test=> SELECT 'foo'::text ~ $$^foo$$$;
ERROR:  syntax error at or near "$" at character 30
LINE 1: SELECT 'foo'::text ~ $$^foo$$$;
                                     ^

Suggestions: use text or varchar for the email address, don't embed
newlines in the regular expression, and if you use dollar quotes
and the regular expression ends with a dollar sign then quote with
a character sequence other than $$.

--
Michael Fuhr

Re: Select all invalid e-mail addresses

From
"Andrus"
Date:
> Suggestions: use text or varchar for the email address, don't embed
> newlines in the regular expression, and if you use dollar quotes
> and the regular expression ends with a dollar sign then quote with
> a character sequence other than $$.

Michael,

thank you. Excellent!

I'm afraid that using VARCHAR instead on CHAR may cause some parts of my
application to stop working. So I used TRIM function.

This regex allows email addresses containing two dots without any letters,
like eeta..soft@online.ee
I havent seen any email of such kind.

Andrus.


create temp table customer ( email char(60)) on commit drop;
insert into customer values( 'eeta..soft@online.ee');
SELECT   email  FROM customer WHERE  email!='' and email IS NOT NULL and
TRIM(email) !~*

$_$^[^@]*@(?:[^@]*\.)?[a-z0-9_-]+\.(?:a[defgilmnoqrstuwz]|b[abdefghijmnorstvwyz]|c[acdfghiklmnoruvxyz]|d[ejkmoz]|e[ceghrst]|f[ijkmorx]|g[abdefhilmnpqrstuwy]|h[kmnrtu]|i[delnoqrst]|j[mop]|k[eghimnprwyz]|l[abcikrstuvy]|m[acdghklmnopqrstuvwxyz]|n[acefgilopruz]|om|p[aefghklmnrtwy]|qa|r[eouw]|s[abcdeghijklmnortvyz]|t[cdfghjkmnoprtvwz]|u[agkmsyz]|v[aceginu]|w[fs]|y[etu]|z[amw]|edu|com|net|org|gov|mil|info|biz|coop|museum|aero|name|pro|mobi|arpa)$$_$




Re: Select all invalid e-mail addresses

From
Michael Fuhr
Date:
On Tue, Oct 25, 2005 at 11:20:53AM +0300, Andrus wrote:
> This regex allows email addresses containing two dots without any letters,
> like eeta..soft@online.ee
> I havent seen any email of such kind.

That's because the regular expression is wrong: it simply checks
the local part for zero or more non-@ characters instead of checking
against the RFC822/RFC2822 specification.  Use a search engine to
find a more complete regular expression (beware: it's long).

--
Michael Fuhr

Re: Select all invalid e-mail addresses

From
Steve Atkins
Date:
On Tue, Oct 25, 2005 at 09:09:44AM -0600, Michael Fuhr wrote:
> On Tue, Oct 25, 2005 at 11:20:53AM +0300, Andrus wrote:
> > This regex allows email addresses containing two dots without any letters,
> > like eeta..soft@online.ee
> > I havent seen any email of such kind.
>
> That's because the regular expression is wrong: it simply checks
> the local part for zero or more non-@ characters instead of checking
> against the RFC822/RFC2822 specification.  Use a search engine to
> find a more complete regular expression (beware: it's long).

eeta..soft@online.ee is a perfectly functional email address, despite
not being in dot-atom form, so technically in violation of RFC
2822. There are few constraints on the local part of an email address,
and those constraints are often violated in practice, and cause no
problems.

I do data analysis on email addresses all day, every day. I'm fully
aware of RFC 2822 constraints, and I'm also aware that the correlation
between them and the real world is high, but not absolute.

If you were using this to validate email software that would be a
different thing, but if you're actually working in the real world with
real world data and are actually concerned about finding email
addresses that are likely to be incorrect (rather than punishing users
with noc RFC 2822 compliant email addresses) then looking at the
local-part in much detail is really not useful.

Cheers,
  Steve

Re: Select all invalid e-mail addresses

From
"Andrus"
Date:
>> This regex allows email addresses containing two dots without any
>> letters,
>> like eeta..soft@online.ee
>
> That's because the regular expression is wrong: it simply checks
> the local part for zero or more non-@ characters instead of checking
> against the RFC822/RFC2822 specification.  Use a search engine to
> find a more complete regular expression (beware: it's long).

Michael, thank you.
I found correct regexp from

http://www.twilightsoul.com/Domains/Voyager/DeveloperVision/BestPracticesPatterns/EmailAddresses/tabid/134/Default.aspx?PageContentID=2

but this needs to be converted to Postgres. It causes the famuous ERROR:
invalid regular expression: invalid character range.
Since text editor find/replace cannot be used to convert it it is probably
not reasonable to waste time trying to make the following code to work in
Postgres.

Andrus.

select email from customer where email!='' and trim(email) !~*
$_$[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\
xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xf
f\n\015()]*)*\)[\040\t]*)*(?:(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\x
ff]+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff])|"[^\\\x80-\xff\n\015
"]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015"]*)*")[\040\t]*(?:\([^\\\x80-\
xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80
-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*
)*(?:\.[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\
\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\
x80-\xff\n\015()]*)*\)[\040\t]*)*(?:[^(\040)<>@,;:".\\\[\]\000-\037\x8
0-\xff]+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff])|"[^\\\x80-\xff\n
\015"]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015"]*)*")[\040\t]*(?:\([^\\\x
80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^
\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040
\t]*)*)*@[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([
^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\
\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:[^(\040)<>@,;:".\\\[\]\000-\037\
x80-\xff]+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff])|\[(?:[^\\\x80-
\xff\n\015\[\]]|\\[^\x80-\xff])*\])[\040\t]*(?:\([^\\\x80-\xff\n\015()
]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\
x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:\.[\04
0\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\
n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\
015()]*)*\)[\040\t]*)*(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?!
[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff])|\[(?:[^\\\x80-\xff\n\015\[\
]]|\\[^\x80-\xff])*\])[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\
x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\01
5()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*)*|(?:[^(\040)<>@,;:".
\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]
)|"[^\\\x80-\xff\n\015"]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015"]*)*")[^
()<>@,;:".\\\[\]\x80-\xff\000-\010\012-\037]*(?:(?:\([^\\\x80-\xff\n\0
15()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][
^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)|"[^\\\x80-\xff\
n\015"]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015"]*)*")[^()<>@,;:".\\\[\]\
x80-\xff\000-\010\012-\037]*)*<[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?
:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-
\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:@[\040\t]*
(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015
()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()
]*)*\)[\040\t]*)*(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\0
40)<>@,;:".\\\[\]\000-\037\x80-\xff])|\[(?:[^\\\x80-\xff\n\015\[\]]|\\
[^\x80-\xff])*\])[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\
xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*
)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:\.[\040\t]*(?:\([^\\\x80
-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x
80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t
]*)*(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\
\[\]\000-\037\x80-\xff])|\[(?:[^\\\x80-\xff\n\015\[\]]|\\[^\x80-\xff])
*\])[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x
80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80
-\xff\n\015()]*)*\)[\040\t]*)*)*(?:,[\040\t]*(?:\([^\\\x80-\xff\n\015(
)]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\
\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*@[\040\t
]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\0
15()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015
()]*)*\)[\040\t]*)*(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(
\040)<>@,;:".\\\[\]\000-\037\x80-\xff])|\[(?:[^\\\x80-\xff\n\015\[\]]|
\\[^\x80-\xff])*\])[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80
-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()
]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:\.[\040\t]*(?:\([^\\\x
80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^
\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040
\t]*)*(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".
\\\[\]\000-\037\x80-\xff])|\[(?:[^\\\x80-\xff\n\015\[\]]|\\[^\x80-\xff
])*\])[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\
\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x
80-\xff\n\015()]*)*\)[\040\t]*)*)*)*:[\040\t]*(?:\([^\\\x80-\xff\n\015
()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\
\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*)?(?:[^
(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\\[\]\000-
\037\x80-\xff])|"[^\\\x80-\xff\n\015"]*(?:\\[^\x80-\xff][^\\\x80-\xff\
n\015"]*)*")[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|
\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))
[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:\.[\040\t]*(?:\([^\\\x80-\xff
\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\x
ff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(
?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\\[\]\
000-\037\x80-\xff])|"[^\\\x80-\xff\n\015"]*(?:\\[^\x80-\xff][^\\\x80-\
xff\n\015"]*)*")[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\x
ff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)
*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*)*@[\040\t]*(?:\([^\\\x80-\x
ff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-
\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)
*(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\\[\
]\000-\037\x80-\xff])|\[(?:[^\\\x80-\xff\n\015\[\]]|\\[^\x80-\xff])*\]
)[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-
\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\x
ff\n\015()]*)*\)[\040\t]*)*(?:\.[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(
?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80
-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:[^(\040)<
>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\\[\]\000-\037\x8
0-\xff])|\[(?:[^\\\x80-\xff\n\015\[\]]|\\[^\x80-\xff])*\])[\040\t]*(?:
\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]
*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)
*\)[\040\t]*)*)*>)$_$






Re: Select all invalid e-mail addresses

From
Scott Marlowe
Date:
On Tue, 2005-10-25 at 13:01, Andrus wrote:
> >> This regex allows email addresses containing two dots without any
> >> letters,
> >> like eeta..soft@online.ee
> >
> > That's because the regular expression is wrong: it simply checks
> > the local part for zero or more non-@ characters instead of checking
> > against the RFC822/RFC2822 specification.  Use a search engine to
> > find a more complete regular expression (beware: it's long).
>
> Michael, thank you.
> I found correct regexp from
>
http://www.twilightsoul.com/Domains/Voyager/DeveloperVision/BestPracticesPatterns/EmailAddresses/tabid/134/Default.aspx?PageContentID=2
>
> but this needs to be converted to Postgres. It causes the famuous ERROR:
> invalid regular expression: invalid character range.
> Since text editor find/replace cannot be used to convert it it is probably
> not reasonable to waste time trying to make the following code to work in
> Postgres.
>
> Andrus.

PERL REGEX SNIPPED.

That's because it's a perl regex, not a posix or sql regex.  IF you
wrapped it in a plperl function, then you could use it.  Anyone know if
the PCRE library can handle this thing?   I guess I could try it myself.