Thread: Email data type

Email data type

From
Gaetano Mendola
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi all, this is the alpha version for the new email data type,
is not written as built in type but as plug in, Tome Lane and others
drove me in this direction.

The type is indexable and provide also conversion methods:

text <--> email
and the operator >>, is possible use it in select like:

select * from my_user where email >> 'hotmail.com';

this select will extract all records with an email inside
the domain 'hotmail.com'.

The validation routine is very simple, right now test only
the presence of character '@' (it's an alpha version), I'd like
to put it on pgfoundry but apparently my DNS are unable to resolve
www.pgfoundry.org. I'll put this version as soon the address become
available.

This is the first time that I wrote code for postgresql so please, if
you can, do a sort of code revision on it.

Comments are welcomed.

Regards
Gaetano Mendola.











-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (MingW32)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFAptPG7UpzwH2SGd4RAhGYAKDMCfPhsBCwGAIFFTH2YHD/fHK2VACfZeUU
Y2nIWu56nvwhpZDTq1ARueo=
=b84w
-----END PGP SIGNATURE-----

Attachment

Re: Email data type

From
Gaetano Mendola
Date:
Gaetano Mendola wrote:

> Hi all, this is the alpha version for the new email data type,
> is not written as built in type but as plug in, Tome Lane and others
> drove me in this direction.
> 
> The type is indexable and provide also conversion methods:
> 
> text <--> email
> and the operator >>, is possible use it in select like:
> 
> select * from my_user where email >> 'hotmail.com';
> 
> this select will extract all records with an email inside
> the domain 'hotmail.com'.
> 
> The validation routine is very simple, right now test only
> the presence of character '@' (it's an alpha version), I'd like
> to put it on pgfoundry but apparently my DNS are unable to resolve
> www.pgfoundry.org. I'll put this version as soon the address become
> available.
> 
> This is the first time that I wrote code for postgresql so please, if
> you can, do a sort of code revision on it.
> 
> Comments are welcomed.

Forget the version attached and use the version on pgfoundry.


Regards
Gaetano Mendola







Re: Email data type

From
Bruno Wolff III
Date:
On Sun, May 16, 2004 at 04:36:55 +0200, Gaetano Mendola <mendola@bigfoot.com> wrote:
> 
> The type is indexable and provide also conversion methods:
> 
> text <--> email
> and the operator >>, is possible use it in select like:

When you are converting between text and email data, what format are the
text version of the address going to be? For example you might be using
rfc2821 encoding, rfc2822 encoding or concatenating the local part,
an @ sign and the domain name. Don't forget about domain literals.


Re: Email data type

From
Gaetano Mendola
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Bruno Wolff III wrote:

| On Sun, May 16, 2004 at 04:36:55 +0200,
|   Gaetano Mendola <mendola@bigfoot.com> wrote:
|
|>The type is indexable and provide also conversion methods:
|>
|>text <--> email
|>and the operator >>, is possible use it in select like:
|
|
| When you are converting between text and email data, what format are the
| text version of the address going to be? For example you might be using
| rfc2821 encoding, rfc2822 encoding or concatenating the local part,
| an @ sign and the domain name. Don't forget about domain literals.

Actually I use <local_part>@<domain_name>

Also the validator will validate emails in this form, if you are thinking to
validate emails as:

"Gaetano M. Public"(junior)<gmendola@(new account)bigfoot.com>

that are perfectly valid I think that it's a valid option to consider.

About the domain literals, I think to validate it in the near future,
rejecting private subnet according to this list:

10.0.0.0 - 10.255.255.255
172.16.0.0 - 172.31.255.255
192.168.0.0 - 192.168.255.255
169.254.0.0 -169.254.255.255

I think I have to discard also the addresses with last octet equal to 256.

Any comments ?


Regards
Gaetano Mendola




-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (MingW32)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFAqNPP7UpzwH2SGd4RAgqxAJ9UCPOOTpPN6Tl9w1748inn7t4a/QCg7pl1
8O3lHDwEKsCuCOJYQAo0vHM=
=wFo+
-----END PGP SIGNATURE-----



Re: Email data type

From
"Dave Page"
Date:

> -----Original Message-----
> From: Gaetano Mendola [mailto:mendola@bigfoot.com]
> Sent: 17 May 2004 16:02
> To: Bruno Wolff III
> Cc: pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Email data type
>
> About the domain literals, I think to validate it in the near
> future, rejecting private subnet according to this list:
>
> 10.0.0.0 - 10.255.255.255
> 172.16.0.0 - 172.31.255.255
> 192.168.0.0 - 192.168.255.255
> 169.254.0.0 -169.254.255.255

Why? What's wrong with private addresses?

Regards, Dave.


Re: Email data type

From
Steve Atkins
Date:
On Mon, May 17, 2004 at 05:01:36PM +0200, Gaetano Mendola wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Bruno Wolff III wrote:
> 
> | On Sun, May 16, 2004 at 04:36:55 +0200,
> |   Gaetano Mendola <mendola@bigfoot.com> wrote:
> |
> |>The type is indexable and provide also conversion methods:
> |>
> |>text <--> email
> |>and the operator >>, is possible use it in select like:
> |
> |
> | When you are converting between text and email data, what format are the
> | text version of the address going to be? For example you might be using
> | rfc2821 encoding, rfc2822 encoding or concatenating the local part,
> | an @ sign and the domain name. Don't forget about domain literals.
> 
> Actually I use <local_part>@<domain_name>
> 
> Also the validator will validate emails in this form, if you are thinking to
> validate emails as:
> 
> "Gaetano M. Public"(junior)<gmendola@(new account)bigfoot.com>
> 
> that are perfectly valid I think that it's a valid option to consider.
> 
> About the domain literals, I think to validate it in the near future,
> rejecting private subnet according to this list:
> 
> 10.0.0.0 - 10.255.255.255
> 172.16.0.0 - 172.31.255.255
> 192.168.0.0 - 192.168.255.255
> 169.254.0.0 -169.254.255.255

Bad idea. If I'm testing I'm likely to be testing with addresses on my
local network. My local network is in 10/8. The middleware or the
application should be making those decisions. But if you are going to
filter on IP space, also consider class D & E space.

Also, a@10.11.12.13 is a syntactically valid email address, in the .13
TLD. It does not deliver to 10.11.12.13, or anywhere else, as of
today, unless the MTA or local recursive resolver is broken (a common
case). a@[10.11.12.13] is a whole other thing. As is a@[::10.11.12.13]
and various other IPv6 variants.

a@foo.bar is syntactically valid. a@foo.invalid is syntactically valid, but
should be immediately rejected. a@example.com is valid, even if known to
be non-deliverable.

> I think I have to discard also the addresses with last octet equal to 256.

Or equal to 4872014, come to that. Any reason you're looking at 256
in particular? If you mean .255 (or .0) then don't fall into that trap -
there are perfectly valid, routable addresses ending in both .0 and
.255, despite what some folks in Redmond would have you believe.

> Any comments ?

Parsing email addresses is a significant part of my day job, and email
address validation is a lot harder than it looks at first sight.

Don't forget quoting, whitespace, escaping and nesting parenthetical
comments. Also, remember that A@b.com and a@b.com are different email
addresses, while a@b.com and a@B.com are the same email
address. POSTMASTER@b.com and postmaster@b.com are the same email
address. ABUse@b.com and abuse@b.com may be the same address or
different email addresses, depending on which religious faction you
belong to.

In some contexts the empty string is a valid email address. In some
contexts "Postmaster" is a valid email address.

Also, one persons definition of a valid email address will be very
different from another persons definition of such. Many of those
definitions require some DNS resolution to make the decision.

I'm not entirely convinced that an email address is a simple and
well-defined enough datatype to handle comprehensively within the
DB. The validation decisions are complex and vary from application to
application.

(I use two text columns - localpart and domainpart, with an indexon reverse(lower(domainpart)) and leave validation to
theapplication,myself).
 

Cheers, Steve



Re: Email data type

From
"Bort, Paul"
Date:
> From: Gaetano Mendola [mailto:mendola@bigfoot.com]
> 
> I think I have to discard also the addresses with last octet 
> equal to 256.
> 
> Any comments ?
> 

Any octet that contains a number less than 0 or greater than 255 should be
suspect. 

Assuming you really meant 255:

It would be perfectly legal for an entity on the internet to have a block of
addresses with a subnet mask of less than 24 bits, which leads to legal
addresses that end in 255.

For example, if your company/university/black helicopter squad needed about
500 servers with direct presence, you might be assigned a block like
123.45.6.0/23. So the network address would be 123.45.6.0, the broadcast
address would be 123.45.7.255, and everything in between, including
123.45.6.255 and 123.45.7.0, would be available for your servers.

HTH, 

Paul




Re: Email data type

From
"Dave Page"
Date:

> -----Original Message-----
> From: Steve Atkins [mailto:steve@blighty.com]
> Sent: 17 May 2004 16:46
> To: pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Email data type
>
> a@foo.bar is syntactically valid. a@foo.invalid is
> syntactically valid, but should be immediately rejected.

I disagree - just because the database server cannot verify the the
existence of a domain does not mean that it isn't valid. I think only
addresses that are syntactically incorrect should be rejected.

Regards, Dave.


Re: Email data type

From
Gaetano Mendola
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Dave Page wrote:
|>-----Original Message-----
|>From: Gaetano Mendola [mailto:mendola@bigfoot.com]
|>Sent: 17 May 2004 16:02
|>To: Bruno Wolff III
|>Cc: pgsql-hackers@postgresql.org
|>Subject: Re: [HACKERS] Email data type
|>
|>About the domain literals, I think to validate it in the near
|>future, rejecting private subnet according to this list:
|>
|>10.0.0.0 - 10.255.255.255
|>172.16.0.0 - 172.31.255.255
|>192.168.0.0 - 192.168.255.255
|>169.254.0.0 -169.254.255.255
|
|
| Why? What's wrong with private addresses?

Well I think that accept an email like:

gmendola@192.168.212.2

is a risky.

It's true that any RFC forbid this kind of emails,
so I can accept it and leave to the final user the
opportunity to extract the domain an validate it.


Regards
Gaetano Mendola











-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (MingW32)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFAqOHm7UpzwH2SGd4RAp4dAJwNqDSws1cVwKr/QMXG/JvV/8M94QCbBfIS
alixKQYe/VoLpwXrBFHszDs=
=BAML
-----END PGP SIGNATURE-----



Re: Email data type

From
Steve Atkins
Date:
On Mon, May 17, 2004 at 05:01:05PM +0100, Dave Page wrote:

> > a@foo.bar is syntactically valid. a@foo.invalid is 
> > syntactically valid, but should be immediately rejected. 
> 
> I disagree - just because the database server cannot verify the the
> existence of a domain does not mean that it isn't valid. I think only
> addresses that are syntactically incorrect should be rejected.

I think you're missing the point. Go and take a glance at RFC 2606 -
it's a BCP which defines the .test, .invalid, .example and .localhost
TLDs.

Of course, some users may want to use .invalid email addresses in
their database, despite their being, by definition, invalid. It's
something that will vary depending on their needs.

Cheers, Steve



Re: Email data type

From
"Dave Page"
Date:

> -----Original Message-----
> From: Gaetano Mendola [mailto:mendola@bigfoot.com]
> Sent: 17 May 2004 17:02
> To: Dave Page
> Cc: Bruno Wolff III; pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Email data type
>
>
> Well I think that accept an email like:
>
> gmendola@192.168.212.2
>
> is a risky.
>
> It's true that any RFC forbid this kind of emails, so I can
> accept it and leave to the final user the opportunity to
> extract the domain an validate it.

Forbidden on the 'net with good reason, but useful internally on private
networks for testing, and probably various other reasons where you might
want to direct a message to a specific interface rather than a hostname
or MX.

Regards, Dave




Re: Email data type

From
Steve Crawford
Date:
On Monday 17 May 2004 8:45 am, Steve Atkins wrote:
> Also, a@10.11.12.13 is a syntactically valid email address, in the
> .13 TLD. It does not deliver to 10.11.12.13, or anywhere else, as
> of today, unless the MTA or local recursive resolver is broken (a
> common case). a@[10.11.12.13] is a whole other thing. As is
> a@[::10.11.12.13] and various other IPv6 variants.

Along those lines a@foo.bar.com and a@[foo.bar.com] are valid but they 
don't necessarily refer to the same mailbox (depends on the mx for 
foo.bar.com).

> Parsing email addresses is a significant part of my day job, and
> email address validation is a lot harder than it looks at first
> sight.

Yes, indeed.

> Don't forget quoting, whitespace, escaping and nesting
> parenthetical comments....

The "just looking for an @" breaks pretty quickly, say with 
"joe@"some.dom or a myriad of other variations.

> In some contexts the empty string is a valid email address. In some
> contexts "Postmaster" is a valid email address.

As are postmaster and pOsTmaSTeR and POSTmaster and they are all the 
same address.

> I'm not entirely convinced that an email address is a simple and
> well-defined enough datatype to handle comprehensively within the
> DB. The validation decisions are complex and vary from application
> to application.
>
> (I use two text columns - localpart and domainpart, with an index
>  on reverse(lower(domainpart)) and leave validation to the
> application, myself).

Indeed. A problem with the "email address" datatype is that it hinders 
normalization:

"Joe User" <joe@user.dom> is valid but a database designer would 
probably prefer columns for name and email, or if the addresses were 
all people, firstname, middlename, lastname, email.

As you mentioned, the email can be broken into localpart and 
domainpart but if the app requires it, the domainpart could be 
further rendered into toplevel (so you could find all the .gov or 
.edu), secondlevel (at least corresponds to a registrant) and 
subdomain(s) as necessary.

Cheers,
Steve



Re: Email data type

From
Steve Atkins
Date:
On Mon, May 17, 2004 at 09:21:54AM -0700, Steve Crawford wrote:

> Along those lines a@foo.bar.com and a@[foo.bar.com] are valid but they 
> don't necessarily refer to the same mailbox (depends on the mx for 
> foo.bar.com).

I don't believe the latter is actually valid, as it has to be an
address literal inside the square brackets. Until that's extended
by RFC that means IPv4 or IPv6 literals.

It's another example of "this is neither trivial nor well-defined",
though.

> > I'm not entirely convinced that an email address is a simple and
> > well-defined enough datatype to handle comprehensively within the
> > DB. The validation decisions are complex and vary from application
> > to application.
> >
> > (I use two text columns - localpart and domainpart, with an index
> >  on reverse(lower(domainpart)) and leave validation to the
> > application, myself).
> 
> Indeed. A problem with the "email address" datatype is that it hinders 
> normalization:

Yup. That's a bigger concern than the overall vagueness of the problem.

> "Joe User" <joe@user.dom> is valid but a database designer would 
> probably prefer columns for name and email, or if the addresses were 
> all people, firstname, middlename, lastname, email.
> 
> As you mentioned, the email can be broken into localpart and 
> domainpart but if the app requires it, the domainpart could be 
> further rendered into toplevel (so you could find all the .gov or 
> .edu), secondlevel (at least corresponds to a registrant) and 
> subdomain(s) as necessary.

That's why I index the domainpart on reverse(lower()) - that way I
can say WHERE reverse(lower(domainpart)) LIKE reverse('%.com')

Cheers, Steve


Re: Email data type

From
"Dave Page"
Date:
It's rumoured that Steve Atkins once said:
> On Mon, May 17, 2004 at 05:01:05PM +0100, Dave Page wrote:
>
>> > a@foo.bar is syntactically valid. a@foo.invalid is
>> > syntactically valid, but should be immediately rejected.
>>
>> I disagree - just because the database server cannot verify the the
>> existence of a domain does not mean that it isn't valid. I think only
>> addresses that are syntactically incorrect should be rejected.
>
> I think you're missing the point. Go and take a glance at RFC 2606 -
> it's a BCP which defines the .test, .invalid, .example and .localhost
> TLDs.

Ahh, yes - clearly I was missing the point :-)

Regards, Dave.




Re: Email data type

From
Bruno Wolff III
Date:
On Mon, May 17, 2004 at 17:01:36 +0200, Gaetano Mendola <mendola@bigfoot.com> wrote:
> 
> Also the validator will validate emails in this form, if you are thinking to
> validate emails as:
> 
> "Gaetano M. Public"(junior)<gmendola@(new account)bigfoot.com>

That appears to be an rfc 2822 address. RFC 2821 addresses are more limited
in that they can't have comments but still escape special characters.

> 
> that are perfectly valid I think that it's a valid option to consider.

What you want to store depends on what you are going to do with the address.
If you don't need the comments, storing it as a local part and a domain
is probably best.

> About the domain literals, I think to validate it in the near future,
> rejecting private subnet according to this list:
> 
> 10.0.0.0 - 10.255.255.255
> 172.16.0.0 - 172.31.255.255
> 192.168.0.0 - 192.168.255.255
> 169.254.0.0 -169.254.255.255

That would depend on whether or not sending to those addresses would make
sense in your case.

> 
> I think I have to discard also the addresses with last octet equal to 256.

Email addresses are in ascii and hence bytes should only contain characters
with the codes 0 through 127.


Re: Email data type

From
Bruno Wolff III
Date:
On Mon, May 17, 2004 at 18:01:43 +0200, Gaetano Mendola <mendola@bigfoot.com> wrote:
> Well I think that accept an email like:
> 
> gmendola@192.168.212.2

Be careful about this. gmendola@192.168.212.2 is not the same as
gmendola@[192.168.212.2] . 


Re: Email data type

From
Andrew Dunstan
Date:
Bruno Wolff III wrote: <blockquote cite="mid20040517193140.GB24230@wolff.to" type="cite"><pre wrap="">On Mon, May 17,
2004at 18:01:43 +0200, Gaetano Mendola <a class="moz-txt-link-rfc2396E"
href="mailto:mendola@bigfoot.com"><mendola@bigfoot.com></a>wrote: </pre><blockquote type="cite"><pre wrap="">Well
Ithink that accept an email like:
 

<a class="moz-txt-link-abbreviated" href="mailto:gmendola@192.168.212.2">gmendola@192.168.212.2</a>
</pre></blockquote><prewrap="">
 
Be careful about this. <a class="moz-txt-link-abbreviated"
href="mailto:gmendola@192.168.212.2">gmendola@192.168.212.2</a>is not the same as
 
<a class="moz-txt-link-abbreviated" href="mailto:gmendola@[192.168.212.2">gmendola@[192.168.212.2</a>] . 
 </pre></blockquote><br /> Not wanting to rain on this fascinating parade, but shouldn't this discussion be carried out
ona mailing list for the pgfoundry emailadt project? (pgfoundry mailing lists were a bit broken but I think they are
fixednow).<br /><br /> cheers<br /><br /> andrew<br />