Re: Case Insensitive Data Type - Mailing list pgsql-general

From Tom Lane
Subject Re: Case Insensitive Data Type
Date
Msg-id 13443.1022266051@sss.pgh.pa.us
Whole thread Raw
In response to Case Insensitive Data Type  ("Russell Black" <russell.black@iarchives.com>)
Responses Re: Case Insensitive Data Type
Re: Case Insensitive Data Type
Re: Case Insensitive Data Type
List pgsql-general
"Russell Black" <russell.black@iarchives.com> writes:
> I'm using email addresses as a primary key in one of my tables.  Currently,=
>  I have to ensure that the email addresses are converted to lower case befo=
> re they get put into the table, and that all lookups on that field are conv=
> erted to lower case before the select statement, in order to ensure that Jo=
> e@Somewhere.com is the same as joe@somewhere.com.

A partial answer to this is

CREATE UNIQUE INDEX foo_key on foo (lower(email));

which will prevent duplicate entries with the same-up-to-case-folding
address.  However you'd still have to query with queries like

SELECT * FROM foo WHERE lower(email) = lower('probe value');

if the probe values you are handed can't be relied on to be downcased
already.  (I tried to think of a way to insert the lower() automatically
using a rule, but couldn't come up with anything --- ideas anyone?)


> Does anyone know of a case-insensitive data type?

I do not think you should go away feeling that a case-insensitive data
type would magically solve your problems.  Today you might think that
case-folding is what you want, but by no later than next week you would
figure out that there are other sorts of normalizations you'll also need
to do on provided addresses --- one obvious example is stripping leading
and trailing blanks and reducing consecutive blanks to a single blank.
In fact if you really want to look like you know what you're doing,
you'll have to strip out the comment portions of an address entirely.
For example, these are all equivalent forms per RFC specs:
    joe@blow.com
    Joe Blow <joe@blow.com>
    "Joe Blow" <joe@blow.com>
    joe@blow.com (Joe Blow)
and should be recognized as such by anything that pretends to know
what email addresses are.  (For that matter, you do know that the
specs say only the hostname part should be assumed case-insensitive,
don't you?)

So the real bottom line here is that you'd better figure a way to pass
the input strings through an address-normalization function.  "lower()"
might do as a first-order approximation but you won't stay with it
forever.

> Can I create a custom data type to do this?

You probably could, but it'd likely be more work than you want.  A
minimal implementation would require an input function (which would
apply the normalization rules you want), an output function (for which
you could probably just commandeer textout), and a no-op text() coercion
function (assuming you'd like to do anything besides display the value).
Then you'd have to create a set of index operators and access method
entries, if you intended to make this indexable --- these could all
piggyback on text operators, but you'd still need to do the work of
making the catalog entries for them.

Assuming you did have all that, I think a query like

    select * from foo where email = 'Joe Blow <joe@blow.com>'

would indeed work the way you want --- the initially untyped string
literal would eventually get coerced to your datatype, and then your
input conversion routine could do the right things to it.  But it'll
likely be easier to make it happen on the client side ;-)

            regards, tom lane

pgsql-general by date:

Previous
From: Neil Conway
Date:
Subject: Re: Question on crypt password
Next
From: Neil Conway
Date:
Subject: Re: Altering existing table to be WITHOUT OIDs