Thread: Best way to store case-insensitive data?
I have a column called "email" that users login with, thus I need to be able to lookup email very quickly. The problem is, emails are case-insensitive. I want foo@bar.com to be able to login with FOO@Bar.com as well. There's two ways of doing this, that I can see: 1) Every time I lookup an email in the database, do a case-insensitive ilike, or cast both sides with LOWER(). I think both are slow, correct? 2) Every time the user updates or saves their email, store it in lowercase, and every time I lookup an email, pass in a lowercase email. This is somewhat of a bug farm because one might miss some little spot in a piece of code where an email is compared or updated. Is there any way to tell postgres to always store data in lowercase form, or just have a flat out case-insensitive column? Thanks! Mike
On 6/10/2010 3:50 PM, Mike Christensen wrote: > I have a column called "email" that users login with, thus I need to > be able to lookup email very quickly. The problem is, emails are > case-insensitive. I want foo@bar.com to be able to login with > FOO@Bar.com as well. There's two ways of doing this, that I can see: > > 1) Every time I lookup an email in the database, do a case-insensitive > ilike, or cast both sides with LOWER(). I think both are slow, > correct? > 2) Every time the user updates or saves their email, store it in > lowercase, and every time I lookup an email, pass in a lowercase > email. This is somewhat of a bug farm because one might miss some > little spot in a piece of code where an email is compared or updated. > > Is there any way to tell postgres to always store data in lowercase > form, or just have a flat out case-insensitive column? Thanks! > > Mike > There is citext in contrib, it makes case insensitive text columns. -Andy
On Thu, Jun 10, 2010 at 2:50 PM, Mike Christensen <mike@kitchenpc.com> wrote: > I have a column called "email" that users login with, thus I need to > be able to lookup email very quickly. The problem is, emails are > case-insensitive. I want foo@bar.com to be able to login with > FOO@Bar.com as well. There's two ways of doing this, that I can see: > > 1) Every time I lookup an email in the database, do a case-insensitive > ilike, or cast both sides with LOWER(). I think both are slow, > correct? > 2) Every time the user updates or saves their email, store it in > lowercase, and every time I lookup an email, pass in a lowercase > email. This is somewhat of a bug farm because one might miss some > little spot in a piece of code where an email is compared or updated. > > Is there any way to tell postgres to always store data in lowercase > form, or just have a flat out case-insensitive column? Thanks! The contrib module citext provides a case insensitive text type.
Where do I get info on installing this? On Thu, Jun 10, 2010 at 2:15 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote: > On Thu, Jun 10, 2010 at 2:50 PM, Mike Christensen <mike@kitchenpc.com> wrote: >> I have a column called "email" that users login with, thus I need to >> be able to lookup email very quickly. The problem is, emails are >> case-insensitive. I want foo@bar.com to be able to login with >> FOO@Bar.com as well. There's two ways of doing this, that I can see: >> >> 1) Every time I lookup an email in the database, do a case-insensitive >> ilike, or cast both sides with LOWER(). I think both are slow, >> correct? >> 2) Every time the user updates or saves their email, store it in >> lowercase, and every time I lookup an email, pass in a lowercase >> email. This is somewhat of a bug farm because one might miss some >> little spot in a piece of code where an email is compared or updated. >> >> Is there any way to tell postgres to always store data in lowercase >> form, or just have a flat out case-insensitive column? Thanks! > > The contrib module citext provides a case insensitive text type. >
On Thu, Jun 10, 2010 at 2:50 PM, Mike Christensen <mike@kitchenpc.com> wrote: > I have a column called "email" that users login with, thus I need to > be able to lookup email very quickly. The problem is, emails are > case-insensitive. I want foo@bar.com to be able to login with > FOO@Bar.com as well. There's two ways of doing this, that I can see: > > 1) Every time I lookup an email in the database, do a case-insensitive > ilike, or cast both sides with LOWER(). I think both are slow, > correct? > 2) Every time the user updates or saves their email, store it in > lowercase, and every time I lookup an email, pass in a lowercase > email. This is somewhat of a bug farm because one might miss some > little spot in a piece of code where an email is compared or updated. > > Is there any way to tell postgres to always store data in lowercase > form, or just have a flat out case-insensitive column? Thanks! Note the other option is to store an index on lower(column) create index mycaseinsensitiveindex on table ((lower(column));
On Thu, Jun 10, 2010 at 3:16 PM, Mike Christensen <mike@kitchenpc.com> wrote: > Where do I get info on installing this? Very much depends on OS and how you installed pgsql
On 6/10/2010 3:50 PM, Mike Christensen wrote: > I have a column called "email" that users login with, thus I need to > be able to lookup email very quickly. The problem is, emails are > case-insensitive. I want foo@bar.com to be able to login with > FOO@Bar.com as well. There's two ways of doing this, that I can see: > > 1) Every time I lookup an email in the database, do a case-insensitive > ilike, or cast both sides with LOWER(). I think both are slow, > correct? Use a functional index and they won't be. create index email_lower_idx on foo (lower(email)); select * from foo where lower(email) = lower('foo@bar'); > 2) Every time the user updates or saves their email, store it in > lowercase, and every time I lookup an email, pass in a lowercase > email. This is somewhat of a bug farm because one might miss some > little spot in a piece of code where an email is compared or updated. > > Is there any way to tell postgres to always store data in lowercase > form, or just have a flat out case-insensitive column? Thanks! > > Mike > Cheers, Steve
Right now, I'm running 8.3.4 on Windows 2003, and I just used the MSI installer to install it. On Thu, Jun 10, 2010 at 2:18 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote: > On Thu, Jun 10, 2010 at 3:16 PM, Mike Christensen <mike@kitchenpc.com> wrote: >> Where do I get info on installing this? > > Very much depends on OS and how you installed pgsql >
From this site: http://developer.postgresql.org/pgdocs/postgres/citext.html I couldn't tell if you still had to create an index on the lower case value. It seems that it basically mimics the WHERE LOWER(email) = LOWER(?) method. Since this part is incredibly performance critical, maybe I'm better off storing my data all in lowercase and keeping the DB case sensitive. On Thu, Jun 10, 2010 at 2:29 PM, Mike Christensen <mike@kitchenpc.com> wrote: > Right now, I'm running 8.3.4 on Windows 2003, and I just used the MSI > installer to install it. > > On Thu, Jun 10, 2010 at 2:18 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote: >> On Thu, Jun 10, 2010 at 3:16 PM, Mike Christensen <mike@kitchenpc.com> wrote: >>> Where do I get info on installing this? >> >> Very much depends on OS and how you installed pgsql >> >
On Thu, Jun 10, 2010 at 3:34 PM, Mike Christensen <mike@kitchenpc.com> wrote: > From this site: > > http://developer.postgresql.org/pgdocs/postgres/citext.html > > I couldn't tell if you still had to create an index on the lower case > value. It seems that it basically mimics the WHERE LOWER(email) = > LOWER(?) method. Since this part is incredibly performance critical, > maybe I'm better off storing my data all in lowercase and keeping the > DB case sensitive. of course you'd still need an index. whether you store it lower case in regular text or mixed case in a citext, the db would need an index for good performance. But you wouldn't have to store a lower() index for citext, just an index. BTW, citext it new for 8.4, so it's probably not an option for you if you're on 8.3
Ah, I should probably upgrade to 8.4. However, I'll probably just wait for 9.0 to come out. So it seems like citext will be about the same as casting both sides to LOWER(), plus putting an index on the lowercase version of the text. I'd probably use that if it were out of the box, but I'm trying to stay away from adding too many dependencies.. I think I'll stick with my original approach of only storing lowercase data in the DB, and perhaps put a CHECK constraint on there to ensure no upper case letters sneak in. Mike On Thu, Jun 10, 2010 at 2:42 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote: > On Thu, Jun 10, 2010 at 3:34 PM, Mike Christensen <mike@kitchenpc.com> wrote: >> From this site: >> >> http://developer.postgresql.org/pgdocs/postgres/citext.html >> >> I couldn't tell if you still had to create an index on the lower case >> value. It seems that it basically mimics the WHERE LOWER(email) = >> LOWER(?) method. Since this part is incredibly performance critical, >> maybe I'm better off storing my data all in lowercase and keeping the >> DB case sensitive. > > of course you'd still need an index. whether you store it lower case > in regular text or mixed case in a citext, the db would need an index > for good performance. But you wouldn't have to store a lower() index > for citext, just an index. > > BTW, citext it new for 8.4, so it's probably not an option for you if > you're on 8.3 >
Heyho! On Thursday 10 June 2010 22.50:23 Mike Christensen wrote: > 2) Every time the user updates or saves their email, store it in > lowercase, and every time I lookup an email, pass in a lowercase > email. I'd do it this way. Plus either a CHECK condition on the table (email = lowercase(email)) (this will reliably catch all cases, but you will experience failures until you have found all cases) Or a BEFORE trigger that converts email to lowercase. (This is mostly transparent for storing, but I usually try to avoid triggers that modify data like this. But that's probably just me.) In either case, obviously you'll still need to change the code that is used for retrieving and comparing email addresses. cheers -- vbi -- featured link: http://www.pool.ntp.org
Attachment
Yup, I actually ended up doing this with this constraint: ALTER TABLE Users ADD CONSTRAINT check_email CHECK (email ~ E'^[^A-Z]+$'); However, I like your version better so I'll use that instead :) Mike On Thu, Jun 10, 2010 at 11:48 PM, Adrian von Bidder <avbidder@fortytwo.ch> wrote: > Heyho! > > On Thursday 10 June 2010 22.50:23 Mike Christensen wrote: >> 2) Every time the user updates or saves their email, store it in >> lowercase, and every time I lookup an email, pass in a lowercase >> email. > > I'd do it this way. Plus either a CHECK condition on the table (email = > lowercase(email)) (this will reliably catch all cases, but you will > experience failures until you have found all cases) > > Or a BEFORE trigger that converts email to lowercase. (This is mostly > transparent for storing, but I usually try to avoid triggers that modify > data like this. But that's probably just me.) > > In either case, obviously you'll still need to change the code that is used > for retrieving and comparing email addresses. > > cheers > -- vbi > > -- > featured link: http://www.pool.ntp.org >
On Thu, 10 Jun 2010 13:50:23 -0700, Mike Christensen wrote: > I have a column called "email" that users login with, thus I need to > be able to lookup email very quickly. The problem is, emails are > case-insensitive. I want foo@bar.com to be able to login with > FOO@Bar.com as well. There's two ways of doing this, that I can see: NB: technically the local part in an email address can be case sensitive. As RFC 5321 says: The local-part of a mailbox MUST BE treated as case sensitive. Therefore, SMTP implementations MUST take care to preserve the case of mailbox local-parts. In particular, for some hosts, the user "smith" is different from the user "Smith". However, exploiting the case sensitivity of mailbox local-parts impedes interoperability and is discouraged. Mailbox domains follow normal DNS rules and are hence not case sensitive. In practice I've yet to see a system having both smith and Smith and them being different, but still it is theoretically posible. -- Michał Politowski Talking has been known to lead to communication if practiced carelessly.
Yea this is a valid point. It's very possible my design won't work for the long term, and at some point I'll have to store the email name exactly as it was entered, and allow the lookup logic to be case insensitive with a lowercase index. However, I think the way I have it now should not break any known email server heh. Mike 2010/6/11 Michal Politowski <mpol@charybda.icm.edu.pl>: > On Thu, 10 Jun 2010 13:50:23 -0700, Mike Christensen wrote: >> I have a column called "email" that users login with, thus I need to >> be able to lookup email very quickly. The problem is, emails are >> case-insensitive. I want foo@bar.com to be able to login with >> FOO@Bar.com as well. There's two ways of doing this, that I can see: > > NB: technically the local part in an email address can be case sensitive. > As RFC 5321 says: > The local-part of a mailbox MUST BE treated as case sensitive. > Therefore, SMTP implementations MUST take care to preserve the case > of mailbox local-parts. In particular, for some hosts, the user > "smith" is different from the user "Smith". However, exploiting the > case sensitivity of mailbox local-parts impedes interoperability and > is discouraged. Mailbox domains follow normal DNS rules and are > hence not case sensitive. > > In practice I've yet to see a system having both smith and Smith > and them being different, but still it is theoretically posible. > > -- > Michał Politowski > Talking has been known to lead to communication if practiced carelessly. > > -- > Sent via pgsql-general mailing list (pgsql-general@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general >
On Friday 11 June 2010 09.27:15 Michal Politowski wrote: [email address local part is case sensitive] > In practice I've yet to see a system having both smith and Smith > and them being different, but still it is theoretically posible. I routinely modify email addresses I store to my addressbook to all lowercase. I have yet to have a single case where this gets me into problems; I think it's probably quite exotic to find a system that actually is case sensitive. cheers -- vbi -- featured product: ClamAV Antivirus - http://www.clamav.net/
Attachment
2010/6/11 Mike Christensen <mike@kitchenpc.com>: > Yea this is a valid point. It's very possible my design won't work > for the long term, and at some point I'll have to store the email name > exactly as it was entered, and allow the lookup logic to be case > insensitive with a lowercase index. However, I think the way I have > it now should not break any known email server heh. Instead of mangling data when you store it, mangle it later when you retrieve it. with a functional index on the column, you get the comparison data stored in an index, ready to go. Performance test the index: create test_index on table (lower(fieldname)); versus storing the emails in lower case.
n Sat, Jun 12, 2010 at 3:21 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote: > Performance test the index: > > create test_index on table (lower(fieldname)); > > versus storing the emails in lower case. Some quick testing on that tells me that storing in lower case will be about twice as fast at retrieval. But we're talking things like 1ms versus 2ms.
Please don't top-post. Mike Christensen wrote: > Ah, I should probably upgrade to 8.4. However, I'll probably just > wait for 9.0 to come out. So it seems like citext will be about the > same as casting both sides to LOWER(), plus putting an index on the > lowercase version of the text. I'd probably use that if it were out > of the box, but I'm trying to stay away from adding too many > dependencies.. I think I'll stick with my original approach of only > storing lowercase data in the DB, and perhaps put a CHECK constraint > on there to ensure no upper case letters sneak in. If your db contains international text there are some corner cases where lower( upper( val )) != val or upper( lower( val )) != val. Or there should be, because that's what happens in certain languages. For example, upper-case 'ß' should be 'SS' in German. Lower-case 'SS' is 'ss'. -- Lew