Thread: Case sensitivity

Case sensitivity

From

"Frank Millman"

Date:

06 August 2005, 13:22:06

Hi all

Is there an LC_COLLATE setting, or any other method, which allows all data
in a database to be treated in a case-insensitive manner?

I have two scenarios in mind. There are workarounds for both of them, but it
would be nice if they were not necessary.

1. In a UNIQUE column, I would like a value of 'a' to be rejected if there
is already a value of 'A'. Workaround - create a unique index on
LOWER(col_name).

2. I would like WHERE col_name = 'x' and WHERE col_name LIKE 'x%' to find
'X' and 'X1'. Workaround - WHERE LOWER(col_name) = 'x' and WHERE col_name
ILIKE 'x%'.

TIA

Frank Millman

Re: Case sensitivity

From

Richard Huxton

Date:

08 August 2005, 10:00:54

Frank Millman wrote:
> Hi all
>
> Is there an LC_COLLATE setting, or any other method, which allows all data
> in a database to be treated in a case-insensitive manner?
>
> I have two scenarios in mind. There are workarounds for both of them, but it
> would be nice if they were not necessary.
>
> 1. In a UNIQUE column, I would like a value of 'a' to be rejected if there
> is already a value of 'A'. Workaround - create a unique index on
> LOWER(col_name).
>
> 2. I would like WHERE col_name = 'x' and WHERE col_name LIKE 'x%' to find
> 'X' and 'X1'. Workaround - WHERE LOWER(col_name) = 'x' and WHERE col_name
> ILIKE 'x%'.

You could define your own type if you want to go to that level of
effort. Or, you could just decide only to allow upper-case values (or
lower-case).

--
   Richard Huxton
   Archonet Ltd

Re: Case sensitivity

From

"Frank Millman"

Date:

09 August 2005, 07:36:58

Frank Millman wrote:

> Hi all
>
> Is there an LC_COLLATE setting, or any other method, which allows all
> data in a database to be treated in a case-insensitive manner?

I was hoping to stimulate some discussion on this topic, but it seems I will
have to kick-start it myself and see if anyone responds.

My area of interest is general accounting/business systems, using a typical
Western character set. I would imagine that this is a common scenario, but
it is not universal, so read my comments in this context.

In the good old days, data entry was always done in upper case, by dedicated
data entry personnel. These days, it is typically done by a wide variety of
individuals, who carry out a wide range of tasks, most of which require
lower case (word processing, email) with the occasional use of the shift key
to enter a capital letter.

In this context, here are two undesirable effects.

1. The user tries to call up account 'A001', but they enter 'a001'. First
problem, the system does not find the account. Second problem, the system
allows them to create a new account with the code 'a001'. Now you have
'A001' and 'a001'. This is a recipe for chaos.

2. The user tries to call up a product item using a search string on the
description. Assume they enter 'Wheel nut'. Assume further that the person
who created the product item used the description 'Wheel Nut'. Try
explaining to the user why the system cannot find the item they are looking
for.

I mentioned in my original post that there are workarounds for these
problems. However, it seems to me that in a typical system you would want to
apply the workaround on every table, and therefore there is a case for
saying that the database should handle it.

I have some experience of two other database systems, and it is of interest
to see how they handle it.

1. D3 (the latest implementation of the old Pick Database System). In the
early days it was case sensitive. When they brought out a new version in the
early 90's they changed it to case insensitive. As you would expect, an
upgrade required a full backup and restore. I was involved in many of these,
some of them quite large. On two occasions I found that accounts were out of
balance after the restore, and on investigation found that situations
similar to 'A001' 'a001' had crept into the old database, and on restore the
second insert was rejected as the first one already existed. When this was
explained to the user, the reaction was always concern that this 'error' had
been allowed to happen, and relief that the new version ensured that it
could never happen again.

2. MSSQL Server 2000. Each time you create a new database you have to
specify which 'collation' to use. There is a wide range available, both case
sensitive and case insensitive. The default (on my system at least) is case
insensitive, and I believe that in practice this is what most people want.

There may well be counter-arguments to this being handled by the database,
and I would be interested to hear them. However, from my point of view, if
this capability is not currently available in PostgreSQL, I would like to
propose that it is considered for some future release.

Looking forward to any comments.

Frank Millman

Re: Case sensitivity

From

Martijn van Oosterhout

Date:

09 August 2005, 09:58:08

On Tue, Aug 09, 2005 at 09:35:25AM +0200, Frank Millman wrote:
> Frank Millman wrote:
>
> > Hi all
> >
> > Is there an LC_COLLATE setting, or any other method, which allows all
> > data in a database to be treated in a case-insensitive manner?
>
> I was hoping to stimulate some discussion on this topic, but it seems I will
> have to kick-start it myself and see if anyone responds.

I know there have been complaints from people that their database is
sorting case-insensetively when they wish it wouldn't. This generally
happens when the LC_COLLATE is set to en_US or some such. However, I
think that even the en_US locale just fiddles the sort order, but
doesn't make upper and lowercase equal.

> I mentioned in my original post that there are workarounds for these
> problems. However, it seems to me that in a typical system you would want to
> apply the workaround on every table, and therefore there is a case for
> saying that the database should handle it.

These "workarounds" are the recommended way of dealing with this issue.
Another option would be to create a new datatype 'itext' which works
like text except it compares case insensetively. PostgreSQL is flexible
like that. Here's something to get you started, see below for example.

http://svana.org/kleptog/pgsql/type_itext.sql

At the moment it uses SQL functions for the comparisons, for production
you'd probably want to have them in C for performance. Also, it's not
pg_dump safe (no operator class support).

BTW, I can't beleive I'm the first to do this, but hey. It's also my
first type with index support so it may be buggy. But it does work for
basic tests...

Have a nice day,
--- snip ---

test=# create table itest ( pkey serial primary key, val itext );
NOTICE:  CREATE TABLE will create implicit sequence "itest_pkey_seq"
for "serial" column "itest.pkey"
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index
"itest_pkey" for table "itest"
CREATE TABLE
test=# insert into itest (val) values ('a');
INSERT 72279 1
test=# insert into itest (val) values ('A');
INSERT 72280 1
test=# insert into itest (val) values ('b');
INSERT 72281 1
test=# select * from itest where val = 'a';
 pkey | val
------+-----
    1 | a
    2 | A
(2 rows)

test=# create unique index itest_val on itest(val);
ERROR:  could not create unique index
DETAIL:  Table contains duplicated values.
test=# delete from itest where val = 'a';
DELETE 2
test=# create unique index itest_val on itest(val);
CREATE INDEX
test=# insert into itest (val) values ('a');
INSERT 72284 1
test=# insert into itest (val) values ('A');
ERROR:  duplicate key violates unique constraint "itest_val"
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Attachment

msg-32091-101926.dat

Re: Case sensitivity

From

Martijn van Oosterhout

Date:

09 August 2005, 11:04:33

On Tue, Aug 09, 2005 at 11:57:48AM +0200, Martijn van Oosterhout wrote:
> Another option would be to create a new datatype 'itext' which works
> like text except it compares case insensetively. PostgreSQL is flexible
> like that. Here's something to get you started, see below for example.
>
> http://svana.org/kleptog/pgsql/type_itext.sql
>
> At the moment it uses SQL functions for the comparisons, for production
> you'd probably want to have them in C for performance. Also, it's not
> pg_dump safe (no operator class support).

Oops, turns out there *is* a CREATE OPERATOR CLASS but my version of
psql doesn't have it in command completion. And when you use that it
*is* saved by pg_dump. Problem solved.

I've tested various things, DISTINCT works, ORDER BY works, GROUP BY
works. Neat huh?

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Attachment

msg-32091-101929.dat

Re: Case sensitivity

From

Richard Huxton

Date:

09 August 2005, 11:46:38

Frank Millman wrote:
> Frank Millman wrote:
>
>
>>Hi all
>>
>>Is there an LC_COLLATE setting, or any other method, which allows all
>>data in a database to be treated in a case-insensitive manner?
>
>
> I was hoping to stimulate some discussion on this topic, but it seems I will
> have to kick-start it myself and see if anyone responds.
>
> My area of interest is general accounting/business systems, using a typical
> Western character set. I would imagine that this is a common scenario, but
> it is not universal, so read my comments in this context.
>
> In the good old days, data entry was always done in upper case, by dedicated
> data entry personnel. These days, it is typically done by a wide variety of
> individuals, who carry out a wide range of tasks, most of which require
> lower case (word processing, email) with the occasional use of the shift key
> to enter a capital letter.
>
> In this context, here are two undesirable effects.

Martijn has pointed to a case-insensitive type, but I'll add a couple of
  points.

> 1. The user tries to call up account 'A001', but they enter 'a001'. First
> problem, the system does not find the account. Second problem, the system
> allows them to create a new account with the code 'a001'. Now you have
> 'A001' and 'a001'. This is a recipe for chaos.

The basic problem here is that the value isn't text. This is partly the
fault of development-systems not having a way to deal with sophisticated
types in databases.

What should happen is that you define some suitable type "AccountCode"
which is defined as allowing character data in the form of (e.g.)
"[A-Z][0-9][0-9][0-9]". That type can cast to/from text but doesn't need
access to the full range of text-handling functions (e.g. concatenating
two account-codes is probably meaningless). Of course, you want to
define this in one place and have both the database constraints and
user-interface understand what you want.

Ironically, MS-Access does this quite well with its tight coupling of
user-interface and data storage.

> 2. The user tries to call up a product item using a search string on the
> description. Assume they enter 'Wheel nut'. Assume further that the person
> who created the product item used the description 'Wheel Nut'. Try
> explaining to the user why the system cannot find the item they are looking
> for.

Here, I'm not convinced a case-insensitive type is any more useful than
just using ILIKE.

--
   Richard Huxton
   Archonet Ltd

Re: Case sensitivity

From

Tom Lane

Date:

09 August 2005, 15:03:20

Martijn van Oosterhout <kleptog@svana.org> writes:
> Another option would be to create a new datatype 'itext' which works
> like text except it compares case insensetively. PostgreSQL is flexible
> like that. Here's something to get you started, see below for example.

> http://svana.org/kleptog/pgsql/type_itext.sql

> At the moment it uses SQL functions for the comparisons, for production
> you'd probably want to have them in C for performance.

I believe there is a C-coded type like this on gborg ("citext" is the
name I think).

            regards, tom lane

Re: Case sensitivity

From

Martijn van Oosterhout

Date:

09 August 2005, 16:15:43

On Tue, Aug 09, 2005 at 11:02:47AM -0400, Tom Lane wrote:
> Martijn van Oosterhout <kleptog@svana.org> writes:
> > Another option would be to create a new datatype 'itext' which works
> > like text except it compares case insensetively. PostgreSQL is flexible
> > like that. Here's something to get you started, see below for example.
>
> > http://svana.org/kleptog/pgsql/type_itext.sql
>
> > At the moment it uses SQL functions for the comparisons, for production
> > you'd probably want to have them in C for performance.
>
> I believe there is a C-coded type like this on gborg ("citext" is the
> name I think).

And so it is, full points to Tom. Here's the link:

http://gborg.postgresql.org/project/citext/projdisplay.php

I couldn't work out any obvious way to make google spit out this link
without the magic word "citext", so hopefully this reference will raise
the score enough that a plain google search for "case insensitive
postgresql" will find it.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Attachment

msg-32091-101976.dat

Re: Case sensitivity

From

"Frank Millman"

Date:

10 August 2005, 11:22:40

Frank Millman wrote:
> Hi all
>
> Is there an LC_COLLATE setting, or any other method, which
allows all
> data in a database to be treated in a case-insensitive manner?
>

Thanks for all the replies, guys, I really appreciate it.

Here is what I have decided to do. If anyone sees any problems with my
approach, please let me know.

Interesting though the citypes are, I will not use them. If I get anywhere
with the app I am developing (it is making progress, but rather slowly) I
will release it as an open source project. I do not want to make it a
requirement that everyone must install a new datatype before they can use
it.

To handle searching for a row based on a string, I will use "LOWER(colname)
= 'x'" and "LOWER(colname) LIKE 'x%'". AFAICT, the second one is equivalent
to "colname ILIKE 'x%'", provided I force 'x' to lowercase first. I prefer
it as I need to support SQL Server as well, and this should work on both
platforms.

The problem of inserting 'a001' when 'A001' exists is still potentially
there, but it should not occur within my app, due to the way I handle table
maintenance. I do not distinguish between INSERT and UPDATE from the user's
point of view, but allow them to enter a primary key, do a SELECT, and then
assume UPDATE mode if the row exists, and INSERT mode if it does not. As I
will use SELECT WHERE LOWER(colname) = 'a001', it will find 'A001' and go
into UPDATE mode, so there should be no danger of duplication. It does not
feel entirely robust, so I will have to go through my app carefully to see
if I can find any loopholes in this theory.

Two questions.

1. Will SELECT WHERE LOWER(colname) = 'a001' use the index, or must I create
a separate index on LOWER(colname)?

2. I was expecting to have a problem with LOWER(colname) if the column was
of a numeric or date type, but it accepts it without complaining. Is it safe
for me to rely on this behaviour in the future?

Thanks again to everyone.

Frank

Re: Case sensitivity

From

"Frank Millman"

Date:

10 August 2005, 13:07:26

Frank Millman wrote:

> 1. Will SELECT WHERE LOWER(colname) = 'a001' use the index, or must I
create a separate index on LOWER(colname)?

Sorry. RTFM. The docs clearly state that this is necessary.

Frank

Re: Case sensitivity

From

"John D. Burger"

Date:

10 August 2005, 13:52:02

> To handle searching for a row based on a string, I will use
> "LOWER(colname)
> = 'x'" and "LOWER(colname) LIKE 'x%'". AFAICT, the second one is
> equivalent
> to "colname ILIKE 'x%'", provided I force 'x' to lowercase first.

I would strongly suggest you do this instead:

   LOWER(colname) = LOWER('x')

This is far more bullet-proof than lower-casing in the client, in case
the client and the server differ (mismatched locales, etc.).  If you
don't want to use ILIKE for portability reasons (perfectly reasonable),
I suggest the analogous:

   LOWER(colname) LIKE LOWER('x%')

- John D. Burger
   MITRE

Re: Case sensitivity

From

Roman Neuhauser

Date:

10 August 2005, 17:35:14

# frank@chagford.com / 2005-08-10 13:18:32 +0200:
> 1. Will SELECT WHERE LOWER(colname) = 'a001' use the index, or must I create
> a separate index on LOWER(colname)?

    the latter

--
How many Vietnam vets does it take to screw in a light bulb?
You don't know, man.  You don't KNOW.
Cause you weren't THERE.             http://bash.org/?255991