Thread: Case sensitivity
Hi all Is there an LC_COLLATE setting, or any other method, which allows all data in a database to be treated in a case-insensitive manner? I have two scenarios in mind. There are workarounds for both of them, but it would be nice if they were not necessary. 1. In a UNIQUE column, I would like a value of 'a' to be rejected if there is already a value of 'A'. Workaround - create a unique index on LOWER(col_name). 2. I would like WHERE col_name = 'x' and WHERE col_name LIKE 'x%' to find 'X' and 'X1'. Workaround - WHERE LOWER(col_name) = 'x' and WHERE col_name ILIKE 'x%'. TIA Frank Millman
Frank Millman wrote: > Hi all > > Is there an LC_COLLATE setting, or any other method, which allows all data > in a database to be treated in a case-insensitive manner? > > I have two scenarios in mind. There are workarounds for both of them, but it > would be nice if they were not necessary. > > 1. In a UNIQUE column, I would like a value of 'a' to be rejected if there > is already a value of 'A'. Workaround - create a unique index on > LOWER(col_name). > > 2. I would like WHERE col_name = 'x' and WHERE col_name LIKE 'x%' to find > 'X' and 'X1'. Workaround - WHERE LOWER(col_name) = 'x' and WHERE col_name > ILIKE 'x%'. You could define your own type if you want to go to that level of effort. Or, you could just decide only to allow upper-case values (or lower-case). -- Richard Huxton Archonet Ltd
Frank Millman wrote: > Hi all > > Is there an LC_COLLATE setting, or any other method, which allows all > data in a database to be treated in a case-insensitive manner? I was hoping to stimulate some discussion on this topic, but it seems I will have to kick-start it myself and see if anyone responds. My area of interest is general accounting/business systems, using a typical Western character set. I would imagine that this is a common scenario, but it is not universal, so read my comments in this context. In the good old days, data entry was always done in upper case, by dedicated data entry personnel. These days, it is typically done by a wide variety of individuals, who carry out a wide range of tasks, most of which require lower case (word processing, email) with the occasional use of the shift key to enter a capital letter. In this context, here are two undesirable effects. 1. The user tries to call up account 'A001', but they enter 'a001'. First problem, the system does not find the account. Second problem, the system allows them to create a new account with the code 'a001'. Now you have 'A001' and 'a001'. This is a recipe for chaos. 2. The user tries to call up a product item using a search string on the description. Assume they enter 'Wheel nut'. Assume further that the person who created the product item used the description 'Wheel Nut'. Try explaining to the user why the system cannot find the item they are looking for. I mentioned in my original post that there are workarounds for these problems. However, it seems to me that in a typical system you would want to apply the workaround on every table, and therefore there is a case for saying that the database should handle it. I have some experience of two other database systems, and it is of interest to see how they handle it. 1. D3 (the latest implementation of the old Pick Database System). In the early days it was case sensitive. When they brought out a new version in the early 90's they changed it to case insensitive. As you would expect, an upgrade required a full backup and restore. I was involved in many of these, some of them quite large. On two occasions I found that accounts were out of balance after the restore, and on investigation found that situations similar to 'A001' 'a001' had crept into the old database, and on restore the second insert was rejected as the first one already existed. When this was explained to the user, the reaction was always concern that this 'error' had been allowed to happen, and relief that the new version ensured that it could never happen again. 2. MSSQL Server 2000. Each time you create a new database you have to specify which 'collation' to use. There is a wide range available, both case sensitive and case insensitive. The default (on my system at least) is case insensitive, and I believe that in practice this is what most people want. There may well be counter-arguments to this being handled by the database, and I would be interested to hear them. However, from my point of view, if this capability is not currently available in PostgreSQL, I would like to propose that it is considered for some future release. Looking forward to any comments. Frank Millman
On Tue, Aug 09, 2005 at 09:35:25AM +0200, Frank Millman wrote: > Frank Millman wrote: > > > Hi all > > > > Is there an LC_COLLATE setting, or any other method, which allows all > > data in a database to be treated in a case-insensitive manner? > > I was hoping to stimulate some discussion on this topic, but it seems I will > have to kick-start it myself and see if anyone responds. I know there have been complaints from people that their database is sorting case-insensetively when they wish it wouldn't. This generally happens when the LC_COLLATE is set to en_US or some such. However, I think that even the en_US locale just fiddles the sort order, but doesn't make upper and lowercase equal. > I mentioned in my original post that there are workarounds for these > problems. However, it seems to me that in a typical system you would want to > apply the workaround on every table, and therefore there is a case for > saying that the database should handle it. These "workarounds" are the recommended way of dealing with this issue. Another option would be to create a new datatype 'itext' which works like text except it compares case insensetively. PostgreSQL is flexible like that. Here's something to get you started, see below for example. http://svana.org/kleptog/pgsql/type_itext.sql At the moment it uses SQL functions for the comparisons, for production you'd probably want to have them in C for performance. Also, it's not pg_dump safe (no operator class support). BTW, I can't beleive I'm the first to do this, but hey. It's also my first type with index support so it may be buggy. But it does work for basic tests... Have a nice day, --- snip --- test=# create table itest ( pkey serial primary key, val itext ); NOTICE: CREATE TABLE will create implicit sequence "itest_pkey_seq" for "serial" column "itest.pkey" NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "itest_pkey" for table "itest" CREATE TABLE test=# insert into itest (val) values ('a'); INSERT 72279 1 test=# insert into itest (val) values ('A'); INSERT 72280 1 test=# insert into itest (val) values ('b'); INSERT 72281 1 test=# select * from itest where val = 'a'; pkey | val ------+----- 1 | a 2 | A (2 rows) test=# create unique index itest_val on itest(val); ERROR: could not create unique index DETAIL: Table contains duplicated values. test=# delete from itest where val = 'a'; DELETE 2 test=# create unique index itest_val on itest(val); CREATE INDEX test=# insert into itest (val) values ('a'); INSERT 72284 1 test=# insert into itest (val) values ('A'); ERROR: duplicate key violates unique constraint "itest_val" -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Attachment
On Tue, Aug 09, 2005 at 11:57:48AM +0200, Martijn van Oosterhout wrote: > Another option would be to create a new datatype 'itext' which works > like text except it compares case insensetively. PostgreSQL is flexible > like that. Here's something to get you started, see below for example. > > http://svana.org/kleptog/pgsql/type_itext.sql > > At the moment it uses SQL functions for the comparisons, for production > you'd probably want to have them in C for performance. Also, it's not > pg_dump safe (no operator class support). Oops, turns out there *is* a CREATE OPERATOR CLASS but my version of psql doesn't have it in command completion. And when you use that it *is* saved by pg_dump. Problem solved. I've tested various things, DISTINCT works, ORDER BY works, GROUP BY works. Neat huh? Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Attachment
Frank Millman wrote: > Frank Millman wrote: > > >>Hi all >> >>Is there an LC_COLLATE setting, or any other method, which allows all >>data in a database to be treated in a case-insensitive manner? > > > I was hoping to stimulate some discussion on this topic, but it seems I will > have to kick-start it myself and see if anyone responds. > > My area of interest is general accounting/business systems, using a typical > Western character set. I would imagine that this is a common scenario, but > it is not universal, so read my comments in this context. > > In the good old days, data entry was always done in upper case, by dedicated > data entry personnel. These days, it is typically done by a wide variety of > individuals, who carry out a wide range of tasks, most of which require > lower case (word processing, email) with the occasional use of the shift key > to enter a capital letter. > > In this context, here are two undesirable effects. Martijn has pointed to a case-insensitive type, but I'll add a couple of points. > 1. The user tries to call up account 'A001', but they enter 'a001'. First > problem, the system does not find the account. Second problem, the system > allows them to create a new account with the code 'a001'. Now you have > 'A001' and 'a001'. This is a recipe for chaos. The basic problem here is that the value isn't text. This is partly the fault of development-systems not having a way to deal with sophisticated types in databases. What should happen is that you define some suitable type "AccountCode" which is defined as allowing character data in the form of (e.g.) "[A-Z][0-9][0-9][0-9]". That type can cast to/from text but doesn't need access to the full range of text-handling functions (e.g. concatenating two account-codes is probably meaningless). Of course, you want to define this in one place and have both the database constraints and user-interface understand what you want. Ironically, MS-Access does this quite well with its tight coupling of user-interface and data storage. > 2. The user tries to call up a product item using a search string on the > description. Assume they enter 'Wheel nut'. Assume further that the person > who created the product item used the description 'Wheel Nut'. Try > explaining to the user why the system cannot find the item they are looking > for. Here, I'm not convinced a case-insensitive type is any more useful than just using ILIKE. -- Richard Huxton Archonet Ltd
Martijn van Oosterhout <kleptog@svana.org> writes: > Another option would be to create a new datatype 'itext' which works > like text except it compares case insensetively. PostgreSQL is flexible > like that. Here's something to get you started, see below for example. > http://svana.org/kleptog/pgsql/type_itext.sql > At the moment it uses SQL functions for the comparisons, for production > you'd probably want to have them in C for performance. I believe there is a C-coded type like this on gborg ("citext" is the name I think). regards, tom lane
On Tue, Aug 09, 2005 at 11:02:47AM -0400, Tom Lane wrote: > Martijn van Oosterhout <kleptog@svana.org> writes: > > Another option would be to create a new datatype 'itext' which works > > like text except it compares case insensetively. PostgreSQL is flexible > > like that. Here's something to get you started, see below for example. > > > http://svana.org/kleptog/pgsql/type_itext.sql > > > At the moment it uses SQL functions for the comparisons, for production > > you'd probably want to have them in C for performance. > > I believe there is a C-coded type like this on gborg ("citext" is the > name I think). And so it is, full points to Tom. Here's the link: http://gborg.postgresql.org/project/citext/projdisplay.php I couldn't work out any obvious way to make google spit out this link without the magic word "citext", so hopefully this reference will raise the score enough that a plain google search for "case insensitive postgresql" will find it. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Attachment
Frank Millman wrote: > Hi all > > Is there an LC_COLLATE setting, or any other method, which allows all > data in a database to be treated in a case-insensitive manner? > Thanks for all the replies, guys, I really appreciate it. Here is what I have decided to do. If anyone sees any problems with my approach, please let me know. Interesting though the citypes are, I will not use them. If I get anywhere with the app I am developing (it is making progress, but rather slowly) I will release it as an open source project. I do not want to make it a requirement that everyone must install a new datatype before they can use it. To handle searching for a row based on a string, I will use "LOWER(colname) = 'x'" and "LOWER(colname) LIKE 'x%'". AFAICT, the second one is equivalent to "colname ILIKE 'x%'", provided I force 'x' to lowercase first. I prefer it as I need to support SQL Server as well, and this should work on both platforms. The problem of inserting 'a001' when 'A001' exists is still potentially there, but it should not occur within my app, due to the way I handle table maintenance. I do not distinguish between INSERT and UPDATE from the user's point of view, but allow them to enter a primary key, do a SELECT, and then assume UPDATE mode if the row exists, and INSERT mode if it does not. As I will use SELECT WHERE LOWER(colname) = 'a001', it will find 'A001' and go into UPDATE mode, so there should be no danger of duplication. It does not feel entirely robust, so I will have to go through my app carefully to see if I can find any loopholes in this theory. Two questions. 1. Will SELECT WHERE LOWER(colname) = 'a001' use the index, or must I create a separate index on LOWER(colname)? 2. I was expecting to have a problem with LOWER(colname) if the column was of a numeric or date type, but it accepts it without complaining. Is it safe for me to rely on this behaviour in the future? Thanks again to everyone. Frank
Frank Millman wrote: > 1. Will SELECT WHERE LOWER(colname) = 'a001' use the index, or must I create a separate index on LOWER(colname)? Sorry. RTFM. The docs clearly state that this is necessary. Frank
> To handle searching for a row based on a string, I will use > "LOWER(colname) > = 'x'" and "LOWER(colname) LIKE 'x%'". AFAICT, the second one is > equivalent > to "colname ILIKE 'x%'", provided I force 'x' to lowercase first. I would strongly suggest you do this instead: LOWER(colname) = LOWER('x') This is far more bullet-proof than lower-casing in the client, in case the client and the server differ (mismatched locales, etc.). If you don't want to use ILIKE for portability reasons (perfectly reasonable), I suggest the analogous: LOWER(colname) LIKE LOWER('x%') - John D. Burger MITRE
# frank@chagford.com / 2005-08-10 13:18:32 +0200: > 1. Will SELECT WHERE LOWER(colname) = 'a001' use the index, or must I create > a separate index on LOWER(colname)? the latter -- How many Vietnam vets does it take to screw in a light bulb? You don't know, man. You don't KNOW. Cause you weren't THERE. http://bash.org/?255991