Thread: Re: Case Insensitive Queries
We tried these but it didn't work. However, that's because username is a bpchar and not a varchar, so its padded with blanks. so we tried where lower(trim(username)) = 'test' and it works. We'll change that column to varchar. The real problem was in the datatype for username. Thanks, On 29 May 2001 12:35:53 -0400, ANDREW PERRIN wrote: > Try: > > - The ILIKE operator, for example, > > SELECT * FROM account WHERE username ILIKE "test"; > > - upper() or lower(), for example, > > SELECT * FROM accont WHERE lower(username) = "test"; > > --------------------------------------------------------- > Andrew J. Perrin - Assistant Professor of Sociology > University of North Carolina, Chapel Hill > 269 Hamilton Hall CB#3210, Chapel Hill, NC 27599-3210 USA > andrew_perrin@unc.edu - http://www.unc.edu/~aperrin > > On 29 May 2001, Mark wrote: > > > Is it possible to execute a query using a where clause that allows case > > insensitive comparison between a field and text. > > > > For example: > > > > select * from account where username = 'test' > > > > where username could be 'Test', which would be a match. As is, this > > compare is case sensitive. > > > > grep'd the source, but stricmp is only used for keywords and not actual > > column data. > > > > Any help would be greatly appreciated. > > > > Thanks, > > > > Mark > > > > > > ---------------------------(end of broadcast)--------------------------- > > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org > > >
This is a good point - and it means that Postgres is not following the SQL Standard in this regard. According to the standard,a scalar string function of a single string argument should return the same "type" of string as its input. So upper(<fixed-char-field>)should return a fixed-char-field. But it doesn't - it always returns a varchar that includes thetrailing spaces from the space-padded fixed char argument. And those trailing spaces are significant for the varcharcomparison with the string literal. It seems to me there are two ways to correct this behavior. One is to have overloaded versions of the relevant string functionthat return the right types. But, probably better, Postgres could support the notion of PAD SPACE or PAD OFF tocontrol the behavior of string comparisons regardless of the particular types of the character fields involved. Are ther plans to change this Postgres behavior? Thanks, Jim Ballard Netezza Corp. ---------- Original Message ---------------------------------- From: Mark <mark@zserve.com> Date: 29 May 2001 10:21:15 -0600 >We tried these but it didn't work. However, that's because username is >a bpchar and not a varchar, so its padded with blanks. so we tried >where lower(trim(username)) = 'test' and it works. We'll change that >column to varchar. The real problem was in the datatype for username. > >Thanks, > >On 29 May 2001 12:35:53 -0400, ANDREW PERRIN wrote: >> Try: >> >> - The ILIKE operator, for example, >> >> SELECT * FROM account WHERE username ILIKE "test"; >> >> - upper() or lower(), for example, >> >> SELECT * FROM accont WHERE lower(username) = "test"; >> >> --------------------------------------------------------- >> Andrew J. Perrin - Assistant Professor of Sociology >> University of North Carolina, Chapel Hill >> 269 Hamilton Hall CB#3210, Chapel Hill, NC 27599-3210 USA >> andrew_perrin@unc.edu - http://www.unc.edu/~aperrin >> >> On 29 May 2001, Mark wrote: >> >> > Is it possible to execute a query using a where clause that allows case >> > insensitive comparison between a field and text. >> > >> > For example: >> > >> > select * from account where username = 'test' >> > >> > where username could be 'Test', which would be a match. As is, this >> > compare is case sensitive. >> > >> > grep'd the source, but stricmp is only used for keywords and not actual >> > column data. >> > >> > Any help would be greatly appreciated. >> > >> > Thanks, >> > >> > Mark >> > >> > >> > ---------------------------(end of broadcast)--------------------------- >> > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org >> > >> > > >---------------------------(end of broadcast)--------------------------- >TIP 4: Don't 'kill -9' the postmaster >
Can you please explain in little more detail? I am curious. I haven't noticed any discussion about upper() being different from lower() when it comes to such comparisons. As far as I know, upper() and lower() only operate on ascii characters a-z. If you are using the default locale, neither function should have any impact on characters in the extended ascii range. If upper() and lower() operate on characters in 8859-1 and other character sets when the appropriate locale is set, then a difference in the behavior of upper() and lower() would seem like a bug. If you can shed some light on this, I would appreciate it. Thanks, Troy > > Mark writes: > > Is it possible to execute a query using a where clause that allows case > > insensitive comparison between a field and text. > > select * from account where upper(username) = upper('test') > > (Upper used because, as has been remarked on this list and in other places, > folding from richer character sets is likely to get better matches this way). > > And yes, you can create an index on upper(fieldname). > > Dan > > ---------------------------(end of broadcast)--------------------------- > TIP 2: you can get off all lists at once with the unregister command > (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) >
"tjk@tksoft.com" <tjk@tksoft.com> writes: > If upper() and lower() operate on characters in 8859-1 and other character > sets when the appropriate locale is set, then a difference in the behavior > of upper() and lower() would seem like a bug. Au contraire ... upper() and lower() are not symmetric operations in quite a few non-English locales. I'll let those who regularly work with them give specific details, but handling of accents, German esstet (sp?), etc are the gotchas that I recall. regards, tom lane
It appears that the behavior of a bpchar compare with a string literal is not implicitly trimming the bpchar before the compare, which IMHO is incorrect behavior. Is my opinion valid? If so, how difficult of a fix would this be in terms of time and effort? Should I submit a bug report to another list, or is a developer receiving this? Is this a feature? This is an important issue for me, because I am converting a db from MS SQL to postgresql. The MS SQL database uses bpchar (or just char in MS SQL terms) because performance is slightly better; the compares automatically trim the blanks off of the char at compare time. I have over 150 tables to work with, and I would rather not have to change them from bpchar to varchar, not to mention the performance decrease this might incur. You might be thinking, 'just use trim(username) everywhere you compare'. Yes, that is a solution, but not a practical one in my case. If this is a bug, I don't want to hack around it: I'd rather wait for the fix. Varchars would incur performance penalties I want to try to avoid if at all possible. Thanks, Mark On 29 May 2001 09:55:18 -0700, Dan Lyke wrote: > Mark writes: > > Is it possible to execute a query using a where clause that allows case > > insensitive comparison between a field and text. > > select * from account where upper(username) = upper('test') > > (Upper used because, as has been remarked on this list and in other places, > folding from richer character sets is likely to get better matches this way). > > And yes, you can create an index on upper(fieldname). > > Dan > > ---------------------------(end of broadcast)--------------------------- > TIP 2: you can get off all lists at once with the unregister command > (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) >
It appears that the behavior of a bpchar compare with a string literal is not implicitly trimming the bpchar before the compare, which IMHO is incorrect behavior. Is my opinion valid? If so, how difficult of a fix would this be in terms of time and effort? Should I submit a bug report to another list, or is a developer receiving this? Is this a feature? This is an important issue for me, because I am converting a db from MS SQL to postgresql. The MS SQL database uses bpchar (or just char in MS SQL terms) because performance is slightly better; the compares automatically trim the blanks off of the char at compare time. I have over 150 tables to work with, and I would rather not have to change them from bpchar to varchar, not to mention the performance decrease this might incur. You might be thinking, 'just use trim(username) everywhere you compare'. Yes, that is a solution, but not a practical one in my case. If this is a bug, I don't want to hack around it: I'd rather wait for the fix. Varchars would incur performance penalties I want to try to avoid if at all possible. Thanks, Mark On 29 May 2001 09:55:18 -0700, Dan Lyke wrote: > Mark writes: > > Is it possible to execute a query using a where clause that allows case > > insensitive comparison between a field and text. > > select * from account where upper(username) = upper('test') > > (Upper used because, as has been remarked on this list and in other places, > folding from richer character sets is likely to get better matches this way). > > And yes, you can create an index on upper(fieldname). > > Dan > > ---------------------------(end of broadcast)--------------------------- > TIP 2: you can get off all lists at once with the unregister command > (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) >
I use a few of them, and in my opinion there is a distinct group of characters at last in the 8859-1 character set which have a lower and upper case instance. The ranges are 0xC0 to 0xDD for upper case and 0xE0 to 0xFD for upper (with the exception of 0xD0, 0xD7, 0xF0, and 0xF7). I haven't examined all the relevant docs, so I might be wrong. The lists are based on my own observations of the characters in question. There is probably no harm in sending a few extra bytes, so I am appending a related function below. If someone finds a flaw with the function, please tell me; that would be greatly appreciated. I am also including a list of related characters. This email is going out with 8859-1 as the charset, so I hope you are able to view them. UPPER CASE: 192: À (0xc0) 193: Á (0xc1) 194: Â (0xc2) 195: Ã (0xc3) 196: Ä (0xc4) 197: Å (0xc5) 198: Æ (0xc6) 199: Ç (0xc7) 200: È (0xc8) 201: É (0xc9) 202: Ê (0xca) 203: Ë (0xcb) 204: Ì (0xcc) 205: Í (0xcd) 206: Î (0xce) 207: Ï (0xcf) 209: Ñ (0xd1) 210: Ò (0xd2) 211: Ó (0xd3) 212: Ô (0xd4) 213: Õ (0xd5) 214: Ö (0xd6) 216: Ø (0xd8) 217: Ù (0xd9) 218: Ú (0xda) 219: Û (0xdb) 220: Ü (0xdc) 221: Ý (0xdd) LOWER CASE: 224: à (0xe0) 225: á (0xe1) 226: â (0xe2) 227: ã (0xe3) 228: ä (0xe4) 229: å (0xe5) 230: æ (0xe6) 231: ç (0xe7) 232: è (0xe8) 233: é (0xe9) 234: ê (0xea) 235: ë (0xeb) 236: ì (0xec) 237: í (0xed) 238: î (0xee) 239: ï (0xef) 241: ñ (0xf1) 242: ò (0xf2) 243: ó (0xf3) 244: ô (0xf4) 245: õ (0xf5) 246: ö (0xf6) 248: ø (0xf8) 249: ù (0xf9) 250: ú (0xfa) 251: û (0xfb) 252: ü (0xfc) 253: ý (0xfd) SKIPPED 208: Ð (0xd0) 215: × (0xd7) 222: Þ (0xde) 240: ð (0xf0) 247: ÷ (0xf7) 254: þ (0xfe) CREATE FUNCTION lower8859_1 (text) RETURNS text AS '/usr/include/pgsql/lib/str8859_1.so' LANGUAGE 'C'; /* No warranty of any kind, use at your own risk. Use freely. */ text * lower8859_1 (text * str1) { text * result; int32 len1 = 0, i; unsigned char * p, * p2, c; unsigned char upper_min= 0xC0; unsigned char upper_max = 0xDD; len1 = VARSIZE(str1) - VARHDRSZ; if (len1 <= 0) return str1; result = (text *) palloc (len1 + 2 + VARHDRSZ); if (! result) return str1; memset (result, 0, len1 + 2 + VARHDRSZ); p = VARDATA(result); p2 = VARDATA(str1); for (i=0; i < len1; i++) { c = p2[i]; if (isupper(c) || (c >= upper_min && c <= upper_max && c != 0xD0 && c !=0xD7)) p[i] = c + 0x20; else p[i] = c; } VARSIZE(result) = len1 + VARHDRSZ; return result; } Troy > "tjk@tksoft.com" <tjk@tksoft.com> writes: > > If upper() and lower() operate on characters in 8859-1 and other character > > sets when the appropriate locale is set, then a difference in the behavior > > of upper() and lower() would seem like a bug. > > Au contraire ... upper() and lower() are not symmetric operations in > quite a few non-English locales. I'll let those who regularly work with > them give specific details, but handling of accents, German esstet (sp?), > etc are the gotchas that I recall. > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/users-lounge/docs/faq.html >
Mark <mark@zserve.com> writes: > It appears that the behavior of a bpchar compare with a string literal > is not implicitly trimming the bpchar before the compare, which IMHO is > incorrect behavior. Is my opinion valid? regression=# create table foo (f1 char(20)); CREATE regression=# insert into foo values ('zz'); INSERT 800569 1 regression=# select * from foo; f1 ----------------------zz (1 row) regression=# select * from foo where f1 = 'zz'; f1 ----------------------zz (1 row) regression=# You'll need to be more specific about what you're unhappy about. > Varchars would incur performance penalties I want to try to avoid if at > all possible. You are operating under misinformation about what's efficient or not. There are no performance penalties that I know of for varchar ... if anything, bpchar is the less efficient choice, at least in Postgres. The extra I/O costs for those padding blanks add up, and there's no compensatory savings anywhere. In any case, if your data is really variable-length strings, forcing it into a datatype that doesn't match its semantics because of dubious micro-efficiency considerations is just plain bad database design. Rather than having blanks that you want to pretend aren't there, you should not have the blanks in the first place. IMHO anyway. regards, tom lane
On Wed, 30 May 2001, Tom Lane wrote: > Mark <mark@zserve.com> writes: > > It appears that the behavior of a bpchar compare with a string literal > > is not implicitly trimming the bpchar before the compare, which IMHO is > > incorrect behavior. Is my opinion valid? > > regression=# create table foo (f1 char(20)); > CREATE > regression=# insert into foo values ('zz'); > INSERT 800569 1 > regression=# select * from foo; > f1 > ---------------------- > zz > (1 row) > > regression=# select * from foo where f1 = 'zz'; > f1 > ---------------------- > zz > (1 row) > > regression=# > > You'll need to be more specific about what you're unhappy about. Given the thread, I think the problem he's having is tied up in upper and lower implicitly converting to text. select * from foo where upper(f1)='ZZ'; gives no rows but if you put 18 spaces after the ZZ you get the row.
On 30 May 2001 11:16:35 -0700, Stephan Szabo wrote: > On Wed, 30 May 2001, Tom Lane wrote: > > > Mark <mark@zserve.com> writes: > > > It appears that the behavior of a bpchar compare with a string literal > > > is not implicitly trimming the bpchar before the compare, which IMHO is > > > incorrect behavior. Is my opinion valid? > > > > regression=# create table foo (f1 char(20)); > > CREATE > > regression=# insert into foo values ('zz'); > > INSERT 800569 1 > > regression=# select * from foo; > > f1 > > ---------------------- > > zz > > (1 row) > > > > regression=# select * from foo where f1 = 'zz'; > > f1 > > ---------------------- > > zz > > (1 row) > > > > regression=# > > > > You'll need to be more specific about what you're unhappy about. > > Given the thread, I think the problem he's having is tied up in > upper and lower implicitly converting to text. > > select * from foo where upper(f1)='ZZ'; > gives no rows but if you put 18 spaces after the ZZ you get the > row. > > could I cast from text to something else?
On 30 May 2001, Mark wrote: > On 30 May 2001 11:16:35 -0700, Stephan Szabo wrote: > > On Wed, 30 May 2001, Tom Lane wrote: > > > > > Mark <mark@zserve.com> writes: > > > > It appears that the behavior of a bpchar compare with a string literal > > > > is not implicitly trimming the bpchar before the compare, which IMHO is > > > > incorrect behavior. Is my opinion valid? > > > > > > regression=# create table foo (f1 char(20)); > > > CREATE > > > regression=# insert into foo values ('zz'); > > > INSERT 800569 1 > > > regression=# select * from foo; > > > f1 > > > ---------------------- > > > zz > > > (1 row) > > > > > > regression=# select * from foo where f1 = 'zz'; > > > f1 > > > ---------------------- > > > zz > > > (1 row) > > > > > > regression=# > > > > > > You'll need to be more specific about what you're unhappy about. > > > > Given the thread, I think the problem he's having is tied up in > > upper and lower implicitly converting to text. > > > > select * from foo where upper(f1)='ZZ'; > > gives no rows but if you put 18 spaces after the ZZ you get the > > row. > > > > > > > could I cast from text to something else? You might be able to get away with something like: create function upper(char) returns char as 'upper'language 'internal'; It seems to work for me, but I'm not 100% sure how safe it is. With the function above, select * from foo where upper(f1)='ZZ' returns me the zz row.
On 30 May 2001 12:53:22 -0400, Tom Lane wrote: > > You are operating under misinformation about what's efficient or not. > There are no performance penalties that I know of for varchar ... if > anything, bpchar is the less efficient choice, at least in Postgres. > The extra I/O costs for those padding blanks add up, and there's no > compensatory savings anywhere. with varchars, as I understand it (and postgresql may be different), each varchar field has a header that stores the length of the particular entry's length. Further, if the varchar field precedes another field, the system loses the ability to use fixed-length addressing to access the field after the varchar, since the system must determine on a case-by-case basis how to access the field after the varchar. It has to calculate the size of the varchar, add that to the start of the varchar (plus header length), and then it has the address of the next field. With non-variant char it is fixed length, so selects and updates operate much more quickly. Even the postgresql documentation asserts something similar to this: 'Both TEXT and VARCHAR() store only the number of characters in the string. CHAR(length) is similar to VARCHAR(), except it always stores exactly length characters. This type pads the value with trailing spaces to achieve the specified length, and provides slightly faster access than TEXT or VARCHAR().' Perhaps I am misinformed. > > In any case, if your data is really variable-length strings, forcing > it into a datatype that doesn't match its semantics because of dubious > micro-efficiency considerations is just plain bad database design. > Rather than having blanks that you want to pretend aren't there, you > should not have the blanks in the first place. IMHO anyway. > Point well taken. If the gain from using bpchar is not much more than using varchar and the data used is actualy variable length up to a max length, the argument is unfounded. So, what to make of all of this? It depends on the performance gain/loss of using varchar. We originally used fixed-length chars because of the performance gain. We try to avoid varchars for that reason. Now, if postgresql is different, then we'll use varchars, as that precisely models our data. Thanks, Mark
Mark <mark@zserve.com> writes: > with varchars, as I understand it (and postgresql may be different), > each varchar field has a header that stores the length of the particular > entry's length. Further, if the varchar field precedes another field, > the system loses the ability to use fixed-length addressing to access > the field after the varchar, since the system must determine on a > case-by-case basis how to access the field after the varchar. It has to > calculate the size of the varchar, add that to the start of the varchar > (plus header length), and then it has the address of the next field. > With non-variant char it is fixed length, so selects and updates operate > much more quickly. Even the postgresql documentation asserts something > similar to this: > > 'Both TEXT and VARCHAR() store only the number of characters in the > string. CHAR(length) is similar to VARCHAR(), except it always stores > exactly length characters. This type pads the value with trailing spaces > to achieve the specified length, and provides slightly faster access > than TEXT or VARCHAR().' > > Perhaps I am misinformed. You are misinformed with respect to Postgres. The implementation of CHAR(length) is just like the implementation of VARCHAR(). It stores a header with the entry length. That header is considered in all functions in the same way that the VARCHAR() header is. I don't know what documentation you are quoting, but it seems somewhat misleading to me. Here is the comment from the implementation (src/backend/utils/adt/varchar.c): /** CHAR() and VARCHAR() types are part of the ANSI SQL standard. CHAR()* is for blank-padded string whose length is specifiedin CREATE TABLE.* VARCHAR is for storing string whose length is at most the length specified* at CREATE TABLE time.**It's hard to implement these types because we cannot figure out* the length of the type from the type itself. I change(hopefully all) the* fmgr calls that invoke input functions of a data type to supply the* length also. (eg. in INSERTs,we have the tupleDescriptor which contains* the length of the attributes and hence the exact length of the char()or* varchar(). We pass this to bpcharin() or varcharin().) In the case where* we cannot determine the length, we passin -1 instead and the input string* must be null-terminated.** We actually implement this as a varlena so that we don'thave to pass in* the length for the comparison functions. (The difference between these* types and "text" is that wetruncate and possibly blank-pad the string* at insertion time.)** - ay 6/95*/ Ian
Mark <mark@zserve.com> writes: > Even the postgresql documentation asserts something similar to this: You're reading obsolete documentation. There is no such assertion (as far as I can find, anyway) in the 7.1 documentation. The speed advantage of bpchar --- which was always extremely marginal anyway, if it was real at all when you consider I/O costs --- is gone completely now, because with TOAST in the picture the system cannot assume that bpchar is fixed length on disk. regards, tom lane