Thread: Re: Case Insensitive Queries

Re: Case Insensitive Queries

From
Mark
Date:
We tried these but it didn't work.  However, that's because username is
a bpchar and not a varchar, so its padded with blanks.  so we tried
where lower(trim(username)) = 'test' and it works.  We'll change that
column to varchar.  The real problem was in the datatype for username.

Thanks,

On 29 May 2001 12:35:53 -0400, ANDREW PERRIN wrote:
> Try:
> 
> - The ILIKE operator, for example,
> 
> SELECT * FROM account WHERE username ILIKE "test";
> 
> - upper() or lower(), for example,
> 
> SELECT * FROM accont WHERE lower(username) = "test";
> 
> ---------------------------------------------------------
>    Andrew J. Perrin - Assistant Professor of Sociology
>         University of North Carolina, Chapel Hill
> 269 Hamilton Hall CB#3210, Chapel Hill, NC 27599-3210 USA
>    andrew_perrin@unc.edu - http://www.unc.edu/~aperrin
> 
> On 29 May 2001, Mark wrote:
> 
> > Is it possible to execute a query using a where clause that allows case
> > insensitive comparison between a field and text.
> > 
> > For example:
> > 
> > select * from account where username = 'test'
> > 
> > where username could be 'Test', which would be a match.  As is, this
> > compare is case sensitive.
> > 
> > grep'd the source, but stricmp is only used for keywords and not actual
> > column data.
> > 
> > Any help would be greatly appreciated.
> > 
> > Thanks,
> > 
> > Mark
> > 
> > 
> > ---------------------------(end of broadcast)---------------------------
> > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
> > 
> 



Re: Case Insensitive Queries

From
"Jim Ballard"
Date:
This is a good point - and it means that Postgres is not following the SQL Standard in this regard.  According to the
standard,a scalar string function of a single string argument should return the same "type" of string as its input.  So
upper(<fixed-char-field>)should return a fixed-char-field.  But it doesn't - it always returns a varchar that includes
thetrailing spaces from the space-padded fixed char argument.  And those trailing spaces are significant for the
varcharcomparison with the string literal.
 

It seems to me there are two ways to correct this behavior.  One is to have overloaded versions of the relevant string
functionthat return the right types.  But, probably better, Postgres could support the notion of PAD SPACE or PAD OFF
tocontrol the behavior of string comparisons regardless of the particular types of the character fields involved.
 

Are ther plans to change this Postgres behavior?

Thanks,
Jim Ballard
Netezza Corp.

---------- Original Message ----------------------------------
From: Mark <mark@zserve.com>
Date: 29 May 2001 10:21:15 -0600

>We tried these but it didn't work.  However, that's because username is
>a bpchar and not a varchar, so its padded with blanks.  so we tried
>where lower(trim(username)) = 'test' and it works.  We'll change that
>column to varchar.  The real problem was in the datatype for username.
>
>Thanks,
>
>On 29 May 2001 12:35:53 -0400, ANDREW PERRIN wrote:
>> Try:
>> 
>> - The ILIKE operator, for example,
>> 
>> SELECT * FROM account WHERE username ILIKE "test";
>> 
>> - upper() or lower(), for example,
>> 
>> SELECT * FROM accont WHERE lower(username) = "test";
>> 
>> ---------------------------------------------------------
>>    Andrew J. Perrin - Assistant Professor of Sociology
>>         University of North Carolina, Chapel Hill
>> 269 Hamilton Hall CB#3210, Chapel Hill, NC 27599-3210 USA
>>    andrew_perrin@unc.edu - http://www.unc.edu/~aperrin
>> 
>> On 29 May 2001, Mark wrote:
>> 
>> > Is it possible to execute a query using a where clause that allows case
>> > insensitive comparison between a field and text.
>> > 
>> > For example:
>> > 
>> > select * from account where username = 'test'
>> > 
>> > where username could be 'Test', which would be a match.  As is, this
>> > compare is case sensitive.
>> > 
>> > grep'd the source, but stricmp is only used for keywords and not actual
>> > column data.
>> > 
>> > Any help would be greatly appreciated.
>> > 
>> > Thanks,
>> > 
>> > Mark
>> > 
>> > 
>> > ---------------------------(end of broadcast)---------------------------
>> > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
>> > 
>> 
>
>
>---------------------------(end of broadcast)---------------------------
>TIP 4: Don't 'kill -9' the postmaster
>


Re: Case Insensitive Queries

From
"tjk@tksoft.com"
Date:
Can you please explain in little more detail? I am curious.

I haven't noticed any discussion about upper() being different from
lower() when it comes to such comparisons.  As far as I know, upper()
and lower() only operate on ascii characters a-z. If you are using the
default locale, neither function should have any impact on characters
in the extended ascii range.

If upper() and lower() operate on characters in 8859-1 and other character
sets when the appropriate locale is set, then a difference in the behavior
of upper() and lower() would seem like a bug.

If you can shed some light on this, I would appreciate it.

Thanks,


Troy






> 
> Mark writes:
> > Is it possible to execute a query using a where clause that allows case
> > insensitive comparison between a field and text.
> 
> select * from account where upper(username) = upper('test')
> 
> (Upper used because, as has been remarked on this list and in other places,
> folding from richer character sets is likely to get better matches this way).
> 
> And yes, you can create an index on upper(fieldname).
> 
> Dan
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>     (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
> 



Re: Case Insensitive Queries

From
Tom Lane
Date:
"tjk@tksoft.com" <tjk@tksoft.com> writes:
> If upper() and lower() operate on characters in 8859-1 and other character
> sets when the appropriate locale is set, then a difference in the behavior
> of upper() and lower() would seem like a bug.

Au contraire ... upper() and lower() are not symmetric operations in
quite a few non-English locales.  I'll let those who regularly work with
them give specific details, but handling of accents, German esstet (sp?),
etc are the gotchas that I recall.
        regards, tom lane


Re: Case Insensitive Queries

From
Mark
Date:
It appears that the behavior of a bpchar compare with a string literal
is not implicitly trimming the bpchar before the compare, which IMHO is
incorrect behavior.  Is my opinion valid?  If so, how difficult of a fix
would this be in terms of time and effort?  Should I submit a bug report
to another list, or is a developer receiving this?  Is this a feature?

This is an important issue for me, because I am converting a db from MS
SQL to postgresql.  The MS SQL database uses bpchar (or just char in MS
SQL terms) because performance is slightly better; the compares
automatically trim the blanks off of the char at compare time.  I have
over 150 tables to work with, and I would rather not have to change them
from bpchar to varchar, not to mention the performance decrease this
might incur.

You might be thinking, 'just use trim(username) everywhere you compare'.
Yes, that is a solution, but not a practical one in my case.  If this is
a bug, I don't want to hack around it: I'd rather wait for the fix.
Varchars would incur performance penalties I want to try to avoid if at
all possible.

Thanks,

Mark

On 29 May 2001 09:55:18 -0700, Dan Lyke wrote:
> Mark writes:
> > Is it possible to execute a query using a where clause that allows
case
> > insensitive comparison between a field and text.
> 
> select * from account where upper(username) = upper('test')
> 
> (Upper used because, as has been remarked on this list and in other
places,
> folding from richer character sets is likely to get better matches
this way).
> 
> And yes, you can create an index on upper(fieldname).
> 
> Dan
> 
> ---------------------------(end of
broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>     (send "unregister YourEmailAddressHere" to
majordomo@postgresql.org)
> 




Re: Case Insensitive Queries

From
Mark
Date:
It appears that the behavior of a bpchar compare with a string literal
is not implicitly trimming the bpchar before the compare, which IMHO is
incorrect behavior.  Is my opinion valid?  If so, how difficult of a fix
would this be in terms of time and effort?  Should I submit a bug report
to another list, or is a developer receiving this?  Is this a feature?

This is an important issue for me, because I am converting a db from MS
SQL to postgresql.  The MS SQL database uses bpchar (or just char in MS
SQL terms) because performance is slightly better; the compares
automatically trim the blanks off of the char at compare time.  I have
over 150 tables to work with, and I would rather not have to change them
from bpchar to varchar, not to mention the performance decrease this
might incur.

You might be thinking, 'just use trim(username) everywhere you compare'.
Yes, that is a solution, but not a practical one in my case.  If this is
a bug, I don't want to hack around it: I'd rather wait for the fix.
Varchars would incur performance penalties I want to try to avoid if at
all possible.

Thanks,

Mark

On 29 May 2001 09:55:18 -0700, Dan Lyke wrote:
> Mark writes:
> > Is it possible to execute a query using a where clause that allows
case
> > insensitive comparison between a field and text.
> 
> select * from account where upper(username) = upper('test')
> 
> (Upper used because, as has been remarked on this list and in other
places,
> folding from richer character sets is likely to get better matches
this way).
> 
> And yes, you can create an index on upper(fieldname).
> 
> Dan
> 
> ---------------------------(end of
broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>     (send "unregister YourEmailAddressHere" to
majordomo@postgresql.org)
> 




Re: Case Insensitive Queries

From
"tjk@tksoft.com"
Date:
I use a few of them, and in my opinion there is a distinct group of
characters at last in the 8859-1 character set which have a lower and
upper case instance. The ranges are 0xC0 to 0xDD for upper case and 0xE0
to 0xFD for upper (with the exception of 0xD0, 0xD7, 0xF0, and 0xF7).
I haven't examined all the relevant docs, so I might be wrong.  The lists
are based on my own observations of the characters in question.

There is probably no harm in sending a few extra bytes, so I am appending
a related function below. If someone finds a flaw with the function,
please tell me; that would be greatly appreciated.

I am also including a list of related characters.  This email is going
out with 8859-1 as the charset, so I hope you are able to view them.




UPPER CASE:
192: À (0xc0)  193: Á (0xc1)  194: Â (0xc2)  195: Ã (0xc3)
196: Ä (0xc4)  197: Å (0xc5)  198: Æ (0xc6)  199: Ç (0xc7)
200: È (0xc8)  201: É (0xc9)  202: Ê (0xca)  203: Ë (0xcb)
204: Ì (0xcc)  205: Í (0xcd)  206: Î (0xce)  207: Ï (0xcf)              209: Ñ (0xd1)  210: Ò (0xd2)  211: Ó (0xd3)
212: Ô (0xd4)  213: Õ (0xd5)  214: Ö (0xd6)
216: Ø (0xd8)  217: Ù (0xd9)  218: Ú (0xda)  219: Û (0xdb)
220: Ü (0xdc)  221: Ý (0xdd)  

LOWER CASE:
224: à (0xe0)  225: á (0xe1)  226: â (0xe2)  227: ã (0xe3)
228: ä (0xe4)  229: å (0xe5)  230: æ (0xe6)  231: ç (0xe7)
232: è (0xe8)  233: é (0xe9)  234: ê (0xea)  235: ë (0xeb)
236: ì (0xec)  237: í (0xed)  238: î (0xee)  239: ï (0xef)              241: ñ (0xf1)  242: ò (0xf2)  243: ó (0xf3)
244: ô (0xf4)  245: õ (0xf5)  246: ö (0xf6)
248: ø (0xf8)  249: ù (0xf9)  250: ú (0xfa)  251: û (0xfb)
252: ü (0xfc)  253: ý (0xfd)

SKIPPED
208: Ð (0xd0) 
215: × (0xd7)
222: Þ (0xde)
240: ð (0xf0)  
247: ÷ (0xf7)
254: þ (0xfe)



CREATE FUNCTION lower8859_1 (text) RETURNS text  AS '/usr/include/pgsql/lib/str8859_1.so'  LANGUAGE 'C';




/* No warranty of any kind, use at your own risk. Use freely. */

text * lower8859_1 (text * str1) {  text * result;  int32 len1  = 0, i;  unsigned char * p, * p2, c;  unsigned char
upper_min= 0xC0;  unsigned char upper_max = 0xDD;
 
  len1 = VARSIZE(str1) - VARHDRSZ;
  if (len1 <= 0)     return str1;
  result = (text *) palloc (len1 + 2 + VARHDRSZ);  if (! result)     return str1;
  memset (result, 0, len1 + 2 + VARHDRSZ);
  p = VARDATA(result);  p2 = VARDATA(str1);
  for (i=0; i < len1; i++) {     c = p2[i];     if (isupper(c) || (c >= upper_min && c <= upper_max && c != 0xD0 && c
!=0xD7))        p[i] = c + 0x20;     else        p[i] = c;  }
 
  VARSIZE(result) = len1 + VARHDRSZ;
  return result;
}




Troy








> "tjk@tksoft.com" <tjk@tksoft.com> writes:
> > If upper() and lower() operate on characters in 8859-1 and other character
> > sets when the appropriate locale is set, then a difference in the behavior
> > of upper() and lower() would seem like a bug.
> 
> Au contraire ... upper() and lower() are not symmetric operations in
> quite a few non-English locales.  I'll let those who regularly work with
> them give specific details, but handling of accents, German esstet (sp?),
> etc are the gotchas that I recall.
> 
>             regards, tom lane
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
> 
> http://www.postgresql.org/users-lounge/docs/faq.html
> 



Re: Case Insensitive Queries

From
Tom Lane
Date:
Mark <mark@zserve.com> writes:
> It appears that the behavior of a bpchar compare with a string literal
> is not implicitly trimming the bpchar before the compare, which IMHO is
> incorrect behavior.  Is my opinion valid?

regression=# create table foo (f1 char(20));
CREATE
regression=# insert into foo values ('zz');
INSERT 800569 1
regression=# select * from foo;         f1
----------------------zz
(1 row)

regression=# select * from foo where f1 = 'zz';         f1
----------------------zz
(1 row)

regression=#

You'll need to be more specific about what you're unhappy about.


> Varchars would incur performance penalties I want to try to avoid if at
> all possible.

You are operating under misinformation about what's efficient or not.
There are no performance penalties that I know of for varchar ... if
anything, bpchar is the less efficient choice, at least in Postgres.
The extra I/O costs for those padding blanks add up, and there's no
compensatory savings anywhere.

In any case, if your data is really variable-length strings, forcing
it into a datatype that doesn't match its semantics because of dubious
micro-efficiency considerations is just plain bad database design.
Rather than having blanks that you want to pretend aren't there, you
should not have the blanks in the first place.  IMHO anyway.
        regards, tom lane


Re: Case Insensitive Queries

From
Stephan Szabo
Date:
On Wed, 30 May 2001, Tom Lane wrote:

> Mark <mark@zserve.com> writes:
> > It appears that the behavior of a bpchar compare with a string literal
> > is not implicitly trimming the bpchar before the compare, which IMHO is
> > incorrect behavior.  Is my opinion valid?
> 
> regression=# create table foo (f1 char(20));
> CREATE
> regression=# insert into foo values ('zz');
> INSERT 800569 1
> regression=# select * from foo;
>           f1
> ----------------------
>  zz
> (1 row)
> 
> regression=# select * from foo where f1 = 'zz';
>           f1
> ----------------------
>  zz
> (1 row)
> 
> regression=#
> 
> You'll need to be more specific about what you're unhappy about.

Given the thread, I think the problem he's having is tied up in
upper and lower implicitly converting to text.

select * from foo where upper(f1)='ZZ';
gives no rows but if you put 18 spaces after the ZZ you get the
row.




Re: Case Insensitive Queries

From
Mark
Date:
On 30 May 2001 11:16:35 -0700, Stephan Szabo wrote:
> On Wed, 30 May 2001, Tom Lane wrote:
> 
> > Mark <mark@zserve.com> writes:
> > > It appears that the behavior of a bpchar compare with a string literal
> > > is not implicitly trimming the bpchar before the compare, which IMHO is
> > > incorrect behavior.  Is my opinion valid?
> > 
> > regression=# create table foo (f1 char(20));
> > CREATE
> > regression=# insert into foo values ('zz');
> > INSERT 800569 1
> > regression=# select * from foo;
> >           f1
> > ----------------------
> >  zz
> > (1 row)
> > 
> > regression=# select * from foo where f1 = 'zz';
> >           f1
> > ----------------------
> >  zz
> > (1 row)
> > 
> > regression=#
> > 
> > You'll need to be more specific about what you're unhappy about.
> 
> Given the thread, I think the problem he's having is tied up in
> upper and lower implicitly converting to text.
> 
> select * from foo where upper(f1)='ZZ';
> gives no rows but if you put 18 spaces after the ZZ you get the
> row.
> 
> 


could I cast from text to something else?




Re: Case Insensitive Queries

From
Stephan Szabo
Date:
On 30 May 2001, Mark wrote:

> On 30 May 2001 11:16:35 -0700, Stephan Szabo wrote:
> > On Wed, 30 May 2001, Tom Lane wrote:
> > 
> > > Mark <mark@zserve.com> writes:
> > > > It appears that the behavior of a bpchar compare with a string literal
> > > > is not implicitly trimming the bpchar before the compare, which IMHO is
> > > > incorrect behavior.  Is my opinion valid?
> > > 
> > > regression=# create table foo (f1 char(20));
> > > CREATE
> > > regression=# insert into foo values ('zz');
> > > INSERT 800569 1
> > > regression=# select * from foo;
> > >           f1
> > > ----------------------
> > >  zz
> > > (1 row)
> > > 
> > > regression=# select * from foo where f1 = 'zz';
> > >           f1
> > > ----------------------
> > >  zz
> > > (1 row)
> > > 
> > > regression=#
> > > 
> > > You'll need to be more specific about what you're unhappy about.
> > 
> > Given the thread, I think the problem he's having is tied up in
> > upper and lower implicitly converting to text.
> > 
> > select * from foo where upper(f1)='ZZ';
> > gives no rows but if you put 18 spaces after the ZZ you get the
> > row.
> > 
> > 
> 
> 
> could I cast from text to something else?

You might be able to get away with something like:
create function upper(char) returns char as 'upper'language 'internal';

It seems to work for me, but I'm not 100% sure how
safe it is.

With the function above, 
select * from foo where upper(f1)='ZZ' returns me the zz
row.




Re: Case Insensitive Queries

From
Mark
Date:
On 30 May 2001 12:53:22 -0400, Tom Lane wrote:

> 
> You are operating under misinformation about what's efficient or not.
> There are no performance penalties that I know of for varchar ... if
> anything, bpchar is the less efficient choice, at least in Postgres.
> The extra I/O costs for those padding blanks add up, and there's no
> compensatory savings anywhere.

with varchars, as I understand it (and postgresql may be different),
each varchar field has a header that stores the length of the particular
entry's length.  Further, if the varchar field precedes another field,
the system loses the ability to use fixed-length addressing to access
the field after the varchar, since the system must determine on a
case-by-case basis how to access the field after the varchar.  It has to
calculate the size of the varchar, add that to the start of the varchar
(plus header length), and then it has the address of the next field.
With non-variant char it is fixed length, so selects and updates operate
much more quickly.  Even the postgresql documentation asserts something
similar to this:

'Both TEXT and VARCHAR() store only the number of characters in the
string. CHAR(length) is similar to VARCHAR(), except it always stores
exactly length characters. This type pads the value with trailing spaces
to achieve the specified length, and provides slightly faster access
than TEXT or VARCHAR().'

Perhaps I am misinformed.

> 
> In any case, if your data is really variable-length strings, forcing
> it into a datatype that doesn't match its semantics because of dubious
> micro-efficiency considerations is just plain bad database design.
> Rather than having blanks that you want to pretend aren't there, you
> should not have the blanks in the first place.  IMHO anyway.
> 

Point well taken.  If the gain from using bpchar is not much more than
using varchar and the data used is actualy variable length up to a max
length, the argument is unfounded.

So, what to make of all of this?  It depends on the performance
gain/loss of using varchar.  We originally used fixed-length chars
because of the performance gain.  We try to avoid varchars for that
reason.  Now, if postgresql is different, then we'll use varchars, as
that precisely models our data.

Thanks,
Mark



Re: Case Insensitive Queries

From
Ian Lance Taylor
Date:
Mark <mark@zserve.com> writes:

> with varchars, as I understand it (and postgresql may be different),
> each varchar field has a header that stores the length of the particular
> entry's length.  Further, if the varchar field precedes another field,
> the system loses the ability to use fixed-length addressing to access
> the field after the varchar, since the system must determine on a
> case-by-case basis how to access the field after the varchar.  It has to
> calculate the size of the varchar, add that to the start of the varchar
> (plus header length), and then it has the address of the next field.
> With non-variant char it is fixed length, so selects and updates operate
> much more quickly.  Even the postgresql documentation asserts something
> similar to this:
> 
> 'Both TEXT and VARCHAR() store only the number of characters in the
> string. CHAR(length) is similar to VARCHAR(), except it always stores
> exactly length characters. This type pads the value with trailing spaces
> to achieve the specified length, and provides slightly faster access
> than TEXT or VARCHAR().'
> 
> Perhaps I am misinformed.

You are misinformed with respect to Postgres.  The implementation of
CHAR(length) is just like the implementation of VARCHAR().  It stores
a header with the entry length.  That header is considered in all
functions in the same way that the VARCHAR() header is.

I don't know what documentation you are quoting, but it seems somewhat
misleading to me.

Here is the comment from the implementation
(src/backend/utils/adt/varchar.c):

/** CHAR() and VARCHAR() types are part of the ANSI SQL standard. CHAR()* is for blank-padded string whose length is
specifiedin CREATE TABLE.* VARCHAR is for storing string whose length is at most the length specified* at CREATE TABLE
time.**It's hard to implement these types because we cannot figure out* the length of the type from the type itself. I
change(hopefully all) the* fmgr calls that invoke input functions of a data type to supply the* length also. (eg. in
INSERTs,we have the tupleDescriptor which contains* the length of the attributes and hence the exact length of the
char()or* varchar(). We pass this to bpcharin() or varcharin().) In the case where* we cannot determine the length, we
passin -1 instead and the input string* must be null-terminated.** We actually implement this as a varlena so that we
don'thave to pass in* the length for the comparison functions. (The difference between these* types and "text" is that
wetruncate and possibly blank-pad the string* at insertion time.)**
      - ay 6/95*/
 

Ian


Re: Case Insensitive Queries

From
Tom Lane
Date:
Mark <mark@zserve.com> writes:
> Even the postgresql documentation asserts something similar to this:

You're reading obsolete documentation.  There is no such assertion
(as far as I can find, anyway) in the 7.1 documentation.  The speed
advantage of bpchar --- which was always extremely marginal anyway,
if it was real at all when you consider I/O costs --- is gone completely
now, because with TOAST in the picture the system cannot assume that
bpchar is fixed length on disk.
        regards, tom lane