Thread: String comparision in PostgreSQL

String comparision in PostgreSQL

From
Nicola Cisternino
Date:
Hi all,
I'm valutating a complex porting of our application based on Sybase SqlAnywhere on PostgreSQL (I've love it ...) and I'd like to have your opinion about searching/ordering funcionality.
The problem is about string comparision.
MS Sql server, MySql, SqlAnywhere and other DB engine allow a simple definition of case sensitive/insensitive behavior using char, varchar and text field type.
In PostgreSQL I've already tried to use "citext", lower() function (applied to indexes, too ...), ILIKE an so on ..... but nothing really work as I need (poor performances ...) !!
My questions are:
1) Why PostgreSQL don't use COLLATE to manage case sensitive / insensitive comparision (I think it's the best and ANSI standard way ....) ?
2) Can I build a custom COLLATION (for example named: "NOCASE" ....) to apply to my DB objects ? What's the way (... some example ? ) ???

Thanks.
Best Regards.
Nicola.

Re: String comparision in PostgreSQL

From
Merlin Moncure
Date:
On Tue, Aug 28, 2012 at 9:46 AM, Nicola Cisternino <ncister@tiscali.it> wrote:
> Hi all,
> I'm valutating a complex porting of our application based on Sybase
> SqlAnywhere on PostgreSQL (I've love it ...) and I'd like to have your
> opinion about searching/ordering funcionality.
> The problem is about string comparision.
> MS Sql server, MySql, SqlAnywhere and other DB engine allow a simple
> definition of case sensitive/insensitive behavior using char, varchar and
> text field type.
> In PostgreSQL I've already tried to use "citext", lower() function (applied
> to indexes, too ...), ILIKE an so on ..... but nothing really work as I need
> (poor performances ...) !!

hm, poor performance? can you elaborate?

merlin


Re: String comparision in PostgreSQL

From
Nicola Cisternino
Date:
Il 29/08/2012 17.08, Merlin Moncure ha scritto:
On Tue, Aug 28, 2012 at 9:46 AM, Nicola Cisternino <ncister@tiscali.it> wrote:
Hi all,
I'm valutating a complex porting of our application based on Sybase
SqlAnywhere on PostgreSQL (I've love it ...) and I'd like to have your
opinion about searching/ordering funcionality.
The problem is about string comparision.
MS Sql server, MySql, SqlAnywhere and other DB engine allow a simple
definition of case sensitive/insensitive behavior using char, varchar and
text field type.
In PostgreSQL I've already tried to use "citext", lower() function (applied
to indexes, too ...), ILIKE an so on ..... but nothing really work as I need
(poor performances ...) !!
hm, poor performance? can you elaborate?

merlin
The same query using " .... LIKE <value> ...." is completed in 15 ms while using " .... ILIKE <value> ...." the execution time is 453 ms ....

Re: String comparision in PostgreSQL

From
Chris Angelico
Date:
On Thu, Aug 30, 2012 at 1:56 AM, Nicola Cisternino <ncister@tiscali.it> wrote:
> The same query using " .... LIKE <value> ...." is completed in 15 ms while
> using " .... ILIKE <value> ...." the execution time is 453 ms ....

Sounds to me like (pun not intended) there's an index that's being
used in one case and not in the other.

But taking this back a step: Do you really need case-insensitive
comparisons? They become pretty much impossible once you start looking
at internationalization. Sure, it's easy in ASCII. You just mask off
one bit and off you go. But truly case insensitive matching gets
really hairy. Can you redo things with case sensitive searches,
possibly with some forcing of case in simple situations? For instance,
you accept a name prefix from the user, look at it and find that it's
all ASCII; lower-case it, then upper-case the first letter, then add a
percent sign, and use a case-sensitive LIKE. That's going to produce
correct results in most cases, and is way faster than truly case
insensitive searching.

ChrisA


Re: String comparision in PostgreSQL

From
Merlin Moncure
Date:
On Wed, Aug 29, 2012 at 10:56 AM, Nicola Cisternino <ncister@tiscali.it> wrote:
> Il 29/08/2012 17.08, Merlin Moncure ha scritto:
>
> On Tue, Aug 28, 2012 at 9:46 AM, Nicola Cisternino <ncister@tiscali.it>
> wrote:
>
> Hi all,
> I'm valutating a complex porting of our application based on Sybase
> SqlAnywhere on PostgreSQL (I've love it ...) and I'd like to have your
> opinion about searching/ordering funcionality.
> The problem is about string comparision.
> MS Sql server, MySql, SqlAnywhere and other DB engine allow a simple
> definition of case sensitive/insensitive behavior using char, varchar and
> text field type.
> In PostgreSQL I've already tried to use "citext", lower() function (applied
> to indexes, too ...), ILIKE an so on ..... but nothing really work as I need
> (poor performances ...) !!
>
> hm, poor performance? can you elaborate?
>
> merlin
>
> The same query using " .... LIKE <value> ...." is completed in 15 ms while
> using " .... ILIKE <value> ...." the execution time is 453 ms ....

citext unfortunately doesn't allow for index optimization of LIKE
queries, which IMNSHO defeats the whole purpose.  to the best way
remains to use lower()

create table foo(f text);
create index on foo(lower(f));
select * from f where lower(f) = 'abc%'

this will be index optimized and fast as long as you specified C
locale for your database.

merlin


Re: String comparision in PostgreSQL

From
Craig Ringer
Date:
On 08/28/2012 10:46 PM, Nicola Cisternino wrote:

> 1) Why PostgreSQL don't use COLLATE to manage case sensitive /
> insensitive comparision (I think it's the best and ANSI standard way ....) ?

Support for per-column collations in PG was only added relatively
recently - in 9.1, by the looks:

     http://www.postgresql.org/docs/9.1/static/collation.html

Prior to that, there was no meaningful way to use case insensitive
collations, as these would affect the whole database, including system
tables, which could break all sorts of things in new and exciting ways.

With the advent of per-column and per-operation collation control,
case-insensitive collations become very appealing.

One of the challenges with adding case insensitive collations is that,
AFAIK, collations are implemented using the operating system charset and
locale support, which may not offer case insensitive collation directly.

Another challenge is the rather ... variable ... meaning of "case
insensitive". Results are likely to vary between host platforms and
versions.

Still, now that per-col / per-op collation is supported, it'd be nice to
have. I don't know if it's just a matter of needing someone with the
desire and time (or funding) and expertise to design and build it, or if
there'd be issues with getting it accepted.

--
Craig Ringer


Re: String comparision in PostgreSQL

From
Nicola Cisternino
Date:
Il 29/08/2012 18.09, Chris Angelico ha scritto:
On Thu, Aug 30, 2012 at 1:56 AM, Nicola Cisternino <ncister@tiscali.it> wrote:
The same query using " .... LIKE <value> ...." is completed in 15 ms while
using " .... ILIKE <value> ...." the execution time is 453 ms ....
Sounds to me like (pun not intended) there's an index that's being
used in one case and not in the other.

But taking this back a step: Do you really need case-insensitive
comparisons? They become pretty much impossible once you start looking
at internationalization. Sure, it's easy in ASCII. You just mask off
one bit and off you go. But truly case insensitive matching gets
really hairy. Can you redo things with case sensitive searches,
possibly with some forcing of case in simple situations? For instance,
you accept a name prefix from the user, look at it and find that it's
all ASCII; lower-case it, then upper-case the first letter, then add a
percent sign, and use a case-sensitive LIKE. That's going to produce
correct results in most cases, and is way faster than truly case
insensitive searching.

ChrisA


Yes I need case-insensitive comparision and the best way to (optionally) obtain it is without alter all application queries that already works with other DB engine (!!)
My simply questions are:
1) Can be a custom collation a solution for my needs ?
2) How can create a custom collation (... what steps ...) ?
Thanks.

Re: String comparision in PostgreSQL

From
Nicola Cisternino
Date:
Il 30/08/2012 4.01, Craig Ringer ha scritto:
On 08/28/2012 10:46 PM, Nicola Cisternino wrote:

1) Why PostgreSQL don't use COLLATE to manage case sensitive /
insensitive comparision (I think it's the best and ANSI standard way ....) ?

Support for per-column collations in PG was only added relatively recently - in 9.1, by the looks:

    http://www.postgresql.org/docs/9.1/static/collation.html

Prior to that, there was no meaningful way to use case insensitive collations, as these would affect the whole database, including system tables, which could break all sorts of things in new and exciting ways.

With the advent of per-column and per-operation collation control, case-insensitive collations become very appealing.

One of the challenges with adding case insensitive collations is that, AFAIK, collations are implemented using the operating system charset and locale support, which may not offer case insensitive collation directly.

Another challenge is the rather ... variable ... meaning of "case insensitive". Results are likely to vary between host platforms and versions.

Still, now that per-col / per-op collation is supported, it'd be nice to have. I don't know if it's just a matter of needing someone with the desire and time (or funding) and expertise to design and build it, or if there'd be issues with getting it accepted.

--
Craig Ringer

Thus the problem is that " .... collations are implemented using the operating system charset and locale support ... " while, other engines, implements collations internally ..... is it right ?
Thanks.

Re: String comparision in PostgreSQL

From
Craig Ringer
Date:
On 08/30/2012 05:16 PM, Nicola Cisternino wrote:

> Thus the problem is that " .... collations are implemented using the
> operating system charset and locale support ... " while, other engines,
> implements collations internally ..... is it right ?

That's my understanding, but I don't know which other database systems
you're talking about because you've never specifically named any.

It's entirely possible that some other DBMSs use the system locale and
collation support with internal downcasing, for example.

All I know, I've already said, but I'm not an expert on Pg's innards.

--
Craig Ringer


Re: String comparision in PostgreSQL

From
Nicola Cisternino
Date:
Il 30/08/2012 12.45, Craig Ringer ha scritto:
On 08/30/2012 05:16 PM, Nicola Cisternino wrote:

Thus the problem is that " .... collations are implemented using the
operating system charset and locale support ... " while, other engines,
implements collations internally ..... is it right ?

That's my understanding, but I don't know which other database systems you're talking about because you've never specifically named any.

It's entirely possible that some other DBMSs use the system locale and collation support with internal downcasing, for example.

All I know, I've already said, but I'm not an expert on Pg's innards.

--
Craig Ringer
Tahnk you for replay.
At this point, the solution could be a new, custom, operating system collation .... (something like: en_CI_US.UTF-8) ....

Re: String comparision in PostgreSQL

From
Craig Ringer
Date:
On 08/30/2012 06:54 PM, Nicola Cisternino wrote:

At this point, the solution could be a new, custom, operating system collation .... (something like: en_CI_US.UTF-8) ....

As far as I know - and as I said, I'm hardly an expert in Pg's guts - there's no way to create a case insensitive collation as things stand.

Creating case insensitive collations in the C library is probably more work than you could possibly expect, if it's even possible at all on some platforms.

You'd have a better chance investigating whether it could be possible to "wrap" an operating system collation with lower-casing. This is likely to be something you will want to work with some experienced Pg developers with, most likely on a funded work basis because you're unlikely to find anyone who wants to implement case insensitive collations for fun in their spare time.

If you're seriously interested in enhancing PostgreSQL to support case insensitive collation, and you have the funds to sponsor the significant amount of work required, you could ask on pgsql-hackers and see if anyone's interested, or contact http://www.postgresql.org/support/professional_support/ . This is not a trivial job.

It will help if you are able to provide *specific* *test* *cases* showing exactly how you think it should work, with samples showing how it works on other specifically named products. No more hand-waving.

--
Craig Ringer

Re: String comparision in PostgreSQL

From
"F. BROUARD / SQLpro"
Date:
Le 30/08/2012 12:45, Craig Ringer a écrit :

> That's my understanding, but I don't know which other database systems
> you're talking about because you've never specifically named any.
>
In his primary post he talk about SQL Server, Sybase and MySQL wich does
good jobs with collation....

Almost a majority of RDBMS have collation support wich is very important
for non english languages, that represents about 90% of the planet
languages !

This feature has always been a "black spot" in PG, and the most
important topic to discourage to use it in professionnal applications.

A +

--
Frédéric BROUARD - expert SGBDR et SQL - MVP SQL Server - 06 11 86 40 66
Le site sur le langage SQL et les SGBDR  :  http://sqlpro.developpez.com
Enseignant Arts & Métiers PACA, ISEN Toulon et CESI/EXIA Aix en Provence
Audit, conseil, expertise, formation, modélisation, tuning, optimisation
*********************** http://www.sqlspot.com *************************



Re: String comparision in PostgreSQL

From
Scott Marlowe
Date:
On Thu, Aug 30, 2012 at 6:09 AM, F. BROUARD / SQLpro
<sqlpro@club-internet.fr> wrote:
> Le 30/08/2012 12:45, Craig Ringer a écrit :
>
>
>> That's my understanding, but I don't know which other database systems
>> you're talking about because you've never specifically named any.
>>
> In his primary post he talk about SQL Server, Sybase and MySQL wich does
> good jobs with collation....
>
> Almost a majority of RDBMS have collation support wich is very important for
> non english languages, that represents about 90% of the planet languages !
>
> This feature has always been a "black spot" in PG, and the most important
> topic to discourage to use it in professionnal applications.

Citations please.

PostgreSQL has excellent collation support.
http://www.postgresql.org/docs/9.1/static/collation.html

Sybase performance on upper() case insensitive searchs:
http://stackoverflow.com/questions/81268/case-insensitive-search-on-sybase

MySQL case insensitive searchs rely on indexing upper or lower
functions just like PostgreSQL and use seq scans for collation induced
case insensitive searchs:
http://use-the-index-luke.com/sql/where-clause/functions/case-insensitive-search

Further PostgreSQL has the citext type:
http://www.postgresql.org/docs/9.1/static/citext.html
Which can be handy for case insensitive searches but can ONLY do case
insensitive stuff.


Re: String comparision in PostgreSQL

From
Merlin Moncure
Date:
On Thu, Aug 30, 2012 at 9:34 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
> On Thu, Aug 30, 2012 at 6:09 AM, F. BROUARD / SQLpro
> <sqlpro@club-internet.fr> wrote:
>> Le 30/08/2012 12:45, Craig Ringer a écrit :
>>
>>
>>> That's my understanding, but I don't know which other database systems
>>> you're talking about because you've never specifically named any.
>>>
>> In his primary post he talk about SQL Server, Sybase and MySQL wich does
>> good jobs with collation....
>>
>> Almost a majority of RDBMS have collation support wich is very important for
>> non english languages, that represents about 90% of the planet languages !
>>
>> This feature has always been a "black spot" in PG, and the most important
>> topic to discourage to use it in professionnal applications.
>
> Citations please.
>
> PostgreSQL has excellent collation support.
> http://www.postgresql.org/docs/9.1/static/collation.html
>
> Sybase performance on upper() case insensitive searchs:
> http://stackoverflow.com/questions/81268/case-insensitive-search-on-sybase
>
> MySQL case insensitive searchs rely on indexing upper or lower
> functions just like PostgreSQL and use seq scans for collation induced
> case insensitive searchs:
> http://use-the-index-luke.com/sql/where-clause/functions/case-insensitive-search
>
> Further PostgreSQL has the citext type:
> http://www.postgresql.org/docs/9.1/static/citext.html
> Which can be handy for case insensitive searches but can ONLY do case
> insensitive stuff.

Yeah.  In particular, lower() approaches for case insensitive
searching have always worked and IMSNHO remain the best way.
Expression based searching and indexing is a real strong point for
postgres and is the 'go to' method for solving a broad array of
problems.  The fairest complaint you can make is that historically
you've had to ditch performance to get good collation features -- and
this is mostly solved.

I guess the biggest problem that remains is the inability to use LIKE
searches for index through utf8 ordered indexes.

merlin