Thread: String comparision in PostgreSQL
Hi all,
I'm valutating a complex porting of our application based on Sybase SqlAnywhere on PostgreSQL (I've love it ...) and I'd like to have your opinion about searching/ordering funcionality.
The problem is about string comparision.
MS Sql server, MySql, SqlAnywhere and other DB engine allow a simple definition of case sensitive/insensitive behavior using char, varchar and text field type.
In PostgreSQL I've already tried to use "citext", lower() function (applied to indexes, too ...), ILIKE an so on ..... but nothing really work as I need (poor performances ...) !!
My questions are:
1) Why PostgreSQL don't use COLLATE to manage case sensitive / insensitive comparision (I think it's the best and ANSI standard way ....) ?
2) Can I build a custom COLLATION (for example named: "NOCASE" ....) to apply to my DB objects ? What's the way (... some example ? ) ???
Thanks.
Best Regards.
Nicola.
I'm valutating a complex porting of our application based on Sybase SqlAnywhere on PostgreSQL (I've love it ...) and I'd like to have your opinion about searching/ordering funcionality.
The problem is about string comparision.
MS Sql server, MySql, SqlAnywhere and other DB engine allow a simple definition of case sensitive/insensitive behavior using char, varchar and text field type.
In PostgreSQL I've already tried to use "citext", lower() function (applied to indexes, too ...), ILIKE an so on ..... but nothing really work as I need (poor performances ...) !!
My questions are:
1) Why PostgreSQL don't use COLLATE to manage case sensitive / insensitive comparision (I think it's the best and ANSI standard way ....) ?
2) Can I build a custom COLLATION (for example named: "NOCASE" ....) to apply to my DB objects ? What's the way (... some example ? ) ???
Thanks.
Best Regards.
Nicola.
On Tue, Aug 28, 2012 at 9:46 AM, Nicola Cisternino <ncister@tiscali.it> wrote: > Hi all, > I'm valutating a complex porting of our application based on Sybase > SqlAnywhere on PostgreSQL (I've love it ...) and I'd like to have your > opinion about searching/ordering funcionality. > The problem is about string comparision. > MS Sql server, MySql, SqlAnywhere and other DB engine allow a simple > definition of case sensitive/insensitive behavior using char, varchar and > text field type. > In PostgreSQL I've already tried to use "citext", lower() function (applied > to indexes, too ...), ILIKE an so on ..... but nothing really work as I need > (poor performances ...) !! hm, poor performance? can you elaborate? merlin
Il 29/08/2012 17.08, Merlin Moncure ha scritto:
The same query using " .... LIKE <value> ...." is completed in 15 ms while using " .... ILIKE <value> ...." the execution time is 453 ms ....On Tue, Aug 28, 2012 at 9:46 AM, Nicola Cisternino <ncister@tiscali.it> wrote:Hi all, I'm valutating a complex porting of our application based on Sybase SqlAnywhere on PostgreSQL (I've love it ...) and I'd like to have your opinion about searching/ordering funcionality. The problem is about string comparision. MS Sql server, MySql, SqlAnywhere and other DB engine allow a simple definition of case sensitive/insensitive behavior using char, varchar and text field type. In PostgreSQL I've already tried to use "citext", lower() function (applied to indexes, too ...), ILIKE an so on ..... but nothing really work as I need (poor performances ...) !!hm, poor performance? can you elaborate? merlin
On Thu, Aug 30, 2012 at 1:56 AM, Nicola Cisternino <ncister@tiscali.it> wrote: > The same query using " .... LIKE <value> ...." is completed in 15 ms while > using " .... ILIKE <value> ...." the execution time is 453 ms .... Sounds to me like (pun not intended) there's an index that's being used in one case and not in the other. But taking this back a step: Do you really need case-insensitive comparisons? They become pretty much impossible once you start looking at internationalization. Sure, it's easy in ASCII. You just mask off one bit and off you go. But truly case insensitive matching gets really hairy. Can you redo things with case sensitive searches, possibly with some forcing of case in simple situations? For instance, you accept a name prefix from the user, look at it and find that it's all ASCII; lower-case it, then upper-case the first letter, then add a percent sign, and use a case-sensitive LIKE. That's going to produce correct results in most cases, and is way faster than truly case insensitive searching. ChrisA
On Wed, Aug 29, 2012 at 10:56 AM, Nicola Cisternino <ncister@tiscali.it> wrote: > Il 29/08/2012 17.08, Merlin Moncure ha scritto: > > On Tue, Aug 28, 2012 at 9:46 AM, Nicola Cisternino <ncister@tiscali.it> > wrote: > > Hi all, > I'm valutating a complex porting of our application based on Sybase > SqlAnywhere on PostgreSQL (I've love it ...) and I'd like to have your > opinion about searching/ordering funcionality. > The problem is about string comparision. > MS Sql server, MySql, SqlAnywhere and other DB engine allow a simple > definition of case sensitive/insensitive behavior using char, varchar and > text field type. > In PostgreSQL I've already tried to use "citext", lower() function (applied > to indexes, too ...), ILIKE an so on ..... but nothing really work as I need > (poor performances ...) !! > > hm, poor performance? can you elaborate? > > merlin > > The same query using " .... LIKE <value> ...." is completed in 15 ms while > using " .... ILIKE <value> ...." the execution time is 453 ms .... citext unfortunately doesn't allow for index optimization of LIKE queries, which IMNSHO defeats the whole purpose. to the best way remains to use lower() create table foo(f text); create index on foo(lower(f)); select * from f where lower(f) = 'abc%' this will be index optimized and fast as long as you specified C locale for your database. merlin
On 08/28/2012 10:46 PM, Nicola Cisternino wrote: > 1) Why PostgreSQL don't use COLLATE to manage case sensitive / > insensitive comparision (I think it's the best and ANSI standard way ....) ? Support for per-column collations in PG was only added relatively recently - in 9.1, by the looks: http://www.postgresql.org/docs/9.1/static/collation.html Prior to that, there was no meaningful way to use case insensitive collations, as these would affect the whole database, including system tables, which could break all sorts of things in new and exciting ways. With the advent of per-column and per-operation collation control, case-insensitive collations become very appealing. One of the challenges with adding case insensitive collations is that, AFAIK, collations are implemented using the operating system charset and locale support, which may not offer case insensitive collation directly. Another challenge is the rather ... variable ... meaning of "case insensitive". Results are likely to vary between host platforms and versions. Still, now that per-col / per-op collation is supported, it'd be nice to have. I don't know if it's just a matter of needing someone with the desire and time (or funding) and expertise to design and build it, or if there'd be issues with getting it accepted. -- Craig Ringer
Il 29/08/2012 18.09, Chris Angelico ha scritto:
My simply questions are:
1) Can be a custom collation a solution for my needs ?
2) How can create a custom collation (... what steps ...) ?
Thanks.
Yes I need case-insensitive comparision and the best way to (optionally) obtain it is without alter all application queries that already works with other DB engine (!!)On Thu, Aug 30, 2012 at 1:56 AM, Nicola Cisternino <ncister@tiscali.it> wrote:The same query using " .... LIKE <value> ...." is completed in 15 ms while using " .... ILIKE <value> ...." the execution time is 453 ms ....Sounds to me like (pun not intended) there's an index that's being used in one case and not in the other. But taking this back a step: Do you really need case-insensitive comparisons? They become pretty much impossible once you start looking at internationalization. Sure, it's easy in ASCII. You just mask off one bit and off you go. But truly case insensitive matching gets really hairy. Can you redo things with case sensitive searches, possibly with some forcing of case in simple situations? For instance, you accept a name prefix from the user, look at it and find that it's all ASCII; lower-case it, then upper-case the first letter, then add a percent sign, and use a case-sensitive LIKE. That's going to produce correct results in most cases, and is way faster than truly case insensitive searching. ChrisA
My simply questions are:
1) Can be a custom collation a solution for my needs ?
2) How can create a custom collation (... what steps ...) ?
Thanks.
Il 30/08/2012 4.01, Craig Ringer ha scritto:
Thanks.
On 08/28/2012 10:46 PM, Nicola Cisternino wrote:Thus the problem is that " .... collations are implemented using the operating system charset and locale support ... " while, other engines, implements collations internally ..... is it right ?1) Why PostgreSQL don't use COLLATE to manage case sensitive /
insensitive comparision (I think it's the best and ANSI standard way ....) ?
Support for per-column collations in PG was only added relatively recently - in 9.1, by the looks:
http://www.postgresql.org/docs/9.1/static/collation.html
Prior to that, there was no meaningful way to use case insensitive collations, as these would affect the whole database, including system tables, which could break all sorts of things in new and exciting ways.
With the advent of per-column and per-operation collation control, case-insensitive collations become very appealing.
One of the challenges with adding case insensitive collations is that, AFAIK, collations are implemented using the operating system charset and locale support, which may not offer case insensitive collation directly.
Another challenge is the rather ... variable ... meaning of "case insensitive". Results are likely to vary between host platforms and versions.
Still, now that per-col / per-op collation is supported, it'd be nice to have. I don't know if it's just a matter of needing someone with the desire and time (or funding) and expertise to design and build it, or if there'd be issues with getting it accepted.
--
Craig Ringer
Thanks.
On 08/30/2012 05:16 PM, Nicola Cisternino wrote: > Thus the problem is that " .... collations are implemented using the > operating system charset and locale support ... " while, other engines, > implements collations internally ..... is it right ? That's my understanding, but I don't know which other database systems you're talking about because you've never specifically named any. It's entirely possible that some other DBMSs use the system locale and collation support with internal downcasing, for example. All I know, I've already said, but I'm not an expert on Pg's innards. -- Craig Ringer
Il 30/08/2012 12.45, Craig Ringer ha scritto:
At this point, the solution could be a new, custom, operating system collation .... (something like: en_CI_US.UTF-8) ....
On 08/30/2012 05:16 PM, Nicola Cisternino wrote:Tahnk you for replay.Thus the problem is that " .... collations are implemented using the
operating system charset and locale support ... " while, other engines,
implements collations internally ..... is it right ?
That's my understanding, but I don't know which other database systems you're talking about because you've never specifically named any.
It's entirely possible that some other DBMSs use the system locale and collation support with internal downcasing, for example.
All I know, I've already said, but I'm not an expert on Pg's innards.
--
Craig Ringer
At this point, the solution could be a new, custom, operating system collation .... (something like: en_CI_US.UTF-8) ....
On 08/30/2012 06:54 PM, Nicola Cisternino wrote:
At this point, the solution could be a new, custom, operating system collation .... (something like: en_CI_US.UTF-8) ....
As far as I know - and as I said, I'm hardly an expert in Pg's guts - there's no way to create a case insensitive collation as things stand.
Creating case insensitive collations in the C library is probably more work than you could possibly expect, if it's even possible at all on some platforms.
You'd have a better chance investigating whether it could be possible to "wrap" an operating system collation with lower-casing. This is likely to be something you will want to work with some experienced Pg developers with, most likely on a funded work basis because you're unlikely to find anyone who wants to implement case insensitive collations for fun in their spare time.
If you're seriously interested in enhancing PostgreSQL to support case insensitive collation, and you have the funds to sponsor the significant amount of work required, you could ask on pgsql-hackers and see if anyone's interested, or contact http://www.postgresql.org/support/professional_support/ . This is not a trivial job.
It will help if you are able to provide *specific* *test* *cases* showing exactly how you think it should work, with samples showing how it works on other specifically named products. No more hand-waving.
--
Craig Ringer
Le 30/08/2012 12:45, Craig Ringer a écrit : > That's my understanding, but I don't know which other database systems > you're talking about because you've never specifically named any. > In his primary post he talk about SQL Server, Sybase and MySQL wich does good jobs with collation.... Almost a majority of RDBMS have collation support wich is very important for non english languages, that represents about 90% of the planet languages ! This feature has always been a "black spot" in PG, and the most important topic to discourage to use it in professionnal applications. A + -- Frédéric BROUARD - expert SGBDR et SQL - MVP SQL Server - 06 11 86 40 66 Le site sur le langage SQL et les SGBDR : http://sqlpro.developpez.com Enseignant Arts & Métiers PACA, ISEN Toulon et CESI/EXIA Aix en Provence Audit, conseil, expertise, formation, modélisation, tuning, optimisation *********************** http://www.sqlspot.com *************************
On Thu, Aug 30, 2012 at 6:09 AM, F. BROUARD / SQLpro <sqlpro@club-internet.fr> wrote: > Le 30/08/2012 12:45, Craig Ringer a écrit : > > >> That's my understanding, but I don't know which other database systems >> you're talking about because you've never specifically named any. >> > In his primary post he talk about SQL Server, Sybase and MySQL wich does > good jobs with collation.... > > Almost a majority of RDBMS have collation support wich is very important for > non english languages, that represents about 90% of the planet languages ! > > This feature has always been a "black spot" in PG, and the most important > topic to discourage to use it in professionnal applications. Citations please. PostgreSQL has excellent collation support. http://www.postgresql.org/docs/9.1/static/collation.html Sybase performance on upper() case insensitive searchs: http://stackoverflow.com/questions/81268/case-insensitive-search-on-sybase MySQL case insensitive searchs rely on indexing upper or lower functions just like PostgreSQL and use seq scans for collation induced case insensitive searchs: http://use-the-index-luke.com/sql/where-clause/functions/case-insensitive-search Further PostgreSQL has the citext type: http://www.postgresql.org/docs/9.1/static/citext.html Which can be handy for case insensitive searches but can ONLY do case insensitive stuff.
On Thu, Aug 30, 2012 at 9:34 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote: > On Thu, Aug 30, 2012 at 6:09 AM, F. BROUARD / SQLpro > <sqlpro@club-internet.fr> wrote: >> Le 30/08/2012 12:45, Craig Ringer a écrit : >> >> >>> That's my understanding, but I don't know which other database systems >>> you're talking about because you've never specifically named any. >>> >> In his primary post he talk about SQL Server, Sybase and MySQL wich does >> good jobs with collation.... >> >> Almost a majority of RDBMS have collation support wich is very important for >> non english languages, that represents about 90% of the planet languages ! >> >> This feature has always been a "black spot" in PG, and the most important >> topic to discourage to use it in professionnal applications. > > Citations please. > > PostgreSQL has excellent collation support. > http://www.postgresql.org/docs/9.1/static/collation.html > > Sybase performance on upper() case insensitive searchs: > http://stackoverflow.com/questions/81268/case-insensitive-search-on-sybase > > MySQL case insensitive searchs rely on indexing upper or lower > functions just like PostgreSQL and use seq scans for collation induced > case insensitive searchs: > http://use-the-index-luke.com/sql/where-clause/functions/case-insensitive-search > > Further PostgreSQL has the citext type: > http://www.postgresql.org/docs/9.1/static/citext.html > Which can be handy for case insensitive searches but can ONLY do case > insensitive stuff. Yeah. In particular, lower() approaches for case insensitive searching have always worked and IMSNHO remain the best way. Expression based searching and indexing is a real strong point for postgres and is the 'go to' method for solving a broad array of problems. The fairest complaint you can make is that historically you've had to ditch performance to get good collation features -- and this is mostly solved. I guess the biggest problem that remains is the inability to use LIKE searches for index through utf8 ordered indexes. merlin