Thread: Collate order on Mac OS X, text with diacritics in UTF-8
Hi,
We are a software publisher searching for a new DBMS for our software. We have more than one hundred installed servers, running Mac OS and a Primebase database.
We are a software publisher searching for a new DBMS for our software. We have more than one hundred installed servers, running Mac OS and a Primebase database.
We have spend some time evaluating PostgreSQL and we can't get correct outputs with the ORDER BY command.
LC_COLLATE and LC_CTYPE are set to fr_FR.UTF-8.
It seems there is a known problem with the collating order of text including diacritics with the UTF8 encodings on BSD systems.
Does anyone know a workaround ?
Thanks for your answers,
Martin
Am 12.01.2010 um 12:36 schrieb Martin Flahault: > We have spend some time evaluating PostgreSQL and we can't get correct outputs with the ORDER BY command. > LC_COLLATE and LC_CTYPE are set to fr_FR.UTF-8. > > It seems there is a known problem with the collating order of text including diacritics with the UTF8 encodings on BSDsystems. > > Does anyone know a workaround ? The best i've seen so far is: CREATE OR REPLACE FUNCTION f_getorderbyfriendlyversion(texttoconvert text) RETURNS text AS $BODY$ select translate(upper($1),'ÄÖÜ','AOU')--add french diacritical characters here $BODY$ LANGUAGE 'sql' IMMUTABLE STRICT COST 100; ALTER FUNCTION f_getorderbyfriendlyversion(text) OWNER TO postgres; CREATE INDEX idx_mytable_myfield_orderbyfriendly ON mytable USING btree (f_getorderbyfriendlyversion(myfield::text)); Select * from mytable order by f_getorderbyfriendlyversion(myfield); Not an ideal solution, but it seems to perform quite well. Best wishes from Berlin, Maximilian Tyrtania
On 12/01/2010 7:36 PM, Martin Flahault wrote: > Hi, > > We are a software publisher searching for a new DBMS for our software. > We have more than one hundred installed servers, running Mac OS and a > Primebase database. > > We have spend some time evaluating PostgreSQL and we can't get correct > outputs with the ORDER BY command. Can you provide a sample? Include a table of sample values, an example query, its output, and what you'd expect to get instead? And why? -- Craig Ringer
Here is an exemple :
postgres=# create database newbase;
CREATE DATABASE
postgres=# \c newbase;
psql (8.4.2)
You are now connected to database "newbase".
newbase=# create table t1 (contenu text);
CREATE TABLE
newbase=# insert into t1 values ('a'), ('e'), ('à'), ('é'), ('A'), ('E');
INSERT 0 6
newbase=# select * from t1 order by contenu;
contenu
---------
A
E
a
e
à
é
(6 rows)
newbase=# select * from t1 order by upper(contenu);
contenu
---------
a
A
e
E
à
é
(6 rows)
Here is the encoding informations :
newbase=# \encoding
UTF8
newbase=# show lc_collate;
lc_collate
------------
fr_FR
(1 row)
newbase=# show lc_ctype;
lc_ctype
----------
fr_FR
(1 row)
As with others DBMS (MySQL for example), diacritics should be ignored when determining the sort order. Here is the expected output:
a
à
A
e
é
E
It seems there is a problem with the collating order on BSD systems with diacritics using UTF8.
If you put this text :
a
A
à
é
e
E
in a UTF8 text file and use the "sort" command on it, you will have the same wrong output as with PostgreSQL :
A
E
a
e
à
é
Hope this will help,
Martin
On Wed, Jan 13, 2010 at 04:15:06PM +0100, Martin Flahault wrote: [postgres] > newbase=# select * from t1 order by contenu; > contenu > --------- > A > E > a > e Postgresql outputs whatever the C library does on the underlying system. The quality of this varies wildly. > à > As with others DBMS (MySQL for example), diacritics should be ignored when determining the sort order. Here is the expectedoutput: MySQL implements the unicode collation algorithm, which means it essentially does what you want. > > It seems there is a problem with the collating order on BSD systems with diacritics using UTF8. Last I checked, BSD did not support useful sorting on UTF-8 at all, so it's not surprised it doesn't work. > in a UTF8 text file and use the "sort" command on it, you will have the same wrong output as with PostgreSQL : Yes, that's the basic idea. Mac OS X apparently provides ICU underneath for programs that would like true unicode collation, but there is little chance that postgresql will ever use this. Hope this helps, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Please line up in a tree and maintain the heap invariant while > boarding. Thank you for flying nlogn airlines.
Attachment
On 13/01/2010 11:15 PM, Martin Flahault wrote: > It seems there is a problem with the collating order on BSD systems with > diacritics using UTF8. > If you put this text : > a > A > à > é > e > E > > in a UTF8 text file and use the "sort" command on it, you will have the > same wrong output as with PostgreSQL : > A > E > a > e > à > é First: PostgreSQL expects the OS to behave correctly and sort according to the locale. It relies on the C library for this. If the C library doesn't do it right, PostgreSQL won't do it right either. So you need to get Mac OS X to do the right thing. Your results match what I get on a Linux system without a properly generated fr_FR.UTF-8 locale. Libc falls back on the "C" locale, which sorts that way. If I generate the fr_FR.UTF-8 locale and run the sort (on the file "x"), I get the desired result: LANG=fr_FR.UTF-8 LC_ALL=fr_FR.UTF-8 sort x a A à e E é I don't know Mac OS X well, but this is making me wonder if maybe you're just missing the required information for the locale, so libc is falling back on the "C" locale. (Of course, being Mac OS X there are probably at least three out of date or simply false "man" pages describing the behaviour, none of which reflect the reality of a magic config key buried somewhere in NetInfo, for which the documentation is also completely out of date. Bitter? Me? Yeah, I admin a bunch of OS X machines on a business network.) Hmm... a quick test suggests that Mac OS X (testing on 10.4) at least *thinks* it supports the fr_FR.UTF-8 locale: osx104$ LANG=xxx LC_ALL=xxx locale LANG="xxx" LC_COLLATE="C" LC_CTYPE="C" LC_MESSAGES="C" LC_MONETARY="C" LC_NUMERIC="C" LC_TIME="C" LC_ALL="C" osx104$ LANG=fr_FR.UTF-8 LC_ALL=fr_FR.UTF-8 locale LANG="fr_FR.UTF-8" LC_COLLATE="fr_FR.UTF-8" LC_CTYPE="fr_FR.UTF-8" LC_MESSAGES="fr_FR.UTF-8" LC_MONETARY="fr_FR.UTF-8" LC_NUMERIC="fr_FR.UTF-8" LC_TIME="fr_FR.UTF-8" LC_ALL="fr_FR.UTF-8" osx104$ locale -a | grep fr_FR fr_FR fr_FR.ISO8859-1 fr_FR.ISO8859-15 fr_FR.UTF-8 ... yet it clearly doesn't: osx104$ LANG=C LC_ALL=C sort x A E a e à é osx104$ LANG=fr_FR.UTF-8 LC_ALL=fr_FR.UTF-8 sort x A E a e à é osx104$ LANG=fr_FR.ISO8859-1 LC_ALL=fr_FR.ISO8859-1 sort x A E a e à é Mac OS X seems to keep its locale config in /usr/share/locale . Looking there, there are clearly LC_COLLATE files for fr_FR.UTF-8 . However, they're identical to those for en_US.UTF-8: osx104$ cd /usr/share/locale osx104$ diff fr_FR.UTF-8/LC_COLLATE en_US.UTF-8/LC_COLLATE ... so your OS's localized collation support is broken/missing, at least if the same is true for more modern versions of OS X. -- Craig Ringer
Martijn van Oosterhout wrote: >> in a UTF8 text file and use the "sort" command on it, you will have the same wrong output as with PostgreSQL : > > Yes, that's the basic idea. Mac OS X apparently provides ICU underneath > for programs that would like true unicode collation, but there is > little chance that postgresql will ever use this. Out of interest: Why not? Using ICU would permit Pg to be independent of libc's collation rules, finally permitting things like specifying a specific collation for a textual sort. It'd make mixing data from different locales in a database a lot easier (read: possible to do correctly). Is this just a matter of "nobody cares enough to produce a solid, tested patch with equivalent performance that doesn't turn people who try to review it green with disgust" ... or are there specific reasons why using something like ICU instead of libc's locale support is not appropriate for Pg? -- Craig Ringer
Craig Ringer <craig@postnewspapers.com.au> writes: > Martijn van Oosterhout wrote: >> Yes, that's the basic idea. Mac OS X apparently provides ICU underneath >> for programs that would like true unicode collation, but there is >> little chance that postgresql will ever use this. > Out of interest: Why not? There's plenty of discussion in the archives about it, but basically ICU would represent a pretty enormous dependency and would lock us in to having no other backend encoding but UTF8. The state of OS X's POSIX-spec locale support is pretty pitiful, but on the whole I'd say if you need better UTF8 locale support you could use another OS. regards, tom lane
Tom Lane wrote: > Craig Ringer <craig@postnewspapers.com.au> writes: >> Martijn van Oosterhout wrote: >>> Yes, that's the basic idea. Mac OS X apparently provides ICU underneath >>> for programs that would like true unicode collation, but there is >>> little chance that postgresql will ever use this. > >> Out of interest: Why not? > > There's plenty of discussion in the archives about it, but basically > ICU would represent a pretty enormous dependency and would lock us in > to having no other backend encoding but UTF8. Thanks. You're right - I should've just STFA ;-) so I appreciate the brief explanation. > The state of OS X's POSIX-spec locale support is pretty pitiful, but on > the whole I'd say if you need better UTF8 locale support you could use > another OS. That's my personal opinion too ... I have the "pleasure" of administrating an OS X Server and six Mac Pro clients at work, and have become well acquainted with the exciting variety of bugs, undocumented "features", and bizarre quirks of that particular OS. POSIX locale issues are the least of its issues. Alas, people will want to run Pg on it anyway, especially when bundling with an app. It'd be nice if it could be made to work smoothly ... but I certainly don't care enough to try! Like you, I favour using an OS that follows the specs it claims to support instead. Perhaps someone who wants to use Mac OS X and Pg for their product will come forward with some compat wrapper functions for the localizable libc/posix functions, so Pg can just be built against the wrapper and the rest of us need not care about OS X's bugs. -- Craig Ringer
On Wed, Jan 13, 2010 at 11:15 PM, Craig Ringer <craig@postnewspapers.com.au> wrote: > Perhaps someone who wants to use Mac OS X and Pg for their product will > come forward with some compat wrapper functions for the localizable > libc/posix functions, so Pg can just be built against the wrapper and > the rest of us need not care about OS X's bugs. I know this sounds crazy, but couldn't Apple be bothered to fix their OS? :)
Craig Ringer <craig@postnewspapers.com.au> writes: > Tom Lane wrote: >> The state of OS X's POSIX-spec locale support is pretty pitiful, but on >> the whole I'd say if you need better UTF8 locale support you could use >> another OS. > Alas, people will want to run Pg on it anyway, especially when bundling > with an app. It'd be nice if it could be made to work smoothly ... but I > certainly don't care enough to try! Like you, I favour using an OS that > follows the specs it claims to support instead. For the record, I *like* OS X; I'm typing this on a Macbook Pro. But it doesn't do everything, and one of the things it doesn't do well is POSIX-spec locale support. As you now know from having looked at the archives, there've been many discussions of changing PG to not rely on the platform's locale support. But the bang-for-buck ratio of any such change doesn't seem very attractive --- we have a lot of higher priority things to spend our finite development manpower on. regards, tom lane
Scott Marlowe wrote: > On Wed, Jan 13, 2010 at 11:15 PM, Craig Ringer > <craig@postnewspapers.com.au> wrote: >> Perhaps someone who wants to use Mac OS X and Pg for their product will >> come forward with some compat wrapper functions for the localizable >> libc/posix functions, so Pg can just be built against the wrapper and >> the rest of us need not care about OS X's bugs. > > I know this sounds crazy, but couldn't Apple be bothered to fix their OS? :) I see you don't use Macs much :-P ( Goes back to researching a way to make network automounts on login work and properly support search in 10.6, since the only way that used to work in 10.4 is broken by 10.6 ) -- Craig Ringer
On Thu, Jan 14, 2010 at 6:15 AM, Craig Ringer <craig@postnewspapers.com.au> wrote: >>> Out of interest: Why not? >> >> There's plenty of discussion in the archives about it, but basically >> ICU would represent a pretty enormous dependency and would lock us in >> to having no other backend encoding but UTF8. > > Thanks. You're right - I should've just STFA ;-) so I appreciate the > brief explanation. There's also the question of whether being inconsistent with the rest of the system is really the right thing at all. If a programmer does a sort in the database and then writes application code using the same collation on the same system which depends on the data being sorted it's nice that that works. Or if an admin presorts the data using sort before doing a data load he might reasonable expect the table to be effectively clustered. Switching to ICU means trading our current inconsistency from platform to platform for a different inconsistency which would be better in some cases and worse in others. -- greg
On Sat, Jan 16, 2010 at 09:10:53PM +0000, Greg Stark wrote: > Switching to ICU means trading our current inconsistency from platform > to platform for a different inconsistency which would be better in > some cases and worse in others. Or, you can have the cake and eat it too. That is, aim for the end goal and let people choose what library they want to use for sorting (that is, extend the meaning of the locale identifier). Patches for this should be in the archives somewhere. As I recall the reason this was rejected is that *BSD lack the capability of handling multiple collation algorithms at all at the libc level (that is, if you don't just tell people to use ICU in that case). Mac OS X doesn't have great POSIX locale support but at least they implemented strcoll_l. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Please line up in a tree and maintain the heap invariant while > boarding. Thank you for flying nlogn airlines.
Attachment
Very interesting discussion indeed. It seems that "Postgresql:The world's most advanced open source database" can not work properly on "Mac OS X: the world'smost advanced operating system" and FreeBSD. Don't you think postgresql.org should remove from their download page the links to FreeBSD and Mac OS X binary packages? Martin Flahault Le 18 janv. 2010 à 08:10, Martijn van Oosterhout a écrit : > On Sat, Jan 16, 2010 at 09:10:53PM +0000, Greg Stark wrote: >> Switching to ICU means trading our current inconsistency from platform >> to platform for a different inconsistency which would be better in >> some cases and worse in others. > > Or, you can have the cake and eat it too. That is, aim for the end goal > and let people choose what library they want to use for sorting (that > is, extend the meaning of the locale identifier). Patches for this > should be in the archives somewhere. As I recall the reason this was > rejected is that *BSD lack the capability of handling multiple > collation algorithms at all at the libc level (that is, if you don't > just tell people to use ICU in that case). > > Mac OS X doesn't have great POSIX locale support but at least they > implemented strcoll_l. > > Have a nice day, > -- > Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ >> Please line up in a tree and maintain the heap invariant while >> boarding. Thank you for flying nlogn airlines.