Thread: Collate order on Mac OS X, text with diacritics in UTF-8

Collate order on Mac OS X, text with diacritics in UTF-8

From
Martin Flahault
Date:
Hi,

We are a software publisher searching for a new DBMS for our software. We have more than one hundred installed servers, running Mac OS and a Primebase database.

We have spend some time evaluating PostgreSQL and we can't get correct outputs with the ORDER BY command.
LC_COLLATE and LC_CTYPE are set to fr_FR.UTF-8.

It seems there is a known problem with the collating order of text including diacritics with the UTF8 encodings on BSD systems.

Does anyone know a workaround ?

Thanks for your answers,

Martin

Re: Collate order on Mac OS X, text with diacritics in UTF-8

From
Maximilian Tyrtania
Date:
Am 12.01.2010 um 12:36 schrieb Martin Flahault:

> We have spend some time evaluating PostgreSQL and we can't get correct outputs with the ORDER BY command.
> LC_COLLATE and LC_CTYPE are set to fr_FR.UTF-8.
>
> It seems there is a known problem with the collating order of text including diacritics with the UTF8 encodings on
BSDsystems. 
>
> Does anyone know a workaround ?

The best i've seen so far is:

CREATE OR REPLACE FUNCTION f_getorderbyfriendlyversion(texttoconvert text)
 RETURNS text AS
$BODY$
select translate(upper($1),'ÄÖÜ','AOU')--add french diacritical characters here
$BODY$
 LANGUAGE 'sql' IMMUTABLE STRICT
 COST 100;
ALTER FUNCTION f_getorderbyfriendlyversion(text) OWNER TO postgres;

CREATE INDEX idx_mytable_myfield_orderbyfriendly
 ON mytable
 USING btree
 (f_getorderbyfriendlyversion(myfield::text));

Select * from mytable order by f_getorderbyfriendlyversion(myfield);

Not an ideal solution, but it seems to perform quite well.

Best wishes from Berlin,

Maximilian Tyrtania

Re: Collate order on Mac OS X, text with diacritics in UTF-8

From
Craig Ringer
Date:
On 12/01/2010 7:36 PM, Martin Flahault wrote:
> Hi,
>
> We are a software publisher searching for a new DBMS for our software.
> We have more than one hundred installed servers, running Mac OS and a
> Primebase database.
>
> We have spend some time evaluating PostgreSQL and we can't get correct
> outputs with the ORDER BY command.

Can you provide a sample? Include a table of sample values, an example
query, its output, and what you'd expect to get instead? And why?

--
Craig Ringer

Re: Collate order on Mac OS X, text with diacritics in UTF-8

From
Martin Flahault
Date:

Here is an exemple :

postgres=# create database newbase;
CREATE DATABASE
postgres=# \c newbase;
psql (8.4.2)
You are now connected to database "newbase".
newbase=# create table t1 (contenu text);
CREATE TABLE
newbase=# insert into t1 values ('a'), ('e'), ('à'), ('é'), ('A'), ('E');
INSERT 0 6

newbase=# select * from t1 order by contenu;
 contenu 
---------
 A
 E
 a
 e
 à
 é
(6 rows)

newbase=# select * from t1 order by upper(contenu);
 contenu 
---------
 a
 A
 e
 E
 à
 é
(6 rows)


Here is the encoding informations :

newbase=# \encoding
UTF8
newbase=# show lc_collate;
 lc_collate 
------------
 fr_FR
(1 row)

newbase=# show lc_ctype;
 lc_ctype 
----------
 fr_FR
(1 row)


As with others DBMS (MySQL for example), diacritics should be ignored when determining the sort order. Here is the expected output:
 a
 à
 A
 e
 é 
 E


It seems there is a problem with the collating order on BSD systems with diacritics using UTF8.
If you put this text :
a
A
à
é
e
E

in a UTF8 text file and use the "sort" command on it, you will have the same wrong output as with PostgreSQL :
A
E
a
e
à
é

Hope this will help,

Martin

Re: Collate order on Mac OS X, text with diacritics in UTF-8

From
Martijn van Oosterhout
Date:
On Wed, Jan 13, 2010 at 04:15:06PM +0100, Martin Flahault wrote:

[postgres]
> newbase=# select * from t1 order by contenu;
>  contenu
> ---------
>  A
>  E
>  a
>  e

Postgresql outputs whatever the C library does on the underlying
system. The quality of this varies wildly.
>  à
> As with others DBMS (MySQL for example), diacritics should be ignored when determining the sort order. Here is the
expectedoutput: 

MySQL implements the unicode collation algorithm, which means it
essentially does what you want.
>
> It seems there is a problem with the collating order on BSD systems with diacritics using UTF8.

Last I checked, BSD did not support useful sorting on UTF-8 at all, so
it's not surprised it doesn't work.

> in a UTF8 text file and use the "sort" command on it, you will have the same wrong output as with PostgreSQL :

Yes, that's the basic idea. Mac OS X apparently provides ICU underneath
for programs that would like true unicode collation, but there is
little chance that postgresql will ever use this.

Hope this helps,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.

Attachment

Re: Collate order on Mac OS X, text with diacritics in UTF-8

From
Craig Ringer
Date:
On 13/01/2010 11:15 PM, Martin Flahault wrote:

> It seems there is a problem with the collating order on BSD systems with
> diacritics using UTF8.
> If you put this text :
> a
> A
> à
> é
> e
> E
>
> in a UTF8 text file and use the "sort" command on it, you will have the
> same wrong output as with PostgreSQL :
> A
> E
> a
> e
> à
> é

First: PostgreSQL expects the OS to behave correctly and sort according
to the locale. It relies on the C library for this. If the C library
doesn't do it right, PostgreSQL won't do it right either. So you need to
get Mac OS X to do the right thing.

Your results match what I get on a Linux system without a properly
generated fr_FR.UTF-8 locale. Libc falls back on the "C" locale, which
sorts that way.

If I generate the fr_FR.UTF-8 locale and run the sort (on the file "x"),
I get the desired result:

LANG=fr_FR.UTF-8 LC_ALL=fr_FR.UTF-8 sort x
a
A
à
e
E
é

I don't know Mac OS X well, but this is making me wonder if maybe you're
just missing the required information for the locale, so libc is falling
back on the "C" locale.

(Of course, being Mac OS X there are probably at least three out of date
or simply false "man" pages describing the behaviour, none of which
reflect the reality of a magic config key buried somewhere in NetInfo,
for which the documentation is also completely out of date. Bitter? Me?
Yeah, I admin a bunch of OS X machines on a business network.)

Hmm... a quick test suggests that Mac OS X (testing on 10.4) at least
*thinks* it supports the fr_FR.UTF-8 locale:

osx104$ LANG=xxx LC_ALL=xxx locale
LANG="xxx"
LC_COLLATE="C"
LC_CTYPE="C"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL="C"

osx104$ LANG=fr_FR.UTF-8 LC_ALL=fr_FR.UTF-8 locale
LANG="fr_FR.UTF-8"
LC_COLLATE="fr_FR.UTF-8"
LC_CTYPE="fr_FR.UTF-8"
LC_MESSAGES="fr_FR.UTF-8"
LC_MONETARY="fr_FR.UTF-8"
LC_NUMERIC="fr_FR.UTF-8"
LC_TIME="fr_FR.UTF-8"
LC_ALL="fr_FR.UTF-8"

osx104$ locale -a  | grep fr_FR
fr_FR
fr_FR.ISO8859-1
fr_FR.ISO8859-15
fr_FR.UTF-8

... yet it clearly doesn't:

osx104$ LANG=C LC_ALL=C sort x
A
E
a
e
à
é
osx104$ LANG=fr_FR.UTF-8 LC_ALL=fr_FR.UTF-8 sort x
A
E
a
e
à
é
osx104$ LANG=fr_FR.ISO8859-1 LC_ALL=fr_FR.ISO8859-1 sort x
A
E
a
e
à
é

Mac OS X seems to keep its locale config in /usr/share/locale . Looking
there, there are clearly LC_COLLATE files for fr_FR.UTF-8 . However,
they're identical to those for en_US.UTF-8:

osx104$ cd /usr/share/locale
osx104$ diff fr_FR.UTF-8/LC_COLLATE en_US.UTF-8/LC_COLLATE

... so your OS's localized collation support is broken/missing, at least
if the same is true for more modern versions of OS X.

--
Craig Ringer

Re: Collate order on Mac OS X, text with diacritics in UTF-8

From
Craig Ringer
Date:
Martijn van Oosterhout wrote:

>> in a UTF8 text file and use the "sort" command on it, you will have the same wrong output as with PostgreSQL :
>
> Yes, that's the basic idea. Mac OS X apparently provides ICU underneath
> for programs that would like true unicode collation, but there is
> little chance that postgresql will ever use this.

Out of interest: Why not?

Using ICU would permit Pg to be independent of libc's collation rules,
finally permitting things like specifying a specific collation for a
textual sort. It'd make mixing data from different locales in a database
a lot easier (read: possible to do correctly).

Is this just a matter of "nobody cares enough to produce a solid, tested
patch with equivalent performance that doesn't turn people who try to
review it green with disgust" ... or are there specific reasons why
using something like ICU instead of libc's locale support is not
appropriate for Pg?

--
Craig Ringer

Re: Collate order on Mac OS X, text with diacritics in UTF-8

From
Tom Lane
Date:
Craig Ringer <craig@postnewspapers.com.au> writes:
> Martijn van Oosterhout wrote:
>> Yes, that's the basic idea. Mac OS X apparently provides ICU underneath
>> for programs that would like true unicode collation, but there is
>> little chance that postgresql will ever use this.

> Out of interest: Why not?

There's plenty of discussion in the archives about it, but basically
ICU would represent a pretty enormous dependency and would lock us in
to having no other backend encoding but UTF8.

The state of OS X's POSIX-spec locale support is pretty pitiful, but on
the whole I'd say if you need better UTF8 locale support you could use
another OS.

            regards, tom lane

Re: Collate order on Mac OS X, text with diacritics in UTF-8

From
Craig Ringer
Date:
Tom Lane wrote:
> Craig Ringer <craig@postnewspapers.com.au> writes:
>> Martijn van Oosterhout wrote:
>>> Yes, that's the basic idea. Mac OS X apparently provides ICU underneath
>>> for programs that would like true unicode collation, but there is
>>> little chance that postgresql will ever use this.
>
>> Out of interest: Why not?
>
> There's plenty of discussion in the archives about it, but basically
> ICU would represent a pretty enormous dependency and would lock us in
> to having no other backend encoding but UTF8.

Thanks. You're right - I should've just STFA ;-) so I appreciate the
brief explanation.

> The state of OS X's POSIX-spec locale support is pretty pitiful, but on
> the whole I'd say if you need better UTF8 locale support you could use
> another OS.

That's my personal opinion too ... I have the "pleasure" of
administrating an OS X Server and six Mac Pro clients at work, and have
become well acquainted with the exciting variety of bugs, undocumented
"features", and bizarre quirks of that particular OS. POSIX locale
issues are the least of its issues.

Alas, people will want to run Pg on it anyway, especially when bundling
with an app. It'd be nice if it could be made to work smoothly ... but I
certainly don't care enough to try! Like you, I favour using an OS that
follows the specs it claims to support instead.

Perhaps someone who wants to use Mac OS X and Pg for their product will
come forward with some compat wrapper functions for the localizable
libc/posix functions, so Pg can just be built against the wrapper and
the rest of us need not care about OS X's bugs.

--
Craig Ringer

Re: Collate order on Mac OS X, text with diacritics in UTF-8

From
Scott Marlowe
Date:
On Wed, Jan 13, 2010 at 11:15 PM, Craig Ringer
<craig@postnewspapers.com.au> wrote:
> Perhaps someone who wants to use Mac OS X and Pg for their product will
> come forward with some compat wrapper functions for the localizable
> libc/posix functions, so Pg can just be built against the wrapper and
> the rest of us need not care about OS X's bugs.

I know this sounds crazy, but couldn't Apple be bothered to fix their OS?  :)

Re: Collate order on Mac OS X, text with diacritics in UTF-8

From
Tom Lane
Date:
Craig Ringer <craig@postnewspapers.com.au> writes:
> Tom Lane wrote:
>> The state of OS X's POSIX-spec locale support is pretty pitiful, but on
>> the whole I'd say if you need better UTF8 locale support you could use
>> another OS.

> Alas, people will want to run Pg on it anyway, especially when bundling
> with an app. It'd be nice if it could be made to work smoothly ... but I
> certainly don't care enough to try! Like you, I favour using an OS that
> follows the specs it claims to support instead.

For the record, I *like* OS X; I'm typing this on a Macbook Pro.  But
it doesn't do everything, and one of the things it doesn't do well is
POSIX-spec locale support.

As you now know from having looked at the archives, there've been many
discussions of changing PG to not rely on the platform's locale
support.  But the bang-for-buck ratio of any such change doesn't seem
very attractive --- we have a lot of higher priority things to spend
our finite development manpower on.

            regards, tom lane

Re: Collate order on Mac OS X, text with diacritics in UTF-8

From
Craig Ringer
Date:
Scott Marlowe wrote:
> On Wed, Jan 13, 2010 at 11:15 PM, Craig Ringer
> <craig@postnewspapers.com.au> wrote:
>> Perhaps someone who wants to use Mac OS X and Pg for their product will
>> come forward with some compat wrapper functions for the localizable
>> libc/posix functions, so Pg can just be built against the wrapper and
>> the rest of us need not care about OS X's bugs.
>
> I know this sounds crazy, but couldn't Apple be bothered to fix their OS?  :)

I see you don't use Macs much :-P

( Goes back to researching a way to make network automounts on login
work and properly support search in 10.6, since the only way that used
to work in 10.4 is broken by 10.6 )

--
Craig Ringer

Re: Collate order on Mac OS X, text with diacritics in UTF-8

From
Greg Stark
Date:
On Thu, Jan 14, 2010 at 6:15 AM, Craig Ringer
<craig@postnewspapers.com.au> wrote:
>>> Out of interest: Why not?
>>
>> There's plenty of discussion in the archives about it, but basically
>> ICU would represent a pretty enormous dependency and would lock us in
>> to having no other backend encoding but UTF8.
>
> Thanks. You're right - I should've just STFA ;-) so I appreciate the
> brief explanation.

There's also the question of whether being inconsistent with the rest
of the system is really the right thing at all. If a programmer does a
sort in the database and then writes application code using the same
collation on the same system which depends on the data being sorted
it's nice that that works. Or if an admin presorts the data using sort
before doing a data load he might reasonable expect the table to be
effectively clustered.

Switching to ICU means trading our current inconsistency from platform
to platform for a different inconsistency which would be better in
some cases and worse in others.


--
greg

Re: Collate order on Mac OS X, text with diacritics in UTF-8

From
Martijn van Oosterhout
Date:
On Sat, Jan 16, 2010 at 09:10:53PM +0000, Greg Stark wrote:
> Switching to ICU means trading our current inconsistency from platform
> to platform for a different inconsistency which would be better in
> some cases and worse in others.

Or, you can have the cake and eat it too. That is, aim for the end goal
and let people choose what library they want to use for sorting (that
is, extend the meaning of the locale identifier). Patches for this
should be in the archives somewhere. As I recall the reason this was
rejected is that *BSD lack the capability of handling multiple
collation algorithms at all at the libc level (that is, if you don't
just tell people to use ICU in that case).

Mac OS X doesn't have great POSIX locale support but at least they
implemented strcoll_l.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.

Attachment

Re: Collate order on Mac OS X, text with diacritics in UTF-8

From
Martin Flahault
Date:
Very interesting discussion indeed.

It seems that "Postgresql:The world's most advanced open source database" can not work properly on "Mac OS X: the
world'smost advanced operating system" and FreeBSD. 

Don't you think postgresql.org should remove from their download page the links to FreeBSD and Mac OS X binary
packages?

Martin Flahault



Le 18 janv. 2010 à 08:10, Martijn van Oosterhout a écrit :

> On Sat, Jan 16, 2010 at 09:10:53PM +0000, Greg Stark wrote:
>> Switching to ICU means trading our current inconsistency from platform
>> to platform for a different inconsistency which would be better in
>> some cases and worse in others.
>
> Or, you can have the cake and eat it too. That is, aim for the end goal
> and let people choose what library they want to use for sorting (that
> is, extend the meaning of the locale identifier). Patches for this
> should be in the archives somewhere. As I recall the reason this was
> rejected is that *BSD lack the capability of handling multiple
> collation algorithms at all at the libc level (that is, if you don't
> just tell people to use ICU in that case).
>
> Mac OS X doesn't have great POSIX locale support but at least they
> implemented strcoll_l.
>
> Have a nice day,
> --
> Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
>> Please line up in a tree and maintain the heap invariant while
>> boarding. Thank you for flying nlogn airlines.