Thread: help understanding collation order

help understanding collation order

From
raf
Date:
hi,

postgresl-8.4.6 (database encoding is utf8)

the following sql behaves strangely on one of my servers:

  create table ttt(id serial not null primary key, name text);
  insert into ttt (name) values ('CLARKE, DAVID');
  insert into ttt (name) values ('CLARK, PETER');
  insert into ttt (name) values ('CLARKE');
  insert into ttt (name) values ('CLARK');
  select * from ttt order by name;
  drop table ttt;

the behaviour i expect (and see on macosx-10.6.6) is:

   id |     name
  ----+---------------
    4 | CLARK
    2 | CLARK, PETER
    3 | CLARKE
    1 | CLARKE, DAVID

the behaviour i don't expect but see anyway (on debian-5.0) is:

   id |     name
  ----+---------------
    4 | CLARK
    3 | CLARKE
    1 | CLARKE, DAVID
    2 | CLARK, PETER

the "good" server has lc_messages='en_AU' and the
"bad" server has lc_messages="en_AU.utf8" which may
be relevant but i can't successfully alter these
values (i.e. postgres doesn't restart after i try)
so i'm hoping it's not relevant.

can anyone explain what i've done wrong here and
suggest what i can do to fix it?

cheers,
raf

p.s. the postgres installation was PostgresPlus/8.4SS that
was subsequently upgraded via postgres one click installers.


Re: help understanding collation order

From
Tom Lane
Date:
raf <raf@raf.org> writes:
> the behaviour i expect (and see on macosx-10.6.6) is:

>    id |     name
>   ----+---------------
>     4 | CLARK
>     2 | CLARK, PETER
>     3 | CLARKE
>     1 | CLARKE, DAVID

> the behaviour i don't expect but see anyway (on debian-5.0) is:

>    id |     name
>   ----+---------------
>     4 | CLARK
>     3 | CLARKE
>     1 | CLARKE, DAVID
>     2 | CLARK, PETER

> the "good" server has lc_messages='en_AU' and the
> "bad" server has lc_messages="en_AU.utf8" which may
> be relevant

No, not particularly.  Sort order is determined by lc_collate
not lc_messages.  Unfortunately it's entirely possible that OSX
will give you a different sort order than Linux even for similarly
named lc_collate settings.  About the only lc_collate setting that
really behaves the same everywhere, guaranteed, is "C" ... and that
might or might not do what you want.  (C locale does satisfy the
above example but it's hard to be sure what you want in general;
and if you are using any non-ASCII characters, C locale will more
than likely not be very satisfactory.)

            regards, tom lane

Re: help understanding collation order

From
Andrew Sullivan
Date:
On Mon, Jan 17, 2011 at 02:19:14PM -0500, Tom Lane wrote:

> No, not particularly.  Sort order is determined by lc_collate
> not lc_messages.  Unfortunately it's entirely possible that OSX
> will give you a different sort order than Linux even for similarly
> named lc_collate settings.  About the only lc_collate setting that
> really behaves the same everywhere, guaranteed, is "C"

This is partly because of the generous room allowed for linguistic
tailoring in the Unicode standard on collation.  In case one really
wants to have a bad day, I can suggest reading
http://unicode.org/reports/tr10/ carefully.  But they make an
important point there, which is that collation rules work by language,
not by script.  As I understand things, in Postgres's case it's partly
a matter of how strongly your OS cleaves to the locale conventions
that determines how this will work.  (Note that not every database
system relies on the underlying OS's facilities the way Postgres does;
some have an independent collation mechanism.)

Unicode does maintain a locale data repository:
http://cldr.unicode.org/.  You might be able to figure out which of
your systems is not playing nice and complain to the OS vendor.

A


--
Andrew Sullivan
ajs@crankycanuck.ca

Re: help understanding collation order

From
raf
Date:
Tom Lane wrote:

> raf <raf@raf.org> writes:
> > the behaviour i expect (and see on macosx-10.6.6) is:
>
> >    id |     name
> >   ----+---------------
> >     4 | CLARK
> >     2 | CLARK, PETER
> >     3 | CLARKE
> >     1 | CLARKE, DAVID
>
> > the behaviour i don't expect but see anyway (on debian-5.0) is:
>
> >    id |     name
> >   ----+---------------
> >     4 | CLARK
> >     3 | CLARKE
> >     1 | CLARKE, DAVID
> >     2 | CLARK, PETER
>
> > the "good" server has lc_messages='en_AU' and the
> > "bad" server has lc_messages="en_AU.utf8" which may
> > be relevant
>
> No, not particularly.  Sort order is determined by lc_collate
> not lc_messages.  Unfortunately it's entirely possible that OSX
> will give you a different sort order than Linux even for similarly
> named lc_collate settings.  About the only lc_collate setting that
> really behaves the same everywhere, guaranteed, is "C" ... and that
> might or might not do what you want.  (C locale does satisfy the
> above example but it's hard to be sure what you want in general;
> and if you are using any non-ASCII characters, C locale will more
> than likely not be very satisfactory.)
>
>             regards, tom lane

thanks. "C" will have to do, i suppose.
that and/or re-sort in the client.

cheers,
raf

p.s. if anyone in debian locale land is listening,
'E' does not sort before ','. what were you thinking? :-)


Re: help understanding collation order

From
Peter Eisentraut
Date:
On tis, 2011-01-18 at 10:33 +1100, raf wrote:
> p.s. if anyone in debian locale land is listening,
> 'E' does not sort before ','. what were you thinking? :-)

What is actually happening is that the punctuation is sorted in a second
pass after the letters.  Which is both correct according to the relevant
standards and also practical in many situations.

It's usually actually the Mac OS X locales that are broken.



Re: help understanding collation order

From
raf
Date:
Peter Eisentraut wrote:

> On tis, 2011-01-18 at 10:33 +1100, raf wrote:
> > p.s. if anyone in debian locale land is listening,
> > 'E' does not sort before ','. what were you thinking? :-)
>
> What is actually happening is that the punctuation is sorted in a second
> pass after the letters.  Which is both correct according to the relevant
> standards and also practical in many situations.

i have no doubt that what you say is true just as i have no doubt that
it is also incorrect and impractical in every situation i'll ever encounter.
i can't imagine ever wanting CLARK to sort both before and after CLARKE.
i'll just re-sort all name-ordered reports in the client. pity.

> It's usually actually the Mac OS X locales that are broken.

only when ignoring the principle of least astonishment. :-)

but it's not the same locale on the two hosts. the macosx
locale is en_AU. the debian locale is en_AU.utf8 so i don't
think they can be compared for brokenness purposes in this case.
they may both be correct.

cheers,
raf


Re: help understanding collation order

From
raf
Date:
raf wrote:

> Peter Eisentraut wrote:
>
> > On tis, 2011-01-18 at 10:33 +1100, raf wrote:
> > > p.s. if anyone in debian locale land is listening,
> > > 'E' does not sort before ','. what were you thinking? :-)
> >
> > What is actually happening is that the punctuation is sorted in a second
> > pass after the letters.  Which is both correct according to the relevant
> > standards and also practical in many situations.
>
> i have no doubt that what you say is true just as i have no doubt that
> it is also incorrect and impractical in every situation i'll ever encounter.
> i can't imagine ever wanting CLARK to sort both before and after CLARKE.

> i'll just re-sort all name-ordered reports in the client. pity.

better idea: i'll separate the family name from the personal
names using regexp_replace() and use them separately in the
order by clause so the sorting can remain in postgres where
it belongs without any commas getting in the way.

cheers,
raf