Thread: Collation sequence and use of operatings system's locale

Collation sequence and use of operatings system's locale

From
James Gates
Date:
I'm working with a issue where it seems the PostgreSQL server collation
sequence for certain locales don't match the operating system's
collation sequence for the same locale (set during initdb).

I can reproduce this on both 8.1.9 & 8.2.5 on Solaris (both Nevada & 10).



1) First I have a text file containing some LATIN1 (ISO8859-1)
characters in random order (followed by their decimal byte code):

$ cat barf
Ö       214
A       65
a       97
Å       197
ä       228
Ä       196

2) Perform ascending byte code & dictionary order sorts (on 1st field
only) with the "C" locale (Note that the results are the same when using
a POSIX locale):

$ locale
LANG=C
LC_CTYPE="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_COLLATE="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_ALL=

$ sort +1 -0 barf
A       65
a       97
Ä       196
Å       197
Ö       214
ä       228

$ sort +0 -1 -d barf
Ä       196
Å       197
Ö       214
ä       228
A       65
a       97

3) Perform ascending byte code & dictionary order sorts (on 1st field
only) with the Swedish locale (Note that the results are the same when
using any ISO8859 locale):

$ locale
LANG=sv_SE.ISO8859-1
LC_CTYPE="sv_SE.ISO8859-1"
LC_NUMERIC="sv_SE.ISO8859-1"
LC_TIME="sv_SE.ISO8859-1"
LC_COLLATE="sv_SE.ISO8859-1"
LC_MONETARY="sv_SE.ISO8859-1"
LC_MESSAGES="sv_SE.ISO8859-1"
LC_ALL=

$ sort +1 -0 barf
A       65
a       97
Ä       196
Å       197
Ö       214
ä       228

$ sort +0 -1 -d barf
a       97
A       65
Å       197
ä       228
Ä       196
Ö       214

(Note that in the CLDR definition for 8859-1 locales, lower case letters
appear before the equivalent upper case letter in a dictionary sort)

4) Now create a db with the same locale:

$ locale
LANG=sv_SE.ISO8859-1
LC_CTYPE="sv_SE.ISO8859-1"
LC_NUMERIC="sv_SE.ISO8859-1"
LC_TIME="sv_SE.ISO8859-1"
LC_COLLATE="sv_SE.ISO8859-1"
LC_MONETARY="sv_SE.ISO8859-1"
LC_MESSAGES="sv_SE.ISO8859-1"
LC_ALL=

$ initdb
The files belonging to this database system will be owned by user
"postgres".
This user must also own the server process.

The database cluster will be initialized with locale sv_SE.ISO8859-1.
The default database encoding has accordingly been set to LATIN1.

fixing permissions on existing directory
/var/tmp/postgres/8.2.5/sv_SE.ISO8859-1 ... ok
creating subdirectories ... ok
selecting default max_connections ... 100
selecting default shared_buffers/max_fsm_pages ... 32MB/204800
creating configuration files ... ok
creating template1 database in
/var/tmp/postgres/8.2.5/sv_SE.ISO8859-1/base/1 ... ok
initializing pg_authid ... ok
initializing dependencies ... ok
creating system views ... ok
loading system objects' descriptions ... ok
creating conversions ... ok
setting privileges on built-in objects ... ok
creating information schema ... ok
vacuuming database template1 ... ok
copying template1 to template0 ... ok
copying template1 to postgres ... ok
.... etc ....

5) Check the correct collation sequence & char set in the db:

postgres=# \l
         List of databases
    Name    |  Owner   | Encoding
-----------+----------+----------
  postgres  | postgres | LATIN1
  template0 | postgres | LATIN1
  template1 | postgres | LATIN1
(3 rows)

postgres=# show lc_collate;
    lc_collate
-----------------
  sv_SE.ISO8859-1
(1 row)

6) And try the same sort on a table containing the same characters:

postgres=# select barf,ascii(barf) from jim order by barf asc;
  barf | ascii
------+-------
  Å    |   197
  ä    |   228
  Ä    |   196
  Ö    |   214
  a    |    97
  A    |    65
(6 rows)

postgres=# \d jim
     Table "public.jim"
  Column | Type | Modifiers
--------+------+-----------
  barf   | text |

Notice that the results don't match the operating system's (either byte
code or dictionary) sort order for the same locale, or even the C or
POSIX locales.

In fact, I can't tell where this order is derived from? I thought that
calling initdb with a particular locale meant that the database used all
aspects of the operating systems locale, including collation sequence.
This is implied in the docs at:

http://www.postgresql.org/docs/8.2/interactive/sql-select.html#SQL-ORDERBY

> Character-string data is sorted according to the locale-specific
> collation order that was established when the database cluster was
> initialized.

I'm probably missing (or misunderstanding) something, since I don't
fully understand PostgreSQL's localization. Can someone please explain?

Many thanks.




Re: Collation sequence and use of operatings system's locale

From
Tom Lane
Date:
James Gates <james.gates@sun.com> writes:
> I'm working with a issue where it seems the PostgreSQL server collation
> sequence for certain locales don't match the operating system's
> collation sequence for the same locale (set during initdb).
> I can reproduce this on both 8.1.9 & 8.2.5 on Solaris (both Nevada & 10).

FWIW, your example works as expected for me with 8.3 CVS tip on Fedora
Core 6:

postgres=# \encoding
LATIN1
postgres=# show lc_collate ;
   lc_collate
----------------
 sv_SE.iso88591
(1 row)

postgres=# select barf,ascii(barf) from jim order by barf asc;
 barf | ascii
------+-------
 a    |    97
 A    |    65
 �    |   197
 �    |   228
 �    |   196
 �    |   214
(6 rows)

> In fact, I can't tell where this order is derived from?

In this context Postgres believes whatever strcoll() tells it.  I don't
see any obvious hole in your methodology (except that I'm dubious about
the exhibited arguments for sort(1)) so it seems possible you've got a
bug in Solaris' strcoll().  But you should probably triple-check the
question of whether what's arriving at strcoll() is in the encoding it
expects.

            regards, tom lane

Re: Collation sequence and use of operatings system's locale

From
Peter Eisentraut
Date:
Am Montag, 29. Oktober 2007 schrieb James Gates:
> Notice that the results don't match the operating system's (either byte
> code or dictionary) sort order for the same locale, or even the C or
> POSIX locales.

Note that none of the sort orders you showed match an actual Swedish
dictionary sort, so something on your system is probably broken.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/