Thread: sort order (ORDER BY) hu_HU.UTF-8 locale with UTF-8 encoding is not working correctly on FreeBSD

Dear List,

I'm trying to solve this problem for more than 2 days, but I have no
other idea.

My system is:

FreeBSD titanium 8.1-RELEASE FreeBSD 8.1-RELEASE #0: Mon Jul 19 02:36:49
UTC 2010     root@mason.cse.buffalo.edu:/usr/obj/usr/src/sys/GENERIC  amd64

Tested PgSQL versions are: 8.4 and 9.0.3 (fresh install using ports)

initdb flags are: --encoding=UTF-8 --locale=hu_HU.UTF-8

Initdb output:

The files belonging to this database system will be owned by user "pgsql".
This user must also own the server process.

The database cluster will be initialized with locale hu_HU.UTF-8.
The default text search configuration will be set to "hungarian".

creating directory /usr/local/pgsql/data ... ok
creating subdirectories ... ok
selecting default max_connections ... 40
selecting default shared_buffers ... 28MB
creating configuration files ... ok
creating template1 database in /usr/local/pgsql/data/base/1 ... ok
initializing pg_authid ... ok
initializing dependencies ... ok
creating system views ... ok
loading system objects' descriptions ... ok
creating conversions ... ok
creating dictionaries ... ok
setting privileges on built-in objects ... ok
creating information schema ... ok
loading PL/pgSQL server-side language ... ok
vacuuming database template1 ... ok
copying template1 to template0 ... ok
copying template1 to postgres ... ok

step 1: createdb -U wooh test --encoding=UTF-8 --locale=hu_HU.UTF-8

step 2: psql -U wooh -l
                              List of databases
   Name    | Owner | Encoding |  Collation  |    Ctype    | Access
privileges
-----------+-------+----------+-------------+-------------+-------------------
 postgres  | pgsql | UTF8     | hu_HU.UTF-8 | hu_HU.UTF-8 |
 template0 | pgsql | UTF8     | hu_HU.UTF-8 | hu_HU.UTF-8 | =c/pgsql
     +
           |       |          |             |             | pgsql=CTc/pgsql
 template1 | pgsql | UTF8     | hu_HU.UTF-8 | hu_HU.UTF-8 | =c/pgsql
     +
           |       |          |             |             | pgsql=CTc/pgsql
 test      | wooh  | UTF8     | hu_HU.UTF-8 | hu_HU.UTF-8 |

Everything seems to be OK.

let's create a table with idn, and name.

step 3:

test=# create table test (idn serial primary key not null, name
varchar(255));
NOTICE:  CREATE TABLE will create implicit sequence "test_idn_seq" for
serial column "test.idn"
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index
"test_pkey" for table "test"
CREATE TABLE

step 4:

[root@titanium /home/wooh]# file insert.sql
insert.sql: UTF-8 Unicode text

step 5:

[root@titanium /home/wooh]# cat insert.sql
INSERT INTO test (name)  values ( 'a' );
INSERT INTO test (name)  values ( 'á' );
INSERT INTO test (name)  values ( 'b' );
INSERT INTO test (name)  values ( 'c' );
INSERT INTO test (name)  values ( 'd' );
INSERT INTO test (name)  values ( 'e' );
INSERT INTO test (name)  values ( 'é' );
INSERT INTO test (name)  values ( 'f' );
INSERT INTO test (name)  values ( 'g' );
INSERT INTO test (name)  values ( 'h' );

step 6:

[root@titanium /home/wooh]# psql -U wooh test < insert.sql
INSERT 0 1
INSERT 0 1
INSERT 0 1
INSERT 0 1
INSERT 0 1
INSERT 0 1
INSERT 0 1
INSERT 0 1
INSERT 0 1
INSERT 0 1

step 7:

[root@titanium /home/wooh]# psql -U wooh test
psql (9.0.3)
Type "help" for help.

test=# select * from test order by name asc;
 idn | name
-----+------
   1 | a
   3 | b
   4 | c
   5 | d
   6 | e
   8 | f
   9 | g
  10 | h
   2 | á
   7 | é
(10 rows)

test=# show lc_collate;
 lc_collate
-------------
 hu_HU.UTF-8
(1 row)

test=# show lc_ctype;
  lc_ctype
-------------
 hu_HU.UTF-8
(1 row)

The sort order is incorrect, and I cannot find out why. Everything seems
to be Ok. (locale, collate, encoding)

But this whole thing works if I chose "LATIN2" encoding with
"hu_HU.ISO8859-2" locale. Unfortunately we use UTF-8 databases, so I
need to use the UTF8 Encoding with hu_HU.UTF-8

Any ideas?

Additional information:

[root@titanium /home/wooh]# pg_controldata /usr/local/pgsql/data/
pg_control version number:            903
Catalog version number:               201008051
Database system identifier:           5570517093231621070
Database cluster state:               in production
pg_control last modified:             Sun Feb  6 11:19:27 2011
Latest checkpoint location:           0/1610CA0
Prior checkpoint location:            0/15F8028
Latest checkpoint's REDO location:    0/1610CA0
Latest checkpoint's TimeLineID:       1
Latest checkpoint's NextXID:          0/675
Latest checkpoint's NextOID:          24576
Latest checkpoint's NextMultiXactId:  1
Latest checkpoint's NextMultiOffset:  0
Latest checkpoint's oldestXID:        654
Latest checkpoint's oldestXID's DB:   1
Latest checkpoint's oldestActiveXID:  0
Time of latest checkpoint:            Sun Feb  6 11:19:19 2011
Minimum recovery ending location:     0/0
Backup start location:                0/0
Current wal_level setting:            minimal
Current max_connections setting:      40
Current max_prepared_xacts setting:   0
Current max_locks_per_xact setting:   64
Maximum data alignment:               8
Database block size:                  8192
Blocks per segment of large relation: 131072
WAL block size:                       8192
Bytes per WAL segment:                16777216
Maximum length of identifiers:        64
Maximum columns in an index:          32
Maximum size of a TOAST chunk:        1996
Date/time type storage:               64-bit integers
Float4 argument passing:              by value
Float8 argument passing:              by value

--
Adam PAPAI
BSD Support Service
http://www.bsdsupportservice.hu
E-mail: adam.papai@bsdsupportservice.hu
Phone: +36 30 33-55-735 (Hungary)
Adam PAPAI wrote:

The correct encoding of the values are:

INSERT INTO test (name)  values ( 'a' );
INSERT INTO test (name)  values ( 'á' );
INSERT INTO test (name)  values ( 'b' );
INSERT INTO test (name)  values ( 'c' );
INSERT INTO test (name)  values ( 'd' );
INSERT INTO test (name)  values ( 'e' );
INSERT INTO test (name)  values ( 'é' );
INSERT INTO test (name)  values ( 'f' );
INSERT INTO test (name)  values ( 'g' );
INSERT INTO test (name)  values ( 'h' );

I forgot to change the encoding of my mail client.

test=# select * from test order by name asc;
 idn | name
-----+------
   1 | a
   3 | b
   4 | c
   5 | d
   6 | e
   8 | f
   9 | g
  10 | h
   2 | á
   7 | é
(10 rows)


--
Adam PAPAI
BSD Support Service
http://www.bsdsupportservice.hu
E-mail: adam.papai@bsdsupportservice.hu
Phone: +36 30 33-55-735 (Hungary)
Adam PAPAI <adam.papai@bsdsupportservice.hu> writes:
> [ hu_HU.UTF-8 locale doesn't sort as expected on ]
> FreeBSD titanium 8.1-RELEASE FreeBSD 8.1-RELEASE #0: Mon Jul 19 02:36:49
> UTC 2010     root@mason.cse.buffalo.edu:/usr/obj/usr/src/sys/GENERIC  amd64

Do you get the expected sort ordering from the "sort" command either?
I'm betting not.  This is not something Postgres controls -- we just
rely on the results of strcoll().

I don't know the state of affairs on FreeBSD for sure, but I do know
that the UTF-8 locales are fairly broken on Mac OS X, which is a BSD
derived platform too.  It looks like nobody in BSD-land has bothered to
put in the work to make multibyte characters sort properly.

You might have better luck with hu_HU using a suitable LATINn character
set instead of UTF8.  If that still doesn't do what you want, you might
have no choice but to move off BSD :-(

            regards, tom lane
Tom Lane wrote:
> Adam PAPAI <adam.papai@bsdsupportservice.hu> writes:
>> [ hu_HU.UTF-8 locale doesn't sort as expected on ]
>> FreeBSD titanium 8.1-RELEASE FreeBSD 8.1-RELEASE #0: Mon Jul 19 02:36:49
>> UTC 2010     root@mason.cse.buffalo.edu:/usr/obj/usr/src/sys/GENERIC  amd64
>
> Do you get the expected sort ordering from the "sort" command either?
> I'm betting not.  This is not something Postgres controls -- we just
> rely on the results of strcoll().
>
> I don't know the state of affairs on FreeBSD for sure, but I do know
> that the UTF-8 locales are fairly broken on Mac OS X, which is a BSD
> derived platform too.  It looks like nobody in BSD-land has bothered to
> put in the work to make multibyte characters sort properly.
>
> You might have better luck with hu_HU using a suitable LATINn character
> set instead of UTF8.  If that still doesn't do what you want, you might
> have no choice but to move off BSD :-(
>
>             regards, tom lane

Hi Tom,

I've tested the "sort" command. It fails too.

Perhaps it's really with the *BSD and not with PostgreSQL.

Thanks,

--
Adam PAPAI
BSD Support Service
http://www.bsdsupportservice.hu
E-mail: adam.papai@bsdsupportservice.hu
Phone: +36 30 33-55-735 (Hungary)
On sön, 2011-02-06 at 11:06 -0500, Tom Lane wrote:
> I don't know the state of affairs on FreeBSD for sure, but I do know
> that the UTF-8 locales are fairly broken on Mac OS X, which is a BSD
> derived platform too.  It looks like nobody in BSD-land has bothered
> to put in the work to make multibyte characters sort properly.

Yeah, same on FreeBSD.  Using LATINn is the right choice there.
QWRhbSBQQVBBSSA8YWRhbS5wYXBhaUBic2RzdXBwb3J0c2VydmljZS5odT4g
d3JpdGVzOg0KDQo+IEZyZWVCU0QgdGl0YW5pdW0gOC4xLVJFTEVBU0UgRnJl
ZUJTRCA4LjEtUkVMRUFTRSAjMDogTW9uIEp1bCAxOSAwMjozNjo0OQ0KPiBV
VEMgMjAxMCAgICAgcm9vdEBtYXNvbi5jc2UuYnVmZmFsby5lZHU6L3Vzci9v
YmovdXNyL3NyYy9zeXMvR0VORVJJQyAgYW1kNjQNCj4NCj4gVGVzdGVkIFBn
U1FMIHZlcnNpb25zIGFyZTogOC40IGFuZCA5LjAuMyAoZnJlc2ggaW5zdGFs
bCB1c2luZyBwb3J0cykNCg0KRGlkIHlvdSB0cnkgSUNVIHBhdGNoPyBJTUhP
IGl0IGNhbiBoZWxwIHdpdGggeW91ciBwcm9ibGVtLiBCdXQgaSBhbSBub3Qg
c3VyZQ0KaG93IGl0IGlzIHN0YWJsZS4NCg0KIyBjZCAvdXNyL3BvcnRzL2Rh
dGFiYXNlcy9wb3N0Z3Jlc3FsODQtc2VydmVyDQojIG1ha2UgY29uZmlnDQri
lIzilIDilIDilIDilIDilIDilIDilIDilIDilIDilIDilIDilIDilIDilIDi
lIDilIDilIDilIDilIDilIDilIDilIDilIDilIDilIDilIDilIDilIDilIDi
lIDilIDilIDilIDilIDilIDilIDilIDilIDilIDilIDilIDilIDilIDilIDi
lIDilIDilIDilIDilIDilIDilIDilIDilIDilIDilIDilIDilIDilIDilIDi
lIDilIDilIDilIDilIDilIDilIDilIDilIDilJANCuKUgiAgICAgICAgICAg
ICAgT3B0aW9ucyBmb3IgcG9zdGdyZXNxbC1zZXJ2ZXIgOC40LjVfMiAgICAg
ICAgICAgICAgICAg4pSCDQrilIIg4pSM4pSA4pSA4pSA4pSA4pSA4pSA4pSA
4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA
4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA
4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA
4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSQIOKUgg0K
4pSCIOKUgltYXSBOTFMgICAgICAgICAgICAgICBVc2UgaW50ZXJuYXRpb25h
bGl6ZWQgbWVzc2FnZXMgICAgICAgICAgICDilIIg4pSCDQrilIIg4pSCWyBd
IFBBTSAgICAgICAgICAgICAgIEJ1aWxkIHdpdGggUEFNIHN1cHBvcnQgKHNl
cnZlciBvbmx5KSAgICAgIOKUgiDilIINCuKUgiDilIJbIF0gTERBUCAgICAg
ICAgICAgICAgQnVpbGQgd2l0aCBMREFQIGF1dGhlbnRpY2F0aW9uIHN1cHBv
cnQgICAg4pSCIOKUgg0K4pSCIOKUglsgXSBNSVRfS1JCNSAgICAgICAgICBC
dWlsZCB3aXRoIE1JVCdzIGtlcmJlcm9zIHN1cHBvcnQgICAgICAgICDilIIg
4pSCDQrilIIg4pSCWyBdIEhFSU1EQUxfS1JCNSAgICAgIEJ1aWxkcyB3aXRo
IEhlaW1kYWwga2VyYmVyb3Mgc3VwcG9ydCAgICAgIOKUgiDilIINCuKUgiDi
lIJbIF0gT1BUSU1JWkVEX0NGTEFHUyAgQnVpbGRzIHdpdGggY29tcGlsZXIg
b3B0aW1pemF0aW9ucyAoLU8zKSAg4pSCIOKUgg0K4pSCIOKUgltYXSBYTUwg
ICAgICAgICAgICAgICBCdWlsZCB3aXRoIFhNTCBkYXRhIHR5cGUgKHNlcnZl
cikgICAgICAgICDilIIg4pSCDQrilIIg4pSCW1hdIFRaREFUQSAgICAgICAg
ICAgIFVzZSBpbnRlcm5hbCB0aW1lem9uZSBkYXRhYmFzZSAoc2VydmVyKSAg
IOKUgiDilIINCuKUgiDilIJbIF0gREVCVUcgICAgICAgICAgICAgQnVpbGRz
IHdpdGggZGVidWdnaW5nIHN5bWJvbHMgICAgICAgICAgICAg4pSCIOKUgg0K
4pSCIOKUgltYXSBHU1NBUEkgICAgICAgICAgICBCVWlsZCB3aXRoIEdTU0FQ
SSBzdXBwb3J0ICAgICAgICAgICAgICAgICDilIIg4pSCDQo9PT7ilIIg4pSC
WyBdIElDVSAgICAgICAgICAgICAgIFVzZSBJQ1UgZm9yIHVuaWNvZGUgY29s
bGF0aW9uIChzZXJ2ZXIpICAgIOKUgiDilIINCuKUgiDilIJbWF0gSU5UREFU
RSAgICAgICAgICAgQnVpbGRzIHdpdGggNjQtYml0IGRhdGUvdGltZSB0eXBl
IChzZXJ2ZXIp4pSCIOKUgg0K4pSCIOKUgiAgICAgICAgICAgICAgICAgICAg
ICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICDi
lIIg4pSCDQrilIIg4pSCICAgICAgICAgICAgICAgICAgICAgICAgICAgICAg
ICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIOKUgiDilIINCuKU
giDilIIgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAg
ICAgICAgICAgICAgICAgICAgICAgICAg4pSCIOKUgg0K4pSc4pSA4pSU4pSA
4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA
4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA
4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA
4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA
4pSA4pSA4pSA4pSY4pSA4pSkDQrilIIgICAgICAgICAgICAgICAgICAgICAg
IFsgIE9LICBdICAgICAgIENhbmNlbCAgICAgICAgICAgICAgICAgICAgICAg
IOKUgg0K4pSU4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA
4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA
4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA
4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA
4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSA4pSYDQoNCi0tIA0K
U2VyZ2V5IEJ1cmxhZHlhbg0K