Thread: Is it bad sorting in UTF ??

Is it bad sorting in UTF ??

From
wstrzalka
Date:
Why PG sort's my data in case insensitive manner?


masterdb=# SELECT name FROM enterprises ORDER BY name;
                 name
--------------------------------------
................
 abc
 AKS 514
 aks518
 AKSFree11111
..........................

The worst (totally mess) example is:

masterdb=# SELECT name, replace(name, ' ', '<20>') FROM enterprises
ORDER BY name;
                 name                 |                        replace
--------------------------------------
+--------------------------------------------------------
...............................
 alinatestfree                        | alinatestfree
 Alina Test Free                      | Alina<20>Test<20>Free
 alinatestfree220                     | alinatestfree220
...............................

More info here:

masterdb=# select version();
                                                 version
----------------------------------------------------------------------------------------------------------
 PostgreSQL 8.3.7 on x86_64-redhat-linux-gnu, compiled by GCC gcc
(GCC) 4.1.2 20070925 (Red Hat 4.1.2-27)
(1 row)



masterdb=# select name, setting from pg_settings WHERE name ilike 'lc
%';
    name     |   setting
-------------+-------------
 lc_collate  | en_US.UTF-8
 lc_ctype    | en_US.UTF-8
 lc_messages | en_US.UTF-8
 lc_monetary | en_US.UTF-8
 lc_numeric  | en_US.UTF-8
 lc_time     | en_US.UTF-8
(6 rows)




masterdb=# \l
                    List of databases
                Name                |  Owner   | Encoding
------------------------------------+----------+----------
 masterdb                           | postgres | UTF8
...............



masterdb=# \d+ enterprises
                       Table "public.enterprises"
       Column       |           Type           | Modifiers |
Description
--------------------+--------------------------+-----------
+-------------
...............
 name               | text                     |           |
...................




masterdb=# EXPLAIN ANALYZE SELECT name FROM enterprises ORDER BY name;
                                                   QUERY PLAN
-----------------------------------------------------------------------------------------------------------------
 Sort  (cost=29.62..30.68 rows=422 width=11) (actual time=1.537..1.596
rows=367 loops=1)
   Sort Key: name
   Sort Method:  quicksort  Memory: 44kB
   ->  Seq Scan on enterprises  (cost=0.00..11.22 rows=422 width=11)
(actual time=0.012..0.181 rows=367 loops=1)
 Total runtime: 1.677 ms
(5 rows)

Re: Is it bad sorting in UTF ??

From
Richard Huxton
Date:
wstrzalka wrote:
> Why PG sort's my data in case insensitive manner?

> masterdb=# select name, setting from pg_settings WHERE name ilike 'lc
> %';
>     name     |   setting
> -------------+-------------
>  lc_collate  | en_US.UTF-8

Because that's what en_US.UTF-8 does. If you want "computer-style"
sorting choose the "C" locale. You'll need to dump all your databases an
re-initdb to do that I'm afraid. I believe 8.4 is going to allow
different locales for each database though.

--
   Richard Huxton
   Archonet Ltd

Re: Is it bad sorting in UTF ??

From
Craig Ringer
Date:
Richard Huxton wrote:
> wstrzalka wrote:
>> Why PG sort's my data in case insensitive manner?
>
>> masterdb=# select name, setting from pg_settings WHERE name ilike 'lc
>> %';
>>     name     |   setting
>> -------------+-------------
>>  lc_collate  | en_US.UTF-8
>
> Because that's what en_US.UTF-8 does. If you want "computer-style"
> sorting choose the "C" locale. You'll need to dump all your databases an
> re-initdb to do that I'm afraid. I believe 8.4 is going to allow
> different locales for each database though.

Even if it were slower, having a way to specify that sorting be WITH a
particular collation method/locale would be appealing.

I guess this could be done with a custom operator/opclass implementing
non-libc-based sorting and collation, which is probably where PostgreSQL
will land up heading in the end.

--
Craig Ringer

Re: Is it bad sorting in UTF ??

From
Lincoln Yeoh
Date:
At 07:57 PM 4/14/2009, Richard Huxton wrote:
>wstrzalka wrote:
>>Why PG sort's my data in case insensitive manner?
>
>>masterdb=# select name, setting from pg_settings WHERE name ilike 'lc
>>%';
>>     name     |   setting
>>-------------+-------------
>>  lc_collate  | en_US.UTF-8
>
>Because that's what en_US.UTF-8 does. If you want "computer-style"
>sorting choose the "C" locale. You'll need to dump all your
>databases an re-initdb to do that I'm afraid. I believe 8.4 is going
>to allow different locales for each database though.

Is it possible to have something like:

select name from sometable order by collate(name,'en_US.UTF-8');
select name from sometable order by collate(name,'C');

Link.