Thread: lc_collate issue

lc_collate issue

From

Cody Pisto

Date:

24 August 2007, 15:01:06

Hi All,

I'm looking for any kind of a reason (and potential workarounds), be it
bug or otherwise, why the following two queries produce different
results under a database encoding of UTF8 and lc_collate of en_US.UTF-8:

SELECT x FROM (SELECT 'Something else' AS x UNION SELECT '-SOMETHING
ELSE-' AS x UNION SELECT 'Somethang' AS x) y ORDER BY LOWER(x)

        x
------------------
 Somethang
 -SOMETHING ELSE-
 Something else
(3 rows)


*AND*

SELECT x FROM (SELECT 'Somethingelse' AS x UNION SELECT
'-SOMETHINGELSE-' AS x UNION SELECT 'Somethang' AS x) y ORDER BY LOWER(x)

        x
-----------------
 Somethang
 Somethingelse
 -SOMETHINGELSE-


The removal of spaces from the strings gives "more correct" sorting
results, with the spaces and '-' characters, '-SOMETHING ELSE-' is
strangely sorted in the middle?
It does not matter if you use LOWER or UPPER, and the "problem" does not
occur on databases with encoding SQL_ASCII and lc_collate of C

I have tested this on Postgres 8.1.9, 8.2, 8.2.4 with database encoding
of UTF8 and lc_collate of en_US.UTF8
and on 7.4.16 with database encoding of SQL_ASCII and lc_collate of C

Thank in advance for any consideration!


-Cody

Re: lc_collate issue

From

Tom Lane

Date:

24 August 2007, 17:16:52

Cody Pisto <cpisto@rvweb.com> writes:
> I'm looking for any kind of a reason (and potential workarounds), be it
> bug or otherwise, why the following two queries produce different
> results under a database encoding of UTF8 and lc_collate of en_US.UTF-8:

That's just how it is in most non-C locales --- they use some weird
algorithm that's alleged to approximate what dictionary makers
traditionally do with phrases.  I don't recall the details but there's
something about multiple passes with spaces being ignored in earlier
passes.  You'll find that sort(1) sorts these lines the same way.

If you don't like it, use C locale, or put together your own locale
definition.  (No, I'm not sure how hard that is ...)

            regards, tom lane

Re: lc_collate issue

From

Cody Pisto

Date:

24 August 2007, 17:48:40

Hi Tom,

Thanks for answering,

I pretty much assumed that was the case (whatever library postgres is using for encoding is causing the "issue")

The[my] problem is, it just seems like completely incorrect behavior..

The quickest and dirtiest examples I can do are that both python and mysql sort these 3 example strings (in utf8 encoding) the way I would expect (like a C locale)

python:

>>> x = [unicode("Somethang", "utf8"), unicode("-SOMETHING ELSE-", "utf8"), unicode("Something else", "utf8")]
>>> x
[u'Somethang', u'-SOMETHING ELSE-', u'Something else']
>>> x.sort()
>>> x
[u'-SOMETHING ELSE-', u'Somethang', u'Something else']

mysql:

mysql> SET NAMES 'utf8';
Query OK, 0 rows affected (0.02 sec)

mysql> SELECT x FROM (SELECT 'Something else' AS x UNION SELECT '-SOMETHING ELSE-' AS x UNION SELECT 'Somethang' AS x) y ORDER BY LOWER(x);
+------------------+
| x                |
+------------------+
| -SOMETHING ELSE- |
| Somethang        |
| Something else   |
+------------------+
3 rows in set (0.00 sec)

postgres:

SELECT x FROM (SELECT 'Something else' AS x UNION SELECT '-SOMETHING ELSE-' AS x UNION SELECT 'Somethang' AS x) y ORDER BY LOWER(x);
        x
------------------
Somethang
-SOMETHING ELSE-
Something else
(3 rows)

And I bet oracle, firebird, sqlite, mssql, and everything else out there that does utf8 would return it in the "right" order (I'm willing to test that too if needed..)

If this is potentially a problem in postgres somewhere, point me in the general direction and I'm more than willing to fix it myself..

Thanks for your consideration..

-Cody

Tom Lane wrote:

Cody Pisto <cpisto@rvweb.com> writes:

I'm looking for any kind of a reason (and potential workarounds), be it 
bug or otherwise, why the following two queries produce different 
results under a database encoding of UTF8 and lc_collate of en_US.UTF-8:

That's just how it is in most non-C locales --- they use some weird
algorithm that's alleged to approximate what dictionary makers
traditionally do with phrases.  I don't recall the details but there's
something about multiple passes with spaces being ignored in earlier
passes.  You'll find that sort(1) sorts these lines the same way.

If you don't like it, use C locale, or put together your own locale
definition.  (No, I'm not sure how hard that is ...)
		regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

-- 

*Cody Pisto*
Redzia RVs
10555 Montgomery NE
Suite 80
Albuquerque, NM 87111
Phone: (866) 844-1986
Fax: (866) 204-4403

Re: lc_collate issue

From

Tom Lane

Date:

24 August 2007, 18:52:45

Cody Pisto <cpisto@rvweb.com> writes:
> If this is potentially a problem in postgres somewhere, point me in the
> general direction and I'm more than willing to fix it myself..

You seem not to have absorbed what I said.  This *is* the correct result
according to that locale's definition of sorting.  You can demonstrate
that without any use of Postgres:

[tgl@rh2 ~]$ cat fooey
Somethang
-SOMETHING ELSE-
Something else
[tgl@rh2 ~]$ LANG=C sort fooey
-SOMETHING ELSE-
Somethang
Something else
[tgl@rh2 ~]$ LANG=en_US sort fooey
Somethang
Something else
-SOMETHING ELSE-
[tgl@rh2 ~]$

If you prefer C sort ordering, run Postgres in C locale.  It's as
simple as that.

            regards, tom lane

Re: lc_collate issue

From

Cody Pisto

Date:

24 August 2007, 19:23:46

Hi Tom,

I did understand what you said, I apologize that it came out otherwise.

I'm just looking for the correct workaround.

If initdb was done with a C locale, and thus lc_collate and friends where all C, but the database and client encoding was set to UTF-8, would postgres convert data on the fly from UTF-8(storage) to ASCII for sorting or would things just blow up when a >1 byte character hit the mix?

The docs say bad things would happen:

http://www.postgresql.org/docs/8.2/static/multibyte.html
Important: Although you can specify any encoding you want for a database, it is unwise to choose an encoding that is not what is expected by the locale you have selected. The LC_COLLATE and LC_CTYPE settings imply a particular encoding, and locale-dependent operations (such as sorting) are likely to misinterpret data that is in an incompatible encoding.

Right now for me ORDER BY LOWER(ASCII(column)), LOWER(column) (or some variation there of) works, but is there a better workaround?

Thanks,

-Cody

Tom Lane wrote:

Cody Pisto <cpisto@rvweb.com> writes:

If this is potentially a problem in postgres somewhere, point me in the 
general direction and I'm more than willing to fix it myself..

You seem not to have absorbed what I said.  This *is* the correct result
according to that locale's definition of sorting.  You can demonstrate
that without any use of Postgres:

[tgl@rh2 ~]$ cat fooey
Somethang
-SOMETHING ELSE-
Something else
[tgl@rh2 ~]$ LANG=C sort fooey
-SOMETHING ELSE-
Somethang
Something else
[tgl@rh2 ~]$ LANG=en_US sort fooey
Somethang
Something else
-SOMETHING ELSE-
[tgl@rh2 ~]$ 

If you prefer C sort ordering, run Postgres in C locale.  It's as
simple as that.
		regards, tom lane

-- 

*Cody Pisto*
Redzia RVs
10555 Montgomery NE
Suite 80
Albuquerque, NM 87111
Phone: (866) 844-1986
Fax: (866) 204-4403

Re: lc_collate issue

From

Tom Lane

Date:

24 August 2007, 22:18:41

Cody Pisto <cpisto@rvweb.com> writes:
> If initdb was done with a C locale, and thus lc_collate and friends
> where all C, but the database and client encoding was set to UTF-8,
> would postgres convert data on the fly from UTF-8(storage) to ASCII for
> sorting or would things just blow up when a >1 byte character hit the mix?

No, C locale just sorts the bytes.  It won't "blow up".  Whether it will
give you a sort ordering you like for multibyte characters is a
different question.

            regards, tom lane

Re: lc_collate issue

From

Tatsuo Ishii

Date:

24 August 2007, 23:20:39

> Cody Pisto <cpisto@rvweb.com> writes:
> > If initdb was done with a C locale, and thus lc_collate and friends
> > where all C, but the database and client encoding was set to UTF-8,
> > would postgres convert data on the fly from UTF-8(storage) to ASCII for
> > sorting or would things just blow up when a >1 byte character hit the mix?
>
> No, C locale just sorts the bytes.  It won't "blow up".  Whether it will
> give you a sort ordering you like for multibyte characters is a
> different question.

Yup.

For example, LATIN1 part of UTF-8 (UNICODE) is physicaly ordered same
as ISO 8859-1. So if you see the order of ISO 8859-1 is "natural",
then the sort order of UTF-8 is ok as well. However the order of CJK
part of UTF-8 is totally different from the original charcater sets
(almost random), you need to use convert() for converting UTF-8 to
original encoding to get "natural" sort order. I don't think you are
interested in CJK part, though.
--
Tatsuo Ishii
SRA OSS, Inc. Japan

Re: lc_collate issue

From

Peter Eisentraut

Date:

25 August 2007, 04:57:43

Cody Pisto wrote:
> I'm just looking for the correct workaround.

The canonically correct workaround it to define your own locale.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/

Re: lc_collate issue

From

"Karsten Hilbert"

Date:

25 August 2007, 11:16:47

> would postgres convert data on the fly from UTF-8(storage) to ASCII for
> sorting
That ain't possible, it seems, or else we wouldn't need UTF-8.

Karsten
--
GMX FreeMail: 1 GB Postfach, 5 E-Mail-Adressen, 10 Free SMS.
Alle Infos und kostenlose Anmeldung: http://www.gmx.net/de/go/freemail

Re: lc_collate issue

From

db@zigo.dhs.org

Date:

30 August 2007, 06:36:41

> I'm just looking for the correct workaround.

While adding a new collation is the "correct" solution it's a lot of work.
Even then pg can't use different collations anyway unless you reinit
the datadir using initdb.

One workaround is to cast the text value into a bytea value, and then it
will be sorted in byte order no matter what locale you have. Like this:

  SELECT *
    FROM foo
ORDER BY CAST (some_column AS BYTEA);

This work except that there is no cast from text to bytea. But we can add
one using:

  CREATE CAST (text AS bytea) WITHOUT FUNCTION AS ASSIGNMENT;

I can't promise that WITHOUT FUNCTION will always work but currently bytea
and text values are stored in the same way so it should work (and it
probably will in future versions as well).

/Dennis