Thread: lc_collate issue
Hi All, I'm looking for any kind of a reason (and potential workarounds), be it bug or otherwise, why the following two queries produce different results under a database encoding of UTF8 and lc_collate of en_US.UTF-8: SELECT x FROM (SELECT 'Something else' AS x UNION SELECT '-SOMETHING ELSE-' AS x UNION SELECT 'Somethang' AS x) y ORDER BY LOWER(x) x ------------------ Somethang -SOMETHING ELSE- Something else (3 rows) *AND* SELECT x FROM (SELECT 'Somethingelse' AS x UNION SELECT '-SOMETHINGELSE-' AS x UNION SELECT 'Somethang' AS x) y ORDER BY LOWER(x) x ----------------- Somethang Somethingelse -SOMETHINGELSE- The removal of spaces from the strings gives "more correct" sorting results, with the spaces and '-' characters, '-SOMETHING ELSE-' is strangely sorted in the middle? It does not matter if you use LOWER or UPPER, and the "problem" does not occur on databases with encoding SQL_ASCII and lc_collate of C I have tested this on Postgres 8.1.9, 8.2, 8.2.4 with database encoding of UTF8 and lc_collate of en_US.UTF8 and on 7.4.16 with database encoding of SQL_ASCII and lc_collate of C Thank in advance for any consideration! -Cody
Cody Pisto <cpisto@rvweb.com> writes: > I'm looking for any kind of a reason (and potential workarounds), be it > bug or otherwise, why the following two queries produce different > results under a database encoding of UTF8 and lc_collate of en_US.UTF-8: That's just how it is in most non-C locales --- they use some weird algorithm that's alleged to approximate what dictionary makers traditionally do with phrases. I don't recall the details but there's something about multiple passes with spaces being ignored in earlier passes. You'll find that sort(1) sorts these lines the same way. If you don't like it, use C locale, or put together your own locale definition. (No, I'm not sure how hard that is ...) regards, tom lane
Hi Tom,
Thanks for answering,
I pretty much assumed that was the case (whatever library postgres is using for encoding is causing the "issue")
The[my] problem is, it just seems like completely incorrect behavior..
The quickest and dirtiest examples I can do are that both python and mysql sort these 3 example strings (in utf8 encoding) the way I would expect (like a C locale)
python:
>>> x = [unicode("Somethang", "utf8"), unicode("-SOMETHING ELSE-", "utf8"), unicode("Something else", "utf8")]
>>> x
[u'Somethang', u'-SOMETHING ELSE-', u'Something else']
>>> x.sort()
>>> x
[u'-SOMETHING ELSE-', u'Somethang', u'Something else']
mysql:
mysql> SET NAMES 'utf8';
Query OK, 0 rows affected (0.02 sec)
mysql> SELECT x FROM (SELECT 'Something else' AS x UNION SELECT '-SOMETHING ELSE-' AS x UNION SELECT 'Somethang' AS x) y ORDER BY LOWER(x);
+------------------+
| x |
+------------------+
| -SOMETHING ELSE- |
| Somethang |
| Something else |
+------------------+
3 rows in set (0.00 sec)
postgres:
SELECT x FROM (SELECT 'Something else' AS x UNION SELECT '-SOMETHING ELSE-' AS x UNION SELECT 'Somethang' AS x) y ORDER BY LOWER(x);
x
------------------
Somethang
-SOMETHING ELSE-
Something else
(3 rows)
And I bet oracle, firebird, sqlite, mssql, and everything else out there that does utf8 would return it in the "right" order (I'm willing to test that too if needed..)
If this is potentially a problem in postgres somewhere, point me in the general direction and I'm more than willing to fix it myself..
Thanks for your consideration..
-Cody
Tom Lane wrote:
Thanks for answering,
I pretty much assumed that was the case (whatever library postgres is using for encoding is causing the "issue")
The[my] problem is, it just seems like completely incorrect behavior..
The quickest and dirtiest examples I can do are that both python and mysql sort these 3 example strings (in utf8 encoding) the way I would expect (like a C locale)
python:
>>> x = [unicode("Somethang", "utf8"), unicode("-SOMETHING ELSE-", "utf8"), unicode("Something else", "utf8")]
>>> x
[u'Somethang', u'-SOMETHING ELSE-', u'Something else']
>>> x.sort()
>>> x
[u'-SOMETHING ELSE-', u'Somethang', u'Something else']
mysql:
mysql> SET NAMES 'utf8';
Query OK, 0 rows affected (0.02 sec)
mysql> SELECT x FROM (SELECT 'Something else' AS x UNION SELECT '-SOMETHING ELSE-' AS x UNION SELECT 'Somethang' AS x) y ORDER BY LOWER(x);
+------------------+
| x |
+------------------+
| -SOMETHING ELSE- |
| Somethang |
| Something else |
+------------------+
3 rows in set (0.00 sec)
postgres:
SELECT x FROM (SELECT 'Something else' AS x UNION SELECT '-SOMETHING ELSE-' AS x UNION SELECT 'Somethang' AS x) y ORDER BY LOWER(x);
x
------------------
Somethang
-SOMETHING ELSE-
Something else
(3 rows)
And I bet oracle, firebird, sqlite, mssql, and everything else out there that does utf8 would return it in the "right" order (I'm willing to test that too if needed..)
If this is potentially a problem in postgres somewhere, point me in the general direction and I'm more than willing to fix it myself..
Thanks for your consideration..
-Cody
Tom Lane wrote:
Cody Pisto <cpisto@rvweb.com> writes:I'm looking for any kind of a reason (and potential workarounds), be it bug or otherwise, why the following two queries produce different results under a database encoding of UTF8 and lc_collate of en_US.UTF-8:That's just how it is in most non-C locales --- they use some weird algorithm that's alleged to approximate what dictionary makers traditionally do with phrases. I don't recall the details but there's something about multiple passes with spaces being ignored in earlier passes. You'll find that sort(1) sorts these lines the same way. If you don't like it, use C locale, or put together your own locale definition. (No, I'm not sure how hard that is ...) regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 5: don't forget to increase your free space map settings
-- *Cody Pisto* Redzia RVs 10555 Montgomery NE Suite 80 Albuquerque, NM 87111 Phone: (866) 844-1986 Fax: (866) 204-4403
Cody Pisto <cpisto@rvweb.com> writes: > If this is potentially a problem in postgres somewhere, point me in the > general direction and I'm more than willing to fix it myself.. You seem not to have absorbed what I said. This *is* the correct result according to that locale's definition of sorting. You can demonstrate that without any use of Postgres: [tgl@rh2 ~]$ cat fooey Somethang -SOMETHING ELSE- Something else [tgl@rh2 ~]$ LANG=C sort fooey -SOMETHING ELSE- Somethang Something else [tgl@rh2 ~]$ LANG=en_US sort fooey Somethang Something else -SOMETHING ELSE- [tgl@rh2 ~]$ If you prefer C sort ordering, run Postgres in C locale. It's as simple as that. regards, tom lane
Hi Tom,
I did understand what you said, I apologize that it came out otherwise.
I'm just looking for the correct workaround.
If initdb was done with a C locale, and thus lc_collate and friends where all C, but the database and client encoding was set to UTF-8, would postgres convert data on the fly from UTF-8(storage) to ASCII for sorting or would things just blow up when a >1 byte character hit the mix?
The docs say bad things would happen:
http://www.postgresql.org/docs/8.2/static/multibyte.html
Important: Although you can specify any encoding you want for a database, it is unwise to choose an encoding that is not what is expected by the locale you have selected. The LC_COLLATE and LC_CTYPE settings imply a particular encoding, and locale-dependent operations (such as sorting) are likely to misinterpret data that is in an incompatible encoding.
Right now for me ORDER BY LOWER(ASCII(column)), LOWER(column) (or some variation there of) works, but is there a better workaround?
Thanks,
-Cody
Tom Lane wrote:
I did understand what you said, I apologize that it came out otherwise.
I'm just looking for the correct workaround.
If initdb was done with a C locale, and thus lc_collate and friends where all C, but the database and client encoding was set to UTF-8, would postgres convert data on the fly from UTF-8(storage) to ASCII for sorting or would things just blow up when a >1 byte character hit the mix?
The docs say bad things would happen:
http://www.postgresql.org/docs/8.2/static/multibyte.html
Important: Although you can specify any encoding you want for a database, it is unwise to choose an encoding that is not what is expected by the locale you have selected. The LC_COLLATE and LC_CTYPE settings imply a particular encoding, and locale-dependent operations (such as sorting) are likely to misinterpret data that is in an incompatible encoding.
Right now for me ORDER BY LOWER(ASCII(column)), LOWER(column) (or some variation there of) works, but is there a better workaround?
Thanks,
-Cody
Tom Lane wrote:
Cody Pisto <cpisto@rvweb.com> writes:If this is potentially a problem in postgres somewhere, point me in the general direction and I'm more than willing to fix it myself..You seem not to have absorbed what I said. This *is* the correct result according to that locale's definition of sorting. You can demonstrate that without any use of Postgres: [tgl@rh2 ~]$ cat fooey Somethang -SOMETHING ELSE- Something else [tgl@rh2 ~]$ LANG=C sort fooey -SOMETHING ELSE- Somethang Something else [tgl@rh2 ~]$ LANG=en_US sort fooey Somethang Something else -SOMETHING ELSE- [tgl@rh2 ~]$ If you prefer C sort ordering, run Postgres in C locale. It's as simple as that. regards, tom lane
-- *Cody Pisto* Redzia RVs 10555 Montgomery NE Suite 80 Albuquerque, NM 87111 Phone: (866) 844-1986 Fax: (866) 204-4403
Cody Pisto <cpisto@rvweb.com> writes: > If initdb was done with a C locale, and thus lc_collate and friends > where all C, but the database and client encoding was set to UTF-8, > would postgres convert data on the fly from UTF-8(storage) to ASCII for > sorting or would things just blow up when a >1 byte character hit the mix? No, C locale just sorts the bytes. It won't "blow up". Whether it will give you a sort ordering you like for multibyte characters is a different question. regards, tom lane
> Cody Pisto <cpisto@rvweb.com> writes: > > If initdb was done with a C locale, and thus lc_collate and friends > > where all C, but the database and client encoding was set to UTF-8, > > would postgres convert data on the fly from UTF-8(storage) to ASCII for > > sorting or would things just blow up when a >1 byte character hit the mix? > > No, C locale just sorts the bytes. It won't "blow up". Whether it will > give you a sort ordering you like for multibyte characters is a > different question. Yup. For example, LATIN1 part of UTF-8 (UNICODE) is physicaly ordered same as ISO 8859-1. So if you see the order of ISO 8859-1 is "natural", then the sort order of UTF-8 is ok as well. However the order of CJK part of UTF-8 is totally different from the original charcater sets (almost random), you need to use convert() for converting UTF-8 to original encoding to get "natural" sort order. I don't think you are interested in CJK part, though. -- Tatsuo Ishii SRA OSS, Inc. Japan
Cody Pisto wrote: > I'm just looking for the correct workaround. The canonically correct workaround it to define your own locale. -- Peter Eisentraut http://developer.postgresql.org/~petere/
> would postgres convert data on the fly from UTF-8(storage) to ASCII for > sorting That ain't possible, it seems, or else we wouldn't need UTF-8. Karsten -- GMX FreeMail: 1 GB Postfach, 5 E-Mail-Adressen, 10 Free SMS. Alle Infos und kostenlose Anmeldung: http://www.gmx.net/de/go/freemail
> I'm just looking for the correct workaround. While adding a new collation is the "correct" solution it's a lot of work. Even then pg can't use different collations anyway unless you reinit the datadir using initdb. One workaround is to cast the text value into a bytea value, and then it will be sorted in byte order no matter what locale you have. Like this: SELECT * FROM foo ORDER BY CAST (some_column AS BYTEA); This work except that there is no cast from text to bytea. But we can add one using: CREATE CAST (text AS bytea) WITHOUT FUNCTION AS ASSIGNMENT; I can't promise that WITHOUT FUNCTION will always work but currently bytea and text values are stored in the same way so it should work (and it probably will in future versions as well). /Dennis