Thread: BUG #1721: mutiple bytes character string comaprison error

BUG #1721: mutiple bytes character string comaprison error

From

"Chii-Tung Liu"

Date:

19 June 2005, 13:45:42

The following bug has been logged online:

Bug reference:      1721
Logged by:          Chii-Tung Liu
Email address:      cdliou@mail.cyut.edu.tw
PostgreSQL version: 8.0.3
Operating system:   Windows XP SP2
Description:        mutiple bytes character string comaprison error
Details:

When compare two UTF-8 encoded string that contains Chinese words, the
result is always TRUE
1. create a database test with encoding set to unicode
CREATE DATABASE test
  WITH OWNER = postgres
       ENCODING = 'UNICODE'
       TABLESPACE = pg_default;
2. insert data with Chinese words
INSERT into node set title='1 ä¸æ'

3. SELECT title from node where title > '1.1 '
would return '1 ä¸æ'

4. Both SELECT '1 ä¸æ' > '1.1' and  SELECT '1.1' > '1 ä¸æ' return
FALSE

Re: BUG #1721: mutiple bytes character string comaprison error

From

Tom Lane

Date:

19 June 2005, 15:18:26

"Chii-Tung Liu" <cdliou@mail.cyut.edu.tw> writes:
> PostgreSQL version: 8.0.3
> Operating system:   Windows XP SP2

> When compare two UTF-8 encoded string that contains Chinese words, the
> result is always TRUE

Sorry, but UTF-8 encoding doesn't work properly on Windows (yet).
Use some other database encoding.

            regards, tom lane

Re: BUG #1721: mutiple bytes character string comaprison

From

Kris Jurka

Date:

20 June 2005, 04:30:51

On Sun, 19 Jun 2005, Tom Lane wrote:

> "Chii-Tung Liu" <cdliou@mail.cyut.edu.tw> writes:
> > PostgreSQL version: 8.0.3
> > Operating system:   Windows XP SP2
>
> > When compare two UTF-8 encoded string that contains Chinese words, the
> > result is always TRUE
>
> Sorry, but UTF-8 encoding doesn't work properly on Windows (yet).
> Use some other database encoding.
>

Shouldn't we forbid its creation then?  At least a strongly worded
warning?  We see these complaints too often.

Kris Jurka

Re: BUG #1721: mutiple bytes character string comaprison error

From

Tom Lane

Date:

20 June 2005, 04:46:51

Kris Jurka <books@ejurka.com> writes:
> On Sun, 19 Jun 2005, Tom Lane wrote:
>> Sorry, but UTF-8 encoding doesn't work properly on Windows (yet).
>> Use some other database encoding.

> Shouldn't we forbid its creation then?

There was serious discussion of that before the 8.0 release, but
we decided not to forbid it.  Check the archives; I don't recall
the reasoning at the moment.

> We see these complaints too often.

There are lots of complaints we see way too often ;-) ... but
distressingly, there are still only 24 hours in a day.

            regards, tom lane

Re: BUG #1721: mutiple bytes character string comaprison

From

Tatsuo Ishii

Date:

20 June 2005, 05:38:45

> The following bug has been logged online:
>
> Bug reference:      1721
> Logged by:          Chii-Tung Liu
> Email address:      cdliou@mail.cyut.edu.tw
> PostgreSQL version: 8.0.3
> Operating system:   Windows XP SP2
> Description:        mutiple bytes character string comaprison error
> Details:
>
> When compare two UTF-8 encoded string that contains Chinese words, the
> result is always TRUE
> 1. create a database test with encoding set to unicode
> CREATE DATABASE test
>   WITH OWNER = postgres
>        ENCODING = 'UNICODE'
>        TABLESPACE = pg_default;
> 2. insert data with Chinese words
> INSERT into node set title='1 $BCfJ8(B'
>
> 3. SELECT title from node where title > '1.1 '
> would return '1 $BCfJ8(B'
>
> 4. Both SELECT '1 $BCfJ8(B' > '1.1' and  SELECT '1.1' > '1 $BCfJ8(B' return
> FALSE

I think you need to use C locale.
--
Tatsuo Ishii

Re: BUG #1721: mutiple bytes character string comaprison error

From

Bruce Momjian

Date:

20 June 2005, 18:52:34

Tom Lane wrote:
> Kris Jurka <books@ejurka.com> writes:
> > On Sun, 19 Jun 2005, Tom Lane wrote:
> >> Sorry, but UTF-8 encoding doesn't work properly on Windows (yet).
> >> Use some other database encoding.
>
> > Shouldn't we forbid its creation then?
>
> There was serious discussion of that before the 8.0 release, but
> we decided not to forbid it.  Check the archives; I don't recall
> the reasoning at the moment.

UTF8 encoding works with the C locale assuming you don't care about
ordering of the character set, e.g. Japanese.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: BUG #1721: mutiple bytes character string comaprison error

From

"John Hansen"

Date:

20 June 2005, 19:31:54

>=20
> UTF8 encoding works with the C locale assuming you don't care=20
> about ordering of the character set, e.g. Japanese.
>=20

Has anyone with the ability to compile postgresql on windows tested the
ICU patch?

... John

Re: BUG #1721: mutiple bytes character string comaprison error

From

"Magnus Hagander"

Date:

20 June 2005, 20:01:15

> > UTF8 encoding works with the C locale assuming you don't care about=20
> > ordering of the character set, e.g. Japanese.
> >=20
>=20
> Has anyone with the ability to compile postgresql on windows=20
> tested the ICU patch?

Yes.
See http://archives.postgresql.org/pgsql-hackers/2005-05/msg00662.php


//Magnus

Re: BUG #1721: mutiple bytes character string comaprison

From

Tatsuo Ishii

Date:

20 June 2005, 22:38:05

> Tom Lane wrote:
> > Kris Jurka <books@ejurka.com> writes:
> > > On Sun, 19 Jun 2005, Tom Lane wrote:
> > >> Sorry, but UTF-8 encoding doesn't work properly on Windows (yet).
> > >> Use some other database encoding.
> >
> > > Shouldn't we forbid its creation then?
> >
> > There was serious discussion of that before the 8.0 release, but
> > we decided not to forbid it.  Check the archives; I don't recall
> > the reasoning at the moment.
>
> UTF8 encoding works with the C locale assuming you don't care about
> ordering of the character set, e.g. Japanese.

No, sometimes Japanese needs char ordering too and I think this is not
a Windows only problem. The real problem is Unicode defines char
orderes in totally random manner because Chinese/Japanese/Korean Kanji
characters are "Unified" in Unicode. To solve the problem, we can use
convert UTF8 to EUC_JP using CONVERT. See archives for more details.

Or you can use Unicode locale only if your platform's locale database
is not broken and you only use single locale.
--
Tatsuo Ishii