Thread: UTF-8 and LIKE vs =

UTF-8 and LIKE vs =

From
David Wheeler
Date:
Hi All,

I'm having some trouble with multibyte characters and LIKE. We've been
using LIKE instead of = for string queries for a long time, as it gives
us flexibility to use wildcards such as "%" when we need to and get the
same results as with = by not using them. But I've just found that it
sometimes doesn't work properly:

bric=# select version();
                                                  version
------------------------------------------------------------------------
---------------------------------
  PostgreSQL 7.4.2 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.2.2
20030222 (Red Hat Linux 3.2.2-5)
(1 row)

bric=# select * from keyword where name = '북한의';
   id  |  name  | screen_name | sort_name | active
------+--------+-------------+-----------+--------
  1218 | 국방비 | 국방비      | 국방비    |      1
(1 row)

bric=# select * from keyword where name LIKE '북한의';
  id | name | screen_name | sort_name | active
----+------+-------------+-----------+--------
(0 rows)

bric=# select * from keyword where name ~ '^북한의';
  id | name | screen_name | sort_name | active
----+------+-------------+-----------+--------
(0 rows)

Any idea why = works here and LIKE and ~ wouldn't?

TIA,

David
Attachment

Re: UTF-8 and LIKE vs =

From
Peter Eisentraut
Date:
David Wheeler wrote:
> Any idea why = works here and LIKE and ~ wouldn't?

Because LIKE does a character-by-character matching and = uses the
operating system locale, which could do anything.  If you set the
locale to C, you should get matching results.  Which one is "better"
depends on the semantics of the language, which I cannot judge here.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/


Re: UTF-8 and LIKE vs =

From
David Wheeler
Date:
On Aug 23, 2004, at 1:22 PM, Peter Eisentraut wrote:

> Because LIKE does a character-by-character matching and = uses the
> operating system locale, which could do anything.  If you set the
> locale to C, you should get matching results.  Which one is "better"
> depends on the semantics of the language, which I cannot judge here.

Thanks. So I need to set the locale to C and then LIKE will work
properly? How do I go about doing that? I can see these options:

LC_COLLATE
String sort order

LC_CTYPE
Character classification (What is a letter? The upper-case equivalent?)

LC_MESSAGES
Language of messages

LC_MONETARY
Formatting of currency amounts

LC_NUMERIC
Formatting of numbers

LC_TIME
Formatting of dates and times

Is one of these the one I need to set?

Thanks,

David


Attachment

Re: UTF-8 and LIKE vs =

From
Ian Barwick
Date:
On Mon, 23 Aug 2004 12:41:30 -0700, David Wheeler <david@kineticode.com> wrote:
(...)
> bric=# select version();
>                                                   version
> ------------------------------------------------------------------------
> ---------------------------------
>   PostgreSQL 7.4.2 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.2.2
> 20030222 (Red Hat Linux 3.2.2-5)
> (1 row)
>
> bric=# select * from keyword where name = '북한의';
>    id  |  name  | screen_name | sort_name | active
> ------+--------+-------------+-----------+--------
>   1218 | 국방비 | 국방비      | 국방비    |      1
> (1 row)

er, the characters in "name" don't seem to match the characters in the
query - '국방비' vs. '북한의' - does that have any bearing?

FWIW (on 7.4.3):

test=#  select * from t1 ;
 id | value
----+--------
  1 | 日本
  2 | 日本語
  3 | 北海道
(3 rows)

test=#  select * from t1 where value ~ '日';
 id | value
----+--------
  1 | 日本
  2 | 日本語
(2 rows)

test=# select * from t1 where value like '日%';
 id | value
----+--------
  1 | 日本
  2 | 日本語

test=# select * from t1 where value like '北海%';
 id | value
----+--------
  3 | 北海道
(1 row)

Ian Barwick
barwick@gmail.net

Re: UTF-8 and LIKE vs =

From
David Wheeler
Date:
On Aug 23, 2004, at 1:58 PM, Ian Barwick wrote:

> er, the characters in "name" don't seem to match the characters in the
> query - '국방비' vs. '북한의' - does that have any bearing?

Yes, it means that = is doing the wrong thing!!

I noticed this because I had a query that was looking in the keyword
table for an existing record using LIKE. If it didn't find it, it
inserted it. But the inserts were giving me an error because the name
column has a UNIQUE index on it. Could it be that the index and the =
operator are comparing bytes, and that '국방비' and '북한의' have the same
bytes but different characters??

If so, this is a pretty serious problem. How can I get = and the
indices to use character semantics rather than byte semantics? I also
need to be able to store data in different languages in the database
(and in the same column!), but all in Unicode.

TIA,

David

Attachment

Re: UTF-8 and LIKE vs =

From
Ian Barwick
Date:
On Mon, 23 Aug 2004 14:04:05 -0700, David Wheeler <david@kineticode.com> wrote:
> On Aug 23, 2004, at 1:58 PM, Ian Barwick wrote:
>
> > er, the characters in "name" don't seem to match the characters in the
> > query - '국방비' vs. '북한의' - does that have any bearing?
>
> Yes, it means that = is doing the wrong thing!!
>
> I noticed this because I had a query that was looking in the keyword
> table for an existing record using LIKE. If it didn't find it, it
> inserted it. But the inserts were giving me an error because the name
> column has a UNIQUE index on it. Could it be that the index and the =
> operator are comparing bytes, and that '국방비' and '북한의' have the same
> bytes but different characters??
>
> If so, this is a pretty serious problem. How can I get = and the
> indices to use character semantics rather than byte semantics? I also
> need to be able to store data in different languages in the database
> (and in the same column!), but all in Unicode.

I don't know what the problem is, but you might want to check the
client encoding settings, and the encoding your characters are
arriving in (remembering all the time, in Postgres "UNICODE" really
means UTF-8).

If you're using Perl (I'm guessing this is Bricolage-related) the
"_utf8_on"-ness of strings might be worth checking too, and also the
"pg_enable_utf8" flag in DBD::Pg.

Ian Barwick
barwick@gmail.net

Re: UTF-8 and LIKE vs =

From
Robert Treat
Date:
On Mon, 2004-08-23 at 16:43, David Wheeler wrote:
> On Aug 23, 2004, at 1:22 PM, Peter Eisentraut wrote:
>
> > Because LIKE does a character-by-character matching and = uses the
> > operating system locale, which could do anything.  If you set the
> > locale to C, you should get matching results.  Which one is "better"
> > depends on the semantics of the language, which I cannot judge here.
>
> Thanks. So I need to set the locale to C and then LIKE will work
> properly? How do I go about doing that? I can see these options:
>
> LC_COLLATE
> String sort order
>
> LC_CTYPE
> Character classification (What is a letter? The upper-case equivalent?)
>
> LC_MESSAGES
> Language of messages
>
> LC_MONETARY
> Formatting of currency amounts
>
> LC_NUMERIC
> Formatting of numbers
>
> LC_TIME
> Formatting of dates and times
>
> Is one of these the one I need to set?
>

initdb is your friend. (well, not really, but that's where your headed)

Robert Treat
--
Build A Brighter Lamp :: Linux Apache {middleware} PostgreSQL


Re: UTF-8 and LIKE vs =

From
David Wheeler
Date:
On Aug 23, 2004, at 2:25 PM, Ian Barwick wrote:

> I don't know what the problem is, but you might want to check the
> client encoding settings, and the encoding your characters are
> arriving in (remembering all the time, in Postgres "UNICODE" really
> means UTF-8).
>
> If you're using Perl (I'm guessing this is Bricolage-related) the
> "_utf8_on"-ness of strings might be worth checking too, and also the
> "pg_enable_utf8" flag in DBD::Pg.

Bricolage is getting all its content at UTF-8. It has been working
beautifully for some time. I tried setting the utf8 flag on the
variable passed to the query, but it made no difference.

I think that LIKE is doing the right thing, and = is not. And I need to
find out how to get = to do the right thing. If I need to dump my
database and run initdb to use C for LC_COLLATE, the, feh, I will.
Right now I have:

LC_COLLATE:                           en_US.UTF-8
LC_CTYPE:                             en_US.UTF-8

Regards,

David


Attachment

Re: UTF-8 and LIKE vs =

From
David Wheeler
Date:
On Aug 23, 2004, at 2:31 PM, Robert Treat wrote:

> initdb is your friend. (well, not really, but that's where your headed)

Yes, that's what I'm beginning to suspect.

Cheers,

David

Attachment

Re: UTF-8 and LIKE vs =

From
David Wheeler
Date:
On Aug 23, 2004, at 2:46 PM, David Wheeler wrote:

> I think that LIKE is doing the right thing, and = is not. And I need
> to find out how to get = to do the right thing. If I need to dump my
> database and run initdb to use C for LC_COLLATE, the, feh, I will.
> Right now I have:
>
> LC_COLLATE:                           en_US.UTF-8
> LC_CTYPE:                             en_US.UTF-8

And testing on another box with these set to "C", it seems to work
properly.

*sigh* Nothing like dumping and restoring a 2.7 GB database to keep me
engaged in what I'm doing, eh?

Regards,

David

Attachment

Re: UTF-8 and LIKE vs =

From
Tom Lane
Date:
David Wheeler <david@kineticode.com> writes:
> On Aug 23, 2004, at 1:58 PM, Ian Barwick wrote:
>> er, the characters in "name" don't seem to match the characters in the
>> query - '=B1=B9=B9=E6=BA=F1' vs. '=BA=CF=C7=D1=C0=C7' - does that have an=
> y bearing?

> Yes, it means that = is doing the wrong thing!!

I have seen this happen in situations where the strings contained
character sequences that were illegal according to the encoding that the
locale thought was in force.  (It seems that strcoll() will return more
or less random results in such cases...)  In particular, given that you
have

> LC_COLLATE:                           en_US.UTF-8
> LC_CTYPE:                             en_US.UTF-8

you are at risk if the data is not legal UTF-8 strings.

The real question therefore is whether you have the database encoding
set correctly --- ie, is it UNICODE (== UTF8)?  If not then it may well
be that Postgres is presenting strings to strcoll() that the latter will
choke on.

            regards, tom lane

Re: UTF-8 and LIKE vs =

From
Markus Bertheau
Date:
В Пнд, 23.08.2004, в 23:04, David Wheeler пишет:
> On Aug 23, 2004, at 1:58 PM, Ian Barwick wrote:
>
> > er, the characters in "name" don't seem to match the characters in the
> > query - '국방비' vs. '북한의' - does that have any bearing?
>
> Yes, it means that = is doing the wrong thing!!

The collation rules of your (and my) locale say that these strings are
the same:

[markus@teetnang markus]$ cat > t
국방비
북한의
[markus@teetnang markus]$ uniq t
국방비
[markus@teetnang markus]$

Make sure that you have initdb'd the database under the right locale.
There's not much PostgreSQL can do if strcoll() says that the strings
are equal.

--
Markus Bertheau <twanger@bluetwanger.de>


Re: UTF-8 and LIKE vs =

From
David Wheeler
Date:
On Aug 23, 2004, at 3:44 PM, Tom Lane wrote:

>> Yes, it means that = is doing the wrong thing!!
>
> I have seen this happen in situations where the strings contained
> character sequences that were illegal according to the encoding that
> the
> locale thought was in force.  (It seems that strcoll() will return more
> or less random results in such cases...)  In particular, given that you
> have
>
>> LC_COLLATE:                           en_US.UTF-8
>> LC_CTYPE:                             en_US.UTF-8
>
> you are at risk if the data is not legal UTF-8 strings.

But is it possible to store non-UTF-8 data in a UNICODE database?

> The real question therefore is whether you have the database encoding
> set correctly --- ie, is it UNICODE (== UTF8)?  If not then it may well
> be that Postgres is presenting strings to strcoll() that the latter
> will
> choke on.

The database is UNICODE.

$ psql -U postgres -l
         List of databases
    Name    |  Owner   | Encoding
-----------+----------+-----------
  bric      | postgres | UNICODE
  template0 | postgres | SQL_ASCII
  template1 | postgres | SQL_ASCII
(3 rows)

I plan to dump it, run initdb with LC_COLLATE and LC_CTYPE both set to
"C", and restore the database and see if that helps.

Thanks,

David

Attachment

Re: UTF-8 and LIKE vs =

From
David Wheeler
Date:
On Aug 23, 2004, at 3:46 PM, Markus Bertheau wrote:

> The collation rules of your (and my) locale say that these strings are
> the same:
>
> [markus@teetnang markus]$ cat > t
> 국방비
> 북한의
> [markus@teetnang markus]$ uniq t
> 국방비
> [markus@teetnang markus]$

Interesting.

> Make sure that you have initdb'd the database under the right locale.
> There's not much PostgreSQL can do if strcoll() says that the strings
> are equal.

Well, I have data from a number of different locales in the same
database. I'm hoping that setting the locale to "C" will do the trick.
It seems to work properly on my Mac:

sharky=# select * from keyword where name = '국방비';
  id |  name  | screen_name | sort_name | active
----+--------+-------------+-----------+--------
   0 | 국방비 | 국방비      | 국방비    |      1
(1 row)

sharky=# select * from keyword where name = '북한의';
  id | name | screen_name | sort_name | active
----+------+-------------+-----------+--------
(0 rows)

sharky=# select * from keyword where name like '북한의';
  id | name | screen_name | sort_name | active
----+------+-------------+-----------+--------
(0 rows)

sharky=# select * from keyword where lower(name) like '국방비';
  id |  name  | screen_name | sort_name | active
----+--------+-------------+-----------+--------
   0 | 국방비 | 국방비      | 국방비    |      1
(1 row)

Regards,

David
Attachment

Re: UTF-8 and LIKE vs =

From
Tom Lane
Date:
David Wheeler <david@kineticode.com> writes:
> But is it possible to store non-UTF-8 data in a UNICODE database?

In theory not ... but I think there was a discussion earlier that
concluded that our check for encoding validity is not airtight ...

            regards, tom lane

Re: UTF-8 and LIKE vs =

From
David Wheeler
Date:
On Aug 23, 2004, at 3:59 PM, Tom Lane wrote:

>> But is it possible to store non-UTF-8 data in a UNICODE database?
>
> In theory not ... but I think there was a discussion earlier that
> concluded that our check for encoding validity is not airtight ...

Well, it it was mostly right, I wouldn't expect it to be a problem as
much as this issue is coming up for me. If, OTOH, the encoding validity
check leaks like a sieve, then I might indeed have a bigger problem.

Is the encoding check fixed in 8.0beta1?

Thanks,

David

Attachment

Re: UTF-8 and LIKE vs =

From
Tom Lane
Date:
David Wheeler <david@kineticode.com> writes:
> Is the encoding check fixed in 8.0beta1?

[ looks back at discussion... ]  Actually I misremembered --- the
discussion was about how we would *reject* legal UTF-8 codes that are
more than 2 bytes long.  So the code is broken, but not in the direction
that would cause your problem.  Time for another theory.

Is the problem query using an index?  If so, does REINDEX help?

            regards, tom lane

Re: UTF-8 and LIKE vs =

From
David Wheeler
Date:
On Aug 23, 2004, at 4:08 PM, Tom Lane wrote:

> [ looks back at discussion... ]  Actually I misremembered --- the
> discussion was about how we would *reject* legal UTF-8 codes that are
> more than 2 bytes long.  So the code is broken, but not in the
> direction
> that would cause your problem.  Time for another theory.

Whew!

> Is the problem query using an index?  If so, does REINDEX help?

Doesn't look like it:

bric=# reindex index udx_keyword__name;
REINDEX
bric=# select * from keyword where name ='북한의';
   id  |  name  | screen_name | sort_name | active
------+--------+-------------+-----------+--------
  1218 | 국방비 | 국방비      | 국방비    |      1
(1 row)

That's still giving me an invalid row for the value I passed to it
(note that the value of the "name" column is different than the value I
queried for).

Regards,

David
Attachment

Re: UTF-8 and LIKE vs =

From
Ian Barwick
Date:
On Tue, 24 Aug 2004 00:46:50 +0200, Markus Bertheau
<twanger@bluetwanger.de> wrote:
>
>
> В Пнд, 23.08.2004, в 23:04, David Wheeler пишет:
> > On Aug 23, 2004, at 1:58 PM, Ian Barwick wrote:
> >
> > > er, the characters in "name" don't seem to match the characters in the
> > > query - '국방비' vs. '북한의' - does that have any bearing?
> >
> > Yes, it means that = is doing the wrong thing!!
>
> The collation rules of your (and my) locale say that these strings are
> the same:
>
> [markus@teetnang markus]$ cat > t
> 국방비
> 북한의
> [markus@teetnang markus]$ uniq t
> 국방비
> [markus@teetnang markus]$

wild speculation in need of a Korean speaker, but:

ian@linux:~/tmp> cat j.txt
テスト
환경설
전검색
웹문서
국방비
북한의
てすと
ian@linux:~/tmp> uniq  j.txt
テスト
환경설
てすと

All but the first and last lines are random Korean (Hangul)
characters. Evidently our respective locales think all Hangul strings
of the same length are identical, which is very probably not the
case...

Ian Barwick

Re: UTF-8 and LIKE vs =

From
Tom Lane
Date:
David Wheeler <david@kineticode.com> writes:
>> Is the problem query using an index?  If so, does REINDEX help?

> Doesn't look like it:

> bric=3D# reindex index udx_keyword__name;
> REINDEX
> bric=3D# select * from keyword where name =3D'=BA=CF=C7=D1=C0=C7';
>    id  |  name  | screen_name | sort_name | active
> ------+--------+-------------+-----------+--------
>   1218 | =B1=B9=B9=E6=BA=F1 | =B1=B9=B9=E6=BA=F1      | =B1=B9=B9=E6=BA=F1=
>     |      1
> (1 row)

Hmm.  I tried putting your string into a UNICODE database and I got
ERROR:  invalid byte sequence for encoding "UNICODE": 0xc7

So there's something funny happening here.  What is your client_encoding
setting?

            regards, tom lane

Re: UTF-8 and LIKE vs =

From
David Wheeler
Date:
On Aug 23, 2004, at 4:35 PM, Tom Lane wrote:

> Hmm.  I tried putting your string into a UNICODE database and I got
> ERROR:  invalid byte sequence for encoding "UNICODE": 0xc7

Really? Curious.

> So there's something funny happening here.  What is your
> client_encoding
> setting?

It's not set. I've had it commented out for quite some time:

#    $ENV{PGCLIENTENCODING} = 'UNICODE'; # Should default to this,
anyway.
#    $ENV{PGSERVERENCODING} = 'UNICODE'; # Should default to this,
anyway.

Regards,

David

Attachment

Re: UTF-8 and LIKE vs =

From
David Wheeler
Date:
On Aug 23, 2004, at 4:34 PM, Ian Barwick wrote:

> wild speculation in need of a Korean speaker, but:
>
> ian@linux:~/tmp> cat j.txt
> テスト
> 환경설
> 전검색
> 웹문서
> 국방비
> 북한의
> てすと
> ian@linux:~/tmp> uniq  j.txt
> テスト
> 환경설
> てすと
>
> All but the first and last lines are random Korean (Hangul)
> characters. Evidently our respective locales think all Hangul strings
> of the same length are identical, which is very probably not the
> case...

Does this go away if you change your locale to C?

Regards,

David

Attachment

Re: UTF-8 and LIKE vs =

From
David Wheeler
Date:
On Aug 23, 2004, at 4:49 PM, David Wheeler wrote:

>> Hmm.  I tried putting your string into a UNICODE database and I got
>> ERROR:  invalid byte sequence for encoding "UNICODE": 0xc7
>
> Really? Curious.

Oh, are you sure that you got my UTF-8 data? Because it came back in
your reply all mangled.

Cheers,

David

Attachment

Re: UTF-8 and LIKE vs =

From
Ian Barwick
Date:
On Mon, 23 Aug 2004 16:50:04 -0700, David Wheeler <david@kineticode.com> wrote:
> On Aug 23, 2004, at 4:34 PM, Ian Barwick wrote:
>
> > wild speculation in need of a Korean speaker, but:
> >
> > ian@linux:~/tmp> cat j.txt
> > テスト
> > 환경설
> > 전검색
> > 웹문서
> > 국방비
> > 북한의
> > てすと
> > ian@linux:~/tmp> uniq  j.txt
> > テスト
> > 환경설
> > てすと
> >
> > All but the first and last lines are random Korean (Hangul)
> > characters. Evidently our respective locales think all Hangul strings
> > of the same length are identical, which is very probably not the
> > case...
>
> Does this go away if you change your locale to C?

Yes.

Ian Barwick

Re: UTF-8 and LIKE vs =

From
David Wheeler
Date:
On Aug 23, 2004, at 5:07 PM, Ian Barwick wrote:

>> Does this go away if you change your locale to C?
>
> Yes.

Hallelujah! I'm running initdb again now.

Cheers,

David

Attachment

Re: UTF-8 and LIKE vs =

From
Tatsuo Ishii
Date:
> > 
> > В Пнд, 23.08.2004, в 23:04, David Wheeler пишет:
> > > On Aug 23, 2004, at 1:58 PM, Ian Barwick wrote:
> > >
> > > > er, the characters in "name" don't seem to match the characters in the
> > > > query - '국방비' vs. '북한의' - does that have any bearing?
> > >
> > > Yes, it means that = is doing the wrong thing!!
> > 
> > The collation rules of your (and my) locale say that these strings are
> > the same:
> > 
> > [markus@teetnang markus]$ cat > t
> > 국방비
> > 북한의
> > [markus@teetnang markus]$ uniq t
> > 국방비
> > [markus@teetnang markus]$
> 
> wild speculation in need of a Korean speaker, but:
> 
> ian@linux:~/tmp> cat j.txt
> テスト
> 환경설
> 전검색
> 웹문서
> 국방비
> 북한의
> てすと
> ian@linux:~/tmp> uniq  j.txt
> テスト
> 환경설
> てすと
> 
> All but the first and last lines are random Korean (Hangul)
> characters. Evidently our respective locales think all Hangul strings
> of the same length are identical, which is very probably not the
> case...

Locales for multibyte encodings are often broken on many platforms. I
see identical things with Japanese on Red Hat. This is one of the
reason why I tell Japanese PostgreSQL users not to enable locale while
initdb...
--
Tatsuo Ishii

Re: UTF-8 and LIKE vs =

From
David Wheeler
Date:
On Aug 23, 2004, at 5:22 PM, Tatsuo Ishii wrote:

> Locales for multibyte encodings are often broken on many platforms. I
> see identical things with Japanese on Red Hat. This is one of the
> reason why I tell Japanese PostgreSQL users not to enable locale while
> initdb...

Yep, and exporting my data, deleting the data directory, running initdb
with --locale=C fixd the problem for me. Woot!

But given what you've said, Tatsuo, it makes me wonder if it's worth it
to use the system locale default when running initdb? Maybe it'd make
more sense for PostgreSQL to default to C unless someone specifies
another --locale?

Thanks for the help, everyone,

David

Attachment

Re: UTF-8 and LIKE vs =

From
Tim Allen
Date:
Tom Lane wrote:
> David Wheeler <david@kineticode.com> writes:
>>bric=3D# reindex index udx_keyword__name;
>>REINDEX
>>bric=3D# select * from keyword where name =3D'=BA=CF=C7=D1=C0=C7';
>>   id  |  name  | screen_name | sort_name | active
>>------+--------+-------------+-----------+--------
>>  1218 | =B1=B9=B9=E6=BA=F1 | =B1=B9=B9=E6=BA=F1      | =B1=B9=B9=E6=BA=F1=
>>    |      1
>>(1 row)
>
>
> Hmm.  I tried putting your string into a UNICODE database and I got
> ERROR:  invalid byte sequence for encoding "UNICODE": 0xc7
>
> So there's something funny happening here.  What is your client_encoding
> setting?
>
>             regards, tom lane

One possible clue: your original post in this thread was using encoding
euc-kr, not unicode (utf-8). If your mailer was set to use that
encoding, perhaps your other client software is/was also?

Just a guess...

Tim

Re: UTF-8 and LIKE vs =

From
David Wheeler
Date:
On Aug 23, 2004, at 6:49 PM, Tim Allen wrote:

> One possible clue: your original post in this thread was using
> encoding euc-kr, not unicode (utf-8). If your mailer was set to use
> that encoding, perhaps your other client software is/was also?

Bah! Stupid Mail.app was trying to be too smart!

Thanks,

David


Attachment

Re: UTF-8 and LIKE vs =

From
Joel
Date:
On Tue, 24 Aug 2004 01:34:46 +0200
Ian Barwick <barwick@gmail.com> wrote

> ...
> wild speculation in need of a Korean speaker, but:
>
> ian@linux:~/tmp> cat j.txt
> 繝㋚せ繝\x88
> 紇俾イス牕、
> 琊⁇イ\x80lラ
> 珖ケ橖ク牕\x9C
> 弶ュ𣝣ゥ欄\x84
> 櫤≶復珣\x98
> 縺ヲ縺吶→
> ian@linux:~/tmp> uniq  j.txt
> 繝㋚せ繝\x88
> 紇俾イス牕、
> 縺ヲ縺吶→
>
> All but the first and last lines are random Korean (Hangul)
> characters. Evidently our respective locales think all Hangul strings
> of the same length are identical, which is very probably not the
> case...

My browser just nicely botched replying on those, but looking at Ian's
post, the first and last lines looked like "test" written in Japanese,
the first line in katakana and the last line in hiragana.

The following should end up posted as shift-JIS, but

テスト
and
てすと

should collate the same under some contexts, since it's more-or-less
equivalent to a variation in case.

--
Joel <rees@ddcom.co.jp>


Re: UTF-8 and LIKE vs =

From
David Wheeler
Date:
On Aug 23, 2004, at 6:58 PM, Joel wrote:

> The following should end up posted as shift-JIS, but
>
> テスト
> and
> てすと
>
> should collate the same under some contexts, since it's more-or-less
> equivalent to a variation in case.

Do you mean if you search for

   SELECT * FROM keyword where name = 'テスト';

You might expect it to return a row with name set to 'てすと'?

Regards,

David

Attachment

Re: UTF-8 and LIKE vs =

From
Joel
Date:
> > The following should end up posted as shift-JIS, but
> >
> > テスト
> > and
> > てすと
> >
> > should collate the same under some contexts, since it's more-or-less
> > equivalent to a variation in case.
>
> Do you mean if you search for
>
>    SELECT * FROM keyword where name = 'テスト';
>
> You might expect it to return a row with name set to 'てすと'?

If the locale machinery iw functioning correctly (and if I understand
correctly), there ought to be a setting that would allow those to
collate to the same point.

I'm not sure what the settings would be, or if it's fully funtional yet.
Maybe Tatsuo would know? (Hope he doesn't mind me CC-ing him.)

So, what was the original language and text of the queries you started
this thread with?

--
Joel <rees@ddcom.co.jp>


Re: UTF-8 and LIKE vs =

From
Tatsuo Ishii
Date:
> > > The following should end up posted as shift-JIS, but
> > >
> > > テスト
> > > and
> > > てすと
> > >
> > > should collate the same under some contexts, since it's more-or-less
> > > equivalent to a variation in case.
> >
> > Do you mean if you search for
> >
> >    SELECT * FROM keyword where name = 'テスト';
> >
> > You might expect it to return a row with name set to 'てすと'?
>
> If the locale machinery iw functioning correctly (and if I understand
> correctly), there ought to be a setting that would allow those to
> collate to the same point.
>
> I'm not sure what the settings would be, or if it's fully funtional yet.
> Maybe Tatsuo would know? (Hope he doesn't mind me CC-ing him.)

I'm not sure what the point of the original question was, but I know
that

    SELECT * FROM keyword where name = 'テスト';

should work if locale = C and encoding is either EUC_JP or UTF-8 on my
standard Linux box (of course I did initdb --no-locale).
--
Tatsuo Ishii

Re: UTF-8 and LIKE vs =

From
Joel
Date:
Just want to check this --

> > > > The following should end up posted as shift-JIS, but
> > > >
> > > > テスト
> > > > and
> > > > てすと
> > > >
> > > > should collate the same under some contexts, since it's more-or-less
> > > > equivalent to a variation in case.
> > >
> > > Do you mean if you search for
> > >
> > >    SELECT * FROM keyword where name = 'テスト';
> > >
> > > You might expect it to return a row with name set to 'てすと'?
> >
> > If the locale machinery iw functioning correctly (and if I understand
> > correctly), there ought to be a setting that would allow those to
> > collate to the same point.
> >
> > I'm not sure what the settings would be, or if it's fully funtional yet.
> > Maybe Tatsuo would know? (Hope he doesn't mind me CC-ing him.)
>
> I'm not sure what the point of the original question was, but I know
> that
>
>     SELECT * FROM keyword where name = 'テスト';
>
> should work if locale = C and encoding is either EUC_JP or UTF-8 on my
> standard Linux box (of course I did initdb --no-locale).

(query is katakana)

    SELECT * FROM keyword where name = 'テスト';


returns a record whose name field is 'てすと' (hiragana)?

Doesn't this require some LC_COLLATE flag?

(I suppose I should try this myself.)

--
Joel <rees@ddcom.co.jp>


Re: UTF-8 and LIKE vs =

From
Tatsuo Ishii
Date:
> Just want to check this --
>
> > > > > The following should end up posted as shift-JIS, but
> > > > >
> > > > > テスト
> > > > > and
> > > > > てすと
> > > > >
> > > > > should collate the same under some contexts, since it's more-or-less
> > > > > equivalent to a variation in case.
> > > >
> > > > Do you mean if you search for
> > > >
> > > >    SELECT * FROM keyword where name = 'テスト';
> > > >
> > > > You might expect it to return a row with name set to 'てすと'?
> > >
> > > If the locale machinery iw functioning correctly (and if I understand
> > > correctly), there ought to be a setting that would allow those to
> > > collate to the same point.
> > >
> > > I'm not sure what the settings would be, or if it's fully funtional yet.
> > > Maybe Tatsuo would know? (Hope he doesn't mind me CC-ing him.)
> >
> > I'm not sure what the point of the original question was, but I know
> > that
> >
> >     SELECT * FROM keyword where name = 'テスト';
> >
> > should work if locale = C and encoding is either EUC_JP or UTF-8 on my
> > standard Linux box (of course I did initdb --no-locale).
>
> (query is katakana)
>
>     SELECT * FROM keyword where name = 'テスト';
>
>
> returns a record whose name field is 'てすと' (hiragana)?
>
> Doesn't this require some LC_COLLATE flag?

No.

> (I suppose I should try this myself.)

Actually there's a similar regression test case in
src/test/mb/sql/euc_jp.sql(expected result is
src/test/mb/expected/euc_jp.sql).
--
Tatsuo Ishii

Re: UTF-8 and LIKE vs =

From
Tatsuo Ishii
Date:
> > Locales for multibyte encodings are often broken on many platforms. I
> > see identical things with Japanese on Red Hat. This is one of the
> > reason why I tell Japanese PostgreSQL users not to enable locale while
> > initdb...
>
> Yep, and exporting my data, deleting the data directory, running initdb
> with --locale=C fixd the problem for me. Woot!
>
> But given what you've said, Tatsuo, it makes me wonder if it's worth it
> to use the system locale default when running initdb? Maybe it'd make
> more sense for PostgreSQL to default to C unless someone specifies
> another --locale?

You are quite right. initdb defaults to use the system locale is
evil. I have been saying this for long time...
--
Tatsuo Ishii

Re: UTF-8 and LIKE vs =

From
Peter Eisentraut
Date:
David Wheeler wrote:
> But given what you've said, Tatsuo, it makes me wonder if it's worth
> it to use the system locale default when running initdb?

Yes, because that is the locale that the user prefers.  If a locale is
broken then you shouldn't set it as system locale in the first place.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/


Re: UTF-8 and LIKE vs =

From
David Wheeler
Date:
On Aug 23, 2004, at 10:25 PM, Joel wrote:

> If the locale machinery iw functioning correctly (and if I understand
> correctly), there ought to be a setting that would allow those to
> collate to the same point.

Bleh. There must be some distinction between them. It sounds like
querying for synonyms.

> I'm not sure what the settings would be, or if it's fully funtional
> yet.
> Maybe Tatsuo would know? (Hope he doesn't mind me CC-ing him.)
>
> So, what was the original language and text of the queries you started
> this thread with?

Korean, but the database I was using has data in 10 different languages
in it, making any locale-specific collation useless.

Regards,

David

Attachment

Re: UTF-8 and LIKE vs =

From
David Wheeler
Date:
On Aug 23, 2004, at 11:13 PM, Tatsuo Ishii wrote:

> I'm not sure what the point of the original question was, but I know
> that
>
>     SELECT * FROM keyword where name = 'テスト';
>
> should work if locale = C and encoding is either EUC_JP or UTF-8 on my
> standard Linux box (of course I did initdb --no-locale).

Yes, that's what I would want, and after rebuilding with --locale=C,
it's what works. Is --no-locale the same as --locale=C, BTW?

Regards,

David

Attachment

Re: UTF-8 and LIKE vs =

From
David Wheeler
Date:
On Aug 24, 2004, at 9:13 AM, Peter Eisentraut wrote:

>> But given what you've said, Tatsuo, it makes me wonder if it's worth
>> it to use the system locale default when running initdb?
>
> Yes, because that is the locale that the user prefers.  If a locale is
> broken then you shouldn't set it as system locale in the first place.

I disagree. IME, it is seldom the locale the user prefers, and nearly
always the locale set by the OS vendor.

I am in agreement with Tatsuo on this one.

Regards,

David

Attachment

Re: UTF-8 and LIKE vs =

From
Peter Eisentraut
Date:
David Wheeler wrote:
> I disagree. IME, it is seldom the locale the user prefers, and nearly
> always the locale set by the OS vendor.

Not in my experience.  The OS does not pick the locale out of the blue.
If your vendor does not respect your preferences, choose a new vendor.

Going back to no locale is not going to happen.  We want to offer
correct and rich semantics to our users, not cripple the system for
some people who cannot configure their operation system correctly.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/


Re: UTF-8 and LIKE vs =

From
David Wheeler
Date:
On Aug 24, 2004, at 11:29 AM, Peter Eisentraut wrote:

> Not in my experience.  The OS does not pick the locale out of the blue.
> If your vendor does not respect your preferences, choose a new vendor.

/me kicks RedHat.

> Going back to no locale is not going to happen.  We want to offer
> correct and rich semantics to our users, not cripple the system for
> some people who cannot configure their operation system correctly.

That's not the trouble so much as that the locales can be badly broken,
and that they're useless for multilingual use.

Regards,

David

Attachment

Re: UTF-8 and LIKE vs =

From
Peter Eisentraut
Date:
David Wheeler wrote:
> That's not the trouble so much as that the locales can be badly

If we always followed the principle "X could be broken, so let's not use
X", then we would never get anything done.  Instead, "X is broken, so
fix it".

> broken, and that they're useless for multilingual use.

I don't agree with that, but perhaps we differ in our interpretation of
"multilingual use".  If you have special requirements, you can always
turn the locales off.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/


Re: UTF-8 and LIKE vs =

From
David Wheeler
Date:
On Aug 24, 2004, at 12:20 PM, Peter Eisentraut wrote:

>> broken, and that they're useless for multilingual use.
>
> I don't agree with that, but perhaps we differ in our interpretation of
> "multilingual use".  If you have special requirements, you can always
> turn the locales off.

Well, we're getting beyond my realm of knowledge here, so I'll leave it
to Tatsuo to articulate better than I can. In the meantime, turning
locales off is exactly what I will do from here on in.

Regards,

David

Attachment

Re: UTF-8 and LIKE vs =

From
Tom Lane
Date:
David Wheeler <david@kineticode.com> writes:
>>> Hmm.  I tried putting your string into a UNICODE database and I got
>>> ERROR:  invalid byte sequence for encoding "UNICODE": 0xc7
>>
>> Really? Curious.

> Oh, are you sure that you got my UTF-8 data? Because it came back in
> your reply all mangled.

I deliberately left it half-MIME-ified so that the hex byte values would
be visible.  One of the variables we have not controlled for is whether
what you posted is the same as what other people received --- Tim
Allen's point about the claimed encoding of the message is very relevant
here.  Lots of mail software will try to translate encodings.

            regards, tom lane

Re: UTF-8 and LIKE vs =

From
Tatsuo Ishii
Date:
> David Wheeler wrote:
> > That's not the trouble so much as that the locales can be badly
>
> If we always followed the principle "X could be broken, so let's not use
> X", then we would never get anything done.  Instead, "X is broken, so
> fix it".

It's not a fair example IMO. We have many counter examples in our
source code itself which are trying to workaround verndors "minor"
bugs. Probably the point is X is important for almost everyone, while
the locale collation is not so for most of Linux uses (and thus for
vendors).  Unfortunately our voices are so small and does not reach to
verndors...

> > broken, and that they're useless for multilingual use.
>
> I don't agree with that, but perhaps we differ in our interpretation of
> "multilingual use".  If you have special requirements, you can always
> turn the locales off.

And they are forced to run initdb...
--
Tatsuo Ishii

Re: UTF-8 and LIKE vs =

From
Joel
Date:
On Tue, 24 Aug 2004 David Wheeler wrote

> On Aug 23, 2004, at 10:25 PM, Joel wrote:
>
> > If the locale machinery iw functioning correctly (and if I understand
> > correctly), there ought to be a setting that would allow those to
> > collate to the same point.
>
> Bleh. There must be some distinction between them. It sounds like
> querying for synonyms.

Sorry, I guess I wasn't clear. It's almost directly corollary to
ignoring case. (Almost, but not quite.) Kind of like querying with "TEST"
and getting records with "test".

> > I'm not sure what the settings would be, or if it's fully funtional
> > yet.
> > Maybe Tatsuo would know? (Hope he doesn't mind me CC-ing him.)
> >
> > So, what was the original language and text of the queries you started
> > this thread with?
>
> Korean, but the database I was using has data in 10 different languages
> in it, making any locale-specific collation useless.

Which means that the correct solution now is beyond the question of what
did the vendor think you would want. No way the vendor is going to be
willing to think you really want that, in spite of the number of people
who want precisely that. We don't yet have a mechanism that works for
general multiple language solutions.

At this point in time, "no locale" is probably close to best practice,
but it leaves you a lot of work.

--
Joel <rees@ddcom.co.jp>


Re: UTF-8 and LIKE vs =

From
Joel
Date:
Peter Eisentraut wrote
> David Wheeler wrote:
> > That's not the trouble so much as that the locales can be badly
>
> If we always followed the principle "X could be broken, so let's not use
> X", then we would never get anything done.  Instead, "X is broken, so
> fix it".

You want to talk my employer into giving me a month (full time) to build
some foundation open source tools, with the understanding that when
that's done maybe we'll know enough to actually be able to pin down a
schedule for fixing this one?

> > broken, and that they're useless for multilingual use.
>
> I don't agree with that, but perhaps we differ in our interpretation of
> "multilingual use".

Yes. There is a huge difference between incidental use of CJKV encodings
in a mostly Latin-based language database and using incidental Latin
characters in a mostly CJKV database.

> If you have special requirements, you can always
> turn the locales off.

Wish it were that simple, but that's where we apparently have to start,
for now.

--
Joel <rees@ddcom.co.jp>


Re: UTF-8 and LIKE vs =

From
Joel
Date:
> > Just want to check this --
> >
> > > > > > The following should end up posted as shift-JIS, but
> > > > > >
> > > > > > テスト
> > > > > > and
> > > > > > てすと
> > > > > >
> > > > > > should collate the same under some contexts, since it's more-or-less
> > > > > > equivalent to a variation in case.
> > > > >
> > > > > Do you mean if you search for
> > > > >
> > > > >    SELECT * FROM keyword where name = 'テスト';
> > > > >
> > > > > You might expect it to return a row with name set to 'てすと'?
> > > >
> > > > If the locale machinery iw functioning correctly (and if I understand
> > > > correctly), there ought to be a setting that would allow those to
> > > > collate to the same point.
> > > >
> > > > I'm not sure what the settings would be, or if it's fully funtional yet.
> > > > Maybe Tatsuo would know? (Hope he doesn't mind me CC-ing him.)
> > >
> > > I'm not sure what the point of the original question was, but I know
> > > that
> > >
> > >     SELECT * FROM keyword where name = 'テスト';
> > >
> > > should work if locale = C and encoding is either EUC_JP or UTF-8 on my
> > > standard Linux box (of course I did initdb --no-locale).
> >
> > (query is katakana)
> >
> >     SELECT * FROM keyword where name = 'テスト';
> >
> >
> > returns a record whose name field is 'てすと' (hiragana)?
> >
> > Doesn't this require some LC_COLLATE flag?
>
> No.

Ick. No way to turn it off, I suppose?

> > (I suppose I should try this myself.)
>
> Actually there's a similar regression test case in
> src/test/mb/sql/euc_jp.sql(expected result is
> src/test/mb/expected/euc_jp.sql).

--
Joel <rees@ddcom.co.jp>


Re: UTF-8 and LIKE vs =

From
Tatsuo Ishii
Date:
> > > I'm not sure what the settings would be, or if it's fully funtional
> > > yet.
> > > Maybe Tatsuo would know? (Hope he doesn't mind me CC-ing him.)
> > >
> > > So, what was the original language and text of the queries you started
> > > this thread with?
> >
> > Korean, but the database I was using has data in 10 different languages
> > in it, making any locale-specific collation useless.

I don't know exactly what kind of encodings you wish to use, but I
think MULE_INTERNAL might help you. It's actually mixture of various
encodings with encoding-prefix added to each letter.  For example if
you can mix KS5601 (Korean) and Japanese Kanji (JIS X0208) in a *same*
column. So if you don't have problem with sorting KS5601/JIS X0208
with C locale, you should not have problem with MULE_INTERNAL too in
theory.

Remaining problem is how to display the Korean-Japanese mixed string
in your client, but this is not PostgreSQL's problem, of
course. However you could write your own conversion function
MULE_INTERNAL <--> UTF-8, and might be able to solve the problem.
--
Tatsuo Ishii

Re: UTF-8 and LIKE vs =

From
Lincoln Yeoh
Date:
At 09:20 PM 8/24/2004 +0200, Peter Eisentraut wrote:

>David Wheeler wrote:
> > That's not the trouble so much as that the locales can be badly
>
>If we always followed the principle "X could be broken, so let's not use
>X", then we would never get anything done.  Instead, "X is broken, so
>fix it".
>
> > broken, and that they're useless for multilingual use.
>
>I don't agree with that, but perhaps we differ in our interpretation of
>"multilingual use".  If you have special requirements, you can always
>turn the locales off.

I think we've been through this before more than a year ago (or even earlier).

See: "default locale considered harmful"

IMO I suggested the default to be C, and I still think that's the best
default. But of course that's just my opinion.

What would be useful would be functions to allow selects etc to be ordered
as if under different query specifiable locales.

Example scenario would be an internationalized webmail application.
Depending on each user preferences, you'd have a different sort order for
their messages/addressbook.

In this case which locale should you pick for initdb? I'd say C.

In most environments where people aren't bothering about locale, C does
fine (and is likely to perform better). In environments where locales
matter having one often isn't enough.

In which case would picking the O/S locale as default be useful? Would
picking C be worse for the user in this case compared to if the user was
expecting C, and got the O/S locale instead?

Cheerio,
Link.

Re: UTF-8 and LIKE vs =

From
David Wheeler
Date:
On Aug 23, 2004, at 7:47 PM, Tom Lane wrote:

>> Oh, are you sure that you got my UTF-8 data? Because it came back in
>> your reply all mangled.
>
> I deliberately left it half-MIME-ified so that the hex byte values
> would
> be visible.  One of the variables we have not controlled for is whether
> what you posted is the same as what other people received --- Tim
> Allen's point about the claimed encoding of the message is very
> relevant
> here.  Lots of mail software will try to translate encodings.

Yes, my mail client changed the UTF-8 to EUC-KR (I'm going to find out
how to make it send everything in UTF-8). But my examples were working
as I described in the terminal with psql (and I do have my term set to
use utf-8), and the problem was solved by initdb --locale=C.

And now that I know to just use C, I won't worry about it again.

Cheers,

David


Attachment

Re: UTF-8 and LIKE vs =

From
David Wheeler
Date:
On Aug 24, 2004, at 6:58 PM, Joel wrote:

> Sorry, I guess I wasn't clear. It's almost directly corollary to
> ignoring case. (Almost, but not quite.) Kind of like querying with
> "TEST"
> and getting records with "test".

How interesting!

> Which means that the correct solution now is beyond the question of
> what
> did the vendor think you would want. No way the vendor is going to be
> willing to think you really want that, in spite of the number of people
> who want precisely that. We don't yet have a mechanism that works for
> general multiple language solutions.

Right, except for "no locale" (is C locale the same thing?).

> At this point in time, "no locale" is probably close to best practice,
> but it leaves you a lot of work.

It does?

Regards,

David


Re: UTF-8 and LIKE vs =

From
David Wheeler
Date:
On Aug 24, 2004, at 7:21 PM, Tatsuo Ishii wrote:

> I don't know exactly what kind of encodings you wish to use, but I
> think MULE_INTERNAL might help you. It's actually mixture of various
> encodings with encoding-prefix added to each letter.  For example if
> you can mix KS5601 (Korean) and Japanese Kanji (JIS X0208) in a *same*
> column. So if you don't have problem with sorting KS5601/JIS X0208
> with C locale, you should not have problem with MULE_INTERNAL too in
> theory.

MULE_INTERNAL? Is that a locale?

> Remaining problem is how to display the Korean-Japanese mixed string
> in your client, but this is not PostgreSQL's problem, of
> course. However you could write your own conversion function
> MULE_INTERNAL <--> UTF-8, and might be able to solve the problem.

Well, the strings aren't usually mixed, but there can be different
languages in different rows. For example, I have a customer with a
single Bricolage instance in which they manage content in all of the
following languages:

   Burmese
   Cantonese
   English
   Korean
   Khmer
   Lao
   Mandarin
   Tibetan
   Uyghur
   Vietnamese

And they might do a search that returns results in more than one of
these languages.

Regards,

David


Re: UTF-8 and LIKE vs =

From
David Wheeler
Date:
On Aug 25, 2004, at 7:35 AM, Lincoln Yeoh wrote:

> IMO I suggested the default to be C, and I still think that's the best
> default. But of course that's just my opinion.

I agree.

> What would be useful would be functions to allow selects etc to be
> ordered as if under different query specifiable locales.

Wouldn't that require different indexes for the different locales?
(Where there are indexes, of course.)

> In this case which locale should you pick for initdb? I'd say C.

I did. Is that the same as --no-locale?

Regards,

David


Re: UTF-8 and LIKE vs =

From
Lincoln Yeoh
Date:
At 09:41 AM 8/26/2004 -0700, David Wheeler wrote:

>>What would be useful would be functions to allow selects etc to be
>>ordered as if under different query specifiable locales.
>
>Wouldn't that require different indexes for the different locales? (Where
>there are indexes, of course.)

That's optional I think - depends on amount of data selected and how
selective the indexes would be.

It should be just like any other index on a function- like having an index
on lower(text).

You have index on locale('korean',text) and so on.

Not really sure how to do that tho :). Anyone have ideas?

Should be possible right?

Link.

Re: UTF-8 and LIKE vs =

From
Joel
Date:
On Thu, 26 Aug 2004 09:36:20 -0700
David Wheeler <david@kineticode.com> wrote

> On Aug 24, 2004, at 6:58 PM, Joel wrote:
>
> > Sorry, I guess I wasn't clear. It's almost directly corollary to
> > ignoring case. (Almost, but not quite.) Kind of like querying with
> > "TEST"
> > and getting records with "test".
>
> How interesting!
>
> > Which means that the correct solution now is beyond the question of
> > what
> > did the vendor think you would want. No way the vendor is going to be
> > willing to think you really want that, in spite of the number of people
> > who want precisely that. We don't yet have a mechanism that works for
> > general multiple language solutions.
>
> Right, except for "no locale" (is C locale the same thing?).

Nobody's answered this yet, but I'll stick my neck out and so I don't
remember. But it should be hidden in the manual somewhere.

(I guess I should say my memories are conflicting, which means I should
look it up myself some time.)

> > At this point in time, "no locale" is probably close to best practice,
> > but it leaves you a lot of work.
>
> It does?

Well, there are just a lot of things that you want to be sort of
automatic that you end up having to do by hand.

--
Joel <rees@ddcom.co.jp>


Re: UTF-8 and LIKE vs =

From
David Wheeler
Date:
On Aug 26, 2004, at 7:51 PM, Joel wrote:

>>> At this point in time, "no locale" is probably close to best
>>> practice,
>>> but it leaves you a lot of work.
>>
>> It does?
>
> Well, there are just a lot of things that you want to be sort of
> automatic that you end up having to do by hand.

Such as? It's not a rhetorical question; I honestly don't know. For me,
C locale simply solved my problem.

Regards,

David


Re: UTF-8 and LIKE vs =

From
Joel
Date:
> >>> At this point in time, "no locale" is probably close to best
> >>> practice,
> >>> but it leaves you a lot of work.
> >>
> >> It does?
> >
> > Well, there are just a lot of things that you want to be sort of
> > automatic that you end up having to do by hand.
>
> Such as? It's not a rhetorical question; I honestly don't know. For me,
> C locale simply solved my problem.

I would expect to run into problems with collation. In that case, you
may end up setting up separate databases for each language, as I
mentioned before in the mail that I forgot to post to the list so people
could correct me if I'm wrong.

Other than that, it depends on what functions the database will have.

If what is being done with the CJKT is pretty basic stuff, I may be just
another too-pessimistic voice.

--
Joel <rees@ddcom.co.jp>


Re: UTF-8 and LIKE vs =

From
David Wheeler
Date:
On Aug 27, 2004, at 5:27 AM, Joel wrote:

> I would expect to run into problems with collation. In that case, you
> may end up setting up separate databases for each language, as I
> mentioned before in the mail that I forgot to post to the list so
> people
> could correct me if I'm wrong.

As far as I know, collation is essentially how an index is ordered,
correct? So that when I so an "ORDER BY" query, the order in which the
rows are returned is determined by the collation. Is that correct?

If so, then I'm happy with the 80% solution of defaulting to Unicode
ordering (or "Unicodabetical").

> Other than that, it depends on what functions the database will have.
>
> If what is being done with the CJKT is pretty basic stuff, I may be
> just
> another too-pessimistic voice.

Frankly, I'm more concerned with the ability of queries to work than I
am of ordering results. Ordering is strictly secondary.

Regards,

David


Re: UTF-8 and LIKE vs =

From
Michael Glaesemann
Date:
On Aug 31, 2004, at 9:16 AM, David Wheeler wrote:

> As far as I know, collation is essentially how an index is ordered,
> correct? So that when I so an "ORDER BY" query, the order in which the
> rows are returned is determined by the collation. Is that correct?

Just a wee thing, and perhaps not the point of the post, but collation
is not important only for indexes. (Actually, I wouldn't be surprised
if some indexes are not necessarily ordered, but that may be my
inexperience talking.) You don't need an index on column foo to be able
to ORDER BY foo.

Michael Glaesemann
grzm myrealbox com


Re: UTF-8 and LIKE vs =

From
David Wheeler
Date:
On Aug 30, 2004, at 5:34 PM, Michael Glaesemann wrote:

> Just a wee thing, and perhaps not the point of the post, but collation
> is not important only for indexes. (Actually, I wouldn't be surprised
> if some indexes are not necessarily ordered, but that may be my
> inexperience talking.) You don't need an index on column foo to be
> able to ORDER BY foo.

Yes, I know that. I guess I should've left mention of indexes out of
it. I was still thinking in terms of my original issue, where a query
for a particular string found no results when it should have because
the index was using the en_us.utf8 locale.

Regards,

David


Re: UTF-8 and LIKE vs =

From
Joel
Date:
On Mon, 30 Aug 2004 17:16:20 -0700 David Wheeler wrote

> On Aug 27, 2004, at 5:27 AM, Joel wrote:
>
> > I would expect to run into problems with collation. In that case, you
> > may end up setting up separate databases for each language, as I
> > mentioned before in the mail that I forgot to post to the list so
> > people
> > could correct me if I'm wrong.
>
> As far as I know, collation is essentially how an index is ordered,

Collation can be used when setting up an index (as Michael points out).

> correct? So that when I so an "ORDER BY" query, the order in which the
> rows are returned is determined by the collation. Is that correct?

Anything that is related to sort order will be effected by collation,
and it is sometimes surprising what is related to sort order. (Sorry to
be vague, it's been a while.)

I have no experience with Korean. All I know is by hearsay.

In Japanese, key fields will often be doubled. This is so that both the
Kanji (ideographic) and kana (pronunciation) can be indexed. Kanji are
not considered to have inherent (standard) ordering in most applications,
and space really isn't (usually) a delimiter. So straight kana order is
(usually) sufficient for the kana field, and codepoint order is usually
sufficient for the Kanji.

Some functions will require some special handling for the kana. One of
the issues is from legacy 8-bit katakana only encodings. Another is
derived from what we would call compositing problems.

There are occasions when codepoint ordering for the Kanji will produce
counter-intuitive results. This is because the natural orderings do
exist (however ambiguously) and both the traditional JIS lists and the
Unicode lists break up the Chinese ideographs in groups that cut cross
sections out of the natural orderings.

What I've heard of Korean, you may run into similar issues even though
legacy should not be so much of a problem.

> If so, then I'm happy with the 80% solution of defaulting to Unicode
> ordering (or "Unicodabetical").

Not knowing what your app is, it would be hard to say how far down the
road it will be before you hit problems with this. (Or if you will.) It
sounds like the only way for you to find out is to put it into
production and ask for feedback.

If you want to get a head start on something, you might want to look
into making or finding custom collation tables. (Maybe.)

> > Other than that, it depends on what functions the database will have.
> >
> > If what is being done with the CJKT is pretty basic stuff, I may be
> > just
> > another too-pessimistic voice.
>
> Frankly, I'm more concerned with the ability of queries to work than I
> am of ordering results. Ordering is strictly secondary.

For now, that's probably correct. But I've given you about a half of a
heads-up on it. Have fun.

--
Joel <rees@ddcom.co.jp>