Thread: UTF-8 and LIKE vs =
Hi All, I'm having some trouble with multibyte characters and LIKE. We've been using LIKE instead of = for string queries for a long time, as it gives us flexibility to use wildcards such as "%" when we need to and get the same results as with = by not using them. But I've just found that it sometimes doesn't work properly: bric=# select version(); version ------------------------------------------------------------------------ --------------------------------- PostgreSQL 7.4.2 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.2.2 20030222 (Red Hat Linux 3.2.2-5) (1 row) bric=# select * from keyword where name = '북한의'; id | name | screen_name | sort_name | active ------+--------+-------------+-----------+-------- 1218 | 국방비 | 국방비 | 국방비 | 1 (1 row) bric=# select * from keyword where name LIKE '북한의'; id | name | screen_name | sort_name | active ----+------+-------------+-----------+-------- (0 rows) bric=# select * from keyword where name ~ '^북한의'; id | name | screen_name | sort_name | active ----+------+-------------+-----------+-------- (0 rows) Any idea why = works here and LIKE and ~ wouldn't? TIA, David
Attachment
David Wheeler wrote: > Any idea why = works here and LIKE and ~ wouldn't? Because LIKE does a character-by-character matching and = uses the operating system locale, which could do anything. If you set the locale to C, you should get matching results. Which one is "better" depends on the semantics of the language, which I cannot judge here. -- Peter Eisentraut http://developer.postgresql.org/~petere/
On Aug 23, 2004, at 1:22 PM, Peter Eisentraut wrote: > Because LIKE does a character-by-character matching and = uses the > operating system locale, which could do anything. If you set the > locale to C, you should get matching results. Which one is "better" > depends on the semantics of the language, which I cannot judge here. Thanks. So I need to set the locale to C and then LIKE will work properly? How do I go about doing that? I can see these options: LC_COLLATE String sort order LC_CTYPE Character classification (What is a letter? The upper-case equivalent?) LC_MESSAGES Language of messages LC_MONETARY Formatting of currency amounts LC_NUMERIC Formatting of numbers LC_TIME Formatting of dates and times Is one of these the one I need to set? Thanks, David
Attachment
On Mon, 23 Aug 2004 12:41:30 -0700, David Wheeler <david@kineticode.com> wrote: (...) > bric=# select version(); > version > ------------------------------------------------------------------------ > --------------------------------- > PostgreSQL 7.4.2 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.2.2 > 20030222 (Red Hat Linux 3.2.2-5) > (1 row) > > bric=# select * from keyword where name = '북한의'; > id | name | screen_name | sort_name | active > ------+--------+-------------+-----------+-------- > 1218 | 국방비 | 국방비 | 국방비 | 1 > (1 row) er, the characters in "name" don't seem to match the characters in the query - '국방비' vs. '북한의' - does that have any bearing? FWIW (on 7.4.3): test=# select * from t1 ; id | value ----+-------- 1 | 日本 2 | 日本語 3 | 北海道 (3 rows) test=# select * from t1 where value ~ '日'; id | value ----+-------- 1 | 日本 2 | 日本語 (2 rows) test=# select * from t1 where value like '日%'; id | value ----+-------- 1 | 日本 2 | 日本語 test=# select * from t1 where value like '北海%'; id | value ----+-------- 3 | 北海道 (1 row) Ian Barwick barwick@gmail.net
On Aug 23, 2004, at 1:58 PM, Ian Barwick wrote: > er, the characters in "name" don't seem to match the characters in the > query - '국방비' vs. '북한의' - does that have any bearing? Yes, it means that = is doing the wrong thing!! I noticed this because I had a query that was looking in the keyword table for an existing record using LIKE. If it didn't find it, it inserted it. But the inserts were giving me an error because the name column has a UNIQUE index on it. Could it be that the index and the = operator are comparing bytes, and that '국방비' and '북한의' have the same bytes but different characters?? If so, this is a pretty serious problem. How can I get = and the indices to use character semantics rather than byte semantics? I also need to be able to store data in different languages in the database (and in the same column!), but all in Unicode. TIA, David
Attachment
On Mon, 23 Aug 2004 14:04:05 -0700, David Wheeler <david@kineticode.com> wrote: > On Aug 23, 2004, at 1:58 PM, Ian Barwick wrote: > > > er, the characters in "name" don't seem to match the characters in the > > query - '국방비' vs. '북한의' - does that have any bearing? > > Yes, it means that = is doing the wrong thing!! > > I noticed this because I had a query that was looking in the keyword > table for an existing record using LIKE. If it didn't find it, it > inserted it. But the inserts were giving me an error because the name > column has a UNIQUE index on it. Could it be that the index and the = > operator are comparing bytes, and that '국방비' and '북한의' have the same > bytes but different characters?? > > If so, this is a pretty serious problem. How can I get = and the > indices to use character semantics rather than byte semantics? I also > need to be able to store data in different languages in the database > (and in the same column!), but all in Unicode. I don't know what the problem is, but you might want to check the client encoding settings, and the encoding your characters are arriving in (remembering all the time, in Postgres "UNICODE" really means UTF-8). If you're using Perl (I'm guessing this is Bricolage-related) the "_utf8_on"-ness of strings might be worth checking too, and also the "pg_enable_utf8" flag in DBD::Pg. Ian Barwick barwick@gmail.net
On Mon, 2004-08-23 at 16:43, David Wheeler wrote: > On Aug 23, 2004, at 1:22 PM, Peter Eisentraut wrote: > > > Because LIKE does a character-by-character matching and = uses the > > operating system locale, which could do anything. If you set the > > locale to C, you should get matching results. Which one is "better" > > depends on the semantics of the language, which I cannot judge here. > > Thanks. So I need to set the locale to C and then LIKE will work > properly? How do I go about doing that? I can see these options: > > LC_COLLATE > String sort order > > LC_CTYPE > Character classification (What is a letter? The upper-case equivalent?) > > LC_MESSAGES > Language of messages > > LC_MONETARY > Formatting of currency amounts > > LC_NUMERIC > Formatting of numbers > > LC_TIME > Formatting of dates and times > > Is one of these the one I need to set? > initdb is your friend. (well, not really, but that's where your headed) Robert Treat -- Build A Brighter Lamp :: Linux Apache {middleware} PostgreSQL
On Aug 23, 2004, at 2:25 PM, Ian Barwick wrote: > I don't know what the problem is, but you might want to check the > client encoding settings, and the encoding your characters are > arriving in (remembering all the time, in Postgres "UNICODE" really > means UTF-8). > > If you're using Perl (I'm guessing this is Bricolage-related) the > "_utf8_on"-ness of strings might be worth checking too, and also the > "pg_enable_utf8" flag in DBD::Pg. Bricolage is getting all its content at UTF-8. It has been working beautifully for some time. I tried setting the utf8 flag on the variable passed to the query, but it made no difference. I think that LIKE is doing the right thing, and = is not. And I need to find out how to get = to do the right thing. If I need to dump my database and run initdb to use C for LC_COLLATE, the, feh, I will. Right now I have: LC_COLLATE: en_US.UTF-8 LC_CTYPE: en_US.UTF-8 Regards, David
Attachment
On Aug 23, 2004, at 2:31 PM, Robert Treat wrote: > initdb is your friend. (well, not really, but that's where your headed) Yes, that's what I'm beginning to suspect. Cheers, David
Attachment
On Aug 23, 2004, at 2:46 PM, David Wheeler wrote: > I think that LIKE is doing the right thing, and = is not. And I need > to find out how to get = to do the right thing. If I need to dump my > database and run initdb to use C for LC_COLLATE, the, feh, I will. > Right now I have: > > LC_COLLATE: en_US.UTF-8 > LC_CTYPE: en_US.UTF-8 And testing on another box with these set to "C", it seems to work properly. *sigh* Nothing like dumping and restoring a 2.7 GB database to keep me engaged in what I'm doing, eh? Regards, David
Attachment
David Wheeler <david@kineticode.com> writes: > On Aug 23, 2004, at 1:58 PM, Ian Barwick wrote: >> er, the characters in "name" don't seem to match the characters in the >> query - '=B1=B9=B9=E6=BA=F1' vs. '=BA=CF=C7=D1=C0=C7' - does that have an= > y bearing? > Yes, it means that = is doing the wrong thing!! I have seen this happen in situations where the strings contained character sequences that were illegal according to the encoding that the locale thought was in force. (It seems that strcoll() will return more or less random results in such cases...) In particular, given that you have > LC_COLLATE: en_US.UTF-8 > LC_CTYPE: en_US.UTF-8 you are at risk if the data is not legal UTF-8 strings. The real question therefore is whether you have the database encoding set correctly --- ie, is it UNICODE (== UTF8)? If not then it may well be that Postgres is presenting strings to strcoll() that the latter will choke on. regards, tom lane
В Пнд, 23.08.2004, в 23:04, David Wheeler пишет: > On Aug 23, 2004, at 1:58 PM, Ian Barwick wrote: > > > er, the characters in "name" don't seem to match the characters in the > > query - '국방비' vs. '북한의' - does that have any bearing? > > Yes, it means that = is doing the wrong thing!! The collation rules of your (and my) locale say that these strings are the same: [markus@teetnang markus]$ cat > t 국방비 북한의 [markus@teetnang markus]$ uniq t 국방비 [markus@teetnang markus]$ Make sure that you have initdb'd the database under the right locale. There's not much PostgreSQL can do if strcoll() says that the strings are equal. -- Markus Bertheau <twanger@bluetwanger.de>
On Aug 23, 2004, at 3:44 PM, Tom Lane wrote: >> Yes, it means that = is doing the wrong thing!! > > I have seen this happen in situations where the strings contained > character sequences that were illegal according to the encoding that > the > locale thought was in force. (It seems that strcoll() will return more > or less random results in such cases...) In particular, given that you > have > >> LC_COLLATE: en_US.UTF-8 >> LC_CTYPE: en_US.UTF-8 > > you are at risk if the data is not legal UTF-8 strings. But is it possible to store non-UTF-8 data in a UNICODE database? > The real question therefore is whether you have the database encoding > set correctly --- ie, is it UNICODE (== UTF8)? If not then it may well > be that Postgres is presenting strings to strcoll() that the latter > will > choke on. The database is UNICODE. $ psql -U postgres -l List of databases Name | Owner | Encoding -----------+----------+----------- bric | postgres | UNICODE template0 | postgres | SQL_ASCII template1 | postgres | SQL_ASCII (3 rows) I plan to dump it, run initdb with LC_COLLATE and LC_CTYPE both set to "C", and restore the database and see if that helps. Thanks, David
Attachment
On Aug 23, 2004, at 3:46 PM, Markus Bertheau wrote: > The collation rules of your (and my) locale say that these strings are > the same: > > [markus@teetnang markus]$ cat > t > 국방비 > 북한의 > [markus@teetnang markus]$ uniq t > 국방비 > [markus@teetnang markus]$ Interesting. > Make sure that you have initdb'd the database under the right locale. > There's not much PostgreSQL can do if strcoll() says that the strings > are equal. Well, I have data from a number of different locales in the same database. I'm hoping that setting the locale to "C" will do the trick. It seems to work properly on my Mac: sharky=# select * from keyword where name = '국방비'; id | name | screen_name | sort_name | active ----+--------+-------------+-----------+-------- 0 | 국방비 | 국방비 | 국방비 | 1 (1 row) sharky=# select * from keyword where name = '북한의'; id | name | screen_name | sort_name | active ----+------+-------------+-----------+-------- (0 rows) sharky=# select * from keyword where name like '북한의'; id | name | screen_name | sort_name | active ----+------+-------------+-----------+-------- (0 rows) sharky=# select * from keyword where lower(name) like '국방비'; id | name | screen_name | sort_name | active ----+--------+-------------+-----------+-------- 0 | 국방비 | 국방비 | 국방비 | 1 (1 row) Regards, David
Attachment
David Wheeler <david@kineticode.com> writes: > But is it possible to store non-UTF-8 data in a UNICODE database? In theory not ... but I think there was a discussion earlier that concluded that our check for encoding validity is not airtight ... regards, tom lane
On Aug 23, 2004, at 3:59 PM, Tom Lane wrote: >> But is it possible to store non-UTF-8 data in a UNICODE database? > > In theory not ... but I think there was a discussion earlier that > concluded that our check for encoding validity is not airtight ... Well, it it was mostly right, I wouldn't expect it to be a problem as much as this issue is coming up for me. If, OTOH, the encoding validity check leaks like a sieve, then I might indeed have a bigger problem. Is the encoding check fixed in 8.0beta1? Thanks, David
Attachment
David Wheeler <david@kineticode.com> writes: > Is the encoding check fixed in 8.0beta1? [ looks back at discussion... ] Actually I misremembered --- the discussion was about how we would *reject* legal UTF-8 codes that are more than 2 bytes long. So the code is broken, but not in the direction that would cause your problem. Time for another theory. Is the problem query using an index? If so, does REINDEX help? regards, tom lane
On Aug 23, 2004, at 4:08 PM, Tom Lane wrote: > [ looks back at discussion... ] Actually I misremembered --- the > discussion was about how we would *reject* legal UTF-8 codes that are > more than 2 bytes long. So the code is broken, but not in the > direction > that would cause your problem. Time for another theory. Whew! > Is the problem query using an index? If so, does REINDEX help? Doesn't look like it: bric=# reindex index udx_keyword__name; REINDEX bric=# select * from keyword where name ='북한의'; id | name | screen_name | sort_name | active ------+--------+-------------+-----------+-------- 1218 | 국방비 | 국방비 | 국방비 | 1 (1 row) That's still giving me an invalid row for the value I passed to it (note that the value of the "name" column is different than the value I queried for). Regards, David
Attachment
On Tue, 24 Aug 2004 00:46:50 +0200, Markus Bertheau <twanger@bluetwanger.de> wrote: > > > В Пнд, 23.08.2004, в 23:04, David Wheeler пишет: > > On Aug 23, 2004, at 1:58 PM, Ian Barwick wrote: > > > > > er, the characters in "name" don't seem to match the characters in the > > > query - '국방비' vs. '북한의' - does that have any bearing? > > > > Yes, it means that = is doing the wrong thing!! > > The collation rules of your (and my) locale say that these strings are > the same: > > [markus@teetnang markus]$ cat > t > 국방비 > 북한의 > [markus@teetnang markus]$ uniq t > 국방비 > [markus@teetnang markus]$ wild speculation in need of a Korean speaker, but: ian@linux:~/tmp> cat j.txt テスト 환경설 전검색 웹문서 국방비 북한의 てすと ian@linux:~/tmp> uniq j.txt テスト 환경설 てすと All but the first and last lines are random Korean (Hangul) characters. Evidently our respective locales think all Hangul strings of the same length are identical, which is very probably not the case... Ian Barwick
David Wheeler <david@kineticode.com> writes: >> Is the problem query using an index? If so, does REINDEX help? > Doesn't look like it: > bric=3D# reindex index udx_keyword__name; > REINDEX > bric=3D# select * from keyword where name =3D'=BA=CF=C7=D1=C0=C7'; > id | name | screen_name | sort_name | active > ------+--------+-------------+-----------+-------- > 1218 | =B1=B9=B9=E6=BA=F1 | =B1=B9=B9=E6=BA=F1 | =B1=B9=B9=E6=BA=F1= > | 1 > (1 row) Hmm. I tried putting your string into a UNICODE database and I got ERROR: invalid byte sequence for encoding "UNICODE": 0xc7 So there's something funny happening here. What is your client_encoding setting? regards, tom lane
On Aug 23, 2004, at 4:35 PM, Tom Lane wrote: > Hmm. I tried putting your string into a UNICODE database and I got > ERROR: invalid byte sequence for encoding "UNICODE": 0xc7 Really? Curious. > So there's something funny happening here. What is your > client_encoding > setting? It's not set. I've had it commented out for quite some time: # $ENV{PGCLIENTENCODING} = 'UNICODE'; # Should default to this, anyway. # $ENV{PGSERVERENCODING} = 'UNICODE'; # Should default to this, anyway. Regards, David
Attachment
On Aug 23, 2004, at 4:34 PM, Ian Barwick wrote: > wild speculation in need of a Korean speaker, but: > > ian@linux:~/tmp> cat j.txt > テスト > 환경설 > 전검색 > 웹문서 > 국방비 > 북한의 > てすと > ian@linux:~/tmp> uniq j.txt > テスト > 환경설 > てすと > > All but the first and last lines are random Korean (Hangul) > characters. Evidently our respective locales think all Hangul strings > of the same length are identical, which is very probably not the > case... Does this go away if you change your locale to C? Regards, David
Attachment
On Aug 23, 2004, at 4:49 PM, David Wheeler wrote: >> Hmm. I tried putting your string into a UNICODE database and I got >> ERROR: invalid byte sequence for encoding "UNICODE": 0xc7 > > Really? Curious. Oh, are you sure that you got my UTF-8 data? Because it came back in your reply all mangled. Cheers, David
Attachment
On Mon, 23 Aug 2004 16:50:04 -0700, David Wheeler <david@kineticode.com> wrote: > On Aug 23, 2004, at 4:34 PM, Ian Barwick wrote: > > > wild speculation in need of a Korean speaker, but: > > > > ian@linux:~/tmp> cat j.txt > > テスト > > 환경설 > > 전검색 > > 웹문서 > > 국방비 > > 북한의 > > てすと > > ian@linux:~/tmp> uniq j.txt > > テスト > > 환경설 > > てすと > > > > All but the first and last lines are random Korean (Hangul) > > characters. Evidently our respective locales think all Hangul strings > > of the same length are identical, which is very probably not the > > case... > > Does this go away if you change your locale to C? Yes. Ian Barwick
On Aug 23, 2004, at 5:07 PM, Ian Barwick wrote: >> Does this go away if you change your locale to C? > > Yes. Hallelujah! I'm running initdb again now. Cheers, David
Attachment
> > > > В Пнд, 23.08.2004, в 23:04, David Wheeler пишет: > > > On Aug 23, 2004, at 1:58 PM, Ian Barwick wrote: > > > > > > > er, the characters in "name" don't seem to match the characters in the > > > > query - '국방비' vs. '북한의' - does that have any bearing? > > > > > > Yes, it means that = is doing the wrong thing!! > > > > The collation rules of your (and my) locale say that these strings are > > the same: > > > > [markus@teetnang markus]$ cat > t > > 국방비 > > 북한의 > > [markus@teetnang markus]$ uniq t > > 국방비 > > [markus@teetnang markus]$ > > wild speculation in need of a Korean speaker, but: > > ian@linux:~/tmp> cat j.txt > テスト > 환경설 > 전검색 > 웹문서 > 국방비 > 북한의 > てすと > ian@linux:~/tmp> uniq j.txt > テスト > 환경설 > てすと > > All but the first and last lines are random Korean (Hangul) > characters. Evidently our respective locales think all Hangul strings > of the same length are identical, which is very probably not the > case... Locales for multibyte encodings are often broken on many platforms. I see identical things with Japanese on Red Hat. This is one of the reason why I tell Japanese PostgreSQL users not to enable locale while initdb... -- Tatsuo Ishii
On Aug 23, 2004, at 5:22 PM, Tatsuo Ishii wrote: > Locales for multibyte encodings are often broken on many platforms. I > see identical things with Japanese on Red Hat. This is one of the > reason why I tell Japanese PostgreSQL users not to enable locale while > initdb... Yep, and exporting my data, deleting the data directory, running initdb with --locale=C fixd the problem for me. Woot! But given what you've said, Tatsuo, it makes me wonder if it's worth it to use the system locale default when running initdb? Maybe it'd make more sense for PostgreSQL to default to C unless someone specifies another --locale? Thanks for the help, everyone, David
Attachment
Tom Lane wrote: > David Wheeler <david@kineticode.com> writes: >>bric=3D# reindex index udx_keyword__name; >>REINDEX >>bric=3D# select * from keyword where name =3D'=BA=CF=C7=D1=C0=C7'; >> id | name | screen_name | sort_name | active >>------+--------+-------------+-----------+-------- >> 1218 | =B1=B9=B9=E6=BA=F1 | =B1=B9=B9=E6=BA=F1 | =B1=B9=B9=E6=BA=F1= >> | 1 >>(1 row) > > > Hmm. I tried putting your string into a UNICODE database and I got > ERROR: invalid byte sequence for encoding "UNICODE": 0xc7 > > So there's something funny happening here. What is your client_encoding > setting? > > regards, tom lane One possible clue: your original post in this thread was using encoding euc-kr, not unicode (utf-8). If your mailer was set to use that encoding, perhaps your other client software is/was also? Just a guess... Tim
On Aug 23, 2004, at 6:49 PM, Tim Allen wrote: > One possible clue: your original post in this thread was using > encoding euc-kr, not unicode (utf-8). If your mailer was set to use > that encoding, perhaps your other client software is/was also? Bah! Stupid Mail.app was trying to be too smart! Thanks, David
Attachment
On Tue, 24 Aug 2004 01:34:46 +0200 Ian Barwick <barwick@gmail.com> wrote > ... > wild speculation in need of a Korean speaker, but: > > ian@linux:~/tmp> cat j.txt > 繝㋚せ繝\x88 > 紇俾イス牕、 > 琊⁇イ\x80lラ > 珖ケ橖ク牕\x9C > 弶ュ𣝣ゥ欄\x84 > 櫤≶復珣\x98 > 縺ヲ縺吶→ > ian@linux:~/tmp> uniq j.txt > 繝㋚せ繝\x88 > 紇俾イス牕、 > 縺ヲ縺吶→ > > All but the first and last lines are random Korean (Hangul) > characters. Evidently our respective locales think all Hangul strings > of the same length are identical, which is very probably not the > case... My browser just nicely botched replying on those, but looking at Ian's post, the first and last lines looked like "test" written in Japanese, the first line in katakana and the last line in hiragana. The following should end up posted as shift-JIS, but テスト and てすと should collate the same under some contexts, since it's more-or-less equivalent to a variation in case. -- Joel <rees@ddcom.co.jp>
On Aug 23, 2004, at 6:58 PM, Joel wrote: > The following should end up posted as shift-JIS, but > > テスト > and > てすと > > should collate the same under some contexts, since it's more-or-less > equivalent to a variation in case. Do you mean if you search for SELECT * FROM keyword where name = 'テスト'; You might expect it to return a row with name set to 'てすと'? Regards, David
Attachment
> > The following should end up posted as shift-JIS, but > > > > テスト > > and > > てすと > > > > should collate the same under some contexts, since it's more-or-less > > equivalent to a variation in case. > > Do you mean if you search for > > SELECT * FROM keyword where name = 'テスト'; > > You might expect it to return a row with name set to 'てすと'? If the locale machinery iw functioning correctly (and if I understand correctly), there ought to be a setting that would allow those to collate to the same point. I'm not sure what the settings would be, or if it's fully funtional yet. Maybe Tatsuo would know? (Hope he doesn't mind me CC-ing him.) So, what was the original language and text of the queries you started this thread with? -- Joel <rees@ddcom.co.jp>
> > > The following should end up posted as shift-JIS, but > > > > > > テスト > > > and > > > てすと > > > > > > should collate the same under some contexts, since it's more-or-less > > > equivalent to a variation in case. > > > > Do you mean if you search for > > > > SELECT * FROM keyword where name = 'テスト'; > > > > You might expect it to return a row with name set to 'てすと'? > > If the locale machinery iw functioning correctly (and if I understand > correctly), there ought to be a setting that would allow those to > collate to the same point. > > I'm not sure what the settings would be, or if it's fully funtional yet. > Maybe Tatsuo would know? (Hope he doesn't mind me CC-ing him.) I'm not sure what the point of the original question was, but I know that SELECT * FROM keyword where name = 'テスト'; should work if locale = C and encoding is either EUC_JP or UTF-8 on my standard Linux box (of course I did initdb --no-locale). -- Tatsuo Ishii
Just want to check this -- > > > > The following should end up posted as shift-JIS, but > > > > > > > > テスト > > > > and > > > > てすと > > > > > > > > should collate the same under some contexts, since it's more-or-less > > > > equivalent to a variation in case. > > > > > > Do you mean if you search for > > > > > > SELECT * FROM keyword where name = 'テスト'; > > > > > > You might expect it to return a row with name set to 'てすと'? > > > > If the locale machinery iw functioning correctly (and if I understand > > correctly), there ought to be a setting that would allow those to > > collate to the same point. > > > > I'm not sure what the settings would be, or if it's fully funtional yet. > > Maybe Tatsuo would know? (Hope he doesn't mind me CC-ing him.) > > I'm not sure what the point of the original question was, but I know > that > > SELECT * FROM keyword where name = 'テスト'; > > should work if locale = C and encoding is either EUC_JP or UTF-8 on my > standard Linux box (of course I did initdb --no-locale). (query is katakana) SELECT * FROM keyword where name = 'テスト'; returns a record whose name field is 'てすと' (hiragana)? Doesn't this require some LC_COLLATE flag? (I suppose I should try this myself.) -- Joel <rees@ddcom.co.jp>
> Just want to check this -- > > > > > > The following should end up posted as shift-JIS, but > > > > > > > > > > テスト > > > > > and > > > > > てすと > > > > > > > > > > should collate the same under some contexts, since it's more-or-less > > > > > equivalent to a variation in case. > > > > > > > > Do you mean if you search for > > > > > > > > SELECT * FROM keyword where name = 'テスト'; > > > > > > > > You might expect it to return a row with name set to 'てすと'? > > > > > > If the locale machinery iw functioning correctly (and if I understand > > > correctly), there ought to be a setting that would allow those to > > > collate to the same point. > > > > > > I'm not sure what the settings would be, or if it's fully funtional yet. > > > Maybe Tatsuo would know? (Hope he doesn't mind me CC-ing him.) > > > > I'm not sure what the point of the original question was, but I know > > that > > > > SELECT * FROM keyword where name = 'テスト'; > > > > should work if locale = C and encoding is either EUC_JP or UTF-8 on my > > standard Linux box (of course I did initdb --no-locale). > > (query is katakana) > > SELECT * FROM keyword where name = 'テスト'; > > > returns a record whose name field is 'てすと' (hiragana)? > > Doesn't this require some LC_COLLATE flag? No. > (I suppose I should try this myself.) Actually there's a similar regression test case in src/test/mb/sql/euc_jp.sql(expected result is src/test/mb/expected/euc_jp.sql). -- Tatsuo Ishii
> > Locales for multibyte encodings are often broken on many platforms. I > > see identical things with Japanese on Red Hat. This is one of the > > reason why I tell Japanese PostgreSQL users not to enable locale while > > initdb... > > Yep, and exporting my data, deleting the data directory, running initdb > with --locale=C fixd the problem for me. Woot! > > But given what you've said, Tatsuo, it makes me wonder if it's worth it > to use the system locale default when running initdb? Maybe it'd make > more sense for PostgreSQL to default to C unless someone specifies > another --locale? You are quite right. initdb defaults to use the system locale is evil. I have been saying this for long time... -- Tatsuo Ishii
David Wheeler wrote: > But given what you've said, Tatsuo, it makes me wonder if it's worth > it to use the system locale default when running initdb? Yes, because that is the locale that the user prefers. If a locale is broken then you shouldn't set it as system locale in the first place. -- Peter Eisentraut http://developer.postgresql.org/~petere/
On Aug 23, 2004, at 10:25 PM, Joel wrote: > If the locale machinery iw functioning correctly (and if I understand > correctly), there ought to be a setting that would allow those to > collate to the same point. Bleh. There must be some distinction between them. It sounds like querying for synonyms. > I'm not sure what the settings would be, or if it's fully funtional > yet. > Maybe Tatsuo would know? (Hope he doesn't mind me CC-ing him.) > > So, what was the original language and text of the queries you started > this thread with? Korean, but the database I was using has data in 10 different languages in it, making any locale-specific collation useless. Regards, David
Attachment
On Aug 23, 2004, at 11:13 PM, Tatsuo Ishii wrote: > I'm not sure what the point of the original question was, but I know > that > > SELECT * FROM keyword where name = 'テスト'; > > should work if locale = C and encoding is either EUC_JP or UTF-8 on my > standard Linux box (of course I did initdb --no-locale). Yes, that's what I would want, and after rebuilding with --locale=C, it's what works. Is --no-locale the same as --locale=C, BTW? Regards, David
Attachment
On Aug 24, 2004, at 9:13 AM, Peter Eisentraut wrote: >> But given what you've said, Tatsuo, it makes me wonder if it's worth >> it to use the system locale default when running initdb? > > Yes, because that is the locale that the user prefers. If a locale is > broken then you shouldn't set it as system locale in the first place. I disagree. IME, it is seldom the locale the user prefers, and nearly always the locale set by the OS vendor. I am in agreement with Tatsuo on this one. Regards, David
Attachment
David Wheeler wrote: > I disagree. IME, it is seldom the locale the user prefers, and nearly > always the locale set by the OS vendor. Not in my experience. The OS does not pick the locale out of the blue. If your vendor does not respect your preferences, choose a new vendor. Going back to no locale is not going to happen. We want to offer correct and rich semantics to our users, not cripple the system for some people who cannot configure their operation system correctly. -- Peter Eisentraut http://developer.postgresql.org/~petere/
On Aug 24, 2004, at 11:29 AM, Peter Eisentraut wrote: > Not in my experience. The OS does not pick the locale out of the blue. > If your vendor does not respect your preferences, choose a new vendor. /me kicks RedHat. > Going back to no locale is not going to happen. We want to offer > correct and rich semantics to our users, not cripple the system for > some people who cannot configure their operation system correctly. That's not the trouble so much as that the locales can be badly broken, and that they're useless for multilingual use. Regards, David
Attachment
David Wheeler wrote: > That's not the trouble so much as that the locales can be badly If we always followed the principle "X could be broken, so let's not use X", then we would never get anything done. Instead, "X is broken, so fix it". > broken, and that they're useless for multilingual use. I don't agree with that, but perhaps we differ in our interpretation of "multilingual use". If you have special requirements, you can always turn the locales off. -- Peter Eisentraut http://developer.postgresql.org/~petere/
On Aug 24, 2004, at 12:20 PM, Peter Eisentraut wrote: >> broken, and that they're useless for multilingual use. > > I don't agree with that, but perhaps we differ in our interpretation of > "multilingual use". If you have special requirements, you can always > turn the locales off. Well, we're getting beyond my realm of knowledge here, so I'll leave it to Tatsuo to articulate better than I can. In the meantime, turning locales off is exactly what I will do from here on in. Regards, David
Attachment
David Wheeler <david@kineticode.com> writes: >>> Hmm. I tried putting your string into a UNICODE database and I got >>> ERROR: invalid byte sequence for encoding "UNICODE": 0xc7 >> >> Really? Curious. > Oh, are you sure that you got my UTF-8 data? Because it came back in > your reply all mangled. I deliberately left it half-MIME-ified so that the hex byte values would be visible. One of the variables we have not controlled for is whether what you posted is the same as what other people received --- Tim Allen's point about the claimed encoding of the message is very relevant here. Lots of mail software will try to translate encodings. regards, tom lane
> David Wheeler wrote: > > That's not the trouble so much as that the locales can be badly > > If we always followed the principle "X could be broken, so let's not use > X", then we would never get anything done. Instead, "X is broken, so > fix it". It's not a fair example IMO. We have many counter examples in our source code itself which are trying to workaround verndors "minor" bugs. Probably the point is X is important for almost everyone, while the locale collation is not so for most of Linux uses (and thus for vendors). Unfortunately our voices are so small and does not reach to verndors... > > broken, and that they're useless for multilingual use. > > I don't agree with that, but perhaps we differ in our interpretation of > "multilingual use". If you have special requirements, you can always > turn the locales off. And they are forced to run initdb... -- Tatsuo Ishii
On Tue, 24 Aug 2004 David Wheeler wrote > On Aug 23, 2004, at 10:25 PM, Joel wrote: > > > If the locale machinery iw functioning correctly (and if I understand > > correctly), there ought to be a setting that would allow those to > > collate to the same point. > > Bleh. There must be some distinction between them. It sounds like > querying for synonyms. Sorry, I guess I wasn't clear. It's almost directly corollary to ignoring case. (Almost, but not quite.) Kind of like querying with "TEST" and getting records with "test". > > I'm not sure what the settings would be, or if it's fully funtional > > yet. > > Maybe Tatsuo would know? (Hope he doesn't mind me CC-ing him.) > > > > So, what was the original language and text of the queries you started > > this thread with? > > Korean, but the database I was using has data in 10 different languages > in it, making any locale-specific collation useless. Which means that the correct solution now is beyond the question of what did the vendor think you would want. No way the vendor is going to be willing to think you really want that, in spite of the number of people who want precisely that. We don't yet have a mechanism that works for general multiple language solutions. At this point in time, "no locale" is probably close to best practice, but it leaves you a lot of work. -- Joel <rees@ddcom.co.jp>
Peter Eisentraut wrote > David Wheeler wrote: > > That's not the trouble so much as that the locales can be badly > > If we always followed the principle "X could be broken, so let's not use > X", then we would never get anything done. Instead, "X is broken, so > fix it". You want to talk my employer into giving me a month (full time) to build some foundation open source tools, with the understanding that when that's done maybe we'll know enough to actually be able to pin down a schedule for fixing this one? > > broken, and that they're useless for multilingual use. > > I don't agree with that, but perhaps we differ in our interpretation of > "multilingual use". Yes. There is a huge difference between incidental use of CJKV encodings in a mostly Latin-based language database and using incidental Latin characters in a mostly CJKV database. > If you have special requirements, you can always > turn the locales off. Wish it were that simple, but that's where we apparently have to start, for now. -- Joel <rees@ddcom.co.jp>
> > Just want to check this -- > > > > > > > > The following should end up posted as shift-JIS, but > > > > > > > > > > > > テスト > > > > > > and > > > > > > てすと > > > > > > > > > > > > should collate the same under some contexts, since it's more-or-less > > > > > > equivalent to a variation in case. > > > > > > > > > > Do you mean if you search for > > > > > > > > > > SELECT * FROM keyword where name = 'テスト'; > > > > > > > > > > You might expect it to return a row with name set to 'てすと'? > > > > > > > > If the locale machinery iw functioning correctly (and if I understand > > > > correctly), there ought to be a setting that would allow those to > > > > collate to the same point. > > > > > > > > I'm not sure what the settings would be, or if it's fully funtional yet. > > > > Maybe Tatsuo would know? (Hope he doesn't mind me CC-ing him.) > > > > > > I'm not sure what the point of the original question was, but I know > > > that > > > > > > SELECT * FROM keyword where name = 'テスト'; > > > > > > should work if locale = C and encoding is either EUC_JP or UTF-8 on my > > > standard Linux box (of course I did initdb --no-locale). > > > > (query is katakana) > > > > SELECT * FROM keyword where name = 'テスト'; > > > > > > returns a record whose name field is 'てすと' (hiragana)? > > > > Doesn't this require some LC_COLLATE flag? > > No. Ick. No way to turn it off, I suppose? > > (I suppose I should try this myself.) > > Actually there's a similar regression test case in > src/test/mb/sql/euc_jp.sql(expected result is > src/test/mb/expected/euc_jp.sql). -- Joel <rees@ddcom.co.jp>
> > > I'm not sure what the settings would be, or if it's fully funtional > > > yet. > > > Maybe Tatsuo would know? (Hope he doesn't mind me CC-ing him.) > > > > > > So, what was the original language and text of the queries you started > > > this thread with? > > > > Korean, but the database I was using has data in 10 different languages > > in it, making any locale-specific collation useless. I don't know exactly what kind of encodings you wish to use, but I think MULE_INTERNAL might help you. It's actually mixture of various encodings with encoding-prefix added to each letter. For example if you can mix KS5601 (Korean) and Japanese Kanji (JIS X0208) in a *same* column. So if you don't have problem with sorting KS5601/JIS X0208 with C locale, you should not have problem with MULE_INTERNAL too in theory. Remaining problem is how to display the Korean-Japanese mixed string in your client, but this is not PostgreSQL's problem, of course. However you could write your own conversion function MULE_INTERNAL <--> UTF-8, and might be able to solve the problem. -- Tatsuo Ishii
At 09:20 PM 8/24/2004 +0200, Peter Eisentraut wrote: >David Wheeler wrote: > > That's not the trouble so much as that the locales can be badly > >If we always followed the principle "X could be broken, so let's not use >X", then we would never get anything done. Instead, "X is broken, so >fix it". > > > broken, and that they're useless for multilingual use. > >I don't agree with that, but perhaps we differ in our interpretation of >"multilingual use". If you have special requirements, you can always >turn the locales off. I think we've been through this before more than a year ago (or even earlier). See: "default locale considered harmful" IMO I suggested the default to be C, and I still think that's the best default. But of course that's just my opinion. What would be useful would be functions to allow selects etc to be ordered as if under different query specifiable locales. Example scenario would be an internationalized webmail application. Depending on each user preferences, you'd have a different sort order for their messages/addressbook. In this case which locale should you pick for initdb? I'd say C. In most environments where people aren't bothering about locale, C does fine (and is likely to perform better). In environments where locales matter having one often isn't enough. In which case would picking the O/S locale as default be useful? Would picking C be worse for the user in this case compared to if the user was expecting C, and got the O/S locale instead? Cheerio, Link.
On Aug 23, 2004, at 7:47 PM, Tom Lane wrote: >> Oh, are you sure that you got my UTF-8 data? Because it came back in >> your reply all mangled. > > I deliberately left it half-MIME-ified so that the hex byte values > would > be visible. One of the variables we have not controlled for is whether > what you posted is the same as what other people received --- Tim > Allen's point about the claimed encoding of the message is very > relevant > here. Lots of mail software will try to translate encodings. Yes, my mail client changed the UTF-8 to EUC-KR (I'm going to find out how to make it send everything in UTF-8). But my examples were working as I described in the terminal with psql (and I do have my term set to use utf-8), and the problem was solved by initdb --locale=C. And now that I know to just use C, I won't worry about it again. Cheers, David
Attachment
On Aug 24, 2004, at 6:58 PM, Joel wrote: > Sorry, I guess I wasn't clear. It's almost directly corollary to > ignoring case. (Almost, but not quite.) Kind of like querying with > "TEST" > and getting records with "test". How interesting! > Which means that the correct solution now is beyond the question of > what > did the vendor think you would want. No way the vendor is going to be > willing to think you really want that, in spite of the number of people > who want precisely that. We don't yet have a mechanism that works for > general multiple language solutions. Right, except for "no locale" (is C locale the same thing?). > At this point in time, "no locale" is probably close to best practice, > but it leaves you a lot of work. It does? Regards, David
On Aug 24, 2004, at 7:21 PM, Tatsuo Ishii wrote: > I don't know exactly what kind of encodings you wish to use, but I > think MULE_INTERNAL might help you. It's actually mixture of various > encodings with encoding-prefix added to each letter. For example if > you can mix KS5601 (Korean) and Japanese Kanji (JIS X0208) in a *same* > column. So if you don't have problem with sorting KS5601/JIS X0208 > with C locale, you should not have problem with MULE_INTERNAL too in > theory. MULE_INTERNAL? Is that a locale? > Remaining problem is how to display the Korean-Japanese mixed string > in your client, but this is not PostgreSQL's problem, of > course. However you could write your own conversion function > MULE_INTERNAL <--> UTF-8, and might be able to solve the problem. Well, the strings aren't usually mixed, but there can be different languages in different rows. For example, I have a customer with a single Bricolage instance in which they manage content in all of the following languages: Burmese Cantonese English Korean Khmer Lao Mandarin Tibetan Uyghur Vietnamese And they might do a search that returns results in more than one of these languages. Regards, David
On Aug 25, 2004, at 7:35 AM, Lincoln Yeoh wrote: > IMO I suggested the default to be C, and I still think that's the best > default. But of course that's just my opinion. I agree. > What would be useful would be functions to allow selects etc to be > ordered as if under different query specifiable locales. Wouldn't that require different indexes for the different locales? (Where there are indexes, of course.) > In this case which locale should you pick for initdb? I'd say C. I did. Is that the same as --no-locale? Regards, David
At 09:41 AM 8/26/2004 -0700, David Wheeler wrote: >>What would be useful would be functions to allow selects etc to be >>ordered as if under different query specifiable locales. > >Wouldn't that require different indexes for the different locales? (Where >there are indexes, of course.) That's optional I think - depends on amount of data selected and how selective the indexes would be. It should be just like any other index on a function- like having an index on lower(text). You have index on locale('korean',text) and so on. Not really sure how to do that tho :). Anyone have ideas? Should be possible right? Link.
On Thu, 26 Aug 2004 09:36:20 -0700 David Wheeler <david@kineticode.com> wrote > On Aug 24, 2004, at 6:58 PM, Joel wrote: > > > Sorry, I guess I wasn't clear. It's almost directly corollary to > > ignoring case. (Almost, but not quite.) Kind of like querying with > > "TEST" > > and getting records with "test". > > How interesting! > > > Which means that the correct solution now is beyond the question of > > what > > did the vendor think you would want. No way the vendor is going to be > > willing to think you really want that, in spite of the number of people > > who want precisely that. We don't yet have a mechanism that works for > > general multiple language solutions. > > Right, except for "no locale" (is C locale the same thing?). Nobody's answered this yet, but I'll stick my neck out and so I don't remember. But it should be hidden in the manual somewhere. (I guess I should say my memories are conflicting, which means I should look it up myself some time.) > > At this point in time, "no locale" is probably close to best practice, > > but it leaves you a lot of work. > > It does? Well, there are just a lot of things that you want to be sort of automatic that you end up having to do by hand. -- Joel <rees@ddcom.co.jp>
On Aug 26, 2004, at 7:51 PM, Joel wrote: >>> At this point in time, "no locale" is probably close to best >>> practice, >>> but it leaves you a lot of work. >> >> It does? > > Well, there are just a lot of things that you want to be sort of > automatic that you end up having to do by hand. Such as? It's not a rhetorical question; I honestly don't know. For me, C locale simply solved my problem. Regards, David
> >>> At this point in time, "no locale" is probably close to best > >>> practice, > >>> but it leaves you a lot of work. > >> > >> It does? > > > > Well, there are just a lot of things that you want to be sort of > > automatic that you end up having to do by hand. > > Such as? It's not a rhetorical question; I honestly don't know. For me, > C locale simply solved my problem. I would expect to run into problems with collation. In that case, you may end up setting up separate databases for each language, as I mentioned before in the mail that I forgot to post to the list so people could correct me if I'm wrong. Other than that, it depends on what functions the database will have. If what is being done with the CJKT is pretty basic stuff, I may be just another too-pessimistic voice. -- Joel <rees@ddcom.co.jp>
On Aug 27, 2004, at 5:27 AM, Joel wrote: > I would expect to run into problems with collation. In that case, you > may end up setting up separate databases for each language, as I > mentioned before in the mail that I forgot to post to the list so > people > could correct me if I'm wrong. As far as I know, collation is essentially how an index is ordered, correct? So that when I so an "ORDER BY" query, the order in which the rows are returned is determined by the collation. Is that correct? If so, then I'm happy with the 80% solution of defaulting to Unicode ordering (or "Unicodabetical"). > Other than that, it depends on what functions the database will have. > > If what is being done with the CJKT is pretty basic stuff, I may be > just > another too-pessimistic voice. Frankly, I'm more concerned with the ability of queries to work than I am of ordering results. Ordering is strictly secondary. Regards, David
On Aug 31, 2004, at 9:16 AM, David Wheeler wrote: > As far as I know, collation is essentially how an index is ordered, > correct? So that when I so an "ORDER BY" query, the order in which the > rows are returned is determined by the collation. Is that correct? Just a wee thing, and perhaps not the point of the post, but collation is not important only for indexes. (Actually, I wouldn't be surprised if some indexes are not necessarily ordered, but that may be my inexperience talking.) You don't need an index on column foo to be able to ORDER BY foo. Michael Glaesemann grzm myrealbox com
On Aug 30, 2004, at 5:34 PM, Michael Glaesemann wrote: > Just a wee thing, and perhaps not the point of the post, but collation > is not important only for indexes. (Actually, I wouldn't be surprised > if some indexes are not necessarily ordered, but that may be my > inexperience talking.) You don't need an index on column foo to be > able to ORDER BY foo. Yes, I know that. I guess I should've left mention of indexes out of it. I was still thinking in terms of my original issue, where a query for a particular string found no results when it should have because the index was using the en_us.utf8 locale. Regards, David
On Mon, 30 Aug 2004 17:16:20 -0700 David Wheeler wrote > On Aug 27, 2004, at 5:27 AM, Joel wrote: > > > I would expect to run into problems with collation. In that case, you > > may end up setting up separate databases for each language, as I > > mentioned before in the mail that I forgot to post to the list so > > people > > could correct me if I'm wrong. > > As far as I know, collation is essentially how an index is ordered, Collation can be used when setting up an index (as Michael points out). > correct? So that when I so an "ORDER BY" query, the order in which the > rows are returned is determined by the collation. Is that correct? Anything that is related to sort order will be effected by collation, and it is sometimes surprising what is related to sort order. (Sorry to be vague, it's been a while.) I have no experience with Korean. All I know is by hearsay. In Japanese, key fields will often be doubled. This is so that both the Kanji (ideographic) and kana (pronunciation) can be indexed. Kanji are not considered to have inherent (standard) ordering in most applications, and space really isn't (usually) a delimiter. So straight kana order is (usually) sufficient for the kana field, and codepoint order is usually sufficient for the Kanji. Some functions will require some special handling for the kana. One of the issues is from legacy 8-bit katakana only encodings. Another is derived from what we would call compositing problems. There are occasions when codepoint ordering for the Kanji will produce counter-intuitive results. This is because the natural orderings do exist (however ambiguously) and both the traditional JIS lists and the Unicode lists break up the Chinese ideographs in groups that cut cross sections out of the natural orderings. What I've heard of Korean, you may run into similar issues even though legacy should not be so much of a problem. > If so, then I'm happy with the 80% solution of defaulting to Unicode > ordering (or "Unicodabetical"). Not knowing what your app is, it would be hard to say how far down the road it will be before you hit problems with this. (Or if you will.) It sounds like the only way for you to find out is to put it into production and ask for feedback. If you want to get a head start on something, you might want to look into making or finding custom collation tables. (Maybe.) > > Other than that, it depends on what functions the database will have. > > > > If what is being done with the CJKT is pretty basic stuff, I may be > > just > > another too-pessimistic voice. > > Frankly, I'm more concerned with the ability of queries to work than I > am of ordering results. Ordering is strictly secondary. For now, that's probably correct. But I've given you about a half of a heads-up on it. Have fun. -- Joel <rees@ddcom.co.jp>