Thread: another seemingly simple encoding question
maybe a routine question here ... .... i hope i can understand the answer. [postgres@www ~]$ pg_ctl --version pg_ctl (PostgreSQL) 8.0.0beta3 [postgres@www ~]$ i have a problem matching a utf8 string with a field in a database encoded in utf8. i read the documentation, checked the following, and don't know where i went astray, trying to match ... 1) i am almost 100% sure the data is correctly utf8. i just dumped and loaded into postgres. 2) utf8db -> \l List of databases Name | Owner | Encoding --------------+----------+----------- utf8db | postgres | UNICODE 3) postgresql.conf # These settings are initialized by initdb -- they might be changed lc_messages = 'en_US.utf8' # locale for system error message strings lc_monetary = 'en_US.utf8' # locale for monetary formatting lc_numeric = 'en_US.utf8' # locale for number formatting lc_time = 'en_US.utf8' # locale for time formatting # - Other Defaults - 4) set client encoding in client (psql or php, either one, both same mismatch) LOG: statement: select wordid,word from korean_english where word='기르 다' limit 10; LOG: statement: show client_encoding; LOG: statement: set client_encoding to 'utf8'; LOG: statement: select wordid,word from korean_english where word='기르 다' limit 10; LOG: statement: show client_encoding; 5) locale -a | grep en <snip> en_US.utf8 </snip> ohhh, where is my mistake, please!
> i have a problem matching a utf8 string with a field in a database > encoded in utf8. You seem to give all the details of your configuration, but unless I misread your message, you don't say what the actual problem is. Can you provide more details? What exactly doesn't work? This may not be the issue, but many people don't realize that there are sometimes multiple ways to encode what is conceptually the same string in UTF8 (or any of the Unicode encodings). If you do not canonicalize your strings using one of the Unicode normalization forms, then seemingly identical strings may not match, because they are not byte-for-byte identical. - John D. Burger MITRE
problem is that my string -- which is in utf-8 -- because all input is converted first in php with $str_out = mb_convert_encoding($str_in, "UTF-8"); and the query, which is like "select wordid from korean_english where word='utf8string'"; and it is returning wordids for words which are not = utf8string (in debug mode) the above is output as UTF-8 by php / browser encoding over the web, and then "exit;" is called, so i just grab it from the browser by cutting and pasting the whole query string. running the query in php and from psql return the same bad wordids, pointing that the encoding is consistent through the cut-and-paste operation. i don't understand what a "unicode normalization form" is. the postgres docs http://www.postgresql.org/docs/8.0/interactive/multibyte.html say Table 20-1. Server Character Sets Name Description UNICODE Unicode (UTF-8) so i thought they were the same, and i dont know about "unicode normalization form". my question is why isn't the utf8string in query returning only matching, corresponding wordids from the database.... thx. 2006-03-24 (금), 08:56 -0500, John D. Burger 쓰시길: > > i have a problem matching a utf8 string with a field in a database > > encoded in utf8. > > You seem to give all the details of your configuration, but unless I > misread your message, you don't say what the actual problem is. Can > you provide more details? What exactly doesn't work? > > This may not be the issue, but many people don't realize that there are > sometimes multiple ways to encode what is conceptually the same string > in UTF8 (or any of the Unicode encodings). If you do not canonicalize > your strings using one of the Unicode normalization forms, then > seemingly identical strings may not match, because they are not > byte-for-byte identical. > > - John D. Burger > MITRE >
This doesn't sound like your problem, but I'll explain the normalization issue using Korean as an example, since that seems to be your data: There are codepoints in Unicode both for Hangul and Jamo, so a Hangul glyph can be represented either with the single corresponding codepoint, or as two or three Jamo codepoints. A Unicode font would display these two alternatives identically. In any Unicode encoding, including UTF8, these two strings would not be byte-for-byte identical. The Unicode normalization forms are four algorithms for normalizing the strings in such a way that they do compare identically. Anyway, it sounds like you have the opposite problem, two strings that are comparing equal when you think they shouldn't. I don't know that anyone can help you unless you post an actual example of two such strings. - John D. Burger MITRE
this is a forward of my problem from April. I have this time gone all the way and re-inited a DB from scratch, created a new database, documented the import procedure, set the locale to match but I am still having problems. For example, look at this match count~ mod=# select count(*) from korean_english; count -------- 205323 (1 row) mod=# mod=# select count(*) from korean_english where word='안녕'; count ------- 40332 (1 row) mod=# \set VERSION = 'PostgreSQL 8.0.0beta3 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.3.3 20040412 (Red Hat Linux 3.3.3-7)' AUTOCOMMIT = 'on' VERBOSITY = 'default' DBNAME = 'mod' USER = 'postgres' PORT = '5432' ENCODING = 'UNICODE' PROMPT1 = '%/%R%# ' PROMPT2 = '%/%R%# ' PROMPT3 = '>> ' HISTSIZE = '500' mod=# I documented the import procedure and put it at http://www.myowndictionary.com/design.htm if there is anybody out there who has any idea, i would be very grateful for help. I have to move my database to postgres from mysql, and it has this big problem with the encoding. thank you . joseph. -------- Forwarded Message -------- 보낸 사람: joseph <kmh496@kornet.net> 받는 사람: pgsql-general@postgresql.org 제목: another seemingly simple encoding question 날짜: Fri, 24 Mar 2006 22:27:06 +0900 maybe a routine question here ... .... i hope i can understand the answer. [postgres@www ~]$ pg_ctl --version pg_ctl (PostgreSQL) 8.0.0beta3 [postgres@www ~]$ i have a problem matching a utf8 string with a field in a database encoded in utf8. i read the documentation, checked the following, and don't know where i went astray, trying to match ... 1) i am almost 100% sure the data is correctly utf8. i just dumped and loaded into postgres. 2) utf8db -> \l List of databases Name | Owner | Encoding --------------+----------+----------- utf8db | postgres | UNICODE 3) postgresql.conf # These settings are initialized by initdb -- they might be changed lc_messages = 'en_US.utf8' # locale for system error message strings lc_monetary = 'en_US.utf8' # locale for monetary formatting lc_numeric = 'en_US.utf8' # locale for number formatting lc_time = 'en_US.utf8' # locale for time formatting # - Other Defaults - 4) set client encoding in client (psql or php, either one, both same mismatch) LOG: statement: select wordid,word from korean_english where word='기르 다' limit 10; LOG: statement: show client_encoding; LOG: statement: set client_encoding to 'utf8'; LOG: statement: select wordid,word from korean_english where word='기르 다' limit 10; LOG: statement: show client_encoding; 5) locale -a | grep en <snip> en_US.utf8 </snip> ohhh, where is my mistake, please! -- my site <a href="http://www.myowndictionary.com">myowndictionary</a> was made to help students of many languages learn them faster.
kmh496 wrote: > this is a forward of my problem from April. > I have this time gone all the way and re-inited a DB from scratch, > created a new database, documented the import procedure, set the locale > to match but I am still having problems. > For example, look at this match count~ > mod=# select count(*) from korean_english; > count > -------- > 205323 > (1 row) > > mod=# > mod=# select count(*) from korean_english where word='안녕'; > count > ------- > 40332 > (1 row) You seem to be implying there is something wrong with the above results, but you haven't given us enough information to have any idea why that's a problem. AFAICT, it's perfectly plausible that 40332 out of the 205323 rows in that table have that particular value of the word column. If that's not correct, you need to tell us how, otherwise no-one can help you. One clue is that you appear to have your mail client set to use EUC-KR encoding, not UTF-8. Perhaps whatever client you're using to put data into your database is using that encoding too? Tim -- ----------------------------------------------- Tim Allen tim@proximity.com.au Proximity Pty Ltd http://www.proximity.com.au/