Thread: another seemingly simple encoding question

another seemingly simple encoding question

From
joseph
Date:
maybe a routine question here ... .... i hope i can understand the
answer.

[postgres@www ~]$ pg_ctl --version
pg_ctl (PostgreSQL) 8.0.0beta3
[postgres@www ~]$


i have a problem matching a utf8 string with a field in a database
encoded in utf8.
i read the documentation, checked the following, and don't know where i
went astray, trying to match ...
1) i am almost 100% sure the data is correctly utf8. i just dumped and
loaded into postgres.
2)
utf8db -> \l
          List of databases
     Name     |  Owner   | Encoding
--------------+----------+-----------
utf8db         | postgres | UNICODE
3) postgresql.conf

# These settings are initialized by initdb -- they might be changed
lc_messages = 'en_US.utf8'              # locale for system error
message strings
lc_monetary = 'en_US.utf8'              # locale for monetary formatting
lc_numeric = 'en_US.utf8'               # locale for number formatting
lc_time = 'en_US.utf8'                  # locale for time formatting

# - Other Defaults -

4) set client encoding in client (psql or php, either one, both same
mismatch)

LOG:  statement: select wordid,word from korean_english where word='기르
다' limit 10;
LOG:  statement: show client_encoding;
LOG:  statement: set client_encoding to  'utf8';
LOG:  statement: select wordid,word from korean_english where word='기르
다' limit 10;
LOG:  statement: show client_encoding;

5) locale -a | grep en
<snip>
en_US.utf8
</snip>


ohhh,  where is my mistake, please!






Re: another seemingly simple encoding question

From
"John D. Burger"
Date:
> i have a problem matching a utf8 string with a field in a database
> encoded in utf8.

You seem to give all the details of your configuration, but unless I
misread your message, you don't say what the actual problem is.  Can
you provide more details?  What exactly doesn't work?

This may not be the issue, but many people don't realize that there are
sometimes multiple ways to encode what is conceptually the same string
in UTF8 (or any of the Unicode encodings).  If you do not canonicalize
your strings using one of the Unicode normalization forms, then
seemingly identical strings may not match, because they are not
byte-for-byte identical.

- John D. Burger
   MITRE


Re: another seemingly simple encoding question

From
joseph
Date:
problem is that my string -- which is in utf-8 -- because
all input is converted first in php with
       $str_out = mb_convert_encoding($str_in, "UTF-8");
and the query, which
is like
"select wordid from korean_english where word='utf8string'";
and it is returning wordids for words which are not = utf8string

(in debug mode) the above is output as UTF-8 by php / browser encoding
over the web, and then "exit;" is called,
so i just grab it from the browser by cutting and pasting the whole
query string.
running the query in php and from psql return the same bad wordids,
pointing that the encoding is consistent through the cut-and-paste
operation.

i don't understand what a "unicode normalization form" is.  the postgres
docs http://www.postgresql.org/docs/8.0/interactive/multibyte.html
say

Table 20-1. Server Character Sets

                Name
                 Description


UNICODE     Unicode (UTF-8)

so i thought they were the same, and i dont know about "unicode
normalization form".


my question is why isn't the utf8string in query returning only
matching, corresponding wordids from the database....

thx.

2006-03-24 (금), 08:56 -0500, John D. Burger 쓰시길:
> > i have a problem matching a utf8 string with a field in a database
> > encoded in utf8.
>
> You seem to give all the details of your configuration, but unless I
> misread your message, you don't say what the actual problem is.  Can
> you provide more details?  What exactly doesn't work?
>
> This may not be the issue, but many people don't realize that there
are
> sometimes multiple ways to encode what is conceptually the same
string
> in UTF8 (or any of the Unicode encodings).  If you do not
canonicalize
> your strings using one of the Unicode normalization forms, then
> seemingly identical strings may not match, because they are not
> byte-for-byte identical.
>
> - John D. Burger
>    MITRE
>







Re: another seemingly simple encoding question

From
"John D. Burger"
Date:
This doesn't sound like your problem, but I'll explain the
normalization issue using Korean as an example, since that seems to be
your data:  There are codepoints in Unicode both for Hangul and Jamo,
so a Hangul glyph can be represented either with the single
corresponding codepoint, or as two or three Jamo codepoints.  A Unicode
font would display these two alternatives identically.  In any Unicode
encoding, including UTF8, these two strings would not be byte-for-byte
identical.  The Unicode normalization forms are four algorithms for
normalizing the strings in such a way that they do compare identically.

Anyway, it sounds like you have the opposite problem, two strings that
are comparing equal when you think they shouldn't.  I don't know that
anyone can help you unless you post an actual example of two such
strings.

- John D. Burger
   MITRE


Re: another seemingly simple encoding question

From
kmh496
Date:
this is a forward of my problem from April.
I have this time gone all the way and re-inited a DB from scratch,
created a new database, documented the import procedure, set the locale
to match but I am still having problems.
For example, look at this match count~
mod=# select count(*) from korean_english;
 count
--------
 205323
(1 row)

mod=#
mod=# select count(*) from korean_english where word='안녕';
 count
-------
 40332
(1 row)

mod=# \set
VERSION = 'PostgreSQL 8.0.0beta3 on i686-pc-linux-gnu, compiled by GCC
gcc (GCC) 3.3.3 20040412 (Red Hat Linux 3.3.3-7)'
AUTOCOMMIT = 'on'
VERBOSITY = 'default'
DBNAME = 'mod'
USER = 'postgres'
PORT = '5432'
ENCODING = 'UNICODE'
PROMPT1 = '%/%R%# '
PROMPT2 = '%/%R%# '
PROMPT3 = '>> '
HISTSIZE = '500'
mod=#

I documented the import procedure and put it at
http://www.myowndictionary.com/design.htm

if there is anybody out there who has any idea, i would be very grateful
for help.   I have to move my database to postgres from mysql, and it
has this big problem with the encoding.

thank you .

joseph.




-------- Forwarded Message --------
보낸 사람: joseph <kmh496@kornet.net>
받는 사람: pgsql-general@postgresql.org
제목: another seemingly simple encoding question
날짜: Fri, 24 Mar 2006 22:27:06 +0900
maybe a routine question here ... .... i hope i can understand the
answer.

[postgres@www ~]$ pg_ctl --version
pg_ctl (PostgreSQL) 8.0.0beta3
[postgres@www ~]$


i have a problem matching a utf8 string with a field in a database
encoded in utf8.
i read the documentation, checked the following, and don't know where i
went astray, trying to match ...
1) i am almost 100% sure the data is correctly utf8. i just dumped and
loaded into postgres.
2)
utf8db -> \l
          List of databases
     Name     |  Owner   | Encoding
--------------+----------+-----------
utf8db         | postgres | UNICODE
3) postgresql.conf

# These settings are initialized by initdb -- they might be changed
lc_messages = 'en_US.utf8'              # locale for system error
message strings
lc_monetary = 'en_US.utf8'              # locale for monetary formatting
lc_numeric = 'en_US.utf8'               # locale for number formatting
lc_time = 'en_US.utf8'                  # locale for time formatting

# - Other Defaults -

4) set client encoding in client (psql or php, either one, both same
mismatch)

LOG:  statement: select wordid,word from korean_english where word='기르
다' limit 10;
LOG:  statement: show client_encoding;
LOG:  statement: set client_encoding to  'utf8';
LOG:  statement: select wordid,word from korean_english where word='기르
다' limit 10;
LOG:  statement: show client_encoding;

5) locale -a | grep en
<snip>
en_US.utf8
</snip>


ohhh,  where is my mistake, please!
--
my site <a href="http://www.myowndictionary.com">myowndictionary</a> was
made to help students of many languages learn them faster.






Re: another seemingly simple encoding question

From
Tim Allen
Date:
kmh496 wrote:
> this is a forward of my problem from April.
> I have this time gone all the way and re-inited a DB from scratch,
> created a new database, documented the import procedure, set the locale
> to match but I am still having problems.
> For example, look at this match count~
> mod=# select count(*) from korean_english;
>  count
> --------
>  205323
> (1 row)
>
> mod=#
> mod=# select count(*) from korean_english where word='안녕';
>  count
> -------
>  40332
> (1 row)

You seem to be implying there is something wrong with the above results,
but you haven't given us enough information to have any idea why that's
a problem. AFAICT, it's perfectly plausible that 40332 out of the 205323
rows in that table have that particular value of the word column. If
that's not correct, you need to tell us how, otherwise no-one can help you.

One clue is that you appear to have your mail client set to use EUC-KR
encoding, not UTF-8. Perhaps whatever client you're using to put data
into your database is using that encoding too?

Tim

--
-----------------------------------------------
Tim Allen          tim@proximity.com.au
Proximity Pty Ltd  http://www.proximity.com.au/