Thread: Understanding Encoding

Understanding Encoding

From
Beena Emerson
Date:
Hello All,

I am not able to understand how the encoding is handled. I would be happy if someone can tell what is happening in the following scenario:

1. I have created a database with EUC_KR encoding and created a table and inserted some korean value into it. 

=# CREATE DATABASE korean WITH ENCODING 'EUC_KR' LC_COLLATE='ko_KR.euckr' LC_CTYPE='ko_KR.euckr' TEMPLATE=template0;

=# \c korean

korean=# SHOW client_encoding;
 client_encoding 
-----------------
 UTF8
(1 row)

korean=# CREATE TABLE tbl (doc text);

korean=# INSERT INTO tbl VALUES ('그레스');


2. If I insert non-korean values it throws error:

korean=# INSERT INTO tbl VALUES ('データベース');
ERROR:  character with byte sequence 0xe3 0x83 0xbc in encoding "UTF8" has no equivalent in encoding "EUC_KR"

korean=# SELECT * FROM tbl;
  doc   
--------
 그레스
(1 row)


3. I change the client encoding to EUC_KR and try inserting the same korean characters and it throws an error:

korean=# SET client_encoding = 'EUC_KR';
SET
korean=# INSERT INTO tbl VALUES ('그레스');
ERROR:  invalid byte sequence for encoding "EUC_KR": 0xa0 0x88


Even the SELECT statement displays something different. I am not able to understand why?

korean=# SELECT * FROM tbl;
  doc   
--------
 �׷���
(1 row)


Can someone please help me.

Thanks you,

Beena Emerson


Re: Understanding Encoding

From
Gopal Tandon
Date:


On Fri, Sep 6, 2013 at 11:26 AM, Beena Emerson <memissemerson@gmail.com> wrote:
Hello All,

I am not able to understand how the encoding is handled. I would be happy if someone can tell what is happening in the following scenario:

1. I have created a database with EUC_KR encoding and created a table and inserted some korean value into it. 

=# CREATE DATABASE korean WITH ENCODING 'EUC_KR' LC_COLLATE='ko_KR.euckr' LC_CTYPE='ko_KR.euckr' TEMPLATE=template0;

=# \c korean

korean=# SHOW client_encoding;
 client_encoding 
-----------------
 UTF8
(1 row)

korean=# CREATE TABLE tbl (doc text);

korean=# INSERT INTO tbl VALUES ('그레스');


2. If I insert non-korean values it throws error:

korean=# INSERT INTO tbl VALUES ('データベース');
ERROR:  character with byte sequence 0xe3 0x83 0xbc in encoding "UTF8" has no equivalent in encoding "EUC_KR"

korean=# SELECT * FROM tbl;
  doc   
--------
 그레스
(1 row)


3. I change the client encoding to EUC_KR and try inserting the same korean characters and it throws an error:

korean=# SET client_encoding = 'EUC_KR';
SET
korean=# INSERT INTO tbl VALUES ('그레스');
ERROR:  invalid byte sequence for encoding "EUC_KR": 0xa0 0x88


Even the SELECT statement displays something different. I am not able to understand why?

korean=# SELECT * FROM tbl;
  doc   
--------
 �׷���
(1 row)


Can someone please help me.

Thanks you,

Beena Emerson



Re: [NOVICE] Understanding Encoding

From
Amit Langote
Date:
On Fri, Sep 6, 2013 at 2:56 PM, Beena Emerson <memissemerson@gmail.com> wrote:
> Hello All,
>
> I am not able to understand how the encoding is handled. I would be happy if
> someone can tell what is happening in the following scenario:
>
> 1. I have created a database with EUC_KR encoding and created a table and
> inserted some korean value into it.
>
> =# CREATE DATABASE korean WITH ENCODING 'EUC_KR' LC_COLLATE='ko_KR.euckr'
> LC_CTYPE='ko_KR.euckr' TEMPLATE=template0;
>
> =# \c korean
>
> korean=# SHOW client_encoding;
>  client_encoding
> -----------------
>  UTF8
> (1 row)
>
> korean=# CREATE TABLE tbl (doc text);
>
> korean=# INSERT INTO tbl VALUES ('그레스');
>
>
> 2. If I insert non-korean values it throws error:
>
> korean=# INSERT INTO tbl VALUES ('データベース');
> ERROR:  character with byte sequence 0xe3 0x83 0xbc in encoding "UTF8" has
> no equivalent in encoding "EUC_KR"
>
> korean=# SELECT * FROM tbl;
>   doc
> --------
>  그레스
> (1 row)
>
>
> 3. I change the client encoding to EUC_KR and try inserting the same korean
> characters and it throws an error:
>
> korean=# SET client_encoding = 'EUC_KR';
> SET
> korean=# INSERT INTO tbl VALUES ('그레스');
> ERROR:  invalid byte sequence for encoding "EUC_KR": 0xa0 0x88
>
>
> Even the SELECT statement displays something different. I am not able to
> understand why?
>
> korean=# SELECT * FROM tbl;
>   doc
> --------
>  �׷���
> (1 row)
>

I wonder if you have tried changing your "locale" to ko_KR; something like:

LANG=ko_KR LC_ALL=ko_KR \
psql -d korean


--
Amit Langote


Re: [NOVICE] Understanding Encoding

From
Beena Emerson
Date:


I wonder if you have tried changing your "locale" to ko_KR; something like:

LANG=ko_KR LC_ALL=ko_KR \
psql -d korean


Hi,

It still gives same result:

$ LANG=ko_KR LC_ALL=ko_KR
$ psql -d korean

korean=# SHOW client_encoding;
 client_encoding 
-----------------
 EUC_KR
(1 row)

korean=# INSERT INTO tbl VALUES ('그레스');
ERROR:  invalid byte sequence for encoding "EUC_KR": 0xa0 0x88 
 
Beena Emerson

Re: [NOVICE] Understanding Encoding

From
Tom Lane
Date:
Beena Emerson <memissemerson@gmail.com> writes:
> It still gives same result:

> $ LANG=ko_KR LC_ALL=ko_KR
> $ psql -d korean

> korean=# SHOW client_encoding;
>  client_encoding
> -----------------
>  EUC_KR
> (1 row)

> korean=# INSERT INTO tbl VALUES ('그레스');
> ERROR:  invalid byte sequence for encoding "EUC_KR": 0xa0 0x88

What you need to figure out is what encoding the text you are typing
is in.  You're telling psql it's EUC_KR but it evidently isn't.
If you're typing these characters manually then it's probably determined
by a setting of the terminal-emulator program you're using.  But if
you're copying-and-pasting then things get more complicated.

Also, what you did above is not what Amit suggested: he wanted you to put
the variable assignments on the same command line as the psql invocation,
so that they'd affect the environment passed to psql.  I'm suspicious of
his solution because I'd have thought the terminal program would set up
the right environment ... but you might as well try it.

            regards, tom lane


Re: Understanding Encoding

From
Tatsuo Ishii
Date:
> Hello All,
> 
> I am not able to understand how the encoding is handled. I would be happy
> if someone can tell what is happening in the following scenario:
> 
> 1. I have created a database with EUC_KR encoding and created a table and
> inserted some korean value into it.
> 
> =# CREATE DATABASE korean WITH ENCODING 'EUC_KR' LC_COLLATE='ko_KR.euckr'
> LC_CTYPE='ko_KR.euckr' TEMPLATE=template0;
> 
> =# \c korean
> 
> korean=# SHOW client_encoding;
>  client_encoding
> -----------------
>  UTF8
> (1 row)
> 
> korean=# CREATE TABLE tbl (doc text);
> 
> korean=# INSERT INTO tbl VALUES ('그레스');
> 
> 
> 2. If I insert non-korean values it throws error:
> 
> korean=# INSERT INTO tbl VALUES ('データベース');
> ERROR:  character with byte sequence 0xe3 0x83 0xbc in encoding "UTF8" has
> no equivalent in encoding "EUC_KR"

The error messages says all. PostgreSQL accepted 'データベース'
encoded in UTF-8 then tried to convert to EUC_KR but failed, because
EUC_KR does not accept languages other than Korean (and ASCII). What
else did you expect?

> korean=# SELECT * FROM tbl;
>   doc
> --------
>  그레스
> (1 row)
> 
> 
> 3. I change the client encoding to EUC_KR and try inserting the same korean
> characters and it throws an error:
> 
> korean=# SET client_encoding = 'EUC_KR';
> SET
> korean=# INSERT INTO tbl VALUES ('그레스');
> ERROR:  invalid byte sequence for encoding "EUC_KR": 0xa0 0x88

0xa0 is definitely not part of EUC_KR. That's why PostgreSQL throws an
error. I gues you are using UHC (Unified Hangul Code), rather than
EUC_KR. They are different encodings. You should do either:

1) Make sure that your termical encoding is EUC_KR.

2) set client_encoding = 'uhc';

> Even the SELECT statement displays something different. I am not able to
> understand why?
> 
> korean=# SELECT * FROM tbl;
>   doc
> --------
>  �׷���
> (1 row)

This is because the same reason above.

> Can someone please help me.
> 
> Thanks you,
> 
> Beena Emerson
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

Re: [NOVICE] Understanding Encoding

From
Amit Langote
Date:
On Fri, Sep 6, 2013 at 3:47 PM, Beena Emerson <memissemerson@gmail.com> wrote:
>
>>
>> I wonder if you have tried changing your "locale" to ko_KR; something
>> like:
>>
>> LANG=ko_KR LC_ALL=ko_KR \
>> psql -d korean
>>
>
> Hi,
>
> It still gives same result:
>
> $ LANG=ko_KR LC_ALL=ko_KR
> $ psql -d korean
>
> korean=# SHOW client_encoding;
>  client_encoding
> -----------------
>  EUC_KR
> (1 row)
>
> korean=# INSERT INTO tbl VALUES ('그레스');
> ERROR:  invalid byte sequence for encoding "EUC_KR": 0xa0 0x88


I changed the encoding of the terminal emulator (GNOME Terminal
2.31.3) using the Terminal menu as:

Terminal -> Set Character Encoding -> Korean (EUC-KR)

Note that, if the menu only lists UTF-8, you'd have to add EUC-KR
using "Add or Remove".

And it seems to work; could you try the same?

--
Amit Langote


Re: [NOVICE] Understanding Encoding

From
Beena Emerson
Date:
On Fri, Sep 6, 2013 at 12:29 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Beena Emerson <memissemerson@gmail.com> writes:
> It still gives same result:

> $ LANG=ko_KR LC_ALL=ko_KR
> $ psql -d korean

> korean=# SHOW client_encoding;
>  client_encoding
> -----------------
>  EUC_KR
> (1 row)

> korean=# INSERT INTO tbl VALUES ('그레스');
> ERROR:  invalid byte sequence for encoding "EUC_KR": 0xa0 0x88

What you need to figure out is what encoding the text you are typing
is in.  You're telling psql it's EUC_KR but it evidently isn't.
If you're typing these characters manually then it's probably determined
by a setting of the terminal-emulator program you're using.  But if
you're copying-and-pasting then things get more complicated.

Also, what you did above is not what Amit suggested: he wanted you to put
the variable assignments on the same command line as the psql invocation,
so that they'd affect the environment passed to psql.  I'm suspicious of
his solution because I'd have thought the terminal program would set up
the right environment ... but you might as well try it.
 
I tried with both the assignment and invocation in same line. Again it gave the same result.
Maybe the problem is with copy paste. I will look into it.
Thank you.

Re: Understanding Encoding

From
Sebastien FLAESCH
Date:
Hi,

Tip:

To identify what encoding you enter in the psql command interpreter:

1) Open a file with vim
2) Type in you SQL or copy/paste
3) Save the file and quit vim
4) $ file <filename>

Should give you the encoding of that text file.

For ex:

sf@orca:~$ echo $LC_ALL
en_US.UTF-8

sf@orca:~$ cat /tmp/xx
abcdefé

sf@orca:~$ file /tmp/xx
/tmp/xx: UTF-8 Unicode text


Seb


On 09/06/2013 09:03 AM, Tatsuo Ishii wrote:
>> Hello All,
>>
>> I am not able to understand how the encoding is handled. I would be happy
>> if someone can tell what is happening in the following scenario:
>>
>> 1. I have created a database with EUC_KR encoding and created a table and
>> inserted some korean value into it.
>>
>> =# CREATE DATABASE korean WITH ENCODING 'EUC_KR' LC_COLLATE='ko_KR.euckr'
>> LC_CTYPE='ko_KR.euckr' TEMPLATE=template0;
>>
>> =# \c korean
>>
>> korean=# SHOW client_encoding;
>>   client_encoding
>> -----------------
>>   UTF8
>> (1 row)
>>
>> korean=# CREATE TABLE tbl (doc text);
>>
>> korean=# INSERT INTO tbl VALUES ('그레스');
>>
>>
>> 2. If I insert non-korean values it throws error:
>>
>> korean=# INSERT INTO tbl VALUES ('データベース');
>> ERROR:  character with byte sequence 0xe3 0x83 0xbc in encoding "UTF8" has
>> no equivalent in encoding "EUC_KR"
>
> The error messages says all. PostgreSQL accepted 'データベース'
> encoded in UTF-8 then tried to convert to EUC_KR but failed, because
> EUC_KR does not accept languages other than Korean (and ASCII). What
> else did you expect?
>
>> korean=# SELECT * FROM tbl;
>>    doc
>> --------
>>   그레스
>> (1 row)
>>
>>
>> 3. I change the client encoding to EUC_KR and try inserting the same korean
>> characters and it throws an error:
>>
>> korean=# SET client_encoding = 'EUC_KR';
>> SET
>> korean=# INSERT INTO tbl VALUES ('그레스');
>> ERROR:  invalid byte sequence for encoding "EUC_KR": 0xa0 0x88
>
> 0xa0 is definitely not part of EUC_KR. That's why PostgreSQL throws an
> error. I gues you are using UHC (Unified Hangul Code), rather than
> EUC_KR. They are different encodings. You should do either:
>
> 1) Make sure that your termical encoding is EUC_KR.
>
> 2) set client_encoding = 'uhc';
>
>> Even the SELECT statement displays something different. I am not able to
>> understand why?
>>
>> korean=# SELECT * FROM tbl;
>>    doc
>> --------
>>   �׷���
>> (1 row)
>
> This is because the same reason above.
>
>> Can someone please help me.
>>
>> Thanks you,
>>
>> Beena Emerson
> --
> Tatsuo Ishii
> SRA OSS, Inc. Japan
> English: http://www.sraoss.co.jp/index_en.php
> Japanese: http://www.sraoss.co.jp
>




Re: Understanding Encoding

From
Beena Emerson
Date:
Hello,

Thank you all. 

Amit, Changing the encoding of the terminal emulator worked.

Sebastiean, the tip was helpful.

--

Beena Emerson