Thread: Chinese in Postgres
Hello all,
before writing this message, I wrote about this in other mailing lists without solving my problem.
Maybe some of you can help me.
I have problems with a DB in postgres, when i try to insert Chinese strings in UTF-8 format.
If I insert the data using a C++ program I have empty squares, in this format: ��� (3 empty squares for each chinese ideogram as that is the length in UTF-8)
If the string contains chinese mixed with ASCII, the ASCII is OK but the Chinese is broken:
漢語1-3漢語 --> ������1-3������
All the data is read from a binary file. It seems it's read correctly, but something happens when the query is executed.
(If the text is in a different language that uses only 2 bytes for each letter, I will see only 2 empty squares per character, ex. hebrew, but this is not good anyway...)
Strange things:
1. if i insert the record doing a query from command line (putty), the chinese text is OK. This problem is only when i insert by the C++ program.
2. I checked the C++ functions involved by creating unitary tests; if i run some other tests (on another virtual machine) the text is not damaged.
These strange things are confusing me, but maybe they will be useful informations for somebody who had the same problem.
The DB is set for UTF-8
Name | Owner | Encoding | Collate | Ctype | Access privileges
--------------+-------+----------+-------------+-------------+------------------
postgres | pgsql | UTF8 | en_US.UTF-8 | en_US.UTF-8 |
MyDB | pgsql | UTF8 | en_US.UTF-8 | en_US.UTF-8 |
template0 | pgsql | UTF8 | en_US.UTF-8 | en_US.UTF-8 |
template1 | pgsql | UTF8 | en_US.UTF-8 | en_US.UTF-8 |
Previously I also tried with:
Name | Owner | Encoding | Collate | Ctype | Access privileges
--------------+-------+----------+-------------+-------------+------------------
postgres | pgsql | UTF8 | C | C |
MyDB | pgsql | UTF8 | C | C |
...
But the problem was the same.
I know that you would like to see the code, but it's too long (anyway if you want i can try to write some lines of code, like connection to Db and so on). I don't know if there is some log create by postgres when inserting damaged data, sould be useful.
For now, in order to save your time my question is: did anybody of you have the same problem?
(and how did you solve it?)
Thanks,
Francesco
Invita i tuoi amici e Tiscali ti premia! Il consiglio di un amico vale più di uno spot in TV. Per ogni nuovo abbonato 30 € di premio per te e per lui! Un amico al mese e parli e navighi sempre gratis: http://freelosophy.tiscali.it/
before writing this message, I wrote about this in other mailing lists without solving my problem.
Maybe some of you can help me.
I have problems with a DB in postgres, when i try to insert Chinese strings in UTF-8 format.
If I insert the data using a C++ program I have empty squares, in this format: ��� (3 empty squares for each chinese ideogram as that is the length in UTF-8)
If the string contains chinese mixed with ASCII, the ASCII is OK but the Chinese is broken:
漢語1-3漢語 --> ������1-3������
All the data is read from a binary file. It seems it's read correctly, but something happens when the query is executed.
(If the text is in a different language that uses only 2 bytes for each letter, I will see only 2 empty squares per character, ex. hebrew, but this is not good anyway...)
Strange things:
1. if i insert the record doing a query from command line (putty), the chinese text is OK. This problem is only when i insert by the C++ program.
2. I checked the C++ functions involved by creating unitary tests; if i run some other tests (on another virtual machine) the text is not damaged.
These strange things are confusing me, but maybe they will be useful informations for somebody who had the same problem.
The DB is set for UTF-8
Name | Owner | Encoding | Collate | Ctype | Access privileges
--------------+-------+----------+-------------+-------------+------------------
postgres | pgsql | UTF8 | en_US.UTF-8 | en_US.UTF-8 |
MyDB | pgsql | UTF8 | en_US.UTF-8 | en_US.UTF-8 |
template0 | pgsql | UTF8 | en_US.UTF-8 | en_US.UTF-8 |
template1 | pgsql | UTF8 | en_US.UTF-8 | en_US.UTF-8 |
Previously I also tried with:
Name | Owner | Encoding | Collate | Ctype | Access privileges
--------------+-------+----------+-------------+-------------+------------------
postgres | pgsql | UTF8 | C | C |
MyDB | pgsql | UTF8 | C | C |
...
But the problem was the same.
I know that you would like to see the code, but it's too long (anyway if you want i can try to write some lines of code, like connection to Db and so on). I don't know if there is some log create by postgres when inserting damaged data, sould be useful.
For now, in order to save your time my question is: did anybody of you have the same problem?
(and how did you solve it?)
Thanks,
Francesco
Invita i tuoi amici e Tiscali ti premia! Il consiglio di un amico vale più di uno spot in TV. Per ogni nuovo abbonato 30 € di premio per te e per lui! Un amico al mese e parli e navighi sempre gratis: http://freelosophy.tiscali.it/
On 08/16/2013 01:25 PM, ciifrancesco@tiscali.it wrote: > Hello all, > before writing this message, I wrote about this in other mailing lists > without solving my problem. > Maybe some of you can help me. > > I have problems with a DB in postgres, when i try to insert Chinese > strings in UTF-8 format. > If I insert the data using a C++ program I have empty squares, in this > format: ��� (3 empty squares for each chinese ideogram as that is the > length in UTF-8) > If the string contains chinese mixed with ASCII, the ASCII is OK but > the Chinese is broken: > 漢語1-3漢語 --> ������1-3������ Can you cehck that your client encoding is also UTF8 hannu=# show client_encoding ; client_encoding ----------------- UTF8 (1 row) Cheers -- Hannu Krosing PostgreSQL Consultant Performance, Scalability and High Availability 2ndQuadrant Nordic OÜ
Hi, Francesco,
As I mentioned that you said "If I insert the data using a C++ program I have empty squares", I guess you forget to convert you string to UTF-8 before insert.On Fri, Aug 16, 2013 at 7:25 PM, ciifrancesco@tiscali.it <ciifrancesco@tiscali.it> wrote:
Hello all,
before writing this message, I wrote about this in other mailing lists without solving my problem.
Maybe some of you can help me.
I have problems with a DB in postgres, when i try to insert Chinese strings in UTF-8 format.
If I insert the data using a C++ program I have empty squares, in this format: ��� (3 empty squares for each chinese ideogram as that is the length in UTF-8)
If the string contains chinese mixed with ASCII, the ASCII is OK but the Chinese is broken:
漢語1-3漢語 --> ������1-3������
All the data is read from a binary file. It seems it's read correctly, but something happens when the query is executed.
(If the text is in a different language that uses only 2 bytes for each letter, I will see only 2 empty squares per character, ex. hebrew, but this is not good anyway...)
Strange things:
1. if i insert the record doing a query from command line (putty), the chinese text is OK. This problem is only when i insert by the C++ program.
2. I checked the C++ functions involved by creating unitary tests; if i run some other tests (on another virtual machine) the text is not damaged.
These strange things are confusing me, but maybe they will be useful informations for somebody who had the same problem.
The DB is set for UTF-8
Name | Owner | Encoding | Collate | Ctype | Access privileges
--------------+-------+----------+-------------+-------------+------------------
postgres | pgsql | UTF8 | en_US.UTF-8 | en_US.UTF-8 |
MyDB | pgsql | UTF8 | en_US.UTF-8 | en_US.UTF-8 |
template0 | pgsql | UTF8 | en_US.UTF-8 | en_US.UTF-8 |
template1 | pgsql | UTF8 | en_US.UTF-8 | en_US.UTF-8 |
Previously I also tried with:
Name | Owner | Encoding | Collate | Ctype | Access privileges
--------------+-------+----------+-------------+-------------+------------------
postgres | pgsql | UTF8 | C | C |
MyDB | pgsql | UTF8 | C | C |
...
But the problem was the same.
I know that you would like to see the code, but it's too long (anyway if you want i can try to write some lines of code, like connection to Db and so on). I don't know if there is some log create by postgres when inserting damaged data, sould be useful.
For now, in order to save your time my question is: did anybody of you have the same problem?
(and how did you solve it?)
Thanks,
Francesco
Invita i tuoi amici e Tiscali ti premia! Il consiglio di un amico vale più di uno spot in TV. Per ogni nuovo abbonato 30 € di premio per te e per lui! Un amico al mese e parli e navighi sempre gratis: http://freelosophy.tiscali.it/
On Fri, Aug 16, 2013 at 4:25 AM, ciifrancesco@tiscali.it <ciifrancesco@tiscali.it> wrote: > If I insert the data using a C++ program I have empty squares, in this > format: ��� (3 empty squares for each chinese ideogram as that is the length > in UTF-8) > If the string contains chinese mixed with ASCII, the ASCII is OK but the > Chinese is broken: > 漢語1-3漢語 --> ������1-3������ You mentioned nothing about what platform this is or how you've built the program, and nothing about operating system locale. If this is a Windows program (you mention PuTTY), I'd read up on differences between what are known as "Unicode" and "Multibyte" encodings on MSDN: http://msdn.microsoft.com/en-us/library/2dax2h36.aspx Of course, this is a total stab in the dark, but then people with the problem that you describe don't tend to be on *nix systems as a rule. As someone said upthread, if Postgres does that then it's because the bytes you sent aren't what you think the are when rendered as UTF-8. -- Peter Geoghegan