Re: Character Encoding problem - Mailing list pgsql-jdbc
From | Craig Ringer |
---|---|
Subject | Re: Character Encoding problem |
Date | |
Msg-id | 47F9A499.3000803@postnewspapers.com.au Whole thread Raw |
In response to | Character Encoding problem ("antony baxter" <antony.baxter@gmail.com>) |
List | pgsql-jdbc |
antony baxter wrote: > Hi Craig - thanks for replying. > >> What's your terminal encoding? The easiest way to find out is to look at >> the output of the `locale' command. > > ant@home (/Users/ant) % locale > LANG="en_GB.UTF-8" Hmm, ok. I was expecting a non-utf-8 locale. > >> What is the output of the psql '\encoding' command when you've just made a >> new connection, and without touching the client_encoding setting? > > testdb=# \encoding > UTF8 OK, so your psql client encoding matches your db locale and your local 8 bit text codec for your terminal etc. >> To perform a valid test you'd need to extract the raw byte sequence from >> the log, encode it as an escaped octal/hex string in psql, and send that. > > Ok. I tried cutting and pasting just some cyrillic from my log file > into 'test.txt' and then did: > > ant@home (/Users/ant) % file test.txt > test.txt: UTF-8 Unicode text > > Which I assumed was enough? Only because all your locales are the same. Copy and paste via a terminal is potentially subject to text encoding conversions by tools like `less' and `vim'. A reliable test would be to use something like Perl or Python to open the file (in plain old 8-bit binary mode, no encoding conversions etc), seek to the appropriate part, and read the sequence of bytes of interest. > Ok. Directly after the Java INSERT: > > testdb=# select name::bytea from testTable where id = '1'; > first_name > ------------------------------------------------------------------------------------------------------ > \342\200\223\303\253\342\200\224\303\204\342\200\224\303\247\342\200\223\316\251\342\200\224\303\207 > (1 row) That decodes as: –ë—Ä—ç–Ω—Ç ie emdash, e-umlaut, emdash, captial-a-umlaut, emdash, c-cedelia, emdash, omega, emdash, capital-c-cedelia. ... which isn't exactly cyrillic ;-) More to the point, the regularly inserted emdashes are very informative, as they suggest that something is mangling the UTF-8 escape bytes. > Then I manually UPDATE and try again: > > testdb=# select name::bytea from testTable where id = '1'; > first_name > ------------------------------------------ > \320\221\321\200\321\215\320\275\321\202 ... which decodes as a much more sensible 'Брэнт'. At least it's cyrillic. >> If you INSERT that string into a text/varchar column in the DB as an >> escaped string as above, then SELECT it back as bytea, you should get the >> original sequence of octal values back. If you select it as text, you should >> get the same value as the string shown above in my email. If you insert the >> non-escaped string as written in my email and select that back out as bytea, >> you should get the same sequence of octal values as shown in my mail. > > Both appear correct - i.e. SELECT name shows me correctly accented + > Chinese text, and SELECT name::bytea shows me the same bytes. Yep, it's now clear that you're getting the right behaviour from psql. Probably perl, too, since it's the same 8-bit encoding all the way through. >> Are you explicitly handling conversions from utf-8 byte strings from psql >> to the local 8 bit encoding, or setting client_encoding to your local 8 bit >> encoding to get the pg libraries to do it for you? > > Neither, as far as I'm aware. I'm running everything (I think) in UTF-8. It's probably a good idea to set the client encoding explicitly to your local encoding anyway, or do explicit utf-8 <> local8bit conversions on data from the DB. Testing your perl code in a non-utf-8 locale >> It's actually rather a lot more likely that the rest of your tests are >> inserting mangled data and then retrieving it by unmangling it the same way, >> and the Java code is getting it right. I see this sort of thing a lot when >> people are working with 8-bit text of different encodings, or converting >> between "Unicode" datatypes and 8-bit strings that don't have an inherent >> encoding. > > Yes - I think thats probably right, except that everything else > *seems* to be functioning correctly... Yep, my bad. So now we look at your strings in Java. Are you certain that the input you're feeding to the JDBC is correctly encoded utf-8 text? I've frequently seen a "unicode" string like java.lang.String or Qt's QString being filled character by character from 8-bit input, resulting in a hopelessly mangled string for non-ascii input, especially in locales where one character isn't necessarily one byte. Treating utf-8 input as ascii or latin-1 (maybe due to an implicit conversion somewhere) is another culprit. -- Craig Ringer
pgsql-jdbc by date: