Thread: Problem with langage encoding
I've inherited a database system (including source)... It's Russian text (which I don't speak, but I need to work on it). On my screen it shows correctly, but in the database it shows a mess... from trying with a web browser. It appears that the data in the database is encoded with ISO-8859-1 (my browser and the database viewed with PGAdmin match when my browser has ISO-8859-1 encoding).... and it shows correctly on screen in my browser when I use the WIN1251 encoding... The problem is that the database is UTF8, and it won't let me use the convert_to function because it says that the characters don't exist... What I need to do is to pull the UTF8 from the database, tell postgres that it's 8859-1, and then convert it to WIN1251... How? Thank you for any assistance! Ron
On 26/11/2008 14:40, Ronald Vyhmeister wrote: > The problem is that the database is UTF8, and it won't let me use the > convert_to function because it says that the characters don't exist... What > I need to do is to pull the UTF8 from the database, tell postgres that it's > 8859-1, and then convert it to WIN1251... How? If the DB is in UTF8, you ought to be able to issue set client_encoding to 'WIN1251' after connection, and the conversion happens automatically. See: http://www.postgresql.org/docs/8.3/static/multibyte.html#AEN27483 HTH, Ray. ------------------------------------------------------------------ Raymond O'Donnell, Director of Music, Galway Cathedral, Ireland rod@iol.ie Galway Cathedral Recitals: http://www.galwaycathedral.org/recitals ------------------------------------------------------------------
> The problem is that the database is UTF8, and it won't let me use the > convert_to function because it says that the characters don't exist... What > I need to do is to pull the UTF8 from the database, tell postgres that it's > 8859-1, and then convert it to WIN1251... How? >If the DB is in UTF8, you ought to be able to issue > set client_encoding to 'WIN1251' >after connection, and the conversion happens automatically. See: > http://www.postgresql.org/docs/8.3/static/multibyte.html#AEN27483 >HTH, >Ray. Wish it would work... when I do it, I get: ERROR: character 0xc3bd of encoding "UTF8" has no equivalent in "WIN1251" ********** Error ********** ERROR: character 0xc3bd of encoding "UTF8" has no equivalent in "WIN1251" SQL state: 22P05 The DB is storing it UTF8, but it is really 8859-1... I need to force PG to think that it's 8859-1 in spite of what it thinks it may be... Ron
"Ronald Vyhmeister" <rvyhmeister@gmail.com> writes: > The DB is storing it UTF8, but it is really 8859-1... I need to force PG to > think that it's 8859-1 in spite of what it thinks it may be... Dump the database, change the "set client_encoding" command in the resulting file, reload. regards, tom lane
Ronald Vyhmeister wrote: > >> The problem is that the database is UTF8, and it won't let me use the >> convert_to function because it says that the characters don't exist... > What >> I need to do is to pull the UTF8 from the database, tell postgres that > it's >> 8859-1, and then convert it to WIN1251... How? > >> If the DB is in UTF8, you ought to be able to issue > >> set client_encoding to 'WIN1251' > >> after connection, and the conversion happens automatically. See: > >> http://www.postgresql.org/docs/8.3/static/multibyte.html#AEN27483 > >> HTH, > >> Ray. > > Wish it would work... when I do it, I get: > > ERROR: character 0xc3bd of encoding "UTF8" has no equivalent in "WIN1251" Which it does not; that character is "쎽" (HANGUL SYLLABLE SSYEG) which certainly isn't in WIN1251 or in latin-1 (ISO-8859-1). The byte sequence for this character in UTF-8 is: 0xec 0x8e 0xbd When decoded as latin-1, those three bytes are interpreted as: "ì½" ... which doesn't seem much more likely to be valid. Are you SURE it's latin-1, not (say) Koi_r-8? Python is a useful tool for investigating encoding problems. It draws a strong distinction between true Unicode strings and byte strings. It provides quality conversion routines that let you specify the encoding of the byte string, and that'll throw an exception for nonsensical input. This lets you take a byte string, decode it according to various encodings, and see what you land up with. You can also test various horrible encoding mangling schemes used by apps rather easily. The python `encodings' module contains a list of all known encodings by all aliases as `encodings.aliases.aliases'. Here's some code that takes an input byte string and prints a table of all ways it can be interpreted according to various encodings: ------------------- import sys import encodings stupid_encodings = ['zlib_codec', 'uu_codec', 'hex_codec', 'bz2_codec'] bs = '\xc3\xbd' for encoding in set(encodings.aliases.aliases.values()): sys.stdout.write(u"%20s: " % encoding) sys.stdout.flush() if encoding in stupid_encodings: continue try: print u"\"%6s\" (%6s)" % (bs.decode(encoding), repr(bs.decode(encoding))) except UnicodeDecodeError,e: print u" [INVALID]" except LookupError,e: print u" [UNSUPPORTED]" -------------------- ... and here's the output for the data you mentioned: -------------------- bz2_codec: cp1140: " C¨" (u'C\xa8') rot_13: " ý" (u'\xc3\xbd') cp932: " テス" (u'\uff83\uff7d') euc_jisx0213: " 箪" (u'\u7baa') cp037: " C¨" (u'C\xa8') hex_codec: cp500: " C¨" (u'C\xa8') uu_codec: big5hkscs: " 羸" (u'\u7fb8') mbcs: [UNSUPPORTED] euc_jis_2004: " 箪" (u'\u7baa') iso2022_jp_3: [INVALID] iso2022_jp_2: [INVALID] iso2022_jp_1: [INVALID] gbk: " 媒" (u'\u5a92') iso2022_jp_2004: [INVALID] quopri_codec: [INVALID] cp424: " C¨" (u'C\xa8') iso2022_jp: [INVALID] mac_iceland: " √Ω" (u'\u221a\u03a9') hp_roman8: " û§" (u'\xfb\xa7') iso2022_kr: [INVALID] euc_kr: " 첵" (u'\uccb5') cp1254: " ý" (u'\xc3\xbd') gb2312: " 媒" (u'\u5a92') cp850: " ├¢" (u'\u251c\xa2') shift_jis: " テス" (u'\uff83\uff7d') cp852: " ├Ż" (u'\u251c\u017b') cp855: " ├й" (u'\u251c\u0439') utf_16_le: " 뷃" (u'\ubdc3') cp857: " ├¢" (u'\u251c\xa2') cp775: " ├Į" (u'\u251c\u012e') cp1026: " C¨" (u'C\xa8') mac_latin2: " √Ĺ" (u'\u221a\u0139') mac_cyrillic: " √љ" (u'\u221a\u0459') base64_codec: " " ( '') ptcp154: " ГҪ" (u'\u0413\u04aa') euc_jp: " 箪" (u'\u7baa') hz: [INVALID] utf_8: " ý" (u'\xfd') mac_greek: " ΟΫ" (u'\u039f\u03ab') utf_7: [INVALID] mac_turkish: " √Ω" (u'\u221a\u03a9') cp949: " 첵" (u'\uccb5') zlib_codec: big5: " 羸" (u'\u7fb8') iso8859_9: " ý" (u'\xc3\xbd') iso8859_8: [INVALID] iso8859_5: " УН" (u'\u0423\u041d') iso8859_4: " ÃŊ" (u'\xc3\u014a') iso8859_7: " Γ½" (u'\u0393\xbd') iso8859_6: [INVALID] iso8859_3: [INVALID] iso8859_2: " Ă˝" (u'\u0102\u02dd') gb18030: " 媒" (u'\u5a92') shift_jis_2004: " テス" (u'\uff83\uff7d') mac_roman: " √Ω" (u'\u221a\u03a9') cp950: " 羸" (u'\u7fb8') utf_16: " 뷃" (u'\ubdc3') iso8859_15: " Ãœ" (u'\xc3\u0153') iso8859_14: " ÃẄ" (u'\xc3\u1e84') tis_620: " รฝ" (u'\u0e23\u0e1d') iso8859_16: " Ăœ" (u'\u0102\u0153') iso8859_11: " รฝ" (u'\u0e23\u0e1d') iso8859_10: " Ã―" (u'\xc3\u2015') tactis: [UNSUPPORTED] ascii: [INVALID] cp869: " ├Ξ" (u'\u251c\u039e') cp860: " ├╜" (u'\u251c\u255c') cp861: " ├╜" (u'\u251c\u255c') cp862: " ├╜" (u'\u251c\u255c') cp863: " ├╜" (u'\u251c\u255c') cp864: " ﺃﺵ" (u'\ufe83\ufeb5') cp865: " ├╜" (u'\u251c\u255c') cp866: " ├╜" (u'\u251c\u255c') shift_jisx0213: " テス" (u'\uff83\uff7d') cp1255: " ֳ½" (u'\u05b3\xbd') latin_1: " ý" (u'\xc3\xbd') cp1257: " Ć½" (u'\u0106\xbd') cp1256: " أ½" (u'\u0623\xbd') cp1251: " ГЅ" (u'\u0413\u0405') cp1250: " Ă˝" (u'\u0102\u02dd') cp1253: " Γ½" (u'\u0393\xbd') cp1252: " ý" (u'\xc3\xbd') cp437: " ├╜" (u'\u251c\u255c') cp1258: " Ă½" (u'\u0102\xbd') iso8859_13: " Ć½" (u'\u0106\xbd') koi8_r: " ц╫" (u'\u0446\u256b') utf_16_be: " 쎽" (u'\uc3bd') johab: " 칳" (u'\uce73') iso2022_jp_ext: [INVALID] -------------------- I don't see anything particularly striking there, but (a) this is a tiny sample not a useful string, (b) I'm relying on my interpretation of what the original byte string must've been to be correct, (c) Python's encoding support isn't absolutely comprehensive, and (d) I'm assuming the original inserting app only made a simple error in populating the database, rather than doing something really exciting with its text. Can you provide a more complete sample of the text in the DB? It'd help if you could show the CAST(stringvar AS bytea) value of the strings, which will provide the byte sequence the string is stored as in the server encoding. -- Craig Ringer
On Thu, Nov 27, 2008 at 02:34:17AM +0900, Craig Ringer <craig@postnewspapers.com.au> wrote a message of 188 lines which said: > > ERROR: character 0xc3bd of encoding "UTF8" has no equivalent in "WIN1251" > > Which it does not; that character is "???" (HANGUL SYLLABLE SSYEG) No, I don't think so. I think that 0xc3bd is the binary value, and 0xc3bd in UTF-8 is U+00FD (LATIN SMALL LETTER Y WITH ACUTE) which is less surprising. It is in Latin-1 but not in WIN1251. In KOI8-R, FD is the Shcha, a common letter in Russian, so I wonder if the database was really in Latin-1.
Stephane Bortzmeyer wrote: > On Thu, Nov 27, 2008 at 02:34:17AM +0900, > Craig Ringer <craig@postnewspapers.com.au> wrote > a message of 188 lines which said: > >>> ERROR: character 0xc3bd of encoding "UTF8" has no equivalent in "WIN1251" >> Which it does not; that character is "???" (HANGUL SYLLABLE SSYEG) > > No, I don't think so. I think that 0xc3bd is the binary value, and > 0xc3bd in UTF-8 is U+00FD (LATIN SMALL LETTER Y WITH ACUTE) which is > less surprising. It is in Latin-1 but not in WIN1251. Yes, that would make a lot more sense. From there it's easy, see below: > In KOI8-R, FD is the Shcha, a common letter in Russian, so I wonder if > the database was really in Latin-1. OK, so what we end up with is the following sequence (expressed in Python, which is always handy for encoding conversions etc): print "\xc3\xbd".decode("utf-8").encode("latin-1").decode("koi8_r") to yield: Щ Assuming that's right, what presumably happened is that the database was initialized with UTF-8 and loaded with data it was told was in the latin-1 encoding, but was actually in the koi8_r encoding. Therefore, utf-8 sequences were generated based on the interpretation of the bytes for each koi8_r character as the latin-1 character for the same byte value, so: Input file: 0xfd (latin-1: ý, koi8_r: Щ) | | (input interpreted as latin-1) v Database: 0xc3bd (utf-8: ý) To recover the data you must reverse that process. Thankfully it's going to be 100% reversible, ie no information has been lost. To create a tiny test table for the following explanation and fix code I just ran: create table ss ( x text) ; insert into ss (x) values (E'\xc3\xbd'); Now if I: set client_encoding = "WIN1251"; I get: test=> select * from ss; ERROR: character 0xc3bd of encoding "UTF8" has no equivalent in "WIN1251" just like you. With client_encoding set as utf-8 (which is what my machine is) I get: test=> select * from ss; x --- ý (1 row) PostgreSQL's convert() function: http://www.postgresql.org/docs/current/static/functions-string.html may be used now to transform your data. It doesn't assume any encoding for the input string, unlike convert_from and convert_to, so you can use a statement like this to convert your data:(where 'x' is the string of mangled data to be converted): select ( convert_from( convert(x::bytea, 'utf-8', 'latin-1'), 'koi8_r') ) from ss; In other words: "Decode the utf-8 sequence in the input and map each utf-8 code point to the corresponding latin-1 character, outputting one byte per latin-1 character. Interpret the sequence of bytes just produced as a sequence of characters in the koi8_r encoding, and map them to the same characters in the database's internal encoding." To copy the converted data to a new table: CREATE TABLE converted (y text); INSERT INTO converted(y) SELECT ( convert_from( convert(x::bytea, 'utf-8', 'latin-1'), 'koi8_r') ) from ss; Now if I SELECT from the table of converted data, I get the right(?) output: test=> select * from converted; y --- Щ (1 row) You can easily wrap that up in a simple SQL function: CREATE OR REPLACE FUNCTION fixstr(text) RETURNS text AS $$ SELECT convert_from(convert($1::bytea, 'utf-8', 'latin-1'),'koi8_r') $$ LANGUAGE 'SQL' IMMUTABLE; so you can just: test=> select fixstr(x) from ss; fixstr -------- Щ (1 row) -- Craig Ringer