Thread: ERROR: could not convert UTF-8 character 0x00ef to ISO8859-1 possiblesolution
ERROR: could not convert UTF-8 character 0x00ef to ISO8859-1 possiblesolution
From
Mauricio Hernández Durán
Date:
Hi all! We encountered the same problem most people have had using latin1 or unicode for spanish characters upon inserting or updates: ERROR: could not convert UTF-8 character 0x00ef to ISO8859-1 We checked out the mailing lists and found solutions from people who had the same problem, we would like to share ours which was basically We tried the changes proposed by others which consist of: 1. Set the database encoding to UNICODE or LATIN1 ( check this with psql -l to get a list of the databases and their respective encodings) 2. Set the client connection pool encoding in the connection url: ?charSet=LATIN1 3. Set the charset directive in your JSP's to ISO-8859-1 (LATIN1). AND 4. Instead of changing the code to work with streams of bytes and their encoding as suggested by other postings we changed the encoding system property for the JVM using the -Dfile.encoding=ISO-8859-1 option. Mind you: since we had no backwards compatibility problems with other legacy apps running on the server this did the trick for us. AS a side note: the app worked perfectly on our testing environment (WIndows XP , Jboss 3.2.3, Postgres 7.4 but not on production (Solaris 8, JBoss 3.2.3, Postgres 7.4) the difference being the default file.enconding system property. Hope it helps, comments on this solution are welcome!!
Attachment
Re: ERROR: could not convert UTF-8 character 0x00ef to ISO8859-1 possiblesolution
From
Guillaume Cottenceau
Date:
Mauricio Hernández Durán <mhernandez 'at' ingenian.com> writes: > Hi all! > > We encountered the same problem most people have had using latin1 or > unicode for spanish characters upon inserting or updates: > > ERROR: could not convert UTF-8 character 0x00ef to ISO8859-1 Iconv actually agrees that this UTF-8 character cannot be converted to ISO8859-1. I can print UTF-8's 0x00EF which gives "ï". Then if I manually input "ï", the bytes in UTF-8 to do that are 0xC3AF, and this can be converted to ISO8859-1 (it is 0xEF). Isn't there a problem with your UTF-8 data containing 0x00EF? -- Guillaume Cottenceau
Re: ERROR: could not convert UTF-8 character 0x00ef to ISO8859-1 possiblesolution
From
Anders Hermansen
Date:
* Guillaume Cottenceau (gc@mnc.ch) wrote: > Isn't there a problem with your UTF-8 data containing 0x00EF? E0 to EF hex (224 to 239): first byte of a three-byte sequence. Anders Hermansen
Re: ERROR: could not convert UTF-8 character 0x00ef to ISO8859-1 possiblesolution
From
Anders Hermansen
Date:
* Guillaume Cottenceau (gc@mnc.ch) wrote: > Anders Hermansen <anders 'at' yoyo.no> writes: > > * Guillaume Cottenceau (gc@mnc.ch) wrote: > > > Isn't there a problem with your UTF-8 data containing 0x00EF? > > > > E0 to EF hex (224 to 239): first byte of a three-byte sequence. > > Well 00 is first byte here, isn't it? UTF-8 is a byte sequence, so it's not about the first byte in the whole sequence. But about the first byte in a tree byte sequece. There should be no nul (0) bytes when encoding UTF-8. I believe this is in the specification to allow it to be compatible with C nul-terminated strings. I believe that the byte sequence 0x00EF i illegal UTF-8 because: 1) It contains nul (0x00) byte 2) 0xEF is not followed by two more bytes On the other hand U+00EF is a valid unicode code point. Which points to: LATIN SMALL LETTER I WITH DIAERESIS It is encoded as 0xC3AF in UTF-8 As 0x00EF in UTF-16 (and UCS-2 ?) As 0xEF in ISO-8859-1 Anders Hermansen
Hi, Guillaume, Guillaume Cottenceau schrieb: >>ERROR: could not convert UTF-8 character 0x00ef to ISO8859-1 > Iconv actually agrees that this UTF-8 character cannot be > converted to ISO8859-1. > > I can print UTF-8's 0x00EF which gives "ï". > > Then if I manually input "ï", the bytes in UTF-8 to do that are > 0xC3AF, and this can be converted to ISO8859-1 (it is 0xEF). > > Isn't there a problem with your UTF-8 data containing 0x00EF? Maybe it is UTF-16 in Network byte order. Markus
Attachment
Re: ERROR: could not convert UTF-8 character 0x00ef to ISO8859-1 possiblesolution
From
Guillaume Cottenceau
Date:
Anders Hermansen <anders 'at' yoyo.no> writes: > * Guillaume Cottenceau (gc@mnc.ch) wrote: > > Anders Hermansen <anders 'at' yoyo.no> writes: > > > * Guillaume Cottenceau (gc@mnc.ch) wrote: > > > > Isn't there a problem with your UTF-8 data containing 0x00EF? > > > > > > E0 to EF hex (224 to 239): first byte of a three-byte sequence. > > > > Well 00 is first byte here, isn't it? > > UTF-8 is a byte sequence, so it's not about the first byte in the whole > sequence. But about the first byte in a tree byte sequece. Yes. I forgot that you assumed the machine was big-endian. So the UTF-8 character is here probably first byte 0xEF, second byte 0x00? I did my test with first byte 0x00 and second byte 0xEF, hence confusion with your initial comment. My reasoning was that if the first byte of this two-byte sequence is 0x00 then the rule that 0xEF is first byte of a three-byte sequence doesn't apply, since 0xEF is second byte in the sequence. > There should be no nul (0) bytes when encoding UTF-8. I believe > this is in the specification to allow it to be compatible with > C nul-terminated strings. > > I believe that the byte sequence 0x00EF i illegal UTF-8 because: > 1) It contains nul (0x00) byte > 2) 0xEF is not followed by two more bytes > > On the other hand U+00EF is a valid unicode code point. Which points to: I think this is assumed little-endian, e.g. first byte 0x00 and second byte 0xEF (especially because UTF-8 is just a series of bytes without any endianness aspects, so it makes good sense to actually read this left-to-right, e.g. byte 0x00 first). > LATIN SMALL LETTER I WITH DIAERESIS > It is encoded as 0xC3AF in UTF-8 > As 0x00EF in UTF-16 (and UCS-2 ?) Yes to "and UCS-2". Two-byte sequences in UCS-2 and UTF-16 are the same[1]. > As 0xEF in ISO-8859-1 Hum I think I may understand what's going on here. It's possible that in the message: ERROR: could not convert UTF-8 character 0x00ef to ISO8859-1 when they say "0x00ef" they don't talk about UTF-8 per-see but they use the unicode representation (which is error prone). Ref: [1] UCS-2 is a subset of UTF-16 which comprises all the 2-byte sequence characters but no 3 or 4-byte sequence characters -- Guillaume Cottenceau
Re: ERROR: could not convert UTF-8 character 0x00ef to ISO8859-1 possiblesolution
From
Vadim Nasardinov
Date:
On Wednesday 27 April 2005 07:54, Anders Hermansen wrote: > On the other hand U+00EF is a valid unicode code point. Which points to: > LATIN SMALL LETTER I WITH DIAERESIS > It is encoded as 0xC3AF in UTF-8 > As 0x00EF in UTF-16 (and UCS-2 ?) > As 0xEF in ISO-8859-1 http://www.eki.ee/letter/chardata.cgi?ucode=ef
Re: ERROR: could not convert UTF-8 character 0x00ef to ISO8859-1 possiblesolution
From
Guillaume Cottenceau
Date:
Vadim Nasardinov <vadimn 'at' redhat.com> writes: > On Wednesday 27 April 2005 07:54, Anders Hermansen wrote: > > On the other hand U+00EF is a valid unicode code point. Which points to: > > LATIN SMALL LETTER I WITH DIAERESIS > > It is encoded as 0xC3AF in UTF-8 > > As 0x00EF in UTF-16 (and UCS-2 ?) > > As 0xEF in ISO-8859-1 > > http://www.eki.ee/letter/chardata.cgi?ucode=ef Which is surprising, because this can totally be encoded in ISO8859-1. -- Guillaume Cottenceau
Re: ERROR: could not convert UTF-8 character 0x00ef to ISO8859-1 possiblesolution
From
Anders Hermansen
Date:
* Guillaume Cottenceau (gc@mnc.ch) wrote: > Anders Hermansen <anders 'at' yoyo.no> writes: > > UTF-8 is a byte sequence, so it's not about the first byte in the whole > > sequence. But about the first byte in a tree byte sequece. > > Yes. I forgot that you assumed the machine was big-endian. So the > UTF-8 character is here probably first byte 0xEF, second byte > 0x00? > > I did my test with first byte 0x00 and second byte 0xEF, hence > confusion with your initial comment. > > My reasoning was that if the first byte of this two-byte > sequence is 0x00 then the rule that 0xEF is first byte of a > three-byte sequence doesn't apply, since 0xEF is second byte in > the sequence. Endianness is not a problem when working with a sequnce of bytes (8-bit) like in utf-8. It only becomes a problem when you deal with more than 1 byte representing 1 value. So it's an issue in UTF-16 which is big-endian by default I think. So I interpreted the message "ERROR: could not convert UTF-8 character 0x00ef to ISO8859-1" as a byte sequence with 0x00 first, and then 0xef. Maybe that's a wrong assumption. > > There should be no nul (0) bytes when encoding UTF-8. I believe > > this is in the specification to allow it to be compatible with > > C nul-terminated strings. > > > > I believe that the byte sequence 0x00EF i illegal UTF-8 because: > > 1) It contains nul (0x00) byte > > 2) 0xEF is not followed by two more bytes > > > > On the other hand U+00EF is a valid unicode code point. Which points to: > > I think this is assumed little-endian, e.g. first byte 0x00 and > second byte 0xEF (especially because UTF-8 is just a series of > bytes without any endianness aspects, so it makes good sense to > actually read this left-to-right, e.g. byte 0x00 first). As I said above. Endiness is not an issue for UTF-8. The byte _sequence_ is always read from start to end. > > LATIN SMALL LETTER I WITH DIAERESIS > > It is encoded as 0xC3AF in UTF-8 > > As 0x00EF in UTF-16 (and UCS-2 ?) > > Yes to "and UCS-2". Two-byte sequences in UCS-2 and UTF-16 are > the same[1]. Yes. > > As 0xEF in ISO-8859-1 > > Hum I think I may understand what's going on here. It's possible > that in the message: > > ERROR: could not convert UTF-8 character 0x00ef to ISO8859-1 > > when they say "0x00ef" they don't talk about UTF-8 per-see but > they use the unicode representation (which is error prone). If 0x00ef refers to a unicode codepoint, it should not have been a problem to convert it to ISO-8859-1 (0xef). If 0x00ef refers to a byte sequence, then the error message is a bit misleading because it's not a character but a byte sequence. And the error is decoding the UTF-8, not encoding the ISO-8859-1. Anders Hermansen
Re: ERROR: could not convert UTF-8 character 0x00ef to ISO8859-1 possiblesolution
From
Tom Lane
Date:
Guillaume Cottenceau <gc@mnc.ch> writes: > My reasoning was that if the first byte of this two-byte > sequence is 0x00 then the rule that 0xEF is first byte of a > three-byte sequence doesn't apply, since 0xEF is second byte in > the sequence. Looking at the source code, it's clear that it's reporting just the first byte of the sequence; the 00 is redundant and probably shouldn't be in the message. There seem to be two possibilities: either there is a valid 3-byte UTF8 character, which cannot be converted to LATIN1; or the alleged UTF8 data isn't really UTF8 at all. regards, tom lane
Re: ERROR: could not convert UTF-8 character 0x00ef to ISO8859-1 possiblesolution
From
Anders Hermansen
Date:
* Tom Lane (tgl@sss.pgh.pa.us) wrote: > Looking at the source code, it's clear that it's reporting just the > first byte of the sequence; the 00 is redundant and probably shouldn't > be in the message. Yes the error message can be a bit confusing. I investigated a error I got when using psql. I did a select and got the message: "ERROR: could not convert UTF-8 character 0x00e2 to ISO8859-1" When looking at the database dump the byte sequence is 0xE2 0x80 0x93, which is valid UTF-8 (U+2013 EN DASH), but can not be converted because the character is not found in ISO-8859-1. If I start up a UTF-8 xterm and psql with UNICODE encoding, then everything works as expected. > There seem to be two possibilities: either there is a valid 3-byte > UTF8 character, which cannot be converted to LATIN1; or the alleged > UTF8 data isn't really UTF8 at all. Yes. Maybe the error messages can be changed so that what actually went wrong is more clear? And possibly printing the whole 3-byte sequence? Anders Hermansen
Re: ERROR: could not convert UTF-8 character 0x00ef to ISO8859-1 possiblesolution
From
Tom Lane
Date:
Anders Hermansen <anders@yoyo.no> writes: > * Tom Lane (tgl@sss.pgh.pa.us) wrote: >> Looking at the source code, it's clear that it's reporting just the >> first byte of the sequence; the 00 is redundant and probably shouldn't >> be in the message. > Yes. Maybe the error messages can be changed so that what actually went > wrong is more clear? And possibly printing the whole 3-byte sequence? Any volunteers for that? The specific message in question is in src/backend/utils/mb/conversion_procs/utf8_and_iso8859_1/utf8_and_iso8859_1.c else if ((c & 0xe0) == 0xe0) elog(ERROR, "could not convert UTF8 character 0x%04x to ISO8859-1", c); Aside from being unhelpful as to the exact input data, this is wrong in another way: it ought to be an ereport() not elog(), because it's certainly not a can't-happen kind of error. A little bit of grepping turns up a number of similarly deficient elog and ereport calls in the src/backend/utils/mb/ tree. There is more useful code for constructing a character description in pg_verifymbstr() in src/backend/utils/mb/wchar.c. Probably what ought to happen is to split out a small subroutine along the lines of char *describe_mb_char(const unsigned char *mbstr, int len) (returning a palloc'd string "0x....") and then make all the places that complain about bad multibyte input use it. Don't have time to deal with it myself, but it seems like a pretty easy project for anyone wanting to dip their toes in the backend. regards, tom lane