Thread: ERROR: could not convert UTF-8 character 0x00ef to ISO8859-1 possiblesolution

ERROR: could not convert UTF-8 character 0x00ef to ISO8859-1 possiblesolution

From
Mauricio Hernández Durán
Date:
Hi all!

We encountered the same problem most people have had using latin1 or
unicode for spanish characters upon inserting or updates:

ERROR: could not convert UTF-8 character 0x00ef to ISO8859-1

We checked out the mailing lists and found solutions from people who had
the same problem, we would like to share ours which was basically
We tried the changes proposed by others which consist of:

1. Set the database encoding to UNICODE or LATIN1 ( check this with psql
-l to get a list of the databases and their respective encodings)
2. Set the client connection pool encoding in the connection url:
?charSet=LATIN1
3. Set the charset directive in your JSP's to ISO-8859-1  (LATIN1).

AND

4. Instead of changing the code to work with streams of bytes and their
encoding  as suggested by other postings we  changed the encoding system
property  for the JVM using the -Dfile.encoding=ISO-8859-1 option.

Mind you: since we had no backwards compatibility problems with other
legacy apps running on the server this did the trick for us.

AS a side note: the app worked perfectly on our testing environment
(WIndows XP , Jboss 3.2.3, Postgres 7.4 but not on production (Solaris
8, JBoss 3.2.3, Postgres 7.4) the difference being the default
file.enconding system property.


Hope it helps, comments on this solution are welcome!!

Attachment

Re: ERROR: could not convert UTF-8 character 0x00ef to ISO8859-1 possiblesolution

From
Guillaume Cottenceau
Date:
Mauricio Hernández Durán <mhernandez 'at' ingenian.com> writes:

> Hi all!
>
> We encountered the same problem most people have had using latin1 or
> unicode for spanish characters upon inserting or updates:
>
> ERROR: could not convert UTF-8 character 0x00ef to ISO8859-1

Iconv actually agrees that this UTF-8 character cannot be
converted to ISO8859-1.

I can print UTF-8's 0x00EF which gives "ï".

Then if I manually input "ï", the bytes in UTF-8 to do that are
0xC3AF, and this can be converted to ISO8859-1 (it is 0xEF).

Isn't there a problem with your UTF-8 data containing 0x00EF?

--
Guillaume Cottenceau

Re: ERROR: could not convert UTF-8 character 0x00ef to ISO8859-1 possiblesolution

From
Anders Hermansen
Date:
* Guillaume Cottenceau (gc@mnc.ch) wrote:
> Isn't there a problem with your UTF-8 data containing 0x00EF?

E0 to EF hex (224 to 239): first byte of a three-byte sequence.


Anders Hermansen

Re: ERROR: could not convert UTF-8 character 0x00ef to ISO8859-1 possiblesolution

From
Anders Hermansen
Date:
* Guillaume Cottenceau (gc@mnc.ch) wrote:
> Anders Hermansen <anders 'at' yoyo.no> writes:
> > * Guillaume Cottenceau (gc@mnc.ch) wrote:
> > > Isn't there a problem with your UTF-8 data containing 0x00EF?
> >
> > E0 to EF hex (224 to 239): first byte of a three-byte sequence.
>
> Well 00 is first byte here, isn't it?

UTF-8 is a byte sequence, so it's not about the first byte in the whole
sequence. But about the first byte in a tree byte sequece.

There should be no nul (0) bytes when encoding UTF-8. I believe this is in the
specification to allow it to be compatible with C nul-terminated strings.

I believe that the byte sequence 0x00EF i illegal UTF-8 because:
1) It contains nul (0x00) byte
2) 0xEF is not followed by two more bytes

On the other hand U+00EF is a valid unicode code point. Which points to:
LATIN SMALL LETTER I WITH DIAERESIS
It is encoded as 0xC3AF in UTF-8
As 0x00EF in UTF-16 (and UCS-2 ?)
As 0xEF in ISO-8859-1


Anders Hermansen

Re: ERROR: could not convert UTF-8 character 0x00ef to ISO8859-1

From
Markus Schaber
Date:
Hi, Guillaume,

Guillaume Cottenceau schrieb:
>>ERROR: could not convert UTF-8 character 0x00ef to ISO8859-1
> Iconv actually agrees that this UTF-8 character cannot be
> converted to ISO8859-1.
>
> I can print UTF-8's 0x00EF which gives "ï".
>
> Then if I manually input "ï", the bytes in UTF-8 to do that are
> 0xC3AF, and this can be converted to ISO8859-1 (it is 0xEF).
>
> Isn't there a problem with your UTF-8 data containing 0x00EF?

Maybe it is UTF-16 in Network byte order.

Markus


Attachment

Re: ERROR: could not convert UTF-8 character 0x00ef to ISO8859-1 possiblesolution

From
Guillaume Cottenceau
Date:
Anders Hermansen <anders 'at' yoyo.no> writes:

> * Guillaume Cottenceau (gc@mnc.ch) wrote:
> > Anders Hermansen <anders 'at' yoyo.no> writes:
> > > * Guillaume Cottenceau (gc@mnc.ch) wrote:
> > > > Isn't there a problem with your UTF-8 data containing 0x00EF?
> > >
> > > E0 to EF hex (224 to 239): first byte of a three-byte sequence.
> >
> > Well 00 is first byte here, isn't it?
>
> UTF-8 is a byte sequence, so it's not about the first byte in the whole
> sequence. But about the first byte in a tree byte sequece.

Yes. I forgot that you assumed the machine was big-endian. So the
UTF-8 character is here probably first byte 0xEF, second byte
0x00?

I did my test with first byte 0x00 and second byte 0xEF, hence
confusion with your initial comment.

My reasoning was that if the first byte of this two-byte
sequence is 0x00 then the rule that 0xEF is first byte of a
three-byte sequence doesn't apply, since 0xEF is second byte in
the sequence.

> There should be no nul (0) bytes when encoding UTF-8. I believe
> this is in the specification to allow it to be compatible with
> C nul-terminated strings.
>
> I believe that the byte sequence 0x00EF i illegal UTF-8 because:
> 1) It contains nul (0x00) byte
> 2) 0xEF is not followed by two more bytes
>
> On the other hand U+00EF is a valid unicode code point. Which points to:

I think this is assumed little-endian, e.g. first byte 0x00 and
second byte 0xEF (especially because UTF-8 is just a series of
bytes without any endianness aspects, so it makes good sense to
actually read this left-to-right, e.g. byte 0x00 first).

> LATIN SMALL LETTER I WITH DIAERESIS
> It is encoded as 0xC3AF in UTF-8
> As 0x00EF in UTF-16 (and UCS-2 ?)

Yes to "and UCS-2". Two-byte sequences in UCS-2 and UTF-16 are
the same[1].

> As 0xEF in ISO-8859-1

Hum I think I may understand what's going on here. It's possible
that in the message:

        ERROR: could not convert UTF-8 character 0x00ef to ISO8859-1

when they say "0x00ef" they don't talk about UTF-8 per-see but
they use the unicode representation (which is error prone).


Ref:
[1] UCS-2 is a subset of UTF-16 which comprises all the 2-byte
    sequence characters but no 3 or 4-byte sequence characters

--
Guillaume Cottenceau

Re: ERROR: could not convert UTF-8 character 0x00ef to ISO8859-1 possiblesolution

From
Vadim Nasardinov
Date:
On Wednesday 27 April 2005 07:54, Anders Hermansen wrote:
> On the other hand U+00EF is a valid unicode code point. Which points to:
> LATIN SMALL LETTER I WITH DIAERESIS
> It is encoded as 0xC3AF in UTF-8
> As 0x00EF in UTF-16 (and UCS-2 ?)
> As 0xEF in ISO-8859-1

http://www.eki.ee/letter/chardata.cgi?ucode=ef

Re: ERROR: could not convert UTF-8 character 0x00ef to ISO8859-1 possiblesolution

From
Guillaume Cottenceau
Date:
Vadim Nasardinov <vadimn 'at' redhat.com> writes:

> On Wednesday 27 April 2005 07:54, Anders Hermansen wrote:
> > On the other hand U+00EF is a valid unicode code point. Which points to:
> > LATIN SMALL LETTER I WITH DIAERESIS
> > It is encoded as 0xC3AF in UTF-8
> > As 0x00EF in UTF-16 (and UCS-2 ?)
> > As 0xEF in ISO-8859-1
>
> http://www.eki.ee/letter/chardata.cgi?ucode=ef

Which is surprising, because this can totally be encoded in
ISO8859-1.

--
Guillaume Cottenceau

Re: ERROR: could not convert UTF-8 character 0x00ef to ISO8859-1 possiblesolution

From
Anders Hermansen
Date:
* Guillaume Cottenceau (gc@mnc.ch) wrote:
> Anders Hermansen <anders 'at' yoyo.no> writes:
> > UTF-8 is a byte sequence, so it's not about the first byte in the whole
> > sequence. But about the first byte in a tree byte sequece.
>
> Yes. I forgot that you assumed the machine was big-endian. So the
> UTF-8 character is here probably first byte 0xEF, second byte
> 0x00?
>
> I did my test with first byte 0x00 and second byte 0xEF, hence
> confusion with your initial comment.
>
> My reasoning was that if the first byte of this two-byte
> sequence is 0x00 then the rule that 0xEF is first byte of a
> three-byte sequence doesn't apply, since 0xEF is second byte in
> the sequence.

Endianness is not a problem when working with a sequnce of bytes (8-bit)
like in utf-8. It only becomes a problem when you deal with more than 1
byte representing 1 value. So it's an issue in UTF-16 which is big-endian by
default I think.

So I interpreted the message "ERROR: could not convert UTF-8 character 0x00ef
to ISO8859-1" as a byte sequence with 0x00 first, and then 0xef. Maybe that's
a wrong assumption.

> > There should be no nul (0) bytes when encoding UTF-8. I believe
> > this is in the specification to allow it to be compatible with
> > C nul-terminated strings.
> >
> > I believe that the byte sequence 0x00EF i illegal UTF-8 because:
> > 1) It contains nul (0x00) byte
> > 2) 0xEF is not followed by two more bytes
> >
> > On the other hand U+00EF is a valid unicode code point. Which points to:
>
> I think this is assumed little-endian, e.g. first byte 0x00 and
> second byte 0xEF (especially because UTF-8 is just a series of
> bytes without any endianness aspects, so it makes good sense to
> actually read this left-to-right, e.g. byte 0x00 first).

As I said above. Endiness is not an issue for UTF-8. The byte _sequence_ is
always read from start to end.

> > LATIN SMALL LETTER I WITH DIAERESIS
> > It is encoded as 0xC3AF in UTF-8
> > As 0x00EF in UTF-16 (and UCS-2 ?)
>
> Yes to "and UCS-2". Two-byte sequences in UCS-2 and UTF-16 are
> the same[1].

Yes.

> > As 0xEF in ISO-8859-1
>
> Hum I think I may understand what's going on here. It's possible
> that in the message:
>
>         ERROR: could not convert UTF-8 character 0x00ef to ISO8859-1
>
> when they say "0x00ef" they don't talk about UTF-8 per-see but
> they use the unicode representation (which is error prone).

If 0x00ef refers to a unicode codepoint, it should not have been a problem to
convert it to ISO-8859-1 (0xef).

If 0x00ef refers to a byte sequence, then the error message is a bit
misleading because it's not a character but a byte sequence. And the error
is decoding the UTF-8, not encoding the ISO-8859-1.


Anders Hermansen

Guillaume Cottenceau <gc@mnc.ch> writes:
> My reasoning was that if the first byte of this two-byte
> sequence is 0x00 then the rule that 0xEF is first byte of a
> three-byte sequence doesn't apply, since 0xEF is second byte in
> the sequence.

Looking at the source code, it's clear that it's reporting just the
first byte of the sequence; the 00 is redundant and probably shouldn't
be in the message.

There seem to be two possibilities: either there is a valid 3-byte
UTF8 character, which cannot be converted to LATIN1; or the alleged
UTF8 data isn't really UTF8 at all.

            regards, tom lane

Re: ERROR: could not convert UTF-8 character 0x00ef to ISO8859-1 possiblesolution

From
Anders Hermansen
Date:
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> Looking at the source code, it's clear that it's reporting just the
> first byte of the sequence; the 00 is redundant and probably shouldn't
> be in the message.

Yes the error message can be a bit confusing. I investigated a error I
got when using psql. I did a select and got the message:
"ERROR:  could not convert UTF-8 character 0x00e2 to ISO8859-1"

When looking at the database dump the byte sequence is 0xE2 0x80 0x93, which
is valid UTF-8 (U+2013 EN DASH), but can not be converted because the
character is not found in ISO-8859-1.

If I start up a UTF-8 xterm and psql with UNICODE encoding, then everything
works as expected.

> There seem to be two possibilities: either there is a valid 3-byte
> UTF8 character, which cannot be converted to LATIN1; or the alleged
> UTF8 data isn't really UTF8 at all.

Yes. Maybe the error messages can be changed so that what actually went
wrong is more clear? And possibly printing the whole 3-byte sequence?


Anders Hermansen

Anders Hermansen <anders@yoyo.no> writes:
> * Tom Lane (tgl@sss.pgh.pa.us) wrote:
>> Looking at the source code, it's clear that it's reporting just the
>> first byte of the sequence; the 00 is redundant and probably shouldn't
>> be in the message.

> Yes. Maybe the error messages can be changed so that what actually went
> wrong is more clear? And possibly printing the whole 3-byte sequence?

Any volunteers for that?  The specific message in question is in
src/backend/utils/mb/conversion_procs/utf8_and_iso8859_1/utf8_and_iso8859_1.c

        else if ((c & 0xe0) == 0xe0)
            elog(ERROR, "could not convert UTF8 character 0x%04x to ISO8859-1",
                 c);

Aside from being unhelpful as to the exact input data, this is wrong in
another way: it ought to be an ereport() not elog(), because it's
certainly not a can't-happen kind of error.

A little bit of grepping turns up a number of similarly deficient
elog and ereport calls in the src/backend/utils/mb/ tree.

There is more useful code for constructing a character description in
pg_verifymbstr() in src/backend/utils/mb/wchar.c.  Probably what ought
to happen is to split out a small subroutine along the lines of
    char *describe_mb_char(const unsigned char *mbstr, int len)
(returning a palloc'd string "0x....") and then make all the places
that complain about bad multibyte input use it.

Don't have time to deal with it myself, but it seems like a pretty easy
project for anyone wanting to dip their toes in the backend.

            regards, tom lane