* Guillaume Cottenceau (gc@mnc.ch) wrote:
> Anders Hermansen <anders 'at' yoyo.no> writes:
> > UTF-8 is a byte sequence, so it's not about the first byte in the whole
> > sequence. But about the first byte in a tree byte sequece.
>
> Yes. I forgot that you assumed the machine was big-endian. So the
> UTF-8 character is here probably first byte 0xEF, second byte
> 0x00?
>
> I did my test with first byte 0x00 and second byte 0xEF, hence
> confusion with your initial comment.
>
> My reasoning was that if the first byte of this two-byte
> sequence is 0x00 then the rule that 0xEF is first byte of a
> three-byte sequence doesn't apply, since 0xEF is second byte in
> the sequence.
Endianness is not a problem when working with a sequnce of bytes (8-bit)
like in utf-8. It only becomes a problem when you deal with more than 1
byte representing 1 value. So it's an issue in UTF-16 which is big-endian by
default I think.
So I interpreted the message "ERROR: could not convert UTF-8 character 0x00ef
to ISO8859-1" as a byte sequence with 0x00 first, and then 0xef. Maybe that's
a wrong assumption.
> > There should be no nul (0) bytes when encoding UTF-8. I believe
> > this is in the specification to allow it to be compatible with
> > C nul-terminated strings.
> >
> > I believe that the byte sequence 0x00EF i illegal UTF-8 because:
> > 1) It contains nul (0x00) byte
> > 2) 0xEF is not followed by two more bytes
> >
> > On the other hand U+00EF is a valid unicode code point. Which points to:
>
> I think this is assumed little-endian, e.g. first byte 0x00 and
> second byte 0xEF (especially because UTF-8 is just a series of
> bytes without any endianness aspects, so it makes good sense to
> actually read this left-to-right, e.g. byte 0x00 first).
As I said above. Endiness is not an issue for UTF-8. The byte _sequence_ is
always read from start to end.
> > LATIN SMALL LETTER I WITH DIAERESIS
> > It is encoded as 0xC3AF in UTF-8
> > As 0x00EF in UTF-16 (and UCS-2 ?)
>
> Yes to "and UCS-2". Two-byte sequences in UCS-2 and UTF-16 are
> the same[1].
Yes.
> > As 0xEF in ISO-8859-1
>
> Hum I think I may understand what's going on here. It's possible
> that in the message:
>
> ERROR: could not convert UTF-8 character 0x00ef to ISO8859-1
>
> when they say "0x00ef" they don't talk about UTF-8 per-see but
> they use the unicode representation (which is error prone).
If 0x00ef refers to a unicode codepoint, it should not have been a problem to
convert it to ISO-8859-1 (0xef).
If 0x00ef refers to a byte sequence, then the error message is a bit
misleading because it's not a character but a byte sequence. And the error
is decoding the UTF-8, not encoding the ISO-8859-1.
Anders Hermansen