> Okay. I have NO IDEA why this works. If someone could enlighten me as
> to the math involved I'd appreciate it. First, a little background:
>
> The Euro symbol is unicode value 0x20AC.
That's most significant byte first, in UTF-16.
> UTF-8 encoding is a way of
> representing most unicode characters in two bytes,
err, no.
> and most latin
> characters in one byte.
UTF-8 is a way to transform the standard code points of a large, fixed
width encoding to a variable width encoding. For Unicode, the number of
octets (Octet is explicitly 8 bits, since byte is not necessarily 8 bits)
can vary between one and four. See RFC 2279:
http://www.ietf.org/rfc/rfc2279.txt
http://www.unicode.org/book/preview/ch02.pdf
Those two pages should clear your question up.
> The only way I have found to insert a euro symbol into the database
> from the command line psql client is this:
>
> INSERT INTO mytable VALUES('\342\202\254');
>
> I don't know why this works. In hex, those octal values are:
> E2 82 AC
That's UTF-8.
> I don't know why my "20" byte turned into two bytes of E2 and 82.
In the conversion from UTF-16 to UTF-8. (You converted it, right?)
> Furthermore, I was under the impression that a UTF-8 encoding of the
> Euro sign only took two bytes.
Let's see. From a table on the RFC page, I note that 0x00 to 0x7f fit in
one octet, 0x0080 to 0x07ff fits in two, and 0x0800 to 0xffff fits in
three. 0x20ac is definitely greater than 0x07ff.
> Corroborating this assumption, upon
> dumping that table with pg_dump and examining the resultant file in
> a hex editor, I see this in that character position: AC 20
That's called looking at a UTF-16 character as if it were an integer on
a byte-backwards, I mean, least-significant byte first CPU. (See "Byte
Order Mark" in the unicode.org link I pasted in above.
> Furthermore, according to the psql online documentation and man page:
> "Anything contained in single quotes is furthermore subject to C-like
> substitutions for \n (new line), \t (tab), \digits, \0digits, and
> \0xdigits (the character with the given decimal, octal, or hexadecimal
> code)."
I'm having trouble finding that page. Do you have a URL for it? (But,
then again, I might not be the person to ask this question. It seems
like I saw something about it somewhere, maybe it's in the list archives
or something.)
> Those digits *should* be interpreted as decimal digits, but they aren't.
> The man page for psql is either incorrect, or the implementation is
> buggy.
>
> It's worth noting that the field I'm inserting into is an SQL_ASCII
> field, and I'm reading my UTF-8 string out of it like this, via JDBC:
>
> String value = new String( resultset.getBytes(1), "UTF-8");
>
> Can anyone help me make sense of this mumbo jumbo?
HTH
--
Joel Rees <joel@alpsgiken.gr.jp>