Re: SOLUTION: Insert a Euro symbol as UTF-8 from a latin1 charset. - Mailing list pgsql-general
From | Joel Rees |
---|---|
Subject | Re: SOLUTION: Insert a Euro symbol as UTF-8 from a latin1 charset. |
Date | |
Msg-id | 20030616163913.D11A.JOEL@alpsgiken.gr.jp Whole thread Raw |
In response to | SOLUTION: Insert a Euro symbol as UTF-8 from a latin1 charset. (Roland Glenn McIntosh <roland@steeltorch.com>) |
List | pgsql-general |
> Okay. I have NO IDEA why this works. If someone could enlighten me as > to the math involved I'd appreciate it. First, a little background: > > The Euro symbol is unicode value 0x20AC. That's most significant byte first, in UTF-16. > UTF-8 encoding is a way of > representing most unicode characters in two bytes, err, no. > and most latin > characters in one byte. UTF-8 is a way to transform the standard code points of a large, fixed width encoding to a variable width encoding. For Unicode, the number of octets (Octet is explicitly 8 bits, since byte is not necessarily 8 bits) can vary between one and four. See RFC 2279: http://www.ietf.org/rfc/rfc2279.txt http://www.unicode.org/book/preview/ch02.pdf Those two pages should clear your question up. > The only way I have found to insert a euro symbol into the database > from the command line psql client is this: > > INSERT INTO mytable VALUES('\342\202\254'); > > I don't know why this works. In hex, those octal values are: > E2 82 AC That's UTF-8. > I don't know why my "20" byte turned into two bytes of E2 and 82. In the conversion from UTF-16 to UTF-8. (You converted it, right?) > Furthermore, I was under the impression that a UTF-8 encoding of the > Euro sign only took two bytes. Let's see. From a table on the RFC page, I note that 0x00 to 0x7f fit in one octet, 0x0080 to 0x07ff fits in two, and 0x0800 to 0xffff fits in three. 0x20ac is definitely greater than 0x07ff. > Corroborating this assumption, upon > dumping that table with pg_dump and examining the resultant file in > a hex editor, I see this in that character position: AC 20 That's called looking at a UTF-16 character as if it were an integer on a byte-backwards, I mean, least-significant byte first CPU. (See "Byte Order Mark" in the unicode.org link I pasted in above. > Furthermore, according to the psql online documentation and man page: > "Anything contained in single quotes is furthermore subject to C-like > substitutions for \n (new line), \t (tab), \digits, \0digits, and > \0xdigits (the character with the given decimal, octal, or hexadecimal > code)." I'm having trouble finding that page. Do you have a URL for it? (But, then again, I might not be the person to ask this question. It seems like I saw something about it somewhere, maybe it's in the list archives or something.) > Those digits *should* be interpreted as decimal digits, but they aren't. > The man page for psql is either incorrect, or the implementation is > buggy. > > It's worth noting that the field I'm inserting into is an SQL_ASCII > field, and I'm reading my UTF-8 string out of it like this, via JDBC: > > String value = new String( resultset.getBytes(1), "UTF-8"); > > Can anyone help me make sense of this mumbo jumbo? HTH -- Joel Rees <joel@alpsgiken.gr.jp>
pgsql-general by date: