Re: SOLUTION: Insert a Euro symbol as UTF-8 from a latin1 charset. - Mailing list pgsql-general

From Joel Rees
Subject Re: SOLUTION: Insert a Euro symbol as UTF-8 from a latin1 charset.
Date
Msg-id 20030616163913.D11A.JOEL@alpsgiken.gr.jp
Whole thread Raw
In response to SOLUTION: Insert a Euro symbol as UTF-8 from a latin1 charset.  (Roland Glenn McIntosh <roland@steeltorch.com>)
List pgsql-general
> Okay.  I have NO IDEA why this works.  If someone could enlighten me as
> to the math involved I'd appreciate it.  First, a little background:
>
> The Euro symbol is unicode value 0x20AC.

That's most significant byte first, in UTF-16.

> UTF-8 encoding is a way of
> representing most unicode characters in two bytes,

err, no.

> and most latin
> characters in one byte.

UTF-8 is a way to transform the standard code points of a large, fixed
width encoding to a variable width encoding. For Unicode, the number of
octets (Octet is explicitly 8 bits, since byte is not necessarily 8 bits)
can vary between one and four. See RFC 2279:

    http://www.ietf.org/rfc/rfc2279.txt
    http://www.unicode.org/book/preview/ch02.pdf

Those two pages should clear your question up.

> The only way I have found to insert a euro symbol into the database
> from the command line psql client is this:
>
>     INSERT INTO mytable VALUES('\342\202\254');
>
> I don't know why this works.  In hex, those octal values are:
>     E2 82 AC

That's UTF-8.

> I don't know why my "20" byte turned into two bytes of E2 and 82.

In the conversion from UTF-16 to UTF-8. (You converted it, right?)

> Furthermore, I was under the impression that a UTF-8 encoding of the
> Euro sign only took two bytes.

Let's see. From a table on the RFC page, I note that 0x00 to 0x7f fit in
one octet, 0x0080 to 0x07ff fits in two, and 0x0800 to 0xffff fits in
three. 0x20ac is definitely greater than 0x07ff.

> Corroborating this assumption, upon
> dumping that table with pg_dump and examining the resultant file in
> a hex editor, I see this in that character position: AC 20

That's called looking at a UTF-16 character as if it were an integer on
a byte-backwards, I mean, least-significant byte first CPU. (See "Byte
Order Mark" in the unicode.org link I pasted in above.

> Furthermore, according to the psql online documentation and man page:
> "Anything contained in single quotes is furthermore subject to C-like
> substitutions for \n (new line), \t (tab), \digits, \0digits, and
> \0xdigits (the character with the given decimal, octal, or hexadecimal
> code)."

I'm having trouble finding that page. Do you have a URL for it? (But,
then again, I might not be the person to ask this question. It seems
like I saw something about it somewhere, maybe it's in the list archives
or something.)

> Those digits *should* be interpreted as decimal digits, but they aren't.
> The man page for psql is either incorrect, or the implementation is
> buggy.
>
> It's worth noting that the field I'm inserting into is an SQL_ASCII
> field, and I'm reading my UTF-8 string out of it like this, via JDBC:
>
>     String value = new String( resultset.getBytes(1), "UTF-8");
>
> Can anyone help me make sense of this mumbo jumbo?

HTH

--
Joel Rees <joel@alpsgiken.gr.jp>


pgsql-general by date:

Previous
From: Bruno Wolff III
Date:
Subject: Re: Lyris -> PG 7.1.3
Next
From: Richard Huxton
Date:
Subject: Re: Odd error message