Thread: SOLUTION: Insert a Euro symbol as UTF-8 from a latin1 charset.

SOLUTION: Insert a Euro symbol as UTF-8 from a latin1 charset.

From

Roland Glenn McIntosh

Date:

13 June 2003, 12:28:22

This is my solution / bug report / RFC cross-posted from [GENERAL] regarding insertion of hexadecimal characters from
thecommand line.

-----------------------------------

Okay. I have NO IDEA why this works. If someone could enlighten me as to the math involved I'd appreciate it. First,
alittle background:

The Euro symbol is unicode value 0x20AC. UTF-8 encoding is a way of representing most unicode characters in two bytes,
andmost latin characters in one byte.

The only way I have found to insert a euro symbol into the database from the command line psql client is this:INSERT
INTOmytable VALUES('\342\202\254');

I don't know why this works. In hex, those octal values are:E2 82 AC

I don't know why my "20" byte turned into two bytes of E2 and 82. Furthermore, I was under the impression that a UTF-8
encodingof the Euro sign only took two bytes. Corroborating this assumption, upon dumping that table with pg_dump and
examiningthe resultant file in a hex editor, I see this in that character position: AC 20

Additionally, according to the psql online documentation and man page:
"Anything contained in single quotes is furthermore subject to C-like substitutions for \n (new line), \t (tab),
\digits,\0digits, and \0xdigits (the character with the given decimal, octal, or hexadecimal code)."

Those digits *should* be interpreted as decimal digits, but they aren't. The man page for psql is either incorrect, or
theimplementation is buggy.

I did try the '\0x20AC' method, and '\0x20\0xAC' without success.
It's worth noting that the field I'm inserting into is an SQL_ASCII field, and I'm reading my UTF-8 string out of it
likethis, via JDBC:String value = new String( resultset.getBytes(1), "UTF-8");

Can anyone help me make sense of this mumbo jumbo?
-Roland

Re: SOLUTION: Insert a Euro symbol as UTF-8 from a latin1 charset.

From

"Jeroen T. Vermeulen"

Date:

13 June 2003, 12:50:48

On Fri, Jun 13, 2003 at 11:28:36AM -0400, Roland Glenn McIntosh wrote:
> 
> The Euro symbol is unicode value 0x20AC.  UTF-8 encoding is a way of representing most unicode characters in two
bytes,and most latin characters in one byte.

More precisely, UTF-8 encodes ASCII characters in one byte.  All other
latin-1 characters take 2 bytes IIRC, with the rest taking up to 4 bytes.

> I don't know why my "20" byte turned into two bytes of E2 and 82.  

Haven't got the spec handy, but UTF-8 uses the most-significant bit(s) of
each byte as a "continuation" field.  If the upper bit is zero, the char
is a plain 7-bit ASCII value.  If it's 1, the byte is part of a multibyte
sequence with a few most-significant bits indicating the sequence's length
and the byte's position in it (IIRC it's something like a countdown to the
end of the sequence).

In a nutshell, you can't just take bits away from your Unicode value and
call it UTF-8; it's a variable-length encoding and it needs some extra
room for the length information to go.

Furthermore, I don't think the Euro symbol is in latin-1 at all.  It was
added in latin-9 (iso 8859-15) and so it's not likely to have gotten a
retroactive spot in the bottom 256 character values.  Hence it will take
UTF-8 more bytes to encode it.

> Furthermore, I was under the impression that a UTF-8 encoding of the Euro sign only took two bytes.  Corroborating
thisassumption, upon dumping that table with pg_dump and examining the resultant file in a hex editor, I see this in
thatcharacter position: AC 20

How does that "corroborate the assumption?"  You're looking at the Unicode
value now, in a fixed-length 16-bit encoding.

> I did try the '\0x20AC' method, and '\0x20\0xAC' without success.
> It's worth noting that the field I'm inserting into is an SQL_ASCII field, and I'm reading my UTF-8 string out of it
likethis, via JDBC:

You can't fit UTF-8 into ASCII.  UTF-8 is an eight-byte encoding; ASCII
is a 7-bit character set.

Jeroen

Re: SOLUTION: Insert a Euro symbol as UTF-8 from a latin1 charset.

From

Ian Barwick

Date:

13 June 2003, 14:27:09

On Friday 13 June 2003 17:28, Roland Glenn McIntosh wrote:
> This is my solution / bug report / RFC cross-posted from [GENERAL]
> regarding insertion of hexadecimal characters from the command line.
> -----------------------------------
>
> Okay.  I have NO IDEA why this works.  If someone could enlighten me as to
> the math involved I'd appreciate it.  First, a little background:
>
> The Euro symbol is unicode value 0x20AC.  UTF-8 encoding is a way of
> representing most unicode characters in two bytes, and most latin
> characters in one byte.
>
> The only way I have found to insert a euro symbol into the database from
> the command line psql client is this: INSERT INTO mytable
> VALUES('\342\202\254');
>
> I don't know why this works.  In hex, those octal values are:
>     E2 82 AC

My apologies, I forgot to mention converting to UTF-8 in my original
reply.

> Additionally, according to the psql online documentation and man page:
> "Anything contained in single quotes is furthermore subject to C-like
> substitutions for \n (new line), \t (tab), \digits, \0digits, and \0xdigits
> (the character with the given decimal, octal, or hexadecimal code)."
>
> Those digits *should* be interpreted as decimal digits, but they aren't. 
> The man page for psql is either incorrect, or the implementation is buggy.

The docs are easy to misunderstand if you are scanning them in a hurry.
This section is referring to substitutions in psql's own meta commands,
not SQL statements, e.g. this:

\echo '\0xe2\0x82\0xac'

will display the Euro sign (assuming your terminal can print it).


Ian Barwick
barwick@gmx.net

Re: SOLUTION: Insert a Euro symbol as UTF-8 from a latin1 charset.

From

Ian Barwick

Date:

16 June 2003, 14:22:11

On Friday 13 June 2003 17:28, Roland Glenn McIntosh wrote:
> This is my solution / bug report / RFC cross-posted from [GENERAL]
> regarding insertion of hexadecimal characters from the command line.
> -----------------------------------
>
> Okay.  I have NO IDEA why this works.  If someone could enlighten me as to
> the math involved I'd appreciate it.  First, a little background:
>
> The Euro symbol is unicode value 0x20AC.  UTF-8 encoding is a way of
> representing most unicode characters in two bytes, and most latin
> characters in one byte.
>
> The only way I have found to insert a euro symbol into the database from
> the command line psql client is this: INSERT INTO mytable
> VALUES('\342\202\254');
>
> I don't know why this works.  In hex, those octal values are:
>     E2 82 AC

My apologies, I forgot to mention converting to UTF-8 in my original
reply.

> Additionally, according to the psql online documentation and man page:
> "Anything contained in single quotes is furthermore subject to C-like
> substitutions for \n (new line), \t (tab), \digits, \0digits, and \0xdigits
> (the character with the given decimal, octal, or hexadecimal code)."
>
> Those digits *should* be interpreted as decimal digits, but they aren't. 
> The man page for psql is either incorrect, or the implementation is buggy.

The docs are easy to misunderstand if you are scanning them in a hurry.
This section is referring to substitutions in psql's own meta commands,
not SQL statements, e.g. this:

\echo '\0xe2\0x82\0xac'

will display the Euro sign (assuming your terminal can print it).


Ian Barwick
barwick@gmx.net