Re: SOLUTION: Insert a Euro symbol as UTF-8 from a latin1 charset. - Mailing list pgsql-hackers

From Jeroen T. Vermeulen
Subject Re: SOLUTION: Insert a Euro symbol as UTF-8 from a latin1 charset.
Date
Msg-id 20030613155029.GN31141@xs4all.nl
Whole thread Raw
In response to SOLUTION: Insert a Euro symbol as UTF-8 from a latin1 charset.  (Roland Glenn McIntosh <roland@steeltorch.com>)
List pgsql-hackers
On Fri, Jun 13, 2003 at 11:28:36AM -0400, Roland Glenn McIntosh wrote:
> 
> The Euro symbol is unicode value 0x20AC.  UTF-8 encoding is a way of representing most unicode characters in two
bytes,and most latin characters in one byte.
 
More precisely, UTF-8 encodes ASCII characters in one byte.  All other
latin-1 characters take 2 bytes IIRC, with the rest taking up to 4 bytes.


> I don't know why my "20" byte turned into two bytes of E2 and 82.  

Haven't got the spec handy, but UTF-8 uses the most-significant bit(s) of
each byte as a "continuation" field.  If the upper bit is zero, the char
is a plain 7-bit ASCII value.  If it's 1, the byte is part of a multibyte
sequence with a few most-significant bits indicating the sequence's length
and the byte's position in it (IIRC it's something like a countdown to the
end of the sequence).

In a nutshell, you can't just take bits away from your Unicode value and
call it UTF-8; it's a variable-length encoding and it needs some extra
room for the length information to go.

Furthermore, I don't think the Euro symbol is in latin-1 at all.  It was
added in latin-9 (iso 8859-15) and so it's not likely to have gotten a
retroactive spot in the bottom 256 character values.  Hence it will take
UTF-8 more bytes to encode it.


> Furthermore, I was under the impression that a UTF-8 encoding of the Euro sign only took two bytes.  Corroborating
thisassumption, upon dumping that table with pg_dump and examining the resultant file in a hex editor, I see this in
thatcharacter position: AC 20
 
How does that "corroborate the assumption?"  You're looking at the Unicode
value now, in a fixed-length 16-bit encoding.

> I did try the '\0x20AC' method, and '\0x20\0xAC' without success.
> It's worth noting that the field I'm inserting into is an SQL_ASCII field, and I'm reading my UTF-8 string out of it
likethis, via JDBC:
 

You can't fit UTF-8 into ASCII.  UTF-8 is an eight-byte encoding; ASCII
is a 7-bit character set.


Jeroen



pgsql-hackers by date:

Previous
From: ohp@pyrenet.fr
Date:
Subject: Re: Mirro updates
Next
From: Bruce Momjian
Date:
Subject: Re: [PATCHES] PostgreSQL libraries - PThread Support, but