RE: PostgreSQL and Unicode - Mailing list pgsql-hackers

From Tatsuo Ishii
Subject RE: PostgreSQL and Unicode
Date
Msg-id 20000516160855E.t-ishii@sra.co.jp
Whole thread Raw
List pgsql-hackers
>     My understanding of the problem is UTF8 is this. Functionally, it is
> equivalent to UCS-2, that is you can encode any Unicode character in UTF-8
> that you could encode in UCS-2.
>     The problem we've run into is only related to Postgres. For example we had
> a field that was fixed at 20 characters. If we put in ASCII then we could
> put in all 20 characters. If we put in UTF8 encoded Japanese then (depending
> on which characters were used) we got about 3 UTF8 characters for each
> Japanese character. Aside from going from 20 characters to 7 (*problem #1*)
> we also now have unpredictable behavior. Some characters, like Japanese,
> were 3:1 ratio when encoding. UTF8 can go as high as 6:1 encoding ratio for
> some language (I don't know which off hand) this is *problem #2*. Finally,
> as a side affect of this, the string was just truncated so we sometimes got
> only a partial UTF8 character in the database. This made the unencoding
> either fail or produce weird results (*problem #3*).

Yes, I have noticed this problem too. But don't we have same problem
with UCS-2, with 2:1 ratio, then? I think we should fix this in the
way:char(10) should means 10 letters, not 10 bytes no matter whatencoding we use

I will tackle this problem for 7.1.

How do you think, Rainer? Are you still unhappy with the solution
above?
--
Tatsuo Ishii


pgsql-hackers by date:

Previous
From: Daniel Kalchev
Date:
Subject: Re: WAL versus Postgres (or: what goes around, comes ar ound)
Next
From: Tatsuo Ishii
Date:
Subject: RE: PostgreSQL and Unicode