Home > mailing lists

RE: PostgreSQL and Unicode - Mailing list pgsql-hackers

From	Tatsuo Ishii
Subject	RE: PostgreSQL and Unicode
Date	May 16, 2000 06:09:31
Msg-id	20000516160855E.t-ishii@sra.co.jp Whole thread Raw
List	pgsql-hackers

Tree view

>     My understanding of the problem is UTF8 is this. Functionally, it is
> equivalent to UCS-2, that is you can encode any Unicode character in UTF-8
> that you could encode in UCS-2.
>     The problem we've run into is only related to Postgres. For example we had
> a field that was fixed at 20 characters. If we put in ASCII then we could
> put in all 20 characters. If we put in UTF8 encoded Japanese then (depending
> on which characters were used) we got about 3 UTF8 characters for each
> Japanese character. Aside from going from 20 characters to 7 (*problem #1*)
> we also now have unpredictable behavior. Some characters, like Japanese,
> were 3:1 ratio when encoding. UTF8 can go as high as 6:1 encoding ratio for
> some language (I don't know which off hand) this is *problem #2*. Finally,
> as a side affect of this, the string was just truncated so we sometimes got
> only a partial UTF8 character in the database. This made the unencoding
> either fail or produce weird results (*problem #3*).

Yes, I have noticed this problem too. But don't we have same problem
with UCS-2, with 2:1 ratio, then? I think we should fix this in the
way:char(10) should means 10 letters, not 10 bytes no matter whatencoding we use

I will tackle this problem for 7.1.

How do you think, Rainer? Are you still unhappy with the solution
above?
--
Tatsuo Ishii

pgsql-hackers by date:

From: Daniel Kalchev
Date: 16 May 2000, 06:04:26
Subject: Re: WAL versus Postgres (or: what goes around, comes ar ound)

From: Tatsuo Ishii
Date: 16 May 2000, 07:01:24
Subject: RE: PostgreSQL and Unicode

RE: PostgreSQL and Unicode - Mailing list pgsql-hackers

Previous

Next