Thread: UTF-8 encoding question regarding PhpPgAdmin development

UTF-8 encoding question regarding PhpPgAdmin development

From

Jean-Michel POURE

Date:

07 January 2003, 08:38:39

Dear all,

We are working on PhpPgAdmin UTF-8 support. I would like to be able to view
UTF-8, ASCII and Latin1 databases in PhpPgAdmin without changing HTML header
encodings.

I guess this can be done using:
SET CLIENT_ENCODING='Unicode'
for all PhpPgAdmin connections.

My question are:

- Are some database encodings not translatable into UTF-8 using SET
CLIENT_ENCODING = 'Unicode'. It used to be the case for Latin1, but it has
been fixed now.

- Some letters, like the euro sign, do not belong to Latin1. Example:  let's
say we have a Latin1 database and use SET CLIENT_ENCODING = 'Unicode'. If I
input a euro sign, does it get rejected by PostgreSQL?

- More generaly, is it safe to convert an Encoding (ex: Latin1 or Chinese
multi-byte) into UTF-8 using SET CLIENT_ENCODING? Can all multi-byte
encodings be converted into/from UTF-8 safely?

Best regards,
Jean-Michel

Re: UTF-8 encoding question regarding PhpPgAdmin development

From

Peter Eisentraut

Date:

07 January 2003, 15:50:41

Jean-Michel POURE writes:

> - Are some database encodings not translatable into UTF-8 using SET
> CLIENT_ENCODING = 'Unicode'. It used to be the case for Latin1, but it has
> been fixed now.

It should be possible.  If not, it's a bug.

> - Some letters, like the euro sign, do not belong to Latin1. Example:  let's
> say we have a Latin1 database and use SET CLIENT_ENCODING = 'Unicode'. If I
> input a euro sign, does it get rejected by PostgreSQL?

Currently, it gives you a warning and ignores the character.  Not sure
that is ideal.

> - More generaly, is it safe to convert an Encoding (ex: Latin1 or Chinese
> multi-byte) into UTF-8 using SET CLIENT_ENCODING? Can all multi-byte
> encodings be converted into/from UTF-8 safely?

Some points to keep in mind: Some character sets contain characters that
are not in Unicode, although you might choose to ignore that fact because
it is of relatively minor importance.  Round-trip conversion is not safely
possible, so if your tool provides a read/edit/write tool then you will
have problems.  Finally, when you display East Asian characters you will
have a font problem because the Chinese, Japanese, and Korean characters
are mapped to the same range in Unicode but you are supposed to use
country-specific glyphs.

In short, I don't think what you are trying to do is easily achievable.

-- 
Peter Eisentraut   peter_e@gmx.net

Re: UTF-8 encoding question regarding PhpPgAdmin development

From

Jean-Michel POURE

Date:

07 January 2003, 17:14:25

Dear Peter,

Thank you very much for your answers. It rings a bell.

> Finally, when you display East Asian characters you will
> have a font problem because the Chinese, Japanese, and Korean characters
> are mapped to the same range in Unicode but you are supposed to use
> country-specific glyphs.

Do you mean that glyph hexaX will display differently in UTF-8 and EUC_JP? If
it is really the case, we cannot use UTF-8.

> Round-trip conversion is not safely possible, so if your tool provides a
> read/edit/write tool then you will have problems.

Maybe we could use "getdatabaseencoding()" to determine the dabase encoding
and generate HTML pages with the corresponding headers. Example: Latin1
database <-> ISOS-8859-1 headers.

The problem is that PhpPgAdmin interface needs to be localized in several
languages, not related to database encoding. Example: EUC_JP interface and
Latin1 databases.

Maybe a solution would be to use the ISO 10646 notation for PhpPgAdmin
interface localization:  "&#XH;", where H is a hexadecimal number.

Cheers,
Jean-MIchel POURE

Re: UTF-8 encoding question regarding PhpPgAdmin

From

Adrian 'Dagurashibanipal' von Bidder

Date:

08 January 2003, 09:11:48

On Tue, 2003-01-07 at 21:59, Peter Eisentraut wrote:

> > - Some letters, like the euro sign, do not belong to Latin1. Example:  let's
> > say we have a Latin1 database and use SET CLIENT_ENCODING = 'Unicode'. If I
> > input a euro sign, does it get rejected by PostgreSQL?
>
> Currently, it gives you a warning and ignores the character.  Not sure
> that is ideal.

(Yes, I should try this myself...)

Ignored as in 'passed through unchanged'; or ignored as in 'removed from
the string'?

cheers
-- vbi

--
this email is protected by a digital signature: http://fortytwo.ch/gpg

Re: UTF-8 encoding question regarding PhpPgAdmin development

From

Peter Eisentraut

Date:

08 January 2003, 17:13:41

Jean-Michel POURE writes:

> > Finally, when you display East Asian characters you will
> > have a font problem because the Chinese, Japanese, and Korean characters
> > are mapped to the same range in Unicode but you are supposed to use
> > country-specific glyphs.
>
> Do you mean that glyph hexaX will display differently in UTF-8 and EUC_JP? If
> it is really the case, we cannot use UTF-8.

Well, it's not completely different, but customized to the language.  The
Chinese, Japanese, and Korean ideographs are really the same historically
but are displayed slightly differently.  If you use a country-specific
character set you probably also get a country-specific font with it, but
if you map it to Unicode then you will get whatever the default look is on
your computer.  This is actually not so bad because as I understand it,
for example, a Japanese book that quotes Chinese text uses the
Japanese-look ideographs for the Chinese portions as well.  But a database
administration tool is not a Japanese book, so you need to judge it.

-- 
Peter Eisentraut   peter_e@gmx.net

Re: UTF-8 encoding question regarding PhpPgAdmin

From

Tatsuo Ishii

Date:

08 January 2003, 20:13:15

> > > - Some letters, like the euro sign, do not belong to Latin1. Example:  let's
> > > say we have a Latin1 database and use SET CLIENT_ENCODING = 'Unicode'. If I
> > > input a euro sign, does it get rejected by PostgreSQL?
> > 
> > Currently, it gives you a warning and ignores the character.  Not sure
> > that is ideal.
> 
> (Yes, I should try this myself...)
> 
> Ignored as in 'passed through unchanged'; or ignored as in 'removed from
> the string'?

"removed from the string". BTW, if I remember correctly, the euro sign is
supported in ISO-8859-16, not in ISO-8859-1.
--
Tatsuo Ishii