Thread: UTF-8 support

UTF-8 support

From
Jean-Michel POURE
Date:
Hello,

I have set up a UNICODE database in PostgreSQL 7.1.2 and use psql for
querying (\ENCODING > UNICODE).
To perform tests, I downloaded code charts from http://www.unicode.org/charts/

1) UTF-8
http://www.postgresql.org/idocs/index.php?app-psql.html explains
"Anything contained in single quotes is furthermore subject to C-like
substitutions for \n (new line), \t (tab), \digits, \0digits, and \0xdigits
(the character with the given decimal, octal, or hexadecimal code)."

To start, I would like to store/display a simple 'A' letter in psql, number
0041, with the following queries:

 > 'INSERT INTO TABLE table_name VALUES (column-name) VALUES ( ' \0041' );
    and then SELECT * FROM table_name. It does not work.
 > Or simply SELECT '\0041'; which does not return 'A'.

Do I miss something?

2) Japanese coding
Do you recommend EUC_JP or UNICODE for storing Japanese text in PostgreSQL?
This is for use in PHP (both for input and display, no recode needed).

3) Is there a way to query available encodings in PostgreSQL for display in
pgAdmin.
Is it a planned feature in PostgreSQL 7.2? This would be nice if it existed.
Example: function pg_available_encodings -> SQL-ASCII;UNICODE;EUC-JP etc...

Thank you in advance,
Jean-Michel POURE
pgAdmin Team
http://pgadmin.postgresql.org

Re: UTF-8 support

From
Hiroshi Inoue
Date:
Jean-Michel POURE wrote:
>
> 3) Is there a way to query available encodings in PostgreSQL for display in
> pgAdmin.

Could pgAdmin display multibyte chars in the first place ?

regards,
Hiroshi Inoue

Re: UTF-8 support

From
Dave Page
Date:

> -----Original Message-----
> From: Hiroshi Inoue [mailto:Inoue@tpf.co.jp]
> Sent: 24 September 2001 06:26
> To: Jean-Michel POURE
> Cc: pgsql-odbc@postgresql.org; pgsql-hackers@postgresql.org;
> Tatsuo Ishii
> Subject: Re: [ODBC] UTF-8 support
>
>
> Jean-Michel POURE wrote:
> >
> > 3) Is there a way to query available encodings in PostgreSQL for
> > display in pgAdmin.
>
> Could pgAdmin display multibyte chars in the first place ?

Dunno, never tried it - perhaps some could let us know?

pgAdmin does now (as of last night) support encoding names (SQL_ASCII,
EUC_JP, KOI8 etc), however the list of options is hardcoded from the list in
multibyte.html (though you can overtype if you know better). If there is a
more up-to-date list than this I'd be glad to know.

Regards, Dave.

Re: [HACKERS] UTF-8 support

From
Jean-Michel POURE
Date:
Hello Tatsuo & all,

First of all, I would like to thank all the ODBC team for their work.
I have been using various databases quite extensively and PostgreSQL ranks
among the best for ODBC support.

>Try:
>INSERT INTO TABLE table_name VALUES (column-name) VALUES ( ' \101' );
>I don't know why the docs claim so, '\OCTAL_NUMBER' seems to work
>anyway.BTW, 'A' is not 041 in octal, it is 101.

Would it be possible to use the hexadecimal \u000 notation in 7.2? Is there
a built-in function to convert hexadecimal into octal?
This would be an interesting feature as it seems to be a standard notation
for UNICODE values (http://www.unicode.org/charts/).
Also, Java has the ability to display UNICODE using \u000 notation. I think
it is the same with PHP although I am not sure.

> > 2) Japanese coding
> > Do you recommend EUC_JP or UNICODE for storing Japanese text in PostgreSQL?
> > This is for use in PHP (both for input and display, no recode needed).
>
>If you are going to use Japanese only, EUC_JP will take less storage
>space. So, in general EUC_JP is recommended.

Are some Japanese fonts designed to work only for EUC-JP or UNICODE?
Can all Japanese fonts be mapped from EUC-JP to UNICODE and conversly?

> > Can you show me more concrete examples showing what you actually want
> to do?
Yes, I am very interested in testing double-byte display in pgAdmin II.
pgAdmin I and II are developed in Visual Basic which supports double-byte
Unicode forms since SP4.

How about ODBC double-byte support?
Is any translation performed between back-end and front-end?
Are there special ODBC settings to display Japanese?
Are there limitations due to Windows (95/98/NT)?
I heard 95 Japanese support was catastrophic.

Presently, I have to store Japanese text in PostgreSQL 7.1 with PHP.
I will do my best to display Japanese text in pgAdmin II.

Greetings from Paris,
Jean-Michel POURE

Re: [HACKERS] UTF-8 support

From
Hiroshi Inoue
Date:
Jean-Michel POURE wrote:
>
> Hello Tatsuo & all,
>
>
> How about ODBC double-byte support?

If you use 'EUC_xx' or 'UTF-8' as the client encoding
you don't need the multibyte version of psqlodbc driver
but if you use 'SJIS' or 'BIG5' you need the multibyte
version introduced by Eiji Tokuya.

> Is any translation performed between back-end and front-end?

You have to set the connect settings Datasource option like
  set client_encoding to 'SJIS'
if you use the different client encoding from that of
server. psqlodbc driver issues above command immediately
after the connection was successful and could use conversion
functionality of PostgreSQL backends.

> Are there special ODBC settings to display Japanese?

pgAdmin.exe doesn't display Japanese characters here
but I can see Japanese characters using pgAdmin source.
There seems to be .. e.g. font problem ...

> Are there limitations due to Windows (95/98/NT)?
> I heard 95 Japanese support was catastrophic.

I've ever used win9X little and don't know the
details.

regards,
Hiroshi Inoue

Re: [HACKERS] UTF-8 support

From
Jean-Michel POURE
Date:
Hello,

Are there built-in functions to convert UTF-8 string values into
hexadecimal \uxxxx and octal values and conversely?
If yes, can I parse any UTF-8 string safely with PL/pgSQL to return \uxxxx
and octal values?

Best regards,
Jean-Michel POURE


Re: [HACKERS] UTF-8 support

From
Tatsuo Ishii
Date:
> 1) UTF-8
> http://www.postgresql.org/idocs/index.php?app-psql.html explains
> "Anything contained in single quotes is furthermore subject to C-like
> substitutions for \n (new line), \t (tab), \digits, \0digits, and \0xdigits
> (the character with the given decimal, octal, or hexadecimal code)."
>
> To start, I would like to store/display a simple 'A' letter in psql, number
> 0041, with the following queries:
>
>  > 'INSERT INTO TABLE table_name VALUES (column-name) VALUES ( ' \0041' );
>     and then SELECT * FROM table_name. It does not work.
>  > Or simply SELECT '\0041'; which does not return 'A'.

Try:

INSERT INTO TABLE table_name VALUES (column-name) VALUES ( ' \101' );

I don't know why the docs claim so, '\OCTAL_NUMBER' seems to work
anyway.

BTW, 'A' is not 041 in octal, it is 101.

> 2) Japanese coding
> Do you recommend EUC_JP or UNICODE for storing Japanese text in PostgreSQL?
> This is for use in PHP (both for input and display, no recode needed).

If you are going to use Japanese only, EUC_JP will take less storage
space. So, in general EUC_JP is recommended.

> 3) Is there a way to query available encodings in PostgreSQL for display in
> pgAdmin.
> Is it a planned feature in PostgreSQL 7.2? This would be nice if it existed.
> Example: function pg_available_encodings -> SQL-ASCII;UNICODE;EUC-JP etc...

Currently no. But it would be easy to implement such a function. What
comes in mind is:

pg_available_encodings([INTEGER how]) RETURNS setof TEXT

where how is

      0(or omitted): returns all available encodings
      1: returns encodings in backend
      2: returns encodings in frontend

Comments?
--
Tatsuo Ishii

Re: [HACKERS] UTF-8 support

From
Tatsuo Ishii
Date:
> > pg_available_encodings([INTEGER how]) RETURNS setof TEXT
> >
> > where how is
> >
> >       0(or omitted): returns all available encodings
> >       1: returns encodings in backend
> >       2: returns encodings in frontend
> >
> > Comments?
>
>         3: returns encodings of both backend and frontend
>
> Why both? To compare and match upon the need.
> If by 0 (ALL) you meant the same, then please ignore my comment.

You are correct. We don't need how=3.

> My question is now how many BE's/FE's would you return encodings for?

I don't quite understand your question. What I thought were something
like this:

SELECT pg_available_encodings();
pg_available_encodings
----------------------
SQL_ASCII
EUC_JP
EUC_CN
EUC_KR
EUC_TW
UNICODE
MULE_INTERNAL
LATIN1
LATIN2
LATIN3
LATIN4
LATIN5
KOI8
WIN
ALT
SJIS
BIG5
WIN1250

BTW, another question comes to my mind. Why don't we make available
this kind of information by ordinaly tables or views, rather than by
functions?  It would be more flexible and easy to use.
--
Tatsuo Ishii

Re: [HACKERS] UTF-8 support

From
"Serguei Mokhov"
Date:
----- Original Message -----
From: Tatsuo Ishii <t-ishii@sra.co.jp>
Sent: Sunday, September 23, 2001 10:47 PM

> > My question is now how many BE's/FE's would you return encodings for?
>
> I don't quite understand your question. What I thought were something
> like this:
>
> SELECT pg_available_encodings();
> pg_available_encodings
> ----------------------
> SQL_ASCII
> EUC_JP
> EUC_CN
> EUC_KR
> EUC_TW
> UNICODE
> MULE_INTERNAL
> LATIN1
> LATIN2
> LATIN3
> LATIN4
> LATIN5
> KOI8
> WIN
> ALT
> SJIS
> BIG5
> WIN1250

Which ones belong to the backend and which ones to the frontend?
Or even more: which ones belong to the backend, which ones
to the frontend #1, which ones to the frontend #2, etc...

For examle, I have two fronends:

FE1: UNICODE,  WIN1251
FE2: KOI8, UNICODE
BE: UNICODE, LATIN1, ALT

Which ones SELECT pg_available_encodings(); will show?
The ones of the BE and the FE making the request?

In case I need to communicate with BE using one common
encoding between the two if it is available.

> BTW, another question comes to my mind. Why don't we make available
> this kind of information by ordinaly tables or views, rather than by
> functions?  It would be more flexible and easy to use.

Sounds like a good idea, make another system table for encodings
and NLS stuff...

S.


Re: [HACKERS] UTF-8 support

From
Tatsuo Ishii
Date:
> Which ones belong to the backend and which ones to the frontend?
> Or even more: which ones belong to the backend, which ones
> to the frontend #1, which ones to the frontend #2, etc...
>
> For examle, I have two fronends:
>
> FE1: UNICODE,  WIN1251
> FE2: KOI8, UNICODE
> BE: UNICODE, LATIN1, ALT
>
> Which ones SELECT pg_available_encodings(); will show?
> The ones of the BE and the FE making the request?
>
> In case I need to communicate with BE using one common
> encoding between the two if it is available.

I'm confused.

What do you mean by BE? BE's encoding is determined by the database
that FE chooses. If you just want to know what kind encodings are
there in the database, why not use:

SELECT DISTINCT ON (encoding) pg_encoding_to_char(encoding) AS
encoding FROM pg_database;

Also, FE's encoding could be any valid encoding that FE chooses,
i.e. it' not BE's choice.

Can you show me more concrete examples showing what you actually want
to do?

>> 3) Is there a way to query available encodings in PostgreSQL for display in
>> pgAdmin.
>
> Could pgAdmin display multibyte chars in the first place ?

Wao. If pgAdmin could not display multibyte chars, all discussions
here are meaningless:-<
--
Tatsuo Ishii

Re: [HACKERS] UTF-8 support

From
"Serguei Mokhov"
Date:
----- Original Message -----
From: Tatsuo Ishii <t-ishii@sra.co.jp>
Sent: Sunday, September 23, 2001 7:58 PM

> > 3) Is there a way to query available encodings in PostgreSQL for display in
> > pgAdmin.
> > Is it a planned feature in PostgreSQL 7.2? This would be nice if it existed.
> > Example: function pg_available_encodings -> SQL-ASCII;UNICODE;EUC-JP etc...
>
> Currently no. But it would be easy to implement such a function. What
> comes in mind is:
>
> pg_available_encodings([INTEGER how]) RETURNS setof TEXT
>
> where how is
>
>       0(or omitted): returns all available encodings
>       1: returns encodings in backend
>       2: returns encodings in frontend
>
> Comments?

        3: returns encodings of both backend and frontend

Why both? To compare and match upon the need.
If by 0 (ALL) you meant the same, then please ignore my comment.

My question is now how many BE's/FE's would you return encodings for?

S.