Thread: Re: pgadmin3 clientencoding

Re: pgadmin3 clientencoding

From
Andreas Pflug
Date:
Andreas Pflug wrote:

>
> Do we really need special encodings, besides unicode? If so, this
> should be implemented on a tree node (Server property: client
> encoding) to make it possible to let the change of encoding have
> immediate effect, or as the "System Object" setting is implemented.
>
There's another point:
I'd like to have the list of valid encodings read from the server,
rather than hard-coding it. This obviously needs a database connection,
which frmOptions lacks.

Regards,
Andreas



Re: pgadmin3 clientencoding

From
Jean-Michel POURE
Date:
On Tuesday 10 June 2003 11:11, you wrote:
> > Do we really need special encodings, besides unicode? If so, this
> > should be implemented on a tree node (Server property: client
> > encoding) to make it possible to let the change of encoding have
> > immediate effect, or as the "System Object" setting is implemented.

This question was asked during PhpPgAdmin development. The answer is that "set
client_encoding" is not always safe.

For example:
- server = SJSS
- client = Unicode
= mostly safe

- server = Latin1
- client = Unicode
= not safe, because if you enter some non-regular characters, they will be
dropped during conversion.

Cheers,
Jean-Michel


Re: pgadmin3 clientencoding

From
Andreas Pflug
Date:
Jean-Michel POURE wrote:

>On Tuesday 10 June 2003 11:11, you wrote:
>
>
>>>Do we really need special encodings, besides unicode? If so, this
>>>should be implemented on a tree node (Server property: client
>>>encoding) to make it possible to let the change of encoding have
>>>immediate effect, or as the "System Object" setting is implemented.
>>>
>>>
>
>This question was asked during PhpPgAdmin development. The answer is that "set
>client_encoding" is not always safe.
>
>For example:
>- server = SJSS
>- client = Unicode
>= mostly safe
>
>- server = Latin1
>- client = Unicode
>= not safe, because if you enter some non-regular characters, they will be
>dropped during conversion.
>
OK, this means a client encoding per database is needed, right?
Additional property for database?

Regards,
Andreas


Re: pgadmin3 clientencoding

From
Jean-Michel POURE
Date:
On Tuesday 10 June 2003 11:39, Andreas Pflug wrote:
> OK, this means a client encoding per database is needed, right?
> Additional property for database?

Yes. Whenever possible database, client and wxWindows encodings should be the
same. For example, the best solution is to have a full Unicode chain:

- UTF-8 database
- UTF-8 data stream (set client-encoding='Unicode')
- UTF-8 display libraries (wxGTK with --enable-unicode).

When server encoding differs, it can cause problems. Example:
- Latin1 database
- Unicode stream (set client-encoding='Unicode')
- UTF-8 display (wxGTK with --enable-unicode)

The grid will display information fine, but whenever the user enters an
illegal character (for example a euro sign which does not belong to Latin1
but belongs to UTF-8), it will be dropped by the parser.

This kind of problem is less frequent with Asian encodings:
- SJSS database
- Unicode stream (set client-encoding='Unicode')
- UTF-8 display (wxGTK with --enable-unicode)

The only solution I see would be to use the iconv libraries (or recode
libraries) to check whether a text entered by a user can be converted into a
server encoding safely or not.

A "safety" conversion test can be done in two steps:
1) convert the text entered by the user from UTF-8 into the database encoding,
2) convert the resulting text back from database encoding into UTF-8.

If the text is the same, the conversion is "safe". Example:
- Latin1 database
- Unicode stream (set client-encoding='Unicode')
- UTF-8 display (wxGTK with --enable-unicode)

1) convert the text entered by the user from UTF-8 into Latin1,
2) convert the resulting text back from Latin1 into UTF-8.

In this example, if a user enters a euro sign (€), it will be dropped and
hence the "safety" test will fail.

Cheers,
Jean-Michel


Re: pgadmin3 clientencoding

From
Andreas Pflug
Date:
Jean-Michel POURE wrote:

>On Tuesday 10 June 2003 11:39, Andreas Pflug wrote:
>
>
>>OK, this means a client encoding per database is needed, right?
>>Additional property for database?
>>
>>
>
>Yes. Whenever possible database, client and wxWindows encodings should be the
>same. For example, the best solution is to have a full Unicode chain:
>
>- UTF-8 database
>- UTF-8 data stream (set client-encoding='Unicode')
>- UTF-8 display libraries (wxGTK with --enable-unicode).
>
>When server encoding differs, it can cause problems. Example:
>- Latin1 database
>- Unicode stream (set client-encoding='Unicode')
>- UTF-8 display (wxGTK with --enable-unicode)
>
>The grid will display information fine, but whenever the user enters an
>illegal character (for example a euro sign which does not belong to Latin1
>but belongs to UTF-8), it will be dropped by the parser.
>
>This kind of problem is less frequent with Asian encodings:
>- SJSS database
>- Unicode stream (set client-encoding='Unicode')
>- UTF-8 display (wxGTK with --enable-unicode)
>
>The only solution I see would be to use the iconv libraries (or recode
>libraries) to check whether a text entered by a user can be converted into a
>server encoding safely or not.
>
>A "safety" conversion test can be done in two steps:
>1) convert the text entered by the user from UTF-8 into the database encoding,
>2) convert the resulting text back from database encoding into UTF-8.
>
>If the text is the same, the conversion is "safe". Example:
>- Latin1 database
>- Unicode stream (set client-encoding='Unicode')
>- UTF-8 display (wxGTK with --enable-unicode)
>
>1) convert the text entered by the user from UTF-8 into Latin1,
>2) convert the resulting text back from Latin1 into UTF-8.
>
>In this example, if a user enters a euro sign (€), it will be dropped and
>hence the "safety" test will fail.
>
The longer I think about this, the more the current implementation
appears wrong to me. The decisive factor is not a user's wish, but the
ability of our charset conversion ability, and that's pretty clear:
wxString can convert unicode to ascii and back, nothing else. Since
unicode will be the recommended setup for non-ascii databases, the
client encoding should be unicode for all connections. This should
enable correct schema and property display. Allowing the connection to
be something different would mean wxString needs to know how to convert
from xxx to unicode, i.e. implementing a client side conversion, which
doesn't make sense. This means: client encoding=SQL_ASCII for
non-unicode, and UNICODE for unicode compiled pgAdmin3.

The remaining problem is that of text entered by the user. This
separates into two categories:
1) freetext entry from frmQuery. The user is responsible to use correct
settings and input representation
2) guided entry, here we hopefully know what may be entered, and check
ourselves for legal characters.

Regards,
Andreas


Re: pgadmin3 clientencoding

From
Jean-Michel POURE
Date:
Dear Andreas,

I don't know if I understand you well. If I don't, please disgard my message.
Here is my point of view:

> The longer I think about this, the more the current implementation
> appears wrong to me. The decisive factor is not a user's wish, but the
> ability of our charset conversion ability, and that's pretty clear:
> wxString can convert unicode to ascii and back, nothing else. Since
> unicode will be the recommended setup for non-ascii databases, the
> client encoding should be unicode for all connections. This should
> enable correct schema and property display. Allowing the connection to
> be something different would mean wxString needs to know how to convert
> from xxx to unicode, i.e. implementing a client side conversion, which
> doesn't make sense. This means: client encoding=SQL_ASCII for
> non-unicode, and UNICODE for unicode compiled pgAdmin3.

I am absolutely sure that we cannot rely on recommandations, such as "create
UNICODE database for multi-byte data and SQL_ASCII otherwise".

PostgreSQL central feature is the ability to store and manage various
encodings. For example, in Japan, many databases are stored under EUC_JP and
SJIS. You wron't ask users to migrate their database to UTF-8.

Therefore, pgAdmin3 shall manage encodings transparently. This is a ***key
feature***. Don't get me wrong, I propose to:

1) Always compile pgAdmin3 with Unicode support. (By the way, I would also be
delighted if all .po files were stored in UTF-8).

2) Always "set client_encoding=Unicode" in order to recode data streams at
backend level. This is 100% safe in case of data viewing. From my point of
view, I never had any problem with this feature, which is bug free.

PostgreSQL is the only database in the world with such on-the-fly conversion
at data stream level. So why not use it.

3) We only need to check whether the data entered in the grid can be (a)
converted from UTF-8 into the database encoding and (b) back from the
database encoding into Unicode.

Iconv (http://www.gnu.org/software/libiconv) or recode
(http://www.iro.umontreal.ca/contrib/recode/HTML/readme.html) libraries can
be used for that. In case of license incompatibilities, we can always use
binary executables of iconv and recode. Iconv and recode are installed by
default in all GNU/Linux distributions.

Alternatively, we could borrow PostgreSQL backend validation code. I know this
code exists because in some cases PostgreSQL refused to enter euro signs in a
Latin1 database and returned an error.

There is no other way. The only other way would be to add native multi-byte
support (SJSS, etc...) to wxWindows widgets, which is impossible. So, the
only remaining solution is to view all data in UTF-8 Unicode.

> The remaining problem is that of text entered by the user. This
> separates into two categories:
> 1) freetext entry from frmQuery. The user is responsible to use correct
> settings and input representation
> 2) guided entry, here we hopefully know what may be entered, and check
> ourselves for legal characters.

I doubt a user should know that the Euro sign (€) does not belong to Latin1.
There are hundreds of examples like that. Therefore, it is impossible to
create a list of legal/forbidden characters.

The only way to test for correct entry is:
- to convert the entry like explained in 3)
or,
- use PostgreSQL backend code.

Maybe we should ask for information on the hackers list. What do you think?

Cheers,
Jean-Michel




Re: pgadmin3 clientencoding

From
Andreas Pflug
Date:
Jean-Michel POURE wrote:

>I am absolutely sure that we cannot rely on recommandations, such as "create
>UNICODE database for multi-byte data and SQL_ASCII otherwise".
>
Jean-Michel,

you got me wrong. Client encoding is only about the data transfer, and
that includes not only the transfer from the server to the client but
also from the client interface to the user interface. pqsql will happily
convert any server encoding to unicode, and wxWindows will bring this to
the user's attention.

>
>PostgreSQL central feature is the ability to store and manage various
>encodings. For example, in Japan, many databases are stored under EUC_JP and
>SJIS. You wron't ask users to migrate their database to UTF-8.
>
Right; we'll be using unicode for internal communication to the server,
whatever the server encoding might be.

>
>Therefore, pgAdmin3 shall manage encodings transparently. This is a ***key
>feature***. Don't get me wrong, I propose to:
>
>1) Always compile pgAdmin3 with Unicode support. (By the way, I would also be
>delighted if all .po files were stored in UTF-8).
>
Yes.

>
>2) Always "set client_encoding=Unicode" in order to recode data streams at
>backend level. This is 100% safe in case of data viewing. From my point of
>view, I never had any problem with this feature, which is bug free.
>
Yes.

>
>PostgreSQL is the only database in the world with such on-the-fly conversion
>at data stream level. So why not use it.
>
>3) We only need to check whether the data entered in the grid can be (a)
>converted from UTF-8 into the database encoding and (b) back from the
>database encoding into Unicode.
>
Yes.

>
>Iconv (http://www.gnu.org/software/libiconv) or recode
>(http://www.iro.umontreal.ca/contrib/recode/HTML/readme.html) libraries can
>be used for that. In case of license incompatibilities, we can always use
>binary executables of iconv and recode. Iconv and recode are installed by
>default in all GNU/Linux distributions.
>
Hmmm... lot of work.

>
>Alternatively, we could borrow PostgreSQL backend validation code. I know this
>code exists because in some cases PostgreSQL refused to enter euro signs in a
>Latin1 database and returned an error.
>
Much better!

>
>There is no other way. The only other way would be to add native multi-byte
>support (SJSS, etc...) to wxWindows widgets,
>
Uhhh! Horror!

>which is impossible. So, the
>only remaining solution is to view all data in UTF-8 Unicode.
>
>
Phew... agreed.

>I doubt a user should know that the Euro sign (€) does not belong to Latin1.
>There are hundreds of examples like that. Therefore, it is impossible to
>create a list of legal/forbidden characters.
>
>The only way to test for correct entry is:
>- to convert the entry like explained in 3)
>or,
>- use PostgreSQL backend code.
>
There's probably such a function like "bool
PG_is_compliant_to_server_encoding(text)", which we should use. If not
locatable, we can contact pgsql-hackers about this.

Regards,
Andreas


Re: pgadmin3 clientencoding

From
"Hiroshi Saito"
Date:
Hi Andreas.

----- Original Message -----
From: "Andreas Pflug" <Andreas.Pflug@web.de>
To: <pgadmin-hackers@postgresql.org>; "Dave Page" <dpage@vale-housing.co.uk>
Sent: Tuesday, June 10, 2003 6:11 PM
Subject: Re: [pgadmin-hackers] pgadmin3 clientencoding


> Andreas Pflug wrote:
>
> >
> > Do we really need special encodings, besides unicode? If so, this
> > should be implemented on a tree node (Server property: client
> > encoding) to make it possible to let the change of encoding have
> > immediate effect, or as the "System Object" setting is implemented.
> >
> There's another point:
> I'd like to have the list of valid encodings read from the server,
> rather than hard-coding it. This obviously needs a database connection,
> which frmOptions lacks.

multi-database-encoding moves comfortably by the present establishment.
Remodeling any further shouldn't be necessary.
db/pgConn.cpp

With kindest regards,
Hiroshi-Saito



Re: pgadmin3 clientencoding

From
"Hiroshi Saito"
Date:
Hi Andreas.

----- Original Message -----
From: "Andreas Pflug" <Andreas.Pflug@web.de>
To: <jm.poure@freesurf.fr>; <pgadmin-hackers@postgresql.org>; "Dave Page"
<dpage@vale-housing.co.uk>
Sent: Tuesday, June 10, 2003 11:41 PM
Subject: Re: [pgadmin-hackers] pgadmin3 clientencoding


> Jean-Michel POURE wrote:
(snip)
> >
> >Therefore, pgAdmin3 shall manage encodings transparently. This is a
***key
> >feature***. Don't get me wrong, I propose to:
> >
> >1) Always compile pgAdmin3 with Unicode support. (By the way, I would
also be
> >delighted if all .po files were stored in UTF-8).
> >
> Yes.

No, it thinks that should be an option.

(snip)
> >PostgreSQL is the only database in the world with such on-the-fly
conversion
> >at data stream level. So why not use it.
> >
> >3) We only need to check whether the data entered in the grid can be (a)
> >converted from UTF-8 into the database encoding and (b) back from the
> >database encoding into Unicode.
> >
> Yes.

This agrees.
But, time is necessary.

(snip)
With kindest regards,
Hiroshi-Saito