Thread: character encoding in StartupMessage

character encoding in StartupMessage

From

John DeSoi

Date:

28 February 2006, 04:05:49

I could not find anything in the Frontend/Backend protocol docs about  
character encoding in the StartupMessage. Assuming it is legal for a  
database or user name to have unicode characters, how is this handled  
when nothing yet has been said about the client encoding?

Thanks,


John DeSoi, Ph.D.
http://pgedit.com/
Power Tools for PostgreSQL

Re: character encoding in StartupMessage

From

Christopher Kings-Lynne

Date:

28 February 2006, 04:18:00

> I could not find anything in the Frontend/Backend protocol docs about 
> character encoding in the StartupMessage. Assuming it is legal for a 
> database or user name to have unicode characters, how is this handled 
> when nothing yet has been said about the client encoding?

A similar badness is that if you issue CREATE DATABASE from a UTF8 
database, the dbname will be stored as UTF8.  Then, if you go to a 
LATIN1 database and create another it will be stored as LATIN1.

Then, it's impossible to display both database names on the same screen 
or webpage as they have different encodings...  Not only that but it's 
impossible to know what encoding it IS in since it's the encoding of the 
database from where you issued the CREATE DATABASE instruction from, not 
the encoding of the database itself.

Chris

Re: character encoding in StartupMessage

From

Tom Lane

Date:

28 February 2006, 05:38:13

Christopher Kings-Lynne <chriskl@familyhealth.com.au> writes:
>> I could not find anything in the Frontend/Backend protocol docs about 
>> character encoding in the StartupMessage. Assuming it is legal for a 
>> database or user name to have unicode characters, how is this handled 
>> when nothing yet has been said about the client encoding?

> A similar badness is that if you issue CREATE DATABASE from a UTF8 
> database, the dbname will be stored as UTF8.  Then, if you go to a 
> LATIN1 database and create another it will be stored as LATIN1.

Yeah, this has been discussed before.  Database and user names both
have this affliction.

I don't see any very nice solution at the moment.  Once we get support
for per-column locales, it might be possible to declare that the shared
catalogs are always in UTF8 encoding and get the necessary
conversions to happen automatically.
        regards, tom lane

Re: character encoding in StartupMessage

From

Christopher Kings-Lynne

Date:

28 February 2006, 05:44:30

> I don't see any very nice solution at the moment.  Once we get support
> for per-column locales, it might be possible to declare that the shared
> catalogs are always in UTF8 encoding and get the necessary
> conversions to happen automatically.


At the very least, could we always convert dbnames and store them as 
their own encoding?  That way at least in HTML you can probably mark 
them out as having particular encodings or something...

Chris

Re: character encoding in StartupMessage

From

Martijn van Oosterhout

Date:

28 February 2006, 13:45:00

On Tue, Feb 28, 2006 at 02:45:25PM +0800, Christopher Kings-Lynne wrote:
> >I don't see any very nice solution at the moment.  Once we get support
> >for per-column locales, it might be possible to declare that the shared
> >catalogs are always in UTF8 encoding and get the necessary
> >conversions to happen automatically.
>
> At the very least, could we always convert dbnames and store them as
> their own encoding?  That way at least in HTML you can probably mark
> them out as having particular encodings or something...

This may be the only solution. Converting everything to UTF-8 has
issues because some encodings are not roundtrip-safe (Enc -> UTF8 -> Enc
gives you a different string than you started with). There's probably
no encoding round-trip safe with every other encoding.

You could probably do things like assume that the database name is in
the same encoding as that database and set \l to output:

select convert(datname,pg_encoding_to_char(encoding),getdatabaseencoding()) from pg_database;

However, my personal preference is to treat the name of the database as
a "bunch of bits" ie, don't consider it encoded at all. To login the
user must provide the same "bunch of bits". This doesn't solve the
issue of how to display the database names to users. Maybe define a
cluster encoding for the shared catalogs...

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Re: character encoding in StartupMessage

From

Alvaro Herrera

Date:

28 February 2006, 14:05:17

Martijn van Oosterhout wrote:

> This may be the only solution. Converting everything to UTF-8 has
> issues because some encodings are not roundtrip-safe (Enc -> UTF8 -> Enc
> gives you a different string than you started with). There's probably
> no encoding round-trip safe with every other encoding.

Is this still true?  If I remember clearly, Tatsuo-san had asserted that
this was the case, but later he said there was some bug in our
conversion routines or the conversion tables.  So maybe now that those
things are fixed (they are, aren't they?) there _is_ a safe roundtrip
from anything to UTF8 and back.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: character encoding in StartupMessage

From

John DeSoi

Date:

28 February 2006, 14:14:56

On Feb 28, 2006, at 1:38 AM, Tom Lane wrote:

>>> I could not find anything in the Frontend/Backend protocol docs  
>>> about
>>> character encoding in the StartupMessage. Assuming it is legal for a
>>> database or user name to have unicode characters, how is this  
>>> handled
>>> when nothing yet has been said about the client encoding?
>
>> A similar badness is that if you issue CREATE DATABASE from a UTF8
>> database, the dbname will be stored as UTF8.  Then, if you go to a
>> LATIN1 database and create another it will be stored as LATIN1.
>
> Yeah, this has been discussed before.  Database and user names both
> have this affliction.

So are the database/user names in the startup message compared using  
the default encoding of the cluster or is just a straight byte  
comparison with no consideration of the encoding?

Thanks,

John DeSoi, Ph.D.
http://pgedit.com/
Power Tools for PostgreSQL

Re: character encoding in StartupMessage

From

Tom Lane

Date:

28 February 2006, 14:44:41

Martijn van Oosterhout <kleptog@svana.org> writes:
> However, my personal preference is to treat the name of the database as
> a "bunch of bits" ie, don't consider it encoded at all.

That's essentially what we are doing as far as the StartupMessage is
concerned.  Doesn't really solve the problem of post-startup accesses to
the shared catalogs, however.
        regards, tom lane

Re: character encoding in StartupMessage

From

Martijn van Oosterhout

Date:

28 February 2006, 15:14:34

On Tue, Feb 28, 2006 at 12:05:17PM -0300, Alvaro Herrera wrote:
> Martijn van Oosterhout wrote:
>
> > This may be the only solution. Converting everything to UTF-8 has
> > issues because some encodings are not roundtrip-safe (Enc -> UTF8 -> Enc
> > gives you a different string than you started with). There's probably
> > no encoding round-trip safe with every other encoding.
>
> Is this still true?  If I remember clearly, Tatsuo-san had asserted that
> this was the case, but later he said there was some bug in our
> conversion routines or the conversion tables.  So maybe now that those
> things are fixed (they are, aren't they?) there _is_ a safe roundtrip
> from anything to UTF8 and back.

I beleive so. If use the ICU Converter Explorer [1] to examine some of
the encodings we support, they have "Contains ambiguous aliases? TRUE".
This means that there are multiple converters that claim to support that
encoding, though they produce different results.

The UTF-8 and Unicode FAQ [2] also lists some issues with EUC-JP saying
that the converters had to be modified to make round-trip conversion
work. However, not all converters work the same.

Anyway, maybe it's not a big problem anymore. The ISO-2022 series is
definitly not round-trip compatable [3] but I don't think we support
them anyway. I think the only issue is if the mappings postgres uses
internally don't match what the user expects, but I don't think there's
much we can do about that...

[1] http://www-950.ibm.com/software/globalization/icu/demo/converters
[2] http://www.cl.cam.ac.uk/~mgk25/unicode.html
[3] http://www.cl.cam.ac.uk/~mgk25/ucs/iso2022-wc.html

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Re: character encoding in StartupMessage

From

Tom Lane

Date:

28 February 2006, 15:19:13

Martijn van Oosterhout <kleptog@svana.org> writes:
>>> This may be the only solution. Converting everything to UTF-8 has
>>> issues because some encodings are not roundtrip-safe

>> Is this still true?

> I beleive so. If use the ICU Converter Explorer [1] to examine some of
> the encodings we support, they have "Contains ambiguous aliases? TRUE".

Which ones, and are they client-only encodings?  If all our server-side
encodings are round-trip safe then I think there's no big issue.

In any case I don't think there's a huge problem if we say that database
and user names had better be chosen from the round-trip-safe subset.
        regards, tom lane

Re: character encoding in StartupMessage

From

"Magnus Hagander"

Date:

28 February 2006, 15:25:39

> Martijn van Oosterhout <kleptog@svana.org> writes:
> >>> This may be the only solution. Converting everything to UTF-8 has
> >>> issues because some encodings are not roundtrip-safe
>
> >> Is this still true?
>
> > I beleive so. If use the ICU Converter Explorer [1] to
> examine some of
> > the encodings we support, they have "Contains ambiguous
> aliases? TRUE".
>
> Which ones, and are they client-only encodings?  If all our
> server-side encodings are round-trip safe then I think
> there's no big issue.
>
> In any case I don't think there's a huge problem if we say
> that database and user names had better be chosen from the
> round-trip-safe subset.

Doesn't this also affect passwords? If so it might be harder to enforce
as the user is often allowed to pick his own password...

//Magnus

Re: character encoding in StartupMessage

From

Martijn van Oosterhout

Date:

28 February 2006, 15:45:38

On Tue, Feb 28, 2006 at 11:19:02AM -0500, Tom Lane wrote:
> Martijn van Oosterhout <kleptog@svana.org> writes:
> >>> This may be the only solution. Converting everything to UTF-8 has
> >>> issues because some encodings are not roundtrip-safe
>
> >> Is this still true?
>
> > I beleive so. If use the ICU Converter Explorer [1] to examine some of
> > the encodings we support, they have "Contains ambiguous aliases? TRUE".
>
> Which ones, and are they client-only encodings?  If all our server-side
> encodings are round-trip safe then I think there's no big issue.
>
> In any case I don't think there's a huge problem if we say that database
> and user names had better be chosen from the round-trip-safe subset.

This is what it says here [1]:
 There are only 19 encodings currently used worldwide as legitimate POSIX multi-byte locale encodings:
   UTF-8, ISO-8859-1, ISO-8859-2, ISO-8859-3, ISO-8859-5, ISO-8859-6,   ISO-8859-7, ISO-8859-8, ISO-8859-9,
ISO-8859-13,ISO-8859-15,   EUC-JP, EUC-KR, GB2312 (= EUC-CN), KOI8-R, KOI8-U, VISCII,   WINDOWS-1251, WINDOWS-1256 
 Each of these is fully roundtrip compatible to ISO 10646, therefore all these locales can be represented nicely in
wchar_tas the equivalent UCS values. The above names and the corresponding defining documents are listed in the IANA
charsetregistry. 

Some of these have multiple definitions according to ICU meaning that
different platforms have implemented them differently in the past
(EUC-JP falls into this catagory), but presumably the IANA charset
registry has proper definitions.

Of the reminaing encodings we support, Big5 is OK, although the term
win-950 which is the windows version has changed over time. GBK has
same problem, win-936 has changed to over time. I don't think we should
concern ourselves with bugs in the windows encodings.

IOW, I think we are mostly safe.

[1] http://www.cl.cam.ac.uk/~mgk25/ucs/iso2022-wc.html
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Re: character encoding in StartupMessage

From

John DeSoi

Date:

28 February 2006, 17:53:14

On Feb 28, 2006, at 11:19 AM, Tom Lane wrote:

> In any case I don't think there's a huge problem if we say that  
> database
> and user names had better be chosen from the round-trip-safe subset.

What about the pg_hba.conf file? Is there a provision to specify the  
encoding or some other way to deal with non-ascii characters?

Thanks,

John DeSoi, Ph.D.
http://pgedit.com/
Power Tools for PostgreSQL

Re: character encoding in StartupMessage

From

Tom Lane

Date:

28 February 2006, 18:49:41

John DeSoi <desoi@pgedit.com> writes:
> On Feb 28, 2006, at 11:19 AM, Tom Lane wrote:
> What about the pg_hba.conf file? Is there a provision to specify the  
> encoding or some other way to deal with non-ascii characters?

pg_hba.conf is also processed without any locale considerations,
ie, effectively the "bunch of bits" approach Martijn mentioned.
        regards, tom lane