Thread: character encoding in StartupMessage
I could not find anything in the Frontend/Backend protocol docs about character encoding in the StartupMessage. Assuming it is legal for a database or user name to have unicode characters, how is this handled when nothing yet has been said about the client encoding? Thanks, John DeSoi, Ph.D. http://pgedit.com/ Power Tools for PostgreSQL
> I could not find anything in the Frontend/Backend protocol docs about > character encoding in the StartupMessage. Assuming it is legal for a > database or user name to have unicode characters, how is this handled > when nothing yet has been said about the client encoding? A similar badness is that if you issue CREATE DATABASE from a UTF8 database, the dbname will be stored as UTF8. Then, if you go to a LATIN1 database and create another it will be stored as LATIN1. Then, it's impossible to display both database names on the same screen or webpage as they have different encodings... Not only that but it's impossible to know what encoding it IS in since it's the encoding of the database from where you issued the CREATE DATABASE instruction from, not the encoding of the database itself. Chris
Christopher Kings-Lynne <chriskl@familyhealth.com.au> writes: >> I could not find anything in the Frontend/Backend protocol docs about >> character encoding in the StartupMessage. Assuming it is legal for a >> database or user name to have unicode characters, how is this handled >> when nothing yet has been said about the client encoding? > A similar badness is that if you issue CREATE DATABASE from a UTF8 > database, the dbname will be stored as UTF8. Then, if you go to a > LATIN1 database and create another it will be stored as LATIN1. Yeah, this has been discussed before. Database and user names both have this affliction. I don't see any very nice solution at the moment. Once we get support for per-column locales, it might be possible to declare that the shared catalogs are always in UTF8 encoding and get the necessary conversions to happen automatically. regards, tom lane
> I don't see any very nice solution at the moment. Once we get support > for per-column locales, it might be possible to declare that the shared > catalogs are always in UTF8 encoding and get the necessary > conversions to happen automatically. At the very least, could we always convert dbnames and store them as their own encoding? That way at least in HTML you can probably mark them out as having particular encodings or something... Chris
On Tue, Feb 28, 2006 at 02:45:25PM +0800, Christopher Kings-Lynne wrote: > >I don't see any very nice solution at the moment. Once we get support > >for per-column locales, it might be possible to declare that the shared > >catalogs are always in UTF8 encoding and get the necessary > >conversions to happen automatically. > > At the very least, could we always convert dbnames and store them as > their own encoding? That way at least in HTML you can probably mark > them out as having particular encodings or something... This may be the only solution. Converting everything to UTF-8 has issues because some encodings are not roundtrip-safe (Enc -> UTF8 -> Enc gives you a different string than you started with). There's probably no encoding round-trip safe with every other encoding. You could probably do things like assume that the database name is in the same encoding as that database and set \l to output: select convert(datname,pg_encoding_to_char(encoding),getdatabaseencoding()) from pg_database; However, my personal preference is to treat the name of the database as a "bunch of bits" ie, don't consider it encoded at all. To login the user must provide the same "bunch of bits". This doesn't solve the issue of how to display the database names to users. Maybe define a cluster encoding for the shared catalogs... Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Martijn van Oosterhout wrote: > This may be the only solution. Converting everything to UTF-8 has > issues because some encodings are not roundtrip-safe (Enc -> UTF8 -> Enc > gives you a different string than you started with). There's probably > no encoding round-trip safe with every other encoding. Is this still true? If I remember clearly, Tatsuo-san had asserted that this was the case, but later he said there was some bug in our conversion routines or the conversion tables. So maybe now that those things are fixed (they are, aren't they?) there _is_ a safe roundtrip from anything to UTF8 and back. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Feb 28, 2006, at 1:38 AM, Tom Lane wrote: >>> I could not find anything in the Frontend/Backend protocol docs >>> about >>> character encoding in the StartupMessage. Assuming it is legal for a >>> database or user name to have unicode characters, how is this >>> handled >>> when nothing yet has been said about the client encoding? > >> A similar badness is that if you issue CREATE DATABASE from a UTF8 >> database, the dbname will be stored as UTF8. Then, if you go to a >> LATIN1 database and create another it will be stored as LATIN1. > > Yeah, this has been discussed before. Database and user names both > have this affliction. So are the database/user names in the startup message compared using the default encoding of the cluster or is just a straight byte comparison with no consideration of the encoding? Thanks, John DeSoi, Ph.D. http://pgedit.com/ Power Tools for PostgreSQL
Martijn van Oosterhout <kleptog@svana.org> writes: > However, my personal preference is to treat the name of the database as > a "bunch of bits" ie, don't consider it encoded at all. That's essentially what we are doing as far as the StartupMessage is concerned. Doesn't really solve the problem of post-startup accesses to the shared catalogs, however. regards, tom lane
On Tue, Feb 28, 2006 at 12:05:17PM -0300, Alvaro Herrera wrote: > Martijn van Oosterhout wrote: > > > This may be the only solution. Converting everything to UTF-8 has > > issues because some encodings are not roundtrip-safe (Enc -> UTF8 -> Enc > > gives you a different string than you started with). There's probably > > no encoding round-trip safe with every other encoding. > > Is this still true? If I remember clearly, Tatsuo-san had asserted that > this was the case, but later he said there was some bug in our > conversion routines or the conversion tables. So maybe now that those > things are fixed (they are, aren't they?) there _is_ a safe roundtrip > from anything to UTF8 and back. I beleive so. If use the ICU Converter Explorer [1] to examine some of the encodings we support, they have "Contains ambiguous aliases? TRUE". This means that there are multiple converters that claim to support that encoding, though they produce different results. The UTF-8 and Unicode FAQ [2] also lists some issues with EUC-JP saying that the converters had to be modified to make round-trip conversion work. However, not all converters work the same. Anyway, maybe it's not a big problem anymore. The ISO-2022 series is definitly not round-trip compatable [3] but I don't think we support them anyway. I think the only issue is if the mappings postgres uses internally don't match what the user expects, but I don't think there's much we can do about that... [1] http://www-950.ibm.com/software/globalization/icu/demo/converters [2] http://www.cl.cam.ac.uk/~mgk25/unicode.html [3] http://www.cl.cam.ac.uk/~mgk25/ucs/iso2022-wc.html Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Martijn van Oosterhout <kleptog@svana.org> writes: >>> This may be the only solution. Converting everything to UTF-8 has >>> issues because some encodings are not roundtrip-safe >> Is this still true? > I beleive so. If use the ICU Converter Explorer [1] to examine some of > the encodings we support, they have "Contains ambiguous aliases? TRUE". Which ones, and are they client-only encodings? If all our server-side encodings are round-trip safe then I think there's no big issue. In any case I don't think there's a huge problem if we say that database and user names had better be chosen from the round-trip-safe subset. regards, tom lane
> Martijn van Oosterhout <kleptog@svana.org> writes: > >>> This may be the only solution. Converting everything to UTF-8 has > >>> issues because some encodings are not roundtrip-safe > > >> Is this still true? > > > I beleive so. If use the ICU Converter Explorer [1] to > examine some of > > the encodings we support, they have "Contains ambiguous > aliases? TRUE". > > Which ones, and are they client-only encodings? If all our > server-side encodings are round-trip safe then I think > there's no big issue. > > In any case I don't think there's a huge problem if we say > that database and user names had better be chosen from the > round-trip-safe subset. Doesn't this also affect passwords? If so it might be harder to enforce as the user is often allowed to pick his own password... //Magnus
On Tue, Feb 28, 2006 at 11:19:02AM -0500, Tom Lane wrote: > Martijn van Oosterhout <kleptog@svana.org> writes: > >>> This may be the only solution. Converting everything to UTF-8 has > >>> issues because some encodings are not roundtrip-safe > > >> Is this still true? > > > I beleive so. If use the ICU Converter Explorer [1] to examine some of > > the encodings we support, they have "Contains ambiguous aliases? TRUE". > > Which ones, and are they client-only encodings? If all our server-side > encodings are round-trip safe then I think there's no big issue. > > In any case I don't think there's a huge problem if we say that database > and user names had better be chosen from the round-trip-safe subset. This is what it says here [1]: There are only 19 encodings currently used worldwide as legitimate POSIX multi-byte locale encodings: UTF-8, ISO-8859-1, ISO-8859-2, ISO-8859-3, ISO-8859-5, ISO-8859-6, ISO-8859-7, ISO-8859-8, ISO-8859-9, ISO-8859-13,ISO-8859-15, EUC-JP, EUC-KR, GB2312 (= EUC-CN), KOI8-R, KOI8-U, VISCII, WINDOWS-1251, WINDOWS-1256 Each of these is fully roundtrip compatible to ISO 10646, therefore all these locales can be represented nicely in wchar_tas the equivalent UCS values. The above names and the corresponding defining documents are listed in the IANA charsetregistry. Some of these have multiple definitions according to ICU meaning that different platforms have implemented them differently in the past (EUC-JP falls into this catagory), but presumably the IANA charset registry has proper definitions. Of the reminaing encodings we support, Big5 is OK, although the term win-950 which is the windows version has changed over time. GBK has same problem, win-936 has changed to over time. I don't think we should concern ourselves with bugs in the windows encodings. IOW, I think we are mostly safe. [1] http://www.cl.cam.ac.uk/~mgk25/ucs/iso2022-wc.html -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
On Feb 28, 2006, at 11:19 AM, Tom Lane wrote: > In any case I don't think there's a huge problem if we say that > database > and user names had better be chosen from the round-trip-safe subset. What about the pg_hba.conf file? Is there a provision to specify the encoding or some other way to deal with non-ascii characters? Thanks, John DeSoi, Ph.D. http://pgedit.com/ Power Tools for PostgreSQL
John DeSoi <desoi@pgedit.com> writes: > On Feb 28, 2006, at 11:19 AM, Tom Lane wrote: > What about the pg_hba.conf file? Is there a provision to specify the > encoding or some other way to deal with non-ascii characters? pg_hba.conf is also processed without any locale considerations, ie, effectively the "bunch of bits" approach Martijn mentioned. regards, tom lane