Re: JDBC driver patch for non-ASCII users - Mailing list pgsql-jdbc
From | sulfinu@gmail.com |
---|---|
Subject | Re: JDBC driver patch for non-ASCII users |
Date | |
Msg-id | 200712111646.13299.sulfinu@gmail.com Whole thread Raw |
In response to | Re: JDBC driver patch for non-ASCII users (Tom Lane <tgl@sss.pgh.pa.us>) |
List | pgsql-jdbc |
On Saturday 08 December 2007, Tom Lane wrote: > Given the current design that allows different databases in a cluster > to (claim they) have different encodings, it's real hard to see how > to handle non-ASCII data in shared catalogs sanely. I don't think > we'll really be able to fix this properly until that mythical day > when we have support for per-column encoding selections. My guess > is that we'd then legislate that shared catalog columns are always > UTF8; after which we could start to think about what it would take > to do conversion of the connection startup packet's contents from > client-side encoding to UTF8. First of all, judgeing from the code I read, you'll have to adjust the wire protocol so that the encoding is signaled at the very beginning of a connection! The V3 protocol seems close, but not just there. Take for example the way that the encoding information is processed when XML reader programs. Next, there's something I already suggested on the "-hackers" mailing list. Until the day when PostgreSQL is rewritten in a Unicode-savvy language (where a "char" is indeed a Unicode point), I believe you should consider enforcing for any database cluster a single encoding chosen from the encodings that cover the whole Unicode set, like UTF-8, UTF-16 etc. This way, lots of problems disappear, things get cleaner and clients need not guess the encoding used at server side for user name, password, database name, table names and so on. Collation rules would finally depend on locale solely, just as they should. The only downside that I see is an (slight) increase in database size, but that's not an issue nowadays. Perhaps you could offer administrators a choice of encoding upon cluster creation, that would statistically minimize the size, depending on the mostly used languages. If you have other reasons against it, bring them to the table, but please do not post ridiculous statements like "I'm not sure a Java char is a Unicode point" or "I don't think that Unicode covers all languages", which I didn't even bother to answer with the classical "RTFM!". Support for per-column encoding selection is from my point of view a stupid waste of developing effort and CPU time, not to mention it is a great opportunity to introduce a myriad of bugs. You're looking at the problem from the wrong end: it is not the encodings that must be flexibly chosen, it is the alphabet! No user is ever going to be interested in the internal encoding of a Postgres database file, nor should he be. But the user will always appreciate to find again the same strings as he has put in the database, regardless of his mother tongue and client program. The logical solution is to support Unicode and disregard encodings altogether (actually, keep them under the sheets, since they are a result of historical limitations). On Saturday 08 December 2007, Kris Jurka wrote: > For the record, I'm in favor of changing our use of initial setup encoding > from SQL-ASCII to UTF-8. While it doesn't solve the root of the problem, > it does allow people to use non-ascii user and database names if they set > them up appropriately and doesn't seem to harm anything. Will you change ALL clients in order to do that? I only needed one client to actually work, JDBC - very frustrating, since it was supposed to be Unicode-proof, written in Java. Ironically, psql works because it uses the platform encoding ;) > The original > patch's suggested use of the client's environment encoding seems random to > me. It's not random, it is a heuristical approach in guessing the right encoding, namely the encoding used by the administrator when he created the database and the user. Afterall, there cannot be anything random in a computer, can it? My solution preserves the currently working configurations - the ASCII-only setups will continue to work after the patch is applied. Moreover, UTF-8 setups are guaranteed to always work! In short, my patch solves today(!) with no undesired side-effects a limitation of the PostgreSQL authentication procedure in the JDBC driver. You're free to reject it, I published it for the general benefit (as it happens, you asked it yourself). Good luck.
pgsql-jdbc by date: