Re: String encoding during connection "handshake" - Mailing list pgsql-hackers

From sulfinu@gmail.com
Subject Re: String encoding during connection "handshake"
Date
Msg-id 200711282017.53764.sulfinu@gmail.com
Whole thread Raw
In response to Re: String encoding during connection "handshake"  (Alvaro Herrera <alvherre@alvh.no-ip.org>)
Responses Re: String encoding during connection "handshake"  (Gregory Stark <stark@enterprisedb.com>)
Re: String encoding during connection "handshake"  ("Trevor Talbot" <quension@gmail.com>)
Re: String encoding during connection "handshake"  (Kris Jurka <books@ejurka.com>)
List pgsql-hackers
On Wednesday 28 November 2007, Alvaro Herrera wrote:
> sulfinu@gmail.com escribió:
> > Martijn,
> >
> > :) don't take it personal, I am just trying to obtain confirmation that I
> >
> > understood well the problem. Afterall, it's just that C has a very
> > outdated notion of "char"s (and no notion of Unicode). I was naively
> > under the impression that "char"s have evolved in nowadays C.
>
> This is not the language's fault in any way.  We support plenty of
> encodings beyond UTF-8.
Yes, you support (and worry about) encodings simply because of a C limitation
dating from 1974, if I recall correctly...
In Java, for example, a "char" is a very well defined datum, namely a Unicode
point. While in C it can be some char or another (or an error!) depending on
what encoding was used. The only definition that stands up is that a "char"
is a byte. Its interpretation is unsure and unsafe (see my original problem).

On Wednesday 28 November 2007, Martijn van Oosterhout wrote:
> On Wed, Nov 28, 2007 at 05:54:05PM +0200, sulfinu@gmail.com wrote:
> > Regarding the problem of "One True Encoding", the answer seems obvious to
> > me: use only one encoding per database cluster, either UTF-8 or UTF-16 or
> > another Unicode-aware scheme, whichever yields a statistically smaller
> > database for the languages employed by the users in their data. This
> > encoding should be a one time choice! De facto, this is already happening
> > now, because one cannot change collation rules after a cluster has been
> > created.
>
> Umm, each database in a cluster can have a different encoding, so there
> is no such thing as the "cluster's encoding".
I implied that a cluster should have a single encoding that covers the whole
Unicode set. That would certainly satisfy everybody.

> You can certainly argue
> that it should be a one time choice, but I doubt you'll get people to
> remove the possibilites we have now. If fact, if anything we'd probably
> go the otherway, allow you to select the collation on a per
> database/table/column level (SQL complaince requires this).
The collation order is implemented in close relationship with the byte
representation of strings, but conceptually depends on the locale solely and
has nothing to do with the encoding.

> This has nothing to do with C by the way. C has many features that
> allow you to work with different encodings. It just doesn't force you
> to use any particular one.
Yes, my point exactly! C forces you to worry about encoding. I mean, if you're
not an ASCII-only user ;)

Think of it this way: if I give you a Java String you will perfectly know what
I meant; if I send you a C char* you don't know what it is in the absence of
extra information - you can even use it as a uint8*, as it is actually done
in md5.c.

I consider this matter closed from my point of view and I have modified the
JDBC driver according to my needs.
Thank you all for the help.


pgsql-hackers by date:

Previous
From: Andrew Dunstan
Date:
Subject: Re: [pgsql-www] Time to update list of contributors
Next
From: "Joshua D. Drake"
Date:
Subject: Re: [pgsql-www] Time to update list of contributors