Thread: Encoding issues

Encoding issues

From

Jeff Davis

Date:

01 August 2008, 15:05:27

MD5Digest.encode() calls password.getBytes("US-ASCII") and the same for
the username. This is wrong, because when java converts a non-ASCII
character to US-ASCII, it replaces it with a "?".

Similarly for sendStartupPacket()
in /org/postgresql/core/v3/ConnectionFactoryImpl.java
and /org/postgresql/core/v2/ConnectionFactoryImpl.java.

I'm not sure exactly what it _should_ do, because the connection itself
is done with ASCII (client_encoding is not yet set).

Rather than trying to convert from characters to ASCII, maybe it should
just get the byte sequence, and send that? That appears to be how other
clients (like libpq) work.

It still fails when the client and server encoding don't match, however.

Regards,
    Jeff Davis

Re: Encoding issues

From

Oliver Jowett

Date:

01 August 2008, 20:18:52

Jeff Davis wrote:

> It still fails when the client and server encoding don't match, however.

Right. Use 7-bit usernames, passwords, and database names for this
reason. The handshake protocol does not allow us to get it right if you
use non-7-bit data here.

-O

Re: Encoding issues

From

Jeff Davis

Date:

01 August 2008, 20:46:21

On Sat, 2008-08-02 at 11:18 +1200, Oliver Jowett wrote:
> Right. Use 7-bit usernames, passwords, and database names for this
> reason. The handshake protocol does not allow us to get it right if you
> use non-7-bit data here.

But when someone _does_ use non-ASCII database names, etc., shouldn't we
produce some kind of useful error, or at least blindly pass the bytes on
to the server?

Changing those characters into "?"s does not seem like the right
solution. That gives us the worst of both worlds: we don't get a useful
error message, yet it's impossible to connect when, e.g., the database
name contains non-ASCII characters.

Regards,
    Jeff Davis

Re: Encoding issues

From

Oliver Jowett

Date:

02 August 2008, 00:25:13

Jeff Davis wrote:

> But when someone _does_ use non-ASCII database names, etc., shouldn't we
> produce some kind of useful error,

That's fair enough.

> or at least blindly pass the bytes on
> to the server?

What bytes? You have a bunch of UTF-16 characters (possibly with
surrogate pairs etc). What encoding do you use to turn that into a
bytestream?

-O

Re: Encoding issues

From

Tom Lane

Date:

02 August 2008, 01:42:32

Oliver Jowett <oliver@opencloud.com> writes:
> Jeff Davis wrote:
>> or at least blindly pass the bytes on to the server?

> What bytes? You have a bunch of UTF-16 characters (possibly with
> surrogate pairs etc). What encoding do you use to turn that into a
> bytestream?

It wouldn't be entirely unreasonable to define the answer as "UTF-8".
That would at least provide serviceable behavior to a goodly group of
users, whereas the current implementation seems guaranteed to fail
for everyone (other than us ASCII-only Neanderthals who don't care
anyway...)

            regards, tom lane

Re: Encoding issues

From

Oliver Jowett

Date:

02 August 2008, 05:56:53

Tom Lane wrote:
> Oliver Jowett <oliver@opencloud.com> writes:
>> Jeff Davis wrote:
>>> or at least blindly pass the bytes on to the server?
>
>> What bytes? You have a bunch of UTF-16 characters (possibly with
>> surrogate pairs etc). What encoding do you use to turn that into a
>> bytestream?
>
> It wouldn't be entirely unreasonable to define the answer as "UTF-8".
> That would at least provide serviceable behavior to a goodly group of
> users, whereas the current implementation seems guaranteed to fail
> for everyone (other than us ASCII-only Neanderthals who don't care
> anyway...)

So then the restriction is "use 7-bit strings, or use a UTF-8 server
encoding"? That sounds reasonable.

How feasible would it be to have the backend transcode user/database
based on the client_encoding given in the StartupMessage? That would
leave authentication as the only remaining wart. It's a pity the current
protocol doesn't allow the backend to emit a ParameterStatus before
authentication is complete ..

-O

Re: Encoding issues

From

Kris Jurka

Date:

19 September 2008, 19:55:14

On Sat, 2 Aug 2008, Tom Lane wrote:

> It wouldn't be entirely unreasonable to define the answer as "UTF-8".
> That would at least provide serviceable behavior to a goodly group of
> users, whereas the current implementation seems guaranteed to fail
> for everyone (other than us ASCII-only Neanderthals who don't care
> anyway...)
>

I've committed a change to the driver to send all initial startup data in
UTF-8.  It wouldn't be tough to expose this as a URL parameter for
initialConnectionEncoding because I've refactored the encoding decisions
out of UnixCrypt and MD5Digest.  I haven't done that at this point because
I'm lazy and I'm not sure how many people actually need such a feature.

Kris Jurka