Thread: Encoding issues
MD5Digest.encode() calls password.getBytes("US-ASCII") and the same for the username. This is wrong, because when java converts a non-ASCII character to US-ASCII, it replaces it with a "?". Similarly for sendStartupPacket() in /org/postgresql/core/v3/ConnectionFactoryImpl.java and /org/postgresql/core/v2/ConnectionFactoryImpl.java. I'm not sure exactly what it _should_ do, because the connection itself is done with ASCII (client_encoding is not yet set). Rather than trying to convert from characters to ASCII, maybe it should just get the byte sequence, and send that? That appears to be how other clients (like libpq) work. It still fails when the client and server encoding don't match, however. Regards, Jeff Davis
Jeff Davis wrote: > It still fails when the client and server encoding don't match, however. Right. Use 7-bit usernames, passwords, and database names for this reason. The handshake protocol does not allow us to get it right if you use non-7-bit data here. -O
On Sat, 2008-08-02 at 11:18 +1200, Oliver Jowett wrote: > Right. Use 7-bit usernames, passwords, and database names for this > reason. The handshake protocol does not allow us to get it right if you > use non-7-bit data here. But when someone _does_ use non-ASCII database names, etc., shouldn't we produce some kind of useful error, or at least blindly pass the bytes on to the server? Changing those characters into "?"s does not seem like the right solution. That gives us the worst of both worlds: we don't get a useful error message, yet it's impossible to connect when, e.g., the database name contains non-ASCII characters. Regards, Jeff Davis
Jeff Davis wrote: > But when someone _does_ use non-ASCII database names, etc., shouldn't we > produce some kind of useful error, That's fair enough. > or at least blindly pass the bytes on > to the server? What bytes? You have a bunch of UTF-16 characters (possibly with surrogate pairs etc). What encoding do you use to turn that into a bytestream? -O
Oliver Jowett <oliver@opencloud.com> writes: > Jeff Davis wrote: >> or at least blindly pass the bytes on to the server? > What bytes? You have a bunch of UTF-16 characters (possibly with > surrogate pairs etc). What encoding do you use to turn that into a > bytestream? It wouldn't be entirely unreasonable to define the answer as "UTF-8". That would at least provide serviceable behavior to a goodly group of users, whereas the current implementation seems guaranteed to fail for everyone (other than us ASCII-only Neanderthals who don't care anyway...) regards, tom lane
Tom Lane wrote: > Oliver Jowett <oliver@opencloud.com> writes: >> Jeff Davis wrote: >>> or at least blindly pass the bytes on to the server? > >> What bytes? You have a bunch of UTF-16 characters (possibly with >> surrogate pairs etc). What encoding do you use to turn that into a >> bytestream? > > It wouldn't be entirely unreasonable to define the answer as "UTF-8". > That would at least provide serviceable behavior to a goodly group of > users, whereas the current implementation seems guaranteed to fail > for everyone (other than us ASCII-only Neanderthals who don't care > anyway...) So then the restriction is "use 7-bit strings, or use a UTF-8 server encoding"? That sounds reasonable. How feasible would it be to have the backend transcode user/database based on the client_encoding given in the StartupMessage? That would leave authentication as the only remaining wart. It's a pity the current protocol doesn't allow the backend to emit a ParameterStatus before authentication is complete .. -O
On Sat, 2 Aug 2008, Tom Lane wrote: > It wouldn't be entirely unreasonable to define the answer as "UTF-8". > That would at least provide serviceable behavior to a goodly group of > users, whereas the current implementation seems guaranteed to fail > for everyone (other than us ASCII-only Neanderthals who don't care > anyway...) > I've committed a change to the driver to send all initial startup data in UTF-8. It wouldn't be tough to expose this as a URL parameter for initialConnectionEncoding because I've refactored the encoding decisions out of UnixCrypt and MD5Digest. I haven't done that at this point because I'm lazy and I'm not sure how many people actually need such a feature. Kris Jurka