Thread: SET client_encoding = 'UTF8'

SET client_encoding = 'UTF8'

From
Daniel Migowski
Date:
Hello dear developers,

The command

    SET client_encoding = 'UTF8'

throws an exception in the driver, because the driver expects UNICODE. I
understand exceptions for other encodings, but this is IMHO a must have.
Out database scripts should contain this line at the beginning, to be
able to dump them manually into the server, and which is actually more
correct than a line that sets the encoding to 'UNICODE'.

Thanks in advance and with best regards,
Daniel Migowski

Re: SET client_encoding = 'UTF8'

From
Tom Lane
Date:
Daniel Migowski <dmigowski@ikoffice.de> writes:
> The command
>     SET client_encoding = 'UTF8'
> throws an exception in the driver, because the driver expects UNICODE.

Er, what driver exactly?  Perhaps you need a more up-to-date version
of said driver?  'UTF8' has been our standard spelling of this
encoding's name for quite some time now.

            regards, tom lane

Re: SET client_encoding = 'UTF8'

From
Kris Jurka
Date:

On Sun, 18 May 2008, Daniel Migowski wrote:

> The command SET client_encoding = 'UTF8'
>
> throws an exception in the driver, because the driver expects UNICODE.

This has been discussed before and the problem is that there are a too
many ways to say UTF8 [1].  You can say UTF8, UTF-8, UTF -- 8, and so on.
Perhaps we should strip all spaces and dashes prior to comparison?

[1] http://archives.postgresql.org/pgsql-jdbc/2008-02/threads.php#00174

Kris Jurka

Re: SET client_encoding = 'UTF8'

From
Oliver Jowett
Date:
Tom Lane wrote:
> Daniel Migowski <dmigowski@ikoffice.de> writes:
>> The command
>>     SET client_encoding = 'UTF8'
>> throws an exception in the driver, because the driver expects UNICODE.
>
> Er, what driver exactly?  Perhaps you need a more up-to-date version
> of said driver?  'UTF8' has been our standard spelling of this
> encoding's name for quite some time now.

The driver requests client_encoding = UNICODE in the startup packet, and
expects client_encoding to stay as UNICODE throughout.

If client code goes off and manually sets it to UTF8 then the JDBC
driver complains, because it doesn't know that UNICODE is equivalent to
UTF8.

-O


Re: SET client_encoding = 'UTF8'

From
Daniel Migowski
Date:
Kris Jurka schrieb:
> On Sun, 18 May 2008, Daniel Migowski wrote:
>> The command SET client_encoding = 'UTF8'
>>
>> throws an exception in the driver, because the driver expects UNICODE.
> This has been discussed before and the problem is that there are a too
> many ways to say UTF8 [1].  You can say UTF8, UTF-8, UTF -- 8, and so
> on. Perhaps we should strip all spaces and dashes prior to comparison?
This would be correct in my opinion. I think no one darse to declare a
charset name the relies on charaters other than 0-9 and a-z to be
identifiable. IMHO we should just allow the way postgres allowes by
itself (we could dig into the parsing code of postgres). I tried at the
command line, and got the following:

set client_encoding='foobar';
FEHLER:  Invalid value for parameter »client_encoding«: »foobar«

set client_encoding='utf8';
OK

set client_encoding='utf-8';
OK

set client_encoding='utf -- 8';
OK

set client_encoding='Utf -- 8';
OK

set client_encoding='Utf -- 98';
FEHLER:  Invalid value for parameter »client_encoding«: »Utf -- 98«

set client_encoding='Utf_8';
OK

But I think we should be right with


userencoding.toLowercase().replaceall("[^0-9a-z]","").equals("utf8"); //
untested prototype code

or something like this.
>
> [1] http://archives.postgresql.org/pgsql-jdbc/2008-02/threads.php#00174
Thanks for the link.

With best regards,
Daniel Migowski

Re: SET client_encoding = 'UTF8'

From
Tom Lane
Date:
Daniel Migowski <dmigowski@ikoffice.de> writes:
> Kris Jurka schrieb:
>> On Sun, 18 May 2008, Daniel Migowski wrote:
>>> The command SET client_encoding = 'UTF8'
>
>> throws an exception in the driver, because the driver expects UNICODE.
>> This has been discussed before and the problem is that there are a too
>> many ways to say UTF8 [1].  You can say UTF8, UTF-8, UTF -- 8, and so
>> on. Perhaps we should strip all spaces and dashes prior to comparison?

Perhaps we should make the backend return the values of client_encoding
and server_encoding in canonical form (ie, "UTF8") regardless of the
spelling variant the user used.  I'm not thrilled with having JDBC
thinking it knows the conversion algorithm the backend uses.

Of course, such a change would break code relying on the older behavior
:-(

            regards, tom lane

Re: SET client_encoding = 'UTF8'

From
Oliver Jowett
Date:
Tom Lane wrote:
> Daniel Migowski <dmigowski@ikoffice.de> writes:
>> Kris Jurka schrieb:
>>> On Sun, 18 May 2008, Daniel Migowski wrote:
>>>> The command SET client_encoding = 'UTF8'
>>> throws an exception in the driver, because the driver expects UNICODE.
>>> This has been discussed before and the problem is that there are a too
>>> many ways to say UTF8 [1].  You can say UTF8, UTF-8, UTF -- 8, and so
>>> on. Perhaps we should strip all spaces and dashes prior to comparison?
>
> Perhaps we should make the backend return the values of client_encoding
> and server_encoding in canonical form (ie, "UTF8") regardless of the
> spelling variant the user used.  I'm not thrilled with having JDBC
> thinking it knows the conversion algorithm the backend uses.
>
> Of course, such a change would break code relying on the older behavior
> :-(

Not sure if this is a big enough issue to warrant a server change. It
only happens when a JDBC client issues a manual SET client_encoding to
an encoding that's UTF8 but isn't spelled "UNICODE". That's going to be
a no-op anyway, so I'm not entirely clear why the client needs to be
sending it in the first place.

It sounds like the root cause might be something like "let's feed
pg_dump output to JDBC". So we could add a special case in the driver to
allow exactly "UTF8" as well as "UNICODE", if that's the canonical way
the server spells it these days.

-O


Re: SET client_encoding = 'UTF8'

From
Tom Lane
Date:
Oliver Jowett <oliver@opencloud.com> writes:
> It sounds like the root cause might be something like "let's feed
> pg_dump output to JDBC". So we could add a special case in the driver to
> allow exactly "UTF8" as well as "UNICODE", if that's the canonical way
> the server spells it these days.

+1 for that in any case, because UNICODE hasn't been the canonical
spelling since 8.1.

            regards, tom lane

Re: SET client_encoding = 'UTF8'

From
Kris Jurka
Date:

On Mon, 19 May 2008, Tom Lane wrote:

> Oliver Jowett <oliver@opencloud.com> writes:
>> So we could add a special case in the driver to allow exactly "UTF8" as
>> well as "UNICODE", if that's the canonical way the server spells it
>> these days.
>
> +1 for that in any case, because UNICODE hasn't been the canonical
> spelling since 8.1.
>

OK, I'll make this happen.  A work around for the immediate problem is to
use the URL parameter allowEncodingChanges=true.

Kris Jurka

http://jdbc.postgresql.org/documentation/83/connect.html#connection-parameters

Re: SET client_encoding = 'UTF8'

From
Kris Jurka
Date:

On Mon, 19 May 2008, Kris Jurka wrote:

> On Mon, 19 May 2008, Tom Lane wrote:
>
>> Oliver Jowett <oliver@opencloud.com> writes:
>>> So we could add a special case in the driver to allow exactly "UTF8" as
>>> well as "UNICODE", if that's the canonical way the server spells it these
>>> days.
>>
>> +1 for that in any case, because UNICODE hasn't been the canonical
>> spelling since 8.1.
>>
>
> OK, I'll make this happen.  A work around for the immediate problem is to use
> the URL parameter allowEncodingChanges=true.
>

Done.