Thread: 8.0.0beta4: "copy" and "client_encoding"

8.0.0beta4: "copy" and "client_encoding"

From
mbch67@yahoo.com
Date:
Hi,

The following statement worked fine with 7.4 but fails under 8.0.0beta4:

   SET CLIENT_ENCODING TO 'LATIN1';

Error message:

   08006
   The servers client_encoding parameter was changed to {0}.
   The JDBC driver requires client_encoding to be UNICODE for
   correct operation.

Is this a bug or a new restriction of the JDBC driver/server backend?

That's the (faulty) 8.0.0beta4 configuration:

   Default client encoding: LATIN1
   Database encoding:       UNICODE
   DBMS:                    8.0.0beta4
   JDBC driver:             pgdev.307.jdbc3.jar
   File encoding:           LATIN1

That's the 7.4.x configuration:

   Default client encoding: LATIN1
   Database encoding:       UNICODE
   DBMS:                    7.4.x
   JDBC driver:             pg74.213.jdbc4.jar
   File encoding:           LATIN1


Adrian

Re: 8.0.0beta4: "copy" and "client_encoding"

From
Kris Jurka
Date:

On Wed, 3 Nov 2004 mbch67@yahoo.com wrote:

> The following statement worked fine with 7.4 but fails under 8.0.0beta4:
>
>    SET CLIENT_ENCODING TO 'LATIN1';

You were never supposed to do this, but only in 8.0 do we actually detect
it and stop you.  Why do you need to do this in the first place?

Kris Jurka


Re: 8.0.0beta4: "copy" and "client_encoding"

From
Oliver Jowett
Date:
mbch67@yahoo.com wrote:
> Hi,
>
> The following statement worked fine with 7.4 but fails under 8.0.0beta4:
>
>    SET CLIENT_ENCODING TO 'LATIN1';
>
> Error message:
>
>    08006
>    The servers client_encoding parameter was changed to {0}.
>    The JDBC driver requires client_encoding to be UNICODE for
>    correct operation.
>
> Is this a bug or a new restriction of the JDBC driver/server backend?

It's an old restriction of the JDBC driver. It's just that only the new
driver detects and complains about attempts to change the encoding.

The underlying problem is that currently the driver sets client_encoding
= UNICODE (which matches the internal Java string representation well)
and expects it to stay that way for the duration of the connection. If
you manually change client_encoding, then the driver will happily try to
decode those LATIN1-encoded strings as UTF8-encoded strings, and get
very confused..

Why do you need to change client_encoding?

-O

Re: 8.0.0beta4: "copy" and "client_encoding"

From
mbch67@yahoo.com
Date:
Thanks Kris and Oliver for your explanations.

Why I want to set the client_encoding to latin1? That's a fair
question. Unfortunately I removed the statement following the COPY
command to keep the posting short:

   SET CLIENT_ENCODING TO 'LATIN1';
   COPY temp_test FROM '/home/postgres/temp_test.dat';

I want to load a latin1-encoded file using the COPY command. Not
setting the encoding leads to the following error message:

   22021
   ERROR: Unicode characters greater than or equal to 0x10000 are
   not supported

Executing the same COPY statement with psql works fine.

Cheers,
Adrian

Re: 8.0.0beta4: "copy" and "client_encoding"

From
Oliver Jowett
Date:
mbch67@yahoo.com wrote:
> Thanks Kris and Oliver for your explanations.
>
> Why I want to set the client_encoding to latin1? That's a fair
> question. Unfortunately I removed the statement following the COPY
> command to keep the posting short:
>
>    SET CLIENT_ENCODING TO 'LATIN1';
>    COPY temp_test FROM '/home/postgres/temp_test.dat';
>
> I want to load a latin1-encoded file using the COPY command.

Hmm. I am suprised that COPY does not let you specify the encoding of
the input file. Using client_encoding for this seems wrong: there are
some cases it can't handle, e.g. trying to load a LATIN1-encoded file
into a table that has a name that can't be represented in LATIN1.

I suppose that in the absence of backend support for this, we could add
some URL parameter that allows client_encoding to be changed, with
suitably dangerous warnings around using it. Then you can temporarily
flip client_encoding to LATIN1 for the duration of the COPY, and revert
it to UNICODE afterwards.

-O

Re: 8.0.0beta4: "copy" and "client_encoding"

From
mbch67@yahoo.com
Date:
> I suppose that in the absence of backend support for this, we could add
> some URL parameter that allows client_encoding to be changed, with
> suitably dangerous warnings around using it. Then you can temporarily
> flip client_encoding to LATIN1 for the duration of the COPY, and revert
> it to UNICODE afterwards.
>

To me a temporarly fix is not really the solution. Shouldn't there be
done some more investigations first, e.g.

1. I set LATIN1 as the database (postgresql.conf) default client
encoding. Why does COPY, executed via JDBC not use the right encoding?
=> To me it seems to be a backend problem. Should this be address in
another posting list?

2. Was the decision to disable the "SET CLIENT_ENCODING" command
really a good idea? What about if I am running a server using UNICODE
to store text, my default client encoding is LATIN1 and I want to
import a Korean encoded text file using COPY via JDBC? There is no way
to tell COPY what encoding the input file based on.
In order to be compliant with PSQL I suggest to reactivate the
disabled "SET CLIENT ENCODING" for JDBC.

Re: 8.0.0beta4: "copy" and "client_encoding"

From
Ulrich Meis
Date:
On Thursday 04 November 2004 11:17, Oliver Jowett wrote:
>[...]
> I suppose that in the absence of backend support for this, we could add
> some URL parameter that allows client_encoding to be changed, with
> suitably dangerous warnings around using it. Then you can temporarily
> flip client_encoding to LATIN1 for the duration of the COPY, and revert
> it to UNICODE afterwards.

How about relaxing so that things like this are alright:

st.executeUpdate('SET CLIENT_ENCODING TO 'whatever'; COPY/...; SET
CLIENT_ENCODING TO 'UNICODE');

Or - I'm surprised to see that this doesn't work - calling a function that
does the trick:

create or replace function copy_enc(varchar,varchar) RETURNS void AS ' set
client_encoding TO ''LATIN1'';set client_encoding TO ''UNICODE'';' LANGUAGE
SQL;


Uli


Re: 8.0.0beta4: "copy" and "client_encoding"

From
Markus Schaber
Date:
Hi, Oliver,

On Thu, 04 Nov 2004 10:17:22 +0000
Oliver Jowett <oliver@opencloud.com> wrote:

> Hmm. I am suprised that COPY does not let you specify the encoding of
> the input file. Using client_encoding for this seems wrong: there are
> some cases it can't handle, e.g. trying to load a LATIN1-encoded file
> into a table that has a name that can't be represented in LATIN1.
>
> I suppose that in the absence of backend support for this, we could add
> some URL parameter that allows client_encoding to be changed, with
> suitably dangerous warnings around using it. Then you can temporarily
> flip client_encoding to LATIN1 for the duration of the COPY, and revert
> it to UNICODE afterwards.

I think you're right. Except COPY from STDIN, client encoding should be
independent from COPY encoding. So I suggest that Adrian requests
backend support for this.

Greets,
Markus

--
markus schaber | dipl. informatiker
logi-track ag | rennweg 14-16 | ch 8001 zürich
phone +41-43-888 62 52 | fax +41-43-888 62 53
mailto:schabios@logi-track.com | www.logi-track.com

Re: 8.0.0beta4: "copy" and "client_encoding"

From
Kris Jurka
Date:

On Fri, 5 Nov 2004, Markus Schaber wrote:

> I think you're right. Except COPY from STDIN, client encoding should be
> independent from COPY encoding. So I suggest that Adrian requests
> backend support for this.
>

Well that's not going to get in for 8.0, so we'll need to do something
especially since the driver is supposed to offer backward compatibility.
From my perspective we've got three options, just drop the check, Oliver's
add a URL parameter to turn off the check, or add an API so that they
don't issue the COPY command directly, but something like
issueServerCopy(table, file, encoding) where the driver could turn off the
check, switch the encoding, do the copy, switch the encoding back, and
turn on the check.

I think the third option is too much work at the moment, when we do offer
a copy API in the driver we could fold that in, but not now.  As to the
first two options, I don't really care.  Either is fine with me, so since
Oliver seems to like a URL parameter, that's what we'll probably go with.

Kris Jurka

Re: 8.0.0beta4: "copy" and "client_encoding"

From
Oliver Jowett
Date:
mbch67@yahoo.com wrote:

> 1. I set LATIN1 as the database (postgresql.conf) default client
> encoding. Why does COPY, executed via JDBC not use the right encoding?
> => To me it seems to be a backend problem. Should this be address in
> another posting list?

The postgresql.conf setting is a default that can be overridden on a
per-client basis. JDBC overrides the default when it connects. This is
normal.

> 2. Was the decision to disable the "SET CLIENT_ENCODING" command
> really a good idea? What about if I am running a server using UNICODE
> to store text, my default client encoding is LATIN1 and I want to
> import a Korean encoded text file using COPY via JDBC? There is no way
> to tell COPY what encoding the input file based on.
> In order to be compliant with PSQL I suggest to reactivate the
> disabled "SET CLIENT ENCODING" for JDBC.

It's a good idea in the sense that if you SET CLIENT_ENCODING, you will
break the JDBC driver in nonobvious ways. The check is there as an extra
safety net. I'd be OK with a URL parameter to disable the check so that
expert users can SET CLIENT_ENCODING at their own risk, but I don't want
the check disabled by default.

It would be theoretically possible for the JDBC driver to track
client_encoding and adjust the encoding it uses accordingly, but:

1) someone needs to actually implement that
2) it is not clear exactly when the encoding changes with respect to
receiving the ParameterStatus message (this is only an issue if there
are encodings where the contents of the ParameterStatus message would
change in the new encoding)
3) it results in an extra round of transcoding (i.e. db encoding ->
client encoding -> unicode, rather than just db encoding -> unicode)

Given that the only thing that we've seen that depends on
client_encoding so far is COPY (and even that has problems), I think the
right solution is to fix COPY, not go to a lot of extra work to support
arbitary client_encoding values.

Are there any other cases where client_encoding needs to be modified by
a JDBC user? It really seems to me that client_encoding is an
implementation detail that JDBC users should not need to worry about,
given that Java already has standard mechanisms for dealing with
encodings (namely "turn everything into unicode strings internally").

====

Also, a couple of workarounds for your case that don't need driver
modifications:

- force use of protocol version 2 by adding "?protocolVersion=2" to your
connection URL; you will lose the benefits of version 3 but it should
also defeat the client_encoding checks.
- transcode the file from LATIN1 to UNICODE (UTF8) on the server side
before issuing the COPY.

-O

Re: 8.0.0beta4: "copy" and "client_encoding"

From
"Barry Lind"
Date:
If you choose to go the URL parameter route, I would suggest you use the
existing 'compatible' parameter.  This is exactly the type of thing that
parameter was designed to be used for.  By default the driver does the
new check.  But with a value of 'compatible=7.4' (or less, i.e. < 8.0)
the driver would revert back to the old behavior of not doing this
check.

--Barry

-----Original Message-----
From: Kris Jurka [mailto:books@ejurka.com]
Sent: Friday, November 05, 2004 12:28 PM
To: Markus Schaber
Cc: Oliver Jowett; pgsql-jdbc@postgresql.org; mbch67@yahoo.com
Subject: Re: [JDBC] 8.0.0beta4: "copy" and "client_encoding"



On Fri, 5 Nov 2004, Markus Schaber wrote:

> I think you're right. Except COPY from STDIN, client encoding should
> be independent from COPY encoding. So I suggest that Adrian requests
> backend support for this.
>

Well that's not going to get in for 8.0, so we'll need to do something
especially since the driver is supposed to offer backward compatibility.

From my perspective we've got three options, just drop the check,
Oliver's add a URL parameter to turn off the check, or add an API so
that they don't issue the COPY command directly, but something like
issueServerCopy(table, file, encoding) where the driver could turn off
the check, switch the encoding, do the copy, switch the encoding back,
and turn on the check.

I think the third option is too much work at the moment, when we do
offer a copy API in the driver we could fold that in, but not now.  As
to the first two options, I don't really care.  Either is fine with me,
so since Oliver seems to like a URL parameter, that's what we'll
probably go with.

Kris Jurka

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
    (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)


Re: 8.0.0beta4: "copy" and "client_encoding"

From
Oliver Jowett
Date:
Barry Lind wrote:
> If you choose to go the URL parameter route, I would suggest you use the
> existing 'compatible' parameter.  This is exactly the type of thing that
> parameter was designed to be used for.  By default the driver does the
> new check.  But with a value of 'compatible=7.4' (or less, i.e. < 8.0)
> the driver would revert back to the old behavior of not doing this
> check.

What's the upgrade path for a "legacy" application that uses COPY, so
that it is a "current" application and no longer needs the compatible=
parameter?

I don't see such a path without an additional COPY API or backend
changes. So I'd prefer not to put this into the "compatible behaviour"
bucket.

-O

Re: 8.0.0beta4: "copy" and "client_encoding"

From
"Barry Lind"
Date:
I am assuming this will get addressed in the backend in 8.1 and that
would be the upgrade path.  (I agree if there isn't agreement on the
server side that this is appropriate for the server, then this wouldn't
be the correct parameter).

--Barry




-----Original Message-----
From: Oliver Jowett [mailto:oliver@opencloud.com]
Sent: Monday, November 08, 2004 2:07 PM
To: Barry Lind
Cc: Kris Jurka; Markus Schaber; pgsql-jdbc@postgresql.org;
mbch67@yahoo.com
Subject: Re: [JDBC] 8.0.0beta4: "copy" and "client_encoding"

Barry Lind wrote:
> If you choose to go the URL parameter route, I would suggest you use
> the existing 'compatible' parameter.  This is exactly the type of
> thing that parameter was designed to be used for.  By default the
> driver does the new check.  But with a value of 'compatible=7.4' (or
> less, i.e. < 8.0) the driver would revert back to the old behavior of
> not doing this check.

What's the upgrade path for a "legacy" application that uses COPY, so
that it is a "current" application and no longer needs the compatible=
parameter?

I don't see such a path without an additional COPY API or backend
changes. So I'd prefer not to put this into the "compatible behaviour"
bucket.

-O


Re: 8.0.0beta4: "copy" and "client_encoding"

From
Kris Jurka
Date:
Well, lets ask -hackers...

When COPYing data from a file, the file encoding is taken from the
client_encoding parameter.  The JDBC driver always uses UNICODE as the
client_encoding and wants to prevent people from changing it by
monitoring ParameterStatus messages and erroring out if it's changed.
This presents a problem when you want to COPY to or from a file with a
different encoding.  It seems reasonable to add an ENCODING specification
to the COPY command instead of relying on the somewhat unrelated
client_encoding setting.  Oliver Jowett also noted that copying from a
file with LATIN1 data into a table whose name contained UNICODE characters
could not be done.  Does this seem like a reasonable thing to do?

Kris Jurka

On Mon, 8 Nov 2004, Barry Lind wrote:

> I am assuming this will get addressed in the backend in 8.1 and that
> would be the upgrade path.  (I agree if there isn't agreement on the
> server side that this is appropriate for the server, then this wouldn't
> be the correct parameter).
>
> --Barry
>

Re: 8.0.0beta4: "copy" and "client_encoding"

From
Markus Schaber
Date:
Hi, Kris,

On Fri, 5 Nov 2004 15:27:51 -0500 (EST)
Kris Jurka <books@ejurka.com> wrote:

> Well that's not going to get in for 8.0, so we'll need to do something
> especially since the driver is supposed to offer backward compatibility.
> From my perspective we've got three options, just drop the check, Oliver's
> add a URL parameter to turn off the check, or add an API so that they
> don't issue the COPY command directly, but something like
> issueServerCopy(table, file, encoding) where the driver could turn off the
> check, switch the encoding, do the copy, switch the encoding back, and
> turn on the check.

I would prefer to adding the URL option, maybe combined with the version
compatibility flag.

Adding an API always carries the disadvantage that the user has to do
runtime casts and add special cased code even when the sql syntax is
compatible for all capable servers.

Have fun,
Markus


--
markus schaber | dipl. informatiker
logi-track ag | rennweg 14-16 | ch 8001 zürich
phone +41-43-888 62 52 | fax +41-43-888 62 53
mailto:schabios@logi-track.com | www.logi-track.com

Re: 8.0.0beta4: "copy" and "client_encoding"

From
Markus Schaber
Date:
Hi, Oliver,

On Tue, 09 Nov 2004 11:07:12 +1300
Oliver Jowett <oliver@opencloud.com> wrote:

> > If you choose to go the URL parameter route, I would suggest you use the
> > existing 'compatible' parameter.  This is exactly the type of thing that
> > parameter was designed to be used for.  By default the driver does the
> > new check.  But with a value of 'compatible=7.4' (or less, i.e. < 8.0)
> > the driver would revert back to the old behavior of not doing this
> > check.
>
> What's the upgrade path for a "legacy" application that uses COPY, so
> that it is a "current" application and no longer needs the compatible=
> parameter?

The upgrade path is to use properly recoded files.

> I don't see such a path without an additional COPY API or backend
> changes. So I'd prefer not to put this into the "compatible behaviour"
> bucket.

Maybe the sanest way would be to have an extra parameter, but setting
compatible=7.4 also disables the check.

But this may be to confusing.

Markus

--
markus schaber | dipl. informatiker
logi-track ag | rennweg 14-16 | ch 8001 zürich
phone +41-43-888 62 52 | fax +41-43-888 62 53
mailto:schabios@logi-track.com | www.logi-track.com

Re: 8.0.0beta4: "copy" and "client_encoding"

From
Kris Jurka
Date:

On Thu, 3 Nov 2004 mbch67@yahoo.com wrote:

> Why I want to set the client_encoding to latin1? That's a fair
> question. Unfortunately I removed the statement following the COPY
> command to keep the posting short:
>
>    SET CLIENT_ENCODING TO 'LATIN1';
>    COPY temp_test FROM '/home/postgres/temp_test.dat';
>

I have added a URL parameter allowEncodingChanges=true to allow client
code to set the client_encoding variable.

Kris Jurka