Thread: Why don't I get a LATIN1 encoding here with SET ENCODING?

Why don't I get a LATIN1 encoding here with SET ENCODING?

From

Bryce Nesbitt

Date:

03 November 2009, 21:24:09

I'm tracking another bug, but wanted to verify stuff on the command line.  I can't figure out why this did not
work:<br/><br /> # psql dblack3-deleteme<br /> Welcome to psql 8.3.8 (server 8.3.4), the PostgreSQL interactive
terminal.<br/> dblack3-deleteme=> create table bryce1 (key int,val1 text,val2 bytea);<br /> dblack3-deleteme=>
\encoding<br/> UTF8<br /> dblack3-deleteme=> SET CLIENT_ENCODING TO 'LATIN1';<br /> SET<br /> dblack3-deleteme=>
\encoding<br/> LATIN1<br /> dblack3-deleteme=> insert into bryce1 values(1,2,'test\177');<br /> INSERT 0 1<br /><br
/>dblack3-deleteme=> insert into bryce1 values(1,2,'test\375');<br /> ERROR:  invalid byte sequence for encoding
"UTF8":0xfd<br /> HINT:  This error can also happen if the byte sequence does not match the encoding expected by the
server,which is controlled by "client_encoding".<br /><br /> dblack3-deleteme=> insert into bryce1
values(1,2,'test\200');<br/> ERROR:  invalid byte sequence for encoding "UTF8": 0x80<br /> HINT:  This error can also
happenif the byte sequence does not match the encoding expected by the server, which is controlled by
"client_encoding".<br/><br /><br /> ------------------------------------------------------<br /><br /> The underlying
problemis a version upgrade of DBI drivers has broken some binary objects we used.<br /> For the binary data:<br
/><spanclass="Apple-style-span" style="border-collapse: separate; color: rgb(0, 0, 0); font-family: 'Times New Roman';
font-size:medium; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height:
normal;orphans: 2; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px;"><span
class="Apple-style-span"style="border-collapse: collapse; font-family: arial,sans-serif,'Arial Unicode MS'; font-size:
13px;"><div><fontface="'courier new',
monospace"><12><1>5<0><0><0><4>FOO1</font></div><div><fontface="'courier new',
monospace"><10><205><0><0><0><4>FOO2</font></div><div><fontface="'courier new',
monospace"><br/></font></div><font face="'courier new', monospace">Going in like so:</font><br /><div class="im"
style="color:rgb(80, 0, 80);"><div><font face="'courier new', monospace">$sth->bind_param($index++, $data, {pg_type
=>$SQLDb::PG_BYTEA});</font></div><div><font face="'courier new', monospace"><br /></font></div></div><div><font
face="'couriernew', monospace">It now comes out munged:</font></div><div><span style="font-family: 'courier
new',monospace;">\012\001 5 \000 \000 \000 \004 FOO1</span></div><div><font face="'courier new', monospace">\010 \377
\000\000 \000 \004 FOO</font></div></span></span>

Re: Why don't I get a LATIN1 encoding here with SET ENCODING?

From

Craig Ringer

Date:

03 November 2009, 22:29:52

Bryce Nesbitt wrote:
> I'm tracking another bug, but wanted to verify stuff on the command line.  I 
> can't figure out why this did not work:

> dblack3-deleteme=> insert into bryce1 values(1,2,'test\375');
> ERROR:  invalid byte sequence for encoding "UTF8": 0xfd

I'd say the server is interpreting your query text as latin-1 and
converting it to the server encoding UTF-8 as it should, resulting in
the utf-8 string:
 insert into bryce1 values(1,2,'test\375');

which it *then* interprets escapes in. As test\xfd ('0x74 0x65 0x73 0x74
0xfd') isn't valid UTF-8, the server rejects it.

If my understanding is right then the trouble is that the
client_encoding setting doesn't affect string escapes in SQL queries.
The conversion of the query text from client to server encoding is done
before string escapes are processed.

In truth, that's how I'd expect it to happen. If I ask for the byte 0xfd
in a string, I don't want the server to decide that I must've meant
something else because I have a different client encoding. If I wanted
encoding conversion, I wouldn't have written it in an escape form, I'd
have written 'ý' not '\375'.

--
Craig Ringer

Re: Why don't I get a LATIN1 encoding here with SET ENCODING?

From

Bryce Nesbitt

Date:

03 November 2009, 23:36:40


Craig Ringer wrote:
> In truth, that's how I'd expect it to happen. If I ask for the byte 0xfd
> in a string, I don't want the server to decide that I must've meant
> something else because I have a different client encoding. If I wanted
> encoding conversion, I wouldn't have written it in an escape form, I'd
> have written 'ý' not '\375'.
I've got a client encoding of LATIN1... so I'd expect to be able to 
present any valid LATIN1 character, not care how the backend stored it, 
then get the same character back from the database.

Re: Why don't I get a LATIN1 encoding here with SET ENCODING?

From

Craig Ringer

Date:

03 November 2009, 23:47:33

Bryce Nesbitt wrote:
> 
> 
> Craig Ringer wrote:
>> In truth, that's how I'd expect it to happen. If I ask for the byte 0xfd
>> in a string, I don't want the server to decide that I must've meant
>> something else because I have a different client encoding. If I wanted
>> encoding conversion, I wouldn't have written it in an escape form, I'd
>> have written 'ý' not '\375'.

> I've got a client encoding of LATIN1... so I'd expect to be able to
> present any valid LATIN1 character, not care how the backend stored it,
> then get the same character back from the database.

Yes - but you are *not* presenting a Latin-1 character. You're
presenting four Latin-1 characters:
 '\', '3', '7', '5'

The server *cannot* process those as an escape sequence before first
converting the SQL string from client to server encoding. It doesn't
know what the bytes you sent it mean until it converts the data sent by
the client to the server encoding. Not all encodings preserve the lower
128 characters - in shift-jis, for example, the bytes usually used for
the '\' and '~' characters mean '¥' and '‾' respectively. If the server
didn't do client-to-server encoding before escape processing, a user
with a shift-jis client encoding who sent:
  test¥041

would be very surprised when the server saw that as:
  test!

instead of literally test¥041 like it should.

Perhaps when processing escapes after doing the encoding conversion the
server could apply any client->server encoding transformation on escape
sequences too. That would achieve the result you wanted here, but it
would leave you very, very, very confused and frustrated the first time
you tried to insert an image into a `bytea' field or manipulate a BLOB,
because the server would 'helpfully' translate the byte escapes for you.

To come closer to what you want, the server would have to detect whether
the escape was in a string that was going to land up in a
character-typed field instead of a byte-typed field. But what about
casts, functions, etc? And how would you specify it if you really did
want exactly those bytes in a text field? It'd be a nightmare.

The server does the only sensible, consistent thing - when you give it a
byte sequence, it assumes you mean literally those bytes.

--
Craig Ringer

Re: Why don't I get a LATIN1 encoding here with SET ENCODING?

From

Bryce Nesbitt

Date:

03 November 2009, 23:54:48

Craig Ringer wrote:
> Yes - but you are *not* presenting a Latin-1 character. You're
> presenting four Latin-1 characters:
>
>   '\', '3', '7', '5'
Well, then I have a different question.  If I can view a bytea column as so:
> select object from context_objects where context_key=100041;          object
-----------------------------\005\007\021\007Article\003
(1 row)

How can I modify the data and put it back into the column?

I'm trying to track down while a ten year old system no longer works 
after a Perl DBI upgrade.  Something is munging high LATIN1.  The DB 
itself is UTF-8, client_encoding is set to LATIN1.  We expect LATIN1 in, 
same stuff out.  But right now that is what I'm debugging.

Re: Why don't I get a LATIN1 encoding here with SET ENCODING?

From

Craig Ringer

Date:

04 November 2009, 00:25:57

Bryce Nesbitt wrote:
> Craig Ringer wrote:
>> Yes - but you are *not* presenting a Latin-1 character. You're
>> presenting four Latin-1 characters:
>>
>>   '\', '3', '7', '5'
> Well, then I have a different question.  If I can view a bytea column as
> so:
> 
>> select object from context_objects where context_key=100041;
>           object
> -----------------------------
> \005\007\021\007Article\003
> (1 row)
> 
> How can I modify the data and put it back into the column?

Presuming the working environment:

CREATE TABLE context_objects( id serial primary key, object bytea not null
);

-- to avoid E'' and double-escaping
SET standard_conforming_strings = on;

INSERT INTO context_objects (object)
VALUES ( '\005\007\021\007Article\003' ), ( 'test\375' );

I'd just SELECT the `object' value, receiving it in escape form from the
PostgreSQL protocol:

SELECT object FROM context_objects          object
-----------------------------
\005\007\021\007Article\003

... which is a string of 27 characters in the local client_encoding. So
in Perl (or whatever) I'd have a string of length 27:
 \005\007\021\007Article\003

I'd then convert that by parsing the escape sequences. (As Perl uses \0
as the octal escape prefix instead of just \, it's necessary to write a
simple loop for that. I don't do Perl voluntarily and haven't used it
for a while so I won't write one inline here, but it should be easy enough.)

Once you have the byte string (length 12 bytes) you manipulate it
however you want, then convert it back to octal escape form for sending
to the Pg server.

client_encoding has no effect on any of this so long as you're using a
client encoding that preserves the lower 128 characters, ie basically
anything except shift-jis.

Now, if your 'object' is actually a string in a single known text
encoding, not just a sequence of bytes, then another approach is
possible. First, stop using `bytea' for text. If the byte sequences are
all known to be latin-1 encoded text, for example, use:

ALTER TABLE context_objects ALTER COLUMN object TYPE text
USING ( convert_from(object, 'latin-1'));

--
-- Now retrieve the rows, which will be returned as text in
-- the client_encoding. Note that \375 is ý in latin-1.
--
testdb=> SELECT object FROM context_objects;           object
-----------------------------\x05\x07\x11\x07Article\x03testý
(2 rows)

(Note: If they're all of different encodings, but you know the
encodings, you can make the encoding param of convert_from a column
reference instead).

Now you have 'object' as server-side utf-8 encoded text that'll be
automatically converted to and from the specified client_encoding . If
you want to get the raw server-side-encoded byte sequence you can always
cast to bytea in your query to get the utf-8 byte sequence in octal
escape form:

testdb=> SELECT object::bytea FROM context_objects;          object
-----------------------------\005\007\021\007Article\003test\303\275

... but usually you'd just let the DB

> I'm trying to track down while a ten year old system no longer works
> after a Perl DBI upgrade.

Is it just Perl DBI? Or is it also Pg?

Could it be related to the standards_conforming_strings change that
affects string escaping ?

--
Craig Ringer

Re: Why don't I get a LATIN1 encoding here with SET ENCODING?

From

Craig Ringer

Date:

04 November 2009, 00:34:47

Bryce Nesbitt wrote:
> Craig Ringer wrote:
>> Yes - but you are *not* presenting a Latin-1 character. You're
>> presenting four Latin-1 characters:
>>
>>   '\', '3', '7', '5'
> Well, then I have a different question.  If I can view a bytea column as
> so:
> 
>> select object from context_objects where context_key=100041;
>           object
> -----------------------------
> \005\007\021\007Article\003
> (1 row)
> 
> How can I modify the data and put it back into the column?

Oh: you might also find the encode(...) function useful. From :

http://www.postgresql.org/docs/current/static/functions-string.html

encode(data bytea, type text) returns text
Encode binary data to different representation. Supported types are:
base64, hex, escape. Escape merely outputs null bytes as \000 and
doubles backslashes.

so rather than struggling through Pg's octal-escaped strings you can ask
for them in hex or base64 form, eg:



craig=> \d context_objects                        Table "public.context_objects"Column |  Type   |
   Modifiers
 

--------+---------+--------------------------------------------------------------id     | integer | not null default
nextval('context_objects_id_seq'::regclass)object | bytea   | not null
Indexes:   "context_objects_pkey" PRIMARY KEY, btree (id)

craig=> select * from context_objects;id |           object
----+----------------------------- 1 | \005\007\021\007Article\003 2 | test\375
(2 rows)

craig=> SELECT encode( object, 'hex' ) FROM context_objects;         encode
--------------------------0507110741727469636c650374657374fd
(2 rows)

craig=> SELECT encode( object, 'base64' ) FROM context_objects;     encode
------------------BQcRB0FydGljbGUDdGVzdP0=
(2 rows)


Both the hex-string and base64 forms are trivial to convert to and from
a byte string in Perl. You can use the matching 'decode' function when
updating / inserting data, eg:
 UPDATE context_objects SET object = decode('BQcRB0FydGljbGUD', 'base64') WHERE id = 2;

--
Craig Ringer