Re: Bug in UTF8-Validation Code? - Mailing list pgsql-hackers

From Albe Laurenz
Subject Re: Bug in UTF8-Validation Code?
Date
Msg-id AFCCBB403D7E7A4581E48F20AF3E5DB201AC06DC@EXADV1.host.magwien.gv.at
Whole thread Raw
In response to Bug in UTF8-Validation Code?  (Mario Weilguni <mweilguni@sime.com>)
Responses Re: Bug in UTF8-Validation Code?  (Andrew Dunstan <andrew@dunslane.net>)
List pgsql-hackers
Mario Weilguni wrote:
>>> Steps to reproduce:
>>> create database testdb with encoding='UTF8';
>>> \c testdb
>>> create table test(x text);
>>> insert into test values ('\244'); ==> Is akzepted, even if not UTF8.
>>
>> This is working as expected, see the remark in
>>
>> http://www.postgresql.org/docs/current/static/sql-syntax-lexical.html#SQL-SYNTAX-STRINGS
>>
>> "It is your responsibility that the byte sequences you create
>>  are valid characters in the server character set encoding."
>
> In that case, pg_dump is doing wrong here and should quote the output. IMO it
> cannot be defined as working as expected, when this makes any database dumps
> worthless, without any warnings at dump-time.
>
> pg_dump should output \244 itself in that case.

True. Here is a test case on 8.2.3
(OS, database and client all use UTF8):

test=> CREATE TABLE test(x text);
CREATE TABLE
test=> INSERT INTO test VALUES ('correct: ä');
INSERT 0 1
test=> INSERT INTO test VALUES (E'incorrect: \244');
INSERT 0 1
test=> \q
laurenz:~> pg_dump -d -t test -f test.sql

Here is an excerpt from 'od -c test.sql':

0001040   e   n   z  \n   -   -  \n  \n   I   N   S   E   R   T       I
0001060   N   T   O       t   e   s   t       V   A   L   U   E   S
0001100   (   '   c   o   r   r   e   c   t   :     303 244   '   )   ;
0001120  \n   I   N   S   E   R   T       I   N   T   O       t   e   s
0001140   t       V   A   L   U   E   S       (   '   i   n   c   o   r
0001160   r   e   c   t   :     244   '   )   ;  \n  \n  \n   -   -  \n

The invalid character (octal 244) is in the INSERT statement!

This makes psql gag:

test=> DROP TABLE test;
DROP TABLE
test=> \i test.sql
SET
SET
SET
SET
SET
SET
SET
SET
CREATE TABLE
ALTER TABLE
INSERT 0 1
psql:test.sql:33: ERROR:  invalid byte sequence for encoding "UTF8": 0xa4
HINT:  This error can also happen if the byte sequence does not match the encoding expected by the server, which is
controlledby "client_encoding". 

A fix could be either that the server checks escape sequences for validity
or that pg_dump outputs invalid bytes as escape sequences.
Or pg_dump could stop with an error.
I think that the cleanest way would be the first.

Yours,
Laurenz Albe


pgsql-hackers by date:

Previous
From: Teodor Sigaev
Date:
Subject: Re: My honours project - databases using dynamically attached entity-properties
Next
From: Andrew Dunstan
Date:
Subject: Re: Bug in UTF8-Validation Code?