BUG #9210: PostgreSQL string store bug? not enforce check with correct characterSET/encoding - Mailing list pgsql-bugs

From digoal@126.com
Subject BUG #9210: PostgreSQL string store bug? not enforce check with correct characterSET/encoding
Date
Msg-id 20140213140759.2710.49877@wrigleys.postgresql.org
Whole thread Raw
Responses Re: BUG #9210: PostgreSQL string store bug? not enforce check with correct characterSET/encoding
List pgsql-bugs
The following bug has been logged on the website:

Bug reference:      9210
Logged by:          digoal.zhou
Email address:      digoal@126.com
PostgreSQL version: 9.3.2
Operating system:   CentOS 6.4
Description:

In PostgreSQL , in addition sql_ascii coded character set encoding does not
check the legitimacy of the other characters are encoded legality checks .
This has clear instructions in the PostgreSQL manual:
Reference
http://www.postgresql.org/docs/9.3/static/multibyte.html
http://blog.163.com/digoal @ 126/blog/static/163877040201211281407682 /
http://blog.163.com/digoal @ 126/blog/static/16387704020132150381348 /
http://blog.163.com/digoal @ 126/blog/static/1638770402011718112210516 /
For example, we are using a database server UTF8 character set , then the
time stored UTF8 database checks whether the agreement , if it does not it
will error .
For example:
ERROR: invalid byte sequence for encoding "UTF8": 0xee 0xc1 0x22
It would appear that illegal characters should not be stored in the
database, but recently doing some data migration discovered this problem ,
the two databases on two servers, operating systems consistent LANG, are
UTF8, database the characters are also UTF8, but after export data to
another database into some of the data that is reported such an error , and
therefore can not normally migrate some data .
>From the face of it , it should be stored in the database of the illegal
character . Then since there are checks , but also how to store into it?
I used here convert_from this function to restore this phenomenon.
We see , in addition to sql_ascii will convert character sets.
postgres = # select t, t :: bytea from convert_from ('\ xeec1', 'gbk') as g
(t);
 t | t
---- + ----------
 He | \ xe79b8d
(1 row)
postgres = # select t, t :: bytea from convert_from ('\ xeec1', 'utf8') as g
(t);
ERROR: invalid byte sequence for encoding "UTF8": 0xee 0xc1
But truthfully use sql_ascii is stored byte stream , without any
conversion.
postgres = # select t, t :: bytea from convert_from ('\ xeec1', 'sql_ascii')
as g (t);
 t | t
--- + --------
   | \ Xeec1
(1 row)
Using this method, the illegal byte stream into the database .
postgres = # create table test (info text);
CREATE TABLE
postgres = # insert into test values ​​(convert_from ('\ xeec1',
'sql_ascii'));
INSERT 0 1
postgres = # select info, info :: bytea from test;
 info | info
------ + --------
      | \ Xeec1
(1 row)
This record came out , then the backup is not restored , as follows:
postgres @ db-192-168-173-55-> pg_dump-t test-a | psql-f -
SET
SET
SET
SET
SET
SET
psql: <stdin>: 19: ERROR: invalid byte sequence for encoding "UTF8": 0xee
0xc1
CONTEXT: COPY test, line 1

Can be used directly in the database, for example :
postgres = # insert into test select * from test;
INSERT 0 1
postgres = # select info, info :: bytea from test;
 info | info
------ + --------
      | \ Xeec1
      | \ Xeec1
(2 rows)

This problem makes the backup data reduction becomes a hassle .
There may be other ways to put the data into the database illegally , which
resulted in the present situation .
If you want to import the data across databases currently available through
DBLINK or external table to import the illegal character of the problem the
way to use the backup to restore occurs .

[ References ]
1. Src / backend / utils / mb / mbutils.c
2. Src / backend / utils / mb / wchar.c
3. Postgres = # \ df convert *
                              List of functions
   Schema | Name | Result data type | Argument data types | Type
------------ + -------------- + ------------------ + --- ------------------
+ --------
 pg_catalog | convert | bytea | bytea, name, name | normal
 pg_catalog | convert_from | text | bytea, name | normal
 pg_catalog | convert_to | bytea | text, name | normal
(3 rows)

pgsql-bugs by date:

Previous
From: Joshua Yanovski
Date:
Subject: Re: BUG #9204: truncate_identifier may be called unnecessarily on escaped quoted identifiers
Next
From: Tom Lane
Date:
Subject: Re: BUG #9204: truncate_identifier may be called unnecessarily on escaped quoted identifiers