Thread: 7.3.2 incorrectly counts characters for unicode varchar field

7.3.2 incorrectly counts characters for unicode varchar field

From
"Matthew Cooper"
Date:
============================================================================
POSTGRESQL BUG REPORT TEMPLATE
============================================================================

Your name : Matthew Cooper
Your email address : matty (at) cloverworxs (dot) com

System Configuration
---------------------
Architecture (example: Intel Pentium) : Intel Pentium
Operating System (example: Linux 2.0.26 ELF) : Redhat 8.0 / 9.0
PostgreSQL version (example: PostgreSQL-7.2.2): PostgreSQL-7.2.2 / 7.3.2
Compiler used (example: gcc 2.95.2) : none

Please enter a FULL description of your problem:
------------------------------------------------
I have a database with UNICODE encoding set. In it is a table with a
varchar(10) column. If I insert 10 western characters into it, it is OK. If
I insert 10 chinese characters it says:
postgresql value too long for type character varying 10
when using 7.3.2. If I use 7.2.2 it works fine.
Please describe a way to repeat the problem. Please try to provide a
concise reproducible example, if at all possible:
----------------------------------------------------------------------
createdb -E UNICODE mydb
Then in psql...
create table mgc (c1 varchar(10));
insert into mgc values('0123456789');
This all works fine.
Now I put the following command into a file (say my.sql) which is UTF-8
encoded and the literal is 10 chinese characters.
(I don't know if once emailed this command will be readable so you may have
to re-create the command by pasting 10 chinese characters into your
favourite UTF-8 compatible editor.)
insert into mgc values ('åˆ†é’Ÿç»ƒä¹ åˆ†é’Ÿç»ƒä¹ ç»ƒä¹ ');
I then run psql -f my.sql and get the error for 7.3.2 but it works for
7.2.2.

If you know how this problem might be fixed, list the solution below:
---------------------------------------------------------------------
I am guessing it is incorrectly counting the bytes and not the characters.
Presumably a workaround is to double the length of the field.

Re: 7.3.2 incorrectly counts characters for unicode varchar field

From
Tom Lane
Date:
> insert into mgc values ('åˆ†é’Ÿç»ƒä¹ åˆ†é’Ÿç»ƒä¹ ç»ƒä¹ ');

I don't think this string is correctly unicode-encoded.  Anyway "length"
claims it is 30 characters.

            regards, tom lane

Re: 7.3.2 incorrectly counts characters for unicode varchar field

From
Tom Lane
Date:
"Matthew Cooper" <matty@cloverworxs.com> writes:
> Attached is the UTF-8 encoded sql file in case it got messed up in the mail
> transfer.

Ah, no doubt it did.

This works fine for me, using either 7.3.4 or CVS tip.  Are you sure
that the system knows your client-side encoding is supposed to be UTF8?

uc=# show client_encoding ;
 client_encoding
-----------------
 UNICODE
(1 row)

uc=# create table mgc(f1 varchar(10));
CREATE TABLE
uc=# \i mgc.sql
INSERT 328444 1
uc=# select * from mgc;
          f1
----------------------
 ͈†Õ’ŸÓ£ƒð¿ Íˆ†Õ’ŸÓ£ƒð¿ Ó£ƒð¿ 
(1 row)

uc=# select length(f1) from mgc;
 length
--------
     10
(1 row)


            regards, tom lane

Re: 7.3.2 incorrectly counts characters for unicode varchar field

From
"Matthew Cooper"
Date:
Attached is the UTF-8 encoded sql file in case it got messed up in the mail
transfer.

And here it is pasted in directly from the window that was displaying
chinese characters.

insert into mgc values ('分钟练习分钟练习练习');


Looking at the UTF-8 documentation, 10 chinese characters could be any
number of bytes, each character being say 2 or 3 characters.

Matty.
----- Original Message -----
From: "Tom Lane" <tgl@sss.pgh.pa.us>
To: "Matthew Cooper" <matty@cloverworxs.com>
Cc: <pgsql-bugs@postgresql.org>
Sent: Saturday, September 13, 2003 5:51 PM
Subject: Re: [BUGS] 7.3.2 incorrectly counts characters for unicode varchar
field


> > insert into mgc values ('Ã¥Ë?â? éâ?TŸç»Æ'ä¹ Ã¥Ë?â?
éâ?TŸç»Æ'ä¹ ç»Æ'ä¹ ');
>
> I don't think this string is correctly unicode-encoded.  Anyway "length"
> claims it is 30 characters.
>
> regards, tom lane
>

Re: 7.3.2 incorrectly counts characters for unicode varchar field

From
"Matthew Cooper"
Date:
Doh! It looks like its time to eat humble pie. It turns out that the guy
here who has 7.3.4 and helped me to reproduce the problem did not follow our
own installation instructions (that he recently re-worded!) as follows:

"createdb -E UNICODE -U DB_USER -P DB_PASSWORD DB_NAME"

and did not set the encoding. I, like a good boy, did on my 7.2
installation. The guys I am trying to debug the problem for are in another
location and are using 7.3.4 too. Hence I narrowed it down to a version
problem. I am asking them to check the encoding on their database too and
will post back with huge apologies and thanks for your time when they
inevitably confirm that the encoding is SQL_ANSI.

Thanks, Matty.
----- Original Message -----
From: "Matthew Cooper" <matty@cloverworxs.com>
To: "Tom Lane" <tgl@sss.pgh.pa.us>
Cc: <pgsql-bugs@postgresql.org>
Sent: Monday, September 15, 2003 9:50 AM
Subject: Re: [BUGS] 7.3.2 incorrectly counts characters for unicode varchar
field


> Attached is the UTF-8 encoded sql file in case it got messed up in the
mail
> transfer.
>
> And here it is pasted in directly from the window that was displaying
> chinese characters.
>
> insert into mgc values ('分钟练习分钟练习练习');
>
>
> Looking at the UTF-8 documentation, 10 chinese characters could be any
> number of bytes, each character being say 2 or 3 characters.
>
> Matty.
> ----- Original Message -----
> From: "Tom Lane" <tgl@sss.pgh.pa.us>
> To: "Matthew Cooper" <matty@cloverworxs.com>
> Cc: <pgsql-bugs@postgresql.org>
> Sent: Saturday, September 13, 2003 5:51 PM
> Subject: Re: [BUGS] 7.3.2 incorrectly counts characters for unicode
varchar
> field
>
>
> > > insert into mgc values ('Ã¥Ë?â? éâ?TŸç»Æ'ä¹ Ã¥Ë?â?
> éâ?TŸç»Æ'ä¹ ç»Æ'ä¹ ');
> >
> > I don't think this string is correctly unicode-encoded.  Anyway "length"
> > claims it is 30 characters.
> >
> > regards, tom lane
> >
>