Thread: 7.3.2 incorrectly counts characters for unicode varchar field
============================================================================ POSTGRESQL BUG REPORT TEMPLATE ============================================================================ Your name : Matthew Cooper Your email address : matty (at) cloverworxs (dot) com System Configuration --------------------- Architecture (example: Intel Pentium) : Intel Pentium Operating System (example: Linux 2.0.26 ELF) : Redhat 8.0 / 9.0 PostgreSQL version (example: PostgreSQL-7.2.2): PostgreSQL-7.2.2 / 7.3.2 Compiler used (example: gcc 2.95.2) : none Please enter a FULL description of your problem: ------------------------------------------------ I have a database with UNICODE encoding set. In it is a table with a varchar(10) column. If I insert 10 western characters into it, it is OK. If I insert 10 chinese characters it says: postgresql value too long for type character varying 10 when using 7.3.2. If I use 7.2.2 it works fine. Please describe a way to repeat the problem. Please try to provide a concise reproducible example, if at all possible: ---------------------------------------------------------------------- createdb -E UNICODE mydb Then in psql... create table mgc (c1 varchar(10)); insert into mgc values('0123456789'); This all works fine. Now I put the following command into a file (say my.sql) which is UTF-8 encoded and the literal is 10 chinese characters. (I don't know if once emailed this command will be readable so you may have to re-create the command by pasting 10 chinese characters into your favourite UTF-8 compatible editor.) insert into mgc values ('Ã¥Ëâ éâŸç»Æä¹ Ã¥Ëâ éâŸç»Æä¹ ç»Æä¹ '); I then run psql -f my.sql and get the error for 7.3.2 but it works for 7.2.2. If you know how this problem might be fixed, list the solution below: --------------------------------------------------------------------- I am guessing it is incorrectly counting the bytes and not the characters. Presumably a workaround is to double the length of the field.
> insert into mgc values ('Ã¥Ëâ éâŸç»Æä¹ Ã¥Ëâ éâŸç»Æä¹ ç»Æä¹ '); I don't think this string is correctly unicode-encoded. Anyway "length" claims it is 30 characters. regards, tom lane
"Matthew Cooper" <matty@cloverworxs.com> writes: > Attached is the UTF-8 encoded sql file in case it got messed up in the mail > transfer. Ah, no doubt it did. This works fine for me, using either 7.3.4 or CVS tip. Are you sure that the system knows your client-side encoding is supposed to be UTF8? uc=# show client_encoding ; client_encoding ----------------- UNICODE (1 row) uc=# create table mgc(f1 varchar(10)); CREATE TABLE uc=# \i mgc.sql INSERT 328444 1 uc=# select * from mgc; f1 ---------------------- ÍÕÓ£ð¿ ÍÕÓ£ð¿ Ó£ð¿ (1 row) uc=# select length(f1) from mgc; length -------- 10 (1 row) regards, tom lane
Attached is the UTF-8 encoded sql file in case it got messed up in the mail transfer. And here it is pasted in directly from the window that was displaying chinese characters. insert into mgc values ('åéç»ä¹ åéç»ä¹ ç»ä¹ '); Looking at the UTF-8 documentation, 10 chinese characters could be any number of bytes, each character being say 2 or 3 characters. Matty. ----- Original Message ----- From: "Tom Lane" <tgl@sss.pgh.pa.us> To: "Matthew Cooper" <matty@cloverworxs.com> Cc: <pgsql-bugs@postgresql.org> Sent: Saturday, September 13, 2003 5:51 PM Subject: Re: [BUGS] 7.3.2 incorrectly counts characters for unicode varchar field > > insert into mgc values ('ÃÂ¥Ã?â? éâ?Tà ¸Ã§Ã»Ã'äù ÃÂ¥Ã?â? éâ?Tà ¸Ã§Ã»Ã'äù çûÃ'äù '); > > I don't think this string is correctly unicode-encoded. Anyway "length" > claims it is 30 characters. > > regards, tom lane >
Doh! It looks like its time to eat humble pie. It turns out that the guy here who has 7.3.4 and helped me to reproduce the problem did not follow our own installation instructions (that he recently re-worded!) as follows: "createdb -E UNICODE -U DB_USER -P DB_PASSWORD DB_NAME" and did not set the encoding. I, like a good boy, did on my 7.2 installation. The guys I am trying to debug the problem for are in another location and are using 7.3.4 too. Hence I narrowed it down to a version problem. I am asking them to check the encoding on their database too and will post back with huge apologies and thanks for your time when they inevitably confirm that the encoding is SQL_ANSI. Thanks, Matty. ----- Original Message ----- From: "Matthew Cooper" <matty@cloverworxs.com> To: "Tom Lane" <tgl@sss.pgh.pa.us> Cc: <pgsql-bugs@postgresql.org> Sent: Monday, September 15, 2003 9:50 AM Subject: Re: [BUGS] 7.3.2 incorrectly counts characters for unicode varchar field > Attached is the UTF-8 encoded sql file in case it got messed up in the mail > transfer. > > And here it is pasted in directly from the window that was displaying > chinese characters. > > insert into mgc values ('åéç»ä¹ åéç»ä¹ ç»ä¹ '); > > > Looking at the UTF-8 documentation, 10 chinese characters could be any > number of bytes, each character being say 2 or 3 characters. > > Matty. > ----- Original Message ----- > From: "Tom Lane" <tgl@sss.pgh.pa.us> > To: "Matthew Cooper" <matty@cloverworxs.com> > Cc: <pgsql-bugs@postgresql.org> > Sent: Saturday, September 13, 2003 5:51 PM > Subject: Re: [BUGS] 7.3.2 incorrectly counts characters for unicode varchar > field > > > > > insert into mgc values ('ÃÂ¥Ã?â? éâ?Tà ¸Ã§Ã»Ã'äù ÃÂ¥Ã?â? > éâ?Tà ¸Ã§Ã»Ã'äù çûÃ'äù '); > > > > I don't think this string is correctly unicode-encoded. Anyway "length" > > claims it is 30 characters. > > > > regards, tom lane > > >