Thread: Unicode database question
Hello, I'm running into a bit of trouble with a Unicode-enabled PostgreSQL database (some of the data consists of artist and/or image names in other languages, like French, Spanish, German and Portuguese, which frequently have accents, and I don't want people entering data to have to use ASCII codes). Having (I thought) managed to get past the issues of exporting text as Unicode in order to import it into the database and uploading the text files as binary instead of data to keep them Unicode/UTF-8 as I upload them, and then using psql's \copy command to insert the data into the database, I can't get the special characters to display properly on the web. :-( I'm not even sure how to tell if the problem is on the input side or the output side - as in, whether it's that the data in the database got muddled on the way in and is not valid Unicode, or whether it's OK but every means I try to use to view it doesn't want to accept Unicode. I'm pretty sure the text files got to the server OK as Unicode, because I was able to view them directly with a web browser and the special characters were OK then. But when I imported them into the database, I was not then able to view the special characters correctly, either in my browser through the PHP frontend I'm developing for the database or phpPgAdmin, or via Telnet/SSH. So I don't know if the problem came about somehow while using \copy to import them, or with the means I'm using to view them. I've set the charset encoding of my PHP pages to UTF-8, and the default encoding in my browser as well, but that doesn't help. And I've tried editing the data through phpPgAdmin to restore the special characters, but got the following error message: Error - /[path to my web directory]/phpPgAdmin/tbl_replace.php -- Line: 77 PostgreSQL said: ERROR: Invalid UNICODE character sequence found (0xe7e36f) Your query: UPDATE "artists" SET "artist_id" = 485, "firstname" = 'Teresa', "lastname" = 'Ascenção'... [rest of query deleted] Ironically, the accented characters in her last name (a c with a cedilla and an a with a tilde, in case they don't show up here) displayed fine in the error message! But it wouldn't enter them into the database. Questions that come to mind: 1. Does anyone have any idea what's going wrong here? 2. Can \copy reduce UTF-8 text to plain ASCII while importing data from a text file? 3. If so, can it be made not to, maybe through adding some kind of parameter to the command? Or is there a better way to import the data? 4. Is if correct for the database encoding to be "UNICODE" or should it be UTF-8 specifically? My impression thus far was that Unicode and UTF-8 were more or less the same thing, but maybe more or less isn't good enough. 5. Does a web form have to be specially coded to accept text with accented characters into a database, or does the encoding of the database itself and/or the web page the form is on determine that? Any assistance would be much appreciated... Lynna -- Resource Centre Database Coordinator Gallery 44 www.gallery44.org
I have some experience in the scenario: utf-8 character set in application and unicode encoding in PS, successful and unsuccessful.The successful one is to have the locale installed in the OS where PS resides. And unsuccessful one is the localecan't be installed due to some restricture such as share OS with others in ISP. With the locale installed, you can use psql to verify whether the language text is stored through the application in PG successfullyor not. You need to install multiple locales since your application serves multiple languages. I don't know what is your application environment. The following is quoted from the page 191 of "Java Enterprise Best Practices": Out of the box, PostgreSQL doesn not support localization -- this must be enabled wieh the PostgreSQL server is built fromthe source distrubtion into compiled binaries. However, one of the limitation of this database is that , to quote thePostgreSQL documentation on localization, "PostgreSQL uses the standard ISO C and POSIX-lick locale facilities providedby the server operation system." Therefore, to achieve proper localization for your application running over a PostgreSQLdatabase, you must involve the sytem adminstrator to install the proper language package. The reference web page, www.postgresql.org/idocs/index.php?cjarset.html doesn't exist anymore. This book is published onDecember 2002. Hope this information helops. Vernon -- --------- Original Message --------- DATE: Wed, 16 Jul 2003 19:39:37 From: Lynna Landstreet <lynna@gallery44.org> To: <pgsql-general@postgresql.org> Cc: >Hello, > >I'm running into a bit of trouble with a Unicode-enabled PostgreSQL database >(some of the data consists of artist and/or image names in other languages, >like French, Spanish, German and Portuguese, which frequently have accents, >and I don't want people entering data to have to use ASCII codes). Having (I >thought) managed to get past the issues of exporting text as Unicode in >order to import it into the database and uploading the text files as binary >instead of data to keep them Unicode/UTF-8 as I upload them, and then using >psql's \copy command to insert the data into the database, I can't get the >special characters to display properly on the web. :-( > >I'm not even sure how to tell if the problem is on the input side or the >output side - as in, whether it's that the data in the database got muddled >on the way in and is not valid Unicode, or whether it's OK but every means I >try to use to view it doesn't want to accept Unicode. I'm pretty sure the >text files got to the server OK as Unicode, because I was able to view them >directly with a web browser and the special characters were OK then. But >when I imported them into the database, I was not then able to view the >special characters correctly, either in my browser through the PHP frontend >I'm developing for the database or phpPgAdmin, or via Telnet/SSH. So I don't >know if the problem came about somehow while using \copy to import them, or >with the means I'm using to view them. > >I've set the charset encoding of my PHP pages to UTF-8, and the default >encoding in my browser as well, but that doesn't help. And I've tried >editing the data through phpPgAdmin to restore the special characters, but >got the following error message: > >Error - /[path to my web directory]/phpPgAdmin/tbl_replace.php -- Line: 77 > >PostgreSQL said: ERROR: Invalid UNICODE character sequence found (0xe7e36f) >Your query: >UPDATE "artists" SET "artist_id" = 485, "firstname" = 'Teresa', "lastname" = >'Ascenção'... [rest of query deleted] > >Ironically, the accented characters in her last name (a c with a cedilla and >an a with a tilde, in case they don't show up here) displayed fine in the >error message! But it wouldn't enter them into the database. > >Questions that come to mind: > >1. Does anyone have any idea what's going wrong here? >2. Can \copy reduce UTF-8 text to plain ASCII while importing data from a >text file? >3. If so, can it be made not to, maybe through adding some kind of parameter >to the command? Or is there a better way to import the data? >4. Is if correct for the database encoding to be "UNICODE" or should it be >UTF-8 specifically? My impression thus far was that Unicode and UTF-8 were >more or less the same thing, but maybe more or less isn't good enough. >5. Does a web form have to be specially coded to accept text with accented >characters into a database, or does the encoding of the database itself >and/or the web page the form is on determine that? > >Any assistance would be much appreciated... > > >Lynna >-- >Resource Centre Database Coordinator >Gallery 44 >www.gallery44.org > > >---------------------------(end of broadcast)--------------------------- >TIP 4: Don't 'kill -9' the postmaster > ____________________________________________________________ Get advanced SPAM filtering on Webmail or POP Mail ... Get Lycos Mail! http://login.mail.lycos.com/r/referral?aid=27005
On Wed, Jul 16, 2003 at 08:59:30PM -0400, Vernon Smith wrote: > I don't know what is your application environment. The following is > quoted from the page 191 of "Java Enterprise Best Practices": > > [...] > > The reference web page, > www.postgresql.org/idocs/index.php?cjarset.html doesn't exist anymore. > This book is published on December 2002. The current URL would be http://www.postgresql.org/docs/7.3/static/charset.html -- Alvaro Herrera (<alvherre[a]dcc.uchile.cl>) "Y eso te lo doy firmado con mis lagrimas" (Fiebre del loco)
It would sure be nice to be able to serve any language based on UTF-8 in binary fields with no location rquired, and any sorting done either in binary or through an additional SQL extension such as: SELECT some_field FROM some_table WHERE xlate_utf8binary_to_collations( field_name_or_select_statement, 'lang1', lang2', ...) = xlate_utf8binary_to_collations('some_value', 'lang1', lang2', ...) SORT BY xlate_utf8binary_to_collations('some_value', 'lang1', lang2', ...). etc, etc, etc Isn't postgres going to be importing some libraries for collations sorting soon? I seem to remember coming upon some discussions involving Tom and others about this.
Hi Lanna, we are running postgresql with UNICODE encoding on a regular basis for our shop. This basically means it stores and retrieves strings as utf-8 encoded. If you dont need special collating rules, thats the way to go. However we are using Python/Zope in front of the DB for presentation and maybe PHP behaves differently. Another problem with Web-Clients is, that they sometimes send forms with a default charset and not with what your form HTML originally had. Meaning you send a page with utf-8 and the post-resquest goes in as iso8859-1 or something like that. This is irritating and should be investigated. Try a simple recording proxy or packet sniffer with tcp-stream assembling ability to log whats going over the wire. One solution to the browsers bug is to mark the page with a well known string which gets sent in the answer (say hidden form field) and undergoes the same charset rules as the rest of the form. If then you get your string back with the answer you can check the encoding/charset. HTH Tino Wildenhain Lynna Landstreet wrote: > Hello, > > I'm running into a bit of trouble with a Unicode-enabled PostgreSQL database > (some of the data consists of artist and/or image names in other languages, > like French, Spanish, German and Portuguese, which frequently have accents, > and I don't want people entering data to have to use ASCII codes). Having (I > thought) managed to get past the issues of exporting text as Unicode in > order to import it into the database and uploading the text files as binary > instead of data to keep them Unicode/UTF-8 as I upload them, and then using > psql's \copy command to insert the data into the database, I can't get the > special characters to display properly on the web. :-( > > I'm not even sure how to tell if the problem is on the input side or the > output side - as in, whether it's that the data in the database got muddled > on the way in and is not valid Unicode, or whether it's OK but every means I > try to use to view it doesn't want to accept Unicode. I'm pretty sure the > text files got to the server OK as Unicode, because I was able to view them > directly with a web browser and the special characters were OK then. But > when I imported them into the database, I was not then able to view the > special characters correctly, either in my browser through the PHP frontend > I'm developing for the database or phpPgAdmin, or via Telnet/SSH. So I don't > know if the problem came about somehow while using \copy to import them, or > with the means I'm using to view them. > > I've set the charset encoding of my PHP pages to UTF-8, and the default > encoding in my browser as well, but that doesn't help. And I've tried > editing the data through phpPgAdmin to restore the special characters, but > got the following error message: > > Error - /[path to my web directory]/phpPgAdmin/tbl_replace.php -- Line: 77 > > PostgreSQL said: ERROR: Invalid UNICODE character sequence found (0xe7e36f) > Your query: > UPDATE "artists" SET "artist_id" = 485, "firstname" = 'Teresa', "lastname" = > 'Ascenção'... [rest of query deleted] > > Ironically, the accented characters in her last name (a c with a cedilla and > an a with a tilde, in case they don't show up here) displayed fine in the > error message! But it wouldn't enter them into the database. > > Questions that come to mind: > > 1. Does anyone have any idea what's going wrong here? > 2. Can \copy reduce UTF-8 text to plain ASCII while importing data from a > text file? > 3. If so, can it be made not to, maybe through adding some kind of parameter > to the command? Or is there a better way to import the data? > 4. Is if correct for the database encoding to be "UNICODE" or should it be > UTF-8 specifically? My impression thus far was that Unicode and UTF-8 were > more or less the same thing, but maybe more or less isn't good enough. > 5. Does a web form have to be specially coded to accept text with accented > characters into a database, or does the encoding of the database itself > and/or the web page the form is on determine that? > > Any assistance would be much appreciated... > > > Lynna
On Thu, 2003-07-17 at 01:39, Lynna Landstreet wrote: > I'm running into a bit of trouble with a Unicode-enabled PostgreSQL database > (some of the data consists of artist and/or image names in other languages, > like French, Spanish, German and Portuguese, which frequently have accents, > and I don't want people entering data to have to use ASCII codes). Having (I > thought) managed to get past the issues of exporting text as Unicode in > order to import it into the database and uploading the text files as binary > instead of data to keep them Unicode/UTF-8 as I upload them, and then using > psql's \copy command to insert the data into the database, I can't get the > special characters to display properly on the web. :-( 1. Try LATIN1 instead of Unicode. 2. What front end do you have for queries? JSP works just fine (www.3continents.com/base_de_donnees.html is an example). I had to write a little bit of code to escape accents in search pages so école and ecole will get results. Cheers Tony Grant -- www.tgds.net Library management software toolkit, redhat linux on Sony Vaio C1XD, Dreamweaver MX with Tomcat and PostgreSQL
В Чтв, 17.07.2003, в 01:39, Lynna Landstreet пишет: > I'm not even sure how to tell if the problem is on the input side or the > output side - as in, whether it's that the data in the database got muddled As a database with encoding='UNICODE' won't acceppt invalid UTF-8, the data in the database is valid. > Error - /[path to my web directory]/phpPgAdmin/tbl_replace.php -- Line: 77 Please use phpPgAdmin 3.0, it correctly handles Unicode. -- Markus Bertheau. Berlin, Berlin. Germany.