Thread: Unicode database question

Unicode database question

From
Lynna Landstreet
Date:
Hello,

I'm running into a bit of trouble with a Unicode-enabled PostgreSQL database
(some of the data consists of artist and/or image names in other languages,
like French, Spanish, German and Portuguese, which frequently have accents,
and I don't want people entering data to have to use ASCII codes). Having (I
thought) managed to get past the issues of exporting text as Unicode in
order to import it into the database and uploading the text files as binary
instead of data to keep them Unicode/UTF-8 as I upload them, and then using
psql's \copy command to insert the data into the database, I can't get the
special characters to display properly on the web. :-(

I'm not even sure how to tell if the problem is on the input side or the
output side - as in, whether it's that the data in the database got muddled
on the way in and is not valid Unicode, or whether it's OK but every means I
try to use to view it doesn't want to accept Unicode. I'm pretty sure the
text files got to the server OK as Unicode, because I was able to view them
directly with a web browser and the special characters were OK then. But
when I imported them into the database, I was not then able to view the
special characters correctly, either in my browser through the PHP frontend
I'm developing for the database or phpPgAdmin, or via Telnet/SSH. So I don't
know if the problem came about somehow while using \copy to import them, or
with the means I'm using to view them.

I've set the charset encoding of my PHP pages to UTF-8, and the default
encoding in my browser as well, but that doesn't help. And I've tried
editing the data through phpPgAdmin to restore the special characters, but
got the following error message:

Error - /[path to my web directory]/phpPgAdmin/tbl_replace.php -- Line: 77

PostgreSQL said: ERROR: Invalid UNICODE character sequence found (0xe7e36f)
Your query:
UPDATE "artists" SET "artist_id" = 485, "firstname" = 'Teresa', "lastname" =
'Ascenção'... [rest of query deleted]

Ironically, the accented characters in her last name (a c with a cedilla and
an a with a tilde, in case they don't show up here) displayed fine in the
error message! But it wouldn't enter them into the database.

Questions that come to mind:

1. Does anyone have any idea what's going wrong here?
2. Can \copy reduce UTF-8 text to plain ASCII while importing data from a
text file?
3. If so, can it be made not to, maybe through adding some kind of parameter
to the command? Or is there a better way to import the data?
4. Is if correct for the database encoding to be "UNICODE" or should it be
UTF-8 specifically? My impression thus far was that Unicode and UTF-8 were
more or less the same thing, but maybe more or less isn't good enough.
5. Does a web form have to be specially coded to accept text with accented
characters into a database, or does the encoding of the database itself
and/or the web page the form is on determine that?

Any assistance would be much appreciated...


Lynna
--
Resource Centre Database Coordinator
Gallery 44
www.gallery44.org


Re: Unicode database question

From
"Vernon Smith"
Date:
I have some experience in the scenario: utf-8 character set in application and unicode encoding in PS, successful and
unsuccessful.The successful one is to have the locale installed in the OS where PS resides. And unsuccessful one is the
localecan't be installed due to some restricture such as share OS with others in ISP.  

With the locale installed, you can use psql to verify whether the language text is stored through the application in PG
successfullyor not. You need to install multiple locales since your application serves multiple languages.  

I don't know what is your application environment. The following is quoted from the page 191 of "Java Enterprise Best
Practices":

Out of the box, PostgreSQL doesn not support localization -- this must be enabled wieh the PostgreSQL server is built
fromthe source distrubtion into compiled binaries. However, one of the limitation of this database is that , to quote
thePostgreSQL documentation on localization, "PostgreSQL uses the standard ISO C and POSIX-lick locale facilities
providedby the server operation system." Therefore, to achieve proper localization for your application running over a
PostgreSQLdatabase, you must involve the sytem adminstrator to install the proper language package. 

The reference web page, www.postgresql.org/idocs/index.php?cjarset.html doesn't exist anymore. This book is published
onDecember 2002. 

Hope this information helops.

Vernon
--

--------- Original Message ---------

DATE: Wed, 16 Jul 2003 19:39:37
From: Lynna Landstreet <lynna@gallery44.org>
To: <pgsql-general@postgresql.org>
Cc:

>Hello,
>
>I'm running into a bit of trouble with a Unicode-enabled PostgreSQL database
>(some of the data consists of artist and/or image names in other languages,
>like French, Spanish, German and Portuguese, which frequently have accents,
>and I don't want people entering data to have to use ASCII codes). Having (I
>thought) managed to get past the issues of exporting text as Unicode in
>order to import it into the database and uploading the text files as binary
>instead of data to keep them Unicode/UTF-8 as I upload them, and then using
>psql's \copy command to insert the data into the database, I can't get the
>special characters to display properly on the web. :-(
>
>I'm not even sure how to tell if the problem is on the input side or the
>output side - as in, whether it's that the data in the database got muddled
>on the way in and is not valid Unicode, or whether it's OK but every means I
>try to use to view it doesn't want to accept Unicode. I'm pretty sure the
>text files got to the server OK as Unicode, because I was able to view them
>directly with a web browser and the special characters were OK then. But
>when I imported them into the database, I was not then able to view the
>special characters correctly, either in my browser through the PHP frontend
>I'm developing for the database or phpPgAdmin, or via Telnet/SSH. So I don't
>know if the problem came about somehow while using \copy to import them, or
>with the means I'm using to view them.
>
>I've set the charset encoding of my PHP pages to UTF-8, and the default
>encoding in my browser as well, but that doesn't help. And I've tried
>editing the data through phpPgAdmin to restore the special characters, but
>got the following error message:
>
>Error - /[path to my web directory]/phpPgAdmin/tbl_replace.php -- Line: 77
>
>PostgreSQL said: ERROR: Invalid UNICODE character sequence found (0xe7e36f)
>Your query:
>UPDATE "artists" SET "artist_id" = 485, "firstname" = 'Teresa', "lastname" =
>'Ascenção'... [rest of query deleted]
>
>Ironically, the accented characters in her last name (a c with a cedilla and
>an a with a tilde, in case they don't show up here) displayed fine in the
>error message! But it wouldn't enter them into the database.
>
>Questions that come to mind:
>
>1. Does anyone have any idea what's going wrong here?
>2. Can \copy reduce UTF-8 text to plain ASCII while importing data from a
>text file?
>3. If so, can it be made not to, maybe through adding some kind of parameter
>to the command? Or is there a better way to import the data?
>4. Is if correct for the database encoding to be "UNICODE" or should it be
>UTF-8 specifically? My impression thus far was that Unicode and UTF-8 were
>more or less the same thing, but maybe more or less isn't good enough.
>5. Does a web form have to be specially coded to accept text with accented
>characters into a database, or does the encoding of the database itself
>and/or the web page the form is on determine that?
>
>Any assistance would be much appreciated...
>
>
>Lynna
>--
>Resource Centre Database Coordinator
>Gallery 44
>www.gallery44.org
>
>
>---------------------------(end of broadcast)---------------------------
>TIP 4: Don't 'kill -9' the postmaster
>



____________________________________________________________
Get advanced SPAM filtering on Webmail or POP Mail ... Get Lycos Mail!
http://login.mail.lycos.com/r/referral?aid=27005

Re: Unicode database question

From
Alvaro Herrera
Date:
On Wed, Jul 16, 2003 at 08:59:30PM -0400, Vernon Smith wrote:

> I don't know what is your application environment. The following is
> quoted from the page 191 of "Java Enterprise Best Practices":
>
> [...]
>
> The reference web page,
> www.postgresql.org/idocs/index.php?cjarset.html doesn't exist anymore.
> This book is published on December 2002.

The current URL would be

http://www.postgresql.org/docs/7.3/static/charset.html

--
Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)
"Y eso te lo doy firmado con mis lagrimas" (Fiebre del loco)

Re: Unicode database question

From
Dennis Gearon
Date:
It would sure be nice to be able to serve any language based on UTF-8 in
binary fields with no location rquired, and any sorting done either in
binary or through an additional SQL extension such as:

SELECT some_field
FROM some_table
WHERE xlate_utf8binary_to_collations( field_name_or_select_statement,
'lang1', lang2', ...) =
               xlate_utf8binary_to_collations('some_value', 'lang1',
lang2', ...)
SORT BY xlate_utf8binary_to_collations('some_value', 'lang1', lang2', ...).

etc, etc, etc

Isn't postgres going to be importing some libraries for collations
sorting soon? I seem to remember coming upon some discussions involving
Tom and others about this.


Re: Unicode database question

From
Tino Wildenhain
Date:
Hi Lanna,

we are running postgresql with UNICODE encoding on a regular
basis for our shop. This basically means it stores and retrieves
strings as utf-8 encoded. If you dont need special collating
rules, thats the way to go. However we are using Python/Zope
in front of the DB for presentation and maybe PHP behaves
differently. Another problem with Web-Clients is, that they
sometimes send forms with a default charset and not with what
your form HTML originally had. Meaning you send a page
with utf-8 and the post-resquest goes in as iso8859-1 or something
like that. This is irritating and should be investigated.
Try a simple recording proxy or packet sniffer with tcp-stream
assembling ability to log whats going over the wire.
One solution to the browsers bug is to mark the page with
a well known string which gets sent in the answer (say hidden
form field) and undergoes the same charset rules as the rest of
the form. If then you get your string back with the answer you
can check the encoding/charset.

HTH
Tino Wildenhain

Lynna Landstreet wrote:
> Hello,
>
> I'm running into a bit of trouble with a Unicode-enabled PostgreSQL database
> (some of the data consists of artist and/or image names in other languages,
> like French, Spanish, German and Portuguese, which frequently have accents,
> and I don't want people entering data to have to use ASCII codes). Having (I
> thought) managed to get past the issues of exporting text as Unicode in
> order to import it into the database and uploading the text files as binary
> instead of data to keep them Unicode/UTF-8 as I upload them, and then using
> psql's \copy command to insert the data into the database, I can't get the
> special characters to display properly on the web. :-(
>
> I'm not even sure how to tell if the problem is on the input side or the
> output side - as in, whether it's that the data in the database got muddled
> on the way in and is not valid Unicode, or whether it's OK but every means I
> try to use to view it doesn't want to accept Unicode. I'm pretty sure the
> text files got to the server OK as Unicode, because I was able to view them
> directly with a web browser and the special characters were OK then. But
> when I imported them into the database, I was not then able to view the
> special characters correctly, either in my browser through the PHP frontend
> I'm developing for the database or phpPgAdmin, or via Telnet/SSH. So I don't
> know if the problem came about somehow while using \copy to import them, or
> with the means I'm using to view them.
>
> I've set the charset encoding of my PHP pages to UTF-8, and the default
> encoding in my browser as well, but that doesn't help. And I've tried
> editing the data through phpPgAdmin to restore the special characters, but
> got the following error message:
>
> Error - /[path to my web directory]/phpPgAdmin/tbl_replace.php -- Line: 77
>
> PostgreSQL said: ERROR: Invalid UNICODE character sequence found (0xe7e36f)
> Your query:
> UPDATE "artists" SET "artist_id" = 485, "firstname" = 'Teresa', "lastname" =
> 'Ascenção'... [rest of query deleted]
>
> Ironically, the accented characters in her last name (a c with a cedilla and
> an a with a tilde, in case they don't show up here) displayed fine in the
> error message! But it wouldn't enter them into the database.
>
> Questions that come to mind:
>
> 1. Does anyone have any idea what's going wrong here?
> 2. Can \copy reduce UTF-8 text to plain ASCII while importing data from a
> text file?
> 3. If so, can it be made not to, maybe through adding some kind of parameter
> to the command? Or is there a better way to import the data?
> 4. Is if correct for the database encoding to be "UNICODE" or should it be
> UTF-8 specifically? My impression thus far was that Unicode and UTF-8 were
> more or less the same thing, but maybe more or less isn't good enough.
> 5. Does a web form have to be specially coded to accept text with accented
> characters into a database, or does the encoding of the database itself
> and/or the web page the form is on determine that?
>
> Any assistance would be much appreciated...
>
>
> Lynna



Re: Unicode database question

From
Tony Grant
Date:
On Thu, 2003-07-17 at 01:39, Lynna Landstreet wrote:

> I'm running into a bit of trouble with a Unicode-enabled PostgreSQL database
> (some of the data consists of artist and/or image names in other languages,
> like French, Spanish, German and Portuguese, which frequently have accents,
> and I don't want people entering data to have to use ASCII codes). Having (I
> thought) managed to get past the issues of exporting text as Unicode in
> order to import it into the database and uploading the text files as binary
> instead of data to keep them Unicode/UTF-8 as I upload them, and then using
> psql's \copy command to insert the data into the database, I can't get the
> special characters to display properly on the web. :-(

1. Try LATIN1 instead of Unicode.

2. What front end do you have for queries?

JSP works just fine (www.3continents.com/base_de_donnees.html is an
example). I had to write a little bit of code to escape accents in
search pages so école and ecole will get results.

Cheers

Tony Grant
--
www.tgds.net Library management software toolkit,
redhat linux on Sony Vaio C1XD,
Dreamweaver MX with Tomcat and PostgreSQL


Re: Unicode database question

From
Markus Bertheau
Date:
В Чтв, 17.07.2003, в 01:39, Lynna Landstreet пишет:

> I'm not even sure how to tell if the problem is on the input side or the
> output side - as in, whether it's that the data in the database got muddled

As a database with encoding='UNICODE' won't acceppt invalid UTF-8, the
data in the database is valid.

> Error - /[path to my web directory]/phpPgAdmin/tbl_replace.php -- Line: 77

Please use phpPgAdmin 3.0, it correctly handles Unicode.

--
Markus Bertheau.
Berlin, Berlin.
Germany.