Thread: Multibyte support and accented characters
Hello all, Can you handle one more question from me? Not related to keyword checkboxes this time, promise. :-) Some of the text that will be entered in the database I'm working on includes some names and titles in other languages - predominantly French, but occasionally German, Spanish, etc. So I understand from reading the PostgreSQL docs that in order to handle this, we need to make sure multibyte support is enabled. Now, I am not very clear on the various encodings and how they work. I've been spoiled by years of working on a Mac where you just type option-e if you want an acute accent, option-u for an umlaut, etc. That's how most of the text that will be used to populate the database has been generated. So my questions are: 1. Which encoding would be best for this? I'm guessing Unicode, but I'm not sure. We pretty much only have to deal with western European languages, not with Russian or Chinese or anything. 2. Once the right one is chosen and enabled, is the process pretty much transparent - i.e., just enter the text and the accented characters will come through fine, or do I have to do something special with them, like the way they have to be encoded with &...; ASCII codes in HTML? 3. Speaking of HTML, even if PostgreSQL is set up to correctly deal with accented characters, when the output is displayed on the web, are they going to have to be converted into &...; form? Any advice would be much appreciated. Thanks, Lynna -- Resource Centre Database Coordinator Gallery 44 www.gallery44.org
At 7:08 PM -0400 6/12/03, Lynna Landstreet wrote:
Hello all,
Can you handle one more question from me? Not related to keyword checkboxes
this time, promise. :-)
Some of the text that will be entered in the database I'm working on
includes some names and titles in other languages - predominantly French,
but occasionally German, Spanish, etc. So I understand from reading the
PostgreSQL docs that in order to handle this, we need to make sure multibyte
support is enabled.
Now, I am not very clear on the various encodings and how they work. I've
been spoiled by years of working on a Mac where you just type option-e if
you want an acute accent, option-u for an umlaut, etc. That's how most of
the text that will be used to populate the database has been generated. So
my questions are:
1. Which encoding would be best for this? I'm guessing Unicode,
Unicode is the safest way to go indeed. It's well on its way to become the new common standard of all computer platforms.
but I'm not
sure. We pretty much only have to deal with western European languages, not
with Russian or Chinese or anything.
2. Once the right one is chosen and enabled, is the process pretty much
transparent - i.e., just enter the text and the accented characters will
come through fine,
No:
CREATE DATABASE mydb WITH ENCODING = 'UNICODE'
Then the front-end, with which you're doing your input, must send its data encoded in unicode UTF-8. If it sends it in another encoding, then use:
SET CLIENT_ENCODING TO '<whatever encoding the front-end uses>'
to enable automatic translation to unicode by PostgreSQL.
Read the manual for further information: http://www.postgresql.org/docs/view.php?version=7.3&file=multibyte.html
or do I have to do something special with them, like the
way they have to be encoded with &...; ASCII codes in HTML?
3. Speaking of HTML, even if PostgreSQL is set up to correctly deal with
accented characters, when the output is displayed on the web, are they going
to have to be converted into &...; form?
Here too you have to tell the browser it's going to receive data in unicode. I don't know whether you can do this in HTML, or whether the user must choose unicode from the browser's appropriate menu.
Perhaps you can have PostgresQL translate the encoding to iso-latin, the Windows standard.
It's better if someone else answers this one for you.
Marc
On Friday, Jun 13, 2003, at 08:54 Asia/Tokyo, M. Bastin wrote: > Here too you have to tell the browser it's going to receive data in > unicode. I don't know whether you can do this in HTML, or whether the > user must choose unicode from the browser's appropriate menu. Browsers are supposed to return data in the same encoding as the page, so, for example, if your page is encoded as UTF-8, data the users enter on the page will be returned to the server in UTF-8. I handle a lot of Japanese, and I've been using XHTML pages encoded as UTF-8 and PHP for handling the scripts, and it's been going in and out of my UTF-8 encoded database just fine. Good luck! Michael Glaesemann grzm myrealbox com
on 6/12/03 7:54 PM, M. Bastin at marcbastin@mindspring.com wrote: >>1. Which encoding would be best for this? I'm guessing Unicode, > >Unicode is the safest way to go indeed. It's well on its way to become >the new common standard of all computer platforms. Cool, that's what I thought. >>2. Once the right one is chosen and enabled, is the process pretty much >>transparent - i.e., just enter the text and the accented characters will >>come through fine, > >Then the front-end, with which you're doing your input, must send its >data encoded in unicode UTF-8. If it sends it in another encoding, then >use: > >SET CLIENT_ENCODING TO '<whatever encoding the front-end uses>' > >to enable automatic translation to unicode by PostgreSQL. Er... This may sounds like a dumb question, but the description of this list *did* say no question was too basic here... How do I tell what encoding the program I'm entering the data with (currently FileMaker Pro on a Mac) is using? Once the database is up on the web, further data entry will be via a web form processed with PHP, so I presume in that case I can use PHP to control the encoding. >Read the manual for further information: >http://www.postgresql.org/docs/view.php?version=7.3&file=multibyte.html I actually did read that before posing that question, but was still pretty confused, thus my post here. :-) >>3. Speaking of HTML, even if PostgreSQL is set up to correctly deal with >>accented characters, when the output is displayed on the web, are they going >>to have to be converted into &...; form? > >Here too you have to tell the browser it's going to receive data in >unicode. I don't know whether you can do this in HTML, or whether the >user must choose unicode from the browser's appropriate menu. Maybe using a Content-Type meta tag like the one Dreamweaver automatically inserts in everything? The default one it uses is <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> - I presume I'd just change iso-8859-1 to unicode? In my experience, relying on users to change their browser settings to accommodate your site is usually a very bad idea. 3/4 of them don't know how and the rest can't be bothered. >Perhaps you can have PostgresQL translate the encoding to iso-latin, the >Windows standard. Not sure if that would work - the default charset for most web pages seems to be iso-8859-1, but that still requires accented characters to use ASCII codes - it can't handle them being typed directly in your text. I don't really mind if I have to do a global find-and-replace on the exported text from the existing FileMaker Pro database to turn all the accented characters into ASCII codes, but it would be a pain for everyone entering data in the future to have to use those. Most of the people working this will not have said codes all memorized, the way I do from making web sites for 6-7 years. I should find out how LiveJournal.com handles encoding. I know there I can type accented characters in directly in their forms and they seem to display properly. Lynna -- Resource Centre Database Coordinator Gallery 44 www.gallery44.org
>How do I tell what encoding the >program I'm entering the data with (currently FileMaker Pro on a Mac) is >using? (I've done 10 years of FMP development and am in the process of switching to REALbasic in combination with PostgreSQL. It's a nice coincidence I answered your post.) FMP for Mac uses an encoding known as MacRoman. I have written a small REALbasic app that can do the encoding conversion of a tab-separated text export from FMP for Mac, to a file suitable for a pgSQL COPY. I haven't really tried it out though for anything else than basic stuff. The only potential issue I can think about is how tabs and returns, *inside* FMP fields, would be handled. I can build you that converting app, just tell me which OS you want. (OS X, or Classic? -- or Windows? Naaa.). Then you can experiment with it and tell me how it works, and whether you need further conversions for those tabs and returns, *inside* FMP fields, if you have these. >Maybe using a Content-Type meta tag like the one Dreamweaver automatically >inserts in everything? The default one it uses is <meta >http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> - I >presume I'd just change iso-8859-1 to unicode? I don't know enough about HTML to help you here, but it sounds like that's all there would be about it. (I'm into native OS client apps, nothing browser based. I hate the way browsers handle data, how they can only return, say 50 records per page, and then you have to load the next one and so on, and most of all I hate the way M$ does everything in its power to make sure nothing works perfectly cross-platform.) >...have said codes all memorized, the way I do from making web >sites for 6-7 years. Wow, if I ever have a web question, I know who to ask! ;-) Marc
on 6/17/03 2:43 PM, Lynna Landstreet at lynna@gallery44.org wrote: > I should find out how LiveJournal.com handles encoding. I know there I can > type accented characters in directly in their forms and they seem to display > properly. Wow, I'm replying to myself, how dorky is that? :-) I just checked and LiveJournal uses UTF-8 encoding. A look at the charset pages at w3c.org and ietf.org showed that UTF-8 basically is Unicode. So, I tried putting up two test pages containing accented characters, one with the standard iso-8859-1 encoding specified in a meta tag and one with UTF-8, in hopes that this would demonstrate that the latter worked and the former didn't. Unfortunately, though, neither of them worked. Tried charset=unicode and that didn't work either. The accented characters just showed up as question marks or nonsense characters. I even made sure that character encoding in my browser was set to UTF-8 and it still didn't work. Clearly, I'm missing something here. I suppose this technically isn't a PostgreSQL question as such any more, though it's being asked with regard to a PostgreSQL-driven site. But does anyone have any idea what sort of step I might be missing here in trying to get accented characters to display via UTF-8? Lynna -- Resource Centre Database Coordinator Gallery 44 www.gallery44.org
Lynna Landstreet <lynna@gallery44.org> writes: > I just checked and LiveJournal uses UTF-8 encoding. A look at the charset > pages at w3c.org and ietf.org showed that UTF-8 basically is Unicode. Well, there's Unicode and Unicode. UTF-8 is one representation of Unicode, but there are a couple others (UCS-2 is the most popular alternative I think). But Postgres uses UTF-8, so that doesn't seem to explain your problem. I'm baffled, and I suspect the people who do know about this don't read pgsql-novice. Try asking on pgsql-general, you might find someone with a clue ;-) regards, tom lane
* Lynna Landstreet <lynna@gallery44.org> [170603, 16:13]: > on 6/17/03 2:43 PM, Lynna Landstreet at lynna@gallery44.org wrote: > Wow, I'm replying to myself, how dorky is that? :-) > [...] > charset=unicode and that didn't work either. The accented characters just > showed up as question marks or nonsense characters. I even made sure that > character encoding in my browser was set to UTF-8 and it still didn't work. > [...] > ......But does anyone have any idea what sort of step I > might be missing here in trying to get accented characters to display via > UTF-8? Lynna, have a look at this message I posted a few weeks ago on a similar problem: it might hopefully give you some clue :-) bye, Ennio