Thread: Multibyte support and accented characters

Multibyte support and accented characters

From
Lynna Landstreet
Date:
Hello all,

Can you handle one more question from me? Not related to keyword checkboxes
this time, promise. :-)

Some of the text that will be entered in the database I'm working on
includes some names and titles in other languages - predominantly French,
but occasionally German, Spanish, etc. So I understand from reading the
PostgreSQL docs that in order to handle this, we need to make sure multibyte
support is enabled.

Now, I am not very clear on the various encodings and how they work. I've
been spoiled by years of working on a Mac where you just type option-e if
you want an acute accent, option-u for an umlaut, etc. That's how most of
the text that will be used to populate the database has been generated. So
my questions are:

1. Which encoding would be best for this? I'm guessing Unicode, but I'm not
sure. We pretty much only have to deal with western European languages, not
with Russian or Chinese or anything.

2. Once the right one is chosen and enabled, is the process pretty much
transparent - i.e., just enter the text and the accented characters will
come through fine, or do I have to do something special with them, like the
way they have to be encoded with &...; ASCII codes in HTML?

3. Speaking of HTML, even if PostgreSQL is set up to correctly deal with
accented characters, when the output is displayed on the web, are they going
to have to be converted into &...; form?

Any advice would be much appreciated.

Thanks,

Lynna
--
Resource Centre Database Coordinator
Gallery 44
www.gallery44.org


Re: Multibyte support and accented characters

From
"M. Bastin"
Date:
At 7:08 PM -0400 6/12/03, Lynna Landstreet wrote:
Hello all,

Can you handle one more question from me? Not related to keyword checkboxes
this time, promise. :-)

Some of the text that will be entered in the database I'm working on
includes some names and titles in other languages - predominantly French,
but occasionally German, Spanish, etc. So I understand from reading the
PostgreSQL docs that in order to handle this, we need to make sure multibyte
support is enabled.

Now, I am not very clear on the various encodings and how they work. I've
been spoiled by years of working on a Mac where you just type option-e if
you want an acute accent, option-u for an umlaut, etc. That's how most of
the text that will be used to populate the database has been generated. So
my questions are:

1. Which encoding would be best for this? I'm guessing Unicode,

Unicode is the safest way to go indeed.  It's well on its way to become the new common standard of all computer platforms.

 but I'm not
sure. We pretty much only have to deal with western European languages, not
with Russian or Chinese or anything.

2. Once the right one is chosen and enabled, is the process pretty much
transparent - i.e., just enter the text and the accented characters will
come through fine,

No:

CREATE DATABASE mydb WITH ENCODING = 'UNICODE'

Then the front-end, with which you're doing your input, must send its data encoded in unicode UTF-8.  If it sends it in another encoding, then use:

SET CLIENT_ENCODING TO '<whatever encoding the front-end uses>'

to enable automatic translation to unicode by PostgreSQL.

Read the manual for further information: http://www.postgresql.org/docs/view.php?version=7.3&file=multibyte.html

 or do I have to do something special with them, like the
way they have to be encoded with &...; ASCII codes in HTML?

3. Speaking of HTML, even if PostgreSQL is set up to correctly deal with
accented characters, when the output is displayed on the web, are they going
to have to be converted into &...; form?

Here too you have to tell the browser it's going to receive data in unicode.  I don't know whether you can do this in HTML, or whether the user must choose unicode from the browser's appropriate menu.

Perhaps you can have PostgresQL translate the encoding to iso-latin, the Windows standard.

It's better if someone else answers this one for you.

Marc

Re: Multibyte support and accented characters

From
Michael Glaesemann
Date:
On Friday, Jun 13, 2003, at 08:54 Asia/Tokyo, M. Bastin wrote:
> Here too you have to tell the browser it's going to receive data in
> unicode.  I don't know whether you can do this in HTML, or whether the
> user must choose unicode from the browser's appropriate menu.

Browsers are supposed to return data in the same encoding as the page,
so, for example, if your page is encoded as UTF-8, data the users enter
on the page will be returned to the server in UTF-8. I handle a lot of
Japanese, and I've been using XHTML pages encoded as UTF-8 and PHP for
handling the scripts, and it's been going in and out of my UTF-8
encoded database just fine.

Good luck!

Michael Glaesemann
grzm myrealbox com


Re: Multibyte support and accented characters

From
Lynna Landstreet
Date:
on 6/12/03 7:54 PM, M. Bastin at marcbastin@mindspring.com wrote:

>>1. Which encoding would be best for this? I'm guessing Unicode,
>
>Unicode is the safest way to go indeed.  It's well on its way to become
>the new common standard of all computer platforms.

Cool, that's what I thought.


>>2. Once the right one is chosen and enabled, is the process pretty much
>>transparent - i.e., just enter the text and the accented characters will
>>come through fine,
>
>Then the front-end, with which you're doing your input, must send its
>data encoded in unicode UTF-8.  If it sends it in another encoding, then
>use:
>
>SET CLIENT_ENCODING TO '<whatever encoding the front-end uses>'
>
>to enable automatic translation to unicode by PostgreSQL.

Er... This may sounds like a dumb question, but the description of this list
*did* say no question was too basic here... How do I tell what encoding the
program I'm entering the data with (currently FileMaker Pro on a Mac) is
using?

Once the database is up on the web, further data entry will be via a web
form processed with PHP, so I presume in that case I can use PHP to control
the encoding.


>Read the manual for further information:
>http://www.postgresql.org/docs/view.php?version=7.3&file=multibyte.html

I actually did read that before posing that question, but was still pretty
confused, thus my post here. :-)


>>3. Speaking of HTML, even if PostgreSQL is set up to correctly deal with
>>accented characters, when the output is displayed on the web, are they going
>>to have to be converted into &...; form?
>
>Here too you have to tell the browser it's going to receive data in
>unicode.  I don't know whether you can do this in HTML, or whether the
>user must choose unicode from the browser's appropriate menu.

Maybe using a Content-Type meta tag like the one Dreamweaver automatically
inserts in everything? The default one it uses is <meta
http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> - I
presume I'd just change iso-8859-1 to unicode?

In my experience, relying on users to change their browser settings to
accommodate your site is usually a very bad idea. 3/4 of them don't know how
and the rest can't be bothered.


>Perhaps you can have PostgresQL translate the encoding to iso-latin, the
>Windows standard.

Not sure if that would work - the default charset for most web pages seems
to be iso-8859-1, but that still requires accented characters to use ASCII
codes - it can't handle them being typed directly in your text.

I don't really mind if I have to do a global find-and-replace on the
exported text from the existing FileMaker Pro database to turn all the
accented characters into ASCII codes, but it would be a pain for everyone
entering data in the future to have to use those. Most of the people working
this will not have said codes all memorized, the way I do from making web
sites for 6-7 years.

I should find out how LiveJournal.com handles encoding. I know there I can
type accented characters in directly in their forms and they seem to display
properly.


Lynna
--
Resource Centre Database Coordinator
Gallery 44
www.gallery44.org


Re: Multibyte support and accented characters

From
"M. Bastin"
Date:
>How do I tell what encoding the
>program I'm entering the data with (currently FileMaker Pro on a Mac) is
>using?

(I've done 10 years of FMP development and am in the process of
switching to REALbasic in combination with PostgreSQL.  It's a nice
coincidence I answered your post.)

FMP for Mac uses an encoding known as MacRoman.

I have written a small REALbasic app that can do the encoding
conversion of a tab-separated text export from FMP for Mac, to a file
suitable for a pgSQL COPY.

I haven't really tried it out though for anything else than basic
stuff.  The only potential issue I can think about is how tabs and
returns, *inside* FMP fields, would be handled.

I can build you that converting app, just tell me which OS you want.
(OS X, or Classic? -- or Windows?  Naaa.).  Then you can experiment
with it and tell me how it works, and whether you need further
conversions for those tabs and returns, *inside* FMP fields, if you
have these.

>Maybe using a Content-Type meta tag like the one Dreamweaver automatically
>inserts in everything? The default one it uses is <meta
>http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> - I
>presume I'd just change iso-8859-1 to unicode?

I don't know enough about HTML to help you here, but it sounds like
that's all there would be about it.

(I'm into native OS client apps, nothing browser based.  I hate the
way browsers handle data, how they can only return, say 50 records
per page, and then you have to load the next one and so on, and most
of all I hate the way M$ does everything in its power to make sure
nothing works perfectly cross-platform.)

>...have said codes all memorized, the way I do from making web
>sites for 6-7 years.

Wow, if I ever have a web question, I know who to ask!  ;-)

Marc

Re: Multibyte support and accented characters

From
Lynna Landstreet
Date:
on 6/17/03 2:43 PM, Lynna Landstreet at lynna@gallery44.org wrote:

> I should find out how LiveJournal.com handles encoding. I know there I can
> type accented characters in directly in their forms and they seem to display
> properly.

Wow, I'm replying to myself, how dorky is that? :-)

I just checked and LiveJournal uses UTF-8 encoding. A look at the charset
pages at w3c.org and ietf.org showed that UTF-8 basically is Unicode.

So, I tried putting up two test pages containing accented characters, one
with the standard iso-8859-1 encoding specified in a meta tag and one with
UTF-8, in hopes that this would demonstrate that the latter worked and the
former didn't. Unfortunately, though, neither of them worked. Tried
charset=unicode and that didn't work either. The accented characters just
showed up as question marks or nonsense characters. I even made sure that
character encoding in my browser was set to UTF-8 and it still didn't work.

Clearly, I'm missing something here. I suppose this technically isn't a
PostgreSQL question as such any more, though it's being asked with regard to
a PostgreSQL-driven site. But does anyone have any idea what sort of step I
might be missing here in trying to get accented characters to display via
UTF-8?


Lynna
--
Resource Centre Database Coordinator
Gallery 44
www.gallery44.org


Re: Multibyte support and accented characters

From
Tom Lane
Date:
Lynna Landstreet <lynna@gallery44.org> writes:
> I just checked and LiveJournal uses UTF-8 encoding. A look at the charset
> pages at w3c.org and ietf.org showed that UTF-8 basically is Unicode.

Well, there's Unicode and Unicode.  UTF-8 is one representation of
Unicode, but there are a couple others (UCS-2 is the most popular
alternative I think).  But Postgres uses UTF-8, so that doesn't seem
to explain your problem.

I'm baffled, and I suspect the people who do know about this don't read
pgsql-novice.  Try asking on pgsql-general, you might find someone with
a clue ;-)

            regards, tom lane

Re: Multibyte support and accented characters

From
Ennio-Sr
Date:
* Lynna Landstreet <lynna@gallery44.org> [170603, 16:13]:
> on 6/17/03 2:43 PM, Lynna Landstreet at lynna@gallery44.org wrote:
> Wow, I'm replying to myself, how dorky is that? :-)
> [...]
> charset=unicode and that didn't work either. The accented characters just
> showed up as question marks or nonsense characters. I even made sure that
> character encoding in my browser was set to UTF-8 and it still didn't work.
> [...]
> ......But does anyone have any idea what sort of step I
> might be missing here in trying to get accented characters to display via
> UTF-8?

Lynna,
have a look at this message I posted a few weeks ago on a similar
problem: it might hopefully give you some clue :-)
bye,
    Ennio