Thread: Space requirements (with respect to foriegn languages)

Space requirements (with respect to foriegn languages)

From
Gerard Samuel
Date:
My site/code/database is developed primarily for the english language.
I've had people from "The Far East" add content to my site using their
native language, and it is displaying properly in the site.
But Im a bit concerned about the number of characters these languages use.
For example, I've had someone enter ->
chinese testing 中文

It is saved in the database as ->
chinese testing 中文

Now, forgive my ignorance, but I have no idea what the additional
chinese characters mean, but from the values in the database, Im
assuming that it amounts to 3 characters.
But if Im correct that those are 3 characters, it is
using up 24 characters in a column.

My concern is that what if I were to limit a column to say 25 "english"
characters, and a chinese fellow, comes by and hypothetically says
"Hello World" in chinese and goes over the limit of the column, the data
will be truncated.

Is there anything that can be done to overcome this shortcoming?

Im currently using PostgreSQL 7.4.2, using SQL_ASCII as the database
characterset, FreeBSD 4.10, php 4.3.6.

Thanks for any advise you can provide...



Re: Space requirements (with respect to foriegn languages)

From
Markus Bertheau
Date:
В Чтв, 26.08.2004, в 22:36, Gerard Samuel пишет:
> My site/code/database is developed primarily for the english language.
> I've had people from "The Far East" add content to my site using their
> native language, and it is displaying properly in the site.
> But Im a bit concerned about the number of characters these languages use.
> For example, I've had someone enter ->
> chinese testing 中文
>
> It is saved in the database as ->
> chinese testing 中文

Your web page uses a character set that does not contain chinese
characters. So the browser decided to send their respective HTML
entities instead. These entities, as you correctly observed, amount to
more than one (latin, ASCII) character.

> Now, forgive my ignorance, but I have no idea what the additional
> chinese characters mean, but from the values in the database, Im
> assuming that it amounts to 3 characters.
> But if Im correct that those are 3 characters, it is
> using up 24 characters in a column.
>
> My concern is that what if I were to limit a column to say 25 "english"
> characters, and a chinese fellow, comes by and hypothetically says
> "Hello World" in chinese and goes over the limit of the column, the data
> will be truncated.

PostgreSQL will not truncate the data, but reject it; but the general
point is correct.

> Is there anything that can be done to overcome this shortcoming?
>
> Im currently using PostgreSQL 7.4.2, using SQL_ASCII as the database
> characterset, FreeBSD 4.10, php 4.3.6.

Change your site to use a character set that includes chinese
characters, for example Unicode. The most common encoding of Unicode on
the web is UTF-8. It's also the encoding PostgreSQL uses when you use
UNICODE as the database encoding.

If you decide to switch your site to UTF-8 and want varchar(25) to mean
25 characters, and not 25 bytes, you have to change the database
encoding to UNICODE accordingly.

--
Markus Bertheau <twanger@bluetwanger.de>


Re: Space requirements (with respect to foriegn languages)

From
Gerard Samuel
Date:
Markus Bertheau wrote:
> В Чтв, 26.08.2004, в 22:36, Gerard Samuel пишет:
>>Im currently using PostgreSQL 7.4.2, using SQL_ASCII as the database
>>characterset, FreeBSD 4.10, php 4.3.6.
>
>
> Change your site to use a character set that includes chinese
> characters, for example Unicode. The most common encoding of Unicode on
> the web is UTF-8. It's also the encoding PostgreSQL uses when you use
> UNICODE as the database encoding.
>
> If you decide to switch your site to UTF-8 and want varchar(25) to mean
> 25 characters, and not 25 bytes, you have to change the database
> encoding to UNICODE accordingly.
>

I'll try some mock scripts to see if it will pan out...
Thanks