Thread: Internationalization

Internationalization

From
Dennis Gearon
Date:
I'm thinking that this should be approached in a slowly descending set
of changes.

1/ Make individual databases possible with a single instance that can be
different encoding AND locale/sorting, and all other aspects of using
the encoding/langauge rules.
2/ Then tables.
3/ Then columns.
-------------------------
So,for the first one,
Is there anyway for a single statement to access more than one database?
Could a query, regexes, etc be facing indexes in different
encodings/sorting collations if different databases in a cluster had
different encodings/collations?

Re: Internationalization

From
Alvaro Herrera
Date:
On Wed, Jun 30, 2004 at 02:26:10PM -0700, Dennis Gearon wrote:

> 1/ Make individual databases possible with a single instance that can be
> different encoding AND locale/sorting, and all other aspects of using
> the encoding/langauge rules.

> Is there anyway for a single statement to access more than one database?
> Could a query, regexes, etc be facing indexes in different
> encodings/sorting collations if different databases in a cluster had
> different encodings/collations?

No, but there are at least two problems:

1. shared tables.  All databases in each cluster shared at least
pg_database, pg_shadow, pg_group and (new) pg_tablespace.  And, of
course, all their indexes.  What would you do about them?

2. when creating a new database, the current method is to copy from
template1.  How would you change the encoding of the new database?

--
Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)
"The problem with the future is that it keeps turning into the present"
(Hobbes)


Re: Internationalization

From
Tom Lane
Date:
Dennis Gearon <gearond@fireserve.net> writes:
> Is there anyway for a single statement to access more than one database?
> Could a query, regexes, etc be facing indexes in different
> encodings/sorting collations if different databases in a cluster had
> different encodings/collations?

The indexes on the shared system tables (eg, pg_database) are the only
issue here.  One possible solution is to require that no locale-aware
datatypes ever be used in these indexes.  I think right now this is true
because "name" doesn't use locale-aware sorting; but we'd have to be
careful not to break the restriction in future.

            regards, tom lane

Re: Internationalization

From
Tom Lane
Date:
Dennis Gearon <gearond@fireserve.net> writes:
> Tom Lane wrote:
>> The indexes on the shared system tables (eg, pg_database) are the only
>> issue here.  One possible solution is to require that no locale-aware
>> datatypes ever be used in these indexes.  I think right now this is true
>> because "name" doesn't use locale-aware sorting; but we'd have to be
>> careful not to break the restriction in future.
>>
> Tom what about table names? Isn't it part of the SQL spec to be able
> to set table names to other langauges other than English?

[shrug...]  So which language/encoding would you like to force everyone
to use?

The issue is not really whether you can create a database name that
looks like however you want.  The issues are (a) what it will look like
to someone else using a different encoding; and (b) how it will sort if
you ask for "select * from pg_database order by datname", relative to
someone else's database name that he thinks is in a different locale and
encoding than you think yours is.

AFAICT the Postgres user community is not ready to accept a "thou shalt
use Unicode" decree, so I don't think that mandating a one-size-fits-all
answer is going to fly.

            regards, tom lane

Re: Internationalization

From
Dennis Gearon
Date:
Tom Lane wrote:

> Dennis Gearon <gearond@fireserve.net> writes:
>
>>Tom Lane wrote:
>>
>>>The indexes on the shared system tables (eg, pg_database) are the only
>>>issue here.  One possible solution is to require that no locale-aware
>>>datatypes ever be used in these indexes.  I think right now this is true
>>>because "name" doesn't use locale-aware sorting; but we'd have to be
>>>careful not to break the restriction in future.
>>>
>>
>>Tom what about table names? Isn't it part of the SQL spec to be able
>>to set table names to other langauges other than English?
>
>
> [shrug...]  So which language/encoding would you like to force everyone
> to use?
>
> The issue is not really whether you can create a database name that
> looks like however you want.  The issues are (a) what it will look like
> to someone else using a different encoding; and (b) how it will sort if
> you ask for "select * from pg_database order by datname", relative to
> someone else's database name that he thinks is in a different locale and
> encoding than you think yours is.
>
> AFAICT the Postgres user community is not ready to accept a "thou shalt
> use Unicode" decree, so I don't think that mandating a one-size-fits-all
> answer is going to fly.
>
>             regards, tom lane
>
So for now, my database is set up as:

show all shows
------------------
server encoding SQL_ASCII

I didn't see anything that said what the LC_COLLATE and LC_TYPE settings were when initdb was done.
How can I find that out?


in postgresql.conf
------------------
LC_MESSAGES = 'C'
LC_MONETARY = 'C'
LC_NUMERIC = 'C'
LC_TIME = 'C'

So I have what:
    8 bit encoding with standard ASCII ?
I can put what langauges in it?
It will sort in standard ASCII order, all not English characters will sort last?



Re: Internationalization

From
Dennis Gearon
Date:
Tom Lane wrote:

> Dennis Gearon <gearond@fireserve.net> writes:
>
>>Is there anyway for a single statement to access more than one database?
>>Could a query, regexes, etc be facing indexes in different
>>encodings/sorting collations if different databases in a cluster had
>>different encodings/collations?
>
>
> The indexes on the shared system tables (eg, pg_database) are the only
> issue here.  One possible solution is to require that no locale-aware
> datatypes ever be used in these indexes.  I think right now this is true
> because "name" doesn't use locale-aware sorting; but we'd have to be
> careful not to break the restriction in future.
>
>             regards, tom lane
>
Tom what about table names? Isn't it part of the SQL spec to be able to set table names to other langauges other than
English?

----------------------

I've researched most of the databases out there that will tell you anything about how they have internationlized them.
Bya vast  majority, I have found them using UTF16 for ALL internals, in memory or CPU. This does double most non
orientallangauge application's memory image. But, memory is cheap, and the desktop/Intel server market is just about to
goto 64 bit and use much more memory. 

Based on my research, all characters for most human langauges are able to be displayed in one - 2 byte, 16 bit char via
UTF16.I am going to do some more research on that. 

PROBABLY, most of them use UTF16 on the disk as well. Since most slow processes are IO bound, using an 8bit text
datatype,WHEN possible, and converting on the fly might be a good way to keep some speed while truly making an ANSI
spec,international database. I'm probably all wet though. 



Re: Internationalization

From
Tom Lane
Date:
Dennis Gearon <gearond@fireserve.net> writes:
> I didn't see anything that said what the LC_COLLATE and LC_TYPE settings were when initdb was done.
> How can I find that out?

In 7.4 you can just SHOW 'em, but before that you have to use
pg_controldata to find it out.

> in postgresql.conf
> ------------------
> LC_MESSAGES = 'C'
> LC_MONETARY = 'C'
> LC_NUMERIC = 'C'
> LC_TIME = 'C'

Given that I'd bet you have collate/ctype as C too, but it's not
certain.

            regards, tom lane