Thread: A rough roadmap for internationalization fixes

A rough roadmap for internationalization fixes

From
Peter Eisentraut
Date:
OK, I've been spreading rumours about fixing the internationalization
problems, so let me make it a bit more clear.  Here are the problems that
need to be fixed:

- Only one locale per process possible.

- Only one gettext-language per process possible.

- lc_collate and lc_ctype need to be held fixed in the entire cluster.

- Gettext relies on iconv character set conversion, which relies on lc_ctype, which leads to a complete screw-up in the
serverbecause of the previous item.
 

- Locale fixed per cluster, but encoding fixed per database, unware of each other, don't get along.

- No support for upper/lower with multibyte encoding.

- Implementation of Unicode horribly incomplete.

These are all dependent on each other and sort of flow into each other.

Here is a proposed ordering of steps toward improving the situation:

1. Take out the character set conversion routines from the backend and
make them a library of their own.  This could possibly be modelled after
iconv, but not necessarily.  Or we might conclude that we can just use
iconv in the first place.

2. Reimplement gettext to use 1. and allow switching of language and
encoding at run-time.

3. Implement Unicode collation algorithm and character classification
routines that are aware of 1.  Use that in place of system locale
routines.

4. Allow choice of locale per database.  (This should be fairly easy after
3.)

5. Allow choice of locale per column and implement collation coercion
according to SQL standard.

This could easily take a long time, but I feel that even if we have to
stop after 2., 3., or 4. at feature freeze, we'd be a lot farther.

Comments?  Anything else that needs fixing?

-- 
Peter Eisentraut   peter_e@gmx.net



Re: A rough roadmap for internationalization fixes

From
Tatsuo Ishii
Date:
> OK, I've been spreading rumours about fixing the internationalization
> problems, so let me make it a bit more clear.  Here are the problems that
> need to be fixed:
> 
> - Only one locale per process possible.
> 
> - Only one gettext-language per process possible.
> 
> - lc_collate and lc_ctype need to be held fixed in the entire cluster.
> 
> - Gettext relies on iconv character set conversion, which relies on
>   lc_ctype, which leads to a complete screw-up in the server because of
>   the previous item.
> 
> - Locale fixed per cluster, but encoding fixed per database, unware
>   of each other, don't get along.
> 
> - No support for upper/lower with multibyte encoding.
> 
> - Implementation of Unicode horribly incomplete.
> 
> These are all dependent on each other and sort of flow into each other.
> 
> Here is a proposed ordering of steps toward improving the situation:
> 
> 1. Take out the character set conversion routines from the backend and
> make them a library of their own.  This could possibly be modelled after
> iconv, but not necessarily.  Or we might conclude that we can just use
> iconv in the first place.

How do you handle user-defined conversions?

> 2. Reimplement gettext to use 1. and allow switching of language and
> encoding at run-time.
> 
> 3. Implement Unicode collation algorithm and character classification
> routines that are aware of 1.  Use that in place of system locale
> routines.

I don't see a relationship between Unicode and the one you are going
to replace the system locale routines. If you are going to the
direction for an "Unicode central" implementation, I will object.

> 4. Allow choice of locale per database.  (This should be fairly easy after
> 3.)
> 
> 5. Allow choice of locale per column and implement collation coercion
> according to SQL standard.
> 
> This could easily take a long time, but I feel that even if we have to
> stop after 2., 3., or 4. at feature freeze, we'd be a lot farther.
> 
> Comments?  Anything else that needs fixing?
> 
> -- 
> Peter Eisentraut   peter_e@gmx.net
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
> 
>                http://archives.postgresql.org
> 


Re: A rough roadmap for internationalization fixes

From
Dennis Bjorklund
Date:
On Mon, 24 Nov 2003, Peter Eisentraut wrote:

> 1. Take out the character set conversion routines from the backend and
> make them a library of their own.  This could possibly be modelled after
> iconv, but not necessarily.  Or we might conclude that we can just use
> iconv in the first place.
> 
> 2. Reimplement gettext to use 1. and allow switching of language and
> encoding at run-time.

Force all translations to be in unicode and convert to other client
encodings if needed. There is no need to support translations stored using
different encodings.

> 3. Implement Unicode collation algorithm and character classification
> routines that are aware of 1.  Use that in place of system locale
> routines.

Couldn't we use some library that already have this, like glib (or
something else). If it's not up to what we need, than fix that library
instead.

--
/Dennis



Re: A rough roadmap for internationalization fixes

From
Peter Eisentraut
Date:
Tatsuo Ishii writes:

> > 3. Implement Unicode collation algorithm and character classification
> > routines that are aware of 1.  Use that in place of system locale
> > routines.
>
> I don't see a relationship between Unicode and the one you are going
> to replace the system locale routines. If you are going to the
> direction for an "Unicode central" implementation, I will object.

The Unicode collation algorithm works for any character set, not only for
Unicode.  It just happens to be published by the Unicode consortium.  So
basically this is just a concrete alternative to making up our own out of
thin air.  Also, the Unicode collation algorithm gives us the flexibility
to define customizations of collations that users frequently want, such as
ignoring or not ignoring punctuation.

Actually, what will more likely happen is that we'll define a collation as
a collection of one or more support functions, the equivalents of
strxfrm() and possibly a few more.  Then it will be up to those functions
to define the collation order.  The server will provide utility functions
that will facilitate implementing a collation order that follows the
Unicode collation algorithm, but you could just as well implement one
using memcmp() or whatever you like.

-- 
Peter Eisentraut   peter_e@gmx.net



Re: A rough roadmap for internationalization fixes

From
Peter Eisentraut
Date:
Dennis Bjorklund writes:

> Force all translations to be in unicode and convert to other client
> encodings if needed. There is no need to support translations stored using
> different encodings.

Tell that to the Japanese.

> Couldn't we use some library that already have this, like glib (or
> something else). If it's not up to what we need, than fix that library
> instead.

I wasn't aware that glib had this.  I'll look.

-- 
Peter Eisentraut   peter_e@gmx.net



Re: A rough roadmap for internationalization fixes

From
"Zeugswetter Andreas SB SD"
Date:
Have you looked at what is available from
http://oss.software.ibm.com/icu/ ?

Seems they have a compatible license, but use some C++.

Andreas


Re: A rough roadmap for internationalization fixes

From
Dennis Bjorklund
Date:
On Tue, 25 Nov 2003, Peter Eisentraut wrote:

> > Force all translations to be in unicode and convert to other client
> > encodings if needed. There is no need to support translations stored using
> > different encodings.
> 
> Tell that to the Japanese.

I've always thought unicode was enough to even represent Japanese. Then 
the client encoding can be something else that we can convert to. In any 
way, the encoding of the message catalog has to be known to the system so 
it can be converted to the correct encoding for the client.

> > Couldn't we use some library that already have this, like glib (or
> > something else). If it's not up to what we need, than fix that library
> > instead.
> 
> I wasn't aware that glib had this.  I'll look.

And I don't really know what demands pg has, but glib has a lot of support 
functions for utf-8. At least we should take a look at it.

-- 
/Dennis



Re: A rough roadmap for internationalization fixes

From
Tatsuo Ishii
Date:
> On Tue, 25 Nov 2003, Peter Eisentraut wrote:
> 
> > > Force all translations to be in unicode and convert to other client
> > > encodings if needed. There is no need to support translations stored using
> > > different encodings.
> > 
> > Tell that to the Japanese.
> 
> I've always thought unicode was enough to even represent Japanese. Then 
> the client encoding can be something else that we can convert to. In any 
> way, the encoding of the message catalog has to be known to the system so 
> it can be converted to the correct encoding for the client.

I'm tired of telling that Unicode is not that perfect. Another gottcha
with Unicode is the UTF-8 encoding (currently we use) consumes 3
bytes for each Kanji character, while other encodings consume only 2
bytes. IMO 3/2 storage ratio could not be neglected for database use.
--
Tatsuo Ishii


Re: A rough roadmap for internationalization fixes

From
Dennis Bjorklund
Date:
On Tue, 25 Nov 2003, Tatsuo Ishii wrote:

> I'm tired of telling that Unicode is not that perfect. Another gottcha
> with Unicode is the UTF-8 encoding (currently we use) consumes 3
> bytes for each Kanji character, while other encodings consume only 2
> bytes. IMO 3/2 storage ratio could not be neglected for database use.

I'm aware of how utf-8 works and I was talking about the message 
cataloges. It does not affect what you store in the database in any way.

-- 
/Dennis



Re: A rough roadmap for internationalization fixes

From
Dennis Bjorklund
Date:
On Tue, 25 Nov 2003, Tatsuo Ishii wrote:

> I'm tired of telling that Unicode is not that perfect. Another gottcha
> with Unicode is the UTF-8 encoding (currently we use) consumes 3
> bytes for each Kanji character, while other encodings consume only 2
> bytes. IMO 3/2 storage ratio could not be neglected for database use.

The rest of the world seems to select unicode as the way to handle
different languages in the UI of programs. For example gnome supports
nothing but unicode. How is that handled in your country? I know that you
are tired of people who don't understand how difficult it is for you, but
I really would like to know. Is gnome not used over there because of this?

About storing data in the database, I would expect it to work with any
encoding, just like I would expect pg to be able to store images in any
format.

I'll try to not mention unicode near you in the feature :-)

-- 
/Dennis



Re: A rough roadmap for internationalization fixes

From
Tom Lane
Date:
Peter Eisentraut <peter_e@gmx.net> writes:
> Dennis Bjorklund writes:
>> Couldn't we use some library that already have this, like glib (or
>> something else). If it's not up to what we need, than fix that library
>> instead.

> I wasn't aware that glib had this.  I'll look.

Of course the trouble with relying on glibc is that we'd have no solution
for platforms that don't use glibc.

It might be okay to rely on glibc for a first-cut implementation,
realizing that we couldn't do everything at once anyway.
        regards, tom lane


Re: A rough roadmap for internationalization fixes

From
Tom Lane
Date:
Peter Eisentraut <peter_e@gmx.net> writes:
> Actually, what will more likely happen is that we'll define a collation as
> a collection of one or more support functions, the equivalents of
> strxfrm() and possibly a few more.  Then it will be up to those functions
> to define the collation order.  The server will provide utility functions
> that will facilitate implementing a collation order that follows the
> Unicode collation algorithm, but you could just as well implement one
> using memcmp() or whatever you like.

That sounds like a good plan to me.  Personally I'd want a
memcmp()-based collation implementation available, so that people who
don't care about sorting anything beyond 7-bit ASCII don't need to pay
a lot of overhead.

We have seen over and over that strcoll() is depressingly slow in some
locales (at least on some platforms).  Do you have any feeling for the
real-world performance of the Unicode algorithm?
        regards, tom lane


Re: A rough roadmap for internationalization fixes

From
Doug McNaught
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> Peter Eisentraut <peter_e@gmx.net> writes:
> 
> > I wasn't aware that glib had this.  I'll look.
> 
> Of course the trouble with relying on glibc is that we'd have no solution
> for platforms that don't use glibc.

glib != glibc.  glib is the low-level library used by GTK and GNOME
for basic data structures, character handling etc.  It's LGPL AFAIK,
which would seem to rule out diredct use from a licensing perspective.

-Doug


Re: A rough roadmap for internationalization fixes

From
Hannu Krosing
Date:
Dennis Bjorklund kirjutas T, 25.11.2003 kell 14:51:
> On Tue, 25 Nov 2003, Tatsuo Ishii wrote:
> 
> > I'm tired of telling that Unicode is not that perfect. 

Of course not, but neither is the current multibyte with only marginal
support for unicode (many people actually need upper()/lower() )

> Another gottcha
> > with Unicode is the UTF-8 encoding (currently we use) consumes 3
> > bytes for each Kanji character, while other encodings consume only 2
> > bytes. 

I think that for *storage* we should use SCSU (the Standard Compression
Scheme for Unicode).

> IMO 3/2 storage ratio could not be neglected for database use.

SCSU should solve that (actually it should use less than 2 bytes char
for encoding any single language)

> The rest of the world seems to select unicode as the way to handle
> different languages in the UI of programs. For example gnome supports
> nothing but unicode. How is that handled in your country? I know that you
> are tired of people who don't understand how difficult it is for you, but
> I really would like to know. Is gnome not used over there because of this?
> 
> About storing data in the database, I would expect it to work with any
> encoding, just like I would expect pg to be able to store images in any
> format.
> 
> I'll try to not mention unicode near you in the feature :-)

---------------
Hannu







Re: A rough roadmap for internationalization fixes

From
Greg Stark
Date:
Peter Eisentraut <peter_e@gmx.net> writes:

> 2. Reimplement gettext to use 1. and allow switching of language and
> encoding at run-time.
> 
> 3. Implement Unicode collation algorithm and character classification
> routines that are aware of 1.  Use that in place of system locale
> routines.

This sounds like you want to completely reimplement all of the locale handling
provided by the OS? That seems like a dead-end approach to me. There's no way
your handling will ever be as complete or as well optimized as some OS's.

Better to find ways to use the OS gettext and locale handling on platforms
that provide good interfaces. On platforms that don't provide good interfaces
either don't support the features or use some third party library to provide
a good implementation.

The only thing you really need in the database is a second parameter on all
the collation functions like strxfrm(col,locale) etc. Then functional indexes
take care of almost everything.

The only advantage to adding locales per-column and/or per-index is the
notational simplicity. Queries could do simple standard expressions and not
have to worry about calling strxfrm or other locale-specific functions all the
time. I'm not sure it's worth the complexity of having to deal with 
"WHERE x>y" where x and y are in different locales though.


-- 
greg



Re: A rough roadmap for internationalization fixes

From
Kurt Roeckx
Date:
On Tue, Nov 25, 2003 at 08:40:57PM +0900, Tatsuo Ishii wrote:
> > On Tue, 25 Nov 2003, Peter Eisentraut wrote:
> > 
> > I've always thought unicode was enough to even represent Japanese. Then 
> > the client encoding can be something else that we can convert to. In any 
> > way, the encoding of the message catalog has to be known to the system so 
> > it can be converted to the correct encoding for the client.
> 
> I'm tired of telling that Unicode is not that perfect.

Maybe it should be explained what the problems really are,
instead of saying it "isn't perfect"?

From what I understand there is only a problem converting from
the "legacy" encoding to unicode, and the other way around, and
no problem if you stop doing the conversion.

The conversion problem is because what in an encoding is only
represented by 1 character can be several characters in unicode.

Some examples people might understand are:
- µ: In iso 8859-1 it's char 0xB5.  In unicode it can be U+00B5 (micro
sign) or U+03BC (greek letter small mu)
- Å: ISO 8859-1: 0xC5. Unicode U+00C5 (latin capital letter a
with ring above) or U+212B (angstrom sign)
- The ohm sign vs the greek letter omega.
- Quotation marks: You have left double quote, right double quote, and a few others.

> Another gottcha
> with Unicode is the UTF-8 encoding (currently we use) consumes 3
> bytes for each Kanji character, while other encodings consume only 2
> bytes. IMO 3/2 storage ratio could not be neglected for database use.

You can encode unicode in different ways, and UTF-8 is only one
of them.  Is there a problem with using UCS-2 (except that it
would require more storage for ASCII)?


Kurt



Re: A rough roadmap for internationalization fixes

From
Peter Eisentraut
Date:
Greg Stark writes:

> This sounds like you want to completely reimplement all of the locale handling
> provided by the OS? That seems like a dead-end approach to me. There's no way
> your handling will ever be as complete or as well optimized as some OS's.

Actually, I'm pretty sure it will be more complete.  About the
optimization, we'll have to see.

> Better to find ways to use the OS gettext and locale handling on platforms
> that provide good interfaces.

There are no such platforms to my knowledge.  The exception is some
version of glibc that provides undocumented interfaces to functionality
that is rumoured to do something that may or may not be relevant to what
we're doing.

> On platforms that don't provide good interfaces either don't support the
> features or use some third party library to provide a good
> implementation.

There are no such libraries.  I keep hearing ICU, but that is much too
bloated.

-- 
Peter Eisentraut   peter_e@gmx.net



Re: A rough roadmap for internationalization fixes

From
Hannu Krosing
Date:
Peter Eisentraut kirjutas T, 25.11.2003 kell 21:13:
> Greg Stark writes:
> 
> > This sounds like you want to completely reimplement all of the locale handling
> > provided by the OS? That seems like a dead-end approach to me. There's no way
> > your handling will ever be as complete or as well optimized as some OS's.
> 
> Actually, I'm pretty sure it will be more complete.  About the
> optimization, we'll have to see.
> 
> > Better to find ways to use the OS gettext and locale handling on platforms
> > that provide good interfaces.
> 
> There are no such platforms to my knowledge. 

Unless you consider ICU (http://oss.software.ibm.com/icu/) as a
"platform" ;)

We will hardly ever be more complete than it.

> There are no such libraries.  I keep hearing ICU, but that is much too
> bloated.

At least it is kind of "standard" and also something what will be
maintained for foreseeable future, it also has a compatible license and
is available on all platforms of interest to postgresql.

And I am not sure that this "bloat" will affect us too much unless we
want to start maintaining a parallel copy - glibc is much more bloated
than ICU .

But if you insist on rolling your own library, you can always use ICU to
write regression test to compare yours with ...

-------------
Hannu



Re: A rough roadmap for internationalization fixes

From
Tom Lane
Date:
Kurt Roeckx <Q@ping.be> writes:
> You can encode unicode in different ways, and UTF-8 is only one
> of them.  Is there a problem with using UCS-2 (except that it
> would require more storage for ASCII)?

UCS-2 is impractical without some *extremely* wide-ranging changes in
the backend.  To take just the most obvious point, doesn't it require
allowing embedded zero bytes in text strings?
        regards, tom lane


Re: A rough roadmap for internationalization fixes

From
Christopher Kings-Lynne
Date:
> About storing data in the database, I would expect it to work with any
> encoding, just like I would expect pg to be able to store images in any
> format.

What's stopping us supporting the other Unicode encodings, eg. UCS-16 
which could save Japansese storage space.

Chris




Re: A rough roadmap for internationalization fixes

From
Tom Lane
Date:
Greg Stark <gsstark@mit.edu> writes:
> The only advantage to adding locales per-column and/or per-index is the
> notational simplicity.

Well, actually, the reason we are interested in doing it is the SQL spec
demands it.
        regards, tom lane


Re: A rough roadmap for internationalization fixes

From
"Zeugswetter Andreas SB SD"
Date:
> > There are no such libraries.  I keep hearing ICU, but that is much too
> > bloated.
>
> At least it is kind of "standard" and also something what will be
> maintained for foreseeable future, it also has a compatible license and
> is available on all platforms of interest to postgresql.

And it is used for DB/2 and Informix, so it must be quite feature complete
for DB relevant stuff.

Andreas


Re: A rough roadmap for internationalization fixes

From
Kurt Roeckx
Date:
On Tue, Nov 25, 2003 at 04:19:05PM -0500, Tom Lane wrote:
> 
> UCS-2 is impractical without some *extremely* wide-ranging changes in
> the backend.  To take just the most obvious point, doesn't it require
> allowing embedded zero bytes in text strings?

If you're going to use unicode in the rest of the backend, you'll
have to be able to deal with them anyway.  You can't use normal C
string functions.


Kurt