Thread: Using multi-locale support in glibc

Using multi-locale support in glibc

From
Martijn van Oosterhout
Date:
Browsing the glibc stuff for locales I noticed that glibc does actually
allow you to specify the collation order to strcoll and friends. The
feature is however marked with:
  Attention: all these functions are *not* standardized in any form.  This is a proof-of-concept implementation.

They do however work fine. I used my taggedtypes module to create a
type that binds the collation order to the text strings and the results
can be seen below.

1. Is something supported by glibc usable for us (re portability to
non-glibc platforms)?

2. Should we be trying to use an interface that's specifically marked
as unstable?

3. What's the plan to support multiple collate orders? There was a
message about it last year but I don't see much progress.

4. It makes some things more difficult. For example, my database is
UNICODE and until I specified a UTF8 locale it didn't come out right.
AFAIK the only easy way to determine if something is UTF8 compatable is
to use locale -k charmap. The C interface is hidden. It should be
possible to compile a list of locales and allow only ones matching the
database. Or automatically convert the strings, the conversion
functions exist.

5. Maybe we should evaluate the interface and give feedback to the
glibc developers to see if it can be made more stable.

If you want to have a look to see what's available, use:
rgrep -3 locale_t /usr/include/ |less

Have a nice day,

PS. The code to test this can be found at:
http://svana.org/kleptog/pgsql/taggedtypes.html

--- TEST OUTPUT ---

test=# select strings from taggedtypes.locale_test order by locale_text( strings, 'C' );strings
---------Test2Tést1Tëst1test1tèst2
(5 rows)

test=# select strings from taggedtypes.locale_test order by locale_text( strings, 'en_US' );strings
---------Tëst1Tést1tèst2test1Test2
(5 rows)

test=# select strings from taggedtypes.locale_test order by locale_text( strings, 'nl_NL' );
ERROR:  Locale 'nl_NL' not supported by library
test=# select strings from taggedtypes.locale_test order by locale_text( strings, 'en_AU.UTF-8' );strings
---------test1Tést1Tëst1Test2tèst2
(5 rows)
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Re: Using multi-locale support in glibc

From
Tom Lane
Date:
Martijn van Oosterhout <kleptog@svana.org> writes:
> 1. Is something supported by glibc usable for us (re portability to
> non-glibc platforms)?

Nope.  Sorry.
        regards, tom lane


Re: Using multi-locale support in glibc

From
Martijn van Oosterhout
Date:
On Thu, Sep 01, 2005 at 01:46:00PM -0400, Tom Lane wrote:
> Martijn van Oosterhout <kleptog@svana.org> writes:
> > 1. Is something supported by glibc usable for us (re portability to
> > non-glibc platforms)?
>
> Nope.  Sorry.

Do we have some platforms that don't have any multi-language support? I
mean, we don't have a complete thread library but a wrapper around the
ones used on the platform. Couldn't we make a similar wrapper that used
glibc if it was available, windows native if it's available, etc...

That way we conform to the platform rather than a version of the
unicode collating set that postgresql happens to ship with it.

For example, Windows doesn't use standard Unicode sorting rules, do we
care if people come complaining that postgresql sorts different from
their app?
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Re: Using multi-locale support in glibc

From
Tom Lane
Date:
Martijn van Oosterhout <kleptog@svana.org> writes:
> Do we have some platforms that don't have any multi-language support? I
> mean, we don't have a complete thread library but a wrapper around the
> ones used on the platform. Couldn't we make a similar wrapper that used
> glibc if it was available, windows native if it's available, etc...

> That way we conform to the platform rather than a version of the
> unicode collating set that postgresql happens to ship with it.

That seems likely to be the worst of all possible worlds :-(.  As to
the first point, our problem with the standard locale support is that
(a) it doesn't conveniently/cheaply support use of multiple locales per
program, and (b) it fails to expose (portably) information that we need
such as the character set assumed by a locale setting.  A wrapper around
that might hide the convenience problem, but not the performance problem
and definitely not the hidden-information problem.  As to the second
point, our experience with similar issues in the timezone library says
that platform-dependent behavior is the last thing we want.

I think we're going to end up doing just what we did with timezones,
ie, create our own library --- hopefully based on someone else's work
rather than rolled from scratch, but we'll feel free to whack the API
around until we like it.  No one's quite had the stomach to do that
yet though ... in part I suppose we're hoping a good library will drop
into our laps.

(The reason thread support is a poor analogy is that we don't actually
care about threads; we only support them to the extent the platform
wants us to.  The requirements for locale and timezones are driven in
the other direction, ie, we need more than most platforms are willing
to give.)
        regards, tom lane