Locale implementation questions (was: Proof of concept COLLATE support with patch) - Mailing list pgsql-hackers

From Martijn van Oosterhout
Subject Locale implementation questions (was: Proof of concept COLLATE support with patch)
Date
Msg-id 20050903203434.GA4281@svana.org
Whole thread Raw
In response to Re: Proof of concept COLLATE support with patch  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Locale implementation questions (was: Proof of concept COLLATE support with patch)
Re: Locale implementation questions
List pgsql-hackers
On Fri, Sep 02, 2005 at 11:42:21AM -0400, Tom Lane wrote:
> The objection is fundamentally that a platform-specific implementation
> cannot be our long-term goal, and so expending effort on creating one
> seems like a diversion.  If there were a plan put forward showing how
> this is just a useful way-station, and we could see how we'd later get
> rid of the glibc dependency without throwing away the work already done,
> then it would be a different story.

Well, my patch showed that useful locale work can be acheived with
precisely two functions: newlocale and strxfrm_l.

I'm going to talk about two things: one, the code from Apple. Two, how
we present locale support to users.
---
Now, it would be really nice to take Apple's implementation in Darwin
and use that. What I don't understand is the licence of the code in
Darwin. My interpretation is that stuff in:

http://darwinsource.opendarwin.org/10.4.2/Libc-391/locale/

is Apple stuff under APSL, useless to us. And that stuff in:

http://darwinsource.opendarwin.org/10.4.2/Libc-391/locale/FreeBSD/

are just patches to FreeBSD and this under the normal BSD license (no
big header claiming the licence change). The good news is that the
majority of what we need is in patch form. The bad news is that the hub
of the good stuff (newlocale, duplocale, freelocale) is under a big fat
APSL licence.

Does anyone know if this code can be used at all by BSD projects or did
they blanket relicence everything?
---
Now, I want to bring up some points relating to including a locale
library in PostgreSQL. Given that none of the BSDs seem really
interested in fixing the issue we'll have to do it ourselves (I don't
see anyone else doing it). We can save ourselves effort by basing it on
FreeBSDs locale code, because then we can use their datafiles, which we
*definitly* don't want to maintain ourselves. Now:

1. FreeBSDs locale list is short, some 48 compared with glibc's 217.
Hopefully Apple can expand on that in a way we can use. But given the
difference we should probably give people a way of falling back to the
system libraries in case there's a locale we don't support.

On the other hand, lots of locales are similar so maybe people can find
ones close enough to work. No, glibc and FreeBSD use different file
formats, so you can't copy them.

Do we want this locale data just for collation, or do we want to be
able to use it for formatting monetary amounts too? This is even more
info to store. Lots of languages use ISO/IEC 14651 for order.

2. Locale data needs to be combined with a charset and compiled to work
with the library. PostgreSQL supports at least 15 charsets but we don't
want to ship compiled versions of all of these (Debian learnt that the
hard way). So, how do we generate the files people need.
 a. Auto-compile on demand. First time a locale is referenced spawn
the compiler to create the locale, then continue. (Ugh) b. Add a CREATE LOCALE english AS 'en_US' WITH CHARSET 'utf8'.
Then
require the COLLATE clause to refer to this identifier. This has some
appeal, seperating the system names from the PostgreSQL names. It also
gives some info regarding charsets. c. Should users be allowed to define new locales? d. Should admins be required to
createthe external files using a 
program, say pg_createlocale.

Remember, if you use a latin1 locale to sort utf8 you'll get the wrong
result, so we want to avoid that.

3. Compiled locale files are large. One UTF-8 locale datafile can
exceed a megabyte. Do we want the option of disabling it for small
systems?

4. Do we want the option of running system locale in parallel with the
internal ones?

5. I think we're going to have to deal with the very real possibility
that our locale database will not be as good as some of the system
provided ones. The question is how. This is quite unlike timezones
which are quite standardized and rarely change. That database is quite
well maintained.

Would people object to a configure option that selected: --with-locales=internal     (use pg database)
--with-locales=system      (use system database for win32, glibc or MacOS X) --with-locales=none         (what we
supportnow, which is neither) 

I don't think it will be much of an issue to support this, all the
functions take the same parameters and have almost the same names.

6. Locales for SQL_ASCII. Seems to me you have two options, either
reject COLLATE altogether unless they specify a charset, or don't care
and let the user shoot themselves in the foot if they wish...

BTW, this MacOS locale supports seems to be new for 10.4.2 according to
the CVS log info, can anyone confirm this?

Anyway, I hope this post didn't bore too much. Locale support has been
one of those things that has bugged me for a long time and it would be
nice if there could be some real movement.

Have a nice weekend,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: core dump with last CVS
Next
From: Greg Stark
Date:
Subject: Re: Proof of concept COLLATE support with patch