Thread: Collation and case mapping thoughts (long)

Collation and case mapping thoughts (long)

From
Peter Eisentraut
Date:
I have been doing some research about how to create new routines for
string collation and character case mapping that would allow us to break
out of the one-locale-per-process scheme.  I have found that the Unicode
standard provides a lot of useful specifications and data for this.  The
Unicode data can be mapped to the other character sets (so you don't
actually have to use Unicode), and it should also be future-proof, in case
one day the entire world uses Unicode.

I am looking for replacements for the following C functions:

isalpha(), isdigit(), etc. --> "character properties"
toupper(), tolower()       --> "case mapping"
strcoll(), strxfrm()       --> "string collation"

(Those should be all the cases relying on LC_CTYPE and LC_COLLATE that
we're interested in.)

What we basically need is an API that allows passing locale and character
encoding as parameters, so they can be used flexibly in a server that
might be using many locales and encodings.

(These musings do not cover how to actually implement per-column or
per-datum locales, but they represent prerequisite work.)

Character properties are easy to handle, because they aren't
locale-dependent at all.  (A letter is a letter and a digit is a digit any
way you look at it.)  The Unicode standard defines a character category
for each character which can be mapped to the POSIX categories (alpha,
digit, punct, blank, etc.).  (Some details on the exact mapping are a bit
fuzzy to me, because the POSIX standard is a bit vague on these points,
but that isn't a terribly hard problem to resolve.)

I imagine that for each encoding supported by the PostgreSQL server we
create a simple lookup array indexed by character code.  Those arrays can
be created from Unicode data and conversion mapping files using a bit of
Perl.

Case mapping is only minimally locale-dependent.  For most languages, the
correspondence between lower-case and upper-case letters is the same (even
if the language wouldn't normally use many of those letters).  The Unicode
standard only enumerates a handful of exceptions, which can easily be
hard-coded.  (The only exception we will really be interested in is the
mapping of the Turkish i and I.  The other exceptions mostly apply to
esoteric Unicode features about which way accents are combined --
something we don't support yet, to my knowledge.)

Thus, we can create for each supported encoding a pair of functions
(tolower/toupper) that maps an input character to the corresponding
lower/upper-case character.  The function only needs to cover the Turkish
exception if all the involved characters are contained in the respective
character set (which they aren't, for example, in ISO 8859-1).  Again, we
can create these functions from Unicode data and conversion map files
using a bit of Perl.

I've already created prototypes for the mentioned character property and
case mapping tables.  The only thing that remains to be done is figuring
out a reasonable space/time tradeoff.

The Unicode standard also defines a collation algorithm.  The collation
algorithm essentially converts a string (of Unicode characters) into a
sequence of numbers which can be compared with, say, memcmp -- much like
strxfrm() does.  The assignment of numbers for characters is the tricky
part.  The Unicode standard defines a default collation order, which is a
nice compromise but not appropriate for all languages.  Thus, making up
various linguistically correct collation tables is the laborious part of
this proposal.

There are a couple of possible implementation approaches for the collation
algorithm:

We can follow the earlier ideas and preprocess collation tables for each
combination of language and character set.  Considering that my system
knows 69 different languages and PostgreSQL supports 28 server-side
character sets, this would give far more than a thousand combinations.

Or we could pick out the combinations that are actually distinct and
deemed to be useful.  (For example, Finnish collation with a character set
typically used for Vietnamese does not seem useful, although it's
perfectly possible.)  I can't really judge what volume of data and work
this would give us.

Finally, we could transcode a given character string to Unicode on the fly
before computing the collation transformation.  This would simply require
two transformations instead of the one we need anyway, and it would keep
everything quite manageable since we'd only need one routine to do it all
and only one set of collation tables.  (Note that the collation problem
does not require round trip transcoding.  We only need conversion *to*
Unicode, which should always be possible for those characters that matter
in collation.)

So, any thoughts on these ideas?

-- 
Peter Eisentraut   peter_e@gmx.net