Collation and case mapping thoughts (long) - Mailing list pgsql-hackers
From | Peter Eisentraut |
---|---|
Subject | Collation and case mapping thoughts (long) |
Date | |
Msg-id | Pine.LNX.4.44.0211121913350.12428-100000@localhost.localdomain Whole thread Raw |
List | pgsql-hackers |
I have been doing some research about how to create new routines for string collation and character case mapping that would allow us to break out of the one-locale-per-process scheme. I have found that the Unicode standard provides a lot of useful specifications and data for this. The Unicode data can be mapped to the other character sets (so you don't actually have to use Unicode), and it should also be future-proof, in case one day the entire world uses Unicode. I am looking for replacements for the following C functions: isalpha(), isdigit(), etc. --> "character properties" toupper(), tolower() --> "case mapping" strcoll(), strxfrm() --> "string collation" (Those should be all the cases relying on LC_CTYPE and LC_COLLATE that we're interested in.) What we basically need is an API that allows passing locale and character encoding as parameters, so they can be used flexibly in a server that might be using many locales and encodings. (These musings do not cover how to actually implement per-column or per-datum locales, but they represent prerequisite work.) Character properties are easy to handle, because they aren't locale-dependent at all. (A letter is a letter and a digit is a digit any way you look at it.) The Unicode standard defines a character category for each character which can be mapped to the POSIX categories (alpha, digit, punct, blank, etc.). (Some details on the exact mapping are a bit fuzzy to me, because the POSIX standard is a bit vague on these points, but that isn't a terribly hard problem to resolve.) I imagine that for each encoding supported by the PostgreSQL server we create a simple lookup array indexed by character code. Those arrays can be created from Unicode data and conversion mapping files using a bit of Perl. Case mapping is only minimally locale-dependent. For most languages, the correspondence between lower-case and upper-case letters is the same (even if the language wouldn't normally use many of those letters). The Unicode standard only enumerates a handful of exceptions, which can easily be hard-coded. (The only exception we will really be interested in is the mapping of the Turkish i and I. The other exceptions mostly apply to esoteric Unicode features about which way accents are combined -- something we don't support yet, to my knowledge.) Thus, we can create for each supported encoding a pair of functions (tolower/toupper) that maps an input character to the corresponding lower/upper-case character. The function only needs to cover the Turkish exception if all the involved characters are contained in the respective character set (which they aren't, for example, in ISO 8859-1). Again, we can create these functions from Unicode data and conversion map files using a bit of Perl. I've already created prototypes for the mentioned character property and case mapping tables. The only thing that remains to be done is figuring out a reasonable space/time tradeoff. The Unicode standard also defines a collation algorithm. The collation algorithm essentially converts a string (of Unicode characters) into a sequence of numbers which can be compared with, say, memcmp -- much like strxfrm() does. The assignment of numbers for characters is the tricky part. The Unicode standard defines a default collation order, which is a nice compromise but not appropriate for all languages. Thus, making up various linguistically correct collation tables is the laborious part of this proposal. There are a couple of possible implementation approaches for the collation algorithm: We can follow the earlier ideas and preprocess collation tables for each combination of language and character set. Considering that my system knows 69 different languages and PostgreSQL supports 28 server-side character sets, this would give far more than a thousand combinations. Or we could pick out the combinations that are actually distinct and deemed to be useful. (For example, Finnish collation with a character set typically used for Vietnamese does not seem useful, although it's perfectly possible.) I can't really judge what volume of data and work this would give us. Finally, we could transcode a given character string to Unicode on the fly before computing the collation transformation. This would simply require two transformations instead of the one we need anyway, and it would keep everything quite manageable since we'd only need one routine to do it all and only one set of collation tables. (Note that the collation problem does not require round trip transcoding. We only need conversion *to* Unicode, which should always be possible for those characters that matter in collation.) So, any thoughts on these ideas? -- Peter Eisentraut peter_e@gmx.net
pgsql-hackers by date: