Unicode support - Mailing list pgsql-hackers
From | - - |
---|---|
Subject | Unicode support |
Date | |
Msg-id | 1842a500904131018y108611d0v61529d36b589afc0@mail.gmail.com Whole thread Raw |
Responses |
Re: Unicode support
Re: Unicode support Re: Unicode support |
List | pgsql-hackers |
Hi. While PostgreSQL is a great database, it lacks some fundamental Unicode support. I want to present some points that have--to my knowledge--not been addressed so far. In the following text, it is assumed that the database and client encoding is UTF-8. 1) Functions like char_length() or length() do NOT return the number of characters (the manual says they do), instead they return the number of code points. To illustrate the difference, I want to quote a Wikipedia article (http://en.wikipedia.org/wiki/Unicode#Ready-made_versus_composite_characters): "For example é can be represented in Unicode as U+0065 (Latin small letter e) followed by U+0301 (combining acute) but it can also be represented as the precomposed character U+00E9 (Latin small letter e with acute). So in many cases, users have many ways of encoding the same character. To deal with this, Unicode provides the mechanism of canonical equivalence." Also, the character U+1EBF (ế), used in Vietnamese has both an acute and a circumflex accent. An equivalent code point sequence is U+0045 (E) U+0302 (circumflex accent) U+0301 (acute accent). So, in the first example, the Unicode codepoints U+0065 U+0301 are infact _one_ character (and are valid Unicode). These two codepoints should be rendered as one character and be indistinguishable from U+00E9. char_length() and length() are agnostic of this. Also, there are quite a few zero-width code points (like code points reserved for surrogate pairs, byte-order-masks, etc) which are not characters at all. As described in 3), there also exist Unicode normalization algorithms, which can decompose one character into multiple code points. In some Asian scripts, this may boost the number of codepoints to a much higher number than the number of actual characters. I guess a quick fix would be to change the definition of length() to return the number of code points in case of an Unicode encoding. 2) PG has no support for the Unicode collation algorithm. Collation is offloaded to the OS, which makes this quite inflexible. This point is further described here in case of the website last.fm: http://russ.garrett.co.uk/tag/postgresql/ . This article also contains a link to a quick fix in form of a 3rd party module. However, in my humble opinion, this functionality should be part of PG itself: the collation algorithm provided by Unicode allows many different locales to be specified. For example, you may want to sort text using the 'de_DE' locale in one query, and use the 'en_GB' locale in the next query. This functionality may be needed in databases which store multi-lingual text. There are also some characters or sequence of characters which are considered equivalent in some languages. As an example 'Umlaut a' is considered to be the same as the character sequence 'ae' in German. You might argue: okay, then just set your OS locale to German. But what happens if text from more languages than just German is stored in the database? Their are multiple instances of such equivalent characters in many languages. Some languages like traditional Spanish treats 'ch' as one character which is sorted between 'c' and 'd'. German does not. Storing both German and Spanish text in a database requires flexible collation, but this cannot be done, if the collation is done at the OS level. In the latter case, you have to set a locale at the beginning and stick with it till the end of time. Also, the functions upper() and lower() depend on the locale. 3) PG has no support for Unicode normalization. This relates to the problem explained in 1). Because Unicode often has multiple code point sequences to represent the same character, it is necessary to bring Unicode strings into a "normalized" form, in order to compare them (Unicode has 4 normalization forms, as explained in http://en.wikipedia.org/wiki/Unicode_normalization). If you just compare by code points, visually indistinguishable characters may not compare as equal, if they have a different code point sequence (for example, if they come from different sources. That is, two users). I'd even go so far as to say that this might pose a security problem. Consider a security-relevant function which compares two strings and misses the instances where the code point sequence is not the same, but the character sequence is. The user is totally unaware of this, since the different code point sequences appear as the same character sequence to him/her, if the font and rendering is done correctly. 4) I don't know if it's easier to just implement a new type 'unicode' that deals with all the unique Unicode characteristics, or to extend the already existing functions that deal with 'text' types. But I think to just ignore all these points is quite dangerous, considering the gain in popularity Unicode has experienced so far and the necessity to support more than just the Latin alphabet. If you consider implementing the points I explained above, the ICU library might be an option. It's open-source, has an MIT-license, works on at least the operating systems PG works on, and is mature. It covers all functionality you need. I've used it so far and can't complain. Also, you might consider rolling your own Unicode implementation, in which case www.unicode.org is the way to go. Any comments are welcome.
pgsql-hackers by date: