Unicode support - Mailing list pgsql-hackers

From - -
Subject Unicode support
Date
Msg-id 1842a500904131018y108611d0v61529d36b589afc0@mail.gmail.com
Whole thread Raw
Responses Re: Unicode support  (Alvaro Herrera <alvherre@commandprompt.com>)
Re: Unicode support  (Peter Eisentraut <peter_e@gmx.net>)
Re: Unicode support  (Peter Eisentraut <peter_e@gmx.net>)
List pgsql-hackers
Hi.

While PostgreSQL is a great database, it lacks some fundamental
Unicode support. I want to present some points that have--to my
knowledge--not been addressed so far. In the following text, it is
assumed that the database and client encoding is UTF-8.


1) Functions like char_length() or length() do NOT return the number
of characters (the manual says they do), instead they return the
number of code points.

To illustrate the difference, I want to quote a Wikipedia article
(http://en.wikipedia.org/wiki/Unicode#Ready-made_versus_composite_characters):

"For example é can be represented in Unicode as U+0065 (Latin small
letter e) followed by U+0301 (combining acute) but it can also be
represented as the precomposed character U+00E9 (Latin small letter e
with acute). So in many cases, users have many ways of encoding the
same character. To deal with this, Unicode provides the mechanism of
canonical equivalence."

Also, the character U+1EBF (ế), used in Vietnamese has both an acute
and a circumflex accent. An equivalent code point sequence is U+0045
(E) U+0302 (circumflex accent) U+0301 (acute accent).

So, in the first example, the Unicode codepoints U+0065 U+0301 are
infact _one_ character (and are valid Unicode). These two codepoints
should be rendered as one character and be indistinguishable from
U+00E9. char_length() and length() are agnostic of this. Also, there
are quite a few zero-width code points (like code points reserved for
surrogate pairs, byte-order-masks, etc) which are not characters at
all. As described in 3), there also exist Unicode normalization
algorithms, which can decompose one character into multiple code
points. In some Asian scripts, this may boost the number of codepoints
to a much higher number than the number of actual characters.

I guess a quick fix would be to change the definition of length() to
return the number of code points in case of an Unicode encoding.


2) PG has no support for the Unicode collation algorithm. Collation is
offloaded to the OS, which makes this quite inflexible. This point is
further described here in case of the website last.fm:
http://russ.garrett.co.uk/tag/postgresql/ . This article also contains
a link to a quick fix in form of a 3rd party module.

However, in my humble opinion, this functionality should be part of PG
itself: the collation algorithm provided by Unicode allows many
different locales to be specified. For example, you may want to sort
text using the 'de_DE' locale in one query, and use the 'en_GB' locale
in the next query. This functionality may be needed in databases which
store multi-lingual text. There are also some characters or sequence
of characters which are considered equivalent in some languages. As an
example 'Umlaut a' is considered to be the same as the character
sequence 'ae' in German. You might argue: okay, then just set your OS
locale to German. But what happens if text from more languages than
just German is stored in the database? Their are multiple instances of
such equivalent characters in many languages. Some languages like
traditional Spanish treats 'ch' as one character which is sorted
between 'c' and 'd'. German does not. Storing both German and Spanish
text in a database requires flexible collation, but this cannot be
done, if the collation is done at the OS level. In the latter case,
you have to set a locale at the beginning and stick with it till the
end of time.

Also, the functions upper() and lower() depend on the locale.


3) PG has no support for Unicode normalization. This relates to the
problem explained in 1). Because Unicode often has multiple code point
sequences to represent the same character, it is necessary to bring
Unicode strings into a "normalized" form, in order to compare them
(Unicode has 4 normalization forms, as explained in
http://en.wikipedia.org/wiki/Unicode_normalization). If you just
compare by code points, visually indistinguishable characters may not
compare as equal, if they have a different code point sequence (for
example, if they come from different sources. That is, two users). I'd
even go so far as to say that this might pose a security problem.
Consider a security-relevant function which compares two strings and
misses the instances where the code point sequence is not the same,
but the character sequence is. The user is totally unaware of this,
since the different code point sequences appear as the same character
sequence to him/her, if the font and rendering is done correctly.


4) I don't know if it's easier to just implement a new type 'unicode'
that deals with all the unique Unicode characteristics, or to extend
the already existing functions that deal with 'text' types. But I
think to just ignore all these points is quite dangerous, considering
the gain in popularity Unicode has experienced so far and the
necessity to support more than just the Latin alphabet.

If you consider implementing the points I explained above, the ICU
library might be an option. It's open-source, has an MIT-license,
works on at least the operating systems PG works on, and is mature. It
covers all functionality you need. I've used it so far and can't
complain. Also, you might consider rolling your own Unicode
implementation, in which case www.unicode.org is the way to go.

Any comments are welcome.


pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Affected rows count by rule as condtition
Next
From: Pavel Stehule
Date:
Subject: proposal: add columns created and altered to pg_proc and pg_class