On 12/11/06, Alexander Staubo <alex@purefiction.net> wrote:
> On Dec 11, 2006, at 02:47 , Daniel van Ham Colchete wrote:
>
> > I never understood what's the matter between the ASCII/ISO-8859-1/UTF8
> > charsets to a database. They're all simple C strings that doesn't have
> > the zero-byte in the midlle (like UTF16 would) and that doesn't
> > require any different processing unless you are doing case insensitive
> > search (them you would have a problem).
>
> That's not the whole story. UTF-8 and other variable-width encodings
> don't provide a 1:1 mapping of logical characters to single bytes; in
> particular, combination characters opens the possibility of multiple
> different byte sequences mapping to the same code point; therefore,
> string comparison in such encodings generally cannot be done at the
> byte level (unless, of course, you first acertain that the strings
> involved are all normalized to an unambiguous subset of your encoding).
>
> PostgreSQL's use of strings is not limited to string comparison.
> Substring extraction, concatenation, regular expression matching, up/
> downcasing, tokenization and so on are all part of PostgreSQL's small
> library of text manipulation functions, and all deal with logical
> characters, meaning they must be Unicode-aware.
>
> Alexander.
>
You're right. I was thinking only about my cases that takes the
Unicode normatization for granted and doesn't use
regexp/tokenization/...
Thanks
Best
Daniel