Home > mailing lists

Re: Unicode support - Mailing list pgsql-hackers

From	- -
Subject	Re: Unicode support
Date	April 13, 2009 21:22:18
Msg-id	1842a500904131411o416032f2sc150fd8421def620@mail.gmail.com Whole thread Raw
In response to	Re: Unicode support (Tom Lane <tgl@sss.pgh.pa.us>)
Responses	Re: Unicode support (Gregory Stark <stark@enterprisedb.com>)
List	pgsql-hackers

Tree view

Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Greg Stark <stark@enterprisedb.com> writes:
>> Is it really true trhat canonical encodings never contain any composed
>> characters in them? I thought there were some glyphs which could only
>> be represented by composed characters.
>
> AFAIK that's not true.  However, in my original comment I was thinking
> about UTF16 surrogates, which are something else entirely --- so I
> withdraw that.  I'm still dubious that it is our job to deal with
> non-normalized characters, though.

Like it or not, they are part of Unicode and they are very much valid
Unicode. They are not in violation with the standard. This has nothing
to do with the encoding. There are also code points which specify the
direction of text (e.g. needed if you want to embed a Hebrew quote in
English text). To count that as a character seems wrong.

>> The original post seemed to be a contrived attempt to say "you should
>> use ICU".
>
> Indeed.  The OP should go read all the previous arguments about ICU
> in our archives.

Not at all. I just was making a suggestion. You may use any other
library or implement it yourself (I even said that in my original
post). www.unicode.org - the official website of the Unicode
consortium, have a complete database of all Unicode characters which
can be used as a basis.

But if you want to ignore the normalization/multiple code point issue,
point 2--the collation problem--still remains. And given that even a
crappy database as MySQL supports Unicode collation, this isn't
something to be ignored, IMHO.
- Hide quoted text -

Andrew Dunstan <andrew@dunslane.net> wrote:
>
>
> Tom Lane wrote:
>>
>> Andrew Dunstan <andrew@dunslane.net> writes:
>>
>>>
>>> This isn't about the number of bytes, but about whether or not we should
>>> count characters encoded as two or more combined code points as a single
>>> char or not.
>>>
>>
>> It's really about whether we should support non-canonical encodings.
>> AFAIK that's a hack to cope with implementations that are restricted
>> to UTF-16, and we should Just Say No.  Clients that are sending these
>> things converted to UTF-8 are in violation of the standard.
>>
>
> I don't believe that the standard forbids the use of combining chars at all.
> RFC 3629 says:
>
>  Security may also be impacted by a characteristic of several
>  character encodings, including UTF-8: the "same thing" (as far as a
>  user can tell) can be represented by several distinct character
>  sequences.  For instance, an e with acute accent can be represented
>  by the precomposed U+00E9 E ACUTE character or by the canonically
>  equivalent sequence U+0065 U+0301 (E + COMBINING ACUTE).  Even though
>  UTF-8 provides a single byte sequence for each character sequence,
>  the existence of multiple character sequences for "the same thing"
>  may have security consequences whenever string matching, indexing,
>  searching, sorting, regular expression matching and selection are
>  involved.  An example would be string matching of an identifier
>  appearing in a credential and in access control list entries.  This
>  issue is amenable to solutions based on Unicode Normalization Forms,
>  see [UAX15].
>

Exactly my point.

Best Regards.

pgsql-hackers by date:

From: Andrew Dunstan
Date: 13 April 2009, 21:04:27
Subject: Re: Unicode support

From: Josh Berkus
Date: 13 April 2009, 22:38:39
Subject: Re: proposal: add columns created and altered to pg_proc and pg_class

Re: Unicode support - Mailing list pgsql-hackers

Previous

Next