Thread: Unicode Normalization

Unicode Normalization

From

"David E. Wheeler"

Date:

23 September 2009, 15:08:26

Hackers,

I just had a discussion on IRC about unicode normalization in  
PostgreSQL. Apparently there is not support for it, currently. Andrew  
Gierth points out that it's part of the SQL spec to support it, though:

> RhodiumToad:e.g.  NORMALIZE(foo,NFC,len)
> justatheory:Oh, just a function then, really.
> RhodiumToad:where the normal form can be any of NFC, NFD, NFKC, NFKD
> RhodiumToad:except that the normal form is an identifier, not a string
> RhodiumToad:also the normal form and length are optional
> RhodiumToad:so NORMALIZE(foo)  is equivalent to NORMALIZE(foo,NFC)

I looked around and found the Public Software Group's utf8proc  
project, which even includes some PostgreSQL support (not, alas, for  
normalization). It has an MIT-licensed C library that offers these  
functions:

> uint8_t utf8proc_NFD(uint8_t str)
>
> Returns a pointer to newly allocated memory of a NFD normalized  
> version of the null-terminated stringstr.
>
> uint8_t utf8proc_NFC(uint8_t str)
>
> Returns a pointer to newly allocated memory of a NFC normalized  
> version of the null-terminated stringstr.
>
> uint8_t utf8proc_NFKD(uint8_t str)
>
> Returns a pointer to newly allocated memory of a NFKD normalized  
> version of the null-terminated stringstr.
>
> uint8_t utf8proc_NFKC(uint8_t str)
>
> Returns a pointer to newly allocated memory of a NFKC normalized  
> version of the null-terminated stringstr.

Anyone got any interest in porting these functions to PostgreSQL? I  
guess the parser would need to be updated to support the use of  
identifiers in the NORMALIZE() function, but otherwise it should be a  
fairly straight-forward port for an experienced C coder, no?

Best,

David

Re: Unicode Normalization

From

"David E. Wheeler"

Date:

23 September 2009, 15:14:29

On Sep 23, 2009, at 11:08 AM, David E. Wheeler wrote:

> I just had a discussion on IRC about unicode normalization in  
> PostgreSQL. Apparently there is not support for it, currently.

BTW, the only reference I found on the [to do list](http://wiki.postgresql.org/wiki/Todo 
) was:

> More sensible support for Unicode combining characters, normal forms

I think that should probably be changed to talk about the unicode  
standard support.

Best,

David

Re: Unicode Normalization

From

"David E. Wheeler"

Date:

23 September 2009, 15:29:31

On Sep 23, 2009, at 11:08 AM, David E. Wheeler wrote:

> I looked around and found the Public Software Group's utf8proc  
> project, which even includes some PostgreSQL support (not, alas, for  
> normalization). It has an MIT-licensed C library that offers these  
> functions:

Sorry, forgot the link:
  http://www.public-software-group.org/utf8proc

Best,

David

Re: Unicode Normalization

From

"David E. Wheeler"

Date:

24 September 2009, 12:36:51

On Sep 24, 2009, at 6:24 AM, pg@thetdh.com wrote:

> In a context using normalization, wouldn't you typically want to  
> store a normalized-text type that could perhaps (depending on  
> locale) take advantage of simpler, more-efficient comparison  
> functions?

That might be nice, but I'd be wary of a geometric multiplication of  
text types. We already have TEXT and CITEXT; what if we had your NTEXT  
(normalized text) but I wanted it to also be case-insensitive?

> Whether you're doing INSERT/UPDATE, or importing a flat text file,  
> if you canonicalize characters and substrings of identical meaning  
> when trivial distinctions of encoding are irrelevant, you're better  
> off later.  User-invocable normalization functions by themselves  
> don't make much sense.

Well, they make sense because there's nothing else right now. It's an  
easy way to get some support in, and besides, it's mandated by the SQL  
standard.

> (If Postgres now supports binary- or mixed-binary-and-text flat  
> files, perhaps for restore purposes, the same thing applies.)

Don't follow this bit.

Best,

David

Re: Unicode Normalization

From

Andrew Dunstan

Date:

24 September 2009, 12:59:27

David E. Wheeler wrote:
> On Sep 24, 2009, at 6:24 AM, pg@thetdh.com wrote:
>
>> In a context using normalization, wouldn't you typically want to 
>> store a normalized-text type that could perhaps (depending on locale) 
>> take advantage of simpler, more-efficient comparison functions?
>
> That might be nice, but I'd be wary of a geometric multiplication of 
> text types. We already have TEXT and CITEXT; what if we had your NTEXT 
> (normalized text) but I wanted it to also be case-insensitive?

Actually, I don't think it's necessarily a good idea at all. If a user 
inputs a perfectly valid piece of UTF8 text, we should be able to give 
it back to them exactly, whether or not it's in normalized form. The 
normalized forms are useful for certain comparison purposes, but they 
don't affect the validity of the text. CITEXT doesn't mangle what is 
stored, just how it's compared.

cheers

andrew

Re: Unicode Normalization

From

"David E. Wheeler"

Date:

24 September 2009, 13:06:12

On Sep 24, 2009, at 8:59 AM, Andrew Dunstan wrote:

>> That might be nice, but I'd be wary of a geometric multiplication  
>> of text types. We already have TEXT and CITEXT; what if we had your  
>> NTEXT (normalized text) but I wanted it to also be case-insensitive?
>
> Actually, I don't think it's necessarily a good idea at all. If a  
> user inputs a perfectly valid piece of UTF8 text, we should be able  
> to give it back to them exactly, whether or not it's in normalized  
> form. The normalized forms are useful for certain comparison  
> purposes, but they don't affect the validity of the text. CITEXT  
> doesn't mangle what is stored, just how it's compared.

Right, I don't think there's a need for a normalized TEXT type.

Best,

David

Re: Unicode Normalization

From

pg@thetdh.com

Date:

24 September 2009, 17:12:45

In a context using normalization, wouldn't you typically want to store a normalized-text type that could perhaps
(dependingon locale) take advantage of simpler, more-efficient comparison functions?  Whether you're doing
INSERT/UPDATE,or importing a flat text file, if you canonicalize characters and substrings of identical meaning when
trivialdistinctions of encoding are irrelevant, you're better off later.  User-invocable normalization functions by
themselvesdon't make much sense.  (If Postgres now supports binary- or mixed-binary-and-text flat files, perhaps for
restorepurposes, the same thing applies.)<br /><br />David Hudson<br /><br />