Thread: Unicode support

Unicode support

From

- -

Date:

13 April 2009, 14:47:52

Hi.

While PostgreSQL is a great database, it lacks some fundamental
Unicode support. I want to present some points that have--to my
knowledge--not been addressed so far. In the following text, it is
assumed that the database and client encoding is UTF-8.


1) Functions like char_length() or length() do NOT return the number
of characters (the manual says they do), instead they return the
number of code points.

To illustrate the difference, I want to quote a Wikipedia article
(http://en.wikipedia.org/wiki/Unicode#Ready-made_versus_composite_characters):

"For example é can be represented in Unicode as U+0065 (Latin small
letter e) followed by U+0301 (combining acute) but it can also be
represented as the precomposed character U+00E9 (Latin small letter e
with acute). So in many cases, users have many ways of encoding the
same character. To deal with this, Unicode provides the mechanism of
canonical equivalence."

Also, the character U+1EBF (ế), used in Vietnamese has both an acute
and a circumflex accent. An equivalent code point sequence is U+0045
(E) U+0302 (circumflex accent) U+0301 (acute accent).

So, in the first example, the Unicode codepoints U+0065 U+0301 are
infact _one_ character (and are valid Unicode). These two codepoints
should be rendered as one character and be indistinguishable from
U+00E9. char_length() and length() are agnostic of this. Also, there
are quite a few zero-width code points (like code points reserved for
surrogate pairs, byte-order-masks, etc) which are not characters at
all. As described in 3), there also exist Unicode normalization
algorithms, which can decompose one character into multiple code
points. In some Asian scripts, this may boost the number of codepoints
to a much higher number than the number of actual characters.

I guess a quick fix would be to change the definition of length() to
return the number of code points in case of an Unicode encoding.


2) PG has no support for the Unicode collation algorithm. Collation is
offloaded to the OS, which makes this quite inflexible. This point is
further described here in case of the website last.fm:
http://russ.garrett.co.uk/tag/postgresql/ . This article also contains
a link to a quick fix in form of a 3rd party module.

However, in my humble opinion, this functionality should be part of PG
itself: the collation algorithm provided by Unicode allows many
different locales to be specified. For example, you may want to sort
text using the 'de_DE' locale in one query, and use the 'en_GB' locale
in the next query. This functionality may be needed in databases which
store multi-lingual text. There are also some characters or sequence
of characters which are considered equivalent in some languages. As an
example 'Umlaut a' is considered to be the same as the character
sequence 'ae' in German. You might argue: okay, then just set your OS
locale to German. But what happens if text from more languages than
just German is stored in the database? Their are multiple instances of
such equivalent characters in many languages. Some languages like
traditional Spanish treats 'ch' as one character which is sorted
between 'c' and 'd'. German does not. Storing both German and Spanish
text in a database requires flexible collation, but this cannot be
done, if the collation is done at the OS level. In the latter case,
you have to set a locale at the beginning and stick with it till the
end of time.

Also, the functions upper() and lower() depend on the locale.


3) PG has no support for Unicode normalization. This relates to the
problem explained in 1). Because Unicode often has multiple code point
sequences to represent the same character, it is necessary to bring
Unicode strings into a "normalized" form, in order to compare them
(Unicode has 4 normalization forms, as explained in
http://en.wikipedia.org/wiki/Unicode_normalization). If you just
compare by code points, visually indistinguishable characters may not
compare as equal, if they have a different code point sequence (for
example, if they come from different sources. That is, two users). I'd
even go so far as to say that this might pose a security problem.
Consider a security-relevant function which compares two strings and
misses the instances where the code point sequence is not the same,
but the character sequence is. The user is totally unaware of this,
since the different code point sequences appear as the same character
sequence to him/her, if the font and rendering is done correctly.


4) I don't know if it's easier to just implement a new type 'unicode'
that deals with all the unique Unicode characteristics, or to extend
the already existing functions that deal with 'text' types. But I
think to just ignore all these points is quite dangerous, considering
the gain in popularity Unicode has experienced so far and the
necessity to support more than just the Latin alphabet.

If you consider implementing the points I explained above, the ICU
library might be an option. It's open-source, has an MIT-license,
works on at least the operating systems PG works on, and is mature. It
covers all functionality you need. I've used it so far and can't
complain. Also, you might consider rolling your own Unicode
implementation, in which case www.unicode.org is the way to go.

Any comments are welcome.

Re: Unicode support

From

Alvaro Herrera

Date:

13 April 2009, 16:28:27

- - wrote:

> 1) Functions like char_length() or length() do NOT return the number
> of characters (the manual says they do), instead they return the
> number of code points.

I think you have client_encoding misconfigured.

alvherre=# select length('á'::text);length 
--------     1
(1 fila)

alvherre=# select pg_column_size('á');pg_column_size 
----------------             3
(1 fila)

(there's one byte of overhead here, so á is two bytes)

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: Unicode support

From

"Kevin Grittner"

Date:

13 April 2009, 16:37:57

Alvaro Herrera <alvherre@commandprompt.com> wrote: 
>> 1) Functions like char_length() or length() do NOT return the
number
>> of characters (the manual says they do), instead they return the
>> number of code points.
> 
> I think you have client_encoding misconfigured.
> 
> alvherre=# select length('á'::text);
>  length 
> --------
>       1
> (1 fila)
The OP didn't say it returned the number of bytes.  Since you found
that this character was stored in only two bytes, it must have been
one two-byte code point.  I think storing it as two code points would
have taken at least three bytes (one for the letter and two for the
accent), no?
-Kevin

Re: Unicode support

From

Andrew Dunstan

Date:

13 April 2009, 16:40:05

Alvaro Herrera wrote:
> - - wrote:
>
>   
>> 1) Functions like char_length() or length() do NOT return the number
>> of characters (the manual says they do), instead they return the
>> number of code points.
>>     
>
> I think you have client_encoding misconfigured.
>
> alvherre=# select length('á'::text);
>  length 
> --------
>       1
> (1 fila)
>
>
>   

Umm, but isn't that because your encoding is using one code point?

See the OP's explanation w.r.t. canonical equivalence.

This isn't about the number of bytes, but about whether or not we should 
count characters encoded as two or more combined code points as a single 
char or not.

cheers

andrew

Re: Unicode support

From

Tom Lane

Date:

13 April 2009, 17:17:05

Andrew Dunstan <andrew@dunslane.net> writes:
> This isn't about the number of bytes, but about whether or not we should 
> count characters encoded as two or more combined code points as a single 
> char or not.

It's really about whether we should support non-canonical encodings.
AFAIK that's a hack to cope with implementations that are restricted
to UTF-16, and we should Just Say No.  Clients that are sending these
things converted to UTF-8 are in violation of the standard.
        regards, tom lane

Re: Unicode support

From

Greg Stark

Date:

13 April 2009, 17:26:28

On Mon, Apr 13, 2009 at 9:15 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>> This isn't about the number of bytes, but about whether or not we should
>> count characters encoded as two or more combined code points as a single
>> char or not.
>
> It's really about whether we should support non-canonical encodings.
> AFAIK that's a hack to cope with implementations that are restricted
> to UTF-16, and we should Just Say No.  Clients that are sending these
> things converted to UTF-8 are in violation of the standard.

Is it really true trhat canonical encodings never contain any composed
characters in them? I thought there were some glyphs which could only
be represented by composed characters.

Also, users can construct strings of unicode code points themselves in
SQL using || or other text operators.

That said, my impression is that composed character support is pretty
thin on the ground elsewhere as well, but I don't have much first-hand
experience.

The original post seemed to be a contrived attempt to say "you should
use ICU". If composed character support were a show-stopper and there
was no other way to get it then it might be convincing, but I don't
know that it is and I don't know that ICU is the only place to get it.
And I'm sure it's not the only way to handle multiple encodings in a
database.

--
greg

Re: Unicode support

From

Tom Lane

Date:

13 April 2009, 17:39:56

Greg Stark <stark@enterprisedb.com> writes:
> Is it really true trhat canonical encodings never contain any composed
> characters in them? I thought there were some glyphs which could only
> be represented by composed characters.

AFAIK that's not true.  However, in my original comment I was thinking
about UTF16 surrogates, which are something else entirely --- so I
withdraw that.  I'm still dubious that it is our job to deal with
non-normalized characters, though.

> The original post seemed to be a contrived attempt to say "you should
> use ICU".

Indeed.  The OP should go read all the previous arguments about ICU
in our archives.
        regards, tom lane

Re: Unicode support

From

Andrew Dunstan

Date:

13 April 2009, 18:04:27

Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>   
>> This isn't about the number of bytes, but about whether or not we should 
>> count characters encoded as two or more combined code points as a single 
>> char or not.
>>     
>
> It's really about whether we should support non-canonical encodings.
> AFAIK that's a hack to cope with implementations that are restricted
> to UTF-16, and we should Just Say No.  Clients that are sending these
> things converted to UTF-8 are in violation of the standard.
>   

I don't believe that the standard forbids the use of combining chars at 
all. RFC 3629 says:
  Security may also be impacted by a characteristic of several  character encodings, including UTF-8: the "same thing"
(asfar as a  user can tell) can be represented by several distinct character  sequences.  For instance, an e with acute
accentcan be represented  by the precomposed U+00E9 E ACUTE character or by the canonically  equivalent sequence U+0065
U+0301(E + COMBINING ACUTE).  Even though  UTF-8 provides a single byte sequence for each character sequence,  the
existenceof multiple character sequences for "the same thing"  may have security consequences whenever string matching,
indexing, searching, sorting, regular expression matching and selection are  involved.  An example would be string
matchingof an identifier  appearing in a credential and in access control list entries.  This  issue is amenable to
solutionsbased on Unicode Normalization Forms,  see [UAX15].

cheers

andrew

Re: Unicode support

From

- -

Date:

13 April 2009, 18:22:18

Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Greg Stark <stark@enterprisedb.com> writes:
>> Is it really true trhat canonical encodings never contain any composed
>> characters in them? I thought there were some glyphs which could only
>> be represented by composed characters.
>
> AFAIK that's not true.  However, in my original comment I was thinking
> about UTF16 surrogates, which are something else entirely --- so I
> withdraw that.  I'm still dubious that it is our job to deal with
> non-normalized characters, though.

Like it or not, they are part of Unicode and they are very much valid
Unicode. They are not in violation with the standard. This has nothing
to do with the encoding. There are also code points which specify the
direction of text (e.g. needed if you want to embed a Hebrew quote in
English text). To count that as a character seems wrong.

>> The original post seemed to be a contrived attempt to say "you should
>> use ICU".
>
> Indeed.  The OP should go read all the previous arguments about ICU
> in our archives.

Not at all. I just was making a suggestion. You may use any other
library or implement it yourself (I even said that in my original
post). www.unicode.org - the official website of the Unicode
consortium, have a complete database of all Unicode characters which
can be used as a basis.

But if you want to ignore the normalization/multiple code point issue,
point 2--the collation problem--still remains. And given that even a
crappy database as MySQL supports Unicode collation, this isn't
something to be ignored, IMHO.
- Hide quoted text -

Andrew Dunstan <andrew@dunslane.net> wrote:
>
>
> Tom Lane wrote:
>>
>> Andrew Dunstan <andrew@dunslane.net> writes:
>>
>>>
>>> This isn't about the number of bytes, but about whether or not we should
>>> count characters encoded as two or more combined code points as a single
>>> char or not.
>>>
>>
>> It's really about whether we should support non-canonical encodings.
>> AFAIK that's a hack to cope with implementations that are restricted
>> to UTF-16, and we should Just Say No.  Clients that are sending these
>> things converted to UTF-8 are in violation of the standard.
>>
>
> I don't believe that the standard forbids the use of combining chars at all.
> RFC 3629 says:
>
>  Security may also be impacted by a characteristic of several
>  character encodings, including UTF-8: the "same thing" (as far as a
>  user can tell) can be represented by several distinct character
>  sequences.  For instance, an e with acute accent can be represented
>  by the precomposed U+00E9 E ACUTE character or by the canonically
>  equivalent sequence U+0065 U+0301 (E + COMBINING ACUTE).  Even though
>  UTF-8 provides a single byte sequence for each character sequence,
>  the existence of multiple character sequences for "the same thing"
>  may have security consequences whenever string matching, indexing,
>  searching, sorting, regular expression matching and selection are
>  involved.  An example would be string matching of an identifier
>  appearing in a credential and in access control list entries.  This
>  issue is amenable to solutions based on Unicode Normalization Forms,
>  see [UAX15].
>

Exactly my point.

Best Regards.

Re: Unicode support

From

Gregory Stark

Date:

14 April 2009, 00:25:31

- - <crossroads0000@googlemail.com> writes:

>>> The original post seemed to be a contrived attempt to say "you should
>>> use ICU".
>>
>> Indeed.  The OP should go read all the previous arguments about ICU
>> in our archives.
>
> Not at all. I just was making a suggestion. You may use any other
> library or implement it yourself (I even said that in my original
> post). www.unicode.org - the official website of the Unicode
> consortium, have a complete database of all Unicode characters which
> can be used as a basis.
>
> But if you want to ignore the normalization/multiple code point issue,
> point 2--the collation problem--still remains. And given that even a
> crappy database as MySQL supports Unicode collation, this isn't
> something to be ignored, IMHO.

Sure, supporting multiple collations in a database is definitely a known
missing feature. There is a lot of work required to do it and a patch to do so
was too late to make it into 8.4 and required more work so hopefully the
issues will be worked out for 8.5.

I suggest you read the old threads and make any contibutions you can
suggesting how to solve the problems that arose.

>> I don't believe that the standard forbids the use of combining chars at all.
>> RFC 3629 says:
>>
>>  ... This issue is amenable to solutions based on Unicode Normalization
>>  Forms, see [UAX15].

This is the relevant part. Tom was claiming that the UTF8 encoding required
normalizing the string of unicode codepoints before encoding. I'm not sure
that's true though, is it?

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's PostGIS support!

Re: Unicode support

From

Andrew Gierth

Date:

14 April 2009, 01:07:45

>>>>> "Gregory" == Gregory Stark <stark@enterprisedb.com> writes:
>>> I don't believe that the standard forbids the use of combining>>> chars at all.  RFC 3629 says:>>> >>> ... This
issueis amenable to solutions based on Unicode>>> Normalization Forms, see [UAX15].

Gregory> This is the relevant part. Tom was claiming that the UTF8Gregory> encoding required normalizing the string of
unicodeGregory>codepoints before encoding. I'm not sure that's true though,Gregory> is it?

FWIW, the SQL spec puts the onus of normalization squarely on the
application; the database is allowed to assume that Unicode strings
are already normalized, is allowed to behave in implementation-defined
ways when presented with strings that aren't normalized, and provision
of normalization functions and predicates is just another optional
feature.

-- 
Andrew (irc:RhodiumToad)

Re: Unicode support

From

Peter Eisentraut

Date:

14 April 2009, 12:36:58

On Monday 13 April 2009 22:39:58 Andrew Dunstan wrote:
> Umm, but isn't that because your encoding is using one code point?
>
> See the OP's explanation w.r.t. canonical equivalence.
>
> This isn't about the number of bytes, but about whether or not we should
> count characters encoded as two or more combined code points as a single
> char or not.

Here is a test case that shows the problem (if your terminal can display
combining characters (xterm appears to work)):

SELECT U&'\00E9', char_length(U&'\00E9');?column? | char_length
----------+-------------é        |           1
(1 row)

SELECT U&'\0065\0301', char_length(U&'\0065\0301');?column? | char_length
----------+-------------é        |           2
(1 row)

Re: Unicode support

From

Peter Eisentraut

Date:

14 April 2009, 12:41:41

On Tuesday 14 April 2009 07:07:27 Andrew Gierth wrote:
> FWIW, the SQL spec puts the onus of normalization squarely on the
> application; the database is allowed to assume that Unicode strings
> are already normalized, is allowed to behave in implementation-defined
> ways when presented with strings that aren't normalized, and provision
> of normalization functions and predicates is just another optional
> feature.

Can you name chapter and verse on that?

I see this, for example,

6.27 <numeric value function>

5) If a <char length expression> is specified, then
Case:
a) If the character encoding form of <character value expression> is not UTF8,
UTF16, or UTF32, then let S be the <string value expression>.
Case:
i)
If the most specific type of S is character string, then the result is the
number of characters in the value of S.
NOTE 134 — The number of characters in a character string is determined
according to the semantics of the character set of that character string.
ii)
Otherwise, the result is OCTET_LENGTH(S).
b) Otherwise, the result is the number of explicit or implicit <char length
units> in <char length expression>, counted in accordance with the definition
of those units in the relevant normatively referenced document.

So SQL redirects the question of character length the Unicode standard.  I
have not been able to find anything there on a quick look, but I'm sure the
Unicode standard has some very specific ideas on this.  Note that the matter
of normalization is not mentioned here.

Re: Unicode support

From

Greg Stark

Date:

14 April 2009, 12:49:50

On Tue, Apr 14, 2009 at 1:32 PM, Peter Eisentraut <peter_e@gmx.net> wrote:
> On Monday 13 April 2009 22:39:58 Andrew Dunstan wrote:
>> Umm, but isn't that because your encoding is using one code point?
>>
>> See the OP's explanation w.r.t. canonical equivalence.
>>
>> This isn't about the number of bytes, but about whether or not we should
>> count characters encoded as two or more combined code points as a single
>> char or not.
>
> Here is a test case that shows the problem (if your terminal can display
> combining characters (xterm appears to work)):
>
> SELECT U&'\00E9', char_length(U&'\00E9');
>  ?column? | char_length
> ----------+-------------
>  é        |           1
> (1 row)
>
> SELECT U&'\0065\0301', char_length(U&'\0065\0301');
>  ?column? | char_length
> ----------+-------------
>  é        |           2
> (1 row)

What's really at issue is "what is a string?". That is, it a sequence
of characters or a sequence of code points. If it's the former then we
would also have to prohibit certain strings such as U&'\0301'
entirely. And we have to make substr() pick out the right number of
code points, etc.



--
greg

Re: Unicode support

From

Tom Lane

Date:

14 April 2009, 13:37:21

Greg Stark <stark@enterprisedb.com> writes:
> What's really at issue is "what is a string?". That is, it a sequence
> of characters or a sequence of code points. If it's the former then we
> would also have to prohibit certain strings such as U&'\0301'
> entirely. And we have to make substr() pick out the right number of
> code points, etc.

Another question is "what is the purpose of a database"?  To me it would
be quite the wrong thing for the DB to not store what is presented, as
long as it's considered legal.  Normalization of legal variant forms
seems pretty questionable.  So I'm with the camp that says this is the
application's responsibility.
        regards, tom lane

Re: Unicode support

From

"David E. Wheeler"

Date:

14 April 2009, 14:44:17

On Apr 14, 2009, at 9:26 AM, Tom Lane wrote:

> Another question is "what is the purpose of a database"?  To me it  
> would
> be quite the wrong thing for the DB to not store what is presented, as
> long as it's considered legal.  Normalization of legal variant forms
> seems pretty questionable.  So I'm with the camp that says this is the
> application's responsibility.

Can `convert()` normalize strings?

Best,

David

Re: Unicode support

From

Andrew Dunstan

Date:

14 April 2009, 15:01:45

David E. Wheeler wrote:
> On Apr 14, 2009, at 9:26 AM, Tom Lane wrote:
>
>> Another question is "what is the purpose of a database"?  To me it would
>> be quite the wrong thing for the DB to not store what is presented, as
>> long as it's considered legal.  Normalization of legal variant forms
>> seems pretty questionable.  So I'm with the camp that says this is the
>> application's responsibility.
>
> Can `convert()` normalize strings?
>
>

I think that's handling a quite different problem.

It certainly should not do so automatically, IMNSHO.

I think there's a good case for some functions implementing the various 
Unicode normalization functions, though.

cheers

andrew

Re: Unicode support

From

Andrew Dunstan

Date:

14 April 2009, 15:07:36

Kevin Grittner wrote:
>  
> I'm curious -- can every multi-code-point character be normalized to a
> single-code-point character?
>  
>   

I don't believe so. Those combinations used in the most common 
orthographic languages have their own code points, but I understand you 
can use the combining chars with essentially any other chars, although 
it might not always make much sense to do so. That's important when 
you're inventing symbols in things like Mathematical and Scientific papers.

cheers

andrew

Re: Unicode support

From

Tom Lane

Date:

14 April 2009, 15:11:24

Andrew Dunstan <andrew@dunslane.net> writes:
> I think there's a good case for some functions implementing the various 
> Unicode normalization functions, though.

I have no objection to that so long as the code footprint is in line
with the utility gain (i.e. not all that much).  If we have to bring in
ICU or something similar to make it happen, the cost/benefit ratio looks
pretty bad.
        regards, tom lane

Re: Unicode support

From

"Kevin Grittner"

Date:

14 April 2009, 15:15:07

Greg Stark <stark@enterprisedb.com> wrote: 
> Peter Eisentraut <peter_e@gmx.net> wrote:
>> SELECT U&'\00E9', char_length(U&'\00E9');
>>  ?column? | char_length
>> ----------+-------------
>>  é        |           1
>> (1 row)
>>
>> SELECT U&'\0065\0301', char_length(U&'\0065\0301');
>>  ?column? | char_length
>> ----------+-------------
>>  é        |           2
>> (1 row)
> 
> What's really at issue is "what is a string?". That is, it a
> sequence of characters or a sequence of code points.
Doesn't the SQL standard refer to them as "character string literals"?
The function is called character_length or char_length.
I'm curious -- can every multi-code-point character be normalized to a
single-code-point character?
-Kevin

Re: Unicode support

From

Peter Eisentraut

Date:

14 April 2009, 15:20:14

On Tuesday 14 April 2009 18:49:45 Greg Stark wrote:
> What's really at issue is "what is a string?". That is, it a sequence
> of characters or a sequence of code points.

I think a sequence of codepoints would be about as silly a definition as the 
antiquated notion of a string as a sequence of bytes.

> If it's the former then we
> would also have to prohibit certain strings such as U&'\0301'
> entirely. And we have to make substr() pick out the right number of
> code points, etc.

Sure enough.  That all goes along with what the original poster was saying.

Re: Unicode support

From

Peter Eisentraut

Date:

14 April 2009, 15:21:26

On Monday 13 April 2009 20:18:31 - - wrote:
> 2) PG has no support for the Unicode collation algorithm. Collation is
> offloaded to the OS, which makes this quite inflexible.

This argument is unclear.  Do you want the Unicode collation algorithm or do 
you want flexibility?  Some OS do implement the Unicode collation algorithm.

Re: Unicode support

From

Peter Eisentraut

Date:

14 April 2009, 15:24:16

On Tuesday 14 April 2009 19:26:41 Tom Lane wrote:
> Another question is "what is the purpose of a database"?  To me it would
> be quite the wrong thing for the DB to not store what is presented, as
> long as it's considered legal.  Normalization of legal variant forms
> seems pretty questionable.  So I'm with the camp that says this is the
> application's responsibility.

I think automatically normalizing or otherwise fiddling with Unicode strings 
with combining characters is not acceptable.  But the point is that we should 
process equivalent forms in a consistent way.

Re: Unicode support

From

- -

Date:

14 April 2009, 15:25:49

>> I don't believe that the standard forbids the use of combining chars at all.
>> RFC 3629 says:
>>
>>  ... This issue is amenable to solutions based on Unicode Normalization
>>  Forms, see [UAX15].

> This is the relevant part. Tom was claiming that the UTF8 encoding required
> normalizing the string of unicode codepoints before encoding. I'm not sure
> that's true though, is it?

No. I think Tom has mistaken this for the fact that the UTF8 encoding
can have multiple byte representations for one and the same code
point. The standard requires the
shortest byte representation to be used. (Please see
http://www.dwheeler.com/secure-programs/Secure-Programs-HOWTO/character-encoding.html
for more information). However, this has nothing to do with *code
point* normalization. The encoding does not require a code point
sequence to be normalized. Infact, UTF-8 could hold any of the 4
different normalized forms, 2 of which are completely decomposed
forms, that is, every accent takes up its own code point. Also, UTF-8
could hold non-normalized strings. Encodings just deal with how code
points are represented in memory or over wires.

> Another question is "what is the purpose of a database"?  To me it would
> be quite the wrong thing for the DB to not store what is presented, as
> long as it's considered legal.  Normalization of legal variant forms
> seems pretty questionable.  So I'm with the camp that says this is the
> application's responsibility.

What I did not mean is automatic normalization. I meant something like
PG providing a function to normalize strings which can be explicitly
called by the user in case it is needed. For example:

SELECT * FROM table1 WHERE normalize(a, 'NFC') = normalize($1, 'NFC');
-- NFC is one of the 4 mentioned normalization forms and the one that
should probably be used, since it combines code points rather than
decomposing them.

I completely agree that the database should never just normalize by
itself, because it might be the users intention to store
non-normalized strings. An exception might be an explicit
configuration setting which tells PG to normalize automatically. In
case of the above SELECT query, the problem of offloading the
normalization to the app means, that every single application that is
ever used with this database has to a) normalize the string, b) use
the same normalization form. If just one application at one point in
time fails to do so, string comparison is no longer safe (which is
could be a security problem as the quoted RFC text says). But with a
callable function like normalize() above, the user himself can choose
whether it is important or not. That is, does he want code points to
match (do not use normalize() then), or does he want characters to
match (use normalize() then). The user can normalize the string
exactly where it is needed (for comparison).

I've searched PG's source code and it appeared to me that the 'text'
type is just a typedef for 'varlena', the same type 'bytea' is based
on. Given that the client and database encoding is the same, does this
mean that text is internally stored in exactly the same binary
representation the client has sent it in? So that if the client has
sent it in any of the 4 normalized forms, PG guarantees to store and
retrieve it (in case of a later SELECT) exactly as it was sent ("store
what is presented")? In other words: does PG guarantuee the code point
sequence to remain the same? Because if it does not, you cannot
offload the normalization work to the app anyway, since PG would be
allowed "un-normalize" it internally.

Also, what happens if the client has a different encoding than the
database, and PG has to internally convert client strings to UTF-8.
Does it only generate code points in the same normalized form that it
expects the user input to be in?

Re: Unicode support

From

"David E. Wheeler"

Date:

14 April 2009, 15:33:11

On Apr 14, 2009, at 11:10 AM, Tom Lane wrote:

> Andrew Dunstan <andrew@dunslane.net> writes:
>> I think there's a good case for some functions implementing the  
>> various
>> Unicode normalization functions, though.
>
> I have no objection to that so long as the code footprint is in line
> with the utility gain (i.e. not all that much).  If we have to bring  
> in
> ICU or something similar to make it happen, the cost/benefit ratio  
> looks
> pretty bad.

I've no idea what it would require, but the mapping table must be  
pretty substantial. Still, I'd love to have this functionality in the  
database.

Best,

David

Re: Unicode support

From

Andrew Gierth

Date:

14 April 2009, 20:00:45

>>>>> "Peter" == Peter Eisentraut <peter_e@gmx.net> writes:
> On Tuesday 14 April 2009 07:07:27 Andrew Gierth wrote:>> FWIW, the SQL spec puts the onus of normalization squarely
onthe>> application; the database is allowed to assume that Unicode>> strings are already normalized, is allowed to
behavein>> implementation-defined ways when presented with strings that>> aren't normalized, and provision of
normalizationfunctions and>> predicates is just another optional feature.

Peter> Can you name chapter and verse on that?

4.2.8 Universal character sets
 A UCS string is a character string whose character repertoire is UCS and whose character encoding form is one of UTF8,
UTF16,or UTF32. Any two UCS strings are comparable.

 An SQL-implementation may assume that all UCS strings are normalized in one of Normalization Form C (NFC),
NormalizationForm D (NFD), Normalization Form KC (NFKC), or Normalization Form KD (NFKD), as specified by [Unicode15].
<normalizedpredicate> may be used to verify the normalization form to which a particular UCS string conforms.
Applicationsmay also use <normalize function> to enforce a particular <normal form>. With the exception of <normalize
function>and <normalized predicate>, the result of any operation on an unnormalized UCS string is
implementation-defined.
 Conversion of UCS strings from one character set to another is automatic.
 Detection of a noncharacter in a UCS-string causes an exception condition to be raised. The detection of an unassigned
codepoint does not.

[Obviously there are things here that we don't conform to anyway (we
don't raise exceptions for noncharacters, for example. We don't claim
conformance to T061.]

<normalized predicate> ::= <row value predicand> <normalized predicate part 2>
<normalized predicate part 2> ::= IS [ NOT ] [ <normal form> ] NORMALIZED

1) Without Feature T061, "UCS support", conforming SQL language shall  not contain a <normalized predicate>.

2) Without Feature F394, "Optional normal form specification",  conforming SQL language shall not contain <normal
form>.

<normalize function> ::= NORMALIZE <left paren> <character value expression>     [ <comma> <normal form> [ <comma>
<normalizefunction result length> ] ] <right paren>

<normal form> ::=   NFC | NFD | NFKC | NFKD

7) Without Feature T061, "UCS support", conforming SQL language shall  not contain a <normalize function>.

9) Without Feature F394, "Optional normal form specification",  conforming SQL language shall not contain <normal
form>.
Peter> I see this, for example,
Peter> 6.27 <numeric value function>[...]Peter> So SQL redirects the question of character length the UnicodePeter>
standard. I have not been able to find anything there on aPeter> quick look, but I'm sure the Unicode standard has some
veryPeter>specific ideas on this.  Note that the matter of normalizationPeter> is not mentioned here.

I've taken a not-so-quick look at the Unicode standard (though I don't
claim to be any sort of expert on it), and I certainly can't see any
definitive indication what the length is supposed to be; however, the
use of terminology such as "combining character sequence" (meaning a
series of codepoints that combine to make a single glyph) certainly
seems to strongly imply that our interpretation is correct and that
the OP's is not.

Other indications: the units used by length() must be the same as the
units used by position() and substring() (in the spec, when USING
CHARACTERS is specified), and it would not make sense to use a
definition of "character" that did not allow you to look inside a
combining sequence.

I've also failed so far to find any examples of other programming
languages in which a combining character sequence is taken to be a
single character for purposes of length or position specification.

-- 
Andrew (irc:RhodiumToad)

Re: Unicode support

From

Martijn van Oosterhout

Date:

15 April 2009, 13:21:45

On Tue, Apr 14, 2009 at 11:32:57AM -0700, David E. Wheeler wrote:
> I've no idea what it would require, but the mapping table must be
> pretty substantial. Still, I'd love to have this functionality in the
> database.

The Unicode tables in ICU outweigh the size of the code by a factor 5
or so. Once Postgres decides it want to do unicode properly and thus
imports the tables in some way, it could get the code to work with it
for 20% extra.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.

Re: Unicode support

From

Peter Eisentraut

Date:

19 April 2009, 12:48:12

On Monday 13 April 2009 20:18:31 - - wrote:
> 1) Functions like char_length() or length() do NOT return the number
> of characters (the manual says they do), instead they return the
> number of code points.

I have added a Todo item about possibly fixing this.

Re: Unicode support

From

Tom Lane

Date:

19 April 2009, 12:54:56

Peter Eisentraut <peter_e@gmx.net> writes:
> On Monday 13 April 2009 20:18:31 - - wrote:
>> 1) Functions like char_length() or length() do NOT return the number
>> of characters (the manual says they do), instead they return the
>> number of code points.

> I have added a Todo item about possibly fixing this.

I thought the conclusion of the thread was that this wasn't wrong?
        regards, tom lane

Re: Unicode support

From

Peter Eisentraut

Date:

20 April 2009, 08:05:47

On Sunday 19 April 2009 18:54:45 Tom Lane wrote:
> Peter Eisentraut <peter_e@gmx.net> writes:
> > On Monday 13 April 2009 20:18:31 - - wrote:
> >> 1) Functions like char_length() or length() do NOT return the number
> >> of characters (the manual says they do), instead they return the
> >> number of code points.
> >
> > I have added a Todo item about possibly fixing this.
>
> I thought the conclusion of the thread was that this wasn't wrong?

The only consensus I saw was that the normal form of an existing Unicode 
string shouldn't be altered by PostgreSQL.  That's pretty clear.

However, no one was entirely clear on the matter of how combining characters 
are supposed to be processed.  And even if we think that the current 
interfaces give the right answer, there should possibly be other interfaces 
that give the other right answer.  It needs more research first of all.