Re: Unicode support - Mailing list pgsql-hackers

From - -
Subject Re: Unicode support
Date
Msg-id 1842a500904141107h2f33ca59oada2f65a34eff881@mail.gmail.com
Whole thread Raw
In response to Re: Unicode support  (Andrew Dunstan <andrew@dunslane.net>)
List pgsql-hackers
>> I don't believe that the standard forbids the use of combining chars at all.
>> RFC 3629 says:
>>
>>  ... This issue is amenable to solutions based on Unicode Normalization
>>  Forms, see [UAX15].

> This is the relevant part. Tom was claiming that the UTF8 encoding required
> normalizing the string of unicode codepoints before encoding. I'm not sure
> that's true though, is it?

No. I think Tom has mistaken this for the fact that the UTF8 encoding
can have multiple byte representations for one and the same code
point. The standard requires the
shortest byte representation to be used. (Please see
http://www.dwheeler.com/secure-programs/Secure-Programs-HOWTO/character-encoding.html
for more information). However, this has nothing to do with *code
point* normalization. The encoding does not require a code point
sequence to be normalized. Infact, UTF-8 could hold any of the 4
different normalized forms, 2 of which are completely decomposed
forms, that is, every accent takes up its own code point. Also, UTF-8
could hold non-normalized strings. Encodings just deal with how code
points are represented in memory or over wires.

> Another question is "what is the purpose of a database"?  To me it would
> be quite the wrong thing for the DB to not store what is presented, as
> long as it's considered legal.  Normalization of legal variant forms
> seems pretty questionable.  So I'm with the camp that says this is the
> application's responsibility.

What I did not mean is automatic normalization. I meant something like
PG providing a function to normalize strings which can be explicitly
called by the user in case it is needed. For example:

SELECT * FROM table1 WHERE normalize(a, 'NFC') = normalize($1, 'NFC');
-- NFC is one of the 4 mentioned normalization forms and the one that
should probably be used, since it combines code points rather than
decomposing them.

I completely agree that the database should never just normalize by
itself, because it might be the users intention to store
non-normalized strings. An exception might be an explicit
configuration setting which tells PG to normalize automatically. In
case of the above SELECT query, the problem of offloading the
normalization to the app means, that every single application that is
ever used with this database has to a) normalize the string, b) use
the same normalization form. If just one application at one point in
time fails to do so, string comparison is no longer safe (which is
could be a security problem as the quoted RFC text says). But with a
callable function like normalize() above, the user himself can choose
whether it is important or not. That is, does he want code points to
match (do not use normalize() then), or does he want characters to
match (use normalize() then). The user can normalize the string
exactly where it is needed (for comparison).

I've searched PG's source code and it appeared to me that the 'text'
type is just a typedef for 'varlena', the same type 'bytea' is based
on. Given that the client and database encoding is the same, does this
mean that text is internally stored in exactly the same binary
representation the client has sent it in? So that if the client has
sent it in any of the 4 normalized forms, PG guarantees to store and
retrieve it (in case of a later SELECT) exactly as it was sent ("store
what is presented")? In other words: does PG guarantuee the code point
sequence to remain the same? Because if it does not, you cannot
offload the normalization work to the app anyway, since PG would be
allowed "un-normalize" it internally.

Also, what happens if the client has a different encoding than the
database, and PG has to internally convert client strings to UTF-8.
Does it only generate code points in the same normalized form that it
expects the user input to be in?


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Unicode string literals versus the world
Next
From: Andreas Pflug
Date:
Subject: Warm Standby restore_command documentation (was: New trigger option of pg_standby)