Re: Unicode support - Mailing list pgsql-hackers
From | - - |
---|---|
Subject | Re: Unicode support |
Date | |
Msg-id | 1842a500904141107h2f33ca59oada2f65a34eff881@mail.gmail.com Whole thread Raw |
In response to | Re: Unicode support (Andrew Dunstan <andrew@dunslane.net>) |
List | pgsql-hackers |
>> I don't believe that the standard forbids the use of combining chars at all. >> RFC 3629 says: >> >> ... This issue is amenable to solutions based on Unicode Normalization >> Forms, see [UAX15]. > This is the relevant part. Tom was claiming that the UTF8 encoding required > normalizing the string of unicode codepoints before encoding. I'm not sure > that's true though, is it? No. I think Tom has mistaken this for the fact that the UTF8 encoding can have multiple byte representations for one and the same code point. The standard requires the shortest byte representation to be used. (Please see http://www.dwheeler.com/secure-programs/Secure-Programs-HOWTO/character-encoding.html for more information). However, this has nothing to do with *code point* normalization. The encoding does not require a code point sequence to be normalized. Infact, UTF-8 could hold any of the 4 different normalized forms, 2 of which are completely decomposed forms, that is, every accent takes up its own code point. Also, UTF-8 could hold non-normalized strings. Encodings just deal with how code points are represented in memory or over wires. > Another question is "what is the purpose of a database"? To me it would > be quite the wrong thing for the DB to not store what is presented, as > long as it's considered legal. Normalization of legal variant forms > seems pretty questionable. So I'm with the camp that says this is the > application's responsibility. What I did not mean is automatic normalization. I meant something like PG providing a function to normalize strings which can be explicitly called by the user in case it is needed. For example: SELECT * FROM table1 WHERE normalize(a, 'NFC') = normalize($1, 'NFC'); -- NFC is one of the 4 mentioned normalization forms and the one that should probably be used, since it combines code points rather than decomposing them. I completely agree that the database should never just normalize by itself, because it might be the users intention to store non-normalized strings. An exception might be an explicit configuration setting which tells PG to normalize automatically. In case of the above SELECT query, the problem of offloading the normalization to the app means, that every single application that is ever used with this database has to a) normalize the string, b) use the same normalization form. If just one application at one point in time fails to do so, string comparison is no longer safe (which is could be a security problem as the quoted RFC text says). But with a callable function like normalize() above, the user himself can choose whether it is important or not. That is, does he want code points to match (do not use normalize() then), or does he want characters to match (use normalize() then). The user can normalize the string exactly where it is needed (for comparison). I've searched PG's source code and it appeared to me that the 'text' type is just a typedef for 'varlena', the same type 'bytea' is based on. Given that the client and database encoding is the same, does this mean that text is internally stored in exactly the same binary representation the client has sent it in? So that if the client has sent it in any of the 4 normalized forms, PG guarantees to store and retrieve it (in case of a later SELECT) exactly as it was sent ("store what is presented")? In other words: does PG guarantuee the code point sequence to remain the same? Because if it does not, you cannot offload the normalization work to the app anyway, since PG would be allowed "un-normalize" it internally. Also, what happens if the client has a different encoding than the database, and PG has to internally convert client strings to UTF-8. Does it only generate code points in the same normalized form that it expects the user input to be in?
pgsql-hackers by date: