Thread: Unicode support
Hi. While PostgreSQL is a great database, it lacks some fundamental Unicode support. I want to present some points that have--to my knowledge--not been addressed so far. In the following text, it is assumed that the database and client encoding is UTF-8. 1) Functions like char_length() or length() do NOT return the number of characters (the manual says they do), instead they return the number of code points. To illustrate the difference, I want to quote a Wikipedia article (http://en.wikipedia.org/wiki/Unicode#Ready-made_versus_composite_characters): "For example é can be represented in Unicode as U+0065 (Latin small letter e) followed by U+0301 (combining acute) but it can also be represented as the precomposed character U+00E9 (Latin small letter e with acute). So in many cases, users have many ways of encoding the same character. To deal with this, Unicode provides the mechanism of canonical equivalence." Also, the character U+1EBF (ế), used in Vietnamese has both an acute and a circumflex accent. An equivalent code point sequence is U+0045 (E) U+0302 (circumflex accent) U+0301 (acute accent). So, in the first example, the Unicode codepoints U+0065 U+0301 are infact _one_ character (and are valid Unicode). These two codepoints should be rendered as one character and be indistinguishable from U+00E9. char_length() and length() are agnostic of this. Also, there are quite a few zero-width code points (like code points reserved for surrogate pairs, byte-order-masks, etc) which are not characters at all. As described in 3), there also exist Unicode normalization algorithms, which can decompose one character into multiple code points. In some Asian scripts, this may boost the number of codepoints to a much higher number than the number of actual characters. I guess a quick fix would be to change the definition of length() to return the number of code points in case of an Unicode encoding. 2) PG has no support for the Unicode collation algorithm. Collation is offloaded to the OS, which makes this quite inflexible. This point is further described here in case of the website last.fm: http://russ.garrett.co.uk/tag/postgresql/ . This article also contains a link to a quick fix in form of a 3rd party module. However, in my humble opinion, this functionality should be part of PG itself: the collation algorithm provided by Unicode allows many different locales to be specified. For example, you may want to sort text using the 'de_DE' locale in one query, and use the 'en_GB' locale in the next query. This functionality may be needed in databases which store multi-lingual text. There are also some characters or sequence of characters which are considered equivalent in some languages. As an example 'Umlaut a' is considered to be the same as the character sequence 'ae' in German. You might argue: okay, then just set your OS locale to German. But what happens if text from more languages than just German is stored in the database? Their are multiple instances of such equivalent characters in many languages. Some languages like traditional Spanish treats 'ch' as one character which is sorted between 'c' and 'd'. German does not. Storing both German and Spanish text in a database requires flexible collation, but this cannot be done, if the collation is done at the OS level. In the latter case, you have to set a locale at the beginning and stick with it till the end of time. Also, the functions upper() and lower() depend on the locale. 3) PG has no support for Unicode normalization. This relates to the problem explained in 1). Because Unicode often has multiple code point sequences to represent the same character, it is necessary to bring Unicode strings into a "normalized" form, in order to compare them (Unicode has 4 normalization forms, as explained in http://en.wikipedia.org/wiki/Unicode_normalization). If you just compare by code points, visually indistinguishable characters may not compare as equal, if they have a different code point sequence (for example, if they come from different sources. That is, two users). I'd even go so far as to say that this might pose a security problem. Consider a security-relevant function which compares two strings and misses the instances where the code point sequence is not the same, but the character sequence is. The user is totally unaware of this, since the different code point sequences appear as the same character sequence to him/her, if the font and rendering is done correctly. 4) I don't know if it's easier to just implement a new type 'unicode' that deals with all the unique Unicode characteristics, or to extend the already existing functions that deal with 'text' types. But I think to just ignore all these points is quite dangerous, considering the gain in popularity Unicode has experienced so far and the necessity to support more than just the Latin alphabet. If you consider implementing the points I explained above, the ICU library might be an option. It's open-source, has an MIT-license, works on at least the operating systems PG works on, and is mature. It covers all functionality you need. I've used it so far and can't complain. Also, you might consider rolling your own Unicode implementation, in which case www.unicode.org is the way to go. Any comments are welcome.
- - wrote: > 1) Functions like char_length() or length() do NOT return the number > of characters (the manual says they do), instead they return the > number of code points. I think you have client_encoding misconfigured. alvherre=# select length('á'::text);length -------- 1 (1 fila) alvherre=# select pg_column_size('á');pg_column_size ---------------- 3 (1 fila) (there's one byte of overhead here, so á is two bytes) -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Alvaro Herrera <alvherre@commandprompt.com> wrote: >> 1) Functions like char_length() or length() do NOT return the number >> of characters (the manual says they do), instead they return the >> number of code points. > > I think you have client_encoding misconfigured. > > alvherre=# select length('á'::text); > length > -------- > 1 > (1 fila) The OP didn't say it returned the number of bytes. Since you found that this character was stored in only two bytes, it must have been one two-byte code point. I think storing it as two code points would have taken at least three bytes (one for the letter and two for the accent), no? -Kevin
Alvaro Herrera wrote: > - - wrote: > > >> 1) Functions like char_length() or length() do NOT return the number >> of characters (the manual says they do), instead they return the >> number of code points. >> > > I think you have client_encoding misconfigured. > > alvherre=# select length('á'::text); > length > -------- > 1 > (1 fila) > > > Umm, but isn't that because your encoding is using one code point? See the OP's explanation w.r.t. canonical equivalence. This isn't about the number of bytes, but about whether or not we should count characters encoded as two or more combined code points as a single char or not. cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > This isn't about the number of bytes, but about whether or not we should > count characters encoded as two or more combined code points as a single > char or not. It's really about whether we should support non-canonical encodings. AFAIK that's a hack to cope with implementations that are restricted to UTF-16, and we should Just Say No. Clients that are sending these things converted to UTF-8 are in violation of the standard. regards, tom lane
On Mon, Apr 13, 2009 at 9:15 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Andrew Dunstan <andrew@dunslane.net> writes: >> This isn't about the number of bytes, but about whether or not we should >> count characters encoded as two or more combined code points as a single >> char or not. > > It's really about whether we should support non-canonical encodings. > AFAIK that's a hack to cope with implementations that are restricted > to UTF-16, and we should Just Say No. Clients that are sending these > things converted to UTF-8 are in violation of the standard. Is it really true trhat canonical encodings never contain any composed characters in them? I thought there were some glyphs which could only be represented by composed characters. Also, users can construct strings of unicode code points themselves in SQL using || or other text operators. That said, my impression is that composed character support is pretty thin on the ground elsewhere as well, but I don't have much first-hand experience. The original post seemed to be a contrived attempt to say "you should use ICU". If composed character support were a show-stopper and there was no other way to get it then it might be convincing, but I don't know that it is and I don't know that ICU is the only place to get it. And I'm sure it's not the only way to handle multiple encodings in a database. -- greg
Greg Stark <stark@enterprisedb.com> writes: > Is it really true trhat canonical encodings never contain any composed > characters in them? I thought there were some glyphs which could only > be represented by composed characters. AFAIK that's not true. However, in my original comment I was thinking about UTF16 surrogates, which are something else entirely --- so I withdraw that. I'm still dubious that it is our job to deal with non-normalized characters, though. > The original post seemed to be a contrived attempt to say "you should > use ICU". Indeed. The OP should go read all the previous arguments about ICU in our archives. regards, tom lane
Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: > >> This isn't about the number of bytes, but about whether or not we should >> count characters encoded as two or more combined code points as a single >> char or not. >> > > It's really about whether we should support non-canonical encodings. > AFAIK that's a hack to cope with implementations that are restricted > to UTF-16, and we should Just Say No. Clients that are sending these > things converted to UTF-8 are in violation of the standard. > I don't believe that the standard forbids the use of combining chars at all. RFC 3629 says: Security may also be impacted by a characteristic of several character encodings, including UTF-8: the "same thing" (asfar as a user can tell) can be represented by several distinct character sequences. For instance, an e with acute accentcan be represented by the precomposed U+00E9 E ACUTE character or by the canonically equivalent sequence U+0065 U+0301(E + COMBINING ACUTE). Even though UTF-8 provides a single byte sequence for each character sequence, the existenceof multiple character sequences for "the same thing" may have security consequences whenever string matching, indexing, searching, sorting, regular expression matching and selection are involved. An example would be string matchingof an identifier appearing in a credential and in access control list entries. This issue is amenable to solutionsbased on Unicode Normalization Forms, see [UAX15]. cheers andrew
Tom Lane <tgl@sss.pgh.pa.us> wrote: > Greg Stark <stark@enterprisedb.com> writes: >> Is it really true trhat canonical encodings never contain any composed >> characters in them? I thought there were some glyphs which could only >> be represented by composed characters. > > AFAIK that's not true. However, in my original comment I was thinking > about UTF16 surrogates, which are something else entirely --- so I > withdraw that. I'm still dubious that it is our job to deal with > non-normalized characters, though. Like it or not, they are part of Unicode and they are very much valid Unicode. They are not in violation with the standard. This has nothing to do with the encoding. There are also code points which specify the direction of text (e.g. needed if you want to embed a Hebrew quote in English text). To count that as a character seems wrong. >> The original post seemed to be a contrived attempt to say "you should >> use ICU". > > Indeed. The OP should go read all the previous arguments about ICU > in our archives. Not at all. I just was making a suggestion. You may use any other library or implement it yourself (I even said that in my original post). www.unicode.org - the official website of the Unicode consortium, have a complete database of all Unicode characters which can be used as a basis. But if you want to ignore the normalization/multiple code point issue, point 2--the collation problem--still remains. And given that even a crappy database as MySQL supports Unicode collation, this isn't something to be ignored, IMHO. - Hide quoted text - Andrew Dunstan <andrew@dunslane.net> wrote: > > > Tom Lane wrote: >> >> Andrew Dunstan <andrew@dunslane.net> writes: >> >>> >>> This isn't about the number of bytes, but about whether or not we should >>> count characters encoded as two or more combined code points as a single >>> char or not. >>> >> >> It's really about whether we should support non-canonical encodings. >> AFAIK that's a hack to cope with implementations that are restricted >> to UTF-16, and we should Just Say No. Clients that are sending these >> things converted to UTF-8 are in violation of the standard. >> > > I don't believe that the standard forbids the use of combining chars at all. > RFC 3629 says: > > Security may also be impacted by a characteristic of several > character encodings, including UTF-8: the "same thing" (as far as a > user can tell) can be represented by several distinct character > sequences. For instance, an e with acute accent can be represented > by the precomposed U+00E9 E ACUTE character or by the canonically > equivalent sequence U+0065 U+0301 (E + COMBINING ACUTE). Even though > UTF-8 provides a single byte sequence for each character sequence, > the existence of multiple character sequences for "the same thing" > may have security consequences whenever string matching, indexing, > searching, sorting, regular expression matching and selection are > involved. An example would be string matching of an identifier > appearing in a credential and in access control list entries. This > issue is amenable to solutions based on Unicode Normalization Forms, > see [UAX15]. > Exactly my point. Best Regards.
- - <crossroads0000@googlemail.com> writes: >>> The original post seemed to be a contrived attempt to say "you should >>> use ICU". >> >> Indeed. The OP should go read all the previous arguments about ICU >> in our archives. > > Not at all. I just was making a suggestion. You may use any other > library or implement it yourself (I even said that in my original > post). www.unicode.org - the official website of the Unicode > consortium, have a complete database of all Unicode characters which > can be used as a basis. > > But if you want to ignore the normalization/multiple code point issue, > point 2--the collation problem--still remains. And given that even a > crappy database as MySQL supports Unicode collation, this isn't > something to be ignored, IMHO. Sure, supporting multiple collations in a database is definitely a known missing feature. There is a lot of work required to do it and a patch to do so was too late to make it into 8.4 and required more work so hopefully the issues will be worked out for 8.5. I suggest you read the old threads and make any contibutions you can suggesting how to solve the problems that arose. >> I don't believe that the standard forbids the use of combining chars at all. >> RFC 3629 says: >> >> ... This issue is amenable to solutions based on Unicode Normalization >> Forms, see [UAX15]. This is the relevant part. Tom was claiming that the UTF8 encoding required normalizing the string of unicode codepoints before encoding. I'm not sure that's true though, is it? -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's PostGIS support!
>>>>> "Gregory" == Gregory Stark <stark@enterprisedb.com> writes: >>> I don't believe that the standard forbids the use of combining>>> chars at all. RFC 3629 says:>>> >>> ... This issueis amenable to solutions based on Unicode>>> Normalization Forms, see [UAX15]. Gregory> This is the relevant part. Tom was claiming that the UTF8Gregory> encoding required normalizing the string of unicodeGregory>codepoints before encoding. I'm not sure that's true though,Gregory> is it? FWIW, the SQL spec puts the onus of normalization squarely on the application; the database is allowed to assume that Unicode strings are already normalized, is allowed to behave in implementation-defined ways when presented with strings that aren't normalized, and provision of normalization functions and predicates is just another optional feature. -- Andrew (irc:RhodiumToad)
On Monday 13 April 2009 22:39:58 Andrew Dunstan wrote: > Umm, but isn't that because your encoding is using one code point? > > See the OP's explanation w.r.t. canonical equivalence. > > This isn't about the number of bytes, but about whether or not we should > count characters encoded as two or more combined code points as a single > char or not. Here is a test case that shows the problem (if your terminal can display combining characters (xterm appears to work)): SELECT U&'\00E9', char_length(U&'\00E9');?column? | char_length ----------+-------------é | 1 (1 row) SELECT U&'\0065\0301', char_length(U&'\0065\0301');?column? | char_length ----------+-------------é | 2 (1 row)
On Tuesday 14 April 2009 07:07:27 Andrew Gierth wrote: > FWIW, the SQL spec puts the onus of normalization squarely on the > application; the database is allowed to assume that Unicode strings > are already normalized, is allowed to behave in implementation-defined > ways when presented with strings that aren't normalized, and provision > of normalization functions and predicates is just another optional > feature. Can you name chapter and verse on that? I see this, for example, 6.27 <numeric value function> 5) If a <char length expression> is specified, then Case: a) If the character encoding form of <character value expression> is not UTF8, UTF16, or UTF32, then let S be the <string value expression>. Case: i) If the most specific type of S is character string, then the result is the number of characters in the value of S. NOTE 134 — The number of characters in a character string is determined according to the semantics of the character set of that character string. ii) Otherwise, the result is OCTET_LENGTH(S). b) Otherwise, the result is the number of explicit or implicit <char length units> in <char length expression>, counted in accordance with the definition of those units in the relevant normatively referenced document. So SQL redirects the question of character length the Unicode standard. I have not been able to find anything there on a quick look, but I'm sure the Unicode standard has some very specific ideas on this. Note that the matter of normalization is not mentioned here.
On Tue, Apr 14, 2009 at 1:32 PM, Peter Eisentraut <peter_e@gmx.net> wrote: > On Monday 13 April 2009 22:39:58 Andrew Dunstan wrote: >> Umm, but isn't that because your encoding is using one code point? >> >> See the OP's explanation w.r.t. canonical equivalence. >> >> This isn't about the number of bytes, but about whether or not we should >> count characters encoded as two or more combined code points as a single >> char or not. > > Here is a test case that shows the problem (if your terminal can display > combining characters (xterm appears to work)): > > SELECT U&'\00E9', char_length(U&'\00E9'); > ?column? | char_length > ----------+------------- > é | 1 > (1 row) > > SELECT U&'\0065\0301', char_length(U&'\0065\0301'); > ?column? | char_length > ----------+------------- > é | 2 > (1 row) What's really at issue is "what is a string?". That is, it a sequence of characters or a sequence of code points. If it's the former then we would also have to prohibit certain strings such as U&'\0301' entirely. And we have to make substr() pick out the right number of code points, etc. -- greg
Greg Stark <stark@enterprisedb.com> writes: > What's really at issue is "what is a string?". That is, it a sequence > of characters or a sequence of code points. If it's the former then we > would also have to prohibit certain strings such as U&'\0301' > entirely. And we have to make substr() pick out the right number of > code points, etc. Another question is "what is the purpose of a database"? To me it would be quite the wrong thing for the DB to not store what is presented, as long as it's considered legal. Normalization of legal variant forms seems pretty questionable. So I'm with the camp that says this is the application's responsibility. regards, tom lane
On Apr 14, 2009, at 9:26 AM, Tom Lane wrote: > Another question is "what is the purpose of a database"? To me it > would > be quite the wrong thing for the DB to not store what is presented, as > long as it's considered legal. Normalization of legal variant forms > seems pretty questionable. So I'm with the camp that says this is the > application's responsibility. Can `convert()` normalize strings? Best, David
David E. Wheeler wrote: > On Apr 14, 2009, at 9:26 AM, Tom Lane wrote: > >> Another question is "what is the purpose of a database"? To me it would >> be quite the wrong thing for the DB to not store what is presented, as >> long as it's considered legal. Normalization of legal variant forms >> seems pretty questionable. So I'm with the camp that says this is the >> application's responsibility. > > Can `convert()` normalize strings? > > I think that's handling a quite different problem. It certainly should not do so automatically, IMNSHO. I think there's a good case for some functions implementing the various Unicode normalization functions, though. cheers andrew
Kevin Grittner wrote: > > I'm curious -- can every multi-code-point character be normalized to a > single-code-point character? > > I don't believe so. Those combinations used in the most common orthographic languages have their own code points, but I understand you can use the combining chars with essentially any other chars, although it might not always make much sense to do so. That's important when you're inventing symbols in things like Mathematical and Scientific papers. cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > I think there's a good case for some functions implementing the various > Unicode normalization functions, though. I have no objection to that so long as the code footprint is in line with the utility gain (i.e. not all that much). If we have to bring in ICU or something similar to make it happen, the cost/benefit ratio looks pretty bad. regards, tom lane
Greg Stark <stark@enterprisedb.com> wrote: > Peter Eisentraut <peter_e@gmx.net> wrote: >> SELECT U&'\00E9', char_length(U&'\00E9'); >> ?column? | char_length >> ----------+------------- >> é | 1 >> (1 row) >> >> SELECT U&'\0065\0301', char_length(U&'\0065\0301'); >> ?column? | char_length >> ----------+------------- >> é | 2 >> (1 row) > > What's really at issue is "what is a string?". That is, it a > sequence of characters or a sequence of code points. Doesn't the SQL standard refer to them as "character string literals"? The function is called character_length or char_length. I'm curious -- can every multi-code-point character be normalized to a single-code-point character? -Kevin
On Tuesday 14 April 2009 18:49:45 Greg Stark wrote: > What's really at issue is "what is a string?". That is, it a sequence > of characters or a sequence of code points. I think a sequence of codepoints would be about as silly a definition as the antiquated notion of a string as a sequence of bytes. > If it's the former then we > would also have to prohibit certain strings such as U&'\0301' > entirely. And we have to make substr() pick out the right number of > code points, etc. Sure enough. That all goes along with what the original poster was saying.
On Monday 13 April 2009 20:18:31 - - wrote: > 2) PG has no support for the Unicode collation algorithm. Collation is > offloaded to the OS, which makes this quite inflexible. This argument is unclear. Do you want the Unicode collation algorithm or do you want flexibility? Some OS do implement the Unicode collation algorithm.
On Tuesday 14 April 2009 19:26:41 Tom Lane wrote: > Another question is "what is the purpose of a database"? To me it would > be quite the wrong thing for the DB to not store what is presented, as > long as it's considered legal. Normalization of legal variant forms > seems pretty questionable. So I'm with the camp that says this is the > application's responsibility. I think automatically normalizing or otherwise fiddling with Unicode strings with combining characters is not acceptable. But the point is that we should process equivalent forms in a consistent way.
>> I don't believe that the standard forbids the use of combining chars at all. >> RFC 3629 says: >> >> ... This issue is amenable to solutions based on Unicode Normalization >> Forms, see [UAX15]. > This is the relevant part. Tom was claiming that the UTF8 encoding required > normalizing the string of unicode codepoints before encoding. I'm not sure > that's true though, is it? No. I think Tom has mistaken this for the fact that the UTF8 encoding can have multiple byte representations for one and the same code point. The standard requires the shortest byte representation to be used. (Please see http://www.dwheeler.com/secure-programs/Secure-Programs-HOWTO/character-encoding.html for more information). However, this has nothing to do with *code point* normalization. The encoding does not require a code point sequence to be normalized. Infact, UTF-8 could hold any of the 4 different normalized forms, 2 of which are completely decomposed forms, that is, every accent takes up its own code point. Also, UTF-8 could hold non-normalized strings. Encodings just deal with how code points are represented in memory or over wires. > Another question is "what is the purpose of a database"? To me it would > be quite the wrong thing for the DB to not store what is presented, as > long as it's considered legal. Normalization of legal variant forms > seems pretty questionable. So I'm with the camp that says this is the > application's responsibility. What I did not mean is automatic normalization. I meant something like PG providing a function to normalize strings which can be explicitly called by the user in case it is needed. For example: SELECT * FROM table1 WHERE normalize(a, 'NFC') = normalize($1, 'NFC'); -- NFC is one of the 4 mentioned normalization forms and the one that should probably be used, since it combines code points rather than decomposing them. I completely agree that the database should never just normalize by itself, because it might be the users intention to store non-normalized strings. An exception might be an explicit configuration setting which tells PG to normalize automatically. In case of the above SELECT query, the problem of offloading the normalization to the app means, that every single application that is ever used with this database has to a) normalize the string, b) use the same normalization form. If just one application at one point in time fails to do so, string comparison is no longer safe (which is could be a security problem as the quoted RFC text says). But with a callable function like normalize() above, the user himself can choose whether it is important or not. That is, does he want code points to match (do not use normalize() then), or does he want characters to match (use normalize() then). The user can normalize the string exactly where it is needed (for comparison). I've searched PG's source code and it appeared to me that the 'text' type is just a typedef for 'varlena', the same type 'bytea' is based on. Given that the client and database encoding is the same, does this mean that text is internally stored in exactly the same binary representation the client has sent it in? So that if the client has sent it in any of the 4 normalized forms, PG guarantees to store and retrieve it (in case of a later SELECT) exactly as it was sent ("store what is presented")? In other words: does PG guarantuee the code point sequence to remain the same? Because if it does not, you cannot offload the normalization work to the app anyway, since PG would be allowed "un-normalize" it internally. Also, what happens if the client has a different encoding than the database, and PG has to internally convert client strings to UTF-8. Does it only generate code points in the same normalized form that it expects the user input to be in?
On Apr 14, 2009, at 11:10 AM, Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: >> I think there's a good case for some functions implementing the >> various >> Unicode normalization functions, though. > > I have no objection to that so long as the code footprint is in line > with the utility gain (i.e. not all that much). If we have to bring > in > ICU or something similar to make it happen, the cost/benefit ratio > looks > pretty bad. I've no idea what it would require, but the mapping table must be pretty substantial. Still, I'd love to have this functionality in the database. Best, David
>>>>> "Peter" == Peter Eisentraut <peter_e@gmx.net> writes: > On Tuesday 14 April 2009 07:07:27 Andrew Gierth wrote:>> FWIW, the SQL spec puts the onus of normalization squarely onthe>> application; the database is allowed to assume that Unicode>> strings are already normalized, is allowed to behavein>> implementation-defined ways when presented with strings that>> aren't normalized, and provision of normalizationfunctions and>> predicates is just another optional feature. Peter> Can you name chapter and verse on that? 4.2.8 Universal character sets A UCS string is a character string whose character repertoire is UCS and whose character encoding form is one of UTF8, UTF16,or UTF32. Any two UCS strings are comparable. An SQL-implementation may assume that all UCS strings are normalized in one of Normalization Form C (NFC), NormalizationForm D (NFD), Normalization Form KC (NFKC), or Normalization Form KD (NFKD), as specified by [Unicode15]. <normalizedpredicate> may be used to verify the normalization form to which a particular UCS string conforms. Applicationsmay also use <normalize function> to enforce a particular <normal form>. With the exception of <normalize function>and <normalized predicate>, the result of any operation on an unnormalized UCS string is implementation-defined. Conversion of UCS strings from one character set to another is automatic. Detection of a noncharacter in a UCS-string causes an exception condition to be raised. The detection of an unassigned codepoint does not. [Obviously there are things here that we don't conform to anyway (we don't raise exceptions for noncharacters, for example. We don't claim conformance to T061.] <normalized predicate> ::= <row value predicand> <normalized predicate part 2> <normalized predicate part 2> ::= IS [ NOT ] [ <normal form> ] NORMALIZED 1) Without Feature T061, "UCS support", conforming SQL language shall not contain a <normalized predicate>. 2) Without Feature F394, "Optional normal form specification", conforming SQL language shall not contain <normal form>. <normalize function> ::= NORMALIZE <left paren> <character value expression> [ <comma> <normal form> [ <comma> <normalizefunction result length> ] ] <right paren> <normal form> ::= NFC | NFD | NFKC | NFKD 7) Without Feature T061, "UCS support", conforming SQL language shall not contain a <normalize function>. 9) Without Feature F394, "Optional normal form specification", conforming SQL language shall not contain <normal form>. Peter> I see this, for example, Peter> 6.27 <numeric value function>[...]Peter> So SQL redirects the question of character length the UnicodePeter> standard. I have not been able to find anything there on aPeter> quick look, but I'm sure the Unicode standard has some veryPeter>specific ideas on this. Note that the matter of normalizationPeter> is not mentioned here. I've taken a not-so-quick look at the Unicode standard (though I don't claim to be any sort of expert on it), and I certainly can't see any definitive indication what the length is supposed to be; however, the use of terminology such as "combining character sequence" (meaning a series of codepoints that combine to make a single glyph) certainly seems to strongly imply that our interpretation is correct and that the OP's is not. Other indications: the units used by length() must be the same as the units used by position() and substring() (in the spec, when USING CHARACTERS is specified), and it would not make sense to use a definition of "character" that did not allow you to look inside a combining sequence. I've also failed so far to find any examples of other programming languages in which a combining character sequence is taken to be a single character for purposes of length or position specification. -- Andrew (irc:RhodiumToad)
On Tue, Apr 14, 2009 at 11:32:57AM -0700, David E. Wheeler wrote: > I've no idea what it would require, but the mapping table must be > pretty substantial. Still, I'd love to have this functionality in the > database. The Unicode tables in ICU outweigh the size of the code by a factor 5 or so. Once Postgres decides it want to do unicode properly and thus imports the tables in some way, it could get the code to work with it for 20% extra. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Please line up in a tree and maintain the heap invariant while > boarding. Thank you for flying nlogn airlines.
On Monday 13 April 2009 20:18:31 - - wrote: > 1) Functions like char_length() or length() do NOT return the number > of characters (the manual says they do), instead they return the > number of code points. I have added a Todo item about possibly fixing this.
Peter Eisentraut <peter_e@gmx.net> writes: > On Monday 13 April 2009 20:18:31 - - wrote: >> 1) Functions like char_length() or length() do NOT return the number >> of characters (the manual says they do), instead they return the >> number of code points. > I have added a Todo item about possibly fixing this. I thought the conclusion of the thread was that this wasn't wrong? regards, tom lane
On Sunday 19 April 2009 18:54:45 Tom Lane wrote: > Peter Eisentraut <peter_e@gmx.net> writes: > > On Monday 13 April 2009 20:18:31 - - wrote: > >> 1) Functions like char_length() or length() do NOT return the number > >> of characters (the manual says they do), instead they return the > >> number of code points. > > > > I have added a Todo item about possibly fixing this. > > I thought the conclusion of the thread was that this wasn't wrong? The only consensus I saw was that the normal form of an existing Unicode string shouldn't be altered by PostgreSQL. That's pretty clear. However, no one was entirely clear on the matter of how combining characters are supposed to be processed. And even if we think that the current interfaces give the right answer, there should possibly be other interfaces that give the other right answer. It needs more research first of all.