Thread: Re: [HACKERS] Unicode combining characters
Hi, I should have sent the patch earlier, but got delayed by other stuff. Anyway, here is the patch: - most of the functionality is only activated when MULTIBYTE is defined, - check valid UTF-8 characters, client-side only yet, and only on output, you still can send invalid UTF-8 to the server (so, it's only partly compliant to Unicode 3.1, but that's better than nothing). - formats with the correct number of columns (that's why I made it in the first place after all), but only for UNICODE. However, the code allows to plug-in routines for other encodings, as Tatsuo did for the other multibyte functions. - corrects a bit the UTF-8 code from Tatsuo to allow Unicode 3.1 characters (characters with values >= 0x10000, which are encoded on four bytes). - doesn't depend on the locale capabilities of the glibc (useful for remote telnet). I would like somebody to check it closely, as it is my first patch to pgsql. Also, I created dummy .orig files, so that the two files I created are included, I hope that's the right way. Now, a lot of functionality is NOT included here, but I will keep that for 7.3 :) That includes all string checking on the server side (which will have to be a bit more optimised ;) ), and the input checking on the client side for UTF-8, though that should not be difficult. It's just to send the strings through mbvalidate() before sending them to the server. Strong checking on UTF-8 strings is mandatory to be compliant with Unicode 3.1+ . Do I have time to look for a patch to include iso-8859-15 for 7.2 ? The euro is coming 1. january 2002 (before 7.3 !) and over 280 millions people in Europe will need the euro sign and only iso-8859-15 and iso-8859-16 have it (and unfortunately, I don't think all Unices will switch to Unicode in the meantime).... err... yes, I know that this is not every single person in Europe that uses PostgreSql, so it's not exactly 280m, but it's just a matter of time ! ;) I'll come back (on pgsql-hackers) later to ask a few questions regarding the full unicode support (normalisation, collation, regexes,...) on the server side :) Here is the patch ! Patrice. -- Patrice HÉDÉ ------------------------------- patrice à islande org ----- -- Isn't it weird how scientists can imagine all the matter of the universe exploding out of a dot smaller than the head of a pin, but they can't come up with a more evocative name for it than "The Big Bang" ? -- What would _you_ call the creation of the universe ? -- "The HORRENDOUS SPACE KABLOOIE !" - Calvin and Hobbes ------------------------------------------ http://www.islande.org/ -----
Attachment
> - corrects a bit the UTF-8 code from Tatsuo to allow Unicode 3.1 > characters (characters with values >= 0x10000, which are encoded on > four bytes). After applying your patches, do the 4-bytes UTF-8 convert to UCS-2 (2 bytes) or UCS-4 (4 bytes) in pg_utf2wchar_with_len()? If it were 4 bytes, we are in trouble. Current regex implementaion does not handle 4 byte width charsets. -- Tatsuo Ishii
* Tatsuo Ishii <t-ishii@sra.co.jp> [011009 18:38]: > > - corrects a bit the UTF-8 code from Tatsuo to allow Unicode 3.1 > > characters (characters with values >= 0x10000, which are encoded on > > four bytes). > > After applying your patches, do the 4-bytes UTF-8 convert to UCS-2 (2 > bytes) or UCS-4 (4 bytes) in pg_utf2wchar_with_len()? If it were 4 > bytes, we are in trouble. Current regex implementaion does not handle > 4 byte width charsets. *sigh* yes, it does encode to four bytes :( Three solutions then : 1) we support these supplementary characters, knowing that they won't work with regexes, 2) I back out the change, but then anyone using these characters will get something weird, since the decoding would be faulty (they would be handled as 3 bytes UTF-8 chars, and then the fourth byte would become a "faulty char"... not very good, as the 3-byte version is still not a valid UTF-8 code !), 3) we fix the regex engine within the next 24 hours, before the beta deadline is activated :/ I must say that I doubt that anyone will use these characters in the next few months : these are mostly chinese extended characters, with old italic, deseret, and gothic scripts, and bysantine and western musical symbols, as well as the mathematical alphanumerical symbols. I would prefer solution 1), as I think it is better to allow these characters, even with a temporary restriction on the regex, than to fail completely on them. As for solution 3), we may still work at it in the next few months :) [I haven't even looked at the regex engine yet, so I don't know the implications of what I have just said !] What do you think ? Patrice -- Patrice Hédé email: patrice hede à islande org www : http://www.islande.org/
> > After applying your patches, do the 4-bytes UTF-8 convert to UCS-2 (2 > > bytes) or UCS-4 (4 bytes) in pg_utf2wchar_with_len()? If it were 4 > > bytes, we are in trouble. Current regex implementaion does not handle > > 4 byte width charsets. > > *sigh* yes, it does encode to four bytes :( > > Three solutions then : > > 1) we support these supplementary characters, knowing that they won't > work with regexes, > > 2) I back out the change, but then anyone using these characters will > get something weird, since the decoding would be faulty (they would > be handled as 3 bytes UTF-8 chars, and then the fourth byte would > become a "faulty char"... not very good, as the 3-byte version is > still not a valid UTF-8 code !), > > 3) we fix the regex engine within the next 24 hours, before the beta > deadline is activated :/ > > I must say that I doubt that anyone will use these characters in the > next few months : these are mostly chinese extended characters, with > old italic, deseret, and gothic scripts, and bysantine and western > musical symbols, as well as the mathematical alphanumerical symbols. > > I would prefer solution 1), as I think it is better to allow these > characters, even with a temporary restriction on the regex, than to > fail completely on them. As for solution 3), we may still work at it > in the next few months :) [I haven't even looked at the regex engine > yet, so I don't know the implications of what I have just said !] > > What do you think ? I think 2) is not very good, and we should reject these 4-bytes UTF-8 strings. After all, we are not ready for them. BTW, other part of your patches looks good. Peter, what do you think? -- Tatsuo Ishii
> > 1) we support these supplementary characters, knowing that they won't > > work with regexes, > > > > 2) I back out the change, but then anyone using these characters will > > get something weird, since the decoding would be faulty (they would > > be handled as 3 bytes UTF-8 chars, and then the fourth byte would > > become a "faulty char"... not very good, as the 3-byte version is > > still not a valid UTF-8 code !), > > > > 3) we fix the regex engine within the next 24 hours, before the beta > > deadline is activated :/ > > > > What do you think ? > > I think 2) is not very good, and we should reject these 4-bytes UTF-8 > strings. After all, we are not ready for them. If we still recognise them as 4-byte UTF-8 chars (in order to parse the next char correctly) and reject them as invalid chars, that should be OK :) > BTW, other part of your patches looks good. Peter, what do you think? Nice to hear :) Patrice -- Patrice Hédé email: patrice hede à islande org www : http://www.islande.org/