Thread: Unicode combining characters
Hi all, while working on a new project involving PostgreSQL and making some tests, I have come up with the following output from psql : lang | length | length | text | text ------+--------+--------+-----------+-----------isl | 7 | 6 | álíta | áleitisl | 7 | 7 | álíta | álitumisl | 7 | 7 | álíta | álitiðisl | 5 | 4 | maður | mannisl | 5 | 7 | maður | mönnumisl | 5 | 5 | maður | mannaisl | 5 | 4 | óska | -aði [the misalignment is what I got, it's not a copy-paste error] This is pasted from a UTF-8 xterm running psql under a UTF-8 locale, querying a database created with -E UNICODE (by the way, these are icelandic words :) ). What you see above is misleading, since it's not possible to see that 'á', 'í', 'ó' and 'ö' are using combining marks, while 'ð' is not. As a reminder, a combining mark in Unicode is that á is actually encoded as a + ' (where ' is the acute combining mark). Encoded in UTF-8, it's then <61 cc 81> [UTF16: 0061 0301], instead of <c3 a1> [UTF16: 00E1]. The "length" fields are what is returned by length(a.text) and length(b.text). So, this shows two problems : - length() on the server side doesn't handle correctly Unicode [I have the same result with char_length()], and returns thenumber of chars (as it is however advertised to do), rather the length of the string. - the psql frontend makes the same mistake. I am using version 7.1.3 (debian sid), so it may have been corrected in the meantime (in this case, I apologise, but I have only recently started again to use PostgreSQL and I haven't followed -hackers long enough). => I think fixing psql shouldn't be too complicated, as the glibc should be providing the locale, and return the right values (is this the case ? and what happens for combined latin + chinese characters for example ? I'll have to try that later). If it's not fixed already, do you want me to look at this ? [it will take some time, as I haven't set up any development environment for postgres yet, and I'm away for one week from thursday]. => regarding the backend, it may be more complex, as the underlaying system may not provide any UTF-8 locale to use (!= from being UTF-8 aware : an administrator may have decided that UTF-8 locales are useless on a server, as only root connections are made, and he wants only the C locale on the console - I've seen that quite often ;) ). This brings me to another subject : I will need to support the full Unicode collation algorithm (UCA, as described in TR#10 [1] of the Unicode consortium), and I will need to be able to sort according to locales which may not be installed on the backend server (some of which may not even be recognised by GNU libc, which supports already more than 140 locales -- artificial languages would be an example). I will also need to be able to normalise the unicode strings (TR#15 [2]) so that I don't have some characters in legacy codepoints [as 00E1 above], and others with combining marks. There is today an implementation in perl of the needed functionality, in Unicode::Collate and Unicode::Normalize (which I haven't tried yet :( ). But as they are Perl modules, the untrusted version of perl, plperlu, will be needed, and it's a pity for what I consider a core functionality in the future (not that plperlu isn't a good thing - I can't wait for it ! - but that an untrusted pl language is needed to support normalisation and collation). Note also that there are a lot of data associated with these algorithms, as you could expect. I was wondering if some people have already thought about this, or already done something, or if some of you are interested in this. If nobody does anything, I'll do something eventually, probably before Christmas (I don't have much time for this, and I don't need the functionality right now), but if there is an interest, I could team with others and develop it faster :) Anyway, I'm open to suggestions : - implement it in C, in the core, - implement it in C, as contributed custom functions, - implement it in perl (by reusing Unicode:: work), in a trusted plperl, - implement it in perl, calling Unicode:: modules, in an untrusted plperl. and then : - provide the data in tables (system and/or user) - which should be available across databases, - load the data from the original text files provided in Unicode (and other as needed), if the functionality is compiledinto the server. - I believe the basic unicode information should be standard, and the locales should be provided as contrib/ files to beplugged in as needed. I can't really accept a solution which would rely on the underlaying libc, as it may not provide the necessary locales (or maybe, then, have a way to override the collating tables by user tables - actually, this would be certainly the best solution if it's in the core, as the tables will put an extra burden on the distribution and the installation footprint, especially if the tables are already there, for glibc, for perl5.6+, for other software dealing with Unicode). The main functions I foresee are : - provide a normalisation function to all 4 forms, - provide a collation_key(text, language) function, as the calculation of the key may be expensive, some may want to indexon the result (I would :) ), - provide a collation algorithm, using the two previous facilities, which can do primary to tertiary collation (cf TR#10for a detailed explanation). I haven't looked at PostgreSQL code yet (shame !), so I may be completely off-track, in which case I'll retract myself and won't bother you again (on that subject, that is ;) )... Comments ? Patrice. [1] http://www.unicode.org/unicode/reports/tr10/ [2] http://www.unicode.org/unicode/reports/tr15/ -- Patrice HÉDÉ ------------------------------- patrice à islande.org ----- -- Isn't it weird how scientists can imagine all the matter of the universe exploding out of a dot smaller than the head of a pin, but they can't come up with a more evocative name for it than "The Big Bang" ? -- What would _you_ call the creation of the universe? -- "The HORRENDOUS SPACE KABLOOIE !" - Calvin and Hobbes ------------------------------------------ http://www.islande.org/ -----
> So, this shows two problems : > > - length() on the server side doesn't handle correctly Unicode [I have > the same result with char_length()], and returns the number of chars > (as it is however advertised to do), rather the length of the > string. This is a known limitation. > - the psql frontend makes the same mistake. > > I am using version 7.1.3 (debian sid), so it may have been corrected > in the meantime (in this case, I apologise, but I have only recently > started again to use PostgreSQL and I haven't followed -hackers long > enough). > > > => I think fixing psql shouldn't be too complicated, as the glibc > should be providing the locale, and return the right values (is this > the case ? and what happens for combined latin + chinese characters > for example ? I'll have to try that later). If it's not fixed already, > do you want me to look at this ? [it will take some time, as I haven't > set up any development environment for postgres yet, and I'm away for > one week from thursday]. Sounds great. > I was wondering if some people have already thought about this, or > already done something, or if some of you are interested in this. If > nobody does anything, I'll do something eventually, probably before > Christmas (I don't have much time for this, and I don't need the > functionality right now), but if there is an interest, I could team > with others and develop it faster :) I'm very interested in your point. I will start studying [1][2] after the beta freeze. > Anyway, I'm open to suggestions : > > - implement it in C, in the core, > > - implement it in C, as contributed custom functions, This may be a good starting point. > I can't really accept a solution which would rely on the underlaying > libc, as it may not provide the necessary locales (or maybe, then, I totally agree here. > The main functions I foresee are : > > - provide a normalisation function to all 4 forms, > > - provide a collation_key(text, language) function, as the calculation > of the key may be expensive, some may want to index on the result (I > would :) ), > > - provide a collation algorithm, using the two previous facilities, > which can do primary to tertiary collation (cf TR#10 for a detailed > explanation). > > I haven't looked at PostgreSQL code yet (shame !), so I may be > completely off-track, in which case I'll retract myself and won't > bother you again (on that subject, that is ;) )... > > Comments ? -- Tatsuo Ishii
Looks like a good project for 7.3 Probably the best starting point would be to develope contrib/unicode with smooth transition to core. Oleg On Mon, 24 Sep 2001, Patrice [iso-8859-15] HИdИ wrote: > Hi all, > > while working on a new project involving PostgreSQL and making some > tests, I have come up with the following output from psql : > > lang | length | length | text | text > ------+--------+--------+-----------+----------- > isl | 7 | 6 | АlМta | Аleit > isl | 7 | 7 | АlМta | Аlitum > isl | 7 | 7 | АlМta | АlitiП > isl | 5 | 4 | maПur | mann > isl | 5 | 7 | maПur | mЖnnum > isl | 5 | 5 | maПur | manna > isl | 5 | 4 | Сska | -aПi > > [the misalignment is what I got, it's not a copy-paste error] > > This is pasted from a UTF-8 xterm running psql under a UTF-8 locale, > querying a database created with -E UNICODE (by the way, these are > icelandic words :) ). > > What you see above is misleading, since it's not possible to see that > 'А', 'М', 'С' and 'Ж' are using combining marks, while 'П' is not. > > As a reminder, a combining mark in Unicode is that А is actually > encoded as a + ' (where ' is the acute combining mark). > > Encoded in UTF-8, it's then <61 cc 81> [UTF16: 0061 0301], > instead of <c3 a1> [UTF16: 00E1]. > > The "length" fields are what is returned by length(a.text) and > length(b.text). > > So, this shows two problems : > > - length() on the server side doesn't handle correctly Unicode [I have > the same result with char_length()], and returns the number of chars > (as it is however advertised to do), rather the length of the > string. > > - the psql frontend makes the same mistake. > > I am using version 7.1.3 (debian sid), so it may have been corrected > in the meantime (in this case, I apologise, but I have only recently > started again to use PostgreSQL and I haven't followed -hackers long > enough). > > > => I think fixing psql shouldn't be too complicated, as the glibc > should be providing the locale, and return the right values (is this > the case ? and what happens for combined latin + chinese characters > for example ? I'll have to try that later). If it's not fixed already, > do you want me to look at this ? [it will take some time, as I haven't > set up any development environment for postgres yet, and I'm away for > one week from thursday]. > > => regarding the backend, it may be more complex, as the underlaying > system may not provide any UTF-8 locale to use (!= from being UTF-8 > aware : an administrator may have decided that UTF-8 locales are > useless on a server, as only root connections are made, and he wants > only the C locale on the console - I've seen that quite often ;) ). > > > This brings me to another subject : I will need to support the full > Unicode collation algorithm (UCA, as described in TR#10 [1] of the > Unicode consortium), and I will need to be able to sort according to > locales which may not be installed on the backend server (some of > which may not even be recognised by GNU libc, which supports already > more than 140 locales -- artificial languages would be an example). I > will also need to be able to normalise the unicode strings (TR#15 [2]) > so that I don't have some characters in legacy codepoints [as 00E1 > above], and others with combining marks. > > There is today an implementation in perl of the needed functionality, > in Unicode::Collate and Unicode::Normalize (which I haven't tried yet > :( ). But as they are Perl modules, the untrusted version of perl, > plperlu, will be needed, and it's a pity for what I consider a core > functionality in the future (not that plperlu isn't a good thing - I > can't wait for it ! - but that an untrusted pl language is needed to > support normalisation and collation). > > Note also that there are a lot of data associated with these > algorithms, as you could expect. > > I was wondering if some people have already thought about this, or > already done something, or if some of you are interested in this. If > nobody does anything, I'll do something eventually, probably before > Christmas (I don't have much time for this, and I don't need the > functionality right now), but if there is an interest, I could team > with others and develop it faster :) > > Anyway, I'm open to suggestions : > > - implement it in C, in the core, > > - implement it in C, as contributed custom functions, > > - implement it in perl (by reusing Unicode:: work), in a trusted plperl, > > - implement it in perl, calling Unicode:: modules, in an untrusted > plperl. > > and then : > > - provide the data in tables (system and/or user) - which should be > available across databases, > > - load the data from the original text files provided in Unicode (and > other as needed), if the functionality is compiled into the server. > > - I believe the basic unicode information should be standard, and the > locales should be provided as contrib/ files to be plugged in as > needed. > > I can't really accept a solution which would rely on the underlaying > libc, as it may not provide the necessary locales (or maybe, then, > have a way to override the collating tables by user tables - actually, > this would be certainly the best solution if it's in the core, as the > tables will put an extra burden on the distribution and the > installation footprint, especially if the tables are already there, > for glibc, for perl5.6+, for other software dealing with Unicode). > > The main functions I foresee are : > > - provide a normalisation function to all 4 forms, > > - provide a collation_key(text, language) function, as the calculation > of the key may be expensive, some may want to index on the result (I > would :) ), > > - provide a collation algorithm, using the two previous facilities, > which can do primary to tertiary collation (cf TR#10 for a detailed > explanation). > > I haven't looked at PostgreSQL code yet (shame !), so I may be > completely off-track, in which case I'll retract myself and won't > bother you again (on that subject, that is ;) )... > > Comments ? > > > Patrice. > > [1] http://www.unicode.org/unicode/reports/tr10/ > > [2] http://www.unicode.org/unicode/reports/tr15/ > > Regards, Oleg _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
Hi, * Tatsuo Ishii <t-ishii@sra.co.jp> [010925 18:18]: > > So, this shows two problems : > > > > - length() on the server side doesn't handle correctly Unicode [I > > have the same result with char_length()], and returns the number > > of chars (as it is however advertised to do), rather the length > > of the string. > > This is a known limitation. To solve this, we could use wcwidth() (there is a custom implementation for the systems which don't have it in the glibc). I'll have a look at it later. > > - the psql frontend makes the same mistake. Same thing here. I have just installed the CVS and downloaded the development version (thanks Baldvin), tested that the stock version compiles fine, and I'll now have a look at how to make this work. :) I'll send a patch when I have this working here. > Sounds great. [Unicode normalisation and collation in the backend] > I'm very interested in your point. I will start studying [1][2] after > the beta freeze. > > > Anyway, I'm open to suggestions : > > > > - implement it in C, in the core, > > > > - implement it in C, as contributed custom functions, > > This may be a good starting point. > > > I can't really accept a solution which would rely on the underlaying > > libc, as it may not provide the necessary locales (or maybe, then, > > I totally agree here. As Oleg suggested, I will try to aim for 7.3, first with a version in contrib, and later, if the implementation is fine, it could be moved to the core (or not ? Though it would be nice to make sure every PostgreSQL installation which supports unicode has it, so that users won't need to have administrative rights to use the functionality). I think I will go for a C version, and probably the collation and normalisation data in tables, with some way to override the defaults with secondary tables... I'll report as soon as I have something +/- working. > -- > Tatsuo Ishii Patrice. -- Patrice HÉDÉ ------------------------------- patrice à islande org ----- -- Isn't it weird how scientists can imagine all the matter of the universe exploding out of a dot smaller than the head of a pin, but they can't come up with a more evocative name for it than "The Big Bang" ? -- What would _you_ call the creation of the universe? -- "The HORRENDOUS SPACE KABLOOIE !" - Calvin and Hobbes ------------------------------------------ http://www.islande.org/ -----
> > > - length() on the server side doesn't handle correctly Unicode [I > > > have the same result with char_length()], and returns the number > > > of chars (as it is however advertised to do), rather the length > > > of the string. > > > > This is a known limitation. > > To solve this, we could use wcwidth() (there is a custom > implementation for the systems which don't have it in the glibc). I'll > have a look at it later. And wcwidth() depends on the locale. That is the another reason we could not use it. > As Oleg suggested, I will try to aim for 7.3, first with a version in > contrib, and later, if the implementation is fine, it could be moved > to the core (or not ? Though it would be nice to make sure every > PostgreSQL installation which supports unicode has it, so that users > won't need to have administrative rights to use the functionality). I would like to see SQL99's charset, collate functionality for 7.3 (or later). If this happens, current multibyte implementation would be dramatically changed. That would be a good timing to merge your Unicode stuffs into the main source tree. -- Tatsuo Ishii
> I would like to see SQL99's charset, collate functionality for 7.3 (or > later). If this happens, current multibyte implementation would be > dramatically changed... I'm *still* interested in working on this (an old story I know). I'm working on date/time stuff for 7.2, but hopefully 7.3 will see some advances in the SQL99 direction on charset etc. - Thomas
> > I would like to see SQL99's charset, collate functionality for 7.3 (or > > later). If this happens, current multibyte implementation would be > > dramatically changed... > > I'm *still* interested in working on this (an old story I know). I'm > working on date/time stuff for 7.2, but hopefully 7.3 will see some > advances in the SQL99 direction on charset etc. BTW, I see "CHARACTER SET" in gram.y. Does current already support that syntax? -- Tatsuo Ishii
> BTW, I see "CHARACTER SET" in gram.y. Does current already support > that syntax? Yes and no. gram.y knows about CHARACTER SET, but only for the long form, the clause is in the wrong position (it preceeds the length specification) and it does not do much useful (generates a data type based on the character set name which does not get recognized farther back). Examples: thomas=# create table t1 (c varchar(20) character set sql_ascii); ERROR: parser: parse error at or near "character" thomas=# create table t1 (c character varying character set sql_ascii (20)); ERROR: Unable to locate type name 'varsql_ascii' in catalog I'm pretty sure I'll get shift/reduce troubles when trying to move that clause to *after* the length specifier. I'll try to do something with the syntax for 7.2 once I've finished the date/time stuff. - Thomas
Can someone give me TODO items for this discussion? > > So, this shows two problems : > > > > - length() on the server side doesn't handle correctly Unicode [I have > > the same result with char_length()], and returns the number of chars > > (as it is however advertised to do), rather the length of the > > string. > > This is a known limitation. > > > - the psql frontend makes the same mistake. > > > > I am using version 7.1.3 (debian sid), so it may have been corrected > > in the meantime (in this case, I apologise, but I have only recently > > started again to use PostgreSQL and I haven't followed -hackers long > > enough). > > > > > > => I think fixing psql shouldn't be too complicated, as the glibc > > should be providing the locale, and return the right values (is this > > the case ? and what happens for combined latin + chinese characters > > for example ? I'll have to try that later). If it's not fixed already, > > do you want me to look at this ? [it will take some time, as I haven't > > set up any development environment for postgres yet, and I'm away for > > one week from thursday]. > > Sounds great. > > > I was wondering if some people have already thought about this, or > > already done something, or if some of you are interested in this. If > > nobody does anything, I'll do something eventually, probably before > > Christmas (I don't have much time for this, and I don't need the > > functionality right now), but if there is an interest, I could team > > with others and develop it faster :) > > I'm very interested in your point. I will start studying [1][2] after > the beta freeze. > > > Anyway, I'm open to suggestions : > > > > - implement it in C, in the core, > > > > - implement it in C, as contributed custom functions, > > This may be a good starting point. > > > I can't really accept a solution which would rely on the underlaying > > libc, as it may not provide the necessary locales (or maybe, then, > > I totally agree here. > > > The main functions I foresee are : > > > > - provide a normalisation function to all 4 forms, > > > > - provide a collation_key(text, language) function, as the calculation > > of the key may be expensive, some may want to index on the result (I > > would :) ), > > > > - provide a collation algorithm, using the two previous facilities, > > which can do primary to tertiary collation (cf TR#10 for a detailed > > explanation). > > > > I haven't looked at PostgreSQL code yet (shame !), so I may be > > completely off-track, in which case I'll retract myself and won't > > bother you again (on that subject, that is ;) )... > > > > Comments ? > -- > Tatsuo Ishii > > ---------------------------(end of broadcast)--------------------------- > TIP 2: you can get off all lists at once with the unregister command > (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> Can someone give me TODO items for this discussion? What about: Improve Unicode combined character handling -- Tatsuo Ishii > > > So, this shows two problems : > > > > > > - length() on the server side doesn't handle correctly Unicode [I have > > > the same result with char_length()], and returns the number of chars > > > (as it is however advertised to do), rather the length of the > > > string. > > > > This is a known limitation. > > > > > - the psql frontend makes the same mistake. > > > > > > I am using version 7.1.3 (debian sid), so it may have been corrected > > > in the meantime (in this case, I apologise, but I have only recently > > > started again to use PostgreSQL and I haven't followed -hackers long > > > enough). > > > > > > > > > => I think fixing psql shouldn't be too complicated, as the glibc > > > should be providing the locale, and return the right values (is this > > > the case ? and what happens for combined latin + chinese characters > > > for example ? I'll have to try that later). If it's not fixed already, > > > do you want me to look at this ? [it will take some time, as I haven't > > > set up any development environment for postgres yet, and I'm away for > > > one week from thursday]. > > > > Sounds great. > > > > > I was wondering if some people have already thought about this, or > > > already done something, or if some of you are interested in this. If > > > nobody does anything, I'll do something eventually, probably before > > > Christmas (I don't have much time for this, and I don't need the > > > functionality right now), but if there is an interest, I could team > > > with others and develop it faster :) > > > > I'm very interested in your point. I will start studying [1][2] after > > the beta freeze. > > > > > Anyway, I'm open to suggestions : > > > > > > - implement it in C, in the core, > > > > > > - implement it in C, as contributed custom functions, > > > > This may be a good starting point. > > > > > I can't really accept a solution which would rely on the underlaying > > > libc, as it may not provide the necessary locales (or maybe, then, > > > > I totally agree here. > > > > > The main functions I foresee are : > > > > > > - provide a normalisation function to all 4 forms, > > > > > > - provide a collation_key(text, language) function, as the calculation > > > of the key may be expensive, some may want to index on the result (I > > > would :) ), > > > > > > - provide a collation algorithm, using the two previous facilities, > > > which can do primary to tertiary collation (cf TR#10 for a detailed > > > explanation). > > > > > > I haven't looked at PostgreSQL code yet (shame !), so I may be > > > completely off-track, in which case I'll retract myself and won't > > > bother you again (on that subject, that is ;) )... > > > > > > Comments ? > > -- > > Tatsuo Ishii > > > > ---------------------------(end of broadcast)--------------------------- > > TIP 2: you can get off all lists at once with the unregister command > > (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) > > > > -- > Bruce Momjian | http://candle.pha.pa.us > pgman@candle.pha.pa.us | (610) 853-3000 > + If your life is a hard drive, | 830 Blythe Avenue > + Christ can be your backup. | Drexel Hill, Pennsylvania 19026 > > ---------------------------(end of broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly >
> > Can someone give me TODO items for this discussion? > > What about: > > Improve Unicode combined character handling Done. I can't update the web version because I don't have permission. Also, have we decided if multibyte should be the configure default now? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> Also, have we decided if multibyte should be the configure default now? Not sure. Anyway I have tested LIKE/REGEX query test using current. The query executed is: explain analyze select '0000000 5089 474e...( 16475 bytes long text containing only 0-9a-z chars) like 'aaa'; and explain analyze select '0000000 5089 474e...( 16475 bytes long text containing only 0-9a-z chars) ~ 'aaa'; Here is the result: no MB with MB LIKE 0.09 msec 0.08 msec REGEX 0.09 msec 0.10 msec LIKE with MB seemed to be resonably fast, but REGEX with MB seemed a little bit slow. Probably this is due the wide character conversion overhead. -- Tatsuo Ishii
If no one can find a case where multibyte is slower, I think we should enable it by default. Comments? > > Also, have we decided if multibyte should be the configure default now? > > Not sure. > > Anyway I have tested LIKE/REGEX query test using current. The query > executed is: > > explain analyze select '0000000 5089 474e...( 16475 > bytes long text containing only 0-9a-z chars) like 'aaa'; > > and > > explain analyze select '0000000 5089 474e...( 16475 > bytes long text containing only 0-9a-z chars) ~ 'aaa'; > > Here is the result: > > no MB with MB > LIKE 0.09 msec 0.08 msec > REGEX 0.09 msec 0.10 msec > > LIKE with MB seemed to be resonably fast, but REGEX with MB seemed a > little bit slow. Probably this is due the wide character conversion > overhead. > -- > Tatsuo Ishii > > ---------------------------(end of broadcast)--------------------------- > TIP 2: you can get off all lists at once with the unregister command > (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian <pgman@candle.pha.pa.us> writes: > If no one can find a case where multibyte is slower, I think we should > enable it by default. Comments? Well, he just did point out such a case: >> no MB with MB >> LIKE 0.09 msec 0.08 msec >> REGEX 0.09 msec 0.10 msec But I agree with your conclusion. If the worst penalty we can find is that a regex comparison operator is 10% slower, we may as well turn it on by default. Most people will never notice the difference, and anyone who really cares can always turn it off again. regards, tom lane
> Bruce Momjian <pgman@candle.pha.pa.us> writes: > > If no one can find a case where multibyte is slower, I think we should > > enable it by default. Comments? > > Well, he just did point out such a case: > > >> no MB with MB > >> LIKE 0.09 msec 0.08 msec > >> REGEX 0.09 msec 0.10 msec > > But I agree with your conclusion. If the worst penalty we can find is > that a regex comparison operator is 10% slower, we may as well turn it > on by default. Most people will never notice the difference, and anyone > who really cares can always turn it off again. But the strange thing is that LIKE is faster, perhaps meaning his measurements can't even see the difference, or is it because the LIKE optimization is off for multibyte. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian <pgman@candle.pha.pa.us> writes: > But the strange thing is that LIKE is faster, perhaps meaning his > measurements can't even see the difference, Yeah, I suspect there's 10% or more noise in these numbers. But then one could read the results as saying we can't reliably measure any difference at all ... I'd feel more confident if the measurements were done using operators repeated enough times to yield multiple-second runtimes. I don't trust fractional-second time measurements on Unix boxen; too much chance of bogus results due to activity of other processes. regards, tom lane
Tatsuo Ishii writes: > LIKE with MB seemed to be resonably fast, but REGEX with MB seemed a > little bit slow. Probably this is due the wide character conversion > overhead. Could this conversion be optimized to recognize when it's dealing with a single-byte character encoding? -- Peter Eisentraut peter_e@gmx.net http://funkturm.homeip.net/~peter
> Yeah, I suspect there's 10% or more noise in these numbers. But then > one could read the results as saying we can't reliably measure any > difference at all ... > > I'd feel more confident if the measurements were done using operators > repeated enough times to yield multiple-second runtimes. I don't > trust fractional-second time measurements on Unix boxen; too much chance > of bogus results due to activity of other processes. Any idea to do that? I tried to do a measurements using something like "SELECT * FROM t1 WHERE very-long-string-column LIKE 'aaa'", but I'm afraid the I/O time masks the difference... -- Tatsuo Ishii
> > LIKE with MB seemed to be resonably fast, but REGEX with MB seemed a > > little bit slow. Probably this is due the wide character conversion > > overhead. > > Could this conversion be optimized to recognize when it's dealing with a > single-byte character encoding? Not sure, will look into... -- Tatsuo Ishii
Tatsuo Ishii <t-ishii@sra.co.jp> writes: >> I'd feel more confident if the measurements were done using operators >> repeated enough times to yield multiple-second runtimes. > Any idea to do that? Maybe something like this: declare a plpgsql function that takes two text parameters and has a body like for (i = 0 to a million) boolvar := $1 like $2; Then call it with strings of different lengths and see how the runtime varies. You need to apply the LIKE to function parameters, else the system will probably collapse the LIKE operation to a constant... regards, tom lane
> Maybe something like this: declare a plpgsql function that takes two > text parameters and has a body like > > for (i = 0 to a million) > boolvar := $1 like $2; > > Then call it with strings of different lengths and see how the runtime > varies. You need to apply the LIKE to function parameters, else the > system will probably collapse the LIKE operation to a constant... Good idea. I did tests for both LIKE and REGEX using PL/pgsql functions(see source code below). Here are the result. What I did was calling the functions with changing taret strings from 32byte to 8192. Times are all in msec. (1) LIKE bytes Without MB With MB 32 8121.94 8094.73 64 8167.98 8105.24 128 8151.30 8108.61 256 8090.12 8098.20 512 8111.05 8101.07 1024 8110.49 8099.61 2048 8095.32 8106.00 4096 8094.88 8091.19 8192 8123.02 8121.63 (2) REGEX bytes Without MB With MB 32 117.93 119.47 64 126.41 127.61 128 143.97 146.55 256 180.49 183.69 512 255.53 256.16 1024 410.59 409.22 2048 5176.38 5181.99 4096 6000.82 5627.84 8192 6529.15 6547.10 ------------- shell script ------------------- for i in 32 64 128 256 512 1024 2048 4096 8192 do psql -c "explain analyze select liketest(a,'aaa') from (select substring('very_long_text' from 0 for $i) as a) as a" test done ------------- shell script ------------------- ------------- functions ----------------- drop function liketest(text,text); create function liketest(text,text) returns bool as ' declarei int;rtn boolean; begini := 1000000;while i > 0 loop rtn := $1 like $2; i := i - 1;end loop;return rtn; end; ' language 'plpgsql'; drop function regextest(text,text); create function regextest(text,text) returns bool as ' declarei int;rtn boolean; begini := 10000;while i > 0 loop rtn := $1 ~ $2; i := i - 1;end loop;return rtn; end; ' language 'plpgsql'; ------------- functions -----------------
> ------------- shell script ------------------- > for i in 32 64 128 256 512 1024 2048 4096 8192 > do > psql -c "explain analyze select liketest(a,'aaa') from > (select substring('very_long_text' from 0 for $i) as a) as a" test > done > ------------- shell script ------------------- I don't think your search string is sufficient for a test. With 'aaa' it actually knows that it only needs to look at the first three characters of a. Imho you need to try something like liketest(a,'%aaa%'). Andreas
> I don't think your search string is sufficient for a test. > With 'aaa' it actually knows that it only needs to look at the > first three characters of a. Imho you need to try something > like liketest(a,'%aaa%'). Ok. I ran the modified test (now the iteration is reduced to 100000 in liketest()). As you can see, there's huge difference. MB seems up to ~8 times slower:-< There seems some problems existing in the implementation. Considering REGEX is not so slow, maybe we should employ the same design as REGEX. i.e. using wide charcters, not multibyte streams... MB+LIKE Total runtime: 1321.58 msec Total runtime: 1718.03 msec Total runtime: 2519.97 msec Total runtime: 4187.05 msec Total runtime: 7629.24 msec Total runtime: 14456.45 msec Total runtime: 17320.14 msec Total runtime: 17323.65 msec Total runtime: 17321.51 msec noMB+LIKE Total runtime: 964.90 msec Total runtime: 993.09 msec Total runtime: 1057.40 msec Total runtime: 1192.68 msec Total runtime: 1494.59 msec Total runtime: 2078.75 msec Total runtime: 2328.77 msec Total runtime: 2326.38 msec Total runtime: 2330.53 msec -- Tatsuo Ishii
Tatsuo Ishii <t-ishii@sra.co.jp> writes: > ... There seems some problems existing in the > implementation. Considering REGEX is not so slow, maybe we should > employ the same design as REGEX. i.e. using wide charcters, not > multibyte streams... Seems like a good thing to put on the to-do list. In the meantime, we still have the question of whether to enable multibyte in the default configuration. I'd still vote YES, as these results seem to me to demonstrate that there is no wide-ranging performance penalty. A problem confined to LIKE on long strings isn't a showstopper IMHO. regards, tom lane
> Tatsuo Ishii <t-ishii@sra.co.jp> writes: > > ... There seems some problems existing in the > > implementation. Considering REGEX is not so slow, maybe we should > > employ the same design as REGEX. i.e. using wide charcters, not > > multibyte streams... > > Seems like a good thing to put on the to-do list. In the meantime, > we still have the question of whether to enable multibyte in the > default configuration. I'd still vote YES, as these results seem > to me to demonstrate that there is no wide-ranging performance penalty. > A problem confined to LIKE on long strings isn't a showstopper IMHO. As I said, with a valid not anchored like expression the performance difference was substantial, even for shorter strings it was 37%. The test with "like 'aaa'" was not a good test case, and we should not deduce anything from that. Andreas
> Tatsuo Ishii <t-ishii@sra.co.jp> writes: > > ... There seems some problems existing in the > > implementation. Considering REGEX is not so slow, maybe we should > > employ the same design as REGEX. i.e. using wide charcters, not > > multibyte streams... > > Seems like a good thing to put on the to-do list. In the meantime, > we still have the question of whether to enable multibyte in the > default configuration. I'd still vote YES, as these results seem > to me to demonstrate that there is no wide-ranging performance penalty. > A problem confined to LIKE on long strings isn't a showstopper IMHO. > Added to TODO: * Use wide characters to evaluate regular expressions, for performance (Tatsuo) -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Added to TODO: > * Use wide characters to evaluate regular expressions, for performance > (Tatsuo) Regexes are fine; it's LIKE that's slow. regards, tom lane
> Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Added to TODO: > > > * Use wide characters to evaluate regular expressions, for performance > > (Tatsuo) > > Regexes are fine; it's LIKE that's slow. Oops, thanks. Changed to: * Use wide characters to evaluate LIKE, for performance (Tatsuo) -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Tom Lane writes: > In the meantime, we still have the question of whether to enable > multibyte in the default configuration. This would make more sense if all of multibyte, locale, and NLS became defaults in one release. I haven't quite sold people in the second item yet, although I have a design how to do that (see below). And the third, well who knows... Perhaps we could make it a release goal for 7.3 to * Optimize i18n stuff to have a minimal performance penalty when it's not used. (locale=C etc.) * Make i18n stuff sufficiently well-behaved to make it the default. (Especially, add initdb options and GUC parameters toset the locale. Don't rely on environment variables -- too complicated.) Meanwhile, quadratic performance penalties (or so it seems) for LIKE expressions aren't exactly a "minor" problem. -- Peter Eisentraut peter_e@gmx.net http://funkturm.homeip.net/~peter
Peter Eisentraut <peter_e@gmx.net> writes: > Tom Lane writes: >> In the meantime, we still have the question of whether to enable >> multibyte in the default configuration. > Perhaps we could make it a release goal for 7.3 Yeah, that's probably the best way to proceed... it's awfully late in the 7.2 cycle to be deciding to do this now... regards, tom lane
> Tom Lane writes: > > > In the meantime, we still have the question of whether to enable > > multibyte in the default configuration. > > This would make more sense if all of multibyte, locale, and NLS became > defaults in one release. I haven't quite sold people in the second item > yet, although I have a design how to do that (see below). And the third, > well who knows... > > Perhaps we could make it a release goal for 7.3 to > > * Optimize i18n stuff to have a minimal performance penalty when it's not > used. (locale=C etc.) > > * Make i18n stuff sufficiently well-behaved to make it the default. > (Especially, add initdb options and GUC parameters to set the locale. > Don't rely on environment variables -- too complicated.) > > Meanwhile, quadratic performance penalties (or so it seems) for LIKE > expressions aren't exactly a "minor" problem. > Added to TODO: * Optimize locale to have minimal performance impact when not used (Peter E) -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> Ok. I ran the modified test (now the iteration is reduced to 100000 in > liketest()). As you can see, there's huge difference. MB seems up to > ~8 times slower:-< There seems some problems existing in the > implementation. Considering REGEX is not so slow, maybe we should > employ the same design as REGEX. i.e. using wide charcters, not > multibyte streams... > > MB+LIKE > Total runtime: 1321.58 msec > Total runtime: 1718.03 msec > Total runtime: 2519.97 msec > Total runtime: 4187.05 msec > Total runtime: 7629.24 msec > Total runtime: 14456.45 msec > Total runtime: 17320.14 msec > Total runtime: 17323.65 msec > Total runtime: 17321.51 msec > > noMB+LIKE > Total runtime: 964.90 msec > Total runtime: 993.09 msec > Total runtime: 1057.40 msec > Total runtime: 1192.68 msec > Total runtime: 1494.59 msec > Total runtime: 2078.75 msec > Total runtime: 2328.77 msec > Total runtime: 2326.38 msec > Total runtime: 2330.53 msec I did some trials with wide characters implementation and saw virtually no improvement. My guess is the logic employed in LIKE is too simple to hide the overhead of the multibyte and wide character conversion. The reason why REGEX with MB is not so slow would be the complexity of its logic, I think. As you can see in my previous postings, $1 ~ $2 operation (this is logically same as a LIKE '%a%') is, for example, almost 80 times slower than LIKE (remember that likest() loops over 10 times more than regextest()). So I decided to use a completely different approach. Now like has two matching engines, one for single byte encodings (MatchText()), the other is for multibyte ones (MBMatchText()). MatchText() is identical to the non MB version of it, and virtually no performance penalty for single byte encodings. MBMatchText() is for multibyte encodings and is identical the one used in 7.1. Here is the MB case result with SQL_ASCII encoding. Total runtime: 901.69 msec Total runtime: 939.08 msec Total runtime: 993.60 msec Total runtime: 1148.18 msec Total runtime: 1434.92 msec Total runtime: 2024.59 msec Total runtime: 2288.50 msec Total runtime: 2290.53 msec Total runtime: 2316.00 msec To accomplish this, I moved MatchText etc. to a separate file and now like.c includes it *twice* (similar technique used in regexec()). This makes like.o a little bit larger, but I believe this is worth for the optimization. -- Tatsuo Ishii
Tatsuo Ishii <t-ishii@sra.co.jp> writes: > To accomplish this, I moved MatchText etc. to a separate file and now > like.c includes it *twice* (similar technique used in regexec()). This > makes like.o a little bit larger, but I believe this is worth for the > optimization. That sounds great. What's your feeling now about the original question: whether to enable multibyte by default now, or not? I'm still thinking that Peter's counsel is the wisest: plan to do it in 7.3, not today. But this fix seems to eliminate the only hard reason we have not to do it today ... regards, tom lane
> What's your feeling now about the original question: whether to enable > multibyte by default now, or not? I'm still thinking that Peter's > counsel is the wisest: plan to do it in 7.3, not today. But this fix > seems to eliminate the only hard reason we have not to do it today ... If SQL99's I18N staffs would be added in 7.3, it means both the multibyte support and the locale support might disappear, then 2might be merged into it (not sure about NLS. is that compatible with, for example, per column charset?) So, for none multibyte users, the multibyte support would be one-release-only-functionality: suddenly appears in 7.2 and disappears in 7.3. I'm afraid this would cause more troubles rather than usefullness. What do you think? -- Tatsuo Ishii
Tatsuo Ishii <t-ishii@sra.co.jp> writes: > What do you think? I think that we were supposed to go beta a month ago, and so this is no time to start adding new features to this release. Let's plan to make this happen (one way or the other) in 7.3, instead. regards, tom lane
> I think that we were supposed to go beta a month ago, and so this is > no time to start adding new features to this release. Let's plan to > make this happen (one way or the other) in 7.3, instead. Agreed. -- Tatsuo Ishii
> > Ok. I ran the modified test (now the iteration is reduced to 100000 in > > liketest()). As you can see, there's huge difference. MB seems up to > > ~8 times slower:-< There seems some problems existing in the > > implementation. Considering REGEX is not so slow, maybe we should > > employ the same design as REGEX. i.e. using wide charcters, not > > multibyte streams... Let me add I think our regex code is very slow. It is the standard BSD regex library by Henry Spencer. He rewrote it a few years ago for TCL 8.X and said he was working on a standalone library version. I have asked him several times via email over the years but he still has not released a standalone version of the new optimized regex code. It is on our TODO list. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
* Bruce Momjian <pgman@candle.pha.pa.us> [011011 22:49]: > > Can I ask about the status of this? I have sent a patch a few days ago solving the client-side issue (on the pgsql-patches mailing list) for review. I think Tatsuo said it looked OK, however he should confirm/infirm this. There is still the issue about unicode characters which have code points above U00FFFF, which probably should be rejected on the server side. I have yet to update my patch for that. I'll probably do that tomorrow, as I don't have more time tonight, but I think this will be trivial, so maybe Tatsuo can do it, if he has some time before that :) If there are other issues, I'd like to know :) Regarding the implementation of Unicode functionality (normalisation, collation, Unicode-aware regexes, uc/lc/tc (title-case) functions,...) on the server side, it's definitely something for 7.3 (though it might be available sooner). It will probably be just a contributed extension first. I'm currently making an alpha version of the project I'm working on in order to have sufficient "real-life" Unicode data to work with, and make sure the design choices make sense :) Patrice. BTW, I tried to find web-accessible archives of pgsql-patches, are there some, or should each and every discussion be followed-up on pgsql-hackers (even though the description for pgsql-patches includes discussions on patches) ? > > Hi all, > > > > while working on a new project involving PostgreSQL and making some > > tests, I have come up with the following output from psql : > > > > lang | length | length | text | text > > ------+--------+--------+-----------+----------- > > isl | 7 | 6 | _l_ta | _leit > > isl | 7 | 7 | _l_ta | _litum > > isl | 7 | 7 | _l_ta | _liti_ > > isl | 5 | 4 | ma_ur | mann > > isl | 5 | 7 | ma_ur | m_nnum > > isl | 5 | 5 | ma_ur | manna > > isl | 5 | 4 | _ska | -a_i > > > > [the misalignment is what I got, it's not a copy-paste error] > > > > This is pasted from a UTF-8 xterm running psql under a UTF-8 locale, > > querying a database created with -E UNICODE (by the way, these are > > icelandic words :) ). > > > > What you see above is misleading, since it's not possible to see that > > '_', '_', '_' and '_' are using combining marks, while '_' is not. > > > > As a reminder, a combining mark in Unicode is that _ is actually > > encoded as a + ' (where ' is the acute combining mark). > > > > Encoded in UTF-8, it's then <61 cc 81> [UTF16: 0061 0301], > > instead of <c3 a1> [UTF16: 00E1]. > > > > The "length" fields are what is returned by length(a.text) and > > length(b.text). > > > > So, this shows two problems : > > > > - length() on the server side doesn't handle correctly Unicode [I have > > the same result with char_length()], and returns the number of chars > > (as it is however advertised to do), rather the length of the > > string. > > > > - the psql frontend makes the same mistake. > > > > I am using version 7.1.3 (debian sid), so it may have been corrected > > in the meantime (in this case, I apologise, but I have only recently > > started again to use PostgreSQL and I haven't followed -hackers long > > enough). > > > > > > => I think fixing psql shouldn't be too complicated, as the glibc > > should be providing the locale, and return the right values (is this > > the case ? and what happens for combined latin + chinese characters > > for example ? I'll have to try that later). If it's not fixed already, > > do you want me to look at this ? [it will take some time, as I haven't > > set up any development environment for postgres yet, and I'm away for > > one week from thursday]. > > > > => regarding the backend, it may be more complex, as the underlaying > > system may not provide any UTF-8 locale to use (!= from being UTF-8 > > aware : an administrator may have decided that UTF-8 locales are > > useless on a server, as only root connections are made, and he wants > > only the C locale on the console - I've seen that quite often ;) ). > > > > > > This brings me to another subject : I will need to support the full > > Unicode collation algorithm (UCA, as described in TR#10 [1] of the > > Unicode consortium), and I will need to be able to sort according to > > locales which may not be installed on the backend server (some of > > which may not even be recognised by GNU libc, which supports already > > more than 140 locales -- artificial languages would be an example). I > > will also need to be able to normalise the unicode strings (TR#15 [2]) > > so that I don't have some characters in legacy codepoints [as 00E1 > > above], and others with combining marks. > > > > There is today an implementation in perl of the needed functionality, > > in Unicode::Collate and Unicode::Normalize (which I haven't tried yet > > :( ). But as they are Perl modules, the untrusted version of perl, > > plperlu, will be needed, and it's a pity for what I consider a core > > functionality in the future (not that plperlu isn't a good thing - I > > can't wait for it ! - but that an untrusted pl language is needed to > > support normalisation and collation). > > > > Note also that there are a lot of data associated with these > > algorithms, as you could expect. > > > > I was wondering if some people have already thought about this, or > > already done something, or if some of you are interested in this. If > > nobody does anything, I'll do something eventually, probably before > > Christmas (I don't have much time for this, and I don't need the > > functionality right now), but if there is an interest, I could team > > with others and develop it faster :) > > > > Anyway, I'm open to suggestions : > > > > - implement it in C, in the core, > > > > - implement it in C, as contributed custom functions, > > > > - implement it in perl (by reusing Unicode:: work), in a trusted plperl, > > > > - implement it in perl, calling Unicode:: modules, in an untrusted > > plperl. > > > > and then : > > > > - provide the data in tables (system and/or user) - which should be > > available across databases, > > > > - load the data from the original text files provided in Unicode (and > > other as needed), if the functionality is compiled into the server. > > > > - I believe the basic unicode information should be standard, and the > > locales should be provided as contrib/ files to be plugged in as > > needed. > > > > I can't really accept a solution which would rely on the underlaying > > libc, as it may not provide the necessary locales (or maybe, then, > > have a way to override the collating tables by user tables - actually, > > this would be certainly the best solution if it's in the core, as the > > tables will put an extra burden on the distribution and the > > installation footprint, especially if the tables are already there, > > for glibc, for perl5.6+, for other software dealing with Unicode). > > > > The main functions I foresee are : > > > > - provide a normalisation function to all 4 forms, > > > > - provide a collation_key(text, language) function, as the calculation > > of the key may be expensive, some may want to index on the result (I > > would :) ), > > > > - provide a collation algorithm, using the two previous facilities, > > which can do primary to tertiary collation (cf TR#10 for a detailed > > explanation). > > > > I haven't looked at PostgreSQL code yet (shame !), so I may be > > completely off-track, in which case I'll retract myself and won't > > bother you again (on that subject, that is ;) )... > > > > Comments ? > > > > > > Patrice. > > > > [1] http://www.unicode.org/unicode/reports/tr10/ > > > > [2] http://www.unicode.org/unicode/reports/tr15/ -- Patrice Hédé email: patrice hede à islande org www : http://www.islande.org/
Can I ask about the status of this? > Hi all, > > while working on a new project involving PostgreSQL and making some > tests, I have come up with the following output from psql : > > lang | length | length | text | text > ------+--------+--------+-----------+----------- > isl | 7 | 6 | _l_ta | _leit > isl | 7 | 7 | _l_ta | _litum > isl | 7 | 7 | _l_ta | _liti_ > isl | 5 | 4 | ma_ur | mann > isl | 5 | 7 | ma_ur | m_nnum > isl | 5 | 5 | ma_ur | manna > isl | 5 | 4 | _ska | -a_i > > [the misalignment is what I got, it's not a copy-paste error] > > This is pasted from a UTF-8 xterm running psql under a UTF-8 locale, > querying a database created with -E UNICODE (by the way, these are > icelandic words :) ). > > What you see above is misleading, since it's not possible to see that > '_', '_', '_' and '_' are using combining marks, while '_' is not. > > As a reminder, a combining mark in Unicode is that _ is actually > encoded as a + ' (where ' is the acute combining mark). > > Encoded in UTF-8, it's then <61 cc 81> [UTF16: 0061 0301], > instead of <c3 a1> [UTF16: 00E1]. > > The "length" fields are what is returned by length(a.text) and > length(b.text). > > So, this shows two problems : > > - length() on the server side doesn't handle correctly Unicode [I have > the same result with char_length()], and returns the number of chars > (as it is however advertised to do), rather the length of the > string. > > - the psql frontend makes the same mistake. > > I am using version 7.1.3 (debian sid), so it may have been corrected > in the meantime (in this case, I apologise, but I have only recently > started again to use PostgreSQL and I haven't followed -hackers long > enough). > > > => I think fixing psql shouldn't be too complicated, as the glibc > should be providing the locale, and return the right values (is this > the case ? and what happens for combined latin + chinese characters > for example ? I'll have to try that later). If it's not fixed already, > do you want me to look at this ? [it will take some time, as I haven't > set up any development environment for postgres yet, and I'm away for > one week from thursday]. > > => regarding the backend, it may be more complex, as the underlaying > system may not provide any UTF-8 locale to use (!= from being UTF-8 > aware : an administrator may have decided that UTF-8 locales are > useless on a server, as only root connections are made, and he wants > only the C locale on the console - I've seen that quite often ;) ). > > > This brings me to another subject : I will need to support the full > Unicode collation algorithm (UCA, as described in TR#10 [1] of the > Unicode consortium), and I will need to be able to sort according to > locales which may not be installed on the backend server (some of > which may not even be recognised by GNU libc, which supports already > more than 140 locales -- artificial languages would be an example). I > will also need to be able to normalise the unicode strings (TR#15 [2]) > so that I don't have some characters in legacy codepoints [as 00E1 > above], and others with combining marks. > > There is today an implementation in perl of the needed functionality, > in Unicode::Collate and Unicode::Normalize (which I haven't tried yet > :( ). But as they are Perl modules, the untrusted version of perl, > plperlu, will be needed, and it's a pity for what I consider a core > functionality in the future (not that plperlu isn't a good thing - I > can't wait for it ! - but that an untrusted pl language is needed to > support normalisation and collation). > > Note also that there are a lot of data associated with these > algorithms, as you could expect. > > I was wondering if some people have already thought about this, or > already done something, or if some of you are interested in this. If > nobody does anything, I'll do something eventually, probably before > Christmas (I don't have much time for this, and I don't need the > functionality right now), but if there is an interest, I could team > with others and develop it faster :) > > Anyway, I'm open to suggestions : > > - implement it in C, in the core, > > - implement it in C, as contributed custom functions, > > - implement it in perl (by reusing Unicode:: work), in a trusted plperl, > > - implement it in perl, calling Unicode:: modules, in an untrusted > plperl. > > and then : > > - provide the data in tables (system and/or user) - which should be > available across databases, > > - load the data from the original text files provided in Unicode (and > other as needed), if the functionality is compiled into the server. > > - I believe the basic unicode information should be standard, and the > locales should be provided as contrib/ files to be plugged in as > needed. > > I can't really accept a solution which would rely on the underlaying > libc, as it may not provide the necessary locales (or maybe, then, > have a way to override the collating tables by user tables - actually, > this would be certainly the best solution if it's in the core, as the > tables will put an extra burden on the distribution and the > installation footprint, especially if the tables are already there, > for glibc, for perl5.6+, for other software dealing with Unicode). > > The main functions I foresee are : > > - provide a normalisation function to all 4 forms, > > - provide a collation_key(text, language) function, as the calculation > of the key may be expensive, some may want to index on the result (I > would :) ), > > - provide a collation algorithm, using the two previous facilities, > which can do primary to tertiary collation (cf TR#10 for a detailed > explanation). > > I haven't looked at PostgreSQL code yet (shame !), so I may be > completely off-track, in which case I'll retract myself and won't > bother you again (on that subject, that is ;) )... > > Comments ? > > > Patrice. > > [1] http://www.unicode.org/unicode/reports/tr10/ > > [2] http://www.unicode.org/unicode/reports/tr15/ > > -- > Patrice H_D_ ------------------------------- patrice _ islande.org ----- > -- Isn't it weird how scientists can imagine all the matter of the > universe exploding out of a dot smaller than the head of a pin, but they > can't come up with a more evocative name for it than "The Big Bang" ? > -- What would _you_ call the creation of the universe ? > -- "The HORRENDOUS SPACE KABLOOIE !" - Calvin and Hobbes > ------------------------------------------ http://www.islande.org/ ----- > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> * Bruce Momjian <pgman@candle.pha.pa.us> [011011 22:49]: > > > > Can I ask about the status of this? > > I have sent a patch a few days ago solving the client-side issue (on > the pgsql-patches mailing list) for review. I think Tatsuo said it > looked OK, however he should confirm/infirm this. I've been waiting for Peter's opnion. His understanding of psql is much better than me. > There is still the issue about unicode characters which have code > points above U00FFFF, which probably should be rejected on the server > side. I have yet to update my patch for that. I'll probably do that > tomorrow, as I don't have more time tonight, but I think this will be > trivial, so maybe Tatsuo can do it, if he has some time before that :) Rejecting over U00FFFF is considered a bug fix, that means we could fix after the beta test begins:-) BTW, have you tried my updates for supporting ISO 8859 characters? Please let me know if you have troubles especially with "euro". I don't understand none of these charsets... -- Tatsuo Ishii
> * Bruce Momjian <pgman@candle.pha.pa.us> [011011 22:49]: > > > > Can I ask about the status of this? > > I have sent a patch a few days ago solving the client-side issue (on > the pgsql-patches mailing list) for review. I think Tatsuo said it > looked OK, however he should confirm/infirm this. OK, I saw the client encoding function appear today. I assume Tatsuo and friends will finish this up. > Regarding the implementation of Unicode functionality (normalisation, > collation, Unicode-aware regexes, uc/lc/tc (title-case) functions,...) > on the server side, it's definitely something for 7.3 (though it might > be available sooner). It will probably be just a contributed extension > first. I'm currently making an alpha version of the project I'm > working on in order to have sufficient "real-life" Unicode data to > work with, and make sure the design choices make sense :) If you would like to add some TODO items, please let me know. Good to document them even if you can't get to them for a while. > BTW, I tried to find web-accessible archives of pgsql-patches, are > there some, or should each and every discussion be followed-up on > pgsql-hackers (even though the description for pgsql-patches includes > discussions on patches) ? I use: http://fts.postgresql.org/db/mw/ You can discuss on patches or hackers. I usually do patch discussion on patches unless I need a larger audience for comments. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
I have committed part of Patrice's patches with minor fixes. Uncommitted changes are related to the backend side, and the reason could be found in the previous discussions (basically this is due to the fact that current regex code does not support UTF-8 chars >= 0x10000). Instead pg_veryfymbstr() now rejects UTF-8 chars >= 0x10000. -- Tatsuo Ishii > Hi, > > I should have sent the patch earlier, but got delayed by other stuff. > Anyway, here is the patch: > > - most of the functionality is only activated when MULTIBYTE is > defined, > > - check valid UTF-8 characters, client-side only yet, and only on > output, you still can send invalid UTF-8 to the server (so, it's > only partly compliant to Unicode 3.1, but that's better than > nothing). > > - formats with the correct number of columns (that's why I made it in > the first place after all), but only for UNICODE. However, the code > allows to plug-in routines for other encodings, as Tatsuo did for > the other multibyte functions. > > - corrects a bit the UTF-8 code from Tatsuo to allow Unicode 3.1 > characters (characters with values >= 0x10000, which are encoded on > four bytes). > > - doesn't depend on the locale capabilities of the glibc (useful for > remote telnet). > > I would like somebody to check it closely, as it is my first patch to > pgsql. Also, I created dummy .orig files, so that the two files I > created are included, I hope that's the right way. > > Now, a lot of functionality is NOT included here, but I will keep that > for 7.3 :) That includes all string checking on the server side (which > will have to be a bit more optimised ;) ), and the input checking on > the client side for UTF-8, though that should not be difficult. It's > just to send the strings through mbvalidate() before sending them to > the server. Strong checking on UTF-8 strings is mandatory to be > compliant with Unicode 3.1+ . > > Do I have time to look for a patch to include iso-8859-15 for 7.2 ? > The euro is coming 1. january 2002 (before 7.3 !) and over 280 > millions people in Europe will need the euro sign and only iso-8859-15 > and iso-8859-16 have it (and unfortunately, I don't think all Unices > will switch to Unicode in the meantime).... > > err... yes, I know that this is not every single person in Europe that > uses PostgreSql, so it's not exactly 280m, but it's just a matter of > time ! ;) > > I'll come back (on pgsql-hackers) later to ask a few questions > regarding the full unicode support (normalisation, collation, > regexes,...) on the server side :) > > Here is the patch ! > > Patrice. > > -- > Patrice H�D� ------------------------------- patrice � islande org ----- > -- Isn't it weird how scientists can imagine all the matter of the > universe exploding out of a dot smaller than the head of a pin, but they > can't come up with a more evocative name for it than "The Big Bang" ? > -- What would _you_ call the creation of the universe ? > -- "The HORRENDOUS SPACE KABLOOIE !" - Calvin and Hobbes > ------------------------------------------ http://www.islande.org/ -----