Thread: case insensitive match in unicode
I would need to do case insensitive match against a field that contains text in different languages - Greek, Hungarian, Arabic etc. The db encoding is UTF8. So far I found no way to achieve that. I tried converting both strings to the same case and using ~* , but neither worked. Does anybody no a way to do this? Thanks for the help. Balázs
On Mon, Mar 27, 2006 at 11:31:17AM +0200, SunWuKung wrote: > I would need to do case insensitive match against a field that contains > text in different languages - Greek, Hungarian, Arabic etc. > The db encoding is UTF8. > > So far I found no way to achieve that. I tried converting both strings > to the same case and using ~* , but neither worked. Oh, tricky. Firstly, case-insensetive means different things to different locales. For example, in Turkish 'i' is not the lowecase version of 'I'. Maybe you've chosen a locale that doesn't do UTF-8? You don't specify a platform either. Locale support varies wildly by platform. What you probably want it some kind of accent-insensetive match that mean that é, è, ë, e, É, È, E and Ë are all considered to match eachother. The way you do that is by converting unicode to a particular normal form and then comparing. Unfortunatly, I don't think PostgreSQL supplies such a function right now. However, some server-side procedural languages can do this. If you can find one (possibly Perl) that can do the conversion, you can create a function to do the mapping. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Attachment
In article <20060327094829.GA30791@svana.org>, kleptog@svana.org says... > On Mon, Mar 27, 2006 at 11:31:17AM +0200, SunWuKung wrote: > > I would need to do case insensitive match against a field that contains > > text in different languages - Greek, Hungarian, Arabic etc. > > The db encoding is UTF8. > > > > So far I found no way to achieve that. I tried converting both strings > > to the same case and using ~* , but neither worked. > > Oh, tricky. Firstly, case-insensetive means different things to > different locales. For example, in Turkish 'i' is not the lowecase > version of 'I'. Maybe you've chosen a locale that doesn't do UTF-8? You > don't specify a platform either. Locale support varies wildly by > platform. > > What you probably want it some kind of accent-insensetive match that > mean that é, è, ë, e, É, È, E and Ë are all considered to match > eachother. The way you do that is by converting unicode to a particular > normal form and then comparing. Unfortunatly, I don't think PostgreSQL > supplies such a function right now. > > However, some server-side procedural languages can do this. If you can > find one (possibly Perl) that can do the conversion, you can create a > function to do the mapping. > > Have a nice day, > This sounds like a very interesting concept. It wouldn't be 'case insensitive' just insensitive. The way I imagine it now is a special case of the ~ function. I create matchgroups in a table and check each character if it is in the group. If it is I will replace the character with the group in [éÉE], [oóOÓ??] and do a regexp with that. What do you think? B.
On Mon, Mar 27, 2006 at 12:45:05PM +0200, SunWuKung wrote: > This sounds like a very interesting concept. > It wouldn't be 'case insensitive' just insensitive. > > The way I imagine it now is a special case of the ~ function. > I create matchgroups in a table and check each character if it is in the > group. If it is I will replace the character with the group in [éÉE], > [oóOÓ??] and do a regexp with that. No need to reinvent the wheel. ICU provides a range of services to deal with this. For example the following filter in ICU: NFD; [:Nonspacing Mark:] Remove; NFC. Will remove all accents from characters. And it works for all Unicode characters. With a bit more thinking you can work with case variations also. There is also a locale-independant case-mapping module there plus various locale specific ones also. http://icu.sourceforge.net/userguide/Transform.html http://icu.sourceforge.net/userguide/caseMappings.html http://icu.sourceforge.net/userguide/normalization.html Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Attachment
In article <20060327114037.GD30791@svana.org>, kleptog@svana.org says... > On Mon, Mar 27, 2006 at 12:45:05PM +0200, SunWuKung wrote: > > This sounds like a very interesting concept. > > It wouldn't be 'case insensitive' just insensitive. > > > > The way I imagine it now is a special case of the ~ function. > > I create matchgroups in a table and check each character if it is in the > > group. If it is I will replace the character with the group in [éÉE], > > [oóOÓ??] and do a regexp with that. > > No need to reinvent the wheel. ICU provides a range of services to deal > with this. For example the following filter in ICU: > > NFD; [:Nonspacing Mark:] Remove; NFC. > > Will remove all accents from characters. And it works for all Unicode > characters. With a bit more thinking you can work with case variations > also. > > There is also a locale-independant case-mapping module there plus > various locale specific ones also. > > http://icu.sourceforge.net/userguide/Transform.html > http://icu.sourceforge.net/userguide/caseMappings.html > http://icu.sourceforge.net/userguide/normalization.html > > Have a nice day, > Thanks, I looked at this and it looks like something that would indeed solve the problem. However I was so far unable to figure out how could I use this from within Postgres. If you have experience with it could you give me an example? Thanks Balázs
On Thu, Apr 06, 2006 at 11:12:26PM +0200, SunWuKung wrote: > > There is also a locale-independant case-mapping module there plus > > various locale specific ones also. > > > > http://icu.sourceforge.net/userguide/Transform.html > > http://icu.sourceforge.net/userguide/caseMappings.html > > http://icu.sourceforge.net/userguide/normalization.html > > > > Have a nice day, > > > Thanks, I looked at this and it looks like something that would indeed > solve the problem. > However I was so far unable to figure out how could I use this from > within Postgres. If you have experience with it could you give me an > example? There are some unofficial ICU patches but I doubt they're still up-to-date. I don't personally use it though maybe someone else here does... Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Attachment
On 4/6/06, SunWuKung <Balazs.Klein@axelero.hu> wrote: > In article <20060327114037.GD30791@svana.org>, kleptog@svana.org says... > > On Mon, Mar 27, 2006 at 12:45:05PM +0200, SunWuKung wrote: > > > This sounds like a very interesting concept. > > > It wouldn't be 'case insensitive' just insensitive. > > > > > > The way I imagine it now is a special case of the ~ function. > > > I create matchgroups in a table and check each character if it is in the > > > group. If it is I will replace the character with the group in [éÉE], > > > [oóOÓ??] and do a regexp with that. > > > > No need to reinvent the wheel. ICU provides a range of services to deal > > with this. For example the following filter in ICU: > > > > NFD; [:Nonspacing Mark:] Remove; NFC. > > > > Will remove all accents from characters. And it works for all Unicode > > characters. With a bit more thinking you can work with case variations > > also. > > > > There is also a locale-independant case-mapping module there plus > > various locale specific ones also. > > > > http://icu.sourceforge.net/userguide/Transform.html > > http://icu.sourceforge.net/userguide/caseMappings.html > > http://icu.sourceforge.net/userguide/normalization.html > > > > Have a nice day, > > > Thanks, I looked at this and it looks like something that would indeed > solve the problem. > However I was so far unable to figure out how could I use this from > within Postgres. If you have experience with it could you give me an > example? I was looking into creating a Pg function wrapper to some of the ICU stuff, but, to be perfectly honest, I couldn't find an actual API reference for ICU. In any case, you can do this with PL/Perl: CREATE FUNCTION strip_nonspacing_marks ( text ) RETURNS text AS $func$ use Unicode::Normalize; use Encode; my $string = NFD( decode( utf8 => shift() ) ); $string =~ s/\p{Mn}+//ogsm; return NFC($string); $func$ LANGUAGE 'plperl' STRICT; It's untested and won't be as fast as ICU, but it should get the job done. Hope it helps! > > Thanks > Balázs > > ---------------------------(end of broadcast)--------------------------- > TIP 6: explain analyze is your friend > -- Mike Rylander mrylander@gmail.com GPLS -- PINES Development Database Developer http://open-ils.org
This perl function doesn't work for me. I keep getting a Cannot decode string with wide characters at /usr/lib/perl5/5.8.7/i386-linux/Encode.pm line 166 error Do you have any suggestion? Thanks Balázs