Thread: Accent insensitive search?
Hi,
I have a DB in utf-8 and postgres 8.3.x.
How can I do an accent insensitive search (like ...) ?
TIA
Hi, use ILIKE HTH, Pedro Doria Meunier GSM: +351961720188 Skype: pdoriam On Wednesday 18 March 2009 04:29:24 pm cifroes wrote: > Hi, > > I have a DB in utf-8 and postgres 8.3.x. > > How can I do an accent insensitive search (like ...) ? > > > TIA
Attachment
Not case insensitive but accent insensitive :)
And I tried select to_ascii('capo','LATIN1'), to_ascii('çapo','LATIN1') and the results are different....
-----Original Message-----
From: pgsql-general-owner@postgresql.org on behalf of Pedro Doria Meunier
Sent: Wed 3/18/2009 4:46 PM
To: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Accent insensitive search?
Hi,
use ILIKE
HTH,
Pedro Doria Meunier
GSM: +351961720188
Skype: pdoriam
On Wednesday 18 March 2009 04:29:24 pm cifroes wrote:
> Hi,
>
> I have a DB in utf-8 and postgres 8.3.x.
>
> How can I do an accent insensitive search (like ...) ?
>
>
> TIA
2009/3/18 Pedro Doria Meunier <pdoria@netmadeira.com>
ILIKE is only case-insensitive, and won't match accented characters. The only thing I can think of doing is to create a function which will replace characters with their equivalent non-accented counterparts and use that in the query.
Example: SELECT name FROM people WHERE unaccent_string(name) ~* 'Pédro';
The function would perform a reg-ex replace similar to: s/[èééêë]/e
Just an idea
Thom
Hi,
use ILIKE
HTH,
ILIKE is only case-insensitive, and won't match accented characters. The only thing I can think of doing is to create a function which will replace characters with their equivalent non-accented counterparts and use that in the query.
Example: SELECT name FROM people WHERE unaccent_string(name) ~* 'Pédro';
The function would perform a reg-ex replace similar to: s/[èééêë]/e
Just an idea
Thom
Here's an example of a function I might use (although I haven't actually got plperl installed, so can't test it myself, but you'll get the idea:
CREATE OR REPLACE FUNCTION unaccent_string(text) RETURNS text AS $$
my ($input_string) = @_;
$input_string =~ s/[âãäåāăą]/a;
$input_string =~ s/[ÁÂÃÄÅĀĂĄ]/A;
$input_string =~ s/[èééêëēĕėęě]/e;
$input_string =~ s/[ĒĔĖĘĚ]/E;
$input_string =~ s/[ìíîïìĩīĭ]/i;
$input_string =~ s/[ÌÍÎÏÌĨĪĬ]/I;
$input_string =~ s/[óôõöōŏő]/o;
$input_string =~ s/[ÒÓÔÕÖŌŎŐ]/O;
$input_string =~ s/[ùúûüũūŭů]/u;
$input_string =~ s/[ÙÚÛÜŨŪŬŮ]/U;
return $input_string;
$$ LANGUAGE plperl;
Regards
Thom
CREATE OR REPLACE FUNCTION unaccent_string(text) RETURNS text AS $$
my ($input_string) = @_;
$input_string =~ s/[âãäåāăą]/a;
$input_string =~ s/[ÁÂÃÄÅĀĂĄ]/A;
$input_string =~ s/[èééêëēĕėęě]/e;
$input_string =~ s/[ĒĔĖĘĚ]/E;
$input_string =~ s/[ìíîïìĩīĭ]/i;
$input_string =~ s/[ÌÍÎÏÌĨĪĬ]/I;
$input_string =~ s/[óôõöōŏő]/o;
$input_string =~ s/[ÒÓÔÕÖŌŎŐ]/O;
$input_string =~ s/[ùúûüũūŭů]/u;
$input_string =~ s/[ÙÚÛÜŨŪŬŮ]/U;
return $input_string;
$$ LANGUAGE plperl;
Regards
Thom
On Wed, Mar 18, 2009 at 04:29:24PM -0000, cifroes wrote: > I have a DB in utf-8 and postgres 8.3.x. > > How can I do an accent insensitive search (like ...) ? No good idea at the moment; I'd somehow expect to find this sort of normalization in the functionality provided by the "text search" code. My reasoning being that searching for a word should find it if you're being accurate or lazy with your accents. I've not been able to coax it into doing anything reliable though. If you can't find anything better in PG; the translate[1] function would be my best suggestion. Performance should be better than using regular expressions. -- Sam http://samason.me.uk/ [1] http://www.postgresql.org/docs/current/static/functions-string.html
Ooops! Silly me! I should have read more carefully ... (blush) sorry! Pedro Doria Meunier GSM: +351961720188 Skype: pdoriam On Wednesday 18 March 2009 04:46:16 pm Pedro Doria Meunier wrote: > Hi, > > use ILIKE > > HTH, > > Pedro Doria Meunier > GSM: +351961720188 > Skype: pdoriam > > On Wednesday 18 March 2009 04:29:24 pm cifroes wrote: > > Hi, > > > > I have a DB in utf-8 and postgres 8.3.x. > > > > How can I do an accent insensitive search (like ...) ? > > > > > > TIA
Attachment
2009/3/18 Sam Mason <sam@samason.me.uk>
If you can't find anything better in PG; the translate[1] function would
be my best suggestion. Performance should be better than using regular
expressions.
Yeah, that does appear to perform better. I tried the following at it worked for me:
CREATE OR REPLACE FUNCTION unaccent_string(text) RETURNS text AS $$
DECLARE
input_string text := $1;
BEGIN
input_string := translate(input_string, 'âãäåāăąÁÂÃÄÅĀĂĄ', 'aaaaaaaaaaaaaaa');
input_string := translate(input_string, 'èééêëēĕėęěĒĔĖĘĚ', 'eeeeeeeeeeeeeee');
input_string := translate(input_string, 'ìíîïìĩīĭÌÍÎÏÌĨĪĬ', 'iiiiiiiiiiiiiiii');
input_string := translate(input_string, 'óôõöōŏőÒÓÔÕÖŌŎŐ', 'ooooooooooooooo');
input_string := translate(input_string, 'ùúûüũūŭůÙÚÛÜŨŪŬŮ', 'uuuuuuuuuuuuuuuu');
return input_string;
END;
$$ LANGUAGE plpgsql;
Regards
Thom
how about... select .... where translate(lower(myfield), 'âãäåāăąèééêëēĕėęěìíîïìĩīĭóôõöōŏőùúûüũūŭů', 'aaaaaaaeeeeeeeeeeiiiiiiiiooooooouuuuuuu') = 'stringiwannamatch'; or something like that. I may have miscounted the vowells in the 'to' string :)
Sam Mason wrote on 18.03.2009 18:15: > On Wed, Mar 18, 2009 at 04:29:24PM -0000, cifroes wrote: >> I have a DB in utf-8 and postgres 8.3.x. >> >> How can I do an accent insensitive search (like ...) ? > > No good idea at the moment; I'd somehow expect to find this sort of > normalization in the functionality provided by the "text search" code. > My reasoning being that searching for a word should find it if you're > being accurate or lazy with your accents. I've not been able to coax it > into doing anything reliable though. > > If you can't find anything better in PG; the translate[1] function would > be my best suggestion. Performance should be better than using regular > expressions. > Are you aware of any plans to add accent-insesitive collations to PG? Would be great if PG offered a similar flexibility here as other databases. Thomas
What I've done in the past in this situation is to create a separate field with the text normalized to whatever the search form is (all lower case, accents stripped, etc.), and then index and search that from the application. Although I've not tried it, a functional index that did the same thing might work, as long as you were careful in how the queries were written to ensure that it was used.
On Wed, 18 Mar 2009, cifroes wrote: > Hi, > > I have a DB in utf-8 and postgres 8.3.x. > > How can I do an accent insensitive search (like ...) ? Take a look on text search capability and http://www.sai.msu.su/~megera/wiki/unaccent We have patches for CVS HEAD, but unfortunately they will likely go to the 8.5 release Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
Thom Brown escribió: > Here's an example of a function I might use (although I haven't actually got > plperl installed, so can't test it myself, but you'll get the idea: > > CREATE OR REPLACE FUNCTION unaccent_string(text) RETURNS text AS $$ > my ($input_string) = @_; > $input_string =~ s/[âãäåāăą]/a; > $input_string =~ s/[ÁÂÃÄÅĀĂĄ]/A; Hmm, if to_ascii() doesn't work, that's something worth some research. Maybe the encoding config is broken, for example. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Mar 18, 2009, at 11:24 AM, Alvaro Herrera wrote: > Hmm, if to_ascii() doesn't work, that's something worth some research. > Maybe the encoding config is broken, for example. The docs say to_ascii() only works with LATIN1, LATIN2, LATIN9, and WIN1250; maybe convert('string', 'UTF-8', 'SQL_ASCII')?
On 2009-03-18, cifroes <cifroes@netcabo.pt> wrote: > This is a multi-part message in MIME format. > > ------_=_NextPart_001_01C9A7E6.B32BBA87 > Content-Type: text/plain; > charset="iso-8859-1" > Content-Transfer-Encoding: quoted-printable > > Hi, > > I have a DB in utf-8 and postgres 8.3.x.=20 > > How can I do an accent insensitive search (like ...) ? use a posix regular expression that matches the string you want: select * from foo where colname ~ '^[CcĆćĈĉĊċČč][ĀāĂ㥹Aa][Pp][ŌōŎŏŐő]$'; you could write a function to do the translation.
On Tue, Mar 24, 2009 at 4:53 PM, Jasen Betts <jasen@xnet.co.nz> wrote: > On 2009-03-18, cifroes <cifroes@netcabo.pt> wrote: >> This is a multi-part message in MIME format. >> >> ------_=_NextPart_001_01C9A7E6.B32BBA87 >> Content-Type: text/plain; >> charset="iso-8859-1" >> Content-Transfer-Encoding: quoted-printable >> >> Hi, >> >> I have a DB in utf-8 and postgres 8.3.x.=20 >> >> How can I do an accent insensitive search (like ...) ? > > use a posix regular expression that matches the string you want: > > select * from foo where colname ~ '^[CcĆćĈĉĊċČč][ĀāĂ㥹Aa][Pp][ŌōŎŏŐő]$'; > > you could write a function to do the translation. Unicode normalization is an interesting way to strip accents I've found: >>> import unicodedata >>> >>> s = u'Björn' >>> s = unicodedata.normalize("NFKD", s) >>> s = ''.join(c for c in s if ord(c) < 127) >>> print s Bjorn You can also use the character names to map many more characters to the ascii equivalent. A large number of these can me smashed into ASCII using regular expressions and some manual mappings to map LETTER THORN -> th, LETTER LATERAL CLICK -> X etc. Just mapping CAPITAL LETTER XX -> XX and SMALL LETTER XX -> xx seems to get you most of europe if you special case SHARP S -> ss and THORN -> th. >>> s = u'ァ' >>> print unicodedata.name(s) KATAKANA LETTER SMALL A -- Stuart Bishop <stuart@stuartbishop.net> http://www.stuartbishop.net/