Home > mailing lists

Thread: Accent insensitive search?

Accent insensitive search?

From

"cifroes"

Date:

18 March 2009, 13:39:42

Hi,

I have a DB in utf-8 and postgres 8.3.x.

How can I do an accent insensitive search (like ...) ?

TIA

Re: Accent insensitive search?

From

Pedro Doria Meunier

Date:

18 March 2009, 13:46:32

Hi,

use ILIKE

HTH,

Pedro Doria Meunier
GSM: +351961720188
Skype: pdoriam

On Wednesday 18 March 2009 04:29:24 pm cifroes wrote:
> Hi,
>
> I have a DB in utf-8 and postgres 8.3.x.
>
> How can I do an accent insensitive search (like ...) ?
>
>
> TIA

Attachment

signature.asc

Re: Accent insensitive search?

From

"cifroes"

Date:

18 March 2009, 13:51:48

Not case insensitive but accent insensitive :)

And I tried select to_ascii('capo','LATIN1'), to_ascii('çapo','LATIN1') and the results are different....

-----Original Message-----
From: pgsql-general-owner@postgresql.org on behalf of Pedro Doria Meunier
Sent: Wed 3/18/2009 4:46 PM
To: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Accent insensitive search?

Hi,

use ILIKE

HTH,

Pedro Doria Meunier
GSM: +351961720188
Skype: pdoriam

On Wednesday 18 March 2009 04:29:24 pm cifroes wrote:
> Hi,
>
> I have a DB in utf-8 and postgres 8.3.x.
>
> How can I do an accent insensitive search (like ...) ?
>
>
> TIA

Re: Accent insensitive search?

From

Thom Brown

Date:

18 March 2009, 13:53:45

2009/3/18 Pedro Doria Meunier <pdoria@netmadeira.com>

Hi,

use ILIKE

HTH,

ILIKE is only case-insensitive, and won't match accented characters. The only thing I can think of doing is to create a function which will replace characters with their equivalent non-accented counterparts and use that in the query.

Example: SELECT name FROM people WHERE unaccent_string(name) ~* 'Pédro';

The function would perform a reg-ex replace similar to: s/[èééêë]/e

Just an idea

Thom

Re: Accent insensitive search?

From

Thom Brown

Date:

18 March 2009, 14:07:59

Here's an example of a function I might use (although I haven't actually got plperl installed, so can't test it myself, but you'll get the idea:

CREATE OR REPLACE FUNCTION unaccent_string(text) RETURNS text AS $$
my ($input_string) = @_;
$input_string =~ s/[âãäåāăą]/a;
$input_string =~ s/[ÁÂÃÄÅĀĂĄ]/A;
$input_string =~ s/[èééêëēĕėęě]/e;
$input_string =~ s/[ĒĔĖĘĚ]/E;
$input_string =~ s/[ìíîïìĩīĭ]/i;
$input_string =~ s/[ÌÍÎÏÌĨĪĬ]/I;
$input_string =~ s/[óôõöōŏő]/o;
$input_string =~ s/[ÒÓÔÕÖŌŎŐ]/O;
$input_string =~ s/[ùúûüũūŭů]/u;
$input_string =~ s/[ÙÚÛÜŨŪŬŮ]/U;
return $input_string;
$$ LANGUAGE plperl;

Regards

Thom

Re: Accent insensitive search?

From

Sam Mason

Date:

18 March 2009, 14:15:57

On Wed, Mar 18, 2009 at 04:29:24PM -0000, cifroes wrote:
> I have a DB in utf-8 and postgres 8.3.x.
>
> How can I do an accent insensitive search (like ...) ?

No good idea at the moment; I'd somehow expect to find this sort of
normalization in the functionality provided by the "text search" code.
My reasoning being that searching for a word should find it if you're
being accurate or lazy with your accents.  I've not been able to coax it
into doing anything reliable though.

If you can't find anything better in PG; the translate[1] function would
be my best suggestion.  Performance should be better than using regular
expressions.

--
  Sam  http://samason.me.uk/

 [1] http://www.postgresql.org/docs/current/static/functions-string.html

Re: Accent insensitive search?

From

Pedro Doria Meunier

Date:

18 March 2009, 14:17:31

Ooops!
Silly me! I should have read more carefully ... (blush) sorry!

Pedro Doria Meunier
GSM: +351961720188
Skype: pdoriam

On Wednesday 18 March 2009 04:46:16 pm Pedro Doria Meunier wrote:
> Hi,
>
> use ILIKE
>
> HTH,
>
> Pedro Doria Meunier
> GSM: +351961720188
> Skype: pdoriam
>
> On Wednesday 18 March 2009 04:29:24 pm cifroes wrote:
> > Hi,
> >
> > I have a DB in utf-8 and postgres 8.3.x.
> >
> > How can I do an accent insensitive search (like ...) ?
> >
> >
> > TIA

Attachment

signature.asc

Re: Accent insensitive search?

From

Thom Brown

Date:

18 March 2009, 14:24:28

2009/3/18 Sam Mason <sam@samason.me.uk>

If you can't find anything better in PG; the translate[1] function would
be my best suggestion. Performance should be better than using regular
expressions.

Yeah, that does appear to perform better. I tried the following at it worked for me:

CREATE OR REPLACE FUNCTION unaccent_string(text) RETURNS text AS $$
DECLARE
input_string text := $1;
BEGIN

input_string := translate(input_string, 'âãäåāăąÁÂÃÄÅĀĂĄ', 'aaaaaaaaaaaaaaa');
input_string := translate(input_string, 'èééêëēĕėęěĒĔĖĘĚ', 'eeeeeeeeeeeeeee');
input_string := translate(input_string, 'ìíîïìĩīĭÌÍÎÏÌĨĪĬ', 'iiiiiiiiiiiiiiii');
input_string := translate(input_string, 'óôõöōŏőÒÓÔÕÖŌŎŐ', 'ooooooooooooooo');
input_string := translate(input_string, 'ùúûüũūŭůÙÚÛÜŨŪŬŮ', 'uuuuuuuuuuuuuuuu');

return input_string;
END;
$$ LANGUAGE plpgsql;

Regards

Thom

Re: Accent insensitive search?

From

John R Pierce

Date:

18 March 2009, 14:32:54


how about...

select ....   where translate(lower(myfield), 
'âãäåāăąèééêëēĕėęěìíîïìĩīĭóôõöōŏőùúûüũūŭů', 
'aaaaaaaeeeeeeeeeeiiiiiiiiooooooouuuuuuu') = 'stringiwannamatch';


or something like that.   I may have miscounted the vowells in the 'to' 
string :)

Re: Accent insensitive search?

From

Thomas Kellerer

Date:

18 March 2009, 14:34:07

Sam Mason wrote on 18.03.2009 18:15:
> On Wed, Mar 18, 2009 at 04:29:24PM -0000, cifroes wrote:
>> I have a DB in utf-8 and postgres 8.3.x.
>>
>> How can I do an accent insensitive search (like ...) ?
>
> No good idea at the moment; I'd somehow expect to find this sort of
> normalization in the functionality provided by the "text search" code.
> My reasoning being that searching for a word should find it if you're
> being accurate or lazy with your accents.  I've not been able to coax it
> into doing anything reliable though.
>
> If you can't find anything better in PG; the translate[1] function would
> be my best suggestion.  Performance should be better than using regular
> expressions.
>
Are you aware of any plans to add accent-insesitive collations to PG?

Would be great if PG offered a similar flexibility here as other databases.

Thomas

Re: Accent insensitive search?

From

Christophe

Date:

18 March 2009, 14:38:21

What I've done in the past in this situation is to create a separate
field with the text normalized to whatever the search form is (all
lower case, accents stripped, etc.), and then index and search that
from the application.

Although I've not tried it, a functional index that did the same thing
might work, as long as you were careful in how the queries were
written to ensure that it was used.

Re: Accent insensitive search?

From

Oleg Bartunov

Date:

18 March 2009, 15:14:28

On Wed, 18 Mar 2009, cifroes wrote:

> Hi,
>
> I have a DB in utf-8 and postgres 8.3.x.
>
> How can I do an accent insensitive search (like ...) ?

Take a look on text search capability and
http://www.sai.msu.su/~megera/wiki/unaccent
We have patches for CVS HEAD, but unfortunately they will likely
go to the 8.5 release

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: Accent insensitive search?

From

Alvaro Herrera

Date:

18 March 2009, 15:24:59

Thom Brown escribió:
> Here's an example of a function I might use (although I haven't actually got
> plperl installed, so can't test it myself, but you'll get the idea:
>
> CREATE OR REPLACE FUNCTION unaccent_string(text) RETURNS text AS $$
> my ($input_string) = @_;
> $input_string =~ s/[âãäåāăą]/a;
> $input_string =~ s/[ÁÂÃÄÅĀĂĄ]/A;

Hmm, if to_ascii() doesn't work, that's something worth some research.
Maybe the encoding config is broken, for example.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: Accent insensitive search?

From

Christophe

Date:

18 March 2009, 15:35:48

On Mar 18, 2009, at 11:24 AM, Alvaro Herrera wrote:
> Hmm, if to_ascii() doesn't work, that's something worth some research.
> Maybe the encoding config is broken, for example.

The docs say to_ascii() only works with LATIN1, LATIN2, LATIN9, and
WIN1250; maybe convert('string', 'UTF-8', 'SQL_ASCII')?

Re: Accent insensitive search?

From

Jasen Betts

Date:

24 March 2009, 06:53:14

On 2009-03-18, cifroes <cifroes@netcabo.pt> wrote:
> This is a multi-part message in MIME format.
>
> ------_=_NextPart_001_01C9A7E6.B32BBA87
> Content-Type: text/plain;
>     charset="iso-8859-1"
> Content-Transfer-Encoding: quoted-printable
>
> Hi,
>
> I have a DB in utf-8 and postgres 8.3.x.=20
>
> How can I do an accent insensitive search (like ...) ?

use a posix regular expression that matches the string you want:

select * from foo where colname ~ '^[CcĆćĈĉĊċČč][ĀāĂăĄąAa][Pp][ŌōŎŏŐő]$';

you could write a function to do the translation.

Re: Accent insensitive search?

From

Stuart Bishop

Date:

30 March 2009, 06:03:39

On Tue, Mar 24, 2009 at 4:53 PM, Jasen Betts <jasen@xnet.co.nz> wrote:
> On 2009-03-18, cifroes <cifroes@netcabo.pt> wrote:
>> This is a multi-part message in MIME format.
>>
>> ------_=_NextPart_001_01C9A7E6.B32BBA87
>> Content-Type: text/plain;
>>       charset="iso-8859-1"
>> Content-Transfer-Encoding: quoted-printable
>>
>> Hi,
>>
>> I have a DB in utf-8 and postgres 8.3.x.=20
>>
>> How can I do an accent insensitive search (like ...) ?
>
> use a posix regular expression that matches the string you want:
>
> select * from foo where colname ~ '^[CcĆćĈĉĊċČč][ĀāĂăĄąAa][Pp][ŌōŎŏŐő]$';
>
> you could write a function to do the translation.

Unicode normalization is an interesting way to strip accents I've found:

>>> import unicodedata
>>>
>>> s = u'Björn'
>>> s = unicodedata.normalize("NFKD", s)
>>> s = ''.join(c for c in s if ord(c) < 127)
>>> print s
Bjorn

You can also use the character names to map many more characters to
the ascii equivalent. A large number of these can me smashed into
ASCII using regular expressions and some manual mappings to map LETTER
THORN -> th, LETTER LATERAL CLICK -> X etc. Just mapping CAPITAL
LETTER XX -> XX and SMALL LETTER XX -> xx seems to get you most of
europe if you special case SHARP S -> ss and THORN -> th.

>>> s = u'ァ'
>>> print unicodedata.name(s)
KATAKANA LETTER SMALL A

--
Stuart Bishop <stuart@stuartbishop.net>
http://www.stuartbishop.net/