Thread: how to ignore accents?

how to ignore accents?

From
"Celia McInnis"
Date:
I have French text with accents in it and would rather (for search purposes)
be able to search for things with the accents removed. Is there any builtin
postgres function which will do this?

Here is the format of some accented (French) words in my database:

fran\347ais caf\351 t\364\t br\373le r\352ver \342nes b\350gues

which I want to be able to search for as:

francais cafe tot brule rever anes begues

Celia

Re: how to ignore accents?

From
Ennio-Sr
Date:
* Celia McInnis <celia@drmath.ca> [300305, 12:29]:
> I have French text with accents in it and would rather (for search purposes)
> be able to search for things with the accents removed. Is there any builtin
> postgres function which will do this?
>
> Here is the format of some accented (French) words in my database:
>
> fran\347ais caf\351 t\364\t br\373le r\352ver \342nes b\350gues
>
> which I want to be able to search for as:
>
> francais cafe tot brule rever anes begues
>
> Celia
>

Hi Celia,

I'm not an expert, but I'm afraid it's not possible to do that (and I'd
be glad if somebody came out saying I'm wrong as I've the same problem).
However, you can achieve a similar result using _like_ and truncating
the accented words so that only un-accented letters are searched for.

In your example, you would use:

mydb=> select * from your_table where col_title like '%franc%';
mydb=> select * from your_table where col_title like '%caf%';
mydb=> select * from your_table where col_title like '%gues%';
and so on.

Hope that helps.
Regards,
    Ennio.


--
[Perche' usare Win$ozz (dico io) se ..."anche uno sciocco sa farlo.   \\?//
 Fa' qualche cosa di cui non sei capace!"  (diceva Henry Miller) ]    (°|°)
[Why use Win$ozz (I say) if ... "even a fool can do that.              )=(
 Do something you aren't good at!" (as Henry Miller used to say) ]

Re: how to ignore accents?

From
Michael Fuhr
Date:
On Thu, Mar 31, 2005 at 02:01:45AM +0200, Ennio-Sr wrote:
> * Celia McInnis <celia@drmath.ca> [300305, 12:29]:
> > I have French text with accents in it and would rather (for search purposes)
> > be able to search for things with the accents removed. Is there any builtin
> > postgres function which will do this?
> >
> > Here is the format of some accented (French) words in my database:
> >
> > fran\347ais caf\351 t\364\t br\373le r\352ver \342nes b\350gues
> >
> > which I want to be able to search for as:
> >
> > francais cafe tot brule rever anes begues
>
> I'm not an expert, but I'm afraid it's not possible to do that (and I'd
> be glad if somebody came out saying I'm wrong as I've the same problem).

You could write an unaccent() function that calls translate().

  http://www.postgresql.org/docs/8.0/interactive/functions-string.html

Here's an (incomplete) example that works in PostgreSQL 8.0.2beta1:

  CREATE FUNCTION unaccent(text) RETURNS text AS $$
  BEGIN
      RETURN translate($1, '\342\347\350\351\352\364\373', 'aceeeou');
  END;
  $$ LANGUAGE plpgsql IMMUTABLE STRICT;

  SELECT unaccent('fran\347ais');
   unaccent
  ----------
   francais
  (1 row)

You could use unaccent() in a functional index:

  CREATE TABLE words (
      id    serial PRIMARY KEY,
      word  text NOT NULL
  );

  CREATE INDEX words_word_idx ON words (unaccent(word));

  INSERT INTO words (word) VALUES ('fran\347ais');
  INSERT INTO words (word) VALUES ('caf\351');
  INSERT INTO words (word) VALUES ('t\364\t');

  SELECT * FROM words WHERE unaccent(word) = 'francais';
   id |   word
  ----+----------
    1 | français
  (1 row)

EXPLAIN shows that the index is being used:

  EXPLAIN SELECT * FROM words WHERE unaccent(word) = 'francais';
                                    QUERY PLAN
  ------------------------------------------------------------------------------
   Index Scan using words_word_idx on words  (cost=0.00..11.31 rows=6 width=36)
     Index Cond: (unaccent(word) = 'francais'::text)
  (2 rows)

Hope this helps.

--
Michael Fuhr
http://www.fuhr.org/~mfuhr/

Re: how to ignore accents?

From
Jaime Casanova
Date:
On Thu, 31 Mar 2005 02:01:45 +0200, Ennio-Sr <nasr.laili@tin.it> wrote:
> * Celia McInnis <celia@drmath.ca> [300305, 12:29]:
> > I have French text with accents in it and would rather (for search
> purposes)
> > be able to search for things with the accents removed. Is there any
> builtin
> > postgres function which will do this?
> >
> > Here is the format of some accented (French) words in my database:
> >
> > fran\347ais caf\351 t\364\t br\373le r\352ver \342nes b\350gues
> >
> > which I want to be able to search for as:
> >
> > francais cafe tot brule rever anes begues
> >
> > Celia
> >
>
> Hi Celia,
>
> I'm not an expert, but I'm afraid it's not possible to do that (and I'd
> be glad if somebody came out saying I'm wrong as I've the same problem).
>
If you are using LATIN1, LATIN2, LATIN9, or WIN1250 encoding i suppose
you can use the to_ascii() function.

regards,
Jaime Casanova

Re: how to ignore accents?

From
"Celia McInnis"
Date:
Thanks, Michael - I just tried your suggestion in postgresql 8.0.1 and
unfortunately it doesn't work. How long will it be before 8.0.2 moves out of
beta mode?

Here's the error which I get in 8.0.1:

ERROR: invalid byte sequence for encoding "UNICODE": 0xe76169

Is there something which I can do in 8.0.1 to have the unaccenting work?

Celia McInnis

On Wed, 30 Mar 2005 17:42:36 -0700, Michael Fuhr wrote
> You could write an unaccent() function that calls translate().
>
>   http://www.postgresql.org/docs/8.0/interactive/functions-string.html
>
> Here's an (incomplete) example that works in PostgreSQL 8.0.2beta1:
>
>   CREATE FUNCTION unaccent(text) RETURNS text AS $$
>   BEGIN
>       RETURN translate($1, '\342\347\350\351\352\364\373', 'aceeeou')
> ;  END;  $$ LANGUAGE plpgsql IMMUTABLE STRICT;
>
>   SELECT unaccent('fran\347ais');
>    unaccent
>   ----------
>    francais
>   (1 row)
>
> You could use unaccent() in a functional index:
>
>   CREATE TABLE words (
>       id    serial PRIMARY KEY,
>       word  text NOT NULL
>   );
>
>   CREATE INDEX words_word_idx ON words (unaccent(word));
>
>   INSERT INTO words (word) VALUES ('fran\347ais');
>   INSERT INTO words (word) VALUES ('caf\351');
>   INSERT INTO words (word) VALUES ('t\364\t');
>
>   SELECT * FROM words WHERE unaccent(word) = 'francais';
>    id |   word
>   ----+----------
>     1 | français
>   (1 row)
>
> EXPLAIN shows that the index is being used:
>
>   EXPLAIN SELECT * FROM words WHERE unaccent(word) = 'francais';
>                                     QUERY PLAN
>
>   --------------------------------------------------------------------------
----
>    Index Scan using words_word_idx on words  (cost=0.00..11.31
> rows=6 width=36)     Index Cond: (unaccent(word) = 'francais'::text)
>   (2 rows)
>
> Hope this helps.
>
> --
> Michael Fuhr
> http://www.fuhr.org/~mfuhr/
>



> On Thu, Mar 31, 2005 at 02:01:45AM +0200, Ennio-Sr wrote:
> > * Celia McInnis <celia@drmath.ca> [300305, 12:29]:
> > > I have French text with accents in it and would rather (for search
purposes)
> > > be able to search for things with the accents removed. Is there any
builtin
> > > postgres function which will do this?
> > >
> > > Here is the format of some accented (French) words in my database:
> > >
> > > fran\347ais caf\351 t\364\t br\373le r\352ver \342nes b\350gues
> > >
> > > which I want to be able to search for as:
> > >
> > > francais cafe tot brule rever anes begues
> >

> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster


--
Open WebMail Project (http://openwebmail.org)


Re: how to ignore accents?

From
Michael Fuhr
Date:
On Thu, Mar 31, 2005 at 07:40:04AM -0600, Celia McInnis wrote:
> Thanks, Michael - I just tried your suggestion in postgresql 8.0.1 and
> unfortunately it doesn't work. How long will it be before 8.0.2 moves out of
> beta mode?
>
> Here's the error which I get in 8.0.1:
>
> ERROR: invalid byte sequence for encoding "UNICODE": 0xe76169

This is an encoding issue, not a difference between 8.0.1 and
8.0.2beta1.  I should have mentioned that the example works with
SQL_ASCII; apparently you're using UNICODE.  Try using your character
set's versions of the accented characters instead of the escape
sequences shown in the example.

--
Michael Fuhr
http://www.fuhr.org/~mfuhr/

Re: how to ignore accents?

From
Ennio-Sr
Date:
* Michael Fuhr <mike@fuhr.org> [300305, 17:42]:
> On Thu, Mar 31, 2005 at 02:01:45AM +0200, Ennio-Sr wrote:
> >
> > I'm not an expert, but I'm afraid it's not possible to do that (and I'd
> > be glad if somebody came out saying I'm wrong as I've the same problem).
>
> You could write an unaccent() function that calls translate().
>
>   http://www.postgresql.org/docs/8.0/interactive/functions-string.html
>
> Here's an (incomplete) example that works in PostgreSQL 8.0.2beta1:
>
>   CREATE FUNCTION unaccent(text) RETURNS text AS $$
>   BEGIN
>       RETURN translate($1, '\342\347\350\351\352\364\373', 'aceeeou');
>   END;
>   $$ LANGUAGE plpgsql IMMUTABLE STRICT;
>

Thank you, Michael, for correcting my pessimism (and ignorance ... :-).
Is there any chance to have that work with PostgreSQL-7.4.7-2 (the
latest available from Debian/sarge)?
I put those instructions in a .sql file, did:
mydb=> \i unaccent.sql
and got:
psql:unaccent.sql:3: ERROR: syntax error at or near "$" at character 50
psql:unaccent.sql:4: WARNING: there is no transaction in progress
COMMIT
psql:unaccent.sql:5: ERROR: syntax error at or near "$" at character 3

Regards,
    Ennio.

--
[Perche' usare Win$ozz (dico io) se ..."anche uno sciocco sa farlo.   \\?//
 Fa' qualche cosa di cui non sei capace!"  (diceva Henry Miller) ]    (°|°)
[Why use Win$ozz (I say) if ... "even a fool can do that.              )=(
 Do something you aren't good at!" (as Henry Miller used to say) ]

Re: how to ignore accents?

From
Michael Fuhr
Date:
On Thu, Mar 31, 2005 at 08:38:02PM +0200, Ennio-Sr wrote:
> * Michael Fuhr <mike@fuhr.org> [300305, 17:42]:
> >
> >   CREATE FUNCTION unaccent(text) RETURNS text AS $$
> >   BEGIN
> >       RETURN translate($1, '\342\347\350\351\352\364\373', 'aceeeou');
> >   END;
> >   $$ LANGUAGE plpgsql IMMUTABLE STRICT;
>
> Thank you, Michael, for correcting my pessimism (and ignorance ... :-).
> Is there any chance to have that work with PostgreSQL-7.4.7-2 (the
> latest available from Debian/sarge)?
> I put those instructions in a .sql file, did:
> mydb=> \i unaccent.sql
> and got:
> psql:unaccent.sql:3: ERROR: syntax error at or near "$" at character 50

The example uses 8.0's dollar quoting; here's something that should
work in earlier versions:

CREATE FUNCTION unaccent(text) RETURNS text AS '
BEGIN
    RETURN translate($1, ''\342\347\350\351\352\364\373'', ''aceeeou'');
END;
' LANGUAGE plpgsql IMMUTABLE STRICT;

As I mentioned in another message, I tested with the SQL_ASCII
encoding, so if you use a different encoding then you might need
to make some changes -- the point is that you can use translate()
to convert one character to another (e.g., an accented character
to its unaccented equivalent).  You'll probably also want to add
more characters to the translation strings -- they're just short
examples for demonstration purposes.

--
Michael Fuhr
http://www.fuhr.org/~mfuhr/

Re: how to ignore accents?

From
Ennio-Sr
Date:
* Michael Fuhr <mike@fuhr.org> [310305, 12:24]:
> On Thu, Mar 31, 2005 at 08:38:02PM +0200, Ennio-Sr wrote:
> > * Michael Fuhr <mike@fuhr.org> [300305, 17:42]:
> > >
>
> The example uses 8.0's dollar quoting; here's something that should
> work in earlier versions:
>
> CREATE FUNCTION unaccent(text) RETURNS text AS '
> BEGIN
>     RETURN translate($1, ''\342\347\350\351\352\364\373'', ''aceeeou'');
> END;
> ' LANGUAGE plpgsql IMMUTABLE STRICT;
>
> As I mentioned in another message, I tested with the SQL_ASCII
> encoding, so if you use a different encoding then you might need
> to make some changes -- the point is that you can use translate()
> to convert one character to another (e.g., an accented character
> to its unaccented equivalent).  You'll probably also want to add
> more characters to the translation strings -- they're just short
> examples for demonstration purposes.
>

OK, I tried the new version and the function was created. However,
something strange is still happening, maybe depending on the LC_ALL
setting. Although I created a new database with sql_ascii encoding and
changed my LC_ALL to 'C', after inserting an accented vowel, either
directly typing it when I was on LC_ALL=it.IT or with \code being on
LC_ALL=C, a select * from table will show all accented vowels whereas
the:
non_acc=> select * from parole where unaccent(parola) = 'cafe';

yealds:

ERROR: column "a^çéèe^o^u^" does not exist
CONTEXT: PL/pgSQL function "unaccent" line 2 at return

[the '^' is on the vowels but I cannot write these letters with ALT+code]

I will read the full documentation before any new attempts ...

Thanks again.
     Ennio

--
[Perche' usare Win$ozz (dico io) se ..."anche uno sciocco sa farlo.   \\?//
 Fa' qualche cosa di cui non sei capace!"  (diceva Henry Miller) ]    (°|°)
[Why use Win$ozz (I say) if ... "even a fool can do that.              )=(
 Do something you aren't good at!" (as Henry Miller used to say) ]

Re: how to ignore accents?

From
Ennio-Sr
Date:
* Ennio-Sr <nasr.laili@tin.it> [010405, 01:40]:
> * Michael Fuhr <mike@fuhr.org> [310305, 12:24]:
> > On Thu, Mar 31, 2005 at 08:38:02PM +0200, Ennio-Sr wrote:

> OK, I tried the new version and the function was created. However,
> something strange is still happening, maybe depending on the LC_ALL
> setting. Although I created a new database with sql_ascii encoding and
> changed my LC_ALL to 'C', after inserting an accented vowel, either
> directly typing it when I was on LC_ALL=it.IT or with \code being on
> LC_ALL=C, a select * from table will show all accented vowels whereas
> the:
> non_acc=> select * from parole where unaccent(parola) = 'cafe';
>
> yealds:
>
> ERROR: column "a^çéèe^o^u^" does not exist
> CONTEXT: PL/pgSQL function "unaccent" line 2 at return
>
> [the '^' is on the vowels but I cannot write these letters with ALT+code]
>
> I will read the full documentation before any new attempts ...
>

A quick look at the docs didn't gave me any clue as to the way to tell
the function it must look at the column 'parola' for string
substitution.
See what happens:

    Table "public.parole"
 Column |  Type   | Modifiers
--------+---------+-----------
 id     | integer |
 parola | text    | not null


 id |  parola
----+----------
  4 | humanité
  5 | café
(2 rows)


 CREATE FUNCTION unaccent(text) RETURNS text AS '
  BEGIN
      RETURN translate(&parola, "\342\347\350\351\352\364\373", "aceeeou");
                       ^^^^^^^
  END;
  ' LANGUAGE plpgsql IMMUTABLE STRICT;

^^^^ [I also tried with field, &field, parola, (parola), [parameter
      "&field?"], $1, arg1 and similar]

# And this is what I get when I give the command:

non_acc=> select * from parole where unaccent(parola)='cafe';

ERROR: column "parola" does not exist   ----------> ???????????????
CONTEXT: PL/pgSQL function "unaccent" line 2 at return

------------------------
Am I missing something important or that type of function is not
supposed to work in 7.4.7-2?

Regards,
    Ennio.

--
[Perche' usare Win$ozz (dico io) se ..."anche uno sciocco sa farlo.   \\?//
 Fa' qualche cosa di cui non sei capace!"  (diceva Henry Miller) ]    (°|°)
[Why use Win$ozz (I say) if ... "even a fool can do that.              )=(
 Do something you aren't good at!" (as Henry Miller used to say) ]

Re: how to ignore accents?

From
Michael Fuhr
Date:
On Fri, Apr 01, 2005 at 08:36:35PM +0200, Ennio-Sr wrote:
>
>  CREATE FUNCTION unaccent(text) RETURNS text AS '
>   BEGIN
>       RETURN translate(&parola, "\342\347\350\351\352\364\373", "aceeeou");
>                        ^^^^^^^
>   END;
>   ' LANGUAGE plpgsql IMMUTABLE STRICT;
>
> ^^^^ [I also tried with field, &field, parola, (parola), [parameter
>       "&field?"], $1, arg1 and similar]

See the PL/pgSQL documentation to get a better idea of the syntax:

http://www.postgresql.org/docs/7.4/interactive/plpgsql.html

Let's get the basic functionality working before messing with
accented characters.  Try this:

CREATE FUNCTION unaccent(text) RETURNS text AS '
BEGIN
    RETURN translate($1, ''ABC'', ''abc'');
END;
' LANGUAGE plpgsql IMMUTABLE STRICT;

CREATE TABLE parole (
    id      serial PRIMARY KEY,
    parola  text NOT NULL
);

INSERT INTO parole (parola) VALUES ('AAA');
INSERT INTO parole (parola) VALUES ('BBB');
INSERT INTO parole (parola) VALUES ('CCC');

SELECT unaccent(parola) FROM parole;
 unaccent
----------
 aaa
 bbb
 ccc
(3 rows)

When you get that working, then you can modify the translate()
strings to convert accented characters to their unaccented
equivalents.

--
Michael Fuhr
http://www.fuhr.org/~mfuhr/

Re: how to ignore accents?

From
Ennio-Sr
Date:
* Michael Fuhr <mike@fuhr.org> [010405, 18:41]:
> On Fri, Apr 01, 2005 at 08:36:35PM +0200, Ennio-Sr wrote:
> >
> >       RETURN translate($1, "\342\347\350\351\352\364\373", "aceeeou");
> >                            ^                           ^  ^       ^

>
> See the PL/pgSQL documentation to get a better idea of the syntax:
>
> http://www.postgresql.org/docs/7.4/interactive/plpgsql.html
>
>     RETURN translate($1, ''ABC'', ''abc'');
>                          ^^   ^^  ^^   ^^

Hi Michael,
thank you for the nice lessson and for the very instructive link.
I read it thouroughly, went back to my example and still got the same
error: Grrr!!!!!!!!!!!
Then, finally, I discovered my error was as stupid as the difference
between _"_ and _''_ !
Now it works greatly! ..... :-)
Cheers,
    Ennio.

--
[Perche' usare Win$ozz (dico io) se ..."anche uno sciocco sa farlo.   \\?//
 Fa' qualche cosa di cui non sei capace!"  (diceva Henry Miller) ]    (°|°)
[Why use Win$ozz (I say) if ... "even a fool can do that.              )=(
 Do something you aren't good at!" (as Henry Miller used to say) ]

Re: how to ignore accents?

From
Ennio-Sr
Date:
* Ennio-Sr <nasr.laili@tin.it> [020405, 16:11]:
> * Michael Fuhr <mike@fuhr.org> [010405, 18:41]:
> > On Fri, Apr 01, 2005 at 08:36:35PM +0200, Ennio-Sr wrote:
> > >
> > >       RETURN translate($1, ''\342\347\350\351\352\364\373'', ''aceeeou'');
> > >
> Now it works greatly! ..... :-)

Sorry, I forgot to ask this in my previous post:
Is there a better way to see all the codes used by postgres to map
characters than trying them one by one?
I mean: where should I look for to have a map similar to the one I get
running (from a linux console):
$ ascii     ????

I tried with unicode but was unable to find any correspondence between,
say, _\350_ and _è_.
Both googling around and looking into the /usr/share/docs didn't help much...

Regards,
    Ennio.

--
[Perche' usare Win$ozz (dico io) se ..."anche uno sciocco sa farlo.   \\?//
 Fa' qualche cosa di cui non sei capace!"  (diceva Henry Miller) ]    (°|°)
[Why use Win$ozz (I say) if ... "even a fool can do that.              )=(
 Do something you aren't good at!" (as Henry Miller used to say) ]

Re: how to ignore accents?

From
Ennio-Sr
Date:
* Ennio-Sr <nasr.laili@tin.it> [020405, 18:45]:
>
> Sorry, I forgot to ask this in my previous post:
> Is there a better way to see all the codes used by postgres to map
> characters than trying them one by one?
> I mean: where should I look for to have a map similar to the one I get
> running (from a linux console):
> $ ascii     ????
>

[For the benefit of future dummies like me :-) ]
A simple 'apropos octal' revealed the arcane:
Postgres uses octal notation. You can see non ascii chars with:

$ man iso-8859-1

Sorry for the moise. Regards,
    Ennio.

--
[Perche' usare Win$ozz (dico io) se ..."anche uno sciocco sa farlo.   \\?//
 Fa' qualche cosa di cui non sei capace!"  (diceva Henry Miller) ]    (°|°)
[Why use Win$ozz (I say) if ... "even a fool can do that.              )=(
 Do something you aren't good at!" (as Henry Miller used to say) ]

Re: how to ignore accents?

From
Ennio-Sr
Date:
* Jaime Casanova <systemguards@gmail.com> [300305, 22:23]:
> On Thu, 31 Mar 2005 02:01:45 +0200, Ennio-Sr <nasr.laili@tin.it> wrote:
> > * Celia McInnis <celia@drmath.ca> [300305, 12:29]:
> > > I have French text with accents in it and would rather (for search
> > purposes)
> > > be able to search for things with the accents removed. Is there any
> > builtin
> > > postgres function which will do this?
> > >
> > > Here is the format of some accented (French) words in my database:
> > >
> > > fran\347ais caf\351 t\364\t br\373le r\352ver \342nes b\350gues
> > >
> > > which I want to be able to search for as:
> > >
> > > francais cafe tot brule rever anes begues
> > >
> > > Celia
> > >
> >
> > Hi Celia,
> >
> > I'm not an expert, but I'm afraid it's not possible to do that (and I'd
> > be glad if somebody came out saying I'm wrong as I've the same problem).
> >
> If you are using LATIN1, LATIN2, LATIN9, or WIN1250 encoding i suppose
> you can use the to_ascii() function.
>
> regards,
> Jaime Casanova
Hi Jaime,
Just for the records:
I tried both Michael's unaccent() and to_ascii() and found that
to_asci() also works, but 'traces' unaccented vowels as well.
Thanks for your suggestion, regards,
    Ennio.

--
[Perche' usare Win$ozz (dico io) se ..."anche uno sciocco sa farlo.   \\?//
 Fa' qualche cosa di cui non sei capace!"  (diceva Henry Miller) ]    (°|°)
[Why use Win$ozz (I say) if ... "even a fool can do that.              )=(
 Do something you aren't good at!" (as Henry Miller used to say) ]