Thread: how to ignore accents?
I have French text with accents in it and would rather (for search purposes) be able to search for things with the accents removed. Is there any builtin postgres function which will do this? Here is the format of some accented (French) words in my database: fran\347ais caf\351 t\364\t br\373le r\352ver \342nes b\350gues which I want to be able to search for as: francais cafe tot brule rever anes begues Celia
* Celia McInnis <celia@drmath.ca> [300305, 12:29]: > I have French text with accents in it and would rather (for search purposes) > be able to search for things with the accents removed. Is there any builtin > postgres function which will do this? > > Here is the format of some accented (French) words in my database: > > fran\347ais caf\351 t\364\t br\373le r\352ver \342nes b\350gues > > which I want to be able to search for as: > > francais cafe tot brule rever anes begues > > Celia > Hi Celia, I'm not an expert, but I'm afraid it's not possible to do that (and I'd be glad if somebody came out saying I'm wrong as I've the same problem). However, you can achieve a similar result using _like_ and truncating the accented words so that only un-accented letters are searched for. In your example, you would use: mydb=> select * from your_table where col_title like '%franc%'; mydb=> select * from your_table where col_title like '%caf%'; mydb=> select * from your_table where col_title like '%gues%'; and so on. Hope that helps. Regards, Ennio. -- [Perche' usare Win$ozz (dico io) se ..."anche uno sciocco sa farlo. \\?// Fa' qualche cosa di cui non sei capace!" (diceva Henry Miller) ] (°|°) [Why use Win$ozz (I say) if ... "even a fool can do that. )=( Do something you aren't good at!" (as Henry Miller used to say) ]
On Thu, Mar 31, 2005 at 02:01:45AM +0200, Ennio-Sr wrote: > * Celia McInnis <celia@drmath.ca> [300305, 12:29]: > > I have French text with accents in it and would rather (for search purposes) > > be able to search for things with the accents removed. Is there any builtin > > postgres function which will do this? > > > > Here is the format of some accented (French) words in my database: > > > > fran\347ais caf\351 t\364\t br\373le r\352ver \342nes b\350gues > > > > which I want to be able to search for as: > > > > francais cafe tot brule rever anes begues > > I'm not an expert, but I'm afraid it's not possible to do that (and I'd > be glad if somebody came out saying I'm wrong as I've the same problem). You could write an unaccent() function that calls translate(). http://www.postgresql.org/docs/8.0/interactive/functions-string.html Here's an (incomplete) example that works in PostgreSQL 8.0.2beta1: CREATE FUNCTION unaccent(text) RETURNS text AS $$ BEGIN RETURN translate($1, '\342\347\350\351\352\364\373', 'aceeeou'); END; $$ LANGUAGE plpgsql IMMUTABLE STRICT; SELECT unaccent('fran\347ais'); unaccent ---------- francais (1 row) You could use unaccent() in a functional index: CREATE TABLE words ( id serial PRIMARY KEY, word text NOT NULL ); CREATE INDEX words_word_idx ON words (unaccent(word)); INSERT INTO words (word) VALUES ('fran\347ais'); INSERT INTO words (word) VALUES ('caf\351'); INSERT INTO words (word) VALUES ('t\364\t'); SELECT * FROM words WHERE unaccent(word) = 'francais'; id | word ----+---------- 1 | français (1 row) EXPLAIN shows that the index is being used: EXPLAIN SELECT * FROM words WHERE unaccent(word) = 'francais'; QUERY PLAN ------------------------------------------------------------------------------ Index Scan using words_word_idx on words (cost=0.00..11.31 rows=6 width=36) Index Cond: (unaccent(word) = 'francais'::text) (2 rows) Hope this helps. -- Michael Fuhr http://www.fuhr.org/~mfuhr/
On Thu, 31 Mar 2005 02:01:45 +0200, Ennio-Sr <nasr.laili@tin.it> wrote: > * Celia McInnis <celia@drmath.ca> [300305, 12:29]: > > I have French text with accents in it and would rather (for search > purposes) > > be able to search for things with the accents removed. Is there any > builtin > > postgres function which will do this? > > > > Here is the format of some accented (French) words in my database: > > > > fran\347ais caf\351 t\364\t br\373le r\352ver \342nes b\350gues > > > > which I want to be able to search for as: > > > > francais cafe tot brule rever anes begues > > > > Celia > > > > Hi Celia, > > I'm not an expert, but I'm afraid it's not possible to do that (and I'd > be glad if somebody came out saying I'm wrong as I've the same problem). > If you are using LATIN1, LATIN2, LATIN9, or WIN1250 encoding i suppose you can use the to_ascii() function. regards, Jaime Casanova
Thanks, Michael - I just tried your suggestion in postgresql 8.0.1 and unfortunately it doesn't work. How long will it be before 8.0.2 moves out of beta mode? Here's the error which I get in 8.0.1: ERROR: invalid byte sequence for encoding "UNICODE": 0xe76169 Is there something which I can do in 8.0.1 to have the unaccenting work? Celia McInnis On Wed, 30 Mar 2005 17:42:36 -0700, Michael Fuhr wrote > You could write an unaccent() function that calls translate(). > > http://www.postgresql.org/docs/8.0/interactive/functions-string.html > > Here's an (incomplete) example that works in PostgreSQL 8.0.2beta1: > > CREATE FUNCTION unaccent(text) RETURNS text AS $$ > BEGIN > RETURN translate($1, '\342\347\350\351\352\364\373', 'aceeeou') > ; END; $$ LANGUAGE plpgsql IMMUTABLE STRICT; > > SELECT unaccent('fran\347ais'); > unaccent > ---------- > francais > (1 row) > > You could use unaccent() in a functional index: > > CREATE TABLE words ( > id serial PRIMARY KEY, > word text NOT NULL > ); > > CREATE INDEX words_word_idx ON words (unaccent(word)); > > INSERT INTO words (word) VALUES ('fran\347ais'); > INSERT INTO words (word) VALUES ('caf\351'); > INSERT INTO words (word) VALUES ('t\364\t'); > > SELECT * FROM words WHERE unaccent(word) = 'francais'; > id | word > ----+---------- > 1 | français > (1 row) > > EXPLAIN shows that the index is being used: > > EXPLAIN SELECT * FROM words WHERE unaccent(word) = 'francais'; > QUERY PLAN > > -------------------------------------------------------------------------- ---- > Index Scan using words_word_idx on words (cost=0.00..11.31 > rows=6 width=36) Index Cond: (unaccent(word) = 'francais'::text) > (2 rows) > > Hope this helps. > > -- > Michael Fuhr > http://www.fuhr.org/~mfuhr/ > > On Thu, Mar 31, 2005 at 02:01:45AM +0200, Ennio-Sr wrote: > > * Celia McInnis <celia@drmath.ca> [300305, 12:29]: > > > I have French text with accents in it and would rather (for search purposes) > > > be able to search for things with the accents removed. Is there any builtin > > > postgres function which will do this? > > > > > > Here is the format of some accented (French) words in my database: > > > > > > fran\347ais caf\351 t\364\t br\373le r\352ver \342nes b\350gues > > > > > > which I want to be able to search for as: > > > > > > francais cafe tot brule rever anes begues > > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Don't 'kill -9' the postmaster -- Open WebMail Project (http://openwebmail.org)
On Thu, Mar 31, 2005 at 07:40:04AM -0600, Celia McInnis wrote: > Thanks, Michael - I just tried your suggestion in postgresql 8.0.1 and > unfortunately it doesn't work. How long will it be before 8.0.2 moves out of > beta mode? > > Here's the error which I get in 8.0.1: > > ERROR: invalid byte sequence for encoding "UNICODE": 0xe76169 This is an encoding issue, not a difference between 8.0.1 and 8.0.2beta1. I should have mentioned that the example works with SQL_ASCII; apparently you're using UNICODE. Try using your character set's versions of the accented characters instead of the escape sequences shown in the example. -- Michael Fuhr http://www.fuhr.org/~mfuhr/
* Michael Fuhr <mike@fuhr.org> [300305, 17:42]: > On Thu, Mar 31, 2005 at 02:01:45AM +0200, Ennio-Sr wrote: > > > > I'm not an expert, but I'm afraid it's not possible to do that (and I'd > > be glad if somebody came out saying I'm wrong as I've the same problem). > > You could write an unaccent() function that calls translate(). > > http://www.postgresql.org/docs/8.0/interactive/functions-string.html > > Here's an (incomplete) example that works in PostgreSQL 8.0.2beta1: > > CREATE FUNCTION unaccent(text) RETURNS text AS $$ > BEGIN > RETURN translate($1, '\342\347\350\351\352\364\373', 'aceeeou'); > END; > $$ LANGUAGE plpgsql IMMUTABLE STRICT; > Thank you, Michael, for correcting my pessimism (and ignorance ... :-). Is there any chance to have that work with PostgreSQL-7.4.7-2 (the latest available from Debian/sarge)? I put those instructions in a .sql file, did: mydb=> \i unaccent.sql and got: psql:unaccent.sql:3: ERROR: syntax error at or near "$" at character 50 psql:unaccent.sql:4: WARNING: there is no transaction in progress COMMIT psql:unaccent.sql:5: ERROR: syntax error at or near "$" at character 3 Regards, Ennio. -- [Perche' usare Win$ozz (dico io) se ..."anche uno sciocco sa farlo. \\?// Fa' qualche cosa di cui non sei capace!" (diceva Henry Miller) ] (°|°) [Why use Win$ozz (I say) if ... "even a fool can do that. )=( Do something you aren't good at!" (as Henry Miller used to say) ]
On Thu, Mar 31, 2005 at 08:38:02PM +0200, Ennio-Sr wrote: > * Michael Fuhr <mike@fuhr.org> [300305, 17:42]: > > > > CREATE FUNCTION unaccent(text) RETURNS text AS $$ > > BEGIN > > RETURN translate($1, '\342\347\350\351\352\364\373', 'aceeeou'); > > END; > > $$ LANGUAGE plpgsql IMMUTABLE STRICT; > > Thank you, Michael, for correcting my pessimism (and ignorance ... :-). > Is there any chance to have that work with PostgreSQL-7.4.7-2 (the > latest available from Debian/sarge)? > I put those instructions in a .sql file, did: > mydb=> \i unaccent.sql > and got: > psql:unaccent.sql:3: ERROR: syntax error at or near "$" at character 50 The example uses 8.0's dollar quoting; here's something that should work in earlier versions: CREATE FUNCTION unaccent(text) RETURNS text AS ' BEGIN RETURN translate($1, ''\342\347\350\351\352\364\373'', ''aceeeou''); END; ' LANGUAGE plpgsql IMMUTABLE STRICT; As I mentioned in another message, I tested with the SQL_ASCII encoding, so if you use a different encoding then you might need to make some changes -- the point is that you can use translate() to convert one character to another (e.g., an accented character to its unaccented equivalent). You'll probably also want to add more characters to the translation strings -- they're just short examples for demonstration purposes. -- Michael Fuhr http://www.fuhr.org/~mfuhr/
* Michael Fuhr <mike@fuhr.org> [310305, 12:24]: > On Thu, Mar 31, 2005 at 08:38:02PM +0200, Ennio-Sr wrote: > > * Michael Fuhr <mike@fuhr.org> [300305, 17:42]: > > > > > The example uses 8.0's dollar quoting; here's something that should > work in earlier versions: > > CREATE FUNCTION unaccent(text) RETURNS text AS ' > BEGIN > RETURN translate($1, ''\342\347\350\351\352\364\373'', ''aceeeou''); > END; > ' LANGUAGE plpgsql IMMUTABLE STRICT; > > As I mentioned in another message, I tested with the SQL_ASCII > encoding, so if you use a different encoding then you might need > to make some changes -- the point is that you can use translate() > to convert one character to another (e.g., an accented character > to its unaccented equivalent). You'll probably also want to add > more characters to the translation strings -- they're just short > examples for demonstration purposes. > OK, I tried the new version and the function was created. However, something strange is still happening, maybe depending on the LC_ALL setting. Although I created a new database with sql_ascii encoding and changed my LC_ALL to 'C', after inserting an accented vowel, either directly typing it when I was on LC_ALL=it.IT or with \code being on LC_ALL=C, a select * from table will show all accented vowels whereas the: non_acc=> select * from parole where unaccent(parola) = 'cafe'; yealds: ERROR: column "a^çéèe^o^u^" does not exist CONTEXT: PL/pgSQL function "unaccent" line 2 at return [the '^' is on the vowels but I cannot write these letters with ALT+code] I will read the full documentation before any new attempts ... Thanks again. Ennio -- [Perche' usare Win$ozz (dico io) se ..."anche uno sciocco sa farlo. \\?// Fa' qualche cosa di cui non sei capace!" (diceva Henry Miller) ] (°|°) [Why use Win$ozz (I say) if ... "even a fool can do that. )=( Do something you aren't good at!" (as Henry Miller used to say) ]
* Ennio-Sr <nasr.laili@tin.it> [010405, 01:40]: > * Michael Fuhr <mike@fuhr.org> [310305, 12:24]: > > On Thu, Mar 31, 2005 at 08:38:02PM +0200, Ennio-Sr wrote: > OK, I tried the new version and the function was created. However, > something strange is still happening, maybe depending on the LC_ALL > setting. Although I created a new database with sql_ascii encoding and > changed my LC_ALL to 'C', after inserting an accented vowel, either > directly typing it when I was on LC_ALL=it.IT or with \code being on > LC_ALL=C, a select * from table will show all accented vowels whereas > the: > non_acc=> select * from parole where unaccent(parola) = 'cafe'; > > yealds: > > ERROR: column "a^çéèe^o^u^" does not exist > CONTEXT: PL/pgSQL function "unaccent" line 2 at return > > [the '^' is on the vowels but I cannot write these letters with ALT+code] > > I will read the full documentation before any new attempts ... > A quick look at the docs didn't gave me any clue as to the way to tell the function it must look at the column 'parola' for string substitution. See what happens: Table "public.parole" Column | Type | Modifiers --------+---------+----------- id | integer | parola | text | not null id | parola ----+---------- 4 | humanité 5 | café (2 rows) CREATE FUNCTION unaccent(text) RETURNS text AS ' BEGIN RETURN translate(&parola, "\342\347\350\351\352\364\373", "aceeeou"); ^^^^^^^ END; ' LANGUAGE plpgsql IMMUTABLE STRICT; ^^^^ [I also tried with field, &field, parola, (parola), [parameter "&field?"], $1, arg1 and similar] # And this is what I get when I give the command: non_acc=> select * from parole where unaccent(parola)='cafe'; ERROR: column "parola" does not exist ----------> ??????????????? CONTEXT: PL/pgSQL function "unaccent" line 2 at return ------------------------ Am I missing something important or that type of function is not supposed to work in 7.4.7-2? Regards, Ennio. -- [Perche' usare Win$ozz (dico io) se ..."anche uno sciocco sa farlo. \\?// Fa' qualche cosa di cui non sei capace!" (diceva Henry Miller) ] (°|°) [Why use Win$ozz (I say) if ... "even a fool can do that. )=( Do something you aren't good at!" (as Henry Miller used to say) ]
On Fri, Apr 01, 2005 at 08:36:35PM +0200, Ennio-Sr wrote: > > CREATE FUNCTION unaccent(text) RETURNS text AS ' > BEGIN > RETURN translate(&parola, "\342\347\350\351\352\364\373", "aceeeou"); > ^^^^^^^ > END; > ' LANGUAGE plpgsql IMMUTABLE STRICT; > > ^^^^ [I also tried with field, &field, parola, (parola), [parameter > "&field?"], $1, arg1 and similar] See the PL/pgSQL documentation to get a better idea of the syntax: http://www.postgresql.org/docs/7.4/interactive/plpgsql.html Let's get the basic functionality working before messing with accented characters. Try this: CREATE FUNCTION unaccent(text) RETURNS text AS ' BEGIN RETURN translate($1, ''ABC'', ''abc''); END; ' LANGUAGE plpgsql IMMUTABLE STRICT; CREATE TABLE parole ( id serial PRIMARY KEY, parola text NOT NULL ); INSERT INTO parole (parola) VALUES ('AAA'); INSERT INTO parole (parola) VALUES ('BBB'); INSERT INTO parole (parola) VALUES ('CCC'); SELECT unaccent(parola) FROM parole; unaccent ---------- aaa bbb ccc (3 rows) When you get that working, then you can modify the translate() strings to convert accented characters to their unaccented equivalents. -- Michael Fuhr http://www.fuhr.org/~mfuhr/
* Michael Fuhr <mike@fuhr.org> [010405, 18:41]: > On Fri, Apr 01, 2005 at 08:36:35PM +0200, Ennio-Sr wrote: > > > > RETURN translate($1, "\342\347\350\351\352\364\373", "aceeeou"); > > ^ ^ ^ ^ > > See the PL/pgSQL documentation to get a better idea of the syntax: > > http://www.postgresql.org/docs/7.4/interactive/plpgsql.html > > RETURN translate($1, ''ABC'', ''abc''); > ^^ ^^ ^^ ^^ Hi Michael, thank you for the nice lessson and for the very instructive link. I read it thouroughly, went back to my example and still got the same error: Grrr!!!!!!!!!!! Then, finally, I discovered my error was as stupid as the difference between _"_ and _''_ ! Now it works greatly! ..... :-) Cheers, Ennio. -- [Perche' usare Win$ozz (dico io) se ..."anche uno sciocco sa farlo. \\?// Fa' qualche cosa di cui non sei capace!" (diceva Henry Miller) ] (°|°) [Why use Win$ozz (I say) if ... "even a fool can do that. )=( Do something you aren't good at!" (as Henry Miller used to say) ]
* Ennio-Sr <nasr.laili@tin.it> [020405, 16:11]: > * Michael Fuhr <mike@fuhr.org> [010405, 18:41]: > > On Fri, Apr 01, 2005 at 08:36:35PM +0200, Ennio-Sr wrote: > > > > > > RETURN translate($1, ''\342\347\350\351\352\364\373'', ''aceeeou''); > > > > Now it works greatly! ..... :-) Sorry, I forgot to ask this in my previous post: Is there a better way to see all the codes used by postgres to map characters than trying them one by one? I mean: where should I look for to have a map similar to the one I get running (from a linux console): $ ascii ???? I tried with unicode but was unable to find any correspondence between, say, _\350_ and _è_. Both googling around and looking into the /usr/share/docs didn't help much... Regards, Ennio. -- [Perche' usare Win$ozz (dico io) se ..."anche uno sciocco sa farlo. \\?// Fa' qualche cosa di cui non sei capace!" (diceva Henry Miller) ] (°|°) [Why use Win$ozz (I say) if ... "even a fool can do that. )=( Do something you aren't good at!" (as Henry Miller used to say) ]
* Ennio-Sr <nasr.laili@tin.it> [020405, 18:45]: > > Sorry, I forgot to ask this in my previous post: > Is there a better way to see all the codes used by postgres to map > characters than trying them one by one? > I mean: where should I look for to have a map similar to the one I get > running (from a linux console): > $ ascii ???? > [For the benefit of future dummies like me :-) ] A simple 'apropos octal' revealed the arcane: Postgres uses octal notation. You can see non ascii chars with: $ man iso-8859-1 Sorry for the moise. Regards, Ennio. -- [Perche' usare Win$ozz (dico io) se ..."anche uno sciocco sa farlo. \\?// Fa' qualche cosa di cui non sei capace!" (diceva Henry Miller) ] (°|°) [Why use Win$ozz (I say) if ... "even a fool can do that. )=( Do something you aren't good at!" (as Henry Miller used to say) ]
* Jaime Casanova <systemguards@gmail.com> [300305, 22:23]: > On Thu, 31 Mar 2005 02:01:45 +0200, Ennio-Sr <nasr.laili@tin.it> wrote: > > * Celia McInnis <celia@drmath.ca> [300305, 12:29]: > > > I have French text with accents in it and would rather (for search > > purposes) > > > be able to search for things with the accents removed. Is there any > > builtin > > > postgres function which will do this? > > > > > > Here is the format of some accented (French) words in my database: > > > > > > fran\347ais caf\351 t\364\t br\373le r\352ver \342nes b\350gues > > > > > > which I want to be able to search for as: > > > > > > francais cafe tot brule rever anes begues > > > > > > Celia > > > > > > > Hi Celia, > > > > I'm not an expert, but I'm afraid it's not possible to do that (and I'd > > be glad if somebody came out saying I'm wrong as I've the same problem). > > > If you are using LATIN1, LATIN2, LATIN9, or WIN1250 encoding i suppose > you can use the to_ascii() function. > > regards, > Jaime Casanova Hi Jaime, Just for the records: I tried both Michael's unaccent() and to_ascii() and found that to_asci() also works, but 'traces' unaccented vowels as well. Thanks for your suggestion, regards, Ennio. -- [Perche' usare Win$ozz (dico io) se ..."anche uno sciocco sa farlo. \\?// Fa' qualche cosa di cui non sei capace!" (diceva Henry Miller) ] (°|°) [Why use Win$ozz (I say) if ... "even a fool can do that. )=( Do something you aren't good at!" (as Henry Miller used to say) ]