Thread: Full text index without accents
Hi again: I am trying to create a full text configuration to ignore word accents in my searches. My approach is similar to simple dicionary one, but i want to remove accents after converting to lower. Is it the only way to do it to develop another .c and write my own dict_noaccent.c, and then compile and install it into the system? Regars, Mario Barcala
You can preprocess text (replace accent by nothing) before to_tsvector or to_tsquery Oleg On Thu, 3 Jul 2008, lbarcala@freeresearch.org wrote: > Hi again: > > I am trying to create a full text configuration to ignore word accents in > my searches. My approach is similar to simple dicionary one, but i want to > remove accents after converting to lower. > > Is it the only way to do it to develop another .c and write my own > dict_noaccent.c, and then compile and install it into the system? > > Regars, > > Mario Barcala > > > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
And which are the types of argument and returning values of a pl/sql function which preprocess de text? I have been searching that, for example, something like this works fine: CREATE INDEX textindex ON document USING gin(to_tsvector('english',upper(text))); where text is the text column of document. But I have tried to do something like: CREATE INDEX textindex ON document USING gin(to_tsvector('english',myfunction(text))); where myfunction is a PL/SQL function which call upper one, but I didn't find which are the types of the myfunction argument and returning value. I am a PL/SQL novice and I didn't find how to do it yet. Of course, then I will have to change upper experiment to my objective: to index without accents. I don't know if PL/SQL is the better option to build such function. Thanks, Mario Barcala > You can preprocess text (replace accent by nothing) before > to_tsvector or to_tsquery > > > > Oleg > On Thu, 3 Jul 2008, lbarcala@freeresearch.org wrote: > > > Hi again: > > > > I am trying to create a full text configuration to ignore word accents in > > my searches. My approach is similar to simple dicionary one, but i want to > > remove accents after converting to lower. > > > > Is it the only way to do it to develop another .c and write my own > > dict_noaccent.c, and then compile and install it into the system? > > > > Regars, > > > > Mario Barcala > > > > > > > > Regards, > Oleg > _____________________________________________________________ > Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), > Sternberg Astronomical Institute, Moscow University, Russia > Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ > phone: +007(495)939-16-83, +007(495)939-23-83
Here is an example CREATE FUNCTION dropatsymbol(text) RETURNS text AS 'select replace($1, ''@'', '' '');' LANGUAGE SQL; arxiv=# select to_tsvector('english',dropatsymbol('oleg@sai.msu.su')); to_tsvector ------------------------- 'oleg':1 'sai.msu.su':2 On Tue, 22 Jul 2008, Fco. Mario Barcala Rodr?guez wrote: > And which are the types of argument and returning values of a pl/sql > function which preprocess de text? > > I have been searching that, for example, something like this works fine: > > CREATE INDEX textindex ON document USING > gin(to_tsvector('english',upper(text))); > > where text is the text column of document. But I have tried to do > something like: > > CREATE INDEX textindex ON document USING > gin(to_tsvector('english',myfunction(text))); > > where myfunction is a PL/SQL function which call upper one, but I didn't > find which are the types of the myfunction argument and returning value. > > I am a PL/SQL novice and I didn't find how to do it yet. Of course, then > I will have to change upper experiment to my objective: to index without > accents. I don't know if PL/SQL is the better option to build such > function. > > Thanks, > > Mario Barcala > >> You can preprocess text (replace accent by nothing) before >> to_tsvector or to_tsquery >> >> >> >> Oleg >> On Thu, 3 Jul 2008, lbarcala@freeresearch.org wrote: >> >>> Hi again: >>> >>> I am trying to create a full text configuration to ignore word accents in >>> my searches. My approach is similar to simple dicionary one, but i want to >>> remove accents after converting to lower. >>> >>> Is it the only way to do it to develop another .c and write my own >>> dict_noaccent.c, and then compile and install it into the system? >>> >>> Regars, >>> >>> Mario Barcala >>> >>> >>> >> >> Regards, >> Oleg >> _____________________________________________________________ >> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), >> Sternberg Astronomical Institute, Moscow University, Russia >> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ >> phone: +007(495)939-16-83, +007(495)939-23-83 > > > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
Finally I create a function like: CREATE OR REPLACE FUNCTION nonsensible (text) RETURNS text AS $$ DECLARE var1 varchar; BEGIN var1=replace($1, 'á', 'a'); var1=replace(var1, 'é', 'e'); var1=replace(var1, 'í', 'i'); var1=replace(var1, 'ó', 'o'); var1=replace(var1, 'ú', 'u'); var1=replace(var1, 'Á', 'A'); var1=replace(var1, 'É', 'E'); var1=replace(var1, 'Í', 'I'); var1=replace(var1, 'Ó', 'O'); var1=replace(var1, 'Ú', 'U'); return var1; END $$LANGUAGE plpgsql immutable; Then, create text indexes, one for sensible queries and other for unsensible ones: CREATE INDEX textindex ON document USING gin(to_tsvector('spanish',text)); CREATE INDEX textindexn ON document USING gin(to_tsvector('spanish',nonsensible(text))); And then make a query sensible or unsensible to accents doing: SELECT id FROM document WHERE to_tsvector('spanish',text) @@ to_tsquery('spanish','word_with_accent'); or: SELECT id FROM document WHERE to_tsvector('spanish',nonsensible(text)) @@ to_tsquery('spanish',nonsensible('word_with_accent')); respectively. I think postgreSQL uses both indexes as necessary. I believe to remember reading something about it in the documentation. Thank you very much, Mario Barcala > Here is an example > > CREATE FUNCTION dropatsymbol(text) RETURNS text > AS 'select replace($1, ''@'', '' '');' > LANGUAGE SQL; > > arxiv=# select to_tsvector('english',dropatsymbol('oleg@sai.msu.su')); > to_tsvector > ------------------------- > 'oleg':1 'sai.msu.su':2
This would probably help: CREATE OR REPLACE FUNCTION norm_text_latin(character varying) RETURNS character varying AS $BODY$ declare p_str alias for $1; v_str varchar; begin select translate(p_str, 'ÀÁÂÃÄÅ', 'AAAAAA') into v_str; select translate(v_str, 'ÉÈËÊ', 'EEEE') into v_str; select translate(v_str, 'ÌÍÎÏ', 'IIII') into v_str; select translate(v_str, 'ÌÍÎÏ', 'IIII') into v_str; select translate(v_str, 'ÒÓÔÕÖ', 'OOOOO') into v_str; select translate(v_str, 'ÙÚÛÜ', 'UUUU') into v_str; select translate(v_str, 'àáâãäå', 'aaaaaa') into v_str; select translate(v_str, 'èéêë', 'eeee') into v_str; select translate(v_str, 'ìíîï', 'iiii') into v_str; select translate(v_str, 'òóôõö', 'ooooo') into v_str; select translate(v_str, 'ùúûü', 'uuuu') into v_str; select translate(v_str, 'Çç', 'Cc') into v_str; return v_str; end;$BODY$ LANGUAGE 'plpgsql' VOLATILE; There's also o useful functions here: http://www.project-open.org/doc/intranet-search-pg/intranet-search-pg-create.sql -----Original Message----- From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of Fco. Mario Barcala Rodríguez Sent: July 24, 2008 4:47 AM To: pgsql-general@postgresql.org Subject: Re: [GENERAL] Full text index without accents Finally I create a function like: CREATE OR REPLACE FUNCTION nonsensible (text) RETURNS text AS $$ DECLARE var1 varchar; BEGIN var1=replace($1, 'á', 'a'); var1=replace(var1, 'é', 'e'); var1=replace(var1, 'í', 'i'); var1=replace(var1, 'ó', 'o'); var1=replace(var1, 'ú', 'u'); var1=replace(var1, 'Á', 'A'); var1=replace(var1, 'É', 'E'); var1=replace(var1, 'Í', 'I'); var1=replace(var1, 'Ó', 'O'); var1=replace(var1, 'Ú', 'U'); return var1; END $$LANGUAGE plpgsql immutable; Then, create text indexes, one for sensible queries and other for unsensible ones: CREATE INDEX textindex ON document USING gin(to_tsvector('spanish',text)); CREATE INDEX textindexn ON document USING gin(to_tsvector('spanish',nonsensible(text))); And then make a query sensible or unsensible to accents doing: SELECT id FROM document WHERE to_tsvector('spanish',text) @@ to_tsquery('spanish','word_with_accent'); or: SELECT id FROM document WHERE to_tsvector('spanish',nonsensible(text)) @@ to_tsquery('spanish',nonsensible('word_with_accent')); respectively. I think postgreSQL uses both indexes as necessary. I believe to remember reading something about it in the documentation. Thank you very much, Mario Barcala > Here is an example > > CREATE FUNCTION dropatsymbol(text) RETURNS text > AS 'select replace($1, ''@'', '' '');' > LANGUAGE SQL; > > arxiv=# select to_tsvector('english',dropatsymbol('oleg@sai.msu.su')); > to_tsvector > ------------------------- > 'oleg':1 'sai.msu.su':2 -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Jonathan Bond-Caron wrote: > This would probably help: > > CREATE OR REPLACE FUNCTION norm_text_latin(character varying) > RETURNS character varying AS > $BODY$ > declare > p_str alias for $1; > v_str varchar; > begin > select translate(p_str, 'ÀÁÂÃÄÅ', 'AAAAAA') into v_str; Hmm, why not simply use to_ascii() ? -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera <alvherre@commandprompt.com> writes: > Hmm, why not simply use to_ascii() ? The big problem with to_ascii is its inadequate set of supported encodings. Somebody *really* needs to give it some love on that front. regards, tom lane
Ya the function name norm_text_latin() was probably misleading, it takes latin1-ish characters *encoded in UTF8* and brings them to ascii. Definitely, the following would be much simpler: SELECT to_ascii('ÀÁÂÃÄÅÒÓÔÕÖ', 'UTF8') As of 8.3, you have to do some magic with to_ascii() and utf8 characters SELECT to_ascii(convert_to_latin('ÀÁÂÃÄÅÒÓÔÕÖ'), 'LATIN1') -----Original Message----- From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of Tom Lane Sent: July 25, 2008 10:42 AM To: Alvaro Herrera Cc: Jonathan Bond-Caron; 'Fco. Mario Barcala Rodríguez'; pgsql-general@postgresql.org Subject: Re: [GENERAL] Full text index without accents Alvaro Herrera <alvherre@commandprompt.com> writes: > Hmm, why not simply use to_ascii() ? The big problem with to_ascii is its inadequate set of supported encodings. Somebody *really* needs to give it some love on that front. regards, tom lane -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general