Thread: Full text index without accents

Full text index without accents

From

lbarcala@freeresearch.org

Date:

03 July 2008, 08:05:15

Hi again:

I am trying to create a full text configuration to ignore word accents in
my searches. My approach is similar to simple dicionary one, but i want to
remove accents after converting to lower.

Is it the only way to do it to develop another .c and write my own
dict_noaccent.c, and then compile and install it into the system?

Regars,

  Mario Barcala

Re: Full text index without accents

From

Oleg Bartunov

Date:

03 July 2008, 13:03:55

You can preprocess text (replace accent by nothing) before
to_tsvector or to_tsquery



Oleg
On Thu, 3 Jul 2008, lbarcala@freeresearch.org wrote:

> Hi again:
>
> I am trying to create a full text configuration to ignore word accents in
> my searches. My approach is similar to simple dicionary one, but i want to
> remove accents after converting to lower.
>
> Is it the only way to do it to develop another .c and write my own
> dict_noaccent.c, and then compile and install it into the system?
>
> Regars,
>
>  Mario Barcala
>
>
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: Full text index without accents

From

"Fco. Mario Barcala" Rodríguez

Date:

22 July 2008, 16:21:40

And which are the types of argument and returning values of a pl/sql
function which preprocess de text?

I have been searching that, for example, something like this works fine:

CREATE INDEX textindex ON document USING
gin(to_tsvector('english',upper(text)));

where text is the text column of document. But I have tried to do
something like:

CREATE INDEX textindex ON document USING
gin(to_tsvector('english',myfunction(text)));

where myfunction is a PL/SQL function which call upper one, but I didn't
find which are the types of the myfunction argument and returning value.

I am a PL/SQL novice and I didn't find how to do it yet. Of course, then
I will have to change upper experiment to my objective: to index without
accents. I don't know if PL/SQL is the better option to build such
function.

Thanks,

  Mario Barcala

> You can preprocess text (replace accent by nothing) before
> to_tsvector or to_tsquery
>
>
>
> Oleg
> On Thu, 3 Jul 2008, lbarcala@freeresearch.org wrote:
>
> > Hi again:
> >
> > I am trying to create a full text configuration to ignore word accents in
> > my searches. My approach is similar to simple dicionary one, but i want to
> > remove accents after converting to lower.
> >
> > Is it the only way to do it to develop another .c and write my own
> > dict_noaccent.c, and then compile and install it into the system?
> >
> > Regars,
> >
> >  Mario Barcala
> >
> >
> >
>
>      Regards,
>          Oleg
> _____________________________________________________________
> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
> Sternberg Astronomical Institute, Moscow University, Russia
> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
> phone: +007(495)939-16-83, +007(495)939-23-83

Re: Full text index without accents

From

Oleg Bartunov

Date:

22 July 2008, 19:30:22

Here is an example

CREATE FUNCTION dropatsymbol(text) RETURNS text
AS 'select replace($1, ''@'', '' '');'
LANGUAGE SQL;

arxiv=# select to_tsvector('english',dropatsymbol('oleg@sai.msu.su'));
        to_tsvector
-------------------------
  'oleg':1 'sai.msu.su':2


On Tue, 22 Jul 2008, Fco. Mario Barcala Rodr?guez wrote:

> And which are the types of argument and returning values of a pl/sql
> function which preprocess de text?
>
> I have been searching that, for example, something like this works fine:
>
> CREATE INDEX textindex ON document USING
> gin(to_tsvector('english',upper(text)));
>
> where text is the text column of document. But I have tried to do
> something like:
>
> CREATE INDEX textindex ON document USING
> gin(to_tsvector('english',myfunction(text)));
>
> where myfunction is a PL/SQL function which call upper one, but I didn't
> find which are the types of the myfunction argument and returning value.
>
> I am a PL/SQL novice and I didn't find how to do it yet. Of course, then
> I will have to change upper experiment to my objective: to index without
> accents. I don't know if PL/SQL is the better option to build such
> function.
>
> Thanks,
>
>  Mario Barcala
>
>> You can preprocess text (replace accent by nothing) before
>> to_tsvector or to_tsquery
>>
>>
>>
>> Oleg
>> On Thu, 3 Jul 2008, lbarcala@freeresearch.org wrote:
>>
>>> Hi again:
>>>
>>> I am trying to create a full text configuration to ignore word accents in
>>> my searches. My approach is similar to simple dicionary one, but i want to
>>> remove accents after converting to lower.
>>>
>>> Is it the only way to do it to develop another .c and write my own
>>> dict_noaccent.c, and then compile and install it into the system?
>>>
>>> Regars,
>>>
>>>  Mario Barcala
>>>
>>>
>>>
>>
>>      Regards,
>>          Oleg
>> _____________________________________________________________
>> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
>> Sternberg Astronomical Institute, Moscow University, Russia
>> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
>> phone: +007(495)939-16-83, +007(495)939-23-83
>
>
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: Full text index without accents

From

"Fco. Mario Barcala" Rodríguez

Date:

24 July 2008, 05:46:58

Finally I create a function like:

CREATE OR REPLACE FUNCTION nonsensible (text) RETURNS text AS $$
DECLARE
  var1 varchar;
BEGIN
  var1=replace($1, 'á', 'a');
  var1=replace(var1, 'é', 'e');
  var1=replace(var1, 'í', 'i');
  var1=replace(var1, 'ó', 'o');
  var1=replace(var1, 'ú', 'u');
  var1=replace(var1, 'Á', 'A');
  var1=replace(var1, 'É', 'E');
  var1=replace(var1, 'Í', 'I');
  var1=replace(var1, 'Ó', 'O');
  var1=replace(var1, 'Ú', 'U');
  return var1;
END
$$LANGUAGE plpgsql immutable;

Then, create text indexes, one for sensible queries and other for
unsensible ones:

CREATE INDEX textindex ON document USING
gin(to_tsvector('spanish',text));

CREATE INDEX textindexn ON document USING
gin(to_tsvector('spanish',nonsensible(text)));

And then make a query sensible or unsensible to accents doing:

SELECT id FROM document WHERE to_tsvector('spanish',text) @@
to_tsquery('spanish','word_with_accent');

or:

SELECT id FROM document WHERE to_tsvector('spanish',nonsensible(text))
@@ to_tsquery('spanish',nonsensible('word_with_accent'));
respectively.

I think postgreSQL uses both indexes as necessary. I believe to remember
reading something about it in the documentation.

Thank you very much,

  Mario Barcala


> Here is an example
>
> CREATE FUNCTION dropatsymbol(text) RETURNS text
> AS 'select replace($1, ''@'', '' '');'
> LANGUAGE SQL;
>
> arxiv=# select to_tsvector('english',dropatsymbol('oleg@sai.msu.su'));
>         to_tsvector
> -------------------------
>   'oleg':1 'sai.msu.su':2

Re: Full text index without accents

From

"Jonathan Bond-Caron"

Date:

25 July 2008, 09:49:01

This would probably help:

CREATE OR REPLACE FUNCTION norm_text_latin(character varying)
  RETURNS character varying AS
$BODY$
declare
    p_str        alias for $1;
    v_str        varchar;
begin
    select translate(p_str, 'ÀÁÂÃÄÅ', 'AAAAAA') into v_str;
    select translate(v_str, 'ÉÈËÊ', 'EEEE') into v_str;
    select translate(v_str, 'ÌÍÎÏ', 'IIII') into v_str;
    select translate(v_str, 'ÌÍÎÏ', 'IIII') into v_str;
    select translate(v_str, 'ÒÓÔÕÖ', 'OOOOO') into v_str;
    select translate(v_str, 'ÙÚÛÜ', 'UUUU') into v_str;
    select translate(v_str, 'àáâãäå', 'aaaaaa') into v_str;
    select translate(v_str, 'èéêë', 'eeee') into v_str;
    select translate(v_str, 'ìíîï', 'iiii') into v_str;
    select translate(v_str, 'òóôõö', 'ooooo') into v_str;
    select translate(v_str, 'ùúûü', 'uuuu') into v_str;
    select translate(v_str, 'Çç', 'Cc') into v_str;
    return v_str;
end;$BODY$
  LANGUAGE 'plpgsql' VOLATILE;

There's also o useful functions here:
http://www.project-open.org/doc/intranet-search-pg/intranet-search-pg-create.sql




-----Original Message-----
From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of Fco. Mario Barcala
Rodríguez
Sent: July 24, 2008 4:47 AM
To: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Full text index without accents

Finally I create a function like:

CREATE OR REPLACE FUNCTION nonsensible (text) RETURNS text AS $$
DECLARE
  var1 varchar;
BEGIN
  var1=replace($1, 'á', 'a');
  var1=replace(var1, 'é', 'e');
  var1=replace(var1, 'í', 'i');
  var1=replace(var1, 'ó', 'o');
  var1=replace(var1, 'ú', 'u');
  var1=replace(var1, 'Á', 'A');
  var1=replace(var1, 'É', 'E');
  var1=replace(var1, 'Í', 'I');
  var1=replace(var1, 'Ó', 'O');
  var1=replace(var1, 'Ú', 'U');
  return var1;
END
$$LANGUAGE plpgsql immutable;

Then, create text indexes, one for sensible queries and other for
unsensible ones:

CREATE INDEX textindex ON document USING
gin(to_tsvector('spanish',text));

CREATE INDEX textindexn ON document USING
gin(to_tsvector('spanish',nonsensible(text)));

And then make a query sensible or unsensible to accents doing:

SELECT id FROM document WHERE to_tsvector('spanish',text) @@
to_tsquery('spanish','word_with_accent');

or:

SELECT id FROM document WHERE to_tsvector('spanish',nonsensible(text))
@@ to_tsquery('spanish',nonsensible('word_with_accent'));
respectively.

I think postgreSQL uses both indexes as necessary. I believe to remember
reading something about it in the documentation.

Thank you very much,

  Mario Barcala


> Here is an example
>
> CREATE FUNCTION dropatsymbol(text) RETURNS text
> AS 'select replace($1, ''@'', '' '');'
> LANGUAGE SQL;
>
> arxiv=# select to_tsvector('english',dropatsymbol('oleg@sai.msu.su'));
>         to_tsvector
> -------------------------
>   'oleg':1 'sai.msu.su':2



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: Full text index without accents

From

Alvaro Herrera

Date:

25 July 2008, 11:18:37

Jonathan Bond-Caron wrote:
> This would probably help:
>
> CREATE OR REPLACE FUNCTION norm_text_latin(character varying)
>   RETURNS character varying AS
> $BODY$
> declare
>     p_str        alias for $1;
>     v_str        varchar;
> begin
>     select translate(p_str, 'ÀÁÂÃÄÅ', 'AAAAAA') into v_str;

Hmm, why not simply use to_ascii() ?



--
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: Full text index without accents

From

Tom Lane

Date:

25 July 2008, 11:50:42

Alvaro Herrera <alvherre@commandprompt.com> writes:
> Hmm, why not simply use to_ascii() ?

The big problem with to_ascii is its inadequate set of supported
encodings.  Somebody *really* needs to give it some love on that
front.

            regards, tom lane

Re: Full text index without accents

From

"Jonathan Bond-Caron"

Date:

25 July 2008, 12:06:46

Ya the function name norm_text_latin() was probably misleading, it takes
latin1-ish characters *encoded in UTF8* and brings them to ascii.

Definitely, the following would be much simpler:
SELECT to_ascii('ÀÁÂÃÄÅÒÓÔÕÖ', 'UTF8')

As of 8.3, you have to do some magic with to_ascii() and utf8 characters

SELECT to_ascii(convert_to_latin('ÀÁÂÃÄÅÒÓÔÕÖ'), 'LATIN1')

-----Original Message-----
From: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org] On Behalf Of Tom Lane
Sent: July 25, 2008 10:42 AM
To: Alvaro Herrera
Cc: Jonathan Bond-Caron; 'Fco. Mario Barcala Rodríguez';
pgsql-general@postgresql.org
Subject: Re: [GENERAL] Full text index without accents

Alvaro Herrera <alvherre@commandprompt.com> writes:
> Hmm, why not simply use to_ascii() ?

The big problem with to_ascii is its inadequate set of supported
encodings.  Somebody *really* needs to give it some love on that
front.

            regards, tom lane

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general