Thread: unaccent

unaccent

From
nngodinh@tiscali.it
Date:
Greetings,

As far as I use the txtidx data structure in conjunction with gist indexing
to make a word indexing of a very large UNICODE db, I've implemented a PostgreSQL
function that uses libunac to unaccent TEXT fileds.

The resulting text is in UTF-8, but you can modify it in the sources with
an appropriate value (using iconv charset names).

Get libunac from: http://www.nongnu.org/unac/ (it uses iconv)

Extract the archive, compile it (make). Move pg_unac.so to your postgresql
shared libraries dir.

Link it in postgresql:

CREATE FUNCTION unac(TEXT) RETURNS TEXT AS 'path_to_pg_unac.so' LANGUAGE
C;

What about integrating unaccent libraries directly in tsearch? It is useful
for french search engines (for instance).

Bye.

Nhan NGO DINH


__________________________________________________________________
Tiscali Ricaricasa
la prima prepagata per navigare in Internet a meno di un'urbana e
risparmiare su tutte le tue telefonate. Acquistala on line e non avrai
nessun costo di attivazione né di ricarica!
http://ricaricasaonline.tiscali.it/




Attachment

Re: unaccent

From
Oleg Bartunov
Date:
On Wed, 18 Sep 2002 nngodinh@tiscali.it wrote:

> Greetings,
>
> As far as I use the txtidx data structure in conjunction with gist indexing
> to make a word indexing of a very large UNICODE db, I've implemented a PostgreSQL
> function that uses libunac to unaccent TEXT fileds.
>
> The resulting text is in UTF-8, but you can modify it in the sources with
> an appropriate value (using iconv charset names).
>
> Get libunac from: http://www.nongnu.org/unac/ (it uses iconv)
>
> Extract the archive, compile it (make). Move pg_unac.so to your postgresql
> shared libraries dir.
>
> Link it in postgresql:
>
> CREATE FUNCTION unac(TEXT) RETURNS TEXT AS 'path_to_pg_unac.so' LANGUAGE
> C;
>
> What about integrating unaccent libraries directly in tsearch? It is useful
> for french search engines (for instance).

I think better to have separate module contrib/unac and document using
it with tsearch. Please write us a couple of lines about using
your function and we'll add them into tsearch documentation.

btw, use palloc instead of malloc in postgresql functions .

>
> Bye.
>
> Nhan NGO DINH
>
>
> __________________________________________________________________
> Tiscali Ricaricasa
> la prima prepagata per navigare in Internet a meno di un'urbana e
> risparmiare su tutte le tue telefonate. Acquistala on line e non avrai
> nessun costo di attivazione nИ di ricarica!
> http://ricaricasaonline.tiscali.it/
>
>
>
>
Regards,    Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83



Re: unaccent

From
Karel Zak
Date:
On Wed, Sep 18, 2002 at 03:08:59PM +0300, Oleg Bartunov wrote:
> On Wed, 18 Sep 2002 nngodinh@tiscali.it wrote:
> >
> > Get libunac from: http://www.nongnu.org/unac/ (it uses iconv)
> >
> > Extract the archive, compile it (make). Move pg_unac.so to your postgresql
> > shared libraries dir.
> >
> I think better to have separate module contrib/unac and document using
> it with tsearch. Please write us a couple of lines about using
> your function and we'll add them into tsearch documentation.
I think about --with-unaccent for PostgreSQL and to_ascii() inmain tree. Comment?
   Karel

-- Karel Zak  <zakkr@zf.jcu.cz>http://home.zf.jcu.cz/~zakkr/C, PostgreSQL, PHP, WWW, http://docs.linux.cz,
http://mape.jcu.cz


Re: unaccent

From
nngodinh@tiscali.it
Date:
The best way to use it is quite simple. If you want to index the table "titles"
and "title" is the field containing the text to be indexed, you can create
another unaccented field, for instance "utitle".

UPDATE titles SET utitle = unac(title);

Of course you can set it up as a trigger function. Then you can use utitle
with txt2txtidx and tsearch.

Another solution is to generate the txtidx field (i.e. titleidx) directly
using unac:

UPDATE titles SET titleidx = txt2txtidx(unac(title));

But the problem is that I've not succeeded using it with tsearch because
(of course) it doesn't allow functions as parameters. So my first idea was
to integrate unac in tsearch.

Bye.

>-- Messaggio Originale --
>Date: Wed, 18 Sep 2002 15:08:59 +0300 (GMT)
>From: Oleg Bartunov <oleg@sai.msu.su>
>To: nngodinh@tiscali.it
>Cc: pgsql-hackers@postgresql.org
>Subject: Re: [HACKERS] unaccent
>
>
>On Wed, 18 Sep 2002 nngodinh@tiscali.it wrote:
>
>> Greetings,
>>
>> As far as I use the txtidx data structure in conjunction with gist indexing
>> to make a word indexing of a very large UNICODE db, I've implemented
a
>PostgreSQL
>> function that uses libunac to unaccent TEXT fileds.
>>
>> The resulting text is in UTF-8, but you can modify it in the sources
with
>> an appropriate value (using iconv charset names).
>>
>> Get libunac from: http://www.nongnu.org/unac/ (it uses iconv)
>>
>> Extract the archive, compile it (make). Move pg_unac.so to your postgresql
>> shared libraries dir.
>>
>> Link it in postgresql:
>>
>> CREATE FUNCTION unac(TEXT) RETURNS TEXT AS 'path_to_pg_unac.so' LANGUAGE
>> C;
>>
>> What about integrating unaccent libraries directly in tsearch? It is
useful
>> for french search engines (for instance).
>
>I think better to have separate module contrib/unac and document using
>it with tsearch. Please write us a couple of lines about using
>your function and we'll add them into tsearch documentation.
>
>btw, use palloc instead of malloc in postgresql functions .
>
>>
>> Bye.
>>
>> Nhan NGO DINH
>>
>>
>> __________________________________________________________________
>> Tiscali Ricaricasa
>> la prima prepagata per navigare in Internet a meno di un'urbana e
>> risparmiare su tutte le tue telefonate. Acquistala on line e non avrai
>> nessun costo di attivazione n? di ricarica!
>> http://ricaricasaonline.tiscali.it/
>>
>>
>>
>>
>
>    Regards,
>        Oleg
>_____________________________________________________________
>Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
>Sternberg Astronomical Institute, Moscow University (Russia)
>Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
>phone: +007(095)939-16-83, +007(095)939-23-83
>
>
>---------------------------(end of broadcast)---------------------------
>TIP 3: if posting/reading through Usenet, please send an appropriate
>subscribe-nomail command to majordomo@postgresql.org so that your
>message can get through to the mailing list cleanly



__________________________________________________________________
Tiscali Ricaricasa
la prima prepagata per navigare in Internet a meno di un'urbana e
risparmiare su tutte le tue telefonate. Acquistala on line e non avrai
nessun costo di attivazione né di ricarica!
http://ricaricasaonline.tiscali.it/





Re: unaccent

From
nngodinh@tiscali.it
Date:
Not "to_ascii", since there are so many extended UNICODE characters that
doesn't have any accent and should not be converted to an ASCII character.

>-- Messaggio Originale --
>Date: Wed, 18 Sep 2002 14:24:26 +0200
>From: Karel Zak <zakkr@zf.jcu.cz>
>To: Oleg Bartunov <oleg@sai.msu.su>
>Cc: nngodinh@tiscali.it, pgsql-hackers@postgresql.org
>Subject: Re: [HACKERS] unaccent
>
>
>On Wed, Sep 18, 2002 at 03:08:59PM +0300, Oleg Bartunov wrote:
>> On Wed, 18 Sep 2002 nngodinh@tiscali.it wrote:
>> >
>> > Get libunac from: http://www.nongnu.org/unac/ (it uses iconv)
>> >
>> > Extract the archive, compile it (make). Move pg_unac.so to your postgresql
>> > shared libraries dir.
>> >
>> I think better to have separate module contrib/unac and document using
>> it with tsearch. Please write us a couple of lines about using
>> your function and we'll add them into tsearch documentation.
>
> I think about --with-unaccent for PostgreSQL and to_ascii() in
> main tree. Comment?
>
>    Karel
>
>--
> Karel Zak  <zakkr@zf.jcu.cz>
> http://home.zf.jcu.cz/~zakkr/
>
> C, PostgreSQL, PHP, WWW, http://docs.linux.cz, http://mape.jcu.cz
>
>---------------------------(end of broadcast)---------------------------
>TIP 6: Have you searched our list archives?
>
>http://archives.postgresql.org


__________________________________________________________________
Tiscali Ricaricasa
la prima prepagata per navigare in Internet a meno di un'urbana e
risparmiare su tutte le tue telefonate. Acquistala on line e non avrai
nessun costo di attivazione né di ricarica!
http://ricaricasaonline.tiscali.it/





Re: unaccent

From
Oleg Bartunov
Date:
On Wed, 18 Sep 2002, Karel Zak wrote:

> On Wed, Sep 18, 2002 at 03:08:59PM +0300, Oleg Bartunov wrote:
> > On Wed, 18 Sep 2002 nngodinh@tiscali.it wrote:
> > >
> > > Get libunac from: http://www.nongnu.org/unac/ (it uses iconv)
> > >
> > > Extract the archive, compile it (make). Move pg_unac.so to your postgresql
> > > shared libraries dir.
> > >
> > I think better to have separate module contrib/unac and document using
> > it with tsearch. Please write us a couple of lines about using
> > your function and we'll add them into tsearch documentation.
>
>  I think about --with-unaccent for PostgreSQL and to_ascii() in
>  main tree. Comment?

Hmm, it'd require linking yet another library. contrib module is
a standard way to test/develope possible future feature.

>
>     Karel
>
>
Regards,    Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83



Re: unaccent

From
Oleg Bartunov
Date:
On Wed, 18 Sep 2002 nngodinh@tiscali.it wrote:

> The best way to use it is quite simple. If you want to index the table "titles"
> and "title" is the field containing the text to be indexed, you can create
> another unaccented field, for instance "utitle".
>
> UPDATE titles SET utitle = unac(title);
>
> Of course you can set it up as a trigger function. Then you can use utitle
> with txt2txtidx and tsearch.
>
> Another solution is to generate the txtidx field (i.e. titleidx) directly
> using unac:
>
> UPDATE titles SET titleidx = txt2txtidx(unac(title));
>
> But the problem is that I've not succeeded using it with tsearch because
> (of course) it doesn't allow functions as parameters. So my first idea was
> to integrate unac in tsearch.

what's exactly a problem ?
UPDATE titles SET titleidx = txt2txtidx(unac(title));
works fine. Perhaps, you have a problem with query ?

>
> Bye.
>
> >-- Messaggio Originale --
> >Date: Wed, 18 Sep 2002 15:08:59 +0300 (GMT)
> >From: Oleg Bartunov <oleg@sai.msu.su>
> >To: nngodinh@tiscali.it
> >Cc: pgsql-hackers@postgresql.org
> >Subject: Re: [HACKERS] unaccent
> >
> >
> >On Wed, 18 Sep 2002 nngodinh@tiscali.it wrote:
> >
> >> Greetings,
> >>
> >> As far as I use the txtidx data structure in conjunction with gist indexing
> >> to make a word indexing of a very large UNICODE db, I've implemented
> a
> >PostgreSQL
> >> function that uses libunac to unaccent TEXT fileds.
> >>
> >> The resulting text is in UTF-8, but you can modify it in the sources
> with
> >> an appropriate value (using iconv charset names).
> >>
> >> Get libunac from: http://www.nongnu.org/unac/ (it uses iconv)
> >>
> >> Extract the archive, compile it (make). Move pg_unac.so to your postgresql
> >> shared libraries dir.
> >>
> >> Link it in postgresql:
> >>
> >> CREATE FUNCTION unac(TEXT) RETURNS TEXT AS 'path_to_pg_unac.so' LANGUAGE
> >> C;
> >>
> >> What about integrating unaccent libraries directly in tsearch? It is
> useful
> >> for french search engines (for instance).
> >
> >I think better to have separate module contrib/unac and document using
> >it with tsearch. Please write us a couple of lines about using
> >your function and we'll add them into tsearch documentation.
> >
> >btw, use palloc instead of malloc in postgresql functions .
> >
> >>
> >> Bye.
> >>
> >> Nhan NGO DINH
> >>
> >>
> >> __________________________________________________________________
> >> Tiscali Ricaricasa
> >> la prima prepagata per navigare in Internet a meno di un'urbana e
> >> risparmiare su tutte le tue telefonate. Acquistala on line e non avrai
> >> nessun costo di attivazione n? di ricarica!
> >> http://ricaricasaonline.tiscali.it/
> >>
> >>
> >>
> >>
> >
> >    Regards,
> >        Oleg
> >_____________________________________________________________
> >Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
> >Sternberg Astronomical Institute, Moscow University (Russia)
> >Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
> >phone: +007(095)939-16-83, +007(095)939-23-83
> >
> >
> >---------------------------(end of broadcast)---------------------------
> >TIP 3: if posting/reading through Usenet, please send an appropriate
> >subscribe-nomail command to majordomo@postgresql.org so that your
> >message can get through to the mailing list cleanly
>
>
>
> __________________________________________________________________
> Tiscali Ricaricasa
> la prima prepagata per navigare in Internet a meno di un'urbana e
> risparmiare su tutte le tue telefonate. Acquistala on line e non avrai
> nessun costo di attivazione nИ di ricarica!
> http://ricaricasaonline.tiscali.it/
>
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster
>
Regards,    Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83



Re: unaccent

From
nngodinh@tiscali.it
Date:
The txt2txtidx function works fine with unac. The problem is with the trigger:

create trigger txtidxupdate before update or insert on titles for each row
execute procedure tsearch(titleidx, title);

As you know tsearch(titleidx, unac(title)) doesn't work.

>-- Messaggio Originale --
>Date: Wed, 18 Sep 2002 17:04:56 +0300 (GMT)
>From: Oleg Bartunov <oleg@sai.msu.su>
>To: nngodinh@tiscali.it
>Cc: pgsql-hackers@postgresql.org
>Subject: Re: [HACKERS] unaccent
>
>
>On Wed, 18 Sep 2002 nngodinh@tiscali.it wrote:
>
>> The best way to use it is quite simple. If you want to index the table
>"titles"
>> and "title" is the field containing the text to be indexed, you can create
>> another unaccented field, for instance "utitle".
>>
>> UPDATE titles SET utitle = unac(title);
>>
>> Of course you can set it up as a trigger function. Then you can use utitle
>> with txt2txtidx and tsearch.
>>
>> Another solution is to generate the txtidx field (i.e. titleidx) directly
>> using unac:
>>
>> UPDATE titles SET titleidx = txt2txtidx(unac(title));
>>
>> But the problem is that I've not succeeded using it with tsearch because
>> (of course) it doesn't allow functions as parameters. So my first idea
>was
>> to integrate unac in tsearch.
>
>what's exactly a problem ?
>UPDATE titles SET titleidx = txt2txtidx(unac(title));
>works fine. Perhaps, you have a problem with query ?
>
>>
>> Bye.
>>
>> >-- Messaggio Originale --
>> >Date: Wed, 18 Sep 2002 15:08:59 +0300 (GMT)
>> >From: Oleg Bartunov <oleg@sai.msu.su>
>> >To: nngodinh@tiscali.it
>> >Cc: pgsql-hackers@postgresql.org
>> >Subject: Re: [HACKERS] unaccent
>> >
>> >
>> >On Wed, 18 Sep 2002 nngodinh@tiscali.it wrote:
>> >
>> >> Greetings,
>> >>
>> >> As far as I use the txtidx data structure in conjunction with gist
indexing
>> >> to make a word indexing of a very large UNICODE db, I've implemented
>> a
>> >PostgreSQL
>> >> function that uses libunac to unaccent TEXT fileds.
>> >>
>> >> The resulting text is in UTF-8, but you can modify it in the sources
>> with
>> >> an appropriate value (using iconv charset names).
>> >>
>> >> Get libunac from: http://www.nongnu.org/unac/ (it uses iconv)
>> >>
>> >> Extract the archive, compile it (make). Move pg_unac.so to your postgresql
>> >> shared libraries dir.
>> >>
>> >> Link it in postgresql:
>> >>
>> >> CREATE FUNCTION unac(TEXT) RETURNS TEXT AS 'path_to_pg_unac.so' LANGUAGE
>> >> C;
>> >>
>> >> What about integrating unaccent libraries directly in tsearch? It
is
>> useful
>> >> for french search engines (for instance).
>> >
>> >I think better to have separate module contrib/unac and document using
>> >it with tsearch. Please write us a couple of lines about using
>> >your function and we'll add them into tsearch documentation.
>> >
>> >btw, use palloc instead of malloc in postgresql functions .
>> >
>> >>
>> >> Bye.
>> >>
>> >> Nhan NGO DINH
>> >>
>> >>
>> >> __________________________________________________________________
>> >> Tiscali Ricaricasa
>> >> la prima prepagata per navigare in Internet a meno di un'urbana e
>> >> risparmiare su tutte le tue telefonate. Acquistala on line e non avrai
>> >> nessun costo di attivazione n? di ricarica!
>> >> http://ricaricasaonline.tiscali.it/
>> >>
>> >>
>> >>
>> >>
>> >
>> >    Regards,
>> >        Oleg
>> >_____________________________________________________________
>> >Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
>> >Sternberg Astronomical Institute, Moscow University (Russia)
>> >Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
>> >phone: +007(095)939-16-83, +007(095)939-23-83
>> >
>> >
>> >---------------------------(end of broadcast)---------------------------
>> >TIP 3: if posting/reading through Usenet, please send an appropriate
>> >subscribe-nomail command to majordomo@postgresql.org so that your
>> >message can get through to the mailing list cleanly
>>
>>
>>
>> __________________________________________________________________
>> Tiscali Ricaricasa
>> la prima prepagata per navigare in Internet a meno di un'urbana e
>> risparmiare su tutte le tue telefonate. Acquistala on line e non avrai
>> nessun costo di attivazione n? di ricarica!
>> http://ricaricasaonline.tiscali.it/
>>
>>
>>
>>
>> ---------------------------(end of broadcast)---------------------------
>> TIP 4: Don't 'kill -9' the postmaster
>>
>
>    Regards,
>        Oleg
>_____________________________________________________________
>Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
>Sternberg Astronomical Institute, Moscow University (Russia)
>Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
>phone: +007(095)939-16-83, +007(095)939-23-83
>
>
>---------------------------(end of broadcast)---------------------------
>TIP 6: Have you searched our list archives?
>
>http://archives.postgresql.org


__________________________________________________________________
Tiscali Ricaricasa
la prima prepagata per navigare in Internet a meno di un'urbana e
risparmiare su tutte le tue telefonate. Acquistala on line e non avrai
nessun costo di attivazione né di ricarica!
http://ricaricasaonline.tiscali.it/





Re: unaccent

From
Peter Eisentraut
Date:
nngodinh@tiscali.it writes:

> Not "to_ascii", since there are so many extended UNICODE characters that
> doesn't have any accent and should not be converted to an ASCII character.

Really, the accent conversion should be part of the character set
conversion routines.  At least my local iconv does that.

In general, the determination of what is an accent and how to convert it
is both dependent on locale and the intended usage.  It's not clear how
that should be handled.

-- 
Peter Eisentraut   peter_e@gmx.net