Thread: unaccent
Greetings, As far as I use the txtidx data structure in conjunction with gist indexing to make a word indexing of a very large UNICODE db, I've implemented a PostgreSQL function that uses libunac to unaccent TEXT fileds. The resulting text is in UTF-8, but you can modify it in the sources with an appropriate value (using iconv charset names). Get libunac from: http://www.nongnu.org/unac/ (it uses iconv) Extract the archive, compile it (make). Move pg_unac.so to your postgresql shared libraries dir. Link it in postgresql: CREATE FUNCTION unac(TEXT) RETURNS TEXT AS 'path_to_pg_unac.so' LANGUAGE C; What about integrating unaccent libraries directly in tsearch? It is useful for french search engines (for instance). Bye. Nhan NGO DINH __________________________________________________________________ Tiscali Ricaricasa la prima prepagata per navigare in Internet a meno di un'urbana e risparmiare su tutte le tue telefonate. Acquistala on line e non avrai nessun costo di attivazione né di ricarica! http://ricaricasaonline.tiscali.it/
Attachment
On Wed, 18 Sep 2002 nngodinh@tiscali.it wrote: > Greetings, > > As far as I use the txtidx data structure in conjunction with gist indexing > to make a word indexing of a very large UNICODE db, I've implemented a PostgreSQL > function that uses libunac to unaccent TEXT fileds. > > The resulting text is in UTF-8, but you can modify it in the sources with > an appropriate value (using iconv charset names). > > Get libunac from: http://www.nongnu.org/unac/ (it uses iconv) > > Extract the archive, compile it (make). Move pg_unac.so to your postgresql > shared libraries dir. > > Link it in postgresql: > > CREATE FUNCTION unac(TEXT) RETURNS TEXT AS 'path_to_pg_unac.so' LANGUAGE > C; > > What about integrating unaccent libraries directly in tsearch? It is useful > for french search engines (for instance). I think better to have separate module contrib/unac and document using it with tsearch. Please write us a couple of lines about using your function and we'll add them into tsearch documentation. btw, use palloc instead of malloc in postgresql functions . > > Bye. > > Nhan NGO DINH > > > __________________________________________________________________ > Tiscali Ricaricasa > la prima prepagata per navigare in Internet a meno di un'urbana e > risparmiare su tutte le tue telefonate. Acquistala on line e non avrai > nessun costo di attivazione nИ di ricarica! > http://ricaricasaonline.tiscali.it/ > > > > Regards, Oleg _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
On Wed, Sep 18, 2002 at 03:08:59PM +0300, Oleg Bartunov wrote: > On Wed, 18 Sep 2002 nngodinh@tiscali.it wrote: > > > > Get libunac from: http://www.nongnu.org/unac/ (it uses iconv) > > > > Extract the archive, compile it (make). Move pg_unac.so to your postgresql > > shared libraries dir. > > > I think better to have separate module contrib/unac and document using > it with tsearch. Please write us a couple of lines about using > your function and we'll add them into tsearch documentation. I think about --with-unaccent for PostgreSQL and to_ascii() inmain tree. Comment? Karel -- Karel Zak <zakkr@zf.jcu.cz>http://home.zf.jcu.cz/~zakkr/C, PostgreSQL, PHP, WWW, http://docs.linux.cz, http://mape.jcu.cz
The best way to use it is quite simple. If you want to index the table "titles" and "title" is the field containing the text to be indexed, you can create another unaccented field, for instance "utitle". UPDATE titles SET utitle = unac(title); Of course you can set it up as a trigger function. Then you can use utitle with txt2txtidx and tsearch. Another solution is to generate the txtidx field (i.e. titleidx) directly using unac: UPDATE titles SET titleidx = txt2txtidx(unac(title)); But the problem is that I've not succeeded using it with tsearch because (of course) it doesn't allow functions as parameters. So my first idea was to integrate unac in tsearch. Bye. >-- Messaggio Originale -- >Date: Wed, 18 Sep 2002 15:08:59 +0300 (GMT) >From: Oleg Bartunov <oleg@sai.msu.su> >To: nngodinh@tiscali.it >Cc: pgsql-hackers@postgresql.org >Subject: Re: [HACKERS] unaccent > > >On Wed, 18 Sep 2002 nngodinh@tiscali.it wrote: > >> Greetings, >> >> As far as I use the txtidx data structure in conjunction with gist indexing >> to make a word indexing of a very large UNICODE db, I've implemented a >PostgreSQL >> function that uses libunac to unaccent TEXT fileds. >> >> The resulting text is in UTF-8, but you can modify it in the sources with >> an appropriate value (using iconv charset names). >> >> Get libunac from: http://www.nongnu.org/unac/ (it uses iconv) >> >> Extract the archive, compile it (make). Move pg_unac.so to your postgresql >> shared libraries dir. >> >> Link it in postgresql: >> >> CREATE FUNCTION unac(TEXT) RETURNS TEXT AS 'path_to_pg_unac.so' LANGUAGE >> C; >> >> What about integrating unaccent libraries directly in tsearch? It is useful >> for french search engines (for instance). > >I think better to have separate module contrib/unac and document using >it with tsearch. Please write us a couple of lines about using >your function and we'll add them into tsearch documentation. > >btw, use palloc instead of malloc in postgresql functions . > >> >> Bye. >> >> Nhan NGO DINH >> >> >> __________________________________________________________________ >> Tiscali Ricaricasa >> la prima prepagata per navigare in Internet a meno di un'urbana e >> risparmiare su tutte le tue telefonate. Acquistala on line e non avrai >> nessun costo di attivazione n? di ricarica! >> http://ricaricasaonline.tiscali.it/ >> >> >> >> > > Regards, > Oleg >_____________________________________________________________ >Oleg Bartunov, sci.researcher, hostmaster of AstroNet, >Sternberg Astronomical Institute, Moscow University (Russia) >Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ >phone: +007(095)939-16-83, +007(095)939-23-83 > > >---------------------------(end of broadcast)--------------------------- >TIP 3: if posting/reading through Usenet, please send an appropriate >subscribe-nomail command to majordomo@postgresql.org so that your >message can get through to the mailing list cleanly __________________________________________________________________ Tiscali Ricaricasa la prima prepagata per navigare in Internet a meno di un'urbana e risparmiare su tutte le tue telefonate. Acquistala on line e non avrai nessun costo di attivazione né di ricarica! http://ricaricasaonline.tiscali.it/
Not "to_ascii", since there are so many extended UNICODE characters that doesn't have any accent and should not be converted to an ASCII character. >-- Messaggio Originale -- >Date: Wed, 18 Sep 2002 14:24:26 +0200 >From: Karel Zak <zakkr@zf.jcu.cz> >To: Oleg Bartunov <oleg@sai.msu.su> >Cc: nngodinh@tiscali.it, pgsql-hackers@postgresql.org >Subject: Re: [HACKERS] unaccent > > >On Wed, Sep 18, 2002 at 03:08:59PM +0300, Oleg Bartunov wrote: >> On Wed, 18 Sep 2002 nngodinh@tiscali.it wrote: >> > >> > Get libunac from: http://www.nongnu.org/unac/ (it uses iconv) >> > >> > Extract the archive, compile it (make). Move pg_unac.so to your postgresql >> > shared libraries dir. >> > >> I think better to have separate module contrib/unac and document using >> it with tsearch. Please write us a couple of lines about using >> your function and we'll add them into tsearch documentation. > > I think about --with-unaccent for PostgreSQL and to_ascii() in > main tree. Comment? > > Karel > >-- > Karel Zak <zakkr@zf.jcu.cz> > http://home.zf.jcu.cz/~zakkr/ > > C, PostgreSQL, PHP, WWW, http://docs.linux.cz, http://mape.jcu.cz > >---------------------------(end of broadcast)--------------------------- >TIP 6: Have you searched our list archives? > >http://archives.postgresql.org __________________________________________________________________ Tiscali Ricaricasa la prima prepagata per navigare in Internet a meno di un'urbana e risparmiare su tutte le tue telefonate. Acquistala on line e non avrai nessun costo di attivazione né di ricarica! http://ricaricasaonline.tiscali.it/
On Wed, 18 Sep 2002, Karel Zak wrote: > On Wed, Sep 18, 2002 at 03:08:59PM +0300, Oleg Bartunov wrote: > > On Wed, 18 Sep 2002 nngodinh@tiscali.it wrote: > > > > > > Get libunac from: http://www.nongnu.org/unac/ (it uses iconv) > > > > > > Extract the archive, compile it (make). Move pg_unac.so to your postgresql > > > shared libraries dir. > > > > > I think better to have separate module contrib/unac and document using > > it with tsearch. Please write us a couple of lines about using > > your function and we'll add them into tsearch documentation. > > I think about --with-unaccent for PostgreSQL and to_ascii() in > main tree. Comment? Hmm, it'd require linking yet another library. contrib module is a standard way to test/develope possible future feature. > > Karel > > Regards, Oleg _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
On Wed, 18 Sep 2002 nngodinh@tiscali.it wrote: > The best way to use it is quite simple. If you want to index the table "titles" > and "title" is the field containing the text to be indexed, you can create > another unaccented field, for instance "utitle". > > UPDATE titles SET utitle = unac(title); > > Of course you can set it up as a trigger function. Then you can use utitle > with txt2txtidx and tsearch. > > Another solution is to generate the txtidx field (i.e. titleidx) directly > using unac: > > UPDATE titles SET titleidx = txt2txtidx(unac(title)); > > But the problem is that I've not succeeded using it with tsearch because > (of course) it doesn't allow functions as parameters. So my first idea was > to integrate unac in tsearch. what's exactly a problem ? UPDATE titles SET titleidx = txt2txtidx(unac(title)); works fine. Perhaps, you have a problem with query ? > > Bye. > > >-- Messaggio Originale -- > >Date: Wed, 18 Sep 2002 15:08:59 +0300 (GMT) > >From: Oleg Bartunov <oleg@sai.msu.su> > >To: nngodinh@tiscali.it > >Cc: pgsql-hackers@postgresql.org > >Subject: Re: [HACKERS] unaccent > > > > > >On Wed, 18 Sep 2002 nngodinh@tiscali.it wrote: > > > >> Greetings, > >> > >> As far as I use the txtidx data structure in conjunction with gist indexing > >> to make a word indexing of a very large UNICODE db, I've implemented > a > >PostgreSQL > >> function that uses libunac to unaccent TEXT fileds. > >> > >> The resulting text is in UTF-8, but you can modify it in the sources > with > >> an appropriate value (using iconv charset names). > >> > >> Get libunac from: http://www.nongnu.org/unac/ (it uses iconv) > >> > >> Extract the archive, compile it (make). Move pg_unac.so to your postgresql > >> shared libraries dir. > >> > >> Link it in postgresql: > >> > >> CREATE FUNCTION unac(TEXT) RETURNS TEXT AS 'path_to_pg_unac.so' LANGUAGE > >> C; > >> > >> What about integrating unaccent libraries directly in tsearch? It is > useful > >> for french search engines (for instance). > > > >I think better to have separate module contrib/unac and document using > >it with tsearch. Please write us a couple of lines about using > >your function and we'll add them into tsearch documentation. > > > >btw, use palloc instead of malloc in postgresql functions . > > > >> > >> Bye. > >> > >> Nhan NGO DINH > >> > >> > >> __________________________________________________________________ > >> Tiscali Ricaricasa > >> la prima prepagata per navigare in Internet a meno di un'urbana e > >> risparmiare su tutte le tue telefonate. Acquistala on line e non avrai > >> nessun costo di attivazione n? di ricarica! > >> http://ricaricasaonline.tiscali.it/ > >> > >> > >> > >> > > > > Regards, > > Oleg > >_____________________________________________________________ > >Oleg Bartunov, sci.researcher, hostmaster of AstroNet, > >Sternberg Astronomical Institute, Moscow University (Russia) > >Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ > >phone: +007(095)939-16-83, +007(095)939-23-83 > > > > > >---------------------------(end of broadcast)--------------------------- > >TIP 3: if posting/reading through Usenet, please send an appropriate > >subscribe-nomail command to majordomo@postgresql.org so that your > >message can get through to the mailing list cleanly > > > > __________________________________________________________________ > Tiscali Ricaricasa > la prima prepagata per navigare in Internet a meno di un'urbana e > risparmiare su tutte le tue telefonate. Acquistala on line e non avrai > nessun costo di attivazione nИ di ricarica! > http://ricaricasaonline.tiscali.it/ > > > > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Don't 'kill -9' the postmaster > Regards, Oleg _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
The txt2txtidx function works fine with unac. The problem is with the trigger: create trigger txtidxupdate before update or insert on titles for each row execute procedure tsearch(titleidx, title); As you know tsearch(titleidx, unac(title)) doesn't work. >-- Messaggio Originale -- >Date: Wed, 18 Sep 2002 17:04:56 +0300 (GMT) >From: Oleg Bartunov <oleg@sai.msu.su> >To: nngodinh@tiscali.it >Cc: pgsql-hackers@postgresql.org >Subject: Re: [HACKERS] unaccent > > >On Wed, 18 Sep 2002 nngodinh@tiscali.it wrote: > >> The best way to use it is quite simple. If you want to index the table >"titles" >> and "title" is the field containing the text to be indexed, you can create >> another unaccented field, for instance "utitle". >> >> UPDATE titles SET utitle = unac(title); >> >> Of course you can set it up as a trigger function. Then you can use utitle >> with txt2txtidx and tsearch. >> >> Another solution is to generate the txtidx field (i.e. titleidx) directly >> using unac: >> >> UPDATE titles SET titleidx = txt2txtidx(unac(title)); >> >> But the problem is that I've not succeeded using it with tsearch because >> (of course) it doesn't allow functions as parameters. So my first idea >was >> to integrate unac in tsearch. > >what's exactly a problem ? >UPDATE titles SET titleidx = txt2txtidx(unac(title)); >works fine. Perhaps, you have a problem with query ? > >> >> Bye. >> >> >-- Messaggio Originale -- >> >Date: Wed, 18 Sep 2002 15:08:59 +0300 (GMT) >> >From: Oleg Bartunov <oleg@sai.msu.su> >> >To: nngodinh@tiscali.it >> >Cc: pgsql-hackers@postgresql.org >> >Subject: Re: [HACKERS] unaccent >> > >> > >> >On Wed, 18 Sep 2002 nngodinh@tiscali.it wrote: >> > >> >> Greetings, >> >> >> >> As far as I use the txtidx data structure in conjunction with gist indexing >> >> to make a word indexing of a very large UNICODE db, I've implemented >> a >> >PostgreSQL >> >> function that uses libunac to unaccent TEXT fileds. >> >> >> >> The resulting text is in UTF-8, but you can modify it in the sources >> with >> >> an appropriate value (using iconv charset names). >> >> >> >> Get libunac from: http://www.nongnu.org/unac/ (it uses iconv) >> >> >> >> Extract the archive, compile it (make). Move pg_unac.so to your postgresql >> >> shared libraries dir. >> >> >> >> Link it in postgresql: >> >> >> >> CREATE FUNCTION unac(TEXT) RETURNS TEXT AS 'path_to_pg_unac.so' LANGUAGE >> >> C; >> >> >> >> What about integrating unaccent libraries directly in tsearch? It is >> useful >> >> for french search engines (for instance). >> > >> >I think better to have separate module contrib/unac and document using >> >it with tsearch. Please write us a couple of lines about using >> >your function and we'll add them into tsearch documentation. >> > >> >btw, use palloc instead of malloc in postgresql functions . >> > >> >> >> >> Bye. >> >> >> >> Nhan NGO DINH >> >> >> >> >> >> __________________________________________________________________ >> >> Tiscali Ricaricasa >> >> la prima prepagata per navigare in Internet a meno di un'urbana e >> >> risparmiare su tutte le tue telefonate. Acquistala on line e non avrai >> >> nessun costo di attivazione n? di ricarica! >> >> http://ricaricasaonline.tiscali.it/ >> >> >> >> >> >> >> >> >> > >> > Regards, >> > Oleg >> >_____________________________________________________________ >> >Oleg Bartunov, sci.researcher, hostmaster of AstroNet, >> >Sternberg Astronomical Institute, Moscow University (Russia) >> >Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ >> >phone: +007(095)939-16-83, +007(095)939-23-83 >> > >> > >> >---------------------------(end of broadcast)--------------------------- >> >TIP 3: if posting/reading through Usenet, please send an appropriate >> >subscribe-nomail command to majordomo@postgresql.org so that your >> >message can get through to the mailing list cleanly >> >> >> >> __________________________________________________________________ >> Tiscali Ricaricasa >> la prima prepagata per navigare in Internet a meno di un'urbana e >> risparmiare su tutte le tue telefonate. Acquistala on line e non avrai >> nessun costo di attivazione n? di ricarica! >> http://ricaricasaonline.tiscali.it/ >> >> >> >> >> ---------------------------(end of broadcast)--------------------------- >> TIP 4: Don't 'kill -9' the postmaster >> > > Regards, > Oleg >_____________________________________________________________ >Oleg Bartunov, sci.researcher, hostmaster of AstroNet, >Sternberg Astronomical Institute, Moscow University (Russia) >Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ >phone: +007(095)939-16-83, +007(095)939-23-83 > > >---------------------------(end of broadcast)--------------------------- >TIP 6: Have you searched our list archives? > >http://archives.postgresql.org __________________________________________________________________ Tiscali Ricaricasa la prima prepagata per navigare in Internet a meno di un'urbana e risparmiare su tutte le tue telefonate. Acquistala on line e non avrai nessun costo di attivazione né di ricarica! http://ricaricasaonline.tiscali.it/
nngodinh@tiscali.it writes: > Not "to_ascii", since there are so many extended UNICODE characters that > doesn't have any accent and should not be converted to an ASCII character. Really, the accent conversion should be part of the character set conversion routines. At least my local iconv does that. In general, the determination of what is an accent and how to convert it is both dependent on locale and the intended usage. It's not clear how that should be handled. -- Peter Eisentraut peter_e@gmx.net