Thread: Text search with ispell
I'm trying to figure out how to use PostgreSQL's fulltext search with an ispell dictionary. I'm having a bit of trouble figuring out where this norwegian.dict comes from though. When I install the norwegian ispell dictionary, i get 4 files, nb.aff, nb.hash, nn.aff and nn.hash. What I'm unable to figure out, is the steps needed to use this for PostgreSQL? -- Tommy
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Tommy Gildseth schrieb: > I'm trying to figure out how to use PostgreSQL's fulltext search with an > ispell dictionary. I'm having a bit of trouble figuring out where this > norwegian.dict comes from though. > When I install the norwegian ispell dictionary, i get 4 files, nb.aff, > nb.hash, nn.aff and nn.hash. What I'm unable to figure out, is the steps > needed to use this for PostgreSQL? > Which version are you running? It's important to know, because tsearch2 is integrated since version 8.3. The behaviour for implementing in earlier versions is therefore different ... Cheers Andy - -- St.Pauli - Hamburg - Germany Andreas Wenk -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFJfum6Va7znmSP9AwRAlN4AJ9odanCrD3R+gMzb7yzJjXWEKfCUACeN1Tv SmVDeFa6xemj53T2cMUFoyM= =khkB -----END PGP SIGNATURE-----
Andreas Wenk wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Tommy Gildseth schrieb: >> I'm trying to figure out how to use PostgreSQL's fulltext search with an >> ispell dictionary. I'm having a bit of trouble figuring out where this >> norwegian.dict comes from though. >> When I install the norwegian ispell dictionary, i get 4 files, nb.aff, >> nb.hash, nn.aff and nn.hash. What I'm unable to figure out, is the steps >> needed to use this for PostgreSQL? >> > > Which version are you running? It's important to know, because tsearch2 is integrated > since version 8.3. The behaviour for implementing in earlier versions is therefore > different ... It will be running on version 8.3 -- Tommy Gildseth
On Tue, 27 Jan 2009, Tommy Gildseth wrote: > I'm trying to figure out how to use PostgreSQL's fulltext search with an > ispell dictionary. I'm having a bit of trouble figuring out where this > norwegian.dict comes from though. > When I install the norwegian ispell dictionary, i get 4 files, nb.aff, > nb.hash, nn.aff and nn.hash. What I'm unable to figure out, is the steps > needed to use this for PostgreSQL? you need to make a choice between two kinds of norwegian language - nn, nb, see http://en.wikipedia.org/wiki/Norwegian_language Then follow standard procedure described in documentation. Where did you get them ? Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
Oleg Bartunov wrote: > On Tue, 27 Jan 2009, Tommy Gildseth wrote: > >> I'm trying to figure out how to use PostgreSQL's fulltext search with >> an ispell dictionary. I'm having a bit of trouble figuring out where >> this norwegian.dict comes from though. >> When I install the norwegian ispell dictionary, i get 4 files, nb.aff, >> nb.hash, nn.aff and nn.hash. What I'm unable to figure out, is the >> steps needed to use this for PostgreSQL? > > you need to make a choice between two kinds of norwegian language - nn, nb, > see http://en.wikipedia.org/wiki/Norwegian_language > Then follow standard procedure described in documentation. > Where did you get them ? Yes, I'm aware of that I need to choose one of those. I guess what I'm having problems with, is figuring out where the <language>.dict file comes from. I didn't find any such file in the rpm downloaded from the links at http://ficus-www.cs.ucla.edu/geoff/ispell.html#ftp-sites and also not in the inorwegian-package in the ubuntu apt repository. I have read through http://www.postgresql.org/docs/current/static/textsearch.html, but it's not quite clear to me, from that, what I need to do, to use an ispell dictionary with tsearch. -- Tommy Gildseth
Have you read http://www.postgresql.org/docs/current/static/textsearch-dictionaries.html#TEXTSEARCH-ISPELL-DICTIONARY We suggest to use dictionaries which come with openoffice, hunspell, probably has better support of composite words. On Tue, 27 Jan 2009, Tommy Gildseth wrote: > Oleg Bartunov wrote: >> On Tue, 27 Jan 2009, Tommy Gildseth wrote: >> >>> I'm trying to figure out how to use PostgreSQL's fulltext search with an >>> ispell dictionary. I'm having a bit of trouble figuring out where this >>> norwegian.dict comes from though. >>> When I install the norwegian ispell dictionary, i get 4 files, nb.aff, >>> nb.hash, nn.aff and nn.hash. What I'm unable to figure out, is the steps >>> needed to use this for PostgreSQL? >> >> you need to make a choice between two kinds of norwegian language - nn, nb, >> see http://en.wikipedia.org/wiki/Norwegian_language >> Then follow standard procedure described in documentation. >> Where did you get them ? > > > Yes, I'm aware of that I need to choose one of those. I guess what I'm having > problems with, is figuring out where the <language>.dict file comes from. > I didn't find any such file in the rpm downloaded from the links at > http://ficus-www.cs.ucla.edu/geoff/ispell.html#ftp-sites and also not in the > inorwegian-package in the ubuntu apt repository. > I have read through > http://www.postgresql.org/docs/current/static/textsearch.html, but it's not > quite clear to me, from that, what I need to do, to use an ispell dictionary > with tsearch. > > > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
Oleg Bartunov wrote: > Have you read > http://www.postgresql.org/docs/current/static/textsearch-dictionaries.html#TEXTSEARCH-ISPELL-DICTIONARY > > We suggest to use dictionaries which come with openoffice, hunspell, > probably > has better support of composite words. > Thanks, that knocked me onto the right track. To easy to miss the blindingly obvious at times. :-) Works beautifully now. -- Tommy Gildseth
Tommy Gildseth wrote: > Oleg Bartunov wrote: >> Have you read >> http://www.postgresql.org/docs/current/static/textsearch-dictionaries.html#TEXTSEARCH-ISPELL-DICTIONARY >> >> We suggest to use dictionaries which come with openoffice, hunspell, >> probably >> has better support of composite words. >> > > Thanks, that knocked me onto the right track. To easy to miss the > blindingly obvious at times. :-) > Works beautifully now. > I may have been to quick to declare success. The following works as expected, returning the individual words: SELECT ts_debug('norwegian', 'overbuljongterningpakkmesterassistent'), ts_debug('norwegian', 'sjokoladefabrikk'), ts_debug('norwegian', 'epleskrott'); -[ RECORD 1 ]-------------------------------------------------------------------------------------------------------------------------------------------------- ts_debug | (asciiword,"Word, all ASCII",overbuljongterningpakkmesterassistent,"{no_ispell,norwegian_stem}",no_ispell,"{buljong,terning,pakk,mester,assistent}") ts_debug | (asciiword,"Word, all ASCII",sjokoladefabrikk,"{no_ispell,norwegian_stem}",no_ispell,"{sjokoladefabrikk,sjokolade,fabrikk}") ts_debug | (asciiword,"Word, all ASCII",epleskrott,"{no_ispell,norwegian_stem}",no_ispell,"{epleskrott,eple,skrott}") But, the following does not: SELECT ts_debug('norwegian', 'hemsedalsdans'), ts_debug('norwegian', 'lærdalsbrua'), ts_debug('norwegian', 'hengesmykke'); -[ RECORD 1 ]---------------------------------------------------------------------------------------------------- ts_debug | (asciiword,"Word, all ASCII",hemsedalsdans,"{no_ispell,norwegian_stem}",norwegian_stem,{hemsedalsdan}) ts_debug | (word,"Word, all letters",lærdalsbrua,"{no_ispell,norwegian_stem}",norwegian_stem,{lærdalsbru}) ts_debug | (asciiword,"Word, all ASCII",hengesmykke,"{no_ispell,norwegian_stem}",norwegian_stem,{hengesmykk}) Would this be due to a limitation in the dictionary, or a misconfiguration on my side? Commands used are as follows: CREATE TEXT SEARCH DICTIONARY no_ispell ( TEMPLATE = ispell, DictFile = nb_NO, AffFile = nb_NO, StopWords = norwegian ); and ALTER TEXT SEARCH CONFIGURATION norwegian ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,word, hword, hword_part WITH no_ispell, norwegian_stem; -- Tommy Gildseth
On Tue, 27 Jan 2009, Tommy Gildseth wrote: > Tommy Gildseth wrote: >> Oleg Bartunov wrote: >>> Have you read >>> http://www.postgresql.org/docs/current/static/textsearch-dictionaries.html#TEXTSEARCH-ISPELL-DICTIONARY >>> We suggest to use dictionaries which come with openoffice, hunspell, >>> probably >>> has better support of composite words. >>> >> >> Thanks, that knocked me onto the right track. To easy to miss the >> blindingly obvious at times. :-) >> Works beautifully now. >> > > I may have been to quick to declare success. > > The following works as expected, returning the individual words: > SELECT > ts_debug('norwegian', 'overbuljongterningpakkmesterassistent'), > ts_debug('norwegian', 'sjokoladefabrikk'), > ts_debug('norwegian', 'epleskrott'); > -[ RECORD 1 > ]-------------------------------------------------------------------------------------------------------------------------------------------------- > ts_debug | (asciiword,"Word, all > ASCII",overbuljongterningpakkmesterassistent,"{no_ispell,norwegian_stem}",no_ispell,"{buljong,terning,pakk,mester,assistent}") > ts_debug | (asciiword,"Word, all > ASCII",sjokoladefabrikk,"{no_ispell,norwegian_stem}",no_ispell,"{sjokoladefabrikk,sjokolade,fabrikk}") > ts_debug | (asciiword,"Word, all > ASCII",epleskrott,"{no_ispell,norwegian_stem}",no_ispell,"{epleskrott,eple,skrott}") > > > But, the following does not: > SELECT > ts_debug('norwegian', 'hemsedalsdans'), > ts_debug('norwegian', 'l?rdalsbrua'), > ts_debug('norwegian', 'hengesmykke'); > -[ RECORD 1 > ]---------------------------------------------------------------------------------------------------- > ts_debug | (asciiword,"Word, all > ASCII",hemsedalsdans,"{no_ispell,norwegian_stem}",norwegian_stem,{hemsedalsdan}) > ts_debug | (word,"Word, all > letters",l?rdalsbrua,"{no_ispell,norwegian_stem}",norwegian_stem,{l?rdalsbru}) > ts_debug | (asciiword,"Word, all > ASCII",hengesmykke,"{no_ispell,norwegian_stem}",norwegian_stem,{hengesmykk}) > > > Would this be due to a limitation in the dictionary, or a misconfiguration on > my side? sorry, I don't know norwegian, what do you mean ? Did you complain that no_ispell doesn't recognize these words ? > > Commands used are as follows: > > CREATE TEXT SEARCH DICTIONARY no_ispell ( > TEMPLATE = ispell, > DictFile = nb_NO, > AffFile = nb_NO, > StopWords = norwegian > ); > > and > > ALTER TEXT SEARCH CONFIGURATION norwegian ALTER MAPPING FOR asciiword, > asciihword, hword_asciipart,word, hword, hword_part WITH no_ispell, > norwegian_stem; > > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
Oleg Bartunov wrote: > On Tue, 27 Jan 2009, Tommy Gildseth wrote: > >> Tommy Gildseth wrote: >>> Oleg Bartunov wrote: >>>> Have you read >>>> http://www.postgresql.org/docs/current/static/textsearch-dictionaries.html#TEXTSEARCH-ISPELL-DICTIONARY >>>> We suggest to use dictionaries which come with openoffice, hunspell, >>>> probably >>>> has better support of composite words. >>>> >>> >>> Thanks, that knocked me onto the right track. To easy to miss the >>> blindingly obvious at times. :-) >>> Works beautifully now. >>> >> >> I may have been to quick to declare success. >> >> The following works as expected, returning the individual words: >> SELECT >> ts_debug('norwegian', 'overbuljongterningpakkmesterassistent'), >> ts_debug('norwegian', 'sjokoladefabrikk'), >> ts_debug('norwegian', 'epleskrott'); >> -[ RECORD 1 >> ]-------------------------------------------------------------------------------------------------------------------------------------------------- >> >> ts_debug | (asciiword,"Word, all >> ASCII",overbuljongterningpakkmesterassistent,"{no_ispell,norwegian_stem}",no_ispell,"{buljong,terning,pakk,mester,assistent}") >> >> ts_debug | (asciiword,"Word, all >> ASCII",sjokoladefabrikk,"{no_ispell,norwegian_stem}",no_ispell,"{sjokoladefabrikk,sjokolade,fabrikk}") >> >> ts_debug | (asciiword,"Word, all >> ASCII",epleskrott,"{no_ispell,norwegian_stem}",no_ispell,"{epleskrott,eple,skrott}") >> >> >> >> But, the following does not: >> SELECT >> ts_debug('norwegian', 'hemsedalsdans'), >> ts_debug('norwegian', 'l?rdalsbrua'), >> ts_debug('norwegian', 'hengesmykke'); >> -[ RECORD 1 >> ]---------------------------------------------------------------------------------------------------- >> >> ts_debug | (asciiword,"Word, all >> ASCII",hemsedalsdans,"{no_ispell,norwegian_stem}",norwegian_stem,{hemsedalsdan}) >> >> ts_debug | (word,"Word, all >> letters",l?rdalsbrua,"{no_ispell,norwegian_stem}",norwegian_stem,{l?rdalsbru}) >> >> ts_debug | (asciiword,"Word, all >> ASCII",hengesmykke,"{no_ispell,norwegian_stem}",norwegian_stem,{hengesmykk}) >> >> >> >> Would this be due to a limitation in the dictionary, or a >> misconfiguration on my side? > > sorry, I don't know norwegian, what do you mean ? Did you complain that > no_ispell doesn't recognize these words ? Yes, I'm sorry, I should have explained better. The words hemsedalsdans, hengesmykke and lærdalsbrua, are "concatenations" of the words Hemsedal and dans, henge and smykke and Lærdal and bru. Hemsedal and Lærdal are in fact geographic names, so I'm not sure it would handle that at all anyway. Both parts of the word, hengesmykke, is in the dictionary though, ie. both henge and smykke. It seems that some words it is able to properly spilt, and then some it doesn't recognise. The problem I'm trying to work around, is that as far as I can tell, tsearch doesn't support truncation, ie. searching for "*smykke" or "hemsedal*" etc. -- Tommy Gildseth
On Tue, 27 Jan 2009, Tommy Gildseth wrote: >> sorry, I don't know norwegian, what do you mean ? Did you complain that >> no_ispell doesn't recognize these words ? > > Yes, I'm sorry, I should have explained better. > The words hemsedalsdans, hengesmykke and l?rdalsbrua, are "concatenations" of > the words Hemsedal and dans, henge and smykke and L?rdal and bru. Hemsedal > and L?rdal are in fact geographic names, so I'm not sure it would handle that > at all anyway. Both parts of the word, hengesmykke, is in the dictionary > though, ie. both henge and smykke. It seems that some words it is able to > properly spilt, and then some it doesn't recognise. you may improve dictionary, affix file should have COMPOUNDFLAG z dict file should contain 'henge', 'smykke' with that flag 'z'. Where did you get dictionary ? > > The problem I'm trying to work around, is that as far as I can tell, tsearch > doesn't support truncation, ie. searching for "*smykke" or "hemsedal*" etc. 8.4 version will support prefix search "hemsedal*". But you could always write your own dictionary or just use dict_xsyn dictionary for such kinds exceptions. http://www.postgresql.org/docs/8.3/static/dict-xsyn.html Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83