Thread: What is the simpliest text search configuration?
Hi all, I'd like to implement a full text search with postgresql, and I can't find a text search configuration that would just: map unicode accentuated letters to an un-accentuated equivalent tokenize the words (and skip any non word characters) no stopwords lower case the tokens How can I achieve this? I'm particularly interested in deactivating the stopwords filtering. I tried pg_catalog.simple, but despite its name, it still considers stop words. Thanks for your help! Jerome. -- Jerome Eteve. http://www.eteve.net jerome@eteve.net
Dear Jerome,
from personal experience full-text searching in PostgreSQL can be quite powerful
but it's not simple, it requires thought, planning and coding. PostgreSQL mainly
provides an efficient token matching mechanism supporting positional information
and weights, but natural language processing and normalization is pretty basic.
If you don't mind writing a couple of user-defined functions to take control of lexeme
normalization, then tsvector/tsquery support can be a very powerful tool for custom
search engines.
regards,
Michael
from personal experience full-text searching in PostgreSQL can be quite powerful
but it's not simple, it requires thought, planning and coding. PostgreSQL mainly
provides an efficient token matching mechanism supporting positional information
and weights, but natural language processing and normalization is pretty basic.
If you don't mind writing a couple of user-defined functions to take control of lexeme
normalization, then tsvector/tsquery support can be a very powerful tool for custom
search engines.
regards,
Michael
2009/11/12 Jérôme Etévé <jerome.eteve@gmail.com>
Hi all,
I'd like to implement a full text search with postgresql, and I can't find
a text search configuration that would just:
map unicode accentuated letters to an un-accentuated equivalent
tokenize the words (and skip any non word characters)
no stopwords
lower case the tokens
How can I achieve this? I'm particularly interested in deactivating
the stopwords filtering.
I tried pg_catalog.simple, but despite its name, it still considers stop words.
Thanks for your help!
Jerome.
Hi Michael, I actually found that the 'simple' dictionary doesn't enforce a stopword list by default. so i defined my search conf like this and it works: create text search configuration sbsimple ( parser = 'default' ) ; alter text search configuration sbsimple ALTER MAPPING FOR word,hword,asciiword,asciihword WITH simple Cheers! J. 2009/11/12 Michael Nacos <m.nacos@gmail.com>: > Dear Jerome, > > from personal experience full-text searching in PostgreSQL can be quite > powerful > but it's not simple, it requires thought, planning and coding. PostgreSQL > mainly > provides an efficient token matching mechanism supporting positional > information > and weights, but natural language processing and normalization is pretty > basic. > > If you don't mind writing a couple of user-defined functions to take control > of lexeme > normalization, then tsvector/tsquery support can be a very powerful tool for > custom > search engines. > > regards, > > Michael > > 2009/11/12 Jérôme Etévé <jerome.eteve@gmail.com> >> >> Hi all, >> >> I'd like to implement a full text search with postgresql, and I can't >> find >> a text search configuration that would just: >> >> map unicode accentuated letters to an un-accentuated equivalent >> tokenize the words (and skip any non word characters) >> no stopwords >> lower case the tokens >> >> How can I achieve this? I'm particularly interested in deactivating >> the stopwords filtering. >> >> I tried pg_catalog.simple, but despite its name, it still considers stop >> words. >> >> Thanks for your help! >> >> Jerome. >> > > -- Jerome Eteve. http://www.eteve.net jerome@eteve.net
=?UTF-8?B?SsOpcsO0bWUgRXTDqXbDqQ==?= <jerome.eteve@gmail.com> writes: > I'd like to implement a full text search with postgresql, and I can't find > a text search configuration that would just: > map unicode accentuated letters to an un-accentuated equivalent > tokenize the words (and skip any non word characters) > no stopwords > lower case the tokens > How can I achieve this? I'm particularly interested in deactivating > the stopwords filtering. > I tried pg_catalog.simple, but despite its name, it still considers stop words. What's wrong with specifying an empty stopword list? (To me, removing accents is already past what I'd expect of a "simple" configuration, so I doubt you're going to find a dictionary that provides exactly that set of features and no other ones.) regards, tom lane
We submitted unaccent dictionary for 8.5 See http://www.sai.msu.su/~megera/wiki/unaccent for some information Oleg On Thu, 12 Nov 2009, Jrme Etv wrote: > Hi Michael, > > I actually found that the 'simple' dictionary doesn't enforce a > stopword list by default. so i defined my search conf like this and it > works: > > create text search configuration sbsimple ( parser = 'default' ) ; > alter text search configuration sbsimple ALTER MAPPING FOR > word,hword,asciiword,asciihword WITH simple > > Cheers! > > J. > > 2009/11/12 Michael Nacos <m.nacos@gmail.com>: >> Dear Jerome, >> >> from personal experience full-text searching in PostgreSQL can be quite >> powerful >> but it's not simple, it requires thought, planning and coding. PostgreSQL >> mainly >> provides an efficient token matching mechanism supporting positional >> information >> and weights, but natural language processing and normalization is pretty >> basic. >> >> If you don't mind writing a couple of user-defined functions to take control >> of lexeme >> normalization, then tsvector/tsquery support can be a very powerful tool for >> custom >> search engines. >> >> regards, >> >> Michael >> >> 2009/11/12 JЪЪrЪЪme EtЪЪvЪЪ <jerome.eteve@gmail.com> >>> >>> Hi all, >>> >>> I'd like to implement a full text search with postgresql, and I can't >>> find >>> a text search configuration that would just: >>> >>> map unicode accentuated letters to an un-accentuated equivalent >>> tokenize the words (and skip any non word characters) >>> no stopwords >>> lower case the tokens >>> >>> How can I achieve this? I'm particularly interested in deactivating >>> the stopwords filtering. >>> >>> I tried pg_catalog.simple, but despite its name, it still considers stop >>> words. >>> >>> Thanks for your help! >>> >>> Jerome. >>> >> >> > > > > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83