Thread: tsearch2 dictionary that indexes substrings?
Hi, In this http://archive.netbsd.se/?ml=pgsql-hackers&a=2006-10&t=2475028 thread Teodor Sigaev describes a way to make tsearch2 index substrings of words: > Brain storm method: > > Develop a dictionary which returns all substring for lexeme, for > example for word foobar it will be 'foobar fooba foob foo fo oobar > ooba oob oo obar oba ob bar ba ar' Did anyone do that already? If I understand it correctly such a dictionary would require to write a custom C component - is that correct? Or could I get away with writing a plpgsql function that does the above and hooking that somehow into the tsearch2 config? thanks, Til
You want trigram based search. ie. postgresql -> 'pos', 'ost', 'stg', 'tgr', 'gre', 'res', 'esq', 'sql' searching for 'gresq' is searching for 'gre' and 'res' and 'esq' which is good friends with bitmap scan. Then a little LIKE '%gresq%' to filter the results. PS : indexing all substring means for long words you get huge number of lexems... On Fri, 20 Apr 2007 11:06:26 +0200, Tilmann Singer <tils-pgsql@tils.net> wrote: > Hi, > > In this http://archive.netbsd.se/?ml=pgsql-hackers&a=2006-10&t=2475028 > thread Teodor Sigaev describes a way to make tsearch2 index substrings > of words: > >> Brain storm method: >> >> Develop a dictionary which returns all substring for lexeme, for >> example for word foobar it will be 'foobar fooba foob foo fo oobar >> ooba oob oo obar oba ob bar ba ar' > > Did anyone do that already? > > If I understand it correctly such a dictionary would require to write > a custom C component - is that correct? Or could I get away with > writing a plpgsql function that does the above and hooking that > somehow into the tsearch2 config? > > > thanks, Til > > ---------------------------(end of broadcast)--------------------------- > TIP 2: Don't 'kill -9' the postmaster
On Fri, 20 Apr 2007, Tilmann Singer wrote: > Hi, > > In this http://archive.netbsd.se/?ml=pgsql-hackers&a=2006-10&t=2475028 > thread Teodor Sigaev describes a way to make tsearch2 index substrings > of words: > >> Brain storm method: >> >> Develop a dictionary which returns all substring for lexeme, for >> example for word foobar it will be 'foobar fooba foob foo fo oobar >> ooba oob oo obar oba ob bar ba ar' > > Did anyone do that already? AFAIK, no > > If I understand it correctly such a dictionary would require to write > a custom C component - is that correct? Or could I get away with > writing a plpgsql function that does the above and hooking that > somehow into the tsearch2 config? You need to write C-function, see example in http://www.sai.msu.su/~megera/postgres/fts/doc/fts-intdict-xmp.html Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
* Listmail <lists@peufeu.com> [20070420 11:25]: > You want trigram based search. > ie. > > postgresql -> 'pos', 'ost', 'stg', 'tgr', 'gre', 'res', > 'esq', 'sql' > > searching for 'gresq' is searching for 'gre' and 'res' and > 'esq' which is good friends with bitmap scan. Then a little LIKE > '%gresq%' to filter the results. I'm not sure how that would fit in with tsearch2 to do full text search so that I can do queries like select * from content where plainto_tsquery(:q) @@ to_tsvector(body) If the possible substrings were already indexed like Oleg suggested in his reply through writing a custom C dictionary, a query like above with q='foo' would find rows from the table content where body contains 'foobar' for instance. However I've seen the example to create a trigram index on a unique word list to provide alternative spelling suggestions to the user which looked very useful. > PS : indexing all substring means for long words you get huge > number of lexems... I'm aware of that and in my case I don't think it will be a problem. It is for a type-ahead search web interface so actually it only requires indexing all possible substrings starting from char 1, ie. p, po, pos, post, postg, postgr, postgre, postgres, postgresq, postgresql. Til
> I'm aware of that and in my case I don't think it will be a > problem. It is for a type-ahead search web interface so actually it > only requires indexing all possible substrings starting from char 1, > ie. p, po, pos, post, postg, postgr, postgre, postgres, postgresq, > postgresql. If you want to provide autocompletion, you could build a list of unique lexemes from inserted posts, and use a btree index and LIKE...
* Oleg Bartunov <oleg@sai.msu.su> [20070420 11:32]: > >If I understand it correctly such a dictionary would require to write > >a custom C component - is that correct? Or could I get away with > >writing a plpgsql function that does the above and hooking that > >somehow into the tsearch2 config? > > You need to write C-function, see example in > http://www.sai.msu.su/~megera/postgres/fts/doc/fts-intdict-xmp.html Thanks. My colleague who speaks more C than me came up with the code below which works fine for us. Will the memory allocated for lexeme be freed by the caller? Til /* * Dictionary for partials of a word, ie. foo => {f,fo,foo} * * Based on the tsearch2/gendict/config.sh generator * * Author: Sean Treadway * * This code is released under the terms of the PostgreSQL License. */ #include "postgres.h" #include "dict.h" #include "common.h" #include "subinclude.h" #include "ts_locale.h" #define is_utf8_continuation(c) ((unsigned char)(c) >= 0x80 && (unsigned char)(c) <= 0xBF) PG_FUNCTION_INFO_V1(dlexize_partial); Datum dlexize_partial(PG_FUNCTION_ARGS); Datum dlexize_partial(PG_FUNCTION_ARGS) { char* in = (char*)PG_GETARG_POINTER(1); char* utxt = pnstrdup(in, PG_GETARG_INT32(2)); /* palloc */ char* txt = lowerstr(utxt); /* palloc */ int txt_len = strlen(txt); int results = 0; int i = 0; /* may overallocate, that's ok */ TSLexeme *res = palloc(sizeof(TSLexeme)*(txt_len+1)); for (i = 1; i <= txt_len; i++) { /* skip UTF8 control codes until EOS */ if (!is_utf8_continuation(txt[i])) { res[results++].lexeme = pnstrdup(txt, i); } } res[results].lexeme=NULL; pfree(utxt); pfree(txt); /* Receiver must free res memory and res[].lexeme */ PG_RETURN_POINTER(res); }
> My colleague who speaks more C than me came up with the code below > which works fine for us. Will the memory allocated for lexeme be freed Nice, except self-defined utf8 properties. I think it will be much better to use pg_mblen(char*). In this case your dictionary will work with any supported by pgsql encodings. > by the caller? Yes, of course. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/