Thread: tsearch2 dictionary that indexes substrings?

tsearch2 dictionary that indexes substrings?

From
Tilmann Singer
Date:
Hi,

In this http://archive.netbsd.se/?ml=pgsql-hackers&a=2006-10&t=2475028
thread Teodor Sigaev describes a way to make tsearch2 index substrings
of words:

> Brain storm method:
>
> Develop a dictionary which returns all substring for lexeme, for
> example for word foobar it will be 'foobar fooba foob foo fo oobar
> ooba oob oo obar oba ob bar ba ar'

Did anyone do that already?

If I understand it correctly such a dictionary would require to write
a custom C component - is that correct? Or could I get away with
writing a plpgsql function that does the above and hooking that
somehow into the tsearch2 config?


thanks, Til

Re: tsearch2 dictionary that indexes substrings?

From
Listmail
Date:
    You want trigram based search.
    ie.

    postgresql -> 'pos', 'ost', 'stg', 'tgr', 'gre', 'res', 'esq', 'sql'

    searching for 'gresq' is searching for  'gre' and 'res' and 'esq' which
is good friends with bitmap scan. Then a little LIKE '%gresq%' to filter
the results.

    PS : indexing all substring means for long words you get huge number of
lexems...



On Fri, 20 Apr 2007 11:06:26 +0200, Tilmann Singer <tils-pgsql@tils.net>
wrote:

> Hi,
>
> In this http://archive.netbsd.se/?ml=pgsql-hackers&a=2006-10&t=2475028
> thread Teodor Sigaev describes a way to make tsearch2 index substrings
> of words:
>
>> Brain storm method:
>>
>> Develop a dictionary which returns all substring for lexeme, for
>> example for word foobar it will be 'foobar fooba foob foo fo oobar
>> ooba oob oo obar oba ob bar ba ar'
>
> Did anyone do that already?
>
> If I understand it correctly such a dictionary would require to write
> a custom C component - is that correct? Or could I get away with
> writing a plpgsql function that does the above and hooking that
> somehow into the tsearch2 config?
>
>
> thanks, Til
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster



Re: tsearch2 dictionary that indexes substrings?

From
Oleg Bartunov
Date:
On Fri, 20 Apr 2007, Tilmann Singer wrote:

> Hi,
>
> In this http://archive.netbsd.se/?ml=pgsql-hackers&a=2006-10&t=2475028
> thread Teodor Sigaev describes a way to make tsearch2 index substrings
> of words:
>
>> Brain storm method:
>>
>> Develop a dictionary which returns all substring for lexeme, for
>> example for word foobar it will be 'foobar fooba foob foo fo oobar
>> ooba oob oo obar oba ob bar ba ar'
>
> Did anyone do that already?

AFAIK, no

>
> If I understand it correctly such a dictionary would require to write
> a custom C component - is that correct? Or could I get away with
> writing a plpgsql function that does the above and hooking that
> somehow into the tsearch2 config?

You need to write C-function, see example in
http://www.sai.msu.su/~megera/postgres/fts/doc/fts-intdict-xmp.html

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: tsearch2 dictionary that indexes substrings?

From
Tilmann Singer
Date:
* Listmail <lists@peufeu.com> [20070420 11:25]:
>     You want trigram based search.
>     ie.
>
>     postgresql -> 'pos', 'ost', 'stg', 'tgr', 'gre', 'res',
>     'esq', 'sql'
>
>     searching for 'gresq' is searching for  'gre' and 'res' and
>     'esq' which  is good friends with bitmap scan. Then a little LIKE
> '%gresq%' to filter  the results.

I'm not sure how that would fit in with tsearch2 to do full text
search so that I can do queries like

select * from content where plainto_tsquery(:q) @@ to_tsvector(body)

If the possible substrings were already indexed like Oleg suggested in
his reply through writing a custom C dictionary, a query like above
with q='foo' would find rows from the table content where body
contains 'foobar' for instance.

However I've seen the example to create a trigram index on a unique
word list to provide alternative spelling suggestions to the user
which looked very useful.

>     PS : indexing all substring means for long words you get huge
>     number of  lexems...

I'm aware of that and in my case I don't think it will be a
problem. It is for a type-ahead search web interface so actually it
only requires indexing all possible substrings starting from char 1,
ie. p, po, pos, post, postg, postgr, postgre, postgres, postgresq,
postgresql.


Til

Re: tsearch2 dictionary that indexes substrings?

From
Listmail
Date:
> I'm aware of that and in my case I don't think it will be a
> problem. It is for a type-ahead search web interface so actually it
> only requires indexing all possible substrings starting from char 1,
> ie. p, po, pos, post, postg, postgr, postgre, postgres, postgresq,
> postgresql.

    If you want to provide autocompletion, you could build a list of unique
lexemes from inserted posts, and use a btree index and LIKE...

Re: tsearch2 dictionary that indexes substrings?

From
Tilmann Singer
Date:
* Oleg Bartunov <oleg@sai.msu.su> [20070420 11:32]:
> >If I understand it correctly such a dictionary would require to write
> >a custom C component - is that correct? Or could I get away with
> >writing a plpgsql function that does the above and hooking that
> >somehow into the tsearch2 config?
>
> You need to write C-function, see example in
> http://www.sai.msu.su/~megera/postgres/fts/doc/fts-intdict-xmp.html

Thanks.

My colleague who speaks more C than me came up with the code below
which works fine for us. Will the memory allocated for lexeme be freed
by the caller?



Til





/*
 * Dictionary for partials of a word, ie. foo => {f,fo,foo}
 *
 * Based on the tsearch2/gendict/config.sh generator
 *
 * Author: Sean Treadway
 *
 * This code is released under the terms of the PostgreSQL License.
 */
#include "postgres.h"

#include "dict.h"
#include "common.h"

#include "subinclude.h"
#include "ts_locale.h"

#define is_utf8_continuation(c) ((unsigned char)(c) >= 0x80 && (unsigned char)(c) <= 0xBF)

PG_FUNCTION_INFO_V1(dlexize_partial);
Datum dlexize_partial(PG_FUNCTION_ARGS);
Datum
dlexize_partial(PG_FUNCTION_ARGS) {
  char*  in = (char*)PG_GETARG_POINTER(1);

  char*  utxt = pnstrdup(in, PG_GETARG_INT32(2)); /* palloc */
  char*  txt = lowerstr(utxt);                    /* palloc */
  int    txt_len = strlen(txt);

  int    results = 0;
  int    i = 0;

  /* may overallocate, that's ok */
  TSLexeme   *res = palloc(sizeof(TSLexeme)*(txt_len+1));

  for (i = 1; i <= txt_len; i++) {
    /* skip UTF8 control codes until EOS */
    if (!is_utf8_continuation(txt[i])) {
      res[results++].lexeme = pnstrdup(txt, i);
    }
  }

  res[results].lexeme=NULL;

  pfree(utxt);
  pfree(txt);

  /* Receiver must free res memory and res[].lexeme */
  PG_RETURN_POINTER(res);
}

Re: tsearch2 dictionary that indexes substrings?

From
Teodor Sigaev
Date:
> My colleague who speaks more C than me came up with the code below
> which works fine for us. Will the memory allocated for lexeme be freed
Nice, except self-defined utf8 properties. I think it will be much better to use
pg_mblen(char*). In this case your dictionary will work with any supported by
pgsql encodings.

> by the caller?
Yes, of course.


--
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/