Thread: Database design for separate tsearch table

Database design for separate tsearch table

From
Mikkel Høgh
Date:
Hi,

I'm trying to make a module allowing Drupal to take advantage of
PostgreSQL's excellent Full Text Search, aka. tsearch.

Since this module will probably not become part of Drupal core right off
the bat, I need to do this without modifying Drupal's own tables, so I've
created a new one for the purpose, "tsearch_node".

I'm pondering how best to do this. Basic minimum is a column which holds
the foreign key (nid) to Drupal's node.

I'm wondering whether to merge all the searchable stuff when creating the
index or to have a separate tsvector column for each of the three
commonly indexable things about Drupal content (nodes), namely the title,
the body and taxonomy. Any insights?

Also, I figure it'd be a good idea to keep the language of the indexed
content right there in the same table, since Drupals own node table
specifies two-letter codes (en, da, pt, etc.) and as far as I can
understand the PostgreSQL documentation, tsearch expects the full
language name (english, danish, portuguese), so to use it in queries
without having to do too much magic would require me to store the full
name right there in the table, or am I mistaken?

Kind regards,

      Mikkel Høgh


Re: Database design for separate tsearch table

From
Ivan Sergio Borgonovo
Date:
On Thu, 23 Oct 2008 00:10:19 +0200
Mikkel Høgh <m@ooh.dk> wrote:

> Hi,

> I'm trying to make a module allowing Drupal to take advantage of
> PostgreSQL's excellent Full Text Search, aka. tsearch.

> Since this module will probably not become part of Drupal core
> right off the bat, I need to do this without modifying Drupal's
> own tables, so I've created a new one for the purpose,
> "tsearch_node".
>
> I'm pondering how best to do this. Basic minimum is a column which
> holds the foreign key (nid) to Drupal's node.
>
> I'm wondering whether to merge all the searchable stuff when
> creating the index or to have a separate tsvector column for each
> of the three commonly indexable things about Drupal content
> (nodes), namely the title, the body and taxonomy. Any insights?

weight them and you'll be able to search by field and "globally".

I didn't make any scientific test but I previously had something
like:

create table subtable (
  subtableid int,
  body text,
  ftidx tsvector
)
create table maintable (
  maintableid int,
  body text,
  subtableid int
  ftidx tsvector
);

but it seem that just searching on a tsvector in maintable
build up with

setweight(to_tsvector('pg_catalog.english',
coalesce(maintable.body,'')), 'A') || ' ' ||

setweight(to_tsvector('pg_catalog.english',
coalesce(subtable.body,'')), 'B')

is faster.

Beware of the difference between gist and gin indexes for
"restricted" weighted searches since with the latter you've to use
@@@

> Also, I figure it'd be a good idea to keep the language of the
> indexed content right there in the same table, since Drupals own
> node table specifies two-letter codes (en, da, pt, etc.) and as
> far as I can understand the PostgreSQL documentation, tsearch
> expects the full language name (english, danish, portuguese), so
> to use it in queries without having to do too much magic would
> require me to store the full name right there in the table, or am
> I mistaken?

You'll have to build up a tsvector for each language, so yeah it may
be useful to store the tsvector together with the language with
witch it was obtained.
If you don't tsearch won't work.
eg. "cane" with Italian tsearch becomes {can}, but when you search it
with English tsearch it remains {cane} so you won't find it.


--
Ivan Sergio Borgonovo
http://www.webthatworks.it


Re: Database design for separate tsearch table

From
Mikkel Høgh
Date:
On Thu, Oct 23, 2008 at 01:06:26AM +0200, Ivan Sergio Borgonovo wrote:
>weight them and you'll be able to search by field and "globally".
>
>I didn't make any scientific test but I previously had something
>like:
>
>create table subtable (
>  subtableid int,
>  body text,
>  ftidx tsvector
>)
>create table maintable (
>  maintableid int,
>  body text,
>  subtableid int
>  ftidx tsvector
>);
>
>but it seem that just searching on a tsvector in maintable
>build up with
>
>setweight(to_tsvector('pg_catalog.english',
>coalesce(maintable.body,'')), 'A') || ' ' ||
>
>setweight(to_tsvector('pg_catalog.english',
>coalesce(subtable.body,'')), 'B')
>
>is faster.

Ok, that seems to be a good approach, thank you.

>Beware of the difference between gist and gin indexes for
>"restricted" weighted searches since with the latter you've to use
>@@@
Um, could you clarify that? I know the general differences between gist
and gin, but not how it affects weighted searches...

Kind regards,
      Mikkel Høgh

Re: Database design for separate tsearch table

From
Ivan Sergio Borgonovo
Date:
On Thu, 23 Oct 2008 14:20:49 +0200
Mikkel Høgh <m@ooh.dk> wrote:

> >but it seem that just searching on a tsvector in maintable
> >build up with
> >
> >setweight(to_tsvector('pg_catalog.english',
> >coalesce(maintable.body,'')), 'A') || ' ' ||
> >
> >setweight(to_tsvector('pg_catalog.english',
> >coalesce(subtable.body,'')), 'B')
> >
> >is faster.
>
> Ok, that seems to be a good approach, thank you.

Unless proven false by a more scientific test than mine.
I'd like to hear more informed opinions on this.
I noticed a performance difference between 1 vs more index but
mainly I've chosen 1 index because it made all the code simpler and
did suit to my needs. So I didn't dig further in "could I make the
other way faster?".

There is one limit (with its implication): you've a limited number
of weight (ABCD).

> >Beware of the difference between gist and gin indexes for
> >"restricted" weighted searches since with the latter you've to use
> >@@@
> Um, could you clarify that? I know the general differences between
> gist and gin, but not how it affects weighted searches...

http://www.postgresql.org/docs/8.3/static/textsearch-indexes.html

search for @@@

--
Ivan Sergio Borgonovo
http://www.webthatworks.it

--
Ivan Sergio Borgonovo
http://www.webthatworks.it