Re: Updated tsearch documentation - Mailing list pgsql-hackers

From Bruce Momjian
Subject Re: Updated tsearch documentation
Date
Msg-id 200707181808.l6II8Qo14063@momjian.us
Whole thread Raw
In response to Re: Updated tsearch documentation  (Oleg Bartunov <oleg@sai.msu.su>)
Responses Re: Updated tsearch documentation  (Oleg Bartunov <oleg@sai.msu.su>)
List pgsql-hackers
Oleg Bartunov wrote:
> >> I agree, that there are could be more examples, but text search doesn't
> >> require something special !
> >> *Example* of trigger function is documented on
> >> http://momjian.us/expire/fulltext/HTML/textsearch-opfunc.html
> >
> > Yes, I see that in tsearch() here:
> >
> >     http://momjian.us/expire/fulltext/HTML/textsearch-opfunc.html#TEXTSEARC$
> >
> > I assume my_filter_name is optional right?  I have updated the prototype
> > to be:
> >
> >     tsearch([vector_column_name], [my_filter_name], text_column_name [, ... ])
> >
> > Is this accurate?  What does this text below it mean?
> 
> no, this in inaccurate. First, vector_column_name is not optional argument,
> it's a name of tsvector column name.

Fixed.

> >     There can be many functions and text columns specified in a tsearch()
> >     trigger. The following rule is used: a function is applied to all
> >     subsequent TEXT columns until the next matching column occurs.
> 
> The idea, is to provide user to preprocess text before applying 
> tsearch machinery. my_filter_name() preprocess text_column_name1,
> text_column_name2,....
> The original syntax allows to specify for every text columns their 
> preprocessing functions.
> 
> So, I suggest to keep original syntax, change 'vector_column_name' to
> 'tsvector_column_name'.

OK, change made.

> > Why are we allowing my_filter_name here?  Isn't that something for a
> > custom trigger.  Is calling it tsearch() a good idea?  Why not
> > tsvector_trigger().
> 
> I don't see any benefit from the tsvector_trigger() name. If you want to add
> some semantic, than tsvector_update_trigger() would be better.  Anyway,
> this trigger is an illustration.

Well, the filter that removes '@' might be an example, but tsearch() is
indeed sort of built-in trigger function to be used for simple cases. 
My point is that because it is only for simple cases, why add complexity
and allow a filter?  It seems best to just remove the filter idea and
let people write their own triggers if they want that functionality.

> >>>     CREATE INDEX textsearch_id ON pgweb USING gin(to_tsvector(column));
> >>>
> >>> That avoids having to have a separate column because you can just say:
> >>>
> >>>     WHERE to_query('XXX') @@ to_tsvector(column)
> >>
> >> yes, it's possible, but without ranking, since currently it's impossible
> >> to store any information in index (it's pg's feature). btw, this should
> >> works and for GiST index also.
> >
> > What if they use @@@.  Wouldn't that work because it is going to check
> > the heap?
> 
> It would work, it'd recalculate to_tsvector(column) for rows found
> ( for GiST - to remove false hits and for weight information, for 
> GIN - for weight information only).

Right.  Currently to use text search on a table, you have to do three
things:
o  add a tsvector column to the tableo  add a trigger to keep the tsvector column currento  add an index to the
tsvectorcolumn
 

My question is why bother with the first two steps?  If you do:
CREATE INDEX textsearch_idx ON pgweb USING gist(to_tsvector('english',column));

you don't need a separate column and a trigger to keep it current.  The
index is kept current as part of normal query processing.  The only
downside is that you have to do to_tsvector() in the heap to avoid false
hits, but that seems minor compared to the disk savings of not having
the separate column.  Is to_tsvector() an expensive function?

> >>>  CREATE INDEX textsearch_idx ON pgweb USING gin(to_tsvector('english',column));
> >>>
> >>> so that at least the configuration is documented in the index.
> >>
> >> yes, it's better to always explicitly specify configuration name and not
> >> rely on default configuration.
> >> Unfortunately, configuration name doesn't saved in the index.
> 
> as Teodor corrected me, index doesn't know about configuration at all !
> What accurate user could do, is to provide configuration name in the 
> comment for tsvector column. Configuration name is an accessory of
> to_tsvector() function.

Well, if you create the index with the configuration name it is
guaranteed to match:
CREATE INDEX textsearch_idx ON pgweb USING gist(to_tsvector('english',column));
                   -------
 
And if someone does:
WHERE 'friend'::tsquery @@ to_tsvector('english',column))

the index is used.  Now if the default configuration is 'english' and
they use:
WHERE 'friend'::tsquery @@ to_tsvector(column))

the index is not used, but this just a good example of why default
configurations aren't that useful.  One problem I see is that if the
default configuration is not 'english', then when the index consults the
heap, it would be using a different configuration and yield incorrect
results.  I am unsure how to fix that.

With the trigger idea, you have to be sure your configuration is the same
every time you INSERT/UPDATE the table or the index will have mixed
configuration entries and it will yield incorrect results, aside from
the heap configuration lookup not matching the index.

Once we nail this down we will have to have a documentation section
about configuration mismatches.

> In principle, tsvector as any data type could be obtained by any other ways,
> for example, OpenFTS construct tsvector following its own rules.
> 
> >
> > I was more concerned that there is nothing documenting the configuration
> > used by the index or the tsvector table column trigger.  By doing:
> 
> again, index has nothing with configuration name.
> Our trigger function is an example, which uses default configuration name.
> User could easily write it's own trigger to keep tsvector column up to date 
> and use configuration name as a parameter.

Right. I am thinking beyond that issue.

> >     CREATE INDEX textsearch_idx ON pgweb USING gin(to_tsvector('english',column));
> >
> > you guarantee that the index uses 'english' for all its entries.  If you
> > omit the 'english' or use a different configuration, it will heap scan
> > the table, which at least gives the right answer.
> 
> sometimes it's useful not to use explicitly configuration name 
> to be able to use index with different configuration. Just change
> tsearch_conf_name.

I assume you are saying the benefit is for tsquery to use a different
configuration, not having some tsvector index entries using different
configurations than others.

> > Also, how do you guarantee that tsearch() triggers always uses the same
> > configuration?  The existing tsearch() API seems to make that
> > impossible.  I am wondering if we need to add the configuration name as
> > a mandatory parameter to tsearch().
> 
> Using the same tsearch_conf_name, which could be defined by many ways, 
> you guarantee to use the same configuration.

Yea, I am sure you _can_ do it.  The question is how can we make it less
error-prone.

--  Bruce Momjian  <bruce@momjian.us>          http://momjian.us EnterpriseDB
http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


pgsql-hackers by date:

Previous
From: "Pavan Deolasee"
Date:
Subject: Re: Comments on the HOT design
Next
From: Stephen Frost
Date:
Subject: Re: Future of krb5 authentication