Re: Updated tsearch documentation - Mailing list pgsql-hackers

From Bruce Momjian
Subject Re: Updated tsearch documentation
Date
Msg-id 200707182340.l6INe8D11726@momjian.us
Whole thread Raw
In response to Re: Updated tsearch documentation  (Oleg Bartunov <oleg@sai.msu.su>)
List pgsql-hackers
Oleg Bartunov wrote:
> On Wed, 18 Jul 2007, Bruce Momjian wrote:
> 
> >
> >>> Why are we allowing my_filter_name here?  Isn't that something for a
> >>> custom trigger.  Is calling it tsearch() a good idea?  Why not
> >>> tsvector_trigger().
> >>
> >> I don't see any benefit from the tsvector_trigger() name. If you want to add
> >> some semantic, than tsvector_update_trigger() would be better.  Anyway,
> >> this trigger is an illustration.
> >
> > Well, the filter that removes '@' might be an example, but tsearch() is
> > indeed sort of built-in trigger function to be used for simple cases.
> > My point is that because it is only for simple cases, why add complexity
> > and allow a filter?  It seems best to just remove the filter idea and
> > let people write their own triggers if they want that functionality.
> 
> If you aware about documentation simplicity than we could just document 
> two versions:
> 1. without filter function - simple, well understood syntax
> 2. with filter function - for advanced users
> 
> I don't want to remove the feature which works for year without any problem.

Yes, this is what I want.  I would like to show the simple usage first,
then explain that a more complex usage is possible.  This will help
people get started using text search.  Triggers and secondary columns
are fine, but to start using it the CREATE INDEX-only case is best.  I
don't suggest we remove any capabilities, only suggest simple solutions.

> >>>>>     CREATE INDEX textsearch_id ON pgweb USING gin(to_tsvector(column));
> >>>>>
> >>>>> That avoids having to have a separate column because you can just say:
> >>>>>
> >>>>>     WHERE to_query('XXX') @@ to_tsvector(column)
> >>>>
> >>>> yes, it's possible, but without ranking, since currently it's impossible
> >>>> to store any information in index (it's pg's feature). btw, this should
> >>>> works and for GiST index also.
> >>>
> >>> What if they use @@@.  Wouldn't that work because it is going to check
> >>> the heap?
> >>
> >> It would work, it'd recalculate to_tsvector(column) for rows found
> >> ( for GiST - to remove false hits and for weight information, for
> >> GIN - for weight information only).
> >
> > Right.  Currently to use text search on a table, you have to do three
> > things:
> >
> >     o  add a tsvector column to the table
> >     o  add a trigger to keep the tsvector column current
> >     o  add an index to the tsvector column
> >
> > My question is why bother with the first two steps?  If you do:
> >
> > CREATE INDEX textsearch_idx ON pgweb USING gist(to_tsvector('english',column));
> >
> > you don't need a separate column and a trigger to keep it current.  The
> > index is kept current as part of normal query processing.  The only
> > downside is that you have to do to_tsvector() in the heap to avoid false
> > hits, but that seems minor compared to the disk savings of not having
> > the separate column.  Is to_tsvector() an expensive function?
> 
> Bruce, you oversimplify the text search, the document could be fully virtual,
> not a column(s), it could be a result of any SQL commands, so it could be 
> very expensive just to obtain document, and yes, to_tsvector could be
> very expensive, depending on the document size, parser and dictionaries used.
> 
> And, again, current postgres architecture forces to use heap to store
> positional and weight information for ranking.
> 
> The use case for what you described is very limited - simple text search
> on one/several column of the same table without ranking.

Right, but I bet that that is all the majority of users need, at least
at first as they start to use text search.

> >>>>>  CREATE INDEX textsearch_idx ON pgweb USING gin(to_tsvector('english',column));
> >>>>>
> >>>>> so that at least the configuration is documented in the index.
> >>>>
> >>>> yes, it's better to always explicitly specify configuration name and not
> >>>> rely on default configuration.
> >>>> Unfortunately, configuration name doesn't saved in the index.
> >>
> >> as Teodor corrected me, index doesn't know about configuration at all !
> >> What accurate user could do, is to provide configuration name in the
> >> comment for tsvector column. Configuration name is an accessory of
> >> to_tsvector() function.
> >
> > Well, if you create the index with the configuration name it is
> > guaranteed to match:
> >
> > CREATE INDEX textsearch_idx ON pgweb USING gist(to_tsvector('english',column));
> >                                                              -------
> > And if someone does:
> >
> >     WHERE 'friend'::tsquery @@ to_tsvector('english',column))
> >
> > the index is used.  Now if the default configuration is 'english' and
> > they use:
> >
> >     WHERE 'friend'::tsquery @@ to_tsvector(column))
> >
> > the index is not used, but this just a good example of why default
> > configurations aren't that useful.  One problem I see is that if the
> > default configuration is not 'english', then when the index consults the
> > heap, it would be using a different configuration and yield incorrect
> > results.  I am unsure how to fix that.
> 
> again, you consider very simple case  and actually, your example is a 
> good example of usefulness of default configuration ! Just think before
> you develop your application, but this is very general rule. There are
> zillions situations you could do bad things, after all.

Right, but if we can reduce possible errors, that is better, and
embedding the configuration name in the expression index funciton does
that.

> Moreover, consider text search on text column, there is no way to specify 
> configuration at all ! We rely on default configuration here
> 
> CREATE INDEX textsearch_idx ON pgweb USING gin(title);

Ah, this should be:
CREATE INDEX textsearch_idx ON pgweb USING gin(to_tsvector('english', title));

What happens if you create a GIN index on a text column?  Does it
convert it to tsvector automatically?

> > With the trigger idea, you have to be sure your configuration is the same
> > every time you INSERT/UPDATE the table or the index will have mixed
> > configuration entries and it will yield incorrect results, aside from
> > the heap configuration lookup not matching the index.
> >
> > Once we nail this down we will have to have a documentation section
> > about configuration mismatches.
> 
> So what is your proposal ? I'm lost a bit. I suggest to begin new thread :)

Well, the topic is tsearch documentation, so I don't see how a new
thread would help us.

My idea is to suggest using expression indexes as a simple way to use
text search on a table, and mention the trigger idea as a more complex
solution.  Right now there is no mention of the expression index idea in
the documentation.

> >> sometimes it's useful not to use explicitly configuration name
> >> to be able to use index with different configuration. Just change
> >> tsearch_conf_name.
> >
> > I assume you are saying the benefit is for tsquery to use a different
> > configuration, not having some tsvector index entries using different
> > configurations than others.
> 
> sure, but not necessarily. You can mix different configurations if they 
> are 'compatible'. Documents could be created from different sources and
> could require different dictionaries. I don't want to limit people
> in creating complex applications. If you unsure, then always use 
> explicit configuration name. What's the problem ?

OK, I understand.  You can store rows needing different synonym tables
in the same column, and if you use the same synonym configuratin when
searching, that is OK.  Again, that is fine, but I would like to give
people a simple way to use text search that doesn't get into that
complexity.  I realize you understand it well, but I can tell you that
most people, including me, don't and you have to be simple at first to
get people to understand what is happening.

I am learning about the possibilities as I email you so obviously I
didn't fully understand at first either, and I have been working on this
for a while.

Good documentation lays things out simply then adds
complexity/flexibility after the fundamentals are understood.

> >>> Also, how do you guarantee that tsearch() triggers always uses the same
> >>> configuration?  The existing tsearch() API seems to make that
> >>> impossible.  I am wondering if we need to add the configuration name as
> >>> a mandatory parameter to tsearch().
> >>
> >> Using the same tsearch_conf_name, which could be defined by many ways,
> >> you guarantee to use the same configuration.
> >
> > Yea, I am sure you _can_ do it.  The question is how can we make it less
> > error-prone.
> 
> We have only ONE variable - configuration name. It could be explicitly 
> defined and then there is no problem at all, or specified by 
> GUC variable tsearch_conf_name.  What could be simpler !
> 
> The question is do we need to define tsearch_conf_name automagically if
> it's not defined ? I inclined don't do that and just issue error.

Agreed.  Please throw an error.

--  Bruce Momjian  <bruce@momjian.us>          http://momjian.us EnterpriseDB
http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Why so many out-of-disk-space failures on buildfarm machines?
Next
From: Andrew Dunstan
Date:
Subject: Re: Why so many out-of-disk-space failures on buildfarm machines?