Re: Updated tsearch documentation - Mailing list pgsql-hackers

From Oleg Bartunov
Subject Re: Updated tsearch documentation
Date
Msg-id Pine.LNX.4.64.0707182253190.18739@sn.sai.msu.ru
Whole thread Raw
In response to Re: Updated tsearch documentation  (Bruce Momjian <bruce@momjian.us>)
Responses Re: Updated tsearch documentation  (Bruce Momjian <bruce@momjian.us>)
Re: Updated tsearch documentation  (Bruce Momjian <bruce@momjian.us>)
List pgsql-hackers
On Wed, 18 Jul 2007, Bruce Momjian wrote:

>
>>> Why are we allowing my_filter_name here?  Isn't that something for a
>>> custom trigger.  Is calling it tsearch() a good idea?  Why not
>>> tsvector_trigger().
>>
>> I don't see any benefit from the tsvector_trigger() name. If you want to add
>> some semantic, than tsvector_update_trigger() would be better.  Anyway,
>> this trigger is an illustration.
>
> Well, the filter that removes '@' might be an example, but tsearch() is
> indeed sort of built-in trigger function to be used for simple cases.
> My point is that because it is only for simple cases, why add complexity
> and allow a filter?  It seems best to just remove the filter idea and
> let people write their own triggers if they want that functionality.

If you aware about documentation simplicity than we could just document 
two versions:
1. without filter function - simple, well understood syntax
2. with filter function - for advanced users

I don't want to remove the feature which works for year without any problem.


>
>>>>>     CREATE INDEX textsearch_id ON pgweb USING gin(to_tsvector(column));
>>>>>
>>>>> That avoids having to have a separate column because you can just say:
>>>>>
>>>>>     WHERE to_query('XXX') @@ to_tsvector(column)
>>>>
>>>> yes, it's possible, but without ranking, since currently it's impossible
>>>> to store any information in index (it's pg's feature). btw, this should
>>>> works and for GiST index also.
>>>
>>> What if they use @@@.  Wouldn't that work because it is going to check
>>> the heap?
>>
>> It would work, it'd recalculate to_tsvector(column) for rows found
>> ( for GiST - to remove false hits and for weight information, for
>> GIN - for weight information only).
>
> Right.  Currently to use text search on a table, you have to do three
> things:
>
>     o  add a tsvector column to the table
>     o  add a trigger to keep the tsvector column current
>     o  add an index to the tsvector column
>
> My question is why bother with the first two steps?  If you do:
>
> CREATE INDEX textsearch_idx ON pgweb USING gist(to_tsvector('english',column));
>
> you don't need a separate column and a trigger to keep it current.  The
> index is kept current as part of normal query processing.  The only
> downside is that you have to do to_tsvector() in the heap to avoid false
> hits, but that seems minor compared to the disk savings of not having
> the separate column.  Is to_tsvector() an expensive function?

Bruce, you oversimplify the text search, the document could be fully virtual,
not a column(s), it could be a result of any SQL commands, so it could be 
very expensive just to obtain document, and yes, to_tsvector could be
very expensive, depending on the document size, parser and dictionaries used.

And, again, current postgres architecture forces to use heap to store
positional and weight information for ranking.

The use case for what you described is very limited - simple text search
on one/several column of the same table without ranking.

>
>>>>>  CREATE INDEX textsearch_idx ON pgweb USING gin(to_tsvector('english',column));
>>>>>
>>>>> so that at least the configuration is documented in the index.
>>>>
>>>> yes, it's better to always explicitly specify configuration name and not
>>>> rely on default configuration.
>>>> Unfortunately, configuration name doesn't saved in the index.
>>
>> as Teodor corrected me, index doesn't know about configuration at all !
>> What accurate user could do, is to provide configuration name in the
>> comment for tsvector column. Configuration name is an accessory of
>> to_tsvector() function.
>
> Well, if you create the index with the configuration name it is
> guaranteed to match:
>
> CREATE INDEX textsearch_idx ON pgweb USING gist(to_tsvector('english',column));
>                                                              -------
> And if someone does:
>
>     WHERE 'friend'::tsquery @@ to_tsvector('english',column))
>
> the index is used.  Now if the default configuration is 'english' and
> they use:
>
>     WHERE 'friend'::tsquery @@ to_tsvector(column))
>
> the index is not used, but this just a good example of why default
> configurations aren't that useful.  One problem I see is that if the
> default configuration is not 'english', then when the index consults the
> heap, it would be using a different configuration and yield incorrect
> results.  I am unsure how to fix that.

again, you consider very simple case  and actually, your example is a 
good example of usefulness of default configuration ! Just think before
you develop your application, but this is very general rule. There are
zillions situations you could do bad things, after all.

Moreover, consider text search on text column, there is no way to specify 
configuration at all ! We rely on default configuration here

CREATE INDEX textsearch_idx ON pgweb USING gin(title);

>
> With the trigger idea, you have to be sure your configuration is the same
> every time you INSERT/UPDATE the table or the index will have mixed
> configuration entries and it will yield incorrect results, aside from
> the heap configuration lookup not matching the index.
>
> Once we nail this down we will have to have a documentation section
> about configuration mismatches.

So what is your proposal ? I'm lost a bit. I suggest to begin new thread :)

>
>> In principle, tsvector as any data type could be obtained by any other ways,
>> for example, OpenFTS construct tsvector following its own rules.
>>
>>>
>>> I was more concerned that there is nothing documenting the configuration
>>> used by the index or the tsvector table column trigger.  By doing:
>>
>> again, index has nothing with configuration name.
>> Our trigger function is an example, which uses default configuration name.
>> User could easily write it's own trigger to keep tsvector column up to date
>> and use configuration name as a parameter.
>
> Right. I am thinking beyond that issue.
>
>>>     CREATE INDEX textsearch_idx ON pgweb USING gin(to_tsvector('english',column));
>>>
>>> you guarantee that the index uses 'english' for all its entries.  If you
>>> omit the 'english' or use a different configuration, it will heap scan
>>> the table, which at least gives the right answer.
>>
>> sometimes it's useful not to use explicitly configuration name
>> to be able to use index with different configuration. Just change
>> tsearch_conf_name.
>
> I assume you are saying the benefit is for tsquery to use a different
> configuration, not having some tsvector index entries using different
> configurations than others.

sure, but not necessarily. You can mix different configurations if they 
are 'compatible'. Documents could be created from different sources and
could require different dictionaries. I don't want to limit people
in creating complex applications. If you unsure, then always use 
explicit configuration name. What's the problem ?

>
>>> Also, how do you guarantee that tsearch() triggers always uses the same
>>> configuration?  The existing tsearch() API seems to make that
>>> impossible.  I am wondering if we need to add the configuration name as
>>> a mandatory parameter to tsearch().
>>
>> Using the same tsearch_conf_name, which could be defined by many ways,
>> you guarantee to use the same configuration.
>
> Yea, I am sure you _can_ do it.  The question is how can we make it less
> error-prone.

We have only ONE variable - configuration name. It could be explicitly 
defined and then there is no problem at all, or specified by 
GUC variable tsearch_conf_name.  What could be simpler !

The question is do we need to define tsearch_conf_name automagically if
it's not defined ? I inclined don't do that and just issue error.

I believe this is enough error-prone.
    Regards,        Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


pgsql-hackers by date:

Previous
From: Heikki Linnakangas
Date:
Subject: Re: Future of krb5 authentication
Next
From: Magnus Hagander
Date:
Subject: Re: Future of krb5 authentication