Thread: Re: default_text_search_config and expression indexes

Re: default_text_search_config and expression indexes

From
Ron Mayer
Date:
From over on the hackers list, Mike Rylander wrote:

> My application (http://open-ils.org, which run >80% of the public
> libraries in Georgia, USA, http://gapines.org and
> http://georgialibraries.org/lib/pines.html) requires that I be able to
> search a corpus of bibliographic records in a mix of languages, and
> potentially with mixed stop-word rules, with one query.

Whoa, cool.   Seems it'd make for a pretty awesome case study.
My apologies if it's already there, but I can't find it on the web site.
http://search.postgresql.org/search?q=georgia&a=1&submit=Search

Is this also related to the project?
http://open-ils.org/


More context from the thread on hackers.

> On 8/13/07, Bruce Momjian <bruce@momjian.us> wrote:
>> Heikki Linnakangas wrote:
>>> Bruce Momjian wrote:
>>>> Heikki Linnakangas wrote:
>>>>> Removing the default configuration setting altogether removes the 2nd
>>>>> problem, but that's not good from a usability point of view. And it
>>>>> doesn't solve the general issue, you can still do things like:
>>>>> SELECT * FROM foo WHERE to_tsvector('confA', textcol) @@
>>>>> to_tsquery('confB', 'query');
>>>> True, but in that case you are specifically naming different
>>>> configurations, so it is hopefully obvious you have a mismatch.
>>> There's many more subtle ways to do that. For example, filling a
>>> tsvector column using a DEFAULT clause. But then you sometimes fill it
>>> in the application instead, with a different configuration. Or if one of
>>> the function calls is buried in another user defined function.
>>>
>>> I don't think explicitly naming the configuration gives enough protection.
>> Oh, wow, OK, well in that case the text search API isn't ready and we
>> will have to hold this for 8.4.
>>
>
> I've been watching this thread with a mixture of dread and hope,
> waiting to see where the developers' inclination will end up; whether
> leaving a useful foot gun available will be allowed.
>
> This is just my $0.02 as a fairly heavy user of the current tsearch2
> code, but I sincerely hope you do not cripple the system by removing
> the ability to store tsvectors built using arbitrary configurations in
> a single column.  Yes, it can lead to unexpected results if you do not
> know what you are doing, but if you have gone beyond building a single
> tsearch2 configuration then you are required to know what you are
> doing.  What's more, IMO the default configuration mechanism feels
> very much like a CONSTRAINT, as Oleg suggests.  That point is one of
> cognizance, where if one has gone to the trouble of setting up
> multiple configurations and has learned enough to do so correctly,
> then one necessarily understands the importance of the setting and can
> use it (or not, and use explicit configurations) correctly.  The
> default config lowers the bar to an acceptable level for beginners
> that have no need of multiple configurations, and while I don't feel
> too strongly, personally, about having a default, I think it is both
> useful and helpful for new users -- it was for me.
>
> Now, so this email isn't entirely complaining, and as a data point for
> the discussion, I'll explain why I do not want to see tsearch2
> crippled in the way suggested by Heikki and Bruce.
>
> My application (http://open-ils.org, which run >80% of the public
> libraries in Georgia, USA, http://gapines.org and
> http://georgialibraries.org/lib/pines.html) requires that I be able to
> search a corpus of bibliographic records in a mix of languages, and
> potentially with mixed stop-word rules, with one query.  I cannot know
> ahead of time what languages will be used in the corpus and I cannot
> restrict any one query to one language.  To accomplish this, the
> record itself will be inspected inside an INSERT/UPDATE trigger to
> determine the language and type, and use the correct configuration for
> creating the tsvector.  This will obviously result in a "mixed"
> tsvector column, but that's exactly what I need.  I can filter on
> record language if the user happens to specify a query language (and
> thus configuration), or simply rank the assumed (IP based, perhaps, or
> browser preference based) preferred language higher, or one of a
> hundred other things.  But I won't be able to do any of that if
> tsvectors are required to have one and only one configuration per
> column.
>
> Anyway, I felt I needed to provide some outside perspective to this,
> as a user, since it seems that the external viewpoint (my particular
> viewpoint, at least) was missing from the discussion.
>
> Thanks, folks, for all the work on this so far!
>
> --miker
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster
>