Re: default_text_search_config and expression indexes - Mailing list pgsql-hackers

From Mike Rylander
Subject Re: default_text_search_config and expression indexes
Date
Msg-id b918cf3d0708141013u7ff808fds8bcf11a58918f6d1@mail.gmail.com
Whole thread Raw
In response to Re: default_text_search_config and expression indexes  (Bruce Momjian <bruce@momjian.us>)
Responses Re: default_text_search_config and expression indexes  (Heikki Linnakangas <heikki@enterprisedb.com>)
Re: default_text_search_config and expression indexes  (Bruce Momjian <bruce@momjian.us>)
Re: default_text_search_config and expression indexes  (Gregory Stark <stark@enterprisedb.com>)
List pgsql-hackers
On 8/13/07, Bruce Momjian <bruce@momjian.us> wrote:
> Heikki Linnakangas wrote:
> > Bruce Momjian wrote:
> > > Heikki Linnakangas wrote:
> > >> Removing the default configuration setting altogether removes the 2nd
> > >> problem, but that's not good from a usability point of view. And it
> > >> doesn't solve the general issue, you can still do things like:
> > >> SELECT * FROM foo WHERE to_tsvector('confA', textcol) @@
> > >> to_tsquery('confB', 'query');
> > >
> > > True, but in that case you are specifically naming different
> > > configurations, so it is hopefully obvious you have a mismatch.
> >
> > There's many more subtle ways to do that. For example, filling a
> > tsvector column using a DEFAULT clause. But then you sometimes fill it
> > in the application instead, with a different configuration. Or if one of
> > the function calls is buried in another user defined function.
> >
> > I don't think explicitly naming the configuration gives enough protection.
>
> Oh, wow, OK, well in that case the text search API isn't ready and we
> will have to hold this for 8.4.
>

I've been watching this thread with a mixture of dread and hope,
waiting to see where the developers' inclination will end up; whether
leaving a useful foot gun available will be allowed.

This is just my $0.02 as a fairly heavy user of the current tsearch2
code, but I sincerely hope you do not cripple the system by removing
the ability to store tsvectors built using arbitrary configurations in
a single column.  Yes, it can lead to unexpected results if you do not
know what you are doing, but if you have gone beyond building a single
tsearch2 configuration then you are required to know what you are
doing.  What's more, IMO the default configuration mechanism feels
very much like a CONSTRAINT, as Oleg suggests.  That point is one of
cognizance, where if one has gone to the trouble of setting up
multiple configurations and has learned enough to do so correctly,
then one necessarily understands the importance of the setting and can
use it (or not, and use explicit configurations) correctly.  The
default config lowers the bar to an acceptable level for beginners
that have no need of multiple configurations, and while I don't feel
too strongly, personally, about having a default, I think it is both
useful and helpful for new users -- it was for me.

Now, so this email isn't entirely complaining, and as a data point for
the discussion, I'll explain why I do not want to see tsearch2
crippled in the way suggested by Heikki and Bruce.

My application (http://open-ils.org, which run >80% of the public
libraries in Georgia, USA, http://gapines.org and
http://georgialibraries.org/lib/pines.html) requires that I be able to
search a corpus of bibliographic records in a mix of languages, and
potentially with mixed stop-word rules, with one query.  I cannot know
ahead of time what languages will be used in the corpus and I cannot
restrict any one query to one language.  To accomplish this, the
record itself will be inspected inside an INSERT/UPDATE trigger to
determine the language and type, and use the correct configuration for
creating the tsvector.  This will obviously result in a "mixed"
tsvector column, but that's exactly what I need.  I can filter on
record language if the user happens to specify a query language (and
thus configuration), or simply rank the assumed (IP based, perhaps, or
browser preference based) preferred language higher, or one of a
hundred other things.  But I won't be able to do any of that if
tsvectors are required to have one and only one configuration per
column.

Anyway, I felt I needed to provide some outside perspective to this,
as a user, since it seems that the external viewpoint (my particular
viewpoint, at least) was missing from the discussion.

Thanks, folks, for all the work on this so far!

--miker


pgsql-hackers by date:

Previous
From: Josh Berkus
Date:
Subject: Re: change name of redirect_stderr?
Next
From: Tom Lane
Date:
Subject: Re: change name of redirect_stderr?