"Mike Rylander" <mrylander@gmail.com> writes:
> My application (http://open-ils.org, which run >80% of the public
> libraries in Georgia, USA, http://gapines.org and
> http://georgialibraries.org/lib/pines.html) requires that I be able to
> search a corpus of bibliographic records in a mix of languages, and
> potentially with mixed stop-word rules, with one query. I cannot know
> ahead of time what languages will be used in the corpus and I cannot
> restrict any one query to one language. To accomplish this, the
> record itself will be inspected inside an INSERT/UPDATE trigger to
> determine the language and type, and use the correct configuration for
> creating the tsvector. This will obviously result in a "mixed"
> tsvector column, but that's exactly what I need. I can filter on
> record language if the user happens to specify a query language (and
> thus configuration), or simply rank the assumed (IP based, perhaps, or
> browser preference based) preferred language higher, or one of a
> hundred other things. But I won't be able to do any of that if
> tsvectors are required to have one and only one configuration per
> column.
>
> Anyway, I felt I needed to provide some outside perspective to this,
> as a user, since it seems that the external viewpoint (my particular
> viewpoint, at least) was missing from the discussion.
This is *extremely* useful. I think it's precisely what we've been missing so
far. At least, what I've been missing.
So the question is what exactly happens in this case? If I search for "the"
does that mean it will ignore matches in English where that's a stop-word but
find me books on tea in French? Is that what I should expect to happen? What
if I search for "earl and the"? Does that find me French books on Early Grey
Tea but English books on all earls?
What happens if I use the same operator directly on the text column? Or
perhaps it's not even possible to specify stop-words when operating on a text
column? Should it be?
-- Gregory Stark EnterpriseDB http://www.enterprisedb.com