Re: default_text_search_config and expression indexes - Mailing list pgsql-hackers

From Mike Rylander
Subject Re: default_text_search_config and expression indexes
Date
Msg-id b918cf3d0708142046w775e6021rcc693e3aa70b81ea@mail.gmail.com
Whole thread Raw
In response to Re: default_text_search_config and expression indexes  (Gregory Stark <stark@enterprisedb.com>)
List pgsql-hackers
On 8/14/07, Gregory Stark <stark@enterprisedb.com> wrote:
> "Mike Rylander" <mrylander@gmail.com> writes:
>
> > My application (http://open-ils.org, which run >80% of the public
> > libraries in Georgia, USA, http://gapines.org and
> > http://georgialibraries.org/lib/pines.html) requires that I be able to
> > search a corpus of bibliographic records in a mix of languages, and
> > potentially with mixed stop-word rules, with one query.  I cannot know
> > ahead of time what languages will be used in the corpus and I cannot
> > restrict any one query to one language.  To accomplish this, the
> > record itself will be inspected inside an INSERT/UPDATE trigger to
> > determine the language and type, and use the correct configuration for
> > creating the tsvector.  This will obviously result in a "mixed"
> > tsvector column, but that's exactly what I need.  I can filter on
> > record language if the user happens to specify a query language (and
> > thus configuration), or simply rank the assumed (IP based, perhaps, or
> > browser preference based) preferred language higher, or one of a
> > hundred other things.  But I won't be able to do any of that if
> > tsvectors are required to have one and only one configuration per
> > column.
> >
> > Anyway, I felt I needed to provide some outside perspective to this,
> > as a user, since it seems that the external viewpoint (my particular
> > viewpoint, at least) was missing from the discussion.
>
> This is *extremely* useful. I think it's precisely what we've been missing so
> far. At least, what I've been missing.
>
> So the question is what exactly happens in this case? If I search for "the"
> does that mean it will ignore matches in English where that's a stop-word but
> find me books on tea in French? Is that what I should expect to happen? What
> if I search for "earl and the"? Does that find me French books on Early Grey
> Tea but English books on all earls?

Oh dear ... you went and got me started...

Each field type has a different set of configurations for potenial
use.  Title and subject fields, for instance, do not (generally) use
stop-word lists, so a subject search for "the" will match any record
with the lexem "the" in a subject field.  Title fields are a little
more complicated, because there is information in the bibliographic
record about how and when to skip leading articles, but generally
those are indexed as well for FTS purposes.  English LCSH subjects
generally don't have stop(like) words in them, so you'll probably just
get "French tea" records.  Title results would be a mix of "earls" and
"French tea" records (probably) and the correlation between the user's
preferred language (either chosen query lang or UI lang, whichever is
available) will help adjust the ranking, pushing what are likely to be
the more appropriate records to the top.

Note, however, that much of this multi-tsearch2-configuration setup is
not used in the implementation at http://gapines.org/ because, well,
there's not much need (read: demand from librarians) for that dataset
to support these more complex tricks.  It's basically all en-US and
exclude stop-words.  Other implementations are making more use of what
I describe above, including a (government mandated) French-English
bilingual institution who shall remain nameless for the time being...

>
> What happens if I use the same operator directly on the text column? Or
> perhaps it's not even possible to specify stop-words when operating on a text
> column? Should it be?

You mean with an expression index on a text column?  I haven't
considered using them for FTS.  It just feels easier and more flexible
to me to use an external tsvector column because of the fairly heavy
processing that goes into creating each tsvector value.  I may
re-evaluate that position now that CREATE INDEX CONCURRENTLY exists,
but I'm not developing with 8.2+ only features yet.  Once 8.3 is out
that may change.

Also, unless I misunderstand, you have to wrap the text column in the
function used to build the index.  For my purposes, that makes
building a generic FTS driver for my app (which, admittedly, only has
a Postgresql driver ;) ) more difficult than having a "hidden" extra
column.  Again, that could change if the benefits of CREATE INDEX
CONCURRENTLY end up outweighing simpler FTS driver code.

--miker


pgsql-hackers by date:

Previous
From: Andrew Dunstan
Date:
Subject: Re: tsearch2 in PostgreSQL 8.3?
Next
From: "Marc G. Fournier"
Date:
Subject: Re: CVS corruption/mistagging?