Re: default_text_search_config and expression indexes - Mailing list pgsql-hackers
From | Mike Rylander |
---|---|
Subject | Re: default_text_search_config and expression indexes |
Date | |
Msg-id | b918cf3d0708142046w775e6021rcc693e3aa70b81ea@mail.gmail.com Whole thread Raw |
In response to | Re: default_text_search_config and expression indexes (Gregory Stark <stark@enterprisedb.com>) |
List | pgsql-hackers |
On 8/14/07, Gregory Stark <stark@enterprisedb.com> wrote: > "Mike Rylander" <mrylander@gmail.com> writes: > > > My application (http://open-ils.org, which run >80% of the public > > libraries in Georgia, USA, http://gapines.org and > > http://georgialibraries.org/lib/pines.html) requires that I be able to > > search a corpus of bibliographic records in a mix of languages, and > > potentially with mixed stop-word rules, with one query. I cannot know > > ahead of time what languages will be used in the corpus and I cannot > > restrict any one query to one language. To accomplish this, the > > record itself will be inspected inside an INSERT/UPDATE trigger to > > determine the language and type, and use the correct configuration for > > creating the tsvector. This will obviously result in a "mixed" > > tsvector column, but that's exactly what I need. I can filter on > > record language if the user happens to specify a query language (and > > thus configuration), or simply rank the assumed (IP based, perhaps, or > > browser preference based) preferred language higher, or one of a > > hundred other things. But I won't be able to do any of that if > > tsvectors are required to have one and only one configuration per > > column. > > > > Anyway, I felt I needed to provide some outside perspective to this, > > as a user, since it seems that the external viewpoint (my particular > > viewpoint, at least) was missing from the discussion. > > This is *extremely* useful. I think it's precisely what we've been missing so > far. At least, what I've been missing. > > So the question is what exactly happens in this case? If I search for "the" > does that mean it will ignore matches in English where that's a stop-word but > find me books on tea in French? Is that what I should expect to happen? What > if I search for "earl and the"? Does that find me French books on Early Grey > Tea but English books on all earls? Oh dear ... you went and got me started... Each field type has a different set of configurations for potenial use. Title and subject fields, for instance, do not (generally) use stop-word lists, so a subject search for "the" will match any record with the lexem "the" in a subject field. Title fields are a little more complicated, because there is information in the bibliographic record about how and when to skip leading articles, but generally those are indexed as well for FTS purposes. English LCSH subjects generally don't have stop(like) words in them, so you'll probably just get "French tea" records. Title results would be a mix of "earls" and "French tea" records (probably) and the correlation between the user's preferred language (either chosen query lang or UI lang, whichever is available) will help adjust the ranking, pushing what are likely to be the more appropriate records to the top. Note, however, that much of this multi-tsearch2-configuration setup is not used in the implementation at http://gapines.org/ because, well, there's not much need (read: demand from librarians) for that dataset to support these more complex tricks. It's basically all en-US and exclude stop-words. Other implementations are making more use of what I describe above, including a (government mandated) French-English bilingual institution who shall remain nameless for the time being... > > What happens if I use the same operator directly on the text column? Or > perhaps it's not even possible to specify stop-words when operating on a text > column? Should it be? You mean with an expression index on a text column? I haven't considered using them for FTS. It just feels easier and more flexible to me to use an external tsvector column because of the fairly heavy processing that goes into creating each tsvector value. I may re-evaluate that position now that CREATE INDEX CONCURRENTLY exists, but I'm not developing with 8.2+ only features yet. Once 8.3 is out that may change. Also, unless I misunderstand, you have to wrap the text column in the function used to build the index. For my purposes, that makes building a generic FTS driver for my app (which, admittedly, only has a Postgresql driver ;) ) more difficult than having a "hidden" extra column. Again, that could change if the benefits of CREATE INDEX CONCURRENTLY end up outweighing simpler FTS driver code. --miker
pgsql-hackers by date: