Re: default_text_search_config and expression indexes - Mailing list pgsql-hackers
From | Heikki Linnakangas |
---|---|
Subject | Re: default_text_search_config and expression indexes |
Date | |
Msg-id | 46C1F52E.6070301@enterprisedb.com Whole thread Raw |
In response to | Re: default_text_search_config and expression indexes ("Mike Rylander" <mrylander@gmail.com>) |
Responses |
Re: default_text_search_config and expression indexes
|
List | pgsql-hackers |
Mike Rylander wrote: > This is just my $0.02 as a fairly heavy user of the current tsearch2 > code, but I sincerely hope you do not cripple the system by removing > the ability to store tsvectors built using arbitrary configurations in > a single column. Yes, it can lead to unexpected results if you do not > know what you are doing, but if you have gone beyond building a single > tsearch2 configuration then you are required to know what you are > doing. What's more, IMO the default configuration mechanism feels > very much like a CONSTRAINT, as Oleg suggests. That point is one of > cognizance, where if one has gone to the trouble of setting up > multiple configurations and has learned enough to do so correctly, > then one necessarily understands the importance of the setting and can > use it (or not, and use explicit configurations) correctly. The > default config lowers the bar to an acceptable level for beginners > that have no need of multiple configurations, and while I don't feel > too strongly, personally, about having a default, I think it is both > useful and helpful for new users -- it was for me. Thanks for chiming in. As a disclaimer: I've never used tsearch2 in a real application myself. > My application (http://open-ils.org, which run >80% of the public > libraries in Georgia, USA, http://gapines.org and > http://georgialibraries.org/lib/pines.html) requires that I be able to > search a corpus of bibliographic records in a mix of languages, and > potentially with mixed stop-word rules, with one query. I cannot know > ahead of time what languages will be used in the corpus and I cannot > restrict any one query to one language. To accomplish this, the > record itself will be inspected inside an INSERT/UPDATE trigger to > determine the language and type, and use the correct configuration for > creating the tsvector. This will obviously result in a "mixed" > tsvector column, but that's exactly what I need. I can filter on > record language if the user happens to specify a query language (and > thus configuration), or simply rank the assumed (IP based, perhaps, or > browser preference based) preferred language higher, or one of a > hundred other things. But I won't be able to do any of that if > tsvectors are required to have one and only one configuration per > column. Don't you need to use the right configuration to parse the query into a tsquery as well? What you have is basically the classic problem problem of representing inheritance in a relational model. You have a superclass, bibliographic record, and multiple subclasses, one per language with one extra field, the corpus in the right language. You've solved it by storing them all in one table, with an extra column (language) to denote which subclass the record is. Alternatively, you could solve it by using inherited tables, or having one table per language with a foreign key referencing the master table containing the other fields, or having a single table with one nullable field per configuration, and a check constraint to check that exactly one of them is not null. As a thought experiment, let me present another, not text search related example that's isomorphic to your problem: Imagine an international online auction system, where you can put items for sale and specify a minimum price. Part of the database schema is: CREATE TABLE item ( id INTEGER, description TEXT, currency CHAR(3), price NUMERIC ); It clearly doesn't make sense to directly compare prices in different currencies against each other. A query like WHERE price > 1000 doesn't make sense, unless you also restrict the currency, or use an exchange rate to convert between currencies. Also, different currencies might have slightly different rules on how many decimal places are significant. In this example, as well as your bibliographic scenario, we can conveniently store prices in all currencies in the same field because they're all numerics. If we wanted to enforce per-currency rules, like NUMERIC(10, 2) for USD and NUMERIC(10,0) for Italian lires (which doesn't really exists anymore, I know), we'd need to store them in separate columns. And before the decimalisation in 1971, the British pound was divided into 20 shillings, each of which was divided to 12 pence, so it wouldn't have fit to normal numeric field, and we would again have to store that in a separate column. What I'm trying to point out is that the problem isn't unique to text search. It's an old problem, with many alternative solutions, even with strong typing. Arguably if you store data in multiple languages in same field, you have a denormalized schema. Granted, loose typing can be more convenient, but you give up the benefits of strong typing as well. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
pgsql-hackers by date: