Re: default_text_search_config and expression indexes - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: default_text_search_config and expression indexes
Date
Msg-id 46C1F52E.6070301@enterprisedb.com
Whole thread Raw
In response to Re: default_text_search_config and expression indexes  ("Mike Rylander" <mrylander@gmail.com>)
Responses Re: default_text_search_config and expression indexes  ("Mike Rylander" <mrylander@gmail.com>)
List pgsql-hackers
Mike Rylander wrote:
> This is just my $0.02 as a fairly heavy user of the current tsearch2
> code, but I sincerely hope you do not cripple the system by removing
> the ability to store tsvectors built using arbitrary configurations in
> a single column.  Yes, it can lead to unexpected results if you do not
> know what you are doing, but if you have gone beyond building a single
> tsearch2 configuration then you are required to know what you are
> doing.  What's more, IMO the default configuration mechanism feels
> very much like a CONSTRAINT, as Oleg suggests.  That point is one of
> cognizance, where if one has gone to the trouble of setting up
> multiple configurations and has learned enough to do so correctly,
> then one necessarily understands the importance of the setting and can
> use it (or not, and use explicit configurations) correctly.  The
> default config lowers the bar to an acceptable level for beginners
> that have no need of multiple configurations, and while I don't feel
> too strongly, personally, about having a default, I think it is both
> useful and helpful for new users -- it was for me.

Thanks for chiming in. As a disclaimer: I've never used tsearch2 in a
real application myself.

> My application (http://open-ils.org, which run >80% of the public
> libraries in Georgia, USA, http://gapines.org and
> http://georgialibraries.org/lib/pines.html) requires that I be able to
> search a corpus of bibliographic records in a mix of languages, and
> potentially with mixed stop-word rules, with one query.  I cannot know
> ahead of time what languages will be used in the corpus and I cannot
> restrict any one query to one language.  To accomplish this, the
> record itself will be inspected inside an INSERT/UPDATE trigger to
> determine the language and type, and use the correct configuration for
> creating the tsvector.  This will obviously result in a "mixed"
> tsvector column, but that's exactly what I need.  I can filter on
> record language if the user happens to specify a query language (and
> thus configuration), or simply rank the assumed (IP based, perhaps, or
> browser preference based) preferred language higher, or one of a
> hundred other things.  But I won't be able to do any of that if
> tsvectors are required to have one and only one configuration per
> column.

Don't you need to use the right configuration to parse the query into a
tsquery as well?

What you have is basically the classic problem problem of representing
inheritance in a relational model. You have a superclass, bibliographic
record, and multiple subclasses, one per language with one extra field,
the corpus in the right language. You've solved it by storing them all
in one table, with an extra column (language) to denote which subclass
the record is. Alternatively, you could solve it by using inherited
tables, or having one table per language with a foreign key referencing
the master table containing the other fields, or having a single table
with one nullable field per configuration, and a check constraint to
check that exactly one of them is not null.

As a thought experiment, let me present another, not text search related
example that's isomorphic to your problem:

Imagine an international online auction system, where you can put items
for sale and specify a minimum price. Part of the database schema is:

CREATE TABLE item ( id INTEGER, description TEXT, currency CHAR(3), price NUMERIC
);

It clearly doesn't make sense to directly compare prices in different
currencies against each other. A query like WHERE price > 1000 doesn't
make sense, unless you also restrict the currency, or use an exchange
rate to convert between currencies. Also, different currencies might
have slightly different rules on how many decimal places are
significant. In this example, as well as your bibliographic scenario, we
can conveniently store prices in all currencies in the same field
because they're all numerics. If we wanted to enforce per-currency
rules, like NUMERIC(10, 2) for USD and NUMERIC(10,0) for Italian lires
(which doesn't really exists anymore, I know), we'd need to store them
in separate columns. And before the decimalisation in 1971, the British
pound was divided into 20 shillings, each of which was divided to 12
pence, so it wouldn't have fit to normal numeric field, and we would
again have to store that in a separate column.

What I'm trying to point out is that the problem isn't unique to text
search. It's an old problem, with many alternative solutions, even with
strong typing. Arguably if you store data in multiple languages in same
field, you have a denormalized schema. Granted, loose typing can be more
convenient, but you give up the benefits of strong typing as well.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: change name of redirect_stderr?
Next
From: Heikki Linnakangas
Date:
Subject: Re: default_text_search_config and expression indexes