Re: default_text_search_config and expression indexes - Mailing list pgsql-hackers

From Mike Rylander
Subject Re: default_text_search_config and expression indexes
Date
Msg-id b918cf3d0708141335k362cf9acpa400bc52f5bb427f@mail.gmail.com
Whole thread Raw
In response to Re: default_text_search_config and expression indexes  (Heikki Linnakangas <heikki@enterprisedb.com>)
List pgsql-hackers
On 8/14/07, Heikki Linnakangas <heikki@enterprisedb.com> wrote:
> Mike Rylander wrote:
[snip]
>
> Don't you need to use the right configuration to parse the query into a
> tsquery as well?
>

Only if the user (or user agent) can supply enough information to move
away from the configured default of, say, en-US.  And even then, other
tsvector configurations need to be searched.  Configurations are not,
of course, just about language/stemming, but also stop-words and
thesauri.  If it were just languages it wouldn't be too big of a deal
(other than in-app DDL management...).

> What you have is basically the classic problem problem of representing
> inheritance in a relational model. You have a superclass, bibliographic
> record, and multiple subclasses, one per language with one extra field,
> the corpus in the right language. You've solved it by storing them all
> in one table, with an extra column (language) to denote which subclass
> the record is. Alternatively, you could solve it by using inherited
> tables, or having one table per language with a foreign key referencing
> the master table containing the other fields, or having a single table
> with one nullable field per configuration, and a check constraint to
> check that exactly one of them is not null.
>

Sorry, communication problem here ... I provided an oversimplified
example meant more to show the issues than to find alternate
solutions, though I appreciate you taking the time to consider the
problem.  If I only had to consider one delimiting facet per record
then it would be much simpler. But with the complication that some
fields have stop-word lists (and some not), some use thesauri (and
different ones, at that), and these fields (as extracted from the
records) and their configurations (stem? use a stop-word list? use
thesaurus x, y or z?) are user-defined ...

> As a thought experiment, let me present another, not text search related
> example that's isomorphic to your problem:
>

Unfortunately, the problem you describe is not quite the same as the
problem I need to solve.

Fortunately, the current incarnation of tsearch2 does a pretty good
job of solving the problem if you store everything in one place and
use the query-time environment to apply some adjustments to the
ranking of items.  I could still work around this problem by creating
inherited tables, one for each configuration on each index-providing
table but I /really/ hope to avoid that.  Creating new configurations
for local requirements doesn't require creating new tables (and the
associated management overhead in the app) today, something I'd really
like to avoid.  In fact, I'm starting to sweat just thinking about
what the planner would go through with the number tables needed for
the potential configurations in an installation that makes use of
multiple thesauri and a mix of stop-word lists across, say, 30
languages.  Such a dataset is not uncommon.

In any case, thanks again for taking the time to think about the
problem.  I still think having the ability to store any old tsvector I
happen to have hanging around in any column of the correct type is a
GoodThing(tm).  I see from Oleg's message down-thread that that's the
way things will be (the tsvector type doesn't know about columns, just
lexem "tuples").

--miker


pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: tsearch2 in PostgreSQL 8.3?
Next
From: Alvaro Herrera
Date:
Subject: Re: change name of redirect_stderr?