Home > mailing lists

Re: default_text_search_config and expression indexes - Mailing list pgsql-hackers

From	Mike Rylander
Subject	Re: default_text_search_config and expression indexes
Date	August 14, 2007 20:35:22
Msg-id	b918cf3d0708141335k362cf9acpa400bc52f5bb427f@mail.gmail.com Whole thread Raw
In response to	Re: default_text_search_config and expression indexes (Heikki Linnakangas <heikki@enterprisedb.com>)
List	pgsql-hackers

Tree view

On 8/14/07, Heikki Linnakangas <heikki@enterprisedb.com> wrote:
> Mike Rylander wrote:
[snip]
>
> Don't you need to use the right configuration to parse the query into a
> tsquery as well?
>

Only if the user (or user agent) can supply enough information to move
away from the configured default of, say, en-US.  And even then, other
tsvector configurations need to be searched.  Configurations are not,
of course, just about language/stemming, but also stop-words and
thesauri.  If it were just languages it wouldn't be too big of a deal
(other than in-app DDL management...).

> What you have is basically the classic problem problem of representing
> inheritance in a relational model. You have a superclass, bibliographic
> record, and multiple subclasses, one per language with one extra field,
> the corpus in the right language. You've solved it by storing them all
> in one table, with an extra column (language) to denote which subclass
> the record is. Alternatively, you could solve it by using inherited
> tables, or having one table per language with a foreign key referencing
> the master table containing the other fields, or having a single table
> with one nullable field per configuration, and a check constraint to
> check that exactly one of them is not null.
>

Sorry, communication problem here ... I provided an oversimplified
example meant more to show the issues than to find alternate
solutions, though I appreciate you taking the time to consider the
problem.  If I only had to consider one delimiting facet per record
then it would be much simpler. But with the complication that some
fields have stop-word lists (and some not), some use thesauri (and
different ones, at that), and these fields (as extracted from the
records) and their configurations (stem? use a stop-word list? use
thesaurus x, y or z?) are user-defined ...

> As a thought experiment, let me present another, not text search related
> example that's isomorphic to your problem:
>

Unfortunately, the problem you describe is not quite the same as the
problem I need to solve.

Fortunately, the current incarnation of tsearch2 does a pretty good
job of solving the problem if you store everything in one place and
use the query-time environment to apply some adjustments to the
ranking of items.  I could still work around this problem by creating
inherited tables, one for each configuration on each index-providing
table but I /really/ hope to avoid that.  Creating new configurations
for local requirements doesn't require creating new tables (and the
associated management overhead in the app) today, something I'd really
like to avoid.  In fact, I'm starting to sweat just thinking about
what the planner would go through with the number tables needed for
the potential configurations in an installation that makes use of
multiple thesauri and a mix of stop-word lists across, say, 30
languages.  Such a dataset is not uncommon.

In any case, thanks again for taking the time to think about the
problem.  I still think having the ability to store any old tsvector I
happen to have hanging around in any column of the correct type is a
GoodThing(tm).  I see from Oleg's message down-thread that that's the
way things will be (the tsvector type doesn't know about columns, just
lexem "tuples").

--miker

pgsql-hackers by date:

From: Bruce Momjian
Date: 14 August 2007, 20:29:28
Subject: Re: tsearch2 in PostgreSQL 8.3?

From: Alvaro Herrera
Date: 14 August 2007, 20:35:23
Subject: Re: change name of redirect_stderr?

Re: default_text_search_config and expression indexes - Mailing list pgsql-hackers

Previous

Next