Re: default_text_search_config and expression indexes - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: default_text_search_config and expression indexes
Date
Msg-id 46B62169.6050204@enterprisedb.com
Whole thread Raw
In response to Re: default_text_search_config and expression indexes  (Bruce Momjian <bruce@momjian.us>)
Responses Re: default_text_search_config and expression indexes
List pgsql-hackers
Bruce Momjian wrote:
> Ron Mayer wrote:
>>>>>> We need more feedback from users.
>>>>> Well, I am waiting for other hackers to get involved, but if they don't,
>>>>> I have to evaluate it myself on the email lists.
>>>> Personally, I think documentation changes would be an OK way to
>>>> to handle it.   Something that makes it extremely clear to the
>>>> user the advantages of having the extra column and the risks
>>>> of avoiding them.
>>> Sure, but you have make sure you use the right configuration in the
>>> trigger, no?  Does the tsquery have to use the same configuration?
>> I wish I knew this myself. :-)   Whatever I had done happened to work
>> but that was largely through people on IRC walking me through it.
> 
> This illustrates the major issue --- that this has to be simple for
> people to get started, while keeping the capabilities for experienced
> users.
> 
> I am now thinking that making users always specify the configuration
> name and not allowing :: casting is going to be the best approach.  We
> can always add more in 8.4 after it is in wide use.

I just read the docs and I'm trying to get a grip of the problem here.

If I understood correctly, the basic issue is that a tsvector datum
created using configuration A is incompatible with a tsquery datum
created using configuration B, in the sense that you won't get
reasonable results if you use the tsquery to search the tsvector, or do
ranking or highlighting. If the configurations happen to be similar
enough, it can work, but not in general.

That underlying issue manifests itself in many ways, including:
- if you create table with a field of type tsvector, typically kept
up-to-date by triggers, and do a search on it using a different
configuration, you get incorrect results.
- using an expression index instead of a tsvector-field, and always
explicitly specifying the configuration, you can avoid that problem (a
query with a different configuration won't use the index). But an
expression index, without explicitly specifying the configuration, will
get corrupted if you change the default configuration.

Removing the default configuration setting altogether removes the 2nd
problem, but that's not good from a usability point of view. And it
doesn't solve the general issue, you can still do things like:
SELECT * FROM foo WHERE to_tsvector('confA', textcol) @@
to_tsquery('confB', 'query');

ISTM we should have a separate tsvector and tsquery data type for each
configuration, and throw an error if you try to mix and match them in a
query. to_tsquery and to_tsvector would be new kind of polymorphic
functions that work with the types. Or we could automatically create a
copy of them when you create a new configuration. We could have a
default configuration setting and rewrite queries that don't explicitly
specify a configuration to use the default.

You could still get into trouble if you alter the configuration after
starting to use it. We could solve that by not allowing you to ALTER
CONFIGURATION, at least not if it's used in tables or indexes. Forcing
people to create a new configuration, and to recreate all indexes and
tsvector columns every time you add a word to a stop-list, for example,
seems too onerous, though. Not sure what to do about that.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


pgsql-hackers by date:

Previous
From: Magnus Hagander
Date:
Subject: Re: Strange file in cvs repo
Next
From: Gregory Stark
Date:
Subject: Autovacuum and toast tables