Thread: Full-text search default vs specified configuration

Full-text search default vs specified configuration

From
Richard Huxton
Date:
I've been looking at a problem someone encountered with ts_headline:
http://archives.postgresql.org/pgsql-general/2008-02/msg01035.php

It turns out the problem was mixing ts_headline(<no specified config>) 
with to_tsquery(<specified config>) where <specified config> wasn't the 
default.

Fair enough, and in retrospect it's obvious. However, I fear it's going 
to be a pretty common error. It's also one that's not easy to catch - 
you can test a configuration, but you can't see what configuration 
generated a particular tsvector / tsquery (afaict).

I realise there was a lot of discussion during 8.3 devt about what was 
wanted from a default config and I'm guessing there's nothing that can 
be done for 8.3.x

Would there be any support for two changes in 8.4 though?

1. Tag tsvector/tsquery's with the (oid of) their configuration?
This could then generate a warning/error if you are running a tsquery 
against the wrong tsvector / combining two incompatible tsvectors etc.

2. Either warn or require CASCADE on changes to a 
configuration/dictionary that could impact existing indexes etc.
I've done it once myself where a stopword dictionary was changed from 
accept=true to accept=false. That change is OK (as long as you don't 
mind rogue tokens in your tsvectors) but others are probably not.

--   Richard Huxton  Archonet Ltd


Re: Full-text search default vs specified configuration

From
Tom Lane
Date:
Richard Huxton <dev@archonet.com> writes:
> Would there be any support for two changes in 8.4 though?

> 1. Tag tsvector/tsquery's with the (oid of) their configuration?

> 2. Either warn or require CASCADE on changes to a 
> configuration/dictionary that could impact existing indexes etc.

IIRC, the current behavior is intentional --- Oleg and Teodor argued
that tsvector values are relatively independent of small changes in
configuration and we should *not* force people to, say, reindex their
tables every time they add or subtract a stopword.  If we had some
measure of whether a TS configuration change was "critical" or not,
it might make sense to restrict critical changes; but I fear that
would be kind of hard to determine.
        regards, tom lane


Re: Full-text search default vs specified configuration

From
Richard Huxton
Date:
Tom Lane wrote:
> Richard Huxton <dev@archonet.com> writes:
>> Would there be any support for two changes in 8.4 though?
> 
>> 1. Tag tsvector/tsquery's with the (oid of) their configuration?
> 
>> 2. Either warn or require CASCADE on changes to a 
>> configuration/dictionary that could impact existing indexes etc.
> 
> IIRC, the current behavior is intentional --- Oleg and Teodor argued
> that tsvector values are relatively independent of small changes in
> configuration and we should *not* force people to, say, reindex their
> tables every time they add or subtract a stopword.  If we had some
> measure of whether a TS configuration change was "critical" or not,
> it might make sense to restrict critical changes; but I fear that
> would be kind of hard to determine.

Well, clearly in my example it didn't impact operation at all, but it's 
an accident waiting to happen (and more importantly, a hard one to track 
down). It's like running SQL-ASCII encoding, everything just ticks along 
only to cause problems a month later.

What about the warning: "This may affect existing indexes - please 
check". Would that cause anyone problems?

What worries me is that it might take 10 messages on general/sql list to 
figure out the problem. This was reported as "words with many hits 
causes problems".

Maybe it's just a matter of getting the message out: "always specify the 
config or never specify the config".

--   Richard Huxton  Archonet Ltd


Re: Full-text search default vs specified configuration

From
Oleg Bartunov
Date:
On Fri, 22 Feb 2008, Richard Huxton wrote:

> Tom Lane wrote:
>> Richard Huxton <dev@archonet.com> writes:
>>> Would there be any support for two changes in 8.4 though?
>> 
>>> 1. Tag tsvector/tsquery's with the (oid of) their configuration?
>> 
>>> 2. Either warn or require CASCADE on changes to a configuration/dictionary 
>>> that could impact existing indexes etc.
>> 
>> IIRC, the current behavior is intentional --- Oleg and Teodor argued
>> that tsvector values are relatively independent of small changes in
>> configuration and we should *not* force people to, say, reindex their
>> tables every time they add or subtract a stopword.  If we had some
>> measure of whether a TS configuration change was "critical" or not,
>> it might make sense to restrict critical changes; but I fear that
>> would be kind of hard to determine.
>
> Well, clearly in my example it didn't impact operation at all, but it's an 
> accident waiting to happen (and more importantly, a hard one to track down). 
> It's like running SQL-ASCII encoding, everything just ticks along only to 
> cause problems a month later.
>
> What about the warning: "This may affect existing indexes - please check". 
> Would that cause anyone problems?
>
> What worries me is that it might take 10 messages on general/sql list to 
> figure out the problem. This was reported as "words with many hits causes 
> problems".

He just didn't read documentation thoroughly.

>
> Maybe it's just a matter of getting the message out: "always specify the 
> config or never specify the config".

Probably, just stress this in documentation.

    Regards,        Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83