Thread: tsearch_core patch: permissions and security issues

tsearch_core patch: permissions and security issues

From
Tom Lane
Date:
I've been looking at the tsearch patch a bit, and I think there needs to
be more thought given to the permissions required to mess around with
tsearch configuration objects.

The TSParser objects reference functions declared to take and return
INTERNAL arguments.  This means that the underlying functions must be
coded in C and can only be installed by a superuser, which in turn means
that there is no scenario where it is really useful for a non-superuser
to execute CREATE PARSER.  What's more, allowing a non-superuser to do
it creates security holes: if you can find an unrelated function taking
the right number of INTERNAL arguments, you can install it as a TSParser
support function.  That trivially allows crashing the backend, and it
could allow worse security holes than that.

TSDictionary objects have exactly the same issues since they also depend
on functions with INTERNAL arguments.

At minimum this means that we should restrict CREATE/DROP/ALTER commands
for these objects to superusers.  (Which in turn means there's no point
in tracking an ownership column for them; every superuser is the same as
every other one, permissions-wise.)  I'm wondering though whether this
doesn't mean that we don't need manipulation commands for them at all.
Is it likely that people will be adding parser or dictionary support to
an installation on the fly?  Maybe we can just create 'em all at initdb
time and be done, similar to the way index access methods are treated.
This doesn't say that it's not possible to add more; you can add an
index access method on the fly too, if you want, by inserting stuff into
pg_am by hand.  I'm just wondering whether all that SQL-statement
support and pg_dump support for custom parsers and dictionaries is
really worth the code space and future maintenance effort it'll eat up.

You could remove the immediate source of this objection if you could
redesign the APIs for the underlying support functions to be more
type-safe.  I'm not sure how feasible or useful that would be though.
The bottom-line question here is whether developing a new parser or
dictionary implementation is really something that ordinary users might
do.  If not, then having all this SQL-level support for setting up
catalog entries seems like wasted effort.

TSConfiguration objects are a different story, since they have only
type-safe dependencies on parsers, locales, and dictionaries.  But they
still need some more thought about permissions, because AFAICS mucking
with a configuration can invalidate some other user's data.  Do we want
to allow runtime changes in a configuration that existing tsvector
columns already depend on?  How can we even recognize whether there is
stored data that will be affected by a configuration change?  (AFAICS
the patch doesn't put anything into the pg_depend machinery that could
deal with this.)  And who gets to decide which configuration is default,
anyway?

I'm also a bit disturbed that you've made searches for TSConfiguration
objects be search-path-sensitive.  That is likely to create problems
similar to those we've recently recognized for function lookup, eg,
an insertion into a full-text-indexed column gets treated differently
depending on the caller's search path.  It's particularly bad to have
the default object be search-path-dependent.  We learned the hard way
not to do that for default index operator classes; let's not make the
same mistake again for tsearch configurations.

Next, it took me a while to understand how Mapping objects fit into
the scheme at all, and now that (I think) I understand, I'm wondering
why treat them as an independent concept.  Seems like the mapping from
token types to dictionaries is really a property of a configuration,
and we ought to be handling it through options of CREATE/ALTER
CONFIGURATION commands, not as an apparently independent object type.
The way the patch is doing it feels like implementing CREATE ATTRIBUTE
as a separate command instead of having ALTER TABLE ADD COLUMN; it's
just weird, and it's not obvious that dropping a configuration should
make the associated mapping object go away.

Lastly, I'm unhappy that the patch still keeps a lot of configuration
information, such as stop word lists, in the filesystem rather than the
database.  It seems to me that the single easiest and most useful part
of a configuration to change is the stop word list; but this setup
guarantees that no one but a DBA can do that, and what's more that
pg_dump won't record your changes.  What's the point of having any
non-superuser configuration capability at all, if stop words aren't part
of what you can change?
        regards, tom lane


Re: tsearch_core patch: permissions and security issues

From
Bruce Momjian
Date:
You bring up a very good point.  There are fifteen new commands being
added for full text indexing:
alter-fulltext-config.sgml      alter-fulltext-owner.sgmlcreate-fulltext-dict.sgml
drop-fulltext-dict.sgmlalter-fulltext-dict.sgml       alter-fulltext-parser.sgmlcreate-fulltext-map.sgml
drop-fulltext-map.sgmlalter-fulltext-dictset.sgml    comment-fulltext.sgmlcreate-fulltext-parser.sgml
drop-fulltext-parser.sgmlalter-fulltext-map.sgml        create-fulltext-config.sgmldrop-fulltext-config.sgml
 

I think encoding is a good example to follow.  We allow users to create
new conversions (CREATE CONVERSION), but we don't allow them to create
new encodings --- those are hard-coded in the backend.  Which of the
following full text objects:
configdictmapdictsetparser

can we hard-code into the backend, and just update for every major
release like we do for encodings?

---------------------------------------------------------------------------

Tom Lane wrote:
> I've been looking at the tsearch patch a bit, and I think there needs to
> be more thought given to the permissions required to mess around with
> tsearch configuration objects.
> 
> The TSParser objects reference functions declared to take and return
> INTERNAL arguments.  This means that the underlying functions must be
> coded in C and can only be installed by a superuser, which in turn means
> that there is no scenario where it is really useful for a non-superuser
> to execute CREATE PARSER.  What's more, allowing a non-superuser to do
> it creates security holes: if you can find an unrelated function taking
> the right number of INTERNAL arguments, you can install it as a TSParser
> support function.  That trivially allows crashing the backend, and it
> could allow worse security holes than that.
> 
> TSDictionary objects have exactly the same issues since they also depend
> on functions with INTERNAL arguments.
> 
> At minimum this means that we should restrict CREATE/DROP/ALTER commands
> for these objects to superusers.  (Which in turn means there's no point
> in tracking an ownership column for them; every superuser is the same as
> every other one, permissions-wise.)  I'm wondering though whether this
> doesn't mean that we don't need manipulation commands for them at all.
> Is it likely that people will be adding parser or dictionary support to
> an installation on the fly?  Maybe we can just create 'em all at initdb
> time and be done, similar to the way index access methods are treated.
> This doesn't say that it's not possible to add more; you can add an
> index access method on the fly too, if you want, by inserting stuff into
> pg_am by hand.  I'm just wondering whether all that SQL-statement
> support and pg_dump support for custom parsers and dictionaries is
> really worth the code space and future maintenance effort it'll eat up.
> 
> You could remove the immediate source of this objection if you could
> redesign the APIs for the underlying support functions to be more
> type-safe.  I'm not sure how feasible or useful that would be though.
> The bottom-line question here is whether developing a new parser or
> dictionary implementation is really something that ordinary users might
> do.  If not, then having all this SQL-level support for setting up
> catalog entries seems like wasted effort.
> 
> TSConfiguration objects are a different story, since they have only
> type-safe dependencies on parsers, locales, and dictionaries.  But they
> still need some more thought about permissions, because AFAICS mucking
> with a configuration can invalidate some other user's data.  Do we want
> to allow runtime changes in a configuration that existing tsvector
> columns already depend on?  How can we even recognize whether there is
> stored data that will be affected by a configuration change?  (AFAICS
> the patch doesn't put anything into the pg_depend machinery that could
> deal with this.)  And who gets to decide which configuration is default,
> anyway?
> 
> I'm also a bit disturbed that you've made searches for TSConfiguration
> objects be search-path-sensitive.  That is likely to create problems
> similar to those we've recently recognized for function lookup, eg,
> an insertion into a full-text-indexed column gets treated differently
> depending on the caller's search path.  It's particularly bad to have
> the default object be search-path-dependent.  We learned the hard way
> not to do that for default index operator classes; let's not make the
> same mistake again for tsearch configurations.
> 
> Next, it took me a while to understand how Mapping objects fit into
> the scheme at all, and now that (I think) I understand, I'm wondering
> why treat them as an independent concept.  Seems like the mapping from
> token types to dictionaries is really a property of a configuration,
> and we ought to be handling it through options of CREATE/ALTER
> CONFIGURATION commands, not as an apparently independent object type.
> The way the patch is doing it feels like implementing CREATE ATTRIBUTE
> as a separate command instead of having ALTER TABLE ADD COLUMN; it's
> just weird, and it's not obvious that dropping a configuration should
> make the associated mapping object go away.
> 
> Lastly, I'm unhappy that the patch still keeps a lot of configuration
> information, such as stop word lists, in the filesystem rather than the
> database.  It seems to me that the single easiest and most useful part
> of a configuration to change is the stop word list; but this setup
> guarantees that no one but a DBA can do that, and what's more that
> pg_dump won't record your changes.  What's the point of having any
> non-superuser configuration capability at all, if stop words aren't part
> of what you can change?
> 
>             regards, tom lane
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster

--  Bruce Momjian  <bruce@momjian.us>          http://momjian.us EnterpriseDB
http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: tsearch_core patch: permissions and security issues

From
Gregory Stark
Date:
"Tom Lane" <tgl@sss.pgh.pa.us> writes:

> You could remove the immediate source of this objection if you could
> redesign the APIs for the underlying support functions to be more
> type-safe.  I'm not sure how feasible or useful that would be though.
> The bottom-line question here is whether developing a new parser or
> dictionary implementation is really something that ordinary users might
> do.  If not, then having all this SQL-level support for setting up
> catalog entries seems like wasted effort.

Well assuming we have any SQL-level support at all I think we should strive to
avoid these functions taking INTERNAL arguments.

I feel like having them in the GIST interface has been a major impediment to
more people defining GIST indexes for more datatypes. Because you need to
write C code dealing with internal data structures to handle page splits the
bar to implement GIST index operator classes is too high for most users. So
instead of a simple SQL command we end up with contrib modules implementing
each type of GIST index.

A while back I proposed that we implement the same page-split algorithm that
most (or all?) of those contrib modules copy-paste between them as a default
implementation. That would allow defining a GIST index in terms of a handful
of operators like "distance" which could be defined with a type-safe api. This
would be less flexible than the existing generic solution but it would allow
defining new GIST indexes without writing C code.

> But they still need some more thought about permissions, because AFAICS
> mucking with a configuration can invalidate some other user's data.

ouch. could mucking with a configuration create a corrupt index?

This sounds sort of analogous to the issues collation bring up.

> It seems to me that the single easiest and most useful part of a
> configuration to change is the stop word list; but this setup guarantees
> that no one but a DBA can do that, and what's more that pg_dump won't record
> your changes.

I would second that, in the past I was expected to provide an administrative
web interface to adjust the list of stop words.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com



Re: tsearch_core patch: permissions and security issues

From
Tom Lane
Date:
Gregory Stark <stark@enterprisedb.com> writes:
> Well assuming we have any SQL-level support at all I think we should
> strive to avoid these functions taking INTERNAL arguments.

I don't think I want to get into redesigning the patch at that level of
detail, at least not for 8.3.  It seems like something possibly worth
thinking about for 8.4 though.  The idea that we might want to change
the API for parser and dictionary support routines seems like another
good argument for not exposing user-level facilities for creating them
right now.

What I'm realizing as I look at it is that this is an enormous patch,
and it's not as close to being ready to apply as I had supposed.  If we
don't scale it back, then either it doesn't get into 8.3 or 8.3 gets
delayed a whole lot longer.  So we need to look at what we can trim or
postpone for a later release.

So all these factors seem to me to point in the same direction: at least
for the time being, we should treat TS parsers and dictionaries the way
we treat index access methods.  There'll be a catalog, which the
adventurous can insert new entries into, but no SQL-level support for
doing it, hence no pg_dump support.  And we reserve the right to whack
around the API for the functions referenced by the catalog entries.

That still leaves us with the question of SQL-level support for TS
configurations, which are built on top of parsers and dictionaries.
We definitely need some level of capability for that.  For the
permissions and dependencies issues, the minimalistic approach is to
say "only superusers can create or alter TS configurations, and if you
alter one it's your responsibility to fix up any dependent tsvector
columns or indexes."  We currently handle index operator classes that
way, so it's not completely ridiculous.  Sure it would be nice to do
better, but maybe that's a post-8.3 project.

That gets us down to just needing to worry about whether we like the
SQL representation of configurations.  Which is still a nontrivial
issue, but at least it seems manageable on a timescale that's
reasonable for 8.3.
        regards, tom lane


Re: tsearch_core patch: permissions and security issues

From
"Joshua D. Drake"
Date:
Tom Lane wrote:
> Gregory Stark <stark@enterprisedb.com> writes:
>> Well assuming we have any SQL-level support at all I think we should
>> strive to avoid these functions taking INTERNAL arguments.

> That gets us down to just needing to worry about whether we like the
> SQL representation of configurations.  Which is still a nontrivial
> issue, but at least it seems manageable on a timescale that's
> reasonable for 8.3.

O.k. I am not trying to throw any cold water on this, but with the 
limitations we are suggesting, does the patch gain us anything over just 
leaving tsearch in contrib?

Sincerely,

Joshua D. Drake


> 
>             regards, tom lane
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 7: You can help support the PostgreSQL project by donating at
> 
>                 http://www.postgresql.org/about/donate
> 


-- 
      === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive  PostgreSQL solutions since 1997             http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/



Re: tsearch_core patch: permissions and security issues

From
Tom Lane
Date:
"Joshua D. Drake" <jd@commandprompt.com> writes:
> O.k. I am not trying to throw any cold water on this, but with the 
> limitations we are suggesting, does the patch gain us anything over just 
> leaving tsearch in contrib?

Well, if you want to take a hard-nosed approach, no form of the patch
would gain us anything over leaving it in contrib, at least not from a
functionality standpoint.  The argument in favor has always been about
perception, really: if it's a "core" feature not an "add-on", then
people will take it more seriously.  And there's a rather weak
ease-of-use argument that you don't have to install a contrib module.
(The idea that it's targeted at people who can't or won't install a
contrib module is another reason why I think we can skip user-defined
parsers and dictionaries ...)
        regards, tom lane


Re: tsearch_core patch: permissions and security issues

From
Bruce Momjian
Date:
Joshua D. Drake wrote:
> Tom Lane wrote:
> > Gregory Stark <stark@enterprisedb.com> writes:
> >> Well assuming we have any SQL-level support at all I think we should
> >> strive to avoid these functions taking INTERNAL arguments.
> 
> > That gets us down to just needing to worry about whether we like the
> > SQL representation of configurations.  Which is still a nontrivial
> > issue, but at least it seems manageable on a timescale that's
> > reasonable for 8.3.
> 
> O.k. I am not trying to throw any cold water on this, but with the 
> limitations we are suggesting, does the patch gain us anything over just 
> leaving tsearch in contrib?

The idea is that common operations like searching and mapping dictionaries
will be easier to do, but the more complex stuff will require catalog
manipulations.

--  Bruce Momjian  <bruce@momjian.us>          http://momjian.us EnterpriseDB
http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: tsearch_core patch: permissions and security issues

From
"Joshua D. Drake"
Date:
Tom Lane wrote:
> "Joshua D. Drake" <jd@commandprompt.com> writes:
>> O.k. I am not trying to throw any cold water on this, but with the 
>> limitations we are suggesting, does the patch gain us anything over just 
>> leaving tsearch in contrib?
> 
> Well, if you want to take a hard-nosed approach, no form of the patch
> would gain us anything over leaving it in contrib, at least not from a
> functionality standpoint.  The argument in favor has always been about
> perception, really: if it's a "core" feature not an "add-on", then
> people will take it more seriously.  And there's a rather weak
> ease-of-use argument that you don't have to install a contrib module.
> (The idea that it's targeted at people who can't or won't install a
> contrib module is another reason why I think we can skip user-defined
> parsers and dictionaries ...)

Well my argument has always been the "core" feature argument. Perhaps I 
am missing some info here, but when I read what you wrote, I read that 
Tsearch will now be "harder" to work with. Not easier. :(

Removal of pg_dump support kind of hurts us, as we already have problems 
with pg_dump support and tsearch2. Adding work to have to re-assign 
permissions to vector columns because we make changes...

I would grant that having the SQL extensions would certainly be nice.

Anyway, I am not trying to stop the progress. I would like to see 
Tsearch2 in core but I also don't want to add complexity. You did say here:

And we reserve the right to whack around the API for the functions 
referenced by the catalog entries.

Which kind of gets us back to upgrade problems doesn't it?


Sincerely,

Joshua D. Drake


> 
>             regards, tom lane
> 


-- 
      === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive  PostgreSQL solutions since 1997             http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/



Re: tsearch_core patch: permissions and security issues

From
Tom Lane
Date:
"Joshua D. Drake" <jd@commandprompt.com> writes:
> Well my argument has always been the "core" feature argument. Perhaps I 
> am missing some info here, but when I read what you wrote, I read that 
> Tsearch will now be "harder" to work with. Not easier. :(

Then you misread it.  What I was proposing was essentially that there
won't be any need for pg_dump support because everything's built-in
(at least as far as parsers/dictionaries go).

As for the permissions issues, that's just formalizing something that's
true today with the contrib module: if you change a configuration, it's
*your* problem whether that invalidates any table entries, the system
won't take care of it for you.
        regards, tom lane


Re: tsearch_core patch: permissions and security issues

From
"Joshua D. Drake"
Date:
Tom Lane wrote:
> "Joshua D. Drake" <jd@commandprompt.com> writes:
>> Well my argument has always been the "core" feature argument. Perhaps I 
>> am missing some info here, but when I read what you wrote, I read that 
>> Tsearch will now be "harder" to work with. Not easier. :(
> 
> Then you misread it.  What I was proposing was essentially that there
> won't be any need for pg_dump support because everything's built-in
> (at least as far as parsers/dictionaries go).
> 
> As for the permissions issues, that's just formalizing something that's
> true today with the contrib module: if you change a configuration, it's
> *your* problem whether that invalidates any table entries, the system
> won't take care of it for you.

O.k. :)

Joshua D. Drake


> 
>             regards, tom lane
> 


-- 
      === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive  PostgreSQL solutions since 1997             http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/



Re: tsearch_core patch: permissions and security issues

From
Bruce Momjian
Date:
I an attempt to communicate what full text search does, and what
features we are thinking of adding/removing, I have put up the
introduction in HTML:
http://momjian.us/expire/fulltext/HTML/fulltext-intro.html

The links to the other sections don't work yet.

---------------------------------------------------------------------------

Tom Lane wrote:
> "Joshua D. Drake" <jd@commandprompt.com> writes:
> > Well my argument has always been the "core" feature argument. Perhaps I 
> > am missing some info here, but when I read what you wrote, I read that 
> > Tsearch will now be "harder" to work with. Not easier. :(
> 
> Then you misread it.  What I was proposing was essentially that there
> won't be any need for pg_dump support because everything's built-in
> (at least as far as parsers/dictionaries go).
> 
> As for the permissions issues, that's just formalizing something that's
> true today with the contrib module: if you change a configuration, it's
> *your* problem whether that invalidates any table entries, the system
> won't take care of it for you.
> 
>             regards, tom lane
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend

--  Bruce Momjian  <bruce@momjian.us>          http://momjian.us EnterpriseDB
http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: tsearch_core patch: permissions and security issues

From
Teodor Sigaev
Date:
>> Well assuming we have any SQL-level support at all I think we should
>> strive to avoid these functions taking INTERNAL arguments.
> That gets us down to just needing to worry about whether we like the
> SQL representation of configurations.  Which is still a nontrivial
> issue, but at least it seems manageable on a timescale that's
> reasonable for 8.3.

Possible solution is to split pg_ts_dict (I'll talk about dictionaries, but the 
same way is possible to parsers, but now it's looked as overdesign) to two table 
like pg_am and pg_opclass.
First table, pg_ts_dict_template (I don't know the exact name yet) which 
contains columns: oid, template_name, dict_init, dict_lexize and second:
pg_ts_dict with colimns: oid, template_oid, owner, schema, dict_initoption.

CREATE/ALTER/DROP DICTIONARY affects only second table and access to first one 
is only select/update/insert/delete similar to pg_am.

IMHO, this interface solves problems with security and dumping.

The reason to save SQLish interface to dictionaries is a simplicity of 
configuration. Snowball's stemmers are useful as is, but ispell dictionary 
requires some configuration action before using.

Next, INTERNAL arguments parser's and dictionary's APIs are used because if 
performance reason. During creation of tsvector from text, there are a lot of 
calls of parsers and dictionaries. And internal structures of they states may be 
rather complex and cannot be matched in any pgsql's type, even in flat memory 
structure.

> Next, it took me a while to understand how Mapping objects fit into> the scheme at all, and now that (I think) I
understand,I'm wondering> why treat them as an independent concept.
 
ALTER FULLTEXT CONFIGURATION cfgname ADD MAPPING FOR tokentypename[, ...] WITH 
dictname1[, ...];
ALTER FULLTEXT CONFIGURATION cfgname ALTER MAPPING FOR tokentypename[, ...] WITH 
dictname1[, ...];
ALTER FULLTEXT CONFIGURATION cfgname ALTER MAPPING [FOR tokentypename[, ...]] REPLACE olddictname TO newdictname;
ALTER FULLTEXT CONFIGURATION cfgname DROP MAPPING [IF EXISTS]  FOR tokentypename;
Is it looking reasonable?
> TSConfiguration objects are a different story, since they have only> type-safe dependencies on parsers, locales, and
dictionaries. But they> still need some more thought about permissions, because AFAICS mucking> with a configuration
caninvalidate some other user's data.Do we want> to allow runtime changes in a configuration that existing tsvector>
columnsalready depend on?  How can we even recognize whether there is> stored data that will be affected by a
configurationchange?  (AFAICS
 

Very complex task: experienced users could use several configuration 
simultaneously. For example: indexing use configuration which doesn't reject 
stop-words, but for default searching use configuration which rejects 
stop-words. BTW, the same effects may be produced by dictionary's change.

-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/
 


Re: tsearch_core patch: permissions and security issues

From
Teodor Sigaev
Date:
> can we hard-code into the backend, and just update for every major
> release like we do for encodings?

Sorry, no one of them :(. We know projects which introduce new parser, new 
dictionary. Config and map are changes very often.

-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/
 


Re: tsearch_core patch: permissions and security issues

From
Teodor Sigaev
Date:
>> But they still need some more thought about permissions, because AFAICS
>> mucking with a configuration can invalidate some other user's data.
> 
> ouch. could mucking with a configuration create a corrupt index?

Depending on what you mean 'corrupted'. It will not corrupted as non-readable 
or cause backend crash. But usage of such tsvector column could be limited - not 
all words will be searchable.

> This sounds sort of analogous to the issues collation bring up.

-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/
 


Re: tsearch_core patch: permissions and security issues

From
Tom Lane
Date:
Teodor Sigaev <teodor@sigaev.ru> writes:
> The reason to save SQLish interface to dictionaries is a simplicity of 
> configuration. Snowball's stemmers are useful as is, but ispell dictionary 
> requires some configuration action before using.

Yeah.  I had been wondering about moving the dict_initoption over to the
configuration entry --- is that sane at all?  It would mean that
dict_init functions would have to guard themselves against invalid
options, but they probably ought to do that anyway.  If we did that,
I think we could have a fixed set of dictionaries without too much
problem, and focus on just configurations as being user-alterable.

>>> Next, it took me a while to understand how Mapping objects fit into
>>> the scheme at all, and now that (I think) I understand, I'm wondering
>>> why treat them as an independent concept.

> ALTER FULLTEXT CONFIGURATION cfgname ADD MAPPING FOR tokentypename[, ...] WITH 
> dictname1[, ...];
> ALTER FULLTEXT CONFIGURATION cfgname ALTER MAPPING FOR tokentypename[, ...] WITH 
> dictname1[, ...];
> ALTER FULLTEXT CONFIGURATION cfgname ALTER MAPPING [FOR tokentypename[, ...]]
>   REPLACE olddictname TO newdictname;
> ALTER FULLTEXT CONFIGURATION cfgname DROP MAPPING [IF EXISTS]  FOR tokentypename;
> Is it looking reasonable?

Er ... what's the difference between the second and third forms?
        regards, tom lane


Re: tsearch_core patch: permissions and security issues

From
Gregory Stark
Date:
"Teodor Sigaev" <teodor@sigaev.ru> writes:

>>> But they still need some more thought about permissions, because AFAICS
>>> mucking with a configuration can invalidate some other user's data.
>>
>> ouch. could mucking with a configuration create a corrupt index?
>
> Depending on what you mean 'corrupted'. It will not corrupted as non-readable
> or cause backend crash. But usage of such tsvector column could be limited -
> not all words will be searchable.

Am I correct to think of this like changing collations leaving your btree
index "corrupt"? In that case it probably won't cause any backend crash either
but you will get incorrect results. For example, returning different results
depending on whether the index or a full table scan is used.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com



Re: tsearch_core patch: permissions and security issues

From
Teodor Sigaev
Date:

Tom Lane wrote:
> Teodor Sigaev <teodor@sigaev.ru> writes:
>> The reason to save SQLish interface to dictionaries is a simplicity of 
>> configuration. Snowball's stemmers are useful as is, but ispell dictionary 
>> requires some configuration action before using.
> 
> Yeah.  I had been wondering about moving the dict_initoption over to the
> configuration entry --- is that sane at all?  It would mean that
It should be. Instances of ispell (and synonym, thesaurus) dictionaries are 
different only in dict_initoption part, so it will be only one entry in 
pg_ts_dict_template and several ones in pg_ts_dict.

>> ALTER FULLTEXT CONFIGURATION cfgname ADD MAPPING FOR tokentypename[, ...] WITH 
>> dictname1[, ...];
>> ALTER FULLTEXT CONFIGURATION cfgname ALTER MAPPING FOR tokentypename[, ...] WITH 
>> dictname1[, ...];
sets dictionary's list for token's type(s)

>> ALTER FULLTEXT CONFIGURATION cfgname ALTER MAPPING [FOR tokentypename[, ...]]
>>   REPLACE olddictname TO newdictname;

Replace dictionary to another dictionary in dictionary's list for token's 
type(s). This command is very useful for tweaking configuration and for creating 
new configuration which differs from already existing one only by pair of 
dictionary.

>> ALTER FULLTEXT CONFIGURATION cfgname DROP MAPPING [IF EXISTS]  FOR tokentypename;
>> Is it looking reasonable?
> 
> Er ... what's the difference between the second and third forms?

That changes are doable for several days. I'd like to make changes together with  replacing of FULLTEXT keyword to TEXT
SEARCHas you suggested.
 

-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/
 


Re: tsearch_core patch: permissions and security issues

From
Oleg Bartunov
Date:
On Thu, 14 Jun 2007, Tom Lane wrote:

> Teodor Sigaev <teodor@sigaev.ru> writes:
>> The reason to save SQLish interface to dictionaries is a simplicity of
>> configuration. Snowball's stemmers are useful as is, but ispell dictionary
>> requires some configuration action before using.
>
> Yeah.  I had been wondering about moving the dict_initoption over to the
> configuration entry --- is that sane at all?  It would mean that
> dict_init functions would have to guard themselves against invalid
> options, but they probably ought to do that anyway.  If we did that,
> I think we could have a fixed set of dictionaries without too much
> problem, and focus on just configurations as being user-alterable.

currently, all dictionaries we provide are all template dictionaries,
so users could change only parameters.

But, there are reasons to allow users register new templates and in fact we 
know people/projects with application-dependent dictionaries. 
How they could dump/reload their dictionaries ?
    Regards,        Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


Re: tsearch_core patch: permissions and security issues

From
Teodor Sigaev
Date:
> But, there are reasons to allow users register new templates and in fact 
> we know people/projects with application-dependent dictionaries. How 
> they could dump/reload their dictionaries ?
The same way as pg_am does.

-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/
 


Re: tsearch_core patch: permissions and security issues

From
Oleg Bartunov
Date:
On Thu, 14 Jun 2007, Gregory Stark wrote:

> "Teodor Sigaev" <teodor@sigaev.ru> writes:
>
>>>> But they still need some more thought about permissions, because AFAICS
>>>> mucking with a configuration can invalidate some other user's data.
>>>
>>> ouch. could mucking with a configuration create a corrupt index?
>>
>> Depending on what you mean 'corrupted'. It will not corrupted as non-readable
>> or cause backend crash. But usage of such tsvector column could be limited -
>> not all words will be searchable.
>
> Am I correct to think of this like changing collations leaving your btree
> index "corrupt"? In that case it probably won't cause any backend crash either
> but you will get incorrect results. For example, returning different results
> depending on whether the index or a full table scan is used.

You're correct. But we can't defend users from all possible errors. 
Other side, that we need somehow to help user to identify what fts 
configuration was used to produce tsvector. For example, comment on
tsvector column would be useful, but we don't know how to do this
automatically.

    Regards,        Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


Re: tsearch_core patch: permissions and security issues

From
Teodor Sigaev
Date:
> Am I correct to think of this like changing collations leaving your btree
> index "corrupt"? In that case it probably won't cause any backend crash either
> but you will get incorrect results. For example, returning different results
> depending on whether the index or a full table scan is used.

Without exotic cases, maximum disaster may be that queries with some words will 
return more or less results than should be. Because of wrong stemming or wrong 
match of stop-word or wrong mapping.

By default, configuration is useful for most users and works for danish, dutch, 
finnish, french, german, hungarian, italian, norwegian, portuguese, spanish, 
swedish, russin and english languages.



-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/
 


Re: tsearch_core patch: permissions and security issues

From
Tom Lane
Date:
Oleg Bartunov <oleg@sai.msu.su> writes:
> You're correct. But we can't defend users from all possible errors. 
> Other side, that we need somehow to help user to identify what fts 
> configuration was used to produce tsvector. For example, comment on
> tsvector column would be useful, but we don't know how to do this
> automatically.

Yeah, I was wondering about that too.  The only way we could relax the
superuser, you-better-know-what-you're-doing restriction on changing
configurations would be if we had a way to identify which tsvector
columns needed to be updated.  Right now that's pretty hard to find out
because the references to configurations are buried in the bodies of
trigger functions.  That whole trigger-function business is not the
nicest part of tsearch2, either ... it'd be better if we could automate
tsvector maintenance more.

One thing I was thinking about is that rather than storing a physical
tsvector column, people might index a "virtual" column using functional
indexes:
create index ... (to_tsvector('english', big_text_col))

which could be queried
select ... where to_tsvector('english', big_text_col) @@ tsquery

Assuming that the index is lossy, the index condition would have to be
rechecked, so to_tsvector() would have to be recomputed, but only at the
rows identified as candidate matches by the index.  The I/O savings from
eliminating the heap's tsvector column might counterbalance the extra
CPU for recomputing tsvectors.  Or not, but in any case this is
attractive because it doesn't need any handmade maintenance support like
a trigger --- the regular index maintenance code does it all.

It strikes me that we could play the same kind of game we played to make
nextval() references to sequences be recognized as dependencies on
sequences.  Invent a "regconfig" OID type that's just like regclass
except it handles OIDs of ts_config entries instead of pg_class entries,
and make the first argument of to_tsvector be one of those:
create index ... (to_tsvector('english'::regconfig, big_text_col))

Now dependency.c can be taught to recognize the regconfig Const as
depending on the referenced ts_config entry, and voila we have a
pg_depend entry showing that the index depends on the configuration.
What we actually do about it is another question, but this at least
gets the knowledge into the system.

[ thinks some more... ]  If we revived the GENERATED AS patch,
you could imagine computing tsvector columns via "GENERATED AS
to_tsvector('english'::regconfig, big_text_col)" instead of a
trigger, and then again you've got the dependency exposed where
the system can see it.  I don't wanna try to do that for 8.3,
but it might be a good path to pursue in future, instead of assuming
that triggers will be the way forevermore.

Thoughts?
        regards, tom lane


Re: tsearch_core patch: permissions and security issues

From
Michael Paesold
Date:
Bruce Momjian wrote:
> I an attempt to communicate what full text search does, and what
> features we are thinking of adding/removing, I have put up the
> introduction in HTML:
> 
>     http://momjian.us/expire/fulltext/HTML/fulltext-intro.html
> 

Very good idea, Bruce!

After reading the discussion and the introduction, here is what I think 
tsearch in core should at least accomplish in 8.3. Please bear in mind, 
that (a) I am talking from a user perspective (there might be technical 
arguments against my thoughts) and (b) I have no hands-on experience 
with Tsearch2 yet, so more experienced users might have different needs.
 - Basic full text search usable for non-superusers - Out-of-the-box working configuration for as many languages as
reasonable(Teodor named quite a number of languages working as-is,   so this is really an improvement over contrib,
great!)- No foot-guns accessible to non-superuser - Agreement on function names, perhaps some should be changed. For
instanceto_tsquery() and plainto_tsquery() seem rather unintuitive   because they don't have a common prefix, and they
arenot consistent   about using underscores. Perhaps to_tsquery() and to_tsquery_plain()? - Future compatibility for
allfeatures available to non-superusers - Stop words in tables, not in external files. - At least for superusers, all
featuresavailable in contrib now,   should be available, too (don't know about pg_dump).
 

What I don't really like is the number of commands introduced without 
any strong reference to full text search. E.g. CREATE CONFIGURATION 
gives no hint at all that this is about full text search. IMHO there are 
more configurations than just full text ones. :-) So perhaps better 
spell this CREATE FULLTEXT CONFIGURATION etc.? (Think about tab 
completion in psql, for instance.)

I guess this is in line with what Tom said about mapping objects and 
CREATE ATTRIBUTE vs. CREATE/ALTER CONFIGURATION.
(http://archives.postgresql.org/pgsql-hackers/2007-06/msg00522.php)

After all, I would really welcome having full text search capabilities 
in core.

Best Regards
Michael Paesold



Re: tsearch_core patch: permissions and security issues

From
Tom Lane
Date:
Teodor Sigaev <teodor@sigaev.ru> writes:
> Tom Lane wrote:
>> Teodor Sigaev <teodor@sigaev.ru> writes:
>>> The reason to save SQLish interface to dictionaries is a simplicity of 
>>> configuration. Snowball's stemmers are useful as is, but ispell dictionary 
>>> requires some configuration action before using.
>> 
>> Yeah.  I had been wondering about moving the dict_initoption over to the
>> configuration entry --- is that sane at all?  It would mean that

> It should be. Instances of ispell (and synonym, thesaurus) dictionaries are 
> different only in dict_initoption part, so it will be only one entry in 
> pg_ts_dict_template and several ones in pg_ts_dict.

No, I was thinking of still having just one pg_ts_dict catalog (no template)
but removing its dictinit field.  Instead, the init strings would be
stored with configuration mapping entries.

This would mean having to remember to provide the right option along
with the dictionary name when doing ALTER CONFIGURATION ADD MAPPING.
Not sure if that would be harder or easier to use than what you're
thinking of.
        regards, tom lane


Re: tsearch_core patch: permissions and security issues

From
Tom Lane
Date:
Michael Paesold <mpaesold@gmx.at> writes:
> After reading the discussion and the introduction, here is what I think 
> tsearch in core should at least accomplish in 8.3.
> ...
>   - Stop words in tables, not in external files.

I realized that there's a pretty serious problem with doing that, which
is encoding.  We don't have any way to deal with preloaded catalog data
that exceeds 7-bit-ASCII, because when you do CREATE DATABASE ... ENCODING
it's going to be copied over exactly as-is.  And there's plenty of
not-ASCII stuff in the non-English stopword files.  This is something we
need to solve eventually, but I think it ties into the whole multiple
locale can-of-worms; there's no way we're getting it done for 8.3.

So I'm afraid we have to settle for stop words in external files for the
moment.  I do have two suggestions though:

* Let's have just one stopword file for each language, with the
convention that the file is stored in UTF8 no matter what language
you're talking about.  We can have the stopword reading code convert
to the database encoding on-the-fly when it reads the file.  Without
this there's just a whole bunch of foot-guns there.  We'd at least need
to have encoding verification checks when reading the files, which seems
hardly cheaper than just translating the data.

* Let's fix it so the reference to the stoplist in the user-visible
options is just a name, with no path or anything like that.  (Similar
to the handling of timezone_abbreviations.)  Then it will be feasible
to re-interpret the option as a reference to a named list in a catalog
someday, when we solve the encoding problem.  Right now the patch has
things like

+ DATA(insert OID = 5140 (  "ru_stem_koi8" PGNSP PGUID 5135 5137 "dicts_data/russian.stop.koi8"));

which is really binding the option pretty tightly to being a filename;
not to mention the large security risks involved in letting anyone but
a superuser have control of such an option.

> What I don't really like is the number of commands introduced without 
> any strong reference to full text search. E.g. CREATE CONFIGURATION 
> gives no hint at all that this is about full text search.

Yeah.  We had some off-list discussion about this and concluded that
TEXT SEARCH seemed to be the right phrase to use in the command names.
That hasn't gotten reflected into the patch yet.
        regards, tom lane


Re: tsearch_core patch: permissions and security issues

From
Gregory Stark
Date:
"Oleg Bartunov" <oleg@sai.msu.su> writes:

> On Thu, 14 Jun 2007, Gregory Stark wrote:
>
>> Am I correct to think of this like changing collations leaving your btree
>> index "corrupt"? In that case it probably won't cause any backend crash either
>> but you will get incorrect results. For example, returning different results
>> depending on whether the index or a full table scan is used.
>
> You're correct. But we can't defend users from all possible errors. 

Sure, but it seems like a the line, at least in existing cases, is that if you
fiddle with catalogs directly then you should know what consequences you need
to be careful of.

But when if you make changes through a supported, documented interface then
the system will protect you from breaking things. 

Hm, I went to construct an example and accidentally found a precedent for not
necessarily protecting users from themselves in every case:


postgres=# create table x (i integer);
CREATE TABLE
postgres=# create function f(integer) returns integer as 'select $1' immutable strict language sql;
CREATE FUNCTION
postgres=# select f(1);f 
---1
(1 row)

postgres=# create index xi on x (f(i));
CREATE INDEX
postgres=# insert into x values (1);
INSERT 0 1
postgres=# insert into x values (2);
INSERT 0 1
postgres=# create or replace function f(integer) returns integer as 'select -$1' immutable strict language sql;
CREATE FUNCTION


Uhm. Oops! And yes, the resulting index is, of course, corrupt:



postgres=# insert into x (select random() from generate_series(1,2000));
INSERT 0 2000
postgres=# select count(*) from x where f(i) = -1;count 
-------    0
(1 row)
postgres=# set enable_bitmapscan = off;
SET
postgres=# set enable_indexscan = off;
SET
postgres=#  select count(*) from x where f(i) = -1;count 
------- 1003
(1 row)

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com



Re: tsearch_core patch: permissions and security issues

From
Teodor Sigaev
Date:
>> It should be. Instances of ispell (and synonym, thesaurus) dictionaries are 
>> different only in dict_initoption part, so it will be only one entry in 
>> pg_ts_dict_template and several ones in pg_ts_dict.
> 
> No, I was thinking of still having just one pg_ts_dict catalog (no template)
> but removing its dictinit field.  Instead, the init strings would be
> stored with configuration mapping entries.
> 
> This would mean having to remember to provide the right option along
> with the dictionary name when doing ALTER CONFIGURATION ADD MAPPING.
> Not sure if that would be harder or easier to use than what you're
> thinking of.

Hmm. Dictionary may present in several lists of dictionaries in one 
configuration. Suppose, it isn't practical to store dictinitoption 
several times. In other hand, the same dictionary (template) with 
different init option may present on configuration too. Typical example 
is configuration for russian language:
lword, lpword tokens have dictionary's list {ispell_en, stem_en}
nlword, nlpword tokens have dictionary's list {ispell_ru, stem_ru}

stem_(ru|en) is a Snowball's stemmer, but ispell_(ru|en) is a ispell 
dictionary (template) with different dictinitoption. Next, 
configurations may share dictionaries.

And, init option may be rather big.



How does the tsearch configuration get selected?

From
Bruce Momjian
Date:
I am confused by the CREATE FULLTEXT CONFIGURATION command:
http://momjian.us/expire/fulltext/SGML/ref/create-fulltext-config.sgml

First, why are we specifying the server locale here since it never
changes:
  <varlistentry>   <term><literal>LOCALE</literal></term>   <listitem>    <para>   <replaceable
class="PARAMETER">localename</replaceable>   is the name of the locale. It should match server's locale
(<varname>lc_ctype</varname>)   to identify full-text configuration used by default.    </para>   </listitem>
</varlistentry>

Second, I can't figure out how to reference a non-default
configuration.  The description says:
  <varlistentry>   <term><LITERAL>AS DEFAULT</LITERAL></term>   <listitem>    <para>     Set <literal>default</literal>
flagfor the configuration, which     used to identify if this configuration is selectable on default     (see
<LITERAL>LOCALE</LITERAL>description above).     It is possible to have <emphasis>maximum one</emphasis> configuration
  with the same locale and in the same schema with this flag enabled.    </para>   </listitem>  </varlistentry>
 

The documentation says that the first fulltext configuration found in
the search patch is the one used, so how does a secondary configuration
in the same schema actually get accessed by @@ or ::tsquery?  Do you
have to use to_tsquery() with the optional configuration name?

Is this really the direction we want to go, having a default that gets
picked up from the search_path, perhaps based on some encoding/locale
match I can't figure out, or do we want to require the configuration to
be specified always, and if we do that, how do we handle the @@
operator?

I am thinking we should just have use the first fulltext configuration
from the first schema in the search path and eliminate naming the
configurations (same as schema name?).  Allowing configuration names to
be specified only sometimes is confusing.  Or we can use a GUC to name
the configuration we want to use specifically, rather than have a
read-only tsearch_conf_name (is it read-only?) that is controlled by the
search_path.

And why are we talking locale here instead of encoding?  And since we
only have one encoding per database, how can there be more than one?  Is
this _client_ encoding?

FYI, while the configuration selection needs work, the rest of the areas
seem logical.

--  Bruce Momjian  <bruce@momjian.us>          http://momjian.us EnterpriseDB
http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: How does the tsearch configuration get selected?

From
Tom Lane
Date:
Bruce Momjian <bruce@momjian.us> writes:
> First, why are we specifying the server locale here since it never
> changes:

It's poorly described.  What it should really say is the language
that the text-to-be-searched is in.  We can actually support multiple
languages here today, the restriction being that there have to be
stemmer instances for the languages with the database encoding you're
using.  With UTF8 encoding this isn't much of a restriction.  We do need
to put code into the dictionary stuff to enforce that you can't use a
stemmer when the database encoding isn't compatible with it.

I would prefer that we not drive any of this stuff off the server's
LC_xxx settings, since as you say that restricts things to just one
locale.

> Second, I can't figure out how to reference a non-default
> configuration.

See the multi-argument versions of to_tsvector etc.

I do see a problem with having to_tsvector(config, text) plus
to_tsvector(text) where the latter implicitly references a config
selected by a GUC variable: how can you tell whether a query using the
latter matches a particular index using the former?  There isn't
anything in the current planner mechanisms that would make that work.
        regards, tom lane


Re: How does the tsearch configuration get selected?

From
Oleg Bartunov
Date:
On Thu, 14 Jun 2007, Tom Lane wrote:

> Bruce Momjian <bruce@momjian.us> writes:
>> First, why are we specifying the server locale here since it never
>> changes:

server's locale is used just for one purpose - to select what text search 
configuration to use by default. Any text search functions can accept
text search configuration as an optional parameter.

>
> It's poorly described.  What it should really say is the language
> that the text-to-be-searched is in.  We can actually support multiple
> languages here today, the restriction being that there have to be
> stemmer instances for the languages with the database encoding you're
> using.  With UTF8 encoding this isn't much of a restriction.  We do need
> to put code into the dictionary stuff to enforce that you can't use a
> stemmer when the database encoding isn't compatible with it.
>
> I would prefer that we not drive any of this stuff off the server's
> LC_xxx settings, since as you say that restricts things to just one
> locale.

something like 
CREATE TEXT SEARCH DICTIONARY dictname [LOCALE=ru_RU.UTF-8]
and raise warning/error if database encoding doesn't match dictionary 
encoding if specified (not all dictionaries depend on encoding, so it
should be an optional parameter).

>
>> Second, I can't figure out how to reference a non-default
>> configuration.
>
> See the multi-argument versions of to_tsvector etc.
>
> I do see a problem with having to_tsvector(config, text) plus
> to_tsvector(text) where the latter implicitly references a config
> selected by a GUC variable: how can you tell whether a query using the
> latter matches a particular index using the former?  There isn't
> anything in the current planner mechanisms that would make that work.

Probably, having default text search configuration is not a good idea
and we could just require it as a mandatory parameter, which could
eliminate many confusion with selecting text search configuration.

    Regards,        Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


Re: tsearch_core patch: permissions and security issues

From
Oleg Bartunov
Date:
On Thu, 14 Jun 2007, Tom Lane wrote:

> Oleg Bartunov <oleg@sai.msu.su> writes:
>> You're correct. But we can't defend users from all possible errors.
>> Other side, that we need somehow to help user to identify what fts
>> configuration was used to produce tsvector. For example, comment on
>> tsvector column would be useful, but we don't know how to do this
>> automatically.
>
> Yeah, I was wondering about that too.  The only way we could relax the
> superuser, you-better-know-what-you're-doing restriction on changing
> configurations would be if we had a way to identify which tsvector
> columns needed to be updated.  Right now that's pretty hard to find out
> because the references to configurations are buried in the bodies of
> trigger functions.  That whole trigger-function business is not the
> nicest part of tsearch2, either ... it'd be better if we could automate
> tsvector maintenance more.

yes, trigger function is a complex stuff, our tsearch() trigger is an 
example of automated stuff. It could be written very easy on plpgsql,
for example.

=# create function my_update() returns trigger as 
$$
BEGIN   NEW.fts=   setweight( to_tsvector('english',NEW.t1),'A') || ' ' ||   setweight(
to_tsvector('english',NEW.t2),'B');RETURN NEW;
 
END;
$$ 
language plpgsql;

>
> One thing I was thinking about is that rather than storing a physical
> tsvector column, people might index a "virtual" column using functional
> indexes:
>
>     create index ... (to_tsvector('english', big_text_col))
>
> which could be queried
>
>     select ... where to_tsvector('english', big_text_col) @@ tsquery


this is already possible for gin index

create index gin_text_idx on test using gin (
( coalesce(to_tsvector(title),'') || coalesce(to_tsvector(body),'') )
);


apod=# select title from test where 
(coalesce(to_tsvector(title),'') || coalesce(to_tsvector(body),'') ) @@ 
to_tsquery('supernovae') order by sdate desc limit 10;


>
> Assuming that the index is lossy, the index condition would have to be
> rechecked, so to_tsvector() would have to be recomputed, but only at the
> rows identified as candidate matches by the index.  The I/O savings from
> eliminating the heap's tsvector column might counterbalance the extra
> CPU for recomputing tsvectors.  Or not, but in any case this is
> attractive because it doesn't need any handmade maintenance support like
> a trigger --- the regular index maintenance code does it all.

I'm afraid it wouldn't work for all cases. We already have headline() function 
which had to reparse document to produce text snippet and it's very slow
and eats most select time. 
ALso, trigger stuff is a normal machinery for databases.

>
> It strikes me that we could play the same kind of game we played to make
> nextval() references to sequences be recognized as dependencies on
> sequences.  Invent a "regconfig" OID type that's just like regclass
> except it handles OIDs of ts_config entries instead of pg_class entries,
> and make the first argument of to_tsvector be one of those:
>
>     create index ... (to_tsvector('english'::regconfig, big_text_col))
>
> Now dependency.c can be taught to recognize the regconfig Const as
> depending on the referenced ts_config entry, and voila we have a
> pg_depend entry showing that the index depends on the configuration.
> What we actually do about it is another question, but this at least
> gets the knowledge into the system.
>

interesting. And \di could display all configuration stuff for text search
indexes ?


> [ thinks some more... ]  If we revived the GENERATED AS patch,
> you could imagine computing tsvector columns via "GENERATED AS
> to_tsvector('english'::regconfig, big_text_col)" instead of a
> trigger, and then again you've got the dependency exposed where
> the system can see it.  I don't wanna try to do that for 8.3,
> but it might be a good path to pursue in future, instead of assuming
> that triggers will be the way forevermore.
>
> Thoughts?

No way with standard. GENERATED AS says that "all columns references in an 
expression associated with a generated column must be to columns of the base 
table containing that generated column."

tsvector could be result of rather complex select involving several tables.


    Regards,        Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


Re: tsearch_core patch: permissions and security issues

From
"Simon Riggs"
Date:
On Wed, 2007-06-13 at 18:06 -0400, Bruce Momjian wrote:
> You bring up a very good point.  There are fifteen new commands being
> added for full text indexing:
> 
>     alter-fulltext-config.sgml      alter-fulltext-owner.sgml
>     create-fulltext-dict.sgml       drop-fulltext-dict.sgml
>     alter-fulltext-dict.sgml        alter-fulltext-parser.sgml
>     create-fulltext-map.sgml        drop-fulltext-map.sgml
>     alter-fulltext-dictset.sgml     comment-fulltext.sgml
>     create-fulltext-parser.sgml     drop-fulltext-parser.sgml
>     alter-fulltext-map.sgml         create-fulltext-config.sgml
>     drop-fulltext-config.sgml

Although I'm happy to see tsearch finally hit the big time, I'm a bit
disappointed to see so many new datatype-specific SQL commands created.
That sets a bad precedent for other datatype authors, as well as all
those people that want to invent new things like SKYLINE etc..

Whatever the reasons for the new commands, those reasons must also be
potentially shared by authors of other datatypes too. Is there a way to
genericise these commands so that we can offer those same benefits to
all datatypes, rather than setting a double standard? Or do we think
that when a Geodetic datatype tries to come into core we would have
commands like CREATE COORDINATE TRANSFORM?

Can we consider CREATE TYPE CONFIGURATION with subsets such as...
CREATE TYPE CONFIGURATION name USING FULLTEXT (map)
CREATE TYPE CONFIGURATION name USING FULLTEXT (dictionary)
CREATE TYPE CONFIGURATION name USING FULLTEXT (parser)

Your choice of syntax may vary, but my point is about creating a
mechanism by which any datatype author can reference complex
configuration details. We managed to do this for INDEXes and OPERATORs,
so it seems a shame to go for the full 15 new commands when we could do
the same thing with much fewer commands and ones that could then be
utilised by others, e.g. PostGIS.

Last minute change true, but only parser+docs changes suggested. I want
the full tsearch2 functionality as much as anyone.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com




Re: How does the tsearch configuration get selected?

From
Teodor Sigaev
Date:
> Probably, having default text search configuration is not a good idea
> and we could just require it as a mandatory parameter, which could
> eliminate many confusion with selecting text search configuration.
Ugh. Having default configuration (by locale or by postgresql.conf or some other 
way) simplifies life a lot in most cases.

-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/
 


Re: How does the tsearch configuration get selected?

From
Bruce Momjian
Date:
Tom Lane wrote:
> Bruce Momjian <bruce@momjian.us> writes:
> > First, why are we specifying the server locale here since it never
> > changes:
> 
> It's poorly described.  What it should really say is the language
> that the text-to-be-searched is in.  We can actually support multiple
> languages here today, the restriction being that there have to be
> stemmer instances for the languages with the database encoding you're
> using.  With UTF8 encoding this isn't much of a restriction.  We do need
> to put code into the dictionary stuff to enforce that you can't use a
> stemmer when the database encoding isn't compatible with it.
> 
> I would prefer that we not drive any of this stuff off the server's
> LC_xxx settings, since as you say that restricts things to just one
> locale.

The idea they had was to set the _default_ full text configuration to
match the locale, e.g.UTF8.en_US.  This works well for cases where we
ship a number of pre-installed full text configurations in pg_catalog.
But of course you can support multiple languages with that
encoding/locale, so you have to have the ability to do other languages,
but not necessarily by default.

> > Second, I can't figure out how to reference a non-default
> > configuration.
> 
> See the multi-argument versions of to_tsvector etc.
> 
> I do see a problem with having to_tsvector(config, text) plus
> to_tsvector(text) where the latter implicitly references a config
> selected by a GUC variable: how can you tell whether a query using the
> latter matches a particular index using the former?  There isn't
> anything in the current planner mechanisms that would make that work.

Well, now that I have gotten feedback, we have a few options:

1)  Require the configuration to be always specified.  The problem with
this is that casting (::tsquery) and operators (@@) have no way to
specify a configuration.

2)  Use a GUC that you can set for the configuration, and perhaps
default it if possible to match the locale.  Is the default affected by
search_path (ouch)?

How do we make sure that any index that is accessed is using the same
configuration that is being used by the query, e.g. ::tsquery?  Do we
have to store the configuration name in the index and somehow throw an
error if it doesn't match?  What about changes to the configuration
after the index has been created, e.g. new stop words or dictionaries?

The two big open issues are whether we allow a default configuration,
and whether we require the configuration name to be always specified.

My guess right now is that we use a GUC that will default if a
pg_catalog configuration name matches the lc_ctype locale name, and we
have to throw an error if an accessed index creation GUC doesn't match
the current GUC.

So we create a pg_catalog full text configuration named UTF8.en-US, and
some others like ru_RU.UTF-8.

--  Bruce Momjian  <bruce@momjian.us>          http://momjian.us EnterpriseDB
http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: How does the tsearch configuration get selected?

From
Bruce Momjian
Date:
Bruce Momjian wrote:
> My guess right now is that we use a GUC that will default if a
> pg_catalog configuration name matches the lc_ctype locale name, and we
> have to throw an error if an accessed index creation GUC doesn't match
> the current GUC.
> 
> So we create a pg_catalog full text configuration named UTF8.en-US, and
> some others like ru_RU.UTF-8.

Do locale names vary across operating systems?  If so, we might as well
skip trying to find a default.

--  Bruce Momjian  <bruce@momjian.us>          http://momjian.us EnterpriseDB
http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: tsearch_core patch: permissions and security issues

From
Tom Lane
Date:
"Simon Riggs" <simon@2ndquadrant.com> writes:
> Although I'm happy to see tsearch finally hit the big time, I'm a bit
> disappointed to see so many new datatype-specific SQL commands created.

Per subsequent discussion we are down to just one new set of commands,
CREATE/ALTER/DROP TEXT SEARCH CONFIGURATION, so it's not as big a
footprint as it was to start with.

> Can we consider CREATE TYPE CONFIGURATION with subsets such as...
> CREATE TYPE CONFIGURATION name USING FULLTEXT (map)
> CREATE TYPE CONFIGURATION name USING FULLTEXT (dictionary)
> CREATE TYPE CONFIGURATION name USING FULLTEXT (parser)

This seems entirely cosmetic, unless you have some proposal for allowing
a uniform underlying implementation not only syntax.  In the absence of
some concrete cases to consider, I don't see how we could imagine that
we know how to implement a useful generic approach.

I have been thinking that it would be smart to try to use the generic
"definition list" syntax, like CREATE OPERATOR and CREATE AGGREGATE.
But the motivation for that is just to avoid defining more keywords
(which has an overall impact on parser size and performance).  It's
not really going to do anything for us in terms of having an
implementation that can be shared with anything else.
        regards, tom lane


Re: How does the tsearch configuration get selected?

From
Tom Lane
Date:
Bruce Momjian <bruce@momjian.us> writes:
> Do locale names vary across operating systems?

Yes, which is the fatal flaw in the whole thing.  The ru_RU part is
reasonably well standardized, but the encoding part is not.  Considering
that encoding is exactly the part of it we don't care about for this
purpose (because we should look to the database encoding instead),
I think it's just going to make life harder not easier to model search
language names on locales.

I'd suggest allowing either full names ("swedish") or the standard
two-letter abbreviations ("sv").  But let's stay away from locale names.
        regards, tom lane


Re: How does the tsearch configuration get selected?

From
Teodor Sigaev
Date:
> 1)  Require the configuration to be always specified.  The problem with
> this is that casting (::tsquery) and operators (@@) have no way to
> specify a configuration.
it's not comfortable for most often cases

> 
> 2)  Use a GUC that you can set for the configuration, and perhaps
> default it if possible to match the locale.  Is the default affected by
> search_path (ouch)?
Right now it works so

> 
> How do we make sure that any index that is accessed is using the same
> configuration that is being used by the query, e.g. ::tsquery?  Do we
> have to store the configuration name in the index and somehow throw an
> error if it doesn't match?  What about changes to the configuration
> after the index has been created, e.g. new stop words or dictionaries?
That's possible intentional case, so we should not throw ERROR!


> 
> The two big open issues are whether we allow a default configuration,
> and whether we require the configuration name to be always specified.
> 
> My guess right now is that we use a GUC that will default if a
> pg_catalog configuration name matches the lc_ctype locale name, and we
> have to throw an error if an accessed index creation GUC doesn't match
> the current GUC.

Where will index store index creation GUC?
> 
> So we create a pg_catalog full text configuration named UTF8.en-US, and
> some others like ru_RU.UTF-8.
> 

-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/
 


Re: How does the tsearch configuration get selected?

From
Teodor Sigaev
Date:
> I'd suggest allowing either full names ("swedish") or the standard
> two-letter abbreviations ("sv").  But let's stay away from locale names.
We can use database's encoding name (the same names used in initdb -E)


-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/
 


Re: How does the tsearch configuration get selected?

From
Tom Lane
Date:
Teodor Sigaev <teodor@sigaev.ru> writes:
>> My guess right now is that we use a GUC that will default if a
>> pg_catalog configuration name matches the lc_ctype locale name, and we
>> have to throw an error if an accessed index creation GUC doesn't match
>> the current GUC.

> Where will index store index creation GUC?

It's not really the index's problem; IIUC the behavior of the gist and
gin index opclasses is not locale-specific.  It's the to_tsvector calls
that built the tsvector heap column that have a locale specified or
implicit.  We need some way of annotating the heap column about this.

In the case of a functional index you can expose the locale:
create index ... (to_tsvector('english'::regconfig, mytextcol))

but there's still the problem that the planner cannot match that to
a query specified as just WHERE to_tsvector(mytextcol) @@ query.
        regards, tom lane


Re: How does the tsearch configuration get selected?

From
Tom Lane
Date:
Teodor Sigaev <teodor@sigaev.ru> writes:
>> I'd suggest allowing either full names ("swedish") or the standard
>> two-letter abbreviations ("sv").  But let's stay away from locale names.

> We can use database's encoding name (the same names used in initdb -E)

AFAICS the encoding name shouldn't be anywhere near this.

The only reason the TS stuff needs an encoding spec is to figure out how
to read an external stop word file.  I think my suggestion upthread is a
lot better: have just one stop word file per language, store them all in
UTF8, and convert to database encoding when loading them.  The database
encoding is implicit and doesn't need to be mentioned anywhere in the TS
configuration.
        regards, tom lane


Re: How does the tsearch configuration get selected?

From
Gregory Stark
Date:
"Tom Lane" <tgl@sss.pgh.pa.us> writes:

> It's not really the index's problem; IIUC the behavior of the gist and
> gin index opclasses is not locale-specific.  It's the to_tsvector calls
> that built the tsvector heap column that have a locale specified or
> implicit.  We need some way of annotating the heap column about this.
>
> In the case of a functional index you can expose the locale:
>
>     create index ... (to_tsvector('english'::regconfig, mytextcol))

Maybe there should be a different type for each locale.

I'm not exactly following this thread so I'm not entirely sure whether that
would actually fit well but it's just a thought I had.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com



Re: How does the tsearch configuration get selected?

From
Tom Lane
Date:
Gregory Stark <stark@enterprisedb.com> writes:
> "Tom Lane" <tgl@sss.pgh.pa.us> writes:
>> It's not really the index's problem; IIUC the behavior of the gist and
>> gin index opclasses is not locale-specific.  It's the to_tsvector calls
>> that built the tsvector heap column that have a locale specified or
>> implicit.  We need some way of annotating the heap column about this.

> Maybe there should be a different type for each locale.

I had been idly wondering if we could do anything with using tsvector's
typmod for the purpose ...
        regards, tom lane


Re: tsearch_core patch: permissions and security issues

From
Robert Treat
Date:
On Friday 15 June 2007 00:46, Oleg Bartunov wrote:
> On Thu, 14 Jun 2007, Tom Lane wrote:
> > [ thinks some more... ]  If we revived the GENERATED AS patch,
> > you could imagine computing tsvector columns via "GENERATED AS
> > to_tsvector('english'::regconfig, big_text_col)" instead of a
> > trigger, and then again you've got the dependency exposed where
> > the system can see it.  I don't wanna try to do that for 8.3,
> > but it might be a good path to pursue in future, instead of assuming
> > that triggers will be the way forevermore.
> >
> > Thoughts?
>
> No way with standard. GENERATED AS says that "all columns references in an
> expression associated with a generated column must be to columns of the
> base table containing that generated column."
>
> tsvector could be result of rather complex select involving several tables.
>

Is there some reason for this restriction in the standard?  I might be in 
favor of "extending" the standard to allow this case if not.  

-- 
Robert Treat
Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL


Re: tsearch_core patch: permissions and security issues

From
Robert Treat
Date:
On Thursday 14 June 2007 15:10, Teodor Sigaev wrote:
> That changes are doable for several days. I'd like to make changes together
> with replacing of FULLTEXT keyword to TEXT SEARCH as you suggested.

AIUI the discussion on this change took place off list?  Can we get a preview 
of what the commands will look like and maybe a summary of the discussion?  It 
may sound a bit pedantic, but the concept/wording of "full text searching" is 
pretty well understood by the database community, so switching to just TEXT 
SEARCH sounds like we're adding ambiguity for reasons that escape me....   

-- 
Robert Treat
Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL


Re: How does the tsearch configuration get selected?

From
Teodor Sigaev
Date:
> The only reason the TS stuff needs an encoding spec is to figure out how
> to read an external stop word file.  I think my suggestion upthread is a
> lot better: have just one stop word file per language, store them all in
> UTF8, and convert to database encoding when loading them.  The database

Hmm. You mean to use language name in configuration, use current encoding to
define which dictionary should be used (stemmers for the same language are 
different for different encoding) and recode dictionaries file from UTF8 to 
current locale. Did I understand you right?

That's possible to do. But it's incompatible changes and cause some difficulties 
for DBA. If server locale is ISO (or KOI8 or any other) and file is in UTF8 then 
text editor/tools might be confused.


-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/
 


Re: How does the tsearch configuration get selected?

From
Teodor Sigaev
Date:
> It's not really the index's problem; IIUC the behavior of the gist and
> gin index opclasses is not locale-specific.  

Right

> It's the to_tsvector calls
> that built the tsvector heap column that have a locale specified or
> implicit.  We need some way of annotating the heap column about this.
It seems too restrictive to advanced users.

> 
> In the case of a functional index you can expose the locale:
> 
>     create index ... (to_tsvector('english'::regconfig, mytextcol))
> 
> but there's still the problem that the planner cannot match that to
> a query specified as just WHERE to_tsvector(mytextcol) @@ query.


-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/
 


Re: How does the tsearch configuration get selected?

From
Tom Lane
Date:
Teodor Sigaev <teodor@sigaev.ru> writes:
>> It's the to_tsvector calls
>> that built the tsvector heap column that have a locale specified or
>> implicit.  We need some way of annotating the heap column about this.

> It seems too restrictive to advanced users.

Hm, are you trying to say that it's sane to have different tsvectors in
a column computed under different language settings?  Maybe we're all
overthinking the problem.  If the tsvector representation is presumed
language-independent then I could see this being a workable approach.
        regards, tom lane


Re: How does the tsearch configuration get selected?

From
Tom Lane
Date:
Teodor Sigaev <teodor@sigaev.ru> writes:
> Hmm. You mean to use language name in configuration, use current encoding to
> define which dictionary should be used (stemmers for the same language are 
> different for different encoding) and recode dictionaries file from UTF8 to 
> current locale. Did I understand you right?

Right.

> That's possible to do. But it's incompatible changes and cause some
> difficulties for DBA. If server locale is ISO (or KOI8 or any other)
> and file is in UTF8 then text editor/tools might be confused.

Well, I'm not as worried about that as I am about the database being
confused ;-).  We need some way to deal with stopword files that are in
a different encoding than the database encoding, and this has to be
proof against accidental or malicious mistakes by the non-superuser
users who are going to be able to specify which stopword file to use.
So I don't want the specification that goes into the CREATE DICTIONARY
command to involve an encoding.

One possibility is that the user-visible specification is just a name
(eg, "english"), but the actual filename out on the filesystem is,
say, name.encoding.stop (eg, "english.utf8.stop") where we use PG's
names for the encodings.  We could just fail if there's not a file
matching the database encoding, or we could try that and then try
utf8, or some other rule.  In any case I'd want it to verify and
convert encoding as necessary while reading.
        regards, tom lane


Re: tsearch_core patch: permissions and security issues

From
"Simon Riggs"
Date:
On Fri, 2007-06-15 at 10:36 -0400, Tom Lane wrote:
> "Simon Riggs" <simon@2ndquadrant.com> writes:
> > Although I'm happy to see tsearch finally hit the big time, I'm a bit
> > disappointed to see so many new datatype-specific SQL commands created.
> 
> Per subsequent discussion we are down to just one new set of commands,
> CREATE/ALTER/DROP TEXT SEARCH CONFIGURATION, so it's not as big a
> footprint as it was to start with.

Thats a lot better, thanks. I'm sure that will work better in PgAdmin
and many other places too.

> I have been thinking that it would be smart to try to use the generic
> "definition list" syntax, like CREATE OPERATOR and CREATE AGGREGATE.
> But the motivation for that is just to avoid defining more keywords
> (which has an overall impact on parser size and performance).  It's
> not really going to do anything for us in terms of having an
> implementation that can be shared with anything else.

It's OK; ALTER RFID TAG NOMENCLATURE has a nice ring to it. :-)

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com




Re: How does the tsearch configuration get selected?

From
Teodor Sigaev
Date:
> Hm, are you trying to say that it's sane to have different tsvectors in
> a column computed under different language settings?  Maybe we're all

Yes, I think so.

That might have sense for close languages. Norwegian languages has two dialects 
and one of them has advanced rules for compound words, russian and ukranian has 
similar rules etc. Operation @@ is language (and encoding) independent, it use 
just strcmp call.

Most often usecase for mixing configuration is somewhere described by me in 
thread using two different configuration for indexing (tsvector creation) and 
search (tsquery creation). BTW, thesaurus dictionary could be used for similar 
reasons in search only configuration.

OpenFTS doesn't use tsearch2 configuration at all, it has such infrastructure 
itself - so, tsvector shouldn't have any information about configuration.

Most often change of configuration is a adding new stop words, which doesn't 
affect correctness of search. Removing stop words cause impossibility to find 
already indexed documents with query contains only removed stop-words.


> overthinking the problem.  If the tsvector representation is presumed
> language-independent then I could see this being a workable approach.

Actually, we should allow to only 'compatible' changes of configuration but it 
very hard (or even impossible) to formulate rules about that. Any dictionary has  its specific dictinitoption changes
tobecome incompatible with itself, the 
 
same is to compatibility between two dictionaries, list of dictionaries.

In practice, we didn't see any disasters after changes in configuration - until 
reindexing search becomes less punctual.


-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/
 


Re: How does the tsearch configuration get selected?

From
Teodor Sigaev
Date:
> One possibility is that the user-visible specification is just a name
> (eg, "english"), but the actual filename out on the filesystem is,
> say, name.encoding.stop (eg, "english.utf8.stop") where we use PG's
> names for the encodings.  We could just fail if there's not a file
> matching the database encoding, or we could try that and then try
> utf8, or some other rule.  In any case I'd want it to verify and
> convert encoding as necessary while reading.

I have no strong objection for UTF8-encoded files (stop words or ispell or 
synonym or thesaurus). Just recode it after reading.

But configuration for different languages might be differ, for example russian 
(and any cyrillic-based) configuration is differ from west-european 
configuration based on different character sets. So, we should have non-obvious 
rules for stemmers to define which exact stemmer and stop-file should be used.
For russian language with utf8 encoding it should use for lword english stemmer, 
but for italian language - italian stemmer. Any ASCII chars can't present in 
russian word, but might italian word can contains only ASCII.



-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/
 


Re: How does the tsearch configuration get selected?

From
Tom Lane
Date:
Teodor Sigaev <teodor@sigaev.ru> writes:
> But configuration for different languages might be differ, for example
> russian (and any cyrillic-based) configuration is differ from
> west-european configuration based on different character sets.

Sure.  I'm just assuming that the set of stopwords doesn't need to vary
depending on the encoding you're using for a language --- that is, if
you're willing to convert the encoding then the same stopword list file
should serve for all encodings of a given language.  Do you think this
might be wrong?
        regards, tom lane


Re: How does the tsearch configuration get selected?

From
Teodor Sigaev
Date:

> Sure.  I'm just assuming that the set of stopwords doesn't need to vary
> depending on the encoding you're using for a language --- that is, if
> you're willing to convert the encoding then the same stopword list file
> should serve for all encodings of a given language.  Do you think this
> might be wrong?
No. I believe that pgsql doesn't support encoding that can not be recoded from 
UTF8, at least for non-hieroglyph languages.

-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/
 


Re: How does the tsearch configuration get selected?

From
Teodor Sigaev
Date:
So, added to my plan 
(http://archives.postgresql.org/pgsql-hackers/2007-06/msg00618.php)
n) single encoded files. That will touch snowball, ispell, synonym, thesaurus   and simple dictionaries
n+1) use encoding names instead of locale's names in configuration

Tom Lane wrote:
> Teodor Sigaev <teodor@sigaev.ru> writes:
>> But configuration for different languages might be differ, for example
>> russian (and any cyrillic-based) configuration is differ from
>> west-european configuration based on different character sets.
> 
> Sure.  I'm just assuming that the set of stopwords doesn't need to vary
> depending on the encoding you're using for a language --- that is, if
> you're willing to convert the encoding then the same stopword list file
> should serve for all encodings of a given language.  Do you think this
> might be wrong?
> 
>             regards, tom lane
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
>        choose an index scan if your joining column's datatypes do not
>        match

-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/
 


Re: How does the tsearch configuration get selected?

From
Gregory Stark
Date:
"Teodor Sigaev" <teodor@sigaev.ru> writes:

>> Hm, are you trying to say that it's sane to have different tsvectors in
>> a column computed under different language settings?  Maybe we're all
>
> Yes, I think so.
>
> That might have sense for close languages. Norwegian languages has two dialects
> and one of them has advanced rules for compound words, russian and ukranian has
> similar rules etc. Operation @@ is language (and encoding) independent, it use
> just strcmp call.

To support this sanely though wouldn't you need to know which language rule a
tsvector was generated with? Like, have a byte in the tsvector tagging it with
the language rule forever more?

What I'm wondering about is if you use a different rule than what was used
when an index entry was inserted will you get different results using the
index than you would doing a sequential scan and reapplying the operator to
every datum?


--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com



Re: How does the tsearch configuration get selected?

From
Teodor Sigaev
Date:
> To support this sanely though wouldn't you need to know which language rule a
> tsvector was generated with? Like, have a byte in the tsvector tagging it with
> the language rule forever more?

No. As corner case, dictionary might return just a number or a hash value.


> 
> What I'm wondering about is if you use a different rule than what was used
> when an index entry was inserted will you get different results using the
> index than you would doing a sequential scan and reapplying the operator to
> every datum?

Rules are apllyed  during creattion of tsvector, not during indexing of 
tsvectors. So, sequential and index scan will return identical results.


Re: How does the tsearch configuration get selected?

From
Bruce Momjian
Date:
Teodor Sigaev wrote:
> So, added to my plan 
> (http://archives.postgresql.org/pgsql-hackers/2007-06/msg00618.php)
> n) single encoded files. That will touch snowball, ispell, synonym, thesaurus
>     and simple dictionaries
> n+1) use encoding names instead of locale's names in configuration

FYI, I am continuing with the documentation cleanup, though I will not
do the /ref directory until we are sure which commands will be kept.

We can later modify the documentation to match the new behavior.

---------------------------------------------------------------------------


> 
> Tom Lane wrote:
> > Teodor Sigaev <teodor@sigaev.ru> writes:
> >> But configuration for different languages might be differ, for example
> >> russian (and any cyrillic-based) configuration is differ from
> >> west-european configuration based on different character sets.
> > 
> > Sure.  I'm just assuming that the set of stopwords doesn't need to vary
> > depending on the encoding you're using for a language --- that is, if
> > you're willing to convert the encoding then the same stopword list file
> > should serve for all encodings of a given language.  Do you think this
> > might be wrong?
> > 
> >             regards, tom lane
> > 
> > ---------------------------(end of broadcast)---------------------------
> > TIP 9: In versions below 8.0, the planner will ignore your desire to
> >        choose an index scan if your joining column's datatypes do not
> >        match
> 
> -- 
> Teodor Sigaev                                   E-mail: teodor@sigaev.ru
>                                                     WWW: http://www.sigaev.ru/

--  Bruce Momjian  <bruce@momjian.us>          http://momjian.us EnterpriseDB
http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Updated tsearch documentation

From
Bruce Momjian
Date:
I have completed my first pass over the tsearch documentation:
http://momjian.us/expire/fulltext/HTML/sql.html

They are from section 14 and following.

I have come up with a number of questions that I placed in SGML comments
in these files:
http://momjian.us/expire/fulltext/SGML/

Teodor/Oleg, let me know when you want to go over my questions.

--  Bruce Momjian  <bruce@momjian.us>          http://momjian.us EnterpriseDB
http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Updated tsearch documentation

From
Oleg Bartunov
Date:
On Sun, 17 Jun 2007, Bruce Momjian wrote:

> I have completed my first pass over the tsearch documentation:
>
>     http://momjian.us/expire/fulltext/HTML/sql.html
>
> They are from section 14 and following.
>
> I have come up with a number of questions that I placed in SGML comments
> in these files:
>
>     http://momjian.us/expire/fulltext/SGML/
>
> Teodor/Oleg, let me know when you want to go over my questions.

Below are my answers (marked as )


Comments to editorial work of Bruce Momjian.

fulltext-intro.sgml:

it is useful to have a predefined list of lexemes.

>Bruce, here should be list of types of lexemes !


</para></listitem>

<!--
SEEMS UNNECESSARY
It useless to attempt normalize <type>email address</type> using
morphological dictionary of russian language, but looks reasonable to pick
out <type>domain name</type> and be able to search for <type>domain
name</type>.
-->

>I dont' understand where did you get this para :)

fulltext-opfunc.sgml:

All of the following functions that accept a configuration argument can
use either an integer <!-- why an integer --> or a textual configuration
name to select a configuration.

> originally it was integer id, probably better use <type>oid</type>


This returns the query used for searching an index. It can be used to test
for an empty query. The <command>SELECT</> below returns <literal>'T'</>,
<!-- lowercase? --> which corresponds to an empty query since GIN indexes
do not support negate queries (a full index scan is inefficient):

> capital case. This looks cumbersome, probably querytree() should
> just return NULL.

The integer option controls several behaviors which is done using bit-wise
fields and <literal>|</literal> (for example, <literal>2|4</literal>):
<!-- why so complex? -->

> to avoid 2 arguments

its <replaceable>id</replaceable> or <replaceable>ts_name</replaceable>; <!-- n
if none is specified that the current configuration is used.

> I don't understand this question

<para>
<!-- why?  -->
Note that the cascade dropping of the <function>headline</function> function
cause dropping of the <literal>parser</literal> used in fulltext configuration
<replaceable>tsname</replaceable>.
</para>

> hmm, probably it should be reversed - cascade dropping of the parser cause
> dropping of the headline function.

In example below, <literal>fulltext_idx</literal> is
a GIN index:<!-- why isn't this automatic -->

> It's explained above. The problem is that current index api doesn't allow
> to say if search was lossy or exact, so to preserve performance of
> GIN index we had to introduce @@@ operator, which is the same as @@, but
> lossy.

nly the <token>lword</token> lexeme, then a <acronym>TZ</acronym>
definition like ' one 1:11' will not work since lexeme type
<token>digit</token> is not assigned to the <acronym>TZ</acronym>.
<!-- what do these numbers mean? -->
</para>

> nothing special, just numbers for example.

<function>ts_debug</> displays information about every token of
<replaceable class="PARAMETER">document</replaceable> as produced by the
parser and processed by the configured dictionaries using the configuration
specified by <replaceable class="PARAMETER">cfgname</replaceable> or
<replaceable class="PARAMETER">oid</replaceable>. <!-- no need for oid

> don't understand this comment. ts_debug accepts cfgname or its oid



    Regards,        Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


Re: Updated tsearch documentation

From
Bruce Momjian
Date:
Oleg Bartunov wrote:
> On Sun, 17 Jun 2007, Bruce Momjian wrote:
> 
> > I have completed my first pass over the tsearch documentation:
> >
> >     http://momjian.us/expire/fulltext/HTML/sql.html
> >
> > They are from section 14 and following.
> >
> > I have come up with a number of questions that I placed in SGML comments
> > in these files:
> >
> >     http://momjian.us/expire/fulltext/SGML/
> >
> > Teodor/Oleg, let me know when you want to go over my questions.
> 
> Below are my answers (marked as )

OK.
> 
> Comments to editorial work of Bruce Momjian.
> 
> fulltext-intro.sgml:
> 
> it is useful to have a predefined list of lexemes.
> 
>Bruce, here should be list of types of lexemes !

Agreed.  Are the list of lexemes parser-specific?

> </para></listitem>
> 
> <!--
> SEEMS UNNECESSARY
> It useless to attempt normalize <type>email address</type> using
> morphological dictionary of russian language, but looks reasonable to pick
> out <type>domain name</type> and be able to search for <type>domain
> name</type>.
> -->
> 
> I dont' understand where did you get this para :)

Uh, it was in the SGML.  I have removed it.

> fulltext-opfunc.sgml:
> 
> All of the following functions that accept a configuration argument can
> use either an integer <!-- why an integer --> or a textual configuration
> name to select a configuration.
> 
> originally it was integer id, probably better use <type>oid</type>

Uh, my question is why are you allowing specification as an integer/oid
when the name works just fine.  I don't see the value in allowing
numbers here.

> This returns the query used for searching an index. It can be used to test
> for an empty query. The <command>SELECT</> below returns <literal>'T'</>,
> <!-- lowercase? --> which corresponds to an empty query since GIN indexes
> do not support negate queries (a full index scan is inefficient):
> 
> > capital case. This looks cumbersome, probably querytree() should
> > just return NULL.

Agreed.

> The integer option controls several behaviors which is done using bit-wise
> fields and <literal>|</literal> (for example, <literal>2|4</literal>):
> <!-- why so complex? -->
> 
> > to avoid 2 arguments

But I don't see why you would want to set two of those values --- they
seem mutually exclusive, e.g.
1 divides the rank by the 1 + logarithm of the document length2 divides the rank by the length itself

I assume you do either one, not both.

> its <replaceable>id</replaceable> or <replaceable>ts_name</replaceable>; <!-- n
> if none is specified that the current configuration is used.
> 
> > I don't understand this question

Same issue as above --- why allow a number here when the name works just
fine.  We don't allow tables to be specified by number, so why
configurations?

> <para>
> <!-- why?  -->
> Note that the cascade dropping of the <function>headline</function> function
> cause dropping of the <literal>parser</literal> used in fulltext configuration
> <replaceable>tsname</replaceable>.
> </para>
> 
> > hmm, probably it should be reversed - cascade dropping of the parser cause
> > dropping of the headline function.

Agreed.

> 
> In example below, <literal>fulltext_idx</literal> is
> a GIN index:<!-- why isn't this automatic -->
> 
> > It's explained above. The problem is that current index api doesn't allow
> > to say if search was lossy or exact, so to preserve performance of
> > GIN index we had to introduce @@@ operator, which is the same as @@, but
> > lossy.

Well, then we have to fix the API.  Telling users to use a different
operator based on what index is defined is just bad style.

> nly the <token>lword</token> lexeme, then a <acronym>TZ</acronym>
> definition like ' one 1:11' will not work since lexeme type
> <token>digit</token> is not assigned to the <acronym>TZ</acronym>.
> <!-- what do these numbers mean? -->
> </para>

OK, I changed it to be clearer.

> > nothing special, just numbers for example.
> 
> <function>ts_debug</> displays information about every token of
> <replaceable class="PARAMETER">document</replaceable> as produced by the
> parser and processed by the configured dictionaries using the configuration
> specified by <replaceable class="PARAMETER">cfgname</replaceable> or
> <replaceable class="PARAMETER">oid</replaceable>. <!-- no need for oid
> 
> > don't understand this comment. ts_debug accepts cfgname or its oid

Again, no need for oid.

--  Bruce Momjian  <bruce@momjian.us>          http://momjian.us EnterpriseDB
http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Updated tsearch documentation

From
Oleg Bartunov
Date:
On Wed, 20 Jun 2007, Bruce Momjian wrote:

> Oleg Bartunov wrote:
>> On Sun, 17 Jun 2007, Bruce Momjian wrote:
>>
>>> I have completed my first pass over the tsearch documentation:
>>>
>>>     http://momjian.us/expire/fulltext/HTML/sql.html
>>>
>>> They are from section 14 and following.
>>>
>>> I have come up with a number of questions that I placed in SGML comments
>>> in these files:
>>>
>>>     http://momjian.us/expire/fulltext/SGML/
>>>
>>> Teodor/Oleg, let me know when you want to go over my questions.
>>
>> Below are my answers (marked as )
>
> OK.
>>
>> Comments to editorial work of Bruce Momjian.
>>
>> fulltext-intro.sgml:
>>
>> it is useful to have a predefined list of lexemes.
>>
>> Bruce, here should be list of types of lexemes !
>
> Agreed.  Are the list of lexemes parser-specific?
>

yes, it it parser which defines types of lexemes.

>> fulltext-opfunc.sgml:
>>
>> All of the following functions that accept a configuration argument can
>> use either an integer <!-- why an integer --> or a textual configuration
>> name to select a configuration.
>>
>> originally it was integer id, probably better use <type>oid</type>
>
> Uh, my question is why are you allowing specification as an integer/oid
> when the name works just fine.  I don't see the value in allowing
> numbers here.

for compatibility reason. Hmm, indeed, i don't recall where oid's could be 
important.

>
>> This returns the query used for searching an index. It can be used to test
>> for an empty query. The <command>SELECT</> below returns <literal>'T'</>,
>> <!-- lowercase? --> which corresponds to an empty query since GIN indexes
>> do not support negate queries (a full index scan is inefficient):
>>
>>> capital case. This looks cumbersome, probably querytree() should
>>> just return NULL.
>
> Agreed.
>
>> The integer option controls several behaviors which is done using bit-wise
>> fields and <literal>|</literal> (for example, <literal>2|4</literal>):
>> <!-- why so complex? -->
>>
>>> to avoid 2 arguments
>
> But I don't see why you would want to set two of those values --- they
> seem mutually exclusive, e.g.
>
>     1 divides the rank by the 1 + logarithm of the document length
>     2 divides the rank by the length itself
>
> I assume you do either one, not both.

but what's about others variants ?

What I missed is the definition of extent.

From http://www.sai.msu.su/~megera/wiki/NewExtentsBasedRanking
Extent is a shortest and non-nested sequence of words, which satisfy a query.


>
>> its <replaceable>id</replaceable> or <replaceable>ts_name</replaceable>; <!-- n
>> if none is specified that the current configuration is used.
>>
>>> I don't understand this question
>
> Same issue as above --- why allow a number here when the name works just
> fine.  We don't allow tables to be specified by number, so why
> configurations?
>
>> <para>
>> <!-- why?  -->
>> Note that the cascade dropping of the <function>headline</function> function
>> cause dropping of the <literal>parser</literal> used in fulltext configuration
>> <replaceable>tsname</replaceable>.
>> </para>
>>
>>> hmm, probably it should be reversed - cascade dropping of the parser cause
>>> dropping of the headline function.
>
> Agreed.
>
>>
>> In example below, <literal>fulltext_idx</literal> is
>> a GIN index:<!-- why isn't this automatic -->
>>
>>> It's explained above. The problem is that current index api doesn't allow
>>> to say if search was lossy or exact, so to preserve performance of
>>> GIN index we had to introduce @@@ operator, which is the same as @@, but
>>> lossy.
>
> Well, then we have to fix the API.  Telling users to use a different
> operator based on what index is defined is just bad style.

This was raised by Heikki and we discussed it a bit in Ottawa, but it's
unclear if it's doable for 8.3.  @@@ operator is in rare use, so we could
say it will be improved in future versions.

>
>> nly the <token>lword</token> lexeme, then a <acronym>TZ</acronym>
>> definition like ' one 1:11' will not work since lexeme type
>> <token>digit</token> is not assigned to the <acronym>TZ</acronym>.
>> <!-- what do these numbers mean? -->
>> </para>
>
> OK, I changed it to be clearer.
>
>>> nothing special, just numbers for example.
>>
>> <function>ts_debug</> displays information about every token of
>> <replaceable class="PARAMETER">document</replaceable> as produced by the
>> parser and processed by the configured dictionaries using the configuration
>> specified by <replaceable class="PARAMETER">cfgname</replaceable> or
>> <replaceable class="PARAMETER">oid</replaceable>. <!-- no need for oid
>>
>>> don't understand this comment. ts_debug accepts cfgname or its oid
>
> Again, no need for oid.

We need to decide if we need oids as user-visible argument. I don't see
any value, probably Teodor think other way.

    Regards,        Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


Re: Updated tsearch documentation

From
Bruce Momjian
Date:
Oleg Bartunov wrote:
> On Wed, 20 Jun 2007, Bruce Momjian wrote:
> >> Comments to editorial work of Bruce Momjian.
> >>
> >> fulltext-intro.sgml:
> >>
> >> it is useful to have a predefined list of lexemes.
> >>
> >> Bruce, here should be list of types of lexemes !
> >
> > Agreed.  Are the list of lexemes parser-specific?
> >
> 
> yes, it it parser which defines types of lexemes.

OK, how will users get a list of supported lexemes?  Do we need a list
per supported parser?

> >> fulltext-opfunc.sgml:
> >>
> >> All of the following functions that accept a configuration argument can
> >> use either an integer <!-- why an integer --> or a textual configuration
> >> name to select a configuration.
> >>
> >> originally it was integer id, probably better use <type>oid</type>
> >
> > Uh, my question is why are you allowing specification as an integer/oid
> > when the name works just fine.  I don't see the value in allowing
> > numbers here.
> 
> for compatibility reason. Hmm, indeed, i don't recall where oid's could be 
> important.

Well, if neither of ussee no reason for it, let's remove it.  We don't
need to support a feature that has no usefulness.

> >> This returns the query used for searching an index. It can be used to test
> >> for an empty query. The <command>SELECT</> below returns <literal>'T'</>,
> >> <!-- lowercase? --> which corresponds to an empty query since GIN indexes
> >> do not support negate queries (a full index scan is inefficient):
> >>
> >>> capital case. This looks cumbersome, probably querytree() should
> >>> just return NULL.
> >
> > Agreed.
> >
> >> The integer option controls several behaviors which is done using bit-wise
> >> fields and <literal>|</literal> (for example, <literal>2|4</literal>):
> >> <!-- why so complex? -->
> >>
> >>> to avoid 2 arguments
> >
> > But I don't see why you would want to set two of those values --- they
> > seem mutually exclusive, e.g.
> >
> >     1 divides the rank by the 1 + logarithm of the document length
> >     2 divides the rank by the length itself
> >
> > I assume you do either one, not both.
> 
> but what's about others variants ?

OK, here is the full list:
0 (the default) ignores document length1 divides the rank by the 1 + logarithm of the document length2 divides the rank
bythe length itself4 divides the rank by the mean harmonic distance between extents8 divides the rank by the number of
uniquewords in document16 divides the rank by 1 + logarithm of the number of unique words in   document
 

so which ones would be both enabled?

> 
> What I missed is the definition of extent.
> 
> >From http://www.sai.msu.su/~megera/wiki/NewExtentsBasedRanking
> Extent is a shortest and non-nested sequence of words, which satisfy a query.

I don't understand how that relates to this.

> >
> >> its <replaceable>id</replaceable> or <replaceable>ts_name</replaceable>; <!-- n
> >> if none is specified that the current configuration is used.
> >>
> >>> I don't understand this question
> >
> > Same issue as above --- why allow a number here when the name works just
> > fine.  We don't allow tables to be specified by number, so why
> > configurations?
> >
> >> <para>
> >> <!-- why?  -->
> >> Note that the cascade dropping of the <function>headline</function> function
> >> cause dropping of the <literal>parser</literal> used in fulltext configuration
> >> <replaceable>tsname</replaceable>.
> >> </para>
> >>
> >>> hmm, probably it should be reversed - cascade dropping of the parser cause
> >>> dropping of the headline function.
> >
> > Agreed.
> >
> >>
> >> In example below, <literal>fulltext_idx</literal> is
> >> a GIN index:<!-- why isn't this automatic -->
> >>
> >>> It's explained above. The problem is that current index api doesn't allow
> >>> to say if search was lossy or exact, so to preserve performance of
> >>> GIN index we had to introduce @@@ operator, which is the same as @@, but
> >>> lossy.
> >
> > Well, then we have to fix the API.  Telling users to use a different
> > operator based on what index is defined is just bad style.
> 
> This was raised by Heikki and we discussed it a bit in Ottawa, but it's
> unclear if it's doable for 8.3.  @@@ operator is in rare use, so we could
> say it will be improved in future versions.

Uh, I am wondering if we just have to force heap access in all cases
until it is fixed.

> >> nly the <token>lword</token> lexeme, then a <acronym>TZ</acronym>
> >> definition like ' one 1:11' will not work since lexeme type
> >> <token>digit</token> is not assigned to the <acronym>TZ</acronym>.
> >> <!-- what do these numbers mean? -->
> >> </para>
> >
> > OK, I changed it to be clearer.
> >
> >>> nothing special, just numbers for example.
> >>
> >> <function>ts_debug</> displays information about every token of
> >> <replaceable class="PARAMETER">document</replaceable> as produced by the
> >> parser and processed by the configured dictionaries using the configuration
> >> specified by <replaceable class="PARAMETER">cfgname</replaceable> or
> >> <replaceable class="PARAMETER">oid</replaceable>. <!-- no need for oid
> >>
> >>> don't understand this comment. ts_debug accepts cfgname or its oid
> >
> > Again, no need for oid.
> 
> We need to decide if we need oids as user-visible argument. I don't see
> any value, probably Teodor think other way.

This is a good time to clean up the API because there are going to be
user-visible changes anyway.

--  Bruce Momjian  <bruce@momjian.us>          http://momjian.us EnterpriseDB
http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Updated tsearch documentation

From
Oleg Bartunov
Date:
On Wed, 20 Jun 2007, Bruce Momjian wrote:

> Oleg Bartunov wrote:
>> On Wed, 20 Jun 2007, Bruce Momjian wrote:
>>>> Comments to editorial work of Bruce Momjian.
>>>>
>>>> fulltext-intro.sgml:
>>>>
>>>> it is useful to have a predefined list of lexemes.
>>>>
>>>> Bruce, here should be list of types of lexemes !
>>>
>>> Agreed.  Are the list of lexemes parser-specific?
>>>
>>
>> yes, it it parser which defines types of lexemes.
>
> OK, how will users get a list of supported lexemes?  Do we need a list
> per supported parser?

it's documented, see "Parser functions" for token_type();

postgres=# select * from token_type('default'); tokid |    alias     |            description
-------+--------------+-----------------------------------     1 | lword        | Latin word     2 | nlword       |
Non-latinword     3 | word         | Word     4 | email        | Email     5 | url          | URL     6 | host
|Host     7 | sfloat       | Scientific notation     8 | version      | VERSION     9 | part_hword   | Part of
hyphenatedword    10 | nlpart_hword | Non-latin part of hyphenated word    11 | lpart_hword  | Latin part of hyphenated
word   12 | blank        | Space symbols    13 | tag          | HTML Tag    14 | protocol     | Protocol head    15 |
hword       | Hyphenated word    16 | lhword       | Latin hyphenated word    17 | nlhword      | Non-latin hyphenated
word   18 | uri          | URI    19 | file         | File or path name    20 | float        | Decimal notation    21 |
int         | Signed integer    22 | uint         | Unsigned integer    23 | entity       | HTML Entity
 

>>>> The integer option controls several behaviors which is done using bit-wise
>>>> fields and <literal>|</literal> (for example, <literal>2|4</literal>):
>>>> <!-- why so complex? -->
>>>>
>>>>> to avoid 2 arguments
>>>
>>> But I don't see why you would want to set two of those values --- they
>>> seem mutually exclusive, e.g.
>>>
>>>     1 divides the rank by the 1 + logarithm of the document length
>>>     2 divides the rank by the length itself
>>>
>>> I assume you do either one, not both.
>>
>> but what's about others variants ?
>
> OK, here is the full list:
>
>     0 (the default) ignores document length
>     1 divides the rank by the 1 + logarithm of the document length
>     2 divides the rank by the length itself
>     4 divides the rank by the mean harmonic distance between extents
>     8 divides the rank by the number of unique words in document
>     16 divides the rank by 1 + logarithm of the number of unique words in
>        document
>
> so which ones would be both enabled?

no one ! This is a list of possible values of rank normalization flag, which 
could be ORed together.

=# select rank_cd('1:1,2,3 4:5 6:7', '1&4',1);  rank_cd
----------- 0.0279055
=# select rank_cd('1:1,2,3 4:5 6:7', '1&4',1|16);  rank_cd
----------- 0.0139528


>
>>
>> What I missed is the definition of extent.
>>
>>> From http://www.sai.msu.su/~megera/wiki/NewExtentsBasedRanking
>> Extent is a shortest and non-nested sequence of words, which satisfy a query.
>
> I don't understand how that relates to this.

because of 
"4 divides the rank by the mean harmonic distance between extents"
   ^^^^^^^
 
it reflects how dense extents which satisfy query are in document.
>
>>>
>>>> its <replaceable>id</replaceable> or <replaceable>ts_name</replaceable>; <!-- n
>>>> if none is specified that the current configuration is used.
>>>>
>>>>> I don't understand this question
>>>
>>> Same issue as above --- why allow a number here when the name works just
>>> fine.  We don't allow tables to be specified by number, so why
>>> configurations?
>>>
>>>> <para>
>>>> <!-- why?  -->
>>>> Note that the cascade dropping of the <function>headline</function> function
>>>> cause dropping of the <literal>parser</literal> used in fulltext configuration
>>>> <replaceable>tsname</replaceable>.
>>>> </para>
>>>>
>>>>> hmm, probably it should be reversed - cascade dropping of the parser cause
>>>>> dropping of the headline function.
>>>
>>> Agreed.
>>>
>>>>
>>>> In example below, <literal>fulltext_idx</literal> is
>>>> a GIN index:<!-- why isn't this automatic -->
>>>>
>>>>> It's explained above. The problem is that current index api doesn't allow
>>>>> to say if search was lossy or exact, so to preserve performance of
>>>>> GIN index we had to introduce @@@ operator, which is the same as @@, but
>>>>> lossy.
>>>
>>> Well, then we have to fix the API.  Telling users to use a different
>>> operator based on what index is defined is just bad style.
>>
>> This was raised by Heikki and we discussed it a bit in Ottawa, but it's
>> unclear if it's doable for 8.3.  @@@ operator is in rare use, so we could
>> say it will be improved in future versions.
>
> Uh, I am wondering if we just have to force heap access in all cases
> until it is fixed.

no-no ! We'll lost performance of GIN index, which isn't lossy and don't
need heap access. I don't see what's wrong if we say that some feature
doesn't supported by text search operator with GIN index.

>> We need to decide if we need oids as user-visible argument. I don't see
>> any value, probably Teodor think other way.
>
> This is a good time to clean up the API because there are going to be
> user-visible changes anyway.

I agree. Keep in mind this, until we get more serious tasks done.
    Regards,        Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


Re: Updated tsearch documentation

From
Oleg Bartunov
Date:
On Wed, 20 Jun 2007, Bruce Momjian wrote:
>>
>> We need to decide if we need oids as user-visible argument. I don't see
>> any value, probably Teodor think other way.
>
> This is a good time to clean up the API because there are going to be
> user-visible changes anyway.

Bruce, just remove oid argument specification from documentation.

    Regards,        Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


Re: Updated tsearch documentation

From
Bruce Momjian
Date:
Oleg Bartunov wrote:
> On Wed, 20 Jun 2007, Bruce Momjian wrote:
> >>
> >> We need to decide if we need oids as user-visible argument. I don't see
> >> any value, probably Teodor think other way.
> >
> > This is a good time to clean up the API because there are going to be
> > user-visible changes anyway.
>
> Bruce, just remove oid argument specification from documentation.

Done.  I am attaching the current function prototypes.  If they don't
match the C code, please let me know.

I have also updated with some minor corrections I received from Erik.  I
will be adding more to the documentation hopefully this week:

    http://momjian.us/expire/fulltext/HTML/

--
  Bruce Momjian  <bruce@momjian.us>          http://momjian.us
  EnterpriseDB                               http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +
*** /pgsgml/fulltext-opfunc.sgml    Sat Jun 16 23:30:11 2007
--- fulltext-opfunc.sgml    Mon Jul  2 21:17:15 2007
***************
*** 141,147 ****

  <term>
  <synopsis>
! to_tsvector(<optional><replaceable class="PARAMETER">configuration</replaceable>,</optional>  <replaceable
class="PARAMETER">document</replaceable>TEXT) returns TSVECTOR 
  </synopsis>
  </term>

--- 141,147 ----

  <term>
  <synopsis>
! to_tsvector(<optional><replaceable class="PARAMETER">conf_name</replaceable></optional>,  <replaceable
class="PARAMETER">document</replaceable>TEXT) returns TSVECTOR 
  </synopsis>
  </term>

***************
*** 285,306 ****

  <term>
  <synopsis>
! tsearch(<replaceable class="PARAMETER">vector_column_name</replaceable><optional>, (<replaceable
class="PARAMETER">my_filter_name</replaceable>| <replaceable class="PARAMETER">text_column_name1</replaceable>)
<optional>...</optional></optional>, <replaceable class="PARAMETER">text_column_nameN</replaceable>) 
  </synopsis>
  </term>

***************
*** 323,329 ****

  <term>
  <synopsis>
! stat(<replaceable class="PARAMETER">sqlquery</replaceable> text <optional>, weight text </optional>) returns SETOF
statinfo
  </synopsis>

  <listitem>
--- 322,328 ----

  <term>
  <synopsis>
! stat(<optional><replaceable class="PARAMETER">sqlquery</replaceable> text </optional>, weight text </optional>)
returnsSETOF statinfo 
  </synopsis>

  <listitem>
***************
*** 403,409 ****

  <term>
  <synopsis>
! to_tsquery(<optional><replaceable class="PARAMETER">configuration</replaceable>,</optional> <replaceable
class="PARAMETER">querytext</replaceable>text) returns TSQUERY 
  </synopsis>
  </term>

--- 402,408 ----

  <term>
  <synopsis>
! to_tsquery(<optional><replaceable class="PARAMETER">conf_name</replaceable></optional>, <replaceable
class="PARAMETER">querytext</replaceable>text) returns TSQUERY 
  </synopsis>
  </term>

***************
*** 446,452 ****

  <term>
  <synopsis>
! plainto_tsquery(<optional><replaceable class="PARAMETER">configuration</replaceable>,</optional>  <replaceable
class="PARAMETER">querytext</replaceable>text) returns TSQUERY 
  </synopsis>
  </term>

--- 445,451 ----

  <term>
  <synopsis>
! plainto_tsquery(<optional><replaceable class="PARAMETER">conf_name</replaceable></optional>,  <replaceable
class="PARAMETER">querytext</replaceable>text) returns TSQUERY 
  </synopsis>
  </term>

***************
*** 989,995 ****

  <term>
  <synopsis>
! rank(<optional> <replaceable class="PARAMETER">weights</replaceable> float4[], </optional> <replaceable
class="PARAMETER">vector</replaceable>TSVECTOR, <replaceable class="PARAMETER">query</replaceable> TSQUERY, <optional>
<replaceableclass="PARAMETER">normalization</replaceable> int4 </optional>) returns float4 
  </synopsis>
  </term>

--- 988,994 ----

  <term>
  <synopsis>
! rank(<optional> <replaceable class="PARAMETER">weights</replaceable> float4[]</optional>, <replaceable
class="PARAMETER">vector</replaceable>TSVECTOR, <replaceable class="PARAMETER">query</replaceable> TSQUERY, <optional>
<replaceableclass="PARAMETER">normalization</replaceable> int4 </optional>) returns float4 
  </synopsis>
  </term>

***************
*** 1084,1090 ****

  <term>
  <synopsis>
! headline(<optional> <replaceable class="PARAMETER">id</replaceable> int4, | <replaceable
class="PARAMETER">ts_name</replaceable>text, </optional> <replaceable class="PARAMETER">document</replaceable> text,
<replaceableclass="PARAMETER">query</replaceable> TSQUERY, <optional> <replaceable
class="PARAMETER">options</replaceable>text </optional>) returns text 
  </synopsis>
  </term>

--- 1083,1089 ----

  <term>
  <synopsis>
! headline(<optional> <replaceable class="PARAMETER">ts_name</replaceable> text</optional>, <replaceable
class="PARAMETER">document</replaceable>text, <replaceable class="PARAMETER">query</replaceable> TSQUERY, <optional>
<replaceableclass="PARAMETER">options</replaceable> text </optional>) returns text 
  </synopsis>
  </term>

***************
*** 1351,1357 ****

  <term>
  <synopsis>
! lexize(<optional> <replaceable class="PARAMETER">oid</replaceable>, | <replaceable
class="PARAMETER">dict_name</replaceable>text, <replaceable class="PARAMETER">lexeme</replaceable> text) returns text[] 
  </synopsis>
  </term>

--- 1350,1356 ----

  <term>
  <synopsis>
! lexize(<optional> <replaceable class="PARAMETER">dict_name</replaceable> text</optional>, <replaceable
class="PARAMETER">lexeme</replaceable>text) returns text[] 
  </synopsis>
  </term>

***************
*** 1858,1878 ****
  <title>Debugging</title>

  <para>
! Function <function>ts_debug</function> allows easy testing of your full text indexing
  configuration.
  </para>

  <synopsis>
! ts_debug(<optional><replaceable class="PARAMETER">cfgname</replaceable> | <replaceable
class="PARAMETER">oid</replaceable></optional>,<replaceable class="PARAMETER">document</replaceable> TEXT) returns
SETOFtsdebug 
  </synopsis>

  <para>
--- 1852,1870 ----
  <title>Debugging</title>

  <para>
! Function <function>ts_debug</function> allows easy testing of your full text searching
  configuration.
  </para>

  <synopsis>
! ts_debug(<optional><replaceable class="PARAMETER">conf_name</replaceable></optional>, <replaceable
class="PARAMETER">document</replaceable>TEXT) returns SETOF tsdebug 
  </synopsis>


Re: Updated tsearch documentation

From
Bruce Momjian
Date:
FYI, I have massively reorganized the text search documentation and it
is getting closer to something I am happy with:
http://momjian.us/expire/fulltext/HTML/textsearch.html

--  Bruce Momjian  <bruce@momjian.us>          http://momjian.us EnterpriseDB
http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Updated tsearch documentation

From
"Nicolas Barbier"
Date:
2007/7/7, Bruce Momjian <bruce@momjian.us>:

> FYI, I have massively reorganized the text search documentation and it
> is getting closer to something I am happy with:
>
>         http://momjian.us/expire/fulltext/HTML/textsearch.html

The following is the result of me proofreading, mainly searching for
small mistakes such as spelling/grammatical errors (that means no
document structure comments, etc).

All corrections are relative to the version of the text at above URL
at the time of me reading it :-).

General

It seems to be a recurring problem that commas are not put between the
brackets when an argument is optional. For example:
"to_tsvector([conf_name], document TEXT)" -> I guess this should be
"to_tsvector([conf_name,] document TEXT)"

Full-text vs. full text and stop-word vs. stop word are not used
consistently. Also, capitalization of full text searching is not used
consistently.

14.1. Introduction

* "indexinging" - > "indexing"
* "There is no linguistic support, even in English" -> "for" instead of "in"?
* "e.g.satisfies" -> add a space before "satisfies"
* "have several thousands derivatives" -> should this not use the
singular form thousand?
* "infinitive form" -> is this the right term? I think it only applies
to verbs (also occurs in 14.4 and probably others)
* "over how lexemes creation" -> not sure what this should be. "are
created" maybe?
* "Map synonyms to a single word. ispell." -> why is ispell a standalone word?
* "so it is natural to introduce a new data type" -> this does not
sound like documentation
* "Also, full-text search operator @@" -> add "the" before "full-text"
* "A document is any text file that can be opened, read, and modified"
-> "file" sounds as if it should be a file on a filesystem.
* "However, the document file must be uniquely identified in the
database." -> why?
* "COALESCE" -> should be a link
* "during calculation of document rank" -> add "the" before
"calculation" and before "document"
* "which supports boolean operators, & (AND)" -> remove the ",". maybe
add "the" before boolean
* "parenthesis" -> "parentheses"
* "Tsquery consists of" -> maybe add "A" before Tsquery

14.2. Operators And Functions               ^^^ -> a non-capital "a" in "and" seems to be more
consistent with the rest of the manual

* "TSVECTOR, otherwise false:" -> "and false if not" or "and false
otherwise" (occurs 3 times in this section)
* "The text should be formatted to match the way a vector is displayed
by SELECT." -> what a strange definition, I think something like
"input format" or so should be used (and defined somewhere, didn't see
it yet) (used twice in this section)
* "tsearch([vector_column_name], my_filter_name | text_column_name1
[...], text_column_nameN)" -> I do not understand the notation
* "The following rule is used: a function is applied to all subsequent
TEXT columns until next matching column occurs." -> I don't get it
* "stat([sqlquery text ], [weight text ]) returns SETOF statinfo" -> I
guess that not both of the arguments are optional?
* "stop-words candidates" -> stop-word candidates
* "tsvectors are compared with each other using lexicographical
ordering." -> of the output representation or something else?
* "Accepts querytext, which should be single tokens separated by" ->
replace "be" with "consist of"
* "& and | or, and ! not" -> putting parentheses around the "and" "or"
and "not" would be more readable. also, a comma is missing before the
"|" sign
* "break it onto tokens" -> into instead of onto
* "since GIN indexes do not support negate queries" -> something like:
"queries with negation" or "negated queries" (depending on what the
correct rule is)
* "Arguments to rewrite() function" -> "the .. functions" or "to .."
(without the "function")
* "can be column names of type tsquery" -> "names of columns of type
tsquery" (the names are not of type tsquery, the columns are)
* "we can change rewriting rule online" -> add "the", possibly use
another word for "online" (it is not clear what that means to me)

14.3. Additional Controls

* "Full text searching in PostgreSQL provides function" -> add "the"
* "we see the resulting" -> maybe "we see that the resulting"
"does not contain a, on, or it, word rats became rat, and the
punctuation sign - was ignored" -> "does not contain the words" (or
lexemes, or tokens), add "the" before "word rats", add quotes around
the "-"
* "on words" -> "into words"
* "they are too frequent" -> "they occur too frequently" (I think a
word cannot "be" frequent)
* "The Punctuation sign -" -> "The punctuation sign -" + put quotes
around the "-"
* "which shows all details of full text machinery" -> add "the" before "full"
* "is to mark out the different parts of document" -> add "a" before "document"
* "by the 1 + logarithm" -> "by 1 + the logarithm"
* "i.e., ordering of search results will not change" -> add "the"
before "ordering", maybe also before "search"
* "note that second example" -> add "the" before "second"
* "than ones with labeled with D" -> "than ones labeled with D" or
"than ones that are labeled with D"
* "Unfortunately, it is almost impossible to avoid since full text
indexing in a database should work without indexes" -> I don't get it
* "to show part of each document" -> add "a" before "part"
* "provides the function headline" -> add something, such as "to
accomplish this" or "that implements such functionality" or something.
* "ellipse-separated" -> "ellipsis-separated"
* "the cascade dropping of the parser function cause dropping of the
headling" -> I don't get the meaning of the sentence. I guess that
"cause" should be "causes" and "headling" should be 'heading"

14.4. Dictionaries

* "to use any word form in a query" -> "to use any derived form of a
word in a query"
* "infinitive" -> is this the right term? I think it only applies to
verbs (used twice in this section)
* "colour" -> is the manual supposed to be UK or US English? I cannot
remember ever having read any UK-isms before
* "substituted to their" -> replace "to" with "by" or "with" (native
English speakers, help me here)
* "see dictionary for integers Section 14.11 as an example" -> strange
way of referring, I would put parenthesis around the section number,
or alternatively put the section number before the title
* "Lexemes come through a stack" -> replace "come through" with "are
processed by" or something
* "appears as a stop-word" -> "turns out to be a stop-word", also
"stop word" is used elsewhere (without the "-") (this inconsistency
occurs a lot in this section)
* "Also, the ts_debug function ( Section 14.10 ) is very useful for
this." -> the spaces around the section reference look strange. maybe
replace "is very useful" by "can be used"
* "and appear in almost every document" -> two times "and" sounds bad,
replace this "and" by a comma
* "discrimination value so they can be ignored in" -> cut this in two
sentences: "discrimination value. Therefore, they can be ignored in
the context of"
* "word like a and it is useless to have them in an index" -> replace
"word" with "words", make "a" somehow stand out (quotes?), replace
"and" with "although" and "have" with "store"
* "However stop words" -> "However, stop words"
* "does affect ranking" -> "do affect ranking" (I think both can be
considered correct, but like this one better)
* "Relative paths in OPTION resolve relative to share/" -> and
"share/" is relative to what? such references occur elsewhere in this
section
* "Synonym dictionary can be used" -> replace "dictonary" with
"dictionaries", or alternatively, put "A" before "synonym"
* "thesynonym" -> add a space
* "en_stemm" -> "en_stem"
* "abbeviated" -> "abbreviated"
* "preferred terms, non-preferred, related terms" -> add "terms" after
"non-preferred", or alternatively, remove all "terms" references apart
from the last one
* "in the thesaurus requires reindexing" -> replace "requires" with "require"
* "It is possible to define only one dictionary." -> I guess that
sentence wants to express that only one dictionary is allowed? In that
case, change to "It is only possible to define one dictionary."
* "Use asterisk" -> add "an" before "asterisk"
* "thesubdictionary" -> "the subdictionary"
* "It is still required that sample words should be known" -> don't
use "required" and "should" together: "sample words are still required
to be known"
* "Since thesaurus dictionary" -> add "a" before "thesaurus"
* "with parser" -> add "the" before "parser"
* "but we can use plainto_tsquery and to_tsvector functions" -> add
"the" before the name of the first function, or remove the "functions"
part
* "not a lexemes" -> "not lexemes"
* "on OpenOffice Wiki" -> add "the" before "OpenOffice"
* "does not supports" -> "does not support"
* "support of" -> "support for"
* "At present, Full text" -> I guess that "full" should not be capitalized
* "see Snowball site" -> add "the" before "Snowball"
* "which accepts a snowball stemmer" -> "that is accepted by a snowball stemmer"

14.5. Indexes

* "speedup" -> "speed up"
* "GiST(The Generalized Search Tree)-based" -> "GiST (Generalized
Search Tree)-based"
* "GIN(The Generalized Inverted Index)-based" -> "GIN (Generalized
Inverted Index)-based"
* "necessary consult the" -> add "to" before "consult"
* "and could be result" -> remove the "be"
* "transitive containment relation is realized" -> add "the" before "transitive"
* "Knuth,1973" -> add a space after the comma
* "i.e. parent is 'OR'-ed bit-strings" -> "i.e., a parent is the
result of 'OR'-ing the bit-strings"
* "of its limited" -> "of the limited"
* "The likelihood of false drops" -> what are "drops"? maybe this
needs to be "hits"?
* "while longer one are" -> replace "one" with "ones"
* "or the result" -> add "whether" before "the"
* "currently is currently" -> remove the first "currently"
* "but its performance" -> replace "its" with "their"
* "heap, so" -> "heap. Therefore, "
* "In example below" -> add "the" before "example"
* "constraint_exclusion" -> why the underscore? should be a link

14.6. Configuration

* "all of the options" -> maybe remove "of the"
* "objects a set" -> add a comma before "a"

14.7. Limitations

* "Length of" -> "The length of" (twice)
* "less then" -> "less than"

None of the numbers use commas to separate the thousands, except for one.

14.8. psql Support

14.9. Application Tutorial

* "searchs" -> "searches"
* "is last-modified date" -> add "the" after "is"

14.10. Debugging

* "Word supernovaes" -> "The word supernovaes"
* "end the dictionary stack" -> add "the" before "dictionary"
* "specifies maximum length" -> add "the" before "maximum"

14.12. Example of Creating a Parser

* "Note it should" -> insert "that" after "Note"
* "The void function" -> replace "The" with "This"

Nicolas

-- 
Nicolas Barbier
http://www.gnu.org/philosophy/no-word-attachments.html


Re: Updated tsearch documentation

From
Bruce Momjian
Date:
Thanks, I applied this patch and rebuild HTML version.  I was wondering
how I was going to make all the changes accurately.  ;-)

---------------------------------------------------------------------------

Nicolas Barbier wrote:
> 2007/7/7, Bruce Momjian <bruce@momjian.us>:
> 
> > FYI, I have massively reorganized the text search documentation and it
> > is getting closer to something I am happy with:
> >
> >         http://momjian.us/expire/fulltext/HTML/textsearch.html
> 
> The following is the result of me proofreading, mainly searching for
> small mistakes such as spelling/grammatical errors (that means no
> document structure comments, etc).
> 
> All corrections are relative to the version of the text at above URL
> at the time of me reading it :-).
> 
> General
> 
> It seems to be a recurring problem that commas are not put between the
> brackets when an argument is optional. For example:
> "to_tsvector([conf_name], document TEXT)" -> I guess this should be
> "to_tsvector([conf_name,] document TEXT)"
> 
> Full-text vs. full text and stop-word vs. stop word are not used
> consistently. Also, capitalization of full text searching is not used
> consistently.
> 

--  Bruce Momjian  <bruce@momjian.us>          http://momjian.us EnterpriseDB
http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Updated tsearch documentation

From
Bruce Momjian
Date:
I think the tsearch documentation is nearing completion:
http://momjian.us/expire/fulltext/HTML/textsearch.html

but I am not happy with how tsearch is enabled in a user table:
http://momjian.us/expire/fulltext/HTML/textsearch-app-tutorial.html

Aside from the fact that it needs more examples, it only illustrates an
example where someone creates a table, populates it, then adds a
tsvector column, populates that, then creates an index.

That seems quite inflexible.  Is there a way to avoid having a separate
tsvector column?  What happens if the table is dynamic?  How is that
column updated based on table changes?  Triggers?  Where are the
examples?  Can you create an index like this:
CREATE INDEX textsearch_id ON pgweb USING gin(to_tsvector(column));

That avoids having to have a separate column because you can just say:
WHERE to_query('XXX') @@ to_tsvector(column)

How do we make sure that the to_query is using the same text search
configuration as the 'column' or index?  Perhaps we should suggest:
 CREATE INDEX textsearch_idx ON pgweb USING gin(to_tsvector('english',column));

so that at least the configuration is documented in the index.

--  Bruce Momjian  <bruce@momjian.us>          http://momjian.us EnterpriseDB
http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Updated tsearch documentation

From
Oleg Bartunov
Date:
On Tue, 17 Jul 2007, Bruce Momjian wrote:

> I think the tsearch documentation is nearing completion:
>
>     http://momjian.us/expire/fulltext/HTML/textsearch.html
>
> but I am not happy with how tsearch is enabled in a user table:
>
>     http://momjian.us/expire/fulltext/HTML/textsearch-app-tutorial.html
>
> Aside from the fact that it needs more examples, it only illustrates an
> example where someone creates a table, populates it, then adds a
> tsvector column, populates that, then creates an index.
>
> That seems quite inflexible.  Is there a way to avoid having a separate
> tsvector column?  What happens if the table is dynamic?  How is that
> column updated based on table changes?  Triggers?  Where are the
> examples?  Can you create an index like this:

I agree, that there are could be more examples, but text search doesn't
require something special !
*Example* of trigger function is documented on 
http://momjian.us/expire/fulltext/HTML/textsearch-opfunc.html


>
>     CREATE INDEX textsearch_id ON pgweb USING gin(to_tsvector(column));
>
> That avoids having to have a separate column because you can just say:
>
>     WHERE to_query('XXX') @@ to_tsvector(column)

yes, it's possible, but without ranking, since currently it's impossible 
to store any information in index (it's pg's feature). btw, this should
works and for GiST index also.

That kind of search is useful if there is  another natural ordering of search 
results, for example, by timestamp.

>
> How do we make sure that the to_query is using the same text search
> configuration as the 'column' or index?  Perhaps we should suggest:

please, keep in mind, it's not mandatory to use the same configuration
at search time, that was used at index creation.

>
>  CREATE INDEX textsearch_idx ON pgweb USING gin(to_tsvector('english',column));
>
> so that at least the configuration is documented in the index.

yes, it's better to always explicitly specify configuration name and not 
rely on default configuration. 
Unfortunately, configuration name doesn't saved in the index.
    Regards,        Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


Re: Updated tsearch documentation

From
Oleg Bartunov
Date:
On Tue, 17 Jul 2007, Oleg Bartunov wrote:

> On Tue, 17 Jul 2007, Bruce Momjian wrote:
>
>> I think the tsearch documentation is nearing completion:
>>
>>     http://momjian.us/expire/fulltext/HTML/textsearch.html
>> 
>> but I am not happy with how tsearch is enabled in a user table:
>>
>>     http://momjian.us/expire/fulltext/HTML/textsearch-app-tutorial.html
>> 
>> Aside from the fact that it needs more examples, it only illustrates an
>> example where someone creates a table, populates it, then adds a
>> tsvector column, populates that, then creates an index.
>> 
>> That seems quite inflexible.  Is there a way to avoid having a separate
>> tsvector column?  What happens if the table is dynamic?  How is that
>> column updated based on table changes?  Triggers?  Where are the
>> examples?  Can you create an index like this:
>
> I agree, that there are could be more examples, but text search doesn't
> require something special !
> *Example* of trigger function is documented on 
> http://momjian.us/expire/fulltext/HTML/textsearch-opfunc.html
>

Bruce,

below is an example of trigger for  insert/update of example table

create function pgweb_update() returns trigger as 
$$
BEGIN   NEW.textsearch_index=   setweight( to_tsvector( coalesce (title,'')), 'A' ) || ' ' ||   setweight(
to_tsvector(coalesce(body,'')),'D'); RETURN NEW;
 
END;
$$ 
language plpgsql;

CREATE TRIGGER fts_update BEFORE INSERT OR UPDATE ON pgweb
FOR EACH ROW EXECUTE PROCEDURE pgweb_update();


>
>>
>>     CREATE INDEX textsearch_id ON pgweb USING gin(to_tsvector(column));
>> 
>> That avoids having to have a separate column because you can just say:
>>
>>     WHERE to_query('XXX') @@ to_tsvector(column)
>
> yes, it's possible, but without ranking, since currently it's impossible to 
> store any information in index (it's pg's feature). btw, this should
> works and for GiST index also.
>
> That kind of search is useful if there is  another natural ordering of search 
> results, for example, by timestamp.
>
>> 
>> How do we make sure that the to_query is using the same text search
>> configuration as the 'column' or index?  Perhaps we should suggest:
>
> please, keep in mind, it's not mandatory to use the same configuration
> at search time, that was used at index creation.
>

one example is when text search index created without taking into account 
stop-words. Then you could search famous 'to be or not to be' with the
same configuration, or ignore stop words with other.


>>
>>  CREATE INDEX textsearch_idx ON pgweb USING 
>> gin(to_tsvector('english',column));
>> 
>> so that at least the configuration is documented in the index.
>
> yes, it's better to always explicitly specify configuration name and not rely 
> on default configuration. Unfortunately, configuration name doesn't saved in 
> the index.
>
>     Regards,
>         Oleg
> _____________________________________________________________
> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
> Sternberg Astronomical Institute, Moscow University, Russia
> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
> phone: +007(495)939-16-83, +007(495)939-23-83
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
>              http://archives.postgresql.org
>
    Regards,        Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


Re: Updated tsearch documentation

From
Bruce Momjian
Date:
Oleg Bartunov wrote:
> On Tue, 17 Jul 2007, Bruce Momjian wrote:
> 
> > I think the tsearch documentation is nearing completion:
> >
> >     http://momjian.us/expire/fulltext/HTML/textsearch.html
> >
> > but I am not happy with how tsearch is enabled in a user table:
> >
> >     http://momjian.us/expire/fulltext/HTML/textsearch-app-tutorial.html
> >
> > Aside from the fact that it needs more examples, it only illustrates an
> > example where someone creates a table, populates it, then adds a
> > tsvector column, populates that, then creates an index.
> >
> > That seems quite inflexible.  Is there a way to avoid having a separate
> > tsvector column?  What happens if the table is dynamic?  How is that
> > column updated based on table changes?  Triggers?  Where are the
> > examples?  Can you create an index like this:
> 
> I agree, that there are could be more examples, but text search doesn't
> require something special !
> *Example* of trigger function is documented on 
> http://momjian.us/expire/fulltext/HTML/textsearch-opfunc.html

Yes, I see that in tsearch() here:
http://momjian.us/expire/fulltext/HTML/textsearch-opfunc.html#TEXTSEARC$

I assume my_filter_name is optional right?  I have updated the prototype
to be:
tsearch([vector_column_name], [my_filter_name], text_column_name [, ... ])

Is this accurate?  What does this text below it mean?
There can be many functions and text columns specified in a tsearch()trigger. The following rule is used: a function is
appliedto allsubsequent TEXT columns until the next matching column occurs. 
 

Why are we allowing my_filter_name here?  Isn't that something for a
custom trigger.  Is calling it tsearch() a good idea?  Why not
tsvector_trigger().

> >     CREATE INDEX textsearch_id ON pgweb USING gin(to_tsvector(column));
> >
> > That avoids having to have a separate column because you can just say:
> >
> >     WHERE to_query('XXX') @@ to_tsvector(column)
> 
> yes, it's possible, but without ranking, since currently it's impossible 
> to store any information in index (it's pg's feature). btw, this should
> works and for GiST index also.

What if they use @@@.  Wouldn't that work because it is going to check
the heap?

> That kind of search is useful if there is  another natural ordering of search 
> results, for example, by timestamp.
> 
> >
> > How do we make sure that the to_query is using the same text search
> > configuration as the 'column' or index?  Perhaps we should suggest:
> 
> please, keep in mind, it's not mandatory to use the same configuration
> at search time, that was used at index creation.

Well, sort of.  If you have stop words in the tquery configuration, you
aren't going to hit any matches in the tsvector, right?  Same for
synonymns, I suppose.  I can see that stemming would work if there was a
mismatch between tsquery and tsvector.

> >  CREATE INDEX textsearch_idx ON pgweb USING gin(to_tsvector('english',column));
> >
> > so that at least the configuration is documented in the index.
> 
> yes, it's better to always explicitly specify configuration name and not 
> rely on default configuration. 
> Unfortunately, configuration name doesn't saved in the index.

I was more concerned that there is nothing documenting the configuration
used by the index or the tsvector table column trigger.  By doing:
CREATE INDEX textsearch_idx ON pgweb USING gin(to_tsvector('english',column));

you guarantee that the index uses 'english' for all its entries.  If you
omit the 'english' or use a different configuration, it will heap scan
the table, which at least gives the right answer.

Also, how do you guarantee that tsearch() triggers always uses the same
configuration?  The existing tsearch() API seems to make that
impossible.  I am wondering if we need to add the configuration name as
a mandatory parameter to tsearch().

--  Bruce Momjian  <bruce@momjian.us>          http://momjian.us EnterpriseDB
http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Updated tsearch documentation

From
Michael Glaesemann
Date:
On Jul 17, 2007, at 16:24 , Bruce Momjian wrote:

> I assume my_filter_name is optional right?  I have updated the  
> prototype
> to be:
>
>     tsearch([vector_column_name], [my_filter_name], text_column_name  
> [, ... ])

Just a style point, but would [filter_name] be better than  
[my_filter_name]? You're not qualifying the others with my_ ... or is  
there something you want to tell us, Bruce? :)

Michael Glaesemann
grzm seespotcode net




Re: Updated tsearch documentation

From
Bruce Momjian
Date:
Michael Glaesemann wrote:
> 
> On Jul 17, 2007, at 16:24 , Bruce Momjian wrote:
> 
> > I assume my_filter_name is optional right?  I have updated the  
> > prototype
> > to be:
> >
> >     tsearch([vector_column_name], [my_filter_name], text_column_name  
> > [, ... ])
> 
> Just a style point, but would [filter_name] be better than  
> [my_filter_name]? You're not qualifying the others with my_ ... or is  
> there something you want to tell us, Bruce? :)

Agreed.  Done.

--  Bruce Momjian  <bruce@momjian.us>          http://momjian.us EnterpriseDB
http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Updated tsearch documentation

From
Oleg Bartunov
Date:
On Tue, 17 Jul 2007, Bruce Momjian wrote:

> Oleg Bartunov wrote:
>> On Tue, 17 Jul 2007, Bruce Momjian wrote:
>>
>>> I think the tsearch documentation is nearing completion:
>>>
>>>     http://momjian.us/expire/fulltext/HTML/textsearch.html
>>>
>>> but I am not happy with how tsearch is enabled in a user table:
>>>
>>>     http://momjian.us/expire/fulltext/HTML/textsearch-app-tutorial.html
>>>
>>> Aside from the fact that it needs more examples, it only illustrates an
>>> example where someone creates a table, populates it, then adds a
>>> tsvector column, populates that, then creates an index.
>>>
>>> That seems quite inflexible.  Is there a way to avoid having a separate
>>> tsvector column?  What happens if the table is dynamic?  How is that
>>> column updated based on table changes?  Triggers?  Where are the
>>> examples?  Can you create an index like this:
>>
>> I agree, that there are could be more examples, but text search doesn't
>> require something special !
>> *Example* of trigger function is documented on
>> http://momjian.us/expire/fulltext/HTML/textsearch-opfunc.html
>
> Yes, I see that in tsearch() here:
>
>     http://momjian.us/expire/fulltext/HTML/textsearch-opfunc.html#TEXTSEARC$
>
> I assume my_filter_name is optional right?  I have updated the prototype
> to be:
>
>     tsearch([vector_column_name], [my_filter_name], text_column_name [, ... ])
>
> Is this accurate?  What does this text below it mean?

no, this in inaccurate. First, vector_column_name is not optional argument,
it's a name of tsvector column name.

>
>     There can be many functions and text columns specified in a tsearch()
>     trigger. The following rule is used: a function is applied to all
>     subsequent TEXT columns until the next matching column occurs.

The idea, is to provide user to preprocess text before applying 
tsearch machinery. my_filter_name() preprocess text_column_name1,
text_column_name2,....
The original syntax allows to specify for every text columns their 
preprocessing functions.

So, I suggest to keep original syntax, change 'vector_column_name' to
'tsvector_column_name'.

>
> Why are we allowing my_filter_name here?  Isn't that something for a
> custom trigger.  Is calling it tsearch() a good idea?  Why not
> tsvector_trigger().

I don't see any benefit from the tsvector_trigger() name. If you want to add
some semantic, than tsvector_update_trigger() would be better.  Anyway,
this trigger is an illustration.

>
>>>     CREATE INDEX textsearch_id ON pgweb USING gin(to_tsvector(column));
>>>
>>> That avoids having to have a separate column because you can just say:
>>>
>>>     WHERE to_query('XXX') @@ to_tsvector(column)
>>
>> yes, it's possible, but without ranking, since currently it's impossible
>> to store any information in index (it's pg's feature). btw, this should
>> works and for GiST index also.
>
> What if they use @@@.  Wouldn't that work because it is going to check
> the heap?

It would work, it'd recalculate to_tsvector(column) for rows found
( for GiST - to remove false hits and for weight information, for 
GIN - for weight information only).

>
>> That kind of search is useful if there is  another natural ordering of search
>> results, for example, by timestamp.
>>
>>>
>>> How do we make sure that the to_query is using the same text search
>>> configuration as the 'column' or index?  Perhaps we should suggest:
>>
>> please, keep in mind, it's not mandatory to use the same configuration
>> at search time, that was used at index creation.
>
> Well, sort of.  If you have stop words in the tquery configuration, you
> aren't going to hit any matches in the tsvector, right?  Same for
> synonymns, I suppose.  I can see that stemming would work if there was a
> mismatch between tsquery and tsvector.
>
>>>  CREATE INDEX textsearch_idx ON pgweb USING gin(to_tsvector('english',column));
>>>
>>> so that at least the configuration is documented in the index.
>>
>> yes, it's better to always explicitly specify configuration name and not
>> rely on default configuration.
>> Unfortunately, configuration name doesn't saved in the index.

as Teodor corrected me, index doesn't know about configuration at all !
What accurate user could do, is to provide configuration name in the 
comment for tsvector column. Configuration name is an accessory of
to_tsvector() function.

In principle, tsvector as any data type could be obtained by any other ways,
for example, OpenFTS construct tsvector following its own rules.

>
> I was more concerned that there is nothing documenting the configuration
> used by the index or the tsvector table column trigger.  By doing:

again, index has nothing with configuration name.
Our trigger function is an example, which uses default configuration name.
User could easily write it's own trigger to keep tsvector column up to date 
and use configuration name as a parameter.

>
>     CREATE INDEX textsearch_idx ON pgweb USING gin(to_tsvector('english',column));
>
> you guarantee that the index uses 'english' for all its entries.  If you
> omit the 'english' or use a different configuration, it will heap scan
> the table, which at least gives the right answer.

sometimes it's useful not to use explicitly configuration name 
to be able to use index with different configuration. Just change
tsearch_conf_name.

> Also, how do you guarantee that tsearch() triggers always uses the same
> configuration?  The existing tsearch() API seems to make that
> impossible.  I am wondering if we need to add the configuration name as
> a mandatory parameter to tsearch().

Using the same tsearch_conf_name, which could be defined by many ways, 
you guarantee to use the same configuration.
    Regards,        Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


Re: Updated tsearch documentation

From
Bruce Momjian
Date:
Oleg Bartunov wrote:
> >> I agree, that there are could be more examples, but text search doesn't
> >> require something special !
> >> *Example* of trigger function is documented on
> >> http://momjian.us/expire/fulltext/HTML/textsearch-opfunc.html
> >
> > Yes, I see that in tsearch() here:
> >
> >     http://momjian.us/expire/fulltext/HTML/textsearch-opfunc.html#TEXTSEARC$
> >
> > I assume my_filter_name is optional right?  I have updated the prototype
> > to be:
> >
> >     tsearch([vector_column_name], [my_filter_name], text_column_name [, ... ])
> >
> > Is this accurate?  What does this text below it mean?
> 
> no, this in inaccurate. First, vector_column_name is not optional argument,
> it's a name of tsvector column name.

Fixed.

> >     There can be many functions and text columns specified in a tsearch()
> >     trigger. The following rule is used: a function is applied to all
> >     subsequent TEXT columns until the next matching column occurs.
> 
> The idea, is to provide user to preprocess text before applying 
> tsearch machinery. my_filter_name() preprocess text_column_name1,
> text_column_name2,....
> The original syntax allows to specify for every text columns their 
> preprocessing functions.
> 
> So, I suggest to keep original syntax, change 'vector_column_name' to
> 'tsvector_column_name'.

OK, change made.

> > Why are we allowing my_filter_name here?  Isn't that something for a
> > custom trigger.  Is calling it tsearch() a good idea?  Why not
> > tsvector_trigger().
> 
> I don't see any benefit from the tsvector_trigger() name. If you want to add
> some semantic, than tsvector_update_trigger() would be better.  Anyway,
> this trigger is an illustration.

Well, the filter that removes '@' might be an example, but tsearch() is
indeed sort of built-in trigger function to be used for simple cases. 
My point is that because it is only for simple cases, why add complexity
and allow a filter?  It seems best to just remove the filter idea and
let people write their own triggers if they want that functionality.

> >>>     CREATE INDEX textsearch_id ON pgweb USING gin(to_tsvector(column));
> >>>
> >>> That avoids having to have a separate column because you can just say:
> >>>
> >>>     WHERE to_query('XXX') @@ to_tsvector(column)
> >>
> >> yes, it's possible, but without ranking, since currently it's impossible
> >> to store any information in index (it's pg's feature). btw, this should
> >> works and for GiST index also.
> >
> > What if they use @@@.  Wouldn't that work because it is going to check
> > the heap?
> 
> It would work, it'd recalculate to_tsvector(column) for rows found
> ( for GiST - to remove false hits and for weight information, for 
> GIN - for weight information only).

Right.  Currently to use text search on a table, you have to do three
things:
o  add a tsvector column to the tableo  add a trigger to keep the tsvector column currento  add an index to the
tsvectorcolumn
 

My question is why bother with the first two steps?  If you do:
CREATE INDEX textsearch_idx ON pgweb USING gist(to_tsvector('english',column));

you don't need a separate column and a trigger to keep it current.  The
index is kept current as part of normal query processing.  The only
downside is that you have to do to_tsvector() in the heap to avoid false
hits, but that seems minor compared to the disk savings of not having
the separate column.  Is to_tsvector() an expensive function?

> >>>  CREATE INDEX textsearch_idx ON pgweb USING gin(to_tsvector('english',column));
> >>>
> >>> so that at least the configuration is documented in the index.
> >>
> >> yes, it's better to always explicitly specify configuration name and not
> >> rely on default configuration.
> >> Unfortunately, configuration name doesn't saved in the index.
> 
> as Teodor corrected me, index doesn't know about configuration at all !
> What accurate user could do, is to provide configuration name in the 
> comment for tsvector column. Configuration name is an accessory of
> to_tsvector() function.

Well, if you create the index with the configuration name it is
guaranteed to match:
CREATE INDEX textsearch_idx ON pgweb USING gist(to_tsvector('english',column));
                   -------
 
And if someone does:
WHERE 'friend'::tsquery @@ to_tsvector('english',column))

the index is used.  Now if the default configuration is 'english' and
they use:
WHERE 'friend'::tsquery @@ to_tsvector(column))

the index is not used, but this just a good example of why default
configurations aren't that useful.  One problem I see is that if the
default configuration is not 'english', then when the index consults the
heap, it would be using a different configuration and yield incorrect
results.  I am unsure how to fix that.

With the trigger idea, you have to be sure your configuration is the same
every time you INSERT/UPDATE the table or the index will have mixed
configuration entries and it will yield incorrect results, aside from
the heap configuration lookup not matching the index.

Once we nail this down we will have to have a documentation section
about configuration mismatches.

> In principle, tsvector as any data type could be obtained by any other ways,
> for example, OpenFTS construct tsvector following its own rules.
> 
> >
> > I was more concerned that there is nothing documenting the configuration
> > used by the index or the tsvector table column trigger.  By doing:
> 
> again, index has nothing with configuration name.
> Our trigger function is an example, which uses default configuration name.
> User could easily write it's own trigger to keep tsvector column up to date 
> and use configuration name as a parameter.

Right. I am thinking beyond that issue.

> >     CREATE INDEX textsearch_idx ON pgweb USING gin(to_tsvector('english',column));
> >
> > you guarantee that the index uses 'english' for all its entries.  If you
> > omit the 'english' or use a different configuration, it will heap scan
> > the table, which at least gives the right answer.
> 
> sometimes it's useful not to use explicitly configuration name 
> to be able to use index with different configuration. Just change
> tsearch_conf_name.

I assume you are saying the benefit is for tsquery to use a different
configuration, not having some tsvector index entries using different
configurations than others.

> > Also, how do you guarantee that tsearch() triggers always uses the same
> > configuration?  The existing tsearch() API seems to make that
> > impossible.  I am wondering if we need to add the configuration name as
> > a mandatory parameter to tsearch().
> 
> Using the same tsearch_conf_name, which could be defined by many ways, 
> you guarantee to use the same configuration.

Yea, I am sure you _can_ do it.  The question is how can we make it less
error-prone.

--  Bruce Momjian  <bruce@momjian.us>          http://momjian.us EnterpriseDB
http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Updated tsearch documentation

From
Oleg Bartunov
Date:
On Wed, 18 Jul 2007, Bruce Momjian wrote:

>
>>> Why are we allowing my_filter_name here?  Isn't that something for a
>>> custom trigger.  Is calling it tsearch() a good idea?  Why not
>>> tsvector_trigger().
>>
>> I don't see any benefit from the tsvector_trigger() name. If you want to add
>> some semantic, than tsvector_update_trigger() would be better.  Anyway,
>> this trigger is an illustration.
>
> Well, the filter that removes '@' might be an example, but tsearch() is
> indeed sort of built-in trigger function to be used for simple cases.
> My point is that because it is only for simple cases, why add complexity
> and allow a filter?  It seems best to just remove the filter idea and
> let people write their own triggers if they want that functionality.

If you aware about documentation simplicity than we could just document 
two versions:
1. without filter function - simple, well understood syntax
2. with filter function - for advanced users

I don't want to remove the feature which works for year without any problem.


>
>>>>>     CREATE INDEX textsearch_id ON pgweb USING gin(to_tsvector(column));
>>>>>
>>>>> That avoids having to have a separate column because you can just say:
>>>>>
>>>>>     WHERE to_query('XXX') @@ to_tsvector(column)
>>>>
>>>> yes, it's possible, but without ranking, since currently it's impossible
>>>> to store any information in index (it's pg's feature). btw, this should
>>>> works and for GiST index also.
>>>
>>> What if they use @@@.  Wouldn't that work because it is going to check
>>> the heap?
>>
>> It would work, it'd recalculate to_tsvector(column) for rows found
>> ( for GiST - to remove false hits and for weight information, for
>> GIN - for weight information only).
>
> Right.  Currently to use text search on a table, you have to do three
> things:
>
>     o  add a tsvector column to the table
>     o  add a trigger to keep the tsvector column current
>     o  add an index to the tsvector column
>
> My question is why bother with the first two steps?  If you do:
>
> CREATE INDEX textsearch_idx ON pgweb USING gist(to_tsvector('english',column));
>
> you don't need a separate column and a trigger to keep it current.  The
> index is kept current as part of normal query processing.  The only
> downside is that you have to do to_tsvector() in the heap to avoid false
> hits, but that seems minor compared to the disk savings of not having
> the separate column.  Is to_tsvector() an expensive function?

Bruce, you oversimplify the text search, the document could be fully virtual,
not a column(s), it could be a result of any SQL commands, so it could be 
very expensive just to obtain document, and yes, to_tsvector could be
very expensive, depending on the document size, parser and dictionaries used.

And, again, current postgres architecture forces to use heap to store
positional and weight information for ranking.

The use case for what you described is very limited - simple text search
on one/several column of the same table without ranking.

>
>>>>>  CREATE INDEX textsearch_idx ON pgweb USING gin(to_tsvector('english',column));
>>>>>
>>>>> so that at least the configuration is documented in the index.
>>>>
>>>> yes, it's better to always explicitly specify configuration name and not
>>>> rely on default configuration.
>>>> Unfortunately, configuration name doesn't saved in the index.
>>
>> as Teodor corrected me, index doesn't know about configuration at all !
>> What accurate user could do, is to provide configuration name in the
>> comment for tsvector column. Configuration name is an accessory of
>> to_tsvector() function.
>
> Well, if you create the index with the configuration name it is
> guaranteed to match:
>
> CREATE INDEX textsearch_idx ON pgweb USING gist(to_tsvector('english',column));
>                                                              -------
> And if someone does:
>
>     WHERE 'friend'::tsquery @@ to_tsvector('english',column))
>
> the index is used.  Now if the default configuration is 'english' and
> they use:
>
>     WHERE 'friend'::tsquery @@ to_tsvector(column))
>
> the index is not used, but this just a good example of why default
> configurations aren't that useful.  One problem I see is that if the
> default configuration is not 'english', then when the index consults the
> heap, it would be using a different configuration and yield incorrect
> results.  I am unsure how to fix that.

again, you consider very simple case  and actually, your example is a 
good example of usefulness of default configuration ! Just think before
you develop your application, but this is very general rule. There are
zillions situations you could do bad things, after all.

Moreover, consider text search on text column, there is no way to specify 
configuration at all ! We rely on default configuration here

CREATE INDEX textsearch_idx ON pgweb USING gin(title);

>
> With the trigger idea, you have to be sure your configuration is the same
> every time you INSERT/UPDATE the table or the index will have mixed
> configuration entries and it will yield incorrect results, aside from
> the heap configuration lookup not matching the index.
>
> Once we nail this down we will have to have a documentation section
> about configuration mismatches.

So what is your proposal ? I'm lost a bit. I suggest to begin new thread :)

>
>> In principle, tsvector as any data type could be obtained by any other ways,
>> for example, OpenFTS construct tsvector following its own rules.
>>
>>>
>>> I was more concerned that there is nothing documenting the configuration
>>> used by the index or the tsvector table column trigger.  By doing:
>>
>> again, index has nothing with configuration name.
>> Our trigger function is an example, which uses default configuration name.
>> User could easily write it's own trigger to keep tsvector column up to date
>> and use configuration name as a parameter.
>
> Right. I am thinking beyond that issue.
>
>>>     CREATE INDEX textsearch_idx ON pgweb USING gin(to_tsvector('english',column));
>>>
>>> you guarantee that the index uses 'english' for all its entries.  If you
>>> omit the 'english' or use a different configuration, it will heap scan
>>> the table, which at least gives the right answer.
>>
>> sometimes it's useful not to use explicitly configuration name
>> to be able to use index with different configuration. Just change
>> tsearch_conf_name.
>
> I assume you are saying the benefit is for tsquery to use a different
> configuration, not having some tsvector index entries using different
> configurations than others.

sure, but not necessarily. You can mix different configurations if they 
are 'compatible'. Documents could be created from different sources and
could require different dictionaries. I don't want to limit people
in creating complex applications. If you unsure, then always use 
explicit configuration name. What's the problem ?

>
>>> Also, how do you guarantee that tsearch() triggers always uses the same
>>> configuration?  The existing tsearch() API seems to make that
>>> impossible.  I am wondering if we need to add the configuration name as
>>> a mandatory parameter to tsearch().
>>
>> Using the same tsearch_conf_name, which could be defined by many ways,
>> you guarantee to use the same configuration.
>
> Yea, I am sure you _can_ do it.  The question is how can we make it less
> error-prone.

We have only ONE variable - configuration name. It could be explicitly 
defined and then there is no problem at all, or specified by 
GUC variable tsearch_conf_name.  What could be simpler !

The question is do we need to define tsearch_conf_name automagically if
it's not defined ? I inclined don't do that and just issue error.

I believe this is enough error-prone.
    Regards,        Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


Re: Updated tsearch documentation

From
Bruce Momjian
Date:
Oleg Bartunov wrote:
> On Wed, 18 Jul 2007, Bruce Momjian wrote:
> 
> >
> >>> Why are we allowing my_filter_name here?  Isn't that something for a
> >>> custom trigger.  Is calling it tsearch() a good idea?  Why not
> >>> tsvector_trigger().
> >>
> >> I don't see any benefit from the tsvector_trigger() name. If you want to add
> >> some semantic, than tsvector_update_trigger() would be better.  Anyway,
> >> this trigger is an illustration.
> >
> > Well, the filter that removes '@' might be an example, but tsearch() is
> > indeed sort of built-in trigger function to be used for simple cases.
> > My point is that because it is only for simple cases, why add complexity
> > and allow a filter?  It seems best to just remove the filter idea and
> > let people write their own triggers if they want that functionality.
> 
> If you aware about documentation simplicity than we could just document 
> two versions:
> 1. without filter function - simple, well understood syntax
> 2. with filter function - for advanced users
> 
> I don't want to remove the feature which works for year without any problem.

Yes, this is what I want.  I would like to show the simple usage first,
then explain that a more complex usage is possible.  This will help
people get started using text search.  Triggers and secondary columns
are fine, but to start using it the CREATE INDEX-only case is best.  I
don't suggest we remove any capabilities, only suggest simple solutions.

> >>>>>     CREATE INDEX textsearch_id ON pgweb USING gin(to_tsvector(column));
> >>>>>
> >>>>> That avoids having to have a separate column because you can just say:
> >>>>>
> >>>>>     WHERE to_query('XXX') @@ to_tsvector(column)
> >>>>
> >>>> yes, it's possible, but without ranking, since currently it's impossible
> >>>> to store any information in index (it's pg's feature). btw, this should
> >>>> works and for GiST index also.
> >>>
> >>> What if they use @@@.  Wouldn't that work because it is going to check
> >>> the heap?
> >>
> >> It would work, it'd recalculate to_tsvector(column) for rows found
> >> ( for GiST - to remove false hits and for weight information, for
> >> GIN - for weight information only).
> >
> > Right.  Currently to use text search on a table, you have to do three
> > things:
> >
> >     o  add a tsvector column to the table
> >     o  add a trigger to keep the tsvector column current
> >     o  add an index to the tsvector column
> >
> > My question is why bother with the first two steps?  If you do:
> >
> > CREATE INDEX textsearch_idx ON pgweb USING gist(to_tsvector('english',column));
> >
> > you don't need a separate column and a trigger to keep it current.  The
> > index is kept current as part of normal query processing.  The only
> > downside is that you have to do to_tsvector() in the heap to avoid false
> > hits, but that seems minor compared to the disk savings of not having
> > the separate column.  Is to_tsvector() an expensive function?
> 
> Bruce, you oversimplify the text search, the document could be fully virtual,
> not a column(s), it could be a result of any SQL commands, so it could be 
> very expensive just to obtain document, and yes, to_tsvector could be
> very expensive, depending on the document size, parser and dictionaries used.
> 
> And, again, current postgres architecture forces to use heap to store
> positional and weight information for ranking.
> 
> The use case for what you described is very limited - simple text search
> on one/several column of the same table without ranking.

Right, but I bet that that is all the majority of users need, at least
at first as they start to use text search.

> >>>>>  CREATE INDEX textsearch_idx ON pgweb USING gin(to_tsvector('english',column));
> >>>>>
> >>>>> so that at least the configuration is documented in the index.
> >>>>
> >>>> yes, it's better to always explicitly specify configuration name and not
> >>>> rely on default configuration.
> >>>> Unfortunately, configuration name doesn't saved in the index.
> >>
> >> as Teodor corrected me, index doesn't know about configuration at all !
> >> What accurate user could do, is to provide configuration name in the
> >> comment for tsvector column. Configuration name is an accessory of
> >> to_tsvector() function.
> >
> > Well, if you create the index with the configuration name it is
> > guaranteed to match:
> >
> > CREATE INDEX textsearch_idx ON pgweb USING gist(to_tsvector('english',column));
> >                                                              -------
> > And if someone does:
> >
> >     WHERE 'friend'::tsquery @@ to_tsvector('english',column))
> >
> > the index is used.  Now if the default configuration is 'english' and
> > they use:
> >
> >     WHERE 'friend'::tsquery @@ to_tsvector(column))
> >
> > the index is not used, but this just a good example of why default
> > configurations aren't that useful.  One problem I see is that if the
> > default configuration is not 'english', then when the index consults the
> > heap, it would be using a different configuration and yield incorrect
> > results.  I am unsure how to fix that.
> 
> again, you consider very simple case  and actually, your example is a 
> good example of usefulness of default configuration ! Just think before
> you develop your application, but this is very general rule. There are
> zillions situations you could do bad things, after all.

Right, but if we can reduce possible errors, that is better, and
embedding the configuration name in the expression index funciton does
that.

> Moreover, consider text search on text column, there is no way to specify 
> configuration at all ! We rely on default configuration here
> 
> CREATE INDEX textsearch_idx ON pgweb USING gin(title);

Ah, this should be:
CREATE INDEX textsearch_idx ON pgweb USING gin(to_tsvector('english', title));

What happens if you create a GIN index on a text column?  Does it
convert it to tsvector automatically?

> > With the trigger idea, you have to be sure your configuration is the same
> > every time you INSERT/UPDATE the table or the index will have mixed
> > configuration entries and it will yield incorrect results, aside from
> > the heap configuration lookup not matching the index.
> >
> > Once we nail this down we will have to have a documentation section
> > about configuration mismatches.
> 
> So what is your proposal ? I'm lost a bit. I suggest to begin new thread :)

Well, the topic is tsearch documentation, so I don't see how a new
thread would help us.

My idea is to suggest using expression indexes as a simple way to use
text search on a table, and mention the trigger idea as a more complex
solution.  Right now there is no mention of the expression index idea in
the documentation.

> >> sometimes it's useful not to use explicitly configuration name
> >> to be able to use index with different configuration. Just change
> >> tsearch_conf_name.
> >
> > I assume you are saying the benefit is for tsquery to use a different
> > configuration, not having some tsvector index entries using different
> > configurations than others.
> 
> sure, but not necessarily. You can mix different configurations if they 
> are 'compatible'. Documents could be created from different sources and
> could require different dictionaries. I don't want to limit people
> in creating complex applications. If you unsure, then always use 
> explicit configuration name. What's the problem ?

OK, I understand.  You can store rows needing different synonym tables
in the same column, and if you use the same synonym configuratin when
searching, that is OK.  Again, that is fine, but I would like to give
people a simple way to use text search that doesn't get into that
complexity.  I realize you understand it well, but I can tell you that
most people, including me, don't and you have to be simple at first to
get people to understand what is happening.

I am learning about the possibilities as I email you so obviously I
didn't fully understand at first either, and I have been working on this
for a while.

Good documentation lays things out simply then adds
complexity/flexibility after the fundamentals are understood.

> >>> Also, how do you guarantee that tsearch() triggers always uses the same
> >>> configuration?  The existing tsearch() API seems to make that
> >>> impossible.  I am wondering if we need to add the configuration name as
> >>> a mandatory parameter to tsearch().
> >>
> >> Using the same tsearch_conf_name, which could be defined by many ways,
> >> you guarantee to use the same configuration.
> >
> > Yea, I am sure you _can_ do it.  The question is how can we make it less
> > error-prone.
> 
> We have only ONE variable - configuration name. It could be explicitly 
> defined and then there is no problem at all, or specified by 
> GUC variable tsearch_conf_name.  What could be simpler !
> 
> The question is do we need to define tsearch_conf_name automagically if
> it's not defined ? I inclined don't do that and just issue error.

Agreed.  Please throw an error.

--  Bruce Momjian  <bruce@momjian.us>          http://momjian.us EnterpriseDB
http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Updated tsearch documentation

From
Bruce Momjian
Date:
Oleg, Teodor,

I am confused by the following example.  How does gin know to create a
tsvector, or does it?  Does gist know too?   

FYI, at some point we need to chat via instant messenger or IRC to
discuss the open items.  My chat information is here:
http://momjian.us/main/contact.html

---------------------------------------------------------------------------

SELECT title
FROM pgweb
WHERE textcat(title,body) @@ plainto_tsquery('create table')
ORDER BY dlm DESC LIMIT 10;

CREATE INDEX pgweb_idx ON pgweb USING gin(textcat(title,body));

--  Bruce Momjian  <bruce@momjian.us>          http://momjian.us EnterpriseDB
http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Updated tsearch documentation

From
Oleg Bartunov
Date:
On Wed, 18 Jul 2007, Bruce Momjian wrote:

> Oleg, Teodor,
>
> I am confused by the following example.  How does gin know to create a
> tsvector, or does it?  Does gist know too?

No, gist doesn't know. I don't remember why, Teodor ?

For GIN see http://archives.postgresql.org/pgsql-hackers/2007-05/msg00625.php
for discussion

>
> FYI, at some point we need to chat via instant messenger or IRC to
> discuss the open items.  My chat information is here:
>
>     http://momjian.us/main/contact.html

I send you invitation for google talk, I use only chat in gmail.
My gmail account is obartunov@gmail.com

>
> ---------------------------------------------------------------------------
>
> SELECT title
> FROM pgweb
> WHERE textcat(title,body) @@ plainto_tsquery('create table')
> ORDER BY dlm DESC LIMIT 10;
>
> CREATE INDEX pgweb_idx ON pgweb USING gin(textcat(title,body));
>
>
    Regards,        Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


Re: Updated tsearch documentation

From
Bruce Momjian
Date:
I have added more documentation to try to show how full text search is
used by user tables.  I think this the documentaiton is almost done:
http://momjian.us/expire/fulltext/HTML/textsearch-tables.html

---------------------------------------------------------------------------

Oleg Bartunov wrote:
> On Wed, 18 Jul 2007, Bruce Momjian wrote:
> 
> > Oleg, Teodor,
> >
> > I am confused by the following example.  How does gin know to create a
> > tsvector, or does it?  Does gist know too?
> 
> No, gist doesn't know. I don't remember why, Teodor ?
> 
> For GIN see http://archives.postgresql.org/pgsql-hackers/2007-05/msg00625.php
> for discussion
> 
> >
> > FYI, at some point we need to chat via instant messenger or IRC to
> > discuss the open items.  My chat information is here:
> >
> >     http://momjian.us/main/contact.html
> 
> I send you invitation for google talk, I use only chat in gmail.
> My gmail account is obartunov@gmail.com
> 
> >
> > ---------------------------------------------------------------------------
> >
> > SELECT title
> > FROM pgweb
> > WHERE textcat(title,body) @@ plainto_tsquery('create table')
> > ORDER BY dlm DESC LIMIT 10;
> >
> > CREATE INDEX pgweb_idx ON pgweb USING gin(textcat(title,body));
> >
> >
> 
>      Regards,
>          Oleg
> _____________________________________________________________
> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
> Sternberg Astronomical Institute, Moscow University, Russia
> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
> phone: +007(495)939-16-83, +007(495)939-23-83

--  Bruce Momjian  <bruce@momjian.us>          http://momjian.us EnterpriseDB
http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Updated tsearch documentation

From
Oleg Bartunov
Date:
Bruce,

I sent you link to my wiki page with summary of changes
http://www.sai.msu.su/~megera/wiki/ts_changes

Your documentation looks rather old.

Oleg
On Tue, 24 Jul 2007, Bruce Momjian wrote:

>
> I have added more documentation to try to show how full text search is
> used by user tables.  I think this the documentaiton is almost done:
>
>     http://momjian.us/expire/fulltext/HTML/textsearch-tables.html
>
> ---------------------------------------------------------------------------
>
> Oleg Bartunov wrote:
>> On Wed, 18 Jul 2007, Bruce Momjian wrote:
>>
>>> Oleg, Teodor,
>>>
>>> I am confused by the following example.  How does gin know to create a
>>> tsvector, or does it?  Does gist know too?
>>
>> No, gist doesn't know. I don't remember why, Teodor ?
>>
>> For GIN see http://archives.postgresql.org/pgsql-hackers/2007-05/msg00625.php
>> for discussion
>>
>>>
>>> FYI, at some point we need to chat via instant messenger or IRC to
>>> discuss the open items.  My chat information is here:
>>>
>>>     http://momjian.us/main/contact.html
>>
>> I send you invitation for google talk, I use only chat in gmail.
>> My gmail account is obartunov@gmail.com
>>
>>>
>>> ---------------------------------------------------------------------------
>>>
>>> SELECT title
>>> FROM pgweb
>>> WHERE textcat(title,body) @@ plainto_tsquery('create table')
>>> ORDER BY dlm DESC LIMIT 10;
>>>
>>> CREATE INDEX pgweb_idx ON pgweb USING gin(textcat(title,body));
>>>
>>>
>>
>>      Regards,
>>          Oleg
>> _____________________________________________________________
>> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
>> Sternberg Astronomical Institute, Moscow University, Russia
>> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
>> phone: +007(495)939-16-83, +007(495)939-23-83
>
>
    Regards,        Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


Re: Updated tsearch documentation

From
Dimitri Fontaine
Date:
Hi,

Le mercredi 25 juillet 2007, Bruce Momjian a écrit :
> I have added more documentation to try to show how full text search is
> used by user tables.  I think this the documentaiton is almost done:
>
>     http://momjian.us/expire/fulltext/HTML/textsearch-tables.html

I've come to understand that GIN indexes are far more costly to update than
GiST one, and Oleg's wiki advice users to partition data and use GiST index
for live part and GIN index for archive part only.

Is it worth mentioning this into this part of the documentation?
And if mentioned here, partitioning step could certainly be part of the
example... or let it as a user exercise, but then explaining why GIN is a
good choice in the provided example.

Hope this helps, regards,
--
dim

Re: Updated tsearch documentation

From
Bruce Momjian
Date:
Oleg Bartunov wrote:
> Bruce,
> 
> I sent you link to my wiki page with summary of changes
> http://www.sai.msu.su/~megera/wiki/ts_changes
> 
> Your documentation looks rather old.

I have updated it to reflect your changes:
http://momjian.us/expire/fulltext/HTML/textsearch-tables.html

---------------------------------------------------------------------------


> 
> Oleg
> On Tue, 24 Jul 2007, Bruce Momjian wrote:
> 
> >
> > I have added more documentation to try to show how full text search is
> > used by user tables.  I think this the documentaiton is almost done:
> >
> >     http://momjian.us/expire/fulltext/HTML/textsearch-tables.html
> >
> > ---------------------------------------------------------------------------
> >
> > Oleg Bartunov wrote:
> >> On Wed, 18 Jul 2007, Bruce Momjian wrote:
> >>
> >>> Oleg, Teodor,
> >>>
> >>> I am confused by the following example.  How does gin know to create a
> >>> tsvector, or does it?  Does gist know too?
> >>
> >> No, gist doesn't know. I don't remember why, Teodor ?
> >>
> >> For GIN see http://archives.postgresql.org/pgsql-hackers/2007-05/msg00625.php
> >> for discussion
> >>
> >>>
> >>> FYI, at some point we need to chat via instant messenger or IRC to
> >>> discuss the open items.  My chat information is here:
> >>>
> >>>     http://momjian.us/main/contact.html
> >>
> >> I send you invitation for google talk, I use only chat in gmail.
> >> My gmail account is obartunov@gmail.com
> >>
> >>>
> >>> ---------------------------------------------------------------------------
> >>>
> >>> SELECT title
> >>> FROM pgweb
> >>> WHERE textcat(title,body) @@ plainto_tsquery('create table')
> >>> ORDER BY dlm DESC LIMIT 10;
> >>>
> >>> CREATE INDEX pgweb_idx ON pgweb USING gin(textcat(title,body));
> >>>
> >>>
> >>
> >>      Regards,
> >>          Oleg
> >> _____________________________________________________________
> >> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
> >> Sternberg Astronomical Institute, Moscow University, Russia
> >> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
> >> phone: +007(495)939-16-83, +007(495)939-23-83
> >
> >
> 
>      Regards,
>          Oleg
> _____________________________________________________________
> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
> Sternberg Astronomical Institute, Moscow University, Russia
> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
> phone: +007(495)939-16-83, +007(495)939-23-83

--  Bruce Momjian  <bruce@momjian.us>          http://momjian.us EnterpriseDB
http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


default_text_search_config and expression indexes

From
Bruce Momjian
Date:
Oleg Bartunov wrote:
> >> Second, I can't figure out how to reference a non-default
> >> configuration.
> >
> > See the multi-argument versions of to_tsvector etc.
> >
> > I do see a problem with having to_tsvector(config, text) plus
> > to_tsvector(text) where the latter implicitly references a config
> > selected by a GUC variable: how can you tell whether a query using the
> > latter matches a particular index using the former?  There isn't
> > anything in the current planner mechanisms that would make that work.
> 
> Probably, having default text search configuration is not a good idea
> and we could just require it as a mandatory parameter, which could
> eliminate many confusion with selecting text search configuration.

We have to decide if we want a GUC default_text_search_config, and if so
when can it be changed.

Right now there are three ways to create a tsvector (or tsquery)
::tsvectorto_tsvector(value)to_tsvector(config, value)

(ignoring plainto_tsvector)

Only the last one specifies the configuration. The others use the
configuration specified by default_text_search_config.  (We had an
previous discussion on what the default value of
default_text_search_config should be, and it was decided it should be
set via initdb based on a flag or the locale.)

Now, because most people use a single configuration, they can just set
default_text_search_config and there is no need to specify the
configuration name.

However, expression indexes cause a problem here:
http://momjian.us/expire/fulltext/HTML/textsearch-tables.html#TEXTSEARCH-TABLES-INDEX

We recommend that users create an expression index on the column they
want to do a full text search on, e.g.
CREATE INDEX pgweb_idx ON pgweb USING gin(to_tsvector(body));

However, the big problem is that the expressions used in expression
indexes should not change their output based on the value of a GUC
variable (because it would corrupt the index), but in the case above,
default_text_search_config controls what configuration is used, and
hence the output of to_tsvector is changed if default_text_search_config
changes.

We have a few possible options:
1) Document the problem and do nothing else.2) Make default_text_search_config a postgresql.conf-only   setting,
therebymaking it impossible to change by non-super   users, or make it a super-user-only setting.3) Remove
default_text_search_configand require the   configuration to be specified in each function call.
 

If we remove default_text_search_config, it would also make ::tsvector
casting useless as well.

--  Bruce Momjian  <bruce@momjian.us>          http://momjian.us EnterpriseDB
http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: default_text_search_config and expression indexes

From
"Pavel Stehule"
Date:
>
>         1) Document the problem and do nothing else.
>         2) Make default_text_search_config a postgresql.conf-only
>            setting, thereby making it impossible to change by non-super
>            users, or make it a super-user-only setting.
>         3) Remove default_text_search_config and require the
>            configuration to be specified in each function call.
>

Hello,

2+.

Regards
Pavel Stehule


Re: Updated tsearch documentation

From
Oleg Bartunov
Date:
On Thu, 26 Jul 2007, Bruce Momjian wrote:

> Oleg Bartunov wrote:
>> Bruce,
>>
>> I sent you link to my wiki page with summary of changes
>> http://www.sai.msu.su/~megera/wiki/ts_changes
>>
>> Your documentation looks rather old.
>
> I have updated it to reflect your changes:
>
>     http://momjian.us/expire/fulltext/HTML/textsearch-tables.html


Bruce, I noticed you miss many changes. For example,


options for stemmer has changed (it's documented in my ts_changes), 
so in 
http://momjian.us/expire/fulltext/HTML/textsearch-tables.html#TEXTSEARCH-TABLES-CONFIGURATION

ALTER TEXT SEARCH DICTIONARY en_stem SET OPTION 'english-utf8.stop';

should be


ALTER TEXT SEARCH DICTIONARY en_stem SET OPTION 
'StopFile=english-utf8.stop, Language=english';


Also, this is wrong

DROP TEXT SEARCH CONFIGURATION MAPPING ON pg FOR email, url, sfloat, uri, float;

it should be

ALTER TEXT SEARCH CONFIGURATION pg DROP MAPPING FOR email, url, sfloat, uri, float;

Configuration now doesn't have DEFAULT flag, so \dF should not display 'Y'


=> \dF
pg_catalog | russian  | Y
public     | pg       | Y


This is what I see now

postgres=# \dF public.*
List of fulltext configurations Schema | Name | Description
--------+------+------------- public | pg   |



>
> ---------------------------------------------------------------------------
>
>
>>
>> Oleg
>> On Tue, 24 Jul 2007, Bruce Momjian wrote:
>>
>>>
>>> I have added more documentation to try to show how full text search is
>>> used by user tables.  I think this the documentaiton is almost done:
>>>
>>>     http://momjian.us/expire/fulltext/HTML/textsearch-tables.html
>>>
>>> ---------------------------------------------------------------------------
>>>
>>> Oleg Bartunov wrote:
>>>> On Wed, 18 Jul 2007, Bruce Momjian wrote:
>>>>
>>>>> Oleg, Teodor,
>>>>>
>>>>> I am confused by the following example.  How does gin know to create a
>>>>> tsvector, or does it?  Does gist know too?
>>>>
>>>> No, gist doesn't know. I don't remember why, Teodor ?
>>>>
>>>> For GIN see http://archives.postgresql.org/pgsql-hackers/2007-05/msg00625.php
>>>> for discussion
>>>>
>>>>>
>>>>> FYI, at some point we need to chat via instant messenger or IRC to
>>>>> discuss the open items.  My chat information is here:
>>>>>
>>>>>     http://momjian.us/main/contact.html
>>>>
>>>> I send you invitation for google talk, I use only chat in gmail.
>>>> My gmail account is obartunov@gmail.com
>>>>
>>>>>
>>>>> ---------------------------------------------------------------------------
>>>>>
>>>>> SELECT title
>>>>> FROM pgweb
>>>>> WHERE textcat(title,body) @@ plainto_tsquery('create table')
>>>>> ORDER BY dlm DESC LIMIT 10;
>>>>>
>>>>> CREATE INDEX pgweb_idx ON pgweb USING gin(textcat(title,body));
>>>>>
>>>>>
>>>>
>>>>      Regards,
>>>>          Oleg
>>>> _____________________________________________________________
>>>> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
>>>> Sternberg Astronomical Institute, Moscow University, Russia
>>>> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
>>>> phone: +007(495)939-16-83, +007(495)939-23-83
>>>
>>>
>>
>>      Regards,
>>          Oleg
>> _____________________________________________________________
>> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
>> Sternberg Astronomical Institute, Moscow University, Russia
>> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
>> phone: +007(495)939-16-83, +007(495)939-23-83
>
>
    Regards,        Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


Re: default_text_search_config and expression indexes

From
Oleg Bartunov
Date:
On Fri, 27 Jul 2007, Pavel Stehule wrote:

>>
>>         1) Document the problem and do nothing else.
>>         2) Make default_text_search_config a postgresql.conf-only
>>            setting, thereby making it impossible to change by non-super
>>            users, or make it a super-user-only setting.
>>         3) Remove default_text_search_config and require the
>>            configuration to be specified in each function call.
>>
>
> Hello,
>
> 2+.


One of the most important purpose of integrating tsearch2 was to 
facilitate full-text search for people in hosting environment. Usually,
they have no superuser rights. I'm asking don't forget about them !

There is no problem with current behaviour once user understand what he do.

    Regards,        Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


Re: default_text_search_config and expression indexes

From
"Pavel Stehule"
Date:
2007/7/27, Oleg Bartunov <oleg@sai.msu.su>:
> On Fri, 27 Jul 2007, Pavel Stehule wrote:
>
> >>
> >>         1) Document the problem and do nothing else.
> >>         2) Make default_text_search_config a postgresql.conf-only
> >>            setting, thereby making it impossible to change by non-super
> >>            users, or make it a super-user-only setting.
> >>         3) Remove default_text_search_config and require the
> >>            configuration to be specified in each function call.
> >>
> >
> > Hello,
> >
> > 2+.
>
>
> One of the most important purpose of integrating tsearch2 was to
> facilitate full-text search for people in hosting environment. Usually,
> they have no superuser rights. I'm asking don't forget about them !
>
> There is no problem with current behaviour once user understand what he do.
>
>

I am not sure if postgresql is well for multilangual hosting
environment. There is problem with locales. Without COLLATE support
postgresql can't be used in similar environment. :(

nice a day
Pavel Stehule


Re: default_text_search_config and expression indexes

From
Oleg Bartunov
Date:
On Fri, 27 Jul 2007, Pavel Stehule wrote:

> 2007/7/27, Oleg Bartunov <oleg@sai.msu.su>:
>> On Fri, 27 Jul 2007, Pavel Stehule wrote:
>>
>>>>
>>>>         1) Document the problem and do nothing else.
>>>>         2) Make default_text_search_config a postgresql.conf-only
>>>>            setting, thereby making it impossible to change by non-super
>>>>            users, or make it a super-user-only setting.
>>>>         3) Remove default_text_search_config and require the
>>>>            configuration to be specified in each function call.
>>>>
>>>
>>> Hello,
>>>
>>> 2+.
>>
>>
>> One of the most important purpose of integrating tsearch2 was to
>> facilitate full-text search for people in hosting environment. Usually,
>> they have no superuser rights. I'm asking don't forget about them !
>>
>> There is no problem with current behaviour once user understand what he do.
>>
>>
>
> I am not sure if postgresql is well for multilangual hosting
> environment. There is problem with locales. Without COLLATE support
> postgresql can't be used in similar environment. :(


configuration has NOTHING with language ! This is a most frequent myth about
configuration. It's just the way we chose for default_text_search_config to
use language part of locale at initdb time.
text search configuration is just a bind between parser to use for
breaking document by lexems and mapping between lexeme type and dictionaries.

>
> nice a day
> Pavel Stehule
>
    Regards,        Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


Re: default_text_search_config and expression indexes

From
"Pavel Stehule"
Date:
> configuration has NOTHING with language ! This is a most frequent myth about
> configuration. It's just the way we chose for default_text_search_config to
> use language part of locale at initdb time.
> text search configuration is just a bind between parser to use for
> breaking document by lexems and mapping between lexeme type and dictionaries.
>
> >

I spoke about impossibility well configuration of postgresql without
administrator's rights. For my czech environment is administrator's
rights necessary too, because czech dictionary aren't in default
installation.

Regards
Pavel Stehule


Re: default_text_search_config and expression indexes

From
Magnus Hagander
Date:
On Thu, Jul 26, 2007 at 06:23:51PM -0400, Bruce Momjian wrote:
> Oleg Bartunov wrote:
> > >> Second, I can't figure out how to reference a non-default
> > >> configuration.
> > >
> > > See the multi-argument versions of to_tsvector etc.
> > >
> > > I do see a problem with having to_tsvector(config, text) plus
> > > to_tsvector(text) where the latter implicitly references a config
> > > selected by a GUC variable: how can you tell whether a query using the
> > > latter matches a particular index using the former?  There isn't
> > > anything in the current planner mechanisms that would make that work.
> > 
> > Probably, having default text search configuration is not a good idea
> > and we could just require it as a mandatory parameter, which could
> > eliminate many confusion with selecting text search configuration.
> 
> We have to decide if we want a GUC default_text_search_config, and if so
> when can it be changed.
> 
> Right now there are three ways to create a tsvector (or tsquery)
> 
>     ::tsvector
>     to_tsvector(value)
>     to_tsvector(config, value)
> 
> (ignoring plainto_tsvector)
> 
> Only the last one specifies the configuration. The others use the
> configuration specified by default_text_search_config.  (We had an
> previous discussion on what the default value of
> default_text_search_config should be, and it was decided it should be
> set via initdb based on a flag or the locale.)
> 
> Now, because most people use a single configuration, they can just set
> default_text_search_config and there is no need to specify the
> configuration name.
> 
> However, expression indexes cause a problem here:
> 
>     http://momjian.us/expire/fulltext/HTML/textsearch-tables.html#TEXTSEARCH-TABLES-INDEX
> 
> We recommend that users create an expression index on the column they
> want to do a full text search on, e.g.
> 
>     CREATE INDEX pgweb_idx ON pgweb USING gin(to_tsvector(body));
> 
> However, the big problem is that the expressions used in expression
> indexes should not change their output based on the value of a GUC
> variable (because it would corrupt the index), but in the case above,
> default_text_search_config controls what configuration is used, and
> hence the output of to_tsvector is changed if default_text_search_config
> changes.

It wuoldn't actually *corrupt* the index, right? You could end up with
wrong results, which might be regarded as corruption in one way, but as
long as you change the value back the index still works, no?


> We have a few possible options:
> 
>     1) Document the problem and do nothing else.
>     2) Make default_text_search_config a postgresql.conf-only
>        setting, thereby making it impossible to change by non-super
>        users, or make it a super-user-only setting.
>     3) Remove default_text_search_config and require the
>        configuration to be specified in each function call.
> 
> If we remove default_text_search_config, it would also make ::tsvector
> casting useless as well.

I think 3 is a really bad solution.

2 is a half-bad solution. Do we have a way to say that it can be set at
database-level for example, but not at user session? Making it
superuser-only to change it but not postgresql.conf-only could accomplish
that, along with warnings in the docs for the super user about the effects
on current indexes by changing it.

//Magnus


Re: Updated tsearch documentation

From
Bruce Momjian
Date:
Thanks, I found a few more places that needed updating.  It should be
accurate now.  Thanks for the report.

---------------------------------------------------------------------------

Oleg Bartunov wrote:
> On Thu, 26 Jul 2007, Bruce Momjian wrote:
> 
> > Oleg Bartunov wrote:
> >> Bruce,
> >>
> >> I sent you link to my wiki page with summary of changes
> >> http://www.sai.msu.su/~megera/wiki/ts_changes
> >>
> >> Your documentation looks rather old.
> >
> > I have updated it to reflect your changes:
> >
> >     http://momjian.us/expire/fulltext/HTML/textsearch-tables.html
> 
> 
> Bruce, I noticed you miss many changes. For example,
> 
> 
> options for stemmer has changed (it's documented in my ts_changes), 
> so in 
> http://momjian.us/expire/fulltext/HTML/textsearch-tables.html#TEXTSEARCH-TABLES-CONFIGURATION
> 
> ALTER TEXT SEARCH DICTIONARY en_stem SET OPTION 'english-utf8.stop';
> 
> should be
> 
> 
> ALTER TEXT SEARCH DICTIONARY en_stem SET OPTION 
> 'StopFile=english-utf8.stop, Language=english';
> 
> 
> Also, this is wrong
> 
> DROP TEXT SEARCH CONFIGURATION MAPPING ON pg FOR email, url, sfloat, uri, float;
> 
> it should be
> 
> ALTER TEXT SEARCH CONFIGURATION pg DROP MAPPING FOR email, url, sfloat, uri, float;
> 
> Configuration now doesn't have DEFAULT flag, so \dF should not display 'Y'
> 
> 
> => \dF
> pg_catalog | russian  | Y
> public     | pg       | Y
> 
> 
> This is what I see now
> 
> postgres=# \dF public.*
> List of fulltext configurations
>   Schema | Name | Description
> --------+------+-------------
>   public | pg   |
> 
> 
> 
> >
> > ---------------------------------------------------------------------------
> >
> >
> >>
> >> Oleg
> >> On Tue, 24 Jul 2007, Bruce Momjian wrote:
> >>
> >>>
> >>> I have added more documentation to try to show how full text search is
> >>> used by user tables.  I think this the documentaiton is almost done:
> >>>
> >>>     http://momjian.us/expire/fulltext/HTML/textsearch-tables.html
> >>>
> >>> ---------------------------------------------------------------------------
> >>>
> >>> Oleg Bartunov wrote:
> >>>> On Wed, 18 Jul 2007, Bruce Momjian wrote:
> >>>>
> >>>>> Oleg, Teodor,
> >>>>>
> >>>>> I am confused by the following example.  How does gin know to create a
> >>>>> tsvector, or does it?  Does gist know too?
> >>>>
> >>>> No, gist doesn't know. I don't remember why, Teodor ?
> >>>>
> >>>> For GIN see http://archives.postgresql.org/pgsql-hackers/2007-05/msg00625.php
> >>>> for discussion
> >>>>
> >>>>>
> >>>>> FYI, at some point we need to chat via instant messenger or IRC to
> >>>>> discuss the open items.  My chat information is here:
> >>>>>
> >>>>>     http://momjian.us/main/contact.html
> >>>>
> >>>> I send you invitation for google talk, I use only chat in gmail.
> >>>> My gmail account is obartunov@gmail.com
> >>>>
> >>>>>
> >>>>> ---------------------------------------------------------------------------
> >>>>>
> >>>>> SELECT title
> >>>>> FROM pgweb
> >>>>> WHERE textcat(title,body) @@ plainto_tsquery('create table')
> >>>>> ORDER BY dlm DESC LIMIT 10;
> >>>>>
> >>>>> CREATE INDEX pgweb_idx ON pgweb USING gin(textcat(title,body));
> >>>>>
> >>>>>
> >>>>
> >>>>      Regards,
> >>>>          Oleg
> >>>> _____________________________________________________________
> >>>> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
> >>>> Sternberg Astronomical Institute, Moscow University, Russia
> >>>> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
> >>>> phone: +007(495)939-16-83, +007(495)939-23-83
> >>>
> >>>
> >>
> >>      Regards,
> >>          Oleg
> >> _____________________________________________________________
> >> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
> >> Sternberg Astronomical Institute, Moscow University, Russia
> >> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
> >> phone: +007(495)939-16-83, +007(495)939-23-83
> >
> >
> 
>      Regards,
>          Oleg
> _____________________________________________________________
> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
> Sternberg Astronomical Institute, Moscow University, Russia
> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
> phone: +007(495)939-16-83, +007(495)939-23-83

--  Bruce Momjian  <bruce@momjian.us>          http://momjian.us EnterpriseDB
http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Updated tsearch documentation

From
Bruce Momjian
Date:
Dimitri Fontaine wrote:
-- Start of PGP signed section.
> Hi,
> 
> Le mercredi 25 juillet 2007, Bruce Momjian a ?crit?:
> > I have added more documentation to try to show how full text search is
> > used by user tables.  I think this the documentaiton is almost done:
> >
> >     http://momjian.us/expire/fulltext/HTML/textsearch-tables.html
> 
> I've come to understand that GIN indexes are far more costly to update than 
> GiST one, and Oleg's wiki advice users to partition data and use GiST index 
> for live part and GIN index for archive part only.
> 
> Is it worth mentioning this into this part of the documentation?
> And if mentioned here, partitioning step could certainly be part of the 
> example... or let it as a user exercise, but then explaining why GIN is a 
> good choice in the provided example.

Partitioning is already in the documentation:
Partitioning of big collections and the proper use of GiST and GINindexes allows the implementation of very fast
searcheswith onlineupdate. Partitioning can be done at the database level using tableinheritance and
<varname>constraint_exclusion</>,or distributingdocuments over servers and collecting search results using
the<filename>contrib/dblink</>extension module. The latter is possiblebecause ranking functions use only local
information.

I don't see a reason to provide an example beyond the existing examples
of how to do partitioning.

--  Bruce Momjian  <bruce@momjian.us>          http://momjian.us EnterpriseDB
http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: default_text_search_config and expression indexes

From
Bruce Momjian
Date:
Magnus Hagander wrote:
> > However, the big problem is that the expressions used in expression
> > indexes should not change their output based on the value of a GUC
> > variable (because it would corrupt the index), but in the case above,
> > default_text_search_config controls what configuration is used, and
> > hence the output of to_tsvector is changed if default_text_search_config
> > changes.
> 
> It wuoldn't actually *corrupt* the index, right? You could end up with
> wrong results, which might be regarded as corruption in one way, but as
> long as you change the value back the index still works, no?

Right, it would _temporarily_ corrupt it.  ;-)

> > We have a few possible options:
> > 
> >     1) Document the problem and do nothing else.
> >     2) Make default_text_search_config a postgresql.conf-only
> >        setting, thereby making it impossible to change by non-super
> >        users, or make it a super-user-only setting.
> >     3) Remove default_text_search_config and require the
> >        configuration to be specified in each function call.
> > 
> > If we remove default_text_search_config, it would also make ::tsvector
> > casting useless as well.
> 
> I think 3 is a really bad solution.
> 
> 2 is a half-bad solution. Do we have a way to say that it can be set at
> database-level for example, but not at user session? Making it
> superuser-only to change it but not postgresql.conf-only could accomplish
> that, along with warnings in the docs for the super user about the effects
> on current indexes by changing it.

OK, here is what I am thinking.  If we make default_text_search_config
super-user-only, then the user can't do SET (using "zero_damaged_pages"
as a superuser-only example):
test=> set zero_damaged_pages = on;ERROR:  permission denied to set parameter "zero_damaged_pages"test=> alter user
guestset zero_damaged_pages = on;ERROR:  permission denied to set parameter "zero_damaged_pages"
 

but the super-user can set it in postgresql.conf, or:
test=# alter user guest set zero_damaged_pages = on;ALTER ROLE

or
test=# alter database vendor3 set zero_damaged_pages = on;ALTER ROLE

meaning while it will be super-user-only, the administrator can set the
default for specific databases and users.  Is that the best approach?

A user can still over-ride the default by specifying the configuration
in the function call.

--  Bruce Momjian  <bruce@momjian.us>          http://momjian.us EnterpriseDB
http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: default_text_search_config and expression indexes

From
Magnus Hagander
Date:
Bruce Momjian wrote:
> Magnus Hagander wrote:
>>> However, the big problem is that the expressions used in expression
>>> indexes should not change their output based on the value of a GUC
>>> variable (because it would corrupt the index), but in the case above,
>>> default_text_search_config controls what configuration is used, and
>>> hence the output of to_tsvector is changed if default_text_search_config
>>> changes.
>> It wuoldn't actually *corrupt* the index, right? You could end up with
>> wrong results, which might be regarded as corruption in one way, but as
>> long as you change the value back the index still works, no?
> 
> Right, it would _temporarily_ corrupt it.  ;-)
> 
>>> We have a few possible options:
>>>
>>>     1) Document the problem and do nothing else.
>>>     2) Make default_text_search_config a postgresql.conf-only
>>>        setting, thereby making it impossible to change by non-super
>>>        users, or make it a super-user-only setting.
>>>     3) Remove default_text_search_config and require the
>>>        configuration to be specified in each function call.
>>>
>>> If we remove default_text_search_config, it would also make ::tsvector
>>> casting useless as well.
>> I think 3 is a really bad solution.
>>
>> 2 is a half-bad solution. Do we have a way to say that it can be set at
>> database-level for example, but not at user session? Making it
>> superuser-only to change it but not postgresql.conf-only could accomplish
>> that, along with warnings in the docs for the super user about the effects
>> on current indexes by changing it.
> 
> OK, here is what I am thinking.  If we make default_text_search_config
> super-user-only, then the user can't do SET (using "zero_damaged_pages"
> as a superuser-only example):
> 
>     test=> set zero_damaged_pages = on;
>     ERROR:  permission denied to set parameter "zero_damaged_pages"
>     
>     test=> alter user guest set zero_damaged_pages = on;
>     ERROR:  permission denied to set parameter "zero_damaged_pages"
> 
> but the super-user can set it in postgresql.conf, or:
> 
>     test=# alter user guest set zero_damaged_pages = on;
>     ALTER ROLE
> 
> or
> 
>     test=# alter database vendor3 set zero_damaged_pages = on;
>     ALTER ROLE
> 
> meaning while it will be super-user-only, the administrator can set the
> default for specific databases and users.  Is that the best approach?

That's exactly the one I was trying to suggest. And then with clear
warnings in the docs around the place that if you set it to different
values accessing the same index (for different users, for example) weird
things may happen.

But I see the ALTER DATABASE a fairly common scenario (I know I would
use it), and from what I can tell thereis no risk at all with that.

//Magnus


Re: default_text_search_config and expression indexes

From
Oleg Bartunov
Date:
On Fri, 27 Jul 2007, Bruce Momjian wrote:

> Magnus Hagander wrote:
>>> However, the big problem is that the expressions used in expression
>>> indexes should not change their output based on the value of a GUC
>>> variable (because it would corrupt the index), but in the case above,
>>> default_text_search_config controls what configuration is used, and
>>> hence the output of to_tsvector is changed if default_text_search_config
>>> changes.
>>
>> It wuoldn't actually *corrupt* the index, right? You could end up with
>> wrong results, which might be regarded as corruption in one way, but as
>> long as you change the value back the index still works, no?
>
> Right, it would _temporarily_ corrupt it.  ;-)
>
>>> We have a few possible options:
>>>
>>>     1) Document the problem and do nothing else.
>>>     2) Make default_text_search_config a postgresql.conf-only
>>>        setting, thereby making it impossible to change by non-super
>>>        users, or make it a super-user-only setting.
>>>     3) Remove default_text_search_config and require the
>>>        configuration to be specified in each function call.
>>>
>>> If we remove default_text_search_config, it would also make ::tsvector
>>> casting useless as well.
>>
>> I think 3 is a really bad solution.
>>
>> 2 is a half-bad solution. Do we have a way to say that it can be set at
>> database-level for example, but not at user session? Making it
>> superuser-only to change it but not postgresql.conf-only could accomplish
>> that, along with warnings in the docs for the super user about the effects
>> on current indexes by changing it.
>
> OK, here is what I am thinking.  If we make default_text_search_config
> super-user-only, then the user can't do SET (using "zero_damaged_pages"
> as a superuser-only example):
>
>     test=> set zero_damaged_pages = on;
>     ERROR:  permission denied to set parameter "zero_damaged_pages"
>
>     test=> alter user guest set zero_damaged_pages = on;
>     ERROR:  permission denied to set parameter "zero_damaged_pages"
>
> but the super-user can set it in postgresql.conf, or:
>
>     test=# alter user guest set zero_damaged_pages = on;
>     ALTER ROLE
>
> or
>
>     test=# alter database vendor3 set zero_damaged_pages = on;
>     ALTER ROLE
>
> meaning while it will be super-user-only, the administrator can set the
> default for specific databases and users.  Is that the best approach?
>
> A user can still over-ride the default by specifying the configuration
> in the function call.

This is ok, but it will not work in hosting environment and still 
doesn't prevent errors.
    Regards,        Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


Re: default_text_search_config and expression indexes

From
Bruce Momjian
Date:
Oleg Bartunov wrote:
> > OK, here is what I am thinking.  If we make default_text_search_config
> > super-user-only, then the user can't do SET (using "zero_damaged_pages"
> > as a superuser-only example):
> >
> >     test=> set zero_damaged_pages = on;
> >     ERROR:  permission denied to set parameter "zero_damaged_pages"
> >
> >     test=> alter user guest set zero_damaged_pages = on;
> >     ERROR:  permission denied to set parameter "zero_damaged_pages"
> >
> > but the super-user can set it in postgresql.conf, or:
> >
> >     test=# alter user guest set zero_damaged_pages = on;
> >     ALTER ROLE
> >
> > or
> >
> >     test=# alter database vendor3 set zero_damaged_pages = on;
> >     ALTER ROLE
> >
> > meaning while it will be super-user-only, the administrator can set the
> > default for specific databases and users.  Is that the best approach?
> >
> > A user can still over-ride the default by specifying the configuration
> > in the function call.
> 
> This is ok, but it will not work in hosting environment and still 
> doesn't prevent errors.

Agreed.  super-user-only now seems strange to me because it isn't a
security issue, but rather an attempt to avoid people causing errors.

The fundamental issue is that if you do a query using tsvector and
tsquery everything will work find because default_text_search_config
will be the same for both queries.  The problem is if do an expression
index lookup that doesn't specify the configuration name and your
default_text_search_config doesn't match the index, or you INSERT or
UPDATE into an expression index with a mismatched
default_text_search_config.

If we do make default_text_search_config super-user-only it prevents a
database owner from doing ALTER DATABASE db1 SET
default_text_search_config = 'english', which seems like a pretty big
limitation because I think per-database default_text_search_config makes
the most sense.

And, again, if you specify the configuration in the expression index you
have to specify it in the WHERE clause and then
default_text_search_config is pretty useless.

If we required the configuration to always be specified, you could still
store multiple configurations in the same column by having a secondary
column hold the configuration name:
CREATE INDEX i on x USING gist (to_tsvector(config_col, body));

--  Bruce Momjian  <bruce@momjian.us>          http://momjian.us EnterpriseDB
http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: default_text_search_config and expression indexes

From
Bruce Momjian
Date:
Bruce Momjian wrote:
> We have to decide if we want a GUC default_text_search_config, and if so
> when can it be changed.
> 
> Right now there are three ways to create a tsvector (or tsquery)
> 
>     ::tsvector
>     to_tsvector(value)
>     to_tsvector(config, value)
> 
> (ignoring plainto_tsvector)
> 
> Only the last one specifies the configuration. The others use the
> configuration specified by default_text_search_config.  (We had an
> previous discussion on what the default value of
> default_text_search_config should be, and it was decided it should be
> set via initdb based on a flag or the locale.)
> 
> Now, because most people use a single configuration, they can just set
> default_text_search_config and there is no need to specify the
> configuration name.
> 
> However, expression indexes cause a problem here:
> 
>     http://momjian.us/expire/fulltext/HTML/textsearch-tables.html#TEXTSEARCH-TABLES-INDEX
> 
> We recommend that users create an expression index on the column they
> want to do a full text search on, e.g.
> 
>     CREATE INDEX pgweb_idx ON pgweb USING gin(to_tsvector(body));
> 
> However, the big problem is that the expressions used in expression
> indexes should not change their output based on the value of a GUC
> variable (because it would corrupt the index), but in the case above,
> default_text_search_config controls what configuration is used, and
> hence the output of to_tsvector is changed if default_text_search_config
> changes.
> 
> We have a few possible options:
> 
>     1) Document the problem and do nothing else.
>     2) Make default_text_search_config a postgresql.conf-only
>        setting, thereby making it impossible to change by non-super
>        users, or make it a super-user-only setting.
>     3) Remove default_text_search_config and require the
>        configuration to be specified in each function call.
> 
> If we remove default_text_search_config, it would also make ::tsvector
> casting useless as well.

OK, I just found a case that I think is going to make #3 a requirement
(remove default_text_search_config).

How is a CREATE INDEX ... to_tsvector(col) going to restore from a
pg_dump?  I see no way of guaranteeing that the
default_text_search_config is correct on the restore, and in fact I
don't think we have any way of knowing the default_text_search_config
used for the index.

And if we have to require the configuration name in CREATE INDEX, it has
to be used in WHERE, so we might as well just remove the default
capability and always require the configuration name.

--  Bruce Momjian  <bruce@momjian.us>          http://momjian.us EnterpriseDB
http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: default_text_search_config and expression indexes

From
Alvaro Herrera
Date:
Bruce Momjian wrote:
> Bruce Momjian wrote:

> >     3) Remove default_text_search_config and require the
> >        configuration to be specified in each function call.
> > 
> > If we remove default_text_search_config, it would also make ::tsvector
> > casting useless as well.
> 
> OK, I just found a case that I think is going to make #3 a requirement
> (remove default_text_search_config).
> 
> How is a CREATE INDEX ... to_tsvector(col) going to restore from a
> pg_dump?  I see no way of guaranteeing that the
> default_text_search_config is correct on the restore, and in fact I
> don't think we have any way of knowing the default_text_search_config
> used for the index.

Make pg_dump emit only CREATE INDEX sentences with two-param format.  In
fact I think it would make sense to convert internally the one-param
format to two-param, before hitting the catalogs.

This would also solve your problem about usability of WHERE clauses, if
you rewrite the one-param calls to two-params before the optimizer kicks
in.

-- 
Alvaro Herrera                 http://www.amazon.com/gp/registry/DXLWNGRJD34J
"Nadie esta tan esclavizado como el que se cree libre no siendolo" (Goethe)


Re: default_text_search_config and expression indexes

From
Bruce Momjian
Date:
Alvaro Herrera wrote:
> Bruce Momjian wrote:
> > Bruce Momjian wrote:
> 
> > >     3) Remove default_text_search_config and require the
> > >        configuration to be specified in each function call.
> > > 
> > > If we remove default_text_search_config, it would also make ::tsvector
> > > casting useless as well.
> > 
> > OK, I just found a case that I think is going to make #3 a requirement
> > (remove default_text_search_config).
> > 
> > How is a CREATE INDEX ... to_tsvector(col) going to restore from a
> > pg_dump?  I see no way of guaranteeing that the
> > default_text_search_config is correct on the restore, and in fact I
> > don't think we have any way of knowing the default_text_search_config
> > used for the index.
> 
> Make pg_dump emit only CREATE INDEX sentences with two-param format.  In
> fact I think it would make sense to convert internally the one-param
> format to two-param, before hitting the catalogs.
>
> This would also solve your problem about usability of WHERE clauses, if
> you rewrite the one-param calls to two-params before the optimizer kicks
> in.

Yes, that could be done but as far as I know we weren't planning to have
those areas of our backend be aware of configuration parameters to
text search functions, and I doubt we want to do that for 8.3, if ever.
There are many tsearch functions and doing this would spill tsearch
function checks all over the backend, reducing our modularity.

The default capability just isn't worth it, and in addition is
error-prone.

--  Bruce Momjian  <bruce@momjian.us>          http://momjian.us EnterpriseDB
http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: default_text_search_config and expression indexes

From
Oleg Bartunov
Date:
On Mon, 30 Jul 2007, Bruce Momjian wrote:

> Oleg Bartunov wrote:
>>> OK, here is what I am thinking.  If we make default_text_search_config
>>> super-user-only, then the user can't do SET (using "zero_damaged_pages"
>>> as a superuser-only example):
>>>
>>>     test=> set zero_damaged_pages = on;
>>>     ERROR:  permission denied to set parameter "zero_damaged_pages"
>>>
>>>     test=> alter user guest set zero_damaged_pages = on;
>>>     ERROR:  permission denied to set parameter "zero_damaged_pages"
>>>
>>> but the super-user can set it in postgresql.conf, or:
>>>
>>>     test=# alter user guest set zero_damaged_pages = on;
>>>     ALTER ROLE
>>>
>>> or
>>>
>>>     test=# alter database vendor3 set zero_damaged_pages = on;
>>>     ALTER ROLE
>>>
>>> meaning while it will be super-user-only, the administrator can set the
>>> default for specific databases and users.  Is that the best approach?
>>>
>>> A user can still over-ride the default by specifying the configuration
>>> in the function call.
>>
>> This is ok, but it will not work in hosting environment and still
>> doesn't prevent errors.
>
> Agreed.  super-user-only now seems strange to me because it isn't a
> security issue, but rather an attempt to avoid people causing errors.
>
> The fundamental issue is that if you do a query using tsvector and
> tsquery everything will work find because default_text_search_config
> will be the same for both queries.  The problem is if do an expression
> index lookup that doesn't specify the configuration name and your
> default_text_search_config doesn't match the index, or you INSERT or
> UPDATE into an expression index with a mismatched
> default_text_search_config.
>
> If we do make default_text_search_config super-user-only it prevents a
> database owner from doing ALTER DATABASE db1 SET
> default_text_search_config = 'english', which seems like a pretty big
> limitation because I think per-database default_text_search_config makes
> the most sense.
>
> And, again, if you specify the configuration in the expression index you
> have to specify it in the WHERE clause and then
> default_text_search_config is pretty useless.

agree. Notice, this is very limited usage case.

>
> If we required the configuration to always be specified, you could still
> store multiple configurations in the same column by having a secondary
> column hold the configuration name:

I don't understand this. Let's don't discuss indexes at all, since indexes
doesn't know about configuratons at all

>
>     CREATE INDEX i on x USING gist (to_tsvector(config_col, body));
    Regards,        Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


Re: default_text_search_config and expression indexes

From
Oleg Bartunov
Date:
On Mon, 30 Jul 2007, Bruce Momjian wrote:

> Bruce Momjian wrote:
>> We have to decide if we want a GUC default_text_search_config, and if so
>> when can it be changed.
>>
>> Right now there are three ways to create a tsvector (or tsquery)
>>
>>     ::tsvector
>>     to_tsvector(value)
>>     to_tsvector(config, value)
>>
>> (ignoring plainto_tsvector)
>>
>> Only the last one specifies the configuration. The others use the
>> configuration specified by default_text_search_config.  (We had an
>> previous discussion on what the default value of
>> default_text_search_config should be, and it was decided it should be
>> set via initdb based on a flag or the locale.)
>>
>> Now, because most people use a single configuration, they can just set
>> default_text_search_config and there is no need to specify the
>> configuration name.
>>
>> However, expression indexes cause a problem here:
>>
>>     http://momjian.us/expire/fulltext/HTML/textsearch-tables.html#TEXTSEARCH-TABLES-INDEX
>>
>> We recommend that users create an expression index on the column they
>> want to do a full text search on, e.g.
>>
>>     CREATE INDEX pgweb_idx ON pgweb USING gin(to_tsvector(body));
>>
>> However, the big problem is that the expressions used in expression
>> indexes should not change their output based on the value of a GUC
>> variable (because it would corrupt the index), but in the case above,
>> default_text_search_config controls what configuration is used, and
>> hence the output of to_tsvector is changed if default_text_search_config
>> changes.
>>
>> We have a few possible options:
>>
>>     1) Document the problem and do nothing else.
>>     2) Make default_text_search_config a postgresql.conf-only
>>        setting, thereby making it impossible to change by non-super
>>        users, or make it a super-user-only setting.
>>     3) Remove default_text_search_config and require the
>>        configuration to be specified in each function call.
>>
>> If we remove default_text_search_config, it would also make ::tsvector
>> casting useless as well.
>
> OK, I just found a case that I think is going to make #3 a requirement
> (remove default_text_search_config).
>
> How is a CREATE INDEX ... to_tsvector(col) going to restore from a
> pg_dump?  I see no way of guaranteeing that the
> default_text_search_config is correct on the restore, and in fact I
> don't think we have any way of knowing the default_text_search_config
> used for the index.
>
> And if we have to require the configuration name in CREATE INDEX, it has
> to be used in WHERE, so we might as well just remove the default
> capability and always require the configuration name.

this is very rare use case for text searching 
1. expression index without configuration name
2. default_text_search_config can be changed by somebody

If somebody really need it, then he should be adviced to use configuration 
name, else we don't guarantee that somebody could change 
default_text_search_config  variable and this could lead to 
incorrect dump/restore.

I don't think we should remove default_text_search_config because of 
this rare case.
    Regards,        Oleg

PS.

Bruce, I'm in the mountains the Northern Caucasia and internet is
a bit  unreliable :(

_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


Re: default_text_search_config and expression indexes

From
Bruce Momjian
Date:
Oleg Bartunov wrote:
> >> If we remove default_text_search_config, it would also make ::tsvector
> >> casting useless as well.
> >
> > OK, I just found a case that I think is going to make #3 a requirement
> > (remove default_text_search_config).
> >
> > How is a CREATE INDEX ... to_tsvector(col) going to restore from a
> > pg_dump?  I see no way of guaranteeing that the
> > default_text_search_config is correct on the restore, and in fact I
> > don't think we have any way of knowing the default_text_search_config
> > used for the index.
> >
> > And if we have to require the configuration name in CREATE INDEX, it has
> > to be used in WHERE, so we might as well just remove the default
> > capability and always require the configuration name.
> 
> this is very rare use case for text searching 
> 1. expression index without configuration name
> 2. default_text_search_config can be changed by somebody

If you are going to be using the configuration name with the create
expression index, you have to use it in the WHERE clause (or the index
doesn't work), and I assume that is 90% of the text search uses.  I
don't see it as rare at all.

> If somebody really need it, then he should be adviced to use configuration 
> name, else we don't guarantee that somebody could change 
> default_text_search_config  variable and this could lead to 
> incorrect dump/restore.
> 
> I don't think we should remove default_text_search_config because of 
> this rare case.

I still feel the default_text_search_config has to be removed.  We have
tried all sorts of ways to make it work but having it not be 100%
reliable for pg_dump/restore means it might as well be in /contrib and
unsupported.  If we have it in core, it has to work 100%.  We can't have
tons of examples that don't specify the configuration name and then
expect every create expression index and WHERE clause to use it. 
default_text_search_config _can_ work, but it seems so easy to break and
so easy to get wrong that I think it must be removed.

If we are going to keep it, I need someone to explain why my comments
above are wrong.  If I am right, someone has to remove
default_text_search_config from the patch.   I can do the documentation.

> Bruce, I'm in the mountains the Northern Caucasia and internet is
> a bit  unreliable :(

Thanks.  I noticed  a lag in your reply.  Hope you are having a good
time.

--  Bruce Momjian  <bruce@momjian.us>          http://momjian.us EnterpriseDB
http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: default_text_search_config and expression indexes

From
Oleg Bartunov
Date:
On Tue, 31 Jul 2007, Bruce Momjian wrote:

>>> And if we have to require the configuration name in CREATE INDEX, it has
>>> to be used in WHERE, so we might as well just remove the default
>>> capability and always require the configuration name.
>>
>> this is very rare use case for text searching
>> 1. expression index without configuration name
>> 2. default_text_search_config can be changed by somebody
>
> If you are going to be using the configuration name with the create
> expression index, you have to use it in the WHERE clause (or the index
> doesn't work), and I assume that is 90% of the text search uses.  I
> don't see it as rare at all.

What is a basis of your assumption ? In my opinion, it's very limited
use of text search, because it doesn't supports ranking. For 4-5 years
of tsearch2 usage I never used it and I never seem in mailing lists.
This is very user-oriented feature and we could probably ask 
-general people for their opinion.

>> If somebody really need it, then he should be adviced to use configuration
>> name, else we don't guarantee that somebody could change
>> default_text_search_config  variable and this could lead to
>> incorrect dump/restore.
>>
>> I don't think we should remove default_text_search_config because of
>> this rare case.
>
> I still feel the default_text_search_config has to be removed.  We have
> tried all sorts of ways to make it work but having it not be 100%
> reliable for pg_dump/restore means it might as well be in /contrib and
> unsupported.  If we have it in core, it has to work 100%.  We can't have
> tons of examples that don't specify the configuration name and then
> expect every create expression index and WHERE clause to use it.
> default_text_search_config _can_ work, but it seems so easy to break and
> so easy to get wrong that I think it must be removed.

I'd better say we don't support text searching using expression index
than remove default_text_search_config. Anyway, I don't feel myself
responisble for such important problem. We need more feedback from 
users.

>
> If we are going to keep it, I need someone to explain why my comments
> above are wrong.  If I am right, someone has to remove
> default_text_search_config from the patch.   I can do the documentation.

I'm in conference and then will be busy writing my applications and
earning money, Teodor is in vacation. I don't want to do 
hasty conclusion, since we're very tired to change our patch from 
one solution to another. We need consensus of developers and users.
I'm almost exhausted and have  no time  to continue this discussion.

Would you be so kind to write separate post about this problem and
call -hackers and -general for feedback. Let's experienced users
show their needs. We said everything and has nothing to add.

    Regards,        Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


Re: default_text_search_config and expression indexes

From
Bruce Momjian
Date:
Oleg Bartunov wrote:
> On Tue, 31 Jul 2007, Bruce Momjian wrote:
> 
> >>> And if we have to require the configuration name in CREATE INDEX, it has
> >>> to be used in WHERE, so we might as well just remove the default
> >>> capability and always require the configuration name.
> >>
> >> this is very rare use case for text searching
> >> 1. expression index without configuration name
> >> 2. default_text_search_config can be changed by somebody
> >
> > If you are going to be using the configuration name with the create
> > expression index, you have to use it in the WHERE clause (or the index
> > doesn't work), and I assume that is 90% of the text search uses.  I
> > don't see it as rare at all.
> 
> What is a basis of your assumption ? In my opinion, it's very limited
> use of text search, because it doesn't supports ranking. For 4-5 years
> of tsearch2 usage I never used it and I never seem in mailing lists.
> This is very user-oriented feature and we could probably ask 
> -general people for their opinion.

I doubt 'general' is going to understand the details of merging this
into the backend.  I assume we have enough people on hackers to decide
this.

Are you saying the majority of users have a separate column with a
trigger?  Does the trigger specify the configuation?  I don't see that
as a parameter argument to tsvector_update_trigger().  If you reload a
pg_dump, what does it use for the configuration?

Why is a separate column better than the index?  Just ranking?

The reason the expression index is nice is this feature has to be easy
to use for people who are new to full text and even PostgreSQL.  Right
now /contrib is fine for experts to use, but we want a larger user base
for this feature.

> >> If somebody really need it, then he should be adviced to use configuration
> >> name, else we don't guarantee that somebody could change
> >> default_text_search_config  variable and this could lead to
> >> incorrect dump/restore.
> >>
> >> I don't think we should remove default_text_search_config because of
> >> this rare case.
> >
> > I still feel the default_text_search_config has to be removed.  We have
> > tried all sorts of ways to make it work but having it not be 100%
> > reliable for pg_dump/restore means it might as well be in /contrib and
> > unsupported.  If we have it in core, it has to work 100%.  We can't have
> > tons of examples that don't specify the configuration name and then
> > expect every create expression index and WHERE clause to use it.
> > default_text_search_config _can_ work, but it seems so easy to break and
> > so easy to get wrong that I think it must be removed.
> 
> I'd better say we don't support text searching using expression index
> than remove default_text_search_config. Anyway, I don't feel myself
> responisble for such important problem. We need more feedback from 
> users.

Well, I am waiting for other hackers to get involved, but if they don't,
I have to evaluate it myself on the email lists.

> > If we are going to keep it, I need someone to explain why my comments
> > above are wrong.  If I am right, someone has to remove
> > default_text_search_config from the patch.   I can do the documentation.
> 
> I'm in conference and then will be busy writing my applications and
> earning money, Teodor is in vacation. I don't want to do 
> hasty conclusion, since we're very tired to change our patch from 
> one solution to another. We need consensus of developers and users.
> I'm almost exhausted and have  no time  to continue this discussion.
> 
> Would you be so kind to write separate post about this problem and
> call -hackers and -general for feedback. Let's experienced users
> show their needs. We said everything and has nothing to add.

If you have no time to continue discussion and perhaps update the patch,
we can consider this patch dead for 8.3 and we can hold it for 8.4
because I can guarantee you this is going to need more discussion and
patch modification before it gets into CVS.

This patch is being treated fairly and exactly the same as every other
patch.

Should we hold the patch for 8.4?

--  Bruce Momjian  <bruce@momjian.us>          http://momjian.us EnterpriseDB
http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: default_text_search_config and expression indexes

From
Oleg Bartunov
Date:
On Tue, 31 Jul 2007, Bruce Momjian wrote:

> Oleg Bartunov wrote:
>> On Tue, 31 Jul 2007, Bruce Momjian wrote:
>>
>>>>> And if we have to require the configuration name in CREATE INDEX, it has
>>>>> to be used in WHERE, so we might as well just remove the default
>>>>> capability and always require the configuration name.
>>>>
>>>> this is very rare use case for text searching
>>>> 1. expression index without configuration name
>>>> 2. default_text_search_config can be changed by somebody
>>>
>>> If you are going to be using the configuration name with the create
>>> expression index, you have to use it in the WHERE clause (or the index
>>> doesn't work), and I assume that is 90% of the text search uses.  I
>>> don't see it as rare at all.
>>
>> What is a basis of your assumption ? In my opinion, it's very limited
>> use of text search, because it doesn't supports ranking. For 4-5 years
>> of tsearch2 usage I never used it and I never seem in mailing lists.
>> This is very user-oriented feature and we could probably ask
>> -general people for their opinion.
>
> I doubt 'general' is going to understand the details of merging this
> into the backend.  I assume we have enough people on hackers to decide
> this.

I mean not technical details, but use case. Does they need expressional
index without ranking but sacrifice ability to use default configuration
in other cases too ? My prediction is that  people doesn't ever thought about 
this possibility until we said them about.

>
> Are you saying the majority of users have a separate column with a
> trigger?  Does the trigger specify the configuation?  I don't see that
> as a parameter argument to tsvector_update_trigger().  If you reload a
> pg_dump, what does it use for the configuration?
>

yes, separate column with custom trigger works fine. It's up to you how
to keep your data actual and it's up to you how to write trigger. 
Our tsvector_update_trigger() is a tsvector_update_trigger_example() !

> Why is a separate column better than the index?  Just ranking?

ranking + composite documents. I already mentioned, that this could be
rather expensive. Also, having separate column allow people various
ways to say what is a document and even change it.

>
> The reason the expression index is nice is this feature has to be easy
> to use for people who are new to full text and even PostgreSQL.  Right
> now /contrib is fine for experts to use, but we want a larger user base
> for this feature.

I agree here. This was one of the main reason of our work for 8.3.
Probably, we shold think in another direction - not to curtail tsearch2
and confuse rather big existing users, but to add an ability to save somehow
configuration used for creating of *document*
either implicitly (in expression index, or just gin(text_column)), or
explicitly (separate column). There is no problem with index itself !

>
>>
>> I'd better say we don't support text searching using expression index
>> than remove default_text_search_config. Anyway, I don't feel myself
>> responisble for such important problem. We need more feedback from
>> users.
>
> Well, I am waiting for other hackers to get involved, but if they don't,
> I have to evaluate it myself on the email lists.
>
>>> If we are going to keep it, I need someone to explain why my comments
>>> above are wrong.  If I am right, someone has to remove
>>> default_text_search_config from the patch.   I can do the documentation.
>>
>> I'm in conference and then will be busy writing my applications and
>> earning money, Teodor is in vacation. I don't want to do
>> hasty conclusion, since we're very tired to change our patch from
>> one solution to another. We need consensus of developers and users.
>> I'm almost exhausted and have  no time  to continue this discussion.
>>
>> Would you be so kind to write separate post about this problem and
>> call -hackers and -general for feedback. Let's experienced users
>> show their needs. We said everything and has nothing to add.
>
> If you have no time to continue discussion and perhaps update the patch,
> we can consider this patch dead for 8.3 and we can hold it for 8.4
> because I can guarantee you this is going to need more discussion and
> patch modification before it gets into CVS.
>
> This patch is being treated fairly and exactly the same as every other
> patch.

why do you say this ? I didn't complain about this.

>
> Should we hold the patch for 8.4?

If we're not agree to say in docs, that implicit usage of text search 
configuration in CREATE INDEX command doesn't supported. Could we leave
default_text_search_config for super-users, at least ?

Anyway, let's wait what other people say.
    Regards,        Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


Re: default_text_search_config and expression indexes

From
Ron Mayer
Date:
Bruce Momjian wrote:
> Oleg Bartunov wrote:
>> What is a basis of your assumption ? In my opinion, it's very limited
>> use of text search, because it doesn't supports ranking. For 4-5 years
>> of tsearch2 usage I never used it and I never seem in mailing lists.
>> This is very user-oriented feature and we could probably ask 
>> -general people for their opinion.

I think I asked about this kind of usage a couple years back;
and Oleg pointed out other reasons why it wasn't as good an
idea too.

http://archives.postgresql.org/pgsql-general/2005-10/msg00475.php
http://archives.postgresql.org/pgsql-general/2005-10/msg00477.php

The particular question I had asked why the functional index was
slower than maintaining the extra column; with the explanation
that the lossy index having to call the function (including
parsing, dictionary lookup, etc) for re-checking the data made
it inadvisable to avoid the extra column anyway.

> I doubt 'general' is going to understand the details of merging this
> into the backend.  I assume we have enough people on hackers to decide
> this.
> 
> Are you saying the majority of users have a separate column with a
> trigger?

I think so.   At least when I was using it in 2005 the second
column with the trigger was faster than using a functional index.

>> We need more feedback from users.
> 
> Well, I am waiting for other hackers to get involved, but if they don't,
> I have to evaluate it myself on the email lists.

Personally, I think documentation changes would be an OK way to
to handle it.   Something that makes it extremely clear to the
user the advantages of having the extra column and the risks
of avoiding them.


Re: default_text_search_config and expression indexes

From
Bruce Momjian
Date:
Oleg Bartunov wrote:
> On Tue, 31 Jul 2007, Bruce Momjian wrote:
> 
> > Oleg Bartunov wrote:
> >> On Tue, 31 Jul 2007, Bruce Momjian wrote:
> >>
> >>>>> And if we have to require the configuration name in CREATE INDEX, it has
> >>>>> to be used in WHERE, so we might as well just remove the default
> >>>>> capability and always require the configuration name.
> >>>>
> >>>> this is very rare use case for text searching
> >>>> 1. expression index without configuration name
> >>>> 2. default_text_search_config can be changed by somebody
> >>>
> >>> If you are going to be using the configuration name with the create
> >>> expression index, you have to use it in the WHERE clause (or the index
> >>> doesn't work), and I assume that is 90% of the text search uses.  I
> >>> don't see it as rare at all.
> >>
> >> What is a basis of your assumption ? In my opinion, it's very limited
> >> use of text search, because it doesn't supports ranking. For 4-5 years
> >> of tsearch2 usage I never used it and I never seem in mailing lists.
> >> This is very user-oriented feature and we could probably ask
> >> -general people for their opinion.
> >
> > I doubt 'general' is going to understand the details of merging this
> > into the backend.  I assume we have enough people on hackers to decide
> > this.
> 
> I mean not technical details, but use case. Does they need expressional
> index without ranking but sacrifice ability to use default configuration
> in other cases too ? My prediction is that  people doesn't ever thought about 
> this possibility until we said them about.

In a choice between expression indexes and default_text_search_config,
there is no question in my mind that expression indexes are more useful.
Lack of default_text_search_config only means you have to specify the
configuration name every time, and can't do casting to a text search
data type.

> > Are you saying the majority of users have a separate column with a
> > trigger?  Does the trigger specify the configuation?  I don't see that
> > as a parameter argument to tsvector_update_trigger().  If you reload a
> > pg_dump, what does it use for the configuration?
> >
> 
> yes, separate column with custom trigger works fine. It's up to you how
> to keep your data actual and it's up to you how to write trigger. 
> Our tsvector_update_trigger() is a tsvector_update_trigger_example() !

Well, that is the major problem --- that this is very error-prone,
especially considering that the tsvector_update_trigger() doesn't get it
right either.

> > Why is a separate column better than the index?  Just ranking?
> 
> ranking + composite documents. I already mentioned, that this could be
> rather expensive. Also, having separate column allow people various
> ways to say what is a document and even change it.

OK, I am confused why an expression index can't use those features if a
separate column can.  I realize the index can't store that information,
but why can the code pick it out of a heap column but not run the
function on the heap row to get that information.  I assume it is
something that is just hard to implement.

> > The reason the expression index is nice is this feature has to be easy
> > to use for people who are new to full text and even PostgreSQL.  Right
> > now /contrib is fine for experts to use, but we want a larger user base
> > for this feature.
> 
> I agree here. This was one of the main reason of our work for 8.3.
> Probably, we shold think in another direction - not to curtail tsearch2
> and confuse rather big existing users, but to add an ability to save somehow
> configuration used for creating of *document*
> either implicitly (in expression index, or just gin(text_column)), or
> explicitly (separate column). There is no problem with index itself !

Agreed.  We need to find a way to save the configuration when the output
of a text search function is stored, either in an expression index or
via a trigger into a separate column, but only if we allow the default
configuration to be changed by non-super-users.

> >
> > Should we hold the patch for 8.4?
> 
> If we're not agree to say in docs, that implicit usage of text search 
> configuration in CREATE INDEX command doesn't supported. Could we leave
> default_text_search_config for super-users, at least ?
> 
> Anyway, let's wait what other people say.

The big problem is that not many people have taken the time to fully
understand how full text search works. I hoped that putting the updated
documentation online would help:
http://momjian.us/expire/fulltext/HTML/textsearch.html

but it seems it hasn't.

What we could do it if we make default_text_search_config
super-user-only and tell users at the start that if
default_text_search_config doesn't match the language they want to use,
then they have to read a documentation section that explains the problem
of configuration mismatches.

The problem with that is that we should be setting
default_text_search_config in the pg_dump output, like we do for
client_encoding, but because it is a super-user-only, it will fail for
non-super-user restores.

So, I am back to thinking default_text_search_config isn't going to
work reliably for novice users.

--  Bruce Momjian  <bruce@momjian.us>          http://momjian.us EnterpriseDB
http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: default_text_search_config and expression indexes

From
Bruce Momjian
Date:
Ron Mayer wrote:
> Bruce Momjian wrote:
> > Oleg Bartunov wrote:
> >> What is a basis of your assumption ? In my opinion, it's very limited
> >> use of text search, because it doesn't supports ranking. For 4-5 years
> >> of tsearch2 usage I never used it and I never seem in mailing lists.
> >> This is very user-oriented feature and we could probably ask 
> >> -general people for their opinion.
> 
> I think I asked about this kind of usage a couple years back;
> and Oleg pointed out other reasons why it wasn't as good an
> idea too.
> 
> http://archives.postgresql.org/pgsql-general/2005-10/msg00475.php
> http://archives.postgresql.org/pgsql-general/2005-10/msg00477.php
> 
> The particular question I had asked why the functional index was
> slower than maintaining the extra column; with the explanation
> that the lossy index having to call the function (including
> parsing, dictionary lookup, etc) for re-checking the data made
> it inadvisable to avoid the extra column anyway.
> 
> > I doubt 'general' is going to understand the details of merging this
> > into the backend.  I assume we have enough people on hackers to decide
> > this.
> > 
> > Are you saying the majority of users have a separate column with a
> > trigger?
> 
> I think so.   At least when I was using it in 2005 the second
> column with the trigger was faster than using a functional index.

OK, it is good you measured it.  I wonder how GIN would behave because
it is not lossy.

> >> We need more feedback from users.
> > 
> > Well, I am waiting for other hackers to get involved, but if they don't,
> > I have to evaluate it myself on the email lists.
> 
> Personally, I think documentation changes would be an OK way to
> to handle it.   Something that makes it extremely clear to the
> user the advantages of having the extra column and the risks
> of avoiding them.

Sure, but you have make sure you use the right configuration in the
trigger, no?  Does the tsquery have to use the same configuration?

--  Bruce Momjian  <bruce@momjian.us>          http://momjian.us EnterpriseDB
http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: default_text_search_config and expression indexes

From
Ron Mayer
Date:
Bruce Momjian wrote:
> Ron Mayer wrote:
>> Bruce Momjian wrote:
>>> Oleg Bartunov wrote:
>>>> What is a basis of your assumption ? 
>> I think I asked about this kind of usage a couple years back;...
>>
>> http://archives.postgresql.org/pgsql-general/2005-10/msg00475.php
>> http://archives.postgresql.org/pgsql-general/2005-10/msg00477.php
>>
>> ...why the functional index was
>> slower than maintaining the extra column; with the explanation
>> that the lossy index having to call the function (including
>> parsing, dictionary lookup, etc) for re-checking the data ...
>> ...
>>>
>>> Are you saying the majority of users have a separate column with a
>>> trigger?
>> I think so.   At least when I was using it in 2005 the second
>> column with the trigger was faster than using a functional index.
> 
> OK, it is good you measured it.  I wonder how GIN would behave because
> it is not lossy.

Too bad I don't have the same database around anymore.
It seems the re-parsing for re-checking for the lossy index was very
expensive, tho.
In the end, I suspect it depends greatly on what fraction of rows match.

>>>> We need more feedback from users.
>>> Well, I am waiting for other hackers to get involved, but if they don't,
>>> I have to evaluate it myself on the email lists.
>> Personally, I think documentation changes would be an OK way to
>> to handle it.   Something that makes it extremely clear to the
>> user the advantages of having the extra column and the risks
>> of avoiding them.
> 
> Sure, but you have make sure you use the right configuration in the
> trigger, no?  Does the tsquery have to use the same configuration?

I wish I knew this myself. :-)   Whatever I had done happened to work
but that was largely through people on IRC walking me through it.


Re: default_text_search_config and expression indexes

From
Bruce Momjian
Date:
Ron Mayer wrote:
> >>>> We need more feedback from users.
> >>> Well, I am waiting for other hackers to get involved, but if they don't,
> >>> I have to evaluate it myself on the email lists.
> >> Personally, I think documentation changes would be an OK way to
> >> to handle it.   Something that makes it extremely clear to the
> >> user the advantages of having the extra column and the risks
> >> of avoiding them.
> > 
> > Sure, but you have make sure you use the right configuration in the
> > trigger, no?  Does the tsquery have to use the same configuration?
> 
> I wish I knew this myself. :-)   Whatever I had done happened to work
> but that was largely through people on IRC walking me through it.

This illustrates the major issue --- that this has to be simple for
people to get started, while keeping the capabilities for experienced
users.

I am now thinking that making users always specify the configuration
name and not allowing :: casting is going to be the best approach.  We
can always add more in 8.4 after it is in wide use.

--  Bruce Momjian  <bruce@momjian.us>          http://momjian.us EnterpriseDB
http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: default_text_search_config and expression indexes

From
Ron Mayer
Date:
Bruce Momjian wrote:
> Ron Mayer wrote:
>> I wish I knew this myself. :-)   Whatever I had done happened to work
>> but that was largely through people on IRC walking me through it.
> 
> This illustrates the major issue --- that this has to be simple for
> people to get started, while keeping the capabilities for experienced
> users.
> 
> I am now thinking that making users always specify the configuration
> name and not allowing :: casting is going to be the best approach.  We
> can always add more in 8.4 after it is in wide use.

That's fair.   Either the docs need to make it totally obvious or
the software should force people to do something safe.


Re: default_text_search_config and expression indexes

From
Heikki Linnakangas
Date:
Bruce Momjian wrote:
> Ron Mayer wrote:
>>>>>> We need more feedback from users.
>>>>> Well, I am waiting for other hackers to get involved, but if they don't,
>>>>> I have to evaluate it myself on the email lists.
>>>> Personally, I think documentation changes would be an OK way to
>>>> to handle it.   Something that makes it extremely clear to the
>>>> user the advantages of having the extra column and the risks
>>>> of avoiding them.
>>> Sure, but you have make sure you use the right configuration in the
>>> trigger, no?  Does the tsquery have to use the same configuration?
>> I wish I knew this myself. :-)   Whatever I had done happened to work
>> but that was largely through people on IRC walking me through it.
> 
> This illustrates the major issue --- that this has to be simple for
> people to get started, while keeping the capabilities for experienced
> users.
> 
> I am now thinking that making users always specify the configuration
> name and not allowing :: casting is going to be the best approach.  We
> can always add more in 8.4 after it is in wide use.

I just read the docs and I'm trying to get a grip of the problem here.

If I understood correctly, the basic issue is that a tsvector datum
created using configuration A is incompatible with a tsquery datum
created using configuration B, in the sense that you won't get
reasonable results if you use the tsquery to search the tsvector, or do
ranking or highlighting. If the configurations happen to be similar
enough, it can work, but not in general.

That underlying issue manifests itself in many ways, including:
- if you create table with a field of type tsvector, typically kept
up-to-date by triggers, and do a search on it using a different
configuration, you get incorrect results.
- using an expression index instead of a tsvector-field, and always
explicitly specifying the configuration, you can avoid that problem (a
query with a different configuration won't use the index). But an
expression index, without explicitly specifying the configuration, will
get corrupted if you change the default configuration.

Removing the default configuration setting altogether removes the 2nd
problem, but that's not good from a usability point of view. And it
doesn't solve the general issue, you can still do things like:
SELECT * FROM foo WHERE to_tsvector('confA', textcol) @@
to_tsquery('confB', 'query');

ISTM we should have a separate tsvector and tsquery data type for each
configuration, and throw an error if you try to mix and match them in a
query. to_tsquery and to_tsvector would be new kind of polymorphic
functions that work with the types. Or we could automatically create a
copy of them when you create a new configuration. We could have a
default configuration setting and rewrite queries that don't explicitly
specify a configuration to use the default.

You could still get into trouble if you alter the configuration after
starting to use it. We could solve that by not allowing you to ALTER
CONFIGURATION, at least not if it's used in tables or indexes. Forcing
people to create a new configuration, and to recreate all indexes and
tsvector columns every time you add a word to a stop-list, for example,
seems too onerous, though. Not sure what to do about that.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: default_text_search_config and expression indexes

From
Bruce Momjian
Date:
Heikki Linnakangas wrote:
> >>> Sure, but you have make sure you use the right configuration in the
> >>> trigger, no?  Does the tsquery have to use the same configuration?
> >> I wish I knew this myself. :-)   Whatever I had done happened to work
> >> but that was largely through people on IRC walking me through it.
> > 
> > This illustrates the major issue --- that this has to be simple for
> > people to get started, while keeping the capabilities for experienced
> > users.
> > 
> > I am now thinking that making users always specify the configuration
> > name and not allowing :: casting is going to be the best approach.  We
> > can always add more in 8.4 after it is in wide use.
> 
> I just read the docs and I'm trying to get a grip of the problem here.
> 
> If I understood correctly, the basic issue is that a tsvector datum
> created using configuration A is incompatible with a tsquery datum
> created using configuration B, in the sense that you won't get
> reasonable results if you use the tsquery to search the tsvector, or do
> ranking or highlighting. If the configurations happen to be similar
> enough, it can work, but not in general.

Right.

> That underlying issue manifests itself in many ways, including:
> - if you create table with a field of type tsvector, typically kept
> up-to-date by triggers, and do a search on it using a different
> configuration, you get incorrect results.

Right.

> - using an expression index instead of a tsvector-field, and always
> explicitly specifying the configuration, you can avoid that problem (a
> query with a different configuration won't use the index). But an
> expression index, without explicitly specifying the configuration, will
> get corrupted if you change the default configuration.

Right.

> Removing the default configuration setting altogether removes the 2nd
> problem, but that's not good from a usability point of view. And it
> doesn't solve the general issue, you can still do things like:
> SELECT * FROM foo WHERE to_tsvector('confA', textcol) @@
> to_tsquery('confB', 'query');

True, but in that case you are specifically naming different
configurations, so it is hopefully obvious you have a mismatch.

> ISTM we should have a separate tsvector and tsquery data type for each
> configuration, and throw an error if you try to mix and match them in a
> query. to_tsquery and to_tsvector would be new kind of polymorphic
> functions that work with the types. Or we could automatically create a
> copy of them when you create a new configuration. We could have a
> default configuration setting and rewrite queries that don't explicitly
> specify a configuration to use the default.

That is going to make multiple configurations quite complex in the
backend, and I think for little value.

> You could still get into trouble if you alter the configuration after
> starting to use it. We could solve that by not allowing you to ALTER
> CONFIGURATION, at least not if it's used in tables or indexes. Forcing
> people to create a new configuration, and to recreate all indexes and
> tsvector columns every time you add a word to a stop-list, for example,
> seems too onerous, though. Not sure what to do about that.

Yea, seems more work than is necessary.  If we require the configuration
to be always supplied, and document that mismatches are a problem, I
think we are in good shape.

--  Bruce Momjian  <bruce@momjian.us>          http://momjian.us EnterpriseDB
http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: default_text_search_config and expression indexes

From
Oleg Bartunov
Date:
On Wed, 8 Aug 2007, Bruce Momjian wrote:

> Heikki Linnakangas wrote:
>>>>> Sure, but you have make sure you use the right configuration in the
>>>>> trigger, no?  Does the tsquery have to use the same configuration?
>>>> I wish I knew this myself. :-)   Whatever I had done happened to work
>>>> but that was largely through people on IRC walking me through it.
>>>
>>> This illustrates the major issue --- that this has to be simple for
>>> people to get started, while keeping the capabilities for experienced
>>> users.
>>>
>>> I am now thinking that making users always specify the configuration
>>> name and not allowing :: casting is going to be the best approach.  We
>>> can always add more in 8.4 after it is in wide use.
>>
>> I just read the docs and I'm trying to get a grip of the problem here.
>>
>> If I understood correctly, the basic issue is that a tsvector datum
>> created using configuration A is incompatible with a tsquery datum
>> created using configuration B, in the sense that you won't get
>> reasonable results if you use the tsquery to search the tsvector, or do
>> ranking or highlighting. If the configurations happen to be similar
>> enough, it can work, but not in general.
>
> Right.

not fair. There are many cases when one can intentionally use different
configurations. But I agree, this is not for beginners.

>
>> That underlying issue manifests itself in many ways, including:
>> - if you create table with a field of type tsvector, typically kept
>> up-to-date by triggers, and do a search on it using a different
>> configuration, you get incorrect results.
>
> Right.

again, you might want to use different configuration.

>
>> - using an expression index instead of a tsvector-field, and always
>> explicitly specifying the configuration, you can avoid that problem (a
>> query with a different configuration won't use the index). But an
>> expression index, without explicitly specifying the configuration, will
>> get corrupted if you change the default configuration.
>
> Right.

the same problem if you drop constrain from table (accidently) and then
gets surprised by select results.

>
>> Removing the default configuration setting altogether removes the 2nd
>> problem, but that's not good from a usability point of view. And it
>> doesn't solve the general issue, you can still do things like:
>> SELECT * FROM foo WHERE to_tsvector('confA', textcol) @@
>> to_tsquery('confB', 'query');
>
> True, but in that case you are specifically naming different
> configurations, so it is hopefully obvious you have a mismatch.
>
>> ISTM we should have a separate tsvector and tsquery data type for each
>> configuration, and throw an error if you try to mix and match them in a
>> query. to_tsquery and to_tsvector would be new kind of polymorphic
>> functions that work with the types. Or we could automatically create a
>> copy of them when you create a new configuration. We could have a
>> default configuration setting and rewrite queries that don't explicitly
>> specify a configuration to use the default.
>
> That is going to make multiple configurations quite complex in the
> backend, and I think for little value.
>
>> You could still get into trouble if you alter the configuration after
>> starting to use it. We could solve that by not allowing you to ALTER
>> CONFIGURATION, at least not if it's used in tables or indexes. Forcing
>> people to create a new configuration, and to recreate all indexes and
>> tsvector columns every time you add a word to a stop-list, for example,
>> seems too onerous, though. Not sure what to do about that.
>
> Yea, seems more work than is necessary.  If we require the configuration
> to be always supplied, and document that mismatches are a problem, I
> think we are in good shape.

We should agree that all you describe is only for DUMMY users. 
From authors point of view I dislike your approach to treat text searching 
as a very limited tool. But I understand that we should preserve people from 
stupid errors.

I want for beginners easy setup and error-prone functionality,
but leaving experienced users to develop complex search engines.
Can we have separate safe interface for text searching and explicitly
recommend it for beginners ?
    Regards,        Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


Re: default_text_search_config and expression indexes

From
Bruce Momjian
Date:
Oleg Bartunov wrote:
> > Yea, seems more work than is necessary.  If we require the configuration
> > to be always supplied, and document that mismatches are a problem, I
> > think we are in good shape.
> 
> We should agree that all you describe is only for DUMMY users. 
> >From authors point of view I dislike your approach to treat text searching 
> as a very limited tool. But I understand that we should preserve people from 
> stupid errors.
> 
> I want for beginners easy setup and error-prone functionality,
> but leaving experienced users to develop complex search engines.
> Can we have separate safe interface for text searching and explicitly
> recommend it for beginners ?

I am glad we are moving this interface discussion forward.  It seems 
Heikki has similar concerns about the interface being error-prone.

It would be nice to have a novice and advanced interface, but we would
have to document both, and then that is going to be confusing for users.

As I see it, specifying the configuration name in every function call is
the novice interface, and avoids the most common errors.  I can see
defaulting the interface name as being an advanced interface, but I
don't think it has enough of a feature to be worth documenting and
implementing.

If we figure out something better in 8.4 we can implement it, but at
this point I can't think of any good solution to not specifying the
configuration name every time.

--  Bruce Momjian  <bruce@momjian.us>          http://momjian.us EnterpriseDB
http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: default_text_search_config and expression indexes

From
tomas@tuxteam.de
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Thu, Aug 09, 2007 at 02:36:41AM -0400, Bruce Momjian wrote:
> Oleg Bartunov wrote:
> > > Yea, seems more work than is necessary.  If we require the configuration
> > > to be always supplied, and document that mismatches are a problem, I
> > > think we are in good shape.
> > 
> > We should agree that all you describe is only for DUMMY users. 
> > >From authors point of view I dislike your approach to treat text searching 
> > as a very limited tool [...]

[...]

> I am glad we are moving this interface discussion forward.  It seems 
> Heikki has similar concerns about the interface being error-prone.
> 
> It would be nice to have a novice and advanced interface, but we would
> have to document both, and then that is going to be confusing for users.
> 
> As I see it, specifying the configuration name in every function call is
> the novice interface, and avoids the most common errors.  I can see
> defaulting the interface name as being an advanced interface, but I
> don't think it has enough of a feature to be worth documenting and
> implementing.
> 
> If we figure out something better in 8.4 we can implement it, but at
> this point I can't think of any good solution to not specifying the
> configuration name every time.

Maybe I'm missing something, but it seems to me that the configuration
is more attached to a column/index thatn to the whole database. If
there's a default in an expression, I'd rather expect this default to be
drawn from the index involved than from a global value (like a functional
index does now).

Regards
- -- tomás
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)

iD8DBQFGuuAoBcgs9XrR2kYRAqiiAJsFL+Iu/b/xYaLza5ozmi839Qh5awCeOp+f
SZHKDPUHZ3u99XzLBn2ZKjw=
=twEt
-----END PGP SIGNATURE-----



Re: default_text_search_config and expression indexes

From
Oleg Bartunov
Date:
On Thu, 9 Aug 2007, tomas@tuxteam.de wrote:

> Maybe I'm missing something, but it seems to me that the configuration
> is more attached to a column/index thatn to the whole database. If
> there's a default in an expression, I'd rather expect this default to be
> drawn from the index involved than from a global value (like a functional
> index does now).

I'm tired to repeat - index itself doesn't know about configuration !
It just index tsvector data type. tsvector in turn can be obtained
using various ways:
1. manually
2. to_tsvector

tsvector can be stored in a separate attribute or be fully virtual
like in expressional index.

Moreover, tsvector can be obtained using various configurations depending
on your application.

    Regards,        Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


Re: default_text_search_config and expression indexes

From
tomas@tuxteam.de
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Thu, Aug 09, 2007 at 02:03:13PM +0400, Oleg Bartunov wrote:
> On Thu, 9 Aug 2007, tomas@tuxteam.de wrote:
> 
> >Maybe I'm missing something [...]

> I'm tired to repeat - index itself doesn't know about configuration !
> It just index tsvector data type. tsvector in turn can be obtained
> using various ways:
> 1. manually
> 2. to_tsvector
> 
> tsvector can be stored in a separate attribute or be fully virtual
> like in expressional index.
> 
> Moreover, tsvector can be obtained using various configurations depending
> on your application.

Yep. I think I got that.

Now what will be the "normal" case for a new user?

* build an index with a virtual tsvector (using a configuration). Then I would expect the "index" to know the "whole"
functionto calculate its entries -- that would include the config used.
 
 This would be more "compatible" with the approach stated elsewhere to always mention explicitly the config.

* manually. Would a novice do that? Or is that advanced stuff?

Regards -- and sorry for my stupid questions :)

- -- tomás
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)

iD8DBQFGu1ciBcgs9XrR2kYRAnVqAJ9mSc463I44JxLLDPpUZ/JirUVH5ACeOhUr
2/4aXs0ukMnvP8YCq8pamwQ=
=IgfC
-----END PGP SIGNATURE-----



Re: default_text_search_config and expression indexes

From
Heikki Linnakangas
Date:
Oleg Bartunov wrote:
> On Wed, 8 Aug 2007, Bruce Momjian wrote:
>> Heikki Linnakangas wrote:
>>> If I understood correctly, the basic issue is that a tsvector datum
>>> created using configuration A is incompatible with a tsquery datum
>>> created using configuration B, in the sense that you won't get
>>> reasonable results if you use the tsquery to search the tsvector, or do
>>> ranking or highlighting. If the configurations happen to be similar
>>> enough, it can work, but not in general.
>>
>> Right.
> 
> not fair. There are many cases when one can intentionally use different
> configurations. But I agree, this is not for beginners.

Can you give an example of that?

I certainly can see the need to use different configurations in one
database, but what's the use case for comparing a tsvector created with
configuration A against a tsquery created with configuration B?

>>> - using an expression index instead of a tsvector-field, and always
>>> explicitly specifying the configuration, you can avoid that problem (a
>>> query with a different configuration won't use the index). But an
>>> expression index, without explicitly specifying the configuration, will
>>> get corrupted if you change the default configuration.
>>
>> Right.
> 
> the same problem if you drop constrain from table (accidently) and then
> gets surprised by select results.

The difference is that if you change the default configuration, you
won't expect that your queries start to return funny results. It looks
harmless, like changing the date style. If you drop a constraint, it's
much more obvious what the consequences are.

> We should agree that all you describe is only for DUMMY users. From
> authors point of view I dislike your approach to treat text searching as
> a very limited tool. But I understand that we should preserve people
> from stupid errors.
> 
> I want for beginners easy setup and error-prone functionality,
> but leaving experienced users to develop complex search engines.
> Can we have separate safe interface for text searching and explicitly
> recommend it for beginners ?

I don't see how any of the suggestions limits what you can do with it.
If we remove the default configuration parameter, you just have to be
explicit. If we go with the type-system I suggested, you could still add
casts and conversion functions between different tsvector types, where
it make sense.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: default_text_search_config and expression indexes

From
Heikki Linnakangas
Date:
Bruce Momjian wrote:
> Heikki Linnakangas wrote:
>> Removing the default configuration setting altogether removes the 2nd
>> problem, but that's not good from a usability point of view. And it
>> doesn't solve the general issue, you can still do things like:
>> SELECT * FROM foo WHERE to_tsvector('confA', textcol) @@
>> to_tsquery('confB', 'query');
> 
> True, but in that case you are specifically naming different
> configurations, so it is hopefully obvious you have a mismatch.

There's many more subtle ways to do that. For example, filling a
tsvector column using a DEFAULT clause. But then you sometimes fill it
in the application instead, with a different configuration. Or if one of
the function calls is buried in another user defined function.

I don't think explicitly naming the configuration gives enough protection.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: default_text_search_config and expression indexes

From
Bruce Momjian
Date:
Heikki Linnakangas wrote:
> Oleg Bartunov wrote:
> > On Wed, 8 Aug 2007, Bruce Momjian wrote:
> >> Heikki Linnakangas wrote:
> >>> If I understood correctly, the basic issue is that a tsvector datum
> >>> created using configuration A is incompatible with a tsquery datum
> >>> created using configuration B, in the sense that you won't get
> >>> reasonable results if you use the tsquery to search the tsvector, or do
> >>> ranking or highlighting. If the configurations happen to be similar
> >>> enough, it can work, but not in general.
> >>
> >> Right.
> > 
> > not fair. There are many cases when one can intentionally use different
> > configurations. But I agree, this is not for beginners.
> 
> Can you give an example of that?
> 
> I certainly can see the need to use different configurations in one
> database, but what's the use case for comparing a tsvector created with
> configuration A against a tsquery created with configuration B?

I assume you could have a configuration with different stop words or
synonymns and compare them.

> >>> - using an expression index instead of a tsvector-field, and always
> >>> explicitly specifying the configuration, you can avoid that problem (a
> >>> query with a different configuration won't use the index). But an
> >>> expression index, without explicitly specifying the configuration, will
> >>> get corrupted if you change the default configuration.
> >>
> >> Right.
> > 
> > the same problem if you drop constrain from table (accidently) and then
> > gets surprised by select results.
> 
> The difference is that if you change the default configuration, you
> won't expect that your queries start to return funny results. It looks
> harmless, like changing the date style. If you drop a constraint, it's
> much more obvious what the consequences are.
> 
> > We should agree that all you describe is only for DUMMY users. From
> > authors point of view I dislike your approach to treat text searching as
> > a very limited tool. But I understand that we should preserve people
> > from stupid errors.
> > 
> > I want for beginners easy setup and error-prone functionality,
> > but leaving experienced users to develop complex search engines.
> > Can we have separate safe interface for text searching and explicitly
> > recommend it for beginners ?
> 
> I don't see how any of the suggestions limits what you can do with it.
> If we remove the default configuration parameter, you just have to be
> explicit. If we go with the type-system I suggested, you could still add
> casts and conversion functions between different tsvector types, where
> it make sense.

I don't think the type system is workable given the ability to create
new configurations on the fly.  I think the configuration must be
specified each time.

At this point, if we keep discussing the tsearch2 API we are not going
to have this in 8.3.

--  Bruce Momjian  <bruce@momjian.us>          http://momjian.us EnterpriseDB
http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: default_text_search_config and expression indexes

From
Bruce Momjian
Date:
Heikki Linnakangas wrote:
> Bruce Momjian wrote:
> > Heikki Linnakangas wrote:
> >> Removing the default configuration setting altogether removes the 2nd
> >> problem, but that's not good from a usability point of view. And it
> >> doesn't solve the general issue, you can still do things like:
> >> SELECT * FROM foo WHERE to_tsvector('confA', textcol) @@
> >> to_tsquery('confB', 'query');
> > 
> > True, but in that case you are specifically naming different
> > configurations, so it is hopefully obvious you have a mismatch.
> 
> There's many more subtle ways to do that. For example, filling a
> tsvector column using a DEFAULT clause. But then you sometimes fill it
> in the application instead, with a different configuration. Or if one of
> the function calls is buried in another user defined function.
> 
> I don't think explicitly naming the configuration gives enough protection.

Oh, wow, OK, well in that case the text search API isn't ready and we
will have to hold this for 8.4.

--  Bruce Momjian  <bruce@momjian.us>          http://momjian.us EnterpriseDB
http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: default_text_search_config and expression indexes

From
Alvaro Herrera
Date:
Oleg Bartunov wrote:
> On Thu, 9 Aug 2007, tomas@tuxteam.de wrote:
>
>> Maybe I'm missing something, but it seems to me that the configuration
>> is more attached to a column/index thatn to the whole database. If
>> there's a default in an expression, I'd rather expect this default to be
>> drawn from the index involved than from a global value (like a functional
>> index does now).
>
> I'm tired to repeat - index itself doesn't know about configuration !

Is there a way to change that?  For example store the configuration in a
metapage or something?

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: default_text_search_config and expression indexes

From
Gregory Stark
Date:
"Alvaro Herrera" <alvherre@commandprompt.com> writes:

> Oleg Bartunov wrote:
>>
>> I'm tired to repeat - index itself doesn't know about configuration !
>
> Is there a way to change that?  For example store the configuration in a
> metapage or something?

I think Heikki's suggestion of having each configuration create a new type
would effectively do the same thing.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com


Re: default_text_search_config and expression indexes

From
"Mike Rylander"
Date:
On 8/13/07, Bruce Momjian <bruce@momjian.us> wrote:
> Heikki Linnakangas wrote:
> > Bruce Momjian wrote:
> > > Heikki Linnakangas wrote:
> > >> Removing the default configuration setting altogether removes the 2nd
> > >> problem, but that's not good from a usability point of view. And it
> > >> doesn't solve the general issue, you can still do things like:
> > >> SELECT * FROM foo WHERE to_tsvector('confA', textcol) @@
> > >> to_tsquery('confB', 'query');
> > >
> > > True, but in that case you are specifically naming different
> > > configurations, so it is hopefully obvious you have a mismatch.
> >
> > There's many more subtle ways to do that. For example, filling a
> > tsvector column using a DEFAULT clause. But then you sometimes fill it
> > in the application instead, with a different configuration. Or if one of
> > the function calls is buried in another user defined function.
> >
> > I don't think explicitly naming the configuration gives enough protection.
>
> Oh, wow, OK, well in that case the text search API isn't ready and we
> will have to hold this for 8.4.
>

I've been watching this thread with a mixture of dread and hope,
waiting to see where the developers' inclination will end up; whether
leaving a useful foot gun available will be allowed.

This is just my $0.02 as a fairly heavy user of the current tsearch2
code, but I sincerely hope you do not cripple the system by removing
the ability to store tsvectors built using arbitrary configurations in
a single column.  Yes, it can lead to unexpected results if you do not
know what you are doing, but if you have gone beyond building a single
tsearch2 configuration then you are required to know what you are
doing.  What's more, IMO the default configuration mechanism feels
very much like a CONSTRAINT, as Oleg suggests.  That point is one of
cognizance, where if one has gone to the trouble of setting up
multiple configurations and has learned enough to do so correctly,
then one necessarily understands the importance of the setting and can
use it (or not, and use explicit configurations) correctly.  The
default config lowers the bar to an acceptable level for beginners
that have no need of multiple configurations, and while I don't feel
too strongly, personally, about having a default, I think it is both
useful and helpful for new users -- it was for me.

Now, so this email isn't entirely complaining, and as a data point for
the discussion, I'll explain why I do not want to see tsearch2
crippled in the way suggested by Heikki and Bruce.

My application (http://open-ils.org, which run >80% of the public
libraries in Georgia, USA, http://gapines.org and
http://georgialibraries.org/lib/pines.html) requires that I be able to
search a corpus of bibliographic records in a mix of languages, and
potentially with mixed stop-word rules, with one query.  I cannot know
ahead of time what languages will be used in the corpus and I cannot
restrict any one query to one language.  To accomplish this, the
record itself will be inspected inside an INSERT/UPDATE trigger to
determine the language and type, and use the correct configuration for
creating the tsvector.  This will obviously result in a "mixed"
tsvector column, but that's exactly what I need.  I can filter on
record language if the user happens to specify a query language (and
thus configuration), or simply rank the assumed (IP based, perhaps, or
browser preference based) preferred language higher, or one of a
hundred other things.  But I won't be able to do any of that if
tsvectors are required to have one and only one configuration per
column.

Anyway, I felt I needed to provide some outside perspective to this,
as a user, since it seems that the external viewpoint (my particular
viewpoint, at least) was missing from the discussion.

Thanks, folks, for all the work on this so far!

--miker


Re: default_text_search_config and expression indexes

From
Heikki Linnakangas
Date:
Mike Rylander wrote:
> This is just my $0.02 as a fairly heavy user of the current tsearch2
> code, but I sincerely hope you do not cripple the system by removing
> the ability to store tsvectors built using arbitrary configurations in
> a single column.  Yes, it can lead to unexpected results if you do not
> know what you are doing, but if you have gone beyond building a single
> tsearch2 configuration then you are required to know what you are
> doing.  What's more, IMO the default configuration mechanism feels
> very much like a CONSTRAINT, as Oleg suggests.  That point is one of
> cognizance, where if one has gone to the trouble of setting up
> multiple configurations and has learned enough to do so correctly,
> then one necessarily understands the importance of the setting and can
> use it (or not, and use explicit configurations) correctly.  The
> default config lowers the bar to an acceptable level for beginners
> that have no need of multiple configurations, and while I don't feel
> too strongly, personally, about having a default, I think it is both
> useful and helpful for new users -- it was for me.

Thanks for chiming in. As a disclaimer: I've never used tsearch2 in a
real application myself.

> My application (http://open-ils.org, which run >80% of the public
> libraries in Georgia, USA, http://gapines.org and
> http://georgialibraries.org/lib/pines.html) requires that I be able to
> search a corpus of bibliographic records in a mix of languages, and
> potentially with mixed stop-word rules, with one query.  I cannot know
> ahead of time what languages will be used in the corpus and I cannot
> restrict any one query to one language.  To accomplish this, the
> record itself will be inspected inside an INSERT/UPDATE trigger to
> determine the language and type, and use the correct configuration for
> creating the tsvector.  This will obviously result in a "mixed"
> tsvector column, but that's exactly what I need.  I can filter on
> record language if the user happens to specify a query language (and
> thus configuration), or simply rank the assumed (IP based, perhaps, or
> browser preference based) preferred language higher, or one of a
> hundred other things.  But I won't be able to do any of that if
> tsvectors are required to have one and only one configuration per
> column.

Don't you need to use the right configuration to parse the query into a
tsquery as well?

What you have is basically the classic problem problem of representing
inheritance in a relational model. You have a superclass, bibliographic
record, and multiple subclasses, one per language with one extra field,
the corpus in the right language. You've solved it by storing them all
in one table, with an extra column (language) to denote which subclass
the record is. Alternatively, you could solve it by using inherited
tables, or having one table per language with a foreign key referencing
the master table containing the other fields, or having a single table
with one nullable field per configuration, and a check constraint to
check that exactly one of them is not null.

As a thought experiment, let me present another, not text search related
example that's isomorphic to your problem:

Imagine an international online auction system, where you can put items
for sale and specify a minimum price. Part of the database schema is:

CREATE TABLE item ( id INTEGER, description TEXT, currency CHAR(3), price NUMERIC
);

It clearly doesn't make sense to directly compare prices in different
currencies against each other. A query like WHERE price > 1000 doesn't
make sense, unless you also restrict the currency, or use an exchange
rate to convert between currencies. Also, different currencies might
have slightly different rules on how many decimal places are
significant. In this example, as well as your bibliographic scenario, we
can conveniently store prices in all currencies in the same field
because they're all numerics. If we wanted to enforce per-currency
rules, like NUMERIC(10, 2) for USD and NUMERIC(10,0) for Italian lires
(which doesn't really exists anymore, I know), we'd need to store them
in separate columns. And before the decimalisation in 1971, the British
pound was divided into 20 shillings, each of which was divided to 12
pence, so it wouldn't have fit to normal numeric field, and we would
again have to store that in a separate column.

What I'm trying to point out is that the problem isn't unique to text
search. It's an old problem, with many alternative solutions, even with
strong typing. Arguably if you store data in multiple languages in same
field, you have a denormalized schema. Granted, loose typing can be more
convenient, but you give up the benefits of strong typing as well.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: default_text_search_config and expression indexes

From
Heikki Linnakangas
Date:
Bruce Momjian wrote:
> Heikki Linnakangas wrote:
>> Bruce Momjian wrote:
>>> Heikki Linnakangas wrote:
>>>> Removing the default configuration setting altogether removes the 2nd
>>>> problem, but that's not good from a usability point of view. And it
>>>> doesn't solve the general issue, you can still do things like:
>>>> SELECT * FROM foo WHERE to_tsvector('confA', textcol) @@
>>>> to_tsquery('confB', 'query');
>>> True, but in that case you are specifically naming different
>>> configurations, so it is hopefully obvious you have a mismatch.
>> There's many more subtle ways to do that. For example, filling a
>> tsvector column using a DEFAULT clause. But then you sometimes fill it
>> in the application instead, with a different configuration. Or if one of
>> the function calls is buried in another user defined function.
>>
>> I don't think explicitly naming the configuration gives enough protection.
> 
> Oh, wow, OK, well in that case the text search API isn't ready and we
> will have to hold this for 8.4.

That would be unfortunate :(. Sorry I haven't looked at this earlier. Do
you think that implementing a strongly typed system is too much work for
8.3?

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: default_text_search_config and expression indexes

From
Oleg Bartunov
Date:
On Tue, 14 Aug 2007, Alvaro Herrera wrote:

> Oleg Bartunov wrote:
>> On Thu, 9 Aug 2007, tomas@tuxteam.de wrote:
>>
>>> Maybe I'm missing something, but it seems to me that the configuration
>>> is more attached to a column/index thatn to the whole database. If
>>> there's a default in an expression, I'd rather expect this default to be
>>> drawn from the index involved than from a global value (like a functional
>>> index does now).
>>
>> I'm tired to repeat - index itself doesn't know about configuration !
>
> Is there a way to change that?  For example store the configuration in a
> metapage or something?

it's useless, in general, since you could use different configuration to 
build tsvector.

    Regards,        Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


Re: default_text_search_config and expression indexes

From
"Mike Rylander"
Date:
On 8/14/07, Heikki Linnakangas <heikki@enterprisedb.com> wrote:
> Mike Rylander wrote:
[snip]
>
> Don't you need to use the right configuration to parse the query into a
> tsquery as well?
>

Only if the user (or user agent) can supply enough information to move
away from the configured default of, say, en-US.  And even then, other
tsvector configurations need to be searched.  Configurations are not,
of course, just about language/stemming, but also stop-words and
thesauri.  If it were just languages it wouldn't be too big of a deal
(other than in-app DDL management...).

> What you have is basically the classic problem problem of representing
> inheritance in a relational model. You have a superclass, bibliographic
> record, and multiple subclasses, one per language with one extra field,
> the corpus in the right language. You've solved it by storing them all
> in one table, with an extra column (language) to denote which subclass
> the record is. Alternatively, you could solve it by using inherited
> tables, or having one table per language with a foreign key referencing
> the master table containing the other fields, or having a single table
> with one nullable field per configuration, and a check constraint to
> check that exactly one of them is not null.
>

Sorry, communication problem here ... I provided an oversimplified
example meant more to show the issues than to find alternate
solutions, though I appreciate you taking the time to consider the
problem.  If I only had to consider one delimiting facet per record
then it would be much simpler. But with the complication that some
fields have stop-word lists (and some not), some use thesauri (and
different ones, at that), and these fields (as extracted from the
records) and their configurations (stem? use a stop-word list? use
thesaurus x, y or z?) are user-defined ...

> As a thought experiment, let me present another, not text search related
> example that's isomorphic to your problem:
>

Unfortunately, the problem you describe is not quite the same as the
problem I need to solve.

Fortunately, the current incarnation of tsearch2 does a pretty good
job of solving the problem if you store everything in one place and
use the query-time environment to apply some adjustments to the
ranking of items.  I could still work around this problem by creating
inherited tables, one for each configuration on each index-providing
table but I /really/ hope to avoid that.  Creating new configurations
for local requirements doesn't require creating new tables (and the
associated management overhead in the app) today, something I'd really
like to avoid.  In fact, I'm starting to sweat just thinking about
what the planner would go through with the number tables needed for
the potential configurations in an installation that makes use of
multiple thesauri and a mix of stop-word lists across, say, 30
languages.  Such a dataset is not uncommon.

In any case, thanks again for taking the time to think about the
problem.  I still think having the ability to store any old tsvector I
happen to have hanging around in any column of the correct type is a
GoodThing(tm).  I see from Oleg's message down-thread that that's the
way things will be (the tsvector type doesn't know about columns, just
lexem "tuples").

--miker


Re: default_text_search_config and expression indexes

From
Bruce Momjian
Date:
Heikki Linnakangas wrote:
> Bruce Momjian wrote:
> > Heikki Linnakangas wrote:
> >> Bruce Momjian wrote:
> >>> Heikki Linnakangas wrote:
> >>>> Removing the default configuration setting altogether removes the 2nd
> >>>> problem, but that's not good from a usability point of view. And it
> >>>> doesn't solve the general issue, you can still do things like:
> >>>> SELECT * FROM foo WHERE to_tsvector('confA', textcol) @@
> >>>> to_tsquery('confB', 'query');
> >>> True, but in that case you are specifically naming different
> >>> configurations, so it is hopefully obvious you have a mismatch.
> >> There's many more subtle ways to do that. For example, filling a
> >> tsvector column using a DEFAULT clause. But then you sometimes fill it
> >> in the application instead, with a different configuration. Or if one of
> >> the function calls is buried in another user defined function.
> >>
> >> I don't think explicitly naming the configuration gives enough protection.
> > 
> > Oh, wow, OK, well in that case the text search API isn't ready and we
> > will have to hold this for 8.4.
> 
> That would be unfortunate :(. Sorry I haven't looked at this earlier. Do
> you think that implementing a strongly typed system is too much work for
> 8.3?

Yea, probably.  See my other posting with an updated subject line.

--  Bruce Momjian  <bruce@momjian.us>          http://momjian.us EnterpriseDB
http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: default_text_search_config and expression indexes

From
Bruce Momjian
Date:
Mike Rylander wrote:
> This is just my $0.02 as a fairly heavy user of the current tsearch2
> code, but I sincerely hope you do not cripple the system by removing
> the ability to store tsvectors built using arbitrary configurations in
> a single column.  Yes, it can lead to unexpected results if you do not
> know what you are doing, but if you have gone beyond building a single
> tsearch2 configuration then you are required to know what you are
> doing.  What's more, IMO the default configuration mechanism feels
> very much like a CONSTRAINT, as Oleg suggests.  That point is one of
> cognizance, where if one has gone to the trouble of setting up
> multiple configurations and has learned enough to do so correctly,
> then one necessarily understands the importance of the setting and can
> use it (or not, and use explicit configurations) correctly.  The
> default config lowers the bar to an acceptable level for beginners
> that have no need of multiple configurations, and while I don't feel
> too strongly, personally, about having a default, I think it is both
> useful and helpful for new users -- it was for me.

What has really hurt the default GUC idea is how to do restores from a
pg_dump.  How do you make sure the right default is used on a restore,
particularly if multiple objects are being restored, and each has a
different default GUC.  I suppose your trigger handles that but that
isn't going to help with an expression index, nor in cases where the
default of the old database is different from the new one.

--  Bruce Momjian  <bruce@momjian.us>          http://momjian.us EnterpriseDB
http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: default_text_search_config and expression indexes

From
Alvaro Herrera
Date:
Bruce Momjian escribió:
> Mike Rylander wrote:
> > This is just my $0.02 as a fairly heavy user of the current tsearch2
> > code, but I sincerely hope you do not cripple the system by removing
> > the ability to store tsvectors built using arbitrary configurations in
> > a single column.  Yes, it can lead to unexpected results if you do not
> > know what you are doing, but if you have gone beyond building a single
> > tsearch2 configuration then you are required to know what you are
> > doing.  What's more, IMO the default configuration mechanism feels
> > very much like a CONSTRAINT, as Oleg suggests.  That point is one of
> > cognizance, where if one has gone to the trouble of setting up
> > multiple configurations and has learned enough to do so correctly,
> > then one necessarily understands the importance of the setting and can
> > use it (or not, and use explicit configurations) correctly.  The
> > default config lowers the bar to an acceptable level for beginners
> > that have no need of multiple configurations, and while I don't feel
> > too strongly, personally, about having a default, I think it is both
> > useful and helpful for new users -- it was for me.
> 
> What has really hurt the default GUC idea is how to do restores from a
> pg_dump.  How do you make sure the right default is used on a restore,
> particularly if multiple objects are being restored, and each has a
> different default GUC.  I suppose your trigger handles that but that
> isn't going to help with an expression index, nor in cases where the
> default of the old database is different from the new one.

I guess what should happen is that pg_dump should include a
SET default_text_search_config = 'foo'
just before the CREATE INDEX, like we do for other variables.  Of
course, in order for this to work, the index itself must know what value
was used on creation.  Oleg already dismissed my suggestion of putting
it into the index itself (a metapage or something).

Maybe store it in reloptions?

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: default_text_search_config and expression indexes

From
Alvaro Herrera
Date:
Oleg Bartunov wrote:
> On Tue, 14 Aug 2007, Alvaro Herrera wrote:
>
>> Oleg Bartunov wrote:
>>> On Thu, 9 Aug 2007, tomas@tuxteam.de wrote:
>>>
>>>> Maybe I'm missing something, but it seems to me that the configuration
>>>> is more attached to a column/index thatn to the whole database. If
>>>> there's a default in an expression, I'd rather expect this default to be
>>>> drawn from the index involved than from a global value (like a 
>>>> functional
>>>> index does now).
>>>
>>> I'm tired to repeat - index itself doesn't know about configuration !
>>
>> Is there a way to change that?  For example store the configuration in a
>> metapage or something?
>
> it's useless, in general, since you could use different configuration to 
> build tsvector.

Hmm, sorry, I think I just understood what this was about: so you mean
that the configuration is really *per row* and not per index?  So I can
store rows into an index using more than one configuration, and it will
work?

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: default_text_search_config and expression indexes

From
"Mike Rylander"
Date:
On 8/14/07, Alvaro Herrera <alvherre@commandprompt.com> wrote:
> Oleg Bartunov wrote:
> > On Tue, 14 Aug 2007, Alvaro Herrera wrote:
> >
> >> Oleg Bartunov wrote:
> >>> On Thu, 9 Aug 2007, tomas@tuxteam.de wrote:
> >>>
> >>>> Maybe I'm missing something, but it seems to me that the configuration
> >>>> is more attached to a column/index thatn to the whole database. If
> >>>> there's a default in an expression, I'd rather expect this default to be
> >>>> drawn from the index involved than from a global value (like a
> >>>> functional
> >>>> index does now).
> >>>
> >>> I'm tired to repeat - index itself doesn't know about configuration !
> >>
> >> Is there a way to change that?  For example store the configuration in a
> >> metapage or something?
> >
> > it's useless, in general, since you could use different configuration to
> > build tsvector.
>
> Hmm, sorry, I think I just understood what this was about: so you mean
> that the configuration is really *per row* and not per index?  So I can
> store rows into an index using more than one configuration, and it will
> work?

Can and does, to great success.  :)

--miker


Re: default_text_search_config and expression indexes

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Bruce Momjian escribi�:
>> What has really hurt the default GUC idea is how to do restores from a
>> pg_dump.

> I guess what should happen is that pg_dump should include a
> SET default_text_search_config = 'foo'
> just before the CREATE INDEX, like we do for other variables.

The nasty cases are in data-only dumps, that is, where you're trying to
load data into a table with pre-existing indexes or triggers.  A SET
like the above is at least as likely to be wrong as right, if the index
or trigger depends on it to tell it what to do.
        regards, tom lane


Re: default_text_search_config and expression indexes

From
Gregory Stark
Date:
"Mike Rylander" <mrylander@gmail.com> writes:

> My application (http://open-ils.org, which run >80% of the public
> libraries in Georgia, USA, http://gapines.org and
> http://georgialibraries.org/lib/pines.html) requires that I be able to
> search a corpus of bibliographic records in a mix of languages, and
> potentially with mixed stop-word rules, with one query.  I cannot know
> ahead of time what languages will be used in the corpus and I cannot
> restrict any one query to one language.  To accomplish this, the
> record itself will be inspected inside an INSERT/UPDATE trigger to
> determine the language and type, and use the correct configuration for
> creating the tsvector.  This will obviously result in a "mixed"
> tsvector column, but that's exactly what I need.  I can filter on
> record language if the user happens to specify a query language (and
> thus configuration), or simply rank the assumed (IP based, perhaps, or
> browser preference based) preferred language higher, or one of a
> hundred other things.  But I won't be able to do any of that if
> tsvectors are required to have one and only one configuration per
> column.
>
> Anyway, I felt I needed to provide some outside perspective to this,
> as a user, since it seems that the external viewpoint (my particular
> viewpoint, at least) was missing from the discussion.

This is *extremely* useful. I think it's precisely what we've been missing so
far. At least, what I've been missing.

So the question is what exactly happens in this case? If I search for "the"
does that mean it will ignore matches in English where that's a stop-word but
find me books on tea in French? Is that what I should expect to happen? What
if I search for "earl and the"? Does that find me French books on Early Grey
Tea but English books on all earls?

What happens if I use the same operator directly on the text column? Or
perhaps it's not even possible to specify stop-words when operating on a text
column? Should it be?

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com


Re: default_text_search_config and expression indexes

From
Bruce Momjian
Date:
Tom Lane wrote:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
> > Bruce Momjian escribi�:
> >> What has really hurt the default GUC idea is how to do restores from a
> >> pg_dump.
> 
> > I guess what should happen is that pg_dump should include a
> > SET default_text_search_config = 'foo'
> > just before the CREATE INDEX, like we do for other variables.
> 
> The nasty cases are in data-only dumps, that is, where you're trying to
> load data into a table with pre-existing indexes or triggers.  A SET
> like the above is at least as likely to be wrong as right, if the index
> or trigger depends on it to tell it what to do.

Ouch.  I had not even thought that far.

FYI, yes, the default tsearch GUC controls operations per row _if_ you
have triggers or expression indexes that rely on the default
configuration GUC.  If you have specified the configuration, there is no
problem, and hence my conclusion that the default GUC is too
error-prone.

--  Bruce Momjian  <bruce@momjian.us>          http://momjian.us EnterpriseDB
http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: default_text_search_config and expression indexes

From
"Mike Rylander"
Date:
On 8/14/07, Gregory Stark <stark@enterprisedb.com> wrote:
> "Mike Rylander" <mrylander@gmail.com> writes:
>
> > My application (http://open-ils.org, which run >80% of the public
> > libraries in Georgia, USA, http://gapines.org and
> > http://georgialibraries.org/lib/pines.html) requires that I be able to
> > search a corpus of bibliographic records in a mix of languages, and
> > potentially with mixed stop-word rules, with one query.  I cannot know
> > ahead of time what languages will be used in the corpus and I cannot
> > restrict any one query to one language.  To accomplish this, the
> > record itself will be inspected inside an INSERT/UPDATE trigger to
> > determine the language and type, and use the correct configuration for
> > creating the tsvector.  This will obviously result in a "mixed"
> > tsvector column, but that's exactly what I need.  I can filter on
> > record language if the user happens to specify a query language (and
> > thus configuration), or simply rank the assumed (IP based, perhaps, or
> > browser preference based) preferred language higher, or one of a
> > hundred other things.  But I won't be able to do any of that if
> > tsvectors are required to have one and only one configuration per
> > column.
> >
> > Anyway, I felt I needed to provide some outside perspective to this,
> > as a user, since it seems that the external viewpoint (my particular
> > viewpoint, at least) was missing from the discussion.
>
> This is *extremely* useful. I think it's precisely what we've been missing so
> far. At least, what I've been missing.
>
> So the question is what exactly happens in this case? If I search for "the"
> does that mean it will ignore matches in English where that's a stop-word but
> find me books on tea in French? Is that what I should expect to happen? What
> if I search for "earl and the"? Does that find me French books on Early Grey
> Tea but English books on all earls?

Oh dear ... you went and got me started...

Each field type has a different set of configurations for potenial
use.  Title and subject fields, for instance, do not (generally) use
stop-word lists, so a subject search for "the" will match any record
with the lexem "the" in a subject field.  Title fields are a little
more complicated, because there is information in the bibliographic
record about how and when to skip leading articles, but generally
those are indexed as well for FTS purposes.  English LCSH subjects
generally don't have stop(like) words in them, so you'll probably just
get "French tea" records.  Title results would be a mix of "earls" and
"French tea" records (probably) and the correlation between the user's
preferred language (either chosen query lang or UI lang, whichever is
available) will help adjust the ranking, pushing what are likely to be
the more appropriate records to the top.

Note, however, that much of this multi-tsearch2-configuration setup is
not used in the implementation at http://gapines.org/ because, well,
there's not much need (read: demand from librarians) for that dataset
to support these more complex tricks.  It's basically all en-US and
exclude stop-words.  Other implementations are making more use of what
I describe above, including a (government mandated) French-English
bilingual institution who shall remain nameless for the time being...

>
> What happens if I use the same operator directly on the text column? Or
> perhaps it's not even possible to specify stop-words when operating on a text
> column? Should it be?

You mean with an expression index on a text column?  I haven't
considered using them for FTS.  It just feels easier and more flexible
to me to use an external tsvector column because of the fairly heavy
processing that goes into creating each tsvector value.  I may
re-evaluate that position now that CREATE INDEX CONCURRENTLY exists,
but I'm not developing with 8.2+ only features yet.  Once 8.3 is out
that may change.

Also, unless I misunderstand, you have to wrap the text column in the
function used to build the index.  For my purposes, that makes
building a generic FTS driver for my app (which, admittedly, only has
a Postgresql driver ;) ) more difficult than having a "hidden" extra
column.  Again, that could change if the benefits of CREATE INDEX
CONCURRENTLY end up outweighing simpler FTS driver code.

--miker


Re: default_text_search_config and expression indexes

From
Oleg Bartunov
Date:
On Tue, 14 Aug 2007, Alvaro Herrera wrote:

> Oleg Bartunov wrote:
>> On Tue, 14 Aug 2007, Alvaro Herrera wrote:
>>
>>> Oleg Bartunov wrote:
>>>> On Thu, 9 Aug 2007, tomas@tuxteam.de wrote:
>>>>
>>>>> Maybe I'm missing something, but it seems to me that the configuration
>>>>> is more attached to a column/index thatn to the whole database. If
>>>>> there's a default in an expression, I'd rather expect this default to be
>>>>> drawn from the index involved than from a global value (like a
>>>>> functional
>>>>> index does now).
>>>>
>>>> I'm tired to repeat - index itself doesn't know about configuration !
>>>
>>> Is there a way to change that?  For example store the configuration in a
>>> metapage or something?
>>
>> it's useless, in general, since you could use different configuration to
>> build tsvector.
>
> Hmm, sorry, I think I just understood what this was about: so you mean
> that the configuration is really *per row* and not per index?  So I can

in the very extreme case, yes. Index doesn't care about configuration.
Everything should works without index !

> store rows into an index using more than one configuration, and it will
> work?

why not. For one set of documents you can use one configuration
(parser+mappings), for another - different configuration.

    Regards,        Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83