Thread: tsearch_core patch: permissions and security issues
I've been looking at the tsearch patch a bit, and I think there needs to be more thought given to the permissions required to mess around with tsearch configuration objects. The TSParser objects reference functions declared to take and return INTERNAL arguments. This means that the underlying functions must be coded in C and can only be installed by a superuser, which in turn means that there is no scenario where it is really useful for a non-superuser to execute CREATE PARSER. What's more, allowing a non-superuser to do it creates security holes: if you can find an unrelated function taking the right number of INTERNAL arguments, you can install it as a TSParser support function. That trivially allows crashing the backend, and it could allow worse security holes than that. TSDictionary objects have exactly the same issues since they also depend on functions with INTERNAL arguments. At minimum this means that we should restrict CREATE/DROP/ALTER commands for these objects to superusers. (Which in turn means there's no point in tracking an ownership column for them; every superuser is the same as every other one, permissions-wise.) I'm wondering though whether this doesn't mean that we don't need manipulation commands for them at all. Is it likely that people will be adding parser or dictionary support to an installation on the fly? Maybe we can just create 'em all at initdb time and be done, similar to the way index access methods are treated. This doesn't say that it's not possible to add more; you can add an index access method on the fly too, if you want, by inserting stuff into pg_am by hand. I'm just wondering whether all that SQL-statement support and pg_dump support for custom parsers and dictionaries is really worth the code space and future maintenance effort it'll eat up. You could remove the immediate source of this objection if you could redesign the APIs for the underlying support functions to be more type-safe. I'm not sure how feasible or useful that would be though. The bottom-line question here is whether developing a new parser or dictionary implementation is really something that ordinary users might do. If not, then having all this SQL-level support for setting up catalog entries seems like wasted effort. TSConfiguration objects are a different story, since they have only type-safe dependencies on parsers, locales, and dictionaries. But they still need some more thought about permissions, because AFAICS mucking with a configuration can invalidate some other user's data. Do we want to allow runtime changes in a configuration that existing tsvector columns already depend on? How can we even recognize whether there is stored data that will be affected by a configuration change? (AFAICS the patch doesn't put anything into the pg_depend machinery that could deal with this.) And who gets to decide which configuration is default, anyway? I'm also a bit disturbed that you've made searches for TSConfiguration objects be search-path-sensitive. That is likely to create problems similar to those we've recently recognized for function lookup, eg, an insertion into a full-text-indexed column gets treated differently depending on the caller's search path. It's particularly bad to have the default object be search-path-dependent. We learned the hard way not to do that for default index operator classes; let's not make the same mistake again for tsearch configurations. Next, it took me a while to understand how Mapping objects fit into the scheme at all, and now that (I think) I understand, I'm wondering why treat them as an independent concept. Seems like the mapping from token types to dictionaries is really a property of a configuration, and we ought to be handling it through options of CREATE/ALTER CONFIGURATION commands, not as an apparently independent object type. The way the patch is doing it feels like implementing CREATE ATTRIBUTE as a separate command instead of having ALTER TABLE ADD COLUMN; it's just weird, and it's not obvious that dropping a configuration should make the associated mapping object go away. Lastly, I'm unhappy that the patch still keeps a lot of configuration information, such as stop word lists, in the filesystem rather than the database. It seems to me that the single easiest and most useful part of a configuration to change is the stop word list; but this setup guarantees that no one but a DBA can do that, and what's more that pg_dump won't record your changes. What's the point of having any non-superuser configuration capability at all, if stop words aren't part of what you can change? regards, tom lane
You bring up a very good point. There are fifteen new commands being added for full text indexing: alter-fulltext-config.sgml alter-fulltext-owner.sgmlcreate-fulltext-dict.sgml drop-fulltext-dict.sgmlalter-fulltext-dict.sgml alter-fulltext-parser.sgmlcreate-fulltext-map.sgml drop-fulltext-map.sgmlalter-fulltext-dictset.sgml comment-fulltext.sgmlcreate-fulltext-parser.sgml drop-fulltext-parser.sgmlalter-fulltext-map.sgml create-fulltext-config.sgmldrop-fulltext-config.sgml I think encoding is a good example to follow. We allow users to create new conversions (CREATE CONVERSION), but we don't allow them to create new encodings --- those are hard-coded in the backend. Which of the following full text objects: configdictmapdictsetparser can we hard-code into the backend, and just update for every major release like we do for encodings? --------------------------------------------------------------------------- Tom Lane wrote: > I've been looking at the tsearch patch a bit, and I think there needs to > be more thought given to the permissions required to mess around with > tsearch configuration objects. > > The TSParser objects reference functions declared to take and return > INTERNAL arguments. This means that the underlying functions must be > coded in C and can only be installed by a superuser, which in turn means > that there is no scenario where it is really useful for a non-superuser > to execute CREATE PARSER. What's more, allowing a non-superuser to do > it creates security holes: if you can find an unrelated function taking > the right number of INTERNAL arguments, you can install it as a TSParser > support function. That trivially allows crashing the backend, and it > could allow worse security holes than that. > > TSDictionary objects have exactly the same issues since they also depend > on functions with INTERNAL arguments. > > At minimum this means that we should restrict CREATE/DROP/ALTER commands > for these objects to superusers. (Which in turn means there's no point > in tracking an ownership column for them; every superuser is the same as > every other one, permissions-wise.) I'm wondering though whether this > doesn't mean that we don't need manipulation commands for them at all. > Is it likely that people will be adding parser or dictionary support to > an installation on the fly? Maybe we can just create 'em all at initdb > time and be done, similar to the way index access methods are treated. > This doesn't say that it's not possible to add more; you can add an > index access method on the fly too, if you want, by inserting stuff into > pg_am by hand. I'm just wondering whether all that SQL-statement > support and pg_dump support for custom parsers and dictionaries is > really worth the code space and future maintenance effort it'll eat up. > > You could remove the immediate source of this objection if you could > redesign the APIs for the underlying support functions to be more > type-safe. I'm not sure how feasible or useful that would be though. > The bottom-line question here is whether developing a new parser or > dictionary implementation is really something that ordinary users might > do. If not, then having all this SQL-level support for setting up > catalog entries seems like wasted effort. > > TSConfiguration objects are a different story, since they have only > type-safe dependencies on parsers, locales, and dictionaries. But they > still need some more thought about permissions, because AFAICS mucking > with a configuration can invalidate some other user's data. Do we want > to allow runtime changes in a configuration that existing tsvector > columns already depend on? How can we even recognize whether there is > stored data that will be affected by a configuration change? (AFAICS > the patch doesn't put anything into the pg_depend machinery that could > deal with this.) And who gets to decide which configuration is default, > anyway? > > I'm also a bit disturbed that you've made searches for TSConfiguration > objects be search-path-sensitive. That is likely to create problems > similar to those we've recently recognized for function lookup, eg, > an insertion into a full-text-indexed column gets treated differently > depending on the caller's search path. It's particularly bad to have > the default object be search-path-dependent. We learned the hard way > not to do that for default index operator classes; let's not make the > same mistake again for tsearch configurations. > > Next, it took me a while to understand how Mapping objects fit into > the scheme at all, and now that (I think) I understand, I'm wondering > why treat them as an independent concept. Seems like the mapping from > token types to dictionaries is really a property of a configuration, > and we ought to be handling it through options of CREATE/ALTER > CONFIGURATION commands, not as an apparently independent object type. > The way the patch is doing it feels like implementing CREATE ATTRIBUTE > as a separate command instead of having ALTER TABLE ADD COLUMN; it's > just weird, and it's not obvious that dropping a configuration should > make the associated mapping object go away. > > Lastly, I'm unhappy that the patch still keeps a lot of configuration > information, such as stop word lists, in the filesystem rather than the > database. It seems to me that the single easiest and most useful part > of a configuration to change is the stop word list; but this setup > guarantees that no one but a DBA can do that, and what's more that > pg_dump won't record your changes. What's the point of having any > non-superuser configuration capability at all, if stop words aren't part > of what you can change? > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 2: Don't 'kill -9' the postmaster -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
"Tom Lane" <tgl@sss.pgh.pa.us> writes: > You could remove the immediate source of this objection if you could > redesign the APIs for the underlying support functions to be more > type-safe. I'm not sure how feasible or useful that would be though. > The bottom-line question here is whether developing a new parser or > dictionary implementation is really something that ordinary users might > do. If not, then having all this SQL-level support for setting up > catalog entries seems like wasted effort. Well assuming we have any SQL-level support at all I think we should strive to avoid these functions taking INTERNAL arguments. I feel like having them in the GIST interface has been a major impediment to more people defining GIST indexes for more datatypes. Because you need to write C code dealing with internal data structures to handle page splits the bar to implement GIST index operator classes is too high for most users. So instead of a simple SQL command we end up with contrib modules implementing each type of GIST index. A while back I proposed that we implement the same page-split algorithm that most (or all?) of those contrib modules copy-paste between them as a default implementation. That would allow defining a GIST index in terms of a handful of operators like "distance" which could be defined with a type-safe api. This would be less flexible than the existing generic solution but it would allow defining new GIST indexes without writing C code. > But they still need some more thought about permissions, because AFAICS > mucking with a configuration can invalidate some other user's data. ouch. could mucking with a configuration create a corrupt index? This sounds sort of analogous to the issues collation bring up. > It seems to me that the single easiest and most useful part of a > configuration to change is the stop word list; but this setup guarantees > that no one but a DBA can do that, and what's more that pg_dump won't record > your changes. I would second that, in the past I was expected to provide an administrative web interface to adjust the list of stop words. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Gregory Stark <stark@enterprisedb.com> writes: > Well assuming we have any SQL-level support at all I think we should > strive to avoid these functions taking INTERNAL arguments. I don't think I want to get into redesigning the patch at that level of detail, at least not for 8.3. It seems like something possibly worth thinking about for 8.4 though. The idea that we might want to change the API for parser and dictionary support routines seems like another good argument for not exposing user-level facilities for creating them right now. What I'm realizing as I look at it is that this is an enormous patch, and it's not as close to being ready to apply as I had supposed. If we don't scale it back, then either it doesn't get into 8.3 or 8.3 gets delayed a whole lot longer. So we need to look at what we can trim or postpone for a later release. So all these factors seem to me to point in the same direction: at least for the time being, we should treat TS parsers and dictionaries the way we treat index access methods. There'll be a catalog, which the adventurous can insert new entries into, but no SQL-level support for doing it, hence no pg_dump support. And we reserve the right to whack around the API for the functions referenced by the catalog entries. That still leaves us with the question of SQL-level support for TS configurations, which are built on top of parsers and dictionaries. We definitely need some level of capability for that. For the permissions and dependencies issues, the minimalistic approach is to say "only superusers can create or alter TS configurations, and if you alter one it's your responsibility to fix up any dependent tsvector columns or indexes." We currently handle index operator classes that way, so it's not completely ridiculous. Sure it would be nice to do better, but maybe that's a post-8.3 project. That gets us down to just needing to worry about whether we like the SQL representation of configurations. Which is still a nontrivial issue, but at least it seems manageable on a timescale that's reasonable for 8.3. regards, tom lane
Tom Lane wrote: > Gregory Stark <stark@enterprisedb.com> writes: >> Well assuming we have any SQL-level support at all I think we should >> strive to avoid these functions taking INTERNAL arguments. > That gets us down to just needing to worry about whether we like the > SQL representation of configurations. Which is still a nontrivial > issue, but at least it seems manageable on a timescale that's > reasonable for 8.3. O.k. I am not trying to throw any cold water on this, but with the limitations we are suggesting, does the patch gain us anything over just leaving tsearch in contrib? Sincerely, Joshua D. Drake > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 7: You can help support the PostgreSQL project by donating at > > http://www.postgresql.org/about/donate > -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutions since 1997 http://www.commandprompt.com/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate PostgreSQL Replication: http://www.commandprompt.com/products/
"Joshua D. Drake" <jd@commandprompt.com> writes: > O.k. I am not trying to throw any cold water on this, but with the > limitations we are suggesting, does the patch gain us anything over just > leaving tsearch in contrib? Well, if you want to take a hard-nosed approach, no form of the patch would gain us anything over leaving it in contrib, at least not from a functionality standpoint. The argument in favor has always been about perception, really: if it's a "core" feature not an "add-on", then people will take it more seriously. And there's a rather weak ease-of-use argument that you don't have to install a contrib module. (The idea that it's targeted at people who can't or won't install a contrib module is another reason why I think we can skip user-defined parsers and dictionaries ...) regards, tom lane
Joshua D. Drake wrote: > Tom Lane wrote: > > Gregory Stark <stark@enterprisedb.com> writes: > >> Well assuming we have any SQL-level support at all I think we should > >> strive to avoid these functions taking INTERNAL arguments. > > > That gets us down to just needing to worry about whether we like the > > SQL representation of configurations. Which is still a nontrivial > > issue, but at least it seems manageable on a timescale that's > > reasonable for 8.3. > > O.k. I am not trying to throw any cold water on this, but with the > limitations we are suggesting, does the patch gain us anything over just > leaving tsearch in contrib? The idea is that common operations like searching and mapping dictionaries will be easier to do, but the more complex stuff will require catalog manipulations. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Tom Lane wrote: > "Joshua D. Drake" <jd@commandprompt.com> writes: >> O.k. I am not trying to throw any cold water on this, but with the >> limitations we are suggesting, does the patch gain us anything over just >> leaving tsearch in contrib? > > Well, if you want to take a hard-nosed approach, no form of the patch > would gain us anything over leaving it in contrib, at least not from a > functionality standpoint. The argument in favor has always been about > perception, really: if it's a "core" feature not an "add-on", then > people will take it more seriously. And there's a rather weak > ease-of-use argument that you don't have to install a contrib module. > (The idea that it's targeted at people who can't or won't install a > contrib module is another reason why I think we can skip user-defined > parsers and dictionaries ...) Well my argument has always been the "core" feature argument. Perhaps I am missing some info here, but when I read what you wrote, I read that Tsearch will now be "harder" to work with. Not easier. :( Removal of pg_dump support kind of hurts us, as we already have problems with pg_dump support and tsearch2. Adding work to have to re-assign permissions to vector columns because we make changes... I would grant that having the SQL extensions would certainly be nice. Anyway, I am not trying to stop the progress. I would like to see Tsearch2 in core but I also don't want to add complexity. You did say here: And we reserve the right to whack around the API for the functions referenced by the catalog entries. Which kind of gets us back to upgrade problems doesn't it? Sincerely, Joshua D. Drake > > regards, tom lane > -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutions since 1997 http://www.commandprompt.com/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate PostgreSQL Replication: http://www.commandprompt.com/products/
"Joshua D. Drake" <jd@commandprompt.com> writes: > Well my argument has always been the "core" feature argument. Perhaps I > am missing some info here, but when I read what you wrote, I read that > Tsearch will now be "harder" to work with. Not easier. :( Then you misread it. What I was proposing was essentially that there won't be any need for pg_dump support because everything's built-in (at least as far as parsers/dictionaries go). As for the permissions issues, that's just formalizing something that's true today with the contrib module: if you change a configuration, it's *your* problem whether that invalidates any table entries, the system won't take care of it for you. regards, tom lane
Tom Lane wrote: > "Joshua D. Drake" <jd@commandprompt.com> writes: >> Well my argument has always been the "core" feature argument. Perhaps I >> am missing some info here, but when I read what you wrote, I read that >> Tsearch will now be "harder" to work with. Not easier. :( > > Then you misread it. What I was proposing was essentially that there > won't be any need for pg_dump support because everything's built-in > (at least as far as parsers/dictionaries go). > > As for the permissions issues, that's just formalizing something that's > true today with the contrib module: if you change a configuration, it's > *your* problem whether that invalidates any table entries, the system > won't take care of it for you. O.k. :) Joshua D. Drake > > regards, tom lane > -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutions since 1997 http://www.commandprompt.com/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate PostgreSQL Replication: http://www.commandprompt.com/products/
I an attempt to communicate what full text search does, and what features we are thinking of adding/removing, I have put up the introduction in HTML: http://momjian.us/expire/fulltext/HTML/fulltext-intro.html The links to the other sections don't work yet. --------------------------------------------------------------------------- Tom Lane wrote: > "Joshua D. Drake" <jd@commandprompt.com> writes: > > Well my argument has always been the "core" feature argument. Perhaps I > > am missing some info here, but when I read what you wrote, I read that > > Tsearch will now be "harder" to work with. Not easier. :( > > Then you misread it. What I was proposing was essentially that there > won't be any need for pg_dump support because everything's built-in > (at least as far as parsers/dictionaries go). > > As for the permissions issues, that's just formalizing something that's > true today with the contrib module: if you change a configuration, it's > *your* problem whether that invalidates any table entries, the system > won't take care of it for you. > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 6: explain analyze is your friend -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
>> Well assuming we have any SQL-level support at all I think we should >> strive to avoid these functions taking INTERNAL arguments. > That gets us down to just needing to worry about whether we like the > SQL representation of configurations. Which is still a nontrivial > issue, but at least it seems manageable on a timescale that's > reasonable for 8.3. Possible solution is to split pg_ts_dict (I'll talk about dictionaries, but the same way is possible to parsers, but now it's looked as overdesign) to two table like pg_am and pg_opclass. First table, pg_ts_dict_template (I don't know the exact name yet) which contains columns: oid, template_name, dict_init, dict_lexize and second: pg_ts_dict with colimns: oid, template_oid, owner, schema, dict_initoption. CREATE/ALTER/DROP DICTIONARY affects only second table and access to first one is only select/update/insert/delete similar to pg_am. IMHO, this interface solves problems with security and dumping. The reason to save SQLish interface to dictionaries is a simplicity of configuration. Snowball's stemmers are useful as is, but ispell dictionary requires some configuration action before using. Next, INTERNAL arguments parser's and dictionary's APIs are used because if performance reason. During creation of tsvector from text, there are a lot of calls of parsers and dictionaries. And internal structures of they states may be rather complex and cannot be matched in any pgsql's type, even in flat memory structure. > Next, it took me a while to understand how Mapping objects fit into> the scheme at all, and now that (I think) I understand,I'm wondering> why treat them as an independent concept. ALTER FULLTEXT CONFIGURATION cfgname ADD MAPPING FOR tokentypename[, ...] WITH dictname1[, ...]; ALTER FULLTEXT CONFIGURATION cfgname ALTER MAPPING FOR tokentypename[, ...] WITH dictname1[, ...]; ALTER FULLTEXT CONFIGURATION cfgname ALTER MAPPING [FOR tokentypename[, ...]] REPLACE olddictname TO newdictname; ALTER FULLTEXT CONFIGURATION cfgname DROP MAPPING [IF EXISTS] FOR tokentypename; Is it looking reasonable? > TSConfiguration objects are a different story, since they have only> type-safe dependencies on parsers, locales, and dictionaries. But they> still need some more thought about permissions, because AFAICS mucking> with a configuration caninvalidate some other user's data.Do we want> to allow runtime changes in a configuration that existing tsvector> columnsalready depend on? How can we even recognize whether there is> stored data that will be affected by a configurationchange? (AFAICS Very complex task: experienced users could use several configuration simultaneously. For example: indexing use configuration which doesn't reject stop-words, but for default searching use configuration which rejects stop-words. BTW, the same effects may be produced by dictionary's change. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
> can we hard-code into the backend, and just update for every major > release like we do for encodings? Sorry, no one of them :(. We know projects which introduce new parser, new dictionary. Config and map are changes very often. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
>> But they still need some more thought about permissions, because AFAICS >> mucking with a configuration can invalidate some other user's data. > > ouch. could mucking with a configuration create a corrupt index? Depending on what you mean 'corrupted'. It will not corrupted as non-readable or cause backend crash. But usage of such tsvector column could be limited - not all words will be searchable. > This sounds sort of analogous to the issues collation bring up. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
Teodor Sigaev <teodor@sigaev.ru> writes: > The reason to save SQLish interface to dictionaries is a simplicity of > configuration. Snowball's stemmers are useful as is, but ispell dictionary > requires some configuration action before using. Yeah. I had been wondering about moving the dict_initoption over to the configuration entry --- is that sane at all? It would mean that dict_init functions would have to guard themselves against invalid options, but they probably ought to do that anyway. If we did that, I think we could have a fixed set of dictionaries without too much problem, and focus on just configurations as being user-alterable. >>> Next, it took me a while to understand how Mapping objects fit into >>> the scheme at all, and now that (I think) I understand, I'm wondering >>> why treat them as an independent concept. > ALTER FULLTEXT CONFIGURATION cfgname ADD MAPPING FOR tokentypename[, ...] WITH > dictname1[, ...]; > ALTER FULLTEXT CONFIGURATION cfgname ALTER MAPPING FOR tokentypename[, ...] WITH > dictname1[, ...]; > ALTER FULLTEXT CONFIGURATION cfgname ALTER MAPPING [FOR tokentypename[, ...]] > REPLACE olddictname TO newdictname; > ALTER FULLTEXT CONFIGURATION cfgname DROP MAPPING [IF EXISTS] FOR tokentypename; > Is it looking reasonable? Er ... what's the difference between the second and third forms? regards, tom lane
"Teodor Sigaev" <teodor@sigaev.ru> writes: >>> But they still need some more thought about permissions, because AFAICS >>> mucking with a configuration can invalidate some other user's data. >> >> ouch. could mucking with a configuration create a corrupt index? > > Depending on what you mean 'corrupted'. It will not corrupted as non-readable > or cause backend crash. But usage of such tsvector column could be limited - > not all words will be searchable. Am I correct to think of this like changing collations leaving your btree index "corrupt"? In that case it probably won't cause any backend crash either but you will get incorrect results. For example, returning different results depending on whether the index or a full table scan is used. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Tom Lane wrote: > Teodor Sigaev <teodor@sigaev.ru> writes: >> The reason to save SQLish interface to dictionaries is a simplicity of >> configuration. Snowball's stemmers are useful as is, but ispell dictionary >> requires some configuration action before using. > > Yeah. I had been wondering about moving the dict_initoption over to the > configuration entry --- is that sane at all? It would mean that It should be. Instances of ispell (and synonym, thesaurus) dictionaries are different only in dict_initoption part, so it will be only one entry in pg_ts_dict_template and several ones in pg_ts_dict. >> ALTER FULLTEXT CONFIGURATION cfgname ADD MAPPING FOR tokentypename[, ...] WITH >> dictname1[, ...]; >> ALTER FULLTEXT CONFIGURATION cfgname ALTER MAPPING FOR tokentypename[, ...] WITH >> dictname1[, ...]; sets dictionary's list for token's type(s) >> ALTER FULLTEXT CONFIGURATION cfgname ALTER MAPPING [FOR tokentypename[, ...]] >> REPLACE olddictname TO newdictname; Replace dictionary to another dictionary in dictionary's list for token's type(s). This command is very useful for tweaking configuration and for creating new configuration which differs from already existing one only by pair of dictionary. >> ALTER FULLTEXT CONFIGURATION cfgname DROP MAPPING [IF EXISTS] FOR tokentypename; >> Is it looking reasonable? > > Er ... what's the difference between the second and third forms? That changes are doable for several days. I'd like to make changes together with replacing of FULLTEXT keyword to TEXT SEARCHas you suggested. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
On Thu, 14 Jun 2007, Tom Lane wrote: > Teodor Sigaev <teodor@sigaev.ru> writes: >> The reason to save SQLish interface to dictionaries is a simplicity of >> configuration. Snowball's stemmers are useful as is, but ispell dictionary >> requires some configuration action before using. > > Yeah. I had been wondering about moving the dict_initoption over to the > configuration entry --- is that sane at all? It would mean that > dict_init functions would have to guard themselves against invalid > options, but they probably ought to do that anyway. If we did that, > I think we could have a fixed set of dictionaries without too much > problem, and focus on just configurations as being user-alterable. currently, all dictionaries we provide are all template dictionaries, so users could change only parameters. But, there are reasons to allow users register new templates and in fact we know people/projects with application-dependent dictionaries. How they could dump/reload their dictionaries ? Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
> But, there are reasons to allow users register new templates and in fact > we know people/projects with application-dependent dictionaries. How > they could dump/reload their dictionaries ? The same way as pg_am does. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
On Thu, 14 Jun 2007, Gregory Stark wrote: > "Teodor Sigaev" <teodor@sigaev.ru> writes: > >>>> But they still need some more thought about permissions, because AFAICS >>>> mucking with a configuration can invalidate some other user's data. >>> >>> ouch. could mucking with a configuration create a corrupt index? >> >> Depending on what you mean 'corrupted'. It will not corrupted as non-readable >> or cause backend crash. But usage of such tsvector column could be limited - >> not all words will be searchable. > > Am I correct to think of this like changing collations leaving your btree > index "corrupt"? In that case it probably won't cause any backend crash either > but you will get incorrect results. For example, returning different results > depending on whether the index or a full table scan is used. You're correct. But we can't defend users from all possible errors. Other side, that we need somehow to help user to identify what fts configuration was used to produce tsvector. For example, comment on tsvector column would be useful, but we don't know how to do this automatically. Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
> Am I correct to think of this like changing collations leaving your btree > index "corrupt"? In that case it probably won't cause any backend crash either > but you will get incorrect results. For example, returning different results > depending on whether the index or a full table scan is used. Without exotic cases, maximum disaster may be that queries with some words will return more or less results than should be. Because of wrong stemming or wrong match of stop-word or wrong mapping. By default, configuration is useful for most users and works for danish, dutch, finnish, french, german, hungarian, italian, norwegian, portuguese, spanish, swedish, russin and english languages. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
Oleg Bartunov <oleg@sai.msu.su> writes: > You're correct. But we can't defend users from all possible errors. > Other side, that we need somehow to help user to identify what fts > configuration was used to produce tsvector. For example, comment on > tsvector column would be useful, but we don't know how to do this > automatically. Yeah, I was wondering about that too. The only way we could relax the superuser, you-better-know-what-you're-doing restriction on changing configurations would be if we had a way to identify which tsvector columns needed to be updated. Right now that's pretty hard to find out because the references to configurations are buried in the bodies of trigger functions. That whole trigger-function business is not the nicest part of tsearch2, either ... it'd be better if we could automate tsvector maintenance more. One thing I was thinking about is that rather than storing a physical tsvector column, people might index a "virtual" column using functional indexes: create index ... (to_tsvector('english', big_text_col)) which could be queried select ... where to_tsvector('english', big_text_col) @@ tsquery Assuming that the index is lossy, the index condition would have to be rechecked, so to_tsvector() would have to be recomputed, but only at the rows identified as candidate matches by the index. The I/O savings from eliminating the heap's tsvector column might counterbalance the extra CPU for recomputing tsvectors. Or not, but in any case this is attractive because it doesn't need any handmade maintenance support like a trigger --- the regular index maintenance code does it all. It strikes me that we could play the same kind of game we played to make nextval() references to sequences be recognized as dependencies on sequences. Invent a "regconfig" OID type that's just like regclass except it handles OIDs of ts_config entries instead of pg_class entries, and make the first argument of to_tsvector be one of those: create index ... (to_tsvector('english'::regconfig, big_text_col)) Now dependency.c can be taught to recognize the regconfig Const as depending on the referenced ts_config entry, and voila we have a pg_depend entry showing that the index depends on the configuration. What we actually do about it is another question, but this at least gets the knowledge into the system. [ thinks some more... ] If we revived the GENERATED AS patch, you could imagine computing tsvector columns via "GENERATED AS to_tsvector('english'::regconfig, big_text_col)" instead of a trigger, and then again you've got the dependency exposed where the system can see it. I don't wanna try to do that for 8.3, but it might be a good path to pursue in future, instead of assuming that triggers will be the way forevermore. Thoughts? regards, tom lane
Bruce Momjian wrote: > I an attempt to communicate what full text search does, and what > features we are thinking of adding/removing, I have put up the > introduction in HTML: > > http://momjian.us/expire/fulltext/HTML/fulltext-intro.html > Very good idea, Bruce! After reading the discussion and the introduction, here is what I think tsearch in core should at least accomplish in 8.3. Please bear in mind, that (a) I am talking from a user perspective (there might be technical arguments against my thoughts) and (b) I have no hands-on experience with Tsearch2 yet, so more experienced users might have different needs. - Basic full text search usable for non-superusers - Out-of-the-box working configuration for as many languages as reasonable(Teodor named quite a number of languages working as-is, so this is really an improvement over contrib, great!)- No foot-guns accessible to non-superuser - Agreement on function names, perhaps some should be changed. For instanceto_tsquery() and plainto_tsquery() seem rather unintuitive because they don't have a common prefix, and they arenot consistent about using underscores. Perhaps to_tsquery() and to_tsquery_plain()? - Future compatibility for allfeatures available to non-superusers - Stop words in tables, not in external files. - At least for superusers, all featuresavailable in contrib now, should be available, too (don't know about pg_dump). What I don't really like is the number of commands introduced without any strong reference to full text search. E.g. CREATE CONFIGURATION gives no hint at all that this is about full text search. IMHO there are more configurations than just full text ones. :-) So perhaps better spell this CREATE FULLTEXT CONFIGURATION etc.? (Think about tab completion in psql, for instance.) I guess this is in line with what Tom said about mapping objects and CREATE ATTRIBUTE vs. CREATE/ALTER CONFIGURATION. (http://archives.postgresql.org/pgsql-hackers/2007-06/msg00522.php) After all, I would really welcome having full text search capabilities in core. Best Regards Michael Paesold
Teodor Sigaev <teodor@sigaev.ru> writes: > Tom Lane wrote: >> Teodor Sigaev <teodor@sigaev.ru> writes: >>> The reason to save SQLish interface to dictionaries is a simplicity of >>> configuration. Snowball's stemmers are useful as is, but ispell dictionary >>> requires some configuration action before using. >> >> Yeah. I had been wondering about moving the dict_initoption over to the >> configuration entry --- is that sane at all? It would mean that > It should be. Instances of ispell (and synonym, thesaurus) dictionaries are > different only in dict_initoption part, so it will be only one entry in > pg_ts_dict_template and several ones in pg_ts_dict. No, I was thinking of still having just one pg_ts_dict catalog (no template) but removing its dictinit field. Instead, the init strings would be stored with configuration mapping entries. This would mean having to remember to provide the right option along with the dictionary name when doing ALTER CONFIGURATION ADD MAPPING. Not sure if that would be harder or easier to use than what you're thinking of. regards, tom lane
Michael Paesold <mpaesold@gmx.at> writes: > After reading the discussion and the introduction, here is what I think > tsearch in core should at least accomplish in 8.3. > ... > - Stop words in tables, not in external files. I realized that there's a pretty serious problem with doing that, which is encoding. We don't have any way to deal with preloaded catalog data that exceeds 7-bit-ASCII, because when you do CREATE DATABASE ... ENCODING it's going to be copied over exactly as-is. And there's plenty of not-ASCII stuff in the non-English stopword files. This is something we need to solve eventually, but I think it ties into the whole multiple locale can-of-worms; there's no way we're getting it done for 8.3. So I'm afraid we have to settle for stop words in external files for the moment. I do have two suggestions though: * Let's have just one stopword file for each language, with the convention that the file is stored in UTF8 no matter what language you're talking about. We can have the stopword reading code convert to the database encoding on-the-fly when it reads the file. Without this there's just a whole bunch of foot-guns there. We'd at least need to have encoding verification checks when reading the files, which seems hardly cheaper than just translating the data. * Let's fix it so the reference to the stoplist in the user-visible options is just a name, with no path or anything like that. (Similar to the handling of timezone_abbreviations.) Then it will be feasible to re-interpret the option as a reference to a named list in a catalog someday, when we solve the encoding problem. Right now the patch has things like + DATA(insert OID = 5140 ( "ru_stem_koi8" PGNSP PGUID 5135 5137 "dicts_data/russian.stop.koi8")); which is really binding the option pretty tightly to being a filename; not to mention the large security risks involved in letting anyone but a superuser have control of such an option. > What I don't really like is the number of commands introduced without > any strong reference to full text search. E.g. CREATE CONFIGURATION > gives no hint at all that this is about full text search. Yeah. We had some off-list discussion about this and concluded that TEXT SEARCH seemed to be the right phrase to use in the command names. That hasn't gotten reflected into the patch yet. regards, tom lane
"Oleg Bartunov" <oleg@sai.msu.su> writes: > On Thu, 14 Jun 2007, Gregory Stark wrote: > >> Am I correct to think of this like changing collations leaving your btree >> index "corrupt"? In that case it probably won't cause any backend crash either >> but you will get incorrect results. For example, returning different results >> depending on whether the index or a full table scan is used. > > You're correct. But we can't defend users from all possible errors. Sure, but it seems like a the line, at least in existing cases, is that if you fiddle with catalogs directly then you should know what consequences you need to be careful of. But when if you make changes through a supported, documented interface then the system will protect you from breaking things. Hm, I went to construct an example and accidentally found a precedent for not necessarily protecting users from themselves in every case: postgres=# create table x (i integer); CREATE TABLE postgres=# create function f(integer) returns integer as 'select $1' immutable strict language sql; CREATE FUNCTION postgres=# select f(1);f ---1 (1 row) postgres=# create index xi on x (f(i)); CREATE INDEX postgres=# insert into x values (1); INSERT 0 1 postgres=# insert into x values (2); INSERT 0 1 postgres=# create or replace function f(integer) returns integer as 'select -$1' immutable strict language sql; CREATE FUNCTION Uhm. Oops! And yes, the resulting index is, of course, corrupt: postgres=# insert into x (select random() from generate_series(1,2000)); INSERT 0 2000 postgres=# select count(*) from x where f(i) = -1;count ------- 0 (1 row) postgres=# set enable_bitmapscan = off; SET postgres=# set enable_indexscan = off; SET postgres=# select count(*) from x where f(i) = -1;count ------- 1003 (1 row) -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
>> It should be. Instances of ispell (and synonym, thesaurus) dictionaries are >> different only in dict_initoption part, so it will be only one entry in >> pg_ts_dict_template and several ones in pg_ts_dict. > > No, I was thinking of still having just one pg_ts_dict catalog (no template) > but removing its dictinit field. Instead, the init strings would be > stored with configuration mapping entries. > > This would mean having to remember to provide the right option along > with the dictionary name when doing ALTER CONFIGURATION ADD MAPPING. > Not sure if that would be harder or easier to use than what you're > thinking of. Hmm. Dictionary may present in several lists of dictionaries in one configuration. Suppose, it isn't practical to store dictinitoption several times. In other hand, the same dictionary (template) with different init option may present on configuration too. Typical example is configuration for russian language: lword, lpword tokens have dictionary's list {ispell_en, stem_en} nlword, nlpword tokens have dictionary's list {ispell_ru, stem_ru} stem_(ru|en) is a Snowball's stemmer, but ispell_(ru|en) is a ispell dictionary (template) with different dictinitoption. Next, configurations may share dictionaries. And, init option may be rather big.
I am confused by the CREATE FULLTEXT CONFIGURATION command: http://momjian.us/expire/fulltext/SGML/ref/create-fulltext-config.sgml First, why are we specifying the server locale here since it never changes: <varlistentry> <term><literal>LOCALE</literal></term> <listitem> <para> <replaceable class="PARAMETER">localename</replaceable> is the name of the locale. It should match server's locale (<varname>lc_ctype</varname>) to identify full-text configuration used by default. </para> </listitem> </varlistentry> Second, I can't figure out how to reference a non-default configuration. The description says: <varlistentry> <term><LITERAL>AS DEFAULT</LITERAL></term> <listitem> <para> Set <literal>default</literal> flagfor the configuration, which used to identify if this configuration is selectable on default (see <LITERAL>LOCALE</LITERAL>description above). It is possible to have <emphasis>maximum one</emphasis> configuration with the same locale and in the same schema with this flag enabled. </para> </listitem> </varlistentry> The documentation says that the first fulltext configuration found in the search patch is the one used, so how does a secondary configuration in the same schema actually get accessed by @@ or ::tsquery? Do you have to use to_tsquery() with the optional configuration name? Is this really the direction we want to go, having a default that gets picked up from the search_path, perhaps based on some encoding/locale match I can't figure out, or do we want to require the configuration to be specified always, and if we do that, how do we handle the @@ operator? I am thinking we should just have use the first fulltext configuration from the first schema in the search path and eliminate naming the configurations (same as schema name?). Allowing configuration names to be specified only sometimes is confusing. Or we can use a GUC to name the configuration we want to use specifically, rather than have a read-only tsearch_conf_name (is it read-only?) that is controlled by the search_path. And why are we talking locale here instead of encoding? And since we only have one encoding per database, how can there be more than one? Is this _client_ encoding? FYI, while the configuration selection needs work, the rest of the areas seem logical. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Bruce Momjian <bruce@momjian.us> writes: > First, why are we specifying the server locale here since it never > changes: It's poorly described. What it should really say is the language that the text-to-be-searched is in. We can actually support multiple languages here today, the restriction being that there have to be stemmer instances for the languages with the database encoding you're using. With UTF8 encoding this isn't much of a restriction. We do need to put code into the dictionary stuff to enforce that you can't use a stemmer when the database encoding isn't compatible with it. I would prefer that we not drive any of this stuff off the server's LC_xxx settings, since as you say that restricts things to just one locale. > Second, I can't figure out how to reference a non-default > configuration. See the multi-argument versions of to_tsvector etc. I do see a problem with having to_tsvector(config, text) plus to_tsvector(text) where the latter implicitly references a config selected by a GUC variable: how can you tell whether a query using the latter matches a particular index using the former? There isn't anything in the current planner mechanisms that would make that work. regards, tom lane
On Thu, 14 Jun 2007, Tom Lane wrote: > Bruce Momjian <bruce@momjian.us> writes: >> First, why are we specifying the server locale here since it never >> changes: server's locale is used just for one purpose - to select what text search configuration to use by default. Any text search functions can accept text search configuration as an optional parameter. > > It's poorly described. What it should really say is the language > that the text-to-be-searched is in. We can actually support multiple > languages here today, the restriction being that there have to be > stemmer instances for the languages with the database encoding you're > using. With UTF8 encoding this isn't much of a restriction. We do need > to put code into the dictionary stuff to enforce that you can't use a > stemmer when the database encoding isn't compatible with it. > > I would prefer that we not drive any of this stuff off the server's > LC_xxx settings, since as you say that restricts things to just one > locale. something like CREATE TEXT SEARCH DICTIONARY dictname [LOCALE=ru_RU.UTF-8] and raise warning/error if database encoding doesn't match dictionary encoding if specified (not all dictionaries depend on encoding, so it should be an optional parameter). > >> Second, I can't figure out how to reference a non-default >> configuration. > > See the multi-argument versions of to_tsvector etc. > > I do see a problem with having to_tsvector(config, text) plus > to_tsvector(text) where the latter implicitly references a config > selected by a GUC variable: how can you tell whether a query using the > latter matches a particular index using the former? There isn't > anything in the current planner mechanisms that would make that work. Probably, having default text search configuration is not a good idea and we could just require it as a mandatory parameter, which could eliminate many confusion with selecting text search configuration. Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
On Thu, 14 Jun 2007, Tom Lane wrote: > Oleg Bartunov <oleg@sai.msu.su> writes: >> You're correct. But we can't defend users from all possible errors. >> Other side, that we need somehow to help user to identify what fts >> configuration was used to produce tsvector. For example, comment on >> tsvector column would be useful, but we don't know how to do this >> automatically. > > Yeah, I was wondering about that too. The only way we could relax the > superuser, you-better-know-what-you're-doing restriction on changing > configurations would be if we had a way to identify which tsvector > columns needed to be updated. Right now that's pretty hard to find out > because the references to configurations are buried in the bodies of > trigger functions. That whole trigger-function business is not the > nicest part of tsearch2, either ... it'd be better if we could automate > tsvector maintenance more. yes, trigger function is a complex stuff, our tsearch() trigger is an example of automated stuff. It could be written very easy on plpgsql, for example. =# create function my_update() returns trigger as $$ BEGIN NEW.fts= setweight( to_tsvector('english',NEW.t1),'A') || ' ' || setweight( to_tsvector('english',NEW.t2),'B');RETURN NEW; END; $$ language plpgsql; > > One thing I was thinking about is that rather than storing a physical > tsvector column, people might index a "virtual" column using functional > indexes: > > create index ... (to_tsvector('english', big_text_col)) > > which could be queried > > select ... where to_tsvector('english', big_text_col) @@ tsquery this is already possible for gin index create index gin_text_idx on test using gin ( ( coalesce(to_tsvector(title),'') || coalesce(to_tsvector(body),'') ) ); apod=# select title from test where (coalesce(to_tsvector(title),'') || coalesce(to_tsvector(body),'') ) @@ to_tsquery('supernovae') order by sdate desc limit 10; > > Assuming that the index is lossy, the index condition would have to be > rechecked, so to_tsvector() would have to be recomputed, but only at the > rows identified as candidate matches by the index. The I/O savings from > eliminating the heap's tsvector column might counterbalance the extra > CPU for recomputing tsvectors. Or not, but in any case this is > attractive because it doesn't need any handmade maintenance support like > a trigger --- the regular index maintenance code does it all. I'm afraid it wouldn't work for all cases. We already have headline() function which had to reparse document to produce text snippet and it's very slow and eats most select time. ALso, trigger stuff is a normal machinery for databases. > > It strikes me that we could play the same kind of game we played to make > nextval() references to sequences be recognized as dependencies on > sequences. Invent a "regconfig" OID type that's just like regclass > except it handles OIDs of ts_config entries instead of pg_class entries, > and make the first argument of to_tsvector be one of those: > > create index ... (to_tsvector('english'::regconfig, big_text_col)) > > Now dependency.c can be taught to recognize the regconfig Const as > depending on the referenced ts_config entry, and voila we have a > pg_depend entry showing that the index depends on the configuration. > What we actually do about it is another question, but this at least > gets the knowledge into the system. > interesting. And \di could display all configuration stuff for text search indexes ? > [ thinks some more... ] If we revived the GENERATED AS patch, > you could imagine computing tsvector columns via "GENERATED AS > to_tsvector('english'::regconfig, big_text_col)" instead of a > trigger, and then again you've got the dependency exposed where > the system can see it. I don't wanna try to do that for 8.3, > but it might be a good path to pursue in future, instead of assuming > that triggers will be the way forevermore. > > Thoughts? No way with standard. GENERATED AS says that "all columns references in an expression associated with a generated column must be to columns of the base table containing that generated column." tsvector could be result of rather complex select involving several tables. Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
On Wed, 2007-06-13 at 18:06 -0400, Bruce Momjian wrote: > You bring up a very good point. There are fifteen new commands being > added for full text indexing: > > alter-fulltext-config.sgml alter-fulltext-owner.sgml > create-fulltext-dict.sgml drop-fulltext-dict.sgml > alter-fulltext-dict.sgml alter-fulltext-parser.sgml > create-fulltext-map.sgml drop-fulltext-map.sgml > alter-fulltext-dictset.sgml comment-fulltext.sgml > create-fulltext-parser.sgml drop-fulltext-parser.sgml > alter-fulltext-map.sgml create-fulltext-config.sgml > drop-fulltext-config.sgml Although I'm happy to see tsearch finally hit the big time, I'm a bit disappointed to see so many new datatype-specific SQL commands created. That sets a bad precedent for other datatype authors, as well as all those people that want to invent new things like SKYLINE etc.. Whatever the reasons for the new commands, those reasons must also be potentially shared by authors of other datatypes too. Is there a way to genericise these commands so that we can offer those same benefits to all datatypes, rather than setting a double standard? Or do we think that when a Geodetic datatype tries to come into core we would have commands like CREATE COORDINATE TRANSFORM? Can we consider CREATE TYPE CONFIGURATION with subsets such as... CREATE TYPE CONFIGURATION name USING FULLTEXT (map) CREATE TYPE CONFIGURATION name USING FULLTEXT (dictionary) CREATE TYPE CONFIGURATION name USING FULLTEXT (parser) Your choice of syntax may vary, but my point is about creating a mechanism by which any datatype author can reference complex configuration details. We managed to do this for INDEXes and OPERATORs, so it seems a shame to go for the full 15 new commands when we could do the same thing with much fewer commands and ones that could then be utilised by others, e.g. PostGIS. Last minute change true, but only parser+docs changes suggested. I want the full tsearch2 functionality as much as anyone. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
> Probably, having default text search configuration is not a good idea > and we could just require it as a mandatory parameter, which could > eliminate many confusion with selecting text search configuration. Ugh. Having default configuration (by locale or by postgresql.conf or some other way) simplifies life a lot in most cases. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
Tom Lane wrote: > Bruce Momjian <bruce@momjian.us> writes: > > First, why are we specifying the server locale here since it never > > changes: > > It's poorly described. What it should really say is the language > that the text-to-be-searched is in. We can actually support multiple > languages here today, the restriction being that there have to be > stemmer instances for the languages with the database encoding you're > using. With UTF8 encoding this isn't much of a restriction. We do need > to put code into the dictionary stuff to enforce that you can't use a > stemmer when the database encoding isn't compatible with it. > > I would prefer that we not drive any of this stuff off the server's > LC_xxx settings, since as you say that restricts things to just one > locale. The idea they had was to set the _default_ full text configuration to match the locale, e.g.UTF8.en_US. This works well for cases where we ship a number of pre-installed full text configurations in pg_catalog. But of course you can support multiple languages with that encoding/locale, so you have to have the ability to do other languages, but not necessarily by default. > > Second, I can't figure out how to reference a non-default > > configuration. > > See the multi-argument versions of to_tsvector etc. > > I do see a problem with having to_tsvector(config, text) plus > to_tsvector(text) where the latter implicitly references a config > selected by a GUC variable: how can you tell whether a query using the > latter matches a particular index using the former? There isn't > anything in the current planner mechanisms that would make that work. Well, now that I have gotten feedback, we have a few options: 1) Require the configuration to be always specified. The problem with this is that casting (::tsquery) and operators (@@) have no way to specify a configuration. 2) Use a GUC that you can set for the configuration, and perhaps default it if possible to match the locale. Is the default affected by search_path (ouch)? How do we make sure that any index that is accessed is using the same configuration that is being used by the query, e.g. ::tsquery? Do we have to store the configuration name in the index and somehow throw an error if it doesn't match? What about changes to the configuration after the index has been created, e.g. new stop words or dictionaries? The two big open issues are whether we allow a default configuration, and whether we require the configuration name to be always specified. My guess right now is that we use a GUC that will default if a pg_catalog configuration name matches the lc_ctype locale name, and we have to throw an error if an accessed index creation GUC doesn't match the current GUC. So we create a pg_catalog full text configuration named UTF8.en-US, and some others like ru_RU.UTF-8. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Bruce Momjian wrote: > My guess right now is that we use a GUC that will default if a > pg_catalog configuration name matches the lc_ctype locale name, and we > have to throw an error if an accessed index creation GUC doesn't match > the current GUC. > > So we create a pg_catalog full text configuration named UTF8.en-US, and > some others like ru_RU.UTF-8. Do locale names vary across operating systems? If so, we might as well skip trying to find a default. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
"Simon Riggs" <simon@2ndquadrant.com> writes: > Although I'm happy to see tsearch finally hit the big time, I'm a bit > disappointed to see so many new datatype-specific SQL commands created. Per subsequent discussion we are down to just one new set of commands, CREATE/ALTER/DROP TEXT SEARCH CONFIGURATION, so it's not as big a footprint as it was to start with. > Can we consider CREATE TYPE CONFIGURATION with subsets such as... > CREATE TYPE CONFIGURATION name USING FULLTEXT (map) > CREATE TYPE CONFIGURATION name USING FULLTEXT (dictionary) > CREATE TYPE CONFIGURATION name USING FULLTEXT (parser) This seems entirely cosmetic, unless you have some proposal for allowing a uniform underlying implementation not only syntax. In the absence of some concrete cases to consider, I don't see how we could imagine that we know how to implement a useful generic approach. I have been thinking that it would be smart to try to use the generic "definition list" syntax, like CREATE OPERATOR and CREATE AGGREGATE. But the motivation for that is just to avoid defining more keywords (which has an overall impact on parser size and performance). It's not really going to do anything for us in terms of having an implementation that can be shared with anything else. regards, tom lane
Bruce Momjian <bruce@momjian.us> writes: > Do locale names vary across operating systems? Yes, which is the fatal flaw in the whole thing. The ru_RU part is reasonably well standardized, but the encoding part is not. Considering that encoding is exactly the part of it we don't care about for this purpose (because we should look to the database encoding instead), I think it's just going to make life harder not easier to model search language names on locales. I'd suggest allowing either full names ("swedish") or the standard two-letter abbreviations ("sv"). But let's stay away from locale names. regards, tom lane
> 1) Require the configuration to be always specified. The problem with > this is that casting (::tsquery) and operators (@@) have no way to > specify a configuration. it's not comfortable for most often cases > > 2) Use a GUC that you can set for the configuration, and perhaps > default it if possible to match the locale. Is the default affected by > search_path (ouch)? Right now it works so > > How do we make sure that any index that is accessed is using the same > configuration that is being used by the query, e.g. ::tsquery? Do we > have to store the configuration name in the index and somehow throw an > error if it doesn't match? What about changes to the configuration > after the index has been created, e.g. new stop words or dictionaries? That's possible intentional case, so we should not throw ERROR! > > The two big open issues are whether we allow a default configuration, > and whether we require the configuration name to be always specified. > > My guess right now is that we use a GUC that will default if a > pg_catalog configuration name matches the lc_ctype locale name, and we > have to throw an error if an accessed index creation GUC doesn't match > the current GUC. Where will index store index creation GUC? > > So we create a pg_catalog full text configuration named UTF8.en-US, and > some others like ru_RU.UTF-8. > -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
> I'd suggest allowing either full names ("swedish") or the standard > two-letter abbreviations ("sv"). But let's stay away from locale names. We can use database's encoding name (the same names used in initdb -E) -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
Teodor Sigaev <teodor@sigaev.ru> writes: >> My guess right now is that we use a GUC that will default if a >> pg_catalog configuration name matches the lc_ctype locale name, and we >> have to throw an error if an accessed index creation GUC doesn't match >> the current GUC. > Where will index store index creation GUC? It's not really the index's problem; IIUC the behavior of the gist and gin index opclasses is not locale-specific. It's the to_tsvector calls that built the tsvector heap column that have a locale specified or implicit. We need some way of annotating the heap column about this. In the case of a functional index you can expose the locale: create index ... (to_tsvector('english'::regconfig, mytextcol)) but there's still the problem that the planner cannot match that to a query specified as just WHERE to_tsvector(mytextcol) @@ query. regards, tom lane
Teodor Sigaev <teodor@sigaev.ru> writes: >> I'd suggest allowing either full names ("swedish") or the standard >> two-letter abbreviations ("sv"). But let's stay away from locale names. > We can use database's encoding name (the same names used in initdb -E) AFAICS the encoding name shouldn't be anywhere near this. The only reason the TS stuff needs an encoding spec is to figure out how to read an external stop word file. I think my suggestion upthread is a lot better: have just one stop word file per language, store them all in UTF8, and convert to database encoding when loading them. The database encoding is implicit and doesn't need to be mentioned anywhere in the TS configuration. regards, tom lane
"Tom Lane" <tgl@sss.pgh.pa.us> writes: > It's not really the index's problem; IIUC the behavior of the gist and > gin index opclasses is not locale-specific. It's the to_tsvector calls > that built the tsvector heap column that have a locale specified or > implicit. We need some way of annotating the heap column about this. > > In the case of a functional index you can expose the locale: > > create index ... (to_tsvector('english'::regconfig, mytextcol)) Maybe there should be a different type for each locale. I'm not exactly following this thread so I'm not entirely sure whether that would actually fit well but it's just a thought I had. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Gregory Stark <stark@enterprisedb.com> writes: > "Tom Lane" <tgl@sss.pgh.pa.us> writes: >> It's not really the index's problem; IIUC the behavior of the gist and >> gin index opclasses is not locale-specific. It's the to_tsvector calls >> that built the tsvector heap column that have a locale specified or >> implicit. We need some way of annotating the heap column about this. > Maybe there should be a different type for each locale. I had been idly wondering if we could do anything with using tsvector's typmod for the purpose ... regards, tom lane
On Friday 15 June 2007 00:46, Oleg Bartunov wrote: > On Thu, 14 Jun 2007, Tom Lane wrote: > > [ thinks some more... ] If we revived the GENERATED AS patch, > > you could imagine computing tsvector columns via "GENERATED AS > > to_tsvector('english'::regconfig, big_text_col)" instead of a > > trigger, and then again you've got the dependency exposed where > > the system can see it. I don't wanna try to do that for 8.3, > > but it might be a good path to pursue in future, instead of assuming > > that triggers will be the way forevermore. > > > > Thoughts? > > No way with standard. GENERATED AS says that "all columns references in an > expression associated with a generated column must be to columns of the > base table containing that generated column." > > tsvector could be result of rather complex select involving several tables. > Is there some reason for this restriction in the standard? I might be in favor of "extending" the standard to allow this case if not. -- Robert Treat Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL
On Thursday 14 June 2007 15:10, Teodor Sigaev wrote: > That changes are doable for several days. I'd like to make changes together > with replacing of FULLTEXT keyword to TEXT SEARCH as you suggested. AIUI the discussion on this change took place off list? Can we get a preview of what the commands will look like and maybe a summary of the discussion? It may sound a bit pedantic, but the concept/wording of "full text searching" is pretty well understood by the database community, so switching to just TEXT SEARCH sounds like we're adding ambiguity for reasons that escape me.... -- Robert Treat Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL
> The only reason the TS stuff needs an encoding spec is to figure out how > to read an external stop word file. I think my suggestion upthread is a > lot better: have just one stop word file per language, store them all in > UTF8, and convert to database encoding when loading them. The database Hmm. You mean to use language name in configuration, use current encoding to define which dictionary should be used (stemmers for the same language are different for different encoding) and recode dictionaries file from UTF8 to current locale. Did I understand you right? That's possible to do. But it's incompatible changes and cause some difficulties for DBA. If server locale is ISO (or KOI8 or any other) and file is in UTF8 then text editor/tools might be confused. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
> It's not really the index's problem; IIUC the behavior of the gist and > gin index opclasses is not locale-specific. Right > It's the to_tsvector calls > that built the tsvector heap column that have a locale specified or > implicit. We need some way of annotating the heap column about this. It seems too restrictive to advanced users. > > In the case of a functional index you can expose the locale: > > create index ... (to_tsvector('english'::regconfig, mytextcol)) > > but there's still the problem that the planner cannot match that to > a query specified as just WHERE to_tsvector(mytextcol) @@ query. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
Teodor Sigaev <teodor@sigaev.ru> writes: >> It's the to_tsvector calls >> that built the tsvector heap column that have a locale specified or >> implicit. We need some way of annotating the heap column about this. > It seems too restrictive to advanced users. Hm, are you trying to say that it's sane to have different tsvectors in a column computed under different language settings? Maybe we're all overthinking the problem. If the tsvector representation is presumed language-independent then I could see this being a workable approach. regards, tom lane
Teodor Sigaev <teodor@sigaev.ru> writes: > Hmm. You mean to use language name in configuration, use current encoding to > define which dictionary should be used (stemmers for the same language are > different for different encoding) and recode dictionaries file from UTF8 to > current locale. Did I understand you right? Right. > That's possible to do. But it's incompatible changes and cause some > difficulties for DBA. If server locale is ISO (or KOI8 or any other) > and file is in UTF8 then text editor/tools might be confused. Well, I'm not as worried about that as I am about the database being confused ;-). We need some way to deal with stopword files that are in a different encoding than the database encoding, and this has to be proof against accidental or malicious mistakes by the non-superuser users who are going to be able to specify which stopword file to use. So I don't want the specification that goes into the CREATE DICTIONARY command to involve an encoding. One possibility is that the user-visible specification is just a name (eg, "english"), but the actual filename out on the filesystem is, say, name.encoding.stop (eg, "english.utf8.stop") where we use PG's names for the encodings. We could just fail if there's not a file matching the database encoding, or we could try that and then try utf8, or some other rule. In any case I'd want it to verify and convert encoding as necessary while reading. regards, tom lane
On Fri, 2007-06-15 at 10:36 -0400, Tom Lane wrote: > "Simon Riggs" <simon@2ndquadrant.com> writes: > > Although I'm happy to see tsearch finally hit the big time, I'm a bit > > disappointed to see so many new datatype-specific SQL commands created. > > Per subsequent discussion we are down to just one new set of commands, > CREATE/ALTER/DROP TEXT SEARCH CONFIGURATION, so it's not as big a > footprint as it was to start with. Thats a lot better, thanks. I'm sure that will work better in PgAdmin and many other places too. > I have been thinking that it would be smart to try to use the generic > "definition list" syntax, like CREATE OPERATOR and CREATE AGGREGATE. > But the motivation for that is just to avoid defining more keywords > (which has an overall impact on parser size and performance). It's > not really going to do anything for us in terms of having an > implementation that can be shared with anything else. It's OK; ALTER RFID TAG NOMENCLATURE has a nice ring to it. :-) -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
> Hm, are you trying to say that it's sane to have different tsvectors in > a column computed under different language settings? Maybe we're all Yes, I think so. That might have sense for close languages. Norwegian languages has two dialects and one of them has advanced rules for compound words, russian and ukranian has similar rules etc. Operation @@ is language (and encoding) independent, it use just strcmp call. Most often usecase for mixing configuration is somewhere described by me in thread using two different configuration for indexing (tsvector creation) and search (tsquery creation). BTW, thesaurus dictionary could be used for similar reasons in search only configuration. OpenFTS doesn't use tsearch2 configuration at all, it has such infrastructure itself - so, tsvector shouldn't have any information about configuration. Most often change of configuration is a adding new stop words, which doesn't affect correctness of search. Removing stop words cause impossibility to find already indexed documents with query contains only removed stop-words. > overthinking the problem. If the tsvector representation is presumed > language-independent then I could see this being a workable approach. Actually, we should allow to only 'compatible' changes of configuration but it very hard (or even impossible) to formulate rules about that. Any dictionary has its specific dictinitoption changes tobecome incompatible with itself, the same is to compatibility between two dictionaries, list of dictionaries. In practice, we didn't see any disasters after changes in configuration - until reindexing search becomes less punctual. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
> One possibility is that the user-visible specification is just a name > (eg, "english"), but the actual filename out on the filesystem is, > say, name.encoding.stop (eg, "english.utf8.stop") where we use PG's > names for the encodings. We could just fail if there's not a file > matching the database encoding, or we could try that and then try > utf8, or some other rule. In any case I'd want it to verify and > convert encoding as necessary while reading. I have no strong objection for UTF8-encoded files (stop words or ispell or synonym or thesaurus). Just recode it after reading. But configuration for different languages might be differ, for example russian (and any cyrillic-based) configuration is differ from west-european configuration based on different character sets. So, we should have non-obvious rules for stemmers to define which exact stemmer and stop-file should be used. For russian language with utf8 encoding it should use for lword english stemmer, but for italian language - italian stemmer. Any ASCII chars can't present in russian word, but might italian word can contains only ASCII. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
Teodor Sigaev <teodor@sigaev.ru> writes: > But configuration for different languages might be differ, for example > russian (and any cyrillic-based) configuration is differ from > west-european configuration based on different character sets. Sure. I'm just assuming that the set of stopwords doesn't need to vary depending on the encoding you're using for a language --- that is, if you're willing to convert the encoding then the same stopword list file should serve for all encodings of a given language. Do you think this might be wrong? regards, tom lane
> Sure. I'm just assuming that the set of stopwords doesn't need to vary > depending on the encoding you're using for a language --- that is, if > you're willing to convert the encoding then the same stopword list file > should serve for all encodings of a given language. Do you think this > might be wrong? No. I believe that pgsql doesn't support encoding that can not be recoded from UTF8, at least for non-hieroglyph languages. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
So, added to my plan (http://archives.postgresql.org/pgsql-hackers/2007-06/msg00618.php) n) single encoded files. That will touch snowball, ispell, synonym, thesaurus and simple dictionaries n+1) use encoding names instead of locale's names in configuration Tom Lane wrote: > Teodor Sigaev <teodor@sigaev.ru> writes: >> But configuration for different languages might be differ, for example >> russian (and any cyrillic-based) configuration is differ from >> west-european configuration based on different character sets. > > Sure. I'm just assuming that the set of stopwords doesn't need to vary > depending on the encoding you're using for a language --- that is, if > you're willing to convert the encoding then the same stopword list file > should serve for all encodings of a given language. Do you think this > might be wrong? > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 9: In versions below 8.0, the planner will ignore your desire to > choose an index scan if your joining column's datatypes do not > match -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
"Teodor Sigaev" <teodor@sigaev.ru> writes: >> Hm, are you trying to say that it's sane to have different tsvectors in >> a column computed under different language settings? Maybe we're all > > Yes, I think so. > > That might have sense for close languages. Norwegian languages has two dialects > and one of them has advanced rules for compound words, russian and ukranian has > similar rules etc. Operation @@ is language (and encoding) independent, it use > just strcmp call. To support this sanely though wouldn't you need to know which language rule a tsvector was generated with? Like, have a byte in the tsvector tagging it with the language rule forever more? What I'm wondering about is if you use a different rule than what was used when an index entry was inserted will you get different results using the index than you would doing a sequential scan and reapplying the operator to every datum? -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
> To support this sanely though wouldn't you need to know which language rule a > tsvector was generated with? Like, have a byte in the tsvector tagging it with > the language rule forever more? No. As corner case, dictionary might return just a number or a hash value. > > What I'm wondering about is if you use a different rule than what was used > when an index entry was inserted will you get different results using the > index than you would doing a sequential scan and reapplying the operator to > every datum? Rules are apllyed during creattion of tsvector, not during indexing of tsvectors. So, sequential and index scan will return identical results.
Teodor Sigaev wrote: > So, added to my plan > (http://archives.postgresql.org/pgsql-hackers/2007-06/msg00618.php) > n) single encoded files. That will touch snowball, ispell, synonym, thesaurus > and simple dictionaries > n+1) use encoding names instead of locale's names in configuration FYI, I am continuing with the documentation cleanup, though I will not do the /ref directory until we are sure which commands will be kept. We can later modify the documentation to match the new behavior. --------------------------------------------------------------------------- > > Tom Lane wrote: > > Teodor Sigaev <teodor@sigaev.ru> writes: > >> But configuration for different languages might be differ, for example > >> russian (and any cyrillic-based) configuration is differ from > >> west-european configuration based on different character sets. > > > > Sure. I'm just assuming that the set of stopwords doesn't need to vary > > depending on the encoding you're using for a language --- that is, if > > you're willing to convert the encoding then the same stopword list file > > should serve for all encodings of a given language. Do you think this > > might be wrong? > > > > regards, tom lane > > > > ---------------------------(end of broadcast)--------------------------- > > TIP 9: In versions below 8.0, the planner will ignore your desire to > > choose an index scan if your joining column's datatypes do not > > match > > -- > Teodor Sigaev E-mail: teodor@sigaev.ru > WWW: http://www.sigaev.ru/ -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
I have completed my first pass over the tsearch documentation: http://momjian.us/expire/fulltext/HTML/sql.html They are from section 14 and following. I have come up with a number of questions that I placed in SGML comments in these files: http://momjian.us/expire/fulltext/SGML/ Teodor/Oleg, let me know when you want to go over my questions. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Sun, 17 Jun 2007, Bruce Momjian wrote: > I have completed my first pass over the tsearch documentation: > > http://momjian.us/expire/fulltext/HTML/sql.html > > They are from section 14 and following. > > I have come up with a number of questions that I placed in SGML comments > in these files: > > http://momjian.us/expire/fulltext/SGML/ > > Teodor/Oleg, let me know when you want to go over my questions. Below are my answers (marked as ) Comments to editorial work of Bruce Momjian. fulltext-intro.sgml: it is useful to have a predefined list of lexemes. >Bruce, here should be list of types of lexemes ! </para></listitem> <!-- SEEMS UNNECESSARY It useless to attempt normalize <type>email address</type> using morphological dictionary of russian language, but looks reasonable to pick out <type>domain name</type> and be able to search for <type>domain name</type>. --> >I dont' understand where did you get this para :) fulltext-opfunc.sgml: All of the following functions that accept a configuration argument can use either an integer <!-- why an integer --> or a textual configuration name to select a configuration. > originally it was integer id, probably better use <type>oid</type> This returns the query used for searching an index. It can be used to test for an empty query. The <command>SELECT</> below returns <literal>'T'</>, <!-- lowercase? --> which corresponds to an empty query since GIN indexes do not support negate queries (a full index scan is inefficient): > capital case. This looks cumbersome, probably querytree() should > just return NULL. The integer option controls several behaviors which is done using bit-wise fields and <literal>|</literal> (for example, <literal>2|4</literal>): <!-- why so complex? --> > to avoid 2 arguments its <replaceable>id</replaceable> or <replaceable>ts_name</replaceable>; <!-- n if none is specified that the current configuration is used. > I don't understand this question <para> <!-- why? --> Note that the cascade dropping of the <function>headline</function> function cause dropping of the <literal>parser</literal> used in fulltext configuration <replaceable>tsname</replaceable>. </para> > hmm, probably it should be reversed - cascade dropping of the parser cause > dropping of the headline function. In example below, <literal>fulltext_idx</literal> is a GIN index:<!-- why isn't this automatic --> > It's explained above. The problem is that current index api doesn't allow > to say if search was lossy or exact, so to preserve performance of > GIN index we had to introduce @@@ operator, which is the same as @@, but > lossy. nly the <token>lword</token> lexeme, then a <acronym>TZ</acronym> definition like ' one 1:11' will not work since lexeme type <token>digit</token> is not assigned to the <acronym>TZ</acronym>. <!-- what do these numbers mean? --> </para> > nothing special, just numbers for example. <function>ts_debug</> displays information about every token of <replaceable class="PARAMETER">document</replaceable> as produced by the parser and processed by the configured dictionaries using the configuration specified by <replaceable class="PARAMETER">cfgname</replaceable> or <replaceable class="PARAMETER">oid</replaceable>. <!-- no need for oid > don't understand this comment. ts_debug accepts cfgname or its oid Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
Oleg Bartunov wrote: > On Sun, 17 Jun 2007, Bruce Momjian wrote: > > > I have completed my first pass over the tsearch documentation: > > > > http://momjian.us/expire/fulltext/HTML/sql.html > > > > They are from section 14 and following. > > > > I have come up with a number of questions that I placed in SGML comments > > in these files: > > > > http://momjian.us/expire/fulltext/SGML/ > > > > Teodor/Oleg, let me know when you want to go over my questions. > > Below are my answers (marked as ) OK. > > Comments to editorial work of Bruce Momjian. > > fulltext-intro.sgml: > > it is useful to have a predefined list of lexemes. > >Bruce, here should be list of types of lexemes ! Agreed. Are the list of lexemes parser-specific? > </para></listitem> > > <!-- > SEEMS UNNECESSARY > It useless to attempt normalize <type>email address</type> using > morphological dictionary of russian language, but looks reasonable to pick > out <type>domain name</type> and be able to search for <type>domain > name</type>. > --> > > I dont' understand where did you get this para :) Uh, it was in the SGML. I have removed it. > fulltext-opfunc.sgml: > > All of the following functions that accept a configuration argument can > use either an integer <!-- why an integer --> or a textual configuration > name to select a configuration. > > originally it was integer id, probably better use <type>oid</type> Uh, my question is why are you allowing specification as an integer/oid when the name works just fine. I don't see the value in allowing numbers here. > This returns the query used for searching an index. It can be used to test > for an empty query. The <command>SELECT</> below returns <literal>'T'</>, > <!-- lowercase? --> which corresponds to an empty query since GIN indexes > do not support negate queries (a full index scan is inefficient): > > > capital case. This looks cumbersome, probably querytree() should > > just return NULL. Agreed. > The integer option controls several behaviors which is done using bit-wise > fields and <literal>|</literal> (for example, <literal>2|4</literal>): > <!-- why so complex? --> > > > to avoid 2 arguments But I don't see why you would want to set two of those values --- they seem mutually exclusive, e.g. 1 divides the rank by the 1 + logarithm of the document length2 divides the rank by the length itself I assume you do either one, not both. > its <replaceable>id</replaceable> or <replaceable>ts_name</replaceable>; <!-- n > if none is specified that the current configuration is used. > > > I don't understand this question Same issue as above --- why allow a number here when the name works just fine. We don't allow tables to be specified by number, so why configurations? > <para> > <!-- why? --> > Note that the cascade dropping of the <function>headline</function> function > cause dropping of the <literal>parser</literal> used in fulltext configuration > <replaceable>tsname</replaceable>. > </para> > > > hmm, probably it should be reversed - cascade dropping of the parser cause > > dropping of the headline function. Agreed. > > In example below, <literal>fulltext_idx</literal> is > a GIN index:<!-- why isn't this automatic --> > > > It's explained above. The problem is that current index api doesn't allow > > to say if search was lossy or exact, so to preserve performance of > > GIN index we had to introduce @@@ operator, which is the same as @@, but > > lossy. Well, then we have to fix the API. Telling users to use a different operator based on what index is defined is just bad style. > nly the <token>lword</token> lexeme, then a <acronym>TZ</acronym> > definition like ' one 1:11' will not work since lexeme type > <token>digit</token> is not assigned to the <acronym>TZ</acronym>. > <!-- what do these numbers mean? --> > </para> OK, I changed it to be clearer. > > nothing special, just numbers for example. > > <function>ts_debug</> displays information about every token of > <replaceable class="PARAMETER">document</replaceable> as produced by the > parser and processed by the configured dictionaries using the configuration > specified by <replaceable class="PARAMETER">cfgname</replaceable> or > <replaceable class="PARAMETER">oid</replaceable>. <!-- no need for oid > > > don't understand this comment. ts_debug accepts cfgname or its oid Again, no need for oid. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Wed, 20 Jun 2007, Bruce Momjian wrote: > Oleg Bartunov wrote: >> On Sun, 17 Jun 2007, Bruce Momjian wrote: >> >>> I have completed my first pass over the tsearch documentation: >>> >>> http://momjian.us/expire/fulltext/HTML/sql.html >>> >>> They are from section 14 and following. >>> >>> I have come up with a number of questions that I placed in SGML comments >>> in these files: >>> >>> http://momjian.us/expire/fulltext/SGML/ >>> >>> Teodor/Oleg, let me know when you want to go over my questions. >> >> Below are my answers (marked as ) > > OK. >> >> Comments to editorial work of Bruce Momjian. >> >> fulltext-intro.sgml: >> >> it is useful to have a predefined list of lexemes. >> >> Bruce, here should be list of types of lexemes ! > > Agreed. Are the list of lexemes parser-specific? > yes, it it parser which defines types of lexemes. >> fulltext-opfunc.sgml: >> >> All of the following functions that accept a configuration argument can >> use either an integer <!-- why an integer --> or a textual configuration >> name to select a configuration. >> >> originally it was integer id, probably better use <type>oid</type> > > Uh, my question is why are you allowing specification as an integer/oid > when the name works just fine. I don't see the value in allowing > numbers here. for compatibility reason. Hmm, indeed, i don't recall where oid's could be important. > >> This returns the query used for searching an index. It can be used to test >> for an empty query. The <command>SELECT</> below returns <literal>'T'</>, >> <!-- lowercase? --> which corresponds to an empty query since GIN indexes >> do not support negate queries (a full index scan is inefficient): >> >>> capital case. This looks cumbersome, probably querytree() should >>> just return NULL. > > Agreed. > >> The integer option controls several behaviors which is done using bit-wise >> fields and <literal>|</literal> (for example, <literal>2|4</literal>): >> <!-- why so complex? --> >> >>> to avoid 2 arguments > > But I don't see why you would want to set two of those values --- they > seem mutually exclusive, e.g. > > 1 divides the rank by the 1 + logarithm of the document length > 2 divides the rank by the length itself > > I assume you do either one, not both. but what's about others variants ? What I missed is the definition of extent. From http://www.sai.msu.su/~megera/wiki/NewExtentsBasedRanking Extent is a shortest and non-nested sequence of words, which satisfy a query. > >> its <replaceable>id</replaceable> or <replaceable>ts_name</replaceable>; <!-- n >> if none is specified that the current configuration is used. >> >>> I don't understand this question > > Same issue as above --- why allow a number here when the name works just > fine. We don't allow tables to be specified by number, so why > configurations? > >> <para> >> <!-- why? --> >> Note that the cascade dropping of the <function>headline</function> function >> cause dropping of the <literal>parser</literal> used in fulltext configuration >> <replaceable>tsname</replaceable>. >> </para> >> >>> hmm, probably it should be reversed - cascade dropping of the parser cause >>> dropping of the headline function. > > Agreed. > >> >> In example below, <literal>fulltext_idx</literal> is >> a GIN index:<!-- why isn't this automatic --> >> >>> It's explained above. The problem is that current index api doesn't allow >>> to say if search was lossy or exact, so to preserve performance of >>> GIN index we had to introduce @@@ operator, which is the same as @@, but >>> lossy. > > Well, then we have to fix the API. Telling users to use a different > operator based on what index is defined is just bad style. This was raised by Heikki and we discussed it a bit in Ottawa, but it's unclear if it's doable for 8.3. @@@ operator is in rare use, so we could say it will be improved in future versions. > >> nly the <token>lword</token> lexeme, then a <acronym>TZ</acronym> >> definition like ' one 1:11' will not work since lexeme type >> <token>digit</token> is not assigned to the <acronym>TZ</acronym>. >> <!-- what do these numbers mean? --> >> </para> > > OK, I changed it to be clearer. > >>> nothing special, just numbers for example. >> >> <function>ts_debug</> displays information about every token of >> <replaceable class="PARAMETER">document</replaceable> as produced by the >> parser and processed by the configured dictionaries using the configuration >> specified by <replaceable class="PARAMETER">cfgname</replaceable> or >> <replaceable class="PARAMETER">oid</replaceable>. <!-- no need for oid >> >>> don't understand this comment. ts_debug accepts cfgname or its oid > > Again, no need for oid. We need to decide if we need oids as user-visible argument. I don't see any value, probably Teodor think other way. Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
Oleg Bartunov wrote: > On Wed, 20 Jun 2007, Bruce Momjian wrote: > >> Comments to editorial work of Bruce Momjian. > >> > >> fulltext-intro.sgml: > >> > >> it is useful to have a predefined list of lexemes. > >> > >> Bruce, here should be list of types of lexemes ! > > > > Agreed. Are the list of lexemes parser-specific? > > > > yes, it it parser which defines types of lexemes. OK, how will users get a list of supported lexemes? Do we need a list per supported parser? > >> fulltext-opfunc.sgml: > >> > >> All of the following functions that accept a configuration argument can > >> use either an integer <!-- why an integer --> or a textual configuration > >> name to select a configuration. > >> > >> originally it was integer id, probably better use <type>oid</type> > > > > Uh, my question is why are you allowing specification as an integer/oid > > when the name works just fine. I don't see the value in allowing > > numbers here. > > for compatibility reason. Hmm, indeed, i don't recall where oid's could be > important. Well, if neither of ussee no reason for it, let's remove it. We don't need to support a feature that has no usefulness. > >> This returns the query used for searching an index. It can be used to test > >> for an empty query. The <command>SELECT</> below returns <literal>'T'</>, > >> <!-- lowercase? --> which corresponds to an empty query since GIN indexes > >> do not support negate queries (a full index scan is inefficient): > >> > >>> capital case. This looks cumbersome, probably querytree() should > >>> just return NULL. > > > > Agreed. > > > >> The integer option controls several behaviors which is done using bit-wise > >> fields and <literal>|</literal> (for example, <literal>2|4</literal>): > >> <!-- why so complex? --> > >> > >>> to avoid 2 arguments > > > > But I don't see why you would want to set two of those values --- they > > seem mutually exclusive, e.g. > > > > 1 divides the rank by the 1 + logarithm of the document length > > 2 divides the rank by the length itself > > > > I assume you do either one, not both. > > but what's about others variants ? OK, here is the full list: 0 (the default) ignores document length1 divides the rank by the 1 + logarithm of the document length2 divides the rank bythe length itself4 divides the rank by the mean harmonic distance between extents8 divides the rank by the number of uniquewords in document16 divides the rank by 1 + logarithm of the number of unique words in document so which ones would be both enabled? > > What I missed is the definition of extent. > > >From http://www.sai.msu.su/~megera/wiki/NewExtentsBasedRanking > Extent is a shortest and non-nested sequence of words, which satisfy a query. I don't understand how that relates to this. > > > >> its <replaceable>id</replaceable> or <replaceable>ts_name</replaceable>; <!-- n > >> if none is specified that the current configuration is used. > >> > >>> I don't understand this question > > > > Same issue as above --- why allow a number here when the name works just > > fine. We don't allow tables to be specified by number, so why > > configurations? > > > >> <para> > >> <!-- why? --> > >> Note that the cascade dropping of the <function>headline</function> function > >> cause dropping of the <literal>parser</literal> used in fulltext configuration > >> <replaceable>tsname</replaceable>. > >> </para> > >> > >>> hmm, probably it should be reversed - cascade dropping of the parser cause > >>> dropping of the headline function. > > > > Agreed. > > > >> > >> In example below, <literal>fulltext_idx</literal> is > >> a GIN index:<!-- why isn't this automatic --> > >> > >>> It's explained above. The problem is that current index api doesn't allow > >>> to say if search was lossy or exact, so to preserve performance of > >>> GIN index we had to introduce @@@ operator, which is the same as @@, but > >>> lossy. > > > > Well, then we have to fix the API. Telling users to use a different > > operator based on what index is defined is just bad style. > > This was raised by Heikki and we discussed it a bit in Ottawa, but it's > unclear if it's doable for 8.3. @@@ operator is in rare use, so we could > say it will be improved in future versions. Uh, I am wondering if we just have to force heap access in all cases until it is fixed. > >> nly the <token>lword</token> lexeme, then a <acronym>TZ</acronym> > >> definition like ' one 1:11' will not work since lexeme type > >> <token>digit</token> is not assigned to the <acronym>TZ</acronym>. > >> <!-- what do these numbers mean? --> > >> </para> > > > > OK, I changed it to be clearer. > > > >>> nothing special, just numbers for example. > >> > >> <function>ts_debug</> displays information about every token of > >> <replaceable class="PARAMETER">document</replaceable> as produced by the > >> parser and processed by the configured dictionaries using the configuration > >> specified by <replaceable class="PARAMETER">cfgname</replaceable> or > >> <replaceable class="PARAMETER">oid</replaceable>. <!-- no need for oid > >> > >>> don't understand this comment. ts_debug accepts cfgname or its oid > > > > Again, no need for oid. > > We need to decide if we need oids as user-visible argument. I don't see > any value, probably Teodor think other way. This is a good time to clean up the API because there are going to be user-visible changes anyway. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Wed, 20 Jun 2007, Bruce Momjian wrote: > Oleg Bartunov wrote: >> On Wed, 20 Jun 2007, Bruce Momjian wrote: >>>> Comments to editorial work of Bruce Momjian. >>>> >>>> fulltext-intro.sgml: >>>> >>>> it is useful to have a predefined list of lexemes. >>>> >>>> Bruce, here should be list of types of lexemes ! >>> >>> Agreed. Are the list of lexemes parser-specific? >>> >> >> yes, it it parser which defines types of lexemes. > > OK, how will users get a list of supported lexemes? Do we need a list > per supported parser? it's documented, see "Parser functions" for token_type(); postgres=# select * from token_type('default'); tokid | alias | description -------+--------------+----------------------------------- 1 | lword | Latin word 2 | nlword | Non-latinword 3 | word | Word 4 | email | Email 5 | url | URL 6 | host |Host 7 | sfloat | Scientific notation 8 | version | VERSION 9 | part_hword | Part of hyphenatedword 10 | nlpart_hword | Non-latin part of hyphenated word 11 | lpart_hword | Latin part of hyphenated word 12 | blank | Space symbols 13 | tag | HTML Tag 14 | protocol | Protocol head 15 | hword | Hyphenated word 16 | lhword | Latin hyphenated word 17 | nlhword | Non-latin hyphenated word 18 | uri | URI 19 | file | File or path name 20 | float | Decimal notation 21 | int | Signed integer 22 | uint | Unsigned integer 23 | entity | HTML Entity >>>> The integer option controls several behaviors which is done using bit-wise >>>> fields and <literal>|</literal> (for example, <literal>2|4</literal>): >>>> <!-- why so complex? --> >>>> >>>>> to avoid 2 arguments >>> >>> But I don't see why you would want to set two of those values --- they >>> seem mutually exclusive, e.g. >>> >>> 1 divides the rank by the 1 + logarithm of the document length >>> 2 divides the rank by the length itself >>> >>> I assume you do either one, not both. >> >> but what's about others variants ? > > OK, here is the full list: > > 0 (the default) ignores document length > 1 divides the rank by the 1 + logarithm of the document length > 2 divides the rank by the length itself > 4 divides the rank by the mean harmonic distance between extents > 8 divides the rank by the number of unique words in document > 16 divides the rank by 1 + logarithm of the number of unique words in > document > > so which ones would be both enabled? no one ! This is a list of possible values of rank normalization flag, which could be ORed together. =# select rank_cd('1:1,2,3 4:5 6:7', '1&4',1); rank_cd ----------- 0.0279055 =# select rank_cd('1:1,2,3 4:5 6:7', '1&4',1|16); rank_cd ----------- 0.0139528 > >> >> What I missed is the definition of extent. >> >>> From http://www.sai.msu.su/~megera/wiki/NewExtentsBasedRanking >> Extent is a shortest and non-nested sequence of words, which satisfy a query. > > I don't understand how that relates to this. because of "4 divides the rank by the mean harmonic distance between extents" ^^^^^^^ it reflects how dense extents which satisfy query are in document. > >>> >>>> its <replaceable>id</replaceable> or <replaceable>ts_name</replaceable>; <!-- n >>>> if none is specified that the current configuration is used. >>>> >>>>> I don't understand this question >>> >>> Same issue as above --- why allow a number here when the name works just >>> fine. We don't allow tables to be specified by number, so why >>> configurations? >>> >>>> <para> >>>> <!-- why? --> >>>> Note that the cascade dropping of the <function>headline</function> function >>>> cause dropping of the <literal>parser</literal> used in fulltext configuration >>>> <replaceable>tsname</replaceable>. >>>> </para> >>>> >>>>> hmm, probably it should be reversed - cascade dropping of the parser cause >>>>> dropping of the headline function. >>> >>> Agreed. >>> >>>> >>>> In example below, <literal>fulltext_idx</literal> is >>>> a GIN index:<!-- why isn't this automatic --> >>>> >>>>> It's explained above. The problem is that current index api doesn't allow >>>>> to say if search was lossy or exact, so to preserve performance of >>>>> GIN index we had to introduce @@@ operator, which is the same as @@, but >>>>> lossy. >>> >>> Well, then we have to fix the API. Telling users to use a different >>> operator based on what index is defined is just bad style. >> >> This was raised by Heikki and we discussed it a bit in Ottawa, but it's >> unclear if it's doable for 8.3. @@@ operator is in rare use, so we could >> say it will be improved in future versions. > > Uh, I am wondering if we just have to force heap access in all cases > until it is fixed. no-no ! We'll lost performance of GIN index, which isn't lossy and don't need heap access. I don't see what's wrong if we say that some feature doesn't supported by text search operator with GIN index. >> We need to decide if we need oids as user-visible argument. I don't see >> any value, probably Teodor think other way. > > This is a good time to clean up the API because there are going to be > user-visible changes anyway. I agree. Keep in mind this, until we get more serious tasks done. Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
On Wed, 20 Jun 2007, Bruce Momjian wrote: >> >> We need to decide if we need oids as user-visible argument. I don't see >> any value, probably Teodor think other way. > > This is a good time to clean up the API because there are going to be > user-visible changes anyway. Bruce, just remove oid argument specification from documentation. Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
Oleg Bartunov wrote: > On Wed, 20 Jun 2007, Bruce Momjian wrote: > >> > >> We need to decide if we need oids as user-visible argument. I don't see > >> any value, probably Teodor think other way. > > > > This is a good time to clean up the API because there are going to be > > user-visible changes anyway. > > Bruce, just remove oid argument specification from documentation. Done. I am attaching the current function prototypes. If they don't match the C code, please let me know. I have also updated with some minor corrections I received from Erik. I will be adding more to the documentation hopefully this week: http://momjian.us/expire/fulltext/HTML/ -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + *** /pgsgml/fulltext-opfunc.sgml Sat Jun 16 23:30:11 2007 --- fulltext-opfunc.sgml Mon Jul 2 21:17:15 2007 *************** *** 141,147 **** <term> <synopsis> ! to_tsvector(<optional><replaceable class="PARAMETER">configuration</replaceable>,</optional> <replaceable class="PARAMETER">document</replaceable>TEXT) returns TSVECTOR </synopsis> </term> --- 141,147 ---- <term> <synopsis> ! to_tsvector(<optional><replaceable class="PARAMETER">conf_name</replaceable></optional>, <replaceable class="PARAMETER">document</replaceable>TEXT) returns TSVECTOR </synopsis> </term> *************** *** 285,306 **** <term> <synopsis> ! tsearch(<replaceable class="PARAMETER">vector_column_name</replaceable><optional>, (<replaceable class="PARAMETER">my_filter_name</replaceable>| <replaceable class="PARAMETER">text_column_name1</replaceable>) <optional>...</optional></optional>, <replaceable class="PARAMETER">text_column_nameN</replaceable>) </synopsis> </term> *************** *** 323,329 **** <term> <synopsis> ! stat(<replaceable class="PARAMETER">sqlquery</replaceable> text <optional>, weight text </optional>) returns SETOF statinfo </synopsis> <listitem> --- 322,328 ---- <term> <synopsis> ! stat(<optional><replaceable class="PARAMETER">sqlquery</replaceable> text </optional>, weight text </optional>) returnsSETOF statinfo </synopsis> <listitem> *************** *** 403,409 **** <term> <synopsis> ! to_tsquery(<optional><replaceable class="PARAMETER">configuration</replaceable>,</optional> <replaceable class="PARAMETER">querytext</replaceable>text) returns TSQUERY </synopsis> </term> --- 402,408 ---- <term> <synopsis> ! to_tsquery(<optional><replaceable class="PARAMETER">conf_name</replaceable></optional>, <replaceable class="PARAMETER">querytext</replaceable>text) returns TSQUERY </synopsis> </term> *************** *** 446,452 **** <term> <synopsis> ! plainto_tsquery(<optional><replaceable class="PARAMETER">configuration</replaceable>,</optional> <replaceable class="PARAMETER">querytext</replaceable>text) returns TSQUERY </synopsis> </term> --- 445,451 ---- <term> <synopsis> ! plainto_tsquery(<optional><replaceable class="PARAMETER">conf_name</replaceable></optional>, <replaceable class="PARAMETER">querytext</replaceable>text) returns TSQUERY </synopsis> </term> *************** *** 989,995 **** <term> <synopsis> ! rank(<optional> <replaceable class="PARAMETER">weights</replaceable> float4[], </optional> <replaceable class="PARAMETER">vector</replaceable>TSVECTOR, <replaceable class="PARAMETER">query</replaceable> TSQUERY, <optional> <replaceableclass="PARAMETER">normalization</replaceable> int4 </optional>) returns float4 </synopsis> </term> --- 988,994 ---- <term> <synopsis> ! rank(<optional> <replaceable class="PARAMETER">weights</replaceable> float4[]</optional>, <replaceable class="PARAMETER">vector</replaceable>TSVECTOR, <replaceable class="PARAMETER">query</replaceable> TSQUERY, <optional> <replaceableclass="PARAMETER">normalization</replaceable> int4 </optional>) returns float4 </synopsis> </term> *************** *** 1084,1090 **** <term> <synopsis> ! headline(<optional> <replaceable class="PARAMETER">id</replaceable> int4, | <replaceable class="PARAMETER">ts_name</replaceable>text, </optional> <replaceable class="PARAMETER">document</replaceable> text, <replaceableclass="PARAMETER">query</replaceable> TSQUERY, <optional> <replaceable class="PARAMETER">options</replaceable>text </optional>) returns text </synopsis> </term> --- 1083,1089 ---- <term> <synopsis> ! headline(<optional> <replaceable class="PARAMETER">ts_name</replaceable> text</optional>, <replaceable class="PARAMETER">document</replaceable>text, <replaceable class="PARAMETER">query</replaceable> TSQUERY, <optional> <replaceableclass="PARAMETER">options</replaceable> text </optional>) returns text </synopsis> </term> *************** *** 1351,1357 **** <term> <synopsis> ! lexize(<optional> <replaceable class="PARAMETER">oid</replaceable>, | <replaceable class="PARAMETER">dict_name</replaceable>text, <replaceable class="PARAMETER">lexeme</replaceable> text) returns text[] </synopsis> </term> --- 1350,1356 ---- <term> <synopsis> ! lexize(<optional> <replaceable class="PARAMETER">dict_name</replaceable> text</optional>, <replaceable class="PARAMETER">lexeme</replaceable>text) returns text[] </synopsis> </term> *************** *** 1858,1878 **** <title>Debugging</title> <para> ! Function <function>ts_debug</function> allows easy testing of your full text indexing configuration. </para> <synopsis> ! ts_debug(<optional><replaceable class="PARAMETER">cfgname</replaceable> | <replaceable class="PARAMETER">oid</replaceable></optional>,<replaceable class="PARAMETER">document</replaceable> TEXT) returns SETOFtsdebug </synopsis> <para> --- 1852,1870 ---- <title>Debugging</title> <para> ! Function <function>ts_debug</function> allows easy testing of your full text searching configuration. </para> <synopsis> ! ts_debug(<optional><replaceable class="PARAMETER">conf_name</replaceable></optional>, <replaceable class="PARAMETER">document</replaceable>TEXT) returns SETOF tsdebug </synopsis>
FYI, I have massively reorganized the text search documentation and it is getting closer to something I am happy with: http://momjian.us/expire/fulltext/HTML/textsearch.html -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
2007/7/7, Bruce Momjian <bruce@momjian.us>: > FYI, I have massively reorganized the text search documentation and it > is getting closer to something I am happy with: > > http://momjian.us/expire/fulltext/HTML/textsearch.html The following is the result of me proofreading, mainly searching for small mistakes such as spelling/grammatical errors (that means no document structure comments, etc). All corrections are relative to the version of the text at above URL at the time of me reading it :-). General It seems to be a recurring problem that commas are not put between the brackets when an argument is optional. For example: "to_tsvector([conf_name], document TEXT)" -> I guess this should be "to_tsvector([conf_name,] document TEXT)" Full-text vs. full text and stop-word vs. stop word are not used consistently. Also, capitalization of full text searching is not used consistently. 14.1. Introduction * "indexinging" - > "indexing" * "There is no linguistic support, even in English" -> "for" instead of "in"? * "e.g.satisfies" -> add a space before "satisfies" * "have several thousands derivatives" -> should this not use the singular form thousand? * "infinitive form" -> is this the right term? I think it only applies to verbs (also occurs in 14.4 and probably others) * "over how lexemes creation" -> not sure what this should be. "are created" maybe? * "Map synonyms to a single word. ispell." -> why is ispell a standalone word? * "so it is natural to introduce a new data type" -> this does not sound like documentation * "Also, full-text search operator @@" -> add "the" before "full-text" * "A document is any text file that can be opened, read, and modified" -> "file" sounds as if it should be a file on a filesystem. * "However, the document file must be uniquely identified in the database." -> why? * "COALESCE" -> should be a link * "during calculation of document rank" -> add "the" before "calculation" and before "document" * "which supports boolean operators, & (AND)" -> remove the ",". maybe add "the" before boolean * "parenthesis" -> "parentheses" * "Tsquery consists of" -> maybe add "A" before Tsquery 14.2. Operators And Functions ^^^ -> a non-capital "a" in "and" seems to be more consistent with the rest of the manual * "TSVECTOR, otherwise false:" -> "and false if not" or "and false otherwise" (occurs 3 times in this section) * "The text should be formatted to match the way a vector is displayed by SELECT." -> what a strange definition, I think something like "input format" or so should be used (and defined somewhere, didn't see it yet) (used twice in this section) * "tsearch([vector_column_name], my_filter_name | text_column_name1 [...], text_column_nameN)" -> I do not understand the notation * "The following rule is used: a function is applied to all subsequent TEXT columns until next matching column occurs." -> I don't get it * "stat([sqlquery text ], [weight text ]) returns SETOF statinfo" -> I guess that not both of the arguments are optional? * "stop-words candidates" -> stop-word candidates * "tsvectors are compared with each other using lexicographical ordering." -> of the output representation or something else? * "Accepts querytext, which should be single tokens separated by" -> replace "be" with "consist of" * "& and | or, and ! not" -> putting parentheses around the "and" "or" and "not" would be more readable. also, a comma is missing before the "|" sign * "break it onto tokens" -> into instead of onto * "since GIN indexes do not support negate queries" -> something like: "queries with negation" or "negated queries" (depending on what the correct rule is) * "Arguments to rewrite() function" -> "the .. functions" or "to .." (without the "function") * "can be column names of type tsquery" -> "names of columns of type tsquery" (the names are not of type tsquery, the columns are) * "we can change rewriting rule online" -> add "the", possibly use another word for "online" (it is not clear what that means to me) 14.3. Additional Controls * "Full text searching in PostgreSQL provides function" -> add "the" * "we see the resulting" -> maybe "we see that the resulting" "does not contain a, on, or it, word rats became rat, and the punctuation sign - was ignored" -> "does not contain the words" (or lexemes, or tokens), add "the" before "word rats", add quotes around the "-" * "on words" -> "into words" * "they are too frequent" -> "they occur too frequently" (I think a word cannot "be" frequent) * "The Punctuation sign -" -> "The punctuation sign -" + put quotes around the "-" * "which shows all details of full text machinery" -> add "the" before "full" * "is to mark out the different parts of document" -> add "a" before "document" * "by the 1 + logarithm" -> "by 1 + the logarithm" * "i.e., ordering of search results will not change" -> add "the" before "ordering", maybe also before "search" * "note that second example" -> add "the" before "second" * "than ones with labeled with D" -> "than ones labeled with D" or "than ones that are labeled with D" * "Unfortunately, it is almost impossible to avoid since full text indexing in a database should work without indexes" -> I don't get it * "to show part of each document" -> add "a" before "part" * "provides the function headline" -> add something, such as "to accomplish this" or "that implements such functionality" or something. * "ellipse-separated" -> "ellipsis-separated" * "the cascade dropping of the parser function cause dropping of the headling" -> I don't get the meaning of the sentence. I guess that "cause" should be "causes" and "headling" should be 'heading" 14.4. Dictionaries * "to use any word form in a query" -> "to use any derived form of a word in a query" * "infinitive" -> is this the right term? I think it only applies to verbs (used twice in this section) * "colour" -> is the manual supposed to be UK or US English? I cannot remember ever having read any UK-isms before * "substituted to their" -> replace "to" with "by" or "with" (native English speakers, help me here) * "see dictionary for integers Section 14.11 as an example" -> strange way of referring, I would put parenthesis around the section number, or alternatively put the section number before the title * "Lexemes come through a stack" -> replace "come through" with "are processed by" or something * "appears as a stop-word" -> "turns out to be a stop-word", also "stop word" is used elsewhere (without the "-") (this inconsistency occurs a lot in this section) * "Also, the ts_debug function ( Section 14.10 ) is very useful for this." -> the spaces around the section reference look strange. maybe replace "is very useful" by "can be used" * "and appear in almost every document" -> two times "and" sounds bad, replace this "and" by a comma * "discrimination value so they can be ignored in" -> cut this in two sentences: "discrimination value. Therefore, they can be ignored in the context of" * "word like a and it is useless to have them in an index" -> replace "word" with "words", make "a" somehow stand out (quotes?), replace "and" with "although" and "have" with "store" * "However stop words" -> "However, stop words" * "does affect ranking" -> "do affect ranking" (I think both can be considered correct, but like this one better) * "Relative paths in OPTION resolve relative to share/" -> and "share/" is relative to what? such references occur elsewhere in this section * "Synonym dictionary can be used" -> replace "dictonary" with "dictionaries", or alternatively, put "A" before "synonym" * "thesynonym" -> add a space * "en_stemm" -> "en_stem" * "abbeviated" -> "abbreviated" * "preferred terms, non-preferred, related terms" -> add "terms" after "non-preferred", or alternatively, remove all "terms" references apart from the last one * "in the thesaurus requires reindexing" -> replace "requires" with "require" * "It is possible to define only one dictionary." -> I guess that sentence wants to express that only one dictionary is allowed? In that case, change to "It is only possible to define one dictionary." * "Use asterisk" -> add "an" before "asterisk" * "thesubdictionary" -> "the subdictionary" * "It is still required that sample words should be known" -> don't use "required" and "should" together: "sample words are still required to be known" * "Since thesaurus dictionary" -> add "a" before "thesaurus" * "with parser" -> add "the" before "parser" * "but we can use plainto_tsquery and to_tsvector functions" -> add "the" before the name of the first function, or remove the "functions" part * "not a lexemes" -> "not lexemes" * "on OpenOffice Wiki" -> add "the" before "OpenOffice" * "does not supports" -> "does not support" * "support of" -> "support for" * "At present, Full text" -> I guess that "full" should not be capitalized * "see Snowball site" -> add "the" before "Snowball" * "which accepts a snowball stemmer" -> "that is accepted by a snowball stemmer" 14.5. Indexes * "speedup" -> "speed up" * "GiST(The Generalized Search Tree)-based" -> "GiST (Generalized Search Tree)-based" * "GIN(The Generalized Inverted Index)-based" -> "GIN (Generalized Inverted Index)-based" * "necessary consult the" -> add "to" before "consult" * "and could be result" -> remove the "be" * "transitive containment relation is realized" -> add "the" before "transitive" * "Knuth,1973" -> add a space after the comma * "i.e. parent is 'OR'-ed bit-strings" -> "i.e., a parent is the result of 'OR'-ing the bit-strings" * "of its limited" -> "of the limited" * "The likelihood of false drops" -> what are "drops"? maybe this needs to be "hits"? * "while longer one are" -> replace "one" with "ones" * "or the result" -> add "whether" before "the" * "currently is currently" -> remove the first "currently" * "but its performance" -> replace "its" with "their" * "heap, so" -> "heap. Therefore, " * "In example below" -> add "the" before "example" * "constraint_exclusion" -> why the underscore? should be a link 14.6. Configuration * "all of the options" -> maybe remove "of the" * "objects a set" -> add a comma before "a" 14.7. Limitations * "Length of" -> "The length of" (twice) * "less then" -> "less than" None of the numbers use commas to separate the thousands, except for one. 14.8. psql Support 14.9. Application Tutorial * "searchs" -> "searches" * "is last-modified date" -> add "the" after "is" 14.10. Debugging * "Word supernovaes" -> "The word supernovaes" * "end the dictionary stack" -> add "the" before "dictionary" * "specifies maximum length" -> add "the" before "maximum" 14.12. Example of Creating a Parser * "Note it should" -> insert "that" after "Note" * "The void function" -> replace "The" with "This" Nicolas -- Nicolas Barbier http://www.gnu.org/philosophy/no-word-attachments.html
Thanks, I applied this patch and rebuild HTML version. I was wondering how I was going to make all the changes accurately. ;-) --------------------------------------------------------------------------- Nicolas Barbier wrote: > 2007/7/7, Bruce Momjian <bruce@momjian.us>: > > > FYI, I have massively reorganized the text search documentation and it > > is getting closer to something I am happy with: > > > > http://momjian.us/expire/fulltext/HTML/textsearch.html > > The following is the result of me proofreading, mainly searching for > small mistakes such as spelling/grammatical errors (that means no > document structure comments, etc). > > All corrections are relative to the version of the text at above URL > at the time of me reading it :-). > > General > > It seems to be a recurring problem that commas are not put between the > brackets when an argument is optional. For example: > "to_tsvector([conf_name], document TEXT)" -> I guess this should be > "to_tsvector([conf_name,] document TEXT)" > > Full-text vs. full text and stop-word vs. stop word are not used > consistently. Also, capitalization of full text searching is not used > consistently. > -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
I think the tsearch documentation is nearing completion: http://momjian.us/expire/fulltext/HTML/textsearch.html but I am not happy with how tsearch is enabled in a user table: http://momjian.us/expire/fulltext/HTML/textsearch-app-tutorial.html Aside from the fact that it needs more examples, it only illustrates an example where someone creates a table, populates it, then adds a tsvector column, populates that, then creates an index. That seems quite inflexible. Is there a way to avoid having a separate tsvector column? What happens if the table is dynamic? How is that column updated based on table changes? Triggers? Where are the examples? Can you create an index like this: CREATE INDEX textsearch_id ON pgweb USING gin(to_tsvector(column)); That avoids having to have a separate column because you can just say: WHERE to_query('XXX') @@ to_tsvector(column) How do we make sure that the to_query is using the same text search configuration as the 'column' or index? Perhaps we should suggest: CREATE INDEX textsearch_idx ON pgweb USING gin(to_tsvector('english',column)); so that at least the configuration is documented in the index. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Tue, 17 Jul 2007, Bruce Momjian wrote: > I think the tsearch documentation is nearing completion: > > http://momjian.us/expire/fulltext/HTML/textsearch.html > > but I am not happy with how tsearch is enabled in a user table: > > http://momjian.us/expire/fulltext/HTML/textsearch-app-tutorial.html > > Aside from the fact that it needs more examples, it only illustrates an > example where someone creates a table, populates it, then adds a > tsvector column, populates that, then creates an index. > > That seems quite inflexible. Is there a way to avoid having a separate > tsvector column? What happens if the table is dynamic? How is that > column updated based on table changes? Triggers? Where are the > examples? Can you create an index like this: I agree, that there are could be more examples, but text search doesn't require something special ! *Example* of trigger function is documented on http://momjian.us/expire/fulltext/HTML/textsearch-opfunc.html > > CREATE INDEX textsearch_id ON pgweb USING gin(to_tsvector(column)); > > That avoids having to have a separate column because you can just say: > > WHERE to_query('XXX') @@ to_tsvector(column) yes, it's possible, but without ranking, since currently it's impossible to store any information in index (it's pg's feature). btw, this should works and for GiST index also. That kind of search is useful if there is another natural ordering of search results, for example, by timestamp. > > How do we make sure that the to_query is using the same text search > configuration as the 'column' or index? Perhaps we should suggest: please, keep in mind, it's not mandatory to use the same configuration at search time, that was used at index creation. > > CREATE INDEX textsearch_idx ON pgweb USING gin(to_tsvector('english',column)); > > so that at least the configuration is documented in the index. yes, it's better to always explicitly specify configuration name and not rely on default configuration. Unfortunately, configuration name doesn't saved in the index. Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
On Tue, 17 Jul 2007, Oleg Bartunov wrote: > On Tue, 17 Jul 2007, Bruce Momjian wrote: > >> I think the tsearch documentation is nearing completion: >> >> http://momjian.us/expire/fulltext/HTML/textsearch.html >> >> but I am not happy with how tsearch is enabled in a user table: >> >> http://momjian.us/expire/fulltext/HTML/textsearch-app-tutorial.html >> >> Aside from the fact that it needs more examples, it only illustrates an >> example where someone creates a table, populates it, then adds a >> tsvector column, populates that, then creates an index. >> >> That seems quite inflexible. Is there a way to avoid having a separate >> tsvector column? What happens if the table is dynamic? How is that >> column updated based on table changes? Triggers? Where are the >> examples? Can you create an index like this: > > I agree, that there are could be more examples, but text search doesn't > require something special ! > *Example* of trigger function is documented on > http://momjian.us/expire/fulltext/HTML/textsearch-opfunc.html > Bruce, below is an example of trigger for insert/update of example table create function pgweb_update() returns trigger as $$ BEGIN NEW.textsearch_index= setweight( to_tsvector( coalesce (title,'')), 'A' ) || ' ' || setweight( to_tsvector(coalesce(body,'')),'D'); RETURN NEW; END; $$ language plpgsql; CREATE TRIGGER fts_update BEFORE INSERT OR UPDATE ON pgweb FOR EACH ROW EXECUTE PROCEDURE pgweb_update(); > >> >> CREATE INDEX textsearch_id ON pgweb USING gin(to_tsvector(column)); >> >> That avoids having to have a separate column because you can just say: >> >> WHERE to_query('XXX') @@ to_tsvector(column) > > yes, it's possible, but without ranking, since currently it's impossible to > store any information in index (it's pg's feature). btw, this should > works and for GiST index also. > > That kind of search is useful if there is another natural ordering of search > results, for example, by timestamp. > >> >> How do we make sure that the to_query is using the same text search >> configuration as the 'column' or index? Perhaps we should suggest: > > please, keep in mind, it's not mandatory to use the same configuration > at search time, that was used at index creation. > one example is when text search index created without taking into account stop-words. Then you could search famous 'to be or not to be' with the same configuration, or ignore stop words with other. >> >> CREATE INDEX textsearch_idx ON pgweb USING >> gin(to_tsvector('english',column)); >> >> so that at least the configuration is documented in the index. > > yes, it's better to always explicitly specify configuration name and not rely > on default configuration. Unfortunately, configuration name doesn't saved in > the index. > > Regards, > Oleg > _____________________________________________________________ > Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), > Sternberg Astronomical Institute, Moscow University, Russia > Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ > phone: +007(495)939-16-83, +007(495)939-23-83 > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Have you searched our list archives? > > http://archives.postgresql.org > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
Oleg Bartunov wrote: > On Tue, 17 Jul 2007, Bruce Momjian wrote: > > > I think the tsearch documentation is nearing completion: > > > > http://momjian.us/expire/fulltext/HTML/textsearch.html > > > > but I am not happy with how tsearch is enabled in a user table: > > > > http://momjian.us/expire/fulltext/HTML/textsearch-app-tutorial.html > > > > Aside from the fact that it needs more examples, it only illustrates an > > example where someone creates a table, populates it, then adds a > > tsvector column, populates that, then creates an index. > > > > That seems quite inflexible. Is there a way to avoid having a separate > > tsvector column? What happens if the table is dynamic? How is that > > column updated based on table changes? Triggers? Where are the > > examples? Can you create an index like this: > > I agree, that there are could be more examples, but text search doesn't > require something special ! > *Example* of trigger function is documented on > http://momjian.us/expire/fulltext/HTML/textsearch-opfunc.html Yes, I see that in tsearch() here: http://momjian.us/expire/fulltext/HTML/textsearch-opfunc.html#TEXTSEARC$ I assume my_filter_name is optional right? I have updated the prototype to be: tsearch([vector_column_name], [my_filter_name], text_column_name [, ... ]) Is this accurate? What does this text below it mean? There can be many functions and text columns specified in a tsearch()trigger. The following rule is used: a function is appliedto allsubsequent TEXT columns until the next matching column occurs. Why are we allowing my_filter_name here? Isn't that something for a custom trigger. Is calling it tsearch() a good idea? Why not tsvector_trigger(). > > CREATE INDEX textsearch_id ON pgweb USING gin(to_tsvector(column)); > > > > That avoids having to have a separate column because you can just say: > > > > WHERE to_query('XXX') @@ to_tsvector(column) > > yes, it's possible, but without ranking, since currently it's impossible > to store any information in index (it's pg's feature). btw, this should > works and for GiST index also. What if they use @@@. Wouldn't that work because it is going to check the heap? > That kind of search is useful if there is another natural ordering of search > results, for example, by timestamp. > > > > > How do we make sure that the to_query is using the same text search > > configuration as the 'column' or index? Perhaps we should suggest: > > please, keep in mind, it's not mandatory to use the same configuration > at search time, that was used at index creation. Well, sort of. If you have stop words in the tquery configuration, you aren't going to hit any matches in the tsvector, right? Same for synonymns, I suppose. I can see that stemming would work if there was a mismatch between tsquery and tsvector. > > CREATE INDEX textsearch_idx ON pgweb USING gin(to_tsvector('english',column)); > > > > so that at least the configuration is documented in the index. > > yes, it's better to always explicitly specify configuration name and not > rely on default configuration. > Unfortunately, configuration name doesn't saved in the index. I was more concerned that there is nothing documenting the configuration used by the index or the tsvector table column trigger. By doing: CREATE INDEX textsearch_idx ON pgweb USING gin(to_tsvector('english',column)); you guarantee that the index uses 'english' for all its entries. If you omit the 'english' or use a different configuration, it will heap scan the table, which at least gives the right answer. Also, how do you guarantee that tsearch() triggers always uses the same configuration? The existing tsearch() API seems to make that impossible. I am wondering if we need to add the configuration name as a mandatory parameter to tsearch(). -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Jul 17, 2007, at 16:24 , Bruce Momjian wrote: > I assume my_filter_name is optional right? I have updated the > prototype > to be: > > tsearch([vector_column_name], [my_filter_name], text_column_name > [, ... ]) Just a style point, but would [filter_name] be better than [my_filter_name]? You're not qualifying the others with my_ ... or is there something you want to tell us, Bruce? :) Michael Glaesemann grzm seespotcode net
Michael Glaesemann wrote: > > On Jul 17, 2007, at 16:24 , Bruce Momjian wrote: > > > I assume my_filter_name is optional right? I have updated the > > prototype > > to be: > > > > tsearch([vector_column_name], [my_filter_name], text_column_name > > [, ... ]) > > Just a style point, but would [filter_name] be better than > [my_filter_name]? You're not qualifying the others with my_ ... or is > there something you want to tell us, Bruce? :) Agreed. Done. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Tue, 17 Jul 2007, Bruce Momjian wrote: > Oleg Bartunov wrote: >> On Tue, 17 Jul 2007, Bruce Momjian wrote: >> >>> I think the tsearch documentation is nearing completion: >>> >>> http://momjian.us/expire/fulltext/HTML/textsearch.html >>> >>> but I am not happy with how tsearch is enabled in a user table: >>> >>> http://momjian.us/expire/fulltext/HTML/textsearch-app-tutorial.html >>> >>> Aside from the fact that it needs more examples, it only illustrates an >>> example where someone creates a table, populates it, then adds a >>> tsvector column, populates that, then creates an index. >>> >>> That seems quite inflexible. Is there a way to avoid having a separate >>> tsvector column? What happens if the table is dynamic? How is that >>> column updated based on table changes? Triggers? Where are the >>> examples? Can you create an index like this: >> >> I agree, that there are could be more examples, but text search doesn't >> require something special ! >> *Example* of trigger function is documented on >> http://momjian.us/expire/fulltext/HTML/textsearch-opfunc.html > > Yes, I see that in tsearch() here: > > http://momjian.us/expire/fulltext/HTML/textsearch-opfunc.html#TEXTSEARC$ > > I assume my_filter_name is optional right? I have updated the prototype > to be: > > tsearch([vector_column_name], [my_filter_name], text_column_name [, ... ]) > > Is this accurate? What does this text below it mean? no, this in inaccurate. First, vector_column_name is not optional argument, it's a name of tsvector column name. > > There can be many functions and text columns specified in a tsearch() > trigger. The following rule is used: a function is applied to all > subsequent TEXT columns until the next matching column occurs. The idea, is to provide user to preprocess text before applying tsearch machinery. my_filter_name() preprocess text_column_name1, text_column_name2,.... The original syntax allows to specify for every text columns their preprocessing functions. So, I suggest to keep original syntax, change 'vector_column_name' to 'tsvector_column_name'. > > Why are we allowing my_filter_name here? Isn't that something for a > custom trigger. Is calling it tsearch() a good idea? Why not > tsvector_trigger(). I don't see any benefit from the tsvector_trigger() name. If you want to add some semantic, than tsvector_update_trigger() would be better. Anyway, this trigger is an illustration. > >>> CREATE INDEX textsearch_id ON pgweb USING gin(to_tsvector(column)); >>> >>> That avoids having to have a separate column because you can just say: >>> >>> WHERE to_query('XXX') @@ to_tsvector(column) >> >> yes, it's possible, but without ranking, since currently it's impossible >> to store any information in index (it's pg's feature). btw, this should >> works and for GiST index also. > > What if they use @@@. Wouldn't that work because it is going to check > the heap? It would work, it'd recalculate to_tsvector(column) for rows found ( for GiST - to remove false hits and for weight information, for GIN - for weight information only). > >> That kind of search is useful if there is another natural ordering of search >> results, for example, by timestamp. >> >>> >>> How do we make sure that the to_query is using the same text search >>> configuration as the 'column' or index? Perhaps we should suggest: >> >> please, keep in mind, it's not mandatory to use the same configuration >> at search time, that was used at index creation. > > Well, sort of. If you have stop words in the tquery configuration, you > aren't going to hit any matches in the tsvector, right? Same for > synonymns, I suppose. I can see that stemming would work if there was a > mismatch between tsquery and tsvector. > >>> CREATE INDEX textsearch_idx ON pgweb USING gin(to_tsvector('english',column)); >>> >>> so that at least the configuration is documented in the index. >> >> yes, it's better to always explicitly specify configuration name and not >> rely on default configuration. >> Unfortunately, configuration name doesn't saved in the index. as Teodor corrected me, index doesn't know about configuration at all ! What accurate user could do, is to provide configuration name in the comment for tsvector column. Configuration name is an accessory of to_tsvector() function. In principle, tsvector as any data type could be obtained by any other ways, for example, OpenFTS construct tsvector following its own rules. > > I was more concerned that there is nothing documenting the configuration > used by the index or the tsvector table column trigger. By doing: again, index has nothing with configuration name. Our trigger function is an example, which uses default configuration name. User could easily write it's own trigger to keep tsvector column up to date and use configuration name as a parameter. > > CREATE INDEX textsearch_idx ON pgweb USING gin(to_tsvector('english',column)); > > you guarantee that the index uses 'english' for all its entries. If you > omit the 'english' or use a different configuration, it will heap scan > the table, which at least gives the right answer. sometimes it's useful not to use explicitly configuration name to be able to use index with different configuration. Just change tsearch_conf_name. > Also, how do you guarantee that tsearch() triggers always uses the same > configuration? The existing tsearch() API seems to make that > impossible. I am wondering if we need to add the configuration name as > a mandatory parameter to tsearch(). Using the same tsearch_conf_name, which could be defined by many ways, you guarantee to use the same configuration. Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
Oleg Bartunov wrote: > >> I agree, that there are could be more examples, but text search doesn't > >> require something special ! > >> *Example* of trigger function is documented on > >> http://momjian.us/expire/fulltext/HTML/textsearch-opfunc.html > > > > Yes, I see that in tsearch() here: > > > > http://momjian.us/expire/fulltext/HTML/textsearch-opfunc.html#TEXTSEARC$ > > > > I assume my_filter_name is optional right? I have updated the prototype > > to be: > > > > tsearch([vector_column_name], [my_filter_name], text_column_name [, ... ]) > > > > Is this accurate? What does this text below it mean? > > no, this in inaccurate. First, vector_column_name is not optional argument, > it's a name of tsvector column name. Fixed. > > There can be many functions and text columns specified in a tsearch() > > trigger. The following rule is used: a function is applied to all > > subsequent TEXT columns until the next matching column occurs. > > The idea, is to provide user to preprocess text before applying > tsearch machinery. my_filter_name() preprocess text_column_name1, > text_column_name2,.... > The original syntax allows to specify for every text columns their > preprocessing functions. > > So, I suggest to keep original syntax, change 'vector_column_name' to > 'tsvector_column_name'. OK, change made. > > Why are we allowing my_filter_name here? Isn't that something for a > > custom trigger. Is calling it tsearch() a good idea? Why not > > tsvector_trigger(). > > I don't see any benefit from the tsvector_trigger() name. If you want to add > some semantic, than tsvector_update_trigger() would be better. Anyway, > this trigger is an illustration. Well, the filter that removes '@' might be an example, but tsearch() is indeed sort of built-in trigger function to be used for simple cases. My point is that because it is only for simple cases, why add complexity and allow a filter? It seems best to just remove the filter idea and let people write their own triggers if they want that functionality. > >>> CREATE INDEX textsearch_id ON pgweb USING gin(to_tsvector(column)); > >>> > >>> That avoids having to have a separate column because you can just say: > >>> > >>> WHERE to_query('XXX') @@ to_tsvector(column) > >> > >> yes, it's possible, but without ranking, since currently it's impossible > >> to store any information in index (it's pg's feature). btw, this should > >> works and for GiST index also. > > > > What if they use @@@. Wouldn't that work because it is going to check > > the heap? > > It would work, it'd recalculate to_tsvector(column) for rows found > ( for GiST - to remove false hits and for weight information, for > GIN - for weight information only). Right. Currently to use text search on a table, you have to do three things: o add a tsvector column to the tableo add a trigger to keep the tsvector column currento add an index to the tsvectorcolumn My question is why bother with the first two steps? If you do: CREATE INDEX textsearch_idx ON pgweb USING gist(to_tsvector('english',column)); you don't need a separate column and a trigger to keep it current. The index is kept current as part of normal query processing. The only downside is that you have to do to_tsvector() in the heap to avoid false hits, but that seems minor compared to the disk savings of not having the separate column. Is to_tsvector() an expensive function? > >>> CREATE INDEX textsearch_idx ON pgweb USING gin(to_tsvector('english',column)); > >>> > >>> so that at least the configuration is documented in the index. > >> > >> yes, it's better to always explicitly specify configuration name and not > >> rely on default configuration. > >> Unfortunately, configuration name doesn't saved in the index. > > as Teodor corrected me, index doesn't know about configuration at all ! > What accurate user could do, is to provide configuration name in the > comment for tsvector column. Configuration name is an accessory of > to_tsvector() function. Well, if you create the index with the configuration name it is guaranteed to match: CREATE INDEX textsearch_idx ON pgweb USING gist(to_tsvector('english',column)); ------- And if someone does: WHERE 'friend'::tsquery @@ to_tsvector('english',column)) the index is used. Now if the default configuration is 'english' and they use: WHERE 'friend'::tsquery @@ to_tsvector(column)) the index is not used, but this just a good example of why default configurations aren't that useful. One problem I see is that if the default configuration is not 'english', then when the index consults the heap, it would be using a different configuration and yield incorrect results. I am unsure how to fix that. With the trigger idea, you have to be sure your configuration is the same every time you INSERT/UPDATE the table or the index will have mixed configuration entries and it will yield incorrect results, aside from the heap configuration lookup not matching the index. Once we nail this down we will have to have a documentation section about configuration mismatches. > In principle, tsvector as any data type could be obtained by any other ways, > for example, OpenFTS construct tsvector following its own rules. > > > > > I was more concerned that there is nothing documenting the configuration > > used by the index or the tsvector table column trigger. By doing: > > again, index has nothing with configuration name. > Our trigger function is an example, which uses default configuration name. > User could easily write it's own trigger to keep tsvector column up to date > and use configuration name as a parameter. Right. I am thinking beyond that issue. > > CREATE INDEX textsearch_idx ON pgweb USING gin(to_tsvector('english',column)); > > > > you guarantee that the index uses 'english' for all its entries. If you > > omit the 'english' or use a different configuration, it will heap scan > > the table, which at least gives the right answer. > > sometimes it's useful not to use explicitly configuration name > to be able to use index with different configuration. Just change > tsearch_conf_name. I assume you are saying the benefit is for tsquery to use a different configuration, not having some tsvector index entries using different configurations than others. > > Also, how do you guarantee that tsearch() triggers always uses the same > > configuration? The existing tsearch() API seems to make that > > impossible. I am wondering if we need to add the configuration name as > > a mandatory parameter to tsearch(). > > Using the same tsearch_conf_name, which could be defined by many ways, > you guarantee to use the same configuration. Yea, I am sure you _can_ do it. The question is how can we make it less error-prone. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Wed, 18 Jul 2007, Bruce Momjian wrote: > >>> Why are we allowing my_filter_name here? Isn't that something for a >>> custom trigger. Is calling it tsearch() a good idea? Why not >>> tsvector_trigger(). >> >> I don't see any benefit from the tsvector_trigger() name. If you want to add >> some semantic, than tsvector_update_trigger() would be better. Anyway, >> this trigger is an illustration. > > Well, the filter that removes '@' might be an example, but tsearch() is > indeed sort of built-in trigger function to be used for simple cases. > My point is that because it is only for simple cases, why add complexity > and allow a filter? It seems best to just remove the filter idea and > let people write their own triggers if they want that functionality. If you aware about documentation simplicity than we could just document two versions: 1. without filter function - simple, well understood syntax 2. with filter function - for advanced users I don't want to remove the feature which works for year without any problem. > >>>>> CREATE INDEX textsearch_id ON pgweb USING gin(to_tsvector(column)); >>>>> >>>>> That avoids having to have a separate column because you can just say: >>>>> >>>>> WHERE to_query('XXX') @@ to_tsvector(column) >>>> >>>> yes, it's possible, but without ranking, since currently it's impossible >>>> to store any information in index (it's pg's feature). btw, this should >>>> works and for GiST index also. >>> >>> What if they use @@@. Wouldn't that work because it is going to check >>> the heap? >> >> It would work, it'd recalculate to_tsvector(column) for rows found >> ( for GiST - to remove false hits and for weight information, for >> GIN - for weight information only). > > Right. Currently to use text search on a table, you have to do three > things: > > o add a tsvector column to the table > o add a trigger to keep the tsvector column current > o add an index to the tsvector column > > My question is why bother with the first two steps? If you do: > > CREATE INDEX textsearch_idx ON pgweb USING gist(to_tsvector('english',column)); > > you don't need a separate column and a trigger to keep it current. The > index is kept current as part of normal query processing. The only > downside is that you have to do to_tsvector() in the heap to avoid false > hits, but that seems minor compared to the disk savings of not having > the separate column. Is to_tsvector() an expensive function? Bruce, you oversimplify the text search, the document could be fully virtual, not a column(s), it could be a result of any SQL commands, so it could be very expensive just to obtain document, and yes, to_tsvector could be very expensive, depending on the document size, parser and dictionaries used. And, again, current postgres architecture forces to use heap to store positional and weight information for ranking. The use case for what you described is very limited - simple text search on one/several column of the same table without ranking. > >>>>> CREATE INDEX textsearch_idx ON pgweb USING gin(to_tsvector('english',column)); >>>>> >>>>> so that at least the configuration is documented in the index. >>>> >>>> yes, it's better to always explicitly specify configuration name and not >>>> rely on default configuration. >>>> Unfortunately, configuration name doesn't saved in the index. >> >> as Teodor corrected me, index doesn't know about configuration at all ! >> What accurate user could do, is to provide configuration name in the >> comment for tsvector column. Configuration name is an accessory of >> to_tsvector() function. > > Well, if you create the index with the configuration name it is > guaranteed to match: > > CREATE INDEX textsearch_idx ON pgweb USING gist(to_tsvector('english',column)); > ------- > And if someone does: > > WHERE 'friend'::tsquery @@ to_tsvector('english',column)) > > the index is used. Now if the default configuration is 'english' and > they use: > > WHERE 'friend'::tsquery @@ to_tsvector(column)) > > the index is not used, but this just a good example of why default > configurations aren't that useful. One problem I see is that if the > default configuration is not 'english', then when the index consults the > heap, it would be using a different configuration and yield incorrect > results. I am unsure how to fix that. again, you consider very simple case and actually, your example is a good example of usefulness of default configuration ! Just think before you develop your application, but this is very general rule. There are zillions situations you could do bad things, after all. Moreover, consider text search on text column, there is no way to specify configuration at all ! We rely on default configuration here CREATE INDEX textsearch_idx ON pgweb USING gin(title); > > With the trigger idea, you have to be sure your configuration is the same > every time you INSERT/UPDATE the table or the index will have mixed > configuration entries and it will yield incorrect results, aside from > the heap configuration lookup not matching the index. > > Once we nail this down we will have to have a documentation section > about configuration mismatches. So what is your proposal ? I'm lost a bit. I suggest to begin new thread :) > >> In principle, tsvector as any data type could be obtained by any other ways, >> for example, OpenFTS construct tsvector following its own rules. >> >>> >>> I was more concerned that there is nothing documenting the configuration >>> used by the index or the tsvector table column trigger. By doing: >> >> again, index has nothing with configuration name. >> Our trigger function is an example, which uses default configuration name. >> User could easily write it's own trigger to keep tsvector column up to date >> and use configuration name as a parameter. > > Right. I am thinking beyond that issue. > >>> CREATE INDEX textsearch_idx ON pgweb USING gin(to_tsvector('english',column)); >>> >>> you guarantee that the index uses 'english' for all its entries. If you >>> omit the 'english' or use a different configuration, it will heap scan >>> the table, which at least gives the right answer. >> >> sometimes it's useful not to use explicitly configuration name >> to be able to use index with different configuration. Just change >> tsearch_conf_name. > > I assume you are saying the benefit is for tsquery to use a different > configuration, not having some tsvector index entries using different > configurations than others. sure, but not necessarily. You can mix different configurations if they are 'compatible'. Documents could be created from different sources and could require different dictionaries. I don't want to limit people in creating complex applications. If you unsure, then always use explicit configuration name. What's the problem ? > >>> Also, how do you guarantee that tsearch() triggers always uses the same >>> configuration? The existing tsearch() API seems to make that >>> impossible. I am wondering if we need to add the configuration name as >>> a mandatory parameter to tsearch(). >> >> Using the same tsearch_conf_name, which could be defined by many ways, >> you guarantee to use the same configuration. > > Yea, I am sure you _can_ do it. The question is how can we make it less > error-prone. We have only ONE variable - configuration name. It could be explicitly defined and then there is no problem at all, or specified by GUC variable tsearch_conf_name. What could be simpler ! The question is do we need to define tsearch_conf_name automagically if it's not defined ? I inclined don't do that and just issue error. I believe this is enough error-prone. Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
Oleg Bartunov wrote: > On Wed, 18 Jul 2007, Bruce Momjian wrote: > > > > >>> Why are we allowing my_filter_name here? Isn't that something for a > >>> custom trigger. Is calling it tsearch() a good idea? Why not > >>> tsvector_trigger(). > >> > >> I don't see any benefit from the tsvector_trigger() name. If you want to add > >> some semantic, than tsvector_update_trigger() would be better. Anyway, > >> this trigger is an illustration. > > > > Well, the filter that removes '@' might be an example, but tsearch() is > > indeed sort of built-in trigger function to be used for simple cases. > > My point is that because it is only for simple cases, why add complexity > > and allow a filter? It seems best to just remove the filter idea and > > let people write their own triggers if they want that functionality. > > If you aware about documentation simplicity than we could just document > two versions: > 1. without filter function - simple, well understood syntax > 2. with filter function - for advanced users > > I don't want to remove the feature which works for year without any problem. Yes, this is what I want. I would like to show the simple usage first, then explain that a more complex usage is possible. This will help people get started using text search. Triggers and secondary columns are fine, but to start using it the CREATE INDEX-only case is best. I don't suggest we remove any capabilities, only suggest simple solutions. > >>>>> CREATE INDEX textsearch_id ON pgweb USING gin(to_tsvector(column)); > >>>>> > >>>>> That avoids having to have a separate column because you can just say: > >>>>> > >>>>> WHERE to_query('XXX') @@ to_tsvector(column) > >>>> > >>>> yes, it's possible, but without ranking, since currently it's impossible > >>>> to store any information in index (it's pg's feature). btw, this should > >>>> works and for GiST index also. > >>> > >>> What if they use @@@. Wouldn't that work because it is going to check > >>> the heap? > >> > >> It would work, it'd recalculate to_tsvector(column) for rows found > >> ( for GiST - to remove false hits and for weight information, for > >> GIN - for weight information only). > > > > Right. Currently to use text search on a table, you have to do three > > things: > > > > o add a tsvector column to the table > > o add a trigger to keep the tsvector column current > > o add an index to the tsvector column > > > > My question is why bother with the first two steps? If you do: > > > > CREATE INDEX textsearch_idx ON pgweb USING gist(to_tsvector('english',column)); > > > > you don't need a separate column and a trigger to keep it current. The > > index is kept current as part of normal query processing. The only > > downside is that you have to do to_tsvector() in the heap to avoid false > > hits, but that seems minor compared to the disk savings of not having > > the separate column. Is to_tsvector() an expensive function? > > Bruce, you oversimplify the text search, the document could be fully virtual, > not a column(s), it could be a result of any SQL commands, so it could be > very expensive just to obtain document, and yes, to_tsvector could be > very expensive, depending on the document size, parser and dictionaries used. > > And, again, current postgres architecture forces to use heap to store > positional and weight information for ranking. > > The use case for what you described is very limited - simple text search > on one/several column of the same table without ranking. Right, but I bet that that is all the majority of users need, at least at first as they start to use text search. > >>>>> CREATE INDEX textsearch_idx ON pgweb USING gin(to_tsvector('english',column)); > >>>>> > >>>>> so that at least the configuration is documented in the index. > >>>> > >>>> yes, it's better to always explicitly specify configuration name and not > >>>> rely on default configuration. > >>>> Unfortunately, configuration name doesn't saved in the index. > >> > >> as Teodor corrected me, index doesn't know about configuration at all ! > >> What accurate user could do, is to provide configuration name in the > >> comment for tsvector column. Configuration name is an accessory of > >> to_tsvector() function. > > > > Well, if you create the index with the configuration name it is > > guaranteed to match: > > > > CREATE INDEX textsearch_idx ON pgweb USING gist(to_tsvector('english',column)); > > ------- > > And if someone does: > > > > WHERE 'friend'::tsquery @@ to_tsvector('english',column)) > > > > the index is used. Now if the default configuration is 'english' and > > they use: > > > > WHERE 'friend'::tsquery @@ to_tsvector(column)) > > > > the index is not used, but this just a good example of why default > > configurations aren't that useful. One problem I see is that if the > > default configuration is not 'english', then when the index consults the > > heap, it would be using a different configuration and yield incorrect > > results. I am unsure how to fix that. > > again, you consider very simple case and actually, your example is a > good example of usefulness of default configuration ! Just think before > you develop your application, but this is very general rule. There are > zillions situations you could do bad things, after all. Right, but if we can reduce possible errors, that is better, and embedding the configuration name in the expression index funciton does that. > Moreover, consider text search on text column, there is no way to specify > configuration at all ! We rely on default configuration here > > CREATE INDEX textsearch_idx ON pgweb USING gin(title); Ah, this should be: CREATE INDEX textsearch_idx ON pgweb USING gin(to_tsvector('english', title)); What happens if you create a GIN index on a text column? Does it convert it to tsvector automatically? > > With the trigger idea, you have to be sure your configuration is the same > > every time you INSERT/UPDATE the table or the index will have mixed > > configuration entries and it will yield incorrect results, aside from > > the heap configuration lookup not matching the index. > > > > Once we nail this down we will have to have a documentation section > > about configuration mismatches. > > So what is your proposal ? I'm lost a bit. I suggest to begin new thread :) Well, the topic is tsearch documentation, so I don't see how a new thread would help us. My idea is to suggest using expression indexes as a simple way to use text search on a table, and mention the trigger idea as a more complex solution. Right now there is no mention of the expression index idea in the documentation. > >> sometimes it's useful not to use explicitly configuration name > >> to be able to use index with different configuration. Just change > >> tsearch_conf_name. > > > > I assume you are saying the benefit is for tsquery to use a different > > configuration, not having some tsvector index entries using different > > configurations than others. > > sure, but not necessarily. You can mix different configurations if they > are 'compatible'. Documents could be created from different sources and > could require different dictionaries. I don't want to limit people > in creating complex applications. If you unsure, then always use > explicit configuration name. What's the problem ? OK, I understand. You can store rows needing different synonym tables in the same column, and if you use the same synonym configuratin when searching, that is OK. Again, that is fine, but I would like to give people a simple way to use text search that doesn't get into that complexity. I realize you understand it well, but I can tell you that most people, including me, don't and you have to be simple at first to get people to understand what is happening. I am learning about the possibilities as I email you so obviously I didn't fully understand at first either, and I have been working on this for a while. Good documentation lays things out simply then adds complexity/flexibility after the fundamentals are understood. > >>> Also, how do you guarantee that tsearch() triggers always uses the same > >>> configuration? The existing tsearch() API seems to make that > >>> impossible. I am wondering if we need to add the configuration name as > >>> a mandatory parameter to tsearch(). > >> > >> Using the same tsearch_conf_name, which could be defined by many ways, > >> you guarantee to use the same configuration. > > > > Yea, I am sure you _can_ do it. The question is how can we make it less > > error-prone. > > We have only ONE variable - configuration name. It could be explicitly > defined and then there is no problem at all, or specified by > GUC variable tsearch_conf_name. What could be simpler ! > > The question is do we need to define tsearch_conf_name automagically if > it's not defined ? I inclined don't do that and just issue error. Agreed. Please throw an error. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Oleg, Teodor, I am confused by the following example. How does gin know to create a tsvector, or does it? Does gist know too? FYI, at some point we need to chat via instant messenger or IRC to discuss the open items. My chat information is here: http://momjian.us/main/contact.html --------------------------------------------------------------------------- SELECT title FROM pgweb WHERE textcat(title,body) @@ plainto_tsquery('create table') ORDER BY dlm DESC LIMIT 10; CREATE INDEX pgweb_idx ON pgweb USING gin(textcat(title,body)); -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Wed, 18 Jul 2007, Bruce Momjian wrote: > Oleg, Teodor, > > I am confused by the following example. How does gin know to create a > tsvector, or does it? Does gist know too? No, gist doesn't know. I don't remember why, Teodor ? For GIN see http://archives.postgresql.org/pgsql-hackers/2007-05/msg00625.php for discussion > > FYI, at some point we need to chat via instant messenger or IRC to > discuss the open items. My chat information is here: > > http://momjian.us/main/contact.html I send you invitation for google talk, I use only chat in gmail. My gmail account is obartunov@gmail.com > > --------------------------------------------------------------------------- > > SELECT title > FROM pgweb > WHERE textcat(title,body) @@ plainto_tsquery('create table') > ORDER BY dlm DESC LIMIT 10; > > CREATE INDEX pgweb_idx ON pgweb USING gin(textcat(title,body)); > > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
I have added more documentation to try to show how full text search is used by user tables. I think this the documentaiton is almost done: http://momjian.us/expire/fulltext/HTML/textsearch-tables.html --------------------------------------------------------------------------- Oleg Bartunov wrote: > On Wed, 18 Jul 2007, Bruce Momjian wrote: > > > Oleg, Teodor, > > > > I am confused by the following example. How does gin know to create a > > tsvector, or does it? Does gist know too? > > No, gist doesn't know. I don't remember why, Teodor ? > > For GIN see http://archives.postgresql.org/pgsql-hackers/2007-05/msg00625.php > for discussion > > > > > FYI, at some point we need to chat via instant messenger or IRC to > > discuss the open items. My chat information is here: > > > > http://momjian.us/main/contact.html > > I send you invitation for google talk, I use only chat in gmail. > My gmail account is obartunov@gmail.com > > > > > --------------------------------------------------------------------------- > > > > SELECT title > > FROM pgweb > > WHERE textcat(title,body) @@ plainto_tsquery('create table') > > ORDER BY dlm DESC LIMIT 10; > > > > CREATE INDEX pgweb_idx ON pgweb USING gin(textcat(title,body)); > > > > > > Regards, > Oleg > _____________________________________________________________ > Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), > Sternberg Astronomical Institute, Moscow University, Russia > Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ > phone: +007(495)939-16-83, +007(495)939-23-83 -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Bruce, I sent you link to my wiki page with summary of changes http://www.sai.msu.su/~megera/wiki/ts_changes Your documentation looks rather old. Oleg On Tue, 24 Jul 2007, Bruce Momjian wrote: > > I have added more documentation to try to show how full text search is > used by user tables. I think this the documentaiton is almost done: > > http://momjian.us/expire/fulltext/HTML/textsearch-tables.html > > --------------------------------------------------------------------------- > > Oleg Bartunov wrote: >> On Wed, 18 Jul 2007, Bruce Momjian wrote: >> >>> Oleg, Teodor, >>> >>> I am confused by the following example. How does gin know to create a >>> tsvector, or does it? Does gist know too? >> >> No, gist doesn't know. I don't remember why, Teodor ? >> >> For GIN see http://archives.postgresql.org/pgsql-hackers/2007-05/msg00625.php >> for discussion >> >>> >>> FYI, at some point we need to chat via instant messenger or IRC to >>> discuss the open items. My chat information is here: >>> >>> http://momjian.us/main/contact.html >> >> I send you invitation for google talk, I use only chat in gmail. >> My gmail account is obartunov@gmail.com >> >>> >>> --------------------------------------------------------------------------- >>> >>> SELECT title >>> FROM pgweb >>> WHERE textcat(title,body) @@ plainto_tsquery('create table') >>> ORDER BY dlm DESC LIMIT 10; >>> >>> CREATE INDEX pgweb_idx ON pgweb USING gin(textcat(title,body)); >>> >>> >> >> Regards, >> Oleg >> _____________________________________________________________ >> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), >> Sternberg Astronomical Institute, Moscow University, Russia >> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ >> phone: +007(495)939-16-83, +007(495)939-23-83 > > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
Hi, Le mercredi 25 juillet 2007, Bruce Momjian a écrit : > I have added more documentation to try to show how full text search is > used by user tables. I think this the documentaiton is almost done: > > http://momjian.us/expire/fulltext/HTML/textsearch-tables.html I've come to understand that GIN indexes are far more costly to update than GiST one, and Oleg's wiki advice users to partition data and use GiST index for live part and GIN index for archive part only. Is it worth mentioning this into this part of the documentation? And if mentioned here, partitioning step could certainly be part of the example... or let it as a user exercise, but then explaining why GIN is a good choice in the provided example. Hope this helps, regards, -- dim
Oleg Bartunov wrote: > Bruce, > > I sent you link to my wiki page with summary of changes > http://www.sai.msu.su/~megera/wiki/ts_changes > > Your documentation looks rather old. I have updated it to reflect your changes: http://momjian.us/expire/fulltext/HTML/textsearch-tables.html --------------------------------------------------------------------------- > > Oleg > On Tue, 24 Jul 2007, Bruce Momjian wrote: > > > > > I have added more documentation to try to show how full text search is > > used by user tables. I think this the documentaiton is almost done: > > > > http://momjian.us/expire/fulltext/HTML/textsearch-tables.html > > > > --------------------------------------------------------------------------- > > > > Oleg Bartunov wrote: > >> On Wed, 18 Jul 2007, Bruce Momjian wrote: > >> > >>> Oleg, Teodor, > >>> > >>> I am confused by the following example. How does gin know to create a > >>> tsvector, or does it? Does gist know too? > >> > >> No, gist doesn't know. I don't remember why, Teodor ? > >> > >> For GIN see http://archives.postgresql.org/pgsql-hackers/2007-05/msg00625.php > >> for discussion > >> > >>> > >>> FYI, at some point we need to chat via instant messenger or IRC to > >>> discuss the open items. My chat information is here: > >>> > >>> http://momjian.us/main/contact.html > >> > >> I send you invitation for google talk, I use only chat in gmail. > >> My gmail account is obartunov@gmail.com > >> > >>> > >>> --------------------------------------------------------------------------- > >>> > >>> SELECT title > >>> FROM pgweb > >>> WHERE textcat(title,body) @@ plainto_tsquery('create table') > >>> ORDER BY dlm DESC LIMIT 10; > >>> > >>> CREATE INDEX pgweb_idx ON pgweb USING gin(textcat(title,body)); > >>> > >>> > >> > >> Regards, > >> Oleg > >> _____________________________________________________________ > >> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), > >> Sternberg Astronomical Institute, Moscow University, Russia > >> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ > >> phone: +007(495)939-16-83, +007(495)939-23-83 > > > > > > Regards, > Oleg > _____________________________________________________________ > Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), > Sternberg Astronomical Institute, Moscow University, Russia > Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ > phone: +007(495)939-16-83, +007(495)939-23-83 -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Oleg Bartunov wrote: > >> Second, I can't figure out how to reference a non-default > >> configuration. > > > > See the multi-argument versions of to_tsvector etc. > > > > I do see a problem with having to_tsvector(config, text) plus > > to_tsvector(text) where the latter implicitly references a config > > selected by a GUC variable: how can you tell whether a query using the > > latter matches a particular index using the former? There isn't > > anything in the current planner mechanisms that would make that work. > > Probably, having default text search configuration is not a good idea > and we could just require it as a mandatory parameter, which could > eliminate many confusion with selecting text search configuration. We have to decide if we want a GUC default_text_search_config, and if so when can it be changed. Right now there are three ways to create a tsvector (or tsquery) ::tsvectorto_tsvector(value)to_tsvector(config, value) (ignoring plainto_tsvector) Only the last one specifies the configuration. The others use the configuration specified by default_text_search_config. (We had an previous discussion on what the default value of default_text_search_config should be, and it was decided it should be set via initdb based on a flag or the locale.) Now, because most people use a single configuration, they can just set default_text_search_config and there is no need to specify the configuration name. However, expression indexes cause a problem here: http://momjian.us/expire/fulltext/HTML/textsearch-tables.html#TEXTSEARCH-TABLES-INDEX We recommend that users create an expression index on the column they want to do a full text search on, e.g. CREATE INDEX pgweb_idx ON pgweb USING gin(to_tsvector(body)); However, the big problem is that the expressions used in expression indexes should not change their output based on the value of a GUC variable (because it would corrupt the index), but in the case above, default_text_search_config controls what configuration is used, and hence the output of to_tsvector is changed if default_text_search_config changes. We have a few possible options: 1) Document the problem and do nothing else.2) Make default_text_search_config a postgresql.conf-only setting, therebymaking it impossible to change by non-super users, or make it a super-user-only setting.3) Remove default_text_search_configand require the configuration to be specified in each function call. If we remove default_text_search_config, it would also make ::tsvector casting useless as well. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
> > 1) Document the problem and do nothing else. > 2) Make default_text_search_config a postgresql.conf-only > setting, thereby making it impossible to change by non-super > users, or make it a super-user-only setting. > 3) Remove default_text_search_config and require the > configuration to be specified in each function call. > Hello, 2+. Regards Pavel Stehule
On Thu, 26 Jul 2007, Bruce Momjian wrote: > Oleg Bartunov wrote: >> Bruce, >> >> I sent you link to my wiki page with summary of changes >> http://www.sai.msu.su/~megera/wiki/ts_changes >> >> Your documentation looks rather old. > > I have updated it to reflect your changes: > > http://momjian.us/expire/fulltext/HTML/textsearch-tables.html Bruce, I noticed you miss many changes. For example, options for stemmer has changed (it's documented in my ts_changes), so in http://momjian.us/expire/fulltext/HTML/textsearch-tables.html#TEXTSEARCH-TABLES-CONFIGURATION ALTER TEXT SEARCH DICTIONARY en_stem SET OPTION 'english-utf8.stop'; should be ALTER TEXT SEARCH DICTIONARY en_stem SET OPTION 'StopFile=english-utf8.stop, Language=english'; Also, this is wrong DROP TEXT SEARCH CONFIGURATION MAPPING ON pg FOR email, url, sfloat, uri, float; it should be ALTER TEXT SEARCH CONFIGURATION pg DROP MAPPING FOR email, url, sfloat, uri, float; Configuration now doesn't have DEFAULT flag, so \dF should not display 'Y' => \dF pg_catalog | russian | Y public | pg | Y This is what I see now postgres=# \dF public.* List of fulltext configurations Schema | Name | Description --------+------+------------- public | pg | > > --------------------------------------------------------------------------- > > >> >> Oleg >> On Tue, 24 Jul 2007, Bruce Momjian wrote: >> >>> >>> I have added more documentation to try to show how full text search is >>> used by user tables. I think this the documentaiton is almost done: >>> >>> http://momjian.us/expire/fulltext/HTML/textsearch-tables.html >>> >>> --------------------------------------------------------------------------- >>> >>> Oleg Bartunov wrote: >>>> On Wed, 18 Jul 2007, Bruce Momjian wrote: >>>> >>>>> Oleg, Teodor, >>>>> >>>>> I am confused by the following example. How does gin know to create a >>>>> tsvector, or does it? Does gist know too? >>>> >>>> No, gist doesn't know. I don't remember why, Teodor ? >>>> >>>> For GIN see http://archives.postgresql.org/pgsql-hackers/2007-05/msg00625.php >>>> for discussion >>>> >>>>> >>>>> FYI, at some point we need to chat via instant messenger or IRC to >>>>> discuss the open items. My chat information is here: >>>>> >>>>> http://momjian.us/main/contact.html >>>> >>>> I send you invitation for google talk, I use only chat in gmail. >>>> My gmail account is obartunov@gmail.com >>>> >>>>> >>>>> --------------------------------------------------------------------------- >>>>> >>>>> SELECT title >>>>> FROM pgweb >>>>> WHERE textcat(title,body) @@ plainto_tsquery('create table') >>>>> ORDER BY dlm DESC LIMIT 10; >>>>> >>>>> CREATE INDEX pgweb_idx ON pgweb USING gin(textcat(title,body)); >>>>> >>>>> >>>> >>>> Regards, >>>> Oleg >>>> _____________________________________________________________ >>>> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), >>>> Sternberg Astronomical Institute, Moscow University, Russia >>>> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ >>>> phone: +007(495)939-16-83, +007(495)939-23-83 >>> >>> >> >> Regards, >> Oleg >> _____________________________________________________________ >> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), >> Sternberg Astronomical Institute, Moscow University, Russia >> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ >> phone: +007(495)939-16-83, +007(495)939-23-83 > > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
On Fri, 27 Jul 2007, Pavel Stehule wrote: >> >> 1) Document the problem and do nothing else. >> 2) Make default_text_search_config a postgresql.conf-only >> setting, thereby making it impossible to change by non-super >> users, or make it a super-user-only setting. >> 3) Remove default_text_search_config and require the >> configuration to be specified in each function call. >> > > Hello, > > 2+. One of the most important purpose of integrating tsearch2 was to facilitate full-text search for people in hosting environment. Usually, they have no superuser rights. I'm asking don't forget about them ! There is no problem with current behaviour once user understand what he do. Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
2007/7/27, Oleg Bartunov <oleg@sai.msu.su>: > On Fri, 27 Jul 2007, Pavel Stehule wrote: > > >> > >> 1) Document the problem and do nothing else. > >> 2) Make default_text_search_config a postgresql.conf-only > >> setting, thereby making it impossible to change by non-super > >> users, or make it a super-user-only setting. > >> 3) Remove default_text_search_config and require the > >> configuration to be specified in each function call. > >> > > > > Hello, > > > > 2+. > > > One of the most important purpose of integrating tsearch2 was to > facilitate full-text search for people in hosting environment. Usually, > they have no superuser rights. I'm asking don't forget about them ! > > There is no problem with current behaviour once user understand what he do. > > I am not sure if postgresql is well for multilangual hosting environment. There is problem with locales. Without COLLATE support postgresql can't be used in similar environment. :( nice a day Pavel Stehule
On Fri, 27 Jul 2007, Pavel Stehule wrote: > 2007/7/27, Oleg Bartunov <oleg@sai.msu.su>: >> On Fri, 27 Jul 2007, Pavel Stehule wrote: >> >>>> >>>> 1) Document the problem and do nothing else. >>>> 2) Make default_text_search_config a postgresql.conf-only >>>> setting, thereby making it impossible to change by non-super >>>> users, or make it a super-user-only setting. >>>> 3) Remove default_text_search_config and require the >>>> configuration to be specified in each function call. >>>> >>> >>> Hello, >>> >>> 2+. >> >> >> One of the most important purpose of integrating tsearch2 was to >> facilitate full-text search for people in hosting environment. Usually, >> they have no superuser rights. I'm asking don't forget about them ! >> >> There is no problem with current behaviour once user understand what he do. >> >> > > I am not sure if postgresql is well for multilangual hosting > environment. There is problem with locales. Without COLLATE support > postgresql can't be used in similar environment. :( configuration has NOTHING with language ! This is a most frequent myth about configuration. It's just the way we chose for default_text_search_config to use language part of locale at initdb time. text search configuration is just a bind between parser to use for breaking document by lexems and mapping between lexeme type and dictionaries. > > nice a day > Pavel Stehule > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
> configuration has NOTHING with language ! This is a most frequent myth about > configuration. It's just the way we chose for default_text_search_config to > use language part of locale at initdb time. > text search configuration is just a bind between parser to use for > breaking document by lexems and mapping between lexeme type and dictionaries. > > > I spoke about impossibility well configuration of postgresql without administrator's rights. For my czech environment is administrator's rights necessary too, because czech dictionary aren't in default installation. Regards Pavel Stehule
On Thu, Jul 26, 2007 at 06:23:51PM -0400, Bruce Momjian wrote: > Oleg Bartunov wrote: > > >> Second, I can't figure out how to reference a non-default > > >> configuration. > > > > > > See the multi-argument versions of to_tsvector etc. > > > > > > I do see a problem with having to_tsvector(config, text) plus > > > to_tsvector(text) where the latter implicitly references a config > > > selected by a GUC variable: how can you tell whether a query using the > > > latter matches a particular index using the former? There isn't > > > anything in the current planner mechanisms that would make that work. > > > > Probably, having default text search configuration is not a good idea > > and we could just require it as a mandatory parameter, which could > > eliminate many confusion with selecting text search configuration. > > We have to decide if we want a GUC default_text_search_config, and if so > when can it be changed. > > Right now there are three ways to create a tsvector (or tsquery) > > ::tsvector > to_tsvector(value) > to_tsvector(config, value) > > (ignoring plainto_tsvector) > > Only the last one specifies the configuration. The others use the > configuration specified by default_text_search_config. (We had an > previous discussion on what the default value of > default_text_search_config should be, and it was decided it should be > set via initdb based on a flag or the locale.) > > Now, because most people use a single configuration, they can just set > default_text_search_config and there is no need to specify the > configuration name. > > However, expression indexes cause a problem here: > > http://momjian.us/expire/fulltext/HTML/textsearch-tables.html#TEXTSEARCH-TABLES-INDEX > > We recommend that users create an expression index on the column they > want to do a full text search on, e.g. > > CREATE INDEX pgweb_idx ON pgweb USING gin(to_tsvector(body)); > > However, the big problem is that the expressions used in expression > indexes should not change their output based on the value of a GUC > variable (because it would corrupt the index), but in the case above, > default_text_search_config controls what configuration is used, and > hence the output of to_tsvector is changed if default_text_search_config > changes. It wuoldn't actually *corrupt* the index, right? You could end up with wrong results, which might be regarded as corruption in one way, but as long as you change the value back the index still works, no? > We have a few possible options: > > 1) Document the problem and do nothing else. > 2) Make default_text_search_config a postgresql.conf-only > setting, thereby making it impossible to change by non-super > users, or make it a super-user-only setting. > 3) Remove default_text_search_config and require the > configuration to be specified in each function call. > > If we remove default_text_search_config, it would also make ::tsvector > casting useless as well. I think 3 is a really bad solution. 2 is a half-bad solution. Do we have a way to say that it can be set at database-level for example, but not at user session? Making it superuser-only to change it but not postgresql.conf-only could accomplish that, along with warnings in the docs for the super user about the effects on current indexes by changing it. //Magnus
Thanks, I found a few more places that needed updating. It should be accurate now. Thanks for the report. --------------------------------------------------------------------------- Oleg Bartunov wrote: > On Thu, 26 Jul 2007, Bruce Momjian wrote: > > > Oleg Bartunov wrote: > >> Bruce, > >> > >> I sent you link to my wiki page with summary of changes > >> http://www.sai.msu.su/~megera/wiki/ts_changes > >> > >> Your documentation looks rather old. > > > > I have updated it to reflect your changes: > > > > http://momjian.us/expire/fulltext/HTML/textsearch-tables.html > > > Bruce, I noticed you miss many changes. For example, > > > options for stemmer has changed (it's documented in my ts_changes), > so in > http://momjian.us/expire/fulltext/HTML/textsearch-tables.html#TEXTSEARCH-TABLES-CONFIGURATION > > ALTER TEXT SEARCH DICTIONARY en_stem SET OPTION 'english-utf8.stop'; > > should be > > > ALTER TEXT SEARCH DICTIONARY en_stem SET OPTION > 'StopFile=english-utf8.stop, Language=english'; > > > Also, this is wrong > > DROP TEXT SEARCH CONFIGURATION MAPPING ON pg FOR email, url, sfloat, uri, float; > > it should be > > ALTER TEXT SEARCH CONFIGURATION pg DROP MAPPING FOR email, url, sfloat, uri, float; > > Configuration now doesn't have DEFAULT flag, so \dF should not display 'Y' > > > => \dF > pg_catalog | russian | Y > public | pg | Y > > > This is what I see now > > postgres=# \dF public.* > List of fulltext configurations > Schema | Name | Description > --------+------+------------- > public | pg | > > > > > > > --------------------------------------------------------------------------- > > > > > >> > >> Oleg > >> On Tue, 24 Jul 2007, Bruce Momjian wrote: > >> > >>> > >>> I have added more documentation to try to show how full text search is > >>> used by user tables. I think this the documentaiton is almost done: > >>> > >>> http://momjian.us/expire/fulltext/HTML/textsearch-tables.html > >>> > >>> --------------------------------------------------------------------------- > >>> > >>> Oleg Bartunov wrote: > >>>> On Wed, 18 Jul 2007, Bruce Momjian wrote: > >>>> > >>>>> Oleg, Teodor, > >>>>> > >>>>> I am confused by the following example. How does gin know to create a > >>>>> tsvector, or does it? Does gist know too? > >>>> > >>>> No, gist doesn't know. I don't remember why, Teodor ? > >>>> > >>>> For GIN see http://archives.postgresql.org/pgsql-hackers/2007-05/msg00625.php > >>>> for discussion > >>>> > >>>>> > >>>>> FYI, at some point we need to chat via instant messenger or IRC to > >>>>> discuss the open items. My chat information is here: > >>>>> > >>>>> http://momjian.us/main/contact.html > >>>> > >>>> I send you invitation for google talk, I use only chat in gmail. > >>>> My gmail account is obartunov@gmail.com > >>>> > >>>>> > >>>>> --------------------------------------------------------------------------- > >>>>> > >>>>> SELECT title > >>>>> FROM pgweb > >>>>> WHERE textcat(title,body) @@ plainto_tsquery('create table') > >>>>> ORDER BY dlm DESC LIMIT 10; > >>>>> > >>>>> CREATE INDEX pgweb_idx ON pgweb USING gin(textcat(title,body)); > >>>>> > >>>>> > >>>> > >>>> Regards, > >>>> Oleg > >>>> _____________________________________________________________ > >>>> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), > >>>> Sternberg Astronomical Institute, Moscow University, Russia > >>>> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ > >>>> phone: +007(495)939-16-83, +007(495)939-23-83 > >>> > >>> > >> > >> Regards, > >> Oleg > >> _____________________________________________________________ > >> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), > >> Sternberg Astronomical Institute, Moscow University, Russia > >> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ > >> phone: +007(495)939-16-83, +007(495)939-23-83 > > > > > > Regards, > Oleg > _____________________________________________________________ > Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), > Sternberg Astronomical Institute, Moscow University, Russia > Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ > phone: +007(495)939-16-83, +007(495)939-23-83 -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Dimitri Fontaine wrote: -- Start of PGP signed section. > Hi, > > Le mercredi 25 juillet 2007, Bruce Momjian a ?crit?: > > I have added more documentation to try to show how full text search is > > used by user tables. I think this the documentaiton is almost done: > > > > http://momjian.us/expire/fulltext/HTML/textsearch-tables.html > > I've come to understand that GIN indexes are far more costly to update than > GiST one, and Oleg's wiki advice users to partition data and use GiST index > for live part and GIN index for archive part only. > > Is it worth mentioning this into this part of the documentation? > And if mentioned here, partitioning step could certainly be part of the > example... or let it as a user exercise, but then explaining why GIN is a > good choice in the provided example. Partitioning is already in the documentation: Partitioning of big collections and the proper use of GiST and GINindexes allows the implementation of very fast searcheswith onlineupdate. Partitioning can be done at the database level using tableinheritance and <varname>constraint_exclusion</>,or distributingdocuments over servers and collecting search results using the<filename>contrib/dblink</>extension module. The latter is possiblebecause ranking functions use only local information. I don't see a reason to provide an example beyond the existing examples of how to do partitioning. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Magnus Hagander wrote: > > However, the big problem is that the expressions used in expression > > indexes should not change their output based on the value of a GUC > > variable (because it would corrupt the index), but in the case above, > > default_text_search_config controls what configuration is used, and > > hence the output of to_tsvector is changed if default_text_search_config > > changes. > > It wuoldn't actually *corrupt* the index, right? You could end up with > wrong results, which might be regarded as corruption in one way, but as > long as you change the value back the index still works, no? Right, it would _temporarily_ corrupt it. ;-) > > We have a few possible options: > > > > 1) Document the problem and do nothing else. > > 2) Make default_text_search_config a postgresql.conf-only > > setting, thereby making it impossible to change by non-super > > users, or make it a super-user-only setting. > > 3) Remove default_text_search_config and require the > > configuration to be specified in each function call. > > > > If we remove default_text_search_config, it would also make ::tsvector > > casting useless as well. > > I think 3 is a really bad solution. > > 2 is a half-bad solution. Do we have a way to say that it can be set at > database-level for example, but not at user session? Making it > superuser-only to change it but not postgresql.conf-only could accomplish > that, along with warnings in the docs for the super user about the effects > on current indexes by changing it. OK, here is what I am thinking. If we make default_text_search_config super-user-only, then the user can't do SET (using "zero_damaged_pages" as a superuser-only example): test=> set zero_damaged_pages = on;ERROR: permission denied to set parameter "zero_damaged_pages"test=> alter user guestset zero_damaged_pages = on;ERROR: permission denied to set parameter "zero_damaged_pages" but the super-user can set it in postgresql.conf, or: test=# alter user guest set zero_damaged_pages = on;ALTER ROLE or test=# alter database vendor3 set zero_damaged_pages = on;ALTER ROLE meaning while it will be super-user-only, the administrator can set the default for specific databases and users. Is that the best approach? A user can still over-ride the default by specifying the configuration in the function call. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Bruce Momjian wrote: > Magnus Hagander wrote: >>> However, the big problem is that the expressions used in expression >>> indexes should not change their output based on the value of a GUC >>> variable (because it would corrupt the index), but in the case above, >>> default_text_search_config controls what configuration is used, and >>> hence the output of to_tsvector is changed if default_text_search_config >>> changes. >> It wuoldn't actually *corrupt* the index, right? You could end up with >> wrong results, which might be regarded as corruption in one way, but as >> long as you change the value back the index still works, no? > > Right, it would _temporarily_ corrupt it. ;-) > >>> We have a few possible options: >>> >>> 1) Document the problem and do nothing else. >>> 2) Make default_text_search_config a postgresql.conf-only >>> setting, thereby making it impossible to change by non-super >>> users, or make it a super-user-only setting. >>> 3) Remove default_text_search_config and require the >>> configuration to be specified in each function call. >>> >>> If we remove default_text_search_config, it would also make ::tsvector >>> casting useless as well. >> I think 3 is a really bad solution. >> >> 2 is a half-bad solution. Do we have a way to say that it can be set at >> database-level for example, but not at user session? Making it >> superuser-only to change it but not postgresql.conf-only could accomplish >> that, along with warnings in the docs for the super user about the effects >> on current indexes by changing it. > > OK, here is what I am thinking. If we make default_text_search_config > super-user-only, then the user can't do SET (using "zero_damaged_pages" > as a superuser-only example): > > test=> set zero_damaged_pages = on; > ERROR: permission denied to set parameter "zero_damaged_pages" > > test=> alter user guest set zero_damaged_pages = on; > ERROR: permission denied to set parameter "zero_damaged_pages" > > but the super-user can set it in postgresql.conf, or: > > test=# alter user guest set zero_damaged_pages = on; > ALTER ROLE > > or > > test=# alter database vendor3 set zero_damaged_pages = on; > ALTER ROLE > > meaning while it will be super-user-only, the administrator can set the > default for specific databases and users. Is that the best approach? That's exactly the one I was trying to suggest. And then with clear warnings in the docs around the place that if you set it to different values accessing the same index (for different users, for example) weird things may happen. But I see the ALTER DATABASE a fairly common scenario (I know I would use it), and from what I can tell thereis no risk at all with that. //Magnus
On Fri, 27 Jul 2007, Bruce Momjian wrote: > Magnus Hagander wrote: >>> However, the big problem is that the expressions used in expression >>> indexes should not change their output based on the value of a GUC >>> variable (because it would corrupt the index), but in the case above, >>> default_text_search_config controls what configuration is used, and >>> hence the output of to_tsvector is changed if default_text_search_config >>> changes. >> >> It wuoldn't actually *corrupt* the index, right? You could end up with >> wrong results, which might be regarded as corruption in one way, but as >> long as you change the value back the index still works, no? > > Right, it would _temporarily_ corrupt it. ;-) > >>> We have a few possible options: >>> >>> 1) Document the problem and do nothing else. >>> 2) Make default_text_search_config a postgresql.conf-only >>> setting, thereby making it impossible to change by non-super >>> users, or make it a super-user-only setting. >>> 3) Remove default_text_search_config and require the >>> configuration to be specified in each function call. >>> >>> If we remove default_text_search_config, it would also make ::tsvector >>> casting useless as well. >> >> I think 3 is a really bad solution. >> >> 2 is a half-bad solution. Do we have a way to say that it can be set at >> database-level for example, but not at user session? Making it >> superuser-only to change it but not postgresql.conf-only could accomplish >> that, along with warnings in the docs for the super user about the effects >> on current indexes by changing it. > > OK, here is what I am thinking. If we make default_text_search_config > super-user-only, then the user can't do SET (using "zero_damaged_pages" > as a superuser-only example): > > test=> set zero_damaged_pages = on; > ERROR: permission denied to set parameter "zero_damaged_pages" > > test=> alter user guest set zero_damaged_pages = on; > ERROR: permission denied to set parameter "zero_damaged_pages" > > but the super-user can set it in postgresql.conf, or: > > test=# alter user guest set zero_damaged_pages = on; > ALTER ROLE > > or > > test=# alter database vendor3 set zero_damaged_pages = on; > ALTER ROLE > > meaning while it will be super-user-only, the administrator can set the > default for specific databases and users. Is that the best approach? > > A user can still over-ride the default by specifying the configuration > in the function call. This is ok, but it will not work in hosting environment and still doesn't prevent errors. Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
Oleg Bartunov wrote: > > OK, here is what I am thinking. If we make default_text_search_config > > super-user-only, then the user can't do SET (using "zero_damaged_pages" > > as a superuser-only example): > > > > test=> set zero_damaged_pages = on; > > ERROR: permission denied to set parameter "zero_damaged_pages" > > > > test=> alter user guest set zero_damaged_pages = on; > > ERROR: permission denied to set parameter "zero_damaged_pages" > > > > but the super-user can set it in postgresql.conf, or: > > > > test=# alter user guest set zero_damaged_pages = on; > > ALTER ROLE > > > > or > > > > test=# alter database vendor3 set zero_damaged_pages = on; > > ALTER ROLE > > > > meaning while it will be super-user-only, the administrator can set the > > default for specific databases and users. Is that the best approach? > > > > A user can still over-ride the default by specifying the configuration > > in the function call. > > This is ok, but it will not work in hosting environment and still > doesn't prevent errors. Agreed. super-user-only now seems strange to me because it isn't a security issue, but rather an attempt to avoid people causing errors. The fundamental issue is that if you do a query using tsvector and tsquery everything will work find because default_text_search_config will be the same for both queries. The problem is if do an expression index lookup that doesn't specify the configuration name and your default_text_search_config doesn't match the index, or you INSERT or UPDATE into an expression index with a mismatched default_text_search_config. If we do make default_text_search_config super-user-only it prevents a database owner from doing ALTER DATABASE db1 SET default_text_search_config = 'english', which seems like a pretty big limitation because I think per-database default_text_search_config makes the most sense. And, again, if you specify the configuration in the expression index you have to specify it in the WHERE clause and then default_text_search_config is pretty useless. If we required the configuration to always be specified, you could still store multiple configurations in the same column by having a secondary column hold the configuration name: CREATE INDEX i on x USING gist (to_tsvector(config_col, body)); -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Bruce Momjian wrote: > We have to decide if we want a GUC default_text_search_config, and if so > when can it be changed. > > Right now there are three ways to create a tsvector (or tsquery) > > ::tsvector > to_tsvector(value) > to_tsvector(config, value) > > (ignoring plainto_tsvector) > > Only the last one specifies the configuration. The others use the > configuration specified by default_text_search_config. (We had an > previous discussion on what the default value of > default_text_search_config should be, and it was decided it should be > set via initdb based on a flag or the locale.) > > Now, because most people use a single configuration, they can just set > default_text_search_config and there is no need to specify the > configuration name. > > However, expression indexes cause a problem here: > > http://momjian.us/expire/fulltext/HTML/textsearch-tables.html#TEXTSEARCH-TABLES-INDEX > > We recommend that users create an expression index on the column they > want to do a full text search on, e.g. > > CREATE INDEX pgweb_idx ON pgweb USING gin(to_tsvector(body)); > > However, the big problem is that the expressions used in expression > indexes should not change their output based on the value of a GUC > variable (because it would corrupt the index), but in the case above, > default_text_search_config controls what configuration is used, and > hence the output of to_tsvector is changed if default_text_search_config > changes. > > We have a few possible options: > > 1) Document the problem and do nothing else. > 2) Make default_text_search_config a postgresql.conf-only > setting, thereby making it impossible to change by non-super > users, or make it a super-user-only setting. > 3) Remove default_text_search_config and require the > configuration to be specified in each function call. > > If we remove default_text_search_config, it would also make ::tsvector > casting useless as well. OK, I just found a case that I think is going to make #3 a requirement (remove default_text_search_config). How is a CREATE INDEX ... to_tsvector(col) going to restore from a pg_dump? I see no way of guaranteeing that the default_text_search_config is correct on the restore, and in fact I don't think we have any way of knowing the default_text_search_config used for the index. And if we have to require the configuration name in CREATE INDEX, it has to be used in WHERE, so we might as well just remove the default capability and always require the configuration name. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Bruce Momjian wrote: > Bruce Momjian wrote: > > 3) Remove default_text_search_config and require the > > configuration to be specified in each function call. > > > > If we remove default_text_search_config, it would also make ::tsvector > > casting useless as well. > > OK, I just found a case that I think is going to make #3 a requirement > (remove default_text_search_config). > > How is a CREATE INDEX ... to_tsvector(col) going to restore from a > pg_dump? I see no way of guaranteeing that the > default_text_search_config is correct on the restore, and in fact I > don't think we have any way of knowing the default_text_search_config > used for the index. Make pg_dump emit only CREATE INDEX sentences with two-param format. In fact I think it would make sense to convert internally the one-param format to two-param, before hitting the catalogs. This would also solve your problem about usability of WHERE clauses, if you rewrite the one-param calls to two-params before the optimizer kicks in. -- Alvaro Herrera http://www.amazon.com/gp/registry/DXLWNGRJD34J "Nadie esta tan esclavizado como el que se cree libre no siendolo" (Goethe)
Alvaro Herrera wrote: > Bruce Momjian wrote: > > Bruce Momjian wrote: > > > > 3) Remove default_text_search_config and require the > > > configuration to be specified in each function call. > > > > > > If we remove default_text_search_config, it would also make ::tsvector > > > casting useless as well. > > > > OK, I just found a case that I think is going to make #3 a requirement > > (remove default_text_search_config). > > > > How is a CREATE INDEX ... to_tsvector(col) going to restore from a > > pg_dump? I see no way of guaranteeing that the > > default_text_search_config is correct on the restore, and in fact I > > don't think we have any way of knowing the default_text_search_config > > used for the index. > > Make pg_dump emit only CREATE INDEX sentences with two-param format. In > fact I think it would make sense to convert internally the one-param > format to two-param, before hitting the catalogs. > > This would also solve your problem about usability of WHERE clauses, if > you rewrite the one-param calls to two-params before the optimizer kicks > in. Yes, that could be done but as far as I know we weren't planning to have those areas of our backend be aware of configuration parameters to text search functions, and I doubt we want to do that for 8.3, if ever. There are many tsearch functions and doing this would spill tsearch function checks all over the backend, reducing our modularity. The default capability just isn't worth it, and in addition is error-prone. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Mon, 30 Jul 2007, Bruce Momjian wrote: > Oleg Bartunov wrote: >>> OK, here is what I am thinking. If we make default_text_search_config >>> super-user-only, then the user can't do SET (using "zero_damaged_pages" >>> as a superuser-only example): >>> >>> test=> set zero_damaged_pages = on; >>> ERROR: permission denied to set parameter "zero_damaged_pages" >>> >>> test=> alter user guest set zero_damaged_pages = on; >>> ERROR: permission denied to set parameter "zero_damaged_pages" >>> >>> but the super-user can set it in postgresql.conf, or: >>> >>> test=# alter user guest set zero_damaged_pages = on; >>> ALTER ROLE >>> >>> or >>> >>> test=# alter database vendor3 set zero_damaged_pages = on; >>> ALTER ROLE >>> >>> meaning while it will be super-user-only, the administrator can set the >>> default for specific databases and users. Is that the best approach? >>> >>> A user can still over-ride the default by specifying the configuration >>> in the function call. >> >> This is ok, but it will not work in hosting environment and still >> doesn't prevent errors. > > Agreed. super-user-only now seems strange to me because it isn't a > security issue, but rather an attempt to avoid people causing errors. > > The fundamental issue is that if you do a query using tsvector and > tsquery everything will work find because default_text_search_config > will be the same for both queries. The problem is if do an expression > index lookup that doesn't specify the configuration name and your > default_text_search_config doesn't match the index, or you INSERT or > UPDATE into an expression index with a mismatched > default_text_search_config. > > If we do make default_text_search_config super-user-only it prevents a > database owner from doing ALTER DATABASE db1 SET > default_text_search_config = 'english', which seems like a pretty big > limitation because I think per-database default_text_search_config makes > the most sense. > > And, again, if you specify the configuration in the expression index you > have to specify it in the WHERE clause and then > default_text_search_config is pretty useless. agree. Notice, this is very limited usage case. > > If we required the configuration to always be specified, you could still > store multiple configurations in the same column by having a secondary > column hold the configuration name: I don't understand this. Let's don't discuss indexes at all, since indexes doesn't know about configuratons at all > > CREATE INDEX i on x USING gist (to_tsvector(config_col, body)); Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
On Mon, 30 Jul 2007, Bruce Momjian wrote: > Bruce Momjian wrote: >> We have to decide if we want a GUC default_text_search_config, and if so >> when can it be changed. >> >> Right now there are three ways to create a tsvector (or tsquery) >> >> ::tsvector >> to_tsvector(value) >> to_tsvector(config, value) >> >> (ignoring plainto_tsvector) >> >> Only the last one specifies the configuration. The others use the >> configuration specified by default_text_search_config. (We had an >> previous discussion on what the default value of >> default_text_search_config should be, and it was decided it should be >> set via initdb based on a flag or the locale.) >> >> Now, because most people use a single configuration, they can just set >> default_text_search_config and there is no need to specify the >> configuration name. >> >> However, expression indexes cause a problem here: >> >> http://momjian.us/expire/fulltext/HTML/textsearch-tables.html#TEXTSEARCH-TABLES-INDEX >> >> We recommend that users create an expression index on the column they >> want to do a full text search on, e.g. >> >> CREATE INDEX pgweb_idx ON pgweb USING gin(to_tsvector(body)); >> >> However, the big problem is that the expressions used in expression >> indexes should not change their output based on the value of a GUC >> variable (because it would corrupt the index), but in the case above, >> default_text_search_config controls what configuration is used, and >> hence the output of to_tsvector is changed if default_text_search_config >> changes. >> >> We have a few possible options: >> >> 1) Document the problem and do nothing else. >> 2) Make default_text_search_config a postgresql.conf-only >> setting, thereby making it impossible to change by non-super >> users, or make it a super-user-only setting. >> 3) Remove default_text_search_config and require the >> configuration to be specified in each function call. >> >> If we remove default_text_search_config, it would also make ::tsvector >> casting useless as well. > > OK, I just found a case that I think is going to make #3 a requirement > (remove default_text_search_config). > > How is a CREATE INDEX ... to_tsvector(col) going to restore from a > pg_dump? I see no way of guaranteeing that the > default_text_search_config is correct on the restore, and in fact I > don't think we have any way of knowing the default_text_search_config > used for the index. > > And if we have to require the configuration name in CREATE INDEX, it has > to be used in WHERE, so we might as well just remove the default > capability and always require the configuration name. this is very rare use case for text searching 1. expression index without configuration name 2. default_text_search_config can be changed by somebody If somebody really need it, then he should be adviced to use configuration name, else we don't guarantee that somebody could change default_text_search_config variable and this could lead to incorrect dump/restore. I don't think we should remove default_text_search_config because of this rare case. Regards, Oleg PS. Bruce, I'm in the mountains the Northern Caucasia and internet is a bit unreliable :( _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
Oleg Bartunov wrote: > >> If we remove default_text_search_config, it would also make ::tsvector > >> casting useless as well. > > > > OK, I just found a case that I think is going to make #3 a requirement > > (remove default_text_search_config). > > > > How is a CREATE INDEX ... to_tsvector(col) going to restore from a > > pg_dump? I see no way of guaranteeing that the > > default_text_search_config is correct on the restore, and in fact I > > don't think we have any way of knowing the default_text_search_config > > used for the index. > > > > And if we have to require the configuration name in CREATE INDEX, it has > > to be used in WHERE, so we might as well just remove the default > > capability and always require the configuration name. > > this is very rare use case for text searching > 1. expression index without configuration name > 2. default_text_search_config can be changed by somebody If you are going to be using the configuration name with the create expression index, you have to use it in the WHERE clause (or the index doesn't work), and I assume that is 90% of the text search uses. I don't see it as rare at all. > If somebody really need it, then he should be adviced to use configuration > name, else we don't guarantee that somebody could change > default_text_search_config variable and this could lead to > incorrect dump/restore. > > I don't think we should remove default_text_search_config because of > this rare case. I still feel the default_text_search_config has to be removed. We have tried all sorts of ways to make it work but having it not be 100% reliable for pg_dump/restore means it might as well be in /contrib and unsupported. If we have it in core, it has to work 100%. We can't have tons of examples that don't specify the configuration name and then expect every create expression index and WHERE clause to use it. default_text_search_config _can_ work, but it seems so easy to break and so easy to get wrong that I think it must be removed. If we are going to keep it, I need someone to explain why my comments above are wrong. If I am right, someone has to remove default_text_search_config from the patch. I can do the documentation. > Bruce, I'm in the mountains the Northern Caucasia and internet is > a bit unreliable :( Thanks. I noticed a lag in your reply. Hope you are having a good time. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Tue, 31 Jul 2007, Bruce Momjian wrote: >>> And if we have to require the configuration name in CREATE INDEX, it has >>> to be used in WHERE, so we might as well just remove the default >>> capability and always require the configuration name. >> >> this is very rare use case for text searching >> 1. expression index without configuration name >> 2. default_text_search_config can be changed by somebody > > If you are going to be using the configuration name with the create > expression index, you have to use it in the WHERE clause (or the index > doesn't work), and I assume that is 90% of the text search uses. I > don't see it as rare at all. What is a basis of your assumption ? In my opinion, it's very limited use of text search, because it doesn't supports ranking. For 4-5 years of tsearch2 usage I never used it and I never seem in mailing lists. This is very user-oriented feature and we could probably ask -general people for their opinion. >> If somebody really need it, then he should be adviced to use configuration >> name, else we don't guarantee that somebody could change >> default_text_search_config variable and this could lead to >> incorrect dump/restore. >> >> I don't think we should remove default_text_search_config because of >> this rare case. > > I still feel the default_text_search_config has to be removed. We have > tried all sorts of ways to make it work but having it not be 100% > reliable for pg_dump/restore means it might as well be in /contrib and > unsupported. If we have it in core, it has to work 100%. We can't have > tons of examples that don't specify the configuration name and then > expect every create expression index and WHERE clause to use it. > default_text_search_config _can_ work, but it seems so easy to break and > so easy to get wrong that I think it must be removed. I'd better say we don't support text searching using expression index than remove default_text_search_config. Anyway, I don't feel myself responisble for such important problem. We need more feedback from users. > > If we are going to keep it, I need someone to explain why my comments > above are wrong. If I am right, someone has to remove > default_text_search_config from the patch. I can do the documentation. I'm in conference and then will be busy writing my applications and earning money, Teodor is in vacation. I don't want to do hasty conclusion, since we're very tired to change our patch from one solution to another. We need consensus of developers and users. I'm almost exhausted and have no time to continue this discussion. Would you be so kind to write separate post about this problem and call -hackers and -general for feedback. Let's experienced users show their needs. We said everything and has nothing to add. Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
Oleg Bartunov wrote: > On Tue, 31 Jul 2007, Bruce Momjian wrote: > > >>> And if we have to require the configuration name in CREATE INDEX, it has > >>> to be used in WHERE, so we might as well just remove the default > >>> capability and always require the configuration name. > >> > >> this is very rare use case for text searching > >> 1. expression index without configuration name > >> 2. default_text_search_config can be changed by somebody > > > > If you are going to be using the configuration name with the create > > expression index, you have to use it in the WHERE clause (or the index > > doesn't work), and I assume that is 90% of the text search uses. I > > don't see it as rare at all. > > What is a basis of your assumption ? In my opinion, it's very limited > use of text search, because it doesn't supports ranking. For 4-5 years > of tsearch2 usage I never used it and I never seem in mailing lists. > This is very user-oriented feature and we could probably ask > -general people for their opinion. I doubt 'general' is going to understand the details of merging this into the backend. I assume we have enough people on hackers to decide this. Are you saying the majority of users have a separate column with a trigger? Does the trigger specify the configuation? I don't see that as a parameter argument to tsvector_update_trigger(). If you reload a pg_dump, what does it use for the configuration? Why is a separate column better than the index? Just ranking? The reason the expression index is nice is this feature has to be easy to use for people who are new to full text and even PostgreSQL. Right now /contrib is fine for experts to use, but we want a larger user base for this feature. > >> If somebody really need it, then he should be adviced to use configuration > >> name, else we don't guarantee that somebody could change > >> default_text_search_config variable and this could lead to > >> incorrect dump/restore. > >> > >> I don't think we should remove default_text_search_config because of > >> this rare case. > > > > I still feel the default_text_search_config has to be removed. We have > > tried all sorts of ways to make it work but having it not be 100% > > reliable for pg_dump/restore means it might as well be in /contrib and > > unsupported. If we have it in core, it has to work 100%. We can't have > > tons of examples that don't specify the configuration name and then > > expect every create expression index and WHERE clause to use it. > > default_text_search_config _can_ work, but it seems so easy to break and > > so easy to get wrong that I think it must be removed. > > I'd better say we don't support text searching using expression index > than remove default_text_search_config. Anyway, I don't feel myself > responisble for such important problem. We need more feedback from > users. Well, I am waiting for other hackers to get involved, but if they don't, I have to evaluate it myself on the email lists. > > If we are going to keep it, I need someone to explain why my comments > > above are wrong. If I am right, someone has to remove > > default_text_search_config from the patch. I can do the documentation. > > I'm in conference and then will be busy writing my applications and > earning money, Teodor is in vacation. I don't want to do > hasty conclusion, since we're very tired to change our patch from > one solution to another. We need consensus of developers and users. > I'm almost exhausted and have no time to continue this discussion. > > Would you be so kind to write separate post about this problem and > call -hackers and -general for feedback. Let's experienced users > show their needs. We said everything and has nothing to add. If you have no time to continue discussion and perhaps update the patch, we can consider this patch dead for 8.3 and we can hold it for 8.4 because I can guarantee you this is going to need more discussion and patch modification before it gets into CVS. This patch is being treated fairly and exactly the same as every other patch. Should we hold the patch for 8.4? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Tue, 31 Jul 2007, Bruce Momjian wrote: > Oleg Bartunov wrote: >> On Tue, 31 Jul 2007, Bruce Momjian wrote: >> >>>>> And if we have to require the configuration name in CREATE INDEX, it has >>>>> to be used in WHERE, so we might as well just remove the default >>>>> capability and always require the configuration name. >>>> >>>> this is very rare use case for text searching >>>> 1. expression index without configuration name >>>> 2. default_text_search_config can be changed by somebody >>> >>> If you are going to be using the configuration name with the create >>> expression index, you have to use it in the WHERE clause (or the index >>> doesn't work), and I assume that is 90% of the text search uses. I >>> don't see it as rare at all. >> >> What is a basis of your assumption ? In my opinion, it's very limited >> use of text search, because it doesn't supports ranking. For 4-5 years >> of tsearch2 usage I never used it and I never seem in mailing lists. >> This is very user-oriented feature and we could probably ask >> -general people for their opinion. > > I doubt 'general' is going to understand the details of merging this > into the backend. I assume we have enough people on hackers to decide > this. I mean not technical details, but use case. Does they need expressional index without ranking but sacrifice ability to use default configuration in other cases too ? My prediction is that people doesn't ever thought about this possibility until we said them about. > > Are you saying the majority of users have a separate column with a > trigger? Does the trigger specify the configuation? I don't see that > as a parameter argument to tsvector_update_trigger(). If you reload a > pg_dump, what does it use for the configuration? > yes, separate column with custom trigger works fine. It's up to you how to keep your data actual and it's up to you how to write trigger. Our tsvector_update_trigger() is a tsvector_update_trigger_example() ! > Why is a separate column better than the index? Just ranking? ranking + composite documents. I already mentioned, that this could be rather expensive. Also, having separate column allow people various ways to say what is a document and even change it. > > The reason the expression index is nice is this feature has to be easy > to use for people who are new to full text and even PostgreSQL. Right > now /contrib is fine for experts to use, but we want a larger user base > for this feature. I agree here. This was one of the main reason of our work for 8.3. Probably, we shold think in another direction - not to curtail tsearch2 and confuse rather big existing users, but to add an ability to save somehow configuration used for creating of *document* either implicitly (in expression index, or just gin(text_column)), or explicitly (separate column). There is no problem with index itself ! > >> >> I'd better say we don't support text searching using expression index >> than remove default_text_search_config. Anyway, I don't feel myself >> responisble for such important problem. We need more feedback from >> users. > > Well, I am waiting for other hackers to get involved, but if they don't, > I have to evaluate it myself on the email lists. > >>> If we are going to keep it, I need someone to explain why my comments >>> above are wrong. If I am right, someone has to remove >>> default_text_search_config from the patch. I can do the documentation. >> >> I'm in conference and then will be busy writing my applications and >> earning money, Teodor is in vacation. I don't want to do >> hasty conclusion, since we're very tired to change our patch from >> one solution to another. We need consensus of developers and users. >> I'm almost exhausted and have no time to continue this discussion. >> >> Would you be so kind to write separate post about this problem and >> call -hackers and -general for feedback. Let's experienced users >> show their needs. We said everything and has nothing to add. > > If you have no time to continue discussion and perhaps update the patch, > we can consider this patch dead for 8.3 and we can hold it for 8.4 > because I can guarantee you this is going to need more discussion and > patch modification before it gets into CVS. > > This patch is being treated fairly and exactly the same as every other > patch. why do you say this ? I didn't complain about this. > > Should we hold the patch for 8.4? If we're not agree to say in docs, that implicit usage of text search configuration in CREATE INDEX command doesn't supported. Could we leave default_text_search_config for super-users, at least ? Anyway, let's wait what other people say. Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
Bruce Momjian wrote: > Oleg Bartunov wrote: >> What is a basis of your assumption ? In my opinion, it's very limited >> use of text search, because it doesn't supports ranking. For 4-5 years >> of tsearch2 usage I never used it and I never seem in mailing lists. >> This is very user-oriented feature and we could probably ask >> -general people for their opinion. I think I asked about this kind of usage a couple years back; and Oleg pointed out other reasons why it wasn't as good an idea too. http://archives.postgresql.org/pgsql-general/2005-10/msg00475.php http://archives.postgresql.org/pgsql-general/2005-10/msg00477.php The particular question I had asked why the functional index was slower than maintaining the extra column; with the explanation that the lossy index having to call the function (including parsing, dictionary lookup, etc) for re-checking the data made it inadvisable to avoid the extra column anyway. > I doubt 'general' is going to understand the details of merging this > into the backend. I assume we have enough people on hackers to decide > this. > > Are you saying the majority of users have a separate column with a > trigger? I think so. At least when I was using it in 2005 the second column with the trigger was faster than using a functional index. >> We need more feedback from users. > > Well, I am waiting for other hackers to get involved, but if they don't, > I have to evaluate it myself on the email lists. Personally, I think documentation changes would be an OK way to to handle it. Something that makes it extremely clear to the user the advantages of having the extra column and the risks of avoiding them.
Oleg Bartunov wrote: > On Tue, 31 Jul 2007, Bruce Momjian wrote: > > > Oleg Bartunov wrote: > >> On Tue, 31 Jul 2007, Bruce Momjian wrote: > >> > >>>>> And if we have to require the configuration name in CREATE INDEX, it has > >>>>> to be used in WHERE, so we might as well just remove the default > >>>>> capability and always require the configuration name. > >>>> > >>>> this is very rare use case for text searching > >>>> 1. expression index without configuration name > >>>> 2. default_text_search_config can be changed by somebody > >>> > >>> If you are going to be using the configuration name with the create > >>> expression index, you have to use it in the WHERE clause (or the index > >>> doesn't work), and I assume that is 90% of the text search uses. I > >>> don't see it as rare at all. > >> > >> What is a basis of your assumption ? In my opinion, it's very limited > >> use of text search, because it doesn't supports ranking. For 4-5 years > >> of tsearch2 usage I never used it and I never seem in mailing lists. > >> This is very user-oriented feature and we could probably ask > >> -general people for their opinion. > > > > I doubt 'general' is going to understand the details of merging this > > into the backend. I assume we have enough people on hackers to decide > > this. > > I mean not technical details, but use case. Does they need expressional > index without ranking but sacrifice ability to use default configuration > in other cases too ? My prediction is that people doesn't ever thought about > this possibility until we said them about. In a choice between expression indexes and default_text_search_config, there is no question in my mind that expression indexes are more useful. Lack of default_text_search_config only means you have to specify the configuration name every time, and can't do casting to a text search data type. > > Are you saying the majority of users have a separate column with a > > trigger? Does the trigger specify the configuation? I don't see that > > as a parameter argument to tsvector_update_trigger(). If you reload a > > pg_dump, what does it use for the configuration? > > > > yes, separate column with custom trigger works fine. It's up to you how > to keep your data actual and it's up to you how to write trigger. > Our tsvector_update_trigger() is a tsvector_update_trigger_example() ! Well, that is the major problem --- that this is very error-prone, especially considering that the tsvector_update_trigger() doesn't get it right either. > > Why is a separate column better than the index? Just ranking? > > ranking + composite documents. I already mentioned, that this could be > rather expensive. Also, having separate column allow people various > ways to say what is a document and even change it. OK, I am confused why an expression index can't use those features if a separate column can. I realize the index can't store that information, but why can the code pick it out of a heap column but not run the function on the heap row to get that information. I assume it is something that is just hard to implement. > > The reason the expression index is nice is this feature has to be easy > > to use for people who are new to full text and even PostgreSQL. Right > > now /contrib is fine for experts to use, but we want a larger user base > > for this feature. > > I agree here. This was one of the main reason of our work for 8.3. > Probably, we shold think in another direction - not to curtail tsearch2 > and confuse rather big existing users, but to add an ability to save somehow > configuration used for creating of *document* > either implicitly (in expression index, or just gin(text_column)), or > explicitly (separate column). There is no problem with index itself ! Agreed. We need to find a way to save the configuration when the output of a text search function is stored, either in an expression index or via a trigger into a separate column, but only if we allow the default configuration to be changed by non-super-users. > > > > Should we hold the patch for 8.4? > > If we're not agree to say in docs, that implicit usage of text search > configuration in CREATE INDEX command doesn't supported. Could we leave > default_text_search_config for super-users, at least ? > > Anyway, let's wait what other people say. The big problem is that not many people have taken the time to fully understand how full text search works. I hoped that putting the updated documentation online would help: http://momjian.us/expire/fulltext/HTML/textsearch.html but it seems it hasn't. What we could do it if we make default_text_search_config super-user-only and tell users at the start that if default_text_search_config doesn't match the language they want to use, then they have to read a documentation section that explains the problem of configuration mismatches. The problem with that is that we should be setting default_text_search_config in the pg_dump output, like we do for client_encoding, but because it is a super-user-only, it will fail for non-super-user restores. So, I am back to thinking default_text_search_config isn't going to work reliably for novice users. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Ron Mayer wrote: > Bruce Momjian wrote: > > Oleg Bartunov wrote: > >> What is a basis of your assumption ? In my opinion, it's very limited > >> use of text search, because it doesn't supports ranking. For 4-5 years > >> of tsearch2 usage I never used it and I never seem in mailing lists. > >> This is very user-oriented feature and we could probably ask > >> -general people for their opinion. > > I think I asked about this kind of usage a couple years back; > and Oleg pointed out other reasons why it wasn't as good an > idea too. > > http://archives.postgresql.org/pgsql-general/2005-10/msg00475.php > http://archives.postgresql.org/pgsql-general/2005-10/msg00477.php > > The particular question I had asked why the functional index was > slower than maintaining the extra column; with the explanation > that the lossy index having to call the function (including > parsing, dictionary lookup, etc) for re-checking the data made > it inadvisable to avoid the extra column anyway. > > > I doubt 'general' is going to understand the details of merging this > > into the backend. I assume we have enough people on hackers to decide > > this. > > > > Are you saying the majority of users have a separate column with a > > trigger? > > I think so. At least when I was using it in 2005 the second > column with the trigger was faster than using a functional index. OK, it is good you measured it. I wonder how GIN would behave because it is not lossy. > >> We need more feedback from users. > > > > Well, I am waiting for other hackers to get involved, but if they don't, > > I have to evaluate it myself on the email lists. > > Personally, I think documentation changes would be an OK way to > to handle it. Something that makes it extremely clear to the > user the advantages of having the extra column and the risks > of avoiding them. Sure, but you have make sure you use the right configuration in the trigger, no? Does the tsquery have to use the same configuration? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Bruce Momjian wrote: > Ron Mayer wrote: >> Bruce Momjian wrote: >>> Oleg Bartunov wrote: >>>> What is a basis of your assumption ? >> I think I asked about this kind of usage a couple years back;... >> >> http://archives.postgresql.org/pgsql-general/2005-10/msg00475.php >> http://archives.postgresql.org/pgsql-general/2005-10/msg00477.php >> >> ...why the functional index was >> slower than maintaining the extra column; with the explanation >> that the lossy index having to call the function (including >> parsing, dictionary lookup, etc) for re-checking the data ... >> ... >>> >>> Are you saying the majority of users have a separate column with a >>> trigger? >> I think so. At least when I was using it in 2005 the second >> column with the trigger was faster than using a functional index. > > OK, it is good you measured it. I wonder how GIN would behave because > it is not lossy. Too bad I don't have the same database around anymore. It seems the re-parsing for re-checking for the lossy index was very expensive, tho. In the end, I suspect it depends greatly on what fraction of rows match. >>>> We need more feedback from users. >>> Well, I am waiting for other hackers to get involved, but if they don't, >>> I have to evaluate it myself on the email lists. >> Personally, I think documentation changes would be an OK way to >> to handle it. Something that makes it extremely clear to the >> user the advantages of having the extra column and the risks >> of avoiding them. > > Sure, but you have make sure you use the right configuration in the > trigger, no? Does the tsquery have to use the same configuration? I wish I knew this myself. :-) Whatever I had done happened to work but that was largely through people on IRC walking me through it.
Ron Mayer wrote: > >>>> We need more feedback from users. > >>> Well, I am waiting for other hackers to get involved, but if they don't, > >>> I have to evaluate it myself on the email lists. > >> Personally, I think documentation changes would be an OK way to > >> to handle it. Something that makes it extremely clear to the > >> user the advantages of having the extra column and the risks > >> of avoiding them. > > > > Sure, but you have make sure you use the right configuration in the > > trigger, no? Does the tsquery have to use the same configuration? > > I wish I knew this myself. :-) Whatever I had done happened to work > but that was largely through people on IRC walking me through it. This illustrates the major issue --- that this has to be simple for people to get started, while keeping the capabilities for experienced users. I am now thinking that making users always specify the configuration name and not allowing :: casting is going to be the best approach. We can always add more in 8.4 after it is in wide use. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Bruce Momjian wrote: > Ron Mayer wrote: >> I wish I knew this myself. :-) Whatever I had done happened to work >> but that was largely through people on IRC walking me through it. > > This illustrates the major issue --- that this has to be simple for > people to get started, while keeping the capabilities for experienced > users. > > I am now thinking that making users always specify the configuration > name and not allowing :: casting is going to be the best approach. We > can always add more in 8.4 after it is in wide use. That's fair. Either the docs need to make it totally obvious or the software should force people to do something safe.
Bruce Momjian wrote: > Ron Mayer wrote: >>>>>> We need more feedback from users. >>>>> Well, I am waiting for other hackers to get involved, but if they don't, >>>>> I have to evaluate it myself on the email lists. >>>> Personally, I think documentation changes would be an OK way to >>>> to handle it. Something that makes it extremely clear to the >>>> user the advantages of having the extra column and the risks >>>> of avoiding them. >>> Sure, but you have make sure you use the right configuration in the >>> trigger, no? Does the tsquery have to use the same configuration? >> I wish I knew this myself. :-) Whatever I had done happened to work >> but that was largely through people on IRC walking me through it. > > This illustrates the major issue --- that this has to be simple for > people to get started, while keeping the capabilities for experienced > users. > > I am now thinking that making users always specify the configuration > name and not allowing :: casting is going to be the best approach. We > can always add more in 8.4 after it is in wide use. I just read the docs and I'm trying to get a grip of the problem here. If I understood correctly, the basic issue is that a tsvector datum created using configuration A is incompatible with a tsquery datum created using configuration B, in the sense that you won't get reasonable results if you use the tsquery to search the tsvector, or do ranking or highlighting. If the configurations happen to be similar enough, it can work, but not in general. That underlying issue manifests itself in many ways, including: - if you create table with a field of type tsvector, typically kept up-to-date by triggers, and do a search on it using a different configuration, you get incorrect results. - using an expression index instead of a tsvector-field, and always explicitly specifying the configuration, you can avoid that problem (a query with a different configuration won't use the index). But an expression index, without explicitly specifying the configuration, will get corrupted if you change the default configuration. Removing the default configuration setting altogether removes the 2nd problem, but that's not good from a usability point of view. And it doesn't solve the general issue, you can still do things like: SELECT * FROM foo WHERE to_tsvector('confA', textcol) @@ to_tsquery('confB', 'query'); ISTM we should have a separate tsvector and tsquery data type for each configuration, and throw an error if you try to mix and match them in a query. to_tsquery and to_tsvector would be new kind of polymorphic functions that work with the types. Or we could automatically create a copy of them when you create a new configuration. We could have a default configuration setting and rewrite queries that don't explicitly specify a configuration to use the default. You could still get into trouble if you alter the configuration after starting to use it. We could solve that by not allowing you to ALTER CONFIGURATION, at least not if it's used in tables or indexes. Forcing people to create a new configuration, and to recreate all indexes and tsvector columns every time you add a word to a stop-list, for example, seems too onerous, though. Not sure what to do about that. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas wrote: > >>> Sure, but you have make sure you use the right configuration in the > >>> trigger, no? Does the tsquery have to use the same configuration? > >> I wish I knew this myself. :-) Whatever I had done happened to work > >> but that was largely through people on IRC walking me through it. > > > > This illustrates the major issue --- that this has to be simple for > > people to get started, while keeping the capabilities for experienced > > users. > > > > I am now thinking that making users always specify the configuration > > name and not allowing :: casting is going to be the best approach. We > > can always add more in 8.4 after it is in wide use. > > I just read the docs and I'm trying to get a grip of the problem here. > > If I understood correctly, the basic issue is that a tsvector datum > created using configuration A is incompatible with a tsquery datum > created using configuration B, in the sense that you won't get > reasonable results if you use the tsquery to search the tsvector, or do > ranking or highlighting. If the configurations happen to be similar > enough, it can work, but not in general. Right. > That underlying issue manifests itself in many ways, including: > - if you create table with a field of type tsvector, typically kept > up-to-date by triggers, and do a search on it using a different > configuration, you get incorrect results. Right. > - using an expression index instead of a tsvector-field, and always > explicitly specifying the configuration, you can avoid that problem (a > query with a different configuration won't use the index). But an > expression index, without explicitly specifying the configuration, will > get corrupted if you change the default configuration. Right. > Removing the default configuration setting altogether removes the 2nd > problem, but that's not good from a usability point of view. And it > doesn't solve the general issue, you can still do things like: > SELECT * FROM foo WHERE to_tsvector('confA', textcol) @@ > to_tsquery('confB', 'query'); True, but in that case you are specifically naming different configurations, so it is hopefully obvious you have a mismatch. > ISTM we should have a separate tsvector and tsquery data type for each > configuration, and throw an error if you try to mix and match them in a > query. to_tsquery and to_tsvector would be new kind of polymorphic > functions that work with the types. Or we could automatically create a > copy of them when you create a new configuration. We could have a > default configuration setting and rewrite queries that don't explicitly > specify a configuration to use the default. That is going to make multiple configurations quite complex in the backend, and I think for little value. > You could still get into trouble if you alter the configuration after > starting to use it. We could solve that by not allowing you to ALTER > CONFIGURATION, at least not if it's used in tables or indexes. Forcing > people to create a new configuration, and to recreate all indexes and > tsvector columns every time you add a word to a stop-list, for example, > seems too onerous, though. Not sure what to do about that. Yea, seems more work than is necessary. If we require the configuration to be always supplied, and document that mismatches are a problem, I think we are in good shape. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Wed, 8 Aug 2007, Bruce Momjian wrote: > Heikki Linnakangas wrote: >>>>> Sure, but you have make sure you use the right configuration in the >>>>> trigger, no? Does the tsquery have to use the same configuration? >>>> I wish I knew this myself. :-) Whatever I had done happened to work >>>> but that was largely through people on IRC walking me through it. >>> >>> This illustrates the major issue --- that this has to be simple for >>> people to get started, while keeping the capabilities for experienced >>> users. >>> >>> I am now thinking that making users always specify the configuration >>> name and not allowing :: casting is going to be the best approach. We >>> can always add more in 8.4 after it is in wide use. >> >> I just read the docs and I'm trying to get a grip of the problem here. >> >> If I understood correctly, the basic issue is that a tsvector datum >> created using configuration A is incompatible with a tsquery datum >> created using configuration B, in the sense that you won't get >> reasonable results if you use the tsquery to search the tsvector, or do >> ranking or highlighting. If the configurations happen to be similar >> enough, it can work, but not in general. > > Right. not fair. There are many cases when one can intentionally use different configurations. But I agree, this is not for beginners. > >> That underlying issue manifests itself in many ways, including: >> - if you create table with a field of type tsvector, typically kept >> up-to-date by triggers, and do a search on it using a different >> configuration, you get incorrect results. > > Right. again, you might want to use different configuration. > >> - using an expression index instead of a tsvector-field, and always >> explicitly specifying the configuration, you can avoid that problem (a >> query with a different configuration won't use the index). But an >> expression index, without explicitly specifying the configuration, will >> get corrupted if you change the default configuration. > > Right. the same problem if you drop constrain from table (accidently) and then gets surprised by select results. > >> Removing the default configuration setting altogether removes the 2nd >> problem, but that's not good from a usability point of view. And it >> doesn't solve the general issue, you can still do things like: >> SELECT * FROM foo WHERE to_tsvector('confA', textcol) @@ >> to_tsquery('confB', 'query'); > > True, but in that case you are specifically naming different > configurations, so it is hopefully obvious you have a mismatch. > >> ISTM we should have a separate tsvector and tsquery data type for each >> configuration, and throw an error if you try to mix and match them in a >> query. to_tsquery and to_tsvector would be new kind of polymorphic >> functions that work with the types. Or we could automatically create a >> copy of them when you create a new configuration. We could have a >> default configuration setting and rewrite queries that don't explicitly >> specify a configuration to use the default. > > That is going to make multiple configurations quite complex in the > backend, and I think for little value. > >> You could still get into trouble if you alter the configuration after >> starting to use it. We could solve that by not allowing you to ALTER >> CONFIGURATION, at least not if it's used in tables or indexes. Forcing >> people to create a new configuration, and to recreate all indexes and >> tsvector columns every time you add a word to a stop-list, for example, >> seems too onerous, though. Not sure what to do about that. > > Yea, seems more work than is necessary. If we require the configuration > to be always supplied, and document that mismatches are a problem, I > think we are in good shape. We should agree that all you describe is only for DUMMY users. From authors point of view I dislike your approach to treat text searching as a very limited tool. But I understand that we should preserve people from stupid errors. I want for beginners easy setup and error-prone functionality, but leaving experienced users to develop complex search engines. Can we have separate safe interface for text searching and explicitly recommend it for beginners ? Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
Oleg Bartunov wrote: > > Yea, seems more work than is necessary. If we require the configuration > > to be always supplied, and document that mismatches are a problem, I > > think we are in good shape. > > We should agree that all you describe is only for DUMMY users. > >From authors point of view I dislike your approach to treat text searching > as a very limited tool. But I understand that we should preserve people from > stupid errors. > > I want for beginners easy setup and error-prone functionality, > but leaving experienced users to develop complex search engines. > Can we have separate safe interface for text searching and explicitly > recommend it for beginners ? I am glad we are moving this interface discussion forward. It seems Heikki has similar concerns about the interface being error-prone. It would be nice to have a novice and advanced interface, but we would have to document both, and then that is going to be confusing for users. As I see it, specifying the configuration name in every function call is the novice interface, and avoids the most common errors. I can see defaulting the interface name as being an advanced interface, but I don't think it has enough of a feature to be worth documenting and implementing. If we figure out something better in 8.4 we can implement it, but at this point I can't think of any good solution to not specifying the configuration name every time. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Thu, Aug 09, 2007 at 02:36:41AM -0400, Bruce Momjian wrote: > Oleg Bartunov wrote: > > > Yea, seems more work than is necessary. If we require the configuration > > > to be always supplied, and document that mismatches are a problem, I > > > think we are in good shape. > > > > We should agree that all you describe is only for DUMMY users. > > >From authors point of view I dislike your approach to treat text searching > > as a very limited tool [...] [...] > I am glad we are moving this interface discussion forward. It seems > Heikki has similar concerns about the interface being error-prone. > > It would be nice to have a novice and advanced interface, but we would > have to document both, and then that is going to be confusing for users. > > As I see it, specifying the configuration name in every function call is > the novice interface, and avoids the most common errors. I can see > defaulting the interface name as being an advanced interface, but I > don't think it has enough of a feature to be worth documenting and > implementing. > > If we figure out something better in 8.4 we can implement it, but at > this point I can't think of any good solution to not specifying the > configuration name every time. Maybe I'm missing something, but it seems to me that the configuration is more attached to a column/index thatn to the whole database. If there's a default in an expression, I'd rather expect this default to be drawn from the index involved than from a global value (like a functional index does now). Regards - -- tomás -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.1 (GNU/Linux) iD8DBQFGuuAoBcgs9XrR2kYRAqiiAJsFL+Iu/b/xYaLza5ozmi839Qh5awCeOp+f SZHKDPUHZ3u99XzLBn2ZKjw= =twEt -----END PGP SIGNATURE-----
On Thu, 9 Aug 2007, tomas@tuxteam.de wrote: > Maybe I'm missing something, but it seems to me that the configuration > is more attached to a column/index thatn to the whole database. If > there's a default in an expression, I'd rather expect this default to be > drawn from the index involved than from a global value (like a functional > index does now). I'm tired to repeat - index itself doesn't know about configuration ! It just index tsvector data type. tsvector in turn can be obtained using various ways: 1. manually 2. to_tsvector tsvector can be stored in a separate attribute or be fully virtual like in expressional index. Moreover, tsvector can be obtained using various configurations depending on your application. Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Thu, Aug 09, 2007 at 02:03:13PM +0400, Oleg Bartunov wrote: > On Thu, 9 Aug 2007, tomas@tuxteam.de wrote: > > >Maybe I'm missing something [...] > I'm tired to repeat - index itself doesn't know about configuration ! > It just index tsvector data type. tsvector in turn can be obtained > using various ways: > 1. manually > 2. to_tsvector > > tsvector can be stored in a separate attribute or be fully virtual > like in expressional index. > > Moreover, tsvector can be obtained using various configurations depending > on your application. Yep. I think I got that. Now what will be the "normal" case for a new user? * build an index with a virtual tsvector (using a configuration). Then I would expect the "index" to know the "whole" functionto calculate its entries -- that would include the config used. This would be more "compatible" with the approach stated elsewhere to always mention explicitly the config. * manually. Would a novice do that? Or is that advanced stuff? Regards -- and sorry for my stupid questions :) - -- tomás -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.1 (GNU/Linux) iD8DBQFGu1ciBcgs9XrR2kYRAnVqAJ9mSc463I44JxLLDPpUZ/JirUVH5ACeOhUr 2/4aXs0ukMnvP8YCq8pamwQ= =IgfC -----END PGP SIGNATURE-----
Oleg Bartunov wrote: > On Wed, 8 Aug 2007, Bruce Momjian wrote: >> Heikki Linnakangas wrote: >>> If I understood correctly, the basic issue is that a tsvector datum >>> created using configuration A is incompatible with a tsquery datum >>> created using configuration B, in the sense that you won't get >>> reasonable results if you use the tsquery to search the tsvector, or do >>> ranking or highlighting. If the configurations happen to be similar >>> enough, it can work, but not in general. >> >> Right. > > not fair. There are many cases when one can intentionally use different > configurations. But I agree, this is not for beginners. Can you give an example of that? I certainly can see the need to use different configurations in one database, but what's the use case for comparing a tsvector created with configuration A against a tsquery created with configuration B? >>> - using an expression index instead of a tsvector-field, and always >>> explicitly specifying the configuration, you can avoid that problem (a >>> query with a different configuration won't use the index). But an >>> expression index, without explicitly specifying the configuration, will >>> get corrupted if you change the default configuration. >> >> Right. > > the same problem if you drop constrain from table (accidently) and then > gets surprised by select results. The difference is that if you change the default configuration, you won't expect that your queries start to return funny results. It looks harmless, like changing the date style. If you drop a constraint, it's much more obvious what the consequences are. > We should agree that all you describe is only for DUMMY users. From > authors point of view I dislike your approach to treat text searching as > a very limited tool. But I understand that we should preserve people > from stupid errors. > > I want for beginners easy setup and error-prone functionality, > but leaving experienced users to develop complex search engines. > Can we have separate safe interface for text searching and explicitly > recommend it for beginners ? I don't see how any of the suggestions limits what you can do with it. If we remove the default configuration parameter, you just have to be explicit. If we go with the type-system I suggested, you could still add casts and conversion functions between different tsvector types, where it make sense. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Bruce Momjian wrote: > Heikki Linnakangas wrote: >> Removing the default configuration setting altogether removes the 2nd >> problem, but that's not good from a usability point of view. And it >> doesn't solve the general issue, you can still do things like: >> SELECT * FROM foo WHERE to_tsvector('confA', textcol) @@ >> to_tsquery('confB', 'query'); > > True, but in that case you are specifically naming different > configurations, so it is hopefully obvious you have a mismatch. There's many more subtle ways to do that. For example, filling a tsvector column using a DEFAULT clause. But then you sometimes fill it in the application instead, with a different configuration. Or if one of the function calls is buried in another user defined function. I don't think explicitly naming the configuration gives enough protection. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas wrote: > Oleg Bartunov wrote: > > On Wed, 8 Aug 2007, Bruce Momjian wrote: > >> Heikki Linnakangas wrote: > >>> If I understood correctly, the basic issue is that a tsvector datum > >>> created using configuration A is incompatible with a tsquery datum > >>> created using configuration B, in the sense that you won't get > >>> reasonable results if you use the tsquery to search the tsvector, or do > >>> ranking or highlighting. If the configurations happen to be similar > >>> enough, it can work, but not in general. > >> > >> Right. > > > > not fair. There are many cases when one can intentionally use different > > configurations. But I agree, this is not for beginners. > > Can you give an example of that? > > I certainly can see the need to use different configurations in one > database, but what's the use case for comparing a tsvector created with > configuration A against a tsquery created with configuration B? I assume you could have a configuration with different stop words or synonymns and compare them. > >>> - using an expression index instead of a tsvector-field, and always > >>> explicitly specifying the configuration, you can avoid that problem (a > >>> query with a different configuration won't use the index). But an > >>> expression index, without explicitly specifying the configuration, will > >>> get corrupted if you change the default configuration. > >> > >> Right. > > > > the same problem if you drop constrain from table (accidently) and then > > gets surprised by select results. > > The difference is that if you change the default configuration, you > won't expect that your queries start to return funny results. It looks > harmless, like changing the date style. If you drop a constraint, it's > much more obvious what the consequences are. > > > We should agree that all you describe is only for DUMMY users. From > > authors point of view I dislike your approach to treat text searching as > > a very limited tool. But I understand that we should preserve people > > from stupid errors. > > > > I want for beginners easy setup and error-prone functionality, > > but leaving experienced users to develop complex search engines. > > Can we have separate safe interface for text searching and explicitly > > recommend it for beginners ? > > I don't see how any of the suggestions limits what you can do with it. > If we remove the default configuration parameter, you just have to be > explicit. If we go with the type-system I suggested, you could still add > casts and conversion functions between different tsvector types, where > it make sense. I don't think the type system is workable given the ability to create new configurations on the fly. I think the configuration must be specified each time. At this point, if we keep discussing the tsearch2 API we are not going to have this in 8.3. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Heikki Linnakangas wrote: > Bruce Momjian wrote: > > Heikki Linnakangas wrote: > >> Removing the default configuration setting altogether removes the 2nd > >> problem, but that's not good from a usability point of view. And it > >> doesn't solve the general issue, you can still do things like: > >> SELECT * FROM foo WHERE to_tsvector('confA', textcol) @@ > >> to_tsquery('confB', 'query'); > > > > True, but in that case you are specifically naming different > > configurations, so it is hopefully obvious you have a mismatch. > > There's many more subtle ways to do that. For example, filling a > tsvector column using a DEFAULT clause. But then you sometimes fill it > in the application instead, with a different configuration. Or if one of > the function calls is buried in another user defined function. > > I don't think explicitly naming the configuration gives enough protection. Oh, wow, OK, well in that case the text search API isn't ready and we will have to hold this for 8.4. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Oleg Bartunov wrote: > On Thu, 9 Aug 2007, tomas@tuxteam.de wrote: > >> Maybe I'm missing something, but it seems to me that the configuration >> is more attached to a column/index thatn to the whole database. If >> there's a default in an expression, I'd rather expect this default to be >> drawn from the index involved than from a global value (like a functional >> index does now). > > I'm tired to repeat - index itself doesn't know about configuration ! Is there a way to change that? For example store the configuration in a metapage or something? -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
"Alvaro Herrera" <alvherre@commandprompt.com> writes: > Oleg Bartunov wrote: >> >> I'm tired to repeat - index itself doesn't know about configuration ! > > Is there a way to change that? For example store the configuration in a > metapage or something? I think Heikki's suggestion of having each configuration create a new type would effectively do the same thing. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
On 8/13/07, Bruce Momjian <bruce@momjian.us> wrote: > Heikki Linnakangas wrote: > > Bruce Momjian wrote: > > > Heikki Linnakangas wrote: > > >> Removing the default configuration setting altogether removes the 2nd > > >> problem, but that's not good from a usability point of view. And it > > >> doesn't solve the general issue, you can still do things like: > > >> SELECT * FROM foo WHERE to_tsvector('confA', textcol) @@ > > >> to_tsquery('confB', 'query'); > > > > > > True, but in that case you are specifically naming different > > > configurations, so it is hopefully obvious you have a mismatch. > > > > There's many more subtle ways to do that. For example, filling a > > tsvector column using a DEFAULT clause. But then you sometimes fill it > > in the application instead, with a different configuration. Or if one of > > the function calls is buried in another user defined function. > > > > I don't think explicitly naming the configuration gives enough protection. > > Oh, wow, OK, well in that case the text search API isn't ready and we > will have to hold this for 8.4. > I've been watching this thread with a mixture of dread and hope, waiting to see where the developers' inclination will end up; whether leaving a useful foot gun available will be allowed. This is just my $0.02 as a fairly heavy user of the current tsearch2 code, but I sincerely hope you do not cripple the system by removing the ability to store tsvectors built using arbitrary configurations in a single column. Yes, it can lead to unexpected results if you do not know what you are doing, but if you have gone beyond building a single tsearch2 configuration then you are required to know what you are doing. What's more, IMO the default configuration mechanism feels very much like a CONSTRAINT, as Oleg suggests. That point is one of cognizance, where if one has gone to the trouble of setting up multiple configurations and has learned enough to do so correctly, then one necessarily understands the importance of the setting and can use it (or not, and use explicit configurations) correctly. The default config lowers the bar to an acceptable level for beginners that have no need of multiple configurations, and while I don't feel too strongly, personally, about having a default, I think it is both useful and helpful for new users -- it was for me. Now, so this email isn't entirely complaining, and as a data point for the discussion, I'll explain why I do not want to see tsearch2 crippled in the way suggested by Heikki and Bruce. My application (http://open-ils.org, which run >80% of the public libraries in Georgia, USA, http://gapines.org and http://georgialibraries.org/lib/pines.html) requires that I be able to search a corpus of bibliographic records in a mix of languages, and potentially with mixed stop-word rules, with one query. I cannot know ahead of time what languages will be used in the corpus and I cannot restrict any one query to one language. To accomplish this, the record itself will be inspected inside an INSERT/UPDATE trigger to determine the language and type, and use the correct configuration for creating the tsvector. This will obviously result in a "mixed" tsvector column, but that's exactly what I need. I can filter on record language if the user happens to specify a query language (and thus configuration), or simply rank the assumed (IP based, perhaps, or browser preference based) preferred language higher, or one of a hundred other things. But I won't be able to do any of that if tsvectors are required to have one and only one configuration per column. Anyway, I felt I needed to provide some outside perspective to this, as a user, since it seems that the external viewpoint (my particular viewpoint, at least) was missing from the discussion. Thanks, folks, for all the work on this so far! --miker
Mike Rylander wrote: > This is just my $0.02 as a fairly heavy user of the current tsearch2 > code, but I sincerely hope you do not cripple the system by removing > the ability to store tsvectors built using arbitrary configurations in > a single column. Yes, it can lead to unexpected results if you do not > know what you are doing, but if you have gone beyond building a single > tsearch2 configuration then you are required to know what you are > doing. What's more, IMO the default configuration mechanism feels > very much like a CONSTRAINT, as Oleg suggests. That point is one of > cognizance, where if one has gone to the trouble of setting up > multiple configurations and has learned enough to do so correctly, > then one necessarily understands the importance of the setting and can > use it (or not, and use explicit configurations) correctly. The > default config lowers the bar to an acceptable level for beginners > that have no need of multiple configurations, and while I don't feel > too strongly, personally, about having a default, I think it is both > useful and helpful for new users -- it was for me. Thanks for chiming in. As a disclaimer: I've never used tsearch2 in a real application myself. > My application (http://open-ils.org, which run >80% of the public > libraries in Georgia, USA, http://gapines.org and > http://georgialibraries.org/lib/pines.html) requires that I be able to > search a corpus of bibliographic records in a mix of languages, and > potentially with mixed stop-word rules, with one query. I cannot know > ahead of time what languages will be used in the corpus and I cannot > restrict any one query to one language. To accomplish this, the > record itself will be inspected inside an INSERT/UPDATE trigger to > determine the language and type, and use the correct configuration for > creating the tsvector. This will obviously result in a "mixed" > tsvector column, but that's exactly what I need. I can filter on > record language if the user happens to specify a query language (and > thus configuration), or simply rank the assumed (IP based, perhaps, or > browser preference based) preferred language higher, or one of a > hundred other things. But I won't be able to do any of that if > tsvectors are required to have one and only one configuration per > column. Don't you need to use the right configuration to parse the query into a tsquery as well? What you have is basically the classic problem problem of representing inheritance in a relational model. You have a superclass, bibliographic record, and multiple subclasses, one per language with one extra field, the corpus in the right language. You've solved it by storing them all in one table, with an extra column (language) to denote which subclass the record is. Alternatively, you could solve it by using inherited tables, or having one table per language with a foreign key referencing the master table containing the other fields, or having a single table with one nullable field per configuration, and a check constraint to check that exactly one of them is not null. As a thought experiment, let me present another, not text search related example that's isomorphic to your problem: Imagine an international online auction system, where you can put items for sale and specify a minimum price. Part of the database schema is: CREATE TABLE item ( id INTEGER, description TEXT, currency CHAR(3), price NUMERIC ); It clearly doesn't make sense to directly compare prices in different currencies against each other. A query like WHERE price > 1000 doesn't make sense, unless you also restrict the currency, or use an exchange rate to convert between currencies. Also, different currencies might have slightly different rules on how many decimal places are significant. In this example, as well as your bibliographic scenario, we can conveniently store prices in all currencies in the same field because they're all numerics. If we wanted to enforce per-currency rules, like NUMERIC(10, 2) for USD and NUMERIC(10,0) for Italian lires (which doesn't really exists anymore, I know), we'd need to store them in separate columns. And before the decimalisation in 1971, the British pound was divided into 20 shillings, each of which was divided to 12 pence, so it wouldn't have fit to normal numeric field, and we would again have to store that in a separate column. What I'm trying to point out is that the problem isn't unique to text search. It's an old problem, with many alternative solutions, even with strong typing. Arguably if you store data in multiple languages in same field, you have a denormalized schema. Granted, loose typing can be more convenient, but you give up the benefits of strong typing as well. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Bruce Momjian wrote: > Heikki Linnakangas wrote: >> Bruce Momjian wrote: >>> Heikki Linnakangas wrote: >>>> Removing the default configuration setting altogether removes the 2nd >>>> problem, but that's not good from a usability point of view. And it >>>> doesn't solve the general issue, you can still do things like: >>>> SELECT * FROM foo WHERE to_tsvector('confA', textcol) @@ >>>> to_tsquery('confB', 'query'); >>> True, but in that case you are specifically naming different >>> configurations, so it is hopefully obvious you have a mismatch. >> There's many more subtle ways to do that. For example, filling a >> tsvector column using a DEFAULT clause. But then you sometimes fill it >> in the application instead, with a different configuration. Or if one of >> the function calls is buried in another user defined function. >> >> I don't think explicitly naming the configuration gives enough protection. > > Oh, wow, OK, well in that case the text search API isn't ready and we > will have to hold this for 8.4. That would be unfortunate :(. Sorry I haven't looked at this earlier. Do you think that implementing a strongly typed system is too much work for 8.3? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Tue, 14 Aug 2007, Alvaro Herrera wrote: > Oleg Bartunov wrote: >> On Thu, 9 Aug 2007, tomas@tuxteam.de wrote: >> >>> Maybe I'm missing something, but it seems to me that the configuration >>> is more attached to a column/index thatn to the whole database. If >>> there's a default in an expression, I'd rather expect this default to be >>> drawn from the index involved than from a global value (like a functional >>> index does now). >> >> I'm tired to repeat - index itself doesn't know about configuration ! > > Is there a way to change that? For example store the configuration in a > metapage or something? it's useless, in general, since you could use different configuration to build tsvector. Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
On 8/14/07, Heikki Linnakangas <heikki@enterprisedb.com> wrote: > Mike Rylander wrote: [snip] > > Don't you need to use the right configuration to parse the query into a > tsquery as well? > Only if the user (or user agent) can supply enough information to move away from the configured default of, say, en-US. And even then, other tsvector configurations need to be searched. Configurations are not, of course, just about language/stemming, but also stop-words and thesauri. If it were just languages it wouldn't be too big of a deal (other than in-app DDL management...). > What you have is basically the classic problem problem of representing > inheritance in a relational model. You have a superclass, bibliographic > record, and multiple subclasses, one per language with one extra field, > the corpus in the right language. You've solved it by storing them all > in one table, with an extra column (language) to denote which subclass > the record is. Alternatively, you could solve it by using inherited > tables, or having one table per language with a foreign key referencing > the master table containing the other fields, or having a single table > with one nullable field per configuration, and a check constraint to > check that exactly one of them is not null. > Sorry, communication problem here ... I provided an oversimplified example meant more to show the issues than to find alternate solutions, though I appreciate you taking the time to consider the problem. If I only had to consider one delimiting facet per record then it would be much simpler. But with the complication that some fields have stop-word lists (and some not), some use thesauri (and different ones, at that), and these fields (as extracted from the records) and their configurations (stem? use a stop-word list? use thesaurus x, y or z?) are user-defined ... > As a thought experiment, let me present another, not text search related > example that's isomorphic to your problem: > Unfortunately, the problem you describe is not quite the same as the problem I need to solve. Fortunately, the current incarnation of tsearch2 does a pretty good job of solving the problem if you store everything in one place and use the query-time environment to apply some adjustments to the ranking of items. I could still work around this problem by creating inherited tables, one for each configuration on each index-providing table but I /really/ hope to avoid that. Creating new configurations for local requirements doesn't require creating new tables (and the associated management overhead in the app) today, something I'd really like to avoid. In fact, I'm starting to sweat just thinking about what the planner would go through with the number tables needed for the potential configurations in an installation that makes use of multiple thesauri and a mix of stop-word lists across, say, 30 languages. Such a dataset is not uncommon. In any case, thanks again for taking the time to think about the problem. I still think having the ability to store any old tsvector I happen to have hanging around in any column of the correct type is a GoodThing(tm). I see from Oleg's message down-thread that that's the way things will be (the tsvector type doesn't know about columns, just lexem "tuples"). --miker
Heikki Linnakangas wrote: > Bruce Momjian wrote: > > Heikki Linnakangas wrote: > >> Bruce Momjian wrote: > >>> Heikki Linnakangas wrote: > >>>> Removing the default configuration setting altogether removes the 2nd > >>>> problem, but that's not good from a usability point of view. And it > >>>> doesn't solve the general issue, you can still do things like: > >>>> SELECT * FROM foo WHERE to_tsvector('confA', textcol) @@ > >>>> to_tsquery('confB', 'query'); > >>> True, but in that case you are specifically naming different > >>> configurations, so it is hopefully obvious you have a mismatch. > >> There's many more subtle ways to do that. For example, filling a > >> tsvector column using a DEFAULT clause. But then you sometimes fill it > >> in the application instead, with a different configuration. Or if one of > >> the function calls is buried in another user defined function. > >> > >> I don't think explicitly naming the configuration gives enough protection. > > > > Oh, wow, OK, well in that case the text search API isn't ready and we > > will have to hold this for 8.4. > > That would be unfortunate :(. Sorry I haven't looked at this earlier. Do > you think that implementing a strongly typed system is too much work for > 8.3? Yea, probably. See my other posting with an updated subject line. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Mike Rylander wrote: > This is just my $0.02 as a fairly heavy user of the current tsearch2 > code, but I sincerely hope you do not cripple the system by removing > the ability to store tsvectors built using arbitrary configurations in > a single column. Yes, it can lead to unexpected results if you do not > know what you are doing, but if you have gone beyond building a single > tsearch2 configuration then you are required to know what you are > doing. What's more, IMO the default configuration mechanism feels > very much like a CONSTRAINT, as Oleg suggests. That point is one of > cognizance, where if one has gone to the trouble of setting up > multiple configurations and has learned enough to do so correctly, > then one necessarily understands the importance of the setting and can > use it (or not, and use explicit configurations) correctly. The > default config lowers the bar to an acceptable level for beginners > that have no need of multiple configurations, and while I don't feel > too strongly, personally, about having a default, I think it is both > useful and helpful for new users -- it was for me. What has really hurt the default GUC idea is how to do restores from a pg_dump. How do you make sure the right default is used on a restore, particularly if multiple objects are being restored, and each has a different default GUC. I suppose your trigger handles that but that isn't going to help with an expression index, nor in cases where the default of the old database is different from the new one. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Bruce Momjian escribió: > Mike Rylander wrote: > > This is just my $0.02 as a fairly heavy user of the current tsearch2 > > code, but I sincerely hope you do not cripple the system by removing > > the ability to store tsvectors built using arbitrary configurations in > > a single column. Yes, it can lead to unexpected results if you do not > > know what you are doing, but if you have gone beyond building a single > > tsearch2 configuration then you are required to know what you are > > doing. What's more, IMO the default configuration mechanism feels > > very much like a CONSTRAINT, as Oleg suggests. That point is one of > > cognizance, where if one has gone to the trouble of setting up > > multiple configurations and has learned enough to do so correctly, > > then one necessarily understands the importance of the setting and can > > use it (or not, and use explicit configurations) correctly. The > > default config lowers the bar to an acceptable level for beginners > > that have no need of multiple configurations, and while I don't feel > > too strongly, personally, about having a default, I think it is both > > useful and helpful for new users -- it was for me. > > What has really hurt the default GUC idea is how to do restores from a > pg_dump. How do you make sure the right default is used on a restore, > particularly if multiple objects are being restored, and each has a > different default GUC. I suppose your trigger handles that but that > isn't going to help with an expression index, nor in cases where the > default of the old database is different from the new one. I guess what should happen is that pg_dump should include a SET default_text_search_config = 'foo' just before the CREATE INDEX, like we do for other variables. Of course, in order for this to work, the index itself must know what value was used on creation. Oleg already dismissed my suggestion of putting it into the index itself (a metapage or something). Maybe store it in reloptions? -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Oleg Bartunov wrote: > On Tue, 14 Aug 2007, Alvaro Herrera wrote: > >> Oleg Bartunov wrote: >>> On Thu, 9 Aug 2007, tomas@tuxteam.de wrote: >>> >>>> Maybe I'm missing something, but it seems to me that the configuration >>>> is more attached to a column/index thatn to the whole database. If >>>> there's a default in an expression, I'd rather expect this default to be >>>> drawn from the index involved than from a global value (like a >>>> functional >>>> index does now). >>> >>> I'm tired to repeat - index itself doesn't know about configuration ! >> >> Is there a way to change that? For example store the configuration in a >> metapage or something? > > it's useless, in general, since you could use different configuration to > build tsvector. Hmm, sorry, I think I just understood what this was about: so you mean that the configuration is really *per row* and not per index? So I can store rows into an index using more than one configuration, and it will work? -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
On 8/14/07, Alvaro Herrera <alvherre@commandprompt.com> wrote: > Oleg Bartunov wrote: > > On Tue, 14 Aug 2007, Alvaro Herrera wrote: > > > >> Oleg Bartunov wrote: > >>> On Thu, 9 Aug 2007, tomas@tuxteam.de wrote: > >>> > >>>> Maybe I'm missing something, but it seems to me that the configuration > >>>> is more attached to a column/index thatn to the whole database. If > >>>> there's a default in an expression, I'd rather expect this default to be > >>>> drawn from the index involved than from a global value (like a > >>>> functional > >>>> index does now). > >>> > >>> I'm tired to repeat - index itself doesn't know about configuration ! > >> > >> Is there a way to change that? For example store the configuration in a > >> metapage or something? > > > > it's useless, in general, since you could use different configuration to > > build tsvector. > > Hmm, sorry, I think I just understood what this was about: so you mean > that the configuration is really *per row* and not per index? So I can > store rows into an index using more than one configuration, and it will > work? Can and does, to great success. :) --miker
Alvaro Herrera <alvherre@commandprompt.com> writes: > Bruce Momjian escribi�: >> What has really hurt the default GUC idea is how to do restores from a >> pg_dump. > I guess what should happen is that pg_dump should include a > SET default_text_search_config = 'foo' > just before the CREATE INDEX, like we do for other variables. The nasty cases are in data-only dumps, that is, where you're trying to load data into a table with pre-existing indexes or triggers. A SET like the above is at least as likely to be wrong as right, if the index or trigger depends on it to tell it what to do. regards, tom lane
"Mike Rylander" <mrylander@gmail.com> writes: > My application (http://open-ils.org, which run >80% of the public > libraries in Georgia, USA, http://gapines.org and > http://georgialibraries.org/lib/pines.html) requires that I be able to > search a corpus of bibliographic records in a mix of languages, and > potentially with mixed stop-word rules, with one query. I cannot know > ahead of time what languages will be used in the corpus and I cannot > restrict any one query to one language. To accomplish this, the > record itself will be inspected inside an INSERT/UPDATE trigger to > determine the language and type, and use the correct configuration for > creating the tsvector. This will obviously result in a "mixed" > tsvector column, but that's exactly what I need. I can filter on > record language if the user happens to specify a query language (and > thus configuration), or simply rank the assumed (IP based, perhaps, or > browser preference based) preferred language higher, or one of a > hundred other things. But I won't be able to do any of that if > tsvectors are required to have one and only one configuration per > column. > > Anyway, I felt I needed to provide some outside perspective to this, > as a user, since it seems that the external viewpoint (my particular > viewpoint, at least) was missing from the discussion. This is *extremely* useful. I think it's precisely what we've been missing so far. At least, what I've been missing. So the question is what exactly happens in this case? If I search for "the" does that mean it will ignore matches in English where that's a stop-word but find me books on tea in French? Is that what I should expect to happen? What if I search for "earl and the"? Does that find me French books on Early Grey Tea but English books on all earls? What happens if I use the same operator directly on the text column? Or perhaps it's not even possible to specify stop-words when operating on a text column? Should it be? -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Tom Lane wrote: > Alvaro Herrera <alvherre@commandprompt.com> writes: > > Bruce Momjian escribi�: > >> What has really hurt the default GUC idea is how to do restores from a > >> pg_dump. > > > I guess what should happen is that pg_dump should include a > > SET default_text_search_config = 'foo' > > just before the CREATE INDEX, like we do for other variables. > > The nasty cases are in data-only dumps, that is, where you're trying to > load data into a table with pre-existing indexes or triggers. A SET > like the above is at least as likely to be wrong as right, if the index > or trigger depends on it to tell it what to do. Ouch. I had not even thought that far. FYI, yes, the default tsearch GUC controls operations per row _if_ you have triggers or expression indexes that rely on the default configuration GUC. If you have specified the configuration, there is no problem, and hence my conclusion that the default GUC is too error-prone. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On 8/14/07, Gregory Stark <stark@enterprisedb.com> wrote: > "Mike Rylander" <mrylander@gmail.com> writes: > > > My application (http://open-ils.org, which run >80% of the public > > libraries in Georgia, USA, http://gapines.org and > > http://georgialibraries.org/lib/pines.html) requires that I be able to > > search a corpus of bibliographic records in a mix of languages, and > > potentially with mixed stop-word rules, with one query. I cannot know > > ahead of time what languages will be used in the corpus and I cannot > > restrict any one query to one language. To accomplish this, the > > record itself will be inspected inside an INSERT/UPDATE trigger to > > determine the language and type, and use the correct configuration for > > creating the tsvector. This will obviously result in a "mixed" > > tsvector column, but that's exactly what I need. I can filter on > > record language if the user happens to specify a query language (and > > thus configuration), or simply rank the assumed (IP based, perhaps, or > > browser preference based) preferred language higher, or one of a > > hundred other things. But I won't be able to do any of that if > > tsvectors are required to have one and only one configuration per > > column. > > > > Anyway, I felt I needed to provide some outside perspective to this, > > as a user, since it seems that the external viewpoint (my particular > > viewpoint, at least) was missing from the discussion. > > This is *extremely* useful. I think it's precisely what we've been missing so > far. At least, what I've been missing. > > So the question is what exactly happens in this case? If I search for "the" > does that mean it will ignore matches in English where that's a stop-word but > find me books on tea in French? Is that what I should expect to happen? What > if I search for "earl and the"? Does that find me French books on Early Grey > Tea but English books on all earls? Oh dear ... you went and got me started... Each field type has a different set of configurations for potenial use. Title and subject fields, for instance, do not (generally) use stop-word lists, so a subject search for "the" will match any record with the lexem "the" in a subject field. Title fields are a little more complicated, because there is information in the bibliographic record about how and when to skip leading articles, but generally those are indexed as well for FTS purposes. English LCSH subjects generally don't have stop(like) words in them, so you'll probably just get "French tea" records. Title results would be a mix of "earls" and "French tea" records (probably) and the correlation between the user's preferred language (either chosen query lang or UI lang, whichever is available) will help adjust the ranking, pushing what are likely to be the more appropriate records to the top. Note, however, that much of this multi-tsearch2-configuration setup is not used in the implementation at http://gapines.org/ because, well, there's not much need (read: demand from librarians) for that dataset to support these more complex tricks. It's basically all en-US and exclude stop-words. Other implementations are making more use of what I describe above, including a (government mandated) French-English bilingual institution who shall remain nameless for the time being... > > What happens if I use the same operator directly on the text column? Or > perhaps it's not even possible to specify stop-words when operating on a text > column? Should it be? You mean with an expression index on a text column? I haven't considered using them for FTS. It just feels easier and more flexible to me to use an external tsvector column because of the fairly heavy processing that goes into creating each tsvector value. I may re-evaluate that position now that CREATE INDEX CONCURRENTLY exists, but I'm not developing with 8.2+ only features yet. Once 8.3 is out that may change. Also, unless I misunderstand, you have to wrap the text column in the function used to build the index. For my purposes, that makes building a generic FTS driver for my app (which, admittedly, only has a Postgresql driver ;) ) more difficult than having a "hidden" extra column. Again, that could change if the benefits of CREATE INDEX CONCURRENTLY end up outweighing simpler FTS driver code. --miker
On Tue, 14 Aug 2007, Alvaro Herrera wrote: > Oleg Bartunov wrote: >> On Tue, 14 Aug 2007, Alvaro Herrera wrote: >> >>> Oleg Bartunov wrote: >>>> On Thu, 9 Aug 2007, tomas@tuxteam.de wrote: >>>> >>>>> Maybe I'm missing something, but it seems to me that the configuration >>>>> is more attached to a column/index thatn to the whole database. If >>>>> there's a default in an expression, I'd rather expect this default to be >>>>> drawn from the index involved than from a global value (like a >>>>> functional >>>>> index does now). >>>> >>>> I'm tired to repeat - index itself doesn't know about configuration ! >>> >>> Is there a way to change that? For example store the configuration in a >>> metapage or something? >> >> it's useless, in general, since you could use different configuration to >> build tsvector. > > Hmm, sorry, I think I just understood what this was about: so you mean > that the configuration is really *per row* and not per index? So I can in the very extreme case, yes. Index doesn't care about configuration. Everything should works without index ! > store rows into an index using more than one configuration, and it will > work? why not. For one set of documents you can use one configuration (parser+mappings), for another - different configuration. Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83