Re: How does the tsearch configuration get selected? - Mailing list pgsql-hackers
From | Bruce Momjian |
---|---|
Subject | Re: How does the tsearch configuration get selected? |
Date | |
Msg-id | 200706151415.l5FEFVG29817@momjian.us Whole thread Raw |
In response to | Re: How does the tsearch configuration get selected? (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: How does the tsearch configuration get
selected?
Re: How does the tsearch configuration get selected? |
List | pgsql-hackers |
Tom Lane wrote: > Bruce Momjian <bruce@momjian.us> writes: > > First, why are we specifying the server locale here since it never > > changes: > > It's poorly described. What it should really say is the language > that the text-to-be-searched is in. We can actually support multiple > languages here today, the restriction being that there have to be > stemmer instances for the languages with the database encoding you're > using. With UTF8 encoding this isn't much of a restriction. We do need > to put code into the dictionary stuff to enforce that you can't use a > stemmer when the database encoding isn't compatible with it. > > I would prefer that we not drive any of this stuff off the server's > LC_xxx settings, since as you say that restricts things to just one > locale. The idea they had was to set the _default_ full text configuration to match the locale, e.g.UTF8.en_US. This works well for cases where we ship a number of pre-installed full text configurations in pg_catalog. But of course you can support multiple languages with that encoding/locale, so you have to have the ability to do other languages, but not necessarily by default. > > Second, I can't figure out how to reference a non-default > > configuration. > > See the multi-argument versions of to_tsvector etc. > > I do see a problem with having to_tsvector(config, text) plus > to_tsvector(text) where the latter implicitly references a config > selected by a GUC variable: how can you tell whether a query using the > latter matches a particular index using the former? There isn't > anything in the current planner mechanisms that would make that work. Well, now that I have gotten feedback, we have a few options: 1) Require the configuration to be always specified. The problem with this is that casting (::tsquery) and operators (@@) have no way to specify a configuration. 2) Use a GUC that you can set for the configuration, and perhaps default it if possible to match the locale. Is the default affected by search_path (ouch)? How do we make sure that any index that is accessed is using the same configuration that is being used by the query, e.g. ::tsquery? Do we have to store the configuration name in the index and somehow throw an error if it doesn't match? What about changes to the configuration after the index has been created, e.g. new stop words or dictionaries? The two big open issues are whether we allow a default configuration, and whether we require the configuration name to be always specified. My guess right now is that we use a GUC that will default if a pg_catalog configuration name matches the lc_ctype locale name, and we have to throw an error if an accessed index creation GUC doesn't match the current GUC. So we create a pg_catalog full text configuration named UTF8.en-US, and some others like ru_RU.UTF-8. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
pgsql-hackers by date: