Re: How does the tsearch configuration get selected? - Mailing list pgsql-hackers

From Bruce Momjian
Subject Re: How does the tsearch configuration get selected?
Date
Msg-id 200706151415.l5FEFVG29817@momjian.us
Whole thread Raw
In response to Re: How does the tsearch configuration get selected?  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: How does the tsearch configuration get selected?
Re: How does the tsearch configuration get selected?
List pgsql-hackers
Tom Lane wrote:
> Bruce Momjian <bruce@momjian.us> writes:
> > First, why are we specifying the server locale here since it never
> > changes:
> 
> It's poorly described.  What it should really say is the language
> that the text-to-be-searched is in.  We can actually support multiple
> languages here today, the restriction being that there have to be
> stemmer instances for the languages with the database encoding you're
> using.  With UTF8 encoding this isn't much of a restriction.  We do need
> to put code into the dictionary stuff to enforce that you can't use a
> stemmer when the database encoding isn't compatible with it.
> 
> I would prefer that we not drive any of this stuff off the server's
> LC_xxx settings, since as you say that restricts things to just one
> locale.

The idea they had was to set the _default_ full text configuration to
match the locale, e.g.UTF8.en_US.  This works well for cases where we
ship a number of pre-installed full text configurations in pg_catalog.
But of course you can support multiple languages with that
encoding/locale, so you have to have the ability to do other languages,
but not necessarily by default.

> > Second, I can't figure out how to reference a non-default
> > configuration.
> 
> See the multi-argument versions of to_tsvector etc.
> 
> I do see a problem with having to_tsvector(config, text) plus
> to_tsvector(text) where the latter implicitly references a config
> selected by a GUC variable: how can you tell whether a query using the
> latter matches a particular index using the former?  There isn't
> anything in the current planner mechanisms that would make that work.

Well, now that I have gotten feedback, we have a few options:

1)  Require the configuration to be always specified.  The problem with
this is that casting (::tsquery) and operators (@@) have no way to
specify a configuration.

2)  Use a GUC that you can set for the configuration, and perhaps
default it if possible to match the locale.  Is the default affected by
search_path (ouch)?

How do we make sure that any index that is accessed is using the same
configuration that is being used by the query, e.g. ::tsquery?  Do we
have to store the configuration name in the index and somehow throw an
error if it doesn't match?  What about changes to the configuration
after the index has been created, e.g. new stop words or dictionaries?

The two big open issues are whether we allow a default configuration,
and whether we require the configuration name to be always specified.

My guess right now is that we use a GUC that will default if a
pg_catalog configuration name matches the lc_ctype locale name, and we
have to throw an error if an accessed index creation GUC doesn't match
the current GUC.

So we create a pg_catalog full text configuration named UTF8.en-US, and
some others like ru_RU.UTF-8.

--  Bruce Momjian  <bruce@momjian.us>          http://momjian.us EnterpriseDB
http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


pgsql-hackers by date:

Previous
From: Teodor Sigaev
Date:
Subject: Re: Tsearch vs Snowball, or what's a source file?
Next
From: Bruce Momjian
Date:
Subject: Re: How does the tsearch configuration get selected?