Home > mailing lists

Re: How does the tsearch configuration get selected? - Mailing list pgsql-hackers

From	Teodor Sigaev
Subject	Re: How does the tsearch configuration get selected?
Date	June 15, 2007 13:26:44
Msg-id	4672BDBD.2070500@sigaev.ru Whole thread Raw
In response to	Re: How does the tsearch configuration get selected? (Tom Lane <tgl@sss.pgh.pa.us>)
Responses	Re: How does the tsearch configuration get selected?
List	pgsql-hackers

Tree view

> One possibility is that the user-visible specification is just a name
> (eg, "english"), but the actual filename out on the filesystem is,
> say, name.encoding.stop (eg, "english.utf8.stop") where we use PG's
> names for the encodings.  We could just fail if there's not a file
> matching the database encoding, or we could try that and then try
> utf8, or some other rule.  In any case I'd want it to verify and
> convert encoding as necessary while reading.

I have no strong objection for UTF8-encoded files (stop words or ispell or 
synonym or thesaurus). Just recode it after reading.

But configuration for different languages might be differ, for example russian 
(and any cyrillic-based) configuration is differ from west-european 
configuration based on different character sets. So, we should have non-obvious 
rules for stemmers to define which exact stemmer and stop-file should be used.
For russian language with utf8 encoding it should use for lword english stemmer, 
but for italian language - italian stemmer. Any ASCII chars can't present in 
russian word, but might italian word can contains only ASCII.



-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/

pgsql-hackers by date:

From: David Fetter
Date: 15 June 2007, 13:22:52
Subject: Re: Rethinking user-defined-typmod before it's too late

From: Tom Lane
Date: 15 June 2007, 13:40:19
Subject: Re: How does the tsearch configuration get selected?

Re: How does the tsearch configuration get selected? - Mailing list pgsql-hackers

Previous

Next