Thread: tsearch2: language or encoding
Hi, I'm wondering if a tsearch's configuration is bound to a language or an encoding. If it's bound to a language, there's a serious design problem, I would think. An encoding or charset is not necessarily bound to single language. We can find such that example everywhere(I'm not talking about Unicode here). LATIN1 inclues English and several european languages. EUC-JP includes English and Japanese etc. And we specify encoding for char's property, not language, I would say the configuration should be bound to an encoding. -- Tatsuo Ishii SRA OSS, Inc. Japan
Tatsuo Ishii <ishii@sraoss.co.jp> writes: > I'm wondering if a tsearch's configuration is bound to a language or > an encoding. If it's bound to a language, there's a serious design > problem, I would think. An encoding or charset is not necessarily > bound to single language. We can find such that example everywhere(I'm > not talking about Unicode here). LATIN1 inclues English and several > european languages. EUC-JP includes English and Japanese etc. And > we specify encoding for char's property, not language, I would say the > configuration should be bound to an encoding. Surely not, because then what do you do with utf8, which (allegedly) represents every language on earth? As far as the word-stemming part goes, that is very clearly bound to a language not an encoding. There may be some other parts of the code that really are better attached to an encoding --- Oleg, Teodor, your thoughts? regards, tom lane
On Fri, 2007-07-06 at 15:43 +0900, Tatsuo Ishii wrote: > I'm wondering if a tsearch's configuration is bound to a language or > an encoding. If it's bound to a language, there's a serious design > problem, I would think. An encoding or charset is not necessarily > bound to single language. We can find such that example everywhere(I'm > not talking about Unicode here). LATIN1 inclues English and several > european languages. EUC-JP includes English and Japanese etc. And > we specify encoding for char's property, not language, I would say the > configuration should be bound to an encoding. Perhaps the encoding could suggest a default language, but I see no direct connection in many cases between language and encoding, especially for European languages and encodings. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Tatsuo, fts configuration doesn't related to the encoding ! It's fully up to you how to combine parser and dictionaries. The problem arise only if you want to define somehow so-called default configuration, which, as I inclined now, is a bad feature. We choose locale name to identify default confgiuration, for 8.3 people suggested to have language name. Oleg On Fri, 6 Jul 2007, Tatsuo Ishii wrote: > Hi, > > I'm wondering if a tsearch's configuration is bound to a language or > an encoding. If it's bound to a language, there's a serious design > problem, I would think. An encoding or charset is not necessarily > bound to single language. We can find such that example everywhere(I'm > not talking about Unicode here). LATIN1 inclues English and several > european languages. EUC-JP includes English and Japanese etc. And > we specify encoding for char's property, not language, I would say the > configuration should be bound to an encoding. > -- > Tatsuo Ishii > SRA OSS, Inc. Japan > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Have you searched our list archives? > > http://archives.postgresql.org > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83