Thread: tsearch2: language or encoding

tsearch2: language or encoding

From
Tatsuo Ishii
Date:
Hi,

I'm wondering if a tsearch's configuration is bound to a language or
an encoding. If it's bound to a language, there's a serious design
problem, I would think. An encoding or charset is not necessarily
bound to single language. We can find such that example everywhere(I'm
not talking about Unicode here). LATIN1 inclues English and several
european languages. EUC-JP includes English and Japanese etc. And
we specify encoding for char's property, not language, I would say the
configuration should be bound to an encoding.
--
Tatsuo Ishii
SRA OSS, Inc. Japan


Re: tsearch2: language or encoding

From
Tom Lane
Date:
Tatsuo Ishii <ishii@sraoss.co.jp> writes:
> I'm wondering if a tsearch's configuration is bound to a language or
> an encoding. If it's bound to a language, there's a serious design
> problem, I would think. An encoding or charset is not necessarily
> bound to single language. We can find such that example everywhere(I'm
> not talking about Unicode here). LATIN1 inclues English and several
> european languages. EUC-JP includes English and Japanese etc. And
> we specify encoding for char's property, not language, I would say the
> configuration should be bound to an encoding.

Surely not, because then what do you do with utf8, which (allegedly)
represents every language on earth?

As far as the word-stemming part goes, that is very clearly bound
to a language not an encoding.  There may be some other parts of
the code that really are better attached to an encoding --- Oleg,
Teodor, your thoughts?
        regards, tom lane


Re: tsearch2: language or encoding

From
"Simon Riggs"
Date:
On Fri, 2007-07-06 at 15:43 +0900, Tatsuo Ishii wrote:

> I'm wondering if a tsearch's configuration is bound to a language or
> an encoding. If it's bound to a language, there's a serious design
> problem, I would think. An encoding or charset is not necessarily
> bound to single language. We can find such that example everywhere(I'm
> not talking about Unicode here). LATIN1 inclues English and several
> european languages. EUC-JP includes English and Japanese etc. And
> we specify encoding for char's property, not language, I would say the
> configuration should be bound to an encoding.

Perhaps the encoding could suggest a default language, but I see no
direct connection in many cases between language and encoding,
especially for European languages and encodings.

--  Simon Riggs EnterpriseDB  http://www.enterprisedb.com



Re: tsearch2: language or encoding

From
Oleg Bartunov
Date:
Tatsuo,

fts configuration doesn't related to the encoding ! It's fully up to you
how to combine parser and dictionaries.

The problem arise only if you want
to define somehow so-called default configuration, which, as I inclined
now, is a bad feature. We choose locale name to identify default confgiuration,
for 8.3 people suggested to have language name.

Oleg
On Fri, 6 Jul 2007, Tatsuo Ishii wrote:

> Hi,
>
> I'm wondering if a tsearch's configuration is bound to a language or
> an encoding. If it's bound to a language, there's a serious design
> problem, I would think. An encoding or charset is not necessarily
> bound to single language. We can find such that example everywhere(I'm
> not talking about Unicode here). LATIN1 inclues English and several
> european languages. EUC-JP includes English and Japanese etc. And
> we specify encoding for char's property, not language, I would say the
> configuration should be bound to an encoding.
> --
> Tatsuo Ishii
> SRA OSS, Inc. Japan
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
>               http://archives.postgresql.org
>
    Regards,        Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83