Thread: default_text_search_config

default_text_search_config

From
Tatsuo Ishii
Date:
When I run initdb -E EUC_JP --no-locale, I found following in my
postgresql.conf:

default_text_search_config = 'pg_catalog.english'

The manual says:

default_text_search_config (string)
   Selects the text search configuration that is used by those   variants of the text search functions that do not have
anexplicit   argument specifying the configuration. See Chapter 12 for further   information. The built-in default is
pg_catalog.simple,but initdb   will initialize the configuration file with a setting that   corresponds to the chosen
lc_ctypelocale, if a configuration   matching that locale can be identified.
 

So I thought the initial value for it should be pg_catalog.simple,
rather than pg_catalog.english. If this is not a bug, what is the
idea behind lc_ctype = C corresponds to 'pg_catalog.english'?
When is pg_catalog.simple supposed to be used?
--
Tatsuo Ishii
SRA OSS, Inc. Japan


Re: default_text_search_config

From
Tom Lane
Date:
Tatsuo Ishii <ishii@postgresql.org> writes:
> When I run initdb -E EUC_JP --no-locale, I found following in my
> postgresql.conf:

> default_text_search_config = 'pg_catalog.english'

> The manual says:

> default_text_search_config (string)

>     Selects the text search configuration that is used by those
>     variants of the text search functions that do not have an explicit
>     argument specifying the configuration. See Chapter 12 for further
>     information. The built-in default is pg_catalog.simple, but initdb
>     will initialize the configuration file with a setting that
>     corresponds to the chosen lc_ctype locale, if a configuration
>     matching that locale can be identified.

> So I thought the initial value for it should be pg_catalog.simple,
> rather than pg_catalog.english. If this is not a bug, what is the
> idea behind lc_ctype = C corresponds to 'pg_catalog.english'?
> When is pg_catalog.simple supposed to be used?

Well, that documentation is correct as far as it goes; what it doesn't
say is that initdb's mapping table explicitly maps C/POSIX locales to
english.  It seems like a reasonable default on this side of the water,
but maybe I'm being too North-American-centric.
        regards, tom lane


Re: default_text_search_config

From
Tatsuo Ishii
Date:
> Tatsuo Ishii <ishii@postgresql.org> writes:
> > When I run initdb -E EUC_JP --no-locale, I found following in my
> > postgresql.conf:
> 
> > default_text_search_config = 'pg_catalog.english'
> 
> > The manual says:
> 
> > default_text_search_config (string)
> 
> >     Selects the text search configuration that is used by those
> >     variants of the text search functions that do not have an explicit
> >     argument specifying the configuration. See Chapter 12 for further
> >     information. The built-in default is pg_catalog.simple, but initdb
> >     will initialize the configuration file with a setting that
> >     corresponds to the chosen lc_ctype locale, if a configuration
> >     matching that locale can be identified.
> 
> > So I thought the initial value for it should be pg_catalog.simple,
> > rather than pg_catalog.english. If this is not a bug, what is the
> > idea behind lc_ctype = C corresponds to 'pg_catalog.english'?
> > When is pg_catalog.simple supposed to be used?
> 
> Well, that documentation is correct as far as it goes; what it doesn't
> say is that initdb's mapping table explicitly maps C/POSIX locales to
> english.  It seems like a reasonable default on this side of the water,
> but maybe I'm being too North-American-centric.

Ok. Are you going to to add "initdb's mapping table explicitly maps
C/POSIX locales to english" to the doc? If no, I can do that part.
--
Tatsuo Ishii
SRA OSS, Inc. Japan


Re: default_text_search_config

From
Tom Lane
Date:
Tatsuo Ishii <ishii@postgresql.org> writes:
>> Well, that documentation is correct as far as it goes; what it doesn't
>> say is that initdb's mapping table explicitly maps C/POSIX locales to
>> english.  It seems like a reasonable default on this side of the water,
>> but maybe I'm being too North-American-centric.

> Ok. Are you going to to add "initdb's mapping table explicitly maps
> C/POSIX locales to english" to the doc? If no, I can do that part.

Before we worry about documenting the behavior, are you happy
about it?  What could be done differently?  I'm wondering if it makes
any sense to consider the specified database encoding while making
the text-search decision ...
        regards, tom lane


Re: default_text_search_config

From
Tatsuo Ishii
Date:
> Tatsuo Ishii <ishii@postgresql.org> writes:
> >> Well, that documentation is correct as far as it goes; what it doesn't
> >> say is that initdb's mapping table explicitly maps C/POSIX locales to
> >> english.  It seems like a reasonable default on this side of the water,
> >> but maybe I'm being too North-American-centric.
> 
> > Ok. Are you going to to add "initdb's mapping table explicitly maps
> > C/POSIX locales to english" to the doc? If no, I can do that part.
> 
> Before we worry about documenting the behavior, are you happy
> about it?  What could be done differently?  I'm wondering if it makes
> any sense to consider the specified database encoding while making
> the text-search decision ...

For me the idea that a text-search configuration maps to a
locale/language seems to be totally wrong. IMO an encoding/charset
could include several languages and a text-search configuration should
be mapped to an encoding/charset, rather than a language.  Apparently
this would not happen in the near future however.

Good thing is, text-search english configuration can handle multibyte
characters. So I can live with current text-search implementation.
--
Tatsuo Ishii
SRA OSS, Inc. Japan


Re: default_text_search_config

From
ITAGAKI Takahiro
Date:
Tatsuo Ishii <ishii@postgresql.org> wrote:

> For me the idea that a text-search configuration maps to a
> locale/language seems to be totally wrong. IMO an encoding/charset
> could include several languages and a text-search configuration should
> be mapped to an encoding/charset, rather than a language.

I think mapping by encoding/charset *is* totally wrong and by locale is
reasonable. How do you treat LATIN1? It can be used in French and German,
etc. Moreover, UTF-8 can be used in almost all languages.

The tight mapping of EUC_jp <=> Japanese is a special case in the world.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center




Re: default_text_search_config

From
Tatsuo Ishii
Date:
> Tatsuo Ishii <ishii@postgresql.org> wrote:
> 
> > For me the idea that a text-search configuration maps to a
> > locale/language seems to be totally wrong. IMO an encoding/charset
> > could include several languages and a text-search configuration should
> > be mapped to an encoding/charset, rather than a language.
> 
> I think mapping by encoding/charset *is* totally wrong and by locale is
> reasonable. How do you treat LATIN1? It can be used in French and German,
> etc. Moreover, UTF-8 can be used in almost all languages.
> 
> The tight mapping of EUC_jp <=> Japanese is a special case in the world.

What? I didn't say that an encoding/charset is mapped to single
language. Actually EUC_JP includes Japanese, English(ascii), Greek,
Cyrillic and so on. So for the full text search being able to process
EUC_JP text properly, it should be able to process multiple languages
at a time.

You know that PostgreSQL allows only one locale for a PostgreSQL
cluster, and the fact that text-search being depending on locale
prevent it from processing multi language text.

The only solution I can think of today is creating new parser which
can process EUC_JP properly (I mean it can process not only Japanese
but also English) and use it on C locale/EUC_JP cluster. I would do
this for 8.4 if I have time.
--
Tatsuo Ishii
SRA OSS, Inc. Japan


Re: default_text_search_config

From
ITAGAKI Takahiro
Date:
Tatsuo Ishii <ishii@postgresql.org> wrote:

> You know that PostgreSQL allows only one locale for a PostgreSQL
> cluster, and the fact that text-search being depending on locale
> prevent it from processing multi language text.
>
> The only solution I can think of today is creating new parser which
> can process EUC_JP properly (I mean it can process not only Japanese
> but also English) and use it on C locale/EUC_JP cluster. I would do
> this for 8.4 if I have time.

The correct solution is probably we will have multiple locales in
single database cluster. We should set the locale after deciding
the encoding nowm, but I think the current implementation is wrong
because locale depends on encoding, but the opposite is not true.
(locale = 'language_country.*encoding*')

If you will go to the multiple text-search support, we'd better to
get done the locale issue first. It might affect your new parser.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center




Re: default_text_search_config

From
Tatsuo Ishii
Date:
> The correct solution is probably we will have multiple locales in
> single database cluster. We should set the locale after deciding
> the encoding nowm, but I think the current implementation is wrong
> because locale depends on encoding, but the opposite is not true.
> (locale = 'language_country.*encoding*')
> 
> If you will go to the multiple text-search support, we'd better to
> get done the locale issue first. It might affect your new parser.

I'm not sure the locale per database solution is a silver bullet.
With this, still we cannot solve the issue, for example, a LATIN1
encoded text includes several languages at a time, thus it needs
multiple locales. Or we cannot have multiple different language
columns, tables at a time because it requires multiple locales. Same
thing can be said to Unicode too. After all it seems a half baked
solution to me.
--
Tatsuo Ishii
SRA OSS, Inc. Japan


Re: default_text_search_config

From
"Pavel Stehule"
Date:
>
> I'm not sure the locale per database solution is a silver bullet.
> With this, still we cannot solve the issue, for example, a LATIN1
> encoded text includes several languages at a time, thus it needs
> multiple locales. Or we cannot have multiple different language
> columns, tables at a time because it requires multiple locales. Same
> thing can be said to Unicode too. After all it seems a half baked
> solution to me.
> --

There is only one correct solution -> support of COLLATES. With
COLLATES you can choise locale per database, per table, per column,
per db operation. This is one point where PostgreSQL is late over
others.

Pavel Stehule


Re: default_text_search_config

From
Tom Lane
Date:
Tatsuo Ishii <ishii@postgresql.org> writes:
> You know that PostgreSQL allows only one locale for a PostgreSQL
> cluster, and the fact that text-search being depending on locale
> prevent it from processing multi language text.

I think you are confusing the capabilities of tsearch with the fact
that we have to pick one default setting.  There's nothing that
stops you from using a search configuration that includes multiple
dictionaries for different languages.
        regards, tom lane


Re: default_text_search_config

From
Oleg Bartunov
Date:
On Fri, 5 Oct 2007, Tom Lane wrote:

> Tatsuo Ishii <ishii@postgresql.org> writes:
>> You know that PostgreSQL allows only one locale for a PostgreSQL
>> cluster, and the fact that text-search being depending on locale
>> prevent it from processing multi language text.
>
> I think you are confusing the capabilities of tsearch with the fact
> that we have to pick one default setting.  There's nothing that
> stops you from using a search configuration that includes multiple
> dictionaries for different languages.

exactly !

>
>             regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster
>
    Regards,        Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83