Thread: default_text_search_config
When I run initdb -E EUC_JP --no-locale, I found following in my postgresql.conf: default_text_search_config = 'pg_catalog.english' The manual says: default_text_search_config (string) Selects the text search configuration that is used by those variants of the text search functions that do not have anexplicit argument specifying the configuration. See Chapter 12 for further information. The built-in default is pg_catalog.simple,but initdb will initialize the configuration file with a setting that corresponds to the chosen lc_ctypelocale, if a configuration matching that locale can be identified. So I thought the initial value for it should be pg_catalog.simple, rather than pg_catalog.english. If this is not a bug, what is the idea behind lc_ctype = C corresponds to 'pg_catalog.english'? When is pg_catalog.simple supposed to be used? -- Tatsuo Ishii SRA OSS, Inc. Japan
Tatsuo Ishii <ishii@postgresql.org> writes: > When I run initdb -E EUC_JP --no-locale, I found following in my > postgresql.conf: > default_text_search_config = 'pg_catalog.english' > The manual says: > default_text_search_config (string) > Selects the text search configuration that is used by those > variants of the text search functions that do not have an explicit > argument specifying the configuration. See Chapter 12 for further > information. The built-in default is pg_catalog.simple, but initdb > will initialize the configuration file with a setting that > corresponds to the chosen lc_ctype locale, if a configuration > matching that locale can be identified. > So I thought the initial value for it should be pg_catalog.simple, > rather than pg_catalog.english. If this is not a bug, what is the > idea behind lc_ctype = C corresponds to 'pg_catalog.english'? > When is pg_catalog.simple supposed to be used? Well, that documentation is correct as far as it goes; what it doesn't say is that initdb's mapping table explicitly maps C/POSIX locales to english. It seems like a reasonable default on this side of the water, but maybe I'm being too North-American-centric. regards, tom lane
> Tatsuo Ishii <ishii@postgresql.org> writes: > > When I run initdb -E EUC_JP --no-locale, I found following in my > > postgresql.conf: > > > default_text_search_config = 'pg_catalog.english' > > > The manual says: > > > default_text_search_config (string) > > > Selects the text search configuration that is used by those > > variants of the text search functions that do not have an explicit > > argument specifying the configuration. See Chapter 12 for further > > information. The built-in default is pg_catalog.simple, but initdb > > will initialize the configuration file with a setting that > > corresponds to the chosen lc_ctype locale, if a configuration > > matching that locale can be identified. > > > So I thought the initial value for it should be pg_catalog.simple, > > rather than pg_catalog.english. If this is not a bug, what is the > > idea behind lc_ctype = C corresponds to 'pg_catalog.english'? > > When is pg_catalog.simple supposed to be used? > > Well, that documentation is correct as far as it goes; what it doesn't > say is that initdb's mapping table explicitly maps C/POSIX locales to > english. It seems like a reasonable default on this side of the water, > but maybe I'm being too North-American-centric. Ok. Are you going to to add "initdb's mapping table explicitly maps C/POSIX locales to english" to the doc? If no, I can do that part. -- Tatsuo Ishii SRA OSS, Inc. Japan
Tatsuo Ishii <ishii@postgresql.org> writes: >> Well, that documentation is correct as far as it goes; what it doesn't >> say is that initdb's mapping table explicitly maps C/POSIX locales to >> english. It seems like a reasonable default on this side of the water, >> but maybe I'm being too North-American-centric. > Ok. Are you going to to add "initdb's mapping table explicitly maps > C/POSIX locales to english" to the doc? If no, I can do that part. Before we worry about documenting the behavior, are you happy about it? What could be done differently? I'm wondering if it makes any sense to consider the specified database encoding while making the text-search decision ... regards, tom lane
> Tatsuo Ishii <ishii@postgresql.org> writes: > >> Well, that documentation is correct as far as it goes; what it doesn't > >> say is that initdb's mapping table explicitly maps C/POSIX locales to > >> english. It seems like a reasonable default on this side of the water, > >> but maybe I'm being too North-American-centric. > > > Ok. Are you going to to add "initdb's mapping table explicitly maps > > C/POSIX locales to english" to the doc? If no, I can do that part. > > Before we worry about documenting the behavior, are you happy > about it? What could be done differently? I'm wondering if it makes > any sense to consider the specified database encoding while making > the text-search decision ... For me the idea that a text-search configuration maps to a locale/language seems to be totally wrong. IMO an encoding/charset could include several languages and a text-search configuration should be mapped to an encoding/charset, rather than a language. Apparently this would not happen in the near future however. Good thing is, text-search english configuration can handle multibyte characters. So I can live with current text-search implementation. -- Tatsuo Ishii SRA OSS, Inc. Japan
Tatsuo Ishii <ishii@postgresql.org> wrote: > For me the idea that a text-search configuration maps to a > locale/language seems to be totally wrong. IMO an encoding/charset > could include several languages and a text-search configuration should > be mapped to an encoding/charset, rather than a language. I think mapping by encoding/charset *is* totally wrong and by locale is reasonable. How do you treat LATIN1? It can be used in French and German, etc. Moreover, UTF-8 can be used in almost all languages. The tight mapping of EUC_jp <=> Japanese is a special case in the world. Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
> Tatsuo Ishii <ishii@postgresql.org> wrote: > > > For me the idea that a text-search configuration maps to a > > locale/language seems to be totally wrong. IMO an encoding/charset > > could include several languages and a text-search configuration should > > be mapped to an encoding/charset, rather than a language. > > I think mapping by encoding/charset *is* totally wrong and by locale is > reasonable. How do you treat LATIN1? It can be used in French and German, > etc. Moreover, UTF-8 can be used in almost all languages. > > The tight mapping of EUC_jp <=> Japanese is a special case in the world. What? I didn't say that an encoding/charset is mapped to single language. Actually EUC_JP includes Japanese, English(ascii), Greek, Cyrillic and so on. So for the full text search being able to process EUC_JP text properly, it should be able to process multiple languages at a time. You know that PostgreSQL allows only one locale for a PostgreSQL cluster, and the fact that text-search being depending on locale prevent it from processing multi language text. The only solution I can think of today is creating new parser which can process EUC_JP properly (I mean it can process not only Japanese but also English) and use it on C locale/EUC_JP cluster. I would do this for 8.4 if I have time. -- Tatsuo Ishii SRA OSS, Inc. Japan
Tatsuo Ishii <ishii@postgresql.org> wrote: > You know that PostgreSQL allows only one locale for a PostgreSQL > cluster, and the fact that text-search being depending on locale > prevent it from processing multi language text. > > The only solution I can think of today is creating new parser which > can process EUC_JP properly (I mean it can process not only Japanese > but also English) and use it on C locale/EUC_JP cluster. I would do > this for 8.4 if I have time. The correct solution is probably we will have multiple locales in single database cluster. We should set the locale after deciding the encoding nowm, but I think the current implementation is wrong because locale depends on encoding, but the opposite is not true. (locale = 'language_country.*encoding*') If you will go to the multiple text-search support, we'd better to get done the locale issue first. It might affect your new parser. Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
> The correct solution is probably we will have multiple locales in > single database cluster. We should set the locale after deciding > the encoding nowm, but I think the current implementation is wrong > because locale depends on encoding, but the opposite is not true. > (locale = 'language_country.*encoding*') > > If you will go to the multiple text-search support, we'd better to > get done the locale issue first. It might affect your new parser. I'm not sure the locale per database solution is a silver bullet. With this, still we cannot solve the issue, for example, a LATIN1 encoded text includes several languages at a time, thus it needs multiple locales. Or we cannot have multiple different language columns, tables at a time because it requires multiple locales. Same thing can be said to Unicode too. After all it seems a half baked solution to me. -- Tatsuo Ishii SRA OSS, Inc. Japan
> > I'm not sure the locale per database solution is a silver bullet. > With this, still we cannot solve the issue, for example, a LATIN1 > encoded text includes several languages at a time, thus it needs > multiple locales. Or we cannot have multiple different language > columns, tables at a time because it requires multiple locales. Same > thing can be said to Unicode too. After all it seems a half baked > solution to me. > -- There is only one correct solution -> support of COLLATES. With COLLATES you can choise locale per database, per table, per column, per db operation. This is one point where PostgreSQL is late over others. Pavel Stehule
Tatsuo Ishii <ishii@postgresql.org> writes: > You know that PostgreSQL allows only one locale for a PostgreSQL > cluster, and the fact that text-search being depending on locale > prevent it from processing multi language text. I think you are confusing the capabilities of tsearch with the fact that we have to pick one default setting. There's nothing that stops you from using a search configuration that includes multiple dictionaries for different languages. regards, tom lane
On Fri, 5 Oct 2007, Tom Lane wrote: > Tatsuo Ishii <ishii@postgresql.org> writes: >> You know that PostgreSQL allows only one locale for a PostgreSQL >> cluster, and the fact that text-search being depending on locale >> prevent it from processing multi language text. > > I think you are confusing the capabilities of tsearch with the fact > that we have to pick one default setting. There's nothing that > stops you from using a search configuration that includes multiple > dictionaries for different languages. exactly ! > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 2: Don't 'kill -9' the postmaster > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83