Home > mailing lists

Thread: tsearch in core patch

tsearch in core patch

From

Teodor Sigaev

Date:

21 June 2007, 14:44:43

http://www.sigaev.ru/misc/tsearch_core-0.52.gz

Plan was:

1) rename FULLTEXT to TEXT SEARCH in SQL command
done

2) rework Snowball stemmer's as Tom suggested
done

3) ALTER FULLTEXT CONFIGURATION cfgname ADD/ALTER/DROP MAPPING
done

4) remove support of default configuration per scheme. Default configuration will be only one per locale.
done

5) single encoded files. That will touch snowball, ispell, synonym, thesaurus and simple dictionaries
done

6) use encoding names instead of locale's names in configuration
Ugh. I missed that knowledge of encoding doesn't allow to determine exact
language --- how do many languages use ISO8859-1 locale?. So, it's not done. Tom
pointed that locale's name isn't portable, but there isn't a lot of names of the
same locale (ru_RU.UTF-8, ru_RU.UTF8 for example). So it's possible to use array
of locales instead of one name.

I didn't see comments about security hole pointed by Tom, so I repeat:

About security holes in PARSER/DICTIONARY. I see following ways to resolve it now:
1) Allow to superuser only to do CREATE/ALTER/DROP PARSER/DICTIONARY Disadvantage: hosting users will not be able to
changedictionaries

2) Remove CREATE/ALTER/DROP PARSER, split pg_ts_dict to pg_ts_dict_template and pg_ts_dict and accordingly change
CREATE/ALTER/DROPDICTIONARY Disadvantage: parser and dictionary's template will not dump/restore, it
shouldbe restored manually (just a INSERT into pg_ts_parser/pg_ts_dict_template)

3) Similar to previous point, but: * CREATE/ALTER/DROP PARSER - super-user only * CREATE/ALTER/DROP DICTIONARY
TEMPLATE- super-user only * CREATE/ALTER/DROP DICTIONARY - allowed to non-superuser Disadvantage: new command
CREATE/ALTER/DROPDICTIONARY TEMPLATE

Which way do we choose? or I miss some variant?

I would like to go by 3) way... Comments?

--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/

Re: tsearch in core patch

From

Hannu Krosing

Date:

21 June 2007, 16:55:11

Ühel kenal päeval, N, 2007-06-21 kell 21:44, kirjutas Teodor Sigaev:
> http://www.sigaev.ru/misc/tsearch_core-0.52.gz
> 
> Plan was:
> 
> 1) rename FULLTEXT to TEXT SEARCH in SQL command
> done
> 
> 2) rework Snowball stemmer's as Tom suggested
> done
> 
> 3) ALTER FULLTEXT CONFIGURATION cfgname ADD/ALTER/DROP MAPPING
> done

Why not rename ALTER FULLTEXT CONFIGURATION --> ALTER TEXT SEARCH
CONFIGURATION here too ?

> 4) remove support of default configuration per scheme. Default configuration
>     will be only one per locale.
> done
> 
> 5) single encoded files. That will touch snowball, ispell, synonym, thesaurus
>     and simple dictionaries
> done
> 
> 6) use encoding names instead of locale's names in configuration
> Ugh. I missed that knowledge of encoding doesn't allow to determine exact 
> language

most languages can be written using UNICODE charset and UTF-8 encoding,
so neither charset not encoding can be used to determine language.

>  --- how do many languages use ISO8859-1 locale?. 

ISO8859-1 is encoding, not locale.

> So, it's not done. Tom 
> pointed that locale's name isn't portable, but there isn't a lot of names of the 
> same locale (ru_RU.UTF-8, ru_RU.UTF8 for example). So it's possible to use array 
> of locales instead of one name.
> 
> I didn't see comments about security hole pointed by Tom, so I repeat:
> 
> About security holes in PARSER/DICTIONARY. I see following ways to resolve it now:
> 1) Allow to superuser only to do CREATE/ALTER/DROP PARSER/DICTIONARY
>     Disadvantage: hosting users will not be able to change dictionaries
> 2) Remove CREATE/ALTER/DROP PARSER, split pg_ts_dict to pg_ts_dict_template
>     and pg_ts_dict and accordingly change CREATE/ALTER/DROP DICTIONARY
>     Disadvantage: parser and dictionary's template will not dump/restore,
>                   it should be restored manually (just a INSERT into
>                   pg_ts_parser/pg_ts_dict_template)
> 3) Similar to previous point, but:
>     * CREATE/ALTER/DROP PARSER - super-user only
>     * CREATE/ALTER/DROP DICTIONARY TEMPLATE - super-user only
>     * CREATE/ALTER/DROP DICTIONARY - allowed to non-superuser
>     Disadvantage: new command CREATE/ALTER/DROP DICTIONARY TEMPLATE
> Which way do we choose? or I miss some variant?
> 
> I would like to go by 3) way... Comments?
>

Re: tsearch in core patch

From

Tom Lane

Date:

21 June 2007, 17:10:36

Hannu Krosing <hannu@skype.net> writes:
> Ühel kenal päeval, N, 2007-06-21 kell 21:44, kirjutas Teodor Sigaev:
>> 6) use encoding names instead of locale's names in configuration
>> Ugh. I missed that knowledge of encoding doesn't allow to determine exact 
>> language

> most languages can be written using UNICODE charset and UTF-8 encoding,
> so neither charset not encoding can be used to determine language.

The recommendation I was making was to use the language name, not the
encoding name, in the user-visible configuration.
        regards, tom lane

Re: tsearch in core patch

From

Teodor Sigaev

Date:

22 June 2007, 06:06:05

>> 3) ALTER FULLTEXT CONFIGURATION cfgname ADD/ALTER/DROP MAPPING
>> done
> Why not rename ALTER FULLTEXT CONFIGURATION --> ALTER TEXT SEARCH
> CONFIGURATION here too ?

It's renamed too.

> most languages can be written using UNICODE charset and UTF-8 encoding,
> so neither charset not encoding can be used to determine language.
yes


>>  --- how do many languages use ISO8859-1 locale?. > ISO8859-1 is encoding, not locale.

I meant, if we'll use encoding name (for example PG_LATIN1) we couldn't 
distinguish languages which use that encoding (for example italian and finnish 
and some more), but using locale names it's possible: it_IT.ISO8859-1, 
fi_FI.ISO8859-1

-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/

Re: tsearch in core patch

From

Teodor Sigaev

Date:

22 June 2007, 06:20:43

> The recommendation I was making was to use the language name, not the
> encoding name, in the user-visible configuration.
How does it determine language of db automatically?

-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/

Re: tsearch in core patch

From

Bruce Momjian

Date:

22 June 2007, 10:54:49

Teodor Sigaev wrote:
> > The recommendation I was making was to use the language name, not the
> > encoding name, in the user-visible configuration.

> How does it determine language of db automatically?

I don't think we are going to do language selection automatically ---
the user is going to have to set tsearch_conf_name.

--  Bruce Momjian  <bruce@momjian.us>          http://momjian.us EnterpriseDB
http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +

Re: tsearch in core patch

From

Teodor Sigaev

Date:

22 June 2007, 11:12:45

> I don't think we are going to do language selection automatically ---
> the user is going to have to set tsearch_conf_name.

Are you suggest to remove long-lived feature of tsearch? In that case we don't 
need cfglocale (or cfglanguage as Tom suggested) and cfgdefault columns in 
pg_ts_cfg at all. Just set up tsearch_conf_name.
-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/

Re: tsearch in core patch

From

Tom Lane

Date:

22 June 2007, 11:28:44

Teodor Sigaev <teodor@sigaev.ru> writes:
>> I don't think we are going to do language selection automatically ---
>> the user is going to have to set tsearch_conf_name.

> Are you suggest to remove long-lived feature of tsearch? In that case we don't 
> need cfglocale (or cfglanguage as Tom suggested) and cfgdefault columns in 
> pg_ts_cfg at all. Just set up tsearch_conf_name.

Is the point here for initdb to be able to establish a sane default
initially?  Seems to me it can guess the language from the first
component of the locale (ru_RU -> russian).
        regards, tom lane

Re: tsearch in core patch

From

Alvaro Herrera

Date:

22 June 2007, 11:28:49

Teodor Sigaev wrote:

> >> --- how do many languages use ISO8859-1 locale?. 
> > ISO8859-1 is encoding, not locale.
> 
> I meant, if we'll use encoding name (for example PG_LATIN1) we couldn't 
> distinguish languages which use that encoding (for example italian and 
> finnish and some more), but using locale names it's possible: 
> it_IT.ISO8859-1, fi_FI.ISO8859-1

I don't understand.  Why use "it_IT.ISO8859-1"?  You just need to know
the language, so "it" is enough.  The _IT part specifies that it's the
italian spoken in Italy.  This may be irrelevant in most cases, but
consider that pt_PT and pt_BR are AFAIK somewhat different languages.

I very much doubt that the different spanishes are any different in the
stemming rules, so there's no need for es_ES, es_PE, es_AR, es_CL etc;
but in the case of portuguese I'm not so sure.  Maybe there are other
examples (like chinese, but I'm not sure how useful is tsearch for
chinese).

And the .ISO8859-1 part you don't need at all if you accept that the
files are UTF8 by design, as Tom proposed.

-- 
Alvaro Herrera                          Developer, http://www.PostgreSQL.org/
"Nadie esta tan esclavizado como el que se cree libre no siendolo" (Goethe)

Re: tsearch in core patch

From

Tom Lane

Date:

22 June 2007, 11:34:45

Alvaro Herrera <alvherre@commandprompt.com> writes:
> I very much doubt that the different spanishes are any different in the
> stemming rules, so there's no need for es_ES, es_PE, es_AR, es_CL etc;
> but in the case of portuguese I'm not so sure.  Maybe there are other
> examples (like chinese, but I'm not sure how useful is tsearch for
> chinese).

> And the .ISO8859-1 part you don't need at all if you accept that the
> files are UTF8 by design, as Tom proposed.

Also, the problem we're dealing with here is mainly lack of
standardization of the encoding part of locale names.  AFAIK, just about
everybody agrees on "es_ES", "ru_RU", etc; it's the part that comes
after that (if any) that is not too consistent across platforms.
So I see no problem in distinguishing between pt_PT and pt_BR if it
turns out we have to.  The trick is to not look at any more of the
locale name than that; and if we standardize on "stopword files are
UTF8" then I don't think we need to.
        regards, tom lane

Re: tsearch in core patch

From

Bruce Momjian

Date:

22 June 2007, 11:46:56

Tom Lane wrote:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
> > I very much doubt that the different spanishes are any different in the
> > stemming rules, so there's no need for es_ES, es_PE, es_AR, es_CL etc;
> > but in the case of portuguese I'm not so sure.  Maybe there are other
> > examples (like chinese, but I'm not sure how useful is tsearch for
> > chinese).
> 
> > And the .ISO8859-1 part you don't need at all if you accept that the
> > files are UTF8 by design, as Tom proposed.
> 
> Also, the problem we're dealing with here is mainly lack of
> standardization of the encoding part of locale names.  AFAIK, just about
> everybody agrees on "es_ES", "ru_RU", etc; it's the part that comes
> after that (if any) that is not too consistent across platforms.
> So I see no problem in distinguishing between pt_PT and pt_BR if it
> turns out we have to.  The trick is to not look at any more of the
> locale name than that; and if we standardize on "stopword files are
> UTF8" then I don't think we need to.

OK, and the open question is when do we do this default setting.  If we
do it in initdb then we can isolate all the detection there.

--  Bruce Momjian  <bruce@momjian.us>          http://momjian.us EnterpriseDB
http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +

Re: tsearch in core patch

From

Oleg Bartunov

Date:

22 June 2007, 12:03:14

On Fri, 22 Jun 2007, Bruce Momjian wrote:

> Tom Lane wrote:
>> Alvaro Herrera <alvherre@commandprompt.com> writes:
>>> I very much doubt that the different spanishes are any different in the
>>> stemming rules, so there's no need for es_ES, es_PE, es_AR, es_CL etc;
>>> but in the case of portuguese I'm not so sure.  Maybe there are other
>>> examples (like chinese, but I'm not sure how useful is tsearch for
>>> chinese).
>>
>>> And the .ISO8859-1 part you don't need at all if you accept that the
>>> files are UTF8 by design, as Tom proposed.
>>
>> Also, the problem we're dealing with here is mainly lack of
>> standardization of the encoding part of locale names.  AFAIK, just about
>> everybody agrees on "es_ES", "ru_RU", etc; it's the part that comes
>> after that (if any) that is not too consistent across platforms.
>> So I see no problem in distinguishing between pt_PT and pt_BR if it
>> turns out we have to.  The trick is to not look at any more of the
>> locale name than that; and if we standardize on "stopword files are
>> UTF8" then I don't think we need to.
>
> OK, and the open question is when do we do this default setting.  If we
> do it in initdb then we can isolate all the detection there.

We can do that at initdb time, but we still have to decide how to map
human-readable language name and lang part of locale name. Are we going
to hardcode it ?

It's not friendly for hosting solution, when people often have no access
to the postgresql.conf, so they need to remember setting tsearch_conf_name.
It could be solved using 'alter user ... set tsearch_conf_name' command though.

    Regards,        Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: tsearch in core patch

From

Magnus Hagander

Date:

22 June 2007, 12:03:27

Tom Lane wrote:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
>> I very much doubt that the different spanishes are any different in the
>> stemming rules, so there's no need for es_ES, es_PE, es_AR, es_CL etc;
>> but in the case of portuguese I'm not so sure.  Maybe there are other
>> examples (like chinese, but I'm not sure how useful is tsearch for
>> chinese).
> 
>> And the .ISO8859-1 part you don't need at all if you accept that the
>> files are UTF8 by design, as Tom proposed.
> 
> Also, the problem we're dealing with here is mainly lack of
> standardization of the encoding part of locale names.  AFAIK, just about
> everybody agrees on "es_ES", "ru_RU", etc; it's the part that comes
> after that (if any) that is not too consistent across platforms.

That may have been true until we started supporting Windows...
Swedish_Sweden.1252 is what I get on my machine, for example. Principle
is the same, but values certainly aren't.

//Magnus

Re: tsearch in core patch

From

Alvaro Herrera

Date:

22 June 2007, 12:25:02

Magnus Hagander wrote:
> Tom Lane wrote:
> > Alvaro Herrera <alvherre@commandprompt.com> writes:
> >> I very much doubt that the different spanishes are any different in the
> >> stemming rules, so there's no need for es_ES, es_PE, es_AR, es_CL etc;
> >> but in the case of portuguese I'm not so sure.  Maybe there are other
> >> examples (like chinese, but I'm not sure how useful is tsearch for
> >> chinese).
> > 
> >> And the .ISO8859-1 part you don't need at all if you accept that the
> >> files are UTF8 by design, as Tom proposed.
> > 
> > Also, the problem we're dealing with here is mainly lack of
> > standardization of the encoding part of locale names.  AFAIK, just about
> > everybody agrees on "es_ES", "ru_RU", etc; it's the part that comes
> > after that (if any) that is not too consistent across platforms.
> 
> That may have been true until we started supporting Windows...
> Swedish_Sweden.1252 is what I get on my machine, for example. Principle
> is the same, but values certainly aren't.

Well, at least the name is not itself translated, so a mapping table is
not right out of the question.  If they had put a name like
"Español_Chile" instead of "Spanish_Chile" we would be in serious
trouble.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: tsearch in core patch

From

Michael Glaesemann

Date:

22 June 2007, 12:34:19

On Jun 22, 2007, at 9:28 , Tom Lane wrote:

> Is the point here for initdb to be able to establish a sane default
> initially?  Seems to me it can guess the language from the first
> component of the locale (ru_RU -> russian).

How would this work for initdb with locale C?

Michael Glaesemann
grzm seespotcode net

Re: tsearch in core patch

From

teodor@sigaev.ru

Date:

22 June 2007, 12:38:32

>> That may have been true until we started supporting Windows...
>> Swedish_Sweden.1252 is what I get on my machine, for example. Principle
>> is the same, but values certainly aren't.
>
> Well, at least the name is not itself translated, so a mapping table is
> not right out of the question.  If they had put a name like
> "Español_Chile" instead of "Spanish_Chile" we would be in serious
> trouble.
I don't think so, in oppsite case you can't type or show it to change
locale :).

So, final propose:
rename cfglocale to cfglanguages and store in it array of laguage names
which is produced from first part of locale names:
russian   '{ru_RU, Russian_Russia}'
spanish   '{es_ES, es_CL, Spanish_Spain, Spanish_Chile}'

Comments?

Is there some obstacles to  use GIN indexes in pg_catalog?

Re: tsearch in core patch

From

Bruce Momjian

Date:

22 June 2007, 12:46:48

Michael Glaesemann wrote:
> 
> On Jun 22, 2007, at 9:28 , Tom Lane wrote:
> 
> > Is the point here for initdb to be able to establish a sane default
> > initially?  Seems to me it can guess the language from the first
> > component of the locale (ru_RU -> russian).
> 
> How would this work for initdb with locale C?

Yea, that's a problem.  I am thinking we should just avoid the entire
issue and require it to be set by the user, and throw an error if the
configuration is not set.

--  Bruce Momjian  <bruce@momjian.us>          http://momjian.us EnterpriseDB
http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +

Re: tsearch in core patch

From

Alvaro Herrera

Date:

22 June 2007, 12:53:14

teodor@sigaev.ru wrote:

> So, final propose:
> rename cfglocale to cfglanguages and store in it array of laguage names
> which is produced from first part of locale names:
> russian   '{ru_RU, Russian_Russia}'
> spanish   '{es_ES, es_CL, Spanish_Spain, Spanish_Chile}'
> 
> Comments?

Why not do it the other way around?
es_ES        spanish
Spanish_Spain    spanish
ru_RU        russian
pt_BR        portuguese_brazil

That way you don't need any funny index.  Or do you need the list of
locales for each language? (but even if you do, you can easily obtain it
by indexing both columns separately using btrees anyway)

-- 
Alvaro Herrera                               http://www.PlanetPostgreSQL.org/
"I can see support will not be a problem.  10 out of 10."    (Simon Wittber)
(http://archives.postgresql.org/pgsql-general/2004-12/msg00159.php)

Re: tsearch in core patch

From

Tatsuo Ishii

Date:

22 June 2007, 12:53:15

> On Jun 22, 2007, at 9:28 , Tom Lane wrote:
> 
> > Is the point here for initdb to be able to establish a sane default
> > initially?  Seems to me it can guess the language from the first
> > component of the locale (ru_RU -> russian).
> 
> How would this work for initdb with locale C?

I'm worrying about that too.
--
Tatsuo Ishii
SRA OSS, Inc. Japan

Re: tsearch in core patch

From

teodor@sigaev.ru

Date:

22 June 2007, 13:12:31

> Why not do it the other way around?
> es_ES        spanish
> Spanish_Spain    spanish
> ru_RU        russian
> pt_BR        portuguese_brazil
>
> That way you don't need any funny index.  Or do you need the list of
> locales for each language? (but even if you do, you can easily obtain it
> by indexing both columns separately using btrees anyway)

Yes, that's possible but that icreases number of identical configuration:
russian_win     Russian_Russia
russian_unix    ru_RU

They doesn't differ except locale name.

Re: tsearch in core patch

From

Alvaro Herrera

Date:

22 June 2007, 13:18:21

teodor@sigaev.ru wrote:
> > Why not do it the other way around?
> > es_ES        spanish
> > Spanish_Spain    spanish
> > ru_RU        russian
> > pt_BR        portuguese_brazil
> >
> > That way you don't need any funny index.  Or do you need the list of
> > locales for each language? (but even if you do, you can easily obtain it
> > by indexing both columns separately using btrees anyway)
> 
> Yes, that's possible but that icreases number of identical configuration:
> russian_win     Russian_Russia
> russian_unix    ru_RU
> 
> They doesn't differ except locale name.

But why do you need them to be different at all?  Just make it

russian     Russian_Russia
russian     ru_RU

Does that not work for some reason?

What I was really suggesting was having a table mapping locale names
into "tsearch languages".  Then the configuration could be made based on
the language, not on the locale name.  So the stopword list is for
"russian", regardless of whether the locale is Russian_Russia or ru_RU.

Is this only for the stopword list, or does it also affect selecting a
stemmer?

Note: it's possible that the stopword list is different for brazilian
portuguese than portuguese portuguese, which is why I was suggesting
using a language "portuguese_brazil" and not just "postuguese".  Whereas
you need a single stopword list for all the countries speaking spanish,
which is why you need only one language called spanish.

-- 
Alvaro Herrera                        http://www.advogato.org/person/alvherre
"Llegará una época en la que una investigación diligente y prolongada sacará
a la luz cosas que hoy están ocultas" (Séneca, siglo I)

Re: tsearch in core patch

From

Tom Lane

Date:

22 June 2007, 13:38:19

Tatsuo Ishii <ishii@sraoss.co.jp> writes:
>> On Jun 22, 2007, at 9:28 , Tom Lane wrote:
>>> Is the point here for initdb to be able to establish a sane default
>>> initially?  Seems to me it can guess the language from the first
>>> component of the locale (ru_RU -> russian).
>> 
>> How would this work for initdb with locale C?

> I'm worrying about that too.

I would be surprised if C locale defaulted to anything except English.
I suppose it would be sensible to add a switch to allow people to select
a different language.  In any case, the only thing initdb would be doing
would be setting up an initial value of a table entry or GUC variable,
so you could always change it yourself later; it may not be worth
sweating too much about this.
        regards, tom lane

Re: tsearch in core patch

From

Euler Taveira de Oliveira

Date:

23 June 2007, 15:01:54

Alvaro Herrera wrote:

> What I was really suggesting was having a table mapping locale names
> into "tsearch languages".  Then the configuration could be made based on
> the language, not on the locale name.  So the stopword list is for
> "russian", regardless of whether the locale is Russian_Russia or ru_RU.
> 
Agreed. But I'm afraid we couldn't map all of the locale names in a
right way. Man, it's a large list. ;)

> Is this only for the stopword list, or does it also affect selecting a
> stemmer?
> 
Both.

> Note: it's possible that the stopword list is different for brazilian
> portuguese than portuguese portuguese, which is why I was suggesting
> using a language "portuguese_brazil" and not just "postuguese".  Whereas
> you need a single stopword list for all the countries speaking spanish,
> which is why you need only one language called spanish.
> 
Indeed it's possible for portuguese, because we have some words that are
written in different ways, e.g.,
pt_BR     pt_PT     english
Mônica    Mónica    Monica
ação      acção     action
Irã       Irão      Iran
.
.
.

Will it be possible to disable stemming or stopwords removal? I'm asking
this 'cause sometimes stemming doesn't lead to good results and/or
stopwords are relevant. Maybe it could be an GUC variables
('enable_stemming' and 'enable_stopwords').


--  Euler Taveira de Oliveira http://www.timbira.com/

Re: tsearch in core patch

From

Oleg Bartunov

Date:

23 June 2007, 16:12:23

On Sat, 23 Jun 2007, Euler Taveira de Oliveira wrote:

> Will it be possible to disable stemming or stopwords removal? I'm asking
> this 'cause sometimes stemming doesn't lead to good results and/or
> stopwords are relevant. Maybe it could be an GUC variables
> ('enable_stemming' and 'enable_stopwords').

Just use another configuration.
    Regards,        Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: tsearch in core patch

From

Tatsuo Ishii

Date:

25 June 2007, 00:19:58

> I would be surprised if C locale defaulted to anything except English.

Don't be surprised. The mechanism of collation is too simple for
Japanse Kanji, and locale is not usefull for Japanse anyway. That's
why Japanese installations of PostgreSQL tend to use C locale.
--
Tatsuo Ishii
SRA OSS, Inc. Japan

> I suppose it would be sensible to add a switch to allow people to select
> a different language.  In any case, the only thing initdb would be doing
> would be setting up an initial value of a table entry or GUC variable,
> so you could always change it yourself later; it may not be worth
> sweating too much about this.
> 
>             regards, tom lane

Re: tsearch in core patch

From

Teodor Sigaev

Date:

27 June 2007, 05:40:10

> But why do you need them to be different at all?  Just make it
> russian     Russian_Russia
> russian     ru_RU
> 
> Does that not work for some reason?
I'd like to have unique names of configuration. So, if user sets GUC variable or 
call function with configuration's name then postgres should not have a choice 
--- it should use pointed configuration exactly.

-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/

Re: tsearch in core patch

From

Tom Lane

Date:

27 June 2007, 11:14:50

Teodor Sigaev <teodor@sigaev.ru> writes:
>> But why do you need them to be different at all?  Just make it
>> russian     Russian_Russia
>> russian     ru_RU
>> 
>> Does that not work for some reason?

> I'd like to have unique names of configuration. So, if user sets GUC variable or 
> call function with configuration's name then postgres should not have a choice 
> --- it should use pointed configuration exactly.

Sure, but the configuration name in this example is "russian", and it's
unique, no?
        regards, tom lane