Thread: tsearch in core patch
http://www.sigaev.ru/misc/tsearch_core-0.52.gz Plan was: 1) rename FULLTEXT to TEXT SEARCH in SQL command done 2) rework Snowball stemmer's as Tom suggested done 3) ALTER FULLTEXT CONFIGURATION cfgname ADD/ALTER/DROP MAPPING done 4) remove support of default configuration per scheme. Default configuration will be only one per locale. done 5) single encoded files. That will touch snowball, ispell, synonym, thesaurus and simple dictionaries done 6) use encoding names instead of locale's names in configuration Ugh. I missed that knowledge of encoding doesn't allow to determine exact language --- how do many languages use ISO8859-1 locale?. So, it's not done. Tom pointed that locale's name isn't portable, but there isn't a lot of names of the same locale (ru_RU.UTF-8, ru_RU.UTF8 for example). So it's possible to use array of locales instead of one name. I didn't see comments about security hole pointed by Tom, so I repeat: About security holes in PARSER/DICTIONARY. I see following ways to resolve it now: 1) Allow to superuser only to do CREATE/ALTER/DROP PARSER/DICTIONARY Disadvantage: hosting users will not be able to changedictionaries 2) Remove CREATE/ALTER/DROP PARSER, split pg_ts_dict to pg_ts_dict_template and pg_ts_dict and accordingly change CREATE/ALTER/DROPDICTIONARY Disadvantage: parser and dictionary's template will not dump/restore, it shouldbe restored manually (just a INSERT into pg_ts_parser/pg_ts_dict_template) 3) Similar to previous point, but: * CREATE/ALTER/DROP PARSER - super-user only * CREATE/ALTER/DROP DICTIONARY TEMPLATE- super-user only * CREATE/ALTER/DROP DICTIONARY - allowed to non-superuser Disadvantage: new command CREATE/ALTER/DROPDICTIONARY TEMPLATE Which way do we choose? or I miss some variant? I would like to go by 3) way... Comments? -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
Ühel kenal päeval, N, 2007-06-21 kell 21:44, kirjutas Teodor Sigaev: > http://www.sigaev.ru/misc/tsearch_core-0.52.gz > > Plan was: > > 1) rename FULLTEXT to TEXT SEARCH in SQL command > done > > 2) rework Snowball stemmer's as Tom suggested > done > > 3) ALTER FULLTEXT CONFIGURATION cfgname ADD/ALTER/DROP MAPPING > done Why not rename ALTER FULLTEXT CONFIGURATION --> ALTER TEXT SEARCH CONFIGURATION here too ? > 4) remove support of default configuration per scheme. Default configuration > will be only one per locale. > done > > 5) single encoded files. That will touch snowball, ispell, synonym, thesaurus > and simple dictionaries > done > > 6) use encoding names instead of locale's names in configuration > Ugh. I missed that knowledge of encoding doesn't allow to determine exact > language most languages can be written using UNICODE charset and UTF-8 encoding, so neither charset not encoding can be used to determine language. > --- how do many languages use ISO8859-1 locale?. ISO8859-1 is encoding, not locale. > So, it's not done. Tom > pointed that locale's name isn't portable, but there isn't a lot of names of the > same locale (ru_RU.UTF-8, ru_RU.UTF8 for example). So it's possible to use array > of locales instead of one name. > > I didn't see comments about security hole pointed by Tom, so I repeat: > > About security holes in PARSER/DICTIONARY. I see following ways to resolve it now: > 1) Allow to superuser only to do CREATE/ALTER/DROP PARSER/DICTIONARY > Disadvantage: hosting users will not be able to change dictionaries > 2) Remove CREATE/ALTER/DROP PARSER, split pg_ts_dict to pg_ts_dict_template > and pg_ts_dict and accordingly change CREATE/ALTER/DROP DICTIONARY > Disadvantage: parser and dictionary's template will not dump/restore, > it should be restored manually (just a INSERT into > pg_ts_parser/pg_ts_dict_template) > 3) Similar to previous point, but: > * CREATE/ALTER/DROP PARSER - super-user only > * CREATE/ALTER/DROP DICTIONARY TEMPLATE - super-user only > * CREATE/ALTER/DROP DICTIONARY - allowed to non-superuser > Disadvantage: new command CREATE/ALTER/DROP DICTIONARY TEMPLATE > Which way do we choose? or I miss some variant? > > I would like to go by 3) way... Comments? >
Hannu Krosing <hannu@skype.net> writes: > Ühel kenal päeval, N, 2007-06-21 kell 21:44, kirjutas Teodor Sigaev: >> 6) use encoding names instead of locale's names in configuration >> Ugh. I missed that knowledge of encoding doesn't allow to determine exact >> language > most languages can be written using UNICODE charset and UTF-8 encoding, > so neither charset not encoding can be used to determine language. The recommendation I was making was to use the language name, not the encoding name, in the user-visible configuration. regards, tom lane
>> 3) ALTER FULLTEXT CONFIGURATION cfgname ADD/ALTER/DROP MAPPING >> done > Why not rename ALTER FULLTEXT CONFIGURATION --> ALTER TEXT SEARCH > CONFIGURATION here too ? It's renamed too. > most languages can be written using UNICODE charset and UTF-8 encoding, > so neither charset not encoding can be used to determine language. yes >> --- how do many languages use ISO8859-1 locale?. > ISO8859-1 is encoding, not locale. I meant, if we'll use encoding name (for example PG_LATIN1) we couldn't distinguish languages which use that encoding (for example italian and finnish and some more), but using locale names it's possible: it_IT.ISO8859-1, fi_FI.ISO8859-1 -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
> The recommendation I was making was to use the language name, not the > encoding name, in the user-visible configuration. How does it determine language of db automatically? -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
Teodor Sigaev wrote: > > The recommendation I was making was to use the language name, not the > > encoding name, in the user-visible configuration. > How does it determine language of db automatically? I don't think we are going to do language selection automatically --- the user is going to have to set tsearch_conf_name. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
> I don't think we are going to do language selection automatically --- > the user is going to have to set tsearch_conf_name. Are you suggest to remove long-lived feature of tsearch? In that case we don't need cfglocale (or cfglanguage as Tom suggested) and cfgdefault columns in pg_ts_cfg at all. Just set up tsearch_conf_name. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
Teodor Sigaev <teodor@sigaev.ru> writes: >> I don't think we are going to do language selection automatically --- >> the user is going to have to set tsearch_conf_name. > Are you suggest to remove long-lived feature of tsearch? In that case we don't > need cfglocale (or cfglanguage as Tom suggested) and cfgdefault columns in > pg_ts_cfg at all. Just set up tsearch_conf_name. Is the point here for initdb to be able to establish a sane default initially? Seems to me it can guess the language from the first component of the locale (ru_RU -> russian). regards, tom lane
Teodor Sigaev wrote: > >> --- how do many languages use ISO8859-1 locale?. > > ISO8859-1 is encoding, not locale. > > I meant, if we'll use encoding name (for example PG_LATIN1) we couldn't > distinguish languages which use that encoding (for example italian and > finnish and some more), but using locale names it's possible: > it_IT.ISO8859-1, fi_FI.ISO8859-1 I don't understand. Why use "it_IT.ISO8859-1"? You just need to know the language, so "it" is enough. The _IT part specifies that it's the italian spoken in Italy. This may be irrelevant in most cases, but consider that pt_PT and pt_BR are AFAIK somewhat different languages. I very much doubt that the different spanishes are any different in the stemming rules, so there's no need for es_ES, es_PE, es_AR, es_CL etc; but in the case of portuguese I'm not so sure. Maybe there are other examples (like chinese, but I'm not sure how useful is tsearch for chinese). And the .ISO8859-1 part you don't need at all if you accept that the files are UTF8 by design, as Tom proposed. -- Alvaro Herrera Developer, http://www.PostgreSQL.org/ "Nadie esta tan esclavizado como el que se cree libre no siendolo" (Goethe)
Alvaro Herrera <alvherre@commandprompt.com> writes: > I very much doubt that the different spanishes are any different in the > stemming rules, so there's no need for es_ES, es_PE, es_AR, es_CL etc; > but in the case of portuguese I'm not so sure. Maybe there are other > examples (like chinese, but I'm not sure how useful is tsearch for > chinese). > And the .ISO8859-1 part you don't need at all if you accept that the > files are UTF8 by design, as Tom proposed. Also, the problem we're dealing with here is mainly lack of standardization of the encoding part of locale names. AFAIK, just about everybody agrees on "es_ES", "ru_RU", etc; it's the part that comes after that (if any) that is not too consistent across platforms. So I see no problem in distinguishing between pt_PT and pt_BR if it turns out we have to. The trick is to not look at any more of the locale name than that; and if we standardize on "stopword files are UTF8" then I don't think we need to. regards, tom lane
Tom Lane wrote: > Alvaro Herrera <alvherre@commandprompt.com> writes: > > I very much doubt that the different spanishes are any different in the > > stemming rules, so there's no need for es_ES, es_PE, es_AR, es_CL etc; > > but in the case of portuguese I'm not so sure. Maybe there are other > > examples (like chinese, but I'm not sure how useful is tsearch for > > chinese). > > > And the .ISO8859-1 part you don't need at all if you accept that the > > files are UTF8 by design, as Tom proposed. > > Also, the problem we're dealing with here is mainly lack of > standardization of the encoding part of locale names. AFAIK, just about > everybody agrees on "es_ES", "ru_RU", etc; it's the part that comes > after that (if any) that is not too consistent across platforms. > So I see no problem in distinguishing between pt_PT and pt_BR if it > turns out we have to. The trick is to not look at any more of the > locale name than that; and if we standardize on "stopword files are > UTF8" then I don't think we need to. OK, and the open question is when do we do this default setting. If we do it in initdb then we can isolate all the detection there. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Fri, 22 Jun 2007, Bruce Momjian wrote: > Tom Lane wrote: >> Alvaro Herrera <alvherre@commandprompt.com> writes: >>> I very much doubt that the different spanishes are any different in the >>> stemming rules, so there's no need for es_ES, es_PE, es_AR, es_CL etc; >>> but in the case of portuguese I'm not so sure. Maybe there are other >>> examples (like chinese, but I'm not sure how useful is tsearch for >>> chinese). >> >>> And the .ISO8859-1 part you don't need at all if you accept that the >>> files are UTF8 by design, as Tom proposed. >> >> Also, the problem we're dealing with here is mainly lack of >> standardization of the encoding part of locale names. AFAIK, just about >> everybody agrees on "es_ES", "ru_RU", etc; it's the part that comes >> after that (if any) that is not too consistent across platforms. >> So I see no problem in distinguishing between pt_PT and pt_BR if it >> turns out we have to. The trick is to not look at any more of the >> locale name than that; and if we standardize on "stopword files are >> UTF8" then I don't think we need to. > > OK, and the open question is when do we do this default setting. If we > do it in initdb then we can isolate all the detection there. We can do that at initdb time, but we still have to decide how to map human-readable language name and lang part of locale name. Are we going to hardcode it ? It's not friendly for hosting solution, when people often have no access to the postgresql.conf, so they need to remember setting tsearch_conf_name. It could be solved using 'alter user ... set tsearch_conf_name' command though. Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
Tom Lane wrote: > Alvaro Herrera <alvherre@commandprompt.com> writes: >> I very much doubt that the different spanishes are any different in the >> stemming rules, so there's no need for es_ES, es_PE, es_AR, es_CL etc; >> but in the case of portuguese I'm not so sure. Maybe there are other >> examples (like chinese, but I'm not sure how useful is tsearch for >> chinese). > >> And the .ISO8859-1 part you don't need at all if you accept that the >> files are UTF8 by design, as Tom proposed. > > Also, the problem we're dealing with here is mainly lack of > standardization of the encoding part of locale names. AFAIK, just about > everybody agrees on "es_ES", "ru_RU", etc; it's the part that comes > after that (if any) that is not too consistent across platforms. That may have been true until we started supporting Windows... Swedish_Sweden.1252 is what I get on my machine, for example. Principle is the same, but values certainly aren't. //Magnus
Magnus Hagander wrote: > Tom Lane wrote: > > Alvaro Herrera <alvherre@commandprompt.com> writes: > >> I very much doubt that the different spanishes are any different in the > >> stemming rules, so there's no need for es_ES, es_PE, es_AR, es_CL etc; > >> but in the case of portuguese I'm not so sure. Maybe there are other > >> examples (like chinese, but I'm not sure how useful is tsearch for > >> chinese). > > > >> And the .ISO8859-1 part you don't need at all if you accept that the > >> files are UTF8 by design, as Tom proposed. > > > > Also, the problem we're dealing with here is mainly lack of > > standardization of the encoding part of locale names. AFAIK, just about > > everybody agrees on "es_ES", "ru_RU", etc; it's the part that comes > > after that (if any) that is not too consistent across platforms. > > That may have been true until we started supporting Windows... > Swedish_Sweden.1252 is what I get on my machine, for example. Principle > is the same, but values certainly aren't. Well, at least the name is not itself translated, so a mapping table is not right out of the question. If they had put a name like "Español_Chile" instead of "Spanish_Chile" we would be in serious trouble. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Jun 22, 2007, at 9:28 , Tom Lane wrote: > Is the point here for initdb to be able to establish a sane default > initially? Seems to me it can guess the language from the first > component of the locale (ru_RU -> russian). How would this work for initdb with locale C? Michael Glaesemann grzm seespotcode net
>> That may have been true until we started supporting Windows... >> Swedish_Sweden.1252 is what I get on my machine, for example. Principle >> is the same, but values certainly aren't. > > Well, at least the name is not itself translated, so a mapping table is > not right out of the question. If they had put a name like > "Español_Chile" instead of "Spanish_Chile" we would be in serious > trouble. I don't think so, in oppsite case you can't type or show it to change locale :). So, final propose: rename cfglocale to cfglanguages and store in it array of laguage names which is produced from first part of locale names: russian '{ru_RU, Russian_Russia}' spanish '{es_ES, es_CL, Spanish_Spain, Spanish_Chile}' Comments? Is there some obstacles to use GIN indexes in pg_catalog?
Michael Glaesemann wrote: > > On Jun 22, 2007, at 9:28 , Tom Lane wrote: > > > Is the point here for initdb to be able to establish a sane default > > initially? Seems to me it can guess the language from the first > > component of the locale (ru_RU -> russian). > > How would this work for initdb with locale C? Yea, that's a problem. I am thinking we should just avoid the entire issue and require it to be set by the user, and throw an error if the configuration is not set. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
teodor@sigaev.ru wrote: > So, final propose: > rename cfglocale to cfglanguages and store in it array of laguage names > which is produced from first part of locale names: > russian '{ru_RU, Russian_Russia}' > spanish '{es_ES, es_CL, Spanish_Spain, Spanish_Chile}' > > Comments? Why not do it the other way around? es_ES spanish Spanish_Spain spanish ru_RU russian pt_BR portuguese_brazil That way you don't need any funny index. Or do you need the list of locales for each language? (but even if you do, you can easily obtain it by indexing both columns separately using btrees anyway) -- Alvaro Herrera http://www.PlanetPostgreSQL.org/ "I can see support will not be a problem. 10 out of 10." (Simon Wittber) (http://archives.postgresql.org/pgsql-general/2004-12/msg00159.php)
> On Jun 22, 2007, at 9:28 , Tom Lane wrote: > > > Is the point here for initdb to be able to establish a sane default > > initially? Seems to me it can guess the language from the first > > component of the locale (ru_RU -> russian). > > How would this work for initdb with locale C? I'm worrying about that too. -- Tatsuo Ishii SRA OSS, Inc. Japan
> Why not do it the other way around? > es_ES spanish > Spanish_Spain spanish > ru_RU russian > pt_BR portuguese_brazil > > That way you don't need any funny index. Or do you need the list of > locales for each language? (but even if you do, you can easily obtain it > by indexing both columns separately using btrees anyway) Yes, that's possible but that icreases number of identical configuration: russian_win Russian_Russia russian_unix ru_RU They doesn't differ except locale name.
teodor@sigaev.ru wrote: > > Why not do it the other way around? > > es_ES spanish > > Spanish_Spain spanish > > ru_RU russian > > pt_BR portuguese_brazil > > > > That way you don't need any funny index. Or do you need the list of > > locales for each language? (but even if you do, you can easily obtain it > > by indexing both columns separately using btrees anyway) > > Yes, that's possible but that icreases number of identical configuration: > russian_win Russian_Russia > russian_unix ru_RU > > They doesn't differ except locale name. But why do you need them to be different at all? Just make it russian Russian_Russia russian ru_RU Does that not work for some reason? What I was really suggesting was having a table mapping locale names into "tsearch languages". Then the configuration could be made based on the language, not on the locale name. So the stopword list is for "russian", regardless of whether the locale is Russian_Russia or ru_RU. Is this only for the stopword list, or does it also affect selecting a stemmer? Note: it's possible that the stopword list is different for brazilian portuguese than portuguese portuguese, which is why I was suggesting using a language "portuguese_brazil" and not just "postuguese". Whereas you need a single stopword list for all the countries speaking spanish, which is why you need only one language called spanish. -- Alvaro Herrera http://www.advogato.org/person/alvherre "Llegará una época en la que una investigación diligente y prolongada sacará a la luz cosas que hoy están ocultas" (Séneca, siglo I)
Tatsuo Ishii <ishii@sraoss.co.jp> writes: >> On Jun 22, 2007, at 9:28 , Tom Lane wrote: >>> Is the point here for initdb to be able to establish a sane default >>> initially? Seems to me it can guess the language from the first >>> component of the locale (ru_RU -> russian). >> >> How would this work for initdb with locale C? > I'm worrying about that too. I would be surprised if C locale defaulted to anything except English. I suppose it would be sensible to add a switch to allow people to select a different language. In any case, the only thing initdb would be doing would be setting up an initial value of a table entry or GUC variable, so you could always change it yourself later; it may not be worth sweating too much about this. regards, tom lane
Alvaro Herrera wrote: > What I was really suggesting was having a table mapping locale names > into "tsearch languages". Then the configuration could be made based on > the language, not on the locale name. So the stopword list is for > "russian", regardless of whether the locale is Russian_Russia or ru_RU. > Agreed. But I'm afraid we couldn't map all of the locale names in a right way. Man, it's a large list. ;) > Is this only for the stopword list, or does it also affect selecting a > stemmer? > Both. > Note: it's possible that the stopword list is different for brazilian > portuguese than portuguese portuguese, which is why I was suggesting > using a language "portuguese_brazil" and not just "postuguese". Whereas > you need a single stopword list for all the countries speaking spanish, > which is why you need only one language called spanish. > Indeed it's possible for portuguese, because we have some words that are written in different ways, e.g., pt_BR pt_PT english Mônica Mónica Monica ação acção action Irã Irão Iran . . . Will it be possible to disable stemming or stopwords removal? I'm asking this 'cause sometimes stemming doesn't lead to good results and/or stopwords are relevant. Maybe it could be an GUC variables ('enable_stemming' and 'enable_stopwords'). -- Euler Taveira de Oliveira http://www.timbira.com/
On Sat, 23 Jun 2007, Euler Taveira de Oliveira wrote: > Will it be possible to disable stemming or stopwords removal? I'm asking > this 'cause sometimes stemming doesn't lead to good results and/or > stopwords are relevant. Maybe it could be an GUC variables > ('enable_stemming' and 'enable_stopwords'). Just use another configuration. Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
> I would be surprised if C locale defaulted to anything except English. Don't be surprised. The mechanism of collation is too simple for Japanse Kanji, and locale is not usefull for Japanse anyway. That's why Japanese installations of PostgreSQL tend to use C locale. -- Tatsuo Ishii SRA OSS, Inc. Japan > I suppose it would be sensible to add a switch to allow people to select > a different language. In any case, the only thing initdb would be doing > would be setting up an initial value of a table entry or GUC variable, > so you could always change it yourself later; it may not be worth > sweating too much about this. > > regards, tom lane
> But why do you need them to be different at all? Just make it > russian Russian_Russia > russian ru_RU > > Does that not work for some reason? I'd like to have unique names of configuration. So, if user sets GUC variable or call function with configuration's name then postgres should not have a choice --- it should use pointed configuration exactly. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
Teodor Sigaev <teodor@sigaev.ru> writes: >> But why do you need them to be different at all? Just make it >> russian Russian_Russia >> russian ru_RU >> >> Does that not work for some reason? > I'd like to have unique names of configuration. So, if user sets GUC variable or > call function with configuration's name then postgres should not have a choice > --- it should use pointed configuration exactly. Sure, but the configuration name in this example is "russian", and it's unique, no? regards, tom lane