Thread: AW: like and optimization
> > I made a reproduceable example of things going wrong with a "en_US" > > locale which is the widely-used (single-byte) ISO-8859-1 Latin 1 charset. > > en_US uses multi-pass collation rules. It's those collation rules, not > the charset per se, that causes the problem. Just to understand things correctly. Is the Like optimization disabled for all non-ASCII char sets, or (imho correctly) for non charset ordered collations (LC_COLLATE) ? Thus can you enable index optimization by simply setting LC_COLLATE to C if your LANG is not set to C ? Andreas
Zeugswetter Andreas SB <ZeugswetterA@wien.spardat.at> writes: > Just to understand things correctly. Is the Like optimization disabled > for all non-ASCII char sets, or (imho correctly) for non charset ordered > collations (LC_COLLATE) ? Currently it's disabled whenever LC_COLLATE is neither C nor POSIX. We can add other names to the "OK" list as we verify that they are safe (see locale_is_like_safe() in src/backend/utils/adt/selfuncs.c). regards, tom lane
Tom Lane writes: > Zeugswetter Andreas SB <ZeugswetterA@wien.spardat.at> writes: > > Just to understand things correctly. Is the Like optimization disabled > > for all non-ASCII char sets, or (imho correctly) for non charset ordered > > collations (LC_COLLATE) ? > > Currently it's disabled whenever LC_COLLATE is neither C nor POSIX. > We can add other names to the "OK" list as we verify that they are safe > (see locale_is_like_safe() in src/backend/utils/adt/selfuncs.c). I have pretty severe doubts that any locale for a language that uses the Latin, Cyrillic, or Greek alphabets (i.e., those that are conceptually similar to English) is like-optimization safe (for the optimization algorithm in its current state), at least across all platforms. Somewhere a vendor is going to adhere to some ISO standard and implement the same multi-pass "letters first" rules that we observed in en_US. There should be some extensive "stress test" that a locale should have to pass before being labelled safe. -- Peter Eisentraut peter_e@gmx.net http://yi.org/peter-e/
Peter Eisentraut wrote: > > Tom Lane writes: > > > Zeugswetter Andreas SB <ZeugswetterA@wien.spardat.at> writes: > > > Just to understand things correctly. Is the Like optimization disabled > > > for all non-ASCII char sets, or (imho correctly) for non charset ordered > > > collations (LC_COLLATE) ? > > > > Currently it's disabled whenever LC_COLLATE is neither C nor POSIX. > > We can add other names to the "OK" list as we verify that they are safe > > (see locale_is_like_safe() in src/backend/utils/adt/selfuncs.c). > > I have pretty severe doubts that any locale for a language that uses the > Latin, Cyrillic, or Greek alphabets (i.e., those that are conceptually > similar to English) is like-optimization safe (for the optimization > algorithm in its current state), at least across all platforms. > Somewhere a vendor is going to adhere to some ISO standard and implement > the same multi-pass "letters first" rules that we observed in en_US. Is there any possibility to use, in a portable way, only our own locale definition files, without reimplementing all the sorts uppercases etc. ? If we had control over the locale definition contents we would be much better off when optimizing as well. And IIRC SQL9x prescribe support for multiple locales (or at least multiple collating sequences) within one database simultaneously. > There should be some extensive "stress test" that a locale should have to > pass before being labelled safe. Sure. ------------- Hannu
Hannu Krosing <hannu@tm.ee> writes: > Is there any possibility to use, in a portable way, only our own locale > definition files, without reimplementing all the sorts uppercases etc. ? AFAIK there is not --- the standard C library APIs do not specify how to represent this information. Thus, we'd have to provide our own complete implementation of locale-specific comparisons, etc, etc. Not to mention acquiring all the raw data for the locale definitions. I think we'd be nuts to try to develop and maintain our own implementation of that. What we should probably think about is somehow piggybacking on someone else's i18n library work, with just enough tweaking of the source so that it can cope efficiently with N different locales at runtime, instead of only one. The situation is not too much different for timezones, BTW. Might make sense to deal with both of those problems in the same way. Are there any BSD-license locale and/or timezone libraries that we might assimilate in this way? We could use an LGPL'd library if there is no other alternative, but I'd just as soon not open up the license issue. regards, tom lane
On Mon, Jan 22, 2001 at 05:46:09PM -0500, Tom Lane wrote: > Hannu Krosing <hannu@tm.ee> writes: > > Is there any possibility to use, in a portable way, only our own locale > > definition files, without reimplementing all the sorts uppercases etc. ? > > The situation is not too much different for timezones, BTW. Might make > sense to deal with both of those problems in the same way. The timezone situation is much better, in that there is a separate organization which maintains a timezone database and code to operate on it. It wouldn't be necessary to include the package with PG, because it can be got at a standard place. You would only need scripts to download, build, and integrate it. > Are there any BSD-license locale and/or timezone libraries that we might > assimilate in this way? We could use an LGPL'd library if there is no > other alternative, but I'd just as soon not open up the license issue. Posix systems include a set of commands for dumping locales in a standard format, and building from them. Instead of shipping locales and code to operate on them, one might include a script to run these tools (where they exist) to dump an existing locale, edit it a bit, and build a more PG-friendly locale. Nathan Myers ncm@zembu.com
> And IIRC SQL9x prescribe support for multiple locales (or at least > multiple > collating sequences) within one database simultaneously. Sounds like SQL92/99 COLLATE things is the way we should go, IMHO. -- Tatsuo Ishii
On Mon, Jan 22, 2001 at 05:46:09PM -0500, Tom Lane wrote: ... > Are there any BSD-license locale and/or timezone libraries that we might > assimilate in this way? We could use an LGPL'd library if there is no > other alternative, but I'd just as soon not open up the license issue. The "Citrus Project" is coming up with with i18n for BSD. FYI Patrick
On Mon, Jan 22, 2001 at 03:09:03PM -0800, Nathan Myers wrote: ... > Posix systems include a set of commands for dumping locales in a standard > format, and building from them. Instead of shipping locales and code to > operate on them, one might include a script to run these tools (where > they exist) to dump an existing locale, edit it a bit, and build a more > PG-friendly locale. Is there really a standard format for locales? Apparantly there are 3 different ways of doing LC_COLLATE ?! Cheers, Patrick