Thread: AW: like and optimization

AW: like and optimization

From
Zeugswetter Andreas SB
Date:
> > I made a reproduceable example of things going wrong with a "en_US"
> > locale which is the widely-used (single-byte) ISO-8859-1 Latin 1 charset.
> 
> en_US uses multi-pass collation rules.  It's those collation rules, not
> the charset per se, that causes the problem.

Just to understand things correctly. Is the Like optimization disabled
for all non-ASCII char sets, or (imho correctly) for non charset ordered 
collations (LC_COLLATE) ?

Thus can you enable index optimization by simply setting
LC_COLLATE to C if your LANG is not set to C ?

Andreas


Re: AW: like and optimization

From
Tom Lane
Date:
Zeugswetter Andreas SB  <ZeugswetterA@wien.spardat.at> writes:
> Just to understand things correctly. Is the Like optimization disabled
> for all non-ASCII char sets, or (imho correctly) for non charset ordered 
> collations (LC_COLLATE) ?

Currently it's disabled whenever LC_COLLATE is neither C nor POSIX.
We can add other names to the "OK" list as we verify that they are safe
(see locale_is_like_safe() in src/backend/utils/adt/selfuncs.c).
        regards, tom lane


Re: AW: like and optimization

From
Peter Eisentraut
Date:
Tom Lane writes:

> Zeugswetter Andreas SB  <ZeugswetterA@wien.spardat.at> writes:
> > Just to understand things correctly. Is the Like optimization disabled
> > for all non-ASCII char sets, or (imho correctly) for non charset ordered
> > collations (LC_COLLATE) ?
>
> Currently it's disabled whenever LC_COLLATE is neither C nor POSIX.
> We can add other names to the "OK" list as we verify that they are safe
> (see locale_is_like_safe() in src/backend/utils/adt/selfuncs.c).

I have pretty severe doubts that any locale for a language that uses the
Latin, Cyrillic, or Greek alphabets (i.e., those that are conceptually
similar to English) is like-optimization safe (for the optimization
algorithm in its current state), at least across all platforms.
Somewhere a vendor is going to adhere to some ISO standard and implement
the same multi-pass "letters first" rules that we observed in en_US.

There should be some extensive "stress test" that a locale should have to
pass before being labelled safe.

-- 
Peter Eisentraut      peter_e@gmx.net       http://yi.org/peter-e/



Re: AW: like and optimization

From
Hannu Krosing
Date:
Peter Eisentraut wrote:
> 
> Tom Lane writes:
> 
> > Zeugswetter Andreas SB  <ZeugswetterA@wien.spardat.at> writes:
> > > Just to understand things correctly. Is the Like optimization disabled
> > > for all non-ASCII char sets, or (imho correctly) for non charset ordered
> > > collations (LC_COLLATE) ?
> >
> > Currently it's disabled whenever LC_COLLATE is neither C nor POSIX.
> > We can add other names to the "OK" list as we verify that they are safe
> > (see locale_is_like_safe() in src/backend/utils/adt/selfuncs.c).
> 
> I have pretty severe doubts that any locale for a language that uses the
> Latin, Cyrillic, or Greek alphabets (i.e., those that are conceptually
> similar to English) is like-optimization safe (for the optimization
> algorithm in its current state), at least across all platforms.
> Somewhere a vendor is going to adhere to some ISO standard and implement
> the same multi-pass "letters first" rules that we observed in en_US.

Is there any possibility to use, in a portable way, only our own locale 
definition files, without reimplementing all the sorts uppercases etc. ?

If we had control over the locale definition contents we would be much
better 
off when optimizing as well.

And IIRC SQL9x prescribe support for multiple locales (or at least
multiple
collating sequences) within one database simultaneously.
> There should be some extensive "stress test" that a locale should have to
> pass before being labelled safe.

Sure.

-------------
Hannu


Re: AW: like and optimization

From
Tom Lane
Date:
Hannu Krosing <hannu@tm.ee> writes:
> Is there any possibility to use, in a portable way, only our own locale 
> definition files, without reimplementing all the sorts uppercases etc. ?

AFAIK there is not --- the standard C library APIs do not specify how to
represent this information.  Thus, we'd have to provide our own complete
implementation of locale-specific comparisons, etc, etc.  Not to mention
acquiring all the raw data for the locale definitions.

I think we'd be nuts to try to develop and maintain our own
implementation of that.  What we should probably think about is somehow
piggybacking on someone else's i18n library work, with just enough
tweaking of the source so that it can cope efficiently with N different
locales at runtime, instead of only one.

The situation is not too much different for timezones, BTW.  Might make
sense to deal with both of those problems in the same way.

Are there any BSD-license locale and/or timezone libraries that we might
assimilate in this way?  We could use an LGPL'd library if there is no
other alternative, but I'd just as soon not open up the license issue.
        regards, tom lane


Re: AW: like and optimization

From
ncm@zembu.com (Nathan Myers)
Date:
On Mon, Jan 22, 2001 at 05:46:09PM -0500, Tom Lane wrote:
> Hannu Krosing <hannu@tm.ee> writes:
> > Is there any possibility to use, in a portable way, only our own locale 
> > definition files, without reimplementing all the sorts uppercases etc. ?
> 
> The situation is not too much different for timezones, BTW.  Might make
> sense to deal with both of those problems in the same way.

The timezone situation is much better, in that there is a separate
organization which maintains a timezone database and code to operate
on it.  It wouldn't be necessary to include the package with PG, 
because it can be got at a standard place.  You would only need 
scripts to download, build, and integrate it.

> Are there any BSD-license locale and/or timezone libraries that we might
> assimilate in this way?  We could use an LGPL'd library if there is no
> other alternative, but I'd just as soon not open up the license issue.

Posix systems include a set of commands for dumping locales in a standard 
format, and building from them.  Instead of shipping locales and code to 
operate on them, one might include a script to run these tools (where 
they exist) to dump an existing locale, edit it a bit, and build a more 
PG-friendly locale.

Nathan Myers
ncm@zembu.com


Re: AW: like and optimization

From
Tatsuo Ishii
Date:
> And IIRC SQL9x prescribe support for multiple locales (or at least
> multiple
> collating sequences) within one database simultaneously.

Sounds like SQL92/99 COLLATE things is the way we should go, IMHO.
--
Tatsuo Ishii


Re: AW: like and optimization

From
Patrick Welche
Date:
On Mon, Jan 22, 2001 at 05:46:09PM -0500, Tom Lane wrote:
... 
> Are there any BSD-license locale and/or timezone libraries that we might
> assimilate in this way?  We could use an LGPL'd library if there is no
> other alternative, but I'd just as soon not open up the license issue.

The "Citrus Project" is coming up with with i18n for BSD.

FYI

Patrick


Re: AW: like and optimization

From
Patrick Welche
Date:
On Mon, Jan 22, 2001 at 03:09:03PM -0800, Nathan Myers wrote:
... 
> Posix systems include a set of commands for dumping locales in a standard 
> format, and building from them.  Instead of shipping locales and code to 
> operate on them, one might include a script to run these tools (where 
> they exist) to dump an existing locale, edit it a bit, and build a more 
> PG-friendly locale.

Is there really a standard format for locales? Apparantly there are 3 different
ways of doing LC_COLLATE ?!

Cheers,

Patrick