Thread: default locale considered harmful? (was Re: [GENERAL] Using index for "like 'ABC%'" type query)

Andrew Sullivan <andrew@libertyrms.info> writes:
>> [ can't LIKE use an index? ]

> Yes.  But locale is enabled by default now, and if you use anything
> other than C (and other-than-C is increasingly the default on the
> supported platforms), it won't work.  Just make sure that your locale
> is C when you do initdb.

I recall someone floating a proposal that initdb should by default
initialize the database in C locale, not whatever-it-finds-in-the-
environment.  To get a non-C locale you'd have to give an explicit
command-line switch --- essentially, reversing the sense of the present
"initdb --no-locale" option.

I'm beginning to think that would be a good idea, given the increasing
prevalence of en_US as a platform locale setting.  Comments?

            regards, tom lane


Re: default locale considered harmful? (was Re: [GENERAL]

From
Bruce Momjian
Date:
Tom Lane wrote:
> Andrew Sullivan <andrew@libertyrms.info> writes:
> >> [ can't LIKE use an index? ]
>
> > Yes.  But locale is enabled by default now, and if you use anything
> > other than C (and other-than-C is increasingly the default on the
> > supported platforms), it won't work.  Just make sure that your locale
> > is C when you do initdb.
>
> I recall someone floating a proposal that initdb should by default
> initialize the database in C locale, not whatever-it-finds-in-the-
> environment.  To get a non-C locale you'd have to give an explicit
> command-line switch --- essentially, reversing the sense of the present
> "initdb --no-locale" option.
>
> I'm beginning to think that would be a good idea, given the increasing
> prevalence of en_US as a platform locale setting.  Comments?

Agreed, or we could special-case en_US to be C locale.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073


Re: default locale considered harmful?

From
Tatsuo Ishii
Date:
> Andrew Sullivan <andrew@libertyrms.info> writes:
> >> [ can't LIKE use an index? ]
> 
> > Yes.  But locale is enabled by default now, and if you use anything
> > other than C (and other-than-C is increasingly the default on the
> > supported platforms), it won't work.  Just make sure that your locale
> > is C when you do initdb.
> 
> I recall someone floating a proposal that initdb should by default
> initialize the database in C locale, not whatever-it-finds-in-the-
> environment.  To get a non-C locale you'd have to give an explicit
> command-line switch --- essentially, reversing the sense of the present
> "initdb --no-locale" option.
> 
> I'm beginning to think that would be a good idea, given the increasing
> prevalence of en_US as a platform locale setting.  Comments?

I agree with that initdb should by default initialize the database in
C locale. I have found on a local list too many users in Japan are
suffered by the locale problem and I'm getting tired of saying "you
should not forget to explicitly specify --no-locale siwtch".
--
Tatsuo Ishii



Re: default locale considered harmful? (was Re: [GENERAL]

From
Peter Eisentraut
Date:
Tom Lane writes:

> I recall someone floating a proposal that initdb should by default
> initialize the database in C locale, not whatever-it-finds-in-the-
> environment.  To get a non-C locale you'd have to give an explicit
> command-line switch --- essentially, reversing the sense of the present
> "initdb --no-locale" option.

If you're concerned about speed, let's think about fixing the real
problems, not about disabling the feature altogether.  A while ago I
proposed an easy solution that made LIKE use an index based on strxfrm
order instead.  It was rejected on the grounds that it would prevent a
future enhancement of the LIKE mechanism to use the locale-enabled
collation order, but no one seems to be seriously interested in
implementing that.  I still have the patch; we can reconsider it if you
like.

(Btw., LIKE using the locale-enabled collation sequence is hardly going to
work, because most locales compare strings backwards from the end to the
start in the second pass, so something like LIKE 'foo%' can easily give
inconsistent results, since you don't know what the end of the string
really is.  It's better to think of pattern matching as
character-by-character matching.)

-- 
Peter Eisentraut   peter_e@gmx.net



On Fri, Apr 18, 2003 at 10:48:21AM -0400, Tom Lane wrote:
> prevalence of en_US as a platform locale setting.  Comments?

I tend to agree, especially since there's no real fix after you've
initdb'd.

--
----
Andrew Sullivan                         204-4141 Yonge Street
Liberty RMS                           Toronto, Ontario Canada
<andrew@libertyrms.info>                              M2P 2A8
                                         +1 416 646 3304 x110


Re: default locale considered harmful? (was Re: [GENERAL] Using

From
Olleg Samojlov
Date:
Tom Lane wrote:
> I'm beginning to think that would be a good idea, given the increasing
> prevalence of en_US as a platform locale setting.  Comments?

Americans will be agree with you. :-)

IMHO:

LC_CTYPE  -\
LC_COLLATE - Ideally make client depending for international databases. 
But it seems impossible to use indexes. May be add parameter to CREATE 
INDEX, which locale (locales) use for sorting text data?

LC_NUMERIC client depending
LC_TIME  client depending, default ISO
LC_MONETARY not usefull, lacking monetary type
LC_MESSAGES client depending for notice (client messages), server 
depending for console or syslog messages, agree with default C

Summary: Exchange information with local services (stdout, syslog, 
files, etc) must be according local locale. With client - according 
client locale.

-- 
Olleg



Re: default locale considered harmful? (was Re: [GENERAL]

From
Bruce Momjian
Date:
Has the single-byte LIKE penalty been eliminated, so we don't need to
consider using C as the default locale for initdb, right?

If fixed, how was it done?

---------------------------------------------------------------------------

Peter Eisentraut wrote:
> Tom Lane writes:
> 
> > I recall someone floating a proposal that initdb should by default
> > initialize the database in C locale, not whatever-it-finds-in-the-
> > environment.  To get a non-C locale you'd have to give an explicit
> > command-line switch --- essentially, reversing the sense of the present
> > "initdb --no-locale" option.
> 
> If you're concerned about speed, let's think about fixing the real
> problems, not about disabling the feature altogether.  A while ago I
> proposed an easy solution that made LIKE use an index based on strxfrm
> order instead.  It was rejected on the grounds that it would prevent a
> future enhancement of the LIKE mechanism to use the locale-enabled
> collation order, but no one seems to be seriously interested in
> implementing that.  I still have the patch; we can reconsider it if you
> like.
> 
> (Btw., LIKE using the locale-enabled collation sequence is hardly going to
> work, because most locales compare strings backwards from the end to the
> start in the second pass, so something like LIKE 'foo%' can easily give
> inconsistent results, since you don't know what the end of the string
> really is.  It's better to think of pattern matching as
> character-by-character matching.)
> 
> -- 
> Peter Eisentraut   peter_e@gmx.net
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>     (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
> 

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: default locale considered harmful? (was Re: [GENERAL]

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Has the single-byte LIKE penalty been eliminated, so we don't need to
> consider using C as the default locale for initdb, right?

I'm still of the opinion that we should make C the default locale.
But I'm not sure where the consensus is, so I've not made the change.

> If fixed, how was it done?

Peter has provided a hack whereby one can create a LIKE-supporting index
in a non-C locale.  But a *default* index in a non-C locale is still not
going to support LIKE ... and the hacked index will not support ordinary
comparison or ordering operators.  So I think there's still a lot left
to be desired here.
        regards, tom lane


Re: default locale considered harmful? (was Re: [GENERAL]

From
Peter Eisentraut
Date:
Tom Lane writes:

> Peter has provided a hack whereby one can create a LIKE-supporting index
> in a non-C locale.  But a *default* index in a non-C locale is still not
> going to support LIKE ... and the hacked index will not support ordinary
> comparison or ordering operators.  So I think there's still a lot left
> to be desired here.

I don't understand why you call this a hack.  Pattern matching and string
comparison simply work differently, so the proper solution is to use
different operator classes.  After all, that's what operator classes exist
for.  What is left to be desired?

-- 
Peter Eisentraut   peter_e@gmx.net



Re: default locale considered harmful? (was Re: [GENERAL]

From
Tom Lane
Date:
Peter Eisentraut <peter_e@gmx.net> writes:
> I don't understand why you call this a hack.  Pattern matching and string
> comparison simply work differently, so the proper solution is to use
> different operator classes.  After all, that's what operator classes exist
> for.  What is left to be desired?

I think that a more general solution would be the ability to select a
locale (and hence a sort order) per-column, as the SQL spec envisions.
Then you'd just select C locale for columns you wanted to do pattern
matching for.

Admittedly, you'd still need the opclass-based approach for cases where
you wanted both pattern matching and a non-C-locale sort order ... but
I doubt that constitutes the majority of cases.

I guess my main concern is that we should not feel that this approach
takes the heat off us to support multiple locales.  As a solution to the
narrow problem of LIKE performance, it's okay --- but it's not getting
us any nearer to a solution to the general locale problem.
        regards, tom lane


Re: default locale considered harmful? (was Re: [GENERAL]

From
Bruce Momjian
Date:
Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Has the single-byte LIKE penalty been eliminated, so we don't need to
> > consider using C as the default locale for initdb, right?
> 
> I'm still of the opinion that we should make C the default locale.
> But I'm not sure where the consensus is, so I've not made the change.
> 
> > If fixed, how was it done?
> 
> Peter has provided a hack whereby one can create a LIKE-supporting index
> in a non-C locale.  But a *default* index in a non-C locale is still not
> going to support LIKE ... and the hacked index will not support ordinary
> comparison or ordering operators.  So I think there's still a lot left
> to be desired here.

So, my understanding is that you would create something such as:
CREATE INDEX iix ON tab (LIKE col)

and that does LIKE lookups and knows how to do col LIKE 'abc%', but it
can't be used for >= or ORDER BY, but it can be used for equality tests?

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: default locale considered harmful? (was Re: [GENERAL]

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> So, my understanding is that you would create something such as:
>     CREATE INDEX iix ON tab (LIKE col)
> and that does LIKE lookups and knows how to do col LIKE 'abc%', but it
> can't be used for >= or ORDER BY, but it can be used for equality tests?

Hm.  Right at the moment, it wouldn't be used for equality tests unless
you spelled equality as "a ~=~ b".  I wonder whether that's necessary
though; couldn't we dispense with that operator and use ordinary
equality as the BTEqual member of these opclasses?  Are there any
locales that claim that not-physically-identical strings are equal?
        regards, tom lane


Re: default locale considered harmful? (was Re: [GENERAL]

From
Bruce Momjian
Date:
Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > So, my understanding is that you would create something such as:
> >     CREATE INDEX iix ON tab (LIKE col)
> > and that does LIKE lookups and knows how to do col LIKE 'abc%', but it
> > can't be used for >= or ORDER BY, but it can be used for equality tests?
> 
> Hm.  Right at the moment, it wouldn't be used for equality tests unless
> you spelled equality as "a ~=~ b".  I wonder whether that's necessary
> though; couldn't we dispense with that operator and use ordinary
> equality as the BTEqual member of these opclasses?  Are there any
> locales that claim that not-physically-identical strings are equal?

Let me see if I understand.  

Our default indexes will be able to do =, >, <, ORDER BY, and the
special index will be able to do LIKE, ORDER BY, and maybe equals.  Do I
have that correct?

Looking at CVS, I see the warning about non-C locales has been removed. 
Should we instead mention the new LIKE index method?
# (Be sure to maintain the correspondence with locale_is_like_safe() in selfuncs.c.)if test x`pg_getlocale COLLATE` !=
xC&& test x`pg_getlocale COLLATE` != xPOSIX; then    echo "This locale setting will prevent the use of indexes for
patternmatching"    echo "operations.  If that is a concern, rerun $CMDNAME with the collation order"    echo "set to
\"C\". For more information see the Administrator's Guide."fi
 

Doing LIKE with single-byte encodings would be easy because it would be
only 256 compares to find the min/max char values, but that doesn't work
with multi-byte encodings, right?

This LIKE/encoding problem is a tricky one because it gives poor
performance with little warning to users.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: default locale considered harmful? (was Re: [GENERAL]

From
Peter Eisentraut
Date:
Tom Lane writes:

> I think that a more general solution would be the ability to select a
> locale (and hence a sort order) per-column, as the SQL spec envisions.

It is a general solution, but not for this problem.  The problem was to
make all locales equally suitable for certain optimizations, not to make
locales available in more places.  I won't pretend to anyone that this
little change will bring us anywhere closer to a solution for that other
problem.

> Then you'd just select C locale for columns you wanted to do pattern
> matching for.

That's wrong, for a number of reasons:

First of all, I don't agree at all that cases where you want both pattern
matching and collation are rare; in fact, I rarely see a case where you
don't want both.  Designing a system on that assumption is not sound,
because all operations should be equally possible in all situations.

Second, we will eventually want pattern matching operations to be locale
aware.  Case-sensitive matching needs this, because case mappings depend
on the locale.  The character class features of POSIX regexps also need
this.  So you cannot make locales and well-performing pattern matching
mutually exclusive.

Third, keep in mind that datums with different locales cannot be combined
liberally.  So systems built the way you propose become crippled in ways
that will be hard to understand and justify.

Finally, the locale of a datum should be a property that describes that
language of the stored data and that can be used for that specific purpose
without concerns and tradeoffs with the internal doings of the
optimization engine.

-- 
Peter Eisentraut   peter_e@gmx.net



Re: default locale considered harmful? (was Re: [GENERAL]

From
Peter Eisentraut
Date:
Tom Lane writes:

> Are there any locales that claim that not-physically-identical strings
> are equal?

In Unicode there are plenty such combinations.

-- 
Peter Eisentraut   peter_e@gmx.net



Re: default locale considered harmful? (was Re: [GENERAL]

From
Peter Eisentraut
Date:
Bruce Momjian writes:

> Our default indexes will be able to do =, >, <, ORDER BY, and the
> special index will be able to do LIKE, ORDER BY, and maybe equals.  Do I
> have that correct?

The default operator class supports comparisons (=, >, <, etc.) and ORDER
BY based on those operators.  The other operator class supports pattern
matching operations (LIKE, SIMILAR, POSIX regexps).

> Looking at CVS, I see the warning about non-C locales has been removed.
> Should we instead mention the new LIKE index method?

I don't see a need.  The old warning was mainly because once you
initdb'ed, you were basically stuck with your choice.  Now we have plenty
of options to query and adjust things later.

> Doing LIKE with single-byte encodings would be easy because it would be
> only 256 compares to find the min/max char values, but that doesn't work
> with multi-byte encodings, right?

This has nothing to do with encodings.

-- 
Peter Eisentraut   peter_e@gmx.net



Re: default locale considered harmful? (was Re: [GENERAL]

From
Bruce Momjian
Date:
Peter Eisentraut wrote:
> Bruce Momjian writes:
> 
> > Our default indexes will be able to do =, >, <, ORDER BY, and the
> > special index will be able to do LIKE, ORDER BY, and maybe equals.  Do I
> > have that correct?
> 
> The default operator class supports comparisons (=, >, <, etc.) and ORDER
> BY based on those operators.  The other operator class supports pattern
> matching operations (LIKE, SIMILAR, POSIX regexps).
> 
> > Looking at CVS, I see the warning about non-C locales has been removed.
> > Should we instead mention the new LIKE index method?
> 
> I don't see a need.  The old warning was mainly because once you
> initdb'ed, you were basically stuck with your choice.  Now we have plenty
> of options to query and adjust things later.

How are people going to know to use these special LIKE indexes?

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073