Thread: default locale considered harmful? (was Re: [GENERAL] Using index for "like 'ABC%'" type query)
default locale considered harmful? (was Re: [GENERAL] Using index for "like 'ABC%'" type query)
From
Tom Lane
Date:
Andrew Sullivan <andrew@libertyrms.info> writes: >> [ can't LIKE use an index? ] > Yes. But locale is enabled by default now, and if you use anything > other than C (and other-than-C is increasingly the default on the > supported platforms), it won't work. Just make sure that your locale > is C when you do initdb. I recall someone floating a proposal that initdb should by default initialize the database in C locale, not whatever-it-finds-in-the- environment. To get a non-C locale you'd have to give an explicit command-line switch --- essentially, reversing the sense of the present "initdb --no-locale" option. I'm beginning to think that would be a good idea, given the increasing prevalence of en_US as a platform locale setting. Comments? regards, tom lane
Tom Lane wrote: > Andrew Sullivan <andrew@libertyrms.info> writes: > >> [ can't LIKE use an index? ] > > > Yes. But locale is enabled by default now, and if you use anything > > other than C (and other-than-C is increasingly the default on the > > supported platforms), it won't work. Just make sure that your locale > > is C when you do initdb. > > I recall someone floating a proposal that initdb should by default > initialize the database in C locale, not whatever-it-finds-in-the- > environment. To get a non-C locale you'd have to give an explicit > command-line switch --- essentially, reversing the sense of the present > "initdb --no-locale" option. > > I'm beginning to think that would be a good idea, given the increasing > prevalence of en_US as a platform locale setting. Comments? Agreed, or we could special-case en_US to be C locale. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
> Andrew Sullivan <andrew@libertyrms.info> writes: > >> [ can't LIKE use an index? ] > > > Yes. But locale is enabled by default now, and if you use anything > > other than C (and other-than-C is increasingly the default on the > > supported platforms), it won't work. Just make sure that your locale > > is C when you do initdb. > > I recall someone floating a proposal that initdb should by default > initialize the database in C locale, not whatever-it-finds-in-the- > environment. To get a non-C locale you'd have to give an explicit > command-line switch --- essentially, reversing the sense of the present > "initdb --no-locale" option. > > I'm beginning to think that would be a good idea, given the increasing > prevalence of en_US as a platform locale setting. Comments? I agree with that initdb should by default initialize the database in C locale. I have found on a local list too many users in Japan are suffered by the locale problem and I'm getting tired of saying "you should not forget to explicitly specify --no-locale siwtch". -- Tatsuo Ishii
Tom Lane writes: > I recall someone floating a proposal that initdb should by default > initialize the database in C locale, not whatever-it-finds-in-the- > environment. To get a non-C locale you'd have to give an explicit > command-line switch --- essentially, reversing the sense of the present > "initdb --no-locale" option. If you're concerned about speed, let's think about fixing the real problems, not about disabling the feature altogether. A while ago I proposed an easy solution that made LIKE use an index based on strxfrm order instead. It was rejected on the grounds that it would prevent a future enhancement of the LIKE mechanism to use the locale-enabled collation order, but no one seems to be seriously interested in implementing that. I still have the patch; we can reconsider it if you like. (Btw., LIKE using the locale-enabled collation sequence is hardly going to work, because most locales compare strings backwards from the end to the start in the second pass, so something like LIKE 'foo%' can easily give inconsistent results, since you don't know what the end of the string really is. It's better to think of pattern matching as character-by-character matching.) -- Peter Eisentraut peter_e@gmx.net
Re: default locale considered harmful? (was Re: [GENERAL] Using index for "like 'ABC%'" type query)
From
Andrew Sullivan
Date:
On Fri, Apr 18, 2003 at 10:48:21AM -0400, Tom Lane wrote: > prevalence of en_US as a platform locale setting. Comments? I tend to agree, especially since there's no real fix after you've initdb'd. -- ---- Andrew Sullivan 204-4141 Yonge Street Liberty RMS Toronto, Ontario Canada <andrew@libertyrms.info> M2P 2A8 +1 416 646 3304 x110
Tom Lane wrote: > I'm beginning to think that would be a good idea, given the increasing > prevalence of en_US as a platform locale setting. Comments? Americans will be agree with you. :-) IMHO: LC_CTYPE -\ LC_COLLATE - Ideally make client depending for international databases. But it seems impossible to use indexes. May be add parameter to CREATE INDEX, which locale (locales) use for sorting text data? LC_NUMERIC client depending LC_TIME client depending, default ISO LC_MONETARY not usefull, lacking monetary type LC_MESSAGES client depending for notice (client messages), server depending for console or syslog messages, agree with default C Summary: Exchange information with local services (stdout, syslog, files, etc) must be according local locale. With client - according client locale. -- Olleg
Has the single-byte LIKE penalty been eliminated, so we don't need to consider using C as the default locale for initdb, right? If fixed, how was it done? --------------------------------------------------------------------------- Peter Eisentraut wrote: > Tom Lane writes: > > > I recall someone floating a proposal that initdb should by default > > initialize the database in C locale, not whatever-it-finds-in-the- > > environment. To get a non-C locale you'd have to give an explicit > > command-line switch --- essentially, reversing the sense of the present > > "initdb --no-locale" option. > > If you're concerned about speed, let's think about fixing the real > problems, not about disabling the feature altogether. A while ago I > proposed an easy solution that made LIKE use an index based on strxfrm > order instead. It was rejected on the grounds that it would prevent a > future enhancement of the LIKE mechanism to use the locale-enabled > collation order, but no one seems to be seriously interested in > implementing that. I still have the patch; we can reconsider it if you > like. > > (Btw., LIKE using the locale-enabled collation sequence is hardly going to > work, because most locales compare strings backwards from the end to the > start in the second pass, so something like LIKE 'foo%' can easily give > inconsistent results, since you don't know what the end of the string > really is. It's better to think of pattern matching as > character-by-character matching.) > > -- > Peter Eisentraut peter_e@gmx.net > > > ---------------------------(end of broadcast)--------------------------- > TIP 2: you can get off all lists at once with the unregister command > (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Has the single-byte LIKE penalty been eliminated, so we don't need to > consider using C as the default locale for initdb, right? I'm still of the opinion that we should make C the default locale. But I'm not sure where the consensus is, so I've not made the change. > If fixed, how was it done? Peter has provided a hack whereby one can create a LIKE-supporting index in a non-C locale. But a *default* index in a non-C locale is still not going to support LIKE ... and the hacked index will not support ordinary comparison or ordering operators. So I think there's still a lot left to be desired here. regards, tom lane
Tom Lane writes: > Peter has provided a hack whereby one can create a LIKE-supporting index > in a non-C locale. But a *default* index in a non-C locale is still not > going to support LIKE ... and the hacked index will not support ordinary > comparison or ordering operators. So I think there's still a lot left > to be desired here. I don't understand why you call this a hack. Pattern matching and string comparison simply work differently, so the proper solution is to use different operator classes. After all, that's what operator classes exist for. What is left to be desired? -- Peter Eisentraut peter_e@gmx.net
Peter Eisentraut <peter_e@gmx.net> writes: > I don't understand why you call this a hack. Pattern matching and string > comparison simply work differently, so the proper solution is to use > different operator classes. After all, that's what operator classes exist > for. What is left to be desired? I think that a more general solution would be the ability to select a locale (and hence a sort order) per-column, as the SQL spec envisions. Then you'd just select C locale for columns you wanted to do pattern matching for. Admittedly, you'd still need the opclass-based approach for cases where you wanted both pattern matching and a non-C-locale sort order ... but I doubt that constitutes the majority of cases. I guess my main concern is that we should not feel that this approach takes the heat off us to support multiple locales. As a solution to the narrow problem of LIKE performance, it's okay --- but it's not getting us any nearer to a solution to the general locale problem. regards, tom lane
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Has the single-byte LIKE penalty been eliminated, so we don't need to > > consider using C as the default locale for initdb, right? > > I'm still of the opinion that we should make C the default locale. > But I'm not sure where the consensus is, so I've not made the change. > > > If fixed, how was it done? > > Peter has provided a hack whereby one can create a LIKE-supporting index > in a non-C locale. But a *default* index in a non-C locale is still not > going to support LIKE ... and the hacked index will not support ordinary > comparison or ordering operators. So I think there's still a lot left > to be desired here. So, my understanding is that you would create something such as: CREATE INDEX iix ON tab (LIKE col) and that does LIKE lookups and knows how to do col LIKE 'abc%', but it can't be used for >= or ORDER BY, but it can be used for equality tests? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > So, my understanding is that you would create something such as: > CREATE INDEX iix ON tab (LIKE col) > and that does LIKE lookups and knows how to do col LIKE 'abc%', but it > can't be used for >= or ORDER BY, but it can be used for equality tests? Hm. Right at the moment, it wouldn't be used for equality tests unless you spelled equality as "a ~=~ b". I wonder whether that's necessary though; couldn't we dispense with that operator and use ordinary equality as the BTEqual member of these opclasses? Are there any locales that claim that not-physically-identical strings are equal? regards, tom lane
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > So, my understanding is that you would create something such as: > > CREATE INDEX iix ON tab (LIKE col) > > and that does LIKE lookups and knows how to do col LIKE 'abc%', but it > > can't be used for >= or ORDER BY, but it can be used for equality tests? > > Hm. Right at the moment, it wouldn't be used for equality tests unless > you spelled equality as "a ~=~ b". I wonder whether that's necessary > though; couldn't we dispense with that operator and use ordinary > equality as the BTEqual member of these opclasses? Are there any > locales that claim that not-physically-identical strings are equal? Let me see if I understand. Our default indexes will be able to do =, >, <, ORDER BY, and the special index will be able to do LIKE, ORDER BY, and maybe equals. Do I have that correct? Looking at CVS, I see the warning about non-C locales has been removed. Should we instead mention the new LIKE index method? # (Be sure to maintain the correspondence with locale_is_like_safe() in selfuncs.c.)if test x`pg_getlocale COLLATE` != xC&& test x`pg_getlocale COLLATE` != xPOSIX; then echo "This locale setting will prevent the use of indexes for patternmatching" echo "operations. If that is a concern, rerun $CMDNAME with the collation order" echo "set to \"C\". For more information see the Administrator's Guide."fi Doing LIKE with single-byte encodings would be easy because it would be only 256 compares to find the min/max char values, but that doesn't work with multi-byte encodings, right? This LIKE/encoding problem is a tricky one because it gives poor performance with little warning to users. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Tom Lane writes: > I think that a more general solution would be the ability to select a > locale (and hence a sort order) per-column, as the SQL spec envisions. It is a general solution, but not for this problem. The problem was to make all locales equally suitable for certain optimizations, not to make locales available in more places. I won't pretend to anyone that this little change will bring us anywhere closer to a solution for that other problem. > Then you'd just select C locale for columns you wanted to do pattern > matching for. That's wrong, for a number of reasons: First of all, I don't agree at all that cases where you want both pattern matching and collation are rare; in fact, I rarely see a case where you don't want both. Designing a system on that assumption is not sound, because all operations should be equally possible in all situations. Second, we will eventually want pattern matching operations to be locale aware. Case-sensitive matching needs this, because case mappings depend on the locale. The character class features of POSIX regexps also need this. So you cannot make locales and well-performing pattern matching mutually exclusive. Third, keep in mind that datums with different locales cannot be combined liberally. So systems built the way you propose become crippled in ways that will be hard to understand and justify. Finally, the locale of a datum should be a property that describes that language of the stored data and that can be used for that specific purpose without concerns and tradeoffs with the internal doings of the optimization engine. -- Peter Eisentraut peter_e@gmx.net
Tom Lane writes: > Are there any locales that claim that not-physically-identical strings > are equal? In Unicode there are plenty such combinations. -- Peter Eisentraut peter_e@gmx.net
Bruce Momjian writes: > Our default indexes will be able to do =, >, <, ORDER BY, and the > special index will be able to do LIKE, ORDER BY, and maybe equals. Do I > have that correct? The default operator class supports comparisons (=, >, <, etc.) and ORDER BY based on those operators. The other operator class supports pattern matching operations (LIKE, SIMILAR, POSIX regexps). > Looking at CVS, I see the warning about non-C locales has been removed. > Should we instead mention the new LIKE index method? I don't see a need. The old warning was mainly because once you initdb'ed, you were basically stuck with your choice. Now we have plenty of options to query and adjust things later. > Doing LIKE with single-byte encodings would be easy because it would be > only 256 compares to find the min/max char values, but that doesn't work > with multi-byte encodings, right? This has nothing to do with encodings. -- Peter Eisentraut peter_e@gmx.net
Peter Eisentraut wrote: > Bruce Momjian writes: > > > Our default indexes will be able to do =, >, <, ORDER BY, and the > > special index will be able to do LIKE, ORDER BY, and maybe equals. Do I > > have that correct? > > The default operator class supports comparisons (=, >, <, etc.) and ORDER > BY based on those operators. The other operator class supports pattern > matching operations (LIKE, SIMILAR, POSIX regexps). > > > Looking at CVS, I see the warning about non-C locales has been removed. > > Should we instead mention the new LIKE index method? > > I don't see a need. The old warning was mainly because once you > initdb'ed, you were basically stuck with your choice. Now we have plenty > of options to query and adjust things later. How are people going to know to use these special LIKE indexes? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073