Re: [HACKERS] indexable and locale - Mailing list pgsql-hackers
From | Tatsuo Ishii |
---|---|
Subject | Re: [HACKERS] indexable and locale |
Date | |
Msg-id | 199910190055.JAA16894@ext16.sra.co.jp Whole thread Raw |
In response to | Re: [HACKERS] indexable and locale (Tom Lane <tgl@sss.pgh.pa.us>) |
List | pgsql-hackers |
> Tatsuo Ishii <t-ishii@sra.co.jp> writes: > >> Attached is a patch to the old problem discussed feverly before 6.5. > > > ... I think your pacthes break > > non-ascii multi-byte character sets data and should be surrounded by > > #ifdef LOCALE rather than replacing current codes surrounded by > > #ifndef LOCALE. > > I am worried about this patch too. Under MULTIBYTE could it > generate invalid characters? I assume you are talking about following code fragment in the pacthes: prefix[prefixlen]++; This would not generate invalid characters under MULTIBYTE since it skips the multi-byte characters by: if ((unsigned) prefix[prefixlen] < 126) This would not make non-ASCII multi-byte characters indexable, however. > Also, do all non-ASCII locales sort > codes 0-126 in the same order as ASCII? I didn't think they do, > but I'm not an expert. As far as I know they do. At least all encodings MULTIBYTE mode can handle have same code point as ASCII in 0-126 range. They have following characteristics: o code point 0x00-0x7f are compatible with ASCII. o code point over 0x80 are variable length multi-byte characters. For example, ISO-8859-1 (Germany, Fernch etc...) has themulti-byte length to always 1, while EUC_JP (Japanese) has 2 to 3. > The approach I was considering for fixing the problem was to use a > loop that would repeatedly try to generate a string greater than the > prefix string. The basic loop step would increment the rightmost > byte as Goran has done (or, if it's already up to the limit, chop > it off and increment the next character position). Then test to > see whether the '<' operator actually believes the result is > greater than the given prefix, and repeat if not. This avoids making > any strong assumptions about the sort order of different character > codes. However, there are two significant issues that would have > to be surmounted to make it work reliably: Sounds good idea. > 1. In MULTIBYTE mode incrementing the rightmost byte might yield > an illegal multibyte character. Some way to prevent or detect this > would be needed, lest it confuse the comparison operator. I think > we have some multibyte routines that could be used to check for > a valid result, but I haven't looked into it. I don't think this is an issue as long as locale isn't enabled. For multibyte encodings (Japanese, Chinese etc..) locale is totally useless and usually I don't enable it. > 2. I think there are some locales out there that have context- > sensitive sorting rules, ie, a given character string may sort > differently than you'd expect from considering the characters in > isolation. For example, in German isn't "ss" treated specially? > If "pqrss" does not sort between "pqrs" and "pqrt" then the entire > premise of *both* sides of the LIKE optimization falls apart, > because you can't be sure what will happen when comparing a prefix > string like "pqrs" against longer strings from the database. > I do not know if this is really a problem, nor what we could do > to avoid it if it is. I'm not sure about it but I am afraid it could be a problem. I think real soultion would be supporting the standard CREATE COLLATION. --- Tatsuo Ishii
pgsql-hackers by date: