Re: Yet another LIKE-indexing scheme - Mailing list pgsql-hackers
From | Erich Stamberger |
---|---|
Subject | Re: Yet another LIKE-indexing scheme |
Date | |
Msg-id | Pine.LNX.4.21.0009031546500.4051-100000@gewi.kfunigraz.ac.at Whole thread Raw |
In response to | Yet another LIKE-indexing scheme (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Yet another LIKE-indexing scheme
|
List | pgsql-hackers |
Hello, On Sat, 2 Sep 2000, Tom Lane wrote: > This would, for example, fix the example given by Erich Stamberger: > > > Another interresting feature of Czech collation is: > > > > H < "CH" < I > > > > and: > > > > B < C < C + CARON < D .. < H < "CH" < I > > > > So what happens with "WHERE name like 'Czec%`" ? > > Our existing code fails because it generates WHERE name >= 'Czec' AND > name < 'Czed'; it will therefore not find names beginning 'Czech' > because those are in another part of the index, between 'Czeh' and > 'Czei'. But WHERE name >= 'Cze' AND name < 'Czf' would work. The Problem is: What tells us, that it is 'f' which sorts after 'e' in that locale? In the "C" locale you can simply add One to the character's code to get the next one, since the numerical ordering of the encoding is identical to the collation: 'e' + 1 = 'f' and 'e' < 'f'. This is *not* true for *every* real-world language-encoding-pair in the world - at least not for characters with codes above 127 and characters with codes below 65. In the example above we are in luck, although there are additional characters between 'e' and 'f' in Czech sorting: Collation => .. < 'e' = 'e + acute' = 'e + caron' < 'f' Encoding => 101 < 233 < 236 > 102 To my knowledge the ISO C API doesn't provide an interface to collation information (IAPITA!). There are no succ() and pred() functions like in PASCAL for example. And even if these functions could be emulated, I'm not sure about possible "side effects". French for example, has even more funny rules ("funny" from a programmer's point of view): Accented characters which appear later in a string are more important than accented characters which appear earlier. IMHO, using the OS's locale support in databases asks for trouble anyway: Who guaratees that the strcolls/localedefs floating around behave the same way? What, if some kind soul of system admistrator updates the OS and fixes a buggy locale definition file (maybe without knowing)? The next UPDATE or INSERT coming along will damage the indices of all databases using the affected locale. Even a simple SELECT may yield strange results. > > Are there examples where this still doesn't work? (Funny sort rules > for trigraphs would break it, I'm sure, unless we drop two characters > instead of just one.) > I don't know if there are any locales, where removing/appending "something" from/to a string can result in a higher/lower collation weight: "xyzab < xyz" or "xyz > xyzab". > Obviously we could still keep the last character in ASCII locale. > That would be a good thing since it'd reduce the number of tuples > scanned. Is there a portable way to determine whether it's safe to > do so in other locales? (Some inquiry function about whether the sort > ordering has any digraph or two-to-one rules might help, but I don't > know if there is one.) > Even ASCII (7-bit) encoded *locales* may be in big trouble here: gewi:~$ cat en.txt 1 2 ? ?2 ?A ?a -A -a + - / a b A B gewi:~$ export LANG="C" gewi:~$ sort en.txt + - -A -a / 1 2 ? ?2 ?A ?a A B a b gewi:~$ export LANG="en_US" gewi:~$ sort en.txt - ? / + 1 ?2 2 -A ?A A -a ?a a B b .. at least strings with punctuation characters will fail in certain cases. -- Erich (still thinking)
pgsql-hackers by date: