Re: 7.3 no longer using indexes for LIKE queries - Mailing list pgsql-general
From | Matthew Gabeler-Lee |
---|---|
Subject | Re: 7.3 no longer using indexes for LIKE queries |
Date | |
Msg-id | ABABFB80F35AD311848B0090279918EF010B9B64@ZYCOSNT2.hq.zycos.com Whole thread Raw |
In response to | 7.3 no longer using indexes for LIKE queries (Matthew Gabeler-Lee <mgabelerlee@zycos.com>) |
List | pgsql-general |
Looking at the list archives, it looks like there's some general classes of problems: 1) Multibyte charsets; me not knowing much, I'd guess that initially none of these are 'sane'. 2) longer string less than shorter string ('ABC\NNN' < 'ABC'); can test by iterating through bytes (or potentially pairs of bytes, to catch needing two special chars as a digraph or something) appeneded to a base string and seeing if any of them trigger the longer-less. 3) characters that are ignored in a sort (e.g. 'ABC\NNNDEF' sorts same as 'ABCDEF'); can test again by iterating through bytes or pairs of bytes looking for the condition. 4) accent folding; I'm not entirely sure like is supposed to do this. I'm going to pretend for the rest of this that the like operator shouldn't fold accented characters. Employing extreme paranoia and assuming great ignorance of any specialties of the locale being tested, I think this should cover all the bases: // pglub(x) = postgres' upper bound string for the like optimization on 'x%' // <, >, = mean sorts less than, etc. report insane if locale is multibyte for every 2-byte string $a let $b = pglub($a) for every 2-byte string $c report insane if "$a$c" < $a or "$a$c" > $b for some 4 byte string(s) of plain characters $d for every 2 byte string $e report insane if $d = "$d[0,1]$e$d[2,3]" else report sane On a fast machine, this would probably take a couple minutes[1] to run per locale, so it would definitely be a one-time thing, not a compile-time thing, but would be practical to run on a one time or per-release basis to generate some source file that listed sane locales. [1] Basically iterating through all values of 32 bits, takes my pc a few seconds to just do 'for(unsigned n=0;n<0xffffffff;++n);', so I extrapolated an estimate of what needs to be done beyond the counting. The second loop runs so many less times that its runtime is in the noise. With some slight modifications, this sanity-finding routine could be used to generate tables that for use by a function to find the proper bounds for like optimization on locales that only suffered from problem #2. Yes, this embeds knowledge about locales into postgres, but it does it through exhaustive examination, not usage of personal expertise. An addendum on accent folding: To my ignorant mind, characters with different accents are different characters. My French teacher certainly marked me off if I got the accents wrong. It seems to me that the most common place one wants to think about this is in full text searching and the like. In this case, maybe I'm daft, but perhaps the thing to do is to create a functional index, where the function being indexed strips all the accents off characters. Does the SQL spec have anything to say on accent folding in comparisons? -Matt -----Original Message----- From: Bruce Momjian [mailto:pgman@candle.pha.pa.us] I think we do know the C local is OK for this. It is just the other ones we don't know about, and we don't even know. Is there some test we can write for this?
pgsql-general by date: