Thread: LIKE indexing proposal
I would like to re-table my proposal from many moons ago to allow pattern-matching operations to use indexes under any locale. The idea is that since pattern matching works on a character-by-character basis, we also need to interact with the index using operators that do a comparison strictly on a character-by-character basis. So we define a separate set of comparison operators and operator classes for each character type (text, varchar, bpchar, name), teach the optimizer to use it for suitably anchored pattern-matching operations, and tell users to create indexes using that special operator class if they want pattern matching to use indexes. Here are a couple of details to discuss: I named the operators #<#, #>=#, etc. If someone can think of better names, let me know. Since character-by-character comparison is essentially binary comparison, I named the operator classes, text_binary_ops, etc. Another idea is to name them text_like_ops or text_pattern_ops or whatever, so that if some change in the pattern matching operations would later force us to alter the behavior of the operators away from being strictly memcmp(), we would be free to change them. Should there be a special case for the C locale? Without extra code, if you use the C locale and would like to use indexes for both equality and pattern comparisons, you need two indexes with different operator classes. That might seem wasteful, but then all locales would really work the same. I'm unclear on how the selectivity estimation should work. The system doesn't collect statistics based on the new #<#-style operators, so the estimates are based on the normal comparison, which might have little to do with reality. -- Peter Eisentraut peter_e@gmx.net
Peter Eisentraut <peter_e@gmx.net> writes: > I would like to re-table my proposal from many moons ago to allow > pattern-matching operations to use indexes under any locale. Do you have better answers to the objections that were raised the last time? In particular, I wonder whether it's worth putting effort into this at all, rather than trying to make the contrib full-text-indexing support mainstream-quality. > Since character-by-character comparison is essentially binary comparison, > I named the operator classes, text_binary_ops, etc. Another idea is to > name them text_like_ops or text_pattern_ops or whatever, text_binary_ops seems a bit of a contradiction in terms :-(. I'd prefer either of the other choices. > Should there be a special case for the C locale? Yes, if only on backwards-compatibility grounds. Database setups that worked well before should not get arbitrarily broken. > I'm unclear on how the selectivity estimation should work. The system > doesn't collect statistics based on the new #<#-style operators, so the > estimates are based on the normal comparison, which might have little to > do with reality. Not sure that this really matters; the selectivity estimation for LIKE is so crude that I doubt it would get significantly worse. But one could conceive of storing statistics computed both ways in pg_statistic. regards, tom lane
Tom Lane writes: > Peter Eisentraut <peter_e@gmx.net> writes: > > I would like to re-table my proposal from many moons ago to allow > > pattern-matching operations to use indexes under any locale. > > Do you have better answers to the objections that were raised the last > time? The objection last time was that, according to the SQL standard, LIKE patterns should use the locale-specific collation order for the fixed parts (instead of character-by-character comparisons), and that under that rule my proposed solution would fall down. Time has shown that this objection is poor on four grounds: 1. LIKE does not work that way, and omitting optimizations now based on possible future changes is a stupid approach. 2. If LIKE were to be changed that way, we can always take the optimization out. 3. Even if LIKE were changed, certainly POSIX regexps would never be changed, so you can still use the optimization there. 4. Nobody is seriously proposing to change LIKE that way. And if someone did, I would object, because the established collation rules make getting consistent results from this approach impossible. -- Peter Eisentraut peter_e@gmx.net
> Peter Eisentraut <peter_e@gmx.net> writes: > > I would like to re-table my proposal from many moons ago to allow > > pattern-matching operations to use indexes under any locale. Wouldn't existing b-trees be sufficient, if they could be 'scanned' starting with the operator >= ? Thus a LIKE 'ABC%' could be done by stepping an (ascending) index fom x >= 'ABC' up to the first key that does not have 'ABC' as first characters ? Seems this would be of more general use, than an extra index for LIKE, no ? Some upper bound would still be needed for a selectivity estimate, but for estimation purposes it would not really need to be watertight. Andreas
> The objection last time was that, according to the SQL standard, LIKE > patterns should use the locale-specific collation order for the fixed > parts (instead of character-by-character comparisons), Yes, I think that LIKE definitely needs to be character-by-character and I can think of no example in the german language that would make general sense (like ss and single byte sharp s (ß) would definitely be nonsense if handled as equal for LIKE). Andreas
"Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes: > Wouldn't existing b-trees be sufficient, if they could be 'scanned' starting > with the operator >= ? Thus a LIKE 'ABC%' could be done by stepping an (ascending) > index fom x >= 'ABC' up to the first key that does not have 'ABC' as first > characters ? You've apparently forgotten all our previous history on that subject :-(. The above does not work in the presence of special sort rules for digraphs, etc. For example, that LIKE should certainly match ABCH ... but there are locales in which "CH" sorts after "D" and would not be found by an indexscan that runs from ABC to ABD. Even with no digraphs, the optimization is broken by locales that treat spaces as second-class citizens, prefer caseless to case-sensitive comparisons, etc. It doesn't work in en_US locale, for example. regards, tom lane
> You've apparently forgotten all our previous history on that subject > :-(. The above does not work in the presence of special sort rules for > digraphs, etc. For example, that LIKE should certainly match ABCH ... > but there are locales in which "CH" sorts after "D" and would not be > found by an indexscan that runs from ABC to ABD. Sorry, so is the start condition of >= 'ABC' also not valid ? A new index scan method with the pattern as input would need to know when it can safely stop (which is the problem I understand), but would have the advantage of beeing able to filter unwanted tuples before the heap access. Andreas